Processor BenchMark Results


  1. Introduction
  2. GEANT Tests
  3. Livermore Loops Tests
  4. Conclusions

INTRODUCTION

This page is my personal benchmark results page. The code I use to benchmark the various processors I have been able to get access to is called GSIM , which is a GEANT based Monte Carlo code that is used by the CLAS collaboration to model their detector.
The GSIM code is relatively demanding on both the INTEGER and FLOATING POINT units of the processor. It has low I/O requirements and relatively low memory requirements. The code fits in about 20 Mb of memory. Memory access times are not too demanding. Most of the computations are on integers, floats (real) and doubles. The randum number generator is given a fixed seed number so that all computations should be exactly reproducing each other. This benchmark is thus a realistic measure of the performance of a processor for "real" scientific computing.  The variation in the results from run to run on the same system is minimal.
It should be noted however that for very modern processors, such as the Pentium4 or x86_64 (AMD Opteron) or Itanium processors, the GEANT code will not be optimized for these processors. The GEANT libraries are basically i386 code that does not make use of the special features of these processors. Recompiling the GEANT libraries to make use of these CPUs is rather cumbersome and not likely to happen anytime soon. In fact, these libraries cannot be converted to 64-bit, due to inherent programming choices (errors?) in the libraries. A recompiled version of the code, using a newer compiler, but still in 32-bit mode will still provide some speed enhancements.
The GSIM test is a 32-bit, no SSE or 3D-NOW, straight forward test!. It is thus more sensitive to raw clock speed.
The current version of the code and the tests reported here are all on RedHat 7.1 with the EGCS compiler, version 2.91.66. This code used a standard optimization level of -O2 for both the Fortran (g77) and C (egcs) compilers.

I also report some results obtained from the "Livermore Loops" program in C. This program unfortunately does not use a very accurate timing system, and thus suffers from rather large variations in the results from run to run. Numbers quoted in the table are the average of at least 3 runs of the code. The "raw" output for the runs are in this file .

DISCLAMERS ETC:
It should be noted that
1) none of the codes used here were specially optimized to use the advanced features of the processor. This is what occurrs in a realistic scientific environment, where the same executable is used for several different Intel based processors. Better results (especially on the newer high end processors) can be obtained using special compilers that make use of the special features of the newer processors.
2) I make no claims that these tests are "fair", that this is the "best measure" of the processor, or anything like that. I can only say that for scientific calculations, especially GEANT based calculations, these results can be useful when making purchasing decisions.



GEANT TESTS

The GSIM test was run with the September 11, 2001 version of the GSIM code. The input file consisted of a set of events, each containing one electron, one proton and one eta particle. I ran 600 of these events through the code, which takes between 8 and 15 minutes. I took care to make sure that this process received near 99% cpu time. Also the load time of the code is not included in the seconds/event estimate.

The following table shows the results. "CPU Clock Speed" is the nominal rating of the BIOS for the code frequency of the CPU. An overclocked CPU will give a higher number than the factory stamped on the chip. The "Bogo Mips" is the cpu speed rating provided by the Linux kernel (see "cat /proc/cpuinfo"). The "GSIM seconds/event" shows the CPU time spend to fully propagate the 3 particles through the detector, including all secondary particles. The "Relative Speed" is the "GSIM seconds/event" normalized to a Celeron 300A at 300 Mhz. The final column, processor efficiency, is a measure of how well the core of the CPU performs. It is the relative speed divided by the clock speed, normalized to 100% for a Celeron 300A. A CPU with a more advanced core should be able to execute more instructions per clock cycle, and thus score higher on this measure.
 
 

GSIM Release September 11, 2001 vs Processor +
Processor CPU Clock Speed BogoMips GSIM  seconds/event Relative Speed* Processor Efficiency***
Opteron 250 2.4 Ghz 2392 4702 0.2359 969% 122%
P4 2.4 Ghz 533 Mhz FSB RDRAM-PC800 2405.5 4797.2 0.3656 625% 78%
Athlon 1800+ MP 1526.5 3047.4 0.3973 575.2% 113.0%
Athlon 1.3 Ghz 1333.3 2660.7 0.4569 500.13 % 112.53%
Athlon MP 1Ghz 995.5 1985.7 0.6047 377.89 % 113.9%
P4 1.5Ghz 1495.5 2981.9 0.5978 382.3% 76.68%
Pentium III 1 Ghz
996
1985.7
0.661
345.7%
104.1%
Pentium III 850 851.9 1697.4 0.77285 295.7% 104.1%
Pentium II 450 450 901.1 1.5427 148.1% 98.75%
Celeron 300A ** 450 458 1.5216 150.2% 100.1%
Celeron 300A  300 300 2.2851 100 % 100 %
*   Percentage that this CPU is faster than a 300 Mhz Celeron, wich costs less than $60. Accurate +/- a few percent.
** Overclocked CPU's
*** The Processor efficiency measures how well a processor performed relative to its CORE CLOCK FREQUENCY. It is the Relative Speed/CPU Clock Speed normalized to a CELERON 300A.
+ These tests were performed with a "release-1-28-gsim-fix" copy of the GSIM code with the identical input generated by aao_rad, for H(e,e',Pi0,Pi+)  Eb=1.515 Gev, Torus=0.3886, "run 10" geometry. Each test ran 5000 events (approx 2 hours). Seconds/event is determined by GSIM log file, and verified with the Unix "time" command.

Comments: The most stiking result is the poor performance by Intel's P4. One only receives 77% of the increased clock speed! A similar result has been obtained by other benchmarkers like Tom's Hardware. The other stricking thing is that the P4 makes up for this lack by sheer clock frequency The opposite is true for the Athlon's which perform relatively better at 112%. The better Athlon core thus gives a similar speed result for a 1 Ghz Athlon compared to a 1.5 Ghz P4, while the Athlon at 1.3 Ghz is 30% FASTER than a 1.5 Ghz P4. (Yes, this result would be different with a P4 optimized compiler, however, such a compiler is not (yet?) available for Linux. NOTE: This optimized compiler is now available from INTEL and it performs VERY WELL. However, I have not yet managed to compile to CERNLIBS using this compiler. Maybe g77 will soon optimize for P4)



Livermore Loops Tests

As mentioned before, the Livermore Loops code is not too reproducible. (see  The Parkbench Benchmark Collection for more detail).  I used it anyway since Linpack (either d or c versions) would not give useful results at all, and I wanted at least one other measure of the CPUs tested.
I ran the Livermore Loops code (the one ported to C) at least 3 times per machine. The full results table is here. The table below shows the Average of the "Geometric Mean" and the "Average" for each run.

The nice thing about this test is that the units are MFlops, which is a widely claimed number for processors.
 
Processor Speed <Geometric> <Average> Relative Speed Efficiency
AMD Opteron 244 (gcc 3.2.2 x86_64) 1794.4 544.7 MFlops 750.2 MFlops 962% 160%
P4 2.4 Ghz 533 Mhz FSB RDRAM-PC800 (gcc 3.2.2 pentium4) 2405.5 398 MFlops 606 MFlops 703% 87.9%
Athlon 1800+ MP (gcc 3.2.2) 1526.5 393.1 MFlops 628.9 MFlops 694% 115%
P4 2.4 Ghz 533 Mhz FSB RDRAM-PC800 2405.5 322 MFlops 526 MFlops 568% 71.1%
Athlon 1800+ MP 1526.5 346.5 MFlops 511.2 MFlops 612% 102%
Athlon 1.3 Ghz 1333.3 302.6 MFlops 461.0 MFlops 535% 123%
Athlon MP 1Ghz 995.5 228.9 MFlops 326.5 MFlops 404% 121%
P4 1.5 Ghz 1495.5 186.9 MFlops 291.4 MFlops 330% 66%
PIII 1Ghz
996 182.1 MFlops 228.1 MFlops 322% 96%
PIII 850 851.9 164.3 MFlops 220.9 MFlops 290% 102.1%
PII 450 450 77.89 MFlops 93.66 MFlops 138% 91.7%
Celeron 300A 450 85.49 MFlops 103.5 MFlops 151% 100%
Celeron 300A 300 56.6 MFlops 69.1 MFlops 100% 100%

I found it somewhat striking that the Livermore Loops results track the GSIM scores fairly well. The P4 does slightly worse in comparison to the Athlon in this test. The Intel compiler (icc) is supposed to give better results, but unfortunately the optimization disturbs the timing routines making the results useless.



CONCLUSIONS

For scientific computing raw speed, nothing so far seems to beat the Athlon processors, maybe until we get g77 to compile for the P4. A dual Athlon performs very nicely, and there seems to be no penalty for running the GSIM code on both processors simultateously. The Pentium 4, though much hyped by Intel is an under performer relative to clock speed. This is most likely due to the lack of a specially optimized compiler that can compile the CERNLIB code. Now that this compiler exists, the P4 becomes more attractive, however, it will require recompiling all your code with the INTEL compiler. The large extra premium you pay for the P4 (especially for dual P4 systems) may not be worth it yet. ( See Note below).
As far as BANG FOR THE BUCK (BFB), the general rule should be: Take the performance of the machine relative to a 300A, then divide by the cost of the machine. Now choose the highest BFB.

Note: Recently I saw a benchmark by a colleague who runs a code for very large matrix inversion using complex numbers. This application is very sensitive to memory access times. Using the Portland f90 compiler, he achived a 30% to 50% speed gain for a Pentium4 over a PentiumIII, when correcting for the difference in clock speed. Thus, on the 1.4 Ghz P4 he used his code went about 2.5 times faster than on a PIII. This is the first real test where I've seen the P4 outshine. Most likely this is due to the RDRAM. He has not yet tried the code on an Athlon with DDR memory.


For comments or questions: holtrop(at)jlab.org