This page is my personal benchmark results page.
The code I use to benchmark the various processors I have been able to get
access to is called
GSIM
, which is a
GEANT
based Monte Carlo code that is used by the
CLAS
collaboration to model their detector.
The GSIM code is relatively demanding on both the
INTEGER and FLOATING POINT units of the processor. It has low I/O requirements
and relatively low memory requirements. The code fits in about 20 Mb of memory.
Memory access times are not too demanding. Most of the computations are on
integers, floats (real) and doubles. The randum number generator is given
a fixed seed number so that all computations should be exactly reproducing
each other. This benchmark is thus a realistic measure of the performance
of a processor for "real" scientific computing. The variation in the
results from run to run on the same system is minimal.
It should be noted however that for very modern processors, such as the
Pentium4 or x86_64 (AMD Opteron) or Itanium processors, the GEANT code will
not be optimized for these processors. The GEANT libraries are basically i386
code that does not make use of the special features of these processors.
Recompiling the GEANT libraries to make use of these CPUs is rather cumbersome
and not likely to happen anytime soon. In fact, these libraries cannot be converted
to 64-bit, due to inherent programming choices (errors?) in the libraries.
A recompiled version of the code, using a newer compiler, but still in 32-bit
mode will still provide some speed enhancements.
The current version of the code and the tests reported
here are all on RedHat 7.1 with the EGCS compiler, version 2.91.66. This code
used a standard optimization level of -O2 for both the Fortran (g77) and
C (egcs) compilers.
I also report some results obtained from the "Livermore Loops" program in C. This program unfortunately does not use a very accurate timing system, and thus suffers from rather large variations in the results from run to run. Numbers quoted in the table are the average of at least 3 runs of the code. The "raw" output for the runs are in this file .
DISCLAMERS ETC:
It should be noted that
1) none of the codes used here were specially optimized
to use the advanced features of the processor. This is what occurrs in a realistic
scientific environment, where the same executable is used for several different
Intel based processors. Better results (especially on the newer high end
processors) can be obtained using special compilers that make use of the
special features of the newer processors.
2) I make no claims that these tests are "fair", that
this is the "best measure" of the processor, or anything like that. I can
only say that for scientific calculations, especially GEANT based calculations,
these results can be useful when making purchasing decisions.
The GSIM test was run with the September 11, 2001 version of the GSIM code. The input file consisted of a set of events, each containing one electron, one proton and one eta particle. I ran 600 of these events through the code, which takes between 8 and 15 minutes. I took care to make sure that this process received near 99% cpu time. Also the load time of the code is not included in the seconds/event estimate.
The following table shows the results. "CPU Clock Speed" is the nominal
rating of the BIOS for the code frequency of the CPU. An overclocked CPU
will give a higher number than the factory stamped on the chip. The "Bogo
Mips" is the cpu speed rating provided by the Linux kernel (see "cat /proc/cpuinfo").
The "GSIM seconds/event" shows the CPU time spend to fully propagate the
3 particles through the detector, including all secondary particles. The
"Relative Speed" is the "GSIM seconds/event" normalized to a Celeron 300A
at 300 Mhz. The final column, processor efficiency, is a measure of how well
the core of the CPU performs. It is the relative speed divided by the clock
speed, normalized to 100% for a Celeron 300A. A CPU with a more advanced
core should be able to execute more instructions per clock cycle, and thus
score higher on this measure.
Processor | CPU Clock Speed | BogoMips | GSIM seconds/event | Relative Speed* | Processor Efficiency*** |
Opteron 250 2.4 Ghz | 2392 | 4702 | 0.2359 | 969% | 122% | P4 2.4 Ghz 533 Mhz FSB RDRAM-PC800 | 2405.5 | 4797.2 | 0.3656 | 625% | 78% | Athlon 1800+ MP | 1526.5 | 3047.4 | 0.3973 | 575.2% | 113.0% | Athlon 1.3 Ghz | 1333.3 | 2660.7 | 0.4569 | 500.13 % | 112.53% |
Athlon MP 1Ghz | 995.5 | 1985.7 | 0.6047 | 377.89 % | 113.9% |
P4 1.5Ghz | 1495.5 | 2981.9 | 0.5978 | 382.3% | 76.68% |
Pentium III 1 Ghz |
996 |
1985.7 |
0.661 |
345.7% |
104.1% |
Pentium III 850 | 851.9 | 1697.4 | 0.77285 | 295.7% | 104.1% |
Pentium II 450 | 450 | 901.1 | 1.5427 | 148.1% | 98.75% |
Celeron 300A ** | 450 | 458 | 1.5216 | 150.2% | 100.1% |
Celeron 300A | 300 | 300 | 2.2851 | 100 % | 100 % |
Comments: The most stiking result is the poor
performance by Intel's P4. One only receives 77% of the increased
clock speed! A similar result has been obtained by other benchmarkers like
Tom's Hardware. The other stricking thing
is that the P4 makes up for this lack by sheer clock frequency
The opposite is true for the Athlon's which
perform relatively better at 112%. The better Athlon core thus gives a similar
speed result for a 1 Ghz Athlon compared to a 1.5 Ghz P4, while the Athlon
at 1.3 Ghz is 30% FASTER than a 1.5 Ghz P4. (Yes, this result would be different
with a P4 optimized compiler, however, such a compiler is not (yet?) available
for Linux. NOTE: This optimized compiler is now available from INTEL and it
performs VERY WELL. However, I have not yet managed to compile to CERNLIBS using
this compiler. Maybe g77 will soon optimize for P4)
As mentioned before, the Livermore Loops code is not too reproducible.
(see
The Parkbench Benchmark Collection
for more detail). I used it anyway since Linpack (either d or c versions)
would not give useful results at all, and I wanted at least one other measure
of the CPUs tested.
I ran the Livermore Loops code (the one ported to C) at least 3 times per
machine. The full results table is here.
The table below shows the Average of the "Geometric Mean" and the "Average"
for each run.
The nice thing about this test is that the units are MFlops, which is
a widely claimed number for processors.
Processor | Speed | <Geometric> | <Average> | Relative Speed | Efficiency |
AMD Opteron 244 (gcc 3.2.2 x86_64) | 1794.4 | 544.7 MFlops | 750.2 MFlops | 962% | 160% | P4 2.4 Ghz 533 Mhz FSB RDRAM-PC800 (gcc 3.2.2 pentium4) | 2405.5 | 398 MFlops | 606 MFlops | 703% | 87.9% | Athlon 1800+ MP (gcc 3.2.2) | 1526.5 | 393.1 MFlops | 628.9 MFlops | 694% | 115% | P4 2.4 Ghz 533 Mhz FSB RDRAM-PC800 | 2405.5 | 322 MFlops | 526 MFlops | 568% | 71.1% | Athlon 1800+ MP | 1526.5 | 346.5 MFlops | 511.2 MFlops | 612% | 102% | Athlon 1.3 Ghz | 1333.3 | 302.6 MFlops | 461.0 MFlops | 535% | 123% |
Athlon MP 1Ghz | 995.5 | 228.9 MFlops | 326.5 MFlops | 404% | 121% |
P4 1.5 Ghz | 1495.5 | 186.9 MFlops | 291.4 MFlops | 330% | 66% |
PIII 1Ghz |
996 | 182.1 MFlops | 228.1 MFlops | 322% | 96% |
PIII 850 | 851.9 | 164.3 MFlops | 220.9 MFlops | 290% | 102.1% |
PII 450 | 450 | 77.89 MFlops | 93.66 MFlops | 138% | 91.7% |
Celeron 300A | 450 | 85.49 MFlops | 103.5 MFlops | 151% | 100% |
Celeron 300A | 300 | 56.6 MFlops | 69.1 MFlops | 100% | 100% |
I found it somewhat striking that the Livermore Loops results track the GSIM scores fairly well. The P4 does slightly worse in comparison to the Athlon in this test. The Intel compiler (icc) is supposed to give better results, but unfortunately the optimization disturbs the timing routines making the results useless.
For scientific computing raw speed, nothing so far seems to beat the
Athlon processors, maybe until we get g77 to compile for the P4.
A dual Athlon performs very nicely, and there seems to be no penalty for
running the GSIM code on both processors simultateously. The Pentium 4, though
much hyped by Intel is an under performer relative to clock speed. This is most likely due to the lack of
a specially optimized compiler that can compile the CERNLIB code.
Now that this compiler exists, the P4 becomes more attractive, however, it will require
recompiling all your code with the INTEL compiler. The large extra premium you pay
for the P4 (especially for dual P4 systems) may not be worth it yet. (
See Note below).
As far as BANG FOR THE BUCK (BFB), the general rule should be: Take the performance
of the machine relative to a 300A, then divide by the
cost of the machine. Now choose the highest BFB.
Note: Recently I saw a benchmark by a colleague
who runs a code for very large matrix inversion using complex numbers. This
application is very sensitive to memory access times. Using the Portland
f90 compiler, he achived a 30% to 50% speed gain for a Pentium4 over a PentiumIII,
when correcting for the difference in clock speed. Thus, on the 1.4 Ghz P4
he used his code went about 2.5 times faster than on a PIII. This is the
first real test where I've seen the P4 outshine. Most likely this is due
to the RDRAM. He has not yet tried the code on an Athlon with DDR memory.