As part of the ongoing work of benchmarking different GPUs, functions etc. for Torben's Corner, I am right now testing the floating point performance of different measurement procedures and of different GPUs. I found some results, which may be relevant to others also. All the data is still preliminary - and all of it in single precision - but it might be useful anyway.

As for the theoretical GFLOPS number the way to calculate it for most modern NVIDIA GPUs is (as far as I have seen):

GFLOPS_theoretical = [clock speed in GHz of the multiplier etc.] * [# cores] * 3

The factor "3" is caused by the following. The GPU is able to make a multiply/add in the same clock cycle - but it should also be able to do another multiply and hence a factor "3". As far as I have seen, it is only rarely that the latter multiply is actually used by the ones doing the programming - and perhaps it is difficult to actually get the benefit from this. I am not sure if this holds for the Fermi architecture but it should be OK for the other GPUs mentioned below.

I have tested various algorithms for estimating the GFLOPS count. I have used the standard SGEMM, and various versions of multiple matrix multiplication. When measuring from MATLAB/Jacket I have found that the results from the different algorithms does not vary much. This to me indicates that Jacket is very robust.

Some results are:

Asus G51J laptop:

This has a powerful GTX260M with a theoretical peak performance of 462 GFLOPS. I measured a peak performance of 110 GFLOPS corresponding to 24% of the theoretical performance.

The Asus has a Core i7-720QM CPU with a theoretical peak performance of 25.6 GFLOPS. I measured 21.1 GFLOPS.

Sony Vaio laptop

The Sony has a GT330M GPU with a theoretical peak performance of 182 GFLOPS. I measured a peak performance of 67 GFLOPS corresponding to 37% of the theoretical performance.

The Sony also has a Core i7-720QM CPU similar to the Asus G51J with a theoretical peak performance of 25.6 GFLOPS. I also here measured 21.1 GFLOPS.

Colfax CXT2000 #1

This Colfax has 3 FX3800 GPU each with a theoretical peak performance of 693 GFLOPS. I measured a peak performance of 254 GFLOPS. This is 37% of the theoretical performance.

The CPU in this computer is a Core i7-975 Extreme. It has a theoretical peak performance of around 50 GFLOPS. I measured 26.6 GFLOPS.

Colfax CXT2000 #2

This Colfax has one C1060 and one C2050.

The C1060 has a theoretical peak performance of 933 GFLOPS where I measured a peak performance of 343 GFLOPS. This is 37% of the theoretical performance.

The C2050 has a theoretical peak performance of 1030 GFLOPS where I measured a peak performance of 524 GFLOPS. This corresponds to 51% of the theoretical performance.

The CPU in this computer is also a Core i7-975 Extreme as for the other Colfax. It has a theoretical peak performance of around 50 GFLOPS. I also here measured 26.6 GFLOPS.

Conclusions:

Without saying too much too early I would say that the findings point in the following direction:

- The measured performance from Jacket/MATLAB is significantly below the theoretical peak performance. This could easily be because the theoretical peak is too optimistic - at least when I test as I do where fetching and putting data in memory also plays a role. The overhead by MATLAB/Jacket is not yet clear to me but I will look into that.
- The new Fermi architecture does on paper not seem to be a big step forward in single precision performance (from 933 to 1030 when going from the C1060 to the C2050 - this is a 10% improvement). But in the practical world the C2050 seems much faster than the theoretical numbers indicate - beating the C1060 by being more than 50% faster.
- Jacket seems to be a very robust performer. For example the C2050 produces more than 450 GFLOPS for all complexity values above 10 GFLOP, and above 400 GFLOPS from approx. 1 GFLOP complexity. Complexity is here around 2N^3 where N is the number of rows/columns in the involved square matrices. So a 1000x1000 matrix has complexity 2 GFLOP. A matrix size of approx. 800x800 gives a complexity of approx. 1 GFLOP.

If interested you can follow the findings on http://wiki.accelereyes.com/wiki/index.php/Jacket_Floating_Point_Performance_(GFlops) where the C2050 results are shown (more to follow). The code is available and it would be great if we could have results for other GPUs also. The code consists of a master file (where you define file names, problem size etc.) and the function file to perform the actual test.

BR Torben