- mobo: ASRock X58 SC
- cpu: Intel core i7 920
- ram: Kingston DDR3 12288MB PC3-8500 1066MHz (6x2048)
- gpu: 2x Asus ENGTX295 Geforce GTX295 1792MB DDR3
- psu: Tagan 1100W PipeRock
- hdd: 2x WD Caviar 1.5TB SATA300 Green Power 32MB
- case: Chieftec BX-02B-B-B (black) ATX
- net: Mellanox InfiniHost III Lx (10Gb/s) (borrowed from N. Copernicus Astronomical Centre), 2x 1Gb/s
While googling today I found site of NCSA's GPU cluster along with results of standard test from CUDA SDK:
../../bin/linux/release/reduction --kernel=5 --n=16384which we can compare with Janus:
Reducing array of type int.
Using Device 0: "Tesla C1060"
16384 elements
128 threads (max)
64 blocks
Average time: 0.025320 ms
Bandwidth: 2.588309 GB/s
../../bin/linux/release/reduction --kernel=5 --n=16384it's better! (This is a moment when we can give big yay for Nehalem technology :] )
Reducing array of type int.
Using Device 0: "GeForce GTX 295"
16384 elements
128 threads (max)
64 blocks
Average time: 0.021630 ms
Bandwidth: 3.029865 GB/s
If we compare shear power: Tesla is capable of 936 Gflops using 180W of energy under load, while GTX295 2*894=1788 Gflops using 330W! Furthermore, the difference in price is enormous! Tesla costs 1500€, while GTX295 - 400€.
I slowly begin to wonder why people use Tesla C1060 at all? Maybe cause it's easier to program single GPU card with lots of memory (Tesla has 4GB DDR3), than put a little effort into developing MPI+CUDA codes... Time will show which strategy will prevail.