Xarthisius' blog: Low-end vs High-End

Over the past few days I've been configuring and playing with my new toy - Janus (as all our clusters bear the names of mythological polycephalous creatures). Janus will work as a test bed and development platform for our MPI+CUDA codes. It consists of two nodes equipped with:

mobo: ASRock X58 SC
cpu: Intel core i7 920
ram: Kingston DDR3 12288MB PC3-8500 1066MHz (6x2048)
gpu: 2x Asus ENGTX295 Geforce GTX295 1792MB DDR3
psu: Tagan 1100W PipeRock
hdd: 2x WD Caviar 1.5TB SATA300 Green Power 32MB
case: Chieftec BX-02B-B-B (black) ATX
net: Mellanox InfiniHost III Lx (10Gb/s) (borrowed from N. Copernicus Astronomical Centre), 2x 1Gb/s

As you can see these are mostly low-end solutions (in terms of HPC), created with gamers, not computing in mind. Main advantage is of course price - 2100€ per node, which is fairly cheap even if it comes with worse performance... Now there is a question whether Janus is much worse than high-end solution? Amazingly, the answer is not that much.
While googling today I found site of NCSA's GPU cluster along with results of standard test from CUDA SDK:

../../bin/linux/release/reduction --kernel=5 --n=16384
Reducing array of type  int.
Using Device 0:  "Tesla C1060"
16384 elements
128 threads (max)
64 blocks
Average time: 0.025320 ms
Bandwidth:    2.588309 GB/s

which we can compare with Janus:

../../bin/linux/release/reduction --kernel=5 --n=16384
Reducing array of type int.
Using Device 0: "GeForce GTX 295"
16384 elements
128 threads (max)
64 blocks
Average time: 0.021630 ms
Bandwidth:    3.029865 GB/s

it's better! (This is a moment when we can give big yay for Nehalem technology :] )
If we compare shear power: Tesla is capable of 936 Gflops using 180W of energy under load, while GTX295 2*894=1788 Gflops using 330W! Furthermore, the difference in price is enormous! Tesla costs 1500€, while GTX295 - 400€.
I slowly begin to wonder why people use Tesla C1060 at all? Maybe cause it's easier to program single GPU card with lots of memory (Tesla has 4GB DDR3), than put a little effort into developing MPI+CUDA codes... Time will show which strategy will prevail.