Parallel performance of VASP

 


The above graph gives a very approximate measure of the scalability of VASP as a  function of the number of nodes, N.  The test system was a 192 atom bulk Al cell, using 388 bands and 82,944 planewaves.  K-point sampling was restricted to the Gamma point.  These tests use the scaLAPACK parallel linear algebra package.  The run time (wall clock) to achieve an SCF for the serial job was slightly less than 2 hours.

The Speed-up is defined as the serial time divided by the parallel time (which is a function of N).

The Efficiency is the Speed-up divided by N.

These results should be considered only as a rough guide for choosing job sizes, and do not necessarily reflect the optimal performance of the code.  (Your mileage may vary.)  In particular, since the Origin uses a physically distributed but logically shared memory architecture (cc-NUMA), it's possible that run times for identical jobs can vary depending on how a given data set is distributed across the nodes.   Unfortunately, I haven't had the time to investigate this effect further.  However, I have seen instances where run times differed by as much as 40%.

 


 

Comparison of VASP Performance:

Alliance RoadRunner Linux Cluster vs. NCSA Origin2000

 

The table below compares the run times (seconds) obtained for the 192 atom Al cell (also used above) on the Origin2000 and the RoadRunner Linux Cluster for three different job sizes.  The percentage by which the RoadRunner run times exceeded those of the appropriate Origin job are indicated in parentheses.
 
 

  Number of Nodes
  4 8 32
Origin2000 2454 1160 463
RoadRunner 2509 (2.2%) 1422 (22.6%) 562 (21.4%)

 

Notes:
bullet I was unable to get VASP to compile with scaLAPACK on the RoadRunner.  Since I'm told that this will degrade performance for jobs using 8 or more nodes, it's possible that the difference in performance between these two machines could mainly be due to scaLAPACK issues.
bullet For this particular model, VASP crashed when I attempted to run it on 1, 2, and 16 nodes, which is puzzling since it seems to work fine on 4, 8, and 32 nodes.  For a smaller model using 50 Cu atoms the code ran correctly for 1, 2, 4, and 8 nodes, before crashing at 16.
bullet Until the above problems are resolved it's diffcult to endorse use of the cluster for large production-scale work.  However, for smaller jobs (where use of scaLAPACK is not crucial) the cluster is nearly as fast as the Origin (and much less expensive).