wiki:intern/HELIOS

Benchmarks on Helios showed strange behaviour. Some infos and parameters:

Machine:

  • Each node consists of 16 cores with 58 GB available memory.
  • The total system is composed of 4410 nodes with the peak performance of 1.52 PF and the available memory of 256 TB.
  • The Interconnection network is Infiniband IB QDR, non-blocking.
  • Topology is fat-tree and the connection socket is PCIExpress gen3, bi-directional.

Helios hardware

Jobscript

relevant parts of jobscript (example for bold job in tables):
#SBATCH -N 16 # number of nodes
#SBATCH -n 32 # number of tasks
#SBATCH -c 8 # number of cores per task
...
NP=${SLURM_NTASKS} # the total number of tasks
...
export OMP_NUM_THREADS=8 # number of threads
export KMP_AFFINITY=compact # environment variable of OpenMP runtime
export KMP_STACKSIZE=1G # openmp stack size
...
mpirun -np ${NP} ./${BIN}.local params

tests with pepc-mini: runtimes in s, examples/mini-tube, filtering off, all output and diagnostics turned off

compiler: intel/13.0.079 mpi: bullxmpi/1.1.16.2 (also tested with intelmpi/4.1 and intelmpi/4.0.3)

320,000 particles

nodes: 2 4 8 16 32
num_walk_threads=2518266156103-
num_walk_threads=4266188158128-
num_walk_threads=6186180184158-
num_walk_threads=7 157 172 180 159-

640,000 particles

nodes: 2 4 8 16 32
num_walk_threads=2 ~1660 ~700 350 201 126
num_walk_threads=4 ~650 396 227 191 151
num_walk_threads=6 456 358 251 241 158
num_walk_threads=7 389 374 265 247 166

1,280,000 particles

nodes: 2 4 8 16 32
num_walk_threads=2 - ~2000 ~1070 490 246
num_walk_threads=4 - ~970 530 318 215
num_walk_threads=6 ~1850 ~640 460 300 237
num_walk_threads=7 ~1580 579 414 313 250

tests with pepc-f and pepc-mini, num_walk_threads=7, 2 ranks per node

1,280,000 particles

nodes: 2 4 8 16 32 64 128
pepc-mini (compare to runs with 1280000 part. above) - 588 407 315 246 155 125
pepc-f, no wall, no periodic bc - ~3300 288 54 39 61 69
pepc-f, no wall, periodic bc, 2 mirror layers - ~27500 ~681 266 126 98 97
pepc-f, wall, no periodic bc 585 467 149 40 37 62 67
pepc-f, wall, periodic bc, 2 mirror layers ~2500 ~1500 569 144 117 95 97
pepc-f, wall, periodic bc, 2 mirror layers, hpcff - ~1600 625 106 65 56 -

total runtime of the pepc-f runs with wall, periodic bc and 1280000 particles

This showed (at least for nodes<64) reasonable results

50 steps of a typical pepc-f production run (2.250.000 particles) showed the following:

nodes: 2 4 8 16 32 64 128
helios - 620 392 ~2427 148 122 95
hpcff - 722 419 ~2504 100 75 -

strong scaling for hpcff and helios runtime per timestep for different number of nodes on helios and hpcff. Typical production case with pepc-f

Further tests showed, that the problem shows up for the chosen particle configuration and 32 mpi ranks on both machines. See plot attached.

With 32 mpi ranks, runtime goes up after 16 steps wtih my production starting configuration

tests with pepc-f, runtimes in s, 20 timesteps

compiler: intel/13.0.079 mpi: intelmpi/4.1

1 MPI task per node, OMP_NUM_THREADS=16, num_walk_threads=15

nodes: 1 2 4 8
10000 part. 6 - - -
20000 part. 7 - - -
40000 part. 12 - - -
80000 part. 21 - - -
160000 part. 42 - - -
320000 part. 84 49 27 17
640000 part. 171 107 336 359
1280000 part. 375 353~6000 ~5000

2 MPI task per node, OMP_NUM_THREADS=8, num_walk_threads=7

nodes: 1 2 4 8
10000 part. 4 - - -
20000 part. 6 - - -
40000 part. 11 - - -
80000 part. 19 - - -
160000 part. 36 - - -
320000 part. 77 41 25 17
640000 part. 178 ~740 ~920 134
1280000 part. ~600>12000~11000 414
Last modified 12 years ago Last modified on 10/23/12 16:21:25

Attachments (6)

Download all attachments as: .zip

Note: See TracWiki for help on using the wiki.