Benchmarks on Helios showed strange behaviour. Some infos and parameters:
Machine:
- Each node consists of 16 cores with 58 GB available memory.
- The total system is composed of 4410 nodes with the peak performance of 1.52 PF and the available memory of 256 TB.
- The Interconnection network is Infiniband IB QDR, non-blocking.
- Topology is fat-tree and the connection socket is PCIExpress gen3, bi-directional.
Jobscript
relevant parts of jobscript (example for bold job in tables):
#SBATCH -N 16 # number of nodes
#SBATCH -n 32 # number of tasks
#SBATCH -c 8 # number of cores per task
...
NP=${SLURM_NTASKS} # the total number of tasks
...
export OMP_NUM_THREADS=8 # number of threads
export KMP_AFFINITY=compact # environment variable of OpenMP runtime
export KMP_STACKSIZE=1G # openmp stack size
...
mpirun -np ${NP} ./${BIN}.local params
tests with pepc-mini: runtimes in s, examples/mini-tube, filtering off, all output and diagnostics turned off
compiler: intel/13.0.079
mpi: bullxmpi/1.1.16.2 (also tested with intelmpi/4.1 and intelmpi/4.0.3)
320,000 particles
nodes: | 2 | 4 | 8 | 16 | 32
|
num_walk_threads=2 | 518 | 266 | 156 | 103 | -
|
num_walk_threads=4 | 266 | 188 | 158 | 128 | -
|
num_walk_threads=6 | 186 | 180 | 184 | 158 | -
|
num_walk_threads=7 | 157 | 172 | 180 | 159 | -
|
640,000 particles
nodes: | 2 | 4 | 8 | 16 | 32
|
num_walk_threads=2 | ~1660 | ~700 | 350 | 201 | 126
|
num_walk_threads=4 | ~650 | 396 | 227 | 191 | 151
|
num_walk_threads=6 | 456 | 358 | 251 | 241 | 158
|
num_walk_threads=7 | 389 | 374 | 265 | 247 | 166
|
1,280,000 particles
nodes: | 2 | 4 | 8 | 16 | 32
|
num_walk_threads=2 | - | ~2000 | ~1070 | 490 | 246
|
num_walk_threads=4 | - | ~970 | 530 | 318 | 215
|
num_walk_threads=6 | ~1850 | ~640 | 460 | 300 | 237
|
num_walk_threads=7 | ~1580 | 579 | 414 | 313 | 250
|
tests with pepc-f and pepc-mini, num_walk_threads=7, 2 ranks per node
1,280,000 particles
nodes: | 2 | 4 | 8 | 16 | 32 | 64 | 128
|
pepc-mini (compare to runs with 1280000 part. above) | - | 588 | 407 | 315 | 246 | 155 | 125
|
pepc-f, no wall, no periodic bc | - | ~3300 | 288 | 54 | 39 | 61 | 69
|
pepc-f, no wall, periodic bc, 2 mirror layers | - | ~27500 | ~681 | 266 | 126 | 98 | 97
|
pepc-f, wall, no periodic bc | 585 | 467 | 149 | 40 | 37 | 62 | 67
|
pepc-f, wall, periodic bc, 2 mirror layers | ~2500 | ~1500 | 569 | 144 | 117 | 95 | 97
|
pepc-f, wall, periodic bc, 2 mirror layers, hpcff | - | ~1600 | 625 | 106 | 65 | 56 | -
|
This showed (at least for nodes<64) reasonable results
50 steps of a typical pepc-f production run (2.250.000 particles) showed the following:
nodes: | 2 | 4 | 8 | 16 | 32 | 64 | 128
|
helios | - | 620 | 392 | ~2427 | 148 | 122 | 95
|
hpcff | - | 722 | 419 | ~2504 | 100 | 75 | -
|
Further tests showed, that the problem shows up for the chosen particle configuration and 32 mpi ranks on both machines. See plot attached.
tests with pepc-f, runtimes in s, 20 timesteps
compiler: intel/13.0.079
mpi: intelmpi/4.1
1 MPI task per node, OMP_NUM_THREADS=16, num_walk_threads=15
nodes: | 1 | 2 | 4 | 8
|
10000 part. | 6 | - | - | -
|
20000 part. | 7 | - | - | -
|
40000 part. | 12 | - | - | -
|
80000 part. | 21 | - | - | -
|
160000 part. | 42 | - | - | -
|
320000 part. | 84 | 49 | 27 | 17
|
640000 part. | 171 | 107 | 336 | 359
|
1280000 part. | 375 | 353 | ~6000 | ~5000
|
2 MPI task per node, OMP_NUM_THREADS=8, num_walk_threads=7
nodes: | 1 | 2 | 4 | 8
|
10000 part. | 4 | - | - | -
|
20000 part. | 6 | - | - | -
|
40000 part. | 11 | - | - | -
|
80000 part. | 19 | - | - | -
|
160000 part. | 36 | - | - | -
|
320000 part. | 77 | 41 | 25 | 17
|
640000 part. | 178 | ~740 | ~920 | 134
|
1280000 part. | ~600 | >12000 | ~11000 | 414
|