Small system ranks/threads optimisation
2D sheath with 2M particles, theta=0.55, 10 timesteps (loop+diags value)
#nodes | #ranks | #threads/rank | walkclock (s) |
32 | 32 | 1 | 949 |
32 | 32 | 4 | 246 |
32 | 32 | 16 | 79 |
32 | 64 | 16 | 37 |
32 | 128 | 1 | 173 |
32 | 128 | 4 | 50 |
32 | 128 | 8 | 31 |
32 | 256 | 1 | 97 |
PEPC Architekturvergleich von Lukas
Google Docs, befindet sich auch in der Anlage als Kopie
Weak scaling test
2D Kelvin-Helmholtz benchmark, 4000 timesteps, no particle I/O, 400x400 gridded data every 100 dt. 2 ranks/node, 16 walk-threads
N (106) | #ranks | #tasks | walkclock (h) | per timestep (s) |
40 | 2048 | 32768 | 3.1 | 2.8 |
80 | 4096 | 65536 | 3.8 | 3.4 |
160 | 8192 | 131072 | 5.44 | 4.9 |
320 | 8192a | 262144 | - | 10 |
a Whole machine test on 28.8.: 1 rank/node, 32 threads, 100dt
Towards higher numbers of worker threads with atomic operations
b.steinbusch, 28.9.2012
The following figure shows a strong scaling of the 2D sheath setup for pepc-b using 32 Juqueen nodes, no I/O. Each node ran one MPI process while the number of worker threads was increased from 4 to 63. The meaning of the individual curves is the following: total
is the execution time as reported by the frontend, t_fields_tree
is the single threaded tree build phase, t_fields_passes
is the tree traversal, together the last two constitute the total time spent in the treecode, the tree traversal is again subdivided into t_process_particle
and t_walk_single_particle
. With an increasing number of threads, more and more time is spent in the process_particle
routine which essentially acquires a lock, increments a global counter and releases the lock again. The walk_single_particle
routine on the other hand does the numerical work, evaluating the MAC and interactions.
As of changeset r3432, the process_particle
routine has been changed to increment the counter using atomic operations provided by the OpenPA library. This way, the lock is no longer needed. The next figure shows the same strong scaling with the changes applied. While the execution time is nearly unchanged for four or eight threads, teams of 16 and more threads perform significantly better. At 32 threads, every core of the BG/Q compute chip is running two threads and very little is gained by increasing the number of threads further. Problems seem to arise when using 63 worker threads, first results point to the communicator thread which might not get to run often enough. For the moment, this number of threads should be avoided.
Attachments (6)
- PEPC Architekturvergleich(1).xls (39.0 KB ) - added by 12 years ago.
- PEPC Architekturvergleich.pdf (82.6 KB ) - added by 12 years ago.
-
pepc-p-vs-q.pdf
(213.1 KB
) - added by 12 years ago.
P vs Q comparison - Scicomp12
-
pepc-p-vs-q.png
(105.0 KB
) - added by 12 years ago.
P vs Q comparison - Scicomp12
- strong_scaling_locks.png (69.5 KB ) - added by 12 years ago.
- strong_scaling_opa.png (59.7 KB ) - added by 12 years ago.
Download all attachments as: .zip