wiki:intern/BGQ

Small system ranks/threads optimisation

2D sheath with 2M particles, theta=0.55, 10 timesteps (loop+diags value)

#nodes #ranks #threads/rank walkclock (s)
32 32 1 949
32 32 4 246
32 32 16 79
32 64 16 37
32 128 1 173
32 128 4 50
32 128 8 31
32 256 1 97

PEPC Architekturvergleich von Lukas

Google Docs, befindet sich auch in der Anlage als Kopie

Summary png or pdf

Weak scaling test

2D Kelvin-Helmholtz benchmark, 4000 timesteps, no particle I/O, 400x400 gridded data every 100 dt. 2 ranks/node, 16 walk-threads

N (106) #ranks #tasks walkclock (h) per timestep (s)
40 2048 32768 3.1 2.8
80 4096 65536 3.8 3.4
160 8192 131072 5.44 4.9
320 8192a 262144 - 10

a Whole machine test on 28.8.: 1 rank/node, 32 threads, 100dt

Towards higher numbers of worker threads with atomic operations

b.steinbusch, 28.9.2012

The following figure shows a strong scaling of the 2D sheath setup for pepc-b using 32 Juqueen nodes, no I/O. Each node ran one MPI process while the number of worker threads was increased from 4 to 63. The meaning of the individual curves is the following: total is the execution time as reported by the frontend, t_fields_tree is the single threaded tree build phase, t_fields_passes is the tree traversal, together the last two constitute the total time spent in the treecode, the tree traversal is again subdivided into t_process_particle and t_walk_single_particle. With an increasing number of threads, more and more time is spent in the process_particle routine which essentially acquires a lock, increments a global counter and releases the lock again. The walk_single_particle routine on the other hand does the numerical work, evaluating the MAC and interactions.

As of changeset r3432, the process_particle routine has been changed to increment the counter using atomic operations provided by the OpenPA library. This way, the lock is no longer needed. The next figure shows the same strong scaling with the changes applied. While the execution time is nearly unchanged for four or eight threads, teams of 16 and more threads perform significantly better. At 32 threads, every core of the BG/Q compute chip is running two threads and very little is gained by increasing the number of threads further. Problems seem to arise when using 63 worker threads, first results point to the communicator thread which might not get to run often enough. For the moment, this number of threads should be avoided.

Last modified 12 years ago Last modified on 09/28/12 11:42:51

Attachments (6)

Download all attachments as: .zip

Note: See TracWiki for help on using the wiki.