intern/BGQ – pepc at trac.fzj

Context Navigation

Small system ranks/threads optimisation

2D sheath with 2M particles, theta=0.55, 10 timesteps (loop+diags value)

#nodes	#ranks	#threads/rank	walkclock (s)
32	32	1	949
32	32	4	246
32	32	16	79
32	64	16	37
32	128	1	173
32	128	4	50
32	128	8	31
32	256	1	97

PEPC Architekturvergleich von Lukas

Google Docs, befindet sich auch in der Anlage als Kopie

Summary png or pdf

Weak scaling test

2D Kelvin-Helmholtz benchmark, 4000 timesteps, no particle I/O, 400x400 gridded data every 100 dt. 2 ranks/node, 16 walk-threads

N (10⁶)	#ranks	#tasks	walkclock (h)	per timestep (s)
40	2048	32768	3.1	2.8
80	4096	65536	3.8	3.4
160	8192	131072	5.44	4.9
320	8192^a	262144	-	10

^a Whole machine test on 28.8.: 1 rank/node, 32 threads, 100dt

Towards higher numbers of worker threads with atomic operations

b.steinbusch, 28.9.2012

The following figure shows a strong scaling of the 2D sheath setup for pepc-b using 32 Juqueen nodes, no I/O. Each node ran one MPI process while the number of worker threads was increased from 4 to 63. The meaning of the individual curves is the following: total is the execution time as reported by the frontend, t_fields_tree is the single threaded tree build phase, t_fields_passes is the tree traversal, together the last two constitute the total time spent in the treecode, the tree traversal is again subdivided into t_process_particle and t_walk_single_particle. With an increasing number of threads, more and more time is spent in the process_particle routine which essentially acquires a lock, increments a global counter and releases the lock again. The walk_single_particle routine on the other hand does the numerical work, evaluating the MAC and interactions.

As of changeset r3432, the process_particle routine has been changed to increment the counter using atomic operations provided by the OpenPA library. This way, the lock is no longer needed. The next figure shows the same strong scaling with the changes applied. While the execution time is nearly unchanged for four or eight threads, teams of 16 and more threads perform significantly better. At 32 threads, every core of the BG/Q compute chip is running two threads and very little is gained by increasing the number of threads further. Problems seem to arise when using 63 worker threads, first results point to the communicator thread which might not get to run often enough. For the moment, this number of threads should be avoided.

Last modified 12 years ago Last modified on 09/28/12 11:42:51

Attachments (6)

PEPC Architekturvergleich(1).xls (39.0 KB ) - added by Mathias Winkel 12 years ago.
PEPC Architekturvergleich.pdf (82.6 KB ) - added by Mathias Winkel 12 years ago.
pepc-p-vs-q.pdf (213.1 KB ) - added by Paul Gibbon 12 years ago. P vs Q comparison - Scicomp12
pepc-p-vs-q.png (105.0 KB ) - added by Paul Gibbon 12 years ago. P vs Q comparison - Scicomp12
strong_scaling_locks.png (69.5 KB ) - added by Benedikt Steinbusch 12 years ago.
strong_scaling_opa.png (59.7 KB ) - added by Benedikt Steinbusch 12 years ago.

Download all attachments as: .zip

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text