Context Navigation

Plans for PEPC with GPU/Accelerator support

Global ideas

Ideally, all plans sketched here shall be applicable/transferable to GPU as well as other accelerators, i.e. MIC processors, FPUs etc.

Finally, we want to have three sets of threads:

Communicators that are not walk- but hashtable-specific and are to be interpreted as tree node servers that answer requests form other MPI ranks and receive remote tree nodes.
Walker-threads that perform the tree traversal.
Interaction-threads that actually evaluate the interaction law for partners that were identified by some walker.

Especially the ratio of number of walkers to interaction-threads will allow for a good load balancing to perfectly fill CPU(s) as well as the available accelerator(s). In a very final stage, this ratio could even be adjusted dynamically: At the beginning of a walk many walkers are necessary, towards its end, only few of them will be needed and more interaction-threads could be started.

Roadmap

DONE

[Benedikt] Code cleanup:
- implement support for distinct trees with individual hashtables
- remove (almost) all global fields
The support for more than one trees will (besides code cleanup itself) prepare for an even larger dynamic range of densities in future. In addition, it will be necessary for later modifications.

[Benedikt] Surgically remove the communicator thread form the tree traversal and attach it to the tree data structure. As a tree node server, it can be started as soon as the tree has been constructed at the end of pepc_grow_tree() and will have to be terminated at pepc_timber_tree(). This will simplify the code of the tree traversal dramatically, clean up dependencies and is the only natural way of implementation. In addition, with such an implementation, the communicator will not have to be started and stopped with every individual walk (e.g. when traversing the neighbour boxes) and the global synchronization overhead for terminating the walk will be reduced.

[Mathias] Node-Prefetching/sending
- When requesting data from a remote rank, we could send the current particle's position and other information instead of only the requested key. Using a very short local traversal, the remote rank can then decide which higher-level nodes will be definitely needed anyway and send them together with the originally requested ones.
- This modification will reduce the number of requests and avoid walkers and interaction threads from stalling due to missing nodes.

TODO

[Dirk] First experiments on GPU support
- derive an (additional) backend from the Coulomb-module that
  - collects interactions on individual interaction lists
  - feed these to a GPU
- If possible, stick to existing structures. Interaction lists can be added either as static lists to the t_particle_data or as a pointer to lists that are allocated as soon as the first interaction is encountered.
With these tests we can identify possible technical problems and opportunities as well as parameters such as list lengths/memory demands etc.

[ ?? ] Perform the tree traversal for particle groups instead of individual particles from a list.
- Organizationally, this will be easiest if the target particles are also inserted into a tree. The tree nodes can get pointers to their first particle in the local particle list. With num_leaves, the length of the node's particle list is known.
- Using some citerion such as the tree node size or its number of leaves or minimum level, particle grous are identified. Particles in each of these nodes collectively traverse the tree (i.e. the tree is traversed for these nodes) and get the same interaction lists.
- For feeding lists to accelerators again, these will have to be attached to the nodes of the target tree now instead as to the individual particles whcih reduces memory requirements etc.
Thus, the work done in the traversal will be significantly reduced with the tradeoff of doing few more interactions than actually necessary.

[ ?? ] Node-Node interactions
- Although the walk is performed for particle groups, the previous modification still sticks to particle-node interactions. Using the ideas of Dehnen, node-node interactions can finally be implemented which renders the treecode FMM-like and will lead to something like O(N) scaling.
- Using a two-sided MAC, runtime can reduced even further (as Dehnen states).

Last modified 11 years ago Last modified on 04/18/13 00:17:49

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text