NotificationsYou must be signed in to change notification settings
Fork3
Star3

You must be signed in to change notification settings

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
cstone		cstone
ryoanji		ryoanji
.clang-format		.clang-format
CMakeLists.txt		CMakeLists.txt
README.md		README.md

Repository files navigation

Ryoanji - a distributed N-body solver for AMD and NVIDIA GPUs

Ryoanji is a Barnes-Hut N-body solver for gravity and electrostatics.It employsEXAFMM multipole kernels and a Barnes-Hut tree-traversalalgorithm inspired byBonsai. Octrees and domain decomposition arehandled byCornerstone Octree, see Ref. [1].

Ryoanji is optimized to run efficiently on both AMD and NVIDIA GPUs, though a CPU implementation is provided as well.

Folder structure

Ryoanji.git├── README.md├── cstone          - Cornerstone library: octree building and domain decomposition│                     (git subtree of https://github.com/sekelle/cornerstone-octree)│                             └── ryoanji                            - Ryoanji: N-body solver   ├── src   └── test       ├── demo.cu                     - single-rank demonstrator app       ├── demo_mpi.cpp                - multi-rank demonstrator app       ├── interface       │   └─── global_forces_gpu.cpp   - multi-rank correctness check vs. direct sum       ├── nbody       └── test_main.cpp

Compilation

Ryoanji is written in C++ and CUDA. The host.cpp translation units require a C++20 compiler(GCC 11 and later, Clang 14 and later), while.cu translation units are compiled in the C++17 standard.CUDA version: 11.6 or later, HIP version 5.2 or later.

NVIDIA CUDA, A100

CC=mpicc CXX=mpicxx cmake -DCMAKE_CUDA_ARCHITECTURES=80 -DCMAKE_CUDA_FLAGS=-ccbin=mpicxx -DGPU_DIRECT=<ON/OFF><GIT_SOURCE_DIR>make -j

AMD HIP, MI250x

The code can directly be built with HIP, no hipification needed:

CC=mpicc CXX=mpicxx cmake -DCMAKE_HIP_ARCHITECTURES=gfx90a -DCSTONE_WITH_GPU_AWARE_MPI=<ON/OFF><GIT_SOURCE_DIR>&& make -j

Performance

One particle-particle (P2P) interaction counts as 23 flops, a multipole-particle (M2P) interaction withspherical hexadecapoles(P=4) counts as2 * P^3 = 128 flops. The performance numbers given below onlytake P2P and M2P into account. Additional floating point operations due to tree node evaluations(multipole acceptance criteria, MAC) or warp-padding overheads are not taken into account.The opening angletheta was set to0.5.

1 x NVIDIA A100: 10.4 TFlop/s (FP32) per GPU, 62.2 million particles / second per GPU, 67 million particles total
4 x NVIDIA A100: 10.9 TFlop/s (FP32) per GPU, 35.5 million particles / second per GPU, 3 billion particles total
1x AMD MI250X: 15.1 TFlops/s (FP32) per GPU (2 GCDs), 60.0 million particles / second per GPU, 67 million particles total
4x AMD MI250X: 15.4 TFlops/s (FP32) per GPU (2 GCDs), 50.0 million particles / second per GPU, 3 billion particles total
8208x AMD MI250X (LUMI-G): ~107 PFlops/s (FP64), 44.3 million particles / second per GPU (2 GCDs), 8 trillion particles total (in 22.2 seconds) [1]

Note: the multi-rank demonstrator app provided here initializes random particles on all MPI ranks for the same spatial domain.This requires all-to-all communication to construct the sub-domains of each rank and is not feasible for large number of ranks.In order to construct domains for trillions of particles such as in Ref. [1], optimized initialization strategies are requiredthat places particles into the correct sub-domains. This is possible for Space-Filling-Curve (SFC) sorted input filesor for in-situ initialization for particle ensembles with known (density) distribution functions.An application front-end that implements this capability, in addition to I/O and a time-stepping loop isavailable as part of theSPH-EXA project.

Accuracy and correctness

The demonstrator apps are configured by default to use an opening angle oftheta = 0.5 cartesian quadrupole expansions.This yields a 1st-percentile error of~5e-4 in the accelerations.

$$$ mpiexec -np 8 ./interface/global_forces_gpu rank 0 1st-percentile acc error 0.000410922, max acc error 0.00267019rank 1 1st-percentile acc error 0.000501579, max acc error 0.00327092rank 2 1st-percentile acc error 0.000362208, max acc error 0.00280561rank 3 1st-percentile acc error 0.000481996, max acc error 0.0251728rank 4 1st-percentile acc error 0.000579059, max acc error 0.0110242rank 5 1st-percentile acc error 0.000442119, max acc error 0.00426394rank 6 1st-percentile acc error 0.000470549, max acc error 0.00187002rank 7 1st-percentile acc error 0.000527458, max acc error 0.00332407global reference potential -0.706933, BH global potential -0.706931

References

[1]S. Keller et al. 2023, Cornerstone: Octree Construction Algorithms for Scalable Particle Simulations

Authors

Sebastian Keller
Rio Yokota

About

No description, website, or topics provided.

Releases

No releases published

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Ryoanji - a distributed N-body solver for AMD and NVIDIA GPUs

Folder structure

Compilation

Performance

Accuracy and correctness

References

Authors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

Movatterモバイル変換

exafmm/ryoanji

Folders and files

Latest commit

History

Repository files navigation

Ryoanji - a distributed N-body solver for AMD and NVIDIA GPUs

Folder structure

Compilation

Performance

Accuracy and correctness

References

Authors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages