Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
NotificationsYou must be signed in to change notification settings

exafmm/ryoanji

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Ryoanji is a Barnes-Hut N-body solver for gravity and electrostatics.It employsEXAFMM multipole kernels and a Barnes-Hut tree-traversalalgorithm inspired byBonsai. Octrees and domain decomposition arehandled byCornerstone Octree, see Ref. [1].

Ryoanji is optimized to run efficiently on both AMD and NVIDIA GPUs, though a CPU implementation is provided as well.

Folder structure

Ryoanji.git├── README.md├── cstone          - Cornerstone library: octree building and domain decomposition│                     (git subtree of https://github.com/sekelle/cornerstone-octree)│                             └── ryoanji                            - Ryoanji: N-body solver   ├── src   └── test       ├── demo.cu                     - single-rank demonstrator app       ├── demo_mpi.cpp                - multi-rank demonstrator app       ├── interface       │   └─── global_forces_gpu.cpp   - multi-rank correctness check vs. direct sum       ├── nbody       └── test_main.cpp

Compilation

Ryoanji is written in C++ and CUDA. The host.cpp translation units require a C++20 compiler(GCC 11 and later, Clang 14 and later), while.cu translation units are compiled in the C++17 standard.CUDA version: 11.6 or later, HIP version 5.2 or later.

NVIDIA CUDA, A100

CC=mpicc CXX=mpicxx cmake -DCMAKE_CUDA_ARCHITECTURES=80 -DCMAKE_CUDA_FLAGS=-ccbin=mpicxx -DGPU_DIRECT=<ON/OFF><GIT_SOURCE_DIR>make -j

AMD HIP, MI250x

The code can directly be built with HIP, no hipification needed:

CC=mpicc CXX=mpicxx cmake -DCMAKE_HIP_ARCHITECTURES=gfx90a -DCSTONE_WITH_GPU_AWARE_MPI=<ON/OFF><GIT_SOURCE_DIR>&& make -j

Performance

One particle-particle (P2P) interaction counts as 23 flops, a multipole-particle (M2P) interaction withspherical hexadecapoles(P=4) counts as2 * P^3 = 128 flops. The performance numbers given below onlytake P2P and M2P into account. Additional floating point operations due to tree node evaluations(multipole acceptance criteria, MAC) or warp-padding overheads are not taken into account.The opening angletheta was set to0.5.

  • 1 x NVIDIA A100: 10.4 TFlop/s (FP32) per GPU, 62.2 million particles / second per GPU, 67 million particles total

  • 4 x NVIDIA A100: 10.9 TFlop/s (FP32) per GPU, 35.5 million particles / second per GPU, 3 billion particles total

  • 1x AMD MI250X: 15.1 TFlops/s (FP32) per GPU (2 GCDs), 60.0 million particles / second per GPU, 67 million particles total

  • 4x AMD MI250X: 15.4 TFlops/s (FP32) per GPU (2 GCDs), 50.0 million particles / second per GPU, 3 billion particles total

  • 8208x AMD MI250X (LUMI-G): ~107 PFlops/s (FP64), 44.3 million particles / second per GPU (2 GCDs), 8 trillion particles total (in 22.2 seconds) [1]

Note: the multi-rank demonstrator app provided here initializes random particles on all MPI ranks for the same spatial domain.This requires all-to-all communication to construct the sub-domains of each rank and is not feasible for large number of ranks.In order to construct domains for trillions of particles such as in Ref. [1], optimized initialization strategies are requiredthat places particles into the correct sub-domains. This is possible for Space-Filling-Curve (SFC) sorted input filesor for in-situ initialization for particle ensembles with known (density) distribution functions.An application front-end that implements this capability, in addition to I/O and a time-stepping loop isavailable as part of theSPH-EXA project.

Accuracy and correctness

The demonstrator apps are configured by default to use an opening angle oftheta = 0.5 cartesian quadrupole expansions.This yields a 1st-percentile error of~5e-4 in the accelerations.

$$$ mpiexec -np 8 ./interface/global_forces_gpu rank 0 1st-percentile acc error 0.000410922, max acc error 0.00267019rank 1 1st-percentile acc error 0.000501579, max acc error 0.00327092rank 2 1st-percentile acc error 0.000362208, max acc error 0.00280561rank 3 1st-percentile acc error 0.000481996, max acc error 0.0251728rank 4 1st-percentile acc error 0.000579059, max acc error 0.0110242rank 5 1st-percentile acc error 0.000442119, max acc error 0.00426394rank 6 1st-percentile acc error 0.000470549, max acc error 0.00187002rank 7 1st-percentile acc error 0.000527458, max acc error 0.00332407global reference potential -0.706933, BH global potential -0.706931

References

[1]S. Keller et al. 2023, Cornerstone: Octree Construction Algorithms for Scalable Particle Simulations

Authors

  • Sebastian Keller
  • Rio Yokota

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp