Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs and CPUs via OpenCL. Free for non-commercial use.

License

NotificationsYou must be signed in to change notification settings

ProjectPhysX/FluidX3D

Repository files navigation

The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs and CPUs viaOpenCL. Free for non-commercial use.


(click on images to show videos on YouTube)

Update History
  • v1.0 (04.08.2022)changes (public release)
    • public release
  • v1.1 (29.09.2022)changes (GPU voxelization)
    • added solid voxelization on GPU (slow algorithm)
    • added tool to print current camera position (keyG)
    • minor bug fix (workaround for Intel iGPU driver bug with triangle rendering)
  • v1.2 (24.10.2022)changes (force/torque compuatation)
    • added functions to compute force/torque on objects
    • added function to translate Mesh
    • added Stokes drag validation setup
  • v1.3 (10.11.2022)changes (minor bug fixes)
    • added unit conversion functions for torque
    • FORCE_FIELD andVOLUME_FORCE can now be used independently
    • minor bug fix (workaround for AMD legacy driver bug with binary number literals)
  • v1.4 (14.12.2022)changes (Linux graphics)
    • complete rewrite of C++ graphics library to minimize API dependencies
    • added interactive graphics mode on Linux with X11
    • fixed streamline visualization bug in 2D
  • v2.0 (09.01.2023)changes (multi-GPU upgrade)
    • added (cross-vendor) multi-GPU support on a single node (PC/laptop/server)
  • v2.1 (15.01.2023)changes (fast voxelization)
    • made solid voxelization on GPU lightning fast (new algorithm, from minutes to milliseconds)
  • v2.2 (20.01.2023)changes (velocity voxelization)
    • added option to voxelize moving/rotating geometry on GPU, with automatic velocity initialization for each grid point based on center of rotation, linear velocity and rotational velocity
    • cells that are converted from solid->fluid during re-voxelization now have their DDFs properly initialized
    • added option to not auto-scale mesh duringread_stl(...), with negativesize parameter
    • added kernel for solid boundary rendering with marching-cubes
  • v2.3 (30.01.2023)changes (particles)
    • added particles with immersed-boundary method (either passive or 2-way-coupled, only supported with single-GPU)
    • minor optimization to GPU voxelization algorithm (workgroup threads outside mesh bounding-box return after ray-mesh intersections have been found)
    • displayed GPU memory allocation size is now fully accurate
    • fixed bug inwrite_line() function insrc/utilities.hpp
    • removed.exe file extension for Linux/macOS
  • v2.4 (11.03.2023)changes (UI improvements)
    • added a help menu with keyH that shows keyboard/mouse controls, visualization settings and simulation stats
    • improvements to keyboard/mouse control (+/- for zoom,mouseclick frees/locks cursor)
    • added suggestion of largest possible grid resolution if resolution is set larger than memory allows
    • minor optimizations in multi-GPU communication (insignificant performance difference)
    • fixed bug in temperature equilibrium function for temperature extension
    • fixed erroneous double literal for Intel iGPUs in skybox color functions
    • fixed bug in make.sh where multi-GPU device IDs would not get forwarded to the executable
    • minor bug fixes in graphics engine (free cursor not centered during rotation, labels in VR mode)
    • fixed bug inLBM::voxelize_stl() size parameter standard initialization
  • v2.5 (11.04.2023)changes (raytracing overhaul)
    • implemented light absorption in fluid for raytracing graphics (no performance impact)
    • improved raytracing framerate when camera is inside fluid
    • fixed skybox pole flickering artifacts
    • fixed bug where moving objects during re-voxelization would leave an erroneous trail of solid grid cells behind
  • v2.6 (16.04.2023)changes (Intel Arc patch)
    • patched OpenCL issues of Intel Arc GPUs: now VRAM allocations >4GB are possible and correct VRAM capacity is reported
  • v2.7 (29.05.2023)changes (visualization upgrade)
    • added slice visualization (key2 / key3 modes, then switch through slice modes with keyT, move slice with keysQ/E)
    • made flag wireframe / solid surface visualization kernels toggleable with key1
    • added surface pressure visualization (key1 whenFORCE_FIELD is enabled andlbm.calculate_force_on_boundaries(); is called)
    • added binary.vtk export function for meshes withlbm.write_mesh_to_vtk(Mesh* mesh);
    • addedtime_step_multiplicator forintegrate_particles() function in PARTICLES extension
    • made correction of wrong memory reporting on Intel Arc more robust
    • fixed bug inwrite_file() template functions
    • reverted back to separatecl::Context for each OpenCL device, as the shared Context otherwise would allocate extra VRAM on all other unused Nvidia GPUs
    • removed Debug and x86 configurations from Visual Studio solution file (one less complication for compiling)
    • fixed bug that particles could get too close to walls and get stuck, or leave the fluid phase (added boundary force)
  • v2.8 (24.06.2023)changes (documentation + polish)
    • finally added moredocumentation
    • cleaned up all sample setups insetup.cpp for more beginner-friendliness, and added required extensions indefines.hpp as comments to all setups
    • improved loading of composite.stl geometries, by adding an option to omit automatic mesh repositioning, added more functionality toMesh struct inutilities.hpp
    • addeduint3 resolution(float3 box_aspect_ratio, uint memory) function to compute simulation box resolution based on box aspect ratio and VRAM occupation in MB
    • addedbool lbm.graphics.next_frame(...) function to export images for a specified video length in themain_setup compute loop
    • addedVIS_... macros to ease setting visualization modes in headless graphics mode inlbm.graphics.visualization_modes
    • simulation box dimensions are now automatically made equally divisible by domains for multi-GPU simulations
    • fixed Info/Warning/Error message formatting for loading files and made Info/Warning/Error message labels colored
    • added Ahmed body setup as an example on how body forces and drag coefficient are computed
    • added Cessna 172 and Bell 222 setups to showcase loading composite .stl geometries and revoxelization of moving parts
    • added optional semi-transparent rendering mode (#define GRAPHICS_TRANSPARENCY 0.7f indefines.hpp)
    • fixed flickering of streamline visualization in interactive graphics
    • improved smooth positioning of streamlines in slice mode
    • fixed bug wheremass andmassex inSURFACE extension were also allocated in CPU RAM (not required)
    • fixed bug in Q-criterion rendering of halo data in multi-GPU mode, reduced gap width between domains
    • removed shared memory optimization from mesh voxelization kernel, as it crashes on Nvidia GPUs with new GPU drivers and is incompatible with old OpenCL 1.0 GPUs
    • fixed raytracing attenuation color when no surface is at the simulation box walls with periodic boundaries
  • v2.9 (31.07.2023)changes (multithreading)
    • added cross-platformparallel_for implementation inutilities.hpp usingstd::threads
    • significantly (>4x) faster simulation startup with multithreaded geometry initialization and sanity checks
    • fastercalculate_force_on_object() andcalculate_torque_on_object() functions with multithreading
    • added total runtime and LBM runtime tolbm.write_status()
    • fixed bug in voxelization ray direction for re-voxelizing rotating objects
    • fixed bug inMesh::get_bounding_box_size()
    • fixed bug inprint_message() function inutilities.hpp
  • v2.10 (05.11.2023)changes (frustrum culling)
    • improved rasterization performance via frustrum culling when only part of the simulation box is visible
    • improved switching between centered/free camera mode
    • refactored OpenCL rendering library
    • unit conversion factors are now automatically printed in console whenunits.set_m_kg_s(...) is used
    • faster startup time for FluidX3D benchmark
    • miner bug fix invoxelize_mesh(...) kernel
    • fixed bug inshading(...)
    • replaced slow (in multithreading)std::rand() function with standard C99 LCG
    • more robust correction of wrong VRAM capacity reporting on Intel Arc GPUs
    • fixed some minor compiler warnings
  • v2.11 (07.12.2023)changes (improved Linux graphics)
    • interactive graphics on Linux are now in fullscreen mode too, fully matching Windows
    • made CPU/GPU buffer initialization significantly faster withstd::fill andenqueueFillBuffer (overall ~8% faster simulation startup)
    • added operating system info to OpenCL device driver version printout
    • fixed flickering with frustrum culling at very small field of view
    • fixed bug where rendered/exported frame was not updated whenvisualization_modes changed
  • v2.12 (18.01.2024)changes (faster startup)
    • ~3x faster source code compiling on Linux using multiple CPU cores ifmake is installed
    • significantly faster simulation initialization (~40% single-GPU, ~15% multi-GPU)
    • minor bug fix inMemory_Container::reset() function
  • v2.13 (11.02.2024)changes (improved .vtk export)
    • data in exported.vtk files is now automatically converted to SI units
    • ~2x faster.vtk export with multithreading
    • added unit conversion functions forTEMPERATURE extension
    • fixed graphical artifacts with axis-aligned camera in raytracing
    • fixedget_exe_path() for macOS
    • fixed X11 multi-monitor issues on Linux
    • workaround for Nvidia driver bug:enqueueFillBuffer is broken for large buffers on Nvidia GPUs
    • fixed slow numeric drift issues caused by-cl-fast-relaxed-math
    • fixed wrong Maximum Allocation Size reporting inLBM::write_status()
    • fixed missing scaling of coordinates to SI units inLBM::write_mesh_to_vtk()
  • v2.14 (03.03.2024)changes (visualization upgrade)
    • coloring can now be switched between velocity/density/temperature with keyZ
    • uniform improved color palettes for velocity/density/temperature visualization
    • color scale with automatic unit conversion can now be shown with keyH
    • slice mode for field visualization now draws fully filled-in slices instead of only lines for velocity vectors
    • shading inVIS_FLAG_SURFACE andVIS_PHI_RASTERIZE modes is smoother now
    • make.sh now automatically detects operating system and X11 support on Linux and only runs FluidX3D if last compilation was successful
    • fixed compiler warnings on Android
    • fixedmake.sh failing on some systems due to nonstandard interpreter path
    • fixed thatmake would not compile with multiple cores on some systems
  • v2.15 (09.04.2024)changes (framerate boost)
    • eliminated one frame memory copy and one clear frame operation in rendering chain, for 20-70% higher framerate on both Windows and Linux
    • enabledg++ compiler optimizations for faster startup and higher rendering framerate
    • fixed bug in multithreaded sanity checks
    • fixed wrong unit conversion for thermal expansion coefficient
    • fixed density to pressure conversion in LBM units
    • fixed bug that raytracing kernel could lock up simulation
    • fixed minor visual artifacts with raytracing
    • fixed that console sometimes was not cleared beforeINTERACTIVE_GRAPHICS_ASCII rendering starts
  • v2.16 (02.05.2024)changes (bug fixes)
    • simplified 10% faster marching-cubes implementation with 1D interpolation on edges instead of 3D interpolation, allowing to get rid of edge table
    • added faster, simplified marching-cubes variant for solid surface rendering where edges are always halfway between grid cells
    • refactoring in OpenCL rendering kernels
    • fixed that voxelization failed in Intel OpenCL CPU Runtime due to array out-of-bounds access
    • fixed that voxelization did not always produce binary identical results in multi-GPU compared to single-GPU
    • fixed that velocity voxelization failed for free surface simulations
    • fixed terrible performance on ARM GPUs by macro-replacing fused-multiply-add (fma) witha*b+c
    • fixed thatY/Z keys were incorrect forQWERTY keyboard layout in Linux
    • fixed that free camera movement speed in help overlay was not updated in stationary image when scrolling
    • fixed that cursor would sometimes flicker when scrolling on trackpads with Linux-X11 interactive graphics
    • fixed flickering of interactive rendering with multi-GPU when camera is not moved
    • fixed missingXInitThreads() call that could crash Linux interactive graphics on some systems
    • fixed z-fighting betweengraphics_rasterize_phi() andgraphics_flags_mc() kernels
  • v2.17 (05.06.2024)changes (unlimited domain resolution)
    • domains are no longer limited to 4.29 billion (2³², 1624³) grid cells or 225 GB memory; if more are used, the OpenCL code will automatically compile with 64-bit indexing
    • new, faster raytracing-based field visualization for single-GPU simulations
    • addedGPU Driver and OpenCL Runtime installation instructions to documentation
    • refactoredINTERACTIVE_GRAPHICS_ASCII
    • fixed memory leak in destructors offloatN,floatNxN,doubleN,doubleNxN (all unused)
    • made camera movement/rotation/zoom behavior independent of framerate
    • fixed thatsmart_device_selection() would print a wrong warning if device reports 0 MHz clock speed
  • v2.18 (21.07.2024)changes (more bug fixes)
    • added support for high refresh rate monitors on Linux
    • more compact OpenCL Runtime installation scripts in Documentation
    • driver/runtime installation instructions will now be printed to console if no OpenCL devices are available
    • added domain information toLBM::write_status()
    • addedLBM::index function foruint3 input parameter
    • fixed that very large simulations sometimes wouldn't render properly by increasing maximum render distance from 10k to 2.1M
    • fixed mouse input stuttering at high screen refresh rate on Linux
    • fixed graphical artifacts in free surface raytracing on Intel CPU Runtime for OpenCL
    • fixed runtime estimation printed in console for setups with multiplelbm.run(...) calls
    • fixed density oscillations in sample setups (too largelbm_u)
    • fixed minor graphical artifacts inraytrace_phi()
    • fixed minor graphical artifacts inray_grid_traverse_sum()
    • fixed wrong printed time step count on raindrop sample setup
  • v2.19 (07.09.2024)changes (camera splines)
    • the camera can now fly along a smooth path through a list of provided keyframe camera placements,using Catmull-Rom splines
    • more accurate remaining runtime estimation that includes time spent on rendering
    • enabled FP16S memory compression by default
    • printed camera placement using keyG is now formatted for easier copy/paste
    • added benchmark chart in Readme using mermaid gantt chart
    • placed memory allocation info during simulation startup at better location
    • fixed threading conflict betweenINTERACTIVE_GRAPHICS andlbm.graphics.write_frame();
    • fixed maximum buffer allocation size limit for AMD GPUs and in Intel CPU Runtime for OpenCL
    • fixed wrongRe<Re_max info printout for 2D simulations
    • minor fix inbandwidth_bytes_per_cell_device()
  • v3.0 (16.11.2024)changes (larger CPU/iGPU simulations)
    • reduced memory footprint on CPUs and iGPU from 72 to 55 Bytes/cell (fused OpenCL host+device buffers forrho/u/flags), allowing 31% higher resolution in the same RAM capacity
    • faster hardware-supported and faster fallback emulation atomic floating-point addition forPARTICLES extension
    • hardenedcalculate_f_eq() against bad user input forD2Q9
    • fixed velocity voxelization for overlapping geometry with different velocity
    • fixed Remaining Time printout during paused simulation
    • fixed CPU/GPU memory printout for CPU/iGPU simulations
  • v3.1 (08.02.2025)changes (more bug fixes)
    • fasterenqueueReadBuffer() on modern CPUs with 64-Byte-alignedhost_buffer
    • hardened ray intersection functions against planar ray edge case
    • updated OpenCL headers
    • better OpenCL device specs detection using vendor ID and Nvidia compute capability
    • better VRAM capacity reporting correction for Intel dGPUs
    • improved styling of performance mermaid gantt chart in Readme
    • added multi-GPU performance mermaid gantt chart in Readme
    • updated driver install guides
    • fixed voxelization being broken on some GPUs
    • added workaround for compiler bug in Intel CPU Runtime for OpenCL that causes Q-criterion isosurface rendering corruption
    • fixed TFlops estimate for Intel Battlemage GPUs
    • fixed wrong device name reporting for AMD GPUs
  • v3.2 (09.03.2025)changes (fast force/torque summation)
    • implemented GPU-accelerated force/torque summation (~20x faster than CPU-multithreaded implementation before)
    • simplified calculating object force/torque in setups
    • improved coloring inVIS_FIELD/ray_grid_traverse_sum()
    • updated OpenCL-Wrapper now compiles OpenCL C code with-cl-std=CL3.0 if available
    • fixed compiling on macOS with new OpenCL headers

How to get started?

Read theFluidX3D Documentation!

Compute Features - Getting the Memory Problem under Control

  • CFD model: lattice Boltzmann method (LBM)
    • streaming (part 2/2)

      f0temp(x,t) =f0(x,t)
      fitemp(x,t) =f(t%2 ?i : (i%2 ?i+1 :i-1))(i%2 ?x :x-ei,t)   for  i ∈ [1,q-1]

    • collision

      ρ(x,t) = (Σifitemp(x,t)) + 1

      u(x,t) =1ρ(x,t) Σicifitemp(x,t)

      fieq-shifted(x,t) =wiρ · ((u°ci)2(2c4) -(u°u)(2c2) +(u°ci)c2) +wi (ρ-1)

      fitemp(x,tt) =fitemp(x,t) +Ωi(fitemp(x,t),fieq-shifted(x,t),τ)

    • streaming (part 1/2)

      f0(x,tt) =f0temp(x,tt)
      f(t%2 ? (i%2 ?i+1 :i-1) :i)(i%2 ?x+ei :x,tt) =fitemp(x,tt)   for  i ∈ [1,q-1]

    • variables andnotation
      variableSI unitsdefining equationdescription
      xmx = (x,y,z)T3D position in Cartesian coordinates
      ts-time
      ρkgρ = (Σifi)+1mass density of fluid
      pkgm s²p =c²ρpressure of fluid
      umsu =1ρ Σicifivelocity of fluid
      νsν =μρkinematic shear viscosity of fluid
      μkgm sμ =ρνdynamic viscosity of fluid
      fikg-shifted density distribution functions (DDFs)
      ΔxmΔx = 1lattice constant (in LBM units)
      ΔtsΔt = 1simulation time step (in LBM units)
      cmsc =1√3ΔxΔtlattice speed of sound (in LBM units)
      i10 ≤i <qLBM streaming direction index
      q1q ∈ { 9,15,19,27 }number of LBM streaming directions
      eimD2Q9 / D3Q15/19/27LBM streaming directions
      cimsci =eiΔtLBM streaming velocities
      wi1Σiwi = 1LBM velocity set weights
      ΩikgSRT or TRTLBM collision operator
      τsτ =νc² +Δt2LBM relaxation time
    • velocity sets: D2Q9, D3Q15, D3Q19 (default), D3Q27

    • collision operators: single-relaxation-time (SRT/BGK) (default), two-relaxation-time (TRT)

    • DDF-shifting and other algebraic optimization to minimize round-off error

  • optimized to minimize VRAM footprint to 1/6 of other LBM codes
    • traditional LBM (D3Q19) with FP64 requires ~344 Bytes/cell

      • 🟧🟧🟧🟧🟧🟧🟧🟧🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟨🟨🟨🟨🟨🟨🟨🟨🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥
        (density 🟧, velocity 🟦, flags 🟨, 2 copies of DDFs 🟩/🟥; each square = 1 Byte)
      • allows for 3 Million cells per 1 GB VRAM
    • FluidX3D (D3Q19) requires only 55 Bytes/cell withEsoteric-Pull+FP16

      • 🟧🟧🟧🟧🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟨🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩
        (density 🟧, velocity 🟦, flags 🟨, DDFs 🟩; each square = 1 Byte)

      • allows for 19 Million cells per 1 GB VRAM

      • in-place streaming withEsoteric-Pull: eliminates redundant copy of density distribution functions (DDFs) in memory; almost cuts memory demand in half and slightly increases performance due to implicit bounce-back boundaries; offers optimal memory access patterns for single-cell in-place streaming

      • decoupled arithmetic precision (FP32) and memory precision (FP32 or FP16S or FP16C): all arithmetic is done in FP32 for compatibility on all hardware, but DDFs in memory can be compressed to FP16S or FP16C: almost cuts memory demand in half again and almost doubles performance, without impacting overall accuracy for most setups

      • only 8 flag bits per lattice point (can be used independently / at the same time)
        • TYPE_S (stationary or moving) solid boundaries
        • TYPE_E equilibrium boundaries (inflow/outflow)
        • TYPE_T temperature boundaries
        • TYPE_F free surface (fluid)
        • TYPE_I free surface (interface)
        • TYPE_G free surface (gas)
        • TYPE_X remaining for custom use or further extensions
        • TYPE_Y remaining for custom use or further extensions
    • large cost saving: comparison of maximum single-GPU grid resolution for D3Q19 LBM

      GPU VRAM capacity1 GB2 GB3 GB4 GB6 GB8 GB10 GB11 GB12 GB16 GB20 GB24 GB32 GB40 GB48 GB64 GB80 GB94 GB128 GB192 GB256 GB
      approximate GPU price$25
      GT 210
      $25
      GTX 950
      $12
      GTX 1060
      $50
      GT 730
      $35
      GTX 1060
      $70
      RX 470
      $500
      RTX 3080
      $240
      GTX 1080 Ti
      $75
      Tesla M40
      $75
      Instinct MI25
      $900
      RX 7900 XT
      $205
      Tesla P40
      $600
      Instinct MI60
      $5500
      A100
      $2400
      RTX 8000
      $10k
      Instinct MI210
      $11k
      A100
      >$40k
      H100 NVL
      ?
      GPU Max 1550
      ~$10k
      MI300X
      -
      traditional LBM (FP64)144³182³208³230³262³288³312³322³330³364³392³418³460³494³526³578³624³658³730³836³920³
      FluidX3D (FP32/FP32)224³282³322³354³406³448³482³498³512³564³608³646³710³766³814³896³966³1018³1130³1292³1422³
      FluidX3D (FP32/FP16)266³336³384³424³484³534³574³594³610³672³724³770³848³912³970³1068³1150³1214³1346³1540³1624³
  • cross-vendor multi-GPU support on a single computer/server
    • domain decomposition allows pooling VRAM from multiple GPUs for much larger grid resolution
    • GPUs don't have to be identical (not even from the same vendor), but similar VRAM capacity/bandwidth is recommended
    • domain communication architecture (simplified)
      ++   .-----------------------------------------------------------------.   ++++   |                              GPU 0                              |   ++++   |                          LBM Domain 0                           |   ++++   '-----------------------------------------------------------------'   ++++              |                 selective                /|\             ++++             \|/               in-VRAM copy               |              ++++        .-------------------------------------------------------.        ++++        |               GPU 0 - Transfer Buffer 0               |        ++++        '-------------------------------------------------------'        ++!!                            |     PCIe     /|\                           !!!!                           \|/    copy      |                            !!@@        .-------------------------.   .-------------------------.        @@@@        | CPU - Transfer Buffer 0 |   | CPU - Transfer Buffer 1 |        @@@@        '-------------------------'\ /'-------------------------'        @@@@                           pointer  X   swap                             @@@@        .-------------------------./ \.-------------------------.        @@@@        | CPU - Transfer Buffer 1 |   | CPU - Transfer Buffer 0 |        @@@@        '-------------------------'   '-------------------------'        @@!!                           /|\    PCIe      |                            !!!!                            |     copy     \|/                           !!++        .-------------------------------------------------------.        ++++        |               GPU 1 - Transfer Buffer 1               |        ++++        '-------------------------------------------------------'        ++++             /|\                selective                 |              ++++              |                in-VRAM copy              \|/             ++++   .-----------------------------------------------------------------.   ++++   |                              GPU 1                              |   ++++   |                          LBM Domain 1                           |   ++++   '-----------------------------------------------------------------'   ++##                                    |                                    ####                      domain synchronization barrier                     ####                                    |                                    ##||   -------------------------------------------------------------> time   ||
    • domain communication architecture (detailed)
      ++   .-----------------------------------------------------------------.   ++++   |                              GPU 0                              |   ++++   |                          LBM Domain 0                           |   ++++   '-----------------------------------------------------------------'   ++++     |  selective in- /|\  |  selective in- /|\  |  selective in- /|\    ++++    \|/ VRAM copy (X)  |  \|/ VRAM copy (Y)  |  \|/ VRAM copy (Z)  |     ++++   .---------------------.---------------------.---------------------.   ++++   |    GPU 0 - TB 0X+   |    GPU 0 - TB 0Y+   |    GPU 0 - TB 0Z+   |   ++++   |    GPU 0 - TB 0X-   |    GPU 0 - TB 0Y-   |    GPU 0 - TB 0Z-   |   ++++   '---------------------'---------------------'---------------------'   ++!!          | PCIe /|\            | PCIe /|\            | PCIe /|\         !!!!         \|/ copy |            \|/ copy |            \|/ copy |          !!@@   .---------. .---------.---------. .---------.---------. .---------.   @@@@   | CPU 0X+ | | CPU 1X- | CPU 0Y+ | | CPU 3Y- | CPU 0Z+ | | CPU 5Z- |   @@@@   | CPU 0X- | | CPU 2X+ | CPU 0Y- | | CPU 4Y+ | CPU 0Z- | | CPU 6Z+ |   @@@@   '---------\ /---------'---------\ /---------'---------\ /---------'   @@@@      pointer X swap (X)    pointer X swap (Y)    pointer X swap (Z)     @@@@   .---------/ \---------.---------/ \---------.---------/ \---------.   @@@@   | CPU 1X- | | CPU 0X+ | CPU 3Y- | | CPU 0Y+ | CPU 5Z- | | CPU 0Z+ |   @@@@   | CPU 2X+ | | CPU 0X- | CPU 4Y+ | | CPU 0Y- | CPU 6Z+ | | CPU 0Z- |   @@@@   '---------' '---------'---------' '---------'---------' '---------'   @@!!         /|\ PCIe |            /|\ PCIe |            /|\ PCIe |          !!!!          | copy \|/            | copy \|/            | copy \|/         !!++   .--------------------..---------------------..--------------------.   ++++   |   GPU 1 - TB 1X-   ||    GPU 3 - TB 3Y-   ||   GPU 5 - TB 5Z-   |   ++++   :====================::=====================::====================:   ++++   |   GPU 2 - TB 2X+   ||    GPU 4 - TB 4Y+   ||   GPU 6 - TB 6Z+   |   ++++   '--------------------''---------------------''--------------------'   ++++    /|\ selective in-  |  /|\ selective in-  |  /|\ selective in-  |     ++++     |  VRAM copy (X) \|/  |  VRAM copy (Y) \|/  |  VRAM copy (Z) \|/    ++++   .--------------------..---------------------..--------------------.   ++++   |        GPU 1       ||        GPU 3        ||        GPU 5       |   ++++   |    LBM Domain 1    ||    LBM Domain 3     ||    LBM Domain 5    |   ++++   :====================::=====================::====================:   ++++   |        GPU 2       ||        GPU 4        ||        GPU 6       |   ++++   |    LBM Domain 2    ||    LBM Domain 4     ||    LBM Domain 6    |   ++++   '--------------------''---------------------''--------------------'   ++##              |                     |                     |              ####              |      domain synchronization barriers      |              ####              |                     |                     |              ##||   -------------------------------------------------------------> time   ||
  • peak performance on GPUs (datacenter/gaming/professional/laptop)
  • powerful model extensions
    • boundary types
      • stationary mid-grid bounce-back boundaries (stationary solid boundaries)
      • moving mid-grid bounce-back boundaries (moving solid boundaries)
      • equilibrium boundaries (non-reflective inflow/outflow)
      • temperature boundaries (fixed temperature)
    • global force per volume (Guo forcing), can be modified on-the-fly
    • local force per volume (force field)
      • optional computation of forces from the fluid on solid boundaries
    • state-of-the-artfree surface LBM (FSLBM) implementation:
    • thermal LBM to simulate thermal convection
    • Smagorinsky-Lilly subgrid turbulence LES model to keep simulations with very large Reynolds number stable

      Παβ = Σiee (fi -fieq-shifted)

      Q = ΣαβΠαβ2
                           ______________________
      τ = ½ (τ0 + √ τ02 +(16√2)(2)√Qρ )

    • particles with immersed-boundary method (either passive or 2-way-coupled, single-GPU only)

Solving the Visualization Problem

  • FluidX3D can do simulations so large that storing the volumetric data for later rendering becomes unmanageable (like 120GB for a single frame, hundreds of TeraByte for a video)
  • instead, FluidX3D allowsrendering raw simulation data directly in VRAM, so no large volumetric files have to be exported to the hard disk (see mytechnical talk)
  • the rendering is so fast that it works interactively in real time for both rasterization and raytracing
  • rasterization and raytracing are done in OpenCL and work on all GPUs, even the ones without RTX/DXR raytracing cores or without any rendering hardware at all (like A100, MI200, ...)
  • if no monitor is available (like on a remote Linux server), there is anASCII rendering mode to interactively visualize the simulation in the terminal (even in WSL and/or through SSH)
  • rendering is fully multi-GPU-parallelized via seamless domain decomposition rasterization
  • with interactive graphics mode disabled, image resolution can be as large as VRAM allows for (4K/8K/16K and above)
  • (interacitive) visualization modes:
    • flag wireframe / solid surface (and force vectors on solid cells or surface pressure if the extension is used)
    • velocity field (with slice mode)
    • streamlines (with slice mode)
    • velocity-colored Q-criterion isosurface
    • rasterized free surface withmarching-cubes
    • raytraced free surface with fast ray-grid traversal and marching-cubes, either 1-4 rays/pixel or 1-10 rays/pixel

Solving the Compatibility Problem

Single-GPU/CPU Benchmarks

Here areperformance benchmarks on various hardware in MLUPs/s, or how many million lattice cells are updated per second. The settings used for the benchmark are D3Q19 SRT with no extensions enabled (only LBM with implicit mid-grid bounce-back boundaries) and the setup consists of an empty cubic box with sufficient size (typically 256³). Without extensions, a single lattice cell requires:

  • a memory capacity of 93 (FP32/FP32) or 55 (FP32/FP16) Bytes
  • a memory bandwidth of 153 (FP32/FP32) or 77 (FP32/FP16) Bytes per time step
  • 363 (FP32/FP32) or 406 (FP32/FP16S) or 1275 (FP32/FP16C) FLOPs per time step (FP32+INT32 operations counted combined)

In consequence, the arithmetic intensity of this implementation is 2.37 (FP32/FP32) or 5.27 (FP32/FP16S) or 16.56 (FP32/FP16C) FLOPs/Byte. So performance is only limited by memory bandwidth. The table in the left 3 columns shows the hardware specs as found in the data sheets (theoretical peak FP32 compute performance, memory capacity, theoretical peak memory bandwidth). The right 3 columns show the measured FluidX3D performance for FP32/FP32, FP32/FP16S, FP32/FP16C floating-point precision settings, with the (roofline model efficiency) in round brackets, indicating how much % of theoretical peak memory bandwidth are being used.

If your GPU/CPU is not on the list yet, you can report your benchmarkshere.

gantttitle FluidX3D Performance [MLUPs/s] - FP32 arithmetic, (fastest of FP32/FP16S/FP16C) memory storagedateFormat XaxisFormat %s%%{init: {"gantt": {'titleTopMargin': 42,'topPadding': 70,'leftPadding': 260,'rightPadding': 5,'sectionFontSize': 20,'fontSize': 20,'barHeight': 20,'barGap': 3,'numberSectionStyles': 2},'theme': 'forest','themeVariables': {'sectionBkgColor': '#99999999','altSectionBkgColor': '#00000000','titleColor': '#AFAFAF','textColor': '#AFAFAF','taskTextColor': 'black','taskBorderColor': '#487E3A'}}}%%section MI300X41327 :crit, 0, 41327section MI250 (1 GCD)9030 :crit, 0, 9030section MI2109547 :crit, 0, 9547section MI1008542 :crit, 0, 8542section MI605111 :crit, 0, 5111section Radeon VII7778 :crit, 0, 7778section GPU Max 11006209 :done, 0, 6209section GH200 94GB GPU34689 : 0, 34689section H100 NVL32613 : 0, 32613section H100 SXM5 80GB HBM329561 : 0, 29561section H100 PCIe 80GB HBM2e20624 : 0, 20624section A100 SXM4 80GB18448 : 0, 18448section A100 PCIe 80GB17896 : 0, 17896section PG506-242/24315654 : 0, 15654section A100 SXM4 40GB16013 : 0, 16013section A100 PCIe 40GB16035 : 0, 16035section CMP 170HX12392 : 0, 12392section A309721 : 0, 9721section V100 SXM2 32GB8947 : 0, 8947section V100 PCIe 16GB10325 : 0, 10325section GV1006641 : 0, 6641section Titan V7253 : 0, 7253section P100 PCIe 16GB5950 : 0, 5950section P100 PCIe 12GB4141 : 0, 4141section GTX TITAN2500 : 0, 2500section K40m1868 : 0, 1868section K80 (1 GPU)1642 : 0, 1642section K20c1507 : 0, 1507section RX 7900 XTX7716 :crit, 0, 7716section PRO W79005939 :crit, 0, 5939section RX 7900 XT5986 :crit, 0, 5986section PRO W78004426 :crit, 0, 4426section RX 7900 GRE4570 :crit, 0, 4570section PRO W77002943 :crit, 0, 2943section RX 76002561 :crit, 0, 2561section PRO W76002287 :crit, 0, 2287section PRO W75001682 :crit, 0, 1682section RX 6900 XT4227 :crit, 0, 4227section RX 6800 XT4241 :crit, 0, 4241section PRO W68003361 :crit, 0, 3361section RX 6700 XT2908 :crit, 0, 2908section RX 6800M3213 :crit, 0, 3213section RX 6700M2429 :crit, 0, 2429section RX 66001839 :crit, 0, 1839section RX 6500 XT1030 :crit, 0, 1030section RX 5700 XT3253 :crit, 0, 3253section RX 57003167 :crit, 0, 3167section RX 5600 XT2214 :crit, 0, 2214section RX Vega 643227 :crit, 0, 3227section RX 5901688 :crit, 0, 1688section RX 580 4GB1848 :crit, 0, 1848section RX 580 2048SP 8GB1622 :crit, 0, 1622section R9 390X2217 :crit, 0, 2217section HD 7850635 :crit, 0, 635section Arc B580 LE4979 :done, 0, 4979section Arc A770 LE4568 :done, 0, 4568section Arc A750 LE4314 :done, 0, 4314section Arc A5803889 :done, 0, 3889section Arc Pro A40985 :done, 0, 985section Arc A3801115 :done, 0, 1115section RTX 509019141 : 0, 19141section RTX 508010304 : 0, 10304section RTX 50707238 : 0, 7238section RTX 409011496 : 0, 11496section RTX 6000 Ada10293 : 0, 10293section L40S7637 : 0, 7637section L407945 : 0, 7945section RTX 4080 Super8218 : 0, 8218section RTX 40807933 : 0, 7933section RTX 4070 Ti Super7295 : 0, 7295section RTX 4090M6901 : 0, 6901section RTX 4070 Super5554 : 0, 5554section RTX 40705016 : 0, 5016section RTX 4080M5114 : 0, 5114section RTX 4000 Ada4221 : 0, 4221section RTX 40603124 : 0, 3124section RTX 4070M3092 : 0, 3092section RTX 2000 Ada2526 : 0, 2526section RTX 3090 Ti10956 : 0, 10956section RTX 309010732 : 0, 10732section RTX 3080 Ti9832 : 0, 9832section RTX 3080 12GB9657 : 0, 9657section RTX A60008814 : 0, 8814section RTX 3080 10GB8118 : 0, 8118section RTX 3070 Ti6807 : 0, 6807section RTX 3080M Ti5908 : 0, 5908section RTX 30705096 : 0, 5096section RTX 3060 Ti5129 : 0, 5129section RTX A40004945 : 0, 4945section RTX A5000M4461 : 0, 4461section RTX 30604070 : 0, 4070section RTX 3060M4012 : 0, 4012section RTX 3050M Ti2341 : 0, 2341section RTX 3050M2339 : 0, 2339section Titan RTX7554 : 0, 7554section RTX 60006879 : 0, 6879section RTX 8000 Passive5607 : 0, 5607section RTX 2080 Ti6853 : 0, 6853section RTX 2080 Super5284 : 0, 5284section RTX 50004773 : 0, 4773section RTX 20804977 : 0, 4977section RTX 2070 Super4893 : 0, 4893section RTX 20705017 : 0, 5017section RTX 2060 Super5035 : 0, 5035section RTX 40004584 : 0, 4584section RTX 2060 KO3376 : 0, 3376section RTX 20603604 : 0, 3604section GTX 1660 Super3551 : 0, 3551section T42887 : 0, 2887section GTX 1660 Ti3041 : 0, 3041section GTX 16601992 : 0, 1992section GTX 1650M 896C1858 : 0, 1858section GTX 1650M 1024C1400 : 0, 1400section T500665 : 0, 665section Titan Xp5495 : 0, 5495section GTX 1080 Ti4877 : 0, 4877section GTX 10803182 : 0, 3182section GTX 1060 6GB1925 : 0, 1925section GTX 1060M1882 : 0, 1882section GTX 1050M Ti1224 : 0, 1224section P1000839 : 0, 839section GTX 980 Ti2703 : 0, 2703section GTX 9801965 : 0, 1965section GTX 9701721 : 0, 1721section M40001519 : 0, 1519section M60 (1 GPU)1571 : 0, 1571section GTX 960M872 : 0, 872section GTX 7701215 : 0, 1215section GTX 680 4GB1274 : 0, 1274section K2000444 : 0, 444section GT 630 (OEM)185 : 0, 185section NVS 2909 : 0, 9section Arise 10206 :active, 0, 6section M2 Ultra (76-CU, 192GB)8769 :active, 0, 8769section M2 Max (38-CU, 32GB)4641 :active, 0, 4641section M1 Ultra (64-CU, 128GB)8418 :active, 0, 8418section M1 Max (24-CU, 32GB)4496 :active, 0, 4496section M1 Pro (16-CU, 16GB)2329 :active, 0, 2329section M1 (8-CU, 16GB)759 :active, 0, 759section Radeon Graphics (7800X3D)498 :crit, 0, 498section 780M (Z1 Extreme)860 :crit, 0, 860section Vega 8 (4750G)511 :crit, 0, 511section Vega 8 (3500U)288 :crit, 0, 288section Arc 140V GPU (16GB)1282 :done, 0, 1282section Arc Graphics (Ultra 9 185H)724 :done, 0, 724section Iris Xe Graphics (i7-1265U)621 :done, 0, 621section UHD Xe 32EUs245 :done, 0, 245section UHD 770475 :done, 0, 475section UHD 630301 :done, 0, 301section UHD P630288 :done, 0, 288section HD 5500192 :done, 0, 192section HD 4600115 :done, 0, 115section Orange Pi 5 Mali-G610 MP4232 :active, 0, 232section Samsung Mali-G72 MP18230 :active, 0, 230section 2x EPYC 97545179 :crit, 0, 5179section 2x EPYC 96541814 :crit, 0, 1814section 2x EPYC 95542552 :crit, 0, 2552section 2x EPYC 7352739 :crit, 0, 739section 2x EPYC 7313498 :crit, 0, 498section 2x EPYC 7302784 :crit, 0, 784section 2x 6980P7875 :done, 0, 7875section 2x 6979P8135 :done, 0, 8135section 2x Platinum 8592+3135 :done, 0, 3135section 2x CPU Max 94802037 :done, 0, 2037section 2x Platinum 8480+2162 :done, 0, 2162section 2x Platinum 83801410 :done, 0, 1410section 2x Platinum 83581285 :done, 0, 1285section 2x Platinum 8256396 :done, 0, 396section 2x Platinum 8153691 :done, 0, 691section 2x Gold 6248R755 :done, 0, 755section 2x Gold 6128254 :done, 0, 254section Phi 7210415 :done, 0, 415section 4x E5-4620 v4460 :done, 0, 460section 2x E5-2630 v4264 :done, 0, 264section 2x E5-2623 v4125 :done, 0, 125section 2x E5-2680 v3304 :done, 0, 304section GH200 Neoverse-V21323 : 0, 1323section TR PRO 7995WX1715 :crit, 0, 1715section TR 3970X463 :crit, 0, 463section TR 1950X273 :crit, 0, 273section Ryzen 7800X3D363 :crit, 0, 363section Ryzen 5700X3D229 :crit, 0, 229section FX-610022 :crit, 0, 22section Athlon X2 QL-653 :crit, 0, 3section Ultra 7 258V287 :done, 0, 287section Ultra 9 185H317 :done, 0, 317section i9-14900K490 :done, 0, 490section i7-13700K504 :done, 0, 504section i7-1265U128 :done, 0, 128section i9-11900KB208 :done, 0, 208section i9-10980XE286 :done, 0, 286section E-2288G198 :done, 0, 198section i7-9700103 :done, 0, 103section i5-9600147 :done, 0, 147section i7-8700K152 :done, 0, 152section E-2176G201 :done, 0, 201section i7-7700HQ108 :done, 0, 108section E3-1240 v5141 :done, 0, 141section i5-5300U37 :done, 0, 37section i7-4770104 :done, 0, 104section i7-4720HQ80 :done, 0, 80section N28077 :done, 0, 7
Loading
Single-GPU/CPU Benchmark Table

Colors: 🔴 AMD, 🔵 Intel, 🟢 Nvidia, ⚪ Apple, 🟡 ARM, 🟤 Glenfly

DeviceFP32
[TFlops/s]
Mem
[GB]
BW
[GB/s]
FP32/FP32
[MLUPs/s]
FP32/FP16S
[MLUPs/s]
FP32/FP16C
[MLUPs/s]
🔴 Instinct MI300X163.40192530022867 (66%)41327 (60%)31670 (46%)
🔴 Instinct MI250 (1 GCD)45.266416385638 (53%)9030 (42%)8506 (40%)
🔴 Instinct MI21045.266416386517 (61%)9547 (45%)8829 (41%)
🔴 Instinct MI10046.143212285093 (63%)8133 (51%)8542 (54%)
🔴 Instinct MI6014.753210243570 (53%)5047 (38%)5111 (38%)
🔴 Radeon VII13.831610244898 (73%)7778 (58%)5256 (40%)
🔵 Data Center GPU Max 110022.224812293487 (43%)6209 (39%)3252 (20%)
🟢 GH200 94GB GPU66.9194400020595 (79%)34689 (67%)19407 (37%)
🟢 H100 NVL60.3294393820018 (78%)32613 (64%)17605 (34%)
🟢 H100 SXM5 80GB HBM366.9180335017602 (80%)29561 (68%)20227 (46%)
🟢 H100 PCIe 80GB HBM2e51.0180200011128 (85%)20624 (79%)13862 (53%)
🟢 A100 SXM4 80GB19.4980203910228 (77%)18448 (70%)11197 (42%)
🟢 A100 PCIe 80GB19.498019359657 (76%)17896 (71%)10817 (43%)
🟢 PG506-243 / PG506-24222.146416388195 (77%)15654 (74%)12271 (58%)
🟢 A100 SXM4 40GB19.494015558522 (84%)16013 (79%)11251 (56%)
🟢 A100 PCIe 40GB19.494015558526 (84%)16035 (79%)11088 (55%)
🟢 CMP 170HX6.32814937684 (79%)12392 (64%)6859 (35%)
🟢 A3010.32249335004 (82%)9721 (80%)5726 (47%)
🟢 Tesla V100 SXM2 32GB15.67329004471 (76%)8947 (77%)7217 (62%)
🟢 Tesla V100 PCIe 16GB14.13169005128 (87%)10325 (88%)7683 (66%)
🟢 Quadro GV10016.66328703442 (61%)6641 (59%)5863 (52%)
🟢 Titan V14.90126533601 (84%)7253 (86%)6957 (82%)
🟢 Tesla P100 16GB9.52167323295 (69%)5950 (63%)4176 (44%)
🟢 Tesla P100 12GB9.52125492427 (68%)4141 (58%)3999 (56%)
🟢 GeForce GTX TITAN4.7162881460 (77%)2500 (67%)1113 (30%)
🟢 Tesla K40m4.29122881131 (60%)1868 (50%)912 (24%)
🟢 Tesla K80 (1 GPU)4.1112240916 (58%)1642 (53%)943 (30%)
🟢 Tesla K20c3.525208861 (63%)1507 (56%)720 (27%)
🔴 Radeon RX 7900 XTX61.44249603665 (58%)7644 (61%)7716 (62%)
🔴 Radeon PRO W790061.30488643107 (55%)5939 (53%)5780 (52%)
🔴 Radeon RX 7900 XT51.61208003013 (58%)5856 (56%)5986 (58%)
🔴 Radeon PRO W780045.20325761872 (50%)4426 (59%)4145 (55%)
🔴 Radeon RX 7900 GRE42.03165761996 (53%)4570 (61%)4463 (60%)
🔴 Radeon PRO W770028.30165761547 (41%)2943 (39%)2899 (39%)
🔴 Radeon RX 760021.7582881250 (66%)2561 (68%)2512 (67%)
🔴 Radeon PRO W760020.0082881179 (63%)2263 (61%)2287 (61%)
🔴 Radeon PRO W750012.208172856 (76%)1630 (73%)1682 (75%)
🔴 Radeon RX 6900 XT23.04165121968 (59%)4227 (64%)4207 (63%)
🔴 Radeon RX 6800 XT20.74165122008 (60%)4241 (64%)4224 (64%)
🔴 Radeon PRO W680017.83325121620 (48%)3361 (51%)3180 (48%)
🔴 Radeon RX 6700 XT13.21123841408 (56%)2883 (58%)2908 (58%)
🔴 Radeon RX 6800M11.78123841439 (57%)3190 (64%)3213 (64%)
🔴 Radeon RX 6700M10.60103201194 (57%)2388 (57%)2429 (58%)
🔴 Radeon RX 66008.938224963 (66%)1817 (62%)1839 (63%)
🔴 Radeon RX 6500 XT5.774144459 (49%)1011 (54%)1030 (55%)
🔴 Radeon RX 5700 XT9.7584481368 (47%)3253 (56%)3049 (52%)
🔴 Radeon RX 57007.7284481521 (52%)3167 (54%)2758 (47%)
🔴 Radeon RX 5600 XT6.7362881136 (60%)2214 (59%)2148 (57%)
🔴 Radeon RX Vega 6413.3584841875 (59%)2878 (46%)3227 (51%)
🔴 Radeon RX 5905.5382561257 (75%)1573 (47%)1688 (51%)
🔴 Radeon RX 580 4GB6.504256946 (57%)1848 (56%)1577 (47%)
🔴 Radeon RX 580 2048SP 8GB4.948224868 (59%)1622 (56%)1240 (43%)
🔴 Radeon R9 390X5.9183841733 (69%)2217 (44%)1722 (35%)
🔴 Radeon HD 78501.842154112 (11%)120 ( 6%)635 (32%)
🔵 Arc B580 LE14.59124562598 (87%)4443 (75%)4979 (84%)
🔵 Arc A770 LE19.66165602663 (73%)4568 (63%)4519 (62%)
🔵 Arc A750 LE17.2085122555 (76%)4314 (65%)4047 (61%)
🔵 Arc A58012.2985122534 (76%)3889 (58%)3488 (52%)
🔵 Arc Pro A405.026192594 (47%)985 (40%)927 (37%)
🔵 Arc A3804.206186622 (51%)1097 (45%)1115 (46%)
🟢 GeForce RTX 5090104.883217929522 (81%)18459 (79%)19141 (82%)
🟢 GeForce RTX 508056.34169605174 (82%)10252 (82%)10304 (83%)
🟢 GeForce RTX 507030.84126723658 (83%)7238 (83%)7107 (81%)
🟢 GeForce RTX 409082.582410085624 (85%)11091 (85%)11496 (88%)
🟢 RTX 6000 Ada91.10489604997 (80%)10249 (82%)10293 (83%)
🟢 L40S91.61488643788 (67%)7637 (68%)7617 (68%)
🟢 L4090.52488643870 (69%)7778 (69%)7945 (71%)
🟢 GeForce RTX 4080 Super52.22167364089 (85%)7660 (80%)8218 (86%)
🟢 GeForce RTX 408055.45167173914 (84%)7626 (82%)7933 (85%)
🟢 GeForce RTX 4070 Ti Super44.10166723694 (84%)6435 (74%)7295 (84%)
🟢 GeForce RTX 4090M28.31165763367 (89%)6545 (87%)6901 (92%)
🟢 GeForce RTX 4070 Super35.55125042751 (83%)5149 (79%)5554 (85%)
🟢 GeForce RTX 407029.15125042646 (80%)4548 (69%)5016 (77%)
🟢 GeForce RTX 4080M33.85124322577 (91%)5086 (91%)5114 (91%)
🟢 RTX 4000 Ada26.73203602130 (91%)3964 (85%)4221 (90%)
🟢 GeForce RTX 406015.1182721614 (91%)3052 (86%)3124 (88%)
🟢 GeForce RTX 4070M18.2582561553 (93%)2945 (89%)3092 (93%)
🟢 RTX 2000 Ada12.00162241351 (92%)2452 (84%)2526 (87%)
🟢 GeForce RTX 3090 Ti40.002410085717 (87%)10956 (84%)10400 (79%)
🟢 GeForce RTX 309039.05249365418 (89%)10732 (88%)10215 (84%)
🟢 GeForce RTX 3080 Ti37.17129125202 (87%)9832 (87%)9347 (79%)
🟢 GeForce RTX 3080 12GB32.26129125071 (85%)9657 (81%)8615 (73%)
🟢 RTX A600040.00487684421 (88%)8814 (88%)8533 (86%)
🟢 GeForce RTX 3080 10GB29.77107604230 (85%)8118 (82%)7714 (78%)
🟢 GeForce RTX 3070 Ti21.7586083490 (88%)6807 (86%)5926 (75%)
🟢 GeForce RTX 3080M Ti23.61165122985 (89%)5908 (89%)5780 (87%)
🟢 GeForce RTX 307020.3184482578 (88%)5096 (88%)5060 (87%)
🟢 GeForce RTX 3060 Ti16.4984482644 (90%)5129 (88%)4718 (81%)
🟢 RTX A400019.17164482500 (85%)4945 (85%)4664 (80%)
🟢 RTX A5000M16.59164482228 (76%)4461 (77%)3662 (63%)
🟢 GeForce RTX 306013.17123602108 (90%)4070 (87%)3566 (76%)
🟢 GeForce RTX 3060M10.9463362019 (92%)4012 (92%)3572 (82%)
🟢 GeForce RTX 3050M Ti7.6041921181 (94%)2341 (94%)2253 (90%)
🟢 GeForce RTX 3050M7.1341921180 (94%)2339 (94%)2016 (81%)
🟢 Titan RTX16.31246723471 (79%)7456 (85%)7554 (87%)
🟢 Quadro RTX 600016.31246723307 (75%)6836 (78%)6879 (79%)
🟢 Quadro RTX 8000 Passive14.93486242591 (64%)5408 (67%)5607 (69%)
🟢 GeForce RTX 2080 Ti13.45116163194 (79%)6700 (84%)6853 (86%)
🟢 GeForce RTX 2080 Super11.3484962434 (75%)5284 (82%)5087 (79%)
🟢 Quadro RTX 500011.15164482341 (80%)4766 (82%)4773 (82%)
🟢 GeForce RTX 208010.0784482318 (79%)4977 (86%)4963 (85%)
🟢 GeForce RTX 2070 Super9.2284482255 (77%)4866 (84%)4893 (84%)
🟢 GeForce RTX 20707.4784482444 (83%)4387 (75%)5017 (86%)
🟢 GeForce RTX 2060 Super7.1884482503 (85%)5035 (87%)4463 (77%)
🟢 Quadro RTX 40007.1284162284 (84%)4584 (85%)4062 (75%)
🟢 GeForce RTX 2060 KO6.7463361643 (75%)3376 (77%)3266 (75%)
🟢 GeForce RTX 20606.7463361681 (77%)3604 (83%)3571 (82%)
🟢 GeForce GTX 1660 Super5.0363361696 (77%)3551 (81%)3040 (70%)
🟢 Tesla T48.14153001356 (69%)2869 (74%)2887 (74%)
🟢 GeForce GTX 1660 Ti5.4862881467 (78%)3041 (81%)3019 (81%)
🟢 GeForce GTX 16605.0761921016 (81%)1924 (77%)1992 (80%)
🟢 GeForce GTX 1650M 896C2.724192963 (77%)1836 (74%)1858 (75%)
🟢 GeForce GTX 1650M 1024C3.204128706 (84%)1214 (73%)1400 (84%)
🟢 T5003.04480339 (65%)578 (56%)665 (64%)
🟢 Titan Xp12.15125482919 (82%)5495 (77%)5375 (76%)
🟢 GeForce GTX 1080 Ti12.06114842631 (83%)4837 (77%)4877 (78%)
🟢 GeForce GTX 10809.7883201623 (78%)3100 (75%)3182 (77%)
🟢 GeForce GTX 1060 6GB4.576192997 (79%)1925 (77%)1785 (72%)
🟢 GeForce GTX 1060M4.446192983 (78%)1882 (75%)1803 (72%)
🟢 GeForce GTX 1050M Ti2.494112631 (86%)1224 (84%)1115 (77%)
🟢 Quadro P10001.89482426 (79%)839 (79%)778 (73%)
🟢 GeForce GTX 980 Ti6.0563361509 (69%)2703 (62%)2381 (55%)
🟢 GeForce GTX 9804.9842241018 (70%)1965 (68%)1872 (64%)
🟢 GeForce GTX 9704.174224980 (67%)1721 (59%)1623 (56%)
🟢 Quadro M40002.578192899 (72%)1519 (61%)1050 (42%)
🟢 Tesla M60 (1 GPU)4.828160853 (82%)1571 (76%)1557 (75%)
🟢 GeForce GTX 960M1.51480442 (84%)872 (84%)627 (60%)
🟢 GeForce GTX 7703.332224800 (55%)1215 (42%)876 (30%)
🟢 GeForce GTX 680 4GB3.334192783 (62%)1274 (51%)814 (33%)
🟢 Quadro K20000.73264312 (75%)444 (53%)171 (21%)
🟢 GeForce GT 630 (OEM)0.46229151 (81%)185 (50%)78 (21%)
🟢 Quadro NVS 2900.031/469 (22%)4 ( 5%)4 ( 5%)
🟤 Arise 10201.502196 ( 5%)6 ( 2%)6 ( 2%)
⚪ M2 Ultra GPU 76CU 192GB19.461478004629 (89%)8769 (84%)7972 (77%)
⚪ M2 Max GPU 38CU 32GB9.73224002405 (92%)4641 (89%)2444 (47%)
⚪ M1 Ultra GPU 64CU 128GB16.38988004519 (86%)8418 (81%)6915 (67%)
⚪ M1 Max GPU 24CU 32GB6.14224002369 (91%)4496 (87%)2777 (53%)
⚪ M1 Pro GPU 16CU 16GB4.10112001204 (92%)2329 (90%)1855 (71%)
⚪ M1 GPU 8CU 16GB2.051168384 (86%)758 (85%)759 (86%)
🔴 Radeon 780M (Z1 Extreme)8.298102443 (66%)860 (65%)820 (62%)
🔴 Radeon Graphics (7800X3D)0.5612102338 (51%)498 (37%)283 (21%)
🔴 Radeon Vega 8 (4750G)2.152757263 (71%)511 (70%)501 (68%)
🔴 Radeon Vega 8 (3500U)1.23738157 (63%)282 (57%)288 (58%)
🔵 Arc 140V GPU (16GB)3.9916137636 (71%)1282 (72%)773 (44%)
🔵 Arc Graphics (Ultra 9 185H)4.811490271 (46%)710 (61%)724 (62%)
🔵 Iris Xe Graphics (i7-1265U)1.921377342 (68%)621 (62%)574 (58%)
🔵 UHD Graphics Xe 32EUs0.742551128 (38%)245 (37%)216 (32%)
🔵 UHD Graphics 7700.823090342 (58%)475 (41%)278 (24%)
🔵 UHD Graphics 6300.46751151 (45%)301 (45%)187 (28%)
🔵 UHD Graphics P6300.465142177 (65%)288 (53%)137 (25%)
🔵 HD Graphics 55000.3532675 (45%)192 (58%)108 (32%)
🔵 HD Graphics 46000.38226105 (63%)115 (35%)34 (10%)
🟡 Mali-G610 MP4 (Orange Pi 5)0.061634130 (58%)232 (52%)93 (21%)
🟡 Mali-G72 MP18 (Samsung S9+)0.24429110 (59%)230 (62%)21 ( 6%)
🔴 2x EPYC 975450.7930729223276 (54%)5077 (42%)5179 (43%)
🔴 2x EPYC 965443.6215369221381 (23%)1814 (15%)1801 (15%)
🔴 2x EPYC 955430.723849222552 (42%)2127 (18%)2144 (18%)
🔴 2x EPYC 73523.53512410739 (28%)106 ( 2%)412 ( 8%)
🔴 2x EPYC 73133.07128410498 (19%)367 ( 7%)418 ( 8%)
🔴 2x EPYC 73023.07128410784 (29%)336 ( 6%)411 ( 8%)
🔵 2x Xeon 6980P98.30614416907875 (71%)5112 (23%)5610 (26%)
🔵 2x Xeon 6979P92.16307216908135 (74%)4175 (19%)4622 (21%)
🔵 2x Xeon Platinum 8592+31.1310247173135 (67%)2359 (25%)2466 (26%)
🔵 2x Xeon CPU Max 948027.242566142037 (51%)1520 (19%)1464 (18%)
🔵 2x Xeon Platinum 8480+28.675126142162 (54%)1845 (23%)1884 (24%)
🔵 2x Xeon Platinum 847025.2920486141865 (46%)1909 (24%)2068 (26%)
🔵 2x Xeon Platinum 838023.5520484101410 (53%)1159 (22%)1298 (24%)
🔵 2x Xeon Platinum 835821.302564101285 (48%)1007 (19%)1120 (21%)
🔵 2x Xeon Platinum 82563.891536282396 (22%)158 ( 4%)175 ( 5%)
🔵 2x Xeon Platinum 81538.19384256691 (41%)290 ( 9%)328 (10%)
🔵 2x Xeon Gold 6248R18.43384282755 (41%)566 (15%)694 (19%)
🔵 2x Xeon Gold 61285.22192256254 (15%)185 ( 6%)193 ( 6%)
🔵 Xeon Phi 72105.32192102415 (62%)193 (15%)223 (17%)
🔵 4x Xeon E5-4620 v42.69512273460 (26%)275 ( 8%)239 ( 7%)
🔵 2x Xeon E5-2630 v41.4164137264 (30%)146 ( 8%)129 ( 7%)
🔵 2x Xeon E5-2623 v40.6764137125 (14%)66 ( 4%)59 ( 3%)
🔵 2x Xeon E5-2680 v31.92128137304 (34%)234 (13%)291 (16%)
🟢 GH200 Neoverse-V2 CPU7.884803841323 (53%)853 (17%)683 (14%)
🔴 Threadripper PRO 7995WX15.362563331134 (52%)1697 (39%)1715 (40%)
🔴 Threadripper 3970X3.79128102376 (56%)103 ( 8%)463 (35%)
🔴 Threadripper 1950X0.8712885273 (49%)43 ( 4%)151 (14%)
🔴 Ryzen 7 7800X3D1.0832102296 (44%)361 (27%)363 (27%)
🔴 Ryzen 7 5700X3D0.873251229 (68%)135 (20%)173 (26%)
🔴 FX-61000.16162611 ( 7%)11 ( 3%)22 ( 7%)
🔴 Athlon X2 QL-650.034113 ( 4%)2 ( 2%)3 ( 2%)
🔵 Core Ultra 7 258V0.5632137287 (32%)123 ( 7%)167 ( 9%)
🔵 Core Ultra 9 185H1.791690317 (54%)267 (23%)288 (25%)
🔵 Core i9-14900K3.743296443 (71%)453 (36%)490 (39%)
🔵 Core i7-13700K2.516490504 (86%)398 (34%)424 (36%)
🔵 Core i7-1265U1.233277128 (26%)62 ( 6%)58 ( 6%)
🔵 Core i9-11900KB0.843251109 (33%)195 (29%)208 (31%)
🔵 Core i9-10980XE3.2312894286 (47%)251 (21%)223 (18%)
🔵 Xeon E-2288G0.953243196 (70%)182 (33%)198 (36%)
🔵 Core i7-97000.776443103 (37%)62 (11%)95 (17%)
🔵 Core i5-96000.601643146 (52%)127 (23%)147 (27%)
🔵 Core i7-8700K0.711651152 (45%)134 (20%)116 (17%)
🔵 Xeon E-2176G0.716442201 (74%)136 (25%)148 (27%)
🔵 Core i7-7700HQ0.36123881 (32%)82 (16%)108 (22%)
🔵 Xeon E3-1240 v50.503234141 (63%)75 (17%)88 (20%)
🔵 Core i7-47700.441626104 (62%)69 (21%)59 (18%)
🔵 Core i7-4720HQ0.33162680 (48%)23 ( 7%)60 (18%)
🔵 Celeron N28070.014117 (10%)3 ( 2%)3 ( 2%)

Multi-GPU Benchmarks

Multi-GPU benchmarks are done at the largest possible grid resolution with cubic domains, and either 2x1x1, 2x2x1 or 2x2x2 of these domains together. The (percentages in round brackets) are single-GPUroofline model efficiency, and the (multiplicators in round brackets) are scaling factors relative to benchmarked single-GPU performance.

gantttitle FluidX3D Performance [MLUPs/s] - FP32 arithmetic, (fastest of FP32/FP16S/FP16C) memory storagedateFormat XaxisFormat %s%%{init: {"gantt": {'titleTopMargin': 42,'topPadding': 70,'leftPadding': 260,'rightPadding': 5,'sectionFontSize': 20,'fontSize': 20,'barHeight': 20,'barGap': 3,'numberSectionStyles': 2},'theme': 'forest','themeVariables': {'sectionBkgColor': '#99999999','altSectionBkgColor': '#00000000','titleColor': '#AFAFAF','textColor': '#AFAFAF','taskTextColor': 'black','taskBorderColor': '#487E3A'}}}%%section 8x Instinct MI300X204924 :crit, 0, 204924section 4x Instinct MI300X109546 :crit, 0, 109546section 2x Instinct MI300X61053 :crit, 0, 61053section 1x Instinct MI300X41327 :crit, 0, 41327section 4x Instinct MI250 (8 GCD)53521 :crit, 0, 53521section 2x Instinct MI250 (4 GCD)29627 :crit, 0, 29627section 1x Instinct MI250 (2 GCD17338 :crit, 0, 17338section 1x Instinct MI250 (1 GCD)9030 :crit, 0, 9030section 32x Instinct MI210 GigaIO50952 :crit, 0, 50952section 24x Instinct MI210 GigaIO45033 :crit, 0, 45033section 16x Instinct MI210 GigaIO37922 :crit, 0, 37922section 8x Instinct MI210 GigaIO27996 :crit, 0, 27996section 4x Instinct MI210 GigaIO17232 :crit, 0, 17232section 2x Instinct MI210 GigaIO13539 :crit, 0, 13539section 1x Instinct MI210 GigaIO9105 :crit, 0, 9105section 4x Instinct MI21031408 :crit, 0, 31408section 2x Instinct MI21016156 :crit, 0, 16156section 1x Instinct MI2108757 :crit, 0, 8757section 8x Radeon VII30826 :crit, 0, 30826section 4x Radeon VII24273 :crit, 0, 24273section 2x Radeon VII15591 :crit, 0, 15591section 1x Radeon VII7778 :crit, 0, 7778section 4x DC GPU Max 110022777 :done, 0, 22777section 2x DC GPU Max 110011815 :done, 0, 11815section 1x DC GPU Max 11006209 :done, 0, 6209section 4x H100 SXM5 80GB HBM378462 : 0, 78462section 2x H100 SXM5 80GB HBM346189 : 0, 46189section 1x H100 SXM5 80GB HBM328522 : 0, 28522section 4x A100 PCIe 80GB52056 : 0, 52056section 2x A100 PCIe 80GB27165 : 0, 27165section 1x A100 PCIe 80GB17896 : 0, 17896section 4x PG506-243/24241088 : 0, 41088section 2x PG506-243/24224168 : 0, 24168section 1x PG506-243/24215654 : 0, 15654section 8x A100 SXM4 40GB72965 : 0, 72965section 4x A100 SXM4 40GB42400 : 0, 42400section 2x A100 SXM4 40GB23707 : 0, 23707section 1x A100 SXM4 40GB15917 : 0, 15917section 4x V100 SXM2 32GB26527 : 0, 26527section 2x V100 SXM2 32GB15469 : 0, 15469section 1x V100 SXM2 32GB8947 : 0, 8947section 3x K40m + 1x Titan Xp5174 : 0, 5174section 2x Tesla K40m3300 : 0, 3300section 1x Tesla K40m1868 : 0, 1868section 1x Tesla K80 (2 GPU)3448 : 0, 3448section 1x Tesla K80 (1 GPU)1642 : 0, 1642section 2x L4014164 : 0, 14164section 1x L407945 : 0, 7945section 8x RTX A600040063 : 0, 40063section 4x RTX A600027915 : 0, 27915section 2x RTX A600015026 : 0, 15026section 1x RTX A60008814 : 0, 8814section 2x Quadro RTX 8000 Pa.10214 : 0, 10214section 1x Quadro RTX 8000 Pa.5607 : 0, 5607section 7x 2080 Ti + 1x A100 40GB33857 : 0, 33857section 4x GeForce RTX 2080 Ti18598 : 0, 18598section 2x GeForce RTX 2080 Ti10922 : 0, 10922section 1x GeForce RTX 2080 Ti6853 : 0, 6853section 1x A770 + 1x Titan Xp8380 :done, 0, 8380
Loading
Multi-GPU Benchmark Table

Colors: 🔴 AMD, 🔵 Intel, 🟢 Nvidia, ⚪ Apple, 🟡 ARM, 🟤 Glenfly

DeviceFP32
[TFlops/s]
Mem
[GB]
BW
[GB/s]
FP32/FP32
[MLUPs/s]
FP32/FP16S
[MLUPs/s]
FP32/FP16C
[MLUPs/s]
🔴 8x Instinct MI300X1307.20153642400152835 (6.7x)192297 (4.7x)204924 (6.5x)
🔴 4x Instinct MI300X653.607682120083678 (3.7x)103200 (2.5x)109546 (3.5x)
🔴 2x Instinct MI300X326.803841060046673 (2.0x)61053 (1.5x)57391 (1.8x)
🔴 1x Instinct MI300X163.40192530022867 (66%)41327 (60%)31670 (46%)
🔴 4x Instinct MI250 (8 GCD)362.085121310727350 (4.9x)52258 (5.8x)53521 (6.3x)
🔴 2x Instinct MI250 (4 GCD)181.04256655416925 (3.0x)29163 (3.2x)29627 (3.5x)
🔴 1x Instinct MI250 (2 GCD)90.5212832779460 (1.7x)14313 (1.6x)17338 (2.0x)
🔴 1x Instinct MI250 (1 GCD)45.266416385638 (53%)9030 (42%)8506 (40%)
🔴 32x Instinct MI210 GigaIO1448.3220485242923881 (3.8x)50952 (6.0x)48848 (5.4x)
🔴 24x Instinct MI210 GigaIO1086.2415363932222056 (3.5x)45033 (5.3x)44631 (4.9x)
🔴 16x Instinct MI210 GigaIO724.1610242621418094 (2.9x)37360 (4.4x)37922 (4.2x)
🔴   8x Instinct MI210 GigaIO362.085121310713546 (2.1x)27996 (3.3x)27820 (3.1x)
🔴   4x Instinct MI210 GigaIO181.0425665548816 (1.4x)17232 (2.0x)16892 (1.9x)
🔴   2x Instinct MI210 GigaIO90.5212832777245 (1.1x)12050 (1.4x)13539 (1.5x)
🔴   1x Instinct MI210 GigaIO45.266416386347 (59%)8486 (40%)9105 (43%)
🔴 4x Instinct MI210181.04256655417075 (2.6x)31408 (3.6x)30643 (3.5x)
🔴 2x Instinct MI21090.5212832779624 (1.5x)15909 (1.8x)16156 (1.8x)
🔴 1x Instinct MI21045.266416386454 (60%)8757 (41%)8751 (41%)
🔴 8x Radeon VII110.64128819221946 (4.5x)30826 (4.0x)24572 (4.7x)
🔴 4x Radeon VII55.3264409612911 (2.6x)24273 (3.1x)17080 (3.2x)
🔴 2x Radeon VII27.663220488113 (1.7x)15591 (2.0x)10352 (2.0x)
🔴 1x Radeon VII13.831610244898 (73%)7778 (58%)5256 (40%)
🔵 4x DC GPU Max 110088.88192491512162 (3.5x)22777 (3.7x)11759 (3.6x)
🔵 2x DC GPU Max 110044.449624586301 (1.8x)11815 (1.9x)5970 (1.8x)
🔵 1x DC GPU Max 110022.224812293487 (43%)6209 (39%)3252 (20%)
🟢 4x H100 SXM5 80GB HBM3267.633201340046442 (2.7x)78462 (2.8x)60490 (3.0x)
🟢 2x H100 SXM5 80GB HBM3133.82160670026838 (1.6x)46189 (1.6x)34147 (1.7x)
🟢 1x H100 SXM5 80GB HBM366.9180335017262 (79%)28522 (66%)20065 (46%)
🟢 4x A100 PCIe 80GB77.96320774025957 (2.7x)52056 (2.9x)33283 (3.1x)
🟢 2x A100 PCIe 80GB38.98160387015742 (1.6x)27165 (1.5x)17510 (1.6x)
🟢 1x A100 PCIe 80GB19.498019359657 (76%)17896 (71%)10817 (43%)
🟢 4x PG506-243 / PG506-24288.57256655423097 (2.8x)41088 (2.6x)36130 (2.9x)
🟢 2x PG506-243 / PG506-24244.28128327713885 (1.7x)24168 (1.5x)20906 (1.7x)
🟢 1x PG506-243 / PG506-24222.146416388195 (77%)15654 (74%)12271 (58%)
🟢 8x A100 SXM4 40GB155.923201244037619 (4.4x)72965 (4.6x)63009 (7.2x)
🟢 4x A100 SXM4 40GB77.96160622023411 (2.7x)42400 (2.7x)29017 (3.3x)
🟢 2x A100 SXM4 40GB38.9880311014311 (1.7x)23707 (1.5x)15512 (1.8x)
🟢 1x A100 SXM4 40GB19.494015558543 (84%)15917 (79%)8748 (43%)
🟢 4x Tesla V100 SXM2 32GB62.68128360013135 (2.9x)26527 (3.0x)22686 (3.1x)
🟢 2x Tesla V100 SXM2 32GB31.346418007953 (1.8x)15469 (1.7x)12932 (1.8x)
🟢 1x Tesla V100 SXM2 32GB15.67329004471 (76%)8947 (77%)7217 (62%)
🟢 3x K40m + 1x Titan Xp17.164811543117 (2.8x)5174 (2.8x)3127 (3.4x)
🟢 2x Tesla K40m8.58245771971 (1.7x)3300 (1.8x)1801 (2.0x)
🟢 1x Tesla K40m4.29122881131 (60%)1868 (50%)912 (24%)
🟢 1x Tesla K80 (2 GPU)8.22244802086 (2.3x)3448 (2.1x)2174 (2.3x)
🟢 1x Tesla K80 (1 GPU)4.1112240916 (58%)1642 (53%)943 (30%)
🟢 2x L40181.049617287137 (1.8x)13547 (1.7x)14164 (1.8x)
🟢 1x L4090.52488643870 (69%)7778 (69%)7945 (71%)
🟢 8x RTX A6000320.00384614419311 (4.4x)40063 (4.5x)39004 (4.6x)
🟢 4x RTX A6000160.00192307214314 (3.2x)27915 (3.2x)27227 (3.2x)
🟢 2x RTX A600080.009615368041 (1.8x)15026 (1.7x)14795 (1.7x)
🟢 1x RTX A600040.00487684421 (88%)8814 (88%)8533 (86%)
🟢 2x Quadro RTX 8000 Pa.29.869612484767 (1.8x)9607 (1.8x)10214 (1.8x)
🟢 1x Quadro RTX 8000 Pa.14.93486242591 (64%)5408 (67%)5607 (69%)
🟢 7x 2080 Ti + 1x A100 40GB107.6088492816146 (5.1x)33732 (5.0x)33857 (4.9x)
🟢 4x GeForce RTX 2080 Ti53.804424649117 (2.9x)18415 (2.7x)18598 (2.7x)
🟢 2x GeForce RTX 2080 Ti26.902212325085 (1.6x)10770 (1.6x)10922 (1.6x)
🟢 1x GeForce RTX 2080 Ti13.45116163194 (79%)6700 (84%)6853 (86%)
🔵 1x A770 + 🟢 1x Titan Xp24.302410954717 (1.7x)8380 (1.7x)8026 (1.6x)

FAQs

General

  • How to learn using FluidX3D?
    Follow theFluidX3D Documentation!

  • What physical model does FluidX3D use?
    FluidX3D implements the lattice Boltzmann method, a type of direct numerical simulation (DNS), the most accurate type of fluid simulation, but also the most computationally challenging. Optional extension models include volume force (Guo forcing), free surface (volume-of-fluid andPLIC), a temperature model and Smagorinsky-Lilly subgrid turbulence model.

  • FluidX3D only uses FP32 or even FP32/FP16, in contrast to FP64. Are simulation results physically accurate?
    Yes, in all but extreme edge cases. The code has been specially optimized to minimize arithmetic round-off errors and make the most out of lower precision. With these optimizations, accuracy in most cases is indistinguishable from FP64 double-precision, even with FP32/FP16 mixed-precision. Details can be found inthis paper.

  • Compared to the benchmark numbers statedhere, efficiency seems much lower but performance is slightly better for most devices. How can this be?
    In that paper, the One-Step-Pull swap algorithm is implemented, using only misaligned reads and coalesced writes. On almost all GPUs, the performance penalty for misaligned writes is much larger than for misaligned reads, and sometimes there is almost no penalty for misaligned reads at all. Because of this, One-Step-Pull runs at peak bandwidth and thus peak efficiency.
    Here, a different swap algorithm termedEsoteric-Pull is used, a type of in-place streaming. This makes the LBM require much less memory (93 vs. 169 (FP32/FP32) or 55 vs. 93 (FP32/FP16) Bytes/cell for D3Q19), and also less memory bandwidth (153 vs. 171 (FP32/FP32) or 77 vs. 95 (FP32/FP16) Bytes/cell per time step for D3Q19) due to so-called implicit bounce-back boundaries. However memory access now is half coalesced and half misaligned for both reads and writes, so memory access efficiency is lower. For overall performance, these two effects approximately cancel out. The benefit of Esoteric-Pull - being able to simulate domains twice as large with the same amount of memory - clearly outweights the cost of slightly lower memory access efficiency, especially since performance is not reduced overall.

  • Why don't you use CUDA? Wouldn't that be more efficient?
    No, that is a wrong myth. OpenCL is exactly as efficient as CUDA on Nvidia GPUs if optimized properly.Here I did roofline model and analyzed OpenCL performance on various hardware. OpenCL efficiency on modern Nvidia GPUs can be 100% with the right memory access pattern, so CUDA can't possibly be any more efficient. Without any performance advantage, there is no reason to use proprietary CUDA over OpenCL, since OpenCL is compatible with a lot more hardware.

  • Why no multi-relaxation-time (MRT) collision operator?
    The idea of MRT is to linearly transform the DDFs into "moment space" by matrix multiplication and relax these moments individually, promising better stability and accuracy. In practice, in the vast majority of cases, it has zero or even negative effects on stability and accuracy, and simple SRT is much superior. Apart from the kinematic shear viscosity and conserved terms, the remaining moments are non-physical quantities and their tuning is a blackbox. Although MRT can be implemented in an efficient manner with only a single matrix-vector multiplication in registers, leading to identical performance compared to SRT by remaining bandwidth-bound, storing the matrices vastly elongates and over-complicates the code for no real benefit.

Hardware

  • Can FluidX3D run on multiple GPUs at the same time?
    Yes. The simulation grid is then split in domains, one for each GPU (domain decomposition method). The GPUs essentially pool their memory, enabling much larger grid resolution and higher performance. Rendering is parallelized across multiple GPUs as well; each GPU renders its own domain with a 3D offset, then rendered frames from all GPUs are overlayed with their z-buffers. Communication between domains is done over PCIe, so no SLI/Crossfire/NVLink/InfinityFabric is required. All GPUs must however be installed in the same node (PC/laptop/server). Even unholy combinations of Nvidia/AMD/Intel GPUs will work, although it is recommended to only use GPUs with similar memory capacity and bandwidth together. Using a fast gaming GPU and slow integrated GPU together would only decrease performance due to communication overhead.

  • I'm on a budget and have only a cheap computer. Can I run FluidX3D on my toaster PC/laptop?
    Absolutely. Today even the most inexpensive hardware, like integrated GPUs or entry-level gaming GPUs, support OpenCL. You might be a bit more limited on memory capacity and grid resolution, but you should be good to go. I've tested FluidX3D on very old and inexpensive hardware and even on my Samsung S9+ smartphone, and it runs just fine, although admittedly a bit slower.

  • I don't have an expensive workstation GPU, but only a gaming GPU. Will performance suffer?
    No. Efficiency on gaming GPUs is exactly as good as on their "professional"/workstation counterparts. Performance often is even better as gaming GPUs have higher boost clocks.

  • Do I need a GPU with ECC memory?
    No. Gaming GPUs work just fine. Some Nvidia GPUs automatically reduce memory clocks for compute applications to almost entirely eliminate memory errors.

  • My GPU does not support CUDA. Can I still use FluidX3D?
    Yes. FluidX3D uses OpenCL 1.2 and not CUDA, so it runs on any GPU from any vendor since around 2012.

  • I don't have a dedicated graphics card at all. Can I still run FluidX3D on my PC/laptop?
    Yes. FluidX3D also runs on all integrated GPUs since around 2012, and also on CPUs.

  • I need more memory than my GPU can offer. Can I run FluidX3D on my CPU as well?
    Yes. You only need to install theIntel OpenCL CPU Runtime.

  • In the benchmarks you list some very expensive hardware. How do you get access to that?
    As a PhD candidate in computational physics, I used FluidX3D for my research, so I had access to BZHPC, SuperMUC-NG and JSC JURECA-DC supercomputers.

Graphics

  • I don't have an RTX/DXR GPU that supports raytracing. Can I still use raytracing graphics in FluidX3D?
    Yes, and at full performance. FluidX3D does not use a bounding volume hierarchy (BVH) to accelerate raytracing, but fast ray-grid traversal instead, implemented directly in OpenCL C. This is much faster than BVH for moving isosurfaces in the LBM grid (~N vs. ~N²+log(N) runtime; LBM itself is ~N³), and it does not require any dedicated raytracing hardware. Raytracing in FluidX3D runs on any GPU that supports OpenCL 1.2.

  • I have a datacenter/mining GPU without any video output or graphics hardware. Can FluidX3D still render simulation results?
    Yes. FluidX3D does all rendering (rasterization and raytracing) in OpenCL C, so no display output and no graphics features like OpenGL/Vulkan/DirectX are required. Rendering is just another form of compute after all. Rendered frames are passed to the CPU over PCIe and then the CPU can either draw them on screen through dedicated/integrated graphics or write them to the hard drive.

  • I'm running FluidX3D on a remote (super-)computer and only have an SSH terminal. Can I still use graphics somehow?
    Yes, either directly as interactive ASCII graphics in the terminal or by storing rendered frames on the hard drive and then copying them over via `scp -r user@server.url:"~/path/to/images/folder" .`.

Licensing

  • I want to learn about programming/software/physics/engineering. Can I use FluidX3D for free?
    Yes. Anyone can use FluidX3D for free for public research, education or personal use. Use by scientists, students and hobbyists is free of charge and well encouraged.

  • I am a scientist/teacher with a paid position at a public institution. Can I use FluidX3D for my research/teaching?
    Yes, you can use FluidX3D free of charge. This is considered research/education, not commercial use. To give credit, thereferences listed below should be cited. If you publish data/results generated by altered source versions, the altered source code must be published as well.

  • I work at a company in CFD/consulting/R&D or related fields. Can I use FluidX3D commercially?
    No. Commercial use is not allowed with the current license.

  • Is FluidX3D open-source?
    No. "Open-source" as a technical term is defined as freely available without any restriction on use, but I am not comfortable with that. I have written FluidX3D in my spare time and no one should milk it for profits while I remain uncompensated, especially considering what other CFD software sells for. The technical term for the type of license I choose is "source-available no-cost non-commercial". The source code is freely available, and you are free to use, to alter and to redistribute it, as long as you do not sell it or make a profit from derived products/services, and as long as you do not use it for any military purposes (see thelicense for details).

  • Will FluidX3D at some point be available with a commercial license?
    Maybe I will add the option for a second, commercial license later on. If you are interested in commercial use, let me know. For non-commercial use in science and education, FluidX3D is and will always be free.

External Code/Libraries/Images used in FluidX3D

References

Contact

Support

I'm developing FluidX3D in my spare time, to make computational fluid dynamics lightning fast, accessible on all hardware, and free for everyone.

  • You can support FluidX3D by reporting any bugs or things that don't work in theissues. I'm welcoming feedback!
  • If you like FluidX3D, share it with friends and colleagues. Spread the word that CFD is now lightning fast, accessible and free.
  • If you want to support FluidX3D financially, you cansponsor me on GitHub orbuy me a coffee. Thank you!

[8]ページ先頭

©2009-2025 Movatter.jp