Installation#
Install nvmath-python#
nvmath-python, like most modern Python packages, provides pre-built binaries (wheels andconda packages) to the end users. The full source code is hosted in theNVIDIA/nvmath-python repository.
In terms of CUDA Toolkit (CTK) choices, nvmath-python is designed and implemented to allowbuilding and running against 1.pip-wheel, 2.conda, or 3. system installation ofCTK. Having a full CTK installation at either build- or run-time is not necessary; only asmall subset, as explained below, is enough.
Host & device APIs (seeOverview) have different run-time dependencies andrequirements. Even among host APIs the needed underlying libraries are different (forexample,fft() on GPUs only needs cuFFT and not cuBLAS). Libraries areloaded when only needed. Therefore, nvmath-python is designed to have most of itsdependenciesoptional, but provides convenient installation commands for users to quicklyspin up a working Python environment.
Thecheatsheet below captures nvmath-python’s required and optionalbuild-time and run-time dependencies. Using the installation commands from the sectionsbelow should support most of your needs.
Install from PyPI#
The pre-built wheels can bepip-installed from the public PyPI. There are severaloptional dependencies expressible in the standard “extras” bracket notation. The followingassumes thatCTK components are also installed via pip (so no extra step from users isneeded; the dependencies are pulled via extras).
Important
Using at least one of thepip extras described below is required for allpipinstalls to ensure that nvmath-python’s dependencies are correctly constrained bypip.
pipinstallnvmath-python (no extras) is a bare installation (very lightweight) forsystem admins or expert users. This requires that the user manage of all dependencies.
Command | Description |
|---|---|
| Install nvmath-python along with all CUDA 12 optionaldependencies (wheels for cuBLAS/cuFFT/… and CuPy) to supportnvmath host APIs. |
| Install nvmath-python along with all CUDA 12 optionaldependencies (wheels for cuBLAS/cuFFT/…, CuPy, Numba, …) to supportnvmath host & device APIs (which only supports CUDA 12)[8]. |
| Install nvmath-python along with all CPU optional dependencies(wheels for NVPL or MKL) to support optimized CPU FFT APIs.[1] Note:
|
| Install nvmath-python along with all MGMN optional dependencies (wheels for mpi4py,NVSHMEM, cuFFTMp, …) to support multi-GPU multi-node APIs. Note: Users must provide an MPI implementation. |
| Install nvmath-python along with all CUDA 12 optional dependencies to supportnvmath.device APIs and a PyTorch built with CTK 12.8. Note: PyTorch has strict pinnings for some CUDA components, and builds of PyTorchlag behind the latest CUDA component releases. We must therefore explicitly requirethat the nvcc and nvrtc components installed for device extensions match; otherwise,pip will create a mismatched environment with CTK components from different releases.Verbose installation commands such as the the one here are necessary because oflimitations of the current wheel format and Python package index. Please see thePyTorch installation instructions for releases built with other CTK versions. |
The options below are for adventurous users who want to manage most of the dependenciesthemselves. The following assumes thatsystem CTK is installed.
Command | Description |
|---|---|
| Install nvmath-python along with CuPy for CUDA 12 to supportnvmath host APIs. Note: Set |
| Install nvmath-python along with CuPy for CUDA 12 to supportnvmath host & device APIs. Note:
|
| Install nvmath-python and mpi4py along with no MGMN optional dependencies to supportmulti-GPU multi-node APIs. Note: Users must provide an MPI implementation and the required cuMp librariesand dependencies (NVSHMEM, cuFFTMp, …). |
Hint
To install extras for CUDA 13, use13 in the extra names instead of12.
Install from conda#
Conda packages can be installed from theconda-forge channel.
Command | Description |
|---|---|
| Install nvmath-python along with all CUDA 12 optionaldependencies (packages for cuBLAS/cuFFT/… and CuPy) to supportnvmath host APIs. |
| Install nvmath-python along with all CUDA 12 optionaldependencies (packages for cuBLAS/cuFFT/…, CuPy, Numba, …) to supportnvmath host & device APIs (which only supports CUDA 12). Note:
|
| Install nvmath-python along with all CPU optional dependencies(NVPL or other) to support optimized CPU FFT APIs.[1] Note:
|
| Install nvmath-python along with all MGMN optional dependencies (packages for mpi4py,NVSHMEM, cuFFTMp, MPI, …) to support multi-GPU multi-node APIs. Note: conda-forge provides a pass-through MPI package variant that may be used inorder to use a system-installed MPI instead of the conda-forge-provided MPIimplementations. |
Notes:
For expert users,
condainstall-cconda-forgenvmath-python=*=core*would be a bareminimal installation (very lightweight). This allows fully explicit control of alldependencies.If you installed
condafromminiforge,most likely the conda-forge channel is already set as the default, then the-cconda-forgepart in the above instruction can be omitted.
Build from source#
Once you clone the repository and go into the root directory, you can build the project fromsource. There are several ways to build it since we need some CUDA headers at build time.
Command | Description |
|---|---|
| Set up a build isolation (as perPEP 517),install CUDA wheels and other build-time dependencies to thebuild environment, build the project, and install it to thecurrent user environment together with the run-timedependencies. Note: in this case we get CUDA headers by installing pip wheels to the isolatedbuild environment. |
| Skip creating a build isolation (it would use CUDA headers from
|
Notes:
If you add the “extras” notation after the dot
.(for example.[cpu],.[cu12-dx], …), it has the same meaning as explained in theprevious section.If you don’t want the run-time dependencies to be automatically handled, add
--no-depsafter thepipinstallcommand above; in this case, however, it’s your responsibilityto make sure that all the run-time requirements are met.By replacing
installbywheel, a wheel can be built targeting the current OS andCPython version.If you want inplace/editable install, add the
-eflag to the command above (before thedot.). This is suitable for local development with a system-installed CTK. However,our wheels rely onnon-editable builds so that the RPATH hack can kick in. DO NOT passthe-eflag when building wheels!All optional run-time dependencies as listed below need to be manually installed.
Cheatsheet#
Below we provide a summary of requirements to support all nvmath-python functionalities. Adependency isrequired unless stated otherwise.
When Building | When Running - host APIs | When Running - device APIs | When Running - host API callbacks | When Running - distributed APIs | |
|---|---|---|---|---|---|
CPU architecture & OS | linux-64, linux-aarch64, win-64 | linux-64, linux-aarch64, win-64 | linux-64, linux-aarch64[1] | linux-64, linux-aarch64 | linux-64, linux-aarch64 |
GPU hardware | All hardware supported by the underlying CUDA Toolkit[5] Optional: needed if the execution space is GPU. | Compute Capability 7.0+ (Volta and above) | Compute Capability 7.0+ (Volta and above) | Data Center GPUwith Compute Capability 7.0+ (Volta and above).GPU connectivity:p2p or GPUDirect RDMA over IB requirements | |
CUDA driver[2] | 525.60.13+ (Linux) / 527.41+ (Windows) with CUDA >=12.0 580+ with CUDA >=13.0 Optional: needed if the execution space is GPU or for loading any CUDA library. | 525.60.13+ (Linux) with CUDA >=12.0 | 525.60.13+ (Linux) with CUDA >=12.0 | 525.60.13+ (Linux) with CUDA >=12.0 | |
Python | 3.10-3.13 | 3.10-3.13 | 3.10-3.13 | 3.10-3.13 | 3.10-3.13 |
pip | 22.3.1+ | ||||
setuptools | >=77.0.3 | ||||
wheel | >=0.34.0 | ||||
Cython | >=3.0.4,<3.1 | ||||
CUDA | CUDA >=12.0 (only need headers from NVCC & CUDART[6]) | CUDA >=12.0 Optional: depending on the math operations in use | CUDA >=12.0 | CUDA >=12.0 | |
cuda-pathfinder | >=1.2.1 | >=1.2.1 | >=1.2.1 | >=1.2.1 | |
cuda-core | >=0.3.2 | >=0.3.2 | >=0.3.2 | >=0.3.2 | |
NumPy | >=1.25 | >=1.25 | >=1.25 | >=1.25 | |
CuPy | >=12.1[4] | >=12.1[4] | >=12.1[4] | ||
PyTorch | >=1.12 (optional)[10] | >=1.12 (optional) | >=1.12 (optional) | ||
libmathdx (cuBLASDx, cuFFTDx, …) | >=0.2.3,<0.3 | ||||
numba-cuda | >=0.18.1 | >=0.18.1 | |||
Math Kernel Library (MKL) | >=2024 (optional) | ||||
NVIDIA Performance Libraries (NVPL) | 24.7 (optional) |
Test Configuration#
nvmath-python is tested in the following environments:
CUDA | 12.0, 12.9, 13.0 |
Driver | R525, R575, R580 |
GPU model | H100, B200, RTX 4090, CG1 (Grace-Hopper) |
Python | 3.10, 3.11, 3.12, 3.13 |
CPU architecture | x86_64, aarch64 |
Operating system | Ubuntu 22.04, Ubuntu 20.04, RHEL 9, Windows 11 |
Run nvmath-python#
As mentioned earlier, nvmath-python can be run with all methods of CUDA installation,including wheels, conda packages, and system CTK. As a result, there is detection logic todiscover shared libraries (for host APIs) and headers (for device APIs to do JITcompilation).
Shared libraries#
pip wheels: Will be auto-discovered if installed
conda packages: Will be auto-discovered if installed, after wheel
system CTK: On Linux, the users needs to ensure the shared libraries are discoverable bythe dynamic linker, say by setting
LD_LIBRARY_PATHor updating system search paths toinclude the DSO locations.
Headers#
This includes libraries such as CCCL and MathDx.
pip wheels: Will be auto-discovered if installed
conda packages: Will be auto-discovered if installed, after wheel
system CTK: Need to set
CUDA_HOME(orCUDA_PATH) andMATHDX_HOME(for MathDxheaders)
Host APIs#
This terminology is explained in theHost APIs.
Examples#
See theexamples directory in the repo. Currently we have:
examples/fftexamples/linalg
Tests#
Thepyproject.toml file lists dependencies required forpip-controlledenvironments to run tests. These requirements are installed via thedev dependencygroup. e.g.pipinstall--groupdev
Running functionality tests#
pytesttests/example_teststests/nvmath_tests/ffttests/nvmath_tests/linalg
Running performance tests#
This will currently run two tests for fft and one test for linalg:
pytest-v-s-k'perf'tests/nvmath_tests/fft/pytest-v-s-k'perf'tests/nvmath_tests/linalg/
Device APIs#
This terminlogy is explained in theDevice APIs.
Examples#
See theexamples/device directory in the repo.
Tests#
Running functionality tests#
pytesttests/nvmath_tests/deviceexamples/device
Running performance tests#
pytest-v-s-k'perf'tests/nvmath_tests/device/
Troubleshooting#
Forpip-users, there are known limitations (many of which are nicely captured in thepypackaging community project) in Python packagingtools. For a complex library such as nvmath-python that interacts with many nativelibraries, there are user-visible caveats.
Be sure that there are no packages with both
-cu12(for CUDA 12) and-cu13(forCUDA 13) suffixes coexisting in your Python environment. For example, this is a corruptedenvironment:$piplistPackageVersion---------------------------nvidia-cublas-cu1212.5.2.13nvidia-cublas13.0.2.14pip24.0setuptools70.0.0wheel0.43.0
Sometimes such conflicts could come from a dependency of the libraries that you use, sopay extra attention to what’s installed.
pipdoes not attempt to check if the installed packages can actually be run againstthe installed GPU driver (CUDA GPU driver cannot be installed bypip), so make sureyour GPU driver is new enough to support the installed-cuXXpackages[2]. Thedriver version can be checked by executingnvidia-smiand inspecting theDriverVersionfield on the output table.CuPy installed from
pipcurrently (as of v13.3.0) only supports conda and system CTK,and notpip-installed CUDA wheels. nvmath-python can help CuPy use the CUDA librariesinstalled tosite-packages(where wheels are installed to) ifnvmathis imported.From beta 2 (v0.2.0) onwards the libraries are “soft-loaded” (no error is raised if alibrary is not installed) whenimportnvmathhappens. This behavior may change in afuture release.Numba installed from
pipcurrently (as of v0.60.0) only supports conda and systemCTK, and notpip-installed CUDA wheels. nvmath-python can also help Numba use theCUDA compilers installed tosite-packagesifnvmathis imported. Same as above,this behavior may change in a future release.PyTorch installed from
pippins some CUDA wheels packages to version v12.6 (or v12.8depending on the installation method). However, nvmath-python does not pin CUDA wheelspackages, so they will float up the latest version. This can cause a mismatch betweencompiler components when using thedxextra. In this case, it’s recommended tomanually constraincuda-cccl,cuda-nvcc,cuda-nvrtc, andcuda-runtimepackages to match the variant of PyTorch installed.
In general, mixing-and-matching CTK packages frompip,conda, and the system ispossible but can be very fragile, so it’s important to understand what you’re doing. Thenvmath-python internals are designed to work with everything installed either viapip,conda, or local system (system CTK, includingtarball extractions, are the fallback solution in the detection logic),but mix-n-match makes the detection logic impossible to get right.
To help you perform an integrity check, the rule of thumb is that every single packageshould only come from one place (eitherpip, orconda, or local system). Forexample, if bothnvidia-cufft-cu12 (which is frompip) andlibcufft (fromconda) appear in the output ofcondalist, something is almost certainly wrong.Below is the package name mapping betweenpip andconda, withXX=12denoting CUDA’s major version:
pip ( | pip ( | conda ( |
|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Note that system packages (by design) do not show up in the output ofcondalist orpiplist. Linux users should check the installation list from your distro packagemanager (apt,yum,dnf, …). See also theLinux Package Manager InstallationGuide for additional information.
For more information with regard to the new CUDA 12+ package layout on conda-forge, see theCUDA recipe README.
Footnotes
[1](1,2,3)Windows support will be added in a future release.
[2](1,2)nvmath-python relies onCUDA minor version compatibility.
[4](1,2,3)As of Beta 7.0 (v0.7.0), CuPy is an optional run-time dependency. It is not included inthe extras/meta-packages, and must be installed separately if desired.
[5]For example, Blackwell GPUs are supported starting CUDA 12.8, so they would not workwith libraries from CUDA 12.6 or below (There is no CUDA 12.7).
[6]While we need some CUDA headers at build time, there is no limitation in the CUDAversion seen at build time.
[7]These versions are not supported due to a known compiler bug; the[dx] extrasalready takes care of this.
If CCCL is installed viapip manually it needs to be constrained with"nvidia-cuda-cccl-cu12>=12.4.127" due to a packaging issue; the[dx] extrasalready takes care of this.
The library must ship FFTW3 symbols for single and double precision transforms in asingleso file.
To usematmul with FP8 or MXFP8 you need PyTorch version built with CUDA 12.8(>=2.7.0 or nightly version)