Overview#
The primary goal of nvmath-python is to bring the power of the NVIDIA math libraries to thePython ecosystem. The package aims to provide intuitive Pythonic APIs giving users fullaccess to all the features offered by our libraries in a variety of execution spaces.
We hope to empower a wide range of Python users by providing easy access to high-performancecore math operations such as FFT, dense and sparse linear algebra, and more. This includesthe following groups of users:
Practitioners: Researchers and application programmers who require robust,high-performance mathematical tools.
Library Package Developers: Developers crafting libraries that rely on advancedmathematical operations.
CUDA Kernel Authors: Programmers who write CUDA kernels and need customizedmathematical functionality.
The APIs provided by nvmath-python can be categorized into:
Host APIs: Invoked from the host and executed in the chosen space. While all hostAPIs support the GPU execution space, select APIs also support CPU and distributed(multi-node multi-GPU) execution spaces.
Device APIs: Called directly from within CUDA kernels.
The nvmath-python library is dedicated to delivering the following key features andcommitments:
Interoperability with array and tensor libraries: Instead of providing a native arrayor tensor data structure, nvmath-python provides seamless interoperability withwidely-used array libraries such as NumPy, CuPy, and PyTorch, through APIs compatiblewith their data representations. nvmath-python should not be regarded as a replacement,but rather as a complementary tool to these libraries.
Logical Feature Parity: While the Pythonic API surface (the number of APIs and thecomplexity of each) is more concise compared to that of the C libraries, it providesaccess to their complete functionality.
Consistent Design Patterns: Uniform design across all modules to simplify userexperience.
Transparency and Explicitness: Avoiding implicit, costly operations such as copyingdata across the same memory space, automatic type promotion, and alterations to the userenvironment or state (current device, current stream, etc.). This allows users to performthe required conversion once for use in all subsequent operations instead of incurringhidden costs on each call.
Clear, Actionable Error Messages: Ensuring that errors are informative and helpful inresolving the problem.
DRY Principle Compliance: Automatically utilizing available information such as thecurrent stream and memory pool to avoid redundant specification (“don’t repeatyourself”).
With nvmath-python, a few lines of code are sufficient to unlock the extensive performancecapabilities of the NVIDIA math libraries. Explore our sample Python codes and more detailedexamples in theexamples directory on GitHub.
Architecture#
nvmath-python is designed to support integration at any level desired by the user. Thisflexibility allows:
Alice, aPython package developer, to utilize core math operations to compose intohigher-level algorithms or adapt these operations into her preferred interfaces.
Bob, anapplication developer, to use core operations directly from nvmath-python orindirectly through other libraries that leverage math-python.
Carol, aresearcher, to write kernels entirely in Python that call core mathoperations such as FFT.

Additionally, we offerPython bindings that provide a 1:1 mappingwith the C APIs. These bindings, which serve as wrappers with API signatures similar totheir C counterparts, are ideal for library developers looking to integrate the capabilitiesof the NVIDIA Math Libraries in a customized manner, in the event that the Pythonic APIsdon’t meet their specific requirements. Conversely, our high-level Pythonic APIs deliver afully integrated solution suitable for native Python users as well as library developers,encompassing both host and device APIs. Select host APIs acceptcallback functionswritten in Python, which are compiled into supported formats such as LTO-IR, usingcompilers likeNumba.
Host APIs#
nvmath-python provides a collection of APIs that can be directly invoked from the CPU(host). At present, these APIs encompass a selection of functionalities within the followingcategories:
Fast Fourier Transform in
nvmath.. Refer toFast Fourier Transform for details.fft Dense Linear Algebra in
nvmath.. Refer toLinear Algebra for details.linalg Sparse Linear Algebra in
nvmath.. Refer toSparse Linear Algebra for details.sparse Tensor Algebra in
nvmath.. Refer toTensor Operations for details.tensor
Effortless Interoperability#
All host APIs support input arrays/tensors from NumPy, CuPy, and PyTorch while returningoutput operands using the same package, thus offering effortless interoperability with theseframeworks. One example for the interoperability is shown below:
importnumpyasnpimportnvmath# Create a numpy.ndarray as inputa=np.random.random(128)+1.j*np.random.random(128)# Call nvmath-python Pythonic APIsb=nvmath.fft.fft(a)# Verify that output is also a numpy.ndarrayassertisinstance(b,np.ndarray)
Stateless and Stateful APIs#
The host APIs within nvmath-python can be generally categorized into two types: statelessfunction-form APIs and stateful class-form APIs.
The function-form APIs, such asnvmath. andnvmath., are designed to deliver quick, end-to-end resultswith a single function call. These APIs are ideal for instances where a user needs toperform a single computation without the need for intermediate steps, customization ofalgorithm selection, or cost amortization of preparatory steps. Conversely, the statefulclass-form APIs, likenvmath. andnvmath.,offer a more comprehensive and flexible approach. They not only encompass the functionalityfound in their function-form counterparts but also allow for amortization of one-time costs,potentially enhancing performance significantly.
The design pattern for all stateful APIs in nvmath-python consists of several key phases:
Problem Specification: This initial phase involves defining the operation and settingoptions that affect its execution. It’s designed to be as lightweight as possible,ensuring the problem is well-defined and supported by the current implementation.
Preparation: Using FFT as an example, this phase includes a planning step to selectthe optimal algorithm for the defined FFT operation. An optional autotuning operation,when available, also falls within the preparation phase. The preparation phase isgenerally the most resource-intensive and may incorporate user-specified planning andautotuning options.
Execution: This phase allows for repeated execution, where the operand can be eithermodified in-place or explicitly reset using the
reset_operand/reset_operandsmethod. The costs associated with the first two phases are therefore amortized overthese multiple executions.Resource Release: Users are advised to use stateful objects from within a contextusing thewith statement, whichautomatically handles the release of internal resources upon exit. If the object isnot used as a context manager using
with, it is necessary to explicitly call thefreemethod to ensure all resources are properly released.
Note
By design, nvmath-python does NOT cache plans with stateless function-form APIs. This isto enable library developers and others to use their own caching mechanisms withnvmath-python. Therefore users should use the stateful object APIs for repeated use aswell as benchmarking to avoid incurring repeated preparatory costs, or use a cached API(seecaching.py for anexample implementation).
Note
The decision to require explicitfree calls for resource release is driven by thefact that Python’s garbage collector may delay freeing object resources when the objectgoes out of scope or its reference count drops to zero. For details, refer to the__del__ method Python documentation.
Generic and Specialized APIs#
Another way of categorizing the host APIs within nvmath-python is by splitting them intogeneric andspecialized APIs, based on their flexibility and the scope of theirfunctionality:
Generic APIs are designed to accommodate a broad range of operands and customizationwith these APIs is confined to options that are universally applicable across allsupported operand types. For instance, the generic matrix multiplication API can handlestructured matrices (such as triangular and banded, in full or packed form) in addition todense full matrices, but the available options are limited to those applicable to allthese matrix types.
Specialized APIs, on the other hand, are tailored for specific types of operands,allowing for full customization that is available to this kind. A prime example is thespecialized matrix multiplication API for dense matrices, which provides numerous optionsspecifically suited to dense matrices.
It should be noted that the notion of generic and specialized APIs is orthogonal to thenotion of stateful versus stateless APIs. Currently, nvmath-python offers the specializedinterface for dense matrix multiplication, instateful andstatelessforms.
Full Logging Support#
nvmath-python provides integration with the Python standard library logger from theloggingmodule to offer full logging of thecomputational details at various levels, for example debug, information, warning and error.An example illustrating the use of the global Python logger is shown below:
importlogging# Turn on logging with level set to "debug" and use a custom format for the loglogging.basicConfig(level=logging.DEBUG,format='%(asctime)s%(levelname)-8s%(message)s',datefmt='%m-%d %H:%M:%S')# Call nvmath-python Pythonic APIsout=nvmath.linalg.advanced.matmul(...)
Alternatively, for APIs that contain theoptions argument, users can set a custom loggerby directly passing it inside a dictionary or as part of the correspondingOptionsobject, for examplenvmath. fornvmath. andnvmath.. An example based on FFT is shown below:
importlogging# Create a custom loggerlogger=logging.getLogger('userlogger')...# Call nvmath-python Pythonic APIsout=nvmath.fft.fft(...,options={'logger':logger})
For the complete examples, refer toglobal logging exampleandcustom user logging example.
Note
The Python logging is orthogonal to the logging provided by certain NVIDIA math libraries,which encapsulates low level implementation details and can be activated via eitherspecific environment variables (for exampleCUBLASLT_LOG_LEVEL forcuBLASLt) orprogrammatically through the Python bindings (for examplenvmath. forcuSOLVER).
Call Blocking Behavior#
By default, calls to all Pythonic host APIs that require GPU execution arenot blocking ifthe input operands reside on the device. This means that functions likenvmath.,nvmath., andnvmath. will return immediately after the operation islaunched on the GPU without waiting for it to complete. Users are therefore responsible forproperly synchronizing the stream when needed. The default behavior can be modified bysetting theblocking attribute (default'auto') of the relevantOptions objecttoTrue. For example, users may setnvmath. toTrueand pass this options object to the corresponding FFT API calls. If the input operands areon the host, the Pythonic API calls will always block since the computation yields an outputoperand that will also reside on the host. Meanwhile, APIs that execute on the host (such asnvmath.) always block.
Stream Semantics#
The stream semantics depend on whether the behavior of the execution APIs is chosen to beblocking or non-blocking (seeCall Blocking Behavior).
For blocking behavior, stream ordering is automatically handled by the nvmath-pythonhigh-level APIs foroperations that are performed within the package. A stream can beprovided for two reasons:
When the computation that prepares the input operands is not already complete by the timethe execution APIs are called. This is a correctness requirement for user-provided data.
To enable parallel computations across multiple streams if the device has sufficientresources and the current stream (which is the default) has concomitant operations. Thiscan be done for performance reasons.
For non-blocking behavior, it is the user’s responsibility to ensure correct stream orderingbetween the execution API calls.
The execution APIs are always launched on the provided stream.
For examples on stream ordering, refer toFFT with multiple streams.
Memory Management#
By default, the host APIs use the memory pool from the package that their operands belongto. This ensures that there is no contention for memory or spurious out-of-memory errors.However, the user also has the ability to provide their own memory allocator if they chooseto do so. In our Pythonic APIs, we support anEMM-like interface as proposed andsupported by Numba for users to set their Python mempool. Taking FFT as an example, userscan set the optionnvmath. to a Python object complying withthenvmath. protocol, and pass the options to the high-levelAPIs likenvmath. ornvmath.. Temporary memory allocationswill then be done through this interface. Internally, we use the same interface to use CuPyor PyTorch’s mempool depending on the operands.
Note
nvmath’sBaseCUDAMemoryManager protocol is slightly different fromNumba’s EMM interface (numba.cuda.BaseCUDAMemoryManager), but duck typing withan existing EMM instance (not type!) at runtime should be possible.
Host APIs with Callbacks#
Certain host APIs (such asnvmath. andnvmath.) allow theuser to provide prolog or epilog functionswritten in Python, resulting in afusedkernel. This improves performance by avoiding extra roundtrips to global memory andeffectively increases the arithmetic intensity of the operation.
importcupyascpimportnvmath# Create the data for the batched 1-D FFT.B,N=256,1024a=cp.random.rand(B,N,dtype=cp.float64)+1j*cp.random.rand(B,N,dtype=cp.float64)# Compute the normalization factor.scale=1.0/N# Define the epilog function for the FFT.defrescale(data_out,offset,data,user_info,unused):data_out[offset]=data*scale# Compile the epilog to LTO-IR (in the context of the execution space).witha.device:epilog=nvmath.fft.compile_epilog(rescale,"complex128","complex128")# Perform the forward FFT, applying the filter as an epilog...r=nvmath.fft.fft(a,axes=[-1],epilog={"ltoir":epilog})
Device APIs#
Thedevice APIs enable the user to call core mathematicaloperations in their Python CUDA kernels, resulting in afully fused kernel. Fusion isessential for performance in latency-dominated cases to reduce the number of kernellaunches, and in memory-bound operations to avoid the extra roundtrip to global memory.
We currently offer support for calling FFT, matrix multiplication, and random numbergeneration APIs in kernels written usingNumba, with plans to offer more core operationsand support other compilers in the future. The design of the device APIs closely mimics thatof the C++ APIs from the corresponding NVIDIA Math Libraries (MathDx librariescuFFTDx andcuBLASDx for FFT and matrix multiplication, andcuRAND device APIsfor random number generation).
Compatibility Policy#
nvmath-python is no different from any Python package, in that we would not succeed withoutdepending on, collaborating with, and evolving alongside the Python community. Given theseconsiderations, we strive to meet the following commitments:
For thelow-level Python bindings,
if the library to be bound is part of CUDA Toolkit, we support the library from themost recent two CUDA major versions (currently CUDA 12/13)
otherwise, we support the library within its major version
Note that all bindings are currentlyexperimental.
For the high-level Pythonic APIs, we maintain backward compatibility to the greatestextent feasible. When a breaking change is necessary, we issue a runtime warning to alertusers of the upcoming changes in the next major release. This practice ensures thatbreaking changes are clearly communicated and reserved for major version updates,allowing users to prepare and adapt without surprises.
We comply withNEP-29 and support a community-defined set of core dependencies(CPython, NumPy, etc).
Note
The policy on backwards compatibility will apply starting with release1.0.0.