- Notifications
You must be signed in to change notification settings - Fork20
Description
GPU Support in NuMojo
Motivation
NuMojo aims to serve as a drop-in replacement for NumPy, while leveraging Mojo’s performance characteristics and native compilation. With GPU backends (e.g., Metal, CUDA, ROCm), we can bring C++/CUDA performance and better. The challenge lies in designing a unified and ergonomic device model that allows users to transparently scale their code from CPU to GPU without major API changes or performance regressions.
Proposed approaches
I outline three main architectural approaches:
Option 1: Unified NDArray with Device-Aware Storage
This approach extends the existing NDArray to include device specialization at compile time.
It provides a PyTorch-like.to[device]() API for explicit device transfers, while keeping a single unified interface across backends liketorch.tensor. Enables users to create NDArray in either CPU or GPU by just providingDevice parameter.
Key Properties:
- Unified API for CPU, GPU, and future devices
- Compile-time device specialization
- Minimal breaking changes to current codebase
- Simple integration path for future devices
Cons:
- A lot of ugly compile time if conditions to differentiate cpu and gpu methods.
Example Usage:
fnmain()raises:aliasSIZE: Int=1024aliascpu: Device= Device.CPUaliasmps: Device= Device.MPS# Create CPU arraysvararr_cpu_1= arange[f32](1.0,101.0,1).reshape(Shape(10,10))vararr_cpu_2= arange[f32](1.0,101.0,1).reshape(Shape(10,10))varmatmul_cpu= arr_cpu_1@ arr_cpu_2# Create GPU arrays (Metal backend)vararr_gpu_1= arange[f32, device=mps](1.0,101.0,1).reshape(Shape(10,10))vararr_gpu_2= arange[f32, device=mps](1.0,101.0,1).reshape(Shape(10,10))varmatmul_gpu= arr_gpu_1@ arr_gpu_2# Matrix API variantvarmat_cpu_1= Matrix[f32, cpu]((SIZE,SIZE),fill_value=1.0)varmat_cpu_2= Matrix[f32, cpu]((SIZE,SIZE),fill_value=2.0)varmatmul_cpu= mat_cpu_1@ mat_cpu_2varmat_gpu_1= Matrix[f32, mps]((SIZE,SIZE),fill_value=1.0)varmat_gpu_2= Matrix[f32, mps]((SIZE,SIZE),fill_value=2.0)varmatmul_gpu= mat_gpu_1@ mat_gpu_2# Transfer between devicesvargpu_from_cpu_1= mat_cpu_1.to[mps]()vargpu_from_cpu_2= mat_cpu_2.to[mps]()varmatmul_gpu_from_cpu= gpu_from_cpu_1@ gpu_from_cpu_2
Option 2: Separate Device-Specific Classes
This design introduces explicit device-specific classes, e.g.:NDArrayCPU, NDArrayGPU. Each type directly manages its own memory layout and compute kernels.
Pros:
- Zero device abstraction overhead
- Enables backend-specific optimizations
Cons:
- Significant code duplication for function overloading
- Poor ergonomics for users switching between CPU/GPU
Example:
aliasmps= Device.MPSvarx_cpu_1= NDArrayCPU[f32](Shape(1024,1024))varx_cpu_2= NDArrayCPU[f32](Shape(1024,1024))varresult= x_cpu_1@ x_cpu_2varx_gpu_1= NDArrayGPU[f32](Shape(1024,1024))varx_gpu_2= NDArrayGPU[f32](Shape(1024,1024))varresult= x_gpu_1@ x_gpu_2varx_cpu_to_gpu= x_cpu_1.to[mps]()
This model may be more suitable for low-level or embedded contexts, but less ideal for NuMojo’s NumPy-compatibility goals.
Option 3: Static Shape GPU Arrays
This approach introduces a StaticNDArray type with compile-time known shapes and dtypes, enabling aggressive optimizations such as loop unrolling and vectorization.
Pros
- Maximum performance and compile-time safety
- Enables highly optimized kernels for fixed-size data
Cons
- Limited flexibility for dynamic workloads
- Increased API and implementation complexity
- Requires separate type definitions (NDArray vs StaticNDArray)
This model could coexist with the dynamic NDArray, targeting scientific computing and ML inference workloads where shapes are known ahead of time.
Note:
- Many of these limitations may be mitigated in the future as Mojo evolves (e.g., with trait parameters and advanced compile-time metaprogramming features).
- While NuMojo aims to be largely NumPy-compatible, we shouldn’t hesitate to improve the API design where it makes sense, even if it introduces intentional deviations from NumPy’s behavior.
Preliminary Results:
Using Approach 1 and approach 2,
- Observed near-zero abstraction overhead with this unified approach
- Achieved ~15× speedup on Apple Silicon GPU (MPS backend) for matmul with SIZE = 2048 using basic GPU kernels
Check the attached figure for cpu vs gpu comparison of matmul using approach 1 and approach 2 with basic tiled gpu kernels.