Mojo-Numerics-and-Algorithms-group/NuMojoPublic

NotificationsYou must be signed in to change notification settings
Fork20
Star191

GPU support in NuMojo #273

New issue

Open

GPU support in NuMojo#273

Labels

NuMojo Enhancement Proposal (NEP)

Description

shivasankarka

opened

on Oct 13, 2025

GPU Support in NuMojo

Motivation

NuMojo aims to serve as a drop-in replacement for NumPy, while leveraging Mojo’s performance characteristics and native compilation. With GPU backends (e.g., Metal, CUDA, ROCm), we can bring C++/CUDA performance and better. The challenge lies in designing a unified and ergonomic device model that allows users to transparently scale their code from CPU to GPU without major API changes or performance regressions.

Proposed approaches

I outline three main architectural approaches:

Option 1: Unified NDArray with Device-Aware Storage

This approach extends the existing NDArray to include device specialization at compile time.
It provides a PyTorch-like.to[device]() API for explicit device transfers, while keeping a single unified interface across backends liketorch.tensor. Enables users to create NDArray in either CPU or GPU by just providingDevice parameter.

Key Properties:

Unified API for CPU, GPU, and future devices
Compile-time device specialization
Minimal breaking changes to current codebase
Simple integration path for future devices

Cons:

A lot of ugly compile time if conditions to differentiate cpu and gpu methods.

Example Usage:

fnmain()raises:aliasSIZE: Int=1024aliascpu: Device= Device.CPUaliasmps: Device= Device.MPS# Create CPU arraysvararr_cpu_1= arange[f32](1.0,101.0,1).reshape(Shape(10,10))vararr_cpu_2= arange[f32](1.0,101.0,1).reshape(Shape(10,10))varmatmul_cpu= arr_cpu_1@ arr_cpu_2# Create GPU arrays (Metal backend)vararr_gpu_1= arange[f32, device=mps](1.0,101.0,1).reshape(Shape(10,10))vararr_gpu_2= arange[f32, device=mps](1.0,101.0,1).reshape(Shape(10,10))varmatmul_gpu= arr_gpu_1@ arr_gpu_2# Matrix API variantvarmat_cpu_1= Matrix[f32, cpu]((SIZE,SIZE),fill_value=1.0)varmat_cpu_2= Matrix[f32, cpu]((SIZE,SIZE),fill_value=2.0)varmatmul_cpu= mat_cpu_1@ mat_cpu_2varmat_gpu_1= Matrix[f32, mps]((SIZE,SIZE),fill_value=1.0)varmat_gpu_2= Matrix[f32, mps]((SIZE,SIZE),fill_value=2.0)varmatmul_gpu= mat_gpu_1@ mat_gpu_2# Transfer between devicesvargpu_from_cpu_1= mat_cpu_1.to[mps]()vargpu_from_cpu_2= mat_cpu_2.to[mps]()varmatmul_gpu_from_cpu= gpu_from_cpu_1@ gpu_from_cpu_2

Option 2: Separate Device-Specific Classes

This design introduces explicit device-specific classes, e.g.:NDArrayCPU, NDArrayGPU. Each type directly manages its own memory layout and compute kernels.

Pros:

Zero device abstraction overhead
Enables backend-specific optimizations

Cons:

Significant code duplication for function overloading
Poor ergonomics for users switching between CPU/GPU

Example:

aliasmps= Device.MPSvarx_cpu_1= NDArrayCPU[f32](Shape(1024,1024))varx_cpu_2= NDArrayCPU[f32](Shape(1024,1024))varresult= x_cpu_1@ x_cpu_2varx_gpu_1= NDArrayGPU[f32](Shape(1024,1024))varx_gpu_2= NDArrayGPU[f32](Shape(1024,1024))varresult= x_gpu_1@ x_gpu_2varx_cpu_to_gpu= x_cpu_1.to[mps]()

This model may be more suitable for low-level or embedded contexts, but less ideal for NuMojo’s NumPy-compatibility goals.

Option 3: Static Shape GPU Arrays

This approach introduces a StaticNDArray type with compile-time known shapes and dtypes, enabling aggressive optimizations such as loop unrolling and vectorization.

Pros

Maximum performance and compile-time safety
Enables highly optimized kernels for fixed-size data

Cons

Limited flexibility for dynamic workloads
Increased API and implementation complexity
Requires separate type definitions (NDArray vs StaticNDArray)

This model could coexist with the dynamic NDArray, targeting scientific computing and ML inference workloads where shapes are known ahead of time.

Note:

Many of these limitations may be mitigated in the future as Mojo evolves (e.g., with trait parameters and advanced compile-time metaprogramming features).
While NuMojo aims to be largely NumPy-compatible, we shouldn’t hesitate to improve the API design where it makes sense, even if it introduces intentional deviations from NumPy’s behavior.

Preliminary Results:

Using Approach 1 and approach 2,

Observed near-zero abstraction overhead with this unified approach
Achieved ~15× speedup on Apple Silicon GPU (MPS backend) for matmul with SIZE = 2048 using basic GPU kernels

Check the attached figure for cpu vs gpu comparison of matmul using approach 1 and approach 2 with basic tiled gpu kernels.

Metadata

Assignees

No one assigned

Labels

NuMojo Enhancement Proposal (NEP)

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GPU support in NuMojo #273

Description

GPU Support in NuMojo

Motivation

Proposed approaches

Option 1: Unified NDArray with Device-Aware Storage

Option 2: Separate Device-Specific Classes

Pros:

Cons:

Example:

Option 3: Static Shape GPU Arrays

Pros

Cons

Note:

Preliminary Results:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions