Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

GPU support in NuMojo #273

Open
Open
@shivasankarka

Description

@shivasankarka

GPU Support in NuMojo

Motivation

NuMojo aims to serve as a drop-in replacement for NumPy, while leveraging Mojo’s performance characteristics and native compilation. With GPU backends (e.g., Metal, CUDA, ROCm), we can bring C++/CUDA performance and better. The challenge lies in designing a unified and ergonomic device model that allows users to transparently scale their code from CPU to GPU without major API changes or performance regressions.

Proposed approaches

I outline three main architectural approaches:

Option 1: Unified NDArray with Device-Aware Storage

This approach extends the existing NDArray to include device specialization at compile time.
It provides a PyTorch-like.to[device]() API for explicit device transfers, while keeping a single unified interface across backends liketorch.tensor. Enables users to create NDArray in either CPU or GPU by just providingDevice parameter.

Key Properties:

  • Unified API for CPU, GPU, and future devices
  • Compile-time device specialization
  • Minimal breaking changes to current codebase
  • Simple integration path for future devices

Cons:

  • A lot of ugly compile time if conditions to differentiate cpu and gpu methods.

Example Usage:

fnmain()raises:aliasSIZE: Int=1024aliascpu: Device= Device.CPUaliasmps: Device= Device.MPS# Create CPU arraysvararr_cpu_1= arange[f32](1.0,101.0,1).reshape(Shape(10,10))vararr_cpu_2= arange[f32](1.0,101.0,1).reshape(Shape(10,10))varmatmul_cpu= arr_cpu_1@ arr_cpu_2# Create GPU arrays (Metal backend)vararr_gpu_1= arange[f32, device=mps](1.0,101.0,1).reshape(Shape(10,10))vararr_gpu_2= arange[f32, device=mps](1.0,101.0,1).reshape(Shape(10,10))varmatmul_gpu= arr_gpu_1@ arr_gpu_2# Matrix API variantvarmat_cpu_1= Matrix[f32, cpu]((SIZE,SIZE),fill_value=1.0)varmat_cpu_2= Matrix[f32, cpu]((SIZE,SIZE),fill_value=2.0)varmatmul_cpu= mat_cpu_1@ mat_cpu_2varmat_gpu_1= Matrix[f32, mps]((SIZE,SIZE),fill_value=1.0)varmat_gpu_2= Matrix[f32, mps]((SIZE,SIZE),fill_value=2.0)varmatmul_gpu= mat_gpu_1@ mat_gpu_2# Transfer between devicesvargpu_from_cpu_1= mat_cpu_1.to[mps]()vargpu_from_cpu_2= mat_cpu_2.to[mps]()varmatmul_gpu_from_cpu= gpu_from_cpu_1@ gpu_from_cpu_2

Option 2: Separate Device-Specific Classes

This design introduces explicit device-specific classes, e.g.:NDArrayCPU, NDArrayGPU. Each type directly manages its own memory layout and compute kernels.

Pros:

  • Zero device abstraction overhead
  • Enables backend-specific optimizations

Cons:

  • Significant code duplication for function overloading
  • Poor ergonomics for users switching between CPU/GPU

Example:

aliasmps= Device.MPSvarx_cpu_1= NDArrayCPU[f32](Shape(1024,1024))varx_cpu_2= NDArrayCPU[f32](Shape(1024,1024))varresult= x_cpu_1@ x_cpu_2varx_gpu_1= NDArrayGPU[f32](Shape(1024,1024))varx_gpu_2= NDArrayGPU[f32](Shape(1024,1024))varresult= x_gpu_1@ x_gpu_2varx_cpu_to_gpu= x_cpu_1.to[mps]()

This model may be more suitable for low-level or embedded contexts, but less ideal for NuMojo’s NumPy-compatibility goals.

Option 3: Static Shape GPU Arrays

This approach introduces a StaticNDArray type with compile-time known shapes and dtypes, enabling aggressive optimizations such as loop unrolling and vectorization.

Pros

  • Maximum performance and compile-time safety
  • Enables highly optimized kernels for fixed-size data

Cons

  • Limited flexibility for dynamic workloads
  • Increased API and implementation complexity
  • Requires separate type definitions (NDArray vs StaticNDArray)

This model could coexist with the dynamic NDArray, targeting scientific computing and ML inference workloads where shapes are known ahead of time.

Note:

  1. Many of these limitations may be mitigated in the future as Mojo evolves (e.g., with trait parameters and advanced compile-time metaprogramming features).
  2. While NuMojo aims to be largely NumPy-compatible, we shouldn’t hesitate to improve the API design where it makes sense, even if it introduces intentional deviations from NumPy’s behavior.

Preliminary Results:

Using Approach 1 and approach 2,

  • Observed near-zero abstraction overhead with this unified approach
  • Achieved ~15× speedup on Apple Silicon GPU (MPS backend) for matmul with SIZE = 2048 using basic GPU kernels

Check the attached figure for cpu vs gpu comparison of matmul using approach 1 and approach 2 with basic tiled gpu kernels.
matmul_benchmark_plots_matrix

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions


      [8]ページ先頭

      ©2009-2025 Movatter.jp