- Notifications
You must be signed in to change notification settings - Fork78
Any model. Any hardware. Zero compromise. Built with@ziglang /@openxla / MLIR /@bazelbuild
License
zml/zml
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
At ZML, we are creating exciting AI products on top of our high-performanceAI inference stack. Our stack is built for production, using the amazingZig language,MLIR, and thepower ofBazel.
We're very happy to share our inference stack with the World and hope it allowsyou, too, to build cool and exciting AI projects.
To give you a glimpse of what you can do with ZML, here is an early demo:
It shows a prototype running a LLaMA2 model sharded on 1 NVIDIA RTX 4090, 1 AMD6800XT, and 1 Google Cloud TPU v2. All accelerators were hosted in differentlocations, with activations being passed over a VPN.
All processes used the same model code, cross-compiled on a Mac, and copied ontothe servers.
For more inspiration, see also the examples below or check out theexamples folder.
We usebazel
to build ZML and its dependencies. The only prerequisite isbazel
, which we recommend to download throughbazelisk
, a version managerforbazel
.
Please note: If you do not wish to installbazel
system-wide, we provideexamples/bazel.sh which downloads it to your home folderand runs it.
Install Bazel (recommended):
curl -L -o /usr/local/bin/bazel 'https://github.com/bazelbuild/bazelisk/releases/download/v1.25.0/bazelisk-linux-amd64'chmod +x /usr/local/bin/bazel
We have implemented a variety of example models in ZML. See our referenceimplementations in theexamples folder.
Theclassic handwritten digitsrecognition task. The model is tasked to recognize a handwritten digit, whichhas been converted to a 28x28 pixel monochrome image.Bazel
will download apre-trained model, and the test dataset. The program will load the model,compile it, and classify a randomly picked example from the test dataset.
On the command line:
cd examplesbazel run -c opt //mnist# or./bazel.sh run -c opt //mnist
This model has restrictions, seehere. Itrequiresapproval from Meta on Huggingface, which can take a few hours to get granted.
While waiting, you can already generate an access token to log into HuggingFacefrombazel
; seehere.
Once you've been granted access, you're ready to download a gated model likeMeta-Llama-3.1-8B-Instruct
!
# requires token in $HOME/.cache/huggingface/token, as created by the# `huggingface-cli login` command, or the `HUGGINGFACE_TOKEN` environment variable.cd examplesbazel run -c opt //llama:Llama-3.1-8B-Instructbazel run -c opt //llama:Llama-3.1-8B-Instruct -- --prompt="What is the capital of France?"
You can also tryLlama-3.1-70B-Instruct
if you have enough memory.
Like the 8B model above, this model also requires approval. Seehere for access requirements.
cd examplesbazel run -c opt //llama:Llama-3.2-1B-Instructbazel run -c opt //llama:Llama-3.2-1B-Instruct -- --prompt="What is the capital of France?"
For a larger 3.2 model, you can also tryLlama-3.2-3B-Instruct
.
You can compile models for accelerator runtimes by appending one or more of thefollowing arguments to the command line when compiling / running a model:
- NVIDIA CUDA:
--@zml//runtimes:cuda=true
- AMD RoCM:
--@zml//runtimes:rocm=true
- Google TPU:
--@zml//runtimes:tpu=true
- AWS Trainium/Inferentia 2:
--@zml//runtimes:neuron=true
- AVOID CPU:
--@zml//runtimes:cpu=false
The latter, avoiding compilation for CPU, cuts down compilation time.
So, to run the OpenLLama model from above on your host sporting an NVIDIA GPU,run the following:
cd examplesbazel run -c opt //llama:Llama-3.2-1B-Instruct \ --@zml//runtimes:cuda=true \ -- --prompt="What is the capital of France?"
bazel test //zml:test
conststd=@import("std");constzml=@import("zml");/// Model definitionconstMnist=struct {fc1:Layer,fc2:Layer,constLayer=struct {weight:zml.Tensor,bias:zml.Tensor,pubfnforward(self:Layer,input:zml.Tensor)zml.Tensor {returnself.weight.matmul(input).add(self.bias).relu(); } };/// just two linear layers + relu activationpubfnforward(self:Mnist,input:zml.Tensor)zml.Tensor {std.log.info("Compiling for target: {s}", .{@tagName(input.getContext().target())});varx=input.flattenAll().convert(.f32);constlayers: []constLayer= &.{self.fc1,self.fc2 };for (layers)|layer| {x=zml.call(layer,.forward, .{x}); }returnx.argMax(0,.u8).indices; }};
constSdpa=struct {pubfnforward(_:Sdpa,ctx:*zml.Context,q_:zml.Tensor,k_:zml.Tensor,v_:zml.Tensor)zml.Tensor {constq=q_.withTags(.{.b,.h,.q,.hd });constk=k_.withTags(.{.b,.h,.k,.hd });constv=v_.withTags(.{.b,.h,.k,.hd });constattn_mask=zml.nn.causalAttnMask(ctx, .{ .q=q.dim(.q), .k=k.dim(.k) },q.dtype(),null);returnzml.nn.sdpa(ctx,q,k,v, .{ .attn_mask=attn_mask }); }};
You might want to check out moreexamples, read through thedocumentation directly on GitHub, or, for the full renderingexperience, browse theonline documentation with included API reference.
Seehere.
ZML is licensed under theApache 2.0 license.
About
Any model. Any hardware. Zero compromise. Built with@ziglang /@openxla / MLIR /@bazelbuild