WWDC 24: Running Mistral 7B with Core ML

PublishedJuly 22, 2024

Update on GitHub

Upvote

Pedro Cuenca

pcuenq

Christopher Fleetwood

WWDC’ 24 is the moment Apple officially unveiled Apple Intelligence andreiterated their commitment to efficient, private, and on-device AI.During the keynote and the sessions that followed, they demonstratedApple Intelligence, which powers a huge array of AI-enhanced featuresthat show practical uses for everyday tasks. These are not*AI-for-the-sake-of-AI* shiny demos. These are time-saving,appropriate (and fun!) helpers that are deeply integrated with apps andthe OS, that also offer developers a number of ways to include thesefeatures within their own apps.

Apple Intelligence features can only work this wellbecause of the vertically integrated software stack that harnessesApple Silicon's capabilities to the fullest. Apple also offers a platform for developers to run models on-device, known asCore ML. This software stack allows you to run ML models across all 3compute units (CPU, GPU & Neural Engine) available on Apple Silicon hardware.

In this blog post, we’ll be exploring some of the best new Core MLfeatures to replicate the Mistral 7B example Apple showcased in theWWDC’24Deploy machine learning and AI models on-device with CoreMLsession, where they use a fork ofswift-transformersto run a state-of-the-art LLM on a Mac. This is a high-quality modelwith more than 7 billion parameters that pushes the capabilities ofconsumer hardware today. You can also check out WWDC’24Bring yourmachine learning and AI models to Applesiliconsession, where part of the Mistral 7B conversion process is shown.

Let’s see what steps to take to run it as efficiently as possible, andlearn the new tools available in iOS 18 & macOS Sequoia.

This is what we’ll be building today:

TL;DR

By the end of this blog post, you will have learnt all the new goodiesaccompanying the latest macOS release AND you will have successfully runa 7B parameter model using less than 4GB of memory on your Mac.

Step 1: Clone thepreview branch of theswift-transformers repo: git clone -b previewhttps://github.com/huggingface/swift-transformersStep 2: Download the converted Core ML models fromthis Hugging Face repoStep 3: Run inference using Swift:swift run transformers "Best recommendations for a place to visit in Paris in August 2024:" --max-length 200 Mistral7B-CoreML/StatefulMistralInstructInt4.mlpackage

Best new Core ML features from WWDC’ 24

Here are some of the most impactful Core ML features from WWDC’ 24 wewill use to run Mistral 7B on a Mac.

Swift Tensor

The first feature we want to highlight is an entirely new Swift type towork with ML tensors. These are multi-dimensional data structures everyML framework uses. Python developers working on ML are familiar withnumpy arrays ortorch tensors, which provide convenient,high-level interfaces to manipulate these large multi-dimensionalmatrices easily. The newMLTensor type provides a high-levelabstraction that mimics the ones available in Python frameworks, greatlysimplifying working with tensor data in Swift.

Core ML already had multi-dimensional data types in the form ofMLMultiArrayandMLShapedArray.However, they were only meant for data storage and simple operationslike wrapping your data and sending it as input to a Core ML model, orunwrapping results from a Core ML model. However,manipulating tensordata with these APIs is difficult. Only a few primitive operations areprovided, and you may have to write your own by accessing the underlyingstorage as an opaque pointer to number data. This is time-consuming anderror-prone.

The newTensor type provides a high-level abstraction that mimicsthe ones available in Python frameworks, greatly simplifying workingwith tensor data in Swift. Consider a language model like the one wewant to port to Core ML. Language models take in an input sequence oftokens, and they output an estimation of the probabilities of all thetokens in the vocabulary, meaning that tokens with a high probabilityhave a high chance of being plausible continuations of the input. Theapplication’s job is to select the best next token to append to thesequence based on those probabilities.Tensor type makes it easy tohandle these operations without custom code.

When we released swift-transformers,we wrote a lot of code (later extended by the community, thanks! ❤️) tohelp with input preparations (convert words to tokens) and outputpost-processing. For example, check outour softmax operationusing Accelerate. All this can be removed when usingMLTensor, assoftmax is provided out of the box!

Stateful Buffers

Before WWDC’ 24, a Core ML model was essentially a pure statelessfunction where you provide inputs and return some outputs. However,sometimes you need to keep a state that depends on previouscomputations. The functional programming method for maintaining state isto add an additional input/output pair. So, based on your inputs andstate, the model computes the output and the new state. There is nothingwrong with this approach, and in fact, that’s the way high-performanceframeworks like JAX work.

However, there are practical limitations: the stateful data needs to besent to the model as an input and retrieved as an output every time youcall the model. If the stateful data is large, then all this going backand forth increases overhead and slows things down. This is particularlyimportant for LLMs because you have to run many iterations to generate asequence. The performance bottleneck is usually your computer’s memorybandwidth (i.e., how fast you can move things to your GPU and back).Stateful models solve this problem by reserving a block of memory forstate data and keeping it on the GPU so you don’t have to send andreceive it every time you use the model.

Stateful buffers were introducedin this WWDC’ 24 sessionusing a toy example that is easy to understand but not representative ofpractical uses with big models such as LLMs. An LLM performance trickfor transformers-based models is key-value caching (known askv-caching). As shown in the following illustration, it avoids costlymatrix multiplications in the crucial attention block by caching theresult of previous operations performed in previous steps. We won’t gointo details, but the takeaways are: kv-cache dramatically increasesperformance, and it requires a large block of memory that is the perfectcandidate for using stateful buffers. Here is acoremltools user guideupdate about stateful models.

New Quantization Techniques

In WWDC 23, we explored a very cool technique called palletization, andwe showed how it could help bring text-to-image models,such as StableDiffusion, to Macs and iPhones.

Whilst these techniques allow you to reduce the size considerably, ifpushed too far, the impact on quality is drastic. Bigger models suffermore from this, as the weight data has an extensive dynamic range.Trying to create a small lookup table (LUT) that captures all possiblevalues becomes increasingly difficult. The solution introduced in WWDC24 is to focus on a smaller portion of the data at a time, and createmultiple lookup tables for different areas of the same tensor.

These methods (block-wise quantization) allow us to compress models toas low as 4-bit precision. Instead of using 4 bytes (the size of afloat32 number) to represent each model parameter, we can get awaywith half a byte (a nibble) for each. This is an 8-fold reduction inmodel size (minus some overhead to account for the block-wisequantization tables), or 4 times smaller when compared tofloat16precision.

Multifunction Support

We won’t use this feature for this example but we wanted to mention ithere as it was introduced at WWDC 24, and we will be showcasing it insome upcoming work. Multifunction support essentially allows you topackage LoRA adapters into generative models to use the same model (witha small set of additional parameters, called adapters) for differenttasks. LoRA is the preferred community technique for large modelfine-tuning. In diffusion models, for example, you can use LoRA togenerate images with different styles, such as photorealistic orcartoonish. We believe LoRA is part of the solution that powers Apple’sGenmoji implementation. For language models, LoRA adapters can be usedto adapt a generic LLM to specific tasks or domains.

To read more about LoRA, you can checkthis post.

To read more about Multifunction, you can check out Apple coremltoolsuser guidehere.

Converting Mistral 7B to Core ML

The single most important component for running a large language modelefficiently is the kv-cache. As mentioned above, this is a greatcandidate forthe new stateful model featurereleased at WWDC’ 24. Models in the transformers library already useefficient attention implementations that rely heavily on kv-caching.However, the default implementations are optimized for Nvidia GPUs, andthis hardware has a different set of constraints than Apple Silicondoes. In the case of Core ML, we need to pre-allocate the full cachebuffer beforehand and ensure that each time we call the model, we updatethe buffer in place. This avoids inefficient memory allocations andtensor concatenations and is also a requirement for Core ML statefulbuffers.

To achieve this goal, we have to use a different attentionimplementation that considers these factors. This requires modifying thetransformers modeling code for the Mistral architecture, and it’s doneinthis fragment of code.

Note: If you want to follow along and replicate the conversion (orconvert another Mistral-based model, like a different fine-tune), youcan usethis scriptto run all the conversion steps.

Tracing & Conversion

The first step is to load the model. We’ll use the patchedimplementation with the in-place cache method.

MODEL_ID ="mistralai/Mistral-7B-Instruct-v0.3"torch_model = StatefulMistralForCausalLM(MODEL_ID)torch_model.eval()

Before running Core ML conversion, we need to trace the model withexample inputs. This process records the tensor operations performed onthose inputs, and the traced graph will be translated to Core MLoperations during conversion. We use sample inputs to trace the model;we don’t need real data.

input_ids = torch.zeros((1,2), dtype=torch.int32)causal_mask = torch.zeros((1,1,2,5), dtype=torch.float32)traced_model = torch.jit.trace(torch_model, [input_ids, causal_mask])

The input to a language model is a sequence of tokens of varying length.We’ll allow the input to grow from a single token to a maximum contextlength of 2048. We can usecoremltools rangedimensions to specify these bounds.

query_length = ct.RangeDim(lower_bound=1, upper_bound=2048, default=1)end_step_dim = ct.RangeDim(lower_bound=1, upper_bound=2048, default=1)inputs = [    ct.TensorType(shape=(1, query_length), dtype=np.int32, name="inputIds"),    ct.TensorType(shape=(1,1, query_length, end_step_dim), dtype=np.float16, name="causalMask"),]outputs = [ct.TensorType(dtype=np.float16, name="logits")]

In addition to the sequence tokens (calledinputIds in the exampleabove), there’s another input calledcausalMask, which specifies thetokens the model needs to pay attention to. This is mostly used whengenerating multiple sequences at the same time using batching. Check outhow these inputs are used in anexample runnerhere.

In this situation, all the input sequences inside a batch must have thesame length, so we use padding tokens and the causal mask to tell themodel that the padding tokens are not to be considered as inputs.

State Preparation

The PyTorch modeling code useskeyCache andvalueCache as thenames of the cache buffers to hold the kv-cache. Those blocks areallocated for the maximum context length (2048). We usecoremltools'newStateTypeto specify that those blocks must be converted to a stateful Core MLbuffer during conversion.

# Specify kv-cache states by using `StateType`.states = [    ct.StateType(        wrapped_type=ct.TensorType(shape=torch_model.kv_cache_shape, dtype=np.float16),        name="keyCache",    ),    ct.StateType(        wrapped_type=ct.TensorType(shape=torch_model.kv_cache_shape, dtype=np.float16),        name="valueCache",    ),]

Core ML Conversion

To convert the model to Core ML, we need to specify the input and outputtypes, as well as the states. The converted model will usefloat16precision because that’s what we specified for the input data. We alsoneed to indicate the minimum deployment target as iOS18, as that’s wherethese features are available. (We can also usemacOS15, which refersto the same conversion target.)

mlmodel_fp16 = ct.convert(    traced_model,    inputs=inputs,    states=states,    outputs=outputs,    minimum_deployment_target=ct.target.iOS18,    skip_model_load=True,)

Model Compression

Using the new block-size quantization strategies described above, we use4-bit linear quantization with block size 32. This will greatly reducemodel size and make the model run faster. Even though computation willstill be performed infloat16, weights are transferred in 4-bit modeand decompressed on the fly, which is more efficient than transferring alarge amount of 16-bit weights.

The quantization parameters are configured as follows:

op_config = ct.optimize.coreml.OpLinearQuantizerConfig(    mode="linear_symmetric",    dtype="int4",    granularity="per_block",    block_size=32,)config = ct.optimize.coreml.OptimizationConfig(global_config=op_config)

Let’s use that configuration to quantize the model. The following linewill take a few minutes to run:

mlmodel_int4 = ct.optimize.coreml.linear_quantize_weights(mlmodel_fp16, config=config)mlmodel_int4.save("StatefulMistral7BInstructInt4.mlpackage")

There’s a final step after conversion and quantization are done. We needto include a piece of additional metadata that indicates the modelidentifier we used (mistralai/Mistral-7B-Instruct-v0.3). The Swiftcode will use this to download the tokenizer files from the Hub.Tokenization is converting text data to the numerical representationsused by models, and it’s different for every model.

mlmodel_int4._spec.description.metadata.userDefined.update({"co.huggingface.exporters.name": MODEL_ID})

The generated model is amlpackage of about 3.8G, compared with the14G that afloat16 conversion would produce.You can find ithere on theHub.

Running Mistral 7B with Swift

If you followed the steps above or downloaded the model from the Hub,you can run it locally using thepreview branch ofswift-transformers. Apple engineers contributed it to the project,including the following important features:

FullTensor support, which greatly simplifies pre- andpost-processing tasks, and allows us to delete many lines oflow-level, confusing and fragile code.
Support for the Swift counterpart of the Stateful API.

Since adopting these features is a breaking change and requires iOS 18or macOS 15, we’ll keep them in apreview branch for now.

To run the model from the command line, please first clone thepreviewbranch from the GitHub repo:

    gitclone -b preview https://github.com/huggingface/swift-transformers

And then run the CLI to test the model:

#to run in release mode, pass -c releaseswift run transformers"Best recommendations for a place to visit in Paris in August 2024:" --max-length 128 Examples/Mistral7B/StatefulMistral7BInstructInt4.mlpackage

For easier testing, you can also useswift-chat, a simple app wewrote to show how to integrate theswift-transformers packageinside. You have to use thepreview branch as well. An example ofswift-chat running the converted Mistral model was shown at thebeginning of this post.

Running Mistral 7B with Python

For those of you who are more familiar with Python, it’s just as easy!

python3 generate.py Examples/Mistral7B/StatefulMistral7BInstructInt4.mlpackage --prompt"Best recommendations for a place to visit in Paris in August 2024:"

coremltools makes it just as easy to run Core ML models with Python.

What's Next?

We are extremely excited about the progress inCore ML andcoremltools this year,and we are looking forward to seeing lots of third-party apps leveragingML models to solve real tasks people need. On our side, we are committedto making this as easy as possible so developers can concentrate oncreating cool apps. There are a few things on our drawing board:

The model updates presented here are excellent for GPUs on Maccomputers. Core ML can use the Neural Engine, which is particularlyefficient on iPhones. Getting the most performance out of the NeuralEngine requires some additional adaptations, which we plan to carryout on a few example models. This work will be based on thelearnings discussed in this2022 (and still very relevant) article by Apple.We won’t run Mistral 7B on iPhone, but there are several smallermodels, like Apple’s OpenELM or DCLM that make for greatcandidates to explore!
The code presented here is highly experimental. As summer goes on,we plan to adopt these methods and incorporate them intoexporters, a Python tool designed to convert transformers modelsto Core ML. Hopefully, you’ll soon be able to convert manyinteresting model architectures very easily.
We’ll keep working on thepreview branch ofswift-transformers to incorporate new features or API changes asthey are released. If you are interested, keep an eye on it!

How can you help?

The tools released by Apple in WWDC help us on our long-term goal tomake AI easy and accessible to all, and we’d love to see where you cantake them. The example we showed is experimental, but you can use it toconvert any Mistral fine-tune to Core ML – please let us know if you do!If you want to try other model architectures, please feel free to openissues or PRs to thepreview branch ofswift-transformers –we’ll try to help you get going!

There’s never been a better time than today to apply your creativity tosolve problems that interest you! Go try things, have fun, and tell ushow we can help.