Transformers documentation

Llama4

Transformers

You are viewingmain version, which requiresinstallation from source. If you'd likeregular pip install, checkout the latest stable version (v4.57.1).

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

This model was released on 2025-04-05 and added to Hugging Face Transformers on 2025-04-05.

Llama4

Llama 4, developed by Meta, introduces a new auto-regressive Mixture-of-Experts (MoE) architecture.This generation includes two models:

The highly capable Llama 4 Maverick with 17B active parameters out of ~400B total, with 128 experts.
The efficient Llama 4 Scout also has 17B active parameters out of ~109B total, using just 16 experts.

Both models leverage early fusion for native multimodality, enabling them to process text and image inputs.Maverick and Scout are both trained on up to 40 trillion tokens on data encompassing 200 languages(with specific fine-tuning support for 12 languages including Arabic, Spanish, German, and Hindi).

For deployment, Llama 4 Scout is designed for accessibility, fitting on a single server-grade GPU viaon-the-fly 4-bit or 8-bitint4 quantization, while Maverick is available in BF16 and FP8 formats.These models are released under the custom Llama 4 Community License Agreement, available on the model repositories.

You can find all the original Llama checkpoints under themeta-llama organization.

The Llama 4 family of models comes in two flavors: 109B, and 402B parameters. Both of these flavors are extremelylarge and won’t fit on your run-of-the-mill device. See below for some examples to reduce the memory usage of themodel.
For the download to be faster and more resilient, we recommend installing thehf_xet dependency as followed:pip install transformers[hf_xet]

The examples below demonstrates how to generate withPipeline or theAutoModel. We additionally add an exampleshowcasing how to toggle the right attributes to enable very long-context generations, as some flavors of Llama 4have context lengths going up to 10 million tokens.

Pipeline

AutoModel - Text only

AutoModel - Multimodal

AutoModel - Multimodal with multiple images

AutoModel - Long context

from transformersimport pipelineimport torchmodel_id ="meta-llama/Llama-4-Scout-17B-16E-Instruct"messages = [    {"role":"user","content":"what is the recipe of mayonnaise?"},]pipe = pipeline("text-generation",    model=model_id,    device_map="auto",    dtype=torch.bfloat16)output = pipe(messages, do_sample=False, max_new_tokens=200)print(output[0]["generated_text"][-1]["content"])

Efficiency; how to get the best out of llama 4

The Attention methods

Updating the default attention function can significantly improve compute performance as well as memory usage. Refer to theAttention Interface overview for an in-depth explanation of our interface.

As of release, the Llama 4 model supports the following attention methods:eager,flex_attention,sdpa. We recommend usingflex_attention for best results.Switching attention mechanism is done at the model initialization step:

Flex Attention

SDPA

Eager

Setting Flex Attention ensures the best results with the very long context the model can handle.

Beware: the example below uses bothdevice_map="auto" and flex-attention.Please usetorchrun to run this example in tensor-parallel mode.
We will work to enable running withdevice_map="auto" and flex-attention withouttensor-parallel in the future.

from transformersimport Llama4ForConditionalGenerationimport torchmodel = Llama4ForConditionalGeneration.from_pretrained(    model_id,    attn_implementation="flex_attention",    device_map="auto",    dtype=torch.bfloat16,)

Quantization

Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to theQuantization overview for available quantization backends.At time of release, both FBGEMM and LLM-Compressor are supported; more quantization methods will be supported in the days that follow the release.

See below for examples using both:

Here is an example loading an BF16 model in FP8 using the FBGEMM approach:

FBGEMM

LLM-Compressor

from transformersimport AutoTokenizer, Llama4ForConditionalGeneration, FbgemmFp8Configimport torchmodel_id ="meta-llama/Llama-4-Scout-17B-16E-Instruct"tokenizer = AutoTokenizer.from_pretrained(model_id)messages = [    {"role":"user","content":"Who are you?"},]inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt", return_dict=True)model = Llama4ForConditionalGeneration.from_pretrained(    model_id,    device_map="auto",    dtype=torch.bfloat16,    quantization_config=FbgemmFp8Config())outputs = model.generate(**inputs.to(model.device), max_new_tokens=100)outputs = tokenizer.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])print(outputs[0])

Offloading

Enabling CPU-offloading means that components of the model might be moved to CPU instead of GPU in case the GPU-memory available isn’t sufficient to load the entire model.At inference, different components will be loaded/unloaded from/to the GPU on the fly. This ensures that the model can be loaded on smaller machines as long as the CPU-memory is sufficient.However, this also slows down inference as it adds communication overhead.

In order to enable CPU-offloading, you simply need to specify thedevice_map toauto at model load:

from transformersimport Llama4ForConditionalGenerationimport torchmodel = Llama4ForConditionalGeneration.from_pretrained(    model_id,    device_map="auto",    dtype=torch.bfloat16,)

Movatterモバイル変換

Transformers

Llama4

Efficiency; how to get the best out of llama 4

The Attention methods

Quantization

Offloading

Llama4Config

classtransformers.Llama4Config

Llama4TextConfig

classtransformers.Llama4TextConfig

Llama4VisionConfig

classtransformers.Llama4VisionConfig

Llama4Processor

classtransformers.Llama4Processor

Llama4ImageProcessorFast

classtransformers.Llama4ImageProcessorFast

preprocess

rescale_and_normalize

Llama4ForConditionalGeneration

classtransformers.Llama4ForConditionalGeneration

forward

Llama4ForCausalLM

classtransformers.Llama4ForCausalLM

forward

Llama4TextModel

classtransformers.Llama4TextModel

forward

Llama4VisionModel

classtransformers.Llama4VisionModel

forward