This model was released on 2025-04-05 and added to Hugging Face Transformers on 2025-04-05.
Llama4
Llama 4, developed by Meta, introduces a new auto-regressive Mixture-of-Experts (MoE) architecture.This generation includes two models:
- The highly capable Llama 4 Maverick with 17B active parameters out of ~400B total, with 128 experts.
- The efficient Llama 4 Scout also has 17B active parameters out of ~109B total, using just 16 experts.
Both models leverage early fusion for native multimodality, enabling them to process text and image inputs.Maverick and Scout are both trained on up to 40 trillion tokens on data encompassing 200 languages(with specific fine-tuning support for 12 languages including Arabic, Spanish, German, and Hindi).
For deployment, Llama 4 Scout is designed for accessibility, fitting on a single server-grade GPU viaon-the-fly 4-bit or 8-bitint4 quantization, while Maverick is available in BF16 and FP8 formats.These models are released under the custom Llama 4 Community License Agreement, available on the model repositories.
You can find all the original Llama checkpoints under themeta-llama organization.
The Llama 4 family of models comes in two flavors: 109B, and 402B parameters. Both of these flavors are extremelylarge and won’t fit on your run-of-the-mill device. See below for some examples to reduce the memory usage of themodel.
For the download to be faster and more resilient, we recommend installing the
hf_xetdependency as followed:pip install transformers[hf_xet]
The examples below demonstrates how to generate withPipeline or theAutoModel. We additionally add an exampleshowcasing how to toggle the right attributes to enable very long-context generations, as some flavors of Llama 4have context lengths going up to 10 million tokens.
from transformersimport pipelineimport torchmodel_id ="meta-llama/Llama-4-Scout-17B-16E-Instruct"messages = [ {"role":"user","content":"what is the recipe of mayonnaise?"},]pipe = pipeline("text-generation", model=model_id, device_map="auto", dtype=torch.bfloat16)output = pipe(messages, do_sample=False, max_new_tokens=200)print(output[0]["generated_text"][-1]["content"])
Efficiency; how to get the best out of llama 4
The Attention methods
Updating the default attention function can significantly improve compute performance as well as memory usage. Refer to theAttention Interface overview for an in-depth explanation of our interface.
As of release, the Llama 4 model supports the following attention methods:eager,flex_attention,sdpa. We recommend usingflex_attention for best results.Switching attention mechanism is done at the model initialization step:
Setting Flex Attention ensures the best results with the very long context the model can handle.
Beware: the example below uses both
device_map="auto"and flex-attention.Please usetorchrunto run this example in tensor-parallel mode.We will work to enable running with
device_map="auto"and flex-attention withouttensor-parallel in the future.
from transformersimport Llama4ForConditionalGenerationimport torchmodel = Llama4ForConditionalGeneration.from_pretrained( model_id, attn_implementation="flex_attention", device_map="auto", dtype=torch.bfloat16,)
Quantization
Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to theQuantization overview for available quantization backends.At time of release, both FBGEMM and LLM-Compressor are supported; more quantization methods will be supported in the days that follow the release.
See below for examples using both:
Here is an example loading an BF16 model in FP8 using the FBGEMM approach:
from transformersimport AutoTokenizer, Llama4ForConditionalGeneration, FbgemmFp8Configimport torchmodel_id ="meta-llama/Llama-4-Scout-17B-16E-Instruct"tokenizer = AutoTokenizer.from_pretrained(model_id)messages = [ {"role":"user","content":"Who are you?"},]inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt", return_dict=True)model = Llama4ForConditionalGeneration.from_pretrained( model_id, device_map="auto", dtype=torch.bfloat16, quantization_config=FbgemmFp8Config())outputs = model.generate(**inputs.to(model.device), max_new_tokens=100)outputs = tokenizer.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])print(outputs[0])
Offloading
Enabling CPU-offloading means that components of the model might be moved to CPU instead of GPU in case the GPU-memory available isn’t sufficient to load the entire model.At inference, different components will be loaded/unloaded from/to the GPU on the fly. This ensures that the model can be loaded on smaller machines as long as the CPU-memory is sufficient.However, this also slows down inference as it adds communication overhead.
In order to enable CPU-offloading, you simply need to specify thedevice_map toauto at model load:
from transformersimport Llama4ForConditionalGenerationimport torchmodel = Llama4ForConditionalGeneration.from_pretrained( model_id, device_map="auto", dtype=torch.bfloat16,)
Llama4Config
classtransformers.Llama4Config
<source>(vision_config = Nonetext_config = Noneboi_token_index = 200080eoi_token_index = 200081image_token_index = 200092tie_word_embeddings = False**kwargs)
Parameters
- vision_config (
Llama4VisionConfig,optional) —The Llama4 Vision config. - text_config (
Llama4TextConfig,optional) —The Llama4 Text config. - boi_token_index (
int,optional, defaults to 200080) —The begin-of-image token index to wrap the image prompt. - eoi_token_index (
int,optional, defaults to 200081) —The end-of-image token index to wrap the image prompt. - image_token_index (
int,optional, defaults to 200092) —The image token index to encode the image prompt. - tie_word_embeddings (
bool,optional, defaults toFalse) —Whether the model’s input and output word embeddings should be tied.
This is the configuration class to store the configuration of aLlama4Model. It is used to instantiate anLlama4 model according to the specified arguments, defining the model architecture. Instantiating a configurationwith the defaults will yield a similar configuration to that of the Llama4 109B.
e.g.meta-llama/Llama-4-Scout-17B-16E
Configuration objects inherit fromPreTrainedConfig and can be used to control the model outputs. Read thedocumentation fromPreTrainedConfig for more information.
>>>from transformersimport Llama4Model, Llama4Config>>># Initializing a Llama4 7B style configuration>>>configuration = Llama4Config()>>># Initializing a model from the Llama4 7B style configuration>>>model = Llama4Model(configuration)>>># Accessing the model configuration>>>configuration = model.config
Llama4TextConfig
classtransformers.Llama4TextConfig
<source>(vocab_size = 202048hidden_size = 5120intermediate_size = 8192intermediate_size_mlp = 16384num_hidden_layers = 48num_attention_heads = 40num_key_value_heads = 8head_dim = 128hidden_act = 'silu'max_position_embeddings = 131072initializer_range = 0.02rms_norm_eps = 1e-05use_cache = Truepad_token_id = Nonebos_token_id = 1eos_token_id = 2tie_word_embeddings = Falseattention_dropout = 0.0num_experts_per_tok = 1num_local_experts = 16moe_layers = Noneinterleave_moe_layer_step = 1use_qk_norm = Trueoutput_router_logits = Falserouter_aux_loss_coef = 0.001router_jitter_noise = 0.0rope_parameters: typing.Union[transformers.modeling_rope_utils.RopeParameters, dict[str, transformers.modeling_rope_utils.RopeParameters], NoneType] = Noneno_rope_layers = Noneno_rope_layer_interval = 4attention_chunk_size = 8192layer_types = Noneattn_temperature_tuning = Truefloor_scale = 8192attn_scale = 0.1**kwargs)
Parameters
- vocab_size (
int,optional, defaults to 202048) —Vocabulary size of the Llama4 text model. Defines the maximum number of different tokens that can be representedby theinputs_idspassed when callingLlama4TextModel. - hidden_size (
int,optional, defaults to 5120) —Dimensionality of the embeddings and hidden states. - intermediate_size (
int,optional, defaults to 8192) —Dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer encoder. - intermediate_size_mlp (
int,optional, defaults to 16384) — TODO - num_hidden_layers (
int,optional, defaults to 48) —Number of hidden layers in the Transformer encoder. - num_attention_heads (
int,optional, defaults to 40) —Number of attention heads for each attention layer in the Transformer encoder. - num_key_value_heads (
int,optional, defaults to 8) —This is the number of key_value heads that should be used to implement Grouped Query Attention. If notspecified, will default tonum_attention_heads. - head_dim (
int,optional, defaults to 128) — TODO - hidden_act (
strorCallable,optional, defaults to"silu") —The non-linear activation function (function or string) in the encoder and pooler. - max_position_embeddings (
int,optional, defaults to 131072) —The maximum sequence length that this model might ever be used with. - initializer_range (
float,optional, defaults to 0.02) —The standard deviation of the truncated_normal_initializer for initializing all weight matrices. - rms_norm_eps (
float,optional, defaults to 1e-05) —The epsilon used by the rms normalization layers. - use_cache (
bool,optional, defaults toTrue) —Whether or not the model should return the last key/values attentions. - pad_token_id (
int,optional, defaults to 128004) —The id of the padding token. - bos_token_id (
int,optional, defaults to 1) —The id of the beginning of sentence token. - eos_token_id (
int,optional, defaults to 2) —The id of the end of sentence token. - tie_word_embeddings (
bool,optional, defaults toFalse) —Whether to tie weight embeddings - attention_dropout (
int,optional, defaults to 0.0) — TODO - num_experts_per_tok (
int,optional, defaults to 1) — TODO - num_local_experts (
int,optional, defaults to 16) — TODO - moe_layers (
int,optional) — TODO - interleave_moe_layer_step (
int,optional, defaults to 1) — TODO - use_qk_norm (
int,optional, defaults toTrue) — TODO - output_router_logits (
int,optional, defaults toFalse) — TODO - router_aux_loss_coef (
int,optional, defaults to 0.001) — TODO - router_jitter_noise (
int,optional, defaults to 0.0) — TODO - rope_parameters (
RopeParameters,optional) —Dictionary containing the configuration parameters for the RoPE embeddings. The dictionaty should containa value forrope_thetaand optionally parameters used for scaling in case you want to use RoPEwith longermax_position_embeddings. - no_rope_layers (
list[int],optional) —List with at least the same length as the number of layers in the model.A1at an index position indicates that the corresponding layer will use RoPE,while a0indicates that it’s a NoPE layer. - no_rope_layer_interval (
int,optional, defaults to 4) —Ifno_rope_layersisNone, it will be created using a NoPE layer everyno_rope_layer_intervallayers. - attention_chunk_size (
int,optional, defaults to 8192) — - layer_types (
list,optional) —Attention pattern for each layer. - attn_temperature_tuning (
bool,optional, defaults toTrue) —Whether to dynamically scale the attention temperature for each query token based on sequence length.Recommended for long sequences (e.g., >32k tokens) to maintain stable output results. - floor_scale (
int,optional, defaults to 8192) — TODO - attn_scale (
int,optional, defaults to 0.1) — TODO
This is the configuration class to store the configuration of aLlama4TextModel. It is used to instantiate aLlama4 text model according to the specified arguments, defining the model architecture. Instantiating a configurationwith the defaults will yield a similar configuration to that of the Llama4 109B.
e.g.meta-llama/Llama-4-Scout-17B-16E
Configuration objects inherit fromPreTrainedConfig and can be used to control the model outputs. Read thedocumentation fromPreTrainedConfig for more information.
Example:
Llama4VisionConfig
classtransformers.Llama4VisionConfig
<source>(hidden_size: typing.Optional[int] = 768hidden_act: typing.Optional[str] = 'gelu'num_hidden_layers: typing.Optional[int] = 34num_attention_heads: typing.Optional[int] = 16num_channels: typing.Optional[int] = 3intermediate_size: typing.Optional[int] = 5632vision_output_dim: typing.Optional[int] = 7680image_size: typing.Optional[int] = 448patch_size: typing.Optional[int] = 14norm_eps: typing.Optional[float] = 1e-05vision_feature_select_strategy: typing.Optional[str] = 'default'initializer_range: typing.Optional[float] = 0.02pixel_shuffle_ratio: typing.Optional[float] = 0.5projector_input_dim: typing.Optional[int] = 4096projector_output_dim: typing.Optional[int] = 4096multi_modal_projector_bias: typing.Optional[bool] = Falseprojector_dropout: typing.Optional[float] = 0.0attention_dropout: typing.Optional[float] = 0.0rope_parameters: typing.Union[transformers.modeling_rope_utils.RopeParameters, dict[str, transformers.modeling_rope_utils.RopeParameters], NoneType] = None**kwargs)
Parameters
- hidden_size (
int,optional, defaults to 768) —Dimensionality of the encoder layers and the pooler layer. - hidden_act (
strorfunction,optional, defaults to"gelu") —The non-linear activation function (function or string) in the encoder and pooler. If string,"gelu","relu","selu"and"gelu_new""quick_gelu"are supported. - num_hidden_layers (
int,optional, defaults to 34) —Number of hidden layers in the Transformer encoder. - num_attention_heads (
int,optional, defaults to 16) —Number of attention heads for each attention layer in the Transformer encoder. - num_channels (
int,optional, defaults to 3) —Number of channels in the input image. - intermediate_size (
int,optional, defaults to 5632) —Dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer encoder. - vision_output_dim (
int,optional, defaults to 7680) —Dimensionality of the vision model output. Includes output of transformerencoder with intermediate layers and global transformer encoder. - image_size (
int,optional, defaults to 448) —The size (resolution) of each imagetile. - patch_size (
int,optional, defaults to 14) —The size (resolution) of each patch. - norm_eps (
float,optional, defaults to 1e-05) —The epsilon used by the layer normalization layers. - vision_feature_select_strategy (
int,optional, defaults to"default") — TODO - initializer_range (
float,optional, defaults to 0.02) —The standard deviation of the truncated_normal_initializer for initializing all weight matrices. - pixel_shuffle_ratio (
int,optional, defaults to 0.5) — TODO - projector_input_dim (
int,optional, defaults to 4096) — TODO - projector_output_dim (
int,optional, defaults to 4096) — TODO - multi_modal_projector_bias (
int,optional, defaults toFalse) — TODO - projector_dropout (
int,optional, defaults to 0.0) — TODO - attention_dropout (
int,optional, defaults to 0.0) — TODO - rope_parameters (
RopeParameters,optional) —RoPE Parameters
This is the configuration class to store the configuration of aLlama4VisionModel. It is used to instantiate aLlama4 vision model according to the specified arguments, defining the model architecture. Instantiating a configurationwith the defaults will yield a similar configuration to that of the Llama4 109B.
e.g.meta-llama/Llama-4-Scout-17B-16E
Configuration objects inherit fromPreTrainedConfig and can be used to control the model outputs. Read thedocumentation fromPreTrainedConfig for more information.
Llama4Processor
classtransformers.Llama4Processor
<source>(image_processor = Nonetokenizer = Nonepatch_size: int = 14pixel_shuffle_ratio: float = 0.5fake_image_token = '<|image|>'image_token = '<|image|>'start_of_image_token = '<|image_start|>'end_of_image_token = '<|image_end|>'patch_token = '<|patch|>'tile_x_separator_token = '<|tile_x_separator|>'tile_y_separator_token = '<|tile_y_separator|>'chat_template = '{{- bos_token }}\n{%- if custom_tools is defined %}\n {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n {%- if strftime_now is defined %}\n {%- set date_string = strftime_now("%d %b %Y") %}\n {%- else %}\n {%- set date_string = "26 Jul 2024" %}\n {%- endif %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0][\'role\'] == \'system\' %} \n {%- if messages[0][\'content\'] is string %}\n {%- set system_message = messages[0][\'content\']|trim %}\n {%- else %}\n {#- FIXME: The processor requires an array, always. #}\n {%- set system_message = messages[0][\'content\'][0][\'text\']|trim %}\n {%- endif %}\n {%- set messages = messages[1:] %}\n {%- set user_supplied_system_message = true %}\n{%- else %}\n {%- set system_message = "" %}\n {%- set user_supplied_system_message = false %}\n{%- endif %}\n\n{#- System message if the user supplied one #}\n{%- if user_supplied_system_message %}\n {{- "<|header_start|>system<|header_end|>\n\n" }}\n {%- if tools is not none %}\n {{- "Environment: ipython\n" }}\n {%- endif %}\n {%- if tools is not none and not tools_in_user_message %}\n {{- "You have access to the following functions. To call a function, please respond with JSON for a function call." }}\n {{- \'Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.\' }}\n {{- "Do not use variables.\n\n" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- "\n\n" }}\n {%- endfor %}\n {%- endif %}\n {{- system_message }}\n {{- "<|eot|>" }}\n{%- endif %}\n\n{#- Custom tools are passed in a user message with some extra guidance #}\n{%- if tools_in_user_message and not tools is none %}\n {#- Extract the first user message so we can plug it in here #}\n {%- if messages | length != 0 %}\n {%- set first_user_message = messages[0][\'content\']|trim %}\n {%- set messages = messages[1:] %}\n {%- else %}\n {{- raise_exception("Cannot put tools in the first user message when there\'s no first user message!") }}\n{%- endif %}\n {{- \'<|header_start|>user<|header_end|>\n\n\' -}}\n {{- "Given the following functions, please respond with a JSON for a function call " }}\n {{- "with its proper arguments that best answers the given prompt.\n\n" }}\n {{- \'Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.\' }}\n {{- "Do not use variables.\n\n" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- "\n\n" }}\n {%- endfor %}\n {{- first_user_message + "<|eot|>"}}\n{%- endif %}\n\n{%- for message in messages %}\n {%- if not (message.role == \'ipython\' or message.role == \'tool\' or \'tool_calls\' in message) %}\n {{- \'<|header_start|>\' + message[\'role\'] + \'<|header_end|>\n\n\' }}\n {%- if message[\'content\'] is string %}\n {{- message[\'content\'] }}\n {%- else %}\n {%- for content in message[\'content\'] %}\n {%- if content[\'type\'] == \'image\' %}\n {{- \'<|image|>\' }}\n {%- elif content[\'type\'] == \'text\' %}\n {{- content[\'text\'] }}\n {%- endif %}\n {%- endfor %}\n {%- endif %}\n {{- "<|eot|>" }}\n {%- elif \'tool_calls\' in message and message.tool_calls|length > 0 %}\n {{- \'<|header_start|>assistant<|header_end|>\n\n\' -}}\n {{- \'<|python_start|>\' }}\n {%- if message[\'content\'] is string %}\n {{- message[\'content\'] }}\n {%- else %}\n {%- for content in message[\'content\'] %}\n {%- if content[\'type\'] == \'image\' %}\n {{- \'<|image|>\' }}\n {%- elif content[\'type\'] == \'text\' %}\n {{- content[\'text\'] }}\n {%- endif %}\n {%- endfor %}\n {%- endif %}\n {{- \'<|python_end|>\' }}\n {%- for tool_call in message.tool_calls %}\n {{- \'{"name": "\' + tool_call.function.name + \'", \' }}\n {{- \'"parameters": \' }}\n {{- tool_call.function.arguments | tojson }}\n {{- "}" }}\n {%- endfor %}\n {{- "<|eot|>" }}\n {%- elif message.role == "tool" or message.role == "ipython" %}\n {{- "<|header_start|>ipython<|header_end|>\n\n" }}\n {%- if message.content is mapping or message.content is iterable %}\n {{- message.content | tojson }}\n {%- else %}\n {{- message.content }}\n {%- endif %}\n {{- "<|eot|>" }}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- \'<|header_start|>assistant<|header_end|>\n\n\' }}\n{%- endif %}\n'**kwargs)
Parameters
- image_processor (AutoImageProcessor,optional) —The image processor is a required input.
- tokenizer ([
PreTrainedTokenizer,PreTrainedTokenizerFast],optional) —The tokenizer is a required input. - patch_size (
int,optional, defaults to 28) —The size of image patches for tokenization. - img_size (
int,optional, defaults to 364) —The size of the image to be tokenized. This should correspond to the size given to the image processor. - image_token (
str,optional, defaults to"<|image|>") —The token to be used to represent an image in the text. - downsample_factor (
int,optional, defaults to 1) —The factor by which to scale the patch size. - start_of_img_token (
str,optional, defaults to"<|START_OF_IMG|>") —The token to be used to represent the start of an image in the text. - end_of_img_token (
str,optional, defaults to"<|END_OF_IMG|>") —The token to be used to represent the end of an image in the text. - img_patch_token (
str,optional, defaults to"<|IMG_PATCH|>") —The token to be used to represent an image patch in the text. - img_line_break_token (
str,optional, defaults to"<|IMG_LINE_BREAK|>") —The token to be used to represent a line break in the text. - tile_token (
str,optional, defaults to"TILE") —The token to be used to represent an image patch in the text. - tile_global_token (
str,optional, defaults to"TILE_GLOBAL") —The token to be used to represent the cover image in the text. - chat_template (
str,optional) — A Jinja template which will be used to convert lists of messagesin a chat into a tokenizable string.
Constructs a Llama4 processor which wraps aAutoImageProcessor andPretrainedTokenizerFast tokenizer into a single processor that inherits both the image processor andtokenizer functionalities. See the__call__() anddecode() for more information.
Llama4ImageProcessorFast
classtransformers.Llama4ImageProcessorFast
<source>(**kwargs: typing_extensions.Unpack[transformers.models.llama4.image_processing_llama4_fast.Llama4ImageProcessorKwargs])
Constructs a fast Llama4 image processor.
preprocess
<source>(images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']]**kwargs: typing_extensions.Unpack[transformers.models.llama4.image_processing_llama4_fast.Llama4ImageProcessorKwargs])→<class 'transformers.image_processing_base.BatchFeature'>
Parameters
- images (
Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']]) —Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. Ifpassing in images with pixel values between 0 and 1, setdo_rescale=False. - do_convert_rgb (
bool,optional) —Whether to convert the image to RGB. - do_resize (
bool,optional) —Whether to resize the image. - size (
Annotated[Union[int, list[int], tuple[int, ...], dict[str, int], NoneType], None]) —Describes the maximum input dimensions to the model. - crop_size (
Annotated[Union[int, list[int], tuple[int, ...], dict[str, int], NoneType], None]) —Size of the output image after applyingcenter_crop. - resample (
Annotated[Union[PILImageResampling, int, NoneType], None]) —Resampling filter to use if resizing the image. This can be one of the enumPILImageResampling. Onlyhas an effect ifdo_resizeis set toTrue. - do_rescale (
bool,optional) —Whether to rescale the image. - rescale_factor (
float,optional) —Rescale factor to rescale the image by ifdo_rescaleis set toTrue. - do_normalize (
bool,optional) —Whether to normalize the image. - image_mean (
Union[float, list[float], tuple[float, ...], NoneType]) —Image mean to use for normalization. Only has an effect ifdo_normalizeis set toTrue. - image_std (
Union[float, list[float], tuple[float, ...], NoneType]) —Image standard deviation to use for normalization. Only has an effect ifdo_normalizeis set toTrue. - do_pad (
bool,optional) —Whether to pad the image. Padding is done either to the largest size in the batchor to a fixed square size per image. The exact padding strategy depends on the model. - pad_size (
Annotated[Union[int, list[int], tuple[int, ...], dict[str, int], NoneType], None]) —The size in{"height": int, "width" int}to pad the images to. Must be larger than any image sizeprovided for preprocessing. Ifpad_sizeis not provided, images will be padded to the largestheight and width in the batch. Applied only whendo_pad=True. - do_center_crop (
bool,optional) —Whether to center crop the image. - data_format (
Union[str, ~image_utils.ChannelDimension, NoneType]) —OnlyChannelDimension.FIRSTis supported. Added for compatibility with slow processors. - input_data_format (
Union[str, ~image_utils.ChannelDimension, NoneType]) —The channel dimension format for the input image. If unset, the channel dimension format is inferredfrom the input image. Can be one of:"channels_first"orChannelDimension.FIRST: image in (num_channels, height, width) format."channels_last"orChannelDimension.LAST: image in (height, width, num_channels) format."none"orChannelDimension.NONE: image in (height, width) format.
- device (
Annotated[str, None],optional) —The device to process the images on. If unset, the device is inferred from the input images. - return_tensors (
Annotated[Union[str, ~utils.generic.TensorType, NoneType], None]) —Returns stacked tensors if set to `pt, otherwise returns a list of tensors. - disable_grouping (
bool,optional) —Whether to disable grouping of images by size to process them individually and not in batches.If None, will be set to True if the images are on CPU, and False otherwise. This choice is based onempirical observations, as detailed here:https://github.com/huggingface/transformers/pull/38157 - max_patches (
int,optional, defaults to 16) —The maximum number of patches to be extracted from the image.Can be overridden by themax_patchesparameter in thepreprocessmethod. - resize_to_max_canvas (
bool,optional, defaults to False) —Whether to resize the image to the maximum canvas size.If True, picks the canvas the allows the largest resizing without distortion.If False, downsample as little as possible, including no resizing at all,but never upsample, unless the image is smaller than the patch size.
Returns
<class 'transformers.image_processing_base.BatchFeature'>
- data (
dict) — Dictionary of lists/arrays/tensors returned by thecall method (‘pixel_values’, etc.). - tensor_type (
Union[None, str, TensorType],optional) — You can give a tensor_type here to convert the lists of integers in PyTorch/Numpy Tensors atinitialization.
rescale_and_normalize
<source>(images: torch.Tensordo_rescale: boolrescale_factor: floatdo_normalize: boolimage_mean: typing.Union[float, list[float]]image_std: typing.Union[float, list[float]])
Rescale and normalize images.Override to rescale and normalize the images in torch.bfloat16 as in the original implementation
Llama4ForConditionalGeneration
classtransformers.Llama4ForConditionalGeneration
<source>(config: Llama4Config)
forward
<source>(input_ids: typing.Optional[torch.LongTensor] = Nonepixel_values: typing.Optional[torch.FloatTensor] = Noneattention_mask: typing.Optional[torch.Tensor] = Noneposition_ids: typing.Optional[torch.LongTensor] = Nonepast_key_values: typing.Optional[transformers.cache_utils.Cache] = Noneinputs_embeds: typing.Optional[torch.FloatTensor] = Nonevision_feature_layer: typing.Union[int, list[int], NoneType] = Nonevision_feature_select_strategy: typing.Optional[str] = Nonelabels: typing.Optional[torch.LongTensor] = Noneuse_cache: typing.Optional[bool] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = Nonecache_position: typing.Optional[torch.LongTensor] = Nonelogits_to_keep: typing.Union[int, torch.Tensor] = 0**kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs])→transformers.models.llama4.modeling_llama4.Llama4CausalLMOutputWithPast ortuple(torch.FloatTensor)
Parameters
- input_ids (
torch.LongTensorof shape(batch_size, sequence_length),optional) —Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.Indices can be obtained usingAutoTokenizer. SeePreTrainedTokenizer.encode() andPreTrainedTokenizer.call() for details.
- pixel_values (
torch.FloatTensorof shape(batch_size, num_channels, image_size, image_size),optional) —The tensors corresponding to the input images. Pixel values can be obtained usingimage_processor_class. Seeimage_processor_class.__call__for details (processor_classusesimage_processor_classfor processing images). - attention_mask (
torch.Tensorof shape(batch_size, sequence_length),optional) —Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that arenot masked,
- 0 for tokens that aremasked.
- position_ids (
torch.LongTensorof shape(batch_size, sequence_length),optional) —Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1]. - past_key_values (
~cache_utils.Cache,optional) —Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attentionblocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=Trueorconfig.use_cache=True.OnlyCache instance is allowed as input, see ourkv cache guide.If no
past_key_valuesare passed,DynamicCache will be initialized by default.The model will output the same cache format that is fed as input.
If
past_key_valuesare used, the user is expected to input only unprocessedinput_ids(those that don’thave their past key value states given to this model) of shape(batch_size, unprocessed_length)instead of allinput_idsof shape(batch_size, sequence_length). - inputs_embeds (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size),optional) —Optionally, instead of passinginput_idsyou can choose to directly pass an embedded representation. Thisis useful if you want more control over how to convertinput_idsindices into associated vectors than themodel’s internal embedding lookup matrix. - vision_feature_layer (
Union[int, list[int], NoneType]) —The index of the layer to select the vision feature. If multiple indices are provided,the vision feature of the corresponding indices will be concatenated to form thevision features. - vision_feature_select_strategy (
str,optional) —The feature selection strategy used to select the vision feature from the vision backbone.Can be one of"default"or"full". - labels (
torch.LongTensorof shape(batch_size, sequence_length),optional) —Labels for computing the masked language modeling loss. Indices should either be in[0, ..., config.vocab_size]or -100 (seeinput_idsdocstring). Tokens with indices set to-100are ignored(masked), the loss is only computed for the tokens with labels in[0, ..., config.vocab_size]. - use_cache (
bool,optional) —If set toTrue,past_key_valueskey value states are returned and can be used to speed up decoding (seepast_key_values). - output_attentions (
bool,optional) —Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returnedtensors for more detail. - output_hidden_states (
bool,optional) —Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors formore detail. - return_dict (
bool,optional) —Whether or not to return aModelOutput instead of a plain tuple. - cache_position (
torch.LongTensorof shape(sequence_length),optional) —Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids,this tensor is not affected by padding. It is used to update the cache in the correct position and to inferthe complete sequence length. - logits_to_keep (
Union[int, torch.Tensor], defaults to0) —If anint, compute logits for the lastlogits_to_keeptokens. If0, calculate logits for allinput_ids(special case). Only last token logits are needed for generation, and calculating them only for thattoken can save memory, which becomes pretty significant for long sequences or large vocabulary size.If atorch.Tensor, must be 1D corresponding to the indices to keep in the sequence length dimension.This is useful when using packed tensor format (single dimension for batch and sequence length).
Returns
transformers.models.llama4.modeling_llama4.Llama4CausalLMOutputWithPast ortuple(torch.FloatTensor)
Atransformers.models.llama4.modeling_llama4.Llama4CausalLMOutputWithPast or a tuple oftorch.FloatTensor (ifreturn_dict=False is passed or whenconfig.return_dict=False) comprising variouselements depending on the configuration (None) and inputs.
loss (
torch.FloatTensorof shape(1,),optional, returned whenlabelsis provided) — Language modeling loss (for next-token prediction).logits (
torch.FloatTensorof shape(batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).past_key_values (
Cache,optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — It is aCache instance. For more details, see ourkv cache guide.Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
past_key_valuesinput) to speed up sequential decoding.hidden_states (
tuple[torch.FloatTensor],optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, +one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (
tuple[torch.FloatTensor],optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attentionheads.
image_hidden_states (
torch.FloatTensor,optional) — Atorch.FloatTensorof size (batch_size, num_images, sequence_length, hidden_size)`.image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
TheLlama4ForConditionalGeneration forward method, overrides the__call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps whilethe latter silently ignores them.
Example:
>>>from PILimport Image>>>import requests>>>from transformersimport AutoProcessor, LlavaForConditionalGeneration>>>model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf")>>>processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")>>>prompt ="USER: <image>\nWhat's the content of the image? ASSISTANT:">>>url ="https://www.ilankelman.org/stopsigns/australia.jpg">>>image = Image.open(requests.get(url, stream=True).raw)>>>inputs = processor(images=image, text=prompt, return_tensors="pt")>>># Generate>>>generate_ids = model.generate(**inputs, max_new_tokens=15)>>>processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]"USER: \nWhat's the content of the image? ASSISTANT: The image features a busy city street with a stop sign prominently displayed"
Llama4ForCausalLM
classtransformers.Llama4ForCausalLM
<source>(config: Llama4TextConfig)
forward
<source>(input_ids: typing.Optional[torch.LongTensor] = Noneattention_mask: typing.Optional[torch.Tensor] = Noneposition_ids: typing.Optional[torch.LongTensor] = Nonepast_key_values: typing.Optional[transformers.cache_utils.Cache] = Noneinputs_embeds: typing.Optional[torch.FloatTensor] = Nonelabels: typing.Optional[torch.LongTensor] = Noneuse_cache: typing.Optional[bool] = Nonecache_position: typing.Optional[torch.LongTensor] = Nonelogits_to_keep: typing.Union[int, torch.Tensor] = 0**kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs])→transformers.modeling_outputs.CausalLMOutputWithPast ortuple(torch.FloatTensor)
Parameters
- input_ids (
torch.LongTensorof shape(batch_size, sequence_length),optional) —Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.Indices can be obtained usingAutoTokenizer. SeePreTrainedTokenizer.encode() andPreTrainedTokenizer.call() for details.
- attention_mask (
torch.Tensorof shape(batch_size, sequence_length),optional) —Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that arenot masked,
- 0 for tokens that aremasked.
- position_ids (
torch.LongTensorof shape(batch_size, sequence_length),optional) —Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1]. - past_key_values (
~cache_utils.Cache,optional) —Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attentionblocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=Trueorconfig.use_cache=True.OnlyCache instance is allowed as input, see ourkv cache guide.If no
past_key_valuesare passed,DynamicCache will be initialized by default.The model will output the same cache format that is fed as input.
If
past_key_valuesare used, the user is expected to input only unprocessedinput_ids(those that don’thave their past key value states given to this model) of shape(batch_size, unprocessed_length)instead of allinput_idsof shape(batch_size, sequence_length). - inputs_embeds (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size),optional) —Optionally, instead of passinginput_idsyou can choose to directly pass an embedded representation. Thisis useful if you want more control over how to convertinput_idsindices into associated vectors than themodel’s internal embedding lookup matrix. - labels (
torch.LongTensorof shape(batch_size, sequence_length),optional) —Labels for computing the masked language modeling loss. Indices should either be in[0, ..., config.vocab_size]or -100 (seeinput_idsdocstring). Tokens with indices set to-100are ignored(masked), the loss is only computed for the tokens with labels in[0, ..., config.vocab_size]. - use_cache (
bool,optional) —If set toTrue,past_key_valueskey value states are returned and can be used to speed up decoding (seepast_key_values). - cache_position (
torch.LongTensorof shape(sequence_length),optional) —Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids,this tensor is not affected by padding. It is used to update the cache in the correct position and to inferthe complete sequence length. - logits_to_keep (
Union[int, torch.Tensor], defaults to0) —If anint, compute logits for the lastlogits_to_keeptokens. If0, calculate logits for allinput_ids(special case). Only last token logits are needed for generation, and calculating them only for thattoken can save memory, which becomes pretty significant for long sequences or large vocabulary size.If atorch.Tensor, must be 1D corresponding to the indices to keep in the sequence length dimension.This is useful when using packed tensor format (single dimension for batch and sequence length).
Returns
transformers.modeling_outputs.CausalLMOutputWithPast ortuple(torch.FloatTensor)
Atransformers.modeling_outputs.CausalLMOutputWithPast or a tuple oftorch.FloatTensor (ifreturn_dict=False is passed or whenconfig.return_dict=False) comprising variouselements depending on the configuration (Llama4Config) and inputs.
loss (
torch.FloatTensorof shape(1,),optional, returned whenlabelsis provided) — Language modeling loss (for next-token prediction).logits (
torch.FloatTensorof shape(batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).past_key_values (
Cache,optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — It is aCache instance. For more details, see ourkv cache guide.Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
past_key_valuesinput) to speed up sequential decoding.hidden_states (
tuple(torch.FloatTensor),optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, +one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (
tuple(torch.FloatTensor),optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attentionheads.
TheLlama4ForCausalLM forward method, overrides the__call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps whilethe latter silently ignores them.
Example:
>>>from transformersimport AutoTokenizer, Llama4ForCausalLM>>>model = Llama4ForCausalLM.from_pretrained("meta-llama4/Llama4-2-7b-hf")>>>tokenizer = AutoTokenizer.from_pretrained("meta-llama4/Llama4-2-7b-hf")>>>prompt ="Hey, are you conscious? Can you talk to me?">>>inputs = tokenizer(prompt, return_tensors="pt")>>># Generate>>>generate_ids = model.generate(inputs.input_ids, max_length=30)>>>tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]"Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
Llama4TextModel
classtransformers.Llama4TextModel
<source>(config: Llama4TextConfig)
Parameters
- config (Llama4TextConfig) —Model configuration class with all the parameters of the model. Initializing with a config file does notload the weights associated with the model, only the configuration. Check out thefrom_pretrained() method to load the model weights.
The bare Llama4 Text Model outputting raw hidden-states without any specific head on to.
This model inherits fromPreTrainedModel. Check the superclass documentation for the generic methods thelibrary implements for all its model (such as downloading or saving, resizing the input embeddings, pruning headsetc.)
This model is also a PyTorchtorch.nn.Module subclass.Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usageand behavior.
forward
<source>(input_ids: typing.Optional[torch.LongTensor] = Noneattention_mask: typing.Optional[torch.Tensor] = Noneposition_ids: typing.Optional[torch.LongTensor] = Nonepast_key_values: typing.Optional[transformers.cache_utils.Cache] = Noneinputs_embeds: typing.Optional[torch.FloatTensor] = Noneuse_cache: typing.Optional[bool] = Nonecache_position: typing.Optional[torch.LongTensor] = None**kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs])→transformers.modeling_outputs.BaseModelOutputWithPast ortuple(torch.FloatTensor)
Parameters
- input_ids (
torch.LongTensorof shape(batch_size, sequence_length),optional) —Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.Indices can be obtained usingAutoTokenizer. SeePreTrainedTokenizer.encode() andPreTrainedTokenizer.call() for details.
- attention_mask (
torch.Tensorof shape(batch_size, sequence_length),optional) —Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that arenot masked,
- 0 for tokens that aremasked.
- position_ids (
torch.LongTensorof shape(batch_size, sequence_length),optional) —Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1]. - past_key_values (
~cache_utils.Cache,optional) —Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attentionblocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=Trueorconfig.use_cache=True.OnlyCache instance is allowed as input, see ourkv cache guide.If no
past_key_valuesare passed,DynamicCache will be initialized by default.The model will output the same cache format that is fed as input.
If
past_key_valuesare used, the user is expected to input only unprocessedinput_ids(those that don’thave their past key value states given to this model) of shape(batch_size, unprocessed_length)instead of allinput_idsof shape(batch_size, sequence_length). - inputs_embeds (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size),optional) —Optionally, instead of passinginput_idsyou can choose to directly pass an embedded representation. Thisis useful if you want more control over how to convertinput_idsindices into associated vectors than themodel’s internal embedding lookup matrix. - use_cache (
bool,optional) —If set toTrue,past_key_valueskey value states are returned and can be used to speed up decoding (seepast_key_values). - cache_position (
torch.LongTensorof shape(sequence_length),optional) —Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids,this tensor is not affected by padding. It is used to update the cache in the correct position and to inferthe complete sequence length.
Returns
transformers.modeling_outputs.BaseModelOutputWithPast ortuple(torch.FloatTensor)
Atransformers.modeling_outputs.BaseModelOutputWithPast or a tuple oftorch.FloatTensor (ifreturn_dict=False is passed or whenconfig.return_dict=False) comprising variouselements depending on the configuration (Llama4Config) and inputs.
last_hidden_state (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.If
past_key_valuesis used only the last hidden-state of the sequences of shape(batch_size, 1, hidden_size)is output.past_key_values (
Cache,optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — It is aCache instance. For more details, see ourkv cache guide.Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if
config.is_encoder_decoder=Truein the cross-attention blocks) that can be used (seepast_key_valuesinput) to speed up sequential decoding.hidden_states (
tuple(torch.FloatTensor),optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, +one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (
tuple(torch.FloatTensor),optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attentionheads.
TheLlama4TextModel forward method, overrides the__call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps whilethe latter silently ignores them.
Llama4VisionModel
classtransformers.Llama4VisionModel
<source>(config: Llama4VisionConfig)
forward
<source>(pixel_values: Tensorattention_mask: typing.Optional[torch.Tensor] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None)
Example:
>>>from PILimport Image>>>import requests>>>from transformersimport AutoProcessor, MllamaVisionModel>>>checkpoint ="meta-llama/Llama-3.2-11B-Vision">>>model = MllamaVisionModel.from_pretrained(checkpoint)>>>processor = AutoProcessor.from_pretrained(checkpoint)>>>url ="https://www.ilankelman.org/stopsigns/australia.jpg">>>image = Image.open(requests.get(url, stream=True).raw)>>>inputs = processor(images=image, return_tensors="pt")>>>output = model(**inputs)>>>print(output.last_hidden_state.shape)torch.Size([1,1,4,1025,7680])