Transformers documentation

Aria

Transformers

You are viewingmain version, which requiresinstallation from source. If you'd likeregular pip install, checkout the latest stable version (v4.57.1).

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

This model was released on 2024-10-08 and added to Hugging Face Transformers on 2024-12-06.

Aria

Aria is a multimodal mixture-of-experts (MoE) model. The goal of this model is to open-source a training recipe for creating a multimodal native model from scratch. Aria has 3.9B and 3.5B activated parameters per visual and text token respectively. Text is handled by a MoE decoder and visual inputs are handled by a lightweight visual encoder. It is trained in 4 stages, language pretraining, multimodal pretraining, multimodal long-context pretraining, and multimodal post-training.

You can find all the original Aria checkpoints under theAria organization.

Click on the Aria models in the right sidebar for more examples of how to apply Aria to different multimodal tasks.

The example below demonstrates how to generate text based on an image withPipeline or theAutoModel class.

Pipeline

AutoModel

import torchfrom transformersimport pipelinepipeline = pipeline("image-to-text",    model="rhymes-ai/Aria",    device=0,    dtype=torch.bfloat16)pipeline("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",    text="What is shown in this image?")

Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to theQuantization overview for more available quantization backends.

The example below usestorchao to only quantize the weights to int4 and therhymes-ai/Aria-sequential_mlp checkpoint. This checkpoint replaces grouped GEMM withtorch.nn.Linear layers for easier quantization.

# pip install torchaoimport torchfrom transformersimport TorchAoConfig, AutoModelForCausalLM, AutoProcessorquantization_config = TorchAoConfig("int4_weight_only", group_size=128)model = AutoModelForCausalLM.from_pretrained("rhymes-ai/Aria-sequential_mlp",    dtype=torch.bfloat16,    device_map="auto",    quantization_config=quantization_config)processor = AutoProcessor.from_pretrained("rhymes-ai/Aria-sequential_mlp",)messages = [    {"role":"user","content": [            {"type":"image","url":"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},            {"type":"text","text":"What is shown in this image?"},        ]    },]inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt")inputs = inputs.to(model.device, torch.bfloat16)output = model.generate(    **inputs,    max_new_tokens=15,    stop_strings=["<|im_end|>"],    tokenizer=processor.tokenizer,    do_sample=True,    temperature=0.9,)output_ids = output[0][inputs["input_ids"].shape[1]:]response = processor.decode(output_ids, skip_special_tokens=True)print(response)

AriaImageProcessor

classtransformers.AriaImageProcessor

(image_mean: typing.Optional[list[float]] = Noneimage_std: typing.Optional[list[float]] = Nonemax_image_size: int = 980min_image_size: int = 336split_resolutions: typing.Optional[list[tuple[int, int]]] = Nonesplit_image: typing.Optional[bool] = Falsedo_convert_rgb: typing.Optional[bool] = Truedo_rescale: bool = Truerescale_factor: typing.Union[int, float] = 0.00392156862745098do_normalize: typing.Optional[bool] = Trueresample: Resampling = <Resampling.BICUBIC: 3>**kwargs)

Parameters

image_mean (list,optional, defaults to [0.5, 0.5, 0.5]) —Mean values for normalization.
image_std (list,optional, defaults to [0.5, 0.5, 0.5]) —Standard deviation values for normalization.
max_image_size (int,optional, defaults to 980) —Maximum image size.
min_image_size (int,optional, defaults to 336) —Minimum image size.
split_resolutions (list,optional, defaults to a list of optimal,resolutions as tuples) —The optimal resolutions for splitting the image.
split_image (bool,optional, defaults toFalse) —Whether to split the image.
do_convert_rgb (bool,optional, defaults toTrue) —Whether to convert the image to RGB.
do_rescale (bool,optional, defaults toTrue) —Whether to rescale the image by the specified scalerescale_factor. Can be overridden bydo_rescale inthepreprocess method.
rescale_factor (int orfloat,optional, defaults to1/255) —Scale factor to use if rescaling the image. Can be overridden byrescale_factor in thepreprocessmethod.
do_normalize (bool,optional, defaults toTrue) —Whether to normalize the image.
resample (PILImageResampling,optional, defaults toBICUBIC) —The resampling filter to use if resizing the image.

A vision processor for the Aria model that handles image preprocessing.Initialize the AriaImageProcessor.

get_image_patches

(image: ndarraygrid_pinpoints: listpatch_size: intresample: Resamplingdata_format: ChannelDimensioninput_data_format: ChannelDimension)→list[np.ndarray]

Parameters

image (np.ndarray) —The input image to be processed.
grid_pinpoints (list[tuple[int, int]]) —A list of possible resolutions as tuples.
patch_size (int) —Size of the patches to divide the image into.
resample (PILImageResampling) —Resampling filter to use if resizing the image.
data_format (ChannelDimension orstr) —The channel dimension format for the output image.
input_data_format (ChannelDimension orstr) —The channel dimension format of the input image.

Returns

list[np.ndarray]

A list of NumPy arrays containing the processed image patches.

Process an image with variable resolutions by dividing it into patches.

get_number_of_image_patches

(height: intwidth: intimages_kwargs = None)→int

Parameters

height (int) —Height of the input image.
width (int) —Width of the input image.
images_kwargs (dict,optional) —Any kwargs to override defaults of the image processor.

Returns

int

Number of patches per image.

A utility that returns number of image patches for a given image size.

pad

(image: ndarraypadding: typing.Union[int, tuple[int, int], collections.abc.Iterable[tuple[int, int]]]mode: PaddingMode = <PaddingMode.CONSTANT: 'constant'>constant_values: typing.Union[float, collections.abc.Iterable[float]] = 0.0data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = Noneinput_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None)→np.ndarray

Parameters

image (np.ndarray) —The image to pad.
padding (int ortuple[int, int] orIterable[tuple[int, int]]) —Padding to apply to the edges of the height, width axes. Can be one of three formats:
- ((before_height, after_height), (before_width, after_width)) unique pad widths for each axis.
- ((before, after),) yields same before and after pad for height and width.
- (pad,) or int is a shortcut for before = after = pad width for all axes.
mode (PaddingMode) —The padding mode to use. Can be one of:
- "constant": pads with a constant value.
- "reflect": pads with the reflection of the vector mirrored on the first and last values of thevector along each axis.
- "replicate": pads with the replication of the last value on the edge of the array along each axis.
- "symmetric": pads with the reflection of the vector mirrored along the edge of the array.
constant_values (float orIterable[float],optional) —The value to use for the padding ifmode is"constant".
data_format (str orChannelDimension,optional) —The channel dimension format for the output image. Can be one of:
- "channels_first" orChannelDimension.FIRST: image in (num_channels, height, width) format.
- "channels_last" orChannelDimension.LAST: image in (height, width, num_channels) format.If unset, will use same as the input image.
input_data_format (str orChannelDimension,optional) —The channel dimension format for the input image. Can be one of:
- "channels_first" orChannelDimension.FIRST: image in (num_channels, height, width) format.
- "channels_last" orChannelDimension.LAST: image in (height, width, num_channels) format.If unset, will use the inferred format of the input image.

Returns

np.ndarray

The padded image.

Pads theimage with the specifiedpadding andmode. Padding can be in the (height,width)dimension of in the (num_patches) dimension. In the second case an iterable if tuples is expectedas input.

preprocess

(images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor'], list[typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']]]]image_mean: typing.Union[float, list[float], NoneType] = Noneimage_std: typing.Union[float, list[float], NoneType] = Nonemax_image_size: typing.Optional[int] = Nonemin_image_size: typing.Optional[int] = Nonesplit_image: typing.Optional[bool] = Nonedo_convert_rgb: typing.Optional[bool] = Nonedo_rescale: typing.Optional[bool] = Nonerescale_factor: typing.Optional[float] = Nonedo_normalize: typing.Optional[bool] = Noneresample: typing.Optional[PIL.Image.Resampling] = Nonereturn_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = 'pt'data_format: typing.Optional[transformers.image_utils.ChannelDimension] = <ChannelDimension.FIRST: 'channels_first'>input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None)→BatchFeature

Parameters

images (ImageInput or list of ImageInput) —The input image or a list of images.
image_mean (list,optional, defaults to [0.5, 0.5, 0.5]) —Mean values for normalization.
image_std (list,optional, defaults to [0.5, 0.5, 0.5]) —Standard deviation values for normalization.
max_image_size (int,optional, defaults toself.max_image_size (980)) —Maximum image size.
min_image_size (int,optional, defaults toself.min_image_size (336)) —Minimum image size.
split_image (bool,optional, defaults toself.split_image (False)) —Whether to split the image.
do_convert_rgb (bool,optional, defaults toself.do_convert_rgb (True)) —Whether to convert the image to RGB.
do_rescale (bool,optional, defaults toself.do_rescale) —Whether to rescale the image.
rescale_factor (float,optional, defaults toself.rescale_factor) —Rescale factor to rescale the image by ifdo_rescale is set toTrue.
do_normalize (bool,optional, defaults toself.do_normalize (True)) —Whether to normalize the image.
resample (PILImageResampling,optional, defaults toself.resample (BICUBIC)) —The resampling filter to use if resizing the image.
return_tensors (str orTensorType,optional, defaults to “pt”) —The type of tensor to return.
data_format (str orChannelDimension,optional) —The channel dimension format for the output image. Can be one of:
- "channels_first" orChannelDimension.FIRST:image in (num_channels, height, width) format.
- "channels_last" orChannelDimension.LAST:image in (height, width, num_channels) format.If unset, will use same as the input image.
input_data_format (str orChannelDimension,optional) —The channel dimension format for the input image. Can be one of:
- "channels_first" orChannelDimension.FIRST:image in (num_channels, height, width) format.
- "channels_last" orChannelDimension.LAST:image in (height, width, num_channels) format.If unset, will use the inferred format of the input image.

Returns

BatchFeature

A BatchFeature object containing:

‘pixel_values’:Tensor of processed image pixel values.
‘pixel_mask’:Boolean pixel mask. This mask is a 2D tensor of shape (max_image_size, max_image_size) where:
- True (1) values indicate pixels that belong to the original resized image.
- False (0) values indicate pixels that are part of the padding.The mask helps distinguish between actual image content and padded areas in subsequent processing steps.
‘num_crops’:The maximum number of crops across all images.

Process a list of images.

AriaProcessor

classtransformers.AriaProcessor

(image_processor = Nonetokenizer: typing.Union[transformers.models.auto.tokenization_auto.AutoTokenizer, str] = Nonechat_template: typing.Optional[str] = Nonesize_conversion: typing.Optional[dict[typing.Union[float, int], int]] = None)

Parameters

image_processor (AriaImageProcessor,optional) —The AriaImageProcessor to use for image preprocessing.
tokenizer (PreTrainedTokenizerBase,optional) —An instance ofPreTrainedTokenizerBase. This should correspond with the model’s text model. The tokenizer is a required input.
chat_template (str,optional) —A Jinja template which will be used to convert lists of messages in a chat into a tokenizable string.
size_conversion (Dict,optional) —A dictionary indicating size conversions for images.

AriaProcessor is a processor for the Aria model which wraps the Aria image preprocessor and the LLama slow tokenizer.

AriaTextConfig

classtransformers.AriaTextConfig

(vocab_size: typing.Optional[int] = 32000hidden_size: typing.Optional[int] = 4096intermediate_size: int = 4096num_hidden_layers: typing.Optional[int] = 32num_attention_heads: typing.Optional[int] = 32num_key_value_heads: typing.Optional[int] = Nonehidden_act: typing.Optional[str] = 'silu'max_position_embeddings: typing.Optional[int] = 2048initializer_range: typing.Optional[float] = 0.02rms_norm_eps: typing.Optional[int] = 1e-06use_cache: typing.Optional[bool] = Truepad_token_id = 2bos_token_id: typing.Optional[int] = 1eos_token_id: typing.Optional[int] = 2pretraining_tp: typing.Optional[int] = 1tie_word_embeddings: typing.Optional[bool] = Falserope_parameters: typing.Union[transformers.modeling_rope_utils.RopeParameters, dict[str, transformers.modeling_rope_utils.RopeParameters], NoneType] = Noneattention_bias: typing.Optional[bool] = Falseattention_dropout: typing.Optional[float] = 0.0mlp_bias: typing.Optional[bool] = Falsehead_dim: typing.Optional[int] = Nonemoe_num_experts: int = 8moe_topk: int = 2moe_num_shared_experts: int = 2**kwargs)

Parameters

vocab_size (int,optional, defaults to 32000) —Vocabulary size of the LLaMA model. Defines the number of different tokens that can be represented by theinputs_ids passed when callingLlamaModel
hidden_size (int,optional, defaults to 4096) —Dimension of the hidden representations.
intermediate_size (int,optional, defaults to 4096) —The size of the MLP representations.
num_hidden_layers (int,optional, defaults to 32) —Number of hidden layers in the Transformer decoder.
num_attention_heads (int,optional, defaults to 32) —Number of attention heads for each attention layer in the Transformer decoder.
num_key_value_heads (int,optional) —This is the number of key_value heads that should be used to implement Grouped Query Attention. Ifnum_key_value_heads=num_attention_heads, the model will use Multi Head Attention (MHA), ifnum_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. Whenconverting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructedby meanpooling all the original heads within that group. For more details, check outthispaper. If it is not specified, will default tonum_attention_heads.
hidden_act (str orfunction,optional, defaults to"silu") —The non-linear activation function (function or string) in the decoder.
max_position_embeddings (int,optional, defaults to 2048) —The maximum sequence length that this model might ever be used with. Llama 1 supports up to 2048 tokens,Llama 2 up to 4096, CodeLlama up to 16384.
initializer_range (float,optional, defaults to 0.02) —The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
rms_norm_eps (float,optional, defaults to 1e-06) —The epsilon used by the rms normalization layers.
use_cache (bool,optional, defaults toTrue) —Whether or not the model should return the last key/values attentions (not used by all models). Onlyrelevant ifconfig.is_decoder=True.
pad_token_id (int,optional, defaults to 2) —Padding token id.
bos_token_id (int,optional, defaults to 1) —Beginning of stream token id.
eos_token_id (int,optional, defaults to 2) —End of stream token id.
pretraining_tp (int,optional, defaults to 1) —Experimental feature. Tensor parallelism rank used during pretraining. Please refer tothisdocument tounderstand more about it. This value is necessary to ensure exact reproducibility of the pretrainingresults. Please refer tothis issue.
tie_word_embeddings (bool,optional, defaults toFalse) —Whether to tie weight embeddings
rope_parameters (RopeParameters,optional) —Dictionary containing the configuration parameters for the RoPE embeddings. The dictionaty should containa value forrope_theta and optionally parameters used for scaling in case you want to use RoPEwith longermax_position_embeddings.
attention_bias (bool,optional, defaults toFalse) —Whether to use a bias in the query, key, value and output projection layers during self-attention.
attention_dropout (float,optional, defaults to 0.0) —The dropout ratio for the attention probabilities.
mlp_bias (bool,optional, defaults toFalse) —Whether to use a bias in up_proj, down_proj and gate_proj layers in the MLP layers.
head_dim (int,optional) —The attention head dimension. If None, it will default to hidden_size // num_heads
moe_num_experts (int,optional, defaults to 8) —The number of experts in the MoE layer.
moe_topk (int,optional, defaults to 2) —The number of top experts to route to for each token.
moe_num_shared_experts (int,optional, defaults to 2) —The number of shared experts.

This class handles the configuration for the text component of the Aria model.Instantiating a configuration with the defaults will yield a similar configuration to that of the model of the Ariarhymes-ai/Aria architecture.This class extends the LlamaConfig to include additional parameters specific to the Mixture of Experts (MoE) architecture.

AriaConfig

classtransformers.AriaConfig

(vision_config = Nonevision_feature_layer: int = -1text_config: AriaTextConfig = Noneprojector_patch_to_query_dict: typing.Optional[dict] = Noneimage_token_index: int = 9initializer_range: float = 0.02**kwargs)

Parameters

vision_config (AriaVisionConfig ordict,optional) —Configuration for the vision component.
vision_feature_layer (int,optional, defaults to -1) —The index of the layer to select the vision feature.
text_config (AriaTextConfig ordict,optional) —Configuration for the text component.
projector_patch_to_query_dict (dict,optional) —Mapping of patch sizes to query dimensions.
image_token_index (int,optional, defaults to 9) —Index used to represent image tokens.
initializer_range (float,optional, defaults to 0.02) —The standard deviation of the truncated normal initializer for initializing all weight matrices.
model_type (str) —Type of the model, set to"aria".
image_token_index (int) —Index used to represent image tokens.
projector_patch_to_query_dict (dict) —Mapping of patch sizes to query dimensions.
vision_config (AriaVisionConfig) —Configuration for the vision component.
text_config (AriaTextConfig) —Configuration for the text component.

This class handles the configuration for both vision and text components of the Aria model,as well as additional parameters for image token handling and projector mapping.Instantiating a configuration with the defaults will yield a similar configuration to that of the model of the Ariarhymes-ai/Aria architecture.

Configuration objects inherit fromPreTrainedConfig and can be used to control the model outputs. Read thedocumentation fromPreTrainedConfig for more information.

AriaTextModel

classtransformers.AriaTextModel

(config: AriaTextConfig)

Parameters

config (AriaTextConfig) —Model configuration class with all the parameters of the model. Initializing with a config file does notload the weights associated with the model, only the configuration. Check out thefrom_pretrained() method to load the model weights.

The bare Aria Text Model outputting raw hidden-states without any specific head on to.

This model inherits fromPreTrainedModel. Check the superclass documentation for the generic methods thelibrary implements for all its model (such as downloading or saving, resizing the input embeddings, pruning headsetc.)

This model is also a PyTorchtorch.nn.Module subclass.Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usageand behavior.

forward

(input_ids: typing.Optional[torch.LongTensor] = Noneattention_mask: typing.Optional[torch.Tensor] = Noneposition_ids: typing.Optional[torch.LongTensor] = Nonepast_key_values: typing.Optional[transformers.cache_utils.Cache] = Noneinputs_embeds: typing.Optional[torch.FloatTensor] = Nonecache_position: typing.Optional[torch.LongTensor] = Noneuse_cache: typing.Optional[bool] = None**kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs])→transformers.modeling_outputs.BaseModelOutputWithPast ortuple(torch.FloatTensor)

Parameters

input_ids (torch.LongTensor of shape(batch_size, sequence_length),optional) —Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.
Indices can be obtained usingAutoTokenizer. SeePreTrainedTokenizer.encode() andPreTrainedTokenizer.call() for details.
What are input IDs?
attention_mask (torch.Tensor of shape(batch_size, sequence_length),optional) —Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:
- 1 for tokens that arenot masked,
- 0 for tokens that aremasked.
What are attention masks?
position_ids (torch.LongTensor of shape(batch_size, sequence_length),optional) —Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1].
What are position IDs?
past_key_values (~cache_utils.Cache,optional) —Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attentionblocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=True orconfig.use_cache=True.
OnlyCache instance is allowed as input, see ourkv cache guide.If nopast_key_values are passed,DynamicCache will be initialized by default.
The model will output the same cache format that is fed as input.
Ifpast_key_values are used, the user is expected to input only unprocessedinput_ids (those that don’thave their past key value states given to this model) of shape(batch_size, unprocessed_length) instead of allinput_idsof shape(batch_size, sequence_length).
inputs_embeds (torch.FloatTensor of shape(batch_size, sequence_length, hidden_size),optional) —Optionally, instead of passinginput_ids you can choose to directly pass an embedded representation. Thisis useful if you want more control over how to convertinput_ids indices into associated vectors than themodel’s internal embedding lookup matrix.
cache_position (torch.LongTensor of shape(sequence_length),optional) —Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids,this tensor is not affected by padding. It is used to update the cache in the correct position and to inferthe complete sequence length.
use_cache (bool,optional) —If set toTrue,past_key_values key value states are returned and can be used to speed up decoding (seepast_key_values).

Returns

transformers.modeling_outputs.BaseModelOutputWithPast ortuple(torch.FloatTensor)

Atransformers.modeling_outputs.BaseModelOutputWithPast or a tuple oftorch.FloatTensor (ifreturn_dict=False is passed or whenconfig.return_dict=False) comprising variouselements depending on the configuration (AriaConfig) and inputs.

last_hidden_state (torch.FloatTensor of shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.
Ifpast_key_values is used only the last hidden-state of the sequences of shape(batch_size, 1, hidden_size) is output.
past_key_values (Cache,optional, returned whenuse_cache=True is passed or whenconfig.use_cache=True) — It is aCache instance. For more details, see ourkv cache guide.
Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally ifconfig.is_encoder_decoder=True in the cross-attention blocks) that can be used (seepast_key_valuesinput) to speed up sequential decoding.
hidden_states (tuple(torch.FloatTensor),optional, returned whenoutput_hidden_states=True is passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, +one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor),optional, returned whenoutput_attentions=True is passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor (one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).
Attentions weights after the attention softmax, used to compute the weighted average in the self-attentionheads.

TheAriaTextModel forward method, overrides the__call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call theModuleinstance afterwards instead of this since the former takes care of running the pre and post processing steps whilethe latter silently ignores them.

AriaModel

classtransformers.AriaModel

(config: AriaConfig)

Parameters

config (AriaConfig) —Model configuration class with all the parameters of the model. Initializing with a config file does notload the weights associated with the model, only the configuration. Check out thefrom_pretrained() method to load the model weights.

The Aria model which consists of a vision backbone and a language model, without a language modeling head.

This model is also a PyTorchtorch.nn.Module subclass.Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usageand behavior.

forward

(input_ids: typing.Optional[torch.LongTensor] = Nonepixel_values: typing.Optional[torch.FloatTensor] = Nonepixel_mask: typing.Optional[torch.LongTensor] = Noneattention_mask: typing.Optional[torch.Tensor] = Noneposition_ids: typing.Optional[torch.LongTensor] = Nonepast_key_values: typing.Optional[transformers.cache_utils.Cache] = Noneinputs_embeds: typing.Optional[torch.FloatTensor] = Noneuse_cache: typing.Optional[bool] = Nonecache_position: typing.Optional[torch.LongTensor] = None**kwargs: typing_extensions.Unpack[transformers.modeling_flash_attention_utils.FlashAttentionKwargs])→transformers.models.aria.modeling_aria.AriaModelOutputWithPast ortuple(torch.FloatTensor)

Parameters

input_ids (torch.LongTensor of shape(batch_size, sequence_length),optional) —Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.
Indices can be obtained usingAutoTokenizer. SeePreTrainedTokenizer.encode() andPreTrainedTokenizer.call() for details.
What are input IDs?
pixel_values (torch.FloatTensor of shape(batch_size, num_channels, image_size, image_size),optional) —The tensors corresponding to the input images. Pixel values can be obtained usingAriaImageProcessor. SeeAriaImageProcessor.call() for details (AriaProcessor usesAriaImageProcessor for processing images).
pixel_mask (torch.LongTensor of shape(batch_size, height, width),optional) —Mask to avoid performing attention on padding pixel values. Mask values selected in[0, 1]:
- 1 for pixels that are real (i.e.not masked),
- 0 for pixels that are padding (i.e.masked).
What are attention masks?
attention_mask (torch.Tensor of shape(batch_size, sequence_length),optional) —Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:
- 1 for tokens that arenot masked,
- 0 for tokens that aremasked.
What are attention masks?
position_ids (torch.LongTensor of shape(batch_size, sequence_length),optional) —Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1].
What are position IDs?
past_key_values (~cache_utils.Cache,optional) —Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attentionblocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=True orconfig.use_cache=True.
OnlyCache instance is allowed as input, see ourkv cache guide.If nopast_key_values are passed,DynamicCache will be initialized by default.
The model will output the same cache format that is fed as input.
Ifpast_key_values are used, the user is expected to input only unprocessedinput_ids (those that don’thave their past key value states given to this model) of shape(batch_size, unprocessed_length) instead of allinput_idsof shape(batch_size, sequence_length).
inputs_embeds (torch.FloatTensor of shape(batch_size, sequence_length, hidden_size),optional) —Optionally, instead of passinginput_ids you can choose to directly pass an embedded representation. Thisis useful if you want more control over how to convertinput_ids indices into associated vectors than themodel’s internal embedding lookup matrix.
use_cache (bool,optional) —If set toTrue,past_key_values key value states are returned and can be used to speed up decoding (seepast_key_values).
cache_position (torch.LongTensor of shape(sequence_length),optional) —Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids,this tensor is not affected by padding. It is used to update the cache in the correct position and to inferthe complete sequence length.

Returns

transformers.models.aria.modeling_aria.AriaModelOutputWithPast ortuple(torch.FloatTensor)

Atransformers.models.aria.modeling_aria.AriaModelOutputWithPast or a tuple oftorch.FloatTensor (ifreturn_dict=False is passed or whenconfig.return_dict=False) comprising variouselements depending on the configuration (AriaConfig) and inputs.

last_hidden_state (torch.FloatTensor of shape(batch_size, sequence_length, hidden_size),optional) — Sequence of hidden-states at the output of the last layer of the model.
past_key_values (Cache,optional, returned whenuse_cache=True is passed or whenconfig.use_cache=True) — It is aCache instance. For more details, see ourkv cache guide.
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (seepast_key_values input) to speed up sequential decoding.
hidden_states (tuple[torch.FloatTensor, ...],optional, returned whenoutput_hidden_states=True is passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, +one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple[torch.FloatTensor, ...],optional, returned whenoutput_attentions=True is passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor (one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).
Attentions weights after the attention softmax, used to compute the weighted average in the self-attentionheads.
image_hidden_states (torch.FloatTensor,optional) — Atorch.FloatTensor of size(batch_size, num_images, sequence_length, hidden_size).image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.

TheAriaModel forward method, overrides the__call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call theModuleinstance afterwards instead of this since the former takes care of running the pre and post processing steps whilethe latter silently ignores them.

get_image_features

(pixel_values: FloatTensorpixel_mask: typing.Optional[torch.FloatTensor] = Nonevision_feature_layer: int = -1)→image_features (torch.Tensor)

Parameters

pixel_values (torch.FloatTensor] of shape(batch_size, channels, height, width)) —The tensors corresponding to the input images.
pixel_mask (torch.FloatTensor],optional) —The tensors corresponding to the input image mask.
vision_feature_layer (Union[int, list[int]],optional) —The index of the layer to select the vision feature. If multiple indices are provided,the vision feature of the corresponding indices will be concatenated to form thevision features.

Returns

image_features (torch.Tensor)

Image feature tensor of shape(num_images, image_length, embed_dim)).

Obtains image last hidden states from the vision tower and apply multimodal projection.

get_placeholder_mask

(input_ids: LongTensorinputs_embeds: FloatTensorimage_features: FloatTensor)

Obtains multimodal placeholder mask frominput_ids orinputs_embeds, and checks that the placeholder token count isequal to the length of multimodal features. If the lengths are different, an error is raised.

AriaTextForCausalLM

classtransformers.AriaTextForCausalLM

(config: AriaTextConfig)

Parameters

config (AriaTextConfig) —Model configuration class with all the parameters of the model. Initializing with a config file does notload the weights associated with the model, only the configuration. Check out thefrom_pretrained() method to load the model weights.

The Aria Model for causal language modeling.

This model is also a PyTorchtorch.nn.Module subclass.Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usageand behavior.

forward

(input_ids: typing.Optional[torch.LongTensor] = Noneattention_mask: typing.Optional[torch.Tensor] = Noneposition_ids: typing.Optional[torch.LongTensor] = Nonepast_key_values: typing.Optional[transformers.cache_utils.Cache] = Noneinputs_embeds: typing.Optional[torch.FloatTensor] = Nonelabels: typing.Optional[torch.LongTensor] = Noneuse_cache: typing.Optional[bool] = Nonecache_position: typing.Optional[torch.LongTensor] = Nonelogits_to_keep: typing.Union[int, torch.Tensor] = 0**kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs])→transformers.modeling_outputs.CausalLMOutputWithPast ortuple(torch.FloatTensor)

Parameters

input_ids (torch.LongTensor of shape(batch_size, sequence_length),optional) —Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.
Indices can be obtained usingAutoTokenizer. SeePreTrainedTokenizer.encode() andPreTrainedTokenizer.call() for details.
What are input IDs?
attention_mask (torch.Tensor of shape(batch_size, sequence_length),optional) —Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:
- 1 for tokens that arenot masked,
- 0 for tokens that aremasked.
What are attention masks?
position_ids (torch.LongTensor of shape(batch_size, sequence_length),optional) —Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1].
What are position IDs?
past_key_values (~cache_utils.Cache,optional) —Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attentionblocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=True orconfig.use_cache=True.
OnlyCache instance is allowed as input, see ourkv cache guide.If nopast_key_values are passed,DynamicCache will be initialized by default.
The model will output the same cache format that is fed as input.
Ifpast_key_values are used, the user is expected to input only unprocessedinput_ids (those that don’thave their past key value states given to this model) of shape(batch_size, unprocessed_length) instead of allinput_idsof shape(batch_size, sequence_length).
inputs_embeds (torch.FloatTensor of shape(batch_size, sequence_length, hidden_size),optional) —Optionally, instead of passinginput_ids you can choose to directly pass an embedded representation. Thisis useful if you want more control over how to convertinput_ids indices into associated vectors than themodel’s internal embedding lookup matrix.
labels (torch.LongTensor of shape(batch_size, sequence_length),optional) —Labels for computing the masked language modeling loss. Indices should either be in[0, ..., config.vocab_size] or -100 (seeinput_ids docstring). Tokens with indices set to-100 are ignored(masked), the loss is only computed for the tokens with labels in[0, ..., config.vocab_size].
use_cache (bool,optional) —If set toTrue,past_key_values key value states are returned and can be used to speed up decoding (seepast_key_values).
cache_position (torch.LongTensor of shape(sequence_length),optional) —Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids,this tensor is not affected by padding. It is used to update the cache in the correct position and to inferthe complete sequence length.
logits_to_keep (Union[int, torch.Tensor], defaults to0) —If anint, compute logits for the lastlogits_to_keep tokens. If0, calculate logits for allinput_ids (special case). Only last token logits are needed for generation, and calculating them only for thattoken can save memory, which becomes pretty significant for long sequences or large vocabulary size.If atorch.Tensor, must be 1D corresponding to the indices to keep in the sequence length dimension.This is useful when using packed tensor format (single dimension for batch and sequence length).

Returns

transformers.modeling_outputs.CausalLMOutputWithPast ortuple(torch.FloatTensor)

Atransformers.modeling_outputs.CausalLMOutputWithPast or a tuple oftorch.FloatTensor (ifreturn_dict=False is passed or whenconfig.return_dict=False) comprising variouselements depending on the configuration (AriaConfig) and inputs.

loss (torch.FloatTensor of shape(1,),optional, returned whenlabels is provided) — Language modeling loss (for next-token prediction).
logits (torch.FloatTensor of shape(batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
past_key_values (Cache,optional, returned whenuse_cache=True is passed or whenconfig.use_cache=True) — It is aCache instance. For more details, see ourkv cache guide.
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (seepast_key_values input) to speed up sequential decoding.
hidden_states (tuple(torch.FloatTensor),optional, returned whenoutput_hidden_states=True is passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, +one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor),optional, returned whenoutput_attentions=True is passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor (one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).
Attentions weights after the attention softmax, used to compute the weighted average in the self-attentionheads.

TheAriaTextForCausalLM forward method, overrides the__call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call theModuleinstance afterwards instead of this since the former takes care of running the pre and post processing steps whilethe latter silently ignores them.

Example:

>>>from transformersimport AutoTokenizer, AriaTextForCausalLM>>>model = AriaTextForCausalLM.from_pretrained("meta-aria_text/AriaText-2-7b-hf")>>>tokenizer = AutoTokenizer.from_pretrained("meta-aria_text/AriaText-2-7b-hf")>>>prompt ="Hey, are you conscious? Can you talk to me?">>>inputs = tokenizer(prompt, return_tensors="pt")>>># Generate>>>generate_ids = model.generate(inputs.input_ids, max_length=30)>>>tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]"Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."

AriaForConditionalGeneration

classtransformers.AriaForConditionalGeneration

(config: AriaConfig)

Parameters

config (AriaConfig) —Model configuration class with all the parameters of the model. Initializing with a config file does notload the weights associated with the model, only the configuration. Check out thefrom_pretrained() method to load the model weights.

Aria model for conditional generation tasks.

This model combines a vision tower, a multi-modal projector, and a language modelto perform tasks that involve both image and text inputs.

This model is also a PyTorchtorch.nn.Module subclass.Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usageand behavior.

forward

(input_ids: typing.Optional[torch.LongTensor] = Nonepixel_values: typing.Optional[torch.FloatTensor] = Nonepixel_mask: typing.Optional[torch.LongTensor] = Noneattention_mask: typing.Optional[torch.Tensor] = Noneposition_ids: typing.Optional[torch.LongTensor] = Nonepast_key_values: typing.Optional[transformers.cache_utils.Cache] = Noneinputs_embeds: typing.Optional[torch.FloatTensor] = Nonelabels: typing.Optional[torch.LongTensor] = Noneuse_cache: typing.Optional[bool] = Nonelogits_to_keep: typing.Union[int, torch.Tensor] = 0cache_position: typing.Optional[torch.LongTensor] = None**kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs])→transformers.models.aria.modeling_aria.AriaCausalLMOutputWithPast ortuple(torch.FloatTensor)

Parameters

input_ids (torch.LongTensor of shape(batch_size, sequence_length),optional) —Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.
Indices can be obtained usingAutoTokenizer. SeePreTrainedTokenizer.encode() andPreTrainedTokenizer.call() for details.
What are input IDs?
pixel_values (torch.FloatTensor of shape(batch_size, num_channels, image_size, image_size),optional) —The tensors corresponding to the input images. Pixel values can be obtained usingAriaImageProcessor. SeeAriaImageProcessor.call() for details (AriaProcessor usesAriaImageProcessor for processing images).
pixel_mask (torch.LongTensor of shape(batch_size, height, width),optional) —Mask to avoid performing attention on padding pixel values. Mask values selected in[0, 1]:
- 1 for pixels that are real (i.e.not masked),
- 0 for pixels that are padding (i.e.masked).
What are attention masks?
attention_mask (torch.Tensor of shape(batch_size, sequence_length),optional) —Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:
- 1 for tokens that arenot masked,
- 0 for tokens that aremasked.
What are attention masks?
position_ids (torch.LongTensor of shape(batch_size, sequence_length),optional) —Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1].
What are position IDs?
past_key_values (~cache_utils.Cache,optional) —Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attentionblocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=True orconfig.use_cache=True.
OnlyCache instance is allowed as input, see ourkv cache guide.If nopast_key_values are passed,DynamicCache will be initialized by default.
The model will output the same cache format that is fed as input.
Ifpast_key_values are used, the user is expected to input only unprocessedinput_ids (those that don’thave their past key value states given to this model) of shape(batch_size, unprocessed_length) instead of allinput_idsof shape(batch_size, sequence_length).
inputs_embeds (torch.FloatTensor of shape(batch_size, sequence_length, hidden_size),optional) —Optionally, instead of passinginput_ids you can choose to directly pass an embedded representation. Thisis useful if you want more control over how to convertinput_ids indices into associated vectors than themodel’s internal embedding lookup matrix.
labels (torch.LongTensor of shape(batch_size, sequence_length),optional) —Labels for computing the masked language modeling loss. Indices should either be in[0, ..., config.vocab_size] ormodel.image_token_id (wheremodel is your instance ofAriaForConditionalGeneration).Tokens with indices set tomodel.image_token_id are ignored (masked), the loss is onlycomputed for the tokens with labels in[0, ..., config.vocab_size].
use_cache (bool,optional) —If set toTrue,past_key_values key value states are returned and can be used to speed up decoding (seepast_key_values).
logits_to_keep (Union[int, torch.Tensor], defaults to0) —If anint, compute logits for the lastlogits_to_keep tokens. If0, calculate logits for allinput_ids (special case). Only last token logits are needed for generation, and calculating them only for thattoken can save memory, which becomes pretty significant for long sequences or large vocabulary size.If atorch.Tensor, must be 1D corresponding to the indices to keep in the sequence length dimension.This is useful when using packed tensor format (single dimension for batch and sequence length).
cache_position (torch.LongTensor of shape(sequence_length),optional) —Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids,this tensor is not affected by padding. It is used to update the cache in the correct position and to inferthe complete sequence length.

Returns

transformers.models.aria.modeling_aria.AriaCausalLMOutputWithPast ortuple(torch.FloatTensor)

Atransformers.models.aria.modeling_aria.AriaCausalLMOutputWithPast or a tuple oftorch.FloatTensor (ifreturn_dict=False is passed or whenconfig.return_dict=False) comprising variouselements depending on the configuration (AriaConfig) and inputs.

loss (torch.FloatTensor of shape(1,),optional, returned whenlabels is provided) — Language modeling loss (for next-token prediction).
logits (torch.FloatTensor of shape(batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
past_key_values (Cache,optional, returned whenuse_cache=True is passed or whenconfig.use_cache=True) — It is aCache instance. For more details, see ourkv cache guide.
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (seepast_key_values input) to speed up sequential decoding.
hidden_states (tuple[torch.FloatTensor],optional, returned whenoutput_hidden_states=True is passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, +one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple[torch.FloatTensor],optional, returned whenoutput_attentions=True is passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor (one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).
Attentions weights after the attention softmax, used to compute the weighted average in the self-attentionheads.
image_hidden_states (torch.FloatTensor,optional) — Atorch.FloatTensor of size(batch_size, num_images, sequence_length, hidden_size).image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.

TheAriaForConditionalGeneration forward method, overrides the__call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call theModuleinstance afterwards instead of this since the former takes care of running the pre and post processing steps whilethe latter silently ignores them.

Example:

>>>import requests>>>import torch>>>from PILimport Image>>>from ioimport BytesIO>>>from transformersimport AutoProcessor, AutoModel>>>from transformers.image_utilsimport load_image>>># Note that passing the image urls (instead of the actual pil images) to the processor is also possible>>>image1 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")>>>image2 = load_image("https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg")>>>image3 = load_image("https://cdn.britannica.com/68/170868-050-8DDE8263/Golden-Gate-Bridge-San-Francisco.jpg")>>>processor = AutoProcessor.from_pretrained("Rhymes-AI/Aria")>>>model = AutoModel.from_pretrained("Rhymes-AI/Aria", dtype=torch.bfloat16, device_map="auto")>>># Create inputs>>>messages = [...    {..."role":"user",..."content": [...            {"type":"image"},...            {"type":"text","text":"In this image, we can see the city of New York, and more specifically the Statue of Liberty."},...            {"type":"image"},...            {"type":"text","text":"What can we see in this image?"},...        ]...    },...    {..."role":"user",..."content": [...            {"type":"image"},...            {"type":"text","text":"In which city is that bridge located?"},...        ]...    }...]>>>prompts = [processor.apply_chat_template([message], add_generation_prompt=True)for messagein messages]>>>images = [[image1, image2], [image3]]>>>inputs = processor(text=prompts, images=images, padding=True, return_tensors="pt").to(model.device)>>># Generate>>>generated_ids = model.generate(**inputs, max_new_tokens=256)>>>generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)>>>print(generated_texts[0])Assistant: There are buildings, trees, lights,and water visiblein this image.>>>print(generated_texts[1])Assistant: The bridgeisin San Francisco.

Update on GitHub

←AltCLIP AyaVision→

Movatterモバイル変換

Transformers

Aria

AriaImageProcessor

classtransformers.AriaImageProcessor

get_image_patches

get_number_of_image_patches

pad

preprocess

AriaProcessor

classtransformers.AriaProcessor

AriaTextConfig

classtransformers.AriaTextConfig

AriaConfig

classtransformers.AriaConfig

AriaTextModel

classtransformers.AriaTextModel

forward

AriaModel

classtransformers.AriaModel

forward

get_image_features

get_placeholder_mask

AriaTextForCausalLM

classtransformers.AriaTextForCausalLM

forward

AriaForConditionalGeneration

classtransformers.AriaForConditionalGeneration

forward