This model was released on 2024-10-08 and added to Hugging Face Transformers on 2024-12-06.
Aria
Aria is a multimodal mixture-of-experts (MoE) model. The goal of this model is to open-source a training recipe for creating a multimodal native model from scratch. Aria has 3.9B and 3.5B activated parameters per visual and text token respectively. Text is handled by a MoE decoder and visual inputs are handled by a lightweight visual encoder. It is trained in 4 stages, language pretraining, multimodal pretraining, multimodal long-context pretraining, and multimodal post-training.
You can find all the original Aria checkpoints under theAria organization.
Click on the Aria models in the right sidebar for more examples of how to apply Aria to different multimodal tasks.
The example below demonstrates how to generate text based on an image withPipeline or theAutoModel class.
import torchfrom transformersimport pipelinepipeline = pipeline("image-to-text", model="rhymes-ai/Aria", device=0, dtype=torch.bfloat16)pipeline("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg", text="What is shown in this image?")
Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to theQuantization overview for more available quantization backends.
The example below usestorchao to only quantize the weights to int4 and therhymes-ai/Aria-sequential_mlp checkpoint. This checkpoint replaces grouped GEMM withtorch.nn.Linear layers for easier quantization.
# pip install torchaoimport torchfrom transformersimport TorchAoConfig, AutoModelForCausalLM, AutoProcessorquantization_config = TorchAoConfig("int4_weight_only", group_size=128)model = AutoModelForCausalLM.from_pretrained("rhymes-ai/Aria-sequential_mlp", dtype=torch.bfloat16, device_map="auto", quantization_config=quantization_config)processor = AutoProcessor.from_pretrained("rhymes-ai/Aria-sequential_mlp",)messages = [ {"role":"user","content": [ {"type":"image","url":"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"}, {"type":"text","text":"What is shown in this image?"}, ] },]inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt")inputs = inputs.to(model.device, torch.bfloat16)output = model.generate( **inputs, max_new_tokens=15, stop_strings=["<|im_end|>"], tokenizer=processor.tokenizer, do_sample=True, temperature=0.9,)output_ids = output[0][inputs["input_ids"].shape[1]:]response = processor.decode(output_ids, skip_special_tokens=True)print(response)
AriaImageProcessor
classtransformers.AriaImageProcessor
<source>(image_mean: typing.Optional[list[float]] = Noneimage_std: typing.Optional[list[float]] = Nonemax_image_size: int = 980min_image_size: int = 336split_resolutions: typing.Optional[list[tuple[int, int]]] = Nonesplit_image: typing.Optional[bool] = Falsedo_convert_rgb: typing.Optional[bool] = Truedo_rescale: bool = Truerescale_factor: typing.Union[int, float] = 0.00392156862745098do_normalize: typing.Optional[bool] = Trueresample: Resampling = <Resampling.BICUBIC: 3>**kwargs)
Parameters
- image_mean (
list,optional, defaults to [0.5, 0.5, 0.5]) —Mean values for normalization. - image_std (
list,optional, defaults to [0.5, 0.5, 0.5]) —Standard deviation values for normalization. - max_image_size (
int,optional, defaults to 980) —Maximum image size. - min_image_size (
int,optional, defaults to 336) —Minimum image size. - split_resolutions (
list,optional, defaults to a list of optimal,resolutions as tuples) —The optimal resolutions for splitting the image. - split_image (
bool,optional, defaults toFalse) —Whether to split the image. - do_convert_rgb (
bool,optional, defaults toTrue) —Whether to convert the image to RGB. - do_rescale (
bool,optional, defaults toTrue) —Whether to rescale the image by the specified scalerescale_factor. Can be overridden bydo_rescaleinthepreprocessmethod. - rescale_factor (
intorfloat,optional, defaults to1/255) —Scale factor to use if rescaling the image. Can be overridden byrescale_factorin thepreprocessmethod. - do_normalize (
bool,optional, defaults toTrue) —Whether to normalize the image. - resample (PILImageResampling,optional, defaults to
BICUBIC) —The resampling filter to use if resizing the image.
A vision processor for the Aria model that handles image preprocessing.Initialize the AriaImageProcessor.
get_image_patches
<source>(image: ndarraygrid_pinpoints: listpatch_size: intresample: Resamplingdata_format: ChannelDimensioninput_data_format: ChannelDimension)→list[np.ndarray]
Parameters
- image (
np.ndarray) —The input image to be processed. - grid_pinpoints (list[tuple[int, int]]) —A list of possible resolutions as tuples.
- patch_size (
int) —Size of the patches to divide the image into. - resample (
PILImageResampling) —Resampling filter to use if resizing the image. - data_format (
ChannelDimensionorstr) —The channel dimension format for the output image. - input_data_format (
ChannelDimensionorstr) —The channel dimension format of the input image.
Returns
list[np.ndarray]
A list of NumPy arrays containing the processed image patches.
Process an image with variable resolutions by dividing it into patches.
get_number_of_image_patches
<source>(height: intwidth: intimages_kwargs = None)→int
A utility that returns number of image patches for a given image size.
pad
<source>(image: ndarraypadding: typing.Union[int, tuple[int, int], collections.abc.Iterable[tuple[int, int]]]mode: PaddingMode = <PaddingMode.CONSTANT: 'constant'>constant_values: typing.Union[float, collections.abc.Iterable[float]] = 0.0data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = Noneinput_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None)→np.ndarray
Parameters
- image (
np.ndarray) —The image to pad. - padding (
intortuple[int, int]orIterable[tuple[int, int]]) —Padding to apply to the edges of the height, width axes. Can be one of three formats:((before_height, after_height), (before_width, after_width))unique pad widths for each axis.((before, after),)yields same before and after pad for height and width.(pad,)or int is a shortcut for before = after = pad width for all axes.
- mode (
PaddingMode) —The padding mode to use. Can be one of:"constant": pads with a constant value."reflect": pads with the reflection of the vector mirrored on the first and last values of thevector along each axis."replicate": pads with the replication of the last value on the edge of the array along each axis."symmetric": pads with the reflection of the vector mirrored along the edge of the array.
- constant_values (
floatorIterable[float],optional) —The value to use for the padding ifmodeis"constant". - data_format (
strorChannelDimension,optional) —The channel dimension format for the output image. Can be one of:"channels_first"orChannelDimension.FIRST: image in (num_channels, height, width) format."channels_last"orChannelDimension.LAST: image in (height, width, num_channels) format.If unset, will use same as the input image.
- input_data_format (
strorChannelDimension,optional) —The channel dimension format for the input image. Can be one of:"channels_first"orChannelDimension.FIRST: image in (num_channels, height, width) format."channels_last"orChannelDimension.LAST: image in (height, width, num_channels) format.If unset, will use the inferred format of the input image.
Returns
np.ndarray
The padded image.
Pads theimage with the specifiedpadding andmode. Padding can be in the (height,width)dimension of in the (num_patches) dimension. In the second case an iterable if tuples is expectedas input.
preprocess
<source>(images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor'], list[typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']]]]image_mean: typing.Union[float, list[float], NoneType] = Noneimage_std: typing.Union[float, list[float], NoneType] = Nonemax_image_size: typing.Optional[int] = Nonemin_image_size: typing.Optional[int] = Nonesplit_image: typing.Optional[bool] = Nonedo_convert_rgb: typing.Optional[bool] = Nonedo_rescale: typing.Optional[bool] = Nonerescale_factor: typing.Optional[float] = Nonedo_normalize: typing.Optional[bool] = Noneresample: typing.Optional[PIL.Image.Resampling] = Nonereturn_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = 'pt'data_format: typing.Optional[transformers.image_utils.ChannelDimension] = <ChannelDimension.FIRST: 'channels_first'>input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None)→BatchFeature
Parameters
- images (ImageInput or list of ImageInput) —The input image or a list of images.
- image_mean (
list,optional, defaults to [0.5, 0.5, 0.5]) —Mean values for normalization. - image_std (
list,optional, defaults to [0.5, 0.5, 0.5]) —Standard deviation values for normalization. - max_image_size (
int,optional, defaults toself.max_image_size(980)) —Maximum image size. - min_image_size (
int,optional, defaults toself.min_image_size(336)) —Minimum image size. - split_image (
bool,optional, defaults toself.split_image(False)) —Whether to split the image. - do_convert_rgb (
bool,optional, defaults toself.do_convert_rgb(True)) —Whether to convert the image to RGB. - do_rescale (
bool,optional, defaults toself.do_rescale) —Whether to rescale the image. - rescale_factor (
float,optional, defaults toself.rescale_factor) —Rescale factor to rescale the image by ifdo_rescaleis set toTrue. - do_normalize (
bool,optional, defaults toself.do_normalize(True)) —Whether to normalize the image. - resample (PILImageResampling,optional, defaults to
self.resample(BICUBIC)) —The resampling filter to use if resizing the image. - return_tensors (
strorTensorType,optional, defaults to “pt”) —The type of tensor to return. - data_format (
strorChannelDimension,optional) —The channel dimension format for the output image. Can be one of:"channels_first"orChannelDimension.FIRST:image in (num_channels, height, width) format."channels_last"orChannelDimension.LAST:image in (height, width, num_channels) format.If unset, will use same as the input image.
- input_data_format (
strorChannelDimension,optional) —The channel dimension format for the input image. Can be one of:"channels_first"orChannelDimension.FIRST:image in (num_channels, height, width) format."channels_last"orChannelDimension.LAST:image in (height, width, num_channels) format.If unset, will use the inferred format of the input image.
Returns
BatchFeature
A BatchFeature object containing:
- ‘pixel_values’:Tensor of processed image pixel values.
- ‘pixel_mask’:Boolean pixel mask. This mask is a 2D tensor of shape (max_image_size, max_image_size) where:
- True (1) values indicate pixels that belong to the original resized image.
- False (0) values indicate pixels that are part of the padding.The mask helps distinguish between actual image content and padded areas in subsequent processing steps.
- ‘num_crops’:The maximum number of crops across all images.
Process a list of images.
AriaProcessor
classtransformers.AriaProcessor
<source>(image_processor = Nonetokenizer: typing.Union[transformers.models.auto.tokenization_auto.AutoTokenizer, str] = Nonechat_template: typing.Optional[str] = Nonesize_conversion: typing.Optional[dict[typing.Union[float, int], int]] = None)
Parameters
- image_processor (
AriaImageProcessor,optional) —The AriaImageProcessor to use for image preprocessing. - tokenizer (
PreTrainedTokenizerBase,optional) —An instance ofPreTrainedTokenizerBase. This should correspond with the model’s text model. The tokenizer is a required input. - chat_template (
str,optional) —A Jinja template which will be used to convert lists of messages in a chat into a tokenizable string. - size_conversion (
Dict,optional) —A dictionary indicating size conversions for images.
AriaProcessor is a processor for the Aria model which wraps the Aria image preprocessor and the LLama slow tokenizer.
AriaTextConfig
classtransformers.AriaTextConfig
<source>(vocab_size: typing.Optional[int] = 32000hidden_size: typing.Optional[int] = 4096intermediate_size: int = 4096num_hidden_layers: typing.Optional[int] = 32num_attention_heads: typing.Optional[int] = 32num_key_value_heads: typing.Optional[int] = Nonehidden_act: typing.Optional[str] = 'silu'max_position_embeddings: typing.Optional[int] = 2048initializer_range: typing.Optional[float] = 0.02rms_norm_eps: typing.Optional[int] = 1e-06use_cache: typing.Optional[bool] = Truepad_token_id = 2bos_token_id: typing.Optional[int] = 1eos_token_id: typing.Optional[int] = 2pretraining_tp: typing.Optional[int] = 1tie_word_embeddings: typing.Optional[bool] = Falserope_parameters: typing.Union[transformers.modeling_rope_utils.RopeParameters, dict[str, transformers.modeling_rope_utils.RopeParameters], NoneType] = Noneattention_bias: typing.Optional[bool] = Falseattention_dropout: typing.Optional[float] = 0.0mlp_bias: typing.Optional[bool] = Falsehead_dim: typing.Optional[int] = Nonemoe_num_experts: int = 8moe_topk: int = 2moe_num_shared_experts: int = 2**kwargs)
Parameters
- vocab_size (
int,optional, defaults to 32000) —Vocabulary size of the LLaMA model. Defines the number of different tokens that can be represented by theinputs_idspassed when callingLlamaModel - hidden_size (
int,optional, defaults to 4096) —Dimension of the hidden representations. - intermediate_size (
int,optional, defaults to 4096) —The size of the MLP representations. - num_hidden_layers (
int,optional, defaults to 32) —Number of hidden layers in the Transformer decoder. - num_attention_heads (
int,optional, defaults to 32) —Number of attention heads for each attention layer in the Transformer decoder. - num_key_value_heads (
int,optional) —This is the number of key_value heads that should be used to implement Grouped Query Attention. Ifnum_key_value_heads=num_attention_heads, the model will use Multi Head Attention (MHA), ifnum_key_value_heads=1the model will use Multi Query Attention (MQA) otherwise GQA is used. Whenconverting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructedby meanpooling all the original heads within that group. For more details, check outthispaper. If it is not specified, will default tonum_attention_heads. - hidden_act (
strorfunction,optional, defaults to"silu") —The non-linear activation function (function or string) in the decoder. - max_position_embeddings (
int,optional, defaults to 2048) —The maximum sequence length that this model might ever be used with. Llama 1 supports up to 2048 tokens,Llama 2 up to 4096, CodeLlama up to 16384. - initializer_range (
float,optional, defaults to 0.02) —The standard deviation of the truncated_normal_initializer for initializing all weight matrices. - rms_norm_eps (
float,optional, defaults to 1e-06) —The epsilon used by the rms normalization layers. - use_cache (
bool,optional, defaults toTrue) —Whether or not the model should return the last key/values attentions (not used by all models). Onlyrelevant ifconfig.is_decoder=True. - pad_token_id (
int,optional, defaults to 2) —Padding token id. - bos_token_id (
int,optional, defaults to 1) —Beginning of stream token id. - eos_token_id (
int,optional, defaults to 2) —End of stream token id. - pretraining_tp (
int,optional, defaults to 1) —Experimental feature. Tensor parallelism rank used during pretraining. Please refer tothisdocument tounderstand more about it. This value is necessary to ensure exact reproducibility of the pretrainingresults. Please refer tothis issue. - tie_word_embeddings (
bool,optional, defaults toFalse) —Whether to tie weight embeddings - rope_parameters (
RopeParameters,optional) —Dictionary containing the configuration parameters for the RoPE embeddings. The dictionaty should containa value forrope_thetaand optionally parameters used for scaling in case you want to use RoPEwith longermax_position_embeddings. - attention_bias (
bool,optional, defaults toFalse) —Whether to use a bias in the query, key, value and output projection layers during self-attention. - attention_dropout (
float,optional, defaults to 0.0) —The dropout ratio for the attention probabilities. - mlp_bias (
bool,optional, defaults toFalse) —Whether to use a bias in up_proj, down_proj and gate_proj layers in the MLP layers. - head_dim (
int,optional) —The attention head dimension. If None, it will default to hidden_size // num_heads - moe_num_experts (
int,optional, defaults to 8) —The number of experts in the MoE layer. - moe_topk (
int,optional, defaults to 2) —The number of top experts to route to for each token. - moe_num_shared_experts (
int,optional, defaults to 2) —The number of shared experts.
This class handles the configuration for the text component of the Aria model.Instantiating a configuration with the defaults will yield a similar configuration to that of the model of the Ariarhymes-ai/Aria architecture.This class extends the LlamaConfig to include additional parameters specific to the Mixture of Experts (MoE) architecture.
AriaConfig
classtransformers.AriaConfig
<source>(vision_config = Nonevision_feature_layer: int = -1text_config: AriaTextConfig = Noneprojector_patch_to_query_dict: typing.Optional[dict] = Noneimage_token_index: int = 9initializer_range: float = 0.02**kwargs)
Parameters
- vision_config (
AriaVisionConfigordict,optional) —Configuration for the vision component. - vision_feature_layer (
int,optional, defaults to -1) —The index of the layer to select the vision feature. - text_config (
AriaTextConfigordict,optional) —Configuration for the text component. - projector_patch_to_query_dict (
dict,optional) —Mapping of patch sizes to query dimensions. - image_token_index (
int,optional, defaults to 9) —Index used to represent image tokens. - initializer_range (
float,optional, defaults to 0.02) —The standard deviation of the truncated normal initializer for initializing all weight matrices. - model_type (
str) —Type of the model, set to"aria". - image_token_index (
int) —Index used to represent image tokens. - projector_patch_to_query_dict (
dict) —Mapping of patch sizes to query dimensions. - vision_config (
AriaVisionConfig) —Configuration for the vision component. - text_config (
AriaTextConfig) —Configuration for the text component.
This class handles the configuration for both vision and text components of the Aria model,as well as additional parameters for image token handling and projector mapping.Instantiating a configuration with the defaults will yield a similar configuration to that of the model of the Ariarhymes-ai/Aria architecture.
Configuration objects inherit fromPreTrainedConfig and can be used to control the model outputs. Read thedocumentation fromPreTrainedConfig for more information.
AriaTextModel
classtransformers.AriaTextModel
<source>(config: AriaTextConfig)
Parameters
- config (AriaTextConfig) —Model configuration class with all the parameters of the model. Initializing with a config file does notload the weights associated with the model, only the configuration. Check out thefrom_pretrained() method to load the model weights.
The bare Aria Text Model outputting raw hidden-states without any specific head on to.
This model inherits fromPreTrainedModel. Check the superclass documentation for the generic methods thelibrary implements for all its model (such as downloading or saving, resizing the input embeddings, pruning headsetc.)
This model is also a PyTorchtorch.nn.Module subclass.Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usageand behavior.
forward
<source>(input_ids: typing.Optional[torch.LongTensor] = Noneattention_mask: typing.Optional[torch.Tensor] = Noneposition_ids: typing.Optional[torch.LongTensor] = Nonepast_key_values: typing.Optional[transformers.cache_utils.Cache] = Noneinputs_embeds: typing.Optional[torch.FloatTensor] = Nonecache_position: typing.Optional[torch.LongTensor] = Noneuse_cache: typing.Optional[bool] = None**kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs])→transformers.modeling_outputs.BaseModelOutputWithPast ortuple(torch.FloatTensor)
Parameters
- input_ids (
torch.LongTensorof shape(batch_size, sequence_length),optional) —Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.Indices can be obtained usingAutoTokenizer. SeePreTrainedTokenizer.encode() andPreTrainedTokenizer.call() for details.
- attention_mask (
torch.Tensorof shape(batch_size, sequence_length),optional) —Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that arenot masked,
- 0 for tokens that aremasked.
- position_ids (
torch.LongTensorof shape(batch_size, sequence_length),optional) —Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1]. - past_key_values (
~cache_utils.Cache,optional) —Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attentionblocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=Trueorconfig.use_cache=True.OnlyCache instance is allowed as input, see ourkv cache guide.If no
past_key_valuesare passed,DynamicCache will be initialized by default.The model will output the same cache format that is fed as input.
If
past_key_valuesare used, the user is expected to input only unprocessedinput_ids(those that don’thave their past key value states given to this model) of shape(batch_size, unprocessed_length)instead of allinput_idsof shape(batch_size, sequence_length). - inputs_embeds (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size),optional) —Optionally, instead of passinginput_idsyou can choose to directly pass an embedded representation. Thisis useful if you want more control over how to convertinput_idsindices into associated vectors than themodel’s internal embedding lookup matrix. - cache_position (
torch.LongTensorof shape(sequence_length),optional) —Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids,this tensor is not affected by padding. It is used to update the cache in the correct position and to inferthe complete sequence length. - use_cache (
bool,optional) —If set toTrue,past_key_valueskey value states are returned and can be used to speed up decoding (seepast_key_values).
Returns
transformers.modeling_outputs.BaseModelOutputWithPast ortuple(torch.FloatTensor)
Atransformers.modeling_outputs.BaseModelOutputWithPast or a tuple oftorch.FloatTensor (ifreturn_dict=False is passed or whenconfig.return_dict=False) comprising variouselements depending on the configuration (AriaConfig) and inputs.
last_hidden_state (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.If
past_key_valuesis used only the last hidden-state of the sequences of shape(batch_size, 1, hidden_size)is output.past_key_values (
Cache,optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — It is aCache instance. For more details, see ourkv cache guide.Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if
config.is_encoder_decoder=Truein the cross-attention blocks) that can be used (seepast_key_valuesinput) to speed up sequential decoding.hidden_states (
tuple(torch.FloatTensor),optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, +one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (
tuple(torch.FloatTensor),optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attentionheads.
TheAriaTextModel forward method, overrides the__call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps whilethe latter silently ignores them.
AriaModel
classtransformers.AriaModel
<source>(config: AriaConfig)
Parameters
- config (AriaConfig) —Model configuration class with all the parameters of the model. Initializing with a config file does notload the weights associated with the model, only the configuration. Check out thefrom_pretrained() method to load the model weights.
The Aria model which consists of a vision backbone and a language model, without a language modeling head.
This model inherits fromPreTrainedModel. Check the superclass documentation for the generic methods thelibrary implements for all its model (such as downloading or saving, resizing the input embeddings, pruning headsetc.)
This model is also a PyTorchtorch.nn.Module subclass.Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usageand behavior.
forward
<source>(input_ids: typing.Optional[torch.LongTensor] = Nonepixel_values: typing.Optional[torch.FloatTensor] = Nonepixel_mask: typing.Optional[torch.LongTensor] = Noneattention_mask: typing.Optional[torch.Tensor] = Noneposition_ids: typing.Optional[torch.LongTensor] = Nonepast_key_values: typing.Optional[transformers.cache_utils.Cache] = Noneinputs_embeds: typing.Optional[torch.FloatTensor] = Noneuse_cache: typing.Optional[bool] = Nonecache_position: typing.Optional[torch.LongTensor] = None**kwargs: typing_extensions.Unpack[transformers.modeling_flash_attention_utils.FlashAttentionKwargs])→transformers.models.aria.modeling_aria.AriaModelOutputWithPast ortuple(torch.FloatTensor)
Parameters
- input_ids (
torch.LongTensorof shape(batch_size, sequence_length),optional) —Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.Indices can be obtained usingAutoTokenizer. SeePreTrainedTokenizer.encode() andPreTrainedTokenizer.call() for details.
- pixel_values (
torch.FloatTensorof shape(batch_size, num_channels, image_size, image_size),optional) —The tensors corresponding to the input images. Pixel values can be obtained usingAriaImageProcessor. SeeAriaImageProcessor.call() for details (AriaProcessor usesAriaImageProcessor for processing images). - pixel_mask (
torch.LongTensorof shape(batch_size, height, width),optional) —Mask to avoid performing attention on padding pixel values. Mask values selected in[0, 1]:- 1 for pixels that are real (i.e.not masked),
- 0 for pixels that are padding (i.e.masked).
- attention_mask (
torch.Tensorof shape(batch_size, sequence_length),optional) —Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that arenot masked,
- 0 for tokens that aremasked.
- position_ids (
torch.LongTensorof shape(batch_size, sequence_length),optional) —Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1]. - past_key_values (
~cache_utils.Cache,optional) —Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attentionblocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=Trueorconfig.use_cache=True.OnlyCache instance is allowed as input, see ourkv cache guide.If no
past_key_valuesare passed,DynamicCache will be initialized by default.The model will output the same cache format that is fed as input.
If
past_key_valuesare used, the user is expected to input only unprocessedinput_ids(those that don’thave their past key value states given to this model) of shape(batch_size, unprocessed_length)instead of allinput_idsof shape(batch_size, sequence_length). - inputs_embeds (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size),optional) —Optionally, instead of passinginput_idsyou can choose to directly pass an embedded representation. Thisis useful if you want more control over how to convertinput_idsindices into associated vectors than themodel’s internal embedding lookup matrix. - use_cache (
bool,optional) —If set toTrue,past_key_valueskey value states are returned and can be used to speed up decoding (seepast_key_values). - cache_position (
torch.LongTensorof shape(sequence_length),optional) —Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids,this tensor is not affected by padding. It is used to update the cache in the correct position and to inferthe complete sequence length.
Returns
transformers.models.aria.modeling_aria.AriaModelOutputWithPast ortuple(torch.FloatTensor)
Atransformers.models.aria.modeling_aria.AriaModelOutputWithPast or a tuple oftorch.FloatTensor (ifreturn_dict=False is passed or whenconfig.return_dict=False) comprising variouselements depending on the configuration (AriaConfig) and inputs.
last_hidden_state (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size),optional) — Sequence of hidden-states at the output of the last layer of the model.past_key_values (
Cache,optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — It is aCache instance. For more details, see ourkv cache guide.Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
past_key_valuesinput) to speed up sequential decoding.hidden_states (
tuple[torch.FloatTensor, ...],optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, +one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (
tuple[torch.FloatTensor, ...],optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attentionheads.
image_hidden_states (
torch.FloatTensor,optional) — Atorch.FloatTensorof size(batch_size, num_images, sequence_length, hidden_size).image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
TheAriaModel forward method, overrides the__call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps whilethe latter silently ignores them.
get_image_features
<source>(pixel_values: FloatTensorpixel_mask: typing.Optional[torch.FloatTensor] = Nonevision_feature_layer: int = -1)→image_features (torch.Tensor)
Parameters
- pixel_values (
torch.FloatTensor]of shape(batch_size, channels, height, width)) —The tensors corresponding to the input images. - pixel_mask (
torch.FloatTensor],optional) —The tensors corresponding to the input image mask. - vision_feature_layer (
Union[int, list[int]],optional) —The index of the layer to select the vision feature. If multiple indices are provided,the vision feature of the corresponding indices will be concatenated to form thevision features.
Returns
image_features (torch.Tensor)
Image feature tensor of shape(num_images, image_length, embed_dim)).
Obtains image last hidden states from the vision tower and apply multimodal projection.
get_placeholder_mask
<source>(input_ids: LongTensorinputs_embeds: FloatTensorimage_features: FloatTensor)
Obtains multimodal placeholder mask frominput_ids orinputs_embeds, and checks that the placeholder token count isequal to the length of multimodal features. If the lengths are different, an error is raised.
AriaTextForCausalLM
classtransformers.AriaTextForCausalLM
<source>(config: AriaTextConfig)
Parameters
- config (AriaTextConfig) —Model configuration class with all the parameters of the model. Initializing with a config file does notload the weights associated with the model, only the configuration. Check out thefrom_pretrained() method to load the model weights.
The Aria Model for causal language modeling.
This model inherits fromPreTrainedModel. Check the superclass documentation for the generic methods thelibrary implements for all its model (such as downloading or saving, resizing the input embeddings, pruning headsetc.)
This model is also a PyTorchtorch.nn.Module subclass.Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usageand behavior.
forward
<source>(input_ids: typing.Optional[torch.LongTensor] = Noneattention_mask: typing.Optional[torch.Tensor] = Noneposition_ids: typing.Optional[torch.LongTensor] = Nonepast_key_values: typing.Optional[transformers.cache_utils.Cache] = Noneinputs_embeds: typing.Optional[torch.FloatTensor] = Nonelabels: typing.Optional[torch.LongTensor] = Noneuse_cache: typing.Optional[bool] = Nonecache_position: typing.Optional[torch.LongTensor] = Nonelogits_to_keep: typing.Union[int, torch.Tensor] = 0**kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs])→transformers.modeling_outputs.CausalLMOutputWithPast ortuple(torch.FloatTensor)
Parameters
- input_ids (
torch.LongTensorof shape(batch_size, sequence_length),optional) —Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.Indices can be obtained usingAutoTokenizer. SeePreTrainedTokenizer.encode() andPreTrainedTokenizer.call() for details.
- attention_mask (
torch.Tensorof shape(batch_size, sequence_length),optional) —Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that arenot masked,
- 0 for tokens that aremasked.
- position_ids (
torch.LongTensorof shape(batch_size, sequence_length),optional) —Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1]. - past_key_values (
~cache_utils.Cache,optional) —Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attentionblocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=Trueorconfig.use_cache=True.OnlyCache instance is allowed as input, see ourkv cache guide.If no
past_key_valuesare passed,DynamicCache will be initialized by default.The model will output the same cache format that is fed as input.
If
past_key_valuesare used, the user is expected to input only unprocessedinput_ids(those that don’thave their past key value states given to this model) of shape(batch_size, unprocessed_length)instead of allinput_idsof shape(batch_size, sequence_length). - inputs_embeds (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size),optional) —Optionally, instead of passinginput_idsyou can choose to directly pass an embedded representation. Thisis useful if you want more control over how to convertinput_idsindices into associated vectors than themodel’s internal embedding lookup matrix. - labels (
torch.LongTensorof shape(batch_size, sequence_length),optional) —Labels for computing the masked language modeling loss. Indices should either be in[0, ..., config.vocab_size]or -100 (seeinput_idsdocstring). Tokens with indices set to-100are ignored(masked), the loss is only computed for the tokens with labels in[0, ..., config.vocab_size]. - use_cache (
bool,optional) —If set toTrue,past_key_valueskey value states are returned and can be used to speed up decoding (seepast_key_values). - cache_position (
torch.LongTensorof shape(sequence_length),optional) —Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids,this tensor is not affected by padding. It is used to update the cache in the correct position and to inferthe complete sequence length. - logits_to_keep (
Union[int, torch.Tensor], defaults to0) —If anint, compute logits for the lastlogits_to_keeptokens. If0, calculate logits for allinput_ids(special case). Only last token logits are needed for generation, and calculating them only for thattoken can save memory, which becomes pretty significant for long sequences or large vocabulary size.If atorch.Tensor, must be 1D corresponding to the indices to keep in the sequence length dimension.This is useful when using packed tensor format (single dimension for batch and sequence length).
Returns
transformers.modeling_outputs.CausalLMOutputWithPast ortuple(torch.FloatTensor)
Atransformers.modeling_outputs.CausalLMOutputWithPast or a tuple oftorch.FloatTensor (ifreturn_dict=False is passed or whenconfig.return_dict=False) comprising variouselements depending on the configuration (AriaConfig) and inputs.
loss (
torch.FloatTensorof shape(1,),optional, returned whenlabelsis provided) — Language modeling loss (for next-token prediction).logits (
torch.FloatTensorof shape(batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).past_key_values (
Cache,optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — It is aCache instance. For more details, see ourkv cache guide.Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
past_key_valuesinput) to speed up sequential decoding.hidden_states (
tuple(torch.FloatTensor),optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, +one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (
tuple(torch.FloatTensor),optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attentionheads.
TheAriaTextForCausalLM forward method, overrides the__call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps whilethe latter silently ignores them.
Example:
>>>from transformersimport AutoTokenizer, AriaTextForCausalLM>>>model = AriaTextForCausalLM.from_pretrained("meta-aria_text/AriaText-2-7b-hf")>>>tokenizer = AutoTokenizer.from_pretrained("meta-aria_text/AriaText-2-7b-hf")>>>prompt ="Hey, are you conscious? Can you talk to me?">>>inputs = tokenizer(prompt, return_tensors="pt")>>># Generate>>>generate_ids = model.generate(inputs.input_ids, max_length=30)>>>tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]"Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
AriaForConditionalGeneration
classtransformers.AriaForConditionalGeneration
<source>(config: AriaConfig)
Parameters
- config (AriaConfig) —Model configuration class with all the parameters of the model. Initializing with a config file does notload the weights associated with the model, only the configuration. Check out thefrom_pretrained() method to load the model weights.
Aria model for conditional generation tasks.
This model combines a vision tower, a multi-modal projector, and a language modelto perform tasks that involve both image and text inputs.
This model inherits fromPreTrainedModel. Check the superclass documentation for the generic methods thelibrary implements for all its model (such as downloading or saving, resizing the input embeddings, pruning headsetc.)
This model is also a PyTorchtorch.nn.Module subclass.Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usageand behavior.
forward
<source>(input_ids: typing.Optional[torch.LongTensor] = Nonepixel_values: typing.Optional[torch.FloatTensor] = Nonepixel_mask: typing.Optional[torch.LongTensor] = Noneattention_mask: typing.Optional[torch.Tensor] = Noneposition_ids: typing.Optional[torch.LongTensor] = Nonepast_key_values: typing.Optional[transformers.cache_utils.Cache] = Noneinputs_embeds: typing.Optional[torch.FloatTensor] = Nonelabels: typing.Optional[torch.LongTensor] = Noneuse_cache: typing.Optional[bool] = Nonelogits_to_keep: typing.Union[int, torch.Tensor] = 0cache_position: typing.Optional[torch.LongTensor] = None**kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs])→transformers.models.aria.modeling_aria.AriaCausalLMOutputWithPast ortuple(torch.FloatTensor)
Parameters
- input_ids (
torch.LongTensorof shape(batch_size, sequence_length),optional) —Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.Indices can be obtained usingAutoTokenizer. SeePreTrainedTokenizer.encode() andPreTrainedTokenizer.call() for details.
- pixel_values (
torch.FloatTensorof shape(batch_size, num_channels, image_size, image_size),optional) —The tensors corresponding to the input images. Pixel values can be obtained usingAriaImageProcessor. SeeAriaImageProcessor.call() for details (AriaProcessor usesAriaImageProcessor for processing images). - pixel_mask (
torch.LongTensorof shape(batch_size, height, width),optional) —Mask to avoid performing attention on padding pixel values. Mask values selected in[0, 1]:- 1 for pixels that are real (i.e.not masked),
- 0 for pixels that are padding (i.e.masked).
- attention_mask (
torch.Tensorof shape(batch_size, sequence_length),optional) —Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that arenot masked,
- 0 for tokens that aremasked.
- position_ids (
torch.LongTensorof shape(batch_size, sequence_length),optional) —Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1]. - past_key_values (
~cache_utils.Cache,optional) —Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attentionblocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=Trueorconfig.use_cache=True.OnlyCache instance is allowed as input, see ourkv cache guide.If no
past_key_valuesare passed,DynamicCache will be initialized by default.The model will output the same cache format that is fed as input.
If
past_key_valuesare used, the user is expected to input only unprocessedinput_ids(those that don’thave their past key value states given to this model) of shape(batch_size, unprocessed_length)instead of allinput_idsof shape(batch_size, sequence_length). - inputs_embeds (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size),optional) —Optionally, instead of passinginput_idsyou can choose to directly pass an embedded representation. Thisis useful if you want more control over how to convertinput_idsindices into associated vectors than themodel’s internal embedding lookup matrix. - labels (
torch.LongTensorof shape(batch_size, sequence_length),optional) —Labels for computing the masked language modeling loss. Indices should either be in[0, ..., config.vocab_size]ormodel.image_token_id(wheremodelis your instance ofAriaForConditionalGeneration).Tokens with indices set tomodel.image_token_idare ignored (masked), the loss is onlycomputed for the tokens with labels in[0, ..., config.vocab_size]. - use_cache (
bool,optional) —If set toTrue,past_key_valueskey value states are returned and can be used to speed up decoding (seepast_key_values). - logits_to_keep (
Union[int, torch.Tensor], defaults to0) —If anint, compute logits for the lastlogits_to_keeptokens. If0, calculate logits for allinput_ids(special case). Only last token logits are needed for generation, and calculating them only for thattoken can save memory, which becomes pretty significant for long sequences or large vocabulary size.If atorch.Tensor, must be 1D corresponding to the indices to keep in the sequence length dimension.This is useful when using packed tensor format (single dimension for batch and sequence length). - cache_position (
torch.LongTensorof shape(sequence_length),optional) —Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids,this tensor is not affected by padding. It is used to update the cache in the correct position and to inferthe complete sequence length.
Returns
transformers.models.aria.modeling_aria.AriaCausalLMOutputWithPast ortuple(torch.FloatTensor)
Atransformers.models.aria.modeling_aria.AriaCausalLMOutputWithPast or a tuple oftorch.FloatTensor (ifreturn_dict=False is passed or whenconfig.return_dict=False) comprising variouselements depending on the configuration (AriaConfig) and inputs.
loss (
torch.FloatTensorof shape(1,),optional, returned whenlabelsis provided) — Language modeling loss (for next-token prediction).logits (
torch.FloatTensorof shape(batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).past_key_values (
Cache,optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — It is aCache instance. For more details, see ourkv cache guide.Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
past_key_valuesinput) to speed up sequential decoding.hidden_states (
tuple[torch.FloatTensor],optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, +one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (
tuple[torch.FloatTensor],optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attentionheads.
image_hidden_states (
torch.FloatTensor,optional) — Atorch.FloatTensorof size(batch_size, num_images, sequence_length, hidden_size).image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
TheAriaForConditionalGeneration forward method, overrides the__call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps whilethe latter silently ignores them.
Example:
>>>import requests>>>import torch>>>from PILimport Image>>>from ioimport BytesIO>>>from transformersimport AutoProcessor, AutoModel>>>from transformers.image_utilsimport load_image>>># Note that passing the image urls (instead of the actual pil images) to the processor is also possible>>>image1 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")>>>image2 = load_image("https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg")>>>image3 = load_image("https://cdn.britannica.com/68/170868-050-8DDE8263/Golden-Gate-Bridge-San-Francisco.jpg")>>>processor = AutoProcessor.from_pretrained("Rhymes-AI/Aria")>>>model = AutoModel.from_pretrained("Rhymes-AI/Aria", dtype=torch.bfloat16, device_map="auto")>>># Create inputs>>>messages = [... {..."role":"user",..."content": [... {"type":"image"},... {"type":"text","text":"In this image, we can see the city of New York, and more specifically the Statue of Liberty."},... {"type":"image"},... {"type":"text","text":"What can we see in this image?"},... ]... },... {..."role":"user",..."content": [... {"type":"image"},... {"type":"text","text":"In which city is that bridge located?"},... ]... }...]>>>prompts = [processor.apply_chat_template([message], add_generation_prompt=True)for messagein messages]>>>images = [[image1, image2], [image3]]>>>inputs = processor(text=prompts, images=images, padding=True, return_tensors="pt").to(model.device)>>># Generate>>>generated_ids = model.generate(**inputs, max_new_tokens=256)>>>generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)>>>print(generated_texts[0])Assistant: There are buildings, trees, lights,and water visiblein this image.>>>print(generated_texts[1])Assistant: The bridgeisin San Francisco.