Transformers documentation

Fuyu

Transformers

You are viewingmain version, which requiresinstallation from source. If you'd likeregular pip install, checkout the latest stable version (v4.57.1).

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

This model was released on 2023-10-17 and added to Hugging Face Transformers on 2023-10-19.

Fuyu

Overview

The Fuyu model was created byADEPT, and authored by Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, Sağnak Taşırlar.

The authors introduced Fuyu-8B, a decoder-only multimodal model based on the classic transformers architecture, with query and key normalization. A linear encoder is added to create multimodal embeddings from image inputs.

By treating image tokens like text tokens and using a special image-newline character, the model knows when an image line ends. Image positional embeddings are removed. This avoids the need for different training phases for various image resolutions. With 8 billion parameters and licensed under CC-BY-NC, Fuyu-8B is notable for its ability to handle both text and images, its impressive context size of 16K, and its overall performance.

TheFuyu models were trained usingbfloat16, but the original inference usesfloat16 The checkpoints uploaded on the hub usedtype = 'float16' which will beused by theAutoModel API to cast the checkpoints fromtorch.float32 totorch.float16.
Thedtype of the online weights is mostly irrelevant, unless you are usingdtype="auto" when initializing a model usingmodel = AutoModelForCausalLM.from_pretrained("path", dtype = "auto"). The reason is that the model will first be downloaded ( using thedtype of the checkpoints online) then it will be cast to the defaultdtype oftorch (becomestorch.float32). Users should specify thedtype they want, and if they don’t it will betorch.float32.
Finetuning the model infloat16 is not recommended and known to producenan, as such the model should be fine-tuned inbfloat16.

Tips:

To convert the model, you need to clone the original repository usinggit clone https://github.com/persimmon-ai-labs/adept-inference, then get the checkpoints:

gitclone https://github.com/persimmon-ai-labs/adept-inferencewget path/to/fuyu-8b-model-weights.tartar -xvf fuyu-8b-model-weights.tarpython src/transformers/models/fuyu/convert_fuyu_weights_to_hf.py  --input_dir /path/to/downloaded/fuyu/weights/ --output_dir /output/path \    --pt_model_path /path/to/fuyu_8b_release/iter_0001251/mp_rank_00/model_optim_rng.pt    --ada_lib_path /path/to/adept-inference

For the chat model:

wget https://axtkn4xl5cip.objectstorage.us-phoenix-1.oci.customer-oci.com/n/axtkn4xl5cip/b/adept-public-data/o/8b_chat_model_release.tartar -xvf 8b_base_model_release.tar

Then, model can be loaded via:

from transformersimport FuyuConfig, FuyuForCausalLMmodel_config = FuyuConfig()model = FuyuForCausalLM(model_config).from_pretrained('/output/path')

Inputs need to be passed through a specific Processor to have the correct formats.A processor requires an image_processor and a tokenizer. Hence, inputs can be loaded via:

from PILimport Imagefrom transformersimport AutoTokenizerfrom transformers.models.fuyu.processing_fuyuimport FuyuProcessorfrom transformers.models.fuyu.image_processing_fuyu_fastimport FuyuImageProcessorFasttokenizer = AutoTokenizer.from_pretrained('adept-hf-collab/fuyu-8b')image_processor = FuyuImageProcessorFast()processor = FuyuProcessor(image_processor=image_processor, tokenizer=tokenizer)text_prompt ="Generate a coco-style caption.\\n"bus_image_url ="https://huggingface.co/datasets/hf-internal-testing/fixtures-captioning/resolve/main/bus.png"bus_image_pil = Image.open(io.BytesIO(requests.get(bus_image_url).content))inputs_to_model = processor(images=bus_image_pil, text=text_prompt)

This model was contributed byMolbap.The original code can be foundhere.

Fuyu uses asentencepiece based tokenizer, with aUnigram model. It supports bytefallback, which is only available intokenizers==0.14.0 for the fast tokenizer.TheLlamaTokenizer is used as it is a standard wrapper around sentencepiece.
The authors suggest to use the following prompt for image captioning:f"Generate a coco-style caption.\\n"

FuyuConfig

classtransformers.FuyuConfig

(vocab_size: typing.Optional[int] = 262144hidden_size: typing.Optional[int] = 4096intermediate_size: typing.Optional[int] = 16384num_hidden_layers: typing.Optional[int] = 36num_attention_heads: typing.Optional[int] = 64hidden_act: typing.Optional[str] = 'relu2'max_position_embeddings: typing.Optional[int] = 16384image_size: typing.Optional[int] = 300patch_size: typing.Optional[int] = 30num_channels: typing.Optional[int] = 3initializer_range: typing.Optional[float] = 0.02layer_norm_eps: typing.Optional[int] = 1e-05use_cache: typing.Optional[bool] = Truetie_word_embeddings: typing.Optional[bool] = Falserope_parameters: typing.Union[transformers.modeling_rope_utils.RopeParameters, dict[str, transformers.modeling_rope_utils.RopeParameters], NoneType] = Noneqk_layernorm: typing.Optional[bool] = Truehidden_dropout: typing.Optional[float] = 0.0attention_dropout: typing.Optional[float] = 0.0partial_rotary_factor: typing.Optional[float] = 0.5pad_token_id: typing.Optional[int] = Nonebos_token_id: typing.Optional[int] = 1eos_token_id: typing.Optional[int] = 2image_token_id: typing.Optional[int] = 71011text_config: typing.Optional[dict] = None**kwargs)

Parameters

vocab_size (int,optional, defaults to 262144) —Vocabulary size of the Fuyu model. Defines the number of different tokens that can be represented by theinputs_ids passed when callingFuyuForCausalLM
hidden_size (int,optional, defaults to 4096) —Dimension of the hidden representations.
intermediate_size (int,optional, defaults to 16384) —Dimension of the MLP representations.
num_hidden_layers (int,optional, defaults to 36) —Number of hidden layers in the Transformer encoder.
num_attention_heads (int,optional, defaults to 64) —Number of attention heads for each attention layer in the Transformer encoder.
hidden_act (str orfunction,optional, defaults to"relu2") —The non-linear activation function (function or string) in the decoder.
max_position_embeddings (int,optional, defaults to 16384) —The maximum sequence length that this model might ever be used with.
image_size (int,optional, defaults to 300) —The input image size.
patch_size (int,optional, defaults to 30) —The input vision transformer encoding patch size.
num_channels (int,optional, defaults to 3) —The input image number of channels.
initializer_range (float,optional, defaults to 0.02) —The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (float,optional, defaults to 1e-05) —The epsilon used by the rms normalization layers.
use_cache (bool,optional, defaults toTrue) —Whether or not the model should return the last key/values attentions (not used by all models). Onlyrelevant ifconfig.is_decoder=True. Whether to tie weight embeddings
tie_word_embeddings (bool,optional, defaults toFalse) —Whether to tie input and output embeddings.
rope_parameters (RopeParameters,optional) —Dictionary containing the configuration parameters for the RoPE embeddings. The dictionaty should containa value forrope_theta and optionally parameters used for scaling in case you want to use RoPEwith longermax_position_embeddings.
qk_layernorm (bool,optional, defaults toTrue) —Whether or not to normalize the Queries and Keys after projecting the hidden states
hidden_dropout (float,optional, defaults to 0.0) —The dropout ratio after applying the MLP to the hidden states.
attention_dropout (float,optional, defaults to 0.0) —The dropout ratio after computing the attention scores.
partial_rotary_factor (float,optional, defaults to 0.5) —Percentage of the query and keys which will have rotary embedding.
pad_token_id (int,optional) —The id of thepadding token.
bos_token_id (int,optional, defaults to 1) —The id of thebeginning-of-sequence token.
eos_token_id (Union[int, list[int]],optional, defaults to 2) —The id of theend-of-sequence token. Optionally, use a list to set multipleend-of-sequence tokens.
image_token_id (int,optional, defaults to 71011) —The id of the image placeholder token.
text_config (dict,optional) —Dictionary of configuration options used to initialize thelanguage```Aut.

This is the configuration class to store the configuration of aFuyuForCausalLM. It is used to instantiate anFuyu model according to the specified arguments, defining the model architecture. Instantiating a configurationwith the defaults will yield a similar configuration to that of theadept/fuyu-8b.

Configuration objects inherit fromPreTrainedConfig and can be used to control the model outputs. Read thedocumentation fromPreTrainedConfig for more information.

>>>from transformersimport FuyuConfig>>># Initializing a Fuyu fuyu-7b style configuration>>>configuration = FuyuConfig()

FuyuModel

classtransformers.FuyuModel

(config: FuyuConfig)

Parameters

config (FuyuConfig) —Model configuration class with all the parameters of the model. Initializing with a config file does notload the weights associated with the model, only the configuration. Check out thefrom_pretrained() method to load the model weights.

The Fuyu model which consists of a vision backbone and a language model, without a language modeling head.

This model inherits fromPreTrainedModel. Check the superclass documentation for the generic methods thelibrary implements for all its model (such as downloading or saving, resizing the input embeddings, pruning headsetc.)

This model is also a PyTorchtorch.nn.Module subclass.Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usageand behavior.

forward

(input_ids: typing.Optional[torch.LongTensor] = Noneimage_patches: typing.Optional[torch.Tensor] = Noneimage_patches_indices: typing.Optional[torch.Tensor] = Noneattention_mask: typing.Optional[torch.Tensor] = Noneposition_ids: typing.Optional[torch.LongTensor] = Nonepast_key_values: typing.Optional[transformers.cache_utils.Cache] = Noneinputs_embeds: typing.Optional[torch.FloatTensor] = Noneuse_cache: typing.Optional[bool] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None**kwargs)→transformers.modeling_outputs.CausalLMOutputWithPast ortuple(torch.FloatTensor)

Parameters

input_ids (torch.LongTensor of shape(batch_size, sequence_length),optional) —Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.
Indices can be obtained usingAutoTokenizer. SeePreTrainedTokenizer.encode() andPreTrainedTokenizer.call() for details.
What are input IDs?
image_patches (torch.FloatTensor of shape(batch_size, num_total_patches, patch_size_ x patch_size x num_channels),optional) —Image patches to be used as continuous embeddings. The patches are flattened and then projected to thehidden size of the model.
image_patches_indices (torch.LongTensor of shape(batch_size, sequence_length),optional) —Tensor of indices of the image patches in the input_ids tensor.
attention_mask (torch.Tensor of shape(batch_size, sequence_length),optional) —Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:
- 1 for tokens that arenot masked,
- 0 for tokens that aremasked.
What are attention masks?
position_ids (torch.LongTensor of shape(batch_size, sequence_length),optional) —Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1].
What are position IDs?
past_key_values (~cache_utils.Cache,optional) —Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attentionblocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=True orconfig.use_cache=True.
OnlyCache instance is allowed as input, see ourkv cache guide.If nopast_key_values are passed,DynamicCache will be initialized by default.
The model will output the same cache format that is fed as input.
Ifpast_key_values are used, the user is expected to input only unprocessedinput_ids (those that don’thave their past key value states given to this model) of shape(batch_size, unprocessed_length) instead of allinput_idsof shape(batch_size, sequence_length).
inputs_embeds (torch.FloatTensor of shape(batch_size, sequence_length, hidden_size),optional) —Optionally, instead of passinginput_ids you can choose to directly pass an embedded representation. Thisis useful if you want more control over how to convertinput_ids indices into associated vectors than themodel’s internal embedding lookup matrix.
use_cache (bool,optional) —If set toTrue,past_key_values key value states are returned and can be used to speed up decoding (seepast_key_values).
output_attentions (bool,optional) —Whether or not to return the attentions tensors of all attention layers. Seeattentions under returnedtensors for more detail.
output_hidden_states (bool,optional) —Whether or not to return the hidden states of all layers. Seehidden_states under returned tensors formore detail.
return_dict (bool,optional) —Whether or not to return aModelOutput instead of a plain tuple.

Returns

transformers.modeling_outputs.CausalLMOutputWithPast ortuple(torch.FloatTensor)

Atransformers.modeling_outputs.CausalLMOutputWithPast or a tuple oftorch.FloatTensor (ifreturn_dict=False is passed or whenconfig.return_dict=False) comprising variouselements depending on the configuration (FuyuConfig) and inputs.

loss (torch.FloatTensor of shape(1,),optional, returned whenlabels is provided) — Language modeling loss (for next-token prediction).
logits (torch.FloatTensor of shape(batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
past_key_values (Cache,optional, returned whenuse_cache=True is passed or whenconfig.use_cache=True) — It is aCache instance. For more details, see ourkv cache guide.
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (seepast_key_values input) to speed up sequential decoding.
hidden_states (tuple(torch.FloatTensor),optional, returned whenoutput_hidden_states=True is passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, +one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor),optional, returned whenoutput_attentions=True is passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor (one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).
Attentions weights after the attention softmax, used to compute the weighted average in the self-attentionheads.

TheFuyuModel forward method, overrides the__call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call theModuleinstance afterwards instead of this since the former takes care of running the pre and post processing steps whilethe latter silently ignores them.

gather_continuous_embeddings

(word_embeddings: Tensorcontinuous_embeddings: listimage_patch_input_indices: Tensor)

Parameters

word_embeddings (torch.FloatTensor of shape(batch_size, sequence_length, hidden_size)) —Tensor of word embeddings.
continuous_embeddings (torch.FloatTensor of shape(batch_size, num_patches, hidden_size)) —Tensor of continuous embeddings. The length of the list is the batch size. Each entry is shape[num_image_embeddings, hidden], and num_image_embeddings needs to match the number of non-negativeindices in image_patch_input_indices for that batch element.
image_patch_input_indices (torch.LongTensor of shape(batch_size, sequence_length)) —Tensor of indices of the image patches in the input_ids tensor.

This function places the continuous_embeddings into the word_embeddings at the locationsindicated by image_patch_input_indices. Different batch elements can have different numbers of continuousembeddings.

get_image_features

(pixel_values: FloatTensor**kwargs)

Parameters

pixel_values (torch.FloatTensor of shape(batch_size, num_channels, image_size, image_size)) —The tensors corresponding to the input images.

Encodes images into continuous embeddings that can be forwarded to the language model.

get_placeholder_mask

(input_ids: LongTensorinputs_embeds: FloatTensorimage_features: FloatTensor)

Obtains multimodal placeholder mask frominput_ids orinputs_embeds, and checks that the placeholder token count isequal to the length of multimodal features. If the lengths are different, an error is raised.

FuyuForCausalLM

classtransformers.FuyuForCausalLM

(config: FuyuConfig)

Parameters

config (FuyuConfig) —Model configuration class with all the parameters of the model. Initializing with a config file does notload the weights associated with the model, only the configuration. Check out thefrom_pretrained() method to load the model weights.

Fuyu Model with a language modeling head on top for causal language model conditioned on image patches and text.

This model is also a PyTorchtorch.nn.Module subclass.Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usageand behavior.

forward

(input_ids: typing.Optional[torch.LongTensor] = Noneimage_patches: typing.Optional[torch.Tensor] = Noneimage_patches_indices: typing.Optional[torch.Tensor] = Noneattention_mask: typing.Optional[torch.Tensor] = Noneposition_ids: typing.Optional[torch.LongTensor] = Nonepast_key_values: typing.Optional[transformers.cache_utils.Cache] = Noneinputs_embeds: typing.Optional[torch.FloatTensor] = Noneuse_cache: typing.Optional[bool] = Nonelabels: typing.Optional[torch.Tensor] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = Nonelogits_to_keep: typing.Optional[int] = 0**kwargs)→transformers.modeling_outputs.CausalLMOutputWithPast ortuple(torch.FloatTensor)

Parameters

input_ids (torch.LongTensor of shape(batch_size, sequence_length),optional) —Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.
Indices can be obtained usingAutoTokenizer. SeePreTrainedTokenizer.encode() andPreTrainedTokenizer.call() for details.
What are input IDs?
image_patches (torch.FloatTensor of shape(batch_size, num_total_patches, patch_size_ x patch_size x num_channels),optional) —Image patches to be used as continuous embeddings. The patches are flattened and then projected to thehidden size of the model.
image_patches_indices (torch.LongTensor of shape(batch_size, sequence_length),optional) —Tensor of indices of the image patches in the input_ids tensor.
attention_mask (torch.Tensor of shape(batch_size, sequence_length),optional) —Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:
- 1 for tokens that arenot masked,
- 0 for tokens that aremasked.
What are attention masks?
position_ids (torch.LongTensor of shape(batch_size, sequence_length),optional) —Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1].
What are position IDs?
past_key_values (~cache_utils.Cache,optional) —Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attentionblocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=True orconfig.use_cache=True.
OnlyCache instance is allowed as input, see ourkv cache guide.If nopast_key_values are passed,DynamicCache will be initialized by default.
The model will output the same cache format that is fed as input.
Ifpast_key_values are used, the user is expected to input only unprocessedinput_ids (those that don’thave their past key value states given to this model) of shape(batch_size, unprocessed_length) instead of allinput_idsof shape(batch_size, sequence_length).
inputs_embeds (torch.FloatTensor of shape(batch_size, sequence_length, hidden_size),optional) —Optionally, instead of passinginput_ids you can choose to directly pass an embedded representation. Thisis useful if you want more control over how to convertinput_ids indices into associated vectors than themodel’s internal embedding lookup matrix.
use_cache (bool,optional) —If set toTrue,past_key_values key value states are returned and can be used to speed up decoding (seepast_key_values).
labels (torch.LongTensor of shape(batch_size, sequence_length),optional) —Labels for computing the masked language modeling loss. Indices should either be in[0, ..., config.text_config.vocab_size] or -100 (seeinput_ids docstring). Tokens with indices set to-100 are ignored(masked), the loss is only computed for the tokens with labels in[0, ..., config.text_config.vocab_size].
output_attentions (bool,optional) —Whether or not to return the attentions tensors of all attention layers. Seeattentions under returnedtensors for more detail.
output_hidden_states (bool,optional) —Whether or not to return the hidden states of all layers. Seehidden_states under returned tensors formore detail.
return_dict (bool,optional) —Whether or not to return aModelOutput instead of a plain tuple.
logits_to_keep (int,optional, defaults to0) —If anint, compute logits for the lastlogits_to_keep tokens. If0, calculate logits for allinput_ids (special case). Only last token logits are needed for generation, and calculating them only for thattoken can save memory, which becomes pretty significant for long sequences or large vocabulary size.If atorch.Tensor, must be 1D corresponding to the indices to keep in the sequence length dimension.This is useful when using packed tensor format (single dimension for batch and sequence length).

Returns

transformers.modeling_outputs.CausalLMOutputWithPast ortuple(torch.FloatTensor)

loss (torch.FloatTensor of shape(1,),optional, returned whenlabels is provided) — Language modeling loss (for next-token prediction).
logits (torch.FloatTensor of shape(batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
past_key_values (Cache,optional, returned whenuse_cache=True is passed or whenconfig.use_cache=True) — It is aCache instance. For more details, see ourkv cache guide.
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (seepast_key_values input) to speed up sequential decoding.
hidden_states (tuple(torch.FloatTensor),optional, returned whenoutput_hidden_states=True is passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, +one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor),optional, returned whenoutput_attentions=True is passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor (one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).
Attentions weights after the attention softmax, used to compute the weighted average in the self-attentionheads.

TheFuyuForCausalLM forward method, overrides the__call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call theModuleinstance afterwards instead of this since the former takes care of running the pre and post processing steps whilethe latter silently ignores them.

Examples:

>>>from transformersimport FuyuProcessor, FuyuForCausalLM>>>from PILimport Image>>>import requests>>>processor = FuyuProcessor.from_pretrained("adept/fuyu-8b")>>>model = FuyuForCausalLM.from_pretrained("adept/fuyu-8b")>>>url ="https://huggingface.co/datasets/hf-internal-testing/fixtures-captioning/resolve/main/bus.png">>>image = Image.open(requests.get(url, stream=True).raw)>>>prompt ="Generate a coco-style caption.\n">>>inputs = processor(images=image, text=prompt, return_tensors="pt")>>>outputs = model(**inputs)>>>generated_ids = model.generate(**inputs, max_new_tokens=7)>>>generation_text = processor.batch_decode(generated_ids[:, -7:], skip_special_tokens=True)>>>print(generation_text[0])A blue bus parked on the side of a road.

FuyuImageProcessor

classtransformers.FuyuImageProcessor

(do_resize: bool = Truesize: typing.Optional[dict[str, int]] = Noneresample: Resampling = <Resampling.BILINEAR: 2>do_pad: bool = Truepadding_value: float = 1.0padding_mode: str = 'constant'do_normalize: bool = Trueimage_mean: typing.Union[float, list[float]] = 0.5image_std: typing.Union[float, list[float]] = 0.5do_rescale: bool = Truerescale_factor: float = 0.00392156862745098patch_size: typing.Optional[dict[str, int]] = None**kwargs)

Parameters

do_resize (bool,optional, defaults toTrue) —Whether to resize the image tosize.
size (dict[str, int],optional, defaults to{"height" -- 1080, "width": 1920}):Dictionary in the format{"height": int, "width": int} specifying the size of the output image.
resample (PILImageResampling,optional, defaults toResampling.BILINEAR) —PILImageResampling filter to use when resizing the image e.g.PILImageResampling.BILINEAR.
do_pad (bool,optional, defaults toTrue) —Whether to pad the image tosize.
padding_value (float,optional, defaults to 1.0) —The value to pad the image with.
padding_mode (str,optional, defaults to"constant") —The padding mode to use when padding the image.
do_normalize (bool,optional, defaults toTrue) —Whether to normalize the image.
image_mean (float,optional, defaults to 0.5) —The mean to use when normalizing the image.
image_std (float,optional, defaults to 0.5) —The standard deviation to use when normalizing the image.
do_rescale (bool,optional, defaults toTrue) —Whether to rescale the image.
rescale_factor (float,optional, defaults to1 / 255) —The factor to use when rescaling the image.
patch_size (dict[str, int],optional, defaults to{"height" -- 30, "width": 30}):Dictionary in the format{"height": int, "width": int} specifying the size of the patches.

This class should handle the image processing part before the main FuyuForCausalLM. In particular, it shouldhandle:

Processing Images:Taking a batch of images as input. If the images are variable-sized, it resizes them based on the desired patchdimensions. The image output is always img_h, img_w of (1080, 1920)
Then, it patches up these images using the patchify_image function.
Creating Image Input IDs:For each patch, a placeholder ID is given to identify where these patches belong in a token sequence. Forvariable-sized images, each line of patches is terminated with a newline ID.
Image Patch Indices:For each image patch, the code maintains an index where these patches should be inserted in a token stream.

call

(images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']]*args**kwargs: typing_extensions.Unpack[transformers.processing_utils.ImagesKwargs])

Preprocess an image or a batch of images.

FuyuImageProcessor

classtransformers.FuyuImageProcessorFast

(**kwargs: typing_extensions.Unpack[transformers.processing_utils.ImagesKwargs])

Constructs a fast Fuyu image processor.

call

Preprocess an image or a batch of images.

FuyuProcessor

classtransformers.FuyuProcessor

(image_processortokenizer**kwargs)

Parameters

image_processor (FuyuImageProcessor) —The image processor is a required input.
tokenizer (LlamaTokenizerFast) —The tokenizer is a required input.

Constructs a Fuyu processor which wraps a Fuyu image processor and a Llama tokenizer into a single processor.

FuyuProcessor offers all the functionalities ofFuyuImageProcessor andLlamaTokenizerFast. See thecall() anddecode() for more information.

call

(images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor'], NoneType] = Nonetext: typing.Union[str, list[str], NoneType] = None**kwargs: typing_extensions.Unpack[transformers.models.fuyu.processing_fuyu.FuyuProcessorKwargs])→FuyuBatchEncoding

Parameters

images (PIL.Image.Image,list[PIL.Image.Image]) —The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorchtensor. Both channels-first and channels-last formats are supported.
text (str,list[str]) —The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must setis_split_into_words=True (to lift the ambiguity with a batch of sequences).

Returns

FuyuBatchEncoding

AFuyuBatchEncoding with the following fields:

input_ids — Tensor of token ids to be fed to a model. Returned whentext is notNone.
image_patches — List of Tensor of image patches. Returned whenimages is notNone.
image_patches_indices — Tensor of indices where patch embeddings have to be inserted by the model.
attention_mask — List of indices specifying which tokens should be attended to by the model whenreturn_attention_mask=True.

Main method to prepare for the model one or several sequences(s) and image(s). This method forwards thetextandkwargs arguments to LlamaTokenizerFast’scall() iftext is notNone toencode the text. To prepare the image(s), this method forwards theimages andkwargs arguments toFuyuImageProcessor’scall() ifimages is notNone. Please refer to the docstringof the above two methods for more information.

Update on GitHub

←Funnel Transformer Gemma→

Movatterモバイル変換

Transformers

Fuyu

Overview

FuyuConfig

classtransformers.FuyuConfig

FuyuModel

classtransformers.FuyuModel

forward

gather_continuous_embeddings

get_image_features

get_placeholder_mask

FuyuForCausalLM

classtransformers.FuyuForCausalLM

forward

FuyuImageProcessor

classtransformers.FuyuImageProcessor

__call__

FuyuImageProcessor

classtransformers.FuyuImageProcessorFast

__call__

FuyuProcessor

classtransformers.FuyuProcessor

__call__

call

call

call