This model was released on 2023-10-17 and added to Hugging Face Transformers on 2023-10-19.
Fuyu
Overview
The Fuyu model was created byADEPT, and authored by Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, Sağnak Taşırlar.
The authors introduced Fuyu-8B, a decoder-only multimodal model based on the classic transformers architecture, with query and key normalization. A linear encoder is added to create multimodal embeddings from image inputs.
By treating image tokens like text tokens and using a special image-newline character, the model knows when an image line ends. Image positional embeddings are removed. This avoids the need for different training phases for various image resolutions. With 8 billion parameters and licensed under CC-BY-NC, Fuyu-8B is notable for its ability to handle both text and images, its impressive context size of 16K, and its overall performance.
The
Fuyumodels were trained usingbfloat16, but the original inference usesfloat16The checkpoints uploaded on the hub usedtype = 'float16'which will beused by theAutoModelAPI to cast the checkpoints fromtorch.float32totorch.float16.The
dtypeof the online weights is mostly irrelevant, unless you are usingdtype="auto"when initializing a model usingmodel = AutoModelForCausalLM.from_pretrained("path", dtype = "auto"). The reason is that the model will first be downloaded ( using thedtypeof the checkpoints online) then it will be cast to the defaultdtypeoftorch(becomestorch.float32). Users should specify thedtypethey want, and if they don’t it will betorch.float32.Finetuning the model in
float16is not recommended and known to producenan, as such the model should be fine-tuned inbfloat16.
Tips:
- To convert the model, you need to clone the original repository using
git clone https://github.com/persimmon-ai-labs/adept-inference, then get the checkpoints:
gitclone https://github.com/persimmon-ai-labs/adept-inferencewget path/to/fuyu-8b-model-weights.tartar -xvf fuyu-8b-model-weights.tarpython src/transformers/models/fuyu/convert_fuyu_weights_to_hf.py --input_dir /path/to/downloaded/fuyu/weights/ --output_dir /output/path \ --pt_model_path /path/to/fuyu_8b_release/iter_0001251/mp_rank_00/model_optim_rng.pt --ada_lib_path /path/to/adept-inferenceFor the chat model:
wget https://axtkn4xl5cip.objectstorage.us-phoenix-1.oci.customer-oci.com/n/axtkn4xl5cip/b/adept-public-data/o/8b_chat_model_release.tartar -xvf 8b_base_model_release.tar
Then, model can be loaded via:
from transformersimport FuyuConfig, FuyuForCausalLMmodel_config = FuyuConfig()model = FuyuForCausalLM(model_config).from_pretrained('/output/path')
Inputs need to be passed through a specific Processor to have the correct formats.A processor requires an image_processor and a tokenizer. Hence, inputs can be loaded via:
from PILimport Imagefrom transformersimport AutoTokenizerfrom transformers.models.fuyu.processing_fuyuimport FuyuProcessorfrom transformers.models.fuyu.image_processing_fuyu_fastimport FuyuImageProcessorFasttokenizer = AutoTokenizer.from_pretrained('adept-hf-collab/fuyu-8b')image_processor = FuyuImageProcessorFast()processor = FuyuProcessor(image_processor=image_processor, tokenizer=tokenizer)text_prompt ="Generate a coco-style caption.\\n"bus_image_url ="https://huggingface.co/datasets/hf-internal-testing/fixtures-captioning/resolve/main/bus.png"bus_image_pil = Image.open(io.BytesIO(requests.get(bus_image_url).content))inputs_to_model = processor(images=bus_image_pil, text=text_prompt)
This model was contributed byMolbap.The original code can be foundhere.
Fuyu uses a
sentencepiecebased tokenizer, with aUnigrammodel. It supports bytefallback, which is only available intokenizers==0.14.0for the fast tokenizer.TheLlamaTokenizeris used as it is a standard wrapper around sentencepiece.The authors suggest to use the following prompt for image captioning:
f"Generate a coco-style caption.\\n"
FuyuConfig
classtransformers.FuyuConfig
<source>(vocab_size: typing.Optional[int] = 262144hidden_size: typing.Optional[int] = 4096intermediate_size: typing.Optional[int] = 16384num_hidden_layers: typing.Optional[int] = 36num_attention_heads: typing.Optional[int] = 64hidden_act: typing.Optional[str] = 'relu2'max_position_embeddings: typing.Optional[int] = 16384image_size: typing.Optional[int] = 300patch_size: typing.Optional[int] = 30num_channels: typing.Optional[int] = 3initializer_range: typing.Optional[float] = 0.02layer_norm_eps: typing.Optional[int] = 1e-05use_cache: typing.Optional[bool] = Truetie_word_embeddings: typing.Optional[bool] = Falserope_parameters: typing.Union[transformers.modeling_rope_utils.RopeParameters, dict[str, transformers.modeling_rope_utils.RopeParameters], NoneType] = Noneqk_layernorm: typing.Optional[bool] = Truehidden_dropout: typing.Optional[float] = 0.0attention_dropout: typing.Optional[float] = 0.0partial_rotary_factor: typing.Optional[float] = 0.5pad_token_id: typing.Optional[int] = Nonebos_token_id: typing.Optional[int] = 1eos_token_id: typing.Optional[int] = 2image_token_id: typing.Optional[int] = 71011text_config: typing.Optional[dict] = None**kwargs)
Parameters
- vocab_size (
int,optional, defaults to 262144) —Vocabulary size of the Fuyu model. Defines the number of different tokens that can be represented by theinputs_idspassed when callingFuyuForCausalLM - hidden_size (
int,optional, defaults to 4096) —Dimension of the hidden representations. - intermediate_size (
int,optional, defaults to 16384) —Dimension of the MLP representations. - num_hidden_layers (
int,optional, defaults to 36) —Number of hidden layers in the Transformer encoder. - num_attention_heads (
int,optional, defaults to 64) —Number of attention heads for each attention layer in the Transformer encoder. - hidden_act (
strorfunction,optional, defaults to"relu2") —The non-linear activation function (function or string) in the decoder. - max_position_embeddings (
int,optional, defaults to 16384) —The maximum sequence length that this model might ever be used with. - image_size (
int,optional, defaults to 300) —The input image size. - patch_size (
int,optional, defaults to 30) —The input vision transformer encoding patch size. - num_channels (
int,optional, defaults to 3) —The input image number of channels. - initializer_range (
float,optional, defaults to 0.02) —The standard deviation of the truncated_normal_initializer for initializing all weight matrices. - layer_norm_eps (
float,optional, defaults to 1e-05) —The epsilon used by the rms normalization layers. - use_cache (
bool,optional, defaults toTrue) —Whether or not the model should return the last key/values attentions (not used by all models). Onlyrelevant ifconfig.is_decoder=True. Whether to tie weight embeddings - tie_word_embeddings (
bool,optional, defaults toFalse) —Whether to tie input and output embeddings. - rope_parameters (
RopeParameters,optional) —Dictionary containing the configuration parameters for the RoPE embeddings. The dictionaty should containa value forrope_thetaand optionally parameters used for scaling in case you want to use RoPEwith longermax_position_embeddings. - qk_layernorm (
bool,optional, defaults toTrue) —Whether or not to normalize the Queries and Keys after projecting the hidden states - hidden_dropout (
float,optional, defaults to 0.0) —The dropout ratio after applying the MLP to the hidden states. - attention_dropout (
float,optional, defaults to 0.0) —The dropout ratio after computing the attention scores. - partial_rotary_factor (
float,optional, defaults to 0.5) —Percentage of the query and keys which will have rotary embedding. - pad_token_id (
int,optional) —The id of thepadding token. - bos_token_id (
int,optional, defaults to 1) —The id of thebeginning-of-sequence token. - eos_token_id (
Union[int, list[int]],optional, defaults to 2) —The id of theend-of-sequence token. Optionally, use a list to set multipleend-of-sequence tokens. - image_token_id (
int,optional, defaults to 71011) —The id of the image placeholder token. - text_config (
dict,optional) —Dictionary of configuration options used to initialize thelanguage```Aut.
This is the configuration class to store the configuration of aFuyuForCausalLM. It is used to instantiate anFuyu model according to the specified arguments, defining the model architecture. Instantiating a configurationwith the defaults will yield a similar configuration to that of theadept/fuyu-8b.
Configuration objects inherit fromPreTrainedConfig and can be used to control the model outputs. Read thedocumentation fromPreTrainedConfig for more information.
FuyuModel
classtransformers.FuyuModel
<source>(config: FuyuConfig)
Parameters
- config (FuyuConfig) —Model configuration class with all the parameters of the model. Initializing with a config file does notload the weights associated with the model, only the configuration. Check out thefrom_pretrained() method to load the model weights.
The Fuyu model which consists of a vision backbone and a language model, without a language modeling head.
This model inherits fromPreTrainedModel. Check the superclass documentation for the generic methods thelibrary implements for all its model (such as downloading or saving, resizing the input embeddings, pruning headsetc.)
This model is also a PyTorchtorch.nn.Module subclass.Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usageand behavior.
forward
<source>(input_ids: typing.Optional[torch.LongTensor] = Noneimage_patches: typing.Optional[torch.Tensor] = Noneimage_patches_indices: typing.Optional[torch.Tensor] = Noneattention_mask: typing.Optional[torch.Tensor] = Noneposition_ids: typing.Optional[torch.LongTensor] = Nonepast_key_values: typing.Optional[transformers.cache_utils.Cache] = Noneinputs_embeds: typing.Optional[torch.FloatTensor] = Noneuse_cache: typing.Optional[bool] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None**kwargs)→transformers.modeling_outputs.CausalLMOutputWithPast ortuple(torch.FloatTensor)
Parameters
- input_ids (
torch.LongTensorof shape(batch_size, sequence_length),optional) —Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.Indices can be obtained usingAutoTokenizer. SeePreTrainedTokenizer.encode() andPreTrainedTokenizer.call() for details.
- image_patches (
torch.FloatTensorof shape(batch_size, num_total_patches, patch_size_ x patch_size x num_channels),optional) —Image patches to be used as continuous embeddings. The patches are flattened and then projected to thehidden size of the model. - image_patches_indices (
torch.LongTensorof shape(batch_size, sequence_length),optional) —Tensor of indices of the image patches in the input_ids tensor. - attention_mask (
torch.Tensorof shape(batch_size, sequence_length),optional) —Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that arenot masked,
- 0 for tokens that aremasked.
- position_ids (
torch.LongTensorof shape(batch_size, sequence_length),optional) —Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1]. - past_key_values (
~cache_utils.Cache,optional) —Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attentionblocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=Trueorconfig.use_cache=True.OnlyCache instance is allowed as input, see ourkv cache guide.If no
past_key_valuesare passed,DynamicCache will be initialized by default.The model will output the same cache format that is fed as input.
If
past_key_valuesare used, the user is expected to input only unprocessedinput_ids(those that don’thave their past key value states given to this model) of shape(batch_size, unprocessed_length)instead of allinput_idsof shape(batch_size, sequence_length). - inputs_embeds (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size),optional) —Optionally, instead of passinginput_idsyou can choose to directly pass an embedded representation. Thisis useful if you want more control over how to convertinput_idsindices into associated vectors than themodel’s internal embedding lookup matrix. - use_cache (
bool,optional) —If set toTrue,past_key_valueskey value states are returned and can be used to speed up decoding (seepast_key_values). - output_attentions (
bool,optional) —Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returnedtensors for more detail. - output_hidden_states (
bool,optional) —Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors formore detail. - return_dict (
bool,optional) —Whether or not to return aModelOutput instead of a plain tuple.
Returns
transformers.modeling_outputs.CausalLMOutputWithPast ortuple(torch.FloatTensor)
Atransformers.modeling_outputs.CausalLMOutputWithPast or a tuple oftorch.FloatTensor (ifreturn_dict=False is passed or whenconfig.return_dict=False) comprising variouselements depending on the configuration (FuyuConfig) and inputs.
loss (
torch.FloatTensorof shape(1,),optional, returned whenlabelsis provided) — Language modeling loss (for next-token prediction).logits (
torch.FloatTensorof shape(batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).past_key_values (
Cache,optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — It is aCache instance. For more details, see ourkv cache guide.Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
past_key_valuesinput) to speed up sequential decoding.hidden_states (
tuple(torch.FloatTensor),optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, +one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (
tuple(torch.FloatTensor),optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attentionheads.
TheFuyuModel forward method, overrides the__call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps whilethe latter silently ignores them.
gather_continuous_embeddings
<source>(word_embeddings: Tensorcontinuous_embeddings: listimage_patch_input_indices: Tensor)
Parameters
- word_embeddings (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size)) —Tensor of word embeddings. - continuous_embeddings (
torch.FloatTensorof shape(batch_size, num_patches, hidden_size)) —Tensor of continuous embeddings. The length of the list is the batch size. Each entry is shape[num_image_embeddings, hidden], and num_image_embeddings needs to match the number of non-negativeindices in image_patch_input_indices for that batch element. - image_patch_input_indices (
torch.LongTensorof shape(batch_size, sequence_length)) —Tensor of indices of the image patches in the input_ids tensor.
This function places the continuous_embeddings into the word_embeddings at the locationsindicated by image_patch_input_indices. Different batch elements can have different numbers of continuousembeddings.
get_image_features
<source>(pixel_values: FloatTensor**kwargs)
Encodes images into continuous embeddings that can be forwarded to the language model.
get_placeholder_mask
<source>(input_ids: LongTensorinputs_embeds: FloatTensorimage_features: FloatTensor)
Obtains multimodal placeholder mask frominput_ids orinputs_embeds, and checks that the placeholder token count isequal to the length of multimodal features. If the lengths are different, an error is raised.
FuyuForCausalLM
classtransformers.FuyuForCausalLM
<source>(config: FuyuConfig)
Parameters
- config (FuyuConfig) —Model configuration class with all the parameters of the model. Initializing with a config file does notload the weights associated with the model, only the configuration. Check out thefrom_pretrained() method to load the model weights.
Fuyu Model with a language modeling head on top for causal language model conditioned on image patches and text.
This model inherits fromPreTrainedModel. Check the superclass documentation for the generic methods thelibrary implements for all its model (such as downloading or saving, resizing the input embeddings, pruning headsetc.)
This model is also a PyTorchtorch.nn.Module subclass.Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usageand behavior.
forward
<source>(input_ids: typing.Optional[torch.LongTensor] = Noneimage_patches: typing.Optional[torch.Tensor] = Noneimage_patches_indices: typing.Optional[torch.Tensor] = Noneattention_mask: typing.Optional[torch.Tensor] = Noneposition_ids: typing.Optional[torch.LongTensor] = Nonepast_key_values: typing.Optional[transformers.cache_utils.Cache] = Noneinputs_embeds: typing.Optional[torch.FloatTensor] = Noneuse_cache: typing.Optional[bool] = Nonelabels: typing.Optional[torch.Tensor] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = Nonelogits_to_keep: typing.Optional[int] = 0**kwargs)→transformers.modeling_outputs.CausalLMOutputWithPast ortuple(torch.FloatTensor)
Parameters
- input_ids (
torch.LongTensorof shape(batch_size, sequence_length),optional) —Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.Indices can be obtained usingAutoTokenizer. SeePreTrainedTokenizer.encode() andPreTrainedTokenizer.call() for details.
- image_patches (
torch.FloatTensorof shape(batch_size, num_total_patches, patch_size_ x patch_size x num_channels),optional) —Image patches to be used as continuous embeddings. The patches are flattened and then projected to thehidden size of the model. - image_patches_indices (
torch.LongTensorof shape(batch_size, sequence_length),optional) —Tensor of indices of the image patches in the input_ids tensor. - attention_mask (
torch.Tensorof shape(batch_size, sequence_length),optional) —Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that arenot masked,
- 0 for tokens that aremasked.
- position_ids (
torch.LongTensorof shape(batch_size, sequence_length),optional) —Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1]. - past_key_values (
~cache_utils.Cache,optional) —Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attentionblocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=Trueorconfig.use_cache=True.OnlyCache instance is allowed as input, see ourkv cache guide.If no
past_key_valuesare passed,DynamicCache will be initialized by default.The model will output the same cache format that is fed as input.
If
past_key_valuesare used, the user is expected to input only unprocessedinput_ids(those that don’thave their past key value states given to this model) of shape(batch_size, unprocessed_length)instead of allinput_idsof shape(batch_size, sequence_length). - inputs_embeds (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size),optional) —Optionally, instead of passinginput_idsyou can choose to directly pass an embedded representation. Thisis useful if you want more control over how to convertinput_idsindices into associated vectors than themodel’s internal embedding lookup matrix. - use_cache (
bool,optional) —If set toTrue,past_key_valueskey value states are returned and can be used to speed up decoding (seepast_key_values). - labels (
torch.LongTensorof shape(batch_size, sequence_length),optional) —Labels for computing the masked language modeling loss. Indices should either be in[0, ..., config.text_config.vocab_size]or -100 (seeinput_idsdocstring). Tokens with indices set to-100are ignored(masked), the loss is only computed for the tokens with labels in[0, ..., config.text_config.vocab_size]. - output_attentions (
bool,optional) —Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returnedtensors for more detail. - output_hidden_states (
bool,optional) —Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors formore detail. - return_dict (
bool,optional) —Whether or not to return aModelOutput instead of a plain tuple. - logits_to_keep (
int,optional, defaults to0) —If anint, compute logits for the lastlogits_to_keeptokens. If0, calculate logits for allinput_ids(special case). Only last token logits are needed for generation, and calculating them only for thattoken can save memory, which becomes pretty significant for long sequences or large vocabulary size.If atorch.Tensor, must be 1D corresponding to the indices to keep in the sequence length dimension.This is useful when using packed tensor format (single dimension for batch and sequence length).
Returns
transformers.modeling_outputs.CausalLMOutputWithPast ortuple(torch.FloatTensor)
Atransformers.modeling_outputs.CausalLMOutputWithPast or a tuple oftorch.FloatTensor (ifreturn_dict=False is passed or whenconfig.return_dict=False) comprising variouselements depending on the configuration (FuyuConfig) and inputs.
loss (
torch.FloatTensorof shape(1,),optional, returned whenlabelsis provided) — Language modeling loss (for next-token prediction).logits (
torch.FloatTensorof shape(batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).past_key_values (
Cache,optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — It is aCache instance. For more details, see ourkv cache guide.Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
past_key_valuesinput) to speed up sequential decoding.hidden_states (
tuple(torch.FloatTensor),optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, +one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (
tuple(torch.FloatTensor),optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attentionheads.
TheFuyuForCausalLM forward method, overrides the__call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps whilethe latter silently ignores them.
Examples:
>>>from transformersimport FuyuProcessor, FuyuForCausalLM>>>from PILimport Image>>>import requests>>>processor = FuyuProcessor.from_pretrained("adept/fuyu-8b")>>>model = FuyuForCausalLM.from_pretrained("adept/fuyu-8b")>>>url ="https://huggingface.co/datasets/hf-internal-testing/fixtures-captioning/resolve/main/bus.png">>>image = Image.open(requests.get(url, stream=True).raw)>>>prompt ="Generate a coco-style caption.\n">>>inputs = processor(images=image, text=prompt, return_tensors="pt")>>>outputs = model(**inputs)>>>generated_ids = model.generate(**inputs, max_new_tokens=7)>>>generation_text = processor.batch_decode(generated_ids[:, -7:], skip_special_tokens=True)>>>print(generation_text[0])A blue bus parked on the side of a road.
FuyuImageProcessor
classtransformers.FuyuImageProcessor
<source>(do_resize: bool = Truesize: typing.Optional[dict[str, int]] = Noneresample: Resampling = <Resampling.BILINEAR: 2>do_pad: bool = Truepadding_value: float = 1.0padding_mode: str = 'constant'do_normalize: bool = Trueimage_mean: typing.Union[float, list[float]] = 0.5image_std: typing.Union[float, list[float]] = 0.5do_rescale: bool = Truerescale_factor: float = 0.00392156862745098patch_size: typing.Optional[dict[str, int]] = None**kwargs)
Parameters
- do_resize (
bool,optional, defaults toTrue) —Whether to resize the image tosize. - size (
dict[str, int],optional, defaults to{"height" -- 1080, "width": 1920}):Dictionary in the format{"height": int, "width": int}specifying the size of the output image. - resample (
PILImageResampling,optional, defaults toResampling.BILINEAR) —PILImageResamplingfilter to use when resizing the image e.g.PILImageResampling.BILINEAR. - do_pad (
bool,optional, defaults toTrue) —Whether to pad the image tosize. - padding_value (
float,optional, defaults to 1.0) —The value to pad the image with. - padding_mode (
str,optional, defaults to"constant") —The padding mode to use when padding the image. - do_normalize (
bool,optional, defaults toTrue) —Whether to normalize the image. - image_mean (
float,optional, defaults to 0.5) —The mean to use when normalizing the image. - image_std (
float,optional, defaults to 0.5) —The standard deviation to use when normalizing the image. - do_rescale (
bool,optional, defaults toTrue) —Whether to rescale the image. - rescale_factor (
float,optional, defaults to1 / 255) —The factor to use when rescaling the image. - patch_size (
dict[str, int],optional, defaults to{"height" -- 30, "width": 30}):Dictionary in the format{"height": int, "width": int}specifying the size of the patches.
This class should handle the image processing part before the main FuyuForCausalLM. In particular, it shouldhandle:
Processing Images:Taking a batch of images as input. If the images are variable-sized, it resizes them based on the desired patchdimensions. The image output is always img_h, img_w of (1080, 1920)
Then, it patches up these images using the patchify_image function.
Creating Image Input IDs:For each patch, a placeholder ID is given to identify where these patches belong in a token sequence. Forvariable-sized images, each line of patches is terminated with a newline ID.
Image Patch Indices:For each image patch, the code maintains an index where these patches should be inserted in a token stream.
__call__
<source>(images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']]*args**kwargs: typing_extensions.Unpack[transformers.processing_utils.ImagesKwargs])
Preprocess an image or a batch of images.
FuyuImageProcessor
classtransformers.FuyuImageProcessorFast
<source>(**kwargs: typing_extensions.Unpack[transformers.processing_utils.ImagesKwargs])
Constructs a fast Fuyu image processor.
__call__
<source>(images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']]*args**kwargs: typing_extensions.Unpack[transformers.processing_utils.ImagesKwargs])
Preprocess an image or a batch of images.
FuyuProcessor
classtransformers.FuyuProcessor
<source>(image_processortokenizer**kwargs)
Parameters
- image_processor (FuyuImageProcessor) —The image processor is a required input.
- tokenizer (LlamaTokenizerFast) —The tokenizer is a required input.
Constructs a Fuyu processor which wraps a Fuyu image processor and a Llama tokenizer into a single processor.
FuyuProcessor offers all the functionalities ofFuyuImageProcessor andLlamaTokenizerFast. See thecall() anddecode() for more information.
__call__
<source>(images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor'], NoneType] = Nonetext: typing.Union[str, list[str], NoneType] = None**kwargs: typing_extensions.Unpack[transformers.models.fuyu.processing_fuyu.FuyuProcessorKwargs])→FuyuBatchEncoding
Parameters
- images (
PIL.Image.Image,list[PIL.Image.Image]) —The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorchtensor. Both channels-first and channels-last formats are supported. - text (
str,list[str]) —The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must setis_split_into_words=True(to lift the ambiguity with a batch of sequences).
Returns
FuyuBatchEncoding
AFuyuBatchEncoding with the following fields:
- input_ids — Tensor of token ids to be fed to a model. Returned when
textis notNone. - image_patches — List of Tensor of image patches. Returned when
imagesis notNone. - image_patches_indices — Tensor of indices where patch embeddings have to be inserted by the model.
- attention_mask — List of indices specifying which tokens should be attended to by the model when
return_attention_mask=True.
Main method to prepare for the model one or several sequences(s) and image(s). This method forwards thetextandkwargs arguments to LlamaTokenizerFast’scall() iftext is notNone toencode the text. To prepare the image(s), this method forwards theimages andkwargs arguments toFuyuImageProcessor’scall() ifimages is notNone. Please refer to the docstringof the above two methods for more information.