Transformers documentation

Chameleon

Transformers

You are viewingmain version, which requiresinstallation from source. If you'd likeregular pip install, checkout the latest stable version (v4.57.1).

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

This model was released on 2024-05-16 and added to Hugging Face Transformers on 2024-07-17.

Chameleon

Overview

The Chameleon model was proposed inChameleon: Mixed-Modal Early-Fusion Foundation Models by META AI Chameleon Team. Chameleon is a Vision-Language Model that use vector quantization to tokenize images which enables the model to generate multimodal output. The model takes images and texts as input, including an interleaved format, and generates textual response. Image generation module is not released yet.

The abstract from the paper is the following:

We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. We outline a stable trainingapproach from inception, an alignment recipe, and an architectural parameterization tailored for theearly-fusion, token-based, mixed-modal setting. The models are evaluated on a comprehensive rangeof tasks, including visual question answering, image captioning, text generation, image generation, andlong-form mixed modal generation. Chameleon demonstrates broad and general capabilities, includingstate-of-the-art performance in image captioning tasks, outperforms Llama-2 in text-only tasks whilebeing competitive with models such as Mixtral 8x7B and Gemini-Pro, and performs non-trivial imagegeneration, all in a single model. It also matches or exceeds the performance of much larger models,including Gemini Pro and GPT-4V, according to human judgments on a new long-form mixed-modalgeneration evaluation, where either the prompt or outputs contain mixed sequences of both images andtext. Chameleon marks a significant step forward in unified modeling of full multimodal documents

Chameleon incorporates a vector quantizer module to transform images into discrete tokens. That also enables image generation using an auto-regressive transformer. Taken from theoriginal paper.

This model was contributed byjoaogante andRaushanTurganbay.The original code can be foundhere.

Usage tips

We advise users to usepadding_side="left" when computing batched generation as it leads to more accurate results. Simply make sure to setprocessor.tokenizer.padding_side = "left" before generating.
Note that Chameleon was tuned for safety alignment. If the model is refusing to answer, consider asking a more concrete question, instead of an open question.
Chameleon generates in chat format which means that the generated text will always be the “assistant’s turn”. You can enable a text completion generation by passingreturn_for_text_completion=True when calling the processor.

Chameleon implementation in Transformers uses a special image token to indicate where to merge image embeddings. For special image token we didn’t add a new one but used one of the reserved tokens:<reserved08707>. You have to add<image> to your prompt in the place where the image should be embedded for correct generation.

Usage example

Single image inference

Chameleon is a gated model so make sure to have access and login to Hugging Face Hub using a token.Here’s how to load the model and perform inference in half-precision (torch.bfloat16):

from transformersimport ChameleonProcessor, ChameleonForConditionalGenerationimport torchfrom PILimport Imageimport requestsprocessor = ChameleonProcessor.from_pretrained("facebook/chameleon-7b")model = ChameleonForConditionalGeneration.from_pretrained("facebook/chameleon-7b", dtype=torch.bfloat16, device_map="auto")# prepare image and text prompturl ='http://images.cocodataset.org/val2017/000000039769.jpg'image = Image.open(requests.get(url, stream=True).raw)prompt ="What do you see in this image?<image>"inputs = processor(images=image, text=prompt, return_tensors="pt").to(model.device, dtype=torch.bfloat16)# autoregressively complete promptoutput = model.generate(**inputs, max_new_tokens=50)print(processor.decode(output[0], skip_special_tokens=True))

Multi image inference

Chameleon can perform inference with multiple images as input, where images either belong to the same prompt or different prompts (in batched inference). Here is how you can do it:

from transformersimport ChameleonProcessor, ChameleonForConditionalGenerationimport torchfrom PILimport Imageimport requestsprocessor = ChameleonProcessor.from_pretrained("facebook/chameleon-7b")model = ChameleonForConditionalGeneration.from_pretrained("facebook/chameleon-7b", dtype=torch.bfloat16, device_map="auto")# Get three different imagesurl ="https://www.ilankelman.org/stopsigns/australia.jpg"image_stop = Image.open(requests.get(url, stream=True).raw)url ="http://images.cocodataset.org/val2017/000000039769.jpg"image_cats = Image.open(requests.get(url, stream=True).raw)url ="https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.jpg"image_snowman = Image.open(requests.get(url, stream=True).raw)# Prepare a batched prompt, where the first one is a multi-image prompt and the second is notprompts = ["What do these images have in common?<image><image>","<image>What is shown in this image?"]# We can simply feed images in the order they have to be used in the text prompt# Each "<image>" token uses one image leaving the next for the subsequent "<image>" tokensinputs = processor(images=[image_stop, image_cats, image_snowman], text=prompts, padding=True, return_tensors="pt").to(device=model.device, dtype=torch.bfloat16)# Generategenerate_ids = model.generate(**inputs, max_new_tokens=50)processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)

Model optimization

Quantization using Bitsandbytes

The model can be loaded in 8 or 4 bits, greatly reducing the memory requirements while maintaining the performance of the original model. First make sure to install bitsandbytes,pip install bitsandbytes and to have access to a GPU/accelerator that is supported by the library.

bitsandbytes is being refactored to support multiple backends beyond CUDA. Currently, ROCm (AMD GPU) and Intel CPU implementations are mature, with Intel XPU in progress and Apple Silicon support expected by Q4/Q1. For installation instructions and the latest backend updates, visitthis link.
We value your feedback to help identify bugs before the full release! Check outthese docs for more details and feedback links.

Simply change the snippet above with:

from transformersimport ChameleonForConditionalGeneration, BitsAndBytesConfig# specify how to quantize the modelquantization_config = BitsAndBytesConfig(    load_in_4bit=True,    bnb_4bit_quant_type="nf4",    bnb_4bit_compute_dtype=torch.bfloat16,)model = ChameleonForConditionalGeneration.from_pretrained("facebook/chameleon-7b", quantization_config=quantization_config, device_map="auto")

Use Flash-Attention 2 and SDPA to further speed-up generation

The models supports both, Flash-Attention 2 and PyTorch’storch.nn.functional.scaled_dot_product_attention which can be enables for optimization. SDPA is the default options when you load the model, If you want to switch for Flash Attention 2, first make sure to install flash-attn. Refer to theoriginal repository regarding that package installation. Simply change the snippet above with:

from transformersimport ChameleonForConditionalGenerationmodel_id ="facebook/chameleon-7b"model = ChameleonForConditionalGeneration.from_pretrained(    model_id,    dtype=torch.bfloat16,    attn_implementation="flash_attention_2").to(0)

ChameleonConfig

classtransformers.ChameleonConfig

(vocab_size: typing.Optional[int] = 65536hidden_size: typing.Optional[int] = 4096intermediate_size: typing.Optional[int] = 11008num_hidden_layers: typing.Optional[int] = 32num_attention_heads: typing.Optional[int] = 32num_key_value_heads: typing.Optional[int] = 32hidden_act: typing.Optional[int] = 'silu'max_position_embeddings: typing.Optional[int] = 4096initializer_range: typing.Optional[float] = 0.02rms_norm_eps: typing.Optional[int] = 1e-05use_cache: typing.Optional[bool] = Truepad_token_id: typing.Optional[int] = Nonebos_token_id: typing.Optional[int] = 1eos_token_id: typing.Optional[int] = 2tie_word_embeddings: typing.Optional[bool] = Falserope_parameters: typing.Union[transformers.modeling_rope_utils.RopeParameters, dict[str, transformers.modeling_rope_utils.RopeParameters], NoneType] = Noneattention_bias: typing.Optional[int] = Falseattention_dropout: typing.Optional[float] = 0.0model_parallel_size: typing.Optional[int] = 1swin_norm: typing.Optional[bool] = Falsevq_config: typing.Optional[dict] = Nonevocabulary_map: typing.Optional[dict] = Nonemlp_bias: typing.Optional[bool] = False**kwargs)

Parameters

vocab_size (int,optional, defaults to 65536) —Vocabulary size of the chameleon model. Defines the number of different tokens that can be represented by theinputs_ids passed when callingChameleonModel; this includes text and image tokens.
hidden_size (int,optional, defaults to 4096) —Dimension of the hidden representations.
intermediate_size (int,optional, defaults to 11008) —Dimension of the MLP representations.
num_hidden_layers (int,optional, defaults to 32) —Number of hidden layers in the Transformer decoder.
num_attention_heads (int,optional, defaults to 32) —Number of attention heads for each attention layer in the Transformer decoder.
num_key_value_heads (int,optional, defaults to 32) —This is the number of key_value heads that should be used to implement Grouped Query Attention. Ifnum_key_value_heads=num_attention_heads, the model will use Multi Head Attention (MHA), ifnum_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed by meanpooling all the original heads within that group. For more details, check out [this paper](https://huggingface.co/papers/2305.13245). If it is not specified, will default tonum_attention_heads`.
hidden_act (str orfunction,optional, defaults to"silu") —The non-linear activation function (function or string) in the decoder.
max_position_embeddings (int,optional, defaults to 4096) —The maximum sequence length that this model might ever be used with. Chameleon supports up to 4096 tokens.
initializer_range (float,optional, defaults to 0.02) —The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
rms_norm_eps (float,optional, defaults to 1e-05) —The epsilon used by the rms normalization layers.
use_cache (bool,optional, defaults toTrue) —Whether or not the model should return the last key/values attentions (not used by all models). Onlyrelevant ifconfig.is_decoder=True.
pad_token_id (int,optional) —Padding token id.
bos_token_id (int,optional, defaults to 1) —Beginning of stream token id.
eos_token_id (int,optional, defaults to 2) —End of stream token id.
tie_word_embeddings (bool,optional, defaults toFalse) —Whether to tie weight embeddings
rope_parameters (RopeParameters,optional) —Dictionary containing the configuration parameters for the RoPE embeddings. The dictionaty should containa value forrope_theta and optionally parameters used for scaling in case you want to use RoPEwith longermax_position_embeddings.
attention_bias (bool, defaults toFalse,optional, defaults toFalse) —Whether to use a bias in the query, key, value and output projection layers during self-attention.
attention_dropout (float,optional, defaults to 0.0) —The dropout ratio for the attention probabilities.
model_parallel_size (int,optional, defaults to 1) —Number of shards used when training the model. This will be used in qk layernorm because the original Chameleon inferencedoesn’t do reduction in those layers and each rank has its own biases.
swin_norm (bool,optional, defaults toFalse) —Use Swin Transformer normalization.
vq_config (dict,optional) —ChameleonVQConfig instance containing the configuration for the VQ-VAE model.
vocabulary_map (dict,optional) —A dictionary containing the vocabulary map from the tokenizer. Used to obtain tokens from the image inputs.
mlp_bias (bool,optional, defaults toFalse) —Whether to use a bias in up_proj, down_proj and gate_proj layers in the MLP layers.

This is the configuration class to store the configuration of aChameleonModel. It is used to instantiate achameleon model according to the specified arguments, defining the model architecture. Instantiating aconfiguration with the defaults will yield a similar configuration to that of themeta/chameleon-7B.

Configuration objects inherit fromPreTrainedConfig and can be used to control the model outputs. Read thedocumentation fromPreTrainedConfig for more information.

>>>from transformersimport ChameleonModel, ChameleonConfig>>># Initializing a chameleon chameleon-7b style configuration>>>configuration = ChameleonConfig()>>># Initializing a model from the chameleon-7b style configuration>>>model = ChameleonModel(configuration)>>># Accessing the model configuration>>>configuration = model.config

ChameleonVQVAEConfig

classtransformers.ChameleonVQVAEConfig

(embed_dim: int = 256num_embeddings: int = 8192double_latent: bool = Falselatent_channels: int = 256resolution: int = 512in_channels: int = 3base_channels: int = 128channel_multiplier: list = [1, 1, 2, 2, 4]num_res_blocks: int = 2attn_resolutions: typing.Optional[list[int]] = Nonedropout: float = 0.0attn_type: str = 'vanilla'initializer_range = 0.02**kwargs)

Parameters

embed_dim (int,optional, defaults to 256) —Dimensionality of each embedding vector.
num_embeddings (int,optional, defaults to 8192) —Number of codebook embeddings.
double_latent (bool,optional, defaults toFalse) —Whether to use double z channels.
latent_channels (int,optional, defaults to 256) —Number of channels for the latent space.
resolution (int,optional, defaults to 512) —Resolution of the input images.
in_channels (int,optional, defaults to 3) —Number of input channels.
base_channels (int,optional, defaults to 128) —Base channel count.
channel_multiplier (list[int],optional, defaults to[1, 1, 2, 2, 4]) —Channel multipliers for each resolution.
num_res_blocks (int,optional, defaults to 2) —Number of residual blocks.
attn_resolutions (list[int],optional) —Resolutions to apply attention.
dropout (float,optional, defaults to 0.0) —Dropout rate.
attn_type (str,optional, defaults to"vanilla") —Attention type used in VQ-GAN encoder. Can be “vanilla” or None.
initializer_range (float,optional, defaults to 0.02) —The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

This is the configuration class to store the configuration of aChameleonVQModel. It is used to instantiate aChameleonVQModel according to the specified arguments, defining the model architecture.Configuration objects inherit fromPreTrainedConfig and can be used to control the model outputs. Read thedocumentation fromPreTrainedConfig for more information. Instantiating aconfiguration with the defaults will yield a similar configuration to the VQModel of themeta/chameleon-7B.

ChameleonProcessor

classtransformers.ChameleonProcessor

(image_processortokenizerimage_seq_length: int = 1024image_token: str = '<image>')

Parameters

image_processor (ChameleonImageProcessor) —The image processor is a required input.
tokenizer (LlamaTokenizerFast) —The tokenizer is a required input.
image_seq_length (int,optional, defaults to 1024) —Sequence length of one image embedding.
image_token (str,optional, defaults to"<image>") —The special token used to indicate image in the text.

Constructs a Chameleon processor which wraps a Chameleon image processor and a Chameleon tokenizer into a singleprocessor.

ChameleonProcessor offers all the functionalities ofChameleonImageProcessor andLlamaTokenizerFast.See the__call__() anddecode() for more information.

ChameleonImageProcessor

classtransformers.ChameleonImageProcessor

(do_resize: bool = Truesize: typing.Optional[dict[str, int]] = Noneresample: Resampling = 1do_center_crop: bool = Truecrop_size: typing.Optional[dict[str, int]] = Nonedo_rescale: bool = Truerescale_factor: typing.Union[int, float] = 0.0078do_normalize: bool = Trueimage_mean: typing.Union[float, list[float], NoneType] = Noneimage_std: typing.Union[float, list[float], NoneType] = Nonedo_convert_rgb: bool = True**kwargs)

Parameters

do_resize (bool,optional, defaults toTrue) —Whether to resize the image’s (height, width) dimensions to the specifiedsize. Can be overridden bydo_resize in thepreprocess method.
size (dict[str, int]optional, defaults to{"shortest_edge" -- 512}):Size of the image after resizing. The shortest edge of the image is resized to size[“shortest_edge”], withthe longest edge resized to keep the input aspect ratio. Can be overridden bysize in thepreprocessmethod.
resample (PILImageResampling,optional, defaults to 1) —Resampling filter to use if resizing the image. Can be overridden byresample in thepreprocess method.
do_center_crop (bool,optional, defaults toTrue) —Whether to center crop the image to the specifiedcrop_size. Can be overridden bydo_center_crop in thepreprocess method.
crop_size (dict[str, int]optional, defaults to {“height” — 512, “width”: 512}):Size of the output image after applyingcenter_crop. Can be overridden bycrop_size in thepreprocessmethod.
do_rescale (bool,optional, defaults toTrue) —Whether to rescale the image by the specified scalerescale_factor. Can be overridden bydo_rescale inthepreprocess method.
rescale_factor (int orfloat,optional, defaults to 0.0078) —Scale factor to use if rescaling the image. Can be overridden byrescale_factor in thepreprocessmethod.
do_normalize (bool,optional, defaults toTrue) —Whether to normalize the image. Can be overridden bydo_normalize in thepreprocess method.
image_mean (float orlist[float],optional, defaults to[1.0, 1.0, 1.0]) —Mean to use if normalizing the image. This is a float or list of floats the length of the number ofchannels in the image. Can be overridden by theimage_mean parameter in thepreprocess method.
image_std (float orlist[float],optional, defaults to[1.0, 1.0, 1.0]) —Standard deviation to use if normalizing the image. This is a float or list of floats the length of thenumber of channels in the image. Can be overridden by theimage_std parameter in thepreprocess method.Can be overridden by theimage_std parameter in thepreprocess method.
do_convert_rgb (bool,optional, defaults toTrue) —Whether to convert the image to RGB.

Constructs a Chameleon image processor.

preprocess

(images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']]do_resize: typing.Optional[bool] = Nonesize: typing.Optional[dict[str, int]] = Noneresample: typing.Optional[PIL.Image.Resampling] = Nonedo_center_crop: typing.Optional[bool] = Nonecrop_size: typing.Optional[int] = Nonedo_rescale: typing.Optional[bool] = Nonerescale_factor: typing.Optional[float] = Nonedo_normalize: typing.Optional[bool] = Noneimage_mean: typing.Union[float, list[float], NoneType] = Noneimage_std: typing.Union[float, list[float], NoneType] = Nonedo_convert_rgb: typing.Optional[bool] = Nonereturn_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = Nonedata_format: typing.Optional[transformers.image_utils.ChannelDimension] = <ChannelDimension.FIRST: 'channels_first'>input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None)

Parameters

images (ImageInput) —Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. Ifpassing in images with pixel values between 0 and 1, setdo_rescale=False.
do_resize (bool,optional, defaults toself.do_resize) —Whether to resize the image.
size (dict[str, int],optional, defaults toself.size) —Size of the image after resizing. Shortest edge of the image is resized to size[“shortest_edge”], withthe longest edge resized to keep the input aspect ratio.
resample (int,optional, defaults toself.resample) —Resampling filter to use if resizing the image. This can be one of the enumPILImageResampling. Onlyhas an effect ifdo_resize is set toTrue.
do_center_crop (bool,optional, defaults toself.do_center_crop) —Whether to center crop the image.
crop_size (dict[str, int],optional, defaults toself.crop_size) —Size of the center crop. Only has an effect ifdo_center_crop is set toTrue.
do_rescale (bool,optional, defaults toself.do_rescale) —Whether to rescale the image.
rescale_factor (float,optional, defaults toself.rescale_factor) —Rescale factor to rescale the image by ifdo_rescale is set toTrue.
do_normalize (bool,optional, defaults toself.do_normalize) —Whether to normalize the image.
image_mean (float orlist[float],optional, defaults toself.image_mean) —Image mean to use for normalization. Only has an effect ifdo_normalize is set toTrue.
image_std (float orlist[float],optional, defaults toself.image_std) —Image standard deviation to use for normalization. Only has an effect ifdo_normalize is set toTrue.
do_convert_rgb (bool,optional, defaults toself.do_convert_rgb) —Whether to convert the image to RGB.
return_tensors (str orTensorType,optional) —The type of tensors to return. Can be one of:
- Unset: Return a list ofnp.ndarray.
- TensorType.PYTORCH or'pt': Return a batch of typetorch.Tensor.
- TensorType.NUMPY or'np': Return a batch of typenp.ndarray.
data_format (ChannelDimension orstr,optional, defaults toChannelDimension.FIRST) —The channel dimension format for the output image. Can be one of:
- "channels_first" orChannelDimension.FIRST: image in (num_channels, height, width) format.
- "channels_last" orChannelDimension.LAST: image in (height, width, num_channels) format.
- Unset: Use the channel dimension format of the input image.
input_data_format (ChannelDimension orstr,optional) —The channel dimension format for the input image. If unset, the channel dimension format is inferredfrom the input image. Can be one of:
- "channels_first" orChannelDimension.FIRST: image in (num_channels, height, width) format.
- "channels_last" orChannelDimension.LAST: image in (height, width, num_channels) format.
- "none" orChannelDimension.NONE: image in (height, width) format.

Preprocess an image or batch of images.

ChameleonImageProcessorFast

classtransformers.ChameleonImageProcessorFast

(**kwargs: typing_extensions.Unpack[transformers.processing_utils.ImagesKwargs])

Constructs a fast Chameleon image processor.

preprocess

(images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']]*args**kwargs: typing_extensions.Unpack[transformers.processing_utils.ImagesKwargs])→<class 'transformers.image_processing_base.BatchFeature'>

Parameters

images (Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']]) —Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. Ifpassing in images with pixel values between 0 and 1, setdo_rescale=False.
do_convert_rgb (bool,optional) —Whether to convert the image to RGB.
do_resize (bool,optional) —Whether to resize the image.
size (Annotated[Union[int, list[int], tuple[int, ...], dict[str, int], NoneType], None]) —Describes the maximum input dimensions to the model.
crop_size (Annotated[Union[int, list[int], tuple[int, ...], dict[str, int], NoneType], None]) —Size of the output image after applyingcenter_crop.
resample (Annotated[Union[PILImageResampling, int, NoneType], None]) —Resampling filter to use if resizing the image. This can be one of the enumPILImageResampling. Onlyhas an effect ifdo_resize is set toTrue.
do_rescale (bool,optional) —Whether to rescale the image.
rescale_factor (float,optional) —Rescale factor to rescale the image by ifdo_rescale is set toTrue.
do_normalize (bool,optional) —Whether to normalize the image.
image_mean (Union[float, list[float], tuple[float, ...], NoneType]) —Image mean to use for normalization. Only has an effect ifdo_normalize is set toTrue.
image_std (Union[float, list[float], tuple[float, ...], NoneType]) —Image standard deviation to use for normalization. Only has an effect ifdo_normalize is set toTrue.
do_pad (bool,optional) —Whether to pad the image. Padding is done either to the largest size in the batchor to a fixed square size per image. The exact padding strategy depends on the model.
pad_size (Annotated[Union[int, list[int], tuple[int, ...], dict[str, int], NoneType], None]) —The size in{"height": int, "width" int} to pad the images to. Must be larger than any image sizeprovided for preprocessing. Ifpad_size is not provided, images will be padded to the largestheight and width in the batch. Applied only whendo_pad=True.
do_center_crop (bool,optional) —Whether to center crop the image.
data_format (Union[~image_utils.ChannelDimension, str, NoneType]) —OnlyChannelDimension.FIRST is supported. Added for compatibility with slow processors.
input_data_format (Union[~image_utils.ChannelDimension, str, NoneType]) —The channel dimension format for the input image. If unset, the channel dimension format is inferredfrom the input image. Can be one of:
- "channels_first" orChannelDimension.FIRST: image in (num_channels, height, width) format.
- "channels_last" orChannelDimension.LAST: image in (height, width, num_channels) format.
- "none" orChannelDimension.NONE: image in (height, width) format.
device (Annotated[str, None],optional) —The device to process the images on. If unset, the device is inferred from the input images.
return_tensors (Annotated[Union[str, ~utils.generic.TensorType, NoneType], None]) —Returns stacked tensors if set to `pt, otherwise returns a list of tensors.
disable_grouping (bool,optional) —Whether to disable grouping of images by size to process them individually and not in batches.If None, will be set to True if the images are on CPU, and False otherwise. This choice is based onempirical observations, as detailed here:https://github.com/huggingface/transformers/pull/38157
image_seq_length (int,optional) —The number of image tokens to be used for each image in the input.Added for backward compatibility but this should be set as a processor attribute in future models.

Returns

<class 'transformers.image_processing_base.BatchFeature'>

data (dict) — Dictionary of lists/arrays/tensors returned by thecall method (‘pixel_values’, etc.).
tensor_type (Union[None, str, TensorType],optional) — You can give a tensor_type here to convert the lists of integers in PyTorch/Numpy Tensors atinitialization.

ChameleonVQVAE

classtransformers.ChameleonVQVAE

(config: ChameleonVQVAEConfig)

Parameters

config (ChameleonVQVAEConfig) —Model configuration class with all the parameters of the model. Initializing with a config file does notload the weights associated with the model, only the configuration. Check out thefrom_pretrained() method to load the model weights.

The VQ-VAE model used in Chameleon for encoding/decoding images into discrete tokens.This model follows the “Make-a-scene: Scene-based text-to-image generation with human priors” paper fromOran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and YanivTaigman.

This model inherits fromPreTrainedModel. Check the superclass documentation for the generic methods thelibrary implements for all its model (such as downloading or saving, resizing the input embeddings, pruning headsetc.)

This model is also a PyTorchtorch.nn.Module subclass.Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usageand behavior.

_forward_unimplemented

(*input: typing.Any)

Define the computation performed at every call.

Should be overridden by all subclasses.

Although the recipe for forward pass needs to be defined withinthis function, one should call theModule instance afterwardsinstead of this since the former takes care of running theregistered hooks while the latter silently ignores them.

ChameleonModel

classtransformers.ChameleonModel

(config: ChameleonConfig)

Parameters

config (ChameleonConfig) —Model configuration class with all the parameters of the model. Initializing with a config file does notload the weights associated with the model, only the configuration. Check out thefrom_pretrained() method to load the model weights.

The bare Chameleon Model outputting raw hidden-states without any specific head on top.

This model is also a PyTorchtorch.nn.Module subclass.Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usageand behavior.

forward

(input_ids: typing.Optional[torch.LongTensor] = Nonepixel_values: typing.Optional[torch.FloatTensor] = Noneattention_mask: typing.Optional[torch.Tensor] = Noneposition_ids: typing.Optional[torch.LongTensor] = Nonepast_key_values: typing.Optional[transformers.cache_utils.Cache] = Noneinputs_embeds: typing.Optional[torch.FloatTensor] = Noneuse_cache: typing.Optional[bool] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = Nonecache_position: typing.Optional[torch.LongTensor] = None**kwargs: typing_extensions.Unpack[transformers.modeling_flash_attention_utils.FlashAttentionKwargs])→transformers.modeling_outputs.BaseModelOutputWithPast ortuple(torch.FloatTensor)

Parameters

input_ids (torch.LongTensor of shape(batch_size, sequence_length),optional) —Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.
Indices can be obtained usingAutoTokenizer. SeePreTrainedTokenizer.encode() andPreTrainedTokenizer.call() for details.
What are input IDs?
pixel_values (torch.FloatTensor of shape(batch_size, num_channels, image_size, image_size),optional) —The tensors corresponding to the input images. Pixel values can be obtained usingChameleonImageProcessor. SeeChameleonImageProcessor.call() for details (ChameleonProcessor usesChameleonImageProcessor for processing images).
attention_mask (torch.Tensor of shape(batch_size, sequence_length),optional) —Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:
- 1 for tokens that arenot masked,
- 0 for tokens that aremasked.
What are attention masks?
position_ids (torch.LongTensor of shape(batch_size, sequence_length),optional) —Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1].
What are position IDs?
past_key_values (~cache_utils.Cache,optional) —Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attentionblocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=True orconfig.use_cache=True.
OnlyCache instance is allowed as input, see ourkv cache guide.If nopast_key_values are passed,DynamicCache will be initialized by default.
The model will output the same cache format that is fed as input.
Ifpast_key_values are used, the user is expected to input only unprocessedinput_ids (those that don’thave their past key value states given to this model) of shape(batch_size, unprocessed_length) instead of allinput_idsof shape(batch_size, sequence_length).
inputs_embeds (torch.FloatTensor of shape(batch_size, sequence_length, hidden_size),optional) —Optionally, instead of passinginput_ids you can choose to directly pass an embedded representation. Thisis useful if you want more control over how to convertinput_ids indices into associated vectors than themodel’s internal embedding lookup matrix.
use_cache (bool,optional) —If set toTrue,past_key_values key value states are returned and can be used to speed up decoding (seepast_key_values).
output_attentions (bool,optional) —Whether or not to return the attentions tensors of all attention layers. Seeattentions under returnedtensors for more detail.
output_hidden_states (bool,optional) —Whether or not to return the hidden states of all layers. Seehidden_states under returned tensors formore detail.
return_dict (bool,optional) —Whether or not to return aModelOutput instead of a plain tuple.
cache_position (torch.LongTensor of shape(sequence_length),optional) —Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids,this tensor is not affected by padding. It is used to update the cache in the correct position and to inferthe complete sequence length.

Returns

transformers.modeling_outputs.BaseModelOutputWithPast ortuple(torch.FloatTensor)

Atransformers.modeling_outputs.BaseModelOutputWithPast or a tuple oftorch.FloatTensor (ifreturn_dict=False is passed or whenconfig.return_dict=False) comprising variouselements depending on the configuration (ChameleonConfig) and inputs.

last_hidden_state (torch.FloatTensor of shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.
Ifpast_key_values is used only the last hidden-state of the sequences of shape(batch_size, 1, hidden_size) is output.
past_key_values (Cache,optional, returned whenuse_cache=True is passed or whenconfig.use_cache=True) — It is aCache instance. For more details, see ourkv cache guide.
Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally ifconfig.is_encoder_decoder=True in the cross-attention blocks) that can be used (seepast_key_valuesinput) to speed up sequential decoding.
hidden_states (tuple(torch.FloatTensor),optional, returned whenoutput_hidden_states=True is passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, +one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor),optional, returned whenoutput_attentions=True is passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor (one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).
Attentions weights after the attention softmax, used to compute the weighted average in the self-attentionheads.

TheChameleonModel forward method, overrides the__call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call theModuleinstance afterwards instead of this since the former takes care of running the pre and post processing steps whilethe latter silently ignores them.

ChameleonForConditionalGeneration

classtransformers.ChameleonForConditionalGeneration

(config)

Parameters

config (ChameleonForConditionalGeneration) —Model configuration class with all the parameters of the model. Initializing with a config file does notload the weights associated with the model, only the configuration. Check out thefrom_pretrained() method to load the model weights.

Chameleon Model with a head on top used for outputting logits for next token prediction.

This model is also a PyTorchtorch.nn.Module subclass.Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usageand behavior.

forward

(input_ids: typing.Optional[torch.LongTensor] = Nonepixel_values: typing.Optional[torch.FloatTensor] = Noneattention_mask: typing.Optional[torch.Tensor] = Noneposition_ids: typing.Optional[torch.LongTensor] = Nonepast_key_values: typing.Optional[transformers.cache_utils.Cache] = Noneinputs_embeds: typing.Optional[torch.FloatTensor] = Nonelabels: typing.Optional[torch.LongTensor] = Noneuse_cache: typing.Optional[bool] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonecache_position: typing.Optional[torch.LongTensor] = Nonelogits_to_keep: typing.Union[int, torch.Tensor] = 0**kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs])→transformers.modeling_outputs.CausalLMOutputWithPast ortuple(torch.FloatTensor)

Parameters

input_ids (torch.LongTensor of shape(batch_size, sequence_length),optional) —Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.
Indices can be obtained usingAutoTokenizer. SeePreTrainedTokenizer.encode() andPreTrainedTokenizer.call() for details.
What are input IDs?
pixel_values (torch.FloatTensor of shape(batch_size, num_channels, image_size, image_size),optional) —The tensors corresponding to the input images. Pixel values can be obtained usingChameleonImageProcessor. SeeChameleonImageProcessor.call() for details (ChameleonProcessor usesChameleonImageProcessor for processing images).
attention_mask (torch.Tensor of shape(batch_size, sequence_length),optional) —Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:
- 1 for tokens that arenot masked,
- 0 for tokens that aremasked.
What are attention masks?
position_ids (torch.LongTensor of shape(batch_size, sequence_length),optional) —Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1].
What are position IDs?
past_key_values (~cache_utils.Cache,optional) —Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attentionblocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=True orconfig.use_cache=True.
OnlyCache instance is allowed as input, see ourkv cache guide.If nopast_key_values are passed,DynamicCache will be initialized by default.
The model will output the same cache format that is fed as input.
Ifpast_key_values are used, the user is expected to input only unprocessedinput_ids (those that don’thave their past key value states given to this model) of shape(batch_size, unprocessed_length) instead of allinput_idsof shape(batch_size, sequence_length).
inputs_embeds (torch.FloatTensor of shape(batch_size, sequence_length, hidden_size),optional) —Optionally, instead of passinginput_ids you can choose to directly pass an embedded representation. Thisis useful if you want more control over how to convertinput_ids indices into associated vectors than themodel’s internal embedding lookup matrix.
labels (torch.LongTensor of shape(batch_size, sequence_length),optional) —Labels for computing the masked language modeling loss. Indices should either be in[0, ..., config.vocab_size] or -100 (seeinput_ids docstring). Tokens with indices set to-100 are ignored(masked), the loss is only computed for the tokens with labels in[0, ..., config.vocab_size].
use_cache (bool,optional) —If set toTrue,past_key_values key value states are returned and can be used to speed up decoding (seepast_key_values).
output_attentions (bool,optional) —Whether or not to return the attentions tensors of all attention layers. Seeattentions under returnedtensors for more detail.
output_hidden_states (bool,optional) —Whether or not to return the hidden states of all layers. Seehidden_states under returned tensors formore detail.
cache_position (torch.LongTensor of shape(sequence_length),optional) —Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids,this tensor is not affected by padding. It is used to update the cache in the correct position and to inferthe complete sequence length.
logits_to_keep (Union[int, torch.Tensor], defaults to0) —If anint, compute logits for the lastlogits_to_keep tokens. If0, calculate logits for allinput_ids (special case). Only last token logits are needed for generation, and calculating them only for thattoken can save memory, which becomes pretty significant for long sequences or large vocabulary size.If atorch.Tensor, must be 1D corresponding to the indices to keep in the sequence length dimension.This is useful when using packed tensor format (single dimension for batch and sequence length).

Returns

transformers.modeling_outputs.CausalLMOutputWithPast ortuple(torch.FloatTensor)

Atransformers.modeling_outputs.CausalLMOutputWithPast or a tuple oftorch.FloatTensor (ifreturn_dict=False is passed or whenconfig.return_dict=False) comprising variouselements depending on the configuration (ChameleonConfig) and inputs.

loss (torch.FloatTensor of shape(1,),optional, returned whenlabels is provided) — Language modeling loss (for next-token prediction).
logits (torch.FloatTensor of shape(batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
past_key_values (Cache,optional, returned whenuse_cache=True is passed or whenconfig.use_cache=True) — It is aCache instance. For more details, see ourkv cache guide.
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (seepast_key_values input) to speed up sequential decoding.
hidden_states (tuple(torch.FloatTensor),optional, returned whenoutput_hidden_states=True is passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, +one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor),optional, returned whenoutput_attentions=True is passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor (one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).
Attentions weights after the attention softmax, used to compute the weighted average in the self-attentionheads.

TheChameleonForConditionalGeneration forward method, overrides the__call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call theModuleinstance afterwards instead of this since the former takes care of running the pre and post processing steps whilethe latter silently ignores them.

Example:

>>>from transformersimport ChameleonProcessor, ChameleonForConditionalGeneration>>>import torch>>>import requests>>>from PILimport Image>>>model = ChameleonForConditionalGeneration.from_pretrained("facebook/chameleon-7b", dtype=torch.bfloat16)>>>processor = ChameleonProcessor.from_pretrained("facebook/chameleon-7b")>>>prompt ="I used to know a lot about constellations when I was younger, but as I grew older, I forgot most of what I knew. These are the only two constellations that I really remember now.<image><image>I would like for you to tell me about 3 more constellations and give me a little bit of history about the constellation.">>>image = Image.open(requests.get("https://nineplanets.org/wp-content/uploads/2020/12/the-big-dipper-1.jpg", stream=True).raw)>>>image_2 = Image.open(requests.get("https://www.kxan.com/wp-content/uploads/sites/40/2020/10/ORION.jpg", stream=True).raw)>>>inputs = processor(images=[image, image_2], text=prompt, return_tensors="pt").to(model.device, torch.bfloat16)>>>generated_ids = model.generate(**inputs, max_new_tokens=100, do_sample=False)>>>processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

Update on GitHub

←BROS Chinese-CLIP→

Movatterモバイル変換

Transformers

Chameleon

Overview

Usage tips

Usage example

Single image inference

Multi image inference

Model optimization

Quantization using Bitsandbytes

Use Flash-Attention 2 and SDPA to further speed-up generation

ChameleonConfig

classtransformers.ChameleonConfig

ChameleonVQVAEConfig

classtransformers.ChameleonVQVAEConfig

ChameleonProcessor

classtransformers.ChameleonProcessor

ChameleonImageProcessor

classtransformers.ChameleonImageProcessor

preprocess

ChameleonImageProcessorFast

classtransformers.ChameleonImageProcessorFast

preprocess

ChameleonVQVAE

classtransformers.ChameleonVQVAE

_forward_unimplemented

ChameleonModel

classtransformers.ChameleonModel

forward

ChameleonForConditionalGeneration

classtransformers.ChameleonForConditionalGeneration

forward