This model was released on 2024-05-16 and added to Hugging Face Transformers on 2024-07-17.
Chameleon
Overview
The Chameleon model was proposed inChameleon: Mixed-Modal Early-Fusion Foundation Models by META AI Chameleon Team. Chameleon is a Vision-Language Model that use vector quantization to tokenize images which enables the model to generate multimodal output. The model takes images and texts as input, including an interleaved format, and generates textual response. Image generation module is not released yet.
The abstract from the paper is the following:
We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. We outline a stable trainingapproach from inception, an alignment recipe, and an architectural parameterization tailored for theearly-fusion, token-based, mixed-modal setting. The models are evaluated on a comprehensive rangeof tasks, including visual question answering, image captioning, text generation, image generation, andlong-form mixed modal generation. Chameleon demonstrates broad and general capabilities, includingstate-of-the-art performance in image captioning tasks, outperforms Llama-2 in text-only tasks whilebeing competitive with models such as Mixtral 8x7B and Gemini-Pro, and performs non-trivial imagegeneration, all in a single model. It also matches or exceeds the performance of much larger models,including Gemini Pro and GPT-4V, according to human judgments on a new long-form mixed-modalgeneration evaluation, where either the prompt or outputs contain mixed sequences of both images andtext. Chameleon marks a significant step forward in unified modeling of full multimodal documents
Chameleon incorporates a vector quantizer module to transform images into discrete tokens. That also enables image generation using an auto-regressive transformer. Taken from theoriginal paper.This model was contributed byjoaogante andRaushanTurganbay.The original code can be foundhere.
Usage tips
We advise users to use
padding_side="left"when computing batched generation as it leads to more accurate results. Simply make sure to setprocessor.tokenizer.padding_side = "left"before generating.Note that Chameleon was tuned for safety alignment. If the model is refusing to answer, consider asking a more concrete question, instead of an open question.
Chameleon generates in chat format which means that the generated text will always be the “assistant’s turn”. You can enable a text completion generation by passing
return_for_text_completion=Truewhen calling the processor.
Chameleon implementation in Transformers uses a special image token to indicate where to merge image embeddings. For special image token we didn’t add a new one but used one of the reserved tokens:
<reserved08707>. You have to add<image>to your prompt in the place where the image should be embedded for correct generation.
Usage example
Single image inference
Chameleon is a gated model so make sure to have access and login to Hugging Face Hub using a token.Here’s how to load the model and perform inference in half-precision (torch.bfloat16):
from transformersimport ChameleonProcessor, ChameleonForConditionalGenerationimport torchfrom PILimport Imageimport requestsprocessor = ChameleonProcessor.from_pretrained("facebook/chameleon-7b")model = ChameleonForConditionalGeneration.from_pretrained("facebook/chameleon-7b", dtype=torch.bfloat16, device_map="auto")# prepare image and text prompturl ='http://images.cocodataset.org/val2017/000000039769.jpg'image = Image.open(requests.get(url, stream=True).raw)prompt ="What do you see in this image?<image>"inputs = processor(images=image, text=prompt, return_tensors="pt").to(model.device, dtype=torch.bfloat16)# autoregressively complete promptoutput = model.generate(**inputs, max_new_tokens=50)print(processor.decode(output[0], skip_special_tokens=True))
Multi image inference
Chameleon can perform inference with multiple images as input, where images either belong to the same prompt or different prompts (in batched inference). Here is how you can do it:
from transformersimport ChameleonProcessor, ChameleonForConditionalGenerationimport torchfrom PILimport Imageimport requestsprocessor = ChameleonProcessor.from_pretrained("facebook/chameleon-7b")model = ChameleonForConditionalGeneration.from_pretrained("facebook/chameleon-7b", dtype=torch.bfloat16, device_map="auto")# Get three different imagesurl ="https://www.ilankelman.org/stopsigns/australia.jpg"image_stop = Image.open(requests.get(url, stream=True).raw)url ="http://images.cocodataset.org/val2017/000000039769.jpg"image_cats = Image.open(requests.get(url, stream=True).raw)url ="https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.jpg"image_snowman = Image.open(requests.get(url, stream=True).raw)# Prepare a batched prompt, where the first one is a multi-image prompt and the second is notprompts = ["What do these images have in common?<image><image>","<image>What is shown in this image?"]# We can simply feed images in the order they have to be used in the text prompt# Each "<image>" token uses one image leaving the next for the subsequent "<image>" tokensinputs = processor(images=[image_stop, image_cats, image_snowman], text=prompts, padding=True, return_tensors="pt").to(device=model.device, dtype=torch.bfloat16)# Generategenerate_ids = model.generate(**inputs, max_new_tokens=50)processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
Model optimization
Quantization using Bitsandbytes
The model can be loaded in 8 or 4 bits, greatly reducing the memory requirements while maintaining the performance of the original model. First make sure to install bitsandbytes,pip install bitsandbytes and to have access to a GPU/accelerator that is supported by the library.
bitsandbytes is being refactored to support multiple backends beyond CUDA. Currently, ROCm (AMD GPU) and Intel CPU implementations are mature, with Intel XPU in progress and Apple Silicon support expected by Q4/Q1. For installation instructions and the latest backend updates, visitthis link.
We value your feedback to help identify bugs before the full release! Check outthese docs for more details and feedback links.
Simply change the snippet above with:
from transformersimport ChameleonForConditionalGeneration, BitsAndBytesConfig# specify how to quantize the modelquantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16,)model = ChameleonForConditionalGeneration.from_pretrained("facebook/chameleon-7b", quantization_config=quantization_config, device_map="auto")
Use Flash-Attention 2 and SDPA to further speed-up generation
The models supports both, Flash-Attention 2 and PyTorch’storch.nn.functional.scaled_dot_product_attention which can be enables for optimization. SDPA is the default options when you load the model, If you want to switch for Flash Attention 2, first make sure to install flash-attn. Refer to theoriginal repository regarding that package installation. Simply change the snippet above with:
from transformersimport ChameleonForConditionalGenerationmodel_id ="facebook/chameleon-7b"model = ChameleonForConditionalGeneration.from_pretrained( model_id, dtype=torch.bfloat16, attn_implementation="flash_attention_2").to(0)
ChameleonConfig
classtransformers.ChameleonConfig
<source>(vocab_size: typing.Optional[int] = 65536hidden_size: typing.Optional[int] = 4096intermediate_size: typing.Optional[int] = 11008num_hidden_layers: typing.Optional[int] = 32num_attention_heads: typing.Optional[int] = 32num_key_value_heads: typing.Optional[int] = 32hidden_act: typing.Optional[int] = 'silu'max_position_embeddings: typing.Optional[int] = 4096initializer_range: typing.Optional[float] = 0.02rms_norm_eps: typing.Optional[int] = 1e-05use_cache: typing.Optional[bool] = Truepad_token_id: typing.Optional[int] = Nonebos_token_id: typing.Optional[int] = 1eos_token_id: typing.Optional[int] = 2tie_word_embeddings: typing.Optional[bool] = Falserope_parameters: typing.Union[transformers.modeling_rope_utils.RopeParameters, dict[str, transformers.modeling_rope_utils.RopeParameters], NoneType] = Noneattention_bias: typing.Optional[int] = Falseattention_dropout: typing.Optional[float] = 0.0model_parallel_size: typing.Optional[int] = 1swin_norm: typing.Optional[bool] = Falsevq_config: typing.Optional[dict] = Nonevocabulary_map: typing.Optional[dict] = Nonemlp_bias: typing.Optional[bool] = False**kwargs)
Parameters
- vocab_size (
int,optional, defaults to 65536) —Vocabulary size of the chameleon model. Defines the number of different tokens that can be represented by theinputs_idspassed when callingChameleonModel; this includes text and image tokens. - hidden_size (
int,optional, defaults to 4096) —Dimension of the hidden representations. - intermediate_size (
int,optional, defaults to 11008) —Dimension of the MLP representations. - num_hidden_layers (
int,optional, defaults to 32) —Number of hidden layers in the Transformer decoder. - num_attention_heads (
int,optional, defaults to 32) —Number of attention heads for each attention layer in the Transformer decoder. - num_key_value_heads (
int,optional, defaults to 32) —This is the number of key_value heads that should be used to implement Grouped Query Attention. Ifnum_key_value_heads=num_attention_heads, the model will use Multi Head Attention (MHA), ifnum_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed by meanpooling all the original heads within that group. For more details, check out [this paper](https://huggingface.co/papers/2305.13245). If it is not specified, will default tonum_attention_heads`. - hidden_act (
strorfunction,optional, defaults to"silu") —The non-linear activation function (function or string) in the decoder. - max_position_embeddings (
int,optional, defaults to 4096) —The maximum sequence length that this model might ever be used with. Chameleon supports up to 4096 tokens. - initializer_range (
float,optional, defaults to 0.02) —The standard deviation of the truncated_normal_initializer for initializing all weight matrices. - rms_norm_eps (
float,optional, defaults to 1e-05) —The epsilon used by the rms normalization layers. - use_cache (
bool,optional, defaults toTrue) —Whether or not the model should return the last key/values attentions (not used by all models). Onlyrelevant ifconfig.is_decoder=True. - pad_token_id (
int,optional) —Padding token id. - bos_token_id (
int,optional, defaults to 1) —Beginning of stream token id. - eos_token_id (
int,optional, defaults to 2) —End of stream token id. - tie_word_embeddings (
bool,optional, defaults toFalse) —Whether to tie weight embeddings - rope_parameters (
RopeParameters,optional) —Dictionary containing the configuration parameters for the RoPE embeddings. The dictionaty should containa value forrope_thetaand optionally parameters used for scaling in case you want to use RoPEwith longermax_position_embeddings. - attention_bias (
bool, defaults toFalse,optional, defaults toFalse) —Whether to use a bias in the query, key, value and output projection layers during self-attention. - attention_dropout (
float,optional, defaults to 0.0) —The dropout ratio for the attention probabilities. - model_parallel_size (
int,optional, defaults to 1) —Number of shards used when training the model. This will be used in qk layernorm because the original Chameleon inferencedoesn’t do reduction in those layers and each rank has its own biases. - swin_norm (
bool,optional, defaults toFalse) —Use Swin Transformer normalization. - vq_config (
dict,optional) —ChameleonVQConfig instance containing the configuration for the VQ-VAE model. - vocabulary_map (
dict,optional) —A dictionary containing the vocabulary map from the tokenizer. Used to obtain tokens from the image inputs. - mlp_bias (
bool,optional, defaults toFalse) —Whether to use a bias in up_proj, down_proj and gate_proj layers in the MLP layers.
This is the configuration class to store the configuration of aChameleonModel. It is used to instantiate achameleon model according to the specified arguments, defining the model architecture. Instantiating aconfiguration with the defaults will yield a similar configuration to that of themeta/chameleon-7B.
Configuration objects inherit fromPreTrainedConfig and can be used to control the model outputs. Read thedocumentation fromPreTrainedConfig for more information.
>>>from transformersimport ChameleonModel, ChameleonConfig>>># Initializing a chameleon chameleon-7b style configuration>>>configuration = ChameleonConfig()>>># Initializing a model from the chameleon-7b style configuration>>>model = ChameleonModel(configuration)>>># Accessing the model configuration>>>configuration = model.config
ChameleonVQVAEConfig
classtransformers.ChameleonVQVAEConfig
<source>(embed_dim: int = 256num_embeddings: int = 8192double_latent: bool = Falselatent_channels: int = 256resolution: int = 512in_channels: int = 3base_channels: int = 128channel_multiplier: list = [1, 1, 2, 2, 4]num_res_blocks: int = 2attn_resolutions: typing.Optional[list[int]] = Nonedropout: float = 0.0attn_type: str = 'vanilla'initializer_range = 0.02**kwargs)
Parameters
- embed_dim (
int,optional, defaults to 256) —Dimensionality of each embedding vector. - num_embeddings (
int,optional, defaults to 8192) —Number of codebook embeddings. - double_latent (
bool,optional, defaults toFalse) —Whether to use double z channels. - latent_channels (
int,optional, defaults to 256) —Number of channels for the latent space. - resolution (
int,optional, defaults to 512) —Resolution of the input images. - in_channels (
int,optional, defaults to 3) —Number of input channels. - base_channels (
int,optional, defaults to 128) —Base channel count. - channel_multiplier (
list[int],optional, defaults to[1, 1, 2, 2, 4]) —Channel multipliers for each resolution. - num_res_blocks (
int,optional, defaults to 2) —Number of residual blocks. - attn_resolutions (
list[int],optional) —Resolutions to apply attention. - dropout (
float,optional, defaults to 0.0) —Dropout rate. - attn_type (
str,optional, defaults to"vanilla") —Attention type used in VQ-GAN encoder. Can be “vanilla” or None. - initializer_range (
float,optional, defaults to 0.02) —The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
This is the configuration class to store the configuration of aChameleonVQModel. It is used to instantiate aChameleonVQModel according to the specified arguments, defining the model architecture.Configuration objects inherit fromPreTrainedConfig and can be used to control the model outputs. Read thedocumentation fromPreTrainedConfig for more information. Instantiating aconfiguration with the defaults will yield a similar configuration to the VQModel of themeta/chameleon-7B.
ChameleonProcessor
classtransformers.ChameleonProcessor
<source>(image_processortokenizerimage_seq_length: int = 1024image_token: str = '<image>')
Parameters
- image_processor (ChameleonImageProcessor) —The image processor is a required input.
- tokenizer (LlamaTokenizerFast) —The tokenizer is a required input.
- image_seq_length (
int,optional, defaults to 1024) —Sequence length of one image embedding. - image_token (
str,optional, defaults to"<image>") —The special token used to indicate image in the text.
Constructs a Chameleon processor which wraps a Chameleon image processor and a Chameleon tokenizer into a singleprocessor.
ChameleonProcessor offers all the functionalities ofChameleonImageProcessor andLlamaTokenizerFast.See the__call__() anddecode() for more information.
ChameleonImageProcessor
classtransformers.ChameleonImageProcessor
<source>(do_resize: bool = Truesize: typing.Optional[dict[str, int]] = Noneresample: Resampling = 1do_center_crop: bool = Truecrop_size: typing.Optional[dict[str, int]] = Nonedo_rescale: bool = Truerescale_factor: typing.Union[int, float] = 0.0078do_normalize: bool = Trueimage_mean: typing.Union[float, list[float], NoneType] = Noneimage_std: typing.Union[float, list[float], NoneType] = Nonedo_convert_rgb: bool = True**kwargs)
Parameters
- do_resize (
bool,optional, defaults toTrue) —Whether to resize the image’s (height, width) dimensions to the specifiedsize. Can be overridden bydo_resizein thepreprocessmethod. - size (
dict[str, int]optional, defaults to{"shortest_edge" -- 512}):Size of the image after resizing. The shortest edge of the image is resized to size[“shortest_edge”], withthe longest edge resized to keep the input aspect ratio. Can be overridden bysizein thepreprocessmethod. - resample (
PILImageResampling,optional, defaults to 1) —Resampling filter to use if resizing the image. Can be overridden byresamplein thepreprocessmethod. - do_center_crop (
bool,optional, defaults toTrue) —Whether to center crop the image to the specifiedcrop_size. Can be overridden bydo_center_cropin thepreprocessmethod. - crop_size (
dict[str, int]optional, defaults to {“height” — 512, “width”: 512}):Size of the output image after applyingcenter_crop. Can be overridden bycrop_sizein thepreprocessmethod. - do_rescale (
bool,optional, defaults toTrue) —Whether to rescale the image by the specified scalerescale_factor. Can be overridden bydo_rescaleinthepreprocessmethod. - rescale_factor (
intorfloat,optional, defaults to 0.0078) —Scale factor to use if rescaling the image. Can be overridden byrescale_factorin thepreprocessmethod. - do_normalize (
bool,optional, defaults toTrue) —Whether to normalize the image. Can be overridden bydo_normalizein thepreprocessmethod. - image_mean (
floatorlist[float],optional, defaults to[1.0, 1.0, 1.0]) —Mean to use if normalizing the image. This is a float or list of floats the length of the number ofchannels in the image. Can be overridden by theimage_meanparameter in thepreprocessmethod. - image_std (
floatorlist[float],optional, defaults to[1.0, 1.0, 1.0]) —Standard deviation to use if normalizing the image. This is a float or list of floats the length of thenumber of channels in the image. Can be overridden by theimage_stdparameter in thepreprocessmethod.Can be overridden by theimage_stdparameter in thepreprocessmethod. - do_convert_rgb (
bool,optional, defaults toTrue) —Whether to convert the image to RGB.
Constructs a Chameleon image processor.
preprocess
<source>(images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']]do_resize: typing.Optional[bool] = Nonesize: typing.Optional[dict[str, int]] = Noneresample: typing.Optional[PIL.Image.Resampling] = Nonedo_center_crop: typing.Optional[bool] = Nonecrop_size: typing.Optional[int] = Nonedo_rescale: typing.Optional[bool] = Nonerescale_factor: typing.Optional[float] = Nonedo_normalize: typing.Optional[bool] = Noneimage_mean: typing.Union[float, list[float], NoneType] = Noneimage_std: typing.Union[float, list[float], NoneType] = Nonedo_convert_rgb: typing.Optional[bool] = Nonereturn_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = Nonedata_format: typing.Optional[transformers.image_utils.ChannelDimension] = <ChannelDimension.FIRST: 'channels_first'>input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None)
Parameters
- images (
ImageInput) —Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. Ifpassing in images with pixel values between 0 and 1, setdo_rescale=False. - do_resize (
bool,optional, defaults toself.do_resize) —Whether to resize the image. - size (
dict[str, int],optional, defaults toself.size) —Size of the image after resizing. Shortest edge of the image is resized to size[“shortest_edge”], withthe longest edge resized to keep the input aspect ratio. - resample (
int,optional, defaults toself.resample) —Resampling filter to use if resizing the image. This can be one of the enumPILImageResampling. Onlyhas an effect ifdo_resizeis set toTrue. - do_center_crop (
bool,optional, defaults toself.do_center_crop) —Whether to center crop the image. - crop_size (
dict[str, int],optional, defaults toself.crop_size) —Size of the center crop. Only has an effect ifdo_center_cropis set toTrue. - do_rescale (
bool,optional, defaults toself.do_rescale) —Whether to rescale the image. - rescale_factor (
float,optional, defaults toself.rescale_factor) —Rescale factor to rescale the image by ifdo_rescaleis set toTrue. - do_normalize (
bool,optional, defaults toself.do_normalize) —Whether to normalize the image. - image_mean (
floatorlist[float],optional, defaults toself.image_mean) —Image mean to use for normalization. Only has an effect ifdo_normalizeis set toTrue. - image_std (
floatorlist[float],optional, defaults toself.image_std) —Image standard deviation to use for normalization. Only has an effect ifdo_normalizeis set toTrue. - do_convert_rgb (
bool,optional, defaults toself.do_convert_rgb) —Whether to convert the image to RGB. - return_tensors (
strorTensorType,optional) —The type of tensors to return. Can be one of:- Unset: Return a list of
np.ndarray. TensorType.PYTORCHor'pt': Return a batch of typetorch.Tensor.TensorType.NUMPYor'np': Return a batch of typenp.ndarray.
- Unset: Return a list of
- data_format (
ChannelDimensionorstr,optional, defaults toChannelDimension.FIRST) —The channel dimension format for the output image. Can be one of:"channels_first"orChannelDimension.FIRST: image in (num_channels, height, width) format."channels_last"orChannelDimension.LAST: image in (height, width, num_channels) format.- Unset: Use the channel dimension format of the input image.
- input_data_format (
ChannelDimensionorstr,optional) —The channel dimension format for the input image. If unset, the channel dimension format is inferredfrom the input image. Can be one of:"channels_first"orChannelDimension.FIRST: image in (num_channels, height, width) format."channels_last"orChannelDimension.LAST: image in (height, width, num_channels) format."none"orChannelDimension.NONE: image in (height, width) format.
Preprocess an image or batch of images.
ChameleonImageProcessorFast
classtransformers.ChameleonImageProcessorFast
<source>(**kwargs: typing_extensions.Unpack[transformers.processing_utils.ImagesKwargs])
Constructs a fast Chameleon image processor.
preprocess
<source>(images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']]*args**kwargs: typing_extensions.Unpack[transformers.processing_utils.ImagesKwargs])→<class 'transformers.image_processing_base.BatchFeature'>
Parameters
- images (
Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']]) —Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. Ifpassing in images with pixel values between 0 and 1, setdo_rescale=False. - do_convert_rgb (
bool,optional) —Whether to convert the image to RGB. - do_resize (
bool,optional) —Whether to resize the image. - size (
Annotated[Union[int, list[int], tuple[int, ...], dict[str, int], NoneType], None]) —Describes the maximum input dimensions to the model. - crop_size (
Annotated[Union[int, list[int], tuple[int, ...], dict[str, int], NoneType], None]) —Size of the output image after applyingcenter_crop. - resample (
Annotated[Union[PILImageResampling, int, NoneType], None]) —Resampling filter to use if resizing the image. This can be one of the enumPILImageResampling. Onlyhas an effect ifdo_resizeis set toTrue. - do_rescale (
bool,optional) —Whether to rescale the image. - rescale_factor (
float,optional) —Rescale factor to rescale the image by ifdo_rescaleis set toTrue. - do_normalize (
bool,optional) —Whether to normalize the image. - image_mean (
Union[float, list[float], tuple[float, ...], NoneType]) —Image mean to use for normalization. Only has an effect ifdo_normalizeis set toTrue. - image_std (
Union[float, list[float], tuple[float, ...], NoneType]) —Image standard deviation to use for normalization. Only has an effect ifdo_normalizeis set toTrue. - do_pad (
bool,optional) —Whether to pad the image. Padding is done either to the largest size in the batchor to a fixed square size per image. The exact padding strategy depends on the model. - pad_size (
Annotated[Union[int, list[int], tuple[int, ...], dict[str, int], NoneType], None]) —The size in{"height": int, "width" int}to pad the images to. Must be larger than any image sizeprovided for preprocessing. Ifpad_sizeis not provided, images will be padded to the largestheight and width in the batch. Applied only whendo_pad=True. - do_center_crop (
bool,optional) —Whether to center crop the image. - data_format (
Union[~image_utils.ChannelDimension, str, NoneType]) —OnlyChannelDimension.FIRSTis supported. Added for compatibility with slow processors. - input_data_format (
Union[~image_utils.ChannelDimension, str, NoneType]) —The channel dimension format for the input image. If unset, the channel dimension format is inferredfrom the input image. Can be one of:"channels_first"orChannelDimension.FIRST: image in (num_channels, height, width) format."channels_last"orChannelDimension.LAST: image in (height, width, num_channels) format."none"orChannelDimension.NONE: image in (height, width) format.
- device (
Annotated[str, None],optional) —The device to process the images on. If unset, the device is inferred from the input images. - return_tensors (
Annotated[Union[str, ~utils.generic.TensorType, NoneType], None]) —Returns stacked tensors if set to `pt, otherwise returns a list of tensors. - disable_grouping (
bool,optional) —Whether to disable grouping of images by size to process them individually and not in batches.If None, will be set to True if the images are on CPU, and False otherwise. This choice is based onempirical observations, as detailed here:https://github.com/huggingface/transformers/pull/38157 - image_seq_length (
int,optional) —The number of image tokens to be used for each image in the input.Added for backward compatibility but this should be set as a processor attribute in future models.
Returns
<class 'transformers.image_processing_base.BatchFeature'>
- data (
dict) — Dictionary of lists/arrays/tensors returned by thecall method (‘pixel_values’, etc.). - tensor_type (
Union[None, str, TensorType],optional) — You can give a tensor_type here to convert the lists of integers in PyTorch/Numpy Tensors atinitialization.
ChameleonVQVAE
classtransformers.ChameleonVQVAE
<source>(config: ChameleonVQVAEConfig)
Parameters
- config (ChameleonVQVAEConfig) —Model configuration class with all the parameters of the model. Initializing with a config file does notload the weights associated with the model, only the configuration. Check out thefrom_pretrained() method to load the model weights.
The VQ-VAE model used in Chameleon for encoding/decoding images into discrete tokens.This model follows the “Make-a-scene: Scene-based text-to-image generation with human priors” paper fromOran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and YanivTaigman.
This model inherits fromPreTrainedModel. Check the superclass documentation for the generic methods thelibrary implements for all its model (such as downloading or saving, resizing the input embeddings, pruning headsetc.)
This model is also a PyTorchtorch.nn.Module subclass.Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usageand behavior.
_forward_unimplemented
<source>(*input: typing.Any)
Define the computation performed at every call.
Should be overridden by all subclasses.
Although the recipe for forward pass needs to be defined withinthis function, one should call the
Moduleinstance afterwardsinstead of this since the former takes care of running theregistered hooks while the latter silently ignores them.
ChameleonModel
classtransformers.ChameleonModel
<source>(config: ChameleonConfig)
Parameters
- config (ChameleonConfig) —Model configuration class with all the parameters of the model. Initializing with a config file does notload the weights associated with the model, only the configuration. Check out thefrom_pretrained() method to load the model weights.
The bare Chameleon Model outputting raw hidden-states without any specific head on top.
This model inherits fromPreTrainedModel. Check the superclass documentation for the generic methods thelibrary implements for all its model (such as downloading or saving, resizing the input embeddings, pruning headsetc.)
This model is also a PyTorchtorch.nn.Module subclass.Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usageand behavior.
forward
<source>(input_ids: typing.Optional[torch.LongTensor] = Nonepixel_values: typing.Optional[torch.FloatTensor] = Noneattention_mask: typing.Optional[torch.Tensor] = Noneposition_ids: typing.Optional[torch.LongTensor] = Nonepast_key_values: typing.Optional[transformers.cache_utils.Cache] = Noneinputs_embeds: typing.Optional[torch.FloatTensor] = Noneuse_cache: typing.Optional[bool] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = Nonecache_position: typing.Optional[torch.LongTensor] = None**kwargs: typing_extensions.Unpack[transformers.modeling_flash_attention_utils.FlashAttentionKwargs])→transformers.modeling_outputs.BaseModelOutputWithPast ortuple(torch.FloatTensor)
Parameters
- input_ids (
torch.LongTensorof shape(batch_size, sequence_length),optional) —Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.Indices can be obtained usingAutoTokenizer. SeePreTrainedTokenizer.encode() andPreTrainedTokenizer.call() for details.
- pixel_values (
torch.FloatTensorof shape(batch_size, num_channels, image_size, image_size),optional) —The tensors corresponding to the input images. Pixel values can be obtained usingChameleonImageProcessor. SeeChameleonImageProcessor.call() for details (ChameleonProcessor usesChameleonImageProcessor for processing images). - attention_mask (
torch.Tensorof shape(batch_size, sequence_length),optional) —Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that arenot masked,
- 0 for tokens that aremasked.
- position_ids (
torch.LongTensorof shape(batch_size, sequence_length),optional) —Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1]. - past_key_values (
~cache_utils.Cache,optional) —Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attentionblocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=Trueorconfig.use_cache=True.OnlyCache instance is allowed as input, see ourkv cache guide.If no
past_key_valuesare passed,DynamicCache will be initialized by default.The model will output the same cache format that is fed as input.
If
past_key_valuesare used, the user is expected to input only unprocessedinput_ids(those that don’thave their past key value states given to this model) of shape(batch_size, unprocessed_length)instead of allinput_idsof shape(batch_size, sequence_length). - inputs_embeds (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size),optional) —Optionally, instead of passinginput_idsyou can choose to directly pass an embedded representation. Thisis useful if you want more control over how to convertinput_idsindices into associated vectors than themodel’s internal embedding lookup matrix. - use_cache (
bool,optional) —If set toTrue,past_key_valueskey value states are returned and can be used to speed up decoding (seepast_key_values). - output_attentions (
bool,optional) —Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returnedtensors for more detail. - output_hidden_states (
bool,optional) —Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors formore detail. - return_dict (
bool,optional) —Whether or not to return aModelOutput instead of a plain tuple. - cache_position (
torch.LongTensorof shape(sequence_length),optional) —Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids,this tensor is not affected by padding. It is used to update the cache in the correct position and to inferthe complete sequence length.
Returns
transformers.modeling_outputs.BaseModelOutputWithPast ortuple(torch.FloatTensor)
Atransformers.modeling_outputs.BaseModelOutputWithPast or a tuple oftorch.FloatTensor (ifreturn_dict=False is passed or whenconfig.return_dict=False) comprising variouselements depending on the configuration (ChameleonConfig) and inputs.
last_hidden_state (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.If
past_key_valuesis used only the last hidden-state of the sequences of shape(batch_size, 1, hidden_size)is output.past_key_values (
Cache,optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — It is aCache instance. For more details, see ourkv cache guide.Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if
config.is_encoder_decoder=Truein the cross-attention blocks) that can be used (seepast_key_valuesinput) to speed up sequential decoding.hidden_states (
tuple(torch.FloatTensor),optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, +one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (
tuple(torch.FloatTensor),optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attentionheads.
TheChameleonModel forward method, overrides the__call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps whilethe latter silently ignores them.
ChameleonForConditionalGeneration
classtransformers.ChameleonForConditionalGeneration
<source>(config)
Parameters
- config (ChameleonForConditionalGeneration) —Model configuration class with all the parameters of the model. Initializing with a config file does notload the weights associated with the model, only the configuration. Check out thefrom_pretrained() method to load the model weights.
Chameleon Model with a head on top used for outputting logits for next token prediction.
This model inherits fromPreTrainedModel. Check the superclass documentation for the generic methods thelibrary implements for all its model (such as downloading or saving, resizing the input embeddings, pruning headsetc.)
This model is also a PyTorchtorch.nn.Module subclass.Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usageand behavior.
forward
<source>(input_ids: typing.Optional[torch.LongTensor] = Nonepixel_values: typing.Optional[torch.FloatTensor] = Noneattention_mask: typing.Optional[torch.Tensor] = Noneposition_ids: typing.Optional[torch.LongTensor] = Nonepast_key_values: typing.Optional[transformers.cache_utils.Cache] = Noneinputs_embeds: typing.Optional[torch.FloatTensor] = Nonelabels: typing.Optional[torch.LongTensor] = Noneuse_cache: typing.Optional[bool] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonecache_position: typing.Optional[torch.LongTensor] = Nonelogits_to_keep: typing.Union[int, torch.Tensor] = 0**kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs])→transformers.modeling_outputs.CausalLMOutputWithPast ortuple(torch.FloatTensor)
Parameters
- input_ids (
torch.LongTensorof shape(batch_size, sequence_length),optional) —Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.Indices can be obtained usingAutoTokenizer. SeePreTrainedTokenizer.encode() andPreTrainedTokenizer.call() for details.
- pixel_values (
torch.FloatTensorof shape(batch_size, num_channels, image_size, image_size),optional) —The tensors corresponding to the input images. Pixel values can be obtained usingChameleonImageProcessor. SeeChameleonImageProcessor.call() for details (ChameleonProcessor usesChameleonImageProcessor for processing images). - attention_mask (
torch.Tensorof shape(batch_size, sequence_length),optional) —Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that arenot masked,
- 0 for tokens that aremasked.
- position_ids (
torch.LongTensorof shape(batch_size, sequence_length),optional) —Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1]. - past_key_values (
~cache_utils.Cache,optional) —Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attentionblocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=Trueorconfig.use_cache=True.OnlyCache instance is allowed as input, see ourkv cache guide.If no
past_key_valuesare passed,DynamicCache will be initialized by default.The model will output the same cache format that is fed as input.
If
past_key_valuesare used, the user is expected to input only unprocessedinput_ids(those that don’thave their past key value states given to this model) of shape(batch_size, unprocessed_length)instead of allinput_idsof shape(batch_size, sequence_length). - inputs_embeds (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size),optional) —Optionally, instead of passinginput_idsyou can choose to directly pass an embedded representation. Thisis useful if you want more control over how to convertinput_idsindices into associated vectors than themodel’s internal embedding lookup matrix. - labels (
torch.LongTensorof shape(batch_size, sequence_length),optional) —Labels for computing the masked language modeling loss. Indices should either be in[0, ..., config.vocab_size]or -100 (seeinput_idsdocstring). Tokens with indices set to-100are ignored(masked), the loss is only computed for the tokens with labels in[0, ..., config.vocab_size]. - use_cache (
bool,optional) —If set toTrue,past_key_valueskey value states are returned and can be used to speed up decoding (seepast_key_values). - output_attentions (
bool,optional) —Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returnedtensors for more detail. - output_hidden_states (
bool,optional) —Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors formore detail. - cache_position (
torch.LongTensorof shape(sequence_length),optional) —Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids,this tensor is not affected by padding. It is used to update the cache in the correct position and to inferthe complete sequence length. - logits_to_keep (
Union[int, torch.Tensor], defaults to0) —If anint, compute logits for the lastlogits_to_keeptokens. If0, calculate logits for allinput_ids(special case). Only last token logits are needed for generation, and calculating them only for thattoken can save memory, which becomes pretty significant for long sequences or large vocabulary size.If atorch.Tensor, must be 1D corresponding to the indices to keep in the sequence length dimension.This is useful when using packed tensor format (single dimension for batch and sequence length).
Returns
transformers.modeling_outputs.CausalLMOutputWithPast ortuple(torch.FloatTensor)
Atransformers.modeling_outputs.CausalLMOutputWithPast or a tuple oftorch.FloatTensor (ifreturn_dict=False is passed or whenconfig.return_dict=False) comprising variouselements depending on the configuration (ChameleonConfig) and inputs.
loss (
torch.FloatTensorof shape(1,),optional, returned whenlabelsis provided) — Language modeling loss (for next-token prediction).logits (
torch.FloatTensorof shape(batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).past_key_values (
Cache,optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — It is aCache instance. For more details, see ourkv cache guide.Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
past_key_valuesinput) to speed up sequential decoding.hidden_states (
tuple(torch.FloatTensor),optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, +one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (
tuple(torch.FloatTensor),optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attentionheads.
TheChameleonForConditionalGeneration forward method, overrides the__call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps whilethe latter silently ignores them.
Example:
>>>from transformersimport ChameleonProcessor, ChameleonForConditionalGeneration>>>import torch>>>import requests>>>from PILimport Image>>>model = ChameleonForConditionalGeneration.from_pretrained("facebook/chameleon-7b", dtype=torch.bfloat16)>>>processor = ChameleonProcessor.from_pretrained("facebook/chameleon-7b")>>>prompt ="I used to know a lot about constellations when I was younger, but as I grew older, I forgot most of what I knew. These are the only two constellations that I really remember now.<image><image>I would like for you to tell me about 3 more constellations and give me a little bit of history about the constellation.">>>image = Image.open(requests.get("https://nineplanets.org/wp-content/uploads/2020/12/the-big-dipper-1.jpg", stream=True).raw)>>>image_2 = Image.open(requests.get("https://www.kxan.com/wp-content/uploads/sites/40/2020/10/ORION.jpg", stream=True).raw)>>>inputs = processor(images=[image, image_2], text=prompt, return_tensors="pt").to(model.device, torch.bfloat16)>>>generated_ids = model.generate(**inputs, max_new_tokens=100, do_sample=False)>>>processor.batch_decode(generated_ids, skip_special_tokens=True)[0]