Transformers documentation

GroupViT

Transformers

You are viewingmain version, which requiresinstallation from source. If you'd likeregular pip install, checkout the latest stable version (v4.57.1).

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

This model was released on 2022-02-22 and added to Hugging Face Transformers on 2022-06-28.

GroupViT

Overview

The GroupViT model was proposed inGroupViT: Semantic Segmentation Emerges from Text Supervision by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang.Inspired byCLIP, GroupViT is a vision-language model that can perform zero-shot semantic segmentation on any given vocabulary categories.

The abstract from the paper is the following:

Grouping and recognition are important components of visual scene understanding, e.g., for object detection and semantic segmentation. With end-to-end deep learning systems, grouping of image regions usually happens implicitly via top-down supervision from pixel-level recognition labels. Instead, in this paper, we propose to bring back the grouping mechanism into deep networks, which allows semantic segments to emerge automatically with only text supervision. We propose a hierarchical Grouping Vision Transformer (GroupViT), which goes beyond the regular grid structure representation and learns to group image regions into progressively larger arbitrary-shaped segments. We train GroupViT jointly with a text encoder on a large-scale image-text dataset via contrastive losses. With only text supervision and without any pixel-level annotations, GroupViT learns to group together semantic regions and successfully transfers to the task of semantic segmentation in a zero-shot manner, i.e., without any further fine-tuning. It achieves a zero-shot accuracy of 52.3% mIoU on the PASCAL VOC 2012 and 22.4% mIoU on PASCAL Context datasets, and performs competitively to state-of-the-art transfer-learning methods requiring greater levels of supervision.

This model was contributed byxvjiarui. The original code can be foundhere.

Usage tips

You may specifyoutput_segmentation=True in the forward ofGroupViTModel to get the segmentation logits of input texts.

Resources

A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with GroupViT.

The quickest way to get started with GroupViT is by checking theexample notebooks (which showcase zero-shot segmentation inference).
One can also check out theHuggingFace Spaces demo to play with GroupViT.

GroupViTConfig

classtransformers.GroupViTConfig

(text_config = Nonevision_config = Noneprojection_dim = 256projection_intermediate_dim = 4096logit_scale_init_value = 2.6592**kwargs)

Parameters

text_config (dict,optional) —Dictionary of configuration options used to initializeGroupViTTextConfig.
vision_config (dict,optional) —Dictionary of configuration options used to initializeGroupViTVisionConfig.
projection_dim (int,optional, defaults to 256) —Dimensionality of text and vision projection layers.
projection_intermediate_dim (int,optional, defaults to 4096) —Dimensionality of intermediate layer of text and vision projection layers.
logit_scale_init_value (float,optional, defaults to 2.6592) —The initial value of thelogit_scale parameter. Default is used as per the original GroupViTimplementation.
kwargs (optional) —Dictionary of keyword arguments.

GroupViTConfig is the configuration class to store the configuration of aGroupViTModel. It is used toinstantiate a GroupViT model according to the specified arguments, defining the text model and vision modelconfigs. Instantiating a configuration with the defaults will yield a similar configuration to that of the GroupViTnvidia/groupvit-gcc-yfcc architecture.

Configuration objects inherit fromPreTrainedConfig and can be used to control the model outputs. Read thedocumentation fromPreTrainedConfig for more information.

GroupViTTextConfig

classtransformers.GroupViTTextConfig

(vocab_size = 49408hidden_size = 256intermediate_size = 1024num_hidden_layers = 12num_attention_heads = 4max_position_embeddings = 77hidden_act = 'quick_gelu'layer_norm_eps = 1e-05dropout = 0.0attention_dropout = 0.0initializer_range = 0.02initializer_factor = 1.0pad_token_id = 1bos_token_id = 49406eos_token_id = 49407**kwargs)

Parameters

vocab_size (int,optional, defaults to 49408) —Vocabulary size of the GroupViT text model. Defines the number of different tokens that can be representedby theinputs_ids passed when callingGroupViTModel.
hidden_size (int,optional, defaults to 256) —Dimensionality of the encoder layers and the pooler layer.
intermediate_size (int,optional, defaults to 1024) —Dimensionality of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder.
num_hidden_layers (int,optional, defaults to 12) —Number of hidden layers in the Transformer encoder.
num_attention_heads (int,optional, defaults to 4) —Number of attention heads for each attention layer in the Transformer encoder.
max_position_embeddings (int,optional, defaults to 77) —The maximum sequence length that this model might ever be used with. Typically set this to something largejust in case (e.g., 512 or 1024 or 2048).
hidden_act (str orfunction,optional, defaults to"quick_gelu") —The non-linear activation function (function or string) in the encoder and pooler. If string,"gelu","relu","selu" and"gelu_new""quick_gelu" are supported.
layer_norm_eps (float,optional, defaults to 1e-5) —The epsilon used by the layer normalization layers.
attention_dropout (float,optional, defaults to 0.0) —The dropout ratio for the attention probabilities.
dropout (float,optional, defaults to 0.0) —The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
initializer_range (float,optional, defaults to 0.02) —The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
initializer_factor (float,optional, defaults to 1.0) —A factor for initializing all weight matrices (should be kept to 1, used internally for initializationtesting).

This is the configuration class to store the configuration of aGroupViTTextModel. It is used to instantiate anGroupViT model according to the specified arguments, defining the model architecture. Instantiating a configurationwith the defaults will yield a similar configuration to that of the GroupViTnvidia/groupvit-gcc-yfcc architecture.

Configuration objects inherit fromPreTrainedConfig and can be used to control the model outputs. Read thedocumentation fromPreTrainedConfig for more information.

Example:

>>>from transformersimport GroupViTTextConfig, GroupViTTextModel>>># Initializing a GroupViTTextModel with nvidia/groupvit-gcc-yfcc style configuration>>>configuration = GroupViTTextConfig()>>>model = GroupViTTextModel(configuration)>>># Accessing the model configuration>>>configuration = model.config

GroupViTVisionConfig

classtransformers.GroupViTVisionConfig

(hidden_size = 384intermediate_size = 1536depths = [6, 3, 3]num_hidden_layers = 12num_group_tokens = [64, 8, 0]num_output_groups = [64, 8, 8]num_attention_heads = 6image_size = 224patch_size = 16num_channels = 3hidden_act = 'gelu'layer_norm_eps = 1e-05dropout = 0.0attention_dropout = 0.0initializer_range = 0.02initializer_factor = 1.0assign_eps = 1.0assign_mlp_ratio = [0.5, 4]**kwargs)

Parameters

hidden_size (int,optional, defaults to 384) —Dimensionality of the encoder layers and the pooler layer.
intermediate_size (int,optional, defaults to 1536) —Dimensionality of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder.
depths (list[int],optional, defaults to [6, 3, 3]) —The number of layers in each encoder block.
num_group_tokens (list[int],optional, defaults to [64, 8, 0]) —The number of group tokens for each stage.
num_output_groups (list[int],optional, defaults to [64, 8, 8]) —The number of output groups for each stage, 0 means no group.
num_attention_heads (int,optional, defaults to 6) —Number of attention heads for each attention layer in the Transformer encoder.
image_size (int,optional, defaults to 224) —The size (resolution) of each image.
patch_size (int,optional, defaults to 16) —The size (resolution) of each patch.
hidden_act (str orfunction,optional, defaults to"gelu") —The non-linear activation function (function or string) in the encoder and pooler. If string,"gelu","relu","selu" and"gelu_new""quick_gelu" are supported.
layer_norm_eps (float,optional, defaults to 1e-5) —The epsilon used by the layer normalization layers.
dropout (float,optional, defaults to 0.0) —The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_dropout (float,optional, defaults to 0.0) —The dropout ratio for the attention probabilities.
initializer_range (float,optional, defaults to 0.02) —The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
initializer_factor (float,optional, defaults to 1.0) —A factor for initializing all weight matrices (should be kept to 1, used internally for initializationtesting).

This is the configuration class to store the configuration of aGroupViTVisionModel. It is used to instantiatean GroupViT model according to the specified arguments, defining the model architecture. Instantiating aconfiguration with the defaults will yield a similar configuration to that of the GroupViTnvidia/groupvit-gcc-yfcc architecture.

Configuration objects inherit fromPreTrainedConfig and can be used to control the model outputs. Read thedocumentation fromPreTrainedConfig for more information.

Example:

>>>from transformersimport GroupViTVisionConfig, GroupViTVisionModel>>># Initializing a GroupViTVisionModel with nvidia/groupvit-gcc-yfcc style configuration>>>configuration = GroupViTVisionConfig()>>>model = GroupViTVisionModel(configuration)>>># Accessing the model configuration>>>configuration = model.config

GroupViTModel

classtransformers.GroupViTModel

(config: GroupViTConfig)

Parameters

config (GroupViTConfig) —Model configuration class with all the parameters of the model. Initializing with a config file does notload the weights associated with the model, only the configuration. Check out thefrom_pretrained() method to load the model weights.

The bare Groupvit Model outputting raw hidden-states without any specific head on top.

This model inherits fromPreTrainedModel. Check the superclass documentation for the generic methods thelibrary implements for all its model (such as downloading or saving, resizing the input embeddings, pruning headsetc.)

This model is also a PyTorchtorch.nn.Module subclass.Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usageand behavior.

forward

(input_ids: typing.Optional[torch.LongTensor] = Nonepixel_values: typing.Optional[torch.FloatTensor] = Noneattention_mask: typing.Optional[torch.Tensor] = Noneposition_ids: typing.Optional[torch.LongTensor] = Nonereturn_loss: typing.Optional[bool] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Noneoutput_segmentation: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None)→transformers.models.groupvit.modeling_groupvit.GroupViTModelOutput ortuple(torch.FloatTensor)

Parameters

input_ids (torch.LongTensor of shape(batch_size, sequence_length),optional) —Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.
Indices can be obtained usingAutoTokenizer. SeePreTrainedTokenizer.encode() andPreTrainedTokenizer.call() for details.
What are input IDs?
pixel_values (torch.FloatTensor of shape(batch_size, num_channels, image_size, image_size),optional) —The tensors corresponding to the input images. Pixel values can be obtained usingCLIPImageProcessor. SeeCLIPImageProcessor.call() for details (CLIPProcessor usesCLIPImageProcessor for processing images).
attention_mask (torch.Tensor of shape(batch_size, sequence_length),optional) —Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:
- 1 for tokens that arenot masked,
- 0 for tokens that aremasked.
What are attention masks?
position_ids (torch.LongTensor of shape(batch_size, sequence_length),optional) —Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1].
What are position IDs?
return_loss (bool,optional) —Whether or not to return the contrastive loss.
output_attentions (bool,optional) —Whether or not to return the attentions tensors of all attention layers. Seeattentions under returnedtensors for more detail.
output_hidden_states (bool,optional) —Whether or not to return the hidden states of all layers. Seehidden_states under returned tensors formore detail.
output_segmentation (bool,optional) —Whether or not to return the segmentation logits.
return_dict (bool,optional) —Whether or not to return aModelOutput instead of a plain tuple.

Returns

transformers.models.groupvit.modeling_groupvit.GroupViTModelOutput ortuple(torch.FloatTensor)

Atransformers.models.groupvit.modeling_groupvit.GroupViTModelOutput or a tuple oftorch.FloatTensor (ifreturn_dict=False is passed or whenconfig.return_dict=False) comprising variouselements depending on the configuration (GroupViTConfig) and inputs.

loss (torch.FloatTensor of shape(1,),optional, returned whenreturn_loss isTrue) — Contrastive loss for image-text similarity.
logits_per_image (torch.FloatTensor of shape(image_batch_size, text_batch_size)) — The scaled dot product scores betweenimage_embeds andtext_embeds. This represents the image-textsimilarity scores.
logits_per_text (torch.FloatTensor of shape(text_batch_size, image_batch_size)) — The scaled dot product scores betweentext_embeds andimage_embeds. This represents the text-imagesimilarity scores.
segmentation_logits (torch.FloatTensor of shape(batch_size, config.num_labels, logits_height, logits_width)) — Classification scores for each pixel.
The logits returned do not necessarily have the same size as thepixel_values passed as inputs. This isto avoid doing two interpolations and lose some quality when a user needs to resize the logits to theoriginal image size as post-processing. You should always check your logits shape and resize as needed.
text_embeds (torch.FloatTensor of shape(batch_size, output_dim) — The text embeddings obtained by applying the projection layer to the pooled output ofGroupViTTextModel.
image_embeds (torch.FloatTensor of shape(batch_size, output_dim) — The image embeddings obtained by applying the projection layer to the pooled output ofGroupViTVisionModel.
text_model_output (<class '~modeling_outputs.BaseModelOutputWithPooling'>.text_model_output, defaults toNone) — The output of theGroupViTTextModel.
vision_model_output (<class '~modeling_outputs.BaseModelOutputWithPooling'>.vision_model_output, defaults toNone) — The output of theGroupViTVisionModel.

TheGroupViTModel forward method, overrides the__call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call theModuleinstance afterwards instead of this since the former takes care of running the pre and post processing steps whilethe latter silently ignores them.

Examples:

>>>from PILimport Image>>>import requests>>>from transformersimport AutoProcessor, GroupViTModel>>>model = GroupViTModel.from_pretrained("nvidia/groupvit-gcc-yfcc")>>>processor = AutoProcessor.from_pretrained("nvidia/groupvit-gcc-yfcc")>>>url ="http://images.cocodataset.org/val2017/000000039769.jpg">>>image = Image.open(requests.get(url, stream=True).raw)>>>inputs = processor(...    text=["a photo of a cat","a photo of a dog"], images=image, return_tensors="pt", padding=True...)>>>outputs = model(**inputs)>>>logits_per_image = outputs.logits_per_image# this is the image-text similarity score>>>probs = logits_per_image.softmax(dim=1)# we can take the softmax to get the label probabilities

get_text_features

(input_ids: Tensorattention_mask: typing.Optional[torch.Tensor] = Noneposition_ids: typing.Optional[torch.Tensor] = None)→text_features (torch.FloatTensor of shape(batch_size, output_dim)

Parameters

input_ids (torch.Tensor of shape(batch_size, sequence_length)) —Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.
Indices can be obtained usingAutoTokenizer. SeePreTrainedTokenizer.encode() andPreTrainedTokenizer.call() for details.
What are input IDs?
attention_mask (torch.Tensor of shape(batch_size, sequence_length),optional) —Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:
- 1 for tokens that arenot masked,
- 0 for tokens that aremasked.
What are attention masks?
position_ids (torch.Tensor of shape(batch_size, sequence_length),optional) —Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1].
What are position IDs?

Returns

text_features (torch.FloatTensor of shape(batch_size, output_dim)

The text embeddings obtained byapplying the projection layer to the pooled output ofGroupViTTextModel.

Examples:

>>>import torch>>>from transformersimport CLIPTokenizer, GroupViTModel>>>model = GroupViTModel.from_pretrained("nvidia/groupvit-gcc-yfcc")>>>tokenizer = CLIPTokenizer.from_pretrained("nvidia/groupvit-gcc-yfcc")>>>inputs = tokenizer(["a photo of a cat","a photo of a dog"], padding=True, return_tensors="pt")>>>with torch.inference_mode():...    text_features = model.get_text_features(**inputs)

get_image_features

(pixel_values: Tensor)→image_features (torch.FloatTensor of shape(batch_size, output_dim)

Parameters

pixel_values (torch.Tensor of shape(batch_size, num_channels, image_size, image_size)) —The tensors corresponding to the input images. Pixel values can be obtained usingCLIPImageProcessor. SeeCLIPImageProcessor.call() for details (CLIPProcessor usesCLIPImageProcessor for processing images).

Returns

image_features (torch.FloatTensor of shape(batch_size, output_dim)

The image embeddings obtained byapplying the projection layer to the pooled output ofGroupViTVisionModel.

Examples:

>>>import torch>>>from transformersimport AutoProcessor, GroupViTModel>>>from transformers.image_utilsimport load_image>>>model = GroupViTModel.from_pretrained("nvidia/groupvit-gcc-yfcc")>>>processor = AutoProcessor.from_pretrained("nvidia/groupvit-gcc-yfcc")>>>url ="http://images.cocodataset.org/val2017/000000039769.jpg">>>image = load_image(url)>>>inputs = processor(images=image, return_tensors="pt")>>>with torch.inference_mode():...    image_features = model.get_image_features(**inputs)

GroupViTTextModel

classtransformers.GroupViTTextModel

(config: GroupViTTextConfig)

forward

(input_ids: typing.Optional[torch.Tensor] = Noneattention_mask: typing.Optional[torch.Tensor] = Noneposition_ids: typing.Optional[torch.Tensor] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None)→transformers.modeling_outputs.BaseModelOutputWithPooling ortuple(torch.FloatTensor)

Parameters

input_ids (torch.Tensor of shape(batch_size, sequence_length),optional) —Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.
Indices can be obtained usingAutoTokenizer. SeePreTrainedTokenizer.encode() andPreTrainedTokenizer.call() for details.
What are input IDs?
attention_mask (torch.Tensor of shape(batch_size, sequence_length),optional) —Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:
- 1 for tokens that arenot masked,
- 0 for tokens that aremasked.
What are attention masks?
position_ids (torch.Tensor of shape(batch_size, sequence_length),optional) —Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1].
What are position IDs?
output_attentions (bool,optional) —Whether or not to return the attentions tensors of all attention layers. Seeattentions under returnedtensors for more detail.
output_hidden_states (bool,optional) —Whether or not to return the hidden states of all layers. Seehidden_states under returned tensors formore detail.
return_dict (bool,optional) —Whether or not to return aModelOutput instead of a plain tuple.

Returns

transformers.modeling_outputs.BaseModelOutputWithPooling ortuple(torch.FloatTensor)

Atransformers.modeling_outputs.BaseModelOutputWithPooling or a tuple oftorch.FloatTensor (ifreturn_dict=False is passed or whenconfig.return_dict=False) comprising variouselements depending on the configuration (GroupViTConfig) and inputs.

last_hidden_state (torch.FloatTensor of shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.
pooler_output (torch.FloatTensor of shape(batch_size, hidden_size)) — Last layer hidden-state of the first token of the sequence (classification token) after further processingthrough the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returnsthe classification token after processing through a linear layer and a tanh activation function. The linearlayer weights are trained from the next sentence prediction (classification) objective during pretraining.
hidden_states (tuple(torch.FloatTensor),optional, returned whenoutput_hidden_states=True is passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, +one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor),optional, returned whenoutput_attentions=True is passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor (one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).
Attentions weights after the attention softmax, used to compute the weighted average in the self-attentionheads.

TheGroupViTTextModel forward method, overrides the__call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call theModuleinstance afterwards instead of this since the former takes care of running the pre and post processing steps whilethe latter silently ignores them.

Examples:

>>>from transformersimport CLIPTokenizer, GroupViTTextModel>>>tokenizer = CLIPTokenizer.from_pretrained("nvidia/groupvit-gcc-yfcc")>>>model = GroupViTTextModel.from_pretrained("nvidia/groupvit-gcc-yfcc")>>>inputs = tokenizer(["a photo of a cat","a photo of a dog"], padding=True, return_tensors="pt")>>>outputs = model(**inputs)>>>last_hidden_state = outputs.last_hidden_state>>>pooled_output = outputs.pooler_output# pooled (EOS token) states

GroupViTVisionModel

classtransformers.GroupViTVisionModel

(config: GroupViTVisionConfig)

forward

(pixel_values: typing.Optional[torch.FloatTensor] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None)→transformers.modeling_outputs.BaseModelOutputWithPooling ortuple(torch.FloatTensor)

Parameters

pixel_values (torch.FloatTensor of shape(batch_size, num_channels, image_size, image_size),optional) —The tensors corresponding to the input images. Pixel values can be obtained usingCLIPImageProcessor. SeeCLIPImageProcessor.call() for details (CLIPProcessor usesCLIPImageProcessor for processing images).
output_attentions (bool,optional) —Whether or not to return the attentions tensors of all attention layers. Seeattentions under returnedtensors for more detail.
output_hidden_states (bool,optional) —Whether or not to return the hidden states of all layers. Seehidden_states under returned tensors formore detail.
return_dict (bool,optional) —Whether or not to return aModelOutput instead of a plain tuple.

Returns

transformers.modeling_outputs.BaseModelOutputWithPooling ortuple(torch.FloatTensor)

last_hidden_state (torch.FloatTensor of shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.
pooler_output (torch.FloatTensor of shape(batch_size, hidden_size)) — Last layer hidden-state of the first token of the sequence (classification token) after further processingthrough the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returnsthe classification token after processing through a linear layer and a tanh activation function. The linearlayer weights are trained from the next sentence prediction (classification) objective during pretraining.
hidden_states (tuple(torch.FloatTensor),optional, returned whenoutput_hidden_states=True is passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, +one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor),optional, returned whenoutput_attentions=True is passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor (one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).
Attentions weights after the attention softmax, used to compute the weighted average in the self-attentionheads.

TheGroupViTVisionModel forward method, overrides the__call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call theModuleinstance afterwards instead of this since the former takes care of running the pre and post processing steps whilethe latter silently ignores them.

Examples:

>>>from PILimport Image>>>import requests>>>from transformersimport AutoProcessor, GroupViTVisionModel>>>processor = AutoProcessor.from_pretrained("nvidia/groupvit-gcc-yfcc")>>>model = GroupViTVisionModel.from_pretrained("nvidia/groupvit-gcc-yfcc")>>>url ="http://images.cocodataset.org/val2017/000000039769.jpg">>>image = Image.open(requests.get(url, stream=True).raw)>>>inputs = processor(images=image, return_tensors="pt")>>>outputs = model(**inputs)>>>last_hidden_state = outputs.last_hidden_state>>>pooled_output = outputs.pooler_output# pooled CLS states

Update on GitHub

←Grounding DINO IDEFICS→

Movatterモバイル変換

Transformers

GroupViT

Overview

Usage tips

Resources

GroupViTConfig

classtransformers.GroupViTConfig

GroupViTTextConfig

classtransformers.GroupViTTextConfig

GroupViTVisionConfig

classtransformers.GroupViTVisionConfig

GroupViTModel

classtransformers.GroupViTModel

forward

get_text_features

get_image_features

GroupViTTextModel

classtransformers.GroupViTTextModel

forward

GroupViTVisionModel

classtransformers.GroupViTVisionModel

forward