Movatterモバイル変換


[0]ホーム

URL:


Hugging Face's logoHugging Face

Transformers documentation

Data2Vec

Transformers

You are viewingmain version, which requiresinstallation from source. If you'd likeregular pip install, checkout the latest stable version (v4.57.1).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces
Faster examples with accelerated inference
Switch between documentation themes

to get started

This model was released on 2022-02-07 and added to Hugging Face Transformers on 2022-03-01.

Data2Vec

PyTorchFlashAttentionSDPA

Overview

The Data2Vec model was proposed indata2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language by Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu and Michael Auli.Data2Vec proposes a unified framework for self-supervised learning across different data modalities - text, audio and images.Importantly, predicted targets for pre-training are contextualized latent representations of the inputs, rather than modality-specific, context-independent targets.

The abstract from the paper is the following:

While the general idea of self-supervised learning is identical across modalities, the actual algorithms andobjectives differ widely because they were developed with a single modality in mind. To get us closer to generalself-supervised learning, we present data2vec, a framework that uses the same learning method for either speech,NLP or computer vision. The core idea is to predict latent representations of the full input data based on amasked view of the input in a selfdistillation setup using a standard Transformer architecture.Instead of predicting modality-specific targets such as words, visual tokens or units of human speech whichare local in nature, data2vec predicts contextualized latent representations that contain information fromthe entire input. Experiments on the major benchmarks of speech recognition, image classification, andnatural language understanding demonstrate a new state of the art or competitive performance to predominant approaches.Models and code are available atwww.github.com/pytorch/fairseq/tree/master/examples/data2vec.

This model was contributed byedugp andpatrickvonplaten.

The original code (for NLP and Speech) can be foundhere.The original code for vision can be foundhere.

Usage tips

  • Data2VecAudio, Data2VecText, and Data2VecVision have all been trained using the same self-supervised learning method.
  • For Data2VecAudio, preprocessing is identical toWav2Vec2Model, including feature extraction
  • For Data2VecText, preprocessing is identical toRobertaModel, including tokenization.
  • For Data2VecVision, preprocessing is identical toBeitModel, including feature extraction.

Using Scaled Dot Product Attention (SDPA)

PyTorch includes a native scaled dot-product attention (SDPA) operator as part oftorch.nn.functional. This functionencompasses several implementations that can be applied depending on the inputs and the hardware in use. See theofficial documentationor theGPU Inferencepage for more information.

SDPA is used by default fortorch>=2.1.1 when an implementation is available, but you may also setattn_implementation="sdpa" infrom_pretrained() to explicitly request SDPA to be used.

The SDPA implementation is currently available for the Data2VecAudio and Data2VecVision models.

from transformersimport Data2VecVisionForImageClassificationmodel = Data2VecVisionForImageClassification.from_pretrained("facebook/data2vec-vision-base", attn_implementation="sdpa", dtype=torch.float16)...

For the best speedups, we recommend loading the model in half-precision (e.g.torch.float16 ortorch.bfloat16).

For the Data2VecVision model, on a local benchmark (NVIDIA GeForce RTX 2060-8GB, PyTorch 2.5.1, OS Ubuntu 20.04)withfloat16 andfacebook/data2vec-vision-base model, we saw the following improvements during training andinference:

Training

num_training_stepsbatch_sizeimage_sizeis_cudaTime per batch (eager - s)Time per batch (sdpa - s)Speedup (%)Eager peak mem (MB)SDPA peak mem (MB)Mem saving (%)
502(1048, 640)True0.9960.75432.1476722.1984264.65357.626

Inference

Image batch sizeEager (s/iter)Eager CI, %Eager memory (MB)SDPA (s/iter)SDPA CI, %SDPA memory (MB)SDPA speedupSDPA memory saved
10.011±0.3%3.76143e+080.01±0.3%3.74397e+081.1010.466
40.014±0.1%4.02756e+080.012±0.2%3.91373e+081.2192.909
160.046±0.3%4.96482e+080.035±0.2%4.51017e+081.31410.081
320.088±0.1%6.23903e+080.067±0.1%5.32974e+081.3317.061

Resources

A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with Data2Vec.

Image Classification

Data2VecText documentation resources

Data2VecAudio documentation resources

Data2VecVision documentation resources

If you’re interested in submitting a resource to be included here, please feel free to open a Pull Request and we’ll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.

Data2VecTextConfig

classtransformers.Data2VecTextConfig

<source>

(vocab_size = 30522hidden_size = 768num_hidden_layers = 12num_attention_heads = 12intermediate_size = 3072hidden_act = 'gelu'hidden_dropout_prob = 0.1attention_probs_dropout_prob = 0.1max_position_embeddings = 512type_vocab_size = 2initializer_range = 0.02layer_norm_eps = 1e-12pad_token_id = 1bos_token_id = 0eos_token_id = 2use_cache = Trueclassifier_dropout = None**kwargs)

Parameters

  • vocab_size (int,optional, defaults to 30522) —Vocabulary size of the DATA2VEC model. Defines the number of different tokens that can be represented bytheinputs_ids passed when callingData2VecModel.
  • hidden_size (int,optional, defaults to 768) —Dimensionality of the encoder layers and the pooler layer.
  • num_hidden_layers (int,optional, defaults to 12) —Number of hidden layers in the Transformer encoder.
  • num_attention_heads (int,optional, defaults to 12) —Number of attention heads for each attention layer in the Transformer encoder.
  • intermediate_size (int,optional, defaults to 3072) —Dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer encoder.
  • hidden_act (str orCallable,optional, defaults to"gelu") —The non-linear activation function (function or string) in the encoder and pooler. If string,"gelu","relu","silu" and"gelu_new" are supported.
  • hidden_dropout_prob (float,optional, defaults to 0.1) —The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
  • attention_probs_dropout_prob (float,optional, defaults to 0.1) —The dropout ratio for the attention probabilities.
  • max_position_embeddings (int,optional, defaults to 512) —The maximum sequence length that this model might ever be used with. Typically set this to something largejust in case (e.g., 512 or 1024 or 2048).
  • type_vocab_size (int,optional, defaults to 2) —The vocabulary size of thetoken_type_ids passed when callingData2VecModel.
  • initializer_range (float,optional, defaults to 0.02) —The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
  • layer_norm_eps (float,optional, defaults to 1e-12) —The epsilon used by the layer normalization layers.
  • is_decoder (bool,optional, defaults toFalse) —Whether the model is used as a decoder or not. IfFalse, the model is used as an encoder.
  • use_cache (bool,optional, defaults toTrue) —Whether or not the model should return the last key/values attentions (not used by all models). Onlyrelevant ifconfig.is_decoder=True.
  • classifier_dropout (float,optional) —The dropout ratio for the classification head.

This is the configuration class to store the configuration of aData2VecTextModel andData2VecTextModel. Itis used to instantiate a Data2VecText model according to the specified arguments, defining the model architecture.Instantiating a configuration with the defaults will yield a similar configuration to that of the Data2VecTextfacebook/data2vec-text-base architecture.

Configuration objects inherit fromPreTrainedConfig and can be used to control the model outputs. Read thedocumentation fromPreTrainedConfig for more information.

Examples:

>>>from transformersimport Data2VecTextConfig, Data2VecTextModel>>># Initializing a Data2VecText facebook/data2vec-text-base style configuration>>>configuration = Data2VecTextConfig()>>># Initializing a model (with random weights) from the facebook/data2vec-text-base style configuration>>>model = Data2VecTextModel(configuration)>>># Accessing the model configuration>>>configuration = model.config

Data2VecAudioConfig

classtransformers.Data2VecAudioConfig

<source>

(vocab_size = 32hidden_size = 768num_hidden_layers = 12num_attention_heads = 12intermediate_size = 3072hidden_act = 'gelu'hidden_dropout = 0.1activation_dropout = 0.1attention_dropout = 0.1feat_proj_dropout = 0.0final_dropout = 0.1layerdrop = 0.1initializer_range = 0.02layer_norm_eps = 1e-05feat_extract_activation = 'gelu'conv_dim = (512, 512, 512, 512, 512, 512, 512)conv_stride = (5, 2, 2, 2, 2, 2, 2)conv_kernel = (10, 3, 3, 3, 3, 2, 2)conv_bias = Falsenum_conv_pos_embedding_groups = 16conv_pos_kernel_size = 19num_conv_pos_embeddings = 5mask_time_prob = 0.05mask_time_length = 10mask_time_min_masks = 2mask_feature_prob = 0.0mask_feature_length = 10mask_feature_min_masks = 0ctc_loss_reduction = 'sum'ctc_zero_infinity = Falseuse_weighted_layer_sum = Falseclassifier_proj_size = 256tdnn_dim = (512, 512, 512, 512, 1500)tdnn_kernel = (5, 3, 3, 1, 1)tdnn_dilation = (1, 2, 3, 1, 1)xvector_output_dim = 512pad_token_id = 0bos_token_id = 1eos_token_id = 2add_adapter = Falseadapter_kernel_size = 3adapter_stride = 2num_adapter_layers = 3output_hidden_size = None**kwargs)

Parameters

  • vocab_size (int,optional, defaults to 32) —Vocabulary size of the Data2VecAudio model. Defines the number of different tokens that can be representedby theinputs_ids passed when callingData2VecAudioModel orTFData2VecAudioModel. Vocabulary sizeof the model. Defines the different tokens that can be represented by theinputs_ids passed to theforward method ofData2VecAudioModel.
  • hidden_size (int,optional, defaults to 768) —Dimensionality of the encoder layers and the pooler layer.
  • num_hidden_layers (int,optional, defaults to 12) —Number of hidden layers in the Transformer encoder.
  • num_attention_heads (int,optional, defaults to 12) —Number of attention heads for each attention layer in the Transformer encoder.
  • intermediate_size (int,optional, defaults to 3072) —Dimensionality of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder.
  • hidden_act (str orfunction,optional, defaults to"gelu") —The non-linear activation function (function or string) in the encoder and pooler. If string,"gelu","relu","selu" and"gelu_new" are supported.
  • hidden_dropout (float,optional, defaults to 0.1) —The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
  • activation_dropout (float,optional, defaults to 0.1) —The dropout ratio for activations inside the fully connected layer.
  • attention_dropout (float,optional, defaults to 0.1) —The dropout ratio for the attention probabilities.
  • final_dropout (float,optional, defaults to 0.1) —The dropout probability for the final projection layer ofData2VecAudioForCTC.
  • layerdrop (float,optional, defaults to 0.1) —The LayerDrop probability. See the [LayerDrop paper](seehttps://huggingface.co/papers/1909.11556) for moredetails.
  • initializer_range (float,optional, defaults to 0.02) —The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
  • layer_norm_eps (float,optional, defaults to 1e-12) —The epsilon used by the layer normalization layers.
  • feat_proj_dropout (float,optional, defaults to 0.0) —The dropout probability for output of the feature encoder.
  • feat_extract_activation (str,optional, defaults to“gelu”) -- The non-linear activation function (function or string) in the 1D convolutional layers of the feature extractor. If string,“gelu”,“relu”,“selu”and“gelu_new”` are supported.
  • conv_dim (tuple[int] orlist[int],optional, defaults to(512, 512, 512, 512, 512, 512, 512)) —A tuple of integers defining the number of input and output channels of each 1D convolutional layer in thefeature encoder. The length ofconv_dim defines the number of 1D convolutional layers.
  • conv_stride (tuple[int] orlist[int],optional, defaults to(5, 2, 2, 2, 2, 2, 2)) —A tuple of integers defining the stride of each 1D convolutional layer in the feature encoder. The lengthofconv_stride defines the number of convolutional layers and has to match the length ofconv_dim.
  • conv_kernel (tuple[int] orlist[int],optional, defaults to(10, 3, 3, 3, 3, 3, 3)) —A tuple of integers defining the kernel size of each 1D convolutional layer in the feature encoder. Thelength ofconv_kernel defines the number of convolutional layers and has to match the length ofconv_dim.
  • conv_bias (bool,optional, defaults toFalse) —Whether the 1D convolutional layers have a bias.
  • num_conv_pos_embeddings (int,optional, defaults to 128) —Number of convolutional positional embeddings. Defines the kernel size of 1D convolutional positionalembeddings layer.
  • num_conv_pos_embedding_groups (int,optional, defaults to 16) —Number of groups of 1D convolutional positional embeddings layer.
  • mask_time_prob (float,optional, defaults to 0.05) —Percentage (between 0 and 1) of all feature vectors along the time axis which will be masked. The maskingprocedure generates ”mask_time_problen(time_axis)/mask_time_length” independent masks over the axis. Ifreasoning from the probability of each feature vector to be chosen as the start of the vector span to bemasked,mask_time_prob should be `prob_vector_startmask_time_length`. Note that overlap may decrease the
  • mask_time_length (int,optional, defaults to 10) —Length of vector span along the time axis.
  • mask_time_min_masks (int,optional, defaults to 2), —The minimum number of masks of lengthmask_feature_length generated along the time axis, each time step,irrespectively ofmask_feature_prob. Only relevant if ”mask_time_prob*len(time_axis)/mask_time_length <mask_time_min_masks”
  • mask_feature_prob (float,optional, defaults to 0.0) —Percentage (between 0 and 1) of all feature vectors along the feature axis which will be masked. Themasking procedure generates ”mask_feature_problen(feature_axis)/mask_time_length” independent masks overthe axis. If reasoning from the probability of each feature vector to be chosen as the start of the vectorspan to be masked,mask_feature_prob should be `prob_vector_startmask_feature_length. Note that overlap may decrease the actual percentage of masked vectors. This is only relevant ifapply_spec_augment isTrue`.
  • mask_feature_length (int,optional, defaults to 10) —Length of vector span along the feature axis.
  • mask_feature_min_masks (int,optional, defaults to 0), —The minimum number of masks of lengthmask_feature_length generated along the feature axis, each timestep, irrespectively ofmask_feature_prob. Only relevant if”mask_feature_prob*len(feature_axis)/mask_feature_length < mask_feature_min_masks”
  • ctc_loss_reduction (str,optional, defaults to"sum") —Specifies the reduction to apply to the output oftorch.nn.CTCLoss. Only relevant when training aninstance ofData2VecAudioForCTC.
  • ctc_zero_infinity (bool,optional, defaults toFalse) —Whether to zero infinite losses and the associated gradients oftorch.nn.CTCLoss. Infinite losses mainlyoccur when the inputs are too short to be aligned to the targets. Only relevant when training an instanceofData2VecAudioForCTC.
  • use_weighted_layer_sum (bool,optional, defaults toFalse) —Whether to use a weighted average of layer outputs with learned weights. Only relevant when using aninstance ofData2VecAudioForSequenceClassification.
  • classifier_proj_size (int,optional, defaults to 256) —Dimensionality of the projection before token mean-pooling for classification.
  • tdnn_dim (tuple[int] orlist[int],optional, defaults to(512, 512, 512, 512, 1500)) —A tuple of integers defining the number of output channels of each 1D convolutional layer in theTDNNmodule of theXVector model. The length oftdnn_dim defines the number ofTDNN layers.
  • tdnn_kernel (tuple[int] orlist[int],optional, defaults to(5, 3, 3, 1, 1)) —A tuple of integers defining the kernel size of each 1D convolutional layer in theTDNN module of theXVector model. The length oftdnn_kernel has to match the length oftdnn_dim.
  • tdnn_dilation (tuple[int] orlist[int],optional, defaults to(1, 2, 3, 1, 1)) —A tuple of integers defining the dilation factor of each 1D convolutional layer inTDNN module of theXVector model. The length oftdnn_dilation has to match the length oftdnn_dim.
  • xvector_output_dim (int,optional, defaults to 512) —Dimensionality of theXVector embedding vectors.
  • add_adapter (bool,optional, defaults toFalse) —Whether a convolutional network should be stacked on top of the Data2VecAudio Encoder. Can be very usefulfor warm-starting Data2VecAudio for SpeechEncoderDecoder models.
  • adapter_kernel_size (int,optional, defaults to 3) —Kernel size of the convolutional layers in the adapter network. Only relevant ifadd_adapter is True.
  • adapter_stride (int,optional, defaults to 2) —Stride of the convolutional layers in the adapter network. Only relevant ifadd_adapter is True.
  • num_adapter_layers (int,optional, defaults to 3) —Number of convolutional layers that should be used in the adapter network. Only relevant ifadd_adapter is True.
  • output_hidden_size (int,optional) —Dimensionality of the encoder output layer. If not defined, this defaults tohidden-size. Only relevantifadd_adapter is True.

This is the configuration class to store the configuration of aData2VecAudioModel. It is used to instantiatean Data2VecAudio model according to the specified arguments, defining the model architecture. Instantiating aconfiguration with the defaults will yield a similar configuration to that of the Data2VecAudiofacebook/data2vec-audio-base-960h architecture.

Configuration objects inherit fromPreTrainedConfig and can be used to control the model outputs. Read thedocumentation fromPreTrainedConfig for more information.

Example:

>>>from transformersimport Data2VecAudioConfig, Data2VecAudioModel>>># Initializing a Data2VecAudio facebook/data2vec-audio-base-960h style configuration>>>configuration = Data2VecAudioConfig()>>># Initializing a model (with random weights) from the facebook/data2vec-audio-base-960h style configuration>>>model = Data2VecAudioModel(configuration)>>># Accessing the model configuration>>>configuration = model.config

Data2VecVisionConfig

classtransformers.Data2VecVisionConfig

<source>

(hidden_size = 768num_hidden_layers = 12num_attention_heads = 12intermediate_size = 3072hidden_act = 'gelu'hidden_dropout_prob = 0.0attention_probs_dropout_prob = 0.0initializer_range = 0.02layer_norm_eps = 1e-12image_size = 224patch_size = 16num_channels = 3use_mask_token = Falseuse_absolute_position_embeddings = Falseuse_relative_position_bias = Falseuse_shared_relative_position_bias = Falselayer_scale_init_value = 0.1drop_path_rate = 0.1use_mean_pooling = Trueout_indices = [3, 5, 7, 11]pool_scales = [1, 2, 3, 6]use_auxiliary_head = Trueauxiliary_loss_weight = 0.4auxiliary_channels = 256auxiliary_num_convs = 1auxiliary_concat_input = Falsesemantic_loss_ignore_index = 255**kwargs)

Parameters

  • hidden_size (int,optional, defaults to 768) —Dimensionality of the encoder layers and the pooler layer.
  • num_hidden_layers (int,optional, defaults to 12) —Number of hidden layers in the Transformer encoder.
  • num_attention_heads (int,optional, defaults to 12) —Number of attention heads for each attention layer in the Transformer encoder.
  • intermediate_size (int,optional, defaults to 3072) —Dimensionality of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder.
  • hidden_act (str orfunction,optional, defaults to"gelu") —The non-linear activation function (function or string) in the encoder and pooler. If string,"gelu","relu","selu" and"gelu_new" are supported.
  • hidden_dropout_prob (float,optional, defaults to 0.0) —The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
  • attention_probs_dropout_prob (float,optional, defaults to 0.0) —The dropout ratio for the attention probabilities.
  • initializer_range (float,optional, defaults to 0.02) —The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
  • layer_norm_eps (float,optional, defaults to 1e-12) —The epsilon used by the layer normalization layers.
  • image_size (int,optional, defaults to 224) —The size (resolution) of each image.
  • patch_size (int,optional, defaults to 16) —The size (resolution) of each patch.
  • num_channels (int,optional, defaults to 3) —The number of input channels.
  • use_mask_token (bool,optional, defaults toFalse) —Whether to use a mask token for masked image modeling.
  • use_absolute_position_embeddings (bool,optional, defaults toFalse) —Whether to use BERT-style absolute position embeddings.
  • use_relative_position_bias (bool,optional, defaults toFalse) —Whether to use T5-style relative position embeddings in the self-attention layers.
  • use_shared_relative_position_bias (bool,optional, defaults toFalse) —Whether to use the same relative position embeddings across all self-attention layers of the Transformer.
  • layer_scale_init_value (float,optional, defaults to 0.1) —Scale to use in the self-attention layers. 0.1 for base, 1e-5 for large. Set 0 to disable layer scale.
  • drop_path_rate (float,optional, defaults to 0.1) —Stochastic depth rate per sample (when applied in the main path of residual layers).
  • use_mean_pooling (bool,optional, defaults toTrue) —Whether to mean pool the final hidden states of the patches instead of using the final hidden state of theCLS token, before applying the classification head.
  • out_indices (list[int],optional, defaults to[3, 5, 7, 11]) —Indices of the feature maps to use for semantic segmentation.
  • pool_scales (tuple[int],optional, defaults to[1, 2, 3, 6]) —Pooling scales used in Pooling Pyramid Module applied on the last feature map.
  • use_auxiliary_head (bool,optional, defaults toTrue) —Whether to use an auxiliary head during training.
  • auxiliary_loss_weight (float,optional, defaults to 0.4) —Weight of the cross-entropy loss of the auxiliary head.
  • auxiliary_channels (int,optional, defaults to 256) —Number of channels to use in the auxiliary head.
  • auxiliary_num_convs (int,optional, defaults to 1) —Number of convolutional layers to use in the auxiliary head.
  • auxiliary_concat_input (bool,optional, defaults toFalse) —Whether to concatenate the output of the auxiliary head with the input before the classification layer.
  • semantic_loss_ignore_index (int,optional, defaults to 255) —The index that is ignored by the loss function of the semantic segmentation model.

This is the configuration class to store the configuration of aData2VecVisionModel. It is used to instantiatean Data2VecVision model according to the specified arguments, defining the model architecture. Instantiating aconfiguration with the defaults will yield a similar configuration to that of the Data2VecVisionfacebook/data2vec-vision-base architecture.

Example:

>>>from transformersimport Data2VecVisionConfig, Data2VecVisionModel>>># Initializing a Data2VecVision data2vec_vision-base-patch16-224-in22k style configuration>>>configuration = Data2VecVisionConfig()>>># Initializing a model (with random weights) from the data2vec_vision-base-patch16-224-in22k style configuration>>>model = Data2VecVisionModel(configuration)>>># Accessing the model configuration>>>configuration = model.config

Data2VecAudioModel

classtransformers.Data2VecAudioModel

<source>

(config: Data2VecAudioConfig)

Parameters

  • config (Data2VecAudioConfig) —Model configuration class with all the parameters of the model. Initializing with a config file does notload the weights associated with the model, only the configuration. Check out thefrom_pretrained() method to load the model weights.

The bare Data2Vec Audio Model outputting raw hidden-states without any specific head on top.

This model inherits fromPreTrainedModel. Check the superclass documentation for the generic methods thelibrary implements for all its model (such as downloading or saving, resizing the input embeddings, pruning headsetc.)

This model is also a PyTorchtorch.nn.Module subclass.Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usageand behavior.

forward

<source>

(input_values: typing.Optional[torch.Tensor]attention_mask: typing.Optional[torch.Tensor] = Nonemask_time_indices: typing.Optional[torch.FloatTensor] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None)transformers.modeling_outputs.Wav2Vec2BaseModelOutput ortuple(torch.FloatTensor)

Parameters

  • input_values (torch.Tensor of shape(batch_size, sequence_length),optional) —Float values of input raw speech waveform. Values can be obtained by loading a.flac or.wav audio fileinto an array of typelist[float], anumpy.ndarray or atorch.Tensor,e.g. via the torchcodec library(pip install torchcodec) or the soundfile library (pip install soundfile).To prepare the array intoinput_values, theAutoProcessor should be used for padding and conversioninto a tensor of typetorch.FloatTensor. Seeprocessor_class.__call__ for details.
  • attention_mask (torch.Tensor of shape(batch_size, sequence_length),optional) —Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:

    • 1 for tokens that arenot masked,
    • 0 for tokens that aremasked.

    What are attention masks?

  • mask_time_indices (torch.BoolTensor of shape(batch_size, sequence_length),optional) —Indices to mask extracted features for contrastive loss. When in training mode, model learns to predictmasked extracted features inconfig.proj_codevector_dim space.
  • output_attentions (bool,optional) —Whether or not to return the attentions tensors of all attention layers. Seeattentions under returnedtensors for more detail.
  • output_hidden_states (bool,optional) —Whether or not to return the hidden states of all layers. Seehidden_states under returned tensors formore detail.
  • return_dict (bool,optional) —Whether or not to return aModelOutput instead of a plain tuple.

Atransformers.modeling_outputs.Wav2Vec2BaseModelOutput or a tuple oftorch.FloatTensor (ifreturn_dict=False is passed or whenconfig.return_dict=False) comprising variouselements depending on the configuration (Data2VecAudioConfig) and inputs.

  • last_hidden_state (torch.FloatTensor of shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.

  • extract_features (torch.FloatTensor of shape(batch_size, sequence_length, conv_dim[-1])) — Sequence of extracted feature vectors of the last convolutional layer of the model.

  • hidden_states (tuple(torch.FloatTensor),optional, returned whenoutput_hidden_states=True is passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor (one for the output of the embeddings + one for the output of each layer) ofshape(batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (tuple(torch.FloatTensor),optional, returned whenoutput_attentions=True is passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor (one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attentionheads.

TheData2VecAudioModel forward method, overrides the__call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call theModuleinstance afterwards instead of this since the former takes care of running the pre and post processing steps whilethe latter silently ignores them.

Data2VecAudioForAudioFrameClassification

classtransformers.Data2VecAudioForAudioFrameClassification

<source>

(config)

Parameters

  • config (Data2VecAudioForAudioFrameClassification) —Model configuration class with all the parameters of the model. Initializing with a config file does notload the weights associated with the model, only the configuration. Check out thefrom_pretrained() method to load the model weights.

The Data2Vec Audio Model with a frame classification head on top for tasks like Speaker Diarization.

This model inherits fromPreTrainedModel. Check the superclass documentation for the generic methods thelibrary implements for all its model (such as downloading or saving, resizing the input embeddings, pruning headsetc.)

This model is also a PyTorchtorch.nn.Module subclass.Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usageand behavior.

forward

<source>

(input_values: typing.Optional[torch.Tensor]attention_mask: typing.Optional[torch.Tensor] = Nonelabels: typing.Optional[torch.Tensor] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None)transformers.modeling_outputs.TokenClassifierOutput ortuple(torch.FloatTensor)

Parameters

  • input_values (torch.FloatTensor of shape(batch_size, sequence_length)) —Float values of input raw speech waveform. Values can be obtained by loading a.flac or.wav audio fileinto an array of typelist[float], anumpy.ndarray or atorch.Tensor,e.g. via the torchcodec library(pip install torchcodec) or the soundfile library (pip install soundfile).To prepare the array intoinput_values, theAutoProcessor should be used for padding and conversioninto a tensor of typetorch.FloatTensor. SeeData2VecAudioProcessor.__call__ for details.
  • attention_mask (torch.Tensor of shape(batch_size, sequence_length),optional) —Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:

    • 1 for tokens that arenot masked,
    • 0 for tokens that aremasked.

    What are attention masks?

  • labels (torch.LongTensor of shape(batch_size,),optional) —Labels for computing the sequence classification/regression loss. Indices should be in[0, ..., config.num_labels - 1]. Ifconfig.num_labels == 1 a regression loss is computed (Mean-Square loss), Ifconfig.num_labels > 1 a classification loss is computed (Cross-Entropy).
  • output_attentions (bool,optional) —Whether or not to return the attentions tensors of all attention layers. Seeattentions under returnedtensors for more detail.
  • output_hidden_states (bool,optional) —Whether or not to return the hidden states of all layers. Seehidden_states under returned tensors formore detail.
  • return_dict (bool,optional) —Whether or not to return aModelOutput instead of a plain tuple.

Returns

transformers.modeling_outputs.TokenClassifierOutput ortuple(torch.FloatTensor)

Atransformers.modeling_outputs.TokenClassifierOutput or a tuple oftorch.FloatTensor (ifreturn_dict=False is passed or whenconfig.return_dict=False) comprising variouselements depending on the configuration (Data2VecAudioConfig) and inputs.

  • loss (torch.FloatTensor of shape(1,),optional, returned whenlabels is provided) — Classification loss.

  • logits (torch.FloatTensor of shape(batch_size, sequence_length, config.num_labels)) — Classification scores (before SoftMax).

  • hidden_states (tuple(torch.FloatTensor),optional, returned whenoutput_hidden_states=True is passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, +one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

  • attentions (tuple(torch.FloatTensor),optional, returned whenoutput_attentions=True is passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor (one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attentionheads.

TheData2VecAudioForAudioFrameClassification forward method, overrides the__call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call theModuleinstance afterwards instead of this since the former takes care of running the pre and post processing steps whilethe latter silently ignores them.

Example:

>>>from transformersimport AutoFeatureExtractor, Data2VecAudioForAudioFrameClassification>>>from datasetsimport load_dataset>>>import torch>>>dataset = load_dataset("hf-internal-testing/librispeech_asr_demo","clean", split="validation")>>>dataset = dataset.sort("id")>>>sampling_rate = dataset.features["audio"].sampling_rate>>>feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/data2vec-audio-base-960h")>>>model = Data2VecAudioForAudioFrameClassification.from_pretrained("facebook/data2vec-audio-base-960h")>>># audio file is decoded on the fly>>>inputs = feature_extractor(dataset[0]["audio"]["array"], return_tensors="pt", sampling_rate=sampling_rate)>>>with torch.no_grad():...    logits = model(**inputs).logits>>>probabilities = torch.sigmoid(logits[0])>>># labels is a one-hot array of shape (num_frames, num_speakers)>>>labels = (probabilities >0.5).long()>>>labels[0].tolist()...

Data2VecAudioForCTC

classtransformers.Data2VecAudioForCTC

<source>

(config)

Parameters

  • config (Data2VecAudioForCTC) —Model configuration class with all the parameters of the model. Initializing with a config file does notload the weights associated with the model, only the configuration. Check out thefrom_pretrained() method to load the model weights.

Data2VecAudio Model with alanguage modeling head on top for Connectionist Temporal Classification (CTC).

This model inherits fromPreTrainedModel. Check the superclass documentation for the generic methods thelibrary implements for all its model (such as downloading or saving, resizing the input embeddings, pruning headsetc.)

This model is also a PyTorchtorch.nn.Module subclass.Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usageand behavior.

forward

<source>

(input_values: typing.Optional[torch.Tensor]attention_mask: typing.Optional[torch.Tensor] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = Nonelabels: typing.Optional[torch.Tensor] = None)transformers.modeling_outputs.CausalLMOutput ortuple(torch.FloatTensor)

Parameters

  • input_values (torch.Tensor of shape(batch_size, sequence_length),optional) —Float values of input raw speech waveform. Values can be obtained by loading a.flac or.wav audio fileinto an array of typelist[float], anumpy.ndarray or atorch.Tensor,e.g. via the torchcodec library(pip install torchcodec) or the soundfile library (pip install soundfile).To prepare the array intoinput_values, theAutoProcessor should be used for padding and conversioninto a tensor of typetorch.FloatTensor. Seeprocessor_class.__call__ for details.
  • attention_mask (torch.Tensor of shape(batch_size, sequence_length),optional) —Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:

    • 1 for tokens that arenot masked,
    • 0 for tokens that aremasked.

    What are attention masks?

  • output_attentions (bool,optional) —Whether or not to return the attentions tensors of all attention layers. Seeattentions under returnedtensors for more detail.
  • output_hidden_states (bool,optional) —Whether or not to return the hidden states of all layers. Seehidden_states under returned tensors formore detail.
  • return_dict (bool,optional) —Whether or not to return aModelOutput instead of a plain tuple.
  • labels (torch.LongTensor of shape(batch_size, target_length),optional) —Labels for connectionist temporal classification. Note thattarget_length has to be smaller or equal tothe sequence length of the output logits. Indices are selected in[-100, 0, ..., config.vocab_size - 1].All labels set to-100 are ignored (masked), the loss is only computed for labels in[0, ..., config.vocab_size - 1].

Returns

transformers.modeling_outputs.CausalLMOutput ortuple(torch.FloatTensor)

Atransformers.modeling_outputs.CausalLMOutput or a tuple oftorch.FloatTensor (ifreturn_dict=False is passed or whenconfig.return_dict=False) comprising variouselements depending on the configuration (Data2VecAudioConfig) and inputs.

  • loss (torch.FloatTensor of shape(1,),optional, returned whenlabels is provided) — Language modeling loss (for next-token prediction).

  • logits (torch.FloatTensor of shape(batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

  • hidden_states (tuple(torch.FloatTensor),optional, returned whenoutput_hidden_states=True is passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, +one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

  • attentions (tuple(torch.FloatTensor),optional, returned whenoutput_attentions=True is passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor (one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attentionheads.

TheData2VecAudioForCTC forward method, overrides the__call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call theModuleinstance afterwards instead of this since the former takes care of running the pre and post processing steps whilethe latter silently ignores them.

Example:

>>>from transformersimport AutoProcessor, Data2VecAudioForCTC>>>from datasetsimport load_dataset>>>import torch>>>dataset = load_dataset("hf-internal-testing/librispeech_asr_demo","clean", split="validation")>>>dataset = dataset.sort("id")>>>sampling_rate = dataset.features["audio"].sampling_rate>>>processor = AutoProcessor.from_pretrained("facebook/data2vec-audio-base-960h")>>>model = Data2VecAudioForCTC.from_pretrained("facebook/data2vec-audio-base-960h")>>># audio file is decoded on the fly>>>inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")>>>with torch.no_grad():...    logits = model(**inputs).logits>>>predicted_ids = torch.argmax(logits, dim=-1)>>># transcribe speech>>>transcription = processor.batch_decode(predicted_ids)>>>transcription[0]...>>>inputs["labels"] = processor(text=dataset[0]["text"], return_tensors="pt").input_ids>>># compute loss>>>loss = model(**inputs).loss>>>round(loss.item(),2)...

Data2VecAudioForSequenceClassification

classtransformers.Data2VecAudioForSequenceClassification

<source>

(config)

Parameters

  • config (Data2VecAudioForSequenceClassification) —Model configuration class with all the parameters of the model. Initializing with a config file does notload the weights associated with the model, only the configuration. Check out thefrom_pretrained() method to load the model weights.

Data2VecAudio Model with a sequence classification head on top (a linear layer over the pooled output) for tasks likeSUPERB Keyword Spotting.

This model inherits fromPreTrainedModel. Check the superclass documentation for the generic methods thelibrary implements for all its model (such as downloading or saving, resizing the input embeddings, pruning headsetc.)

This model is also a PyTorchtorch.nn.Module subclass.Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usageand behavior.

forward

<source>

(input_values: typing.Optional[torch.Tensor]attention_mask: typing.Optional[torch.Tensor] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = Nonelabels: typing.Optional[torch.Tensor] = None)transformers.modeling_outputs.SequenceClassifierOutput ortuple(torch.FloatTensor)

Parameters

  • input_values (torch.FloatTensor of shape(batch_size, sequence_length)) —Float values of input raw speech waveform. Values can be obtained by loading a.flac or.wav audio fileinto an array of typelist[float], anumpy.ndarray or atorch.Tensor,e.g. via the torchcodec library(pip install torchcodec) or the soundfile library (pip install soundfile).To prepare the array intoinput_values, theAutoProcessor should be used for padding and conversioninto a tensor of typetorch.FloatTensor. SeeData2VecAudioProcessor.__call__ for details.
  • attention_mask (torch.Tensor of shape(batch_size, sequence_length),optional) —Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:

    • 1 for tokens that arenot masked,
    • 0 for tokens that aremasked.

    What are attention masks?

  • output_attentions (bool,optional) —Whether or not to return the attentions tensors of all attention layers. Seeattentions under returnedtensors for more detail.
  • output_hidden_states (bool,optional) —Whether or not to return the hidden states of all layers. Seehidden_states under returned tensors formore detail.
  • return_dict (bool,optional) —Whether or not to return aModelOutput instead of a plain tuple.
  • labels (torch.LongTensor of shape(batch_size,),optional) —Labels for computing the sequence classification/regression loss. Indices should be in[0, ..., config.num_labels - 1]. Ifconfig.num_labels == 1 a regression loss is computed (Mean-Square loss), Ifconfig.num_labels > 1 a classification loss is computed (Cross-Entropy).

Atransformers.modeling_outputs.SequenceClassifierOutput or a tuple oftorch.FloatTensor (ifreturn_dict=False is passed or whenconfig.return_dict=False) comprising variouselements depending on the configuration (Data2VecAudioConfig) and inputs.

  • loss (torch.FloatTensor of shape(1,),optional, returned whenlabels is provided) — Classification (or regression if config.num_labels==1) loss.

  • logits (torch.FloatTensor of shape(batch_size, config.num_labels)) — Classification (or regression if config.num_labels==1) scores (before SoftMax).

  • hidden_states (tuple(torch.FloatTensor),optional, returned whenoutput_hidden_states=True is passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, +one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

  • attentions (tuple(torch.FloatTensor),optional, returned whenoutput_attentions=True is passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor (one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attentionheads.

TheData2VecAudioForSequenceClassification forward method, overrides the__call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call theModuleinstance afterwards instead of this since the former takes care of running the pre and post processing steps whilethe latter silently ignores them.

Example of single-label classification:

>>>import torch>>>from transformersimport AutoTokenizer, Data2VecAudioForSequenceClassification>>>tokenizer = AutoTokenizer.from_pretrained("facebook/data2vec-audio-base-960h")>>>model = Data2VecAudioForSequenceClassification.from_pretrained("facebook/data2vec-audio-base-960h")>>>inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")>>>with torch.no_grad():...    logits = model(**inputs).logits>>>predicted_class_id = logits.argmax().item()>>>model.config.id2label[predicted_class_id]...>>># To train a model on `num_labels` classes, you can pass `num_labels=num_labels` to `.from_pretrained(...)`>>>num_labels =len(model.config.id2label)>>>model = Data2VecAudioForSequenceClassification.from_pretrained("facebook/data2vec-audio-base-960h", num_labels=num_labels)>>>labels = torch.tensor([1])>>>loss = model(**inputs, labels=labels).loss>>>round(loss.item(),2)...

Example of multi-label classification:

>>>import torch>>>from transformersimport AutoTokenizer, Data2VecAudioForSequenceClassification>>>tokenizer = AutoTokenizer.from_pretrained("facebook/data2vec-audio-base-960h")>>>model = Data2VecAudioForSequenceClassification.from_pretrained("facebook/data2vec-audio-base-960h", problem_type="multi_label_classification")>>>inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")>>>with torch.no_grad():...    logits = model(**inputs).logits>>>predicted_class_ids = torch.arange(0, logits.shape[-1])[torch.sigmoid(logits).squeeze(dim=0) >0.5]>>># To train a model on `num_labels` classes, you can pass `num_labels=num_labels` to `.from_pretrained(...)`>>>num_labels =len(model.config.id2label)>>>model = Data2VecAudioForSequenceClassification.from_pretrained(..."facebook/data2vec-audio-base-960h", num_labels=num_labels, problem_type="multi_label_classification"...)>>>labels = torch.sum(...    torch.nn.functional.one_hot(predicted_class_ids[None, :].clone(), num_classes=num_labels), dim=1...).to(torch.float)>>>loss = model(**inputs, labels=labels).loss

Data2VecAudioForXVector

classtransformers.Data2VecAudioForXVector

<source>

(config)

Parameters

  • config (Data2VecAudioForXVector) —Model configuration class with all the parameters of the model. Initializing with a config file does notload the weights associated with the model, only the configuration. Check out thefrom_pretrained() method to load the model weights.

Data2VecAudio Model with an XVector feature extraction head on top for tasks like Speaker Verification.

This model inherits fromPreTrainedModel. Check the superclass documentation for the generic methods thelibrary implements for all its model (such as downloading or saving, resizing the input embeddings, pruning headsetc.)

This model is also a PyTorchtorch.nn.Module subclass.Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usageand behavior.

forward

<source>

(input_values: typing.Optional[torch.Tensor]attention_mask: typing.Optional[torch.Tensor] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = Nonelabels: typing.Optional[torch.Tensor] = None)transformers.modeling_outputs.XVectorOutput ortuple(torch.FloatTensor)

Parameters

  • input_values (torch.FloatTensor of shape(batch_size, sequence_length)) —Float values of input raw speech waveform. Values can be obtained by loading a.flac or.wav audio fileinto an array of typelist[float], anumpy.ndarray or atorch.Tensor,e.g. via the torchcodec library(pip install torchcodec) or the soundfile library (pip install soundfile).To prepare the array intoinput_values, theAutoProcessor should be used for padding and conversioninto a tensor of typetorch.FloatTensor. SeeData2VecAudioProcessor.__call__ for details.
  • attention_mask (torch.Tensor of shape(batch_size, sequence_length),optional) —Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:

    • 1 for tokens that arenot masked,
    • 0 for tokens that aremasked.

    What are attention masks?

  • output_attentions (bool,optional) —Whether or not to return the attentions tensors of all attention layers. Seeattentions under returnedtensors for more detail.
  • output_hidden_states (bool,optional) —Whether or not to return the hidden states of all layers. Seehidden_states under returned tensors formore detail.
  • return_dict (bool,optional) —Whether or not to return aModelOutput instead of a plain tuple.
  • labels (torch.LongTensor of shape(batch_size,),optional) —Labels for computing the sequence classification/regression loss. Indices should be in[0, ..., config.num_labels - 1]. Ifconfig.num_labels == 1 a regression loss is computed (Mean-Square loss), Ifconfig.num_labels > 1 a classification loss is computed (Cross-Entropy).

Returns

transformers.modeling_outputs.XVectorOutput ortuple(torch.FloatTensor)

Atransformers.modeling_outputs.XVectorOutput or a tuple oftorch.FloatTensor (ifreturn_dict=False is passed or whenconfig.return_dict=False) comprising variouselements depending on the configuration (Data2VecAudioConfig) and inputs.

  • loss (torch.FloatTensor of shape(1,),optional, returned whenlabels is provided) — Classification loss.

  • logits (torch.FloatTensor of shape(batch_size, config.xvector_output_dim)) — Classification hidden states before AMSoftmax.

  • embeddings (torch.FloatTensor of shape(batch_size, config.xvector_output_dim)) — Utterance embeddings used for vector similarity-based retrieval.

  • hidden_states (tuple(torch.FloatTensor),optional, returned whenoutput_hidden_states=True is passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor (one for the output of the embeddings + one for the output of each layer) ofshape(batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (tuple(torch.FloatTensor),optional, returned whenoutput_attentions=True is passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor (one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attentionheads.

TheData2VecAudioForXVector forward method, overrides the__call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call theModuleinstance afterwards instead of this since the former takes care of running the pre and post processing steps whilethe latter silently ignores them.

Example:

>>>from transformersimport AutoFeatureExtractor, Data2VecAudioForXVector>>>from datasetsimport load_dataset>>>import torch>>>dataset = load_dataset("hf-internal-testing/librispeech_asr_demo","clean", split="validation")>>>dataset = dataset.sort("id")>>>sampling_rate = dataset.features["audio"].sampling_rate>>>feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/data2vec-audio-base-960h")>>>model = Data2VecAudioForXVector.from_pretrained("facebook/data2vec-audio-base-960h")>>># audio file is decoded on the fly>>>inputs = feature_extractor(...    [d["array"]for din dataset[:2]["audio"]], sampling_rate=sampling_rate, return_tensors="pt", padding=True...)>>>with torch.no_grad():...    embeddings = model(**inputs).embeddings>>>embeddings = torch.nn.functional.normalize(embeddings, dim=-1).cpu()>>># the resulting embeddings can be used for cosine similarity-based retrieval>>>cosine_sim = torch.nn.CosineSimilarity(dim=-1)>>>similarity = cosine_sim(embeddings[0], embeddings[1])>>>threshold =0.7# the optimal threshold is dataset-dependent>>>if similarity < threshold:...print("Speakers are not the same!")>>>round(similarity.item(),2)...

Data2VecTextModel

classtransformers.Data2VecTextModel

<source>

(configadd_pooling_layer = True)

Parameters

  • config (Data2VecTextModel) —Model configuration class with all the parameters of the model. Initializing with a config file does notload the weights associated with the model, only the configuration. Check out thefrom_pretrained() method to load the model weights.
  • add_pooling_layer (bool,optional, defaults toTrue) —Whether to add a pooling layer

The bare Data2Vec Text Text Model outputting raw hidden-states without any specific head on to.

This model inherits fromPreTrainedModel. Check the superclass documentation for the generic methods thelibrary implements for all its model (such as downloading or saving, resizing the input embeddings, pruning headsetc.)

This model is also a PyTorchtorch.nn.Module subclass.Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usageand behavior.

forward

<source>

(input_ids: typing.Optional[torch.Tensor] = Noneattention_mask: typing.Optional[torch.Tensor] = Nonetoken_type_ids: typing.Optional[torch.Tensor] = Noneposition_ids: typing.Optional[torch.Tensor] = Noneinputs_embeds: typing.Optional[torch.Tensor] = Noneencoder_hidden_states: typing.Optional[torch.Tensor] = Noneencoder_attention_mask: typing.Optional[torch.Tensor] = Nonepast_key_values: typing.Optional[transformers.cache_utils.Cache] = Noneuse_cache: typing.Optional[bool] = Nonecache_position: typing.Optional[torch.Tensor] = None**kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs])transformers.modeling_outputs.BaseModelOutputWithPoolingAndCrossAttentions ortuple(torch.FloatTensor)

Parameters

  • input_ids (torch.Tensor of shape(batch_size, sequence_length),optional) —Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.

    Indices can be obtained usingAutoTokenizer. SeePreTrainedTokenizer.encode() andPreTrainedTokenizer.call() for details.

    What are input IDs?

  • attention_mask (torch.Tensor of shape(batch_size, sequence_length),optional) —Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:

    • 1 for tokens that arenot masked,
    • 0 for tokens that aremasked.

    What are attention masks?

  • token_type_ids (torch.Tensor of shape(batch_size, sequence_length),optional) —Segment token indices to indicate first and second portions of the inputs. Indices are selected in[0, 1]:

    • 0 corresponds to asentence A token,
    • 1 corresponds to asentence B token.

    What are token type IDs?

  • position_ids (torch.Tensor of shape(batch_size, sequence_length),optional) —Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1].

    What are position IDs?

  • inputs_embeds (torch.Tensor of shape(batch_size, sequence_length, hidden_size),optional) —Optionally, instead of passinginput_ids you can choose to directly pass an embedded representation. Thisis useful if you want more control over how to convertinput_ids indices into associated vectors than themodel’s internal embedding lookup matrix.
  • encoder_hidden_states (torch.Tensor of shape(batch_size, sequence_length, hidden_size),optional) —Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attentionif the model is configured as a decoder.
  • encoder_attention_mask (torch.Tensor of shape(batch_size, sequence_length),optional) —Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used inthe cross-attention if the model is configured as a decoder. Mask values selected in[0, 1]:

    • 1 for tokens that arenot masked,
    • 0 for tokens that aremasked.
  • past_key_values (~cache_utils.Cache,optional) —Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attentionblocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=True orconfig.use_cache=True.

    OnlyCache instance is allowed as input, see ourkv cache guide.If nopast_key_values are passed,DynamicCache will be initialized by default.

    The model will output the same cache format that is fed as input.

    Ifpast_key_values are used, the user is expected to input only unprocessedinput_ids (those that don’thave their past key value states given to this model) of shape(batch_size, unprocessed_length) instead of allinput_idsof shape(batch_size, sequence_length).

  • use_cache (bool,optional) —If set toTrue,past_key_values key value states are returned and can be used to speed up decoding (seepast_key_values).
  • cache_position (torch.Tensor of shape(sequence_length),optional) —Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids,this tensor is not affected by padding. It is used to update the cache in the correct position and to inferthe complete sequence length.

Atransformers.modeling_outputs.BaseModelOutputWithPoolingAndCrossAttentions or a tuple oftorch.FloatTensor (ifreturn_dict=False is passed or whenconfig.return_dict=False) comprising variouselements depending on the configuration (Data2VecTextConfig) and inputs.

  • last_hidden_state (torch.FloatTensor of shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.

  • pooler_output (torch.FloatTensor of shape(batch_size, hidden_size)) — Last layer hidden-state of the first token of the sequence (classification token) after further processingthrough the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returnsthe classification token after processing through a linear layer and a tanh activation function. The linearlayer weights are trained from the next sentence prediction (classification) objective during pretraining.

  • hidden_states (tuple(torch.FloatTensor),optional, returned whenoutput_hidden_states=True is passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, +one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

  • attentions (tuple(torch.FloatTensor),optional, returned whenoutput_attentions=True is passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor (one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attentionheads.

  • cross_attentions (tuple(torch.FloatTensor),optional, returned whenoutput_attentions=True andconfig.add_cross_attention=True is passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor (one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute theweighted average in the cross-attention heads.

  • past_key_values (Cache,optional, returned whenuse_cache=True is passed or whenconfig.use_cache=True) — It is aCache instance. For more details, see ourkv cache guide.

    Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally ifconfig.is_encoder_decoder=True in the cross-attention blocks) that can be used (seepast_key_valuesinput) to speed up sequential decoding.

TheData2VecTextModel forward method, overrides the__call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call theModuleinstance afterwards instead of this since the former takes care of running the pre and post processing steps whilethe latter silently ignores them.

Data2VecTextForCausalLM

classtransformers.Data2VecTextForCausalLM

<source>

(config)

Parameters

  • config (Data2VecTextForCausalLM) —Model configuration class with all the parameters of the model. Initializing with a config file does notload the weights associated with the model, only the configuration. Check out thefrom_pretrained() method to load the model weights.

Data2VecText Model with alanguage modeling head on top for CLM fine-tuning.

This model inherits fromPreTrainedModel. Check the superclass documentation for the generic methods thelibrary implements for all its model (such as downloading or saving, resizing the input embeddings, pruning headsetc.)

This model is also a PyTorchtorch.nn.Module subclass.Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usageand behavior.

forward

<source>

(input_ids: typing.Optional[torch.LongTensor] = Noneattention_mask: typing.Optional[torch.FloatTensor] = Nonetoken_type_ids: typing.Optional[torch.LongTensor] = Noneposition_ids: typing.Optional[torch.LongTensor] = Noneinputs_embeds: typing.Optional[torch.FloatTensor] = Noneencoder_hidden_states: typing.Optional[torch.FloatTensor] = Noneencoder_attention_mask: typing.Optional[torch.FloatTensor] = Nonelabels: typing.Optional[torch.LongTensor] = Nonepast_key_values: typing.Optional[tuple[tuple[torch.FloatTensor]]] = Noneuse_cache: typing.Optional[bool] = Nonecache_position: typing.Optional[torch.Tensor] = Nonelogits_to_keep: typing.Union[int, torch.Tensor] = 0**kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs])transformers.modeling_outputs.CausalLMOutputWithCrossAttentions ortuple(torch.FloatTensor)

Parameters

  • input_ids (torch.LongTensor of shape(batch_size, sequence_length),optional) —Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.

    Indices can be obtained usingAutoTokenizer. SeePreTrainedTokenizer.encode() andPreTrainedTokenizer.call() for details.

    What are input IDs?

  • attention_mask (torch.FloatTensor of shape(batch_size, sequence_length),optional) —Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:

    • 1 for tokens that arenot masked,
    • 0 for tokens that aremasked.

    What are attention masks?

  • token_type_ids (torch.LongTensor of shape(batch_size, sequence_length),optional) —Segment token indices to indicate first and second portions of the inputs. Indices are selected in[0, 1]:

    • 0 corresponds to asentence A token,
    • 1 corresponds to asentence B token.

    What are token type IDs?

  • position_ids (torch.LongTensor of shape(batch_size, sequence_length),optional) —Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1].

    What are position IDs?

  • inputs_embeds (torch.FloatTensor of shape(batch_size, sequence_length, hidden_size),optional) —Optionally, instead of passinginput_ids you can choose to directly pass an embedded representation. Thisis useful if you want more control over how to convertinput_ids indices into associated vectors than themodel’s internal embedding lookup matrix.
  • encoder_hidden_states (torch.FloatTensor of shape(batch_size, sequence_length, hidden_size),optional) —Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attentionif the model is configured as a decoder.
  • encoder_attention_mask (torch.FloatTensor of shape(batch_size, sequence_length),optional) —Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used inthe cross-attention if the model is configured as a decoder. Mask values selected in[0, 1]:

    • 1 for tokens that arenot masked,
    • 0 for tokens that aremasked.
  • labels (torch.LongTensor of shape(batch_size, sequence_length),optional) —Labels for computing the left-to-right language modeling loss (next word prediction). Indices should be in[-100, 0, ..., config.vocab_size] (seeinput_ids docstring) Tokens with indices set to-100 areignored (masked), the loss is only computed for the tokens with labels in[0, ..., config.vocab_size]
  • past_key_values (tuple[tuple[torch.FloatTensor]],optional) —Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attentionblocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=True orconfig.use_cache=True.

    OnlyCache instance is allowed as input, see ourkv cache guide.If nopast_key_values are passed,DynamicCache will be initialized by default.

    The model will output the same cache format that is fed as input.

    Ifpast_key_values are used, the user is expected to input only unprocessedinput_ids (those that don’thave their past key value states given to this model) of shape(batch_size, unprocessed_length) instead of allinput_idsof shape(batch_size, sequence_length).

  • use_cache (bool,optional) —If set toTrue,past_key_values key value states are returned and can be used to speed up decoding (seepast_key_values).
  • cache_position (torch.Tensor of shape(sequence_length),optional) —Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids,this tensor is not affected by padding. It is used to update the cache in the correct position and to inferthe complete sequence length.
  • logits_to_keep (Union[int, torch.Tensor], defaults to0) —If anint, compute logits for the lastlogits_to_keep tokens. If0, calculate logits for allinput_ids (special case). Only last token logits are needed for generation, and calculating them only for thattoken can save memory, which becomes pretty significant for long sequences or large vocabulary size.If atorch.Tensor, must be 1D corresponding to the indices to keep in the sequence length dimension.This is useful when using packed tensor format (single dimension for batch and sequence length).

Atransformers.modeling_outputs.CausalLMOutputWithCrossAttentions or a tuple oftorch.FloatTensor (ifreturn_dict=False is passed or whenconfig.return_dict=False) comprising variouselements depending on the configuration (Data2VecTextConfig) and inputs.

  • loss (torch.FloatTensor of shape(1,),optional, returned whenlabels is provided) — Language modeling loss (for next-token prediction).

  • logits (torch.FloatTensor of shape(batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

  • hidden_states (tuple(torch.FloatTensor),optional, returned whenoutput_hidden_states=True is passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, +one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

  • attentions (tuple(torch.FloatTensor),optional, returned whenoutput_attentions=True is passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor (one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attentionheads.

  • cross_attentions (tuple(torch.FloatTensor),optional, returned whenoutput_attentions=True is passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor (one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).

    Cross attentions weights after the attention softmax, used to compute the weighted average in thecross-attention heads.

  • past_key_values (Cache,optional, returned whenuse_cache=True is passed or whenconfig.use_cache=True) — It is aCache instance. For more details, see ourkv cache guide.

    Contains pre-computed hidden-states (key and values in the attention blocks) that can be used (seepast_key_values input) to speed up sequential decoding.

TheData2VecTextForCausalLM forward method, overrides the__call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call theModuleinstance afterwards instead of this since the former takes care of running the pre and post processing steps whilethe latter silently ignores them.

Example:

>>>from transformersimport AutoTokenizer, Data2VecTextForCausalLM, Data2VecTextConfig>>>import torch>>>tokenizer = AutoTokenizer.from_pretrained("facebook/data2vec-text-base")>>>config = Data2VecTextConfig.from_pretrained("facebook/data2vec-text-base")>>>config.is_decoder =True>>>model = Data2VecTextForCausalLM.from_pretrained("facebook/data2vec-text-base", config=config)>>>inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")>>>outputs = model(**inputs)>>>prediction_logits = outputs.logits

Data2VecTextForMaskedLM

classtransformers.Data2VecTextForMaskedLM

<source>

(config)

Parameters

  • config (Data2VecTextForMaskedLM) —Model configuration class with all the parameters of the model. Initializing with a config file does notload the weights associated with the model, only the configuration. Check out thefrom_pretrained() method to load the model weights.

The Data2Vec Text Model with alanguage modeling head on top.”

This model inherits fromPreTrainedModel. Check the superclass documentation for the generic methods thelibrary implements for all its model (such as downloading or saving, resizing the input embeddings, pruning headsetc.)

This model is also a PyTorchtorch.nn.Module subclass.Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usageand behavior.

forward

<source>

(input_ids: typing.Optional[torch.LongTensor] = Noneattention_mask: typing.Optional[torch.FloatTensor] = Nonetoken_type_ids: typing.Optional[torch.LongTensor] = Noneposition_ids: typing.Optional[torch.LongTensor] = Noneinputs_embeds: typing.Optional[torch.FloatTensor] = Noneencoder_hidden_states: typing.Optional[torch.FloatTensor] = Noneencoder_attention_mask: typing.Optional[torch.FloatTensor] = Nonelabels: typing.Optional[torch.LongTensor] = None**kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs])transformers.modeling_outputs.MaskedLMOutput ortuple(torch.FloatTensor)

Parameters

  • input_ids (torch.LongTensor of shape(batch_size, sequence_length),optional) —Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.

    Indices can be obtained usingAutoTokenizer. SeePreTrainedTokenizer.encode() andPreTrainedTokenizer.call() for details.

    What are input IDs?

  • attention_mask (torch.FloatTensor of shape(batch_size, sequence_length),optional) —Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:

    • 1 for tokens that arenot masked,
    • 0 for tokens that aremasked.

    What are attention masks?

  • token_type_ids (torch.LongTensor of shape(batch_size, sequence_length),optional) —Segment token indices to indicate first and second portions of the inputs. Indices are selected in[0, 1]:

    • 0 corresponds to asentence A token,
    • 1 corresponds to asentence B token.

    What are token type IDs?

  • position_ids (torch.LongTensor of shape(batch_size, sequence_length),optional) —Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1].

    What are position IDs?

  • inputs_embeds (torch.FloatTensor of shape(batch_size, sequence_length, hidden_size),optional) —Optionally, instead of passinginput_ids you can choose to directly pass an embedded representation. Thisis useful if you want more control over how to convertinput_ids indices into associated vectors than themodel’s internal embedding lookup matrix.
  • encoder_hidden_states (torch.FloatTensor of shape(batch_size, sequence_length, hidden_size),optional) —Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attentionif the model is configured as a decoder.
  • encoder_attention_mask (torch.FloatTensor of shape(batch_size, sequence_length),optional) —Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used inthe cross-attention if the model is configured as a decoder. Mask values selected in[0, 1]:

    • 1 for tokens that arenot masked,
    • 0 for tokens that aremasked.
  • labels (torch.LongTensor of shape(batch_size, sequence_length),optional) —Labels for computing the masked language modeling loss. Indices should be in[-100, 0, ..., config.vocab_size] (seeinput_ids docstring) Tokens with indices set to-100 are ignored (masked), theloss is only computed for the tokens with labels in[0, ..., config.vocab_size]

Returns

transformers.modeling_outputs.MaskedLMOutput ortuple(torch.FloatTensor)

Atransformers.modeling_outputs.MaskedLMOutput or a tuple oftorch.FloatTensor (ifreturn_dict=False is passed or whenconfig.return_dict=False) comprising variouselements depending on the configuration (Data2VecTextConfig) and inputs.

  • loss (torch.FloatTensor of shape(1,),optional, returned whenlabels is provided) — Masked language modeling (MLM) loss.

  • logits (torch.FloatTensor of shape(batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

  • hidden_states (tuple(torch.FloatTensor),optional, returned whenoutput_hidden_states=True is passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, +one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

  • attentions (tuple(torch.FloatTensor),optional, returned whenoutput_attentions=True is passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor (one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attentionheads.

TheData2VecTextForMaskedLM forward method, overrides the__call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call theModuleinstance afterwards instead of this since the former takes care of running the pre and post processing steps whilethe latter silently ignores them.

Example:

>>>from transformersimport AutoTokenizer, Data2VecTextForMaskedLM>>>import torch>>>tokenizer = AutoTokenizer.from_pretrained("facebook/data2vec-text-base")>>>model = Data2VecTextForMaskedLM.from_pretrained("facebook/data2vec-text-base")>>>inputs = tokenizer("The capital of France is <mask>.", return_tensors="pt")>>>with torch.no_grad():...    logits = model(**inputs).logits>>># retrieve index of <mask>>>>mask_token_index = (inputs.input_ids == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0]>>>predicted_token_id = logits[0, mask_token_index].argmax(axis=-1)>>>tokenizer.decode(predicted_token_id)...>>>labels = tokenizer("The capital of France is Paris.", return_tensors="pt")["input_ids"]>>># mask labels of non-<mask> tokens>>>labels = torch.where(inputs.input_ids == tokenizer.mask_token_id, labels, -100)>>>outputs = model(**inputs, labels=labels)>>>round(outputs.loss.item(),2)...

Data2VecTextForSequenceClassification

classtransformers.Data2VecTextForSequenceClassification

<source>

(config)

Parameters

  • config (Data2VecTextForSequenceClassification) —Model configuration class with all the parameters of the model. Initializing with a config file does notload the weights associated with the model, only the configuration. Check out thefrom_pretrained() method to load the model weights.

Data2VecText Model transformer with a sequence classification/regression head on top (a linear layer on top of thepooled output) e.g. for GLUE tasks.

This model inherits fromPreTrainedModel. Check the superclass documentation for the generic methods thelibrary implements for all its model (such as downloading or saving, resizing the input embeddings, pruning headsetc.)

This model is also a PyTorchtorch.nn.Module subclass.Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usageand behavior.

forward

<source>

(input_ids: typing.Optional[torch.LongTensor] = Noneattention_mask: typing.Optional[torch.FloatTensor] = Nonetoken_type_ids: typing.Optional[torch.LongTensor] = Noneposition_ids: typing.Optional[torch.LongTensor] = Noneinputs_embeds: typing.Optional[torch.FloatTensor] = Nonelabels: typing.Optional[torch.LongTensor] = None**kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs])transformers.modeling_outputs.SequenceClassifierOutput ortuple(torch.FloatTensor)

Parameters

  • input_ids (torch.LongTensor of shape(batch_size, sequence_length),optional) —Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.

    Indices can be obtained usingAutoTokenizer. SeePreTrainedTokenizer.encode() andPreTrainedTokenizer.call() for details.

    What are input IDs?

  • attention_mask (torch.FloatTensor of shape(batch_size, sequence_length),optional) —Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:

    • 1 for tokens that arenot masked,
    • 0 for tokens that aremasked.

    What are attention masks?

  • token_type_ids (torch.LongTensor of shape(batch_size, sequence_length),optional) —Segment token indices to indicate first and second portions of the inputs. Indices are selected in[0, 1]:

    • 0 corresponds to asentence A token,
    • 1 corresponds to asentence B token.

    What are token type IDs?

  • position_ids (torch.LongTensor of shape(batch_size, sequence_length),optional) —Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1].

    What are position IDs?

  • inputs_embeds (torch.FloatTensor of shape(batch_size, sequence_length, hidden_size),optional) —Optionally, instead of passinginput_ids you can choose to directly pass an embedded representation. Thisis useful if you want more control over how to convertinput_ids indices into associated vectors than themodel’s internal embedding lookup matrix.
  • labels (torch.LongTensor of shape(batch_size,),optional) —Labels for computing the sequence classification/regression loss. Indices should be in[0, ..., config.num_labels - 1]. Ifconfig.num_labels == 1 a regression loss is computed (Mean-Square loss), Ifconfig.num_labels > 1 a classification loss is computed (Cross-Entropy).

Atransformers.modeling_outputs.SequenceClassifierOutput or a tuple oftorch.FloatTensor (ifreturn_dict=False is passed or whenconfig.return_dict=False) comprising variouselements depending on the configuration (Data2VecTextConfig) and inputs.

  • loss (torch.FloatTensor of shape(1,),optional, returned whenlabels is provided) — Classification (or regression if config.num_labels==1) loss.

  • logits (torch.FloatTensor of shape(batch_size, config.num_labels)) — Classification (or regression if config.num_labels==1) scores (before SoftMax).

  • hidden_states (tuple(torch.FloatTensor),optional, returned whenoutput_hidden_states=True is passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, +one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

  • attentions (tuple(torch.FloatTensor),optional, returned whenoutput_attentions=True is passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor (one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attentionheads.

TheData2VecTextForSequenceClassification forward method, overrides the__call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call theModuleinstance afterwards instead of this since the former takes care of running the pre and post processing steps whilethe latter silently ignores them.

Example of single-label classification:

>>>import torch>>>from transformersimport AutoTokenizer, Data2VecTextForSequenceClassification>>>tokenizer = AutoTokenizer.from_pretrained("facebook/data2vec-text-base")>>>model = Data2VecTextForSequenceClassification.from_pretrained("facebook/data2vec-text-base")>>>inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")>>>with torch.no_grad():...    logits = model(**inputs).logits>>>predicted_class_id = logits.argmax().item()>>>model.config.id2label[predicted_class_id]...>>># To train a model on `num_labels` classes, you can pass `num_labels=num_labels` to `.from_pretrained(...)`>>>num_labels =len(model.config.id2label)>>>model = Data2VecTextForSequenceClassification.from_pretrained("facebook/data2vec-text-base", num_labels=num_labels)>>>labels = torch.tensor([1])>>>loss = model(**inputs, labels=labels).loss>>>round(loss.item(),2)...

Example of multi-label classification:

>>>import torch>>>from transformersimport AutoTokenizer, Data2VecTextForSequenceClassification>>>tokenizer = AutoTokenizer.from_pretrained("facebook/data2vec-text-base")>>>model = Data2VecTextForSequenceClassification.from_pretrained("facebook/data2vec-text-base", problem_type="multi_label_classification")>>>inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")>>>with torch.no_grad():...    logits = model(**inputs).logits>>>predicted_class_ids = torch.arange(0, logits.shape[-1])[torch.sigmoid(logits).squeeze(dim=0) >0.5]>>># To train a model on `num_labels` classes, you can pass `num_labels=num_labels` to `.from_pretrained(...)`>>>num_labels =len(model.config.id2label)>>>model = Data2VecTextForSequenceClassification.from_pretrained(..."facebook/data2vec-text-base", num_labels=num_labels, problem_type="multi_label_classification"...)>>>labels = torch.sum(...    torch.nn.functional.one_hot(predicted_class_ids[None, :].clone(), num_classes=num_labels), dim=1...).to(torch.float)>>>loss = model(**inputs, labels=labels).loss

Data2VecTextForMultipleChoice

classtransformers.Data2VecTextForMultipleChoice

<source>

(config)

Parameters

  • config (Data2VecTextForMultipleChoice) —Model configuration class with all the parameters of the model. Initializing with a config file does notload the weights associated with the model, only the configuration. Check out thefrom_pretrained() method to load the model weights.

The Data2Vec Text Model with a multiple choice classification head on top (a linear layer on top of the pooled output and asoftmax) e.g. for RocStories/SWAG tasks.

This model inherits fromPreTrainedModel. Check the superclass documentation for the generic methods thelibrary implements for all its model (such as downloading or saving, resizing the input embeddings, pruning headsetc.)

This model is also a PyTorchtorch.nn.Module subclass.Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usageand behavior.

forward

<source>

(input_ids: typing.Optional[torch.LongTensor] = Nonetoken_type_ids: typing.Optional[torch.LongTensor] = Noneattention_mask: typing.Optional[torch.FloatTensor] = Nonelabels: typing.Optional[torch.LongTensor] = Noneposition_ids: typing.Optional[torch.LongTensor] = Noneinputs_embeds: typing.Optional[torch.FloatTensor] = None**kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs])transformers.modeling_outputs.MultipleChoiceModelOutput ortuple(torch.FloatTensor)

Parameters

  • input_ids (torch.LongTensor of shape(batch_size, num_choices, sequence_length)) —Indices of input sequence tokens in the vocabulary.

    Indices can be obtained usingAutoTokenizer. SeePreTrainedTokenizer.encode() andPreTrainedTokenizer.call() for details.

    What are input IDs?

  • token_type_ids (torch.LongTensor of shape(batch_size, num_choices, sequence_length),optional) —Segment token indices to indicate first and second portions of the inputs. Indices are selected in[0, 1]:

    • 0 corresponds to asentence A token,
    • 1 corresponds to asentence B token.

    What are token type IDs?

  • attention_mask (torch.FloatTensor of shape(batch_size, sequence_length),optional) —Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:

    • 1 for tokens that arenot masked,
    • 0 for tokens that aremasked.

    What are attention masks?

  • labels (torch.LongTensor of shape(batch_size,),optional) —Labels for computing the multiple choice classification loss. Indices should be in[0, ..., num_choices-1] wherenum_choices is the size of the second dimension of the input tensors. (Seeinput_ids above)
  • position_ids (torch.LongTensor of shape(batch_size, num_choices, sequence_length),optional) —Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.max_position_embeddings - 1].

    What are position IDs?

  • inputs_embeds (torch.FloatTensor of shape(batch_size, num_choices, sequence_length, hidden_size),optional) —Optionally, instead of passinginput_ids you can choose to directly pass an embedded representation. Thisis useful if you want more control over how to convertinput_ids indices into associated vectors than themodel’s internal embedding lookup matrix.

Atransformers.modeling_outputs.MultipleChoiceModelOutput or a tuple oftorch.FloatTensor (ifreturn_dict=False is passed or whenconfig.return_dict=False) comprising variouselements depending on the configuration (Data2VecTextConfig) and inputs.

  • loss (torch.FloatTensor of shape(1,),optional, returned whenlabels is provided) — Classification loss.

  • logits (torch.FloatTensor of shape(batch_size, num_choices)) —num_choices is the second dimension of the input tensors. (seeinput_ids above).

    Classification scores (before SoftMax).

  • hidden_states (tuple(torch.FloatTensor),optional, returned whenoutput_hidden_states=True is passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, +one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

  • attentions (tuple(torch.FloatTensor),optional, returned whenoutput_attentions=True is passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor (one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attentionheads.

TheData2VecTextForMultipleChoice forward method, overrides the__call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call theModuleinstance afterwards instead of this since the former takes care of running the pre and post processing steps whilethe latter silently ignores them.

Example:

>>>from transformersimport AutoTokenizer, Data2VecTextForMultipleChoice>>>import torch>>>tokenizer = AutoTokenizer.from_pretrained("facebook/data2vec-text-base")>>>model = Data2VecTextForMultipleChoice.from_pretrained("facebook/data2vec-text-base")>>>prompt ="In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced.">>>choice0 ="It is eaten with a fork and a knife.">>>choice1 ="It is eaten while held in the hand.">>>labels = torch.tensor(0).unsqueeze(0)# choice0 is correct (according to Wikipedia ;)), batch size 1>>>encoding = tokenizer([prompt, prompt], [choice0, choice1], return_tensors="pt", padding=True)>>>outputs = model(**{k: v.unsqueeze(0)for k, vin encoding.items()}, labels=labels)# batch size is 1>>># the linear classifier still needs to be trained>>>loss = outputs.loss>>>logits = outputs.logits

Data2VecTextForTokenClassification

classtransformers.Data2VecTextForTokenClassification

<source>

(config)

Parameters

  • config (Data2VecTextForTokenClassification) —Model configuration class with all the parameters of the model. Initializing with a config file does notload the weights associated with the model, only the configuration. Check out thefrom_pretrained() method to load the model weights.

The Data2Vec Text transformer with a token classification head on top (a linear layer on top of the hidden-statesoutput) e.g. for Named-Entity-Recognition (NER) tasks.

This model inherits fromPreTrainedModel. Check the superclass documentation for the generic methods thelibrary implements for all its model (such as downloading or saving, resizing the input embeddings, pruning headsetc.)

This model is also a PyTorchtorch.nn.Module subclass.Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usageand behavior.

forward

<source>

(input_ids: typing.Optional[torch.LongTensor] = Noneattention_mask: typing.Optional[torch.FloatTensor] = Nonetoken_type_ids: typing.Optional[torch.LongTensor] = Noneposition_ids: typing.Optional[torch.LongTensor] = Noneinputs_embeds: typing.Optional[torch.FloatTensor] = Nonelabels: typing.Optional[torch.LongTensor] = None**kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs])transformers.modeling_outputs.TokenClassifierOutput ortuple(torch.FloatTensor)

Parameters

  • input_ids (torch.LongTensor of shape(batch_size, sequence_length),optional) —Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.

    Indices can be obtained usingAutoTokenizer. SeePreTrainedTokenizer.encode() andPreTrainedTokenizer.call() for details.

    What are input IDs?

  • attention_mask (torch.FloatTensor of shape(batch_size, sequence_length),optional) —Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:

    • 1 for tokens that arenot masked,
    • 0 for tokens that aremasked.

    What are attention masks?

  • token_type_ids (torch.LongTensor of shape(batch_size, sequence_length),optional) —Segment token indices to indicate first and second portions of the inputs. Indices are selected in[0, 1]:

    • 0 corresponds to asentence A token,
    • 1 corresponds to asentence B token.

    What are token type IDs?

  • position_ids (torch.LongTensor of shape(batch_size, sequence_length),optional) —Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1].

    What are position IDs?

  • inputs_embeds (torch.FloatTensor of shape(batch_size, sequence_length, hidden_size),optional) —Optionally, instead of passinginput_ids you can choose to directly pass an embedded representation. Thisis useful if you want more control over how to convertinput_ids indices into associated vectors than themodel’s internal embedding lookup matrix.
  • labels (torch.LongTensor of shape(batch_size, sequence_length),optional) —Labels for computing the token classification loss. Indices should be in[0, ..., config.num_labels - 1].

Returns

transformers.modeling_outputs.TokenClassifierOutput ortuple(torch.FloatTensor)

Atransformers.modeling_outputs.TokenClassifierOutput or a tuple oftorch.FloatTensor (ifreturn_dict=False is passed or whenconfig.return_dict=False) comprising variouselements depending on the configuration (Data2VecTextConfig) and inputs.

  • loss (torch.FloatTensor of shape(1,),optional, returned whenlabels is provided) — Classification loss.

  • logits (torch.FloatTensor of shape(batch_size, sequence_length, config.num_labels)) — Classification scores (before SoftMax).

  • hidden_states (tuple(torch.FloatTensor),optional, returned whenoutput_hidden_states=True is passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, +one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

  • attentions (tuple(torch.FloatTensor),optional, returned whenoutput_attentions=True is passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor (one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attentionheads.

TheData2VecTextForTokenClassification forward method, overrides the__call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call theModuleinstance afterwards instead of this since the former takes care of running the pre and post processing steps whilethe latter silently ignores them.

Example:

>>>from transformersimport AutoTokenizer, Data2VecTextForTokenClassification>>>import torch>>>tokenizer = AutoTokenizer.from_pretrained("facebook/data2vec-text-base")>>>model = Data2VecTextForTokenClassification.from_pretrained("facebook/data2vec-text-base")>>>inputs = tokenizer(..."HuggingFace is a company based in Paris and New York", add_special_tokens=False, return_tensors="pt"...)>>>with torch.no_grad():...    logits = model(**inputs).logits>>>predicted_token_class_ids = logits.argmax(-1)>>># Note that tokens are classified rather then input words which means that>>># there might be more predicted token classes than words.>>># Multiple token classes might account for the same word>>>predicted_tokens_classes = [model.config.id2label[t.item()]for tin predicted_token_class_ids[0]]>>>predicted_tokens_classes...>>>labels = predicted_token_class_ids>>>loss = model(**inputs, labels=labels).loss>>>round(loss.item(),2)...

Data2VecTextForQuestionAnswering

classtransformers.Data2VecTextForQuestionAnswering

<source>

(config)

Parameters

  • config (Data2VecTextForQuestionAnswering) —Model configuration class with all the parameters of the model. Initializing with a config file does notload the weights associated with the model, only the configuration. Check out thefrom_pretrained() method to load the model weights.

The Data2Vec Text transformer with a span classification head on top for extractive question-answering tasks likeSQuAD (a linear layer on top of the hidden-states output to computespan start logits andspan end logits).

This model inherits fromPreTrainedModel. Check the superclass documentation for the generic methods thelibrary implements for all its model (such as downloading or saving, resizing the input embeddings, pruning headsetc.)

This model is also a PyTorchtorch.nn.Module subclass.Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usageand behavior.

forward

<source>

(input_ids: typing.Optional[torch.LongTensor] = Noneattention_mask: typing.Optional[torch.FloatTensor] = Nonetoken_type_ids: typing.Optional[torch.LongTensor] = Noneposition_ids: typing.Optional[torch.LongTensor] = Noneinputs_embeds: typing.Optional[torch.FloatTensor] = Nonestart_positions: typing.Optional[torch.LongTensor] = Noneend_positions: typing.Optional[torch.LongTensor] = None**kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs])transformers.modeling_outputs.QuestionAnsweringModelOutput ortuple(torch.FloatTensor)

Parameters

  • input_ids (torch.LongTensor of shape(batch_size, sequence_length),optional) —Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.

    Indices can be obtained usingAutoTokenizer. SeePreTrainedTokenizer.encode() andPreTrainedTokenizer.call() for details.

    What are input IDs?

  • attention_mask (torch.FloatTensor of shape(batch_size, sequence_length),optional) —Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:

    • 1 for tokens that arenot masked,
    • 0 for tokens that aremasked.

    What are attention masks?

  • token_type_ids (torch.LongTensor of shape(batch_size, sequence_length),optional) —Segment token indices to indicate first and second portions of the inputs. Indices are selected in[0, 1]:

    • 0 corresponds to asentence A token,
    • 1 corresponds to asentence B token.

    What are token type IDs?

  • position_ids (torch.LongTensor of shape(batch_size, sequence_length),optional) —Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1].

    What are position IDs?

  • inputs_embeds (torch.FloatTensor of shape(batch_size, sequence_length, hidden_size),optional) —Optionally, instead of passinginput_ids you can choose to directly pass an embedded representation. Thisis useful if you want more control over how to convertinput_ids indices into associated vectors than themodel’s internal embedding lookup matrix.
  • start_positions (torch.LongTensor of shape(batch_size,),optional) —Labels for position (index) of the start of the labelled span for computing the token classification loss.Positions are clamped to the length of the sequence (sequence_length). Position outside of the sequenceare not taken into account for computing the loss.
  • end_positions (torch.LongTensor of shape(batch_size,),optional) —Labels for position (index) of the end of the labelled span for computing the token classification loss.Positions are clamped to the length of the sequence (sequence_length). Position outside of the sequenceare not taken into account for computing the loss.

Atransformers.modeling_outputs.QuestionAnsweringModelOutput or a tuple oftorch.FloatTensor (ifreturn_dict=False is passed or whenconfig.return_dict=False) comprising variouselements depending on the configuration (Data2VecTextConfig) and inputs.

  • loss (torch.FloatTensor of shape(1,),optional, returned whenlabels is provided) — Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.

  • start_logits (torch.FloatTensor of shape(batch_size, sequence_length)) — Span-start scores (before SoftMax).

  • end_logits (torch.FloatTensor of shape(batch_size, sequence_length)) — Span-end scores (before SoftMax).

  • hidden_states (tuple(torch.FloatTensor),optional, returned whenoutput_hidden_states=True is passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, +one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

  • attentions (tuple(torch.FloatTensor),optional, returned whenoutput_attentions=True is passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor (one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attentionheads.

TheData2VecTextForQuestionAnswering forward method, overrides the__call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call theModuleinstance afterwards instead of this since the former takes care of running the pre and post processing steps whilethe latter silently ignores them.

Example:

>>>from transformersimport AutoTokenizer, Data2VecTextForQuestionAnswering>>>import torch>>>tokenizer = AutoTokenizer.from_pretrained("facebook/data2vec-text-base")>>>model = Data2VecTextForQuestionAnswering.from_pretrained("facebook/data2vec-text-base")>>>question, text ="Who was Jim Henson?","Jim Henson was a nice puppet">>>inputs = tokenizer(question, text, return_tensors="pt")>>>with torch.no_grad():...    outputs = model(**inputs)>>>answer_start_index = outputs.start_logits.argmax()>>>answer_end_index = outputs.end_logits.argmax()>>>predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index +1]>>>tokenizer.decode(predict_answer_tokens, skip_special_tokens=True)...>>># target is "nice puppet">>>target_start_index = torch.tensor([14])>>>target_end_index = torch.tensor([15])>>>outputs = model(**inputs, start_positions=target_start_index, end_positions=target_end_index)>>>loss = outputs.loss>>>round(loss.item(),2)...

Data2VecVisionModel

classtransformers.Data2VecVisionModel

<source>

(config: Data2VecVisionConfigadd_pooling_layer: bool = False)

Parameters

  • config (Data2VecVisionConfig) —Model configuration class with all the parameters of the model. Initializing with a config file does notload the weights associated with the model, only the configuration. Check out thefrom_pretrained() method to load the model weights.
  • add_pooling_layer (bool,optional, defaults toFalse) —Whether to add a pooling layer

The bare Data2Vec Vision Model outputting raw hidden-states without any specific head on top.

This model inherits fromPreTrainedModel. Check the superclass documentation for the generic methods thelibrary implements for all its model (such as downloading or saving, resizing the input embeddings, pruning headsetc.)

This model is also a PyTorchtorch.nn.Module subclass.Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usageand behavior.

forward

<source>

(pixel_values: Tensorbool_masked_pos: typing.Optional[torch.BoolTensor] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Noneinterpolate_pos_encoding: bool = Falsereturn_dict: typing.Optional[bool] = None)transformers.models.data2vec.modeling_data2vec_vision.Data2VecVisionModelOutputWithPooling ortuple(torch.FloatTensor)

Parameters

  • pixel_values (torch.Tensor of shape(batch_size, num_channels, image_size, image_size)) —The tensors corresponding to the input images. Pixel values can be obtained usingBeitImageProcessor. SeeBeitImageProcessor.__call__() for details (processor_class usesBeitImageProcessor for processing images).
  • bool_masked_pos (torch.BoolTensor of shape(batch_size, num_patches),optional) —Boolean masked positions. Indicates which patches are masked (1) and which aren’t (0).
  • output_attentions (bool,optional) —Whether or not to return the attentions tensors of all attention layers. Seeattentions under returnedtensors for more detail.
  • output_hidden_states (bool,optional) —Whether or not to return the hidden states of all layers. Seehidden_states under returned tensors formore detail.
  • interpolate_pos_encoding (bool, defaults toFalse) —Whether to interpolate the pre-trained position encodings.
  • return_dict (bool,optional) —Whether or not to return aModelOutput instead of a plain tuple.

Returns

transformers.models.data2vec.modeling_data2vec_vision.Data2VecVisionModelOutputWithPooling ortuple(torch.FloatTensor)

Atransformers.models.data2vec.modeling_data2vec_vision.Data2VecVisionModelOutputWithPooling or a tuple oftorch.FloatTensor (ifreturn_dict=False is passed or whenconfig.return_dict=False) comprising variouselements depending on the configuration (Data2VecVisionConfig) and inputs.

  • last_hidden_state (torch.FloatTensor of shape(batch_size, sequence_length, hidden_size),optional) — Sequence of hidden-states at the output of the last layer of the model.

  • pooler_output (torch.FloatTensor of shape(batch_size, hidden_size)) — Average of the last layer hidden states of the patch tokens (excluding the[CLS] token) ifconfig.use_mean_pooling is set to True. If set to False, then the final hidden state of the[CLS] tokenwill be returned.

  • hidden_states (tuple[torch.FloatTensor, ...],optional, returned whenoutput_hidden_states=True is passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, +one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

  • attentions (tuple[torch.FloatTensor, ...],optional, returned whenoutput_attentions=True is passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor (one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attentionheads.

TheData2VecVisionModel forward method, overrides the__call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call theModuleinstance afterwards instead of this since the former takes care of running the pre and post processing steps whilethe latter silently ignores them.

Example:

Data2VecVisionForImageClassification

classtransformers.Data2VecVisionForImageClassification

<source>

(config: Data2VecVisionConfig)

Parameters

  • config (Data2VecVisionConfig) —Model configuration class with all the parameters of the model. Initializing with a config file does notload the weights associated with the model, only the configuration. Check out thefrom_pretrained() method to load the model weights.

Data2VecVision Model transformer with an image classification head on top (a linear layer on top of the average ofthe final hidden states of the patch tokens) e.g. for ImageNet.

This model inherits fromPreTrainedModel. Check the superclass documentation for the generic methods thelibrary implements for all its model (such as downloading or saving, resizing the input embeddings, pruning headsetc.)

This model is also a PyTorchtorch.nn.Module subclass.Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usageand behavior.

forward

<source>

(pixel_values: typing.Optional[torch.Tensor] = Nonelabels: typing.Optional[torch.Tensor] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Noneinterpolate_pos_encoding: bool = Falsereturn_dict: typing.Optional[bool] = None)transformers.modeling_outputs.ImageClassifierOutput ortuple(torch.FloatTensor)

Parameters

  • pixel_values (torch.Tensor of shape(batch_size, num_channels, image_size, image_size),optional) —The tensors corresponding to the input images. Pixel values can be obtained usingBeitImageProcessor. SeeBeitImageProcessor.__call__() for details (processor_class usesBeitImageProcessor for processing images).
  • labels (torch.LongTensor of shape(batch_size,),optional) —Labels for computing the image classification/regression loss. Indices should be in[0, ..., config.num_labels - 1]. Ifconfig.num_labels == 1 a regression loss is computed (Mean-Square loss), Ifconfig.num_labels > 1 a classification loss is computed (Cross-Entropy).
  • output_attentions (bool,optional) —Whether or not to return the attentions tensors of all attention layers. Seeattentions under returnedtensors for more detail.
  • output_hidden_states (bool,optional) —Whether or not to return the hidden states of all layers. Seehidden_states under returned tensors formore detail.
  • interpolate_pos_encoding (bool, defaults toFalse) —Whether to interpolate the pre-trained position encodings.
  • return_dict (bool,optional) —Whether or not to return aModelOutput instead of a plain tuple.

Returns

transformers.modeling_outputs.ImageClassifierOutput ortuple(torch.FloatTensor)

Atransformers.modeling_outputs.ImageClassifierOutput or a tuple oftorch.FloatTensor (ifreturn_dict=False is passed or whenconfig.return_dict=False) comprising variouselements depending on the configuration (Data2VecVisionConfig) and inputs.

  • loss (torch.FloatTensor of shape(1,),optional, returned whenlabels is provided) — Classification (or regression if config.num_labels==1) loss.

  • logits (torch.FloatTensor of shape(batch_size, config.num_labels)) — Classification (or regression if config.num_labels==1) scores (before SoftMax).

  • hidden_states (tuple(torch.FloatTensor),optional, returned whenoutput_hidden_states=True is passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, +one for the output of each stage) of shape(batch_size, sequence_length, hidden_size). Hidden-states(also called feature maps) of the model at the output of each stage.

  • attentions (tuple(torch.FloatTensor),optional, returned whenoutput_attentions=True is passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor (one for each layer) of shape(batch_size, num_heads, patch_size, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attentionheads.

TheData2VecVisionForImageClassification forward method, overrides the__call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call theModuleinstance afterwards instead of this since the former takes care of running the pre and post processing steps whilethe latter silently ignores them.

Example:

>>>from transformersimport AutoImageProcessor, Data2VecVisionForImageClassification>>>import torch>>>from datasetsimport load_dataset>>>dataset = load_dataset("huggingface/cats-image")>>>image = dataset["test"]["image"][0]>>>image_processor = AutoImageProcessor.from_pretrained("facebook/data2vec-vision-base")>>>model = Data2VecVisionForImageClassification.from_pretrained("facebook/data2vec-vision-base")>>>inputs = image_processor(image, return_tensors="pt")>>>with torch.no_grad():...    logits = model(**inputs).logits>>># model predicts one of the 1000 ImageNet classes>>>predicted_label = logits.argmax(-1).item()>>>print(model.config.id2label[predicted_label])...

Data2VecVisionForSemanticSegmentation

classtransformers.Data2VecVisionForSemanticSegmentation

<source>

(config: Data2VecVisionConfig)

Parameters

  • config (Data2VecVisionConfig) —Model configuration class with all the parameters of the model. Initializing with a config file does notload the weights associated with the model, only the configuration. Check out thefrom_pretrained() method to load the model weights.

The Data2Vec Vision Model with a semantic segmentation head on top e.g. for ADE20K, CityScapes.

This model inherits fromPreTrainedModel. Check the superclass documentation for the generic methods thelibrary implements for all its model (such as downloading or saving, resizing the input embeddings, pruning headsetc.)

This model is also a PyTorchtorch.nn.Module subclass.Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usageand behavior.

forward

<source>

(pixel_values: typing.Optional[torch.Tensor] = Nonelabels: typing.Optional[torch.Tensor] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Noneinterpolate_pos_encoding: bool = Falsereturn_dict: typing.Optional[bool] = None)transformers.modeling_outputs.SemanticSegmenterOutput ortuple(torch.FloatTensor)

Parameters

  • pixel_values (torch.Tensor of shape(batch_size, num_channels, image_size, image_size),optional) —The tensors corresponding to the input images. Pixel values can be obtained usingBeitImageProcessor. SeeBeitImageProcessor.__call__() for details (processor_class usesBeitImageProcessor for processing images).
  • labels (torch.LongTensor of shape(batch_size, height, width),optional) —Ground truth semantic segmentation maps for computing the loss. Indices should be in[0, ..., config.num_labels - 1]. Ifconfig.num_labels > 1, a classification loss is computed (Cross-Entropy).
  • output_attentions (bool,optional) —Whether or not to return the attentions tensors of all attention layers. Seeattentions under returnedtensors for more detail.
  • output_hidden_states (bool,optional) —Whether or not to return the hidden states of all layers. Seehidden_states under returned tensors formore detail.
  • interpolate_pos_encoding (bool, defaults toFalse) —Whether to interpolate the pre-trained position encodings.
  • return_dict (bool,optional) —Whether or not to return aModelOutput instead of a plain tuple.

Atransformers.modeling_outputs.SemanticSegmenterOutput or a tuple oftorch.FloatTensor (ifreturn_dict=False is passed or whenconfig.return_dict=False) comprising variouselements depending on the configuration (Data2VecVisionConfig) and inputs.

  • loss (torch.FloatTensor of shape(1,),optional, returned whenlabels is provided) — Classification (or regression if config.num_labels==1) loss.

  • logits (torch.FloatTensor of shape(batch_size, config.num_labels, logits_height, logits_width)) — Classification scores for each pixel.

    The logits returned do not necessarily have the same size as thepixel_values passed as inputs. This isto avoid doing two interpolations and lose some quality when a user needs to resize the logits to theoriginal image size as post-processing. You should always check your logits shape and resize as needed.

  • hidden_states (tuple(torch.FloatTensor),optional, returned whenoutput_hidden_states=True is passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, +one for the output of each layer) of shape(batch_size, patch_size, hidden_size).

    Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

  • attentions (tuple(torch.FloatTensor),optional, returned whenoutput_attentions=True is passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor (one for each layer) of shape(batch_size, num_heads, patch_size, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attentionheads.

TheData2VecVisionForSemanticSegmentation forward method, overrides the__call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call theModuleinstance afterwards instead of this since the former takes care of running the pre and post processing steps whilethe latter silently ignores them.

Examples:

>>>from transformersimport AutoImageProcessor, Data2VecVisionForSemanticSegmentation>>>from PILimport Image>>>import requests>>>url ="http://images.cocodataset.org/val2017/000000039769.jpg">>>image = Image.open(requests.get(url, stream=True).raw)>>>image_processor = AutoImageProcessor.from_pretrained("facebook/data2vec-vision-base")>>>model = Data2VecVisionForSemanticSegmentation.from_pretrained("facebook/data2vec-vision-base")>>>inputs = image_processor(images=image, return_tensors="pt")>>>outputs = model(**inputs)>>># logits are of shape (batch_size, num_labels, height, width)>>>logits = outputs.logits
Update on GitHub


[8]ページ先頭

©2009-2025 Movatter.jp