Transformers documentation

VITS

Transformers

Get started

Transformers Installation Quickstart

Base classes

Inference

Training

Quantization

Kernels

Export to production

Resources

Contribute

API

Main Classes

Models

Text models

Vision models

Audio models

Audio Spectrogram Transformer Bark CLAP CSM dac Dia EnCodec FastSpeech2Conformer GraniteSpeech Hubert Kyutai Speech-To-Text MCTCT Mimi MMS Moonshine Moshi MusicGen MusicGen Melody Parakeet Pop2Piano Seamless-M4T SeamlessM4T-v2 SEW SEW-D Speech2Text Speech2Text2 SpeechT5 UniSpeech UniSpeech-SAT UnivNet VITS Wav2Vec2 Wav2Vec2-BERT Wav2Vec2-Conformer Wav2Vec2Phoneme WavLM Whisper X-Codec XLS-R XLSR-Wav2Vec2

Video models

Multimodal models

Reinforcement learning models

Time series models

Graph models

Internal helpers

Reference

You are viewingmain version, which requiresinstallation from source. If you'd likeregular pip install, checkout the latest stable version (v4.57.1).

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

This model was released on 2021-06-11 and added to Hugging Face Transformers on 2023-09-01.

VITS

VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) is a end-to-end speech synthesis model, simplifying the traditional two-stage text-to-speech (TTS) systems. It’s unique because it directly synthesizes speech from text using variational inference, adversarial learning, and normalizing flows to produce natural and expressive speech with diverse rhythms and intonations.

You can find all the original VITS checkpoints under theAI at Meta organization.

Click on the VITS models in the right sidebar for more examples of how to apply VITS.

The example below demonstrates how to generate text based on an image withPipeline or theAutoModel class.

Pipeline

AutoModel

import torchfrom transformersimport pipeline, set_seedfrom scipy.io.wavfileimport writeset_seed(555)pipe = pipeline(    task="text-to-speech",    model="facebook/mms-tts-eng",    dtype=torch.float16,    device=0)speech = pipe("Hello, my dog is cute")# Extract audio data and sampling rateaudio_data = speech["audio"]sampling_rate = speech["sampling_rate"]# Save as WAV filewrite("hello.wav", sampling_rate, audio_data.squeeze())

Notes

Set a seed for reproducibility because VITS synthesizes speech non-deterministically.

For languages with non-Roman alphabets (Korean, Arabic, etc.), install theuroman package to preprocess the text inputs to the Roman alphabet. You can check if the tokenizer requires uroman as shown below.

# pip install -U uromanfrom transformersimport VitsTokenizertokenizer = VitsTokenizer.from_pretrained("facebook/mms-tts-eng")print(tokenizer.is_uroman)

If your language requires uroman, the tokenizer automatically applies it to the text inputs. Python >= 3.10 doesn’t require any additional preprocessing steps. For Python < 3.10, follow the steps below.

gitclone https://github.com/isi-nlp/uroman.gitcd uromanexport UROMAN=$(pwd)

Create a function to preprocess the inputs. You can either use the bash variableUROMAN or pass the directory path directly to the function.

import torchfrom transformersimport VitsTokenizer, VitsModel, set_seedimport osimport subprocesstokenizer = VitsTokenizer.from_pretrained("facebook/mms-tts-kor")model = VitsModel.from_pretrained("facebook/mms-tts-kor")defuromanize(input_string, uroman_path):"""Convert non-Roman strings to Roman using the `uroman` perl package."""    script_path = os.path.join(uroman_path,"bin","uroman.pl")    command = ["perl", script_path]    process = subprocess.Popen(command, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)# Execute the perl command    stdout, stderr = process.communicate(input=input_string.encode())if process.returncode !=0:raise ValueError(f"Error{process.returncode}:{stderr.decode()}")# Return the output as a string and skip the new-line character at the endreturn stdout.decode()[:-1]text ="이봐 무슨 일이야"uromanized_text = uromanize(text, uroman_path=os.environ["UROMAN"])inputs = tokenizer(text=uromanized_text, return_tensors="pt")set_seed(555)# make deterministicwith torch.no_grad():   outputs = model(inputs["input_ids"])waveform = outputs.waveform[0]

VitsConfig

classtransformers.VitsConfig

(vocab_size = 38hidden_size = 192num_hidden_layers = 6num_attention_heads = 2window_size = 4use_bias = Trueffn_dim = 768layerdrop = 0.1ffn_kernel_size = 3flow_size = 192spectrogram_bins = 513hidden_act = 'relu'hidden_dropout = 0.1attention_dropout = 0.1activation_dropout = 0.1initializer_range = 0.02layer_norm_eps = 1e-05use_stochastic_duration_prediction = Truenum_speakers = 1speaker_embedding_size = 0upsample_initial_channel = 512upsample_rates = [8, 8, 2, 2]upsample_kernel_sizes = [16, 16, 4, 4]resblock_kernel_sizes = [3, 7, 11]resblock_dilation_sizes = [[1, 3, 5], [1, 3, 5], [1, 3, 5]]leaky_relu_slope = 0.1depth_separable_channels = 2depth_separable_num_layers = 3duration_predictor_flow_bins = 10duration_predictor_tail_bound = 5.0duration_predictor_kernel_size = 3duration_predictor_dropout = 0.5duration_predictor_num_flows = 4duration_predictor_filter_channels = 256prior_encoder_num_flows = 4prior_encoder_num_wavenet_layers = 4posterior_encoder_num_wavenet_layers = 16wavenet_kernel_size = 5wavenet_dilation_rate = 1wavenet_dropout = 0.0speaking_rate = 1.0noise_scale = 0.667noise_scale_duration = 0.8sampling_rate = 16000**kwargs)

Parameters

vocab_size (int,optional, defaults to 38) —Vocabulary size of the VITS model. Defines the number of different tokens that can be represented by theinputs_ids passed to the forward method ofVitsModel.
hidden_size (int,optional, defaults to 192) —Dimensionality of the text encoder layers.
num_hidden_layers (int,optional, defaults to 6) —Number of hidden layers in the Transformer encoder.
num_attention_heads (int,optional, defaults to 2) —Number of attention heads for each attention layer in the Transformer encoder.
window_size (int,optional, defaults to 4) —Window size for the relative positional embeddings in the attention layers of the Transformer encoder.
use_bias (bool,optional, defaults toTrue) —Whether to use bias in the key, query, value projection layers in the Transformer encoder.
ffn_dim (int,optional, defaults to 768) —Dimensionality of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder.
layerdrop (float,optional, defaults to 0.1) —The LayerDrop probability for the encoder. See the [LayerDrop paper](seehttps://huggingface.co/papers/1909.11556)for more details.
ffn_kernel_size (int,optional, defaults to 3) —Kernel size of the 1D convolution layers used by the feed-forward network in the Transformer encoder.
flow_size (int,optional, defaults to 192) —Dimensionality of the flow layers.
spectrogram_bins (int,optional, defaults to 513) —Number of frequency bins in the target spectrogram.
hidden_act (str orfunction,optional, defaults to"relu") —The non-linear activation function (function or string) in the encoder and pooler. If string,"gelu","relu","selu" and"gelu_new" are supported.
hidden_dropout (float,optional, defaults to 0.1) —The dropout probability for all fully connected layers in the embeddings and encoder.
attention_dropout (float,optional, defaults to 0.1) —The dropout ratio for the attention probabilities.
activation_dropout (float,optional, defaults to 0.1) —The dropout ratio for activations inside the fully connected layer.
initializer_range (float,optional, defaults to 0.02) —The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (float,optional, defaults to 1e-05) —The epsilon used by the layer normalization layers.
use_stochastic_duration_prediction (bool,optional, defaults toTrue) —Whether to use the stochastic duration prediction module or the regular duration predictor.
num_speakers (int,optional, defaults to 1) —Number of speakers if this is a multi-speaker model.
speaker_embedding_size (int,optional, defaults to 0) —Number of channels used by the speaker embeddings. Is zero for single-speaker models.
upsample_initial_channel (int,optional, defaults to 512) —The number of input channels into the HiFi-GAN upsampling network.
upsample_rates (tuple[int] orlist[int],optional, defaults to[8, 8, 2, 2]) —A tuple of integers defining the stride of each 1D convolutional layer in the HiFi-GAN upsampling network.The length ofupsample_rates defines the number of convolutional layers and has to match the length ofupsample_kernel_sizes.
upsample_kernel_sizes (tuple[int] orlist[int],optional, defaults to[16, 16, 4, 4]) —A tuple of integers defining the kernel size of each 1D convolutional layer in the HiFi-GAN upsamplingnetwork. The length ofupsample_kernel_sizes defines the number of convolutional layers and has to matchthe length ofupsample_rates.
resblock_kernel_sizes (tuple[int] orlist[int],optional, defaults to[3, 7, 11]) —A tuple of integers defining the kernel sizes of the 1D convolutional layers in the HiFi-GANmulti-receptive field fusion (MRF) module.
resblock_dilation_sizes (tuple[tuple[int]] orlist[list[int]],optional, defaults to[[1, 3, 5], [1, 3, 5], [1, 3, 5]]) —A nested tuple of integers defining the dilation rates of the dilated 1D convolutional layers in theHiFi-GAN multi-receptive field fusion (MRF) module.
leaky_relu_slope (float,optional, defaults to 0.1) —The angle of the negative slope used by the leaky ReLU activation.
depth_separable_channels (int,optional, defaults to 2) —Number of channels to use in each depth-separable block.
depth_separable_num_layers (int,optional, defaults to 3) —Number of convolutional layers to use in each depth-separable block.
duration_predictor_flow_bins (int,optional, defaults to 10) —Number of channels to map using the unonstrained rational spline in the duration predictor model.
duration_predictor_tail_bound (float,optional, defaults to 5.0) —Value of the tail bin boundary when computing the unconstrained rational spline in the duration predictormodel.
duration_predictor_kernel_size (int,optional, defaults to 3) —Kernel size of the 1D convolution layers used in the duration predictor model.
duration_predictor_dropout (float,optional, defaults to 0.5) —The dropout ratio for the duration predictor model.
duration_predictor_num_flows (int,optional, defaults to 4) —Number of flow stages used by the duration predictor model.
duration_predictor_filter_channels (int,optional, defaults to 256) —Number of channels for the convolution layers used in the duration predictor model.
prior_encoder_num_flows (int,optional, defaults to 4) —Number of flow stages used by the prior encoder flow model.
prior_encoder_num_wavenet_layers (int,optional, defaults to 4) —Number of WaveNet layers used by the prior encoder flow model.
posterior_encoder_num_wavenet_layers (int,optional, defaults to 16) —Number of WaveNet layers used by the posterior encoder model.
wavenet_kernel_size (int,optional, defaults to 5) —Kernel size of the 1D convolution layers used in the WaveNet model.
wavenet_dilation_rate (int,optional, defaults to 1) —Dilation rates of the dilated 1D convolutional layers used in the WaveNet model.
wavenet_dropout (float,optional, defaults to 0.0) —The dropout ratio for the WaveNet layers.
speaking_rate (float,optional, defaults to 1.0) —Speaking rate. Larger values give faster synthesised speech.
noise_scale (float,optional, defaults to 0.667) —How random the speech prediction is. Larger values create more variation in the predicted speech.
noise_scale_duration (float,optional, defaults to 0.8) —How random the duration prediction is. Larger values create more variation in the predicted durations.
sampling_rate (int,optional, defaults to 16000) —The sampling rate at which the output audio waveform is digitalized expressed in hertz (Hz).

This is the configuration class to store the configuration of aVitsModel. It is used to instantiate a VITSmodel according to the specified arguments, defining the model architecture. Instantiating a configuration with thedefaults will yield a similar configuration to that of the VITSfacebook/mms-tts-eng architecture.

Configuration objects inherit fromPreTrainedConfig and can be used to control the model outputs. Read thedocumentation fromPreTrainedConfig for more information.

Example:

>>>from transformersimport VitsModel, VitsConfig>>># Initializing a "facebook/mms-tts-eng" style configuration>>>configuration = VitsConfig()>>># Initializing a model (with random weights) from the "facebook/mms-tts-eng" style configuration>>>model = VitsModel(configuration)>>># Accessing the model configuration>>>configuration = model.config

VitsTokenizer

classtransformers.VitsTokenizer

(vocab_filepad_token = '<pad>'unk_token = '<unk>'language = Noneadd_blank = Truenormalize = Truephonemize = Trueis_uroman = False**kwargs)

Parameters

vocab_file (str) —Path to the vocabulary file.
language (str,optional) —Language identifier.
add_blank (bool,optional, defaults toTrue) —Whether to insert token id 0 in between the other tokens.
normalize (bool,optional, defaults toTrue) —Whether to normalize the input text by removing all casing and punctuation.
phonemize (bool,optional, defaults toTrue) —Whether to convert the input text into phonemes.
is_uroman (bool,optional, defaults toFalse) —Whether theuroman Romanizer needs to be applied to the input text prior to tokenizing.

Construct a VITS tokenizer. Also supports MMS-TTS.

This tokenizer inherits fromPreTrainedTokenizer which contains most of the main methods. Users should refer tothis superclass for more information regarding those methods.

call

(text: Union[TextInput, PreTokenizedInput, list[TextInput], list[PreTokenizedInput], None] = Nonetext_pair: Optional[Union[TextInput, PreTokenizedInput, list[TextInput], list[PreTokenizedInput]]] = Nonetext_target: Union[TextInput, PreTokenizedInput, list[TextInput], list[PreTokenizedInput], None] = Nonetext_pair_target: Optional[Union[TextInput, PreTokenizedInput, list[TextInput], list[PreTokenizedInput]]] = Noneadd_special_tokens: bool = Truepadding: Union[bool, str, PaddingStrategy] = Falsetruncation: Union[bool, str, TruncationStrategy, None] = Nonemax_length: Optional[int] = Nonestride: int = 0is_split_into_words: bool = Falsepad_to_multiple_of: Optional[int] = Nonepadding_side: Optional[str] = Nonereturn_tensors: Optional[Union[str, TensorType]] = Nonereturn_token_type_ids: Optional[bool] = Nonereturn_attention_mask: Optional[bool] = Nonereturn_overflowing_tokens: bool = Falsereturn_special_tokens_mask: bool = Falsereturn_offsets_mapping: bool = Falsereturn_length: bool = Falseverbose: bool = True**kwargs)→BatchEncoding

Parameters

text (str,list[str],list[list[str]],optional) —The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must setis_split_into_words=True (to lift the ambiguity with a batch of sequences).
text_pair (str,list[str],list[list[str]],optional) —The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must setis_split_into_words=True (to lift the ambiguity with a batch of sequences).
text_target (str,list[str],list[list[str]],optional) —The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or alist of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized),you must setis_split_into_words=True (to lift the ambiguity with a batch of sequences).
text_pair_target (str,list[str],list[list[str]],optional) —The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or alist of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized),you must setis_split_into_words=True (to lift the ambiguity with a batch of sequences).
add_special_tokens (bool,optional, defaults toTrue) —Whether or not to add special tokens when encoding the sequences. This will use the underlyingPretrainedTokenizerBase.build_inputs_with_special_tokens function, which defines which tokens areautomatically added to the input ids. This is useful if you want to addbos oreos tokensautomatically.
padding (bool,str orPaddingStrategy,optional, defaults toFalse) —Activates and controls padding. Accepts the following values:
- True or'longest': Pad to the longest sequence in the batch (or no padding if only a singlesequence is provided).
- 'max_length': Pad to a maximum length specified with the argumentmax_length or to the maximumacceptable input length for the model if that argument is not provided.
- False or'do_not_pad' (default): No padding (i.e., can output a batch with sequences of differentlengths).
truncation (bool,str orTruncationStrategy,optional, defaults toFalse) —Activates and controls truncation. Accepts the following values:
- True or'longest_first': Truncate to a maximum length specified with the argumentmax_length orto the maximum acceptable input length for the model if that argument is not provided. This willtruncate token by token, removing a token from the longest sequence in the pair if a pair ofsequences (or a batch of pairs) is provided.
- 'only_first': Truncate to a maximum length specified with the argumentmax_length or to themaximum acceptable input length for the model if that argument is not provided. This will onlytruncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
- 'only_second': Truncate to a maximum length specified with the argumentmax_length or to themaximum acceptable input length for the model if that argument is not provided. This will onlytruncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
- False or'do_not_truncate' (default): No truncation (i.e., can output batch with sequence lengthsgreater than the model maximum admissible input size).
max_length (int,optional) —Controls the maximum length to use by one of the truncation/padding parameters.
If left unset or set toNone, this will use the predefined model maximum length if a maximum lengthis required by one of the truncation/padding parameters. If the model has no specific maximum inputlength (like XLNet) truncation/padding to a maximum length will be deactivated.
stride (int,optional, defaults to 0) —If set to a number along withmax_length, the overflowing tokens returned whenreturn_overflowing_tokens=True will contain some tokens from the end of the truncated sequencereturned to provide some overlap between truncated and overflowing sequences. The value of thisargument defines the number of overlapping tokens.
is_split_into_words (bool,optional, defaults toFalse) —Whether or not the input is already pre-tokenized (e.g., split into words). If set toTrue, thetokenizer assumes the input is already split into words (for instance, by splitting it on whitespace)which it will tokenize. This is useful for NER or token classification.
pad_to_multiple_of (int,optional) —If set will pad the sequence to a multiple of the provided value. Requirespadding to be activated.This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability>= 7.5 (Volta).
padding_side (str,optional) —The side on which the model should have padding applied. Should be selected between [‘right’, ‘left’].Default value is picked from the class attribute of the same name.
return_tensors (str orTensorType,optional) —If set, will return tensors instead of list of python integers. Acceptable values are:
- 'pt': Return PyTorchtorch.Tensor objects.
- 'np': Return Numpynp.ndarray objects.
return_token_type_ids (bool,optional) —Whether to return token type IDs. If left to the default, will return the token type IDs according tothe specific tokenizer’s default, defined by thereturn_outputs attribute.
What are token type IDs?
return_attention_mask (bool,optional) —Whether to return the attention mask. If left to the default, will return the attention mask accordingto the specific tokenizer’s default, defined by thereturn_outputs attribute.
What are attention masks?
return_overflowing_tokens (bool,optional, defaults toFalse) —Whether or not to return overflowing token sequences. If a pair of sequences of input ids (or a batchof pairs) is provided withtruncation_strategy = longest_first orTrue, an error is raised insteadof returning overflowing tokens.
return_special_tokens_mask (bool,optional, defaults toFalse) —Whether or not to return special tokens mask information.
return_offsets_mapping (bool,optional, defaults toFalse) —Whether or not to return(char_start, char_end) for each token.
This is only available on fast tokenizers inheriting fromPreTrainedTokenizerFast, if usingPython’s tokenizer, this method will raiseNotImplementedError.
return_length (bool,optional, defaults toFalse) —Whether or not to return the lengths of the encoded inputs.
verbose (bool,optional, defaults toTrue) —Whether or not to print more information and warnings.
**kwargs — passed to theself.tokenize() method

Returns

BatchEncoding

ABatchEncoding with the following fields:

input_ids — List of token ids to be fed to a model.
What are input IDs?
token_type_ids — List of token type ids to be fed to a model (whenreturn_token_type_ids=True orif“token_type_ids” is inself.model_input_names).
What are token type IDs?
attention_mask — List of indices specifying which tokens should be attended to by the model (whenreturn_attention_mask=True or if“attention_mask” is inself.model_input_names).
What are attention masks?
overflowing_tokens — List of overflowing tokens sequences (when amax_length is specified andreturn_overflowing_tokens=True).
num_truncated_tokens — Number of tokens truncated (when amax_length is specified andreturn_overflowing_tokens=True).
special_tokens_mask — List of 0s and 1s, with 1 specifying added special tokens and 0 specifyingregular sequence tokens (whenadd_special_tokens=True andreturn_special_tokens_mask=True).
length — The length of the inputs (whenreturn_length=True)

Main method to tokenize and prepare for the model one or several sequence(s) or one or several pair(s) ofsequences.

save_vocabulary

(save_directory: strfilename_prefix: typing.Optional[str] = None)

VitsModel

classtransformers.VitsModel

(config: VitsConfig)

Parameters

config (VitsConfig) —Model configuration class with all the parameters of the model. Initializing with a config file does notload the weights associated with the model, only the configuration. Check out thefrom_pretrained() method to load the model weights.

The complete VITS model, for text-to-speech synthesis.

This model inherits fromPreTrainedModel. Check the superclass documentation for the generic methods thelibrary implements for all its model (such as downloading or saving, resizing the input embeddings, pruning headsetc.)

This model is also a PyTorchtorch.nn.Module subclass.Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usageand behavior.

forward

(input_ids: typing.Optional[torch.Tensor] = Noneattention_mask: typing.Optional[torch.Tensor] = Nonespeaker_id: typing.Optional[int] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = Nonelabels: typing.Optional[torch.FloatTensor] = None)→transformers.models.vits.modeling_vits.VitsModelOutput ortuple(torch.FloatTensor)

Parameters

input_ids (torch.Tensor of shape(batch_size, sequence_length),optional) —Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.
Indices can be obtained usingAutoTokenizer. SeePreTrainedTokenizer.encode() andPreTrainedTokenizer.call() for details.
What are input IDs?
attention_mask (torch.Tensor of shape(batch_size, sequence_length),optional) —Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:
- 1 for tokens that arenot masked,
- 0 for tokens that aremasked.
What are attention masks?
speaker_id (int,optional) —Which speaker embedding to use. Only used for multispeaker models.
output_attentions (bool,optional) —Whether or not to return the attentions tensors of all attention layers. Seeattentions under returnedtensors for more detail.
output_hidden_states (bool,optional) —Whether or not to return the hidden states of all layers. Seehidden_states under returned tensors formore detail.
return_dict (bool,optional) —Whether or not to return aModelOutput instead of a plain tuple.
labels (torch.FloatTensor of shape(batch_size, config.spectrogram_bins, sequence_length),optional) —Float values of target spectrogram. Timesteps set to-100.0 are ignored (masked) for the losscomputation.

Returns

transformers.models.vits.modeling_vits.VitsModelOutput ortuple(torch.FloatTensor)

Atransformers.models.vits.modeling_vits.VitsModelOutput or a tuple oftorch.FloatTensor (ifreturn_dict=False is passed or whenconfig.return_dict=False) comprising variouselements depending on the configuration (VitsConfig) and inputs.

waveform (torch.FloatTensor of shape(batch_size, sequence_length)) — The final audio waveform predicted by the model.
sequence_lengths (torch.FloatTensor of shape(batch_size,)) — The length in samples of each element in thewaveform batch.
spectrogram (torch.FloatTensor of shape(batch_size, sequence_length, num_bins)) — The log-mel spectrogram predicted at the output of the flow model. This spectrogram is passed to the Hi-FiGAN decoder model to obtain the final audio waveform.
hidden_states (tuple[torch.FloatTensor],optional, returned whenoutput_hidden_states=True is passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, +one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple[torch.FloatTensor],optional, returned whenoutput_attentions=True is passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor (one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).
Attentions weights after the attention softmax, used to compute the weighted average in the self-attentionheads.

TheVitsModel forward method, overrides the__call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call theModuleinstance afterwards instead of this since the former takes care of running the pre and post processing steps whilethe latter silently ignores them.

Example:

>>>from transformersimport VitsTokenizer, VitsModel, set_seed>>>import torch>>>tokenizer = VitsTokenizer.from_pretrained("facebook/mms-tts-eng")>>>model = VitsModel.from_pretrained("facebook/mms-tts-eng")>>>inputs = tokenizer(text="Hello - my dog is cute", return_tensors="pt")>>>set_seed(555)# make deterministic>>>with torch.no_grad():...    outputs = model(inputs["input_ids"])>>>outputs.waveform.shapetorch.Size([1,45824])

Update on GitHub

←UnivNet Wav2Vec2→

Movatterモバイル変換

Transformers

VITS

Notes

VitsConfig

classtransformers.VitsConfig

VitsTokenizer

classtransformers.VitsTokenizer

__call__

save_vocabulary

VitsModel

classtransformers.VitsModel

forward

call