Transformers documentation

Autoformer

Transformers

You are viewingmain version, which requiresinstallation from source. If you'd likeregular pip install, checkout the latest stable version (v4.57.1).

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

This model was released on 2021-06-24 and added to Hugging Face Transformers on 2023-05-30.

Autoformer

Overview

The Autoformer model was proposed inAutoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long.

This model augments the Transformer as a deep decomposition architecture, which can progressively decompose the trend and seasonal components during the forecasting process.

The abstract from the paper is the following:

Extending the forecasting time is a critical demand for real applications, such as extreme weather early warning and long-term energy consumption planning. This paper studies the long-term forecasting problem of time series. Prior Transformer-based models adopt various self-attention mechanisms to discover the long-range dependencies. However, intricate temporal patterns of the long-term future prohibit the model from finding reliable dependencies. Also, Transformers have to adopt the sparse versions of point-wise self-attentions for long series efficiency, resulting in the information utilization bottleneck. Going beyond Transformers, we design Autoformer as a novel decomposition architecture with an Auto-Correlation mechanism. We break with the pre-processing convention of series decomposition and renovate it as a basic inner block of deep models. This design empowers Autoformer with progressive decomposition capacities for complex time series. Further, inspired by the stochastic process theory, we design the Auto-Correlation mechanism based on the series periodicity, which conducts the dependencies discovery and representation aggregation at the sub-series level. Auto-Correlation outperforms self-attention in both efficiency and accuracy. In long-term forecasting, Autoformer yields state-of-the-art accuracy, with a 38% relative improvement on six benchmarks, covering five practical applications: energy, traffic, economics, weather and disease.

This model was contributed byelisim andkashif.The original code can be foundhere.

Resources

A list of official Hugging Face and community (indicated by 🌎) resources to help you get started. If you’re interested in submitting a resource to be included here, please feel free to open a Pull Request and we’ll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.

Check out the Autoformer blog-post in HuggingFace blog:Yes, Transformers are Effective for Time Series Forecasting (+ Autoformer)

AutoformerConfig

classtransformers.AutoformerConfig

(prediction_length: typing.Optional[int] = Nonecontext_length: typing.Optional[int] = Nonedistribution_output: str = 'student_t'loss: str = 'nll'input_size: int = 1lags_sequence: list = [1, 2, 3, 4, 5, 6, 7]scaling: bool = Truenum_time_features: int = 0num_dynamic_real_features: int = 0num_static_categorical_features: int = 0num_static_real_features: int = 0cardinality: typing.Optional[list[int]] = Noneembedding_dimension: typing.Optional[list[int]] = Noned_model: int = 64encoder_attention_heads: int = 2decoder_attention_heads: int = 2encoder_layers: int = 2decoder_layers: int = 2encoder_ffn_dim: int = 32decoder_ffn_dim: int = 32activation_function: str = 'gelu'dropout: float = 0.1encoder_layerdrop: float = 0.1decoder_layerdrop: float = 0.1attention_dropout: float = 0.1activation_dropout: float = 0.1num_parallel_samples: int = 100init_std: float = 0.02use_cache: bool = Trueis_encoder_decoder = Truelabel_length: int = 10moving_average: int = 25autocorrelation_factor: int = 3**kwargs)

Parameters

prediction_length (int) —The prediction length for the decoder. In other words, the prediction horizon of the model.
context_length (int,optional, defaults toprediction_length) —The context length for the encoder. If unset, the context length will be the same as theprediction_length.
distribution_output (string,optional, defaults to"student_t") —The distribution emission head for the model. Could be either “student_t”, “normal” or “negative_binomial”.
loss (string,optional, defaults to"nll") —The loss function for the model corresponding to thedistribution_output head. For parametricdistributions it is the negative log likelihood (nll) - which currently is the only supported one.
input_size (int,optional, defaults to 1) —The size of the target variable which by default is 1 for univariate targets. Would be > 1 in case ofmultivariate targets.
lags_sequence (list[int],optional, defaults to[1, 2, 3, 4, 5, 6, 7]) —The lags of the input time series as covariates often dictated by the frequency. Default is[1, 2, 3, 4, 5, 6, 7].
scaling (bool,optional defaults toTrue) —Whether to scale the input targets.
num_time_features (int,optional, defaults to 0) —The number of time features in the input time series.
num_dynamic_real_features (int,optional, defaults to 0) —The number of dynamic real valued features.
num_static_categorical_features (int,optional, defaults to 0) —The number of static categorical features.
num_static_real_features (int,optional, defaults to 0) —The number of static real valued features.
cardinality (list[int],optional) —The cardinality (number of different values) for each of the static categorical features. Should be a listof integers, having the same length asnum_static_categorical_features. Cannot beNone ifnum_static_categorical_features is > 0.
embedding_dimension (list[int],optional) —The dimension of the embedding for each of the static categorical features. Should be a list of integers,having the same length asnum_static_categorical_features. Cannot beNone ifnum_static_categorical_features is > 0.
d_model (int,optional, defaults to 64) —Dimensionality of the transformer layers.
encoder_layers (int,optional, defaults to 2) —Number of encoder layers.
decoder_layers (int,optional, defaults to 2) —Number of decoder layers.
encoder_attention_heads (int,optional, defaults to 2) —Number of attention heads for each attention layer in the Transformer encoder.
decoder_attention_heads (int,optional, defaults to 2) —Number of attention heads for each attention layer in the Transformer decoder.
encoder_ffn_dim (int,optional, defaults to 32) —Dimension of the “intermediate” (often named feed-forward) layer in encoder.
decoder_ffn_dim (int,optional, defaults to 32) —Dimension of the “intermediate” (often named feed-forward) layer in decoder.
activation_function (str orfunction,optional, defaults to"gelu") —The non-linear activation function (function or string) in the encoder and decoder. If string,"gelu" and"relu" are supported.
dropout (float,optional, defaults to 0.1) —The dropout probability for all fully connected layers in the encoder, and decoder.
encoder_layerdrop (float,optional, defaults to 0.1) —The dropout probability for the attention and fully connected layers for each encoder layer.
decoder_layerdrop (float,optional, defaults to 0.1) —The dropout probability for the attention and fully connected layers for each decoder layer.
attention_dropout (float,optional, defaults to 0.1) —The dropout probability for the attention probabilities.
activation_dropout (float,optional, defaults to 0.1) —The dropout probability used between the two layers of the feed-forward networks.
num_parallel_samples (int,optional, defaults to 100) —The number of samples to generate in parallel for each time step of inference.
init_std (float,optional, defaults to 0.02) —The standard deviation of the truncated normal weight initialization distribution.
use_cache (bool,optional, defaults toTrue) —Whether to use the past key/values attentions (if applicable to the model) to speed up decoding.
label_length (int,optional, defaults to 10) —Start token length of the Autoformer decoder, which is used for direct multi-step prediction (i.e.non-autoregressive generation).
moving_average (int,optional, defaults to 25) —The window size of the moving average. In practice, it’s the kernel size in AvgPool1d of the DecompositionLayer.
autocorrelation_factor (int,optional, defaults to 3) —“Attention” (i.e. AutoCorrelation mechanism) factor which is used to find top k autocorrelations delays.It’s recommended in the paper to set it to a number between 1 and 5.

This is the configuration class to store the configuration of anAutoformerModel. It is used to instantiate anAutoformer model according to the specified arguments, defining the model architecture. Instantiating aconfiguration with the defaults will yield a similar configuration to that of the Autoformerhuggingface/autoformer-tourism-monthlyarchitecture.

Configuration objects inherit fromPreTrainedConfig can be used to control the model outputs. Read thedocumentation fromPreTrainedConfig for more information.

>>>from transformersimport AutoformerConfig, AutoformerModel>>># Initializing a default Autoformer configuration>>>configuration = AutoformerConfig()>>># Randomly initializing a model (with random weights) from the configuration>>>model = AutoformerModel(configuration)>>># Accessing the model configuration>>>configuration = model.config

AutoformerModel

classtransformers.AutoformerModel

(config: AutoformerConfig)

Parameters

config (AutoformerConfig) —Model configuration class with all the parameters of the model. Initializing with a config file does notload the weights associated with the model, only the configuration. Check out thefrom_pretrained() method to load the model weights.

The bare Autoformer Model outputting raw hidden-states without any specific head on top.

This model inherits fromPreTrainedModel. Check the superclass documentation for the generic methods thelibrary implements for all its model (such as downloading or saving, resizing the input embeddings, pruning headsetc.)

This model is also a PyTorchtorch.nn.Module subclass.Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usageand behavior.

forward

(past_values: Tensorpast_time_features: Tensorpast_observed_mask: Tensorstatic_categorical_features: typing.Optional[torch.Tensor] = Nonestatic_real_features: typing.Optional[torch.Tensor] = Nonefuture_values: typing.Optional[torch.Tensor] = Nonefuture_time_features: typing.Optional[torch.Tensor] = Nonedecoder_attention_mask: typing.Optional[torch.LongTensor] = Noneencoder_outputs: typing.Optional[list[torch.FloatTensor]] = Nonepast_key_values: typing.Optional[transformers.cache_utils.Cache] = Noneoutput_hidden_states: typing.Optional[bool] = Noneoutput_attentions: typing.Optional[bool] = Noneuse_cache: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = Nonecache_position: typing.Optional[torch.Tensor] = None)→transformers.models.autoformer.modeling_autoformer.AutoformerModelOutput ortuple(torch.FloatTensor)

Parameters

past_values (torch.FloatTensor of shape(batch_size, sequence_length)) —Past values of the time series, that serve as context in order to predict the future. These values maycontain lags, i.e. additional values from the past which are added in order to serve as “extra context”.Thepast_values is what the Transformer encoder gets as input (with optional additional features, such asstatic_categorical_features,static_real_features,past_time_features).
The sequence length here is equal tocontext_length +max(config.lags_sequence).
Missing values need to be replaced with zeros.
past_time_features (torch.FloatTensor of shape(batch_size, sequence_length, num_features),optional) —Optional time features, which the model internally will add topast_values. These could be things like“month of year”, “day of the month”, etc. encoded as vectors (for instance as Fourier features). Thesecould also be so-called “age” features, which basically help the model know “at which point in life” atime-series is. Age features have small values for distant past time steps and increase monotonically themore we approach the current time step.
These features serve as the “positional encodings” of the inputs. So contrary to a model like BERT, wherethe position encodings are learned from scratch internally as parameters of the model, the Time SeriesTransformer requires to provide additional time features.
The Autoformer only learns additional embeddings forstatic_categorical_features.
past_observed_mask (torch.BoolTensor of shape(batch_size, sequence_length),optional) —Boolean mask to indicate whichpast_values were observed and which were missing. Mask values selected in[0, 1]:
- 1 for values that areobserved,
- 0 for values that aremissing (i.e. NaNs that were replaced by zeros).
static_categorical_features (torch.LongTensor of shape(batch_size, number of static categorical features),optional) —Optional static categorical features for which the model will learn an embedding, which it will add to thevalues of the time series.
Static categorical features are features which have the same value for all time steps (static over time).
A typical example of a static categorical feature is a time series ID.
static_real_features (torch.FloatTensor of shape(batch_size, number of static real features),optional) —Optional static real features which the model will add to the values of the time series.
Static real features are features which have the same value for all time steps (static over time).
A typical example of a static real feature is promotion information.
future_values (torch.FloatTensor of shape(batch_size, prediction_length)) —Future values of the time series, that serve as labels for the model. Thefuture_values is what theTransformer needs to learn to output, given thepast_values.
See the demo notebook and code snippets for details.
Missing values need to be replaced with zeros.
future_time_features (torch.FloatTensor of shape(batch_size, prediction_length, num_features),optional) —Optional time features, which the model internally will add tofuture_values. These could be things like“month of year”, “day of the month”, etc. encoded as vectors (for instance as Fourier features). Thesecould also be so-called “age” features, which basically help the model know “at which point in life” atime-series is. Age features have small values for distant past time steps and increase monotonically themore we approach the current time step.
These features serve as the “positional encodings” of the inputs. So contrary to a model like BERT, wherethe position encodings are learned from scratch internally as parameters of the model, the Time SeriesTransformer requires to provide additional features.
The Autoformer only learns additional embeddings forstatic_categorical_features.
decoder_attention_mask (torch.LongTensor of shape(batch_size, target_sequence_length),optional) —Mask to avoid performing attention on certain token indices. By default, a causal mask will be used, tomake sure the model can only look at previous inputs in order to predict the future.
encoder_outputs (tuple(tuple(torch.FloatTensor),optional) —Tuple consists oflast_hidden_state,hidden_states (optional) andattentions (optional)last_hidden_state of shape(batch_size, sequence_length, hidden_size) (optional) is a sequence ofhidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder.
past_key_values (~cache_utils.Cache,optional) —Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attentionblocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=True orconfig.use_cache=True.
OnlyCache instance is allowed as input, see ourkv cache guide.If nopast_key_values are passed,DynamicCache will be initialized by default.
The model will output the same cache format that is fed as input.
Ifpast_key_values are used, the user is expected to input only unprocessedinput_ids (those that don’thave their past key value states given to this model) of shape(batch_size, unprocessed_length) instead of allinput_idsof shape(batch_size, sequence_length).
output_hidden_states (bool,optional) —Whether or not to return the hidden states of all layers. Seehidden_states under returned tensors formore detail.
output_attentions (bool,optional) —Whether or not to return the attentions tensors of all attention layers. Seeattentions under returnedtensors for more detail.
use_cache (bool,optional) —If set toTrue,past_key_values key value states are returned and can be used to speed up decoding (seepast_key_values).
return_dict (bool,optional) —Whether or not to return aModelOutput instead of a plain tuple.
cache_position (torch.Tensor of shape(sequence_length),optional) —Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids,this tensor is not affected by padding. It is used to update the cache in the correct position and to inferthe complete sequence length.

Returns

transformers.models.autoformer.modeling_autoformer.AutoformerModelOutput ortuple(torch.FloatTensor)

Atransformers.models.autoformer.modeling_autoformer.AutoformerModelOutput or a tuple oftorch.FloatTensor (ifreturn_dict=False is passed or whenconfig.return_dict=False) comprising variouselements depending on the configuration (AutoformerConfig) and inputs.

last_hidden_state (torch.FloatTensor of shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the decoder of the model.
Ifpast_key_values is used only the last hidden-state of the sequences of shape(batch_size, 1, hidden_size) is output.
trend (torch.FloatTensor of shape(batch_size, sequence_length, hidden_size)) — Trend tensor for each time series.
past_key_values (Cache,optional, returned whenuse_cache=True is passed or whenconfig.use_cache=True) — It is aCache instance. For more details, see ourkv cache guide.
Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attentionblocks) that can be used (seepast_key_values input) to speed up sequential decoding.
decoder_hidden_states (tuple[torch.FloatTensor],optional, returned whenoutput_hidden_states=True is passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, +one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).
Hidden-states of the decoder at the output of each layer plus the initial embedding outputs.
decoder_attentions (tuple[torch.FloatTensor],optional, returned whenoutput_attentions=True is passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor (one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).
Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in theself-attention heads.
cross_attentions (tuple[torch.FloatTensor],optional, returned whenoutput_attentions=True is passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor (one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).
Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute theweighted average in the cross-attention heads.
encoder_last_hidden_state (torch.FloatTensor of shape(batch_size, sequence_length, hidden_size),optional, defaults toNone) — Sequence of hidden-states at the output of the last layer of the encoder of the model.
encoder_hidden_states (tuple[torch.FloatTensor],optional, returned whenoutput_hidden_states=True is passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, +one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).
Hidden-states of the encoder at the output of each layer plus the initial embedding outputs.
encoder_attentions (tuple[torch.FloatTensor],optional, returned whenoutput_attentions=True is passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor (one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).
Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in theself-attention heads.
loc (torch.FloatTensor of shape(batch_size,) or(batch_size, input_size),optional) — Shift values of each time series’ context window which is used to give the model inputs of the samemagnitude and then used to shift back to the original magnitude.
scale (torch.FloatTensor of shape(batch_size,) or(batch_size, input_size),optional) — Scaling values of each time series’ context window which is used to give the model inputs of the samemagnitude and then used to rescale back to the original magnitude.
static_features: (torch.FloatTensor of shape(batch_size, feature size),optional) — Static features of each time series’ in a batch which are copied to the covariates at inference time.
static_features (torch.FloatTensor of shape(batch_size, feature size),optional, defaults toNone) — Static features of each time series’ in a batch which are copied to the covariates at inference time.

TheAutoformerModel forward method, overrides the__call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call theModuleinstance afterwards instead of this since the former takes care of running the pre and post processing steps whilethe latter silently ignores them.

Examples:

>>>from huggingface_hubimport hf_hub_download>>>import torch>>>from transformersimport AutoformerModel>>>file = hf_hub_download(...    repo_id="hf-internal-testing/tourism-monthly-batch", filename="train-batch.pt", repo_type="dataset"...)>>>batch = torch.load(file)>>>model = AutoformerModel.from_pretrained("huggingface/autoformer-tourism-monthly")>>># during training, one provides both past and future values>>># as well as possible additional features>>>outputs = model(...    past_values=batch["past_values"],...    past_time_features=batch["past_time_features"],...    past_observed_mask=batch["past_observed_mask"],...    static_categorical_features=batch["static_categorical_features"],...    future_values=batch["future_values"],...    future_time_features=batch["future_time_features"],...)>>>last_hidden_state = outputs.last_hidden_state

AutoformerForPrediction

classtransformers.AutoformerForPrediction

(config: AutoformerConfig)

Parameters

config (AutoformerConfig) —Model configuration class with all the parameters of the model. Initializing with a config file does notload the weights associated with the model, only the configuration. Check out thefrom_pretrained() method to load the model weights.

The Autoformer Model with a distribution head on top for time-series forecasting.

This model is also a PyTorchtorch.nn.Module subclass.Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usageand behavior.

forward

(past_values: Tensorpast_time_features: Tensorpast_observed_mask: Tensorstatic_categorical_features: typing.Optional[torch.Tensor] = Nonestatic_real_features: typing.Optional[torch.Tensor] = Nonefuture_values: typing.Optional[torch.Tensor] = Nonefuture_time_features: typing.Optional[torch.Tensor] = Nonefuture_observed_mask: typing.Optional[torch.Tensor] = Nonedecoder_attention_mask: typing.Optional[torch.LongTensor] = Noneencoder_outputs: typing.Optional[list[torch.FloatTensor]] = Nonepast_key_values: typing.Optional[transformers.cache_utils.Cache] = Noneoutput_hidden_states: typing.Optional[bool] = Noneoutput_attentions: typing.Optional[bool] = Noneuse_cache: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None)→transformers.modeling_outputs.Seq2SeqTSPredictionOutput ortuple(torch.FloatTensor)

Parameters

past_values (torch.FloatTensor of shape(batch_size, sequence_length)) —Past values of the time series, that serve as context in order to predict the future. These values maycontain lags, i.e. additional values from the past which are added in order to serve as “extra context”.Thepast_values is what the Transformer encoder gets as input (with optional additional features, such asstatic_categorical_features,static_real_features,past_time_features).
The sequence length here is equal tocontext_length +max(config.lags_sequence).
Missing values need to be replaced with zeros.
past_time_features (torch.FloatTensor of shape(batch_size, sequence_length, num_features),optional) —Optional time features, which the model internally will add topast_values. These could be things like“month of year”, “day of the month”, etc. encoded as vectors (for instance as Fourier features). Thesecould also be so-called “age” features, which basically help the model know “at which point in life” atime-series is. Age features have small values for distant past time steps and increase monotonically themore we approach the current time step.
These features serve as the “positional encodings” of the inputs. So contrary to a model like BERT, wherethe position encodings are learned from scratch internally as parameters of the model, the Time SeriesTransformer requires to provide additional time features.
The Autoformer only learns additional embeddings forstatic_categorical_features.
past_observed_mask (torch.BoolTensor of shape(batch_size, sequence_length),optional) —Boolean mask to indicate whichpast_values were observed and which were missing. Mask values selected in[0, 1]:
- 1 for values that areobserved,
- 0 for values that aremissing (i.e. NaNs that were replaced by zeros).
static_categorical_features (torch.LongTensor of shape(batch_size, number of static categorical features),optional) —Optional static categorical features for which the model will learn an embedding, which it will add to thevalues of the time series.
Static categorical features are features which have the same value for all time steps (static over time).
A typical example of a static categorical feature is a time series ID.
static_real_features (torch.FloatTensor of shape(batch_size, number of static real features),optional) —Optional static real features which the model will add to the values of the time series.
Static real features are features which have the same value for all time steps (static over time).
A typical example of a static real feature is promotion information.
future_values (torch.FloatTensor of shape(batch_size, prediction_length)) —Future values of the time series, that serve as labels for the model. Thefuture_values is what theTransformer needs to learn to output, given thepast_values.
See the demo notebook and code snippets for details.
Missing values need to be replaced with zeros.
future_time_features (torch.FloatTensor of shape(batch_size, prediction_length, num_features),optional) —Optional time features, which the model internally will add tofuture_values. These could be things like“month of year”, “day of the month”, etc. encoded as vectors (for instance as Fourier features). Thesecould also be so-called “age” features, which basically help the model know “at which point in life” atime-series is. Age features have small values for distant past time steps and increase monotonically themore we approach the current time step.
These features serve as the “positional encodings” of the inputs. So contrary to a model like BERT, wherethe position encodings are learned from scratch internally as parameters of the model, the Time SeriesTransformer requires to provide additional features.
The Autoformer only learns additional embeddings forstatic_categorical_features.
future_observed_mask (torch.BoolTensor of shape(batch_size, sequence_length) or(batch_size, sequence_length, input_size),optional) —Boolean mask to indicate whichfuture_values were observed and which were missing. Mask values selectedin[0, 1]:
- 1 for values that areobserved,
- 0 for values that aremissing (i.e. NaNs that were replaced by zeros).
This mask is used to filter out missing values for the final loss calculation.
decoder_attention_mask (torch.LongTensor of shape(batch_size, target_sequence_length),optional) —Mask to avoid performing attention on certain token indices. By default, a causal mask will be used, tomake sure the model can only look at previous inputs in order to predict the future.
encoder_outputs (tuple(tuple(torch.FloatTensor),optional) —Tuple consists oflast_hidden_state,hidden_states (optional) andattentions (optional)last_hidden_state of shape(batch_size, sequence_length, hidden_size) (optional) is a sequence ofhidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder.
past_key_values (~cache_utils.Cache,optional) —Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attentionblocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=True orconfig.use_cache=True.
OnlyCache instance is allowed as input, see ourkv cache guide.If nopast_key_values are passed,DynamicCache will be initialized by default.
The model will output the same cache format that is fed as input.
Ifpast_key_values are used, the user is expected to input only unprocessedinput_ids (those that don’thave their past key value states given to this model) of shape(batch_size, unprocessed_length) instead of allinput_idsof shape(batch_size, sequence_length).
output_hidden_states (bool,optional) —Whether or not to return the hidden states of all layers. Seehidden_states under returned tensors formore detail.
output_attentions (bool,optional) —Whether or not to return the attentions tensors of all attention layers. Seeattentions under returnedtensors for more detail.
use_cache (bool,optional) —If set toTrue,past_key_values key value states are returned and can be used to speed up decoding (seepast_key_values).
return_dict (bool,optional) —Whether or not to return aModelOutput instead of a plain tuple.

Returns

transformers.modeling_outputs.Seq2SeqTSPredictionOutput ortuple(torch.FloatTensor)

Atransformers.modeling_outputs.Seq2SeqTSPredictionOutput or a tuple oftorch.FloatTensor (ifreturn_dict=False is passed or whenconfig.return_dict=False) comprising variouselements depending on the configuration (AutoformerConfig) and inputs.

loss (torch.FloatTensor of shape(1,),optional, returned when afuture_values is provided) — Distributional loss.
params (torch.FloatTensor of shape(batch_size, num_samples, num_params)) — Parameters of the chosen distribution.
past_key_values (EncoderDecoderCache,optional, returned whenuse_cache=True is passed or whenconfig.use_cache=True) — It is aEncoderDecoderCache instance. For more details, see ourkv cache guide.
Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attentionblocks) that can be used (seepast_key_values input) to speed up sequential decoding.
decoder_hidden_states (tuple(torch.FloatTensor),optional, returned whenoutput_hidden_states=True is passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, +one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).
Hidden-states of the decoder at the output of each layer plus the initial embedding outputs.
decoder_attentions (tuple(torch.FloatTensor),optional, returned whenoutput_attentions=True is passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor (one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).
Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in theself-attention heads.
cross_attentions (tuple(torch.FloatTensor),optional, returned whenoutput_attentions=True is passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor (one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).
Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute theweighted average in the cross-attention heads.
encoder_last_hidden_state (torch.FloatTensor of shape(batch_size, sequence_length, hidden_size),optional) — Sequence of hidden-states at the output of the last layer of the encoder of the model.
encoder_hidden_states (tuple(torch.FloatTensor),optional, returned whenoutput_hidden_states=True is passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, +one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).
Hidden-states of the encoder at the output of each layer plus the initial embedding outputs.
encoder_attentions (tuple(torch.FloatTensor),optional, returned whenoutput_attentions=True is passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor (one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).
Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in theself-attention heads.
loc (torch.FloatTensor of shape(batch_size,) or(batch_size, input_size),optional) — Shift values of each time series’ context window which is used to give the model inputs of the samemagnitude and then used to shift back to the original magnitude.
scale (torch.FloatTensor of shape(batch_size,) or(batch_size, input_size),optional) — Scaling values of each time series’ context window which is used to give the model inputs of the samemagnitude and then used to rescale back to the original magnitude.
static_features (torch.FloatTensor of shape(batch_size, feature size),optional) — Static features of each time series’ in a batch which are copied to the covariates at inference time.

TheAutoformerForPrediction forward method, overrides the__call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call theModuleinstance afterwards instead of this since the former takes care of running the pre and post processing steps whilethe latter silently ignores them.

Examples:

>>>from huggingface_hubimport hf_hub_download>>>import torch>>>from transformersimport AutoformerForPrediction>>>file = hf_hub_download(...    repo_id="hf-internal-testing/tourism-monthly-batch", filename="train-batch.pt", repo_type="dataset"...)>>>batch = torch.load(file)>>>model = AutoformerForPrediction.from_pretrained("huggingface/autoformer-tourism-monthly")>>># during training, one provides both past and future values>>># as well as possible additional features>>>outputs = model(...    past_values=batch["past_values"],...    past_time_features=batch["past_time_features"],...    past_observed_mask=batch["past_observed_mask"],...    static_categorical_features=batch["static_categorical_features"],...    future_values=batch["future_values"],...    future_time_features=batch["future_time_features"],...)>>>loss = outputs.loss>>>loss.backward()>>># during inference, one only provides past values>>># as well as possible additional features>>># the model autoregressively generates future values>>>outputs = model.generate(...    past_values=batch["past_values"],...    past_time_features=batch["past_time_features"],...    past_observed_mask=batch["past_observed_mask"],...    static_categorical_features=batch["static_categorical_features"],...    future_time_features=batch["future_time_features"],...)>>>mean_prediction = outputs.sequences.mean(dim=1)

The AutoformerForPrediction can also use static_real_features. To do so, set num_static_real_features inAutoformerConfig based on number of such features in the dataset (in case of tourism_monthly dataset it

is equal to 1), initialize the model and call as shown below:

>>>from huggingface_hubimport hf_hub_download>>>import torch>>>from transformersimport AutoformerConfig, AutoformerForPrediction>>>file = hf_hub_download(...    repo_id="hf-internal-testing/tourism-monthly-batch", filename="train-batch.pt", repo_type="dataset"...)>>>batch = torch.load(file)>>># check number of static real features>>>num_static_real_features = batch["static_real_features"].shape[-1]>>># load configuration of pretrained model and override num_static_real_features>>>configuration = AutoformerConfig.from_pretrained(..."huggingface/autoformer-tourism-monthly",...    num_static_real_features=num_static_real_features,...)>>># we also need to update feature_size as it is not recalculated>>>configuration.feature_size += num_static_real_features>>>model = AutoformerForPrediction(configuration)>>>outputs = model(...    past_values=batch["past_values"],...    past_time_features=batch["past_time_features"],...    past_observed_mask=batch["past_observed_mask"],...    static_categorical_features=batch["static_categorical_features"],...    static_real_features=batch["static_real_features"],...    future_values=batch["future_values"],...    future_time_features=batch["future_time_features"],...)

Update on GitHub

←Trajectory Transformer Informer→

Movatterモバイル変換

Transformers

Autoformer

Overview

Resources

AutoformerConfig

classtransformers.AutoformerConfig

AutoformerModel

classtransformers.AutoformerModel

forward

AutoformerForPrediction

classtransformers.AutoformerForPrediction

forward