Transformer #

classtorch.nn.Transformer(d_model=512,nhead=8,num_encoder_layers=6,num_decoder_layers=6,dim_feedforward=2048,dropout=0.1,activation=<functionrelu>,custom_encoder=None,custom_decoder=None,layer_norm_eps=1e-05,batch_first=False,norm_first=False,bias=True,device=None,dtype=None)[source]#

A basic transformer layer.

This Transformer layer implements the original Transformer architecture describedin theAttention Is All You Need paper. Theintent of this layer is as a reference implementation for foundational understandingand thus it contains only limited features relative to newer Transformer architectures.Given the fast pace of innovation in transformer-like architectures, we recommendexploring thistutorialto build an efficient transformer layer from building blocks in core or using higherlevel libraries from thePyTorch Ecosystem.

Parameters

d_model (int) – the number of expected features in the encoder/decoder inputs (default=512).
nhead (int) – the number of heads in the multiheadattention models (default=8).
num_encoder_layers (int) – the number of sub-encoder-layers in the encoder (default=6).
num_decoder_layers (int) – the number of sub-decoder-layers in the decoder (default=6).
dim_feedforward (int) – the dimension of the feedforward network model (default=2048).
dropout (float) – the dropout value (default=0.1).
activation (Union[str,Callable[[Tensor],Tensor]]) – the activation function of encoder/decoder intermediate layer, can be a string(“relu” or “gelu”) or a unary callable. Default: relu
custom_encoder (Optional[Any]) – custom encoder (default=None).
custom_decoder (Optional[Any]) – custom decoder (default=None).
layer_norm_eps (float) – the eps value in layer normalization components (default=1e-5).
batch_first (bool) – IfTrue, then the input and output tensors are providedas (batch, seq, feature). Default:False (seq, batch, feature).
norm_first (bool) – ifTrue, encoder and decoder layers will perform LayerNorms beforeother attention and feedforward operations, otherwise after. Default:False (after).
bias (bool) – If set toFalse,Linear andLayerNorm layers will not learn an additivebias. Default:True.

Examples

>>>transformer_model=nn.Transformer(nhead=16,num_encoder_layers=12)>>>src=torch.rand((10,32,512))>>>tgt=torch.rand((20,32,512))>>>out=transformer_model(src,tgt)

Note: A full example to apply nn.Transformer module for the word language model is available inpytorch/examples

forward(src,tgt,src_mask=None,tgt_mask=None,memory_mask=None,src_key_padding_mask=None,tgt_key_padding_mask=None,memory_key_padding_mask=None,src_is_causal=None,tgt_is_causal=None,memory_is_causal=False)[source]#

Take in and process masked source/target sequences.

Note

If a boolean tensor is provided for any of the [src/tgt/memory]_mask arguments, positions with aTrue value arenot allowed to participate in the attention,which is the opposite of the definition forattn_maskintorch.nn.functional.scaled_dot_product_attention().

Parameters

src (Tensor) – the sequence to the encoder (required).
tgt (Tensor) – the sequence to the decoder (required).
src_mask (Optional[Tensor]) – the additive mask for the src sequence (optional).
tgt_mask (Optional[Tensor]) – the additive mask for the tgt sequence (optional).
memory_mask (Optional[Tensor]) – the additive mask for the encoder output (optional).
src_key_padding_mask (Optional[Tensor]) – the Tensor mask for src keys per batch (optional).
tgt_key_padding_mask (Optional[Tensor]) – the Tensor mask for tgt keys per batch (optional).
memory_key_padding_mask (Optional[Tensor]) – the Tensor mask for memory keys per batch (optional).
src_is_causal (Optional[bool]) – If specified, applies a causal mask assrc_mask.Default:None; try to detect a causal mask.Warning:src_is_causal provides a hint thatsrc_mask isthe causal mask. Providing incorrect hints can result inincorrect execution, including forward and backwardcompatibility.
tgt_is_causal (Optional[bool]) – If specified, applies a causal mask astgt_mask.Default:None; try to detect a causal mask.Warning:tgt_is_causal provides a hint thattgt_mask isthe causal mask. Providing incorrect hints can result inincorrect execution, including forward and backwardcompatibility.
memory_is_causal (bool) – If specified, applies a causal mask asmemory_mask.Default:False.Warning:memory_is_causal provides a hint thatmemory_mask is the causal mask. Providing incorrecthints can result in incorrect execution, includingforward and backward compatibility.

Return type

Tensor

Shape:

src: $(S, E) (S, E)$ for unbatched input, $(S, N, E) (S, N, E)$ ifbatch_first=False or(N, S, E) ifbatch_first=True.
tgt: $(T, E) (T, E)$ for unbatched input, $(T, N, E) (T, N, E)$ ifbatch_first=False or(N, T, E) ifbatch_first=True.
src_mask: $(S, S) (S, S)$ or $(N \cdot num_heads, S, S) (N\cdot\text{num\_heads}, S, S)$ .
tgt_mask: $(T, T) (T, T)$ or $(N \cdot num_heads, T, T) (N\cdot\text{num\_heads}, T, T)$ .
memory_mask: $(T, S) (T, S)$ .
src_key_padding_mask: $(S) (S)$ for unbatched input otherwise $(N, S) (N, S)$ .
tgt_key_padding_mask: $(T) (T)$ for unbatched input otherwise $(N, T) (N, T)$ .
memory_key_padding_mask: $(S) (S)$ for unbatched input otherwise $(N, S) (N, S)$ .

Note: [src/tgt/memory]_mask ensures that position $i i$ is allowed to attend the unmaskedpositions. If a BoolTensor is provided, positions withTrueare not allowed to attend whileFalse values will be unchanged. If a FloatTensoris provided, it will be added to the attention weight.[src/tgt/memory]_key_padding_mask provides specified elements in the key to be ignored bythe attention. If a BoolTensor is provided, the positions with thevalue ofTrue will be ignored while the position with the value ofFalse will be unchanged.

output: $(T, E) (T, E)$ for unbatched input, $(T, N, E) (T, N, E)$ ifbatch_first=False or(N, T, E) ifbatch_first=True.

Note: Due to the multi-head attention architecture in the transformer model,the output sequence length of a transformer is same as the input sequence(i.e. target) length of the decoder.

where $S S$ is the source sequence length, $T T$ is the target sequence length, $N N$ is thebatch size, $E E$ is the feature number

Examples

>>>output=transformer_model(...src,tgt,src_mask=src_mask,tgt_mask=tgt_mask...)

staticgenerate_square_subsequent_mask(sz,device=None,dtype=None)[source]#

Generate a square causal mask for the sequence.

The masked positions are filled with float(‘-inf’). Unmasked positions are filled with float(0.0).

Return type: Tensor

On this page

Show Source

PyTorch Libraries

Movatterモバイル変換

Transformer #

Docs

Tutorials

Resources

Movatterモバイル変換

Transformer#

Docs

Tutorials

Resources

Transformer #