Rate this Page

TransformerEncoderLayer#

classtorch.nn.modules.transformer.TransformerEncoderLayer(d_model,nhead,dim_feedforward=2048,dropout=0.1,activation=<functionrelu>,layer_norm_eps=1e-05,batch_first=False,norm_first=False,bias=True,device=None,dtype=None)[source]#

TransformerEncoderLayer is made up of self-attn and feedforward network.

This TransformerEncoderLayer implements the original architecture describedin theAttention Is All You Need paper. Theintent of this layer is as a reference implementation for foundational understandingand thus it contains only limited features relative to newer Transformer architectures.Given the fast pace of innovation in transformer-like architectures, we recommendexploring thistutorialto build efficient layers from building blocks in core or using higherlevel libraries from thePyTorch Ecosystem.

TransformerEncoderLayer can handle either traditional torch.tensor inputs,or Nested Tensor inputs. Derived classes are expected to similarly acceptboth input formats. (Not all combinations of inputs are currentlysupported by TransformerEncoderLayer while Nested Tensor is in prototypestate.)

If you are implementing a custom layer, you may derive it either fromthe Module or TransformerEncoderLayer class. If your custom layersupports both torch.Tensors and Nested Tensors inputs, make itsimplementation a derived class of TransformerEncoderLayer. If your customLayer supports only torch.Tensor inputs, derive its implementation fromModule.

Parameters
  • d_model (int) – the number of expected features in the input (required).

  • nhead (int) – the number of heads in the multiheadattention models (required).

  • dim_feedforward (int) – the dimension of the feedforward network model (default=2048).

  • dropout (float) – the dropout value (default=0.1).

  • activation (Union[str,Callable[[Tensor],Tensor]]) – the activation function of the intermediate layer, can be a string(“relu” or “gelu”) or a unary callable. Default: relu

  • layer_norm_eps (float) – the eps value in layer normalization components (default=1e-5).

  • batch_first (bool) – IfTrue, then the input and output tensors are providedas (batch, seq, feature). Default:False (seq, batch, feature).

  • norm_first (bool) – ifTrue, layer norm is done prior to attention and feedforwardoperations, respectively. Otherwise it’s done after. Default:False (after).

  • bias (bool) – If set toFalse,Linear andLayerNorm layers will not learn an additivebias. Default:True.

Examples

>>>encoder_layer=nn.TransformerEncoderLayer(d_model=512,nhead=8)>>>src=torch.rand(10,32,512)>>>out=encoder_layer(src)
Alternatively, whenbatch_first isTrue:
>>>encoder_layer=nn.TransformerEncoderLayer(...d_model=512,nhead=8,batch_first=True...)>>>src=torch.rand(32,10,512)>>>out=encoder_layer(src)
Fast path:

forward() will use a special optimized implementation described inFlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness if all of the followingconditions are met:

  • Either autograd is disabled (usingtorch.inference_mode ortorch.no_grad) or no tensorargumentrequires_grad

  • training is disabled (using.eval())

  • batch_first isTrue and the input is batched (i.e.,src.dim()==3)

  • activation is one of:"relu","gelu",torch.functional.relu, ortorch.functional.gelu

  • at most one ofsrc_mask andsrc_key_padding_mask is passed

  • if src is aNestedTensor, neithersrc_masknorsrc_key_padding_mask is passed

  • the twoLayerNorm instances have a consistenteps value (this will naturally be the caseunless the caller has manually modified one without modifying the other)

If the optimized implementation is in use, aNestedTensor can bepassed forsrc to represent padding more efficiently than using a paddingmask. In this case, aNestedTensor will bereturned, and an additional speedup proportional to the fraction of the input thatis padding can be expected.

forward(src,src_mask=None,src_key_padding_mask=None,is_causal=False)[source]#

Pass the input through the encoder layer.

Parameters
  • src (Tensor) – the sequence to the encoder layer (required).

  • src_mask (Optional[Tensor]) – the mask for the src sequence (optional).

  • src_key_padding_mask (Optional[Tensor]) – the mask for the src keys per batch (optional).

  • is_causal (bool) – If specified, applies a causal mask assrcmask.Default:False.Warning:is_causal provides a hint thatsrc_mask is thecausal mask. Providing incorrect hints can result inincorrect execution, including forward and backwardcompatibility.

Return type

Tensor

Shape:

see the docs inTransformer.