- Notifications
You must be signed in to change notification settings - Fork2
This repository provides an overview of Segformer, architecture encoder in particular. Some details of Segformer can be misleaded, thus makes a short description here to help understand the model. Meanwhile, the code (Keras/TensorFlow) is also provided for supporting.
License
ACSEkevin/An-Overview-of-Segformer-and-Details-Description
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
In this repository, the structure of theSegformer model is explained. In many recent blog posts and tutorials, the structure of Segformer has been misunderstood by many people, even experienced computer vision engineers, for reasons that may include misleading diagrams of the Segformer structure in theoriginal paper, but the model structure is shown clearly in thesource code address given in the paper. Therefore, the details of the Segformer, includingOverlapPatchEmbedding,Efficient Multihead Attention,Mixed-FeedForward Network,OverlapPatchMerging andSegformer block, will also be elaborated here. If there is any problem, please feel free to make a complain, also make acontact if convenient.
Also, the code has been uploaded for a reference which is developed by Keras/TensorFlow.
Project:
git clone https://github.com/ACSEkevin/An-Overview-of-Segformer-and-Details-Description.gitADEChallengeData2016/: ADE20K dataset which has been used for training and testing the model, please refer to:ADE20K Dataset.models/: Two types of programming the model:structrual andclass inheritance.adedataset.py: a dataset batch generator (keras requirement).train.py: model train script. NOTICE: this is an example basic train script that Keras .fit() API has been used, for detailed model training please use TensorFlow to build atrain_one_epoch().
To be continued: A validation script; a predict script for model output.
Here are-drawn architecture replaces the one from theoriginal paper, which might help to gain a better understanding.
To conclude and compare:
- In encoder, an input image is scaled to its
$\frac{1}{32}$ and then upsampled to$\frac{1}{4}$ of the original size in decoder. However, the model given in the repository upsampled to thefull size to attempt for a better result. This can be revised after cloning. - In the original figure,OverlapPatchEmbedding layer is only shown at the begining of the architecture, which can be misleading, infact, there is alwaysOverlapPatchEmbedding layers followed by previous transformer block (shown as SegFormer Block in the figure). Nevertheless, the paper presents a plural term as `OverlapPatchEmbeddings' which implies that there more than one layer.
- There is aOverlapPatchMerging layer at the end of the transformer block, this layer reshapes the vector groups back to feature maps. It can be easy to confuse these two layers as many blogs shows a `no-merging-after-block' opinion.
- The feature map
$C_1$ goes through theMLP Layer without upsampling. Others are upsampled by$\times 2, \times 4, \times 8$ respectively withbilinear interpolation.
In basic trabsformer block, an image is split and patched as a 'sequence', there is no info interaction between patches (strides=patch_size). While in Segformer, the patch size > strides which leads to information sharing between patches (each conv row) thus called 'overlapped' patches. In the end, followed by a layer normalization.
x=Conv2D(n_filters,kernel_size=kernel_size,strides=strides,padding='same')(inputs)batches,height,width,embed_dim=x.shapex=tf.reshape(x,shape=[-1,height*width,embed_dim])x=LayerNormalization()(x)
Below is a diagram that shows the detailed architecture of anA Segformer Block module. A sequence goes throughEfficient Self-Attention andMix-Feedforward Network layers, each preceded by aLayer Normalization.
In thepaper, the authors proposed anEfficient Self-Attention to reduce the temporal complexity from
- Like a normal Self-Attention module, each vector of an input sequence will propose
$query$ ,$key$ and$value$ . While there is only one vector shown in the figure. - Differently,
$key$ and$value$ matrices go throughreductionlayer then participate in transformations. The layer can be implemented byConv2Dwhich plays a role of down sampling (strides$=$ kernel_size), then followed by aLayer Normalization. TheReshapelayers helps reconstruct and de-construct feature maps respectively. - Shape changes inreduction layer:
$[num_{patches}, dim_{embed}]$ ->$[height, width, dim_{embed}]$ ->$[\frac{height}{sr}, \frac{width}{sr}, dim_{embed}]$ ->$[\frac{height \times width}{sr^2}, dim_{embed}]$ .
reduction layer:
batches,n_patches,channels=inputs.shapeifsr_ratio>1:inputs=tf.reshape(inputs,shape=[batches,height,width,embed_dim])inputs=Conv2D(embed_dim,kernel_size=sr_ratio,strides=sr_ratio,padding='same')(inputs)inputs=LayerNormalization()(inputs)inputs=tf.reshape(inputs,shape=[batches, (height*width)// (sr_ratio**2),embed_dim])
Condtional Positional Encoding method addresses the problem of loss of accuracy resulted from different input resolutions in VisionTransformer. In thispaper authors pointed out that positional encoding(PE) is not necessary for segmentation tasks. Thus there is only aConvMix-FFN.
- In thecode, the layerDWConv was adpoted rather than
Conv$3 \times 3$ descripted in the paper , which can be mis-leading. - The
Reshapelayers have the same purpose as those inreductionlayer fromEfficient Self-Attention. - Shape changes inMix-FFN layer:
$[num_{patches}, dim_{embed}]$ ->$[num_{patches}, dim_{embed} \cdot rate_{exp}]$ ->$[height, width, dim_{embed} \cdot rate_{exp}]$ ->$[num_{patches}, dim_{embed} \cdot rate_{exp}]$ ->$[num_{patches}, dim_{embed}]$ .
batches,n_patches,channels=inputs.shapex=Dense(int(embed_dim*expansion_rate),use_bias=True)(inputs)x=tf.reshape(x,shape=[batches,height,width,int(embed_dim*expansion_rate)])x=DepthwiseConv2D(kernel_size=3,strides=1,padding='same')(x)x=tf.reshape(x,shape=[batches,n_patches,int(embed_dim*expansion_rate)])x=Activation('gelu')(x)x=Dense(embed_dim,use_bias=True)(x)x=Dropout(rate=drop_rate)(x)
This is a simple reshape operation to reconstruct sequences (patches) to feature maps. There is also a detail that the layer is also proceded by aLayer Normalization.
x=LayerNormalization()(x)feature_Cx=tf.reshape(x,shape=[batches,height_Cx,width_Cx,embed_dims[index]])
whereembed_dims[index] can be a list that stores the embedding dimension of each Segformer block.
Xie, E.et al. 'SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers',NeurIPS 2021, arXiv doi:10.48550/arXiv.2105.15203
Chu, X.et al. (2021) 'Conditional Positional Encodings for Vision Transformers',ICLR 2023, pp. 1-19. arXiv doi:10.48550/arXiv.2102.1088
Zhou, B.et al. (2017). 'Scene Parsing through ADE20K Dataset',Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). doi:10.1109/CVPR.2017.544
About
This repository provides an overview of Segformer, architecture encoder in particular. Some details of Segformer can be misleaded, thus makes a short description here to help understand the model. Meanwhile, the code (Keras/TensorFlow) is also provided for supporting.
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Uh oh!
There was an error while loading.Please reload this page.





