ACSEkevin/An-Overview-of-Segformer-and-Details-DescriptionPublic

NotificationsYou must be signed in to change notification settings
Fork2
Star17

This repository provides an overview of Segformer, architecture encoder in particular. Some details of Segformer can be misleaded, thus makes a short description here to help understand the model. Meanwhile, the code (Keras/TensorFlow) is also provided for supporting.

License

MIT license

17 stars 2 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
ADEChallengeData2016		ADEChallengeData2016
__pycache__		__pycache__
images		images
models		models
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ade_dataset.py		ade_dataset.py
requirements.txt		requirements.txt
train.py		train.py

Repository files navigation

An Overview of Segformer and Details Description

Preface

In this repository, the structure of theSegformer model is explained. In many recent blog posts and tutorials, the structure of Segformer has been misunderstood by many people, even experienced computer vision engineers, for reasons that may include misleading diagrams of the Segformer structure in theoriginal paper, but the model structure is shown clearly in thesource code address given in the paper. Therefore, the details of the Segformer, includingOverlapPatchEmbedding,Efficient Multihead Attention,Mixed-FeedForward Network,OverlapPatchMerging andSegformer block, will also be elaborated here. If there is any problem, please feel free to make a complain, also make acontact if convenient.
Also, the code has been uploaded for a reference which is developed by Keras/TensorFlow.

This is anmulti-task segmentation example of street scene. Images taken from city center, Sheffield, England.

Basics and File Description

Project:

git clone https://github.com/ACSEkevin/An-Overview-of-Segformer-and-Details-Description.git

ADEChallengeData2016/: ADE20K dataset which has been used for training and testing the model, please refer to:ADE20K Dataset.
models/: Two types of programming the model:structrual andclass inheritance.
adedataset.py: a dataset batch generator (keras requirement).
train.py: model train script. NOTICE: this is an example basic train script that Keras .fit() API has been used, for detailed model training please use TensorFlow to build atrain_one_epoch().

To be continued: A validation script; a predict script for model output.

A General Overview of the Model Arcitecture

Here are-drawn architecture replaces the one from theoriginal paper, which might help to gain a better understanding.

To conclude and compare:

In encoder, an input image is scaled to its$\frac{1}{32}$ and then upsampled to$\frac{1}{4}$ of the original size in decoder. However, the model given in the repository upsampled to thefull size to attempt for a better result. This can be revised after cloning.
In the original figure,OverlapPatchEmbedding layer is only shown at the begining of the architecture, which can be misleading, infact, there is alwaysOverlapPatchEmbedding layers followed by previous transformer block (shown as SegFormer Block in the figure). Nevertheless, the paper presents a plural term as `OverlapPatchEmbeddings' which implies that there more than one layer.
There is aOverlapPatchMerging layer at the end of the transformer block, this layer reshapes the vector groups back to feature maps. It can be easy to confuse these two layers as many blogs shows a `no-merging-after-block' opinion.
The feature map$C_1$ goes through theMLP Layer without upsampling. Others are upsampled by$\times 2, \times 4, \times 8$ respectively withbilinear interpolation.

A Single Stage of the Encoder

OverlapPatchEmbedding

In basic trabsformer block, an image is split and patched as a 'sequence', there is no info interaction between patches (strides=patch_size). While in Segformer, the patch size > strides which leads to information sharing between patches (each conv row) thus called 'overlapped' patches. In the end, followed by a layer normalization.

x=Conv2D(n_filters,kernel_size=kernel_size,strides=strides,padding='same')(inputs)batches,height,width,embed_dim=x.shapex=tf.reshape(x,shape=[-1,height*width,embed_dim])x=LayerNormalization()(x)

A Segformer Block

Below is a diagram that shows the detailed architecture of anA Segformer Block module. A sequence goes throughEfficient Self-Attention andMix-Feedforward Network layers, each preceded by aLayer Normalization.

Efficient Self-Attention

In thepaper, the authors proposed anEfficient Self-Attention to reduce the temporal complexity from$O(n^2)$ to$O(\frac{n^2}{sr})$ where$sr$ is sampling reduction ratio. The module trans back to basicSelf-Attention$sr=1$.

Like a normal Self-Attention module, each vector of an input sequence will propose$query$,$key$ and$value$. While there is only one vector shown in the figure.
Differently,$key$ and$value$ matrices go throughreduction layer then participate in transformations. The layer can be implemented byConv2D which plays a role of down sampling (strides$=$ kernel_size), then followed by aLayer Normalization. TheReshape layers helps reconstruct and de-construct feature maps respectively.
Shape changes inreduction layer:$[num_{patches}, dim_{embed}]$ ->$[height, width, dim_{embed}]$ ->$[\frac{height}{sr}, \frac{width}{sr}, dim_{embed}]$ ->$[\frac{height \times width}{sr^2}, dim_{embed}]$.

reduction layer:

batches,n_patches,channels=inputs.shapeifsr_ratio>1:inputs=tf.reshape(inputs,shape=[batches,height,width,embed_dim])inputs=Conv2D(embed_dim,kernel_size=sr_ratio,strides=sr_ratio,padding='same')(inputs)inputs=LayerNormalization()(inputs)inputs=tf.reshape(inputs,shape=[batches, (height*width)// (sr_ratio**2),embed_dim])

Mix-Feedforward Network

Condtional Positional Encoding method addresses the problem of loss of accuracy resulted from different input resolutions in VisionTransformer. In thispaper authors pointed out that positional encoding(PE) is not necessary for segmentation tasks. Thus there is only aConv$3 \times 3$ layer without PE inMix-FFN.

In thecode, the layerDWConv was adpoted rather thanConv$3 \times 3$ descripted in the paper , which can be mis-leading.
TheReshape layers have the same purpose as those inreduction layer fromEfficient Self-Attention.
Shape changes inMix-FFN layer:$[num_{patches}, dim_{embed}]$ ->$[num_{patches}, dim_{embed} \cdot rate_{exp}]$ ->$[height, width, dim_{embed} \cdot rate_{exp}]$ ->$[num_{patches}, dim_{embed} \cdot rate_{exp}]$ ->$[num_{patches}, dim_{embed}]$.

batches,n_patches,channels=inputs.shapex=Dense(int(embed_dim*expansion_rate),use_bias=True)(inputs)x=tf.reshape(x,shape=[batches,height,width,int(embed_dim*expansion_rate)])x=DepthwiseConv2D(kernel_size=3,strides=1,padding='same')(x)x=tf.reshape(x,shape=[batches,n_patches,int(embed_dim*expansion_rate)])x=Activation('gelu')(x)x=Dense(embed_dim,use_bias=True)(x)x=Dropout(rate=drop_rate)(x)

OverlapPatchMerging

This is a simple reshape operation to reconstruct sequences (patches) to feature maps. There is also a detail that the layer is also proceded by aLayer Normalization.

x=LayerNormalization()(x)feature_Cx=tf.reshape(x,shape=[batches,height_Cx,width_Cx,embed_dims[index]])

whereembed_dims[index] can be a list that stores the embedding dimension of each Segformer block.

References

Xie, E.et al. 'SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers',NeurIPS 2021, arXiv doi:10.48550/arXiv.2105.15203
Chu, X.et al. (2021) 'Conditional Positional Encodings for Vision Transformers',ICLR 2023, pp. 1-19. arXiv doi:10.48550/arXiv.2102.1088
Zhou, B.et al. (2017). 'Scene Parsing through ADE20K Dataset',Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). doi:10.1109/CVPR.2017.544

Author and Contributor

@ACSEKevin

About

Releases

No releases published

Packages

No packages published

Languages

Python100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

An Overview of Segformer and Details Description

Preface

Basics and File Description

A General Overview of the Model Arcitecture

A Single Stage of the Encoder

OverlapPatchEmbedding

A Segformer Block

Efficient Self-Attention

Mix-Feedforward Network

OverlapPatchMerging

References

Author and Contributor

@ACSEKevin

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Languages

Movatterモバイル変換

License

ACSEkevin/An-Overview-of-Segformer-and-Details-Description

Folders and files

Latest commit

History

Repository files navigation

An Overview of Segformer and Details Description

Preface

Basics and File Description

A General Overview of the Model Arcitecture

A Single Stage of the Encoder

OverlapPatchEmbedding

A Segformer Block

Efficient Self-Attention

Mix-Feedforward Network

OverlapPatchMerging

References

Author and Contributor

@ACSEKevin

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Languages

Packages