Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

This repository provides an overview of Segformer, architecture encoder in particular. Some details of Segformer can be misleaded, thus makes a short description here to help understand the model. Meanwhile, the code (Keras/TensorFlow) is also provided for supporting.

License

NotificationsYou must be signed in to change notification settings

ACSEkevin/An-Overview-of-Segformer-and-Details-Description

Repository files navigation

Preface

In this repository, the structure of theSegformer model is explained. In many recent blog posts and tutorials, the structure of Segformer has been misunderstood by many people, even experienced computer vision engineers, for reasons that may include misleading diagrams of the Segformer structure in theoriginal paper, but the model structure is shown clearly in thesource code address given in the paper. Therefore, the details of the Segformer, includingOverlapPatchEmbedding,Efficient Multihead Attention,Mixed-FeedForward Network,OverlapPatchMerging andSegformer block, will also be elaborated here. If there is any problem, please feel free to make a complain, also make acontact if convenient.
Also, the code has been uploaded for a reference which is developed by Keras/TensorFlow.

This is anmulti-task segmentation example of street scene. Images taken from city center, Sheffield, England.

Basics and File Description

Project:

git clone https://github.com/ACSEkevin/An-Overview-of-Segformer-and-Details-Description.git

ADEChallengeData2016/: ADE20K dataset which has been used for training and testing the model, please refer to:ADE20K Dataset.
models/: Two types of programming the model:structrual andclass inheritance.
adedataset.py: a dataset batch generator (keras requirement).
train.py: model train script. NOTICE: this is an example basic train script that Keras .fit() API has been used, for detailed model training please use TensorFlow to build atrain_one_epoch().

To be continued: A validation script; a predict script for model output.

A General Overview of the Model Arcitecture

Here are-drawn architecture replaces the one from theoriginal paper, which might help to gain a better understanding.

drawing

To conclude and compare:

  • In encoder, an input image is scaled to its$\frac{1}{32}$ and then upsampled to$\frac{1}{4}$ of the original size in decoder. However, the model given in the repository upsampled to thefull size to attempt for a better result. This can be revised after cloning.
  • In the original figure,OverlapPatchEmbedding layer is only shown at the begining of the architecture, which can be misleading, infact, there is alwaysOverlapPatchEmbedding layers followed by previous transformer block (shown as SegFormer Block in the figure). Nevertheless, the paper presents a plural term as `OverlapPatchEmbeddings' which implies that there more than one layer.
  • There is aOverlapPatchMerging layer at the end of the transformer block, this layer reshapes the vector groups back to feature maps. It can be easy to confuse these two layers as many blogs shows a `no-merging-after-block' opinion.
  • The feature map$C_1$ goes through theMLP Layer without upsampling. Others are upsampled by$\times 2, \times 4, \times 8$ respectively withbilinear interpolation.

A Single Stage of the Encoder

OverlapPatchEmbedding

In basic trabsformer block, an image is split and patched as a 'sequence', there is no info interaction between patches (strides=patch_size). While in Segformer, the patch size > strides which leads to information sharing between patches (each conv row) thus called 'overlapped' patches. In the end, followed by a layer normalization.

x=Conv2D(n_filters,kernel_size=kernel_size,strides=strides,padding='same')(inputs)batches,height,width,embed_dim=x.shapex=tf.reshape(x,shape=[-1,height*width,embed_dim])x=LayerNormalization()(x)

A Segformer Block

Below is a diagram that shows the detailed architecture of anA Segformer Block module. A sequence goes throughEfficient Self-Attention andMix-Feedforward Network layers, each preceded by aLayer Normalization.

drawing

Efficient Self-Attention

In thepaper, the authors proposed anEfficient Self-Attention to reduce the temporal complexity from$O(n^2)$ to$O(\frac{n^2}{sr})$ where$sr$ is sampling reduction ratio. The module trans back to basicSelf-Attention$sr=1$.

  • Like a normal Self-Attention module, each vector of an input sequence will propose$query$,$key$ and$value$. While there is only one vector shown in the figure.
  • Differently,$key$ and$value$ matrices go throughreduction layer then participate in transformations. The layer can be implemented byConv2D which plays a role of down sampling (strides$=$ kernel_size), then followed by aLayer Normalization. TheReshape layers helps reconstruct and de-construct feature maps respectively.
  • Shape changes inreduction layer:$[num_{patches}, dim_{embed}]$ ->$[height, width, dim_{embed}]$ ->$[\frac{height}{sr}, \frac{width}{sr}, dim_{embed}]$ ->$[\frac{height \times width}{sr^2}, dim_{embed}]$.

reduction layer:

batches,n_patches,channels=inputs.shapeifsr_ratio>1:inputs=tf.reshape(inputs,shape=[batches,height,width,embed_dim])inputs=Conv2D(embed_dim,kernel_size=sr_ratio,strides=sr_ratio,padding='same')(inputs)inputs=LayerNormalization()(inputs)inputs=tf.reshape(inputs,shape=[batches, (height*width)// (sr_ratio**2),embed_dim])

Mix-Feedforward Network

Condtional Positional Encoding method addresses the problem of loss of accuracy resulted from different input resolutions in VisionTransformer. In thispaper authors pointed out that positional encoding(PE) is not necessary for segmentation tasks. Thus there is only aConv$3 \times 3$ layer without PE inMix-FFN.

  • In thecode, the layerDWConv was adpoted rather thanConv$3 \times 3$ descripted in the paper , which can be mis-leading.
  • TheReshape layers have the same purpose as those inreduction layer fromEfficient Self-Attention.
  • Shape changes inMix-FFN layer:$[num_{patches}, dim_{embed}]$ ->$[num_{patches}, dim_{embed} \cdot rate_{exp}]$ ->$[height, width, dim_{embed} \cdot rate_{exp}]$ ->$[num_{patches}, dim_{embed} \cdot rate_{exp}]$ ->$[num_{patches}, dim_{embed}]$.
batches,n_patches,channels=inputs.shapex=Dense(int(embed_dim*expansion_rate),use_bias=True)(inputs)x=tf.reshape(x,shape=[batches,height,width,int(embed_dim*expansion_rate)])x=DepthwiseConv2D(kernel_size=3,strides=1,padding='same')(x)x=tf.reshape(x,shape=[batches,n_patches,int(embed_dim*expansion_rate)])x=Activation('gelu')(x)x=Dense(embed_dim,use_bias=True)(x)x=Dropout(rate=drop_rate)(x)

OverlapPatchMerging

This is a simple reshape operation to reconstruct sequences (patches) to feature maps. There is also a detail that the layer is also proceded by aLayer Normalization.

x=LayerNormalization()(x)feature_Cx=tf.reshape(x,shape=[batches,height_Cx,width_Cx,embed_dims[index]])

whereembed_dims[index] can be a list that stores the embedding dimension of each Segformer block.

References

  • Xie, E.et al. 'SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers',NeurIPS 2021, arXiv doi:10.48550/arXiv.2105.15203

  • Chu, X.et al. (2021) 'Conditional Positional Encodings for Vision Transformers',ICLR 2023, pp. 1-19. arXiv doi:10.48550/arXiv.2102.1088

  • Zhou, B.et al. (2017). 'Scene Parsing through ADE20K Dataset',Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). doi:10.1109/CVPR.2017.544

Author and Contributor

About

This repository provides an overview of Segformer, architecture encoder in particular. Some details of Segformer can be misleaded, thus makes a short description here to help understand the model. Meanwhile, the code (Keras/TensorFlow) is also provided for supporting.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages


[8]ページ先頭

©2009-2025 Movatter.jp