- Jiangning Zhang1 na1,
- Xiangtai Li2 na1,
- Yabiao Wang3,
- Chengjie Wang3,
- Yibo Yang2,
- Yong Liu1 na1 &
- …
- Dacheng Tao4
1273Accesses
5Citations
Abstract
Motivated by biological evolution, this paper explains the rationality of Vision Transformer by analogy with the proven practical evolutionary algorithm (EA) and derives that both have consistent mathematical formulation. Then inspired by effective EA variants, we propose a novel pyramid EATFormer backbone that only contains the proposed EA-based transformer (EAT) block, which consists of three residual parts, i.e., Multi-scale region aggregation, global and local interaction, and feed-forward network modules, to model multi-scale, interactive, and individual information separately. Moreover, we design a task-related head docked with transformer backbone to complete final information fusion more flexibly andimprove a modulated deformable MSA to dynamically model irregular locations. Massive quantitative and quantitative experiments on image classification, downstream tasks, and explanatory experiments demonstrate the effectiveness and superiority of our approach over state-of-the-art methods. E.g., our Mobile (1.8 M), Tiny (6.1 M), Small (24.3 M), and Base (49.0 M) models achieve 69.4, 78.4, 83.1, and 83.9 Top-1 only trained on ImageNet-1K with naive training recipe; EATFormer-Tiny/Small/Base armed Mask-R-CNN obtain 45.4/47.4/49.0 box AP and 41.4/42.9/44.2 mask AP on COCO detection, surpassing contemporary MPViT-T, Swin-T, and Swin-S by 0.6/1.4/0.5 box AP and 0.4/1.3/0.9 mask AP separately with less FLOPs; Our EATFormer-Small/Base achieve 47.3/49.3 mIoU on ADE20K by Upernet that exceeds Swin-T/S by 2.8/1.7. Code is available athttps://github.com/zhangzjn/EATFormer.
This is a preview of subscription content,log in via an institution to check access.
Access this article
Subscribe and save
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
Buy Now
Price includes VAT (Japan)
Instant access to the full article PDF.














Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability
All the datasets used in this paper are available online. ImageNet-1K (http://image-net.org), COCO 2017 (https://cocodataset.org), and ADE20K (http://sceneparsing.csail.mit.edu) can be downloaded from their official website accordingly.
References
Ali, A., Touvron, H., Caron, M., Bojanowski, P., Douze, M., Joulin, A., Laptev, I., Neverova, N., Synnaeve, G., Verbeek, J., & Jegou H (2021). Xcit: Cross-covariance image transformers. InNeurIPS.
Atito, S., Awais, M., & Kittler, J. (2021). Sit: Self-supervised vision transformer. arXiv preprintarXiv:2104.03602
Baevski, A., Hsu, W.N., Xu, Q., Babu, A., Gu, J., & Auli, M. (2022). Data2vec: A general framework for self-supervised learning in speech, vision and language. InICML.
Bao, H., Dong, L., Piao, S., & Wei, F. (2022). BEit: BERT pre-training of image transformers. InICLR.
Bartz-Beielstein, T., Branke, J., Mehnen, J., & Mersmann, O. (2014).Evolutionary algorithms. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery.
Bello, I. (2021). Lambdanetworks: Modeling long-range interactions without attention. InICLR.
Bertasius, G., Wang, H., & Torresani, L. (2021). Is space-time attention all you need for video understanding? InICML.
Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., & Veit, A. (2021). Understanding robustness of transformers for image classification. InICCV.
Bhowmik, P., Pantho, M. J. H., & Bobda, C. (2021). Bio-inspired smart vision sensor: Toward a reconfigurable hardware modeling of the hierarchical processing in the brain.Journal of Real-Time Image Processing,18, 157–174.
Brest, J., Greiner, S., Boskovic, B., Mernik, M., & Zumer, V. (2006). Self-adapting control parameters in differential evolution: A comparative study on numerical benchmark problems. InTEC.
Brest, J., Zamuda, A., Boskovic, B., Maucec, M. S., & Zumer, V. (2008). High-dimensional real-parameter optimization using self-adaptive differential evolution algorithm with population size reduction. InCEC.
Brest, J., Zamuda, A., Fister, I., & Maučec, M. S. (2010). Large scale global optimization using self-adaptive differential evolution algorithm. InCEC.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., & Agarwal, S. (2020). Language models are few-shot learners. InNeurIPS.
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. In:ECCV.
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging properties in self-supervised vision transformers. InICCV.
Chen, B., Li, P., Li, C., Li, B., Bai, L., Lin, C., Sun, M., Yan, J., & Ouyang, W. (2021). Glit: Neural architecture search for global and local image transformer. InICCV.
Chen, H., Wang, Y., Guo, T., Xu, C., Deng, Y., Liu, Z., Ma, S., Xu, C., Xu, C., & Gao, W. (2021). Pre-trained image processing transformer. InCVPR.
Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A. L., & Zhou, Y. (2021). Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprintarXiv:2102.04306
Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Xu, J., Zhang, Z., Cheng, D., Zhu, C., Cheng, T., Zhao, Q., Li, B., Lu, X., Zhu, R., Wu, Y., Dai, J., Wang, J., Shi, J., Ouyang, W., Loy, C. C., & Lin, D. (2019). MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprintarXiv:1906.07155
Chen, M., Peng, H., Fu, J., & Ling, H. (2021). Autoformer: Searching transformers for visual recognition. InICCV.
Chen, M., Wu, K., Ni, B., Peng, H., Liu, B., Fu, J., Chao, H., & Ling, H. (2021). Searching the search space of vision transformer. InNeurIPS.
Chen, Q., Wu, Q., Wang, J., Hu, Q., Hu, T., Ding, E., & Cheng, J., Wang, J. (2022). Mixformer: Mixing features across windows and dimensions. InCVPR.
Chen, T., Saxena, S., Li, L., Fleet, D. J., & Hinton, G. (2022). Pix2seq: A language modeling framework for object detection. InICLR.
Chen, X., Ding, M., Wang, X., Xin, Y., Mo, S., Wang, Y., Han, S., Luo, P., Zeng, G., & Wang, J. (2023). Context autoencoder for self-supervised representation learning. InIJCV.
Chen, X., Xie, S., & He, K. (2021). An empirical study of training self-supervised visual transformers. InICCV.
Chen, Y., Dai, X., Chen, D., Liu, M., Dong, X., Yuan, L., & Liu, Z. (2022). Mobile-former: Bridging mobilenet and transformer. InCVPR.
Chen, Z., & Kang, L. (2005). Multi-population evolutionary algorithm for solving constrained optimization problems. InAIAI.
Chen, Z., Zhu, Y., Zhao, C., Hu, G., Zeng, W., Wang, J., & Tang, M. (2021). Dpt: Deformable patch-based transformer for visual recognition. InACM MM.
Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., & Girdhar, R. (2022). Masked-attention mask transformer for universal image segmentation. InCVPR.
Cheng, B., Schwing, A., & Kirillov, A. (2021). Per-pixel classification is not all you need for semantic segmentation. InNeurIPS.
Choromanski, K. M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J. Q., Mohiuddin, A., Kaiser, L., Belanger, D. B., Colwell, L. J., & Weller, A. (2021). Rethinking attention with performers. InICLR.
Chu, X., Tian, Z., Wang, Y., Zhang, B., Ren, H., Wei, X., Xia, H., & Shen, C. (2021). Twins: Revisiting the design of spatial attention in vision transformers. InNeurIPS.
Chu, X., Tian, Z., Zhang, B., Wang, X., Wei, X., Xia, H., & Shen, C. (2023). Conditional positional encodings for vision transformers. InICLR.
Coello, C. A. C., & Lamont, G. B. (2004).Applications of multi-objective evolutionary algorithms (Vol. 1). World Scientific.
Cordonnier, J. B., Loukas, A., & Jaggi, M. (2020). On the relationship between self-attention and convolutional layers. InICLR.
Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., & Wei, Y. (2017). Deformable convolutional networks. InICCV.
Das, S., & Suganthan, P. N. (2010). Differential evolution: A survey of the state-of-the-art.TEC.
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. InCVPR.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. InNAACL.
Dong, X., Bao, J., Chen, D., Zhang, W., Yu, N., Yuan, L., Chen, D., & Guo, B. (2022). Cswin transformer: A general vision transformer backbone with cross-shaped windows. InCVPR.
Dong, X., Bao, J., Zhang, T., Chen, D., Zhang, W., Yuan, L., Chen, D., Wen, F., Yu, N., & Guo, B. (2023). Peco: Perceptual codebook for Bert pre-training of vision transformers. InAAAI.
Dong, Y., Cordonnier, J. B., & Loukas, A. (2021). Attention is not all you need: Pure attention loses rank doubly exponentially with depth. InICML.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An image is worth 16\(\times \) 16 words: Transformers for image recognition at scale. InICLR.
d’Ascoli, S., Touvron, H., Leavitt, M. L., Morcos, A. S., Biroli, G., & Sagun, L. (2021). Convit: Improving vision transformers with soft convolutional inductive biases. InICML.
Fang, Y., Liao, B., Wang, X., Fang, J., Qi, J., Wu, R., Niu, J., & Liu, W. (2021). You only look at one sequence: Rethinking transformer in vision through object detection. InNeurIPS.
Felleman, D. J., & Van Essen, D. C. (1991).Distributed hierarchical processing in the primate cerebral cortex. Cerebral Cortex.
Gao, P., Ma, T., Li, H., Lin, Z., Dai, J., & Qiao, Y. (2022). Mcmae: Masked convolution meets masked autoencoders. InNeurIPS.
García-Martínez, C., & Lozano, M. (2008). Local search based on genetic algorithms. InAdvances in metaheuristics for hard optimization. Springer.
Goyal, A., & Bengio, Y. (2022). Inductive biases for deep learning of higher-level cognition.Proceedings of the Royal Society A,478, 20210068.
Guo, J., Han, K., Wu, H., Tang, Y., Chen, X., Wang, Y., & Xu, C. (2022). Cmt: Convolutional neural networks meet vision transformers. InCVPR.
Guo, M. H., Lu, C. Z., Liu, Z.N., Cheng, M. M., & Hu, S. M. (2023). Visual attention network. InCVM.
Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., & Wang, Y. (2021). Transformer in transformer. InNeurIPS.
Hao, Y., Dong, L., Wei, F., & Xu, K. (2021). Self-attention attribution: Interpreting information interactions inside transformer. InAAAI.
Hart, W. E., Krasnogor, N., & Smith, J. E. (2005). Memetic evolutionary algorithms. InRecent advances in memetic algorithms (pp. 3–27). Springer.
Hassanat, A., Almohammadi, K., Alkafaween, E., Abunawas, E., Hammouri, A., & Prasath, V. (2019). Choosing mutation and crossover ratios for genetic algorithms—a review with a new dynamic approach.Information,10, 390.
Hassani, A., Walton, S., Li, J., Li, S., & Shi, H. (2023). Neighborhood attention transformer. InCVPR.
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked autoencoders are scalable vision learners. InCVPR.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. InCVPR.
He, R., Ravula, A., Kanagal, B., & Ainslie, J. (2020). Realformer: Transformer likes residual attention. arXiv preprintarXiv:2012.11747.
Howard, A., Sandler, M., Chu, G., Chen, L.C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., & Le, Q. V. (2019). Searching for mobilenetv3. InICCV.
Huang, Z., Ben, Y., Luo, G., Cheng, P., Yu, G., & Fu, B. (2021). Shuffle transformer: Rethinking spatial shuffle for vision transformer. arXiv preprintarXiv:2106.03650.
Hudson, D. A., & Zitnick, L. (2021). Generative adversarial transformers. InICML.
Jiang, Y., Chang, S., & Wang, Z. (2021). Transgan: Two pure transformers can make one strong gan, and that can scale up. InNeurIPS.
Jiang, Z.H., Hou, Q., Yuan, L., Zhou, D., Shi, Y., Jin, X., Wang, A., & Feng, J. (2021). All tokens matter: Token labeling for training better vision transformers. InNeurIPS.
Katharopoulos, A., Vyas, A., Pappas, N., & Fleuret, F. (2020). Transformers are rnns: Fast autoregressive transformers with linear attention. InICML.
Khare, V., Yao, X., & Deb, K. (2003). Performance scaling of multi-objective evolutionary algorithms. InEMO.
Kim, J., Nguyen, D., Min, S., Cho, S., Lee, M., Lee, H., & Hong, S. (2022). Pure transformers are powerful graph learners. InNeurIPS.
Kitaev, N., Kaiser, L., & Levskaya, A. (2020). Reformer: The efficient transformer. InICLR.
Kolen, A., & Pesch, E. (1994). Genetic local search in combinatorial optimization.Discrete Applied Mathematics.
Kumar, S., Sharma, V. K., & Kumari, R. (2014). Memetic search in differential evolution algorithm. arXiv preprintarXiv:1408.0101.
Land, M. W. S. (1998).Evolutionary algorithms with local search for combinatorial optimization. University of California.
Lee, Y., Kim, J., Willette, J., & Hwang, S. J. (2022). Mpvit: Multi-path vision transformer for dense prediction. InCVPR.
Li, C., Tang, T., Wang, G., Peng, J., Wang, B., Liang, X., & Chang, X. (2021). Bossnas: Exploring hybrid cnn-transformers with block-wisely self-supervised neural architecture search. InICCV.
Li, K., Wang, Y., Peng, G., Song, G., Liu, Y., Li, H., & Qiao, Y. (2022). Uniformer: Unified transformer for efficient spatial-temporal representation learning. InICLR.
Li, K., Wang, Y., Zhang, J., Gao, P., Song, G., Liu, Y., Li, H., & Qiao, Y. (2023). Uniformer: Unifying convolution and self-attention for visual recognition.TPAMI.
Li, X., Wang, L., Jiang, Q., & Li, N. (2021). Differential evolution algorithm with multi-population cooperation and multi-strategy integration.Neurocomputing,421, 285–302.
Li, Y., Hu, J., Wen, Y., Evangelidis, G., Salahi, K., Wang, Y., Tulyakov, S., & Ren, J. (2023). Rethinking vision transformers for mobilenet size and speed. InICCV.
Li, Y., Zhang, K., Cao, J., Timofte, R., & Van Gool, L. (2021). Localvit: Bringing locality to vision transformers. arXiv preprintarXiv:2104.05707
Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., & Timofte, R. (2021). Swinir: Image restoration using swin transformer. InICCV.
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. InECCV.
Liu, J., & Lampinen, J. (2005). A fuzzy adaptive differential evolution algorithm.Soft Computing,9, 448–462.
Liu, Y., Li, H., Guo, Y., Kong, C., Li, J., & Wang, S. (2022). Rethinking attention-model explainability through faithfulness violation test. InICML.
Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang, Z., Dong, L., & Wei, F. (2022). Swin transformer v2: Scaling up capacity and resolution. InCVPR.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. InICCV.
Loshchilov, I., & Hutter, F. (2019). Decoupled weight decay regularization. InICLR.
Lu, J., Mottaghi, R., & Kembhavi, A. (2021). Container: Context aggregation networks. InNeurIPS.
Maaz, M., Shaker, A., Cholakkal, H., Khan, S., Zamir, S. W., Anwer, R. M., & Shahbaz Khan, F. (2023). Edgenext: Efficiently amalgamated cnn-transformer architecture for mobile vision applications. InECCVW.
Mehta, S., & Rastegari, M. (2022). Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. InICLR.
Min, J., Zhao, Y., Luo, C., & Cho, M. (2022). Peripheral vision transformer. InNeurIPS.
Moscato, P., et al. (1989). On evolution, search, optimization, genetic algorithms and martial arts: Towards memetic algorithms.Caltech Concurrent Computation Program, C3P Report,826, 1989.
Motter, B. C. (1993). Focal attention produces spatially selective processing in visual cortical areas v1, v2, and v4 in the presence of competing stimuli.Journal of Neurophysiology,70, 909–919.
Nakashima, K., Kataoka, H., Matsumoto, A., Iwata, K., Inoue, N., & Satoh, Y. (2022). Can vision transformers learn without natural images? InAAAI.
Neimark, D., Bar, O., Zohar, M., & Asselmann, D. (2021). Video transformer network. InICCV.
Opara, K. R., & Arabas, J. (2019). Differential evolution: A survey of theoretical analyses.Swarm and Evolutionary Computation,44, 546–558.
Padhye, N., Mittal, P., & Deb, K. (2013). Differential evolution: Performances and analyses. InCEC.
Pan, X., Ge, C., Lu, R., Song, S., Chen, G., Huang, Z., & Huang, G. (2022). On the integration of self-attention and convolution. InCVPR.
Pant, M., Zaheer, H., Garcia-Hernandez, L., Abraham, A., et al. (2020). Differential evolution: A review of more than two decades of research.Engineering Applications of Artificial Intelligence,90, 103479.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., & Desmaison, A. (2019). Pytorch: An imperative style, high-performance deep learning library. InNeurIPS.
Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. InICLR.
Qiang, Y., Pan, D., Li, C., Li, X., Jang, R., & Zhu, D. (2022). Attcat: Explaining transformers via attentive class activation tokens. InNeurIPS.
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., & Sutskever, I. (2018). Improving language understanding by generative pre-training. OpenAI.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog.
Raghu, M., Unterthiner, T., Kornblith, S., & Zhang, C. (2021). Dosovitskiy, A. Do vision transformers see like convolutional neural networks? InNeurIPS.
Ren, S., Zhou, D., He, S., Feng, J., & Wang, X. (2022). Shunted self-attention via multi-scale token aggregation. InCVPR.
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., & Parikh, D. (2017). Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. InICCV.
Shi, E. C., Leung, F. H., & Law, B. N. (2014). Differential evolution with adaptive population size. InICDSP.
Si, C., Yu, W., Zhou, P., Zhou, Y., Wang, X., & Yan, S. (2022). Inception transformer. InNeurIPS.
Sloss, A. N., & Gustafson, S. (2020). 2019 evolutionary algorithms review. InGenetic programming theory and practice XVII.
Srinivas, A., Lin, T. Y., Parmar, N., Shlens, J., Abbeel, P., & Vaswani, A. (2021). Bottleneck transformers for visual recognition. InCVPR.
Storn, R., & Price, K. (1997). Differential evolution–a simple and efficient heuristic for global optimization over continuous spaces.Journal of Global Optimization 341–359
Tan, M., & Le, Q. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. InICML.
Thatipelli, A., Narayan, S., Khan, S., Anwer, R. M., Khan, F. S., & Ghanem, B. (2022). Spatio-temporal relation modeling for few-shot action recognition. InCVPR.
Toffolo, A., & Benini, E. (2003). Genetic diversity as an objective in multi-objective evolutionary algorithms.Evolutionary Computation,11, 151–167.
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2021). Training data-efficient image transformers & distillation through attention. InICML.
Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., & Jégou, H. (2021). Going deeper with image transformers. InICCV.
Tu, Z., Talebi, H., Zhang, H., Yang, F., Milanfar, P., Bovik, A., & Li, Y. (2022). Maxvit: Multi-axis vision transformer. InECCV.
Valanarasu, J.M.J., Oza, P., Hacihaliloglu, I., & Patel, V. M. (2021). Medical transformer: Gated axial-attention for medical image segmentation. InMICCAI.
Vaswani, A., Ramachandran, P., Srinivas, A., Parmar, N., Hechtman, B., & Shlens, J. (2021). Scaling local self-attention for parameter efficient visual backbones. InCVPR.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., & Polosukhin, I. (2017). Attention is all you need. InNeurIPS.
Vikhar, P. A. (2016). Evolutionary algorithms: A critical review and its future prospects. InICGTSPICC.
Wan, Z., Chen, H., An, J., Jiang, W., Yao, C., & Luo, J. (2022). Facial attribute transformers for precise and robust makeup transfer. InCACV.
Wang, H., Wu, Z., Liu, Z., Cai, H., Zhu, L., Gan, C., & Han, S. (2020). Hat: Hardware-aware transformers for efficient natural language processing. InACL.
Wang, R., Chen, D., Wu, Z., Chen, Y., Dai, X., Liu, M., Jiang, Y. G., Zhou, L., & Yuan, L. (2022). Bevt: Bert pretraining of video transformers. InCVPR.
Wang, S., Li, B., Khabsa, M., Fang, H., & Ma, H. (2020). Linformer: Self-attention with linear complexity. arXiv preprintarXiv:2006.04768
Wang, W., Xie, E., Li, X., Fan, D.P., Song, K., Liang, D., Lu, T., Luo, P., & Shao, L. (2021). Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. InICCV.
Wang, W., Xie, E., Li, X., Fan, D. P., Song, K., Liang, D., Lu, T., Luo, P., & Shao, L. (2022). Pvt v2: Improved baselines with pyramid vision transformer.CVM.
Wang, W., Yao, L., Chen, L., Lin, B., Cai, D., He, X., & Liu, W. (2022). Crossformer: A versatile vision transformer hinging on cross-scale attention. InICLR.
Wang, Y., Yang, Y., Bai, J., Zhang, M., Bai, J., Yu, J., Zhang, C., Huang, G., & Tong, Y. (2021). Evolving attention with residual convolutions. InICML.
Wei, C., Fan, H., Xie, S., Wu, C. Y., Yuille, A., & Feichtenhofer, C. (2022). Masked feature prediction for self-supervised visual pre-training. InCVPR.
Wightman, R. (2019). Pytorch image models.https://github.com/rwightman/pytorch-image-models
Wightman, R., Touvron, H., & Jegou, H. (2021). Resnet strikes back: An improved training procedure in timm. InNeurIPSW.
Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L., & Zhang, L. (2021). Cvt: Introducing convolutions to vision transformers. InICCV.
Xia, Z., Pan, X., Song, S., Li, L.E., & Huang, G. (2022). Vision transformer with deformable attention. InCVPR.
Xiao, T., Liu, Y., Zhou, B., Jiang, Y., & Sun, J. (2018). Unified perceptual parsing for scene understanding. InECCV.
Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J. M., & Luo, P. (2021). Segformer: Simple and efficient design for semantic segmentation with transformers. InNeurIPS.
Xie, Z., Zhang, Z., Cao, Y., Lin, Y., Bao, J., Yao, Z., Dai, Q., & Hu, H. (2022). Simmim: A simple framework for masked image modeling. InCVPR.
Xu, L., Yan, X., Ding, W., & Liu, Z. (2023). Attribution rollout: a new way to interpret visual transformer.JAIHC.
Xu, M., Xiong, Y., Chen, H., Li, X., Xia, W., Tu, Z., & Soatto, S. (2021). Long short-term transformer for online action detection. InNeurIPS.
Xu, W., Xu, Y., Chang, T., & Tu, Z. (2021). Co-scale conv-attentional image transformers. InICCV.
Xu, Y., Zhang, Q., Zhang, J., & Tao, D. (2021). Vitae: Vision transformer advanced by exploring intrinsic inductive bias. InNeurIPS.
Yang, C., Wang, Y., Zhang, J., Zhang, H., Wei, Z., Lin, Z., & Yuille, A. (2022). Lite vision transformer with enhanced self-attention. InCVPR.
Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., & Yan, S. (2022). Metaformer is actually what you need for vision. InCVPR.
Yuan, K., Guo, S., Liu, Z., Zhou, A., Yu, F., & Wu, W. (2021). Incorporating convolution designs into visual transformers. InICCV.
Yuan, L., Chen, Y., Wang, T., Yu, W., Shi, Y., Jiang, Z. H., Tay, F. E., Feng, J., & Yan, S. (2021). Tokens-to-token vit: Training vision transformers from scratch on imagenet. InICCV.
Yuan, L., Hou, Q., Jiang, Z., Feng, J., & Yan, S. (2022). Volo: Vision outlooker for visual recognition. InTPAMI.
Yuan, Y., Fu, R., Huang, L., Lin, W., Zhang, C., Chen, X., & Wang, J. (2021). Hrformer: High-resolution vision transformer for dense predict. InNeurIPS.
Zamir, S. W., Arora, A., Khan, S., Hayat, M., Khan, F. S., & Yang, M. H. (2022). Restormer: Efficient transformer for high-resolution image restoration. InCVPR.
Zhang, J., Li, X., Li, J., Liu, L., Xue, Z., Zhang, B., Jiang, Z., Huang, T., Wang, Y., & Wang, C. (2023). Rethinking mobile block for efficient attention-based models. InICCV.
Zhang, J., Xu, C., Li, J., Chen, W., Wang, Y., Tai, Y., Chen, S., Wang, C., Huang, F., & Liu, Y. (2021). Analogous to evolutionary algorithm: Designing a unified sequence model. InNeurIPS.
Zhang, Q., Xu, Y., Zhang, J., & Tao, D. (2023). Vitaev2: Vision transformer advanced by exploring inductive bias for image recognition and beyond.IJCV.
Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P. H., & Zhang, L. (2021). Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. InCVPR.
Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., & Torralba, A. (2019). Semantic understanding of scenes through the ade20k dataset. InIJCV.
Zhou, D., Kang, B., Jin, X., Yang, L., Lian, X., Hou, Q., & Feng, J. (2021). Deepvit: Towards deeper vision transformer. arXiv preprintarXiv:2103.11886
Zhu, X., Hu, H., Lin, S., & Dai, J. (2019). Deformable convnets v2: More deformable, better results. InCVPR.
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., & Dai, J. (2021). Deformable {detr}: Deformable transformers for end-to-end object detection. InICLR.
Acknowledgements
This work was supported by a Grant from The National Natural Science Foundation of China(No. 62103363)
Author information
Jiangning Zhang and Xiangtai Li have contributed equally as first authors.
Authors and Affiliations
Institute of Cyber-Systems and Control, Advanced Perception on Robotics and Intelligent Learning Lab (APRIL), Zhejiang University, Hangzhou, China
Jiangning Zhang & Yong Liu
School of Artificial Intelligence, Key Laboratory of Machine Perception (MOE), Peking University, Beijing, China
Xiangtai Li & Yibo Yang
Youtu Lab, Tencent, Shanghai, China
Yabiao Wang & Chengjie Wang
School of Computer Science, Faculty of Engineering, The University of Sydney, Darlington, NSW, 2008, Australia
Dacheng Tao
- Jiangning Zhang
You can also search for this author inPubMed Google Scholar
- Xiangtai Li
You can also search for this author inPubMed Google Scholar
- Yabiao Wang
You can also search for this author inPubMed Google Scholar
- Chengjie Wang
You can also search for this author inPubMed Google Scholar
- Yibo Yang
You can also search for this author inPubMed Google Scholar
- Yong Liu
You can also search for this author inPubMed Google Scholar
- Dacheng Tao
You can also search for this author inPubMed Google Scholar
Corresponding author
Correspondence toYong Liu.
Additional information
Communicated by Suha Kwak.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, J., Li, X., Wang, Y.et al. EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm.Int J Comput Vis132, 3509–3536 (2024). https://doi.org/10.1007/s11263-024-02034-6
Received:
Accepted:
Published:
Issue Date:
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative