- Yanbin Hao ORCID:orcid.org/0000-0002-0695-15661 na1,
- Diansong Zhou1 na1,
- Zhicai Wang1,
- Chong-Wah Ngo2 &
- …
- Meng Wang3
569Accesses
1Altmetric
Abstract
In recent years, vision Transformers and MLPs have demonstrated remarkable performance in image understanding tasks. However, their inherently dense computational operators, such as self-attention and token-mixing layers, pose significant challenges when applied to spatio-temporal video data. To address this gap, we propose PosMLP-Video, a lightweight yet powerful MLP-like backbone for video recognition. Instead of dense operators, we use efficient relative positional encoding to build pairwise token relations, leveraging small-sized parameterized relative position biases to obtain each relation score. Specifically, to enable spatio-temporal modeling, we extend the image PosMLP’s positional gating unit to temporal, spatial, and spatio-temporal variants, namely PoTGU, PoSGU, and PoSTGU, respectively. These gating units can be feasibly combined into three types of spatio-temporal factorized positional MLP blocks, which not only decrease model complexity but also maintain good performance. Additionally, we enrich relative positional relationships by using channel grouping. Experimental results on three video-related tasks demonstrate that PosMLP-Video achieves competitive speed-accuracy trade-offs compared to the previous state-of-the-art models. In particular, PosMLP-Video pre-trained on ImageNet1K achieves 59.0%/70.3% top-1 accuracy on Something-Something V1/V2 and 82.1% top-1 accuracy on Kinetics-400 while requiring much fewer parameters and FLOPs than other models. The code is released athttps://github.com/zhouds1918/PosMLP_Video.
This is a preview of subscription content,log in via an institution to check access.
Access this article
Subscribe and save
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
Buy Now
Price includes VAT (Japan)
Instant access to the full article PDF.








Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., & Schmid, C. (2021). Vivit: A video vision transformer. InProceedings of the ieee/cvf international conference on computer vision (pp. 6836–6846).
Bertasius, G., Wang, H., & Torresani, L. (2021). Is space-time attention all you need for video understanding?Icml (Vol. 2, p. 4).
Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. Inproceedings of the ieee conference on computer vision and pattern recognition (pp. 6299–6308).
Chen, J., & Ho, C. M. (2022). Mm-vit: Multi-modal video transformer for compressed video action recognition. InProceedings of the ieee/cvf winter conference on applications of computer vision (pp. 1910–1921).
Cordonnier, J.-B., Loukas, A., & Jaggi, M. (2019). On the relationship between self-attention and convolutional layers.arXiv:1911.03584
Davison, A. K., Lansley, C., Costen, N., Tan, K., & Yap, M. H. (2016). Samm: A spontaneous microfacial movement dataset.IEEE Transactions on Affective Computing,9(1), 116–129.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale.arXiv:2010.11929
d’Ascoli, S., Touvron, H., Leavitt, M. L., Morcos, A. S., Biroli, G., & Sagun, L. (2021). Convit: Improving vision transformers with soft convolutional inductive biases. InInternational conference on machine learning (pp. 2286–2296).
Bulat, A., et al. (2021). Space-time mixing attention for video transformer.Advances in Neural Information Processing Systems,34, 19594–19607.
Fan, H., Li, Y., Xiong, B., Lo, W.-Y., & Feichtenhofer, C. (2020). Pyslowfast.https://github.com/facebookresearch/slowfast
Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Malik, J., & Feichtenhofer, C. (2021). Multiscale vision transformers. InProceedings of the ieee/cvf international conference on computer vision (pp. 6824–6835).
Feichtenhofer, C. (2020). X3d: Expanding architectures for efficient video recognition. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 203–213).
Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). Slowfast networks for video recognition. InProceedings of the ieee/cvf international conference on computer vision (pp. 6202–6211).
Girshick, R. (2015). Fast r-cnn. InProceedings of the ieee international conference on computer vision (pp. 1440–1448).
Goyal, R., Ebrahimi Kahou, S., Michalski, V., Materzynska, J., Westphal, S., Kim, H., et al. (2017). The ”something something” video database for learning and evaluating visual common sense. InProceedings of the ieee international conference on computer vision (pp. 5842–5850).
Gu, C., Sun, C., Ross, D. A., Vondrick, C., Pantofaru, C., Li, Y., et al. (2018). Ava: A video dataset of spatio-temporally localized atomic visual actions. InProceedings of the ieee conference on computer vision and pattern recognition (pp. 6047–6056).
Hao, Y., Wang, S., Cao, P., Gao, X., Xu, T., Wu, J., & He, X. (2022). Attention in attention: Modeling context correlation for efficient video classification.IEEE Transactions on Circuits and Systems for Video Technology,32(10), 7120–7132.
Hao, Y., Zhang, H., Ngo, C.-W., & He, X. (2022). Group contextualization for video recognition. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 928–938).
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. InProceedings of the ieee international conference on computer vision (pp. 2961–2969).
Herzig, R., Ben-Avraham, E., Mangalam, K., Bar, A., Chechik, G., Rohrbach, A., & Globerson, A. (2022). Object-region video transformers. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 3148–3159).
Hou, Q., Jiang, Z., Yuan, L., Cheng, M.-M., Yan, S., & Feng, J. (2022). Vision permutator: A permutable mlp-like architecture for visual recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence,45(1), 1328–1334.
Islam, M. A., Kowal, M., Jia, S., Derpanis, K. G., & Bruce, N. D. (2021a). Global pooling, more than meets the eye: Position information is encoded channel-wise in cnns. InProceedings of the ieee/cvf international conference on computer vision (pp. 793–801).
Islam, M. A., Kowal, M., Jia, S., Derpanis, K. G., & Bruce, N. D. (2021b). Position, padding and predictions: A deeper look at position information in cnns.arXiv:2101.12322
Jiang, B., Wang, M., Gan, W., Wu, W., & Yan, J. (2019). Stm: Spatiotemporal and motion encoding for action recognition. InProceedings of the ieee/cvf international conference on computer vision (pp. 2000–2009).
Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., et al. (2017). The kinetics human action video dataset.arXiv:1705.06950
Kayhan, O. S., & Gemert, J. C. V. (2020). On translation invariance in cnns: Convolutional layers can exploit absolute spatial location. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 14274–14285).
Khor, H.-Q., See, J., Liong, S.-T., Phan, R. C., & Lin, W. (2019). Dual-stream shallow networks for facial micro-expression recognition. In2019 ieee international conference on image processing (icip) (pp. 36–40).
Kwon, H., Kim, M., Kwak, S., & Cho, M. (2020). Motionsqueeze: Neural motion feature learning for video understanding. InComputer vision—eccv 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, proceedings, part xvi 16 (pp. 345–362).
Li, H., Sui, M., Zhu, Z., & Zhao, F. (2022). Mmnet: Muscle motion-guided network for micro-expression recognition.arXiv:2201.05297
Li, J., Dong, Z., Lu, S., Wang, S.-J., Yan, W.-J., Ma, Y., & Fu, X. (2022). Cas (me) 3: A third generation facial spontaneous micro-expression database with depth information and high ecological validity.IEEE Transactions on Pattern Analysis and Machine Intelligence,45(3), 2782–2800.
Li, K., Li, X., Wang, Y., Wang, J., & Qiao, Y. (2021). Ct-net: Channel tensorization network for video classification.arXiv:2106.01603
Li, K., Wang, Y., Zhang, J., Gao, P., Song, G., Liu, Y., et al. (2023). Uniformer: Unifying convolution and self-attention for visual recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence.
Li, X., Pfister, T., Huang, X., Zhao, G., & Pietikäinen, M. (2013). A spontaneous micro-expression database: Inducement, collection and baseline. In2013 10th ieee international conference and workshops on automatic face and gesture recognition (fg) (pp. 1–6).
Li, X., Wang, Y., Zhou, Z., & Qiao, Y. (2020). Smallbignet: Integrating core and contextual views for video classification. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 1092–1101).
Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., & Wang, L. (2020). Tea: Temporal excitation and aggregation for action recognition. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 909–918).
Li, Y., Li, Y., & Vasconcelos, N. (2018). Resound: Towards action recognition without representation bias. InProceedings of the European conference on computer vision (eccv) (pp. 513–528).
Li, Y., Liu, M., & Rehg, J.M. (2018). In the eye of beholder: Joint learning of gaze and actions in first person video. InProceedings of the european conference on computer vision (eccv) (pp. 619–635).
Li, Y., Wu, C.-Y., Fan, H., Mangalam, K., Xiong, B., Malik, J., & Feichtenhofer, C. (2022). Mvitv2: Improved multiscale vision transformers for classification and detection. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 4804–4814).
Lian, D., Yu, Z., Sun, X., & Gao, S. (2021). As-mlp: An axial shifted mlp architecture for vision.arXiv:2107.08391
Lin, J., Gan, C., & Han, S. (2019). Tsm: Temporal shift module for efficient video understanding. InProceedings of the ieee/cvf international conference on computer vision (pp. 7083–7093).
Liong, S.-T., Gan, Y. S., See, J., Khor, H.-Q., & Huang, Y.-C. (2019). Shallow triple stream threedimensional cnn (ststnet) for micro-expression recognition. In2019 14th ieee international conference on automatic face and gesture recognition (fg 2019) (pp. 1–5).
Liu, H., Dai, Z., So, D., & Le, Q. V. (2021). Pay attention to mlps.Advances in Neural Information Processing Systems,34, 9204–9215.
Liu, Z., Lin, Y., Cao, Y., Hu, H.,Wei, Y., Zhang, Z., & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the ieee/cvf international conference on computer vision (pp. 10012–10022).
Liu, Z., Luo, D., Wang, Y., Wang, L., Tai, Y., Wang, C., & Lu, T. (2020). Teinet: Towards an efficient architecture for video recognition. InProceedings of the aaai conference on artificial intelligence (Vol. 34, pp. 11669–11676).
Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., & Hu, H. (2022). Video swin transformer. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 3202– 3211).
Liu, Z., Wang, L., Wu, W., Qian, C., & Lu, T. (2021). Tam: Temporal adaptive module for video recognition. InProceedings of the ieee/cvf international conference on computer vision (pp. 13708–13718).
Luo, C., & Yuille, A.L. (2019). Grouped spatialtemporal aggregation for efficient action recognition. InProceedings of the ieee/cvf international conference on computer vision (pp. 5512–5521).
Nguyen, X.-B., Duong, C.N., Li, X., Gauch, S., Seo, H.-S., & Luu, K. (2023). Micron-bert: Bert-based facial micro-expression recognition. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 1482– 1492).
Patrick, M., Campbell, D., Asano, Y., Misra, I., Metze, F., Feichtenhofer, C., & Henriques, J. F. (2021). Keeping your eye on the ball: Trajectory attention in video transformers.Advances in Neural Information Processing Systems,34, 12493–12506.
Qiu, Z., Yao, T., & Mei, T. (2017). Learning spatiotemporal representation with pseudo-3d residual networks. Inproceedings of the ieee international conference on computer vision (pp. 5533–5541).
Qiu, Z., Yao, T., Ngo, C.-W., & Mei, T. (2022). Mlp-3d: A mlp-like 3d architecture with grouped time mixing. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 3062–3072).
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified textto- text transformer.The Journal of Machine Learning Research,21(1), 5485–5551.
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks.Advances in neural information processing systems,28.
Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Gradcam: Visual explanations from deep networks via gradient-based localization. InProceedings of the ieee international conference on computer vision (pp. 618–626).
Shaw, P., Uszkoreit, J., & Vaswani, A. (2018). Selfattention with relative position representations.arXiv:1803.02155
Tan, H., Lei, J., Wolf, T., & Bansal, M. (2021). Vimpac: Video pre-training via masked token prediction and contrastive learning.arXiv:2106.11250
Tan, Y., Hao, Y., He, X., Wei, Y., & Yang, X. (2021). Selective dependency aggregation for action classification. InProceedings of the 29th acm international conference on multimedia (pp. 592–601).
Tan, Y., Hao, Y., Zhang, H., Wang, S., & He, X. (2022). Hierarchical hourglass convolutional network for efficient video classification. InProceedings of the 30th acm international conference on multimedia (pp. 5880–5891).
Tolstikhin, I. O., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., et al. (2021). Mlp-mixer: An all-mlp architecture for vision.Advances in Neural Information Processing Systems,34, 24261–24272.
Tong, Z., Song, Y., Wang, J., & Wang, L. (2022). Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advances in Neural Information Processing Systems,35, 10078–10093.
Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. InProceedings of the ieee international conference on computer vision (pp. 4489–4497).
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., & Paluri, M. (2018). A closer look at spatiotemporal convolutions for action recognition. InProceedings of the ieee conference on computer vision and pattern recognition (pp. 6450–6459).
Tu, S., Dai, Q., Wu, Z., Cheng, Z.-Q., Hu, H., & Jiang, Y.-G. (2023). Implicit temporal modeling with learnable alignment for video recognition. In Proceedings of the ieee/cvf international conference on computer vision (pp. 19936–19947).
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., & Polosukhin, I. (2017). Attention is all you need.Advances in neural information processing systems,30.
Wang, J., & Torresani, L. (2022). Deformable video transformer. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 14053–14062).
Wang, L., Tong, Z., Ji, B., & Wu, G. (2021). Tdn: Temporal difference networks for efficient action recognition. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 1895–1904).
Wang, X., Girshick, R., Gupta, A., & He, K. (2018). Non-local neural networks.Proceedings of the ieee conference on computer vision and pattern recognition (pp. 7794–7803).
Wang, X., Wu, Y., Zhu, L., & Yang, Y. (2020). Symbiotic attention with privileged information for egocentric action recognition. InProceedings of the aaai conference on artificial intelligence (Vol. 34, pp. 12249–12256).
Wang, Z., Hao, Y., Gao, X., Zhang, H., Wang, S., Mu, T., & He, X. (2022). Parameterization of crosstoken relations with relative positional encoding for vision mlp. InProceedings of the 30th acm international conference on multimedia (pp. 6288–6299).
Wu, C.-Y., Li, Y., Mangalam, K., Fan, H., Xiong, B., Malik, J., & Feichtenhofer, C. (2022). Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 13587–13597).
Xie, S., Sun, C., Huang, J., Tu, Z., & Murphy, K. (2018). Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. InProceedings of the european conference on computer vision (eccv) (pp. 305– 321).
Xing, Z., Dai, Q., Hu, H., Chen, J., Wu, Z., & Jiang, Y.- G. (2023). Svformer: Semi-supervised video transformer for action recognition. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 18816–18826).
Yan, S., Xiong, X., Arnab, A., Lu, Z., Zhang, M., Sun, C., & Schmid, C. (2022). Multiview transformers for video recognition. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 3333–3343).
Yan, W.-J., Li, X., Wang, S.-J., Zhao, G., Liu, Y. J., Chen, Y.-H., & Fu, X. (2014). Casme ii: An improved spontaneous micro-expression database and the baseline evaluation.PLOS ONE,9(1), e86041.
Yang, J., Dong, X., Liu, L., Zhang, C., Shen, J., & Yu, D. (2022). Recurring the transformer for video action recognition. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 14063–14073).
Yu, T., Li, X., Cai, Y., Sun, M., & Li, P. (2022). S2-mlp: Spatial-shift mlp architecture for vision. InProceedings of the ieee/cvf winter conference on applications of computer vision (pp. 297–306).
Zhang, C., Gupta, A., & Zisserman, A. (2021). Temporal query networks for fine-grained video understanding. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 4486–4496).
Zhang, D. J., Li, K., Wang, Y., Chen, Y., Chandra, S., Qiao, Y., & Shou, M. Z. (2022). Morphmlp: An efficient mlp-like backbone for spatial-temporal representation learning. InComputer vision–eccv 2022: 17th European conference, Tel Aviv, Israel, October 23–27, 2022, proceedings, part xxxv (pp. 230–248).
Zhang, H., Cheng, L., Hao, Y., & Ngo, C.-w. (2022). Long-term leap attention, short-term periodic shift for video classification. InProceedings of the 30th acm international conference on multimedia (pp. 5773–5782).
Zhang, H., Hao, Y., & Ngo, C.-W. (2021). Token shift transformer for video classification. InProceedings of the 29th acm international conference on multimedia (pp. 917–925).
Zhang, S. (2022). Tfcnet: Temporal fully connected networks for static unbiased temporal reasoning.arXiv:2203.05928
Zhao, S., Tao, H., Zhang, Y., Xu, T., Zhang, K., Hao, Z., & Chen, E. (2021). A two-stage 3d cnn based learning method for spontaneous microexpression recognition.Neurocomputing,448(276–289), 24.
Acknowledgements
The work was supported by the National Natural Science Foundation of China (No. 62101524). We thank Prof. Xiangnan He for providing thoughtful comments and suggestions on the writing and structural organization of the paper.
Author information
Yanbin Hao and Diansong Zhou have contributed equally to this work.
Authors and Affiliations
School of Information Science and Technology, School of Artificial Intelligence and Data Science, University of Science and Technology of China, No. 96, JinZhai Road, Hefei, 230026, Anhui, China
Yanbin Hao, Diansong Zhou & Zhicai Wang
School of Computing and Information Systems, Singapore Management University, 80 Stamford Road, Singapore, 178902, Singapore
Chong-Wah Ngo
School of Computer Science and Information Engineering, Hefei University of Technology, No. 485, Danxia Road, Hefei, 230601, Anhui, China
Meng Wang
- Yanbin Hao
You can also search for this author inPubMed Google Scholar
- Diansong Zhou
You can also search for this author inPubMed Google Scholar
- Zhicai Wang
You can also search for this author inPubMed Google Scholar
- Chong-Wah Ngo
You can also search for this author inPubMed Google Scholar
- Meng Wang
You can also search for this author inPubMed Google Scholar
Corresponding author
Correspondence toYanbin Hao.
Additional information
Communicated by Limin Wang.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Hao, Y., Zhou, D., Wang, Z.et al. PosMLP-Video: Spatial and Temporal Relative Position Encoding for Efficient Video Recognition.Int J Comput Vis132, 5820–5840 (2024). https://doi.org/10.1007/s11263-024-02154-z
Received:
Accepted:
Published:
Issue Date:
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative