Movatterモバイル変換

Yanbin Hao ORCID:orcid.org/0000-0002-0695-1566¹^na1,
Diansong Zhou¹^na1,
Zhicai Wang¹,
Chong-Wah Ngo² &
…
Meng Wang³

569Accesses
1Altmetric
Explore all metrics

Abstract

In recent years, vision Transformers and MLPs have demonstrated remarkable performance in image understanding tasks. However, their inherently dense computational operators, such as self-attention and token-mixing layers, pose significant challenges when applied to spatio-temporal video data. To address this gap, we propose PosMLP-Video, a lightweight yet powerful MLP-like backbone for video recognition. Instead of dense operators, we use efficient relative positional encoding to build pairwise token relations, leveraging small-sized parameterized relative position biases to obtain each relation score. Specifically, to enable spatio-temporal modeling, we extend the image PosMLP’s positional gating unit to temporal, spatial, and spatio-temporal variants, namely PoTGU, PoSGU, and PoSTGU, respectively. These gating units can be feasibly combined into three types of spatio-temporal factorized positional MLP blocks, which not only decrease model complexity but also maintain good performance. Additionally, we enrich relative positional relationships by using channel grouping. Experimental results on three video-related tasks demonstrate that PosMLP-Video achieves competitive speed-accuracy trade-offs compared to the previous state-of-the-art models. In particular, PosMLP-Video pre-trained on ImageNet1K achieves 59.0%/70.3% top-1 accuracy on Something-Something V1/V2 and 82.1% top-1 accuracy on Kinetics-400 while requiring much fewer parameters and FLOPs than other models. The code is released athttps://github.com/zhouds1918/PosMLP_Video.

This is a preview of subscription content,log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning

Elysium: Exploring Object-Level Perception in Videos via MLLM

$$\mathrm R^2$$ -Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., & Schmid, C. (2021). Vivit: A video vision transformer. InProceedings of the ieee/cvf international conference on computer vision (pp. 6836–6846).
Bertasius, G., Wang, H., & Torresani, L. (2021). Is space-time attention all you need for video understanding?Icml (Vol. 2, p. 4).
Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. Inproceedings of the ieee conference on computer vision and pattern recognition (pp. 6299–6308).
Chen, J., & Ho, C. M. (2022). Mm-vit: Multi-modal video transformer for compressed video action recognition. InProceedings of the ieee/cvf winter conference on applications of computer vision (pp. 1910–1921).
Cordonnier, J.-B., Loukas, A., & Jaggi, M. (2019). On the relationship between self-attention and convolutional layers.arXiv:1911.03584
Davison, A. K., Lansley, C., Costen, N., Tan, K., & Yap, M. H. (2016). Samm: A spontaneous microfacial movement dataset.IEEE Transactions on Affective Computing,9(1), 116–129.
Article Google Scholar
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale.arXiv:2010.11929
d’Ascoli, S., Touvron, H., Leavitt, M. L., Morcos, A. S., Biroli, G., & Sagun, L. (2021). Convit: Improving vision transformers with soft convolutional inductive biases. InInternational conference on machine learning (pp. 2286–2296).
Bulat, A., et al. (2021). Space-time mixing attention for video transformer.Advances in Neural Information Processing Systems,34, 19594–19607.
Google Scholar
Fan, H., Li, Y., Xiong, B., Lo, W.-Y., & Feichtenhofer, C. (2020). Pyslowfast.https://github.com/facebookresearch/slowfast
Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Malik, J., & Feichtenhofer, C. (2021). Multiscale vision transformers. InProceedings of the ieee/cvf international conference on computer vision (pp. 6824–6835).
Feichtenhofer, C. (2020). X3d: Expanding architectures for efficient video recognition. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 203–213).
Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). Slowfast networks for video recognition. InProceedings of the ieee/cvf international conference on computer vision (pp. 6202–6211).
Girshick, R. (2015). Fast r-cnn. InProceedings of the ieee international conference on computer vision (pp. 1440–1448).
Goyal, R., Ebrahimi Kahou, S., Michalski, V., Materzynska, J., Westphal, S., Kim, H., et al. (2017). The ”something something” video database for learning and evaluating visual common sense. InProceedings of the ieee international conference on computer vision (pp. 5842–5850).
Gu, C., Sun, C., Ross, D. A., Vondrick, C., Pantofaru, C., Li, Y., et al. (2018). Ava: A video dataset of spatio-temporally localized atomic visual actions. InProceedings of the ieee conference on computer vision and pattern recognition (pp. 6047–6056).
Hao, Y., Wang, S., Cao, P., Gao, X., Xu, T., Wu, J., & He, X. (2022). Attention in attention: Modeling context correlation for efficient video classification.IEEE Transactions on Circuits and Systems for Video Technology,32(10), 7120–7132.
Article Google Scholar
Hao, Y., Zhang, H., Ngo, C.-W., & He, X. (2022). Group contextualization for video recognition. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 928–938).
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. InProceedings of the ieee international conference on computer vision (pp. 2961–2969).
Herzig, R., Ben-Avraham, E., Mangalam, K., Bar, A., Chechik, G., Rohrbach, A., & Globerson, A. (2022). Object-region video transformers. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 3148–3159).
Hou, Q., Jiang, Z., Yuan, L., Cheng, M.-M., Yan, S., & Feng, J. (2022). Vision permutator: A permutable mlp-like architecture for visual recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence,45(1), 1328–1334.
Article Google Scholar
Islam, M. A., Kowal, M., Jia, S., Derpanis, K. G., & Bruce, N. D. (2021a). Global pooling, more than meets the eye: Position information is encoded channel-wise in cnns. InProceedings of the ieee/cvf international conference on computer vision (pp. 793–801).
Islam, M. A., Kowal, M., Jia, S., Derpanis, K. G., & Bruce, N. D. (2021b). Position, padding and predictions: A deeper look at position information in cnns.arXiv:2101.12322
Jiang, B., Wang, M., Gan, W., Wu, W., & Yan, J. (2019). Stm: Spatiotemporal and motion encoding for action recognition. InProceedings of the ieee/cvf international conference on computer vision (pp. 2000–2009).
Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., et al. (2017). The kinetics human action video dataset.arXiv:1705.06950
Kayhan, O. S., & Gemert, J. C. V. (2020). On translation invariance in cnns: Convolutional layers can exploit absolute spatial location. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 14274–14285).
Khor, H.-Q., See, J., Liong, S.-T., Phan, R. C., & Lin, W. (2019). Dual-stream shallow networks for facial micro-expression recognition. In2019 ieee international conference on image processing (icip) (pp. 36–40).
Kwon, H., Kim, M., Kwak, S., & Cho, M. (2020). Motionsqueeze: Neural motion feature learning for video understanding. InComputer vision—eccv 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, proceedings, part xvi 16 (pp. 345–362).
Li, H., Sui, M., Zhu, Z., & Zhao, F. (2022). Mmnet: Muscle motion-guided network for micro-expression recognition.arXiv:2201.05297
Li, J., Dong, Z., Lu, S., Wang, S.-J., Yan, W.-J., Ma, Y., & Fu, X. (2022). Cas (me) 3: A third generation facial spontaneous micro-expression database with depth information and high ecological validity.IEEE Transactions on Pattern Analysis and Machine Intelligence,45(3), 2782–2800.
Google Scholar
Li, K., Li, X., Wang, Y., Wang, J., & Qiao, Y. (2021). Ct-net: Channel tensorization network for video classification.arXiv:2106.01603
Li, K., Wang, Y., Zhang, J., Gao, P., Song, G., Liu, Y., et al. (2023). Uniformer: Unifying convolution and self-attention for visual recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence.
Li, X., Pfister, T., Huang, X., Zhao, G., & Pietikäinen, M. (2013). A spontaneous micro-expression database: Inducement, collection and baseline. In2013 10th ieee international conference and workshops on automatic face and gesture recognition (fg) (pp. 1–6).
Li, X., Wang, Y., Zhou, Z., & Qiao, Y. (2020). Smallbignet: Integrating core and contextual views for video classification. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 1092–1101).
Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., & Wang, L. (2020). Tea: Temporal excitation and aggregation for action recognition. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 909–918).
Li, Y., Li, Y., & Vasconcelos, N. (2018). Resound: Towards action recognition without representation bias. InProceedings of the European conference on computer vision (eccv) (pp. 513–528).
Li, Y., Liu, M., & Rehg, J.M. (2018). In the eye of beholder: Joint learning of gaze and actions in first person video. InProceedings of the european conference on computer vision (eccv) (pp. 619–635).
Li, Y., Wu, C.-Y., Fan, H., Mangalam, K., Xiong, B., Malik, J., & Feichtenhofer, C. (2022). Mvitv2: Improved multiscale vision transformers for classification and detection. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 4804–4814).
Lian, D., Yu, Z., Sun, X., & Gao, S. (2021). As-mlp: An axial shifted mlp architecture for vision.arXiv:2107.08391
Lin, J., Gan, C., & Han, S. (2019). Tsm: Temporal shift module for efficient video understanding. InProceedings of the ieee/cvf international conference on computer vision (pp. 7083–7093).
Liong, S.-T., Gan, Y. S., See, J., Khor, H.-Q., & Huang, Y.-C. (2019). Shallow triple stream threedimensional cnn (ststnet) for micro-expression recognition. In2019 14th ieee international conference on automatic face and gesture recognition (fg 2019) (pp. 1–5).
Liu, H., Dai, Z., So, D., & Le, Q. V. (2021). Pay attention to mlps.Advances in Neural Information Processing Systems,34, 9204–9215.
Google Scholar
Liu, Z., Lin, Y., Cao, Y., Hu, H.,Wei, Y., Zhang, Z., & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the ieee/cvf international conference on computer vision (pp. 10012–10022).
Liu, Z., Luo, D., Wang, Y., Wang, L., Tai, Y., Wang, C., & Lu, T. (2020). Teinet: Towards an efficient architecture for video recognition. InProceedings of the aaai conference on artificial intelligence (Vol. 34, pp. 11669–11676).
Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., & Hu, H. (2022). Video swin transformer. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 3202– 3211).
Liu, Z., Wang, L., Wu, W., Qian, C., & Lu, T. (2021). Tam: Temporal adaptive module for video recognition. InProceedings of the ieee/cvf international conference on computer vision (pp. 13708–13718).
Luo, C., & Yuille, A.L. (2019). Grouped spatialtemporal aggregation for efficient action recognition. InProceedings of the ieee/cvf international conference on computer vision (pp. 5512–5521).
Nguyen, X.-B., Duong, C.N., Li, X., Gauch, S., Seo, H.-S., & Luu, K. (2023). Micron-bert: Bert-based facial micro-expression recognition. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 1482– 1492).
Patrick, M., Campbell, D., Asano, Y., Misra, I., Metze, F., Feichtenhofer, C., & Henriques, J. F. (2021). Keeping your eye on the ball: Trajectory attention in video transformers.Advances in Neural Information Processing Systems,34, 12493–12506.
Google Scholar
Qiu, Z., Yao, T., & Mei, T. (2017). Learning spatiotemporal representation with pseudo-3d residual networks. Inproceedings of the ieee international conference on computer vision (pp. 5533–5541).
Qiu, Z., Yao, T., Ngo, C.-W., & Mei, T. (2022). Mlp-3d: A mlp-like 3d architecture with grouped time mixing. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 3062–3072).
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified textto- text transformer.The Journal of Machine Learning Research,21(1), 5485–5551.
Google Scholar
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks.Advances in neural information processing systems,28.
Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Gradcam: Visual explanations from deep networks via gradient-based localization. InProceedings of the ieee international conference on computer vision (pp. 618–626).
Shaw, P., Uszkoreit, J., & Vaswani, A. (2018). Selfattention with relative position representations.arXiv:1803.02155
Tan, H., Lei, J., Wolf, T., & Bansal, M. (2021). Vimpac: Video pre-training via masked token prediction and contrastive learning.arXiv:2106.11250
Tan, Y., Hao, Y., He, X., Wei, Y., & Yang, X. (2021). Selective dependency aggregation for action classification. InProceedings of the 29th acm international conference on multimedia (pp. 592–601).
Tan, Y., Hao, Y., Zhang, H., Wang, S., & He, X. (2022). Hierarchical hourglass convolutional network for efficient video classification. InProceedings of the 30th acm international conference on multimedia (pp. 5880–5891).
Tolstikhin, I. O., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., et al. (2021). Mlp-mixer: An all-mlp architecture for vision.Advances in Neural Information Processing Systems,34, 24261–24272.
Google Scholar
Tong, Z., Song, Y., Wang, J., & Wang, L. (2022). Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advances in Neural Information Processing Systems,35, 10078–10093.
Google Scholar
Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. InProceedings of the ieee international conference on computer vision (pp. 4489–4497).
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., & Paluri, M. (2018). A closer look at spatiotemporal convolutions for action recognition. InProceedings of the ieee conference on computer vision and pattern recognition (pp. 6450–6459).
Tu, S., Dai, Q., Wu, Z., Cheng, Z.-Q., Hu, H., & Jiang, Y.-G. (2023). Implicit temporal modeling with learnable alignment for video recognition. In Proceedings of the ieee/cvf international conference on computer vision (pp. 19936–19947).
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., & Polosukhin, I. (2017). Attention is all you need.Advances in neural information processing systems,30.
Wang, J., & Torresani, L. (2022). Deformable video transformer. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 14053–14062).
Wang, L., Tong, Z., Ji, B., & Wu, G. (2021). Tdn: Temporal difference networks for efficient action recognition. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 1895–1904).
Wang, X., Girshick, R., Gupta, A., & He, K. (2018). Non-local neural networks.Proceedings of the ieee conference on computer vision and pattern recognition (pp. 7794–7803).
Wang, X., Wu, Y., Zhu, L., & Yang, Y. (2020). Symbiotic attention with privileged information for egocentric action recognition. InProceedings of the aaai conference on artificial intelligence (Vol. 34, pp. 12249–12256).
Wang, Z., Hao, Y., Gao, X., Zhang, H., Wang, S., Mu, T., & He, X. (2022). Parameterization of crosstoken relations with relative positional encoding for vision mlp. InProceedings of the 30th acm international conference on multimedia (pp. 6288–6299).
Wu, C.-Y., Li, Y., Mangalam, K., Fan, H., Xiong, B., Malik, J., & Feichtenhofer, C. (2022). Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 13587–13597).
Xie, S., Sun, C., Huang, J., Tu, Z., & Murphy, K. (2018). Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. InProceedings of the european conference on computer vision (eccv) (pp. 305– 321).
Xing, Z., Dai, Q., Hu, H., Chen, J., Wu, Z., & Jiang, Y.- G. (2023). Svformer: Semi-supervised video transformer for action recognition. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 18816–18826).
Yan, S., Xiong, X., Arnab, A., Lu, Z., Zhang, M., Sun, C., & Schmid, C. (2022). Multiview transformers for video recognition. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 3333–3343).
Yan, W.-J., Li, X., Wang, S.-J., Zhao, G., Liu, Y. J., Chen, Y.-H., & Fu, X. (2014). Casme ii: An improved spontaneous micro-expression database and the baseline evaluation.PLOS ONE,9(1), e86041.
Article Google Scholar
Yang, J., Dong, X., Liu, L., Zhang, C., Shen, J., & Yu, D. (2022). Recurring the transformer for video action recognition. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 14063–14073).
Yu, T., Li, X., Cai, Y., Sun, M., & Li, P. (2022). S2-mlp: Spatial-shift mlp architecture for vision. InProceedings of the ieee/cvf winter conference on applications of computer vision (pp. 297–306).
Zhang, C., Gupta, A., & Zisserman, A. (2021). Temporal query networks for fine-grained video understanding. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 4486–4496).
Zhang, D. J., Li, K., Wang, Y., Chen, Y., Chandra, S., Qiao, Y., & Shou, M. Z. (2022). Morphmlp: An efficient mlp-like backbone for spatial-temporal representation learning. InComputer vision–eccv 2022: 17th European conference, Tel Aviv, Israel, October 23–27, 2022, proceedings, part xxxv (pp. 230–248).
Zhang, H., Cheng, L., Hao, Y., & Ngo, C.-w. (2022). Long-term leap attention, short-term periodic shift for video classification. InProceedings of the 30th acm international conference on multimedia (pp. 5773–5782).
Zhang, H., Hao, Y., & Ngo, C.-W. (2021). Token shift transformer for video classification. InProceedings of the 29th acm international conference on multimedia (pp. 917–925).
Zhang, S. (2022). Tfcnet: Temporal fully connected networks for static unbiased temporal reasoning.arXiv:2203.05928
Zhao, S., Tao, H., Zhang, Y., Xu, T., Zhang, K., Hao, Z., & Chen, E. (2021). A two-stage 3d cnn based learning method for spontaneous microexpression recognition.Neurocomputing,448(276–289), 24.
Google Scholar

Download references

Acknowledgements

The work was supported by the National Natural Science Foundation of China (No. 62101524). We thank Prof. Xiangnan He for providing thoughtful comments and suggestions on the writing and structural organization of the paper.

Author information

Author notes

Yanbin Hao and Diansong Zhou have contributed equally to this work.

Authors and Affiliations

School of Information Science and Technology, School of Artificial Intelligence and Data Science, University of Science and Technology of China, No. 96, JinZhai Road, Hefei, 230026, Anhui, China
Yanbin Hao, Diansong Zhou & Zhicai Wang
School of Computing and Information Systems, Singapore Management University, 80 Stamford Road, Singapore, 178902, Singapore
Chong-Wah Ngo
School of Computer Science and Information Engineering, Hefei University of Technology, No. 485, Danxia Road, Hefei, 230601, Anhui, China
Meng Wang

Authors

Yanbin Hao
View author publications
You can also search for this author inPubMed Google Scholar
Diansong Zhou
View author publications
You can also search for this author inPubMed Google Scholar
Zhicai Wang
View author publications
You can also search for this author inPubMed Google Scholar
Chong-Wah Ngo
View author publications
You can also search for this author inPubMed Google Scholar
Meng Wang
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence toYanbin Hao.

Additional information

Communicated by Limin Wang.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hao, Y., Zhou, D., Wang, Z.et al. PosMLP-Video: Spatial and Temporal Relative Position Encoding for Efficient Video Recognition.Int J Comput Vis132, 5820–5840 (2024). https://doi.org/10.1007/s11263-024-02154-z

Download citation

Received:23 October 2023
Accepted:15 June 2024
Published:27 June 2024
Issue Date:December 2024
DOI:https://doi.org/10.1007/s11263-024-02154-z

Movatterモバイル変換

PosMLP-Video: Spatial and Temporal Relative Position Encoding for Efficient Video Recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning

Elysium: Exploring Object-Level Perception in Videos via MLLM

$$\mathrm R^2$$ -Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Access this article

Subscribe and save

Buy Now