Movatterモバイル変換


[0]ホーム

URL:


Skip to main content

Advertisement

Springer Nature Link
Log in

PosMLP-Video: Spatial and Temporal Relative Position Encoding for Efficient Video Recognition

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

In recent years, vision Transformers and MLPs have demonstrated remarkable performance in image understanding tasks. However, their inherently dense computational operators, such as self-attention and token-mixing layers, pose significant challenges when applied to spatio-temporal video data. To address this gap, we propose PosMLP-Video, a lightweight yet powerful MLP-like backbone for video recognition. Instead of dense operators, we use efficient relative positional encoding to build pairwise token relations, leveraging small-sized parameterized relative position biases to obtain each relation score. Specifically, to enable spatio-temporal modeling, we extend the image PosMLP’s positional gating unit to temporal, spatial, and spatio-temporal variants, namely PoTGU, PoSGU, and PoSTGU, respectively. These gating units can be feasibly combined into three types of spatio-temporal factorized positional MLP blocks, which not only decrease model complexity but also maintain good performance. Additionally, we enrich relative positional relationships by using channel grouping. Experimental results on three video-related tasks demonstrate that PosMLP-Video achieves competitive speed-accuracy trade-offs compared to the previous state-of-the-art models. In particular, PosMLP-Video pre-trained on ImageNet1K achieves 59.0%/70.3% top-1 accuracy on Something-Something V1/V2 and 82.1% top-1 accuracy on Kinetics-400 while requiring much fewer parameters and FLOPs than other models. The code is released athttps://github.com/zhouds1918/PosMLP_Video.

This is a preview of subscription content,log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  • Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., & Schmid, C. (2021). Vivit: A video vision transformer. InProceedings of the ieee/cvf international conference on computer vision (pp. 6836–6846).

  • Bertasius, G., Wang, H., & Torresani, L. (2021). Is space-time attention all you need for video understanding?Icml (Vol. 2, p. 4).

  • Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. Inproceedings of the ieee conference on computer vision and pattern recognition (pp. 6299–6308).

  • Chen, J., & Ho, C. M. (2022). Mm-vit: Multi-modal video transformer for compressed video action recognition. InProceedings of the ieee/cvf winter conference on applications of computer vision (pp. 1910–1921).

  • Cordonnier, J.-B., Loukas, A., & Jaggi, M. (2019). On the relationship between self-attention and convolutional layers.arXiv:1911.03584

  • Davison, A. K., Lansley, C., Costen, N., Tan, K., & Yap, M. H. (2016). Samm: A spontaneous microfacial movement dataset.IEEE Transactions on Affective Computing,9(1), 116–129.

    Article  Google Scholar 

  • Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale.arXiv:2010.11929

  • d’Ascoli, S., Touvron, H., Leavitt, M. L., Morcos, A. S., Biroli, G., & Sagun, L. (2021). Convit: Improving vision transformers with soft convolutional inductive biases. InInternational conference on machine learning (pp. 2286–2296).

  • Bulat, A., et al. (2021). Space-time mixing attention for video transformer.Advances in Neural Information Processing Systems,34, 19594–19607.

    Google Scholar 

  • Fan, H., Li, Y., Xiong, B., Lo, W.-Y., & Feichtenhofer, C. (2020). Pyslowfast.https://github.com/facebookresearch/slowfast

  • Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Malik, J., & Feichtenhofer, C. (2021). Multiscale vision transformers. InProceedings of the ieee/cvf international conference on computer vision (pp. 6824–6835).

  • Feichtenhofer, C. (2020). X3d: Expanding architectures for efficient video recognition. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 203–213).

  • Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). Slowfast networks for video recognition. InProceedings of the ieee/cvf international conference on computer vision (pp. 6202–6211).

  • Girshick, R. (2015). Fast r-cnn. InProceedings of the ieee international conference on computer vision (pp. 1440–1448).

  • Goyal, R., Ebrahimi Kahou, S., Michalski, V., Materzynska, J., Westphal, S., Kim, H., et al. (2017). The ”something something” video database for learning and evaluating visual common sense. InProceedings of the ieee international conference on computer vision (pp. 5842–5850).

  • Gu, C., Sun, C., Ross, D. A., Vondrick, C., Pantofaru, C., Li, Y., et al. (2018). Ava: A video dataset of spatio-temporally localized atomic visual actions. InProceedings of the ieee conference on computer vision and pattern recognition (pp. 6047–6056).

  • Hao, Y., Wang, S., Cao, P., Gao, X., Xu, T., Wu, J., & He, X. (2022). Attention in attention: Modeling context correlation for efficient video classification.IEEE Transactions on Circuits and Systems for Video Technology,32(10), 7120–7132.

    Article  Google Scholar 

  • Hao, Y., Zhang, H., Ngo, C.-W., & He, X. (2022). Group contextualization for video recognition. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 928–938).

  • He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. InProceedings of the ieee international conference on computer vision (pp. 2961–2969).

  • Herzig, R., Ben-Avraham, E., Mangalam, K., Bar, A., Chechik, G., Rohrbach, A., & Globerson, A. (2022). Object-region video transformers. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 3148–3159).

  • Hou, Q., Jiang, Z., Yuan, L., Cheng, M.-M., Yan, S., & Feng, J. (2022). Vision permutator: A permutable mlp-like architecture for visual recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence,45(1), 1328–1334.

    Article  Google Scholar 

  • Islam, M. A., Kowal, M., Jia, S., Derpanis, K. G., & Bruce, N. D. (2021a). Global pooling, more than meets the eye: Position information is encoded channel-wise in cnns. InProceedings of the ieee/cvf international conference on computer vision (pp. 793–801).

  • Islam, M. A., Kowal, M., Jia, S., Derpanis, K. G., & Bruce, N. D. (2021b). Position, padding and predictions: A deeper look at position information in cnns.arXiv:2101.12322

  • Jiang, B., Wang, M., Gan, W., Wu, W., & Yan, J. (2019). Stm: Spatiotemporal and motion encoding for action recognition. InProceedings of the ieee/cvf international conference on computer vision (pp. 2000–2009).

  • Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., et al. (2017). The kinetics human action video dataset.arXiv:1705.06950

  • Kayhan, O. S., & Gemert, J. C. V. (2020). On translation invariance in cnns: Convolutional layers can exploit absolute spatial location. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 14274–14285).

  • Khor, H.-Q., See, J., Liong, S.-T., Phan, R. C., & Lin, W. (2019). Dual-stream shallow networks for facial micro-expression recognition. In2019 ieee international conference on image processing (icip) (pp. 36–40).

  • Kwon, H., Kim, M., Kwak, S., & Cho, M. (2020). Motionsqueeze: Neural motion feature learning for video understanding. InComputer vision—eccv 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, proceedings, part xvi 16 (pp. 345–362).

  • Li, H., Sui, M., Zhu, Z., & Zhao, F. (2022). Mmnet: Muscle motion-guided network for micro-expression recognition.arXiv:2201.05297

  • Li, J., Dong, Z., Lu, S., Wang, S.-J., Yan, W.-J., Ma, Y., & Fu, X. (2022). Cas (me) 3: A third generation facial spontaneous micro-expression database with depth information and high ecological validity.IEEE Transactions on Pattern Analysis and Machine Intelligence,45(3), 2782–2800.

    Google Scholar 

  • Li, K., Li, X., Wang, Y., Wang, J., & Qiao, Y. (2021). Ct-net: Channel tensorization network for video classification.arXiv:2106.01603

  • Li, K., Wang, Y., Zhang, J., Gao, P., Song, G., Liu, Y., et al. (2023). Uniformer: Unifying convolution and self-attention for visual recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence.

  • Li, X., Pfister, T., Huang, X., Zhao, G., & Pietikäinen, M. (2013). A spontaneous micro-expression database: Inducement, collection and baseline. In2013 10th ieee international conference and workshops on automatic face and gesture recognition (fg) (pp. 1–6).

  • Li, X., Wang, Y., Zhou, Z., & Qiao, Y. (2020). Smallbignet: Integrating core and contextual views for video classification. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 1092–1101).

  • Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., & Wang, L. (2020). Tea: Temporal excitation and aggregation for action recognition. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 909–918).

  • Li, Y., Li, Y., & Vasconcelos, N. (2018). Resound: Towards action recognition without representation bias. InProceedings of the European conference on computer vision (eccv) (pp. 513–528).

  • Li, Y., Liu, M., & Rehg, J.M. (2018). In the eye of beholder: Joint learning of gaze and actions in first person video. InProceedings of the european conference on computer vision (eccv) (pp. 619–635).

  • Li, Y., Wu, C.-Y., Fan, H., Mangalam, K., Xiong, B., Malik, J., & Feichtenhofer, C. (2022). Mvitv2: Improved multiscale vision transformers for classification and detection. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 4804–4814).

  • Lian, D., Yu, Z., Sun, X., & Gao, S. (2021). As-mlp: An axial shifted mlp architecture for vision.arXiv:2107.08391

  • Lin, J., Gan, C., & Han, S. (2019). Tsm: Temporal shift module for efficient video understanding. InProceedings of the ieee/cvf international conference on computer vision (pp. 7083–7093).

  • Liong, S.-T., Gan, Y. S., See, J., Khor, H.-Q., & Huang, Y.-C. (2019). Shallow triple stream threedimensional cnn (ststnet) for micro-expression recognition. In2019 14th ieee international conference on automatic face and gesture recognition (fg 2019) (pp. 1–5).

  • Liu, H., Dai, Z., So, D., & Le, Q. V. (2021). Pay attention to mlps.Advances in Neural Information Processing Systems,34, 9204–9215.

    Google Scholar 

  • Liu, Z., Lin, Y., Cao, Y., Hu, H.,Wei, Y., Zhang, Z., & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the ieee/cvf international conference on computer vision (pp. 10012–10022).

  • Liu, Z., Luo, D., Wang, Y., Wang, L., Tai, Y., Wang, C., & Lu, T. (2020). Teinet: Towards an efficient architecture for video recognition. InProceedings of the aaai conference on artificial intelligence (Vol. 34, pp. 11669–11676).

  • Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., & Hu, H. (2022). Video swin transformer. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 3202– 3211).

  • Liu, Z., Wang, L., Wu, W., Qian, C., & Lu, T. (2021). Tam: Temporal adaptive module for video recognition. InProceedings of the ieee/cvf international conference on computer vision (pp. 13708–13718).

  • Luo, C., & Yuille, A.L. (2019). Grouped spatialtemporal aggregation for efficient action recognition. InProceedings of the ieee/cvf international conference on computer vision (pp. 5512–5521).

  • Nguyen, X.-B., Duong, C.N., Li, X., Gauch, S., Seo, H.-S., & Luu, K. (2023). Micron-bert: Bert-based facial micro-expression recognition. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 1482– 1492).

  • Patrick, M., Campbell, D., Asano, Y., Misra, I., Metze, F., Feichtenhofer, C., & Henriques, J. F. (2021). Keeping your eye on the ball: Trajectory attention in video transformers.Advances in Neural Information Processing Systems,34, 12493–12506.

    Google Scholar 

  • Qiu, Z., Yao, T., & Mei, T. (2017). Learning spatiotemporal representation with pseudo-3d residual networks. Inproceedings of the ieee international conference on computer vision (pp. 5533–5541).

  • Qiu, Z., Yao, T., Ngo, C.-W., & Mei, T. (2022). Mlp-3d: A mlp-like 3d architecture with grouped time mixing. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 3062–3072).

  • Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified textto- text transformer.The Journal of Machine Learning Research,21(1), 5485–5551.

    Google Scholar 

  • Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks.Advances in neural information processing systems,28.

  • Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Gradcam: Visual explanations from deep networks via gradient-based localization. InProceedings of the ieee international conference on computer vision (pp. 618–626).

  • Shaw, P., Uszkoreit, J., & Vaswani, A. (2018). Selfattention with relative position representations.arXiv:1803.02155

  • Tan, H., Lei, J., Wolf, T., & Bansal, M. (2021). Vimpac: Video pre-training via masked token prediction and contrastive learning.arXiv:2106.11250

  • Tan, Y., Hao, Y., He, X., Wei, Y., & Yang, X. (2021). Selective dependency aggregation for action classification. InProceedings of the 29th acm international conference on multimedia (pp. 592–601).

  • Tan, Y., Hao, Y., Zhang, H., Wang, S., & He, X. (2022). Hierarchical hourglass convolutional network for efficient video classification. InProceedings of the 30th acm international conference on multimedia (pp. 5880–5891).

  • Tolstikhin, I. O., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., et al. (2021). Mlp-mixer: An all-mlp architecture for vision.Advances in Neural Information Processing Systems,34, 24261–24272.

    Google Scholar 

  • Tong, Z., Song, Y., Wang, J., & Wang, L. (2022). Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advances in Neural Information Processing Systems,35, 10078–10093.

    Google Scholar 

  • Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. InProceedings of the ieee international conference on computer vision (pp. 4489–4497).

  • Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., & Paluri, M. (2018). A closer look at spatiotemporal convolutions for action recognition. InProceedings of the ieee conference on computer vision and pattern recognition (pp. 6450–6459).

  • Tu, S., Dai, Q., Wu, Z., Cheng, Z.-Q., Hu, H., & Jiang, Y.-G. (2023). Implicit temporal modeling with learnable alignment for video recognition. In Proceedings of the ieee/cvf international conference on computer vision (pp. 19936–19947).

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., & Polosukhin, I. (2017). Attention is all you need.Advances in neural information processing systems,30.

  • Wang, J., & Torresani, L. (2022). Deformable video transformer. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 14053–14062).

  • Wang, L., Tong, Z., Ji, B., & Wu, G. (2021). Tdn: Temporal difference networks for efficient action recognition. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 1895–1904).

  • Wang, X., Girshick, R., Gupta, A., & He, K. (2018). Non-local neural networks.Proceedings of the ieee conference on computer vision and pattern recognition (pp. 7794–7803).

  • Wang, X., Wu, Y., Zhu, L., & Yang, Y. (2020). Symbiotic attention with privileged information for egocentric action recognition. InProceedings of the aaai conference on artificial intelligence (Vol. 34, pp. 12249–12256).

  • Wang, Z., Hao, Y., Gao, X., Zhang, H., Wang, S., Mu, T., & He, X. (2022). Parameterization of crosstoken relations with relative positional encoding for vision mlp. InProceedings of the 30th acm international conference on multimedia (pp. 6288–6299).

  • Wu, C.-Y., Li, Y., Mangalam, K., Fan, H., Xiong, B., Malik, J., & Feichtenhofer, C. (2022). Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 13587–13597).

  • Xie, S., Sun, C., Huang, J., Tu, Z., & Murphy, K. (2018). Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. InProceedings of the european conference on computer vision (eccv) (pp. 305– 321).

  • Xing, Z., Dai, Q., Hu, H., Chen, J., Wu, Z., & Jiang, Y.- G. (2023). Svformer: Semi-supervised video transformer for action recognition. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 18816–18826).

  • Yan, S., Xiong, X., Arnab, A., Lu, Z., Zhang, M., Sun, C., & Schmid, C. (2022). Multiview transformers for video recognition. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 3333–3343).

  • Yan, W.-J., Li, X., Wang, S.-J., Zhao, G., Liu, Y. J., Chen, Y.-H., & Fu, X. (2014). Casme ii: An improved spontaneous micro-expression database and the baseline evaluation.PLOS ONE,9(1), e86041.

    Article  Google Scholar 

  • Yang, J., Dong, X., Liu, L., Zhang, C., Shen, J., & Yu, D. (2022). Recurring the transformer for video action recognition. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 14063–14073).

  • Yu, T., Li, X., Cai, Y., Sun, M., & Li, P. (2022). S2-mlp: Spatial-shift mlp architecture for vision. InProceedings of the ieee/cvf winter conference on applications of computer vision (pp. 297–306).

  • Zhang, C., Gupta, A., & Zisserman, A. (2021). Temporal query networks for fine-grained video understanding. InProceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 4486–4496).

  • Zhang, D. J., Li, K., Wang, Y., Chen, Y., Chandra, S., Qiao, Y., & Shou, M. Z. (2022). Morphmlp: An efficient mlp-like backbone for spatial-temporal representation learning. InComputer vision–eccv 2022: 17th European conference, Tel Aviv, Israel, October 23–27, 2022, proceedings, part xxxv (pp. 230–248).

  • Zhang, H., Cheng, L., Hao, Y., & Ngo, C.-w. (2022). Long-term leap attention, short-term periodic shift for video classification. InProceedings of the 30th acm international conference on multimedia (pp. 5773–5782).

  • Zhang, H., Hao, Y., & Ngo, C.-W. (2021). Token shift transformer for video classification. InProceedings of the 29th acm international conference on multimedia (pp. 917–925).

  • Zhang, S. (2022). Tfcnet: Temporal fully connected networks for static unbiased temporal reasoning.arXiv:2203.05928

  • Zhao, S., Tao, H., Zhang, Y., Xu, T., Zhang, K., Hao, Z., & Chen, E. (2021). A two-stage 3d cnn based learning method for spontaneous microexpression recognition.Neurocomputing,448(276–289), 24.

    Google Scholar 

Download references

Acknowledgements

The work was supported by the National Natural Science Foundation of China (No. 62101524). We thank Prof. Xiangnan He for providing thoughtful comments and suggestions on the writing and structural organization of the paper.

Author information

Author notes
  1. Yanbin Hao and Diansong Zhou have contributed equally to this work.

Authors and Affiliations

  1. School of Information Science and Technology, School of Artificial Intelligence and Data Science, University of Science and Technology of China, No. 96, JinZhai Road, Hefei, 230026, Anhui, China

    Yanbin Hao, Diansong Zhou & Zhicai Wang

  2. School of Computing and Information Systems, Singapore Management University, 80 Stamford Road, Singapore, 178902, Singapore

    Chong-Wah Ngo

  3. School of Computer Science and Information Engineering, Hefei University of Technology, No. 485, Danxia Road, Hefei, 230601, Anhui, China

    Meng Wang

Authors
  1. Yanbin Hao

    You can also search for this author inPubMed Google Scholar

  2. Diansong Zhou

    You can also search for this author inPubMed Google Scholar

  3. Zhicai Wang

    You can also search for this author inPubMed Google Scholar

  4. Chong-Wah Ngo

    You can also search for this author inPubMed Google Scholar

  5. Meng Wang

    You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence toYanbin Hao.

Additional information

Communicated by Limin Wang.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hao, Y., Zhou, D., Wang, Z.et al. PosMLP-Video: Spatial and Temporal Relative Position Encoding for Efficient Video Recognition.Int J Comput Vis132, 5820–5840 (2024). https://doi.org/10.1007/s11263-024-02154-z

Download citation

Keywords

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Advertisement


[8]ページ先頭

©2009-2025 Movatter.jp