Movatterモバイル変換


[0]ホーム

URL:


Skip to main content

Advertisement

Springer Nature Link
Log in

Prototypical Transformer for Weakly Supervised Action Segmentation

  • Conference paper
  • First Online:

Part of the book series:Lecture Notes in Computer Science ((LNCS,volume 14430))

  • 810Accesses

Abstract

Weakly supervised action segmentation aims to recognize the sequential actions in a video, with only action orderings as supervision for model training. Existing methods either predict the action labels to construct discriminative losses or segment the video based on action prototypes. In this paper, we propose a novel Prototypical Transformer (ProtoTR) to alleviate the defects of existing methods. The motivation behind ProtoTR is to further enhance the prototype-based method with more discriminative power for superior segmentation results. Specifically, the Prediction Decoder of ProtoTR translates the visual input into action ordering while its Video Encoder segments the video with action prototypes. As a unified model, both the encoder and decoder are jointly optimized on the same set of action prototypes. The effectiveness of the proposed method is demonstrated by its state-of-the-art performance on different benchmark datasets.

This is a preview of subscription content,log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 10295
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 12869
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide -see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

References

  1. Berndt, D.J., Clifford, J.: Using dynamic time warping to find patterns in time series. In: KDD Workshop, Seattle, WA, USA, vol. 10, pp. 359–370. (1994)

    Google Scholar 

  2. Bojanowski, P., et al.: Weakly supervised action labeling in videos under ordering constraints. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 628–643. Springer, Cham (2014).https://doi.org/10.1007/978-3-319-10602-1_41

    Chapter  Google Scholar 

  3. Chang, C.Y., Huang, D.A., Sui, Y., Fei-Fei, L., Niebles, J.C.: D3TW: discriminative differentiable dynamic time warping for weakly supervised action alignment and segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3546–3555. IEEE (2019)

    Google Scholar 

  4. Chang, X., Tung, F., Mori, G.: Learning discriminative prototypes with dynamic time warping. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8395–8404. IEEE (2021)

    Google Scholar 

  5. Ding, L., Xu, C.: Weakly-supervised action segmentation with iterative soft boundary assignment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6508–6516 (2018)

    Google Scholar 

  6. Farha, Y.A., Gall, J.: MS-TCN: multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019)

    Google Scholar 

  7. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256. JMLR Workshop and Conference Proceedings (2010)

    Google Scholar 

  8. Guo, J., Xue, W., Guo, L., Yuan, T., Chen, S.: Multi-level temporal relation graph for continuous sign language recognition. In: Yu, S., et al. (eds.) PRCV 2022, Part III. LNCS, vol. 13536, pp. 408–419. Springer, Cham (2022).https://doi.org/10.1007/978-3-031-18913-5_32

    Chapter  Google Scholar 

  9. Huang, D.-A., Fei-Fei, L., Niebles, J.C.: Connectionist temporal modeling for weakly supervised action labeling. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 137–153. Springer, Cham (2016).https://doi.org/10.1007/978-3-319-46493-0_9

    Chapter  Google Scholar 

  10. Ji, P., Yang, B., Zhang, T., Zou, Y.: Consensus-guided keyword targeting for video captioning. In: Yu, S., et al. (eds.) PRCV 2022, Part III. LNCS, vol. 13536, pp. 270–281. Springer, Cham (2022).https://doi.org/10.1007/978-3-031-18913-5_21

    Chapter  Google Scholar 

  11. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprintarXiv:1412.6980 (2014)

  12. Koller, O., Zargaran, S., Ney, H.: Re-sign: re-aligned end-to-end sequence modelling with deep recurrent CNN-HMMS. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4297–4305. IEEE (2017)

    Google Scholar 

  13. Kuehne, H., Arslan, A.B., Serre, T.: The language of actions: recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 780–787. IEEE (2014)

    Google Scholar 

  14. Kuehne, H., Gall, J., Serre, T.: An end-to-end generative framework for video segmentation and recognition. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–8. IEEE (2016)

    Google Scholar 

  15. Kuehne, H., Richard, A., Gall, J.: Weakly supervised learning of actions from transcripts. Comput. Vis. Image Underst.163, 78–89 (2017)

    Article  Google Scholar 

  16. Kuehne, H., Richard, A., Gall, J.: A hybrid RNN-HMM approach for weakly supervised temporal action segmentation. IEEE Trans. Pattern Anal. Mach. Intell.42(4), 765–779 (2018)

    Article  Google Scholar 

  17. Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017)

    Google Scholar 

  18. Li, J., Lei, P., Todorovic, S.: Weakly supervised energy-based learning for action segmentation. In: IEEE/CVF International Conference on Computer Vision, pp. 6242–6250. IEEE (2019)

    Google Scholar 

  19. Richard, A., Kuehne, H., Gall, J.: Weakly supervised action learning with RNN based fine-to-coarse modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 754–763. IEEE (2017)

    Google Scholar 

  20. Richard, A., Kuehne, H., Iqbal, A., Gall, J.: NeuralNetwork-Viterbi: a framework for weakly supervised video learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7386–7395. IEEE (2018)

    Google Scholar 

  21. Ridley, J., Coskun, H., Tan, D.J., Navab, N., Tombari, F.: Transformers in action: weakly supervised action segmentation. arXiv preprintarXiv:2201.05675 (2022)

  22. Souri, Y., Fayyaz, M., Minciullo, L., Francesca, G., Gall, J.: Fast weakly supervised action segmentation using mutual consistency. IEEE Trans. Pattern Anal. Mach. Intell.44(10), 6196–6208 (2021)

    Article  Google Scholar 

  23. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  24. Yang, B., Zhang, T., Zou, Y.: Clip meets video captioning: Concept-aware representation learning does matter. In: Yu, S., et al. (eds.) PRCV 2022, Part I. LNCS, vol. 13534, pp. 368–381. Springer, Cham (2022).https://doi.org/10.1007/978-3-031-18907-4_29

    Chapter  Google Scholar 

  25. Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2914–2923 (2017)

    Google Scholar 

Download references

Acknowledgement

This work was supported by the National Science Foundation for Young Scientists of China (62106289).

Author information

Authors and Affiliations

  1. School of Electronics and Information Technology, Sun Yat-sen University, Guangzhou, China

    Tao Lin & Wei Sun

  2. School of Artificial Intelligence, Sun Yat-sen University, Guangzhou, China

    Xiaobin Chang

  3. School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China

    Weishi Zheng

  4. Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, Beijing, China

    Tao Lin, Xiaobin Chang & Weishi Zheng

  5. Guangdong Key Laboratory of Big Data Analysis and Processing, Guangzhou, 510006, People’s Republic of China

    Xiaobin Chang

Authors
  1. Tao Lin

    You can also search for this author inPubMed Google Scholar

  2. Xiaobin Chang

    You can also search for this author inPubMed Google Scholar

  3. Wei Sun

    You can also search for this author inPubMed Google Scholar

  4. Weishi Zheng

    You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence toXiaobin Chang.

Editor information

Editors and Affiliations

  1. Nanjing University of Information Science and Technology, Nanjing, China

    Qingshan Liu

  2. Xiamen University, Xiamen, China

    Hanzi Wang

  3. Beijing University of Posts and Telecommunications, Beijing, China

    Zhanyu Ma

  4. Sun Yat-sen University, Guangzhou, China

    Weishi Zheng

  5. Peking University, Beijing, China

    Hongbin Zha

  6. Chinese Academy of Sciences, Beijing, China

    Xilin Chen

  7. Chinese Academy of Sciences, Beijing, China

    Liang Wang

  8. Xiamen University, Xiamen, China

    Rongrong Ji

Rights and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lin, T., Chang, X., Sun, W., Zheng, W. (2024). Prototypical Transformer for Weakly Supervised Action Segmentation. In: Liu, Q.,et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14430. Springer, Singapore. https://doi.org/10.1007/978-981-99-8537-1_16

Download citation

Publish with us

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 10295
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 12869
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide -see info

Tax calculation will be finalised at checkout

Purchases are for personal use only


[8]ページ先頭

©2009-2025 Movatter.jp