Part of the book series:Lecture Notes in Computer Science ((LNCS,volume 14430))
Included in the following conference series:
810Accesses
Abstract
Weakly supervised action segmentation aims to recognize the sequential actions in a video, with only action orderings as supervision for model training. Existing methods either predict the action labels to construct discriminative losses or segment the video based on action prototypes. In this paper, we propose a novel Prototypical Transformer (ProtoTR) to alleviate the defects of existing methods. The motivation behind ProtoTR is to further enhance the prototype-based method with more discriminative power for superior segmentation results. Specifically, the Prediction Decoder of ProtoTR translates the visual input into action ordering while its Video Encoder segments the video with action prototypes. As a unified model, both the encoder and decoder are jointly optimized on the same set of action prototypes. The effectiveness of the proposed method is demonstrated by its state-of-the-art performance on different benchmark datasets.
This is a preview of subscription content,log in via an institution to check access.
Access this chapter
Subscribe and save
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
Buy Now
- Chapter
- JPY 3498
- Price includes VAT (Japan)
- eBook
- JPY 10295
- Price includes VAT (Japan)
- Softcover Book
- JPY 12869
- Price includes VAT (Japan)
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Berndt, D.J., Clifford, J.: Using dynamic time warping to find patterns in time series. In: KDD Workshop, Seattle, WA, USA, vol. 10, pp. 359–370. (1994)
Bojanowski, P., et al.: Weakly supervised action labeling in videos under ordering constraints. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 628–643. Springer, Cham (2014).https://doi.org/10.1007/978-3-319-10602-1_41
Chang, C.Y., Huang, D.A., Sui, Y., Fei-Fei, L., Niebles, J.C.: D3TW: discriminative differentiable dynamic time warping for weakly supervised action alignment and segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3546–3555. IEEE (2019)
Chang, X., Tung, F., Mori, G.: Learning discriminative prototypes with dynamic time warping. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8395–8404. IEEE (2021)
Ding, L., Xu, C.: Weakly-supervised action segmentation with iterative soft boundary assignment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6508–6516 (2018)
Farha, Y.A., Gall, J.: MS-TCN: multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019)
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256. JMLR Workshop and Conference Proceedings (2010)
Guo, J., Xue, W., Guo, L., Yuan, T., Chen, S.: Multi-level temporal relation graph for continuous sign language recognition. In: Yu, S., et al. (eds.) PRCV 2022, Part III. LNCS, vol. 13536, pp. 408–419. Springer, Cham (2022).https://doi.org/10.1007/978-3-031-18913-5_32
Huang, D.-A., Fei-Fei, L., Niebles, J.C.: Connectionist temporal modeling for weakly supervised action labeling. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 137–153. Springer, Cham (2016).https://doi.org/10.1007/978-3-319-46493-0_9
Ji, P., Yang, B., Zhang, T., Zou, Y.: Consensus-guided keyword targeting for video captioning. In: Yu, S., et al. (eds.) PRCV 2022, Part III. LNCS, vol. 13536, pp. 270–281. Springer, Cham (2022).https://doi.org/10.1007/978-3-031-18913-5_21
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprintarXiv:1412.6980 (2014)
Koller, O., Zargaran, S., Ney, H.: Re-sign: re-aligned end-to-end sequence modelling with deep recurrent CNN-HMMS. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4297–4305. IEEE (2017)
Kuehne, H., Arslan, A.B., Serre, T.: The language of actions: recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 780–787. IEEE (2014)
Kuehne, H., Gall, J., Serre, T.: An end-to-end generative framework for video segmentation and recognition. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–8. IEEE (2016)
Kuehne, H., Richard, A., Gall, J.: Weakly supervised learning of actions from transcripts. Comput. Vis. Image Underst.163, 78–89 (2017)
Kuehne, H., Richard, A., Gall, J.: A hybrid RNN-HMM approach for weakly supervised temporal action segmentation. IEEE Trans. Pattern Anal. Mach. Intell.42(4), 765–779 (2018)
Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017)
Li, J., Lei, P., Todorovic, S.: Weakly supervised energy-based learning for action segmentation. In: IEEE/CVF International Conference on Computer Vision, pp. 6242–6250. IEEE (2019)
Richard, A., Kuehne, H., Gall, J.: Weakly supervised action learning with RNN based fine-to-coarse modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 754–763. IEEE (2017)
Richard, A., Kuehne, H., Iqbal, A., Gall, J.: NeuralNetwork-Viterbi: a framework for weakly supervised video learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7386–7395. IEEE (2018)
Ridley, J., Coskun, H., Tan, D.J., Navab, N., Tombari, F.: Transformers in action: weakly supervised action segmentation. arXiv preprintarXiv:2201.05675 (2022)
Souri, Y., Fayyaz, M., Minciullo, L., Francesca, G., Gall, J.: Fast weakly supervised action segmentation using mutual consistency. IEEE Trans. Pattern Anal. Mach. Intell.44(10), 6196–6208 (2021)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Yang, B., Zhang, T., Zou, Y.: Clip meets video captioning: Concept-aware representation learning does matter. In: Yu, S., et al. (eds.) PRCV 2022, Part I. LNCS, vol. 13534, pp. 368–381. Springer, Cham (2022).https://doi.org/10.1007/978-3-031-18907-4_29
Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2914–2923 (2017)
Acknowledgement
This work was supported by the National Science Foundation for Young Scientists of China (62106289).
Author information
Authors and Affiliations
School of Electronics and Information Technology, Sun Yat-sen University, Guangzhou, China
Tao Lin & Wei Sun
School of Artificial Intelligence, Sun Yat-sen University, Guangzhou, China
Xiaobin Chang
School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
Weishi Zheng
Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, Beijing, China
Tao Lin, Xiaobin Chang & Weishi Zheng
Guangdong Key Laboratory of Big Data Analysis and Processing, Guangzhou, 510006, People’s Republic of China
Xiaobin Chang
- Tao Lin
You can also search for this author inPubMed Google Scholar
- Xiaobin Chang
You can also search for this author inPubMed Google Scholar
- Wei Sun
You can also search for this author inPubMed Google Scholar
- Weishi Zheng
You can also search for this author inPubMed Google Scholar
Corresponding author
Correspondence toXiaobin Chang.
Editor information
Editors and Affiliations
Nanjing University of Information Science and Technology, Nanjing, China
Qingshan Liu
Xiamen University, Xiamen, China
Hanzi Wang
Beijing University of Posts and Telecommunications, Beijing, China
Zhanyu Ma
Sun Yat-sen University, Guangzhou, China
Weishi Zheng
Peking University, Beijing, China
Hongbin Zha
Chinese Academy of Sciences, Beijing, China
Xilin Chen
Chinese Academy of Sciences, Beijing, China
Liang Wang
Xiamen University, Xiamen, China
Rongrong Ji
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Lin, T., Chang, X., Sun, W., Zheng, W. (2024). Prototypical Transformer for Weakly Supervised Action Segmentation. In: Liu, Q.,et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14430. Springer, Singapore. https://doi.org/10.1007/978-981-99-8537-1_16
Download citation
Published:
Publisher Name:Springer, Singapore
Print ISBN:978-981-99-8536-4
Online ISBN:978-981-99-8537-1
eBook Packages:Computer ScienceComputer Science (R0)
Share this paper
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative