Movatterモバイル変換


[0]ホーム

URL:


Skip to main content

Advertisement

Springer Nature Link
Log in

Exploring challenge and explainable shot type classification using SAM-guided approaches

  • Original Paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

The language of film shots is an important component of cinematic narrative, as it can visually convey the story, emotions, and themes, making films a highly expressive and engaging art form. In previous methods for analyzing film shot attributes, the focus has mainly been on movements and scale with a lack of interpretable research on the results of shot type analysis. In this study, we have built a new dataset to broaden the scope of existing shot attribute analysis tasks, such as distinguishing film composition, and introduced a new task: recognizing the key objects that determine shot attributes. Specifically, we have proposed a framework that utilizes clues from the Detection Transformer (DETR) to guide the use of segment anything (SAM) for mask segmentation to classify shot attributes. To address the issue of variable quantities of key objects within shots, we have developed an adaptive weight allocation strategy that enhances network training and provides a more effective approach to handling the new task we have introduced. Additionally, we extract optical flow magnitude and angle information from each pair of frames to enhance training effectiveness. Subsequent experimental results on MovieShots and our dataset demonstrate that our proposed method surpasses all prior approaches.

This is a preview of subscription content,log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data Availability

Owing to copyright restrictions related to films, we are regrettably unable to furnish our dataset. Nevertheless, MovieShots can be acquired through online sources.

References

  1. Chen, Zeyu, Zhang, Yana, Zhang, Suya, Yang, Cheng: Study on location bias of CNN for shot scale classification. Multimed. Tools Appl.81(28), 40289–40309 (2022)

    Article  Google Scholar 

  2. Chen, Z., Zhang, Y., Zhang, L., Yang, C.: RO-TextCNN based MUL-MOVE-net for camera motion classification. In: 2021 IEEE/ACIS 20th International Fall Conference on Computer and Information Science (ICIS Fall), pp. 182–186. IEEE (2021)

  3. Rao, A., Wang, J., Xu, L., Jiang, X., Huang, Q., Zhou, B., Lin, D.: A unified framework for shot type classification based on subject centric lens. In: Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pp. 17–34. Springer, Berlin (2020)

  4. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: European Conference on Computer Vision, pp. 213–229. Springer, Berlin (2020)

  5. Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprintarXiv:2304.02643 (2023)

  6. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Lawrence Z., C.: Microsoft coco: common objects in context. In: Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13, pp. 740–755. Springer, Berlin (2014)

  7. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)

  8. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprintarXiv:2010.11929 (2020)

  9. Sun, C., Shrivastava, A., Singh, S., Gupta, A.: Revisiting unreasonable effectiveness of data in deep learning era. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 843–852 (2017)

  10. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)

  11. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)

  12. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst.28 (2015)

  13. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Adv. Neural Inf. Process. Syst.30 (2017)

  14. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: an imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst.32 (2019)

  15. Ma, J., Wang, B.: Segment anything in medical images. arXiv preprintarXiv:2304.12306 (2023)

  16. Cen, J., Wu, Y., Wang, K., Li, X., Yang, J., Pei, Y., Kong, L., Liu, Z., Chen, Q.: SAD: Segment any RGBD. arXiv preprintarXiv:2305.14207 (2023)

  17. Chen, T., Zhu, L., Ding, C., Cao, R., Wang, Y., Li, Z., Sun, L., Mao, P., Zang, Y.: Sam fails to segment anything?—SAM-adapter: adapting SAM in underperformed scenes: camouflage, shadow, medical image segmentation, and more

  18. Cheng, H.K., Oh, S.W., Price, B., Schwing, A., Lee, J.-Y.: Tracking anything with decoupled video segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1316–1326 (2023)

  19. Hui, T.-W., Tang, X., Loy, C.C.: Liteflownet: a lightweight convolutional neural network for optical flow estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8981–8989 (2018)

  20. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

  21. Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6450–6459 (2018)

  22. Feichtenhofer, C.: X3d: Expanding architectures for efficient video recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 203–213 (2020)

  23. Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019)

  24. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: Vivit: A video vision transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6836–6846 (2021)

  25. Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al.: The kinetics human action video dataset. arXiv preprintarXiv:1705.06950 (2017)

  26. Li, L., Zhang, X., Hu, W., Li, W., Zhu, P.: Soccer video shot classification based on color characterization using dominant sets clustering. In: Advances in Multimedia Information Processing-PCM 2009: 10th Pacific Rim Conference on Multimedia, Bangkok, Thailand, December 15–18, 2009 Proceedings 10, pp. 923–929. Springer, Berlin (2009)

  27. Hasan, Muhammad Abul, Min, Xu., He, Xiangjian, Changsheng, Xu.: CAMHID: camera motion histogram descriptor and its application to cinematographic shot classification. IEEE Trans. Circuits Syst. Video Technol.24(10), 1682–1695 (2014)

    Article  Google Scholar 

  28. Prasertsakul, P., Kondo, T., Iida, H.: Video shot classification using 2d motion histogram. In: 2017 14th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), pp. 202–205. IEEE (2017)

  29. Vadis, Q., Carreira, J., Zisserman, A.: Action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)

  30. Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks: towards good practices for deep action recognition. In: European Conference on Computer Vision, pp. 20–36. Springer, Berlin (2016)

  31. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)

Download references

Funding

This research received no external funding.

Author information

Authors and Affiliations

  1. Shanghai Film Academy, Shanghai University, Shanghai, 200072, China

    Fengtian Lu, Yuzhi Li & Feng Tian

Authors
  1. Fengtian Lu

    You can also search for this author inPubMed Google Scholar

  2. Yuzhi Li

    You can also search for this author inPubMed Google Scholar

  3. Feng Tian

    You can also search for this author inPubMed Google Scholar

Contributions

Conceptualization, T.L.; methodology, T.L., Y.L.; software, T.L., Y.L.; validation, Y.L. and T.L.; formal analysis, T.L.; writing-original draft preparation, T.L.; writing-review and editing, T.L., Y.L., F.T. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence toFeng Tian.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Ethical approval

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lu, F., Li, Y. & Tian, F. Exploring challenge and explainable shot type classification using SAM-guided approaches.SIViP18, 2533–2542 (2024). https://doi.org/10.1007/s11760-023-02928-x

Download citation

Keywords

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Advertisement


[8]ページ先頭

©2009-2025 Movatter.jp