Movatterモバイル変換

313Accesses
Explore all metrics

Abstract

The language of film shots is an important component of cinematic narrative, as it can visually convey the story, emotions, and themes, making films a highly expressive and engaging art form. In previous methods for analyzing film shot attributes, the focus has mainly been on movements and scale with a lack of interpretable research on the results of shot type analysis. In this study, we have built a new dataset to broaden the scope of existing shot attribute analysis tasks, such as distinguishing film composition, and introduced a new task: recognizing the key objects that determine shot attributes. Specifically, we have proposed a framework that utilizes clues from the Detection Transformer (DETR) to guide the use of segment anything (SAM) for mask segmentation to classify shot attributes. To address the issue of variable quantities of key objects within shots, we have developed an adaptive weight allocation strategy that enhances network training and provides a more effective approach to handling the new task we have introduced. Additionally, we extract optical flow magnitude and angle information from each pair of frames to enhance training effectiveness. Subsequent experimental results on MovieShots and our dataset demonstrate that our proposed method surpasses all prior approaches.

This is a preview of subscription content,log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

Enhancing interpretability in film shot analysis through continuous shot integration and saliency maps

Article13 February 2025

A lightweight weak semantic framework for cinematographic shot classification

ArticleOpen access26 September 2023

Shot Boundary Detection for Automatic Video Analysis of Historical Films

Data Availability

Owing to copyright restrictions related to films, we are regrettably unable to furnish our dataset. Nevertheless, MovieShots can be acquired through online sources.

References

Chen, Zeyu, Zhang, Yana, Zhang, Suya, Yang, Cheng: Study on location bias of CNN for shot scale classification. Multimed. Tools Appl.81(28), 40289–40309 (2022)
Article Google Scholar
Chen, Z., Zhang, Y., Zhang, L., Yang, C.: RO-TextCNN based MUL-MOVE-net for camera motion classification. In: 2021 IEEE/ACIS 20th International Fall Conference on Computer and Information Science (ICIS Fall), pp. 182–186. IEEE (2021)
Rao, A., Wang, J., Xu, L., Jiang, X., Huang, Q., Zhou, B., Lin, D.: A unified framework for shot type classification based on subject centric lens. In: Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pp. 17–34. Springer, Berlin (2020)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: European Conference on Computer Vision, pp. 213–229. Springer, Berlin (2020)
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprintarXiv:2304.02643 (2023)
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Lawrence Z., C.: Microsoft coco: common objects in context. In: Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13, pp. 740–755. Springer, Berlin (2014)
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprintarXiv:2010.11929 (2020)
Sun, C., Shrivastava, A., Singh, S., Gupta, A.: Revisiting unreasonable effectiveness of data in deep learning era. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 843–852 (2017)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst.28 (2015)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Adv. Neural Inf. Process. Syst.30 (2017)
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: an imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst.32 (2019)
Ma, J., Wang, B.: Segment anything in medical images. arXiv preprintarXiv:2304.12306 (2023)
Cen, J., Wu, Y., Wang, K., Li, X., Yang, J., Pei, Y., Kong, L., Liu, Z., Chen, Q.: SAD: Segment any RGBD. arXiv preprintarXiv:2305.14207 (2023)
Chen, T., Zhu, L., Ding, C., Cao, R., Wang, Y., Li, Z., Sun, L., Mao, P., Zang, Y.: Sam fails to segment anything?—SAM-adapter: adapting SAM in underperformed scenes: camouflage, shadow, medical image segmentation, and more
Cheng, H.K., Oh, S.W., Price, B., Schwing, A., Lee, J.-Y.: Tracking anything with decoupled video segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1316–1326 (2023)
Hui, T.-W., Tang, X., Loy, C.C.: Liteflownet: a lightweight convolutional neural network for optical flow estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8981–8989 (2018)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6450–6459 (2018)
Feichtenhofer, C.: X3d: Expanding architectures for efficient video recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 203–213 (2020)
Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019)
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: Vivit: A video vision transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6836–6846 (2021)
Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al.: The kinetics human action video dataset. arXiv preprintarXiv:1705.06950 (2017)
Li, L., Zhang, X., Hu, W., Li, W., Zhu, P.: Soccer video shot classification based on color characterization using dominant sets clustering. In: Advances in Multimedia Information Processing-PCM 2009: 10th Pacific Rim Conference on Multimedia, Bangkok, Thailand, December 15–18, 2009 Proceedings 10, pp. 923–929. Springer, Berlin (2009)
Hasan, Muhammad Abul, Min, Xu., He, Xiangjian, Changsheng, Xu.: CAMHID: camera motion histogram descriptor and its application to cinematographic shot classification. IEEE Trans. Circuits Syst. Video Technol.24(10), 1682–1695 (2014)
Article Google Scholar
Prasertsakul, P., Kondo, T., Iida, H.: Video shot classification using 2d motion histogram. In: 2017 14th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), pp. 202–205. IEEE (2017)
Vadis, Q., Carreira, J., Zisserman, A.: Action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks: towards good practices for deep action recognition. In: European Conference on Computer Vision, pp. 20–36. Springer, Berlin (2016)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)

Download references

Funding

This research received no external funding.

Author information

Authors and Affiliations

Shanghai Film Academy, Shanghai University, Shanghai, 200072, China
Fengtian Lu, Yuzhi Li & Feng Tian

Authors

Fengtian Lu
View author publications
You can also search for this author inPubMed Google Scholar
Yuzhi Li
View author publications
You can also search for this author inPubMed Google Scholar
Feng Tian
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Conceptualization, T.L.; methodology, T.L., Y.L.; software, T.L., Y.L.; validation, Y.L. and T.L.; formal analysis, T.L.; writing-original draft preparation, T.L.; writing-review and editing, T.L., Y.L., F.T. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence toFeng Tian.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Ethical approval

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lu, F., Li, Y. & Tian, F. Exploring challenge and explainable shot type classification using SAM-guided approaches.SIViP18, 2533–2542 (2024). https://doi.org/10.1007/s11760-023-02928-x

Download citation

Received:10 November 2023
Revised:22 November 2023
Accepted:30 November 2023
Published:05 January 2024
Issue Date:April 2024
DOI:https://doi.org/10.1007/s11760-023-02928-x

Movatterモバイル変換

Exploring challenge and explainable shot type classification using SAM-guided approaches

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Enhancing interpretability in film shot analysis through continuous shot integration and saliency maps

A lightweight weak semantic framework for cinematographic shot classification

Shot Boundary Detection for Automatic Video Analysis of Historical Films

Data Availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Access this article

Subscribe and save

Buy Now