313Accesses
Abstract
The language of film shots is an important component of cinematic narrative, as it can visually convey the story, emotions, and themes, making films a highly expressive and engaging art form. In previous methods for analyzing film shot attributes, the focus has mainly been on movements and scale with a lack of interpretable research on the results of shot type analysis. In this study, we have built a new dataset to broaden the scope of existing shot attribute analysis tasks, such as distinguishing film composition, and introduced a new task: recognizing the key objects that determine shot attributes. Specifically, we have proposed a framework that utilizes clues from the Detection Transformer (DETR) to guide the use of segment anything (SAM) for mask segmentation to classify shot attributes. To address the issue of variable quantities of key objects within shots, we have developed an adaptive weight allocation strategy that enhances network training and provides a more effective approach to handling the new task we have introduced. Additionally, we extract optical flow magnitude and angle information from each pair of frames to enhance training effectiveness. Subsequent experimental results on MovieShots and our dataset demonstrate that our proposed method surpasses all prior approaches.
This is a preview of subscription content,log in via an institution to check access.
Access this article
Subscribe and save
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
Buy Now
Price includes VAT (Japan)
Instant access to the full article PDF.







Similar content being viewed by others
Data Availability
Owing to copyright restrictions related to films, we are regrettably unable to furnish our dataset. Nevertheless, MovieShots can be acquired through online sources.
References
Chen, Zeyu, Zhang, Yana, Zhang, Suya, Yang, Cheng: Study on location bias of CNN for shot scale classification. Multimed. Tools Appl.81(28), 40289–40309 (2022)
Chen, Z., Zhang, Y., Zhang, L., Yang, C.: RO-TextCNN based MUL-MOVE-net for camera motion classification. In: 2021 IEEE/ACIS 20th International Fall Conference on Computer and Information Science (ICIS Fall), pp. 182–186. IEEE (2021)
Rao, A., Wang, J., Xu, L., Jiang, X., Huang, Q., Zhou, B., Lin, D.: A unified framework for shot type classification based on subject centric lens. In: Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pp. 17–34. Springer, Berlin (2020)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: European Conference on Computer Vision, pp. 213–229. Springer, Berlin (2020)
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprintarXiv:2304.02643 (2023)
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Lawrence Z., C.: Microsoft coco: common objects in context. In: Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13, pp. 740–755. Springer, Berlin (2014)
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprintarXiv:2010.11929 (2020)
Sun, C., Shrivastava, A., Singh, S., Gupta, A.: Revisiting unreasonable effectiveness of data in deep learning era. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 843–852 (2017)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst.28 (2015)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Adv. Neural Inf. Process. Syst.30 (2017)
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: an imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst.32 (2019)
Ma, J., Wang, B.: Segment anything in medical images. arXiv preprintarXiv:2304.12306 (2023)
Cen, J., Wu, Y., Wang, K., Li, X., Yang, J., Pei, Y., Kong, L., Liu, Z., Chen, Q.: SAD: Segment any RGBD. arXiv preprintarXiv:2305.14207 (2023)
Chen, T., Zhu, L., Ding, C., Cao, R., Wang, Y., Li, Z., Sun, L., Mao, P., Zang, Y.: Sam fails to segment anything?—SAM-adapter: adapting SAM in underperformed scenes: camouflage, shadow, medical image segmentation, and more
Cheng, H.K., Oh, S.W., Price, B., Schwing, A., Lee, J.-Y.: Tracking anything with decoupled video segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1316–1326 (2023)
Hui, T.-W., Tang, X., Loy, C.C.: Liteflownet: a lightweight convolutional neural network for optical flow estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8981–8989 (2018)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6450–6459 (2018)
Feichtenhofer, C.: X3d: Expanding architectures for efficient video recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 203–213 (2020)
Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019)
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: Vivit: A video vision transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6836–6846 (2021)
Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al.: The kinetics human action video dataset. arXiv preprintarXiv:1705.06950 (2017)
Li, L., Zhang, X., Hu, W., Li, W., Zhu, P.: Soccer video shot classification based on color characterization using dominant sets clustering. In: Advances in Multimedia Information Processing-PCM 2009: 10th Pacific Rim Conference on Multimedia, Bangkok, Thailand, December 15–18, 2009 Proceedings 10, pp. 923–929. Springer, Berlin (2009)
Hasan, Muhammad Abul, Min, Xu., He, Xiangjian, Changsheng, Xu.: CAMHID: camera motion histogram descriptor and its application to cinematographic shot classification. IEEE Trans. Circuits Syst. Video Technol.24(10), 1682–1695 (2014)
Prasertsakul, P., Kondo, T., Iida, H.: Video shot classification using 2d motion histogram. In: 2017 14th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), pp. 202–205. IEEE (2017)
Vadis, Q., Carreira, J., Zisserman, A.: Action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks: towards good practices for deep action recognition. In: European Conference on Computer Vision, pp. 20–36. Springer, Berlin (2016)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)
Funding
This research received no external funding.
Author information
Authors and Affiliations
Shanghai Film Academy, Shanghai University, Shanghai, 200072, China
Fengtian Lu, Yuzhi Li & Feng Tian
- Fengtian Lu
You can also search for this author inPubMed Google Scholar
- Yuzhi Li
You can also search for this author inPubMed Google Scholar
- Feng Tian
You can also search for this author inPubMed Google Scholar
Contributions
Conceptualization, T.L.; methodology, T.L., Y.L.; software, T.L., Y.L.; validation, Y.L. and T.L.; formal analysis, T.L.; writing-original draft preparation, T.L.; writing-review and editing, T.L., Y.L., F.T. All authors have read and agreed to the published version of the manuscript.
Corresponding author
Correspondence toFeng Tian.
Ethics declarations
Conflict of interest
The authors declare no conflict of interest.
Ethical approval
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Lu, F., Li, Y. & Tian, F. Exploring challenge and explainable shot type classification using SAM-guided approaches.SIViP18, 2533–2542 (2024). https://doi.org/10.1007/s11760-023-02928-x
Received:
Revised:
Accepted:
Published:
Issue Date:
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative