Abstract
The achievements of deep learning in the sphere of computer vision have elevated video prediction to a prominent research focus. The prevailing trend in current deep learning endeavors is to pursue advanced optimization of model architectures and enhancement of their performance metrics. The task of video prediction is inherently complex, and most of the algorithm models proposed in the past are also. In this paper, we propose a novel simple video prediction network structure based on three-Dimensional Convolutional Neural Network (3D-CNN) and multi-loss, abbreviated as ML3DVP. Our network model is completely based on 3D-CNN. Compared with Convolutional Long Short-Term Memory (ConvLSTM), Recurrent Neural Network (RNN), Generative Adversarial Network (GAN) and its variants, we start from the most basic network structure to reduce complexity, thereby improving the speed of model prediction. In addition, most models today will encounter quality problems such as insufficient clarity. To solve this problem, we introduced multiple losses for back propagation. Using multiple quality evaluation indicators, Structural Similarity (SSIM) and Peak Signal-to-Noise Ratio (PSNR), as optimization objectives, continuously improves the prediction quality during the training process. The evaluation of model complexity, parameter count, and predictive outcomes across four datasets substantiates that our proposed model has successfully attained the objectives of structural refinement and enhanced performance.
This is a preview of subscription content,log in via an institution to check access.
Access this article
Subscribe and save
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
Buy Now
Price includes VAT (Japan)
Instant access to the full article PDF.









Similar content being viewed by others

NDNetGaming - development of a no-reference deep CNN for gaming video quality prediction
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Availability of Data and Materials
My manuscript has associated data in a data repository.
Code Availability
The custom code used in this study is available in the GitHub repository athttps://github.com/okayq/ML3DVP.git.
References
Shi X, Chen Z, Wang H, Yeung D-Y, Wong W-K, Woo W-c (2015) Convolutional lstm network: A machine learning approach for precipitation nowcasting. Adv Neural Inf Process Syst 28
Wang Y, Zhang J, Zhu H, Long M, Wang J, Yu PS (2019) Memory in memory: A predictive neural network for learning higher-order non-stationarity from spatiotemporal dynamics. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9154–9162
Wang Y, Jiang L, Yang M-H, Li L-J, Long M, Fei-Fei L (2018) Eidetic 3d lstm: A model for video prediction and beyond. In: International conference on learning representations
Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6299–6308
Ji S, Xu W, Yang M, Yu K (2012) 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 221–231
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497
Oprea S, Martinez-Gonzalez P, Garcia-Garcia A, Castro-Vargas JA, Orts-Escolano S, Garcia-Rodriguez J, Argyros A (2020) A review on deep learning techniques for video prediction. IEEE Trans Pattern Anal Mach Intell 2806–2826
Srivastava N, Mansimov E, Salakhudinov R (2015) Unsupervised learning of video representations using lstms. In: International conference on machine learning, pp 843–852
Zhang J, Zheng Y, Qi D (2017) Deep spatio-temporal residual networks for citywide crowd flows prediction. In: Proceedings of the AAAI conference on artificial intelligence
Ionescu C, Papava D, Olaru V, Sminchisescu C (2013) Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Trans Pattern Anal Mach Intell 1325–1339
Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local svm approach. In: Proceedings of the 17th international conference on pattern recognition, 2004., pp 32–36
Wang Y, Long M, Wang J, Gao Z, Yu PS (2017) Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms. Adv Neural Inf Process Syst 30
Wang Y, Gao Z, Long M, Wang J, Philip SY (2018) Predrnn++: Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning. In: International conference on machine learning, pp 5123–5132
Guen VL, Thome N (2020) Disentangling physical dynamics from unknown factors for unsupervised video prediction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11474–11484
Gao Z, Tan C, Wu L, Li SZ (2022) Simvp: Simpler yet better video prediction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3170–3180
Seo M, Lee H, Kim D, Seo J (2023) Implicit stacked autoregressive model for video prediction. arXiv preprintarXiv:2303.07849
Byeon W, Wang Q, Srivastava RK, Koumoutsakos P (2018) Contextvp: Fully context-aware video prediction. In: Proceedings of the european conference on computer vision (ECCV), pp 753–769
Lin Z, Li M, Zheng Z, Cheng Y, Yuan C (2020) Self-attention convlstm for spatiotemporal prediction. In: Proceedings of the AAAI conference on artificial intelligence, pp 11531–11538
Saideni W, Helbert D, Courreges F, Cances JP (2022) A novel video prediction algorithm based on robust spatiotemporal convolutional long short-term memory (robust-st-convlstm). In: Proceedings of seventh international congress on information and communication technology: ICICT 2022, London, pp 193–204
Vondrick C, Torralba A (2017) Generating the future with adversarial transformers. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1020–1028
Walker J, Marino K, Gupta A, Hebert M (2017) The pose knows: Video forecasting by generating pose futures. In: Proceedings of the IEEE international conference on computer vision, pp 3332–3341
Ji Y, Gong B, Langguth M, Mozaffari A, Zhi X (2022) Clgan: A gan-based video prediction model for precipitation nowcasting. EGUsphere 1–23
Jing B, Ding H, Yang Z, Li B, Bao L (2022) Video prediction: a step-by-step improvement of a video synthesis network. Appl Intell 1–13
Liu B, Chen Y, Liu S, Kim H-S (2021) Deep learning in latent space for video prediction and compression. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 701–710
Liang X, Lee L, Dai W, Xing EP (2017) Dual motion gan for future-flow embedded video prediction. In: Proceedings of the IEEE international conference on computer vision, pp 1744–1752
Jang Y, Kim G, Song Y (2018) Video prediction with appearance and motion conditions. In: International conference on machine learning, pp 2225–2234
Farazi H, Nogga J, Behnke S (2021) Local frequency domain transformer networks for video prediction. In: 2021 International Joint Conference on Neural Networks (IJCNN), pp 1–10
Ye X, Bilodeau G-A (2023) Video prediction by efficient transformers. Image Vis Comput 104612
Ye X, Bilodeau G-A (2022) Vptr: Efficient transformers for video prediction. In: 2022 26th International conference on pattern recognition (ICPR), pp 3492–3499
Rakhimov R, Volkhonskiy D, Artemov A, Zorin D, Burnaev E (2020) Latent video transformer. arXiv preprintarXiv:2006.10704
Radford A, Metz L, Chintala S (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprintarXiv:1511.06434
Fan J, Cao X, Wang Q, Yap P-T, Shen D (2019) Adversarial learning for mono-or multi-modal registration. Med Image Anal 101545
Qin H, Xie W, Li Y, Jiang K, Lei J, Du Q (2023) Weakly supervised adversarial learning via latent space for hyperspectral target detection. Pattern Recognit 109125
Vrskova R, Hudec R, Kamencay P, Sykora P (2022) Human activity classification using the 3dcnn architecture. Appl Sci 931
Naik KJ, Soni A (2021) Video classification using 3d convolutional neural network. In: Advancements in security and privacy initiatives for multimedia images, pp 1–18
Jiang G, Jiang X, Fang Z, Chen S (2021) An efficient attention module for 3d convolutional neural networks in action recognition. Appl Intell 1–15
Wang X, Xie W, Song J (2018) Learning spatiotemporal features with 3dcnn and convgru for video anomaly detection. In: 2018 14th IEEE International conference on signal processing (ICSP), pp 474–479
Majhi S, Dash R, Sa PK (2020) Temporal pooling in inflated 3dcnn for weakly-supervised video anomaly detection. In: 2020 11th International conference on computing, communication and networking technologies (ICCCNT), pp 1–6
Murari K et al (2019) Recurrent 3d convolutional network for rodent behavior recognition. In: ICASSP 2019-2019 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 1174–1178
Karasawa H, Liu C-L, Ohwada H (2018) Deep 3d convolutional neural network architectures for alzheimer’s disease diagnosis. In: Intelligent information and database systems: 10th asian conference, ACIIDS 2018, Dong Hoi City, Vietnam, March 19-21, 2018, Proceedings, Part I 10, pp 287–296
Riahi A, Elharrouss O, Al-Maadeed S (2022) Bemd-3dcnn-based method for covid-19 detection. Comput Biol Med 105188
Li X, Zhou Y, Du P, Lang G, Xu M, Wu W (2021) A deep learning system that generates quantitative ct reports for diagnosing pulmonary tuberculosis. Appl Intell 4082–4093
Xu S, Liu C, Zong Y, Chen S, Lu Y, Yang L, Ng EY, Wang Y, Wang Y, Liu Y et al (2019) An early diagnosis of oral cancer based on three-dimensional convolutional neural networks. IEEE Access 158603–158611
Collins T, Maktabi M, Barberio M, Bencteux V, Jansen-Winkeln B, Chalopin C, Marescaux J, Hostettler A, Diana M, Gockel I (2021) Automatic recognition of colon and esophagogastric cancer with machine learning and hyperspectral imaging. Diagnostics 1810
Brown K, Dormer J, Fei B, Hoyt K (2019) Deep 3d convolutional neural networks for fast super-resolution ultrasound imaging. In: Medical imaging 2019: ultrasonic imaging and tomography, p 1095502
Anand Kumar G, Sridevi P (2019) Intensity inhomogeneity correction for magnetic resonance imaging of automatic brain tumor segmentation. Microelectronics, electromagnetics and telecommunications: proceedings of the fourth ICMEET 2018:703–711
James G (2013) An introduction to statistical learning. springer
Feng R, Chen M, Song Y (2024) Learning traffic as videos: Short-term traffic flow prediction using mixed-pointwise convolution and channel attention mechanism. Expert Syst Appl 122468
Do Nascimento CAR, Mariani VC, Santos Coelho L (2020) Integrative numerical modeling and thermodynamic optimal design of counter-flow plate-fin heat exchanger applying neural networks. Int J Heat Mass Transfer 120097
Wang Z, Bovik AC, Sheikh HR, Simoncelli EP (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process 600–612
Jia X, De Brabandere B, Tuytelaars T, Gool LV (2016) Dynamic filter networks. Advances in neural information processing systems
Oliu M, Selva J, Escalera S (2018) Folded recurrent neural networks for future video prediction. In: Proceedings of the european conference on computer vision (ECCV), pp 716–731
Yu W, Lu Y, Easterbrook S, Fidler S (2020) Efficient and information-preserving future frame prediction and beyond
Kalchbrenner N, Oord A, Simonyan K, Danihelka I, Vinyals O, Graves A, Kavukcuoglu K (2017) Video pixel networks. In: International conference on machine learning, pp 1771–1779
Wang Y, Gao Z, Long M, Wang J, Philip SY (2018) Predrnn++: Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning. In: International conference on machine learning, pp 5123–5132
Villegas R, Yang J, Hong S, Lin X, Lee H (2017) Decomposing motion and content for natural video sequence prediction. arXiv preprintarXiv:1706.08033
Lee AX, Zhang R, Ebert F, Abbeel P, Finn C, Levine S (2018) Stochastic adversarial video prediction. arXiv preprintarXiv:1804.01523
Zhang J, Wang Y, Long M, Jianmin W, Philip SY (2019) Z-order recurrent neural networks for video prediction. In: 2019 IEEE International conference on multimedia and expo (ICME), pp 230–235
Babaeizadeh M, Finn C, Erhan D, Campbell RH, Levine S (2017) Stochastic variational video prediction. arXiv preprintarXiv:1710.11252
Jin B, Hu Y, Zeng Y, Tang Q, Liu S, Ye J (2018) Varnet: Exploring variations for unsupervised video prediction. In: 2018 IEEE/RSJ International conference on intelligent robots and systems (IROS), pp 5801–5806
Lee J, Lee J, Lee S, Yoon S (2018) Mutual suppression network for video prediction using disentangled features. arXiv preprintarXiv:1804.04810
Jin B, Hu Y, Tang Q, Niu J, Shi Z, Han Y, Li X (2020) Exploring spatial-temporal multi-frequency analysis for high-fidelity and temporal-consistency video prediction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4554–4563
Acknowledgements
This work is supported by the National Natural Science Foundation of China under Grant no. 62476126.
Funding
This work is supported by the National Natural Science Foundation of China under Grant no. 62476126.
Author information
Authors and Affiliations
College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, 211106, China
Ziru Qin & Qun Dai
MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, Nanjing, 211106, China
Ziru Qin & Qun Dai
- Ziru Qin
You can also search for this author inPubMed Google Scholar
- Qun Dai
You can also search for this author inPubMed Google Scholar
Contributions
Author 1(First Author): Conceptualization, Data Curation, Investigation, Methodology, Software, Validation, Visualization, Writing-Original Draft. Author 2(Corresponding Author): Project Administration, Supervision, Funding acquisition, Conceptualization, Methodology, Writing-Review & Editing
Corresponding author
Correspondence toQun Dai.
Ethics declarations
Ethics approval
Not applicable.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Qin, Z., Dai, Q. A 3D-CNN and multi-loss video prediction architecture.Appl Intell55, 416 (2025). https://doi.org/10.1007/s10489-025-06328-1
Accepted:
Published: