- Xiaohan Xia ORCID:orcid.org/0000-0002-5192-38631,
- Le Yang1,
- Xiaoyong Wei2,3,
- Hichem Sahli4,5 &
- …
- Dongmei Jiang1,3
831Accesses
Abstract
Characterizing spatial information and modelling temporal dynamics of facial images are key challenges for dynamic facial expression recognition (FER). In this paper, we propose an end-to-end multi-scale multi-attention network (MSMA-Net) for dynamic FER. In our model, the spatio-temporal features are encoded at two scales, i.e. the entire face and local facial patches. For each scale, we adopt a 2D convolutional neural network (CNN) to capture frame-based spatial information, and a 3D CNN to depict the short-term dynamics in the temporal sequence. Moreover, we propose a multi-attention mechanism by considering both spatial and temporal attention models. The temporal attention is applied on the image sequence to highlight expressive frames within the whole sequence, and the spatial attention mechanism is applied at the patch level to learn salient facial features. Comprehensive experiments on publicly available datasets (Aff-Wild2, RML, and AFEW) show that the proposed MSMA-Net model automatically highlights salient expressive frames, within which salient facial features are learned, allowing better or very competitive results compared to state-of-the-art methods.
This is a preview of subscription content,log in via an institution to check access.
Access this article
Subscribe and save
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
Buy Now
Price includes VAT (Japan)
Instant access to the full article PDF.




Similar content being viewed by others
References
Abd Elrahman, S.M., Abraham, A.: A review of class imbalance problem. J. Netw. Innov. Comput.1(2013), 332–340 (2013)
Barron, J.L., Fleet, D.J., Beauchemin, S.S.: Performance of optical flow techniques. Int. J. Comput. Vis.12(1), 43–77 (1994)
Cao, Q., Shen, L., Xie, W., Parkhi, O.M., Zisserman, A.: Vggface2: A dataset for recognising faces across pose and age. In: 2018 13th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2018), IEEE, pp. 67–74 (2018)
Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6299–6308 (2017)
Deng, D., Chen, Z., Shi, B.E.: Multitask emotion recognition with incomplete labels. In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), IEEE, pp. 828–835 (2020)
Dhall, A.: Emotiw 2019: Automatic emotion, engagement and cohesion prediction tasks. In: 2019 International Conference on Multimodal Interaction, pp. 546–550 (2019)
Dhall, A., Goecke, R., Lucey, S., Gedeon, T.: Collecting large, richly annotated facial-expression databases from movies. IEEE Ann. Hist. Comput.19(03), 34–41 (2012)
Fan, Y., Lam, J.C., Li, V.O.: Multi-region ensemble convolutional neural network for facial expression recognition. In: International Conference on Artificial Neural Networks, pp. 84–94. Springer, Berlin (2018)
Fan, Y., Lam, J.C., Li, V.O.: Video-based emotion recognition using deeply-supervised neural networks. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction, pp. 584–588 (2018)
Fan, Y., Lu, X., Li, D., Liu, Y.: Video-based emotion recognition using cnn-rnn and c3d hybrid networks. In: Proceedings of the 18th ACM International Conference on Multimodal Interaction, pp. 445–450 (2016)
Friesen, E., Ekman, P.: Facial action coding system: a technique for the measurement of facial movement. Palo Alto3(2), 5 (1978)
Gera, D., Balasubramanian, S.: Affect expression behaviour analysis in the wild using spatio-channel attention and complementary context information. arXiv preprintarXiv:2009.14440 (2020)
...Goodfellow, I.J., Erhan, D., Luc Carrier, P., Courville, A., Mirza, M., Hamner, B., Cukierski, W., Tang, Y., Thaler, D., Lee, D.H., Zhou, Y., Ramaiah, C., Feng, F., Li, R., Wang, X., Athanasakis, D., Shawe-Taylor, J., Milakov, M., Park, J., Ionescu, R., Popescu, M., Grozea, C., Bergstra, J., Xie, J., Romaszko, L., Xu, B., Chuang, Z., Bengio, Y.: Challenges in representation learning: A report on three machine learning contests. Neural Netw.64, 59–63 (2015). (Special Issue on Deep Learning of Representations)
Jeni, L.A., Cohn, J.F., Kanade, T.: Dense 3d face alignment from 2d video for real-time use. Image Vis. Comput.58, 13–24 (2017)
Jung, H., Lee, S., Yim, J., Park, S., Kim, J.: Joint fine-tuning in deep neural networks for facial expression recognition. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2983–2991 (2015)
Kansizoglou, I., Bampis, L., Gasteratos, A.: An active learning paradigm for online audio-visual emotion recognition. IEEE Trans. Affect. Comput. (2019).https://doi.org/10.1109/TAFFC.2019.2961089
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., Liu, T.Y.: Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst.30, 3146–3154 (2017)
Kim, D.H., Baddar, W.J., Jang, J., Ro, Y.M.: Multi-objective based spatio-temporal feature representation learning robust to expression intensity variations for facial expression recognition. IEEE Trans. Affect. Comput.10(2), 223–236 (2017)
Knyazev, B., Shvetsov, R., Efremova, N., Kuharenko, A.: Leveraging large face recognition data for emotion classification. In: 2018 13th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2018), IEEE, pp. 692–696 (2018)
Kollias, D., Schulc, A., Hajiyev, E., Zafeiriou, S.: Analysing affective behavior in the first abaw 2020 competition. In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), IEEE, pp. 637–643 (2020)
Kuhnke, F., Rumberg, L., Ostermann, J.: Two-stream aural-visual affect analysis in the wild. In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), IEEE, pp. 600–605 (2020)
Larochelle, H., Hinton, G.: Learning to combine foveal glimpses with a third-order boltzmann machine. In: Proceedings of the 23rdInternational Conference on Neural Information Processing Systems, pp. 1243–1251. Curran Associates Inc, Red Hook (2010)
Li, S., Deng, W.: Deep facial expression recognition: A survey. IEEE Trans. Affect. Comput. (2020).https://doi.org/10.1109/TAFFC.2020.2981446
Li, S., Zheng, W., Zong, Y., Lu, C., Tang, C., Jiang, X., Liu, J., Xia, W.: Bi-modality fusion for emotion recognition in the wild. In: 2019 International Conference on Multimodal Interaction, pp. 89–594 (2019)
Liu, C., Tang, T., Lv, K., Wang, M.: Multi-feature based emotionrecognition for video clips. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction, pp. 630–634 (2018)
Liu, H., Zeng, J., Shan, S.: Facial expression recognition for in-the-wild videos. In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pp. 615–618 (2020)
Liu, M., Li, S., Shan, S., Wang, R., Chen, S.: Deeply learning deformable facial action parts model for dynamic expression analysis. In: Asian Conference on Computer Vision, pp. 143–157. Springer, Berlin (2014)
Ma, Y., Hao, Y., Chen, M., Chen, J., Lu, P., Košir, A.: Audio-visual emotion fusion (avef): A deep efficient weighted approach. Inf. Fus.46, 184–192 (2019)
Meng, D., Peng, X., Wang, K., Qiao, Y.: Frame attention networks for facial expression recognition in videos. In: 2019 IEEE International Conference on Image Processing (ICIP), IEEE, pp. 3866–3870 (2019)
Mollahosseini, A., Hasani, B., Mahoor, M.H.: Affectnet: A database for facial expression, valence, and arousal computing in the wild. IEEE Trans. Affect. Comput.10(1), 18–31 (2017)
Onal E.I., Yang, L., Jeni, L.A., Cohn, J.F.: D-pattnet: Dynamic patch-attentive deep network for action unit detection. Front. Comput. Sci.1, 11 (2019)
Pan, X., Ying, G., Chen, G., Li, H., Li, W.: A deep spatial and temporal aggregation framework for video-based facial expression recognition. IEEE Access7, 48807–48815 (2019)
Shan, C., Gong, S., McOwan, P.W.: Facial expression recognition based on local binary patterns: A comprehensive study. Image Vis. Comput.27(6), 803–816 (2009)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision(ICCV), pp. 4489–4497 (2015)
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6450–6459 (2018)
Vielzeuf, V., Kervadec, C., Pateux, S., Lechervy, A., Jurie, F.: An occam’s razor view on learning audiovisual emotion recognition withsmall training sets. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction, pp. 589–593 (2018)
Wang, Y., Guan, L.: Recognizing human emotional state from audiovisual signals. IEEE Trans. Multimed.10(5), 936–946 (2008)
Wang, Y., Wu, J., Hoashi, K.: Multi-attention fusion network forvideo-based emotion recognition. In: 2019 International Conferenceon Multimodal Interaction, Association for Computing Machinery, pp. 595–601 (2019)
Woo, S., Park, J., Lee, J.Y., So Kweon, I.: Cbam: Convolutionalblock attention module. In: Proceedings of the European conferenceon computer vision (ECCV), pp. 3–19 (2018)
Xia, X., Liu, J., Yang, T., Jiang, D., Han, W., Sahli, H.: Videoemotion recognition using hand-crafted and deep learning features. In: 2018 First Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia), IEEE, pp. 1–6 (2018)
Youoku, S., Toyoda, Y., Yamamoto, T., Saito, J., Kawamura, R., Mi, X., Murase, K.: A multi-term and multi-task analyzing framework for affective analysis in-the-wild. arXiv preprintarXiv:2009.13885 (2020)
Zhang, K., Huang, Y., Du, Y., Wang, L.: Facial expression recognition based on deep evolutional spatial-temporal networks. IEEE Trans. Image Process.26(9), 4193–4203 (2017)
Zhang, S., Pan, X., Cui, Y., Zhao, X., Liu, L.: Learning affective video features for facial expression recognition via hybrid deep learning. IEEE Access7, 32297–32304 (2019)
Zhang, S., Zhang, S., Huang, T., Gao, W.: Multimodal deep convolutional neural network for audio-visual emotion recognition. In: Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, pp. 281–284 (2016)
Zhang, S., Zhang, S., Huang, T., Gao, W., Tian, Q.: Learning affective features with a hybrid deep model for audio-visual emotion recognition. IEEE Trans. Circuits Syst. Video Technol.28(10), 3030–3043 (2017)
Zhao, X., Chen, G., Chuang, Y., Tao, X., Zhang, S.: Learning expression features via deep residual attention networks for facial expression recognition from video sequences. IETE Tech. Rev. (2020).https://doi.org/10.1080/02564602.2020.1814168
Funding
This work is supported by the Shaanxi Provincial International Science and Technology Collaboration Project (grant 2017KW-ZD-14), the National Natural Science Foundation of China (No. 61872256), and the VUB Interdisciplinary Research Program through the EMO-App project.
Author information
Authors and Affiliations
Shaanxi Key Laboratory on Speech and Image Information Processing, National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, School of Computer Science, Northwestern Polytechnical University (NPU), Youyi Xilu 127, Xi’an, 710072, China
Xiaohan Xia, Le Yang & Dongmei Jiang
School of Computer Science, Sichuan University, Chengdu, 610065, China
Xiaoyong Wei
Peng Cheng Laboratory, Vanke Cloud City Phase I Building 8, Xili Street, Nanshan District, Shenzhen, 518055, Guangdong, China
Xiaoyong Wei & Dongmei Jiang
Department of Electronics and Informatics (ETRO), Vrije Universiteit Brussel (VUB), Pleinlaan 2, 1050, Brussels, Belgium
Hichem Sahli
Interuniversity Microelectronics Centre (IMEC), Kapeldreef 75, 3001, Heverlee, Belgium
Hichem Sahli
- Xiaohan Xia
You can also search for this author inPubMed Google Scholar
- Le Yang
You can also search for this author inPubMed Google Scholar
- Xiaoyong Wei
You can also search for this author inPubMed Google Scholar
- Hichem Sahli
You can also search for this author inPubMed Google Scholar
- Dongmei Jiang
You can also search for this author inPubMed Google Scholar
Corresponding author
Correspondence toXiaohan Xia.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Xia, X., Yang, L., Wei, X.et al. A multi-scale multi-attention network for dynamic facial expression recognition.Multimedia Systems28, 479–493 (2022). https://doi.org/10.1007/s00530-021-00849-8
Received:
Accepted:
Published:
Issue Date:
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative