Movatterモバイル変換


[0]ホーム

URL:


Skip to main content
Springer Nature Link
Log in

A multi-scale multi-attention network for dynamic facial expression recognition

  • Original Research Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Characterizing spatial information and modelling temporal dynamics of facial images are key challenges for dynamic facial expression recognition (FER). In this paper, we propose an end-to-end multi-scale multi-attention network (MSMA-Net) for dynamic FER. In our model, the spatio-temporal features are encoded at two scales, i.e. the entire face and local facial patches. For each scale, we adopt a 2D convolutional neural network (CNN) to capture frame-based spatial information, and a 3D CNN to depict the short-term dynamics in the temporal sequence. Moreover, we propose a multi-attention mechanism by considering both spatial and temporal attention models. The temporal attention is applied on the image sequence to highlight expressive frames within the whole sequence, and the spatial attention mechanism is applied at the patch level to learn salient facial features. Comprehensive experiments on publicly available datasets (Aff-Wild2, RML, and AFEW) show that the proposed MSMA-Net model automatically highlights salient expressive frames, within which salient facial features are learned, allowing better or very competitive results compared to state-of-the-art methods.

This is a preview of subscription content,log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Abd Elrahman, S.M., Abraham, A.: A review of class imbalance problem. J. Netw. Innov. Comput.1(2013), 332–340 (2013)

    Google Scholar 

  2. Barron, J.L., Fleet, D.J., Beauchemin, S.S.: Performance of optical flow techniques. Int. J. Comput. Vis.12(1), 43–77 (1994)

    Article  Google Scholar 

  3. Cao, Q., Shen, L., Xie, W., Parkhi, O.M., Zisserman, A.: Vggface2: A dataset for recognising faces across pose and age. In: 2018 13th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2018), IEEE, pp. 67–74 (2018)

  4. Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6299–6308 (2017)

  5. Deng, D., Chen, Z., Shi, B.E.: Multitask emotion recognition with incomplete labels. In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), IEEE, pp. 828–835 (2020)

  6. Dhall, A.: Emotiw 2019: Automatic emotion, engagement and cohesion prediction tasks. In: 2019 International Conference on Multimodal Interaction, pp. 546–550 (2019)

  7. Dhall, A., Goecke, R., Lucey, S., Gedeon, T.: Collecting large, richly annotated facial-expression databases from movies. IEEE Ann. Hist. Comput.19(03), 34–41 (2012)

    Google Scholar 

  8. Fan, Y., Lam, J.C., Li, V.O.: Multi-region ensemble convolutional neural network for facial expression recognition. In: International Conference on Artificial Neural Networks, pp. 84–94. Springer, Berlin (2018)

    Google Scholar 

  9. Fan, Y., Lam, J.C., Li, V.O.: Video-based emotion recognition using deeply-supervised neural networks. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction, pp. 584–588 (2018)

  10. Fan, Y., Lu, X., Li, D., Liu, Y.: Video-based emotion recognition using cnn-rnn and c3d hybrid networks. In: Proceedings of the 18th ACM International Conference on Multimodal Interaction, pp. 445–450 (2016)

  11. Friesen, E., Ekman, P.: Facial action coding system: a technique for the measurement of facial movement. Palo Alto3(2), 5 (1978)

    Google Scholar 

  12. Gera, D., Balasubramanian, S.: Affect expression behaviour analysis in the wild using spatio-channel attention and complementary context information. arXiv preprintarXiv:2009.14440 (2020)

  13. ...Goodfellow, I.J., Erhan, D., Luc Carrier, P., Courville, A., Mirza, M., Hamner, B., Cukierski, W., Tang, Y., Thaler, D., Lee, D.H., Zhou, Y., Ramaiah, C., Feng, F., Li, R., Wang, X., Athanasakis, D., Shawe-Taylor, J., Milakov, M., Park, J., Ionescu, R., Popescu, M., Grozea, C., Bergstra, J., Xie, J., Romaszko, L., Xu, B., Chuang, Z., Bengio, Y.: Challenges in representation learning: A report on three machine learning contests. Neural Netw.64, 59–63 (2015). (Special Issue on Deep Learning of Representations)

    Article  Google Scholar 

  14. Jeni, L.A., Cohn, J.F., Kanade, T.: Dense 3d face alignment from 2d video for real-time use. Image Vis. Comput.58, 13–24 (2017)

    Article  Google Scholar 

  15. Jung, H., Lee, S., Yim, J., Park, S., Kim, J.: Joint fine-tuning in deep neural networks for facial expression recognition. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2983–2991 (2015)

  16. Kansizoglou, I., Bampis, L., Gasteratos, A.: An active learning paradigm for online audio-visual emotion recognition. IEEE Trans. Affect. Comput. (2019).https://doi.org/10.1109/TAFFC.2019.2961089

    Article  Google Scholar 

  17. Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., Liu, T.Y.: Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst.30, 3146–3154 (2017)

    Google Scholar 

  18. Kim, D.H., Baddar, W.J., Jang, J., Ro, Y.M.: Multi-objective based spatio-temporal feature representation learning robust to expression intensity variations for facial expression recognition. IEEE Trans. Affect. Comput.10(2), 223–236 (2017)

    Article  Google Scholar 

  19. Knyazev, B., Shvetsov, R., Efremova, N., Kuharenko, A.: Leveraging large face recognition data for emotion classification. In: 2018 13th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2018), IEEE, pp. 692–696 (2018)

  20. Kollias, D., Schulc, A., Hajiyev, E., Zafeiriou, S.: Analysing affective behavior in the first abaw 2020 competition. In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), IEEE, pp. 637–643 (2020)

  21. Kuhnke, F., Rumberg, L., Ostermann, J.: Two-stream aural-visual affect analysis in the wild. In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), IEEE, pp. 600–605 (2020)

  22. Larochelle, H., Hinton, G.: Learning to combine foveal glimpses with a third-order boltzmann machine. In: Proceedings of the 23rdInternational Conference on Neural Information Processing Systems, pp. 1243–1251. Curran Associates Inc, Red Hook (2010)

    Google Scholar 

  23. Li, S., Deng, W.: Deep facial expression recognition: A survey. IEEE Trans. Affect. Comput. (2020).https://doi.org/10.1109/TAFFC.2020.2981446

    Article  Google Scholar 

  24. Li, S., Zheng, W., Zong, Y., Lu, C., Tang, C., Jiang, X., Liu, J., Xia, W.: Bi-modality fusion for emotion recognition in the wild. In: 2019 International Conference on Multimodal Interaction, pp. 89–594 (2019)

  25. Liu, C., Tang, T., Lv, K., Wang, M.: Multi-feature based emotionrecognition for video clips. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction, pp. 630–634 (2018)

  26. Liu, H., Zeng, J., Shan, S.: Facial expression recognition for in-the-wild videos. In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pp. 615–618 (2020)

  27. Liu, M., Li, S., Shan, S., Wang, R., Chen, S.: Deeply learning deformable facial action parts model for dynamic expression analysis. In: Asian Conference on Computer Vision, pp. 143–157. Springer, Berlin (2014)

    Google Scholar 

  28. Ma, Y., Hao, Y., Chen, M., Chen, J., Lu, P., Košir, A.: Audio-visual emotion fusion (avef): A deep efficient weighted approach. Inf. Fus.46, 184–192 (2019)

    Article  Google Scholar 

  29. Meng, D., Peng, X., Wang, K., Qiao, Y.: Frame attention networks for facial expression recognition in videos. In: 2019 IEEE International Conference on Image Processing (ICIP), IEEE, pp. 3866–3870 (2019)

  30. Mollahosseini, A., Hasani, B., Mahoor, M.H.: Affectnet: A database for facial expression, valence, and arousal computing in the wild. IEEE Trans. Affect. Comput.10(1), 18–31 (2017)

    Article  Google Scholar 

  31. Onal E.I., Yang, L., Jeni, L.A., Cohn, J.F.: D-pattnet: Dynamic patch-attentive deep network for action unit detection. Front. Comput. Sci.1, 11 (2019)

    Article  Google Scholar 

  32. Pan, X., Ying, G., Chen, G., Li, H., Li, W.: A deep spatial and temporal aggregation framework for video-based facial expression recognition. IEEE Access7, 48807–48815 (2019)

    Article  Google Scholar 

  33. Shan, C., Gong, S., McOwan, P.W.: Facial expression recognition based on local binary patterns: A comprehensive study. Image Vis. Comput.27(6), 803–816 (2009)

    Article  Google Scholar 

  34. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision(ICCV), pp. 4489–4497 (2015)

  35. Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6450–6459 (2018)

  36. Vielzeuf, V., Kervadec, C., Pateux, S., Lechervy, A., Jurie, F.: An occam’s razor view on learning audiovisual emotion recognition withsmall training sets. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction, pp. 589–593 (2018)

  37. Wang, Y., Guan, L.: Recognizing human emotional state from audiovisual signals. IEEE Trans. Multimed.10(5), 936–946 (2008)

    Article  Google Scholar 

  38. Wang, Y., Wu, J., Hoashi, K.: Multi-attention fusion network forvideo-based emotion recognition. In: 2019 International Conferenceon Multimodal Interaction, Association for Computing Machinery, pp. 595–601 (2019)

  39. Woo, S., Park, J., Lee, J.Y., So Kweon, I.: Cbam: Convolutionalblock attention module. In: Proceedings of the European conferenceon computer vision (ECCV), pp. 3–19 (2018)

  40. Xia, X., Liu, J., Yang, T., Jiang, D., Han, W., Sahli, H.: Videoemotion recognition using hand-crafted and deep learning features. In: 2018 First Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia), IEEE, pp. 1–6 (2018)

  41. Youoku, S., Toyoda, Y., Yamamoto, T., Saito, J., Kawamura, R., Mi, X., Murase, K.: A multi-term and multi-task analyzing framework for affective analysis in-the-wild. arXiv preprintarXiv:2009.13885 (2020)

  42. Zhang, K., Huang, Y., Du, Y., Wang, L.: Facial expression recognition based on deep evolutional spatial-temporal networks. IEEE Trans. Image Process.26(9), 4193–4203 (2017)

    Article MathSciNet  Google Scholar 

  43. Zhang, S., Pan, X., Cui, Y., Zhao, X., Liu, L.: Learning affective video features for facial expression recognition via hybrid deep learning. IEEE Access7, 32297–32304 (2019)

    Article  Google Scholar 

  44. Zhang, S., Zhang, S., Huang, T., Gao, W.: Multimodal deep convolutional neural network for audio-visual emotion recognition. In: Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, pp. 281–284 (2016)

  45. Zhang, S., Zhang, S., Huang, T., Gao, W., Tian, Q.: Learning affective features with a hybrid deep model for audio-visual emotion recognition. IEEE Trans. Circuits Syst. Video Technol.28(10), 3030–3043 (2017)

    Article  Google Scholar 

  46. Zhao, X., Chen, G., Chuang, Y., Tao, X., Zhang, S.: Learning expression features via deep residual attention networks for facial expression recognition from video sequences. IETE Tech. Rev. (2020).https://doi.org/10.1080/02564602.2020.1814168

    Article  Google Scholar 

Download references

Funding

This work is supported by the Shaanxi Provincial International Science and Technology Collaboration Project (grant 2017KW-ZD-14), the National Natural Science Foundation of China (No. 61872256), and the VUB Interdisciplinary Research Program through the EMO-App project.

Author information

Authors and Affiliations

  1. Shaanxi Key Laboratory on Speech and Image Information Processing, National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, School of Computer Science, Northwestern Polytechnical University (NPU), Youyi Xilu 127, Xi’an, 710072, China

    Xiaohan Xia, Le Yang & Dongmei Jiang

  2. School of Computer Science, Sichuan University, Chengdu, 610065, China

    Xiaoyong Wei

  3. Peng Cheng Laboratory, Vanke Cloud City Phase I Building 8, Xili Street, Nanshan District, Shenzhen, 518055, Guangdong, China

    Xiaoyong Wei & Dongmei Jiang

  4. Department of Electronics and Informatics (ETRO), Vrije Universiteit Brussel (VUB), Pleinlaan 2, 1050, Brussels, Belgium

    Hichem Sahli

  5. Interuniversity Microelectronics Centre (IMEC), Kapeldreef 75, 3001, Heverlee, Belgium

    Hichem Sahli

Authors
  1. Xiaohan Xia

    You can also search for this author inPubMed Google Scholar

  2. Le Yang

    You can also search for this author inPubMed Google Scholar

  3. Xiaoyong Wei

    You can also search for this author inPubMed Google Scholar

  4. Hichem Sahli

    You can also search for this author inPubMed Google Scholar

  5. Dongmei Jiang

    You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence toXiaohan Xia.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xia, X., Yang, L., Wei, X.et al. A multi-scale multi-attention network for dynamic facial expression recognition.Multimedia Systems28, 479–493 (2022). https://doi.org/10.1007/s00530-021-00849-8

Download citation

Keywords

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Advertisement


[8]ページ先頭

©2009-2025 Movatter.jp