Movatterモバイル変換


[0]ホーム

URL:


Skip to main content

Advertisement

Springer Nature Link
Log in

Dynamic-Static Cross Attentional Feature Fusion Method for Speech Emotion Recognition

  • Conference paper
  • First Online:

Part of the book series:Lecture Notes in Computer Science ((LNCS,volume 13834))

Included in the following conference series:

  • 1952Accesses

  • 1Citation

Abstract

The dynamic-static fusion features play an important role in speech emotion recognition (SER). However, the fusion methods of dynamic features and static features generally are simple addition or serial fusion, which might cause the loss of certain underlying emotional information. To address this issue, we proposed a dynamic-static cross attentional feature fusion method (SD-CAFF) with a cross attentional feature fusion mechanism (Cross AFF) to extract superior deep dynamic-static fusion features. To be specific, the Cross AFF is utilized to parallel fuse the deep features from the CNN/LSTM feature extraction module, which can extract the deep static features and the deep dynamic features from acoustic features (MFCC, Delta, and Delta-delta). In addition to the SD-CAFF framework, we also employed muti-task learning in the training process to further improve the accuracy of emotion recognition. The experimental results on IEMOCAP demonstrated the WA and UA of SD-CAFF are 75.78% and 74.89%, respectively, which outperformed the current SOTAs. Furthermore, SD-CAFF achieved competitive performances (WA: 56.77%; UA: 56.30%) in the comparison experiments of cross-corpus capability on MSP-IMPROV.

Supported by National Natural Science Foundation (NNSF) of China (Grant 61867005).

This is a preview of subscription content,log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 12583
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 15729
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide -see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Similar content being viewed by others

References

  1. Busso, C., et al.: IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Eval.42(4), 335–359 (2008).https://doi.org/10.1007/s10579-008-9076-6

    Article  Google Scholar 

  2. Busso, C., Parthasarathy, S., Burmania, A., AbdelWahab, M., Sadoughi, N., Provost, E.M.: MSP-IMPROV: an acted corpus of dyadic interactions to study emotion perception. IEEE Trans. Affect. Comput.8(1), 67–80 (2016).https://doi.org/10.1109/TAFFC.2016.2515617

    Article  Google Scholar 

  3. Cao, Q., Hou, M., Chen, B., Zhang, Z., Lu, G.: Hierarchical network based on the fusion of static and dynamic features for speech emotion recognition. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6334–6338. IEEE (2021).https://doi.org/10.1109/icassp39728.2021.9414540

  4. Chen, Y., Dai, X., Liu, M., Chen, D., Yuan, L., Liu, Z.: Dynamic ReLU. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12364, pp. 351–367. Springer, Cham (2020).https://doi.org/10.1007/978-3-030-58529-7_21

    Chapter  Google Scholar 

  5. Dai, Y., Gieseke, F., Oehmcke, S., Wu, Y., Barnard, K.: Attentional feature fusion. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3560–3569 (2021).https://doi.org/10.1109/WACV48630.2021.00360

  6. Huilian, L., Weiping, H., Yan, W.: Speech emotion recognition based on BLSTM and CNN feature fusion. In: Proceedings of the 2020 4th International Conference on Digital Signal Processing, pp. 169–172 (2020).https://doi.org/10.1145/3408127.3408192

  7. Lambrecht, L., Kreifelts, B., Wildgruber, D.: Gender differences in emotion recognition: impact of sensory modality and emotional category. Cogn. Emot.28(3), 452–469 (2014).https://doi.org/10.1080/02699931.2013.837378

    Article  Google Scholar 

  8. Latif, S., Rana, R., Khalifa, S., Jurdak, R., Epps, J., Schuller, B.W.: Multi-task semi-supervised adversarial autoencoding for speech emotion recognition. IEEE Trans. Affect. Comput.13(2), 992–1004 (2020).https://doi.org/10.1109/taffc.2020.2983669

    Article  Google Scholar 

  9. Li, Y., Baidoo, C., Cai, T., Kusi, G.A.: Speech emotion recognition using 1D CNN with no attention. In: 2019 23rd International Computer Science and Engineering Conference (ICSEC), pp. 351–356. IEEE (2019).https://doi.org/10.1109/ICSEC47112.2019.8974716

  10. Liu, J., Liu, Z., Wang, L., Guo, L., Dang, J.: Speech emotion recognition with local-global aware deep representation learning. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7174–7178. IEEE (2020).https://doi.org/10.1109/icassp40776.2020.9053192

  11. Liu, L.Y., Liu, W.Z., Zhou, J., Deng, H.Y., Feng, L.: ATDA: attentional temporal dynamic activation for speech emotion recognition. Knowl.-Based Syst.243, 108472 (2022).https://doi.org/10.1016/j.knosys.2022.108472

  12. Nediyanchath, A., Paramasivam, P., Yenigalla, P.: Multi-head attention for speech emotion recognition with auxiliary learning of gender recognition. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7179–7183. IEEE (2020).https://doi.org/10.1109/icassp40776.2020.9054073

  13. Shirian, A., Guha, T.: Compact graph architecture for speech emotion recognition. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6284–6288. IEEE (2021).https://doi.org/10.1109/icassp39728.2021.9413876

  14. Su, B.H., Chang, C.M., Lin, Y.S., Lee, C.C.: Improving speech emotion recognition using graph attentive bi-directional gated recurrent unit network. In: INTERSPEECH, pp. 506–510 (2020).https://doi.org/10.21437/interspeech.2020-1733

  15. Sun, B., Wei, Q., Li, L., Xu, Q., He, J., Yu, L.: LSTM for dynamic emotion and group emotion recognition in the wild. In: Proceedings of the 18th ACM International Conference on Multimodal Interaction, pp. 451–457 (2016).https://doi.org/10.1145/2993148.2997640

  16. Sun, S.: A survey of multi-view machine learning. Neural Comput. Appl.23(7), 2031–2038 (2013).https://doi.org/10.1007/s00521-013-1362-6

    Article  Google Scholar 

  17. Ullah, A., Muhammad, K., Del Ser, J., Baik, S.W., de Albuquerque, V.H.C.: Activity recognition using temporal optical flow convolutional features and multilayer LSTM. IEEE Trans. Industr. Electron.66(12), 9692–9702 (2018).https://doi.org/10.1109/TIE.2018.2881943

    Article  Google Scholar 

  18. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017).https://doi.org/10.5555/3295222.3295349

  19. Yang, J., Yang, J.Y., Zhang, D., Lu, J.F.: Feature fusion: parallel strategy vs. serial strategy. Pattern Recogn.36(6), 1369–1381 (2003).https://doi.org/10.1016/S0031-3203(02)00262-5

    Article MATH  Google Scholar 

  20. Yoon, S., Byun, S., Jung, K.: Multimodal speech emotion recognition using audio and text. In: 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 112–118. IEEE (2018).https://doi.org/10.1109/SLT.2018.8639583

Download references

Author information

Authors and Affiliations

  1. Hefei University of Technology, Hefei, China

    Ke Dong & Jie Che

  2. Dalian University of Technology, Dalian, China

    Hao Peng

  3. Newcastle University, Newcastle, UK

    Hao Peng

Corresponding author

Correspondence toKe Dong.

Editor information

Editors and Affiliations

  1. University of Bergen, Bergen, Norway

    Duc-Tien Dang-Nguyen

  2. Dublin City University, Dublin, Ireland

    Cathal Gurrin

  3. Radboud University Nijmegen, Nijmegen, The Netherlands

    Martha Larson

  4. Dublin City University, Dublin, Ireland

    Alan F. Smeaton

  5. University of Amsterdam, Amsterdam, The Netherlands

    Stevan Rudinac

  6. National Institute of Information and Communications Technology, Tokyo, Japan

    Minh-Son Dao

  7. Department of Information Science and Media Studies, University of Bergen, Bergen, Norway

    Christoph Trattner

  8. La Trobe University, Melbourne, VIC, Australia

    Phoebe Chen

Rights and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Dong, K., Peng, H., Che, J. (2023). Dynamic-Static Cross Attentional Feature Fusion Method for Speech Emotion Recognition. In: Dang-Nguyen, DT.,et al. MultiMedia Modeling. MMM 2023. Lecture Notes in Computer Science, vol 13834. Springer, Cham. https://doi.org/10.1007/978-3-031-27818-1_29

Download citation

Publish with us

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 12583
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 15729
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide -see info

Tax calculation will be finalised at checkout

Purchases are for personal use only


[8]ページ先頭

©2009-2025 Movatter.jp