Part of the book series:Lecture Notes in Computer Science ((LNCS,volume 13834))
Included in the following conference series:
1952Accesses
1Citation
Abstract
The dynamic-static fusion features play an important role in speech emotion recognition (SER). However, the fusion methods of dynamic features and static features generally are simple addition or serial fusion, which might cause the loss of certain underlying emotional information. To address this issue, we proposed a dynamic-static cross attentional feature fusion method (SD-CAFF) with a cross attentional feature fusion mechanism (Cross AFF) to extract superior deep dynamic-static fusion features. To be specific, the Cross AFF is utilized to parallel fuse the deep features from the CNN/LSTM feature extraction module, which can extract the deep static features and the deep dynamic features from acoustic features (MFCC, Delta, and Delta-delta). In addition to the SD-CAFF framework, we also employed muti-task learning in the training process to further improve the accuracy of emotion recognition. The experimental results on IEMOCAP demonstrated the WA and UA of SD-CAFF are 75.78% and 74.89%, respectively, which outperformed the current SOTAs. Furthermore, SD-CAFF achieved competitive performances (WA: 56.77%; UA: 56.30%) in the comparison experiments of cross-corpus capability on MSP-IMPROV.
Supported by National Natural Science Foundation (NNSF) of China (Grant 61867005).
This is a preview of subscription content,log in via an institution to check access.
Access this chapter
Subscribe and save
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
Buy Now
- Chapter
- JPY 3498
- Price includes VAT (Japan)
- eBook
- JPY 12583
- Price includes VAT (Japan)
- Softcover Book
- JPY 15729
- Price includes VAT (Japan)
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Busso, C., et al.: IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Eval.42(4), 335–359 (2008).https://doi.org/10.1007/s10579-008-9076-6
Busso, C., Parthasarathy, S., Burmania, A., AbdelWahab, M., Sadoughi, N., Provost, E.M.: MSP-IMPROV: an acted corpus of dyadic interactions to study emotion perception. IEEE Trans. Affect. Comput.8(1), 67–80 (2016).https://doi.org/10.1109/TAFFC.2016.2515617
Cao, Q., Hou, M., Chen, B., Zhang, Z., Lu, G.: Hierarchical network based on the fusion of static and dynamic features for speech emotion recognition. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6334–6338. IEEE (2021).https://doi.org/10.1109/icassp39728.2021.9414540
Chen, Y., Dai, X., Liu, M., Chen, D., Yuan, L., Liu, Z.: Dynamic ReLU. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12364, pp. 351–367. Springer, Cham (2020).https://doi.org/10.1007/978-3-030-58529-7_21
Dai, Y., Gieseke, F., Oehmcke, S., Wu, Y., Barnard, K.: Attentional feature fusion. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3560–3569 (2021).https://doi.org/10.1109/WACV48630.2021.00360
Huilian, L., Weiping, H., Yan, W.: Speech emotion recognition based on BLSTM and CNN feature fusion. In: Proceedings of the 2020 4th International Conference on Digital Signal Processing, pp. 169–172 (2020).https://doi.org/10.1145/3408127.3408192
Lambrecht, L., Kreifelts, B., Wildgruber, D.: Gender differences in emotion recognition: impact of sensory modality and emotional category. Cogn. Emot.28(3), 452–469 (2014).https://doi.org/10.1080/02699931.2013.837378
Latif, S., Rana, R., Khalifa, S., Jurdak, R., Epps, J., Schuller, B.W.: Multi-task semi-supervised adversarial autoencoding for speech emotion recognition. IEEE Trans. Affect. Comput.13(2), 992–1004 (2020).https://doi.org/10.1109/taffc.2020.2983669
Li, Y., Baidoo, C., Cai, T., Kusi, G.A.: Speech emotion recognition using 1D CNN with no attention. In: 2019 23rd International Computer Science and Engineering Conference (ICSEC), pp. 351–356. IEEE (2019).https://doi.org/10.1109/ICSEC47112.2019.8974716
Liu, J., Liu, Z., Wang, L., Guo, L., Dang, J.: Speech emotion recognition with local-global aware deep representation learning. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7174–7178. IEEE (2020).https://doi.org/10.1109/icassp40776.2020.9053192
Liu, L.Y., Liu, W.Z., Zhou, J., Deng, H.Y., Feng, L.: ATDA: attentional temporal dynamic activation for speech emotion recognition. Knowl.-Based Syst.243, 108472 (2022).https://doi.org/10.1016/j.knosys.2022.108472
Nediyanchath, A., Paramasivam, P., Yenigalla, P.: Multi-head attention for speech emotion recognition with auxiliary learning of gender recognition. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7179–7183. IEEE (2020).https://doi.org/10.1109/icassp40776.2020.9054073
Shirian, A., Guha, T.: Compact graph architecture for speech emotion recognition. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6284–6288. IEEE (2021).https://doi.org/10.1109/icassp39728.2021.9413876
Su, B.H., Chang, C.M., Lin, Y.S., Lee, C.C.: Improving speech emotion recognition using graph attentive bi-directional gated recurrent unit network. In: INTERSPEECH, pp. 506–510 (2020).https://doi.org/10.21437/interspeech.2020-1733
Sun, B., Wei, Q., Li, L., Xu, Q., He, J., Yu, L.: LSTM for dynamic emotion and group emotion recognition in the wild. In: Proceedings of the 18th ACM International Conference on Multimodal Interaction, pp. 451–457 (2016).https://doi.org/10.1145/2993148.2997640
Sun, S.: A survey of multi-view machine learning. Neural Comput. Appl.23(7), 2031–2038 (2013).https://doi.org/10.1007/s00521-013-1362-6
Ullah, A., Muhammad, K., Del Ser, J., Baik, S.W., de Albuquerque, V.H.C.: Activity recognition using temporal optical flow convolutional features and multilayer LSTM. IEEE Trans. Industr. Electron.66(12), 9692–9702 (2018).https://doi.org/10.1109/TIE.2018.2881943
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017).https://doi.org/10.5555/3295222.3295349
Yang, J., Yang, J.Y., Zhang, D., Lu, J.F.: Feature fusion: parallel strategy vs. serial strategy. Pattern Recogn.36(6), 1369–1381 (2003).https://doi.org/10.1016/S0031-3203(02)00262-5
Yoon, S., Byun, S., Jung, K.: Multimodal speech emotion recognition using audio and text. In: 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 112–118. IEEE (2018).https://doi.org/10.1109/SLT.2018.8639583
Author information
Authors and Affiliations
Hefei University of Technology, Hefei, China
Ke Dong & Jie Che
Dalian University of Technology, Dalian, China
Hao Peng
Newcastle University, Newcastle, UK
Hao Peng
- Ke Dong
Search author on:PubMed Google Scholar
- Hao Peng
Search author on:PubMed Google Scholar
- Jie Che
Search author on:PubMed Google Scholar
Corresponding author
Correspondence toKe Dong.
Editor information
Editors and Affiliations
University of Bergen, Bergen, Norway
Duc-Tien Dang-Nguyen
Dublin City University, Dublin, Ireland
Cathal Gurrin
Radboud University Nijmegen, Nijmegen, The Netherlands
Martha Larson
Dublin City University, Dublin, Ireland
Alan F. Smeaton
University of Amsterdam, Amsterdam, The Netherlands
Stevan Rudinac
National Institute of Information and Communications Technology, Tokyo, Japan
Minh-Son Dao
Department of Information Science and Media Studies, University of Bergen, Bergen, Norway
Christoph Trattner
La Trobe University, Melbourne, VIC, Australia
Phoebe Chen
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Dong, K., Peng, H., Che, J. (2023). Dynamic-Static Cross Attentional Feature Fusion Method for Speech Emotion Recognition. In: Dang-Nguyen, DT.,et al. MultiMedia Modeling. MMM 2023. Lecture Notes in Computer Science, vol 13834. Springer, Cham. https://doi.org/10.1007/978-3-031-27818-1_29
Download citation
Published:
Publisher Name:Springer, Cham
Print ISBN:978-3-031-27817-4
Online ISBN:978-3-031-27818-1
eBook Packages:Computer ScienceComputer Science (R0)
Share this paper
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative