CN117765981A

Movatterモバイル変換

Info

Publication number: CN117765981A
Application number: CN202311752836.2A
Authority: CN
Inventors: 李阳; 张晓衡; 王俊杰
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2023-12-19
Filing date: 2023-12-19
Publication date: 2024-03-26

Abstract

The invention discloses a cross-modal fusion emotion recognition method and system based on voice texts. The method comprises the steps of adopting a disclosed voice text multi-modal dialogue data set with emotion marks and preprocessing, and establishing a corresponding training set and a corresponding testing set; secondly, constructing a cross-modal attention interaction fusion emotion recognition model; then, performing model training and performing test evaluation on the performance of the method; and finally, building a multi-mode emotion recognition system to verify the effectiveness of the method. The advantages of the invention include: the multi-modal emotion recognition model fused by cross-modal attention interaction is designed, and the accuracy of emotion recognition results is improved by calculating text features strongly associated with audio and audio features strongly associated with the text; an emotion recognition system based on cross-modal fusion is built, and the emotion expression effect of man-machine natural interaction is enhanced through multi-modal co-emotion interaction. The method of the invention has the average recognition accuracy reaching 76.0% and 64.5% on the public data sets IEMOCAP and MELD, which are superior to the existing optimal method.

Description

Translated fromChinese

一种基于语音文本跨模态融合的情感识别方法及系统An emotion recognition method and system based on cross-modal fusion of speech and text

技术领域Technical field

本发明属于人工智能和情感交互技术领域，尤其是涉及一种基于语音文本跨模态融合的情感识别方法及系统。The invention belongs to the technical fields of artificial intelligence and emotional interaction, and in particular relates to an emotion recognition method and system based on cross-modal fusion of voice and text.

背景技术Background technique

在抑郁症等神经疾病日益严重的背景下，具有情感识别、互动和共情功能的情感交互系统在智能诊断和治疗方面都具有广泛的应用价值。In the context of the increasingly serious neurological diseases such as depression, emotional interaction systems with emotion recognition, interaction and empathy functions have extensive application value in intelligent diagnosis and treatment.

对于情感交互系统来说，识别并分析用户情感是重要的一环。情感通常以多种形式出现在对话中，如语音和文本，然而，现有的情绪识别系统多数仅使用单一模态的特征进行情绪识别，单模态识别系统面临情感解释不全面、分类不准确等局限。对多模态信息交互的忽略，导致现有情感识别能力尚且非常有限。其较低的情感识别准确率也无法在后续的情感交互过程中，给用户带来良好的服务体验。例如：CN202211554888.4公开了一种基于改进注意力机制的语音情感识别方法及装置，其包括对采集的音频信号样本进行数据预处理；对采集的音频信号样本提取声学特征，得到频谱特征图；构建卷积神经网络结合双向门控循环单元网络的CNN-BGRU深度学习网络；构建MSK改进注意力机制模块，将通过CNN-BGRU深度学习网络得到的特征信息进行进一步处理；进而输出语音情感识别结果。但是单模态包含的情感信息有限从而限制了情感识别准确率。For emotional interaction systems, identifying and analyzing user emotions is an important part. Emotions usually appear in conversations in multiple forms, such as speech and text. However, most existing emotion recognition systems only use single-modal features for emotion recognition. Single-modal recognition systems face the problem of incomplete emotional interpretation and inaccurate classification. and other limitations. The neglect of multi-modal information interaction results in existing emotion recognition capabilities being very limited. Its low emotion recognition accuracy cannot bring users a good service experience in the subsequent emotional interaction process. For example: CN202211554888.4 discloses a speech emotion recognition method and device based on an improved attention mechanism, which includes data preprocessing of collected audio signal samples; extracting acoustic features from the collected audio signal samples to obtain a spectrum feature map; Construct a CNN-BGRU deep learning network that combines a convolutional neural network with a bidirectional gated recurrent unit network; construct an MSK improved attention mechanism module to further process the feature information obtained through the CNN-BGRU deep learning network; and then output the speech emotion recognition results . However, the emotional information contained in single modality is limited, which limits the accuracy of emotion recognition.

虽然也有一些文献提出了融合视觉和听觉的多模态情感识别方法，但受简单拼接等融合方法的限制，导致现有情感识别方法在准确率和鲁棒性等方面无法满足动态场景下人机交互、临床诊疗的使用要求，依然限制了情感交互系统的应用与制备。为解决上述问题，需要一种基于语音文本跨模态融合的实时情感识别方法，用于识别对话中蕴含的情感。Although some literature has proposed multi-modal emotion recognition methods that integrate vision and hearing, they are limited by fusion methods such as simple splicing. As a result, existing emotion recognition methods cannot meet the requirements of human-machine situations in dynamic scenarios in terms of accuracy and robustness. The requirements for interaction and clinical diagnosis and treatment still limit the application and preparation of emotional interaction systems. In order to solve the above problems, a real-time emotion recognition method based on cross-modal fusion of speech and text is needed to identify the emotions contained in conversations.

此外，情感交互系统是通过赋予计算机人类的情感，使之具有识别、理解、表达情感的能力，情感表达能力也将直接影响用户的情感交互体验。如何表达丰富的情感状态并根据用户反馈进行自适应调整，是目前情感表达的难点。现有技术中情感交互系统在与用户进行交谈时，一般是从离线或在线语料数据库或聊天数据库中搜索出与用户问题相关的合理回答，图1为现有技术中语音交互系统与用户进行交谈的示意图。诸如图1的现有的大部分情感对话生成交互只考虑利用情感标签等情感表面信息来提高生成响应的质量，却忽略了情感背后更深层次的情感意图等细粒度特征，导致对话模型生成的回复共情性不足，无法满足用户的共情需求。In addition, the emotional interaction system gives computers the ability to recognize, understand, and express emotions by giving them human emotions. The emotional expression ability will also directly affect the user's emotional interaction experience. How to express rich emotional states and make adaptive adjustments based on user feedback is currently a difficulty in emotional expression. When the emotional interaction system in the prior art talks with the user, it generally searches for reasonable answers related to the user's questions from an offline or online corpus database or chat database. Figure 1 shows a conversation between the voice interaction system and the user in the prior art. schematic diagram. Most of the existing emotional dialogue generation interactions such as Figure 1 only consider using emotional surface information such as emotion labels to improve the quality of generated responses, but ignore fine-grained features such as the deeper emotional intention behind the emotion, resulting in responses generated by the dialogue model. Insufficient empathy and unable to meet users’ empathy needs.

发明内容Contents of the invention

为解决现有基于深度学习的情感识别方法的识别准确率不高，缺乏多模态交互融合的问题，本发明提出了一种基于跨模态注意力融合的多模态情感识别方法。经过数据获取、数据处理、模型构建、方法测试和方法验证五个步骤，实现实时对话情感高准确率识别。在公开多模态对话数据集IEMOCAP和MELD数据集上对本方法进行了性能测试，加权准确率分布达到了76.0％和64.5％，均优于其他现有最优方法。此外，为解决现有情感识别系统表达僵硬，缺乏深度情感交流的问题，本发明研发了一个基于语音文本跨模态融合情感识别方法的智能情感交互系统，集成了数据采集模块、情感识别模块和交互模块，通过与5名志愿者进行情感交互对本方法进行了在线有效性验证，具备了向临床推广应用的条件。In order to solve the problems of low recognition accuracy and lack of multi-modal interaction fusion of existing deep learning-based emotion recognition methods, the present invention proposes a multi-modal emotion recognition method based on cross-modal attention fusion. After five steps of data acquisition, data processing, model construction, method testing and method verification, high-accuracy recognition of real-time dialogue emotions is achieved. The performance of this method was tested on the public multi-modal dialogue data sets IEMOCAP and MELD data sets, and the weighted accuracy distribution reached 76.0% and 64.5%, both of which are better than other existing optimal methods. In addition, in order to solve the problems of rigid expression and lack of deep emotional communication in existing emotion recognition systems, the present invention develops an intelligent emotion interaction system based on a voice-text cross-modal fusion emotion recognition method, which integrates a data collection module, an emotion recognition module and In the interactive module, the effectiveness of this method was verified online through emotional interaction with 5 volunteers, and it is qualified for clinical promotion and application.

本发明提供了一种基于跨模态交互多模态情感识别方法及系统。本发明人采用多模态公开数据集和采集数据，对方法进行了性能测试和有效性验证：先对获取的两个公开数据集IEMOCAP和MELD进行预处理，建立各自对应的训练集和测试集；其次构建多模态情感识别模型；然后将预处理好的训练集分别输入情感识别模型中进行模型训练，并将预处理好的测试集分别输入训练好的模型中进行方法的性能测试；最后搭建情感识别系统，进行方法的有效性验证。The present invention provides a multi-modal emotion recognition method and system based on cross-modal interaction. The inventor used multi-modal public data sets and collected data to conduct performance testing and validity verification of the method: first, preprocess the two public data sets IEMOCAP and MELD obtained to establish corresponding training sets and test sets. ; Second, build a multi-modal emotion recognition model; then input the pre-processed training sets into the emotion recognition model for model training, and input the pre-processed test sets into the trained models to test the performance of the method; finally Build an emotion recognition system and verify the effectiveness of the method.

根据本发明的一个实施例的一种基于跨模态融合的情感识别方法包括以下步骤：An emotion recognition method based on cross-modal fusion according to an embodiment of the present invention includes the following steps:

步骤1：获取公开的情感对话数据集IEMOCAP和MELD；Step 1: Obtain the public emotional dialogue data sets IEMOCAP and MELD;

步骤2：对获取的公开数据集进行预处理操作，并建立两个公开数据集各自对应的训练集和测试集；Step 2: Perform preprocessing operations on the obtained public data sets, and establish corresponding training sets and test sets for the two public data sets;

步骤3：构建跨模态融合的情感识别神经网络；Step 3: Construct a cross-modal fusion emotion recognition neural network;

步骤4：将步骤2中预处理后的两个训练集分别输入情感识别模型中，进行模型训练；Step 4: Input the two training sets preprocessed in step 2 into the emotion recognition model respectively for model training;

步骤5：再将步骤2中预处理的两个测试集输入训练好的识别模型中，进行方法的性能测试，并与现有最先进方法进行性能比对；Step 5: Then input the two test sets preprocessed in step 2 into the trained recognition model, conduct a performance test of the method, and compare the performance with the most advanced existing methods;

步骤6：搭建情感识别系统，利用所述系统与5名志愿者进行对话交流，通过记录人机交互过程中的识别准确率和对话流畅度，进行方法的有效性验证。Step 6: Build an emotion recognition system, use the system to conduct conversations and exchanges with 5 volunteers, and verify the effectiveness of the method by recording the recognition accuracy and conversation fluency during human-computer interaction.

其中：in:

在步骤2中，数据预处理包括对音频信号进行预加重、分帧和加窗处理；对每一帧加窗后的特征通过快速傅里叶变换得到频谱，进而转化为梅尔频谱；In step 2, data preprocessing includes pre-emphasis, framing and windowing of the audio signal; the windowed features of each frame are used to obtain the spectrum through fast Fourier transform, and then converted into a Mel spectrum;

在步骤3中，所构建的跨模态融合多模态情感识别神经网络包括：单模态双注意力特征提取模块、跨模态注意力交互融合模块、决策层融合及情感分类模块依次串行连接，其中，单模态双注意力特征提取模块用于初步提取语音文本中的时域和通道维度的情感特征；跨模态注意力交互融合模块包含多模态特征对齐和多模态特征交互，用于提取深度情感语义特征；决策层融合分类模块用于输出情感识别结果，并为生成相应的情感共情交互回复奠定基础；In step 3, the constructed cross-modal fusion multi-modal emotion recognition neural network includes: single-modal dual attention feature extraction module, cross-modal attention interactive fusion module, decision-making layer fusion and emotion classification modules in sequence Connection, in which the single-modal dual-attention feature extraction module is used to initially extract emotional features in the time domain and channel dimensions in speech text; the cross-modal attention interaction fusion module includes multi-modal feature alignment and multi-modal feature interaction , used to extract deep emotional semantic features; the decision-making layer fusion classification module is used to output emotion recognition results and lay the foundation for generating corresponding emotional empathic interactive responses;

在步骤4中，进行模型训练时，先采用早期停止策略进行模型一次训练，保存好模型训练参数后，再进行模型二次训练，采用交叉熵函数计算梯度提升决策融合分类模块的输出与标签的误差，通过误差反向传播与随机梯度下降迭代更新模型参数；进行方法的性能测试时，语音信号采样频率为220Hz，将划分后的数据输入训练好的情感识别模型中，并采用情感识别准确率、F1值来评价方法的识别性能；In step 4, when training the model, first use the early stopping strategy to train the model once, save the model training parameters, and then train the model twice. Use the cross entropy function to calculate the output of the gradient boosting decision fusion classification module and the label. Error, iteratively update the model parameters through error backpropagation and stochastic gradient descent; when performing the performance test of the method, the speech signal sampling frequency is 220Hz, the divided data is input into the trained emotion recognition model, and the emotion recognition accuracy is used , F1 value to evaluate the recognition performance of the method;

在步骤5中，方法的有效性验证流程为：搭建情感识别系统，利用所述系统先进行5名所述志愿者语音数据的采集，其次按照步骤S2的方法进行数据预处理，并建立采集数据的训练集和测试集，再将预处理后的训练集输入步骤S3构建的情感识别模型中，按照步骤S4所述方法进行情感识别模型训练；然后利用所述系统再次进行5名志愿者语音数据的采集，通过记录人机情感交互过程，进行方法的有效性验证；In step 5, the validity verification process of the method is: build an emotion recognition system, use the system to first collect the voice data of the five volunteers, and then perform data preprocessing according to the method in step S2, and establish the collection data The training set and test set, and then input the preprocessed training set into the emotion recognition model constructed in step S3, and perform emotion recognition model training according to the method described in step S4; then use the system to conduct voice data of 5 volunteers again Collection, by recording the human-computer emotional interaction process, the effectiveness of the method is verified;

所述的情感交互系统可完成语音采集、语音预处理、模型构建、模型训练、情感识别和共情交互反馈，其中，交互反馈是基于所述的情感识别的分类结果及其与对应关系，并生成相应的共情回复。The emotion interaction system can complete voice collection, voice preprocessing, model construction, model training, emotion recognition and empathic interactive feedback, where the interactive feedback is based on the classification results of the emotion recognition and their corresponding relationships, and Generate corresponding empathic responses.

本发明提出的一种基于跨模态融合的多模态情感识别方法及系统的主要优点包括：The main advantages of a multi-modal emotion recognition method and system based on cross-modal fusion proposed by this invention include:

1.本发明设计的单模态双注意力特征提取模块，基于双注意力机制的语音情感识别方法在传统时域注意力机制的基础上引入通道维度的注意力，可以大幅度减少多余的注意力得分计算来降低模型耗时，有效提高了语音情感识别的检测准确率，解决了当前用于语音情感识别的神经网络中对不同帧和不同通道关注度相同无法高效捕捉情感信息的问题。1. The single-modal dual-attention feature extraction module designed by the present invention. The speech emotion recognition method based on the dual-attention mechanism introduces channel-dimensional attention on the basis of the traditional time-domain attention mechanism, which can greatly reduce redundant attention. Power score calculation is used to reduce the time consumption of the model, effectively improve the detection accuracy of speech emotion recognition, and solve the problem that the current neural network used for speech emotion recognition pays the same attention to different frames and different channels and cannot efficiently capture emotional information.

2.本发明设计的跨模态注意力交互融合模块，通过充分利用语音和文本模态之间的互补性，计算加权特征融合中的语音权重和文本语义权重，提取高阶情感特征，最后将加权后的语音高阶情感特征和加权后的文本高阶情感特征输入到多层感知机MLP中进行加权特征融合并完成情感分类，有效地提高了多模态情感识别的准确率。在公开数据集IEMOCAP和MELD数据集上对本方法进行了有效性验证，加权准确率达到了76.0％和64.5％，均优于其他现有最优方法；2. The cross-modal attention interactive fusion module designed by the present invention makes full use of the complementarity between speech and text modalities, calculates the speech weight and text semantic weight in weighted feature fusion, extracts high-order emotional features, and finally The weighted high-order emotional features of speech and the weighted high-order emotional features of text are input into the multi-layer perceptron MLP for weighted feature fusion and emotion classification, which effectively improves the accuracy of multi-modal emotion recognition. The effectiveness of this method was verified on the public data sets IEMOCAP and MELD data sets, and the weighted accuracy reached 76.0% and 64.5%, both of which are better than other existing optimal methods;

3.本发明设计的决策层融合分类模块，通过模态间的交互实现语音文本多模态对齐和融合，解决了单一模态情感信息提取不足的问题。3. The decision-making layer fusion classification module designed by the present invention realizes multi-modal alignment and fusion of speech and text through the interaction between modalities, and solves the problem of insufficient extraction of emotional information in a single modality.

4.本发明搭建的情感识别系统，通过将数据采集模块、情感识别模块和交互模块集成，增强了人机情感交互性和丰富度，5名志愿者对本方法和本系统进行了在线有效性验证，满足人机交互系统实用性的要求。4. The emotion recognition system built by this invention enhances the interactivity and richness of human-computer emotion by integrating the data collection module, emotion recognition module and interaction module. Five volunteers conducted online validity verification of this method and system. , to meet the practical requirements of human-computer interaction systems.

附图说明Description of drawings

图1为进行比较的现有技术语音交互聊天的工作流程示意图。Figure 1 is a schematic diagram of the workflow of voice interactive chat in the prior art for comparison.

图2为根据本发明的一个实施例的一种基于跨模态融合的情感识别模型的结构图。Figure 2 is a structural diagram of an emotion recognition model based on cross-modal fusion according to an embodiment of the present invention.

图3为根据本发明的一个实施例的一种基于跨模态融合的情感识别方法及系统构成示意图。Figure 3 is a schematic diagram of an emotion recognition method and system based on cross-modal fusion according to an embodiment of the present invention.

具体实施方式Detailed ways

根据本发明的一个实施例，提出了一种基于跨模态交互的多模态情感识别方法。采集用户语音信号，构建跨模态交互融合神经网络模型，实现对用户实时的情感识别，并通过情感识别系统进行情感意图识别和共情交互。According to an embodiment of the present invention, a multi-modal emotion recognition method based on cross-modal interaction is proposed. Collect user voice signals, build a cross-modal interaction fusion neural network model, realize real-time emotion recognition of users, and perform emotional intent recognition and empathic interaction through the emotion recognition system.

以下结合附图并结合具体实施例，对本发明所提出的一种基于语音文本跨模态融合的情感识别方法进行详细说明，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The emotion recognition method based on cross-modal fusion of speech and text proposed by the present invention will be described in detail below with reference to the accompanying drawings and specific embodiments. The described embodiments are only part of the embodiments of the present invention, not all implementations. example. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of the present invention.

如图3所示，根据本发明的一个实施例的一种基于语音文本跨模态融合情感识别方法包括：As shown in Figure 3, a voice-text cross-modal fusion emotion recognition method according to one embodiment of the present invention includes:

步骤S1：获取公开数据集：Step S1: Obtain public data sets:

根据一个具体实施例，所述公开数据集是IEMOCAP(交互式情感二元捕捉数据集)，是由南加州大学的Sail实验室收集的，包含语音、视觉、文本、动作姿势模态信息。该数据集由十位演员分为五组进行录制，记录了十位说话者在双向对话中的行为。对话的媒介都是英语。参与者表演分为即兴表演或固定的脚本场景表演。总共包含了大约为12小时的音视频文件，包含10039段对话，平均持续时间为4.5秒，平均单词数为11.4。注释者对每句话进行标记，分别为中性、快乐、悲伤、愤怒、惊讶、恐惧、厌恶、挫折、兴奋等类别标签。According to a specific embodiment, the public data set is IEMOCAP (Interactive Emotional Binary Capture Dataset), which was collected by the Sail Laboratory of the University of Southern California and contains speech, vision, text, and action posture modal information. This dataset was recorded by ten actors divided into five groups, recording the behavior of ten speakers in a two-way conversation. The medium of conversation was all English. Participant performances are divided into improvisational performances or fixed scripted scene performances. In total, it contains approximately 12 hours of audio and video files, containing 10,039 dialogue segments, with an average duration of 4.5 seconds and an average number of words of 11.4. The annotator labels each sentence with category labels such as neutral, happy, sad, angry, surprised, fear, disgust, frustration, excitement, etc.

根据一个具体实施例，所述MELD(多模态情感线数据集)拥有超过1400个对话和Friends电视连续剧的13000话语，每句话被归类为七个情绪之一：愤怒，厌恶，悲伤，喜悦，中立，惊奇和恐惧，并且每种言语都有一种情感(正面，负面和中性)注释。According to a specific embodiment, the MELD (Multimodal Emotional Lines Dataset) has more than 1,400 dialogues and 13,000 utterances from the Friends TV series, with each sentence classified as one of seven emotions: anger, disgust, sadness, joy, neutrality, surprise and fear, and each utterance has an emotion (positive, negative and neutral) annotation.

步骤S2:对获取的公开数据集进行语音、文本预处理，并建立两个公开数据集各自对应的训练集和测试集，包括：Step S2: Perform speech and text preprocessing on the obtained public data set, and establish corresponding training sets and test sets for the two public data sets, including:

步骤S2.1：进行公开数据集的音频数据预处理，具体包括：Step S2.1: Preprocess the audio data of the public data set, including:

步骤S2.1.1：时频变换，包括：对音频信号依次进行预加重、分帧和加窗处理；对每一帧加窗后的特征s_i(n)通过快速傅里叶变换得到频谱；采用频谱增强技术，将随机的时频掩码应用于频谱图。Step S2.1.1: Time-frequency transformation, including: sequentially pre-emphasizing, framing and windowing the audio signal; obtaining the spectrum of the windowed features s_i (n) of each frame through fast Fourier transform; using Spectrum enhancement technology applies a random time-frequency mask to the spectrogram.

其中，S_i(k)为傅里叶变换得到的频谱。Among them, S_i (k) is the spectrum obtained by Fourier transform.

步骤S2.1.2：梅尔频谱生成，包括：将频谱输入Mel滤波器组，得到Mel频谱；对Mel频谱进行倒谱分析，得到梅尔频率倒谱系数向量。Step S2.1.2: Mel spectrum generation, including: inputting the spectrum into the Mel filter bank to obtain the Mel spectrum; performing cepstrum analysis on the Mel spectrum to obtain the Mel frequency cepstral coefficient vector.

步骤S2.1.3：动态信息提取，包括：MFCC特征向量只描述单个帧的功率谱包络线，为获取语音在动态中的信息，计算logMFB的微分和加速度系数(Δ和ΔΔ系数)。Step S2.1.3: Dynamic information extraction, including: the MFCC feature vector only describes the power spectrum envelope of a single frame. In order to obtain the dynamic information of speech, the differential and acceleration coefficients (Δ and ΔΔ coefficients) of logMFB are calculated.

步骤S2.1.4：深度特征提取，包括采用一个带权重的一维空洞卷积网络，包括三个高级特征学习块(UFLB)，由空洞卷积层、归一化层和Leaky-ReLU层组合，并且通过残差连接把当前信息与之前的信息进行整合。Step S2.1.4: Deep feature extraction, including using a weighted one-dimensional atrous convolutional network, including three advanced feature learning blocks (UFLB), which are composed of atrous convolutional layers, normalization layers and Leaky-ReLU layers, And integrate the current information with the previous information through residual connection.

步骤S2.2：进行公开数据集的文本数据预处理，具体包括：Step S2.2: Preprocess the text data of the public data set, including:

步骤S2.2.1：对于文本情感识别任务，传统词嵌入模型忽略了单词在不同的上下文环境中的影响，模型对语义理解也会存在一些偏差。基于这种情况，采用来自Transformer模型的双向编码器BERT(Bidirectional Encoder Representations fromTansformers，双向特征解码语言模型)，通过联合调节所有层的上下文信息，从未标记的文本中预训练深度双向表示。Step S2.2.1: For text emotion recognition tasks, traditional word embedding models ignore the impact of words in different contexts, and the model will also have some biases in semantic understanding. Based on this situation, the bidirectional encoder BERT (Bidirectional Encoder Representations from Tansformers, bidirectional feature decoding language model) from the Transformer model is used to pre-train deep bidirectional representations from unlabeled text by jointly adjusting the contextual information of all layers.

步骤S2.2.2：通过一个额外的输出层来对预先训练过的BERT模型进行微调，从而为广泛的任务创建最先进的模型，而不需要对实质性的任务特定的体系结构进行修改。其中，BERT模型可以根据上下文信息双向捕获词义信息，从而能够学习到更丰富、更全面的文本特征信息首先对文本根据一个具体实施例，所述采用预先训练的BERT模型并进行微调，提取768维特征向量作为文本情感特征向量。Step S2.2.2: Fine-tuning the pre-trained BERT model with an additional output layer creates state-of-the-art models for a wide range of tasks without requiring substantial task-specific architectural modifications. Among them, the BERT model can bidirectionally capture word meaning information based on contextual information, so that it can learn richer and more comprehensive text feature information. First, according to a specific embodiment, the pre-trained BERT model is used and fine-tuned to extract 768-dimensional text. The feature vector serves as the text emotion feature vector.

步骤S2.2.3：在IEMOCAP数据集上训练，采取五折交叉验证，每次用四组对话作为训练集，剩下一组作为测试集。选取公开数据集MELD中训练集文件夹中的语音文本数据文件作为训练集，选取测试集文件夹中的语音文本数据文件作为测试集；Step S2.2.3: Train on the IEMOCAP data set, adopt five-fold cross-validation, use four groups of conversations as the training set each time, and the remaining group as the test set. Select the speech text data files in the training set folder in the public data set MELD as the training set, and select the speech text data files in the test set folder as the test set;

步骤S3：使用Pytorch深度学习，构建跨模态融合多模态情感识别神经网络。使用Python中Pytorch深度学习框架函数库，建立基于跨模态交互的多模态情感识别神经网络模型，该模型包括依次串行连接的单模态双注意力特征提取模块、跨模态注意力交互融合模块、决策层融合及情感分类模块，该模型的结构如图2所示，具体包括：Step S3: Use Pytorch deep learning to build a cross-modal fusion multi-modal emotion recognition neural network. Use the Pytorch deep learning framework function library in Python to establish a multi-modal emotion recognition neural network model based on cross-modal interaction. The model includes a single-modal dual-attention feature extraction module and a cross-modal attention interaction that are serially connected in sequence. Fusion module, decision-making layer fusion and emotion classification module. The structure of this model is shown in Figure 2, specifically including:

(a)单模态双注意力特征提取模块，通过时间和通道的双重注意力捕捉时间维度和通道维度上的特征依赖来学习高质量的潜在表征；两种类型的注意力分支包括TAB(Temporal-wise Attention Branch)和CAB(Channelwise Attention Branch)用于初步提取语音文本中的时域和通道维度的情感特征，其中通过对语音特征学习进行优化建模，引入多头注意力机制，聚合BiLSTM模型中的隐藏状态信息，学习更具判别性的语音情感特征，聚焦语音的显著情感片段，包括进行如下操作：(a) Single-modal dual-attention feature extraction module learns high-quality latent representations by capturing feature dependencies in the time dimension and channel dimension through dual attention of time and channel; the two types of attention branches include TAB (Temporal -wise Attention Branch) and CAB (Channelwise Attention Branch) are used to initially extract the emotional features in the time domain and channel dimensions in the speech text. By optimizing the modeling of speech feature learning, a multi-head attention mechanism is introduced, and the BiLSTM model is aggregated hidden state information, learn more discriminative speech emotional features, and focus on the salient emotional segments of speech, including the following operations:

其中t_ij表示第i行第j列的注意力分数T为初步提取的文本特征，c_ij表示第i行第j列的通道注意力分数,T_t和T_c分别表示关注时间维度和通道维度得到的特征,T₁,T₂,T₃分别表示三个卷积层之后得到的特征向量。Where t_ij represents the attention score of the i-th row and j-th column, T is the initially extracted text feature, c_ij represents the channel attention score of the i-th row and j-th column, T_t and T_c represent the attention time dimension and channel dimension respectively. The obtained features, T₁ , T₂ , and T₃ respectively represent the feature vectors obtained after three convolutional layers.

(b)跨模态注意力交互融合模块，其包含多模态特征对齐和多模态特征交互操作部分，用于提取深度情感语义特征,为提升跨模态情感辨别能力，通过跨模态多头协同注意力计算不同的帧和词向量之间的对齐分数，其中考虑到文本信息噪声较小、不易受到干扰、语义蕴含丰富的特点，用文本特征作为Q进行引导，进而得到归一化后的注意力权重和与文本对齐后的音频特征，包括进行操作：(b) Cross-modal attention interactive fusion module, which includes multi-modal feature alignment and multi-modal feature interactive operation parts, is used to extract deep emotional semantic features. In order to improve cross-modal emotion discrimination capabilities, cross-modal multi-head Collaborative attention calculates the alignment scores between different frames and word vectors. Taking into account the characteristics of text information that is less noisy, less susceptible to interference, and rich in semantics, text features are used as Q for guidance, and then the normalized Attention weights and audio features after alignment with text, including operations:

a_i,j＝tan h(U^Ts_i+V^Tt_j+b) (5)a_i,j =tan h(U^T s_i +V^T t_j +b) (5)

其中tanh(·)为激活函数，U,V分别为可训练的全连接层权重矩阵。Among them, tanh(·) is the activation function, and U and V are the trainable fully connected layer weight matrices respectively.

并确定跨模态交互模块学习两个模态之间与情感相关的互补与交互信息，首先分别从两个已经对齐的模态特征得到sigmoid函数作用后的激发矩阵E_s和E_t然后交叉进行信息交互：And determine the cross-modal interaction module to learn the complementary and interactive information related to emotion between the two modalities. First, the excitation matrices E_s and E_t after the action of the sigmoid function are obtained from the two aligned modal features respectively, and then they are crossed. Information exchange:

其中σ(·)表示sigmoid激活函数，E_s,E_t表示分别从语音文本模态中提取的异质性特征，W_t,W_s为全连接层权重。Among them, σ(·) represents the sigmoid activation function, E_s and E_t represent the heterogeneous features extracted from the speech text modality respectively, and W_t and W_s are the fully connected layer weights.

(c)决策层融合及情感分类模块，用于输出情感识别结果；所述决策级融合是一种后期融合方法，分别对单个模态进行建模，不同的模态之间视为相互独立；包括：(c) Decision-level fusion and emotion classification modules are used to output emotion recognition results; the decision-level fusion is a late-stage fusion method that models individual modalities separately, and different modalities are considered independent of each other; include:

先对各个模态提取特征，First, extract features for each mode,

再通过情感分类器获得单模态的情感识别结果，Then obtain the single-modal emotion recognition result through the emotion classifier,

再对每个模态的识别结果采用某种决策方法进行融合，得到最终的情感分类结果，使得单模态任务尽可能的取得最佳性能。Then the recognition results of each modality are fused using a certain decision-making method to obtain the final emotion classification result, so that the single-modal task can achieve the best performance possible.

步骤S4：训练神经网络模型：Step S4: Train the neural network model:

将步骤S2中预处理后的IEMOCAP和MELD两个训练集分别输入情感识别模型中，进行模型的第一次训练；再将步骤S2中预处理后的IEMOCAP和MELD两个测试集输入训练好的模型中，进行方法的性能测试，并与现有最先进方法进行性能比对；其中：Input the two training sets of IEMOCAP and MELD preprocessed in step S2 into the emotion recognition model respectively for the first training of the model; then input the two test sets preprocessed in step S2 of IEMOCAP and MELD into the trained emotion recognition model. In the model, the performance of the method is tested and the performance is compared with the most advanced existing methods; among them:

进行情感识别模型第一次训练时，采用早期停止策略，将步骤S2中划分的训练集重新划分为训练集和验证集，When training the emotion recognition model for the first time, an early stopping strategy is used to re-divide the training set divided in step S2 into a training set and a verification set.

当情感识别模型在验证集上的识别准确率经连续N次训练迭代保持稳定后，提前停止训练过程，并保存此时的模型参数；When the recognition accuracy of the emotion recognition model on the verification set remains stable after N consecutive training iterations, the training process is stopped in advance and the model parameters at this time are saved;

然后进行第二次训练，在第一次训练停止时的情感识别模型参数的基础上，开始第二次训练。在第二次训练中，使用步骤S1中划分的训练集作为模型训练集，并沿用上述提前停止策略。Then conduct a second training, and start the second training based on the emotion recognition model parameters when the first training was stopped. In the second training, the training set divided in step S1 is used as the model training set, and the above-mentioned early stopping strategy is used.

在情感识别模型的两次训练过程中均采用交叉熵损失函数计算分类损失，计算式如下：In both training processes of the emotion recognition model, the cross-entropy loss function is used to calculate the classification loss. The calculation formula is as follows:

其中α,β,γ为权重系数，P表示三个分类器分别的预测概率，s,t分别表示语音和文本模态，st表示语音和文本多模态，表示模型预测的情感标签，y_i表示真实情感标签，C代表情感类别总数,N表示批处理的大小，S和S^c分别为原始和融合后的语音特征,T和T^c分别为原始和融合后的文本特征。/>分别表示融合特征调整损失和情感分类损失。Among them, α, β, and γ are weight coefficients, P represents the prediction probability of the three classifiers, s and t represent speech and text modalities respectively, st represents speech and text multi-modality, represents the emotional label predicted by the model, y_i represents the real emotion label, C represents the total number of emotional categories, N represents the size of the batch, S and S^c are the original and fused speech features respectively, T and T^c are the original and fused speech features respectively. subsequent text features. /> represent fusion feature adjustment loss and sentiment classification loss respectively.

步骤S5：公开数据集测试并评估网络模型：Step S5: Public dataset test and evaluation of network model:

对所提出多模态融合方法的准确性测试时，首先对不同模态的单模态情感识别性能进行测试，并与多模态融合方法进行比较；When testing the accuracy of the proposed multi-modal fusion method, the single-modal emotion recognition performance of different modalities was first tested and compared with the multi-modal fusion method;

然后，通过消融实验分别验证了双注意力机制和跨模态交互结构对性能的影响；为验证所提方法的有效性，在IEMOCAP和MELD两个数据集上进行性能测试，并与现有最先进方法进行性能比对。从表1可知，本发明所提出的基于多模态交互的多模态情感识别方法在IEMOCAP上的识别准确率达到了76.0％，在MELD上的识别准确率达到了64.5％，均高于现有最先进识别方法，说明本发明所提出的方法识别率高，具有更好的分类性能，满足了情感识别系统对识别精度的要求。Then, the impact of dual attention mechanism and cross-modal interaction structure on performance was verified through ablation experiments. In order to verify the effectiveness of the proposed method, performance tests were conducted on IEMOCAP and MELD data sets, and compared with the existing state-of-the-art methods. Advanced methods for performance comparison. As can be seen from Table 1, the multi-modal emotion recognition method based on multi-modal interaction proposed by the present invention has a recognition accuracy of 76.0% on IEMOCAP and a recognition accuracy of 64.5% on MELD, both of which are higher than the current state of the art. There are the most advanced recognition methods, which shows that the method proposed by the present invention has a high recognition rate, has better classification performance, and meets the requirements of the emotion recognition system for recognition accuracy.

本发明选择了四种针对语音文本双模态的情感识别算法作为对比方法，FAF(Y.Gu,K.Yang,S.Fu,S.Chen,X.Li,and I.Marsic,Multimodal affective analysisusing hierarchical attention strategy with wordlevel alignment,2018)采用单词级别的对齐忽略了多模态交互，TSIN(B.Chen,Q.Cao,M.Hou,Z.Zhang,G.Lu,and D.Zhang,Multimodal emotion recognition with temporal and semantic consistency,2021),KS-TRM(W.Wu,C.Zhang,and P.C.Woodland,Emotion recognition by fusing timesynchronous and time asynchronous representations,2021)关注了多模态之间的交互同时产生了很多无用的冗余信息，损失了原始的单模态信息。采用对比方法进行实时情感识别时，实验步骤和本发明提出的跨模态融合的多模态情感识别方法一致，并在表1中列出了在IEMOCAP数据集上加权和非加权准确率效果的对比。从表中可以看出，与其他深度学习方法相比，本发明提出的基于多模态情感识别方法具有识别准确率高的优势,将其扩展用于疾病健康监测等其他领域，如抑郁症等神经疾病的早期辅助检测，均有重大意义。This invention selects four dual-modal emotion recognition algorithms for speech and text as comparison methods, FAF (Y.Gu, K.Yang, S.Fu, S.Chen, X.Li, and I.Marsic, Multimodal affective analysis using hierarchical attention strategy with wordlevel alignment, 2018) adopts word-level alignment and ignores multimodal interaction, TSIN (B.Chen, Q.Cao, M.Hou, Z.Zhang, G.Lu, and D.Zhang, Multimodal emotion recognition with temporal and semantic consistency, 2021), KS-TRM (W.Wu, C.Zhang, and P.C.Woodland, Emotion recognition by fusing timesynchronous and time asynchronous representations, 2021) focuses on the interaction between multiple modalities simultaneously. A lot of useless redundant information, and the original single-modal information is lost. When using the comparison method for real-time emotion recognition, the experimental steps are consistent with the cross-modal fusion multi-modal emotion recognition method proposed by the present invention, and the weighted and unweighted accuracy effects on the IEMOCAP data set are listed in Table 1 Compared. As can be seen from the table, compared with other deep learning methods, the multi-modal emotion recognition method proposed by the present invention has the advantage of high recognition accuracy, and can be extended to other fields such as disease and health monitoring, such as depression, etc. Early auxiliary detection of neurological diseases is of great significance.

表1在公开数据集IEMOCAP上情感识别准确率对比Table 1 Comparison of emotion recognition accuracy on the public data set IEMOCAP

表2在公开数据集MELD上情感识别准确率对比Table 2 Comparison of emotion recognition accuracy on the public data set MELD

步骤S6：搭建情感识别系统：包括数据采集模块、存储与预处理模块、情感识别模块和自适应共情交互模块。Step S6: Build an emotion recognition system: including a data collection module, a storage and preprocessing module, an emotion recognition module and an adaptive empathy interaction module.

其中，数据采集模块用于采集测试者的情感语音数据并转化为对应的文本，进行预处理；Among them, the data collection module is used to collect the tester's emotional voice data and convert it into corresponding text for preprocessing;

所述情感识别模块包括：单模态双注意力机制特征提取模块，跨模态交互协同学习模块、跨模态融合注意模块和分类层；The emotion recognition module includes: a single-modal dual attention mechanism feature extraction module, a cross-modal interactive collaborative learning module, a cross-modal fusion attention module and a classification layer;

单模态双注意力机制特征提取模块提供语音文本信号中的时域和通道维度特征的初步提取；The single-modal dual-attention mechanism feature extraction module provides preliminary extraction of time-domain and channel-dimensional features in speech text signals;

跨模态交互协同学习模块提供语音文本交互协同的融合特征；The cross-modal interactive collaborative learning module provides the fusion features of voice-text interactive collaboration;

情感识别分类层提供融合特征的分类结果；The emotion recognition classification layer provides classification results of fused features;

模型训练模块，用于提供情感识别模型的训练，并输出情感识别分类结果，提供模型训练时，先采用早期停止策略，将划分的训练集重新划分为训练集和验证集，当情感识别模型在验证集上的识别准确率经连续N次训练迭代保持稳定后，提前停止训练过程，并保存此时的模型参数；The model training module is used to provide training for emotion recognition models and output emotion recognition classification results. When providing model training, an early stopping strategy is first used to re-divide the divided training set into a training set and a verification set. When the emotion recognition model is After the recognition accuracy on the verification set remains stable for N consecutive training iterations, the training process is stopped in advance and the model parameters at this time are saved;

然后采用二次训练策略，在第一次训练停止时的情感识别模型参数的基础上，开始第二次训练，在第二次训练中，使用步骤S1中划分的训练集作为模型训练集，并沿用上述提前停止策略，识别模型训练过程中，采用交叉熵损失函数计算分类损失，计算式如下：Then use a secondary training strategy to start the second training based on the emotion recognition model parameters when the first training was stopped. In the second training, use the training set divided in step S1 as the model training set, and Following the above-mentioned early stopping strategy, during the recognition model training process, the cross-entropy loss function is used to calculate the classification loss. The calculation formula is as follows:

其中p_i是情感识别模型生成的第i个条件概率，l_i是标签集的第i类，ω(·)是指示函数，表示识别模型中的可学习的参数，‖·‖用于缓解过拟合问题的正则化项，λ是权衡正则化权重,M代表批处理的大小，训练过程中的学习率为1×10^-2；where p_i is the i-th conditional probability generated by the emotion recognition model, l_i is the i-th category of the label set, and ω(·) is the indicator function, Represents the learnable parameters in the recognition model, ‖·‖ is used to alleviate the regularization term of the overfitting problem, λ is the trade-off regularization weight, M represents the size of the batch, and the learning rate during the training process is 1×10^{- 2} ;

共情交互模块，用于根据所述情感识别模块所识别出的情感进行情感意图识别，对话生成并进行共情交互。The empathic interaction module is used to identify emotional intentions based on the emotions recognized by the emotion recognition module, generate dialogues and perform empathic interaction.

以上对本发明所提供的基于语音文本跨模态融合的情感识别的方法及系统进行了详细的说明，但显然本发明的范围并不局限于此。在不脱离所附权利要求书所限定的保护范围的情况下，对上述实施例的各种改变都在本发明的范围之内。The above is a detailed description of the method and system for emotion recognition based on cross-modal fusion of speech and text provided by the present invention, but it is obvious that the scope of the present invention is not limited thereto. Various changes to the above embodiments are within the scope of the present invention without departing from the scope of protection defined by the attached claims.