CN118378128A

Movatterモバイル変換

Info

Publication number: CN118378128A
Application number: CN202410583005.5A
Authority: CN
Inventors: 陈略峰; 刘玉珑; 吴敏; 李敏; 林禹西
Original assignee: China University of Geosciences Wuhan
Current assignee: China University of Geosciences Wuhan
Priority date: 2024-05-11
Filing date: 2024-05-11
Publication date: 2024-07-23

Abstract

Translated fromChinese

本发明公开了一种基于分阶段注意力机制的多模态情感识别方法，涉及识别检测技术领域，使用基于ECA‑Net的双分支特征融合网络获取面部情感特征，通过ShuffleNetV2网络从语谱图中学习得到语音模态的高层情感语义特征矩阵，再利用多头注意力机模块实现不同模态的有效信息融合。本发明通过学习不同的注意力权重来关注输入多模态情感特征矩阵中的不同部分，将各个模态的情感特征信息进行融合，进而增强模型的表示学习能力，同时提高鲁棒性和泛化能力，实现更准确、全面的情感识别，实现了对多模态情感数据的精细融合与准确分类。

The present invention discloses a multimodal emotion recognition method based on a staged attention mechanism, which relates to the field of recognition and detection technology. A dual-branch feature fusion network based on ECA‑Net is used to obtain facial emotion features, and a high-level emotion semantic feature matrix of speech modality is learned from a spectrogram through a ShuffleNetV2 network, and then a multi-head attention machine module is used to realize effective information fusion of different modalities. The present invention focuses on different parts of the input multimodal emotion feature matrix by learning different attention weights, and fuses the emotion feature information of each modality, thereby enhancing the representation learning ability of the model, while improving robustness and generalization ability, realizing more accurate and comprehensive emotion recognition, and realizing fine fusion and accurate classification of multimodal emotion data.

Description

Translated fromChinese

一种基于分阶段注意力机制的多模态情感识别方法A multimodal emotion recognition method based on staged attention mechanism

技术领域Technical Field

本发明涉及识别检测技术领域，尤其涉及一种基于分阶段注意力机制的多模态情感识别方法。The present invention relates to the field of recognition and detection technology, and in particular to a multimodal emotion recognition method based on a staged attention mechanism.

背景技术Background technique

随着情感计算理论的不断完善与人工智能技术的蓬勃发展，各种类型的智能机器终端不断增多，人与智能机器交互和融合成为未来社会的发展趋势。情感是在人机共融的复杂生活和工作环境中人类之间交流的具体体现，而情感识别是情感计算的基础。研究表明，人类93％的交流是通过非语言的手段进行，包括面部表情、姿态肢体语言和语音语调等。由此，在智能人机交互的研究中，通过对获取的面部表情和语音信息进行分析，并像人类一样去感知和识别交互者的情感状态，实现情感交流和需求满足的良性循环，对提升服务机器人的服务质量十分重要。With the continuous improvement of affective computing theory and the vigorous development of artificial intelligence technology, various types of intelligent machine terminals are increasing, and the interaction and integration between humans and intelligent machines has become the development trend of future society. Emotion is the concrete manifestation of communication between humans in the complex living and working environment of human-machine integration, and emotion recognition is the basis of affective computing. Studies have shown that 93% of human communication is carried out through non-verbal means, including facial expressions, gestures, body language, and voice intonation. Therefore, in the study of intelligent human-computer interaction, it is very important to improve the service quality of service robots by analyzing the acquired facial expressions and voice information, and perceiving and identifying the emotional state of the interactor like humans, so as to achieve a virtuous cycle of emotional communication and demand satisfaction.

针对人脸情感识别，根据提取的特征成分类型，该方法可分为两类，一类是手工设计的人脸表达低层特征，另一类是通过深度学习的方式获取的特征，如深度卷积网络所获取人脸的表示深层特征。例如尺度不变特征变换算子(Scaled Invariant FeatureTransform,SIFT)、活动外观模型、局部二值模式、定向梯度直方图和稠密尺度不变特征变换算子(Dense Scaled Invariant Feature Transform,Dense SIFT)等特征提取方法。然而，由于这些低层情感特征的表示力有限，随着深度网络的不断发展，越来越多的学者将多视角人脸情感识别与深度网络结构结合，利用深度网络强大的多层网络结构进一步深入挖掘情感特征的语义表达。并且，当我们面对一张表情丰富的脸时，我们的大脑系统不仅会从外貌局部细节中获取信息，而且会根据全局的整体面部特征进行推理决策。局部细节特征向量和整个图像的全局特征在不同尺度上描述情感状态信息，具有强烈的互补性，基于双分支特征融合的人脸情感识别方法正逐渐成为主流方法。For facial emotion recognition, according to the type of feature components extracted, the method can be divided into two categories: one is the manually designed low-level features of facial expression, and the other is the features obtained by deep learning, such as the deep features of facial representation obtained by deep convolutional networks. For example, feature extraction methods such as Scaled Invariant Feature Transform (SIFT), active appearance model, local binary pattern, oriented gradient histogram and dense scaled invariant feature transform (Dense SIFT). However, due to the limited representation power of these low-level emotional features, with the continuous development of deep networks, more and more scholars have combined multi-view facial emotion recognition with deep network structures, and used the powerful multi-layer network structure of deep networks to further explore the semantic expression of emotional features. Moreover, when we face a face with rich expressions, our brain system will not only obtain information from the local details of appearance, but also make inferences and decisions based on the overall facial features of the world. The local detail feature vector and the global features of the entire image describe the emotional state information at different scales and are highly complementary. The facial emotion recognition method based on dual-branch feature fusion is gradually becoming the mainstream method.

利用多模态情感数据源，解决了信息单一、数据量匮乏等问题，但也引入了新的难题：不同模态所使用的特征分析与识别处理方法存在较大差异。因此，如何选择合适的特征融合方法对于最终的多模态情感识别效果至关重要。多模态信息融合是将从各种模态收集的数据进行综合处理的过程，因而可以提供多方面丰富信息，最终提高总体结果或决策的准确性。The use of multimodal emotion data sources solves the problems of single information and lack of data, but it also introduces new problems: the feature analysis and recognition processing methods used by different modalities are quite different. Therefore, how to choose a suitable feature fusion method is crucial to the final multimodal emotion recognition effect. Multimodal information fusion is a process of comprehensive processing of data collected from various modalities, which can provide rich information in many aspects and ultimately improve the accuracy of the overall results or decisions.

目前，多模态情感识别方法大多选择基于各个单模态通道中提取的显著情感特征，再利用网络模型在特征级或者决策级直接计算特征矩阵，同时处理多模态数据输入，最终实现融合后的情感识别。在特征级融合中，将多种模态各自提取的特征以一种直接的方式融合为一个情感分类向量，忽略不同模态的特异性特征区别。因此，这种策略很难模拟音频和视觉模式之间的时间同步，无法对不同模态之间时间尺度和量度水平上的差异复杂关系进行建模。对于决策级融合，将多个单独分类器的后验概率使用线性加权等方法进行组合，得到最终的识别结果。该方法充分考虑了不同模态之间的差异，并假定模态之间是独立的，但在模拟不同模式之间的相互作用方面比较薄弱。随着深度学习的快速发展，注意力机制也逐渐与深度神经网络相结合，并广泛应用于情感识别领域。At present, most multimodal emotion recognition methods choose to extract significant emotional features from each single-modal channel, and then use the network model to directly calculate the feature matrix at the feature level or decision level, and process multimodal data input at the same time to finally achieve fused emotion recognition. In feature-level fusion, the features extracted from multiple modalities are fused into an emotion classification vector in a direct way, ignoring the specific feature differences between different modalities. Therefore, this strategy is difficult to simulate the time synchronization between audio and visual modes, and cannot model the complex relationship between the time scale and measurement level of different modalities. For decision-level fusion, the posterior probabilities of multiple individual classifiers are combined using methods such as linear weighting to obtain the final recognition result. This method fully considers the differences between different modalities and assumes that the modalities are independent, but it is relatively weak in simulating the interaction between different modes. With the rapid development of deep learning, attention mechanisms have gradually been combined with deep neural networks and widely used in the field of emotion recognition.

综上所述，基于人脸图像和语音数据的多模态情感识别是情感计算中的重要研究方向，有利于促进智能化人机自然交互领域的发展。因此，本发明旨在研究一种基于分阶段注意力机制的多模态情感识别方法，在对人脸图像与语音信息进行分析的基础上，获取具有表征能力的情感特征，并根据情感数据多模态的特点，建立情感信息融合模型。为构建具有情感交互能力的智能情感机器人系统创造条件，赋予智能机器感知情感状态等深层交流信息的认知识别能力。In summary, multimodal emotion recognition based on face images and voice data is an important research direction in emotional computing, which is conducive to promoting the development of the field of intelligent human-computer natural interaction. Therefore, the present invention aims to study a multimodal emotion recognition method based on a staged attention mechanism, which obtains emotional features with representation capabilities based on the analysis of face images and voice information, and establishes an emotional information fusion model based on the multimodal characteristics of emotional data. It creates conditions for building an intelligent emotional robot system with emotional interaction capabilities, and gives intelligent machines the cognitive recognition ability to perceive deep communication information such as emotional states.

发明内容Summary of the invention

本发明要解决的技术问题主要在于：现阶段的模态信息融合方法未充分考虑到各模态特征在识别不同情感的情况下所发挥的不同作用，不能同时兼顾不同模态特征之间的相关性和差异性。为了解决该问题，本发明提出一种基于分阶段注意力机制的多模态情感识别方法，结合多视角人脸与语音情感信息的特点，研究各个模态的特征提取方法与多模态特征融合算法。关注于不同模态之间的非线性相关性以及其在特征选择上的影响，将各模态特征数据进行高效融合，从而提高各模态信息的利用率，提升算法的识别性能。The technical problem to be solved by the present invention is mainly that the current modal information fusion method does not fully consider the different roles played by each modal feature in identifying different emotions, and cannot simultaneously take into account the correlation and difference between different modal features. In order to solve this problem, the present invention proposes a multimodal emotion recognition method based on a staged attention mechanism, combining the characteristics of multi-view face and voice emotion information, and studying the feature extraction method of each modality and the multimodal feature fusion algorithm. Focusing on the nonlinear correlation between different modalities and its influence on feature selection, the feature data of each modality is efficiently fused, thereby improving the utilization rate of each modal information and enhancing the recognition performance of the algorithm.

一种基于分阶段注意力机制的多模态情感识别方法，包括以下步骤：A multimodal emotion recognition method based on a staged attention mechanism comprises the following steps:

S1：利用Dense SIFT捕捉人脸局部细节特征，同时将整个人脸图像输入至全局特征提取网络，获取全局完整特征；S1: Use Dense SIFT to capture local detailed features of the face, and input the entire face image into the global feature extraction network to obtain global complete features;

S2：基于ECA-Net注意力机制自动建立局部细节特征与全局完整特征的相关性权重矩阵，实现双分支人脸特征的融合；S2: Based on the ECA-Net attention mechanism, the correlation weight matrix between local detail features and global complete features is automatically established to achieve the fusion of dual-branch facial features;

S3：使用多个堆叠的深度残差模块对步骤S2融合后的人脸情感特征进行深度学习分析，挖掘高层次人脸情感特征矩阵；S3: Use multiple stacked deep residual modules to perform deep learning analysis on the facial emotion features fused in step S2 to mine a high-level facial emotion feature matrix;

S4：将语音数据转化为保留丰富时频两域原始信息的语谱图，利用ShuffleNetV2网络学习得到语音模态数据的高层情感语义特征矩阵；S4: The speech data is converted into a spectrogram that retains rich original information in both time and frequency domains, and the high-level emotional semantic feature matrix of the speech modality data is obtained by learning the ShuffleNetV2 network;

S5：采用多头注意力机模块学习步骤S3得到的高层次人脸情感特征和步骤S4得到的高层情感语义特征之间的注意力权重，以捕捉不同模态特征之间的依赖关系，实现对多模态情感数据的融合与分类。S5: Use the multi-head attention machine module to learn the attention weights between the high-level facial emotion features obtained in step S3 and the high-level emotional semantic features obtained in step S4 to capture the dependency between features of different modalities and achieve the fusion and classification of multimodal emotional data.

一种电子设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述程序时实现所述的多模态情感识别方法的步骤。An electronic device comprises a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the multimodal emotion recognition method when executing the program.

一种存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现所述的多模态情感识别方法的步骤。A storage medium stores a computer program, which, when executed by a processor, implements the steps of the multimodal emotion recognition method.

本发明提供的技术方案具有以下有益效果：本发明提出一种基于分阶段注意力机制的多模态情感识别方法，使用基于ECA-Net的双分支特征融合网络获取面部情感特征，通过改进的ShuffleNetV2网络从语谱图中学习得到语音模态的高层情感语义特征矩阵，再利用多头注意力机模块实现不同模态的有效信息融合。本发明利用Dense SIFT捕捉丰富的局部细节特征，同时将整个人脸图像输入至浅层全局特征提取网络；基于ECA-Net注意力机制自动建模局部细节特征与全局完整特征的相关性权重矩阵，实现双分支人脸特征的有效互补融合；使用多个堆叠的深度残差模块对融合后的人脸情感特征进行深度学习分析，挖掘高层次人脸情感特征矩阵；同时，将语音数据转化为保留丰富时频两域原始信息的语谱图，设计改进的ShuffleNetV2网络学习得到语音模态数据的高层情感语义特征矩阵；采用多头注意力机模块学习不同的注意力权重，捕捉不同模态特征之间的依赖关系，实现对多模态情感数据的精细融合与准确分类。采用基于高效通道注意力机制的双分支特征融合网络，深入考虑多视角人脸情感信息特性，获得具有区分度的人脸特征。将语音数据转化为语谱图，设计深度网络模型获得语音模态数据的高层情感语义特征。自此基础上，采用多头注意力融合策略对多视角人脸与语音情感模态信息进行融合，通过学习不同的注意力权重来关注输入多模态情感特征矩阵中的不同部分，将各个模态的情感特征信息进行融合，进而增强模型的表示学习能力，同时提高鲁棒性和泛化能力，实现更准确、全面的情感识别。The technical solution provided by the present invention has the following beneficial effects: the present invention proposes a multimodal emotion recognition method based on a staged attention mechanism, uses a dual-branch feature fusion network based on ECA-Net to obtain facial emotion features, learns the high-level emotion semantic feature matrix of the speech modality from the spectrogram through the improved ShuffleNetV2 network, and then uses a multi-head attention machine module to realize effective information fusion of different modalities. The present invention uses Dense SIFT to capture rich local detail features, and at the same time inputs the entire face image into a shallow global feature extraction network; based on the ECA-Net attention mechanism, the correlation weight matrix of the local detail features and the global complete features is automatically modeled to realize the effective complementary fusion of the dual-branch face features; multiple stacked deep residual modules are used to perform deep learning analysis on the fused facial emotion features to mine the high-level facial emotion feature matrix; at the same time, the speech data is converted into a spectrogram that retains rich original information in the time and frequency domains, and the improved ShuffleNetV2 network is designed to learn the high-level emotion semantic feature matrix of the speech modality data; the multi-head attention machine module is used to learn different attention weights to capture the dependency between different modality features, and realize the fine fusion and accurate classification of multimodal emotion data. A dual-branch feature fusion network based on an efficient channel attention mechanism is used to deeply consider the characteristics of multi-view facial emotion information and obtain discriminative facial features. The speech data is converted into a spectrogram, and a deep network model is designed to obtain the high-level emotional semantic features of the speech modality data. On this basis, a multi-head attention fusion strategy is used to fuse the multi-view facial and speech emotional modality information. By learning different attention weights to focus on different parts of the input multi-modal emotional feature matrix, the emotional feature information of each modality is fused, thereby enhancing the representation learning ability of the model, while improving robustness and generalization ability, and achieving more accurate and comprehensive emotion recognition.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

下面将结合附图及实施例对本发明的具体效果作进一步说明，附图中：The specific effects of the present invention will be further described below in conjunction with the accompanying drawings and embodiments, in which:

图1是本发明实施例中基于多视角人脸图像的多模态情感识别的流程图；FIG1 is a flow chart of multimodal emotion recognition based on multi-view facial images in an embodiment of the present invention;

图2是本发明实施例中基于ECA-Net注意力机制的多分支特征融合框架图图；FIG2 is a diagram of a multi-branch feature fusion framework based on the ECA-Net attention mechanism in an embodiment of the present invention;

图3是本发明实施例中基于多视角人脸图像的多模态情感识别网络结构图。FIG3 is a diagram showing a multimodal emotion recognition network structure based on multi-view facial images in an embodiment of the present invention.

图4是本发明实施例中硬件设备工作的示意图。FIG. 4 is a schematic diagram of the operation of the hardware device in the embodiment of the present invention.

具体实施方式Detailed ways

为了对本发明的技术特征、目的和效果有更加清楚的理解，现对照附图详细说明本发明的具体实施方式。In order to have a clearer understanding of the technical features, purposes and effects of the present invention, specific embodiments of the present invention are now described in detail with reference to the accompanying drawings.

本发明的实施例提供了一种基于多视角人脸图像的多模态情感识别方法。考虑到多视角人脸与语音模态的差异性与互补性，采用基于ECA-Net注意力机制的局部和1全局双分支特征融合网络，深入考虑多视角人脸情感信息特性；将语音数据转化为语谱图，设计相应基于ShuffleNet V2模块的特征提取网络模型，以获得语音模态数据的高层情感语义特征。自此基础上，采用多头注意力融合策略对多视角人脸与语音情感模态信息进行融合，通过学习不同的注意力权重来关注输入多模态情感特征矩阵中的不同部分，进而增强模型的表示学习能力，实现更准确、全面地进行情感识别。在三个广泛使用的多模态情感基准数据集RAVDESS、eNTERFACE'05和AFEW上进行仿真验证。具体包括以下步骤：The embodiment of the present invention provides a multimodal emotion recognition method based on multi-view facial images. Taking into account the differences and complementarities between multi-view facial and speech modalities, a local and global dual-branch feature fusion network based on the ECA-Net attention mechanism is adopted to deeply consider the characteristics of multi-view facial emotion information; the speech data is converted into a spectrogram, and a corresponding feature extraction network model based on the ShuffleNet V2 module is designed to obtain high-level emotional semantic features of speech modal data. On this basis, a multi-head attention fusion strategy is adopted to fuse the multi-view facial and speech emotional modal information, and different attention weights are learned to focus on different parts of the input multimodal emotional feature matrix, thereby enhancing the representation learning ability of the model and achieving more accurate and comprehensive emotion recognition. Simulation verification is carried out on three widely used multimodal emotional benchmark datasets: RAVDESS, eNTERFACE'05 and AFEW. Specifically, the following steps are included:

S1：利用稠密尺度不变特征变化算子(Dense SIFT)捕捉人脸中丰富的局部细节特征，同时将整个人脸图像输入至浅层全局特征提取网络，获取全局完整特征。S1: Use the dense scale-invariant feature transform operator (Dense SIFT) to capture the rich local detail features in the face, and at the same time input the entire face image into a shallow global feature extraction network to obtain global complete features.

将原始图像划分成一个由若干块局部区域组成的密集网格，并在每个块区域的中心提取具有固定尺度和方向的Dense SIFT特征向量。通过统计特征点邻域梯度直方图作为该特征点的特征描述向量，也就是人脸局部特征描述矩阵。使用由卷积层、归一化层、激活函数和最大池化层构成的全局特征提取网络对输入的Dense SIFT特征向量和整个人原始像素进行编码。The original image is divided into a dense grid consisting of several local areas, and a Dense SIFT feature vector with a fixed scale and direction is extracted from the center of each block area. The feature description vector of the feature point is obtained by counting the gradient histogram of the neighborhood of the feature point, which is also the local feature description matrix of the face. A global feature extraction network consisting of a convolutional layer, a normalization layer, an activation function, and a maximum pooling layer is used to encode the input Dense SIFT feature vector and the original pixels of the entire person.

S2：基于高效通道(ECA-Net)注意力机制自动建立局部细节特征与全局完整特征的相关性权重矩阵，实现双分支人脸特征的有效互补融合。S2: Based on the efficient channel (ECA-Net) attention mechanism, the correlation weight matrix between local detail features and global complete features is automatically established to achieve effective complementary fusion of dual-branch facial features.

ECA-Net利用全局平均池化获得聚合特征以避免降维运算，通过执行大小为m的快速一维卷积来生成通道权重，进一步实现捕获本地跨通道信息交互。通道注意力可以通过下式获得：ECA-Net uses global average pooling to obtain aggregate features to avoid dimensionality reduction operations, and generates channel weights by performing fast one-dimensional convolutions of size m to further capture local cross-channel information interactions. Channel attention can be obtained by the following formula:

其中，g(X)为全局平均池化表达式，X∈R^W×H×C为输入特征，W,H,C分别为宽度、高度和通道维度，X_ij为宽度为i高度为j的输入特征；ω为y所对应的权重值；W_k为局部跨通道注意力交互矩阵；σ(·)为Sigmoid函数。Where g(X) is the global average pooling expression, X∈R^W×H×C is the input feature, W, H, C are the width, height and channel dimensions respectively,_Xij is the input feature with width i and height j; ω is the weight value corresponding to y;_Wk is the local cross-channel attention interaction matrix; σ(·) is the Sigmoid function.

同时，通过让所有的通道共享相同的权重学习参数，可以通过卷积核大小尺寸为m的一维卷积共享不同通道之间的权值矩阵，实现信息交互，如下式所示。At the same time, by allowing all channels to share the same weight learning parameters, the weight matrix between different channels can be shared through a one-dimensional convolution with a convolution kernel size of m to achieve information interaction, as shown in the following formula.

ω＝σ(C1D_m(y))ω＝σ(C1D_m (y))

其中，C1D表示一维卷积操作，该卷积模块包含m个参数。Among them, C1D represents a one-dimensional convolution operation, and the convolution module contains m parameters.

使用一种自适应选择一维卷积核大小的方法，相互作用的覆盖范围与通道维数C存在合理的正比关系。即m与C之间存在映射φ:Using a method of adaptively selecting the size of the one-dimensional convolution kernel, the coverage of the interaction is reasonably proportional to the channel dimension C. That is, there is a mapping φ between m and C:

C＝φ(m)C＝φ(m)

通过将线性函数φ(m)＝γ*m-b扩展为非线性函数来增加可能的解，即:The possible solutions are increased by expanding the linear function φ(m) = γ*m-b into a nonlinear function, namely:

C＝φ(m)＝2^(γ*m-b)C＝φ(m)＝2^(γ*mb)

那么，对于给定通道维度C，一维卷积内核可以自适应地由下式设置：Then, for a given channel dimension C, the one-dimensional convolution kernel can be adaptively set as follows:

其中，||_odd表示最接近的奇数，γ为一维卷积的系数，b为设定的常数，在本实施例中，γ和b设置为2和1。Wherein, ||_odd represents the closest odd number, γ is the coefficient of the one-dimensional convolution, and b is a set constant. In this embodiment, γ and b are set to 2 and 1.

S3：使用多个堆叠的深度残差模块对步骤S3融合后的人脸情感特征进行深度学习分析，挖掘高层次人脸情感特征矩阵。S3: Use multiple stacked deep residual modules to perform deep learning analysis on the facial emotion features fused in step S3 to mine the high-level facial emotion feature matrix.

定义x_i表示第i个深度残差模块的输入，x_i+1为第i个深度残差模块的输出，根据残差网络架构的原理，深度残差模块的结构表示为：Define_xi to represent the input of the ith deep residual module, and_xi+1 to represent the output of the ith deep residual module. According to the principle of the residual network architecture, the structure of the deep residual module is expressed as:

其中，表示权值函数，表示第i个深度残差模块中第一个3×3卷积层和BN层所表示的权值参数，表示第i个深度残差模块中第二个3×3卷积层和BN层所表示的权值参数。表示带学习的残差映射，f()表示Relu激活函数。in, represents the weight function, represents the weight parameters represented by the first 3×3 convolutional layer and BN layer in the i-th depth residual module, Represents the weight parameters represented by the second 3×3 convolutional layer and BN layer in the i-th deep residual module. represents the residual mapping with learning, and f() represents the Relu activation function.

因此，考虑到实际任务需求，本发明采用八个堆叠的深度残差模块进行多视角人脸情感高层语义特征的进一步学习，特征识别网络包括多个堆叠的深度残差模块、平均池化层、全连接层和Softmax分类器。步骤S2获得的双分支情感融合特征输出向量被输入到八个连续堆叠的深度残差模块，得到512个尺寸大小为7×7的高层语义情感特征。使用平均池化层得到512个为1×1的特征矩阵。使用全连接层和Softmax分类器用于进行步骤S5的分类。Therefore, taking into account the actual task requirements, the present invention uses eight stacked deep residual modules to further learn the high-level semantic features of multi-view facial emotions. The feature recognition network includes multiple stacked deep residual modules, average pooling layers, fully connected layers and Softmax classifiers. The dual-branch emotion fusion feature output vector obtained in step S2 is input into eight continuously stacked deep residual modules to obtain 512 high-level semantic emotion features of size 7×7. The average pooling layer is used to obtain 512 feature matrices of 1×1. The fully connected layer and Softmax classifier are used for classification in step S5.

S4：将语音数据转化为保留丰富时频两域原始信息的语谱图，利用ShuffleNetV2网络学习得到语音模态数据的高层情感语义特征矩阵。S4: The speech data is converted into a spectrogram that retains rich original information in both time and frequency domains, and the high-level emotional semantic feature matrix of the speech modality data is obtained by ShuffleNetV2 network learning.

需要对原始语音信号进行预处理，具体包括预加重、分帧加窗和快速傅里叶变换处理，并据此绘制语谱图。The original speech signal needs to be preprocessed, including pre-emphasis, frame windowing and fast Fourier transform processing, and the spectrogram is drawn based on it.

预加重处理，用于强调信号中的高频部分，放大高频信息，改善信号质量，提高信号传播效率，如式所示：Pre-emphasis processing is used to emphasize the high-frequency part of the signal, amplify the high-frequency information, improve the signal quality, and improve the signal propagation efficiency, as shown in the formula:

y(t)＝x(t)-αx(t-1)y(t)＝x(t)-αx(t-1)

其中，x(t)表示t时刻语音的原始语音信号(即原始样本值)，x(t-1)表示前一时刻的语音信号(即前一刻的采样值)，α为预加重系数，y(t)表示预加重处理后的语音信号；Wherein, x(t) represents the original speech signal (i.e., the original sample value) of the speech at time t, x(t-1) represents the speech signal at the previous moment (i.e., the sampling value at the previous moment), α is the pre-emphasis coefficient, and y(t) represents the speech signal after pre-emphasis processing;

对于预加重处理后的语音信号进行分帧加窗操作，选取的帧长为25ms，帧移为10ms，加窗函数使用的是汉明窗，其函数表示为w(n)，加窗后的语音信号y'(n)表示为：The pre-emphasized speech signal is framed and windowed. The selected frame length is 25ms, the frame shift is 10ms, and the windowing function uses a Hamming window, which is expressed as w(n). The windowed speech signal y'(n) is expressed as:

y'(n)＝y(n)*w(n)y'(n)＝y(n)*w(n)

其中，n＝1,2,…,N，N表示窗函数的总长度；Where n = 1, 2, ..., N, N represents the total length of the window function;

采用分帧加窗操作后的语音数据进行快速傅里叶变换处理，输出FFT系数Y(k)，也就是语音信号频域信息；语音信号的能量密度谱数值P(k)等于傅里叶系数Y(k)绝对值的平方：The speech data after the frame-by-frame windowing operation is processed by fast Fourier transform, and the FFT coefficient Y(k) is output, which is the frequency domain information of the speech signal; the energy density spectrum value P(k) of the speech signal is equal to the square of the absolute value of the Fourier coefficient Y(k):

P(k)＝|Y(k)|²P(k)＝|Y(k)|²

对P(k)取对数并进行乘性增幅，得到如下所示的频谱幅值S(k,t)，对其进行归一化处理，并据此绘制语谱图：Take the logarithm of P(k) and multiply it to get the spectrum amplitude S(k,t) shown below, normalize it and draw the spectrogram based on it:

S(k,t)＝20*log₁₀(P(k)+ε)S(k,t)＝20*_log10 (P(k)+ε)

其中，S(k,t)为频谱幅值，ε为正则化系数；Among them, S(k,t) is the spectrum amplitude, ε is the regularization coefficient;

将预处理后的语谱图送至3×3卷积层和最大池化层，输出特征尺寸为56×56。再将其依次送入由下采样模块和基本单元模块串行组成的ShuffleNetV2模块1、ShuffleNetV2模块2和ShuffleNetV2模块3这3个结构模块。最后，经过1×1卷积层、线性归一化层和自适应平均池化层，得到512个1×1的特征矩阵，用于进行不同模态数据的信息融合。The preprocessed spectrogram is sent to the 3×3 convolution layer and the maximum pooling layer, and the output feature size is 56×56. Then it is sent to the three structural modules of ShuffleNetV2 module 1, ShuffleNetV2 module 2 and ShuffleNetV2 module 3, which are composed of the downsampling module and the basic unit module in series. Finally, after the 1×1 convolution layer, the linear normalization layer and the adaptive average pooling layer, 512 1×1 feature matrices are obtained for information fusion of different modal data.

S5：采用多头注意力机模块学习步骤S3得到的高层次人脸情感特征和步骤S4得到的高层情感语义特征之间的注意力权重，捕捉不同模态特征之间的依赖关系，实现对多模态情感数据的精细融合与准确分类。S5: Use the multi-head attention machine module to learn the attention weights between the high-level facial emotion features obtained in step S3 and the high-level emotional semantic features obtained in step S4, capture the dependency between features of different modalities, and achieve fine fusion and accurate classification of multimodal emotional data.

根据相似度计算结果，自注意力机制可以自动考虑全局和局部的依赖关系，并且可以学习到不同元素之间的关联性，从而更好地聚焦于多模态情感特征矩阵中的重要信息。Based on the similarity calculation results, the self-attention mechanism can automatically consider global and local dependencies, and can learn the correlation between different elements, so as to better focus on the important information in the multimodal sentiment feature matrix.

Attention(Q,K,V)＝Softmax(sim)VAttention(Q,K,V)=Softmax(sim)V

Y＝Attention(Q,K,V)Y＝Attention(Q,K,V)

其中，sim表示相似度，Q为查询向量，K为关键向量，V为数值向量，表示缩放因子，d_k为K向量对应的维度，T表示转置，Attention表示自注意力。在较大向量维度下，Softmax梯度变化很小，为抵消该影响，将点乘矩阵除以Among them, sim represents similarity, Q is the query vector, K is the key vector, and V is the numerical vector. represents the scaling factor, d_k is the dimension corresponding to the K vector, T represents transpose, and Attention represents self-attention. In the case of larger vector dimensions, the Softmax gradient changes very little. To offset this effect, the dot product matrix is divided by

多头自注意力机制通过使用多个注意力头来并行地学习多个不同的自注意力表示，从而增强了模型的表达能力和泛化能力。每个注意力头都独立地计算注意力权重，因此每个单注意力机制都能够关注不同模态之间的相互作用和重要性，从而捕捉到模态之间的复杂关联。The multi-head self-attention mechanism enhances the expressiveness and generalization ability of the model by using multiple attention heads to learn multiple different self-attention representations in parallel. Each attention head calculates the attention weight independently, so each single attention mechanism can focus on the interaction and importance between different modalities, thereby capturing the complex associations between modalities.

MultiHead(Q,K,V)＝Concat(head₁,head₂,...,head_h)W^oMultiHead(Q,K,V)=Concat(head₁ ,head₂ ,...,head_h )W^o

head_i＝Attention(Q_i,K_i,V_i)head_i =Attention(Q_i ,K_i ,V_i )

其中，head_i＝Attention(Q_i,K_i,V_i)为第i个注意力头所对应的注意力输出序列。W^o为可学习的输出权重矩阵，将多个注意力头的输出进行线性组合，得到最终的注意力机制输出，MultiHead()表示多头注意力，Concat()表示字符串函数。Among them, head_i = Attention(Q_i ,K_i ,V_i ) is the attention output sequence corresponding to the i-th attention head.^{W o} is a learnable output weight matrix. The outputs of multiple attention heads are linearly combined to obtain the final attention mechanism output. MultiHead() represents multi-head attention, and Concat() represents a string function.

设多头注意力头数h为8。多模态情感特征输入矩阵即d_model＝512。因此对应8个头的维度为d_h0＝d_h1...＝d_h7＝d_model/h＝64，最后，对于获取到的多头注意力输出矩阵Y，对其进行线性变化，再将其送入全连接层与Softmax分类器实现情感状态的分类。在RAVDESS、eNTERFACE'05和AFEW三个数据集中的仿真结果分别如表1、2、3所示。Assume the number of multi-head attention heads h is 8. Multimodal sentiment feature input matrix That is, d_model = 512. Therefore, the dimension corresponding to 8 heads is d_h0 = d_h1 ... = d_h7 = d_model /h = 64. Finally, for the obtained multi-head attention output matrix Y, it is linearly transformed and then sent to the fully connected layer and Softmax classifier to realize the classification of emotional state. The simulation results on the three datasets of RAVDESS, eNTERFACE'05 and AFEW are shown in Tables 1, 2, and 3 respectively.

表1RAVDESS数据集仿真结果Table 1. Simulation results of RAVDESS dataset

表2eNTERFACE'05数据集仿真结果Table 2 Simulation results of eNTERFACE'05 dataset

表3AFEW数据集仿真结果Table 3 Simulation results of AFEW dataset

本发明在完成了在基准数据集上的仿真实验后，对实验结果进行了分析和验证，实验结果表明本发明所提出的方法能够在捕获人脸与语音数据中具有区分度的高层情感语义特征的基础上，实现视听模态之间的情感特征互补融合。这种融合可以提供更丰富的情感信息，从而实现准确的多模态情感识别结果。在RAVDESS数据集上人脸情感识别、语音情感识别和多模态情感识别的平均识别率分别为82.1％、80.6％和91.1％；在eNTERFACE'05数据库上人脸情感识别、语音情感识别和多模态情感识别的平均识别率分别为80.1％，73.9％和92.0％。在AFEW数据库上多视角人脸情感识别、语音情感识别和多模态情感识别的平均识别率分别为37.6％、47.4％和60.1％。由此可知，本发明所提出的多模态情感识别的准确率最高。After completing the simulation experiment on the benchmark data set, the present invention analyzed and verified the experimental results. The experimental results show that the method proposed by the present invention can realize the complementary fusion of emotional features between audio-visual modalities on the basis of capturing the high-level emotional semantic features with discrimination in face and voice data. This fusion can provide richer emotional information, thereby achieving accurate multimodal emotion recognition results. On the RAVDESS data set, the average recognition rates of face emotion recognition, voice emotion recognition and multimodal emotion recognition are 82.1%, 80.6% and 91.1% respectively; on the eNTERFACE'05 database, the average recognition rates of face emotion recognition, voice emotion recognition and multimodal emotion recognition are 80.1%, 73.9% and 92.0% respectively. On the AFEW database, the average recognition rates of multi-view face emotion recognition, voice emotion recognition and multimodal emotion recognition are 37.6%, 47.4% and 60.1% respectively. It can be seen that the accuracy of the multimodal emotion recognition proposed by the present invention is the highest.

请参见图4，图4是本发明实施例的硬件设备工作示意图，所述硬件设备具体包括：一种电子设备401、处理器402及存储器403。Please refer to FIG. 4 , which is a schematic diagram of the operation of a hardware device according to an embodiment of the present invention. The hardware device specifically includes: an electronic device 401 , a processor 402 , and a memory 403 .

一种电子设备401：所述一种电子设备401实现所述多模态情感识别方法。An electronic device 401: The electronic device 401 implements the multimodal emotion recognition method.

处理器402：所述处理器402加载并执行所述存储器403中的指令及数据用于实现所述多模态情感识别方法。Processor 402: The processor 402 loads and executes instructions and data in the memory 403 to implement the multimodal emotion recognition method.

存储器403：所述存储器403存储指令及数据；所述存储器403用于实现所述多模态情感识别方法。Memory 403: The memory 403 stores instructions and data; the memory 403 is used to implement the multimodal emotion recognition method.

以上仅为本发明的优选实施例，并非因此限制本发明的专利范围，凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换，或直接或间接运用在其他相关的技术领域，均同理包括在本发明的专利保护范围内。The above are only preferred embodiments of the present invention, and are not intended to limit the patent scope of the present invention. Any equivalent structure or equivalent process transformation made using the contents of the present invention specification and drawings, or directly or indirectly applied in other related technical fields, are also included in the patent protection scope of the present invention.