CN119293740B

Movatterモバイル変換

Info

Publication number: CN119293740B
Application number: CN202411833608.2A
Authority: CN
Inventors: 蔡创新; 潘志庚; 邹宇; 林贤煊; 夏先亮; 张考; 胡志华; 李昌利; 张婧; 王爽
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2024-12-13
Filing date: 2024-12-13
Publication date: 2025-03-07
Anticipated expiration: 2044-12-13
Also published as: CN119293740A

Abstract

The invention discloses a multi-mode dialogue emotion recognition method, which relates to the technical field of multi-mode emotion recognition and man-machine interaction, and comprises the steps of respectively obtaining expression time sequence characteristics and gesture time sequence characteristics by utilizing a face recognition model and a gesture recognition tool; the attention module carries out self-adaptive weighted fusion on the expression and gesture characteristics to obtain fused visual mode characteristics, builds new expression of context information, obtains emotion expression based on a prompt emotion modeling technology, extracts text mode characteristics through a text encoder, extracts mode characteristics of voices corresponding to a speaker by using a data vectorization model, proposes a jump connection multi-head attention cross-mode fusion method, carries out cross-mode alignment and fusion on multi-mode characteristics, and then carries out emotion recognition through an emotion classifier module. The method effectively solves the problems of insufficient key emotion clue recognition and insufficient fusion in the traditional multi-mode emotion recognition, and improves the accuracy and the robustness of emotion recognition.

Description

Multi-mode dialogue emotion recognition method

Technical Field

The invention relates to the technical field of multi-mode emotion recognition and man-machine interaction, in particular to a multi-mode dialogue emotion recognition method.

Background

Emotion recognition is capable of automatically perceiving, recognizing, understanding and providing feedback about human emotion, and relates to the crossing fields of computer science, neuroscience, psychology and social science. The method is of great importance for realizing natural, humanoid and personalized man-machine interaction, and has great value for the arrival of the intelligent and digital times. The emotion recognition has wide application fields including intelligent assistants, virtual reality, intelligent medical health and the like. Traditional emotion recognition methods mainly depend on data sources of single modes (such as voice, characters or expressions), and are difficult to fully reflect the complexity of human emotion.

In practical applications, emotion is often conveyed through various signals, such as facial expression, body posture, voice intonation, and the like. The modal information supplements each other, together conveying a complete emotional state. The conventional conversation emotion recognition method has the following challenges that (1) the emotion recognition accuracy is low due to the fact that the multimode characteristics of emotion transmission are ignored due to the fact that single emotion information sources are relied on, (2) the information such as expressions and gestures cannot be accurately integrated into unified emotion output based on the multimode method, and (3) in a multi-person-participated multi-round conversation scene, importance of different emotion cues at different moments is different due to the fact that multiple participants are involved and interaction is complex. The traditional multi-mode emotion recognition method lacks a flexible attention mechanism, and is difficult to highlight key emotion clues, so that recognition effect is affected.

Disclosure of Invention

The invention aims to provide a multi-modal dialogue emotion recognition method, which is characterized in that visual modal characteristics are generated by combining expression characteristics and gesture characteristics, cross-modal alignment and fusion are carried out on the visual modal characteristics, text modal characteristics and voice modal characteristics, and the fused characteristics are input into an emotion classifier module for emotion recognition. The method effectively solves the problems of insufficient key emotion clue recognition and insufficient fusion in the traditional multi-mode emotion recognition, and improves the accuracy and the robustness of emotion recognition. The invention is realized by the following technical scheme.

The invention provides a multi-mode dialogue emotion recognition method, which comprises the following steps:

Constructing a multimodal interactive dialogue list U;

Multi-modal interactive dialogue list based on face recognition model SFace and gesture recognition tool MEDIAPIPEOne video clip of (a)Respectively obtaining the expression time sequence characteristicsAnd gesture timing features;

Based on human face gesture attention module FPA pair emotion time sequence characteristicsAnd gesture timing featuresFusing to obtain final visual mode characteristics;

Construction of new expressions of context informationAnd obtaining emotion representation through the prompted emotion modeling technology PEMTRepresenting emotionInput to text encoder SimCSE for text modal characteristics;

From multimodal interactive dialog listsEach utterance of (a)Corresponding voice clipSampling at equal intervals, inputting sampling data into a data vectorization model data2vec to extract voice characteristics, and aggregating voice characteristics of all frames to obtain voice modal characteristics;

Characterization of visual modalityText modality featuresAnd speech modality featuresInputting the multi-modal information into a jump connection multi-head attention module SMA to fuse the multi-modal information so as to obtain a cross-modal fused attention output;

Attention output to cross-modal fusionPerforming nonlinear transformation, and characterizing the nonlinear transformationInput to emotion classifier to generate predictive probability distribution vector of emotion type。

In practical application, the conventional conversation emotion recognition method has single emotion information source and can not accurately integrate information such as expression, gesture and the like into unified emotion output, and the problem is solved by extracting expression time sequence characteristics and gesture time sequence characteristics by using corresponding models and fusing the expression time sequence characteristics and the gesture time sequence characteristics into visual mode characteristics. In the process of fusing the expression time sequence characteristics and the gesture time sequence characteristics, a model can pay attention to more important emotion time sequence information and enhance the fusion of the characteristics by using various attention mechanisms. The method solves the problems that the traditional multi-mode emotion recognition method lacks a flexible attention mechanism, and is difficult to highlight key emotion clues, so that recognition effect is affected.

According to the invention, through cross-modal alignment and fusion of the visual modal characteristics, the text modal characteristics and the voice modal characteristics, multidimensional expression and deep understanding of emotion information are realized. The text modal characteristics can capture emotion clues in language contents, such as word selection, sentence structure and implicit semantics, provide rich context information for emotion recognition, and reflect the dynamic change of emotion through acoustic parameters such as intonation, speed, volume and the like, so that details which are difficult to capture by simple text analysis are made up. The fusion of the three components not only enhances the comprehensiveness of emotion recognition, but also remarkably improves the robustness and the accuracy of the system.

In addition, the interpretive and transparency of the emotion recognition system can be improved through the finally generated predicted emotion probability distribution. The user can observe the judgment confidence of the emotion classification model on each category, and the judgment confidence is helpful for providing support for decision making. In practical application, especially in the fields of medical treatment, education, psychological counseling and the like, the probability information can help professionals to better understand the emotion state of an analyzed person and support more targeted intervention or suggestion.

Optionally, the building of the multimodal interactive dialog listComprises collecting multimodal dialogue data of multiple rounds of dialogue participated by multiple participants, preprocessing the multimodal dialogue data, and constructing multimodal interactive dialogue list,

The multimodal interactive dialog listComprising a plurality of utterancesWherein each utteranceComprising a text recordA video clipAnd a speech segment,

The multimodal interactive dialog listThe expression of (2) is:,

Each of the utterancesThe expression of (2) is:,

Wherein,The sequence number representing the utterance, which may also be referred to as the firstEach session unit has a value range of-,Representing an entire multimodal interactive dialog listTotal number of utterances in each utteranceThe corresponding speaker is。

Optionally, the face recognition model SFace and gesture recognition tool MEDIAPIPE are based on a multimodal interactive dialog listOne video clip of (a)Respectively obtaining the expression time sequence characteristicsAnd gesture timing featuresComprising the following steps:

video clip by face recognition model SFaceThe human face in the image is identified, the human face area is extracted, and a sequence containing human face images is formed,WhereinRepresenting a sequence of face imagesA face image of a frame;

on the same video clip by gesture recognition tool MEDIAPIPEHuman body posture recognition is performed, human body areas are extracted, and a sequence containing posture images is formedWhereinRepresenting a sequence of pose imagesA pose image of a frame;

sequence of face imagesAnd a sequence of pose imagesInputting into a space-time feature extraction model Timesformer to extract expression time sequence features respectivelyAnd gesture timing featuresFor capturing facial expression dynamics and posture changes, the expression timing characteristicsAnd gesture timing featuresThe expressions are as follows:

,

。

optionally, the face gesture-based attention module FPA is configured to compare the emotion timing characteristicsAnd gesture timing featuresThe fusion is carried out by:

Timing characteristics of the appearanceAnd gesture timing featuresCoding to obtain rich time sequence information, wherein the coded expression time sequence characteristics and gesture time sequence characteristics are obtained through MLP nonlinear transformation:

,

Wherein,AndRespectively representing the expression time sequence characteristics and the gesture time sequence characteristics after encoding;

for the expression time sequence characteristics after codingAnd gesture timing featuresSelf-adaptive weighting is carried out through the following formula to obtain expression attention coefficientsAnd gesture attention coefficient:

,

Wherein,For the coded expression time sequence characteristicsIs used for the weight matrix of the (c),For coded gesture timing featuresIs used for the weight matrix of the (c),For the coded expression time sequence characteristicsIs set in the above-described state,For coded gesture timing featuresIs a bias vector of (2);

in order to improve emotion representation capability of expressions and gestures, a multi-head attention mechanism is used for respectively carrying out time sequence feature on the coded expressionsAnd gesture timing featuresEnhancing emotion characterization, and encoding the expression time sequence characteristicsAnd gesture timing featuresThe multi-head attention is calculated by the following formulasAnd:

,

Wherein,AndRespectively representing the weight coefficients of the expression time sequence characteristics and the gesture time sequence characteristics obtained after calculation by a multi-head attention mechanism,,,,AndAre all the parameters which can be learned,,,,AndAre all learnable parameters. For the expression time sequence characteristics after codingAnd gesture timing featuresThe multi-head attention calculation is performed respectively, and in practical application, the multi-mode dialogue emotion recognition model can be enabled to pay attention to more important time sequence information about emotion in a self-adaptive mode.

Optionally, the face gesture-based attention module FPA is configured to compare the emotion timing characteristicsAnd gesture timing featuresThe fusing further comprises:

the coded gesture time sequence featuresAs query Q, the coded expression time sequence featuresAs key K and value V, enhanced expression timing characteristics are obtained by the following formula:

,

The coded expression time sequence featuresAs query Q, the coded gesture time sequence featuresAs key K and value V, enhanced gesture timing features are obtained by the following formula:

。

The method realizes the expression time sequence characteristics after encoding by using the cross attention mechanismAnd gesture timing featuresThe information interaction between the two can further enhance the feature fusion.

The invention selects the human face gesture attention module FPA to fuse the surface condition time sequence characteristics and the gesture time sequence characteristics. The face pose attention module FPA includes a feature encoding layer, an attention generation layer, an attention calibration layer, a cross attention layer, and a fusion layer connected in sequence. First, the expression time sequence featuresGesture timing featuresInputting the feature coding layer to obtain the coded expression time sequence featuresAnd gesture timing features. Secondly, the coded expression time sequence featuresAnd gesture timing featuresInput into the attention generation layer to obtain expression attention coefficientAnd gesture attention coefficient. Then, the coded expression time sequence featuresAnd gesture timing featuresInputting the attention calibration layer to obtain the expression time sequence characteristicsAnd gesture timing featuresWeight coefficient of (2)And. Then, the weight coefficient is calculatedAndInputting the cross attention layer to obtain enhanced expression time sequence characteristicsAnd enhanced gesture timing features. Finally, the expression attention coefficientPosture attention coefficientEnhanced expressive sequential featuresAnd enhanced gesture timing featuresSimultaneously inputting a fusion layer to fuse the expression and the gesture features to obtain final visual mode features. The method comprises the following specific steps:

optionally, the face gesture-based attention module FPA is configured to compare the emotion timing characteristicsAnd gesture timing featuresFusing to obtain final visual mode characteristicsComprising the following steps:

Expression timing feature to be enhancedAnd gesture timing featuresThe final visual mode characteristics are obtained by fusing the following formulas in a weighted mode:

。

Optionally, the constructing new expression of the context informationAnd obtaining emotion representation through the prompted emotion modeling technology PEMTRepresenting emotionInput to text encoder SimCSE for text modal characteristicsComprising the following steps:

context modeling, extracting a current utteranceText recording of previous three utterances, intended to enhance text recording of current utterance by considering historical utterance informationProviding more rich context information for emotion recognition, constructing a new expression containing context informationThe new expression containing context informationThe expression is as follows:

,

Text feature extraction using new expressions of context informationSpeakerAnd text recordingPrompt-based emotion modeling technique PEMT is proposed to capture a speakerAnd long distance dependence between uttered words, resulting in emotion representationsThe emotion representsThe expression of (2) is as follows:

,

Representing the emotion through text encoder SimCSEThe representation of the special token < mask > in the document is encoded to obtain the text modal characteristicsAnd providing key information of the text mode for multi-mode emotion recognition. The text modality featureThe expression is as follows:

。

Optionally, the interactive dialogue list is based on multiple modesEach utterance of (a)Corresponding voice clipSampling at equal intervals, inputting sampling data into a data vectorization model data2vec to extract voice characteristics, and aggregating voice characteristics of all frames to obtain voice modal characteristicsComprising the following steps:

speech signal segmentation, given each utteranceCorresponding voice clipContinuous speech segmentsDivided intoThe frame is used as voice data, and the expression of the voice data is as follows:

,

Wherein,Represent the firstIndividual speech segmentsIs the first of (2)The voice data of the frames are transmitted to the host,The value range is 1-,For the total number of frames,

Speech feature extraction, namely extracting each speech dataInput into the data vectorization model data2vec to obtain voice characteristic representationThe speech feature representsThe expression is as follows:

,

Aggregating all frame features-representing all speech features using average poolingAggregate into utterancesIs a speech modal feature of (a)The voice modality featuresObtained by calculation by the following formula:

。

Optionally, the visual mode feature isText modality featuresAnd speech modality featuresInputting the multi-modal information into a jump connection multi-head attention module SMA to fuse the multi-modal information so as to obtain a cross-modal fused attention outputComprising the following steps:

splicing the mode features, namely splicing the visual mode featuresAnd speech modality featuresSplicing to obtain spliced feature vectorsThe spliced feature vectorThe expression is as follows:

,

cross-modal attention computation, text modal featuresAs a querySum valueFeatures after splicingAs a keyThe weight score of each attention head is calculated by the following formula to obtain the firstOutput representation of individual attention heads:

,

Wherein,A sequence number representing the attention header,The range of the values is as follows,,,,Is the firstThe weight scores corresponding to the individual attention headers,For the key V vector dimension, for scaling the attention score;

the weight scores of all attention heads are spliced through a linear transformation matrixObtaining final cross-modal fusion attention outputThe cross-modality fused attention outputObtained by calculation by the following formula:

。

On the one hand, the multi-mode information is fused by the jumping connection multi-head attention module SMA, and on the other hand, the self-adaptive weight distributes importance for each mode. On the other hand, the jump connection can prevent the model from ignoring key emotion information in different modalities, which helps the model capture and integrate subtle and important emotion cues from different modalities.

Optionally, the pair of cross-modality fused attention outputsPerforming nonlinear transformation, and characterizing the nonlinear transformationInput to emotion classifier to generate predictive probability distribution vector of emotion typeComprising the following steps:

non-linear transformation of cross-modal fusion attention output by full connection layer and activation function ReLU in emotion classifierTo obtain the characteristics after nonlinear transformationTo enhance the capturing capability of the model to complex relationships and high-order interactions between features, the non-linearly transformed featuresObtained by calculation by the following formula:

,

Wherein,Is a weight matrix of the full connection layer,As a result of the offset vector,The representation of the activation function ReLU,

The emotion classifier outputs the characteristics after nonlinear transformationInputting the prediction probability distribution vector into an emotion classifier, and calculating to obtain the prediction probability distribution vector of the emotion type through the following formula:

,

Wherein,As a matrix of weights for the classifier,Is a bias term. The above-described mid-emotion classifier uses a fully connected layer and generates a predictive probability distribution of emotion categories by a softmax function.

Advantageous effects

(1) The invention realizes the efficient fusion of the multi-mode information by combining the attention mechanism of the expression and the gesture. The method can carry out weighting treatment according to the importance of different emotion clues, so that the emotion recognition model can dynamically capture core emotion characteristics, thereby remarkably improving the accuracy of emotion recognition. Meanwhile, the introduction of the attention mechanism in practical use can enable the multi-mode dialogue emotion recognition model to automatically pay attention to expression and gesture information with emotion significance in the dialogue, and irrelevant or noise information is filtered, so that emotion recognition accuracy and robustness are improved.

(2) The predicted emotion probability distribution finally generated by the method not only provides the final judgment of the emotion classification model on a certain emotion type, but also reveals the possibility of other emotion types. By looking at the probability values of each category, the ambiguity of the input emotion data and the potential emotion mixing situation can be known. For example, when the probability of a certain data point in both anger and sad categories is high, the user can be assisted in carrying out finer emotion analysis, and the true emotion state can be reflected more accurately.

Drawings

FIG. 1 is a schematic flow chart of a multi-modal dialog emotion recognition method according to an embodiment of the present invention.

Detailed Description

Further description is provided below in connection with the drawings and the specific embodiments. In the description of the present invention, it should be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or number of technical features indicated. Thus, a feature defining "a first", "a second", etc. may explicitly or implicitly include one or more such feature.

Example 1

The embodiment provides a multi-mode dialogue emotion recognition method, which comprises the following steps:

Constructing a multimodal interactive dialogue list U;

Construction of new expressions of context informationAnd deriving emotion representations based on the passed emotion modeling technique PEMTRepresenting emotionInput to text encoder SimCSE for text modal characteristics;

According to the invention, through cross-modal alignment and fusion of the visual modal characteristics, the text modal characteristics and the voice modal characteristics, multidimensional expression and deep understanding of emotion information are realized. In addition, the interpretive and transparency of the emotion recognition system can be improved through the finally generated predicted emotion probability distribution. The user can observe the judgment confidence of the emotion classification model on each category, and the judgment confidence is helpful for providing support for decision making. In practical application, especially in the fields of medical treatment, education, psychological counseling and the like, the probability information can help professionals to better understand the emotion state of an analyzed person and support more targeted intervention or suggestion.

Example 2

On the basis of embodiment 1, this embodiment describes a specific implementation procedure of a multimodal dialog emotion recognition method, as shown in fig. 1, which specifically includes the following contents:

1. data preprocessing

Step 1, firstly, collecting multi-modal dialogue data of multi-round dialogue participated by multiple persons, preprocessing the data, and finally constructing a multi-modal interactive dialogue list;

Step 1.1, collecting all multi-mode conversation data comprising a plurality of participants, wherein the multi-mode conversation data comprises text mode data, voice mode data, video mode data and speaker names corresponding to all utterances;

Step 1.2, performing standardized preprocessing on the text modal data of each utterance, including removing stop words, correcting spelling errors and sentence breaking processing, and ensuring continuity and accuracy of text content, thereby obtaining text records corresponding to each utterance;

Step 1.3 cleaning and normalizing the speech mode data of each utterance, including noise reduction and equalization, to improve the clarity and consistency of the speech, thereby obtaining speech segments corresponding to each utterance;

Step 1.4, preprocessing video modal data corresponding to each utterance, extracting 15 key frames through equidistant sampling to construct a simplified video segment, thereby reducing data redundancy and improving processing efficiency, and finally obtaining a video segment corresponding to each utterance;

Building multimodal interactive dialog listsComprises collecting multimodal dialogue data of multiple rounds of dialogue participated by multiple participants, preprocessing the multimodal dialogue data, and constructing multimodal interactive dialogue list,

Multimodal interactive dialog listComprising a plurality of utterancesWherein each utteranceComprising a text recordA video clipAnd a speech segment,

Multimodal interactive dialog listThe expression of (2) is:,

Each utteranceThe expression of (2) is:,

2. Feature multimodal fusion

Step 2, first, a video clip is generated by face recognition model SFace and gesture recognition tool MEDIAPIPEPerforming face recognition and human body recognition to obtain image sequences of a face region and a human body regionAndThen, extracting by a space-time feature extraction model TimesformerAndRespectively obtaining expression time sequence characteristicsAnd gesture timing featuresAnd finally, providing a human face posture attention module FPA to fuse the surface timing sequence characteristics and the posture timing sequence characteristics to obtain final visual mode characteristics;

Step 2.1, extracting expression time sequence characteristics and gesture time sequence characteristics;

Step 2.1.1, face sequence represents a video segment through face recognition model SFaceThe human face in the image is identified, the human face area is extracted, and a sequence containing human face images is formed,WhereinRepresenting a sequence of face imagesA face image of a frame;

Step 2.1.2, gesture sequence indicates that the same video clip is presented by gesture recognition tool MEDIAPIPEHuman body posture recognition is performed, human body areas are extracted, and a sequence containing posture images is formedWhereinRepresenting a sequence of pose imagesA pose image of a frame;

Step 2.1.3, extracting time sequence characteristics, namely extracting the human face image sequenceAnd a sequence of pose imagesInputting into a space-time feature extraction model Timesformer to extract expression time sequence features respectivelyAnd gesture timing featuresFor capturing facial expression dynamics and posture changes, the expression timing characteristicsAnd gesture timing featuresThe expressions are as follows:

,

step 2.2. Timing characteristics of the expressionAnd gesture timing featuresCoding to obtain rich time sequence information, wherein the coded expression time sequence characteristics and gesture time sequence characteristics are obtained through MLP nonlinear transformation:

,

step 2.3, the expression time sequence characteristics after the coding are processedAnd gesture timing featuresSelf-adaptive weighting is carried out through the following formula to obtain expression attention coefficientsAnd gesture attention coefficient:

,

Step 2.4, in order to improve emotion representation capability of expressions and gestures, using a multi-head attention mechanism to respectively encode expression time sequence characteristicsAnd gesture timing featuresEnhancing emotion characterization, and encoding the expression time sequence characteristicsAnd gesture timing featuresThe multi-head attention is calculated by the following formulasAnd:

,

Wherein,AndRespectively representing the weight coefficients of the expression time sequence characteristics and the gesture time sequence characteristics obtained after calculation by a multi-head attention mechanism,,,,AndAre all the parameters which can be learned,,,,AndAre all learnable parameters. For the expression time sequence characteristics after the codingAnd gesture timing featuresThe multi-head attention calculation is respectively carried out, so that the multi-mode dialogue emotion recognition model can adaptively pay attention to more important time sequence information about emotion in practical application;

Step 2.5, the coded gesture time sequence featuresAs query Q, the coded expression time sequence featuresAs key K and value V, enhanced expression timing characteristics are obtained by the following formula:

,

。

The method realizes the expression time sequence characteristics after encoding by using the cross attention mechanismAnd gesture timing featuresThe information interaction between the two can further enhance the feature fusion;

Step 2.6, enhancing the expression time sequence characteristicsAnd gesture timing featuresThe final visual mode characteristics are obtained by fusing the following formulas in a weighted mode:

. The gesture time sequence features are fused to obtain the final visual mode features。

Step 3, constructing new expression of the context informationAnd obtaining emotion representation through the prompted emotion modeling technology PEMTRepresenting emotionInput to text encoder SimCSE for text modal characteristics;

Step 3.1, context modeling, extracting the current utteranceText recording of previous three utterances, intended to enhance text recording of current utterance by considering historical utterance informationProviding more rich context information for emotion recognition, constructing a new expression containing context informationThe new expression containing context informationThe expression is as follows:

,

step 3.2 text feature extraction using new expressions of context informationSpeakerAnd text recordingPrompt-based emotion modeling technique PEMT is proposed to capture a speakerAnd long distance dependence between uttered words, resulting in emotion representationsThe emotion representsThe expression of (2) is as follows:

,

。

Step 4. First, according to the wordsCorresponding voice clipSampling at equal intervals. Then, the sampled data is input to the data vectorization model data2vec to extract the speech features. Finally, the voice characteristics of all frames are aggregated to obtain voice mode characteristics;

Step 4.1, segmentation of the Speech Signal given each utteranceCorresponding voice clipContinuous speech segmentsDivided intoThe frame is used as voice data, and the expression of the voice data is as follows:

,

Wherein,Represent the firstIndividual speech segmentsIs the first of (2)The voice data of the frames are transmitted to the host,The value range is 1-,In this embodiment, as the total number of framesSet to 8, in other possible embodiments of the invention, different numbers may be set;

Step 4.2, extracting the voice characteristics, namely extracting each voice dataInput into the data vectorization model data2vec to obtain voice characteristic representationThe speech feature representsThe expression is as follows:

,

step 4.3, aggregating all frame features by representing all speech features using averaging poolingAggregate into utterancesIs a speech modal feature of (a)The voice modality featuresObtained by calculation by the following formula:

。

Step 5, the visual mode is characterizedText modality featuresAnd speech modality featuresInputting the multi-modal information into a jump connection multi-head attention module SMA to fuse the multi-modal information so as to obtain a cross-modal fused attention output;

Step 5.1, splicing the modal characteristics, namely splicing the visual modal characteristicsAnd speech modality featuresSplicing to obtain spliced feature vectorsThe spliced feature vectorThe expression is as follows:

,

Step 5.2, cross-modal attention calculation, namely, text modal characteristicsAs a querySum valueFeatures after splicingAs a keyThe weight score of each attention head is calculated by the following formula to obtain the firstOutput representation of individual attention heads:

,

On the one hand, the multi-mode information is fused by the jumping connection multi-head attention module SMA, and on the other hand, the self-adaptive weight distributes importance for each mode. On the other hand, the jump connection can prevent the model from ignoring key emotion information in different modalities, which helps the model capture and integrate subtle and important emotion cues from different modalities. In the present embodiment of the present invention, in the present embodiment,Set to 3. In other possible embodiments of the invention different numbers may be provided.

Step 6, merging attention output to cross-modePerforming nonlinear transformation, and characterizing the nonlinear transformationInput to emotion classifier to generate predictive probability distribution vector of emotion type;

Step 6.1, realizing the non-linear transformation of the cross-modal fusion attention output through the full connection layer and the activation function ReLU in the emotion classifierTo obtain the characteristics after nonlinear transformationTo enhance the capturing capability of the model to complex relationships and high-order interactions between features, the non-linearly transformed featuresObtained by calculation by the following formula:

,

Wherein,Is a weight matrix of the full connection layer,As a result of the offset vector,Representing an activation function ReLU;

Step 6.2, outputting the characteristics after nonlinear transformation by the emotion classifierInputting the prediction probability distribution vector into an emotion classifier, and calculating to obtain the prediction probability distribution vector of the emotion type through the following formula:

,

The embodiments of the present invention have been described above with reference to the accompanying drawings, but the present invention is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present invention and the scope of the claims, which are all within the protection of the present invention.

Claims

1. A multi-mode dialogue emotion recognition method is characterized by comprising the following steps:

Constructing a multimodal interactive dialogue list U;

Attention output to cross-modal fusionPerforming nonlinear transformation, and characterizing the nonlinear transformationInput to emotion classifier to generate predictive probability distribution vector of emotion type;

The human face gesture-based attention module FPA pair emotion time sequence featureAnd gesture timing featuresThe fusion is carried out by:

Timing characteristics of the appearanceGesture timing featuresCoding, wherein the coded expression time sequence characteristics and gesture time sequence characteristics are obtained through MLP nonlinear transformation:

,

for the expression time sequence characteristics after codingAnd gesture timing featuresThe multi-head attention is calculated by the following formulasAnd:

,

Wherein,AndRespectively representing the weight coefficients of the expression time sequence characteristics and the gesture time sequence characteristics obtained after calculation by a multi-head attention mechanism,,,,AndAre all the parameters which can be learned,,,,AndAre all learnable parameters;

The human face gesture-based attention module FPA pair emotion time sequence featureAnd gesture timing featuresThe fusing further comprises:

,

;

The human face gesture-based attention module FPA pair emotion time sequence featureAnd gesture timing featuresFusing to obtain final visual mode characteristicsComprising the following steps:

。

2. The method of claim 1, wherein the building a multimodal interactive dialog listComprises collecting multimodal dialogue data of multiple rounds of dialogue participated by multiple participants, preprocessing the multimodal dialogue data, and constructing multimodal interactive dialogue list,

The multimodal interactive dialog listThe expression of (2) is:,

Each of the utterancesThe expression of (2) is:,

Wherein,Number of the expression, the range of the value is-,Representing an entire multimodal interactive dialog listTotal number of utterances in each utteranceThe corresponding speaker is。

3. The method of claim 1, wherein the recognition model SFace and gesture recognition tool MEDIAPIPE are based on a multimodal interactive dialog listOne video clip of (a)Respectively obtaining the expression time sequence characteristicsAnd gesture timing featuresComprising the following steps:

sequence of face imagesAnd a sequence of pose imagesInputting into a space-time feature extraction model Timesformer to extract expression time sequence features respectivelyAnd gesture timing featuresThe expression time sequence characteristicsAnd gesture timing featuresThe expressions are as follows:

,

。

4. the multi-modal dialog emotion recognition method as recited in claim 1, wherein said new representation of build context informationAnd obtaining emotion representation through the prompted emotion modeling technology PEMTRepresenting emotionInput to text encoder SimCSE for text modal characteristicsComprising the following steps:

Extracting a current utteranceText records of the previous three utterances, building a new expression containing context informationThe new expression containing context informationThe expression is as follows:

,

new expression using context informationSpeakerAnd text recordingPrompt-based emotion modeling technique PEMT is proposed to capture a speakerAnd long distance dependence between uttered words, resulting in emotion representationsThe emotion representsThe expression of (2) is as follows:

,

Representing the emotion through text encoder SimCSEThe representation of the special token < mask > in the document is encoded to obtain the text modal characteristicsThe text modality featureThe expression is as follows:

。

5. the method of claim 1, wherein the interactive dialog list is based on a multimodal dialog emotion recognition methodEach utterance of (a)Corresponding voice clipSampling at equal intervals, inputting sampling data into a data vectorization model data2vec to extract voice characteristics, and aggregating voice characteristics of all frames to obtain voice modal characteristicsComprising the following steps:

Given each utteranceCorresponding voice clipContinuous speech segmentsDivided intoThe frame is used as voice data, and the expression of the voice data is as follows:

,

Each voice dataInput into the data vectorization model data2vec to obtain voice characteristic representationThe speech feature representsThe expression is as follows:

,

representing all speech features using averaging poolingAggregate into utterancesIs a speech modal feature of (a)The voice modality featuresObtained by calculation by the following formula:

。

6. the method for identifying emotion of a multi-modal dialog according to claim 1, wherein said visual modality is characterized byText modality featuresAnd speech modality featuresInputting the multi-modal information into a jump connection multi-head attention module SMA to fuse the multi-modal information so as to obtain a cross-modal fused attention outputComprising the following steps:

characterization of visual modalityAnd speech modality featuresSplicing to obtain spliced feature vectorsThe spliced feature vectorThe expression is as follows:

,

Characterizing text modalitiesAs a querySum valueFeatures after splicingAs a keyThe weight score of each attention head is calculated by the following formula to obtain the firstOutput representation of individual attention heads:

,

。

7. The method of claim 6, wherein the pair of cross-modal fused attention outputsPerforming nonlinear transformation, and characterizing the nonlinear transformationInput to emotion classifier to generate predictive probability distribution vector of emotion typeComprising the following steps:

Cross-modal fusion attention output through full connection layer and activation function ReLU in emotion classifierTo obtain the characteristics after nonlinear transformationFeatures after the nonlinear transformationObtained by calculation by the following formula:

,

Features after non-linear transformationInputting the prediction probability distribution vector into an emotion classifier, and calculating to obtain the prediction probability distribution vector of the emotion type through the following formula:

,

Wherein,As a matrix of weights for the classifier,Is a bias term.