Movatterモバイル変換


[0]ホーム

URL:


CN119293740B - A multimodal conversation emotion recognition method - Google Patents

A multimodal conversation emotion recognition method
Download PDF

Info

Publication number
CN119293740B
CN119293740BCN202411833608.2ACN202411833608ACN119293740BCN 119293740 BCN119293740 BCN 119293740BCN 202411833608 ACN202411833608 ACN 202411833608ACN 119293740 BCN119293740 BCN 119293740B
Authority
CN
China
Prior art keywords
emotion
gesture
expression
features
time sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202411833608.2A
Other languages
Chinese (zh)
Other versions
CN119293740A (en
Inventor
蔡创新
潘志庚
邹宇
林贤煊
夏先亮
张考
胡志华
李昌利
张婧
王爽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Information Science and Technology
Original Assignee
Nanjing University of Information Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Information Science and TechnologyfiledCriticalNanjing University of Information Science and Technology
Priority to CN202411833608.2ApriorityCriticalpatent/CN119293740B/en
Publication of CN119293740ApublicationCriticalpatent/CN119293740A/en
Application grantedgrantedCritical
Publication of CN119293740BpublicationCriticalpatent/CN119293740B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

The invention discloses a multi-mode dialogue emotion recognition method, which relates to the technical field of multi-mode emotion recognition and man-machine interaction, and comprises the steps of respectively obtaining expression time sequence characteristics and gesture time sequence characteristics by utilizing a face recognition model and a gesture recognition tool; the attention module carries out self-adaptive weighted fusion on the expression and gesture characteristics to obtain fused visual mode characteristics, builds new expression of context information, obtains emotion expression based on a prompt emotion modeling technology, extracts text mode characteristics through a text encoder, extracts mode characteristics of voices corresponding to a speaker by using a data vectorization model, proposes a jump connection multi-head attention cross-mode fusion method, carries out cross-mode alignment and fusion on multi-mode characteristics, and then carries out emotion recognition through an emotion classifier module. The method effectively solves the problems of insufficient key emotion clue recognition and insufficient fusion in the traditional multi-mode emotion recognition, and improves the accuracy and the robustness of emotion recognition.

Description

Multi-mode dialogue emotion recognition method
Technical Field
The invention relates to the technical field of multi-mode emotion recognition and man-machine interaction, in particular to a multi-mode dialogue emotion recognition method.
Background
Emotion recognition is capable of automatically perceiving, recognizing, understanding and providing feedback about human emotion, and relates to the crossing fields of computer science, neuroscience, psychology and social science. The method is of great importance for realizing natural, humanoid and personalized man-machine interaction, and has great value for the arrival of the intelligent and digital times. The emotion recognition has wide application fields including intelligent assistants, virtual reality, intelligent medical health and the like. Traditional emotion recognition methods mainly depend on data sources of single modes (such as voice, characters or expressions), and are difficult to fully reflect the complexity of human emotion.
In practical applications, emotion is often conveyed through various signals, such as facial expression, body posture, voice intonation, and the like. The modal information supplements each other, together conveying a complete emotional state. The conventional conversation emotion recognition method has the following challenges that (1) the emotion recognition accuracy is low due to the fact that the multimode characteristics of emotion transmission are ignored due to the fact that single emotion information sources are relied on, (2) the information such as expressions and gestures cannot be accurately integrated into unified emotion output based on the multimode method, and (3) in a multi-person-participated multi-round conversation scene, importance of different emotion cues at different moments is different due to the fact that multiple participants are involved and interaction is complex. The traditional multi-mode emotion recognition method lacks a flexible attention mechanism, and is difficult to highlight key emotion clues, so that recognition effect is affected.
Disclosure of Invention
The invention aims to provide a multi-modal dialogue emotion recognition method, which is characterized in that visual modal characteristics are generated by combining expression characteristics and gesture characteristics, cross-modal alignment and fusion are carried out on the visual modal characteristics, text modal characteristics and voice modal characteristics, and the fused characteristics are input into an emotion classifier module for emotion recognition. The method effectively solves the problems of insufficient key emotion clue recognition and insufficient fusion in the traditional multi-mode emotion recognition, and improves the accuracy and the robustness of emotion recognition. The invention is realized by the following technical scheme.
The invention provides a multi-mode dialogue emotion recognition method, which comprises the following steps:
Constructing a multimodal interactive dialogue list U;
Multi-modal interactive dialogue list based on face recognition model SFace and gesture recognition tool MEDIAPIPEOne video clip of (a)Respectively obtaining the expression time sequence characteristicsAnd gesture timing features;
Based on human face gesture attention module FPA pair emotion time sequence characteristicsAnd gesture timing featuresFusing to obtain final visual mode characteristics;
Construction of new expressions of context informationAnd obtaining emotion representation through the prompted emotion modeling technology PEMTRepresenting emotionInput to text encoder SimCSE for text modal characteristics;
From multimodal interactive dialog listsEach utterance of (a)Corresponding voice clipSampling at equal intervals, inputting sampling data into a data vectorization model data2vec to extract voice characteristics, and aggregating voice characteristics of all frames to obtain voice modal characteristics;
Characterization of visual modalityText modality featuresAnd speech modality featuresInputting the multi-modal information into a jump connection multi-head attention module SMA to fuse the multi-modal information so as to obtain a cross-modal fused attention output;
Attention output to cross-modal fusionPerforming nonlinear transformation, and characterizing the nonlinear transformationInput to emotion classifier to generate predictive probability distribution vector of emotion type
In practical application, the conventional conversation emotion recognition method has single emotion information source and can not accurately integrate information such as expression, gesture and the like into unified emotion output, and the problem is solved by extracting expression time sequence characteristics and gesture time sequence characteristics by using corresponding models and fusing the expression time sequence characteristics and the gesture time sequence characteristics into visual mode characteristics. In the process of fusing the expression time sequence characteristics and the gesture time sequence characteristics, a model can pay attention to more important emotion time sequence information and enhance the fusion of the characteristics by using various attention mechanisms. The method solves the problems that the traditional multi-mode emotion recognition method lacks a flexible attention mechanism, and is difficult to highlight key emotion clues, so that recognition effect is affected.
According to the invention, through cross-modal alignment and fusion of the visual modal characteristics, the text modal characteristics and the voice modal characteristics, multidimensional expression and deep understanding of emotion information are realized. The text modal characteristics can capture emotion clues in language contents, such as word selection, sentence structure and implicit semantics, provide rich context information for emotion recognition, and reflect the dynamic change of emotion through acoustic parameters such as intonation, speed, volume and the like, so that details which are difficult to capture by simple text analysis are made up. The fusion of the three components not only enhances the comprehensiveness of emotion recognition, but also remarkably improves the robustness and the accuracy of the system.
In addition, the interpretive and transparency of the emotion recognition system can be improved through the finally generated predicted emotion probability distribution. The user can observe the judgment confidence of the emotion classification model on each category, and the judgment confidence is helpful for providing support for decision making. In practical application, especially in the fields of medical treatment, education, psychological counseling and the like, the probability information can help professionals to better understand the emotion state of an analyzed person and support more targeted intervention or suggestion.
Optionally, the building of the multimodal interactive dialog listComprises collecting multimodal dialogue data of multiple rounds of dialogue participated by multiple participants, preprocessing the multimodal dialogue data, and constructing multimodal interactive dialogue list,
The multimodal interactive dialog listComprising a plurality of utterancesWherein each utteranceComprising a text recordA video clipAnd a speech segment,
The multimodal interactive dialog listThe expression of (2) is:,
Each of the utterancesThe expression of (2) is:,
Wherein,The sequence number representing the utterance, which may also be referred to as the firstEach session unit has a value range of-,Representing an entire multimodal interactive dialog listTotal number of utterances in each utteranceThe corresponding speaker is
Optionally, the face recognition model SFace and gesture recognition tool MEDIAPIPE are based on a multimodal interactive dialog listOne video clip of (a)Respectively obtaining the expression time sequence characteristicsAnd gesture timing featuresComprising the following steps:
video clip by face recognition model SFaceThe human face in the image is identified, the human face area is extracted, and a sequence containing human face images is formed,WhereinRepresenting a sequence of face imagesA face image of a frame;
on the same video clip by gesture recognition tool MEDIAPIPEHuman body posture recognition is performed, human body areas are extracted, and a sequence containing posture images is formedWhereinRepresenting a sequence of pose imagesA pose image of a frame;
sequence of face imagesAnd a sequence of pose imagesInputting into a space-time feature extraction model Timesformer to extract expression time sequence features respectivelyAnd gesture timing featuresFor capturing facial expression dynamics and posture changes, the expression timing characteristicsAnd gesture timing featuresThe expressions are as follows:
,
optionally, the face gesture-based attention module FPA is configured to compare the emotion timing characteristicsAnd gesture timing featuresThe fusion is carried out by:
Timing characteristics of the appearanceAnd gesture timing featuresCoding to obtain rich time sequence information, wherein the coded expression time sequence characteristics and gesture time sequence characteristics are obtained through MLP nonlinear transformation:
,
,
Wherein,AndRespectively representing the expression time sequence characteristics and the gesture time sequence characteristics after encoding;
for the expression time sequence characteristics after codingAnd gesture timing featuresSelf-adaptive weighting is carried out through the following formula to obtain expression attention coefficientsAnd gesture attention coefficient:
,
,
Wherein,For the coded expression time sequence characteristicsIs used for the weight matrix of the (c),For coded gesture timing featuresIs used for the weight matrix of the (c),For the coded expression time sequence characteristicsIs set in the above-described state,For coded gesture timing featuresIs a bias vector of (2);
in order to improve emotion representation capability of expressions and gestures, a multi-head attention mechanism is used for respectively carrying out time sequence feature on the coded expressionsAnd gesture timing featuresEnhancing emotion characterization, and encoding the expression time sequence characteristicsAnd gesture timing featuresThe multi-head attention is calculated by the following formulasAnd:
,
,
Wherein,AndRespectively representing the weight coefficients of the expression time sequence characteristics and the gesture time sequence characteristics obtained after calculation by a multi-head attention mechanism,,,,AndAre all the parameters which can be learned,,,,AndAre all learnable parameters. For the expression time sequence characteristics after codingAnd gesture timing featuresThe multi-head attention calculation is performed respectively, and in practical application, the multi-mode dialogue emotion recognition model can be enabled to pay attention to more important time sequence information about emotion in a self-adaptive mode.
Optionally, the face gesture-based attention module FPA is configured to compare the emotion timing characteristicsAnd gesture timing featuresThe fusing further comprises:
the coded gesture time sequence featuresAs query Q, the coded expression time sequence featuresAs key K and value V, enhanced expression timing characteristics are obtained by the following formula:
,
The coded expression time sequence featuresAs query Q, the coded gesture time sequence featuresAs key K and value V, enhanced gesture timing features are obtained by the following formula:
The method realizes the expression time sequence characteristics after encoding by using the cross attention mechanismAnd gesture timing featuresThe information interaction between the two can further enhance the feature fusion.
The invention selects the human face gesture attention module FPA to fuse the surface condition time sequence characteristics and the gesture time sequence characteristics. The face pose attention module FPA includes a feature encoding layer, an attention generation layer, an attention calibration layer, a cross attention layer, and a fusion layer connected in sequence. First, the expression time sequence featuresGesture timing featuresInputting the feature coding layer to obtain the coded expression time sequence featuresAnd gesture timing features. Secondly, the coded expression time sequence featuresAnd gesture timing featuresInput into the attention generation layer to obtain expression attention coefficientAnd gesture attention coefficient. Then, the coded expression time sequence featuresAnd gesture timing featuresInputting the attention calibration layer to obtain the expression time sequence characteristicsAnd gesture timing featuresWeight coefficient of (2)And. Then, the weight coefficient is calculatedAndInputting the cross attention layer to obtain enhanced expression time sequence characteristicsAnd enhanced gesture timing features. Finally, the expression attention coefficientPosture attention coefficientEnhanced expressive sequential featuresAnd enhanced gesture timing featuresSimultaneously inputting a fusion layer to fuse the expression and the gesture features to obtain final visual mode features. The method comprises the following specific steps:
optionally, the face gesture-based attention module FPA is configured to compare the emotion timing characteristicsAnd gesture timing featuresFusing to obtain final visual mode characteristicsComprising the following steps:
Expression timing feature to be enhancedAnd gesture timing featuresThe final visual mode characteristics are obtained by fusing the following formulas in a weighted mode:
Optionally, the constructing new expression of the context informationAnd obtaining emotion representation through the prompted emotion modeling technology PEMTRepresenting emotionInput to text encoder SimCSE for text modal characteristicsComprising the following steps:
context modeling, extracting a current utteranceText recording of previous three utterances, intended to enhance text recording of current utterance by considering historical utterance informationProviding more rich context information for emotion recognition, constructing a new expression containing context informationThe new expression containing context informationThe expression is as follows:
,
Text feature extraction using new expressions of context informationSpeakerAnd text recordingPrompt-based emotion modeling technique PEMT is proposed to capture a speakerAnd long distance dependence between uttered words, resulting in emotion representationsThe emotion representsThe expression of (2) is as follows:
,
Representing the emotion through text encoder SimCSEThe representation of the special token < mask > in the document is encoded to obtain the text modal characteristicsAnd providing key information of the text mode for multi-mode emotion recognition. The text modality featureThe expression is as follows:
Optionally, the interactive dialogue list is based on multiple modesEach utterance of (a)Corresponding voice clipSampling at equal intervals, inputting sampling data into a data vectorization model data2vec to extract voice characteristics, and aggregating voice characteristics of all frames to obtain voice modal characteristicsComprising the following steps:
speech signal segmentation, given each utteranceCorresponding voice clipContinuous speech segmentsDivided intoThe frame is used as voice data, and the expression of the voice data is as follows:
,
Wherein,Represent the firstIndividual speech segmentsIs the first of (2)The voice data of the frames are transmitted to the host,The value range is 1-,For the total number of frames,
Speech feature extraction, namely extracting each speech dataInput into the data vectorization model data2vec to obtain voice characteristic representationThe speech feature representsThe expression is as follows:
,
Aggregating all frame features-representing all speech features using average poolingAggregate into utterancesIs a speech modal feature of (a)The voice modality featuresObtained by calculation by the following formula:
Optionally, the visual mode feature isText modality featuresAnd speech modality featuresInputting the multi-modal information into a jump connection multi-head attention module SMA to fuse the multi-modal information so as to obtain a cross-modal fused attention outputComprising the following steps:
splicing the mode features, namely splicing the visual mode featuresAnd speech modality featuresSplicing to obtain spliced feature vectorsThe spliced feature vectorThe expression is as follows:
,
cross-modal attention computation, text modal featuresAs a querySum valueFeatures after splicingAs a keyThe weight score of each attention head is calculated by the following formula to obtain the firstOutput representation of individual attention heads:
,
Wherein,A sequence number representing the attention header,The range of the values is as follows,,,,Is the firstThe weight scores corresponding to the individual attention headers,For the key V vector dimension, for scaling the attention score;
the weight scores of all attention heads are spliced through a linear transformation matrixObtaining final cross-modal fusion attention outputThe cross-modality fused attention outputObtained by calculation by the following formula:
On the one hand, the multi-mode information is fused by the jumping connection multi-head attention module SMA, and on the other hand, the self-adaptive weight distributes importance for each mode. On the other hand, the jump connection can prevent the model from ignoring key emotion information in different modalities, which helps the model capture and integrate subtle and important emotion cues from different modalities.
Optionally, the pair of cross-modality fused attention outputsPerforming nonlinear transformation, and characterizing the nonlinear transformationInput to emotion classifier to generate predictive probability distribution vector of emotion typeComprising the following steps:
non-linear transformation of cross-modal fusion attention output by full connection layer and activation function ReLU in emotion classifierTo obtain the characteristics after nonlinear transformationTo enhance the capturing capability of the model to complex relationships and high-order interactions between features, the non-linearly transformed featuresObtained by calculation by the following formula:
,
Wherein,Is a weight matrix of the full connection layer,As a result of the offset vector,The representation of the activation function ReLU,
The emotion classifier outputs the characteristics after nonlinear transformationInputting the prediction probability distribution vector into an emotion classifier, and calculating to obtain the prediction probability distribution vector of the emotion type through the following formula:
,
Wherein,As a matrix of weights for the classifier,Is a bias term. The above-described mid-emotion classifier uses a fully connected layer and generates a predictive probability distribution of emotion categories by a softmax function.
Advantageous effects
(1) The invention realizes the efficient fusion of the multi-mode information by combining the attention mechanism of the expression and the gesture. The method can carry out weighting treatment according to the importance of different emotion clues, so that the emotion recognition model can dynamically capture core emotion characteristics, thereby remarkably improving the accuracy of emotion recognition. Meanwhile, the introduction of the attention mechanism in practical use can enable the multi-mode dialogue emotion recognition model to automatically pay attention to expression and gesture information with emotion significance in the dialogue, and irrelevant or noise information is filtered, so that emotion recognition accuracy and robustness are improved.
(2) The predicted emotion probability distribution finally generated by the method not only provides the final judgment of the emotion classification model on a certain emotion type, but also reveals the possibility of other emotion types. By looking at the probability values of each category, the ambiguity of the input emotion data and the potential emotion mixing situation can be known. For example, when the probability of a certain data point in both anger and sad categories is high, the user can be assisted in carrying out finer emotion analysis, and the true emotion state can be reflected more accurately.
Drawings
FIG. 1 is a schematic flow chart of a multi-modal dialog emotion recognition method according to an embodiment of the present invention.
Detailed Description
Further description is provided below in connection with the drawings and the specific embodiments. In the description of the present invention, it should be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or number of technical features indicated. Thus, a feature defining "a first", "a second", etc. may explicitly or implicitly include one or more such feature.
Example 1
The embodiment provides a multi-mode dialogue emotion recognition method, which comprises the following steps:
Constructing a multimodal interactive dialogue list U;
Multi-modal interactive dialogue list based on face recognition model SFace and gesture recognition tool MEDIAPIPEOne video clip of (a)Respectively obtaining the expression time sequence characteristicsAnd gesture timing features;
Based on human face gesture attention module FPA pair emotion time sequence characteristicsAnd gesture timing featuresFusing to obtain final visual mode characteristics;
Construction of new expressions of context informationAnd deriving emotion representations based on the passed emotion modeling technique PEMTRepresenting emotionInput to text encoder SimCSE for text modal characteristics;
From multimodal interactive dialog listsEach utterance of (a)Corresponding voice clipSampling at equal intervals, inputting sampling data into a data vectorization model data2vec to extract voice characteristics, and aggregating voice characteristics of all frames to obtain voice modal characteristics;
Characterization of visual modalityText modality featuresAnd speech modality featuresInputting the multi-modal information into a jump connection multi-head attention module SMA to fuse the multi-modal information so as to obtain a cross-modal fused attention output;
Attention output to cross-modal fusionPerforming nonlinear transformation, and characterizing the nonlinear transformationInput to emotion classifier to generate predictive probability distribution vector of emotion type
In practical application, the conventional conversation emotion recognition method has single emotion information source and can not accurately integrate information such as expression, gesture and the like into unified emotion output, and the problem is solved by extracting expression time sequence characteristics and gesture time sequence characteristics by using corresponding models and fusing the expression time sequence characteristics and the gesture time sequence characteristics into visual mode characteristics. In the process of fusing the expression time sequence characteristics and the gesture time sequence characteristics, a model can pay attention to more important emotion time sequence information and enhance the fusion of the characteristics by using various attention mechanisms. The method solves the problems that the traditional multi-mode emotion recognition method lacks a flexible attention mechanism, and is difficult to highlight key emotion clues, so that recognition effect is affected.
According to the invention, through cross-modal alignment and fusion of the visual modal characteristics, the text modal characteristics and the voice modal characteristics, multidimensional expression and deep understanding of emotion information are realized. In addition, the interpretive and transparency of the emotion recognition system can be improved through the finally generated predicted emotion probability distribution. The user can observe the judgment confidence of the emotion classification model on each category, and the judgment confidence is helpful for providing support for decision making. In practical application, especially in the fields of medical treatment, education, psychological counseling and the like, the probability information can help professionals to better understand the emotion state of an analyzed person and support more targeted intervention or suggestion.
Example 2
On the basis of embodiment 1, this embodiment describes a specific implementation procedure of a multimodal dialog emotion recognition method, as shown in fig. 1, which specifically includes the following contents:
1. data preprocessing
Step 1, firstly, collecting multi-modal dialogue data of multi-round dialogue participated by multiple persons, preprocessing the data, and finally constructing a multi-modal interactive dialogue list;
Step 1.1, collecting all multi-mode conversation data comprising a plurality of participants, wherein the multi-mode conversation data comprises text mode data, voice mode data, video mode data and speaker names corresponding to all utterances;
Step 1.2, performing standardized preprocessing on the text modal data of each utterance, including removing stop words, correcting spelling errors and sentence breaking processing, and ensuring continuity and accuracy of text content, thereby obtaining text records corresponding to each utterance;
Step 1.3 cleaning and normalizing the speech mode data of each utterance, including noise reduction and equalization, to improve the clarity and consistency of the speech, thereby obtaining speech segments corresponding to each utterance;
Step 1.4, preprocessing video modal data corresponding to each utterance, extracting 15 key frames through equidistant sampling to construct a simplified video segment, thereby reducing data redundancy and improving processing efficiency, and finally obtaining a video segment corresponding to each utterance;
Building multimodal interactive dialog listsComprises collecting multimodal dialogue data of multiple rounds of dialogue participated by multiple participants, preprocessing the multimodal dialogue data, and constructing multimodal interactive dialogue list,
Multimodal interactive dialog listComprising a plurality of utterancesWherein each utteranceComprising a text recordA video clipAnd a speech segment,
Multimodal interactive dialog listThe expression of (2) is:,
Each utteranceThe expression of (2) is:,
Wherein,The sequence number representing the utterance, which may also be referred to as the firstEach session unit has a value range of-,Representing an entire multimodal interactive dialog listTotal number of utterances in each utteranceThe corresponding speaker is
2. Feature multimodal fusion
Step 2, first, a video clip is generated by face recognition model SFace and gesture recognition tool MEDIAPIPEPerforming face recognition and human body recognition to obtain image sequences of a face region and a human body regionAndThen, extracting by a space-time feature extraction model TimesformerAndRespectively obtaining expression time sequence characteristicsAnd gesture timing featuresAnd finally, providing a human face posture attention module FPA to fuse the surface timing sequence characteristics and the posture timing sequence characteristics to obtain final visual mode characteristics;
Step 2.1, extracting expression time sequence characteristics and gesture time sequence characteristics;
Step 2.1.1, face sequence represents a video segment through face recognition model SFaceThe human face in the image is identified, the human face area is extracted, and a sequence containing human face images is formed,WhereinRepresenting a sequence of face imagesA face image of a frame;
Step 2.1.2, gesture sequence indicates that the same video clip is presented by gesture recognition tool MEDIAPIPEHuman body posture recognition is performed, human body areas are extracted, and a sequence containing posture images is formedWhereinRepresenting a sequence of pose imagesA pose image of a frame;
Step 2.1.3, extracting time sequence characteristics, namely extracting the human face image sequenceAnd a sequence of pose imagesInputting into a space-time feature extraction model Timesformer to extract expression time sequence features respectivelyAnd gesture timing featuresFor capturing facial expression dynamics and posture changes, the expression timing characteristicsAnd gesture timing featuresThe expressions are as follows:
,
,
step 2.2. Timing characteristics of the expressionAnd gesture timing featuresCoding to obtain rich time sequence information, wherein the coded expression time sequence characteristics and gesture time sequence characteristics are obtained through MLP nonlinear transformation:
,
,
Wherein,AndRespectively representing the expression time sequence characteristics and the gesture time sequence characteristics after encoding;
step 2.3, the expression time sequence characteristics after the coding are processedAnd gesture timing featuresSelf-adaptive weighting is carried out through the following formula to obtain expression attention coefficientsAnd gesture attention coefficient:
,
,
Wherein,For the coded expression time sequence characteristicsIs used for the weight matrix of the (c),For coded gesture timing featuresIs used for the weight matrix of the (c),For the coded expression time sequence characteristicsIs set in the above-described state,For coded gesture timing featuresIs a bias vector of (2);
Step 2.4, in order to improve emotion representation capability of expressions and gestures, using a multi-head attention mechanism to respectively encode expression time sequence characteristicsAnd gesture timing featuresEnhancing emotion characterization, and encoding the expression time sequence characteristicsAnd gesture timing featuresThe multi-head attention is calculated by the following formulasAnd:
,
,
Wherein,AndRespectively representing the weight coefficients of the expression time sequence characteristics and the gesture time sequence characteristics obtained after calculation by a multi-head attention mechanism,,,,AndAre all the parameters which can be learned,,,,AndAre all learnable parameters. For the expression time sequence characteristics after the codingAnd gesture timing featuresThe multi-head attention calculation is respectively carried out, so that the multi-mode dialogue emotion recognition model can adaptively pay attention to more important time sequence information about emotion in practical application;
Step 2.5, the coded gesture time sequence featuresAs query Q, the coded expression time sequence featuresAs key K and value V, enhanced expression timing characteristics are obtained by the following formula:
,
The coded expression time sequence featuresAs query Q, the coded gesture time sequence featuresAs key K and value V, enhanced gesture timing features are obtained by the following formula:
The method realizes the expression time sequence characteristics after encoding by using the cross attention mechanismAnd gesture timing featuresThe information interaction between the two can further enhance the feature fusion;
Step 2.6, enhancing the expression time sequence characteristicsAnd gesture timing featuresThe final visual mode characteristics are obtained by fusing the following formulas in a weighted mode:
. The gesture time sequence features are fused to obtain the final visual mode features
Step 3, constructing new expression of the context informationAnd obtaining emotion representation through the prompted emotion modeling technology PEMTRepresenting emotionInput to text encoder SimCSE for text modal characteristics;
Step 3.1, context modeling, extracting the current utteranceText recording of previous three utterances, intended to enhance text recording of current utterance by considering historical utterance informationProviding more rich context information for emotion recognition, constructing a new expression containing context informationThe new expression containing context informationThe expression is as follows:
,
step 3.2 text feature extraction using new expressions of context informationSpeakerAnd text recordingPrompt-based emotion modeling technique PEMT is proposed to capture a speakerAnd long distance dependence between uttered words, resulting in emotion representationsThe emotion representsThe expression of (2) is as follows:
,
Representing the emotion through text encoder SimCSEThe representation of the special token < mask > in the document is encoded to obtain the text modal characteristicsAnd providing key information of the text mode for multi-mode emotion recognition. The text modality featureThe expression is as follows:
Step 4. First, according to the wordsCorresponding voice clipSampling at equal intervals. Then, the sampled data is input to the data vectorization model data2vec to extract the speech features. Finally, the voice characteristics of all frames are aggregated to obtain voice mode characteristics;
Step 4.1, segmentation of the Speech Signal given each utteranceCorresponding voice clipContinuous speech segmentsDivided intoThe frame is used as voice data, and the expression of the voice data is as follows:
,
Wherein,Represent the firstIndividual speech segmentsIs the first of (2)The voice data of the frames are transmitted to the host,The value range is 1-,In this embodiment, as the total number of framesSet to 8, in other possible embodiments of the invention, different numbers may be set;
Step 4.2, extracting the voice characteristics, namely extracting each voice dataInput into the data vectorization model data2vec to obtain voice characteristic representationThe speech feature representsThe expression is as follows:
,
step 4.3, aggregating all frame features by representing all speech features using averaging poolingAggregate into utterancesIs a speech modal feature of (a)The voice modality featuresObtained by calculation by the following formula:
Step 5, the visual mode is characterizedText modality featuresAnd speech modality featuresInputting the multi-modal information into a jump connection multi-head attention module SMA to fuse the multi-modal information so as to obtain a cross-modal fused attention output;
Step 5.1, splicing the modal characteristics, namely splicing the visual modal characteristicsAnd speech modality featuresSplicing to obtain spliced feature vectorsThe spliced feature vectorThe expression is as follows:
,
Step 5.2, cross-modal attention calculation, namely, text modal characteristicsAs a querySum valueFeatures after splicingAs a keyThe weight score of each attention head is calculated by the following formula to obtain the firstOutput representation of individual attention heads:
,
Wherein,A sequence number representing the attention header,The range of the values is as follows,,,,Is the firstThe weight scores corresponding to the individual attention headers,For the key V vector dimension, for scaling the attention score;
the weight scores of all attention heads are spliced through a linear transformation matrixObtaining final cross-modal fusion attention outputThe cross-modality fused attention outputObtained by calculation by the following formula:
,
On the one hand, the multi-mode information is fused by the jumping connection multi-head attention module SMA, and on the other hand, the self-adaptive weight distributes importance for each mode. On the other hand, the jump connection can prevent the model from ignoring key emotion information in different modalities, which helps the model capture and integrate subtle and important emotion cues from different modalities. In the present embodiment of the present invention, in the present embodiment,Set to 3. In other possible embodiments of the invention different numbers may be provided.
Step 6, merging attention output to cross-modePerforming nonlinear transformation, and characterizing the nonlinear transformationInput to emotion classifier to generate predictive probability distribution vector of emotion type;
Step 6.1, realizing the non-linear transformation of the cross-modal fusion attention output through the full connection layer and the activation function ReLU in the emotion classifierTo obtain the characteristics after nonlinear transformationTo enhance the capturing capability of the model to complex relationships and high-order interactions between features, the non-linearly transformed featuresObtained by calculation by the following formula:
,
Wherein,Is a weight matrix of the full connection layer,As a result of the offset vector,Representing an activation function ReLU;
Step 6.2, outputting the characteristics after nonlinear transformation by the emotion classifierInputting the prediction probability distribution vector into an emotion classifier, and calculating to obtain the prediction probability distribution vector of the emotion type through the following formula:
,
Wherein,As a matrix of weights for the classifier,Is a bias term. The above-described mid-emotion classifier uses a fully connected layer and generates a predictive probability distribution of emotion categories by a softmax function.
The embodiments of the present invention have been described above with reference to the accompanying drawings, but the present invention is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present invention and the scope of the claims, which are all within the protection of the present invention.

Claims (7)

CN202411833608.2A2024-12-132024-12-13 A multimodal conversation emotion recognition methodActiveCN119293740B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202411833608.2ACN119293740B (en)2024-12-132024-12-13 A multimodal conversation emotion recognition method

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202411833608.2ACN119293740B (en)2024-12-132024-12-13 A multimodal conversation emotion recognition method

Publications (2)

Publication NumberPublication Date
CN119293740A CN119293740A (en)2025-01-10
CN119293740Btrue CN119293740B (en)2025-03-07

Family

ID=94158057

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202411833608.2AActiveCN119293740B (en)2024-12-132024-12-13 A multimodal conversation emotion recognition method

Country Status (1)

CountryLink
CN (1)CN119293740B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN120075544A (en)*2025-04-252025-05-30中国传媒大学Sign language broadcasting video generation method, system and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN102470273A (en)*2009-07-092012-05-23微软公司Visual representation expression based on player expression
CN106919251A (en)*2017-01-092017-07-04重庆邮电大学A kind of collaborative virtual learning environment natural interactive method based on multi-modal emotion recognition

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
KR102697345B1 (en)*2018-09-282024-08-23삼성전자주식회사An electronic device and method for obtaining emotional information
CN113935435B (en)*2021-11-172025-01-17南京邮电大学 Multimodal emotion recognition method based on spatiotemporal feature fusion
CN114944002B (en)*2022-06-162024-04-16中国科学技术大学 Text description-assisted gesture-aware facial expression recognition method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN102470273A (en)*2009-07-092012-05-23微软公司Visual representation expression based on player expression
CN106919251A (en)*2017-01-092017-07-04重庆邮电大学A kind of collaborative virtual learning environment natural interactive method based on multi-modal emotion recognition

Also Published As

Publication numberPublication date
CN119293740A (en)2025-01-10

Similar Documents

PublicationPublication DateTitle
CN112151030B (en)Multi-mode-based complex scene voice recognition method and device
CN108717856B (en) A speech emotion recognition method based on multi-scale deep convolutional neural network
CN115329779B (en) A multi-person conversation emotion recognition method
CN117765981A (en)Emotion recognition method and system based on cross-modal fusion of voice text
CN113065344B (en) A cross-corpus sentiment recognition method based on transfer learning and attention mechanism
CN113205817A (en)Speech semantic recognition method, system, device and medium
CN115577161A (en)Multi-mode emotion analysis model fusing emotion resources
Latif et al.Can Large Language Models Aid in Annotating Speech Emotional Data? Uncovering New Frontiers [Research Frontier]
CN117786596A (en) A multi-modal emotion recognition method, system and storage medium based on text modality guided attention fusion
CN112597841B (en)Emotion analysis method based on door mechanism multi-mode fusion
CN119293740B (en) A multimodal conversation emotion recognition method
CN114995657A (en)Multimode fusion natural interaction method, system and medium for intelligent robot
CN117672268A (en)Multi-mode voice emotion recognition method based on relative entropy alignment fusion
CN120046695A (en)Multi-mode emotion analysis method and system based on knowledge distillation and dynamic fusion mechanism
CN119295994B (en) A multimodal sentiment analysis method based on cross-modal attention
CN115171176A (en)Object emotion analysis method and device and electronic equipment
CN115730203A (en)Voice emotion recognition method based on global perception cross-modal feature fusion network
CN116884410A (en)Cognitive detection method, cognitive detection device, electronic equipment and storage medium
CN111653270A (en)Voice processing method and device, computer readable storage medium and electronic equipment
CN119150165A (en)Multi-mode emotion recognition method based on deep learning
CN118860152A (en) A virtual environment interaction system based on multimodal emotion recognition
CN118520091A (en)Multi-mode intelligent question-answering robot and construction method thereof
CN118395233A (en)Implicit emotion analysis method based on vision and voice enhancement
CN114743129B (en)Method and system for predicting emotion of old people in real time based on posture recognition
Ghorpade et al.ITTS model: speech generation for image captioning using feature extraction for end-to-end synthesis

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp