Movatterモバイル変換


[0]ホーム

URL:


CN119206424B - Intent recognition method and system based on multi-modal fusion of voice and sight - Google Patents

Intent recognition method and system based on multi-modal fusion of voice and sight
Download PDF

Info

Publication number
CN119206424B
CN119206424BCN202411730078.9ACN202411730078ACN119206424BCN 119206424 BCN119206424 BCN 119206424BCN 202411730078 ACN202411730078 ACN 202411730078ACN 119206424 BCN119206424 BCN 119206424B
Authority
CN
China
Prior art keywords
features
sight
fusion
feature
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202411730078.9A
Other languages
Chinese (zh)
Other versions
CN119206424A (en
Inventor
孟巍
吴雪霞
宗振国
郭腾炫
孔鹏
朱伟义
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Marketing Service Center of State Grid Shandong Electric Power Co Ltd
Original Assignee
Marketing Service Center of State Grid Shandong Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Marketing Service Center of State Grid Shandong Electric Power Co LtdfiledCriticalMarketing Service Center of State Grid Shandong Electric Power Co Ltd
Priority to CN202411730078.9ApriorityCriticalpatent/CN119206424B/en
Publication of CN119206424ApublicationCriticalpatent/CN119206424A/en
Application grantedgrantedCritical
Publication of CN119206424BpublicationCriticalpatent/CN119206424B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

The invention relates to an intention recognition method and system based on multi-modal fusion of voice and sight, and belongs to the technical field of intention recognition. The method comprises the steps of feature extraction, multi-modal representation, multi-modal fusion and intention recognition, wherein the feature extraction comprises the steps of extracting text, voice and sight features from voice and faces, the multi-modal representation comprises 1) modal sharing representation, 2) modal specific representation, the multi-modal fusion comprises 3) intra-modal fusion, 4) trans-modal fusion, and the intention recognition comprises the steps of inputting the final fusion features into a multi-layer perceptron, connecting a softmax layer to output a classification result, and carrying out the intention recognition. According to the invention, one subject in the training set is randomly selected in each training step, and a meta-learning method is applied, namely, a model is updated based on a first-order gradient algorithm, so that the over-fitting problem is relieved, and the model can optimize new parameters. The invention realizes the visual line estimation irrelevant to the appearance and the extraction of the key facial features by applying the visual line feature extraction method based on the full-face appearance and the random identity countermeasure network.

Description

Intent recognition method and system based on multi-modal fusion of voice and sight
Technical Field
The invention relates to an intention recognition method and system based on multi-modal fusion of voice and sight, and belongs to the technical field of intention recognition.
Background
In the digital marketing meta-universe interaction environment, intention recognition is beneficial to improving the efficiency and quality of customer interaction, can help enterprises to know customer demands deeply, provides personalized services, realizes accurate marketing, and enhances customer satisfaction.
Currently, in order to accurately understand the intention of a user, the multi-mode fusion technology can more comprehensively and accurately understand the intention of the user by comprehensively utilizing information of different modes. In a digital marketing meta-universe environment, speech in human-computer interaction can extract text expression information and intonation emotion information of a client, and facial gaze information can reflect the attention and points of interest of the user. The voice and sight information are fused, so that richer intention information can be provided, and the accuracy and the robustness of an intention recognition algorithm are improved.
The human face sight contains rich human intention information, and the sight estimation technology based on deep learning is greatly focused. The technology can learn advanced sight features from high-dimensional video images, so that the accuracy of estimating the sight of the face is remarkably improved. However, due to the reasons of light, shielding, differences in appearance of faces and the like in the face images, the model is often over-fitted when the model is trained by using a limited data set of specific people. Thus, applying face gaze estimation to actual intent recognition applications remains a challenge.
In addition, although multimodal fusion techniques solve these problems to some extent, how to effectively fuse information of different modalities improves accuracy of client intent recognition still faces many challenges. For example, how to deal with heterogeneity, synchronicity issues of different modality data, how to design efficient feature fusion algorithms, etc.
The main reasons for the problems include incomplete or improper selection of the modal information, limited visual line estimation training data, heterogeneity of the modal information, information redundancy and noise, unreasonable design of a characteristic fusion mode, weight distribution during fusion and insufficient interaction among the modalities.
The generalization capability and accuracy of the intent recognition model can be affected by the problems of improper modal information selection, overfitting of the sight estimation model, unreasonable multi-modal fusion method and the like. To solve these problems, the performance of the model can be improved by fine selection of modal information, meta-learning, optimization of regularization techniques, and design of a reasonable feature fusion method.
Disclosure of Invention
Aiming at the defects and shortcomings in the prior art, the invention provides a multi-mode fusion method based on the combined action of voice and sight for intention recognition, aiming at solving the following technical problems:
1. and selecting the mode information, namely selecting the proper mode information as a key in a human-computer interaction intention recognition task. The face sight information can reflect the attention and the interest points of the user, and the voice can extract text expression information and intonation emotion information of the user. Both line of sight and speech are key information for user intent.
2. Line-of-sight estimation model overfitting-in a line-of-sight estimation task, overfitting may be manifested in that the model's predictions of a particular scene, individual, or line-of-sight direction are very accurate, but when facing different scenes or people, the model's prediction accuracy is significantly degraded.
3. The influence of the multi-mode feature fusion method on the intention recognition accuracy is that the multi-mode feature fusion refers to extracting features from a plurality of modes (such as voice, text, images, gestures and the like) and effectively combining the features to improve the understanding capability of the model on the intention of a user in a complex scene. The information complementarity between different modes can be enhanced by selecting a proper fusion method, and the accuracy and the robustness of intention recognition are improved.
The technical scheme of the invention is as follows:
an intention recognition method based on multi-modal fusion of voice and sight comprises the following steps:
Feature extraction, extracting text, speech and line of sight features from speech and faces based on a pre-trained BERT model, a Wav2vec 2.0 model, and a self-training FGEN model;
The multi-modal representation comprises 1) modal sharing representation, wherein a modal sharing encoder is constructed to learn cross-modal sharing characteristics, the modal sharing encoder converts text, voice and sight characteristics into a unified characteristic space to obtain sharing characteristics, the central moment difference is utilized to minimize similarity loss among different modal sharing characteristics, 2) modal specific representation, wherein a modal specific encoder is constructed to learn specific characteristics of each mode, the modal specific encoder converts the text, voice and sight characteristics into a specific characteristic space to obtain specific characteristics, and the difference loss ensures that the sharing characteristics and the specific characteristics of the same mode are distributed differently, and meanwhile, the specific characteristic distribution of different modes is also different;
The multi-mode fusion comprises 3) intra-mode fusion, namely fusing the shared characteristic and the specific characteristic of each mode through a self-attention mechanism to obtain a single-mode fusion characteristic, and 4) cross-mode fusion, namely learning the related characteristic of the cross-mode through a cross-attention mechanism and fusing the characteristics of different modes through a gating mechanism to obtain a final fusion characteristic;
And (3) intention recognition, namely inputting the final fusion characteristics into a multi-layer perceptron, connecting a softmax layer to output a classification result, and carrying out intention recognition.
Extracting text features based on the pre-trained BERT model, comprising:
capturing voice of the interaction environment by adopting an active acquisition mode, and acquiring the acquired textIs input into the BERT model and,For word vectors in the text, m is the total number of word vectors, and the output of the last hidden layer of the BERT model represents text characteristics,Is the length of the sequence of text utterances,Is the characteristic dimension of each tag.
Extracting voice features based on a pre-trained Wav2vec 2.0 model, comprising:
capturing voice of interactive environment by adopting active acquisition modeEach speech segmentAre all input into the Wav2vec 2.0 model, the output of the last hidden layer of the Wav2vec 2.0 model representing the speech features,Representing the length of the sequence of speech segments,Representing the feature dimension of each tag.
According to the invention, the method for extracting the sight line features based on the pre-trained FGEN model comprises the following steps:
Using SWCNN as a training baseline model, expanding a network architecture by connecting a face identity classifier (face recognition model) to realize appearance-independent sight estimation and extraction of facial key features, training FGEN model by adopting a meta learning strategy, namely randomly selecting a subject from a face sight dataset to form a meta training dataset and a meta testing dataset in each iterative training to obtain the sight feature embeddingIs the sequence length of the key-frames,Is the feature dimension of each frame; Representing oneThe real number matrix of (a) comprises feature embedding of all key frames, each row corresponds to the feature of one key frame, each column corresponds to the value of a specific feature dimension, and the real number matrix is as follows:
;
Wherein,A face identity classifier is represented and is used to identify the face,Representing a gaze prediction model, ML represents a meta-learning strategy.
According to the invention, the method adopts meta learning strategy to train FGEN model, FGEN model train utilizes countermeasure strategy to realize independence of sight feature and appearance main body, comprising:
updating face recognition parameters;
Parameters of the FGEN model are updated with line of sight loss and contrast loss.
The line-of-sight prediction model converts the face image into a gaze direction vector, as follows:
;
Wherein,Representing line-of-sight yaw and pitch vector labels,Represents the facial line-of-sight feature vector output by FGEN model,Representing a predictive model of the line of sight of a human face,Representing a face dataset;
a face identity classifier is represented and is used to identify the face,A recognition probability vector representing each type of face in the training subject;
face identity classifierIs updated according to the cross entropy loss function of the identification probability and the labelThe following is shown:
;
Wherein,For each identification tag of the face of the subject,The predictive recognition probability of the face of the nth subject is given, and N is the total number of training data; representing a logarithmic function;
Next, FGEN model continues the contrast training, face sight predictive modelOptimizing against a goal opposite to the face identity classifier, whereby the contrast penalty for appearance generalization is defined as follows:
;
;
Wherein,Representing the accuracy of the predictive probability of the face of the subject,Resistance loss indicating appearance generalization; is a uniform distribution, k is the number of subjects in the training set; representing norms, constructing a loss function by cosine similarity to make the predicted identity probabilityUniformly distributed;
gaze direction loss using L1 lossThe definition is:
;
Wherein,Predictive model for representing human face and line of sightPredicted vectors.
Using multiple objective loss functionsTraining is as follows:
wherein, according to experience, set up
According to the present invention, preferably, one subject in the training set is randomly selected in each training step, and a meta learning method, i.e., an algorithm based on a first order gradient is applied to update the model, comprising:
first, randomly extracting k subjects from the training data set, and constructing a new meta-training set by using face images of the k subjectsTraining the meta-training setThe sequence is input into FGEN model, and gradient vectors are calculated in the optimization process as follows:
;
Wherein,The initial weight of the vision predictive model is, x is the meta-training iteration number,Representing the gradient vector calculated during optimization of the gaze prediction model,Is a calculation function of the gradient vector;
The initial weights are then updated using the following equation:
Wherein,Is the step size used for the random gradient descent operation,Is the weight of the vision predictive model optimized by the meta-training process,Refers to weights updated by gradient descent;
next, in the meta-training setIs to construct a meta-adaptation set with p non-overlapping topicsAnd adapt the meta-adaptation to the iterative weightsThe updating is as follows:
;
after the meta-adaptation iteration, a new meta-adaptation set is created by selecting a new topicThe meta-adaptation process is repeated delta times and updated weights are calculated;
Finally, a face sight prediction model is carried out according to the following formulaUpdating:
;
;
Wherein y is the meta-adaptation iteration number,For use ofThe weight of the face sight prediction model optimized in the meta-adaptation process;
line-of-sight feature extraction by inputting FGEN the processed image and trained parameters into a model to obtain line-of-sight features=FGEN(,,ML)。
According to the invention, preferably, in the text featureSpeech featuresAnd line of sight featurePreprocessing with a layer of multi-headed convectors before multi-modal representation,And
According to a preferred embodiment of the present invention, the modal sharing representation comprises:
Sharing an encoder by constructing a modalityTo learn the sharing characteristics of the cross modes, the modal sharing encoder will,AndConversion to unified feature space, obtaining shared features of text, speech and line of sight, respectively,And:
;
;
;
Using the center moment difference CMD to minimize similarity loss, let X and Y be bounded random samples, in intervalProbability distributions on the top are p and q, respectively, and CMD regularization matrixIs defined as:
;
wherein X and Y are respectively crossed and brought into three different modes, A and B represent the value ranges of X and Y,Is the empirical expected vector of random samples X,Is the vector of the center moment of all k-order samples in X, calculates the sharing characteristic of each pair of modes:
;
Wherein, t is the number of the three-dimensional space,And v are text, speech and line of sight identifiers respectively,AndCommon features of mode m1 and mode m2, respectively, mode m1 and mode m2 are any two of the three modes, respectively, minimizing lossesThe shared feature representation distribution of each pair of modalities will be forced to be similar.
According to a preferred embodiment of the invention, the modality specific representation comprises constructing a modality specific encoder comprising:, A kind of electronic deviceCorresponding to text, voice and sight line respectively, and mode specific encoderAndConversion to a unique feature space to obtain a specific feature,And:
;
;
;
The following differential loss is formedThe calculation formula is as follows:
;
Wherein,Is the square of the L2-norm ifAnd;AndIs orthogonal, differential lossAt least one of the minimum, t,And v are text, speech and line of sight identifiers respectively,AndThe mode m1 and the mode m2 are respectively the common characteristics of the mode m1 and the mode m2, and the mode m1 and the mode m2 are respectively any two of the three modes.
According to the preferred embodiment of the present invention, a decoder is constructed prior to the multi-mode fusion operationTo input the shared feature and the specific feature, wherein,Representing the characteristics of the sharing of the features,Which are indicative of the various features of the various ways in which the invention may be practiced,Representing decoder parameters, reconstructing the original feature space:
;
;
;
Wherein,Respectively representing the intrinsic information of three modal characteristics;
using the mean square error to estimate the reconstruction error:
;
Wherein,Is the square of the L2-norm,To prevent over-fitting regularization term, W is the decoder parameter; Representing reconstruction errors refers to errors or losses that the model produces when encoding the input data into a potential representation and reconstructing it by the decoder.
According to the invention, the intra-modal fusion comprises:
The calculation formula of the self-attention mechanism is as follows:
Wherein,AndRespectively representing a query, a key and a value matrix; AndIs a parameter that needs to be learned and,Is the dimension of K, for self-attention mechanism, three matrices Q, K and V come from the same input, the shared features and specific features of text, speech and line of sight are respectively connected and input into self-attention mechanism to obtain single-mode fusion feature,And
According to the invention, the preferred cross-modal fusion comprises:
after obtaining the unimodal fusion feature, text-to-line-of-sight is learned using a cross-attention mechanismLine of sight to textText-to-speechSpeech to textSpeech to sight lineLine of sight to speechThe cross-attention mechanism asymmetrically combines two sequences, one as input to Q and the other as input to K and V, the calculation formula for the cross-attention mechanism is as follows:
;
Will beAndThe combination is input to a sight gate to obtain sight feature fusion weightWill (i) beAndIs input to the auditory gate to obtain the speech feature fusion weightWill (i) beAndCombining and inputting to a text gate to obtain text feature fusion weightsThe calculation formula is as follows:
;
;
;
Wherein,AndAs a function of the linear layer parameters,AndIs a bias term;
according to the fusion weight, the sight line characteristicsAnd speech featuresWith text featuresFusing to obtain a final fusion characteristic h:
according to the invention, the method comprises the following steps:
after the final fusion characteristic h is obtained, h is input into a multi-layer perceptron, and a softmax layer is connected to obtain a classification resultThe following is shown:
;
wherein W and b represent linear layer parameters and bias terms.
According to the invention, the losses are preferably co-optimized,And cross entropy lossThe final optimization objective L is as follows:
;
;
Wherein,,AndIs the weight that determines the contribution to the overall loss L, N is the number of training samples; AndRepresenting the actual tag distribution and the predicted tag distribution of sample i, respectively; Is L2 regularization, W is a model parameter.
A computer device comprising a memory storing a computer program and a processor implementing steps of a speech and gaze multimodal fusion based intent recognition method when executing the computer program.
A computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of a speech and gaze multimodal fusion based intent recognition method.
An intent recognition system based on a multi-modal fusion of speech and gaze comprising:
A feature extraction module configured to extract text, speech and line-of-sight features from speech and faces based on the pre-trained BERT model, the Wav2vec 2.0 model, and the self-training FGEN model;
The multi-mode representation module is configured to comprise a mode sharing representation, a mode specific encoder, a differential loss and a mode analysis module, wherein the mode sharing representation comprises the steps of constructing a mode sharing encoder to learn cross-mode sharing characteristics, converting text, voice and sight characteristics into a unified characteristic space by the mode sharing encoder to obtain sharing characteristics, minimizing similarity loss among sharing characteristics of different modes by utilizing a central moment difference, constructing a mode specific encoder to learn specific characteristics of each mode, converting the text, voice and sight characteristics into a specific characteristic space by the mode specific encoder to obtain the specific characteristics, ensuring that the sharing characteristics and the specific characteristics of the same mode are distributed differently by the differential loss, and meanwhile, ensuring that the specific characteristics of different modes are distributed differently;
The multi-mode fusion module is configured to comprise intra-mode fusion, cross-mode fusion, wherein the intra-mode fusion comprises the step of fusing shared features and specific features of each mode through a self-attention mechanism to obtain single-mode fusion features, the step of cross-mode fusion comprises the steps of learning cross-mode related features through a cross-attention mechanism and fusing features of different modes through a gating mechanism to obtain final fusion features;
And the intention recognition module is configured to input the final fusion characteristics into the multi-layer perceptron, and connect the softmax layer to output a classification result to perform intention recognition.
The beneficial effects of the invention are as follows:
1. The method solves the problem of overfitting of the sight estimation model, namely randomly selecting one subject in a training set in each training step, and applying a meta-learning method, namely updating the model based on a first-order gradient algorithm, so that the overfitting problem is relieved, and the model can optimize new parameters.
2. And the efficient feature extraction is realized by applying a line-of-sight feature extraction method based on the whole face appearance and the random identity countermeasure network, so that the line-of-sight estimation irrelevant to the appearance and the extraction of the facial key features are realized.
3. Optimizing multi-mode information fusion, namely providing a fusion method of a gating neural network based on an attention mechanism, improving the status of voice and sight features and helping to realize more comprehensive understanding of customer intention.
4. The accuracy and the robustness of the intention recognition are improved by a line-of-sight feature extraction method based on the full face appearance and the random identity countermeasure network and a fusion method of the gating neural network based on the attention mechanism.
Drawings
FIG. 1 is a flow chart of an intent recognition method based on multi-modal fusion of speech and gaze;
FIG. 2 is a block flow diagram of training FGEN a model using a meta-learning strategy;
FIG. 3 is an overall block diagram of a line-of-sight feature extraction method based on full-face appearance and random identity challenge network;
FIG. 4 is a schematic diagram of a self-care computing process;
Fig. 5 is a schematic diagram of a cross-attention computing process.
Detailed Description
The invention is further defined by, but is not limited to, the following drawings and examples in conjunction with the specification.
Example 1
An intention recognition method based on multi-modal fusion of voice and sight, as shown in fig. 1, comprises the following steps:
Feature extraction, extracting text, speech and line of sight features from speech and faces based on a pre-trained BERT model, a Wav2vec 2.0 model, and a self-training FGEN (Face to gaze encoder) model;
The multi-modal representation comprises 1) a modal sharing representation, wherein the modal sharing representation comprises the steps of constructing a modal sharing encoder to learn cross-modal sharing characteristics, converting text, voice and sight characteristics into uniform characteristic space by the modal sharing encoder to obtain sharing characteristics, minimizing similarity loss among different modal sharing characteristics by using a Central Moment Difference (CMD), 2) the modal specific representation comprises the steps of constructing a modal specific encoder to learn specific characteristics of each modality, converting the text, voice and sight characteristics into specific characteristic space by the modal specific encoder to obtain specific characteristics, ensuring different distributions of the sharing characteristics and the specific characteristics of the same modality by the differential loss, and simultaneously ensuring different specific characteristic distributions of different modalities;
The multi-modal fusion comprises 3) intra-modal fusion, namely, fusing the sharing characteristic and the specific characteristic of each mode through a self-attention mechanism (shown in figure 4) to obtain a single-modal fusion characteristic, and 4) cross-modal fusion, namely, learning the cross-modal related characteristic through a cross-attention mechanism (shown in figure 5) and fusing the different-modal characteristics through the gating mechanism to obtain a final fusion characteristic;
and (3) intention recognition, namely inputting the final fusion characteristics into a multi-layer perceptron (MLP), connecting a softmax layer to output a classification result, and carrying out intention recognition.
Example 2
The intention recognition method based on the multi-modal fusion of voice and sight is characterized in that:
in the present embodiment of the present invention,
The modal sharing encoder is a structure for multi-modal learning, and has the main function of extracting sharing characteristics across different modalities.
The mode-specific encoder is an encoder structure specifically designed for each mode independently, unlike a mode-shared encoder.
The decoder is a module for mapping the features extracted by the encoder back to the required output space. The main task of the decoder is to restore the feature vectors or feature maps extracted by the encoder to a specific output format, e.g. to generate a density map, to partition a map, to reconstruct an image, or to perform a classification task.
The self-attention mechanism is a deep learning mechanism, is widely applied to the fields of natural language processing, computer vision and the like, and is used for capturing the association relationship inside the features. It learns global dependencies within the data by letting each feature location focus on other locations in the input.
The cross-attention mechanism is a deep learning mechanism and is mainly used for processing feature interaction and fusion under multi-mode or multi-input situations. Unlike the self-attention mechanism, the cross-attention mechanism calculates attention weights between different modalities or different inputs in order to capture associations between different data sources to achieve feature complementation.
Face identity classifier (face recognition model) receives as input gaze representation vectors and then predicts face identity probability. The proposed architecture FGEN aims to extract gaze representations generalized to appearance by predicting uniform probability values for all faces.
Gaze prediction model-converting a face image into gaze direction vectors (yaw and pitch),Where D is the normalized face image set,Is the vector of the gaze,Is the facial line of sight feature vector extracted from the intermediate.
Extracting text features based on the pre-trained BERT model, comprising:
the pre-trained BERT model can extract text semantic features, which become an essential module for natural language understanding tasks. Thus, a pre-trained language model BERT is used to extract text features. In contrast to Word2vec, the BERT model solves the ambiguity problem by considering the context. Meanwhile, different semantic features obtained through hierarchical learning provide rich feature choices for downstream tasks. Based on the pre-training model BERT, the downstream tasks are fine-tuned, and good performance can be achieved by only a small number of training samples. Capturing voice of the interaction environment by adopting an active acquisition mode, and acquiring the acquired textIs input into the BERT model and,For word vectors in the text, m is the total number of word vectors, and the output of the last hidden layer of the BERT model represents text characteristics,Is the length of the sequence of text utterances,Is the characteristic dimension of each tag. The feature extraction process is as follows:
extracting voice features based on a pre-trained Wav2vec 2.0 model, comprising:
The Wav2vec 2.0 model is an unsupervised speech pre-training model published by Meta in 2020. The core idea is to construct a self-supervising training target by Vector Quantization (VQ). The model performs better than standard acoustic features. Thus, the pre-trained Wav2vec 2.0 model is used to extract speech features. Similarly, capturing speech of an interactive environment using active acquisitionEach speech segmentAre all input into the Wav2vec 2.0 model, the output of the last hidden layer of the Wav2vec 2.0 model representing the speech features,Representing the length of the sequence of speech segments,Representing the feature dimension of each tag. The feature extraction process is as follows:
;
Where P denotes a preprocessing operation, F denotes a feature extractor, and Q denotes a quantizer.
Extracting sight line features based on a pre-trained FGEN model, including:
People observe something of interest around themselves through the eyes and receive important information. Thus, the line of sight of an eye looking at a thing object is a non-verbal cue that is important to human intent. Recent studies have shown that the information features of the full-face region more accurately predict line-of-sight positions. Therefore, the invention provides a full-face appearance and random identity countermeasure network or FGEN model based line-of-sight feature extraction (FGEN-Face to gaze encoder, FGEN), and the method does not need special infrared or depth equipment and only uses a common camera. The invention uses SWCNN (Switchable Convolutional Neural Network, a convolution neural network model applied in a multi-mode or multi-task scene) as a training baseline model, expands a network architecture by connecting a face identity classifier (face recognition model) to realize appearance-independent sight estimation and extraction of facial key features, adopts a meta-learning strategy to train FGEN models in order to solve the problem of model fitting caused by insufficient training sets, namely randomly selects a meta-training dataset and a meta-testing dataset from a face sight dataset for each iterative training to obtain the sight feature embeddingIs the sequence length of the key-frames,Is the feature dimension of each frame; Representing oneThe real number matrix of (a) comprises feature embedding of all key frames, each row corresponds to the feature of one key frame, each column corresponds to the value of a specific feature dimension, and the real number matrix is as follows:
;
Wherein,A face identity classifier is represented and is used to identify the face,Representing a gaze prediction model, ML represents a meta-learning strategy. Line of sight feature embeddingBy embedding gaze features in feature space, unique information of eye gaze direction or gaze point is captured, thereby providing a richer context and semantic understanding for visual tasks. FGEN () represents a function with respect to FGEN for extracting line-of-sight features.
The specific architecture of FGEN model is shown in figure 2.
Training FGEN a model by using a meta-learning strategy, training FGEN a model by using an countermeasure strategy to realize independence of sight features and appearance subjects, wherein the training method comprises the following steps as shown in fig. 3:
updating face recognition parameters;
Parameters of the FGEN model are updated with line of sight loss and contrast loss.
The line-of-sight prediction model converts the face image into a gaze direction vector, as follows:
Wherein,Representing line-of-sight yaw and pitch vector labels,Represents the facial line-of-sight feature vector output by FGEN model,Representing a predictive model of the line of sight of a human face,Representing a face dataset;
a face identity classifier is represented and is used to identify the face,A recognition probability vector representing each type of face in the training subject;
face gaze invariant feature learning:
face identity classifierIs updated according to the cross entropy loss function of the identification probability and the labelThe following is shown:
Wherein,For each identification tag of the face of the subject,The predictive recognition probability of the face of the nth subject is given, and N is the total number of training data; representing a logarithmic function;
identity probability distribution equality feature learning:
Next, FGEN model continues the contrast training, face sight predictive modelOptimizing the target opposite to the face identity classifier (i.e. optimizing the parameters of the video prediction model to make the identity probability of all samplesEqual distribution of (c), whereby the resistance loss of appearance generalization is defined as follows:
;
;
Wherein,Representing the accuracy of the predictive probability of the face of the subject,Resistance loss indicating appearance generalization; is a uniform distribution, k is the number of subjects in the training set; representing norms, constructing a loss function by cosine similarity to make the predicted identity probabilityUniformly distributed;
Therefore, based on the appearance related information of the face image, the face sight prediction model is promoted.
Gaze direction loss using L1 lossThe definition is:
;
Wherein,Predictive model for representing human face and line of sightPredicted vectors.
Using multiple objective loss functionsTraining is as follows:
;
wherein, according to experience, set up. After the parameters of the face sight prediction model are updated, the face recognition model is frozen. By the above-described resistance learning, the limitation of the appearance-based gaze estimation method is alleviated by extracting the key information of the appearance without additional non-key labels.
In order to improve generalization performance, a new learning strategy is introduced. This strategy was designed to compensate for the overfitting problem caused by individual-specific appearance factors. In general, gaze estimation performs a leave-on cross-validation that leaves only one subject in the test dataset and uses the remaining subjects for training. In such training settings, the model is subject to deviations in the appearance factors of the limited subjects used in the training dataset and overfitting can easily occur. To avoid these problems, a new training strategy is constructed, in which one subject in the training set is randomly selected in each training step, and a meta-learning method, i.e., an algorithm based on a first order gradient, is applied to update the model, comprising:
first, randomly extracting k subjects from the training data set, and constructing a new meta-training set by using face images of the k subjectsTraining the meta-training setThe sequence is input into FGEN model, and gradient vectors are calculated in the optimization process as follows:
Wherein,The initial weight of the vision predictive model is, x is the meta-training iteration number,Representing the gradient vector calculated during optimization of the gaze prediction model,Is a calculation function of the gradient vector;
The initial weights are then updated using the following equation:
;
;
Wherein,Is the step size used for the random gradient descent (SGD) operation,Is the weight of the vision predictive model optimized by the meta-training process,Refers to weights updated by gradient descent;
next, in the meta-training setIs to construct a meta-adaptation set with p non-overlapping topicsAnd adapt the meta-adaptation to the iterative weightsThe updating is as follows:
;
after the meta-adaptation iteration, a new meta-adaptation set is created by selecting a new topicThe meta-adaptation process is repeated delta times and updated weights are calculated;
Finally, a face sight prediction model is carried out according to the following formulaUpdating:
;
;
Wherein y is the meta-adaptation iteration number,For use ofThe invention applies the process to relieve the problem of overfitting, so that the model can optimize new parameters and prevent the model from updating towards the leading direction.
Line-of-sight feature extraction by inputting FGEN the processed image and trained parameters into a model to obtain line-of-sight features
There is complementarity and consistency between the different modes. These factors contribute to a common motivation and goal when people express their language, sound and purpose, showing consistency between patterns. At the same time, language, sound and purpose have unique semantics, intonation and expression, respectively, and exhibit complementary characteristics. Thus, modality sharing and modality specific encoders are designed for learning the sharing and specific characteristics of text, speech and line of sight. The representation provides an overall view of the modality, which lays a foundation for subsequent multi-modality fusion. In text featuresSpeech featuresAnd line of sight featurePreprocessing with a layer of multi-headed convectors before multi-modal representation,And
The modality sharing representation includes:
Sharing an encoder by constructing a modalityTo learn the sharing characteristics of the cross modes, the modal sharing encoder will,AndConversion to unified feature space, obtaining shared features of text, speech and line of sight, respectively,And:
;
;
;
In order to ensure that the sharing characteristics of different modes of the same sample are similar, a central moment difference CMD is utilized to minimize the similarity loss, wherein the CMD contains moment information higher than KL divergence. Compared to the maximum average difference (MMD), CMD reduces the amount of computation because the core matrix does not need to be computed. CMD estimates the difference between the two distributions by matching their order moment differences. Let X and Y be bounded random samples, in intervalProbability distributions on the top are p and q, respectively, and CMD regularization matrixIs defined as:
;
wherein X and Y are respectively crossed and brought into three different modes, A and B represent the value ranges of X and Y,Is the empirical expected vector of random samples X,Is the vector of the center moment of all k-order samples in X, and calculates the sharing characteristic of each pair of modes:
;
Wherein, t is the number of the three-dimensional space,And v are text, speech and line of sight identifiers respectively,AndCommon features of mode m1 and mode m2, respectively, mode m1 and mode m2 are any two of three modes, respectively, intuitively, minimizing lossesThe shared feature representation distribution of each pair of modalities will be forced to be similar.
The modality specific representation comprises constructing a modality specific encoder for understanding specific features of different modes, comprising:, A kind of electronic deviceCorresponding to text, voice and sight line respectively, and mode specific encoderAndConversion to a unique feature space to obtain a specific feature,And:
;
;
;
On the one hand, the features of one excellent specific modality must ensure that the shared features and the distribution of specific features of the same modality are different. On the other hand, it is also necessary to ensure that the distribution of specific features is different between different modes. Therefore, the following differential loss is formedThe calculation formula is as follows:
;
Wherein,Is the square of the L2-norm, intuitively speaking, ifAnd;AndIs orthogonal, differential lossAt least one of the minimum, t,And v are text, speech and line of sight identifiers respectively,AndThe mode m1 and the mode m2 are respectively the common characteristics of the mode m1 and the mode m2, and the mode m1 and the mode m2 are respectively any two of the three modes.
Before the multi-mode fusion operation, a decoder is constructed to ensure that the shared features and specific features obtained by the encoder maintain the basic properties of the original feature spaceTo input the shared feature and the specific feature, wherein,Representing the characteristics of the sharing of the features,Which are indicative of the various features of the various ways in which the invention may be practiced,Representing decoder parameters, reconstructing the original feature space:
;
;
;
Wherein,Respectively representing the intrinsic information of the three modal characteristics, and ensuring the characteristic dimension and distribution of the intrinsic information to be consistent with the new characteristics of the original input data.
The reconstruction error is estimated using the Mean Square Error (MSE):
;
Wherein,Is the square of the L2-norm,To prevent over-fitting regularization term, W is the decoder parameter; representing reconstruction errors refers to errors or losses that the model produces when encoding the input data into a potential representation (e.g., feature embedding) and reconstructing it by the decoder.
Intra-modality fusion, comprising:
After the shared features and specific features of each modality are obtained through multi-modality representation learning, they are connected into a self-attention mechanism model. The attention mechanism captures the correlation of the shared feature and the specific feature, resulting in a single-mode fusion feature. Self-care can well achieve parallel computing and long-range dependence. As shown in fig. 4. The calculation formula of the self-attention mechanism is as follows:
;
Wherein,AndRespectively representing a query, a key and a value matrix; AndIs a parameter that needs to be learned and,Is the dimension of K, for self-attention, three matrices Q, K and V come from the same input, the shared features and specific features of text, speech and line of sight are respectively connected and input into self-attention mechanism to obtain single-mode fusion feature,And
Cross-modal fusion, comprising:
after obtaining the unimodal fusion feature, text-to-line-of-sight is learned using a cross-attention mechanismLine of sight to textText-to-speechSpeech to textSpeech to sight lineLine of sight to speechThe main difference between the cross-attention mechanism and the self-attention mechanism is that the inputs to the cross-attention mechanism come from different sequences. The cross-attention mechanism asymmetrically combines two sequences, one as input to Q and the other as input to K and V, as shown in fig. 5, the calculation formula for the cross-attention mechanism is as follows:
;
Will beAndThe combination is input to a sight gate to obtain sight feature fusion weightWill (i) beAndIs input to the auditory gate to obtain the speech feature fusion weightWill (i) beAndCombining and inputting to a text gate to obtain text feature fusion weightsThe calculation formula is as follows:
;
;
;
Wherein,AndAs a function of the linear layer parameters,AndIs a bias term;
according to the fusion weight, the sight line characteristicsAnd speech featuresWith text featuresFusing to obtain a final fusion characteristic h:
+
intent recognition, including:
after obtaining the final fusion characteristic h, inputting h into a multi-layer perceptron (MLP), and connecting softmax layers to obtain a classification resultThe following is shown:
;
wherein W and b represent linear layer parameters and bias terms.
To implement the multi-modal representation, fusion and predictive end-to-end training process, the penalty is co-optimized,And cross entropy lossThe final optimization objective L is as follows:
;
;
Wherein,,AndIs the weight that determines the contribution to the overall loss L, N is the number of training samples; AndRepresenting the actual tag distribution and the predicted tag distribution of sample i, respectively; is L2 regularization, which reduces the degree of overfitting of the model. W is a model parameter.
Compared with other models, the model provided by the invention has relatively stable performance improvement, and in most cases, the performance is higher than that of other models, the accuracy macroscopic average score is higher, and the model has better field self-adaption and generalization capability.
Example 3
A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the steps of a speech and gaze multimodal fusion based intent recognition method as described in embodiments 1 or 2 when executing the computer program.
Example 4
A computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of a speech and line-of-sight multimodal fusion-based intent recognition method as described in embodiments 1 or 2.
Example 5
An intent recognition system based on a multi-modal fusion of speech and gaze comprising:
A feature extraction module configured to extract text, speech and line-of-sight features from speech and faces based on the pre-trained BERT model, the Wav2vec 2.0 model, and the self-training FGEN model;
The multi-mode representation module is configured to comprise a mode sharing representation, a mode specific encoder, a differential loss and a mode analysis module, wherein the mode sharing representation comprises the steps of constructing a mode sharing encoder to learn cross-mode sharing characteristics, converting text, voice and sight characteristics into a unified characteristic space by the mode sharing encoder to obtain sharing characteristics, minimizing similarity loss among sharing characteristics of different modes by utilizing a central moment difference, constructing a mode specific encoder to learn specific characteristics of each mode, converting the text, voice and sight characteristics into a specific characteristic space by the mode specific encoder to obtain the specific characteristics, ensuring that the sharing characteristics and the specific characteristics of the same mode are distributed differently by the differential loss, and meanwhile, ensuring that the specific characteristics of different modes are distributed differently;
The multi-mode fusion module is configured to comprise intra-mode fusion, cross-mode fusion, wherein the intra-mode fusion comprises the step of fusing shared features and specific features of each mode through a self-attention mechanism to obtain single-mode fusion features, the step of cross-mode fusion comprises the steps of learning cross-mode related features through a cross-attention mechanism and fusing features of different modes through a gating mechanism to obtain final fusion features;
And the intention recognition module is configured to input the final fusion characteristics into the multi-layer perceptron, and connect the softmax layer to output a classification result to perform intention recognition.

Claims (5)

Translated fromChinese
1.一种基于语音与视线多模态融合的意图识别方法,其特征在于,包括:1. A method for intention recognition based on multimodal fusion of speech and sight, characterized by comprising:特征提取:基于预训练的BERT模型、Wav2vec 2.0模型和自训练FGEN模型从语音和脸部提取文本、语音和视线特征;Feature extraction: Extract text, voice, and gaze features from voice and face based on the pre-trained BERT model, Wav2vec 2.0 model, and self-trained FGEN model;多模态表示:包括:1)模态共享表示:构建模态共享编码器学习跨模态的共享特征;模态共享编码器将文本、语音和视线特征转换到统一的特征空间,获得共享特征,利用中心矩差异最小化不同模态共享特征之间的相似性损失;2)模态特异表示:构建模态特异编码器学习各模态的特定特征;模态特异编码器将文本、语音和视线特征转换到特定特征空间,获得特定特征;通过差分损失确保同一模态的共享特征和特定特征的分布不同,同时不同模态的特定特征分布也不同;Multimodal representation: including: 1) Modality sharing representation: construct a modality sharing encoder to learn cross-modal shared features; the modality sharing encoder converts text, speech and sight features into a unified feature space to obtain shared features, and uses the central moment difference to minimize the similarity loss between shared features of different modalities; 2) Modality specific representation: construct a modality specific encoder to learn specific features of each modality; the modality specific encoder converts text, speech and sight features into a specific feature space to obtain specific features; through differential loss, ensure that the distribution of shared features and specific features of the same modality is different, and the distribution of specific features of different modalities is also different;多模态融合:包括:3)模态内融合:通过自注意力机制融合每个模态的共享特征和特定特征,获得单模态融合特征;4)跨模态融合:使用交叉注意力机制学习跨模态的相关特征,并通过门控机制融合不同模态的特征,得到最终的融合特征;Multimodal fusion: including: 3) Intra-modal fusion: The shared features and specific features of each modality are fused through the self-attention mechanism to obtain the single-modal fusion features; 4) Cross-modal fusion: Use the cross-attention mechanism to learn the relevant features of cross-modality, and fuse the features of different modalities through the gating mechanism to obtain the final fusion features;意图识别:将最终的融合特征输入多层感知机,并连接softmax层输出分类结果,进行意图识别;Intent recognition: The final fusion features are input into the multi-layer perceptron and connected to the softmax layer to output the classification results for intent recognition;模态共享表示:包括:Modal shared representation: including:在文本特征zt、语音特征za和视线特征zv进行多模态表示之前,利用一层多头的Transformer进行预处理得到ut,ua和uvBefore the text feature zt , speech feature za and sight feature zv are represented in multimodal form, a layer of multi-headed Transformer is used to perform preprocessing to obtain ut , ua and uv ;通过构造一个模态共享编码器Ec(u(t,a,v);θc)来学习跨模式的共享特征,模态共享编码器将ut,ua和uv转换为统一的特征空间,分别获得文本、语音和视线的共享特征By constructing a modality-sharing encoder Ec (u(t, a, v) ; θc ) to learn cross-modal shared features, the modality-sharing encoder convertsut ,ua anduv into a unified feature space to obtain shared features of text, speech and sight, respectively. and利用中心矩差异CMD来最小化相似性损失;设X和Y是有界的随机样本,在区间[A,B]N上的概率分布分别为p和q,CMD正则化矩阵CMDk(X,Y)的定义为:The central moment difference CMD is used to minimize the similarity loss; let X and Y be bounded random samples with probability distributions p and q on the interval [A, B]N respectively, and the CMD regularization matrix CMDk (X, Y) is defined as:式中,X,Y将分别交叉带入不同的三种模态,A,B代表X和Y的取值范围,是随机样本X的经验期望向量,是X中所有k阶样本中心矩的向量,计算每对模式的共享特征之间的CMDkIn the formula, X and Y will be cross-bred into three different modes respectively, A and B represent the value range of X and Y. is the empirical expectation vector of random sample X, is the vector of all k-order sample central moments in X, and the CMDk between the shared features of each pair of patterns is calculated:其中,t,a和v分别是文本、语音和视线标识符,分别是模态m1和模态m2的共同特征;模态m1和模态m2分别是三种模态中的任两种;最小化损失L2/3,将迫使每对模态的共享特征表示分布是相似的;Among them, t, a and v are text, voice and sight identifiers respectively. and are the common features of mode m1 and mode m2 respectively; mode m1 and mode m2 are any two of the three modes respectively; minimizing the loss L2/3 will force the shared feature representation distribution of each pair of modes to be similar;模态特异表示:包括:构建模态特异编码器,包括:分别对应文本、语音和视线;模态特异编码器将ut、ua和uv转换为唯一特征空间,以获得特定特征Modality-specific representation: includes: building a modality-specific encoder, including: and Corresponding to text, speech and sight respectively; the modality-specific encoder convertsut ,ua anduv into a unique feature space to obtain specific features and形成以下的差分损失L:/;计算公式:The following differential loss L:/ is formed; calculation formula:其中,||·||"是L2-范数的平方,如果是正交的,则差分损失L:/;最小,t,a和v分别是文本、语音和视线标识符,分别是模态m1和模态m2的共同特征;模态m1和模态m2分别是三种模态中的任两种;where ||·||" is the square of the L2-norm. If and and are orthogonal, then the differential loss Lis minimum, t, a and v are text, speech and sight identifiers respectively, and are the common features of mode m1 and mode m2 respectively; mode m1 and mode m2 are any two of the three modes respectively;多模态融合操作之前,构建一个解码器来输入共享特征和特定特征,其中,表示共享特征,表示特定特征,θ:表示解码器参数;重建原始特征空间:Before the multimodal fusion operation, build a decoder To input shared features and specific features, where represents shared features, Represents a specific feature, θ: represents the decoder parameter; reconstructs the original feature space:其中,分别表示包括三种模态特征本质信息;in, They represent the essential information of three modal features respectively;使用均方误差来估计重建误差:Use the mean square error to estimate the reconstruction error:其中,||·||"为L2-范数的平方,为防止过拟合的正则化项,W为解码器参数;LPQRST表示重建误差,是指模型在将输入数据编码成潜在表示并通过解码器进行重建时,所产生的误差或损失;Among them, ||·||" is the square of the L2-norm, The regularization term is used to prevent overfitting. W is the decoder parameter. LPQRST represents the reconstruction error, which refers to the error or loss generated when the model encodes the input data into a potential representation and reconstructs it through the decoder.模态内融合;包括:Intra-modal fusion; including:自注意力机制的计算公式如下:The calculation formula of the self-attention mechanism is as follows:其中,Q=WcX1、K=WkX1和V=WvX1分别表示查询、键和值矩阵;Wc、Wk和Wv是需要学习的参数,dk是K的维数;对于自注意,三个矩阵Q、K和V来自同一个输入;将文本、语音和视线的共享特征和特定特征分别连接起来,并输入到自注意力机制中,获得单模态融合特征ht,ha和hvWhere Q = Wc X1, K = Wk X1 and V = Wv X1 represent query, key and value matrices respectively; Wc , Wk and Wv are parameters to be learned, and dk is the dimension of K; for self-attention, the three matrices Q, K and V come from the same input; the shared features and specific features of text, speech and line of sight are concatenated and input into the self-attention mechanism to obtain the unimodal fusion features ht ,ha and hv ;跨模态融合;包括:Cross-modal fusion; including:在获得单模态融合特征后,使用交叉注意力机制学习文本到视线CAt-v、视线到文本CAv-t、文本到语音CAt-a、语音到文本CAa-t、语音到视线CAa-v、视线到语音CAv-a的相关特征;交叉注意力机制不对称地结合两个序列,一个作为Q的输入,另一个作为K和V的输入;交叉注意力机制的计算公式如下:After obtaining the unimodal fusion features, the cross-attention mechanism is used to learn the relevant features of text-to-view CAtv , view-to-text CAvt , text-to-speech CAta , speech-to-text CAat , speech-to-view CAav , and view-to-speech CAva ; the cross-attention mechanism asymmetrically combines two sequences, one as the input of Q and the other as the input of K and V; the calculation formula of the cross-attention mechanism is as follows:将CAt-v和CAv-t组合输入到视线门,得到视线特征融合权重WratB_v;将CAt-a和CAa-t的组合输入到听觉门,获得语音特征融合权重WratB_a;将CAa-v和CAv-a组合输入到文本门,得到文本特征融合权重WratB_t;计算公式如下:The combination of CAtv and CAvt is input into the sight gate to obtain the sight feature fusion weight WratB_v ; the combination of CAta and CAat is input into the auditory gate to obtain the speech feature fusion weight WratB_a ; the combination of CAav and CAva is input into the text gate to obtain the text feature fusion weight WratB_t ; the calculation formula is as follows:WratB_v=sigmoid(Wv[CAt-v;CAv-t]+bv);WratB_v =sigmoid(Wv [CAtv ; CAvt ]+bv );WratB_a=sigmoid(Wx[CAt-a;CAa-t]+ba);WratB_a =sigmoid(Wx [CAta ; CAat ]+ba );WratB_t=sigmoid(Wt[CAa-v;CAv-a]+by);WratB_t =sigmoid(Wt [CAav ; CAva ]+by );其中,Wv、Wa和Wy为线性层参数,bv、ba和bt为偏置项;Among them, Wv , Wa and Wy are linear layer parameters, bv ,ba and bt are bias terms;根据融合权重,将视线特征hv和语音特征ha与文本特征ht进行融合,得到最终的融合特征h:According to the fusion weight, the sight feature hv and the speech feature ha are fused with the text feature ht to obtain the final fusion feature h:h=WratB_v*hv+WratB_a*ha+WratB_t*hth=WratB_v *hv +WratB_a *ha +WratB_t *ht ;意图识别;包括:Intent recognition; including:在获得最终的融合特征h后,将h输入到多层感知机中,并连接softmax层,得到分类结果如下所示:After obtaining the final fusion feature h, h is input into the multi-layer perceptron and connected to the softmax layer to obtain the classification result As shown below:其中,W1和b表示线性层参数和偏差项;Where W1 and b represent the linear layer parameters and bias terms;模态内融合;共同优化损失的L2/3、L:/;;,LABcCD和交叉熵损失Lta2k;最终的优化目标L如下所示:Intra-modal fusion; jointly optimize the loss L2/3 , L:/; , LABcCD and cross entropy loss Lta2k ; the final optimization target L is as follows:L=Lta2k+αL2/3+βL:/;;+γLABcCDL=Lta2k +αL2/3 +βL:/;; +γLABcCD ;其中,α,β和γ是决定对总体损失L的贡献的权重;N3是训练样本的数量;yl分别表示样本l的实际标签分布和预测的标签分布;是L2正则化,W为解码器参数。Among them, α, β and γ are weights that determine the contribution to the overall loss L; N3 is the number of training samples; yl and Represent the actual label distribution and predicted label distribution of sample l respectively; is L2 regularization, and W is the decoder parameter.2.根据权利要求1所述的一种基于语音与视线多模态融合的意图识别方法,其特征在于,基于自训练FGEN模型提取视线特征;包括:2. According to claim 1, the method for intention recognition based on multimodal fusion of speech and sight line is characterized in that sight line features are extracted based on a self-trained FGEN model; comprising:使用SWCNN作为训练基线模型,通过连接人脸身份分类器来扩展网络架构,实现与外观无关的视线估计,脸部重点特征的提取;采用元学习策略训练FGEN模型;即:每次迭代训练都从人脸视线数据集中随机选择受试者组成元训练数据集和元测试数据集,得到视线特征嵌入性lv是关键帧的序列长度,hv是每个帧的特征维度;R表示一个lv×hv的实数矩阵,包括所有关键帧的特征嵌入,每一行对应一个关键帧的特征,每一列对应某一特定特征维度的值;如下所示:Using SWCNN as the training baseline model, the network architecture is extended by connecting the face identity classifier to achieve appearance-independent gaze estimation and extraction of key facial features. The FGEN model is trained using a meta-learning strategy; that is, each iterative training randomly selects subjects from the face gaze dataset to form a meta-training dataset and a meta-test dataset to obtain the gaze feature embedding. lv is the sequence length of the key frames, hv is the feature dimension of each frame; R represents a real number matrix of lv ×hv , including the feature embeddings of all key frames, each row corresponds to the feature of a key frame, and each column corresponds to the value of a specific feature dimension; as shown below:zv=FGEN(C/:,Er,ML)zv =FGEN(C/: ,Er ,ML)其中,C/:表示人脸身份分类器,Er表示视线预测模型,ML表示元学习策略。Among them, C/: represents the face identity classifier,Er represents the gaze prediction model, and ML represents the meta-learning strategy.3.根据权利要求1所述的一种基于语音与视线多模态融合的意图识别方法,其特征在于,采用元学习策略训练FGEN模型,FGEN模型训练利用对抗策略实现视线特征与外观主体的无关性;包括:3. According to claim 1, the method for intention recognition based on multimodal fusion of speech and sight is characterized in that a meta-learning strategy is adopted to train the FGEN model, and the FGEN model training uses an adversarial strategy to achieve the independence of sight features and appearance subjects; comprising:人脸识别参数更新;Face recognition parameters updated;利用视线损失和对抗损失更新FGEN模型的参数;Update the parameters of the FGEN model using line of sight loss and adversarial loss;视线预测模型将人脸图像转换为注视方向向量;如下所示:The gaze prediction model converts a face image into a gaze direction vector; as shown below:(vr,hr)=Er(D)(vr ,hr )=Er (D)其中,vr表示视线偏航和俯仰向量标签,hr表示FGEN模型输出的人脸视线特征向量,Er表示人脸视线预测模型,D表示人脸数据集;Among them, vr represents the yaw and pitch vector labels of the line of sight, hr represents the face line of sight feature vector output by the FGEN model, Er represents the face line of sight prediction model, and D represents the face dataset;C/:表示人脸身份分类器,表示训练样本中每类人脸的识别概率向量;C/: represents the face identity classifier, Represents the recognition probability vector of each type of face in the training sample;人脸身份分类器C/:的参数根据识别概率和标签的交叉熵损失函数进行更新,交叉熵损失函数L/:c如下所示:The parameters of the face identity classifier C/: are updated according to the cross entropy loss function of the recognition probability and the label. The cross entropy loss function L/:c is as follows:其中,p/:为每个人脸样本的识别标签,为第n个人脸样本的预测识别概率,N1为训练数据总数;CE()表示对数函数;Among them, p/: is the identification label of each face sample, is the predicted recognition probability of the nth face sample, N1 is the total number of training data; CE() represents the logarithmic function;接下来,FGEN模型继续进行对抗性训练;人脸视线预测模型Er向与人脸身份分类器相反的目标进行优化;由此,将外观泛化的对抗性损失定义如下:Next, the FGEN model continues adversarial training; the face gaze prediction modelEr is optimized towards the opposite goal of the face identity classifier; thus, the adversarial loss of appearance generalization is defined as follows:其中,表示受试者人脸预测概率的准确度,Ladv表示外观泛化的对抗性损失;e=[1/k1,...1/k1]d,是一个均匀分布,k1是训练集中的受试者数量;||e||表示取范数;利用余弦相似度构造一个损失函数,使预测的恒等概率采用均匀分布;in, represents the accuracy of the predicted probability of the subject's face, Ladv represents the adversarial loss of appearance generalization; e = [1/k1 ,...1/k1 ]d , is a uniform distribution, k1 is the number of subjects in the training set; ||e|| represents the norm; a loss function is constructed using cosine similarity to make the predicted constant probability Use uniform distribution;注视方向损失使用L1损失Lr,定义为:The gaze direction loss uses the L1 loss Lr , which is defined as:其中,表示人脸视线预测模型Er预测的向量;in, Represents the vector predicted by the face gaze prediction modelEr ;使用多目标损失函数LtCtal训练,如下所示:Use the multi-objective loss functionLtCtal for training as follows:LtCtal=ηa:vLa:v+LrLtCtal = ηa:v La:v + Lr .4.根据权利要求3所述的一种基于语音与视线多模态融合的意图识别方法,其特征在于,在每个训练步骤中随机选择训练集中的一个受试者,并应用元学习方法,即基于一阶梯度的算法来更新模型;包括:4. The method for intention recognition based on multimodal fusion of speech and sight according to claim 3 is characterized in that in each training step, a subject in the training set is randomly selected, and a meta-learning method, i.e., an algorithm based on first-order gradient, is applied to update the model; comprising:首先,从训练数据集中随机抽取k"个受试者,并利用这k"个受试者构建一个新的元训练集将元训练集序列输入到FGEN模型中,在优化过程中计算梯度向量如下:First, k" subjects are randomly sampled from the training dataset and a new meta-training set is constructed using these k" subjects The meta-training set The sequence is input into the FGEN model, and the gradient vector is calculated during the optimization process as follows:其中,为视线预测模型的初始权重,为元训练迭代次数,表示视线预测模型优化过程中计算的梯度向量,U()是梯度向量的计算函数;in, is the initial weight of the line of sight prediction model, is the number of meta-training iterations, It represents the gradient vector calculated during the optimization process of the sight prediction model, and U() is the calculation function of the gradient vector;然后,使用以下方程更新初始权重Then, update the initial weights using the following equation其中,ε是随机梯度下降操作使用的步长,φ是元训练过程优化的视线预测模型的权重,是指通过梯度下降而更新的权重;where ε is the step size used by the stochastic gradient descent operation, φ is the weight of the gaze prediction model optimized by the meta-training process, refers to the weights updated by gradient descent;接下来,在元训练集中构建一个具有个不重叠主题的元适应集并将元适应迭代的权值更新为:Next, in the meta-training set Construct a Meta-adaptation sets of non-overlapping topics And adapt the element to the iterative weights Updated to:在元适应迭代后,通过选择新的主题,创建一个新的元适应集重复元适应过程δ次,并计算更新的权重After the meta-adaptation iteration, a new meta-adaptation set is created by selecting new topics Repeat the meta-adaptation process δ times and calculate the updated weights最后,根据下式对人脸视线预测模型Er进行更新:Finally, the face gaze prediction modelEr is updated according to the following formula:其中,y为元适应迭代次数,为使用的元适应过程优化的人脸视线预测模型的权重;Among them, y is the number of meta-adaptation iterations, For use The weights of the face gaze prediction model optimized by the meta-adaptation process;视线特征提取:将处理后的图像和训练的参数输入FGEN模型,以获得视线特征zv=FGEN(C·d,E±,ML)。Line of sight feature extraction: The processed image and the trained parameters are input into the FGEN model to obtain the line of sight feature zv =FGEN(C·d ,E± ,ML).5.一种基于语音与视线多模态融合的意图识别系统,其特征在于,包括:5. An intention recognition system based on multimodal fusion of speech and sight, characterized by comprising:特征提取模块,被配置为:基于预训练的BERT模型、Wav2vec 2.0模型和自训练FGEN模型从语音和脸部提取文本、语音和视线特征;The feature extraction module is configured to extract text, speech, and gaze features from speech and face based on the pre-trained BERT model, Wav2vec 2.0 model, and the self-trained FGEN model;多模态表示模块,被配置为:包括:模态共享表示:构建模态共享编码器学习跨模态的共享特征;模态共享编码器将文本、语音和视线特征转换到统一的特征空间,获得共享特征,利用中心矩差异最小化不同模态共享特征之间的相似性损失;模态特异表示:构建模态特异编码器学习各模态的特定特征;模态特异编码器将文本、语音和视线特征转换到特定特征空间,获得特定特征;通过差分损失确保同一模态的共享特征和特定特征的分布不同,同时不同模态的特定特征分布也不同;The multimodal representation module is configured to include: modality sharing representation: constructing a modality sharing encoder to learn cross-modal shared features; the modality sharing encoder converts text, speech and sight features into a unified feature space to obtain shared features, and uses the central moment difference to minimize the similarity loss between shared features of different modalities; modality specific representation: constructing a modality specific encoder to learn specific features of each modality; the modality specific encoder converts text, speech and sight features into a specific feature space to obtain specific features; through differential loss, ensure that the distribution of shared features and specific features of the same modality is different, and the distribution of specific features of different modalities is also different;多模态融合模块,被配置为:包括:模态内融合:通过自注意力机制融合每个模态的共享特征和特定特征,获得单模态融合特征;跨模态融合:使用交叉注意力机制学习跨模态的相关特征,并通过门控机制融合不同模态的特征,得到最终的融合特征;The multimodal fusion module is configured to include: intra-modal fusion: fusing the shared features and specific features of each modality through the self-attention mechanism to obtain the single-modal fusion features; cross-modal fusion: using the cross-attention mechanism to learn the relevant features of the cross-modality, and fusing the features of different modalities through the gating mechanism to obtain the final fusion features;意图识别模块,被配置为:将最终的融合特征输入多层感知机,并连接softmax层输出分类结果,进行意图识别;The intent recognition module is configured to: input the final fusion features into the multi-layer perceptron and connect the softmax layer to output the classification results for intent recognition;模态共享表示:包括:Modal shared representation: including:在文本特征zt、语音特征za和视线特征zv进行多模态表示之前,利用一层多头的Transformer进行预处理得到ut,ua和uvBefore the text feature zt , speech feature za and sight feature zv are represented in multimodal form, a layer of multi-headed Transformer is used to perform preprocessing to obtain ut , ua and uv ;通过构造一个模态共享编码器Ec(u(t,a,v);θc)来学习跨模式的共享特征,模态共享编码器将ut,ua和uv转换为统一的特征空间,分别获得文本、语音和视线的共享特征By constructing a modality-sharing encoder Ec (u(t, a, v) ; θc ) to learn cross-modal shared features, the modality-sharing encoder convertsut ,ua anduv into a unified feature space to obtain shared features of text, speech and sight, respectively. and利用中心矩差异CMD来最小化相似性损失;设X和Y是有界的随机样本,在区间[A,B]N上的概率分布分别为p和q,CMD正则化矩阵CMDk(X,Y)的定义为:The central moment difference CMD is used to minimize the similarity loss; let X and Y be bounded random samples with probability distributions p and q on the interval [A, B]N respectively, and the CMD regularization matrix CMDk (X, Y) is defined as:式中,X,Y将分别交叉带入不同的三种模态,A,B代表X和Y的取值范围,是随机样本X的经验期望向量,是X中所有k阶样本中心矩的向量,计算每对模式的共享特征之间的CMDkIn the formula, X and Y will be cross-bred into three different modes respectively, A and B represent the value range of X and Y. is the empirical expectation vector of random sample X, is the vector of all k-order sample central moments in X, and the CMDk between the shared features of each pair of patterns is calculated:其中,t,a和v分别是文本、语音和视线标识符,分别是模态m1和模态m2的共同特征;模态m1和模态m2分别是三种模态中的任两种;最小化损失L2/3,将迫使每对模态的共享特征表示分布是相似的;Among them, t, a and v are text, voice and sight identifiers respectively. and are the common features of mode m1 and mode m2 respectively; mode m1 and mode m2 are any two of the three modes respectively; minimizing the loss L2/3 will force the shared feature representation distribution of each pair of modes to be similar;模态特异表示:包括:构建模态特异编码器,包括:分别对应文本、语音和视线;模态特异编码器将ut、ua和uv转换为唯一特征空间,以获得特定特征Modality-specific representation: includes: building a modality-specific encoder, including: and Corresponding to text, speech and sight respectively; the modality-specific encoder convertsut ,ua anduv into a unique feature space to obtain specific features and形成以下的差分损失L:/;计算公式:The following differential loss L:/ is formed; calculation formula:其中,||·||"是L2-范数的平方,如果是正交的,则差分损失L:/;最小,t,a和v分别是文本、语音和视线标识符,分别是模态m1和模态m2的共同特征;模态m1和模态m2分别是三种模态中的任两种;where ||·||" is the square of the L2-norm. If and and are orthogonal, then the differential loss Lis minimum, t, a and v are text, speech and sight identifiers respectively, and are the common features of mode m1 and mode m2 respectively; mode m1 and mode m2 are any two of the three modes respectively;多模态融合操作之前,构建一个解码器来输入共享特征和特定特征,其中,表示共享特征,表示特定特征,θ:表示解码器参数;重建原始特征空间:Before the multimodal fusion operation, build a decoder To input shared features and specific features, where represents shared features, Represents a specific feature, θ: represents the decoder parameter; reconstructs the original feature space:其中,分别表示包括三种模态特征本质信息;in, They represent the essential information of three modal features respectively;使用均方误差来估计重建误差:Use the mean square error to estimate the reconstruction error:其中,||·||"为L2-范数的平方,为防止过拟合的正则化项,W为解码器参数;LPQRST表示重建误差,是指模型在将输入数据编码成潜在表示并通过解码器进行重建时,所产生的误差或损失;Among them, ||·||" is the square of the L2-norm, The regularization term is used to prevent overfitting. W is the decoder parameter. LPQRST represents the reconstruction error, which refers to the error or loss generated when the model encodes the input data into a potential representation and reconstructs it through the decoder.模态内融合;包括:Intra-modal fusion; including:自注意力机制的计算公式如下:The calculation formula of the self-attention mechanism is as follows:其中,Q=WcX1、K=WkX1和V=WvX1分别表示查询、键和值矩阵;Wc、Wk和Wv是需要学习的参数,dk是K的维数;对于自注意,三个矩阵Q、K和V来自同一个输入;将文本、语音和视线的共享特征和特定特征分别连接起来,并输入到自注意力机制中,获得单模态融合特征ht,ha和hvWhere Q = Wc X1, K = Wk X1 and V = Wv X1 represent query, key and value matrices respectively; Wc , Wk and Wv are parameters to be learned, and dk is the dimension of K; for self-attention, the three matrices Q, K and V come from the same input; the shared features and specific features of text, speech and line of sight are concatenated and input into the self-attention mechanism to obtain the unimodal fusion features ht ,ha and hv ;跨模态融合;包括:Cross-modal fusion; including:在获得单模态融合特征后,使用交叉注意力机制学习文本到视线CAt-v、视线到文本CAv-t、文本到语音CAt-a、语音到文本CAa-t、语音到视线CAa-v、视线到语音CAv-a的相关特征;交叉注意力机制不对称地结合两个序列,一个作为Q的输入,另一个作为K和V的输入;交叉注意力机制的计算公式如下:After obtaining the unimodal fusion features, the cross-attention mechanism is used to learn the relevant features of text-to-view CAtv , view-to-text CAvt , text-to-speech CAta , speech-to-text CAat , speech-to-view CAav , and view-to-speech CAva ; the cross-attention mechanism asymmetrically combines two sequences, one as the input of Q and the other as the input of K and V; the calculation formula of the cross-attention mechanism is as follows:将CAt-v和CAv-t组合输入到视线门,得到视线特征融合权重WratB_v;将CAt-a和CAa-t的组合输入到听觉门,获得语音特征融合权重WratB_a;将CAa-v和CAv-a组合输入到文本门,得到文本特征融合权重WratB_t;计算公式如下:The combination of CAtv and CAvt is input into the sight gate to obtain the sight feature fusion weight WratB_v ; the combination of CAta and CAat is input into the auditory gate to obtain the speech feature fusion weight WratB_a ; the combination of CAav and CAva is input into the text gate to obtain the text feature fusion weight WratB_t ; the calculation formula is as follows:WratB_v=sigmoid(Wv[CAt-v;CAv-t]+bv);WratB_v =sigmoid(Wv [CAtv ; CAvt ]+bv );WratB_a=sigmoid(Wx[CAt-a;CAa-t]+ba);WratB_a =sigmoid(Wx [CAta ; CAat ]+ba );WratB_t=sigmoid(Wt[CAa-v;CAv-a]+by);WratB_t =sigmoid(Wt [CAav ; CAva ]+by );其中,Wv、Wa和Wy为线性层参数,bv、ba和bt为偏置项;Among them, Wv , Wa and Wy are linear layer parameters, bv ,ba and bt are bias terms;根据融合权重,将视线特征hv和语音特征ha与文本特征ht进行融合,得到最终的融合特征h:According to the fusion weight, the sight feature hv and the speech feature ha are fused with the text feature ht to obtain the final fusion feature h:h=WratB_v*hv+WratB_a*ha+WratB_t*hth=WratB_v *hv +WratB_a *ha +WratB_t *ht ;意图识别;包括:Intent recognition; including:在获得最终的融合特征h后,将h输入到多层感知机中,并连接softmax层,得到分类结果如下所示:After obtaining the final fusion feature h, h is input into the multi-layer perceptron and connected to the softmax layer to obtain the classification result As shown below:其中,W1和b表示线性层参数和偏差项;Where W1 and b represent the linear layer parameters and bias terms;模态内融合;共同优化损失的L2/3、L:/;;,LABcCD和交叉熵损失Lta2k;最终的优化目标L如下所示:Intra-modal fusion; jointly optimize the loss L2/3 , L:/; , LABcCD and cross entropy loss Lta2k ; the final optimization target L is as follows:L=Lta2k+αL2/3+βL:/;;+γLABcCDL=Lta2k +αL2/3 +βL:/;; +γLABcCD ;其中,α,β和γ是决定对总体损失L的贡献的权重;N3是训练样本的数量;yl分别表示样本l的实际标签分布和预测的标签分布;是L2正则化,W为解码器参数。Among them, α, β and γ are weights that determine the contribution to the overall loss L; N3 is the number of training samples; yl and Represent the actual label distribution and predicted label distribution of sample l respectively; is L2 regularization, and W is the decoder parameter.
CN202411730078.9A2024-11-292024-11-29Intent recognition method and system based on multi-modal fusion of voice and sightActiveCN119206424B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202411730078.9ACN119206424B (en)2024-11-292024-11-29Intent recognition method and system based on multi-modal fusion of voice and sight

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202411730078.9ACN119206424B (en)2024-11-292024-11-29Intent recognition method and system based on multi-modal fusion of voice and sight

Publications (2)

Publication NumberPublication Date
CN119206424A CN119206424A (en)2024-12-27
CN119206424Btrue CN119206424B (en)2025-04-25

Family

ID=94062906

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202411730078.9AActiveCN119206424B (en)2024-11-292024-11-29Intent recognition method and system based on multi-modal fusion of voice and sight

Country Status (1)

CountryLink
CN (1)CN119206424B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN120388580A (en)*2025-06-272025-07-29杭州秋果计划科技有限公司 Audio processing method, device and XR device

Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN117828534A (en)*2024-01-152024-04-05武汉大学 Emotional intention semantic association method, system and device based on implicit label reasoning
CN118296548A (en)*2024-03-292024-07-05西安理工大学Multi-mode emotion recognition method based on double-stream encoder and attention mechanism

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
WO2016121052A1 (en)*2015-01-292016-08-04三菱電機株式会社Multimodal intent understanding device and multimodal intent understanding method
KR20240109698A (en)*2023-01-052024-07-12한국전자통신연구원Method and apparatus for multimodal unsupervised meta-learning
CN116661603A (en)*2023-06-022023-08-29南京信息工程大学 User Intent Recognition Method Based on Multimodal Fusion in Complex Human-Computer Interaction Scenarios
CN116682144B (en)*2023-06-202023-12-22北京大学Multi-modal pedestrian re-recognition method based on multi-level cross-modal difference reconciliation
CN117391091A (en)*2023-11-062024-01-12北京航空航天大学Deep semantic feature representation method for identifying intention of small sample of empty pipe instruction
CN117765981A (en)*2023-12-192024-03-26北京航空航天大学Emotion recognition method and system based on cross-modal fusion of voice text
CN118245846B (en)*2024-04-192024-12-24烟台大学Multi-mode intention recognition method and system for uncertain mode deletion
CN118194238B (en)*2024-05-142024-07-23广东电网有限责任公司Multilingual multi-mode emotion recognition method, system and equipment
CN118898046A (en)*2024-07-102024-11-05广东工业大学 A multimodal sentiment analysis method combining pre-trained model and self-attention block
CN119026071B (en)*2024-07-252025-04-25天津大学Emotion recognition method based on visual language pre-training and multi-mode collaborative fusion

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN117828534A (en)*2024-01-152024-04-05武汉大学 Emotional intention semantic association method, system and device based on implicit label reasoning
CN118296548A (en)*2024-03-292024-07-05西安理工大学Multi-mode emotion recognition method based on double-stream encoder and attention mechanism

Also Published As

Publication numberPublication date
CN119206424A (en)2024-12-27

Similar Documents

PublicationPublication DateTitle
CN113723166B (en) Content identification method, device, computer equipment and storage medium
US12079703B2 (en)Convolution-augmented transformer models
CN113420179B (en)Semantic reconstruction video description method based on time sequence Gaussian mixture hole convolution
WO2022105117A1 (en)Method and device for image quality assessment, computer device, and storage medium
CN115719510B (en) Group behavior recognition method based on multimodal fusion and implicit interaction relationship learning
Wang et al.Spatial–temporal pooling for action recognition in videos
GB2581943A (en)Interactive systems and methods
CN114049381A (en) A Siamese Cross-Target Tracking Method Fusing Multi-layer Semantic Information
CN115331769A (en)Medical image report generation method and device based on multi-modal fusion
CN119206424B (en)Intent recognition method and system based on multi-modal fusion of voice and sight
CN108985370B (en) Image annotation sentence automatic generation method
CN114896434B (en)Hash code generation method and device based on center similarity learning
CN114863229A (en) Image classification method and training method and device for image classification model
CN113780249B (en)Expression recognition model processing method, device, equipment, medium and program product
CN114282013A (en)Data processing method, device and storage medium
Chao et al.Audio visual emotion recognition with temporal alignment and perception attention
CN119622559B (en) Multimodal sentiment analysis method and system based on attention and graph-enhanced text
Hajarolasvadi et al.Deep facial emotion recognition in video using eigenframes
Elmadany et al.Multiview learning via deep discriminative canonical correlation analysis
CN117764084A (en)Short text emotion analysis method based on multi-head attention mechanism and multi-model fusion
CN113158861A (en) A Motion Analysis Method Based on Prototype Contrastive Learning
CN115937369A (en)Expression animation generation method and system, electronic equipment and storage medium
CN116563450A (en)Expression migration method, model training method and device
Huang et al.Dynamic sign language recognition based on CBAM with autoencoder time series neural network
Li et al.A multi-scale parallel convolutional neural network based intelligent human identification using face information

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp