Intent recognition method and system based on multi-modal fusion of voice and sightTechnical Field
The invention relates to an intention recognition method and system based on multi-modal fusion of voice and sight, and belongs to the technical field of intention recognition.
Background
In the digital marketing meta-universe interaction environment, intention recognition is beneficial to improving the efficiency and quality of customer interaction, can help enterprises to know customer demands deeply, provides personalized services, realizes accurate marketing, and enhances customer satisfaction.
Currently, in order to accurately understand the intention of a user, the multi-mode fusion technology can more comprehensively and accurately understand the intention of the user by comprehensively utilizing information of different modes. In a digital marketing meta-universe environment, speech in human-computer interaction can extract text expression information and intonation emotion information of a client, and facial gaze information can reflect the attention and points of interest of the user. The voice and sight information are fused, so that richer intention information can be provided, and the accuracy and the robustness of an intention recognition algorithm are improved.
The human face sight contains rich human intention information, and the sight estimation technology based on deep learning is greatly focused. The technology can learn advanced sight features from high-dimensional video images, so that the accuracy of estimating the sight of the face is remarkably improved. However, due to the reasons of light, shielding, differences in appearance of faces and the like in the face images, the model is often over-fitted when the model is trained by using a limited data set of specific people. Thus, applying face gaze estimation to actual intent recognition applications remains a challenge.
In addition, although multimodal fusion techniques solve these problems to some extent, how to effectively fuse information of different modalities improves accuracy of client intent recognition still faces many challenges. For example, how to deal with heterogeneity, synchronicity issues of different modality data, how to design efficient feature fusion algorithms, etc.
The main reasons for the problems include incomplete or improper selection of the modal information, limited visual line estimation training data, heterogeneity of the modal information, information redundancy and noise, unreasonable design of a characteristic fusion mode, weight distribution during fusion and insufficient interaction among the modalities.
The generalization capability and accuracy of the intent recognition model can be affected by the problems of improper modal information selection, overfitting of the sight estimation model, unreasonable multi-modal fusion method and the like. To solve these problems, the performance of the model can be improved by fine selection of modal information, meta-learning, optimization of regularization techniques, and design of a reasonable feature fusion method.
Disclosure of Invention
Aiming at the defects and shortcomings in the prior art, the invention provides a multi-mode fusion method based on the combined action of voice and sight for intention recognition, aiming at solving the following technical problems:
1. and selecting the mode information, namely selecting the proper mode information as a key in a human-computer interaction intention recognition task. The face sight information can reflect the attention and the interest points of the user, and the voice can extract text expression information and intonation emotion information of the user. Both line of sight and speech are key information for user intent.
2. Line-of-sight estimation model overfitting-in a line-of-sight estimation task, overfitting may be manifested in that the model's predictions of a particular scene, individual, or line-of-sight direction are very accurate, but when facing different scenes or people, the model's prediction accuracy is significantly degraded.
3. The influence of the multi-mode feature fusion method on the intention recognition accuracy is that the multi-mode feature fusion refers to extracting features from a plurality of modes (such as voice, text, images, gestures and the like) and effectively combining the features to improve the understanding capability of the model on the intention of a user in a complex scene. The information complementarity between different modes can be enhanced by selecting a proper fusion method, and the accuracy and the robustness of intention recognition are improved.
The technical scheme of the invention is as follows:
an intention recognition method based on multi-modal fusion of voice and sight comprises the following steps:
Feature extraction, extracting text, speech and line of sight features from speech and faces based on a pre-trained BERT model, a Wav2vec 2.0 model, and a self-training FGEN model;
The multi-modal representation comprises 1) modal sharing representation, wherein a modal sharing encoder is constructed to learn cross-modal sharing characteristics, the modal sharing encoder converts text, voice and sight characteristics into a unified characteristic space to obtain sharing characteristics, the central moment difference is utilized to minimize similarity loss among different modal sharing characteristics, 2) modal specific representation, wherein a modal specific encoder is constructed to learn specific characteristics of each mode, the modal specific encoder converts the text, voice and sight characteristics into a specific characteristic space to obtain specific characteristics, and the difference loss ensures that the sharing characteristics and the specific characteristics of the same mode are distributed differently, and meanwhile, the specific characteristic distribution of different modes is also different;
The multi-mode fusion comprises 3) intra-mode fusion, namely fusing the shared characteristic and the specific characteristic of each mode through a self-attention mechanism to obtain a single-mode fusion characteristic, and 4) cross-mode fusion, namely learning the related characteristic of the cross-mode through a cross-attention mechanism and fusing the characteristics of different modes through a gating mechanism to obtain a final fusion characteristic;
And (3) intention recognition, namely inputting the final fusion characteristics into a multi-layer perceptron, connecting a softmax layer to output a classification result, and carrying out intention recognition.
Extracting text features based on the pre-trained BERT model, comprising:
capturing voice of the interaction environment by adopting an active acquisition mode, and acquiring the acquired textIs input into the BERT model and,For word vectors in the text, m is the total number of word vectors, and the output of the last hidden layer of the BERT model represents text characteristics,Is the length of the sequence of text utterances,Is the characteristic dimension of each tag.
Extracting voice features based on a pre-trained Wav2vec 2.0 model, comprising:
capturing voice of interactive environment by adopting active acquisition modeEach speech segmentAre all input into the Wav2vec 2.0 model, the output of the last hidden layer of the Wav2vec 2.0 model representing the speech features,Representing the length of the sequence of speech segments,Representing the feature dimension of each tag.
According to the invention, the method for extracting the sight line features based on the pre-trained FGEN model comprises the following steps:
Using SWCNN as a training baseline model, expanding a network architecture by connecting a face identity classifier (face recognition model) to realize appearance-independent sight estimation and extraction of facial key features, training FGEN model by adopting a meta learning strategy, namely randomly selecting a subject from a face sight dataset to form a meta training dataset and a meta testing dataset in each iterative training to obtain the sight feature embeddingIs the sequence length of the key-frames,Is the feature dimension of each frame; Representing oneThe real number matrix of (a) comprises feature embedding of all key frames, each row corresponds to the feature of one key frame, each column corresponds to the value of a specific feature dimension, and the real number matrix is as follows:
;
Wherein,A face identity classifier is represented and is used to identify the face,Representing a gaze prediction model, ML represents a meta-learning strategy.
According to the invention, the method adopts meta learning strategy to train FGEN model, FGEN model train utilizes countermeasure strategy to realize independence of sight feature and appearance main body, comprising:
updating face recognition parameters;
Parameters of the FGEN model are updated with line of sight loss and contrast loss.
The line-of-sight prediction model converts the face image into a gaze direction vector, as follows:
;
Wherein,Representing line-of-sight yaw and pitch vector labels,Represents the facial line-of-sight feature vector output by FGEN model,Representing a predictive model of the line of sight of a human face,Representing a face dataset;
a face identity classifier is represented and is used to identify the face,A recognition probability vector representing each type of face in the training subject;
face identity classifierIs updated according to the cross entropy loss function of the identification probability and the labelThe following is shown:
;
Wherein,For each identification tag of the face of the subject,The predictive recognition probability of the face of the nth subject is given, and N is the total number of training data; representing a logarithmic function;
Next, FGEN model continues the contrast training, face sight predictive modelOptimizing against a goal opposite to the face identity classifier, whereby the contrast penalty for appearance generalization is defined as follows:
;
;
Wherein,Representing the accuracy of the predictive probability of the face of the subject,Resistance loss indicating appearance generalization; is a uniform distribution, k is the number of subjects in the training set; representing norms, constructing a loss function by cosine similarity to make the predicted identity probabilityUniformly distributed;
gaze direction loss using L1 lossThe definition is:
;
Wherein,Predictive model for representing human face and line of sightPredicted vectors.
Using multiple objective loss functionsTraining is as follows:
wherein, according to experience, set up。
According to the present invention, preferably, one subject in the training set is randomly selected in each training step, and a meta learning method, i.e., an algorithm based on a first order gradient is applied to update the model, comprising:
first, randomly extracting k subjects from the training data set, and constructing a new meta-training set by using face images of the k subjectsTraining the meta-training setThe sequence is input into FGEN model, and gradient vectors are calculated in the optimization process as follows:
;
Wherein,The initial weight of the vision predictive model is, x is the meta-training iteration number,Representing the gradient vector calculated during optimization of the gaze prediction model,Is a calculation function of the gradient vector;
The initial weights are then updated using the following equation:
Wherein,Is the step size used for the random gradient descent operation,Is the weight of the vision predictive model optimized by the meta-training process,Refers to weights updated by gradient descent;
next, in the meta-training setIs to construct a meta-adaptation set with p non-overlapping topicsAnd adapt the meta-adaptation to the iterative weightsThe updating is as follows:
;
after the meta-adaptation iteration, a new meta-adaptation set is created by selecting a new topicThe meta-adaptation process is repeated delta times and updated weights are calculated;
Finally, a face sight prediction model is carried out according to the following formulaUpdating:
;
;
Wherein y is the meta-adaptation iteration number,For use ofThe weight of the face sight prediction model optimized in the meta-adaptation process;
line-of-sight feature extraction by inputting FGEN the processed image and trained parameters into a model to obtain line-of-sight features=FGEN(,,ML)。
According to the invention, preferably, in the text featureSpeech featuresAnd line of sight featurePreprocessing with a layer of multi-headed convectors before multi-modal representation,And。
According to a preferred embodiment of the present invention, the modal sharing representation comprises:
Sharing an encoder by constructing a modalityTo learn the sharing characteristics of the cross modes, the modal sharing encoder will,AndConversion to unified feature space, obtaining shared features of text, speech and line of sight, respectively,And:
;
;
;
Using the center moment difference CMD to minimize similarity loss, let X and Y be bounded random samples, in intervalProbability distributions on the top are p and q, respectively, and CMD regularization matrixIs defined as:
;
wherein X and Y are respectively crossed and brought into three different modes, A and B represent the value ranges of X and Y,Is the empirical expected vector of random samples X,Is the vector of the center moment of all k-order samples in X, calculates the sharing characteristic of each pair of modes:
;
Wherein, t is the number of the three-dimensional space,And v are text, speech and line of sight identifiers respectively,AndCommon features of mode m1 and mode m2, respectively, mode m1 and mode m2 are any two of the three modes, respectively, minimizing lossesThe shared feature representation distribution of each pair of modalities will be forced to be similar.
According to a preferred embodiment of the invention, the modality specific representation comprises constructing a modality specific encoder comprising:, A kind of electronic deviceCorresponding to text, voice and sight line respectively, and mode specific encoder、AndConversion to a unique feature space to obtain a specific feature,And:
;
;
;
The following differential loss is formedThe calculation formula is as follows:
;
Wherein,Is the square of the L2-norm ifAnd;AndIs orthogonal, differential lossAt least one of the minimum, t,And v are text, speech and line of sight identifiers respectively,AndThe mode m1 and the mode m2 are respectively the common characteristics of the mode m1 and the mode m2, and the mode m1 and the mode m2 are respectively any two of the three modes.
According to the preferred embodiment of the present invention, a decoder is constructed prior to the multi-mode fusion operationTo input the shared feature and the specific feature, wherein,Representing the characteristics of the sharing of the features,Which are indicative of the various features of the various ways in which the invention may be practiced,Representing decoder parameters, reconstructing the original feature space:
;
;
;
Wherein,、、Respectively representing the intrinsic information of three modal characteristics;
using the mean square error to estimate the reconstruction error:
;
Wherein,Is the square of the L2-norm,To prevent over-fitting regularization term, W is the decoder parameter; Representing reconstruction errors refers to errors or losses that the model produces when encoding the input data into a potential representation and reconstructing it by the decoder.
According to the invention, the intra-modal fusion comprises:
The calculation formula of the self-attention mechanism is as follows:
Wherein,、AndRespectively representing a query, a key and a value matrix;、 AndIs a parameter that needs to be learned and,Is the dimension of K, for self-attention mechanism, three matrices Q, K and V come from the same input, the shared features and specific features of text, speech and line of sight are respectively connected and input into self-attention mechanism to obtain single-mode fusion feature,And。
According to the invention, the preferred cross-modal fusion comprises:
after obtaining the unimodal fusion feature, text-to-line-of-sight is learned using a cross-attention mechanismLine of sight to textText-to-speechSpeech to textSpeech to sight lineLine of sight to speechThe cross-attention mechanism asymmetrically combines two sequences, one as input to Q and the other as input to K and V, the calculation formula for the cross-attention mechanism is as follows:
;
Will beAndThe combination is input to a sight gate to obtain sight feature fusion weightWill (i) beAndIs input to the auditory gate to obtain the speech feature fusion weightWill (i) beAndCombining and inputting to a text gate to obtain text feature fusion weightsThe calculation formula is as follows:
;
;
;
Wherein,、AndAs a function of the linear layer parameters,、AndIs a bias term;
according to the fusion weight, the sight line characteristicsAnd speech featuresWith text featuresFusing to obtain a final fusion characteristic h:
。
according to the invention, the method comprises the following steps:
after the final fusion characteristic h is obtained, h is input into a multi-layer perceptron, and a softmax layer is connected to obtain a classification resultThe following is shown:
;
wherein W and b represent linear layer parameters and bias terms.
According to the invention, the losses are preferably co-optimized、,And cross entropy lossThe final optimization objective L is as follows:
;
;
Wherein,,AndIs the weight that determines the contribution to the overall loss L, N is the number of training samples; AndRepresenting the actual tag distribution and the predicted tag distribution of sample i, respectively; Is L2 regularization, W is a model parameter.
A computer device comprising a memory storing a computer program and a processor implementing steps of a speech and gaze multimodal fusion based intent recognition method when executing the computer program.
A computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of a speech and gaze multimodal fusion based intent recognition method.
An intent recognition system based on a multi-modal fusion of speech and gaze comprising:
A feature extraction module configured to extract text, speech and line-of-sight features from speech and faces based on the pre-trained BERT model, the Wav2vec 2.0 model, and the self-training FGEN model;
The multi-mode representation module is configured to comprise a mode sharing representation, a mode specific encoder, a differential loss and a mode analysis module, wherein the mode sharing representation comprises the steps of constructing a mode sharing encoder to learn cross-mode sharing characteristics, converting text, voice and sight characteristics into a unified characteristic space by the mode sharing encoder to obtain sharing characteristics, minimizing similarity loss among sharing characteristics of different modes by utilizing a central moment difference, constructing a mode specific encoder to learn specific characteristics of each mode, converting the text, voice and sight characteristics into a specific characteristic space by the mode specific encoder to obtain the specific characteristics, ensuring that the sharing characteristics and the specific characteristics of the same mode are distributed differently by the differential loss, and meanwhile, ensuring that the specific characteristics of different modes are distributed differently;
The multi-mode fusion module is configured to comprise intra-mode fusion, cross-mode fusion, wherein the intra-mode fusion comprises the step of fusing shared features and specific features of each mode through a self-attention mechanism to obtain single-mode fusion features, the step of cross-mode fusion comprises the steps of learning cross-mode related features through a cross-attention mechanism and fusing features of different modes through a gating mechanism to obtain final fusion features;
And the intention recognition module is configured to input the final fusion characteristics into the multi-layer perceptron, and connect the softmax layer to output a classification result to perform intention recognition.
The beneficial effects of the invention are as follows:
1. The method solves the problem of overfitting of the sight estimation model, namely randomly selecting one subject in a training set in each training step, and applying a meta-learning method, namely updating the model based on a first-order gradient algorithm, so that the overfitting problem is relieved, and the model can optimize new parameters.
2. And the efficient feature extraction is realized by applying a line-of-sight feature extraction method based on the whole face appearance and the random identity countermeasure network, so that the line-of-sight estimation irrelevant to the appearance and the extraction of the facial key features are realized.
3. Optimizing multi-mode information fusion, namely providing a fusion method of a gating neural network based on an attention mechanism, improving the status of voice and sight features and helping to realize more comprehensive understanding of customer intention.
4. The accuracy and the robustness of the intention recognition are improved by a line-of-sight feature extraction method based on the full face appearance and the random identity countermeasure network and a fusion method of the gating neural network based on the attention mechanism.
Drawings
FIG. 1 is a flow chart of an intent recognition method based on multi-modal fusion of speech and gaze;
FIG. 2 is a block flow diagram of training FGEN a model using a meta-learning strategy;
FIG. 3 is an overall block diagram of a line-of-sight feature extraction method based on full-face appearance and random identity challenge network;
FIG. 4 is a schematic diagram of a self-care computing process;
Fig. 5 is a schematic diagram of a cross-attention computing process.
Detailed Description
The invention is further defined by, but is not limited to, the following drawings and examples in conjunction with the specification.
Example 1
An intention recognition method based on multi-modal fusion of voice and sight, as shown in fig. 1, comprises the following steps:
Feature extraction, extracting text, speech and line of sight features from speech and faces based on a pre-trained BERT model, a Wav2vec 2.0 model, and a self-training FGEN (Face to gaze encoder) model;
The multi-modal representation comprises 1) a modal sharing representation, wherein the modal sharing representation comprises the steps of constructing a modal sharing encoder to learn cross-modal sharing characteristics, converting text, voice and sight characteristics into uniform characteristic space by the modal sharing encoder to obtain sharing characteristics, minimizing similarity loss among different modal sharing characteristics by using a Central Moment Difference (CMD), 2) the modal specific representation comprises the steps of constructing a modal specific encoder to learn specific characteristics of each modality, converting the text, voice and sight characteristics into specific characteristic space by the modal specific encoder to obtain specific characteristics, ensuring different distributions of the sharing characteristics and the specific characteristics of the same modality by the differential loss, and simultaneously ensuring different specific characteristic distributions of different modalities;
The multi-modal fusion comprises 3) intra-modal fusion, namely, fusing the sharing characteristic and the specific characteristic of each mode through a self-attention mechanism (shown in figure 4) to obtain a single-modal fusion characteristic, and 4) cross-modal fusion, namely, learning the cross-modal related characteristic through a cross-attention mechanism (shown in figure 5) and fusing the different-modal characteristics through the gating mechanism to obtain a final fusion characteristic;
and (3) intention recognition, namely inputting the final fusion characteristics into a multi-layer perceptron (MLP), connecting a softmax layer to output a classification result, and carrying out intention recognition.
Example 2
The intention recognition method based on the multi-modal fusion of voice and sight is characterized in that:
in the present embodiment of the present invention,
The modal sharing encoder is a structure for multi-modal learning, and has the main function of extracting sharing characteristics across different modalities.
The mode-specific encoder is an encoder structure specifically designed for each mode independently, unlike a mode-shared encoder.
The decoder is a module for mapping the features extracted by the encoder back to the required output space. The main task of the decoder is to restore the feature vectors or feature maps extracted by the encoder to a specific output format, e.g. to generate a density map, to partition a map, to reconstruct an image, or to perform a classification task.
The self-attention mechanism is a deep learning mechanism, is widely applied to the fields of natural language processing, computer vision and the like, and is used for capturing the association relationship inside the features. It learns global dependencies within the data by letting each feature location focus on other locations in the input.
The cross-attention mechanism is a deep learning mechanism and is mainly used for processing feature interaction and fusion under multi-mode or multi-input situations. Unlike the self-attention mechanism, the cross-attention mechanism calculates attention weights between different modalities or different inputs in order to capture associations between different data sources to achieve feature complementation.
Face identity classifier (face recognition model) receives as input gaze representation vectors and then predicts face identity probability. The proposed architecture FGEN aims to extract gaze representations generalized to appearance by predicting uniform probability values for all faces.
Gaze prediction model-converting a face image into gaze direction vectors (yaw and pitch),Where D is the normalized face image set,Is the vector of the gaze,Is the facial line of sight feature vector extracted from the intermediate.
Extracting text features based on the pre-trained BERT model, comprising:
the pre-trained BERT model can extract text semantic features, which become an essential module for natural language understanding tasks. Thus, a pre-trained language model BERT is used to extract text features. In contrast to Word2vec, the BERT model solves the ambiguity problem by considering the context. Meanwhile, different semantic features obtained through hierarchical learning provide rich feature choices for downstream tasks. Based on the pre-training model BERT, the downstream tasks are fine-tuned, and good performance can be achieved by only a small number of training samples. Capturing voice of the interaction environment by adopting an active acquisition mode, and acquiring the acquired textIs input into the BERT model and,For word vectors in the text, m is the total number of word vectors, and the output of the last hidden layer of the BERT model represents text characteristics,Is the length of the sequence of text utterances,Is the characteristic dimension of each tag. The feature extraction process is as follows:
。
extracting voice features based on a pre-trained Wav2vec 2.0 model, comprising:
The Wav2vec 2.0 model is an unsupervised speech pre-training model published by Meta in 2020. The core idea is to construct a self-supervising training target by Vector Quantization (VQ). The model performs better than standard acoustic features. Thus, the pre-trained Wav2vec 2.0 model is used to extract speech features. Similarly, capturing speech of an interactive environment using active acquisitionEach speech segmentAre all input into the Wav2vec 2.0 model, the output of the last hidden layer of the Wav2vec 2.0 model representing the speech features,Representing the length of the sequence of speech segments,Representing the feature dimension of each tag. The feature extraction process is as follows:
;
Where P denotes a preprocessing operation, F denotes a feature extractor, and Q denotes a quantizer.
Extracting sight line features based on a pre-trained FGEN model, including:
People observe something of interest around themselves through the eyes and receive important information. Thus, the line of sight of an eye looking at a thing object is a non-verbal cue that is important to human intent. Recent studies have shown that the information features of the full-face region more accurately predict line-of-sight positions. Therefore, the invention provides a full-face appearance and random identity countermeasure network or FGEN model based line-of-sight feature extraction (FGEN-Face to gaze encoder, FGEN), and the method does not need special infrared or depth equipment and only uses a common camera. The invention uses SWCNN (Switchable Convolutional Neural Network, a convolution neural network model applied in a multi-mode or multi-task scene) as a training baseline model, expands a network architecture by connecting a face identity classifier (face recognition model) to realize appearance-independent sight estimation and extraction of facial key features, adopts a meta-learning strategy to train FGEN models in order to solve the problem of model fitting caused by insufficient training sets, namely randomly selects a meta-training dataset and a meta-testing dataset from a face sight dataset for each iterative training to obtain the sight feature embeddingIs the sequence length of the key-frames,Is the feature dimension of each frame; Representing oneThe real number matrix of (a) comprises feature embedding of all key frames, each row corresponds to the feature of one key frame, each column corresponds to the value of a specific feature dimension, and the real number matrix is as follows:
;
Wherein,A face identity classifier is represented and is used to identify the face,Representing a gaze prediction model, ML represents a meta-learning strategy. Line of sight feature embeddingBy embedding gaze features in feature space, unique information of eye gaze direction or gaze point is captured, thereby providing a richer context and semantic understanding for visual tasks. FGEN () represents a function with respect to FGEN for extracting line-of-sight features.
The specific architecture of FGEN model is shown in figure 2.
Training FGEN a model by using a meta-learning strategy, training FGEN a model by using an countermeasure strategy to realize independence of sight features and appearance subjects, wherein the training method comprises the following steps as shown in fig. 3:
updating face recognition parameters;
Parameters of the FGEN model are updated with line of sight loss and contrast loss.
The line-of-sight prediction model converts the face image into a gaze direction vector, as follows:
Wherein,Representing line-of-sight yaw and pitch vector labels,Represents the facial line-of-sight feature vector output by FGEN model,Representing a predictive model of the line of sight of a human face,Representing a face dataset;
a face identity classifier is represented and is used to identify the face,A recognition probability vector representing each type of face in the training subject;
face gaze invariant feature learning:
face identity classifierIs updated according to the cross entropy loss function of the identification probability and the labelThe following is shown:
Wherein,For each identification tag of the face of the subject,The predictive recognition probability of the face of the nth subject is given, and N is the total number of training data; representing a logarithmic function;
identity probability distribution equality feature learning:
Next, FGEN model continues the contrast training, face sight predictive modelOptimizing the target opposite to the face identity classifier (i.e. optimizing the parameters of the video prediction model to make the identity probability of all samplesEqual distribution of (c), whereby the resistance loss of appearance generalization is defined as follows:
;
;
Wherein,Representing the accuracy of the predictive probability of the face of the subject,Resistance loss indicating appearance generalization; is a uniform distribution, k is the number of subjects in the training set; representing norms, constructing a loss function by cosine similarity to make the predicted identity probabilityUniformly distributed;
Therefore, based on the appearance related information of the face image, the face sight prediction model is promoted.
Gaze direction loss using L1 lossThe definition is:
;
Wherein,Predictive model for representing human face and line of sightPredicted vectors.
Using multiple objective loss functionsTraining is as follows:
;
wherein, according to experience, set up. After the parameters of the face sight prediction model are updated, the face recognition model is frozen. By the above-described resistance learning, the limitation of the appearance-based gaze estimation method is alleviated by extracting the key information of the appearance without additional non-key labels.
In order to improve generalization performance, a new learning strategy is introduced. This strategy was designed to compensate for the overfitting problem caused by individual-specific appearance factors. In general, gaze estimation performs a leave-on cross-validation that leaves only one subject in the test dataset and uses the remaining subjects for training. In such training settings, the model is subject to deviations in the appearance factors of the limited subjects used in the training dataset and overfitting can easily occur. To avoid these problems, a new training strategy is constructed, in which one subject in the training set is randomly selected in each training step, and a meta-learning method, i.e., an algorithm based on a first order gradient, is applied to update the model, comprising:
first, randomly extracting k subjects from the training data set, and constructing a new meta-training set by using face images of the k subjectsTraining the meta-training setThe sequence is input into FGEN model, and gradient vectors are calculated in the optimization process as follows:
Wherein,The initial weight of the vision predictive model is, x is the meta-training iteration number,Representing the gradient vector calculated during optimization of the gaze prediction model,Is a calculation function of the gradient vector;
The initial weights are then updated using the following equation:
;
;
Wherein,Is the step size used for the random gradient descent (SGD) operation,Is the weight of the vision predictive model optimized by the meta-training process,Refers to weights updated by gradient descent;
next, in the meta-training setIs to construct a meta-adaptation set with p non-overlapping topicsAnd adapt the meta-adaptation to the iterative weightsThe updating is as follows:
;
after the meta-adaptation iteration, a new meta-adaptation set is created by selecting a new topicThe meta-adaptation process is repeated delta times and updated weights are calculated;
Finally, a face sight prediction model is carried out according to the following formulaUpdating:
;
;
Wherein y is the meta-adaptation iteration number,For use ofThe invention applies the process to relieve the problem of overfitting, so that the model can optimize new parameters and prevent the model from updating towards the leading direction.
Line-of-sight feature extraction by inputting FGEN the processed image and trained parameters into a model to obtain line-of-sight features。
There is complementarity and consistency between the different modes. These factors contribute to a common motivation and goal when people express their language, sound and purpose, showing consistency between patterns. At the same time, language, sound and purpose have unique semantics, intonation and expression, respectively, and exhibit complementary characteristics. Thus, modality sharing and modality specific encoders are designed for learning the sharing and specific characteristics of text, speech and line of sight. The representation provides an overall view of the modality, which lays a foundation for subsequent multi-modality fusion. In text featuresSpeech featuresAnd line of sight featurePreprocessing with a layer of multi-headed convectors before multi-modal representation,And。
The modality sharing representation includes:
Sharing an encoder by constructing a modalityTo learn the sharing characteristics of the cross modes, the modal sharing encoder will,AndConversion to unified feature space, obtaining shared features of text, speech and line of sight, respectively,And:
;
;
;
In order to ensure that the sharing characteristics of different modes of the same sample are similar, a central moment difference CMD is utilized to minimize the similarity loss, wherein the CMD contains moment information higher than KL divergence. Compared to the maximum average difference (MMD), CMD reduces the amount of computation because the core matrix does not need to be computed. CMD estimates the difference between the two distributions by matching their order moment differences. Let X and Y be bounded random samples, in intervalProbability distributions on the top are p and q, respectively, and CMD regularization matrixIs defined as:
;
wherein X and Y are respectively crossed and brought into three different modes, A and B represent the value ranges of X and Y,Is the empirical expected vector of random samples X,Is the vector of the center moment of all k-order samples in X, and calculates the sharing characteristic of each pair of modes:
;
Wherein, t is the number of the three-dimensional space,And v are text, speech and line of sight identifiers respectively,AndCommon features of mode m1 and mode m2, respectively, mode m1 and mode m2 are any two of three modes, respectively, intuitively, minimizing lossesThe shared feature representation distribution of each pair of modalities will be forced to be similar.
The modality specific representation comprises constructing a modality specific encoder for understanding specific features of different modes, comprising:, A kind of electronic deviceCorresponding to text, voice and sight line respectively, and mode specific encoder、AndConversion to a unique feature space to obtain a specific feature,And:
;
;
;
On the one hand, the features of one excellent specific modality must ensure that the shared features and the distribution of specific features of the same modality are different. On the other hand, it is also necessary to ensure that the distribution of specific features is different between different modes. Therefore, the following differential loss is formedThe calculation formula is as follows:
;
Wherein,Is the square of the L2-norm, intuitively speaking, ifAnd;AndIs orthogonal, differential lossAt least one of the minimum, t,And v are text, speech and line of sight identifiers respectively,AndThe mode m1 and the mode m2 are respectively the common characteristics of the mode m1 and the mode m2, and the mode m1 and the mode m2 are respectively any two of the three modes.
Before the multi-mode fusion operation, a decoder is constructed to ensure that the shared features and specific features obtained by the encoder maintain the basic properties of the original feature spaceTo input the shared feature and the specific feature, wherein,Representing the characteristics of the sharing of the features,Which are indicative of the various features of the various ways in which the invention may be practiced,Representing decoder parameters, reconstructing the original feature space:
;
;
;
Wherein,、、Respectively representing the intrinsic information of the three modal characteristics, and ensuring the characteristic dimension and distribution of the intrinsic information to be consistent with the new characteristics of the original input data.
The reconstruction error is estimated using the Mean Square Error (MSE):
;
Wherein,Is the square of the L2-norm,To prevent over-fitting regularization term, W is the decoder parameter; representing reconstruction errors refers to errors or losses that the model produces when encoding the input data into a potential representation (e.g., feature embedding) and reconstructing it by the decoder.
Intra-modality fusion, comprising:
After the shared features and specific features of each modality are obtained through multi-modality representation learning, they are connected into a self-attention mechanism model. The attention mechanism captures the correlation of the shared feature and the specific feature, resulting in a single-mode fusion feature. Self-care can well achieve parallel computing and long-range dependence. As shown in fig. 4. The calculation formula of the self-attention mechanism is as follows:
;
Wherein,、AndRespectively representing a query, a key and a value matrix;、 AndIs a parameter that needs to be learned and,Is the dimension of K, for self-attention, three matrices Q, K and V come from the same input, the shared features and specific features of text, speech and line of sight are respectively connected and input into self-attention mechanism to obtain single-mode fusion feature,And。
Cross-modal fusion, comprising:
after obtaining the unimodal fusion feature, text-to-line-of-sight is learned using a cross-attention mechanismLine of sight to textText-to-speechSpeech to textSpeech to sight lineLine of sight to speechThe main difference between the cross-attention mechanism and the self-attention mechanism is that the inputs to the cross-attention mechanism come from different sequences. The cross-attention mechanism asymmetrically combines two sequences, one as input to Q and the other as input to K and V, as shown in fig. 5, the calculation formula for the cross-attention mechanism is as follows:
;
Will beAndThe combination is input to a sight gate to obtain sight feature fusion weightWill (i) beAndIs input to the auditory gate to obtain the speech feature fusion weightWill (i) beAndCombining and inputting to a text gate to obtain text feature fusion weightsThe calculation formula is as follows:
;
;
;
Wherein,、AndAs a function of the linear layer parameters,、AndIs a bias term;
according to the fusion weight, the sight line characteristicsAnd speech featuresWith text featuresFusing to obtain a final fusion characteristic h:
+。
intent recognition, including:
after obtaining the final fusion characteristic h, inputting h into a multi-layer perceptron (MLP), and connecting softmax layers to obtain a classification resultThe following is shown:
;
wherein W and b represent linear layer parameters and bias terms.
To implement the multi-modal representation, fusion and predictive end-to-end training process, the penalty is co-optimized、,And cross entropy lossThe final optimization objective L is as follows:
;
;
Wherein,,AndIs the weight that determines the contribution to the overall loss L, N is the number of training samples; AndRepresenting the actual tag distribution and the predicted tag distribution of sample i, respectively; is L2 regularization, which reduces the degree of overfitting of the model. W is a model parameter.
Compared with other models, the model provided by the invention has relatively stable performance improvement, and in most cases, the performance is higher than that of other models, the accuracy macroscopic average score is higher, and the model has better field self-adaption and generalization capability.
Example 3
A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the steps of a speech and gaze multimodal fusion based intent recognition method as described in embodiments 1 or 2 when executing the computer program.
Example 4
A computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of a speech and line-of-sight multimodal fusion-based intent recognition method as described in embodiments 1 or 2.
Example 5
An intent recognition system based on a multi-modal fusion of speech and gaze comprising:
A feature extraction module configured to extract text, speech and line-of-sight features from speech and faces based on the pre-trained BERT model, the Wav2vec 2.0 model, and the self-training FGEN model;
The multi-mode representation module is configured to comprise a mode sharing representation, a mode specific encoder, a differential loss and a mode analysis module, wherein the mode sharing representation comprises the steps of constructing a mode sharing encoder to learn cross-mode sharing characteristics, converting text, voice and sight characteristics into a unified characteristic space by the mode sharing encoder to obtain sharing characteristics, minimizing similarity loss among sharing characteristics of different modes by utilizing a central moment difference, constructing a mode specific encoder to learn specific characteristics of each mode, converting the text, voice and sight characteristics into a specific characteristic space by the mode specific encoder to obtain the specific characteristics, ensuring that the sharing characteristics and the specific characteristics of the same mode are distributed differently by the differential loss, and meanwhile, ensuring that the specific characteristics of different modes are distributed differently;
The multi-mode fusion module is configured to comprise intra-mode fusion, cross-mode fusion, wherein the intra-mode fusion comprises the step of fusing shared features and specific features of each mode through a self-attention mechanism to obtain single-mode fusion features, the step of cross-mode fusion comprises the steps of learning cross-mode related features through a cross-attention mechanism and fusing features of different modes through a gating mechanism to obtain final fusion features;
And the intention recognition module is configured to input the final fusion characteristics into the multi-layer perceptron, and connect the softmax layer to output a classification result to perform intention recognition.