Movatterモバイル変換


[0]ホーム

URL:


CN117370534A - Virtual reality-oriented multisource fusion emotion support dialogue method - Google Patents

Virtual reality-oriented multisource fusion emotion support dialogue method
Download PDF

Info

Publication number
CN117370534A
CN117370534ACN202311561956.4ACN202311561956ACN117370534ACN 117370534 ACN117370534 ACN 117370534ACN 202311561956 ACN202311561956 ACN 202311561956ACN 117370534 ACN117370534 ACN 117370534A
Authority
CN
China
Prior art keywords
text
representation
feature vector
eye movement
modality
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311561956.4A
Other languages
Chinese (zh)
Inventor
赖雨珊
余建兴
印鉴
陈自豪
田海川
蔡泽彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen UniversityfiledCriticalSun Yat Sen University
Priority to CN202311561956.4ApriorityCriticalpatent/CN117370534A/en
Publication of CN117370534ApublicationCriticalpatent/CN117370534A/en
Pendinglegal-statusCriticalCurrent

Links

Classifications

Landscapes

Abstract

The invention discloses a multisource fusion emotion support dialogue method oriented to virtual reality, which relates to the technical field of artificial intelligence and comprises the steps of acquiring text information, audio information and eye movement information interacted by a user in real time, and respectively performing feature extraction operation to acquire a text feature vector, a voice feature vector and an eye movement feature vector; then, feature fusion is carried out on the feature vectors of the three modes, and emotion characterization vectors are obtained; and finally, inputting the text feature vector and the emotion feature vector into a preset decoder to generate a dialogue reply text.

Description

Translated fromChinese
一种面向虚拟现实的多源融合情感支持对话方法A multi-source fusion emotional support dialogue method for virtual reality

技术领域Technical field

本发明涉及人工智能的技术领域,更具体地,涉及一种面向虚拟现实的多源融合情感支持对话方法。The present invention relates to the technical field of artificial intelligence, and more specifically, to a multi-source fusion emotion support dialogue method for virtual reality.

背景技术Background technique

在传统情感对话系统中,往往采用文字对话的形式,缺少丰富的信息交互,如肢体语言、语音语调交互,导致情感对话的质量得不到保证,情感对话系统的输出内容存在同质化、术语化的倾向,不能很好理解上下文信息和对话者的情感状态。In traditional emotional dialogue systems, text dialogues are often used and there is a lack of rich information interaction, such as body language and voice intonation interaction. As a result, the quality of emotional dialogue cannot be guaranteed. The output content of the emotional dialogue system has homogeneity and terminology. tendency to misunderstand contextual information and the emotional state of the interlocutor.

近年来,随着人机交互在人工智能领域的迅猛发展,各种基于不同模态的情感识别技术也在不断迭代进步。情感识别技术在医疗、教育等领域的人机交互中扮演着重要的角色,通过提取用户的生理信号(如脑电、心率、呼吸、体温等)以及外部表现(如语音、文字、面部表情、眼动、姿态等)两方面的信息,可以训练机器学会识别用户情感,根据指定情绪生成富有情感的对话,并以文本的形式实现人机交互。随着VR技术的发展,可以通过VR设备采集到用户的情感特征,帮助分析识别用户情感,以便于在交互中让用户获得良好的体验。In recent years, with the rapid development of human-computer interaction in the field of artificial intelligence, various emotion recognition technologies based on different modalities are also constantly iteratively improving. Emotion recognition technology plays an important role in human-computer interaction in medical, education and other fields. It extracts users' physiological signals (such as EEG, heart rate, respiration, body temperature, etc.) and external expressions (such as voice, text, facial expressions, Eye movements, gestures, etc.) can train machines to learn to recognize user emotions, generate emotional dialogues based on specified emotions, and realize human-computer interaction in the form of text. With the development of VR technology, users' emotional characteristics can be collected through VR equipment to help analyze and identify user emotions, so that users can have a good experience during interactions.

目前的文本情感分析技术主要包括基于情感词典的方法、基于机器学习的方法和基于深度学习的方法。其中,情感词典的构建分为人工构建情感词典和自动构建情感词典,这种方法存在着人工开销大、不适合跨领域研究、构建成本高、需要对语料进行预处理、构建词典准确率低等缺点。基于机器学习的方法包括朴素贝叶斯、最大熵和支持向量机的方法。其中,朴素贝叶斯方法存在对输入数据表达形式敏感,需要计算先验概率、在分类决策方面存在错误率的缺陷;最大熵方法往往只能得到局部最优解而并非全局最优解,同时样本数目和约束函数数目过多会导致迭代过程计算量大,实际应用困难;支持向量机方法虽然泛化错误率低,计算开销小,但该方法对参数调节和核函数的选择敏感。在基于深度学习的方法中,受到大多数人青睐的方法是基于BERT模型的情感分析方法。BERT模型是基于Transformer编码器架构的预训练语言模型,这种架构强化了模型的并行运算能力,同时注意力机制能够帮助当前的词获取较好的上下文信息。然而其掩码策略仅替换部分字符,而不是整个词,导致上下文情感的理解能力受限。另外,上述方法通常是针对单个数据源数据,容易误判用户情绪状态。Current text sentiment analysis technologies mainly include methods based on sentiment dictionaries, methods based on machine learning, and methods based on deep learning. Among them, the construction of emotional dictionaries is divided into manual construction of emotional dictionaries and automatic construction of emotional dictionaries. This method has high manual overhead, is not suitable for cross-domain research, has high construction costs, requires preprocessing of corpus, and has low dictionary construction accuracy. shortcoming. Machine learning-based methods include naive Bayes, maximum entropy, and support vector machine methods. Among them, the naive Bayes method has the disadvantages of being sensitive to the expression form of input data, needing to calculate prior probabilities, and having error rates in classification decisions; the maximum entropy method can often only obtain local optimal solutions rather than global optimal solutions. At the same time, Excessive number of samples and constraint functions will lead to a large amount of calculation in the iterative process and difficulty in practical application. Although the support vector machine method has a low generalization error rate and low computational overhead, it is sensitive to parameter adjustment and the selection of kernel functions. Among methods based on deep learning, the method favored by most people is the sentiment analysis method based on the BERT model. The BERT model is a pre-trained language model based on the Transformer encoder architecture. This architecture strengthens the parallel computing capabilities of the model, and the attention mechanism can help the current word obtain better contextual information. However, its masking strategy only replaces part of the characters instead of the entire word, resulting in limited understanding of contextual emotions. In addition, the above methods are usually based on a single data source data, and it is easy to misjudge the user's emotional state.

目前的情感对话生成技术大都倾向于指定生成回复情感的对话,例如,将用户情绪标注为悲伤后,传统情感对话系统会机械性的输出适用于所有悲伤情绪的回复,如:“请不要悲伤啦,一切都会好起来的!”;目前的情感对话生成技术不能很好地根据用户当前情绪生成贴合实际情绪的情感对话,与真实对话场景具有很大差距。Most of the current emotional dialogue generation technologies tend to specify the dialogue to generate a reply emotion. For example, after marking the user's emotion as sad, the traditional emotional dialogue system will mechanically output a reply that is suitable for all sad emotions, such as: "Please don't be sad." , everything will be fine!"; Current emotional dialogue generation technology cannot well generate emotional dialogue that fits the actual emotions based on the user's current emotions, and there is a big gap with real dialogue scenarios.

现有技术提供了一种面向不确定模态缺失的多模态情感分析方法,包括:获取带有不确定缺失的多模态数据,包括三种模态:文本、视觉和音频;通过训练后的多模态情感分析网络处理所述三种模态数据,以生成并输出最终的情感分类;所述多模态情感分析网络包括模态翻译模块,所述模态翻译模块是基于多头自注意力机制提取三种模态数据的单模态特征,并使用Transformer解码器对Transformer编码器进行监督,使视觉特征和音频特征逼近文本特征,从而将视觉特征和音频特征翻译成文本特征。该现有技术关注三种模态数据的单模态特征,没有考虑不同模态间的差异性和一致性,生成的文本质量低,不贴合真实对话场景。The existing technology provides a multi-modal sentiment analysis method for uncertain modal missing, including: obtaining multi-modal data with uncertain missing, including three modalities: text, visual and audio; after training The multi-modal emotion analysis network processes the three modal data to generate and output the final emotion classification; the multi-modal emotion analysis network includes a modal translation module, which is based on multi-head self-attention The force mechanism extracts the single-modal features of the three modal data, and uses the Transformer decoder to supervise the Transformer encoder to make the visual features and audio features approximate the text features, thereby translating the visual features and audio features into text features. This existing technology focuses on the single-modal features of three modal data and does not consider the differences and consistency between different modalities. The generated text is of low quality and does not fit the real conversation scene.

发明内容Contents of the invention

本发明为克服上述现有情感对话生成技术生成情感对话的质量低、不贴合真实对话场景的缺陷,提供一种面向虚拟现实的多源融合情感支持对话方法,从多源信息中提取情感特征,充分分析用户情绪,生成高质量的、贴合真实对话场景的情感对话,提高用户的交互体验。In order to overcome the above-mentioned defects of low quality of emotional dialogue generated by existing emotional dialogue generation technology and incompatibility with real dialogue scenes, the present invention provides a multi-source fusion emotion support dialogue method for virtual reality, and extracts emotional features from multi-source information. , fully analyze user emotions, generate high-quality emotional dialogue that fits real dialogue scenarios, and improve the user's interactive experience.

为解决上述技术问题,本发明的技术方案如下:In order to solve the above technical problems, the technical solutions of the present invention are as follows:

本发明提供了一种面向虚拟现实的多源融合情感支持对话方法,包括:The present invention provides a multi-source fusion emotional support dialogue method for virtual reality, including:

S1:采集用户的语音信息和眼动信息,并从所述语音信息中提取文本信息和音频信息;S1: Collect the user's voice information and eye movement information, and extract text information and audio information from the voice information;

S2:分别对所述文本信息、音频信息和眼动信息进行特征提取操作,对应获得文本特征向量、语音特征向量和眼动特征向量;S2: Perform feature extraction operations on the text information, audio information and eye movement information respectively, and obtain text feature vectors, speech feature vectors and eye movement feature vectors accordingly;

S3:对所述文本特征向量、语音特征向量和眼动特征向量进行特征融合操作,获得情绪表征向量;S3: Perform a feature fusion operation on the text feature vector, speech feature vector and eye movement feature vector to obtain an emotion representation vector;

S4:将所述文本特征向量和情绪特征向量输入预设的解码器中,生成对话回复文本。S4: Input the text feature vector and emotion feature vector into a preset decoder to generate dialogue reply text.

优选地,对所述文本信息进行特征提取操作前,还需进行预处理操作,具体方法为:Preferably, before performing the feature extraction operation on the text information, a preprocessing operation needs to be performed. The specific method is:

对于每段所述文本信息,在开头添加CLS标签,在结尾添加SEP标签,获得对应的文本序列T={tcls,t2,…,tSEP,…,tN};其中,tcls表示CLS标签序列化后的单词,tSEP表示SEP标签序列化后的单词,tN表示文本序列中第N个单词,N表示所有文本信息中的最大长度;当文本信息的长度小于最大长度时,tSEP后的单词用零进行填充。For each paragraph of text information, add a CLS tag at the beginning and a SEP tag at the end to obtain the corresponding text sequence T={tcls ,t2 ,...,tSEP ,...,tN }; where tcls represents CLS tag serialized words, tSEP represents the SEP tag serialized word, tN represents the Nth word in the text sequence, N represents the maximum length of all text information; when the length of the text information is less than the maximum length, Words after tSEP are padded with zeros.

优选地,对所述文本信息进行特征提取操作,获得文本特征向量的具体方法为:Preferably, a feature extraction operation is performed on the text information, and the specific method for obtaining the text feature vector is:

将所述文本信息对应的文本序列T输入现有的BERT-wwm模型进行特征提取操作,获得文本特征向量:Input the text sequence T corresponding to the text information into the existing BERT-wwm model to perform feature extraction operations to obtain the text feature vector:

Xt=BERTwwm(T)={xcls,x2,…,xSEP,…,xN}Xt = BERTwwm (T) = {xcls ,x2 ,…,xSEP ,…,xN }

式中,Xt表示文本特征向量,BERTwwm表示BERT-wwm模型,xcls表示文本特征向量中的CLS标签,xSEP表示文本特征向量中的SEP标签的,xN表示文本特征向量中第N个的元素。Intheformula,_ elements.

优选地,对所述音频信息进行特征提取操作,获得语音特征向量的具体方法为:Preferably, a feature extraction operation is performed on the audio information, and the specific method for obtaining the speech feature vector is:

将所述音频信息输入现有的VGGish模型中,获得梅尔频率倒谱系数特征;Input the audio information into the existing VGGish model to obtain Mel frequency cepstrum coefficient features;

将所述梅尔频率倒谱系数输入现有的双向长短期记忆网络中,获得语音特征向量XaThe Mel frequency cepstral coefficient is input into the existing two-way long short-term memory network to obtain the speech feature vector Xa .

优选地,对所述眼动信息进行特征提取操作,获得眼动特征向量的具体方法为:Preferably, a feature extraction operation is performed on the eye movement information, and the specific method for obtaining the eye movement feature vector is:

所述眼动信息包括瞳孔直径、注视时间、扫视持续时间和扫视振幅;The eye movement information includes pupil diameter, fixation time, saccade duration and saccade amplitude;

对所述瞳孔直径提取参数特征,包括瞳孔直径均值、瞳孔直径标准差和瞳孔直径微分熵;Extract parameter features for the pupil diameter, including pupil diameter mean, pupil diameter standard deviation and pupil diameter differential entropy;

对所述注视时间提取参数特征,包括注视时间均值和注视时间标准差;Extract parameter features from the gaze time, including gaze time mean and gaze time standard deviation;

对所述扫视持续时间提取参数特征,包括扫视持续时间均值和扫视持续时间标准差;Extract parameter features for the saccade duration, including saccade duration mean and saccade duration standard deviation;

对所述扫视振幅提取参数特征,包括扫视振幅均值和扫视振幅标准差;Extract parameter features from the saccade amplitude, including saccade amplitude mean and saccade amplitude standard deviation;

利用所述瞳孔直径均值、瞳孔直径标准差、瞳孔直径微分熵、注视时间均值、注视时间标准差、扫视持续时间均值、扫视持续时间标准差、扫视振幅均值和扫视振幅标准差组成眼动特征矩阵;The eye movement feature matrix is composed of the mean pupil diameter, standard deviation of pupil diameter, differential entropy of pupil diameter, mean fixation time, standard deviation of fixation time, mean saccade duration, standard deviation of saccade duration, mean saccade amplitude and standard deviation of saccade amplitude. ;

将所述眼动特征矩阵输入现有的双向长短期记忆网络中,获得眼动特征向量XeThe eye movement feature matrix is input into the existing two-way long short-term memory network to obtain the eye movement feature vector Xe .

优选地,构建多子空间共享私有表示模型对所述文本特征向量、语音特征向量和眼动特征向量进行特征融合操作,获得情绪表征向量;Preferably, a multi-subspace shared private representation model is constructed to perform a feature fusion operation on the text feature vector, speech feature vector and eye movement feature vector to obtain an emotion representation vector;

所述多子空间共享私有表示模型包括一维卷积层、私有空间、主共享子空间、辅助共享子空间、拼接层和自注意力机制层;The multi-subspace shared private representation model includes a one-dimensional convolution layer, a private space, a main shared subspace, an auxiliary shared subspace, a splicing layer and a self-attention mechanism layer;

所述一维卷积层的输入端作为多子空间共享私有表示模型的输入端,一维卷积层的输出端分别与私有空间、主共享子空间、辅助共享子空间的输入端连接;私有空间、主共享子空间、辅助共享子空间的输出端均与拼接层的输入端连接,拼接层的输出端与自注意力机制层的输入端连接,自注意力机制层的输出端作为多子空间共享私有表示模型的输出端。The input end of the one-dimensional convolution layer serves as the input end of the multi-subspace shared private representation model, and the output end of the one-dimensional convolution layer is connected to the input end of the private space, the main shared subspace, and the auxiliary shared subspace respectively; private The output terminals of the space, main shared subspace, and auxiliary shared subspace are all connected to the input terminal of the splicing layer. The output terminal of the splicing layer is connected to the input terminal of the self-attention mechanism layer. The output terminal of the self-attention mechanism layer serves as a multi-subspace A spatially shared private representation of the output of the model.

优选地,所述步骤S3的具体方法为:Preferably, the specific method of step S3 is:

S3.1:将所述文本特征向量、语音特征向量和眼动特征向量输入多子空间共享私有表示模型,经一维卷积层映射到相同维度,获得处理后的文本特征向量、语音特征向量和眼动特征向量;S3.1: Input the text feature vector, speech feature vector and eye movement feature vector into the multi-subspace shared private representation model, map it to the same dimension through the one-dimensional convolution layer, and obtain the processed text feature vector and speech feature vector and eye movement feature vectors;

S3.2:将所述处理后的文本特征向量、语音特征向量和眼动特征向量均输入私有空间,对应获得文本模态特定表示、语音模态特定表示和眼动模态特定表示;S3.2: Input the processed text feature vectors, speech feature vectors and eye movement feature vectors into the private space, and correspondingly obtain text modality specific representations, speech modality specific representations and eye movement modality specific representations;

S3.3:将所述处理后的文本特征向量、语音特征向量和眼动特征向量均输入主共享子空间,对应获得文本模态主共享表示、语音模态主共享表示和眼动模态主共享表示;S3.3: Input the processed text feature vector, speech feature vector and eye movement feature vector into the main shared subspace, and obtain the main shared representation of text modality, the main shared representation of speech modality and the main shared representation of eye movement modality. shared representation;

S3.4:将所述处理后的文本特征向量、语音特征向量和眼动特征向量均输入辅助共享子空间,对应获得文本模态辅助共享表示、语音模态辅助共享表示和眼动模态辅助共享表示;S3.4: Input the processed text feature vector, speech feature vector and eye movement feature vector into the auxiliary shared subspace, and obtain the text modality auxiliary shared representation, speech modality auxiliary shared representation and eye movement modality auxiliary correspondingly. shared representation;

S3.5:将所述文本模态特定表示、语音模态特定表示、眼动模态特定表示、文本模态主共享表示、语音模态主共享表示、眼动模态主共享表示、文本模态辅助共享表示、语音模态辅助共享表示和眼动模态辅助共享表示输入拼接层,获得模态混合表示矩阵;S3.5: Combine the text modality specific representation, speech modality specific representation, eye movement modality specific representation, text modality main shared representation, speech modality main shared representation, eye movement modality main shared representation, text modality The modality auxiliary shared representation, speech modality auxiliary shared representation and eye movement modality auxiliary shared representation are input into the splicing layer to obtain the modality mixed representation matrix;

S3.6:将所述模态混合表示矩阵输入自注意力机制层,获得融合自注意力机制的模态混合表示矩阵,转化为情绪表征向量输出。S3.6: Input the modal mixed representation matrix into the self-attention mechanism layer, obtain the modal mixed representation matrix fused with the self-attention mechanism, and convert it into an emotional representation vector output.

将文本特征向量、语音特征向量和眼动特征向量三个模态的特征向量映射到相同维度,再投射到主共享子空间和辅助共享子空间中,得到模态共享表示来捕捉三个模态特征向量间的一致性;而投射到私有空间中,学习三个模态特征向量的特定表示,以抽取各模态间的差异性;私有空间、主共享子空间、辅助共享子空间均是基于双向长短时记忆网络构建的;最后将各个模态的共享表示和特定表示拼接后执行自注意力,获得融合自注意力机制的模态混合表示矩阵,最终转化为情绪表征向量输出;最大限度的抽取了多模态间互补信息、减少各模态间的冗余信息,使情绪表征向量更加准确,更符合用户的当前情绪。并且,设置任务损失、相似性损失、差异性损失和重构损失组成总损失函数来训练优化多子空间共享私有表示模型,当总损失函数的损失值达到最小时,获得优化好的多子空间共享私有表示模型,提高情绪表征向量的质量。The feature vectors of the three modalities of text feature vector, speech feature vector and eye movement feature vector are mapped to the same dimension, and then projected into the main shared subspace and the auxiliary shared subspace to obtain a modal shared representation to capture the three modalities. The consistency between feature vectors; and projected into the private space, learn the specific representation of the three modal feature vectors to extract the differences between each modality; the private space, main shared subspace, and auxiliary shared subspace are all based on Constructed by a bidirectional long short-term memory network; finally, the shared representation and specific representation of each modality are spliced and self-attention is performed to obtain a modal mixed representation matrix that integrates the self-attention mechanism, and is finally converted into an emotional representation vector output; to the maximum extent It extracts complementary information between multiple modalities and reduces redundant information between modalities, making the emotional representation vector more accurate and more in line with the user's current emotion. Moreover, task loss, similarity loss, difference loss and reconstruction loss are set to form a total loss function to train and optimize the multi-subspace shared private representation model. When the loss value of the total loss function reaches the minimum, the optimized multi-subspace is obtained Sharing private representation models to improve the quality of emotion representation vectors.

优选地,所述预设的解码器为GPT-3解码器。Preferably, the preset decoder is a GPT-3 decoder.

GPT-3解码器由多个Transformer块组成,每个Transformer块包含了多头自注意力机制和全连接前馈网络;将文本特征向量和情绪特征向量拼接为拼接向量,再输入GPT-3解码器中,利用GPT-3解码器生成符合上下文语意且具有情感色彩的对话回复文本。The GPT-3 decoder is composed of multiple Transformer blocks. Each Transformer block contains a multi-head self-attention mechanism and a fully connected feed-forward network; the text feature vector and the emotion feature vector are spliced into a splicing vector, and then input into the GPT-3 decoder. In , the GPT-3 decoder is used to generate conversational reply text that conforms to the contextual semantics and has emotional color.

优选地,所述方法还包括:Preferably, the method further includes:

S5:将所述对话回复文本输入虚拟现实设备,通过虚拟形象和合成语音实现与用户的交互。S5: Input the dialogue reply text into the virtual reality device, and realize interaction with the user through the avatar and synthesized voice.

利用现有的虚拟现实技术生成虚拟对话形象,将传统的纯文本交互转变为视觉、听觉多源的对话交互方式,充分调动用户情绪,加强对话系统交互性,提升用户的对话交互体验。如利用3D建模软件设计3D虚拟人物形象用于与用户的对话交互,并将该虚拟人物形象嵌入头戴式VR设备中。在用户使用该头戴式VR设备时,利用HMD技术将虚拟形象投射到用户眼前,并且通过设备的头部追踪功能,实时更新视角,使用户可以360度环顾虚拟环境,以提供沉浸式的视觉体验。与此同时,将符合上下文语境并具有情感色彩的对话回复文本通过语音合成技术将该对话转化为合成语音,使用虚拟人物形象与用户进行交互,用户通过扬声器或耳机听到头戴式VR设备输出的合成语音。Use existing virtual reality technology to generate virtual dialogue images, transform traditional pure text interaction into a visual and auditory multi-source dialogue interaction method, fully mobilize users' emotions, strengthen the interactivity of the dialogue system, and improve the user's dialogue interaction experience. For example, 3D modeling software is used to design a 3D virtual character for dialogue and interaction with the user, and the virtual character is embedded in a head-mounted VR device. When the user uses the head-mounted VR device, HMD technology is used to project the virtual image in front of the user's eyes, and the device's head tracking function updates the perspective in real time, allowing the user to look around the virtual environment 360 degrees to provide immersive vision. experience. At the same time, the conversation reply text that is in line with the context and has an emotional color is converted into a synthesized voice through speech synthesis technology, and a virtual character is used to interact with the user. The user hears the head-mounted VR device through speakers or headphones. The output synthesized speech.

本发明还提供了一种面向虚拟现实的多源融合情感支持对话系统,用于实现上述的方法,包括:The present invention also provides a multi-source fusion emotional support dialogue system for virtual reality, used to implement the above method, including:

数据获取模块,用于采集用户的语音信息和眼动信息,并从所述语音信息中提取文本信息和音频信息;A data acquisition module, used to collect the user's voice information and eye movement information, and extract text information and audio information from the voice information;

特征提取模块,用于分别对所述文本信息、音频信息和眼动信息进行特征提取操作,对应获得文本特征向量、语音特征向量和眼动特征向量;A feature extraction module, used to perform feature extraction operations on the text information, audio information and eye movement information respectively, and correspondingly obtain text feature vectors, speech feature vectors and eye movement feature vectors;

特征融合模块,用于对所述文本特征向量、语音特征向量和眼动特征向量进行特征融合操作,获得情绪表征向量;A feature fusion module, used to perform a feature fusion operation on the text feature vector, speech feature vector and eye movement feature vector to obtain an emotion representation vector;

对话生成模块,用于将所述文本特征向量和情绪特征向量输入预设的解码器中,生成对话回复文本。A dialogue generation module, configured to input the text feature vector and the emotion feature vector into a preset decoder to generate dialogue reply text.

与现有技术相比,本发明技术方案的有益效果是:Compared with the existing technology, the beneficial effects of the technical solution of the present invention are:

本发明首先获取用户实时交互的文本信息、音频信息和眼动信息,通过对三种模态的多源信息进行特征提取,对应获得三种模态的文本特征向量、语音特征向量和眼动特征向量;之后将文本特征向量、语音特征向量和眼动特征向量进行融合,获得融合了多源信息情感的情绪表征向量,克服了从单一文本模态抽取情感特征具有不确定性的问题,增强了后续生成对话回复文本的情感性;最后,将文本特征向量和情绪特征向量输入预设的解码器中,使生成的对话回复文本既符合上下文语义,又具有贴合实际的情感色彩。本发明能够充分分析用户情绪,生成高质量的、贴合真实对话场景的情感对话,提高用户的交互体验。This invention first obtains the text information, audio information and eye movement information of the user's real-time interaction, and by performing feature extraction on the multi-source information of the three modes, correspondingly obtains the text feature vectors, speech feature vectors and eye movement features of the three modes. vector; then the text feature vector, speech feature vector and eye movement feature vector are fused to obtain an emotional representation vector that integrates emotions from multiple sources of information, overcoming the uncertainty of extracting emotional features from a single text modality, and enhancing Subsequently, the emotionality of the dialogue reply text is generated; finally, the text feature vector and emotion feature vector are input into the preset decoder, so that the generated dialogue reply text not only conforms to the contextual semantics, but also has an emotional color that is close to the actual situation. The present invention can fully analyze user emotions, generate high-quality emotional dialogue that fits real dialogue scenarios, and improve the user's interactive experience.

附图说明Description of the drawings

图1为实施例1所述的一种面向虚拟现实的多源融合情感支持对话方法的流程图。图2为实施例2所述的一种面向虚拟现实的多源融合情感支持对话方法的流程示意图。Figure 1 is a flow chart of a multi-source fusion emotional support dialogue method for virtual reality described in Embodiment 1. Figure 2 is a schematic flow chart of a multi-source fusion emotional support dialogue method for virtual reality described in Embodiment 2.

图3为实施例2所述的多子空间共享私有表示模型的结构示意图。Figure 3 is a schematic structural diagram of the multi-subspace shared private representation model described in Embodiment 2.

图4为实施例3所述的一种面向虚拟现实的多源融合情感支持对话系统的结构示意图。Figure 4 is a schematic structural diagram of a multi-source fusion emotional support dialogue system for virtual reality described in Embodiment 3.

具体实施方式Detailed ways

附图仅用于示例性说明,不能理解为对本专利的限制;The drawings are for illustrative purposes only and should not be construed as limitations of this patent;

为了更好说明本实施例,附图某些部件会有省略、放大或缩小,并不代表实际产品的尺寸;In order to better illustrate this embodiment, some components in the drawings will be omitted, enlarged or reduced, which does not represent the size of the actual product;

对于本领域技术人员来说,附图中某些公知结构及其说明可能省略是可以理解的。It is understandable to those skilled in the art that some well-known structures and their descriptions may be omitted in the drawings.

下面结合附图和实施例对本发明的技术方案做进一步的说明。The technical solution of the present invention will be further described below with reference to the accompanying drawings and examples.

实施例1Example 1

本实施例提供了一种面向虚拟现实的多源融合情感支持对话方法,如图1所示,包括:This embodiment provides a multi-source fusion emotional support dialogue method for virtual reality, as shown in Figure 1, including:

S1:采集用户的语音信息和眼动信息,并从所述语音信息中提取文本信息和音频信息;S1: Collect the user's voice information and eye movement information, and extract text information and audio information from the voice information;

S2:分别对所述文本信息、音频信息和眼动信息进行特征提取操作,对应获得文本特征向量、语音特征向量和眼动特征向量;S2: Perform feature extraction operations on the text information, audio information and eye movement information respectively, and obtain text feature vectors, speech feature vectors and eye movement feature vectors accordingly;

S3:对所述文本特征向量、语音特征向量和眼动特征向量进行特征融合操作,获得情绪表征向量;S3: Perform a feature fusion operation on the text feature vector, speech feature vector and eye movement feature vector to obtain an emotion representation vector;

S4:将所述文本特征向量和情绪特征向量输入预设的解码器中,生成对话回复文本。S4: Input the text feature vector and emotion feature vector into a preset decoder to generate dialogue reply text.

在具体实施过程中,本实施例首先获取用户实时的文本信息、音频信息和眼动信息,通过对三种模态的多源信息进行特征提取,对应获得三种模态的文本特征向量、语音特征向量和眼动特征向量;之后将文本特征向量、语音特征向量和眼动特征向量进行融合,获得融合了多源信息情感的情绪表征向量,克服了从单一文本模态抽取情感特征具有不确定性的问题,增强了后续生成对话回复文本的情感性;最后,将文本特征向量和情绪特征向量输入预设的解码器中,使生成的对话回复文本既符合上下文语义,又具有贴合实际的情感色彩。本实施例能够充分分析用户情绪,生成高质量的、贴合真实对话场景的情感对话,提高用户的交互体验。During the specific implementation process, this embodiment first obtains the user's real-time text information, audio information, and eye movement information. By performing feature extraction on the multi-source information in the three modalities, the text feature vectors, voice information, and speech feature vectors of the three modalities are correspondingly obtained. feature vectors and eye movement feature vectors; and then fuse the text feature vectors, speech feature vectors and eye movement feature vectors to obtain an emotional representation vector that integrates emotions from multiple sources of information, overcoming the uncertainty of extracting emotional features from a single text modality. sexual issues, which enhances the emotionality of the subsequently generated dialogue reply text; finally, the text feature vector and emotion feature vector are input into the preset decoder, so that the generated dialogue reply text not only conforms to the contextual semantics, but also has practical characteristics. Emotional color. This embodiment can fully analyze user emotions, generate high-quality emotional dialogue that fits real dialogue scenarios, and improve the user's interactive experience.

实施例2Example 2

本实施例提供了一种面向虚拟现实的多源融合情感支持对话方法,如图2所示,包括:This embodiment provides a multi-source fusion emotional support dialogue method for virtual reality, as shown in Figure 2, including:

S1:采集用户的语音信息和眼动信息,并从所述语音信息中提取文本信息和音频信息;S1: Collect the user's voice information and eye movement information, and extract text information and audio information from the voice information;

S2:分别对所述文本信息、音频信息和眼动信息进行特征提取操作,对应获得文本特征向量、语音特征向量和眼动特征向量;S2: Perform feature extraction operations on the text information, audio information and eye movement information respectively, and obtain text feature vectors, speech feature vectors and eye movement feature vectors accordingly;

对于所述文本信息,在所述文本信息进行特征提取操作前,还需进行预处理操作,具体放方法为:For the text information, before the feature extraction operation is performed on the text information, a preprocessing operation needs to be performed. The specific method is:

对于每段所述文本信息,在开头添加CLS标签,在结尾添加SEP标签,获得对应的文本序列T={tcls,t2,…,tSEP,…,tN};其中,tcls表示CLS标签序列化后的单词,tSEP表示SEP标签序列化后的单词,tN表示文本序列中第N个单词,N表示所有文本信息中的最大长度;当文本信息的长度小于最大长度时,tSEP后的单词用零进行填充;For each paragraph of text information, add a CLS tag at the beginning and a SEP tag at the end to obtain the corresponding text sequence T={tcls ,t2 ,...,tSEP ,...,tN }; where tcls represents CLS tag serialized words, tSEP represents the SEP tag serialized word, tN represents the Nth word in the text sequence, N represents the maximum length of all text information; when the length of the text information is less than the maximum length, t The words afterSEP are padded with zeros;

将所述文本信息对应的文本序列T输入现有的BERT-wwm模型进行特征提取操作,获得文本特征向量:Input the text sequence T corresponding to the text information into the existing BERT-wwm model to perform feature extraction operations to obtain the text feature vector:

Xt=BERTwwm(T)={xcls,x2,…,xSEP,…,xN}Xt = BERTwwm (T) = {xcls ,x2 ,…,xSEP ,…,xN }

式中,Xt表示文本特征向量,BERTwwm表示BERT-wwm模型,xcls表示文本特征向量中的CLS标签,xSEP表示文本特征向量中的SEP标签的,xN表示文本特征向量中第N个的元素;Intheformula,_ elements;

对于所述音频信息,进行特征提取操作,获得语音特征向量的具体方法为:For the audio information, a feature extraction operation is performed, and the specific method for obtaining the speech feature vector is:

将所述音频信息输入现有的VGGish模型中,获得梅尔频率倒谱系数特征;Input the audio information into the existing VGGish model to obtain Mel frequency cepstrum coefficient features;

将所述梅尔频率倒谱系数输入现有的双向长短期记忆网络中,获得语音特征向量XaInput the Mel frequency cepstral coefficient into the existing two-way long short-term memory network to obtain the speech feature vector Xa ;

对于所述眼动信息,进行特征提取操作,获得眼动特征向量的具体方法为:For the eye movement information, a feature extraction operation is performed, and the specific method for obtaining the eye movement feature vector is:

所述眼动信息包括瞳孔直径、注视时间、扫视持续时间和扫视振幅;The eye movement information includes pupil diameter, fixation time, saccade duration and saccade amplitude;

对所述瞳孔直径提取参数特征,包括瞳孔直径均值、瞳孔直径标准差和瞳孔直径微分熵;Extract parameter features for the pupil diameter, including pupil diameter mean, pupil diameter standard deviation and pupil diameter differential entropy;

对所述注视时间提取参数特征,包括注视时间均值和注视时间标准差;Extract parameter features from the gaze time, including gaze time mean and gaze time standard deviation;

对所述扫视持续时间提取参数特征,包括扫视持续时间均值和扫视持续时间标准差;Extract parameter features for the saccade duration, including saccade duration mean and saccade duration standard deviation;

对所述扫视振幅提取参数特征,包括扫视振幅均值和扫视振幅标准差;Extract parameter features from the saccade amplitude, including saccade amplitude mean and saccade amplitude standard deviation;

利用所述瞳孔直径均值、瞳孔直径标准差、瞳孔直径微分熵、注视时间均值、注视时间标准差、扫视持续时间均值、扫视持续时间标准差、扫视振幅均值和扫视振幅标准差组成眼动特征矩阵;The eye movement feature matrix is composed of the mean pupil diameter, standard deviation of pupil diameter, differential entropy of pupil diameter, mean fixation time, standard deviation of fixation time, mean saccade duration, standard deviation of saccade duration, mean saccade amplitude and standard deviation of saccade amplitude. ;

将所述眼动特征矩阵输入现有的双向长短期记忆网络中,获得眼动特征向量XeInput the eye movement feature matrix into the existing two-way long short-term memory network to obtain the eye movement feature vector Xe ;

S3:构建多子空间共享私有表示模型,对所述文本特征向量、语音特征向量和眼动特征向量进行特征融合操作,获得情绪表征向量;S3: Construct a multi-subspace shared private representation model, perform feature fusion operations on the text feature vectors, speech feature vectors and eye movement feature vectors to obtain emotional representation vectors;

如图3所示,所述多子空间共享私有表示模型包括一维卷积层、私有空间、主共享子空间、辅助共享子空间、拼接层和自注意力机制层;As shown in Figure 3, the multi-subspace shared private representation model includes a one-dimensional convolution layer, a private space, a main shared subspace, an auxiliary shared subspace, a splicing layer and a self-attention mechanism layer;

所述一维卷积层的输入端作为多子空间共享私有表示模型的输入端,一维卷积层的输出端分别与私有空间、主共享子空间、辅助共享子空间的输入端连接;私有空间、主共享子空间、辅助共享子空间的输出端均与拼接层的输入端连接,拼接层的输出端与自注意力机制层的输入端连接,自注意力机制层的输出端作为多子空间共享私有表示模型的输出端;The input end of the one-dimensional convolution layer serves as the input end of the multi-subspace shared private representation model, and the output end of the one-dimensional convolution layer is connected to the input end of the private space, the main shared subspace, and the auxiliary shared subspace respectively; private The output terminals of the space, main shared subspace, and auxiliary shared subspace are all connected to the input terminal of the splicing layer. The output terminal of the splicing layer is connected to the input terminal of the self-attention mechanism layer. The output terminal of the self-attention mechanism layer serves as a multi-subspace The output of the spatially shared private representation model;

获得情绪表征向量的具体方法为:The specific method to obtain the emotion representation vector is:

S3.1:将所述文本特征向量、语音特征向量和眼动特征向量输入多子空间共享私有表示模型,经一维卷积层映射到相同维度,获得处理后的文本特征向量、语音特征向量和眼动特征向量;具体方法为:S3.1: Input the text feature vector, speech feature vector and eye movement feature vector into the multi-subspace shared private representation model, map it to the same dimension through the one-dimensional convolution layer, and obtain the processed text feature vector and speech feature vector and eye movement feature vector; the specific method is:

XT=Conv1D(Xt,kt)XT =Conv1D(Xt ,kt )

XA=Conv1D(Xa,ka)XA =Conv1D(Xa ,ka )

XE=Conv1D(Xe,ke)XE =Conv1D(Xe ,ke )

式中,XT表示处理后的文本特征向量,XA表示语音特征向量,XE表示眼动特征向量,Conv1D(*)表示一维卷积操作,kt,ka,ke分别表示对应文本特征向量、语音特征向量和眼动特征向量的卷积核;In the formula, XTrepresents the processed text feature vector, X Arepresentsthespeech feature vector, Convolution kernels for text feature vectors, speech feature vectors and eye movement feature vectors;

S3.2:将所述处理后的文本特征向量、语音特征向量和眼动特征向量均输入私有空间,对应获得文本模态特定表示、语音模态特定表示和眼动模态特定表示,具体为:S3.2: Input the processed text feature vectors, speech feature vectors and eye movement feature vectors into the private space, and correspondingly obtain text modality specific representation, speech modality specific representation and eye movement modality specific representation, specifically as follows :

式中,为文本模态特定表示,/>为语音模态特定表示,/>为眼动模态特定表示,BiLSTM1(*)表示私有空间,/>分别表示私有空间中对应处理后的文本特征向量、语音特征向量和眼动特征向量的网络参数;In the formula, A modal-specific representation of text, /> For speech modality specific representation,/> For eye movement modality specific representation, BiLSTM1 (*) represents private space,/> Respectively represent the network parameters corresponding to the processed text feature vector, speech feature vector and eye movement feature vector in the private space;

S3.3:将所述处理后的文本特征向量、语音特征向量和眼动特征向量均输入主共享子空间,对应获得文本模态主共享表示、语音模态主共享表示和眼动模态主共享表示,具体为:S3.3: Input the processed text feature vector, speech feature vector and eye movement feature vector into the main shared subspace, and obtain the main shared representation of text modality, the main shared representation of speech modality and the main shared representation of eye movement modality. Shared representation, specifically:

式中,为文本模态主共享表示,/>为语音模态主共享表示,/>为眼动模态主共享表示,BiLSTM2(*)表示主共享子空间,θf表示主共享参数;In the formula, For text modal main shared representation,/> Is the main shared representation of speech modality,/> is the main shared representation of the eye movement modality, BiLSTM2 (*) represents the main shared subspace, and θf represents the main shared parameters;

S3.4:将所述处理后的文本特征向量、语音特征向量和眼动特征向量均输入辅助共享子空间,对应获得文本模态辅助共享表示、语音模态辅助共享表示和眼动模态辅助共享表示,具体为:S3.4: Input the processed text feature vector, speech feature vector and eye movement feature vector into the auxiliary shared subspace, and obtain the text modality auxiliary shared representation, speech modality auxiliary shared representation and eye movement modality auxiliary correspondingly. Shared representation, specifically:

式中,为文本模态辅助共享表示、/>为语音模态辅助共享表示,/>为眼动模态辅助共享表示,BiLSTM3(*)表示辅助共享子空间,θs表示辅助共享参数;In the formula, Auxiliary shared representation for text modality,/> Auxiliary shared representation for speech modality,/> It is the auxiliary shared representation of the eye movement modality, BiLSTM3 (*) represents the auxiliary shared subspace, and θs represents the auxiliary shared parameters;

S3.5:将所述文本模态特定表示、语音模态特定表示、眼动模态特定表示、文本模态主共享表示、语音模态主共享表示、眼动模态主共享表示、文本模态辅助共享表示、语音模态辅助共享表示和眼动模态辅助共享表示输入拼接层,获得模态混合表示矩阵,表示为其中,H为模态混合表示矩阵;S3.5: Combine the text modality specific representation, speech modality specific representation, eye movement modality specific representation, text modality main shared representation, speech modality main shared representation, eye movement modality main shared representation, text modality Modality auxiliary shared representation, speech modality auxiliary shared representation and eye movement modality auxiliary shared representation are input into the splicing layer to obtain the modality mixed representation matrix, expressed as Among them, H is the modal mixing representation matrix;

S3.6:将所述模态混合表示矩阵输入自注意力机制层,获得融合自注意力机制的模态混合表示矩阵,转化为情绪表征向量输出;S3.6: Input the modal mixed representation matrix into the self-attention mechanism layer, obtain the modal mixed representation matrix fused with the self-attention mechanism, and convert it into an emotional representation vector output;

对模态混合表示矩阵H执行多头自注意力,获得融合自注意力机制的模态混合表示矩阵:Perform multi-head self-attention on the modal mixed representation matrix H to obtain the modal mixed representation matrix fused with the self-attention mechanism:

式中,表示融合自注意力机制的模态混合表示矩阵,MultiHead(*)表示多头自注意力操作,θatt表示注意力参数,hi表示融合自注意力机制的模态混合表示矩阵中的第i个元素;In the formula, Represents the modal mixed representation matrix of the fused self-attention mechanism, MultiHead(*) represents the multi-head self-attention operation, θatt represents the attention parameter, hi represents the i-th modal mixed representation matrix of the fused self-attention mechanism element;

将融合自注意力机制的模态混合表示矩阵转化为情绪表征向量:Convert the modal mixture representation matrix fused with the self-attention mechanism into an emotional representation vector:

oi=tanh(Whi+b)oi =tanh(Whi +b)

式中,oi表示中的第i个元素经过tanh激活函数操作后的结果,W表示权重矩阵,b表示偏置向量,βi表示/>中第i个元素的注意力权重,用于表示第i个元素在情绪表征向量E中所占的重要性,E表示情绪表征向量;In the formula, oi represents The i-th element in is the result of the tanh activation function operation. W represents the weight matrix, b represents the bias vector, and βi represents /> The attention weight of the i-th element in is used to represent the importance of the i-th element in the emotional representation vector E, where E represents the emotional representation vector;

将文本特征向量、语音特征向量和眼动特征向量三个模态的特征向量映射到相同维度,再投射到主共享子空间和辅助共享子空间中,得到模态共享表示来捕捉三个模态特征向量间的一致性;而投射到私有空间中,学习三个模态特征向量的特定表示,以抽取各模态间的差异性;私有空间、主共享子空间、辅助共享子空间均是基于双向长短时记忆网络构建的;最后将各个模态的共享表示和特定表示拼接后执行自注意力,获得融合自注意力机制的模态混合表示矩阵,最终转化为情绪表征向量输出;最大限度的抽取了多模态间互补信息、减少各模态间的冗余信息,使情绪表征向量更加准确,更符合用户的当前情绪。The feature vectors of the three modalities of text feature vector, speech feature vector and eye movement feature vector are mapped to the same dimension, and then projected into the main shared subspace and the auxiliary shared subspace to obtain a modal shared representation to capture the three modalities. The consistency between feature vectors; and projected into the private space, learn the specific representation of the three modal feature vectors to extract the differences between each modality; the private space, main shared subspace, and auxiliary shared subspace are all based on Constructed by a bidirectional long short-term memory network; finally, the shared representation and specific representation of each modality are spliced and self-attention is performed to obtain a modal mixed representation matrix that integrates the self-attention mechanism, and is finally converted into an emotional representation vector output; to the maximum extent It extracts complementary information between multiple modalities and reduces redundant information between modalities, making the emotional representation vector more accurate and more in line with the user's current emotion.

并且,设置任务损失、相似性损失、差异性损失和重构损失组成总损失函数来训练优化多子空间共享私有表示模型,当总损失函数的损失值达到最小时,获得优化好的多子空间共享私有表示模型,提高情绪表征向量的质量;Moreover, task loss, similarity loss, difference loss and reconstruction loss are set to form a total loss function to train and optimize the multi-subspace shared private representation model. When the loss value of the total loss function reaches the minimum, the optimized multi-subspace is obtained Share private representation models to improve the quality of emotional representation vectors;

总损失函数为:The total loss function is:

ζ=ζtask+αζsim+βζdiff+γζreconζ=ζtask +αζsim +βζdiff +γζrecon

式中,ζ表示总损失函数,ζtask表示任务损失,ζsim表示相似性损失,ζdiff表示差异性损失,ζrecon表示重构损失;α,β,γ分别表示第一、第二、第三权重超参;In the formula, ζ represents the total loss function, ζtask represents the task loss, ζsim represents the similarity loss, ζdiff represents the difference loss, ζrecon represents the reconstruction loss; α, β, and γ represent the first, second, and third respectively. Three-weighted hyperparameter;

对于任务损失,使用全连接层进行任务预测,并使用/>计算任务损失,任务损失用于估计训练期间的情感预测结果的质量;For task loss, use fully connected layers Make task predictions and use/> Calculate the task loss, which is used to estimate the quality of the emotion prediction results during training;

当任务为分类任务时,使用标准交叉熵损失作为任务损失:When the task is a classification task, use standard cross-entropy loss as the task loss:

式中,yi为对数据集进行分类任务时的真实分布,为使用情绪表征向量进行任务预测所得到的预测分布,N为数据集的样本数;In the formula, yi is the true distribution when classifying the data set, is the prediction distribution obtained by using emotion representation vectors for task prediction, and N is the number of samples in the data set;

当任务为回归任务时,使用均方差损失作为任务损失:When the task is a regression task, use the mean square error loss as the task loss:

式中,yi为对数据集进行回归任务时的真实分布,为使用情绪表征向量进行任务预测所得到的预测分布,N为数据集的样本数;In the formula, yi is the true distribution when performing regression tasks on the data set, is the prediction distribution obtained by using emotion representation vectors for task prediction, and N is the number of samples in the data set;

相似性损失表示为:The similarity loss is expressed as:

式中,CMDK(*)表示中心距差异;In the formula, CMDK (*) represents the center distance difference;

差异性损失表示为:The differential loss is expressed as:

式中,‖*‖F表示平方弗罗尼乌斯范数;In the formula, ‖*‖F represents the square Fronius norm;

对于重构损失,使用解码器函数重构三个模态的特征向量Xm,使用Xm和/>间的均方差损失作为重构损失:For the reconstruction loss, use the decoder function Reconstruct the eigenvectors Xm of the three modes using Xm and /> The mean square error loss between is used as the reconstruction loss:

式中,表示使用解码器函数重构Xm所得到的重构后模态特征向量,Xm表示重构前模态特征向量,dh表示编码维度,‖*‖2表示L2范数;In the formula, represents the reconstructed modal feature vector obtained by using the decoder function to reconstruct Xm , Xm represents the pre-reconstruction modal feature vector, dh represents the coding dimension, and ‖*‖2 represents the L2 norm;

当总损失函数的损失值达到最小时,获得优化好的多子空间共享私有表示模型,提高情绪表征向量的质量。When the loss value of the total loss function reaches the minimum, an optimized multi-subspace shared private representation model is obtained and the quality of the emotional representation vector is improved.

S4:将所述文本特征向量和情绪特征向量输入预设的GPT-3解码器中,生成对话回复文本;S4: Input the text feature vector and emotion feature vector into the preset GPT-3 decoder to generate dialogue reply text;

将文本特征向量Xt和情绪特征向量E进行拼接,获得拼接向量z=[Xt,E];将拼接向量z输入GPT-3解码器中,生成了符合上下文语意且具有情感色彩的对话回复文本;Splicethe textfeature vector text;

GPT-3解码器由多个Transformer块组成,每个Transformer块包含了多头自注意力机制和全连接前馈网络,利用GPT-3解码器生成符合上下文语意且具有情感色彩的对话回复文本。The GPT-3 decoder is composed of multiple Transformer blocks. Each Transformer block contains a multi-head self-attention mechanism and a fully connected feed-forward network. The GPT-3 decoder is used to generate conversational reply text that conforms to the contextual semantics and has emotional color.

S5:将所述对话回复文本输入虚拟现实设备,通过虚拟形象和合成语音实现与用户的交互。S5: Input the dialogue reply text into the virtual reality device, and realize interaction with the user through the avatar and synthesized voice.

利用现有的虚拟现实技术生成虚拟对话形象,将传统的纯文本交互转变为视觉、听觉多源的对话交互方式,充分调动用户情绪,加强对话系统交互性,提升用户的对话交互体验。如利用3D建模软件设计3D虚拟人物形象用于与用户的对话交互,并将该虚拟人物形象嵌入头戴式VR设备中。在用户使用该头戴式VR设备时,利用HMD技术将虚拟形象投射到用户眼前,并且通过设备的头部追踪功能,实时更新视角,使用户可以360度环顾虚拟环境,以提供沉浸式的视觉体验。与此同时,将符合上下文语境并具有情感色彩的对话回复文本通过语音合成技术将该对话转化为合成语音,使用虚拟人物形象与用户进行交互,用户通过扬声器或耳机听到头戴式VR设备输出的合成语音。Use existing virtual reality technology to generate virtual dialogue images, transform traditional pure text interaction into a visual and auditory multi-source dialogue interaction method, fully mobilize user emotions, strengthen the interactivity of the dialogue system, and improve the user's dialogue interaction experience. For example, 3D modeling software is used to design a 3D virtual character for dialogue and interaction with the user, and the virtual character is embedded in a head-mounted VR device. When the user uses the head-mounted VR device, HMD technology is used to project the virtual image in front of the user's eyes, and the device's head tracking function updates the perspective in real time, allowing the user to look around the virtual environment 360 degrees to provide immersive vision. experience. At the same time, the conversation reply text that is in line with the context and has an emotional color is converted into a synthesized voice through speech synthesis technology, and a virtual character is used to interact with the user. The user hears the head-mounted VR device through speakers or headphones. The output synthesized speech.

实施例3Example 3

本实施例提供了一种面向虚拟现实的多源融合情感支持对话系统,用于实现实施例1或2所述的方法,如图4所示,包括:This embodiment provides a multi-source fusion emotional support dialogue system for virtual reality, used to implement the method described in Embodiment 1 or 2, as shown in Figure 4, including:

数据获取模块,用于采集用户的语音信息和眼动信息,并从所述语音信息中提取文本信息和音频信息;A data acquisition module, used to collect the user's voice information and eye movement information, and extract text information and audio information from the voice information;

特征提取模块,用于分别对所述文本信息、音频信息和眼动信息进行特征提取操作,对应获得文本特征向量、语音特征向量和眼动特征向量;A feature extraction module, used to perform feature extraction operations on the text information, audio information and eye movement information respectively, and correspondingly obtain text feature vectors, speech feature vectors and eye movement feature vectors;

特征融合模块,用于对所述文本特征向量、语音特征向量和眼动特征向量进行特征融合操作,获得情绪表征向量;A feature fusion module, used to perform a feature fusion operation on the text feature vector, speech feature vector and eye movement feature vector to obtain an emotion representation vector;

对话生成模块,用于将所述文本特征向量和情绪特征向量输入预设的解码器中,生成对话回复文本。A dialogue generation module, configured to input the text feature vector and the emotion feature vector into a preset decoder to generate dialogue reply text.

相同或相似的标号对应相同或相似的部件;The same or similar numbers correspond to the same or similar parts;

附图中描述位置关系的用语仅用于示例性说明,不能理解为对本专利的限制;The terms used to describe positional relationships in the drawings are only for illustrative purposes and should not be construed as limitations to this patent;

显然,本发明的上述实施例仅仅是为清楚地说明本发明所作的举例,而并非是对本发明的实施方式的限定。对于所属领域的普通技术人员来说,在上述说明的基础上还可以做出其它不同形式的变化或变动。这里无需也无法对所有的实施方式予以穷举。凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本发明权利要求的保护范围之内。Obviously, the above-mentioned embodiments of the present invention are only examples to clearly illustrate the present invention, and are not intended to limit the implementation of the present invention. For those of ordinary skill in the art, other different forms of changes or modifications can be made based on the above description. An exhaustive list of all implementations is not necessary or possible. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention shall be included in the protection scope of the claims of the present invention.

Claims (10)

Translated fromChinese
1.一种面向虚拟现实的多源融合情感支持对话方法,其特征在于,包括:1. A multi-source fusion emotional support dialogue method for virtual reality, which is characterized by including:S1:采集用户的语音信息和眼动信息,并从所述语音信息中提取文本信息和音频信息;S1: Collect the user's voice information and eye movement information, and extract text information and audio information from the voice information;S2:分别对所述文本信息、音频信息和眼动信息进行特征提取操作,对应获得文本特征向量、语音特征向量和眼动特征向量;S2: Perform feature extraction operations on the text information, audio information and eye movement information respectively, and obtain text feature vectors, speech feature vectors and eye movement feature vectors accordingly;S3:对所述文本特征向量、语音特征向量和眼动特征向量进行特征融合操作,获得情绪表征向量;S3: Perform a feature fusion operation on the text feature vector, speech feature vector and eye movement feature vector to obtain an emotion representation vector;S4:将所述文本特征向量和情绪特征向量输入预设的解码器中,生成对话回复文本。S4: Input the text feature vector and emotion feature vector into a preset decoder to generate dialogue reply text.2.根据权利要求1所述的面向虚拟现实的多源融合情感支持对话方法,其特征在于,对所述文本信息进行特征提取操作前,还需进行预处理操作,具体方法为:2. The multi-source fusion emotional support dialogue method for virtual reality according to claim 1, characterized in that, before performing feature extraction operations on the text information, a pre-processing operation needs to be performed. The specific method is:对于每段所述文本信息,在开头添加CLS标签,在结尾添加SEP标签,获得对应的文本序列T={tcls,t2,…,tSEP,…,tN};其中,tcls表示CLS标签序列化后的单词,tSEP表示SEP标签序列化后的单词,tN表示文本序列中第N个单词,N表示所有文本信息中的最大长度;当文本信息的长度小于最大长度时,tSEP后的单词用零进行填充。For each paragraph of text information, add a CLS tag at the beginning and a SEP tag at the end to obtain the corresponding text sequence T={tcls ,t2 ,...,tSEP ,...,tN }; where tcls represents CLS tag serialized words, tSEP represents the SEP tag serialized word, tN represents the Nth word in the text sequence, N represents the maximum length of all text information; when the length of the text information is less than the maximum length, Words after tSEP are padded with zeros.3.根据权利要求2所述的面向虚拟现实的多源融合情感支持对话方法,其特征在于,对所述文本信息进行特征提取操作,获得文本特征向量的具体方法为:3. The multi-source fusion emotion support dialogue method for virtual reality according to claim 2, characterized in that the text information is subjected to a feature extraction operation, and the specific method for obtaining the text feature vector is:将所述文本信息对应的文本序列T输入现有的BERT-wwm模型进行特征提取操作,获得文本特征向量:Input the text sequence T corresponding to the text information into the existing BERT-wwm model to perform feature extraction operations to obtain the text feature vector:Xt=BERTwwm(T)={xcls,x2,…,xSEP,…,xN}Xt = BERTwwm (T) = {xcls ,x2 ,…,xSEP ,…,xN }式中,Xt表示文本特征向量,BERTwwm表示BERT-wwm模型,xcls表示文本特征向量中的CLS标签,xSEP表示文本特征向量中的SEP标签的,xN表示文本特征向量中第N个的元素。Intheformula,_ elements.4.根据权利要求1所述的面向虚拟现实的多源融合情感支持对话方法,其特征在于,对所述音频信息进行特征提取操作,获得语音特征向量的具体方法为:4. The multi-source fusion emotion support dialogue method for virtual reality according to claim 1, characterized in that the specific method of performing a feature extraction operation on the audio information and obtaining the speech feature vector is:将所述音频信息输入现有的VGGish模型中,获得梅尔频率倒谱系数特征;Input the audio information into the existing VGGish model to obtain Mel frequency cepstrum coefficient features;将所述梅尔频率倒谱系数输入现有的双向长短期记忆网络中,获得语音特征向量XaThe Mel frequency cepstral coefficient is input into the existing two-way long short-term memory network to obtain the speech feature vector Xa .5.根据权利要求1所述的面向虚拟现实的多源融合情感支持对话方法,其特征在于,对所述眼动信息进行特征提取操作,获得眼动特征向量的具体方法为:5. The multi-source fusion emotion support dialogue method for virtual reality according to claim 1, characterized in that the eye movement information is subjected to a feature extraction operation, and the specific method for obtaining the eye movement feature vector is:所述眼动信息包括瞳孔直径、注视时间、扫视持续时间和扫视振幅;The eye movement information includes pupil diameter, fixation time, saccade duration and saccade amplitude;对所述瞳孔直径提取参数特征,包括瞳孔直径均值、瞳孔直径标准差和瞳孔直径微分熵;Extract parameter features for the pupil diameter, including pupil diameter mean, pupil diameter standard deviation and pupil diameter differential entropy;对所述注视时间提取参数特征,包括注视时间均值和注视时间标准差;Extract parameter features from the gaze time, including gaze time mean and gaze time standard deviation;对所述扫视持续时间提取参数特征,包括扫视持续时间均值和扫视持续时间标准差;Extract parameter features for the saccade duration, including saccade duration mean and saccade duration standard deviation;对所述扫视振幅提取参数特征,包括扫视振幅均值和扫视振幅标准差;Extract parameter features from the saccade amplitude, including saccade amplitude mean and saccade amplitude standard deviation;利用所述瞳孔直径均值、瞳孔直径标准差、瞳孔直径微分熵、注视时间均值、注视时间标准差、扫视持续时间均值、扫视持续时间标准差、扫视振幅均值和扫视振幅标准差组成眼动特征矩阵;The eye movement feature matrix is composed of the mean pupil diameter, standard deviation of pupil diameter, differential entropy of pupil diameter, mean fixation time, standard deviation of fixation time, mean saccade duration, standard deviation of saccade duration, mean saccade amplitude and standard deviation of saccade amplitude. ;将所述眼动特征矩阵输入现有的双向长短期记忆网络中,获得眼动特征向量XeThe eye movement feature matrix is input into the existing two-way long short-term memory network to obtain the eye movement feature vector Xe .6.根据权利要求1任一项所述的面向虚拟现实的多源融合情感支持对话方法,其特征在于,构建多子空间共享私有表示模型对所述文本特征向量、语音特征向量和眼动特征向量进行特征融合操作,获得情绪表征向量;6. The multi-source fusion emotional support dialogue method for virtual reality according to any one of claims 1, characterized in that, a multi-subspace shared private representation model is constructed to compare the text feature vector, speech feature vector and eye movement feature. Perform feature fusion operations on vectors to obtain emotional representation vectors;所述多子空间共享私有表示模型包括一维卷积层、私有空间、主共享子空间、辅助共享子空间、拼接层和自注意力机制层;The multi-subspace shared private representation model includes a one-dimensional convolution layer, a private space, a main shared subspace, an auxiliary shared subspace, a splicing layer and a self-attention mechanism layer;所述一维卷积层的输入端作为多子空间共享私有表示模型的输入端,一维卷积层的输出端分别与私有空间、主共享子空间、辅助共享子空间的输入端连接;私有空间、主共享子空间、辅助共享子空间的输出端均与拼接层的输入端连接,拼接层的输出端与自注意力机制层的输入端连接,自注意力机制层的输出端作为多子空间共享私有表示模型的输出端。The input end of the one-dimensional convolution layer serves as the input end of the multi-subspace shared private representation model, and the output end of the one-dimensional convolution layer is connected to the input end of the private space, the main shared subspace, and the auxiliary shared subspace respectively; private The output terminals of the space, main shared subspace, and auxiliary shared subspace are all connected to the input terminal of the splicing layer. The output terminal of the splicing layer is connected to the input terminal of the self-attention mechanism layer. The output terminal of the self-attention mechanism layer serves as a multi-subspace A spatially shared private representation of the output of the model.7.根据权利要求3-6任一项所述的面向虚拟现实的多源融合情感支持对话方法,其特征在于,所述步骤S3的具体方法为:7. The multi-source fusion emotional support dialogue method for virtual reality according to any one of claims 3-6, characterized in that the specific method of step S3 is:S3.1:将所述文本特征向量、语音特征向量和眼动特征向量输入多子空间共享私有表示模型,经一维卷积层映射到相同维度,获得处理后的文本特征向量、语音特征向量和眼动特征向量;S3.1: Input the text feature vector, speech feature vector and eye movement feature vector into the multi-subspace shared private representation model, map it to the same dimension through the one-dimensional convolution layer, and obtain the processed text feature vector and speech feature vector and eye movement feature vectors;S3.2:将所述处理后的文本特征向量、语音特征向量和眼动特征向量均输入私有空间,对应获得文本模态特定表示、语音模态特定表示和眼动模态特定表示;S3.2: Input the processed text feature vectors, speech feature vectors and eye movement feature vectors into the private space, and correspondingly obtain text modality specific representations, speech modality specific representations and eye movement modality specific representations;S3.3:将所述处理后的文本特征向量、语音特征向量和眼动特征向量均输入主共享子空间,对应获得文本模态主共享表示、语音模态主共享表示和眼动模态主共享表示;S3.3: Input the processed text feature vector, speech feature vector and eye movement feature vector into the main shared subspace, and obtain the main shared representation of text modality, the main shared representation of speech modality and the main shared representation of eye movement modality. shared representation;S3.4:将所述处理后的文本特征向量、语音特征向量和眼动特征向量均输入辅助共享子空间,对应获得文本模态辅助共享表示、语音模态辅助共享表示和眼动模态辅助共享表示;S3.4: Input the processed text feature vector, speech feature vector and eye movement feature vector into the auxiliary shared subspace, and obtain the text modality auxiliary shared representation, speech modality auxiliary shared representation and eye movement modality auxiliary correspondingly. shared representation;S3.5:将所述文本模态特定表示、语音模态特定表示、眼动模态特定表示、文本模态主共享表示、语音模态主共享表示、眼动模态主共享表示、文本模态辅助共享表示、语音模态辅助共享表示和眼动模态辅助共享表示输入拼接层,获得模态混合表示矩阵;S3.5: Combine the text modality specific representation, speech modality specific representation, eye movement modality specific representation, text modality main shared representation, speech modality main shared representation, eye movement modality main shared representation, text modality The modality auxiliary shared representation, speech modality auxiliary shared representation and eye movement modality auxiliary shared representation are input into the splicing layer to obtain the modality mixed representation matrix;S3.6:将所述模态混合表示矩阵输入自注意力机制层,获得融合自注意力机制的模态混合表示矩阵,转化为情绪表征向量输出。S3.6: Input the modal mixed representation matrix into the self-attention mechanism layer, obtain the modal mixed representation matrix fused with the self-attention mechanism, and convert it into an emotional representation vector output.8.根据权利要求1所述的面向虚拟现实的多源融合情感支持对话方法,其特征在于,所述预设的解码器为GPT-3解码器。8. The multi-source fusion emotion support dialogue method for virtual reality according to claim 1, characterized in that the preset decoder is a GPT-3 decoder.9.根据权利要求1所述的面向虚拟现实的多源融合情感支持对话方法,其特征在于,所述方法还包括:9. The multi-source fusion emotional support dialogue method for virtual reality according to claim 1, characterized in that the method further includes:S5:将所述对话回复文本输入虚拟现实设备,通过虚拟形象和合成语音实现与用户的交互。S5: Input the dialogue reply text into the virtual reality device, and realize interaction with the user through the avatar and synthesized voice.10.一种面向虚拟现实的多源融合情感支持对话系统,其特征在于,用于实现权利要求1-9任一项所述的方法,包括:10. A multi-source fusion emotional support dialogue system for virtual reality, characterized in that it is used to implement the method described in any one of claims 1-9, including:数据获取模块,用于采集用户的语音信息和眼动信息,并从所述语音信息中提取文本信息和音频信息;A data acquisition module, used to collect the user's voice information and eye movement information, and extract text information and audio information from the voice information;特征提取模块,用于分别对所述文本信息、音频信息和眼动信息进行特征提取操作,对应获得文本特征向量、语音特征向量和眼动特征向量;A feature extraction module, used to perform feature extraction operations on the text information, audio information and eye movement information respectively, and correspondingly obtain text feature vectors, speech feature vectors and eye movement feature vectors;特征融合模块,用于对所述文本特征向量、语音特征向量和眼动特征向量进行特征融合操作,获得情绪表征向量;A feature fusion module, used to perform a feature fusion operation on the text feature vector, speech feature vector and eye movement feature vector to obtain an emotion representation vector;对话生成模块,用于将所述文本特征向量和情绪特征向量输入预设的解码器中,生成对话回复文本。A dialogue generation module, configured to input the text feature vector and the emotion feature vector into a preset decoder to generate dialogue reply text.
CN202311561956.4A2023-11-212023-11-21Virtual reality-oriented multisource fusion emotion support dialogue methodPendingCN117370534A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202311561956.4ACN117370534A (en)2023-11-212023-11-21Virtual reality-oriented multisource fusion emotion support dialogue method

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202311561956.4ACN117370534A (en)2023-11-212023-11-21Virtual reality-oriented multisource fusion emotion support dialogue method

Publications (1)

Publication NumberPublication Date
CN117370534Atrue CN117370534A (en)2024-01-09

Family

ID=89406045

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202311561956.4APendingCN117370534A (en)2023-11-212023-11-21Virtual reality-oriented multisource fusion emotion support dialogue method

Country Status (1)

CountryLink
CN (1)CN117370534A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN117808011A (en)*2024-03-012024-04-02青岛网信信息科技有限公司Chat robot method, medium and system with simulated emotion
CN118519524A (en)*2024-05-152024-08-20北京师范大学Virtual digital person system and method for learning disorder group

Cited By (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN117808011A (en)*2024-03-012024-04-02青岛网信信息科技有限公司Chat robot method, medium and system with simulated emotion
CN117808011B (en)*2024-03-012024-06-04青岛网信信息科技有限公司Chat robot method, medium and system with simulated emotion
CN118519524A (en)*2024-05-152024-08-20北京师范大学Virtual digital person system and method for learning disorder group

Similar Documents

PublicationPublication DateTitle
Eskimez et al.Speech driven talking face generation from a single image and an emotion condition
Bhattacharya et al.Speech2affectivegestures: Synthesizing co-speech gestures with generative adversarial affective expression learning
WO2022106654A2 (en)Methods and systems for video translation
CN113033450B (en)Multi-mode continuous emotion recognition method, service inference method and system
CN117370534A (en)Virtual reality-oriented multisource fusion emotion support dialogue method
KR102437039B1 (en)Learning device and method for generating image
CN115631267A (en) Method and device for generating animation
WO2023226239A1 (en)Object emotion analysis method and apparatus and electronic device
CN118519524A (en)Virtual digital person system and method for learning disorder group
CN117809616A (en)Server, display equipment and voice interaction method
CN115409923A (en)Method, device and system for generating three-dimensional virtual image facial animation
WO2025066217A1 (en)Server, display device, and digital human processing method
CN116522142A (en)Method for training feature extraction model, feature extraction method and device
CN118897887A (en) An efficient digital human interaction system integrating multimodal information
CN117809681A (en)Server, display equipment and digital human interaction method
Ma et al.A review of human emotion synthesis based on generative technology
Chen et al.Speaker-independent emotional voice conversion via disentangled representations
CN119205988A (en) Image generation method, device, electronic device and medium
WO2025001721A9 (en)Server, display device, and digital human processing method
Ji et al.3D facial animation driven by speech-video dual-modal signals
Zainkó et al.Adaptation of tacotron2-based text-to-speech for articulatory-to-acoustic mapping using ultrasound tongue imaging
Saini et al.Artificial intelligence inspired fog-cloud-based visual-assistance framework for blind and visually-impaired people
CN117809682A (en)Server, display equipment and digital human interaction method
DE112022007494T5 (en) AUDIO-DRIVEN FACIAL ANIMATION WITH EMOTION SUPPORT USING MACHINE LEARNING
Zhao et al.Generating diverse gestures from speech using memory networks as dynamic dictionaries

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination

[8]ページ先頭

©2009-2025 Movatter.jp