CN118519524A

Movatterモバイル変換

Info

Publication number: CN118519524A
Application number: CN202410598337.0A
Authority: CN
Inventors: 陈莺; 黄荣怀; 刘德建
Original assignee: Beijing Normal University
Current assignee: Beijing Normal University
Priority date: 2024-05-15
Filing date: 2024-05-15
Publication date: 2024-08-20

Abstract

The invention discloses a virtual digital person system and a method for learning disorder groups, which relate to the technical field of virtual interaction, and are used for identifying voice signals and video signals generated when users raise questions to obtain voice recognition results, facial expression recognition results and behavior recognition results, further determining emotion states of the users when the users raise the questions, generating dialogue contents meeting dialogue scenes and emotion requirements of the users based on the voice recognition results and the emotion states, wherein the dialogue contents are answers to the questions raised by the users, generating audio and video of virtual roles according to the dialogue contents to answer the questions raised by the users, and feeding the audio and video back to the users.

Description

Translated fromChinese

一种面向学习障碍群体的虚拟数字人系统及方法A virtual digital human system and method for people with learning disabilities

技术领域Technical Field

本发明涉及虚拟交互技术领域，特别是涉及一种面向学习障碍群体的虚拟数字人系统及方法。The present invention relates to the field of virtual interaction technology, and in particular to a virtual digital human system and method for people with learning disabilities.

背景技术Background Art

自闭症谱系障碍(Autism Spectrum Disorder，ASD，简称自闭症)是一种神经发育障碍，其症状表现为社会沟通和互动困难、重复性行为和刻板兴趣。自闭症群体由于对社交不感兴趣，且常常表现出不适当的沟通、社交互动、重复刻板行为，因此往往难以适应主流学习环境，需要学习上的特殊帮助。在过去的一个世纪中，越来越多的儿童被诊断为自闭症，而与之相关的医务人员和职业治疗师(Occupational Therapist，OT)的数量仍然有限，这让通过相关技术干预自闭症诊疗受到学界和市场的持续关注。Autism Spectrum Disorder (ASD) is a neurodevelopmental disorder characterized by difficulties in social communication and interaction, repetitive behaviors, and stereotyped interests. Because autistic people have no interest in socializing and often display inappropriate communication, social interaction, and repetitive stereotyped behaviors, they often find it difficult to adapt to the mainstream learning environment and need special help in learning. Over the past century, more and more children have been diagnosed with autism, while the number of medical personnel and occupational therapists (OT) is still limited, which has led to the continued attention of academia and the market to the use of relevant technologies to intervene in the diagnosis and treatment of autism.

目前，最常见的应用于自闭症群体干预的技术为利用手机、平板或电脑设计学习/游戏平台，通过屏幕进行交互，并通过完成任务的方式提升社交及沟通技能。但这种方式沉浸感和交互性不强，学习者容易丧失兴趣或是陷入烦躁，导致拒绝使用，且使用过程中需要照料者或治疗师持续的陪伴指导。Currently, the most common technology used for intervention in autism is to design learning/game platforms using mobile phones, tablets or computers, interact through screens, and improve social and communication skills by completing tasks. However, this method is not immersive and interactive, and learners are prone to lose interest or become irritated, leading to refusal to use it, and caregivers or therapists are required to accompany and guide them during use.

近年来也有研究将虚拟现实(Virtual Reality，VR)技术应用于自闭症群体干预，该技术可以创建一个强交互性的、生动、高沉浸感的环境，学习者可以实现安全、无尴尬的交互。但作为一种富有挑战性的学习情境，VR环境需要用户与外界完全隔离，缺乏学习引导和支持，且学习体验受制于音响、画质等条件，这都对自闭症群体提出了挑战，容易产生抗拒使用，或是使用效果不佳的情况。In recent years, there have been studies on the application of virtual reality (VR) technology to intervention in autism groups. This technology can create a highly interactive, vivid, and highly immersive environment where learners can interact safely and without embarrassment. However, as a challenging learning situation, the VR environment requires users to be completely isolated from the outside world, lacks learning guidance and support, and the learning experience is subject to conditions such as sound and image quality. This poses a challenge to the autistic group, and they are prone to resist using it or have poor results.

基于此，亟需一种新型的为自闭症群体提供学习支持的技术。Based on this, there is an urgent need for a new type of technology to provide learning support for people with autism.

发明内容Summary of the invention

本发明的目的是提供一种面向学习障碍群体的虚拟数字人系统及方法，通过虚拟角色为自闭症群体提供学习支持，提高学习体验和学习效果。The purpose of the present invention is to provide a virtual digital human system and method for learning disabled groups, providing learning support for autistic groups through virtual characters, and improving learning experience and learning effect.

为实现上述目的，本发明提供了如下方案：To achieve the above object, the present invention provides the following solutions:

一种面向学习障碍群体的虚拟数字人系统，包括：A virtual digital human system for people with learning disabilities, comprising:

音视频信息处理AI，用于对用户提出问题时产生的语音信号和视频信号进行识别，得到语音识别结果、面部表情识别结果和行为识别结果；所述用户包括自闭症患者；Audio and video information processing AI, used to recognize the voice signals and video signals generated when users ask questions, and obtain voice recognition results, facial expression recognition results, and behavior recognition results; the users include autistic patients;

情感分析AI，与所述音视频信息处理AI相连接，用于基于所述语音识别结果、所述面部表情识别结果和所述行为识别结果确定用户提出问题时的情感状态；Emotion analysis AI, connected to the audio and video information processing AI, for determining the emotional state of the user when asking the question based on the speech recognition result, the facial expression recognition result and the behavior recognition result;

对话内容生成式AI，分别与所述音视频信息处理AI和所述情感分析AI相连接，用于基于所述语音识别结果和所述情感状态生成符合对话情景和用户情感需求的对话内容，所述对话内容是对用户提出问题进行回答的答案；A conversation content generating AI, connected to the audio and video information processing AI and the emotion analysis AI respectively, for generating conversation content that meets the conversation scenario and the user's emotional needs based on the speech recognition result and the emotional state, wherein the conversation content is an answer to a question raised by the user;

虚拟角色生成式AI，与所述对话内容生成式AI相连接，用于根据所述对话内容生成虚拟角色回答用户提出问题的音视频，并将所述音视频反馈至用户。The virtual character generation AI is connected to the dialogue content generation AI, and is used to generate audio and video of the virtual character answering questions raised by the user based on the dialogue content, and feed the audio and video back to the user.

在一些实施例中，所述音视频信息处理AI具体包括：In some embodiments, the audio and video information processing AI specifically includes:

语音识别模块，用于通过采样和量化将所述语音信号转换为数字信号；利用梅尔频率倒谱系数对所述数字信号进行特征提取，得到声学特征；以所述声学特征作为输入，利用声学模型输出音素概率分布；以所述音素概率分布作为输入，利用维特比算法输出文本序列；所述音素概率分布包括各个音素的出现概率；所述文本序列即为所述语音识别结果；A speech recognition module is used to convert the speech signal into a digital signal by sampling and quantization; extract features of the digital signal using Mel frequency cepstral coefficients to obtain acoustic features; use the acoustic features as input and output phoneme probability distribution using an acoustic model; use the phoneme probability distribution as input and output a text sequence using a Viterbi algorithm; the phoneme probability distribution includes the occurrence probability of each phoneme; the text sequence is the speech recognition result;

面部识别模块，用于对于所述视频信号中的每一视频帧，以所述视频帧作为输入，利用人脸检测器输出人物面部信息，基于所述人物面部信息对所述视频帧进行裁剪，得到人物面部图像，对所述人物面部图像进行特征提取，得到面部表情特征；所述人物面部信息包括人物面部的位置和大小；所有所述视频帧的面部表情特征组成面部表情特征时间序列，所述面部表情特征时间序列即为所述面部表情识别结果；A facial recognition module is used for taking each video frame in the video signal as input, using a face detector to output character facial information, cropping the video frame based on the character facial information to obtain a character facial image, and extracting features from the character facial image to obtain facial expression features; the character facial information includes the position and size of the character's face; the facial expression features of all the video frames constitute a facial expression feature time series, and the facial expression feature time series is the facial expression recognition result;

行为识别模块，用于对于所述视频信号中的每一视频帧，以所述视频帧作为输入，利用人物检测器输出人物边界框，利用卡尔曼滤波器对所述人物边界框进行修正，得到包含人物边界框的视频帧，对包含人物边界框的视频帧进行特征提取，得到关键点位置特征，所有所述视频帧的关键点位置特征组成位置特征时间序列；以所述位置特征时间序列作为输入，利用动作分类模型输出动作类别概率分布，所述动作类别概率分布包括各个动作类别的发生概率；所述动作类别概率分布即为所述行为识别结果。The behavior recognition module is used for taking each video frame in the video signal as input, using a person detector to output a person boundary box, using a Kalman filter to correct the person boundary box to obtain a video frame containing the person boundary box, performing feature extraction on the video frame containing the person boundary box to obtain key point position features, and the key point position features of all the video frames constitute a position feature time series; taking the position feature time series as input, using an action classification model to output an action category probability distribution, the action category probability distribution includes the occurrence probability of each action category; the action category probability distribution is the behavior recognition result.

在一些实施例中，所述情感分析AI具体包括：In some embodiments, the sentiment analysis AI specifically includes:

融合模块，用于对所述语音识别结果、所述面部表情识别结果和所述行为识别结果进行加权融合，得到融合后特征；A fusion module, used for weighted fusion of the speech recognition result, the facial expression recognition result and the behavior recognition result to obtain fused features;

情感分类模块，用于以所述融合后特征作为输入，利用情感计算模型输出情感类别概率分布，所述情感类别概率分布包括每一情感类别的发生概率；所述情感类别概率分布即为用户提出问题时的情感状态。The emotion classification module is used to use the fused features as input and output the emotion category probability distribution using the emotion calculation model. The emotion category probability distribution includes the occurrence probability of each emotion category; the emotion category probability distribution is the emotional state of the user when asking the question.

在一些实施例中，所述对话内容生成式AI具体包括：In some embodiments, the conversation content generation AI specifically includes:

信息关键点输出模块，用于以所述语音识别结果作为输入，利用知识表达模型输出符合对话情景的对话内容的信息关键点；An information key point output module, used to use the speech recognition result as input and output the information key points of the dialogue content that conforms to the dialogue scenario using the knowledge expression model;

对话内容输出模块，用于以所述信息关键点和所述情感状态作为输入，利用大语言模型生成符合对话情景和用户情感需求的对话内容。The dialogue content output module is used to use the information key points and the emotional state as input and use the large language model to generate dialogue content that meets the dialogue scenario and the user's emotional needs.

在一些实施例中，所述虚拟角色生成式AI具体包括：In some embodiments, the virtual character generation AI specifically includes:

虚拟角色创建模块，用于生成虚拟角色；A virtual character creation module, used to generate a virtual character;

音视频合成模块，用于以所述对话内容作为输入，利用音频生成模型进行转换，得到转换后语音信号；以所述对话内容作为输入，利用视频生成模型进行转换，得到转换后视频信号；将所述转换后语音信号和所述转换后视频信号组成虚拟角色回答用户提出问题的音视频；The audio and video synthesis module is used to use the conversation content as input, convert it using the audio generation model to obtain a converted voice signal; use the conversation content as input, convert it using the video generation model to obtain a converted video signal; and combine the converted voice signal and the converted video signal to form an audio and video of the virtual character answering the question raised by the user;

虚拟角色驱动模块，用于基于所述音视频驱动所述虚拟角色动作，以将所述音视频反馈至用户。The virtual character driving module is used to drive the action of the virtual character based on the audio and video, so as to feed back the audio and video to the user.

在一些实施例中，所述语音识别模块还用于在通过采样和量化将所述语音信号转换为数字信号之前，对所述语音信号进行预处理，得到新的语音信号；所述预处理包括噪音去除、信号增强和归一化；In some embodiments, the speech recognition module is further used to pre-process the speech signal to obtain a new speech signal before converting the speech signal into a digital signal by sampling and quantization; the pre-processing includes noise removal, signal enhancement and normalization;

所述声学模型为循环神经网络；The acoustic model is a recurrent neural network;

所述人脸检测器为多任务卷积神经网络；The face detector is a multi-task convolutional neural network;

对所述人物面部图像进行特征提取，得到面部表情特征，具体包括：以所述人物面部图像作为输入，利用卷积神经网络的卷积层进行特征提取，得到特征图；利用卷积神经网络的扁平层对所述特征图进行处理，得到面部表情特征；Extracting features from the facial image of the person to obtain facial expression features, specifically comprising: taking the facial image of the person as input, extracting features using a convolutional layer of a convolutional neural network to obtain a feature map; processing the feature map using a flat layer of the convolutional neural network to obtain facial expression features;

所述人物检测器为YOLO模型或FasterR-CNN模型；The person detector is a YOLO model or a FasterR-CNN model;

对包含人物边界框的视频帧进行特征提取，得到关键点位置特征，具体包括：以包含人物边界框的视频帧作为输入，利用姿势估计模型进行特征提取，得到关键点位置特征；所述姿势估计模型为OpenPose模型，关键点包括头部、肩部、肘部、手腕、腰部、膝盖和脚踝；Extracting features from a video frame containing a person's bounding box to obtain key point position features, specifically comprising: taking the video frame containing the person's bounding box as input, extracting features using a pose estimation model to obtain key point position features; the pose estimation model is an OpenPose model, and the key points include the head, shoulders, elbows, wrists, waist, knees, and ankles;

所述动作分类模型为长短时记忆网络或时间卷积神经网络。The action classification model is a long short-term memory network or a temporal convolutional neural network.

在一些实施例中，对所述语音识别结果、所述面部表情识别结果和所述行为识别结果进行加权融合，具体包括：利用注意力机制对所述语音识别结果、所述面部表情识别结果和所述行为识别结果进行加权融合。In some embodiments, weighted fusion of the speech recognition result, the facial expression recognition result and the behavior recognition result is performed, specifically including: weighted fusion of the speech recognition result, the facial expression recognition result and the behavior recognition result is performed using an attention mechanism.

在一些实施例中，所述知识表达模型为BERT模型；所述大语言模型为GLM-130B模型。In some embodiments, the knowledge expression model is a BERT model; and the large language model is a GLM-130B model.

在一些实施例中，生成虚拟角色，具体包括：响应用户输入的虚拟角色形象需求生成虚拟角色三维模型；建立所述虚拟角色三维模型的骨骼结构，得到虚拟角色；In some embodiments, generating a virtual character specifically includes: generating a virtual character three-dimensional model in response to a virtual character image requirement input by a user; establishing a skeleton structure of the virtual character three-dimensional model to obtain the virtual character;

所述音频生成模型为TTS模型；The audio generation model is a TTS model;

所述视频生成模型为TTV模型；The video generation model is a TTV model;

将所述转换后语音信号和所述转换后视频信号组成虚拟角色回答用户提出问题的音视频，具体包括：以所述转换后语音信号和所述转换后视频信号作为输入，利用合成模型调整虚拟角色的唇形，输出虚拟角色回答用户提出问题的音视频；所述合成模型为Wav2Lip模型。The converted voice signal and the converted video signal are combined into audio and video of a virtual character answering questions raised by a user, specifically comprising: taking the converted voice signal and the converted video signal as input, adjusting the lip shape of the virtual character using a synthesis model, and outputting the audio and video of the virtual character answering questions raised by the user; the synthesis model is a Wav2Lip model.

一种面向学习障碍群体的虚拟数字人方法，基于上述的一种面向学习障碍群体的虚拟数字人系统进行工作，包括：A virtual digital human method for people with learning disabilities, based on the above-mentioned virtual digital human system for people with learning disabilities, includes:

对用户提出问题时产生的语音信号和视频信号进行识别，得到语音识别结果、面部表情识别结果和行为识别结果；所述用户包括自闭症患者；Recognize the voice signal and video signal generated when the user asks a question, and obtain a voice recognition result, a facial expression recognition result, and a behavior recognition result; the user includes an autistic patient;

基于所述语音识别结果、所述面部表情识别结果和所述行为识别结果确定用户提出问题时的情感状态；Determining the emotional state of the user when asking the question based on the speech recognition result, the facial expression recognition result, and the behavior recognition result;

基于所述语音识别结果和所述情感状态生成符合对话情景和用户情感需求的对话内容，所述对话内容是对用户提出问题进行回答的答案；Generate a dialogue content that meets the dialogue scenario and the user's emotional needs based on the speech recognition result and the emotional state, wherein the dialogue content is an answer to a question raised by the user;

根据所述对话内容生成虚拟角色回答用户提出问题的音视频，并将所述音视频反馈至用户。The audio and video of the virtual character answering the question raised by the user is generated according to the conversation content, and the audio and video are fed back to the user.

根据本发明提供的具体实施例，本发明公开了以下技术效果：According to the specific embodiments provided by the present invention, the present invention discloses the following technical effects:

本发明提供一种面向学习障碍群体的虚拟数字人系统及方法，对用户提出问题时产生的语音信号和视频信号进行识别，得到语音识别结果、面部表情识别结果和行为识别结果，进一步确定用户提出问题时的情感状态，再基于语音识别结果和情感状态生成符合对话情景和用户情感需求的对话内容，对话内容是对用户提出问题进行回答的答案，根据对话内容生成虚拟角色回答用户提出问题的音视频，并将音视频反馈至用户，本发明在回答用户问题时，充分考虑到用户的情绪，生成符合用户情感需求的对话内容，并基于该对话内容驱动虚拟角色来回答用户问题，能够对用户进行更好的学习引导和支持，从而提高学习体验和学习效果。The present invention provides a virtual digital human system and method for a group of people with learning disabilities. The system recognizes voice signals and video signals generated when a user asks a question, obtains voice recognition results, facial expression recognition results and behavior recognition results, further determines the emotional state of the user when asking the question, and then generates dialogue content that meets the dialogue scenario and the user's emotional needs based on the voice recognition results and the emotional state. The dialogue content is the answer to the question asked by the user. According to the dialogue content, audio and video of a virtual character answering the question asked by the user are generated, and the audio and video are fed back to the user. When answering the user's question, the present invention fully considers the user's emotion, generates dialogue content that meets the user's emotional needs, and drives the virtual character to answer the user's question based on the dialogue content, which can provide better learning guidance and support for the user, thereby improving the learning experience and learning effect.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings required for use in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative work.

图1为本发明实施例1提供的一种面向学习障碍群体的虚拟数字人系统的示意图。FIG1 is a schematic diagram of a virtual digital human system for learning disabled groups provided in Example 1 of the present invention.

图2为本发明实施例2提供的一种面向学习障碍群体的虚拟数字人方法的方法流程示意图。FIG. 2 is a schematic diagram of a method flow of a virtual digital human method for learning-disabled groups provided in Example 2 of the present invention.

具体实施方式DETAILED DESCRIPTION

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The following will be combined with the drawings in the embodiments of the present invention to clearly and completely describe the technical solutions in the embodiments of the present invention. Obviously, the described embodiments are only part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.

为使本发明的上述目的、特征和优点能够更加明显易懂，下面结合附图和具体实施方式对本发明作进一步详细的说明。In order to make the above-mentioned objects, features and advantages of the present invention more obvious and easy to understand, the present invention is further described in detail below with reference to the accompanying drawings and specific embodiments.

实施例1Example 1

如图1所示，本实施例中的一种面向学习障碍群体的虚拟数字人系统，包括：As shown in FIG1 , a virtual digital human system for people with learning disabilities in this embodiment includes:

音视频信息处理AI，用于对用户提出问题时产生的语音信号和视频信号进行识别，得到语音识别结果、面部表情识别结果和行为识别结果，用户包括自闭症患者。Audio and video information processing AI is used to recognize the voice and video signals generated when users ask questions, and obtain voice recognition results, facial expression recognition results, and behavior recognition results. Users include autistic patients.

情感分析AI，与音视频信息处理AI相连接，用于基于语音识别结果、面部表情识别结果和行为识别结果确定用户提出问题时的情感状态。Emotion analysis AI, connected to audio and video information processing AI, is used to determine the emotional state of the user when asking a question based on speech recognition results, facial expression recognition results, and behavior recognition results.

对话内容生成式AI，分别与音视频信息处理AI和情感分析AI相连接，用于基于语音识别结果和情感状态生成符合对话情景和用户情感需求的对话内容，对话内容是对用户提出问题进行回答的答案。The conversation content generation AI is connected to the audio and video information processing AI and the emotion analysis AI respectively. It is used to generate conversation content that conforms to the conversation scenario and user emotional needs based on the voice recognition results and emotional state. The conversation content is the answer to the questions raised by the user.

虚拟角色生成式AI，与对话内容生成式AI相连接，用于根据对话内容生成虚拟角色回答用户提出问题的音视频，并将音视频反馈至用户。The virtual character generation AI is connected to the conversation content generation AI to generate audio and video of the virtual character answering questions raised by the user based on the conversation content, and feed the audio and video back to the user.

本实施例所提供的是一种使用虚拟角色为学习障碍群体提供学习支持的系统，属于一种实现特殊教育的系统，在通过摄像头获取ASD患者的表情和身体动作这一视频信号，通过麦克风或者拾音器获取ASD患者的说话声音这一语音信号(或称音频信号)之后，由音视频信息处理AI对音视频信息(包括视频信号和语音信号)进行视频处理和语音识别，由情感分析AI判断ASD患者的情感状态，由对话内容生成式AI理解输入的关键信息并生成符合ASD患者情感需求的回复内容(即对话内容)，由虚拟角色生成式AI根据生成的回复内容合成虚拟角色的表情、动作和声音，得到音视频。The present embodiment provides a system that uses virtual characters to provide learning support for groups with learning disabilities, and is a system for implementing special education. After obtaining a video signal of the expression and body movement of an ASD patient through a camera, and obtaining a voice signal (or audio signal) of the ASD patient's speaking voice through a microphone or a pickup, an audio and video information processing AI performs video processing and voice recognition on the audio and video information (including video signals and voice signals), an emotion analysis AI determines the emotional state of the ASD patient, a dialogue content generation AI understands the key information input and generates reply content (i.e., dialogue content) that meets the emotional needs of the ASD patient, and a virtual character generation AI synthesizes the expression, movement, and sound of the virtual character based on the generated reply content to obtain audio and video.

本实施例的虚拟数字人系统包括如下部分：The virtual digital human system of this embodiment includes the following parts:

(一)音视频信息处理AI：1. Audio and video information processing AI:

ASD患者在使用该虚拟数字人系统时，虚拟数字人系统内置的麦克风/拾音器和摄像头会捕捉用户输出的语音信息(即语音信号)和视频信息(即视频信号)，从而可得到用户提出问题时产生的语音信号和视频信号。对于语音信息，音视频信息处理AI在接收到语音信号后会依次进行预处理、编码和解码操作，以对语音信号进行识别，最后输出语音识别结果。对于视频信息，音视频信息处理AI在接收到视频信号后会依次进行预处理、特征提取和分类操作，以对视频信号进行识别，最后输出面部表情识别结果和行为识别结果。When ASD patients use the virtual digital human system, the built-in microphone/pickup and camera of the virtual digital human system will capture the voice information (i.e., voice signal) and video information (i.e., video signal) output by the user, so as to obtain the voice signal and video signal generated when the user asks a question. For voice information, after receiving the voice signal, the audio and video information processing AI will perform preprocessing, encoding, and decoding operations in sequence to recognize the voice signal, and finally output the voice recognition result. For video information, after receiving the video signal, the audio and video information processing AI will perform preprocessing, feature extraction, and classification operations in sequence to recognize the video signal, and finally output the facial expression recognition result and behavior recognition result.

本实施例的音视频信息处理AI包括语音识别、面部识别和行为识别三个部分，音视频信息处理AI具体包括：The audio and video information processing AI of this embodiment includes three parts: speech recognition, facial recognition, and behavior recognition. The audio and video information processing AI specifically includes:

(1)语音识别模块：(1) Speech Recognition Module:

语音识别模块用于通过采样和量化将语音信号转换为数字信号；利用梅尔频率倒谱系数对数字信号进行特征提取，得到声学特征；以声学特征作为输入，利用声学模型输出音素概率分布；以音素概率分布作为输入，利用维特比算法输出文本序列；音素概率分布包括各个音素的出现概率；文本序列即为语音识别结果。The speech recognition module is used to convert speech signals into digital signals through sampling and quantization; use Mel-frequency cepstral coefficients to extract features of digital signals to obtain acoustic features; use acoustic features as input and use acoustic models to output phoneme probability distribution; use phoneme probability distribution as input and use Viterbi algorithm to output text sequences; phoneme probability distribution includes the probability of occurrence of each phoneme; the text sequence is the speech recognition result.

其中，语音识别模块还用于在通过采样和量化将语音信号转换为数字信号之前，对语音信号进行预处理，得到新的语音信号，预处理包括噪音去除、信号增强和归一化。The speech recognition module is also used to preprocess the speech signal before converting it into a digital signal through sampling and quantization to obtain a new speech signal. The preprocessing includes noise removal, signal enhancement and normalization.

其中，声学模型为循环神经网络。Among them, the acoustic model is a recurrent neural network.

具体的，本实施例的语音识别模块的处理过程包括：Specifically, the processing process of the speech recognition module of this embodiment includes:

(1.1)预处理：(1.1) Preprocessing:

语音识别模块获取到麦克风或者拾音器采集的语音信号后，首先对语音信号进行预处理，预处理包括噪音去除、信号增强和音频归一化，以确保输入信号的质量和稳定性。After the speech recognition module obtains the speech signal collected by the microphone or pickup, it first preprocesses the speech signal. The preprocessing includes noise removal, signal enhancement and audio normalization to ensure the quality and stability of the input signal.

(1.2)编码：(1.2) Coding:

在完成预处理之后，语音识别模块通过采样和量化将语音信号由模拟形式转化为数字形式，得到数字信号，以通过采样和量化将语音信号转换为数字信号。其中，采样是指对语音信号进行采样，将连续的语音信号转换成离散的采样点，量化是指预先定义振幅值范围与数字的对应关系，对于每一采样点，根据采样点的振幅值所属的振幅值范围确定采样点对应的数字，将采样点的振幅值映射为有限的数字。基于上述采样过程和量化过程，原始的语音信号便被转换成数字信号。After completing the preprocessing, the speech recognition module converts the speech signal from analog form to digital form through sampling and quantization to obtain a digital signal. Sampling refers to sampling the speech signal to convert the continuous speech signal into discrete sampling points, and quantization refers to predefining the correspondence between the amplitude value range and the number. For each sampling point, the number corresponding to the sampling point is determined according to the amplitude value range to which the amplitude value of the sampling point belongs, and the amplitude value of the sampling point is mapped to a finite number. Based on the above sampling process and quantization process, the original speech signal is converted into a digital signal.

在得到数字信号之后，利用梅尔频率倒谱系数(MFCC)，基于频谱分析对数字信号进行特征提取，以利用梅尔频率倒谱系数对数字信号进行特征提取，得到声学特征。MFCC作为一种常用的声学特征提取方法，能够反映语音信号的频谱特性，具体过程为：对数字信号进行快速傅里叶变换(FFT)，得到其频谱信息，根据梅尔频率刻度对频谱进行划分，计算每个频率分段的能量，对能量进行归一化处理，得到MFCC系数，MFCC系数即为语音信号的声学特征，声学特征包括音调、频率和声音的时序性。After obtaining the digital signal, the Mel-frequency cepstral coefficient (MFCC) is used to extract the features of the digital signal based on spectrum analysis to obtain the acoustic features. As a commonly used acoustic feature extraction method, MFCC can reflect the spectral characteristics of speech signals. The specific process is: perform fast Fourier transform (FFT) on the digital signal to obtain its spectrum information, divide the spectrum according to the Mel frequency scale, calculate the energy of each frequency segment, normalize the energy, and obtain the MFCC coefficient. The MFCC coefficient is the acoustic feature of the speech signal, and the acoustic features include pitch, frequency, and timing of sound.

在得到声学特征之后，以声学特征作为输入，利用声学模型输出音素概率分布。本实施例采用循环神经网络(RNN)建模语音信号的声音特性，即本实施例的声学模型可为循环神经网络。RNN是一种能够处理序列数据的神经网络，可被用于捕捉语音信号的时间动态变化。在训练过程中，声学模型通过最小化预测概率和实际概率之间的差距，学习语音信号的概率分布。在实际应用中，声学模型的输入为声学特征，声学模型的输出为一系列概率值(即音素概率分布)，音素概率分布表示在给定声学特征的情况下，各个音素的出现概率。音素是根据语音的自然属性划分出来的最小语音单位，依据音节里的发音动作来分析，一个动作构成一个音素。After obtaining the acoustic features, the acoustic features are used as input and the acoustic model is used to output the phoneme probability distribution. This embodiment uses a recurrent neural network (RNN) to model the sound characteristics of the speech signal, that is, the acoustic model of this embodiment can be a recurrent neural network. RNN is a neural network that can process sequence data and can be used to capture the temporal dynamic changes of speech signals. During the training process, the acoustic model learns the probability distribution of the speech signal by minimizing the gap between the predicted probability and the actual probability. In practical applications, the input of the acoustic model is the acoustic features, and the output of the acoustic model is a series of probability values (i.e., the phoneme probability distribution), which represents the probability of occurrence of each phoneme given the acoustic features. Phonemes are the smallest speech units divided according to the natural properties of speech, and are analyzed based on the pronunciation actions in the syllables, with one action constituting a phoneme.

(1.3)解码：(1.3) Decoding:

声音模型和语言模型共同参与解码，通过维特比搜索算法(即维特比算法)寻找最可能的文本序列并输出。语言模型负责处理文本序列的语法和语义结构，以更好地理解文本序列的上下文。语言模型可采用条件随机场(CRF)或长短时记忆网络(LSTM)等神经网络结构，输入为文本序列，输出为文本序列的概率分布，文本序列的概率分布指文本序列下一个单词或符号的概率分布，可以指导文本生成过程，选择具有较高概率的单词或符号，提高语音识别结果的识别准确度。在训练过程中，语言模型通过最大化输入的文本序列的概率分布，学习文本序列的语法和语义规则。维特比搜索算法是一种动态规划算法，它根据声学模型和语言模型输出的概率分布，搜索最长的公共子序列，从而找到最可能的文本序列，具体过程为：音素是语音中的最小音位单位，每个音素对应着一种特定的声音，音素组合在一起形成单词和句子，根据声学模型输出的音素概率分布，生成多个音素序列，为每个音素序列分配一个初始分数，在语言模型的指导下，语言模型为音素序列组成单词和句子提供指导，迭代地计算每个音素序列的分数，并选择分数最高的音素序列作为最终输出的文本序列，从而以音素概率分布作为输入，利用维特比算法输出文本序列，维特比算法的最终输出即为识别出的文本序列，文本序列即为语音识别结果。The sound model and the language model jointly participate in decoding, and the most likely text sequence is found and output through the Viterbi search algorithm (i.e., the Viterbi algorithm). The language model is responsible for processing the grammatical and semantic structure of the text sequence to better understand the context of the text sequence. The language model can use a neural network structure such as a conditional random field (CRF) or a long short-term memory network (LSTM). The input is a text sequence, and the output is the probability distribution of the text sequence. The probability distribution of the text sequence refers to the probability distribution of the next word or symbol in the text sequence, which can guide the text generation process, select words or symbols with higher probabilities, and improve the recognition accuracy of the speech recognition results. During the training process, the language model learns the grammatical and semantic rules of the text sequence by maximizing the probability distribution of the input text sequence. The Viterbi search algorithm is a dynamic programming algorithm. It searches for the longest common subsequence based on the probability distribution output by the acoustic model and the language model to find the most likely text sequence. The specific process is as follows: Phoneme is the smallest phoneme unit in speech. Each phoneme corresponds to a specific sound. Phonemes are combined to form words and sentences. According to the phoneme probability distribution output by the acoustic model, multiple phoneme sequences are generated. An initial score is assigned to each phoneme sequence. Under the guidance of the language model, the language model provides guidance for the phoneme sequence to form words and sentences. The score of each phoneme sequence is iteratively calculated, and the phoneme sequence with the highest score is selected as the final output text sequence. Therefore, the phoneme probability distribution is used as input, and the Viterbi algorithm is used to output the text sequence. The final output of the Viterbi algorithm is the recognized text sequence, and the text sequence is the speech recognition result.

为了进一步提高语音识别结果的质量，本实施例还可以对待输出的文本序列进行拼写检查、语法分析和上下文纠正，以校正可能的识别错误，确保最终输出的文本序列准确无误。In order to further improve the quality of speech recognition results, the present embodiment may also perform spelling check, grammar analysis and context correction on the text sequence to be output, so as to correct possible recognition errors and ensure that the text sequence finally output is accurate.

(2)面部识别模块：(2) Facial Recognition Module:

面部识别模块用于对于视频信号中的每一视频帧，以视频帧作为输入，利用人脸检测器输出人物面部信息，基于人物面部信息对视频帧进行裁剪，得到人物面部图像，对人物面部图像进行特征提取，得到面部表情特征；人物面部信息包括人物面部的位置和大小；所有视频帧的面部表情特征组成面部表情特征时间序列，面部表情特征时间序列即为面部表情识别结果。The facial recognition module is used for each video frame in the video signal, taking the video frame as input, using the face detector to output the facial information of the person, cropping the video frame based on the facial information of the person to obtain the facial image of the person, and extracting the features of the facial image of the person to obtain the facial expression features; the facial information of the person includes the position and size of the person's face; the facial expression features of all video frames constitute the facial expression feature time series, and the facial expression feature time series is the facial expression recognition result.

其中，人脸检测器可为多任务卷积神经网络。Among them, the face detector can be a multi-task convolutional neural network.

其中，对人物面部图像进行特征提取，得到面部表情特征，具体包括：以人物面部图像作为输入，利用卷积神经网络的卷积层进行特征提取，得到特征图；利用卷积神经网络的扁平层对特征图进行处理，得到面部表情特征。Among them, feature extraction is performed on the facial image of the person to obtain facial expression features, which specifically includes: taking the facial image of the person as input, using the convolution layer of the convolutional neural network to extract features to obtain a feature map; using the flat layer of the convolutional neural network to process the feature map to obtain facial expression features.

具体的，本实施例的面部识别模块的处理过程包括：Specifically, the processing process of the facial recognition module of this embodiment includes:

(2.1)预处理：(2.1) Preprocessing:

面部识别模块获取到摄像头采集的视频流(即视频信号)后，首先使用多任务卷积神经网络(MT-CNN)作为人脸检测器来识别和定位视频帧中人物的面部，MT-CNN的输入为视频帧，输出为面部区域的位置和大小。After the facial recognition module obtains the video stream (i.e., video signal) captured by the camera, it first uses a multi-task convolutional neural network (MT-CNN) as a face detector to identify and locate the face of the person in the video frame. The input of MT-CNN is the video frame, and the output is the position and size of the facial area.

面部区域以矩形框的形式表示，矩形框的位置和大小用于定位视频帧中人物的面部区域，这也是人脸图像裁剪的重要依据，面部识别模块进一步利用人像裁剪算法将面部区域从视频帧中提取出来，即基于人物面部信息对视频帧进行裁剪，得到人物面部图像，可减小后续处理的工作量，提高特征提取的稳定性。The facial area is represented by a rectangular frame. The position and size of the rectangular frame are used to locate the facial area of the person in the video frame, which is also an important basis for cropping the face image. The facial recognition module further uses the portrait cropping algorithm to extract the facial area from the video frame, that is, to crop the video frame based on the person's facial information to obtain the person's facial image, which can reduce the workload of subsequent processing and improve the stability of feature extraction.

(2.2)特征提取：(2.2) Feature extraction:

面部识别模块使用预训练的卷积神经网络(如VGG16、ResNet、MobileNet等)的卷积层来提取人物面部表情的纹理、形状、眼部和嘴部运动等有关表情的高级特征，形成特征图，特征图是卷积神经网络的中间输出，它反映了输入图像在不同维度上的特征，对于情感计算十分重要。面部识别模块进一步使用预训练的卷积神经网络的扁平层将特征图的像素值串联形成一个扁平化一维的特征向量，该特征向量包含了有关人物面部的丰富信息，如纹理、形状、眼部和嘴部运动等，得到面部表情特征。The facial recognition module uses the convolutional layer of a pre-trained convolutional neural network (such as VGG16, ResNet, MobileNet, etc.) to extract high-level features of facial expressions such as texture, shape, eye and mouth movements, etc., to form a feature map. The feature map is the intermediate output of the convolutional neural network. It reflects the characteristics of the input image in different dimensions and is very important for emotional computing. The facial recognition module further uses the flattening layer of the pre-trained convolutional neural network to concatenate the pixel values of the feature map to form a flattened one-dimensional feature vector, which contains rich information about the person's face, such as texture, shape, eye and mouth movements, etc., to obtain facial expression features.

在视频流中，上述预处理和特征提取过程连续地在每个视频帧上进行，每一视频帧均可得到一面部表情特征，从而可形成一个面部表情特征时间序列，这个时间序列用于分析人物面部表情特征在时间上的变化，在后续情感分析任务中被使用，即分析人物面部表情特征在视频流中的变化趋势，从而便于后续进行情感分析。例如，对于一个包含10个特征图的视频帧，可以将每个特征图的像素值串联在一起形成一个特征向量，得到面部表情特征。随着时间的推移，可以连续地计算各视频帧的面部表情特征，形成一个长度为视频帧数的面部表情特征时间序列。In the video stream, the above preprocessing and feature extraction process is continuously performed on each video frame, and a facial expression feature can be obtained for each video frame, thereby forming a facial expression feature time series. This time series is used to analyze the temporal changes of the facial expression features of the characters, and is used in the subsequent sentiment analysis task, that is, to analyze the changing trend of the facial expression features of the characters in the video stream, so as to facilitate the subsequent sentiment analysis. For example, for a video frame containing 10 feature maps, the pixel values of each feature map can be concatenated together to form a feature vector to obtain the facial expression feature. As time goes by, the facial expression features of each video frame can be continuously calculated to form a facial expression feature time series with a length equal to the number of video frames.

(3)行为识别模块：(3) Behavior recognition module:

行为识别模块用于对于视频信号中的每一视频帧，以视频帧作为输入，利用人物检测器输出人物边界框，利用卡尔曼滤波器对人物边界框进行修正，得到包含人物边界框的视频帧，对包含人物边界框的视频帧进行特征提取，得到关键点位置特征，所有视频帧的关键点位置特征组成位置特征时间序列；以位置特征时间序列作为输入，利用动作分类模型输出动作类别概率分布，动作类别概率分布包括各个动作类别的发生概率；动作类别概率分布即为行为识别结果。The behavior recognition module is used for each video frame in the video signal, taking the video frame as input, using the person detector to output the person bounding box, using the Kalman filter to correct the person bounding box, obtaining the video frame containing the person bounding box, performing feature extraction on the video frame containing the person bounding box, obtaining the key point position features, and the key point position features of all video frames constitute a position feature time series; taking the position feature time series as input, using the action classification model to output the action category probability distribution, the action category probability distribution includes the occurrence probability of each action category; the action category probability distribution is the behavior recognition result.

其中，人物检测器为YOLO模型或Faster R-CNN模型。Among them, the person detector is the YOLO model or the Faster R-CNN model.

其中，对包含人物边界框的视频帧进行特征提取，得到关键点位置特征，具体包括：以包含人物边界框的视频帧作为输入，利用姿势估计模型进行特征提取，得到关键点位置特征，姿势估计模型为OpenPose模型，关键点包括头部、肩部、肘部、手腕、腰部、膝盖和脚踝。Among them, feature extraction is performed on the video frame containing the person's bounding box to obtain key point position features, specifically including: taking the video frame containing the person's bounding box as input, using the posture estimation model to perform feature extraction to obtain key point position features, the posture estimation model is the OpenPose model, and the key points include the head, shoulders, elbows, wrists, waist, knees and ankles.

其中，动作分类模型为长短时记忆网络或时间卷积神经网络。Among them, the action classification model is a long short-term memory network or a temporal convolutional neural network.

具体的，本实施例的行为识别模块的处理过程包括：Specifically, the processing process of the behavior recognition module in this embodiment includes:

(3.1)预处理：(3.1) Preprocessing:

行为识别模块在获取到摄像头采集的视频流后，首先使用YOLO(You Only LookOnce)或Faster R-CNN深度学习模型来检测视频帧中的人物，模型的输入为视频帧，模型的输出为检测到的人物边界框，人物边界框用于表示每个对象的位置、大小和类别概率。After obtaining the video stream captured by the camera, the behavior recognition module first uses the YOLO (You Only Look Once) or Faster R-CNN deep learning model to detect the people in the video frame. The input of the model is the video frame, and the output of the model is the detected person bounding box. The person bounding box is used to represent the position, size and category probability of each object.

在检测到人物后，行为识别模块进一步使用卡尔曼滤波器实现人物的跟踪。卡尔曼滤波器是一种递归滤波器，通过预测和更新来估计动态系统的状态。在人物跟踪中，卡尔曼滤波器根据人物在连续帧中的位置和速度信息，预测下一帧中人物的位置，并根据卡尔曼滤波器的预测结果更新YOLO模型或FasterR-CNN模型预测得到的人物边界框，以对人物边界框进行修正，减小跟踪误差，得到包含人物边界框的视频帧。After detecting a person, the behavior recognition module further uses a Kalman filter to track the person. The Kalman filter is a recursive filter that estimates the state of a dynamic system through prediction and update. In person tracking, the Kalman filter predicts the position of the person in the next frame based on the position and velocity information of the person in consecutive frames, and updates the person bounding box predicted by the YOLO model or FasterR-CNN model based on the prediction results of the Kalman filter to correct the person bounding box, reduce the tracking error, and obtain a video frame containing the person bounding box.

(3.2)特征提取：(3.2) Feature extraction:

行为识别模块采用预训练的姿势估计模型，如OpenPose模型，来提取包含人物边界框的视频帧中人物动作的关键特征，OpenPose模型的输入为包含人物边界框的视频帧，输出为每个人物的关键点坐标(即关键点位置特征)，关键点包括头部、肩部、肘部、手腕、腰部、膝盖和脚踝等，这些关键点用于描述人物的动作和姿态。The behavior recognition module uses a pre-trained posture estimation model, such as the OpenPose model, to extract the key features of the character's actions in the video frame containing the character's bounding box. The input of the OpenPose model is the video frame containing the character's bounding box, and the output is the key point coordinates of each character (i.e., key point position features). The key points include the head, shoulders, elbows, wrists, waist, knees and ankles, etc. These key points are used to describe the character's actions and postures.

(3.3)分类：(3.3) Classification:

行为识别模块将提取到的关键点位置特征输入到基于长短时记忆网络(LSTM)或时间卷积神经网络(TCN)的深度学习模型(即动作分类模型)，以建模动作的时间序列信息，动作分类模型的输入为位置特征时间序列(包括按时间顺序排序的所有视频帧的关键点位置特征)，输出为对动作的分类结果(即动作类别概率分布)，该动作类别概率分布为一个表征行为识别结果的特征向量，特征向量的每个元素表示该元素对应动作类别的发生概率，此动作类别概率分布可用于后续情感分析任务，判断视频中的行为是积极情感或是消极情感的体现。LSTM和TCN因能够捕捉动作在时间上的变化，可以提高行为识别的准确性。The behavior recognition module inputs the extracted key point position features into a deep learning model (i.e., action classification model) based on a long short-term memory network (LSTM) or a temporal convolutional neural network (TCN) to model the time series information of the action. The input of the action classification model is a time series of position features (including key point position features of all video frames sorted in chronological order), and the output is the classification result of the action (i.e., the probability distribution of action categories). The probability distribution of action categories is a feature vector that represents the behavior recognition result. Each element of the feature vector represents the probability of occurrence of the action category corresponding to the element. This action category probability distribution can be used for subsequent sentiment analysis tasks to determine whether the behavior in the video is a manifestation of positive or negative emotions. LSTM and TCN can improve the accuracy of behavior recognition because they can capture the changes in actions over time.

(二)情感分析AI：(II) Sentiment Analysis AI:

在音频和视频处理完成后，情感分析AI基于语音识别结果、面部表情识别结果和行为识别结果，进行多模态特征融合操作，再利用训练好的情感分类器(即情感计算模型)输出观测对象的情感预测结果，确定用户提出问题时的情感状态。After the audio and video processing is completed, the sentiment analysis AI performs multimodal feature fusion operations based on the speech recognition results, facial expression recognition results, and behavior recognition results, and then uses the trained sentiment classifier (i.e., sentiment computing model) to output the sentiment prediction results of the observed object to determine the emotional state of the user when asking the question.

本实施例的情感分析AI包括情感计算一个部分，情感分析AI具体包括：The sentiment analysis AI of this embodiment includes a sentiment calculation part, and the sentiment analysis AI specifically includes:

融合模块，用于对语音识别结果、面部表情识别结果和行为识别结果进行加权融合，得到融合后特征。The fusion module is used to perform weighted fusion on the speech recognition results, facial expression recognition results and behavior recognition results to obtain fused features.

情感分类模块，用于以融合后特征作为输入，利用情感计算模型输出情感类别概率分布，情感类别概率分布包括每一情感类别的发生概率，情感类别概率分布即为用户提出问题时的情感状态。The sentiment classification module is used to use the fused features as input and output the sentiment category probability distribution using the sentiment calculation model. The sentiment category probability distribution includes the occurrence probability of each sentiment category. The sentiment category probability distribution is the emotional state of the user when asking the question.

其中，对语音识别结果、面部表情识别结果和行为识别结果进行加权融合，具体包括：利用注意力机制对语音识别结果、面部表情识别结果和行为识别结果进行加权融合。Among them, the speech recognition results, facial expression recognition results and behavior recognition results are weightedly fused, specifically including: using the attention mechanism to weightedly fuse the speech recognition results, facial expression recognition results and behavior recognition results.

具体的，本实施例的情感分析AI的处理过程包括：Specifically, the processing process of the sentiment analysis AI in this embodiment includes:

(1)多模态特征融合：(1) Multimodal feature fusion:

融合模块获取语音识别结果、面部表情识别结果和行为识别结果后，采用基于模型层特征融合的方法完成多模态的特征融合。特征融合过程中，使用注意力机制来动态地调整不同模态特征的权重。注意力机制通过计算特征之间的相似度，为每个模态的特征分配一个权重系数，权重系数表示了当前的上下文中，每个模态特征对于情感计算的重要性。对于输入的语音识别结果、面部表情识别结果和行为识别结果，首先，注意力机制会计算三个特征的相似度，得到每一特征对应的权重系数。然后，注意力机制会将这三个特征分别乘以其对应的权重系数，得到三个加权后特征。最后，注意力机制会将三个加权后特征相加，得到融合后特征。融合后特征的计算公式可以表示为：融合特征＝w1*语音识别结果+w2*面部表情识别结果+w3*行为识别结果，其中，w1、w2、w3分别为语音识别结果、面部表情识别结果和行为识别结果对应的权重系数。After the fusion module obtains the speech recognition results, facial expression recognition results, and behavior recognition results, it uses the method based on model layer feature fusion to complete the multimodal feature fusion. During the feature fusion process, the attention mechanism is used to dynamically adjust the weights of different modal features. The attention mechanism calculates the similarity between features and assigns a weight coefficient to the features of each modality. The weight coefficient represents the importance of each modal feature for emotion calculation in the current context. For the input speech recognition results, facial expression recognition results, and behavior recognition results, first, the attention mechanism calculates the similarity of the three features and obtains the weight coefficient corresponding to each feature. Then, the attention mechanism multiplies the three features by their corresponding weight coefficients to obtain three weighted features. Finally, the attention mechanism adds the three weighted features to obtain the fused feature. The calculation formula of the fused feature can be expressed as: fused feature = w1*speech recognition result + w2*facial expression recognition result + w3*behavior recognition result, where w1, w2, and w3 are the weight coefficients corresponding to the speech recognition result, facial expression recognition result, and behavior recognition result, respectively.

(2)情感分类：(2) Sentiment classification:

在情感计算模型的训练过程中，输入为标记的多模态情感数据集，输出为情感分类结果，标记的多模态情感数据集包括基于音频和视频数据生成的融合后特征，以及融合后特征对应的情感标签，情感标签包括积极、消极、中性等。在训练过程中，使用准确度、召回率和F1分数作为评价指标，用于评估模型的性能。In the training process of the emotional computing model, the input is a labeled multimodal emotional dataset, and the output is the emotional classification result. The labeled multimodal emotional dataset includes fused features generated based on audio and video data, and emotional labels corresponding to the fused features. Emotional labels include positive, negative, neutral, etc. In the training process, accuracy, recall rate and F1 score are used as evaluation indicators to evaluate the performance of the model.

训练好的情感计算模型的应用过程中，输入为接收到的融合后特征，输出为情感计算结果，即情感类别概率分布，其中，每个元素表示该元素对应情感类别的发生概率。情感计算结果可用于进一步的分析和决策，例如判断视频中的情感倾向或进行情感监控等任务。In the application process of the trained sentiment computing model, the input is the received fused features, and the output is the sentiment computing result, that is, the probability distribution of sentiment categories, where each element represents the probability of occurrence of the corresponding sentiment category of the element. The sentiment computing result can be used for further analysis and decision-making, such as judging the sentiment tendency in the video or performing sentiment monitoring tasks.

(三)对话内容生成式AI：(III) Conversational content generation AI:

根据语音识别结果，虚拟数字人系统获取了ASD患者提出的问题，接下来需要对问题作出回答。对话内容生成式AI基于知识图谱完成了BERT模型和大语言模型的预训练，形成了知识表达模型和大语言模型，知识表达模型根据问题内容可以生成对话内容的信息关键点，大语言模型结合需要回应的信息关键点和ASD患者当前的情感状态，生成符合对话情景和情感需求的对话内容，即基于语音识别结果和情感状态生成符合对话情景和用户情感需求的对话内容。According to the speech recognition results, the virtual digital human system obtains the questions raised by the ASD patients, and then needs to answer the questions. The conversation content generation AI completes the pre-training of the BERT model and the large language model based on the knowledge graph, forming a knowledge expression model and a large language model. The knowledge expression model can generate key information points of the conversation content according to the content of the question. The large language model combines the key information points that need to be responded to and the current emotional state of the ASD patient to generate conversation content that meets the conversation scenario and emotional needs, that is, based on the speech recognition results and emotional state, it generates conversation content that meets the conversation scenario and user emotional needs.

本实施例的对话内容生成式AI包括BERT模型、对话内容生成两个部分，对话内容生成式AI具体包括：The conversation content generation AI of this embodiment includes two parts: BERT model and conversation content generation. The conversation content generation AI specifically includes:

信息关键点输出模块，用于以语音识别结果作为输入，利用知识表达模型输出符合对话情景的对话内容的信息关键点，信息关键点可以理解为不含情绪的答案。The information key point output module is used to take the speech recognition results as input and use the knowledge expression model to output the information key points of the dialogue content that conforms to the dialogue scenario. The information key points can be understood as answers without emotions.

对话内容输出模块，用于以信息关键点和情感状态作为输入，利用大语言模型生成符合对话情景和用户情感需求的对话内容。The dialogue content output module is used to take information key points and emotional states as input and use the large language model to generate dialogue content that conforms to the dialogue scenario and user emotional needs.

其中，知识表达模型为BERT模型。Among them, the knowledge expression model is the BERT model.

其中，大语言模型为GLM-130B模型。Among them, the large language model is the GLM-130B model.

具体的，本实施例的对话内容生成式AI的处理过程包括：Specifically, the processing process of the dialogue content generation AI in this embodiment includes:

(1)信息关键点生成：(1) Generation of key information points:

本实施例创建的BERT模型是基于对原始BERT模型的模型参数进行微调所得到的，具体将ASD数据集融入到原始BERT模型，以在原始BERT模型中增加ASD领域的特定知识，BERT模型的模型参数包括词嵌入权重、位置嵌入权重、隐藏层权重、注意力机制权重等，这些模型参数在训练过程中通过优化算法进行更新，以提高BERT模型在ASD领域的性能。在自监督训练过程中，BERT模型通过预测输入句子(即文本序列)中的隐藏标记(如[mask]token)来学习语言表示。这种训练方式不需要标注数据，降低了训练成本。训练好的BERT模型可以应用于ASD领域的问答系统，根据输入的问题生成准确的不含情绪的答案。The BERT model created in this embodiment is based on fine-tuning the model parameters of the original BERT model. Specifically, the ASD data set is integrated into the original BERT model to add specific knowledge in the ASD field to the original BERT model. The model parameters of the BERT model include word embedding weights, position embedding weights, hidden layer weights, attention mechanism weights, etc. These model parameters are updated through optimization algorithms during the training process to improve the performance of the BERT model in the ASD field. During the self-supervised training process, the BERT model learns language representation by predicting hidden tags (such as [mask] token) in the input sentences (i.e., text sequences). This training method does not require labeled data, which reduces training costs. The trained BERT model can be applied to the question-answering system in the ASD field to generate accurate, emotion-free answers based on the input questions.

在将语音识别结果输入到BERT模型中之前，先对数据进行清洗和预处理，具体包括：分词和为分词得到的每个词语分配相应的标识符和位置嵌入。其中，分词是将文本序列转化为单词或词汇单元等词语的过程，例如，对于句子“ASD children may exhibitatypical behavior”，分词结果为[“ASD”，“children”，“may”，“exhibit”，“atypical”，“behavior”]。在为每个词语分配相应的标识符和位置嵌入时，标识符用于表示每个词语的唯一编号，位置嵌入用于表示词语在文本序列中的位置信息，这些信息有助于BERT模型理解词语的语义和上下文关系。Before the speech recognition results are input into the BERT model, the data is cleaned and preprocessed, including: word segmentation and assigning corresponding identifiers and position embeddings to each word obtained by word segmentation. Among them, word segmentation is the process of converting a text sequence into words such as words or vocabulary units. For example, for the sentence "ASD children may exhibit atypical behavior", the word segmentation result is ["ASD", "children", "may", "exhibit", "atypical", "behavior"]. When assigning corresponding identifiers and position embeddings to each word, the identifier is used to represent the unique number of each word, and the position embedding is used to represent the position information of the word in the text sequence. This information helps the BERT model understand the semantics and contextual relationship of the words.

分词后的文本序列需转化为BERT模型接受的张量形式，张量是具有统一类型的多维数组，可用于表示文本序列，包括每个词语的标识符和位置嵌入。例如，对于一个包含6个词语的文本序列，可以将其转化为一个6xhidden_size的张量，其中，hidden_size是BERT模型的隐藏层维度。为了将数据高效地输入到BERT模型中，需创建数据管道，在训练过程中创建数据管道可以实现批量处理以提高训练效率，在应用过程中创建数据管道可以实现数据的快速输入以提高预测效率。The text sequence after word segmentation needs to be converted into a tensor form accepted by the BERT model. A tensor is a multidimensional array of a uniform type that can be used to represent a text sequence, including the identifier and position embedding of each word. For example, for a text sequence containing 6 words, it can be converted into a 6xhidden_size tensor, where hidden_size is the hidden layer dimension of the BERT model. In order to efficiently input data into the BERT model, a data pipeline needs to be created. Creating a data pipeline during the training process can achieve batch processing to improve training efficiency, and creating a data pipeline during the application process can achieve rapid data input to improve prediction efficiency.

将上述张量输入BERT模型，BERT模型的输出为一个长度为文本序列中词语数量的张量，表示每个词语对应的概率分布，这些概率分布反映了模型对于输入的文本序列中每个词语的预测结果，每个词语对应的概率分布中概率最大的预测词语即为该词语的最终预测结果，所有词语的最终预测结果组成信息关键点。The above tensor is input into the BERT model. The output of the BERT model is a tensor with a length equal to the number of words in the text sequence, which represents the probability distribution corresponding to each word. These probability distributions reflect the model's prediction results for each word in the input text sequence. The predicted word with the highest probability in the probability distribution corresponding to each word is the final prediction result of the word. The final prediction results of all words constitute the key information points.

(2)对话内容生成：(2) Dialogue content generation:

对话内容的自动生成基于预训练的大语言模型GLM-130B实现，GLM-130B模型的输入为BERT模型的输出和情感分析AI的输出，GLM-130B模型的输出为对话内容，对话内容是文本。在生成对话内容前，GLM-130B模型需要进行预处理和训练，训练时，使用语言模型预训练技术，通过对大量的无监督文本数据进行训练，来学习文本中的语言模式和知识。The automatic generation of dialogue content is based on the pre-trained large language model GLM-130B. The input of the GLM-130B model is the output of the BERT model and the output of the sentiment analysis AI. The output of the GLM-130B model is the dialogue content, which is text. Before generating dialogue content, the GLM-130B model needs to be preprocessed and trained. During training, the language model pre-training technology is used to learn the language patterns and knowledge in the text by training a large amount of unsupervised text data.

当输入一个问题或文本后，大语言模型使用字符级别的编码方式，如独热编码或者字符级别的嵌入等对输入文本进行编码。以独热编码为例，假设词汇表大小为V，那么对于输入文本中的每个词语，可以将其转化为一个长度为V的向量，其中，只有一个元素为1，其余元素为0，这个向量表示了词语在词汇表中的索引。这种方式可将文本中的每个词语映射为一个向量，从而将文本转化为可以被神经网络处理的形式。When a question or text is input, the large language model uses character-level encoding methods, such as one-hot encoding or character-level embedding, to encode the input text. Taking one-hot encoding as an example, assuming that the vocabulary size is V, then for each word in the input text, it can be converted into a vector of length V, in which only one element is 1 and the rest are 0. This vector represents the index of the word in the vocabulary. This method can map each word in the text to a vector, thereby converting the text into a form that can be processed by a neural network.

大语言模型会使用Transformer架构来处理编码后的文本。Transformer架构是一种基于自注意力机制的神经网络模型，可以同时处理多个序列，捕捉序列中的局部和全局依赖关系。在Transformer架构中，输入文本首先经过编码器(Encoder)处理，输出为一个上下文向量。然后，这个上下文向量被传递给解码器(Decoder)，解码器根据上下文向量生成对应的输出文本。大语言模型的解码过程使用生成式对抗网络(GAN)技术，来保证生成的输出文本的质量和合理性。生成式对抗网络包括一个生成器(Generator)和一个判别器(Discriminator)。生成器负责根据输入的上下文向量生成候选文本，判别器负责判断生成的候选文本是否真实。在训练过程中，生成器和判别器相互竞争，生成器试图生成更真实的文本以欺骗判别器，而判别器试图区分真实文本和生成文本。通过这种竞争，生成器能够生成越来越真实的候选文本。在实际应用中，生成器生成的候选文本即为输出文本。最后，大语言模型通过解码神经网络的输出，生成对应的回答或输出文本。The large language model uses the Transformer architecture to process the encoded text. The Transformer architecture is a neural network model based on the self-attention mechanism that can process multiple sequences at the same time and capture local and global dependencies in the sequence. In the Transformer architecture, the input text is first processed by the encoder and the output is a context vector. Then, this context vector is passed to the decoder, which generates the corresponding output text based on the context vector. The decoding process of the large language model uses the generative adversarial network (GAN) technology to ensure the quality and rationality of the generated output text. The generative adversarial network includes a generator and a discriminator. The generator is responsible for generating candidate texts based on the input context vector, and the discriminator is responsible for judging whether the generated candidate texts are real. During the training process, the generator and the discriminator compete with each other. The generator tries to generate more realistic texts to deceive the discriminator, while the discriminator tries to distinguish between real texts and generated texts. Through this competition, the generator can generate more and more realistic candidate texts. In practical applications, the candidate texts generated by the generator are the output texts. Finally, the large language model generates the corresponding answer or output text by decoding the output of the neural network.

(四)虚拟角色生成式AI：(IV) Virtual Character Generation AI:

虚拟角色生成式AI依据生成的对话内容进行文本转语音和文本转视频的操作来生成虚拟角色回答对应问题的音视频，以根据对话内容生成虚拟角色回答用户提出问题的音视频，并将音视频反馈至用户，即最终传递至ASD患者。The virtual character generation AI performs text-to-speech and text-to-video operations based on the generated conversation content to generate audio and video of the virtual character answering the corresponding questions, and generates audio and video of the virtual character answering the user's questions based on the conversation content, and feeds the audio and video back to the user, which is ultimately delivered to the ASD patient.

本实施例的虚拟角色生成式AI包括虚拟角色创建和虚拟角色驱动两个部分，虚拟角色生成式AI具体包括：The virtual character generation AI of this embodiment includes two parts: virtual character creation and virtual character driving. The virtual character generation AI specifically includes:

虚拟角色创建模块，用于生成虚拟角色。A virtual character creation module is used to generate a virtual character.

音视频合成模块，用于以对话内容作为输入，利用音频生成模型进行转换，得到转换后语音信号；以对话内容作为输入，利用视频生成模型进行转换，得到转换后视频信号；将转换后语音信号和转换后视频信号组成虚拟角色回答用户提出问题的音视频。The audio and video synthesis module is used to take the conversation content as input, convert it using the audio generation model, and obtain a converted voice signal; take the conversation content as input, convert it using the video generation model, and obtain a converted video signal; and combine the converted voice signal and the converted video signal into audio and video of a virtual character answering questions raised by the user.

虚拟角色驱动模块，用于基于音视频驱动虚拟角色动作，以将音视频反馈至用户。The virtual character driving module is used to drive the virtual character's actions based on audio and video to feed back the audio and video to the user.

其中，生成虚拟角色，具体包括：响应用户输入的虚拟角色形象需求生成虚拟角色三维模型；建立虚拟角色三维模型的骨骼结构，得到虚拟角色。The generation of a virtual character specifically includes: generating a three-dimensional model of the virtual character in response to a virtual character image requirement input by a user; and establishing a skeleton structure of the three-dimensional model of the virtual character to obtain the virtual character.

其中，音频生成模型为TTS模型。Among them, the audio generation model is a TTS model.

其中，视频生成模型为TTV模型。Among them, the video generation model is the TTV model.

其中，将转换后语音信号和转换后视频信号组成虚拟角色回答用户提出问题的音视频，具体包括：以转换后语音信号和转换后视频信号作为输入，利用合成模型调整虚拟角色的唇形，输出虚拟角色回答用户提出问题的音视频，合成模型为Wav2Lip模型。Among them, the converted voice signal and the converted video signal are combined into audio and video of the virtual character answering the questions raised by the user, specifically including: taking the converted voice signal and the converted video signal as input, using the synthesis model to adjust the lip shape of the virtual character, and outputting the audio and video of the virtual character answering the questions raised by the user, and the synthesis model is the Wav2Lip model.

具体的，本实施例的虚拟角色生成式AI的处理过程包括：Specifically, the processing process of the virtual character generation AI in this embodiment includes:

(1)虚拟角色创建：(1) Creation of virtual characters:

ASD患者在使用本实施例的虚拟数字人系统时，可以根据自己的喜好创建虚拟角色的形象。首先，ASD患者在虚拟数字人系统中定义虚拟角色的特征，如肤色、发色等特征。然后，虚拟角色创建模块响应用户输入的虚拟角色形象需求在Blender软件中创建包含头部、身体和四肢等部分的虚拟角色三维模型。虚拟角色创建模块进一步建立虚拟角色三维模型的骨骼结构，具体使用骨骼绑定工具(如Adobe Photoshop或Maya软件)将骨骼分配给虚拟角色三维模型的各个部位，以便在动画中可以使用关键帧动画或骨骼动画来实现站立、行走、奔跑、跳跃等动作，得到虚拟角色。When using the virtual digital human system of the present embodiment, ASD patients can create the image of a virtual character according to their preferences. First, the ASD patient defines the characteristics of the virtual character in the virtual digital human system, such as skin color, hair color and other characteristics. Then, the virtual character creation module responds to the virtual character image requirements input by the user to create a virtual character three-dimensional model including the head, body and limbs in the Blender software. The virtual character creation module further establishes the skeletal structure of the virtual character three-dimensional model, and specifically uses a skeletal binding tool (such as Adobe Photoshop or Maya software) to assign bones to various parts of the virtual character three-dimensional model, so that keyframe animation or skeletal animation can be used in the animation to achieve actions such as standing, walking, running, jumping, etc., to obtain a virtual character.

(2)对答视频合成：(2) Answer video synthesis:

音视频合成模块根据大语言模型生成的文本(即对话内容)合成音频和视频动画。首先，音视频合成模块将生成的文本输入音频生成模型，音频生成模型可采用Text toSpeech(TTS)模型，通过TTS模型将输入的文本转换成音频输出，得到转换后语音信号。TTS模型使用现有的技术，如Google的Text-to-SpeechAPI和Amazon的Polly等，将文本转换为自然流畅的人类语音。然后，音视频合成模块将生成的文本输入视频生成模型，视频生成模型可采用Text to Video(TTV)模型，通过TTV模型将输入的文本转换成虚拟角色的动作视频，得到转换后视频信号。最后，音视频合成模块将转换后语音信号和转换后视频信号组成虚拟角色回答用户提出问题的音视频，具体的，在得到虚拟角色的动作视频和音频之后，将音频和视频输入到Wav2Lip模型，Wav2Lip模型从音频内容中提取唇形信息，并将其应用到视频中的虚拟角色上，以根据音频内容调整视频中虚拟角色的唇形。此时，Wav2Lip模型的输入为音频和视频，输出为音频和根据音频内容调整唇形后的视频，如此，虚拟角色的唇形可以根据说话的内容动态改变，使得虚拟角色的动作更加自然。The audio and video synthesis module synthesizes audio and video animations based on the text (i.e., the conversation content) generated by the large language model. First, the audio and video synthesis module inputs the generated text into the audio generation model. The audio generation model can use the Text to Speech (TTS) model. The TTS model converts the input text into audio output to obtain the converted voice signal. The TTS model uses existing technologies, such as Google's Text-to-Speech API and Amazon's Polly, to convert text into natural and fluent human speech. Then, the audio and video synthesis module inputs the generated text into the video generation model. The video generation model can use the Text to Video (TTV) model. The TTV model converts the input text into the action video of the virtual character to obtain the converted video signal. Finally, the audio and video synthesis module combines the converted voice signal and the converted video signal into the audio and video of the virtual character answering the questions raised by the user. Specifically, after obtaining the action video and audio of the virtual character, the audio and video are input into the Wav2Lip model. The Wav2Lip model extracts lip shape information from the audio content and applies it to the virtual character in the video to adjust the lip shape of the virtual character in the video according to the audio content. At this time, the input of the Wav2Lip model is audio and video, and the output is audio and video with lip shape adjusted according to the audio content. In this way, the lip shape of the virtual character can be dynamically changed according to the content of the speech, making the movements of the virtual character more natural.

(3)虚拟角色驱动：(3) Virtual character drive:

本实施例在Unity中创建动画控制器来管理虚拟角色的不同动画状态，实现对虚拟角色动作的控制，动画的过渡条件和触发方式在动画控制器中以状态机的方式定义。虚拟角色驱动模块基于音视频，利用上述动画控制器来驱动虚拟角色动作，以将音视频反馈至用户。通过这种方法，ASD患者可以与虚拟角色进行互动，提高学习效果和兴趣。In this embodiment, an animation controller is created in Unity to manage different animation states of virtual characters and control the actions of virtual characters. The transition conditions and triggering methods of the animation are defined in the animation controller in the form of a state machine. The virtual character driving module is based on audio and video, and uses the above animation controller to drive the actions of virtual characters to feed back audio and video to the user. In this way, ASD patients can interact with virtual characters to improve learning effects and interests.

本实施例可以实现面向自闭症群体的语言、行为、情绪识别，实现多模态自然交互，且可以对ASD数据集进行高效训练，提供专业性的虚拟人设计及行为反应，从而帮助提升ASD群体的学习效果和体验。This embodiment can realize language, behavior, and emotion recognition for autistic groups, achieve multimodal natural interaction, and can efficiently train ASD data sets, provide professional virtual human design and behavioral responses, thereby helping to improve the learning effect and experience of the ASD group.

考虑到通过计算机编程的虚拟数字人相比于人类提供了更强的稳定性，不会出现情绪波动，虚拟数字人可以方便的集成于计算机、虚拟现实等技术环境，因此本实施例建立了一种更简单、可预测的交流形式，有助于克服人与人互动中的固有障碍，在自闭症群体使用相关平台时提供学习引导、情感交流等帮助。本实施例所提出的一种数字孪生教师虚拟数字人系统，虚拟数字人(即虚拟角色)成为教师的替身，借助数据模拟教师在现实环境中的行为，辅助学生利用学习平台，提升学习体验，达到使用虚拟角色为ASD人群提供学习支持的目的。Considering that virtual digital people programmed by computers provide stronger stability than humans and will not have emotional fluctuations, virtual digital people can be easily integrated into computer, virtual reality and other technical environments, so this embodiment establishes a simpler and more predictable form of communication, which helps to overcome the inherent obstacles in human interaction and provide learning guidance, emotional communication and other assistance when the autistic group uses related platforms. In the digital twin teacher virtual digital human system proposed in this embodiment, the virtual digital person (i.e., virtual character) becomes a substitute for the teacher, simulates the teacher's behavior in the real environment with the help of data, assists students in using the learning platform, improves the learning experience, and achieves the purpose of using virtual characters to provide learning support for ASD people.

本实施例使用虚拟角色为自闭症群体提供学习支持，虚拟角色以真实教师为蓝本，模拟其在现实中的行为，并适当重建，在学习平台中指导自闭症学习者学习，能够提供基于自然交互的陪伴和情感支持，解决面向自闭症群体学习支持的虚拟人系统构建、情感识别以及内容生成，本实施例是一个有针对性的、自动化程度高的智能虚拟人学习支持系统，可集成于多个技术平台，提高平台使用效果。This embodiment uses virtual characters to provide learning support for the autistic group. The virtual characters are based on real teachers, simulate their behaviors in reality, and are appropriately reconstructed to guide autistic learners in their learning on the learning platform. They can provide companionship and emotional support based on natural interactions, solve the problems of virtual human system construction, emotion recognition, and content generation for learning support for the autistic group. This embodiment is a targeted, highly automated intelligent virtual human learning support system that can be integrated into multiple technology platforms to improve the platform usage effect.

实施例2Example 2

如图2所示，本实施例中的一种面向学习障碍群体的虚拟数字人方法，基于实施例1所述的一种面向学习障碍群体的虚拟数字人系统进行工作，包括：As shown in FIG. 2 , a virtual digital human method for a group with learning disabilities in this embodiment works based on a virtual digital human system for a group with learning disabilities described in Embodiment 1, and includes:

S1：对用户提出问题时产生的语音信号和视频信号进行识别，得到语音识别结果、面部表情识别结果和行为识别结果；所述用户包括自闭症患者。S1: Recognize the voice signal and video signal generated when the user asks a question to obtain a voice recognition result, a facial expression recognition result and a behavior recognition result; the user includes an autistic patient.

S2：基于所述语音识别结果、所述面部表情识别结果和所述行为识别结果确定用户提出问题时的情感状态。S2: Determine the emotional state of the user when asking the question based on the speech recognition result, the facial expression recognition result and the behavior recognition result.

S3：基于所述语音识别结果和所述情感状态生成符合对话情景和用户情感需求的对话内容，所述对话内容是对用户提出问题进行回答的答案。S3: Generate dialogue content that meets the dialogue scenario and the user's emotional needs based on the speech recognition result and the emotional state, and the dialogue content is an answer to the question raised by the user.

S4：根据所述对话内容生成虚拟角色回答用户提出问题的音视频，并将所述音视频反馈至用户。S4: Generate audio and video of the virtual character answering the question raised by the user according to the conversation content, and feed the audio and video back to the user.

以上实施例的各技术特征可以进行任意的组合，为使描述简洁，未对上述实施例中的各个技术特征所有可能的组合都进行描述，然而，只要这些技术特征的组合不存在矛盾，都应当认为是本说明书记载的范围。The technical features of the above embodiments may be arbitrarily combined. To make the description concise, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

本文中应用了具体个例对本发明的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本发明的方法及其核心思想；同时，对于本领域的一般技术人员，依据本发明的思想，在具体实施方式及应用范围上均会有改变之处。综上所述，本说明书内容不应理解为对本发明的限制。This article uses specific examples to illustrate the principles and implementation methods of the present invention. The above examples are only used to help understand the method and core ideas of the present invention. At the same time, for those skilled in the art, according to the ideas of the present invention, there will be changes in the specific implementation methods and application scope. In summary, the content of this specification should not be understood as limiting the present invention.