CN115170703A

Movatterモバイル変換

Info

Publication number: CN115170703A
Application number: CN202210773036.8A
Authority: CN
Inventors: 李�杰
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2022-10-11

Abstract

Translated fromChinese

本公开提供了虚拟形象驱动方法、装置、电子设备及存储介质，涉及计算机视觉、深度学习以及增强现实等人工智能领域，可应用于虚拟形象生成以及元宇宙等场景，其中的方法可包括：获取目标人物的视频数据，其中，所述视频数据包括第一音频数据以及与第一音频数据对应的唇部动作；对所述视频数据进行三维重建；基于第一音频数据以及三维重建结果，确定出第一音频数据与目标人物对应的三维人脸之间的表情驱动映射关系；获取第二音频数据，根据第二音频数据以及所述表情驱动映射关系，驱动目标虚拟形象。应用本公开所述方案，可提升驱动效果等。

The present disclosure provides a virtual image driving method, device, electronic device and storage medium, which relate to the fields of artificial intelligence such as computer vision, deep learning, and augmented reality, and can be applied to scenarios such as virtual image generation and metaverse. The method may include: obtaining video data of the target person, wherein the video data includes first audio data and lip movements corresponding to the first audio data; three-dimensional reconstruction is performed on the video data; based on the first audio data and the three-dimensional reconstruction result, determine The expression-driven mapping relationship between the first audio data and the three-dimensional face corresponding to the target person; the second audio data is acquired, and the target avatar is driven according to the second audio data and the expression-driven mapping relationship. By applying the solution described in the present disclosure, the driving effect and the like can be improved.

Description

Translated fromChinese

虚拟形象驱动方法、装置、电子设备及存储介质Virtual image driving method, device, electronic device and storage medium

技术领域technical field

本公开涉及人工智能技术领域，特别涉及计算机视觉、深度学习以及增强现实等领域的虚拟形象驱动方法、装置、电子设备及存储介质。The present disclosure relates to the technical field of artificial intelligence, and in particular, to a virtual image driving method, device, electronic device, and storage medium in the fields of computer vision, deep learning, and augmented reality.

背景技术Background technique

在实际应用中，可借助于三维虚拟形象生成来创建虚拟世界，其中，如何让三维虚拟形象符合人类正常运动学进行运动是创建多彩动态世界的重要因素。In practical applications, a virtual world can be created with the help of three-dimensional virtual image generation. How to make the three-dimensional virtual image move in accordance with normal human kinematics is an important factor in creating a colorful dynamic world.

发明内容SUMMARY OF THE INVENTION

本公开提供了虚拟形象驱动方法、装置、电子设备及存储介质。The present disclosure provides an avatar driving method, apparatus, electronic device and storage medium.

一种虚拟形象驱动方法，包括：An avatar driving method, comprising:

获取目标人物的视频数据，其中，所述视频数据包括第一音频数据以及与所述第一音频数据对应的唇部动作；acquiring video data of the target person, wherein the video data includes first audio data and lip movements corresponding to the first audio data;

对所述视频数据进行三维重建；performing three-dimensional reconstruction on the video data;

基于所述第一音频数据以及三维重建结果，确定出所述第一音频数据与所述目标人物对应的三维人脸之间的表情驱动映射关系；Based on the first audio data and the three-dimensional reconstruction result, determine the expression-driven mapping relationship between the first audio data and the three-dimensional face corresponding to the target person;

获取第二音频数据，根据所述第二音频数据以及所述表情驱动映射关系，驱动目标虚拟形象。Acquire second audio data, and drive the target avatar according to the second audio data and the expression-driven mapping relationship.

一种虚拟形象驱动装置，包括：视频获取模块、三维重建模块、关系确定模块、音频获取模块以及目标驱动模块；A virtual image driving device, comprising: a video acquisition module, a three-dimensional reconstruction module, a relationship determination module, an audio acquisition module and a target driving module;

所述视频获取模块，用于获取目标人物的视频数据，其中，所述视频数据包括第一音频数据以及与所述第一音频数据对应的唇部动作；The video acquisition module is configured to acquire video data of the target person, wherein the video data includes first audio data and lip movements corresponding to the first audio data;

所述三维重建模块，用于对所述视频数据进行三维重建；the three-dimensional reconstruction module for performing three-dimensional reconstruction on the video data;

所述关系确定模块，用于基于所述第一音频数据以及三维重建结果，确定出所述第一音频数据与所述目标人物对应的三维人脸之间的表情驱动映射关系；The relationship determination module is configured to determine, based on the first audio data and the three-dimensional reconstruction result, an expression-driven mapping relationship between the first audio data and the three-dimensional face corresponding to the target person;

所述音频获取模块，用于获取第二音频数据；The audio acquisition module is used to acquire second audio data;

所述目标驱动模块，用于根据所述第二音频数据以及所述表情驱动映射关系，驱动目标虚拟形象。The target driving module is configured to drive the target avatar according to the second audio data and the expression driving mapping relationship.

一种电子设备，包括：An electronic device comprising:

至少一个处理器；以及at least one processor; and

与所述至少一个处理器通信连接的存储器；其中，a memory communicatively coupled to the at least one processor; wherein,

所述存储器存储有可被所述至少一个处理器执行的指令，所述指令被所述至少一个处理器执行，以使所述至少一个处理器能够执行如以上所述的方法。The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.

一种存储有计算机指令的非瞬时计算机可读存储介质，所述计算机指令用于使计算机执行如以上所述的方法。A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method as described above.

一种计算机程序产品，包括计算机程序/指令，所述计算机程序/指令被处理器执行时实现如以上所述的方法。A computer program product comprising computer programs/instructions which, when executed by a processor, implement the method as described above.

应当理解，本部分所描述的内容并非旨在标识本公开的实施例的关键或重要特征，也不用于限制本公开的范围。本公开的其它特征将通过以下的说明书而变得容易理解。It should be understood that what is described in this section is not intended to identify key or critical features of embodiments of the disclosure, nor is it intended to limit the scope of the disclosure. Other features of the present disclosure will become readily understood from the following description.

附图说明Description of drawings

附图用于更好地理解本方案，不构成对本公开的限定。其中：The accompanying drawings are used for better understanding of the present solution, and do not constitute a limitation to the present disclosure. in:

图1为本公开所述虚拟形象驱动方法第一实施例的流程图；FIG. 1 is a flowchart of the first embodiment of the avatar driving method described in the present disclosure;

图2为本公开所述虚拟形象驱动方法第二实施例的流程图；FIG. 2 is a flowchart of a second embodiment of the avatar driving method described in the present disclosure;

图3为本公开所述虚拟形象驱动装置实施例300的组成结构示意图；FIG. 3 is a schematic diagram of the composition and structure of theembodiment 300 of the avatar driving apparatus described in the present disclosure;

图4示出了可以用来实施本公开的实施例的电子设备400的示意性框图。FIG. 4 shows a schematic block diagram of anelectronic device 400 that may be used to implement embodiments of the present disclosure.

具体实施方式Detailed ways

以下结合附图对本公开的示范性实施例做出说明，其中包括本公开实施例的各种细节以助于理解，应当将它们认为仅仅是示范性的。因此，本领域普通技术人员应当认识到，可以对这里描述的实施例做出各种改变和修改，而不会背离本公开的范围和精神。同样，为了清楚和简明，以下的描述中省略了对公知功能和结构的描述。Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding and should be considered as exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted from the following description for clarity and conciseness.

另外，应理解，本文中术语“和/或”，仅仅是一种描述关联对象的关联关系，表示可以存在三种关系，例如，A和/或B，可以表示：单独存在A，同时存在A和B，单独存在B这三种情况。另外，本文中字符“/”，一般表示前后关联对象是一种“或”的关系。In addition, it should be understood that the term "and/or" in this document is only an association relationship for describing associated objects, indicating that there can be three kinds of relationships, for example, A and/or B, which can mean that A exists alone, and A exists at the same time and B, there are three cases of B alone. In addition, the character "/" in this document generally indicates that the related objects are an "or" relationship.

图1为本公开所述虚拟形象驱动方法第一实施例的流程图。如图1所示，包括以下具体实现方式。FIG. 1 is a flow chart of a first embodiment of a method for driving an avatar according to the disclosure. As shown in Figure 1, the following specific implementations are included.

在步骤101中，获取目标人物的视频数据，其中，所述视频数据包括第一音频数据以及与第一音频数据对应的唇部动作。Instep 101, video data of the target person is acquired, wherein the video data includes first audio data and lip movements corresponding to the first audio data.

在步骤102中，对所述视频数据进行三维重建。Instep 102, three-dimensional reconstruction is performed on the video data.

在步骤103中，基于第一音频数据以及三维重建结果，确定出第一音频数据与目标人物对应的三维人脸之间的表情驱动映射关系。Instep 103, based on the first audio data and the three-dimensional reconstruction result, an expression-driven mapping relationship between the first audio data and the three-dimensional face corresponding to the target person is determined.

在步骤104中，获取第二音频数据，根据第二音频数据以及表情驱动映射关系，驱动目标虚拟形象。Instep 104, second audio data is acquired, and the target avatar is driven according to the second audio data and the expression-driven mapping relationship.

采用上述方法实施例所述方案，可基于视频数据获取到音频数据到三维人脸的表情驱动映射关系，相应地，可基于该映射关系以及音频数据等来驱动任意的目标虚拟形象(即三维虚拟形象)，如驱动目标虚拟形象进行说话或唱歌等，从而可使得呈现出的运动效果与真实情况如真实表情等更为相符，进而提升了驱动效果。Using the solutions described in the above method embodiments, the expression-driven mapping relationship between audio data and a 3D face can be obtained based on the video data. image), such as driving the target avatar to speak or sing, etc., so that the presented motion effect can be more consistent with the real situation, such as real expressions, etc., thereby improving the driving effect.

所述视频数据需要符合预定标准，比如，符合预定标准可以是指视频数据中的各帧图像中包括相同的单一人脸，即各帧图像中均包括同一个人的人脸，另外，视频数据中这个人正在进行说话或唱歌等。The video data needs to meet a predetermined standard. For example, meeting the predetermined standard may mean that each frame of image in the video data includes the same single face, that is, each frame of image includes the face of the same person. The person is talking or singing, etc.

针对所述视频数据，可直接进行三维重建，或者，本公开的一个实施例中，针对所述视频数据，可先对其进行预处理，从而得到人头居中的视频数据，之后可对预处理后的视频数据进行三维重建。For the video data, three-dimensional reconstruction can be performed directly, or, in an embodiment of the present disclosure, the video data can be preprocessed first, so as to obtain the video data with the human head centered, and then the preprocessed video data can be obtained. 3D reconstruction of the video data.

如何进行预处理不作限制，只要能够达到“人头居中”的目的即可。比如，可对视频数据中的各帧图像进行人脸检测和跟踪(tracking)处理，并基于得到的处理结果，裁剪得到人头居中的各帧图像，其中，检测是指检测图像中的人脸，跟踪是指对同一人脸进行跟踪，裁剪是指对图像大小进行裁剪，如将图像从1024*1024大小裁剪为512*512大小，裁剪后的图像中人头需要居中。There is no restriction on how to perform preprocessing, as long as the purpose of "centering the head" can be achieved. For example, face detection and tracking processing can be performed on each frame image in the video data, and based on the obtained processing result, each frame image with the head centered is obtained by cropping, wherein the detection refers to detecting the face in the image, Tracking refers to tracking the same face, and cropping refers to cropping the image size, such as cropping the image from 1024*1024 size to 512*512 size, the head of the cropped image needs to be centered.

通过上述处理，可规范化视频数据的格式，从而为后续处理奠定了良好的基础，如可提升后续处理结果的准确性等。Through the above processing, the format of video data can be standardized, thereby laying a good foundation for subsequent processing, such as improving the accuracy of subsequent processing results.

本公开的一个实施例中，在对视频数据进行三维重建时，可利用参数化模型(FLAME)来对视频数据进行三维重建，另外，重建过程可采用以下至少一种方式：二维重投影误差约束、表情感知损失约束。In an embodiment of the present disclosure, when performing 3D reconstruction on video data, a parametric model (FLAME) can be used to perform 3D reconstruction on video data. In addition, the reconstruction process can adopt at least one of the following methods: 2D reprojection error Constraints, expression-aware loss constraints.

其中，二维重投影误差约束是指重建后的三维信息(如三维人脸)到二维信息的重投影误差约束，表情感知损失约束是指重建过程中的纹理渲染图像与原始图像之间的表情感知损失约束，或称为人脸表情情感约束，原始图像即指重建前的视频数据中的图像。Among them, the 2D reprojection error constraint refers to the reprojection error constraint of the reconstructed 3D information (such as a 3D face) to the 2D information, and the expression perception loss constraint refers to the difference between the texture rendering image and the original image in the reconstruction process. Expression perception loss constraint, or facial expression emotion constraint, the original image refers to the image in the video data before reconstruction.

借助于上述约束，可提升重建结果的准确性，并可使得得到三维人脸的表情更为自然和真实等。With the help of the above constraints, the accuracy of the reconstruction result can be improved, and the expression of the obtained three-dimensional face can be made more natural and realistic.

通过利用FLAME模型进行三维重建，可得到视频数据对应的三维重建结果，其中可包括各帧图像分别对应的人脸姿态拟合结果、单帧人头大小(shape)、表情、纹理、相机内外参等各种信息。By using the FLAME model for 3D reconstruction, the 3D reconstruction results corresponding to the video data can be obtained, which can include the face pose fitting results corresponding to each frame of images, the size of a single frame of human head (shape), expressions, textures, camera internal and external parameters, etc. various information.

之后，可基于视频数据对应的第一音频数据以及三维重建结果，确定出第一音频数据与目标人物对应三维人脸之间的表情驱动映射关系。其中，根据三维重建结果，可得到目标人物对应的三维人脸，进一步地，可结合第一音频数据，确定出第一音频数据与三维人脸之间的表情驱动映射关系。Afterwards, an expression-driven mapping relationship between the first audio data and the three-dimensional face corresponding to the target person may be determined based on the first audio data corresponding to the video data and the three-dimensional reconstruction result. Wherein, according to the three-dimensional reconstruction result, the three-dimensional face corresponding to the target person can be obtained, and further, the expression-driven mapping relationship between the first audio data and the three-dimensional face can be determined by combining the first audio data.

本公开的一个实施例中，可首先获取第一音频数据对应的频谱映射结果，之后可基于所述频谱映射结果以及三维重建结果，确定出第一音频数据与三维人脸之间的表情驱动映射关系。In one embodiment of the present disclosure, a spectrum mapping result corresponding to the first audio data may be obtained first, and then an expression-driven mapping between the first audio data and a three-dimensional face may be determined based on the spectrum mapping result and the three-dimensional reconstruction result. relation.

比如，可使用短时傅里叶变换获取到第一音频数据对应的频谱映射结果，即可通过频谱映射将第一音频数据离散化，实现解耦，之后可以所述频谱映射结果作为网络输入数据，以视频中的人脸作为重建表情标签，高效准确地学习得到音频数据到三维人脸的表情驱动映射关系，即可结合频谱映射结果以及三维重建结果中的信息，确定出所需的表情驱动映射关系。For example, the spectrum mapping result corresponding to the first audio data can be obtained by using short-time Fourier transform, that is, the first audio data can be discretized through spectrum mapping to realize decoupling, and then the spectrum mapping result can be used as network input data , using the face in the video as the reconstructed expression label, to efficiently and accurately learn the expression-driven mapping relationship between the audio data and the 3D face, and then combine the spectral mapping results and the information in the 3D reconstruction results to determine the required expression driver Mapping relations.

对于获取到的表情驱动映射关系，后续可用于针对任一目标虚拟形象、基于预定的第二音频数据驱动该目标虚拟形象，如驱动该目标虚拟形象进行说话。The acquired expression-driven mapping relationship can be subsequently used to drive the target avatar based on the predetermined second audio data for any target avatar, such as driving the target avatar to speak.

即可获取第二音频数据，根据第二音频数据以及所述表情驱动映射关系，驱动目标虚拟形象。That is, the second audio data is acquired, and the target avatar is driven according to the second audio data and the expression driving mapping relationship.

目标虚拟形象可为任意的三维虚拟形象，即可为任意的驱动对象。The target avatar can be any three-dimensional avatar, that is, any driving object.

本公开的一个实施例中，还可获取FLAME模型与目标虚拟形象的三角网格映射关系。In an embodiment of the present disclosure, the triangular mesh mapping relationship between the FLAME model and the target avatar can also be obtained.

比如，可通过执行稠密三角网格注册对齐和变形(warping)操作等来完成三角网格映射关系的建立。For example, the establishment of the mapping relationship between the triangular meshes can be completed by performing the registration and alignment of the dense triangular meshes and the operations of warping (warping).

相应地，本公开的一个实施例中，可根据第二音频数据、表情驱动映射关系以及三角网格映射关系，利用FLAME模型驱动目标虚拟形象。Correspondingly, in an embodiment of the present disclosure, the FLAME model may be used to drive the target avatar according to the second audio data, the expression-driven mapping relationship, and the triangular mesh mapping relationship.

FLAME模型为比较成熟的模型，因此，借助于FLAME模型，可高效准确地实现目标虚拟形象的驱动，如驱动目标虚拟形象按照第二音频数据对应说话等，当然，如果需要，所述FLAME模型也可替换为其它模型，视实际需要而定。The FLAME model is a relatively mature model. Therefore, with the help of the FLAME model, the target avatar can be driven efficiently and accurately, such as driving the target avatar to speak according to the second audio data. Of course, if necessary, the FLAME model can also be used. Can be replaced with other models, depending on actual needs.

另外，在实际应用中，可使用形变迁移(DT，Deformation Transfer)算法来驱动目标虚拟形象，DT算法也为比较成熟的算法，易于实现，且具有较好的实现效果。In addition, in practical applications, a Deformation Transfer (DT, Deformation Transfer) algorithm can be used to drive the target avatar. The DT algorithm is also a relatively mature algorithm, which is easy to implement and has a good implementation effect.

本公开的一个实施例中，FLAME模型可为重建后的FLAME模型，通过重建，可使得FLAME模型在进行表情驱动时仅驱动目标虚拟形象的预定部分，所述预定部分即指驱动有效部分。In one embodiment of the present disclosure, the FLAME model may be a reconstructed FLAME model. Through reconstruction, the FLAME model can only drive a predetermined part of the target avatar when performing expression driving, and the predetermined part refers to the driving effective part.

本公开的一个实施例中，对FLAME模型进行重建可包括：将FLAME模型中的预定标识信息设置为预定值。比如，可将FLAME模型中的标识(ID)信息中的shape置为0，并可将相机外参置为0等，ID信息中可包括shape、表情、相机内外参等，其中，相机内外参可理解为三维世界到二维世界的映射关系，通过设置，可使得表情驱动只选择目标虚拟形象的驱动有效部分，如说话及表情相关部分，从而提升了驱动效率及驱动效果等。In one embodiment of the present disclosure, reconstructing the FLAME model may include: setting predetermined identification information in the FLAME model to a predetermined value. For example, the shape in the identification (ID) information in the FLAME model can be set to 0, and the external parameters of the camera can be set to 0, etc. The ID information can include shape, expression, internal and external parameters of the camera, etc. Among them, the internal and external parameters of the camera It can be understood as the mapping relationship between the three-dimensional world and the two-dimensional world. By setting, the expression driver can only select the driving effective parts of the target avatar, such as the parts related to speech and expressions, thereby improving the driving efficiency and driving effect.

综合上述介绍，图2为本公开所述虚拟形象驱动方法第二实施例的流程图。如图2所示，包括以下具体实现方式。Based on the above introduction, FIG. 2 is a flowchart of a second embodiment of the avatar driving method described in the present disclosure. As shown in FIG. 2 , the following specific implementations are included.

在步骤201中，获取目标人物进行说话时的视频数据。Instep 201, video data when the target person speaks are acquired.

本实施例中，以目标人物进行说话的视频数据为例进行说明，相应地，后续驱动目标虚拟形象进行说话。In this embodiment, the video data of the target person speaking is taken as an example for description, and accordingly, the target avatar is subsequently driven to speak.

在步骤202中，对所述视频数据进行预处理，得到人头居中的视频数据。Instep 202, the video data is preprocessed to obtain video data with a human head centered.

在步骤203中，利用FLAME模型对预处理后的视频数据进行三维重建。Instep 203, three-dimensional reconstruction is performed on the preprocessed video data by using the FLAME model.

其中，重建过程可采用以下至少一种方式：二维重投影误差约束，表情感知损失约束。Wherein, the reconstruction process can adopt at least one of the following ways: two-dimensional reprojection error constraint, expression perception loss constraint.

在步骤204中，基于所述视频数据对应的第一音频数据以及三维重建结果，确定出第一音频数据与目标人物对应的三维人脸之间的表情驱动映射关系。Instep 204, based on the first audio data corresponding to the video data and the three-dimensional reconstruction result, an expression-driven mapping relationship between the first audio data and the three-dimensional face corresponding to the target person is determined.

比如，可获取第一音频数据对应的频谱映射结果，基于所述频谱映射结果以及三维重建结果，确定出所述表情驱动映射关系。For example, the spectrum mapping result corresponding to the first audio data may be acquired, and the expression-driven mapping relationship may be determined based on the spectrum mapping result and the three-dimensional reconstruction result.

在步骤205中，获取FLAME模型与目标虚拟形象的三角网格映射关系。Instep 205, the triangular mesh mapping relationship between the FLAME model and the target avatar is obtained.

在步骤206中，获取第二音频数据。Instep 206, second audio data is acquired.

在步骤207中，根据第二音频数据、所述表情驱动映射关系以及所述三角网格映射关系，利用FLAME模型驱动目标虚拟形象进行说话。Instep 207, according to the second audio data, the expression-driven mapping relationship and the triangular mesh mapping relationship, the FLAME model is used to drive the target avatar to speak.

其中，FLAME模型可为重建后的FLAME模型，通过重建，可使得FLAME模型在进行表情驱动时仅驱动目标虚拟形象的驱动有效部分。The FLAME model may be a reconstructed FLAME model, and through reconstruction, the FLAME model can only drive the driving effective part of the target avatar when performing expression driving.

需要说明的是，对于前述的各方法实施例，为了简单描述，将其都表述为一系列的动作组合，但是本领域技术人员应该知悉，本公开并不受所描述的动作顺序的限制，因为依据本公开，某些步骤可以采用其它顺序或者同时进行，比如，步骤205和步骤206的顺序可以互换。其次，本领域技术人员也应该知悉，说明书中所描述的实施例均属于优选实施例，所涉及的动作和模块并不一定是本公开所必须的。另外，某个实施例中没有详述的部分，可以参见其它实施例中的相关说明。It should be noted that, for the purpose of simple description, the foregoing method embodiments are all expressed as a series of action combinations, but those skilled in the art should know that the present disclosure is not limited by the described action sequence, because According to the present disclosure, certain steps may be performed in other orders or simultaneously, eg, the order ofstep 205 and step 206 may be interchanged. Secondly, those skilled in the art should also know that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily required by the present disclosure. In addition, for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions in other embodiments.

总之，采用本公开方法实施例所述方案，可使得目标虚拟形象呈现出的运动效果与真实情况如真实表情等更为相符，提升了驱动效果，而且，不仅适用于元宇宙风格化虚拟形象的生成交互场景，也适用于目前大多数终端的虚拟形象生成交互场景，并在实现成本、鲁棒性、适配性和可迁移性等方面均具有较大的优势。In a word, by using the solutions described in the method embodiments of the present disclosure, the motion effect presented by the target avatar can be more consistent with the real situation, such as real expressions, etc., and the driving effect is improved. The generation of interactive scenes is also suitable for the generation of interactive scenes of virtual images of most terminals, and has great advantages in terms of implementation cost, robustness, adaptability and transferability.

以上是关于方法实施例的介绍，以下通过装置实施例，对本公开所述方案进行进一步说明。The above is an introduction to the method embodiments, and the solutions described in the present disclosure will be further described below through the device embodiments.

图3为本公开所述虚拟形象驱动装置实施例300的组成结构示意图。如图3所示，包括：视频获取模块301、三维重建模块302、关系确定模块303、音频获取模块304以及目标驱动模块305。FIG. 3 is a schematic diagram of the composition and structure of theembodiment 300 of the avatar driving apparatus according to the disclosure. As shown in FIG. 3 , it includes: avideo acquisition module 301 , a three-dimensional reconstruction module 302 , arelationship determination module 303 , anaudio acquisition module 304 and atarget driving module 305 .

视频获取模块301，用于获取目标人物的视频数据，其中，所述视频数据包括第一音频数据以及与第一音频数据对应的唇部动作。Thevideo acquisition module 301 is configured to acquire video data of a target person, wherein the video data includes first audio data and lip movements corresponding to the first audio data.

三维重建模块302，用于对所述视频数据进行三维重建。The three-dimensional reconstruction module 302 is configured to perform three-dimensional reconstruction on the video data.

关系确定模块303，用于基于第一音频数据以及三维重建结果，确定出第一音频数据与目标人物对应的三维人脸之间的表情驱动映射关系。Therelationship determining module 303 is configured to determine, based on the first audio data and the three-dimensional reconstruction result, an expression-driven mapping relationship between the first audio data and the three-dimensional face corresponding to the target person.

音频获取模块304，用于获取第二音频数据。Theaudio acquisition module 304 is configured to acquire second audio data.

目标驱动模块305，用于根据第二音频数据以及所述表情驱动映射关系，驱动目标虚拟形象。Thetarget driving module 305 is configured to drive the target avatar according to the second audio data and the expression driving mapping relationship.

采用上述装置实施例所述方案，可基于视频数据获取到音频数据到三维人脸的表情驱动映射关系，相应地，可基于该映射关系以及音频数据等来驱动任意的目标虚拟形象，如驱动目标虚拟形象进行说话，从而可使得呈现出的运动效果与真实情况如真实表情等更为相符，进而提升了驱动效果。With the solution described in the above device embodiment, the expression-driven mapping relationship between audio data and 3D human face can be obtained based on the video data, and accordingly, any target avatar can be driven based on the mapping relationship and audio data, such as driving the target The avatar speaks, so that the presented motion effect can be more in line with the real situation, such as real expressions, etc., thereby improving the driving effect.

针对所述视频数据，可直接进行三维重建，或者，本公开的一个实施例中，针对所述视频数据，视频获取模块301可先对其进行预处理，从而得到人头居中的视频数据，之后可由三维重建模块302对预处理后的视频数据进行三维重建。本公开的一个实施例中，视频获取模块301可分别对视频数据中的各帧图像进行检测和跟踪处理，基于得到的处理结果，裁剪得到人头居中的图像。For the video data, three-dimensional reconstruction may be performed directly, or, in an embodiment of the present disclosure, for the video data, thevideo acquisition module 301 may first preprocess it, so as to obtain the video data with the human head centered, and then The three-dimensional reconstruction module 302 performs three-dimensional reconstruction on the preprocessed video data. In one embodiment of the present disclosure, thevideo acquisition module 301 may perform detection and tracking processing on each frame of image in the video data, and based on the obtained processing result, crop the image with the human head centered.

本公开的一个实施例中，三维重建模块302在对视频数据进行三维重建时，可利用FLAME模型来对视频数据进行三维重建，另外，重建过程可采用以下至少一种方式：二维重投影误差约束、表情感知损失约束。In an embodiment of the present disclosure, when the3D reconstruction module 302 performs 3D reconstruction on the video data, the FLAME model can be used to perform 3D reconstruction on the video data. In addition, the reconstruction process can adopt at least one of the following methods: 2D reprojection error Constraints, expression-aware loss constraints.

之后，关系确定模块303可基于视频数据对应的第一音频数据以及三维重建结果，确定出音频数据到三维人脸的表情驱动映射关系。Afterwards, therelationship determining module 303 may determine an expression-driven mapping relationship between the audio data and the three-dimensional face based on the first audio data corresponding to the video data and the three-dimensional reconstruction result.

本公开的一个实施例中，关系确定模块303可首先获取第一音频数据对应的频谱映射结果，之后可基于所述频谱映射结果以及三维重建结果，确定出音频数据到三维人脸的表情驱动映射关系。In one embodiment of the present disclosure, therelationship determination module 303 may first obtain the spectral mapping result corresponding to the first audio data, and then determine the expression-driven mapping from the audio data to the three-dimensional face based on the spectral mapping result and the three-dimensional reconstruction result. relation.

在获取到表情驱动映射关系后，即可利用其来驱动目标虚拟形象，相应地，音频获取模块304可获取第二音频数据，目标驱动模块305可根据第二音频数据以及表情驱动映射关系，驱动目标虚拟形象。After obtaining the expression-driven mapping relationship, it can be used to drive the target avatar. Correspondingly, theaudio acquisition module 304 can obtain the second audio data, and thetarget driving module 305 can drive the target avatar according to the second audio data and the expression-driven mapping relationship. target avatar.

本公开的一个实施例中，目标驱动模块305还可获取FLAME模型与目标虚拟形象的三角网格映射关系。比如，可通过执行稠密三角网格注册对齐和warping操作等来完成三角网格映射关系的建立。In an embodiment of the present disclosure, thetarget driving module 305 may further acquire the triangular mesh mapping relationship between the FLAME model and the target avatar. For example, the establishment of the triangular mesh mapping relationship can be completed by performing registration and alignment of dense triangular meshes and warping operations.

相应地，本公开的一个实施例中，目标驱动模块305可根据第二音频数据、表情驱动映射关系以及三角网格映射关系，利用FLAME模型驱动目标虚拟形象。Correspondingly, in an embodiment of the present disclosure, thetarget driving module 305 may use the FLAME model to drive the target avatar according to the second audio data, the expression driving mapping relationship, and the triangular mesh mapping relationship.

另外，在实际应用中，可使用DT算法来驱动目标虚拟形象，DT算法为比较成熟的算法，易于实现，且具有较好的实现效果。In addition, in practical applications, the DT algorithm can be used to drive the target avatar. The DT algorithm is a relatively mature algorithm, which is easy to implement and has a good implementation effect.

本公开的一个实施例中，FLAME模型可为重建后的FLAME模型，通过重建，可使得FLAME模型在进行表情驱动时仅驱动目标虚拟形象的预定部分，如可通过将FLAME模型中的预定标识信息设置为预定值，对FLAME模型进行重建。In an embodiment of the present disclosure, the FLAME model may be a reconstructed FLAME model. Through reconstruction, the FLAME model can only drive a predetermined part of the target avatar when performing expression driving. For example, the predetermined identification information in the FLAME model can be used Set to a predetermined value to reconstruct the FLAME model.

图3所示装置实施例的具体工作流程可参照前述方法实施例中的相关说明，不再赘述。For the specific workflow of the apparatus embodiment shown in FIG. 3 , reference may be made to the relevant descriptions in the foregoing method embodiments, which will not be repeated.

总之，采用本公开装置实施例所述方案，可使得目标虚拟形象呈现出的运动效果与真实情况如真实表情等更为相符，提升了驱动效果，而且，不仅适用于元宇宙风格化虚拟形象的生成交互场景，也适用于目前大多数终端的虚拟形象生成交互场景，并在实现成本、鲁棒性、适配性和可迁移性等方面均具有较大的优势。In a word, by adopting the solution described in the embodiment of the disclosed device, the motion effect presented by the target avatar can be more consistent with the real situation, such as real expressions, etc., and the driving effect is improved. The generation of interactive scenes is also suitable for the generation of interactive scenes of virtual images of most terminals, and has great advantages in terms of implementation cost, robustness, adaptability and transferability.

本公开所述方案可应用于人工智能领域，特别涉及计算机视觉、深度学习以及增强现实等领域。人工智能是研究使计算机来模拟人的某些思维过程和智能行为(如学习、推理、思考、规划等)的学科，既有硬件层面的技术也有软件层面的技术，人工智能硬件技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理等技术，人工智能软件技术主要包括计算机视觉技术、语音识别技术、自然语言处理技术以及机器学习/深度学习、大数据处理技术、知识图谱技术等几大方向。The solutions described in the present disclosure can be applied in the field of artificial intelligence, especially in the fields of computer vision, deep learning, and augmented reality. Artificial intelligence is the study of making computers to simulate certain thinking processes and intelligent behaviors of people (such as learning, reasoning, thinking, planning, etc.). There are both hardware-level technologies and software-level technologies. AI hardware technologies generally include Sensors, special artificial intelligence chips, cloud computing, distributed storage, big data processing and other technologies, artificial intelligence software technologies mainly include computer vision technology, speech recognition technology, natural language processing technology and machine learning/deep learning, big data processing technology, Knowledge graph technology and other major directions.

本公开所述实施例中的视频数据和音频数据并不是针对某一特定用户的，并不能反映出某一特定用户的个人信息，另外，本公开所述方法的执行主体可以通过各种公开、合法合规的方式获取所述视频数据和音频数据。The video data and audio data in the embodiments described in the present disclosure are not aimed at a specific user, and cannot reflect the personal information of a specific user. In addition, the execution subject of the method described in the present disclosure can be implemented through various disclosures, Obtain the video data and audio data in a legal and compliant manner.

本公开的技术方案中，所涉及的用户个人信息的收集、存储、使用、加工、传输、提供和公开等处理，均符合相关法律法规的规定，且不违背公序良俗。In the technical solutions of the present disclosure, the collection, storage, use, processing, transmission, provision, and disclosure of the user's personal information involved are all in compliance with relevant laws and regulations, and do not violate public order and good customs.

根据本公开的实施例，本公开还提供了一种电子设备、一种可读存储介质和一种计算机程序产品。According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.

图4示出了可以用来实施本公开的实施例的电子设备400的示意性框图。电子设备旨在表示各种形式的数字计算机，诸如，膝上型计算机、台式计算机、工作台、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置，诸如，个人数字助理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例，并且不意在限制本文中描述的和/或者要求的本公开的实现。FIG. 4 shows a schematic block diagram of anelectronic device 400 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, servers, blade servers, mainframe computers, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are by way of example only, and are not intended to limit implementations of the disclosure described and/or claimed herein.

如图4所示，设备400包括计算单元401，其可以根据存储在只读存储器(ROM)402中的计算机程序或者从存储单元408加载到随机访问存储器(RAM)403中的计算机程序，来执行各种适当的动作和处理。在RAM 403中，还可存储设备400操作所需的各种程序和数据。计算单元401、ROM 402以及RAM 403通过总线404彼此相连。输入/输出(I/O)接口405也连接至总线404。As shown in FIG. 4 , thedevice 400 includes acomputing unit 401 that can be executed according to a computer program stored in a read only memory (ROM) 402 or loaded from astorage unit 408 into a random access memory (RAM) 403 Various appropriate actions and handling. In theRAM 403, various programs and data necessary for the operation of thedevice 400 can also be stored. Thecomputing unit 401 , theROM 402 , and theRAM 403 are connected to each other through abus 404 . An input/output (I/O)interface 405 is also connected tobus 404 .

设备400中的多个部件连接至I/O接口405，包括：输入单元406，例如键盘、鼠标等；输出单元407，例如各种类型的显示器、扬声器等；存储单元408，例如磁盘、光盘等；以及通信单元409，例如网卡、调制解调器、无线通信收发机等。通信单元409允许设备400通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。Various components in thedevice 400 are connected to the I/O interface 405, including: aninput unit 406, such as a keyboard, mouse, etc.; anoutput unit 407, such as various types of displays, speakers, etc.; astorage unit 408, such as a magnetic disk, an optical disk, etc. ; and acommunication unit 409, such as a network card, a modem, a wireless communication transceiver, and the like. Thecommunication unit 409 allows thedevice 400 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks.

计算单元401可以是各种具有处理和计算能力的通用和/或专用处理组件。计算单元401的一些示例包括但不限于中央处理单元(CPU)、图形处理单元(GPU)、各种专用的人工智能(AI)计算芯片、各种运行机器学习模型算法的计算单元、数字信号处理器(DSP)、以及任何适当的处理器、控制器、微控制器等。计算单元401执行上文所描述的各个方法和处理，例如本公开所述的方法。例如，在一些实施例中，本公开所述的方法可被实现为计算机软件程序，其被有形地包含于机器可读介质，例如存储单元408。在一些实施例中，计算机程序的部分或者全部可以经由ROM 402和/或通信单元409而被载入和/或安装到设备400上。当计算机程序加载到RAM 403并由计算单元401执行时，可以执行本公开所述的方法的一个或多个步骤。备选地，在其他实施例中，计算单元401可以通过其他任何适当的方式(例如，借助于固件)而被配置为执行本公开所述的方法。Computing unit 401 may be various general-purpose and/or special-purpose processing components with processing and computing capabilities. Some examples of computingunits 401 include, but are not limited to, central processing units (CPUs), graphics processing units (GPUs), various specialized artificial intelligence (AI) computing chips, various computing units that run machine learning model algorithms, digital signal processing processor (DSP), and any suitable processor, controller, microcontroller, etc. Thecomputing unit 401 performs the various methods and processes described above, such as the methods described in this disclosure. For example, in some embodiments, the methods described in this disclosure may be implemented as a computer software program tangibly embodied on a machine-readable medium, such asstorage unit 408 . In some embodiments, part or all of the computer program may be loaded and/or installed ondevice 400 viaROM 402 and/orcommunication unit 409 . When a computer program is loaded intoRAM 403 and executed by computingunit 401, one or more steps of the methods described in this disclosure may be performed. Alternatively, in other embodiments, thecomputing unit 401 may be configured by any other suitable means (eg, by means of firmware) to perform the methods described in this disclosure.

本文中以上描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、芯片上系统的系统(SOC)、复杂可编程逻辑设备(CPLD)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括：实施在一个或者多个计算机程序中，该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释，该可编程处理器可以是专用或者通用可编程处理器，可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令，并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。Various implementations of the systems and techniques described herein above may be implemented in digital electronic circuitry, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips system (SOC), complex programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs executable and/or interpretable on a programmable system including at least one programmable processor that The processor, which may be a special purpose or general-purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device an output device.

用于实施本公开的方法的程序代码可以采用一个或多个编程语言的任何组合来编写。这些程序代码可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器或控制器，使得程序代码当由处理器或控制器执行时使流程图和/或框图中所规定的功能/操作被实施。程序代码可以完全在机器上执行、部分地在机器上执行，作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer or other programmable data processing apparatus, such that the program code, when executed by the processor or controller, performs the functions/functions specified in the flowcharts and/or block diagrams. Action is implemented. The program code may execute entirely on the machine, partly on the machine, partly on the machine and partly on a remote machine as a stand-alone software package or entirely on the remote machine or server.

在本公开的上下文中，机器可读介质可以是有形的介质，其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备，或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), fiber optics, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.

为了提供与用户的交互，可以在计算机上实施此处描述的系统和技术，该计算机具有：用于向用户显示信息的显示装置(例如，CRT(阴极射线管)或者LCD(液晶显示器)监视器)；以及键盘和指向装置(例如，鼠标或者轨迹球)，用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以用于提供与用户的交互；例如，提供给用户的反馈可以是任何形式的传感反馈(例如，视觉反馈、听觉反馈、或者触觉反馈)；并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。To provide interaction with a user, the systems and techniques described herein may be implemented on a computer having a display device (eg, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user ); and a keyboard and pointing device (eg, a mouse or trackball) through which a user can provide input to the computer. Other kinds of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (eg, visual feedback, auditory feedback, or tactile feedback); and can be in any form (including acoustic input, voice input, or tactile input) to receive input from the user.

可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如，作为数据服务器)、或者包括中间件部件的计算系统(例如，应用服务器)、或者包括前端部件的计算系统(例如，具有图形用户界面或者网络浏览器的用户计算机，用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如，通信网络)来将系统的部件相互连接。通信网络的示例包括：局域网(LAN)、广域网(WAN)和互联网。The systems and techniques described herein may be implemented on a computing system that includes back-end components (eg, as a data server), or a computing system that includes middleware components (eg, an application server), or a computing system that includes front-end components (eg, a user's computer having a graphical user interface or web browser through which a user may interact with implementations of the systems and techniques described herein), or including such backend components, middleware components, Or any combination of front-end components in a computing system. The components of the system may be interconnected by any form or medium of digital data communication (eg, a communication network). Examples of communication networks include: Local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

计算机系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。服务器可以是云服务器，也可以为分布式系统的服务器，或者是结合了区块链的服务器。A computer system can include clients and servers. Clients and servers are generally remote from each other and usually interact through a communication network. The relationship of client and server arises by computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, a distributed system server, or a server combined with blockchain.

应该理解，可以使用上面所示的各种形式的流程，重新排序、增加或删除步骤。例如，本发公开中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行，只要能够实现本公开公开的技术方案所期望的结果，本文在此不进行限制。It should be understood that steps may be reordered, added or deleted using the various forms of flow shown above. For example, the steps described in the present disclosure can be executed in parallel, sequentially, or in different orders. As long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, there is no limitation herein.

上述具体实施方式，并不构成对本公开保护范围的限制。本领域技术人员应该明白的是，根据设计要求和其他因素，可以进行各种修改、组合、子组合和替代。任何在本公开的精神和原则之内所作的修改、等同替换和改进等，均应包含在本公开保护范围之内。The above-mentioned specific embodiments do not constitute a limitation on the protection scope of the present disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may occur depending on design requirements and other factors. Any modifications, equivalent replacements, and improvements made within the spirit and principles of the present disclosure should be included within the protection scope of the present disclosure.