CN118521682A

Movatterモバイル変換

Info

Publication number: CN118521682A
Application number: CN202410479688.XA
Authority: CN
Inventors: 范锡睿; 陈毅; 杜宗财; 王志强; 赵亚飞
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2024-04-19
Filing date: 2024-04-19
Publication date: 2024-08-20

Abstract

The disclosure provides an image generation method, a training device and training equipment for a mouth shape driving model, relates to the technical field of artificial intelligence, and particularly relates to the fields of image processing, deep learning and the like. The specific implementation scheme is as follows: acquiring a mouth shape information image and a plurality of character images; wherein the plurality of person images includes a plurality of reference images of the target person; reconstructing a face based on the mouth shape information image to determine geometrical features of the face; determining the face style characteristics of the target person based on the plurality of reference images and the style characteristic extraction network; and fusing the facial geometric features and the facial style features to obtain a mouth shape driving image corresponding to the mouth shape information image.

Description

Translated fromChinese

图像生成方法、口型驱动模型的训练方法、装置和设备Image generation method, lip-driven model training method, device and equipment

技术领域Technical Field

本公开涉及人工智能技术领域，尤其涉及图像处理和深度学习等领域。The present disclosure relates to the field of artificial intelligence technology, and in particular to the fields of image processing and deep learning.

背景技术Background Art

人脸口型驱动是数字人应用中的一项核心技术。相关技术中，口型驱动方案通常需要采集大量说话视频数据训练通用模型，可以在任意人像上输出通用平均口型。如果要进行特定人的个性化的口型驱动，则需要采集这一人物的数据重新训练专用模型。Lip-shaped driving is a core technology in digital human applications. In related technologies, lip-shaped driving solutions usually require the collection of a large amount of speaking video data to train a general model, which can output a general average lip shape on any portrait. If you want to perform personalized lip-shaped driving for a specific person, you need to collect the data of this person and retrain a dedicated model.

发明内容Summary of the invention

本公开提供了一种图像生成方法、口型驱动模型的训练方法、装置和设备。The present invention provides an image generation method, a lip-driven model training method, a device and a device.

根据本公开的一方面，提供了一种图像生成方法，包括：According to one aspect of the present disclosure, there is provided an image generation method, comprising:

获取口型信息图像以及多个人物图像；其中，多个人物图像包括目标人物的多个参考图像；Acquire a lip shape information image and a plurality of character images; wherein the plurality of character images include a plurality of reference images of a target character;

基于口型信息图像进行人脸重建，以确定人脸几何特征；Reconstruct the face based on the lip information image to determine the geometric features of the face;

基于多个参考图像以及风格特征提取网络，确定目标人物的人脸风格特征；Determine the facial style features of the target person based on multiple reference images and a style feature extraction network;

对人脸几何特征以及人脸风格特征进行融合，得到与口型信息图像对应的口型驱动图像。The facial geometric features and facial style features are fused to obtain a lip-shaped driven image corresponding to the lip-shaped information image.

根据本公开的另一方面，提供了一种口型驱动模型的训练方法，包括：According to another aspect of the present disclosure, a method for training a lip-activated model is provided, comprising:

基于N组样本对预设模型进行训练，得到口型驱动模型；其中，N组样本中的每组样本包括对应于同一参考人物的多个参考图像，N组样本中的M组样本对应于不同的参考人物，N为不小于3的整数，M为不小于2的整数，且M小于N；口型驱动模型用于实现本公开实施例中的任一图像生成方法。The preset model is trained based on N groups of samples to obtain a lip-driven model; wherein each group of the N groups of samples includes multiple reference images corresponding to the same reference person, M groups of samples in the N groups of samples correspond to different reference persons, N is an integer not less than 3, M is an integer not less than 2, and M is less than N; the lip-driven model is used to implement any image generation method in the embodiments of the present disclosure.

根据本公开的另一方面，提供了一种图像生成装置，包括：According to another aspect of the present disclosure, there is provided an image generating device, comprising:

输入模块，用于获取口型信息图像以及多个人物图像；其中，多个人物图像包括目标人物的多个参考图像；An input module, used to obtain a lip shape information image and a plurality of character images; wherein the plurality of character images include a plurality of reference images of a target character;

第一特征确定模块，用于基于口型信息图像进行人脸重建，以确定人脸几何特征；A first feature determination module is used to perform face reconstruction based on the lip shape information image to determine the face geometric features;

第二特征确定模块，用于基于多个参考图像以及风格特征提取网络，确定所述目标人物的人脸风格特征；A second feature determination module, used to determine the facial style features of the target person based on multiple reference images and a style feature extraction network;

融合模块，用于对人脸几何特征以及人脸风格特征进行融合，得到与口型信息图像对应的口型驱动图像。The fusion module is used to fuse the facial geometric features and the facial style features to obtain a lip-shaped driven image corresponding to the lip-shaped information image.

根据本公开的另一方面，提供了一种口型驱动模型的训练装置，包括：According to another aspect of the present disclosure, there is provided a training device for a lip-activated model, comprising:

训练模块，用于基于N组样本对预设模型进行训练，得到口型驱动模型；其中，N组样本中的每组样本包括对应于同一参考人物的多个参考图像，N组样本中的M组样本对应于不同的参考人物，N为不小于3的整数，M为不小于2的整数，且M小于N；口型驱动模型应用于本公开实施例中的任一图像生成装置。A training module is used to train a preset model based on N groups of samples to obtain a lip-driven model; wherein each group of samples in the N groups of samples includes multiple reference images corresponding to the same reference person, M groups of samples in the N groups of samples correspond to different reference persons, N is an integer not less than 3, M is an integer not less than 2, and M is less than N; the lip-driven model is applied to any image generation device in the embodiments of the present disclosure.

根据本公开的另一方面，提供了一种电子设备，包括：According to another aspect of the present disclosure, there is provided an electronic device, comprising:

至少一个处理器；以及at least one processor; and

与该至少一个处理器通信连接的存储器；其中，a memory communicatively connected to the at least one processor; wherein,

该存储器存储有可被该至少一个处理器执行的指令，该指令被该至少一个处理器执行，以使该至少一个处理器能够执行本公开实施例中任一的方法。The memory stores instructions that can be executed by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute any method in the embodiments of the present disclosure.

根据本公开的另一方面，提供了一种存储有计算机指令的非瞬时计算机可读存储介质，其中，该计算机指令用于使该计算机执行根据本公开实施例中任一的方法。According to another aspect of the present disclosure, a non-transitory computer-readable storage medium storing computer instructions is provided, wherein the computer instructions are used to enable the computer to execute any method according to the embodiments of the present disclosure.

根据本公开的另一方面，提供了一种计算机程序产品，包括计算机程序，该计算机程序在被处理器执行时实现根据本公开实施例中任一的方法。According to another aspect of the present disclosure, a computer program product is provided, including a computer program, which implements any method according to the embodiments of the present disclosure when executed by a processor.

本公开实施例的技术方案中，可以更有针对性地、更精准地提取到口型驱动所需的信息，提高口型驱动图像的准确性。此外，针对特定人物的驱动，也无需采集该人物的大量图像来对实现图像生成方法的口型驱动模型进行专有的训练，只需要采集多个参考图像进行风格特征提取。如此，提升了口型驱动模型的训练效率并降低了训练成本。In the technical solution of the embodiment of the present disclosure, the information required for lip-driven operation can be extracted more specifically and accurately, thereby improving the accuracy of the lip-driven image. In addition, for the driving of a specific character, there is no need to collect a large number of images of the character to conduct exclusive training on the lip-driven model that implements the image generation method. Instead, it is only necessary to collect multiple reference images for style feature extraction. In this way, the training efficiency of the lip-driven model is improved and the training cost is reduced.

应当理解，本部分所描述的内容并非旨在标识本公开的实施例的关键或重要特征，也不用于限制本公开的范围。本公开的其它特征将通过以下的说明书而变得容易理解。It should be understood that the content described in this section is not intended to identify the key or important features of the embodiments of the present disclosure, nor is it intended to limit the scope of the present disclosure. Other features of the present disclosure will become easily understood through the following description.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

附图用于更好地理解本方案，不构成对本公开的限定。其中：The accompanying drawings are used to better understand the present solution and do not constitute a limitation of the present disclosure.

图1是相关技术中数字人的口型驱动方案的示意图；FIG1 is a schematic diagram of a lip-activated digital human in the related art;

图2是根据本公开一实施例的图像生成方法的流程示意图；FIG2 is a schematic diagram of a flow chart of an image generating method according to an embodiment of the present disclosure;

图3是本公开实施例的图像生成方法的应用示例的示意图；FIG3 is a schematic diagram of an application example of the image generation method according to an embodiment of the present disclosure;

图4是根据本公开一实施例的口型驱动模型的训练方法的流程示意图；FIG4 is a flow chart of a method for training a lip-activated model according to an embodiment of the present disclosure;

图5是本公开一实施例提供的图像生成装置的结构示意图；FIG5 is a schematic diagram of the structure of an image generating device provided by an embodiment of the present disclosure;

图6是根据本公开一实施例的口型驱动模型的训练装置的结构示意图；FIG6 is a schematic diagram of the structure of a training device for a lip-activated model according to an embodiment of the present disclosure;

图7是用来实现本公开实施例的方法的电子设备的框图。FIG. 7 is a block diagram of an electronic device for implementing the method of an embodiment of the present disclosure.

具体实施方式DETAILED DESCRIPTION

以下结合附图对本公开的示范性实施例做出说明，其中包括本公开实施例的各种细节以助于理解，应当将它们认为仅仅是示范性的。因此，本领域普通技术人员应当认识到，可以对这里描述的实施例做出各种改变和修改，而不会背离本公开的范围。同样，为了清楚和简明，以下的描述中省略了对公知功能和结构的描述。The following is a description of exemplary embodiments of the present disclosure in conjunction with the accompanying drawings, including various details of the embodiments of the present disclosure to facilitate understanding, which should be considered as merely exemplary. Therefore, it should be recognized by those of ordinary skill in the art that various changes and modifications may be made to the embodiments described herein without departing from the scope of the present disclosure. Similarly, for the sake of clarity and conciseness, the description of well-known functions and structures is omitted in the following description.

为了便于理解本公开实施例提供的方法，以下对本公开实施例的相关技术进行说明，以下相关技术作为可选方案与本公开实施例的技术方案可以进行任意结合，其均属于本公开实施例的保护范围。In order to facilitate understanding of the method provided by the embodiment of the present disclosure, the related technologies of the embodiment of the present disclosure are described below. The following related technologies can be arbitrarily combined with the technical solutions of the embodiment of the present disclosure as optional solutions, and they all belong to the protection scope of the embodiment of the present disclosure.

图1是相关技术中数字人的口型驱动方案的示意图。口型驱动方案的主要任务是生成口型驱动图像17，即利用包含口型信息的源图像11以及目标人物形象，生成一个人脸图像，该人脸图像即为口型驱动图像17；其中，目标人物形象可以采用目标图像12表征。如图1所示，口型驱动方案的实现过程包括：源图像11经过特征提取网络13，得到源图像人脸特征14；目标图像12经过特征提取网络13，得到目标图像人脸特征15；源图像人脸特征14和目标图像人脸特征15一起输入到人脸驱动网络16，同时也输入目标图像12，人脸驱动网络16输出驱动后的人脸图像，即口型驱动图像17。FIG1 is a schematic diagram of a lip-type drive scheme for a digital human in the related art. The main task of the lip-type drive scheme is to generate a lip-type drive image 17, that is, to generate a face image 17 by using a source image 11 containing lip-type information and a target character image, and the face image is the lip-type drive image 17; wherein the target character image can be represented by a target image 12. As shown in FIG1 , the implementation process of the lip-type drive scheme includes: the source image 11 passes through a feature extraction network 13 to obtain a source image face feature 14; the target image 12 passes through a feature extraction network 13 to obtain a target image face feature 15; the source image face feature 14 and the target image face feature 15 are input into a face drive network 16 together, and the target image 12 is also input, and the face drive network 16 outputs a driven face image, that is, a lip-type drive image 17.

实际应用中，源图像可以是说话视频(口型信息视频)中的一帧，针对口型信息视频中的每帧，分别执行上述过程，则可生成与口型信息视频中的每帧对应的口型驱动图像，利用每帧对应的口型驱动图像，可以合成得到口型驱动视频。In practical applications, the source image can be a frame in a speaking video (lip information video). The above process is performed for each frame in the lip information video, and a lip-type driven image corresponding to each frame in the lip information video can be generated. The lip-type driven video can be synthesized using the lip-type driven image corresponding to each frame.

在上述方案中，口型的准确度依赖特征提取网络13提取的特征。而特征提取网络13用于针对源图像11与目标图像12提取通用的人脸特征，容易存在歧义性，会影响口型的准确度。In the above solution, the accuracy of the lip shape depends on the features extracted by the feature extraction network 13. The feature extraction network 13 is used to extract common facial features for the source image 11 and the target image 12, which is prone to ambiguity and will affect the accuracy of the lip shape.

此外，目标人物形象的信息完全来自于目标图像12，而目标图像12中的人物形象信息有限，难以针对特定人物生成具有较好的身份还原度的口型驱动图像17。相关技术中，在利用不同人物的图像数据训练得到包含特征提取网络13以及人脸驱动网络16的通用口型驱动模型后，往往需要再采集大量目标人物形象的数据对该口型驱动模型进行针对性的训练。In addition, the information of the target person image comes entirely from the target image 12, and the person image information in the target image 12 is limited, making it difficult to generate a lip-driven image 17 with good identity restoration for a specific person. In the related art, after using image data of different characters to train a general lip-driven model including a feature extraction network 13 and a face driving network 16, it is often necessary to collect a large amount of data on the target person image to conduct targeted training on the lip-driven model.

图2示出了本公开一实施例提供的图像生成方法。该方法可以用于解决上述技术问题中的至少之一。可选地，该方法可以应用于图像生成装置，该装置可以部署于电子设备中。电子设备例如是单机或多机的终端、服务器或其他处理设备。其中，终端可以为移动设备、个人数字助理(Personal Digital Assistant，PDA)、手持设备、计算设备、车载设备、可穿戴设备等用户设备(User Equipment，UE)。在一些可能的实现方式中，该方法还可以通过处理器调用存储器中存储的计算机可读指令的方式来实现。如图2所示，该方法可以包括：FIG2 shows an image generation method provided by an embodiment of the present disclosure. The method can be used to solve at least one of the above-mentioned technical problems. Optionally, the method can be applied to an image generation device, which can be deployed in an electronic device. The electronic device is, for example, a single-machine or multi-machine terminal, server or other processing device. Among them, the terminal can be a user equipment (User Equipment, UE) such as a mobile device, a personal digital assistant (Personal Digital Assistant, PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, etc. In some possible implementations, the method can also be implemented by a processor calling a computer-readable instruction stored in a memory. As shown in FIG2, the method may include:

S210、获取口型信息图像以及多个人物图像；其中，多个人物图像包括目标人物的多个参考图像；S210, acquiring a lip shape information image and a plurality of character images; wherein the plurality of character images include a plurality of reference images of a target character;

S220、基于口型信息图像进行人脸重建，以确定人脸几何特征；S220, performing face reconstruction based on the lip shape information image to determine the face geometric features;

S230、基于多个参考图像以及风格特征提取网络，确定目标人物的人脸风格特征；S230, determining facial style features of a target person based on multiple reference images and a style feature extraction network;

S240、对人脸几何特征以及人脸风格特征进行融合，得到与口型信息图像对应的口型驱动图像。S240: Fusing the facial geometric features and the facial style features to obtain a lip-shaped driven image corresponding to the lip-shaped information image.

本公开实施例中，口型信息图像可以是用于指示目标口型的图像，也可以称为源图像。可选地，口型信息图像也可以是人脸图像，该人脸图像中的人脸具有目标口型，上述图像生成方法主要用于生成在目标人物上做出该目标口型的图像。In the disclosed embodiment, the lip shape information image may be an image for indicating the target lip shape, and may also be referred to as a source image. Optionally, the lip shape information image may also be a face image, and the face in the face image has the target lip shape. The above-mentioned image generation method is mainly used to generate an image of the target person making the target lip shape.

可选地，口型信息图像可以是说话视频中的图像帧。具体地，针对说话视频进行拆帧得到多个图像帧，针对每个图像帧，可以执行上述方法生成与每个图像帧对应的口型驱动图像。基于此，可以利用与多个图像帧一一对应的多个口型驱动图像，合成口型驱动的视频。Optionally, the lip information image may be an image frame in a speech video. Specifically, the speech video is deframed to obtain a plurality of image frames, and for each image frame, the above method may be executed to generate a lip-driven image corresponding to each image frame. Based on this, a plurality of lip-driven images corresponding to a plurality of image frames may be used to synthesize a lip-driven video.

本公开实施例中，多个人物图像用于表征目标人物的形象或者说风格，人物图像中也包含人脸信息。可选地，可以在多个人物图像中选取部分作为用于提取人脸风格特征的多个参考图像。In the disclosed embodiment, multiple character images are used to represent the image or style of the target character, and the character images also contain face information. Optionally, some of the multiple character images can be selected as multiple reference images for extracting face style features.

上述方法可以采用口型驱动模型实现，即可以将口型信息图像以及多个人物图像输入口型驱动模型，口型驱动模型基于口型信息图像进行人脸重建，以确定人脸几何特征，并基于多个参考图像以及风格特征提取网络，确定目标人物的人脸风格特征，对人脸几何特征以及人脸风格特征进行融合，得到与口型信息图像对应的口型驱动图像，最后输出该口型驱动图像。可以理解，口型驱动模型中包含风格特征提取网络，风格特征提取网络为可学习的网络，在口型驱动模型的训练过程中得到优化。The above method can be implemented by using a lip-type driven model, that is, a lip-type information image and multiple character images can be input into the lip-type driven model, and the lip-type driven model reconstructs the face based on the lip-type information image to determine the face geometry features, and determines the face style features of the target person based on multiple reference images and a style feature extraction network, and fuses the face geometry features and face style features to obtain a lip-type driven image corresponding to the lip-type information image, and finally outputs the lip-type driven image. It can be understood that the lip-type driven model includes a style feature extraction network, and the style feature extraction network is a learnable network that is optimized during the training process of the lip-type driven model.

本公开实施例中，针对口型信息图像，通过人脸重建来确定人脸几何特征，即通过重建确定人脸参数，来确定人脸上几何形式的特征(点、线等特征)，例如关键点特征或轮廓特征等。而针对多个参考图像，则基于风格特征提取网络来确定所述目标人物的人脸风格特征。也就是说，将人脸特征中的人脸几何特征和人脸风格特征解耦，针对口型驱动所需的口型(可采用几何特征表征)和目标人物形象(可采用人脸风格特征表征)，分别从口型信息图像和参考图像中提取，从而使提取的信息更精准。此外，由于在图像生成过程中利用了多个参考图像进行目标人物的风格特征提取，因此，针对特定人物的驱动，也无需采集该人物的大量图像来对实现图像生成方法的口型驱动模型进行专有的训练，只需要采集少量的参考图像进行风格特征提取。如此，提升了口型驱动模型的训练效率。In the disclosed embodiment, for the lip information image, the face geometric features are determined by face reconstruction, that is, the face parameters are determined by reconstruction to determine the features of the geometric form on the face (features such as points and lines), such as key point features or contour features. For multiple reference images, the face style features of the target person are determined based on the style feature extraction network. In other words, the face geometric features and face style features in the face features are decoupled, and the lip shape (which can be represented by geometric features) and the target person image (which can be represented by face style features) required for lip driving are extracted from the lip information image and the reference image, respectively, so that the extracted information is more accurate. In addition, since multiple reference images are used to extract the style features of the target person in the image generation process, for the driving of a specific person, there is no need to collect a large number of images of the person to perform exclusive training on the lip driving model that implements the image generation method, and only a small number of reference images need to be collected for style feature extraction. In this way, the training efficiency of the lip driving model is improved.

需要说明的是，在上述方法中，步骤S220和步骤S230的执行顺序不限，可以先执行S220再执行S230，也可以先执行S230再执行S220，还可以并行执行S220与S230，本公开实施例不对此进行限制。It should be noted that in the above method, the execution order of step S220 and step S230 is not limited. S220 can be executed first and then S230, or S230 can be executed first and then S220, or S220 and S230 can be executed in parallel. The embodiments of the present disclosure are not limited to this.

在一些实施例中，多个人物图像还包括目标人物的目标图像。上述步骤S240、对人脸几何特征以及人脸风格特征进行融合，得到与口型信息图像对应的口型驱动图像，包括：In some embodiments, the plurality of person images also include a target image of a target person. The above step S240, fusing the face geometric features and the face style features to obtain a lip-shaped driven image corresponding to the lip-shaped information image, includes:

基于人脸驱动网络对目标人物的目标图像、人脸几何特征以及人脸风格特征进行交叉融合，得到与口型信息对应的口型驱动图像。Based on the face-driven network, the target image, facial geometric features and facial style features of the target person are cross-fused to obtain a lip-driven image corresponding to the lip shape information.

其中，人脸驱动网络可以是预先配置或者是预先训练得到的。示例性地，该人脸驱动网络可以是用于进行多模态信息交叉融合的网络，例如Diffusion(扩散)网络。可选地，人脸驱动网络为可学习的网络，即人脸驱动网络可作为口型驱动模型的一部分，在口型驱动模型训练过程中得到优化。The face drive network may be pre-configured or pre-trained. Exemplarily, the face drive network may be a network for cross-fusion of multimodal information, such as a diffusion network. Optionally, the face drive network is a learnable network, that is, the face drive network may be used as part of the lip-driven model and optimized during the lip-driven model training process.

在上述实施例中，目标人物的目标图像为用于表征目标人物的身份信息的图像。通过将目标图像也输入到人脸驱动网络中，使人脸驱动网络对三路信息进行交叉融合，可以提升口型驱动图像的准确性。In the above embodiment, the target image of the target person is an image used to represent the identity information of the target person. By inputting the target image into the face drive network, the face drive network cross-fuses the three-way information, which can improve the accuracy of the lip-driven image.

在一些实施例中，上述步骤S220、基于口型信息图像进行人脸重建，以确定人脸几何特征，包括：In some embodiments, the above step S220, performing face reconstruction based on the lip shape information image to determine the face geometric features, includes:

基于口型信息图像进行人脸三维重建，得到三维人脸参数；Perform three-dimensional face reconstruction based on the lip information image to obtain three-dimensional face parameters;

采用渲染网络基于三维人脸参数进行渲染，得到目标特征图；The rendering network is used to render based on the three-dimensional face parameters to obtain the target feature map;

基于目标特征图，得到人脸几何特征。Based on the target feature map, the facial geometric features are obtained.

在上述实施例中的人脸三维重建，可以指根据输入的人脸图像(本公开实施例中该人脸图像为上述口型信息图像)，通过计算得到三维人脸模型中的一系列参数的取值，这些参数的取值就是三维人脸参数。In the above embodiment, the three-dimensional reconstruction of the face may refer to obtaining the values of a series of parameters in the three-dimensional face model by calculation based on the input face image (the face image in the disclosed embodiment is the above-mentioned lip shape information image), and the values of these parameters are the three-dimensional face parameters.

在上述实施例中，采用渲染网络基于三维人脸参数进行渲染，得到目标特征图，即将三维人脸参数再渲染为二维特征图。可选地，渲染网络为可学习的网络，即渲染网络可作为口型驱动模型的一部分，在口型驱动模型训练过程中得到优化。In the above embodiment, a rendering network is used to render based on three-dimensional face parameters to obtain a target feature map, that is, the three-dimensional face parameters are rendered into a two-dimensional feature map. Optionally, the rendering network is a learnable network, that is, the rendering network can be used as a part of the lip-driven model and optimized during the training process of the lip-driven model.

在上述实施例中，可以将目标特征图作为人脸几何特征，也可以对目标特征图进行其他处理后得到人脸几何特征。In the above embodiment, the target feature map may be used as the face geometric feature, or the face geometric feature may be obtained by performing other processing on the target feature map.

相关技术中基于特征提取网络提取源图像的人脸特征只能基于二维信息进行处理，而二维信息存在歧义性。比较而言，本公开实施例先得到三维人脸参数，在三维域获取口型形状信息，再进行渲染，可以提高口型信息的准确性，从而提高口型驱动图像的准确性。In the related art, facial features extracted from source images based on feature extraction networks can only be processed based on two-dimensional information, which is ambiguous. In comparison, the disclosed embodiment first obtains three-dimensional facial parameters, obtains lip shape information in the three-dimensional domain, and then renders, which can improve the accuracy of lip shape information, thereby improving the accuracy of lip-driven images.

在一些实施例中，目标特征图包括稠密人脸关键点特征图和/或人脸轮廓图。In some embodiments, the target feature map includes a dense facial key point feature map and/or a facial contour map.

其中，稠密人脸关键点特征图可以用于确定稠密人脸关键点。其中，稠密人脸关键点可以是在常规关键点检测算法中的关键点基础上进行稠密化处理得到的，例如可以插入额外的关键点，如额头区域和脸部外围区域，使其能够覆盖整个脸部区域，相比常规关键点能提升几何特征的信息量。The dense face key point feature map can be used to determine the dense face key points. The dense face key points can be obtained by densifying the key points in the conventional key point detection algorithm. For example, additional key points can be inserted, such as the forehead area and the peripheral area of the face, so that it can cover the entire face area, which can increase the amount of information of geometric features compared to conventional key points.

人脸轮廓图可以包括人脸轮廓线条信息，可选地，可以不包含纹理特征。人脸具有特定的纹理特征，纹理是在特征图上表现为灰度或颜色分布的某种规律性特征。在一些实施例中的基于三维人脸参数进行渲染的过程中可以将轮廓特征与纹理特征分离，以更准确提取出几何线条形式的人脸轮廓图，该人脸轮廓图相比常规关键点也能提升几何特征的信息量，并提升人脸几何特征的准确性。The face contour map may include face contour line information, and optionally, may not include texture features. A face has specific texture features, and texture is a regular feature that is represented as grayscale or color distribution on a feature map. In some embodiments, the contour features and texture features may be separated during the rendering process based on three-dimensional face parameters to more accurately extract a face contour map in the form of geometric lines. Compared with conventional key points, the face contour map can also increase the amount of information of geometric features and improve the accuracy of face geometric features.

在上述实施例中，采用稠密人脸关键点特征图和人脸轮廓图中的一种或两种作为目标特征图，相比提取常规关键点信息，可以提高人脸几何特征的准确性。In the above embodiment, one or both of the dense facial key point feature map and the facial contour map are used as the target feature map, which can improve the accuracy of facial geometric features compared to extracting conventional key point information.

在一些实施例中，基于多个参考图像以及风格特征提取网络，确定所述目标人物的人脸风格特征，包括：In some embodiments, determining the facial style features of the target person based on multiple reference images and a style feature extraction network includes:

采用风格特征提取网络对多个参考图像进行特征提取，得到目标人物的人脸风格特征；其中，风格特征提取网络是基于U型网络(U-net)以及注意力(attention)机制构建的。A style feature extraction network is used to extract features from multiple reference images to obtain the facial style features of the target person. The style feature extraction network is built based on a U-net and an attention mechanism.

示例性地，可以在基于卷积的U-net中加入attention机制，形成attention-net，用于进行人脸风格特征提取。具体地，U-net由一个收缩路径和一个扩展路径组成，收缩路径与扩展路径相对应的网络层之间具有跳跃连接结构。收缩路径中不同网络层包含不同细腻程度位置信息的特征图，扩展路径中不同网络层包含不同语义信息丰富程度的上采样特征图，而跳跃连接结构则结合了不同网络层中具有不同深度的语义抽象信息和不同细腻程度的位置信息，实现了包含密集位置信息的特征图与包含丰富语义信息的特征图之间的融合。在U-net中加入attention机制后得到神经网络可以称为attention-net。具体地，attention-net中通过在U-net中增加注意力模块，对各路径或跳跃连接结构上的不同信息赋予不同的权重，得到注意力信息后再与各路径或跳跃连接结构上的信息进行融合。将attention-net用作人脸风格特征提取网络可以加强对人脸风格信息的关注。相比采用常规的卷积神经网络提取特征，上述实施例采用基于U型网络(U-net)以及注意力(attention)机制构建的风格特征提取，可以提高提取特定特征的精准度，从而提高人脸风格特征的准确性。Exemplarily, an attention mechanism can be added to a convolution-based U-net to form an attention-net for facial style feature extraction. Specifically, the U-net consists of a contraction path and an expansion path, and there is a jump connection structure between the network layers corresponding to the contraction path and the expansion path. Different network layers in the contraction path contain feature maps with different levels of position information, and different network layers in the expansion path contain upsampled feature maps with different levels of semantic information richness. The jump connection structure combines semantic abstract information with different depths and position information with different levels of fineness in different network layers, and realizes the fusion between feature maps containing dense position information and feature maps containing rich semantic information. The neural network obtained by adding the attention mechanism to the U-net can be called attention-net. Specifically, in the attention-net, by adding an attention module to the U-net, different weights are assigned to different information on each path or jump connection structure, and the attention information is obtained and then fused with the information on each path or jump connection structure. Using attention-net as a facial style feature extraction network can strengthen the attention to facial style information. Compared with the conventional convolutional neural network feature extraction, the above embodiment uses style feature extraction based on U-net and attention mechanism, which can improve the accuracy of extracting specific features, thereby improving the accuracy of facial style features.

为了更清楚地理解上述技术方案，图3示出了本公开实施例的图像生成方法的应用示例的示意图。如图3所示，在本应用示例中：In order to more clearly understand the above technical solution, FIG3 shows a schematic diagram of an application example of the image generation method of an embodiment of the present disclosure. As shown in FIG3, in this application example:

1、口型信息图像311经过人脸三维重建312得到三维人脸参数313，经过神经网络渲染314得到人脸几何特征315。人脸几何特征315可以是稠密人脸关键点特征图或是不带纹理的人脸轮廓图。1. The lip information image 311 is subjected to 3D face reconstruction 312 to obtain 3D face parameters 313, and is subjected to neural network rendering 314 to obtain face geometric features 315. The face geometric features 315 can be a dense face key point feature map or a face contour map without texture.

2、引入多个参考图像321，参考图像321和目标图像330来自同一人物，即均来自目标人物。参考图像321输入风格特征提取网络322获取人脸风格特征323。2. Introduce multiple reference images 321, where the reference images 321 and the target image 330 are from the same person, that is, both are from the target person. The reference image 321 is input into the style feature extraction network 322 to obtain the facial style feature 323.

3、将人脸风格特征323、人脸几何特征315和目标图像330一起输入到人脸驱动diffusion网络340进行推理，人脸驱动diffusion网络340通过人脸风格特征323和人脸几何特征315控制输出口型驱动图像350，即口型驱动的人脸图像。3. The facial style features 323, facial geometry features 315 and target image 330 are input together into the face-driven diffusion network 340 for inference. The face-driven diffusion network 340 controls the output of the lip-driven image 350, i.e., the lip-driven facial image, through the facial style features 323 and facial geometry features 315.

对于特定人物的驱动，在本应用示例中，仅需要采集3-5张目标人物的人脸图像作为参考图像提取风格特征，即可进行个性化驱动。For driving of a specific person, in this application example, it is only necessary to collect 3-5 facial images of the target person as reference images to extract style features for personalized driving.

图4是根据本公开一实施例的口型驱动模型的训练方法的流程示意图，该方法可以包括上述实施例的方法的一个或多个特征。该方法包括：FIG4 is a flow chart of a method for training a lip-activated model according to an embodiment of the present disclosure, and the method may include one or more features of the method of the above embodiment. The method includes:

步骤S410、基于N组样本对预设模型进行训练，得到口型驱动模型；其中，N组样本中的每组样本包括同一参考人物的多个参考图像，N组样本中的M组样本对应于不同的参考人物。Step S410, training a preset model based on N groups of samples to obtain a lip-driven model; wherein each group of samples in the N groups of samples includes multiple reference images of the same reference person, and M groups of samples in the N groups of samples correspond to different reference persons.

上述方法中，N为不小于3的整数，M为不小于2的整数且M小于N；该口型驱动模型用于实现前述任意实施例提供的图像生成方法。In the above method, N is an integer not less than 3, M is an integer not less than 2 and M is less than N; the lip-driven model is used to implement the image generation method provided by any of the above embodiments.

可以理解，在上述实施例中，构建多组样本，每组样本包括同一参考人物的多个参考图像，且多组样本中部分组别的样本，是对应于不同的参考人物的。也就是说，多组样本中包括同一参考人物的不同组样本，也包括不同参考人物的不同组样本，从而可以利用不同参考人物的样本，提高口型驱动模型的泛化性，使得口型驱动模型可以应用于不同人物上。It can be understood that in the above embodiment, multiple groups of samples are constructed, each group of samples includes multiple reference images of the same reference person, and some groups of samples in the multiple groups of samples correspond to different reference persons. In other words, the multiple groups of samples include different groups of samples of the same reference person, and also include different groups of samples of different reference persons, so that the samples of different reference persons can be used to improve the generalization of the lip-driven model, so that the lip-driven model can be applied to different persons.

可选地，上述预设模型与口型驱动模型具有相同的网络结构。具体地，当预设模型训练至收敛时，将该预设模型作为口型驱动模型。Optionally, the preset model and the lip-driven model have the same network structure. Specifically, when the preset model is trained to convergence, the preset model is used as the lip-driven model.

可选地，每组样本中还可以包括一个口型信息图像以及参考人物的目标图像。在训练过程中，针对每组样本，可以将口型信息图像、参考人物的目标图像和参考人物的多个参考图像输入到预设模型中，预设模型执行上述图像生成方法，输出口型驱动图像；进而可以基于口型驱动图像计算损失，利用损失更新模型参数。Optionally, each group of samples may also include a lip information image and a target image of a reference person. During the training process, for each group of samples, the lip information image, the target image of the reference person, and multiple reference images of the reference person may be input into a preset model, and the preset model executes the above-mentioned image generation method and outputs a lip-driven image; then, the loss may be calculated based on the lip-driven image, and the model parameters may be updated using the loss.

由于上述口型驱动模型应用于前述实施例中的图像生成方法，而前述实施例中的图像生成方法中，将人脸风格特征和人脸几何特征解耦，可以利用多个参考图像提取特定人物的人脸风格特征，因此无需采集大量目标人物的数据对该口型驱动模型进行针对性的训练，即可应用在特定的目标人物上取得较好的口型驱动效果，提升了训练的效率，并降低了训练成本。Since the above-mentioned lip-driven model is applied to the image generation method in the aforementioned embodiment, and in the image generation method in the aforementioned embodiment, the facial style features and facial geometric features are decoupled, multiple reference images can be used to extract the facial style features of a specific person. Therefore, there is no need to collect a large amount of target person data to conduct targeted training on the lip-driven model. It can be applied to the specific target person to achieve better lip-driven effect, thereby improving the training efficiency and reducing the training cost.

在一些实施例中，预设模型包括多个神经网络，例如上述风格特征提取网络、人脸驱动网络和渲染网络。上述步骤S410中，基于N组样本对预设模型进行训练，包括：In some embodiments, the preset model includes multiple neural networks, such as the style feature extraction network, face driving network and rendering network. In the above step S410, the preset model is trained based on N groups of samples, including:

将N组样本中的第i组样本输入预设模型，得到预设模型中的各个神经网络针对第i组样本输出的信息；其中，i为不大于N的正整数；Input the i-th group of samples from the N groups of samples into the preset model to obtain information output by each neural network in the preset model for the i-th group of samples; wherein i is a positive integer not greater than N;

基于各个神经网络针对第i组样本输出的信息，得到各个神经网络分别对应的损失函数；Based on the information output by each neural network for the i-th group of samples, the loss function corresponding to each neural network is obtained;

基于各个神经网络分别对应的损失函数，对预设模型的参数进行更新。Based on the loss functions corresponding to each neural network, the parameters of the preset model are updated.

根据上述实施例，在模型训练过程中，会利用到模型中的各个神经网络的输出信息，分别计算损失函数，以利用各个神经网络对应的损失函数对模型进行更新，如此，可以针对模型中的各个节点进行针对性的优化，提高模型的精度，从而提高口型驱动的准确性。According to the above embodiment, during the model training process, the output information of each neural network in the model will be used to calculate the loss function respectively, so as to update the model using the loss function corresponding to each neural network. In this way, targeted optimization can be performed on each node in the model to improve the accuracy of the model, thereby improving the accuracy of lip-driven operation.

可选地，上述多个神经网络可以包括前述实施例中的风格特征提取网络、渲染网络以及人脸驱动网络中的一个或多个。Optionally, the above-mentioned multiple neural networks may include one or more of the style feature extraction network, the rendering network and the face driving network in the aforementioned embodiments.

在一些实施例中，上述多个神经网络包括风格特征提取网络。该风格特征网络用于基于多个参考图像，输出人脸风格特征。上述口型驱动模型的训练方法中，基于各个神经网络针对第i组样本输出的信息，得到各个神经网络分别对应的损失函数，包括：In some embodiments, the plurality of neural networks include a style feature extraction network. The style feature network is used to output facial style features based on a plurality of reference images. In the training method of the lip-driven model, based on the information output by each neural network for the i-th group of samples, the loss functions corresponding to each neural network are obtained, including:

将与第i组样本对应于同一参考人物的样本作为正例样本，并将与第i组样本对应于不同参考人物的样本作为负例样本；The samples corresponding to the same reference person as the i-th group of samples are regarded as positive samples, and the samples corresponding to different reference persons as the i-th group of samples are regarded as negative samples;

基于风格特征提取网络针对第i组样本输出的人脸风格特征、风格特征提取网络针对正例样本输出的人脸风格特征以及风格特征提取网络针对负例样本输出的人脸风格特征，计算对比学习的损失函数；Calculate the loss function of contrastive learning based on the facial style features output by the style feature extraction network for the i-th group of samples, the facial style features output by the style feature extraction network for the positive samples, and the facial style features output by the style feature extraction network for the negative samples;

基于对比学习的损失函数，得到风格特征提取网络的损失函数。Based on the loss function of contrastive learning, the loss function of the style feature extraction network is obtained.

可以看到，在上述实施例中，风格特征提取网络是通过对比学习获得风格提取能力的。具体地，针对每组样本，将来自同一参考人物的样本作为正例样本，将来自不同参考人物的样本作为负例样本；在构建正负例样本后，基于该组样本与正例样本、负例样本对应的不同输出结果，计算对比学习损失函数。可以理解，在模型更新过程中，要求来自正例样本的对比损失尽量小，来自负例样本的对比损失尽量大，使风格特征提取网络对同一人物输出相近的人脸风格特征，对不同人物输出差距较大的人脸风格特征，从而实现提取特定人物的风格特征。It can be seen that in the above embodiment, the style feature extraction network obtains the style extraction capability through contrastive learning. Specifically, for each group of samples, samples from the same reference person are used as positive samples, and samples from different reference persons are used as negative samples; after constructing positive and negative samples, the contrastive learning loss function is calculated based on the different output results corresponding to the group of samples and the positive samples and negative samples. It can be understood that in the process of model updating, it is required that the contrast loss from the positive samples is as small as possible, and the contrast loss from the negative samples is as large as possible, so that the style feature extraction network outputs similar facial style features for the same person, and outputs facial style features with large differences for different people, thereby realizing the extraction of style features of specific people.

在上述过程中，可以通过计算第i组样本与正例样本之间的余弦距离、第i组样本与反例样本之间的余弦距离，分别计算得到与正例样本的对比学习损失以及与负例样本的对比学习损失，从而基于两者确定第i组样本对应的损失函数。In the above process, by calculating the cosine distance between the i-th group of samples and the positive sample, and the cosine distance between the i-th group of samples and the negative sample, the contrastive learning loss with the positive sample and the contrastive learning loss with the negative sample can be calculated respectively, so as to determine the loss function corresponding to the i-th group of samples based on the two.

根据上述实施例，可以使风格特征提取网络具备准确的风格特征提取能力，从而提高模型对特定人物的处理能力，只需要少量参考图像即可实现高准确度的个性化口型驱动，提升训练效率，也提升数字人的用户交互体验。According to the above embodiments, the style feature extraction network can be equipped with accurate style feature extraction capability, thereby improving the model's processing capability for specific characters. Only a small number of reference images are required to achieve highly accurate personalized lip-sync driving, thereby improving training efficiency and user interaction experience of digital humans.

根据本公开的实施例，本公开还提供了一种图像生成装置，图5示出了本公开一实施例提供的图像生成装置500的结构示意图，如图5所示，图像生成装置500包括：According to an embodiment of the present disclosure, the present disclosure further provides an image generating device. FIG5 shows a schematic diagram of the structure of an image generating device 500 provided by an embodiment of the present disclosure. As shown in FIG5 , the image generating device 500 includes:

输入模块510，用于获取口型信息图像以及多个人物图像；其中，多个人物图像包括目标人物的多个参考图像；An input module 510 is used to obtain a lip shape information image and a plurality of character images; wherein the plurality of character images include a plurality of reference images of a target character;

第一特征确定模块520，用于基于口型信息图像进行人脸重建，以确定人脸几何特征；A first feature determination module 520, configured to perform face reconstruction based on the lip shape information image to determine face geometric features;

第二特征确定模块530，用于基于多个参考图像以及风格特征提取网络，确定所述目标人物的人脸风格特征；A second feature determination module 530, configured to determine the facial style features of the target person based on a plurality of reference images and a style feature extraction network;

融合模块540，用于对人脸几何特征以及人脸风格特征进行融合，得到与口型信息图像对应的口型驱动图像。The fusion module 540 is used to fuse the facial geometric features and the facial style features to obtain a lip-shaped driving image corresponding to the lip-shaped information image.

其中，口型信息图像可以是用于指示目标口型的图像，也可以称为源图像。可选地，口型信息图像也可以是人脸图像，该人脸图像中的人脸具有目标口型，上述图像生成装置主要用于生成在目标人物上做出该目标口型的图像。The lip shape information image may be an image for indicating the target lip shape, and may also be referred to as a source image. Optionally, the lip shape information image may also be a face image, in which the face has the target lip shape, and the image generating device is mainly used to generate an image of the target person making the target lip shape.

可选地，口型信息图像可以是说话视频中的图像帧。具体地，图像生成装置500可以配置为先针对说话视频进行拆帧得到多个图像帧，再针对每个图像帧，采用上述各模块生成与每个图像帧对应的口型驱动图像。基于此，图像生成装置500可以利用与多个图像帧一一对应的多个口型驱动图像，合成口型驱动的视频。Optionally, the lip information image may be an image frame in a speech video. Specifically, the image generation device 500 may be configured to first deframe the speech video to obtain a plurality of image frames, and then for each image frame, use the above modules to generate a lip-driven image corresponding to each image frame. Based on this, the image generation device 500 may synthesize a lip-driven video using a plurality of lip-driven images corresponding to a plurality of image frames.

多个人物图像用于表征目标人物的形象或者说风格，人物图像中也包含人脸信息。可选地，输入模块510可以在多个人物图像中选取部分作为用于提取人脸风格特征的多个参考图像。The multiple character images are used to represent the image or style of the target character, and the character images also contain face information. Optionally, the input module 510 can select some of the multiple character images as multiple reference images for extracting face style features.

上述图像生成装置500中可以部署有口型驱动模型，即可以将口型信息图像以及多个人物图像输入口型驱动模型，口型驱动模型包括上述各模块，用于基于口型信息图像进行人脸重建，以确定人脸几何特征，并基于多个参考图像以及风格特征提取网络，确定目标人物的人脸风格特征，对人脸几何特征以及人脸风格特征进行融合，得到与口型信息图像对应的口型驱动图像，最后输出该口型驱动图像。可以理解，口型驱动模型中包含风格特征提取网络，风格特征提取网络为可学习的网络，在口型驱动模型的训练过程中得到优化。The above-mentioned image generation device 500 may be deployed with a lip-type driven model, that is, the lip-type information image and multiple character images may be input into the lip-type driven model. The lip-type driven model includes the above-mentioned modules, which are used to reconstruct the face based on the lip-type information image to determine the face geometry features, and determine the face style features of the target person based on multiple reference images and the style feature extraction network, fuse the face geometry features and the face style features, obtain the lip-type driven image corresponding to the lip-type information image, and finally output the lip-type driven image. It can be understood that the lip-type driven model includes a style feature extraction network, which is a learnable network and is optimized during the training process of the lip-type driven model.

在一些实施例中，多个人物图像还包括目标人物的目标图像；In some embodiments, the plurality of person images further includes a target image of a target person;

融合模块540用于：The fusion module 540 is used to:

其中，人脸驱动网络可以是预先配置或者是预先训练得到的。示例性地，该人脸驱动网络可以是用于进行多模态信息交叉融合的网络，例如Diffusion网络。可选地，人脸驱动网络为可学习的网络，即人脸驱动网络可作为口型驱动模型的一部分，在口型驱动模型训练过程中得到优化。The face drive network may be pre-configured or pre-trained. Exemplarily, the face drive network may be a network for cross-fusion of multimodal information, such as a Diffusion network. Optionally, the face drive network is a learnable network, that is, the face drive network may be used as part of the lip-driven model and optimized during the lip-driven model training process.

在一些实施例中，第一特征确定模块520用于：In some embodiments, the first feature determination module 520 is used to:

在一些实施例中，第二特征确定模块530用于：In some embodiments, the second feature determination module 530 is used to:

采用风格特征提取网络对多个参考图像进行特征提取，得到目标人物的人脸风格特征；其中，风格特征提取网络是基于U型网络以及注意力机制构建的。A style feature extraction network is used to extract features from multiple reference images to obtain the facial style features of the target person; wherein, the style feature extraction network is constructed based on a U-shaped network and an attention mechanism.

示例性地，可以在基于卷积的U-net中加入attention机制，形成attention-net，用于进行人脸风格特征提取。具体地，U-net由一个收缩路径和一个扩展路径组成，收缩路径与扩展路径相对应的网络层之间具有跳跃连接结构。收缩路径中不同网络层包含不同细腻程度位置信息的特征图，扩展路径中不同网络层包含不同语义信息丰富程度的上采样特征图，而跳跃连接结构则结合了不同网络层中具有不同深度的语义抽象信息和不同细腻程度的位置信息，实现了包含密集位置信息的特征图与包含丰富语义信息的特征图之间的融合。在U-net中加入attention机制后得到神经网络可以称为attention-net。具体地，attention-net中通过在U-net中增加注意力模块，对各路径或跳跃连接结构上的不同信息赋予不同的权重，得到注意力信息后再与各路径或跳跃连接结构上的信息进行融合。将attention-net用作人脸风格特征提取网络可以加强对人脸风格信息的关注。Exemplarily, an attention mechanism can be added to a convolution-based U-net to form an attention-net for facial style feature extraction. Specifically, the U-net consists of a contraction path and an expansion path, and there is a jump connection structure between the network layers corresponding to the contraction path and the expansion path. Different network layers in the contraction path contain feature maps with different levels of position information, and different network layers in the expansion path contain upsampled feature maps with different levels of semantic information richness. The jump connection structure combines semantic abstract information with different depths and position information with different levels of fineness in different network layers, and realizes the fusion between feature maps containing dense position information and feature maps containing rich semantic information. The neural network obtained by adding the attention mechanism to the U-net can be called attention-net. Specifically, in the attention-net, by adding an attention module to the U-net, different weights are assigned to different information on each path or jump connection structure, and the attention information is obtained and then fused with the information on each path or jump connection structure. Using attention-net as a facial style feature extraction network can strengthen the attention to facial style information.

上述图像生成装置500在图像生成过程中利用了多个参考图像进行目标人物的风格特征提取，因此，针对特定人物的驱动，也无需采集该人物的大量图像来对口型驱动模型进行专有的训练，只需要采集少量的参考图像进行风格特征提取。如此，提升了口型驱动模型的训练效率。The image generation device 500 uses multiple reference images to extract the style features of the target person during the image generation process. Therefore, for the driving of a specific person, it is not necessary to collect a large number of images of the person to perform dedicated training on the lip-driven model, but only a small number of reference images need to be collected for style feature extraction. In this way, the training efficiency of the lip-driven model is improved.

图6是根据本公开一实施例的口型驱动模型的训练装置600的结构示意图，口型驱动模型的训练装置600包括：FIG6 is a schematic diagram of a structure of a lip-activated model training device 600 according to an embodiment of the present disclosure. The lip-activated model training device 600 includes:

训练模块610，用于基于N组样本对预设模型进行训练，得到口型驱动模型；其中，N组样本中的每组样本包括对应于同一参考人物的多个参考图像，N组样本中的M组样本对应于不同的参考人物，N为不小于3的整数，M为不小于2的整数，且M小于N；口型驱动模型应用于前述任一实施例提供的图像生成装置。The training module 610 is used to train the preset model based on N groups of samples to obtain a lip-driven model; wherein each group of samples in the N groups of samples includes multiple reference images corresponding to the same reference person, and M groups of samples in the N groups of samples correspond to different reference persons, N is an integer not less than 3, M is an integer not less than 2, and M is less than N; the lip-driven model is applied to the image generation device provided in any of the aforementioned embodiments.

可选地，上述预设模型与口型驱动模型具有相同的网络结构。具体地，训练模块610在预设模型训练至收敛时，将该预设模型作为口型驱动模型。Optionally, the preset model and the lip-driven model have the same network structure. Specifically, when the preset model is trained to convergence, the training module 610 uses the preset model as the lip-driven model.

可选地，每组样本中还可以包括一个口型信息图像以及参考人物的目标图像。训练模块610针对每组样本，可以将口型信息图像、参考人物的目标图像和参考人物的多个参考图像输入到预设模型中，预设模型输出口型驱动图像；进而可以基于口型驱动图像计算损失，利用损失更新模型参数。Optionally, each group of samples may also include a lip information image and a target image of a reference character. For each group of samples, the training module 610 may input the lip information image, the target image of the reference character, and multiple reference images of the reference character into a preset model, and the preset model outputs a lip-driven image; then, the loss may be calculated based on the lip-driven image, and the model parameters may be updated using the loss.

由于上述口型驱动模型应用于前述实施例中的图像生成装置，而前述实施例中的图像生成装置中，将人脸风格特征和人脸几何特征解耦，可以利用多个参考图像提取特定人物的人脸风格特征，因此无需采集大量目标人物的数据对该口型驱动模型进行针对性的训练，即可应用在特定的目标人物上取得较好的口型驱动效果，提升了训练的效率，并降低了训练成本。Since the above-mentioned lip-type driving model is applied to the image generating device in the aforementioned embodiment, and in the image generating device in the aforementioned embodiment, the facial style features and the facial geometric features are decoupled, the facial style features of a specific person can be extracted using multiple reference images. Therefore, there is no need to collect a large amount of target person data to conduct targeted training on the lip-type driving model. It can be applied to the specific target person to achieve better lip-type driving effect, thereby improving the training efficiency and reducing the training cost.

在一些实施例中，预设模型包括多个神经网络；In some embodiments, the preset model includes a plurality of neural networks;

训练模块610用于：The training module 610 is used to:

在一些实施例中，多个神经网络包括风格特征提取网络；In some embodiments, the plurality of neural networks includes a style feature extraction network;

训练模块610用于：The training module 610 is used to:

在上述实施例中，风格特征提取网络是通过对比学习获得风格提取能力的。具体地，针对每组样本，将来自同一参考人物的样本作为正例样本，将来自不同参考人物的样本作为负例样本；在构建正负例样本后，基于该组样本与正例样本、负例样本对应的不同输出结果，计算对比学习损失函数。可以理解，在模型更新过程中，要求来自正例样本的对比损失尽量小，来自负例样本的对比损失尽量大，使风格特征提取网络对同一人物输出相近的人脸风格特征，对不同人物输出差距较大的人脸风格特征，从而实现提取特定人物的风格特征。In the above embodiment, the style feature extraction network obtains the style extraction capability through contrastive learning. Specifically, for each group of samples, samples from the same reference person are used as positive samples, and samples from different reference persons are used as negative samples; after constructing positive and negative samples, the contrastive learning loss function is calculated based on the different output results corresponding to the group of samples and the positive samples and negative samples. It can be understood that in the process of model updating, the contrast loss from the positive samples is required to be as small as possible, and the contrast loss from the negative samples is required to be as large as possible, so that the style feature extraction network outputs similar facial style features for the same person, and outputs facial style features with large differences for different people, thereby realizing the extraction of style features of specific people.

本公开实施例的装置的各模块、子模块的具体功能和示例的描述，可以参见上述方法实施例中对应步骤的相关描述，在此不再赘述。For the description of specific functions and examples of each module and submodule of the device in the embodiment of the present disclosure, reference can be made to the relevant description of the corresponding steps in the above method embodiment, which will not be repeated here.

本公开的技术方案中，所涉及的用户个人信息的获取，存储和应用等，均符合相关法律法规的规定，且不违背公序良俗。In the technical solution disclosed herein, the acquisition, storage and application of user personal information involved are in compliance with the provisions of relevant laws and regulations and do not violate public order and good morals.

根据本公开的实施例，本公开还提供了一种电子设备、一种可读存储介质和一种计算机程序产品。According to an embodiment of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.

图7示出了可以用来实施本公开的实施例的示例电子设备700的结构示意图。电子设备700旨在表示各种形式的数字计算机，诸如，膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备700还可以表示各种形式的移动装置，诸如，个人数字助理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例，并且不意在限制本文中描述的和/或者要求的本公开的实现。FIG. 7 shows a block diagram of an example electronic device 700 that can be used to implement an embodiment of the present disclosure. The electronic device 700 is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device 700 can also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely examples and are not intended to limit the implementation of the present disclosure described and/or required herein.

如图7所示，电子设备700包括计算单元701，其可以根据存储在只读存储器(ROM)702中的计算机程序或者从存储单元708加载到随机访问存储器(RAM)703中的计算机程序，来执行各种适当的动作和处理。在RAM 703中，还可存储电子设备700操作所需的各种程序和数据。计算单元701、ROM 702以及RAM 703通过总线704彼此相连。输入/输出(I/O)接口705也连接至总线704。As shown in Figure 7, electronic device 700 includes a computing unit 701, which can perform various appropriate actions and processes according to a computer program stored in a read-only memory (ROM) 702 or a computer program loaded from a storage unit 708 into a random access memory (RAM) 703. In RAM 703, various programs and data required for the operation of electronic device 700 can also be stored. Computing unit 701, ROM 702 and RAM 703 are connected to each other via bus 704. Input/output (I/O) interface 705 is also connected to bus 704.

电子设备700中的多个部件连接至I/O接口705，包括：输入单元706，例如键盘、鼠标等；输出单元707，例如各种类型的显示器、扬声器等；存储单元708，例如磁盘、光盘等；以及通信单元709，例如网卡、调制解调器、无线通信收发机等。通信单元709允许电子设备700通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。Multiple components in the electronic device 700 are connected to the I/O interface 705, including: an input unit 706, such as a keyboard, a mouse, etc.; an output unit 707, such as various types of displays, speakers, etc.; a storage unit 708, such as a disk, an optical disk, etc.; and a communication unit 709, such as a network card, a modem, a wireless communication transceiver, etc. The communication unit 709 allows the electronic device 700 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks.

计算单元701可以是各种具有处理和计算能力的通用和/或专用处理组件。计算单元701的一些示例包括但不限于中央处理单元(CPU)、图形处理单元(GPU)、各种专用的人工智能(AI)计算芯片、各种运行机器学习模型算法的计算单元、数字信号处理器(DSP)、以及任何适当的处理器、控制器、微控制器等。计算单元701执行上文所描述的各个方法和处理，例如图像生成方法或口型驱动模型的训练方法。例如，在一些实施例中，图像生成方法或口型驱动模型的训练方法可被实现为计算机软件程序，其被有形地包含于机器可读介质，例如存储单元708。在一些实施例中，计算机程序的部分或者全部可以经由ROM 702和/或通信单元709而被载入和/或安装到电子设备700上。当计算机程序加载到RAM 703并由计算单元701执行时，可以执行上文描述的图像生成方法或口型驱动模型的训练方法的一个或多个步骤。备选地，在其他实施例中，计算单元701可以通过其他任何适当的方式(例如，借助于固件)而被配置为执行图像生成方法或口型驱动模型的训练方法。The computing unit 701 may be a variety of general and/or special processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various dedicated artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, digital signal processors (DSPs), and any appropriate processors, controllers, microcontrollers, etc. The computing unit 701 performs the various methods and processes described above, such as an image generation method or a training method for a lip-driven model. For example, in some embodiments, the image generation method or the training method for a lip-driven model may be implemented as a computer software program, which is tangibly included in a machine-readable medium, such as a storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed on the electronic device 700 via the ROM 702 and/or the communication unit 709. When the computer program is loaded into the RAM 703 and executed by the computing unit 701, one or more steps of the image generation method or the training method for the lip-driven model described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to execute the image generation method or the lip-driven model training method in any other appropriate manner (eg, by means of firmware).

本文中以上描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、芯片上系统的系统(SOC)、负载可编程逻辑设备(CPLD)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括：实施在一个或者多个计算机程序中，该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释，该可编程处理器可以是专用或者通用可编程处理器，可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令，并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。Various implementations of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various implementations can include: being implemented in one or more computer programs that can be executed and/or interpreted on a programmable system including at least one programmable processor, which can be a special purpose or general purpose programmable processor that can receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.

用于实施本公开的方法的程序代码可以采用一个或多个编程语言的任何组合来编写。这些程序代码可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器或控制器，使得程序代码当由处理器或控制器执行时使流程图和/或框图中所规定的功能/操作被实施。程序代码可以完全在机器上执行、部分地在机器上执行，作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。The program code for implementing the method of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, a special-purpose computer, or other programmable data processing device, so that the program code, when executed by the processor or controller, implements the functions/operations specified in the flow chart and/or block diagram. The program code may be executed entirely on the machine, partially on the machine, partially on the machine and partially on a remote machine as a stand-alone software package, or entirely on a remote machine or server.

在本公开的上下文中，机器可读介质可以是有形的介质，其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备，或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, device, or equipment. A machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or equipment, or any suitable combination of the foregoing. A more specific example of a machine-readable storage medium may include an electrical connection based on one or more lines, a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

为了提供与用户的交互，可以在计算机上实施此处描述的系统和技术，该计算机具有：用于向用户显示信息的显示装置(例如，CRT(阴极射线管)或者LCD(液晶显示器)监视器)；以及键盘和指向装置(例如，鼠标或者轨迹球)，用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以用于提供与用户的交互；例如，提供给用户的反馈可以是任何形式的传感反馈(例如，视觉反馈、听觉反馈、或者触觉反馈)；并且可以用任何形式(包括声输入、语音输入、或者触觉输入)来接收来自用户的输入。To provide interaction with a user, the systems and techniques described herein can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user; and a keyboard and pointing device (e.g., a mouse or trackball) through which the user can provide input to the computer. Other types of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including acoustic input, voice input, or tactile input).

可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如，作为数据服务器)、或者包括中间件部件的计算系统(例如，应用服务器)、或者包括前端部件的计算系统(例如，具有图形用户界面或者网络浏览器的用户计算机，用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如，通信网络)来将系统的部件相互连接。通信网络的示例包括：局域网(LAN)、广域网(WAN)和互联网。The systems and techniques described herein may be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., a user computer with a graphical user interface or a web browser through which a user can interact with implementations of the systems and techniques described herein), or a computing system that includes any combination of such back-end components, middleware components, or front-end components. The components of the system may be interconnected by any form or medium of digital data communication (e.g., a communications network). Examples of communications networks include: a local area network (LAN), a wide area network (WAN), and the Internet.

计算机系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。服务器可以是云服务器，也可以为分布式系统的服务器，或者是结合了区块链的服务器。A computer system may include a client and a server. The client and the server are generally remote from each other and usually interact through a communication network. The relationship of client and server is generated by computer programs running on respective computers and having a client-server relationship with each other. The server may be a cloud server, a server of a distributed system, or a server combined with a blockchain.

应该理解，可以使用上面所示的各种形式的流程，重新排序、增加或删除步骤。例如，本公开中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行，只要能够实现本公开公开的技术方案所期望的结果，本文在此不进行限制。It should be understood that the various forms of processes shown above can be used to reorder, add or delete steps. For example, the steps recorded in this disclosure can be executed in parallel, sequentially or in different orders, as long as the desired results of the technical solutions disclosed in this disclosure can be achieved, and this document does not limit this.

上述具体实施方式，并不构成对本公开保护范围的限制。本领域技术人员应该明白的是，根据设计要求和其他因素，可以进行各种修改、组合、子组合和替代。任何在本公开的原则之内所作的修改、等同替换和改进等，均应包含在本公开保护范围之内。The above specific implementations do not constitute a limitation on the protection scope of the present disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions can be made according to design requirements and other factors. Any modification, equivalent substitution and improvement made within the principles of the present disclosure shall be included in the protection scope of the present disclosure.