CN117218246A

Movatterモバイル変換

Info

Publication number: CN117218246A
Application number: CN202310283088.1A
Authority: CN
Inventors: 杨泽军
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-03-10
Filing date: 2023-03-10
Publication date: 2023-12-12

Abstract

The present application relates to the field of data processing technologies, and in particular, to a training method and apparatus for an image generation model, an electronic device, and a storage medium, where the method includes: acquiring a training sample set; a training sample includes: a sample reference map, a sample skeleton map and a sample depth map under a target pose and a sample standard map of a target object are contained; training the pre-trained image generation model by adopting the training sample set, and outputting a target image generation model; in each iteration process, based on the multi-scale global comprehensive difference loss between the output prediction standard diagram and the sample standard diagram in the training sample, the local difference loss in the image area is appointed by combining the prediction standard diagram and the sample standard diagram, and the model parameters are adjusted. Therefore, the self-shielding problem of different areas can be processed by means of depth values at different key point positions, and the generation effect of the trained target image generation model is guaranteed.

Description

Translated fromChinese

图像生成模型的训练方法、装置、电子设备及存储介质Training method, device, electronic equipment and storage medium for image generation model

技术领域Technical field

本申请涉及数据处理技术领域，尤其涉及一种图像生成模型的训练方法、装置、电子设备及存储介质。The present application relates to the field of data processing technology, and in particular to a training method, device, electronic equipment and storage medium for an image generation model.

背景技术Background technique

相关技术下，在生成虚拟对象不同动作下的平面图像时，通常通过训练动作迁移模型，实现基于虚拟对象的参考图像，生成目标姿态下的虚拟对象图像。In related technologies, when generating planar images of virtual objects under different movements, a motion transfer model is usually trained to generate a virtual object image based on a reference image of the virtual object and a target posture.

目前，在采用动作迁移模型生成目标姿态对应的图像时，通常基于参考图像和用于指示目标姿态的二维骨架图，生成对应的目标图像，其中，二维骨架图中仅示意有头部和四肢上的关键点位置。Currently, when using an action transfer model to generate an image corresponding to a target posture, the corresponding target image is usually generated based on a reference image and a two-dimensional skeleton diagram used to indicate the target posture. The two-dimensional skeleton diagram only shows the head and Location of key points on the limbs.

然而，在基于已有的动作迁移模型生成图像时，只能依据二维关键点坐标指示目标姿态，难以在相似的动作中实现对于不同动作的有效辨别，降低了目标图像的生成准确率；另外，仅能基于相同的图像尺度，考量不同区域的肢体，使得生成的目标图像中不同肢体的清晰度存在差异，而且，已有的动作迁移模型仅对脸部区域进行了细节性考量，使得对于目标姿态下的末端肢体姿态还原不准确，难以保障目标图像的生成效果。However, when generating images based on existing action transfer models, the target posture can only be indicated based on two-dimensional key point coordinates. It is difficult to effectively distinguish different actions in similar actions, which reduces the accuracy of target image generation; in addition, , can only consider limbs in different areas based on the same image scale, resulting in differences in the clarity of different limbs in the generated target image. Moreover, the existing action transfer model only considers the details of the face area, making it difficult for The restoration of the terminal limb posture under the target posture is inaccurate, making it difficult to ensure the generation effect of the target image.

发明内容Contents of the invention

本申请实施例提供一种图像生成模型的训练方法、装置、电子设备及存储介质，用于提高目标姿态对应的目标图像的生成准确率，保障目标图像的生成效果。Embodiments of the present application provide a training method, device, electronic device, and storage medium for an image generation model, which are used to improve the accuracy of generating target images corresponding to target postures and ensure the generation effect of target images.

第一方面，提出一种图像生成模型的训练方法，包括：In the first aspect, a training method for image generation model is proposed, including:

获取训练样本集；一条训练样本中包括：包含目标对象的样本参考图、指示所述目标对象在目标位姿下各关键点位置的样本骨架图和样本深度图，以及所述目标位姿的样本标准图；所述样本骨架图中至少包括肢体末端骨架；Obtain a training sample set; a training sample includes: a sample reference map containing the target object, a sample skeleton map and a sample depth map indicating the position of each key point of the target object in the target pose, and a sample of the target pose. Standard diagram; the sample skeleton diagram at least includes the limb end skeleton;

采用所述训练样本集，对预训练后的图像生成模型进行多轮迭代训练，输出已训练的目标图像生成模型；其中，在一轮迭代过程中，执行以下操作：Using the training sample set, perform multiple rounds of iterative training on the pre-trained image generation model, and output the trained target image generation model; wherein, during one round of iteration, the following operations are performed:

基于选取的训练样本中包含的样本骨架图和样本深度图，按照对应的目标位姿，对包含的样本参考图中的所述目标对象进行动作迁移处理，得到预测标准图；Based on the sample skeleton map and sample depth map contained in the selected training sample, perform motion migration processing on the target object in the included sample reference map according to the corresponding target pose to obtain a prediction standard map;

基于所述预测标准图与所述样本标准图之间多尺度的全局综合差异损失，结合所述预测标准图和所述样本标准图之间，指定图像区域内的局部差异损失，调整所述图像生成模型中的模型参数。Based on the multi-scale global comprehensive difference loss between the prediction standard map and the sample standard map, combined with the local difference loss in the specified image area between the prediction standard map and the sample standard map, the image is adjusted Generate model parameters in the model.

第二方面，提出一种图像生成模型的训练装置，包括：In the second aspect, a training device for image generation model is proposed, including:

获取单元，用于获取训练样本集；一条训练样本中包括：包含目标对象的样本参考图、指示所述目标对象在目标位姿下各关键点位置的样本骨架图和样本深度图，以及所述目标位姿的样本标准图；所述样本骨架图中至少包括肢体末端骨架；An acquisition unit is used to obtain a training sample set; a training sample includes: a sample reference map containing the target object, a sample skeleton map and a sample depth map indicating the position of each key point of the target object in the target pose, and the A sample standard diagram of the target pose; the sample skeleton diagram at least includes the limb end skeleton;

训练单元，用于采用所述训练样本集，对预训练后的图像生成模型进行多轮迭代训练，输出已训练的目标图像生成模型；其中，在一轮迭代过程中，执行以下操作：The training unit is used to use the training sample set to perform multiple rounds of iterative training on the pre-trained image generation model, and output the trained target image generation model; wherein, during one round of iteration, the following operations are performed:

可选的，所述图像生成模型中包括：配置有卷积注意力层的第一编码网络、配置有卷积注意力层和图像融合层的第二编码网络，以及配置有卷积注意力层的多尺度解码网络；Optionally, the image generation model includes: a first encoding network configured with a convolutional attention layer, a second encoding network configured with a convolutional attention layer and an image fusion layer, and a convolutional attention layer configured. Multi-scale decoding network;

则所述基于选取的训练样本包含的样本骨架图和样本深度图，按照对应的目标位姿，对包含的样本参考图中的所述目标对象进行动作迁移处理，得到预测标准图时，所述训练单元用于：Based on the sample skeleton map and sample depth map contained in the selected training sample, the action migration process is performed on the target object in the included sample reference map according to the corresponding target pose, and when the prediction standard map is obtained, the Training units are used for:

将选取的训练样本包含的样本参考图输入所述第一编码网络，得到编码后的参考图像特征；Input the sample reference image contained in the selected training sample into the first encoding network to obtain the encoded reference image features;

将所述训练样本包含的样本骨架图和样本深度图，在通道维度上进行拼接后，输入所述第二编码网络，得到编码融合后的骨骼动作特征；After the sample skeleton map and sample depth map contained in the training sample are spliced in the channel dimension, they are input into the second encoding network to obtain the skeletal action features after encoding fusion;

采用所述多尺度解码网络，基于所述骨骼动作特征对所述参考图像特征进行解码，得到完成动作迁移后的预测标准图。The multi-scale decoding network is used to decode the reference image features based on the skeletal action features to obtain the prediction standard map after the action migration is completed.

可选的，所述训练样本集是采用如下方式生成的：Optionally, the training sample set is generated in the following manner:

获取目标对象在不同位姿下的样本标准图和三维坐标集合，其中，一个三维坐标集合中包括：一个位姿下各关键点位置各自对应的三维坐标；Obtain sample standard diagrams and three-dimensional coordinate sets of the target object in different poses, where a three-dimensional coordinate set includes: three-dimensional coordinates corresponding to each key point position in one pose;

采用预设的二维重投影技术，对每个三维坐标集合进行处理，得到基于各关键点位置在图像坐标系下的像素点坐标生成的样本骨架图，以及得到基于所述各关键点位置各自对应的像素深度值生成的样本深度图；Using preset two-dimensional reprojection technology, each three-dimensional coordinate set is processed to obtain a sample skeleton diagram generated based on the pixel coordinates of each key point position in the image coordinate system, and to obtain the respective key point positions based on each The sample depth map generated by the corresponding pixel depth value;

基于所述不同位姿对应的样本标准图、样本骨架图，以及样本深度图，生成训练样本集。Based on the sample standard map, sample skeleton map, and sample depth map corresponding to the different poses, a training sample set is generated.

可选的，所述得到基于各关键点位置在图像坐标系下的二维坐标生成的样本骨架图时，所述获取单元用于：Optionally, when obtaining the sample skeleton diagram generated based on the two-dimensional coordinates of each key point position in the image coordinate system, the acquisition unit is used to:

获得将所述三维坐标集合中各关键点位置，投影至图像坐标系后的各像素点坐标；Obtain the coordinates of each pixel point after projecting the position of each key point in the three-dimensional coordinate set to the image coordinate system;

通过连接所述各像素点坐标各自对应的像素点，还原对应位姿下的骨骼分布，得到与对应的样本标准图大小相同的样本骨架图。By connecting the pixel points corresponding to the coordinates of each pixel point, the bone distribution in the corresponding pose is restored, and a sample skeleton diagram that is the same size as the corresponding sample standard diagram is obtained.

可选的，所述得到基于所述各关键点位置的像素深度值生成的样本深度图时，所述获取单元用于：Optionally, when obtaining the sample depth map generated based on the pixel depth value of each key point position, the acquisition unit is used to:

获取将所述三维坐标集合中各关键点位置投影至图像坐标系后，对应所述各关键点位置得到的各像素点坐标及像素深度值；Obtain the coordinates of each pixel and the pixel depth value corresponding to the position of each key point obtained after projecting the position of each key point in the three-dimensional coordinate set to the image coordinate system;

构建与所述图像坐标系匹配的初始深度图，并基于各像素深度值，结合针对所述各像素点坐标各自归属的像素点范围确定的像素值取值差异，调整所述初始深度图中各像素点各自对应的像素值，得到样本深度图。Construct an initial depth map that matches the image coordinate system, and adjust each pixel in the initial depth map based on the depth value of each pixel and in combination with the pixel value difference determined for the pixel range to which each pixel coordinate belongs. The corresponding pixel values of each pixel point are used to obtain the sample depth map.

可选的，当所述图像生成模型作为生成器对抗器结构中的生成器进行训练时，所述得到预测标准图之后，所述训练单元还用于：Optionally, when the image generation model is trained as a generator in a generator-antagonist structure, after obtaining the prediction standard map, the training unit is also used to:

采用预设的生成对抗损失函数，基于所述预测标准图和对应的样本标准图，得到对应的对抗损失；Using a preset generated adversarial loss function, the corresponding adversarial loss is obtained based on the prediction standard map and the corresponding sample standard map;

基于所述对抗损失、所述预测标准图与所述样本标准图之间的全局综合差异损失，结合所述预测标准图和所述样本标准图之间，指定图像区域内的局部差异损失，调整所述图像生成模型中的模型参数。Based on the adversarial loss, the global comprehensive difference loss between the prediction standard map and the sample standard map, combined with the local difference loss in the specified image area between the prediction standard map and the sample standard map, adjust The image generates model parameters in the model.

可选的，所述局部差异损失采用以下方式确定：Optionally, the local difference loss is determined in the following way:

在所述预测标准图和所述样本标准图中，分别确定用于定位子图像区域的各目标关键点位置，并分别在所述预测标准图和所述样本标准图中，基于确定的各目标关键点位置，裁剪得到包含多个子图像区域的指定图像区域；In the prediction standard map and the sample standard map, the positions of each target key point for locating the sub-image area are respectively determined, and in the prediction standard map and the sample standard map, based on the determined targets Key point position, crop to obtain a specified image area containing multiple sub-image areas;

基于每个子图像区域内的像素值差异和图像特征差异，得到对应的局部差异损失。Based on the pixel value difference and image feature difference within each sub-image area, the corresponding local difference loss is obtained.

可选的，所述全局综合差异损失采用如下方式确定：Optionally, the global comprehensive difference loss is determined in the following way:

基于所述预测标准图与所述样本标准图之间，各像素点的像素值差异，得到全局像素值损失，并基于所述预测标准图与所述样本标准图之间，在多个预设尺度下的图像特征差异，得到多尺度特征损失；Based on the pixel value difference of each pixel point between the predicted standard map and the sample standard map, a global pixel value loss is obtained, and based on the predicted standard map and the sample standard map, a multiple preset The difference in image features at different scales results in multi-scale feature loss;

将所述全局像素值损失和所述多尺度特征损失，得到对应的全局综合差异损失。The global pixel value loss and the multi-scale feature loss are combined to obtain the corresponding global comprehensive difference loss.

可选的，所述训练单元采用以下方式完成所述图像生成模型的预训练：Optionally, the training unit completes pre-training of the image generation model in the following manner:

获取指定的数据集，并通过对所述数据集中的各样本骨架图进行单目深度估计处理，得到所述各样本骨架图各自对应的样本深度图，其中，所述数据集中包括各样本对象在不同位姿下的样本标准图和样本骨架图；Obtain the specified data set, and obtain the sample depth map corresponding to each sample skeleton map by performing monocular depth estimation processing on each sample skeleton map in the data set, wherein the data set includes each sample object in Sample standard pictures and sample skeleton pictures in different poses;

基于根据所述数据集得到的样本标准图、样本骨架图，以及样本深度图，构建预训练样本集合，并基于所述预训练样本集合对初始的图像生成模型进行多轮迭代训练，输出预训练后的图像生成模型。Based on the sample standard map, sample skeleton map, and sample depth map obtained from the data set, a pre-training sample set is constructed, and the initial image generation model is trained for multiple rounds of iterations based on the pre-training sample set to output the pre-training The resulting image generation model.

可选的，所述训练单元按照以下任意一种方式，确定在对预训练后的图像生成模型进行每轮迭代过程中使用的学习率：Optionally, the training unit determines the learning rate used in each iteration of the pre-trained image generation model in any of the following ways:

采用预设的余弦退火算法，基于预设的初始学习率，确定每个训练周期对应的学习率取值，并根据当前迭代过程归属的训练周期，确定当前迭代过程对应的目标学习率，其中，一个训练周期内包括至少一轮迭代过程；Using the preset cosine annealing algorithm, based on the preset initial learning rate, determine the value of the learning rate corresponding to each training cycle, and determine the target learning rate corresponding to the current iteration process according to the training cycle to which the current iteration process belongs, where, A training cycle includes at least one iteration process;

基于预设的初始学习率和学习率衰减系数，确定每个训练周期对应的学习率取值，并根据当前迭代过程归属的训练周期，确定当前迭代过程对应的目标学习率，其中，一个训练周期内包括至少一轮迭代过程。Based on the preset initial learning rate and learning rate attenuation coefficient, determine the learning rate value corresponding to each training cycle, and determine the target learning rate corresponding to the current iterative process according to the training cycle to which the current iterative process belongs, where, one training cycle It includes at least one iterative process.

可选的，所述装置还包括生成单元，所述生成单元用于：Optionally, the device further includes a generating unit, the generating unit is used for:

获取目标对象在参考动作下的参考图像，以及所述目标对象在指定位姿下的平面骨架图和平面深度图，其中，所述平面骨架图中包括手部骨骼；Obtain the reference image of the target object under the reference action, as well as the planar skeleton map and the planar depth map of the target object in the specified pose, wherein the planar skeleton map includes hand bones;

采用所述目标图像生成模型，基于所述平面骨架图和所述平面深度图，对所述参考图像进行动作迁移处理，得到所述目标对象在所述指定位姿下的目标图像。The target image generation model is used to perform motion migration processing on the reference image based on the planar skeleton map and the planar depth map to obtain a target image of the target object in the specified pose.

第三方面，提出一种电子设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述程序时实现上述方法。In a third aspect, an electronic device is proposed, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the program, the above method is implemented.

第四方面，提出一种计算机可读存储介质，其上存储有计算机程序，所述计算机程序被处理器执行时实现上述方法。In a fourth aspect, a computer-readable storage medium is proposed, on which a computer program is stored, and when the computer program is executed by a processor, the above method is implemented.

第五方面，提出一种计算机程序产品，包括计算机程序，所述计算机程序被处理器执行时上述方法。In a fifth aspect, a computer program product is proposed, including a computer program, when the computer program is executed by a processor, the above method is provided.

本申请有益效果如下：The beneficial effects of this application are as follows:

本申请实施例中，提出了本申请实施例中，提出了一种图像生成模型的训练方法、装置、电子设备及存储介质，借助于构建的包括样本骨架图和样本深度图的训练样本，能够在训练图像生成模型的过程中，引入不同关键点位置处的深度值，这不仅为图像生成提供了更多可参考的依据，还能够有效区分相似位姿，处理不同区域的自遮挡问题；在训练根据样本骨架图和样本深度图，对样本参考图进行动作迁移得到预测标准图的过程中，还能够提高图像的生成准确性。In the embodiments of the present application, a training method, device, electronic device and storage medium for an image generation model are proposed. With the help of the constructed training samples including the sample skeleton map and the sample depth map, it is possible to In the process of training the image generation model, the depth values at different key point positions are introduced, which not only provides more reference basis for image generation, but also can effectively distinguish similar poses and deal with self-occlusion problems in different areas; in In the process of training, based on the sample skeleton map and sample depth map, the sample reference map is moved to obtain the predicted standard map, which can also improve the accuracy of image generation.

而且，通过在样本骨架图中考量肢体末端骨架，能够在模型训练处理过程中学习对肢体末端姿态的细节化处理，使得在生成的预测标准图中能够有效还原末端肢体动作，保障了图像的生成效果。Moreover, by considering the limb end skeleton in the sample skeleton diagram, the detailed processing of the limb end posture can be learned during the model training process, so that the end limb movements can be effectively restored in the generated prediction standard diagram, ensuring the generation of the image. Effect.

另外，通过在模型训练过程中，考量多尺度的综合全局综合差异损失和局部差异损失，能够采用不同的图像尺度，评价不同区域的图像呈现差异，保障图像呈现的清晰性；而且，能够一定程度上提高模型的训练效果，更好地指导模型学习实现动作迁移，为训练得到能够还原动作细节且位姿呈现准确的目标图像生成模型提供了保障，提高了后续基于目标图像生成模型生成图像的准确率。In addition, by considering multi-scale comprehensive global comprehensive difference loss and local difference loss during the model training process, different image scales can be used to evaluate image presentation differences in different areas to ensure the clarity of image presentation; moreover, it can be used to a certain extent It improves the training effect of the model, better guides the model learning to achieve action transfer, provides guarantee for training a target image generation model that can restore action details and accurately present poses, and improves the accuracy of subsequent image generation based on the target image generation model. Rate.

附图说明Description of drawings

图1为本申请实施例中可能的应用场景示意图；Figure 1 is a schematic diagram of possible application scenarios in the embodiment of the present application;

图2A为本申请实施例中图像生成模型的训练流程示意图；Figure 2A is a schematic diagram of the training process of the image generation model in the embodiment of the present application;

图2B为本申请实施例中生成训练样本集的过程示意图；Figure 2B is a schematic diagram of the process of generating a training sample set in an embodiment of the present application;

图2C为本申请实施例中描述目标对象在一个位姿下动作详情的内容示意图；Figure 2C is a schematic diagram describing the details of the target object's movements in one pose according to the embodiment of the present application;

图2D为本申请实施例中生成样本骨架图的过程示意图；Figure 2D is a schematic diagram of the process of generating a sample skeleton diagram in an embodiment of the present application;

图2E为本申请实施例中生成的样本深度图示意图；Figure 2E is a schematic diagram of the sample depth map generated in the embodiment of the present application;

图2F为本申请实施例中初始构建的图像生成模型示意图；Figure 2F is a schematic diagram of the image generation model initially constructed in the embodiment of the present application;

图2G为本申请实施例中一轮模型训练的过程示意图；Figure 2G is a schematic diagram of a round of model training process in the embodiment of the present application;

图3A为本申请实施例中在目标图像生成模型的训练阶段和应用阶段的处理过程示意图；Figure 3A is a schematic diagram of the processing process in the training phase and application phase of the target image generation model in the embodiment of the present application;

图3B为本申请实施例中单轮迭代训练的过程示意图；Figure 3B is a schematic diagram of the process of a single round of iterative training in the embodiment of the present application;

图3C为本申请实施例中训练得到目标图像生成模型的整体结构示意图；Figure 3C is a schematic diagram of the overall structure of the target image generation model trained in the embodiment of the present application;

图4为本申请实施例中图像生成模型的训练装置的逻辑结构示意图；Figure 4 is a schematic diagram of the logical structure of the training device of the image generation model in the embodiment of the present application;

图5为本申请实施例的一种电子设备的一个硬件组成结构示意图；Figure 5 is a schematic structural diagram of a hardware structure of an electronic device according to an embodiment of the present application;

图6为本申请实施例中的一个计算装置的结构示意图。Figure 6 is a schematic structural diagram of a computing device in an embodiment of the present application.

具体实施方式Detailed ways

为使本申请实施例的目的、技术方案和优点更加清楚，下面将结合本申请实施例中的附图，对本申请的技术方案进行清楚、完整地描述，显然，所描述的实施例是本申请技术方案的一部分实施例，而不是全部的实施例。基于本申请文件中记载的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本申请技术方案保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions of the present application will be clearly and completely described below in conjunction with the drawings in the embodiments of the present application. Obviously, the described embodiments are Some embodiments of the technical solution, rather than all embodiments. Based on the embodiments recorded in the application documents, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of the technical solution of this application.

本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换，以便这里描述的本发明的实施例能够在除了这里图示或描述的那些以外的顺序实施。The terms "first", "second", etc. in the description and claims of this application and the above-mentioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances so that the embodiments of the invention described herein are capable of being practiced in sequences other than those illustrated or described herein.

以下对本申请实施例中的部分用语进行解释说明，以便于本领域技术人员理解。Some terms used in the embodiments of this application are explained below to facilitate understanding by those skilled in the art.

虚拟对象：是指进行三维建模后，在虚拟空间中创建的虚拟角色，在本申请可能的实施例中，目标对象指代虚拟对象。Virtual object: refers to a virtual character created in the virtual space after three-dimensional modeling. In possible embodiments of this application, the target object refers to the virtual object.

三维坐标：在本申请实施例中目标对象为虚拟对象的情况下，三维坐标指代在创建的虚拟空间中建立世界坐标系后，在该世界坐标下的XYZ坐标；在目标对象为实体对象的情况下，三维坐标指代在现实的世界坐标系下的XYZ坐标。Three-dimensional coordinates: In the embodiment of this application, when the target object is a virtual object, the three-dimensional coordinates refer to the XYZ coordinates under the world coordinates after the world coordinate system is established in the created virtual space; when the target object is a physical object, In this case, the three-dimensional coordinates refer to the XYZ coordinates in the real world coordinate system.

图像坐标系：是指在平面图像中建立的坐标系，记为UV坐标系，其中，U值表示像素点在平面图像横轴方向的像素坐标，V值表示该像素点在平面图像纵轴方向的像素坐标。Image coordinate system: refers to the coordinate system established in the plane image, recorded as the UV coordinate system, where the U value represents the pixel coordinate of the pixel in the horizontal axis direction of the plane image, and the V value represents the pixel point in the vertical axis direction of the plane image. pixel coordinates.

UVZ坐标：在平面图像中建立图像坐标系后，UV值表示在图像上横纵方向的像素坐标，Z值表示图像平面上的像素点，相对于相机坐标系原点的深度距离。UVZ coordinates: After establishing an image coordinate system in a plane image, the UV value represents the pixel coordinates in the horizontal and vertical directions on the image, and the Z value represents the depth distance of the pixel point on the image plane relative to the origin of the camera coordinate system.

像素点坐标：表示像素点在图像坐标系下的坐标。Pixel coordinates: Indicates the coordinates of the pixel in the image coordinate system.

各关键点位置：是指选取的用于描述不同位姿下的骨骼分布情况的各关键点各自对应的位置，对于选取的关键点而言，通常约定选用眼部关键点、鼻部关键点、肩关节点、肘关节点、腕关节点、髋关节点、膝关节点，以及踝关节点定位不同位姿；本申请实施例中，为了描述不同位姿下的局部细节，创造性地引入了各个指关节点，使得能够描述出不同位姿下末端区域的姿态，并在学习过程中引入对该区域的细节性学习，其中，末端区域可以是手部区域和脚部区域中的任意一项或组合。The position of each key point: refers to the corresponding position of each key point selected to describe the distribution of bones in different postures. For the selected key points, it is usually agreed to use eye key points, nose key points, The shoulder joint point, elbow joint point, wrist joint point, hip joint point, knee joint point, and ankle joint point are positioned in different postures; in the embodiment of the present application, in order to describe the local details in different postures, various elements are creatively introduced The knuckle points enable the description of the posture of the end area in different postures, and introduce detailed learning of this area during the learning process, where the end area can be any one of the hand area and the foot area or combination.

样本骨架图：用于描述一个位姿下的骨骼分布情况，本申请实施例中，是依据对应的位姿下各关键点位置，连接生成的二维图像；在本申请提出的技术方案中，对应每个位姿，存在用于描述该位姿下目标对象动作的骨架图和深度图，在训练的过程中，称为样本骨架图和样本深度图，在应用过程中称为平面骨架图和平面深度图；在UVZ坐标中，样本骨架图根据各关键点位置的UV值确定。Sample skeleton diagram: used to describe the distribution of bones in a posture. In the embodiment of this application, it is a two-dimensional image generated by connecting the key points in the corresponding posture; in the technical solution proposed by this application, Corresponding to each pose, there is a skeleton map and depth map used to describe the action of the target object in that pose. During the training process, they are called the sample skeleton map and the sample depth map. During the application process, they are called the planar skeleton map and Plane depth map; in UVZ coordinates, the sample skeleton map is determined based on the UV value of each key point position.

样本深度图：用于与样本骨架图共同描述目标对象的位姿，与对应的样本骨架图大小相同；本申请实施例中，对于对应一个位姿的样本深度图和样本骨架图而言，样本骨架图描述了目标对象在该位姿下的骨骼形状和分布情况；样本深度图中像素点的像素值，用于表征像素点位置距离相机坐标系原点的深度值，换言之，样本深度图描述了用于定位骨架的各关键点各自对应的深度值，使得能够描述出目标对象在该位姿下，不同位置骨骼距离相机坐标系原点的距离差异；在UVZ坐标中，样本深度图根据各关键点位置各自对应的Z值确定。Sample depth map: used to describe the pose of the target object together with the sample skeleton map, and is the same size as the corresponding sample skeleton map; in the embodiment of this application, for the sample depth map and sample skeleton map corresponding to one pose, the sample The skeleton map describes the bone shape and distribution of the target object in this pose; the pixel value of the pixel in the sample depth map is used to represent the depth value of the pixel position from the origin of the camera coordinate system. In other words, the sample depth map describes The depth value corresponding to each key point used to locate the skeleton enables the description of the distance difference between the target object's bones at different positions and the origin of the camera coordinate system in this pose; in UVZ coordinates, the sample depth map is based on each key point The Z value corresponding to each position is determined.

动作迁移算法：基于一张目标对象图像和目标姿态的二维关键点骨架，将目标对象图像转换为目标姿态的新图像的深度学习算法。Action transfer algorithm: Based on a target object image and a two-dimensional key point skeleton of the target pose, a deep learning algorithm that converts the target object image into a new image of the target pose.

人体姿态估计算法：是指能够实现人体关键点检测的深度学习算法。Human posture estimation algorithm: refers to a deep learning algorithm that can detect key points of the human body.

单目深度估计算法：是指基于单一视角图像进行像素点深度估计的深度学习算法。Monocular depth estimation algorithm: refers to a deep learning algorithm that estimates pixel depth based on a single perspective image.

超分辨率算法：用于将低分辨率图像转换为高分辨率高清大图的深度学习算法。Super-resolution algorithm: A deep learning algorithm used to convert low-resolution images into high-resolution large images.

骨骼重定向：用于将一个三维骨骼的动作，迁移到不同体型的另一个三维骨骼上，例如，基于目标对象A在动作1下各个关键点的三维坐标，迁移至不同体型的目标对象B上，得到目标对象B在动作1下各个关键点的三维坐标。Bone redirection: used to migrate the action of a three-dimensional skeleton to another three-dimensional skeleton of a different body type. For example, based on the three-dimensional coordinates of each key point of target object A in action 1, migrate it to target object B of a different body type. , obtain the three-dimensional coordinates of each key point of target object B under action 1.

机器学习(Machine Learning，ML)是一门多领域交叉学科，涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。专门研究计算机怎样模拟或实现人类的学习行为，以获取新的知识或技能，重新组织已有的知识结构使之不断改善自身的性能。机器学习是人工智能的核心，是使计算机具有智能的根本途径，其应用遍及人工智能的各个领域。机器学习和深度学习通常包括人工神经网络、置信网络、强化学习、迁移学习、归纳学习、式教学习等技术。Machine Learning (ML) is a multi-field interdisciplinary subject involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other disciplines. It specializes in studying how computers can simulate or implement human learning behavior to acquire new knowledge or skills, and reorganize existing knowledge structures to continuously improve their performance. Machine learning is the core of artificial intelligence and the fundamental way to make computers intelligent. Its applications cover all fields of artificial intelligence. Machine learning and deep learning usually include artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, teaching learning and other technologies.

下面对本申请实施例的设计思想进行简要介绍：The design ideas of the embodiments of this application are briefly introduced below:

相关技术下，在生成虚拟人不同位姿下的平面图像时，在可能的实现方式中，可以先获取在构建的虚拟空间中，虚拟人在不同动作下的三维关键点坐标集合；之后，基于不同动作下的三维关键点坐标集合，驱动虚拟人做出相应的动作；再通过对虚拟人进行布料解算，实现对虚拟人的装饰，最终通过对装饰的虚拟人进行美术渲染，得到不同位姿下的平面图像。Under related technologies, when generating plane images of a virtual person in different postures, in a possible implementation, the coordinate set of the three-dimensional key points of the virtual person under different actions in the constructed virtual space can be first obtained; then, based on The collection of three-dimensional key point coordinates under different actions drives the virtual person to make corresponding actions; then the virtual person is decorated by performing cloth calculation on the virtual person, and finally, through art rendering of the decorated virtual person, different positions are obtained. Plane image in pose.

然而，在此图像生成方式中，每生成一张平面图像，需要重复地进行一次动作驱动、布料解算，以及美术渲染的过程，而且布料解算过程需要耗费大量的计算资源，这不仅增加了图像的生成成本，还增加了图像的生成时间，极大地限制了图像的生成效率。However, in this image generation method, each time a plane image is generated, the process of motion driving, cloth calculation, and art rendering needs to be repeated, and the cloth calculation process requires a large amount of computing resources, which not only increases the The cost of image generation also increases the image generation time, which greatly limits the image generation efficiency.

进而，现有技术提出通过训练动作迁移模型，基于虚拟人的参考图像和目标姿态下的二维骨架图，生成目标姿态下的虚拟人图像。Furthermore, the existing technology proposes to generate an image of the virtual human in the target posture by training an action transfer model based on the reference image of the virtual human and the two-dimensional skeleton diagram in the target posture.

然而，在已有的处理方式下，难以在相似的动作中实现对于不同动作的有效辨别，降低了目标图像的生成准确率；而且，仅能基于相同的图像尺度，考量不同区域的肢体，使得生成的目标图像中不同肢体的清晰度存在差异，另外，对于目标姿态下的末端肢体姿态还原不准确，难以保障目标图像的生成效果。However, under the existing processing methods, it is difficult to effectively distinguish different actions in similar actions, which reduces the accuracy of target image generation; moreover, limbs in different areas can only be considered based on the same image scale, making There are differences in the clarity of different limbs in the generated target image. In addition, the restoration of the end limb posture under the target posture is inaccurate, making it difficult to ensure the generation effect of the target image.

有鉴于此，本申请实施例中，提出了一种图像生成模型的训练方法、装置、电子设备及存储介质，获取训练样本集；一条训练样本中包括：包含目标对象的样本参考图、指示目标对象在目标位姿下各关键点位置的样本骨架图和样本深度图，以及目标位姿的样本标准图；样本骨架图中至少包括肢体末端骨架；再采用训练样本集，对预训练后的图像生成模型进行多轮迭代训练，输出已训练的目标图像生成模型；其中，在一轮迭代过程中，执行以下操作：基于选取的训练样本包含的样本骨架图和样本深度图，按照对应的目标位姿，对包含的样本参考图中的目标对象进行动作迁移处理，得到预测标准图；基于预测标准图与样本标准图之间多尺度的全局综合差异损失，结合预测标准图和样本标准图之间，指定图像区域内的局部差异损失，调整图像生成模型中的模型参数。In view of this, in the embodiment of the present application, a training method, device, electronic device and storage medium for an image generation model are proposed to obtain a training sample set; a training sample includes: a sample reference picture containing the target object, an indicator target The sample skeleton map and sample depth map of each key point position of the object in the target pose, as well as the sample standard map of the target pose; the sample skeleton map at least includes the limb end skeleton; then use the training sample set to compare the pre-trained images The generation model performs multiple rounds of iterative training and outputs the trained target image generation model; during one iteration process, the following operations are performed: Based on the sample skeleton map and sample depth map contained in the selected training sample, according to the corresponding target position pose, perform action migration processing on the target object in the included sample reference map to obtain the predicted standard map; based on the multi-scale global comprehensive difference loss between the predicted standard map and the sample standard map, combine the prediction standard map and the sample standard map , specify the local difference loss within the image region, and adjust the model parameters in the image generation model.

这样，借助于构建的包括样本骨架图和样本深度图的训练样本，能够在训练图像生成模型的过程中，引入不同关键点位置处的深度值，这不仅为图像生成提供了更多可参考的依据，还能够有效区分相似位姿，处理不同区域的自遮挡问题；在训练根据样本骨架图和样本深度图，对样本参考图进行动作迁移得到预测标准图的过程中，还能够提高图像的生成准确性。In this way, with the help of the constructed training samples including the sample skeleton map and the sample depth map, the depth values at different key point positions can be introduced in the process of training the image generation model, which not only provides more reference for image generation. Based on this, it can also effectively distinguish similar poses and deal with self-occlusion problems in different areas; in the process of training to perform action migration on the sample reference map to obtain the predicted standard map based on the sample skeleton map and sample depth map, it can also improve image generation. accuracy.

以下结合说明书附图对本申请的优选实施例进行说明，应当理解，此处所描述的优选实施例仅用于说明和解释本申请，并不用于限定本申请，并且在不冲突的情况下，本申请实施例及实施例中的特征可以相互组合。The preferred embodiments of the present application are described below in conjunction with the accompanying drawings. It should be understood that the preferred embodiments described here are only used to illustrate and explain the present application, and are not used to limit the present application. In the absence of conflict, the present application The embodiments and features of the embodiments may be combined with each other.

参阅图1所示，为本申请实施例中可能的应用场景示意图。该应用场景示意图中，包括图像获取设备110，以及处理设备120。Refer to Figure 1, which is a schematic diagram of possible application scenarios in the embodiment of the present application. The application scenario diagram includes an image acquisition device 110 and a processing device 120.

本申请实施例中，图像获取设备110，根据实际的处理需要，可以提供用于生成训练样本集的图像；或者，用于在训练图像生成模型的过程中，生成训练样本集，其中，生成的训练样本集中的图像类型包括：不同位姿下的样本标准图、样本深度图，以及样本骨架图；以及在基于训练后的目标图像生成模型进行处理时，提供处理所依据的参考图像，以及指定位姿下的平面骨架图和平面深度图。In the embodiment of the present application, the image acquisition device 110 can provide images for generating a training sample set according to actual processing needs; or, it can be used to generate a training sample set in the process of training the image generation model, where the generated The image types in the training sample set include: sample standard images, sample depth images, and sample skeleton images in different postures; and when processing based on the target image generation model after training, provide the reference image on which the processing is based, and specify Planar skeleton map and planar depth map under pose.

在目标对象为虚拟对象的情况下，图像获取设备110具体对应的设备包括但不限于是桌面计算机、移动电话、移动电脑、平板电脑等具体处理能力的电子设备。在目标对象为实体对象的情况下，图像获取设备110具体可以是具有处理功能的深度相机等设备，或者能够依据深度相机提供的图像进行处理的电子设备。When the target object is a virtual object, the specific devices corresponding to the image acquisition device 110 include but are not limited to electronic devices with specific processing capabilities such as desktop computers, mobile phones, mobile computers, tablet computers, etc. When the target object is a physical object, the image acquisition device 110 may specifically be a device such as a depth camera with a processing function, or an electronic device capable of processing based on the image provided by the depth camera.

处理设备120，可以是独立的物理服务器，也可以是多个物理服务器构成的服务器集群或者分布式系统，还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN、以及大数据和人工智能平台等基础云计算服务的云服务器。还可以是桌面计算机、移动电脑、平板电脑等电子设备。The processing device 120 may be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or may provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud Cloud servers for basic cloud computing services such as communications, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms. It can also be a desktop computer, mobile computer, tablet computer and other electronic devices.

本申请实施例中，图像获取设备110与处理设备120之间，采用有线连接或者无线连接的连接方式，通过通信网络建立通信连接。In the embodiment of the present application, the image acquisition device 110 and the processing device 120 adopt a wired connection or a wireless connection to establish a communication connection through a communication network.

本申请可能的技术方案中，图像获取设备110可以向处理设备120提供训练所需要的图像；进而根据实际的处理需要，由处理设备120生成预训练样本集和训练样本集，处理设备120在依据预训练样本集完成对于初始的图像生成模型的预训练，得到预训练后的图像生成模型之后，依据训练样本集继续对图像生成模型进行训练，得到训练后的目标图像生成模型。In the possible technical solutions of this application, the image acquisition device 110 can provide the images required for training to the processing device 120; and then according to the actual processing needs, the processing device 120 generates a pre-training sample set and a training sample set, and the processing device 120 generates a pre-training sample set and a training sample set according to the actual processing needs. The pre-training sample set completes the pre-training of the initial image generation model. After obtaining the pre-trained image generation model, the image generation model is continued to be trained based on the training sample set to obtain the trained target image generation model.

需要说明的是，本申请实施例中，根据实际的处理需要，可以针对不同目标对象，训练得到不同的目标图像生成模型，具体的，在依据数据集完成预训练，得到预训练后的图像生成模型之后，可以针对存在图像生成需求的每个目标对象，分别依据目标对象在不同位姿下的图像进行微调训练，得到该目标对象对应的目标图像生成模型。It should be noted that in the embodiments of this application, according to actual processing needs, different target image generation models can be trained for different target objects. Specifically, pre-training is completed based on the data set to obtain the pre-trained image generation model. After the model is built, fine-tuning training can be performed based on the images of the target object in different postures for each target object that requires image generation, and the target image generation model corresponding to the target object can be obtained.

例如，假设存在虚拟对象A和B，为了针对虚拟对象A和B分别生成不同位姿下的图像，需要针对虚拟对象A和虚拟对象B，分别训练生成对应的目标图像生成模型。For example, assuming that there are virtual objects A and B, in order to generate images in different postures for virtual objects A and B, it is necessary to train and generate corresponding target image generation models for virtual object A and virtual object B respectively.

本申请提出的技术方案，可以在各样的应用场景中，实现对图像生成模型的训练，下面对可能的应用场景进行说明：The technical solution proposed in this application can realize the training of image generation models in various application scenarios. The possible application scenarios are described below:

场景一、针对虚拟对象生成图像生成模型。Scenario 1: Generate an image generation model for virtual objects.

处理设备获取依据指定的数据集生成预训练样本集后，基于预训练样本集对构建的图像生成模型进行多轮迭代训练，得到预训练后的图像生成模型；进而，采用布料解算和美术渲染技术，针对虚拟对象生成不同位姿下的样本标准图、样本骨架图，以及样本深度图，构建训练样本集；再依据训练样本集对预训练后的图像生成模型进行多轮迭代训练，最终得到训练后的目标图像生成模型。After the processing device obtains and generates a pre-training sample set based on the specified data set, it conducts multiple rounds of iterative training on the constructed image generation model based on the pre-training sample set to obtain the pre-trained image generation model; then, cloth solution and art rendering are used technology, generate sample standard images, sample skeleton images, and sample depth images in different postures for virtual objects to construct a training sample set; then conduct multiple rounds of iterative training on the pre-trained image generation model based on the training sample set, and finally obtain The trained target image generation model.

需要说明的是，本申请实施例中，虚拟对象具体可以是游戏场景中的游戏人物或动物，或者，可以是虚拟对象直播场景中的虚拟人物或动物，又或者，可以是授权的动画片或表情包等影视作品中的虚拟人物或动物形象；进而在依据针对虚拟对象训练得到的目标图像生成模型进行处理时，能够对应针对虚拟对象配置的不同位姿，分别生成相应位姿下的目标图像，进而通过连播目标图像展示虚拟对象的动态形象。It should be noted that in the embodiment of the present application, the virtual object may be a game character or animal in a game scene, or may be a virtual character or animal in a virtual object live broadcast scene, or may be an authorized cartoon or animal. Virtual characters or animal images in film and television works such as emoticons; then when processing based on the target image generation model trained for the virtual object, it can correspond to different poses configured for the virtual object and generate target images in the corresponding poses. , and then display the dynamic image of the virtual object through the broadcast target image.

场景二、针对实体对象生成图像生成模型。Scenario 2: Generate an image generation model for entity objects.

处理设备获取依据指定的数据集生成的预训练样本集后，基于预训练样本集对构建的图像生成模型进行多轮迭代训练，得到预训练后的图像生成模型；进而，采用深度相机拍摄实体对象不同位姿下的图像，基于深度相机拍摄的图像，能够提取确定实体对象在不同位姿下的样本标准图、样本骨架图，以及样本深度图，得到构建的训练样本集，再依据训练样本集对预训练后的图像生成模型进行多轮迭代训练，最终得到训练后的目标图像生成模型。After the processing device obtains the pre-training sample set generated based on the specified data set, it performs multiple rounds of iterative training on the constructed image generation model based on the pre-training sample set to obtain the pre-trained image generation model; then, a depth camera is used to capture the entity object Images in different poses, based on the images taken by the depth camera, can extract and determine the sample standard map, sample skeleton map, and sample depth map of the entity object in different poses to obtain the constructed training sample set, and then based on the training sample set Perform multiple rounds of iterative training on the pre-trained image generation model, and finally obtain the trained target image generation model.

需要说明的是，本申请实施例中，实体对象具体可以是实体人物或动物。以实体对象为实体人物为例，在得到实体人物授权的情况下，可以针对实体人物训练得到目标图像生成模型，进而在依据目标图像生成模型进行处理时，能够对应针对实体对象配置的不同位姿，趣味性地分别生成相应位姿下的目标图像，进而通过连播目标图像展示实体对象的动态形象。It should be noted that in the embodiment of the present application, the entity object may specifically be an entity person or animal. Taking the entity object as an entity person as an example, with the authorization of the entity person, the target image generation model can be trained for the entity person, and then when the target image generation model is processed, it can correspond to different postures configured for the entity object. , interestingly generate target images in corresponding poses, and then display the dynamic image of the entity object through continuous broadcast of target images.

下面结合附图，从处理设备120的角度，以由处理设备实现样本生成，以及图像生成模型的训练为例，对相关的处理过程进行示意性说明：In conjunction with the accompanying drawings, from the perspective of the processing device 120, taking the sample generation by the processing device and the training of the image generation model as an example, the relevant processing procedures will be schematically explained:

参阅图2A所示，其为本申请实施例中图像生成模型的训练流程示意图，下面结合附图2A，对相关的训练过程进行说明：Refer to Figure 2A, which is a schematic diagram of the training process of the image generation model in the embodiment of the present application. The relevant training process will be described below in conjunction with Figure 2A:

步骤201：处理设备获取训练样本集，其中，一条训练样本中包括：包含目标对象的样本参考图、指示目标对象在目标位姿下各关键点位置的样本骨架图和样本深度图，以及目标位姿的样本标准图；样本骨架图中至少包括肢体末端骨架。Step 201: The processing device obtains a training sample set, in which a training sample includes: a sample reference map containing the target object, a sample skeleton map and a sample depth map indicating the position of each key point of the target object in the target pose, and the target position. Sample standard drawing of posture; the sample skeleton drawing at least includes the limb end skeleton.

本申请实施例中，处理设备在针对目标对象，对预训练后的图像生成模型进行微调训练之前，需要获取所采用的训练样本集，其中，一条训练样本中包括：包含目标对象的样本参考图、用于指示目标对象在目标位姿下各关键点位置的样本骨架图和样本深度图，以及目标对象在目标位姿下的样本标准图；样本骨架图中至少包括肢体末端骨架。In the embodiment of the present application, before the processing device performs fine-tuning training on the pre-trained image generation model for the target object, it needs to obtain the training sample set used. Among them, a training sample includes: a sample reference image containing the target object. , a sample skeleton map and a sample depth map used to indicate the position of each key point of the target object in the target pose, and a sample standard map of the target object in the target pose; the sample skeleton map at least includes the limb end skeleton.

需要说明的是，在目标对象为实体对象的情况下，可以获取深度相机拍摄的图像，并预先确定用于描述位姿的各关键点位置；进而，基于深度相机拍摄的图像，确定目标对象在对应的位姿下各关键点位置的UV坐标，以及各关键点位置处的深度值Z，并以此生成对应的样本骨骼图和样本深度图。It should be noted that when the target object is a physical object, the image captured by the depth camera can be obtained, and the positions of each key point used to describe the pose can be determined in advance; then, based on the image captured by the depth camera, it is determined where the target object is. The UV coordinates of each key point position in the corresponding pose and the depth value Z at each key point position are used to generate the corresponding sample skeleton map and sample depth map.

在目标对象为虚拟对象的情况下，参阅图2B所示，其为本申请实施例中生成训练样本集的过程示意图，下面结合附图2B对针对虚拟对象生成训练样本集的过程进行说明：In the case where the target object is a virtual object, refer to Figure 2B, which is a schematic diagram of the process of generating a training sample set in an embodiment of the present application. The process of generating a training sample set for the virtual object will be described below in conjunction with Figure 2B:

步骤2011：处理设备获取目标对象在不同位姿下的样本标准图和三维坐标集合，其中，一个三维坐标集合中包括：一个位姿下各关键点位置各自对应的三维坐标。Step 2011: The processing device obtains sample standard maps and three-dimensional coordinate sets of the target object in different poses, where a three-dimensional coordinate set includes: three-dimensional coordinates corresponding to each key point position in one pose.

本申请实施例中，由于针对每个具有图像生成需求的目标对象，需要分别训练对应的目标图像分类模型，因此，在针对每个目标对象生成训练样本集时，需要基于该目标对象在不同位姿下的图像构建训练样本集。In the embodiment of the present application, since for each target object with image generation requirements, a corresponding target image classification model needs to be trained separately, therefore, when generating a training sample set for each target object, it is necessary to generate a training sample set based on the target object at different locations. Construct a training sample set from images in different poses.

具体的，处理设备针对业务上使用的虚拟对象，在构建的虚拟空间中，对多种不同姿态下的虚拟对象进行布料解算，得到虚拟对象渲染图像；进而导出各姿态下，选定的各关键点位置在虚拟空间中的三维世界坐标。Specifically, for the virtual objects used in business, the processing device performs cloth calculation on the virtual objects in multiple different postures in the constructed virtual space to obtain the virtual object rendering images; and then exports the selected virtual objects in each posture. The three-dimensional world coordinates of key point positions in virtual space.

需要说明的是，本申请实施例中，选定的各关键位置至少包括：眼部关键点、鼻部关键点、肩关节点、肘关节点、腕关节点、髋关节点、膝关节点、踝关节点、指关节点，以及部分脸部关键点；另外，从构建的虚拟空间中导出虚拟对象不同关键点位置的三维坐标，是本领域的常规技术，本申请对此将不做具体说明。It should be noted that in the embodiment of the present application, each selected key position at least includes: eye key points, nose key points, shoulder joint points, elbow joint points, wrist joint points, hip joint points, knee joint points, Ankle joint points, finger joint points, and some facial key points; in addition, deriving the three-dimensional coordinates of the different key point positions of the virtual object from the constructed virtual space is a conventional technology in this field, and this application will not explain it in detail. .

步骤2012：处理设备采用预设的二维重投影技术，对每个三维坐标集合进行处理，得到基于各关键点位置在图像坐标系下的像素点坐标生成的样本骨架图，以及得到基于各关键点位置各自对应的像素深度值生成的样本深度图。Step 2012: The processing device uses the preset two-dimensional reprojection technology to process each three-dimensional coordinate set to obtain a sample skeleton diagram generated based on the pixel coordinates of each key point position in the image coordinate system, and obtain a sample skeleton diagram based on each key point position. The sample depth map generated by the pixel depth value corresponding to each point position.

处理设备依据不同位姿的目标对象，在各关键点位置的三维坐标，得到各三维坐标集合之后，采用预设的二维重投影技术，对每个三维坐标集合进行处理，基于每个三维坐标集合中的各三维坐标，变换得到图像坐标系下的像素点坐标和像素点的深度值。After obtaining each three-dimensional coordinate set based on the three-dimensional coordinates of each key point position of the target object in different poses, the processing device uses the preset two-dimensional reprojection technology to process each three-dimensional coordinate set. Based on each three-dimensional coordinate Each three-dimensional coordinate in the set is transformed to obtain the pixel coordinates and the depth value of the pixel in the image coordinate system.

具体的，在采用二维重投影技术进行处理时，可以采用以下公式进行处理：Specifically, when using two-dimensional reprojection technology for processing, the following formula can be used for processing:

首先，根据虚拟引擎中的相机参数，计算得到用于进行数据转换的相机内参矩阵和外参矩阵。假设获取有相机在虚拟空间内的世界坐标系下的XYZ坐标、X轴上的旋转角α、Y轴上的旋转角β，以及Z轴上的旋转角γ，相机焦距f以及感光传感器(sensor)的物理尺寸。而且，在实际计算时需要根据虚拟引擎的坐标轴顺序和方向对应调整变量。First, according to the camera parameters in the virtual engine, the camera internal parameter matrix and external parameter matrix used for data conversion are calculated. Assume that the XYZ coordinates of the camera in the world coordinate system in the virtual space, the rotation angle α on the X axis, the rotation angle β on the Y axis, and the rotation angle γ on the Z axis, the camera focal length f and the photosensitive sensor (sensor) are obtained ) physical size. Moreover, during actual calculation, variables need to be adjusted according to the order and direction of the coordinate axes of the virtual engine.

相机外参矩阵参考计算公式如下：The camera external parameter matrix reference calculation formula is as follows:

相机内参矩阵参考计算公式如下：The camera internal parameter matrix reference calculation formula is as follows:

其中f是相机焦距，d_x和d_y是感光sensor每个像素的物理长度，c_x和c_y是图像的中心像素坐标。where f is the focal length of the camera, d_x and d_y are the physical lengths of each pixel of the photosensitive sensor, and c_x and cy_y are the center pixel coordinates of the image.

在得到虚拟对象各关键位置的世界坐标(x₀,y₀,z₀)后，可以使用如下公式进行二维重投影，从而得到关键位置在图像上的UV坐标以及对应像素深度值Z：After obtaining the world coordinates (x₀ , y₀ , z₀ ) of each key position of the virtual object, the following formula can be used to perform two-dimensional reprojection, thereby obtaining the UV coordinates of the key position on the image and the corresponding pixel depth value Z:

基于如上公式进行求解后，即可根据每个关键点位置的三维坐标(x₀,y₀,z₀)，投影得到图像坐标系下的像素点坐标，以及对应像素点处的深度值。After solving based on the above formula, according to the three-dimensional coordinates (x₀ , y₀ , z₀ ) of each key point position, the pixel point coordinates in the image coordinate system and the depth value at the corresponding pixel point can be obtained by projection.

需要说明的是，本申请实施例中，在进行二维重投影计算时，考虑到本申请涉及到的虚拟对象存在于虚拟空间中，故相机外参中C_e的求解方式与真实世界坐标系上的转换方式不同，本申请创造性地将C_e的求解方式确定为使得能过更好地适应虚拟空间中的转换需要，在实践中具有非常好的转换效果，提高了将虚拟空间中的世界坐标系转换至图像坐标系的有效性。It should be noted that in the embodiment of the present application, when performing the two-dimensional reprojection calculation, considering that the virtual objects involved in the present application exist in the virtual space, the solution method for C_e in the camera external parameters is consistent with the real world coordinate system. The conversion method on C e is different. This application creatively determines the solution method of C_e as It can better adapt to the conversion needs in the virtual space, has a very good conversion effect in practice, and improves the effectiveness of converting the world coordinate system in the virtual space to the image coordinate system.

进一步的，处理设备针对每个三维坐标集合，基于三维坐标集合中包含的各关键点位置各自在图像坐标系下的二维坐标，生成样本骨架图。Further, for each three-dimensional coordinate set, the processing device generates a sample skeleton diagram based on the two-dimensional coordinates of each key point position contained in the three-dimensional coordinate set in the image coordinate system.

具体的，处理设备获得将三维坐标集合中各关键点位置，投影至图像坐标系后的各像素点坐标后，通过连接各像素点坐标各自对应的像素点，还原对应位姿下的骨骼分布，得到与对应的样本标准图大小相同的样本骨架图。Specifically, after the processing device obtains the position of each key point in the three-dimensional coordinate set and projects the coordinates of each pixel into the image coordinate system, it restores the bone distribution in the corresponding pose by connecting the corresponding pixels of each pixel coordinate. Obtain the sample skeleton diagram with the same size as the corresponding sample standard diagram.

本申请实施例中，处理设备在生成二维的样本骨架图的过程中，先根据各关键点位置的UV坐标(即在图像坐标系下的像素点坐标)，在图像上标注出与各关键点位置各自对应的像素点；然后通过连接能够还原出骨骼分布的各像素点，对应的画出每一段骨骼，最终得到样本骨架图。In the embodiment of the present application, in the process of generating a two-dimensional sample skeleton diagram, the processing device first marks the points corresponding to each key point on the image based on the UV coordinates of each key point position (that is, the pixel point coordinates in the image coordinate system). The pixels corresponding to each point position are then connected to the pixels that can restore the bone distribution, and each segment of the bone is drawn correspondingly, and finally the sample skeleton diagram is obtained.

需要说明的是，本申请实施例中，处理设备在连接各像素点生成样本骨架图时，可以根据实际的处理需要，直接将相关的像素点连接为线段，得到每段骨骼；或者，可以在相关的像素点之间建立诸如双弧线连接在内的连接线，使得能够突出骨骼分布情况。It should be noted that in the embodiment of the present application, when the processing device connects each pixel point to generate a sample skeleton diagram, it can directly connect the relevant pixel points into line segments according to the actual processing needs to obtain each segment of the skeleton; or, it can Connecting lines, such as double arc connections, are established between relevant pixels to highlight the distribution of bones.

例如，参阅图2C所示，其为本申请实施例中描述目标对象在一个位姿下动作详情的内容示意图。根据图2C所示意的内容可知，处理设备将目标对象在位姿1下进行布料解算和美术渲染后，能够得到对应的样本标准图，与此同时，能够根据目标对象以位姿1的形态存在于虚拟空间中时，各关键点位置分布情况，导出用于描述位姿1的三维坐标集合。For example, refer to FIG. 2C , which is a schematic diagram describing the details of the target object's movements in one pose in an embodiment of the present application. According to what is shown in Figure 2C, the processing device can obtain the corresponding sample standard image after performing cloth calculation and art rendering on the target object in pose 1. At the same time, it can obtain the corresponding sample standard image according to the target object in pose 1. When existing in the virtual space, the position distribution of each key point is used to derive a three-dimensional coordinate set used to describe pose 1.

又例如，参阅图2D所示，其为本申请实施例中生成样本骨架图的过程示意图，根据图2D示意的内容可知，在获得目标对象在虚拟空间中一个位姿下的三维坐标集合后，采用二维重投影技术，将该位姿下的各关键点位置，重投影至图像平面中，得到图2D中所示意的，在相应的图像坐标系下各关键点位置的分布情况，即，能够在图像坐标系中确定各关键点位置各自对应的像素点；进而，处理设备通过将不同位置的像素点连接成骨骼，还原对应位姿下不同骨骼的分布情况，得到对应的样本骨架图。For another example, refer to Figure 2D, which is a schematic diagram of the process of generating a sample skeleton diagram in an embodiment of the present application. According to the content shown in Figure 2D, after obtaining the three-dimensional coordinate set of the target object in a pose in the virtual space, Using two-dimensional reprojection technology, the key point positions in this pose are reprojected into the image plane, and the distribution of the key point positions in the corresponding image coordinate system is obtained as shown in Figure 2D, that is, The pixels corresponding to each key point position can be determined in the image coordinate system; furthermore, the processing device restores the distribution of different bones in the corresponding pose by connecting the pixels at different positions into skeletons, and obtains the corresponding sample skeleton map.

这样，能够基于指示目标对象在一个位姿下各关键点位置的三维坐标集合，投影确定将该位姿下的目标对象投影至图像坐标系后，对应的各关键点位置的平面分布情况，并通过连接各关键点位置各自对应的像素点，能够还原该位姿下的骨骼分布；而且，通过引入对于诸如手部在内的肢体末端骨架的考量，能够有效还原手部姿态细节，提高了对于目标对象的姿态还原效果。In this way, based on the three-dimensional coordinate set indicating the position of each key point of the target object in a pose, the projection can determine the plane distribution of the corresponding key point positions after the target object in the pose is projected to the image coordinate system, and By connecting the corresponding pixels of each key point position, the bone distribution in the pose can be restored; moreover, by introducing the consideration of the end skeleton of the limbs such as the hand, the details of the hand posture can be effectively restored, improving the accuracy of the posture. The pose restoration effect of the target object.

处理设备在对应一个位姿下的目标对象生成样本骨架图的同时，还可以根据采用二维重投影技术确定的，各关键点位置各自对应的像素深度值，生成对应的样本深度图。While the processing device generates a sample skeleton map corresponding to the target object in a pose, it can also generate a corresponding sample depth map based on the pixel depth values corresponding to each key point position determined using two-dimensional reprojection technology.

具体的，处理设备获取将三维坐标集合中各关键点位置投影至图像坐标系后，对应各关键点位置得到的各像素点坐标及像素深度值；构建与图像坐标系匹配的初始深度图，并基于各像素深度值，结合针对各像素点坐标各自归属的像素点范围确定的像素值取值差异，调整初始深度图中各像素点各自对应的像素值，得到样本深度图。Specifically, the processing device obtains the coordinates of each pixel and the pixel depth value obtained corresponding to the position of each key point after projecting the position of each key point in the three-dimensional coordinate set to the image coordinate system; constructs an initial depth map that matches the image coordinate system, and Based on the depth value of each pixel, combined with the pixel value difference determined for the pixel range to which each pixel coordinate belongs, the corresponding pixel value of each pixel in the initial depth map is adjusted to obtain a sample depth map.

在可能的实现方式中，处理设备在生成样本深度图时，可以先创建一张与二维样本骨架图尺寸相同的黑色背景(即像素值为0)图，再将每个关键点位置对应的像素点的像素值初始化为对应的像素深度值；再在每个关键点位置对应的像素点位置生成一个半径为N像素，均值为M的高斯分布，得到对应的像素点范围内像素值系数的高斯分布结果，其中，N和M的取值根据实际的处理需要设置，如，N取25，M取1，不同像素点对应的像素值系数差异，表征了不同像素点之间的像素值取值差异；之后，将得到像素点范围内的各像素值取值系数，与像素点范围的生成所依据像素点位置处的像素深度值进行相乘，得到像素点范围内不同像素点位置处的像素值，进而得到对应的样本深度图，其中，像素深度值的单位可以为米，样本深度图中不同像素点位置处的像素值，表征不同像素位置对应的差异化的深度值。In a possible implementation, when the processing device generates the sample depth map, it can first create a black background map with the same size as the two-dimensional sample skeleton map (that is, the pixel value is 0), and then convert the corresponding key point position to The pixel value of the pixel is initialized to the corresponding pixel depth value; then a Gaussian distribution with a radius of N pixels and a mean of M is generated at the pixel position corresponding to each key point position, and the pixel value coefficient within the corresponding pixel range is obtained. Gaussian distribution result, in which the values of N and M are set according to actual processing needs. For example, N is 25 and M is 1. The difference in pixel value coefficients corresponding to different pixels represents the pixel value between different pixels. value difference; after that, the value coefficient of each pixel value within the pixel range will be obtained, and multiplied by the pixel depth value at the pixel position based on which the pixel range is generated, to obtain the value coefficient at different pixel positions within the pixel range. The pixel value is then obtained to obtain the corresponding sample depth map, where the unit of the pixel depth value can be meters. The pixel values at different pixel positions in the sample depth map represent the differentiated depth values corresponding to different pixel positions.

可选的，考虑到像素深度值的取值范围可能与像素值的取值范围不同，故可以先将不同像素深度值的取值进行归一化处理，进而将归一化处理后的像素深度值与对应位置的像素值系数相乘，以及将相乘结果与像素值的取值范围相乘，最终得到对应位置的像素值。Optionally, considering that the value range of the pixel depth value may be different from the value range of the pixel value, the values of different pixel depth values can be normalized first, and then the normalized pixel depth The value is multiplied by the pixel value coefficient at the corresponding position, and the multiplication result is multiplied by the value range of the pixel value to finally obtain the pixel value at the corresponding position.

例如，假设对应姿态2下的目标对象，确定关键点位置1对应像素点1，且像素点1处的像素深度值为1.5米，那么，考虑到像素深度值的取值最多可能只有10米，故将所有深度值除以10进行归一化，得到处理后像素点1处的像素值；然后以像素点1为中心生成一个均值为1，半径为25像素的高斯核，整体乘上像素点1的深度值后，将该高斯核画在图像中，得到该关键点的深度信息图。For example, assuming that the target object corresponds to posture 2, it is determined that key point position 1 corresponds to pixel point 1, and the pixel depth value at pixel point 1 is 1.5 meters. Then, considering that the pixel depth value may only be 10 meters at most, Therefore, all depth values are normalized by dividing them by 10 to obtain the pixel value at pixel 1 after processing; then a Gaussian kernel with a mean value of 1 and a radius of 25 pixels is generated with pixel 1 as the center, and the overall value is multiplied by the pixel point. After setting the depth value of 1, draw the Gaussian kernel in the image to obtain the depth information map of the key point.

又例如，参阅图2E所示，其为本申请实施例中生成的样本深度图示意图，根据图2E所示意的内容可知，处理设备对应样本标准图生成尺寸与该样本标准图相同的样本骨架图的同时，生成尺寸与样本骨架图相同的样本深度图；根据图2E所示意的内容可知，在生成的样本深度图时，先将全黑的初始深度图中，与各关键点位置对应的像素点的像素值取值为对应的像素深度值，进而可以以关键点位置对应的像素点为中心确定像素点范围；再按照预设的高斯半径和均值生成像素值系数的高斯分布结果，以及通过计算像素值系数与对应位置的像素值的乘积结果，得到对应位置最终的像素值。For another example, refer to Figure 2E, which is a schematic diagram of the sample depth map generated in the embodiment of the present application. According to the content shown in Figure 2E, it can be seen that the processing equipment corresponding to the sample standard map generates a sample skeleton map with the same size as the sample standard map. At the same time, a sample depth map with the same size as the sample skeleton map is generated; according to the content shown in Figure 2E, when generating the sample depth map, the pixels corresponding to the positions of each key point in the all-black initial depth map are first The pixel value of the point is the corresponding pixel depth value, and then the pixel point range can be determined with the pixel point corresponding to the key point position as the center; then the Gaussian distribution result of the pixel value coefficient is generated according to the preset Gaussian radius and mean, and through Calculate the product of the pixel value coefficient and the pixel value at the corresponding position to obtain the final pixel value at the corresponding position.

这样，根据各关键点位置投影之后对应的像素点的像素深度值，不仅能够表征出各关键点位置距离相机坐标系原点的差异性距离，还能够表达出不同像素点的相对深度差异，因而可以有效表达出相似动作之间的关键点位置分布情况，提高姿态指示的准确性；另外，通过确定关键点位置对应的像素点位置，以及确定像素点范围，使得能够在生成的样本深度图中扩大关键点位置的影响，相当于对关键点位置对应的位置进行放大，避免进行单像素点识别，降低了检测难度。In this way, according to the pixel depth value of the corresponding pixel point after projection of each key point position, it can not only characterize the differential distance of each key point position from the origin of the camera coordinate system, but also express the relative depth difference of different pixel points, so it can It effectively expresses the distribution of key point positions between similar actions and improves the accuracy of gesture indication; in addition, by determining the pixel position corresponding to the key point position and determining the pixel range, it is possible to expand the generated sample depth map The influence of the key point position is equivalent to amplifying the position corresponding to the key point position, avoiding single pixel point recognition and reducing the difficulty of detection.

步骤2013：处理设备基于不同位姿对应的样本标准图、样本骨架图，以及样本深度图，生成训练样本集。Step 2013: The processing device generates a training sample set based on the sample standard map, sample skeleton map, and sample depth map corresponding to different poses.

处理设备获得目标对象在不同位姿下的样本标准图、样本骨架图，以及样本深度图后，先在不同位姿对应的样本标准图中选定样本参考图，再将该样本参考图，与除该样本参考图对应的位姿外，其他位姿对应的样本标准图、样本骨架图，以及样本深度图进行组合，得到各训练样本；进而依据生成的各训练样本，组成训练样本集合。After the processing device obtains the sample standard images, sample skeleton images, and sample depth images of the target object in different poses, it first selects the sample reference image among the sample standard images corresponding to the different poses, and then compares the sample reference image with the sample reference image. In addition to the pose corresponding to the sample reference map, the sample standard map, sample skeleton map, and sample depth map corresponding to other poses are combined to obtain each training sample; and then based on each generated training sample, a training sample set is formed.

可选的，处理设备可以分别将每个位姿下的样本标准图作为样本参考图，并将样本参考图，与其他每个位姿下的样本标准图、样本骨架图，以及样本深度图进行组合，分别得到各条训练样本。Optionally, the processing device can use the sample standard map in each pose as a sample reference map, and compare the sample reference map with the sample standard map, sample skeleton map, and sample depth map in each other pose. Combined, each training sample is obtained respectively.

例如，假设存在目标对象1在位姿1-5下的样本标准图、样本骨架图，以及样本深度图，在生成训练样本时，可以将选择位姿1对应的样本标准图作为样本参考图，并将该样本参考图与每个其他位姿下的样本标准图、样本骨架图，以及样本深度图进行组合，得到4条训练样本。For example, assuming that there are sample standard images, sample skeleton images, and sample depth images of target object 1 in poses 1-5, when generating training samples, the sample standard image corresponding to pose 1 can be selected as the sample reference image. The sample reference map is combined with the sample standard map, sample skeleton map, and sample depth map in each other pose to obtain 4 training samples.

这样，能够建立根据样本骨架图和样本深度图，联合指示目标位姿下各关键点位置的训练样本集，且在训练样本集的样本骨架图中引入了对于肢体末端骨架的考量，相当于在训练样本中融入了更多可学习的因素，为训练得到有效的图像生成模型提供了训练依据。In this way, a training sample set can be established that jointly indicates the position of each key point in the target pose based on the sample skeleton map and the sample depth map, and the consideration of the limb end skeleton is introduced in the sample skeleton map of the training sample set, which is equivalent to More learnable factors are incorporated into the training samples, which provides a training basis for training an effective image generation model.

步骤202：处理设备采用训练样本集，对预训练后的图像生成模型进行多轮迭代训练，输出已训练的目标图像生成模型。Step 202: The processing device uses the training sample set to perform multiple rounds of iterative training on the pre-trained image generation model, and outputs the trained target image generation model.

本申请实施例中，根据实际的处理需要，为了节省训练时间，处理设备可以先对初始的图像生成模型进行多轮迭代预训练，得到预训练后的图像生成模型，进而对预训练后的图像生成模型进行多轮迭代训练，输出已训练的目标图像生成模型。In the embodiment of the present application, according to actual processing needs, in order to save training time, the processing device can first perform multiple rounds of iterative pre-training on the initial image generation model to obtain a pre-trained image generation model, and then perform pre-trained image generation on the pre-trained image generation model. The generation model undergoes multiple rounds of iterative training and outputs the trained target image generation model.

参阅图2F所示，其为本申请实施例中初始构建的图像生成模型示意图，根据图2F所示，本申请在Nueral-Texture-Extracion-Distribution结构的动作迁移模型的基础上进行了算法和结构调整，本申请构建的图像生成模型中包括：配置有卷积注意力层的第一编码网络、配置有卷积注意力层和图像融合层的第二编码网络，以及配置有卷积注意力层的多尺度解码网络，其中，Refer to Figure 2F, which is a schematic diagram of the image generation model initially constructed in the embodiment of the present application. As shown in Figure 2F, this application has carried out an algorithm and structure based on the action migration model of the Nueral-Texture-Extracion-Distribution structure. Adjustment, the image generation model constructed in this application includes: a first encoding network configured with a convolutional attention layer, a second encoding network configured with a convolutional attention layer and an image fusion layer, and a convolutional attention layer configured multi-scale decoding network, where,

1)配置有卷积注意力层的第一编码网络。1) The first encoding network configured with convolutional attention layer.

对应图2F中连接有轻量级注意力模块(Convolutional Block AttentionModule，CBAM)的骨骼编码器(The Skeleton Encoder)，其中，CBAM模块也称卷积注意力层。Corresponding to the Skeleton Encoder (The Skeleton Encoder) connected to the lightweight attention module (Convolutional Block AttentionModule, CBAM) in Figure 2F, where the CBAM module is also called the convolutional attention layer.

2)配置有卷积注意力层和图像融合层的第二编码网络。2) A second encoding network configured with a convolutional attention layer and an image fusion layer.

对应图2F中连接有CBAM、且内置有图像融合层的参考图像编码器(The ReferenceEncoder)，其中，图像融合层也称深度图融合卷积层，用于融合骨架平面图和关键点深度图。Corresponds to the reference image encoder (The ReferenceEncoder) connected to CBAM and built-in image fusion layer in Figure 2F. The image fusion layer is also called the depth map fusion convolution layer and is used to fuse the skeleton plan map and the key point depth map.

具体的，在训练过程中，将图像尺寸相同的样本骨架图和样本深度图在通道维度进行拼接后，输入第二编码网络，并在第二编码网络内部，实现样本骨架图和样本深度图的融合。Specifically, during the training process, after splicing the sample skeleton map and the sample depth map with the same image size in the channel dimension, they are input into the second encoding network, and inside the second encoding network, the sample skeleton map and the sample depth map are realized. Fusion.

3)配置有卷积注意力层的多尺度解码网络。3) Multi-scale decoding network configured with convolutional attention layer.

对应图2F中粗虚线框所示意的内容，包括目标图像生成器(The Target ImageRenderer)、NTED，以及各个卷积模块(Conv Blocks)在内的网络结构，其中，NTED用于提取输入图像的空间纹理特征，并映射成目标姿态对应的特征分布状态；Conv Blocks是卷积层堆叠而成的模块，而且构建的图像生成模型中包括16×8、32×16、…、512×256、1024×512、2048×1024等多个尺寸的卷积层模块，分别用于针对不同尺寸的图像，提取和融合得到深层特征；tRGB用于将深层特征矩阵用卷积层转换成通道数为3的RGB图像，Upsample部分用于对图像进行上采样。Corresponding to the content indicated by the thick dotted box in Figure 2F, the network structure includes the target image generator (The Target ImageRenderer), NTED, and each convolution module (Conv Blocks). Among them, NTED is used to extract the space of the input image. Texture features are mapped to the feature distribution state corresponding to the target pose; Conv Blocks are modules stacked by convolutional layers, and the built image generation model includes 16×8, 32×16,..., 512×256, 1024× Convolutional layer modules of multiple sizes such as 512, 2048×1024, etc. are used to extract and fuse deep features for images of different sizes; tRGB is used to convert the deep feature matrix into RGB with a channel number of 3 using a convolutional layer Image, the Upsample part is used to upsample the image.

继续结合附图2F示意的内容，本申请构建的图像生成模型中，通过在第一编码网络和第二编码网络中添加CBAM层，有利于模型提取出不同尺度的目标对象的特征；在多尺度解码网络中，通过在最后两层接近图像输出的部分添加CBAM，有利于训练提升图像生成的细节。Continuing to combine the content illustrated in Figure 2F, in the image generation model constructed by this application, by adding a CBAM layer to the first encoding network and the second encoding network, it is beneficial for the model to extract the characteristics of target objects at different scales; in multi-scale In the decoding network, adding CBAM to the last two layers close to the image output is conducive to training and improving the details of image generation.

本申请实施例中，处理设备在对初始的图像生成模型进行预训练时，获取指定的数据集，并通过对数据集中的各样本骨架图进行单目深度估计处理，得到各样本骨架图各自对应的样本深度图，其中，该数据集中包括各样本对象在不同位姿下的样本标准图和样本骨架图；基于根据该数据集得到的样本标准图、样本骨架图，以及样本深度图，构建预训练样本集合，并基于预训练样本集合对初始的图像生成模型进行多轮迭代训练，输出预训练后的图像生成模型。In the embodiment of the present application, when pre-training the initial image generation model, the processing device obtains the specified data set, and performs monocular depth estimation processing on each sample skeleton map in the data set to obtain the corresponding corresponding sample skeleton map. sample depth map, wherein the data set includes the sample standard map and sample skeleton map of each sample object in different postures; based on the sample standard map, sample skeleton map, and sample depth map obtained from the data set, a predetermined sample depth map is constructed Train a sample set, conduct multiple rounds of iterative training on the initial image generation model based on the pre-training sample set, and output the pre-trained image generation model.

具体的，考虑到由于针对目标对象渲染的数据量有限，为了提升能够实现动作迁移的图像生成模型的泛化性，可以预先使用大量人体平面图像数据，对初始的图像生成模型进行预训练。对此，本申请可以获取目前包含较大数据量的、用于人体姿态估计的数据集，进而在数据集中筛选出单人且人物占比超过设定阈值的平面图像生成训练数据，其中，选用的数据集可以是COCO数据集，或者，human3.6等，本申请对此不做具体限定；数据集中的图像包括不同位姿下的人物图像(即包含样本对象的样本标准图)，以及在对应位姿下的平面骨架图(即样本骨架图)。Specifically, considering that the amount of data for target object rendering is limited, in order to improve the generalization of the image generation model that can achieve action transfer, a large amount of human body planar image data can be used in advance to pre-train the initial image generation model. In this regard, this application can obtain a data set currently containing a large amount of data for human body posture estimation, and then filter out the flat images of a single person in the data set whose proportion exceeds the set threshold to generate training data. Among them, select The data set can be the COCO data set, or human3.6, etc. This application does not specifically limit this; the images in the data set include images of people in different poses (i.e., sample standard images containing sample objects), and The plane skeleton diagram under the corresponding pose (i.e., the sample skeleton diagram).

本申请实施例中，考虑到数据集中包括的是平面图像，故在生成对应的深度图时，可以单目深度估计算法，得到标注的各关键点位置各自对应的深度值，从而得到每张人物图像中各关键点位置的UVZ数据，并以此生成预训练样本集，以及依据预训练样本集训练得到泛化性更强的预训练模型。In the embodiment of this application, considering that the data set includes flat images, when generating the corresponding depth map, a monocular depth estimation algorithm can be used to obtain the depth values corresponding to the marked key point positions, thereby obtaining each figure. UVZ data of each key point position in the image, and use this to generate a pre-training sample set, and train based on the pre-training sample set to obtain a pre-training model with stronger generalization.

需要说明的是，本申请实施例中，采用单目深度估计算法进行处理的原因在于，目前人物姿态丰富的数据集中基本上仅提供二维关键点标签，即UV坐标；而本申请中考量的是关键点的UVZ坐标，故需要借助单目深度估计算法，预测图像中人物的各关键点位置所对应的像素点，与摄像头之间的距离，进而能够对应每个关键点位置得到对应的Z值，从而合成出所需的UVZ数据。It should be noted that in the embodiments of this application, the reason why the monocular depth estimation algorithm is used for processing is that currently only two-dimensional key point labels, that is, UV coordinates, are provided in the current data sets with rich human postures; while the ones considered in this application is the UVZ coordinate of the key point, so it is necessary to use the monocular depth estimation algorithm to predict the pixel corresponding to the position of each key point of the character in the image and the distance between it and the camera, and then the corresponding Z can be obtained corresponding to the position of each key point value to synthesize the required UVZ data.

另外，对于本申请采用的单目深度估计算法而言，可以获取基于RGBD相机采集的RGB图像以及深度图构成的训练集，并采用训练集训练能够预测每个像素深度值的深度卷积神经网络，使得依据由深度卷积神经网络实现的单目深度估计算法功能，能够很好的补全RGB图像的深度信息。In addition, for the monocular depth estimation algorithm used in this application, a training set composed of RGB images and depth maps collected based on RGBD cameras can be obtained, and the training set can be used to train a deep convolutional neural network that can predict the depth value of each pixel. , so that the depth information of the RGB image can be well complemented based on the monocular depth estimation algorithm function implemented by the deep convolutional neural network.

本申请实施例中，处理设备通过对获取的数据集进行处理，能够得到数据集中不同位姿下的各样本对象，各自对应的样本标准图、样本骨架图，以及样本深度图；通过基于相同样本对象在不同位姿下的样本标准图、样本骨架图，以及样本深度图，构建预训练样本，能够生成对应的预训练样本集合；进而，依据生成的预训练样本集合，实现对初始的图像生成模型的多轮迭代训练，得到预训练后的图像生成模型。In the embodiment of this application, the processing device can obtain each sample object in different postures in the data set by processing the acquired data set, and its corresponding sample standard map, sample skeleton map, and sample depth map; by using the same sample based on The sample standard map, sample skeleton map, and sample depth map of the object in different postures are used to construct the pre-training sample, which can generate the corresponding pre-training sample set; then, based on the generated pre-training sample set, the initial image generation is realized After multiple rounds of iterative training of the model, a pre-trained image generation model is obtained.

需要说明的是，预训练过程中执行的模型处理过程，与针对预训练后的图像生成模型执行的处理过程相同，本申请在此将不对预训练时具体的处理过程进行说明。It should be noted that the model processing process performed during the pre-training process is the same as the processing process performed on the image generation model after pre-training. This application will not describe the specific processing process during pre-training here.

这样，通过对构建的图像生成模型进行预训练，能够提高模型的泛化性，降低后续在针对目标对象训练目标图像生成模型的过程中，对于训练样本的标注需求，有助于提高模型的训练速度。In this way, pre-training the built image generation model can improve the generalization of the model, reduce the need for labeling of training samples in the subsequent process of training the target image generation model for the target object, and help improve the training of the model. speed.

进一步的，处理设备获取预训练后的图像生成模型后，对该图像生成模型进行多轮迭代微调训练，直至满足预设的收敛条件为止，输出训练后的目标图像生成模型，其中，预设的收敛条件可以是训练轮数达到设定值等，本申请对此不做具体限制。Further, after the processing device obtains the pre-trained image generation model, it performs multiple rounds of iterative fine-tuning training on the image generation model until the preset convergence conditions are met, and outputs the trained target image generation model, wherein the preset The convergence condition may be that the number of training rounds reaches a set value, etc. This application does not impose specific restrictions on this.

参阅图2G所示，其为本申请实施例中一轮模型训练的过程示意图，下面结合附图2G，以对预训练后的图像生成模型进行一轮训练为例，对相关的训练过程进行说明：Refer to Figure 2G, which is a schematic diagram of the process of one round of model training in the embodiment of the present application. The relevant training process will be described below with reference to Figure 2G, taking a round of training on the pre-trained image generation model as an example. :

步骤2021：处理设备基于选取的训练样本中包含的样本骨架图和样本深度图，按照对应的目标位姿，对包含的样本参考图中的目标对象进行动作迁移处理，得到预测标准图。Step 2021: Based on the sample skeleton map and sample depth map contained in the selected training sample, the processing device performs action migration processing on the target object in the included sample reference map according to the corresponding target pose, and obtains a prediction standard map.

具体的，在图像生成模型中包括：配置有卷积注意力层的第一编码网络、配置有卷积注意力层和图像融合层的第二编码网络，以及配置有卷积注意力层的多尺度解码网络的情况下；在执行步骤2021时，处理设备将选取的训练样本包含的样本参考图输入第一编码网络，得到编码后的参考图像特征；再将训练样本包含的样本骨架图和样本深度图，在通道维度上进行拼接后，输入第二编码网络，得到编码融合后的骨骼动作特征；然后，采用多尺度解码网络，基于骨骼动作特征对参考图像特征进行解码，得到完成动作迁移后的预测标准图。Specifically, the image generation model includes: a first encoding network configured with a convolutional attention layer, a second encoding network configured with a convolutional attention layer and an image fusion layer, and a multi-channel encoding network configured with a convolutional attention layer. In the case of a scale decoding network; when executing step 2021, the processing device inputs the sample reference image contained in the selected training sample into the first encoding network to obtain the encoded reference image features; and then inputs the sample skeleton image and sample contained in the training sample. After the depth map is spliced in the channel dimension, it is input into the second encoding network to obtain the skeletal action features after coding fusion; then, a multi-scale decoding network is used to decode the reference image features based on the skeletal action features to obtain the completed action migration. prediction standard chart.

本申请实施例中，处理设备采用图像生成模型中的第一编码网络，实现对目标对象的样本参考图的编码处理，得到样本参考图对应的参考图像特征；以此同时，采用图像生成模型中的第二编码网络，实现对目标对象的样本骨架图和样本深度图的编码融合处理，得到能够描述目标姿态的骨骼动作特征；进而借助于多尺度解码网络，基于骨骼动作特征指导目标对象由参考图像特征所对应的位姿迁移至目标位姿，得到模型输出的预测标准图。In the embodiment of the present application, the processing device uses the first encoding network in the image generation model to implement encoding processing of the sample reference picture of the target object, and obtains the reference image features corresponding to the sample reference picture; at the same time, the processing device uses the first encoding network in the image generation model. The second encoding network realizes the encoding and fusion processing of the sample skeleton map and sample depth map of the target object, and obtains the skeletal action features that can describe the target posture; then with the help of the multi-scale decoding network, based on the skeletal action features, the target object is guided from the reference The pose corresponding to the image feature is transferred to the target pose, and the prediction standard map output by the model is obtained.

这样，借助于包含CBAM的第一编码网络、第二编码网络，以及多尺度解码网络，能够学习进行目标动作的动作迁移，而且，本申请通过将样本深度图作为模型输入的一部分，使得能够同时输入二维的样本骨架图和样本深度图，因而能够有效引入各关键点位置的三维信息，在提升模型输入数据的信息量的同时，有助于模型学习更精确地实现动作迁移。In this way, with the help of the first encoding network, the second encoding network including CBAM, and the multi-scale decoding network, the action transfer of the target action can be learned. Moreover, this application uses the sample depth map as part of the model input, so that it can simultaneously Inputting two-dimensional sample skeleton map and sample depth map can effectively introduce the three-dimensional information of each key point position, which not only improves the information content of the model input data, but also helps the model learn to achieve action transfer more accurately.

步骤2022：处理设备基于预测标准图与样本标准图之间多尺度的全局综合差异损失，结合预测标准图和样本标准图之间，指定图像区域内的局部差异损失，调整图像生成模型中的模型参数。Step 2022: The processing device adjusts the model in the image generation model based on the multi-scale global comprehensive difference loss between the prediction standard image and the sample standard image, combined with the local difference loss in the specified image area between the prediction standard image and the sample standard image. parameter.

在执行步骤2022时，处理设备基于图像生成模型生成的预测标准图，与选取的训练样本中对应的样本标准图之间的图像差异，计算模型损失值，进而依据模型损失值调整图像生成模型中的模型参数。When performing step 2022, the processing device calculates the model loss value based on the image difference between the prediction standard map generated by the image generation model and the corresponding sample standard map in the selected training sample, and then adjusts the image generation model based on the model loss value. model parameters.

在一些可能的实现方式中，处理设备计算预测标准图与样本标准图之间多尺度的全局综合差异损失，以及计算预测标准图和样本标准图之间，指定图像区域内的局部差异损失后，将全局综合差异损失和局部差异损失的加权叠加结果，作为调整模型参数时依据的模型损失值。In some possible implementations, the processing device calculates a multi-scale global comprehensive difference loss between the prediction standard image and the sample standard image, and calculates the local difference loss between the prediction standard image and the sample standard image in the specified image area, The weighted superposition result of global comprehensive difference loss and local difference loss is used as the model loss value based on when adjusting model parameters.

在另一些可能的实施例中，当图像生成模型作为生成器对抗器结构中的生成器进行训练时，得到预测标准图之后，采用预设的生成对抗损失函数，基于预测标准图和对应的样本标准图，得到对应的对抗损失；再基于对抗损失、预测标准图与样本标准图之间的全局综合差异损失，结合预测标准图和样本标准图之间，指定图像区域内的局部差异损失，调整图像生成模型中的模型参数。In other possible embodiments, when the image generation model is trained as a generator in a generator-antagonist structure, after obtaining the prediction standard map, a preset generative adversarial loss function is used, based on the prediction standard map and the corresponding sample standard image to obtain the corresponding adversarial loss; then based on the adversarial loss, the global comprehensive difference loss between the predicted standard image and the sample standard image, and the local difference loss in the specified image area between the predicted standard image and the sample standard image, the adjustment Model parameters in image generation models.

这样，通过引入全局综合损失差异和局部损失差异，能够有效考量图像间的局部图像差异和整体差异，有助于模型学习在生成的图像中还原位姿细节；而且通过额外引入生成对抗损失，能够借助于生成器对抗器结构的训练框架，对图片的生成质量进行进一步的评价，协助提高模型的训练效果。In this way, by introducing the global comprehensive loss difference and the local loss difference, the local image difference and the overall difference between images can be effectively considered, which helps the model learn to restore pose details in the generated images; and by additionally introducing the generative adversarial loss, it can With the help of the training framework of the generator-antagonist structure, the quality of image generation is further evaluated to help improve the training effect of the model.

本申请实施例中，在确定预测标准图和样本标准图之间的局部差异损失时，处理设备在预测标准图和样本标准图中，分别确定用于定位子图像区域的各目标关键点位置，并分别在预测标准图和样本标准图中，基于确定的各目标关键点位置，裁剪得到包含多个子图像区域的指定图像区域；基于每个子图像区域内的像素值差异和图像特征差异，得到对应的局部差异损失。In the embodiment of the present application, when determining the local difference loss between the prediction standard map and the sample standard map, the processing device determines the position of each target key point used to locate the sub-image area in the prediction standard map and the sample standard map, respectively. And in the prediction standard map and the sample standard map respectively, based on the determined position of each target key point, a designated image area containing multiple sub-image areas is cropped; based on the difference in pixel values and image feature differences in each sub-image area, the corresponding local difference loss.

具体的，处理设备在计算局部差异损失时，先在样本标准图和预测标准图中选定考量的局部区域，如，可以选定脸部区域和手部区域；进而，在样本标准图和预测标准图中，分别根据人体关键点定位对应的局部区域，进而从样本标准图和预测标准图中裁剪出相应的局部区域。Specifically, when calculating the local difference loss, the processing device first selects the local area to be considered in the sample standard map and the prediction standard map. For example, the face area and the hand area can be selected; then, in the sample standard map and the prediction standard map, In the standard map, the corresponding local areas are located according to the key points of the human body, and then the corresponding local areas are cut out from the sample standard map and the predicted standard map.

例如，假设预设的局部区域为脸部图像区域和手部图像区域，那么，在裁剪脸部图像区域时，先依据选定的各关键点位置中的眼睛关键点位置，再基于眼睛关键点的连线生成矩形区域框，以划分出脸部区域，进而裁剪得到样本标准图和预测标准图中各自对应的脸部图像区域(即子图像区域)；同理，可以选定手指关节点，并依据手指关节点划分出手部图像区域，进而裁剪得到样本标准图和样本预测图中各自对应的手部图像区域(即子图像区域)。For example, assuming that the preset local areas are the face image area and the hand image area, then when cropping the face image area, first based on the eye key point positions among the selected key point positions, and then based on the eye key point positions. The connection line generates a rectangular area frame to divide the facial area, and then crop the corresponding facial image areas (i.e., sub-image areas) in the sample standard image and the prediction standard image; similarly, the finger joint points can be selected, The hand image area is divided according to the finger joint points, and then the corresponding hand image areas (ie, sub-image areas) in the sample standard image and the sample prediction image are obtained by cropping.

进而，处理设备可以根据实际的处理需要，采用L1损失函数，计算每个子图像区域内像素点的像素值差异损失和图像特征差异损失，进而根据每个子图像区域对应的像素值差异损失和图像特征差异损失，计算得到包括各子图像区域的指定区域，对应的局部差异损失。Furthermore, the processing device can use the L1 loss function to calculate the pixel value difference loss and image feature difference loss of the pixels in each sub-image area according to the actual processing needs, and then calculate the pixel value difference loss and image feature difference loss corresponding to each sub-image area. The difference loss is calculated to include the specified area of each sub-image area and the corresponding local difference loss.

这样，通过引入局部损失差异，能够有效考量图像间局部区域的差异，使得在模型的学习训练过程中，能够指导在生成的图像中还原位姿细节，提高图像的生成效果。In this way, by introducing local loss differences, the differences in local areas between images can be effectively considered, so that during the learning and training process of the model, it can guide the restoration of pose details in the generated images and improve the image generation effect.

本申请实施例中，处理设备在确定预测标准图和样本标准图之间的全局综合差异损失时，基于预测标准图与样本标准图之间，各像素点的像素值差异，得到全局像素值损失，并基于预测标准图与样本标准图之间，在多个预设尺度下的图像特征差异，得到多尺度特征损失；再将全局像素值损失和多尺度特征损失，得到对应的全局综合差异损失。In the embodiment of the present application, when the processing device determines the global comprehensive difference loss between the prediction standard picture and the sample standard picture, the global pixel value loss is obtained based on the pixel value difference of each pixel between the prediction standard picture and the sample standard picture. , and based on the image feature differences at multiple preset scales between the prediction standard image and the sample standard image, a multi-scale feature loss is obtained; then the global pixel value loss and the multi-scale feature loss are combined to obtain the corresponding global comprehensive difference loss. .

具体的，处理设备在计算多尺度特征损失时，可以借助于视觉几何组(VisualGeometry Group，VGG)网络，将预测标准图和样本标准图分别输入VGG网络中，得到VGG网络输出的在多个预设尺度下的图像特征；进而通过采用L1损失函数，分别计算每个尺度下预测标准图和样本标准图之间图像特征差异，最终得到多尺度特征损失。Specifically, when the processing device calculates the multi-scale feature loss, it can use the Visual Geometry Group (VGG) network to input the prediction standard map and the sample standard map into the VGG network respectively, and obtain the output of the VGG network in multiple predictions. Set the image features under the scale; then use the L1 loss function to calculate the difference in image features between the predicted standard image and the sample standard image at each scale, and finally obtain the multi-scale feature loss.

在计算全局像素值损失时，处理设备采用L1损失函数，基于预测标准图和样本标准图之间的像素值差异，确定对应的全局像素值损失。When calculating the global pixel value loss, the processing device uses the L1 loss function to determine the corresponding global pixel value loss based on the pixel value difference between the predicted standard image and the sample standard image.

进而，通过计算多尺度特征损失和全局像素值损失之间的加权叠加结果，最终能够得到对应的全局综合差异损失。Furthermore, by calculating the weighted superposition result between the multi-scale feature loss and the global pixel value loss, the corresponding global comprehensive difference loss can finally be obtained.

这样，借助于全局综合损失差异，能够从整体上有效考量图像间差异，并能够兼顾像素值层面和图像特征层面上的综合差异影响，使得模型能够向缩小像素值差异和图像特征差异的趋势调整。In this way, with the help of the global comprehensive loss difference, the differences between images can be effectively considered as a whole, and the impact of comprehensive differences at the pixel value level and image feature level can be taken into account, so that the model can be adjusted to reduce the pixel value difference and image feature difference. .

需要说明的是，本申请实施例中，在对实现动作迁移功能的图像生成模型进行预训练的过程中，采用的是已有的真实人类图像数据集，考虑到预训练后的图像生成模型对于特定目标对象的拟合效果可能不够高，故需要基于处理后的目标对象数据生成训练样本集后，对预训练后的图像生成模型进行进一步的优化，提升模型对目标对象数据的拟合程度。在优化模型时，为了防止模型在目标对象数据上过度拟合导致对预训练知识发生遗忘，可以控制训练周期并减小模型学习率，其中，训练周期的数目根据实际的处理需要设置，本申请对此不做具体限制。It should be noted that in the embodiments of the present application, in the process of pre-training the image generation model that implements the action migration function, the existing real human image data set is used. Considering that the pre-trained image generation model is The fitting effect of a specific target object may not be high enough, so it is necessary to generate a training sample set based on the processed target object data, and then further optimize the pre-trained image generation model to improve the fitting degree of the model to the target object data. When optimizing the model, in order to prevent the model from overfitting on the target object data and forgetting the pre-training knowledge, the training cycle can be controlled and the model learning rate can be reduced. The number of training cycles is set according to the actual processing needs. This application There are no specific restrictions on this.

具体的，处理设备可以按照以下任意一种方式，确定在对预训练后的图像生成模型进行每轮迭代过程中使用的学习率：Specifically, the processing device can determine the learning rate used in each iteration of the pre-trained image generation model in any of the following ways:

方式一、采用余弦退火算法计算学习率。Method 1: Use the cosine annealing algorithm to calculate the learning rate.

具体的，处理设备可以采用预设的余弦退火算法，基于预设的初始学习率，确定每个训练周期对应的学习率取值，并根据当前迭代过程归属的训练周期，确定当前迭代过程对应的目标学习率，其中，一个训练周期内包括至少一轮迭代过程。Specifically, the processing device can use a preset cosine annealing algorithm to determine the learning rate value corresponding to each training cycle based on the preset initial learning rate, and determine the value corresponding to the current iteration process based on the training cycle to which the current iteration process belongs. Target learning rate, where one training cycle includes at least one iteration process.

基于此，处理设备可以采用余弦退火的方式，确定每个训练周期内采用的学习率，使得能够对应训练周期，实现周期性地调整学习率的取值，随着训练周期的增加，学习率的取值逐渐减小。Based on this, the processing equipment can use cosine annealing to determine the learning rate used in each training cycle, so that the value of the learning rate can be periodically adjusted corresponding to the training cycle. As the training cycle increases, the learning rate increases. The value gradually decreases.

二、基于预设的学习率衰减函数，计算学习率。2. Calculate the learning rate based on the preset learning rate decay function.

具体的，处理设备基于预设的初始学习率和学习率衰减系数，确定每个训练周期对应的学习率取值，并根据当前迭代过程归属的训练周期，确定当前迭代过程对应的目标学习率，其中，一个训练周期内包括至少一轮迭代过程。Specifically, the processing device determines the learning rate value corresponding to each training cycle based on the preset initial learning rate and learning rate attenuation coefficient, and determines the target learning rate corresponding to the current iterative process according to the training cycle to which the current iterative process belongs. Among them, one training cycle includes at least one iteration process.

例如，假设学习率衰减系数为0.5，则可以控制每个训练周期内的迭代轮次内，采用相同的学习率训练模型，以及，对于相邻的两个训练周期而言，前者的训练周期内采用的学习率，为后者训练周期内采用的学习率的二倍。For example, assuming that the learning rate attenuation coefficient is 0.5, you can control the same learning rate to train the model in the iteration rounds within each training cycle, and, for two adjacent training cycles, within the former training cycle The learning rate used is twice the learning rate used during the latter training cycle.

这样，通过调整模型训练所使用的学习率，能够一定程度上防止模型在目标对象数据上过度拟合，避免图像生成模型对于预训练知识的遗忘，保障模型的训练效果。In this way, by adjusting the learning rate used in model training, it is possible to prevent the model from overfitting on the target object data to a certain extent, avoid the image generation model from forgetting pre-training knowledge, and ensure the training effect of the model.

进一步的，在基于预训练后的图像生成模型，训练得到目标图像生成模型后，处理设备可以依据目标图像生成模型，进行图像中目标对象的动作迁移，得到目标对象在指定姿态下的目标图像。Further, after training to obtain the target image generation model based on the pre-trained image generation model, the processing device can perform motion migration of the target object in the image based on the target image generation model, and obtain the target image of the target object in a specified posture.

处理设备获取目标对象在参考动作下的参考图像，以及目标对象在指定位姿下的平面骨架图和平面深度图，其中，平面骨架图中包括手部骨骼；再采用目标图像生成模型，基于平面骨架图和平面深度图，对参考图像进行动作迁移处理，得到目标对象在指定位姿下的目标图像。The processing device obtains the reference image of the target object under the reference action, as well as the plane skeleton diagram and plane depth map of the target object in the specified pose, where the plane skeleton diagram includes hand bones; then the target image is used to generate a model, based on the plane Skeleton map and plane depth map, perform motion transfer processing on the reference image, and obtain the target image of the target object in the specified pose.

具体的，处理设备得到优化的目标图像生成模后，即可根据实际的处理需要，离线进行图像的生成。在具体的生成过程中，处理设备先获取目标对象在指定位姿下，各关键点位置对应的三维坐标集合。Specifically, after the processing device obtains the optimized target image generation model, it can generate the image offline according to actual processing needs. In the specific generation process, the processing device first obtains the three-dimensional coordinate set corresponding to each key point position of the target object in the specified pose.

然后，处理设备对三维坐标集合中各关键点位置的三维坐标，分别进行重投影得到UVZ坐标，进而合成对应的平面骨架图和平面深度图；再将目标对象在参考位姿下的参考图像，与指定位姿对应的平面骨架图和平面深度图，一起输入至优化后的目标图像生成模型，即可得到目标对象输出的，目标对象在指定位姿下的平面图像。Then, the processing device re-projects the three-dimensional coordinates of each key point position in the three-dimensional coordinate set to obtain the UVZ coordinates, and then synthesizes the corresponding planar skeleton map and planar depth map; then the reference image of the target object in the reference pose, The plane skeleton map and plane depth map corresponding to the specified pose are input to the optimized target image generation model, and the plane image output by the target object in the specified pose can be obtained.

需要说明的是，在存在其他对象在指定位姿下的三维坐标集合，而不存在目标对象在该指定位姿下的三维坐标集合的情况下，可以针对指示其他对象指定位姿的三维坐标集合进行骨骼重定向处理，得到目标对象在该指定位姿下的三维坐标集合。It should be noted that when there is a set of three-dimensional coordinates of other objects in a specified pose but there is no set of three-dimensional coordinates of the target object in the specified pose, the set of three-dimensional coordinates indicating the pose of the other object can be specified. Perform bone reorientation processing to obtain the three-dimensional coordinate set of the target object in the specified pose.

特别的，当生成的平面图像分辨率不够时，处理设备可以使用超分辨率算法提升图像的分辨率，最终输出成品图像，其中，超分辨率算法和重定向算法为本领域的常规技术，本申请对此不做具体说明。In particular, when the resolution of the generated plane image is not enough, the processing device can use a super-resolution algorithm to improve the resolution of the image and finally output the finished image. The super-resolution algorithm and the redirection algorithm are conventional technologies in this field. The application does not specify this.

另外，在本申请一些可能的实现场景中，可以获取用于指示目标对象不同位姿的期望位姿序列，其中，期望位姿序列中每个位姿对应一个参考图像，以及用于指式期望位姿的三维坐标集合；进而处理设备能够处理生成各期望位姿各自对应的目标图像，最终对应期望位姿序列生成目标图像序列，因而在连播目标图像的情况下，能够得到目标对象的位姿变化视频。In addition, in some possible implementation scenarios of this application, an expected pose sequence for indicating different poses of the target object can be obtained, where each pose in the expected pose sequence corresponds to a reference image, and is used to indicate the expected pose. A collection of three-dimensional coordinates of poses; then the processing device can process and generate target images corresponding to each desired pose, and finally generate a target image sequence corresponding to the desired pose sequence. Therefore, in the case of continuous broadcast of target images, the pose of the target object can be obtained Change video.

这样，仅需要单次大批量渲染用于训练图像生成模型的目标对象数据，即可实现后期离线生成各种指定姿态的目标对象图像素材，不再需要额外依赖外部美术技术，能够有效降低渲染图像和布料解算的时间成本和设备成本，提高长期运营的生产效率；而且，通过引入了平面深度图的参与作用，能有效处理自遮挡问题，有效还原指定位姿下的目标对象，另外，借助于训练过程中增加的对于肢体末端骨架的考量，能够至少增加处理手部骨骼的功能，使生成图像中的手势变得可控，而且，通过考量了局部损失差异，能够实现对局部特征的进一步优化，缓解尺度过小导致图像质量过低的问题，再者，通过更改模型结构，使得模型内部能够处理2048*1024等尺寸的图像，够极大提高生成的图像的分辨率，使生成图像达到1080p分辨率。In this way, only a single batch of target object data for training the image generation model is required to be rendered, and target object image materials of various specified postures can be generated offline in the later stage. There is no need to rely on external art technology, which can effectively reduce the cost of rendering images. and cloth calculation time and equipment costs, improving the production efficiency of long-term operations; moreover, by introducing the participation of the plane depth map, it can effectively handle the self-occlusion problem and effectively restore the target object in the specified pose. In addition, with the help of The additional consideration of the limb end skeleton during the training process can at least increase the function of processing hand bones, making the gestures in the generated images controllable. Moreover, by considering the local loss difference, further improvement of local features can be achieved. Optimize to alleviate the problem of low image quality caused by too small scale. Furthermore, by changing the model structure, the model can process images of sizes such as 2048*1024, which can greatly improve the resolution of the generated image and make the generated image reach 1080p resolution.

下面结合具体的应用场景，以目标对象为虚拟人为例，对本申请实施例中涉及到的训练过程和应用过程进行示意性说明：The following is a schematic explanation of the training process and application process involved in the embodiments of this application based on specific application scenarios, taking the target object as a virtual human as an example:

参阅图3A所示，其为本申请实施例中在目标图像生成模型的训练阶段和应用阶段的处理过程示意图，根据图3A所示意的内容可知，在训练阶段，处理设备在虚拟空间中渲染大量虚拟人的平面图像，并导出不同平面图像对应的各关键点位置坐标，得到相应的三维坐标集合，生成训练样本集；进而采用训练样本集对预训练后的图像生成模型进行训练，输出训练后的目标图像生成模型。Refer to Figure 3A, which is a schematic diagram of the processing process in the training stage and application stage of the target image generation model in the embodiment of the present application. According to the content shown in Figure 3A, it can be seen that in the training stage, the processing device renders a large number of images in the virtual space. Plane images of the virtual human, and derive the position coordinates of each key point corresponding to different planar images, obtain the corresponding three-dimensional coordinate set, and generate a training sample set; then use the training sample set to train the pre-trained image generation model, and output the trained target image generation model.

参阅图3B所示，其为本申请实施例中单轮迭代训练的过程示意图，根据图3B所示意的内容可知，在对预训练后的图像生成模型进行一轮迭代训练时，处理设备选取训练样本后，将训练样本中的样本参考图输入第一编码网络，以及将训练样本中的样本骨架图和样本深度图在通道维度上进行拼接后，输入第二编码网络，进而得到多尺度解码网络输出的预测标准图；Refer to Figure 3B, which is a schematic diagram of the process of a single round of iterative training in the embodiment of the present application. According to the content shown in Figure 3B, it can be seen that when performing a round of iterative training on the pre-trained image generation model, the processing device selects the training After the sample, the sample reference map in the training sample is input into the first encoding network, and the sample skeleton map and sample depth map in the training sample are spliced in the channel dimension, and then input into the second encoding network to obtain a multi-scale decoding network. Output prediction standard chart;

之后，继续结合图3B的内容，在图像生成模型作为生成器对抗器结构中的生成器进行训练时，处理设备分别将预测标准图和样本标准图输入判别器，得到对应的对抗损失，其中，对抗损失的计算是本领域的常规技术，在此不做具体说明；另外，处理设备根据预测标准图和样本标准图之间的图像像素差异，计算图像像素差异损失(也称全局像素值损失)；与此同时，处理设备从预测标准图和样本标准图中分别裁剪出人脸图像区域和手部图像区域后，通过加权计算人脸图像区域对应的像素差异损失和图像特征差异损失，以及手部图像区域对应的像素值差异损失和图像特征差异损失，得到对应的局部差异损失；而且，处理设备还可以将预测标准图和样本标准图输入预设的VGG网络，得到预测标准图和样本标准图各自对应的多尺度图像特征，并通过计算多尺度图像特征各自对应的图像特征差异损失，最终得到多尺度特征损失。After that, continuing to combine the content of Figure 3B, when the image generation model is trained as a generator in the generator-antagonist structure, the processing device inputs the prediction standard map and the sample standard map into the discriminator respectively to obtain the corresponding adversarial loss, where, The calculation of adversarial loss is a conventional technology in this field and will not be explained in detail here; in addition, the processing device calculates the image pixel difference loss (also called global pixel value loss) based on the image pixel difference between the prediction standard image and the sample standard image. ; At the same time, after the processing device cuts out the face image area and hand image area respectively from the prediction standard image and the sample standard image, it calculates the pixel difference loss and image feature difference loss corresponding to the face image area through weighting, as well as the hand image area. The corresponding pixel value difference loss and image feature difference loss corresponding to the local image area are obtained to obtain the corresponding local difference loss; moreover, the processing device can also input the prediction standard map and the sample standard map into the preset VGG network to obtain the prediction standard map and sample standard map. The multi-scale image features corresponding to each image are calculated, and the multi-scale feature loss is finally obtained by calculating the image feature difference loss corresponding to the multi-scale image features.

进而，在图3B示意的训练过程中，借助于计算得到的多种损失，加权获得模型损失，并依据模型损失调整图像生成模型的模型参数。Furthermore, in the training process illustrated in Figure 3B , the model loss is weighted with the help of various calculated losses, and the model parameters of the image generation model are adjusted based on the model loss.

继续结合图3A所示意的内容进行说明，在应用阶段，处理设备先准备虚拟人的期望位姿序列，其中，每个期望位姿关联有用于指示各关键点位置的三维坐标集合；之后，处理设备确定虚拟人对应的参考图像，并针对每个期望位姿，确定对应的平面骨架图和平面深度图；再采用已训练的目标图像生成模型，分别依据每个期望位姿对应的参考图像、平面骨架图像，以及平面深度图，得到目标图像；最后，处理设备对应期望位姿序列得到目标图像序列。Continuing to explain with reference to the content shown in Figure 3A, in the application stage, the processing device first prepares the desired pose sequence of the virtual human, in which each desired pose is associated with a three-dimensional coordinate set indicating the location of each key point; after that, the processing The device determines the reference image corresponding to the virtual human, and determines the corresponding planar skeleton map and planar depth map for each desired pose; then uses the trained target image to generate a model, based on the reference image corresponding to each desired pose, The planar skeleton image and the planar depth map are used to obtain the target image; finally, the processing device obtains the target image sequence corresponding to the desired pose sequence.

参阅图3C所示，其为本申请实施例中训练得到目标图像生成模型的整体结构示意图，在图3C所示意的内容中整体上分为预训练阶段、构建优化过程中的训练样本集的阶段、模型优化阶段，以及应用阶段。Refer to Figure 3C, which is a schematic diagram of the overall structure of the target image generation model trained in the embodiment of the present application. The content shown in Figure 3C is generally divided into a pre-training stage and a stage of building a training sample set in the optimization process. , model optimization stage, and application stage.

在预训练阶段，处理设备处理数据集，保存图像中各关键点位置的UV坐标，并利用单目深度估计算法，得到各关键点位置各自对应的深度值；再根据参考位姿下的图像，目标位姿下的图像、骨架图和深度图，训练初始的图像生成模型，得到预训练后的图像生成模型。In the pre-training stage, the processing equipment processes the data set, saves the UV coordinates of each key point position in the image, and uses the monocular depth estimation algorithm to obtain the depth value corresponding to each key point position; then based on the image in the reference pose, The image, skeleton map and depth map in the target pose are used to train the initial image generation model and obtain the pre-trained image generation model.

在构建优化过程中的训练样本集的阶段，处理设备实现对训练数据的预处理，对虚拟人在多种位姿下进行布料解算和美术渲染，分别得到对应的平面图像，并保存渲染所使用的相机参数以及对应位姿下的三维坐标集合；再根据相机参数对三维坐标集合表征的三维骨架进行重投影，得到UVZ坐标，进而根据投影确定的UV位置的像素点，生成骨架图，以及根据UVZ值确定像素点位置的像素深度值，生成深度图。In the stage of constructing the training sample set in the optimization process, the processing equipment implements preprocessing of the training data, performs cloth calculation and art rendering on the virtual human in various poses, obtains the corresponding plane images, and saves the rendering results. The camera parameters used and the three-dimensional coordinate set in the corresponding pose; then reproject the three-dimensional skeleton represented by the three-dimensional coordinate set according to the camera parameters to obtain the UVZ coordinates, and then generate a skeleton diagram based on the pixel points of the UV positions determined by the projection, and Determine the pixel depth value of the pixel position based on the UVZ value and generate a depth map.

在模型优化阶段，处理设备采用训练样本集，对预训练后的图像生成模型进行优化。In the model optimization stage, the processing device uses the training sample set to optimize the pre-trained image generation model.

在应用阶段，处理设备先准备虚拟人的期望位姿序列，其中，每个期望位姿关联有用于指示各关键点位置的三维坐标集合；再确定虚拟人对应的参考图像，并基于得到的三维坐标集合，处理得到对应的平面骨架图和平面深度图；之后，处理设备基于能够拟合虚拟人数据的目标图像生成模型，对应期望位姿序列中的每个期望位姿，生成对应的平面图像；最后，根据实际的处理需要，可以使用超分辨率算法，提升各目标图像的分辨率。In the application stage, the processing device first prepares the desired pose sequence of the virtual human, in which each desired pose is associated with a three-dimensional coordinate set indicating the location of each key point; then determines the reference image corresponding to the virtual human, and based on the obtained three-dimensional The coordinate set is processed to obtain the corresponding planar skeleton map and planar depth map; then, the processing device generates a model based on the target image that can fit the virtual human data, and generates the corresponding planar image corresponding to each desired pose in the desired pose sequence. ; Finally, according to actual processing needs, super-resolution algorithms can be used to improve the resolution of each target image.

这样，通过先基于虚拟人渲染一批不同动作且完成布料解算的图像，并利用虚拟引擎中相机的参数，对虚拟人对应的图像中各关键点位置的三维坐标进行二维重投影，得到各关键点位置在图像平面的二维坐标以及像素深度值；再使用这些图像，以及对应各关键点位置的二维坐标和像素深度值，训练一个基于深度卷积网络的图像生成模型。后续即可基于虚拟人期望姿态序列，转换成对应各期望位姿的平面图像序列，而且，能够在图像生成过程中进行细节性考量，避免受到自遮挡区域的影响，提高了图像生成效率和生成准确性。In this way, by first rendering a batch of images with different actions and completing cloth calculation based on the virtual person, and using the parameters of the camera in the virtual engine to perform two-dimensional reprojection of the three-dimensional coordinates of each key point position in the image corresponding to the virtual person, we get The two-dimensional coordinates and pixel depth values of each key point position in the image plane are then used to train an image generation model based on a deep convolutional network using these images, as well as the two-dimensional coordinates and pixel depth values corresponding to each key point position. Subsequently, based on the desired pose sequence of the virtual human, it can be converted into a plane image sequence corresponding to each desired pose. Moreover, details can be considered in the image generation process to avoid being affected by self-occlusion areas, improving image generation efficiency and generation. accuracy.

基于同一发明构思，参阅图4所示，其为本申请实施例中图像生成模型的训练装置的逻辑结构示意图，图像生成模型的训练装置400中包括获取单元401，以及训练单元402，其中，Based on the same inventive concept, refer to Figure 4, which is a schematic diagram of the logical structure of the training device for the image generation model in the embodiment of the present application. The training device 400 for the image generation model includes an acquisition unit 401 and a training unit 402, where,

获取单元401，用于获取训练样本集；一条训练样本中包括：包含目标对象的样本参考图、指示目标对象在目标位姿下各关键点位置的样本骨架图和样本深度图，以及目标位姿的样本标准图；样本骨架图中至少包括肢体末端骨架；Acquisition unit 401 is used to obtain a training sample set; a training sample includes: a sample reference map containing the target object, a sample skeleton map and a sample depth map indicating the position of each key point of the target object in the target pose, and the target pose. The sample standard diagram; the sample skeleton diagram at least includes the limb end skeleton;

训练单元402，用于采用训练样本集，对预训练后的图像生成模型进行多轮迭代训练，输出已训练的目标图像生成模型；其中，在一轮迭代过程中，执行以下操作：The training unit 402 is used to use the training sample set to perform multiple rounds of iterative training on the pre-trained image generation model, and output the trained target image generation model; wherein, during one round of iteration, the following operations are performed:

基于选取的训练样本中包含的样本骨架图和样本深度图，按照对应的目标位姿，对包含的样本参考图中的目标对象进行动作迁移处理，得到预测标准图；Based on the sample skeleton map and sample depth map contained in the selected training samples, according to the corresponding target pose, motion migration processing is performed on the target object in the included sample reference map to obtain the prediction standard map;

基于预测标准图与样本标准图之间多尺度的全局综合差异损失，结合预测标准图和样本标准图之间，指定图像区域内的局部差异损失，调整图像生成模型中的模型参数。Based on the multi-scale global comprehensive difference loss between the prediction standard map and the sample standard map, combined with the local difference loss in the specified image area between the prediction standard map and the sample standard map, the model parameters in the image generation model are adjusted.

可选的，图像生成模型中包括：配置有卷积注意力层的第一编码网络、配置有卷积注意力层和图像融合层的第二编码网络，以及配置有卷积注意力层的多尺度解码网络；Optionally, the image generation model includes: a first encoding network configured with a convolutional attention layer, a second encoding network configured with a convolutional attention layer and an image fusion layer, and a multi-channel encoding network configured with a convolutional attention layer. scale decoding network;

则基于选取的训练样本包含的样本骨架图和样本深度图，按照对应的目标位姿，对包含的样本参考图中的目标对象进行动作迁移处理，得到预测标准图时，训练单元402用于：Based on the sample skeleton map and sample depth map contained in the selected training sample, the target object in the included sample reference map is subjected to motion migration processing according to the corresponding target pose. When the prediction standard map is obtained, the training unit 402 is used to:

将选取的训练样本包含的样本参考图输入第一编码网络，得到编码后的参考图像特征；Input the sample reference image contained in the selected training sample into the first encoding network to obtain the encoded reference image features;

将训练样本包含的样本骨架图和样本深度图，在通道维度上进行拼接后，输入第二编码网络，得到编码融合后的骨骼动作特征；After splicing the sample skeleton map and sample depth map contained in the training sample in the channel dimension, input them into the second encoding network to obtain the skeletal action features after encoding fusion;

采用多尺度解码网络，基于骨骼动作特征对参考图像特征进行解码，得到完成动作迁移后的预测标准图。A multi-scale decoding network is used to decode the reference image features based on the skeletal action features to obtain the predicted standard map after completing the action migration.

可选的，训练样本集是采用如下方式生成的：Optionally, the training sample set is generated in the following way:

采用预设的二维重投影技术，对每个三维坐标集合进行处理，得到基于各关键点位置在图像坐标系下的像素点坐标生成的样本骨架图，以及得到基于各关键点位置各自对应的像素深度值生成的样本深度图；Using the preset two-dimensional reprojection technology, each three-dimensional coordinate set is processed to obtain a sample skeleton diagram generated based on the pixel coordinates of each key point position in the image coordinate system, and the corresponding corresponding key point positions based on each key point are obtained. Sample depth map generated from pixel depth values;

基于不同位姿对应的样本标准图、样本骨架图，以及样本深度图，生成训练样本集。A training sample set is generated based on the sample standard map, sample skeleton map, and sample depth map corresponding to different poses.

可选的，得到基于各关键点位置在图像坐标系下的二维坐标生成的样本骨架图时，获取单元401用于：Optionally, when obtaining the sample skeleton diagram generated based on the two-dimensional coordinates of each key point position in the image coordinate system, the acquisition unit 401 is used to:

获得将三维坐标集合中各关键点位置，投影至图像坐标系后的各像素点坐标；Obtain the coordinates of each pixel point after projecting the position of each key point in the three-dimensional coordinate set to the image coordinate system;

通过连接各像素点坐标各自对应的像素点，还原对应位姿下的骨骼分布，得到与对应的样本标准图大小相同的样本骨架图。By connecting the pixel points corresponding to the coordinates of each pixel point, the bone distribution in the corresponding pose is restored, and a sample skeleton diagram that is the same size as the corresponding sample standard diagram is obtained.

可选的，得到基于各关键点位置的像素深度值生成的样本深度图时，获取单元401用于：Optionally, when obtaining the sample depth map generated based on the pixel depth value of each key point position, the acquisition unit 401 is used to:

获取将三维坐标集合中各关键点位置投影至图像坐标系后，对应各关键点位置得到的各像素点坐标及像素深度值；Obtain the coordinates of each pixel and the pixel depth value corresponding to the position of each key point obtained after projecting the position of each key point in the three-dimensional coordinate set to the image coordinate system;

构建与图像坐标系匹配的初始深度图，并基于各像素深度值，结合针对各像素点坐标各自归属的像素点范围确定的像素值取值差异，调整初始深度图中各像素点各自对应的像素值，得到样本深度图。Construct an initial depth map that matches the image coordinate system, and adjust the corresponding pixels of each pixel in the initial depth map based on the depth value of each pixel and the difference in pixel values determined for the pixel range to which each pixel coordinate belongs. value to get the sample depth map.

可选的，当图像生成模型作为生成器对抗器结构中的生成器进行训练时，得到预测标准图之后，训练单元402还用于：Optionally, when the image generation model is trained as a generator in a generator-antagonist structure, after obtaining the prediction standard map, the training unit 402 is also used to:

采用预设的生成对抗损失函数，基于预测标准图和对应的样本标准图，得到对应的对抗损失；Using the preset generative adversarial loss function, the corresponding adversarial loss is obtained based on the prediction standard map and the corresponding sample standard map;

基于对抗损失、预测标准图与样本标准图之间的全局综合差异损失，结合预测标准图和样本标准图之间，指定图像区域内的局部差异损失，调整图像生成模型中的模型参数。Based on the adversarial loss, the global comprehensive difference loss between the prediction standard map and the sample standard map, combined with the local difference loss in the specified image area between the prediction standard map and the sample standard map, the model parameters in the image generation model are adjusted.

可选的，局部差异损失采用以下方式确定：Optionally, the local difference loss is determined in the following way:

在预测标准图和样本标准图中，分别确定用于定位子图像区域的各目标关键点位置，并分别在预测标准图和样本标准图中，基于确定的各目标关键点位置，裁剪得到包含多个子图像区域的指定图像区域；In the prediction standard map and the sample standard map, the position of each target key point used to locate the sub-image area is determined respectively, and in the prediction standard map and the sample standard map, based on the determined position of each target key point, a crop containing multiple targets is obtained. The specified image area of the sub-image area;

可选的，全局综合差异损失采用如下方式确定：Optionally, the global comprehensive difference loss is determined as follows:

基于预测标准图与样本标准图之间，各像素点的像素值差异，得到全局像素值损失，并基于预测标准图与样本标准图之间，在多个预设尺度下的图像特征差异，得到多尺度特征损失；Based on the difference in pixel values of each pixel between the predicted standard map and the sample standard map, the global pixel value loss is obtained, and based on the difference in image features at multiple preset scales between the predicted standard map and the sample standard map, we get Multi-scale feature loss;

将全局像素值损失和多尺度特征损失，得到对应的全局综合差异损失。Combining global pixel value loss and multi-scale feature loss, the corresponding global comprehensive difference loss is obtained.

可选的，训练单元402采用以下方式完成图像生成模型的预训练：Optionally, the training unit 402 completes the pre-training of the image generation model in the following manner:

获取指定的数据集，并通过对数据集中的各样本骨架图进行单目深度估计处理，得到各样本骨架图各自对应的样本深度图，其中，数据集中包括各样本对象在不同位姿下的样本标准图和样本骨架图；Obtain the specified data set, and perform monocular depth estimation processing on each sample skeleton map in the data set to obtain the sample depth map corresponding to each sample skeleton map. The data set includes samples of each sample object in different postures. Standard drawings and sample skeleton drawings;

基于根据数据集得到的样本标准图、样本骨架图，以及样本深度图，构建预训练样本集合，并基于预训练样本集合对初始的图像生成模型进行多轮迭代训练，输出预训练后的图像生成模型。Based on the sample standard map, sample skeleton map, and sample depth map obtained from the data set, a pre-training sample set is constructed, and the initial image generation model is trained for multiple rounds of iterations based on the pre-training sample set, and the pre-trained image generation is output. Model.

可选的，训练单元402按照以下任意一种方式，确定在对预训练后的图像生成模型进行每轮迭代过程中使用的学习率：Optionally, the training unit 402 determines the learning rate used in each iteration of the pre-trained image generation model in any of the following ways:

可选的，装置还包括生成单元403，生成单元403用于：Optionally, the device also includes a generating unit 403, which is used to:

获取目标对象在参考动作下的参考图像，以及目标对象在指定位姿下的平面骨架图和平面深度图，其中，平面骨架图中包括手部骨骼；Obtain the reference image of the target object under the reference action, as well as the plane skeleton map and plane depth map of the target object in the specified pose, where the plane skeleton map includes hand bones;

采用目标图像生成模型，基于平面骨架图和平面深度图，对参考图像进行动作迁移处理，得到目标对象在指定位姿下的目标图像。The target image generation model is used, based on the plane skeleton map and the plane depth map, to perform motion migration processing on the reference image to obtain the target image of the target object in the specified pose.

在介绍了本申请示例性实施方式的图像生成模型的训练方法和装置之后，接下来，介绍根据本申请的另一示例性实施方式的电子设备。After introducing the training method and device of the image generation model according to the exemplary embodiment of the present application, next, an electronic device according to another exemplary embodiment of the present application is introduced.

所属技术领域的技术人员能够理解，本申请的各个方面可以实现为系统、方法或程序产品。因此，本申请的各个方面可以具体实现为以下形式，即：完全的硬件实施方式、完全的软件实施方式(包括固件、微代码等)，或硬件和软件方面结合的实施方式，这里可以统称为“电路”、“模块”或“系统”。Those skilled in the art can understand that various aspects of the present application can be implemented as systems, methods or program products. Therefore, various aspects of the present application can be specifically implemented in the following forms, namely: a complete hardware implementation, a complete software implementation (including firmware, microcode, etc.), or a combination of hardware and software implementations, which may be collectively referred to herein as "Circuit", "Module" or "System".

与上述方法实施例基于同一发明构思，本申请实施例中还提供了一种电子设备，参阅图5所示，其为应用本申请实施例的一种电子设备的一个硬件组成结构示意图，电子设备500可以至少包括处理器501、以及存储器502。其中，存储器502存储有程序代码，当程序代码被处理器501执行时，使得处理器501执行上述任意一种图像生成模型的训练方法的步骤。Based on the same inventive concept as the above method embodiments, the embodiments of the present application also provide an electronic device. Refer to Figure 5, which is a schematic diagram of the hardware composition of an electronic device using the embodiment of the present application. The electronic device 500 may include at least a processor 501 and a memory 502 . The memory 502 stores program code. When the program code is executed by the processor 501, the processor 501 is caused to perform the steps of any of the above image generation model training methods.

在一些可能的实施方式中，根据本申请的计算装置可以至少包括至少一个处理器、以及至少一个存储器。其中，存储器存储有程序代码，当程序代码被处理器执行时，使得处理器执行本说明书上述描述的根据本申请各种示例性实施方式的图像生成模型的训练的步骤。例如，处理器可以执行如图2A中所示的步骤。In some possible implementations, a computing device according to the present application may include at least one processor, and at least one memory. The memory stores program code. When the program code is executed by the processor, it causes the processor to perform the above-described steps of training the image generation model according to various exemplary embodiments of the present application. For example, the processor may perform the steps shown in Figure 2A.

下面参照图6来描述根据本申请的这种实施方式的计算装置600。如图6所示，计算装置600以通用计算装置的形式表现。计算装置600的组件可以包括但不限于：上述至少一个处理单元601、上述至少一个存储单元602、连接不同系统组件(包括存储单元602和处理单元601)的总线603。A computing device 600 according to this embodiment of the present application is described below with reference to FIG. 6 . As shown in Figure 6, computing device 600 is embodied in the form of a general-purpose computing device. The components of the computing device 600 may include, but are not limited to: the above-mentioned at least one processing unit 601, the above-mentioned at least one storage unit 602, and a bus 603 connecting different system components (including the storage unit 602 and the processing unit 601).

总线603表示几类总线结构中的一种或多种，包括存储器总线或者存储器控制器、外围总线、处理器或者使用多种总线结构中的任意总线结构的局域总线。Bus 603 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a processor, or a local bus using any of a variety of bus structures.

存储单元602可以包括易失性存储器形式的可读介质，例如随机存取存储器(RAM)6021和/或高速缓存存储器6022，还可以进一步包括只读存储器(ROM)6023。The storage unit 602 may include readable media in the form of volatile memory, such as a random access memory (RAM) 6021 and/or a cache memory 6022, and may further include a read-only memory (ROM) 6023.

存储单元602还可以包括具有一组(至少一个)程序模块8024的程序/实用工具6025，这样的程序模块6024包括但不限于：操作系统、一个或者多个应用程序、其它程序模块以及程序数据，这些示例中的每一个或某种组合中可能包括网络环境的实现。Storage unit 602 may also include a program/utility 6025 having a set of (at least one) program modules 8024 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, Each of these examples, or some combination, may include the implementation of a network environment.

计算装置600也可以与一个或多个外部设备604(例如键盘、指向设备等)通信，还可与一个或者多个使得对象能与计算装置600交互的设备通信，和/或与使得该计算装置600能与一个或多个其它计算装置进行通信的任何设备(例如路由器、调制解调器等等)通信。这种通信可以通过输入/输出(I/O)接口605进行。并且，计算装置600还可以通过网络适配器606与一个或者多个网络(例如局域网(LAN)，广域网(WAN)和/或公共网络，例如因特网)通信。如图所示，网络适配器606通过总线603与用于计算装置600的其它模块通信。应当理解，尽管图中未示出，可以结合计算装置600使用其它硬件和/或软件模块，包括但不限于：微代码、设备驱动器、冗余处理器、外部磁盘驱动阵列、RAID系统、磁带驱动器以及数据备份存储系统等。Computing device 600 may also communicate with one or more external devices 604 (e.g., keyboard, pointing device, etc.), with one or more devices that enable objects to interact with computing device 600, and/or with one or more devices that enable the computing device 600 to 600 can communicate with any device (eg, router, modem, etc.) that can communicate with one or more other computing devices. This communication may occur through input/output (I/O) interface 605. Furthermore, computing device 600 may also communicate with one or more networks (eg, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet) through network adapter 606. As shown, network adapter 606 communicates with other modules for computing device 600 via bus 603 . It should be understood that, although not shown in the figures, other hardware and/or software modules may be used in conjunction with computing device 600, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drives And data backup storage system, etc.

与上述方法实施例基于同一发明构思，本申请提供的图像生成模型的训练的各个方面还可以实现为一种程序产品的形式，其包括程序代码，当程序产品在电子设备上运行时，程序代码用于使电子设备执行本说明书上述描述的根据本申请各种示例性实施方式的图像生成模型的训练方法中的步骤，例如，电子设备可以执行如图2A中所示的步骤。Based on the same inventive concept as the above method embodiments, various aspects of the training of the image generation model provided by this application can also be implemented in the form of a program product, which includes program code. When the program product is run on an electronic device, the program code For causing the electronic device to perform the steps in the training method of the image generation model according to various exemplary embodiments of the present application described above in this specification, for example, the electronic device may perform the steps as shown in FIG. 2A.

程序产品可以采用一个或多个可读介质的任意组合。可读介质可以是可读信号介质或者可读存储介质。可读存储介质例如可以是但不限于电、磁、光、电磁、红外线、或半导体的系统、装置或器件，或者任意以上的组合。可读存储介质的更具体的例子(非穷举的列表)包括：具有一个或多个导线的电连接、便携式盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。The Program Product may take the form of one or more readable media in any combination. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device or device, or any combination thereof. More specific examples (non-exhaustive list) of readable storage media include: electrical connection with one or more conductors, portable disk, hard disk, random access memory (RAM), read only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.

尽管已描述了本申请的优选实施例，但本领域内的技术人员一旦得知了基本创造性概念，则可对这些实施例做出另外的变更和修改。所以，所附权利要求意欲解释为包括优选实施例以及落入本申请范围的所有变更和修改。Although the preferred embodiments of the present application have been described, those skilled in the art will be able to make additional changes and modifications to these embodiments once the basic inventive concepts are understood. Therefore, it is intended that the appended claims be construed to include the preferred embodiments and all changes and modifications that fall within the scope of this application.

显然，本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的精神和范围。这样，倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内，则本申请也意图包含这些改动和变型在内。Obviously, those skilled in the art can make various changes and modifications to the present application without departing from the spirit and scope of the present application. In this way, if these modifications and variations of the present application fall within the scope of the claims of the present application and equivalent technologies, the present application is also intended to include these modifications and variations.