Movatterモバイル変換


[0]ホーム

URL:


CN118052904A - Character image generation method based on gesture guidance of generation countermeasure network - Google Patents

Character image generation method based on gesture guidance of generation countermeasure network
Download PDF

Info

Publication number
CN118052904A
CN118052904ACN202410120166.0ACN202410120166ACN118052904ACN 118052904 ACN118052904 ACN 118052904ACN 202410120166 ACN202410120166 ACN 202410120166ACN 118052904 ACN118052904 ACN 118052904A
Authority
CN
China
Prior art keywords
image
target
texture
source
loss function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410120166.0A
Other languages
Chinese (zh)
Inventor
魏巍
秦超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian Minzu University
Original Assignee
Dalian Minzu University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian Minzu UniversityfiledCriticalDalian Minzu University
Priority to CN202410120166.0ApriorityCriticalpatent/CN118052904A/en
Publication of CN118052904ApublicationCriticalpatent/CN118052904A/en
Pendinglegal-statusCriticalCurrent

Links

Classifications

Landscapes

Abstract

The invention provides a character image generating method based on gesture guidance of generating an antagonism network, which adopts a two-stage design to realize the synthesis of character images guided by gestures and enhance the coincidence degree of generated images and target gestures at the same time: a coarse target image generation stage and a texture refinement network stage. The coarse target image generation stage functions to generate a coarse image having a target pose. The texture refinement stage is used for enabling the rough image to learn textures and style information on the source image through the new texture migration module and the texture related attention module, so that the rough image is changed into a vivid image with target gesture and source image information. The method comprises the following steps: preprocessing data; training a network at the generation stage of the rough target image to obtain a rough character image; training a texture refinement network stage to obtain a realistic figure image with a specific gesture; and (5) training a loss function. The method has high image quality, and better retains the style texture information of the source image.

Description

Translated fromChinese
基于生成对抗网络的姿态引导的人物图像生成方法Pose-guided character image generation method based on generative adversarial network

技术领域Technical Field

本发明属于图像合成领域,尤其是针对人体姿态迁移的图像合成技术。具体为一种基于生成对抗网络的姿态引导的人物图像生成方法。The present invention belongs to the field of image synthesis, and in particular to an image synthesis technology for human posture migration, and is specifically a posture-guided human image generation method based on a generative adversarial network.

背景技术Background technique

近年来,随着深度学习技术的不断完善与发展,使得姿态引导的人物图像生成成为计算机视觉领域备受关注的研究方向。该技术通过融合姿态信息与图像生成技术,能够生成具有特定姿态的逼真人物图像,并同时保留其外观细节。因此,姿态引导的人物图像生成方法受到广泛的关注和研究,且在虚拟试穿、视频生成、人物重识别等领域具有广泛的应用潜力。In recent years, with the continuous improvement and development of deep learning technology, posture-guided character image generation has become a research direction that has attracted much attention in the field of computer vision. By integrating posture information with image generation technology, this technology can generate realistic character images with specific postures while retaining their appearance details. Therefore, posture-guided character image generation methods have received extensive attention and research, and have broad application potential in the fields of virtual try-on, video generation, and character re-identification.

现有的人体姿态迁移方法虽然都合成了较为逼真的人物图像,但在人物姿态转换中由于关键点对应稀疏,姿态和纹理变化较大,编辑能力有限,无法准确捕捉源图像与目标图像之间的纹理形状和样式信息映射,并且在使用注意力的方法时没有明确地学习不同姿势之间的空间转换,在多次传输过程中可能会丢失信息,从而导致细节模糊,生成的人物图像较差。Although existing human pose transfer methods have synthesized relatively realistic human images, they are unable to accurately capture the texture shape and style information mapping between source and target images due to sparse key point correspondence, large pose and texture variations, and limited editing capabilities. In addition, when using attention methods, they do not explicitly learn the spatial transformation between different poses, and information may be lost during multiple transmissions, resulting in blurred details and poor generated human images.

发明内容Summary of the invention

针对现有技术的不足,本发明提供一种基于生成对抗网络的姿态引导的人物图像生成方法,可以生成结构准确、逼真的图像,产生了更精细的外观纹理和逼真的结果,更加保留源图像衣服纹理等相关特征。In view of the shortcomings of the prior art, the present invention provides a posture-guided character image generation method based on a generative adversarial network, which can generate structurally accurate and realistic images, produce finer appearance textures and realistic results, and better retain relevant features such as clothing texture of the source image.

本发明为解决其技术问题所采用的技术方案是:The technical solution adopted by the present invention to solve the technical problem is:

一种基于生成对抗网络的姿态引导的人物图像生成方法,步骤包括:A method for generating a character image guided by a posture based on a generative adversarial network, comprising the following steps:

S1、数据预处理:利用关节点检测器HPE提取人体姿态,再将固定人物及对应姿态分为一组,对每组中的图片进行排列组合形成数据集,将数据集分为训练集和测试集,训练集和测试集的人物身份不重叠;S1. Data preprocessing: Use the joint point detector HPE to extract human posture, then divide the fixed characters and corresponding postures into a group, arrange and combine the pictures in each group to form a data set, and divide the data set into a training set and a test set. The characters in the training set and the test set do not overlap;

S2、训练粗目标图像生成阶段网络得到粗略人物图像:将源图像xs,源姿态ps和目标姿态pt作为输入,经过编码器En1、残差块、解码器De1,输出与目标姿态一致的粗略人物图像S2. Train the network in the coarse target image generation phase to obtain a coarse person image: Take the source imagexs , source poseps and target posept as input, pass through the encoderEn1 , residual block, decoderDe1 , and output a coarse person image consistent with the target pose

S3、训练纹理细化网络阶段生成具有特定姿态的逼真人物图像:由三种类型的输入:源图像xs、源图像xs与源姿态ps的联合、粗略人物图像与目标姿态pt的联合,经过编码器En2、En3,得到源图像纹理/>源姿态变化/>目标姿态变化/>之后将源图像纹理/>源姿态变化/>目标姿态变化/>分别作为纹理迁移模块TTM的输入,之后将纹理迁移模块TTM的输出通过解码器De2,得到具有特定姿态的逼真人物图像;S3, training texture refinement network stage generates realistic person images with specific postures: it consists of three types of inputs: source imagexs , the combination of source imagexs and source postureps , and rough person image Combined with the target posturept , the source image texture is obtained through encoders En2 and En3 /> Source attitude changes/> Target attitude changes/> Then the source image texture /> Source attitude changes/> Target attitude changes/> They are respectively used as inputs of the texture transfer module TTM, and then the output of the texture transfer module TTM is passed through the decoder De2 to obtain a realistic character image with a specific posture;

S4、采用损失函数训练包含S2、S3两个阶段人物图像生成的网络模型。S4. Use the loss function to train the network model for character image generation in the two stages of S2 and S3.

进一步的,步骤S2具体包括:Furthermore, step S2 specifically includes:

S2_1:首先将源图像xs,源姿态ps和目标姿态pt作为输入,将其使用cat函数按照每一行并排融合生成一个新的特征图张量;S2_1: First, take the source imagexs , source poseps and target posept as input, and use the cat function to fuse them side by side in each row to generate a new feature map tensor;

S2_2:将融合生成的新的特征图张量经过3层编码器En1:每一层编码器En1先对特征图进行卷积核为4×4大小的包含归一化操作和激活函数的卷积操作实现下采样,即特征图的尺寸减半;然后对特征图进行卷积核为3×3大小的卷积操作,以在不改变特征图尺寸的情况下,通过卷积操作对特征图进行局部信息的提取和特征增强;三层编码器En1维度变化依次为3->64,64->128,128->256;最后编码器En1输出分辨率为32×32,维度为256的特征张量;S2_2: The new feature map tensor generated by fusion is passed through the three-layer encoder En1 : Each layer of encoder En1 first performs a convolution operation with a convolution kernel of 4×4 size, including normalization operation and activation function, to achieve downsampling, that is, the size of the feature map is halved; then a convolution operation with a convolution kernel of 3×3 size is performed on the feature map to extract local information and enhance features of the feature map through convolution operation without changing the size of the feature map; the dimensions of the three-layer encoder En1 change from 3->64, 64->128, 128->256 in turn; finally, the encoder En1 outputs a feature tensor with a resolution of 32×32 and a dimension of 256;

S2_3:将编码器En1输出的特征张量放入残差块中得到一个分辨率不变、维度不变的特征张量,残差块内流程为:先通过两层卷积核大小为3×3卷积层,之后再使特征张量通过卷积核大小为1×1卷积层与其进行跳跃连接来捕捉输入特征的细节;S2_3: The feature tensor output by encoder En1 is placed in the residual block to obtain a feature tensor with unchanged resolution and dimension. The process in the residual block is: first pass through two layers of convolutional layers with a convolutional kernel size of 3×3, and then make the feature tensor pass through a convolutional layer with a convolutional kernel size of 1×1 to jump connect with it to capture the details of the input features;

S2_4:将从残差块得到的特征张量作为解码器De1的输入,输出粗略人物图像x~t;解码器De1流程:首先对输入的特征张量进行卷积核为3×3的卷积操作;然后进行卷积核为3×3大小,步长为2的转置卷积操作;之后再与输入的特征张量进行转置卷积操作得到的结果进行跳跃连接;之后重复两次上述操作得到分辨率为256×256,维度为64的特征图,对其进行一次卷积核为3×3的卷积操作,使其维度变为3,即将其转化为RGB图片的格式。S2_4: The feature tensor obtained from the residual block is used as the input of the decoder De1 , and a rough character image x~t is output; the decoder De1 process: first, a convolution operation with a convolution kernel of 3×3 is performed on the input feature tensor; then a transposed convolution operation with a convolution kernel of 3×3 and a step size of 2 is performed; then a jump connection is performed on the result obtained by the transposed convolution operation with the input feature tensor; then the above operation is repeated twice to obtain a feature map with a resolution of 256×256 and a dimension of 64, and a convolution operation with a convolution kernel of 3×3 is performed on it to make its dimension become 3, that is, it is converted into the format of an RGB image.

进一步的,步骤S3具体包括:Furthermore, step S3 specifically includes:

S3_1:纹理细化网络阶段由三种类型的输入:源图像xs、源图像xs与源姿态ps的联合、粗略人物图像与目标姿态pt的联合,经过编码器En2、En3,得到源图像纹理/>源姿态变化/>目标姿态变化/>其中源图像xs和源姿态ps的编码器En2与粗略人物图像/>和目标姿态pt的编码器En3共享一个权重;S3_1: The texture refinement network stage has three types of input: source imagexs , the union of source imagexs and source poseps , and rough person image Combined with the target posturept , the source image texture is obtained through encoders En2 and En3 /> Source attitude changes/> Target attitude changes/> The encoder En2 of the source imagexs and source poseps is similar to the rough person image/> Shares a weight with the encoder En3 of the target posture pt ;

S3_2:将源图像纹理源姿态变化/>目标姿态变化/>作为纹理迁移模块TTM的输入,纹理迁移模块TTM的结构采用多头注意力机制MHA和多层感知机制MLP,并将其分为源姿态变换纹理和目标姿态变换纹理两部分;S3_2: Source image texture Source attitude changes/> Target attitude changes/> As the input of the texture transfer module TTM, the structure of the texture transfer module TTM adopts the multi-head attention mechanism MHA and the multi-layer perception mechanism MLP, and divides it into two parts: the source pose transformation texture and the target pose transformation texture;

S3_3:将纹理迁移模块TTM得到的目标变换纹理Tt通过解码器De2,最后将解码器De2的每一层输出通过跳跃链接To-RGB,对其进行上采样,求和得到具有特定姿态的逼真人物图像。S3_3: The target transformed textureTt obtained by the texture transfer module TTM is passed through the decoder De2. Finally, each layer output of the decoder De2 is upsampled through the jump link To-RGB, and the sum is obtained to obtain a realistic character image with a specific posture.

进一步的,步骤S3_2具体包括:Furthermore, step S3_2 specifically includes:

S3_2a:在源姿态变换纹理中,将源姿态变化进行残差和多头注意力机制处理得到Ts1,之后再对Ts1进行批量归一化与多层感知机制处理得到Ts2,然后再对Ts2进行批量归一化与多头注意力机制处理得到Ts3,最后再对Ts3进行批量归一化与多层感知机制处理得到源变化纹理Ts;Ts1,Ts2,Ts3,Ts公式如下所示:S3_2a: In the source pose transformation texture, the source pose is changed The residual and multi- head attention mechanism are processed to obtainTs1 ,and thenTs1 is processed by batch normalization andmulti -layer perception mechanism to obtainTs2 , and thenTs2 is processed by batch normalization and multi-headattention mechanism to obtainTs3 , and finallyTs3is processed by batch normalization and multi-layer perception mechanism to obtain the source change textureTs ; the formulasof Ts1,Ts2 ,Ts3,andTs are as follows:

Ts2=BN(Ts1)+MLP(BN(Ts1))Ts2 =BN(Ts1 )+MLP(BN(Ts1 ))

Ts3=BN(Ts2)+MHA(BN(Ts2),BN(Ts2),BN(Ts2))Ts3 =BN(Ts2 )+MHA(BN(Ts2 ),BN(Ts2 ),BN(Ts2 ))

Ts=BN(BN(Ts3)+MLP(BN(Ts3)))Ts = BN(BN(Ts3 )+MLP(BN(Ts3 )))

其中,Res表示残差处理,MHA表示多头注意力机制处理,BN表示批量归一化处理,MLP表示多层感知机制处理;Among them, Res represents residual processing, MHA represents multi-head attention mechanism processing, BN represents batch normalization processing, and MLP represents multi-layer perception mechanism processing;

S3_2b:在目标姿态变换纹理中,首先将目标姿态变化进行残差和多头注意力机制处理得到Tt1,之后再对Tt1进行批量归一化与多头注意力机制处理得到Tt2,然后对Tt2进行批量归一化与多层感知机制处理得到Tt3,然后对Tt3进行批量归一化与多头注意力机制处理得到Tt4,之后对Tt4进行批量归一化与多头注意力机制处理得到Tt5,最后对Tt5进行批量归一化和多层感知机制处理得到目标变换纹理Tt;Tt1,Tt2,Tt3,Tt4,Tt5,Tt的公式如下所示:S3_2b: In the target pose transformation texture, the target pose is first transformed The residual and multi-head attention mechanism are processed to obtain Tt1 , and then Tt1 is processed by batch normalization and multi-head attention mechanism to obtain Tt2 , and then Tt2 is processed by batch normalization and multi-layer perception mechanism to obtain Tt3 , and then Tt3 is processed by batch normalization and multi-head attention mechanism to obtain Tt4 , and then Tt4 is processed by batch normalization and multi-head attention mechanism to obtain Tt5 , and finally Tt5 is processed by batch normalization and multi-layer perception mechanism to obtain the target transformed texture Tt ; the formulas of Tt1 , Tt2 , Tt3 , Tt4 , Tt5 , Tt are as follows:

Tt3=BN(Tt2)+MLP(BN(Tt2))Tt3 = BN(Tt2 )+MLP(BN(Tt2 ))

Tt4=BN(Tt3)+MHA(BN(Tt3),BN(Tt3),BN(Tt3))Tt4 =BN(Tt3 )+MHA(BN(Tt3 ),BN(Tt3 ),BN(Tt3 ))

Tt=BN(BN(Tt5)+MLP(BN(Tt5)))Tt = BN(BN(Tt5 )+MLP(BN(Tt5 )))

其中,Res表示残差处理,MHA表示多头注意力机制处理,BN表示批量归一化处理,MLP表示多层感知机制处理。Among them, Res represents residual processing, MHA represents multi-head attention mechanism processing, BN represents batch normalization processing, and MLP represents multi-layer perception mechanism processing.

进一步的,步骤S4具体为:Furthermore, step S4 is specifically as follows:

整个网络模型损失函数用以下公式表达:The loss function of the entire network model is expressed by the following formula:

L=LCTIG+LTRNL=LCTIG +LTRN

其中:LCTIG表示粗目标图像生成阶段的损失函数;LTRN表示纹理细化网络阶段的损失函数;Where: LCTIG represents the loss function of the coarse target image generation stage; LTRN represents the loss function of the texture refinement network stage;

在粗目标图像生成阶段,采用l1损失来训练网络模型,公式如下所示:In the coarse target image generation stage,l1 loss is used to train the network model, and the formula is as follows:

其中表示在粗目标图像生成阶段中l1损失的权重;xt为真实的目标人物图像,/>为粗略人物图像;in represents the weight ofl1 loss in the rough target image generation stage;xt is the real target person image, /> It is a rough image of a person;

在纹理细化网络阶段,采用重建损失函数感知损失函数Lperc、生成对抗性损失函数Ladv以及风格损失函数Lstyle,四个损失函数来生成人物图像。公式如下所示,其中λl1percadvstyle代表不同损失函数之间的权重:In the texture refinement network stage, the reconstruction loss function is used Perceptual loss function Lperc , generative adversarial loss function Ladv , and style loss function Lstyle , four loss functions are used to generate character images. The formula is as follows, where λl1percadvstyle represent the weights between different loss functions:

进一步的,所述重建损失函数通过计算生成的人物图像/>和真实的目标人物图像xt在像素空间上的l1损失使生成图像和目标图像在颜色和纹理上一致,提高生成图像的矢量,重建损失函数/>公式如下:Furthermore, the reconstruction loss function Character images generated by calculation/> The l1 loss in the pixel space of the real target person imagext makes the generated image consistent with the target image in color and texture, improves the vector of the generated image, and reconstructs the loss function/> The formula is as follows:

感知损失函数Lperc:采用从预训练VGG-19网络中提取的多尺度空间中的特征之间的l1损失,感知损失函数Lperc公式如下,表示使用的VGG-19预训练模型的卷积层,从预训练网络的第i层提取的特征:Perceptual loss function Lperc : Using the l1 loss between features in the multi-scale space extracted from the pre-trained VGG-19 network, the perceptual loss function Lperc formula is as follows, Represents the convolutional layer of the VGG-19 pre-trained model used, and the features extracted from the i-th layer of the pre-trained network:

生成对抗损失函数Ladv:将生成的人物图像和真实的目标人物图像xt作为图像生成鉴别器D的输入,并惩罚它们之间的分布距离,使得生成的人物图像/>的分布越来越接近目标人物图像xt的分布,生成对抗损失函数Ladv公式如下:Generate adversarial loss function Ladv : Generate the generated character image and the real target person imagext as the input of the image generation discriminator D, and penalize the distribution distance between them so that the generated person image/> The distribution of is getting closer and closer to the distribution of the target person imagext , and the formula for generating the adversarial loss function Ladv is as follows:

Ladv=E[log(1-D(TRN(CTIG(xs,ps,pt),xs,ps,pt)))]+E[log(D(xt))]Ladv = E[log(1-D(TRN(CTIG(xs ,ps ,pt ),xs ,ps ,pt )))]+E[log(D(xt ))]

风格损失函数Lstyle:风格损失计算生成的人物图像和真实的目标人物图像xt之间的激活映射的统计差异,并增强生成的人物图像的颜色和样式与目标人物图像的相似性:Style loss function Lstyle : character image generated by style loss calculation The statistical difference in activation maps between the real target person image xt and the real target person imagext , and enhance the similarity of the color and style of the generated person image with the target person image:

本发明的有益效果包括:The beneficial effects of the present invention include:

本发明提供一种新的人体姿态迁移(HPT-GAN)网络模型,HPT-GAN为两阶段网络模型,包括粗目标图像生成(CTIG)阶段,纹理细化网络(TRN)阶段。粗目标图像生成(CTIG)阶段,其作用是为了生成与目标姿态一致的粗略人物图像;纹理细化网络(TRN)阶段,其作用是为了让第一阶段生成的粗略目标图像学习源图像的一些纹理细节,以及上下文关系,通过提取和重组语义实体将其放入注意力模块来操纵,从而生成最终的目标图像。本模型在传统的编码器与解码器之间引入了一系列的残差块(ResBlocks),在粗目标图像生成(CTIG)阶段,能够防止粗糙图像出现肢体丢失的现象。在纹理细化网络(TRN)阶段,本模型提出一个新的纹理迁移模块(TTM),能够解决初始人物服装纹理和目标姿态对齐的能力较弱和当面对较大的姿态变形时会产生明显的伪影的现象。且在纹理细化网络(TRN)阶段,对不同分辨率的RGB输出的贡献进行上采样和求和,之后将解码器与跳跃连接结合起来,生成更为逼真的图像,使生成的人物图像更加接近真实图像。The present invention provides a new human posture transfer (HPT-GAN) network model. HPT-GAN is a two-stage network model, including a coarse target image generation (CTIG) stage and a texture refinement network (TRN) stage. The coarse target image generation (CTIG) stage is used to generate a rough character image consistent with the target posture; the texture refinement network (TRN) stage is used to allow the coarse target image generated in the first stage to learn some texture details of the source image and contextual relationships, and to manipulate the semantic entities by extracting and reorganizing them into an attention module, thereby generating a final target image. This model introduces a series of residual blocks (ResBlocks) between the traditional encoder and decoder, and can prevent the coarse image from losing limbs in the coarse image in the coarse target image generation (CTIG) stage. In the texture refinement network (TRN) stage, this model proposes a new texture transfer module (TTM), which can solve the problem that the initial character clothing texture and the target posture are weakly aligned and obvious artifacts are generated when facing large posture deformation. In the texture refinement network (TRN) stage, the contributions of RGB outputs of different resolutions are upsampled and summed, and then the decoder is combined with jump connections to generate more realistic images, making the generated character images closer to real images.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明网络模型结构图;FIG1 is a diagram showing a network model structure of the present invention;

图2为编码器,解码器,残差快模块结构图;Figure 2 is a diagram of the encoder, decoder, and residual fast module structure;

图3为粗目标图像生成阶段的输出示例;Figure 3 is an output example of the coarse target image generation stage;

图4为纹理迁移模块TTM结构图;FIG4 is a structural diagram of a texture migration module TTM;

图5为解码器-跳跃连接结构图;FIG5 is a decoder-skip connection structure diagram;

图6为与其他算法的定性比较。Figure 6 shows a qualitative comparison with other algorithms.

具体实施方式Detailed ways

为了使本申请所要解决的技术问题、技术方案及有益效果更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。In order to make the technical problems, technical solutions and beneficial effects to be solved by this application more clearly understood, the application is further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain this application and are not used to limit this application.

为了解决现有技术中存在的问题,本发明提出了一种更加接近人类审美观的人物图像生成方法。以前的方法在姿态引导人物图像生成的任务中虽然取得了不错的效果,但仍存在以下瑕疵:1)生成的人物图像会出现肢体丢失和伪影的现象;2)仅专注于训练将源图像从源姿势变换为目标姿势的源到目标任务上,导致生成器的层数会变得繁杂;3)生成器经过多次卷积、池化来获得较大的感受野,在多次重复这一操作中,将会让原图像的一些重要纹理细节丢失,让其无法很好地捕捉源图像和目标图像之间的纹理映射和一些重要细节和空间上下文关系,尤其是当姿态出现大幅度变换时。In order to solve the problems existing in the prior art, the present invention proposes a method for generating character images that is closer to human aesthetics. Although the previous methods have achieved good results in the task of generating character images guided by posture, they still have the following flaws: 1) The generated character images will have the phenomenon of limb loss and artifacts; 2) They only focus on training the source-to-target task of transforming the source image from the source posture to the target posture, which makes the number of layers of the generator complicated; 3) The generator undergoes multiple convolutions and pooling to obtain a larger receptive field. In the repeated operation, some important texture details of the original image will be lost, making it impossible to capture the texture mapping between the source image and the target image and some important details and spatial contextual relationships, especially when the posture changes significantly.

基于以上的问题,本发明提出的人体姿态迁移网络模型(HPT-GAN)使用DeepFashon数据集作为处理对象,为了防止出现肢体丢失现象引入残差块生成目标姿态,再利用注意力机制(纹理迁移TTM模块)实现对应局部的纹理相似性和上下文关系的提取与迁移,最后再利用解码器-跳跃连接(To-RGB)实现不同层次之间的特征融合。整个网络分为以下几个步骤:数据预处理;训练粗目标图像生成阶段网络训练得到粗略人物图像;训练纹理细化网络阶段得到具有特定姿态的逼真人物图像;损失函数。网络模型结构图如图1所示,具体技术方案如下:Based on the above problems, the human posture transfer network model (HPT-GAN) proposed in the present invention uses the DeepFashon dataset as the processing object. In order to prevent the limb loss phenomenon, a residual block is introduced to generate the target posture, and then the attention mechanism (texture transfer TTM module) is used to realize the extraction and migration of the corresponding local texture similarity and contextual relationship, and finally the decoder-jump connection (To-RGB) is used to realize the feature fusion between different levels. The entire network is divided into the following steps: data preprocessing; training the rough target image generation stage network training to obtain a rough character image; training the texture refinement network stage to obtain a realistic character image with a specific posture; loss function. The network model structure diagram is shown in Figure 1, and the specific technical solution is as follows:

S1:数据预处理:S1: Data preprocessing:

首先利用训练好的关节点检测器HPE提取人体姿态,再将固定人物及对应姿态分为一组,对每组中的图片进行排列组合形成数据集。本方案利用的数据集是DeepFashion数据集,包含52712张高质量的人物图像,分辨率为256×256,将其分为101966组训练集和8570组测试集,训练集和测试集的人物身份不重叠。First, the trained joint point detector HPE is used to extract human postures, and then the fixed characters and corresponding postures are divided into a group, and the images in each group are arranged and combined to form a data set. The data set used in this scheme is the DeepFashion data set, which contains 52,712 high-quality human images with a resolution of 256×256. It is divided into 101,966 training sets and 8,570 test sets. The identities of the characters in the training set and the test set do not overlap.

S2:训练粗目标图像生成阶段网络训练得到粗略人物图像:S2: Training the rough target image generation phase. The network is trained to obtain a rough character image:

首先将源图像xs,源姿态ps和目标姿态pt作为输入,将其融合(使用cat函数按照每一行并排)生成一个新的特征图张量,之后经过3层编码器En1:每一层编码器先对特征图进行卷积核为4×4大小的卷积操作(包含归一化操作和激活函数)实现下采样,即特征图的尺寸减半;然后对特征图进行卷积核为3×3大小的卷积操作,这一步的主要功能是在不改变特征图尺寸的情况下,通过卷积操作对特征图进行局部信息的提取和特征增强;3层编码器维度变化依次为3->64,64->128,128->256。最后编码器输出为分辨率为32×32,维度为256的新的特征张量。First, the source imagexs , source postureps and target posturept are taken as input, and they are fused (using the cat function to arrange them in parallel in each row) to generate a new feature map tensor, which is then passed through a three-layer encoderEn1 : each layer of the encoder first performs a convolution operation with a convolution kernel of 4×4 size (including normalization operation and activation function) on the feature map to achieve downsampling, that is, the size of the feature map is halved; then the feature map is convolved with a convolution kernel of 3×3 size. The main function of this step is to extract local information and enhance features of the feature map through convolution operation without changing the size of the feature map; the dimensional changes of the three-layer encoder are 3->64, 64->128, 128->256. Finally, the encoder outputs a new feature tensor with a resolution of 32×32 and a dimension of 256.

将编码器输出的新特征张量放入残差块(ResBlocks)中得到一个分辨率,维度不变的新的特征张量。残差块(ResBlocks)流程:通过两层卷积核大小为3×3卷积层对输入特征图进行特征提取和变换,这一步包括卷积操作、归一化操作(批量归一化)和激活函数(LeakyReLU)。之后再对输入特征图进行卷积核大小为1×1卷积层与其进行跳跃连接来捕捉输入特征的细节和低级别信息,有助于解决梯度消失问题,使得网络能够更有效地学习和传播梯度。The new feature tensor output by the encoder is placed in the residual block (ResBlocks) to obtain a new feature tensor with unchanged resolution and dimension. Residual block (ResBlocks) process: The input feature map is extracted and transformed through two layers of convolutional layers with a convolution kernel size of 3×3. This step includes convolution operations, normalization operations (batch normalization) and activation functions (LeakyReLU). After that, the input feature map is convolved with a convolutional layer with a convolution kernel size of 1×1 and jump-connected with it to capture the details and low-level information of the input features, which helps solve the gradient vanishing problem and enables the network to learn and propagate gradients more effectively.

最后将从残差块(ResBlocks)得到的新特征张量作为解码器De1的输入,得到输出粗略人物图像解码器De1流程:首先对输入的特征图进行卷积核为3×3的卷积操作(包含归一化操作和激活函数),作用:在不改变特征图尺寸的情况下,通过卷积操作对特征图进行局部信息的提取和特征增强;然后进行卷积核为3×3大小,步长为2的转置卷积操作,作用:通过卷积操作实现上采样,即特征图的尺寸翻倍;之后再与输入的特征图进行转置卷积操作得到的结果进行跳跃连接,这样操作有助于解码器更好地捕捉输入数据的多尺度特征、语义信息和上下文关系,提高模型的表示能力。之后,就是重复两次上面的操作得到分辨率为256×256,维度为64的特征图,对其进行一次卷积核为3×3的卷积操作,使其维度变为3,即将其转化为RGB图片的格式。Finally, the new feature tensor obtained from the residual block (ResBlocks) is used as the input of the decoder De1 to obtain the output rough character image Decoder De1 process: First, perform a convolution operation with a kernel size of 3×3 on the input feature map (including normalization and activation functions). Function: Extract local information and enhance features of the feature map through convolution without changing the size of the feature map; then perform a transposed convolution operation with a kernel size of 3×3 and a step size of 2. Function: Upsampling is achieved through convolution, that is, the size of the feature map is doubled; then the result of the transposed convolution operation with the input feature map is skipped and connected. This operation helps the decoder better capture the multi-scale features, semantic information, and contextual relationships of the input data, and improve the representation ability of the model. After that, the above operation is repeated twice to obtain a feature map with a resolution of 256×256 and a dimension of 64, and a convolution operation with a kernel size of 3×3 is performed on it to change its dimension to 3, that is, convert it into the format of an RGB image.

S3:训练纹理细化网络阶段生成具有特定姿态的逼真人物图像:S3: Training the texture refinement network stage to generate realistic human images with specific poses:

纹理细化网络阶段由三种类型的输入:源图像xs,源图像xs与源姿态ps的联合,粗糙人物图像与目标姿态pt的联合。经过编码器(En2,En3,其中源图像xs和源姿态ps的编码器En3与粗糙图像/>和目标姿态pt的编码器En3共享一个权重)得到源图像纹理/>源姿态变化/>目标姿态变化/>The texture refinement network stage has three types of input: source imagexs , the union of source imagexs and source poseps , and rough person image The union of the source image xs and the target posept . After the encoder (En2 ,En3 , where the encoderEn3 of the source imagexs and the source poseps is combined with the rough image/> Share a weight with the encoder En3 of the target posture pt ) to obtain the source image texture/> Source attitude changes/> Target attitude changes/>

之后将源图像纹理源姿态变化/>目标姿态变化/>分别作为纹理迁移模块TTM和纹理相关注意力模块TCAM的输入,TCAM模块是现有技术模块,本发明未对其进行改进,此处不做介绍。纹理迁移模块TTM设计目标是在图像生成过程中有效地传递源图像的纹理和风格信息。为此,我们引入了关联机制,通过对源域与目标域之间的局部区域进行相似性匹配和上下文关系建模。这种关联机制使得模块能够准确地捕捉到源图像和目标图像之间的对应关系,从而在生成过程中迁移纹理和风格。纹理迁移模块TTM的结构如图4所示,采用多头注意力机制(MHA)和多层感知机制(MLP),并将其分为源姿态变换纹理和目标姿态变换纹理两部分。Then the source image texture Source attitude changes/> Target attitude changes/> They are respectively used as inputs of the texture transfer module TTM and the texture-related attention module TCAM. The TCAM module is a prior art module, which is not improved by the present invention and will not be introduced here. The design goal of the texture transfer module TTM is to effectively transfer the texture and style information of the source image during the image generation process. To this end, we introduced an association mechanism by performing similarity matching and context relationship modeling on the local areas between the source domain and the target domain. This association mechanism enables the module to accurately capture the correspondence between the source image and the target image, thereby migrating the texture and style during the generation process. The structure of the texture transfer module TTM is shown in Figure 4. It adopts a multi-head attention mechanism (MHA) and a multi-layer perception mechanism (MLP), and is divided into two parts: the source pose transformation texture and the target pose transformation texture.

在源姿态变换纹理中,为了保留源图像的姿态和特征信息融合过程中一些关键特征与纹理,首先将源姿态变化特征进行ResBlock和MHA处理得到Ts1,之后再对Ts1进行批量归一化(BN)与多层感知机制(MLP)处理得到Ts2,然后再对Ts2进行批量归一化(BN)与多头注意力(MHA)处理得到Ts3,最后再对Ts3进行批量归一化(BN)与多层感知机制(MLP)处理得到源变化纹理Ts。Ts1,Ts2,Ts3,Ts公式如下所示。In the source pose transformation texture, in order to retain some key features and textures in the process of fusion of pose and feature information of the source image, the source pose change feature ResBlock and MHA are performed to obtain Ts1 , and then batch normalization (BN) and multi-layer perception mechanism (MLP) are performed on Ts1 to obtain Ts2 , and then batch normalization (BN) and multi-head attention (MHA) are performed on Ts2 to obtain Ts3 , and finally batch normalization (BN) and multi-layer perception mechanism (MLP) are performed on Ts3 to obtain the source change texture Ts . The formulas of Ts1 , Ts2 , Ts3 , and Ts are as follows.

Ts2=BN(Ts1)+MLP(BN(Ts1))Ts2 =BN(Ts1 )+MLP(BN(Ts1 ))

Ts3=BN(Ts2)+MHA(BN(Ts2),BN(Ts2),BN(Ts2))Ts3 =BN(Ts2 )+MHA(BN(Ts2 ),BN(Ts2 ),BN(Ts2 ))

Ts=BN(BN(Ts3)+MLP(BN(Ts3)))Ts = BN(BN(Ts3 )+MLP(BN(Ts3 )))

在目标姿态变换纹理中,我们的目的是将源变化纹理Ts,源图像纹理目标姿态变化/>里面的特征纹理信息融合到一起,生成源变化纹理Tt,使Tt在保持目标姿态前提下,实现源样式与目标姿态的服装形状,风格和纹理等相关信息精准对齐。我们首先将目标姿态变化/>进行Resblock和MHA处理得到Tt1,之后再对Tt1进行BN与MHA(为了捕捉Tt1,/>之间的关系,并生成对应的注意力权重,有助于姿态对齐、特征交互和整合、上下文建模)处理得到Tt2,然后对Tt2进行BN与MLP处理得到Tt3,然后对Tt3进行BN与MHA处理得到Tt4,之后对Tt4进行BN与MHA处理得到Tt5,最后对Tt5进行BN和MLP处理得到目标变换纹理Tt。Tt1,Tt2,Tt3,Tt4,Tt5,Tt的公式如下所示。In the target pose transformation texture, our goal is to transform the source change texture Ts , the source image texture Target posture changes/> The feature texture information inside is fused together to generate the source change texture Tt , so that Tt can accurately align the clothing shape, style and texture of the source style and the target posture while maintaining the target posture. We first change the target posture /> Resblock and MHA are performed to obtain Tt1 , and then BN and MHA are performed on Tt1 (in order to capture Tt1 ,/> The relationship between and generates corresponding attention weights, which is helpful for posture alignment, feature interaction and integration, and context modeling) is processed to obtain Tt2 , and then Tt2 is processed by BN and MLP to obtain Tt3 , and then Tt3 is processed by BN and MHA to obtain Tt4 , and then Tt4 is processed by BN and MHA to obtain Tt5 , and finally Tt5 is processed by BN and MLP to obtain the target transformed texture Tt . The formulas of Tt1 , Tt2 , Tt3 , Tt4 , Tt5 , and Tt are shown below.

Tt3=BN(Tt2)+MLP(BN(Tt2))Tt3 = BN(Tt2 )+MLP(BN(Tt2 ))

Tt4=BN(Tt3)+MHA(BN(Tt3),BN(Tt3),BN(Tt3))Tt4 =BN(Tt3 )+MHA(BN(Tt3 ),BN(Tt3 ),BN(Tt3 ))

Tt=BN(BN(Tt5)+MLP(BN(Tt5)))Tt = BN(BN(Tt5 )+MLP(BN(Tt5 )))

在源姿态变换纹理和目标姿态变换纹理中,Resblock、MHA和MLP操作可以帮助保留源图像中的重要特征,并提取目标图像中的相关特征。Resblock通过跳跃连接保留源图像的原始信息,MHA的多头机制可以对输入序列中不同位置的关系进行建模,捕捉全局上下文信息以及序列数据中的长距离依赖关系。这有助于对源图像和目标图像之间的关系进行理解和建模,以实现更准确的姿态迁移,MLP能够提取非线性特征。这些操作的组合有助于保留和提取图像中的姿态、表情等关键特征并促进这些特征的交互和整合,提高特征的表达能力和多样性,从而实现更准确的姿态迁移。In the source pose transformation texture and the target pose transformation texture, the Resblock, MHA and MLP operations can help retain important features in the source image and extract relevant features in the target image. Resblock retains the original information of the source image through jump connections, and the multi-head mechanism of MHA can model the relationship between different positions in the input sequence, capturing global context information and long-distance dependencies in sequence data. This helps to understand and model the relationship between the source and target images for more accurate pose transfer, and MLP can extract nonlinear features. The combination of these operations helps to retain and extract key features such as pose and expression in the image and promote the interaction and integration of these features, improve the expressiveness and diversity of the features, and thus achieve more accurate pose transfer.

之后将TTM模块得到目标变换纹理Tt通过解码器De2,得到具有特定姿态的逼真人物图像。解码器De2加入了残差快和图像跳跃连接(To-RGB),解码器-跳跃连接结构图如图5所示。加入残差块的目的是为了减少图像重建过程中的失真,同时减少冗余信息,提高生成图像的质量。加入图像跳跃连接的目的是得到不同分辨率对应的RGB图像,从而对其进行上采样,求和。对不同分辨率下的RGB图像进行上采样和求和可以保留了图像中更多的细节,也可以将高层次的特征与低层次的特征相结合。Then, the target transformed texture Tt obtained by the TTM module is passed through the decoder De2 to obtain a realistic character image with a specific posture. The decoder De2 adds a residual block and an image jump connection (To-RGB). The decoder-jump connection structure diagram is shown in Figure 5. The purpose of adding the residual block is to reduce the distortion in the image reconstruction process, while reducing redundant information and improving the quality of the generated image. The purpose of adding the image jump connection is to obtain RGB images corresponding to different resolutions, so as to upsample and sum them. Upsampling and summing RGB images at different resolutions can retain more details in the image and combine high-level features with low-level features.

S4:损失函数:S4: Loss function:

本方案的模型包含两个阶段的人物图像生成,整个模型损失函数可以用以下公式表达(LCTIG表示粗目标图像生成阶段的损失函数;LTRN表示纹理细化网络阶段的损失函数)。The model of this scheme includes two stages of character image generation, and the loss function of the entire model can be expressed by the following formula (LCTIG represents the loss function of the coarse target image generation stage; LTRN represents the loss function of the texture refinement network stage).

L=LCTIG+LTRNL=LCTIG +LTRN

在粗目标图像生成阶段,采用l1损失来训练网络,公式如下所示(其中表示在粗目标图像生成阶段中l1损失的权重,因为在此阶段,只需要实现生成包含目标姿态的粗糙人物图像/>因此只需要使用l1损失就行):In the coarse target image generation stage,l1 loss is used to train the network, and the formula is as follows (where Represents the weight ofl1 loss in the rough target image generation stage, because in this stage, only the generation of a rough character image containing the target posture is required/> So just use l1 loss):

在纹理细化阶段,采用重建损失函数感知损失函数Lperc,生成对抗性损失函数Ladv,以及风格损失函数Lstyle,四个损失函数来生成人物图像。公式如下所(其中λl1percadvstyle代表着不同损失函数之间的权重):。In the texture refinement stage, the reconstruction loss function is used Perceptual loss function Lperc , generative adversarial loss function Ladv , and style loss function Lstyle , four loss functions are used to generate character images. The formula is as follows (where λl1percadvstyle represent the weights between different loss functions):.

重建损失函数:通过计算生成的人物图像和真实的目标人物图像xt在像素空间上的l1损失使生成图像和目标图像,在颜色和纹理上一致,提高生成图像的矢量,重建损失函数/>公式如下。Reconstruction loss function: The character image generated by calculation The l1 loss in the pixel space of the real target person imagext makes the generated image and the target image consistent in color and texture, improves the vector of the generated image, and reconstructs the loss function/> The formula is as follows.

感知损失函数:采用从预训练VGG-19[43]网络中提取的多尺度空间中的特征之间的l1距离,感知损失函数Lperc公式如下,表示使用的VGG-19预训练模型的卷积层,从预训练网络的第i层提取的特征。Perceptual loss function: Using thel1 distance between features in the multi-scale space extracted from the pre-trained VGG-19 [43] network, the perceptual loss function Lperc is formulated as follows: Represents the convolutional layer of the VGG-19 pre-trained model used, and the features extracted from the i-th layer of the pre-trained network.

生成对抗损失:将生成的人物图像和真实的目标人物图像xt作为图像生成鉴别器D的输入,并惩罚它们之间的分布距离,使得生成的人物姿态图像/>的分布越来越接近目标图像xt的分布,生成对抗损失函数Ladv公式如下。Generate adversarial loss: Generate the generated character image and the real target person imagext as the input of the image generation discriminator D, and penalize the distribution distance between them so that the generated person pose image/> The distribution of is getting closer and closer to the distribution of the target imagext , and the formula for generating the adversarial loss function Ladv is as follows.

Ladv=E[log(1-D(TRN(CTIG(xs,ps,pt),xs,ps,pt)))]+E[log(D(xt))]Ladv = E[log(1-D(TRN(CTIG(xs ,ps ,pt ),xs ,ps ,pt )))]+E[log(D(xt ))]

风格损失:风格损失计算生成的人物图像和真实的目标人物图像xt之间的激活映射的统计差异,并增强生成的图像的颜色和样式与目标图像的相似性。Style loss: Character images generated by style loss calculation and the true target person imagext , and enhance the similarity of the color and style of the generated image to the target image.

以上就是对网络模型结构HPT-GAN的讲解,我们使用SSIM、FID、LPIPS和PSNR来评估我们的网络模型,它们是现有图像生成任务中使用的评估方法。结构相似性指数度量(SSIM)是一种结构相似性度量,它将相似性度量任务分为三个比较:亮度、对比度和结构。峰值信噪比(PSNR)描述了图像的最大可能功率与破坏噪声的功率之间的比率。习得感知图像补丁相似性(LPIPS)使用人类标记的数据集来训练模型,以评估生成的图像和真实图像之间的相似性。结构相似性指数测量(SSIM)和峰值信噪比(PSNR)在像素级别上测量生成图像的质量。通过使用LPIPS来计算生成的图像和感知域中的真实图像之间的差。此外,使用Frechet Inception Distance(FID)测量图像的真实性,用于计算起始网络上真实图像和合成图像分布之间的Fréchet距离。以上评估指标,SSIM和PSNR的值是越高越好,LPIPS和FID是越小越好。The above is an explanation of the network model structure HPT-GAN. We use SSIM, FID, LPIPS and PSNR to evaluate our network model, which are evaluation methods used in existing image generation tasks. The structural similarity index measure (SSIM) is a structural similarity measure that divides the similarity measurement task into three comparisons: brightness, contrast and structure. The peak signal-to-noise ratio (PSNR) describes the ratio between the maximum possible power of an image and the power of the destructive noise. The learned perceptual image patch similarity (LPIPS) uses a human-labeled dataset to train the model to evaluate the similarity between the generated image and the real image. The structural similarity index measure (SSIM) and the peak signal-to-noise ratio (PSNR) measure the quality of the generated image at the pixel level. The difference between the generated image and the real image in the perceptual domain is calculated by using LPIPS. In addition, the authenticity of the image is measured using Frechet Inception Distance (FID), which is used to calculate the Fréchet distance between the distribution of real images and synthetic images on the starting network. For the above evaluation indicators, the higher the values of SSIM and PSNR, the better, and the smaller the values of LPIPS and FID, the better.

定量比较:将本方案的方法与近几年开源的顶刊的方法进行了比较,包括DPTN,TCAN,PISE,SPIG,GFLA。表1显示了图像质量和模型大小的定量结果。从表1可以看出,在所有比较的方法中,我们的方法取得了三个第一,一个第二,一个第三的结果,意味着本方案提出的模型可以生成结构准确,逼真的图像。Quantitative comparison: The method of this scheme is compared with the methods of top open source journals in recent years, including DPTN, TCAN, PISE, SPIG, and GFLA. Table 1 shows the quantitative results of image quality and model size. As can be seen from Table 1, among all the compared methods, our method achieved three first, one second, and one third results, which means that the model proposed in this scheme can generate structurally accurate and realistic images.

表1用几种最先进的方法对图像质量和模型大小进行定量比较Table 1 Quantitative comparison of image quality and model size using several state-of-the-art methods

定性比较:在图6中,比较我们的方法和近几年最先进方法产生的不同结果。很明显可以看出,姿态引导的人物图像生成中,面对简单的衣服纹理和姿态变换,这些方法都可以取得逼真的效果(例如,第一行),但是在面对复杂的衣服纹理和姿态变换时,本方案提出的方法则产生了更精细的外观纹理和逼真的结果,更加保留源图像衣服纹理等相关特征(例如,第四,五,六行)。Qualitative comparison: In Figure 6, we compare the different results produced by our method and the most advanced methods in recent years. It can be clearly seen that in the pose-guided character image generation, these methods can achieve realistic results when faced with simple clothing textures and pose changes (e.g., the first row), but when faced with complex clothing textures and pose changes, the method proposed in this scheme produces more refined appearance textures and realistic results, and better preserves relevant features such as clothing texture of the source image (e.g., the fourth, fifth, and sixth rows).

Claims (6)

Translated fromChinese
1.一种基于生成对抗网络的姿态引导的人物图像生成方法,其特征步骤包括:1. A posture-guided character image generation method based on a generative adversarial network, the characteristic steps of which include:S1、数据预处理:利用关节点检测器HPE提取人体姿态,再将固定人物及对应姿态分为一组,对每组中的图片进行排列组合形成数据集,将数据集分为训练集和测试集,训练集和测试集的人物身份不重叠;S1. Data preprocessing: Use the joint point detector HPE to extract human posture, then divide the fixed characters and corresponding postures into a group, arrange and combine the pictures in each group to form a data set, and divide the data set into a training set and a test set. The characters in the training set and the test set do not overlap;S2、训练粗目标图像生成阶段网络得到粗略人物图像:将源图像xs,源姿态ps和目标姿态pt作为输入,经过编码器En1、残差块、解码器De1,输出与目标姿态一致的粗略人物图像S2. Train the network in the coarse target image generation phase to obtain a coarse person image: Take the source imagexs , source poseps and target posept as input, pass through the encoderEn1 , residual block, decoderDe1 , and output a coarse person image consistent with the target poseS3、训练纹理细化网络阶段生成具有特定姿态的逼真人物图像:由三种类型的输入:源图像xs、源图像xs与源姿态ps的联合、粗略人物图像与目标姿态pt的联合,经过编码器En2、En3,得到源图像纹理/>源姿态变化/>目标姿态变化/>之后将源图像纹理/>源姿态变化/>目标姿态变化/>分别作为纹理迁移模块TTM的输入,之后将纹理迁移模块TTM的输出通过解码器De2,得到具有特定姿态的逼真人物图像;S3, training texture refinement network stage generates realistic person images with specific postures: it consists of three types of inputs: source imagexs , the combination of source imagexs and source postureps , and rough person image Combined with the target posturept , the source image texture is obtained through encoders En2 and En3 /> Source attitude changes/> Target attitude changes/> Then the source image texture /> Source attitude changes/> Target attitude changes/> They are respectively used as inputs of the texture transfer module TTM, and then the output of the texture transfer module TTM is passed through the decoder De2 to obtain a realistic character image with a specific posture;S4、采用损失函数训练包含S2、S3两个阶段人物图像生成的网络模型。S4. Use the loss function to train the network model for character image generation in the two stages of S2 and S3.2.根据权利要求1所述的基于生成对抗网络的姿态引导的人物图像生成方法,其特征在于,步骤S2具体包括:2. According to the method for generating a character image guided by a posture based on a generative adversarial network in claim 1, it is characterized in that step S2 specifically comprises:S2_1:首先将源图像xs,源姿态ps和目标姿态pt作为输入,将其使用cat函数按照每一行并排融合生成一个新的特征图张量;S2_1: First, take the source imagexs , source poseps and target posept as input, and use the cat function to fuse them side by side in each row to generate a new feature map tensor;S2_2:将融合生成的新的特征图张量经过3层编码器En1:每一层编码器En1先对特征图进行卷积核为4×4大小的包含归一化操作和激活函数的卷积操作实现下采样,即特征图的尺寸减半;然后对特征图进行卷积核为3×3大小的卷积操作,以在不改变特征图尺寸的情况下,通过卷积操作对特征图进行局部信息的提取和特征增强;三层编码器En1维度变化依次为3->64,64->128,128->256;最后编码器En1输出分辨率为32×32,维度为256的特征张量;S2_2: The new feature map tensor generated by fusion is passed through the three-layer encoder En1 : Each layer of encoder En1 first performs a convolution operation with a convolution kernel of 4×4 size, including normalization operation and activation function, to achieve downsampling, that is, the size of the feature map is halved; then a convolution operation with a convolution kernel of 3×3 size is performed on the feature map to extract local information and enhance features of the feature map through convolution operation without changing the size of the feature map; the dimensions of the three-layer encoder En1 change from 3->64, 64->128, 128->256 in turn; finally, the encoder En1 outputs a feature tensor with a resolution of 32×32 and a dimension of 256;S2_3:将编码器En1输出的特征张量放入残差块中得到一个分辨率不变、维度不变的特征张量,残差块内流程为:先通过两层卷积核大小为3×3卷积层,之后再使特征张量通过卷积核大小为1×1卷积层与其进行跳跃连接来捕捉输入特征的细节;S2_3: The feature tensor output by encoder En1 is placed in the residual block to obtain a feature tensor with unchanged resolution and dimension. The process in the residual block is: first pass through two layers of convolutional layers with a convolutional kernel size of 3×3, and then make the feature tensor pass through a convolutional layer with a convolutional kernel size of 1×1 to jump connect with it to capture the details of the input features;S2_4:将从残差块得到的特征张量作为解码器De1的输入,输出粗略人物图像解码器De1流程:首先对输入的特征张量进行卷积核为3×3的卷积操作;然后进行卷积核为3×3大小,步长为2的转置卷积操作;之后再与输入的特征张量进行转置卷积操作得到的结果进行跳跃连接;之后重复两次上述操作得到分辨率为256×256,维度为64的特征图,对其进行一次卷积核为3×3的卷积操作,使其维度变为3,即将其转化为RGB图片的格式。S2_4: The feature tensor obtained from the residual block is used as the input of the decoder De1 to output a rough character image Decoder De1 process: First, perform a convolution operation on the input feature tensor with a convolution kernel of 3×3; then perform a transposed convolution operation with a convolution kernel of 3×3 and a stride of 2; then perform a jump connection with the result of the transposed convolution operation on the input feature tensor; then repeat the above operation twice to obtain a feature map with a resolution of 256×256 and a dimension of 64, and perform a convolution operation with a convolution kernel of 3×3 to change its dimension to 3, that is, convert it into the format of an RGB image.3.根据权利要求1所述的基于生成对抗网络的姿态引导的人物图像生成方法,其特征在于,步骤S3具体包括:3. The method for generating a character image guided by a posture based on a generative adversarial network according to claim 1, wherein step S3 specifically comprises:S3_1:纹理细化网络阶段由三种类型的输入:源图像xs、源图像xs与源姿态ps的联合、粗略人物图像与目标姿态pt的联合,经过编码器En2、En3,得到源图像纹理/>源姿态变化目标姿态变化/>其中源图像xs和源姿态ps的编码器En2与粗略人物图像/>和目标姿态pt的编码器En3共享一个权重;S3_1: The texture refinement network stage has three types of input: source imagexs , the union of source imagexs and source poseps , and rough person image Combined with the target posturept , the source image texture is obtained through encoders En2 and En3 /> Source attitude change Target attitude changes/> The encoder En2 of the source imagexs and source poseps is similar to the rough person image/> Shares a weight with the encoder En3 of the target posture pt ;S3_2:将源图像纹理源姿态变化/>目标姿态变化/>作为纹理迁移模块TTM的输入,纹理迁移模块TTM的结构采用多头注意力机制MHA和多层感知机制MLP,并将其分为源姿态变换纹理和目标姿态变换纹理两部分;S3_2: Source image texture Source attitude changes/> Target posture changes/> As the input of the texture transfer module TTM, the structure of the texture transfer module TTM adopts the multi-head attention mechanism MHA and the multi-layer perception mechanism MLP, and divides it into two parts: the source pose transformation texture and the target pose transformation texture;S3_3:将纹理迁移模块TTM得到的目标变换纹理Tt通过解码器De2,最后将解码器De2的每一层输出通过跳跃链接To-RGB,对其进行上采样,求和得到具有特定姿态的逼真人物图像。S3_3: The target transformed textureTt obtained by the texture transfer module TTM is passed through the decoder De2. Finally, each layer output of the decoder De2 is upsampled through the jump link To-RGB, and the sum is obtained to obtain a realistic character image with a specific posture.4.根据权利要求3所述的基于生成对抗网络的姿态引导的人物图像生成方法,其特征在于,步骤S3_2具体包括:4. According to the method for generating a character image guided by a gesture based on a generative adversarial network according to claim 3, it is characterized in that step S3_2 specifically comprises:S3_2a:在源姿态变换纹理中,将源姿态变化进行残差和多头注意力机制处理得到之后再对/>进行批量归一化与多层感知机制处理得到/>然后再对/>进行批量归一化与多头注意力机制处理得到/>最后再对/>进行批量归一化与多层感知机制处理得到源变化纹理Ts;/>Ts公式如下所示:S3_2a: In the source pose transformation texture, the source pose is changed The residual and multi-head attention mechanism are processed to obtain Then check/> Batch normalization and multi-layer perception mechanism processing are performed to obtain/> Then, check/> Batch normalization and multi-head attention mechanism are performed to obtain/> Finally, check/> Perform batch normalization and multi-layer perception mechanism processing to obtain the source change texture Ts ;/> TheTs formula is as follows:其中,Res表示残差处理,MHA表示多头注意力机制处理,BN表示批量归一化处理,MLP表示多层感知机制处理;Among them, Res represents residual processing, MHA represents multi-head attention mechanism processing, BN represents batch normalization processing, and MLP represents multi-layer perception mechanism processing;S3_2b:在目标姿态变换纹理中,首先将目标姿态变化进行残差和多头注意力机制处理得到Tt1,之后再对Tt1进行批量归一化与多头注意力机制处理得到Tt2,然后对Tt2进行批量归一化与多层感知机制处理得到Tt3,然后对Tt3进行批量归一化与多头注意力机制处理得到Tt4,之后对Tt4进行批量归一化与多头注意力机制处理得到Tt5,最后对Tt5进行批量归一化和多层感知机制处理得到目标变换纹理Tt;Tt1,Tt2,Tt3,Tt4,Tt5,Tt的公式如下所示:S3_2b: In the target pose transformation texture, the target pose is first transformed The residual and multi-head attention mechanism are processed to obtain Tt1 , and then Tt1 is processed by batch normalization and multi-head attention mechanism to obtain Tt2 , and then Tt2 is processed by batch normalization and multi-layer perception mechanism to obtain Tt3 , and then Tt3 is processed by batch normalization and multi-head attention mechanism to obtain Tt4 , and then Tt4 is processed by batch normalization and multi-head attention mechanism to obtain Tt5 , and finally Tt5 is processed by batch normalization and multi-layer perception mechanism to obtain the target transformed texture Tt ; the formulas of Tt1 , Tt2 , Tt3 , Tt4 , Tt5 , Tt are as follows:Tt3=BN(Tt2)+MLP(BN(Tt2))Tt3 = BN(Tt2 )+MLP(BN(Tt2 ))Tt4=BN(Tt3)+MHA(BN(Tt3),BN(Tt3),BN(Tt3))Tt4 =BN(Tt3 )+MHA(BN(Tt3 ),BN(Tt3 ),BN(Tt3 ))Tt=BN(BN(Tt5)+MLP(BN(Tt5)))Tt = BN(BN(Tt5 )+MLP(BN(Tt5 )))其中,Res表示残差处理,MHA表示多头注意力机制处理,BN表示批量归一化处理,MLP表示多层感知机制处理。Among them, Res represents residual processing, MHA represents multi-head attention mechanism processing, BN represents batch normalization processing, and MLP represents multi-layer perception mechanism processing.5.根据权利要求1所述的基于生成对抗网络的姿态引导的人物图像生成方法,其特征在于,步骤S4具体为:5. The method for generating a character image based on a posture-guided generative adversarial network according to claim 1, wherein step S4 is specifically:整个网络模型损失函数用以下公式表达:The loss function of the entire network model is expressed by the following formula:L=LCTIG+LTRNL=LCTIG +LTRN其中:LCTIG表示粗目标图像生成阶段的损失函数;LTRN表示纹理细化网络阶段的损失函数;Where: LCTIG represents the loss function of the coarse target image generation stage; LTRN represents the loss function of the texture refinement network stage;在粗目标图像生成阶段,采用l1损失来训练网络模型,公式如下所示:In the coarse target image generation stage,l1 loss is used to train the network model, and the formula is as follows:其中表示在粗目标图像生成阶段中l1损失的权重;xt为真实的目标人物图像,/>为粗略人物图像;in represents the weight ofl1 loss in the rough target image generation stage;xt is the real target person image, /> It is a rough image of a person;在纹理细化网络阶段,采用重建损失函数感知损失函数Lperc、生成对抗性损失函数Ladv以及风格损失函数Lstyle,四个损失函数来生成人物图像。公式如下所示,其中/>λpercadvstyle代表不同损失函数之间的权重:In the texture refinement network stage, the reconstruction loss function is used Perceptual loss function Lperc , generative adversarial loss function Ladv , and style loss function Lstyle , four loss functions are used to generate character images. The formula is as follows, where/> λpercadvstyle represent the weights between different loss functions:6.根据权利要求5所述的基于生成对抗网络的姿态引导的人物图像生成方法,其特征在于,6. The method for generating a human image based on a posture-guided generative adversarial network according to claim 5, characterized in that:所述重建损失函数通过计算生成的人物图像/>和真实的目标人物图像xt在像素空间上的l1损失使生成图像和目标图像在颜色和纹理上一致,提高生成图像的矢量,重建损失函数/>公式如下:The reconstruction loss function Character images generated by calculation/> The l1 loss in the pixel space of the real target person imagext makes the generated image consistent with the target image in color and texture, improves the vector of the generated image, and reconstructs the loss function/> The formula is as follows:感知损失函数Lperc:采用从预训练VGG-19网络中提取的多尺度空间中的特征之间的l1损失,感知损失函数Lperc公式如下,表示使用的VGG-19预训练模型的卷积层,从预训练网络的第i层提取的特征:Perceptual loss function Lperc : Using the l1 loss between features in the multi-scale space extracted from the pre-trained VGG-19 network, the perceptual loss function Lperc formula is as follows, Represents the convolutional layer of the VGG-19 pre-trained model used, and the features extracted from the i-th layer of the pre-trained network:生成对抗损失函数Ladv:将生成的人物图像和真实的目标人物图像xt作为图像生成鉴别器D的输入,并惩罚它们之间的分布距离,使得生成的人物图像/>的分布越来越接近目标人物图像xt的分布,生成对抗损失函数Ladv公式如下:Generate adversarial loss function Ladv : Generate the generated character image and the real target person imagext as the input of the image generation discriminator D, and penalize the distribution distance between them so that the generated person image/> The distribution of is getting closer and closer to the distribution of the target person imagext , and the formula for generating the adversarial loss function Ladv is as follows:Ladv=E[log(1-D(TRN(CTIG(xs,ps,pt),xs,ps,pt)))]+E[log(D(xt))]Ladv = E[log(1-D(TRN(CTIG(xs ,ps ,pt ),xs ,ps ,pt )))]+E[log(D(xt ))]风格损失函数Lstyle:风格损失计算生成的人物图像和真实的目标人物图像xt之间的激活映射的统计差异,并增强生成的人物图像的颜色和样式与目标人物图像的相似性:Style loss function Lstyle : character image generated by style loss calculation The statistical difference in activation maps between the real target person image xt and the real target person imagext , and enhance the similarity of the color and style of the generated person image with the target person image:
CN202410120166.0A2024-01-292024-01-29Character image generation method based on gesture guidance of generation countermeasure networkPendingCN118052904A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202410120166.0ACN118052904A (en)2024-01-292024-01-29Character image generation method based on gesture guidance of generation countermeasure network

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202410120166.0ACN118052904A (en)2024-01-292024-01-29Character image generation method based on gesture guidance of generation countermeasure network

Publications (1)

Publication NumberPublication Date
CN118052904Atrue CN118052904A (en)2024-05-17

Family

ID=91049269

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202410120166.0APendingCN118052904A (en)2024-01-292024-01-29Character image generation method based on gesture guidance of generation countermeasure network

Country Status (1)

CountryLink
CN (1)CN118052904A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN119477671A (en)*2025-01-142025-02-18南京信息工程大学 A method and device for generating human body and posture transformation images
CN119648889A (en)*2024-11-222025-03-18武汉大学 Building texture generation method and system based on building texture perception color loss function

Cited By (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN119648889A (en)*2024-11-222025-03-18武汉大学 Building texture generation method and system based on building texture perception color loss function
CN119648889B (en)*2024-11-222025-06-06武汉大学Building texture generation method and system based on building texture perception color loss function
CN119477671A (en)*2025-01-142025-02-18南京信息工程大学 A method and device for generating human body and posture transformation images

Similar Documents

PublicationPublication DateTitle
CN111709902B (en)Infrared and visible light image fusion method based on self-attention mechanism
CN110020989B (en)Depth image super-resolution reconstruction method based on deep learning
CN112184577B (en)Single image defogging method based on multiscale self-attention generation countermeasure network
CN111625608B (en) A method and system for generating an electronic map from remote sensing images based on a GAN model
CN112465718B (en) A Two-Stage Image Inpainting Method Based on Generative Adversarial Networks
CN118052904A (en)Character image generation method based on gesture guidance of generation countermeasure network
CN114882524B (en) A monocular 3D hand gesture estimation method based on fully convolutional neural network
CN113538662B (en)Single-view three-dimensional object reconstruction method and device based on RGB data
CN114581356B (en) Image enhancement model generalization method based on style transfer data augmentation
CN117237808A (en)Remote sensing image target detection method and system based on ODC-YOLO network
CN118351538A (en) A remote sensing image road segmentation method combining channel attention mechanism and multi-layer axial Transformer feature fusion structure
CN106508048B (en)A kind of similar scale image interfusion method based on multiple dimensioned primitive form
CN113256494A (en)Text image super-resolution method
CN112686830A (en)Super-resolution method of single depth map based on image decomposition
Xu et al.Infrared and visible image fusion using a deep unsupervised framework with perceptual loss
Zhu et al.Super resolution reconstruction method for infrared images based on pseudo transferred features
CN117291803A (en)PAMGAN lightweight facial super-resolution reconstruction method
CN113159158A (en)License plate correction and reconstruction method and system based on generation countermeasure network
CN110415816B (en) A multi-classification method for clinical images of skin diseases based on transfer learning
CN119919782A (en) A remote sensing target detection method and system based on selective feature space fusion
Zhu et al.Filter-deform attention GAN: constructing human motion videos from few images
Zhao et al.A method of degradation mechanism-based unsupervised remote sensing image super-resolution
Zhang et al.SSP-IR: Semantic and Structure Priors for Diffusion-based Realistic Image Restoration
Pu et al.Application of image style transfer based on normalized residual network in art design
CN118710495B (en) A hyperspectral image super-resolution reconstruction method based on recursive dilated self-attention

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination

[8]ページ先頭

©2009-2025 Movatter.jp