CN111524226A

Movatterモバイル変換

Info

Publication number: CN111524226A
Application number: CN202010316895.5A
Authority: CN
Inventors: 张举勇; 蔡泓锐; 郭玉东; 彭妆
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2020-04-21
Filing date: 2020-04-21
Publication date: 2020-08-11
Anticipated expiration: 2040-04-21
Also published as: CN111524226B

Abstract

Translated fromChinese

本发明公开了一种讽刺肖像画的关键点检测与三维重建方法，包括：构建卷积神经网络，收集包含三维人脸模板模型、以及由讽刺肖像画、标注好的二维关键点坐标及基于现有方法生成的三维夸张脸模型的数据集；利用数据集进行网络训练，通过卷积神经网络输出与输入讽刺肖像画对应的变形表示模型与相机投影参数，从而预测三维夸张脸模型顶点坐标和二维关键点坐标。该方法解放了为讽刺肖像标注关键点的过程，在一种新的人脸变形表示以及庞大的数据集帮助下，训练好的卷积神经网络能通过预测得到的变形表示模型直接重建夸张三维人脸，并由同时预测得到的相机投影参数得到二维关键点坐标。

The invention discloses a key point detection and three-dimensional reconstruction method for caricature portraits, which includes: constructing a convolutional neural network, collecting a three-dimensional face template model, and a caricature portrait, marked two-dimensional key point coordinates and based on existing A dataset of 3D exaggerated face model generated by the method; use the dataset for network training, and output the deformation representation model and camera projection parameters corresponding to the input caricature through the convolutional neural network, so as to predict the vertex coordinates of the 3D exaggerated face model and the 2D key point coordinates. This method liberates the process of annotating key points for satirical portraits. With the help of a new face deformation representation and a huge dataset, the trained convolutional neural network can directly reconstruct the exaggerated three-dimensional human body through the predicted deformation representation model. face, and obtain two-dimensional keypoint coordinates from the camera projection parameters predicted at the same time.

Description

Translated fromChinese

讽刺肖像画的关键点检测与三维重建方法Key point detection and 3D reconstruction method for caricature portraits

技术领域technical field

本发明涉及图像处理技术、三维建模技术领域，尤其涉及一种讽刺肖像画的关键点检测与三维重建方法。The invention relates to the fields of image processing technology and three-dimensional modeling technology, in particular to a key point detection and three-dimensional reconstruction method for caricature portraits.

背景技术Background technique

讽刺肖像是一种依托二维图像和三维模型的艺术表达形式。它通过夸张化人脸的某些特征或细节以营造带有幽默色彩的视觉效果，常被用于影视、广告以及社交等生活场景中。这种艺术表现形式同样也在计算机视觉和认知心理学等领域下被证明可以有效地提高人脸识别的精确度。由于其富有潜力的研究前景和广泛用途，讽刺肖像的相关课题正吸引越来越多的科研人员和企业投身其中。A satirical portrait is a form of artistic expression that relies on two-dimensional images and three-dimensional models. It creates humorous visual effects by exaggerating certain features or details of the face, and is often used in life scenes such as film and television, advertising, and social networking. This artistic expression has also been shown to be effective in improving the accuracy of face recognition in areas such as computer vision and cognitive psychology. Due to its potential research prospects and wide range of uses, the subject of satirical portraits is attracting more and more researchers and companies to devote themselves to it.

关于讽刺肖像的关键点检测技术：相较于正常人脸，讽刺肖像更具有夸张性和多样性的特点，所以识别关键点的难度较大。因此，关于讽刺肖像的自动关键点检测算法几乎没有。但另一方面，讽刺肖像的许多研究课题均依赖于关键点，而手工标注关键点不仅枯燥无味，而且费时费力。于是开发一种关于讽刺肖像的关键点检测算法是件有重要意义的事情，不仅能填补这方面研究的空白，也能帮助相关课题的发展。Regarding the key point detection technology of satirical portraits: Compared with normal faces, satirical portraits are more exaggerated and diverse, so it is more difficult to identify key points. As a result, automatic keypoint detection algorithms for sarcastic portraits are few and far between. On the other hand, many research topics in caricature portraits rely on keypoints, and annotating keypoints by hand is tedious and time-consuming. Therefore, it is of great significance to develop a key point detection algorithm for satirical portraits, which can not only fill the gaps in this research, but also help the development of related topics.

目前流行的正常人脸关键点检测算法多为数据驱动的方法，依赖于深度神经网络结构的设计。这类算法通常能从单张图片提取出人脸视觉特征或人脸图像像素统计特征来回归关键点的位置，提取的方法包括基于知识的、基于代数特征的。而夸张人脸根植于正常人脸，它需要满足一张人脸的基本特征，譬如需要具有特定数目的眼睛、嘴巴、鼻子和耳朵等。但夸张人脸在正常人脸的基础上通常对这些特征进行了夸张化，所以不同图片间某一特征差异性较大，例如围绕眼睛的关键点分布。由于夸张出的特征差异化、多样化，关于讽刺肖像的关键点检测算法寥寥无几。Most of the current popular normal face keypoint detection algorithms are data-driven methods and rely on the design of deep neural network structures. This kind of algorithm can usually extract face visual features or face image pixel statistical features from a single image to regress the position of key points. The extraction methods include knowledge-based and algebraic feature-based. The exaggerated face is rooted in the normal face, which needs to meet the basic characteristics of a face, such as the need to have a certain number of eyes, mouth, nose and ears. However, exaggerated faces usually exaggerate these features on the basis of normal faces, so there is a large difference in a certain feature between different pictures, such as the distribution of key points around the eyes. Due to the differentiation and diversity of exaggerated features, there are few key point detection algorithms for satirical portraits.

关于讽刺肖像的三维重建技术：目前人们获取三维夸张人脸模型主要有两类方法：手工建模和基于变形算法的重建。手工建模作为最早的三维建模手段，目前依旧广泛地应用于生成夸张人脸三维模型。但其过程一般需要经过专业学习训练的人员在专业的建模软件如AutoDesk Maya上来完成。虽然有精度高的优点，但由于它需要大量的时间和人力，于是基于变形算法来获取三维夸张人脸模型更受大家欢迎。不过，变形算法虽然具有自动生成的优势，但其生成的模型往往夸张风格局限，同手工建模得到的形态各异的三维夸张人脸相比多样性不足、精度不足。而且现有的变形算法大多依赖于关键点，所以还是需要花费时间、人力在标注关键点上，一旦标注得不准确，生成的模型也很可能与原始的二维讽刺肖像不匹配。Regarding the 3D reconstruction technology of satirical portraits: At present, there are two main methods for obtaining 3D exaggerated face models: manual modeling and reconstruction based on deformation algorithms. As the earliest 3D modeling method, manual modeling is still widely used to generate exaggerated face 3D models. However, the process generally needs to be completed by professional training personnel in professional modeling software such as AutoDesk Maya. Although it has the advantage of high precision, because it requires a lot of time and manpower, it is more popular to obtain 3D exaggerated face models based on deformation algorithms. However, although the deformation algorithm has the advantage of automatic generation, the models generated by it are often limited in exaggerated style, which lacks diversity and accuracy compared with the 3D exaggerated faces of different shapes obtained by manual modeling. Moreover, most of the existing deformation algorithms rely on key points, so it still takes time and manpower to label the key points. Once the labeling is inaccurate, the generated model may not match the original two-dimensional satirical portrait.

传统的基于图像生成正常人脸三维模型的方法，往往先通过相机采集等途径构建出某些人的三维模型，再通过基于统计或降维的方法构造相应的人脸数据库，并建立一个人脸的参数化模型(包括线性模型和非线性模型)，进而将复杂的三维人脸参数化到一个低维的参数空间中，通过获取在低维空间中的坐标表示，便可以重建出相应的正常人脸。源自于这个思路，以往的夸张人脸生成思路首先对单张图片标注二维关键点，通过关键点约束和构建好的参数化模型生成相应的夸张人脸。这样的方法非常依赖于关键点，所以不仅要花费时间在标注任务上，一旦标注准确度不高，还会直接影响重建出的三维模型。The traditional method of generating a 3D model of a normal face based on an image often first constructs a 3D model of some people through camera acquisition, etc., and then constructs a corresponding face database through statistical or dimensionality reduction methods, and establishes a face. The parameterized model (including linear model and nonlinear model) of , and then the complex three-dimensional face is parameterized into a low-dimensional parameter space. By obtaining the coordinate representation in the low-dimensional space, the corresponding normal face can be reconstructed. human face. Derived from this idea, the previous idea of exaggerated face generation first marked two-dimensional key points on a single image, and generated corresponding exaggerated faces through key point constraints and a constructed parametric model. Such a method is very dependent on key points, so not only does it take time to label tasks, but once the labeling accuracy is not high, it will directly affect the reconstructed 3D model.

发明内容SUMMARY OF THE INVENTION

本发明的目的是提供一种讽刺肖像画的关键点检测与三维重建方法，可以自动地、快速地检测夸张人脸的关键点以及生成对应的三维模型，在人脸识别、动画生成、表情迁移以及AR\VR等领域具有重要的实际应用价值。The purpose of the present invention is to provide a key point detection and three-dimensional reconstruction method for caricature portraits, which can automatically and quickly detect the key points of exaggerated faces and generate corresponding three-dimensional models. AR\VR and other fields have important practical application value.

本发明的目的是通过以下技术方案实现的：The purpose of this invention is to realize through the following technical solutions:

一种讽刺肖像画的关键点检测与三维重建方法，包括：A key point detection and 3D reconstruction method for caricature portraits, including:

构建卷积神经网络，收集包含三维人脸模板模型、以及由讽刺肖像画、标注好的二维关键点坐标及基于现有方法生成的三维夸张脸模型的数据集；所述三维人脸模板模型与三维夸张脸模型具有相同的拓扑结构；Constructing a convolutional neural network, collecting a data set including a three-dimensional face template model, as well as a caricature portrait, annotated two-dimensional key point coordinates, and a three-dimensional exaggerated face model generated based on existing methods; the three-dimensional face template model and 3D exaggerated face models have the same topology;

训练阶段，以三维人脸模板模型作为模板人脸，计算各讽刺肖像画的变形表示模型，并输出相机投影参数；根据变形表示模型与相机投影参数预测对应的三维夸张脸模型顶点坐标和二维关键点坐标，并以此构造训练阶段的损失函数，从而对网络进行有监督的训练；In the training stage, the 3D face template model is used as the template face to calculate the deformation representation model of each caricature portrait, and output the camera projection parameters; according to the deformation representation model and the camera projection parameters, the corresponding 3D exaggerated face model vertex coordinates and 2D keys are predicted. Point coordinates, and use this to construct the loss function in the training phase, so as to conduct supervised training of the network;

训练完毕后，对于输入的讽刺肖像画得到对应的变形表示模型与相机投影参数，从而预测三维夸张脸模型顶点坐标和二维关键点坐标。After training, the corresponding deformation representation model and camera projection parameters are obtained for the input satirical portrait, so as to predict the vertex coordinates of the three-dimensional exaggerated face model and the two-dimensional key point coordinates.

由上述本发明提供的技术方案可以看出，1)由变形表示来约束的人脸上的变形使得生成的人脸依旧具有人脸的性质，而且强大的变形表示模型还能生成具有夸张风格的人脸。2)通过卷积神经网络结构能直接从单张图片回归人脸变形模型以及相机的投影参数。3)二者共同作用，就获得了比较准确的三维夸张人脸模型。同时也得到了较精确的二维关键点坐标。It can be seen from the above technical solutions provided by the present invention that 1) the deformation of the human face constrained by the deformation representation makes the generated face still have the properties of the human face, and the powerful deformation representation model can also generate exaggerated styles. human face. 2) Through the convolutional neural network structure, the face deformation model and the projection parameters of the camera can be directly returned from a single image. 3) The two work together to obtain a more accurate three-dimensional exaggerated face model. At the same time, more accurate two-dimensional key point coordinates are obtained.

附图说明Description of drawings

为了更清楚地说明本发明实施例的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域的普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他附图。In order to illustrate the technical solutions of the embodiments of the present invention more clearly, the following briefly introduces the accompanying drawings used in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without any creative effort.

图1为本发明实施例提供的一种讽刺肖像画的关键点检测与三维重建方法的流程图；1 is a flowchart of a key point detection and three-dimensional reconstruction method for a caricature portrait provided by an embodiment of the present invention;

图2为本发明实施例提供的卷积神经网络工作过程示意图；2 is a schematic diagram of a working process of a convolutional neural network provided by an embodiment of the present invention;

图3为本发明实施例提供的利用训练好的卷积神经网络进行测试的结果示意图。FIG. 3 is a schematic diagram of a test result using a trained convolutional neural network according to an embodiment of the present invention.

具体实施方式Detailed ways

下面结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明的保护范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present invention.

在讽刺肖像的人脸识别领域，基于正常人脸的关键点检测算法通常因为不同图片间某些面部特征的分布差异过大而识别不够准确，检测后仍需要花费大量的时间调整关键点的位置。在讽刺肖像的三维重建领域，传统的三维重建方法由于基模型的表达能力不够，重建出的人脸模型夸张程度不足；一些基于优化方法及关键点约束的重建算法过度依赖于关键点的标注，一旦标注的不够准确，生成的三维模型与二维图片会存在较大的偏差。为此，本发明实施例提供一种讽刺肖像画的关键点检测与三维重建方法，如图1所示，其主要包括如下步骤：In the field of face recognition for satirical portraits, key point detection algorithms based on normal faces are usually inaccurate due to the large difference in the distribution of certain facial features between different pictures, and it still takes a lot of time to adjust the positions of key points after detection. . In the field of 3D reconstruction of satirical portraits, the traditional 3D reconstruction methods have insufficient expression ability of the base model, and the reconstructed face model is not exaggerated enough; Once the annotation is not accurate enough, the generated 3D model will have a large deviation from the 2D picture. To this end, an embodiment of the present invention provides a key point detection and three-dimensional reconstruction method for caricature portraits, as shown in FIG. 1 , which mainly includes the following steps:

步骤1、构建卷积神经网络，收集包含三维人脸模板模型、以及由讽刺肖像画、标注好的二维关键点坐标及基于现有方法生成的三维夸张脸模型的数据集。Step 1. Construct a convolutional neural network, and collect a data set including a 3D face template model, as well as a 3D exaggerated face model generated by caricature portraits, annotated 2D keypoint coordinates and existing methods.

本步骤主要是进行网络构建与数据的收集；由于该数据集存在获取方式的多样性和数据集处理不同的可能性，要求数据集中的三维夸张脸模型要与三维人脸模板模型有相同的拓扑结构，即不同的数据间享有同样的顶点个数和邻接关系，且顶点的顺序在不同模型上是相同的；除此之外，还设定采集的人脸数据足够多样。This step is mainly for network construction and data collection; due to the diversity of acquisition methods and the possibility of different data set processing, the 3D exaggerated face model in the data set is required to have the same topology as the 3D face template model. Structure, that is, different data share the same number of vertices and adjacency relationship, and the order of vertices is the same on different models; in addition, it is also set that the collected face data is sufficiently diverse.

本领域的技术人员可以理解，上述满足此类条件的正常人脸数据集可以通过常规手段获得。Those skilled in the art can understand that the above-mentioned normal face dataset that satisfies such conditions can be obtained by conventional means.

步骤2、训练阶段，以三维人脸模板模型作为模板人脸，计算各讽刺肖像画的变形表示模型，并输出相机投影参数；根据变形表示模型与相机投影参数预测对应的三维夸张脸模型顶点坐标和二维关键点坐标，并以此构造训练阶段的损失函数，从而对网络进行有监督的训练。Step 2. In the training phase, the three-dimensional face template model is used as the template face, the deformation representation model of each satirical portrait is calculated, and the camera projection parameters are output; according to the deformation representation model and the camera projection parameters, the corresponding three-dimensional exaggerated face model vertex coordinates and The two-dimensional key point coordinates are used to construct the loss function in the training phase, so as to conduct supervised training of the network.

首先，针对变形表示模型的计算原理进行介绍。First, the calculation principle of the deformation representation model is introduced.

记三维人脸模板模型上顶点集合为V，V＝{v_i|i＝1,...,N_v}，即V由单张人脸三维数据上所有顶点v_i构成，其中i为索引下标，N_v为顶点的总数；由于获取的数据集满足人脸数据在顶点个数和顶点顺序上相同，同时邻接关系也相同。故知道了所描述的顶点集合V和某个索引下标i后，便可以知道所指代的是哪个顶点。Denote the set of vertices on the 3D face template model as V, V={v_i |i=1,...,N_v }, that is, V consists of all vertices v_i on the 3D data of a single face, where i is the index Subscript, N_v is the total number of vertices; since the obtained data set satisfies the face data in the same number of vertices and vertex order, and the adjacency relationship is also the same. Therefore, after knowing the described vertex set V and a certain index subscript i, you can know which vertex is referred to.

将三维人脸模板模型作为模板人脸，讽刺肖像画对应的三维夸张脸模型作为变形人脸；构造一个关于变形人脸上索引下标为i的顶点v'_i及模板人脸上索引下标为i的顶点v_i之间的变形梯度T_i的能量函数，最小化该能量函数求解T_i：Take the 3D face template model as the template face, and the 3D exaggerated face model corresponding to the caricature portrait as the deformed face; construct a vertex v'_i with the index index i on the deformed face and the index index on the template face as The energy function of the deformation gradient T_i between the vertices v_i of i, minimize this energy function to solve T_i :

其中，N_i指以索引下标为i的顶点为中心的1-邻域顶点的下标集合，模板人脸中集合N_i内索引下标为j的顶点记为v_j，变形人脸中集合N_i内索引下标为j的顶点记为v'_j；e'_ij为变形人脸上顶点v'_i到顶点v'_j的边，e_ij为模板人脸上顶点v_i到顶点v_j的边；c_ij为模板人脸的余弦拉普拉斯权重；Among them, Ni refers to the index set of 1-neighbor vertices centered on the vertex with index index_i as the center, the vertices with index index_j in the set Ni in the template face are denoted as v_j , and in the deformed face The vertex with index index j in the set N_i is denoted as v'_j;e'_ij is the edge from vertex v'_i to vertex v'_j on the deformed face, and e_ij is the vertex v_i on the template face to vertex v The edge of_j ; c_ij is the cosine Laplacian weight of the template face;

获得顶点的形变梯度后，通过矩阵极分解将T_i分解成R_iS_i，其中R_i代表顶点v_i到顶点v'_i变形梯度的旋转矩阵分量，S_i代表顶点v_i到顶点v'_i变形梯度的放缩矩阵分量；After obtaining the deformation gradient of the vertex, decompose T_i into R_i S_i through matrix polar decomposition, where R_i represents the rotation matrix component of the deformation gradient from vertex v_i to vertex v'_i , and S_i represents vertex v_i to vertex v' The scaling matrix component of the_i deformation gradient;

通过矩阵运算，将旋转矩阵R_i等效的表示成exp(logR_i)，则模板人脸到变形人脸的变形表示模型写成：Through matrix operation, the rotation matrix R_i is equivalently expressed as exp(logR_i ), then the deformation representation model from the template face to the deformed face is written as:

f_n＝{logR_i；S_i-I|i＝1,...,N_v}f_n ={logR_i ; S_i -I|i=1,...,N_v }

其中，I为单位阵，引入其目的在于构建一个坐标系统，V_n＝{v'_i|i＝1,...,N_v}为三维夸张脸模型上的顶点集合；logR的目的是为了使得对于旋转矩阵上的运算R_iR_j可以表示成exp(logR_i+logR_j)，这样便可以使得乘法运算简化成了加法运算。Among them, I is the unit matrix, the purpose of introducing it is to construct a coordinate system, V_n ={v'_i |i=1,...,N_v } is the vertex set on the three-dimensional exaggerated face model; the purpose of logR is to So that the operation R_i R_j on the rotation matrix can be expressed as exp(logR_i +logR_j ), so that the multiplication operation can be simplified into an addition operation.

通过对所有的变形人脸到模板人脸的变形进行编码，得到三维夸张脸模型数据集上基于模板人脸的变形表示集合F＝{f_n|n＝1,...,N}，N为变形表示集合中元素的个数，也就是人脸数据集中三维数据的个数。示例性的，F中元素的个数为7800个，即N＝7800。By encoding all the deformations of deformed faces to template faces, the deformation representation set based on template faces on the 3D exaggerated face model dataset is obtained F={f_n |n=1,...,N}, N Deformation represents the number of elements in the set, that is, the number of three-dimensional data in the face dataset. Exemplarily, the number of elements in F is 7800, that is, N=7800.

将变形表示集合F记为大小为N×M的矩阵，矩阵的第n行代表编号为n的夸张脸基于模板人脸的的变形表示f_n；对于每个f_n，其第i个顶点v'_i的变形表示{logR_i；S_i-I}记为一个维度为9的向量，所以M＝N_v×9，同上，N_v为人脸三维网格上顶点的总数。Denote the deformation representation set F as a matrix of size N×M, and the nth row of the matrix represents the deformation representation f_n of the exaggerated face numbered n based on the template face; for each f_n , its ith vertex v The deformation of '_i represents {logR_i ; S_i -I} is denoted as a vector with a dimension of 9, so M=N_v ×9, the same as above, N_v is the total number of vertices on the face three-dimensional mesh.

如图2所示，所述卷积神经网络包括编码器与解码器；编码器，用于将讽刺肖像画编码为一个K维的隐向量，该隐向量被切分为两部分，一部分为K1维的向量，即相机投影参数；另一部分为K2维的向量，被解码器解码变为变形表示模型；其中，K1+K2＝K。As shown in Figure 2, the convolutional neural network includes an encoder and a decoder; the encoder is used to encode the caricature into a K-dimensional latent vector, the latent vector is divided into two parts, one part is K1-dimensional The vector of , that is, the camera projection parameter; the other part is a K2-dimensional vector, which is decoded by the decoder and becomes a deformed representation model; among them, K1+K2=K.

示例性的，可以采用ResNet34作为编码器，可以采用3层全连接神经网络作为解码器。Exemplarily, ResNet34 can be used as the encoder, and a 3-layer fully connected neural network can be used as the decoder.

示例性的，输入的讽刺肖像画的分辨率可以为224*224，K＝216，K1＝6，K2＝210。Exemplarily, the resolution of the input caricature may be 224*224, K=216, K1=6, K2=210.

基于上述原理，训练过程中，三维人脸模板模型作为模板人脸，输入讽刺肖像画，通过预测出的变形表示模型中的旋转矩阵分量与放缩矩阵分量，得到变形梯度，进而预测讽刺肖像画对应的三维夸张脸模型顶点坐标，再结合网络输出的相机投影参数预测二维关键点坐标。之后，可以利用数据集中相应讽刺肖像画对应的标注好的二维关键点坐标和三维夸张脸模型(真实值)构造损失函数，通过不断训练使得网络预测出的三维夸张脸模型顶点坐标和二维关键点坐标趋近数据集中的真实值。Based on the above principles, in the training process, the 3D face template model is used as the template face, and the caricature portrait is input, and the rotation matrix component and the scaling matrix component in the model are represented by the predicted deformation, and the deformation gradient is obtained, and then the corresponding caricature portrait is predicted. The vertex coordinates of the three-dimensional exaggerated face model are combined with the camera projection parameters output by the network to predict the two-dimensional key point coordinates. After that, the loss function can be constructed by using the labeled two-dimensional key point coordinates and the three-dimensional exaggerated face model (true value) corresponding to the corresponding caricature portraits in the dataset, and through continuous training, the network predicts the vertex coordinates of the three-dimensional exaggerated face model and the two-dimensional key point. Point coordinates approach the true values in the dataset.

网络训练的优选实施方式如下：The preferred implementation of network training is as follows:

对于一幅讽刺肖像画，通过卷积神经网络可以得到变形表示模型，表示为：For a caricature portrait, the deformation representation model can be obtained through the convolutional neural network, which is expressed as:

其中，

代表预测得到的顶点v_i到顶点v'_i变形梯度的旋转矩阵分量，

代表预测得到的顶点v_i到顶点v'_i变形梯度的放缩矩阵分量；记

表示预测得到的变形人脸上的索引下标为i的顶点v'_i及模板人脸对应的下标为i的顶点v_i变形梯度；in,

represents the rotation matrix component of the predicted deformation gradient from vertex v_i to vertex v'_i ,

represents the scaling matrix component of the predicted deformation gradient from vertex v_i to vertex v'_i;

Denotes the deformation gradient of the vertex v_i with the index subscript i of the predicted deformed face and the vertex v_i with the subscript i corresponding to the template face;

根据预测到的变形梯度

通过求解最优化问题预测出三维夸张脸模型的顶点坐标：According to the predicted deformation gradient

The vertex coordinates of the 3D exaggerated face model are predicted by solving the optimization problem:

其中，

为预测到的三维夸张脸模型中索引下标为i的顶点坐标，

表示预测到变形人脸中集合N_i内索引下标为j的顶点坐标；求解最优问题相当于求解线性方程组得到三维夸张脸的顶点坐标：in,

is the vertex coordinate with the index subscript i in the predicted 3D exaggerated face model,

Represents the predicted vertex coordinates of the index subscript j in the set N_i in the deformed face; solving the optimal problem is equivalent to solving the linear equation system to obtain the vertex coordinates of the three-dimensional exaggerated face:

相机投影参数P表示为：

其中

是放缩参数，

是旋转矩阵(由欧拉角向量得到)，

是平移参数。如之前的示例所述，K1＝6，则

依次为1维、3维、2维向量。根据预测出的三维夸张脸模型的顶点坐标以及弱透视投影公式，可以得到二维关键点坐标：The camera projection parameter P is expressed as:

in

is the scaling parameter,

is the rotation matrix (obtained by the Euler angle vector),

is the translation parameter. As in the previous example, K1=6, then

They are 1-dimensional, 3-dimensional, and 2-dimensional vectors in sequence. According to the predicted vertex coordinates of the 3D exaggerated face model and the weak perspective projection formula, the 2D key point coordinates can be obtained:

其中，L'是从预测得到的三维夸张脸模型的顶点集合中选出的三维关键点集合；

是二维关键点集合，T为二维关键点的总数。Wherein, L' is a set of three-dimensional key points selected from the vertex set of the predicted three-dimensional exaggerated face model;

is the set of two-dimensional keypoints, and T is the total number of two-dimensional keypoints.

示例性的，关键点可以是包含轮廓、眉毛、眼睛、鼻子和嘴巴的68关键点，也可以是其他形式的关键点；根据选定的关键点形式可以从三维关键点集合中选出相应的三维关键点，构成集合L'。Exemplarily, the key points can be 68 key points including contour, eyebrows, eyes, nose and mouth, or other forms of key points; according to the selected key point form, the corresponding key point can be selected from the three-dimensional key point set. Three-dimensional key points, constituting the set L'.

在训练过程中，数据集中的数据作为训练时的真值(监督信息)。基于输入的单张讽刺肖像，通过步骤1构造好的卷积神经网络结合本步骤介绍的方式能输出一个变形表示模型f和一个相机投影参数P，并由此得到了预测出的三维模型顶点坐标

和二维关键点坐标

During the training process, the data in the dataset is used as the ground truth (supervised information) during training. Based on the input single sarcastic portrait, the convolutional neural network constructed instep 1 can output a deformed representation model f and a camera projection parameter P through the convolutional neural network constructed instep 1, and thus obtain the predicted 3D model vertex coordinates

and 2D keypoint coordinates

本发明实施例中，训练阶段的损失函数包含三个部分：In this embodiment of the present invention, the loss function in the training phase includes three parts:

1)基于顶点的损失函数E_ver。1) Vertex-based loss function_Ever .

使用数据集中对应讽刺肖像画的三维夸张脸的顶点坐标作为监督信息，相应的损失函数表示为：Using the vertex coordinates of the 3D exaggerated faces corresponding to caricature portraits in the dataset as supervision information, the corresponding loss function is expressed as:

其中，

为预测到的三维夸张脸模型中索引下标为i的顶点坐标，v'_i为数据集中相应三维夸张脸模型中索引下标为i的顶点坐标。in,

is the vertex coordinate with the index subscript i in the predicted 3D exaggerated face model, and v'_i is the vertex coordinate with the index subscript i in the corresponding 3D exaggerated face model in the data set.

2)基于二维关键点的损失函数E_lan。2) Loss function E_lan based on two-dimensional key points.

使用数据集中相应二维关键点坐标作为监督信息，相应的损失函数表示为：Using the corresponding two-dimensional keypoint coordinates in the dataset as supervision information, the corresponding loss function is expressed as:

是二维关键点集合，T为二维关键点的总数；

为预测到的二维关键点坐标；q'_t为数据集中相应的标注好的二维关键点坐标。Wherein, L' is a set of three-dimensional key points selected from the vertex set of the predicted three-dimensional exaggerated face model;

is the set of two-dimensional key points, and T is the total number of two-dimensional key points;

is the predicted two-dimensional key point coordinates; q'_t is the corresponding labeled two-dimensional key point coordinates in the data set.

3)基于相机投影参数的损失函数E_srt3) Loss function_Esrt based on camera projection parameters

由于关键点损失值不仅涉及三维顶点坐标，还涉及相机参数，因此刚开始训练时需要更多的监督信息对相机参数单独约束，相应的损失函数表示为：Since the keypoint loss value involves not only the three-dimensional vertex coordinates, but also the camera parameters, more supervision information is needed to constrain the camera parameters separately at the beginning of training, and the corresponding loss function is expressed as:

其中，

是放缩参数，

是旋转矩阵，

是平移参数。in,

is the scaling parameter,

is the rotation matrix,

is the translation parameter.

最终，训练阶段的损失函数为：Finally, the loss function in the training phase is:

E＝λ₁E_ver+λ₂E_lan+λ₃E_srtE=λ₁ E_ver +λ₂ E_lan +λ₃ E_srt

其中，{λ_k|k＝1,2,3}为权重参数；示例性的，设置λ₁＝1,λ₂＝0.00001,λ₃＝0.0001。Wherein, {λ_k |k=1, 2, 3} is a weight parameter; exemplarily, λ₁ =1, λ₂ =0.00001, λ₃ =0.0001.

本发明实施例中，可以基于PyTorch深度学习框架训练模型，每次读入多组数据(例如32组)进行有监督地学习，训练多个循环(例如，2000个循环)后训练结束。In this embodiment of the present invention, a model can be trained based on the PyTorch deep learning framework, multiple sets of data (eg, 32 sets) are read in each time for supervised learning, and the training ends after training for multiple cycles (eg, 2000 cycles).

步骤3、训练完毕后，对于输入的讽刺肖像画得到对应的变形表示模型与相机投影参数，从而预测三维夸张脸模型顶点坐标和二维关键点坐标。Step 3. After the training is completed, the corresponding deformation representation model and camera projection parameters are obtained for the input satirical portrait, so as to predict the vertex coordinates of the three-dimensional exaggerated face model and the two-dimensional key point coordinates.

测试过程与训练过程的处理方式相同，将讽刺肖像画输入至训练好的卷积神经网络中，可以得到变形表示模型与相机投影参数，从而预测出三维夸张脸模型顶点坐标(由于拓扑结构已知，因而可直接构造三维夸张脸模型)和二维关键点坐标。The testing process is handled in the same way as the training process. The caricature portrait is input into the trained convolutional neural network, and the deformation representation model and camera projection parameters can be obtained, so as to predict the vertex coordinates of the three-dimensional exaggerated face model (due to the known topology, Therefore, a three-dimensional exaggerated face model) and two-dimensional key point coordinates can be directly constructed.

如图3示意性的给出了一些测试结果的示例；第一行是输入的二维讽刺肖像(224*224)，第二行是预测得到的三维夸张脸模型，第三行是标注好预测出的二维关键点的图像。Figure 3 schematically shows some examples of test results; the first line is the input 2D sarcastic portrait (224*224), the second line is the predicted 3D exaggerated face model, and the third line is the labeled prediction out the 2D keypoint image.

本发明实施例上述方案，相比于传统的基于图片的关键点检测和三维重建算法，主要具有以下优点：Compared with the traditional image-based key point detection and three-dimensional reconstruction algorithm, the above solution in the embodiment of the present invention mainly has the following advantages:

1)通过参数化三维非线性变形模型，算法增强了卷积神经网络的表达能力，实现了基于夸张人脸的关键点检测任务。1) By parameterizing the three-dimensional nonlinear deformation model, the algorithm enhances the expressive ability of the convolutional neural network and realizes the key point detection task based on exaggerated faces.

2)通过卷积神经网络，算法实现了一种从二维夸张人脸图片端到端重建三维人脸模型的方法。2) Through the convolutional neural network, the algorithm implements a method for end-to-end reconstruction of a 3D face model from a 2D exaggerated face image.

3)基于构建好的大量数据训练，算法模型在不同风格、不同作家的讽刺肖像作品上的识别、建模准确度均较以往算法有大幅度的提高。3) Based on a large amount of data training, the recognition and modeling accuracy of the algorithm model in the satirical portrait works of different styles and authors are greatly improved compared with the previous algorithms.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到上述实施例可以通过软件实现，也可以借助软件加必要的通用硬件平台的方式来实现。基于这样的理解，上述实施例的技术方案可以以软件产品的形式体现出来，该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM，U盘，移动硬盘等)中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本发明各个实施例所述的方法。From the description of the above embodiments, those skilled in the art can clearly understand that the above embodiments can be implemented by software or by means of software plus a necessary general hardware platform. Based on this understanding, the technical solutions of the above embodiments may be embodied in the form of software products, and the software products may be stored in a non-volatile storage medium (which may be CD-ROM, U disk, mobile hard disk, etc.), including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in various embodiments of the present invention.

以上所述，仅为本发明较佳的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明披露的技术范围内，可轻易想到的变化或替换，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应该以权利要求书的保护范围为准。The above description is only a preferred embodiment of the present invention, but the protection scope of the present invention is not limited to this. Substitutions should be covered within the protection scope of the present invention. Therefore, the protection scope of the present invention should be based on the protection scope of the claims.

Claims

1. A method for detecting key points and reconstructing three-dimensionally of ironic portrait painting is characterized by comprising the following steps:

constructing a convolutional neural network, and collecting a data set which comprises a three-dimensional face template model, sarcasia portrait, marked two-dimensional key point coordinates and a three-dimensional exaggerated face model generated based on the existing method; the three-dimensional face template model and the three-dimensional exaggerated face model have the same topological structure;

in the training stage, a three-dimensional face template model is used as a template face, a deformation representation model of each ironic portrait is calculated, and camera projection parameters are output; predicting the corresponding three-dimensional exaggerated face model vertex coordinates and two-dimensional key point coordinates according to the deformation representation model and the camera projection parameters, and constructing a loss function in a training stage according to the three-dimensional exaggerated face model vertex coordinates and the two-dimensional key point coordinates, so that the network is trained in a supervision mode;

after training, corresponding deformation representation model and camera projection parameters are obtained for the ironic portrait painting input, and therefore the vertex coordinates and the two-dimensional key point coordinates of the three-dimensional exaggerated face model are predicted.

2. The ironic portrait keypoint detection and three-dimensional reconstruction method of claim 1, wherein said convolutional neural network comprises an encoder and a decoder; an encoder for encoding the ironic portrait as a K-dimensional hidden vector that is split into two parts, one part being a K1-dimensional vector, i.e. camera projection parameters; the other part is a vector with K2 dimension, and the vector is decoded by a decoder to become a deformation representation model; wherein K1+ K2 ═ K.

3. The ironic portrait keypoint detection and three-dimensional reconstruction method of claim 1, wherein said three-dimensional face template model and three-dimensional exaggerated face model having the same topology means that two models share the same number and adjacency of vertices, and the order of vertices is the same on different models; recording the set of the top points of the three-dimensional face template model as V, V ═ V_i|i＝1,...,N_vV is formed by all vertexes V on single face three-dimensional data_iWherein i is an index subscript, N_vIs the total number of vertices;

during training, the three-dimensional face template model is used as a template face, and a sarcasic portrait is input to obtain a deformation representation model f and a camera projection parameter P.

4. The ironic portrait keypoint detection and three-dimensional reconstruction method of claim 3,

the deformation representation model is expressed as:

wherein,

representing predicted vertices v_iTo v 'to vertex'_iThe rotational matrix component of the deformation gradient,

representing predicted vertices v_iTo v 'to vertex'_iA scaling matrix component of the deformation gradient; note the book

Denotes a vertex v 'with index subscript i on the predicted warped face'_iAnd a vertex v with subscript i corresponding to the template face_iA deformation gradient;

according to predicted deformation gradient

Predicting the vertex coordinates of the three-dimensional exaggerated face model by solving an optimization problem:

wherein,

to index the vertex coordinates with index i in the predicted three-dimensional exaggerated face model,

representing a set N of predicted faces_iThe internal index subscript is the vertex coordinate of j;

the camera projection parameters P are expressed as:

wherein

Is a scaling parameter that is a function of the zoom level,

is a matrix of rotations of the optical system,

is a translation parameter; according to the predicted vertex coordinates of the three-dimensional exaggerated face model and a weak perspective projection formula, two-dimensional key point coordinates can be obtained:

wherein L' is a three-dimensional key point set selected from a predicted vertex set of the three-dimensional exaggerated face model;

is a two-dimensional keypoint set, and T is the total number of two-dimensional keypoints.

5. The method of ironic portrait keypoint detection and three-dimensional reconstruction of claim 1, or 2, or 3, or 4, characterized in that the loss function of the training phase is:

E＝λ₁E_ver+λ₂E_lan+λ₃E_srt

wherein, { lambda ]_k1,2,3 is a weight parameter;

E_verfor vertex-based loss functions:

wherein,

to index the vertex coordinates, v ', with subscript i in the predicted three-dimensional exaggerated face model'_iIndexing vertex coordinates with subscript i in a corresponding three-dimensional exaggerated face model in the data set; n is a radical of_vRepresents the total number of vertices;

E_lanis a loss function based on two-dimensional key points:

is a two-dimensional key point set, and T is the total number of the two-dimensional key points;

the predicted two-dimensional key point coordinates are obtained; q's'_tMarking the coordinates of the two-dimensional key points in the data set correspondingly;

E_srtfor the loss function based on camera projection parameters:

wherein,

is a scaling parameter that is a function of the zoom level,

is a matrix of rotations of the optical system,

is a translation parameter.