CN110516642A

Movatterモバイル変換

Info

Publication number: CN110516642A
Application number: CN201910818443.4A
Authority: CN
Inventors: 王正宁; 赵德明; 何庆东; 曾浩; 曾仪; 刘怡君; 吕侠; 谢镇灿
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-08-30
Filing date: 2019-08-30
Publication date: 2019-11-29

Abstract

Translated fromChinese

本发明公开了一种轻量化人脸3D关键点检测方法及系统，包括：将数据库中人脸关键点的N个3D参考坐标向量在三个二维平面进行降维投影；基于k阶改进型沙漏网络构建联合编码子网络，利用该联合编码子网络将每个2D视角下的N个2D参考坐标向量联合编码为2D联合热力图；采用concat法将三个2D视角下的2D联合热力图叠加为3D联合热力图；基于2D全卷积网络构建所述解码子网络，利用解码子网络将所述3D联合热力图解码为N个3D检测坐标向量。本发明设计了相应的轻量化神经网络(联合编码子网络和解码子网络)来进行联合热力图的生成、3D坐标的回归；结合了现有的2D及3D人脸关键点检测方法的优点，在保持较高检测精度的同时，减小模型参数量、提高模型运行速度。

The invention discloses a light-weight human face 3D key point detection method and system, comprising: performing dimension reduction projection of N 3D reference coordinate vectors of human face key points in a database on three two-dimensional planes; The hourglass network constructs a joint encoding subnetwork, and uses the joint encoding subnetwork to jointly encode N 2D reference coordinate vectors under each 2D perspective into a 2D joint heat map; use the concat method to superimpose the 2D joint heat maps under three 2D perspectives It is a 3D joint heat map; the decoding subnetwork is constructed based on a 2D full convolutional network, and the 3D joint heat map is decoded into N 3D detection coordinate vectors by using the decoding subnetwork. The present invention designs a corresponding lightweight neural network (joint encoding sub-network and decoding sub-network) to perform the generation of a joint heat map and the regression of 3D coordinates; combining the advantages of existing 2D and 3D face key point detection methods, While maintaining high detection accuracy, reduce the amount of model parameters and increase the speed of model operation.

Description

Translated fromChinese

一种轻量化人脸3D关键点检测方法及系统A lightweight face 3D key point detection method and system

技术领域technical field

本发明涉及图像处理和计算机机器视觉技术领域，尤其涉及一种轻量化人脸3D关键点检测方法及系统。The invention relates to the technical fields of image processing and computer vision, in particular to a lightweight human face 3D key point detection method and system.

背景技术Background technique

随着深度学习技术在计算机视觉领域的蓬勃发展，各种人脸图像处理任务在生活中得到了广泛应用，其中人脸关键点检测在人脸识别、表情识别、人脸重建等方面都扮演了重要的角色。With the vigorous development of deep learning technology in the field of computer vision, various face image processing tasks have been widely used in life, among which face key point detection has played an important role in face recognition, expression recognition, face reconstruction, etc. important role.

在过去的十年里人脸关键点检测取得了巨大的成就，特别是在2D人脸关键点检测领域。其中由Cootes等提出的基于点分布模型的ASM(Active Shape Model)算法是经典的人脸关键点检测算法，该算法通过人工标定的方法先标定训练集，经过训练获得形状模型，再通过关键点的匹配实现特定物体的匹配；由Dollar提出的基于级联回归的CPR(CascadedPose Regression)算法通过一系列回归器将一个指定的初始预测值逐步细化，每一个回归器都依靠前一个回归器的输出来执行简单的图像操作，整个系统可自动的从训练样本中学习；此外，由Zhang等人提出一种多任务级联卷积神经网络MTCNN(Multi-task CascadedConvolutional Networks)用以同时处理人脸检测和人脸关键点定位问题。然而，在诸如大角度姿态以及脸部遮挡等复杂场景下，基于2D的人脸关键点检测方法难以实现、存在限制。为了解决这种限制，越来越多的研究者逐渐关注于3D人脸关键点检测，3D人脸关键点相对于2D表示更多信息并且提供更多遮挡信息。In the past decade, great achievements have been made in face key point detection, especially in the field of 2D face key point detection. Among them, the ASM (Active Shape Model) algorithm based on the point distribution model proposed by Cootes et al. is a classic face key point detection algorithm. The algorithm first calibrates the training set by manual calibration, obtains the shape model after training, and then passes the key point The matching of specific objects is realized; the cascaded regression-based CPR (Cascaded Pose Regression) algorithm proposed by Dollar gradually refines a specified initial prediction value through a series of regressors, and each regressor relies on the previous regressor. output to perform simple image operations, and the entire system can automatically learn from training samples; in addition, a multi-task cascaded convolutional neural network MTCNN (Multi-task Cascaded Convolutional Networks) was proposed by Zhang et al. to process faces simultaneously Detection and facial keypoint localization problems. However, in complex scenes such as large-angle poses and face occlusions, 2D-based face key point detection methods are difficult to implement and have limitations. In order to solve this limitation, more and more researchers are gradually focusing on 3D face keypoint detection, which represents more information and provides more occlusion information than 2D.

3D人脸关键点检测方法大致分为基于模型的方法和非基于模型的方法。一、基于模型的方法：Blanz等人提出的三维变形模型(3DMM)是完成3D人脸关键点检测的常用方法；二、非基于模型的方法：Tulyakov等人提出了一种用级联回归计算三维形状特征来定位3D人脸关键点的方法，将级联回归方法推广到3D人脸关键点检测中。此外，在基于模型的方法中，还包括利用深度学习模型完成人脸关键点检测的方法，主要分为两阶段回归法和体积表示方法，两阶段回归典型方法，将(x,y)坐标与z轴分离，先回归(x,y)坐标，再回归z；体积表示方法将传统2D热力图扩展到3D体积表式，在人体关键点检测中也有广泛应用。3D face keypoint detection methods are roughly divided into model-based methods and non-model-based methods. 1. Model-based method: The three-dimensional deformable model (3DMM) proposed by Blanz et al. is a common method to complete 3D face key point detection; 2. Non-model-based method: Tulyakov et al. proposed a cascaded regression calculation The method of locating key points of 3D face by three-dimensional shape features, and extending the cascade regression method to key point detection of 3D face. In addition, in the model-based method, it also includes the method of using the deep learning model to complete the key point detection of the face, which is mainly divided into two-stage regression method and volume representation method. The two-stage regression typical method combines (x, y) coordinates with The z-axis is separated, and the (x, y) coordinates are returned first, and then the z is returned; the volume representation method extends the traditional 2D heat map to a 3D volume representation, which is also widely used in the detection of key points of the human body.

然而由于3D空间维度的增加，相应算法的处理速度、模型精度都面临巨大的挑战，现有的3D人脸关键点检测算法在处理速度、模型大小和复杂度、模型精度等方面都存在不同程度的缺陷。However, due to the increase of 3D spatial dimension, the processing speed and model accuracy of the corresponding algorithms are facing huge challenges. The existing 3D face key point detection algorithms have different degrees in terms of processing speed, model size and complexity, and model accuracy. Defects.

发明内容Contents of the invention

本发明的目的之一至少在于，针对如何克服上述现有技术存在的问题，提供一种轻量化人脸3D关键点检测方法及系统。One of the objectives of the present invention is at least to provide a lightweight human face 3D key point detection method and system for how to overcome the above-mentioned problems in the prior art.

为了实现上述目的，本发明采用的技术方案包括以下各方面。In order to achieve the above object, the technical solutions adopted by the present invention include the following aspects.

一种轻量化人脸3D关键点检测方法，包括：A lightweight face 3D key point detection method, comprising:

步骤101，将数据库中人脸关键点的N个3D参考坐标向量在三个二维平面进行降维投影；其中，所述三个二维平面分别为xy、xz、yz平面，且x、y、z同时为正或同时为负；每个二维平面中包括N个与所述N个3D参考坐标向量相对应的2D参考坐标向量；Step 101, the N 3D reference coordinate vectors of face key points in the database are subjected to dimensionality reduction projection on three two-dimensional planes; wherein, the three two-dimensional planes are respectively xy, xz, and yz planes, and x, y , z are positive or negative at the same time; each two-dimensional plane includes N 2D reference coordinate vectors corresponding to the N 3D reference coordinate vectors;

步骤102，基于k阶改进型沙漏网络构建联合编码子网络，训练所述联合编码子网络使其性能趋于稳定；利用训练好的联合编码子网络将每个2D视角下的N个2D参考坐标向量联合编码为2D联合热力图；其中，所述k阶改进型沙漏网络残差单元采用Residual+Inception结构；Step 102, constructing a joint encoding subnetwork based on the k-order improved hourglass network, training the joint encoding subnetwork to make its performance tend to be stable; using the trained joint encoding subnetwork to convert N 2D reference coordinates under each 2D perspective The vectors are jointly encoded as a 2D joint heat map; wherein, the k-order improved hourglass network residual unit adopts the Residual+Inception structure;

步骤103，采用concat法将三个2D视角下的2D联合热力图叠加为3D联合热力图；Step 103, using the concat method to superimpose the 2D joint thermodynamic maps under the three 2D viewing angles into a 3D joint thermodynamic map;

步骤104，基于2D全卷积网络构建所述解码子网络，训练所述解码子网络使其性能趋于稳定；利用所述解码子网络将所述3D联合热力图解码为N个3D检测坐标向量。Step 104, construct the decoding sub-network based on the 2D full convolution network, train the decoding sub-network to make its performance tend to be stable; use the decoding sub-network to decode the 3D joint heat map into N 3D detection coordinate vectors .

优选的，所述联合编码子网络为2阶改进型沙漏网络。Preferably, the joint coding sub-network is a 2-order improved hourglass network.

优选的，采用多损失函数融合训练方法对所述联合编码子网络、解码子网络进行训练。Preferably, the joint encoding sub-network and decoding sub-network are trained using a multi-loss function fusion training method.

优选的，所述多损失函数融合训练方法采用三种不同的损失函数对网络进行三轮迭代训练，以前一轮训练得到的最优权重作为下一轮的初始权重，直到三轮训练完成停止训练。Preferably, the multi-loss function fusion training method uses three different loss functions to perform three rounds of iterative training on the network, and the optimal weight obtained in the previous round of training is used as the initial weight of the next round, and the training is stopped until the three rounds of training are completed. .

优选的，所述三种损失函数为：均方误差损失函数、最小绝对值误差损失函数、平滑后的最小绝对值误差损失函数。Preferably, the three loss functions are: mean square error loss function, minimum absolute value error loss function, and smoothed minimum absolute value error loss function.

优选的，所述解码子网络包括：4个2D卷积层，每个卷积层中间搭配batchnormalization及LeakyRelu激活函数。Preferably, the decoding sub-network includes: 4 2D convolutional layers, each convolutional layer is equipped with batchnormalization and LeakyRelu activation functions.

一种轻量化人脸3D关键点检测系统，包括至少一个处理器，以及与所述至少一个处理器通信连接的存储器；所述存储器存储有可被所述至少一个处理器执行的指令，所述指令被所述至少一个处理器执行，以使所述至少一个处理器能够执行上述方法。A lightweight human face 3D key point detection system, comprising at least one processor, and a memory connected in communication with the at least one processor; the memory stores instructions executable by the at least one processor, the The instructions are executed by the at least one processor to enable the at least one processor to perform the method described above.

综上所述，由于采用了上述技术方案，本发明至少具有以下有益效果：In summary, due to the adoption of the above technical solution, the present invention at least has the following beneficial effects:

1、通过结合2D人脸关键点检测与3D人脸关键点检测的优点，提出了一种联合热力图及坐标回归方法，并设计了相应的轻量化神经网络(联合编码子网络和解码子网络)来进行联合热力图的生成、3D坐标的回归；本方法结合了现有的2D及3D人脸关键点检测方法的优点，所采用的联合热力图表示方法减小了计算量和模型复杂度，在保持较高检测精度的同时，减小模型参数量、提高模型运行速度；在联合编码过程中，对原有的神经网络的残差单元进行改进，进一步提升网络的特征提取能力、检测精度。1. By combining the advantages of 2D face key point detection and 3D face key point detection, a joint heat map and coordinate regression method is proposed, and a corresponding lightweight neural network (joint encoding subnetwork and decoding subnetwork is designed) ) to generate a joint heat map and return 3D coordinates; this method combines the advantages of the existing 2D and 3D face key point detection methods, and the joint heat map representation method used reduces the amount of calculation and model complexity , while maintaining a high detection accuracy, reduce the amount of model parameters and increase the speed of the model; in the process of joint encoding, the residual unit of the original neural network is improved to further improve the feature extraction ability and detection accuracy of the network .

2、联合编码子网络采用2阶改进型沙漏结构，降低网络的深度，提高网络收敛速度，降低网络的参数量。2. The joint coding sub-network adopts a second-order improved hourglass structure, which reduces the depth of the network, improves the convergence speed of the network, and reduces the amount of network parameters.

3、提出一种多损失函数融合训练方法，采用三种不同的损失函数对网络进行三轮迭代训练，使网络的检测精度变得更加准确。3. A multi-loss function fusion training method is proposed, and three different loss functions are used to train the network for three rounds of iterative training, so that the detection accuracy of the network becomes more accurate.

附图说明Description of drawings

图1是根据本发明示例性实施例的轻量化人脸3D关键点检测方法流程图。Fig. 1 is a flowchart of a method for detecting lightweight 3D key points of a human face according to an exemplary embodiment of the present invention.

图2是原沙漏网络残差单元结构示意图。Figure 2 is a schematic diagram of the original hourglass network residual unit structure.

图3是根据本发明示例性实施例的改进型沙漏网络残差单元结构示意图。Fig. 3 is a schematic structural diagram of an improved hourglass network residual unit according to an exemplary embodiment of the present invention.

图4是根据本发明示例性实施例的改进型二阶沙漏网络(联合编码子网络)结构示意图。Fig. 4 is a schematic structural diagram of an improved second-order hourglass network (joint encoding sub-network) according to an exemplary embodiment of the present invention.

图5是根据本发明示例性实施例的联合编码子网络生成的示例性热力图。Fig. 5 is an exemplary heat map generated by a joint coding sub-network according to an exemplary embodiment of the present invention.

图6是根据本发明示例性实施例的解码子网络产生的3D关键点示意图。Fig. 6 is a schematic diagram of 3D key points generated by the decoding sub-network according to an exemplary embodiment of the present invention.

图7是根据本发明示例性实施例的解码子网络产生的3D关键点在图像上的投影示意图。Fig. 7 is a schematic diagram of projection of 3D key points on an image generated by a decoding sub-network according to an exemplary embodiment of the present invention.

图8是根据本发明示例性实施例的联合编码子网络与解码子网络构成的完整网络结构示意图。Fig. 8 is a schematic diagram of a complete network structure composed of a joint encoding subnetwork and a decoding subnetwork according to an exemplary embodiment of the present invention.

图9是根据本发明示例性实施例的轻量化人脸3D关键点检测系统结构示意图。Fig. 9 is a schematic structural diagram of a lightweight human face 3D key point detection system according to an exemplary embodiment of the present invention.

具体实施方式Detailed ways

下面结合附图及实施例，对本发明进行进一步详细说明，以使本发明的目的、技术方案及优点更加清楚明白。应当理解，此处所描述的具体实施例仅用以解释本发明，并不用于限定本发明。In the following, the present invention will be further described in detail in conjunction with the accompanying drawings and embodiments, so as to make the purpose, technical solutions and advantages of the present invention more clear. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

图1示出了根据本发明示例性实施例的轻量化人脸3D关键点检测方法。该实施例的方法主要包括：Fig. 1 shows a lightweight human face 3D key point detection method according to an exemplary embodiment of the present invention. The method of this embodiment mainly includes:

具体的，从Ground truth(一般简称为GT信息)数据集里面提取N个人脸关键点的3D参考坐标向量，一般人脸总共有68个关键点，因此本实施例中优选N＝68。将提取到的N个3D关键点参考坐标向量(x,y,z)在三个二维平面进行降维分解。在具体投影时为分解为三个2D参考坐标向量(x,y)，(y,z)及(x,z)。令V_x,y,z＝(x,y,z)表示关键点3D参考坐标向量，则分离产生的三个2D参考坐标向量为：Specifically, 3D reference coordinate vectors of N face key points are extracted from the Ground truth (generally abbreviated as GT information) data set. Generally, there are 68 key points in a face, so N=68 is preferred in this embodiment. The extracted N 3D key point reference coordinate vectors (x, y, z) are decomposed on three two-dimensional planes for dimensionality reduction. In the specific projection, it is decomposed into three 2D reference coordinate vectors (x, y), (y, z) and (x, z). Let V_{x, y, z} = (x, y, z) represent the key point 3D reference coordinate vector, then the three 2D reference coordinate vectors generated by separation are:

例如：一个三维空间坐标点为(1，-2，3)将其降维分解可得(1，-2)、(-2，3)、(1，3)，但为了后面能够形成联合2D热力图，我们在降维时将在xy、yz、xz(x、y、z正负性相同，同时为正或同时为负)的三个坐标平面进行投影；由此保证每个三维坐标在降维之后可以得到三个正负性相同二维参考坐标。优选的，我们将其投影在空间坐标系第一象限(x、y、z均为正)的三个面中。For example: a three-dimensional space coordinate point is (1, -2, 3) and its dimension reduction can be decomposed to get (1, -2), (-2, 3), (1, 3), but in order to form a joint 2D later For the heat map, we will project on the three coordinate planes of xy, yz, and xz (x, y, and z are positive and negative at the same time, both positive or negative) during dimensionality reduction; thus ensuring that each three-dimensional coordinate is in After dimensionality reduction, three two-dimensional reference coordinates with the same positive and negative properties can be obtained. Preferably, we project it on the three planes of the first quadrant (x, y, z are all positive) of the space coordinate system.

具体的，如图3所示，对沙漏网络内部的残差子单元(原结构如图2)进行改进，提出Residual+Inception结构，在网络宽度上进行扩展，卷积核尺寸为n×n，池化核尺寸为n×n(n＝2k+1，k为正整数)，之后使多类感受野进行通道融合。融合后的特征图具有输入图像的不同感受野，不同的语义信息。对于不同尺度的输入图像，改进型沙漏网络有更强的特征提取能力，提升检测精度。更改残差单元后，沙漏网络将会变宽，网络的表征能力可以用宽度来提升。若仍采用传统的4阶沙漏结构，网络由于参数过多将会陷入过拟合。因此为了防止过拟合，我们只保留2阶沙漏结构。如图4所示，本发明的联合编码子网络采用2阶改进型沙漏网络，对输入图像进行高效的特征提取处理，避免因引入Inception结构带来的编码子网络的宽度增加和网络参数过多导致的过拟合。采用2阶改进型沙漏网络可以大大降低网络的深度，使网络能够更快收敛，同时降低了网络的参数量。图中的绿色矩形模块是由改进后的Residual+Inception子单元组成，绿色矩形内部的第一行数字代表输入通道，第二行数字代表输出通道。对于一阶沙漏模块而言，上半路在原尺度进行，下半路经历了先降采样再升采样的过程，降采样使用最大池化，升采样使用最近邻插值，最后将上下两个半路输出相加得到最终输出。沙漏模块的阶数不同导致网络的复杂度和参数量不同。Specifically, as shown in Figure 3, the residual subunit inside the hourglass network (the original structure is shown in Figure 2) is improved, and the Residual+Inception structure is proposed to expand the network width. The convolution kernel size is n×n, The size of the pooling kernel is n×n (n=2k+1, k is a positive integer), and then multi-type receptive fields are channel-fused. The fused feature maps have different receptive fields of the input image and different semantic information. For input images of different scales, the improved hourglass network has stronger feature extraction capabilities and improves detection accuracy. After changing the residual unit, the hourglass network will become wider, and the representation ability of the network can be improved by width. If the traditional fourth-order hourglass structure is still used, the network will fall into overfitting due to too many parameters. Therefore, in order to prevent overfitting, we only keep the 2nd-order hourglass structure. As shown in Figure 4, the joint encoding sub-network of the present invention adopts a second-order improved hourglass network to perform efficient feature extraction processing on the input image, avoiding the increase in the width of the encoding sub-network and excessive network parameters caused by the introduction of the Inception structure resulting in overfitting. The use of a 2-order improved hourglass network can greatly reduce the depth of the network, enabling the network to converge faster, while reducing the amount of network parameters. The green rectangle module in the figure is composed of the improved Residual+Inception subunit. The first row of numbers inside the green rectangle represents the input channel, and the second row of numbers represents the output channel. For the first-order hourglass module, the first half is carried out at the original scale, and the second half has experienced the process of downsampling and then upsampling. Downsampling uses maximum pooling, upsampling uses nearest neighbor interpolation, and finally adds the output of the upper and lower halves. to get the final output. The different orders of the hourglass module lead to different complexity and parameters of the network.

利用带有坐标值的人脸图像对该k阶沙漏联合编码网络进行学习训练，由于联合热力图的大小为w×h×3，对于大小为256×256的人脸图像来说，编码分辨率常设置为128×128×3，以使该编码子网络E形成从人脸图像I坐标输入到联合热力图H的映射E(I)→H。网络输入为128×128大小的人脸图像，输出为w×h联合热力图(输出层热力图大小可根据实际需要进行设置)。优选的，将生成的联合热力图的大小设置为64×64，使得人脸关键点的相对位置由稀疏变得更加紧凑，减小模型的空间冗余，降低了网络的参数量。Use the face image with coordinate values to learn and train the k-order hourglass joint encoding network. Since the size of the joint heat map is w×h×3, for a face image with a size of 256×256, the encoding resolution It is usually set to 128×128×3, so that the encoding subnetwork E forms a mapping E(I)→H from the coordinate input of the face image I to the joint heat map H. The network input is a 128×128 face image, and the output is a w×h joint heat map (the size of the output layer heat map can be set according to actual needs). Preferably, the size of the generated joint heat map is set to 64×64, so that the relative positions of the key points of the face are changed from sparse to more compact, reducing the spatial redundancy of the model and reducing the parameter amount of the network.

进一步的，利用该联合编码子网络将N个2D参考坐标向量联合编码为2D联合热力图的具体过程为：Further, the specific process of jointly encoding N 2D reference coordinate vectors into a 2D joint heat map by using the joint encoding subnetwork is as follows:

针对于某个关键点对应的2D参考坐标向量(x,y)，首先将其编码为一系列连续的数值；并采用取最大值的方式进行筛选，即在所述编码所得的一系列连续数值中选取其中最大的值，作为热力图的编码值。令表示第m个热力图中位于(i_m,j_m)处的值，m∈{1,2,3}。对于人脸图像上的第n个关键点，其位置为v_x,y,v_y,z,v_x,z，以2D高斯形式对(x,y)坐标向量进行编码(其他两个坐标向量进行相同的操作)，如公式(1)所示(σ为方差)：For the 2D reference coordinate vector (x, y) corresponding to a certain key point, first encode it into a series of continuous values; Select the largest value among them as the encoding value of the heat map. make Indicates the value at (i_m , j_m ) in the mth heat map, m∈{1,2,3}. For the nth key point on the face image, whose position is v_x,y ,v_y,z ,v_x,z , encode the (x,y) coordinate vector in 2D Gaussian form (the other two coordinate vectors Perform the same operation), as shown in formula (1) (σ is the variance):

对于一个具有N个关键点的人脸图像，对每个关键点在其编码出的一系列连续的值中，通过取最大值，然后将N个关键点的编码值联合到一张图上形成2D联合热力图，如公式(2)所示：For a face image with N key points, for each key point In a series of continuous values encoded by it, by taking the maximum value, and then combining the encoded values of N key points into one image to form a 2D joint heat map, as shown in formula (2):

通过上述联合编码过程即可分别得到三个视角下的2D联合热力图，每一个2D联合热力图的大小均为w×h，其中编码了所有N个关键点。图5示出了本发明示例性的联合编码子网络生成的热力图。Through the above joint encoding process, 2D joint heat maps from three perspectives can be obtained respectively, and the size of each 2D joint heat map is w×h, in which all N key points are encoded. Fig. 5 shows a heat map generated by an exemplary joint encoding sub-network of the present invention.

具体的，采用concat法将三个二维平面下的2D联合热力图进行叠加，得到3D热力图。Concat法是一种联合向量算法，用于连接两个或者多个数组。通过concat的方法可以将这三个2D联合热力图叠加到一起，得到大小为w×h×3的3D热力图(其中3代表3个通道)，如公式(3)所示：Specifically, the concat method is used to superimpose the 2D joint thermodynamic maps under the three two-dimensional planes to obtain a 3D thermodynamic map. The Concat method is a joint vector algorithm for concatenating two or more arrays. These three 2D joint heat maps can be superimposed together by the concat method to obtain a 3D heat map of size w×h×3 (where 3 represents 3 channels), as shown in formula (3):

H＝concat(p₁,p₂,p₃) (3)H＝concat(p₁ ,p₂ ,p₃ ) (3)

具体的，该解码子网络经过预训练后可形成联合热力图H到相应的3D坐标向量c间的映射D(H)→c。由于联合热力图H的大小为w×h×3，因此解码子网络采用了一个2D全卷积网络进行构建，对热力图进行解码，如图4所示(见附录1)；该解码子网络共包括5个2D卷积层，卷积核个数分别为128，128，256，256，512，卷积核大小均为4×4，步长为2，最后一个卷积层的通道数为N×3，每个卷积层中间搭配batch normalization及LeakyRelu激活函数，最后一层为全局平均池化层，将由concat法得到的3D联合热力图通过该解码子网络可得到N个3D关键点坐标向量。如图6所示，由此我们通过解码子网络完成了对人脸N个3D关键点的检测坐标向量的提取。进一步，为了方便可视化，如图7所示，将3D关键点坐标投影到2D图像上。Specifically, after pre-training, the decoding sub-network can form a mapping D(H)→c between the joint heat map H and the corresponding 3D coordinate vector c. Since the size of the joint heat map H is w×h×3, the decoding subnetwork is constructed using a 2D full convolutional network to decode the heatmap, as shown in Figure 4 (see Appendix 1); the decoding subnetwork It includes a total of 5 2D convolutional layers. The number of convolutional kernels is 128, 128, 256, 256, and 512 respectively. The size of the convolutional kernels is 4×4, and the step size is 2. The number of channels in the last convolutional layer is N×3, each convolutional layer is equipped with batch normalization and LeakyRelu activation function in the middle, and the last layer is a global average pooling layer. The 3D joint heat map obtained by the concat method can be obtained through the decoding subnetwork to obtain N 3D key point coordinates vector. As shown in Figure 6, we have completed the extraction of the detection coordinate vectors of N 3D key points of the face through the decoding sub-network. Further, for the convenience of visualization, as shown in Figure 7, the coordinates of 3D key points are projected onto the 2D image.

进一步的，考虑到在网络训练过程中，不同的损失函数拥有不同的收敛速度且导向不同的极值点，本发明采用多损失函数融合训练方法对所述联合编码子网络、解码子网络进行训练。所述多损失函数融合训练方法采用三种不同的损失函数对网络进行三轮迭代训练，以前一轮训练得到的最优权重作为下一轮的初始权重，直到三轮训练完成停止训练。由于每一种损失函数对不同大小的误差敏感度不同，均方误差损失(MSE)对大误差敏感，因此遇到大误差时收敛速度快；最小绝对值误差损失(L1)和平滑后的最小绝对值误差损失(SmoothL1)对小误差敏感，遇到小误差时收敛速度更快。因此我们采用三种损失函数，迭代进行三轮训练，第一轮采用均方误差损失函数，第二轮采用最小绝对值误差损失函数，第三轮采用平滑后的最小绝对值误差损失函数，每一轮训练后的最优权重作为下一轮的初始权重。通过这样的训练方式，使网络的检测精度变得更加准确。相应的联合编码子网络损失函数为公式4～6：Further, considering that in the network training process, different loss functions have different convergence speeds and lead to different extreme points, the present invention adopts a multi-loss function fusion training method to train the joint encoding sub-network and decoding sub-network . The multi-loss function fusion training method uses three different loss functions to perform three rounds of iterative training on the network, and the optimal weight obtained in the previous round of training is used as the initial weight of the next round, and the training is stopped until the three rounds of training are completed. Since each loss function has different sensitivities to different sizes of errors, the mean square error loss (MSE) is sensitive to large errors, so the convergence speed is fast when large errors are encountered; the minimum absolute value error loss (L1) and the smoothed minimum The absolute value error loss (SmoothL1) is sensitive to small errors and converges faster when encountering small errors. Therefore, we use three loss functions and iteratively conduct three rounds of training. The first round uses the mean square error loss function, the second round uses the minimum absolute value error loss function, and the third round uses the smoothed minimum absolute value error loss function. The optimal weight after one round of training is used as the initial weight for the next round. Through such a training method, the detection accuracy of the network becomes more accurate. The corresponding joint encoding sub-network loss function is formula 4~6:

L_hm1＝∑|E(I)-H|² (4)L_hm1 ＝∑|E(I)-H|² (4)

L_hm2＝∑|E(I)-H| (5)L_hm2 =∑|E(I)-H| (5)

相应的解码子网络损失函数为公式7～9：The corresponding decoding subnetwork loss function is formula 7~9:

L_coord1＝∑|D(H)-c|² (7)L_coord1 ＝∑|D(H)-c|² (7)

L_coord2＝∑|D(H)-c| (8)L_coord2 ＝∑|D(H)-c| (8)

其中，D表示解码子网络；c表示3D检测坐标向量；H表示联合热力图；E表示联合编码子网络，I表示带有坐标向量的人脸图像。Among them, D represents the decoding subnetwork; c represents the 3D detection coordinate vector; H represents the joint heat map; E represents the joint encoding subnetwork, and I represents the face image with the coordinate vector.

在本发明进一步的实施例中，我们先分别对联合编码子网络、解码子网络两个子网络进行预训练，再将两个网络连接在一起作为整体进行微调(在编程过程中，图8为该完整网络的结构示意图。在两个网络模型中间加入cancat算法进行2D联合热力图到3D热力图的叠加)，主要分两步进行：In a further embodiment of the present invention, we first pre-train the two sub-networks of the joint encoding sub-network and the decoding sub-network respectively, and then connect the two networks together as a whole for fine-tuning (in the programming process, Figure 8 shows the Schematic diagram of the complete network structure. The cancat algorithm is added between the two network models to superimpose the 2D joint heat map to the 3D heat map), which is mainly divided into two steps:

第一步：在预训练阶段，利用带有坐标向量的人脸图像训练所述联合编码子网络，以使其形成输入层为N个带有坐标向量的人脸图像、输出层为联合热力图的非线性映射。同时，利用3D联合热力图训练所述解码子网络，以使其形成输入层为3D联合热力图、输出层为3D检测坐标向量的非线性映射。Step 1: In the pre-training stage, use the face images with coordinate vectors to train the joint encoding sub-network, so that the input layer is N face images with coordinate vectors, and the output layer is a joint heat map non-linear mapping. At the same time, the decoding sub-network is trained by using the 3D joint heat map, so that it forms a nonlinear mapping whose input layer is the 3D joint heat map and whose output layer is the 3D detection coordinate vector.

第二步：在微调阶段，将预训练后的解码子网络连接到预训练后的联合编码子网络的后面，在两个网络中间加入concat算法(可通过编程实现)形成一个完整的联合热力图人脸3D关键点检测网络模型，对这个完整的网络模型进行微调，输入为N个带有坐标向量的原始人脸图像，输出依次为相应的2D联合热力图、相应的关键点3D坐标向量。最终整个网络以端到端的方式进行训练，其采用多损失函数融合训练方法为：第一轮训练采用均方误差损失函数，对应的损失值是L_hm1+L_coord1；第二轮训练采用最小绝对值误差损失函数，对应的损失值是L_hm2+L_coord2；第三轮训练采用平滑后的最小绝对值误差损失函数，对应的损失值是L_hm3+L_coord3。前一轮训练得到的最优权重作为下一轮的初始权重，直至三三轮训练结束得到最终训练结果停止训练。Step 2: In the fine-tuning stage, connect the pre-trained decoding sub-network to the back of the pre-trained joint encoding sub-network, and add the concat algorithm (programmable) between the two networks to form a complete joint heat map The face 3D key point detection network model fine-tunes this complete network model. The input is N original face images with coordinate vectors, and the output is the corresponding 2D joint heat map and the corresponding key point 3D coordinate vector. Finally, the entire network is trained in an end-to-end manner. The multi-loss function fusion training method is adopted: the first round of training uses the mean square error loss function, and the corresponding loss value is L_hm1 + L_coord1 ; the second round of training uses the minimum absolute value error loss function, the corresponding loss value is L_hm2 +L_coord2 ; the third round of training uses the smoothed minimum absolute value error loss function, and the corresponding loss value is L_hm3 +L_coord3 . The optimal weight obtained in the previous round of training is used as the initial weight of the next round, and the training is stopped until the final training result is obtained after three or three rounds of training.

在本发明的进一步的实施例中，我们将提取到的检测坐标向量与参考坐标向量作对比论证，并通过具体的实验数据验证本算法。将本算法所得实验结果与现有技术中的算法做准确度对比，其结果如表1、表2、所示：In a further embodiment of the present invention, we compare and demonstrate the extracted detection coordinate vector with the reference coordinate vector, and verify the algorithm through specific experimental data. The experimental results obtained by this algorithm are compared with the accuracy of the algorithm in the prior art, and the results are shown in Table 1 and Table 2:

表1 在AFLW2000-3D数据集上3D-FAN，JVCR及本算法的GTE性能对比Table 1 Comparison of GTE performance of 3D-FAN, JVCR and this algorithm on the AFLW2000-3D dataset

表2 3DDFA，3D-FAN，JVCR及本算法网络参数量大小(MB)及处理一张图片耗时(ms)Table 2 3DDFA, 3D-FAN, JVCR and this algorithm network parameter size (MB) and time-consuming to process a picture (ms)

图9示出了根据本发明示例性实施例的基于联合热力图的人脸3D关键点检测系统，即电子设备310(例如具备程序执行功能的计算机服务器)，其包括至少一个处理器311，电源314，以及与所述至少一个处理器311通信连接的存储器312和输入输出接口313；所述存储器312存储有可被所述至少一个处理器311执行的指令，所述指令被所述至少一个处理器311执行，以使所述至少一个处理器311能够执行前述任一实施例所公开的方法；所述输入输出接口313可以包括显示器、键盘、鼠标、以及USB接口，用于输入输出数据；电源314用于为电子设备310提供电能。Fig. 9 shows the human face 3D key point detection system based on joint heat map according to an exemplary embodiment of the present invention, that is, an electronic device 310 (such as a computer server with program execution function), which includes at least one processor 311, a power supply 314, and a memory 312 and an input/output interface 313 that are communicatively connected to the at least one processor 311; the memory 312 stores instructions that can be executed by the at least one processor 311, and the instructions are processed by the at least one processor The processor 311 executes, so that the at least one processor 311 can execute the method disclosed in any of the foregoing embodiments; the input and output interface 313 can include a display, a keyboard, a mouse, and a USB interface for inputting and outputting data; a power supply 314 is used to provide electric energy for the electronic device 310 .

本领域技术人员可以理解：实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成，前述的程序可以存储于计算机可读取存储介质中，该程序在执行时，执行包括上述方法实施例的步骤；而前述的存储介质包括：移动存储设备、只读存储器(Read Only Memory，ROM)、磁碟或者光盘等各种可以存储程序代码的介质。Those skilled in the art can understand that all or part of the steps for implementing the above-mentioned method embodiments can be completed by hardware related to program instructions, and the aforementioned programs can be stored in computer-readable storage media. The steps of the method embodiment; and the foregoing storage medium includes: a removable storage device, a read only memory (Read Only Memory, ROM), a magnetic disk or an optical disk, and other various media that can store program codes.

当本发明上述集成的单元以软件功能单元的形式实现并作为独立的产品销售或使用时，也可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明实施例的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机、服务器、或者网络设备等)执行本发明各个实施例所述方法的全部或部分。而前述的存储介质包括：移动存储设备、ROM、磁碟或者光盘等各种可以存储程序代码的介质。When the above-mentioned integrated units of the present invention are realized in the form of software function units and sold or used as independent products, they can also be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the embodiment of the present invention is essentially or the part that contributes to the prior art can be embodied in the form of a software product. The computer software product is stored in a storage medium and includes several instructions for Make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the methods described in various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program codes such as removable storage devices, ROMs, magnetic disks or optical disks.

以上所述，仅为本发明具体实施方式的详细说明，而非对本发明的限制。相关技术领域的技术人员在不脱离本发明的原则和范围的情况下，做出的各种替换、变型以及改进均应包含在本发明的保护范围之内。The above description is only a detailed description of specific embodiments of the present invention, rather than limiting the present invention. Various replacements, modifications and improvements made by those skilled in the relevant technical fields without departing from the principle and scope of the present invention shall be included in the protection scope of the present invention.

Claims

Translated fromChinese

1.一种轻量化人脸3D关键点检测方法，其特征在于，包括：1. A lightweight human face 3D key point detection method, characterized in that, comprising:

2.根据权利要求1所述的方法，其特征在于，所述联合编码子网络为2阶改进型沙漏网络。2. The method according to claim 1, wherein the joint encoding subnetwork is a 2-order improved hourglass network.

3.根据权利要求1所述的方法，其特征在于，采用多损失函数融合训练方法对所述联合编码子网络、解码子网络进行训练。3. The method according to claim 1, wherein the joint encoding sub-network and decoding sub-network are trained using a multi-loss function fusion training method.

4.根据权利要求3所述的方法，其特征在于，所述多损失函数融合训练方法采用三种不同的损失函数对网络进行三轮迭代训练，以前一轮训练得到的最优权重作为下一轮的初始权重，直到三轮训练完成停止训练。4. The method according to claim 3, wherein the multi-loss function fusion training method uses three different loss functions to carry out three rounds of iterative training to the network, and the optimal weight obtained in the previous round of training is used as the next round of training. rounds of initial weights, and stop training until three rounds of training are complete.

5.根据权利要求4所述的方法，其特征在于，所述三种损失函数为：均方误差损失函数、最小绝对值误差损失函数、平滑后的最小绝对值误差损失函数。5. The method according to claim 4, wherein the three loss functions are: a mean square error loss function, a minimum absolute value error loss function, and a smoothed minimum absolute value error loss function.

6.根据权利要求1所述的方法，其特征在于，所述解码子网络包括：4个2D卷积层，每个卷积层中间搭配batch normalization及LeakyRelu激活函数。6. The method according to claim 1, wherein the decoding sub-network comprises: 4 2D convolutional layers, each convolutional layer is equipped with batch normalization and LeakyRelu activation functions.

7.一种轻量化人脸3D关键点检测系统，其特征在于，包括至少一个处理器，以及与所述至少一个处理器通信连接的存储器；所述存储器存储有可被所述至少一个处理器执行的指令，所述指令被所述至少一个处理器执行，以使所述至少一个处理器能够执行权利要求1至6中任一项所述的方法。7. A lightweight human face 3D key point detection system, characterized in that it comprises at least one processor, and a memory connected in communication with the at least one processor; Instructions executed by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 6.