CN105469041A

Movatterモバイル変換

Info

Publication number: CN105469041A
Application number: CN201510807796.6A
Authority: CN
Inventors: 熊红凯; 倪赛杰
Original assignee: Shanghai Jiao Tong University
Current assignee: Shanghai Jiao Tong University
Priority date: 2015-11-19
Filing date: 2015-11-19
Publication date: 2016-04-06
Anticipated expiration: 2035-11-19
Also published as: CN105469041B

Abstract

The invention discloses a facial point detection system based on multi-task regularization and a layer-by-layer supervision neural network. The system comprises a multi-task regularization module and a layer-by-layer supervision network module. The multi-task regularization module includes a main task and a related task; and the main task and the related task study jointly to obtain a common feature space and then an additional regular term is provided by using an auxiliary tag of the related task to enhance a generalization ability of a network. The layer-by-layer supervision network module, different from the traditional convolution neural network only optimizing an objective function of an output layer, introduces a supervision objective function into each interlayer, thereby enhancing the saliency of features obtained by studying of the interlayers. Therefore, problems that overfitting occurs and the feature robustness is uncertain according to the traditional convolution neural network can be solved effectively.

Description

Translated fromChinese

基于多任务正则化与逐层监督神经网络的人脸点检测系统Face point detection system based on multi-task regularization and layer-by-layer supervised neural network

技术领域technical field

本发明涉及一种计算机视觉领域的人脸点检测方法，具体是一种基于多任务正则化与逐层监督神经网络的人脸点检测系统。The invention relates to a human face point detection method in the field of computer vision, in particular to a human face point detection system based on multi-task regularization and layer-by-layer supervision neural network.

背景技术Background technique

在计算机视觉领域，人脸点，如眼睛、鼻子、嘴巴等的检测是一个非常基本而重要的问题，是后续的人脸识别、追踪及3D人脸建模的基础。即使有大量的研究投入其中，由于图像中人的头部姿势变化及部分遮挡问题，人脸点检测在环境下受限的情况下仍然是一个富有挑战性的问题。In the field of computer vision, the detection of face points, such as eyes, nose, mouth, etc., is a very basic and important issue, which is the basis for subsequent face recognition, tracking and 3D face modeling. Even with a lot of research invested in it, face point detection is still a challenging problem in limited environments due to the varying head poses and partial occlusions in images.

现有的人脸点检测方法主要分为两类：模板适配与基于回归的方法。基于回归的方法首先对输入图像进行特征提取，然后将学习到的特征映射至人脸特征点的空间。卷积神经网络将原始图片作为输入，利用多个线性滤波器自动计算高层次的特征表示，在实际特征提取应用中取得显著成果。The existing face point detection methods are mainly divided into two categories: template adaptation and regression-based methods. The regression-based method first performs feature extraction on the input image, and then maps the learned features to the space of facial feature points. The convolutional neural network takes the original image as input, and uses multiple linear filters to automatically calculate high-level feature representations, and has achieved remarkable results in practical feature extraction applications.

Y.Sun等人在2013年的《IEEEComputerVisionandPatternRecognition》(IEEECVPR)会议上发表的“Deepconvolutionalnetworkcascadeforfacialpointdetection”一文中提出了一种多个卷积神经网络级联的人脸点检测方法，它预先将人脸分成几个部分，对每个部分单独使用卷积神经网络进行由粗到细的特征点检测，但是这种级联的方法使得网络参数成倍增加导致训练困难，并且会带来非常大的计算开销。In the article "Deep convolutional network cascade for facial point detection" published at the "IEEE Computer Vision and Pattern Recognition" (IEEE CVPR) conference in 2013, Y. Sun et al. proposed a face point detection method with multiple convolutional neural network cascades, which divides the face into several parts in advance. For each part, the convolutional neural network is used for feature point detection from coarse to fine, but this cascading method makes the network parameters multiply, which makes training difficult and brings a very large computational overhead.

Z.Zhang等人在2014年的《EuropeanConferenceonComputerVision》会议上发表的“Faciallandmarkdetectionbydeepmulti-tasklearning”一文中提出了一种多任务学习的方法。这种方法利用人脸其他特性与特征点的相关性进行卷积神经网络模型建立，以促进对主任务即人脸点的检测。这种方法降低了模型复杂度，却没有考虑到主任务和相关任务的具体关系。Z. Zhang et al. proposed a multi-task learning method in the article "Facial landmark detection by deep multi-task learning" published at the "European Conference on Computer Vision" conference in 2014. This method uses the correlation between other characteristics of the face and the feature points to build a convolutional neural network model to facilitate the detection of the main task, that is, face points. This method reduces the model complexity, but does not take into account the specific relationship between the main task and related tasks.

发明内容Contents of the invention

本发明针对现有技术中的缺陷，提供了一种基于多任务正则化与逐层监督神经网络的人脸点检测系统，可以有效解决传统卷积神经网络的过拟合与特征鲁棒性不确定的问题。Aiming at the defects in the prior art, the present invention provides a face point detection system based on multi-task regularization and layer-by-layer supervision neural network, which can effectively solve the over-fitting and feature robustness of the traditional convolutional neural network. OK question.

本发明是通过以下技术方案实现的：The present invention is achieved through the following technical solutions:

本发明所述的一种基于多任务正则化与逐层监督神经网络的人脸点检测系统，包括两部分：多任务正则化模块和逐层监督网络模块，其中：A kind of human face point detection system based on multi-task regularization and layer-by-layer supervision neural network of the present invention comprises two parts: multi-task regularization module and layer-by-layer supervision network module, wherein:

所述逐层监督网络模块，对输入图像根据其像素值进行特征提取，不同于传统卷积神经网络只对输出层目标函数进行优化，该模块对每一中间层都引入监督目标函数，从而加强中间层学习到的特征的显著性，再将输出特征输入给多任务正则化模块进行信号的反向传播，以此重复直至网络收敛；The layer-by-layer supervision network module extracts features of the input image according to its pixel values, which is different from the traditional convolutional neural network which only optimizes the output layer objective function. This module introduces a supervision objective function to each intermediate layer, thereby strengthening The significance of the features learned by the middle layer, and then input the output features to the multi-task regularization module for backpropagation of the signal, and repeat until the network converges;

所述多任务正则化模块，包括主任务与相关任务，主任务与相关任务共同学习逐层监督网络模块的参数得到所有任务共有的特征空间，再利用相关任务的辅助标签提供附加正则项以加强网络的泛化能力，最后输出主任务的预测坐标值。The multi-task regularization module includes main tasks and related tasks. The main task and related tasks jointly learn the parameters of the layer-by-layer supervision network module to obtain the feature space shared by all tasks, and then use the auxiliary labels of related tasks to provide additional regularization items to strengthen The generalization ability of the network, and finally output the predicted coordinate value of the main task.

优选地，所述多任务正则化模块，包括主任务子模块与相关任务子模块，其中：Preferably, the multi-task regularization module includes a main task sub-module and a related task sub-module, wherein:

所述主任务子模块对输入人脸图像5个特征点的检测，分别是：左眼、右眼、鼻子、左嘴角和右嘴角的检测，预测每个点的坐标值作为最终输出。The main task sub-module detects 5 feature points of the input face image, which are respectively: detection of the left eye, right eye, nose, left corner of the mouth and right corner of the mouth, and predicts the coordinate value of each point as the final output.

所述相关任务子模块分别对输入人脸图像进行姿态估计、笑容检测、眼镜检测与性别预测，预测每个分类任务的标签值以提升主任务的预测准确率。The relevant task sub-module performs pose estimation, smile detection, glasses detection and gender prediction on the input face image respectively, and predicts the label value of each classification task to improve the prediction accuracy of the main task.

更优选地，所述多任务正则化模块的主要目的是产生待优化目标函数，即预测值与真实值之差，对该目标函数进行最小化问题求解以使预测值尽可能逼近真实值。More preferably, the main purpose of the multi-task regularization module is to generate an objective function to be optimized, that is, the difference between the predicted value and the real value, and to solve the minimization problem of the objective function so that the predicted value is as close as possible to the real value.

更优选地，所述多任务正则化模块的优化目标函数是主任务损失函数和相关任务损失函数的线性组合。More preferably, the optimization objective function of the multi-task regularization module is a linear combination of the main task loss function and the related task loss function.

更优选地，所述主任务损失函数和相关任务损失函数分别用平方差回归函数和交叉熵函数表示。More preferably, the main task loss function and the related task loss function are represented by square difference regression function and cross entropy function respectively.

优选地，所述逐层监督网络模块，在中间每个卷积层之后添加回归监督函数，与多任务正则化模块中的待优化目标函数一起进行信号的反向传播。Preferably, in the layer-by-layer supervision network module, a regression supervision function is added after each convolutional layer in the middle, and the backpropagation of signals is performed together with the objective function to be optimized in the multi-task regularization module.

优选地，所述逐层监督网络模块，其中回归监督函数是该卷积层输出坐标值与真实坐标值的平方差函数。Preferably, in the layer-by-layer supervision network module, the regression supervision function is the square difference function between the output coordinate value of the convolution layer and the real coordinate value.

优选地，所述逐层监督网络模块，只对主任务进行监督，而不对相关任务监督以保证主任务的优先性。Preferably, the layer-by-layer supervision network module only supervises the main task, but does not supervise related tasks so as to ensure the priority of the main task.

优选地，所述逐层监督网络模块，其中逐层监督神经网络的反向传播，减轻了传统卷积神经网络的梯度弥散问题。Preferably, in the layer-by-layer supervision network module, the backpropagation of the layer-by-layer supervision neural network alleviates the gradient dispersion problem of the traditional convolutional neural network.

与现有技术相比，本发明具有如下的有益效果：Compared with the prior art, the present invention has the following beneficial effects:

本发明上述技术方案针对传统卷积神经网络存在的问题，提出了改进的方法。本发明对传统卷积神经网的每一层添加监督项，以增强所学特征的透明性与减轻梯度弥散的问题。本发明的4个相关任务——姿态检测、笑容检测、眼镜检测和性别预测与主任务——人脸点的检测共享特征空间以增强主任务的准确率，也增强网络的整体泛化能力。The above technical solution of the present invention proposes an improved method for the problems existing in the traditional convolutional neural network. The invention adds supervision items to each layer of the traditional convolutional neural network to enhance the transparency of learned features and alleviate the problem of gradient dispersion. The four related tasks of the present invention—posture detection, smile detection, glasses detection, and gender prediction—share the feature space with the main task—face point detection to enhance the accuracy of the main task and also enhance the overall generalization ability of the network.

附图说明Description of drawings

通过阅读参照以下附图对非限制性实施例所作的详细描述，本发明的其它特征、目的和优点将会变得更明显：Other characteristics, objects and advantages of the present invention will become more apparent by reading the detailed description of non-limiting embodiments made with reference to the following drawings:

图1为本发明系统一实施例的结构框图；Fig. 1 is the structural block diagram of an embodiment of the system of the present invention;

图2是本发明方法中逐层监督网络示意图。Fig. 2 is a schematic diagram of layer-by-layer supervision network in the method of the present invention.

具体实施方式detailed description

下面结合具体实施例对本发明进行详细说明。以下实施例将有助于本领域的技术人员进一步理解本发明，但不以任何形式限制本发明。应当指出的是，对本领域的普通技术人员来说，在不脱离本发明构思的前提下，还可以做出若干变形和改进。这些都属于本发明的保护范围。The present invention will be described in detail below in conjunction with specific embodiments. The following examples will help those skilled in the art to further understand the present invention, but do not limit the present invention in any form. It should be noted that those skilled in the art can make several modifications and improvements without departing from the concept of the present invention. These all belong to the protection scope of the present invention.

本发明针对传统卷积神经网络所存在的问题，提出一种基于多任务正则化与逐层监督神经网络的人脸点检测系统。本系统在多任务正则部分，针对传统卷积神经网络的过拟合问题，利用相关任务标签的优势以学习到高层次识别任务的共有特征表示。在逐层监督学习部分，本系统针对传统卷及神经网络梯度弥散所学特征显著性不够的问题，在神经网络的每一中间层添加监督层，以提升从输出层反向传播回来的梯度信号。本发明将此系统用于人脸点检测，有效证明了多任务正则与逐层监督神经网络的有效性。Aiming at the problems existing in the traditional convolutional neural network, the present invention proposes a human face point detection system based on multi-task regularization and layer-by-layer supervisory neural network. In the multi-task regularization part, this system aims at the over-fitting problem of traditional convolutional neural networks, and uses the advantages of related task labels to learn the common feature representation of high-level recognition tasks. In the layer-by-layer supervised learning part, this system aims at the problem that the features learned by the traditional volume and neural network gradient diffusion are not significant enough, and adds a supervision layer to each intermediate layer of the neural network to improve the gradient signal backpropagated from the output layer . The invention uses this system for human face point detection, which effectively proves the validity of the multi-task regularization and layer-by-layer supervision neural network.

如图1所示，为本发明系统一实施例的结构框图，包括：多任务正则化模块和逐层监督网络模块。As shown in FIG. 1 , it is a structural block diagram of an embodiment of the system of the present invention, including: a multi-task regularization module and a layer-by-layer supervision network module.

本实施例中，所述的多任务正则化模块中的主任务与几个相关任务共同学习得到共有的特征空间，再利用相关任务的辅助标签提供附加正则项以加强网络的泛化能力。In this embodiment, the main task in the multi-task regularization module learns together with several related tasks to obtain a common feature space, and then uses the auxiliary labels of related tasks to provide additional regularization items to enhance the generalization ability of the network.

本实施例中，优化目标函数由主任务损失函数与相关任务损失函数的线性组合表示而成：In this embodiment, the optimization objective function is expressed by a linear combination of the main task loss function and the related task loss function:

其中λ^a是第a个相关任务的权重，是主任务损失函数，是第a个相关任务的损失函数，T是所有任务的总个数，w是神经网络的各层待求参数。where^λa is the weight of the ath related task, is the main task loss function, is the loss function of the a-th related task, T is the total number of all tasks, and w is the required parameters of each layer of the neural network.

本实施例中的主任务是人脸5个坐标点的检测，相关任务分别是：姿态检测、笑容检测、眼镜检测和性别预测。The main task in this embodiment is the detection of five coordinate points of the human face, and the related tasks are: attitude detection, smile detection, glasses detection and gender prediction.

对于一组训练样本i＝1，…，N，t＝1，…，T，N和T分别为样本总数和任务个数，其中样本表示第t个任务的原始输入，表示相应的真实标签数据。左眼、右眼、鼻子、左嘴角和右嘴角这5个点的检测是回归任务，因此目标值是相应点的坐标值。主任务的损失函数采用平方误差函数：其中f(x；w)是5个点的预测坐标值，||.||²是平方差函数；4个相关任务的损失函数采用交叉熵函数：其中是softmax函数，用以对后验概率的建模。For a set of training samples i=1,..., N, t=1,..., T, N and T are the total number of samples and the number of tasks respectively, where the samples represents the original input of the t-th task, represents the corresponding ground-truth label data. The detection of the 5 points of left eye, right eye, nose, left mouth corner and right mouth corner is a regression task, so the target value is the coordinate value of the corresponding point. The loss function of the main task adopts the squared error function: Where f(x;w) is the predicted coordinate value of 5 points, ||.||² is the square difference function; the loss function of the 4 related tasks adopts the cross entropy function: in Is the softmax function to model the posterior probability.

因此，与式(1)相对应，最终优化目标函数是：Therefore, corresponding to formula (1), the final optimization objective function is:

$\underset{w w}{min min} {| | | | y the y - - f f ((x x;; w w)) | | | |}^{22} + + {Σ Σ}_{a a = = 11}^{T T - - 11} {λ λ}^{a a} ((- - y the y log log ((p p (({y the y}^{a a} | | x x)))))) - - - - - - ((22))$

本实例中，所述的逐层监督网络模块不同于传统卷积神经网络只对输出层目标函数进行优化，而是对每一中间层都引入监督目标函数如图2，从而加强中间层学习到的特征的显著性。卷积神经网络由K层卷积层和pooling层交替组成以提取层次化的特征，可由下列递归式表示：In this example, the layer-by-layer supervision network module is different from the traditional convolutional neural network which only optimizes the output layer objective function, but introduces the supervision objective function to each intermediate layer as shown in Figure 2, thereby strengthening the learning of the intermediate layer. salient features. The convolutional neural network consists of K layers of convolutional layers and pooling layers alternately to extract hierarchical features, which can be expressed by the following recursive formula:

Z_k＝pool(Z_k-1*W_k+b_k)(3)Z_k ＝pool(Z_k-1 *W_k +b_k )(3)

其中Z_k是第k个卷积层的特征图，Z_k-1是第k-1个卷积层的特征图，W_k是需要学习的滤波器权重，b_k是偏差项。where Z_k is the feature map of the kth convolutional layer, Z_k-1 is the feature map of the k-1th convolutional layer, W_k is the filter weight to be learned, and b_k is the bias term.

本发明采用深度监督的方法，在中间每个卷积层响应后面加入回归监督，以使得更准确地解公式(2)，The present invention adopts the method of deep supervision, and adds regression supervision behind the response of each convolutional layer in the middle, so that formula (2) can be solved more accurately,

其中是最后输出层的目标函数，而是第k层所输出的伴随监督目标函数。因此：in is the objective function of the final output layer, and is the adjoint supervised objective function output by the kth layer. therefore:

其中w和w_k分别表示最后层和中间层的滤波器参数，K是卷积总层数，α_k是第k个卷积层的回归监督函数所占权重。注意到为了保证主任务的优先性，只对主任务施加监督项。Among them, w and w_k represent the filter parameters of the last layer and the middle layer respectively, K is the total number of convolutional layers, and α_k is the weight of the regression supervision function of the kth convolutional layer. Note that in order to ensure the priority of the main task, only the supervision item is applied to the main task.

对于最终的待优化问题公式(4)，即多任务正则化模块的输出函数，采用随机梯度下降法求解，即首先前向学习共有的特征表示，再将信号反向传播回去以细化这种表示，重复上述两个步骤直至网络收敛。For the final optimization problem formula (4), that is, the output function of the multi-task regularization module, the stochastic gradient descent method is used to solve it, that is, the common feature representation is first learned forward, and then the signal is back-propagated to refine this Indicates that the above two steps are repeated until the network converges.

实施效果Implementation Effect

依据上述步骤，采用发明内容中的步骤进行实施，实验所用训练数据来源于数据集LFW和网络的共计10000幅图片，每一幅图片都标注有5个点，分别是：左眼、右眼、鼻子、左嘴角和右嘴角。将所有标注值都根据图片大小归一化到[0,1]。实验所用测试数据来源于数据集AFLW、AFW和LFPW。本发明采用三层滤波器大小均为5x5的卷积层，每一卷积层后分别连接pooling层和回归监督层，第四层为含64个神经元的全连接层，最后为含有主任务人脸点检测和4个脸部相关属性的多任务网络层。本实例系统分别比较了传统卷积神经回归网、逐层监督网络、多任务正则化网络，所测得5个点的平均错误率分别为：2.14％、5.18％、2.80％和2.71％。实验表明，本发明提出的基于多任务学习与逐层监督神经网络的系统在人脸点检测的问题中具有很好的效果。According to the above steps, the steps in the content of the invention are used for implementation. The training data used in the experiment comes from a total of 10,000 pictures in the data set LFW and the network. Each picture is marked with 5 points, namely: left eye, right eye, Nose, left and right mouth corners. All label values are normalized to [0,1] according to the image size. The test data used in the experiment comes from the data sets AFLW, AFW and LFPW. The present invention adopts a convolutional layer with three layers of filters whose size is 5x5. After each convolutional layer, a pooling layer and a regression supervision layer are respectively connected. The fourth layer is a fully connected layer containing 64 neurons, and finally contains the main task Multi-task network layers for face point detection and 4 face-related attributes. In this example system, the traditional convolutional neural regression network, layer-by-layer supervision network, and multi-task regularization network are compared respectively. The average error rates of the five points measured are: 2.14%, 5.18%, 2.80% and 2.71%. Experiments show that the system based on multi-task learning and layer-by-layer supervision neural network proposed by the present invention has a good effect on the problem of face point detection.

以上对本发明的具体实施例进行了描述。需要理解的是，本发明并不局限于上述特定实施方式，本领域技术人员可以在权利要求的范围内做出各种变形或修改，这并不影响本发明的实质内容。Specific embodiments of the present invention have been described above. It should be understood that the present invention is not limited to the specific embodiments described above, and those skilled in the art may make various changes or modifications within the scope of the claims, which do not affect the essence of the present invention.

Claims

1. based on multitask regularization and a face point detection system of successively to supervise neural network, it is characterized in that, comprising: multitask regularization module and successively supervise mixed-media network modules mixed-media, wherein:

Describedly successively supervise mixed-media network modules mixed-media, according to its pixel value, feature extraction is carried out to input picture, be different from traditional convolutional neural networks to be only optimized output layer objective function, this module all introduces supervision objective function to each middle layer, thus strengthen the conspicuousness of the feature that middle layer learns, again output characteristic is inputed to the backpropagation that multitask regularization module carries out signal, repeat with this until network convergence;

Described multitask regularization module, comprise main task and inter-related task, the parameter that main task and inter-related task learn successively to supervise mixed-media network modules mixed-media jointly obtains the total feature space of all tasks, the assisted tag of recycling inter-related task provides additional regular terms with the generalization ability of Strengthens network, finally exports the prediction coordinate figure of main task.

It is 2. according to claim 1 that based on multitask regularization and the face point detection system of successively supervising neural network, it is characterized in that, described multitask regularization module, comprises main task submodule and inter-related task submodule, wherein:

Described main task submodule, to the detection of input facial image 5 unique points, respectively: the detection of left eye, right eye, nose, the left corners of the mouth and the right corners of the mouth, predicts that the coordinate figure of each point is as final output;

Described inter-related task submodule carries out Attitude estimation, smile's detection, Glasses detection and gender prediction to input facial image respectively, predicts that the label value of each classification task is to promote the predictablity rate of main task.

3. according to claim 2 based on multitask regularization and the face point detection system of successively supervising neural network, it is characterized in that, the fundamental purpose of described multitask regularization module produces objective function to be optimized, the i.e. difference of predicted value and actual value, carries out minimization problem to this objective function and solves to make predicted value approaching to reality value as far as possible.

4. according to claim 3 based on multitask regularization and the face point detection system of successively supervising neural network, it is characterized in that, the optimization object function of described multitask regularization module is the linear combination of main task loss function and inter-related task loss function.

5. according to claim 4 based on multitask regularization and the face point detection system of successively supervising neural network, it is characterized in that, described main task loss function and inter-related task loss function use difference of two squares regression function and cross entropy function representation respectively.

6. according to any one of claim 1-5 based on multitask regularization and the face point detection system of successively to supervise neural network, it is characterized in that, describedly successively supervise mixed-media network modules mixed-media, add after each convolutional layer in centre and return Monitor function, carry out the backpropagation of signal together with the objective function to be optimized in multitask regularization module.

7. according to claim 6 based on multitask regularization and the face point detection system of successively supervising neural network, it is characterized in that, describedly successively supervise mixed-media network modules mixed-media, wherein return the difference of two squares function that Monitor function is convolutional layer output coordinate value and true coordinate value.

8. according to claim 6 based on multitask regularization and the face point detection system of successively supervising neural network, it is characterized in that, described describedly successively supervise mixed-media network modules mixed-media, wherein return the backpropagation of Monitor function, alleviate the gradient disperse problem of traditional convolutional neural networks.

9. according to any one of claim 1-5 based on multitask regularization and the face point detection system of successively to supervise neural network, it is characterized in that, described successively supervision mixed-media network modules mixed-media, only exercises supervision to main task, and does not supervise inter-related task with the priority ensureing main task.