CN109543606A

Movatterモバイル変換

Info

Publication number: CN109543606A
Application number: CN201811396296.8A
Authority: CN
Inventors: 郑伟诗; 叶海佳
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2018-11-22
Filing date: 2018-11-22
Publication date: 2019-03-29
Anticipated expiration: 2038-11-22
Also published as: CN109543606B

Abstract

Translated fromChinese

本发明公开了一种加入注意力机制的人脸识别方法，先用级联的神经网络对数据集进行人脸检测和人脸对齐处理，再构建加入注意力机制的深度神经网络，训练注意力机制网络，最后将测试样本输入训练好注意力机制网络进行人脸识别。本发明采用STN模块构建注意力机制，在深度神经网络的每一个阶段后都输入到不同的STN模块，把STN模块的串联输出结果和深度神经网络的输出结果融合起来，作为输出特征。为了让网络能够自适应地学习到具有判别力的感兴趣区域特征，本发明采用通过STN模块对输入进行仿射变换操作的方法，加强了网络对局部信息的理解与学习，在现有的人脸识别网络上，提高了人脸识别的准确率，增强了识别系统的鲁棒性。

The invention discloses a face recognition method with an attention mechanism added. First, a cascaded neural network is used to perform face detection and face alignment processing on a data set, and then a deep neural network with an attention mechanism is constructed to train the attention. Mechanism network, and finally input the test sample into the trained attention mechanism network for face recognition. The invention adopts the STN module to construct the attention mechanism, and inputs to different STN modules after each stage of the deep neural network. In order to allow the network to adaptively learn the features of the region of interest with discriminative power, the present invention adopts the method of performing affine transformation operations on the input through the STN module, which strengthens the network's understanding and learning of local information. On the face recognition network, the accuracy of face recognition is improved and the robustness of the recognition system is enhanced.

Description

Translated fromChinese

一种加入注意力机制的人脸识别方法A face recognition method with attention mechanism

技术领域technical field

本发明涉及机器深度学习、图像处理识别领域，尤其涉及一种加入注意力机制的人脸识别方法。The invention relates to the fields of machine deep learning, image processing and recognition, and in particular to a face recognition method with an attention mechanism added.

背景技术Background technique

人脸识别是近年来计算机视觉领域和机器学习领域中最富挑战性的课题之一，受到了研究者们的广泛关注.成功有效的人脸识别具有广阔的应用前景，可在国防安全、视频监控、人机交互和视频索引等场景发挥巨大作用。Face recognition is one of the most challenging topics in the field of computer vision and machine learning in recent years, and has received extensive attention from researchers. Successful and effective face recognition has broad application prospects, and can be used in national defense security, video Scenarios such as surveillance, human-computer interaction, and video indexing play a huge role.

目前，大部分基于CNN的特征提取网络使用分类损失(Softmax Loss)作为网络训练的监督信号，这些网络以分类为学习目标，在训练过程中不同类别之间的距离会逐渐增大。Deepface使用分类网络方法，同时使用复杂的3D对齐方式和大量的训练数据。DeepID则是首先对人脸图片进行分块，然后使用多个分类网络对不同人脸块进行特征提取，最后使用联合贝叶斯算法对这些特征进行融合，由于该技术是对不同人脸块进行特征提取，所以数据集比原图增加了好几倍，训练时间大大增加，计算资源消耗大。另外这些人脸块都是严格固定好划分方式的，对于侧脸或者非规则的人脸图片，则该准确率会大打折扣，算法不够鲁棒。At present, most CNN-based feature extraction networks use classification loss (Softmax Loss) as the supervision signal for network training. These networks take classification as the learning target, and the distance between different categories will gradually increase during the training process. Deepface uses a classification network approach while using complex 3D alignments and large amounts of training data. DeepID first divides the face image into blocks, then uses multiple classification networks to extract features from different face blocks, and finally uses the joint Bayesian algorithm to fuse these features. Feature extraction, so the data set is several times larger than the original image, the training time is greatly increased, and the computing resource consumption is large. In addition, these face blocks are strictly fixed and divided. For profile or irregular face pictures, the accuracy rate will be greatly reduced, and the algorithm is not robust enough.

发明内容SUMMARY OF THE INVENTION

为了克服现有技术存在的缺陷，本发明提供一种加入注意力机制的人脸识别方法，通过注意力模块，神经网络能够自动学习到具有判别性的人脸块特征，而不是固定划分人脸块，用这样的方法提取到的特征更有利于提升分类准确率，鲁棒性更强。同时由于注意力模块结构简洁，所以计算资源消耗少，网络收敛速度快。In order to overcome the defects of the prior art, the present invention provides a face recognition method with an attention mechanism. Through the attention module, the neural network can automatically learn the discriminative face block features, rather than fixedly dividing the face. The features extracted by this method are more conducive to improving the classification accuracy and are more robust. At the same time, due to the simple structure of the attention module, the consumption of computing resources is low, and the network convergence speed is fast.

为了达到上述目的，本发明采用以下技术方案：In order to achieve the above object, the present invention adopts the following technical solutions:

本发明公开一种加入注意力机制的人脸识别方法，包括下述步骤：The invention discloses a face recognition method adding an attention mechanism, comprising the following steps:

S1：使用级联的卷积神经网络进行图像预处理，得到对齐的人脸图像；S1: Use cascaded convolutional neural networks for image preprocessing to obtain aligned face images;

S2：对预处理后的图像进行数据增广，所述数据增广包括随机裁剪和随机翻转操作，经过步骤S1处理后的图像随机裁剪出设定的尺寸区域，以设定的概率对图像进行翻转，最后对图像做白化处理，对于测试样本则直接归一化成设定尺寸的图像，然后进行白化处理，所述设定尺寸与随机裁剪的设定尺寸相同；S2: Perform data augmentation on the preprocessed image, the data augmentation includes random cropping and random flipping operations, the image processed in step S1 is randomly cropped out of a set size area, and the image is processed with a set probability. Flip, and finally whiten the image. For the test sample, it is directly normalized to an image of a set size, and then whitened. The set size is the same as the set size of random cropping;

S3：设置注意力机制模块，用于网络自动学习到具有判别性的人脸块特征，利用注意力机制模块将输入的图像进行卷积操作，然后进行全连接回归输出M个角度值，M为自然数，基于M个角度值构建矩阵，通过矩阵运算提取图像的局部特征；S3: Set the attention mechanism module for the network to automatically learn the discriminative face block features, use the attention mechanism module to perform the convolution operation on the input image, and then perform the full connection regression to output M angle values, where M is Natural numbers, construct a matrix based on M angle values, and extract local features of the image through matrix operations;

S4：搭建注意力机制网络，采用深度神经网络提取图像特征,并加入注意力机制模块，所述注意力机制网络包括主路和支路，所述主路为图片通过深度神经网络后得到的输出，所述支路为深度神经网络的每个阶段的输出经过不同的注意力机制模块，再依次进行elementwise-add后得到的输出，最后把主路和支路的输出进行特征拼接，得到最终的图像特征图，用于计算损失函数和作为人脸识别的特征；S4: Build an attention mechanism network, use a deep neural network to extract image features, and add an attention mechanism module. The attention mechanism network includes a main path and a branch, and the main path is the output obtained after the image passes through the deep neural network. , the branch is the output of each stage of the deep neural network through different attention mechanism modules, and then the output obtained after elementwise-add is performed in turn, and finally the outputs of the main road and the branch are feature spliced to obtain the final Image feature maps for calculating loss functions and as features for face recognition;

S5：训练注意力机制网络，采用人脸识别损失函数对注意力机制网络进行训练并且保存；S5: Train the attention mechanism network, and use the face recognition loss function to train and save the attention mechanism network;

S6：提取图像特征，将测试样本输入到训练好的注意力机制网络中，得到优质的图像特征；S6: Extract image features, input test samples into the trained attention mechanism network, and obtain high-quality image features;

S7：人脸识别，把提取得到的图像特征用softmax回归方法进行分类，完成测试样本的识别。S7: face recognition, classify the extracted image features with the softmax regression method, and complete the recognition of the test samples.

作为优选的技术方案，步骤S1中所述级联的卷积神经网络采用MTCNN，包括P-Net、R-Net和O-Net，给定任意一张待测图像，缩放到不同比例，构建图像金字塔，然后依次输入P-Net、R-Net和O-Net，提取人脸候选框，还包括拟合人脸与非人脸分类、边框回归和人脸特征点坐标回归的目标训练，具体损失函数如下所述：As a preferred technical solution, the cascaded convolutional neural network described in step S1 adopts MTCNN, including P-Net, R-Net and O-Net, given any image to be tested, zoomed to different scales, and constructed the image Pyramid, then input P-Net, R-Net and O-Net in turn to extract face candidate frames, including fitting face and non-face classification, frame regression and face feature point coordinate regression target training, specific loss The function is described below:

MTCNN进行人脸与非人脸分类使用交叉熵作为损失函数，记为L_det，计算公式如下：MTCNN uses cross entropy as the loss function for face and non-face classification, denoted as L_det , and the calculation formula is as follows:

其中，p⁽ⁱ⁾为模型预测的概率，为测试样本x⁽ⁱ⁾的标签,where p⁽ⁱ⁾ is the probability predicted by the model, is the label of the test sample x⁽ⁱ⁾ ,

MTCNN进行边框回归使用L2Loss作为损失函数，记为L_box，计算公式如下：MTCNN uses L2Loss as the loss function for bounding box regression, denoted as L_box , and the calculation formula is as follows:

其中，是模型预测的回归值，是测试样本x⁽ⁱ⁾真实的坐标值，且in, is the regression value predicted by the model, is the true coordinate value of the test sample x⁽ⁱ⁾ , and

MTCNN进行人脸特征点坐标回归同样使用L2Loss作为损失函数，记为L_landmark，计算公式如下：MTCNN also uses L2Loss as the loss function for facial feature point coordinate regression, which is recorded as L_landmark . The calculation formula is as follows:

其中，是模型预测的回归值，是测试样本x⁽ⁱ⁾真实人脸特征点的坐标值，且in, is the regression value predicted by the model, is the coordinate value of the real face feature point of the test sample x⁽ⁱ⁾ , and

作为优选的技术方案，所述MTCNN引入总目标函数，用于排除非人脸数据参与到损失函数的计算，所述总目标函数计算公式如下：As a preferred technical solution, the MTCNN introduces an overall objective function to exclude non-face data from participating in the calculation of the loss function, and the calculation formula of the overall objective function is as follows:

其中，N表示训练样本总数，α_j表示对应目标函数在总的目标函数中的重要程度，对于P-Net或R-Net的相关权重为(α_det＝1,α_box＝0.5,α_landmark＝0.5)；对于ONet的相关权重为(α_det＝1,α_box＝0.5,α_landmark＝1)。Among them, N represents the total number of training samples, α_j represents the importance of the corresponding objective function in the total objective function, and the relevant weight for P-Net or R-Net is (α_det =1,α_box =0.5,α_landmark = 0.5); the relevant weight for ONet is (α_det =1, α_box =0.5, α_landmark =1).

作为优选的技术方案，步骤S3所述注意力机制模块采用STN模块，所述STN模块包括本地化网络模块，网格生成器和采样器，As a preferred technical solution, the attention mechanism module in step S3 adopts an STN module, and the STN module includes a localization network module, a grid generator and a sampler,

所述本地化网络模块将输入的图片进行卷积操作，然后进行全连接回归出6个角度值，形成2*3的矩阵，The localization network module performs a convolution operation on the input image, and then performs full connection regression to obtain 6 angle values to form a 2*3 matrix,

所述网格生成器通过矩阵运算计算出目标图V中的每个位置对应原图U中的坐标位置，生成T_θ(G_i)，具体计算公式如下所述：The grid generator calculates the coordinate position in the original image U corresponding to each position in the target image V through matrix operation, and generates T_θ (G_i ), and the specific calculation formula is as follows:

其中，代表原始图的坐标，代表目标图的坐标，A_θ为本地化网络模块网络回归出的6个角度值，in, represent the coordinates of the original graph, Represents the coordinates of the target image, A_θ is the 6 angle values returned by the localization network module network,

所述采样器根据T(G)中的坐标信息，在原始图U中进行采样，将U中的像素复制到目标图V中。The sampler performs sampling in the original image U according to the coordinate information in T(G), and copies the pixels in U to the target image V.

作为优选的技术方案，步骤S4中，所述深度神经网络的基础网络采用resnet50，resnet50包括5个stage，具体如下所述：As a preferred technical solution, in step S4, the basic network of the deep neural network adopts resnet50, and resnet50 includes 5 stages, which are as follows:

Stage0：包括卷积层和池化层，所述卷积层的卷积核大小为7x7，输出通道数为64，步长为2，所述池化层采用maxpooling的池化方式，窗口大小为3x3，步长为2；Stage0: including convolution layer and pooling layer, the convolution kernel size of the convolution layer is 7x7, the number of output channels is 64, and the stride is 2. The pooling layer adopts the pooling method of maxpooling, and the window size is 3x3, step size is 2;

Stage1：由3个输出通道数为256的块组成；Stage1: consists of 3 blocks with 256 output channels;

Stage2：由4个输出通道数为512的块组成；Stage2: consists of 4 blocks with 512 output channels;

Stage3：由5个输出通道数为1024的块组成；Stage3: consists of 5 blocks with 1024 output channels;

Stage4：由6个输出通道数为2048的块组成；Stage4: consists of 6 blocks with 2048 output channels;

所述支路网络将基础网络resnet50的stage0,1,2,3,4得到的图像特征图分别输入到各个STN模块中，得到特征L0、L1、L2、L3、L4，所述L1-L4均做一次卷积操作，卷积核大小为1x1，步长为1，输出通道数为上一个特征的通道数，用elementwise-add的方式把这些特征依次相加，具体计算方式为：The branch network inputs the image feature maps obtained from stages0, 1, 2, 3, and 4 of the basic network resnet50 into each STN module to obtain features L0, L1, L2, L3, and L4. Do a convolution operation, the size of the convolution kernel is 1x1, the step size is 1, the number of output channels is the number of channels of the previous feature, and these features are added in turn by elementwise-add. The specific calculation method is:

L0+f(L1)+f(L2)+f(L3)+f(L4)L0+f(L1)+f(L2)+f(L3)+f(L4)

其中”+”为elsemenwise-add操作，f(·)为卷积操作。Where "+" is the elsemenwise-add operation, and f( ) is the convolution operation.

作为优选的技术方案，所述块的结构形成步骤具体如下所述：As a preferred technical solution, the structure forming steps of the block are as follows:

采用一个1x1卷积进行降维，然后进行3x3卷积操作，再用1x1卷积升维，输出与输入进行elementwise-add操作后得到的结果，Use a 1x1 convolution to reduce the dimension, then perform a 3x3 convolution operation, and then use a 1x1 convolution to increase the dimension, and the result obtained after the elementwise-add operation between the output and the input,

最后加入一个128维的全连接层进行降维。Finally, a 128-dimensional fully connected layer is added for dimensionality reduction.

作为优选的技术方案，步骤S5中所述人脸识别损失函数采用Softmax函数，基于Softmax函数的分类模型的第K路输出为：As a preferred technical solution, the face recognition loss function described in step S5 adopts the Softmax function, and the Kth output of the classification model based on the Softmax function is:

其中b_k为Softmax层的两个参数，表示有K组权重和偏置。in b_k is the two parameters of the Softmax layer, indicating that there are K groups of weights and biases.

作为优选的技术方案，所述Softmax层采用未激活的全连接层。As a preferred technical solution, the Softmax layer adopts an inactive fully connected layer.

作为优选的技术方案，所述Softmax层输出变换后第K类的后验概率为：As a preferred technical solution, the posterior probability of the Kth class after output transformation of the Softmax layer is:

为了每个测试样本所属类别的概率最大，定义Softmax Loss为：In order to maximize the probability of the category to which each test sample belongs, the Softmax Loss is defined as:

其中θ表示模型参数，x⁽ⁱ⁾表示测试样本y⁽ⁱ⁾所属类别。where θ represents the model parameters, and x⁽ⁱ⁾ represents the category to which the test sample y⁽ⁱ⁾ belongs.

作为优选的技术方案，所述基于Softmax函数的分类模型还包括优化器，优化器采用Adam。As a preferred technical solution, the classification model based on the Softmax function further includes an optimizer, and the optimizer adopts Adam.

本发明与现有技术相比，具有如下优点和有益效果：Compared with the prior art, the present invention has the following advantages and beneficial effects:

(1)本发明基于提取更有判别性的人脸局部特征为出发点，在基础神经网络的框架下设计了注意力机制模块，并且以独特的连接方式和深度神经网络结合，形成了独特的加入注意力机制的人脸识别方法，能够提取到丰富类别相关信息的人脸特征。(1) The present invention is based on the extraction of more discriminative face local features as a starting point, designs an attention mechanism module under the framework of a basic neural network, and combines a unique connection method with a deep neural network to form a unique add-on The face recognition method of the attention mechanism can extract the face features with rich category-related information.

(2)本发明对预处理后的图像进行数据增广，包括随机裁剪和随机翻转操作，用于增加训练的样本数据，训练集的数据扩增能够加强网络的鲁棒性。(2) The present invention performs data augmentation on the preprocessed images, including random cropping and random flipping operations, to increase the sample data for training, and the data augmentation of the training set can enhance the robustness of the network.

(3)本发明的注意力机制模块采用STN模块，STN模块包括本地化网络模块，网格生成器和采样器，该STN模块结构简洁，计算资源消耗少，网络收敛速度快。(3) The attention mechanism module of the present invention adopts the STN module. The STN module includes a localization network module, a grid generator and a sampler. The STN module has a simple structure, consumes less computing resources, and has a fast network convergence speed.

附图说明Description of drawings

图1为本发明人脸对齐网络的结构示意图；1 is a schematic structural diagram of a face alignment network of the present invention;

图2为本发明STN模块的结构示意图；Fig. 2 is the structural representation of the STN module of the present invention;

图3为本发明基础深度卷积神经网络的结构示意图；3 is a schematic structural diagram of a basic deep convolutional neural network of the present invention;

图4为本发明基础深度卷积神经网络中的块结构示意图；4 is a schematic diagram of a block structure in the basic deep convolutional neural network of the present invention;

图5为本发明注意力机制网络的结构示意图。FIG. 5 is a schematic structural diagram of the attention mechanism network of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

本实施例公开一种基于加入注意力机制的人脸识别算法，所述算法包括以下步骤：This embodiment discloses a face recognition algorithm based on adding an attention mechanism, and the algorithm includes the following steps:

步骤一：使用级联的神经网络进行人脸检测人脸对齐的数据预处理，采用的级联的卷积神经网络是MTCNN，MTCNN级联结构主要由3个卷积神经网络组成，分别为P-Net、R-Net和O-Net。给定一张待检测图片，图片会首先被缩放到不同的比例，以构建图片的尺度空间，然后依次输入三个网络，以提取人脸候选框。如图1所示，该算法有三个阶段组成：第一阶段，浅层的CNN快速产生候选窗体；第二阶段，通过更复杂的CNN精炼候选窗体，丢弃大量的重叠窗体；第三阶段，使用更加强大的CNN，实现候选窗体去留，同时显示五个面部关键点定位。在进行模型训练的时候，为了融合人脸检测和人脸对齐任务，MTCNN同时拟合3个目标：人脸/非人脸分类、边框回归和人脸特征点坐标回归。三个损失函数分别是：Step 1: Use a cascaded neural network for face detection and face alignment data preprocessing. The cascaded convolutional neural network used is MTCNN. The MTCNN cascade structure is mainly composed of three convolutional neural networks, respectively P -Net, R-Net and O-Net. Given an image to be detected, the image is first scaled to different scales to construct the scale space of the image, and then input to three networks in turn to extract face candidate boxes. As shown in Figure 1, the algorithm consists of three stages: in the first stage, a shallow CNN quickly generates candidate forms; in the second stage, a more complex CNN refines candidate forms and discards a large number of overlapping forms; In the first stage, a more powerful CNN is used to realize the candidate frame removal and display the five facial key point locations at the same time. During model training, in order to integrate face detection and face alignment tasks, MTCNN simultaneously fits three objectives: face/non-face classification, frame regression and face feature point coordinate regression. The three loss functions are:

(1)人脸/非人脸分类(1) face/non-face classification

人脸/非人脸是一个二分类问题，所以MTCNN使用交叉熵作为损失函数，记为L_det。对于每个测试样本x⁽ⁱ⁾，Face/non-face is a binary classification problem, so MTCNN uses cross-entropy as the loss function, denoted as L_det . For each test sample x⁽ⁱ⁾ ,

(2)边框回归：边框回归的目的在于对于每个人脸候选框估计与附近真实人脸区域的偏移量，包括左边、上边、宽和高。所以边框回归是一个回归问题，以上述4个数值作为回归目标，所以MTCNN使用L2Loss作为损失函数，记为L_box。对于每个测试样本x⁽ⁱ⁾，(2) Frame regression: The purpose of frame regression is to estimate the offset from the nearby real face area for each face candidate frame, including the left, top, width and height. Therefore, the border regression is a regression problem, and the above 4 values are used as the regression target, so MTCNN uses L2Loss as the loss function, which is recorded as L_box . For each test sample x⁽ⁱ⁾ ,

其中，是模型预测的回归值，是测试样本x⁽ⁱ⁾真实的坐标值，因为待回归的目标有4个值，所以where is the regression value predicted by the model, is the real coordinate value of the test sample x⁽ⁱ⁾ , because the target to be regressed has 4 values, so

(3)人脸特征点坐标回归(3) Coordinate regression of facial feature points

人脸特征点坐标回归同样是一个回归问题，由于MTCNN只检测5个人脸特征点，而每个特征点包含x、y坐标，所以一共有10个回归目标。这里同样使用L2Loss作为损失函数，记为L_landmark。对于每个测试样本x⁽ⁱ⁾：The face feature point coordinate regression is also a regression problem. Since MTCNN only detects 5 face feature points, and each feature point contains x and y coordinates, there are a total of 10 regression targets. Here L2Loss is also used as the loss function, denoted as L_landmark . For each test sample x⁽ⁱ⁾ :

其中，是模型预测的回归值，是测试样本x⁽ⁱ⁾真实人脸特征点的坐标值，因为待回归的目标有10个值，所以in, is the regression value predicted by the model, is the coordinate value of the real face feature point of the test sample x⁽ⁱ⁾ , because the target to be regressed has 10 values, so

(4)总目标函数(4) Overall objective function

让模型同时拟合不同的目标，需要使用不同类型的训练数据，例如非人脸图片、部分人脸图片、带特征点标注人脸数据等，但并不是所有数据对所有目标函数都有意义，例如非人脸数据对L_landmark并没有意义。因而在训练的时候，并不是每种样本都需要参与所有损失函数的计算，为了进行对不同的样本进行区分，MTCNN引入样本类型标签表示样本x⁽ⁱ⁾是否属于类型j，于是总目标函数表示为To make the model fit different targets at the same time, it is necessary to use different types of training data, such as non-face pictures, partial face pictures, face data with feature points, etc., but not all data are meaningful for all target functions. For example, non-face data does not make sense for L_landmark . Therefore, during training, not every sample needs to participate in the calculation of all loss functions. In order to distinguish different samples, MTCNN introduces sample type labels. Indicates whether the sample x⁽ⁱ⁾ belongs to type j, so the overall objective function is expressed as

其中，N表示训练样本总数，α_j表示对应目标函数在总的目标函数中的重要程度，对于P-Net和R-Net，相关权重为(α_det＝1,α_box＝0.5,α_landmark＝0.5)；而对于ONet，为了保证人脸特征点的准确度，提高了特征点坐标回归目标函数的权重，变为(α_det＝1,α_box＝0.5,α_landmark＝1)Among them, N represents the total number of training samples, α_j represents the importance of the corresponding objective function in the total objective function, for P-Net and R-Net, the relevant weights are (α_det =1,α_box =0.5,α_landmark = 0.5); and for ONet, in order to ensure the accuracy of face feature points, the weight of the feature point coordinate regression objective function is increased, becoming (α_det =1,α_box =0.5,α_landmark =1)

步骤二：数据增广Step 2: Data Augmentation

数据增广采用了随机裁剪和随机翻转操作，前者将经过步骤一处理后的图片中随机裁剪出160x160区域，后者以0.5的概率对图片进行翻转。最后对图片进行白化。测试样本则直接归一化成160x160大小的图片，然后同样进行白化。Data augmentation adopts random cropping and random flipping operations. The former randomly crops a 160x160 area from the image processed in step 1, and the latter flips the image with a probability of 0.5. Finally, whiten the image. The test sample is directly normalized to a 160x160 size image, and then whitened as well.

步骤三：设计注意力机制模块Step 3: Design the attention mechanism module

注意力机制模块采用的是STN模块：如图2所示，STN模块由本地化网络模块(Localisation Network)，网格生成器(Grid generator)，采样器(Sampler)3个部分组成。The attention mechanism module adopts the STN module: as shown in Figure 2, the STN module consists of three parts: the localisation network module (Localisation Network), the grid generator (Grid generator), and the sampler (Sampler).

Localisation Network：该网络就是一个简单的回归网络。将输入的图片进行几个卷积操作，然后全连接回归出6个角度值(假设是仿射变换)，2*3的矩阵。Localisation Network: This network is a simple regression network. Perform several convolution operations on the input image, and then fully connect to regress 6 angle values (assuming affine transformation), a 2*3 matrix.

Grid generator：网格生成器负责将V中的坐标位置，通过矩阵运算，计算出目标图V中的每个位置对应原图U中的坐标位置，即生成T_θ(G_i)。Grid generator: The grid generator is responsible for calculating the coordinate position in V through matrix operations to calculate each position in the target image V corresponding to the coordinate position in the original image U, that is, generating T_θ (G_i ).

这里的Grid采样过程，对于二维仿射变换(旋转，平移，缩放)来说，就是简单的矩阵运算：The Grid sampling process here is a simple matrix operation for two-dimensional affine transformations (rotation, translation, scaling):

上式中，代表原始图的坐标，代表目标图的坐标。A_θ为Localisation Network网络回归出的6个角度值。In the above formula, represent the coordinates of the original graph, Represents the coordinates of the target graph. A_θ is the 6 angle values regressed by the Localisation Network network.

Sampler：采样器根据T_θ(G_i)中的坐标信息，在原始图U中进行采样，将U中的像素复制到目标图V中。Sampler: The sampler samples the original image U according to the coordinate information in T_θ (G_i ), and copies the pixels in U to the target image V.

步骤三：搭建注意力机制网络Step 3: Build the attention mechanism network

特征提取采用深度神经网络的方法，采用的基础网络是resnet50,然后再这个基础上加入注意力机制模块。而注意力机制模块采用的是STN模块：将输入特征进行几个卷积操作，然后全连接回归出6个角度值(假设是仿射变换)，2*3的矩阵。然后输入乘以这个矩阵就能得到局部有意义的特征。Feature extraction adopts the method of deep neural network, the basic network used is resnet50, and then the attention mechanism module is added on this basis. The attention mechanism module uses the STN module: perform several convolution operations on the input features, and then fully connect to regress 6 angle values (assuming affine transformation), a 2*3 matrix. The input is then multiplied by this matrix to get locally meaningful features.

网络分为主路和支路，主路为图片通过resnet50得到的输出，支路为经过不同的STN模块后再依次进行elementwise-add得到的输出。The network is divided into a main road and a branch. The main road is the output obtained by the picture through resnet50, and the branch is the output obtained by elementwise-add after passing through different STN modules.

主路：resnet50，由5个阶段组成，其中每个阶段包括了若干个卷积和池化操作。The main road: resnet50, which consists of 5 stages, each of which includes several convolution and pooling operations.

如图3所示，首先resnet50按输出特征图尺寸来分，可以分为5个stage，每个stage输出的特征图大小都不一样。As shown in Figure 3, first, resnet50 is divided according to the size of the output feature map, which can be divided into 5 stages, and the size of the feature map output by each stage is different.

Stage0有一个卷积层和池化层，卷积核大小是7x7，输出通道数为64，步长为2。池化采用的是maxpooling,窗口大小为3x3，步长为2。Stage0 has a convolutional layer and a pooling layer, the convolution kernel size is 7x7, the number of output channels is 64, and the stride is 2. Pooling uses maxpooling with a window size of 3x3 and a stride of 2.

Stage1由3个输出通道数为256的块(block)组成。Stage1 consists of 3 blocks with 256 output channels.

Stage2由4个输出通道数为512的块(block)组成。Stage2 consists of 4 blocks with 512 output channels.

Stage3由5个输出通道数为1024的块(block)组成。Stage3 consists of 5 blocks with 1024 output channels.

Stage4由6个输出通道数为2048的块(block)组成。Stage4 consists of 6 blocks with 2048 output channels.

如图4所示，其中每一个block的结构都是先用一个1x1卷积进行降维，然后进行3x3卷积，最后再用1x1卷积升维，输出与输入做elementwise-add操作，得到结果。As shown in Figure 4, the structure of each block is to use a 1x1 convolution to reduce the dimension first, then perform a 3x3 convolution, and finally use a 1x1 convolution to increase the dimension, and the output and input are performed elementwise-add operation to obtain the result. .

最后接一个128维的全连接层进行信息整合。Finally, a 128-dimensional fully connected layer is used for information integration.

支路：分别把stage0,1,2,3,4得到的特征图输入到各个STN模块中得到各自的特征：Branch: Input the feature maps obtained by stage0, 1, 2, 3, and 4 into each STN module to obtain their own features:

stage0经过STN后的输出为L0；The output of stage0 after STN is L0;

Stage1经过STN后的输出为L1；The output of Stage1 after STN is L1;

Stage2经过STN后的输出为L2；The output of Stage2 after STN is L2;

Stage3经过STN后的输出为L3；The output of Stage3 after STN is L3;

Stage4经过STN后的输出为L4；The output of Stage4 after STN is L4;

如图5所示，除第一个特征外，其余的特征都做一次卷积操作，卷积核大小是1x1，步长为1，输出通道数为上一个特征的通道数，用elementwise-add的方式把这些特征依次融合起来，所以做卷积操作的意义就是用于改变特征维度，以便特征相加操作。具体相加方法如下：As shown in Figure 5, except for the first feature, the rest of the features are subjected to a convolution operation. The size of the convolution kernel is 1x1, the stride is 1, and the number of output channels is the number of channels of the previous feature. Use elementwise-add These features are fused in turn, so the meaning of the convolution operation is to change the feature dimension so that the feature can be added. The specific addition method is as follows:

L0+f(L1)+f(L2)+f(L3)+f(L4)L0+f(L1)+f(L2)+f(L3)+f(L4)

这样就能得到主路输出和支路输出，最后把两路的输出进行特征拼接，得到最终的特征。这个特征将直接用于计算损失函数和作为人脸识别的特征。In this way, the main output and the branch output can be obtained, and finally the features of the two outputs are spliced to obtain the final feature. This feature will be used directly to calculate the loss function and as a feature for face recognition.

步骤五：训练注意力机制神经网络Step 5: Train the Attention Mechanism Neural Network

在本实施例中，构建Softmax分类模型时，我们将特征输出为x输入K路Softmax层(使用未激活的全连接层实现)，以计算样本关于不同类别的后验概率其中K代表类别数目。Softmax层包含两个参数，W和b，于是第k路输出又可以表示成：In this embodiment, when constructing the Softmax classification model, we output the feature as x input to the K-way Softmax layer (implemented using an inactive fully connected layer) to calculate the posterior probability of the sample with respect to different categories where K represents the number of categories. The Softmax layer contains two parameters, W and b, so the kth output It can also be expressed as:

但由于全连接层的输出是任意数值，为了样本关于不同类别的归一化概率，我们需要对Softmax层输出变换，则得到的关于第k类的后验概率为：However, since the output of the fully connected layer is an arbitrary value, in order to normalize the probability of the sample with respect to different categories, we need to transform the output of the Softmax layer, and the obtained posterior probability about the kth category is:

在本实施例中，为了最大化每个样本关于所属类别的概率最大，我们可以定义Softmax Loss为：In this embodiment, in order to maximize the probability that each sample belongs to the category, we can define Softmax Loss as:

θ表示模型参数，x⁽ⁱ⁾表示样本y⁽ⁱ⁾所属类别。θ represents the model parameters, and x⁽ⁱ⁾ represents the class of the sample y⁽ⁱ⁾ .

在本实施例中，优化器采用Adam，权值衰减为5e-5，batch size为128,平均池化层输出采用dropout操作，保持概率为0.8。学习率调整策略为：先以0.1作为学习率对训练集训练3轮，然后降低至0.01训练2轮，接着再降低至0.001训练2轮，共7轮。每训完一轮的分类模型都会在LFW上进行验证，最后把训练好的分类模型保存。In this embodiment, the optimizer adopts Adam, the weight decay is 5e-5, the batch size is 128, the average pooling layer output adopts the dropout operation, and the retention probability is 0.8. The learning rate adjustment strategy is: first use 0.1 as the learning rate to train the training set for 3 rounds, then reduce it to 0.01 for 2 rounds of training, and then reduce it to 0.001 for 2 rounds of training, for a total of 7 rounds. After each round of training, the classification model will be verified on LFW, and finally the trained classification model will be saved.

步骤六：学习图像的高层特征和抽象特征Step 6: Learning high-level features and abstract features of images

提取图像特征，将测试样本输入到训练好的注意力机制网络中，得到优质的图像特征。Extract image features, input test samples into the trained attention mechanism network, and obtain high-quality image features.

步骤七：人脸识别Step 7: Face Recognition

把提取得到的图像特征用softmax回归方法进行分类，完成测试样本的识别。The extracted image features are classified by the softmax regression method to complete the identification of the test samples.

上述实施例为本发明较佳的实施方式，但本发明的实施方式并不受上述实施例的限制，其他的任何未背离本发明的精神实质与原理下所作的改变、修饰、替代、组合、简化，均应为等效的置换方式，都包含在本发明的保护范围之内。The above-mentioned embodiments are preferred embodiments of the present invention, but the embodiments of the present invention are not limited by the above-mentioned embodiments, and any other changes, modifications, substitutions, combinations, The simplification should be equivalent replacement manners, which are all included in the protection scope of the present invention.

Claims

Translated fromChinese

1.一种加入注意力机制的人脸识别方法，其特征在于，包括下述步骤：1. a face recognition method adding attention mechanism, is characterized in that, comprises the following steps:

2.根据权利要求1所述的一种加入注意力机制的人脸识别方法，其特征在于，步骤S1中所述级联的卷积神经网络采用MTCNN，包括P-Net、R-Net和O-Net，给定任意一张待测图像，缩放到不同比例，构建图像金字塔，然后依次输入P-Net、R-Net和O-Net，提取人脸候选框，还包括拟合人脸与非人脸分类、边框回归和人脸特征点坐标回归的目标训练，具体损失函数如下所述：2. a kind of face recognition method adding attention mechanism according to claim 1 is characterized in that, the convolutional neural network of cascade described in step S1 adopts MTCNN, comprises P-Net, R-Net and O -Net, given any image to be tested, zoom to different scales, build an image pyramid, and then input P-Net, R-Net and O-Net in turn to extract face candidate frames, including fitting faces and non- The target training of face classification, frame regression and face feature point coordinate regression, the specific loss function is as follows:

3.根据权利要求2所述的一种加入注意力机制的人脸识别方法，其特征在于，所述MTCNN引入总目标函数，用于排除非人脸数据参与到损失函数的计算，所述总目标函数计算公式如下：3. a kind of face recognition method adding attention mechanism according to claim 2, is characterized in that, described MTCNN introduces total objective function, is used for excluding non-face data to participate in the calculation of loss function, described total The objective function calculation formula is as follows:

其中，N表示训练样本总数，α^j表示对应目标函数在总的目标函数中的重要程度，对于P-Net或R-Net的相关权重为(α_det＝1,α_box＝0.5,α_landmark＝0.5)；对于ONet的相关权重为(α_det＝1,α_box＝0.5,α_landmark＝1)。Among them, N represents the total number of training samples, α^j represents the importance of the corresponding objective function in the total objective function, and the relevant weight for P-Net or R-Net is (α_det =1,α_box =0.5,α_landmark = 0.5); the relevant weight for ONet is (α_det =1, α_box =0.5, α_landmark =1).

4.根据权利要求1所述的一种加入注意力机制的人脸识别方法，其特征在于，步骤S3所述注意力机制模块采用STN模块，所述STN模块包括本地化网络模块，网格生成器和采样器，4. a kind of face recognition method adding attention mechanism according to claim 1, is characterized in that, the described attention mechanism module of step S3 adopts STN module, described STN module comprises localization network module, grid generates and samplers,

5.根据权利要求1所述的一种加入注意力机制的人脸识别方法，其特征在于，步骤S4中，所述深度神经网络的基础网络采用resnet50，resnet50包括5个stage，具体如下所述：5. a kind of face recognition method adding attention mechanism according to claim 1, is characterized in that, in step S4, the basic network of described deep neural network adopts resnet50, and resnet50 comprises 5 stages, specifically as follows :

L0+f(L1)+f(L2)+f(L3)+f(L4)L0+f(L1)+f(L2)+f(L3)+f(L4)

6.根据权利要求5所述的一种加入注意力机制的人脸识别方法，其特征在于，所述块的结构形成步骤具体如下所述：6. a kind of face recognition method adding attention mechanism according to claim 5, is characterized in that, the structure formation step of described block is specifically as follows:

7.根据权利要求1所述的一种加入注意力机制的人脸识别方法，其特征在于，步骤S5中所述人脸识别损失函数采用Softmax函数，基于Softmax函数的分类模型的第K路输出为：7. a kind of face recognition method adding attention mechanism according to claim 1, is characterized in that, described in step S5, face recognition loss function adopts Softmax function, the Kth output based on the classification model of Softmax function for:

8.根据权利要求7所述的一种加入注意力机制的人脸识别方法，其特征在于，所述Softmax层采用未激活的全连接层。8 . The method for face recognition with an attention mechanism according to claim 7 , wherein the Softmax layer adopts an inactive fully connected layer. 9 .

9.根据权利要求8所述的一种加入注意力机制的人脸识别方法，其特征在于，所述Softmax层输出变换后第K类的后验概率为：9. a kind of face recognition method adding attention mechanism according to claim 8 is characterized in that, the posterior probability of the Kth class after the output transformation of the Softmax layer is:

10.根据权利要求7所述的一种加入注意力机制的人脸识别方法，其特征在于，所述基于Softmax函数的分类模型还包括优化器，优化器采用Adam。10 . The method for face recognition with an attention mechanism according to claim 7 , wherein the classification model based on the Softmax function further comprises an optimizer, and the optimizer adopts Adam. 11 .