CN106372581A

Movatterモバイル変換

Info

Publication number: CN106372581A
Application number: CN201610726171.1A
Authority: CN
Inventors: 吴晓雨; 郭天楚; 杨磊; 朱贝贝; 谭笑
Original assignee: Communication University of China
Current assignee: Communication University of China
Priority date: 2016-08-25
Filing date: 2016-08-25
Publication date: 2017-02-01
Anticipated expiration: 2036-08-25
Also published as: CN106372581B

Abstract

The invention provides a method for constructing and training a human face identification feature extraction network. The method comprises the steps of constructing a feature extraction network and a metric learning dimension reduction network, wherein an output of the feature extraction network is an input of the metric learning dimension reduction network; training the feature extraction network based on all sample sets to output feature sets; screening the feature sets by utilizing semantic sampling to obtain a pure sample set; and training the metric learning dimension reduction network based on the pure sample set. Through a natural human face identification network constructed by the method, the feature representation capability can be improved, so that feature information in data can be fully mined and an original human face picture can be accurately identified.

Description

Translated fromChinese

构建及训练人脸识别特征提取网络的方法Method of Constructing and Training Feature Extraction Network for Face Recognition

技术领域technical field

本发明涉及图像识别技术领域，尤其涉及一种人脸识别网络的构建方法，具体来说就是一种构建及训练人脸识别特征提取网络的方法。The present invention relates to the technical field of image recognition, in particular to a method for constructing a face recognition network, specifically a method for constructing and training a face recognition feature extraction network.

背景技术Background technique

人脸识别一直以来都是计算机视觉中的热点话题。与传统的虹膜识别、指纹识别等生物特征识别相比，人脸识别不需要借助特殊的媒介采集数据，只通过最为普通的摄像头获取的影像数据或图片即可完成识别任务。这使得人脸识别比虹膜识别、指纹识别等具有更为广泛的应用场景。人脸识别作为生物特征识别的一种，多被应用于安防、身份认证等领域。随着社会的不断发展及科学技术的不断进步，人脸识别技术已经慢慢地从实验室研究走入了人们的生活。进而人脸识别被应用在门禁、考勤、手机解锁、金融支付等更加贴近日常生活的领域。Face recognition has always been a hot topic in computer vision. Compared with traditional biometric identification such as iris recognition and fingerprint recognition, face recognition does not need to use special media to collect data, and only the image data or pictures obtained by the most common camera can complete the recognition task. This makes face recognition have a wider range of application scenarios than iris recognition and fingerprint recognition. As a kind of biometric identification, face recognition is mostly used in security, identity authentication and other fields. With the continuous development of society and the continuous advancement of science and technology, face recognition technology has gradually entered people's lives from laboratory research. Furthermore, face recognition is applied in areas closer to daily life, such as access control, time attendance, mobile phone unlocking, and financial payment.

但是，人脸识别技术应用在日常生活场景中，也存在一个不可回避的问题，就是人脸识别设备无法获取到类似于实验室采集的标准光照、标准姿态的照片。在日常人脸识别场景中，人们很可能是在自然状态下通过手机摄像头采集照片，这导致这些待识别数据也趋近于自然、任意、多光照、多表情的人脸图片。这些数据与之前实验室获取的标准光照、标准姿态的人脸数据相比，自然生活状态下的人脸含有较多的噪声，在识别过程中需要考虑不均衡光照，非正面姿态、表情以及人物是否有面部小范围遮挡，是否化妆等因素，使得传统的人脸识别技术受到了巨大的挑战。因此，如何研发出一种对外在干扰因素鲁棒的人脸识别技术是当前亟待解决的一个问题。However, there is also an unavoidable problem in the application of face recognition technology in daily life scenes, that is, face recognition equipment cannot obtain photos similar to the standard lighting and standard poses collected in the laboratory. In daily face recognition scenarios, people are likely to collect photos through mobile phone cameras in a natural state, which makes the data to be recognized tend to be natural, arbitrary, multi-illuminated, and multi-expressive face pictures. Compared with the standard illumination and standard posture face data obtained in the laboratory, these data contain more noise in the natural life state. Unbalanced illumination, non-frontal posture, expression and characters need to be considered in the recognition process. Whether there is a small range of facial occlusion, whether there is makeup, etc., make the traditional face recognition technology a huge challenge. Therefore, how to develop a face recognition technology that is robust to external interference factors is a problem that needs to be solved urgently.

现有技术中，获得一个具有鲁棒性的人脸识别模型，依赖庞大的训练数据。希望训练数据与实际预测数据具有相似的统计分布。目前，随着互联网的发展，以及社交网络的普及，在当前大数据时代，人们可以通过互联网获取庞大的训练数据。但如何利用这些庞大的训练数据，使得人脸识别模型充分学习到所需要的信息成为当下研究的热点。随着深度学习的流行与发展，人们发现深度学习与浅层学习相比，深度学习能更好的描述数据中隐含的信息，并且比浅层学习具有更好的表征能力及对目标函数的拟合能力。因此，在自然图像识别领域，深度学习取得了突出的贡献。然而人脸识别与自然图像识别是两个不同的任务，具有相似性也具有不同的特点。相似点表现为二者都是图像识别任务，参考信号以及损失函数相似，都是利用深度学习网络高度抽象及表征拟合能力处理庞大的数据；不同点在于自然图像千差万别，背景复杂庞大，网络可能需要更多的考虑大范围上下文信息及颜色纹理信息，而人脸结构简单，不同人的区分度与自然图像中不同类别的区分度相比，不同人之间的区分度更小，在人脸识别任务中需要更多的关注细节差异，较少的关注颜色信息。故而不能直接的将自然图像识别的训练方式及网络结构直接应用在人脸识别任务中。In the prior art, obtaining a robust face recognition model relies on huge training data. The training data is expected to have a similar statistical distribution to the actual prediction data. At present, with the development of the Internet and the popularity of social networks, in the current era of big data, people can obtain huge training data through the Internet. However, how to make use of these huge training data to make the face recognition model fully learn the required information has become a current research hotspot. With the popularity and development of deep learning, it is found that deep learning can better describe the hidden information in data than shallow learning, and has better representation ability and objective function than shallow learning. Fitting ability. Therefore, in the field of natural image recognition, deep learning has made outstanding contributions. However, face recognition and natural image recognition are two different tasks, which have similarities and different characteristics. The similarity is that both are image recognition tasks, the reference signal and loss function are similar, and both use the deep learning network's high abstraction and representation fitting capabilities to process huge data; the difference is that natural images vary widely, the background is complex and huge, the network may It needs to consider more large-scale context information and color texture information, but the structure of the face is simple, and the discrimination between different people is smaller than that of different categories in natural images. Recognition tasks require more attention to detail differences and less attention to color information. Therefore, the training method and network structure of natural image recognition cannot be directly applied to the face recognition task.

现有常见的自然人脸识别深度神经网络，例如CASIA-NET，其训练数据全部采集自互联网，并且去除了与LFW数据库中重叠的人物身份，保证了训练集与测试集不重叠。CASIA-Net共包含10个卷积层，一个全连接分类层，如图7所示，具体参数如下表a，表a为CASIA-Net的网络参数，从表a中可以看出，CASIA-Net融合了现有比较成功的神经网络设计技巧，包括深层结构、低维表示、多损失函数。较小的卷积核堆叠不仅可以降低参数的数量，还可以增加网络的非线性能力，CASIA-Net中全部用的是3*3卷积的堆叠。受现有VGG-Net的启发，CASIA-Net将两个3*3组成一个stage，共5个stage组成整个网络。CASIA-Net并没有采用全连层来融合特征图像(feature map)来得到低维的特征，整个网络特征提取都用的是卷积操作。池化5(Pool5)层是特征层，低维表示符合人脸低维流行分布的假设。由于低维表示需要包含人脸的所有区分信息，而ReLU会使得神经元稀疏，故而卷积52(Conv52)并没有采用ReLU激活。在最大池化(max pooling)中，是取感知域最大值作为激活值传入下一层，在特征层中若采用max pooling，很容易引入噪声敏感区域。故在conv52层后采用averagepooling操作，在特征层融合softmax信号和verification信号学习到了更多的有利于区分人脸的表示信息。The existing common natural face recognition deep neural network, such as CASIA-NET, its training data is all collected from the Internet, and the identity of the person overlapping with the LFW database is removed to ensure that the training set and the test set do not overlap. CASIA-Net contains a total of 10 convolutional layers and a fully connected classification layer, as shown in Figure 7. The specific parameters are in the following table a. Table a is the network parameters of CASIA-Net. As can be seen from table a, CASIA-Net It integrates the existing relatively successful neural network design techniques, including deep structure, low-dimensional representation, and multiple loss functions. Smaller convolution kernel stacks can not only reduce the number of parameters, but also increase the nonlinear capability of the network. All CASIA-Net uses 3*3 convolution stacks. Inspired by the existing VGG-Net, CASIA-Net forms a stage with two 3*3, and a total of 5 stages form the entire network. CASIA-Net does not use fully connected layers to fuse feature images (feature maps) to obtain low-dimensional features. The entire network feature extraction uses convolution operations. The pooling 5 (Pool5) layer is a feature layer, and the low-dimensional representation conforms to the assumption of the low-dimensional popular distribution of faces. Since the low-dimensional representation needs to contain all the distinguishing information of the face, and ReLU will make neurons sparse, convolution 52 (Conv52) does not use ReLU activation. In max pooling, the maximum value of the perceptual domain is taken as the activation value and passed to the next layer. If max pooling is used in the feature layer, it is easy to introduce noise-sensitive areas. Therefore, the average pooling operation is adopted after the conv52 layer, and the softmax signal and verification signal are fused in the feature layer to learn more representation information that is beneficial to distinguish faces.

表atable a

众所周知，人脸与自然图像不同，人脸结构单一且固定。对于人脸分类的网络，不仅需要大尺度特征，也需要更多的关注图像的细节特征，即需要更小的卷积核，更小的感知域来捕获细节。但是，现有CASIA-Net的卷积层堆叠简单，对网络提取的特征也没有深入研究，卷积核均采用3*3，特征尺度单一，因此现有CASIA-Net无法胜任自然人脸特征识别。As we all know, human faces are different from natural images, and the structure of human faces is single and fixed. For the face classification network, not only large-scale features are needed, but also more detailed features of the image are needed, that is, smaller convolution kernels and smaller perceptual domains are needed to capture details. However, the existing CASIA-Net's convolutional layers are simple to stack, and the features extracted by the network have not been studied in depth. The convolution kernels are all 3*3, and the feature scale is single. Therefore, the existing CASIA-Net is not qualified for natural face feature recognition.

为了进一步改善现有的自然人脸识别深度神经网络，人们试图引入其它学习思想，以使自然人脸识别深度神经网络的参数达到一个较优的位置，例如，度量学习的引入可以改善自然人脸识别深度神经网络的特性。度量学习的典型损失函数为tripletloss。但tripletloss在神经网络训练中存在几个问题：一是硬件资源不足，二是不能良好的和softmax统一训练，三是在特征空间提供误差对噪声不鲁棒。In order to further improve the existing deep neural network for natural face recognition, people try to introduce other learning ideas to make the parameters of the deep neural network for natural face recognition reach a better position. For example, the introduction of metric learning can improve the deep neural network for natural face recognition. characteristics of the network. A typical loss function for metric learning is triplet loss. However, triplet loss has several problems in neural network training: one is insufficient hardware resources, the other is that it cannot be well trained with softmax, and the third is that the error provided in the feature space is not robust to noise.

因此，现有自然人脸识别深度神经网络依然无法准确识别自然人脸，因而，如何改进现有的人脸识别深度神经网络来准确识别自然人脸图片成为了本领域技术人员亟待解决的技术问题。Therefore, the existing deep neural network for natural face recognition is still unable to accurately recognize natural human faces. Therefore, how to improve the existing deep neural network for face recognition to accurately identify natural human face pictures has become a technical problem to be solved urgently by those skilled in the art.

发明内容Contents of the invention

有鉴于此，本发明要解决的技术问题在于提供一种构建及训练人脸识别特征提取网络的方法，解决了现有深度学习网络的不能精确提取有效特征，以及不能准确识别原始人脸图片的问题。In view of this, the technical problem to be solved by the present invention is to provide a method for constructing and training a face recognition feature extraction network, which solves the problem that the existing deep learning network cannot accurately extract effective features, and cannot accurately identify the original face picture. question.

为了解决上述技术问题，本发明的具体实施方式提供一种构建及训练人脸识别特征提取网络的方法，包括：构建特征提取网络和度量学习降维网络，其中，所述特征提取网络的输出为所述度量学习降维网络的输入；基于全部样本集训练所述特征提取网络从而输出特征集；利用语义采样筛选所述特征集从而获得纯净样本集；基于所述纯净样本集训练所述度量学习降维网络。In order to solve the above-mentioned technical problems, a specific embodiment of the present invention provides a method for constructing and training a face recognition feature extraction network, including: constructing a feature extraction network and a metric learning dimensionality reduction network, wherein the output of the feature extraction network is The metric learns the input of the dimensionality reduction network; trains the feature extraction network based on all sample sets to output a feature set; uses semantic sampling to filter the feature set to obtain a pure sample set; trains the metric learning based on the pure sample set Dimensionality reduction network.

根据本发明的上述具体实施方式可知，构建及训练人脸识别特征提取网络的方法至少具有以下有益效果：同时利用特征提取网络和度量学习降维网络进行特征提取。在特征提取网络中，设计了以阶段(Stage)堆叠方式的深度学习网络，从而让深度学习网络具有更好的特征提取能力；在Stage的设计中，同时采用了1*1、3*3、5*5的卷积核，并同时对前一层的特征图像(feature map)卷积，并将得到的feature map叠加，以此来提取多尺度的特征；然后，采用了一个3*3的卷积核对多尺度的feature map进行卷积，将多尺度卷积核的特征融合起来，并且，通过feature map维度的变化，达到了先扩张，充分学习较完备的特征，再压缩，去除冗余特征的目的；每一个Stage都可以看作是卷积核的叠加，卷积核的叠加可以看作利用较少的权重得到较大的感知域，并且增强了深度学习网络的非线性表征层。另外，引入度量学习降维网络，度量学习降维网络的输入为特征提取网络提取到的图像的较低维特征，特征提取网络的输出特征集通过语义采样，筛选出纯净样本集；再利用纯净样本集训练度量学习降维网络；然后利用度量学习损失函数tripletloss优化度量学习降维网络，同时利用特征提取网络和度量学习降维网络进行特征提取，提高特征的表征能力，从而充分挖掘数据中的特征信息，指导深度学习网络快速求解，可以准确识别原始人脸图片。According to the above specific embodiments of the present invention, it can be seen that the method for constructing and training the face recognition feature extraction network has at least the following beneficial effects: feature extraction is performed by using the feature extraction network and the metric learning dimensionality reduction network at the same time. In the feature extraction network, a deep learning network in the form of stage (Stage) stacking is designed, so that the deep learning network has better feature extraction capabilities; in the design of the Stage, 1*1, 3*3, 5*5 convolution kernel, and simultaneously convolve the feature image (feature map) of the previous layer, and superimpose the obtained feature map to extract multi-scale features; then, a 3*3 The convolution kernel convolutes the multi-scale feature map, fuses the features of the multi-scale convolution kernel, and, through the change of the feature map dimension, achieves first expansion, fully learns more complete features, and then compresses to remove redundancy The purpose of the feature; each stage can be regarded as a superposition of convolution kernels, and the superposition of convolution kernels can be regarded as using less weight to obtain a larger perceptual domain, and enhance the nonlinear representation layer of the deep learning network. In addition, the metric learning dimensionality reduction network is introduced. The input of the metric learning dimensionality reduction network is the lower-dimensional features of the image extracted by the feature extraction network. The output feature set of the feature extraction network is filtered out by semantic sampling. Pure sample set; The sample set trains the metric learning dimensionality reduction network; then uses the metric learning loss function tripletloss to optimize the metric learning dimensionality reduction network, and uses the feature extraction network and the metric learning dimensionality reduction network for feature extraction to improve the representation ability of the features, thereby fully mining the data. The feature information guides the deep learning network to quickly solve it, and can accurately identify the original face picture.

应了解的是，上述一般描述及以下具体实施方式仅为示例性及阐释性的，其并不能限制本发明所欲主张的范围。It should be understood that the above general description and the following specific embodiments are only exemplary and explanatory, and should not limit the scope of the present invention.

附图说明Description of drawings

下面的所附附图是本发明的说明书的一部分，其绘示了本发明的示例实施例，所附附图与说明书的描述一起用来说明本发明的原理。The accompanying drawings, which follow and form a part of the specification of the invention, illustrate example embodiments of the invention and together with the description serve to explain the principles of the invention.

图1为本发明具体实施方式提供的一种构建及训练人脸识别特征提取网络的方法的实施例一的流程图；Fig. 1 is the flowchart of Embodiment 1 of a method for constructing and training a face recognition feature extraction network provided by a specific embodiment of the present invention;

图2为本发明具体实施方式提供的一种构建及训练人脸识别特征提取网络的方法的实施例二的流程图；Fig. 2 is the flow chart of Embodiment 2 of a method for constructing and training a face recognition feature extraction network provided by a specific embodiment of the present invention;

图3A为本发明具体实施方式提供的原始人脸图片示意图；3A is a schematic diagram of an original human face picture provided by a specific embodiment of the present invention;

图3B为利用本发明具体实施方式提供的自然人脸识别网络处理原始人脸图片后获得的标准人脸图片示意图；3B is a schematic diagram of a standard human face picture obtained after processing the original human face picture by using the natural face recognition network provided by the specific embodiment of the present invention;

图4为本发明具体实施方式提供的特征提取网络的示意图；4 is a schematic diagram of a feature extraction network provided by a specific embodiment of the present invention;

图5为本发明具体实施方式提供的二维特征残差示意图；FIG. 5 is a schematic diagram of a two-dimensional feature residual provided by a specific embodiment of the present invention;

图6A为传统的tripletLoss示意图；Figure 6A is a schematic diagram of a traditional tripletLoss;

图6B为本发明具体实施方式提供的tripletLoss示意图。Fig. 6B is a schematic diagram of tripletLoss provided by a specific embodiment of the present invention.

图7为现有技术中CASIA-Net深度神经网络的结构示意图。FIG. 7 is a schematic structural diagram of the CASIA-Net deep neural network in the prior art.

具体实施方式detailed description

为使本发明实施例的目的、技术方案和优点更加清楚明白，下面将以附图及详细叙述清楚说明本发明所揭示内容的精神，任何所属技术领域技术人员在了解本发明内容的实施例后，当可由本发明内容所教示的技术，加以改变及修饰，其并不脱离本发明内容的精神与范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the spirit of the disclosure of the present invention will be clearly described below with the accompanying drawings and detailed descriptions. Any person skilled in the art will understand the embodiments of the present invention. , when it can be changed and modified by the technology taught in the content of the present invention, it does not depart from the spirit and scope of the content of the present invention.

本发明的示意性实施例及其说明用于解释本发明，但并不作为对本发明的限定。另外，在附图及实施方式中所使用相同或类似标号的元件/构件是用来代表相同或类似部分。The exemplary embodiments and descriptions of the present invention are used to explain the present invention, but not to limit the present invention. In addition, elements/members with the same or similar numbers used in the drawings and embodiments are used to represent the same or similar parts.

关于本文中所使用的“第一”、“第二”、…等，并非特别指称次序或顺位的意思，也非用以限定本发明，其仅为了区别以相同技术用语描述的元件或操作。As used herein, "first", "second", ... etc. do not refer to a particular sequence or order, nor are they used to limit the present invention, but are only used to distinguish elements or operations described with the same technical terms .

关于本文中所使用的方向用语，例如：上、下、左、右、前或后等，仅是参考附图的方向。因此，使用的方向用语是用来说明并非用来限制本创作。Regarding the directional terms used herein, such as: up, down, left, right, front or rear, etc., only refer to the directions of the drawings. Accordingly, the directional terms used are for illustration and not for limitation of the present invention.

关于本文中所使用的“包含”、“包括”、“具有”、“含有”等等，均为开放性的用语，即意指包含但不限于。As used herein, "comprising", "comprising", "having", "comprising" and so on are all open terms, meaning including but not limited to.

关于本文中所使用的“及/或”，包括所述事物的任一或全部组合。As used herein, "and/or" includes any or all combinations of the stated things.

关于本文中所使用的用语“大致”、“约”等，用以修饰任何可以微变化的数量或误差，但这些微变化或误差并不会改变其本质。一般而言，此类用语所修饰的微变化或误差的范围在部分实施例中可为20％，在部分实施例中可为10％，在部分实施例中可为5％或是其他数值。本领域技术人员应当了解，前述提及的数值可依实际需求而调整，并不以此为限。The terms "approximately" and "about" used herein are used to modify any quantity or error that may vary slightly, but these slight changes or errors will not change its essence. Generally speaking, the range of slight changes or errors modified by such terms may be 20% in some embodiments, 10% in some embodiments, 5% in some embodiments or other numerical values. Those skilled in the art should understand that the aforementioned values can be adjusted according to actual needs, and are not limited thereto.

某些用以描述本申请的用词将于下或在此说明书的别处讨论，以提供本领域技术人员在有关本申请的描述上额外的引导。Certain terms used to describe the present application are discussed below or elsewhere in this specification to provide those skilled in the art with additional guidance in describing the present application.

图1为本发明具体实施方式提供的一种构建及训练人脸识别特征提取网络的方法的实施例一的流程图，如图1所示，构建特征提取网络和度量学习降维网络，并训练特征提取网络和度量学习降维网络。Fig. 1 is the flow chart of embodiment one of a kind of construction and the method for training face recognition feature extraction network that the specific embodiment of the present invention provides, as shown in Fig. 1, construct feature extraction network and metric learning dimensionality reduction network, and train Feature extraction network and metric learning dimensionality reduction network.

该附图所示的具体实施方式包括：The specific implementation shown in this accompanying drawing includes:

步骤101：构建特征提取网络和度量学习降维网络，其中，所述特征提取网络的输出为所述度量学习降维网络的输入。构建特征提取网络和度量学习降维网络作为自然人脸识别网络的特征提取模块。Step 101: constructing a feature extraction network and a metric learning dimensionality reduction network, wherein the output of the feature extraction network is the input of the metric learning dimensionality reduction network. Construct feature extraction network and metric learning dimensionality reduction network as feature extraction modules of natural face recognition network.

步骤102：基于全部样本集训练所述特征提取网络从而输出特征集。具体包括：基于全部样本集利用损失函数softmax训练所述特征提取网络从而输出特征集。本发明的具体实施例中，特征集的特征维度为320；全部样本集为CASIA-WebFace数据库，CASIA-WebFace数据库共包含10575个类别，49万张图片；在所述特征提取网络中同时运用1*1、3*3、5*5卷积核，从而形成多尺度特征融合方式。Step 102: Train the feature extraction network based on all sample sets to output a feature set. Specifically, it includes: training the feature extraction network with a loss function softmax based on all sample sets so as to output a feature set. In a specific embodiment of the present invention, the feature dimension of the feature set is 320; the whole sample set is a CASIA-WebFace database, and the CASIA-WebFace database contains 10,575 categories and 490,000 pictures; in the feature extraction network, 1 *1, 3*3, 5*5 convolution kernels to form a multi-scale feature fusion method.

步骤103：利用语义采样筛选所述特征集从而获得纯净样本集。具体包括：利用语义采样筛选所述特征集中距离特征平面最远的90％样本作为纯净样本集。本发明的具体实施例中，纯净样本集为DataSubset；通过Logistic回归获得特征平面；纯净样本集的特征维度为320。Step 103: Use semantic sampling to filter the feature set to obtain a pure sample set. It specifically includes: using semantic sampling to filter 90% of the samples in the feature set farthest from the feature plane as a pure sample set. In a specific embodiment of the present invention, the pure sample set is DataSubset; the feature plane is obtained by Logistic regression; the feature dimension of the pure sample set is 320.

步骤104：基于所述纯净样本集训练所述度量学习降维网络，纯净样本集经度量学习降维网络后特征维度降为128。Step 104: Train the metric learning dimensionality reduction network based on the pure sample set, and the feature dimension of the pure sample set is reduced to 128 after the metric learning dimensionality reduction network.

参见图1，本发明引入度量学习降维网络，依次利用特征提取网络和度量学习降维网络处理标准人脸图片，提高特征的表征能力，从而充分挖掘数据中的特征信息，指导深度学习网络快速求解，可以准确识别原始人脸图片。Referring to Fig. 1, the present invention introduces a metric learning dimensionality reduction network, sequentially uses the feature extraction network and the metric learning dimensionality reduction network to process standard face pictures, improves the representation ability of features, thereby fully mining the feature information in the data, and guiding the deep learning network to quickly Solving the problem can accurately identify the original face picture.

图2为本发明具体实施方式提供的一种构建及训练人脸识别特征提取网络的方法的实施例二的流程图，如图2所示，基于所述纯净样本集训练所述度量学习降维网络之后，需要利用改进的度量学习损失函数tripletLoss优化所述度量学习降维网络。Fig. 2 is a flow chart of Embodiment 2 of a method for constructing and training a face recognition feature extraction network provided by a specific embodiment of the present invention. As shown in Fig. 2, the metric learning dimensionality reduction is trained based on the pure sample set After the network, the metric learning dimensionality reduction network needs to be optimized using the improved metric learning loss function tripletLoss.

该附图所示的具体实施方式中，步骤104之后，该方法还包括：In the specific implementation shown in the accompanying drawing, after step 104, the method also includes:

步骤105：利用度量学习损失函数tripletLoss优化所述度量学习降维网络。利用度量学习损失函数tripletLoss优化所述度量学习降维网络时，以远离大多数样本点的样本点为锚点从而使远离大多数样本点的样本点向大多数样本点靠近。Step 105: Using the metric learning loss function tripletLoss to optimize the metric learning dimensionality reduction network. When using the metric learning loss function tripletLoss to optimize the metric learning dimensionality reduction network, the sample points far away from most sample points are used as anchor points so that the sample points far away from most sample points are approached to most sample points.

本发明使用的度量学习损失函数tripletLoss的具体公式为：The specific formula of the metric learning loss function tripletLoss used in the present invention is:

tripletloss＝log(1+z)tripletloss=log(1+z)

其中，z为中间变量；f_a为选定样本a的特征；f_p为选定样本p的特征；f_n为选定样本n的特征；margin为人为设定的固定间隔。in, z is the intermediate variable; f_a is the feature of the selected sample a; f_p is the feature of the selected sample p; f_n is the feature of the selected sample n; margin is an artificially set fixed interval.

其中，现有技术中的度量学习损失函数loss为：Among them, the metric learning loss function loss in the prior art is:

$L L o o s the s s the s = = {Σ Σ}_{i i}^{N N} m m a a x x ((00,, | | | | f f (({x x}_{i i}^{a a})) - - f f (({x x}_{i i}^{p p})) | | {| |}_{22}^{22} - - | | | | f f (({x x}_{i i}^{a a})) - - f f (({x x}_{i i}^{n no})) | | {| |}_{22}^{22} + + α α))$

显然，本发明改进后的度量学习损失函数相对于现有的度量学习损失函数，改进后的度量学习损失函数加入了对于残差过大的平衡因子，起到一个平滑的作用，从而使度量学习降维网络的网络参数达到一个较优的位置。Obviously, compared with the existing metric learning loss function, the improved metric learning loss function of the present invention adds a balance factor for excessive residual error, which plays a smoothing role, so that the metric learning The network parameters of the dimensionality reduction network reach a better position.

参见图2，利用改进的度量学习损失函数tripletLoss优化度量学习降维网络，从而使度量学习降维网络的网络参数达到一个较优的位置，可以提高度量学习降维网络的性能和鲁棒性。Referring to Figure 2, the improved metric learning loss function tripletLoss is used to optimize the metric learning dimensionality reduction network, so that the network parameters of the metric learning dimensionality reduction network reach a better position, which can improve the performance and robustness of the metric learning dimensionality reduction network.

图3A为本发明具体实施方式提供的原始人脸图片示意图；图3B为利用本发明具体实施方式提供的自然人脸识别网络处理原始人脸图片后获得的标准人脸图片示意图，如图3A、图3B所示，在自然状态下获取的原始人脸图片往往是杂乱无章的。比如，原始人脸图片中很可能包含多个人脸，包含复杂背景，且人脸的面内旋转角度不一。若将此类图片直接传入深度网络学习(如特征提取网络、度量学习降维网络等)，那么深度网络所需看到的信息便可能包含多个人脸，大小不一、角度不同的人脸，且可能包含许多背景噪声。深度网络固然可以通过大量的参数，复杂的非线性，拟合出复杂的函数来逼近表示图片中的有效内容，但若引入先验知识，对输入数据做预处理，可使得深度网络更为细致的学习到高效的特征。故而，原始人脸图片处理的任务主要包含：引入先验知识，去除传入图片中的杂乱背景，去除人脸面内旋转。原始人脸图片处理后的数据，应只包含一个人脸。本申请中，采用五点标定，如图所示，两张可看到对齐前和对齐后的图像，对齐后图像随五官在固定区域出现。Fig. 3 A is the schematic diagram of the original human face picture provided by the specific embodiment of the present invention; Fig. 3 B is the standard human face picture schematic diagram obtained after utilizing the natural face recognition network provided by the specific embodiment of the present invention to process the original human face picture, as shown in Fig. 3A, Fig. As shown in 3B, the raw face images acquired in the natural state are often disorganized. For example, the original face image is likely to contain multiple faces and complex backgrounds, and the in-plane rotation angles of the faces are different. If such pictures are directly passed to deep network learning (such as feature extraction network, metric learning dimensionality reduction network, etc.), then the information that the deep network needs to see may include multiple faces, faces of different sizes and angles. , and may contain a lot of background noise. Although the deep network can approximate and represent the effective content in the picture through a large number of parameters, complex nonlinearity, and complex functions, if the introduction of prior knowledge and preprocessing of the input data can make the deep network more detailed learn efficient features. Therefore, the task of processing the original face image mainly includes: introducing prior knowledge, removing the cluttered background in the incoming image, and removing the internal rotation of the face. The processed data of the original face image should contain only one face. In this application, five-point calibration is used. As shown in the figure, two images before and after alignment can be seen, and the image after alignment appears in a fixed area along with the facial features.

仿射变换是一种二维坐标到而为坐标的变换，可写成如下的形式：Affine transformation is a transformation from two-dimensional coordinates to coordinates, which can be written as follows:

x'＝ax+by+m 2.1x'=ax+by+m 2.1

y'＝cx+dy+n 2.2y'=cx+dy+n 2.2

其中，x’和y’为新坐标，x’和y’可以根据系数a、b、c、d、m、n与原始坐标x、y计算出来。Among them, x' and y' are the new coordinates, and x' and y' can be calculated according to the coefficients a, b, c, d, m, n and the original coordinates x, y.

如上式，仿射变换的参数可由三点唯一确定。为了起到约束作用，采取五点检测来求取和约束仿射变换参数。这里称之为“五点仿射变换”，由于五点标注为人脸的五官位置，通过“五点仿射变换”对齐后的图像，五官基本处于同一个位置左右。将所有图片对齐到128*128大小的图片上，其中，标准五官位置分别为(32,50)、(96,50)、(64,75)、(43,90)、(86,90)，分别代表左眼球、右眼球、鼻尖、左嘴角、右嘴角的位置。As in the above formula, the parameters of the affine transformation can be uniquely determined by three points. In order to play a constraining role, five-point detection is used to obtain and constrain the affine transformation parameters. It is called "five-point affine transformation" here. Since the five points are marked as the facial features of the face, the facial features are basically in the same position in the image aligned through the "five-point affine transformation". Align all pictures to a 128*128 picture, where the standard facial features are (32,50), (96,50), (64,75), (43,90), (86,90), Represent the positions of the left eyeball, right eyeball, nose tip, left mouth corner, and right mouth corner respectively.

图4为本发明具体实施方式提供的特征提取网络的示意图，如图4所示，首先训练特征提取网络，接下来训练度量学习降维网络。FIG. 4 is a schematic diagram of a feature extraction network provided by a specific embodiment of the present invention. As shown in FIG. 4 , the feature extraction network is first trained, and then the metric learning dimensionality reduction network is trained.

将特征提取网络和度量学习降维网络的训练方式概括如下：The training methods of feature extraction network and metric learning dimensionality reduction network are summarized as follows:

第一步，将标准人脸图片传入特征提取网络，以身份信息为参考信号(即监督信号)，softmax作损失函数，训练特征提取网络。In the first step, the standard face picture is passed to the feature extraction network, and the identity information is used as the reference signal (ie, the supervision signal), and softmax is used as the loss function to train the feature extraction network.

第二步，将所有标准人脸图片经过训练好的特征提取网络后，提取较低维特征。The second step is to extract lower-dimensional features after all standard face images are trained through the feature extraction network.

第三步，将较低维特征进行语义采样，得到数据A。The third step is to perform semantic sampling on lower-dimensional features to obtain data A.

第四步，以身份信息为参考信号，softmax函数作损失函数，预训练度量学习降维网络。In the fourth step, the identity information is used as the reference signal, the softmax function is used as the loss function, and the pre-training metric learns the dimensionality reduction network.

第五步，以类别关系为参考信号，改进的tripletloss函数作为损失函数，将较低维特征进行以batch(批量)为单位的度量采样，作为数据B，输入度量学习降维网络，得到低维特征。The fifth step is to use the category relationship as a reference signal and the improved tripletloss function as a loss function to perform metric sampling of lower-dimensional features in units of batches as data B, and input the metric learning dimensionality reduction network to obtain low-dimensional feature.

以上，特征提取模块训练完毕。在预测过程中，将预训练好的数据分别传入两个网络，得到输入图像的低维表示。Above, the feature extraction module is trained. In the prediction process, the pre-trained data is passed into the two networks respectively to obtain a low-dimensional representation of the input image.

特征提取网络具有以下特征：一，卷积核以stage方式堆叠，去全连层，仅仅使用卷积层提取特征；二，最后一个卷积层提取的feature map作为特征层，不采用ReLU激活，使得特征低维稠密；三，为了去除噪声，最后一个卷积核提取的特征采用average pooling。The feature extraction network has the following characteristics: 1. The convolution kernel is stacked in a stage manner, and the fully connected layer is removed, and only the convolution layer is used to extract features; 2. The feature map extracted by the last convolution layer is used as the feature layer, and ReLU activation is not used. Make the features low-dimensional and dense; third, in order to remove noise, the features extracted by the last convolution kernel use average pooling.

本发明对stage进行了改进，特征提取网络包含的stage具有多尺度特征融合，特征去相关，特征维度缩减等功能。在特征提取网络中，第五阶段(stage5)的结构与通常的stage相比，少了一个5*5的卷积核，其原因为当深度网络卷积到stage5时，其输入的长和宽以较小，故而不采用5*5卷积。整个特征提取网络包含了11个非线性卷积层，1个全连分类层。网络参数见表1。表1为特征提取网络的网络参数。The present invention improves the stage, and the stage included in the feature extraction network has functions such as multi-scale feature fusion, feature decorrelation, and feature dimension reduction. In the feature extraction network, the structure of the fifth stage (stage5) lacks a 5*5 convolution kernel compared with the usual stage. The reason is that when the deep network is convoluted to stage5, the length and width of its input is smaller, so 5*5 convolution is not used. The entire feature extraction network contains 11 nonlinear convolutional layers and 1 fully connected classification layer. The network parameters are shown in Table 1. Table 1 is the network parameters of the feature extraction network.

表1Table 1

特征提取网络迭代20万次，初始学习率为0.01，以gamma＝0.8的幅度每一万次调整一次。当前迭代时的学习率＝初始学习率*gamma^(迭代次数/10000)(具体含义为：开始训练时学习率为0.01，第一万次迭代时学习率为0.01*gamma，第二万次迭代时为0.01*gamma＾2，以此类推)，权重衰减系数为5e-4(具体含义为10的负5次幂)。批量大小Batchsize为150。The feature extraction network iterates 200,000 times, the initial learning rate is 0.01, and the range of gamma=0.8 is adjusted every 10,000 times. Learning rate at the current iteration = initial learning rate * gamma^(number of iterations/10000) (the specific meaning is: the learning rate is 0.01 at the beginning of training, the learning rate is 0.01*gamma at the 10,000th iteration, and 0.01*gamma at the 20,000th iteration is 0.01*gamma＾2, and so on), and the weight decay coefficient is 5e-4 (the specific meaning is the negative 5th power of 10). The batch size Batchsize is 150.

在上文中的特征提取网络训练中，采用CASIA-WebFace全部样本作为训练集，定义该训练集输出的特征集为DataSet_All。DataSet_All包含I个人物身份。在这里需要从DataSet_All中通过语义采样抽取一部分作为度量学习降维网络参与训练的数据集——DataSubset。In the feature extraction network training above, all samples of CASIA-WebFace are used as the training set, and the feature set output by the training set is defined as DataSet_All. DataSet_All contains I person identities. Here, it is necessary to extract a part from DataSet_All through semantic sampling as the data set that the metric learning dimensionality reduction network participates in training——DataSubset.

原则上，需要找到较难学习的样本作为数据子集DataSubset。因为这样的样本是比较难以分类的，是特征提取网络没有良好解决的样本集，利用这一类样本作为度量学习降维网络的训练样本，可以解决了特征提取网络的遗留问题。但是，对于CASIA-WebFace数据库而言，共包含10575个类别，49万张图片，全部镜像后整个数据库包含近100万张图片。若对每一个类别的每一张图片，遍历100万张图片找到距离最近的X张不同类别的样本，计算量过于庞大。并且，该采样方式存在以下问题：一，对于CASIA-WebFace这类网络获取的数据资源，标注很可能错误，按以上叙述方法采样，很容易采集到噪声，影响了网络整体的性能。二，上述采样方式过多的考虑了最难分类的图片，导致样本分布不平衡。In principle, it is necessary to find samples that are more difficult to learn as the data subset DataSubset. Because such samples are relatively difficult to classify, they are sample sets that the feature extraction network does not solve well. Using this type of samples as training samples for the metric learning dimensionality reduction network can solve the legacy problems of the feature extraction network. However, for the CASIA-WebFace database, it contains a total of 10,575 categories and 490,000 pictures. After all mirroring, the entire database contains nearly 1 million pictures. If for each picture of each category, traversing 1 million pictures to find the nearest X samples of different categories, the amount of calculation is too large. Moreover, this sampling method has the following problems: 1. For data resources obtained from networks such as CASIA-WebFace, the labeling is likely to be wrong. Sampling according to the above description method is easy to collect noise, which affects the overall performance of the network. Second, the above sampling method considers the most difficult to classify pictures too much, resulting in an unbalanced sample distribution.

为了解决以上问题，将每一身份类别训练一个简单的二值分类器——logistic回归。对于特定的身份类别，该分类器的正样本是由该身份类别的样本构成，负样本是二倍量的从不属于该身份类别的样本中个随机采集到。由此数据集训练的针对该身份类别的二分类器是弱分类器，虽然性能不是足够的高，但是对噪声有一定的宽容度。Logistic回归相当于寻找了一个超平面，正负样本在超平面的两侧。超平面参数分别为w和b。认为该超平面为该类别的特征平面。求出10575个特征平面后，逐个算出与每个特征平面距离最近的特征。特征平面之间距离公式为下述公式2.3，其中f_i，f_j分别表示第i类和第j类的特征平面。对于每个类别特征平面及其最相近的特征平面，选取该类别样本到特征平面最远的90％正样本，即最像该类别的90％样本，从90％的选取样本中随机抽取70％的样本作为采样样本，按此方法遍历10757个类别。对于样本采样过程，我们需要的只有样本到特征平面距离，不需要非线性映射的概率空间。在公式2.4中所描述的z为样本到特征平面距离，f_i为第i个类别的特征平面，x_i为i类别中的样本特征。具体步骤归纳如下：In order to solve the above problems, a simple binary classifier-logistic regression is trained for each identity category. For a specific identity category, the positive samples of the classifier are composed of samples of the identity category, and the negative samples are twice as many randomly collected from samples that do not belong to the identity category. The binary classifier trained on this data set for this identity category is a weak classifier. Although the performance is not high enough, it has a certain tolerance to noise. Logistic regression is equivalent to finding a hyperplane, and the positive and negative samples are on both sides of the hyperplane. The hyperplane parameters are w and b, respectively. The hyperplane is considered to be the feature plane of the class. After calculating 10575 feature planes, calculate the features closest to each feature plane one by one. The formula for the distance between feature planes is the following formula 2.3, where f_i and f_j represent the feature planes of the i-th and j-th classes respectively. For each category feature plane and its closest feature plane, select the 90% positive samples that are farthest from the category sample to the feature plane, that is, the 90% samples that are most similar to the category, and randomly select 70% from 90% of the selected samples The samples of are used as sampling samples, and 10757 categories are traversed in this way. For the sample sampling process, all we need is the distance from the sample to the feature plane, and the probability space of the non-linear mapping is not needed. z described in formula 2.4 is the distance from the sample to the feature plane, f_i is the feature plane of the i-th category, and x_i is the sample feature in the i category. The specific steps are summarized as follows:

第一步，对于身份i属于I，选取全部该身份下样本作为Pos_i，数目为N，对于所有样本身份为j，满足j属于I且j不等于i，选取2N个样本作为Neg_i。In the first step, for identity i belonging to I, select all the samples under this identity as Pos_i, the number is N, and for all samples whose identity is j, satisfying that j belongs to I and j is not equal to i, select 2N samples as Neg_i.

第二步，根据Pos_i与Neg_i训练Logistic回归，得到特征平面P_i参数w_i和b_i。In the second step, Logistic regression is trained according to Pos_i and Neg_i, and the feature plane P_i parameters w_i and b_i are obtained.

第三步，重复一到二步，计算出所有特征平面。The third step is to repeat the first and second steps to calculate all the feature planes.

第四步，对于身份i属于I，按照公式2.3，计算出距离特征平面f_i最近的特征平面f_j。The fourth step, for the identity i belongs to I, according to the formula 2.3, calculate the feature plane f_j closest to the feature plane f_i .

第五步，按公式2.4计算所有样本x_i属于身份i，到特征平面f_i的距离，降序排列，取top90％的样为sub_90，随机从sub_90中选取75％的样本放入DataSubset。The fifth step is to calculate the distance from all samples x_i belonging to identity_i to the feature plane fi according to formula 2.4, and arrange them in descending order. Take the top90% samples as sub_90, and randomly select 75% of the samples from sub_90 and put them into DataSubset.

第六步，计算所有样本x_j属于身份j(语义距离身份i最近的类别)，按照第五步的方法选取样本放入DataSubset。The sixth step is to calculate that all samples x_j belong to identity j (the category with the closest semantic distance to identity i), and select samples according to the method of step five and put them into DataSubset.

第七步，重复四到六步，直到遍历完所有的身份，得到最终的DataSubset。The seventh step is to repeat steps four to six until all identities are traversed to obtain the final DataSubset.

${dis dis}_{i i j j} = = \frac{{f f}_{i i} * * {f f}_{j j}}{| | | | {f f}_{i i} | | | | * * | | | | {f f}_{j j} | | | |} - - - - - - 2.3 2.3$

z_sample＝w_i*x_i 2.4z_sample = w_i * x_i 2.4

在语义采样过程中，之所以选取离特征平面最远的90％该类别样本，其意义是选取了最像该类别的样本，即排除掉了错误标注，图片质量差的样本。In the semantic sampling process, the reason why 90% of the samples of the category farthest from the feature plane is selected is that the samples most similar to the category are selected, that is, samples with wrong labels and poor picture quality are excluded.

另外，度量学习降维网络是一个全连接网络，输入是特征提取网络的输出——320维特征，通过两个全连接到128个隐含神经元，再通过全连接到CASIA-WebFace的身份数目——10575。参数设置如表2，表2为度量学习降维网络的参数设置。这里的256维特征是稠密的，这里全连层之后均不用ReLU激活。In addition, the metric learning dimensionality reduction network is a fully connected network, the input is the output of the feature extraction network - 320-dimensional features, through two fully connected to 128 hidden neurons, and then fully connected to the number of identities of CASIA-WebFace - 10575. The parameter settings are shown in Table 2, and Table 2 shows the parameter settings of the metric learning dimensionality reduction network. The 256-dimensional features here are dense, and there is no ReLU activation after the fully connected layer here.

表2Table 2

本申请别通过以及三个方面对传统的tripletloss进行改进。This application does not improve the traditional triplet loss through three aspects.

首先，tripletLoss的引入，设计了一个度量学习降维网络，并只在此网络中使用。当设计了度量学习降维网络后，tripletloss所需要观测的batch样本就不再是原始的图片x，而是通过特征提取网络提取到的x的一个较低维表达的特征。并且，度量学习降维网络只包含一个全连接的隐含层，网络参数较少，需要保存的中间数据也较少，这大大降低了内存或者显存的使用量，可以直接使用单GPU训练网络。First, with the introduction of tripletLoss, a metric learning dimensionality reduction network is designed and used only in this network. When the metric learning dimensionality reduction network is designed, the batch sample that tripletloss needs to observe is no longer the original picture x, but a lower-dimensional expression feature of x extracted by the feature extraction network. Moreover, the metric learning dimensionality reduction network only contains a fully connected hidden layer, with fewer network parameters and less intermediate data to be saved, which greatly reduces the usage of memory or video memory, and can directly use a single GPU to train the network.

其次，采用本文采用了使用数据集预训练的方。先使batch中随机类别放置样本，使得batch中样本适合softmax损失函数的残差来更新网络参数，当网络处于一个较优的位置时，停止训练。再按类别，每个类别采样30个样本，共100个类别，共3000个样本，放入batch中，使得目前batch中的样本适合于tripletloss残差更新网络参数。分别采用两种训练方式使得网络从一个较优的位置开始度量学习，不仅有利于三元组的采样，也利于平衡两种损失函数之间的关系。Second, this paper adopts the method of using dataset pre-training. First place samples in random categories in the batch, so that the samples in the batch are suitable for the residual of the softmax loss function to update the network parameters. When the network is in a better position, stop training. Then by category, 30 samples are sampled for each category, a total of 100 categories, a total of 3000 samples, put into the batch, so that the samples in the current batch are suitable for tripletloss residual update network parameters. Using two training methods respectively enables the network to start metric learning from a better position, which is not only conducive to the sampling of triplets, but also conducive to balancing the relationship between the two loss functions.

最后，我们改进tripletloss的损失函数，使其加入对于残差过大的平衡因子。如公式2.5和2.6，Finally, we improved the loss function of tripletloss to add a balance factor for excessive residuals. As in Equations 2.5 and 2.6,

Loss＝log(1+z)............2.5Loss=log(1+z)..........2.5

$z z = = ((11 / / 22)) m m a a x x ((m m arg arg i i n no + + | | | | {f f}_{a a - -} {f f}_{p p} | | {| |}_{22}^{22} - - | | | | {f f}_{a a - -} {f f}_{n no} | | {| |}_{22}^{22},, 00)) - - - - - - 2.6 2.6$

由于log函数是一个具有平滑作用的函数。该函数的引入，使得网络Loss对于选定样本a的特征f_a求导时，加入了一个1/(1+z)的系数，随着z的增大，该残差系数越小，起到了一个平滑的作用。Since the log function is a smoothing function. The introduction of this function makes the network Loss add a coefficient of 1/(1+z) when deriving the feature f_a of the selected sample a. As z increases, the residual coefficient becomes smaller, which plays a role A smoothing effect.

图5为本发明具体实施方式提供的二维特征残差示意图，如图5所示，点a为选中的样本点——锚点，p为与a点有相同身份的样本——正样本，n为与a有不同身份的样本——负样本。双箭头代表着正样本对与负样本对之间的距离。根据公式2.5、2.6得到此时损失函数对样本a点的残差为f_n-f_p，即两个特征向量之差，此项量是由p点指向n点的向量，即梯度方向为由p点指向n点，幅度为f_n-f_p的模值。网络是梯度下降的更新原则，其意义是调整网络参数，使得网络逆着梯度残差方向。即希望改变网络参数，使得a点向a’点的方向移动。当加入平滑后，梯度的幅度不再是f_n-f_p的模值，而是乘以了一个1/(1+z)的系数，即希望网络改变参数使得a点移动到a’点的位置。随着z的增大，系数减小，可以平滑的使a逆着梯度方向移动。Fig. 5 is a schematic diagram of a two-dimensional feature residual provided by a specific embodiment of the present invention. As shown in Fig. 5, point a is a selected sample point—an anchor point, and p is a sample with the same identity as point a—a positive sample. n is a sample with a different identity from a - a negative sample. The double arrows represent the distance between the positive sample pair and the negative sample pair. According to the formulas 2.5 and 2.6, the residual error of the loss function to the point a of the sample at this time is f_n -f_p , which is the difference between the two feature vectors. This amount is a vector from point p to point n, that is, the gradient direction is given by Point p points to point n, and the magnitude is the modulus value of f_n -f_p . The network is an update principle of gradient descent, and its meaning is to adjust the network parameters so that the network goes against the direction of the gradient residual. That is, it is desired to change the network parameters so that point a moves in the direction of point a'. When smoothing is added, the magnitude of the gradient is no longer the modulus of f_n -f_p , but multiplied by a coefficient of 1/(1+z), that is, it is hoped that the network changes the parameters so that point a moves to point a' Location. As z increases, the coefficient decreases, and a can be smoothly moved against the direction of the gradient.

图6A为传统的tripletLoss示意图，图6B为本发明具体实施方式提供的tripletLoss示意图，如图6A、图6B所示，传统的tripletLoss限制采集到的负样本不能是最难的负样本，而是较难的负样本，否则会使得网络在早期引起梯度崩塌。但本发明提出的tripletLoss是在一个具有较少参数的网络中使用，且网络参数基本已经达到一个较优的位置，故而对于三元组的采样可以不严格按照归一化欧式距离公式所要求。Figure 6A is a schematic diagram of a traditional tripletLoss, and Figure 6B is a schematic diagram of a tripletLoss provided by a specific embodiment of the present invention. Difficult negative samples, otherwise it will cause the network to cause gradient collapse in the early stage. However, the tripletLoss proposed by the present invention is used in a network with fewer parameters, and the network parameters have basically reached a better position, so the sampling of triplets may not strictly follow the requirements of the normalized Euclidean distance formula.

在本发明的改进的tripletloss中，采用了语义采样来训练。在语义采样中已经排除了较多的噪声部分，使得我们采集到的正样本含有较少的噪声，使得网络受到干扰较小。通过实验发现，在一个batch中，每一个类别包含30个样本，分别以每一个样本为锚点，选取最难的正样本时，我们发现，在batch中标号为1的样本，总是被选中为以其他样本为锚点时的最难正样本。并且被选中为最难正样本的样本分布较为集中，表现为总有那么几个点“不合群”，远离其他样本点，如图6A、图6B所示，灰色深度表示以每个样本点为锚点时，每个样本点被选中为其他样本点最难正样本的频率，灰色深度越深频率越高。与其以每一个点为锚点，让大多数的点去靠近不合群的点，不如以不合群的点为锚点，让这个锚点向大多数点来靠近。图6A中左侧为原始采样方式，梯度会使得最大多数点去向“不合群”的点靠拢；图6B为本文的采样方式，使得“不合群”的点向大多数点靠拢。以图6B中batch中标号为1的点为例，它被认为是同类别中11个点的最难正样本点，这里称这11个点组成了一个set_1，那么以1号点为锚点，set_1中的11个点为最难正样本点，随机抽取满足下述公式2.7的负样本点。如此，相当于对于锚点1号点，提供了11个不同的特征误差，使得网络更新权重让1号点朝着大部分样本点靠近。In the improved triplet loss of the present invention, semantic sampling is used for training. In the semantic sampling, more noise parts have been excluded, so that the positive samples we collect contain less noise, so that the network is less disturbed. Through experiments, it is found that in a batch, each category contains 30 samples, and each sample is used as an anchor point. When selecting the most difficult positive sample, we found that the sample labeled 1 in the batch is always selected. is the hardest positive sample when using other samples as anchors. And the distribution of samples selected as the most difficult positive samples is relatively concentrated, showing that there are always a few points that are "out of group" and far away from other sample points, as shown in Figure 6A and Figure 6B, the gray depth indicates that each sample point is When anchoring, each sample point is selected as the frequency of the most difficult positive sample of other sample points, and the darker the gray depth, the higher the frequency. Instead of using each point as an anchor point and letting most points approach the ungrouped points, it is better to use the ungrouped points as the anchor point and let the anchor point approach most points. The left side of Figure 6A is the original sampling method, and the gradient will make the most points move closer to the "out-of-group" points; Figure 6B shows the sampling method in this paper, making the "out-of-group" points closer to the majority of points. Take the point labeled 1 in the batch in Figure 6B as an example. It is considered to be the most difficult positive sample point of 11 points in the same category. Here, these 11 points form a set_1, so point 1 is used as the anchor point , the 11 points in set_1 are the most difficult positive sample points, and the negative sample points satisfying the following formula 2.7 are randomly selected. In this way, it is equivalent to providing 11 different feature errors for the anchor point No. 1, so that the network updates the weight to make the No. 1 point move closer to most of the sample points.

$| | | | f f (({x x}_{i i}^{a a})) - - f f (({x x}_{i i}^{p p})) | | {| |}_{22}^{22} < < | | | | f f (({x x}_{i i}^{a a})) - - f f (({x x}_{i i}^{n no})) | | {| |}_{22}^{22} - - - - - - 2.7 2.7$

具体步骤如下：Specific steps are as follows:

第一步，将DataSubset数据集，按照每一类采样30个样本，每个batch包含100个类别准备数据。In the first step, the DataSubset dataset is sampled with 30 samples according to each category, and each batch contains 100 categories to prepare data.

第二步，用预训练好的网络系数初始化网络。In the second step, the network is initialized with pre-trained network coefficients.

第三步，在每个batch中，对于每一个输入样本，随机选取最难正样本，组成set。In the third step, in each batch, for each input sample, the most difficult positive samples are randomly selected to form a set.

第四步，选取set中的样本作为锚点，与其对应的样本为最难正样本，随机选取满足公式2.7的负样本，组成三元组对，更新网络参数。The fourth step is to select the sample in the set as the anchor point, and the corresponding sample is the most difficult positive sample, and randomly select the negative sample satisfying the formula 2.7 to form a triplet pair and update the network parameters.

本发明具体实施例提供一种构建及训练人脸识别特征提取网络的方法，同时利用特征提取网络和度量学习降维网络进行特征提取。在特征提取网络中，设计了以阶段(Stage)堆叠方式的深度学习网络，从而让深度学习网络具有更好的特征提取能力；在Stage的设计中，同时采用了1*1、3*3、5*5的卷积核，并同时对前一层的特征图像(featuremap)卷积，并将得到的feature map叠加，以此来提取多尺度的特征；然后，采用了一个3*3的卷积核对多尺度的feature map进行卷积，将多尺度卷积核的特征融合起来，并且，通过feature map维度的变化，达到了先扩张，充分学习较完备的特征，再压缩，去除冗余特征的目的；每一个Stage都可以看作是卷积核的叠加，卷积核的叠加可以看作利用较少的权重得到较大的感知域，并且增强了深度学习网络的非线性表征层。另外，引入度量学习降维网络，度量学习降维网络的输入为特征提取网络提取到的图像的较低维特征，特征提取网络的输出特征集通过语义采样，筛选出纯净样本集；再利用纯净样本集训练度量学习降维网络；然后利用度量学习损失函数tripletloss优化度量学习降维网络，同时利用特征提取网络和度量学习降维网络进行特征提取，提高特征的表征能力，从而充分挖掘数据中的特征信息，指导深度学习网络快速求解，可以准确识别原始人脸图片。A specific embodiment of the present invention provides a method for constructing and training a face recognition feature extraction network, and at the same time utilizes a feature extraction network and a metric learning dimensionality reduction network for feature extraction. In the feature extraction network, a deep learning network in the form of stage (Stage) stacking is designed, so that the deep learning network has better feature extraction capabilities; in the design of the Stage, 1*1, 3*3, 5*5 convolution kernel, and convolve the feature image (featuremap) of the previous layer at the same time, and superimpose the obtained feature map to extract multi-scale features; then, a 3*3 volume is used The product kernel convolutes the multi-scale feature map, fuses the features of the multi-scale convolution kernel, and, through the change of the feature map dimension, achieves first expansion, fully learns more complete features, and then compresses to remove redundant features The purpose; each stage can be regarded as a superposition of convolution kernels, and the superposition of convolution kernels can be regarded as using less weight to obtain a larger perceptual domain, and enhance the nonlinear representation layer of the deep learning network. In addition, the metric learning dimensionality reduction network is introduced. The input of the metric learning dimensionality reduction network is the lower-dimensional features of the image extracted by the feature extraction network. The output feature set of the feature extraction network is filtered out by semantic sampling. Pure sample set; The sample set trains the metric learning dimensionality reduction network; then uses the metric learning loss function tripletloss to optimize the metric learning dimensionality reduction network, and uses the feature extraction network and the metric learning dimensionality reduction network for feature extraction to improve the representation ability of the features, thereby fully mining the data. The feature information guides the deep learning network to quickly solve it, and can accurately identify the original face picture.

上述的本发明实施例可在各种硬件、软件编码或两者组合中进行实施。例如，本发明的实施例也可为在数据信号处理器(Digital Signal Processor，DSP)中执行上述方法的程序代码。本发明也可涉及计算机处理器、数字信号处理器、微处理器或现场可编程门阵列(Field Programmable Gate Array，FPGA)执行的多种功能。可根据本发明配置上述处理器执行特定任务，其通过执行定义了本发明揭示的特定方法的机器可读软件代码或固件代码来完成。可将软件代码或固件代码发展为不同的程序语言与不同的格式或形式。也可为不同的目标平台编译软件代码。然而，根据本发明执行任务的软件代码与其他类型配置代码的不同代码样式、类型与语言不脱离本发明的精神与范围。The above-mentioned embodiments of the present invention can be implemented in various hardware, software codes or a combination of both. For example, the embodiments of the present invention may also be program codes for executing the above method in a digital signal processor (Digital Signal Processor, DSP). The invention may also involve various functions performed by a computer processor, digital signal processor, microprocessor, or Field Programmable Gate Array (FPGA). The aforementioned processors may be configured in accordance with the present invention to perform specific tasks by executing machine-readable software code or firmware code that defines specific methods disclosed herein. The software code or firmware code can be developed into different programming languages and different formats or forms. The software code can also be compiled for different target platforms. However, different code styles, types, and languages for software code and other types of configuration code for performing tasks in accordance with the present invention do not depart from the spirit and scope of the present invention.

以上所述仅为本发明示意性的具体实施方式，在不脱离本发明的构思和原则的前提下，任何本领域的技术人员所做出的等同变化与修改，均应属于本发明保护的范围。The above are only illustrative specific implementations of the present invention. Under the premise of not departing from the concept and principle of the present invention, any equivalent changes and modifications made by those skilled in the art shall fall within the protection scope of the present invention. .