CN111967331A

Movatterモバイル変換

Info

Publication number: CN111967331A
Application number: CN202010696193.4A
Authority: CN
Inventors: 傅予力; 黄汉业; 向友君; 许晓燕; 吕玲玲
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2020-07-20
Filing date: 2020-07-20
Publication date: 2020-11-20
Anticipated expiration: 2040-07-20
Also published as: CN111967331B

Abstract

Translated fromChinese

本发明公开了一种基于融合特征和字典学习的人脸表示攻击检测方法及系统，该方法步骤包括：根据人脸图像二次成像的失真来源提取完整人脸图像的图像质量特征；构建深度卷积网络模型，通过深度卷积网络提取人脸图像块的深度网络特征；将两种特征级联通过PCA生成最终的融合特征；利用融合特征初始化字典原子，训练基于低秩共享字典的字典学习分类器；基于融合特征重构残差的大小判断测试样本的类别。本发明首次结合图像质量特征和深度网络特征进行人脸表示攻击检测，更好地利用了单帧图像提供的信息，有效增强了提取特征的判别能力；首次通过低秩共享字典剥离出真伪样本的相同模式，不仅成功提高了攻击检测的准确率，而且具有良好的泛化性。

The invention discloses a face representation attack detection method and system based on fusion features and dictionary learning. The method steps include: extracting image quality features of a complete face image according to a distortion source of secondary imaging of the face image; constructing a depth volume The product network model extracts the deep network features of the face image block through a deep convolutional network; cascades the two features to generate the final fusion feature through PCA; uses the fusion feature to initialize dictionary atoms, and trains a dictionary learning classification based on a low-rank shared dictionary based on the size of the fused feature reconstruction residual to determine the category of the test sample. The invention combines image quality features and deep network features for the first time to detect face representation attacks, better utilizes the information provided by a single frame of image, and effectively enhances the discrimination ability of extracted features; for the first time, the true and false samples are separated out through a low-rank shared dictionary. , which not only successfully improves the accuracy of attack detection, but also has good generalization.

Description

Translated fromChinese

基于融合特征和字典学习的人脸表示攻击检测方法及系统Face Representation Attack Detection Method and System Based on Fusion Features and Dictionary Learning

技术领域technical field

本发明涉及图像处理技术领域，具体涉及一种基于融合特征和字典学习的人脸表示攻击检测方法及系统。The invention relates to the technical field of image processing, in particular to a face representation attack detection method and system based on fusion feature and dictionary learning.

背景技术Background technique

如今，人脸识别技术被广泛应用在安防、支付、娱乐设施等场景中。然而，人脸识别系统存在一定的安全隐患。随着社交网络的发展和智能手机的普及，越来越多人在网络上分享个人的照片、视频，不法分子可以通过利用这些媒介伪装成其他人或者故意混淆个人身份来攻击人脸识别系统，达到侵犯他人财产安全、逃脱法律制裁等目的。企图用合法用户的照片、视频等手段以借用该用户身份通过人脸识别系统的操作，被称为人脸表示攻击，检测这类攻击的方法，称为人脸活体检测。Today, face recognition technology is widely used in security, payment, entertainment facilities and other scenarios. However, the face recognition system has certain security risks. With the development of social networks and the popularity of smartphones, more and more people are sharing personal photos and videos on the Internet. Criminals can attack the face recognition system by using these media to pretend to be other people or deliberately obfuscate personal identities. To achieve the purpose of infringing on the safety of others' property and evading legal sanctions. Attempts to use a legitimate user's photos, videos and other means to borrow the user's identity through the operation of the face recognition system are called face representation attacks, and the method of detecting such attacks is called face live detection.

在人脸活体检测中，人脸图像可分为两类，一类是直接对合法用户本人进行拍摄得到的图像。另一类图像的拍摄对象可能是合法用户的照片、视频和蜡像等跟合法用户脸部相似度高的对象。这类图像被称为人脸表示攻击图像(简称攻击人脸)，是活体检测技术要检测的对象。In face live detection, face images can be divided into two categories, one is an image obtained by directly photographing a legitimate user himself. Another type of imagery may be photos, videos, and wax figures of legitimate users that closely resemble legitimate users' faces. This type of image is called a face-representing attack image (referred to as an attack face), and is the object to be detected by the living body detection technology.

人脸活体检测算法的核心在于提取人脸图像中对检测活体最有判别力的特征，传统的检测技术基于手工设计的特征，例如LBP(局部二值模式)、LPQ(局部相位量化)，随着设备成像质量的不断改善，人工设计能够检测攻击人脸的特征变得非常困难。近年来，采用卷积神经网络自动提取特征成为主流。深度卷积神经网络在图像分类任务上表现出色，但受限于活体检测数据集的规模，仅通过类别标签监督的深度网络倾向于记忆存在于训练集中的任意特征，容易导致过拟合，算法泛化性差。The core of the face live detection algorithm is to extract the most discriminative features in the face image for detecting living bodies. The traditional detection technology is based on hand-designed features, such as LBP (local binary pattern), LPQ (local phase quantization), with With the continuous improvement of the imaging quality of equipment, it has become very difficult to manually design features that can detect attacking faces. In recent years, the use of convolutional neural networks to automatically extract features has become mainstream. Deep convolutional neural networks perform well in image classification tasks, but limited by the scale of living detection datasets, deep networks supervised only by category labels tend to memorize any features that exist in the training set, which can easily lead to overfitting. Poor generalization.

发明内容SUMMARY OF THE INVENTION

为了克服现有技术存在的缺陷与不足，本发明提供一种基于融合特征和字典学习的人脸表示攻击检测方法，通过融合人工设计的图像质量特征和深度网络特征，充分利用了单帧图像提供的信息，有效增强特征的识别能力，并采用了基于低秩共享字典的字典学习方法实现真伪样本的分类，共享字典能够剥离出真伪样本的共性，从而提高攻击检测的准确率。In order to overcome the defects and deficiencies existing in the prior art, the present invention provides a face representation attack detection method based on fusion features and dictionary learning. It can effectively enhance the recognition ability of features, and adopts a dictionary learning method based on a low-rank shared dictionary to classify real and fake samples. The shared dictionary can strip out the commonality of real and fake samples, thereby improving the accuracy of attack detection.

为了达到上述目的，本发明采用以下技术方案：In order to achieve the above object, the present invention adopts the following technical solutions:

本发明提供一种基于融合特征和字典学习的人脸表示攻击检测方法，包括下述步骤：The present invention provides a face representation attack detection method based on fusion feature and dictionary learning, comprising the following steps:

对输入视频进行人脸检测和裁剪，构建人脸图像数据库；Perform face detection and cropping on the input video to construct a face image database;

提取人脸图像数据库中的人脸图像的融合特征，所述融合特征包括图像质量特征和深度网络特征；Extract the fusion features of the face images in the face image database, and the fusion features include image quality features and deep network features;

根据人脸图像二次成像的失真来源提取完整人脸图像的图像质量特征；Extract the image quality features of the complete face image according to the distortion source of the secondary imaging of the face image;

构建深度卷积网络模型，通过深度卷积网络提取人脸图像块的深度网络特征；Build a deep convolutional network model, and extract the deep network features of face image blocks through a deep convolutional network;

根据图像质量特征和深度网络特征，将两种特征分别标准化后级联，级联后的特征通过PCA进行降维，生成最终的融合特征；According to the image quality feature and the deep network feature, the two features are normalized and cascaded, and the cascaded features are reduced in dimension by PCA to generate the final fusion feature;

基于融合特征初始化字典原子，训练基于低秩共享字典的字典学习分类器；Initialize dictionary atoms based on fusion features, and train a dictionary learning classifier based on low-rank shared dictionary;

基于融合特征重构残差的大小判断测试样本的类别。The category of the test sample is judged based on the size of the fused feature reconstruction residual.

作为优选的技术方案，所述根据人脸图像二次成像的失真来源提取完整人脸图像的图像质量特征，具体步骤包括：提取镜面反射特征、提取模糊特征、提取色矩特征和提取色彩多样性特征步骤，将提取得到的特征级联，得到图像质量特征。As a preferred technical solution, the image quality features of the complete face image are extracted according to the distortion source of the secondary imaging of the face image, and the specific steps include: extracting specular reflection features, extracting fuzzy features, extracting color moment features and extracting color diversity In the feature step, the extracted features are cascaded to obtain image quality features.

作为优选的技术方案，所述通过深度卷积网络提取人脸图像块的深度网络特征，具体步骤包括：As a preferred technical solution, the specific steps of extracting the deep network features of face image blocks through a deep convolutional network include:

通过对完整人脸图像进行随机缩放和随机裁剪生成所述的人脸图像块，构建轻量深度卷积网络模型，以人脸图像块作为卷积网络模型的输入，采用Focal Loss损失函数训练卷积网络模型提取人脸图像块的深度网络特征，并采用标签平滑方法将独热编码标签转化成软标签，优化深度卷积神经网络的训练过程。The face image block is generated by random scaling and random cropping of the complete face image, and a lightweight deep convolutional network model is constructed. The face image block is used as the input of the convolutional network model, and the Focal Loss loss function is used to train the volume. The product network model extracts the deep network features of the face image blocks, and uses the label smoothing method to convert the one-hot encoded labels into soft labels to optimize the training process of the deep convolutional neural network.

作为优选的技术方案，所述基于融合特征初始化字典原子，训练基于低秩共享字典的字典学习分类器，具体步骤包括：通过交替优化字典和稀疏系数最小化字典模型的代价函数，迭代优化设定的次数后保存字典。As a preferred technical solution, initializing dictionary atoms based on fusion features, and training a dictionary learning classifier based on a low-rank shared dictionary, the specific steps include: minimizing the cost function of the dictionary model by alternately optimizing the dictionary and sparse coefficients, and iteratively optimizing the setting save the dictionary after the number of times.

作为优选的技术方案，所述字典模型的代价函数表示为：As a preferred technical solution, the cost function of the dictionary model is expressed as:

其中，第一项为辨别保真项，第二项为基于Fisher准则的判别系数项，第三项为L1正则化项，第四项为核范数，辨别保真项用于实现字典的识别力；判别系数项用于增大类内相似度，减小类间相似度，L1正则化项用于实现系数X的稀疏；核范数用于约束共享字典张成的子空间大小，保证共享字典的低秩性，λ₁、λ₂和η用于权衡代价函数各项的比重；Among them, the first item is the discrimination fidelity item, the second item is the discrimination coefficient item based on Fisher criterion, the third item is the L1 regularization item, and the fourth item is the nuclear norm. The discriminative fidelity item is used to realize the identification of the dictionary Force; the discriminant coefficient term is used to increase the intra-class similarity and reduce the inter-class similarity, and the L1 regularization term is used to achieve the sparseness of the coefficient X; the kernel norm is used to constrain the size of the subspace formed by the shared dictionary to ensure sharing The low rank of the dictionary, λ₁ , λ₂ and η are used to weigh the proportion of each item of the cost function;

判别保真项定义为：The discriminant fidelity term is defined as:

其中，

表示第c类的样本，样本为融合特征，m表示融合特征的维度，n_c表示第c类的样本数，D表示总字典，D_c表示第c类的子字典，X_ci表示第c类样本在第i类字典上的系数；in,

Represents the sample of the c-th class, the sample is the fusion feature, m represents the dimension of the fusion feature, n_c represents the number of samples of the c-th class, D represents the total dictionary, D_c represents the c-th class sub-dictionary, X_c i represents the c-th class the coefficients of the class samples on the i-th dictionary;

判别系数项定义为：The discriminant coefficient term is defined as:

其中，M_c表示第c类样本的稀疏系数的平均值，M表示整个训练集稀疏系数的平均值，M⁰表示共享字典上系数的平均值，

的作用是强制所有训练样本在共享字典上的系数靠近平均值。Among them, M_c represents the average value of the sparse coefficients of the c-th sample, M represents the average value of the sparse coefficients of the entire training set, M⁰ represents the average value of the coefficients on the shared dictionary,

The role of is to force the coefficients of all training samples on the shared dictionary to be close to the mean.

作为优选的技术方案，还包括求解测试样本的稀疏系数步骤，具体为：通过保存的字典构造两个带共享字典的类别字典，固定类别字典求解测试样本的稀疏系数。As a preferred technical solution, it also includes the step of solving the sparse coefficient of the test sample, specifically: constructing two category dictionaries with shared dictionaries by using the saved dictionary, and fixing the category dictionary to solve the sparse coefficient of the test sample.

作为优选的技术方案，所述基于融合特征重构残差的大小判断测试样本的类别，具体步骤包括：As a preferred technical solution, the specific steps for determining the type of the test sample based on the size of the fusion feature reconstruction residual include:

基于弹性网正则化求解测试样本的稀疏系数，通过稀疏系数重构测试样本的融合特征，重构残差最小的类别作为测试样本的预测类别。Based on elastic net regularization, the sparse coefficient of the test sample is calculated, and the fusion feature of the test sample is reconstructed by the sparse coefficient, and the category with the smallest residual error is reconstructed as the prediction category of the test sample.

本发明提供一种基于融合特征和字典学习的人脸表示攻击检测系统，包括：人脸图像数据库构建模块、初步融合特征提取模块、最终融合特征生成模块、字典学习分类器训练模块和测试样本的类别判断模块；The invention provides a face representation attack detection system based on fusion features and dictionary learning, including: a face image database building module, a preliminary fusion feature extraction module, a final fusion feature generation module, a dictionary learning classifier training module and a test sample. Category judgment module;

所述初步融合特征提取模块包括图像质量特征提取模块和深度网络特征提取模块；The preliminary fusion feature extraction module includes an image quality feature extraction module and a deep network feature extraction module;

所述人脸图像数据库构建模块用于对输入视频进行人脸检测和裁剪，构建人脸图像数据库；The face image database building module is used to perform face detection and cropping on the input video to construct a face image database;

所述初步融合特征提取模块用于提取人脸图像数据库中的人脸图像的融合特征，所述融合特征包括图像质量特征和深度网络特征；The preliminary fusion feature extraction module is used to extract the fusion features of the face images in the face image database, and the fusion features include image quality features and deep network features;

所述图像质量特征提取模块用于根据人脸图像二次成像的失真来源提取完整人脸图像的图像质量特征；The image quality feature extraction module is used to extract the image quality feature of the complete face image according to the distortion source of the secondary imaging of the face image;

所述深度网络特征提取模块用于构建深度卷积网络模型，通过深度卷积网络提取人脸图像块的深度网络特征；The deep network feature extraction module is used to construct a deep convolution network model, and extract the deep network features of face image blocks through a deep convolution network;

所述最终融合特征生成模块用于根据图像质量特征和深度网络特征，将两种特征分别标准化后级联，级联后的特征通过PCA进行降维，生成最终的融合特征；The final fusion feature generation module is used to standardize the two types of features and cascade them according to the image quality feature and the depth network feature, and the cascaded features are dimensionally reduced by PCA to generate the final fusion feature;

所述字典学习分类器训练模块用于基于融合特征初始化字典原子，训练基于低秩共享字典的字典学习分类器；The dictionary learning classifier training module is used to initialize dictionary atoms based on fusion features, and train a dictionary learning classifier based on a low-rank shared dictionary;

所述测试样本的类别判断模块用于基于融合特征重构残差的大小判断测试样本的类别。The category judgment module of the test sample is used for judging the category of the test sample based on the size of the fusion feature reconstruction residual.

本发明与现有技术相比，具有如下优点和有益效果：Compared with the prior art, the present invention has the following advantages and beneficial effects:

(1)本发明通过融合人工设计的图像质量特征和深度网络特征，充分利用了单帧图像提供的信息，增强了特征的判别能力。(1) The present invention makes full use of the information provided by a single frame of image by fusing artificially designed image quality features and deep network features, and enhances the feature discrimination ability.

(2)本发明采用低秩共享字典剥离出真伪样本的共性，保证类别字典能够更好地表示真实样本和攻击样本的差异，避免了采用全连接层容易发生过拟合的缺点，泛化性好，进一步提高了攻击检测的准确率。(2) The present invention uses a low-rank shared dictionary to strip out the commonality of true and false samples, to ensure that the category dictionary can better represent the difference between real samples and attack samples, and to avoid the shortcoming of over-fitting by using a fully connected layer. It has good performance and further improves the accuracy of attack detection.

(3)本发明采用弹性网正则化替代传统的L1正则化，解决了使用L1正则化时模型容易忽略某些特征的问题，对保持细致特征有益，增强了稀疏系数的判别力。(3) The present invention uses elastic net regularization to replace the traditional L1 regularization, solves the problem that the model easily ignores certain features when using the L1 regularization, is beneficial to maintaining fine features, and enhances the discriminative power of sparse coefficients.

(4)本发明采用随机裁剪和随机生成的图像块作为卷积神经网络的输入，使卷积神经网络专注学习提取与欺骗模式相关的有效信息，以一种有效的方式扩大了数据集规模，有效缓解了数据规模小导致的性能下降问题。(4) The present invention uses randomly cropped and randomly generated image blocks as the input of the convolutional neural network, so that the convolutional neural network can focus on learning and extract effective information related to the deception mode, and expand the scale of the data set in an effective way, Effectively alleviate the performance degradation problem caused by small data scale.

附图说明Description of drawings

图1为本实施例基于融合特征和字典学习的人脸表示攻击检测方法的流程示意图。FIG. 1 is a schematic flowchart of a method for detecting a face representation attack based on fusion features and dictionary learning according to this embodiment.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

实施例Example

如图1所示，本实施例提供一种基于融合特征和字典学习的人脸表示攻击检测方法，包括下述步骤：As shown in FIG. 1 , the present embodiment provides a face representation attack detection method based on fusion feature and dictionary learning, including the following steps:

S1：对输入视频进行人脸检测和裁剪，构建人脸图像数据库；S1: Perform face detection and cropping on the input video to construct a face image database;

本实施例选择已公开的人脸表示攻击视频数据集REPLAY-ATTACK、CASIA-FASD和MSU-MFSD，三个数据集都包括真实人脸视频和攻击人脸视频，并且提供了训练集和测试集的划分，提取数据集每个视频的前30帧，采用基于Haar特征的级联分类器来检测图片帧中人脸的位置，并裁剪出人脸图像；In this embodiment, the published face representation attack video datasets REPLAY-ATTACK, CASIA-FASD and MSU-MFSD are selected. All three datasets include real face videos and attack face videos, and training sets and test sets are provided. The first 30 frames of each video in the data set are extracted, and the cascade classifier based on Haar features is used to detect the position of the face in the picture frame, and crop out the face image;

S2：提取人脸图像数据库中的人脸图像的融合特征，融合特征包括图像质量特征和深度网络特征，具体步骤包括：S2: Extract the fusion features of the face images in the face image database. The fusion features include image quality features and deep network features. The specific steps include:

S21)根据人脸图像二次成像的失真来源提取完整人脸图像的图像质量特征，S21) extracting the image quality feature of the complete face image according to the distortion source of the secondary imaging of the face image,

根据人脸图像数据库，通过模糊度、镜面反射、色彩失真等方面对人脸图像进行子特征提取，最后的样本图像质量特征向量由所有子特征向量进行拼接构成，具体步骤如下：According to the face image database, the sub-features of the face image are extracted by ambiguity, specular reflection, color distortion, etc., and the final sample image quality feature vector is formed by splicing all the sub-feature vectors. The specific steps are as follows:

在相同的成像环境下，真实访问人脸是一次成像图片，人脸表示攻击是二次成像图片，分析二次成像过程中人脸图像失真的来源有助于增强提取特征的判别力，本实施例从镜面反射、模糊度、色矩失真和颜色多样性失真四个方面来提取图像质量特征；In the same imaging environment, the real access face is a primary imaging image, and the face representation attack is a secondary imaging image. Analyzing the source of face image distortion during the secondary imaging process helps to enhance the discrimination of extracted features. This implementation For example, image quality features are extracted from four aspects: specular reflection, ambiguity, color moment distortion and color diversity distortion;

提取镜面反射特征：迭代地将输入人脸图像高光位置的色度替换为相邻像素的最大漫反射色度，然后提取图像中的镜面反射成分，进一步利用该成分所占的百分比值、平均值和方差值构成镜面反射特征；Extract specular reflection features: iteratively replace the chromaticity of the highlight position of the input face image with the maximum diffuse reflection chromaticity of adjacent pixels, then extract the specular reflection component in the image, and further use the percentage value and average value of this component. and the variance value to form the specular reflection feature;

提取模糊特征：采用基于二次模糊的方法来提取图像的模糊特征。将输入图像转化为灰度图像，利用卷积核大小为3 3的高斯滤波器对灰度图像进行低通滤波，滤波后的图像称为模糊图像。通过比较图像相邻像素的变化度量图片清晰度，具体方式为计算水平方向和竖直方向的绝对差分图像，计算绝对差分图像所有像素点的灰度值总和，分别称为水平差分总和与竖直差分总和，滤波前后图像的水平、竖直差分总和的比值构成模糊特征。Extracting fuzzy features: The method based on secondary blurring is used to extract the fuzzy features of the image. The input image is converted into a grayscale image, and the grayscale image is low-pass filtered by a Gaussian filter with a convolution kernel size of 33. The filtered image is called a blurred image. The image clarity is measured by comparing the changes of adjacent pixels in the image. The specific method is to calculate the absolute difference image in the horizontal and vertical directions, and calculate the sum of the gray values of all pixels in the absolute difference image, which are called the horizontal difference sum and the vertical difference respectively. Difference sum, the ratio of the horizontal and vertical difference sums of the image before and after filtering constitutes the blur feature.

提取色矩特征：首先将输入人脸图像从RGB空间转换到通道相对独立的HSV空间，然后计算每个通道的平均值、方差值和偏斜度，并计算每个通道的最小和最大直方图箱中像素的百分比，这5个值构成色矩特征。Extract color moment features: first convert the input face image from RGB space to HSV space with relatively independent channels, then calculate the mean, variance and skewness of each channel, and calculate the minimum and maximum histograms of each channel The percentage of pixels in the box, these 5 values make up the color moment feature.

提取色彩多样性特征：对输入图片的R、G、B三个通道进行颜色量化，然后采用前100个最常出现的颜色的直方图箱计数以及人脸图像中出现的所有不同颜色的数量构成色彩多样性特征。Extract color diversity features: quantify the color of the R, G, and B channels of the input image, and then use the histogram bin counts of the top 100 most frequently occurring colors and the number of all different colors that appear in the face image. Color diversity characteristics.

将上述提取的四种特征级联，称为图像质量特征，图像质量特征的维度为121维。The four kinds of features extracted above are cascaded, called image quality features, and the dimension of the image quality features is 121 dimensions.

S22)构建深度卷积网络模型，通过深度卷积网络提取人脸图像块的深度网络特征，具体说明如下：S22) Construct a deep convolutional network model, and extract the deep network features of the face image block through a deep convolutional network, and the details are as follows:

将S1得到的人脸图像的尺寸缩放到112×112，通过随机缩放和随机裁剪实现人脸图像随机分块，图像块的大小设置为48×48，将人脸图像块作为卷积神经网络的输入，训练神经网络提取特征，使用局部的图像块增大了训练集的规模，可以使卷积神经网络专注学习提取与欺骗攻击模式相关的有效信息，同时保持了原始输入的分辨率，防止有判别力的信息丢失；The size of the face image obtained by S1 is scaled to 112×112, and the face image is randomly divided into blocks by random scaling and random cropping. The size of the image block is set to 48×48, and the face image block is used as the convolutional neural network. Input, train the neural network to extract features, and use local image blocks to increase the size of the training set, allowing the convolutional neural network to focus on learning to extract effective information related to the deception attack mode, while maintaining the resolution of the original input, preventing Discriminatory information is lost;

现有的公开数据集规模较小，卷积网络模型可以采用复杂度较小的模型。本实施例采用了在ImageNet数据集上预训练过的ResNet18模型。同时，将ResNet18的第一个卷积层的卷积核大小缩小为3×3，步长缩小为1，最后的卷积层的下一层是全局池化层，全局池化层将卷积层输出的每个特征图取平均，然后将这些平均数连接成一个一维向量，全局池化层的下一层是全连接层，全连接层以全局池化层输出的一维向量为输入，输出维度为对应类别数，本实施例中类别数设为2，分别对应真实人脸和攻击人脸。The existing public datasets are small in scale, and the convolutional network model can adopt a model with less complexity. This example uses the ResNet18 model pre-trained on the ImageNet dataset. At the same time, the size of the convolution kernel of the first convolutional layer of ResNet18 is reduced to 3 × 3, and the stride is reduced to 1. The next layer of the final convolutional layer is the global pooling layer, which will convolution Each feature map output by the layer is averaged, and then these averages are connected into a one-dimensional vector. The next layer of the global pooling layer is a fully connected layer, which takes the one-dimensional vector output by the global pooling layer as input. , and the output dimension is the number of corresponding categories. In this embodiment, the number of categories is set to 2, corresponding to real faces and attack faces respectively.

卷积网络训练时的损失函数是Focal Loss函数。Focal Loss损失函数的公式如下：The loss function during convolutional network training is the Focal Loss function. The formula of the Focal Loss loss function is as follows:

FL(p_t)＝-(1-p_t)^γlog(p_t)FL(p_t )=-(1-p_t )^γ log(p_t )

其中，

p是网络输出的真实人脸样本的概率，y表示输入图像的真实标签，真实人脸的标签值为1，攻击人脸的标签值为0，γ称为聚焦参数，取值大于0。调制因子(1-p_t)^γ将模型的预测得分高低整合到损失函数中，使模型可以根据样本的难易自适应调整，本实施例将γ的取值为2，通常，攻击人脸视频的数量是真实人脸视频的若干倍，采用Focal Loss损失函数替代常规的交叉熵损失函数可以解决数据集普遍存在的数据不平衡问题。in,

p is the probability of the real face sample output by the network, y is the real label of the input image, the label value of the real face is 1, the label value of the attack face is 0, γ is called the focusing parameter, and the value is greater than 0. The modulation factor (1-p_t )^γ integrates the prediction score of the model into the loss function, so that the model can be adaptively adjusted according to the difficulty of the sample. In this embodiment, the value of γ is 2. Generally, the attacking face video The number is several times that of real face videos. Using the Focal Loss loss function to replace the conventional cross-entropy loss function can solve the data imbalance problem that is common in the dataset.

在卷积神经网络训练的过程中，采用标签平滑指将传统的独热编码标签转化成软标签，如下所示：In the process of convolutional neural network training, label smoothing refers to converting traditional one-hot encoded labels into soft labels, as shown below:

其中，y_oh表示常规的独热编码标签，y_ls表示标签平滑后的软标签。标签平滑将正确类别处的标签值减少到原来的(1-α)倍，本来为0的项变成

K表示类别的数目，α∈[0,1]。本实施例中，α的取值为0.1。标签平滑通过适当减少正确标签的值来鼓励模型去选择正确的类别，但不过度自信。对于人脸表示攻击检测来说，正负样本在图像域是非常相似的。因此，在网络的初始阶段，采用硬标签容易导致网络迅速拟合，引入标签平滑方法进一步提高了卷积网络模型的泛化能力。Among them, y_oh represents the conventional one-hot encoded label, and y_ls represents the soft label after label smoothing. Label smoothing reduces the label value at the correct category by a factor of (1-α), and the original 0 term becomes

K represents the number of classes, α∈[0,1]. In this embodiment, the value of α is 0.1. Label smoothing encourages the model to select the correct class by appropriately reducing the value of the correct label without overconfidence. For face representation attack detection, positive and negative samples are very similar in the image domain. Therefore, in the initial stage of the network, the use of hard labels can easily lead to the rapid fitting of the network, and the introduction of label smoothing method further improves the generalization ability of the convolutional network model.

训练神经网络时使用的优化方法为随机梯度下降法，初始学习率和权重衰减分别设为0.001和0.00001，学习率调节器为带重启的余弦退火调节器，最低学习率设为0.00004，余弦周期为5轮次，一共迭代30个轮次，网络训练结束之后，去掉最后的全局平均池化层和全连接层，利用前面的卷积块组来提取人脸图像块的深度网络特征，深度网络特征维度为512维；The optimization method used when training the neural network is stochastic gradient descent, the initial learning rate and weight decay are set to 0.001 and 0.00001 respectively, the learning rate regulator is a cosine annealing regulator with restart, the minimum learning rate is set to 0.00004, and the cosine period is 5 rounds, a total of 30 rounds of iterations, after the network training, remove the last global average pooling layer and fully connected layer, use the previous convolution block group to extract the deep network features of face image blocks, deep network features The dimension is 512 dimensions;

S23)根据图像质量特征和深度网络特征，将两种特征分别标准化后级联，级联后的特征通过PCA进行降维，生成最终的融合特征，具体说明如下：S23) According to the image quality feature and the deep network feature, the two features are respectively standardized and then cascaded, and the features after the cascade are dimensionally reduced by PCA to generate the final fusion feature, which is specifically described as follows:

通过对人脸图像数据库的每张人脸图像进行图像质量特征和深度网络特征两组特征的提取，计算得到两组特征的平均值和方差，对两组特征进行标准化，对每张人脸图像，将标准化后对应的图像质量特征和深度网络特征直接级联，两个特征直接级联的长度为633；By extracting two sets of features of image quality features and deep network features for each face image in the face image database, the average and variance of the two sets of features are calculated, and the two sets of features are standardized. , and directly cascade the corresponding image quality features and deep network features after normalization, and the length of the direct cascade of the two features is 633;

采用PCA(主成分分析)对级联特征进行降维，降维后的特征称为融合特征，为了确定一个相对好的PCA主成分数，本实施例先通过设置一个主成分数较大的实验来确定切割点；本实施例将PCA降维后的维度先设置为400，然后将主成分按方差从大到小排序，计算方差的累计值，根据方差累计和占总方差和的比例重新确定PCA降维的维数。本实施例选择将PCA降维的维数设为256维；PCA (principal component analysis) is used to reduce the dimension of cascaded features, and the features after dimension reduction are called fusion features. In order to determine a relatively good number of PCA principal components, this embodiment first sets an experiment with a larger number of principal components to determine the cutting point; in this embodiment, the dimension after PCA dimensionality reduction is first set to 400, and then the principal components are sorted in descending order of variance, the cumulative value of variance is calculated, and the ratio of the cumulative sum of variance to the sum of total variance is re-determined The dimensionality of PCA dimensionality reduction. This embodiment chooses to set the dimension of PCA dimension reduction as 256 dimensions;

S3：利用训练样本的融合特征初始化字典原子，训练基于低秩共享字典的字典学习分类器；S3: Initialize dictionary atoms with fusion features of training samples, and train a dictionary learning classifier based on low-rank shared dictionary;

本实施例采用了基于低秩共享字典的字典学习方法，设置总字典D＝[D₁,D₂,D₀]∈R^m×n，其中m表示融合特征的维度，n表示字典的大小，类别字典D₁和D₂分别对应真实人脸和攻击人脸，类别字典的大小设置为125。共享字典D₀的大小设置为20，从训练集图像中提取融合特征，利用融合特征来初始化字典原子，其中，两个类别字典从对应的类别中随机抽取样本，共享字典从整个训练集中随机抽取样本，字典的原子都经过L2归一化；This embodiment adopts a dictionary learning method based on a low-rank shared dictionary, and sets the total dictionary D=[D₁ , D₂ , D₀ ]∈R^m×n , where m represents the dimension of the fusion feature, n represents the size of the dictionary, Category dictionaries D₁ and D₂ correspond to real faces and attack faces, respectively, and the size of the category dictionary is set to 125. The size of the shared dictionary D₀ is set to 20, the fusion features are extracted from the training set images, and the dictionary atoms are initialized with the fusion features, where the two category dictionaries randomly sample samples from the corresponding categories, and the shared dictionary is randomly selected from the entire training set. Samples, the atoms of the dictionary are L2 normalized;

通过迭代优化字典D和系数X来最小化字典模型的代价函数J，本实施例中将迭代次数设置为25，字典模型的代价函数J定义如下：The cost function J of the dictionary model is minimized by iteratively optimizing the dictionary D and the coefficient X. In this embodiment, the number of iterations is set to 25, and the cost function J of the dictionary model is defined as follows:

其中，第一项为辨别保真项，第二项为基于Fisher准则的判别系数项，第三项为L1正则化项，第四项为核范数，辨别保真项的作用是实现字典的识别力；判别系数项的作用是为了增大类内相似度，减小类间相似度，L1正则化项的作用是实现系数X的稀疏；核范数的作用是约束共享字典张成的子空间大小，保证共享字典的低秩性，λ₁、λ₂和η用于权衡代价函数各项的比重，本实施例中，λ₁设为0.1，λ₂设为0.01，η设为0.0001；Among them, the first item is the discrimination fidelity item, the second item is the discrimination coefficient item based on Fisher criterion, the third item is the L1 regularization item, and the fourth item is the nuclear norm. The function of the discriminative fidelity item is to realize the dictionary Discrimination; the function of the discriminant coefficient term is to increase the similarity within the class and reduce the similarity between the classes. The function of the L1 regularization term is to achieve the sparseness of the coefficient X; the function of the kernel norm is to constrain the shared dictionary. The space size ensures the low rank of the shared dictionary. λ₁ , λ₂ and η are used to weigh the proportions of the cost function items. In this embodiment, λ₁ is set to 0.1, λ₂ is set to 0.01, and η is set to 0.0001;

具体地，判别保真项的定义如下：Specifically, the definition of the discriminant fidelity term is as follows:

其中，

表示第c类的样本，样本为融合特征，m表示融合特征的维度，n_c表示第c类的样本数，c的取值为1或2，D表示总字典，D_c表示第c类的子字典，

表示第c类样本在第i类字典上的系数，i的取值为1或2；in,

Represents the samples of the c-th class, the samples are fusion features, m represents the dimension of the fusion features, n_c represents the number of samples of the c-th class, c is 1 or 2, D represents the total dictionary, D_c represents the c-th class of subdictionary,

Represents the coefficient of the c-th sample on the i-th dictionary, and the value of i is 1 or 2;

具体地，判别系数项的定义如下：Specifically, the definition of the discriminant coefficient term is as follows:

的作用是强制所有训练样本在共享字典上的系数靠近平均值，这是为了防止共享字典对不同类别样本的贡献差距太大，影响分类性能；Among them, M_c represents the average value of the sparse coefficients of the c-th sample, M represents the average value of the sparse coefficients of the entire training set, M⁰ represents the average value of the coefficients on the shared dictionary,

The function is to force the coefficients of all training samples on the shared dictionary to be close to the average value, which is to prevent the contribution of the shared dictionary to different categories of samples from being too different and affecting the classification performance;

通过交替优化字典和稀疏系数最小化字典模型的代价函数，迭代优化一定次数后保存字典，通过保存的字典构造两个带共享字典的类别字典，固定类别字典求解测试样本的稀疏系数。The cost function of the dictionary model is minimized by alternately optimizing the dictionary and the sparse coefficient, and the dictionary is saved after a certain number of iterations. The saved dictionary is used to construct two category dictionaries with shared dictionaries, and the sparse coefficient of the test sample is solved by fixing the category dictionary.

S4：基于融合特征重构残差的大小判断测试样本的类别。S4: Determine the category of the test sample based on the size of the fusion feature reconstruction residual.

利用本实施例得到的字典D分别构造两个子字典

和

即步骤S3中保存的字典构造两个带共享字典的类别字典，求解测试样本y的稀疏系数时，本实施例采用弹性网正则化，模型的优化问题如下：Use the dictionary D obtained in this embodiment to construct two sub-dictionaries respectively

and

That is, the dictionary saved in step S3 constructs two category dictionaries with shared dictionaries, and when solving the sparse coefficient of the test sample y, this embodiment adopts elastic net regularization, and the optimization problem of the model is as follows:

其中，

表示带共享字典的类别字典，x表示测试样本y对应的稀疏系数。第二项是L1正则化项，第三项是L2正则化项，λ_a和λ_b用于权衡L1正则化项和L2正则化项的比重，本实施例中，λ_a设为0.01，λ_b设为0.01，与L1正则化相比，L2正则化倾向于使x的解更平滑，因此，通过线性结合L1正则化和L2正则化，能够产生改进的稀疏编码。in,

represents a class dictionary with a shared dictionary, and x represents the sparse coefficient corresponding to the test sample y. The second term is the L1 regularization term, and the third term is the L2 regularization term. λ_a and λ_b are used to weigh the proportions of the L1 regularization term and the L2 regularization term. In this embodiment, λ_a is set to 0.01, and λ With_b set to 0.01, L2 regularization tends to make the solution of x smoother compared to L1 regularization, so by linearly combining L1 regularization and L2 regularization, an improved sparse coding can be produced.

求得测试样本y的稀疏系数后，根据每个类的子字典对应的系数重构y。重构残差最小的类别作为预测类别，如下式所示：After obtaining the sparse coefficient of the test sample y, reconstruct y according to the coefficient corresponding to the sub-dictionary of each class. The category with the smallest reconstruction residual is used as the predicted category, as shown in the following formula:

如下表1所示，本实施例在REPLAY-ATTACK、CASIA-FASD、MSU-MFSD三个数据集上与单一特征的性能比较，评价指标为HTER(半总错误率)。As shown in Table 1 below, the performance of this embodiment is compared with a single feature on three datasets of REPLAY-ATTACK, CASIA-FASD, and MSU-MFSD, and the evaluation index is HTER (half total error rate).

表1 三个公开数据集上采用不同特征的性能比较Table 1. Performance comparison with different features on three public datasets

REPLAY-ATTACKREPLAY-ATTACKCASIA-FASDCASIA-FASDMSU-MFSDMSU-MFSD图像质量特征Image Quality Characteristics12.85％12.85%13.99％13.99%13.71％13.71%深度网络特征Deep network features2.37％2.37%4.81％4.81%11.13％11.13%融合特征fusion features1.92％1.92%4.41％4.41%9.39％9.39%

表1表明，深度网络特征不能自动提取出人工设计特征中所有有判别力的因素，本方明方法通过将图像质量特征和深度网络特征融合，进一步利用了图像信息，有效增强了特征的识别能力。Table 1 shows that deep network features cannot automatically extract all discriminative factors in artificially designed features. This method further utilizes image information by fusing image quality features and deep network features, effectively enhancing the feature recognition ability. .

如下表2所示，本实施例在CASIA-FASD和REPLAY-ATTACK跨数据集场景下和其他方法的性能比较，评价指标为HTER。As shown in Table 2 below, the performance of this embodiment is compared with other methods in the cross-dataset scenario of CASIA-FASD and REPLAY-ATTACK, and the evaluation index is HTER.

表2 跨数据集场景下采用不同特征的性能比较Table 2 Performance comparison of different features in cross-dataset scenarios

表2表明，相比LBP等人工设计方法和单一的CNN方法，本发明方法在跨数据集场景下具有较好的泛化性。Table 2 shows that, compared with artificial design methods such as LBP and a single CNN method, the method of the present invention has better generalization in cross-dataset scenarios.

本实施例还提供一种基于融合特征和字典学习的人脸表示攻击检测系统，包括：人脸图像数据库构建模块、初步融合特征提取模块、最终融合特征生成模块、字典学习分类器训练模块和测试样本的类别判断模块；This embodiment also provides a face representation attack detection system based on fusion features and dictionary learning, including: a face image database building module, a preliminary fusion feature extraction module, a final fusion feature generation module, a dictionary learning classifier training module and a test module The category judgment module of the sample;

在本实施例中，初步融合特征提取模块包括图像质量特征提取模块和深度网络特征提取模块；In this embodiment, the preliminary fusion feature extraction module includes an image quality feature extraction module and a deep network feature extraction module;

在本实施例中，人脸图像数据库构建模块用于对输入视频进行人脸检测和裁剪，构建人脸图像数据库；In this embodiment, the face image database building module is used to perform face detection and cropping on the input video to construct a face image database;

在本实施例中，初步融合特征提取模块用于提取人脸图像数据库中的人脸图像的融合特征，所述融合特征包括图像质量特征和深度网络特征；In this embodiment, the preliminary fusion feature extraction module is used to extract fusion features of face images in the face image database, and the fusion features include image quality features and deep network features;

在本实施例中，图像质量特征提取模块用于根据人脸图像二次成像的失真来源提取完整人脸图像的图像质量特征；In this embodiment, the image quality feature extraction module is used to extract the image quality feature of the complete face image according to the distortion source of the secondary imaging of the face image;

在本实施例中，深度网络特征提取模块用于构建深度卷积网络模型，通过深度卷积网络提取人脸图像块的深度网络特征；In this embodiment, the deep network feature extraction module is used to construct a deep convolution network model, and extract the deep network features of the face image block through the deep convolution network;

在本实施例中，最终融合特征生成模块用于根据图像质量特征和深度网络特征，将两种特征分别标准化后级联，级联后的特征通过PCA进行降维，生成最终的融合特征；In this embodiment, the final fusion feature generation module is used to standardize the two types of features and cascade them according to the image quality feature and the deep network feature, and the cascaded features are subjected to dimension reduction through PCA to generate the final fusion feature;

在本实施例中，字典学习分类器训练模块用于基于融合特征初始化字典原子，训练基于低秩共享字典的字典学习分类器；In this embodiment, the dictionary learning classifier training module is used to initialize dictionary atoms based on fusion features, and train a dictionary learning classifier based on a low-rank shared dictionary;

在本实施例中，测试样本的类别判断模块用于基于融合特征重构残差的大小判断测试样本的类别。In this embodiment, the category determination module of the test sample is used to determine the category of the test sample based on the size of the fusion feature reconstruction residual.

通过上述技术方案的描述，可以看到本发明通过结合人工设计的图像质量特征和深度网络特征，充分利用了单帧图像提供的信息，增强了特征的判别能力。本发明针对人脸表示攻击数据集的特点，对卷积神经网络的结构和训练方式进行了优化，通过Focal Loss损失函数解决数据不平衡问题，通过标签平滑技术进一步提高深度网络的泛化能力。另外，引入低秩共享字典剥离出真伪样本的共性，采用弹性网正则化改进测试样本的稀疏编码，进一步提高了字典学习分类器的准确率。本发明方法具有良好的泛化性，适用于实际场景中的二维人脸表示攻击检测。From the description of the above technical solutions, it can be seen that the present invention makes full use of the information provided by a single frame of image by combining artificially designed image quality features and deep network features, and enhances the feature discrimination ability. Aiming at the characteristics of the face representation attack data set, the invention optimizes the structure and training method of the convolutional neural network, solves the data imbalance problem through the Focal Loss loss function, and further improves the generalization ability of the deep network through the label smoothing technology. In addition, a low-rank shared dictionary is introduced to strip out the commonality of true and false samples, and elastic net regularization is used to improve the sparse coding of test samples, which further improves the accuracy of dictionary learning classifiers. The method of the invention has good generalization and is suitable for attack detection of two-dimensional face representation in actual scenes.

上述实施例为本发明较佳的实施方式，但本发明的实施方式并不受上述实施例的限制，其他的任何未背离本发明的精神实质与原理下所作的改变、修饰、替代、组合、简化，均应为等效的置换方式，都包含在本发明的保护范围之内。The above-mentioned embodiments are preferred embodiments of the present invention, but the embodiments of the present invention are not limited by the above-mentioned embodiments, and any other changes, modifications, substitutions, combinations, The simplification should be equivalent replacement manners, which are all included in the protection scope of the present invention.

Claims

1. A face representation attack detection method based on fusion features and dictionary learning is characterized by comprising the following steps:

carrying out face detection and cutting on an input video to construct a face image database;

extracting fusion characteristics of face images in a face image database, wherein the fusion characteristics comprise image quality characteristics and depth network characteristics;

extracting image quality characteristics of the complete face image according to a distortion source of secondary imaging of the face image;

constructing a depth convolution network model, and extracting the depth network characteristics of the human face image block through a depth convolution network;

respectively standardizing and cascading the two characteristics according to the image quality characteristic and the depth network characteristic, and reducing the dimension of the cascaded characteristics through PCA to generate final fusion characteristics;

initializing dictionary atoms based on fusion characteristics, and training a dictionary learning classifier based on a low-rank shared dictionary;

and judging the category of the test sample based on the size of the fusion feature reconstruction residual error.

2. The method for detecting the face representation attack based on the fusion feature and the dictionary learning according to the claim 1, characterized in that the image quality feature of the complete face image is extracted according to the distortion source of the secondary imaging of the face image, and the specific steps comprise: and extracting specular reflection characteristics, fuzzy characteristics, color moment characteristics and color diversity characteristics, and cascading the extracted characteristics to obtain image quality characteristics.

3. The method for detecting the face representation attack based on the fusion feature and the dictionary learning as claimed in claim 1, wherein the depth network feature of the face image block is extracted through a depth convolution network, and the specific steps include:

the method comprises the steps of generating a face image block by randomly zooming and randomly cutting a complete face image, constructing a lightweight depth convolution network model, taking the face image block as the input of the convolution network model, training the convolution network model by adopting a Focal local Loss function to extract the depth network characteristics of the face image block, converting a one-hot coded label into a soft label by adopting a label smoothing method, and optimizing the training process of a depth convolution neural network.

4. The method for detecting the face representation attack based on the fusion feature and the dictionary learning as claimed in claim 1, wherein the dictionary atoms are initialized based on the fusion feature, and the dictionary learning classifier based on the low-rank shared dictionary is trained, and the method comprises the following specific steps: and (4) alternately optimizing the dictionary and minimizing the cost function of the dictionary model by the sparse coefficient, and storing the dictionary after the set times of iterative optimization.

5. The method for detecting human face representation attack based on fusion feature and dictionary learning according to claim 4, wherein the cost function of the dictionary model is expressed as:

wherein the first item is a discrimination fidelity itemThe second term is a discrimination coefficient term based on a Fisher criterion, the third term is an L1 regularization term, the fourth term is a nuclear norm, and the discrimination fidelity term is used for realizing the recognition power of the dictionary; the discrimination coefficient item is used for increasing the intra-class similarity and reducing the inter-class similarity, and the L1 regularization item is used for realizing the sparsity of the coefficient X; the kernel norm is used for constraining the size of a subspace spanned by the shared dictionary, and the low rank property, lambda, of the shared dictionary is ensured₁、λ₂And η is used to trade off the specific gravity of the terms of the cost function;

the discriminant fidelity term is defined as:

wherein,

representing samples of class c, the samples being fusion features, m representing the dimension of the fusion features, n_cRepresenting the number of samples of class c, D representing the global dictionary, D_cA sub-dictionary representing the class c,

representing coefficients of a class c sample on an class i dictionary;

the term discriminant coefficient is defined as:

wherein M is_cRepresenting the mean value of the sparse coefficients of class c samples, M representing the mean value of the sparse coefficients of the entire training set, M⁰Represents the average of the coefficients on the shared dictionary,

the effect of (a) is to force the coefficients of all training samples on the shared dictionary to be close to the average.

6. The face representation attack detection method based on fusion feature and dictionary learning according to claim 4, characterized by further comprising a step of solving sparse coefficients of test samples, specifically: and constructing two class dictionaries with shared dictionaries through the stored dictionaries, and solving sparse coefficients of the test samples by fixing the class dictionaries.

7. The method for detecting the face representation attack based on the fusion feature and the dictionary learning according to claim 1, wherein the method for judging the category of the test sample based on the size of the residual error of the fusion feature reconstruction comprises the following specific steps:

and solving the sparse coefficient of the test sample based on the regularization of the elastic network, reconstructing the fusion characteristics of the test sample through the sparse coefficient, and reconstructing the class with the minimum residual error as the prediction class of the test sample.

8. A face representation attack detection system based on fusion feature and dictionary learning, comprising: the system comprises a face image database construction module, a preliminary fusion feature extraction module, a final fusion feature generation module, a dictionary learning classifier training module and a test sample category judgment module;

the preliminary fusion feature extraction module comprises an image quality feature extraction module and a depth network feature extraction module;

the face image database construction module is used for carrying out face detection and cutting on an input video to construct a face image database;

the preliminary fusion feature extraction module is used for extracting fusion features of the face images in the face image database, and the fusion features comprise image quality features and depth network features;

the image quality characteristic extraction module is used for extracting the image quality characteristics of the complete face image according to the distortion source of the secondary imaging of the face image;

the depth network feature extraction module is used for constructing a depth convolution network model and extracting the depth network features of the human face image blocks through a depth convolution network;

the final fusion feature generation module is used for respectively standardizing and cascading the two features according to the image quality feature and the depth network feature, and reducing the dimension of the cascaded features through PCA to generate final fusion features;

the dictionary learning classifier training module is used for initializing dictionary atoms based on fusion characteristics and training a dictionary learning classifier based on a low-rank shared dictionary;

the type judgment module of the test sample is used for judging the type of the test sample based on the size of the fusion characteristic reconstruction residual error.