Disclosure of Invention
In view of this, the present invention aims to provide a method for marking a face, which effectively solves the problem of jitter of mark points existing when the current face marking algorithm is used in a video sequence, so that key points of the face in the video sequence are marked more stably, and a stable effect is provided for subsequent analysis and processing links based on the mark points, especially for applications with high requirements on the position stability and accuracy of the mark points, such as expression analysis and recognition.
In order to achieve the purpose, the technical scheme of the invention is realized as follows:
a method of face labeling comprising a training process and a prediction process, the training process comprising the steps of: a1, inputting a training sample, and preprocessing the training sample; a2, performing cascade regression training; a3, generating a model file after training and storing;
the prediction process comprises the steps of: b1, denoising and filtering the input human face image; b2, detecting the face image; b2, performing prediction calculation on the processed face image model; and B3, analyzing and calculating the positions of the feature points through the trained model file to obtain the shape of the human face.
Further, the training samples input in step a1 are images obtained by labeling face regions and key points, and are denoted as Li (i ═ N), N represents the total number of training sets, and the feature point shape is an overall contour formed by the positions of all the feature points;
the real shape of the feature point is denoted as Si, and represents coordinate information of all the feature points.
Further, the preprocessing of the training samples in step a1 includes normalization, and the normalization method includes: the face shape of each sample image is mapped into a unified standard matrix by carrying out isomorphic scale scaling and translation operations on the feature points of each sample image, and the corresponding shape information of each sample image in the standard matrix, including the coordinate information corresponding to each feature point, is obtained.
Furthermore, normalization processing is carried out on each sample image in the sample image set, so that the matrixes of the feature points on each sample image are consistent, training is facilitated, and a cascade regression learning model is constructed.
Further, in step a2, cascade regression training is performed through random forests.
Further, the model file generated in step a3 includes the number of levels of the random forest, the number of regression trees in each level of forest, the depth of each regression tree, and node information of each node of the regression tree.
Further, the node information of each node of the regression tree includes a pixel pair position and a threshold of each node, probability information of left and right branches, and an error estimation value of a leaf node.
Further, in step B1, the input face image is denoised by nonlinear median filtering, and the median is statistically calculated by N × N (where N is an odd number) windows centered on each pixel in the image, and the pixel value at the position is replaced by the median.
Further, in step B2, the method for calculating the prediction of the processed face image model includes: analyzing a model file generated by training, reconstructing a random forest model generated in the training process, and carrying out iterative computation on each regression tree under each level of random forest in each level of random forest regression model to finally obtain the detected face shape.
Further, the method for calculating the prediction of the processed face image model comprises the following steps:
c1, analyzing the model file to obtain the average shape S, the pixel pair position and threshold (u, v, th) of the node, and the error estimation value of each leaf node;
c2, entering a first regression tree of the first-level forest, starting from the root node, judging the pixel intensity difference value of the image at the position (u, v) according to the (u, v, th) of the root node, and calculating the left branch probability and the right branch probability;
c3, processing the next level depth node of the tree, and respectively calculating the probability of the left branch and the right branch;
c4, until the probability of all leaf node branches is calculated, finally calculating the product of the path branch probabilities of all leaf nodes to obtain the probability of the leaf node, and updating the shape estimation for the estimated value of the shape error of the tree, wherein the sum of the error estimation of all leaf nodes and the product of the probabilities is the accumulated sum of the shape error estimation of the tree;
repeating the steps C2-C4 on other trees to obtain the shape estimation of the primary forest;
and performing iterative calculation by taking the obtained shape estimation as the initial shape of the next adjacent regression tree, and repeating the steps C2-C4 for each stage of regression until the last regression tree in the random forest model obtains the estimated shape of the last regression tree as the detected face shape.
Compared with the prior art, the method for marking the human face has the following advantages:
the method for marking the human face effectively solves the problem of jitter of the marking points existing when the existing human face marking algorithm is used in a video sequence, enables the key points of the human face in the video sequence to be marked more stably, provides a stable effect for subsequent analysis and processing links based on the marking points, and particularly provides applications with high requirements on the position stability and accuracy of the marking points, such as expression analysis and recognition.
Detailed Description
It should be noted that the embodiments and features of the embodiments of the present invention may be combined with each other without conflict.
In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art through specific situations.
The present invention will be described in detail with reference to examples.
A method of face labeling comprising a training process and a prediction process, the training process comprising the steps of: a1, inputting a training sample, and preprocessing the training sample; a2, performing cascade regression training; a3, generating a model file after training and storing;
the prediction process comprises the steps of: b1, denoising and filtering the input human face image; b2, detecting the face image; b2, performing prediction calculation on the processed face image model; and B3, analyzing and calculating the positions of the feature points through the trained model file to obtain the shape of the human face.
The training samples input in step a1 are images obtained by labeling face regions and key points, and are denoted as Li (i ═ N), N represents the total number of training sets, and the feature point shape is an overall contour formed by the positions of all the feature points;
the real shape of the feature point is denoted as Si, and represents coordinate information of all the feature points.
The preprocessing of the training samples in the step a1 includes normalization, and the normalization method includes: scaling and translating the feature points of each sample image in isomorphic scale, mapping the face shape of each sample image into a unified standard matrix, and obtaining the corresponding shape information of each sample image in the standard matrix, including the coordinate information corresponding to each feature point.
And carrying out normalization processing on each sample image in the sample image set, wherein the normalization processing is used for ensuring that matrixes where feature points on each sample image are located are consistent, training is convenient, and a cascade regression learning model is constructed.
In step a2, cascade regression training is performed through random forests.
The model file generated in step a3 includes the number of levels of the random forest, the number of regression trees in each level of forest, the depth of each regression tree, and node information for each node of the regression tree.
The node information of each node of the regression tree comprises the pixel pair position and threshold value of each node, probability information of left and right branches and error estimation values of leaf nodes.
In step B1, the input face image is denoised by nonlinear median filtering, and a median is statistically calculated by N × N (where N is an odd number) windows centered on each pixel in the image, and the pixel value at the position is replaced by the median.
In step B2, the method for calculating the prediction of the processed face image model includes: analyzing a model file generated by training, reconstructing a random forest model generated in the training process, and carrying out iterative computation on each regression tree under each level of random forest in the random forest regression model to finally obtain the detected face shape.
The method for predicting and calculating the processed face image model comprises the following steps:
c1, analyzing the model file to obtain the average shape S, the pixel pair position and threshold (u, v, th) of the node, and the error estimation value of each leaf node;
c2, entering a first regression tree of the first-level forest, starting from the root node, judging the pixel intensity difference value of the image at the position (u, v) according to the (u, v, th) of the root node, and calculating the left branch probability and the right branch probability;
c3, processing the next level depth node of the tree, and respectively calculating the probability of the left branch and the right branch;
c4, until the probability of all leaf node branches is calculated, finally calculating the product of the path branch probabilities of all leaf nodes to obtain the probability of the leaf node, and updating the shape estimation for the estimated value of the shape error of the tree, wherein the sum of the error estimation of all leaf nodes and the product of the probabilities is the accumulated sum of the shape error estimation of the tree;
repeating the steps C2-C4 on other trees to obtain the shape estimation of the primary forest;
and performing iterative calculation by taking the obtained shape estimation as the initial shape of the next adjacent regression tree, and repeating the steps C2-C4 for each stage of regression until the last regression tree in the random forest model obtains the estimated shape of the last regression tree as the detected face shape.
In the specific implementation, the following examples are proposed:
1.1 image preprocessing:
the training samples are all images marked on the face region and the key points, and are marked as Li (i < ═ N), N represents the total number of the training sets, and the feature point shape is an integral contour formed by the positions of all the feature points.
The true shape of the feature point is denoted as Si (coordinate information of all feature points).
Because the resolution, the posture and the like of different sample images are different, normalization processing needs to be carried out on each sample image in a sample image set, the face shape of each sample image is mapped into a unified standard matrix by carrying out operations such as scaling, translation and the like on feature points of each sample image in isomorphic scale, and shape information corresponding to each sample image in the standard matrix, including coordinate information corresponding to each feature point and the like, is obtained.
And carrying out normalization processing on each sample image in the sample image set, so that the matrixes of the characteristic points on each sample image are consistent, the training is convenient, and a cascade regression learning model is constructed.
1.2 probability cascade regression learning:
the training is the process of regression learning and is realized through random forests, the training process is the process of generating a cascade of random forests, and the random forests realize the mapping between the image characteristics and the shapes. The shape prediction error is gradually approximated through multistage random forests, and the method is a process for gradually approximating the prediction error from coarse to fine. The number of the random forests is T, and the number of the regression trees in each random forest is M. The image features adopt intensity difference information (gray difference) of pixel pairs, and the features have the characteristics of low calculation complexity and good robustness to posture, illumination change and the like.
The initial prediction of the shape of each sample feature point is the calculated average shape S of the preprocessing process, and the prediction error before the first-level iteration R ^0 is as follows: Δ S — S, where S (I _ I) is the characteristic point true shape of the sample I _ I. Each stage of stochastic Sensors are further approximated to the error after the previous stage of regression, I is image characteristic information, S ^ (T-1) is the shape estimation after the previous stage of regression, and after T stage of regression, the accumulated error estimation of all regressors is approximated to delta S, so that the aim of approximating the real shape is fulfilled.
S ^ t ^ R ^ t (L, S ^ (t-1)) formula (1)
Each level of the regressor R ^ t is a random forest, K regression trees { R ^1, R ^2, …, R ^ K } are provided, the depth of the trees is 1, and each layer is provided with 2^ n (n is more than or equal to 0 and less than or equal to l) nodes. When entering each level of forest, N pairs of pixel points are randomly generated in the face area, the pixel points satisfy the following formula (2), and the P value is smaller than the threshold value T _ P.
P ═ e- λ | u-v | formula (2)
In the formula (2), the essence is to constrain the randomly selected pixels, and a pair of randomly selected pixels is based on the above common
The value calculated by the formula is filtered, and the requirement that P is less than Tp. is met because through experiments, random points in the form of exponential functions are in the face area
The domains have better distribution characteristics, are coefficients in a general exponential function form, and have no specific physical significance.
The meaning of this formula
It is to make these randomly selected points as close to the facial five-sense organs edge and outline as possible, due to the proximity of the edge
The pixel difference has a large gradient (difference value), so the gradient needs to be larger than a certain range, so the P value needs to be smaller than a certain range.
To solve the jitter problem, the added probability model is: when each node is divided into a left subtree and a right subtree, the left subtree and the right subtree respectively correspond to a probability, the probability of the right subtree is p, the probability of the left subtree is 1-p, wherein alpha is a constant, g is an actual gray difference value of a sample at a (u, v) position, and [ th ] _ m is a threshold value of the division at the node. It is well proven mathematically that the probability of the branch of the subtree where the sample is located is greater than that of the other subtree.
p is 1/(1+ e α (g-th _ m)) formula (3)
After the tree is generated, the probability of each leaf node is the product of the probabilities of all the branches that are traversed from the root node. The reason why the probability model is added to resist jitter is obvious because each regression tree is actually a partition (or classification) of the training samples, and the linear combination of these partitions is used to estimate and approximate the shape error. Without the probabilistic model, the shape error estimate for each sample is determined only by the sample in the leaf node where it falls, and after adding the probabilistic model, the weighted sum of the samples in all leaf nodes is determined, but the weights of the other leaf nodes are lower. Obviously, the result is more stable, and meanwhile, under the condition of the global most condition, the process of gradually approaching the real error is realized. The predicted value of the shape error for each sample falling into the left and right subtrees is replaced by the mean of the errors of the true shapes of all samples in that subtree and the shape estimated at the previous stage. Therefore, the selection of the pixel pair division of each node is a process for solving the minimum overall prediction error once.
1.3 generating a model file:
necessary information of training end in the process is stored to generate a model file. The information contained in the model file mainly comprises the series number of random forests, the number of regression trees in each forest, the depth of each regression tree, and the node of each node of the regression tree
Information such as pixel pair location and threshold for each node, probability information for left and right branches, error estimates for leaf nodes.
2.1 denoising and filtering the image to be detected
Due to the change of hardware and environmental illumination, the difference between the front frame and the back frame of the image to be detected acquired by the camera is large even if the human face is still, and the positions of the characteristic points are unstable due to the difference. To reduce the effect of such noise, the image to be detected may be subjected to median filtering before prediction is performed. The median is calculated statistically for a window of N × N (where N is an odd number) centered at each pixel in the image by means of nonlinear median filtering, and the pixel value at that position is replaced by the median.
2.2 model prediction computation Using model files
Analyzing the model file generated by training, reconstructing a random forest model generated in the training process, carrying out iterative computation on each regression tree under each level of random forest in the random forest regression model, and finally obtaining the detection result
The human face shape of (1). Specifically, the method comprises the following steps: predicting the feature point true shape of the image according to the following steps:
1) analyzing the model file to obtain an average shape S, pixel pair positions and threshold values (u, v, th) of the nodes, and an error estimation value of each leaf node;
2) entering a first regression tree of a first-level forest, starting from a root node, judging a pixel intensity difference value of an image at a position (u, v) according to (u, v, th) of the root node, and respectively calculating a left branch probability and a right branch probability according to the difference value and an equation (3);
3) processing the next level depth node of the tree, wherein the steps are similar to 2), and calculating the probability of the left branch and the probability of the right branch respectively;
4) calculating the probability of all leaf node branches until the probability of all leaf node branches is calculated, and finally calculating the product of the path branch probabilities of all leaf nodes to obtain the probability of the leaf nodes, wherein the sum of the error estimation of all leaf nodes and the product of the probabilities is the estimation value of the shape error of the tree, and the shape estimation is updated; repeating the steps 2) to 4) on other trees to obtain the shape estimation of the first-level forest; and performing iterative calculation by taking the obtained shape estimation as the initial shape of the next adjacent regression tree, and repeating the steps 2) -4) for each stage of regression until the last regression tree in the random forest model obtains the estimated shape of the last regression tree as the detected face shape.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.