Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a single-stage face detection and key point positioning method, which specifically comprises the following steps:
step S1, acquiring a plurality of face images, and labeling each face image to obtain a labeled image with a real face frame and a real key point position;
step S2, training according to the annotation image to obtain a face detection and key point positioning fusion model;
step S3, inputting the current frame of the human face picture to be detected in the video image into the human face detection and key point positioning fusion model, and obtaining and outputting the current frame of human face detection frame and the current frame of human face key point position corresponding to the current frame of the human face picture to be detected;
step S4, performing key point anti-shaking processing on the next frame of the face picture to be detected according to the key point position of the current frame of the face, and recording the total times of performing the key point anti-shaking processing;
step S5, comparing the recorded total times with a preset time threshold:
if the total number of times is not greater than the number threshold, then go to step S6;
if the total number of times is greater than the number threshold, clearing the total number of times, and then returning to the step S3;
step S6, directly obtaining the next frame face detection frame and the next frame face key point position corresponding to the next frame of the face picture to be detected according to the key point anti-shake processing result, outputting the next frame face detection frame and the next frame face key point position as the current frame face detection frame and the current frame face key point position, and then returning to the step S4;
the above process is continuously executed until all the frames of the video images are processed.
Preferably, the face detection and key point positioning fusion model adopts a retinet network structure, and a feature graph output by the last three layers of convolution layers in the retinet network structure adopts a feature pyramid network structure.
Preferably, in the training process of the face detection and key point positioning fusion model, an anchor point frame with a preset proportion is used for performing regression prediction of the face detection frame and prediction of the position of the key point of the face.
Preferably, in the training process of the face detection and key point positioning fusion model, in the feature map generated by convolution operation, the size of the receptive field of each pixel point in the corresponding face image is twice as large as the size of the anchor point frame.
Preferably, the preset ratio is 1: 1.
Preferably, the step S2 specifically includes:
step S21, inputting the annotation image into a pre-generated initial fusion model to obtain a corresponding face detection prediction result and a key point prediction result;
the face detection prediction result comprises a face classification prediction result, a face frame regression prediction result and a face frame proportion prediction result;
step S22, respectively calculating a first loss function between the face classification prediction result and a real face classification result contained in the real face frame, a second loss function between the face frame regression prediction result and a real face region contained in the real face frame, a third loss function between the face frame proportion prediction result and the preset proportion, and a fourth loss function between the key point prediction result and the real key point position;
step S23, performing weighted summation on the first loss function, the second loss function, the third loss function, and the fourth loss function to obtain a total loss function, and comparing the total loss function with a preset loss function threshold:
if the total loss function is not less than the loss function threshold, then go to step S24;
if the total loss function is less than the loss function threshold, then go to step S25;
step S24, adjusting the training parameters in the initial fusion model according to a preset learning rate, and then returning to the step S21 to continue a new training process;
and step S25, taking the initial fusion model as a face detection and key point positioning fusion model and outputting the model.
Preferably, in step S4, the key point anti-shake processing specifically includes:
step A1, according to the positions of the face key points, expanding the positions of the corresponding face key points in the next frame of face picture to be detected by preset times to obtain a face area picture;
step A2, the face region picture is verified according to a pre-generated face verification model, and whether the face region picture is a face is judged according to a verification result:
if yes, go to step A3;
if not, exiting;
and step A3, tracking the face by adopting a tracking algorithm to obtain the face detection frame and the face key point position corresponding to the next frame of the face picture to be detected.
A single-stage face detection and key point positioning system is applied to any one of the single-stage face detection and key point positioning methods, and specifically comprises:
the data annotation module is used for acquiring a plurality of face images and annotating each face image to obtain an annotated image with a real face frame and a real key point position;
the data training module is connected with the data annotation module and used for obtaining a face detection and key point positioning fusion model according to the annotation image training;
the model prediction module is connected with the data training module and used for inputting the current frame of the human face picture to be detected into the human face detection and key point positioning fusion model to obtain and output a human face detection frame and a human face key point position corresponding to the current frame of the human face picture to be detected;
the anti-shake processing module is connected with the model prediction module and is used for carrying out key point anti-shake processing on the next frame of face picture to be detected according to the face detection frame and the position of the key point of the face and recording the total times of carrying out the key point anti-shake processing;
the data comparison module is connected with the anti-shake processing module and used for comparing the total times obtained by recording with a preset time threshold, generating a first comparison result when the total times is not more than the time threshold, and generating a second comparison result when the total times is more than the time threshold;
the first processing module is connected with the data comparison module, and is used for directly obtaining a next frame face detection frame and a next frame face key point position corresponding to a next frame of face picture to be detected according to the first comparison result and the key point anti-shake processing result, and outputting the next frame face detection frame and the next frame face key point position as the current frame face detection frame and the current frame face key point position;
and the second processing module is connected with the data comparison module and used for clearing the total times according to the second comparison result.
Preferably, the data training module specifically includes:
the data prediction unit is used for inputting the marked image into a pre-generated initial fusion model to obtain a corresponding human face detection prediction result and a key point prediction result;
the face detection prediction result comprises a face classification prediction result, a face frame regression prediction result and a face frame proportion prediction result;
the first processing unit is connected with the data prediction unit and is used for respectively calculating a first loss function between the face classification prediction result and a real face classification result contained in the real face frame, a second loss function between the face frame regression prediction result and a real face area contained in the real face frame, a third loss function between the face frame proportion prediction result and the preset proportion, and a fourth loss function between the key point prediction result and the real key point position;
the second processing unit is connected with the first processing unit and used for carrying out weighted summation on the first loss function, the second loss function, the third loss function and the fourth loss function to obtain a total loss function;
the data comparison unit is connected with the second processing unit and used for comparing the total loss function with a preset loss function threshold value, generating a first comparison result when the total loss function is not smaller than the loss function threshold value, and generating a second comparison result when the total loss function is smaller than the loss function threshold value;
the third processing unit is connected with the data comparison unit and used for adjusting the training parameters in the initial fusion model according to the first comparison result and a preset learning rate so as to continue a new training process;
and the fourth processing unit is connected with the data comparison unit and used for taking the initial fusion model as a face detection and key point positioning fusion model according to the second comparison result and outputting the face detection and key point positioning fusion model.
Preferably, the anti-shake processing module specifically includes:
the image processing unit is used for expanding the corresponding position of the face key point in the next frame of face picture to be detected by preset times according to the position of the face key point to obtain a face area picture;
the face checking unit is connected with the image processing unit and used for checking the face region picture according to a pre-generated face checking model and outputting a corresponding face checking result when the checking result shows that the face region picture is a face;
and the face tracking unit is connected with the face checking unit and used for tracking the face by adopting a tracking algorithm to obtain the positions of the face detection frame and the face key point corresponding to the next frame of the face picture to be detected.
The technical scheme has the following advantages or beneficial effects:
1) the face detection and the key point positioning are fused, an end-to-end training mode is adopted, the face detection and the key point positioning are mutually promoted, and the accuracy rate of the face detection and the key point positioning is effectively improved;
2) the problem of key point jitter is effectively improved by combining the face verification model and the tracking model;
3) the face detection and the key point positioning are fused into a model, so that the inference speed is effectively improved, and the method is suitable for edge computing equipment only supporting single model deployment.
Detailed Description
The invention is described in detail below with reference to the figures and specific embodiments. The present invention is not limited to the embodiment, and other embodiments may be included in the scope of the present invention as long as the gist of the present invention is satisfied.
In a preferred embodiment of the present invention, based on the above problems in the prior art, a single-stage face detection and key point location method is provided, as shown in fig. 1, and specifically includes:
step S1, acquiring a plurality of face images, and labeling each face image to obtain a labeled image with a real face frame and a real key point position;
step S2, training according to the annotation image to obtain a face detection and key point positioning fusion model;
step S3, inputting the current frame of the human face picture to be detected in the video image into the human face detection and key point positioning fusion model, and obtaining and outputting the current frame of human face detection frame and the current frame of human face key point position corresponding to the current frame of the human face picture to be detected;
step S4, performing key point anti-shake processing on the next frame of face picture to be detected according to the key point position of the current frame of face, and recording the total times of performing key point anti-shake processing;
step S5, comparing the recorded total times with a preset time threshold:
if the total number of times is not greater than the number threshold, go to step S6;
if the total number of times is larger than the number threshold, clearing the total number of times, and then returning to the step S3;
step S6, directly obtaining the next frame face detection frame and the next frame face key point position corresponding to the next frame face picture to be detected according to the key point anti-shake processing result, outputting the next frame face detection frame and the next frame face key point position as the current frame face detection frame and the current frame face key point position, and then returning to the step S4;
the above process is continuously executed until all the frame video images are processed.
The key point positioning method fuses face detection and key point positioning, adopts an end-to-end training mode, mutually promotes the face detection and the key point positioning, and effectively improves the accuracy of the face detection and the key point positioning; meanwhile, the problem of key point jitter is effectively improved by combining the face verification model and the tracking model; furthermore, the face detection and the key point positioning are fused into a model, the inference speed is effectively improved, the method is suitable for the edge computing equipment only supporting single-model deployment, and the problem that the whole inference speed becomes very slow when the edge computing equipment loads more than one model is solved.
Further specifically, the technical scheme of the invention comprises a training process of a face detection and key point positioning fusion model:
firstly, training data are prepared, namely, a plurality of acquired face images are labeled to obtain labeled images with real face frames and real key point positions. In this embodiment, it is preferable that the annotation image is stored in a format of a text document. Specifically, a text document named "in.txt" is newly created as a training set, where each line in the "in.txt" represents a piece of annotation image data. Each piece of marked image data preferably comprises 6 points and a picture path of a face frame (box) and a face key point (landmark), and the specific storage format is as follows:
Path/xxx.jpg x1,y1,x2,y2,ptx1,pty1,ptx2,pty2……ptx6,pty6
where Path represents a storage Path, xxx.jpg represents the name of an annotation image, x1, y1, x2, and y2 represent face frame data, and ptx1, pty1, ptx2, pty2 … … ptx6, and pty6 represent 6 face key point data. x1, y1 to ptx6 and pty6 represent data of one face in the annotation image, and if a picture contains a plurality of faces, the process is repeated later. The training data of face detection and key point positioning are fused together to form a labeling file, so that data processing and reading are facilitated.
Secondly, presetting a network framework of a face detection and key point positioning fusion model, and preferably adopting a retinet network structure based on a single-stage network framework, wherein the retinet network structure is a single-stage detection network structure with a characteristic pyramid (fpn) structure. Due to the particularity of the human face, the proportion of the anchor point frame (anchor) in the training process is preferably set to be 1:1, so that the phenomena that the human face frame detected by the model is long, wide and not in accordance with the proportion of the human face are effectively avoided. In order to further improve recall rate and precision rate of face detection and key point positioning, in the process of training the face detection and key point positioning fusion model, in the feature map generated by convolution operation, the size of the receptive field of each pixel point in the feature map in the corresponding face image is preferably set to be twice of the size of the anchor point frame, so that the problem of poor detection precision caused by the adoption of a default value of the feature map is avoided. In order to enable the human face detection and key point positioning fusion model to have a good small face detection effect, the method preferably selects the feature maps of the three layers behind the backbone network (backbone) to perform upsampling by adopting the feature pyramid (fpn), so that feature fusion is realized, and the recall rate of the small face is effectively improved.
More preferably, in the training process of the face detection and key point positioning fusion model, after each training is finished, the method further comprises the steps of balancing positive and negative samples according to the prediction result of the current training, and sending the obtained positive sample and negative sample to the next training. In this embodiment, for positive and negative samples of face detection, it is preferable to use the positive sample when the intersection ratio (iou) of the anchor point frame (anchor) and the real face frame (gt) obtained by the calculation of the current training prediction is greater than 0.5, and use the negative sample when the intersection ratio (iou) of the anchor point frame (anchor) and the real face frame (gt) is less than 0.3. For positive and negative samples of the key point positioning, preferably, when the intersection-parallel ratio (iou) of the anchor point frame (anchor) and the real face frame (gt) obtained by the training and prediction is greater than 0.7, a loss function between the predicted key point positioning and the real key point position is calculated, and the problems that the network is difficult to converge and the key point positioning is inaccurate due to the fact that the intersection-parallel ratio (iou) of the anchor point frame (anchor) and the real face frame (gt) is too small are solved.
Further specifically, in the training process of the face detection and key point positioning fusion model, four loss functions are preferably set as return values in the training process, so that the network effectiveness is ensured. The four loss functions preferably include a first loss function between the face classification prediction result and the real face classification result contained in the real face frame, a second loss function between the face frame regression prediction result and the real face region contained in the real face frame, a third loss function between the face frame proportion prediction result and the preset proportion, and a fourth loss function between the key point prediction result and the real key point position. The first loss function preferably adopts a softmax function, the second loss function preferably adopts a smooth function, the third loss function preferably adopts a MSE function, and the fourth loss function is set to ensure that the proportion of the face frames is 1: 1.
Preferably, the first loss function, the second loss function, the third loss function and the fourth loss function are weighted and summed to obtain a total loss function. The weight of the first loss function is preferably 1, the weight of the second loss function is preferably 1, the weight of the third loss function is preferably 0.5, and the weight of the fourth loss function is preferably 0.1. The weight setting of the fourth loss function is small, so that the integral effect of the network on face detection and key point positioning is not influenced while the face frame proportion is ensured. In the training process, the preset learning rate is preferably 0.0001, and after the training is finished, the fusion model of face detection and key point positioning is waited.
Furthermore, the human face detection and key point positioning fusion model is adopted to carry out human face detection and key point positioning on the picture to be detected of the video image, and in order to eliminate the problem of jitter of key point positioning caused by data labeling subjectivity and human face detection frame jitter, the position of the human face detection frame is fixed from the angle of human face tracking, so that the human face detection frame is not moved when the human face is not moved, the input frame is not moved when the key point is positioned, and the jitter amplitude of the key point is reduced.
Specifically, when performing face detection and key point positioning on a continuous multi-frame picture to be detected of a video image, preferably, a current frame picture to be detected is used as a detection starting node, the position of a current frame face detection frame and a current frame key point corresponding to the current frame picture to be detected is obtained by adopting the face detection and key point positioning fusion model for predicting the current frame picture to be detected, and the position of a corresponding face detection frame and key point is obtained by adopting key point anti-shake processing on a subsequent continuous multi-frame picture to be detected. Preferably, one-time face detection and key point positioning fusion model prediction is adopted, key point anti-shake processing is adopted for the subsequent ten frames of pictures to be detected, then face detection and key point positioning fusion model prediction is adopted, and the like.
In a preferred embodiment of the present invention, the face detection and key point positioning fusion model adopts a retinet network structure, and the feature graph output by the last three layers of convolution layers in the retinet network structure adopts a feature pyramid network structure.
In the preferred embodiment of the present invention, in the training process of the face detection and key point positioning fusion model, the anchor point frame with the preset proportion is used for performing regression prediction of the face detection frame and prediction of the position of the key point of the face.
In a preferred embodiment of the present invention, in the training process of the face detection and key point location fusion model, in the feature map generated by convolution operation, the size of the receptive field of each pixel point in the corresponding face image is twice as large as the size of the anchor point frame.
In a preferred embodiment of the present invention, the predetermined ratio is 1: 1.
In a preferred embodiment of the present invention, as shown in fig. 2, step S2 specifically includes:
step S21, inputting the annotation image into a pre-generated initial fusion model to obtain a corresponding face detection prediction result and a key point prediction result;
the face detection prediction result comprises a face classification prediction result, a face frame regression prediction result and a face frame proportion prediction result;
step S22, respectively calculating a first loss function between the face classification prediction result and the real face classification result contained in the real face frame, a second loss function between the face frame regression prediction result and the real face region contained in the real face frame, a third loss function between the face frame proportion prediction result and the preset proportion, and a fourth loss function between the key point prediction result and the real key point position;
step S23, performing weighted summation on the first loss function, the second loss function, the third loss function, and the fourth loss function to obtain a total loss function, and comparing the total loss function with a preset loss function threshold:
if the total loss function is not less than the loss function threshold, go to step S24;
if the total loss function is less than the loss function threshold, go to step S25;
step S24, adjusting the training parameters in the initial fusion model according to the preset learning rate, and then returning to step S21 to continue a new training process;
and step S25, taking the initial fusion model as a face detection and key point positioning fusion model and outputting the model.
In a preferred embodiment of the present invention, in step S4, as shown in fig. 3, the key point anti-shake processing specifically includes:
step A1, according to the position of the key point of the face, the position of the corresponding key point of the face in the next frame of picture to be detected is enlarged by a preset multiple to obtain a picture of the face area;
step A2, checking the face region picture according to a pre-generated face checking model, and judging whether the face region picture is a face according to a checking result:
if yes, go to step A3;
if not, exiting;
and step A3, tracking the face by adopting a tracking algorithm to obtain a face detection frame and face key point positions corresponding to the next frame of face picture to be detected.
Specifically, in this embodiment, the jitter amplitude of the key point can be greatly reduced through the key point anti-jitter processing. The key point anti-shake processing mainly comprises two parts, namely a face checking module for judging whether the face is the face, wherein the face checking module preferably adopts a simple two-classification network and can be realized by only a plurality of convolution layers; and the second is a tracking algorithm, the KCF algorithm is preferably adopted in the tracking algorithm, and the tracking algorithm is used for tracking the face to obtain a face detection frame when the face verification module judges that the face is the face. The human face area is enlarged by 1.5 times and sent into a detection algorithm, so that the output picture is ensured not to have large change, and the accuracy of the regression of the key points is also ensured.
A single-stage face detection and key point localization system, which applies any one of the above single-stage face detection and key point localization methods, as shown in fig. 4, specifically includes:
thedata annotation module 1 is used for acquiring a plurality of face images and annotating each face image to obtain an annotated image with a real face frame and a real key point position;
thedata training module 2 is connected with thedata annotation module 1 and used for obtaining a face detection and key point positioning fusion model according to the annotation image training;
themodel prediction module 3 is connected with thedata training module 2 and is used for inputting the current frame of the human face picture to be detected into the human face detection and key point positioning fusion model to obtain and output a human face detection frame and a human face key point position corresponding to the current frame of the human face picture to be detected;
theanti-shake processing module 4 is connected with themodel prediction module 3 and is used for carrying out key point anti-shake processing on the next frame of face picture to be detected according to the face detection frame and the position of the key point of the face and recording the total times of carrying out the key point anti-shake processing;
thedata comparison module 5 is connected with theanti-shake processing module 4 and is used for comparing the recorded total times with a preset time threshold, generating a first comparison result when the total times is not more than the time threshold, and generating a second comparison result when the total times is more than the time threshold;
thefirst processing module 6 is connected with thedata comparison module 5, and is used for directly obtaining the next frame face detection frame and the next frame face key point position corresponding to the next frame face picture to be detected according to the first comparison result and the key point anti-shake processing result, and outputting the next frame face detection frame and the next frame face key point position as the current frame face detection frame and the current frame face key point position;
and thesecond processing module 7 is connected with thedata comparison module 5 and used for clearing the total times according to a second comparison result.
In a preferred embodiment of the present invention, thedata training module 2 specifically includes:
thedata prediction unit 21 is configured to input the annotation image into a pre-generated initial fusion model to obtain a corresponding face detection prediction result and a corresponding key point prediction result;
the face detection prediction result comprises a face classification prediction result, a face frame regression prediction result and a face frame proportion prediction result;
thefirst processing unit 22 is connected to thedata prediction unit 21 and configured to calculate a first loss function between the face classification prediction result and a real face classification result included in a real face frame, a second loss function between the face frame regression prediction result and a real face region included in the real face frame, a third loss function between the face frame proportion prediction result and a preset proportion, and a fourth loss function between the key point prediction result and a real key point position, respectively;
thesecond processing unit 23 is connected to thefirst processing unit 22, and configured to perform weighted summation on the first loss function, the second loss function, the third loss function, and the fourth loss function to obtain a total loss function;
thedata comparison unit 24 is connected to thesecond processing unit 23, and configured to compare the total loss function with a preset loss function threshold, and generate a first comparison result when the total loss function is not less than the loss function threshold, and generate a second comparison result when the total loss function is less than the loss function threshold;
thethird processing unit 25 is connected to thedata comparing unit 24, and is configured to adjust the training parameters in the initial fusion model according to the first comparison result and the preset learning rate to continue a new training process;
and thefourth processing unit 26 is connected to thedata comparing unit 24, and is configured to take the initial fusion model as a face detection and key point positioning fusion model according to the second comparison result and output the face detection and key point positioning fusion model.
In a preferred embodiment of the present invention, theanti-shake processing module 4 specifically includes:
the image processing unit 41 is configured to enlarge, by a preset multiple, a position of a corresponding face key point in a next frame of face picture to be detected according to the position of the face key point, so as to obtain a face region picture;
theface checking unit 42 is connected to the image processing unit 41, and is configured to check the face region picture according to a pre-generated face checking model, and output a corresponding face checking result when the checking result indicates that the face region picture is a face;
and theface tracking unit 43 is connected to theface verification unit 42, and is configured to track the face by using a tracking algorithm, so as to obtain a processing result of the anti-shake processing for the key points, where the key points include a face detection frame corresponding to the next frame of face picture to be detected and the key point positions of the face.
And thedata recording unit 44 is connected with theface tracking unit 43 and is used for recording the total times of key point anti-shake processing according to the processing result.
While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention.