A kind of detection of real time multi-human face and trackingTechnical field
The invention belongs to Face datection and tracking fields, and in particular to a kind of detection of real time multi-human face and tracking.
Background technique
With the fast development of science and technology, the relevant technologies based on computer vision are widely applied, wherein faceTracking technique is widely applied under the scenes such as video security protection, automatic gate inhibition, market shopping.
Face tracking technology mainly includes Face datection and face tracking technology.Human face detection tech refers to be looked in pictureTo face location.Face tracking technology refers to given Initial Face position, and it is pre- that lasting face location is carried out in successive video framesIt surveys.The face tracking technology of current main-stream is roughly divided into three kinds by principle, first is that correlation filtering tracking is based on, second is that being based onDeep learning tracking, third is that being based on optical flow tracking method.
Based on correlation filtering tracking it is representative be KCF (Kernelized Correlation Filter)With SRDCF (Spatially Regularized Discriminative Correlation Filter).KCF method usesCircular matrix obtains positive negative sample, and one object detector of training during tracking, is gone under detection using object detectorWhether the tracking target of one frame is real goal, then reuses new testing result and goes to update training set and then update target detectionDevice.The defect of this method is, when object movement speed quickly, when there is boundary effect or dynamic fuzzy phenomenon, tracking can be lostTarget.SRDCF method proposes multiple dimensioned, the bigger detection zone solution boundary effect of use, but the method speed of service is verySlowly, it is unable to satisfy requirement of real-time.
Tracking based on deep learning it is representative be MDNet (Multi-Domain ConvolutionalNeural Network), network is made of the layer multiple-limb of inclusion layer and special domain, and wherein domain corresponds to independent training etc.Grade, and each branch is responsible for one two classification and goes to identify the target in each domain, goes to instruct using iterative manner for each domainPractice network, obtains the general target feature extraction in inclusion layer.When tracking the target in video sequence, in conjunction with pre-training CNNIn (Convolution Neural Network) inclusion layer and two classification layers constitute a new networks, while propose online withTrack (online tracking), online tracking are real by candidate window obtained through stochastical sampling around assessment former frame targetExisting, tracking target's feature-extraction accuracy is higher, but because network parameter is more, using CPU be extremely difficult to real-time target withTrack.
Method based on optical flow tracking it is representative be LK (Lucas-Kanade) light stream estimation difference method, the partyMethod is a kind of infinitesimal optical flow computation method based on gradient, and this optical flow method first has to meet three hypothesis, it is assumed that 1, it is brightIt spends constant, is exactly variation of the same point with the time, brightness will not change.Assuming that 2, small movement is exactly the change of timeChange the acute variation that will not cause position.Assuming that 3, space is consistent, and it is also neighbour on image that neighbouring point, which projects to, in a sceneNear point, and neighbouring spot speed is consistent.Optical flow method can carry out target following under the scene of any complexity.This method can be quasi-Really, target following is rapidly completed, compares and is suitably applied in the smaller terminal of calculating power.
It to be realized on calculating the smaller CPU (such as i5-6200) of power, it requires that the calculation amount of implementation method is wantedSmall, algorithm design cannot be too complicated.Compared to correlation filtering tracking and deep learning tracking, it is based on optical flow tracking sideThe advantage of method is can more rapidly to realize target following, and face is blocked, human face posture expression complexity, the mobile speed of faceDegree is fast, tracking environmental background complexity has better robustness, is suitble to realize on calculating the smaller CPU processor of power.
But it is current based on the method for optical flow tracking during handling Face datection and tracking, occur asking as followsTopic:
1. occlusion issue is blocked including person to person, blocking between people and object will cause the missing of face information, this willIt directly results in tracking target to lose, tracking accuracy rate decline.
2. dynamic fuzzy and boundary effect, this will cause face information to obscure, and feature extraction inaccuracy directly results in trackingThe loss of target.
3. background environment is complicated, background environment includes that variation, color and the object of illumination condition are varied, is tracked sometimesColor of object can be with background environment solid colour, these will all bring huge challenge to face tracking task.
4. efficiency, existing plurality of human faces tracking technique is relatively difficult to meet requirement of real-time, especially compares in calculating powerOn low CPU device.
There are the feelings such as block, external environment background complexity and illumination condition are changeable in human face posture expression complexity, faceUnder condition, it is easy to cause the accuracy of tracking to reduce and cannot accomplishes the effect of real-time tracking.
Summary of the invention
Drawbacks described above based on the prior art, the object of the present invention is to provide a kind of detection of real time multi-human face and track sidesMethod, the accuracy to solve the problems, such as existing recognition of face and tracking is not high and cannot reach real-time tracking.
The technical solution adopted by the invention is as follows:
A kind of detection of real time multi-human face and tracking, which comprises
The image of each video frame is obtained from the video flowing of input;
Carry out the detection of face location coordinate to the video frame of acquisition by Face datection model, and by face location coordinateStore face location coordinate container;
Face tracking initialization operation, the position coordinates that the target face of tracking is extracted from face location coordinate container are straightTo taking, use spatial gradient matrix extract target feature point and store to characteristic point container with for subsequent face tracking moreNewly;
Image pyramid model is established, according to the position of the model prediction current video frame human face target;
Statistical trace frame number just re-starts a Face datection when the tracking frame number threshold value of tracking frame number satisfaction setting,When being unsatisfactory for, then calculates the face location coordinate frame central point detected and face tracking updates the face location predicted and sitsThe distance between frame central point is marked, does not need then to carry out face tracking initialization when calculating distance threshold of the distance less than setting,It then needs to carry out face tracking initialization when calculating distance threshold of the distance greater than setting, final result is subjected to display output.
Further, include: to the position coordinates extraction of target face
According to formula:It is each in the given tracking target area A of calculating input image ISpatial gradient the matrix G, A of pixel PxGradient for target area A in x-axis direction, AyLadder for target area A in y-axis directionDegree;
Calculate the minimal eigenvalue λ of each GmAnd storage λmGreater than given eigenvalue threshold λthCorresponding pixel P, thenJudge whether pixel P is greater than other pixels in 3 × 3 neighborhood of surrounding, if it is greater, then retaining pixel P and from allThe minimal eigenvalue λ of storagemMiddle maximizing λmax, if it is less, no longer retaining, execute operations described below;
Calculate the distance between the pixel that remains distance and with distance threshold distancethCompare, retainsDistance is greater than distance threshold distancethPixel, these pixels of reservation are the characteristic point extracted, after being used forContinuous face tracking and update.
Further, face tracking is specifically included according to image pyramid model:
Pyramid is established, I is defined0It is the pyramid bottom, that is, the 0th layer of image, resolution ratio highest, L expressionThe pyramid number of plies, L take the natural number greater than 1, ILIndicate L tomographic image;
By the optical flow computation result feedback of pyramid top layer to secondary top layer, gLEstimate as light stream value of top layer when initial,It is set as 0, the light stream value of secondary top layer is estimated as gL-1, pyramid top layer namely L-1 layers of light stream value dL-1,gL-2=2 (gL-1+dL-1(the 0+d of)=2L-1)=2dL-1, continue on pyramid and feed back downwards, iteration obtains until reaching the pyramid bottomFinal original image light stream value d are as follows: d=g0+d0, final light flow valuve is exactly the superposition of all layers of segmentation light stream value d, it may be assumed that
The target feature point position A (x, y) that target area A is extracted is given by previous frame image and calculates present frame target areaDomain B characteristic point position B (x+vx,y+vy), vx,vyIt is light stream value d in x-axis, the displacement component of y-axis;
The position of tracking target face is shown in current frame image.
Further, behind the position that current frame image draws tracking target face, judge whether all characteristic pointsIt is taken out from face location coordinate container, if container is not sky, continues to take out, if it is empty, then execute operations described below:
If the tracking frame number of statistics is equal to setting tracking frame number threshold value, face is carried out to the first frame image gotDetection, if conditions are not met, then carrying out operations described below:
The face location central point f arrived according to tracking predictiontrack(xt_center,yt_center) and the obtained people of Face datectionFace place-centric point fdetection(xd_center,yd_center), the distance l of two o'clock is calculated,
Set distance threshold value lth=15, when distance l is greater than distance threshold lthWhen, show the face location detected and withThe face location difference of track is larger, should re-start face tracking initialization operation according to the face location detected at this time, whenDistance l is less than or equal to distance threshold lthWhen, show that the face location of the face location detected and tracking is not much different, does not have to weightIt is new to carry out face tracking initialization, and execute operations described below;
Carry out Face datection and plurality of human faces tracking display;
Judge whether video flowing terminates, is exited if terminating.
Further, the Face datection model includes first network module and the second network module, the first networkModule uses three by 2 convolutional layers, 2 active coatings, 2 normalization layers, 2 pond layer compositions, second network moduleInception structure composition.
Further, also the face detection model is trained and is tested before using the face detection model.
Further, include: to the training of the face detection model
The face samples pictures under a large amount of natural scenes are obtained, face location mark are carried out to obtained picture, and generateThe mark document of xml format;
The human face data completed to mark is cleaned, face resolution ratio directly removing less than 20 × 20;
The data that cleaning is completed directly generate lmdb formatted file, for carrying out data in deep learning frame caffeIt reads;
The network model for completing lightweight is built;
Starting model training, face predicts that loss function uses softmax loss function,Wherein, yiIndicate i-th group of data and corresponding mark classification, ifActually this group of data are faces, then y=1, if actually this group of data are not face, y=0, f (xi, θ) and it indicates to be predicted asThe probability value of face, xiIndicate the input of i-th group of data, θ indicate can learning parameter, m indicates sample number;
Backpropagation, using stochastic gradient descent algorithm, continuous iteration, the value for enabling loss function obtain is as close possible to 0;
If reaching the number of iterations of setting, terminate, such as not up to continues to train.
Compared with prior art, a kind of detection of real time multi-human face disclosed in this invention and tracking, reached asLower technical effect:
1, the characteristic point that the present invention extracts can represent the main feature of target to be tracked, even if in extraneous illumination conditionIt is complicated, background is complicated, there are dynamic fuzzy and boundary effect, exist block etc. under the conditions of also there is good generalization ability.?That is the characteristic point extracted still can characterize target signature even if environment is more complicated, there are small areas to block for target, complete tracking.
2, the present invention sets correction condition to the tracking of target, when encountering extreme condition, such as the very strong video of illuminationIt is a piece of black that video presentation under a piece of white or no light condition is presented, or the characteristic point extracted has been blocked (note completely justMeaning characteristic point is dispersed in target area, even if fraction characteristic point is blocked or lacks or can continue tracking) thisThe case where Shi Huiyou BREAK TRACK, it is necessary to re-start Face datection, face tracking initialization, face tracking update.
3, by the accurate detection of Face datection model realization face, the real-time of dynamic human face is realized by trackingTracking.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show belowThere is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is this hairBright some embodiments for those of ordinary skill in the art without creative efforts, can be with rootOther attached drawings are obtained according to these attached drawings.
Fig. 1 is the model schematic of the Inception structure in Face datection model described in the embodiment of the present invention.
Fig. 2 is the network structure of Face datection model described in the embodiment of the present invention.
Fig. 3 is the schematic diagram of Face datection model described in the embodiment of the present invention being trained.
Fig. 4 is the schematic diagram that Face datection model described in the embodiment of the present invention is tested.
Fig. 5 is the detection of real time multi-human face and the complete flow diagram of tracking described in the embodiment of the present invention.
Specific embodiment
Technical solution in order to enable those skilled in the art to better understand the present invention, with reference to the accompanying drawing and specific embodiment partyPresent invention is further described in detail for formula.
The present invention will handle all faces detected in input video frame, and track to it, both include to individual human faceTracking, also include the tracking to multiple faces.The present invention is based on optical flow tracking methods, can faster realize target following,Face is blocked, human face posture expression is complicated, face movement speed is fast, tracking environmental background complexity has better robustness,It is suitble to realize on calculating the smaller CPU processor of power.
A kind of detection of plurality of human faces disclosed in the embodiment of the present invention and tracking are based on Face datection model and faceTrace model realizes that Face datection is realized using depth learning technology, and face tracking is real using the tracking based on light streamIt is existing.
Shown in reference picture 1, Fig. 2, in the present embodiment, Face datection model is to be divided into two sons using structure end to endModule, first network module and the second network module.First network module is by 2 convolutional layers, 2 active coatings, 2 normalizationLayer, 2 pond layer compositions, main function is the characteristic information of rapidly extracting input picture, by first time convolution algorithm, activationProcessing, normalized and Chi Huahou, using being exported behind second of convolution algorithm, activation processing, normalized and pondTo the second network module, since first network module network model is simple, model parameter is less, and calculation amount is also small, can compareQuickly extract face characteristic information.For second network module by using 3 Inception structures, Fig. 1 is Inception knotThe network structure of structure, Inception structure include a variety of different convolution branches, can get a variety of receptive fields, can be to eachKind scale in other words preferably detects multiple dimensioned face.Picture after three Inception pattern handlings, can be moreAdd the face of accurate detection various scales into input picture.
By Face datection model, the position coordinates and picture of face frame are exported, and face tracking module is then according to coordinateThe region for the human face target to be tracked is found from input picture.
In order to realize the accurate detection of Face datection model, before testing, to model training and test.
Training process to Face datection model is as shown in figure 3, specifically comprise the following steps:
1) face picture under a large amount of natural scenes is obtained, from web crawlers, client provides or public data collection,Face location mark is carried out to obtained picture, and generates xml format mark document, is executed 2).
2) human face data completed to mark is cleaned, and face resolution ratio directly removing less than 20 × 20 is not used inTraining, prevents network model from not restraining, and executes 3).
3) data that cleaning is completed directly generate lmdb formatted file, for being counted in deep learning frame caffeAccording to reading, execute 4).
4) network model for completing lightweight is built, and is executed 5).
5) start model training, face predicts that loss function uses softmax loss, yiIndicate i-th group of data and correspondenceMark classification, if actually this group of data are face, y=1, if actually this group of data are not face, y=0.f(xi,θ)Indicate the probability value for being predicted as face, xiIndicate the input of i-th group of data, θ indicate can learning parameter, m indicates sample number.Formula is such asUnder:
Backpropagation, using stochastic gradient descent algorithm, continuous iteration, the value for enabling loss function obtain as close possible to 0,It executes 6).
If 6) reach the number of iterations of setting, terminate, if 5) the number of iterations of not up to setting, executes.InstructionData format after the completion of white silk is caffemodel, is to need designated model store path when calling.
By being trained to model, can it is more accurate it is quick detection recognize face.
After being finished to Face datection model training, model is tested.Test process is as shown in Figure 4.
1) 2) input video stream executes;
2) video frame is obtained, is executed 3);
3) picture format is converted, is executed 4), the channel hwc of picture is converted into the channel cwh;
4) it enters data into and gives Face datection model, execute 5)
5) result is exported.Export face location coordinate and face probability value in current video frame.
After rule are closed in the test of Face datection model, face tracking process is entered, the input of face tracking port is peopleThe face frame coordinate (upper left corner starting point (x, y) width (width) is high (height)) and picture of face detection model output, face trackingMesh target area to be tracked can be found from input picture according to coordinate.Picture is obtained from video flowing, is first carried out after gettingThen Face datection judges whether to face tracking initialization, then judge whether to face tracking update, under finally obtainingOne or end.
Before formally using model, trained model is tested, to improve the accuracy for using model.
Referring to Figure 5, the process of a complete face tracking is as follows:
1) detection model initializes, and executes 2).
2) 3) input video stream executes.
3) video frame images I is obtained from input video stream, is executed 4).
4) Face datection is carried out to the first frame image got, executed 5).
If 5) detect face, execute 6), if not detecting face, executes 22).
6) face location coordinate is stored in face location coordinate container, is executed 7).
If 7) carry out face tracking initialization, execute 8), if initialized without face tracking, executes 12).
8) 1 face position coordinates is obtained from face location coordinate container, then is executed 9).
9) target human face characteristic point, given tracking target area A (being exactly face location coordinate) of calculating input image I are extractedIn each pixel P spatial gradient matrix G, AxGradient for target area A in x-axis direction, AyIt is target area A in y-axis sideTo gradient,
Calculate the minimal eigenvalue λ of each GmAnd storage λmGreater than given eigenvalue threshold λthCorresponding pixel P, thenJudge whether pixel P is greater than other pixels in 3 × 3 neighborhood of surrounding, if it is greater, then retaining pixel P and from allThe minimal eigenvalue λ of storagemMiddle maximizing λmax, if it is less, no longer retaining, execute 10).The feature of target faceThe extraction of point is the method using spatial gradient matrix, and characteristic point container is arrived in storage after extraction, for the updated of face trackingJourney.
10) finally calculate the distance between pixel remained dis tan ce and with distance threshold dis tancethCompare, retains dis tan ce and be greater than distance threshold dis tan cethPixel, these pixels of reservation are to mention11) characteristic point taken is executed for tracking.By step 9), 10), realize the extraction of characteristic point, the characteristic point of extraction canCharacterize target face characteristic, though extraneous illumination condition is complicated, background is complicated, there are dynamic fuzzy and boundary effect, existSmall area also has good generalization ability under the conditions of blocking etc., can complete subsequent tracking.
11) if face location coordinate container is sky, execute 12), if face location coordinate container is not sky, holdsRow 8).
If 12) carry out face tracking update, execute 13), if updated without face tracking, executes 22) straightRow video is tapped into show.
13) characteristic point of the target face of initialization is taken out from characteristic point container, is executed 14).
14) 15) tracking frame counting number is executed for statistical trace how many frame.
15) image pyramid processing establishes pyramid, defines I0It is the pyramid bottom, that is, the 0th layer of image,Its resolution ratio highest.L indicates the pyramid number of plies, and L usually takes 2,3,4.ILIndicate L tomographic image.
Optical flow computation result (misalignment) feedback of pyramid top layer (L-1 layers) arrives time top layer (L-2 layers), gLMakeThe light stream value for being top layer when initial estimation, is set as 0, the light stream value of secondary top layer is estimated as gL-1, pyramid top layer (L-1 layers)Light stream value dL-1,
gL-2=2 (gL-1+dL-1(the 0+d of)=2L-1)=2dL-1
It continues on pyramid to feed back downwards, iteration, until reaching the pyramid bottom.
Final original image light stream value d are as follows:
D=g0+d0
Final light flow valuve is exactly the superposition of all layers of segmentation light stream value d,
The target feature point position A (x, y) that target area A is extracted is given by previous frame image and calculates present frame target areaDomain B characteristic point position B (x+vx,y+vy), vx,vy16) as light stream value d is executed in x-axis, y-axis displacement component.
16) tracking position of object is drawn in current frame image, executed 17).
17) judge whether from container to take out all characteristic points, if feature container is sky, execute 18), if not13) sky then executes.
18) if the tracking frame number counted is equal to tracking frame number threshold value Δframe=10, then execute 4), 19), if discontentedIt is sufficient then execute 22).
19) 20) position coordinates for obtaining Face datection, execute.
20) the face location central point f arrived according to tracking predictiontrack(xt_center,yt_center) and Face datection obtainFace location central point fdetection(xd_center,yd_center), the distance l of two o'clock is calculated,
21) set distance threshold value lth=15, when distance l is greater than distance threshold lthWhen, show the face location detected andThe face location difference of tracking is larger, should re-start face tracking initialization according to the face location detected at this time, execute7).When distance l is less than or equal to distance threshold lthWhen, show that the face location of the face location detected and tracking is not much different,Without re-starting face tracking initialization, execute 22).Step 20) 21) sets correction condition, when encountering extreme condition,Characteristic point that is a piece of black, or extracting is presented as video under a piece of white or no light condition is presented in the very strong video of illuminationJust blocked completely (attention characteristics point is dispersed in target area, though fraction characteristic point be blocked or lack alsoIt is that can continue tracking), it the case where at this moment having BREAK TRACK, just needs to re-start Face datection, face at this timeTracking initialization, face tracking update.
22) Face datection and plurality of human faces tracking display are carried out, the position of face frame is shown in video frame, no matter face is examinedSurvey or the result of plurality of human faces trace back be all face frame position coordinates, finally will in video frame by face location with squareThe form of shape is drawn, and is executed 23).
23) judge whether video flowing terminates, execute 24) if terminating, executed 3) if being not finished
24) entire program is exited.
The present invention through the above steps, human face posture expression is complicated, face exist block, external environment background complexity withAnd when illumination condition is changeable, it also can be realized the accuracy of tracking and accomplish the effect of real-time tracking.
Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations;AlthoughPresent invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: it still may be usedTo modify the technical solutions described in the foregoing embodiments or equivalent replacement of some of the technical features;And these are modified or replaceed, technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution spirit andRange.