Disclosure of Invention
The invention aims to provide an anti-occlusion moving target tracking device and method based on ROI prediction and multi-module learning, which have high accuracy, high real-time performance and high robustness.
The technical scheme for realizing the aim of the invention is that the anti-shielding moving target tracking device based on ROI prediction and multi-module learning comprises a tracking module, a detection module, a learning module and a synthesis module, wherein:
The tracking module comprises a multi-feature extraction module and a related filtering module, a scale filter is added on the basis of the position filter to serve as an algorithm framework of the tracking module, and PCA (principal component analysis) dimension reduction and QR decomposition are carried out on the position filter and the scale filter by adding a multi-feature extraction fusion optimization framework;
The detection module comprises an ROI prediction module and a cascade classification module, wherein the ROI prediction module performs position estimation by using square root volume Kalman filtering, obtains an ROI region by using the estimated position and inputs the ROI region as the cascade classification module, and a fHOG-SVM classifier is adopted in the cascade classification module;
The tracking module and the detection module work synchronously, parameters are corrected, learned and updated mutually through the learning module, and the comprehensive module obtains the final output position through coordination work among a plurality of modules, so that tracking of a single moving target is realized.
An anti-occlusion moving target tracking method based on ROI prediction and multi-module learning comprises the following steps:
Converting the gray level of an input RGB three-channel image into a single-channel image, carrying out color space standardization on the image by adopting a gamma correction method, calculating image gradients including gradient values and gradient directions of each pixel point, then constructing 9-dimensional HOG feature vectors, obtaining 36-dimensional feature vectors corresponding to each cell through normalization and truncation, extracting 31-dimensional features through PCA dimension reduction, combining the features of each cell, obtaining fHOG features of MxNx31 dimensions from one MxN image, and splicing the fHOG features with the gray level features of MxNx1 to obtain fusion features of MxNx32 dimensions;
Setting a position filter and a scale filter, firstly initializing the expected two-dimensional Gaussian output of a target aiming at the position filter, collecting a sample by taking the target position as the center, reducing the fused characteristic from 32 dimensions to 18 dimensions by using PCA dimension reduction, extracting the 18-dimensional characteristic for each pixel point of the sample, multiplying the 18-dimensional characteristic by a two-dimensional Hamming window as a test input, and then determining a new target position by using Fourier inverse transformation;
Step 3, predicting an area of the ROI, namely taking the position of the target in the image at the previous moment as an observation value, estimating the position of the target in the image at the next moment by utilizing a square root volume Kalman filtering algorithm, dividing the area by the length-width ratio and four times of the area of the previous frame, and taking the area as the ROI area of the current frame to send into a detection module;
Step 4, cascade classification, namely setting an image element variance classifier, a fHOG-SVM classifier and a nearest neighbor classifier, taking the ROI area as an input of a cascade classification module, namely a region to be detected, firstly generating a plurality of sliding windows to be detected with different scales in the region to be detected, sending the sliding windows to be detected into the image element variance classifier, calculating pixel gray variances of the window to be detected and a target frame image, considering a test sample with the variance smaller than half of the target sample variance as a negative sample, then taking a positive sample as an input of the fHOG-SVM classifier, extracting fHOG characteristics, sending the positive sample to the SVM classifier to obtain a positive and negative sample class result, finally taking positive sample windows obtained by the first two classifiers as an input of the nearest neighbor classifier, sequentially matching the similarity of each window and the online model, and updating a positive sample space of the online model, thereby obtaining a final positive sample of the cascade classification module, namely an output of the detection module;
In the P-N learning, firstly, predicting the target position of the current frame by using a tracking module, correcting a positive sample which is incorrectly divided into negative samples by a P expert into a positive sample if the predicted position is detected as the negative sample by the detection module, and sending the positive sample into a training set;
And 6, integrating multiple modules, namely obtaining a final output position through coordination work among the multiple modules, and realizing tracking of a single moving target.
Compared with the prior art, the method has the remarkable advantages that (1) a scale filter is added on the basis of a position filter to serve as an algorithm framework of a tracking module, tracking robustness is improved by adding a multi-feature extraction fusion optimization framework, PCA (principal component analysis) degradation and QR (quick response) decomposition are carried out on the position filter and the scale filter to reduce calculation amount and improve instantaneity, (2) in a detection module, position estimation is carried out by utilizing square root volume Kalman filtering, detection robustness when a target is shielded is improved, an ROI (region of interest) is obtained by utilizing the estimated position and is used as input of a cascade classifier, the search range is reduced, calculation amount is reduced, instantaneity is improved, and in the cascade classifier, a fHOG-SVM classifier is adopted to replace an original random fern classifier, fHOG features can keep good invariance for image scale and illumination change, calculation amount is reduced relative to HOG features, algorithm speed is improved, detection accuracy and instantaneity are improved by combining fHOG with an SVM, the tracking module and the detection module independently works, the inside the tracking module is mutually learned and parameter is updated by a learning update module, finally, the target is estimated after the position is jointly detected, the target is estimated again, and the target can still be normally lost after the target is restarted by the initialization detection module is restarted.
Detailed Description
The invention discloses an anti-shielding moving target tracking device based on ROI prediction and multi-module learning, which comprises a tracking module, a detection module, a learning module and a synthesis module, wherein:
The tracking module comprises a multi-feature extraction module and a related filtering module, a scale filter is added on the basis of the position filter to serve as an algorithm framework of the tracking module, and PCA (principal component analysis) dimension reduction and QR decomposition are carried out on the position filter and the scale filter by adding a multi-feature extraction fusion optimization framework;
The detection module comprises an ROI prediction module and a cascade classification module, wherein the ROI prediction module performs position estimation by using square root volume Kalman filtering, obtains an ROI region by using the estimated position and inputs the ROI region as the cascade classification module, and a fHOG-SVM classifier is adopted in the cascade classification module;
The tracking module and the detection module work synchronously, parameters are corrected, learned and updated mutually through the learning module, and the comprehensive module obtains the final output position through coordination work among a plurality of modules, so that tracking of a single moving target is realized.
As a specific example, the multi-feature extraction module is obtained by performing stitching fusion on the gray feature of mxn×1 and the quick direction gradient histogram of mxn×31, that is, fHOG features, where M, N is a positive integer, and the feature fHOG of 31 dimensions is obtained by performing normalization truncation to obtain a 36-dimensional feature vector corresponding to each cell, and then performing principal component analysis PCA to reduce dimensions, where the feature includes an 18-dimensional signed fHOG gradient, a 9-dimensional unsigned fHOG gradient, and a 4-dimensional feature from normalization operations of cells of the current cell and cells of 4 fields of diagonal fields.
As a specific example, the basic framework of the related filtering module is a discriminant scale space tracking algorithm framework, and the position filter and the scale filter are utilized to sequentially perform target positioning and scale evaluation, and principal component analysis PCA and orthogonal triangular QR decomposition are performed on the position filter and the scale filter respectively to optimize the algorithm framework.
As a specific example, the ROI prediction module performs ROI region estimation by adding square root volume Kalman filtering, and takes the predicted target position of the current frame as a center, and the length-width ratio and four times area of the previous frame are used for defining a region, and the region is used as the ROI region of the current frame to be sent to the cascade classification module.
As a specific example, the cascade classification module includes an image element variance classifier, a fHOG-SVM classifier, and a nearest neighbor classifier, which are specifically as follows:
The ROI prediction module predicts a region where a current frame target most likely appears, the region is used as an input of the cascade classification module, namely a region to be detected, a plurality of sliding windows to be detected with different scales are firstly generated in the region to be detected, the sliding windows are sent into the image element variance classifier, pixel gray level variances of the window to be detected and a target frame image are calculated, a test sample with the variances smaller than half of the target sample variances is regarded as a negative sample, then a positive sample is used as an input of the fHOG-SVM classifier, fHOG characteristics are extracted and sent into the SVM classifier to obtain a positive sample class result, and finally a positive sample window obtained by the first two classifiers is used as an input of a nearest neighbor classifier, similarity between each window and an online model is matched in sequence, and a positive sample space of the online model is updated, so that a final positive sample of the cascade classification module is obtained, namely an output of the detection module.
The invention discloses an anti-shielding moving target tracking method based on ROI prediction and multi-module learning, which comprises the following steps:
Converting the gray level of an input RGB three-channel image into a single-channel image, carrying out color space standardization on the image by adopting a gamma correction method, calculating image gradients including gradient values and gradient directions of each pixel point, then constructing 9-dimensional HOG feature vectors, obtaining 36-dimensional feature vectors corresponding to each cell through normalization and truncation, extracting 31-dimensional features through PCA dimension reduction, combining the features of each cell, obtaining fHOG features of MxNx31 dimensions from one MxN image, and splicing the fHOG features with the gray level features of MxNx1 to obtain fusion features of MxNx32 dimensions;
Setting a position filter and a scale filter, firstly initializing the expected two-dimensional Gaussian output of a target aiming at the position filter, collecting a sample by taking the target position as the center, reducing the fused characteristic from 32 dimensions to 18 dimensions by using PCA dimension reduction, extracting the 18-dimensional characteristic for each pixel point of the sample, multiplying the 18-dimensional characteristic by a two-dimensional Hamming window as a test input, and then determining a new target position by using Fourier inverse transformation;
Step 3, predicting an area of the ROI, namely taking the position of the target in the image at the previous moment as an observation value, estimating the position of the target in the image at the next moment by utilizing a square root volume Kalman filtering algorithm, dividing the area by the length-width ratio and four times of the area of the previous frame, and taking the area as the ROI area of the current frame to send into a detection module;
Step 4, cascade classification, namely setting an image element variance classifier, a fHOG-SVM classifier and a nearest neighbor classifier, taking the ROI area as an input of a cascade classification module, namely a region to be detected, firstly generating a plurality of sliding windows to be detected with different scales in the region to be detected, sending the sliding windows to be detected into the image element variance classifier, calculating pixel gray variances of the window to be detected and a target frame image, considering a test sample with the variance smaller than half of the target sample variance as a negative sample, then taking a positive sample as an input of the fHOG-SVM classifier, extracting fHOG characteristics, sending the positive sample to the SVM classifier to obtain a positive and negative sample class result, finally taking positive sample windows obtained by the first two classifiers as an input of the nearest neighbor classifier, sequentially matching the similarity of each window and the online model, and updating a positive sample space of the online model, thereby obtaining a final positive sample of the cascade classification module, namely an output of the detection module;
In the P-N learning, firstly, predicting the target position of the current frame by using a tracking module, correcting a positive sample which is incorrectly divided into negative samples by a P expert into a positive sample if the predicted position is detected as the negative sample by the detection module, and sending the positive sample into a training set;
And 6, integrating multiple modules, namely obtaining a final output position through coordination work among the multiple modules, and realizing tracking of a single moving target.
As a specific example, in step 2, for a position filter, firstly, initializing an expected two-dimensional gaussian output of a target, taking a sample with the target position as the center, reducing the fused feature from 32 dimensions to 18 dimensions by using PCA dimension reduction, extracting the 18-dimensional feature for each pixel point of the sample, multiplying the feature by a two-dimensional hamming window as a test input, and then determining a new position of the target by using inverse fourier transform, specifically as follows:
Firstly, a target sample selected in an initial image is set as a positive sample f, and a two-dimensional Gaussian function is selected as a desired output sample g, so that the following formula is minimum:
Wherein λ represents a convolution operation, fl represents a feature of the first channel, hl represents a filter of the first channel, l e {1, 2..d }, d is a dimension of the selected feature;
the upper part is converted into a complex frequency domain, and the Parseval formula is used for solving:
Wherein Hl,Fl, G is Hl,fl, G is the corresponding variable obtained by discrete Fourier transform DFT,Is the conjugate transpose of G;
Updating the filter parameters with a training sample ft:
Wherein,And Bt is a filterThe numerator and denominator of a corresponding one of the training samples ft,And Bt-1 is the filter numerator and denominator of the last frame of the training sample, and eta is the learning rate;
If zt is an image sample,For the variable obtained by the discrete fourier transform, the output yt is:
Wherein,AndIs the numerator and denominator of the filter in the previous frame, yt is the correlation score, and the state estimation of the current target position is obtained by searching the maximum correlation score.
As a specific example, in step 3, the position of the target in the image at the previous time is taken as an observation value, the position of the target in the image at the next time is estimated by using a square root volume kalman filter algorithm, and the length-width ratio and the quadruple area of the previous frame define a region, and the region is taken as the ROI region of the current frame and sent to the detection module, specifically as follows:
For discrete nonlinear dynamic target tracking systems with additive noise:
Wherein xt and yt represent the state and measurement value of the system at time t, f (·) and h (·) are nonlinear state transfer function and nonlinear measurement function, respectively, the process noise wt-1 and the measurement noise vt-1 are independent of each other, and wt-1~N(0,Qt),vt-1~N(0,Rt);
The state estimation includes time update and measurement update, and when a failure is detected, the state parameters xt-1 and St-1 of the last successful frame are used to initialize the filter, and then the filter gain Kt, the new state estimation is calculated byAnd square root factor of error covariance St:
Wherein Pxy,t is the measurement predictor 'S mutual covariance matrix, Syy,t is the square root of the measurement predictor' S automatic covariance matrix,As a result of the system state prediction value,To estimate the measurement state predictor, χt、γt is the weight matrix, SR,t is the square root of the covariance matrix of the measurement noise;
Estimating the position of the target in the image at the next moment by taking the position v= (i, j) of the target in the image at the previous moment as an observation valueThe length-width ratio of the previous frame and the four times area delimit the area, and the area is used as the ROI area of the current frame to be sent to the detection module.
As a specific example, in the step 4, the SVM solves the nonlinear problem using a kernel function, and makes a hyperplane as a decision surface by creating a hyperplane in the feature space, so that the isolation edge between the positive sample and the negative sample is maximized, and separates the positive sample and the negative sample, and assumes that the hyperplane is:
wx+b=0
Wherein w represents a normal vector, determines the direction of the hyperplane, b represents an offset, and determines the distance between the hyperplane and the origin;
Training sample set train={(x1,y1),(x2,y2),...,(xn,yn)},x∈Rn,yi∈{+1,-1},i represents the i-th sample, n represents sample capacity, classification surface needs to meet yi[wxi +b ]. Gtoreq.1, i=1, 2, m, and the best hyperplane problem translates into:
a lagrangian function was introduced:
should satisfyAnd (3) withObtaining the optimal solutionOptimal weight normal vector w*, optimal offset b*:
So the best hyperplane w*x+b* =0, the best classification function f (x) =sgn { w*x+b* };
and comparing the positive sample output by the fHOG-SVM classifier with the similarity of the online model to realize sample classification and updating the positive sample space of the online model, wherein the similarity calculation is specifically as follows:
Wherein Sr is a correlation similarity, S+ is a positive similarity, S- is a negative similarity, defined as follows:
where M represents the target model of the sample library,A positive sample is represented and a positive sample is represented,Representing a negative sample, p representing a sample to be tested;
The calculation formula of S is as follows:
S(pi,pj)=0.5(NCC(pi,pj)+1)
Wherein NCC is defined as follows:
Wherein μi,σi is the mean and standard deviation of image block pi, μj,σj is the mean and standard deviation of image block pj;
Finally, comparing the calculated Sr, wherein the larger the Sr is, the greater the possibility that the sample is a target is, the samples with the threshold value gamma and the threshold value gamma of Sr are regarded as positive samples, otherwise, the positive samples are negative samples and are discarded, meanwhile, new positive samples are added into the positive sample library of the online model for subsequent matching, the number of the positive sample libraries of the online model is fixed, the number is insufficient, some samples are randomly deleted and the new samples are added after the upper limit of the number is exceeded.
As a specific example, the step 6 is specifically as follows:
The comprehensive module obtains a final output position through coordination among a plurality of modules, and the comprehensive module is divided into four cooperation modes according to the operation results of the detection module and the tracking module:
(1) Success of tracking success detection
The successful detection means that at least one sliding window passes through the detection module, after the sliding windows pass through the detection module and are clustered, the final clustering result only has one clustering center, the failure detection means that no sliding window passes through the detection module, or that a plurality of sliding windows pass through the detection module but the clustering result has a plurality of clustering centers, the successful tracking means that the tracking module has characteristic rectangular frame output, and the failure tracking means that no characteristic rectangular frame output;
If the tracking is successful and the detection is successful, clustering the detection results to obtain related output results, judging the overlapping rate and the credibility of the clustering center and the tracking module, if the overlapping rate is lower than a threshold value of 0.5 and the credibility of the detection module is high, correcting the result of the tracking module by using the detection module, and if the overlapping rate is higher than the threshold value of 0.5, weighting and averaging the result obtained by using the detection module and the tracking module as final output;
(2) Tracking success detection failure
If tracking is successful but detection fails, directly taking the output of the tracking module as the final output of the current frame;
(3) Success of tracking failure detection
If the tracking fails but the detection is successful, the output sample frame of the detection module is clustered, if the final clustering result is only provided with one clustering center, the clustering result is used as the final output, and the tracking module is reinitialized by the clustering result, namely, the re-detection process of reentering after the target disappears;
(4) Failure to detect tracking failure
If both the tracking module and the detection module fail, the detection is considered invalid and discarded.
The invention is described in further detail below with reference to the accompanying drawings and specific examples.
Examples
Referring to fig. 1, the anti-occlusion moving target tracking method based on ROI prediction and multi-module learning of the present invention includes a feature extraction module, a correlation filtering tracking module, an ROI prediction module, a cascade detection module, a learning module, and a synthesis module, and the real-time process of the specific algorithm is as follows:
Step one, multi-feature extraction
Converting the gray level of the input RGB three-channel image into a single-channel image, carrying out color space standardization on the image by adopting a gamma correction method, calculating image gradients including gradient values and gradient directions of each pixel point, and then constructing a 9-dimensional HOG feature vector. And obtaining 36-dimensional feature vectors corresponding to each cell through normalization and truncation. And (3) through PCA dimension reduction, extracting 31-dimensional features, combining the features of each cell, obtaining fHOG features of MxNx31 dimensions from one MxN image, and splicing the fHOG features with the gray features of MxNx1 to obtain fusion features of MxNx32 dimensions. Through multi-feature fusion, the robustness of illumination and appearance rapid change is improved.
Step two, relevant filtering tracking
The correlation filter tracking module comprises two parts, namely a position filter and a scale filter. Firstly, initializing expected two-dimensional Gaussian output of a target, collecting a sample by taking the target position as the center, reducing the calculated amount, improving the running speed, reducing the fused characteristic from 32 dimensions to 18 dimensions by using PCA dimension reduction, extracting the 18-dimensional characteristic for each pixel point of the sample, multiplying the 18-dimensional characteristic by a two-dimensional Hamming window, taking the two-dimensional Hamming window as a test input, and obtaining yt by using Fourier inverse transformation, wherein the maximum value of y is the new position of the target. The scale filter adopts the same design method as the position filter, firstly, the expected one-dimensional Gaussian output of the scale filter is initialized, samples under different scales are extracted by taking the target position as the center, each sample passes through a one-dimensional Hamming window and serves as a test input, then inverse Fourier transform is utilized to obtain yt, and the maximum value of y is the new target scale.
Step three, ROI prediction region
The detection module needs to generate a multi-scale sliding window as a sample in the image global range, so that the calculation amount is greatly increased, the algorithm speed is reduced, and the ROI area, namely the region of interest, is determined by predicting the target position at the next moment, so that the search range is reduced, the detection sample is reduced, and the instantaneity is improved. And taking the position v= (i, j) of the target in the image at the previous moment as an observation value, estimating the position v= (i, j) of the target in the image at the next moment by utilizing a square root volume Kalman filtering algorithm, dividing the area by the length-width ratio of the previous frame and four times of the area, and taking the area as the ROI area of the current frame to be sent to a detection module.
Step four, cascade detection
The cascade detection module comprises an image element variance classifier, a fHOG-SVM classifier and a nearest neighbor classifier.
The ROI prediction module predicts a region where a current frame target is most likely to appear, and takes the region as an input of a cascade classifier, namely a region to be detected, firstly, a sample to be detected is obtained in the region to be detected in a multi-scale displacement mode, the sample to be detected is sent into the image element variance classifier, pixel gray variance of a window to be detected and a target frame image is calculated respectively, and if the variance of the sample to be detected is smaller than half of the variance of a target block diagram, the sample to be detected is regarded as a negative sample. The scan frame of the input region can be reduced by half by variance filtering.
Then, positive samples obtained by the image element variance classifier are used as input of the fHOG-SVM classifier, fHOG features are extracted, and positive and negative sample class results are obtained by sending the positive and negative samples to the SVM classifier. The SVM solves the problem of nonlinearity using a kernel function, the main idea being to create a hyperplane in the feature space as a decision surface, so that the isolation edge between positive and negative samples is maximized, separating the positive and negative samples.
And comparing the similarity of the positive sample output by the fHOG-SVM classifier with the online model to realize sample classification and updating the positive sample space of the online model. The similarity calculation is specifically as follows:
Wherein Sr is a correlation similarity, S+ is a positive similarity, S- is a negative similarity, defined as follows:
where M represents the target model of the sample library,A positive sample is represented and a positive sample is represented,Representing a negative sample, p representing the sample to be tested. The calculation formula of S is as follows:
S(pi,pj)=0.5(NCC(pi,pj)+1)
Wherein NCC is defined as follows:
Where μi,σi is the mean and standard deviation of image block pi and μj,σj is the mean and standard deviation of image block pj.
Finally, the calculated Sr is compared, the greater Sr is, the greater the possibility that the sample is the target is, the sample with the threshold value γ set, Sr > γ is considered as positive, and otherwise negative, and discarded. Meanwhile, new positive samples are added into a positive sample library of the online model for subsequent matching, the number of the positive sample libraries of the online model is fixed, the positive samples are added when the number is insufficient, and some samples are randomly deleted and then the new samples are added when the number exceeds the upper limit.
Step five, learning and updating
And a P-N learning mode is used in the algorithm, and the performance of the classifier in the detection module is optimized in an online learning mode, so that the generalization capability of the classifier is improved. In the P-N learning, first, a tracking module predicts a target position of a current frame, and if the predicted position is detected as a negative sample by a detection module, a P expert corrects the positive sample thus incorrectly divided into negative samples as a positive sample and sends the positive sample into a training set. Then, the N expert compares the positive sample generated by the detection module with the positive sample obtained by the P expert, and selects the most reliable sample as the output position.
Step six, multi-module synthesis
The comprehensive module obtains a final output position through coordination work among the multiple modules, and can be divided into four cooperation modes according to the operation results of the detection module and the tracking module:
(1) Success of tracking success detection
If the tracking is successful and the detection is successful, the detection results are clustered to obtain related output results, the overlapping rate and the credibility of the clustering center and the tracking module are judged, if the overlapping rate of the clustering center and the tracking module is low and the credibility of the detection module is high, the detection module is used for correcting the result of the tracking module, and if the overlapping rate is close, the detection module and the tracking module are used for weighted average to obtain the result as final output.
(2) Tracking success detection failure
If tracking is successful but detection fails, the output of the tracking module is directly taken as the final output of the current frame.
(3) Success of tracking failure detection
If tracking fails but detection is successful, the output sample frames of the detection module are clustered, if the final clustering result has only one clustering center, the result is used as the final output, and the tracking module is reinitialized by the final clustering result, namely, the re-detection process which is re-entered after the target disappears. If there are also a plurality of cluster centers, it is stated that although the detection module passes, there are a plurality of different positions, and the detection is considered to be failed.
(4) Failure to detect tracking failure
If both the tracking and detection modules fail, the detection is deemed invalid and discarded.
The invention provides a target tracking method aiming at the problems of scale change, illumination change, target shielding, target disappearance and the like of a single moving target in a long-term tracking process, which not only can overcome the problems to realize secondary tracking, but also has high algorithm instantaneity and high robustness. The invention also provides an anti-shielding moving target tracking device based on the ROI prediction and the multi-module learning, which comprises a tracking module, a detection module, a learning module and a comprehensive module, wherein the tracking module based on the multi-feature extraction and related filtering algorithm and the detection module based on the ROI prediction and fHOG-SVM are used for respectively carrying out target position prediction on a single frame, the comprehensive module is used for comprehensively outputting the results of the tracking module and the detection module, and meanwhile, the learning updating module is used for correcting the tracking module and the detection module, so that the classification and generalization level of the detection module is improved, and the stability of the algorithm is greatly improved. In a word, the invention takes the tracking-detecting-learning framework as the background, optimizes and improves the algorithm of each module, solves four main problems encountered in the long-term tracking problem, and has the advantages of high instantaneity, high robustness, high detection precision and the like.