Movatterモバイル変換


[0]ホーム

URL:


CN111401178B - Video target real-time tracking method and system based on depth feature fusion and adaptive correlation filtering - Google Patents

Video target real-time tracking method and system based on depth feature fusion and adaptive correlation filtering
Download PDF

Info

Publication number
CN111401178B
CN111401178BCN202010157649.XACN202010157649ACN111401178BCN 111401178 BCN111401178 BCN 111401178BCN 202010157649 ACN202010157649 ACN 202010157649ACN 111401178 BCN111401178 BCN 111401178B
Authority
CN
China
Prior art keywords
feature
correlation
typical
filter
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010157649.XA
Other languages
Chinese (zh)
Other versions
CN111401178A (en
Inventor
蔡晓刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by IndividualfiledCriticalIndividual
Priority to CN202010157649.XApriorityCriticalpatent/CN111401178B/en
Publication of CN111401178ApublicationCriticalpatent/CN111401178A/en
Application grantedgrantedCritical
Publication of CN111401178BpublicationCriticalpatent/CN111401178B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Images

Classifications

Landscapes

Abstract

The invention discloses a video target real-time tracking method and system based on depth feature fusion and self-adaptive correlation filtering. The feature extraction module extracts multi-layer depth features by using a lightweight network model, so that the real-time performance of a target tracking task is ensured. The feature fusion module provides a multilayer depth feature fusion strategy of typical correlation analysis aiming at the fact that the representation of the target by independently extracted multilayer features is not complete enough. The method improves the expression capability of the target and the distinguishing capability of the target and the background, reduces the feature redundancy and reduces the calculation amount of the follow-up related filter. Aiming at the tracker drift problem caused by challenges such as target deformation, target occlusion, target movement out of view, target rotation and the like existing in a target tracking task, the related filter updating strategy based on response value dispersion analysis is provided by the related filtering module, and the filter template is updated adaptively, so that the specific problem is relieved.

Description

Video target real-time tracking method and system based on depth feature fusion and adaptive correlation filtering
Technical Field
The invention relates to the technical field of target tracking, in particular to a video target real-time tracking method and system based on depth feature fusion and adaptive correlation filtering.
Background
The video target tracking task predicts the position of the target by initializing the target position in the first frame and then analyzing and processing the video frame by frame using a tracking algorithm. In recent years, a method based on deep learning and correlation filtering has attracted a great deal of attention in the field of video object tracking, and has prompted an improvement in object tracking performance.
CF2[1] The multi-layer depth features are independently extracted from the deeper VGG-16 and sent to a subsequent correlation filter for prediction calculation, so that the precedent of combining the deep learning and the correlation filtering is created. However, due to the use of more complex feature extraction models,and thus the speed is also reduced. For tasks with real-time tracking requirements, a tradeoff between performance and speed is required.
Extracting features from the deep neural network model is the longest time-consuming step, and the most straightforward acceleration method is to use a lightweight deep neural network model. However, the multi-layer depth features extracted from the lightweight model have insufficient characterizations of objects, limited distinctions of background from objects, and redundancy between features, resulting in additional computational effort. Meanwhile, the target tracking task is often faced with specific problems of target deformation, target shielding, target moving out of view, target rotation and the like, and tracker drift and even tracking failure are easy to occur by using the existing related filtering algorithm.
Disclosure of Invention
The invention aims to: an objective is to provide a video target real-time tracking method based on depth feature fusion and adaptive correlation filtering, so as to solve the above problems of the current target tracking method combining deep learning and correlation filtering. It is a further object to provide a system for implementing the above method.
The technical scheme is as follows: a video target real-time tracking method based on depth feature fusion and adaptive correlation filtering comprises the following steps:
step 1, extracting original multi-layer depth features by using a lightweight network model, wherein the multi-layer features respectively have semantic information and texture information of targets from deep layers to shallow layers;
step 2, using a multi-layer depth feature fusion strategy based on typical correlation analysis to obtain typical discrimination features with high characterization capability and low redundancy;
and 3, calculating the discrete degree of the multi-layer filter response diagram by using a relevant filter updating strategy based on response value dispersion analysis so as to adapt to the appearance change of the target.
In a further embodiment, thestep 1 further includes:
and selecting a light VGG-M-2048 deep neural network as a deep feature extractor, removing the last three full-connection layers when extracting features from the network, and extracting only the features of the convolution layer for inputting the feature fusion module. The VGG-M network model is pre-trained on ImageNet large-scale image classification datasets. And deleting all full-connection layers with large parameters in the tracking process, and only preserving the first five groups of convolution layers for feature extraction. The light-weight network model of the full-connection layer is deleted, so that the depth feature extraction time can be greatly reduced, and the tracking speed is improved.
In a further embodiment, the step 2 further includes:
a multi-layer depth feature fusion strategy based on typical correlation analysis is proposed: mapping two layers of depth features independently extracted from a network to a joint feature space, projecting the two layers of depth features into feature vectors, calculating the maximum correlation of the two groups of feature vectors by using a typical correlation analysis method, generating two groups of typical variables according to the correlation, carrying out point-to-point fusion and addition on the two groups of typical variables, mapping the two groups of typical variables back to the original feature space to form a group of typical distinguishing features, and finally sending the typical distinguishing features to a subsequent correlation filter for calculation. The typical discrimination features subjected to feature fusion have stronger target expression capability, stronger target background discrimination capability and lower redundancy, and the calculation amount of a subsequent related filter can be reduced while the motion target characterization capability is improved.
In a further embodiment, the step 3 further includes:
and (3) providing a relevant filter updating strategy based on response value dispersion analysis: calculating a relevant filter response diagram of the current frame in the process of tracking frame by frame, defining a variation coefficient according to the information of the response diagram, normalizing the variation coefficient in the time dimension, and solving the relative deviation of the variation coefficient; when the relative deviation is greater than the threshold value, the tracking prediction result of the current frame is considered to be reliable, and the filter template can be updated by using the current frame; when the relative deviation is less than the threshold, the current frame tracking prediction result is considered unreliable, and the filter template of the historical reliable frame is still maintained. The strategy can judge whether the target meets challenges in the tracking process, adaptively update the filter template in a more reasonable mode, and particularly can alleviate the problem of tracker drift caused by challenges such as deformation of the target, occlusion of the target, removal of the target from the field of view, rotation of the target and the like.
In a further embodiment, step 2 further comprises:
step 2.1, obtaining third and fourth layer convolution characteristics C in the characteristic extraction module3 ,C4 ∈R13×13×512 Projecting two sets of original features into two dimensions, called U, V, where U, V ε R169×512 The method comprises the steps of carrying out a first treatment on the surface of the The following two sets of linear transformations are considered, mapping U, V to a joint feature space, resulting in U and V, where:
U* =AT U
V* =BT V
step 2.2, using pearson correlation coefficients to measure the correlation between U and V, finding the optimal solution of matrices a and B to maximize the correlation coefficients:
Figure SMS_1
where cov (x) represents covariance and var (x) represents variance;
step 2.3, defining covariance matrixes of U and V:
Figure SMS_2
step 2.4, the goal of the typical correlation analysis can be converted into a convex optimization problem:
Figure SMS_3
s.t.AT SUU B=1,BT SVV A=1
step 2.5, solving the optimization problem by using a Lagrangian multiplier method to obtain the following formula:
Figure SMS_4
Figure SMS_5
step 2.6, carrying out feature decomposition on the above formula, finding out the maximum feature value and solving the square root to obtain feature vectors A and B corresponding to the maximum feature value matrix, namely a transformation matrix of U and V; where λ represents a diagonal eigenvalue matrix with d non-zero eigenvalues, and d=rank (SUV ) The method comprises the steps of carrying out a first treatment on the surface of the After A, B is obtained, returning to the step 2.1 to obtain typical variables U; the typical variables are added point to point and fused in a joint feature space:
Z=U* +V*
wherein Z,U* ,V* ∈R169×d Finally, mapping Z back to the original feature space to obtain a typical distinguishing feature F, wherein F is E R13×13×d Thus, feature fusion is completed.
In a further embodiment, step 3 further comprises:
step 3.1, in the process of tracking frame by frame, calculating a response diagram of a correlation filter of the current frame, and defining a variation coefficient according to the information of the response diagram
Figure SMS_6
wherein ,σt Is the variance, mu of the t-th frame filter response mapt Is the mean value of the t frame filter response diagram;
step 3.2, normalizing the variation coefficient in the time dimension to obtain the relative deviation
Figure SMS_7
wherein ,
Figure SMS_8
representing the mean of the coefficients of variation for all frames before t frames.
A video target real-time tracking system based on depth feature fusion and adaptive correlation filtering comprises the following modules: a correlation filtering module for updating the multi-layer kernel correlation filter; the feature fusion module is used for inputting a subsequent kernel correlation filter; and a feature extraction module for inputting the feature fusion module.
In a further embodiment, the feature extraction module extracts original multi-layer depth features by using a lightweight network model, wherein the multi-layer features respectively have semantic information and texture information of the target from deep layer to shallow layer;
the characteristic fusion module obtains typical discrimination characteristics with high characterization capacity and low redundancy by using a multi-layer depth characteristic fusion strategy based on typical correlation analysis;
the correlation filtering module uses a correlation filter updating strategy based on response value dispersion analysis to calculate the dispersion degree of the multi-layer filter response diagram so as to adapt to the appearance change of the target, thereby adaptively updating the multi-layer kernel correlation filter.
In a further embodiment, the feature extraction module further selects a light VGG-M-2048 deep neural network as the deep feature extractor, and removes the last three full connection layers when extracting features from the network, and extracts only the features of the convolution layer for inputting the feature fusion module;
the feature fusion module further provides a multi-layer depth feature fusion strategy based on typical correlation analysis; mapping two layers of depth features independently extracted from a network to a joint feature space, projecting the two layers of depth features into feature vectors, calculating the maximum correlation of the two groups of feature vectors by using a typical correlation analysis method, generating two groups of typical variables according to the correlation, carrying out point-to-point fusion and addition on the two groups of typical variables, mapping the two groups of typical variables back to the original feature space to form a group of typical distinguishing features, and finally sending the typical distinguishing features to a subsequent correlation filter for calculation;
the third and fourth layers of convolution characteristics C are obtained in the characteristic extraction module3 ,C4 ∈R13×13×512 Projecting two sets of original features into two dimensions, called U, V, where U, V ε R169×512 The method comprises the steps of carrying out a first treatment on the surface of the The following two sets of linear transformations are considered, mapping U, V to a joint feature space, resulting in U and V, where:
U* =AT U
V* =BT V
the correlation between U and V is measured using pearson correlation coefficients, and the optimal solution of matrices a and B is found to maximize the correlation coefficients:
Figure SMS_9
where cov (x) represents covariance and var (x) represents variance;
defining covariance matrix of U and V:
Figure SMS_10
the goal of a typical correlation analysis can be translated into a convex optimization problem:
Figure SMS_11
s.t.AT SUU B=1,BT SVV A=1
solving the optimization problem using the lagrangian multiplier method yields the following equation:
Figure SMS_12
Figure SMS_13
performing feature decomposition on the above formula, finding out the maximum feature value and solving the square root to obtain feature vectors A and B corresponding to the maximum feature value matrix, namely a transformation matrix of U and V; where λ represents a diagonal eigenvalue matrix with d non-zero eigenvalues, and d=rank (SUV ) The method comprises the steps of carrying out a first treatment on the surface of the After A, B is obtained, returning to the step 2.1 to obtain typical variables U; the typical variables are added point to point and fused in a joint feature space:
Z=U* +V*
wherein Z,U* ,V* ∈R169×d Finally, mapping Z back to the original feature space to obtain a typical distinguishing feature F, wherein F is E R13×13×d Thus, feature fusion is completed.
In a further embodiment, the correlation filtering module further proposes a correlation filter update strategy based on a response value dispersion analysis; in the process of tracking frame by frame, calculating a relevant filter response diagram of the current frame, defining a variation coefficient according to the information of the response diagram, normalizing the variation coefficient in the time dimension, and solving the relative deviation of the variation coefficient; when the relative deviation is greater than the threshold value, the tracking prediction result of the current frame is considered to be reliable, and the current frame is used for updating the filter template; when the relative deviation is smaller than the threshold value, the tracking prediction result of the current frame is considered unreliable, and a filter template of a historical reliable frame is still maintained;
in the process of tracking frame by frame, calculating a response diagram of a correlation filter of the current frame, and defining a variation coefficient according to the information of the response diagram
Figure SMS_14
wherein ,σt Is the variance, mu of the t-th frame filter response mapt Is the mean value of the t frame filter response diagram;
normalizing the variation coefficient in the time dimension to obtain the relative deviation
Figure SMS_15
wherein ,
Figure SMS_16
representing the mean of the coefficients of variation for all frames before t frames.
The beneficial effects are that: the invention provides a video target real-time tracking method and a video target real-time tracking system based on depth feature fusion and self-adaptive correlation filtering, which use a lightweight network model to extract multi-layer depth features, so that the real-time performance of a target tracking task is ensured. Aiming at the two limitations that the representation of the target by independently extracted multilayer features is not complete enough and a large amount of redundancy exists between the features, a multilayer depth feature fusion strategy of typical correlation analysis is provided. The method improves the expression capability of the target and the distinguishing capability of the target and the background, reduces the feature redundancy and reduces the calculation amount of the follow-up related filter. Meanwhile, aiming at the tracker drift problem caused by challenges such as target deformation, target occlusion, target movement out of view, target rotation and the like existing in a target tracking task, the invention provides a relevant filter updating strategy based on response value dispersion analysis, and the filter template is updated adaptively, so that the specific problem is relieved. Overall, the video target real-time tracking device based on depth feature fusion and adaptive correlation filtering has ideal improvement on evaluation performance on a data set, real-time performance and good application value.
Drawings
Fig. 1 is a flowchart of a video target real-time tracking method based on depth feature fusion and adaptive correlation filtering in the present invention.
FIG. 2 is a flow chart of a multi-layer depth feature fusion strategy based on a typical correlation analysis in the present invention
Fig. 3 is a graph of tracking accuracy based on 96 standard videos (DFF for the present invention) obtained through experiments and eight other prior algorithms.
Fig. 4 is a graph of the tracking success rate of the invention based on 96 standard videos (the invention is represented by DFF in the figure) obtained through experiments and other eight prior algorithms.
FIG. 5 is a graph of the tracking success rate of the present invention in the target deformation problem (the present invention is represented by DFF in the figure) obtained by experiments and other eight prior algorithms.
Fig. 6 is a graph of the tracking success rate of the present invention in the problem of target removal from view (shown by DFF).
Detailed Description
The applicant believes that extracting features from a deep neural network model is the longest time consuming step, and the most straightforward acceleration method is to use a lightweight deep neural network model. However, the multi-layer depth features extracted from the lightweight model have insufficient characterizations of objects, limited distinctions of background from objects, and redundancy between features, resulting in additional computational effort. Meanwhile, the target tracking task is often faced with specific problems of target deformation, target shielding, target moving out of view, target rotation and the like, and tracker drift and even tracking failure are easy to occur by using the existing related filtering algorithm.
Therefore, the invention provides a multi-layer depth feature fusion strategy based on typical correlation analysis (CCA) based on depth feature fusion and self-adaptive correlation filtering, which improves feature expression capability and reduces redundancy. And meanwhile, a relevant filter updating strategy based on response value dispersion analysis is provided, and the filter updating is performed adaptively. The method and the prior 8 target tracking algorithms based on deep learning and related filtering are combined in OTB[2] The comparison was performed over all 96 videos in the standard dataset, with an AUC of 0.599 for the success rate at a positioning error threshold of 20 pixels, in the first position. The accuracy is 0.758, which is the second. In a comparative deep learning-based target tracking algorithm, the invention can achieve real-time. In a comparative target tracking algorithm based on correlation filtering, the tracking success rate of the invention is at the first place. Obviously, the invention effectively improves the target tracking performance and the real-time performance.
The technical scheme of the invention is further specifically described below through specific examples and with reference to the accompanying drawings.
As shown in fig. 1, the invention provides a video target real-time tracking system based on depth feature fusion and adaptive correlation filtering, which comprises the following modules:
1) And the feature extraction module is used for extracting original multi-layer depth features by using a lightweight network model, wherein the multi-layer features respectively have semantic information and texture information of targets from deep layers to shallow layers and are used for inputting the feature fusion module.
2) And the characteristic fusion module is used for obtaining typical discrimination characteristics with high characterization capability and low redundancy by using a multi-layer depth characteristic fusion strategy based on typical correlation analysis and inputting the typical discrimination characteristics into a subsequent kernel correlation filter.
3) And the correlation filtering module is used for calculating the discrete degree of the multi-layer filter response diagram by using a correlation filter updating strategy based on response value dispersion analysis so as to adapt to the appearance change of the target, thereby adaptively updating the multi-layer kernel correlation filter.
Specifically, in the 1) feature extraction module, a light VGG-M-2048 is selected as a depth feature extractor, which has five sets of convolution layers, for a 224x 3 image, the total calculated amount is 3.58G, and the HCF algorithm selects a VGG-16 network having 16 sets of convolution layers and 15.47G calculated amount, so that the light feature extraction network can improve the forward propagation speed by about 5 times. We remove the last three fully connected layers when extracting features from VGG-M-2048, which can remove a large number of unnecessary parameters. Only the features of the convolution layer are extracted for input to the feature fusion module.
In the 2) feature fusion module, as shown in FIG. 2, a multi-layer depth feature fusion strategy based on a typical correlation analysis is used in the present invention. The third and fourth layers of convolution characteristics C are obtained in the characteristic extraction module3 ,C4 ∈R13×13×512 Projecting two sets of original features into two dimensions, called U, V, where U, V ε R169×512 . Taking two groups of linear transformations into consideration, mapping U and V to a joint feature space to obtain U and V, wherein U and V are respectively calculated by the following steps of
U* =AT U
V* =BT V (1)
The goal of a typical correlation analysis is to maximize the correlation between the typical variables of U and V in the joint feature space, so pearson correlation coefficients are used to measure the correlation between U and V. We want to find the optimal solution of matrices a and B to maximize the correlation coefficient, as in equation (2):
Figure SMS_17
where cov (x) represents covariance and var (x) represents variance. Next we define the covariance matrix of U, V, as in equation (3),
Figure SMS_18
the goal of a typical correlation analysis can be converted into a convex optimization problem as in equation (4):
Figure SMS_19
s.t.AT SUU B=1,BT SVV A=1 (4)
we solve the above optimization problem using the lagrangian multiplier method, resulting in the following:
Figure SMS_20
Figure SMS_21
next, we need to perform a feature decomposition on the above equation, find the maximum feature value and take the square root. The transformation matrix of eigenvectors a and B, i.e., U and V, corresponding to the maximum eigenvalue matrix is obtained. λ represents a diagonal eigenvalue matrix with d non-zero eigenvalues, and d=rank (SUV ). After obtaining a, B, the typical variable U, V can be obtained from equation (1). We add the typical variables point-to-point, fuse in the joint feature space,
Z=U* +V* (6)
wherein Z,U* ,V* ∈R169×d Finally, mapping Z back to the original feature space to obtain a typical distinguishing feature F, wherein F is E R13×13×d Thus, feature fusion is completed. The fusion feature F is fed into the correlation filtering module in a subsequent operation.
In the 3) correlation filtering module, in the process of tracking frame by frame, we calculate the correlation filter response diagram of the current frame, according to the information of the response diagram, we define the variation coefficient gamma, as in the formula (7),
Figure SMS_22
wherein σt Is the variance, mu of the filter response map of the t-th frame (current frame)t Is the mean of the t-th frame filter response plot. Experiments prove that when the target is clear, the response diagram of the related filter presents unimodal state, and the response diagram is matched with mut And sigma (sigma)t Are all at a higher level; when the target encounters problems of shielding, deformation and the like, the response diagram of the related filter presents discrete multi-peak state, mut Slowly decrease, but sigmat Drastically reduce, make gammat Reduction, i.e. gammat And when the current frame is larger, the appearance change amplitude of the current frame is smaller, and the target tracking result is reliable. Then we normalize the variation coefficient in the time dimension, calculate the relative deviation beta as formula (8),
Figure SMS_23
wherein
Figure SMS_24
Representing the mean of the coefficients of variation for all frames before t frames. Therefore, when β is greater than the threshold, we consider the current frame tracking prediction to be reliable, and the current frame can be used to update the filter template. When β is less than the threshold, we consider the current frame tracking prediction to be unreliable, yet maintain the filter template for the previous reliable frame.
To test the effectiveness of the invention for target tracking, the invention was tested in OTB[2] Experiments were performed on 96 standard videos in the database and compared to 8 existing correlation-filtering-based, deep-learning-based target tracking algorithms in table 1, with our systematic algorithm denoted DFF.
Table 1 compares algorithm names, year and provenance
Figure SMS_25
According to the test results, as shown in FIG. 3, the success rate of the present invention on OTB-50 was AUC of 0.758, and the accuracy was 0.608. The success rate on OTB-100 was AUC of 0.599 and accuracy of 0.748. As shown in Table 2, the success rate of the invention on the data set is the first to the other 8 algorithms, and is the CF2 based on deep learning[1] Compared with the algorithm, the method can achieve real-time; and KCF based on correlation filtering[5] 、DSST[6] Compared with the prior art, the method has obviously higher tracking success rate and accuracy. Meanwhile, as shown in fig. 4 to 6, the success rate results of the present invention on 42 videos with the problem of deformation of the target and 13 videos with the problem of removal of the target from the field of view are all first.
Table 2 evaluation results of DFF and 8 comparison algorithms on dataset
Figure SMS_26
As described above, although the present invention has been shown and described with reference to certain preferred embodiments, it is not to be construed as limiting the invention itself. Various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (7)

1. A video target real-time tracking method based on depth feature fusion and self-adaptive correlation filtering is characterized by comprising the following steps:
step 1, extracting original multi-layer depth features by using a lightweight network model, wherein the multi-layer features respectively have semantic information and texture information of targets from deep layers to shallow layers;
step 2, using a multi-layer depth feature fusion strategy based on typical correlation analysis to obtain typical discrimination features with high characterization capability and low redundancy; a multi-layer depth feature fusion strategy based on typical correlation analysis is proposed: mapping two layers of depth features independently extracted from a network to a joint feature space, projecting the feature space to form feature vectors, calculating the maximum correlation of the two groups of feature vectors by using a typical correlation analysis method, generating two groups of typical variables according to the correlation, carrying out point-to-point fusion addition on the two groups of typical variables, mapping back to the original feature space to form a group of typical distinguishing features, and finally sending the group of typical distinguishing features to a subsequent correlation filter for calculation;
and 3, calculating the discrete degree of the multi-layer filter response diagram by using a relevant filter updating strategy based on response value dispersion analysis so as to adapt to the appearance change of the target.
2. The method for real-time tracking of a video object based on depth feature fusion and adaptive correlation filtering of claim 1, wherein step 1 further comprises:
and selecting a light VGG-M-2048 deep neural network as a deep feature extractor, removing the last three full-connection layers when extracting features from the network, and extracting only the features of the convolution layer for inputting the feature fusion module.
3. The method for real-time tracking of a video object based on depth feature fusion and adaptive correlation filtering of claim 1, wherein step 3 further comprises:
and (3) providing a relevant filter updating strategy based on response value dispersion analysis: calculating a relevant filter response diagram of the current frame in the process of tracking frame by frame, defining a variation coefficient according to the information of the response diagram, normalizing the variation coefficient in the time dimension, and solving the relative deviation of the variation coefficient; when the relative deviation is greater than the threshold value, the tracking prediction result of the current frame is considered to be reliable, and the filter template can be updated by using the current frame; when the relative deviation is less than the threshold, the current frame tracking prediction result is considered unreliable, and the filter template of the historical reliable frame is still maintained.
4. The method for real-time tracking of a video object based on depth feature fusion and adaptive correlation filtering of claim 1, wherein step 2 further comprises:
step 2.1, obtaining a third and a fourth in the feature extraction moduleLayer convolution feature C3 ,C4 ∈R13×13×512 Projecting two sets of original features into two dimensions, called U, V, where U, V ε R169×512 The method comprises the steps of carrying out a first treatment on the surface of the The following two sets of linear transformations are considered, mapping U, V to a joint feature space, resulting in U and V, where:
U* =AT U
V* =BT V
step 2.2, using pearson correlation coefficients to measure the correlation between U and V, finding the optimal solution of matrices a and B to maximize the correlation coefficients:
Figure FDA0004092820250000021
where cov (x) represents covariance and var (x) represents variance;
step 2.3, defining covariance matrixes of U and V:
Figure FDA0004092820250000022
step 2.4, the goal of the typical correlation analysis can be converted into a convex optimization problem:
Figure FDA0004092820250000023
s.t.AT SUU B=1,BT SVV A=1
step 2.5, solving the optimization problem by using a Lagrangian multiplier method to obtain the following formula:
Figure FDA0004092820250000024
Figure FDA0004092820250000025
step 2.6, carrying out feature decomposition on the above formula, finding out the maximum feature value and solving the square root to obtain feature vectors A and B corresponding to the maximum feature value matrix, namely a transformation matrix of U and V; where λ represents a diagonal eigenvalue matrix with d non-zero eigenvalues, and d=rank (SUV ) The method comprises the steps of carrying out a first treatment on the surface of the After A, B is obtained, returning to the step 2.1 to obtain typical variables U; the typical variables are added point to point and fused in a joint feature space:
Z=U* +V*
wherein Z,U* ,V* ∈R169×d Finally, mapping Z back to the original feature space to obtain a typical distinguishing feature F, wherein F is E R13×13×d Thus, feature fusion is completed.
5. A method for real-time tracking of video objects based on depth feature fusion and adaptive correlation filtering as claimed in claim 3, wherein step 3 further comprises:
step 3.1, in the process of tracking frame by frame, calculating a response diagram of a correlation filter of the current frame, and defining a variation coefficient according to the information of the response diagram
Figure FDA0004092820250000031
wherein ,σt Is the variance, mu of the t-th frame filter response mapt Is the mean value of the t frame filter response diagram;
step 3.2, normalizing the variation coefficient in the time dimension to obtain the relative deviation
Figure FDA0004092820250000032
wherein ,
Figure FDA0004092820250000033
representing the mean of the coefficients of variation for all frames before t frames.
6. The video target real-time tracking system based on depth feature fusion and adaptive correlation filtering is characterized by comprising the following modules:
a correlation filtering module for updating the multi-layer kernel correlation filter; the correlation filtering module calculates the discrete degree of the multi-layer filter response diagram by using a correlation filter updating strategy based on response value dispersion analysis so as to adapt to the appearance change of a target, thereby adaptively updating the multi-layer kernel correlation filter;
the feature fusion module is used for inputting a subsequent kernel correlation filter; the characteristic fusion module obtains typical discrimination characteristics with high characterization capacity and low redundancy by using a multi-layer depth characteristic fusion strategy based on typical correlation analysis;
the feature extraction module is used for inputting the feature fusion module; the feature extraction module extracts original multi-layer depth features by using a lightweight network model, wherein the multi-layer features respectively have semantic information and texture information of targets from deep layers to shallow layers;
the feature extraction module further selects a light VGG-M-2048 deep neural network as a deep feature extractor, and removes the last three full-connection layers when features are extracted from the network, and only extracts the features of the convolution layer for inputting the features into the feature fusion module;
the feature fusion module further provides a multi-layer depth feature fusion strategy based on typical correlation analysis; mapping two layers of depth features independently extracted from a network to a joint feature space, projecting the feature space to form feature vectors, calculating the maximum correlation of the two groups of feature vectors by using a typical correlation analysis method, generating two groups of typical variables according to the correlation, carrying out point-to-point fusion addition on the two groups of typical variables, mapping back to the original feature space to form a group of typical distinguishing features, and finally sending the group of typical distinguishing features to a subsequent correlation filter for calculation;
the third and fourth layers of convolution characteristics C are obtained in the characteristic extraction module3 ,C4 ∈R13×13×512 Projecting two sets of original features into two dimensions, called U, V, where U, V ε R169×512 The method comprises the steps of carrying out a first treatment on the surface of the The U, V mapping is considered below for two sets of linear transformationsAnd (3) shooting the joint feature space to obtain U and V, wherein:
U* =AT U
V* =BT V
the correlation between U and V is measured using pearson correlation coefficients, and the optimal solution of matrices a and B is found to maximize the correlation coefficients:
Figure FDA0004092820250000041
where cov (x) represents covariance and var (x) represents variance;
defining covariance matrix of U and V:
Figure FDA0004092820250000042
the goal of a typical correlation analysis can be translated into a convex optimization problem:
Figure FDA0004092820250000043
s.t.AT SUU B=1,BT SVV A=1
solving the optimization problem using the lagrangian multiplier method yields the following equation:
Figure FDA0004092820250000044
Figure FDA0004092820250000045
performing feature decomposition on the above formula, finding out the maximum feature value and solving the square root to obtain feature vectors A and B corresponding to the maximum feature value matrix, namely a transformation matrix of U and V; wherein lambda represents the diagonal eigenvalueA matrix having d non-zero eigenvalues, and d=rank (SUV ) The method comprises the steps of carrying out a first treatment on the surface of the After A, B is obtained, returning to the step 2.1 to obtain typical variables U; the typical variables are added point to point and fused in a joint feature space:
Z=U* +V*
wherein Z,U* ,V* ∈R169×d Finally, mapping Z back to the original feature space to obtain a typical distinguishing feature F, wherein F is E R13×13×d Thus, feature fusion is completed.
7. The video object real-time tracking system based on depth feature fusion and adaptive correlation filtering of claim 6, wherein:
the correlation filtering module further provides a correlation filter updating strategy based on response value dispersion analysis; in the process of tracking frame by frame, calculating a relevant filter response diagram of the current frame, defining a variation coefficient according to the information of the response diagram, normalizing the variation coefficient in the time dimension, and solving the relative deviation of the variation coefficient; when the relative deviation is greater than the threshold value, the tracking prediction result of the current frame is considered to be reliable, and the current frame is used for updating the filter template; when the relative deviation is smaller than the threshold value, the tracking prediction result of the current frame is considered unreliable, and a filter template of a historical reliable frame is still maintained;
in the process of tracking frame by frame, calculating a response diagram of a correlation filter of the current frame, and defining a variation coefficient according to the information of the response diagram
Figure FDA0004092820250000051
wherein ,σt Is the variance, mu of the t-th frame filter response mapt Is the mean value of the t frame filter response diagram;
normalizing the variation coefficient in the time dimension to obtain the relative deviation
Figure FDA0004092820250000052
wherein ,
Figure FDA0004092820250000053
representing the mean of the coefficients of variation for all frames before t frames. />
CN202010157649.XA2020-03-092020-03-09Video target real-time tracking method and system based on depth feature fusion and adaptive correlation filteringActiveCN111401178B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202010157649.XACN111401178B (en)2020-03-092020-03-09Video target real-time tracking method and system based on depth feature fusion and adaptive correlation filtering

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202010157649.XACN111401178B (en)2020-03-092020-03-09Video target real-time tracking method and system based on depth feature fusion and adaptive correlation filtering

Publications (2)

Publication NumberPublication Date
CN111401178A CN111401178A (en)2020-07-10
CN111401178Btrue CN111401178B (en)2023-06-13

Family

ID=71430587

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202010157649.XAActiveCN111401178B (en)2020-03-092020-03-09Video target real-time tracking method and system based on depth feature fusion and adaptive correlation filtering

Country Status (1)

CountryLink
CN (1)CN111401178B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN112270228A (en)*2020-10-162021-01-26西安工程大学 A Pedestrian Re-identification Method Based on DCCA Fusion Features
CN112541468B (en)*2020-12-222022-09-06中国人民解放军国防科技大学 A Target Tracking Method Based on Dual Template Response Fusion
CN113538509B (en)*2021-06-022022-09-27天津大学Visual tracking method and device based on adaptive correlation filtering feature fusion learning
CN117893574A (en)*2024-03-142024-04-16大连理工大学Infrared unmanned aerial vehicle target tracking method based on correlation filtering convolutional neural network

Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
WO2015054688A2 (en)*2013-10-112015-04-16Seno Medical Instruments, Inc.Systems and methods for component separation in medical imaging
CN108288283A (en)*2018-01-222018-07-17扬州大学A kind of video tracing method based on correlation filtering
CN108921872A (en)*2018-05-152018-11-30南京理工大学A kind of robustness visual target tracking method suitable for long-range tracking
CN109741366A (en)*2018-11-272019-05-10昆明理工大学 A Correlation Filtering Target Tracking Method Fusion Multi-layer Convolution Features

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US11055854B2 (en)*2018-08-232021-07-06Seoul National University R&Db FoundationMethod and system for real-time target tracking based on deep learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
WO2015054688A2 (en)*2013-10-112015-04-16Seno Medical Instruments, Inc.Systems and methods for component separation in medical imaging
CN108288283A (en)*2018-01-222018-07-17扬州大学A kind of video tracing method based on correlation filtering
CN108921872A (en)*2018-05-152018-11-30南京理工大学A kind of robustness visual target tracking method suitable for long-range tracking
CN109741366A (en)*2018-11-272019-05-10昆明理工大学 A Correlation Filtering Target Tracking Method Fusion Multi-layer Convolution Features

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
魏永强 等.深度特征的核相关滤波视觉跟踪 .《http://kns.cnki.net/kcms/detail/11.2127.TP.20190828.1416.002.html》.2019,第1-12页.*

Also Published As

Publication numberPublication date
CN111401178A (en)2020-07-10

Similar Documents

PublicationPublication DateTitle
CN111401178B (en)Video target real-time tracking method and system based on depth feature fusion and adaptive correlation filtering
CN111144364B (en)Twin network target tracking method based on channel attention updating mechanism
JP6807471B2 (en) Semantic segmentation model training methods and equipment, electronics, and storage media
CN111080675B (en)Target tracking method based on space-time constraint correlation filtering
US8989442B2 (en)Robust feature fusion for multi-view object tracking
US8423485B2 (en)Correspondence learning apparatus and method and correspondence learning program, annotation apparatus and method and annotation program, and retrieval apparatus and method and retrieval program
CN110210551A (en)A kind of visual target tracking method based on adaptive main body sensitivity
CN113657560A (en)Weak supervision image semantic segmentation method and system based on node classification
CN111429485B (en) Cross-modal filter tracking method based on adaptive regularization and high confidence update
CN108875655A (en)A kind of real-time target video tracing method and system based on multiple features
CN112329784A (en)Correlation filtering tracking method based on space-time perception and multimodal response
CN113052873A (en)Single-target tracking method for on-line self-supervision learning scene adaptation
CN111862167B (en)Rapid robust target tracking method based on sparse compact correlation filter
CN110458784A (en)It is a kind of that compression noise method is gone based on image perception quality
US12307746B1 (en)Nighttime unmanned aerial vehicle object tracking method fusing hybrid attention mechanism
CN107798329B (en)CNN-based adaptive particle filter target tracking method
CN109741364A (en) Target tracking method and device
CN113674218A (en)Weld characteristic point extraction method and device, electronic equipment and storage medium
CN117392176A (en) Pedestrian tracking method and system for video surveillance, computer-readable medium
CN105654518B (en)A kind of trace template adaptive approach
CN110070002A (en)A kind of Activity recognition method based on 3D convolutional neural networks
CN117523626A (en) Pseudo RGB-D face recognition method
CN108280845B (en) A scale-adaptive target tracking method for complex backgrounds
CN110135435A (en) A method and device for saliency detection based on extensive learning system
CN118334610A (en) A method, device and equipment for identifying target objects in an autonomous driving scene

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp