CN111401178B

Movatterモバイル変換

Info

Publication number: CN111401178B
Application number: CN202010157649.XA
Authority: CN
Inventors: 蔡晓刚
Original assignee: Individual
Current assignee: Individual
Priority date: 2020-03-09
Filing date: 2020-03-09
Publication date: 2023-06-13
Anticipated expiration: 2040-03-09
Also published as: CN111401178A

Abstract

The invention discloses a video target real-time tracking method and system based on depth feature fusion and self-adaptive correlation filtering. The feature extraction module extracts multi-layer depth features by using a lightweight network model, so that the real-time performance of a target tracking task is ensured. The feature fusion module provides a multilayer depth feature fusion strategy of typical correlation analysis aiming at the fact that the representation of the target by independently extracted multilayer features is not complete enough. The method improves the expression capability of the target and the distinguishing capability of the target and the background, reduces the feature redundancy and reduces the calculation amount of the follow-up related filter. Aiming at the tracker drift problem caused by challenges such as target deformation, target occlusion, target movement out of view, target rotation and the like existing in a target tracking task, the related filter updating strategy based on response value dispersion analysis is provided by the related filtering module, and the filter template is updated adaptively, so that the specific problem is relieved.

Description

Video target real-time tracking method and system based on depth feature fusion and adaptive correlation filtering

Technical Field

The invention relates to the technical field of target tracking, in particular to a video target real-time tracking method and system based on depth feature fusion and adaptive correlation filtering.

Background

The video target tracking task predicts the position of the target by initializing the target position in the first frame and then analyzing and processing the video frame by frame using a tracking algorithm. In recent years, a method based on deep learning and correlation filtering has attracted a great deal of attention in the field of video object tracking, and has prompted an improvement in object tracking performance.

CF2^[1] The multi-layer depth features are independently extracted from the deeper VGG-16 and sent to a subsequent correlation filter for prediction calculation, so that the precedent of combining the deep learning and the correlation filtering is created. However, due to the use of more complex feature extraction models,and thus the speed is also reduced. For tasks with real-time tracking requirements, a tradeoff between performance and speed is required.

Extracting features from the deep neural network model is the longest time-consuming step, and the most straightforward acceleration method is to use a lightweight deep neural network model. However, the multi-layer depth features extracted from the lightweight model have insufficient characterizations of objects, limited distinctions of background from objects, and redundancy between features, resulting in additional computational effort. Meanwhile, the target tracking task is often faced with specific problems of target deformation, target shielding, target moving out of view, target rotation and the like, and tracker drift and even tracking failure are easy to occur by using the existing related filtering algorithm.

Disclosure of Invention

The invention aims to: an objective is to provide a video target real-time tracking method based on depth feature fusion and adaptive correlation filtering, so as to solve the above problems of the current target tracking method combining deep learning and correlation filtering. It is a further object to provide a system for implementing the above method.

The technical scheme is as follows: a video target real-time tracking method based on depth feature fusion and adaptive correlation filtering comprises the following steps:

step 1, extracting original multi-layer depth features by using a lightweight network model, wherein the multi-layer features respectively have semantic information and texture information of targets from deep layers to shallow layers;

step 2, using a multi-layer depth feature fusion strategy based on typical correlation analysis to obtain typical discrimination features with high characterization capability and low redundancy;

and 3, calculating the discrete degree of the multi-layer filter response diagram by using a relevant filter updating strategy based on response value dispersion analysis so as to adapt to the appearance change of the target.

In a further embodiment, thestep 1 further includes:

and selecting a light VGG-M-2048 deep neural network as a deep feature extractor, removing the last three full-connection layers when extracting features from the network, and extracting only the features of the convolution layer for inputting the feature fusion module. The VGG-M network model is pre-trained on ImageNet large-scale image classification datasets. And deleting all full-connection layers with large parameters in the tracking process, and only preserving the first five groups of convolution layers for feature extraction. The light-weight network model of the full-connection layer is deleted, so that the depth feature extraction time can be greatly reduced, and the tracking speed is improved.

In a further embodiment, the step 2 further includes:

a multi-layer depth feature fusion strategy based on typical correlation analysis is proposed: mapping two layers of depth features independently extracted from a network to a joint feature space, projecting the two layers of depth features into feature vectors, calculating the maximum correlation of the two groups of feature vectors by using a typical correlation analysis method, generating two groups of typical variables according to the correlation, carrying out point-to-point fusion and addition on the two groups of typical variables, mapping the two groups of typical variables back to the original feature space to form a group of typical distinguishing features, and finally sending the typical distinguishing features to a subsequent correlation filter for calculation. The typical discrimination features subjected to feature fusion have stronger target expression capability, stronger target background discrimination capability and lower redundancy, and the calculation amount of a subsequent related filter can be reduced while the motion target characterization capability is improved.

In a further embodiment, the step 3 further includes:

and (3) providing a relevant filter updating strategy based on response value dispersion analysis: calculating a relevant filter response diagram of the current frame in the process of tracking frame by frame, defining a variation coefficient according to the information of the response diagram, normalizing the variation coefficient in the time dimension, and solving the relative deviation of the variation coefficient; when the relative deviation is greater than the threshold value, the tracking prediction result of the current frame is considered to be reliable, and the filter template can be updated by using the current frame; when the relative deviation is less than the threshold, the current frame tracking prediction result is considered unreliable, and the filter template of the historical reliable frame is still maintained. The strategy can judge whether the target meets challenges in the tracking process, adaptively update the filter template in a more reasonable mode, and particularly can alleviate the problem of tracker drift caused by challenges such as deformation of the target, occlusion of the target, removal of the target from the field of view, rotation of the target and the like.

In a further embodiment, step 2 further comprises:

step 2.1, obtaining third and fourth layer convolution characteristics C in the characteristic extraction module₃ ，C₄ ∈R^13×13×512 Projecting two sets of original features into two dimensions, called U, V, where U, V ε R^169×512 The method comprises the steps of carrying out a first treatment on the surface of the The following two sets of linear transformations are considered, mapping U, V to a joint feature space, resulting in U and V, where:

U^* ＝A^T U

V^* ＝B^T V

step 2.2, using pearson correlation coefficients to measure the correlation between U and V, finding the optimal solution of matrices a and B to maximize the correlation coefficients:

where cov (x) represents covariance and var (x) represents variance;

step 2.3, defining covariance matrixes of U and V:

step 2.4, the goal of the typical correlation analysis can be converted into a convex optimization problem:

s.t.A^T S_UU B＝1,B^T S_VV A＝1

step 2.5, solving the optimization problem by using a Lagrangian multiplier method to obtain the following formula:

step 2.6, carrying out feature decomposition on the above formula, finding out the maximum feature value and solving the square root to obtain feature vectors A and B corresponding to the maximum feature value matrix, namely a transformation matrix of U and V; where λ represents a diagonal eigenvalue matrix with d non-zero eigenvalues, and d=rank (S_UV ) The method comprises the steps of carrying out a first treatment on the surface of the After A, B is obtained, returning to the step 2.1 to obtain typical variables U; the typical variables are added point to point and fused in a joint feature space:

Z＝U^* +V^*

wherein Z,U^* ,V^* ∈R^169×d Finally, mapping Z back to the original feature space to obtain a typical distinguishing feature F, wherein F is E R^13×13×d Thus, feature fusion is completed.

In a further embodiment, step 3 further comprises:

step 3.1, in the process of tracking frame by frame, calculating a response diagram of a correlation filter of the current frame, and defining a variation coefficient according to the information of the response diagram

wherein ,σ_t Is the variance, mu of the t-th frame filter response map_t Is the mean value of the t frame filter response diagram;

step 3.2, normalizing the variation coefficient in the time dimension to obtain the relative deviation

wherein ,

representing the mean of the coefficients of variation for all frames before t frames.

A video target real-time tracking system based on depth feature fusion and adaptive correlation filtering comprises the following modules: a correlation filtering module for updating the multi-layer kernel correlation filter; the feature fusion module is used for inputting a subsequent kernel correlation filter; and a feature extraction module for inputting the feature fusion module.

In a further embodiment, the feature extraction module extracts original multi-layer depth features by using a lightweight network model, wherein the multi-layer features respectively have semantic information and texture information of the target from deep layer to shallow layer;

the characteristic fusion module obtains typical discrimination characteristics with high characterization capacity and low redundancy by using a multi-layer depth characteristic fusion strategy based on typical correlation analysis;

the correlation filtering module uses a correlation filter updating strategy based on response value dispersion analysis to calculate the dispersion degree of the multi-layer filter response diagram so as to adapt to the appearance change of the target, thereby adaptively updating the multi-layer kernel correlation filter.

In a further embodiment, the feature extraction module further selects a light VGG-M-2048 deep neural network as the deep feature extractor, and removes the last three full connection layers when extracting features from the network, and extracts only the features of the convolution layer for inputting the feature fusion module;

the feature fusion module further provides a multi-layer depth feature fusion strategy based on typical correlation analysis; mapping two layers of depth features independently extracted from a network to a joint feature space, projecting the two layers of depth features into feature vectors, calculating the maximum correlation of the two groups of feature vectors by using a typical correlation analysis method, generating two groups of typical variables according to the correlation, carrying out point-to-point fusion and addition on the two groups of typical variables, mapping the two groups of typical variables back to the original feature space to form a group of typical distinguishing features, and finally sending the typical distinguishing features to a subsequent correlation filter for calculation;

the third and fourth layers of convolution characteristics C are obtained in the characteristic extraction module₃ ，C₄ ∈R^13×13×512 Projecting two sets of original features into two dimensions, called U, V, where U, V ε R^169×512 The method comprises the steps of carrying out a first treatment on the surface of the The following two sets of linear transformations are considered, mapping U, V to a joint feature space, resulting in U and V, where:

U^* ＝A^T U

V^* ＝B^T V

the correlation between U and V is measured using pearson correlation coefficients, and the optimal solution of matrices a and B is found to maximize the correlation coefficients:

where cov (x) represents covariance and var (x) represents variance;

defining covariance matrix of U and V:

the goal of a typical correlation analysis can be translated into a convex optimization problem:

s.t.A^T S_UU B＝1,B^T S_VV A＝1

solving the optimization problem using the lagrangian multiplier method yields the following equation:

performing feature decomposition on the above formula, finding out the maximum feature value and solving the square root to obtain feature vectors A and B corresponding to the maximum feature value matrix, namely a transformation matrix of U and V; where λ represents a diagonal eigenvalue matrix with d non-zero eigenvalues, and d=rank (S_UV ) The method comprises the steps of carrying out a first treatment on the surface of the After A, B is obtained, returning to the step 2.1 to obtain typical variables U; the typical variables are added point to point and fused in a joint feature space:

Z＝U^* +V^*

In a further embodiment, the correlation filtering module further proposes a correlation filter update strategy based on a response value dispersion analysis; in the process of tracking frame by frame, calculating a relevant filter response diagram of the current frame, defining a variation coefficient according to the information of the response diagram, normalizing the variation coefficient in the time dimension, and solving the relative deviation of the variation coefficient; when the relative deviation is greater than the threshold value, the tracking prediction result of the current frame is considered to be reliable, and the current frame is used for updating the filter template; when the relative deviation is smaller than the threshold value, the tracking prediction result of the current frame is considered unreliable, and a filter template of a historical reliable frame is still maintained;

in the process of tracking frame by frame, calculating a response diagram of a correlation filter of the current frame, and defining a variation coefficient according to the information of the response diagram

normalizing the variation coefficient in the time dimension to obtain the relative deviation

wherein ,

The beneficial effects are that: the invention provides a video target real-time tracking method and a video target real-time tracking system based on depth feature fusion and self-adaptive correlation filtering, which use a lightweight network model to extract multi-layer depth features, so that the real-time performance of a target tracking task is ensured. Aiming at the two limitations that the representation of the target by independently extracted multilayer features is not complete enough and a large amount of redundancy exists between the features, a multilayer depth feature fusion strategy of typical correlation analysis is provided. The method improves the expression capability of the target and the distinguishing capability of the target and the background, reduces the feature redundancy and reduces the calculation amount of the follow-up related filter. Meanwhile, aiming at the tracker drift problem caused by challenges such as target deformation, target occlusion, target movement out of view, target rotation and the like existing in a target tracking task, the invention provides a relevant filter updating strategy based on response value dispersion analysis, and the filter template is updated adaptively, so that the specific problem is relieved. Overall, the video target real-time tracking device based on depth feature fusion and adaptive correlation filtering has ideal improvement on evaluation performance on a data set, real-time performance and good application value.

Drawings

Fig. 1 is a flowchart of a video target real-time tracking method based on depth feature fusion and adaptive correlation filtering in the present invention.

FIG. 2 is a flow chart of a multi-layer depth feature fusion strategy based on a typical correlation analysis in the present invention

Fig. 3 is a graph of tracking accuracy based on 96 standard videos (DFF for the present invention) obtained through experiments and eight other prior algorithms.

Fig. 4 is a graph of the tracking success rate of the invention based on 96 standard videos (the invention is represented by DFF in the figure) obtained through experiments and other eight prior algorithms.

FIG. 5 is a graph of the tracking success rate of the present invention in the target deformation problem (the present invention is represented by DFF in the figure) obtained by experiments and other eight prior algorithms.

Fig. 6 is a graph of the tracking success rate of the present invention in the problem of target removal from view (shown by DFF).

Detailed Description

The applicant believes that extracting features from a deep neural network model is the longest time consuming step, and the most straightforward acceleration method is to use a lightweight deep neural network model. However, the multi-layer depth features extracted from the lightweight model have insufficient characterizations of objects, limited distinctions of background from objects, and redundancy between features, resulting in additional computational effort. Meanwhile, the target tracking task is often faced with specific problems of target deformation, target shielding, target moving out of view, target rotation and the like, and tracker drift and even tracking failure are easy to occur by using the existing related filtering algorithm.

Therefore, the invention provides a multi-layer depth feature fusion strategy based on typical correlation analysis (CCA) based on depth feature fusion and self-adaptive correlation filtering, which improves feature expression capability and reduces redundancy. And meanwhile, a relevant filter updating strategy based on response value dispersion analysis is provided, and the filter updating is performed adaptively. The method and the prior 8 target tracking algorithms based on deep learning and related filtering are combined in OTB^[2] The comparison was performed over all 96 videos in the standard dataset, with an AUC of 0.599 for the success rate at a positioning error threshold of 20 pixels, in the first position. The accuracy is 0.758, which is the second. In a comparative deep learning-based target tracking algorithm, the invention can achieve real-time. In a comparative target tracking algorithm based on correlation filtering, the tracking success rate of the invention is at the first place. Obviously, the invention effectively improves the target tracking performance and the real-time performance.

The technical scheme of the invention is further specifically described below through specific examples and with reference to the accompanying drawings.

As shown in fig. 1, the invention provides a video target real-time tracking system based on depth feature fusion and adaptive correlation filtering, which comprises the following modules:

1) And the feature extraction module is used for extracting original multi-layer depth features by using a lightweight network model, wherein the multi-layer features respectively have semantic information and texture information of targets from deep layers to shallow layers and are used for inputting the feature fusion module.

2) And the characteristic fusion module is used for obtaining typical discrimination characteristics with high characterization capability and low redundancy by using a multi-layer depth characteristic fusion strategy based on typical correlation analysis and inputting the typical discrimination characteristics into a subsequent kernel correlation filter.

3) And the correlation filtering module is used for calculating the discrete degree of the multi-layer filter response diagram by using a correlation filter updating strategy based on response value dispersion analysis so as to adapt to the appearance change of the target, thereby adaptively updating the multi-layer kernel correlation filter.

Specifically, in the 1) feature extraction module, a light VGG-M-2048 is selected as a depth feature extractor, which has five sets of convolution layers, for a 224x 3 image, the total calculated amount is 3.58G, and the HCF algorithm selects a VGG-16 network having 16 sets of convolution layers and 15.47G calculated amount, so that the light feature extraction network can improve the forward propagation speed by about 5 times. We remove the last three fully connected layers when extracting features from VGG-M-2048, which can remove a large number of unnecessary parameters. Only the features of the convolution layer are extracted for input to the feature fusion module.

In the 2) feature fusion module, as shown in FIG. 2, a multi-layer depth feature fusion strategy based on a typical correlation analysis is used in the present invention. The third and fourth layers of convolution characteristics C are obtained in the characteristic extraction module₃ ，C₄ ∈R^13×13×512 Projecting two sets of original features into two dimensions, called U, V, where U, V ε R^169×512 . Taking two groups of linear transformations into consideration, mapping U and V to a joint feature space to obtain U and V, wherein U and V are respectively calculated by the following steps of

U^* ＝A^T U

V^* ＝B^T V (1)

The goal of a typical correlation analysis is to maximize the correlation between the typical variables of U and V in the joint feature space, so pearson correlation coefficients are used to measure the correlation between U and V. We want to find the optimal solution of matrices a and B to maximize the correlation coefficient, as in equation (2):

where cov (x) represents covariance and var (x) represents variance. Next we define the covariance matrix of U, V, as in equation (3),

the goal of a typical correlation analysis can be converted into a convex optimization problem as in equation (4):

s.t.A^T S_UU B＝1,B^T S_VV A＝1 (4)

we solve the above optimization problem using the lagrangian multiplier method, resulting in the following:

next, we need to perform a feature decomposition on the above equation, find the maximum feature value and take the square root. The transformation matrix of eigenvectors a and B, i.e., U and V, corresponding to the maximum eigenvalue matrix is obtained. λ represents a diagonal eigenvalue matrix with d non-zero eigenvalues, and d=rank (S_UV ). After obtaining a, B, the typical variable U, V can be obtained from equation (1). We add the typical variables point-to-point, fuse in the joint feature space,

Z＝U^* +V^* (6)

wherein Z,U^* ,V^* ∈R^169×d Finally, mapping Z back to the original feature space to obtain a typical distinguishing feature F, wherein F is E R^13×13×d Thus, feature fusion is completed. The fusion feature F is fed into the correlation filtering module in a subsequent operation.

In the 3) correlation filtering module, in the process of tracking frame by frame, we calculate the correlation filter response diagram of the current frame, according to the information of the response diagram, we define the variation coefficient gamma, as in the formula (7),

wherein σ_t Is the variance, mu of the filter response map of the t-th frame (current frame)_t Is the mean of the t-th frame filter response plot. Experiments prove that when the target is clear, the response diagram of the related filter presents unimodal state, and the response diagram is matched with mu_t And sigma (sigma)_t Are all at a higher level; when the target encounters problems of shielding, deformation and the like, the response diagram of the related filter presents discrete multi-peak state, mu_t Slowly decrease, but sigma_t Drastically reduce, make gamma_t Reduction, i.e. gamma_t And when the current frame is larger, the appearance change amplitude of the current frame is smaller, and the target tracking result is reliable. Then we normalize the variation coefficient in the time dimension, calculate the relative deviation beta as formula (8),

wherein

Representing the mean of the coefficients of variation for all frames before t frames. Therefore, when β is greater than the threshold, we consider the current frame tracking prediction to be reliable, and the current frame can be used to update the filter template. When β is less than the threshold, we consider the current frame tracking prediction to be unreliable, yet maintain the filter template for the previous reliable frame.

To test the effectiveness of the invention for target tracking, the invention was tested in OTB^[2] Experiments were performed on 96 standard videos in the database and compared to 8 existing correlation-filtering-based, deep-learning-based target tracking algorithms in table 1, with our systematic algorithm denoted DFF.

Table 1 compares algorithm names, year and provenance

According to the test results, as shown in FIG. 3, the success rate of the present invention on OTB-50 was AUC of 0.758, and the accuracy was 0.608. The success rate on OTB-100 was AUC of 0.599 and accuracy of 0.748. As shown in Table 2, the success rate of the invention on the data set is the first to the other 8 algorithms, and is the CF2 based on deep learning^[1] Compared with the algorithm, the method can achieve real-time; and KCF based on correlation filtering^[5] 、DSST^[6] Compared with the prior art, the method has obviously higher tracking success rate and accuracy. Meanwhile, as shown in fig. 4 to 6, the success rate results of the present invention on 42 videos with the problem of deformation of the target and 13 videos with the problem of removal of the target from the field of view are all first.

Table 2 evaluation results of DFF and 8 comparison algorithms on dataset

As described above, although the present invention has been shown and described with reference to certain preferred embodiments, it is not to be construed as limiting the invention itself. Various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A video target real-time tracking method based on depth feature fusion and self-adaptive correlation filtering is characterized by comprising the following steps:

step 2, using a multi-layer depth feature fusion strategy based on typical correlation analysis to obtain typical discrimination features with high characterization capability and low redundancy; a multi-layer depth feature fusion strategy based on typical correlation analysis is proposed: mapping two layers of depth features independently extracted from a network to a joint feature space, projecting the feature space to form feature vectors, calculating the maximum correlation of the two groups of feature vectors by using a typical correlation analysis method, generating two groups of typical variables according to the correlation, carrying out point-to-point fusion addition on the two groups of typical variables, mapping back to the original feature space to form a group of typical distinguishing features, and finally sending the group of typical distinguishing features to a subsequent correlation filter for calculation;

2. The method for real-time tracking of a video object based on depth feature fusion and adaptive correlation filtering of claim 1, wherein step 1 further comprises:

and selecting a light VGG-M-2048 deep neural network as a deep feature extractor, removing the last three full-connection layers when extracting features from the network, and extracting only the features of the convolution layer for inputting the feature fusion module.

3. The method for real-time tracking of a video object based on depth feature fusion and adaptive correlation filtering of claim 1, wherein step 3 further comprises:

and (3) providing a relevant filter updating strategy based on response value dispersion analysis: calculating a relevant filter response diagram of the current frame in the process of tracking frame by frame, defining a variation coefficient according to the information of the response diagram, normalizing the variation coefficient in the time dimension, and solving the relative deviation of the variation coefficient; when the relative deviation is greater than the threshold value, the tracking prediction result of the current frame is considered to be reliable, and the filter template can be updated by using the current frame; when the relative deviation is less than the threshold, the current frame tracking prediction result is considered unreliable, and the filter template of the historical reliable frame is still maintained.

4. The method for real-time tracking of a video object based on depth feature fusion and adaptive correlation filtering of claim 1, wherein step 2 further comprises:

step 2.1, obtaining a third and a fourth in the feature extraction moduleLayer convolution feature C₃ ,C₄ ∈R^13×13×512 Projecting two sets of original features into two dimensions, called U, V, where U, V ε R^169×512 The method comprises the steps of carrying out a first treatment on the surface of the The following two sets of linear transformations are considered, mapping U, V to a joint feature space, resulting in U and V, where:

U^* ＝A^T U

V^* ＝B^T V

where cov (x) represents covariance and var (x) represents variance;

step 2.3, defining covariance matrixes of U and V:

s.t.A^T S_UU B＝1,B^T S_VV A＝1

Z＝U^* +V^*

wherein Z,U^* ,V^* ∈R^169×d Finally, mapping Z back to the original feature space to obtain a typical distinguishing feature F, wherein F is E R¹³^×13×d Thus, feature fusion is completed.

5. A method for real-time tracking of video objects based on depth feature fusion and adaptive correlation filtering as claimed in claim 3, wherein step 3 further comprises:

wherein ,

6. The video target real-time tracking system based on depth feature fusion and adaptive correlation filtering is characterized by comprising the following modules:

a correlation filtering module for updating the multi-layer kernel correlation filter; the correlation filtering module calculates the discrete degree of the multi-layer filter response diagram by using a correlation filter updating strategy based on response value dispersion analysis so as to adapt to the appearance change of a target, thereby adaptively updating the multi-layer kernel correlation filter;

the feature fusion module is used for inputting a subsequent kernel correlation filter; the characteristic fusion module obtains typical discrimination characteristics with high characterization capacity and low redundancy by using a multi-layer depth characteristic fusion strategy based on typical correlation analysis;

the feature extraction module is used for inputting the feature fusion module; the feature extraction module extracts original multi-layer depth features by using a lightweight network model, wherein the multi-layer features respectively have semantic information and texture information of targets from deep layers to shallow layers;

the feature extraction module further selects a light VGG-M-2048 deep neural network as a deep feature extractor, and removes the last three full-connection layers when features are extracted from the network, and only extracts the features of the convolution layer for inputting the features into the feature fusion module;

the feature fusion module further provides a multi-layer depth feature fusion strategy based on typical correlation analysis; mapping two layers of depth features independently extracted from a network to a joint feature space, projecting the feature space to form feature vectors, calculating the maximum correlation of the two groups of feature vectors by using a typical correlation analysis method, generating two groups of typical variables according to the correlation, carrying out point-to-point fusion addition on the two groups of typical variables, mapping back to the original feature space to form a group of typical distinguishing features, and finally sending the group of typical distinguishing features to a subsequent correlation filter for calculation;

the third and fourth layers of convolution characteristics C are obtained in the characteristic extraction module₃ ,C₄ ∈R^13×13×512 Projecting two sets of original features into two dimensions, called U, V, where U, V ε R^169×512 The method comprises the steps of carrying out a first treatment on the surface of the The U, V mapping is considered below for two sets of linear transformationsAnd (3) shooting the joint feature space to obtain U and V, wherein:

U^* ＝A^T U

V^* ＝B^T V

where cov (x) represents covariance and var (x) represents variance;

defining covariance matrix of U and V:

s.t.A^T S_UU B＝1,B^T S_VV A＝1

performing feature decomposition on the above formula, finding out the maximum feature value and solving the square root to obtain feature vectors A and B corresponding to the maximum feature value matrix, namely a transformation matrix of U and V; wherein lambda represents the diagonal eigenvalueA matrix having d non-zero eigenvalues, and d=rank (S_UV ) The method comprises the steps of carrying out a first treatment on the surface of the After A, B is obtained, returning to the step 2.1 to obtain typical variables U; the typical variables are added point to point and fused in a joint feature space:

Z＝U^* +V^*

7. The video object real-time tracking system based on depth feature fusion and adaptive correlation filtering of claim 6, wherein:

the correlation filtering module further provides a correlation filter updating strategy based on response value dispersion analysis; in the process of tracking frame by frame, calculating a relevant filter response diagram of the current frame, defining a variation coefficient according to the information of the response diagram, normalizing the variation coefficient in the time dimension, and solving the relative deviation of the variation coefficient; when the relative deviation is greater than the threshold value, the tracking prediction result of the current frame is considered to be reliable, and the current frame is used for updating the filter template; when the relative deviation is smaller than the threshold value, the tracking prediction result of the current frame is considered unreliable, and a filter template of a historical reliable frame is still maintained;

wherein ,

representing the mean of the coefficients of variation for all frames before t frames. />