CN106952288B

Movatterモバイル変換

Info

Publication number: CN106952288B
Application number: CN201710204379.1A
Authority: CN
Inventors: 李映; 林彬; 杭涛
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2017-03-31
Filing date: 2017-03-31
Publication date: 2019-09-24
Anticipated expiration: 2037-03-31
Also published as: CN106952288A

Abstract

Translated fromChinese

本发明涉及一种基于卷积特征和全局搜索检测的长时遮挡鲁棒跟踪方法，通过在跟踪模块中使用卷积特征和多尺度相关滤波方法，增强了跟踪目标外观模型的特征表达能力，使得跟踪结果对于光照变化、目标尺度变化、目标旋转等因素具有很强的鲁棒性；又通过引入的全局搜索检测机制，使得当目标被长时遮挡导致跟踪失败时，检测模块可以再次检测到目标，使跟踪模块从错误中恢复过来，这样即使在目标外观变化的情况下，也能够被长时间持续地跟踪。

The invention relates to a long-term occlusion robust tracking method based on convolution features and global search detection. By using convolution features and multi-scale correlation filtering methods in the tracking module, the feature expression ability of the tracking target appearance model is enhanced, so that The tracking results are highly robust to factors such as illumination changes, target scale changes, and target rotations; and through the introduction of a global search detection mechanism, when the target is blocked for a long time and the tracking fails, the detection module can detect the target again , to make the tracking module recover from errors, so that it can be tracked continuously for a long time even when the target appearance changes.

Description

Translated fromChinese

基于卷积特征和全局搜索检测的长时遮挡鲁棒跟踪方法Robust tracking method for long-term occlusion based on convolutional features and global search detection

技术领域technical field

本发明属计算机视觉领域，涉及一种目标跟踪方法，具体涉及一种基于卷积特征和全局搜索检测的长时遮挡鲁棒跟踪方法。The invention belongs to the field of computer vision and relates to a target tracking method, in particular to a long-term occlusion robust tracking method based on convolution features and global search detection.

背景技术Background technique

目标跟踪的主要任务是获取视频序列中特定目标的位置与运动信息，在视频监控、人机交互等领域具有广泛的应用。跟踪过程中，光照变化、背景复杂、目标发生旋转或缩放等因素都会增加目标跟踪问题的复杂性，尤其是当目标被长时遮挡时，则更容易导致跟踪失败。The main task of target tracking is to obtain the position and motion information of a specific target in a video sequence, which has a wide range of applications in video surveillance, human-computer interaction and other fields. During the tracking process, factors such as illumination changes, complex background, target rotation or scaling will increase the complexity of the target tracking problem, especially when the target is blocked for a long time, it is more likely to cause tracking failure.

文献“Tracking-learning-detection,IEEE Transactions on PatternAnalysis and Machine Intelligence,2012,34(7):1409-1422”提出的跟踪方法(简称TLD)首次将传统的跟踪算法和检测算法结合起来，利用检测结果完善跟踪结果，提高了系统的可靠性和鲁棒性。其跟踪算法基于光流法，检测算法产生大量的检测窗口，对于每个检测窗口，都必须被三个检测器接受才能成为最后的检测结果。针对遮挡问题，TLD提供了一个切实有效的解决思路，能够对目标进行长时跟踪(Long-term Tracking)。但是，TLD使用的是浅层的人工特征，对目标的表征能力有限，且检测算法的设计也较为复杂，有一定的改进空间。The tracking method (TLD for short) proposed in the document "Tracking-learning-detection, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 34(7): 1409-1422" combines the traditional tracking algorithm and detection algorithm for the first time, and uses the detection results Improve the tracking results, improve the reliability and robustness of the system. Its tracking algorithm is based on the optical flow method, and the detection algorithm generates a large number of detection windows. For each detection window, it must be accepted by three detectors to become the final detection result. For the occlusion problem, TLD provides a practical and effective solution, which can track the target for a long time (Long-term Tracking). However, TLD uses shallow artificial features, which have limited ability to represent targets, and the design of detection algorithms is relatively complex, so there is room for improvement.

发明内容Contents of the invention

要解决的技术问题technical problem to be solved

为了避免现有技术的不足之处，本发明提出一种基于卷积特征和全局搜索检测的长时遮挡鲁棒跟踪方法，解决视频运动目标在跟踪过程中由于长时遮挡或目标平移出视野之外等因素造成外观模型漂移，从而易导致跟踪失败的问题。In order to avoid the deficiencies of the prior art, the present invention proposes a long-term occlusion robust tracking method based on convolutional features and global search detection to solve the problems caused by long-term occlusion or the target moving out of the field of view during the tracking process of video moving targets. External factors cause the appearance model to drift, which easily leads to the problem of tracking failure.

技术方案Technical solutions

一种基于卷积特征和全局搜索检测的长时遮挡鲁棒跟踪方法，其特征在于步骤如下：A long-term occlusion robust tracking method based on convolution features and global search detection, characterized in that the steps are as follows:

步骤1读取视频中第一帧图像数据以及目标所在的初始位置信息[x,y,w,h]，其中x,y表示目标中心的横坐标和纵坐标，w,h表示目标的宽和高。将(x,y)对应的坐标点记为P，以P为中心，大小为w×h的目标初始区域记为R_init，再将目标的尺度记为scale，初始化为1。Step 1 Read the first frame of image data in the video and the initial position information of the target [x, y, w, h], where x, y represent the abscissa and ordinate of the target center, w, h represent the width and high. Record the coordinate point corresponding to (x, y) as P, take P as the center, and mark the initial area of the target with a size of w×h as R_init , and record the scale of the target as scale, which is initialized to 1.

步骤2以P为中心，确定一个包含目标及背景信息的区域R_bkg，R_bkg的大小为M×N，M＝2w,N＝2h。采用VGGNet-19作为CNN模型，在第5层卷积层(conv5-4)对R'提取卷积特征图z_{target_init}。然后根据z_{target_init}构建跟踪模块的目标模型t∈{1,2,...,T}，T为CNN模型通道数，计算方法如下：Step 2. Taking P as the center, determine a region R_bkg containing target and background information. The size of R_bkg is M×N, M=2w, N=2h. VGGNet-19 is used as the CNN model, and the convolutional feature map z_{target_init} is extracted from R' in the 5th convolutional layer (conv5-4). Then build the target model of the tracking module according to z_{target_init} t∈{1,2,...,T}, T is the number of CNN model channels, the calculation method is as follows:

其中，大写的变量为相应的小写变量在频域上的表示，高斯滤波模板m,n为高斯函数自变量，m∈{1,2,...,M}，n∈{1,2,...,N}，σ_target为高斯核的带宽，⊙代表元素相乘运算，上划线表示复共轭，λ₁为调整参数(为了避免分母为0而引入)，设定为0.0001。Among them, the uppercase variable is the representation of the corresponding lowercase variable in the frequency domain, and the Gaussian filter template m, n are the independent variables of the Gaussian function, m∈{1,2,...,M}, n∈{1,2,...,N}, σ_target is the bandwidth of the Gaussian kernel, ⊙ represents multiplication of elements, the overline represents complex conjugation, and λ₁ is an adjustment parameter (introduced to avoid the denominator being 0), which is set to 0.0001.

步骤3以P为中心，提取S个不同尺度的图像子块，S设定为33。每个子块的大小为w×h×s，变量s为图像子块的尺度因子，s∈[0.7,1.4]。然后分别提取每个图像子块的HOG特征，合并后成为一个S维的HOG特征向量，这里将其命名为尺度特征向量，记为z_{scale_init}。再根据z_{scale_init}构建跟踪模块的尺度模型W_scale，计算方法与步骤2中计算类似(将尺度特征向量替换掉卷积特征图)，具体如下：Step 3 takes P as the center and extracts S image sub-blocks of different scales, and S is set to 33. The size of each sub-block is w×h×s, and the variable s is the scale factor of the image sub-block, s∈[0.7,1.4]. Then extract the HOG features of each image sub-block separately, and merge them into an S-dimensional HOG feature vector, which is named as the scale feature vector and recorded as z_{scale_init} . Then build the scale model W_scale of the tracking module according to z_{scale_init} , and the calculation method is the same as that calculated in step 2 Similar (replacing the scale feature vector with the convolution feature map), as follows:

其中，s'为高斯函数自变量，s'∈{1,2,...,S}，σ_scale为高斯核的带宽，λ₂为调整参数，设定为0.0001。in, s' is the independent variable of the Gaussian function, s'∈{1,2,...,S}, σ_scale is the bandwidth of the Gaussian kernel, λ₂ is an adjustment parameter, which is set to 0.0001.

步骤4对目标初始区域R_init提取灰度特征，得到的灰度特征表示是一个二维矩阵，这里将该矩阵命名为目标外观表示矩阵，记为A_k，下标k表示当前帧数，初始时k＝1。然后将检测模块的滤波模型D初始化为A₁，即D＝A₁，再初始化历史目标表示矩阵集合A_his。A_his的作用是存储当前及之前每一帧的目标外观表示矩阵，即A_his＝{A₁,A₂,...,A_k}，初始时A_his＝{A₁}。Step 4 extracts the grayscale features of the target initial region R_init , and the obtained grayscale feature representation is a two-dimensional matrix. Here, the matrix is named the target appearance representation matrix, denoted as A_k , the subscript k represents the current frame number, and the initial When k=1. Then the filter model D of the detection module is initialized to A₁ , that is, D=A₁ , and then the historical target representation matrix set A_his is initialized. The function of A_his is to store the target appearance representation matrix of the current and previous frames, that is, A_his ={A₁ , A₂ ,...,A_k }, initially A_his ={A₁ }.

步骤5读取下一帧图像，仍然以P为中心，提取大小为R_bkg×scale的经过尺度缩放后的目标搜索区域。然后通过步骤2中的CNN网络提取目标搜索区域的卷积特征，并以双边插值的方式采样到R_bkg的大小得到当前帧的卷积特征图z_{target_cur}，再利用目标模型计算目标置信图f_target，计算方法如下：Step 5: Read the next frame of image, still take P as the center, and extract the scaled target search area whose size is R_bkg × scale. Then use the CNN network in step 2 to extract the convolution features of the target search area, and sample to the size of R_bkg by bilateral interpolation to obtain the convolution feature map z_{target_cur} of the current frame, and then use the target model Calculate the target confidence map f_target , the calculation method is as follows:

其中，为傅里叶逆变换。最后更新P的坐标，将(x,y)修正为f_target中的最大响应值所对应的坐标：in, is the inverse Fourier transform. Finally, update the coordinates of P, and correct (x,y) to the coordinates corresponding to the maximum response value in f_target :

步骤6以P为中心，提取S个不同尺度的图像子块，然后分别提取每个图像子块的HOG特征，合并后得到当前帧的尺度特征向量z_{scale_cur}(同步骤3中z_{scale_init}的计算方法)。再利用尺度模型W_scale计算尺度置信图：Step 6 takes P as the center, extracts S image sub-blocks of different scales, then extracts the HOG features of each image sub-block respectively, and obtains the scale feature vector z_{scale_cur} of the current frame after merging (same as the calculation method of z_{scale_init} in step 3 ). Then use the scale model W_scale to calculate the scale confidence map:

最后更新目标的尺度scale，计算方法如下：Finally, the scale of the target is updated, and the calculation method is as follows:

至此，可以得到跟踪模块在当前帧(第k帧)的输出：以坐标为(x,y)的P为中心，大小为R_init×scale的图像子块TPatch_k。另外，将已经计算完成的f_target中的最大响应值简记为TPeak_k，即TPeak_k＝f_target(x,y)。So far, the output of the tracking module in the current frame (the kth frame) can be obtained: an image sub-block TPatch_k whose size is R_init ×scale centered on P whose coordinates are (x, y). In addition, the calculated maximum response value in f_target is abbreviated as TPeak_k , that is, TPeak_k =f_target (x, y).

步骤7检测模块以全局搜索的方式将滤波模型D与当前帧的整幅图像进行卷积，计算滤波模型D与当前帧各个位置的相似程度。取响应度最高的前j个值(j设定为10)，并分别以这j个值对应的位置点为中心，提取大小为R_init×scale的j个图像子块。将这j个图像子块作为元素，生成一个图像子块集合DPatches_k，即检测模块在第k帧的输出。Step 7: The detection module convolutes the filter model D with the entire image of the current frame by means of global search, and calculates the degree of similarity between the filter model D and each position of the current frame. Take the first j values with the highest responsivity (j is set to 10), and extract j image sub-blocks with a size of R_init × scale centered on the position points corresponding to these j values. Taking these j image sub-blocks as elements, generate a set of image sub-blocks DPatches_k , which is the output of the detection module at the kth frame.

步骤8对检测模块输出的集合DPatches_k中各图像子块，分别计算其与跟踪模块输出的TPatch_k之间的像素重叠率，可以得到j个值，将其中最高的值记为如果小于阈值(设定为0.05)，判定为目标被完全遮挡，需要抑制跟踪模块在模型更新时的学习率β，并转步骤9；否则按初始学习率β_init(β_init设定为0.02)进行更新，并转步骤10。β的计算公式如下：Step 8 For each image sub-block in the set DPatches_k output by the detection module, calculate the pixel overlap rate between it and the TPatch_k output by the tracking module, and j values can be obtained, and the highest value is recorded as if less than threshold ( set to 0.05), it is determined that the target is completely occluded, and it is necessary to suppress the learning rate β of the tracking module when the model is updated, and go to step 9; otherwise, update according to the initial learning rate β_init (β_init is set to 0.02), and Go to step 10. The calculation formula of β is as follows:

步骤9根据DPatches_k中各图像子块的中心，分别提取大小为R_bkg×scale的j个目标搜索区域，按照步骤5中的方法对每一个目标搜索区域提取卷积特征图并计算目标置信图，可以得到j个目标搜索区域上的最大响应值。在这j个响应值中再进行比较，将最大的值记为DPeak_k。如果DPeak_k大于TPeak_k，则再次更新P的坐标，将(x,y)修正为DPeak_k所对应的坐标。并重新计算目标尺度特征向量和目标尺度scale(同步骤6中的计算方式)。Step 9 According to the center of each image sub-block in DPatches_k , respectively extract j target search areas with a size of R_bkg × scale, and extract the convolution feature map for each target search area according to the method in step 5 and calculate the target confidence map , the maximum response value on j target search areas can be obtained. Then compare among the j response values, and record the largest value as DPeak_k . If DPeak_k is greater than TPeak_k , update the coordinates of P again, and correct (x, y) to the coordinates corresponding to DPeak_k . And recalculate the target scale feature vector and target scale scale (same calculation method in step 6).

步骤10目标在当前帧最优的位置中心确定为P，最优尺度确定为scale。在图像中标示出新的目标区域R_new，即以P为中心，宽和高分别为w×scale、h×scale的矩形框。另外，将已经计算完成、并且能够得到最优目标位置中心P的卷积特征图简记为z_target；同样，将能够得到最优目标尺度scale的尺度特征向量简记为z_scale。In step 10, the center of the target's optimal position in the current frame is determined as P, and the optimal scale is determined as scale. A new target region R_new is marked in the image, that is, a rectangular frame with P as the center and width and height of w×scale and h×scale respectively. In addition, the convolution feature map that has been calculated and can obtain the optimal target position center P is abbreviated as z_target ; similarly, the scale feature vector that can obtain the optimal target scale scale is abbreviated as z_scale .

步骤11利用z_target、z_scale，以及上一帧建立的跟踪模块中的目标模型和尺度模型W_scale，分别以加权求和的方式进行模型更新，计算方法如下：Step 11 uses z_target , z_scale , and the target model in the tracking module established in the previous frame and the scale model W_scale , the model is updated by weighted summation respectively, and the calculation method is as follows:

其中，β为步骤8计算后的学习率。Among them, β is the learning rate calculated in step 8.

步骤12对新的目标区域R_new提取灰度特征后得到当前帧的目标外观表示矩阵A_k，将A_k加入到历史目标表示矩阵集合A_his。如果集合A_his中元素个数大于_c(_c设定为20)，则从A_his中随机选择_c个元素生成一个三维矩阵C_k，C_k(:,i)对应的是A_his中任意一个元素(即二维矩阵A_k)；否则用A_his中所有元素生成矩阵C_k。然后对C_k进行平均化得到二维矩阵，将这个二维矩阵作为检测模块新的滤波模型D，计算方法如下：Step 12: After extracting the gray features of the new target region R_new , the target appearance representation matrix A_k of the current frame is obtained, and adding A_k to the historical target representation matrix set A_his . If the number of elements in the set A_his is greater than_c (_c is set to 20), then randomly select_c elements from A_his to generate a three-dimensional matrix C_k , and C_k (:,i) corresponds to any one of A_his elements (that is, two-dimensional matrix A_k ); otherwise, all elements in A_his are used to generate matrix C_k . Then C_k is averaged to obtain a two-dimensional matrix, and this two-dimensional matrix is used as the new filtering model D of the detection module, and the calculation method is as follows:

步骤13判断是否处理完视频中所有的图像帧，若处理完则算法结束，否则转步骤5继续执行。Step 13 judges whether all image frames in the video have been processed, and if processed, the algorithm ends, otherwise go to step 5 and continue to execute.

有益效果Beneficial effect

本发明提出的一种基于卷积特征和全局搜索检测的长时遮挡鲁棒跟踪方法，分别设计了跟踪模块和检测模块，跟踪过程中，两个模块协同工作：跟踪模块主要利用卷积神经网络(Convolutional Neural Network,CNN)提取目标的卷积特征用于构建鲁棒的目标模型，并通过方向梯度直方图(Histogram of Oriented Gradient,HOG)特征构建尺度模型，结合相关滤波方法来分别确定目标的位置中心和尺度；检测模块提取灰度特征构建目标的滤波模型，采用全局搜索的方式在整幅图像中对目标进行快速检测并判断遮挡的发生，一旦目标被完全遮挡(或其它因素导致目标外观剧烈变化)，检测模块利用检测结果修正跟踪目标的位置，并抑制跟踪模块的模型更新，防止引入不必要的噪声从而导致模型漂移和跟踪失败。A long-term occlusion robust tracking method based on convolution features and global search detection proposed by the present invention, the tracking module and the detection module are designed respectively. During the tracking process, the two modules work together: the tracking module mainly uses the convolutional neural network (Convolutional Neural Network, CNN) extracts the convolutional features of the target to build a robust target model, and constructs a scale model through the Histogram of Oriented Gradient (HOG) feature, and combines the correlation filtering method to determine the target's Position center and scale; the detection module extracts the grayscale features to construct the filter model of the target, and uses the global search method to quickly detect the target in the entire image and judge the occurrence of occlusion. Once the target is completely occluded (or other factors lead to the appearance of the target drastic changes), the detection module uses the detection results to correct the position of the tracking target, and suppresses the model update of the tracking module to prevent the introduction of unnecessary noise that leads to model drift and tracking failure.

优越性：通过在跟踪模块中使用卷积特征和多尺度相关滤波方法，增强了跟踪目标外观模型的特征表达能力，使得跟踪结果对于光照变化、目标尺度变化、目标旋转等因素具有很强的鲁棒性；又通过引入的全局搜索检测机制，使得当目标被长时遮挡导致跟踪失败时，检测模块可以再次检测到目标，使跟踪模块从错误中恢复过来，这样即使在目标外观变化的情况下，也能够被长时间持续地跟踪。Advantages: By using convolution features and multi-scale correlation filtering methods in the tracking module, the feature expression ability of the tracking target appearance model is enhanced, so that the tracking results are highly robust to factors such as illumination changes, target scale changes, and target rotations. Robustness; through the introduction of the global search detection mechanism, when the target is blocked for a long time and the tracking fails, the detection module can detect the target again, so that the tracking module can recover from the error, so that even if the target appearance changes , can also be tracked continuously for a long time.

附图说明Description of drawings

图1：基于卷积特征和全局搜索检测的长时遮挡鲁棒跟踪方法流程图Figure 1: Flowchart of a long-duration occlusion robust tracking method based on convolutional features and global search detection

具体实施方式Detailed ways

现结合实施例、附图对本发明作进一步描述：Now in conjunction with embodiment, accompanying drawing, the present invention will be further described: