CN110728694B

Movatterモバイル変換

Info

Publication number: CN110728694B
Application number: CN201910956780.XA
Authority: CN
Inventors: 张辉; 朱牧; 张菁; 卓力; 齐天卉; 张磊
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2019-10-10
Filing date: 2019-10-10
Publication date: 2023-11-24
Anticipated expiration: 2039-10-10
Also published as: CN110728694A

Abstract

The invention relates to a long-term visual target tracking method based on continuous learning. The method is characterized in that a deep neural network structure is designed for long-time visual target tracking, an initialized network model is obtained through model initialization, then the initialized network model is utilized for online tracking, long-time or short-time model updating is carried out in the tracking process by utilizing a continuous learning method, and the method is suitable for various changes of targets in the tracking process. According to the invention, the online updating process of the traditional visual target tracking model is converted into the continuous learning process, and the complete appearance description of the target is integrally built from all the historical data of the video, so that the robustness of long-term visual tracking is effectively improved. The method can provide an effective solution for long-time visual target tracking for application requirements such as intelligent video monitoring, man-machine interaction, visual navigation and the like.

Description

Translated fromChinese

一种基于持续学习的长时视觉目标跟踪方法A long-term visual target tracking method based on continuous learning

技术领域Technical field

本发明属于计算机视觉和图像视频处理领域，特别涉及一种基于持续学习的长时视觉目标跟踪方法。The invention belongs to the field of computer vision and image and video processing, and particularly relates to a long-term visual target tracking method based on continuous learning.

背景技术Background technique

视觉目标跟踪是计算机视觉、图像视频处理中的一个基础问题，在监控视频自动分析、人机交互、视觉导航等领域有着广泛的应用。按照视频序列帧的长度，跟踪方法大致可分为两大类：短时目标跟踪和长时目标跟踪。一般当被跟踪的视频序列帧长度大于1000帧时，我们称之为长时目标跟踪。目前短时跟踪算法在相对较短的视频数据上已经取得了较好的性能，但是将其直接应用到现实长时视频序列的处理上，跟踪的精度和鲁棒性都还远达不到实际场景的指标需求。Visual target tracking is a basic problem in computer vision and image and video processing. It has been widely used in automatic analysis of surveillance videos, human-computer interaction, visual navigation and other fields. According to the length of video sequence frames, tracking methods can be roughly divided into two categories: short-term target tracking and long-term target tracking. Generally, when the frame length of the tracked video sequence is greater than 1000 frames, we call it long-term target tracking. At present, short-term tracking algorithms have achieved good performance on relatively short video data, but when they are directly applied to the processing of real long-term video sequences, the tracking accuracy and robustness are far from practical. Indicator requirements of the scenario.

在长时跟踪任务中，除了要面对短时场景中如目标尺度变化、光照变化、目标变形等常见挑战外，还需要解决频繁的“消失后再现”目标的稳健锁定难题。因此，与传统的短时跟踪相比，长时跟踪具有更大的挑战性，更符合各种应用场景的实际需求。然而，目前面向这类长时数据的跟踪技术较为欠缺，已有方法的性能也十分有限。一种现有的长时跟踪思路是将传统跟踪和传统目标检测方法相结合，来解决跟踪中目标发生形变、部分遮挡等问题。同时，通过在线学习机制不断更新跟踪模块的“显著特征点”和检测模块的目标模型及相关参数，从而使得跟踪效果更加鲁棒、可靠。此外，也有方法利用关键点匹配跟踪和鲁棒估计技术，能够把长时记忆给集成起来，并且可以为输出控制提供额外的信息。上述跟踪方法可以在整帧图像中搜索目标，但由于仅采用了手工设计的简单特征，其性能并不理想。最近，一些基于相关滤波和深度学习的跟踪方法被提出，虽然有用于长时跟踪的重新检测方案，但都局限于仅在图像局部范围内进行搜索，因此在目标出视野后无法再次捕获它，并不能胜任长时跟踪任务需求。In long-term tracking tasks, in addition to facing common challenges in short-term scenes such as target scale changes, illumination changes, and target deformations, it is also necessary to solve the problem of robust locking of targets that frequently "disappear and then reappear." Therefore, compared with traditional short-term tracking, long-term tracking is more challenging and more in line with the actual needs of various application scenarios. However, tracking technology for this type of long-term data is currently lacking, and the performance of existing methods is also very limited. An existing long-term tracking idea is to combine traditional tracking and traditional target detection methods to solve problems such as target deformation and partial occlusion during tracking. At the same time, the "salient feature points" of the tracking module and the target model and related parameters of the detection module are continuously updated through the online learning mechanism, thereby making the tracking effect more robust and reliable. In addition, there are also methods that use key point matching tracking and robust estimation technology to integrate long-term memory and provide additional information for output control. The above tracking method can search for targets in the entire frame image, but its performance is not ideal because it only uses simple hand-designed features. Recently, some tracking methods based on correlation filtering and deep learning have been proposed. Although there are re-detection schemes for long-term tracking, they are limited to searching only within the local range of the image, so it is impossible to capture the target again after it goes out of the field of view. It is not capable of long-term tracking task requirements.

从技术发展的现状看，基于深度卷积神经网络图像分类的视觉目标跟踪方法具有将目标与杂乱的背景有效区分的巨大潜力，基于此类框架的跟踪方法有着广阔的发展前景。但是，仅采用离线训练的跟踪模型通常难以适应视频的在线变化，而简单地用新数据频繁更新模型又会加速跟踪漂移，导致其在处理长时跟踪问题时很容易失败。本发明则通过持续学习方法平衡模型的历史记忆与在线更新，提出了一种基于持续学习的长时视觉目标跟踪方法。Judging from the current status of technology development, visual target tracking methods based on deep convolutional neural network image classification have great potential to effectively distinguish targets from cluttered backgrounds. Tracking methods based on such frameworks have broad development prospects. However, tracking models that only use offline training are often difficult to adapt to online changes in videos, and simply updating the model frequently with new data will accelerate tracking drift, making it easy to fail when dealing with long-term tracking problems. The present invention balances the historical memory and online updating of the model through a continuous learning method, and proposes a long-term visual target tracking method based on continuous learning.

发明内容Contents of the invention

本发明利用持续学习理论，将视觉目标跟踪方法的模型在线更新转换为一个持续学习过程，在整个视频序列中学习时序图像的有效抽象与表征，建立目标的完整画像。最终适应跟踪过程中目标变形、背景干扰、遮挡和光照变化等情况，达到提升现有跟踪方法在线更新时的适应性与可靠性，降低模型对目标变形、遮挡等噪声的敏感度，达到长时稳健跟踪目标的目的。The present invention uses continuous learning theory to convert the model online update of the visual target tracking method into a continuous learning process, learns effective abstraction and representation of time-series images in the entire video sequence, and establishes a complete portrait of the target. Ultimately, it adapts to target deformation, background interference, occlusion, and illumination changes during the tracking process, thereby improving the adaptability and reliability of online updates of existing tracking methods, reducing the sensitivity of the model to noise such as target deformation, occlusion, etc., and achieving long-term performance. The purpose of robust tracking of targets.

本发明是采用以下技术手段实现的：一种基于持续学习的长时视觉目标跟踪方法，主要包括网络模型设计、模型初始化、在线跟踪和模型更新四部分。The present invention is implemented by using the following technical means: a long-term visual target tracking method based on continuous learning, which mainly includes four parts: network model design, model initialization, online tracking and model update.

网络模型设计：首先根据附图1所示的整体流程设计了深度神经网络结构；然后将该网络各阶段特征图调整成自适应尺寸。Network model design: First, the deep neural network structure is designed according to the overall process shown in Figure 1; then the feature maps of each stage of the network are adjusted to adaptive sizes.

模型初始化：主要包括3个步骤：初始帧分割图像获取；模型初始化训练样本库生成；模型初始化训练及模型获取。其中，模型初始化训练及模型获取阶段包括损失函数、梯度下降法的选取。Model initialization: mainly includes three steps: initial frame segmentation image acquisition; model initialization training sample library generation; model initialization training and model acquisition. Among them, the model initialization training and model acquisition stages include the selection of loss function and gradient descent method.

在线跟踪：主要包括3个步骤：生成候选样本；获取最佳候选样本；使用目标框回归定位目标区域。Online tracking: mainly includes 3 steps: generating candidate samples; obtaining the best candidate samples; and using target frame regression to locate the target area.

模型更新：主要包括3个步骤：更新方式选择；模型更新样本库的生成与更新；持续学习方式模型训练及模型获取。其中，样本库生成中包括在线样本集和记忆感知样本集获取；样本库更新中包括在线样本集和记忆感知样本集更新；持续学习方式模型训练及模型获取阶段包括损失函数、梯度下降法的选取。Model update: mainly includes three steps: update method selection; generation and update of model update sample library; continuous learning method model training and model acquisition. Among them, the sample library generation includes the acquisition of online sample sets and memory-aware sample sets; the sample library update includes the update of online sample sets and memory-aware sample sets; the continuous learning mode model training and model acquisition stages include the selection of loss functions and gradient descent methods. .

所述的网络模型设计，具体步骤如下：The specific steps for the network model design are as follows:

(1)本发明设计的深度神经网络结构：如附图2所示，本发明的网络结构由共享层和分类层组成。其中，共享层包括3个卷积层、2个最大值池化层、2个全连接层和5个非线性激活ReLU层。卷积层与通用VGG-M网络的相应部分相同。接下来的两个完全连接的层各有512个输出单元，并结合了ReLU和Dropouts模块。分类层是包含了Dropouts模块和具有softmax损失的二值分类层，负责区分目标和背景。(1) Deep neural network structure designed by the present invention: As shown in Figure 2, the network structure of the present invention consists of a sharing layer and a classification layer. Among them, the shared layer includes 3 convolutional layers, 2 maximum pooling layers, 2 fully connected layers and 5 nonlinear activation ReLU layers. The convolutional layers are identical to the corresponding parts of the general VGG-M network. The next two fully connected layers have 512 output units each and combine ReLU and Dropouts modules. The classification layer includes the Dropouts module and the binary classification layer with softmax loss, which is responsible for distinguishing the target and the background.

在卷积神经网络CNN的图像处理过程中，卷积层之间需要通过卷积滤波器联系，卷积滤波器的定义表示为N×C×W×H，其中N代表卷积滤波器的种类，C代表被滤波通道的通道数；W、H分别代表滤波范围的宽、高。In the image processing process of convolutional neural network CNN, the convolutional layers need to be connected through convolutional filters. The definition of convolutional filters is expressed as N×C×W×H, where N represents the type of convolutional filter. , C represents the number of filtered channels; W and H represent the width and height of the filtering range respectively.

(2)本发明在持续学习的长时目标跟踪过程中，各卷积层输入和输出特征图的变化如下：(2) In the long-term target tracking process of continuous learning of this invention, the changes in the input and output feature maps of each convolution layer are as follows:

本发明在跟踪过程中，将不同尺寸的图像统一成3×107×107的图像后输入网络，在第一个卷积层中，先经过96个7×7的卷积核后，再经过非线性激活层ReLU和局部响应归一化层输出通道数为96，最后经过最大值池化层得到96×25×25的特征图；在第二卷积层中，输入大小为96×25×25的特征图，先经过256个5×5的卷积核后，再经过非线性激活层ReLU和局部响应归一化层输出通道数为256，最后经过最大值池化层得到256×5×5的特征图；在第三卷积层中，输入大小为256×5×5的特征图，先经过512个3×3的卷积核后，再经过非线性激活层ReLU得到512×3×3的特征图；在第四个全连接层中，输入大小为512×3×3的特征图，先经过512神经单元，再经过非线性激活层ReLU得到512维的特征向量；在第五个全连接层中，输入大小为512维的特征向量，先经过512神经单元，再经过Dropouts层，最后通过非线性激活层ReLU得到512维的特征向量；在分类层中，将大小为512为特征向量，先经过Dropouts层，再输入一个具有softmax损失的二值分类层，最后输出大小为2维的分类得分。During the tracking process, the present invention unifies images of different sizes into 3×107×107 images and then inputs them into the network. In the first convolution layer, it first passes through 96 7×7 convolution kernels, and then passes through non- The number of output channels of the linear activation layer ReLU and the local response normalization layer is 96, and finally the feature map of 96×25×25 is obtained through the maximum pooling layer; in the second convolution layer, the input size is 96×25×25 The feature map of The feature map of The feature map of In the connection layer, a 512-dimensional feature vector is input, which first passes through the 512 neural unit, then through the Dropouts layer, and finally obtains a 512-dimensional feature vector through the nonlinear activation layer ReLU; in the classification layer, the 512-dimensional feature vector is , first passes through the Dropouts layer, then inputs a binary classification layer with softmax loss, and finally outputs a 2-dimensional classification score.

所述的模型初始化，具体步骤如下：The specific steps for initializing the model are as follows:

(1)初始帧分割图像获取：初始帧模板的质量对当前的跟踪结果有着重要的影响。为了增加被跟踪目标的详细表示，通过Simple Linear Iterative Clustering(SLIC)超像素分割方法来应用超像素级分割，使得分割后的图像不仅在颜色和纹理上与目标一致，而且还保留了目标的结构信息，如附图3所示。(1) Initial frame segmentation image acquisition: The quality of the initial frame template has an important impact on the current tracking results. In order to increase the detailed representation of the tracked target, superpixel-level segmentation is applied through the Simple Linear Iterative Clustering (SLIC) superpixel segmentation method, so that the segmented image is not only consistent with the target in color and texture, but also retains the structure of the target. information, as shown in Figure 3.

(2)训练样本库的生成：分别在第一帧原始图像和分割图像的初始目标位置周围随机采样抽取N₁个样本。这些样本根据它们与真实标注框(ground truth)的交并比分数被标记为正样本(0.7～1.0之间)和负样本(0～0.5之间)。(2) Generation of training sample library: randomly sample N₁ samples around the initial target position of the first frame of the original image and the segmented image. These samples are labeled as positive samples (between 0.7 and 1.0) and negative samples (between 0 and 0.5) according to their intersection scores with the ground truth.

(3)模型初始化训练及模型获取：在跟踪序列的初始帧，对网络最后输出的分类得分，采用二分类交叉熵损失作为损失函数求其损失，然后使用梯度下降法对网络全连接层参数进行更新。其中，训练全连接层进行H₁(50次)迭代，全连接FC4-5层的学习率设置为0.0005，分类层FC6层的学习率设置为0.005；动量和权重衰减分别设置为0.9和0.0005；每小批由M⁺(32)个正样本和从M^-(1024)个负样本中选出的(96)个难分负样本组成；最后，经过反复迭代，当达到H₁(50次)迭代时停止训练，获得网络初始化模型。(3) Model initialization training and model acquisition: In the initial frame of the tracking sequence, use the binary cross-entropy loss as the loss function to calculate the classification score finally output by the network, and then use the gradient descent method to calculate the fully connected layer parameters of the network. renew. Among them, the fully connected layer is trained for H₁ (50 times) iterations, the learning rate of the fully connected FC4-5 layer is set to 0.0005, and the learning rate of the classification layer FC6 layer is set to 0.005; the momentum and weight attenuation are set to 0.9 and 0.0005 respectively; Each mini-batch consists of M⁺ (32) positive samples and M^- (1024) negative samples. (96) difficult-to-distinguish negative samples; finally, after repeated iterations, the training is stopped when it reaches H₁ (50 times) iterations, and the network initialization model is obtained.

所述的在线跟踪，具体步骤如下：The specific steps for online tracking are as follows:

(1)目标候选样本生成：给定视频序列中的每一帧，首先围绕前一帧中目标的预测位置绘制N₂个候选样本。(1) Target candidate sample generation: Given each frame in the video sequence, N₂ candidate samples are first drawn around the predicted position of the target in the previous frame.

(2)获取最佳候选样本：将步骤(1)获取的N₂个候选样本送入当前的网络模型中计算分类得分，取分类得分最高的候选样本作为估计的目标位置。(2) Obtain the best candidate sample: Send the N₂ candidate samples obtained in step (1) into the current network model to calculate the classification score, and take the candidate sample with the highest classification score as the estimated target position.

(3)目标框回归：步骤(2)获得估计的目标位置后，使用目标框回归方法定位目标区域获得跟踪结果。(3) Target frame regression: After obtaining the estimated target position in step (2), use the target frame regression method to locate the target area and obtain the tracking result.

所述的模型更新，具体步骤如下：The specific steps for the described model update are as follows:

(1)更新方式选择：综合考虑目标跟踪中的两个互补方面：鲁棒性和自适应性。采用长时更新和短时更新两种模型更新方式。在跟踪过程中，每隔f(8～10)帧执行一次长时更新，当模型将估计的目标位置分类为背景时执行一次短时更新。(1) Update method selection: Comprehensive consideration of two complementary aspects in target tracking: robustness and adaptability. Two model update methods, long-term update and short-term update, are used. During the tracking process, a long-term update is performed every f (8 to 10) frames, and a short-term update is performed when the model classifies the estimated target position as background.

(2)模型更新样本库的生成与更新：模型更新样本库包括在线样本集和记忆感知样本集/>两部分，其中f_l(80～100)和f_s(20～30)分别表示长时收集样本设定帧数和短时收集样本设定帧数。/>和/>分别表示在线样本集中的在线正样本集和在线负样本集，/>和/>分别表示记忆感知样本集中的记忆感知正样本集和记忆感知负样本集。特别地，在线正负样本集中的/>(500)个和/>(5000)个是在初始帧目标位置随机采样产生的正负样本。对在线跟踪中的每一帧，当模型将估计的目标位置分类为前景时表明跟踪成功，就在估计的目标位置周围随机采样，分别收集/>(50)个正样本和/>(200)个负样本添加到/>和/>样本集中，其中t表示在线跟踪视频序列的第t帧。对在线正样本集/>当跟踪成功超过f_l(80～100)帧时删除在最早帧中收集的正样本，然后把删除的正样本添加到记忆感知正样本集/>中，即在线正样本集只收集最新跟踪成功的f_l(80～100)帧样本；对在线负样本集/>当跟踪成功超过f_s(20～30)帧时删除在最早帧中收集的负样本，然后把删除的负样本添加到记忆感知负样本集/>中，即在线负样本集只收集最新跟踪成功的f_s(20～30)帧样本。对记忆感知正样本集/>当其收集超过f_l(80～100)帧时，使用K均值聚类算法将样本集聚成N_C(10～15)个类，当有新样本时，分别计算新样本的特征均值向量与N_C个聚类中心的欧式距离，并将新样本添加到与其欧式距离最小的类中，同时删除此类中最早的与新样本数量相同的样本，确保记忆感知正样本集/>在更前后样本集总数不变；对记忆感知负样本集/>当收集超过f_s(20～30)帧时删除在最早帧中收集的样本，即记忆感知负样本集只收集最新的f_s(20～30)帧样本。(2) Generation and update of the model update sample library: The model update sample library includes online sample sets and memory perception sample set/> Two parts, of which f_l (80 ~ 100) and f_s (20 ~ 30) respectively represent the number of frames set for long-term collection of samples and the number of frames set for short-term collection of samples. /> and/> Represents the online positive sample set and the online negative sample set in the online sample set respectively,/> and/> Represents the memory-aware positive sample set and the memory-aware negative sample set in the memory-aware sample set respectively. In particular, in the online positive and negative sample set/> (500) and/> (5000) are positive and negative samples generated by random sampling at the target position of the initial frame. For each frame in online tracking, when the model classifies the estimated target position as foreground, indicating successful tracking, random samples are taken around the estimated target position and collected separately. (50) positive samples and/> (200) negative samples are added to/> and/> sample set, where t represents the t-th frame of the online tracking video sequence. For online positive sample set/> When the tracking success exceeds f_l (80~100) frames, delete the positive samples collected in the earliest frames, and then add the deleted positive samples to the memory-aware positive sample set/> , that is, the online positive sample set only collects the latest successfully tracked f_l (80~100) frame samples; for the online negative sample set/> When the tracking success exceeds f_s (20~30) frames, delete the negative samples collected in the earliest frame, and then add the deleted negative samples to the memory-aware negative sample set/> , that is, the online negative sample set only collects the latest f_s (20~30) frame samples with successful tracking. Positive sample set for memory perception/> When more than f_l (80-100) frames are collected, the K-means clustering algorithm is used to cluster the samples into N_C (10-15) classes. When there are new samples, the feature mean vector and N of the new samples are calculated respectively. Euclidean distance of_C cluster centers, and add new samples to the class with the smallest Euclidean distance, while deleting the earliest sample in this class with the same number of new samples to ensure memory-aware positive sample set/> The total number of sample sets remains unchanged before and after the update; negative sample sets for memory perception/> When more than f_s (20-30) frames are collected, the samples collected in the earliest frames are deleted, that is, the memory-aware negative sample set only collects the latest f_s (20-30) frame samples.

(3)持续学习方式模型训练及模型获取：持续学习方式模型训练包括预热训练和联合优化训练两阶段。预热训练的目的是为了使模型学会适应当前的目标变化，联合优化训练的目的是为了使模型能记住历史的目标变化，从而在长时目标跟踪过程中建立目标的完整描述，当被跟踪目标出视野后再出现时便可以利用模型的历史记忆迅速找回目标，实现长时稳健的跟踪。在模型长时更新或短时跟新时，若记忆感知样本集还没有收集到样本，利用步骤(2)中收集的在线样本集对模型进行训练，对网络最后输出的分类得分，采用二分类交叉熵损失函数计算其分类损失。最后根据当前的分类损失，使用梯度下降法对网络全连接层参数进行更新，训练全连接层进行H₂(15次)迭代；当记忆感知样本集有样本时，首先，利用步骤(2)中收集的在线样本集/>对模型进行预热训练，采用二分类交叉熵损失函数计算其分类损失，然后使用梯度下降法对网络全连接层参数进行更新，训练全连接层进行H₃(10次)迭代；当模型预热训练结束后，利用步骤(2)中收集的在线样本集和记忆感知样本集/>对模型进行联合优化训练，对在线样本集利用二分类交叉熵损失函数计算其分类损失，对记忆感知样本集利用知识蒸馏损失函数计算其知识蒸馏损失，最后的总损失为分类损失加上λ倍的知识蒸馏损失。计算出总损失后，使用梯度下降法对网络全连接层参数进行更新，训练全连接层进行H₄(15次)迭代。其中在各训练阶段，全连接FC4-5层的学习率都设置为0.001，分类层FC6层的学习率都设置为0.01，动量和权重衰减都是分别设置为0.9和0.0005，训练时，每小批由M⁺(32)个正样本和从M^-(1024)个负样本中选出的/>(96)个难分负样本组成。(3) Continuous learning mode model training and model acquisition: Continuous learning mode model training includes two stages: warm-up training and joint optimization training. The purpose of warm-up training is to make the model learn to adapt to the current target changes. The purpose of joint optimization training is to enable the model to remember historical target changes, thereby establishing a complete description of the target during the long-term target tracking process. When being tracked When the target reappears after leaving the field of view, the model's historical memory can be used to quickly retrieve the target and achieve long-term and robust tracking. During the long-term update or short-term update of the model, if the memory-aware sample set has not yet collected samples, use the online sample set collected in step (2). The model is trained, and the binary cross-entropy loss function is used to calculate the classification loss of the final output classification score of the network. Finally, according to the current classification loss, the gradient descent method is used to update the parameters of the fully connected layer of the network, and the fully connected layer is trained for H₂ (15 times) iterations; when the memory sensing sample set has samples, first, use the method in step (2) Collection of online samples/> Perform preheating training on the model, use the binary cross-entropy loss function to calculate its classification loss, then use the gradient descent method to update the parameters of the fully connected layer of the network, and train the fully connected layer for H₃ (10 times) iterations; when the model is preheated After training, use the online sample set collected in step (2) and memory perception sample set/> The model is jointly optimized and trained. The binary cross-entropy loss function is used to calculate the classification loss for the online sample set. The knowledge distillation loss is calculated for the memory-aware sample set using the knowledge distillation loss function. The final total loss is the classification loss plus λ times. knowledge distillation loss. After calculating the total loss, use the gradient descent method to update the parameters of the fully connected layer of the network, and train the fully connected layer for H₄ (15 times) iterations. In each training stage, the learning rate of the fully connected FC4-5 layer is set to 0.001, the learning rate of the classification layer FC6 layer is set to 0.01, and the momentum and weight attenuation are set to 0.9 and 0.0005 respectively. During training, every hour A batch consists of M⁺ (32) positive samples and selected from M^- (1024) negative samples/> (96) difficult-to-distinguish negative samples.

本发明的特点：Features of the invention:

本发明提出了一种基于持续学习的长时视觉目标跟踪方法。该方法把传统视觉目标跟踪的模型在线更新转换为持续学习的过程，结合动态构建的在线样本集和记忆感知样本集，在长期时间维度内学习目标的遮挡、形态、尺度和光照等变化，从而在整个视频序列，对时序数据进行有效地抽象和表征，建立目标的完整画像。实现在目标长时间被遮挡或出视野后，仍可依据持续学习学到的历史模型，迅速找回重现于视野内的目标。相较于已有视觉目标跟踪技术，该方法通过持续学习方法平衡了模型的历史记忆与在线更新，克服了传统的使用新数据频繁更新所导致的模型“灾难性遗忘”问题，从视频的所有历史数据整体建立目标的完整画像描述，获得对噪声不敏感的目标模型，提升视觉跟踪的鲁棒性，达到长时跟踪的目的。本发明所述的方法可为智能视频监控、人机交互、视觉导航等应用需求提供长时视觉目标跟踪的有效解决方案。The present invention proposes a long-term visual target tracking method based on continuous learning. This method converts the online update of the traditional visual target tracking model into a continuous learning process, combines the dynamically constructed online sample set and the memory perception sample set to learn changes in the target's occlusion, form, scale, and illumination in the long-term time dimension, thereby Throughout the video sequence, the time series data is effectively abstracted and represented to establish a complete portrait of the target. After the target has been blocked or out of the field of view for a long time, the target can still be quickly retrieved based on the historical model learned through continuous learning. Compared with existing visual target tracking technology, this method balances the model's historical memory and online updates through continuous learning methods, overcoming the traditional "catastrophic forgetting" problem of the model caused by frequent updates using new data, from all aspects of the video The historical data is used as a whole to establish a complete portrait description of the target, obtain a target model that is insensitive to noise, improve the robustness of visual tracking, and achieve the purpose of long-term tracking. The method described in the present invention can provide an effective solution for long-term visual target tracking for application requirements such as intelligent video surveillance, human-computer interaction, and visual navigation.

附图说明：Picture description:

图1.整体流程图Figure 1. Overall flow chart

图2.网络结构Figure 2. Network structure

图3.初始帧分割图像Figure 3. Initial frame segmentation image

具体实施方式：Detailed ways:

以下结合说明书附图，对本发明的实施实例加以详细说明：The implementation examples of the present invention will be described in detail below with reference to the accompanying drawings:

一种基于持续学习的长时目标跟踪方法，整体流程如附图1所示；算法分为模型初始化、在线跟踪和模型更新部分。模型初始化部分：对初始帧处理，首先利用超像素分割方法获得只有前景的初始帧分割图像，然后输入初始帧原图像和初始帧分割图像分别提取卷积层特征，再融合两部分特征，即将两部分特征相加，接着通过全连接层和分类层获取分类得分并计算分类损失，然后通过反向传播梯度损失项，优化求解最优初始化模型。在线跟踪部分：在后续帧处理过程中，首先利用前一帧中目标的预测位置产生候选样本，然后将每个候选样本输入网络中计算其分类得分，选择分类得分最高的候选样本，最后使用目标框回归定位目标区域获得跟踪结果。模型更新部分：在跟踪过程中，每隔10帧或模型将估计的目标被分类为背景时，利用持续学习的方法进行长时或短时模型更新，适应目标在跟踪过程中的各种变化。A long-term target tracking method based on continuous learning. The overall process is shown in Figure 1; the algorithm is divided into model initialization, online tracking and model update parts. Model initialization part: For initial frame processing, first use the superpixel segmentation method to obtain the initial frame segmentation image with only the foreground, then input the initial frame original image and the initial frame segmentation image to extract the convolution layer features, and then fuse the two parts of the features, that is, the two parts Partial features are added, and then the classification score is obtained through the fully connected layer and the classification layer and the classification loss is calculated. Then the gradient loss term is backpropagated to optimize and solve the optimal initialization model. Online tracking part: In the subsequent frame processing process, first use the predicted position of the target in the previous frame to generate candidate samples, then input each candidate sample into the network to calculate its classification score, select the candidate sample with the highest classification score, and finally use the target Box regression locates the target area to obtain tracking results. Model update part: During the tracking process, every 10 frames or when the model classifies the estimated target as background, the continuous learning method is used to perform long-term or short-term model updates to adapt to various changes in the target during the tracking process.

所述的模型初始化部分，具体步骤如下：The specific steps for the model initialization part are as follows:

(1)初始帧分割图像获取：初始帧由超像素集组成，其中N是图像中的超像素数，Ο_i表示超像素集中第i个超像素的像素值。完全位于边界框外部的超像素被视为背景，其余超像素未知(背景或前景)。用超像素随机采样的P个像素值x_v，将超像素建模为其中P是随机采样的超像素个数，x_v表示超像素模型m中第v个超像素的像素值。这可以看作是超混合体颜色分布的经验直方图。对于任何已知的超混合模型m^b，如果相似度得分S(m^a,m^b)＞η,η＝0.5，则对应于未知超混合模型m^a的超混合模型标记为背景，其中：(1) Initial frame segmentation image acquisition: The initial frame is composed of a superpixel set Composition, where N is the number of superpixels in the image, Ο_i represents the pixel value of the i-th superpixel in the superpixel set. Superpixels completely outside the bounding box are considered background, and the remaining superpixels are unknown (background or foreground). Using P pixel values x_v randomly sampled by the superpixel, the superpixel is modeled as where P is the number of randomly sampled superpixels, and x_v represents the pixel value of the v-th superpixel in superpixel model m. This can be viewed as an empirical histogram of the color distribution of a supermixture. For any known supermixed model m^b , if the similarity score S(m^a , m^b ) > η, η = 0.5, then the supermixed model corresponding to the unknown supermixed model m^a is marked as background, where:

其中x_k是未知超像素模型m^a中第k个超像素的像素值，score(x_k,m^b)定义为：where x_k is the pixel value of the k-th superpixel in the unknown superpixel model m^a , and score(x_k ,m^b ) is defined as:

其中x_j是已知超像素模型m^b中第j个超像素的像素值。将参数R设置为0.5，它控制以每个模型像素为中心的球体的半径，允许有轻微的误差。附图3展示了分割结果。where x_j is the pixel value of the jth superpixel in the known superpixel model m^b . Set the parameter R to 0.5, which controls the radius of the sphere centered at each model pixel, allowing for slight error. Figure 3 shows the segmentation results.

(2)模型初始化训练样本库的生成：分别在第一帧原始图像和分割图像的初始目标位置周围随机采样抽取500个正样本，只在第一帧原始图像抽取5000个负样本。这些样本根据它们与真实标注框的交并比分数，分数在[0.7,1]之间的样本标记为正样本，分数在[0,0.5]之间的样本标记为负样本。(2) Generation of model initialization training sample library: 500 positive samples are randomly sampled around the initial target position of the first frame of the original image and segmented image, and 5000 negative samples are only extracted from the first frame of the original image. These samples are scored according to their intersection ratio with the true annotation box. Samples with scores between [0.7,1] are marked as positive samples, and samples with scores between [0,0.5] are marked as negative samples.

(3)模型初始化训练及模型获取：对网络最后输出的分类得分，采用二分类交叉熵损失作为损失函数求其损失项，公式为：(3) Model initialization training and model acquisition: For the final output classification score of the network, the binary cross-entropy loss is used as the loss function to find the loss term. The formula is:

其中，X_n/Y_n表示初始化训练样本库的训练样本和训练样本标签，N_n是从X_n抽取的一批样本，是N_n中第i样本对应的标签，/>是N_n中第i个样本/>相应的softmax输出。然后，通过随机梯度下降法求解最优化网络参数，在测试序列的初始帧，训练全连接层进行50次迭代，全连接FC4-5层的学习率设置为0.0005，分类层FC6层的学习率设置为0.005；动量和权重衰减分别设置为0.9和0.0005；每小批由M⁺＝32个正样本和从M^-＝1024个负样本中选出的/>个负难分样本组成。Among them, X_n /Y_n represents the training samples and training sample labels of the initialized training sample library, and N_n is a batch of samples extracted from X_n . is the label corresponding to the i-th sample in N_n ,/> is the i-th sample in N_n /> Corresponding softmax output. Then, the optimal network parameters are solved through the stochastic gradient descent method. In the initial frame of the test sequence, the fully connected layer is trained for 50 iterations. The learning rate of the fully connected FC4-5 layer is set to 0.0005, and the learning rate of the classification layer FC6 is set. is 0.005; momentum and weight decay are set to 0.9 and 0.0005 respectively; each mini-batch is selected from M⁺ = 32 positive samples and M^- = 1024 negative samples/> It consists of negative indistinguishable samples.

所述的在线跟踪部分，具体步骤如下：The specific steps for the online tracking part are as follows:

(1)对在线跟踪中的每一帧，根据前一帧估计的目标位置利用高斯分布产生256个候选样本x_u表示候选样本中的第u个候选样本。高斯分布的均值为r，协方差为对角矩阵diag(0.09r²,0.09r²,0.25)，其中r是前一帧估计的目标位置宽度和高度的平均值。(1) For each frame in online tracking, 256 candidate samples are generated using Gaussian distribution based on the target position estimated in the previous frame. x_u represents the u-th candidate sample among the candidate samples. The mean of the Gaussian distribution is r and the covariance is the diagonal matrix diag (0.09r² ,0.09r² ,0.25), where r is the average of the width and height of the target position estimated in the previous frame.

(2)网络函数的输出是一个二维向量，分别表示输入的候选样本对应目标和背景的分数。选择分类得分最高的那个候选样本作为估计的目标位置：(2) The output of the network function is a two-dimensional vector, representing the scores of the input candidate samples corresponding to the target and background respectively. Select the candidate sample with the highest classification score as the estimated target location:

其中，u是候选样本下标，f⁺(·)表示当前网络函数，x^*表示网络计算出的分类得分最高的候选样本，即估计的目标位置。Among them, u is the candidate sample subscript, f⁺ (·) represents the current network function, and x^* represents the candidate sample with the highest classification score calculated by the network, that is, the estimated target position.

(3)最后对得到目标位置进行目标框回归定位目标区域。目标框回归采用了岭回归方法，岭回归中的参数α设置为1000。(3) Finally, perform target frame regression to locate the target area on the obtained target position. The target box regression uses the ridge regression method, and the parameter α in the ridge regression is set to 1000.

所述的模型更新部分，具体步骤如下：The specific steps for the model update part are as follows:

(1)更新方式选择：采用长时更新和短时更新两种模型更新方式。在跟踪过程中，每隔f＝10帧执行一次长时更新，当模型将估计的目标位置分类为背景时执行一次短时更新。(1) Update method selection: adopt two model update methods: long-term update and short-term update. During tracking, a long update is performed every f = 10 frames, and a short update is performed when the model classifies the estimated target position as background.

(2)模型更新样本库的生成与更新：模型更新样本库包括在线样本集和记忆感知样本集/>两部分，下标f_l和f_s分别表示长时收集样本设定帧数和短时收集样本设定帧数。/>和/>分别表示在线样本集中的在线正样本集和在线负样本集，/>和分别表示记忆感知样本集中的记忆感知正样本集和记忆感知负样本集。特别地，在线正负样本集中的/>个和/>个是在初始帧目标位置随机采样产生的正负样本。对在线跟踪中的每一帧，当模型将估计的目标位置分类为前景时表明跟踪成功，就在估计的目标位置周围随机采样，分别收集50个正样本和200个负样本添加到/>和/>样本集中。对在线正样本集/>当跟踪成功超过100帧时删除在最早帧中收集的正样本，然后把删除的正样本添加到记忆感知正样本集/>中，即在线正样本集只收集最新跟踪成功的100帧样本；对在线负样本集/>当跟踪成功超过30帧时删除在最早帧中收集的负样本，然后把删除的负样本添加到记忆感知负样本集/>中，即在线负样本集只收集最新跟踪成功的30帧样本。(2) Generation and update of the model update sample library: The model update sample library includes online sample sets and memory perception sample set/> The two parts, the subscripts f_l and f_s respectively represent the number of frames set for long-term collection of samples and the number of frames set for short-term collection of samples. /> and/> Represents the online positive sample set and the online negative sample set in the online sample set respectively,/> and Represents the memory-aware positive sample set and the memory-aware negative sample set in the memory-aware sample set respectively. In particular, in the online positive and negative sample set/> individual and/> Each is a positive and negative sample generated by random sampling at the target position of the initial frame. For each frame in online tracking, when the model classifies the estimated target position as foreground, indicating successful tracking, random sampling is performed around the estimated target position, and 50 positive samples and 200 negative samples are collected and added to/> and/> sample concentration. For online positive sample set/> When tracking succeeds for more than 100 frames, delete the positive samples collected in the earliest frames, and then add the deleted positive samples to the memory-aware positive sample set/> , that is, the online positive sample set only collects the latest 100 frame samples with successful tracking; for the online negative sample set/> When tracking is successful for more than 30 frames, delete the negative samples collected in the earliest frames, and then add the deleted negative samples to the memory-aware negative sample set/> , that is, the online negative sample set only collects the latest 30 frame samples with successful tracking.

对记忆感知正样本集当收集的样本超过长时收集样本设定帧100帧时，使用K均值聚类算法将样本集聚成10类：Positive sample set for memory perception When the collected samples exceed 100 frames of the long-term collection sample setting frame, the K-means clustering algorithm is used to cluster the samples into 10 categories:

其中τ表示聚类簇的簇标记下标，表示聚类结果，/>是特征向量计算函数：where τ represents the cluster label subscript of the cluster cluster, Indicates the clustering results,/> is the eigenvector calculation function:

式中，W和b分别表示网络全连接FC5层之前网络权重和偏置，x表示输入样本，表示卷积运算。当有新的记忆感知样本/>时，分别计算新样本的特征均值向量与10个聚类中心的欧式距离，欧式距离计算公式为：In the formula, W and b respectively represent the network weight and bias before the fully connected FC5 layer of the network, x represents the input sample, Represents the convolution operation. When there is a new memory perception sample/> When , calculate the Euclidean distance between the feature mean vector of the new sample and the 10 cluster centers respectively. The Euclidean distance calculation formula is:

d_τ(μ_new-μ_τ)＝||μ_new-μ_τ||,τ＝1,...,10 (7)d_τ (μ_new -μ_τ )＝||μ_new -μ_τ ||,τ＝1,...,10 (7)

式中，μ_new表示新样本特征均值向量，μ_τ表示10个聚类簇中第τ类的特征均值向量。根据与之距离最近的均值向量确定新样本的簇标记：In the formula, μ_new represents the feature mean vector of the new sample, and μ_τ represents the feature mean vector of the τth class among the 10 clusters. Determine the cluster label of the new sample based on the nearest mean vector:

并将新样本划入相应的簇：同时删除此类中最早的与新样本数量相同的样本，确保记忆感知正样本集/>在更前后样本集总数不变；对记忆感知负样本集当收集超过30帧样本时删除最早收集的样本，即记忆感知负样本集只收集最新的30帧样本。And classify new samples into corresponding clusters: At the same time, delete the oldest sample in this category that has the same number as the new sample to ensure memory-aware positive sample set/> The total number of sample sets remains unchanged before and after the update; negative sample sets for memory perception When more than 30 frame samples are collected, the earliest collected samples are deleted, that is, the memory-aware negative sample set only collects the latest 30 frame samples.

(3)持续学习方式训练及模型获取：在模型长时更新或短时跟新时，若记忆感知样本集还没有样本，利用步骤(2)中收集的在线样本集对模型进行训练，对网络最后输出的分类得分，采用二分类交叉熵损失函数公式(3)，计算其分类损失。最后根据当前的分类损失，使用梯度下降法对网络全连接层参数进行更新，梯度下降法公式：(3) Continuous learning method training and model acquisition: When the model is updated for a long time or is updated in a short time, if there are no samples in the memory sensing sample set, use the online sample set collected in step (2) The model is trained, and the classification score finally output by the network is calculated using the binary cross-entropy loss function formula (3) to calculate the classification loss. Finally, based on the current classification loss, the gradient descent method is used to update the parameters of the fully connected layer of the network. The gradient descent method formula is:

式中，θ_n表示网络参数，η是学习率，l(·)表示损失函数。训练全连接层进行15次迭代。当记忆感知样本集有样本时，首先，利用步骤(2)中收集的在线样本集对模型进行预热训练，采用二分类交叉熵损失函数公式(3)，计算其分类损失，然后使用梯度下降法公式(9)对网络全连接层参数进行更新，训练全连接层进行10次迭代；当模型预热训练结束后，利用步骤(2)中收集的在线样本集/>和记忆感知样本集/>对模型进行联合优化训练，对在线样本集采用二分类交叉熵损失函数公式(3)计算其分类损失L_C，对记忆感知样本集采用知识蒸馏损失函数计算其蒸馏损失L_D，知识蒸馏损失函数公式为：In the formula, θ_n represents the network parameters, eta is the learning rate, and l(·) represents the loss function. Train the fully connected layer for 15 iterations. When the memory-aware sample set has samples, first, use the online sample set collected in step (2) Preheat the model and use the binary cross-entropy loss function formula (3) to calculate its classification loss. Then use the gradient descent method formula (9) to update the fully connected layer parameters of the network and train the fully connected layer for 10 iterations. ;After the model warm-up training is completed, use the online sample set collected in step (2)/> and memory perception sample set/> The model is jointly optimized and trained. The two-class cross entropy loss function formula (3) is used to calculate the classification loss L_C for the online sample set. The knowledge distillation loss function is used to calculate the distillation loss L_D for the memory-aware sample set. The knowledge distillation loss function The formula is:

式中，表示记忆感知样本集训练样本和样本标签，与公式(3)不同，/>是由旧网络输出的软标签，N_m是从X_m抽取的一批样本，/>是N_m中的第i样本对应的标签，/>是第i个样本/>相应的softmax输出。最后，总的损失函数为：In the formula, Represents the training samples and sample labels of the memory-aware sample set, which is different from formula (3), /> is the soft label output by the old network, N_m is a batch of samples extracted from X_m ,/> is the label corresponding to the i-th sample in N_m ,/> is the i-th sample/> Corresponding softmax output. Finally, the total loss function is:

L_sum＝L_C+λ·L_D (11)L_sum =L_C +λ·L_D (11)

式中，参数λ设置为0.7。计算出总损失后，使用梯度下降法公式(9)对网络全连接层参数进行更新，训练全连接层进行15次迭代。其中在各个训练阶段，全连接FC4-5层的学习率都设置为0.001，分类层FC6层的学习率都设置为0.01，动量和权重衰减都是分别设置为0.9和0.0005，训练时，每小批由32个正样本和从1024个负样本中选出的96个难分负样本组成。In the formula, the parameter λ is set to 0.7. After calculating the total loss, use the gradient descent method formula (9) to update the parameters of the fully connected layer of the network, and train the fully connected layer for 15 iterations. In each training stage, the learning rate of the fully connected FC4-5 layer is set to 0.001, the learning rate of the classification layer FC6 layer is set to 0.01, and the momentum and weight attenuation are set to 0.9 and 0.0005 respectively. During training, every hour The batch consists of 32 positive samples and 96 hard-to-distinguish negative samples selected from 1024 negative samples.