CN107945210B

Movatterモバイル変換

Info

Publication number: CN107945210B
Application number: CN201711237457.4A
Authority: CN
Inventors: 周圆; 李孜孜; 曹颖; 杜晓婷; 杨鸿宇
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2017-11-30
Filing date: 2017-11-30
Publication date: 2021-01-05
Anticipated expiration: 2037-11-30
Also published as: CN107945210A

Abstract

Translated fromChinese

本发明公开了一种基于深度学习和环境自适应的目标跟踪算法，该跟踪算法由两部分组成，一部分是预处理，对跟踪视频的每一帧图像来提取信息，然后通过显著性检测、卷积神经网络算法来对采取的正负样本进行进一步的筛选；另外一部分是实现VGG模型的卷积神经网络：首先利用三层的卷积网络来提取目标特征，其次利用全连接层来对目标和背景来进行分类，最后得到想要跟踪的目标的位置，再开始下一帧的跟踪流程。现有技术相比，本发明(1)能够在降低计算复杂度的同时，精确使用图像的预处理信息，使得跟踪效果更加精确，因此，本发明内容具有独创性；(2)该跟踪器能适应多种环境复杂的场景，有着广泛的应用前景。

The invention discloses a target tracking algorithm based on deep learning and environment adaptation. The tracking algorithm is composed of two parts, one part is preprocessing, and information is extracted from each frame image of the tracking video, and then through the saliency detection, volume Convolutional neural network algorithm is used to further screen the positive and negative samples taken; the other part is to realize the convolutional neural network of the VGG model: first, the three-layer convolutional network is used to extract the target features, and then the fully connected layer is used to analyze the target and The background is classified, and finally the position of the target to be tracked is obtained, and then the tracking process of the next frame is started. Compared with the prior art, the present invention (1) can accurately use the preprocessing information of the image while reducing the computational complexity, so that the tracking effect is more accurate. Therefore, the content of the present invention is original; (2) the tracker can Adapt to a variety of complex environments and scenes, has a wide range of application prospects.

Description

Translated fromChinese

基于深度学习和环境自适应的目标跟踪方法Target tracking method based on deep learning and environment adaptation

技术领域technical field

本发明涉及计算机视觉的目标跟踪领域，更具体地，涉及一种基于深度学习方法来对环境自适应的目标跟踪算法。The invention relates to the field of target tracking of computer vision, and more particularly, to a target tracking algorithm based on a deep learning method to adapt to the environment.

背景技术Background technique

人类是通过感觉来与外界联系和沟通的，但是人的精力和视野是非常有限的。因此在各个领域的应用中，人类的视觉是受到了很大的限制甚至低效的。在数字计算机技术飞速发展的今天，计算机视觉也越来越引起人们的广泛关注，人们意图用计算机来代替人的“眼睛”，使之具有智能化，让计算机能够处理视觉信息、完善人类视觉上的诸多短板。计算机视觉是融合了人工神经网络、心理学、物理学、计算机图形学以及数学等众多领域的一门交叉性很强的学科。Humans connect and communicate with the outside world through feeling, but human energy and vision are very limited. Therefore, in various fields of application, human vision is greatly limited or even inefficient. Today, with the rapid development of digital computer technology, computer vision has attracted more and more attention. People intend to use computers to replace human "eyes" to make them intelligent, so that computers can process visual information and improve human vision. many shortcomings. Computer vision is a highly interdisciplinary subject that integrates artificial neural networks, psychology, physics, computer graphics, and mathematics.

目前在计算机视觉领域，目标跟踪是非常活跃的课题之一，人们也越来越把注意点放在了这个领域上。目标跟踪的应用领域非常广泛，例如，动作分析、行为识别、监控和人机交互等都用到了这方面的知识，在科学和工程中有着重要的研究价值与极大的应用前景，吸引着国内外大批研究者的兴趣。At present, in the field of computer vision, object tracking is one of the very active topics, and people are paying more and more attention to this field. Target tracking has a wide range of applications. For example, action analysis, behavior recognition, monitoring, and human-computer interaction all use this knowledge. It has important research value and great application prospects in science and engineering, attracting domestic attention. interest of a large number of foreign researchers.

深度学习已经很好的应用于图像处理方向当中，为目标跟踪方向提供了一种新的解决思路。在目标跟踪领域，利用深度学习的深层架构自动地从获取的样本中学习更加抽象和本质的特征，从而来测试新的序列。结合深度学习方法的跟踪技术，在性能上逐渐超越了传统的跟踪方法，成为了这一领域的一个新趋势。Deep learning has been well applied to the direction of image processing, providing a new solution for the direction of target tracking. In the field of object tracking, the deep architecture of deep learning is used to automatically learn more abstract and essential features from the acquired samples to test new sequences. The tracking technology combined with the deep learning method has gradually surpassed the traditional tracking method in performance and has become a new trend in this field.

迄今为止，在国内外公开发表的论文和文献中尚未见开展有关基于深度学习和环境自适应的目标跟踪算法。So far, no target tracking algorithm based on deep learning and environment adaptation has been developed in the papers and literatures published at home and abroad.

发明内容SUMMARY OF THE INVENTION

基于上述现有技术，本发明提提出一种基于深度学习和环境自适应的目标跟踪方法，利用卷积神经网络，自适应调节网络的参数，使得跟踪器在多种跟踪场景都有很高的准确率结合显著性检测的预处理优势。Based on the above-mentioned prior art, the present invention proposes a target tracking method based on deep learning and environment adaptation, which uses a convolutional neural network to adaptively adjust the parameters of the network, so that the tracker has high performance in various tracking scenarios. Accuracy combined with the preprocessing advantage of saliency detection.

本发明的一种基于深度学习和环境自适应的目标跟踪方法，该方法包括以下步骤：A target tracking method based on deep learning and environment adaptation of the present invention, the method comprises the following steps:

步骤1、采用107×107像素点大小的图片作为输入；Step 1. Use a picture with a size of 107×107 pixels as input;

步骤2、预处理包括正样本预处理和负样本的处理，包括正样本预处理和负样本预处理；其中，正样本预处理的步骤包括：首先，执行采样流程：根据groundtruth值在正样本中的目标周围取一个比目标的groundtruth值大的矩形，作为采样框，计算正样本的显著图占整个采样框的比例，若是比例大于设定的某个阈值，当成纯正的正样本，若是比设定的阈值小，则予以丢弃；然后，利用显著性检测算法检测出目标的形状，得到显著图，将得到的显著图二值化后，用二值化后的显著图代替原来的那一帧图像，再根据前面的采样的流程对二值化之后的整帧图像来进行采样；负样本预处理的步骤包括：使用难例挖掘算法对于负样本进行筛选，将采样的样本在卷积神经网络中进行一次正向传播，将loss比较大的样本按照顺序排列，并将前面的选出来loss比较大的样本作为“难例”，用这部分样本来训练网络；其中：离线多域训练时，从每一帧中采用50个正样本和200个负样本，正样本和负样本分别和ground-truth的框有≥0.7和≤0.5的重合率，根据这个标准来分别选取正负样本的；同样的，对于在线学习，收集

个正样本和

负样本，并且遵循上边的采样重合率标准；Step 2. Preprocessing includes positive sample preprocessing and negative sample processing, including positive sample preprocessing and negative sample preprocessing; wherein, the steps of positive sample preprocessing include: first, perform a sampling process: according to the groundtruth value, in the positive sample Take a rectangle around the target that is larger than the groundtruth value of the target as a sampling frame, and calculate the proportion of the saliency map of the positive sample to the entire sampling frame. If the ratio is greater than a certain threshold set, it is regarded as a pure positive sample. If the set threshold is small, it will be discarded; then, the shape of the target is detected by the saliency detection algorithm, and the saliency map is obtained. After binarizing the obtained saliency map, the original frame is replaced by the binarized saliency map. The whole frame image after binarization is sampled according to the previous sampling process; the steps of negative sample preprocessing include: using the difficult example mining algorithm to screen the negative samples, and the sampled samples in the convolutional neural network. Carry out a forward propagation in the middle, arrange the samples with relatively large loss in order, and use the previously selected samples with relatively large loss as "difficult examples", and use these samples to train the network; among them: during offline multi-domain training, From each frame, 50 positive samples and 200 negative samples are used. The positive samples and negative samples have a coincidence rate of ≥0.7 and ≤0.5 with the ground-truth frame respectively. According to this standard, the positive and negative samples are selected respectively; the same Yes, for online learning, collect

positive samples and

Negative samples, and follow the sampling coincidence rate standard above;

步骤3、在第一帧被训练时采用边界框回归模型，具体处理包括：对于测试的视频序列中所给定第一帧，使用三层卷积网络来训练一个线性的边界框回归模型来预测目标的位置、提取目标特征；在随后的视频序列的每一帧中，使用边界框回归模型来调整预测对应目标的边界框的位置。Step 3. The bounding box regression model is used when the first frame is trained. The specific processing includes: for the given first frame in the test video sequence, a three-layer convolutional network is used to train a linear bounding box regression model to predict The position of the target, extract the target features; in each frame of the subsequent video sequence, use the bounding box regression model to adjust the position of the bounding box that predicts the corresponding target.

与现有技术相比，本发明具有以下效果：Compared with the prior art, the present invention has the following effects:

(1)能够在降低计算复杂度的同时，精确使用图像的预处理信息，使得跟踪效果更加精确，因此，本发明内容具有独创性；(1) The preprocessing information of the image can be accurately used while reducing the computational complexity, so that the tracking effect is more accurate. Therefore, the content of the present invention is original;

(2)该跟踪器能适应多种环境复杂的场景，有着广泛的应用前景。(2) The tracker can adapt to a variety of complex environments and has a wide range of application prospects.

附图说明Description of drawings

图1为本发明的基于深度学习和环境自适应的目标跟踪方法整体框架；图1(a)为本文跟踪算法的基本模型；图1(b)为显著性检测模型；图1(c)深度学习跟踪模型；Fig. 1 is the overall framework of the target tracking method based on deep learning and environment adaptation of the present invention; Fig. 1 (a) is the basic model of the tracking algorithm in this paper; Fig. 1 (b) is the saliency detection model; Fig. 1 (c) depth Learning tracking model;

图2为Diving序列跟踪测试结果Figure 2 shows the results of the Diving sequence tracking test

图3为ball序列跟踪测试结果Figure 3 shows the results of the ball sequence tracking test

具体实施方式Detailed ways

本发明的基于深度学习和环境自适应的目标跟踪方法，该跟踪方法由两部分组成，一部分是预处理，对跟踪视频的每一帧图像来提取信息，然后通过显著性检测、卷积神经网络算法来对采取的正负样本进行进一步的筛选；另外一部分是实现VGG模型的卷积神经网络：首先利用三层的卷积网络来提取目标特征，其次利用全连接层来对目标和背景来进行分类，最后得到想要跟踪的目标的位置，再开始下一帧的跟踪流程。The target tracking method based on deep learning and environment adaptation of the present invention consists of two parts, one part is preprocessing, which extracts information from each frame of the tracked video, and then uses saliency detection, convolutional neural network The algorithm is used to further screen the positive and negative samples taken; the other part is the convolutional neural network that implements the VGG model: first, the three-layer convolutional network is used to extract the target features, and then the fully connected layer is used to perform the target and background. Classification, and finally get the position of the target you want to track, and then start the tracking process of the next frame.

具体流程详细描述如下：The specific process is described in detail as follows:

步骤1、采用107×107像素点大小的图片作为输入；为了保证卷积层输出的特征图与输入的大小相匹配，要保证输入全卷积层的为一维向量；Step 1. Use a picture with a size of 107×107 pixels as input; in order to ensure that the feature map output by the convolution layer matches the size of the input, it is necessary to ensure that the input full convolution layer is a one-dimensional vector;

步骤2、预处理包括正样本预处理和负样本的处理Step 2. Preprocessing includes positive sample preprocessing and negative sample processing

(1)正样本预处理：一般的方法采取的正样本有的时候是包含了大部分背景的负样本，这样的“正样本”对于卷积神经网络中的训练是会造成一定误差的。因此，本发明对所采取的的正样本进行一定的筛选，使得正样本更加的纯正。具体的实现方法如下：(1) Positive sample preprocessing: The positive samples taken by the general method are sometimes negative samples that contain most of the background. Such "positive samples" will cause certain errors in the training of convolutional neural networks. Therefore, the present invention performs certain screening on the positive samples taken, so that the positive samples are more pure. The specific implementation method is as follows:

首先，根据groundtruth值在正样本中的目标周围取一个矩形，矩形一定要比目标的groundtruth值大；计算显著图占整个采样框的比例，若是比例大于设定的某个阈值，就可以当成纯正的正样本来输入进网络，若是比设定的阈值小，则予以丢弃。这样可以用来保证得到的正样本都几乎是纯正的。First, take a rectangle around the target in the positive sample according to the groundtruth value, the rectangle must be larger than the groundtruth value of the target; calculate the proportion of the saliency map to the entire sampling frame, if the proportion is greater than a certain threshold, it can be regarded as pure The positive samples are input into the network, and if it is smaller than the set threshold, it will be discarded. This can be used to ensure that the obtained positive samples are almost pure.

然后，进行“显著性”检测，即对于在一个区域内显著的物体进行检测。具体作法是利用显著性检测算法大致的检测出目标的形状，然后将得到的显著图二值化，将其插回原来的一帧的图像中，再根据前面的采样的流程对二值化之后的整帧图像来进行采样，后面要利用“显著性”方法来对目标进行检验。Then, perform "saliency" detection, that is, detect objects that are salient in a region. The specific method is to use the saliency detection algorithm to roughly detect the shape of the target, and then binarize the obtained saliency map, insert it back into the original image of one frame, and then binarize it according to the previous sampling process. The whole frame of the image is sampled, and then the "saliency" method is used to test the target.

本步骤中的正样本筛选，在大多数的跟踪算法中是一个通用的正样本筛选方法；将这个思想用到了预训练的网络中，可以对于整个网络的参数有一定的影响。The positive sample screening in this step is a general positive sample screening method in most tracking algorithms; applying this idea to the pre-trained network can have a certain impact on the parameters of the entire network.

(2)负样本预处理(2) Negative sample preprocessing

在跟踪检测中，大多数的负样本通常是冗余的，只有很少的具有代表性的负样本是对于训练跟踪器有用的。对于平常的SGD方法，很容易造成跟踪器的漂移问题。对于解决这个问题，最常用的就是难例挖掘的思想。对于负样本的筛选应用难例挖掘的思想，将采样的样本在卷积神经网络中进行一次正向传播，将loss比较大的样本按照顺序排列，并将前面的选出来，因为这部分样本与正样本足够接近，同时又不是正样本，因此被称为“难例”，用这部分样本来训练网络，可以使网络更好的学习到正负样本之间的差别。In tracking detection, most of the negative samples are usually redundant, and only a few representative negative samples are useful for training the tracker. For the ordinary SGD method, it is easy to cause the drift problem of the tracker. To solve this problem, the most commonly used is the idea of hard case mining. For the screening of negative samples, the idea of hard case mining is applied, the sampled samples are forwarded in the convolutional neural network, the samples with relatively large losses are arranged in order, and the previous ones are selected, because these samples are similar to The positive samples are close enough and are not positive samples, so they are called "hard examples". Using these samples to train the network can make the network better learn the difference between positive and negative samples.

步骤3、在第一帧被训练时采用边界框回归模型，具体处理包括：对于测试的视频序列中所给定第一帧，使用三层卷积网络来训练一个线性回归模型来预测目标的位置、提取目标特征；在随后的视频序列的每一帧中，使用回归模型来调整目标的边界框的位置，利用全连接层对图像中的目标和背景进行分类，得到目标概率大的图像块，将该图像块视为要跟踪的目标，即可得到要跟踪目标的位置，再开始下一帧的跟踪流程。Step 3. The bounding box regression model is used when the first frame is trained, and the specific processing includes: for the given first frame in the test video sequence, a three-layer convolutional network is used to train a linear regression model to predict the position of the target , extract the target feature; in each frame of the subsequent video sequence, use the regression model to adjust the position of the bounding box of the target, use the fully connected layer to classify the target and background in the image, and obtain the image block with high target probability, The image block is regarded as the target to be tracked, the position of the target to be tracked can be obtained, and then the tracking process of the next frame is started.

在正样本预处理中，还可以采用长短更新策略：利用一段时间内收集到的正样本来重新更新网络。在跟踪目标的时候，一旦发现跟丢了，就使用短期的更新策略，在短期更新策略中，用于更新网络的正样本还是这一段时间内采集到的正样本。两个更新策略中所使用的负样本都使用的短期更新模型中所收集到的负样本。规定T_s和T_l是两个帧索引集，短期设定为T_s＝20帧，长期设定为T_l＝100帧。采用这一个策略的目的就是使得样本保持为最“新鲜”的，这样对于跟踪结果更有利。In positive sample preprocessing, a long and short update strategy can also be used: the network is re-updated with positive samples collected over a period of time. When tracking the target, once it is found that it is lost, a short-term update strategy is used. In the short-term update strategy, the positive samples used to update the network are still the positive samples collected during this period of time. Negative samples collected in the short-term update model used in both update strategies. It is specified that T_s and T_l are two frame index sets, the short-term setting is T_s =20 frames, and the long-term setting is T_l =100 frames. The purpose of this strategy is to keep the samples as "fresh" as possible, which is more beneficial for tracking results.

在离线训练好神经网络之后，对于需要测试的视频序列，是在线跟踪的。因此在整体跟踪算法中，需要有在线跟踪算法部分。在线跟踪的算法具体实现过程如下：After the neural network is trained offline, the video sequences to be tested are tracked online. Therefore, in the overall tracking algorithm, there needs to be an online tracking algorithm part. The specific implementation process of the online tracking algorithm is as follows:

输入：预训练卷积神经网络CNN的滤波器{w₁,...,w₅}Input: Filters {w₁ ,...,w₅ } of the pretrained convolutional neural network CNN

初始化目标的状态x₁Initialize the state of the target x₁

输出：估计目标的状态

Output: Estimated state of the target

(1)随机初始化第6个全连接层的权重w₆，使得w₆获得一个随机的初始值；(1) Randomly initialize the weight w₆ of the sixth fully connected layer, so that w₆ obtains a random initial value;

(2)训练一个边界框回归模型；(2) Train a bounding box regression model;

(3)抽取正样本

和负样本

(3) Extract positive samples

and negative samples

(4)利用显著性网络对正样本进行筛选，(4) Use the saliency network to screen the positive samples,

(5)使用抽取出的正样本

和负样本

来更新全连接层的权重值{w₄,w₅,w₆}，其中，w₄,w₅,w₆分别表示全连接第4.5.6层的权重值；(5) Use the extracted positive samples

and negative samples

to update the weight value of the fully connected layer {w₄ , w₅ , w₆ }, where w₄ , w₅ , w₆ respectively represent the weight value of the fully connected layer 4.5.6;

(6)设置长短更新初始值：T_s←{1}和T_l←{1}；(6) Set the initial value of length update: T_s ←{1} and T_l ←{1};

(7)重复以下操作：(7) Repeat the following operations:

抽取目标的候选样本

Extract candidate samples of the target

通过公式

找到最优的目标的状态

其中，

为候选样本，该公式表明候选正样本经过卷积神经网络评分最高的样本即为最优的目标状态

by formula

Find the optimal target state

in,

is a candidate sample, the formula indicates that the candidate positive sample with the highest score after the convolutional neural network is the optimal target state

如果

然后抽取训练的样本

和

if

Then take the training samples

and

T_s←T_s∪{t}，T_l←T_l∪{t}T_s ←T_s ∪{t}, T_l ←T_l ∪{t}

其中，t表示第t帧，T_s和T_l分别代表短和长的索引集。将t与T_s和T_l的最大值分别的赋给T_s和T_l，更新两个帧索引集的值；where t represents the t-th frame, and_Ts and_Tl represent the short and long index sets, respectively. Assign the maximum value of t and T_s and T_l to T_s and T_l respectively, and update the values of the two frame index sets;

如果短的帧索引集的位置长度大于设置的20，即：|T_s|＞τ_s，然后将短索引集T_s中的最小的元素剔除

其中，v代表短索引集中的值；If the position length of the short frame index set is greater than theset 20, that is: |T_s |>τ_s , then remove the smallest element in the short index set T_s

where v represents the value in the short index set;

如果长的帧索引集的位置长度大于设置的100，即：|T_l|＞τ_l，然后将长索引集T_l中的最小的值剔除

If the position length of the long frame index set is greater than the set 100, that is: |T_l |>τ_l , then remove the smallest value in the long index set T_l

使用边界框回归模型来调整预测的目标的位置

Use a bounding box regression model to adjust the position of the predicted object

如果

使用短期模型中的正样本和负样本来更新权重{w₄,w₅,w₆}；if

Update the weights {w₄ ,w₅ ,w₆ } using the positive and negative samples in the short-term model;

其他情况，使用短期模型中的正样本和负样本来更新权重{w₄,w₅,w₆}。Otherwise, use the positive and negative samples in the short-term model to update the weights {w₄ ,w₅ ,w₆ }.

下面将结合附图对本发明的实施方式作进一步的详细描述。The embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.

下面对专利提出的基于深度学习和环境自适应的目标跟踪方法进行验证。同时，通过仿真实验比较该算法的训练误差与未改进前的算法的训练误差进行对比，通过大量的实验结果来证实算法的有效性。实验结果以跟踪的目标框的形式表示。The following is a verification of the target tracking method based on deep learning and environment adaptation proposed by the patent. At the same time, the training error of the algorithm is compared with the training error of the unimproved algorithm through simulation experiments, and a large number of experimental results are used to verify the effectiveness of the algorithm. The experimental results are represented in the form of tracked object boxes.

候选目标生成为了在每一帧中生成候选目标，选取N＝256个样本，Candidate target generation To generate candidate targets in each frame, N=256 samples are selected,

其中，

表示的为先前的目标状态；协方差矩阵是一个参数为(0.09r²)的对角矩阵，r表示前一帧中目标框的长和宽的平均值。每个候选目标框的大小是初始状态目标框的1.5倍。in,

represents the previous target state; the covariance matrix is a diagonal matrix with a parameter of (0.09r² ), and r represents the average of the length and width of the target box in the previous frame. The size of each candidate target box is 1.5 times that of the initial state target box.

训练数据：在离线多域训练时，从每一帧中采用50个正样本和200个负样本，正样本和负样本分别和ground-truth的框有≥0.7和≤0.5的重合率，就是根据这个标准来分别选取正负样本的。同样的，对于在线学习，收集

个正样本和

个负样本，并且遵循上边的采样重合率标准。但是第一帧采样时，我们采取正样本

负样本

对于边界框回归u，我们使用1000个训练样本。Training data: During offline multi-domain training, 50 positive samples and 200 negative samples are used from each frame. The positive samples and negative samples have a coincidence rate of ≥0.7 and ≤0.5 with the ground-truth frame, respectively. This standard is used to select positive and negative samples respectively. Similarly, for online learning, collect

positive samples and

negative samples, and follow the sampling coincidence rate standard above. But when the first frame is sampled, we take a positive sample

negative sample

For bounding box regression u, we use 1000 training samples.

网络学习：对于训练K个分支的多域网络学习，把卷积层的学习率参数设置为0.0001，把全连接层的学习率设置为0.001。最开始训练全连接层的时候，我们迭代30次，全连接层4和5的学习率设置为0.0001，第六个全连接层学习率设置为0.001。Network learning: For multi-domain network learning for training K branches, set the learning rate parameter of the convolutional layer to 0.0001 and the learning rate of the fully connected layer to 0.001. When first training the fully connected layer, we iterate 30 times, the learning rate of the fully connectedlayer 4 and 5 is set to 0.0001, and the learning rate of the sixth fully connected layer is set to 0.001.

表1为改进算法是加入“显著性”预处理网络，表2为未改进算法是没加入预处理网络的实验结果。Table 1 shows the improved algorithm with the addition of a "significant" preprocessing network, and Table 2 shows the experimental results of the unimproved algorithm without adding the preprocessing network.

表1、改进算法后的训练结果Table 1. The training results after the improved algorithm

表2、未改进算法的训练结果Table 2. Training results of the unimproved algorithm

Claims

1. A target tracking method based on deep learning and environment self-adaptation is characterized by comprising the following steps:

step (1), adopting a picture with 107 multiplied by 107 pixel points as input;

the pretreatment comprises positive sample pretreatment and negative sample treatment, wherein the positive sample pretreatment and the negative sample pretreatment are included; wherein, the step of positive sample pretreatment comprises: firstly, a sampling flow is executed: taking a rectangle larger than the grountruth value of the target around the target in the positive sample as a sampling frame according to the grountruth value, calculating the proportion of the saliency map of the positive sample in the whole sampling frame, if the proportion is larger than a set threshold value, taking the positive sample as a pure positive sample, and if the proportion is smaller than the set threshold value, discarding the positive sample; secondly, detecting the shape of the target by using a saliency detection algorithm to obtain a saliency map, binarizing the obtained saliency map, replacing the original frame image with the binarized saliency map, and sampling the binarized whole frame image according to the previous sampling process; the negative sample pretreatment step comprises the following steps: is difficult to useThe mining algorithm screens negative samples, the sampled samples are subjected to one-time forward propagation in a convolutional neural network, the samples with large loss are arranged in sequence, the selected samples with large loss are taken as 'difficult cases', and the network is trained by the samples; wherein: during off-line multi-domain training, 50 positive samples and 200 negative samples are adopted from each frame, the positive samples and the negative samples respectively have a coincidence rate which is more than or equal to 0.7 and less than or equal to 0.5 with a frame of a ground-route, and the positive samples and the negative samples are respectively selected according to the standard; likewise, for online learning, collection

A positive sample and

negative samples and follows the upper sample coincidence rate standard;

step (3), adopting a bounding box regression model when the first frame is trained, and specifically processing the method comprises the following steps: for a given first frame in a tested video sequence, training a linear bounding box regression model by using a three-layer convolution network to predict the position of a target and extract the characteristics of the target; in each frame of the subsequent video sequence, a bounding box regression model is used to adjust the position of the bounding box that predicts the corresponding target.