CN113705325A

Movatterモバイル変換

Info

Publication number: CN113705325A
Application number: CN202110736925.2A
Authority: CN
Inventors: 于洪涛; 朱鹏飞
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2021-11-26
Anticipated expiration: 2041-06-30
Also published as: CN113705325B

Abstract

Translated fromChinese

本发明公开了一种基于动态紧凑记忆嵌入的可变形单目标跟踪方法及装置，方法包括：基于紧凑记忆嵌入的目标相关性匹配，以获取目标前景和背景相似性，以及目标后验概率；而紧凑记忆嵌入的动态调节机制依据特征相关性，选取当前目标特征中的高质量部分整合到记忆中；采用逐个像素点到参考特征全局的关联方式，捕获当前目标特征中的目标变形状态，实现可变形特征的提取；将目标前景相似性、形变特征等四个特征沿通道级联并输入到解码器中获得精细化的目标分割掩模；基于目标分割掩模获取目标的矩形包围框实现视觉跟踪定位。装置包括：处理器和存储器。本发明解决了跟踪过程中诸如遮挡、目标形变、目标外观变化、相似目标干扰等挑战性问题。

The invention discloses a deformable single target tracking method and device based on dynamic compact memory embedding. The method includes: target correlation matching based on compact memory embedding, so as to obtain the target foreground and background similarity and the target posterior probability; and The dynamic adjustment mechanism of compact memory embedding selects the high-quality part of the current target feature to integrate into the memory according to the feature correlation; adopts the association method from pixel to reference feature global, captures the target deformation state in the current target feature, and realizes the Deformation feature extraction; four features including target foreground similarity and deformation feature are cascaded along the channel and input into the decoder to obtain a refined target segmentation mask; based on the target segmentation mask, the rectangular bounding box of the target is obtained to achieve visual tracking position. The apparatus includes: a processor and a memory. The present invention solves challenging problems such as occlusion, target deformation, target appearance change, and similar target interference in the tracking process.

Description

Translated fromChinese

基于动态紧凑记忆嵌入的可变形单目标跟踪方法及装置Deformable single target tracking method and device based on dynamic compact memory embedding

技术领域technical field

本发明涉及单目标视觉跟踪领域，尤其涉及一种基于动态紧凑记忆嵌入的可变形单目标跟踪方法及装置。The invention relates to the field of single-target visual tracking, in particular to a deformable single-target tracking method and device based on dynamic compact memory embedding.

背景技术Background technique

视觉对象跟踪是计算机视觉领域中一项基本且具有挑战性的任务。它具有许多实际应用，例如：交通监控、人机交互、自主机器人、自动驾驶等。尽管现有的跟踪方法在准确性和鲁棒性方面都得到了显著的改善，但是仍然有一些尚待解决的挑战性问题，例如：遮挡，变形，背景混乱等。Visual object tracking is a fundamental and challenging task in the field of computer vision. It has many practical applications, such as: traffic monitoring, human-computer interaction, autonomous robots, autonomous driving, etc. Although existing tracking methods have been significantly improved in both accuracy and robustness, there are still some challenging problems to be solved, such as: occlusion, deformation, background clutter, etc.

孪生跟踪器是一种简洁高效的跟踪算法。在跟踪过程中，搜索区域和模板之间的最大相关性被视为目标的定位。为了实现更好的模型泛化性，通常使用大量的有标记数据对孪生跟踪器进行训练。SINT(基于孪生化实例搜索的视觉跟踪方法)和SiamsesFC(基于孪生全卷积网络的视觉跟踪方法)对孪生跟踪领域的发展具有里程碑式的影响。他们是首次尝试端到端地训练孪生网络以进行视觉跟踪。SiamRPN++(基于大型主干网络和孪生区域建议网络的跟踪模型)和SiamDW(基于更宽更深的孪生网络跟踪模型)改进了ResNet(基于残差连接的网络)模型的结构，并将其成功应用于孪生跟踪模型上，从而显著提高了跟踪性能。SiamRPN(基于孪生区域建议网络的跟踪模型)将区域提议网络(RPN)应用于孪生跟踪网络。两分支的网络具有用于锚的背景-前景分离的分类网络头，以及用于建议框细化的回归网乱头。与基于锚的方法相比，无锚点跟踪方法(Siamfc++(基于无锚点回归的孪生全卷积跟踪模型)，SiamCAR(用于视觉跟踪的孪生全卷积分类和回归模型)， SiamBAN(基于自适应包围框回归的跟踪模型)，Ocean(基于目标感知的无锚点跟踪方法)) 避免了大量的锚点预设，从而显著减少了模型超参数。这些方法可以实现更灵活的目标边界框回归。孪生跟踪方法尽管简单有效，但是固定模板很难表达目标外观和尺度方面的变化。MOSSE(基于自适应相关滤波跟踪方法)是判别式相关滤波(discriminant correlationfilters,DCF)跟踪方法的首创。作为在线跟踪方法，相关滤波器对于目标外观和比例变化具有更好的适应性和通用性。随后的改进策略例如：连续卷积、动态更新训练集、空间正则化、时间平滑正则化等，都进一步改善了基于DCF跟踪器的性能。通过结合DCF的在线更新和IOU-Net(实现目标重叠率最大化的回归网络模型)对目标的定位细化，ATOM(通过目标重叠最大化实现目标精确跟踪)和DiMP(基于判别式模型预测的视觉跟踪方法)在模板匹配类跟踪方法中获得了当时最佳的跟踪性能。Siamese tracker is a concise and efficient tracking algorithm. During the tracking process, the maximum correlation between the search area and the template is regarded as the localization of the target. To achieve better model generalization, Siamese trackers are usually trained with a large amount of labeled data. SINT (Siamese Instance Search-based Visual Tracking Method) and SiamsesFC (Siamese Fully Convolutional Network-based Visual Tracking Method) have a landmark influence on the development of Siamese tracking field. They are the first attempt to train a Siamese network end-to-end for visual tracking. SiamRPN++ (tracking model based on large backbone network and Siamese region proposal network) and SiamDW (tracking model based on wider and deeper Siamese network) improve the structure of ResNet (residual connection based network) model and successfully apply it to Siamese on the tracking model, thereby significantly improving the tracking performance. SiamRPN (Tracking Model Based on Siamese Region Proposal Network) applies Region Proposal Network (RPN) to Siamese Tracking Network. The two-branch network has a classification head for background-foreground separation of anchors, and a regression head for proposal refinement. Compared with anchor-based methods, anchor-free tracking methods (Siamfc++ (Siamese fully convolutional tracking model based on anchor-free regression), SiamCAR (Siamese fully convolutional classification and regression model for visual tracking), SiamBAN (based on A tracking model for adaptive bounding box regression), Ocean (an anchor-free tracking method based on object awareness)) avoids a large number of anchor presets, thereby significantly reducing model hyperparameters. These methods can achieve more flexible target bounding box regression. Although the Siamese tracking method is simple and effective, it is difficult for a fixed template to express changes in object appearance and scale. MOSSE (tracking method based on adaptive correlation filtering) is the first of the discriminant correlation filters (discriminant correlation filters, DCF) tracking method. As an online tracking method, correlation filters have better adaptability and generality to object appearance and scale changes. Subsequent improvement strategies such as: continuous convolution, dynamically updating the training set, spatial regularization, temporal smoothing regularization, etc., have further improved the performance of DCF-based trackers. By combining the online update of DCF and IOU-Net (a regression network model that maximizes the target overlap rate), the target location refinement, ATOM (target accurate tracking through target overlap maximization) and DiMP (discriminant model-based prediction Visual tracking method) achieved the best tracking performance at that time among template matching class tracking methods.

CSR-DCF(结合相关滤波和颜色分割的跟踪方法)通过前景和背景的颜色直方图构造目标掩模，然后将此掩模添加到相关滤波器上，很好地抑制了边界效应。而SiamMask(结合目标分割的跟踪方法)将分割网络分支扩展到孪生跟踪模型上，借助分割损失来增强模型的表达能力。与常见的视频对象分割(video object segmentaion,VOS)方法相比，SiamMask通过采用轻量级的分割网络实现了较高的跟踪速度。与“基于检测的跟踪”不同，D3S(判别式分割跟踪模型)创新地用分割网络替换跟踪模型中的目标回归分支，通过将DCF的精确定位和分割模型对目标变形的鲁棒性相结合，D3S获得了先进的跟踪性能。 DMB(基于双记忆力库的分割跟踪模型)通过存储目标的历史外观和空间定位信息，为当前目标分割提供丰富的参考。CSR-DCF (a tracking method combining correlation filtering and color segmentation) constructs a target mask from the color histograms of the foreground and background, and then adds this mask to the correlation filter, which well suppresses the boundary effect. And SiamMask (a tracking method combined with target segmentation) extends the segmentation network branch to the Siamese tracking model, and uses segmentation loss to enhance the expressiveness of the model. Compared with common video object segmentation (VOS) methods, SiamMask achieves high tracking speed by adopting a lightweight segmentation network. Different from "detection-based tracking", D3S (discriminative segmentation tracking model) innovatively replaces the target regression branch in the tracking model with a segmentation network. By combining the precise localization of DCF and the robustness of the segmentation model to target deformation, D3S gets advanced tracking performance. DMB (Dual Memory Bank-based Segmentation Tracking Model) provides a rich reference for current object segmentation by storing the historical appearance and spatial positioning information of objects.

丰富的记忆嵌入为视频分析任务提供了充分的参考信息。MemTrack(结合动态记忆网络跟踪模型)使用动态记忆网络来克服由固定模板引起的跟踪漂移问题。而STM(基于时空记忆的视频分割方法)存储了密集的历史帧特征和用于当前像素级时空信息匹配的掩模。密集的参考信息使它可以很好地处理VOS期间的外观变化和遮挡。为了避免过多的记忆存储中的冗余记忆，AFB-URR(联合自适应特征记忆和不确定区域细化的视频分割模型) 提出了一种自适应特征库来动态组织历史信息。它利用加权平均值来合并相似的记忆，并遵循缓存替换策略来消除查询频率最低的记忆。Rich memory embeddings provide sufficient reference information for video analysis tasks. MemTrack (in combination with a dynamic memory network tracking model) uses a dynamic memory network to overcome the tracking drift problem caused by fixed templates. And STM (Spatio-temporal memory-based video segmentation method) stores dense historical frame features and masks for current pixel-level spatio-temporal information matching. The dense reference information allows it to handle appearance changes and occlusions during VOS well. To avoid redundant memory in excessive memory storage, AFB-URR (Video Segmentation Model Joint Adaptive Feature Memory and Uncertain Region Refinement) proposes an adaptive feature library to dynamically organize historical information. It utilizes a weighted average to merge similar memories and follows a cache replacement strategy to eliminate the least frequently queried memories.

固定的匹配模板很难获得变化的目标表示，尤其是对于非刚性对象。为此，SiamAttn (基于孪生注意力的跟踪方法)提出了一种可变形的孪生注意力机制，该机制计算模板特征和搜索特征之间的可变形自注意力和互注意力。自注意力使用空间注意力来学习上下文信息，而互注意力可以聚集目标模板和搜索区域之间丰富的上下文相互依赖关系，从而隐式完成目标模板的更新。Deformable-DETR(可变形的多头注意力目标检测模型)应用多尺度可变形注意力模块代替Transformer(多头注意力机制)注意力机制来处理特征图。由于其权重的灵活性，该方法在目标检测任务中具有出色的性能。It is difficult for a fixed matching template to obtain changing object representations, especially for non-rigid objects. To this end, SiamAttn (Siamese Attention Based Tracking Method) proposes a deformable Siamese attention mechanism that computes deformable self-attention and mutual attention between template features and search features. Self-attention uses spatial attention to learn contextual information, while mutual attention can aggregate rich contextual interdependencies between target templates and search regions, thereby implicitly completing target template updates. Deformable-DETR (deformable multi-head attention object detection model) applies a multi-scale deformable attention module instead of Transformer (multi-head attention mechanism) attention mechanism to process feature maps. Due to the flexibility of its weights, this method has excellent performance in object detection tasks.

在一定程度上，基于DCF的跟踪器虽然缓解了固定模板难以适应场景变化的难题，但是基于模板匹配的方法具有以下两个限制：To a certain extent, although the DCF-based tracker alleviates the problem that the fixed template is difficult to adapt to scene changes, the template matching-based method has the following two limitations:

首先，单个初始帧的模板无法提供足够的目标信息来用于目标匹配；第二是现有的匹配方法仅在对应的像素之间进行，匹配结果过于粗糙，很难捕获足够精细的目标形变信息。First, the template of a single initial frame cannot provide enough target information for target matching; second, the existing matching methods only perform between corresponding pixels, and the matching results are too rough, making it difficult to capture sufficiently fine target deformation information .

发明内容SUMMARY OF THE INVENTION

本发明提供了一种基于动态紧凑记忆嵌入的可变形单目标跟踪方法及装置，本发明解决了跟踪过程中诸如遮挡、目标形变、目标外观变化、相似目标干扰，以及复杂背景等挑战性问题，详见下文描述：The invention provides a deformable single target tracking method and device based on dynamic compact memory embedding. The invention solves challenging problems such as occlusion, target deformation, target appearance change, similar target interference, and complex background in the tracking process. See the description below for details:

一种基于动态紧凑记忆嵌入的可变形单目标跟踪方法，所述方法包括：A deformable single target tracking method based on dynamic compact memory embedding, the method includes:

目标相似度匹配过程中生成的特征亲和力矩阵，即表示了当前目标特征与现有目标特征记忆之间的相关性；通过对特征亲和力矩阵进行逐行的筛选前K个值并取其平均，以获取目标前景相似性和目标后验概率；The feature affinity matrix generated during the target similarity matching process represents the correlation between the current target feature and the existing target feature memory; the first K values of the feature affinity matrix are screened row by row and averaged to get Obtain the target foreground similarity and target posterior probability;

依据获取的特征相关性，合并现有目标特征记忆和当前目标特征之间的高相关部分，将与现有目标特征记忆具有中等相似性的部分扩充到记忆中，将低相关的部分丢弃，实现了记忆嵌入的动态自适应调节，并获得了紧凑化的记忆嵌入；According to the acquired feature correlation, merge the high correlation parts between the existing target feature memory and the current target feature, expand the part with moderate similarity to the existing target feature memory into the memory, discard the low correlation part, and realize The dynamic adaptive adjustment of memory embedding is achieved, and the compact memory embedding is obtained;

采用查询特征逐个像素点到参考特征全局的关联方式，通过聚合查询像素和参考特征之间的加权相关性，捕获当前目标特征中的目标变形状态，在相似的目标部分之间建立对应关系，实现可变形特征的提取；By adopting the association method from the query feature to the reference feature globally, and by aggregating the weighted correlation between the query pixel and the reference feature, the target deformation state in the current target feature is captured, and the corresponding relationship between similar target parts is established to achieve Deformable feature extraction;

将目标前景相似性和目标后验概率，可变形特征以及在线判别式相关滤波器中获取到的目标空间定位，沿通道级联并输入到解码器中获得精细化的目标分割掩模；基于目标分割掩模获取目标的矩形包围框实现视觉跟踪定位。The target foreground similarity and target posterior probability, deformable features, and target spatial localization obtained in the online discriminant correlation filter are cascaded along the channel and input into the decoder to obtain a refined target segmentation mask; based on the target The segmentation mask obtains the rectangular bounding box of the target to achieve visual tracking and positioning.

进一步地，所述获取当前目标特征与现有记忆之间的相关性具体为：Further, the obtained correlation between the current target feature and the existing memory is specifically:

当一个新的帧I_t-1被模型分割之后，当前的目标查询特征F_t-1和获得的掩模将与历史信息进行整合，通过不断整合新的目标信息到键和值的记忆中，所构成的记忆嵌入将包含丰富的目标外观信息；对于当前的目标特征，以及已有的目标特征记忆，先将二者进行维度变换，再进行矩阵相乘获得二者的亲和力矩阵，该亲和力矩阵即表达了二者的相关性。When a new frame I_t-1 is segmented by the model, the current target query feature F_t-1 and the obtained mask will be integrated with historical information, by continuously integrating new target information into the memory of keys and values, The formed memory embedding will contain rich target appearance information; for the current target feature and the existing target feature memory, the two are first transformed into dimensions, and then matrix multiplication is performed to obtain the affinity matrix of the two. The affinity matrix That is to say, the correlation between the two is expressed.

其中，所述依据获取的特征相关性，合并现有目标特征记忆和当前目标特征之间的高相关部分，将与现有目标特征记忆具有中等相关性的部分扩充到记忆中，将低相关的部分直接丢弃，实现了记忆嵌入的动态自适应调节，该动态调节过程具体为：Wherein, according to the acquired feature correlation, the high correlation part between the existing target feature memory and the current target feature is merged, the part that has a moderate correlation with the existing target feature memory is expanded into the memory, and the low correlation part is expanded into the memory. Part of it is directly discarded, and the dynamic adaptive adjustment of memory embedding is realized. The dynamic adjustment process is as follows:

利用序列中第一帧的目标信息初始化记忆嵌入，并将其作为记忆库的主要部分，将目标查询与现有的目标特征记忆进行对比，找出两者的相似部分；Use the target information of the first frame in the sequence to initialize the memory embedding, and use it as the main part of the memory bank, compare the target query with the existing target feature memory, and find out the similar parts of the two;

对于目标查询中的每一个元素，搜索亲和力矩阵，以获得其与目标特征记忆M_k∈R^Thw的最大相似部分；For each element in the target query, search the affinity matrix to obtain its maximum similarity to the target feature memory M_k ∈ R^Thw ;

若两者之间的最大相关大于某个上限值，则遵循哈希单映射原则将具有相同键值的记录插入到相同的存储空间中，即采用加权融合方式，将当前特征中与已有记忆具有高相关的部分更新到现有的记忆嵌入中；对于和已有记忆的相关性值高于总体平均值的当前目标特征，会将其直接扩展到现有记忆中；对目标前景和背景掩模的对应部分进行同样处理。If the maximum correlation between the two is greater than a certain upper limit value, the records with the same key value are inserted into the same storage space according to the hash single mapping principle, that is, the weighted fusion method is used to combine the current features with existing ones. The part of the memory with high correlation is updated to the existing memory embedding; for the current target feature whose correlation value with the existing memory is higher than the overall average, it will be directly extended to the existing memory; for the target foreground and background The corresponding part of the mask is processed in the same way.

进一步地，所述将当前特征中与已有记忆具有高相关的部分更新到现有的记忆嵌入中具体为：Further, the update of the part with high correlation with the existing memory in the current feature to the existing memory embedding is specifically:

合并现有记忆和当前目标特征之间的高相关部分：Merge high-correlation parts between existing memory and current target features:

M_k(j')＝βF_t-1(i)+(1-β)M_k(j),M_k (j')=βF_t-1 (i)+(1-β)M_k (j),

其中，M_k(j')为合并后的目标特征记忆，M_vf(j')和M_vb(j')分别为合并后的目标前景和背景值记忆，j'为记忆存储合并后的下标索引值，β为融合权重，F_t-1为目标查询，

为前景，Y^b_t-1为背景，M_k为目标特征记忆，M_vf为目标前景值记忆，M_vb为目标背景值记忆，i为当前特征中和已有记忆具有中等相似性的特征点的下标索引，j为已有记忆中和当前特征具有中等相似性的部分的下标索引；Among them, M_k (j') is the combined target feature memory, M_vf (j') and M_vb (j') are the combined target foreground and background value memories, respectively, and j' is the memory storage combined lower index value, β is the fusion weight, F_t-1 is the target query,

is the foreground, Y^b_t-1 is the background, M_k is the target feature memory, M_vf is the target foreground value memory, M_vb is the target background value memory, and i is the feature point with moderate similarity between the current feature and the existing memory. The subscript index of , j is the subscript index of the part with moderate similarity to the current feature in the existing memory;

将与已有记忆的相关性值高于平均值的当前目标特征直接扩展到现有记忆中：Extend current target features with higher-than-average correlation values to existing memories directly into existing memories:

其中，Union(.)表示当前特征和相应记忆的取并集操作，

为现有键记忆和前一时刻目标特征的并集，

为现有目标前景值记忆和前一时刻目标前景掩模的并集，

为现有目标背景值记忆和前一时刻目标背景掩模的并集。Among them, Union(.) represents the union operation of the current feature and the corresponding memory,

is the union of the existing key memory and the previous moment target features,

is the union of the existing target foreground value memory and the target foreground mask of the previous moment,

Memory for the existing target background value and the union of the target background mask at the previous moment.

其中，所述通过聚合查询像素和参考特征之间的加权相关性，捕获当前目标特征中的目标变形状态，在相似的目标部分之间建立了对应关系具体为：Wherein, the weighted correlation between the aggregating query pixel and the reference feature is described to capture the target deformation state in the current target feature, and establish a corresponding relationship between similar target parts. Specifically:

构造查询特征F中每个像素与整个模板之间的关联，应用非共享注意机制计算键中的每个像素r^j和fⁱ的相关性，获取相似性函数：Construct the association between each pixel in the query feature F and the entire template, apply the non-shared attention mechanism to calculate the correlation between each pixel r^j and fⁱ in the key, and obtain the similarity function:

通过softmax函数对模板R的所有像素上的e_ij归一化，获取归一化后的相似性权重用于聚合模板中的逐像素特征，生成对应于fⁱ的目标变形特征；Normalize the eij on all pixels of the template R by the_softmax function, and obtain the normalized similarity weight for aggregating the pixel-by-pixel features in the template to generate the target deformation feature corresponding to^fi ;

利用残差将查询特征和目标变形特征连接，获取包含变形信息的特征，并采用目标前景概率增强目标可变形特征的可信度。Residuals are used to connect query features and target deformation features to obtain features containing deformation information, and the target foreground probability is used to enhance the credibility of target deformable features.

第二方面，一种基于动态紧凑记忆嵌入的可变形单目标跟踪装置，所述装置包括：处理器和存储器，所述存储器中存储有程序指令，所述处理器调用存储器中存储的程序指令以使装置执行第一方面中的任一项所述的方法步骤。In a second aspect, a deformable single-target tracking device based on dynamic compact memory embedding, the device includes: a processor and a memory, the memory stores program instructions, and the processor calls the program instructions stored in the memory to An apparatus is caused to perform the method steps of any one of the first aspects.

第三方面，一种计算机可读存储介质，所述计算机可读存储介质存储有计算机程序，所述计算机程序包括程序指令，所述程序指令被处理器执行时使所述处理器执行第一方面中的任一项所述的方法步骤。In a third aspect, a computer-readable storage medium stores a computer program, and the computer program includes program instructions that, when executed by a processor, cause the processor to perform the first aspect The method steps of any one of.

本发明提供的技术方案的有益效果是：The beneficial effects of the technical scheme provided by the present invention are:

1、在目标相似性匹配过程中，本发明引入了动态调节的紧凑化目标记忆库，在遮挡、背景杂波等复杂跟踪情况下能够为目标相似性匹配提供有效的查询参考；此外，基于哈希算法的动态记忆调节机制，能够有效保证记忆的紧凑化和高质量，避免了记忆冗余和非必要的记忆检索；1. In the process of target similarity matching, the present invention introduces a dynamically adjusted compact target memory library, which can provide an effective query reference for target similarity matching under complex tracking conditions such as occlusion and background clutter; The dynamic memory adjustment mechanism of the Greek algorithm can effectively ensure the compactness and high quality of memory, and avoid memory redundancy and unnecessary memory retrieval;

2、本发明提出的可变形特征学习模块通过建立查询特征中每个像素与整个参考模板特征的全局对应关系，有效地获取了当前目标的变形信息，进一步解决了跟踪过程中形变目标问题；2. The deformable feature learning module proposed by the present invention effectively obtains the deformation information of the current target by establishing the global correspondence between each pixel in the query feature and the entire reference template feature, and further solves the problem of deforming the target in the tracking process;

3、本发明在五个具有挑战的跟踪数据基准上(包括VOT2016，VOT2018，VOT2019，GOT-10K和TrackingNet)，广泛的仿真实验证明了本方法相较于其他最新跟踪器的明显优势，尤其在VOT2018上获得了目前排名第一的EAO得分0.508。3. On five challenging tracking data benchmarks (including VOT2016, VOT2018, VOT2019, GOT-10K and TrackingNet), extensive simulation experiments prove the obvious advantages of this method compared to other state-of-the-art trackers, especially in On VOT2018, it has obtained the current No. 1 EAO score of 0.508.

附图说明Description of drawings

图1为本发明的模型总体结构示意图；1 is a schematic diagram of the overall structure of the model of the present invention;

其中，该模型主要由基于DCF的跟踪器、形变特征提取模块和基于紧凑记忆嵌入的相似度匹配模块，以及上采样分割模块组成。Among them, the model is mainly composed of a DCF-based tracker, a deformation feature extraction module, a similarity matching module based on compact memory embedding, and an up-sampling segmentation module.

图2为本发明中目标相似性匹配和紧凑记忆嵌入结构示意图；2 is a schematic diagram of the target similarity matching and compact memory embedding structure in the present invention;

为了获取当前帧中的目标，在查询特征F和目标特征记忆M_k之间执行相似性匹配。具体而言，维度变换后的F和M_k通过矩阵乘法构造当前特征和记忆之间的亲和矩阵A ∈R^hw^×Thw。然后，A检索值记忆M_vf和M_vb，将检索到的亲和特征沿列维度进行Top-K平均以获得前景和背景目标相似度S_f和S_b。在完成目标分割和跟踪后，查询特征F和获得的目标前景和背景分割掩模Y^f和Y^b将根据本发明提出的紧凑记忆调整机制更新到相应的内存中。To obtain the target in the current frame, similarity matching is performed between the query feature F and the target feature memory_Mk . Specifically, the dimension-transformed F and M_k construct an affinity matrix A ∈ R^hw^×Thw between the current feature and the memory through matrix multiplication. Then, A retrieves the values to memorize M_vf and M_vb , and performs Top-K averaging of the retrieved affinity features along the column dimension to obtain foreground and background target similarities S_f and S_b . After completing the target segmentation and tracking, the query feature F and the obtained target foreground and background segmentation masks Y^f and Y^b update the compact memory adjustment mechanism proposed according to the present invention into the corresponding memory.

图3为不同记忆存储方式的跟踪效果对比示意图；Figure 3 is a schematic diagram of the comparison of tracking effects of different memory storage methods;

图4为形变特征对模型跟踪效果提升的可视化示意图；FIG. 4 is a schematic diagram of the visualization of the improvement of the tracking effect of the model by the deformation feature;

其中，紧凑记忆的嵌入有效地提高了模型对相似干扰的判别能力。但是，该模型仍然不能完全分割目标的边缘细节，而形变特征学习模块有效地解决了这一问题，实现了目标的精确分割。Among them, the embedding of compact memory effectively improves the model's ability to discriminate similar disturbances. However, the model still cannot completely segment the edge details of the target, and the deformation feature learning module effectively solves this problem and achieves accurate segmentation of the target.

图5为本方法在DAVIS2017数据集上的分割可视化实验结果示意图；Figure 5 is a schematic diagram of the results of the segmentation visualization experiment on the DAVIS2017 dataset;

图6为本方法在VOT2016、VOT2018和VOT2019上的可视化实验结果示意图；Figure 6 is a schematic diagram of the visualization experimental results of this method on VOT2016, VOT2018 and VOT2019;

图7为一种基于动态紧凑记忆嵌入的可变形单目标跟踪装置的结构示意图。FIG. 7 is a schematic structural diagram of a deformable single-target tracking device based on dynamic compact memory embedding.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚，下面对本发明实施方式作进一步地详细描述。In order to make the objectives, technical solutions and advantages of the present invention clearer, the embodiments of the present invention are further described in detail below.

在本发明实施例中，为了解决背景技术中存在的问题，提出了一种用于可变形视觉跟踪的紧凑型记忆嵌入。针对单个初始参考特征仅包含较少的目标信息，特别是在视频序列中，目标将发生明显的外观和形态变化。本发明相应地提出了一种动态记忆嵌入调节机制，通过检索相似度匹配过程中生成的特征亲和力矩阵，本发明实施例获得了查询特征与现有记忆之间的相关性。In the embodiment of the present invention, in order to solve the problems existing in the background technology, a compact memory embedding for deformable visual tracking is proposed. For a single initial reference feature only contains less target information, especially in video sequences, the target will undergo obvious appearance and morphological changes. The present invention accordingly proposes a dynamic memory embedding adjustment mechanism. By retrieving the feature affinity matrix generated during the similarity matching process, the embodiment of the present invention obtains the correlation between the query feature and the existing memory.

然后，本发明实施例合并现有记忆和当前目标特征之间的高相关部分。此外，将与记忆中具有中等相似性的部分扩充到记忆中来，而将不相关的部分直接丢弃。高质量的记忆嵌入可以提供历史帧中完整的目标变化信息，从而有效地处理目标遮挡和相似性干扰物体等问题。此外，和现有的基于相关运算的匹配方法相比，本发明实施例提出的可变形特征提取方法采用了逐个像素点到全局特征的关联方式，通过聚合查询像素和参考特征之间的加权相关性，以此有效地捕获查询特征中的目标变形状态。Then, the embodiment of the present invention merges the high correlation parts between the existing memory and the current target feature. In addition, the parts with moderate similarity to the memory are expanded into the memory, and the irrelevant parts are directly discarded. High-quality memory embedding can provide complete target change information in historical frames, which can effectively deal with problems such as target occlusion and similarity interference objects. In addition, compared with the existing matching method based on correlation operation, the deformable feature extraction method proposed in the embodiment of the present invention adopts the correlation method from pixel to global feature, and aggregates the weighted correlation between query pixels and reference features by aggregating the weighted correlation between query pixels and reference features , so as to effectively capture the target deformation state in the query feature.

图1展示了本发明实施例中的跟踪方法的整体流程图。为了应对跟踪过程中的挑战，本方法的模型主要包括三个关键组件，即基于紧凑型记忆嵌入的目标相似性匹配模块、可变形特征学习模块和基于判别式相关滤波的跟踪器。FIG. 1 shows an overall flow chart of a tracking method in an embodiment of the present invention. In order to deal with the challenges in the tracking process, the model of this method mainly includes three key components, namely the target similarity matching module based on compact memory embedding, the deformable feature learning module and the tracker based on discriminative correlation filtering.

为了跟踪和分割当前t时刻的视频帧I_t，视频帧I_t和初始帧I₁首先由主干网(即ResNet50)编码为查询特征和参考特征。为了提高计算效率，将提取出的查询帧和参考帧特征降维到64通道，这两个特征分别表示为F_t和R。In order to track and segment the video frame It at the current time_t , the video frame It and the initial frame_I1 are first encoded as query features and reference features by the backbone network (ie,_ResNet50 ). To improve computational efficiency, the extracted query frame and reference frame features are reduced to 64 channels, and these two features are denoted as F_t and R, respectively.

本发明实施例中所用的目标相似性匹配模块参考了Transformer(多头注意力机制)中的自注意力机制。而动态记忆调节机制将当前的目标特征F_t和目标分割掩膜扩展到了紧凑型记忆存储M_k和M_v(此处M_k表示目标特征的记忆存储，M_v表示目标前景和背景分割掩模的记忆存储)中，从而有效地克服了跟踪过程中的遮挡和外观变化。而本发明实施例中描述的目标可变形特征学习模块逐像素地进行查询特征和参考特征的全局对比，并在相似的目标部分之间建立了对应关系df(F_t,R)。由此，全局的目标对应关系就可以完全捕获目标的形变信息。而判别式跟踪器用于提取目标定位信息L_t，借鉴了最新的基于深度相关过滤器的跟踪器ATOM，本方法首先利用1×1的卷积层将主干特征降维到64个通道，然后，再通过4×4卷积层和连续可微的激活函数PELU(参数化指数线性激活函数)进行处理。激活后的特征最大值即被视为目标空间定位。此跟踪模块是由有效的反向传播方法进行在线训练。最后，目标相似匹配结果、可变形特征和目标定位信息沿着通道被连接在一起，该组合的特征再经过一个三阶段的上采样分割模块进行细化分割处理。分割出的目标掩膜再经过轮廓转换即可获得目标的包围框。The target similarity matching module used in the embodiments of the present invention refers to the self-attention mechanism in Transformer (multi-head attention mechanism). The dynamic memory adjustment mechanism extends the current target feature F_t and target segmentation mask to compact memory storage M_k and M_v (where M_k represents the memory storage of target features, M_v represents the target foreground and background segmentation masks memory storage), thus effectively overcoming occlusion and appearance changes during tracking. However, the target deformable feature learning module described in the embodiment of the present invention performs a global comparison between the query feature and the reference feature pixel by pixel, and establishes a corresponding relationship df(F_t , R) between similar target parts. Thus, the global target correspondence can fully capture the deformation information of the target. The discriminative tracker is used to extract the target localization information L_t , and draws on the latest tracker ATOM based on depth correlation filter. This method first uses a 1×1 convolution layer to reduce the backbone feature to 64 channels, and then, It is then processed through a 4×4 convolutional layer and a continuously differentiable activation function PELU (parameterized exponential linear activation function). The activated feature maximum is regarded as the target space localization. This tracking module is trained online by an efficient backpropagation method. Finally, the target similarity matching results, deformable features and target localization information are concatenated together along the channel, and the combined features are further refined and segmented by a three-stage upsampling segmentation module. The segmented target mask is then transformed by contour to obtain the bounding box of the target.

一、基于紧凑记忆嵌入的目标相似性匹配1. Target similarity matching based on compact memory embedding

1、目标相似性匹配1. Target similarity matching

准确地将目标与复杂的背景分开需要可用的参考信息。基于匹配的VOS方法会充分利用初始帧中的目标标注信息来准确匹配当前的目标。在本发明实施例中，遵循常规注意力机制的做法，目标相似度匹配模块包含有查询、键和值三部分。图2展示目标相似度匹配模块的结构。查询特征F_t∈R^h×w×c是当前帧的目标表示。有所不同的是，模型中的键R ∈R^h^×w×c是初始帧的目标特征，而值(

和

)是视频第一帧的前景和背景分割掩模，其中，h为目标特征图的像素高度，w为目标特征图的像素宽度，c为目标特征图的通道数，R为多维矩阵的表示符号，

为视频第一帧的前景分割掩模，

为视频中第一帧的背景分割掩模，f为目标前景的表示符号，b为目标背景的表示符号。Accurately separating targets from complex backgrounds requires available reference information. The matching-based VOS method makes full use of the target annotation information in the initial frame to accurately match the current target. In the embodiment of the present invention, following the practice of the conventional attention mechanism, the target similarity matching module includes three parts: query, key and value. Figure 2 shows the structure of the target similarity matching module. The query feature F_t ∈ R^h×w×c is the target representation of the current frame. The difference is that the key R ∈ R^h^{× w × c} in the model is the target feature of the initial frame, while the value (

and

) is the foreground and background segmentation mask of the first frame of the video, where h is the pixel height of the target feature map, w is the pixel width of the target feature map, c is the channel number of the target feature map, and R is the symbol of the multidimensional matrix ,

is the foreground segmentation mask for the first frame of the video,

is the background segmentation mask of the first frame in the video, f is the symbol of the target foreground, and b is the symbol of the target background.

在视频序列中，目标通常会发生明显的外观和结构变化。仅仅利用目标的固定初始信息进行相似性匹配不能保证目标匹配的质量。当一个新的帧，例如I_t-1被模型分割时，当前的目标查询特征F_t-1和获得的掩模(Y^f_t-1和Y^b_t-1)将与历史信息合并。通过将历史信息插入到键和值记忆中，所构成的记忆嵌入(M_k∈R^T×h×w×c，M_vf∈R^T×h×w×1和M_vb∈R^T×h×w×1)将包含丰富的目标外观信息，其中，M_k为目标特征记忆，M_vf为目标前景值记忆，M_vb为目标背景值记忆，T为累计的视频帧数。In video sequences, objects often undergo significant changes in appearance and structure. Only using the fixed initial information of the target for similarity matching cannot guarantee the quality of target matching. When a new frame, such as It_-1 , is segmented by the model, the current target query features_Ft-1 and the obtained masks (^Yft_-1 and^Ybt_-1 ) will be merged with historical information. By inserting historical information into key and value memories, the formed memory embeddings (M_k ∈ R^T×h×w×c , M_vf ∈ R^T×h×w×1 and M_vb ∈ R^{T×h× w×1} ) will contain rich target appearance information, where M_k is the target feature memory, M_vf is the target foreground value memory, M_vb is the target background value memory, and T is the accumulated number of video frames.

为了在键记忆和查询特征之间建立起目标的像素级关联，本方法首先生成了亲和矩阵 A。具体地说，为了更好地匹配，键记忆M_k和查询F_t首先沿每个通道进行逐像素的二范数标准化处理。此外，M_k和F_t的维度还会分别被重塑为Thw×c和hw×c。In order to establish the target pixel-level association between key memory and query features, the method first generates an affinity matrix A. Specifically, for better matching, the key memory_Mk and the query_Ft are first subjected to pixel-wise two-norm normalization along each channel. In addition, the dimensions of M_k and F_t are reshaped to Thw×c and hw×c, respectively.

A＝F_t*(M_k)^T, (1)A=F_t *(M_k )^T , (1)

其中，*表示矩阵乘法，(.)^T表示矩阵转置，A∈R^hw×Thw。Among them, * denotes matrix multiplication, (.)^T denotes matrix transpose, A∈R^hw×Thw .

亲和矩阵A度量了查询特征F_t和键记忆M_k之间每个像素的相似性。为了得到准确的匹配目标，需要进一步检索值记忆。然后，将前景和背景值记忆图M_vf和M_vb的维度重塑为Thw×1(本文中的1为数值)。对于i∈R^hw，亲和向量aⁱ∈R^Thw通过点积计算检索值记忆向量M_vf∈R^Thw×1和M_vb，The affinity matrix A measures the similarity of each pixel between the query feature F_t and the key memory M_k . In order to get an accurate matching target, it is necessary to further retrieve the value memory. Then, the dimensions of the foreground and background value memory maps M_vf and M_vb are reshaped to Thw × 1 (1 in this paper is a numerical value). For i∈R^hw , the affinity vector aⁱ ∈ R^Thw computes the retrieved value memory vector M_vf ∈ R^{Thw × 1} and M_vb by the dot product,

其中，

和

为亲和力向量检索目标前景值记忆向量的结果，

为亲和力向量检索目标背景值记忆向量的结果，i为在目标特征hw维度上的第i个特征向量的下标索引。in,

and

The result of retrieving the target foreground value memory vector for the affinity vector,

The result of retrieving the target background value memory vector for the affinity vector, i is the subscript index of the ith feature vector in the target feature hw dimension.

高置信度的匹配分数保证了目标匹配的准确性。因此，本发明实施例应用top-K平均函数来提取检索到的向量

和

中的目标分数：

The matching score with high confidence guarantees the accuracy of target matching. Therefore, embodiments of the present invention apply the top-K averaging function to extract the retrieved vectors

and

Target score in:

其中，集合

表示在匹配分数矩阵

的第i行中的前K个匹配分数的索引，j为上述前K个匹配分数中第j个分数的下标索引。在本方法中，K被设置为3，背景匹配同上。最后的前景和背景匹配结果S_f∈R^hw和S_b∈R^hw用于生成目标后验概率P，即目标前景相对于背景的概率。Among them, the collection

represented in the matching score matrix

The index of the first K matching scores in the ith row of , and j is the subscript index of the jth score in the first K matching scores. In this method, K is set to 3, and the background matching is the same as above. The final foreground and background matching results S_f ∈ R^hw and S_b ∈ R^hw are used to generate the target posterior probability P, that is, the probability of the target foreground relative to the background.

2、紧凑记忆嵌入2. Compact memory embedding

丰富的记忆嵌入可以有效地提高目标相似性匹配的精度。然而，受计算设备的存储容量的限制，不可能将所有历史帧信息都存储在记忆库中，特别是对于包含超过1K帧的长视频。此外，由于相邻帧中的目标可能过于相似，或者出现目标遮挡，存储这些目标信息会导致记忆嵌入存储的冗余，以及不必要的匹配查询。Rich memory embeddings can effectively improve the accuracy of target similarity matching. However, limited by the storage capacity of computing devices, it is impossible to store all historical frame information in the memory bank, especially for long videos containing more than 1K frames. In addition, since objects in adjacent frames may be too similar, or object occlusion occurs, storing these object information results in redundant memory embedding storage and unnecessary matching queries.

现有的方法仅仅应用了临近的几个历史帧，或者以相等的间隔选择一部分历史帧。这些方法将丢失一些有效的参考信息。AFB-URR(联合自适应特征记忆和不确定区域细化的视频分割模型)合并了记忆中的相似部分，并删除查询频率最低的部分。但是它仍然可能会引入不相关的历史信息到记忆中，这很容易导致模型误差积累，直到模型漂移。受哈希算法的启发，为本模型开发了一个动态的紧凑记忆调整机制，从而形成了一个更有效的目标参考信息库。图2展示出了动态紧凑记忆嵌入的结构。基于亲和矩阵A，本发明实施例合并了当前特征和现有内存之间的高相似度(高于上限阈值)部分。为了避免低质量记忆造成的误匹配，直接丢弃低相关的目标特征。Existing methods only apply several adjacent historical frames, or select a part of historical frames at equal intervals. These methods will lose some valid reference information. AFB-URR (A Video Segmentation Model for Joint Adaptive Feature Memory and Uncertain Region Refinement) merges similar parts in memory and removes the parts with the least query frequency. But it may still introduce irrelevant historical information into memory, which can easily lead to model error accumulation until the model drifts. Inspired by the hash algorithm, a dynamic compact memory adjustment mechanism is developed for this model, resulting in a more efficient target reference information base. Figure 2 shows the structure of dynamic compact memory embedding. Based on the affinity matrix A, the embodiment of the present invention merges the high similarity (above the upper threshold) part between the current feature and the existing memory. In order to avoid mismatches caused by low-quality memory, low-correlation target features are directly discarded.

基于匹配的VOS方法大多采用初始帧的目标信息作为参考模板，因为具有真值标注的初始特征具有准确且完整的目标描述。因此，本方法利用序列中第一帧的目标信息来初始化记忆嵌入，并将其作为记忆库的主要部分。而最近视频帧中的目标与当前目标最为相似，但考虑到模型本身的计算误差，在目标匹配过程中会减少相应的查询偏好。例如，在模型推断期间，视频帧I_t-1完成目标分割并获得目标查询F_t-1，以及前景Y^f_t-1和背景Y^b_t-1的分割掩模。为了提取有用的参考信息，首先将目标查询F_t-1与现有的目标特征记忆M_k进行对比，找出两者的相似部分。在目标相似性匹配过程中生成的亲和力矩阵A∈R^hw×Thw度量了查询特征F_t-1和键记忆M_k之间的相关性。因此，本发明实施例直接使用亲和力矩阵A 对记忆嵌入进行动态管理。对于F_t-1中的每一个元素F_t-1(i)∈R^hw，搜索亲和力矩阵A，以获得其与M_k∈R^Thw的最大相似部分：

The matching-based VOS methods mostly use the target information of the initial frame as a reference template, because the initial features with ground-truth annotations have accurate and complete target descriptions. Therefore, the method utilizes the target information of the first frame in the sequence to initialize the memory embedding as the main part of the memory bank. The target in the recent video frame is the most similar to the current target, but considering the computational error of the model itself, the corresponding query preference will be reduced in the target matching process. For example, during model inference, video frame I_t-1 completes object segmentation and obtains object query F_t-1 , as well as segmentation masks for foreground Y^f_t-1 and background Y^b_t-1 . In order to extract useful reference information, the target query F_t-1 is first compared with the existing target feature memory M_k to find out the similar parts of the two. The affinity matrix A ∈ R^{hw × Thw} generated during the target similarity matching process measures the correlation between the query feature F_t-1 and the key memory M_k . Therefore, the embodiments of the present invention directly use the affinity matrix A to dynamically manage the memory embedding. For each element F_t_-1 (i)∈R^hw in F t-1, search the affinity matrix A for its maximal similarity to M_k ∈ R^Thw :

其中，Re(·)为亲和力矩阵A的每一行中最大的值，A(ij)为亲和力矩阵A中的每一个元素。Among them, Re(·) is the largest value in each row of affinity matrix A, and A(ij) is each element in affinity matrix A.

如果两者之间的最大相关A(ij)大于某个上限ζ，则认为两者足够相似。在哈希算法中，哈希映射中具有相同键的记录将被插入到相同的存储空间中。因此，本发明实施例只将多个相似特征中的一个保存到记忆中。考虑到记忆的多样性，采用权值对相似特征和相应的记忆进行合并，避免了不必要的检索和内存冗余。根据以上分析，初始参考信息最为准确。因此，本发明实施例采用一个较小的融合权重β，将当前特征更新到现有的记忆嵌入中，以避免模型误差的干扰。在所有实验中，ζ设置为0.95，β设置为0.001。记忆嵌入的在线更新公式如下：M_k(j')＝βF_t-1(i)+(1-β)M_k(j),(6)If the maximum correlation A(ij) between the two is greater than some upper bound ζ, the two are considered sufficiently similar. In a hashing algorithm, records with the same key in the hash map will be inserted into the same storage space. Therefore, embodiments of the present invention save only one of the multiple similar features into memory. Considering the diversity of memory, weights are used to merge similar features and corresponding memories, avoiding unnecessary retrieval and memory redundancy. According to the above analysis, the initial reference information is the most accurate. Therefore, the embodiment of the present invention adopts a smaller fusion weight β to update the current feature into the existing memory embedding, so as to avoid the interference of model errors. In all experiments, ζ was set to 0.95 and β to 0.001. The online update formula of memory embedding is as follows: M_k (j')=βF_t-1 (i)+(1-β)M_k (j),(6)

其中，M_k(j')为合并后的目标特征记忆，M_vf(j')和M_vb(j')分别为合并后的目标前景和背景值记忆，j'为记忆存储合并后的下标索引值。Among them, M_k (j') is the combined target feature memory, M_vf (j') and M_vb (j') are the combined target foreground and background value memories, respectively, and j' is the memory storage combined lower index value.

对于最大相关性Re(F_t-1(i))<ζ，本发明实施例选择相关性值高于平均值

的特征，并将其扩展到现有记忆中，以确保记忆嵌入的多样性。同时，也避免了无关记忆的查询，通过以下操作来合并出高效紧凑的记忆存储：For the maximum correlation Re(F_t-1 (i))<ζ, the embodiment of the present invention selects the correlation value to be higher than the average

features and extend them to existing memories to ensure diversity of memory embeddings. At the same time, it also avoids irrelevant memory queries, and combines efficient and compact memory storage through the following operations:

其中，Union(.)表示当前特征和相应记忆的取并集操作，

为现有键记忆和前一时刻目标特征的并集，

为现有目标前景值记忆和前一时刻目标前景掩模的并集，

图3给出了本发明实施例中的紧凑记忆嵌入和其他两种相关方法的比较。如第一行所示，存储所有历史记忆和自适应特征库(AFB)在一定程度上提高了目标相似性匹配的判别能力。但是对于复杂的背景杂波，冗余的记忆会导致错误的目标匹配。本发明实施例的记忆嵌入方法充分挖掘了记忆特征的多样性和紧凑性，可以获得更好的目标匹配性能。FIG. 3 presents a comparison between the compact memory embedding in the embodiment of the present invention and the other two related methods. As shown in the first row, storing all historical memory and adaptive feature base (AFB) improves the discriminative ability of target similarity matching to a certain extent. But for complex background clutter, redundant memory can lead to false target matching. The memory embedding method of the embodiment of the present invention fully exploits the diversity and compactness of memory features, and can obtain better target matching performance.

二、目标可变性特征学习2. Target variability feature learning

如图4所示，在紧凑记忆嵌入的帮助下，目标相似性匹配具有良好的判别能力，能够很好地解决目标遮挡和背景干扰等跟踪难题。在解决目标变形方面也有一定的优势。但它无法有效地解决严重的目标变形或空间细节问题。受图注意机制的启发，本发明实施例提出了目标可变性特征学习来进一步缓解上述困境。本发明实施例构造了查询特征F中每个像素与整个模板R之间的关联，以获取完整的目标变形信息。由于查询和键包含不同的目标表示，对于查询中的每像素fⁱ，本发明实施例应用非共享注意机制计算键中的每个像素 r^j和fⁱ的相关性，得到可学习的相似性函数：e_ij＝(W_F fⁱ)^T(W_R r^j). (12)As shown in Fig. 4, with the help of compact memory embedding, target similarity matching has good discriminative ability and can well solve tracking problems such as target occlusion and background interference. It also has certain advantages in solving target deformation. But it cannot effectively solve severe object deformation or spatial detail problems. Inspired by the graph attention mechanism, the embodiments of the present invention propose target variability feature learning to further alleviate the above dilemma. The embodiment of the present invention constructs the association between each pixel in the query feature F and the entire template R, so as to obtain complete target deformation information. Since the query and the key contain different target representations, for each pixel fi in the query, the embodiment of the present invention applies a non-shared attention mechanism to calculate the correlation between each pixel r^j and^fi in the key^, and obtains a learnable similarity Function: e_ij = (W_F fⁱ )^T (W_R r^j ). (12)

其中，W_F和W_R表示将fⁱ和r^j转换为更高维表示的可学习线性变换，e_ij为查询特征中的每像素fⁱ与键特征中的每个像素r^j之间的相关性。where WF and WR represent a learnable linear transformation that converts_fⁱ and_r^j into higher-dimensional representations, and e_ij is the difference between each pixel fⁱ in the query feature and each pixel r^j in the key feature Correlation.

为了便于比较fⁱ和模板R中不同部分之间的相似性，通过softmax函数对模板R的所有像素上的e_ij进行归一化，得到归一化的相似性权重：In order to facilitate the comparison of the similarity between fi and different parts in the template R, the softmax function is_{used to normalize the e ij}^on all pixels of the template R to obtain the normalized similarity weight:

利用归一化的相似度权重d_ij来聚集模板R中的逐像素特征，从而生成对应于fⁱ的变形特征v_i；

Use the normalized similarity weight d_ij to aggregate the pixel-wise features in the template R, thereby generating the deformed feature v_{i corresponding to f i}^;

其中，φ_v(.)表示ReLU(W_v*(.))，W_v*(.)表示对输入的特征进行一种线性变换。然后利用残差连接将查询特征和获取的目标变形特征连接起来，得到包含变形信息的特征

Among them, φ_v (.) represents ReLU(W_v *(.)), and W_v *(.) represents a linear transformation of the input features. Then use the residual connection to connect the query feature and the obtained target deformation feature to obtain the feature containing the deformation information

其中，φ_c(.)表示ReLU(W_c*(.))旨在降低特征的维数。Concat(.)表示特征连接操作。在匹配的目标变形特征中不可避免地存在背景误匹配。因此，采用目标前景概率(即初始帧的目标分割掩模)P来增强变形特征的可信度。where φ_c (.) denotes that ReLU(W_c *(.)) aims to reduce the dimensionality of features. Concat(.) represents the feature concatenation operation. There are inevitably background mismatches in the matched object deformation features. Therefore, the target foreground probability (i.e. the target segmentation mask of the initial frame) P is adopted to enhance the credibility of the deformed features.

具体地，所获得的目标变形

通过点积运算检索初始帧中的前景概率

以生成最终的可变形特征：

Specifically, the obtained target deformation

Retrieve foreground probabilities in initial frame by dot product operation

to generate the final deformable feature:

三、单目标视觉跟踪3. Single target visual tracking

将相似性匹配过程获取的目标前景相似性和目标后验概率，形变特征提取模块获取的目标形变特征，以及在线判别式相关滤波器获取的目标空间定位，共四种特征图沿着通道级联起来，再输入到轻量化的解码器(上采样分割模块)中获得最终精细化的目标分割掩模。利用OpenCV库中的轮廓检测函数在最终的目标分割掩模上获取出目标的矩形包围框，包围框的中心位置即为目标的定位，包围框的大小即为目标的尺度状态。The target foreground similarity and target posterior probability obtained by the similarity matching process, the target deformation features obtained by the deformation feature extraction module, and the target space positioning obtained by the online discriminant correlation filter, a total of four feature maps are cascaded along the channel. Then, it is input to the lightweight decoder (up-sampling segmentation module) to obtain the final refined target segmentation mask. Use the contour detection function in the OpenCV library to obtain the rectangular bounding box of the target on the final target segmentation mask. The center position of the bounding box is the positioning of the target, and the size of the bounding box is the scale state of the target.

四、具体实施步骤和仿真实验Fourth, the specific implementation steps and simulation experiments

在本发明实施例中，利用在ImageNet上预先训练后的ResNet50的前四个阶段作为主干网络来提取特征。目标分割是一项像素级的分类任务，需要使用具有高置信度语义的特征。因此，本方法提取了骨干网的最后一层进行目标相似性匹配和变形特征提取。然后，主干特征通过1×1卷积层减少到64个信道，接着是3×3卷积层和ReLU激活。在上采样分割过程中，利用主干的前三级来补充目标的空间细节信息。本方法将top-K设置为K＝3。而相似记忆合并的上限阈值ζ设置为0.95。在初步实验中，本方法发现内存嵌入模块和top-K 的参数对跟踪性能没有明显的影响。因此，上述设置在所有相关实验中都是固定的。In the embodiment of the present invention, the first four stages of ResNet50 pre-trained on ImageNet are used as the backbone network to extract features. Object segmentation is a pixel-level classification task that requires the use of features with high-confidence semantics. Therefore, this method extracts the last layer of the backbone network for target similarity matching and deformation feature extraction. Then, the backbone features are reduced to 64 channels through a 1×1 convolutional layer, followed by a 3×3 convolutional layer and ReLU activation. During the upsampling segmentation process, the first three stages of the backbone are used to supplement the spatial details of the target. This method sets top-K to K=3. The upper threshold ζ for similar memory merging is set to 0.95. In preliminary experiments, the method found that the parameters of the memory embedding module and top-K had no significant effect on the tracking performance. Therefore, the above settings are fixed in all relevant experiments.

模型训练：紧凑记忆嵌入和基于DCF跟踪的模块都是在线更新的，无需进行预训练。因此，本方法在Youtube-VOS数据集上仅预训练了相似性匹配、可变形特征提取和上采样分割模块。与基于孪生网络的跟踪模型中采用的采样策略类似，从视频序列中采样一对带有掩模的图像来构造训练样本。本方法通过Adam优化器来最小化交叉熵损失，学习率为8×10^-4，每15个周期衰减0.2。整个训练过程在单张Nvidia Titan XP显卡上进行，需要60个epoches。Model training: Both compact memory embeddings and DCF tracking-based modules are updated online without pre-training. Therefore, this method only pre-trains similarity matching, deformable feature extraction and upsampling segmentation modules on the Youtube-VOS dataset. Similar to the sampling strategy employed in the Siamese network-based tracking model, a pair of images with masks is sampled from a video sequence to construct training samples. This method minimizes the cross-entropy loss through the Adam optimizer with a learning rate of 8×10^-4 and a decay of 0.2 every 15 epochs. The entire training process is performed on a single Nvidia Titan XP graphics card and requires 60 epoches.

模型推理：在VOT任务中，视频序列包含给定的边界框标签。本方法首先在初始帧中用地面真值框生成一个伪掩模作为伪标签，然后，利用该生成的伪标签对模型进行初始化。在跟踪过程中，利用分割模型对当前帧进行处理，得到目标分割模板。使用查询特征和生成的分割掩模在线更新紧凑型记忆库。最后，将得到的分割掩模转换成一个旋转的目标包围框作为跟踪结果。Model inference: In the VOT task, video sequences contain given bounding box labels. This method first generates a pseudo-mask as a pseudo-label with the ground truth frame in the initial frame, and then uses the generated pseudo-label to initialize the model. During the tracking process, the segmentation model is used to process the current frame to obtain the target segmentation template. The compact memory is updated online using the query features and the generated segmentation masks. Finally, the obtained segmentation mask is converted into a rotated object bounding box as the tracking result.

仿真实验结果：表1-3展示了本发明实施例的跟踪模型在公开的单目标跟踪数据集基准VOT2016、VOT2018和VOT2019的评测结果，通过将本方法与一些最新的跟踪模型(例如SiamMask、SiamRPN++、ATOM、D3S和Ocean等)进行对比，实验结果显示出了跟踪模型的先进性，从而也验证了本方法的有效性。表4展示了本发明实施例构建的模型在视频目标分割数据基准DAVIS2017上的评测结果，也充分验证了本模型的有效性和先进性。而图5和6展示了本方法在VOT系列和DAVIS2017数据基准上的可视化实验结果。Simulation experimental results: Table 1-3 shows the evaluation results of the tracking model of the embodiment of the present invention in the published single-target tracking dataset benchmarks VOT2016, VOT2018 and VOT2019. By combining this method with some latest tracking models (such as SiamMask, SiamRPN++) , ATOM, D3S and Ocean, etc.), the experimental results show the advanced nature of the tracking model, which also verifies the effectiveness of this method. Table 4 shows the evaluation results of the model constructed in the embodiment of the present invention on the video target segmentation data benchmark DAVIS2017, which also fully verifies the effectiveness and advancement of the model. Figures 5 and 6 show the visual experimental results of this method on the VOT series and the DAVIS2017 data benchmark.

本发明实施例涉及的对比跟踪方法说明如下：SiamMask(结合目标分割的跟踪方法)； D3S(判别式分割跟踪模型)；CCOT(基于连续卷积的跟踪方法)；CSR-DCF(基于通道和空间可信度的相关滤波跟踪方法)；ASRCF(基于自适应空间正则化的相关滤波跟踪方法)；SiamDW(基于更宽更深的孪生网络跟踪模型)；SiamRPN++(基于大型主干网络和孪生区域建议网络的跟踪模型)；SiamRPN(基于孪生区域建议网络的跟踪模型)；SiamBAN (基于自适应包围框回归的跟踪模型)；Ocean-on/off(基于目标感知的无锚点跟踪方法)； Update-Net(基于模板在线更新的孪生跟踪方法)；SPM(基于序列并行匹配的实时视觉对象跟踪方法)；ATOM(通过目标重叠最大化实现目标精确跟踪)；DiMP(基于判别式模型预测的视觉跟踪方法)；C-RPN(基于孪生级联区域建议网络的跟踪方法)；SiamGraph(基于孪生注意力图模型的跟踪方法)；LADCF(基于时序一致性约束的判别式跟踪方法)； OSMN(基于网络调制的有效视频对象分割方法)；STM(基于时空记忆的视频分割方法)； VM(基于目标匹配的视频分割方法)；FAVOS(基于跟踪部分的视频分割方法)；OnAVOS (基于在线自适应卷积网络的分割方法)。The comparative tracking methods involved in the embodiments of the present invention are described as follows: SiamMask (tracking method combined with target segmentation); D3S (discriminative segmentation tracking model); CCOT (continuous convolution-based tracking method); CSR-DCF (channel and space-based tracking method) Correlation filter tracking method for credibility); ASRCF (correlation filter tracking method based on adaptive spatial regularization); SiamDW (based on a wider and deeper Siamese network tracking model); SiamRPN++ (based on large backbone network and Siamese region proposal network tracking model); SiamRPN (a tracking model based on Siamese region proposal network); SiamBAN (a tracking model based on adaptive bounding box regression); Ocean-on/off (an anchor-free tracking method based on object awareness); Update-Net ( Siamese tracking method based on template online update); SPM (real-time visual object tracking method based on sequence parallel matching); ATOM (target accurate tracking by maximizing target overlap); DiMP (visual tracking method based on discriminative model prediction); C-RPN (Tracking method based on Siamese Cascaded Region Proposal Network); SiamGraph (Tracking method based on Siamese Attention Graph Model); LADCF (Discriminative Tracking Method Based on Temporal Consistency Constraints); OSMN (Effective Video Based on Network Modulation) Object segmentation method); STM (spatiotemporal memory-based video segmentation method); VM (target matching-based video segmentation method); FAVOS (tracking part-based video segmentation method); OnAVOS (online adaptive convolutional network-based segmentation method) ).

表1多种跟踪方法在VOT2016数据基准上的性能对比Table 1. Performance comparison of various tracking methods on the VOT2016 data benchmark

表2多种跟踪方法在VOT2018数据基准上的性能对比Table 2 Performance comparison of various tracking methods on the VOT2018 data benchmark

表3多种跟踪方法在VOT2019数据基准上的性能对比Table 3. Performance comparison of various tracking methods on the VOT2019 data benchmark

表4多种跟踪和VOS方法在DAVIS2017数据基准上的性能对比Table 4. Performance comparison of various tracking and VOS methods on the DAVIS2017 data benchmark

基于同一发明构思，本发明实施例还提供了一种基于动态紧凑记忆嵌入的可变形单目标跟踪装置，参见图7，该装置包括：处理器1和存储器2，存储器2中存储有程序指令，处理器1调用存储器2中存储的程序指令以使装置执行实施例中的以下方法步骤：Based on the same inventive concept, an embodiment of the present invention also provides a deformable single-target tracking device based on dynamic compact memory embedding. Referring to FIG. 7 , the device includes: aprocessor 1 and a memory 2, where program instructions are stored in the memory 2, Theprocessor 1 invokes program instructions stored in the memory 2 to cause the apparatus to perform the following method steps in an embodiment:

目标相似度匹配过程中生成的特征亲和力矩阵，用于表达当前目标特征与现有目标特征记忆之间的相关性，然后对特征亲和力矩阵进行逐行的筛选前K个值并取其平均，以获取目标前景、背景相似性，以及目标后验概率；The feature affinity matrix generated during the target similarity matching process is used to express the correlation between the current target feature and the existing target feature memory, and then the feature affinity matrix is screened row by row and the top K values are averaged to get Obtain the target foreground, background similarity, and target posterior probability;

依据获取的特征相关性，合并现有目标特征记忆和当前目标特征之间的高相关部分，将与现有目标特征记忆具有中等相似性的部分扩充到记忆中，将低相关的部分丢弃，实现了记忆嵌入的动态自适应调节；According to the acquired feature correlation, merge the high correlation parts between the existing target feature memory and the current target feature, expand the part with moderate similarity to the existing target feature memory into the memory, discard the low correlation part, and realize Dynamic adaptive adjustment of memory embedding;

采用逐个像素点到全局特征的关联方式，通过聚合查询像素和参考特征之间的加权相关性，捕获当前目标特征中的目标变形状态，在相似的目标部分之间建立对应关系，实现可变形特征的提取；Using the association method from pixel-by-pixel to global features, by aggregating the weighted correlation between query pixels and reference features, capturing the target deformation state in the current target feature, establishing correspondence between similar target parts, and realizing deformable features extraction;

其中，获取当前目标特征与现有记忆之间的相关性具体为：Among them, the correlation between the acquisition of the current target feature and the existing memory is specifically:

进一步地，依据获取的特征相关性，合并现有目标特征记忆和当前目标特征之间的高相关部分，将与现有目标特征记忆具有中等相关性的部分扩充到记忆中，将不相关的部分丢弃，实现了记忆嵌入的动态自适应调节具体为：Further, according to the acquired feature correlation, merge the high correlation parts between the existing target feature memory and the current target feature, expand the part with moderate correlation with the existing target feature memory into the memory, and add the irrelevant part. discarding, the dynamic adaptive adjustment of memory embedding is realized. Specifically:

对于目标查询中的每一个元素，搜索亲和力矩阵，以获得其与M_k∈R^Thw的最大相似部分；For each element in the target query, search the affinity matrix to obtain its maximum similarity to M_k ∈ R^Thw ;

其中，将当前特征中与已有记忆具有高相关的部分更新到现有的记忆嵌入中具体为：Among them, updating the part of the current feature with high correlation with the existing memory to the existing memory embedding is specifically:

M_k(j')＝βF_t-1(i)+(1-β)M_k(j),M_k (j')=βF_t-1 (i)+(1-β)M_k (j),

其中，M_k(j')为合并后的目标特征记忆，M_vf(j')和M_vb(j')分别为合并后的目标前景和背景值记忆，j'为记忆存储合并后的下标索引值，β为融合权重，F_t-1为目标查询，Y^f_t-1为前景，Y^b_t-1为背景，M_k为目标特征记忆，M_vf为目标前景值记忆，M_vb为目标背景值记忆，i为当前特征中和已有记忆具有中等相似性的特征点的下标索引，j为已有记忆中和当前特征具有中等相似性的部分的下标索引；Among them, M_k (j') is the combined target feature memory, M_vf (j') and M_vb (j') are the combined target foreground and background value memories, respectively, and j' is the memory storage combined lower index value, β is the fusion weight, F_t-1 is the target query, Y^f_t-1 is the foreground, Y^b_t-1 is the background, M_k is the target feature memory, M_vf is the target foreground value memory, M_vb is the target background value memory, i is the subscript index of the feature point in the current feature and the existing memory with moderate similarity, j is the subscript index of the part in the existing memory and the current feature with moderate similarity;

其中，Union(.)表示当前特征和相应记忆的取并集操作，

为现有键记忆和前一时刻目标特征的并集，

为现有目标前景值记忆和前一时刻目标前景掩模的并集，

进一步地，通过聚合查询像素和参考特征之间的加权相关性，捕获当前目标特征中的目标变形状态，在相似的目标部分之间建立了对应关系具体为：Further, by aggregating the weighted correlation between the query pixel and the reference feature, the target deformation state in the current target feature is captured, and the corresponding relationship between similar target parts is established. Specifically:

这里需要指出的是，以上实施例中的装置描述是与实施例中的方法描述相对应的，本发明实施例在此不做赘述。It should be pointed out here that the device descriptions in the above embodiments correspond to the method descriptions in the embodiments, which are not repeated in this embodiment of the present invention.

上述的处理器1和存储器2的执行主体可以是计算机、单片机、微控制器等具有计算功能的器件，具体实现时，本发明实施例对执行主体不做限制，根据实际应用中的需要进行选择。The execution body of the above-mentionedprocessor 1 and the memory 2 may be a device with computing functions such as a computer, a single-chip microcomputer, a microcontroller, etc. During specific implementation, the embodiment of the present invention does not limit the execution body, and is selected according to the needs in practical applications. .

存储器2和处理器1之间通过总线3传输数据信号，本发明实施例对此不做赘述。Data signals are transmitted between the memory 2 and theprocessor 1 through the bus 3, which is not described in detail in this embodiment of the present invention.

基于同一发明构思，本发明实施例还提供了一种计算机可读存储介质，存储介质包括存储的程序，在程序运行时控制存储介质所在的设备执行上述实施例中的方法步骤。Based on the same inventive concept, an embodiment of the present invention also provides a computer-readable storage medium, the storage medium includes a stored program, and when the program runs, the device where the storage medium is located is controlled to execute the method steps in the above embodiments.

该计算机可读存储介质包括但不限于快闪存储器、硬盘、固态硬盘等。The computer-readable storage medium includes, but is not limited to, flash memory, hard disk, solid-state disk, and the like.

这里需要指出的是，以上实施例中的可读存储介质描述是与实施例中的方法描述相对应的，本发明实施例在此不做赘述。It should be pointed out here that the description of the readable storage medium in the above embodiment corresponds to the description of the method in the embodiment, which is not repeated in this embodiment of the present invention.

在上述实施例中，可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时，可以全部或部分地以计算机程序产品的形式实现。计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时，全部或部分地产生按照本发明实施例的流程或功能。In the above-mentioned embodiments, it may be implemented in whole or in part by software, hardware, firmware or any combination thereof. When implemented in software, it can be implemented in whole or in part in the form of a computer program product. A computer program product includes one or more computer instructions. The computer program instructions, when loaded and executed on a computer, result in whole or in part of the procedures or functions according to the embodiments of the present invention.

计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。计算机指令可以存储在计算机可读存储介质中，或者通过计算机可读存储介质进行传输。计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。可用介质可以是磁性介质或者半导体介质等。The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable device. Computer instructions may be stored in or transmitted over a computer-readable storage medium. A computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, data center, or the like that contains an integration of one or more available media. The usable medium may be a magnetic medium or a semiconductor medium, or the like.

本发明实施例对各器件的型号除做特殊说明的以外，其他器件的型号不做限制，只要能完成上述功能的器件均可。In the embodiment of the present invention, the models of each device are not limited unless otherwise specified, as long as the device can perform the above functions.

本领域技术人员可以理解附图只是一个优选实施例的示意图，上述本发明实施例序号仅仅为了描述，不代表实施例的优劣。Those skilled in the art can understand that the accompanying drawing is only a schematic diagram of a preferred embodiment, and the above-mentioned serial numbers of the embodiments of the present invention are only for description, and do not represent the advantages and disadvantages of the embodiments.

以上所述仅为本发明的较佳实施例，并不用以限制本发明，凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above are only preferred embodiments of the present invention and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the protection of the present invention. within the range.

Claims

Translated fromChinese

1.一种基于动态紧凑记忆嵌入的可变形单目标跟踪方法，其特征在于，所述方法包括：1. A deformable single target tracking method based on dynamic compact memory embedding, characterized in that the method comprises:

2.根据权利要求1所述的一种基于动态紧凑记忆嵌入的可变形单目标跟踪方法，其特征在于，所述获取当前目标特征与现有记忆之间的相关性具体为：2. a kind of deformable single target tracking method based on dynamic compact memory embedding according to claim 1, is characterized in that, the correlation between described acquisition current target feature and existing memory is specifically:

构建目标相似性匹配模型，模型中的键是初始帧的目标特征，而值是视频第一帧的前景和背景分割掩模；Build a target similarity matching model, where the key in the model is the target feature of the initial frame, and the value is the foreground and background segmentation masks of the first frame of the video;

当一个新的帧I_t-1被模型分割之后，当前的目标查询特征F_t-1和获得的掩模将与历史信息整合，通过不断整合新的目标信息到键和值的记忆中，所构成的记忆嵌入将包含丰富的目标外观信息；对于当前的目标特征，以及已有的目标特征记忆，先将二者进行维度变换，再进行矩阵相乘获得二者的亲和力矩阵，该亲和力矩阵即表达了二者的相关性。When a new frame I_t-1 is segmented by the model, the current target query feature F_t-1 and the obtained mask will be integrated with historical information. By continuously integrating new target information into the memory of keys and values, so The formed memory embedding will contain rich target appearance information; for the current target feature and the existing target feature memory, the two are first transformed into dimensions, and then the matrix is multiplied to obtain the affinity matrix of the two. The affinity matrix is express the correlation between the two.

3.根据权利要求1所述的一种基于动态紧凑记忆嵌入的可变形单目标跟踪方法，其特征在于，所述依据获取的特征相关性，合并现有目标特征记忆和当前目标特征之间的高相关部分，将与现有目标特征记忆具有中等相关性的部分扩充到记忆中，将低相关的部分丢弃，实现了记忆嵌入的动态自适应调节具体为：3. a kind of deformable single target tracking method based on dynamic compact memory embedding according to claim 1, is characterized in that, described according to acquired feature correlation, merge existing target feature memory and current target feature between the memory. For the high-correlation part, the part with a moderate correlation with the existing target feature memory is expanded into the memory, and the low-correlation part is discarded to realize the dynamic adaptive adjustment of the memory embedding. Specifically:

利用序列中第一帧的目标信息初始化记忆嵌入，并将其作为记忆库的主要部分，将目标查询特征与现有的目标特征记忆进行对比，找出两者的相似部分；Use the target information of the first frame in the sequence to initialize the memory embedding, and use it as the main part of the memory bank, compare the target query feature with the existing target feature memory, and find out the similar parts of the two;

4.根据权利要求3所述的一种基于动态紧凑记忆嵌入的可变形单目标跟踪方法，其特征在于，所述将当前特征中与已有记忆具有高相关的部分更新到现有的记忆嵌入中具体为：4. A deformable single-target tracking method based on dynamic compact memory embedding according to claim 3, wherein the part of the current feature that is highly correlated with the existing memory is updated to the existing memory embedding Specifically:

M_k(j')＝βF_t-1(i)+(1-β)M_k(j),M_k (j')=βF_t-1 (i)+(1-β)M_k (j),

将与已有记忆的相关性值高于平均值的当前目标特征直接扩展到现有记忆中：Extend current target features with higher than average correlation values to existing memories directly into existing memories:

其中，Union(.)表示当前特征和相应记忆的取并集操作，

为现有键记忆和前一时刻目标特征的并集，

为现有目标前景值记忆和前一时刻目标前景掩模的并集，

5.根据权利要求3所述的一种基于动态紧凑记忆嵌入的可变形单目标跟踪方法，其特征在于，所述通过聚合查询特征中的像素点和参考特征整体之间的加权相关性，捕获当前目标特征中的目标变形状态，在相似的目标部分之间建立了对应关系具体为：5. A deformable single target tracking method based on dynamic compact memory embedding according to claim 3, characterized in that, by aggregating the weighted correlation between the pixels in the query feature and the reference feature as a whole, capturing the The target deformation state in the current target feature establishes a corresponding relationship between similar target parts. Specifically:

利用残差将查询特征和目标变形特征连接，获取包含变形信息的特征，并采用目标前景概率增强目标可变形特征的可信度。Residuals are used to connect query features and target deformable features to obtain features containing deformation information, and the target foreground probability is used to enhance the credibility of target deformable features.

6.一种基于动态紧凑记忆嵌入的可变形单目标跟踪装置，其特征在于，所述装置包括：处理器和存储器，所述存储器中存储有程序指令，所述处理器调用存储器中存储的程序指令以使装置执行权利要求1-5中的任一项所述的方法步骤。6. A deformable single-target tracking device based on dynamic compact memory embedding, wherein the device comprises: a processor and a memory, wherein program instructions are stored in the memory, and the processor calls the program stored in the memory Instructions to cause an apparatus to perform the method steps of any of claims 1-5.

7.一种计算机可读存储介质，其特征在于，所述计算机可读存储介质存储有计算机程序，所述计算机程序包括程序指令，所述程序指令被处理器执行时使所述处理器执行权利要求1-5中的任一项所述的方法步骤。7. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, the computer program comprising program instructions, and when the program instructions are executed by a processor, the processor executes the rights The method steps of any of claims 1-5.