CN116563343A

Movatterモバイル変換

Info

Publication number: CN116563343A
Application number: CN202310575583.XA
Authority: CN
Inventors: 秦玉文; 陈建明; 豆嘉真; 钟丽云; 邸江磊
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2023-05-22
Filing date: 2023-05-22
Publication date: 2023-08-08

Abstract

Translated fromChinese

本发明属于图像处理领域，公开了一种基于孪生网络结构和锚框自适应思想的RGBT目标跟踪方法，用于解决传统的RGBT目标跟踪方法在可见度低或者照明条件较差等情况下难以实现鲁棒性跟踪的难题。模型包括基于孪生网络结构的特征提取网络、跨模态信息互补的融合网络和基于锚框自适应思想的跟踪预测网络；本发明利用可见光和热红外图像信息的互补性和一致性设计基于孪生网络结构的特征提取网络，增强网络对目标的表征能力；同时设计跨模态信息互补的融合方案，增强跟踪模型在复杂场景下跟踪器的鲁棒性；基于锚框自适应思想的跟踪预测网络使跟踪器具有更强的灵活性。本发明的方法可以实现对复杂背景的目标进行跟踪，跟踪精度更高，且效率更好。

The invention belongs to the field of image processing, and discloses an RGBT target tracking method based on twin network structure and anchor frame self-adaptive thought, which is used to solve the problem that the traditional RGBT target tracking method is difficult to achieve robustness under the conditions of low visibility or poor lighting conditions. Sticky Tracking Puzzle. The model includes a feature extraction network based on the twin network structure, a cross-modal information complementary fusion network, and a tracking prediction network based on the idea of anchor box adaptation; the invention uses the complementarity and consistency of visible light and thermal infrared image information to design a twin network-based The structure of the feature extraction network enhances the representation ability of the network for the target; at the same time, a cross-modal information complementary fusion scheme is designed to enhance the robustness of the tracking model in complex scenes; the tracking prediction network based on the idea of anchor box adaptation makes Trackers have more flexibility. The method of the invention can realize the tracking of the target with complex background, and the tracking precision is higher and the efficiency is better.

Description

Translated fromChinese

一种基于孪生网络结构和锚框自适应思想的RGBT目标跟踪方法A RGBT target tracking based on Siamese network structure and anchor box adaptive ideamethod

技术领域technical field

本发明属于图像处理领域，具体涉及一种基于孪生网络结构和锚框自适应思想的RGBT目标跟踪方法。The invention belongs to the field of image processing, and in particular relates to an RGBT target tracking method based on twin network structure and anchor frame self-adaptive thought.

背景技术Background technique

RGBT跟踪任务旨在利用可见光数据和热红外数据的互补优势，实现复杂环境下的视觉目标跟踪，其目的是确定各种场景下给定目标的位置和大小。作为计算机视觉领域一项基本且具有挑战性的任务，目标跟踪技术现已广泛应用于智能安防、交通控制、医学治疗和诊断、人机交互和现代军事等众多实际领域。尽管相关研究和应用已经取得了重大进展，但现有的目标跟踪器大多数是基于单模态数据实现的，其鲁棒性和可靠性在复杂环境下有限，如基于可见光数据的目标跟踪器在可见度低或者照明条件较差的情况下难以实现强鲁棒性的跟踪效果。近年来提出了大量的RGBT跟踪方法来解决这些问题，但由于无法有效挖掘多模态信息中所包含的目标特征信息，通常会导致跟踪漂移。The RGBT tracking task aims to use the complementary advantages of visible light data and thermal infrared data to achieve visual target tracking in complex environments. Its purpose is to determine the position and size of a given target in various scenarios. As a basic and challenging task in the field of computer vision, object tracking technology has been widely used in many practical fields such as intelligent security, traffic control, medical treatment and diagnosis, human-computer interaction and modern military affairs. Although significant progress has been made in related research and applications, most of the existing object trackers are implemented based on single-modal data, and their robustness and reliability are limited in complex environments, such as object trackers based on visible light data. Robust tracking is difficult to achieve in low visibility or poor lighting conditions. In recent years, a large number of RGBT tracking methods have been proposed to solve these problems, but due to the inability to effectively mine the target feature information contained in the multimodal information, it usually leads to tracking drift.

发明内容Contents of the invention

本发明的目的在于克服现有技术的不足，提供一种基于孪生网络结构和锚框自适应思想的RGBT目标跟踪方法，所述RGBT目标跟踪方法可以实现对复杂背景的目标进行跟踪，跟踪精度更高，且效率更好。The purpose of the present invention is to overcome the deficiencies of the prior art, and provide a RGBT target tracking method based on the twin network structure and anchor frame self-adaptive thought. The RGBT target tracking method can realize tracking of targets with complex backgrounds, and the tracking accuracy is higher. Higher and more efficient.

本发明解决上述技术问题的技术方案是：The technical scheme that the present invention solves the problems of the technologies described above is:

一种基于孪生网络结构和锚框自适应思想的RGBT目标跟踪方法，包括以下步骤：A RGBT target tracking method based on twin network structure and anchor box adaptive thought, comprising the following steps:

(S1)、构建数据集：从公开的RGB数据集和RGBT目标跟踪数据集中按需筛选出数据，得到对应预训练数据集和训练数据集；(S1), building a data set: filter out data on demand from the public RGB data set and RGBT target tracking data set, and obtain the corresponding pre-training data set and training data set;

(S2)、构建网络：包括基于孪生网络结构的特征提取网络、跨模态信息互补的融合网络和基于锚框自适应思想的跟踪预测网络；(S2), constructing a network: including a feature extraction network based on a twin network structure, a fusion network with cross-modal information complementarity, and a tracking prediction network based on anchor frame adaptive thinking;

(S3)、将从(S1)步骤中得到的预训练数据集对基于孪生网络结构的特征提取网络进行预训练，采用梯度下降法训练至损失值基本收敛；随后再利用S1步骤中得到的训练数据集对跟踪模型进行微调，并降低学习率进行训练，采用随机梯度下降法训练至损失值基本收敛，获得训练好的网络；(S3), pre-train the feature extraction network based on the twin network structure with the pre-training data set obtained in the (S1) step, and use the gradient descent method to train until the loss value basically converges; then use the training obtained in the S1 step The data set fine-tunes the tracking model, and reduces the learning rate for training. The stochastic gradient descent method is used to train until the loss value basically converges, and a trained network is obtained;

(S4)、获取可见光图像和热红外图像中待跟踪目标模板，计算后续帧的搜索区域，再利用训练好的网络对可见光和红外视频序列进行跟踪，获得跟踪结果。(S4). Acquire target templates to be tracked in the visible light image and thermal infrared image, calculate the search area of the subsequent frame, and then use the trained network to track the visible light and infrared video sequences to obtain the tracking result.

优选的，在步骤(S2)中，基于孪生网络结构的特征提取网络、跨模态信息互补的融合网络和基于锚框自适应思想的跟踪预测网络的构建，包括以下步骤：Preferably, in step (S2), the construction of a feature extraction network based on a twin network structure, a fusion network complementary to cross-modal information, and a tracking prediction network based on anchor frame adaptive ideas includes the following steps:

(S2-1)、构建特征提取网络：特征提取网络为基于孪生网络结构的特征提取网络，采用深层次、多分支结构，包括特征提取和特征增强两部分；特征提取部分由包含4条修改后的ResNet-50构成，特征增强部分包含两个基于注意力机制的图像增强模块，(S2-1), building a feature extraction network: the feature extraction network is a feature extraction network based on a twin network structure, using a deep-level, multi-branch structure, including two parts: feature extraction and feature enhancement; the feature extraction part is modified by including 4 Composed of ResNet-50, the feature enhancement part contains two image enhancement modules based on the attention mechanism,

(S2-2)、构建跨模态信息互补的融合网络：为一种跨模态特征融合方案，通过4个1×1的卷积模块实现跨模态特征的融合；融合后的结果经过互相关操作得到用于预测目标跟踪结果的响应图；(S2-2) Constructing a fusion network with complementary cross-modal information: it is a cross-modal feature fusion scheme, which realizes the fusion of cross-modal features through four 1×1 convolution modules; Correlation operations result in response maps for predicting target tracking results;

(S2-3)、构建基于锚框自适应思想的跟踪预测网络：基于锚框自适应思想的跟踪预测网络包含两个结构相同的跟踪预测头。每个检测头均包含3个分支，分别是用于预测响应图中每个位置类别的分类分支；用于计算该位置的目标包围框的回归分支；用于计算每个位置的中心度得分，剔除异常值的中心度分支。(S2-3) Building a tracking prediction network based on the idea of anchor frame adaptation: the tracking prediction network based on the idea of anchor frame adaptation includes two tracking prediction heads with the same structure. Each detection head contains 3 branches, which are the classification branch used to predict each position category in the response graph; the regression branch used to calculate the target bounding box of the position; the centrality score used to calculate each position, The centrality branch that removes outliers.

优选的，在步骤(S3)中，基于孪生网络结构的特征提取网络的预训练和跟踪模型的微调，包括以下步骤：Preferably, in step (S3), the fine-tuning of pre-training and tracking model based on the feature extraction network of twin network structure comprises the following steps:

(S3-1)、特征提取网络包含4条结构相同的ResNet-50，我们使用基于可见光数据构成的预训练数据集对其中一条特征提取网络进行预训练，预训练模型的输入图像尺寸为127×127，采用随机梯度下降法优化跟踪模型至模型收敛，并保存训练好的预训练模型；(S3-1). The feature extraction network contains 4 ResNet-50s with the same structure. We use the pre-training data set based on visible light data to pre-train one of the feature extraction networks. The input image size of the pre-training model is 127× 127. Use the stochastic gradient descent method to optimize the tracking model until the model converges, and save the trained pre-training model;

(S3-2)、使用(S2-1)步骤中保存的预训练模型初始化特征提取网络中的特征提取部分参数，冻结所有ResNet-50的前两层，使用在S1步骤中获得的训练数据集微调跟踪模型，并降低学习率进行训练，采用随机梯度下降法训练至损失值基本收敛；获得训练好的网络。(S3-2), use the pre-training model saved in the (S2-1) step to initialize the feature extraction part parameters in the feature extraction network, freeze the first two layers of all ResNet-50, use the training data set obtained in the S1 step Fine-tune the tracking model, reduce the learning rate for training, and use the stochastic gradient descent method to train until the loss value basically converges; obtain a trained network.

优选的，在步骤(S4)中，获取可见光图像和热红外图像中待跟踪目标模板和进行跟踪，包括以下步骤：Preferably, in step (S4), acquiring the target template to be tracked in the visible light image and the thermal infrared image and performing tracking include the following steps:

(S4-1)、目标的模板获取是在跟踪的开始阶段，目标的模板为视频序列初始帧中的目标，后续帧为候选帧；(S4-1), the template acquisition of the target is at the initial stage of tracking, the template of the target is the target in the initial frame of the video sequence, and the subsequent frame is a candidate frame;

(S4-2)、模型的输入分别是两种模态视频序列的第一帧和待检测帧裁剪出来的图像块，RGB模板图像、RGB候选图像、热红外模板图像和热红外候选图像，并统一模板图像和候选图像的输入大小，分别设置为127×127个像素和255×255个像素；(S4-2), the input of the model is the first frame of the two modal video sequences and the image block cut out of the frame to be detected, the RGB template image, the RGB candidate image, the thermal infrared template image and the thermal infrared candidate image, and Unify the input size of the template image and the candidate image, which are set to 127×127 pixels and 255×255 pixels, respectively;

(S4-3)、将从步骤(S4-1)和(S4-2)中得到的RGB模板图像、RGB候选图像、热红外模板图像和热红外候选图像分别经过特征提取网络的特征提取部分和特征增强部分，分别得到RGB模板特征、RGB候选特征、热红外模板特征和热红外候选特征、RGB模板增强特征、RGB候选增强特征、热红外模板增强特征和热红外候选增强特征；(S4-3), the RGB template image obtained from step (S4-1) and (S4-2), the RGB candidate image, the thermal infrared template image and the thermal infrared candidate image are respectively passed through the feature extraction part and the feature extraction network of the feature extraction network The feature enhancement part obtains RGB template features, RGB candidate features, thermal infrared template features and thermal infrared candidate features, RGB template enhancement features, RGB candidate enhancement features, thermal infrared template enhancement features, and thermal infrared candidate enhancement features;

(S4-4)、在步骤(S4-3)的基础上，经过跨模态信息互补的融合网络分别将RGB模板特征和热红外模板增强特征进行融合、将RGB候选特征和热红外候选增强特征进行融合、将热红外模板特征和RGB模板增强特征进行融合、将热红外候选特征和RGB候选增强特征进行融合，分别得到跨模态信息互补增强后的RGB模板特征、RGB候选特征、热红外模板特征和热红外候选特征；(S4-4), on the basis of step (S4-3), through the cross-modal information complementary fusion network, the RGB template features and the thermal infrared template enhancement features are respectively fused, and the RGB candidate features and the thermal infrared candidate enhancement features are combined Fusion, fusion of thermal infrared template features and RGB template enhancement features, fusion of thermal infrared candidate features and RGB candidate enhancement features, to obtain RGB template features, RGB candidate features, and thermal infrared template features after cross-modal information complementary enhancement. features and thermal infrared candidate features;

(S4-5)、将新生成的RGB模板特征、RGB候选特征、热红外模板特征和热红外候选特征；两两进行互相关操作得到用于跟踪预测的响应图；(S4-5), newly generated RGB template features, RGB candidate features, thermal infrared template features and thermal infrared candidate features; pairwise cross-correlation operations are performed to obtain a response map for tracking prediction;

(S4-6)、最后再将响应图输入到基于锚框自适应思想的跟踪预测网络，通过生成一个6D向量t＝(cls，cen，l，t，r，b)完成对位置的预测，其中cls表示分类得分，cen表示中心度得分，l+r和t+b表示当前帧中目标预测的宽和高。(S4-6), and finally input the response map into the tracking prediction network based on the idea of anchor box adaptation, and complete the prediction of the position by generating a 6D vector t=(cls, cen, l, t, r, b), Where cls represents the classification score, cen represents the centrality score, l+r and t+b represent the width and height of the target prediction in the current frame.

优选的，在步骤(S2-1)中，我们结合特征金字塔结构设计特征提取网络，并对ResNet-50网络进行了必要的改进，删除了原ResNet-50中最后2个卷积块(Conv4和Conv5)中的下采样操作，以提供更详细的空间详细用于跟踪器的预测，并使用空洞率为2和4的空洞卷积代替Conv4和Conv5中的卷积核，以提升感受野范围。最后，通过1×1卷积将ResNet-50最后3个卷积模块的输出特征图的通道数减少至256，并按通道维度聚合这些特征。Preferably, in step (S2-1), we design a feature extraction network in combination with the feature pyramid structure, and make necessary improvements to the ResNet-50 network, and delete the last two convolutional blocks (Conv4 and The downsampling operation in Conv5) to provide more detailed spatial details for the prediction of the tracker, and use the hole convolution with a hole rate of 2 and 4 to replace the convolution kernel in Conv4 and Conv5 to improve the receptive field range. Finally, the number of channels of the output feature maps of the last 3 convolution modules of ResNet-50 is reduced to 256 by 1 × 1 convolution, and these features are aggregated by the channel dimension.

优选的，在步骤(S2-1)中，特征增强部分包含一个基于通道注意力的模板图像特征增强模块和基于通道-空间注意力的候选图像特征增强模块。Preferably, in step (S2-1), the feature enhancement part includes a template image feature enhancement module based on channel attention and a candidate image feature enhancement module based on channel-spatial attention.

优选的，在步骤(S2-2)中，深度互相关操作可定义为：Preferably, in step (S2-2), the depth cross-correlation operation can be defined as:

M_rgb＝Xⁿ_rgb★Zⁿ_rgb (1)M_rgb ＝Xⁿ_rgb Zⁿ_rgb (1)

M_t＝Xⁿ_t★Xⁿ_t (2)M_t = Xⁿ_t Xⁿ_t (2)

其中，★表示逐通道互相关操作；新生成的RGB模板特征、RGB候选特征、热红外模板特征和热红外候选特征；M_rgb、M_t分别表示生成的可见光响应图和热红外响应图。Among them, ★ represents the channel-by-channel cross-correlation operation; Newly generated RGB template features, RGB candidate features, thermal infrared template features and thermal infrared candidate features; M_rgb and M_t represent the generated visible light response map and thermal infrared response map, respectively.

优选的，在步骤(S3-1)中，算法训练的总损失函数可表示为:Preferably, in step (S3-1), the total loss function of algorithm training can be expressed as:

其中，λ₁、λ₂为超参数，用来平衡中心度损失和回归损失；分别表示用于可见光模态和热红外模态分类的交叉熵损失函数；/>表示回归损失，用来预测预测框的位置；/>表示中心度损失；回归损失可定义为：Among them, λ₁ and λ₂ are hyperparameters, which are used to balance the centrality loss and regression loss; denote the cross-entropy loss function for the classification of visible light mode and thermal infrared mode, respectively; /> Indicates the regression loss, used to predict the position of the prediction box; /> Represents centrality loss; regression loss can be defined as:

其中L_iou为真实值和预测框之间的交并比，可通过g_(i，j)计算出。表示点(i，j)到真实值4条边的距离。中心度分支的中心度损失/>可表示为：Among them,_Liou is the intersection and union ratio between the real value and the predicted frame, which can be calculated by g_{(i, j)} . Indicates the distance from the point (i, j) to the 4 sides of the real value. centrality loss for the centrality branch /> Can be expressed as:

其中，表示位置点(i，j)的中心度得分。in, Indicates the centrality score of the location point (i, j).

优选的，在步骤(S4-6)的跟踪过程中，我们在分类得分图中在入比例和尺度惩罚来抑制预测目标大形变：Preferably, during the tracking process of step (S4-6), we use scale and scale penalties in the classification score map to suppress the large deformation of the predicted target:

这里k表示超参数，r表示宽高比，r′表示最后一帧的宽高比，s表示目标尺度，s′表示最后一帧目标尺度。同时，使用余弦窗惩罚H来抑制大位移。则最终生成cls的可表示为：Here k represents the hyperparameter, r represents the aspect ratio, r′ represents the aspect ratio of the last frame, s represents the target scale, and s′ represents the target scale of the last frame. Meanwhile, a cosine window is used to penalize H to suppress large displacements. Then the final generated cls can be expressed as:

cls＝(1-α)cls*penalty+αH (7)cls=(1-α)cls*penalty+αH (7)

其中α表示超参数。经过上述步骤获取两个检测头的cls_rgb和cls_t后，最后一步是通过峰值自适应选择模块获取峰值最高点位置的索引：where α denotes a hyperparameter. After obtaining the cls_rgb and cls_t of the two detection heads through the above steps, the last step is to obtain the index of the highest point position of the peak through the peak adaptive selection module:

P＝argmax(cls_rgb,cls_t) (8)P＝argmax(cls_rgb ,cls_t ) (8)

其中，argmax表示比较两个输入数组中峰值分数并返回峰值较大的位置索引操作。Among them, argmax represents the operation of comparing the peak scores in the two input arrays and returning the position index with the larger peak.

优选的，选取步骤(S3)模型训练过程中最小损失的模型权重，输出目标在当前帧的准确位置。Preferably, the model weight with the minimum loss during the model training process in step (S3) is selected, and the accurate position of the target in the current frame is output.

本发明与现有技术相比具有以下的有益效果：Compared with the prior art, the present invention has the following beneficial effects:

1、本发明的基于孪生网络结构和锚框自适应思想的RGBT目标跟踪方法将孪生网络结构引入RGBT目标跟踪，通过孪生网络结构构建了一个深层次、多分支的特征提取网络，以便于充分挖掘可见光图像和热红外图像的语义信息。同时，设计了模板图像增强模块和候选图像增强模块，以增强网络对目标浅层信息的表征能力，提高跟踪精度。1. The RGBT target tracking method based on the twin network structure and anchor box adaptive thought of the present invention introduces the twin network structure into RGBT target tracking, and constructs a deep-level, multi-branch feature extraction network through the twin network structure, so as to fully mine Semantic information for visible and thermal infrared images. At the same time, a template image enhancement module and a candidate image enhancement module are designed to enhance the network's ability to represent the shallow information of the target and improve the tracking accuracy.

2、本发明的基于孪生网络结构和锚框自适应思想的RGBT目标跟踪方法同时在跟踪模型中设计了一种跨模态信息互补的融合方案，增强跟踪模型在可见度低、相似物干扰或者遮挡等复杂场景下跟踪器的鲁棒性。2. The RGBT target tracking method based on the twin network structure and anchor frame adaptive thought of the present invention also designs a fusion scheme of cross-modal information complementation in the tracking model, and enhances the tracking model in the case of low visibility, similar object interference or occlusion. Robustness of the tracker in complex scenes such as

3、本发明的基于孪生网络结构和锚框自适应思想的RGBT目标跟踪方法基于锚框自适应思想设计跟踪预测网络，对于大尺度变化或者形变的目标具有更强的灵活性，同时降低跟踪器的计算复杂度，进而增强跟踪模型的跟踪效率。3. The RGBT target tracking method based on the twin network structure and anchor frame adaptive thought of the present invention is based on the anchor frame adaptive thought to design the tracking prediction network, which has stronger flexibility for large-scale changes or deformed targets, and at the same time reduces the number of trackers. Computational complexity, thereby enhancing the tracking efficiency of the tracking model.

附图说明Description of drawings

图1为本发明的基于孪生网络结构和锚框自适应思想的RGBT目标跟踪方法的流程图。Fig. 1 is a flow chart of the RGBT object tracking method based on the twin network structure and anchor box adaptive thought of the present invention.

图2为本发明的基于孪生网络结构和锚框自适应思想的RGBT目标跟踪方法特征提取网络部分中图像增强模块示意图，包括模板图像特征增强模块和候选图像特征增强模块。Fig. 2 is a schematic diagram of the image enhancement module in the feature extraction network part of the RGBT target tracking method based on the twin network structure and the anchor frame adaptive idea of the present invention, including a template image feature enhancement module and a candidate image feature enhancement module.

图3为本发明的基于孪生网络结构和锚框自适应思想的RGBT目标跟踪方法中基于锚框自适应思想的跟踪预测网络的示意图。FIG. 3 is a schematic diagram of a tracking prediction network based on anchor frame adaptation in the RGBT target tracking method based on twin network structure and anchor frame adaptation in the present invention.

具体实施方式Detailed ways

下面结合实施例及附图对本发明作进一步详细的描述，但本发明的实施方式不限于此。The present invention will be further described in detail below in conjunction with the embodiments and the accompanying drawings, but the embodiments of the present invention are not limited thereto.

参见图1，本发明的基于孪生网络结构和锚框自适应思想的RGBT目标跟踪方法包括以下步骤：Referring to Fig. 1, the RGBT target tracking method based on twin network structure and anchor box adaptive thought of the present invention comprises the following steps:

参见图1，在步骤(S2)中，基于孪生网络结构的特征提取网络、跨模态信息互补的融合网络和基于锚框自适应思想的跟踪预测网络的构建，包括以下步骤：Referring to Figure 1, in step (S2), the construction of a feature extraction network based on twin network structure, a fusion network with complementary cross-modal information, and a tracking and prediction network based on anchor frame adaptive ideas includes the following steps:

(S2-1)、构建特征提取网络：特征提取网络为基于孪生网络结构的特征提取网络，采用深层次、多分支结构，包括特征提取和特征增强两部分；特征提取部分由包含4条修改后的ResNet-50构成，特征增强部分包含两个基于注意力机制的图像增强模块；(S2-1), building a feature extraction network: the feature extraction network is a feature extraction network based on a twin network structure, using a deep-level, multi-branch structure, including two parts: feature extraction and feature enhancement; the feature extraction part is modified by including 4 Composed of ResNet-50, the feature enhancement part contains two image enhancement modules based on the attention mechanism;

(S2-2)、跨模态信息互补的融合网络：为一种跨模态特征融合方案，通过4个1×1的卷积模块实现跨模态特征的融合；融合后的结果经过互相关操作得到用于预测目标跟踪结果的响应图；(S2-2), cross-modal information complementary fusion network: it is a cross-modal feature fusion scheme, which realizes the fusion of cross-modal features through four 1×1 convolution modules; the fusion results are cross-correlated The operation obtains a response graph for predicting target tracking results;

(S2-3)、基于锚框自适应思想的跟踪预测网络：基于锚框自适应思想的跟踪预测网络包含两个结构相同的跟踪预测头。每个检测头均包含3个分支，分别是用于预测响应图中每个位置类别的分类分支；用于计算该位置的目标包围框的回归分支；用于计算每个位置的中心度得分，剔除异常值的中心度分支。(S2-3) Tracking prediction network based on the idea of anchor box adaptation: the tracking prediction network based on the idea of anchor box adaptation includes two tracking prediction heads with the same structure. Each detection head contains 3 branches, which are the classification branch used to predict each position category in the response graph; the regression branch used to calculate the target bounding box of the position; the centrality score used to calculate each position, The centrality branch that removes outliers.

参见图1，在步骤(S3)中，基于孪生网络结构的特征提取网络的预训练和跟踪模型的微调，包括以下步骤：Referring to Fig. 1, in step (S3), the pre-training of the feature extraction network based on the twin network structure and the fine-tuning of the tracking model include the following steps:

(S3-2)、使用S2-1步骤中保存的预训练模型初始化特征提取网络中的特征提取部分参数，冻结所有ResNet-50的前两层，使用在S1步骤中获得的训练数据集微调跟踪模型，并降低学习率进行训练，采用随机梯度下降法训练至损失值基本收敛；获得训练好的网络。(S3-2), use the pre-training model saved in the S2-1 step to initialize the feature extraction part of the parameters in the feature extraction network, freeze the first two layers of all ResNet-50, and use the training data set obtained in the S1 step to fine-tune the tracking Model, and reduce the learning rate for training, use the stochastic gradient descent method to train until the loss value basically converges; obtain the trained network.

参见图1，在步骤(S4)中，获取可见光图像和热红外图像中待跟踪目标模板和进行跟踪，包括以下步骤：Referring to Fig. 1, in step (S4), obtain target template to be tracked in visible light image and thermal infrared image and track, comprise the following steps:

(S4-6)、最后再将响应图输入到基于锚框自适应思想的跟踪预测网络，通过生成一个6D向量t＝(cls，cen，l，t，r，b)分别完成对两种模态中目标位置的预测，其中cls表示分类得分，cen表示中心度得分，l+r和t+b表示当前帧中目标预测的宽和高。(S4-6), and finally input the response graph to the tracking prediction network based on the idea of anchor frame adaptation, and complete the two models respectively by generating a 6D vector t=(cls, cen, l, t, r, b) The prediction of the target position in the state, where cls represents the classification score, cen represents the centrality score, and l+r and t+b represent the predicted width and height of the target in the current frame.

对基于孪生网络的ResNet-50的骨干结构只提取最后3个卷积块的特征图输出，没有充分利用网络所获得的目标底层信息，如颜色、纹理等特征，从而导致算法在应对遮挡、尺度变化和相似物干扰等场景下，难以实现鲁棒性的跟踪效果。For the backbone structure of ResNet-50 based on the twin network, only the feature map output of the last three convolution blocks is extracted, and the underlying information of the target obtained by the network, such as color, texture and other features, is not fully utilized, which leads to the algorithm's inability to deal with occlusion, scale, etc. It is difficult to achieve robust tracking effects in scenarios such as changes and similar object interference.

为此，参见图2本实施例基于注意力机制设计模板图像特征增强模块。我们基于通道注意力设计模板图像特征增强模块，该模块首先将来自RGB模态和热红外模态的模板特征图按通道维度进行拼接，得到联合特征；再将联合特征经过全局平均池化操作(GlobalAvg Pooling)和卷积操作进行降维生成权重矩阵。最后，将权重通过乘法逐通道加权到先前的特征上，完成在通道维度上的对原始特征的重标定。For this reason, referring to FIG. 2, this embodiment designs a template image feature enhancement module based on an attention mechanism. We design a template image feature enhancement module based on channel attention. This module first stitches the template feature maps from the RGB mode and the thermal infrared mode according to the channel dimension to obtain the joint feature; then the joint feature is subjected to a global average pooling operation ( GlobalAvg Pooling) and convolution operations perform dimensionality reduction to generate weight matrices. Finally, the weights are multiplied channel by channel to the previous features to complete the recalibration of the original features in the channel dimension.

参见图2，本实施例联合空间-通道注意力机制设计候选图像特征增强模块。在该模块中，首先将来自两种模态的候选特征图按通道维度进行拼接得到联合特征，再将联合特征先经过通道注意力模块生成权重矩阵，通过逐通道乘法完成联合特征在通道维度上的建模；然后经过空间注意力模块，完成目标特征的空间建模，生成增强后的联合特征图；最后，将增强后的联合特征图按通道维度进行分割，分别生成增强后的RGB候选特征和热红外候选特征。Referring to FIG. 2 , this embodiment combines the space-channel attention mechanism to design a candidate image feature enhancement module. In this module, the candidate feature maps from the two modalities are first spliced according to the channel dimension to obtain the joint feature, and then the joint feature is first passed through the channel attention module to generate a weight matrix, and the joint feature is completed in the channel dimension by channel-by-channel multiplication. Then, through the spatial attention module, the spatial modeling of the target features is completed, and the enhanced joint feature map is generated; finally, the enhanced joint feature map is divided according to the channel dimension, and the enhanced RGB candidate features are respectively generated. and thermal infrared candidate features.

通过上述设置，利用ResNet-50网络和特征金字塔结构充分利用目标深层次、多尺度特征，同时结合注意力机制构建图像增强模块进一步加强网络对目标浅层特征的表征，可以提高算法在复杂场景中跟踪的鲁棒性。Through the above settings, use the ResNet-50 network and feature pyramid structure to make full use of the deep-level and multi-scale features of the target, and at the same time combine the attention mechanism to build an image enhancement module to further strengthen the representation of the shallow features of the target by the network, which can improve the performance of the algorithm in complex scenes. Tracking robustness.

另外，算法训练的总损失函数可表示为:In addition, the total loss function of algorithm training can be expressed as:

另外，我们在分类得分图中在入比例和尺度惩罚来抑制预测目标大形变：In addition, we impose scale and scale penalties in the classification score map to suppress large deformations of predicted targets:

cls＝(1-α)cls*penalty+αH (7)cls=(1-α)cls*penalty+αH (7)

P＝argmax(cls_rgb,cls_t) (8)P＝argmax(cls_rgb ,cls_t ) (8)

参见图3，以下则以具体案例对本发明的RGBT目标跟踪方法进行阐述：Referring to Fig. 3, the RGBT target tracking method of the present invention is described below with a specific case:

首先将输入的RGB模板图像、RGB候选图像、热红外模板图像和热红外候选图像分别经过特征提取网络的特征提取部分和特征增强部分，分别得到RGB模板特征、RGB候选特征、热红外模板特征和热红外候选特征、RGB模板增强特征、RGB候选增强特征、热红外模板增强特征和热红外候选增强特征；然后通过将来自不同模态的特征按通道数进行拼接和降维，实现跨模态特征的融合，并将融合后的结果经过互相关操作分别得到用于预测RGB模态和热红外模态目标跟踪结果的响应图；最后，再将生成的两个模态的响应图分别输入到基于锚框自适应思想的跟踪预测网络中完成对双模态目标类别的预测和位置的定位，再通过所设计的峰值自适应选择模块实现对两个预测头生成的结果进行重新标定，生成最佳的跟踪方案。First, the input RGB template image, RGB candidate image, thermal infrared template image and thermal infrared candidate image are respectively passed through the feature extraction part and feature enhancement part of the feature extraction network, and the RGB template features, RGB candidate features, thermal infrared template features and Thermal infrared candidate features, RGB template enhanced features, RGB candidate enhanced features, thermal infrared template enhanced features, and thermal infrared candidate enhanced features; then, cross-modal features are realized by splicing and dimensionally reducing features from different modalities according to the number of channels fusion, and the fused results are obtained through cross-correlation operations to obtain the response maps used to predict the target tracking results of the RGB mode and the thermal infrared mode; finally, the generated response maps of the two modes are respectively input to the In the tracking and prediction network based on the anchor frame adaptive idea, the prediction of the bimodal target category and the location of the position are completed, and then the results generated by the two prediction heads are recalibrated through the designed peak adaptive selection module to generate the best tracking program.

上述为本发明较佳的实施方式，但本发明的实施方式并不受上述内容的限制，其他的任何未背离本发明的精神实质与原理下所作的改变、修饰、替代、组合、简化，均应为等效的置换方式，都包含在本发明的保护范围之内。The above is a preferred embodiment of the present invention, but the embodiment of the present invention is not limited by the above content, and any other changes, modifications, substitutions, combinations, and simplifications that do not deviate from the spirit and principles of the present invention are all Replacement methods that should be equivalent are all included within the protection scope of the present invention.