CN116402860A

Movatterモバイル変換

Info

Publication number: CN116402860A
Application number: CN202310434144.7A
Authority: CN
Inventors: 金国栋; 薛远亮; 谭力宁; 高晶; 龙江雄; 田思远
Original assignee: Rocket Force University of Engineering of PLA
Current assignee: Rocket Force University of Engineering of PLA
Priority date: 2023-04-21
Filing date: 2023-04-21
Publication date: 2023-07-07

Abstract

The invention discloses a method and a device for tracking an unmanned aerial vehicle aerial photographing target with enhanced attention, which relate to the technical field of image tracking and comprise the following steps: the method comprises the steps of acquiring an unmanned aerial vehicle aerial video, inputting an initial frame and a current frame of the unmanned aerial vehicle aerial video into a template branch and a search branch in a twin tracking network constructed based on an IRESNet network, respectively outputting three groups of template branch feature maps and search branch weight feature maps from three convolution blocks of layer2, layer3 and layer4 of the IRESNet network, weighting and fusing the three groups of template branch feature maps and search branch feature maps by using a region suggestion network of a plurality of different depth convolution blocks arranged between layer2, layer3 and layer4 of the template branch and search branch, and tracking a target in the current frame by using a weighting and fusion result. The method can classify the targets and enhance the characterization capability of the targets.

Description

Translated fromChinese

一种注意力增强的无人机航拍目标跟踪方法和装置A kind of attention-enhanced UAV aerial photography target tracking method and device

技术领域technical field

本发明涉及图像跟踪技术领域，更具体的涉及一种注意力增强的无人机航拍目标跟踪方法和装置。The present invention relates to the technical field of image tracking, and in particular to a method and device for tracking an aerial photography target of an unmanned aerial vehicle with enhanced attention.

背景技术Background technique

目标跟踪的主要任务是在给定目标第一帧后，对后续帧中出现的该目标进行持续不断的定位跟踪。在无人机对目标的跟踪过程中，视频图像经数据链系统传输至地面站显示，操纵手通过操纵杆及其他指令控制稳定无人机平台和摄像系统搜索侦察目标，当感兴趣的目标出现在画面上时，选择感兴趣目标，地面计算机会提取该目标的一系列特征，并将其作为模板。计算机通过计算模板图像与后续图像的相似度，来确认后续图像中感兴趣目标的位置信息，实现对目标的持续跟踪。The main task of target tracking is to continuously locate and track the target appearing in subsequent frames after the first frame of the target is given. During the tracking of the target by the UAV, the video image is transmitted to the ground station for display through the data link system, and the operator controls the stable UAV platform and camera system to search for the reconnaissance target through the joystick and other instructions. When the target of interest appears When on the screen, select the target of interest, and the ground computer will extract a series of features of the target and use it as a template. The computer confirms the position information of the target of interest in the subsequent image by calculating the similarity between the template image and the subsequent image, and realizes the continuous tracking of the target.

相较于地面平台，无人机视角下目标尺寸相对较小，目标包含的像素点较少，并且无人机飞行过程中容易出现相机抖动和飞行速度变化，造成目标模糊和形变等问题，都对跟踪算法的能力有着更高的要求。Compared with the ground platform, the size of the target from the perspective of the drone is relatively small, and the target contains fewer pixels, and camera shake and flight speed changes are prone to occur during the flight of the drone, causing problems such as blurring and deformation of the target. There are higher requirements for the ability of the tracking algorithm.

传统的跟踪算法跟踪速度较快，但使用的是手工设计的特征来表示目标，对目标的表征能力不足，跟踪精度难以提升。大部分的孪生跟踪算法为了追求跟踪精度，使用一系列繁琐复杂的操作，忽略了对跟踪速度的要求，但跟踪速度不满足实时性难以部署到无人机平台上。The traditional tracking algorithm has a fast tracking speed, but it uses manually designed features to represent the target, which is not capable of characterization of the target, and it is difficult to improve the tracking accuracy. Most of the twin tracking algorithms use a series of cumbersome and complicated operations in pursuit of tracking accuracy, ignoring the requirements for tracking speed, but the tracking speed does not meet the real-time requirements and is difficult to deploy on the UAV platform.

发明内容Contents of the invention

本发明旨在至少解决现有技术中存在的技术问题之一。为此，本发明第一方面提出一种注意力增强的无人机航拍目标跟踪方法，包括：The present invention aims to solve at least one of the technical problems existing in the prior art. For this reason, the first aspect of the present invention proposes a kind of attention-enhancing UAV aerial photography target tracking method, comprising:

获取无人机航拍视频；Obtain drone aerial video;

将无人机航拍视频的初始帧和当前帧输入基于IResNet网络构建的孪生跟踪网络中模板分支和搜索分支，分别从IResNet网络的layer2、layer3和layer4三个卷积块输出三组模板分支特征图和搜索分支权特征图，所述IResNet网络是基于ResNet50网络分别构建孪生跟踪网络的模板分支和搜索分支，并在模板分支和搜索分支的layer1上的最后一个bottleneck的投影映射上增加硬下采样，将硬下采样和通道数为256、卷积核为1乘1的残差模块的软下采样相加，然后在IResNet网络的每个Bottleneck后面增加一个简单注意力模块，所述硬下采样包括串联的步长为2、大小为3乘3的最大池化层和步长为1、卷积核为1乘1的卷积层；Input the initial frame and the current frame of the drone aerial video into the template branch and the search branch in the twin tracking network based on the IResNet network, and output three sets of template branch feature maps from the three convolutional blocks of the IResNet network, layer2, layer3 and layer4, respectively. and the search branch weight feature map, the IResNet network is based on the ResNet50 network to respectively construct the template branch and the search branch of the twin tracking network, and add hard downsampling on the projection map of the last bottleneck on the layer1 of the template branch and the search branch, Add the hard downsampling and the soft downsampling of the residual module with a channel number of 256 and a convolution kernel of 1 by 1, and then add a simple attention module after each Bottleneck of the IResNet network. The hard downsampling includes A concatenated max pooling layer with a stride of 2 and a size of 3 by 3 and a convolutional layer with a stride of 1 and a kernel of 1 by 1;

利用设置于模板分支和搜索分支的layer2、layer3和layer4之间的多个不同深度卷积块的区域建议网络，对三组模板分支特征图和搜索分支特征图进行加权融合，利用加权融合的结果跟踪当前帧中的目标。Using the regional proposal network of multiple convolutional blocks of different depths between layer2, layer3 and layer4 of the template branch and the search branch, the three sets of template branch feature maps and search branch feature maps are weighted and fused, and the results of the weighted fusion are used Track the object in the current frame.

进一步，利用设置于模板分支和搜索分支的layer2、layer3和layer4之间的多个不同深度卷积块的区域建议网络，对三组模板分支特征图和搜索分支特征图进行加权融合，利用加权融合的结果跟踪当前帧中的目标，包括：Further, using the regional proposal network of multiple convolutional blocks of different depths set between layer2, layer3 and layer4 of the template branch and the search branch, the three groups of template branch feature maps and search branch feature maps are weighted and fused, and the weighted fusion is used. The results of tracking objects in the current frame include:

在模板分支和搜索分支的layer2、layer3和layer4的多个不同深度卷积块之间的区域建议网络中输入模板分支特征图和搜索分支特征图；The template branch feature map and the search branch feature map are input into the region proposal network between multiple convolutional blocks of different depths in layer2, layer3, and layer4 of the template branch and the search branch;

利用预先定义尺度和大小不同的k个锚框完成对搜索分支特征图的目标的多尺度估计；Use k anchor boxes with predefined scales and sizes to complete the multi-scale estimation of the target of the search branch feature map;

模板分支特征图当作卷积核，与搜索分支特征图进行卷积运算，分别获得模板分支特征图和搜索分支特征图的回归响应图和分类响应图；The template branch feature map is used as a convolution kernel, and the convolution operation is performed with the search branch feature map to obtain the regression response map and the classification response map of the template branch feature map and the search branch feature map;

将模板分支特征图和搜索分支特征图输出的两个回归响应图进行深度互相关运算，获得回归结果；Perform a deep cross-correlation operation on the two regression response maps output by the template branch feature map and the search branch feature map to obtain the regression result;

将模板分支特征图和搜索分支特征图输出的两个分类响应图进行深度互相关运算，获得分类结果；Perform deep cross-correlation operations on the two classification response maps output by the template branch feature map and the search branch feature map to obtain classification results;

获取分类结果的目标值最大值所处位置作为目标的预测位置输出；Obtain the position of the maximum value of the target value of the classification result as the predicted position output of the target;

在回归结果中获取预测位置对应的预测边界框，作为目标的预测框输出。Obtain the predicted bounding box corresponding to the predicted position in the regression result, and output it as the predicted box of the target.

本发明另一方面还提供一种注意力增强的无人机航拍目标跟踪装置，包括：Another aspect of the present invention also provides a UAV aerial photography target tracking device with enhanced attention, including:

获取模块，用于获取无人机航拍视频；The acquisition module is used to acquire the drone aerial video;

处理模块，用于将无人机航拍视频的初始帧和当前帧输入基于IResNet网络构建的孪生跟踪网络中模板分支和搜索分支，分别从IResNet网络的layer2、layer3和layer4三个卷积块输出三组模板分支特征图和搜索分支权特征图，所述IResNet网络是基于ResNet50网络分别构建孪生跟踪网络的模板分支和搜索分支，并在模板分支和搜索分支的layer1上的最后一个bottleneck的投影映射上增加硬下采样，将硬下采样和通道数为256、卷积核为1乘1的残差模块的软下采样相加，然后在IResNet网络的每个Bottleneck后面增加一个简单注意力模块，所述硬下采样包括串联的步长为2、大小为3乘3的最大池化层和步长为1、卷积核为1乘1的卷积层；The processing module is used to input the initial frame and the current frame of the UAV aerial video into the template branch and the search branch in the twin tracking network built based on the IResNet network, and output three convolution blocks from the layer2, layer3 and layer4 of the IResNet network respectively. Group template branch feature map and search branch weight feature map. The IResNet network is based on the ResNet50 network to construct the template branch and the search branch of the twin tracking network respectively, and on the projection map of the last bottleneck on the layer1 of the template branch and the search branch Add hard downsampling, add the hard downsampling and the soft downsampling of the residual module with 256 channels and a convolution kernel of 1 by 1, and then add a simple attention module behind each Bottleneck of the IResNet network, so The hard downsampling described above includes a concatenated max pooling layer with a stride of 2 and a size of 3 by 3 and a convolutional layer with a stride of 1 and a convolution kernel of 1 by 1;

输出模块，用于利用设置于模板分支和搜索分支的layer2、layer3和layer4之间的多个不同深度卷积块的区域建议网络，对三组模板分支特征图和搜索分支特征图进行加权融合，利用加权融合的结果跟踪当前帧中的目标。The output module is used to utilize the regional proposal network of a plurality of convolutional blocks of different depths arranged between layer2, layer3 and layer4 of the template branch and the search branch to carry out weighted fusion of the three sets of template branch feature maps and search branch feature maps, Use the result of weighted fusion to track the object in the current frame.

本发明另一方面还提供一种电子设备，所述电子设备包括处理器和存储器，所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集，所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如第一方面任一项所述的注意力增强的无人机航拍目标跟踪方法。Another aspect of the present invention also provides an electronic device, the electronic device includes a processor and a memory, at least one instruction, at least one program, code set or instruction set are stored in the memory, the at least one instruction, the At least one section of program, the code set or the instruction set is loaded and executed by the processor to realize the attention-enhanced UAV aerial photography target tracking method according to any one of the first aspect.

本发明另一方面还提供一种计算机可读存储介质，所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集，所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现如第一方面任一项所述的注意力增强的无人机航拍目标跟踪方法。Another aspect of the present invention also provides a computer-readable storage medium, the storage medium stores at least one instruction, at least one program, code set or instruction set, the at least one instruction, the at least one program, the The code set or instruction set is loaded and executed by the processor to realize the attention-enhanced UAV aerial photography target tracking method as described in any one of the first aspect.

本发明实施例提供一种注意力增强的无人机航拍目标跟踪方法和装置，与现有技术相比，其有益效果如下：Embodiments of the present invention provide a method and device for tracking targets in UAV aerial photography with enhanced attention. Compared with the prior art, the beneficial effects are as follows:

1)融合两种下采样方式，设计了IResNet深层特征提取网络，软采样保留了目标的背景信息，有助于定位，硬采样利用最大池化寻找最大激活值，有助于目标分类，增强对目标的表征能力。1) Combining two down-sampling methods, IResNet deep feature extraction network is designed. Soft sampling retains the background information of the target, which is helpful for positioning. Hard sampling uses maximum pooling to find the maximum activation value, which is helpful for target classification and enhances the accuracy of the target. The ability to represent the target.

2)为了从深层特征中筛选跟踪目标的特征，增加了无参数的简单注意力模块SimAM，提升了算法的辨别能力和抗干扰能力，有效解决无人机跟踪过程中出现的相似物干扰、运动模糊和部分遮挡等情况2) In order to screen the features of the tracking target from the deep features, a parameterless simple attention module SimAM is added, which improves the discrimination ability and anti-interference ability of the algorithm, and effectively solves the interference and movement of similar objects in the process of UAV tracking. Blurring and partial occlusion etc.

3)在不同特征层上使用多层RPN，深层语义信息和浅层细节信息的结合使用，有助于目标分类和定位，使用RPN模块代替传统尺度估计方法，更准确地定位目标，并针对无人机跟踪目标特点，重新分配融合权重，对于尺度变化情况，有很好的适应能力，同时速度达到37.5FPS、满足实时性要求。3) Using multi-layer RPN on different feature layers, the combination of deep semantic information and shallow detail information is helpful for target classification and positioning. Using RPN modules instead of traditional scale estimation methods can more accurately locate targets, and target no Man-machine tracking target characteristics, redistribution of fusion weights, has a good adaptability to scale changes, and at the same time the speed reaches 37.5FPS, meeting the real-time requirements.

附图说明Description of drawings

为了更清楚地说明本发明的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单的介绍。显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还能够根据这些附图获得其它附图。In order to illustrate the technical solution of the present invention more clearly, the following will briefly introduce the drawings required for the embodiments or the description of the prior art. Apparently, the drawings in the following description are only some embodiments of the present invention, and those skilled in the art can obtain other drawings according to these drawings without creative efforts.

图1为本发明实施例提供的一种注意力增强的无人机航拍目标跟踪方法的流程图；Fig. 1 is a flow chart of an attention-enhancing UAV aerial photography target tracking method provided by an embodiment of the present invention;

图2是本发明提供的一种基于双多尺度注意力模块的无人机目标跟踪方法的网络模型图；Fig. 2 is the network model figure of a kind of UAV target tracking method based on double multi-scale attention module provided by the present invention;

图3为本发明实施例提供的一种注意力增强的无人机航拍目标跟踪方法的硬下采样和软下采样结构图；Fig. 3 is a hard downsampling and soft downsampling structural diagram of an attention-enhanced UAV aerial photography target tracking method provided by an embodiment of the present invention;

图4为本发明实施例提供的一种注意力增强的无人机航拍目标跟踪方法的SimAM原理图；Fig. 4 is the SimAM principle diagram of a kind of attention-enhancing UAV aerial photography target tracking method provided by the embodiment of the present invention;

图5为本发明实施例提供的一种注意力增强的无人机航拍目标跟踪方法的RPN原理图；5 is a RPN schematic diagram of an attention-enhanced UAV aerial photography target tracking method provided by an embodiment of the present invention;

图6为本发明实施例提供的一种多尺度的无人机航拍目标跟踪装置的结构图。Fig. 6 is a structural diagram of a multi-scale UAV aerial photography target tracking device provided by an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例，都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

本说明书提供了如实施例或流程图所述的方法操作步骤，但基于常规或无创造性的劳动可以包括更多或者更少的操作步骤。在实际中的系统或服务器产品执行时，可以按照实施例或者附图所示的方法顺序执行或者并行执行(例如并行处理器或者多线程处理的环境)。This description provides the method operation steps as described in the embodiment or flow chart, but more or less operation steps may be included based on routine or non-inventive work. When an actual system or server product is executed, the methods shown in the embodiments or drawings may be executed sequentially or in parallel (for example, in a parallel processor or multi-thread processing environment).

目前目标跟踪算法主要分为基于相关滤波的跟踪算法和基于深度学习的跟踪算法。相关滤波跟踪算法使用信号处理领域的相关滤波器来计算模板与搜索图像的相似度，利用傅里叶变换在频域进行加速，运算量大大减少，提高了运算速度，可达上百帧每秒。但是大多数相关滤波算法都是使用传统特征提取算法来表征跟踪目标，鲁棒性和准确性不够，不能有效应对复杂场景下的目标跟踪任务。At present, target tracking algorithms are mainly divided into tracking algorithms based on correlation filtering and tracking algorithms based on deep learning. The correlation filter tracking algorithm uses the correlation filter in the field of signal processing to calculate the similarity between the template and the search image, and uses Fourier transform to accelerate in the frequency domain, greatly reducing the amount of calculation and improving the calculation speed, up to hundreds of frames per second . However, most correlation filtering algorithms use traditional feature extraction algorithms to characterize tracking targets, which are not robust and accurate enough to effectively deal with target tracking tasks in complex scenes.

孪生跟踪算法因其在精度和速度上的巨大潜力，逐渐成为目标跟踪领域的主流算法，后续的跟踪算法大多都是基于孪生结构进行研究的。孪生跟踪算法的工作原理可表示为式(1)，孪生跟踪算法主要由特征提取部分

相似度计算部分(*)和跟踪结果生成部分组成。Due to its great potential in accuracy and speed, the twin tracking algorithm has gradually become the mainstream algorithm in the field of target tracking. Most of the subsequent tracking algorithms are researched based on the twin structure. The working principle of the twin tracking algorithm can be expressed as formula (1), and the twin tracking algorithm is mainly composed of the feature extraction part

It consists of the similarity calculation part (*) and the tracking result generation part.

式中，f(z,x)为相似度响应图；

为特征提取部分；*为互相关运算；b为每个位置的偏差；I为单位矩阵。In the formula, f(z,x) is the similarity response map;

is the feature extraction part; * is the cross-correlation operation; b is the deviation of each position; I is the identity matrix.

1)特征提取部分：使用孪生神经网络提取特征，两条分支分别为模板分支与搜索分支。模板分支输入初始帧的目标图像z作为模板、输出为模板特征图

搜索分支输入后续帧的搜索图像x、输出为搜索特征图/>

1) Feature extraction part: use the twin neural network to extract features, and the two branches are the template branch and the search branch. The template branch inputs the target image z of the initial frame as a template and outputs the template feature map

The search branch inputs the search image x of the subsequent frame, and the output is the search feature map />

2)相似度计算部分(*)：用于整合两条分支的特征图上的特征信息，计算搜索特征图与模板特征图的相似度，生成相似度响应图f(z,x)。2) Similarity calculation part (*): It is used to integrate the feature information on the feature maps of the two branches, calculate the similarity between the search feature map and the template feature map, and generate a similarity response map f(z,x).

3)跟踪结果生成部分：根据得到的响应图来预测搜索图像上的目标位置，最大响应的位置一般认为是目标预测位置，再进行目标尺度估计和边界框回归。3) Tracking result generation part: predict the target position on the search image according to the obtained response map, and the position with the maximum response is generally considered as the target prediction position, and then perform target scale estimation and bounding box regression.

孪生跟踪算法进行在线跟踪的流程主要分为以下步骤：The online tracking process of the twin tracking algorithm is mainly divided into the following steps:

视频序列逐帧输入到特征提取部分；The video sequence is input to the feature extraction part frame by frame;

若为第一帧，模板分支提取目标特征作为模板特征；If it is the first frame, the template branch extracts the target feature as the template feature;

若不是第一帧，搜索分支提取当前帧的目标特征作为搜索特征；If it is not the first frame, the search branch extracts the target feature of the current frame as the search feature;

相似度计算部分计算特征图之间的相似度，生成响应图；The similarity calculation part calculates the similarity between feature maps and generates a response map;

跟踪结果生成部分利用相似度响应图，预测当前帧中的目标位置；The tracking result generation part uses the similarity response map to predict the target position in the current frame;

重复步骤3～5，直至视频序列的最后一帧。Repeat steps 3-5 until the last frame of the video sequence.

图1为本发明实施例提供的一种注意力增强的无人机航拍目标跟踪方法的流程图，图2为本发明实施例提供的一种注意力增强的无人机航拍目标跟踪方法的网络模型图。Fig. 1 is a flow chart of an attention-enhanced UAV aerial photography target tracking method provided by an embodiment of the present invention, and Fig. 2 is a network of an attention-enhanced UAV aerial photography target tracking method provided by an embodiment of the present invention model diagram.

如图1所示，方法包括：As shown in Figure 1, the methods include:

步骤101、获取无人机航拍视频；Step 101, obtaining the drone aerial video;

步骤102、将无人机航拍视频的初始帧和当前帧输入基于IResNet网络构建的孪生跟踪网络中模板分支和搜索分支，分别从IResNet网络的layer2、layer3和layer4三个卷积块输出三组模板分支特征图和搜索分支权特征图，所述IResNet网络是基于ResNet50网络分别构建孪生跟踪网络的模板分支和搜索分支，并在模板分支和搜索分支的layer1上的最后一个bottleneck的投影映射上增加硬下采样，将硬下采样和通道数为256、卷积核为1乘1的残差模块的软下采样相加，然后在IResNet网络的每个Bottleneck后面增加一个简单注意力模块，所述硬下采样包括串联的步长为2、大小为3乘3的最大池化层和步长为1、卷积核为1乘1的卷积层；Step 102, input the initial frame and the current frame of the drone aerial video into the template branch and the search branch in the twin tracking network built based on the IResNet network, and output three groups of templates from the three convolutional blocks of layer2, layer3 and layer4 of the IResNet network respectively Branch feature map and search branch weight feature map, the IResNet network is based on the ResNet50 network to respectively construct the template branch and the search branch of the twin tracking network, and add hard For downsampling, add the hard downsampling and the soft downsampling of the residual module with 256 channels and a convolution kernel of 1 by 1, and then add a simple attention module behind each Bottleneck of the IResNet network, the hard Downsampling consists of concatenating a max pooling layer with a stride of 2 and a size of 3 by 3 and a convolutional layer with a stride of 1 and a kernel of 1 by 1;

在步骤102中，更好地利用丰富的语义信息建立鲁棒的目标模型才是无人机跟踪算法的关键，ResNet-50因其特征提取能力更强而被广泛使用在计算机视觉其他领域，本发明的改进深度残差网络(Improved ResNet，IResNet)也是其基础上根据无人机跟踪任务而进行改进，图3(a)为IResNet中的硬下采样，硬下采样由设置在搜索分支和模板分支的layer1最后一个bottleneck上的一个步长为2、大小为3的最大池化和一个步长为1、大小为1的卷积来实现，空间下采样与通道升维分为两步进行：第一步，最大池化会全面考虑空间信息并从中选择最大激活值，作为卷积的输入；第二步，利用步长为1、大小为1的卷积实现跨通道的信息融合，如图3(b)所示，最后一个bottleneck上的通道数为256、卷积核为1乘1的残差模块的软下采样是步长为2、大小为3的卷积，相比于最大池化，输入输出之间的计算更平滑，因此叫“软下采样”，最后将投影映射中“硬下采样”的输出和残差部分中“软下采样”的输出相加，有效地结合两种下采样方法，使其更适合无人机下目标像素点少、运动模糊的跟踪任务，还不会增加运算成本。并且卷积运算会考虑特征图上的全部信息包括背景信息，有助于网络更好地定位，本发明提出的IResNet改进原有的采样方式，将两种各有优点、相互互补的“软下采样”和“硬下采样”进行融合，保留更多地细节信息和背景信息。Instep 102, better use of rich semantic information to establish a robust target model is the key to the UAV tracking algorithm. ResNet-50 is widely used in other fields of computer vision because of its stronger feature extraction capabilities. The invented improved deep residual network (Improved ResNet, IResNet) is also improved according to the UAV tracking task. Figure 3(a) shows the hard downsampling in IResNet. The hard downsampling is set in the search branch and the template A maximum pooling with a step size of 2 and a size of 3 and a convolution with a step size of 1 and a size of 1 on the last bottleneck oflayer 1 of the branch are implemented. Space downsampling and channel dimensionality are divided into two steps: In the first step, the maximum pooling will fully consider the spatial information and select the maximum activation value as the input of the convolution; in the second step, the convolution with a step size of 1 and a size of 1 is used to achieve cross-channel information fusion, as shown in the figure As shown in 3(b), the soft downsampling of the residual module with a channel number of 256 and a convolution kernel of 1 by 1 on the last bottleneck is a convolution with a step size of 2 and a size of 3. Compared with the maximum pooling The calculation between input and output is smoother, so it is called "soft downsampling". Finally, the output of "hard downsampling" in the projection map and the output of "soft downsampling" in the residual part are added, effectively combining the two A down-sampling method makes it more suitable for tracking tasks with few target pixels and motion blur under the UAV, and it will not increase the computing cost. And the convolution operation will consider all the information on the feature map including the background information, which will help the network to better locate. sampling" and "hard downsampling" to preserve more details and background information.

为了提高本发明的辨别能力，在每个Bottleneck后面增加一个简单注意力模块，如图4所示，为了让网络学习到更具区分性的神经元，本发明提出直接从当前的神经元推理出三维的权重，然后反过来去优化这些神经元。为了有效的推理出三维的权重，本发明基于神经科学的知识定义了一个能量函数，然后获得了该函数的解析解。In order to improve the discriminative ability of the present invention, a simple attention module is added behind each bottleneck, as shown in Figure 4, in order to allow the network to learn more discriminative neurons, the present invention proposes to infer directly from the current neurons Three-dimensional weights, and then optimize these neurons in turn. In order to deduce the three-dimensional weight effectively, the present invention defines an energy function based on neuroscience knowledge, and then obtains the analytical solution of the function.

首先本发明通过定义一个简单的能量函数，度量神经元之间的线性可分性。针对每个神经元定义了如下的能量函数为：Firstly, the present invention measures the linear separability between neurons by defining a simple energy function. The following energy function is defined for each neuron as:

其中t和x_i是输入特征X∈R^C×H×W中单个通道上的目标神经元和其它神经元，且

和/>

是t和x_i的线性变换；i是在空间维度上的索引；M＝H×W是通道数上的神经元数量；ω_t和b_t是线性变换的权重和偏置。where t and_xi are the target neuron and other neurons on a single channel in the input feature X∈R^C×H×W , and

and />

is the linear transformation of t and x_i ; i is the index in the spatial dimension; M=H×W is the number of neurons on the channel number; ω_t and b_t are the weights and biases of the linear transformation.

当目标神经元t＝y_t，其它的神经元x_i＝y₀，并且y_t≠y₀时，式(2)取得最小值。最小化式(2)等价于找到同一通道内目标神经元与其它神经元的线性可分性。为简单起见，本发明采用的是二值标签并添加正则项，最终的能量函数如下所示：When the target neuron t=y_t , other neurons x_i =y₀ , and y_t ≠y₀ , formula (2) takes the minimum value. Minimizing (2) is equivalent to finding the linear separability of the target neuron from other neurons in the same channel. For the sake of simplicity, the present invention uses binary labels and adds a regular term, and the final energy function is as follows:

理论上，每个通道都会有M个能量函数，如果使用梯度下降算法求解这些等式，计算开销会非常大。因此本发明中式(3)的ω_t和b_t都是可以通过式(4)和式(5)快速求解。Theoretically, each channel will have M energy functions, and if the gradient descent algorithm is used to solve these equations, the computational overhead will be very large. Therefore, both ω_t and b_t of formula (3) in the present invention can be quickly solved by formula (4) and formula (5).

其中

和/>

是单通道上除去目标神经元后所有神经元的均值和方差，从式(4)和式(5)可以看出解析解都是在单通道上得到的，因此假设同一个通道上的其它神经元也满足相同的分布。基于这个假设，就可以在所有神经元上计算均值和方差，在同一通道上的所有神经元都可以复用这个均值和方差，因此可以大大减少每个位置上重复计算μ和σ的开销，最终每个位置的最小能量可以通过下式得到：in

and />

is the mean and variance of all neurons after removing the target neuron on a single channel. It can be seen from equations (4) and (5) that the analytical solutions are all obtained on a single channel, so it is assumed that other neurons on the same channel also satisfies the same distribution. Based on this assumption, the mean and variance can be calculated on all neurons, and all neurons on the same channel can reuse the mean and variance, so the overhead of repeatedly calculating μ and σ at each position can be greatly reduced, and finally The minimum energy at each position can be obtained by the following formula:

其中

和/>

式(6)可以看出，能量/>

越低，目标神经元与周围神经元的区别越大，在后面互相关运算计算相似度时也越重要。in

and />

It can be seen from formula (6) that the energy />

The lower the value, the greater the difference between the target neuron and the surrounding neurons, and the more important it is when calculating the similarity in the subsequent cross-correlation operation.

最后，直接使用缩放而不是相加操作来做特征的提取，整个SimAM模块的提取过程如下：Finally, directly use scaling instead of adding operations to extract features. The extraction process of the entire SimAM module is as follows:

其中E是

在所有通道和空间维度的汇总，ζ是sigmoid函数，防止E中的值过大，但不会影响神经元的相对大小，⊙表示矩阵的点乘，/>

表示目标神经元的所在位置。where E is

In the summary of all channels and spatial dimensions, ζ is a sigmoid function, which prevents the value in E from being too large, but does not affect the relative size of neurons, ⊙ represents the point product of the matrix, />

Indicates the location of the target neuron.

在SimAM模块中，E通过sigmoid函数得到权重系数，再将权重系数与输入的特征图X相乘，达到指示出目标神经元的所在位置，从而自适应地引导网络选择正确的通道来计算相似度。In the SimAM module, E obtains the weight coefficient through the sigmoid function, and then multiplies the weight coefficient with the input feature map X to indicate the location of the target neuron, thereby adaptively guiding the network to select the correct channel to calculate the similarity .

步骤103、输出模块，用于利用设置于模板分支和搜索分支的layer2、layer3和layer4之间的多个不同深度卷积块的区域建议网络，对三组模板分支特征图和搜索分支特征图进行加权融合，利用加权融合的结果跟踪当前帧中的目标。Step 103, the output module, is used to use the region proposal network of a plurality of different depth convolution blocks arranged between layer2, layer3 and layer4 of the template branch and the search branch to perform three sets of template branch feature maps and search branch feature maps Weighted fusion, using the results of weighted fusion to track the target in the current frame.

本发明提出一种注意力增强的无人机航拍目标跟踪方法。首先，对深度残差网络(Residual Network，ResNet)中进行改进，融合了软采样和硬采样的方法，有助于目标分类和定位，其次，为了更好地筛选特征，使用无参数的简单注意力(Simple attentionmodule，SimAM)增强目标特征、抑制干扰信息。最后的跟踪框生成阶段，将不同深度卷积层的区域建议网络(Region proposal network，RPN)的输出进行融合并分配不同权值，有效解决尺度变化问题。实验表明，本发明能更有效地应对尺度变化、小目标、运动模糊和部分遮挡等问题，提升了对航拍目标的跟踪效果，并且速度达到37.5FPS，满足实时性要求。The invention proposes an attention-enhancing UAV aerial photography target tracking method. First of all, the deep residual network (Residual Network, ResNet) is improved, and the method of combining soft sampling and hard sampling is helpful for target classification and positioning. Secondly, in order to better filter features, simple attention without parameters is used. SimAM (Simple attention module, SimAM) enhances target features and suppresses interference information. In the final tracking frame generation stage, the outputs of the region proposal network (RPN) of different depth convolutional layers are fused and assigned different weights to effectively solve the problem of scale change. Experiments show that the present invention can more effectively deal with problems such as scale changes, small targets, motion blur and partial occlusion, and improves the tracking effect of aerial photography targets, and the speed reaches 37.5FPS, meeting real-time requirements.

在一种可能的实施方式中，利用设置于模板分支和搜索分支的layer2、layer3和layer4的多个不同深度卷积块之间的区域建议网络，对三组模板分支特征图和搜索分支特征图进行加权融合，获得当前帧的目标跟踪结果，包括：In a possible implementation, using the region proposal network set between multiple convolutional blocks of different depths of layer2, layer3 and layer4 of the template branch and the search branch, the three sets of template branch feature maps and search branch feature maps Perform weighted fusion to obtain the target tracking results of the current frame, including:

利用预先定义尺度和大小不同的k个锚框完成对目标分支特征图的目标的多尺度估计；The multi-scale estimation of the target of the target branch feature map is completed by using k anchor boxes with different predefined scales and sizes;

将进行深度互相关运算，获得回归结果；The deep cross-correlation operation will be performed to obtain the regression result;

模板分支特征图和搜索分支特征图输出的两个回归响应图进行深度互相关运算，获得回归结果；The two regression response maps output by the template branch feature map and the search branch feature map are subjected to a deep cross-correlation operation to obtain the regression result;

在本发明提供的实施例中，本发明引入区域建议网络(RPN)模块代替传统金字塔式的尺度估计方法，预先定义尺度、大小不同的k个锚框(anchor)完成对目标的多尺度估计，如图5所示。主要包括上通道互相关运算(UP-channel Xcorr)、边界框回归分支B_W×H×4k和分类分支S_W×H×2k：上通道互相关运算是整合模板分支与搜索分支特征信息的关键运算，其核心思想是把模板分支特征图φ(z)当作卷积核，与搜索分支特征图φ(x)进行卷积运算，得到响应图,然后回归分支与分类分支根据响应图，分别得到目标的预测位置与分类得分，其中得分最高的目标及其预测位置作为输出。In the embodiment provided by the present invention, the present invention introduces the Region Proposal Network (RPN) module to replace the traditional pyramid scale estimation method, and k anchor frames (anchor) with different scales and sizes are pre-defined to complete the multi-scale estimation of the target. As shown in Figure 5. It mainly includes the upper channel cross-correlation operation (UP-channel Xcorr), the bounding box regression branch B_W×H×4k and the classification branch S_W×H×2k : the upper channel cross-correlation operation is the key to integrating the feature information of the template branch and the search branch Operation, the core idea is to use the template branch feature map φ(z) as a convolution kernel, and perform convolution operation with the search branch feature map φ(x) to obtain a response map, and then the regression branch and the classification branch are respectively based on the response map. The predicted position and classification score of the target are obtained, and the target with the highest score and its predicted position are output.

式中：*代表深度互相关运算；φ(·)为特征提取网络；W,H,2k、4k为特征图的宽度、高度和通道数φ(z)_cls和φ(x)_cls分别表示模板分支特征图和搜索分支特征图输出的两个分类响应图，φ(x)_reg和φ(z)_reg分别表示模板分支特征图和搜索分支特征图输出的两个回归响应图。In the formula: * represents the depth cross-correlation operation; φ(·) is the feature extraction network; W, H, 2k, 4k are the width, height and channel number of the feature map φ(z)_cls and φ(x)_cls represent the templates respectively The two classification response maps output by the branch feature map and the search branch feature map, φ(x)_reg and φ(z)_reg denote the two regression response maps output by the template branch feature map and the search branch feature map, respectively.

由于不知道跟踪目标的先验信息，跟踪过程中预先定义的anchor想要一次性地、准确地估计目标的大小和位置是比较困难的。为了利用多层特征信息和目标的准确定位，因此采用多个RPN融合的方式输出跟踪结果，设计了各层融合的权重分别为α₃＝β₃＝0.5,α₄＝α₅＝β₄＝β₅＝0.25，如下式所示。Since the prior information of the tracking target is not known, it is difficult for the pre-defined anchor to estimate the size and position of the target at one time and accurately during the tracking process. In order to utilize the multi-layer feature information and the accurate positioning of the target, multiple RPN fusion methods are adopted to output the tracking results, and the weights of the fusion of each layer are designed to be α₃ =β₃ =0.5, α₄ =α₅ =β₄ = β₅ =0.25, as shown in the following formula.

S＝α₂S₂+α₃S₃+α₄S₄S＝α₂ S₂ +α₃ S₃ +α₄ S₄

B＝β₂B₂+β₃B₃+β₄B₄B＝β₂ B₂ +β₃ B₃ +β₄ B₄

S₂,S₃,S₄分别表示layer2，layer3，layer4输出的模板分支特征图，B₂,B₃,B₄分别表示layer2，layer3，layer4输出的搜索分支特征图。S₂ , S₃ , and S₄ represent the template branch feature maps output by layer2, layer3, and layer4 respectively, and B₂ , B₃ , and B₄ represent the search branch feature maps output by layer2, layer3, and layer4 respectively.

本发明另一方面还提供一种注意力增强的无人机航拍目标跟踪装置200，如图6所示，装置包括：Another aspect of the present invention also provides an attention-enhancing UAV aerial photographytarget tracking device 200, as shown in Figure 6, the device includes:

获取模块201，用于获取无人机航拍视频；Obtainingmodule 201, used for obtaining unmanned aerial vehicle aerial video;

处理模块202，用于将无人机航拍视频的初始帧和当前帧输入基于IResNet网络构建的孪生跟踪网络中模板分支和搜索分支，分别从IResNet网络的layer2、layer3和layer4三个卷积块输出三组模板分支特征图和搜索分支权特征图，所述IResNet网络是基于ResNet50网络分别构建孪生跟踪网络的模板分支和搜索分支，并在模板分支和搜索分支的layer1上的最后一个bottleneck的投影映射上增加硬下采样，将硬下采样和通道数为256、卷积核为1乘1的残差模块的软下采样相加，然后在IResNet网络的每个Bottleneck后面增加一个简单注意力模块，所述硬下采样包括串联的步长为2、大小为3乘3的最大池化层和步长为1、卷积核为1乘1的卷积层；Theprocessing module 202 is used to input the initial frame and the current frame of the UAV aerial video into the template branch and the search branch in the twin tracking network built based on the IResNet network, and output from three convolutional blocks of the IResNet network, layer2, layer3 and layer4 respectively Three sets of template branch feature maps and search branch weight feature maps. The IResNet network is based on the ResNet50 network to construct the template branch and search branch of the twin tracking network respectively, and the projection mapping of the last bottleneck on the layer1 of the template branch and the search branch. Increase the hard downsampling, add the hard downsampling and the soft downsampling of the residual module with a channel number of 256 and a convolution kernel of 1 by 1, and then add a simple attention module behind each Bottleneck of the IResNet network, The hard downsampling includes a concatenated maximum pooling layer with a stride of 2 and a size of 3 by 3 and a convolutional layer with a stride of 1 and a convolution kernel of 1 by 1;

输出模块203，用于利用设置于模板分支和搜索分支的layer2、layer3和layer4之间的多个不同深度卷积块的区域建议网络，对三组模板分支特征图和搜索分支特征图进行加权融合，利用加权融合的结果跟踪当前帧中的目标。Theoutput module 203 is used to perform weighted fusion of the three sets of template branch feature maps and search branch feature maps by using a region proposal network with multiple convolutional blocks of different depths arranged between layer2, layer3 and layer4 of the template branch and the search branch , use the result of weighted fusion to track the target in the current frame.

在本发明提供的又一实施例中，还提供了一种设备，所述设备包括处理器和存储器，所述存储器种存储有至少一条指令、至少一段程序、代码集或指令集，所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现本发明实施例中所述的注意力增强的无人机航拍目标跟踪方法。In yet another embodiment provided by the present invention, a device is also provided. The device includes a processor and a memory, and the memory stores at least one instruction, at least one program, a code set or an instruction set, and the at least An instruction, the at least one section of program, the code set or the instruction set are loaded and executed by the processor to implement the attention-enhanced UAV aerial photography target tracking method described in the embodiment of the present invention.

在本发明提供的又一实施例中，还提供了一种计算机可读存储介质，所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集，所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现本发明实施例中所述的注意力增强的无人机航拍目标跟踪方法。In yet another embodiment provided by the present invention, a computer-readable storage medium is also provided, the storage medium stores at least one instruction, at least one program, code set or instruction set, and the at least one instruction, all The at least one segment of the program, the code set or the instruction set is loaded and executed by the processor to implement the attention-enhanced UAV aerial photography target tracking method described in the embodiment of the present invention.

在上述实施例中，可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时，可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括多个计算机指令。在计算机上加载和执行所述计算机程序指令时，全部或部分地产生按照本发明实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中，或者从一个计算机可读存储介质向另一个计算机可读存储介质传输，例如，所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质，(例如，软盘、硬盘、磁带)、光介质(例如，DVD)、或者半导体介质(例如固态硬盘Solid StateDisk(SSD))等。In the above embodiments, all or part of them may be implemented by software, hardware, firmware or any combination thereof. When implemented using software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes a plurality of computer instructions. When the computer program instructions are loaded and executed on the computer, all or part of the processes or functions according to the embodiments of the present invention will be generated. The computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server, or data center by wired (eg, coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center integrated with multiple available media. The available medium may be a magnetic medium (for example, a floppy disk, a hard disk, or a magnetic tape), an optical medium (for example, DVD), or a semiconductor medium (for example, a Solid State Disk (SSD)).

需要说明的是，在本文中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that in this article, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply that there is a relationship between these entities or operations. There is no such actual relationship or order between them. Furthermore, the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article, or apparatus comprising a set of elements includes not only those elements, but also includes elements not expressly listed. other elements of or also include elements inherent in such a process, method, article, or device. Without further limitations, an element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in the process, method, article or apparatus comprising said element.

本说明书中的各个实施例均采用相关的方式描述，各个实施例之间相同相似的部分互相参见即可，每个实施例重点说明的都是与其他实施例的不同之处。尤其，对于系统实施例而言，由于其基本相似于方法实施例，所以描述的比较简单，相关之处参见方法实施例的部分说明即可。Each embodiment in this specification is described in a related manner, the same and similar parts of each embodiment can be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the system embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for relevant parts, refer to part of the description of the method embodiment.

以上所述仅为本发明的较佳实施例而已，并非用于限定本发明的保护范围。凡在本发明的精神和原则之内所作的任何修改、等同替换、改进等，均包含在本发明的保护范围内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the protection scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present invention are included in the protection scope of the present invention.

Claims

Translated fromChinese

1.一种注意力增强的无人机航拍目标跟踪方法，其特征在于，包括：1. a kind of unmanned aerial vehicle aerial photography target tracking method of attention enhancement, it is characterized in that, comprising:

获取无人机航拍视频；Obtain drone aerial video;

2.如权利要求1所述的一种注意力增强的无人机航拍目标跟踪方法，其特征在于，所述利用设置于模板分支和搜索分支的layer2、layer3和layer4之间的多个不同深度卷积块的区域建议网络，对三组模板分支特征图和搜索分支特征图进行加权融合，利用加权融合的结果跟踪当前帧中的目标，包括：2. a kind of attention enhancement as claimed in claim 1 is characterized in that, described utilization is arranged on a plurality of different depths between layer2, layer3 and layer4 of template branch and search branch The region proposal network of the convolution block performs weighted fusion on the three sets of template branch feature maps and search branch feature maps, and uses the weighted fusion results to track the target in the current frame, including:

3.一种注意力增强的无人机航拍目标跟踪装置，其特征在于，包括：3. A UAV aerial photography target tracking device with enhanced attention, characterized in that it comprises:

4.一种电子设备，其特征在于，所述电子设备包括处理器和存储器，所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集，所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如权利要求1-2任一项所述的注意力增强的无人机航拍目标跟踪方法。4. An electronic device, characterized in that the electronic device includes a processor and a memory, at least one instruction, at least one section of program, code set or instruction set are stored in the memory, the at least one instruction, the at least A section of program, said code set or instruction set is loaded and executed by said processor to realize the attention-enhancing UAV aerial photography target tracking method according to any one of claims 1-2.

5.一种计算机可读存储介质，其特征在于，所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集，所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现如权利要求1-2任一项所述的注意力增强的无人机航拍目标跟踪方法。5. A computer-readable storage medium, characterized in that at least one instruction, at least one section of program, code set or instruction set is stored in the storage medium, and the at least one instruction, the at least one section of program, the code The set or instruction set is loaded and executed by the processor to realize the attention-enhanced UAV aerial photography target tracking method according to any one of claims 1-2.