CN111914917A

Movatterモバイル変換

Info

Publication number: CN111914917A
Application number: CN202010710684.XA
Authority: CN
Inventors: 王燕妮; 刘祥; 翟会杰; 余丽仙; 孙雪松
Original assignee: Xian University of Architecture and Technology
Current assignee: Xian University of Architecture and Technology
Priority date: 2020-07-22
Filing date: 2020-07-22
Publication date: 2020-11-10
Anticipated expiration: 2040-07-22
Also published as: CN111914917B

Abstract

Translated fromChinese

本发明公开了一种基于特征金字塔网络和注意力机制的目标检测改进算法，该方法通过结合特征金字塔网络的原理，对原始SSD算法中基础网络提取出的6个多尺度特征图进行融合，融合后形成的新特征图中同时包含有丰富的上下文信息，以提高检测能力；并对融合后的特征图添加注意力模型，有效提取出小目标的特征信息。改善了漏检的情况，提高了算法的鲁棒性，同时在检测速度方面仍满足实时性的要求。

The invention discloses an improved target detection algorithm based on a feature pyramid network and an attention mechanism. By combining the principle of the feature pyramid network, the method fuses 6 multi-scale feature maps extracted from the basic network in the original SSD algorithm, and fuses them. The new feature map formed later also contains rich context information to improve the detection ability; an attention model is added to the fused feature map to effectively extract the feature information of small targets. It improves the situation of missed detection, improves the robustness of the algorithm, and still meets the real-time requirements in terms of detection speed.

Description

Translated fromChinese

一种基于特征金字塔网络和注意力机制的目标检测改进算法An Improved Target Detection Algorithm Based on Feature Pyramid Network and Attention Mechanism

技术领域technical field

本发明属于数字图像处理领域，涉及目标检测，特别涉及一种基于特征金字塔网络和注意力机制的目标检测改进算法。The invention belongs to the field of digital image processing and relates to target detection, in particular to an improved target detection algorithm based on a feature pyramid network and an attention mechanism.

背景技术Background technique

目标检测的任务是找出图像中的感兴趣目标，确定它们的类别和位置，是计算机视觉领域的核心问题之一，在红外探测技术，智能视频监控，遥感影像目标检测，医疗诊断以及智能建筑中的火灾、烟雾检测中都有广泛应用。目标检测算法可以分为传统目标检测算法和基于深度学习的目标检测算法；传统目标检测算法代表算法有SIFT算法和V-J检测算法等，但该种方法时间复杂度高，且没有很好的鲁棒性。基于深度学习的目标检测算法，经典算法有R-CNN算法，Fast R-CNN算法，Faster R-CNN算法，YOLO算法，SSD算法等。虽然现阶段有很多优秀的目标检测算法，但检测性能仍有很多不足，从而导致出现漏检、误检等问题。The task of object detection is to find out the objects of interest in the image, determine their category and location, which is one of the core problems in the field of computer vision, in infrared detection technology, intelligent video surveillance, remote sensing image object detection, medical diagnosis and intelligent buildings. It is widely used in fire and smoke detection in China. Target detection algorithms can be divided into traditional target detection algorithms and target detection algorithms based on deep learning; the representative algorithms of traditional target detection algorithms include SIFT algorithm and V-J detection algorithm, but this method has high time complexity and is not very robust. sex. Target detection algorithms based on deep learning, classic algorithms include R-CNN algorithm, Fast R-CNN algorithm, Faster R-CNN algorithm, YOLO algorithm, SSD algorithm, etc. Although there are many excellent target detection algorithms at this stage, there are still many shortcomings in detection performance, which leads to problems such as missed detection and false detection.

发明内容SUMMARY OF THE INVENTION

针对上述现有技术存在的缺陷或不足，本发明的目的在于，提供一种基于特征金字塔网络和注意力机制的目标检测改进算法。In view of the above-mentioned defects or deficiencies in the prior art, the purpose of the present invention is to provide an improved target detection algorithm based on a feature pyramid network and an attention mechanism.

为了实现上述任务，本发明采取如下的技术解决方案：In order to realize the above-mentioned tasks, the present invention adopts the following technical solutions:

一种基于特征金字塔网络和注意力机制的目标检测改进算法，其特征在于，包括以下步骤：An improved target detection algorithm based on feature pyramid network and attention mechanism, characterized in that it includes the following steps:

步骤1)结合特征金字塔网络的原理，对原始SSD算法中，基础网络VGG-16提取出输入图像的6个多尺度特征图，按自小而大的顺序进行特征融合；得到融合不同层的特征图，且融合后的特征图同时包含有丰富的语义信息和细节信息；Step 1) Combined with the principle of the feature pyramid network, in the original SSD algorithm, the basic network VGG-16 extracts 6 multi-scale feature maps of the input image, and performs feature fusion in the order from small to large; the features of different layers are obtained. image, and the fused feature map contains rich semantic information and detailed information at the same time;

其中，所述原始SSD算法中，经过基础网络VGG-16对输入图像提取出的特征图尺度是从大到小依次递减的，其中底层特征图分辨率较大，含有更多细节信息，高层特征图分辨率较小，包含更多抽象的语义信息，因此，原始SSD算法将底层特征图用于对小目标进行检测，高层特征图用于对中、大目标进行检测；Among them, in the original SSD algorithm, the scale of the feature map extracted from the input image by the basic network VGG-16 is in descending order from large to small. The resolution of the image is small and contains more abstract semantic information. Therefore, the original SSD algorithm uses the low-level feature map to detect small objects, and the high-level feature map is used to detect medium and large objects;

步骤2)引入通道注意力机制，对特征融合后其中拥有更加丰富的细节信息和语义信息，同时对小目标检测更加敏感的两个特征图添加注意力模型；即通过对特征图添加掩码(mask)来实现注意力机制，将感兴趣区域的特征标识出来，通过网络的不断训练，让网络学习到每一张图像中需要重点关注的感兴趣区域，抑制其他干扰区域带来的影响，从而增强算法对小目标物体的检测能力。Step 2) Introduce the channel attention mechanism, and add an attention model to the two feature maps that have richer detailed information and semantic information after feature fusion, and are more sensitive to small target detection; that is, by adding a mask to the feature map ( mask) to realize the attention mechanism, identify the features of the area of interest, and through the continuous training of the network, let the network learn the area of interest that needs to be focused on in each image, suppress the influence of other interference areas, and thus Enhance the detection ability of the algorithm for small target objects.

根据本发明，步骤1)中所述输入图像尺寸为300×300，经过基础网络VGG-16后得到的用于检测的特征图尺寸分别为38×38、19×19、10×10、5×5、3×3、1×1。按特征金字塔网络的原理，对特征图按照尺寸从小到大的顺序，依次进行特征融合，得到尺寸大小仍为38×38、19×19、10×10、5×5、3×3、1×1的6个特征图。According to the present invention, the size of the input image in step 1) is 300×300, and the size of the feature map for detection obtained after the basic network VGG-16 is 38×38, 19×19, 10×10, 5× 5, 3×3, 1×1. According to the principle of the feature pyramid network, the feature maps are fused in order of size from small to large, and the size is still 38×38, 19×19, 10×10, 5×5, 3×3, 1× 1 of the 6 feature maps.

进一步地，步骤2)中对步骤1)中按特征金字塔原理融合后的特征图，添加注意力模型，因为融合过程是按照特征图尺寸从小到大的顺序进行的，因此融合后信息最丰富的特征图为(38，38),(19，19)两个特征图这两个特征图相比其他特征图，拥有更加丰富的细节信息和语义信息，同时对小目标检测更加敏感；且为了保持算法的检测速度，减少算法的计算量，故只对融合后的(38，38),(19，19)这两个特征图添加注意力模型，具体的检测算法过程如下：Further, in step 2), an attention model is added to the feature map fused according to the principle of feature pyramid in step 1), because the fusion process is carried out in the order of the size of the feature map from small to large, so the most informative after fusion. The feature maps are (38, 38), (19, 19) two feature maps. Compared with other feature maps, these two feature maps have richer detailed information and semantic information, and are more sensitive to small target detection; and in order to maintain The detection speed of the algorithm reduces the calculation amount of the algorithm, so only the attention model is added to the two feature maps after fusion (38, 38), (19, 19). The specific detection algorithm process is as follows:

a)基于单阶段网络模型的目标检测，利用回归的思想，直接通过一个卷积神经网络在输入图像上回归出目标的类别及边框。首先结合特征金字塔网络的原理，对原始SSD算法提取出的多尺度特征图按照尺寸从小到大的顺序，依次进行特征融合；原始SSD算法由基础网络VGG-16提取出的多尺度特征图，尺寸大小分别为38×38、19×19、10×10、5×5、3×3、1×1，按照特征金字塔网络的原理，按照尺寸从小到大的顺序，进行特征融合，融合得到6个尺寸为38×38、19×19、10×10、5×5、3×3、1×1的特征图，这些特征图都包含有丰富的语义信息和细节信息。a) Target detection based on a single-stage network model, using the idea of regression, directly regresses the category and frame of the target on the input image through a convolutional neural network. First, combined with the principle of the feature pyramid network, the multi-scale feature maps extracted by the original SSD algorithm are fused in order of size from small to large; the multi-scale feature maps extracted by the original SSD algorithm from the basic network VGG-16, the size The sizes are 38×38, 19×19, 10×10, 5×5, 3×3, and 1×1. According to the principle of the feature pyramid network, in the order of size from small to large, feature fusion is performed, and 6 are obtained by fusion. Feature maps ofsize 38×38, 19×19, 10×10, 5×5, 3×3, 1×1, these feature maps contain rich semantic information and detailed information.

b)结合注意力机制的原理，引入了通道注意力，对进行特征融合后的特征图添加注意力模型；对1a)中经过特征融合后的特征图添加注意力模型，由于融合后38×38、19×19两个特征图中包含有最丰富的信息，且为了保持算法的实时性，因此只对这两个特征图添加注意力模型。b) Combined with the principle of the attention mechanism, the channel attention is introduced, and the attention model is added to the feature map after feature fusion; the attention model is added to the feature map after feature fusion in 1a). , 19×19 feature maps contain the most abundant information, and in order to maintain the real-time performance of the algorithm, only the attention model is added to these two feature maps.

c)由步骤a)和b)中得到的6个多尺度特征图，在每个单元都要设置不同尺寸、长宽比的候选框，对于候选框的尺度，按如下公式(1)进行计算：c) From the six multi-scale feature maps obtained in steps a) and b), candidate frames of different sizes and aspect ratios must be set in each unit. For the scale of the candidate frame, the following formula (1) is used to calculate :

其中，m代表特征层的个数；s_k表示候选框与图片的比例；s_max和s_min代表比例的最大值和最小值，分别取值为0.9和0.2；利用上述公式(1)得到各个候选框的尺度；Among them, m represents the number of feature layers; s_k represents the ratio of the candidate frame to the picture; s_max and s_min represent the maximum and minimum values of the ratio, which are 0.9 and 0.2 respectively; using the above formula (1) to obtain each The scale of the candidate box;

对于长宽比，一般取值为

且按照如下公式(2)对候选框的宽度及高度进行计算：For the aspect ratio, the general value is

And calculate the width and height of the candidate frame according to the following formula (2):

对于宽高比为1的候选框，还增加一个尺度为

的候选框，候选框的中心坐标为

其中|f_k|代表特征层的大小；For the candidate box with an aspect ratio of 1, a scale is added as

The candidate frame of , the center coordinates of the candidate frame are

where |f_k | represents the size of the feature layer;

d)使用3×3的卷积核通过卷积操作对多尺度特征图的类别和置信度进行检测，并对目标检测算法进行训练；模型训练时损失函数定义为位置损失(localization loss,loc)和置信度损失(confidence loss,conf)的加权和，计算公式如下：d) Use a 3×3 convolution kernel to detect the category and confidence of the multi-scale feature map through the convolution operation, and train the target detection algorithm; the loss function during model training is defined as localization loss (loc) and the weighted sum of confidence loss (conf), the formula is as follows:

式中，N为匹配的候选框的数量；x∈{1,0}表示候选框是否与真实框匹配，若匹配，则x＝1，反之x＝0；c为类别置信度预测值；g为真实框的位置参数；l为预测框的位置预测值；α权重系数，设置为1。In the formula, N is the number of matched candidate frames; x∈{1,0} indicates whether the candidate frame matches the real frame, if it matches, then x=1, otherwise x=0; c is the category confidence prediction value; g is the position parameter of the real frame; l is the position prediction value of the prediction frame; α weight coefficient, set to 1.

对于SSD中的位置损失函数，采用Smooth L1 loss，对候选框的中心(cx,cy)及宽度(w)、高度(h)的偏移量进行回归。公式如下：For the position loss function in SSD, Smooth L1 loss is used to regress the center (cx, cy) and the offset of the width (w) and height (h) of the candidate frame. The formula is as follows:

对于SSD中的置信度损失函数，使用典型的softmax loss，其公式为：For the confidence loss function in SSD, a typical softmax loss is used, and its formula is:

本发明的基于特征金字塔网络和注意力机制的目标检测改进算法，以单阶段目标检测算法SSD算法为基础，考虑到特征图分辨率大小对目标检测性能的影响，对原算法进行改进，结合特征金字塔网络的思想，对原始SSD算法提取出的多尺度特征图进行融合，融合形成具有丰富语义信息和丰富细节信息的特征图；再结合注意力机制的原理，为融合后尺寸为38×38、19×19两个特征图添加注意力模型，以加强对小目标物体的识别效果。The improved target detection algorithm based on the feature pyramid network and the attention mechanism of the present invention is based on the single-stage target detection algorithm SSD algorithm. The idea of the pyramid network is to fuse the multi-scale feature maps extracted by the original SSD algorithm to form a feature map with rich semantic information and rich detailed information; 19×19 two feature maps are added with attention model to enhance the recognition effect of small target objects.

附图说明Description of drawings

图1是结合特征金字塔网络和注意力机制的目标检测算法的网络结构示意图；Figure 1 is a schematic diagram of the network structure of a target detection algorithm combining a feature pyramid network and an attention mechanism;

图2是原始SSD算法与改进后的目标检测算法检测效果对比图片，其中，左侧的图a1、图a2、图a3、图a4和图a5均是原始SSD算法检测图片；右侧的图b1、图b2、图b3、图b4和图b5均是改进后目标检测算法检测图片。Figure 2 is a comparison picture of the detection effect of the original SSD algorithm and the improved target detection algorithm. Among them, Figure a1, Figure a2, Figure a3, Figure a4 and Figure a5 on the left are the detection pictures of the original SSD algorithm; Figure b1 on the right , Figure b2, Figure b3, Figure b4 and Figure b5 are the improved target detection algorithm detection pictures.

以下结合附图和实施例对本发明做进一步详细描述。The present invention will be described in further detail below with reference to the accompanying drawings and embodiments.

具体实施方式Detailed ways

本发明的基于特征金字塔网络和注意力机制的目标检测改进算法，采取的技术思路是，以单阶段目标检测算法SSD为基础，对SSD算法中不足进行分析，提出对SSD目标检测算法进行改进。集合特征金字塔网络的原理，对原始SSD算法提取出的6个特征图进行融合，融合形成新的特征图，同时具有丰富的语义信息和细节信息；然后对融合后的特征图添加注意力模型，但为了保持算法的实时性，只对包含信息最丰富，同时对小目标检测更加敏感的38×38和19×19两个特征图进行添加注意力模型。通过对算法的改进以达到提高目标检测算法的检测能力，改善漏检等问题。The improved target detection algorithm based on the feature pyramid network and the attention mechanism of the present invention adopts the technical idea that, based on the single-stage target detection algorithm SSD, the deficiencies in the SSD algorithm are analyzed, and the improvement of the SSD target detection algorithm is proposed. The principle of the aggregated feature pyramid network is to fuse the 6 feature maps extracted by the original SSD algorithm to form a new feature map with rich semantic information and detailed information; then add an attention model to the fused feature map, However, in order to maintain the real-time performance of the algorithm, the attention model is only added to the 38×38 and 19×19 feature maps that contain the most information and are more sensitive to small target detection. Through the improvement of the algorithm, the detection ability of the target detection algorithm can be improved, and the problems such as missed detection can be improved.

本实施例给出一种基于特征金字塔网络和注意力机制的目标检测改进算法，包括以下步骤：The present embodiment provides an improved target detection algorithm based on a feature pyramid network and an attention mechanism, including the following steps:

步骤1)中，输入图像的尺寸为300×300，经过基础网络VGG-16提取出的特征图尺寸分别为38×38、19×19、10×10、5×5、3×3、1×1，结合特征金字塔网络的思想，对提取出的这6个特征图按尺寸从小到大的方式进行融合，即1×1与3×3，3×3与5×5，5×5与10×10，10×10与19×19，19×19与38×38。融合后的特征图尺寸仍为38×38、19×19、10×10、5×5、3×3、1×1。In step 1), the size of the input image is 300×300, and the size of the feature map extracted by the basic network VGG-16 is 38×38, 19×19, 10×10, 5×5, 3×3, 1× 1. Combined with the idea of feature pyramid network, the extracted 6 feature maps are fused according to the size from small to large, namely 1×1 and 3×3, 3×3 and 5×5, 5×5 and 10 ×10, 10×10 and 19×19, 19×19 and 38×38. The size of the fused feature map is still 38×38, 19×19, 10×10, 5×5, 3×3, 1×1.

步骤2)中，结合注意力机制的原理，对融合后的特征图添加注意力模型，由于特征融合后38×38和19×19两个特征图中包含最丰富的信息，且为了保持检测算法的实时性，减少计算量，只对这两个特征图添加注意力模型，添加注意力模型后可以增强对小目标物体特征的提取。In step 2), combined with the principle of the attention mechanism, an attention model is added to the fused feature map. Since the 38×38 and 19×19 feature maps after feature fusion contain the most abundant information, and in order to maintain the detection algorithm The real-time performance is reduced, and the amount of calculation is reduced. Only the attention model is added to these two feature maps. After adding the attention model, the feature extraction of small target objects can be enhanced.

改进后的目标检测算法的检测过程如下：The detection process of the improved target detection algorithm is as follows:

a)基于单阶段网络模型的目标检测，利用回归的思想，直接通过一个卷积神经网络在输入图像上回归出目标的类别及边框。首先结合特征金字塔网络的原理，对原始SSD算法提取出的多尺度特征图按照尺寸从小到大的顺序，依次进行特征融合；原始SSD算法由基础网络VGG-16提取出的多尺度特征图，尺寸大小分别为38×38、19×19、10×10、5×5、3×3、1×1，按照特征金字塔网络的原理，按照尺寸从小到大的顺序，进行特征融合，以(1，1)和(3，3)两个特征图为例：a) Target detection based on a single-stage network model, using the idea of regression, directly regresses the category and frame of the target on the input image through a convolutional neural network. First, combined with the principle of the feature pyramid network, the multi-scale feature maps extracted by the original SSD algorithm are fused in order of size from small to large; the multi-scale feature maps extracted by the original SSD algorithm from the basic network VGG-16, the size The sizes are 38×38, 19×19, 10×10, 5×5, 3×3, and 1×1. According to the principle of the feature pyramid network, the feature fusion is performed in the order of size from small to large, with (1, 1) and (3, 3) two feature maps as examples:

首先对尺寸为(1，1)的特征图进行上采样，采用内插值的方法，在原有图像像素的基础上，在像素点之间采用合适的插值算法插入新的元素，从而扩大特征图大小，使扩大后与(3，3)特征图大小一致；然后对(3，3)的特征图进行1*1的卷积操作，改变其通道数，使通道数与经过上采样得到的特征图通道数相同；最后进行特征融合，融合后再使用3*3卷积核对融合后的特征图进行卷积操作，以消除上采样的混叠效应。其他相邻特征图之间的融合与上述方法一致。融合得到6个尺寸为38×38、19×19、10×10、5×5、3×3、1×1的特征图，这些特征图都包含有丰富的语义信息和细节信息。Firstly, the feature map with size (1, 1) is upsampled, and the interpolation method is used. On the basis of the original image pixels, a suitable interpolation algorithm is used to insert new elements between the pixels, thereby expanding the size of the feature map. , so that the size of the feature map of (3, 3) is the same after expansion; then perform a 1*1 convolution operation on the feature map of (3, 3) to change the number of channels so that the number of channels is the same as the feature map obtained by upsampling. The number of channels is the same; finally, feature fusion is performed, and after fusion, a 3*3 convolution kernel is used to perform a convolution operation on the fused feature map to eliminate the aliasing effect of upsampling. The fusion between other adjacent feature maps is consistent with the above method. Six feature maps with sizes of 38 × 38, 19 × 19, 10 × 10, 5 × 5, 3 × 3, and 1 × 1 are obtained by fusion, and these feature maps contain rich semantic information and detailed information.

b)结合注意力机制的原理，引入了通道注意力，对进行特征融合后的特征图添加注意力模型；对a)步骤中经过特征融合后的特征图添加注意力模型，由于融合后38×38、19×19两个特征图中包含有最丰富的信息，且为了保持算法的实时性，因此只对这两个特征图添加注意力模型。注意力模型的添加过程分为三个步骤：挤压，激励，注意。b) Combined with the principle of the attention mechanism, channel attention is introduced, and an attention model is added to the feature map after feature fusion; an attention model is added to the feature map after feature fusion in step a). The 38 and 19×19 feature maps contain the most abundant information, and in order to maintain the real-time performance of the algorithm, only the attention model is added to these two feature maps. The addition process of the attention model is divided into three steps: squeezing, excitation, and attention.

挤压操作的公式如下：The formula for the squeeze operation is as follows:

其中，H、W分别代表输入的高度，宽度，U代表输入，Y代表输出，C为输入的通道数；Among them, H and W represent the height and width of the input respectively, U represents the input, Y represents the output, and C is the number of input channels;

该式(1)的作用是将H*W*C的输入转化为1*1*C的输出，相当于进行了一个全局平均池化(global average pooling)操作。The function of the formula (1) is to convert the input of H*W*C into the output of 1*1*C, which is equivalent to performing a global average pooling operation.

激励操作的公式如下：The formula for the incentive operation is as follows:

S＝h-Swish(W₂×ReLU6(W₁Y))(2)S=h-Swish(W₂ ×ReLU6(W₁ Y))(2)

其中，Y代表挤压操作的输出，S代表激励操作的输出，W₁的维度为C/r*C，W₂的维度为C*C/r，r是一个缩放参数，本文取值为4。W₁与Y相乘代表全连接操作，然后经过ReLU6激活函数；再与W₂相乘，也代表一个全连接操作，最后再经过hard-Swish激活函数，即完成了激励操作。ReLU6和hard-Swish激活函数公式如下式(3)所示。Among them, Y represents the output of the squeeze operation, S represents the output of the excitation operation, the dimension of W₁ is C/r*C, the dimension of W₂ is C*C/r, and r is a scaling parameter, which is 4 in this paper. . The multiplication of W₁ and Y represents the full connection operation, and then passes through the ReLU6 activation function; then multiplication with W₂ also represents a full connection operation, and finally passes through the hard-Swish activation function to complete the excitation operation. The formulas of ReLU6 and hard-Swish activation functions are shown in Equation (3).

Attention的操作如下式所示：The operation of Attention is as follows:

X＝S×U (4)X=S×U (4)

式中，X代表添加注意力机制后的特征图，U代表原始输入，S代表激励操作的输出，对每一个特征图的权重和特征图的特征进行相乘。In the formula, X represents the feature map after adding the attention mechanism, U represents the original input, S represents the output of the excitation operation, and the weight of each feature map and the feature of the feature map are multiplied.

c)由步骤a)和步骤b)中得到的6个多尺度特征图，在每个单元都要设置不同尺寸、长宽比的候选框，对于候选框的尺度，按如下公式进行计算：c) For the 6 multi-scale feature maps obtained in step a) and step b), candidate frames of different sizes and aspect ratios must be set in each unit. The scale of the candidate frame is calculated according to the following formula:

其中，m代表特征层的个数；s_k表示候选框与图片的比例；s_max和s_min代表比例的最大值和最小值，分别取值为0.9和0.2；Among them, m represents the number of feature layers; s_k represents the ratio of the candidate frame to the image; s_max and s_min represent the maximum and minimum values of the ratio, which are 0.9 and 0.2 respectively;

利用上式(5)得到各个候选框的尺度；Use the above formula (5) to obtain the scale of each candidate frame;

对于长宽比，一般取值为

且按照如下公式(6)对候选框的宽度及高度进行计算：For the aspect ratio, the general value is

And calculate the width and height of the candidate frame according to the following formula (6):

对于宽高比为1的候选框，还增加一个尺度为

的候选框，候选框的中心坐标为

The candidate frame of , the center coordinates of the candidate frame are

where |f_k | represents the size of the feature layer;

然后对改进后的目标检测算法模型进行训练。Then the improved target detection algorithm model is trained.

在本实施例中，以PASCAL VOC2007数据集和PASCAL VOC2012数据集作为模型训练所用的训练集，并采用数据扩增技术，通过对数据集进行水平翻转、随机裁剪等操作，对训练集图像进行扩充。In this embodiment, the PASCAL VOC2007 data set and the PASCAL VOC2012 data set are used as the training sets for model training, and the data augmentation technology is used to expand the training set images by performing horizontal flipping, random cropping and other operations on the data set. .

实验所用的数据：PASCAL VOC数据集，是一套用于图像识别和分类的标准化的数据集，该数据集中包含20个类别，分别为人、鸟、猫、牛、狗、马、羊、飞机、自行车、船、巴士、汽车、摩托车、火车、瓶子、椅子、餐桌、盆栽植物、沙发、电视。Data used in the experiment: PASCAL VOC dataset is a standardized dataset for image recognition and classification. The dataset contains 20 categories, namely people, birds, cats, cows, dogs, horses, sheep, airplanes, bicycles , boat, bus, car, motorcycle, train, bottle, chair, dining table, potted plant, sofa, TV.

本实施例使用上述VOC2007数据集和VOC2012数据集进行训练，使用VOC2007数据集进行测试。训练时采用随机梯度下降法(SGD)，batchsize设置为32，初始学习率设置为0.001，动量参数monmentum设置为0.9，学习率在迭代次数为100000和150000时调小90％，共训练200000次。In this example, the above-mentioned VOC2007 dataset and VOC2012 dataset are used for training, and the VOC2007 dataset is used for testing. Stochastic Gradient Descent (SGD) is used for training, the batch size is set to 32, the initial learning rate is set to 0.001, the momentum parameter is set to 0.9, and the learning rate is reduced by 90% when the number of iterations is 100,000 and 150,000, and a total of 200,000 training times.

为了验证本实施例的基于单阶段网络模型的目标检测改进算法的检测效果，申请人选用PASCAL VOC2007数据集中的测试集进行检测，使用mAP(mean Average Precision)作为检测算法的评价指标，检测到的每一个类别都会得到由查准率和查全率构成的曲线，即P-R曲线，曲线下的面积就是AP值，对检测的所有类别的AP值再求平均，即可得到mAP值。与其他主流目标检测模型从主观和客观两方面进行检测效果对比(参见表1和表2)。In order to verify the detection effect of the target detection improvement algorithm based on the single-stage network model of this embodiment, the applicant selects the test set in the PASCAL VOC2007 data set for detection, and uses mAP (mean Average Precision) as the evaluation index of the detection algorithm. Each category will get a curve composed of precision and recall, that is, the P-R curve. The area under the curve is the AP value. The mAP value can be obtained by averaging the AP values of all categories detected. The detection effect is compared with other mainstream target detection models from both subjective and objective aspects (see Table 1 and Table 2).

表1Table 1

表2Table 2

检测效果主观评价中，对比原始SSD算法及改进后的检测算法效果图(如图2所示，其中，a1、a2、a3、a4、a5图均是原始SSD算法检测图片；b1、b2、b3、b4、b5图均是目标检测算法检测图片)。从图中可以看出，改进后的目标检测算法相比原始SSD算法，显著改善了原始算法中的漏检等问题，对密集分布的小目标物体检测能力更加优秀，可以检测到更多的目标。检测效果较原始SSD算法有了较明显的提升。In the subjective evaluation of the detection effect, compare the original SSD algorithm and the improved detection algorithm renderings (as shown in Figure 2, where a1, a2, a3, a4, and a5 are the original SSD algorithm detection pictures; b1, b2, b3 , b4, and b5 are all images detected by the target detection algorithm). As can be seen from the figure, compared with the original SSD algorithm, the improved target detection algorithm significantly improves the missed detection and other problems in the original algorithm, and has better detection ability for densely distributed small target objects, and can detect more targets. . Compared with the original SSD algorithm, the detection effect has been significantly improved.

Claims

1. An improved target detection algorithm based on a feature pyramid network and an attention mechanism is characterized by comprising the following steps:

step 1) combining the principle of a feature pyramid network, and performing feature fusion on 6 multi-scale feature maps extracted from an input image by a basic network VGG-16 in an original SSD algorithm according to the sequence of small features and large features; obtaining feature maps fusing different layers, wherein the fused feature maps simultaneously contain rich semantic information and detail information;

in the original SSD algorithm, the scale of a feature map extracted from an input image through a basic network VGG-16 is gradually decreased from large to small, wherein the resolution of a bottom-layer feature map is large and contains more detailed information, and the resolution of a high-layer feature map is small and contains more abstract semantic information, so that the original SSD algorithm uses the bottom-layer feature map for detecting small targets and the high-layer feature map for detecting medium and large targets;

step 2) introducing a channel attention mechanism, adding an attention model to two feature graphs which have richer detail information and semantic information after feature fusion and are more sensitive to small target detection; namely, a mask (mask) is added to a feature map to realize an attention mechanism, the features of the region of interest are identified, the network learns the region of interest needing to be focused in each image through continuous training of the network, and influences caused by other interference regions are inhibited, so that the detection capability of the algorithm on small target objects is enhanced.

2. The algorithm of claim 1, wherein the size of the input image in step 1) is 300 x 300, and the sizes of the feature maps for detection obtained after passing through the underlying network VGG-16 are 38 x 38, 19 x 19, 10 x 10, 5 x 5, 3 x 3, 1 x 1; according to the principle of the feature pyramid network, feature fusion is sequentially carried out on feature graphs for detection from small to large in size, and 6 feature graphs with the feature graph size still being 38 multiplied by 38, 19 multiplied by 19, 10 multiplied by 10, 5 multiplied by 5, 3 multiplied by 3 and 1 multiplied by 1 are obtained.

3. The algorithm according to claim 1, wherein in step 2), an attention model is added to the feature map fused according to the feature pyramid principle in step 1), because the fusion process is performed in the order from small to large in feature map size, so that the feature map with the most abundant information after fusion is (38, 38), (19, 19), and the two feature maps have more abundant detail information and semantic information and are more sensitive to small object detection than other feature maps; in order to maintain the detection speed of the algorithm and reduce the calculation amount of the algorithm, the attention model is only added to the two feature maps (38, 38) and (19, 19) after fusion, and the detection process of the target detection algorithm is as follows:

a) and (3) target detection based on a single-stage network model, and directly regressing the category and the frame of the target on the input image through a convolutional neural network by utilizing the regression idea. Firstly, combining the principle of a characteristic pyramid network, and sequentially performing characteristic fusion on multi-scale characteristic graphs extracted by an original SSD algorithm according to the sequence of sizes from small to large; in the original SSD algorithm, input image multi-scale feature maps extracted by a basic network VGG-16 are respectively 38 × 38, 19 × 19, 10 × 10, 5 × 5, 3 × 3 and 1 × 1 in size, feature fusion is carried out according to the principle of a feature pyramid network and the order of the sizes from small to large, 6 feature maps with the sizes of 38 × 38, 19 × 19, 10 × 10, 5 × 5, 3 × 3 and 1 × 1 are obtained through fusion, and the feature maps contain rich semantic information and detailed information.

b) The method is characterized in that channel attention is introduced by combining the principle of an attention mechanism, and an attention model is added to a feature map subjected to feature fusion; adding an attention model to the feature map subjected to feature fusion in the step 1a), wherein the two feature maps of 38 × 38 and 19 × 19 contain the most abundant information after fusion, and in order to keep the real-time performance of the algorithm, only adding the attention model to the two feature maps;

c) setting candidate frames with different sizes and aspect ratios in each unit according to the 6 multi-scale feature maps obtained in the steps a) and b), and calculating the scale of the candidate frames according to the following formula (1):

wherein m represents the number of feature layers; s_kRepresenting the ratio of the candidate frame to the picture; s_maxAnd s_minThe maximum value and the minimum value of the representative proportion are respectively 0.9 and 0.2;

obtaining the scale of each candidate frame by using the formula (1);

for aspect ratio, the value is generally

And the width and height of the candidate frame are calculated according to the following formula (2):

for a candidate box with aspect ratio of 1, a scale is also added

The candidate frame of (1), the center coordinates of the candidate frame are

Wherein | f_k| represents the size of the feature layer;

d) detecting the category and the confidence coefficient of the multi-scale feature map by using a convolution kernel of 3 multiplied by 3 through convolution operation, and training a target detection algorithm; the loss function during model training is defined as a weighted sum of position loss (loc) and confidence loss (conf), and the calculation formula is as follows:

in the formula, N is the number of matched candidate frames; x belongs to {1,0} and represents whether the candidate frame is matched with the real frame, if so, x is 1, otherwise, x is 0; c is a category confidence degree predicted value; g is a position parameter of the real frame; l is the position predicted value of the predicted frame; an alpha weight coefficient set to 1;

for the position loss function in SSD, the center (cx, cy) of the candidate frame, and the offset of the width (w) and height (h) are regressed using Smooth L1 loss. The formula is as follows:

for the confidence loss function in SSD, a typical softmax loss is used, which is formulated as: