CN111179217A

Movatterモバイル変換

Info

Publication number: CN111179217A
Application number: CN201911229726.1A
Authority: CN
Inventors: 于瑞国; 汪嫱; 王臣汉; 李雪威; 喻梅; 姜汉; 高洁; 刘志强; 应翔
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2019-12-04
Filing date: 2019-12-04
Publication date: 2020-05-19

Abstract

Translated fromChinese

本发明公开一种基于注意力机制的遥感图像多尺度目标检测方法，针对遥感图像背景复杂、小目标众多，目标尺寸多样化的特点，采用特征金字塔网络进行不同尺度目标的预测。为了更加准确地关注到目标所在区域，在网络中引入注意力机制，有效地提高目标检测精度。其中，包括全局空间注意力模块和像素特征注意力模块。全局空间注意力模块通过从浅层特征中提取空间位置相关信息，并与深层特征融合，强化深层特征的位置表达能力。像素特征注意力模块以多尺度卷积核产生与输入同尺寸的特征图，并以通道注意力的方式为每层特征图赋予权值，得到细节信息表现良好的像素级注意力图在不增加计算量的同时增加了感受野。

The invention discloses a multi-scale target detection method in remote sensing images based on an attention mechanism. Aiming at the characteristics of complex backgrounds, numerous small targets and diverse target sizes in remote sensing images, a feature pyramid network is used to predict targets of different scales. In order to pay more attention to the area where the target is located, an attention mechanism is introduced into the network to effectively improve the target detection accuracy. Among them, the global spatial attention module and the pixel feature attention module are included. The global spatial attention module enhances the position expression ability of deep features by extracting spatial position-related information from shallow features and fusing them with deep features. The pixel feature attention module uses a multi-scale convolution kernel to generate a feature map of the same size as the input, and assigns weights to each layer of feature maps in the form of channel attention, and obtains a pixel-level attention map with good detailed information performance without increasing computation. while increasing the receptive field.

Description

Attention mechanism-based remote sensing image multi-scale target detection method

Technical Field

The invention belongs to the field of artificial intelligence, deep learning and computer vision, relates to a target detection and salient feature extraction technology, and particularly relates to a remote sensing image target detection technology based on an attention mechanism.

Background

In the related art, the target detection technology mainly aims at natural images, and the methods are mainly divided into two types: one is a target detection method of two-stage, which is widely applied. The Two-stage method is represented by the method of fast R-CNN, FPN and the like, and the target detection is divided into Two stages, namely region generation and target classification. Firstly, feature extraction is carried out by utilizing a pre-trained CNN network, then a certain number of candidate regions are generated through a region generation network RPN, and then fine-grained object classification and frame regression are carried out on the candidate regions. The Two-stage method can obtain a relatively accurate detection result, but has the disadvantage that the real-time detection speed cannot be achieved because a large number of candidate frames with different scales need to be generated.

The other is a target detection method of one-stage. The One-stage method can directly extract features in the network to predict the classification and the position of the object, and greatly improves the detection speed although some accuracy is lost. For example, represented by YOLO, the image to be detected is scaled to a uniform size, and in order to detect targets at different positions, the image is equally divided into grids, and if the center of a certain target falls in a grid cell, the grid cell is responsible for predicting the target. The detection speed of the YOLO is improved by about 10 times compared with that of FasterR-CNN. However, the detection accuracy of the one-stage method is not as good as that of the two-stage method, and the size of the mesh division has a large influence on the detection result.

In the aspect of target detection of the remote sensing image, the remote sensing image has the characteristics of complex background, diversified angles, different sizes, a large number of small targets and the like, so that the difficulty is relatively high. However, the target detection of the existing remote sensing image still adopts the same method as the natural image, and no solution is provided for the problems that the network cannot accurately notice key image areas and the like due to complicated backgrounds in the remote sensing image.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, provides a remote sensing image multi-scale target detection method based on an attention mechanism, and can effectively optimize the problem that the target is difficult to accurately position in a complicated background when the traditional detection method aiming at a natural image faces the remote sensing image.

The purpose of the invention is realized by the following technical scheme:

a remote sensing image multi-scale target detection method based on an attention mechanism comprises the following steps:

firstly, extracting features of different scales from a remote sensing image through a feature pyramid network, extracting shallow features from a lower layer network, and extracting deep features from a higher layer network;

step two, introducing a global spatial attention module (GSA), extracting shallow features from the lower layer network, and optimizing deep features by using attention information of the shallow features;

thirdly, the characteristics extracted from the last layer pass through a pixel characteristic attention module (PFA), so that the receptive field is enlarged, and the loss of detail information is reduced;

step four, the feature maps with different scales generated in the step one are sent to a candidate frame generation network (RPN) to generate candidate frames;

mapping the target in the candidate frame to feature maps with different scales, deforming the candidate frame on each feature map to a fixed size through ROI posing, and extracting the features of the related target;

and step six, classifying the candidate frames and performing frame regression.

Further, the step one specifically comprises the following steps: a characteristic pyramid network is adopted, the characteristic pyramid network comprises a bottom-up part and a top-down part, the bottom-up part is a convolution and pooling process, and the resolution is continuously reduced; taking each block as a stage, wherein the total number of the stages is five; the top-down process performs up-sampling from a feature map ofstage 5, and then performs horizontal connection with a feature map with the same size generated from bottom to top; and then, carrying out convolution on each fusion result by adopting a convolution kernel of 3x3 to eliminate the aliasing effect of up-sampling.

Further, the second step is the operation carried out in the process of extracting the features from bottom to top in the first step, firstly, the features of the shallow layer are subjected to global average pooling and global maximum pooling, and then the two attention maps are fused by pixel-by-pixel addition to generate a global attention map; and finally multiplying the global attention map by the deep feature map.

Furthermore, step three is to introduce a pixel feature attention module afterstage 5 from bottom to top in step one, and endow each layer of the feature map with a weight value in a channel attention mode; the global information and the detail information are fused, and the positioning capability of the network to the target area is improved.

And further, the step four is that the candidate box generation network extracts candidate boxes with different sizes by processing the extracted convolution feature maps with different scales by using anchors, and searches for a predefined number of areas possibly containing the target.

Compared with the prior art, the technical scheme of the invention has the following beneficial effects:

1. the remote sensing image multi-scale target detection method based on the attention mechanism provides a new idea for target detection of remote sensing images, the network can be more accurately positioned to the position of a target by introducing the attention mechanism, and the method has good generalization and good effect on target detection of natural images.

And 2, the GSA module effectively reduces the information loss of small targets, improves the training speed, accelerates the convergence of the network and can achieve better detection effect on targets with different sizes.

And 3, the PFA module increases the reception field, reduces the false detection of the network on the premise of not increasing the calculated amount, and effectively improves the accuracy of target detection.

Drawings

Fig. 1 is a schematic view of the overall structure of the present invention.

FIG. 2 is a diagram of a global spatial attention module.

FIG. 3 is a schematic diagram of a pixel feature attention module.

Fig. 4a and 4b are graphs of target detection results of the FPN method and the inventive method (AMOD), respectively.

Detailed Description

The invention is described in further detail below with reference to the figures and specific examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention provides a remote sensing image multi-scale target detection model based on an attention mechanism, which is shown in figure 1 and is an overall structural schematic diagram of the invention, wherein two modules are shown in figures 2 and 3, and the overall processing steps are as follows:

and S1.1, adopting ResNet101 as a feature extraction network, extracting shallow features and deep features through convolution and pooling, wherein the shallow features are high in resolution and are positioned at the lower layer of the extraction network, and the deep features are low in resolution and are positioned at the upper layer of the extraction network.

Step S2.1: by extracting the global spatial attention of the shallow layer and fusing the global spatial attention with the deep layer features, the problem of small object information loss caused by low resolution of the deep layer features is optimized. For F ∈ R^H×W×CRespectively carrying out global average pooling and global maximum pooling on the dimension C to obtain F' epsilon R^H×W×1Then fused by means of pixel-by-pixel addition, as shown in formula (1):

wherein F is a shallow feature map, F_gsaFor the output result of global spatial attention, sigmoid is sigmoid activation function, AvgPool is global average pooling, MaxPool is global maximum pooling,

is added pixel by pixel.

Step S2.2: downsampling the shallow features to the same size as the deep features, and multiplying by them, i.e. transferring more accurate position information to the deep features, as shown in equation (2):

wherein down represents down sampling, F' is a deep feature map, F_gsaIs a global attention map for shallow features,

is pixel-by-pixel multiplication.

Step S3.1: the channel attention branch of the pixel feature attention is calculated. In order to obtain global information of each channel of the feature map, the feature map is subjected to global average pooling, the size of the feature map is changed to be 1 × 1 × C, then 1 × 1 convolution is performed to integrate the feature map, and the calculation formula is shown as (3):

M_G(F)＝σ(f^1×1(AvgPool(F))) (3)

wherein f is^1×1Representing a convolution of 1x1, AvgPool for global mean pooling, σ for relu activation function, M_G(F) Is the output of the channel attention branch.

Step S3.2: the convolution branch of the pixel feature attention is calculated. The feature map is convolved by 1 × 1, as shown in equation (4):

M_C(F)＝f^1×1(F) (4)

wherein f is^1×1Representing a convolution with a kernel of 1, M_C(F) Is the output of the convolution branch.

Step S3.3: and calculating the multi-scale convolution branch of the attention of the pixel feature. A convolution kernel of 5x5 with a step size of 2 is applied to the input to obtain a feature map of halved size. Then, using 5x5, convolution kernels withstep size 1 and convolution kernels withsize 3 and

step sizes

2 and 1, respectively, we get the feature maps asinput sizes 1/2 and 1/4, respectively. And finally, upsampling the smaller feature map, fusing the smaller feature map and the smaller feature map together, and then, upsampling again to restore the size of the original feature map, wherein the formula is shown as (5):

where up represents upsampling, f^3×3And f^5×5Respectively denoted as convolutions with convolution kernels of 3 and 5,

for pixel-by-pixel addition, M_m(F) Is the output of the multi-scale convolution branch.

Step S3.4: inputting the feature map F into feature maps obtained from different branches of the pixel feature attention module for fusion, as shown in formulas (6) and (7):

F_out＝M(F) (7)

where sigma is the relu activation function,

in order to add the pixels one by one,

for pixel-by-pixel multiplication, M_G(F)、M_C(F)、M_m(F) Respectively obtaining results of a feature graph F through a global channel attention branch, a convolution branch and a multi-scale convolution branch, wherein F is the result of the feature graph F_outIs the output of the pixel feature attention module.

Step S3.5: the top-down process starts with the feature map ofstage 5 being upsampled and then cross-concatenated with the same size feature map generated from the bottom-up. And performing convolution on each fusion result by adopting a convolution kernel of 3x3 to generate feature maps with different scales.

Step S4.1: the feature maps of different scales are convoluted by adopting a convolution kernel of 3x3, and a plurality of regions with different length-width ratios of different scales are learned by taking the central point of each region of the feature maps as the center, and the regions are called anchors. And calculating the intersection ratio IOU of each anchor and the true value, wherein the IOU is called positive anchor which is more than 0.5 as shown in an equation (8), and the IOU is called negative anchor otherwise. The area size set by the method anchor is 32²,64²,128²,224²,256²And the length-width ratios of the three different components are {1:1,1:2,2:1 }.

Wherein, box represents the extracted candidate frame, and gt is the real frame of each target.

Step S4.2: and mapping the candidate frame to the feature map by using the candidate frame generated by the RPN network in the step S4.1 and the feature map with different scales generated in the step S3.

Step S5: and deforming the candidate frame in the feature map in the step S4.2 to a fixed size, and then performing full-connection operation to perform prediction. For each candidate frame, selecting a stage for prediction according to the size of the candidate frame, as shown in formula (9):

where w and h represent the width and height of the candidate box, respectively. Parameter k₀Is a positive integer, and represents that w × h is 128²In the case of (2), the target stage to which the candidate box should be mapped is generally fixed to 4 in the experiment. Reference area 128 is determined by the size of the data set. k is the stage intended to detect the target.

Step S5.2: the classification of the specific category is performed using softmax, as shown in equation (10):

L_cls(p_i,p_i^*)＝-log[p_i^*p_i+(1-p_i^*)(1-p_i)](10)

wherein p is_iPredicting the probability of being the target, p, for the Anchor_i^*Labels for true values, L_cls(p_i,p_i^*) Is a classification loss function.

Step S5.3: the accurate position of the object is obtained by performing bounding box regression with L1 loss, as shown in formula (11):

wherein t is_i＝{t_x,t_y,t_w,t_hIs a vector representing the 4 coordinates of the predicted box, t_i^*Is the coordinate vector, p, of the real box corresponding to posivetachor_iPredicting the probability of being the target, p, for the Anchor_i^*Labels for true values, L_reg(t_i,t_i^*) Is the bounding box regression loss function.

Step S5.4: performing combined training of classification and bounding box regression, wherein the total loss function is shown as formula (12):

wherein p is_iPredicting the probability of being the target, p, for the Anchor_i^*Tag for true value, t_i＝{t_x,t_y,t_w,t_hIs a vector representing the 4 coordinates of the predicted box, t_i^*Is the coordinate vector of the real frame corresponding to the positive anchor, L_cls(p_i,p_i^*) As a function of classification loss, L_reg(t_i,t_i^*) As a function of the regression loss of the bounding box, λ is the balance weight, N_clsIs the size of mini-batch，N_regIs the number of anchors.

In order to verify that the method has better generalization, the invention performs experiments on the remote sensing data set NWPU (comprising 10 target classes), the RSOD (comprising 4 target classes) and the natural data set PASCAL VOC07 (comprising 20 target classes). The remote sensing data set is as follows 8: a ratio of 2 randomly partitions the training set and the test set.

The Average Precision (AP) of each category and the average precision mean (mAP) of all categories are used as evaluation criteria to quantitatively evaluate the detection effect, with larger values being better. As shown in Table 1, the experimental results show that the model provided by the invention achieves the best detection effect for both remote sensing images and natural images. On the NWPU data set, the detection effect of small targets such as storefront, ship and the like and longer targets such as bridge and harbor is greatly improved, and is improved by 3.75% compared with the FPN method; the detection effect of each category on the RSOD data set is improved to different degrees; on the PASCAL VOC07, the SSD300 and the YOLO3 adopt PASCAL VOCs 07 and 12 as training sets and test sets, and the other models only adopt PASCAL VOC07 as the training sets, so that the AMOD model can achieve the best effect under fewer training sets. The model provided by the invention is 0.8% higher than YOLO3 and 1.3% higher than FPN.

TABLE 1 comparison of the detection results of different methods on NWPU-VHR 10, RSOD and PASCAL VOC07 data sets

In table 1, by comparing the maps of the target detection of the most advanced method at present, it can be known that, for both the remote sensing image and the natural image, the multi-scale target detection method of the remote sensing image based on the attention mechanism provided by the present invention can effectively improve the accuracy of target detection. Especially, on the NWPU data set, the improvement is 9.2 percent compared with that of fast R-CNN and 5.2 percent compared with that ofYOLO 3.

In the training speed, the remote sensing image multi-scale target detection method based on the attention mechanism can be used for positioning the target area more quickly, so that the training speed is improved, and the convergence is accelerated. As shown in table 2, the improved AMOD method has a significant improvement in speed compared to the reference FPN method. Training speed was increased 37.5% on NWPU data set, 20% on RSOD and 35% on PASCAL VOC 07.

TABLE 2 comparison of training speeds of different methods on NWPU, RSOD, PASCAL VOC07 data sets

	Number of FPN iterations	Number of AMOD iterations
			NWPU	40000	25000
RSOD	50000	40000
			PASCAL VOC 07	200000	130000

The partial remote sensing image detection results are shown in fig. 4a and 4b, and it can be seen that the FPN method does not detect all bridges and tennis courts, but the improved AMOD method of the present invention increases the receptive field, reduces missing detection, and effectively improves the target detection accuracy due to the addition of the pixel feature attention module.

The present invention is not limited to the above-described embodiments. The foregoing description of the specific embodiments is intended to describe and illustrate the technical solutions of the present invention, and the above specific embodiments are merely illustrative and not restrictive. Those skilled in the art can make many changes and modifications to the invention without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A remote sensing image multi-scale target detection method based on an attention mechanism is characterized by comprising the following steps:

firstly, extracting features of different scales from a remote sensing image through a feature pyramid network, extracting shallow features from a low-level network, and extracting deep features from a high-level network;

introducing a global spatial attention module (GSA) to extract attention information of shallow features from a lower layer network and optimize deep features;

and step six, classifying the candidate frames and performing frame regression.

2. The method for detecting the multi-scale target of the remote sensing image based on the attention mechanism is characterized in that the first step specifically comprises the following steps: a characteristic pyramid network is adopted, the characteristic pyramid network comprises a bottom-up part and a top-down part, the bottom-up part is a convolution and pooling process, and the resolution is continuously reduced; taking each block as a stage, wherein the total number of the stages is five; the top-down process performs up-sampling from a feature map of stage 5, and then performs horizontal connection with a feature map with the same size generated from bottom to top; and then, carrying out convolution on each fusion result by adopting a convolution kernel of 3x3 to eliminate the aliasing effect of up-sampling.

3. The method for detecting the multi-scale target of the remote sensing image based on the attention mechanism is characterized in that the second step is an operation performed in the process of extracting the features from bottom to top in the first step, the features of the shallow layer are subjected to global average pooling and global maximum pooling, and then the two attention diagrams are fused by pixel-by-pixel addition to generate a global attention diagram; and finally multiplying the global attention map by the deep feature map.

4. The method for detecting the multi-scale target of the remote sensing image based on the attention mechanism is characterized in that in the third step, a pixel feature attention module is introduced after the stage 5 from the bottom to the top in the first step, and each layer of a feature map is endowed with a weight value in a channel attention mode; the global information and the detail information are fused, and the positioning capability of the network to the target area is improved.

5. The method for detecting the multi-scale target of the remote sensing image based on the attention mechanism as claimed in claim 1, wherein the fourth step is that the candidate frame generation network extracts candidate frames with different sizes by processing the extracted convolution feature maps with different scales by using anchors, and searches for a predefined number of areas which may contain the target.