Disclosure of Invention
The invention aims to overcome the defects in the prior art, provides a remote sensing image multi-scale target detection method based on an attention mechanism, and can effectively optimize the problem that the target is difficult to accurately position in a complicated background when the traditional detection method aiming at a natural image faces the remote sensing image.
The purpose of the invention is realized by the following technical scheme:
a remote sensing image multi-scale target detection method based on an attention mechanism comprises the following steps:
firstly, extracting features of different scales from a remote sensing image through a feature pyramid network, extracting shallow features from a lower layer network, and extracting deep features from a higher layer network;
step two, introducing a global spatial attention module (GSA), extracting shallow features from the lower layer network, and optimizing deep features by using attention information of the shallow features;
thirdly, the characteristics extracted from the last layer pass through a pixel characteristic attention module (PFA), so that the receptive field is enlarged, and the loss of detail information is reduced;
step four, the feature maps with different scales generated in the step one are sent to a candidate frame generation network (RPN) to generate candidate frames;
mapping the target in the candidate frame to feature maps with different scales, deforming the candidate frame on each feature map to a fixed size through ROI posing, and extracting the features of the related target;
and step six, classifying the candidate frames and performing frame regression.
Further, the step one specifically comprises the following steps: a characteristic pyramid network is adopted, the characteristic pyramid network comprises a bottom-up part and a top-down part, the bottom-up part is a convolution and pooling process, and the resolution is continuously reduced; taking each block as a stage, wherein the total number of the stages is five; the top-down process performs up-sampling from a feature map ofstage 5, and then performs horizontal connection with a feature map with the same size generated from bottom to top; and then, carrying out convolution on each fusion result by adopting a convolution kernel of 3x3 to eliminate the aliasing effect of up-sampling.
Further, the second step is the operation carried out in the process of extracting the features from bottom to top in the first step, firstly, the features of the shallow layer are subjected to global average pooling and global maximum pooling, and then the two attention maps are fused by pixel-by-pixel addition to generate a global attention map; and finally multiplying the global attention map by the deep feature map.
Furthermore, step three is to introduce a pixel feature attention module afterstage 5 from bottom to top in step one, and endow each layer of the feature map with a weight value in a channel attention mode; the global information and the detail information are fused, and the positioning capability of the network to the target area is improved.
And further, the step four is that the candidate box generation network extracts candidate boxes with different sizes by processing the extracted convolution feature maps with different scales by using anchors, and searches for a predefined number of areas possibly containing the target.
Compared with the prior art, the technical scheme of the invention has the following beneficial effects:
1. the remote sensing image multi-scale target detection method based on the attention mechanism provides a new idea for target detection of remote sensing images, the network can be more accurately positioned to the position of a target by introducing the attention mechanism, and the method has good generalization and good effect on target detection of natural images.
And 2, the GSA module effectively reduces the information loss of small targets, improves the training speed, accelerates the convergence of the network and can achieve better detection effect on targets with different sizes.
And 3, the PFA module increases the reception field, reduces the false detection of the network on the premise of not increasing the calculated amount, and effectively improves the accuracy of target detection.
Detailed Description
The invention is described in further detail below with reference to the figures and specific examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The invention provides a remote sensing image multi-scale target detection model based on an attention mechanism, which is shown in figure 1 and is an overall structural schematic diagram of the invention, wherein two modules are shown in figures 2 and 3, and the overall processing steps are as follows:
and S1.1, adopting ResNet101 as a feature extraction network, extracting shallow features and deep features through convolution and pooling, wherein the shallow features are high in resolution and are positioned at the lower layer of the extraction network, and the deep features are low in resolution and are positioned at the upper layer of the extraction network.
Step S2.1: by extracting the global spatial attention of the shallow layer and fusing the global spatial attention with the deep layer features, the problem of small object information loss caused by low resolution of the deep layer features is optimized. For F ∈ RH×W×CRespectively carrying out global average pooling and global maximum pooling on the dimension C to obtain F' epsilon RH×W×1Then fused by means of pixel-by-pixel addition, as shown in formula (1):
wherein F is a shallow feature map, F
gsaFor the output result of global spatial attention, sigmoid is sigmoid activation function, AvgPool is global average pooling, MaxPool is global maximum pooling,
is added pixel by pixel.
Step S2.2: downsampling the shallow features to the same size as the deep features, and multiplying by them, i.e. transferring more accurate position information to the deep features, as shown in equation (2):
wherein down represents down sampling, F' is a deep feature map, F
gsaIs a global attention map for shallow features,
is pixel-by-pixel multiplication.
Step S3.1: the channel attention branch of the pixel feature attention is calculated. In order to obtain global information of each channel of the feature map, the feature map is subjected to global average pooling, the size of the feature map is changed to be 1 × 1 × C, then 1 × 1 convolution is performed to integrate the feature map, and the calculation formula is shown as (3):
MG(F)=σ(f1×1(AvgPool(F))) (3)
wherein f is1×1Representing a convolution of 1x1, AvgPool for global mean pooling, σ for relu activation function, MG(F) Is the output of the channel attention branch.
Step S3.2: the convolution branch of the pixel feature attention is calculated. The feature map is convolved by 1 × 1, as shown in equation (4):
MC(F)=f1×1(F) (4)
wherein f is1×1Representing a convolution with a kernel of 1, MC(F) Is the output of the convolution branch.
Step S3.3: and calculating the multi-scale convolution branch of the attention of the pixel feature. A convolution kernel of 5x5 with a step size of 2 is applied to the input to obtain a feature map of halved size. Then, using 5x5, convolution kernels withstep size 1 and convolution kernels withsize 3 andstep sizes 2 and 1, respectively, we get the feature maps asinput sizes 1/2 and 1/4, respectively. And finally, upsampling the smaller feature map, fusing the smaller feature map and the smaller feature map together, and then, upsampling again to restore the size of the original feature map, wherein the formula is shown as (5):
where up represents upsampling, f
3×3And f
5×5Respectively denoted as convolutions with convolution kernels of 3 and 5,
for pixel-by-pixel addition, M
m(F) Is the output of the multi-scale convolution branch.
Step S3.4: inputting the feature map F into feature maps obtained from different branches of the pixel feature attention module for fusion, as shown in formulas (6) and (7):
Fout=M(F) (7)
where sigma is the relu activation function,
in order to add the pixels one by one,
for pixel-by-pixel multiplication, M
G(F)、M
C(F)、M
m(F) Respectively obtaining results of a feature graph F through a global channel attention branch, a convolution branch and a multi-scale convolution branch, wherein F is the result of the feature graph F
outIs the output of the pixel feature attention module.
Step S3.5: the top-down process starts with the feature map ofstage 5 being upsampled and then cross-concatenated with the same size feature map generated from the bottom-up. And performing convolution on each fusion result by adopting a convolution kernel of 3x3 to generate feature maps with different scales.
Step S4.1: the feature maps of different scales are convoluted by adopting a convolution kernel of 3x3, and a plurality of regions with different length-width ratios of different scales are learned by taking the central point of each region of the feature maps as the center, and the regions are called anchors. And calculating the intersection ratio IOU of each anchor and the true value, wherein the IOU is called positive anchor which is more than 0.5 as shown in an equation (8), and the IOU is called negative anchor otherwise. The area size set by the method anchor is 322,642,1282,2242,2562And the length-width ratios of the three different components are {1:1,1:2,2:1 }.
Wherein, box represents the extracted candidate frame, and gt is the real frame of each target.
Step S4.2: and mapping the candidate frame to the feature map by using the candidate frame generated by the RPN network in the step S4.1 and the feature map with different scales generated in the step S3.
Step S5: and deforming the candidate frame in the feature map in the step S4.2 to a fixed size, and then performing full-connection operation to perform prediction. For each candidate frame, selecting a stage for prediction according to the size of the candidate frame, as shown in formula (9):
where w and h represent the width and height of the candidate box, respectively. Parameter k0Is a positive integer, and represents that w × h is 1282In the case of (2), the target stage to which the candidate box should be mapped is generally fixed to 4 in the experiment. Reference area 128 is determined by the size of the data set. k is the stage intended to detect the target.
Step S5.2: the classification of the specific category is performed using softmax, as shown in equation (10):
Lcls(pi,pi*)=-log[pi*pi+(1-pi*)(1-pi)](10)
wherein p isiPredicting the probability of being the target, p, for the Anchori*Labels for true values, Lcls(pi,pi*) Is a classification loss function.
Step S5.3: the accurate position of the object is obtained by performing bounding box regression with L1 loss, as shown in formula (11):
wherein t isi={tx,ty,tw,thIs a vector representing the 4 coordinates of the predicted box, ti*Is the coordinate vector, p, of the real box corresponding to posivetachoriPredicting the probability of being the target, p, for the Anchori*Labels for true values, Lreg(ti,ti*) Is the bounding box regression loss function.
Step S5.4: performing combined training of classification and bounding box regression, wherein the total loss function is shown as formula (12):
wherein p isiPredicting the probability of being the target, p, for the Anchori*Tag for true value, ti={tx,ty,tw,thIs a vector representing the 4 coordinates of the predicted box, ti*Is the coordinate vector of the real frame corresponding to the positive anchor, Lcls(pi,pi*) As a function of classification loss, Lreg(ti,ti*) As a function of the regression loss of the bounding box, λ is the balance weight, NclsIs the size of mini-batch,NregIs the number of anchors.
In order to verify that the method has better generalization, the invention performs experiments on the remote sensing data set NWPU (comprising 10 target classes), the RSOD (comprising 4 target classes) and the natural data set PASCAL VOC07 (comprising 20 target classes). The remote sensing data set is as follows 8: a ratio of 2 randomly partitions the training set and the test set.
The Average Precision (AP) of each category and the average precision mean (mAP) of all categories are used as evaluation criteria to quantitatively evaluate the detection effect, with larger values being better. As shown in Table 1, the experimental results show that the model provided by the invention achieves the best detection effect for both remote sensing images and natural images. On the NWPU data set, the detection effect of small targets such as storefront, ship and the like and longer targets such as bridge and harbor is greatly improved, and is improved by 3.75% compared with the FPN method; the detection effect of each category on the RSOD data set is improved to different degrees; on the PASCAL VOC07, the SSD300 and the YOLO3 adopt PASCAL VOCs 07 and 12 as training sets and test sets, and the other models only adopt PASCAL VOC07 as the training sets, so that the AMOD model can achieve the best effect under fewer training sets. The model provided by the invention is 0.8% higher than YOLO3 and 1.3% higher than FPN.
TABLE 1 comparison of the detection results of different methods on NWPU-VHR 10, RSOD and PASCAL VOC07 data sets
In table 1, by comparing the maps of the target detection of the most advanced method at present, it can be known that, for both the remote sensing image and the natural image, the multi-scale target detection method of the remote sensing image based on the attention mechanism provided by the present invention can effectively improve the accuracy of target detection. Especially, on the NWPU data set, the improvement is 9.2 percent compared with that of fast R-CNN and 5.2 percent compared with that ofYOLO 3.
In the training speed, the remote sensing image multi-scale target detection method based on the attention mechanism can be used for positioning the target area more quickly, so that the training speed is improved, and the convergence is accelerated. As shown in table 2, the improved AMOD method has a significant improvement in speed compared to the reference FPN method. Training speed was increased 37.5% on NWPU data set, 20% on RSOD and 35% on PASCAL VOC 07.
TABLE 2 comparison of training speeds of different methods on NWPU, RSOD, PASCAL VOC07 data sets
| Number of FPN iterations | Number of AMOD iterations |
| NWPU | 40000 | 25000 |
| RSOD | 50000 | 40000 |
| PASCAL VOC 07 | 200000 | 130000 |
The partial remote sensing image detection results are shown in fig. 4a and 4b, and it can be seen that the FPN method does not detect all bridges and tennis courts, but the improved AMOD method of the present invention increases the receptive field, reduces missing detection, and effectively improves the target detection accuracy due to the addition of the pixel feature attention module.
The present invention is not limited to the above-described embodiments. The foregoing description of the specific embodiments is intended to describe and illustrate the technical solutions of the present invention, and the above specific embodiments are merely illustrative and not restrictive. Those skilled in the art can make many changes and modifications to the invention without departing from the spirit and scope of the invention as defined in the appended claims.