CN114882352A

Movatterモバイル変換

Info

Publication number: CN114882352A
Application number: CN202210340880.1A
Authority: CN
Inventors: 孙涵; 刘宇泽; 李明洋; 王恩浩; 康巨涛
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2022-03-31
Filing date: 2022-03-31
Publication date: 2022-08-09
Anticipated expiration: 2042-03-31
Also published as: CN114882352B

Abstract

The invention discloses an aerial remote sensing image recognition method based on multilayer and regional characteristic fusion, which belongs to the field of fine-grained image recognition, and is purposefully improved by adopting a backbone network Resnet-50 according to the characteristics of high updating speed, fuzzy image details, low resolution and the like of an aerial remote sensing image. According to the method, a novel multilayer fusion and regional characteristic fusion mode is adopted, attention weight is attached to the multilayer fusion, and the model is given the ability of self-distribution of fusion proportion, so that the selection of detail parts in a shallow characteristic diagram is flexible, the utilization of local detail characteristic information of the remote sensing image is better reserved, and the calculation cost and parameter quantity of regional characteristic fusion are reduced by an extraction strategy of an interest region.

Description

Translated fromChinese

一种基于多层和区域特征融合的航拍遥感图像识别方法An aerial remote sensing image recognition method based on fusion of multi-layer and regional features

技术领域technical field

本发明属于细粒度图像识别领域，尤其涉及一种基于多层和区域特征融合的航拍遥感图像识别方法。The invention belongs to the field of fine-grained image recognition, in particular to an aerial remote sensing image recognition method based on multi-layer and regional feature fusion.

背景技术Background technique

近年来，图像识别已经广泛应用于生活各个方面，包括人脸识别身份、花卉识别软件、相似商品查询等等，使人们的生活变得更加方便。但随着图像识别应用的逐渐深入，图像识别已经不仅仅局限于对于日常物品的识别，对于航拍遥感图像等视角不同、类型不同的图像识别成为目前研究亟待解决的问题。In recent years, image recognition has been widely used in all aspects of life, including face recognition, flower recognition software, similar commodity query, etc., making people's life more convenient. However, with the gradual deepening of the application of image recognition, image recognition is not limited to the recognition of daily objects, and the recognition of images with different perspectives and types such as aerial remote sensing images has become an urgent problem to be solved in current research.

目前在细粒度图像识别领域中特征学习被证明具有更好的识别效果，在特征学习方面目前有两种研究方向，第一种是图像数据集带有人工标注边界框等注释信息，这种强监督学习方式，在对于具有判别力的图像区域的选取中省下了很大的功夫，利用已经标注好的区域和注释进行特征提取和分类往往具有更好的效果，但是人工标注特征信息具有巨大的人力成本，随着数据集的不断扩大实际应用具有难度。张宁等人提出的基于部件的区域卷积神经网络(Part-based RCNN)，该方法对于图像中已经标注好的部件区域进行处理，首先采用几何记分函数对于从标注区域提取出的特征通过带有几何约束的SVM分类器进行打分，从而选出主体(root)和部件(part)两部分区域，然后通过常规特征提取分类的操作进行图像识别；第二种是在没有人工标注边界框和注释的弱监督学习方式下，利用注意力机制等方法，自主学习随机提取的特征区域中的判别信息，虽然省去了人力成本，但是模型自主生成兴趣区域往往导致整体计算开销过大的问题。Ardhendu Behera等人提出的上下文敏感注意力池化(CAP)方法，这种方法对骨干网络提取的最深层特征图进行操作，随机提取区域并利用区域之间的上下文关系和通过长短期记忆(LSTM)网络得到空间位置关系作为分类依据进行图像识别。At present, feature learning has been proved to have better recognition effect in the field of fine-grained image recognition. There are currently two research directions in feature learning. The first is that the image dataset has annotation information such as manually annotated bounding boxes. The supervised learning method saves a lot of effort in the selection of discriminative image regions, and the use of already marked regions and annotations for feature extraction and classification often has better results, but manual annotation of feature information has huge With the continuous expansion of the data set, the practical application is difficult. The Part-based RCNN proposed by Zhang Ning et al. This method processes the marked part regions in the image. First, the geometric score function is used for the features extracted from the marked regions through the band. The SVM classifier with geometric constraints is used to score, so as to select two parts of the main body (root) and part (part), and then perform image recognition through conventional feature extraction and classification operations; the second is when there is no manual annotation of bounding boxes and annotations. In the weakly supervised learning mode of the model, using the attention mechanism and other methods to independently learn the discriminative information in the randomly extracted feature regions, although the labor cost is saved, the autonomous generation of interest regions by the model often leads to the problem of excessive overall computational overhead. The context-sensitive attention pooling (CAP) method proposed by Ardhendu Behera et al., this method operates on the deepest feature map extracted by the backbone network, randomly extracts regions and exploits the contextual relationship between regions and uses long short-term memory (LSTM) ) network to obtain the spatial position relationship as the classification basis for image recognition.

然而航拍遥感图像和常规数据集不同，其分辨率较低、范围广、数据量大，从而导致特征不明显；同时相较于常规物体没有部件的概念，图像变形大、俯视角拍摄、包含大量无关的干扰区域，因此在识别时具有一定的难度。下面列举了Part-based RCNN和CAP方法在航拍遥感图像识别任务中表现不佳的原因：However, aerial remote sensing images are different from conventional datasets in that they have low resolution, wide range, and large amount of data, resulting in inconspicuous features. At the same time, compared with conventional objects, which have no concept of components, the image has large deformation, is shot from a top-down angle, and contains a large number of Irrelevant interference areas, so it is difficult to identify. The reasons for the poor performance of Part-based RCNN and CAP methods in aerial remote sensing image recognition tasks are listed below:

1)对于Part-based RCNN方法，航拍图像由于数据量大且更新频繁的特点，因此对于人工标注信息的要求量极大，导致人工成本大，不适合对于此类图像识别任务的长期应用。1) For the Part-based RCNN method, due to the large amount of data and frequent updates of aerial images, the requirement for manual annotation information is extremely large, resulting in high labor costs, which is not suitable for long-term application of such image recognition tasks.

2)对于CAP方法，航拍遥感图像由于数据量大且更新频繁的特点导致本身网络结构复杂的CAP方法在整体上具有计算量过大的问题，在模型自主提取兴趣区域的步骤中，产生了大量的冗余信息，从而导致额外的计算开销，导致在整体数据训练上容易造成资源和时间的浪费。2) For the CAP method, due to the large amount of data and frequent updates of aerial remote sensing images, the CAP method with a complex network structure has the problem of excessive computation on the whole. redundant information, which leads to additional computational overhead, which leads to a waste of resources and time in the overall data training.

3)对于CAP方法，航拍遥感图像分辨率较低、范围广，本身细节较为模糊，仅采用骨干网络融合后的最深层特征图导致对于浅层细节部分特征信息丢失严重，从而对识别结果产生影响。3) For the CAP method, aerial remote sensing images have low resolution and wide range, and their details are relatively blurred. Only using the deepest feature map after backbone network fusion results in serious loss of feature information for shallow details, which affects the recognition results. .

发明内容SUMMARY OF THE INVENTION

本发明提供了一种基于多层和区域特征融合的航拍遥感图像识别方法，解决了现有技术中航拍图像识别任务存在的问题，基于Resnet-50骨干网络，主体采用注意力机制方法进行图像特征的处理，在航拍图像识别任务中，能够获得更好的识别效果。The invention provides an aerial remote sensing image recognition method based on multi-layer and regional feature fusion, which solves the problems existing in the aerial image recognition task in the prior art. In the aerial image recognition task, a better recognition effect can be obtained.

为实现以上目的，本发明采用以下技术方案：To achieve the above object, the present invention adopts the following technical solutions:

一种基于多层和区域特征融合的航拍遥感图像识别方法，包括以下步骤：An aerial remote sensing image recognition method based on multi-layer and regional feature fusion, comprising the following steps:

步骤(1)航拍图像数据集制作：对数据集进行预处理操作，图像填充后随机裁剪，并进行随机旋转和水平翻转用于数据增强；Step (1) Aerial image data set production: preprocess the data set, randomly crop the image after filling, and perform random rotation and horizontal flip for data enhancement;

步骤(2)搭建图像识别模型：基于航拍图像数据集，训练图像识别模型；Step (2) build an image recognition model: based on the aerial image data set, train the image recognition model;

步骤(3)测试图像检测过程：利用训练好的图像识别网络以及网络权重参数对测试图像中的图像进行识别，并且输出预测的类别。Step (3) Test image detection process: use the trained image recognition network and network weight parameters to identify the images in the test image, and output the predicted category.

以上所述步骤中，步骤(2)中具体包括以下步骤：Among the above-mentioned steps, the step (2) specifically includes the following steps:

步骤(2.1)从骨干网络中获取特征层：Step (2.1) Obtain the feature layer from the backbone network:

在骨干网络ResNet-50的{O_i|i＝1,2,3,4,5}特征图，也就是ResNet-50的第1、第2、第3、第4和第5层的特征，选择其中的{O_i|i＝2,3,4,5}作为多层特征融合的几个特征层，特征层对应的通道数为{256,512,1024,2048}；In the {O_i |i=1,2,3,4,5} feature map of the backbone network ResNet-50, that is, the features of the 1st, 2nd, 3rd, 4th and 5th layers of ResNet-50, Select {O_i |i=2,3,4,5} as several feature layers of multi-layer feature fusion, and the number of channels corresponding to the feature layers is {256,512,1024,2048};

步骤(2.2)融合前处理：Step (2.2) Pre-fusion processing:

对{O_i|i＝2,3,4,5}四张特征图在处理前添加注意力权重{A_i|i＝2,3,4,5},与特征图进行元素相乘,初始化同为0.25，因为对于特征图来说，浅层和深层带来的效益不同，一方面是浅层细节信息，另一方面是深层语义信息，需要考虑不同的利用程度。首先所有权重均为1的相加融合对于网络来说数据还是有重复融合的情况，添加权重后四张特征图的融合结果也会和典型的细粒度图像识别领域中利用深层特征图的数据形式靠齐，其次，对于深浅层不同特征信息我们不能人为的限制权重分配，若不采用注意力机制，而一次次实验对于不同数据集的超参数也是很大的工程，因此注意力机制可以给模型更大的自主学习空间。Add the attention weight {A_i |i=2,3,4,5} to the four feature maps of {O_i |i=2,3,4,5} before processing, multiply the elements with the feature map, and initialize The same is 0.25, because for the feature map, the benefits brought by the shallow layer and the deep layer are different. On the one hand, it is the shallow detail information, and on the other hand, it is the deep semantic information, and different degrees of utilization need to be considered. First of all, the additive fusion with all weights of 1 still has repeated fusion of data for the network. After adding the weights, the fusion result of the four feature maps will also be the same as the data form using deep feature maps in the typical fine-grained image recognition field. Align, secondly, we cannot artificially limit the weight distribution for different feature information of deep and shallow layers. If the attention mechanism is not used, the experiment is also a big project for the hyperparameters of different data sets, so the attention mechanism can give the model More space for independent study.

之后通过1×1卷积将{O_i|i＝2,3,4,5}特征图的通道数转换为128，这里将通道数减小并且统一，一方面是减少后续计算的开销，另一方面是更加方便后续的多层融合；接下来对{O_i|i＝2,3,4}三个特征图分别采用同样扩张率的扩张卷积进行操作，扩张率为1指的是采用原始的3×3卷积核，而扩张率为2指的是3×3卷积核每次采样间隔2个像素，最后和5×5卷积核产生的感受野相同但参数减少，从而使感受野扩大。为了防止计算量增加，卷积核都固定在3×3这一较小量上进行运算。对于前三层特征图扩张率设定为1,2,3，但对于最后一张特征图扩张率为1,3,5。随着扩张率增大，感受野同时也增大。在分别进行独立的不同扩张率的卷积核卷积后，保持扩张卷积后的通道数仍为128。对不同扩张率卷积后的三张特征图进行元素相加，最后经过一个3×3的卷积核进行融合，经过扩展卷积后的特征图相比于之前具有更多丰富的语义信息，使骨干网络的输出结果不再仅仅关注物体的抽象特征，而更容易关注并学习到物体的整体信息，输出特征通道数为128。Afterwards, the number of channels of the {O_i |i=2,3,4,5} feature map is converted to 128 through 1×1 convolution. Here, the number of channels is reduced and unified. On the one hand, it reduces the cost of subsequent calculations, and on the other hand On the one hand, it is more convenient for subsequent multi-layer fusion; next, the three feature maps of {O_i |i=2,3,4} are operated with dilated convolutions with the same dilation rate. The original 3×3 convolution kernel, and the expansion rate of 2 means that the 3×3 convolution kernel is separated by 2 pixels per sampling, and finally the receptive field generated by the 5×5 convolution kernel is the same but the parameters are reduced, so that the The receptive field expands. In order to prevent the increase in the amount of computation, the convolution kernels are all fixed at a small amount of 3×3. The dilation rate is set to 1, 2, 3 for the first three layers of feature maps, but 1, 3, 5 for the last feature map. As the expansion rate increases, the receptive field also increases. After convolution of convolution kernels with different dilation rates independently, the number of channels after dilated convolution is still 128. The elements of the three feature maps convolved with different dilation rates are added, and finally a 3×3 convolution kernel is fused. The feature map after dilation and convolution has more rich semantic information than before. The output of the backbone network no longer only pays attention to the abstract features of the object, but it is easier to pay attention to and learn the overall information of the object, and the number of output feature channels is 128.

步骤(2.3)融合操作：Step (2.3) fusion operation:

将步骤(2.2)输出的4张处理后的特征图采用相邻相加的方式融合，即O₂和O₃相加，O₃和O₄相加，O₄和O₅相加，相加后分别经过3×3卷积核进行融合，保持通道数为128，得到3个特征图，重复上述操作，两两相邻融合，最后得到多层融合后的总特征图O，维度为(128,56,56)；The four processed feature maps output in step (2.2) are fused by adjacent addition, that is, O₂ and O₃ are added, O₃ and O₄ are added, O₄ and O₅ are added, and the addition Afterwards, 3 × 3 convolution kernels are respectively used for fusion, keeping the number of channels at 128, to obtain 3 feature maps, repeating the above operations, two adjacent fusions, and finally obtaining the total feature map O after multi-layer fusion, with a dimension of (128 ,56,56);

步骤(2.4)双线性插值重置特征图大小：Step (2.4) bilinear interpolation resets the feature map size:

对于(2.3)产生的维度为(128,56,56)的特征图，我们这里采用双线性插值将特征图转化维度为(128,42,42)的特征图，而转化过程中需要确认新特征图中指定位置的像素值我们通过双线性插值得到，双线性插值本质上是由两个变量的线性插值的扩展。For the feature map with dimensions (128, 56, 56) generated by (2.3), we use bilinear interpolation to convert the feature map into a feature map with dimensions (128, 42, 42), and the conversion process needs to confirm the new The pixel value at the specified position in the feature map is obtained by bilinear interpolation, which is essentially an extension of the linear interpolation of two variables.

步骤(2.5)兴趣区域提取：Step (2.5) Region of interest extraction:

对步骤(2.4)中改变大小为(128，42，42)的特征图，以14为单位将特征图的前两维划分为3×3的平面，按照大于1×1的大小提取兴趣区域(这里的兴趣区域也就是特征图或者特征向量)，同时将可由更小兴趣区域组合成的较大的兴趣区域删除，由这种方法选取的兴趣区域相比于在原始图像中选取兴趣区域，省去了对每个区域的多层卷积神经网络操作，减少了模型额外的计算开销，同时舍去可组合成的兴趣区域减小了计算的冗余程度，算上原来的(128，42，42)维度的特征图，一共得到19个特征图，采用双线性插值将不同大小的兴趣区域统一形状为(128，7，7)的特征向量；For the feature map whose size is changed to (128, 42, 42) in step (2.4), the first two dimensions of the feature map are divided into 3 × 3 planes in units of 14, and the region of interest is extracted according to the size greater than 1 × 1 ( The region of interest here is also the feature map or feature vector), and the larger region of interest that can be composed of smaller regions of interest is deleted. Compared with the region of interest selected in the original image, the region of interest selected by this method saves The multi-layer convolutional neural network operation on each area is eliminated, which reduces the additional computational overhead of the model, and at the same time, the reduction of the computational redundancy by discarding the composable interest area, including the original (128, 42, 42) Dimensional feature maps, a total of 19 feature maps are obtained, and bilinear interpolation is used to unify interest regions of different sizes as feature vectors with a shape of (128, 7, 7);

步骤(2.6)区域特征融合：Step (2.6) Regional feature fusion:

将步骤(2.5)提取出的用来统一维度为(128，7，7)的特征向量转换为有权重的形式，这使得兴趣区域可以在基于相关性权重的基础上进行特征融合。这一步骤的意义是以各个特征向量各自之间的相关性作为权重进行特征信息的融合，因为按照步骤(2.5)随机提取的兴趣区域往往不能包括所有关键信息，因此这里的操作也是对于可能缺失的关键特征信息的补全，具体计算如下：The feature vector extracted in step (2.5) and used to unify the dimension (128, 7, 7) is converted into a weighted form, which enables the region of interest to perform feature fusion based on the correlation weight. The significance of this step is to use the correlation between each feature vector as a weight to fuse feature information, because the region of interest randomly extracted according to step (2.5) often cannot include all key information, so the operation here is also for possible missing information. The completion of the key feature information of , the specific calculation is as follows:

上式中

和

指的是步骤(2.5)提取出的特征向量，W_β和W_β′为初始化的权重矩阵，其参数可以通过学习得到，b_β为偏差，tanh为非线性激活函数，β_r，r′为对应

和

两个特征向量的相关性矩阵，W_α和b_α分别为用于初始化的权重矩阵和偏差，α_r，r′为注意力权重，c_r为

和其他所有特征向量与相对应注意力权重乘积后的加和，也就是新的特征向量，意义上是基于相关性得到的特征补充和完善，In the above formula

and

Refers to the feature vector extracted in step (2.5), W_β and W_β' are initialized weight matrices, the parameters of which can be obtained by learning, b_β is the deviation, tanh is the nonlinear activation function, β_{r, r'} is correspond

and

The correlation matrix of the two eigenvectors, W_α and b_α are the weight matrix and bias used for initialization, respectively, α_{r, r′} are the attention weights, and c_r is

and the sum of the products of all other feature vectors and the corresponding attention weights, that is, the new feature vector, in the sense that the feature is supplemented and improved based on correlation,

通过q和k得出两个特征向量的相关性矩阵β_r，r′，W_α用于它们的非线性融合，b_α和b_β是偏差值，这些矩阵和偏差值{W_β，W_β′，W_α，b_α，b_β}∈θ_c是可学习参数，注意力权重α_r，r′捕捉特征图

和

所代表的区域r和r′之间的相关性，最后生成的计算过权重的向量c_r包含了

基于自身的和它的相邻内容的特征，输出为19个维度为(128,7,7)的特征向量；The correlation matrix β_{r, r′} , W_α of the two eigenvectors is obtained by q and k for their nonlinear fusion, b_α and b_β are the deviation values, these matrices and the deviation values {W_β , W_{β ′} , W_α , b_α , b_β }∈θ_c are learnable parameters, and attention weights α_{r, r′} capture feature maps

and

The correlation between the represented regions r and r', the finally generated weighted vector_r contains the

Based on the features of itself and its adjacent content, the output is 19 feature vectors with dimensions (128, 7, 7);

步骤(2.7)识别结果输出：Step (2.7) Identification result output:

将步骤(2.6)得到的19张融合后的特征区域进行最后的分类操作。首先在保留通道数的基础上调整形状，将19张特征图通道维以外的数据合成一维，也就是(128,19×49)，即(128,931)，进行平均池化后乘以权重α(这里的α是设定的超参数，默认为0.01)，同时对最开始多层融合模块的输出特征图进行平均池化，得到两张形状为(128，1)和(128,1,1)的特征向量，在行方向一维展平后形状为(128)，然后将两张特征图元素加和送入Asoftmax中，根据输出概率得出最后的预测结果。The final classification operation is performed on the 19 fused feature regions obtained in step (2.6). First, adjust the shape on the basis of retaining the number of channels, and synthesize the data other than the channel dimension of the 19 feature maps into one dimension, that is (128, 19 × 49), that is (128, 931), and multiply by the weight α ( Here α is the set hyperparameter, the default is 0.01), and the average pooling is performed on the output feature map of the initial multi-layer fusion module to obtain two shapes of (128, 1) and (128, 1, 1) The feature vector of , and the shape is (128) after one-dimensional flattening in the row direction, and then the two feature map elements are added and sent to Asoftmax, and the final prediction result is obtained according to the output probability.

进一步地，步骤(2.7)中采用交叉熵损失：Further, the cross-entropy loss is used in step (2.7):

其中，M为类别的数量，y_ic为符号函数(0或1),若样本i的真实类别为c则取1，否则取0，p_ic为观测样本i属于类别c的预测概率，N为样本个数，L_i为第i个样本的损失，L为所有样本的平均损失。Among them, M is the number of categories, y_ic is the sign function (0 or 1), if the real category of sample i is c, it takes 1, otherwise it takes 0,_pic is the predicted probability that the observed sample i belongs to category c, and N is The number of samples, Li is the loss of the_ith sample, and L is the average loss of all samples.

进一步地，步骤(3)中具体包括以下步骤：Further, the following steps are specifically included in step (3):

步骤(3.1)将测试图像送入网络模型中；Step (3.1) sends the test image into the network model;

步骤(3.2)通过Resent-50骨干网络进行特征提取，获得特征图；In step (3.2), feature extraction is performed through the Resent-50 backbone network to obtain a feature map;

步骤(3.3)采用步骤(2)改进的带有注意力权重的多层融合网络进行特征图的融合；Step (3.3) adopts the improved multi-layer fusion network with attention weight in step (2) to fuse feature maps;

步骤(3.4)采用步骤(2)改进的区域特征融合模块，进行兴趣区域提取和区域特征融合；Step (3.4) adopts the improved regional feature fusion module of step (2) to extract the region of interest and fuse regional features;

步骤(3.5)将融合后的总特征图进行类别预测，得出结果。Step (3.5) performs category prediction on the fused total feature map, and obtains the result.

有益效果：本发明提供了一种基于多层和区域特征融合的航拍遥感图像识别方法，采用骨干网络Resnet-50，根据航拍遥感图像更新速度快、图像细节模糊、分辨率低等特点进行了针对性的改进。本发明采用的新的多层融合和区域特征融合方式，多层融合中附带注意力权重，给予模型自主分配融合比重的能力，使对于浅层特征图中细节部分的选取具有灵活性，更好的保留了对于遥感图像局部细节特征信息的利用。兴趣区域的提取策略减少了区域特征融合的计算开销和参数量。在最新的大型遥感图像数据集AID数据集上进行测试。图像像素大小为600*600，总包含30类场景图像，每一类大概220-420张，共10000张。最终遥感图像识别达到了96.69％的较高准确率。Beneficial effects: The present invention provides an aerial remote sensing image recognition method based on the fusion of multi-layer and regional features, adopts the backbone network Resnet-50, and carries out a target recognition method according to the characteristics of the aerial remote sensing image update speed, blurred image details, and low resolution. Sexual improvement. The new multi-layer fusion and regional feature fusion methods adopted by the present invention, the multi-layer fusion is attached with attention weight, which gives the model the ability to independently assign the fusion weight, so that the selection of the detail part in the shallow feature map has flexibility, and better It retains the utilization of local detail feature information of remote sensing images. The extraction strategy of region of interest reduces the computational cost and parameter amount of region feature fusion. Tested on the latest large-scale remote sensing image dataset AID dataset. The pixel size of the image is 600*600, and it contains 30 types of scene images, each of which is about 220-420 images, a total of 10,000 images. The final remote sensing image recognition achieved a high accuracy of 96.69%.

附图说明Description of drawings

图1为本发明方法流程图；Fig. 1 is the flow chart of the method of the present invention;

图2为本发明实施例中搭建航拍遥感图像识别模型过程图；2 is a process diagram of building an aerial photography remote sensing image recognition model in an embodiment of the present invention;

图3为本发明实施例中MRI的模型整体过程图；3 is an overall process diagram of an MRI model in an embodiment of the present invention;

图4为本发明实施例中双线性插值例图；4 is an example diagram of bilinear interpolation in an embodiment of the present invention;

图5为本发明实施例中输入的测试图像样例图；5 is a sample diagram of a test image input in an embodiment of the present invention;

图6为本发明实施例中骨干网络第三层特征图；6 is a feature diagram of a third layer of a backbone network in an embodiment of the present invention;

图7为本发明实施例中多层融合后的特征图。FIG. 7 is a feature map after multi-layer fusion in an embodiment of the present invention.

具体实施方式Detailed ways

下面结合附图和具体实施例对本发明进行详细说明：The present invention will be described in detail below in conjunction with the accompanying drawings and specific embodiments:

如图1所示，一种基于多层和区域特征融合的航拍遥感图像识别方法，该方法基于Resnet-50骨干网络，主体采用注意力机制方法进行图像特征的处理，具体包括以下步骤：As shown in Figure 1, an aerial remote sensing image recognition method based on multi-layer and regional feature fusion, the method is based on the Resnet-50 backbone network, and the main body adopts the attention mechanism method to process image features, which specifically includes the following steps:

具体的，在步骤(1)中，对数据集进行预处理操作，图像填充后随机裁剪，并进行随机旋转和水平翻转用于数据增强；这里将AID数据集中6679张图片作为训练集，3321作为测试集。Specifically, in step (1), the data set is preprocessed, the images are randomly cropped after filling, and randomly rotated and horizontally flipped for data enhancement; here 6679 images in the AID dataset are used as the training set, 3321 are used as test set.

如图2和图3所示，步骤(2)中实现步骤如下：As shown in Figure 2 and Figure 3, the implementation steps in step (2) are as follows:

在骨干网络ResNet-50的{O_i|i＝1,2,3,4,5}特征图，也就是ResNet-50的第1、第2、第3、第4和第5层的特征，选择其中的{O_i|i＝2,3,4,5}作为多层特征融合的几个特征层，特征层对应的通道数为{256,512,1024,2048}。图5位测试图片样例，作为需要预测类别的输入图像。第3层特征图如图6所示。In the {O_i |i=1,2,3,4,5} feature map of the backbone network ResNet-50, that is, the features of the 1st, 2nd, 3rd, 4th and 5th layers of ResNet-50, Among them, {O_i |i=2, 3, 4, 5} are selected as several feature layers of multi-layer feature fusion, and the number of channels corresponding to the feature layers is {256, 512, 1024, 2048}. Figure 5. An example of a test image as an input image for the category to be predicted. The layer 3 feature map is shown in Figure 6.

步骤(2.2)融合前处理：Step (2.2) Pre-fusion processing:

对{O_i|i＝2,3,4,5}四张特征图在处理前添加注意力权重{A_i|i＝2,3,4,5},与特征图进行元素相乘,初始化同为0.25。因为对于特征图来说，浅层和深层带来的效益不同，一方面是浅层细节信息，另一方面是深层语义信息，需要考虑不同的利用程度。首先所有权重均为1的相加融合对于网络来说数据还是有重复融合的情况，添加权重后四张特征图的融合结果也会和典型的细粒度图像识别领域中利用深层特征图的数据形式靠齐，其次，对于深浅层不同特征信息我们不能人为的限制权重分配，若不采用注意力机制，而一次次实验对于不同数据集的超参数也是很大的工程，因此注意力机制可以给模型更大的自主学习空间。Add the attention weight {A_i |i=2,3,4,5} to the four feature maps of {O_i |i=2,3,4,5} before processing, multiply the elements with the feature map, and initialize The same is 0.25. Because for the feature map, the benefits brought by the shallow layer and the deep layer are different. On the one hand, it is the shallow detail information, and on the other hand, it is the deep semantic information, and different degrees of utilization need to be considered. First of all, the additive fusion with all weights of 1 still has repeated fusion of data for the network. After adding the weights, the fusion result of the four feature maps will also be the same as the data form using deep feature maps in the typical fine-grained image recognition field. Align, secondly, we cannot artificially limit the weight distribution for different feature information of deep and shallow layers. If the attention mechanism is not used, the experiment is also a big project for the hyperparameters of different data sets, so the attention mechanism can give the model More space for independent study.

之后通过1×1卷积将{O_i|i＝2,3,4,5}特征图的通道数转换为128，这里将通道数减小并且统一，一方面是减少后续计算的开销，另一方面是更加方便后续的多层融合。接下来对{O_i|i＝2,3,4}三个特征图分别采用同样扩张率的扩张卷积进行操作。扩张率为1指的是采用原始的3×3卷积核，而扩张率为2指的是3×3卷积核每次采样间隔2个像素，最后和5×5卷积核产生的感受野相同但参数减少，从而使感受野扩大。为了防止计算量增加，卷积核都固定在3×3这一较小量上进行运算。对于前三层特征图扩张率设定为1,2,3，但对于最后一张特征图扩张率为1,3,5。随着扩张率增大，感受野同时也增大。在分别进行独立的不同扩张率的卷积核卷积后，保持扩张卷积后的通道数仍为128。对不同扩张率卷积后的三张特征图进行元素相加。最后经过一个3×3的卷积核进行融合。经过扩展卷积后的特征图相比于之前具有更多丰富的语义信息，使骨干网络的输出结果不再仅仅关注物体的抽象特征，而更容易关注并学习到物体的整体信息。输出特征通道数为128。Afterwards, the number of channels of the {O_i |i=2,3,4,5} feature map is converted to 128 through 1×1 convolution. Here, the number of channels is reduced and unified. On the one hand, it reduces the cost of subsequent calculations, and on the other hand On the one hand, it is more convenient for subsequent multi-layer fusion. Next, the three feature maps of {O_i |i = 2, 3, 4} are operated by dilated convolution with the same dilation rate. The expansion rate of 1 refers to the use of the original 3×3 convolution kernel, while the expansion rate of 2 refers to the 3×3 convolution kernel with 2 pixels per sampling interval, and finally the feeling generated by the 5×5 convolution kernel The field is the same but the parameters are reduced, thereby expanding the receptive field. In order to prevent the increase in the amount of computation, the convolution kernels are all fixed at a small amount of 3×3. The dilation rate is set to 1, 2, 3 for the first three layers of feature maps, but 1, 3, 5 for the last feature map. As the expansion rate increases, the receptive field also increases. After performing independent convolution kernel convolutions with different dilation rates, the number of channels after dilation convolution is still 128. Element-wise addition is performed on the three feature maps convolved with different dilation rates. Finally, it is fused through a 3×3 convolution kernel. The feature map after dilated convolution has more rich semantic information than before, so that the output of the backbone network no longer only focuses on the abstract features of the object, but it is easier to focus on and learn the overall information of the object. The number of output feature channels is 128.

步骤(2.3)融合操作：(相邻特征图像素值元素相加)Step (2.3) Fusion operation: (addition of adjacent feature map pixel value elements)

如图7所示，将步骤(2.2)输出的4张处理后的特征图采用相邻相加的方式融合，即O₂和O₃相加，O₃和O₄相加，O₄和O₅相加，相加后分别经过3×3卷积核进行融合，保持通道数为128，得到3个特征图，重复上述操作，两两相邻融合，最后得到多层融合后的总特征图O，维度为(128，56,56)。As shown in Figure 7, the four processed feature maps output in step (2.2) are fused by adjacent addition, that is, O₂ and O₃ are added, O₃ and O₄ are added, and O₄ and O are added.₅ are added, and after the addition, they are fused through 3 × 3 convolution kernels respectively, keeping the number of channels at 128, and 3 feature maps are obtained. Repeat the above operation, fuse two adjacent ones, and finally obtain the total feature map after multi-layer fusion. O, the dimension is (128, 56, 56).

步骤(2.4)双线性插值重置特征图大小：(调整融合后特征图大小)Step (2.4) Bilinear interpolation resets the size of the feature map: (adjust the size of the feature map after fusion)

对于(2.3)产生的(128，56，56)为维度的特征图，我们这里采用双线性插值将特征图转化维度为(128，42，42)的特征图，而转化过程中需要确认新特征图中指定位置的像素值我们通过双线性插值得到。双线性插值本质上是有两个变量的线性插值的扩展。For the feature map with dimension (128, 56, 56) generated by (2.3), we use bilinear interpolation here to convert the feature map with dimension (128, 42, 42), and the conversion process needs to confirm the new feature map. The pixel value at the specified position in the feature map is obtained by bilinear interpolation. Bilinear interpolation is essentially an extension of linear interpolation with two variables.

如图4所示，假如想得到未知函数f在点P＝(x，y)的值，假设我们已知函数f在Q₁₁＝(x₁，y₁)，Q₁₂＝(x₁，y₂)，Q₂₁＝(x₂，y₁)，Q₂₂＝(x₂，y₂)四个点的值，这里的x，y、x₁，y₁、x₂，y₂为坐标。f是坐标处像素点的像素值。首先在x方向进行线性插值，可以得到R₁＝(x，y₁)、R₂＝(x，y₂)处的像素值f(R₁)和f(R₂)：As shown in Figure 4, if we want to get the value of the unknown function f at point P=(x, y), suppose we know the function f at Q₁₁ =(x₁ , y₁ ), Q₁₂ =(x₁ , y_{2 )} ), Q₂₁ =(x₂ , y₁ ), Q₂₂ =(x₂ , y₂ ) values of four points, where x, y, x₁ , y₁ , x₂ , y₂ are coordinates. f is the pixel value of the pixel at the coordinate. First perform linear interpolation in the x direction to obtain pixel values f(R₁ ) and f(R₂ ) at R₁ =(x, y₁ ), R₂ =(x, y₂ ):

然后在y方向进行线性插值，得到点P处的像素值f(P)：Then perform linear interpolation in the y direction to get the pixel value f(P) at point P:

综合起来就是双线性插值最后的结果f(x，y)，P处的像素值：The combination is the final result of bilinear interpolation f(x, y), the pixel value at P:

步骤(2.5)兴趣区域提取：(划分提取区域后调整为统一大小)Step (2.5) Region of interest extraction: (adjust to a uniform size after dividing the extraction region)

对步骤(2.4)中改变大小为(128，42，42)的特征图，以14位单位将特征图的前两维划分为3×3的平面，按照大于1×1的大小提取兴趣区域(这里的兴趣区域也就是特征图或者特征向量)，同时将可由更小兴趣区域组合成的较大的兴趣区域删除，由这种方法选取的兴趣区域相比于在原始图像中选取兴趣区域，省去了对每个区域的多层卷积神经网络操作，减少了模型额外的计算开销，同时舍去可组合成的兴趣区域减小了计算的冗余程度。算上原来的(128，42，42)维度的特征图，一共得到19个特征图，采用双线性插值将不同大小的兴趣区域统一形状为(128，7，7)的特征向量。For the feature map whose size is changed to (128, 42, 42) in step (2.4), the first two dimensions of the feature map are divided into 3 × 3 planes in 14-bit units, and the region of interest is extracted according to the size greater than 1 × 1 ( The region of interest here is also the feature map or feature vector), and the larger region of interest that can be composed of smaller regions of interest is deleted. Compared with the region of interest selected in the original image, the region of interest selected by this method saves The multi-layer convolutional neural network operation for each region is eliminated, which reduces the additional computational overhead of the model, and at the same time, the composable interest regions are discarded to reduce the computational redundancy. Counting the original feature maps of (128, 42, 42) dimensions, a total of 19 feature maps are obtained, and bilinear interpolation is used to unify regions of interest of different sizes into feature vectors with a shape of (128, 7, 7).

步骤(2.6)区域特征融合：(各个特征图有权重的像素值元素相加)Step (2.6) Regional feature fusion: (addition of pixel value elements with weights in each feature map)

把步骤(2.5)提取出的用来统一维度为(128，7，7)的特征向量转换为有权重的形式。这使得兴趣区域可以在基于相关性权重的基础上进行特征融合。这一步骤的意义是以各个特征向量各自之间的相关性作为权重进行特征信息的融合，因为按照步骤(2.5)随机提取的兴趣区域往往不能包括所有关键信息，因此这里的操作也是对于可能缺失的关键特征信息的补全，具体计算如下：The feature vector extracted in step (2.5) and used to unify the dimension (128, 7, 7) is converted into a weighted form. This enables feature fusion of regions of interest based on relevance weights. The significance of this step is to use the correlation between each feature vector as a weight to fuse feature information, because the region of interest randomly extracted according to step (2.5) often cannot include all key information, so the operation here is also for possible missing information. The completion of the key feature information of , the specific calculation is as follows:

上式中

和

和

and

通过q和k得出两个特征向量的相关性矩阵β_r，r′。W_α用于它们的非线性融合。b_α和b_β是偏差值。这些矩阵和偏差值{W_β，W_β′，W_α，b_α，b_β}∈θ_c是可学习参数。注意力权重α_r，r′捕捉特征图

和

所代表的区域r和r′之间的相关性。最后生成的计算过权重的向量c_r包含了

基于自身的和它的相邻内容的特征。输出为19个维度为(128，7，7)的特征向量。The correlation matrix β_r,r′ of the two eigenvectors is obtained from q and k. W_α is used for their nonlinear fusion. b_α and b_β are bias values. These matrices and bias values {W_β , W_β′ , W_α , b_α , b_β }∈θ_c are learnable parameters. Attention weights α_{r, r′} capture feature maps

and

The correlation between the represented regions r and r'. The final generated weighted_{vector cr} contains

Based on the characteristics of itself and its neighbors. The output is 19 feature vectors of dimension (128, 7, 7).

步骤(2.7)识别结果输出：Step (2.7) Identification result output:

这里将步骤(2.6)得到的19张融合后的特征区域进行最后的分类操作。首先在保留通道数的基础上调整形状，将19张特征图通道维以外的数据合成一维，也就是(128，19×49)，即(128，931)。进行平均池化后乘以权重α(这里的α是设定的超参数，默认为0.01)，同时对最开始多层融合模块的输出特征图进行平均池化，得到两张形状为(128，1)和(128，1，1)的特征向量，在行方向一维展平后形状为(128)，然后将两张特征图元素加和送入Asoftmax中，根据输出概率得出最后的预测结果。Here, the 19 fused feature regions obtained in step (2.6) are subjected to the final classification operation. First, adjust the shape on the basis of retaining the number of channels, and synthesize the data other than the channel dimension of the 19 feature maps into one dimension, that is, (128, 19×49), that is, (128, 931). After average pooling, multiply the weight α (here α is the set hyperparameter, the default is 0.01), and at the same time perform average pooling on the output feature map of the initial multi-layer fusion module to obtain two shapes of (128, 1) The feature vectors of (128, 1, 1) are one-dimensionally flattened in the row direction and the shape is (128), and then the two feature map elements are added and sent to Asoftmax, and the final prediction is obtained according to the output probability. result.

进一步的，步骤(2.7)中，采用交叉熵损失：Further, in step (2.7), cross entropy loss is used:

步骤(2.8)训练识别模型：Step (2.8) Train the recognition model:

采用的预训练模型是由Pytorch官方提供的在Imagenet上进行了预训练的模型。所有图像在送入模型前尺寸都被统一调整为224×224。设置训练迭代次数为100次，批大小为8。SGD学习率参数设置动量为0.5、初始学习率为0.008。训练前10个epoch采用余弦退火学习率优化策略，加载SGD优化器，设置cosine循环次数为100。10个epoch后前75％的epoch采用SGD优化器，后25％设定为固定的较高学习率为0.05，并对后25％训练得出权重进行记录，最后对记录的权重进行加权平均。根据以上训练配置得到网络模型参数。The pre-training model used is a model pre-trained on Imagenet provided by Pytorch. All images are uniformly resized to 224×224 before being fed into the model. Set the number of training iterations to 100 and the batch size to 8. The SGD learning rate parameter sets the momentum to 0.5 and the initial learning rate to 0.008. The first 10 epochs of training use the cosine annealing learning rate optimization strategy, load the SGD optimizer, and set the number of cosine cycles to 100. After 10 epochs, the first 75% of the epochs use the SGD optimizer, and the last 25% are set to a fixed higher learning The rate is 0.05, and the weights obtained from the last 25% of the training are recorded, and finally the weights recorded are weighted and averaged. The network model parameters are obtained according to the above training configuration.

在最新的大型遥感图像数据集AID数据集上进行测试，图像像素大小为600*600，总包含30类场景图像，每一类大概220-420张，共10000张，最终遥感图像识别达到了96.69％的较高准确率。实验结果如下：Tested on the latest large-scale remote sensing image dataset AID dataset, the image pixel size is 600*600, and it contains 30 types of scene images, each type is about 220-420, a total of 10,000 images, the final remote sensing image recognition reached 96.69 % higher accuracy. The experimental results are as follows:

表1 实验结果Table 1 Experimental results

以上所述仅是本发明的优选实施方式，应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明原理的前提下还可以作出若干改进，这些改进也应视为本发明的保护范围。The above are only the preferred embodiments of the present invention. It should be pointed out that for those skilled in the art, several improvements can be made without departing from the principles of the present invention, and these improvements should also be regarded as the invention. protected range.

Claims

1. An aerial remote sensing image identification method based on multilayer and regional feature fusion is characterized by comprising the following steps:

step (1) aerial image data set production: preprocessing the data set, randomly cutting the image after filling, and randomly rotating and horizontally turning the image for data enhancement;

step (2), an image recognition model is built: training an image recognition model based on the aerial image dataset;

step (3) testing the image detection process: and identifying the images in the test image by using the trained image identification network and the network weight parameters, and outputting the predicted category.

2. The method for identifying the aerial remote sensing image based on the fusion of the multilayer and the regional characteristics according to claim 1, wherein the step (2) specifically comprises the following steps:

step (2.1) obtains the characteristic layer from the backbone network:

o at backbone network ResNet-50_i I | i ═ 1,2,3,4,5} feature map, with { O's selected among them_i I is 2,3,4,5, which is taken as a plurality of feature layers of multi-layer feature fusion, and the number of channels corresponding to the feature layers is {256,512,1024,2048 };

step (2.2) fusion pretreatment:

to { O_i I 2,3,4,5, adding attention weight { A } before processing_i 2,3,4,5, multiplying the characteristic diagram by the element, and beginningThe initialization is 0.25; by 1 × 1 convolution with { O_i Converting the channel number of the | i ═ 2,3,4,5} feature map into 128; to { O_i The i | -2, 3,4} three characteristic graphs are respectively operated by adopting expansion convolution with the same expansion rate, and finally the parameters are reduced while the reception fields generated by the 5 × 5 convolution kernels are the same, so that the reception fields are expanded; after independent convolution kernel convolutions with different expansion rates are respectively carried out, the number of channels after the expansion convolution is still 128; performing element addition on the three feature maps after convolution with different expansion rates, and finally performing fusion through a convolution kernel of 3 multiplied by 3, wherein the number of output feature channels is 128;

step (2.3) fusion operation:

fusing the 4 processed feature maps output in the step (2.2) in an adjacent addition mode, fusing the feature maps respectively through a 3 x 3 convolution kernel after the addition, keeping the number of channels to be 128 to obtain 3 feature maps, repeating the operation, fusing two by two in an adjacent mode, and finally obtaining a multi-layer fused total feature map O with dimensions of (128,56, 56);

step (2.4) bilinear interpolation resets the size of the feature map:

for the feature map with the dimension (128,56,56) generated in the step (2.3), converting the feature map into the feature map with the dimension (128,42,42) by using bilinear interpolation;

and (2.5) extracting the interest region:

for the feature map with the changed size of (128,42,42) in the step (2.4), dividing the first two dimensions of the feature map into 3 × 3 planes by taking 14 as a unit, extracting interest areas according to the size larger than 1 × 1, deleting larger interest areas which can be combined by smaller interest areas, adding the original feature map with the dimension of (128,42,42) to obtain 19 feature maps in total, and unifying the interest areas with different sizes into feature vectors with the shape of (128,7,7) by adopting bilinear interpolation;

and (2.6) regional feature fusion:

converting the feature vectors extracted in the step (2.5) and used for unifying the dimensions (128,7,7) into a weighted form, and outputting 19 feature vectors with the dimensions (128,7, 7);

and (2.7) outputting an identification result:

performing final classification operation on the 19 fused feature regions obtained in the step (2.6), firstly adjusting the shape on the basis of the number of reserved channels, synthesizing data except the 19 feature map channel dimensions into a single dimension, namely (128,19 × 49), namely (128,931), performing average pooling, multiplying the single dimension by a weight alpha, simultaneously performing average pooling on the output feature map of the initial multi-layer fusion module to obtain two feature vectors with the shapes of (128, 1) and (128,1,1), performing one-dimensional flattening in the row direction to obtain the shape of (128), then adding the two feature map elements and sending the sum into Asoftmax, and obtaining a final prediction result according to the output probability;

step (2.8) training a recognition model:

and uniformly adjusting the sizes of all the images before the images are sent into the model, and obtaining network model parameters according to training configuration.

3. The method for identifying aerial remote sensing images based on fusion of multilayer and regional characteristics as claimed in claim 2, wherein in step (2.2), in order to prevent the increase of the calculation amount, the convolution kernels are all fixed on a small amount of 3 x 3 for operation.

4. The method for identifying aerial remote sensing images based on multi-layer and regional feature fusion according to claim 2 or 3, wherein in the step (2.2), the expansion rate is set to be 1,2,3 for the first three layers of feature maps, and is set to be 1,3,5 for the last feature map.

5. The method for identifying the aerial remote sensing image based on the fusion of the multilayer and the regional characteristics as claimed in claim 2, wherein the specific calculation of the step (2.6) is as follows:

in the above formula

And

the feature vector W extracted in the step (2.5) is referred to_β And W_β′ For the initialized weight matrix, the parameters of which can be learned, b_β For bias, tanh is the nonlinear activation function, β_r,r′ To correspond to

And

correlation matrix of two eigenvectors, W_α And b_α Respectively, a weight matrix and a bias, alpha, for initialization_r,r′ As attention weight, c_r Is composed of

And the sum of the products of all other feature vectors and the corresponding attention weights;

deriving a correlation matrix beta of the two eigenvectors by q and k_r,r′ ，W_α For their non-linear fusion, b_α And b_β Are the bias values, these matrices and the bias values { W }_β ,W_β′ ,W_α ,b_α ,b_β }∈θ_c Is a learnable parameter, attention weight α_r,r′ Capturing feature maps

And

the correlation between the represented regions r and r', the final generated vector c with calculated weights_r Comprises a

Based on the characteristics of itself and its neighboring content.

6. The method for identifying the aerial remote sensing image based on the fusion of the multilayer and the regional characteristics according to claim 2, wherein the step (2.7) adopts cross entropy loss:

where M is the number of classes, y_ic Is a sign function (0 or 1), taking 1 if the true class of sample i is c, otherwise taking 0, p_ic For the prediction probability that the observation sample i belongs to the class c, N is the number of samples, L_i Is the loss of the ith sample, and L is the average loss of all samples.

7. The method for identifying the aerial remote sensing image based on the fusion of the multilayer and the regional characteristics according to claim 1, wherein the step (3) specifically comprises the following steps:

step (3.1) sending the test image into a network model;

step (3.2) extracting features through a resource-50 backbone network to obtain a feature map;

step (3.3) adopting the improved multilayer fusion network with attention weight in step (2) to perform feature map fusion;

step (3.4) adopting the improved region feature fusion module in the step (2) to extract the region of interest and fuse the region features;

and (3.5) performing category prediction on the fused total characteristic graph to obtain a result.