CN112837320A

Movatterモバイル変換

Info

Publication number: CN112837320A
Application number: CN202110129416.3A
Authority: CN
Inventors: 张东映; 唐振超; 罗蔚然; 洪志明; 梁忠壮; 刘震
Original assignee: Wuhan Shanlai Technology Co Ltd
Current assignee: Huazhong University of Science and Technology
Priority date: 2021-01-29
Filing date: 2021-01-29
Publication date: 2021-05-25
Anticipated expiration: 2041-01-29
Also published as: CN112837320B

Abstract

The invention discloses a remote sensing image semantic segmentation method based on parallel void convolution, which relates to the technical field of remote sensing images and comprises the following steps: the method comprises the steps of obtaining a high-resolution remote sensing image in advance, slicing the high-resolution remote sensing image, normalizing and standardizing the high-resolution remote sensing image, and obtaining a source high-resolution remote sensing image; initializing a low-layer network of a feature extraction network resnet101 based on a pretrained resnet101 parameter on ImageNet, constructing a parallel cavity convolution network, and extracting shallow features of a source high-resolution remote sensing image; inputting the shallow layer characteristics into a parallel cavity convolution network to obtain multi-scale information, and fusing the multi-scale information; and fusing the fused features and the shallow features again, repairing image-level information by using a full-connection conditional random field, and acquiring a semantic segmentation result. The invention enlarges the experience field of convolution under the condition of not increasing extra parameters, and compared with the standard convolution which achieves the same experience field, the parallel cavity convolution method can save video memory.

Description

Remote sensing image semantic segmentation method based on parallel hole convolution

Technical Field

The invention relates to the technical field of remote sensing images, in particular to a remote sensing image semantic segmentation method based on parallel void convolution.

Background

With the maturity of satellite remote sensing technology, the promotion of commercialization and the encouragement and promotion of governments around the world, satellite remote sensing is rapidly developed and applied in more and more fields. The semantic segmentation of the remote sensing image is an important link of the satellite remote sensing application. The semantic segmentation of the remote sensing image is widely applied to the field of pattern recognition such as city planning, road planning, ground object target extraction and even automatic driving. The semantic segmentation precision is improved, and the method has important significance in the problem of processing the remote sensing image.

The ground feature information in the remote sensing image is complex and various, and in order to improve the semantic segmentation precision of the remote sensing image, a great deal of research has been carried out by related scholars, and a plurality of algorithms are proposed. The idea of the algorithms mainly comprises (a) adopting a full convolution network to carry out semantic segmentation on a remote sensing image, and (b) fusing feature information with symmetrical size on the basis of the full convolution network, and recording indexes in an anti-pooling process to make up for loss of position information. The methods are all based on standard convolution, and the receptive field of the standard convolution has limitation, so that the expansion of the receptive field has important research and application values for the semantic segmentation of the remote sensing image.

An effective solution to the problems in the related art has not been proposed yet.

Disclosure of Invention

Aiming at the problems in the related art, the invention provides a remote sensing image semantic segmentation method based on parallel void convolution, so as to overcome the technical problems in the prior related art.

The technical scheme of the invention is realized as follows:

a remote sensing image semantic segmentation method based on parallel void convolution comprises the following steps:

the method comprises the steps of obtaining a high-resolution remote sensing image in advance, slicing the high-resolution remote sensing image, normalizing and standardizing the high-resolution remote sensing image, and obtaining a source high-resolution remote sensing image;

initializing a low-layer network of a feature extraction network resnet101 based on a pretrained resnet101 parameter on ImageNet, constructing a parallel cavity convolution network, and extracting shallow features of a source high-resolution remote sensing image;

inputting shallow layer characteristics into a parallel cavity convolution network to obtain multi-scale information, and fusing the multi-scale information, wherein different expansion rates are set to capture the multi-scale information;

and fusing the fused features and the shallow features again, repairing image-level information by using a full-connection conditional random field, and acquiring a semantic segmentation result.

Further, the slicing the high-resolution remote sensing image comprises slicing the high-resolution remote sensing image to 512 pixels in length and width.

Further, the method also comprises the following steps:

and extracting RGB three channels from the sliced high-resolution remote sensing image.

Further, the parallel hole convolution network includes the following steps:

starting from a standard ordinary convolution, a discrete function F is scaled:

and is provided with

And k is as follows: omega_r→ i is the discrete convolution kernel, developed with p as the centerThe convolution calculation process of (a) is as follows:

by extending the standard convolution, let l be the dilation rate of the hole convolution, then the hole convolution is:

and (3) parallelly carrying out cavity convolution on the shallow layer features with different expansion rates to obtain multi-scale features, and fusing the multi-scale features by using a splicing mode to form a parallel cavity convolution network layer.

Further, the expansion ratios are set to 2, 3, 4, and 5, respectively.

Further, the fully connected conditional random field inpainting image-level information comprises the following steps:

the energy function used by the fully connected conditional random field is:

the elementary potential energy function of the method is used for describing the influence of an observed object and a label:

θ_i(x_i)＝-log P(x_i)

wherein, the pixel points i, P (x) of the image_i) For the probability of classification of the network on a pixel, a binary potential energy function describes the correlation between observed objects:

when x is_i≠y_jWhen u (x)_i,y_j) 1, otherwise, u (x)_i,y_j) 0, another k^m(f_i,f_j) As f_iAnd f_jGaussian kernel of between f_iIs the color information corresponding to the pixel i, i.e. the feature vector, w_mIs the weight used by the Gaussian kernel;

in the process of minimizing the energy function, pixels which are unreasonably classified in the image can be corrected, and a repaired semantic segmentation result is obtained.

The invention has the beneficial effects that:

the invention relates to a remote sensing image semantic segmentation method based on parallel cavity convolution, which comprises the steps of obtaining a high-resolution remote sensing image in advance, slicing the high-resolution remote sensing image, normalizing and standardizing to obtain a source high-resolution remote sensing image, initializing the characteristic based on the pretrained resnet101 parameter on ImageNet to extract the lower layer network of the network resnet101, constructing a parallel cavity convolution network, extracting shallow features of the source high-resolution remote sensing image, inputting the shallow features into a parallel cavity convolution network to obtain multi-scale information, and the multi-scale information is fused, the fused features and the shallow features are fused again, the full-connection conditional random field is used for restoring the image-level information, the semantic segmentation result is obtained, and the semantic segmentation is realized without adding extra parameters, the field of experience of convolution is enlarged, and compared with standard convolution which achieves the same field of experience, the parallel hole convolution method can save video memory; the parallel computing structure is adopted, so that the nodes in the neural network computing graph are conveniently distributed on distributed hardware, and the computing speed is improved; the multi-scale information is beneficial to capturing detailed objects and large objects by a network, small target objects are not easy to miss, and the semantic segmentation precision is improved; in addition, the void convolution can widely perceive the adjacent objects of the target object, the pixel-level classification can be more effectively carried out by means of the adjacent information, and the pixel-level classification effect is better than that of the standard convolution.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a schematic diagram of hole convolution sampling with different expansion rates of a remote sensing image semantic segmentation method based on parallel hole convolution according to an embodiment of the invention;

FIG. 2 is a schematic diagram of a multi-scale parallel void convolution network of a remote sensing image semantic segmentation method based on parallel void convolution according to an embodiment of the invention;

FIG. 3 shows a parallel void convolution semantic segmentation result of the remote sensing image semantic segmentation method based on parallel void convolution according to the embodiment of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present invention.

According to the embodiment of the invention, a remote sensing image semantic segmentation method based on parallel void convolution is provided.

As shown in fig. 1 to fig. 3, the method for segmenting the remote sensing image semantic based on the parallel void convolution according to the embodiment of the present invention includes the following steps:

And slicing the high-resolution remote sensing image, wherein slicing the high-resolution remote sensing image to 512 pixels in length and width.

Wherein, still include the following step:

The parallel hole convolution network comprises the following steps:

starting from a standard ordinary convolution, a discrete function F is scaled:

and is provided with

And k is as follows: omega_r→ i is the discrete convolution kernel, the convolution computation process with p as the center is:

Wherein the expansion ratios are set to 2, 3, 4 and 5, respectively.

Wherein the fully connected conditional random field inpainting image-level information comprises the steps of:

the energy function used by the fully connected conditional random field is:

θ_i(x_i)＝-log P(x_i)

when x is_i≠y_jWhen u (x)_i,y_j) 1, otherwise, u (x)_i,y_j) 0, another k^m(f_i,f_j) As f_iAnd f_jGaussian kernel of between f_iIs the color information corresponding to the pixel i, i.e. the feature vector, w_mIs the weight used by the gaussian kernel.

By means of the technical scheme, the high-resolution remote sensing image is sliced to obtain the pre-trained resnet101, the low-level migration is used as a feature extraction network, and the shallow features of the sliced image are extracted; constructing parallel cavity convolution, setting the expansion rate of a convolution kernel from 2 to 5, inputting shallow layer characteristics into a parallel cavity convolution network, and splicing information under different scales; the output features of the cavity convolution network and the shallow features are fused again, resolution is restored through up-sampling, and a segmentation result is restored through a conditional random field; and merging the segmentation results of the slices, and repairing unreasonable prediction results by simply filling holes and removing small connected domains.

Specifically, in one embodiment, the method comprises the following steps:

s1: firstly, preprocessing a high-resolution remote sensing image, wherein the resolution of the high-resolution remote sensing image is too high, the memory and the video memory of a common computer are not easy to bear the calculation of the whole image, and the image is sliced to 512 pixels in length and width according to the commonly used 512 resolution of a mainstream semantic segmentation model;

s2: in order to be compatible with the conventional deep convolutional neural network, RGB three channels need to be extracted from the sliced remote sensing image. Conventional data enhancement was performed: random horizontal turning, random vertical turning and color dithering. When data enhancement is carried out, the same processing is carried out on the annotation image along with the RGB image;

s3: and carrying out proper scaling or normalization on the RGB three-channel tensor. Suppose a data set has a total of m RGB images that can be divided into 3 channels of tensors [ x ]₁,x₂,x₃]Normalizing the tensor to obtain [ y [)₁,y₂,y₃]The tensor normalization formula for each channel is:

s4: then, normalizing according to the mean value mu and the standard deviation sigma of each channel to obtain tensor [ z [ [ z ]₁,z₂,z₃]The normalized calculation formula is:

s5: initializing a network based on resnet101, intercepting layer1 to layer4, wherein the cavity convolution expansion rate of layer4 is 2, and the cavity convolution expansion rates of layer1 to layer3 are all 1, which is equivalent to standard common convolution;

s6: performing cavity space pyramid convolution on the features output by the resnet101, performing parallel convolution by using different expansion rates, wherein the space pyramid convolution does not perform global pooling, but replaces global pooling branches with standard convolution instead, and the purpose is to obtain semantic information more deeply and improve classification accuracy;

s7: the low-level features generated by layer1 in resnet101 are fused with the space pyramid convolution result after linear interpolation by using a skip level structure, the low-level features can bring partial position information for the high-level features, and because global pooling is cancelled in the space pyramid convolution layer, the network lacks the position information of image level features, the rough segmentation result output by the network needs to be post-processed based on a conditional random field;

s8: calculating loss by using cross entropy, adding weight to each class of objects during cross entropy calculation due to unbalanced object distribution of the remote sensing image, reversely propagating and calculating gradient in a calculation graph by using the loss function, and updating network parameters;

s9: adadelta is adopted as an optimization method of model training, and the initial learning rate is set to be 1e^-1。

Adadelta can achieve a faster effect in the early and middle training period. The main network for extracting the features of the model is resnet101, although the specific ground object information of the remote sensing image cannot be directly detected by the resnet101 pre-trained on ImageNet, the low-level information such as edges, angles, colors and the like can be effectively sensed, so that the feature extraction layer of the network can be initialized by using the parameters of the resnet101 pre-trained on ImageNet, and the network can obtain a good initial solution; performing random initialization of other layer parameters of the network, wherein the random initialization obeys Gaussian distribution;

s10: the model can be converged after traversing the whole data set for 256 times, the batch processing size is set to be 8, and the co-iteration number of model training is 5e⁴；

S11: the high-resolution remote sensing image cannot be segmented at one time, so that the high-resolution remote sensing image is required to be sliced first and then semantically segmented one by one, and unreasonable prediction results are restored by simply filling holes and removing small connected domains when all slices are spliced.

Further, as shown in fig. 1, (a)/(b)/(c) in fig. 1 indicates characteristic sampling of the hole convolutions having different expansion rates, and the expansion rates are 1, 2, and 3, respectively, and it can be seen from fig. 1 that the receptive field increases as the expansion rate of the hole convolution increases. The hole convolution can be sparse sampled on the features by setting the expansion rate, and can be performed by using any expansion rate, so that the method is favorable for clearly controlling the receptive field and acquiring the context information in the intensive calculation task. The setting of the hole convolution expansion rate does not affect the structure of the original network parameters, which is friendly to the migration learning, so that fine adjustment can be still performed based on the parameters of the original network after the expansion rate is set.

In addition, according to steps S1 to S4, the GID high resolution remote sensing image is sliced to have a resolution of 512, normalized and normalized, and the statistical data set is used to obtain 3 channel mean values of RGB required for normalization: 0.3515224,0.38427463,0.35403764. The standard deviation is: 0.19264674,0.18325084,0.17028946.

In addition, as shown in fig. 2, according to steps S5 and S6, the feature is initialized by using the prestrained resnet101 parameter on ImageNet to extract the lower network of the network resnet101, the lower network can effectively detect the position information of edges, corners, and the like, and construct a parallel cavity convolution network, and the expansion rates are set to 2, 3, 4, and 5, respectively. The template parameters of the convolution kernel are all tensors consisting of (3, 3) size. The shallow layer features are input into a parallel cavity convolution network to obtain multi-scale information, the multi-scale information is fused in a splicing mode, and the calculation process of the feature input cavity convolution is shown in the attached figure 2.

In addition, according to step S7, the fused features and the shallow features are fused again to compensate for the position detail information, and the resolution is restored by upsampling, so that the obtained semantic segmentation result is rough, and the full-connection conditional random field is used to repair the image-level information, that is, the image-level features are repaired to improve the semantic segmentation result.

In addition, according to steps S8 to S10, the data set is traversed, forward calculation is performed, the loss function is updated after each batch processing is finished, the loss function is firstly counted according to the pixel classes to obtain the proportion of each class, and the proportion is fused to the cross entropy loss to serve as the coefficient of each class loss. And starting from the node of the loss function, reversely calculating in the calculation graph, acquiring the gradient and updating the model parameters. The optimizer used for model updating is Adadelta, and Adadelta can achieve a quicker effect in the early and middle training period.

In addition, as shown in fig. 3, after the training is finished, parameters of the semantic segmentation network are obtained, and the parameters are loaded into the network with the corresponding structure during reasoning. Semantic segmentation is performed on each slice of the high-resolution remote sensing image, and (a)/(b)/(c) in fig. 3 respectively represents the original slice image, the real label corresponding to the slice, and the semantic segmentation result of the parallel cavity convolution. Different terrain objects are represented using different pixels. The figure 3 shows that the remote sensing image semantic segmentation method based on the parallel cavity convolution achieves good effect, and the segmentation result is close to the real label.

In addition, according to step S11, the semantic segmentation results of each slice are sorted and merged, and an unreasonable prediction result is repaired by simply filling holes and removing small connected domains when each slice is spliced.

In summary, according to the technical solution of the present invention, by acquiring the high-resolution remote sensing image in advance, slicing the high-resolution remote sensing image, normalizing and standardizing to obtain a source high-resolution remote sensing image, initializing the characteristic based on the pretrained resnet101 parameter on ImageNet to extract the lower layer network of the network resnet101, constructing a parallel cavity convolution network, extracting shallow features of the source high-resolution remote sensing image, inputting the shallow features into a parallel cavity convolution network to obtain multi-scale information, and the multi-scale information is fused, the fused features and the shallow features are fused again, the full-connection conditional random field is used for restoring the image-level information, the semantic segmentation result is obtained, and the semantic segmentation is realized without adding extra parameters, the field of experience of convolution is enlarged, and compared with standard convolution which achieves the same field of experience, the parallel hole convolution method can save video memory; the parallel computing structure is adopted, so that the nodes in the neural network computing graph are conveniently distributed on distributed hardware, and the computing speed is improved; the multi-scale information is beneficial to capturing detailed objects and large objects by a network, small target objects are not easy to miss, and the semantic segmentation precision is improved; in addition, the void convolution can widely perceive the adjacent objects of the target object, the pixel-level classification can be more effectively carried out by means of the adjacent information, and the pixel-level classification effect is better than that of the standard convolution.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

Translated fromChinese

1.一种基于并行空洞卷积的遥感影像语义分割方法，其特征在于，包括以下步骤：1. a remote sensing image semantic segmentation method based on parallel hole convolution, is characterized in that, comprises the following steps:

预先获取高分辨率遥感影像，并对高分辨率遥感影像进行切片，进行归一化和标准化，获取源高分辨率遥感影像；Acquire high-resolution remote sensing images in advance, and slice, normalize and standardize high-resolution remote sensing images to obtain source high-resolution remote sensing images;

基于ImageNet上预训练的resnet101参数初始化特征提取网络resnet101的低层网络，构建并行空洞卷积网络，并提取源高分辨率遥感影像的浅层特征；Initialize the low-level network of the feature extraction network resnet101 based on the resnet101 parameters pre-trained on ImageNet, construct a parallel atrous convolutional network, and extract the shallow features of the source high-resolution remote sensing images;

将浅层特征输入至并行空洞卷积网络获取多尺度信息，并将多尺度信息融合，其中包括设置不同的膨胀率捕获多尺度信息；Input the shallow features into the parallel atrous convolutional network to obtain multi-scale information, and fuse the multi-scale information, including setting different expansion rates to capture multi-scale information;

将融合后的特征与浅层特征再次融合，并使用全连接条件随机场修复图像级信息，获取语义分割结果。The fused features are re-fused with shallow features, and image-level information is repaired using a fully-connected conditional random field to obtain semantic segmentation results.

2.根据权利要求1所述的基于并行空洞卷积的遥感影像语义分割方法，其特征在于，所述对高分辨率遥感影像进行切片，包括将高分辨率遥感影像切片至长和宽为512像素。2. the remote sensing image semantic segmentation method based on parallel hole convolution according to claim 1, is characterized in that, the described high-resolution remote sensing image is sliced, including slices high-resolution remote sensing image to a length and a width of 512 pixel.

3.根据权利要求2所述的基于并行空洞卷积的遥感影像语义分割方法，其特征在于，还包括以下步骤：3. the remote sensing image semantic segmentation method based on parallel hole convolution according to claim 2, is characterized in that, also comprises the following steps:

基于切片后的高分辨率遥感影像中提取RGB三通道。Extract RGB three channels from the sliced high-resolution remote sensing images.

4.根据权利要求1所述的基于并行空洞卷积的遥感影像语义分割方法，其特征在于，所述并行空洞卷积网络，包括以下步骤：4. The remote sensing image semantic segmentation method based on parallel hole convolution according to claim 1, wherein the parallel hole convolution network comprises the following steps:

从标准的普通卷积开始，标定有离散的函数F：

和有

其k：

为离散卷积核，以p为中心展开的卷积计算过程为：Starting from the standard ordinary convolution, a discrete function F is scaled:

And there

Its k:

is a discrete convolution kernel, and the calculation process of convolution centered on p is:

通过扩充标准卷积，使l为空洞卷积的膨胀率，则空洞卷积为：By expanding the standard convolution, let l be the expansion rate of the atrous convolution, then the atrous convolution is:

将浅层特征并行通过不同膨胀率的空洞卷积得到多尺度的特征，使用拼接方式融合多尺度特征，即构成并行空洞卷积网络层。Multi-scale features are obtained by parallel convolution of shallow layers with different expansion rates, and multi-scale features are fused by splicing, that is, a parallel hole convolution network layer is formed.

5.根据权利要求4所述的基于并行空洞卷积的遥感影像语义分割方法，其特征在于，所述膨胀率分别设置为2、3、4和5。5 . The method for semantic segmentation of remote sensing images based on parallel hole convolution according to claim 4 , wherein the expansion ratios are set to 2, 3, 4, and 5, respectively. 6 .

6.根据权利要求1所述的基于并行空洞卷积的遥感影像语义分割方法，其特征在于，所述全连接条件随机场修复图像级信息，包括以下步骤：6. The remote sensing image semantic segmentation method based on parallel hole convolution according to claim 1, wherein the fully connected conditional random field repairs image-level information, comprising the following steps:

全连接条件随机场使用的能量函数为：The energy function used by the fully connected conditional random field is:

其一元势能函数用于描述观测对象和标注的影响：Its unary potential energy function is used to describe the influence of observation objects and labels:

θ_i(x_i)＝-log P(x_i)θ_i (x_i )=-log P(x_i )

其中，图像的像素点i，P(x_i)为网络在像素上分类的概率，二元势能函数描述观测对象之间的相关性：Among them, the pixel point i of the image, P(_xi ) is the probability that the network classifies the pixel on the pixel, and the binary potential energy function describes the correlation between the observed objects:

当x_i≠y_j时，u(x_i,y_j)＝1，否则，u(x_i,y_j)＝0，另外k^m(f_i,f_j)作为f_i与f_j之间的高斯核，f_i是像素i对应的颜色信息，也就是特征向量，w_m是该高斯核使用的权重。When x_i ≠y_j , u(x_i , y_j )=1, otherwise,^u (x_i , y_j )=0, and_km (fi , f_j ) as the distance between f_i and f_j The Gaussian kernel of , f_i is the color information corresponding to pixel i, that is, the feature vector, and w_m is the weight used by the Gaussian kernel.