Disclosure of Invention
Aiming at the problems in the related art, the invention provides a remote sensing image semantic segmentation method based on parallel void convolution, so as to overcome the technical problems in the prior related art.
The technical scheme of the invention is realized as follows:
a remote sensing image semantic segmentation method based on parallel void convolution comprises the following steps:
the method comprises the steps of obtaining a high-resolution remote sensing image in advance, slicing the high-resolution remote sensing image, normalizing and standardizing the high-resolution remote sensing image, and obtaining a source high-resolution remote sensing image;
initializing a low-layer network of a feature extraction network resnet101 based on a pretrained resnet101 parameter on ImageNet, constructing a parallel cavity convolution network, and extracting shallow features of a source high-resolution remote sensing image;
inputting shallow layer characteristics into a parallel cavity convolution network to obtain multi-scale information, and fusing the multi-scale information, wherein different expansion rates are set to capture the multi-scale information;
and fusing the fused features and the shallow features again, repairing image-level information by using a full-connection conditional random field, and acquiring a semantic segmentation result.
Further, the slicing the high-resolution remote sensing image comprises slicing the high-resolution remote sensing image to 512 pixels in length and width.
Further, the method also comprises the following steps:
and extracting RGB three channels from the sliced high-resolution remote sensing image.
Further, the parallel hole convolution network includes the following steps:
starting from a standard ordinary convolution, a discrete function F is scaled:
and is provided with
And k is as follows: omega
r→ i is the discrete convolution kernel, developed with p as the centerThe convolution calculation process of (a) is as follows:
by extending the standard convolution, let l be the dilation rate of the hole convolution, then the hole convolution is:
and (3) parallelly carrying out cavity convolution on the shallow layer features with different expansion rates to obtain multi-scale features, and fusing the multi-scale features by using a splicing mode to form a parallel cavity convolution network layer.
Further, the expansion ratios are set to 2, 3, 4, and 5, respectively.
Further, the fully connected conditional random field inpainting image-level information comprises the following steps:
the energy function used by the fully connected conditional random field is:
the elementary potential energy function of the method is used for describing the influence of an observed object and a label:
θi(xi)=-log P(xi)
wherein, the pixel points i, P (x) of the imagei) For the probability of classification of the network on a pixel, a binary potential energy function describes the correlation between observed objects:
when x isi≠yjWhen u (x)i,yj) 1, otherwise, u (x)i,yj) 0, another km(fi,fj) As fiAnd fjGaussian kernel of between fiIs the color information corresponding to the pixel i, i.e. the feature vector, wmIs the weight used by the Gaussian kernel;
in the process of minimizing the energy function, pixels which are unreasonably classified in the image can be corrected, and a repaired semantic segmentation result is obtained.
The invention has the beneficial effects that:
the invention relates to a remote sensing image semantic segmentation method based on parallel cavity convolution, which comprises the steps of obtaining a high-resolution remote sensing image in advance, slicing the high-resolution remote sensing image, normalizing and standardizing to obtain a source high-resolution remote sensing image, initializing the characteristic based on the pretrained resnet101 parameter on ImageNet to extract the lower layer network of the network resnet101, constructing a parallel cavity convolution network, extracting shallow features of the source high-resolution remote sensing image, inputting the shallow features into a parallel cavity convolution network to obtain multi-scale information, and the multi-scale information is fused, the fused features and the shallow features are fused again, the full-connection conditional random field is used for restoring the image-level information, the semantic segmentation result is obtained, and the semantic segmentation is realized without adding extra parameters, the field of experience of convolution is enlarged, and compared with standard convolution which achieves the same field of experience, the parallel hole convolution method can save video memory; the parallel computing structure is adopted, so that the nodes in the neural network computing graph are conveniently distributed on distributed hardware, and the computing speed is improved; the multi-scale information is beneficial to capturing detailed objects and large objects by a network, small target objects are not easy to miss, and the semantic segmentation precision is improved; in addition, the void convolution can widely perceive the adjacent objects of the target object, the pixel-level classification can be more effectively carried out by means of the adjacent information, and the pixel-level classification effect is better than that of the standard convolution.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present invention.
According to the embodiment of the invention, a remote sensing image semantic segmentation method based on parallel void convolution is provided.
As shown in fig. 1 to fig. 3, the method for segmenting the remote sensing image semantic based on the parallel void convolution according to the embodiment of the present invention includes the following steps:
the method comprises the steps of obtaining a high-resolution remote sensing image in advance, slicing the high-resolution remote sensing image, normalizing and standardizing the high-resolution remote sensing image, and obtaining a source high-resolution remote sensing image;
initializing a low-layer network of a feature extraction network resnet101 based on a pretrained resnet101 parameter on ImageNet, constructing a parallel cavity convolution network, and extracting shallow features of a source high-resolution remote sensing image;
inputting shallow layer characteristics into a parallel cavity convolution network to obtain multi-scale information, and fusing the multi-scale information, wherein different expansion rates are set to capture the multi-scale information;
and fusing the fused features and the shallow features again, repairing image-level information by using a full-connection conditional random field, and acquiring a semantic segmentation result.
And slicing the high-resolution remote sensing image, wherein slicing the high-resolution remote sensing image to 512 pixels in length and width.
Wherein, still include the following step:
and extracting RGB three channels from the sliced high-resolution remote sensing image.
The parallel hole convolution network comprises the following steps:
starting from a standard ordinary convolution, a discrete function F is scaled:
and is provided with
And k is as follows: omega
r→ i is the discrete convolution kernel, the convolution computation process with p as the center is:
by extending the standard convolution, let l be the dilation rate of the hole convolution, then the hole convolution is:
and (3) parallelly carrying out cavity convolution on the shallow layer features with different expansion rates to obtain multi-scale features, and fusing the multi-scale features by using a splicing mode to form a parallel cavity convolution network layer.
Wherein the expansion ratios are set to 2, 3, 4 and 5, respectively.
Wherein the fully connected conditional random field inpainting image-level information comprises the steps of:
the energy function used by the fully connected conditional random field is:
the elementary potential energy function of the method is used for describing the influence of an observed object and a label:
θi(xi)=-log P(xi)
wherein, the pixel points i, P (x) of the imagei) For the probability of classification of the network on a pixel, a binary potential energy function describes the correlation between observed objects:
when x isi≠yjWhen u (x)i,yj) 1, otherwise, u (x)i,yj) 0, another km(fi,fj) As fiAnd fjGaussian kernel of between fiIs the color information corresponding to the pixel i, i.e. the feature vector, wmIs the weight used by the gaussian kernel.
By means of the technical scheme, the high-resolution remote sensing image is sliced to obtain the pre-trained resnet101, the low-level migration is used as a feature extraction network, and the shallow features of the sliced image are extracted; constructing parallel cavity convolution, setting the expansion rate of a convolution kernel from 2 to 5, inputting shallow layer characteristics into a parallel cavity convolution network, and splicing information under different scales; the output features of the cavity convolution network and the shallow features are fused again, resolution is restored through up-sampling, and a segmentation result is restored through a conditional random field; and merging the segmentation results of the slices, and repairing unreasonable prediction results by simply filling holes and removing small connected domains.
Specifically, in one embodiment, the method comprises the following steps:
s1: firstly, preprocessing a high-resolution remote sensing image, wherein the resolution of the high-resolution remote sensing image is too high, the memory and the video memory of a common computer are not easy to bear the calculation of the whole image, and the image is sliced to 512 pixels in length and width according to the commonly used 512 resolution of a mainstream semantic segmentation model;
s2: in order to be compatible with the conventional deep convolutional neural network, RGB three channels need to be extracted from the sliced remote sensing image. Conventional data enhancement was performed: random horizontal turning, random vertical turning and color dithering. When data enhancement is carried out, the same processing is carried out on the annotation image along with the RGB image;
s3: and carrying out proper scaling or normalization on the RGB three-channel tensor. Suppose a data set has a total of m RGB images that can be divided into 3 channels of tensors [ x ]1,x2,x3]Normalizing the tensor to obtain [ y [)1,y2,y3]The tensor normalization formula for each channel is:
s4: then, normalizing according to the mean value mu and the standard deviation sigma of each channel to obtain tensor [ z [ [ z ]1,z2,z3]The normalized calculation formula is:
s5: initializing a network based on resnet101, intercepting layer1 to layer4, wherein the cavity convolution expansion rate of layer4 is 2, and the cavity convolution expansion rates of layer1 to layer3 are all 1, which is equivalent to standard common convolution;
s6: performing cavity space pyramid convolution on the features output by the resnet101, performing parallel convolution by using different expansion rates, wherein the space pyramid convolution does not perform global pooling, but replaces global pooling branches with standard convolution instead, and the purpose is to obtain semantic information more deeply and improve classification accuracy;
s7: the low-level features generated by layer1 in resnet101 are fused with the space pyramid convolution result after linear interpolation by using a skip level structure, the low-level features can bring partial position information for the high-level features, and because global pooling is cancelled in the space pyramid convolution layer, the network lacks the position information of image level features, the rough segmentation result output by the network needs to be post-processed based on a conditional random field;
s8: calculating loss by using cross entropy, adding weight to each class of objects during cross entropy calculation due to unbalanced object distribution of the remote sensing image, reversely propagating and calculating gradient in a calculation graph by using the loss function, and updating network parameters;
s9: adadelta is adopted as an optimization method of model training, and the initial learning rate is set to be 1e-1。
Adadelta can achieve a faster effect in the early and middle training period. The main network for extracting the features of the model is resnet101, although the specific ground object information of the remote sensing image cannot be directly detected by the resnet101 pre-trained on ImageNet, the low-level information such as edges, angles, colors and the like can be effectively sensed, so that the feature extraction layer of the network can be initialized by using the parameters of the resnet101 pre-trained on ImageNet, and the network can obtain a good initial solution; performing random initialization of other layer parameters of the network, wherein the random initialization obeys Gaussian distribution;
s10: the model can be converged after traversing the whole data set for 256 times, the batch processing size is set to be 8, and the co-iteration number of model training is 5e4;
S11: the high-resolution remote sensing image cannot be segmented at one time, so that the high-resolution remote sensing image is required to be sliced first and then semantically segmented one by one, and unreasonable prediction results are restored by simply filling holes and removing small connected domains when all slices are spliced.
Further, as shown in fig. 1, (a)/(b)/(c) in fig. 1 indicates characteristic sampling of the hole convolutions having different expansion rates, and the expansion rates are 1, 2, and 3, respectively, and it can be seen from fig. 1 that the receptive field increases as the expansion rate of the hole convolution increases. The hole convolution can be sparse sampled on the features by setting the expansion rate, and can be performed by using any expansion rate, so that the method is favorable for clearly controlling the receptive field and acquiring the context information in the intensive calculation task. The setting of the hole convolution expansion rate does not affect the structure of the original network parameters, which is friendly to the migration learning, so that fine adjustment can be still performed based on the parameters of the original network after the expansion rate is set.
In addition, according to steps S1 to S4, the GID high resolution remote sensing image is sliced to have a resolution of 512, normalized and normalized, and the statistical data set is used to obtain 3 channel mean values of RGB required for normalization: 0.3515224,0.38427463,0.35403764. The standard deviation is: 0.19264674,0.18325084,0.17028946.
In addition, as shown in fig. 2, according to steps S5 and S6, the feature is initialized by using the prestrained resnet101 parameter on ImageNet to extract the lower network of the network resnet101, the lower network can effectively detect the position information of edges, corners, and the like, and construct a parallel cavity convolution network, and the expansion rates are set to 2, 3, 4, and 5, respectively. The template parameters of the convolution kernel are all tensors consisting of (3, 3) size. The shallow layer features are input into a parallel cavity convolution network to obtain multi-scale information, the multi-scale information is fused in a splicing mode, and the calculation process of the feature input cavity convolution is shown in the attached figure 2.
In addition, according to step S7, the fused features and the shallow features are fused again to compensate for the position detail information, and the resolution is restored by upsampling, so that the obtained semantic segmentation result is rough, and the full-connection conditional random field is used to repair the image-level information, that is, the image-level features are repaired to improve the semantic segmentation result.
In addition, according to steps S8 to S10, the data set is traversed, forward calculation is performed, the loss function is updated after each batch processing is finished, the loss function is firstly counted according to the pixel classes to obtain the proportion of each class, and the proportion is fused to the cross entropy loss to serve as the coefficient of each class loss. And starting from the node of the loss function, reversely calculating in the calculation graph, acquiring the gradient and updating the model parameters. The optimizer used for model updating is Adadelta, and Adadelta can achieve a quicker effect in the early and middle training period.
In addition, as shown in fig. 3, after the training is finished, parameters of the semantic segmentation network are obtained, and the parameters are loaded into the network with the corresponding structure during reasoning. Semantic segmentation is performed on each slice of the high-resolution remote sensing image, and (a)/(b)/(c) in fig. 3 respectively represents the original slice image, the real label corresponding to the slice, and the semantic segmentation result of the parallel cavity convolution. Different terrain objects are represented using different pixels. The figure 3 shows that the remote sensing image semantic segmentation method based on the parallel cavity convolution achieves good effect, and the segmentation result is close to the real label.
In addition, according to step S11, the semantic segmentation results of each slice are sorted and merged, and an unreasonable prediction result is repaired by simply filling holes and removing small connected domains when each slice is spliced.
In summary, according to the technical solution of the present invention, by acquiring the high-resolution remote sensing image in advance, slicing the high-resolution remote sensing image, normalizing and standardizing to obtain a source high-resolution remote sensing image, initializing the characteristic based on the pretrained resnet101 parameter on ImageNet to extract the lower layer network of the network resnet101, constructing a parallel cavity convolution network, extracting shallow features of the source high-resolution remote sensing image, inputting the shallow features into a parallel cavity convolution network to obtain multi-scale information, and the multi-scale information is fused, the fused features and the shallow features are fused again, the full-connection conditional random field is used for restoring the image-level information, the semantic segmentation result is obtained, and the semantic segmentation is realized without adding extra parameters, the field of experience of convolution is enlarged, and compared with standard convolution which achieves the same field of experience, the parallel hole convolution method can save video memory; the parallel computing structure is adopted, so that the nodes in the neural network computing graph are conveniently distributed on distributed hardware, and the computing speed is improved; the multi-scale information is beneficial to capturing detailed objects and large objects by a network, small target objects are not easy to miss, and the semantic segmentation precision is improved; in addition, the void convolution can widely perceive the adjacent objects of the target object, the pixel-level classification can be more effectively carried out by means of the adjacent information, and the pixel-level classification effect is better than that of the standard convolution.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.