Disclosure of Invention
In view of the above, the present invention aims to provide a method for detecting a significant target in a remote sensing image, which adopts an encoder-decoder structure, introduces a attention mechanism, a feature flow mechanism and a cascade decoding mechanism, designs a new loss function to train a target detection model, and further detects a significant target in the remote sensing image through the trained target detection model, thereby effectively improving the detection effect of the significant target edge of the remote sensing image, and improving the conditions of small target omission detection, false detection, etc.
The invention discloses a method for detecting a remarkable target of a remote sensing image, which comprises the following steps:
s1, acquiring remote sensing image data comprising a training set and a testing set, and constructing a remote sensing image salient target detection model comprising a detection feature encoder and a cascade feature decoder;
s2, introducing an attention mechanism, a feature flow mechanism and a cascade decoding mechanism, training the remote sensing image significant target detection model based on remote sensing image data of a training set, stopping training until a preset loss function converges, and obtaining a trained remote sensing image significant target detection model;
s3, performing salient target prediction on the remote sensing image data of the test set by using the trained remote sensing image salient target detection model, and further outputting a corresponding salient map.
Further, the detection feature encoder is a dense attention flow encoder, the dense attention flow encoder is improved based on a VGG16 network as a main network, and the improvement process is that the last three full connection layers of the VGG16 network are removed and truncated before the last pooling layer of the VGG16 network, so that the dense attention flow encoder is obtained.
Further, the specific implementation manner of the step S2 includes:
S21, introducing an attention mechanism, extracting the output characteristics of the last layer of each part from the improved VGG16 network, merging output characteristic dimensions based on a preset spatial pixel relation matrix to construct an operation matrix among pixels, and further realizing the representation of the relation among pixels;
s22, carrying out normalization processing based on an operation matrix among pixels to obtain attention weights, and multiplying the output characteristics after dimension combination by the attention weights to obtain characteristics after spatial self-attention weighting;
S23, adding the output characteristics and the characteristics weighted by using the spatial attention by using a residual connection mode, and obtaining the output deep characteristics by connecting a channel attention mechanism, wherein the process is expressed as follows by a formula:
F=CA(f+δ·(f*(Re-1(Re(f)⊙R))))
Wherein Re-1 represents the inverse operation of the merging dimension of the output features, R represents the pixel relation matrix, x represents the multiplication of elements, δ represents a learnable coefficient, CA (-) represents the channel attention mechanism, and f represents the output initial feature of the backbone network;
S24, upsampling and 1X 1 convolution are carried out on the deep features so as to adjust the sizes and channels of the deep features and the current features to be consistent;
S25, based on a preset gradual splicing module, splicing the deep features subjected to up-sampling and 1 multiplied by 1 convolution with the current features from the next layer of the current features in the sequence from the shallow layer to the deep layer;
S26, adjusting the channel number of the spliced features to be the channel number of the deep features output by the detection feature encoder, and inputting the deep features into a cascade feature decoder for decoding;
and S27, activating the final output of the cascade feature decoder by using a Sigmoid function, and further completing training of a remote sensing image salient target detection model.
Further, the preset spatial pixel relation matrix in step S21 is expressed as:
M={(Re(f))T⊙Re(f)}T
Where Re () represents an operation of combining the latter two dimensions of the output feature into one dimension, while, as such, it represents a matrix multiplication operation, and T represents a transpose.
Further, the normalization processing based on the operation matrix between pixels in the step S22 is formulated as:
Where r (x, y) represents the degree of importance of the influence of pixel x on pixel y, m (x, y) represents an element in the pixel relationship matrix, and e represents a natural constant.
Further, the method further comprises the step S23', wherein the multi-level pyramid fusion multi-scale spatial attention is adopted to extract information of output features, specifically, the output features are updated into three channels with different resolutions through 2 times and 4 times downsampling, the multi-level pyramid fusion multi-scale spatial attention is used to refine the different scale features, the refined features and the output features are fused based on a residual structure, then three-level features are fused according to the sequence from low resolution to high resolution, further deep features of the multi-level pyramid fusion multi-scale spatial attention weight are obtained, and finally the deep features output in the step S23 are combined with the deep features of the multi-level pyramid fusion multi-scale spatial attention weight.
Further, in the step S25, the deep feature after upsampling and 1×1 convolution is spliced with the current feature, and expressed as:
Fk=Conc(Conv(Up(f5)),...Conv(Up(Fk-1)),Fk)
Where Up (·) represents upsampling to align deep features with current features, Fk represents the k-th level features fed into the concatenated decoder, F5 represents the 5-th level features fed into the concatenated decoder, conv represents the convolutional layer.
Further, the preset loss function is a combined loss function with different weight coefficients, and is expressed as:
L=ω1LP+ω2LR+ω3LMAE+ω4LS
Wherein LP、LR、LMAE and Ls represent a precision loss term, a total loss term, an average absolute error loss term, and a structural similarity loss term, respectively, and ω1、ω2、ω3、ω4 represent weight coefficients of LP、LR、LMAE and Ls, respectively, wherein:
LS=1-Smeasure
Smeasure=α×So+(1-α)×Sr
Wherein N is the total number of samples, N is the sample number, J is the high-direction pixel number of the remote sensing image, i is the wide-direction pixel number of the remote sensing image, epsilon is a preset constant, W, H is the width and the height of the remote sensing image, S (i, J) epsilon S is the predicted value of each pixel, G (i, J) epsilon G is the true value of each pixel, S is the significance prediction result, G is the true label, Sr is the similarity measure facing the region, So is the similarity measure facing the object structure, alpha is the hyper-parameter, and is used for measuring the similarity measure facing the region and the similarity measure facing the object structure.
Further, step S4 is further included, comparing the output corresponding saliency map with the truth map, so as to measure the level of generating the saliency map by the remote sensing image saliency target model.
Further, the specific implementation mode of the step S4 is that a saliency map generated by a remote sensing image saliency target model is measured based on a preset index PR curve, an F value, average absolute loss and an S value.
Compared with the prior art, the method for detecting the remarkable target of the remote sensing image has the following advantages:
(1) The invention uses the cascade structure to decode the features, so that more advanced semantic features can guide the feature decoding process, and the problems of missing detection and false detection of small targets in remote sensing images are effectively solved.
(2) The invention designs a new loss function for training a remote sensing image salient target detection model, improves the prediction confidence of a salient region, and enables the model to predict a more accurate salient target boundary.
Detailed Description
It should be noted that, without conflict, the embodiments of the present invention and features of the embodiments may be combined with each other. The invention will be described in detail below with reference to the drawings in connection with embodiments.
Referring to fig. 1 to 6, the method for detecting the significant target of the remote sensing image of the present invention comprises the following steps:
s1, acquiring remote sensing image data comprising a training set and a testing set, and constructing a remote sensing image salient target detection model comprising a detection feature encoder and a cascade feature decoder;
in this step, the detection feature encoder is a dense attention flow encoder, and the dense attention flow encoder is improved based on a VGG16 network as a main network, wherein the improvement process is that the last three full connection layers of the VGG16 network are removed and truncated before the last pooling layer of the VGG16 network, so as to obtain the dense attention flow encoder. Therefore, the characteristic dimension of the first four layers in the improved VGG16 network is thatW and H are the width and height of the remote sensing image respectively, k is the backbone network layer, the last layer is the pooling layer removed, so that the characteristic dimension and the fourth layer are consistent, and meanwhile, after each characteristic extraction layer is finished, the current level characteristics are selected and refined and then sent to the next level for extraction.
In this embodiment, EORSSD is used as the remote sensing image dataset, 1400 remote sensing images are randomly selected from the remote sensing image dataset as the training set, and 600 remote sensing images are used as the test set.
S2, introducing an attention mechanism, a feature flow mechanism and a cascade decoding mechanism, training the remote sensing image significant target detection model based on remote sensing image data of a training set, stopping training until a preset loss function converges, and obtaining a trained remote sensing image significant target detection model;
In the step, it specifically comprises the following steps:
S21, introducing an attention mechanism, extracting the output characteristics of the last layer of each part from the improved VGG16 network, merging output characteristic dimensions based on a preset spatial pixel relation matrix to construct an operation matrix among pixels, and further realizing the representation of the relation among pixels;
Wherein the preset spatial pixel relation matrix is expressed as follows by a formula:
M={(Re(f))T⊙Re(f)}T
where Re () represents an operation of combining the latter two dimensions of the output feature into one dimension, while, as such, it represents a matrix multiplication operation, and T represents a transpose;
s22, carrying out normalization processing based on an operation matrix among pixels to obtain attention weights, multiplying the output characteristics after combining dimensions by the attention weights, and then restoring the temperature of the output characteristics to obtain characteristics weighted by using the spatial self-attention, wherein the obtained characteristics weighted by using the spatial self-attention have global information;
wherein, the normalization processing based on the operation matrix between pixels is formulated as:
wherein r (x, y) represents the influence importance degree of the pixel x on the pixel y, m (x, y) represents an element in the pixel relation matrix, and e represents a natural constant;
S23, adding the output characteristics and the characteristics weighted by using the spatial attention by using a residual connection mode, and obtaining the output deep characteristics by connecting a channel attention mechanism, wherein the process is expressed as follows by a formula:
F=CA(f+δ·(f*(Re-1(Re(f)⊙R))))
Wherein Re-1 represents the inverse operation of the merging dimension of the output features, R represents the pixel relation matrix, x represents the multiplication of elements, δ represents a learnable coefficient, CA (-) represents the channel attention mechanism, and f represents the output initial feature of the backbone network;
In this embodiment, after the improved VGG16 network feature extraction and feature refinement of the remote sensing image, five features with different scales are finally formed, and the deeper layers of the features with different scales contain more semantic features, and the shallower layers retain more detail features.
S24, upsampling and 1X 1 convolution are carried out on the deep features so as to adjust the sizes and channels of the deep features and the current features to be consistent;
S25, based on a preset gradual splicing module, splicing the deep features subjected to upsampling and 1 multiplied by 1 convolution with the current features from the next layer of the current features according to the sequence from the shallow layer to the deep layer, wherein the splicing process is expressed as follows:
Fk=Conc(Conv(Up(F5)),...Conv(Up(Fk-1)),Fk)
Where Up (·) represents upsampling to align deep features with current features, Fk represents k-th level features fed into the concatenated decoder, F5 represents 5-th level features fed into the concatenated decoder, conv represents the convolutional layer;
In this embodiment, in order to achieve complete extraction of image features, a attention fusion mode from shallow layer to deep layer is adopted to fuse multi-level features, and a GCA4 module is used for example, the GCA4 module receives and splices output features of a GCA1 module, a GCA2 module and a GCA3 module respectively, and adjusts the number of channels to 1 again, so as to form a final attention diagram, and the final attention diagram is expressed as follows by a formula:
A4=Conv(Conc(A1,A2,A3,A4))
The attention is then multiplied by the refined features and the deep features fed into the concatenated feature decoder are generated using the residual connection.
S26, adjusting the channel number of the spliced features to be the channel number of the deep features output by the detection feature encoder, and inputting the deep features into a cascade feature decoder for decoding;
and S27, activating the final output of the cascade feature decoder by using a Sigmoid function, and further completing training of a remote sensing image salient target detection model.
In this embodiment, since the deep features have the most abundant semantic features, each level of decoder can be guided to decode, but all the features from the deeper layers have more semantic information than the features from the shallow layers, the deep features obtained by each level of encoder can guide the shallow layer decoder except the most deep global features by using the cascade feature decoder, so that the generation of the final saliency map is facilitated. Each decoder unit receives the output from the previous level decoder and the deep spliced features, and activates the output of the last decoder by using a Sigmoid function to obtain a final predicted saliency map.
S3, performing salient target prediction on the remote sensing image data of the test set by using the trained remote sensing image salient target detection model, and further outputting a corresponding salient map.
In this embodiment, the significant target prediction is performed on the remote sensing image data of the test set based on the trained remote sensing image significant target detection model, so that a significant map of a more accurate significant target boundary can be obtained.
The preset loss function is a combined loss function with different weight coefficients, and is expressed as follows by a formula:
L=ω1LP+ω2LR+ω3LMAE+ω4LS
Wherein LP、LR、LMAE and Ls represent a precision loss term, a total loss term, an average absolute error loss term, and a structural similarity loss term, respectively, and ω1、ω2、ω3、ω4 represent weight coefficients of LP、LR、LMAE and Ls, respectively, wherein:
LS=1-Smeasure
Smeasure=α×So+(1-α)×Sr
Wherein N is the total number of samples, N is the sample number, j is the high-direction pixel number of the remote sensing image, i is the wide-direction pixel number of the remote sensing image, epsilon is a preset constant, W, H is the width and the height of the remote sensing image, S (i, j) epsilon S is the predicted value of each pixel, G (i, j) epsilon G is the true value of each pixel, S is the significance prediction result, G is the true label, Sr is the similarity measure facing the region, So is the similarity measure facing the object structure, alpha is the hyper-parameter, and alpha is the similarity measure for measuring the region facing the similarity measure and the similarity measure facing the object structure.
In this embodiment, the difference of the images obtained by comparing the structural similarity with the structural information between the images is more consistent with the perception result of human eyes, so that the problem that the detection capability of the cross entropy loss function on the edge part is not strong in the process of detecting the significant target can be solved by adopting the combined loss function with different weight coefficients as the preset loss function.
In another embodiment, the method further comprises the step S4 of comparing the outputted corresponding saliency map with a truth map so as to measure the level of generating the saliency map by the remote sensing image saliency target model, specifically, measuring the saliency map generated by the remote sensing image saliency target model based on a preset index PR curve, an F value, average absolute loss and an S value.
In this embodiment, the saliency map generated by the model is compared to the truth map to quantitatively measure the level of saliency map generation. The four indices, namely the PR curve, the F value, the mean absolute loss (MAE) and the S value, were used for the evaluation.
The Precision refers to the ratio of the correct positive sample to all the positive samples predicted, i.e. the Precision ratio, the Recall refers to the ratio of the correct positive sample to the positive sample in the label, i.e. the Recall ratio, all (Precision, recall) values can be obtained by adjusting the threshold between (0, 1), and then the Precision-Recall (PR) curve can be obtained by sequentially connecting, therefore, the closer the PR curve is to the (1, 1) point of the coordinate axis, the better the performance of the model is represented, as shown in fig. 4, fig. 4 shows the PR curve of the remote sensing image salient object detection method in this embodiment.
Wherein the F value is defined as
Wherein β2 is set to 0.3 to emphasize the importance of Precision;
mean absolute loss (MAE) is an indicator of the absolute error of a significant predictive and truth plot, formulated as:
In the formula, S represents a significance prediction result, and G represents a real label.
The Smeasure value is an index that measures the generated saliency map from the structural similarity level, and is expressed as:
Smeasure=α×So+(1-α)×Sr
Where Sr is a region-oriented similarity measure, So is an object-oriented similarity measure, and α represents a hyper-parameter.
In this embodiment, the test set is subjected to significant target detection by using a remote sensing image significant target detection method, and the detection results are shown in table 1, fig. 5 and fig. 6.
Table 1 shows the detection results of the remarkable targets of the remote sensing images
| Evaluation index | F | MAE | S |
| Value of | 0.9031 | 0.0048 | 0.9189 |
As can be seen from fig. 5 and fig. 6, the method for detecting the significant target in the remote sensing image not only can accurately predict the significant target, but also can accurately predict the boundary region of the significant target, and simultaneously, the prediction under the small target scene is relatively accurate, thereby reducing the conditions of missed detection and false detection.
In another embodiment, the method further comprises a step S23', wherein the step S23' is used for extracting information from the output features by adopting multi-level pyramid fusion multi-scale space attention, specifically, the output features are updated into three channels with different resolutions through 2 times and 4 times downsampling, the multi-level pyramid fusion multi-scale space attention is used for refining the different scale features, the refined features are fused with the output features based on a residual structure, then three-level features are fused according to the sequence from low resolution to high resolution, further deep features of the multi-level pyramid fusion multi-scale space attention weight are obtained, and finally the deep features output in the step S23 are combined with the deep features of the multi-level pyramid fusion multi-scale space attention weight.
In this embodiment, in addition to the attention among the individual pixels, the multi-scale attention of the entire image space can also extract useful information, specifically, the GCA module will use a multi-level pyramid to fuse the multi-scale spatial attention after deriving the feature output with self-attention.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.