Disclosure of Invention
The invention solves the problem of providing a satellite cloud picture multi-label hash retrieval method of an interpretable depth network, wherein the extracted fusion characteristic comprises semantic information rich in target cloud pictures, and the target hash codes generated subsequently on the basis also comprise rich semantics and have interpretability, so that the reliability of the whole interpretable depth network is improved, and the accurate retrieval of the satellite cloud pictures is realized.
In order to solve the problems, the invention provides a satellite cloud image multi-tag hash retrieval method of an interpretable deep network, which comprises the following steps:
The method comprises the steps of inputting an obtained target cloud image as an input item to a pre-trained interpretable depth network to obtain a target hash code of the target cloud image, wherein the interpretable depth network comprises a feature learning module with interpretable fusion features for generating the target cloud image and a hash learning module for generating the target hash code, the feature learning module comprises a global feature learning module and a local feature learning module, the global feature learning module is a unit for learning a single global semantic feature for representing the target cloud image, the local feature learning module is a unit for learning local semantic features of a plurality of cloud types in different areas in the target cloud image, and the output of the global feature learning module and the output of the local feature learning module are combined to obtain the fusion feature;
and carrying out similarity measurement according to the target hash code and the hash codes of each historical cloud image in the historical cloud image database, and searching out the historical cloud images similar to the target cloud images.
The method has the beneficial effects that the fusion characteristics extracted in the method comprise semantic information rich in target cloud patterns, and the target hash codes generated subsequently on the basis of the semantic information are rich in semantics and have interpretability, so that the reliability of the whole interpretability depth network is improved, the accurate retrieval of satellite cloud patterns is realized, the method provides great help for meteorological workers to perform next research, and the method is beneficial to practical application.
Further, the hash learning module in the interpretable deep network generates the target hash code according to the fusion feature, including:
Inputting the fusion characteristics as input items to a pre-trained hash code learning module to obtain hash codes corresponding to the fusion characteristics;
determining a target hash code of the target cloud image according to the hash code and a first preset relation;
the first preset relation is:
wherein d is the hash code and b is the target hash code.
In the scheme, the fusion characteristics comprise rich semantic information, and the generated hash codes also comprise rich semantic information, so that the interpretability deep network has interpretability, and the reliability of the output result of the network is improved.
Further, the step of generating the fusion feature of the target cloud image by the feature learning module in the interpretable deep network includes:
inputting the target cloud image into a backbone network to obtain a deep feature image;
Determining global features of global semantic information representing the target cloud image according to the deep feature image and a preset global feature extraction strategy;
Determining local features representing local semantic information of a plurality of cloud types in different areas in the target cloud image based on the deep feature image and a preset local feature extraction strategy;
and splicing the global features and the local features to obtain fusion features.
Further, determining global features of global semantic information characterizing the target cloud image according to the deep feature image and a preset global feature extraction strategy, including:
Carrying out average pooling operation on the deep feature map;
And inputting the result after the average pooling operation to a full-connection layer for feature dimension reduction so as to obtain global features representing global semantic information of the target cloud image.
In the scheme, the global features are simply, accurately and reliably determined through the average pooling operation and the feature dimension reduction of the full connection layer, and the determination of the subsequent fusion features is facilitated.
Further, determining local features of local semantic information characterizing cloud types in different areas of the target cloud image based on the deep feature image and a preset local feature extraction strategy comprises:
convolving the deep feature map to obtain a first sub-map;
Processing the first sub-graph according to a preset attention strategy to obtain a first attention map representing the mining result of the most distinctive region;
generating a first local feature map according to the first attention map and the first subgraph;
Determining a first local feature of the local semantic information of the cloud type under the first local feature map so as to splice the global feature and the first local feature, and further obtaining a fusion feature corresponding to the target cloud map.
In the scheme, the design of the preset local feature extraction strategy ensures that the most distinctive region can be mined so as to extract the first local feature of the region and add the first local feature into the fusion feature, thereby enriching the semantics of the fusion feature.
Further, processing the first sub-graph according to a preset attention policy to obtain a first attention map representing a mining result of the most distinctive region, including:
activating the first sub-graph to determine a corresponding activation feature graph thereof;
And performing convolution operation and aggregation normalization operation on the activation feature map in sequence, and processing an execution result by using a Sigmoid activation function to obtain a first attention map.
In this scheme, the design of the preset attention strategy takes into account the attention mechanism, and can focus on the most distinctive area.
Further, after processing the first sub-graph according to a preset attention policy to obtain a first attention map representing a mining result of the most distinctive region, the method further includes:
s21. let o=1;
s22, determining a second pixel point based on a first attention map corresponding to the first sub-graph and a second preset relational expression so as to obtain a suppression map corresponding to the second pixel point;
The second preset relation is:
Wherein μk′ is a pixel value of a kth second pixel in the suppression chart, μk is a pixel value of a kth pixel in the first attention map, a is a preset super parameter, μmean is an average number of pixel values of all pixels in the first attention map, and μstd is a standard deviation of pixel values of all pixels in the first attention map;
s23, determining an o second subgraph according to the inhibition chart and the first subgraph;
S24, processing the o second sub-graph according to the preset attention strategy to obtain an o second attention graph so as to generate an o second local feature graph according to the o second attention graph and the o second sub-graph, and further determining an o second local feature corresponding to the o second local feature graph;
S25: let o=o+1;
S26, judging whether the o reaches the preset branch number, wherein the preset branch number is an integer not less than 2, if not, entering S27, and if so, entering S29;
s27, determining a new second pixel point according to the o-1 second attention map and the second preset relation to obtain a new inhibition map corresponding to the new second pixel point;
S28, generating an o-th second sub-graph according to the new inhibition graph and the o-1-th second sub-graph, and returning to S24;
And S29, splicing the global feature, the first local feature and the obtained total o-1 second local features to obtain fusion features corresponding to the target cloud image.
In the scheme, the mining of the residual specific object areas except the most distinctive area is complemented by generating a plurality of second local features, so that the formed fusion features have rich semantic information, the final retrieval result also has a certain interpretation, and the complex semantic content of the satellite cloud picture is better described.
Further, the interpretive depth network is trained in advance based on a cloud image training set, the cloud image training set comprises N training cloud images, each training cloud image corresponds to at least one real image classification label, the total category number of the real image classification labels corresponding to all the training cloud images is C, and N and C are integers not less than 1, and the training step of the interpretive depth network comprises the following steps:
inputting the training cloud image into an interpretable depth network to be trained to obtain hash-like code characteristics;
Inputting the hash code characteristics into a classification layer to obtain corresponding image classification label predicted values;
training the interpretable depth network by utilizing the hash-like code features, the image classification label predicted value and a preset loss function;
The preset loss function is as follows:
Ltotal=λLcls+ηLq+νLb
Ltotal is the training error, Lcls is a multi-label classification loss, lambda is a first penalty parameter corresponding to Lcls, Lq is a quantization loss, eta is a second penalty parameter corresponding to Lq, Lb is a bit balance loss, and v is a third penalty parameter corresponding to Lb;
wherein, the Lcls is determined based on a third preset relational expression, and the third preset relational expression is:
An ith predicted value in the image classification label predicted values corresponding to the nth training cloud image, omegai is a weight value,Classifying the ith real value in the label for the real image corresponding to the nth training cloud image, wherein sigma is a Sigmoid activation function;
the Lq is determined based on a fourth preset relational expression, where the fourth preset relational expression is:
k is the preset hash code length, dn is the hash code-like characteristic corresponding to the nth training cloud picture, and e is a K-dimensional vector with a value of 1;
the Lb is determined based on a fifth preset relationship, where the fifth preset relationship is:
mean (dn) represents the mean value of the dn.
In the scheme, multi-label supervision is introduced into the interpretable deep network, so that more compact hash codes can be generated, and the similarity retrieval efficiency is improved.
Detailed Description
In order that the above objects, features and advantages of the invention will be readily understood, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings.
Referring to fig. 1, fig. 1 is a flowchart of a method for searching multi-tag hash of satellite cloud images of an interpretable deep network according to the present invention.
The satellite cloud image multi-tag hash retrieval method of the interpretable deep network comprises the following steps:
S11, inputting the obtained target cloud image as an input item to a pre-trained interpretable depth network to obtain a target hash code of the target cloud image, wherein the interpretable depth network comprises a feature learning module with interpretable fusion features for generating the target cloud image and a hash learning module for generating the target hash code, the feature learning module comprises a global feature learning module and a local feature learning module, the global feature learning module is a unit for learning a single global semantic feature for representing the target cloud image, the local feature learning module is a unit for learning local semantic features of a plurality of cloud types in different areas in the target cloud image, and the output of the global feature learning module and the output of the local feature learning module are spliced and combined to obtain the fusion features;
And S12, performing similarity measurement according to the target hash code and the hash codes of each historical cloud image in the historical cloud image database, and searching out a historical cloud image similar to the target cloud image.
Specifically, considering that different areas in a satellite cloud image often contain different cloud types, an interpretable depth network is designed, the network is trained in advance based on a multi-label cloud image training set, the defect that abundant semantic information in the cloud image is ignored when network training is carried out by means of a single semantic label is overcome, a feature learning module in the interpretable depth network can extract global features and a plurality of local features so as to obtain fusion features, a concentration branch network and a suppression module are applied in the extraction process of each local feature, the global features and each local feature are beneficial to finding out a plurality of different complementary areas through suppression layer by layer, a multi-label supervision mechanism is introduced by a hash learning module in the interpretable depth network, the content of the target cloud image can be represented by generating a compact target hash code, the cost of time and space is reduced, the abundant semantic information is also achieved, the Hamming distance is calculated by means of the target hash code and the hash codes to search similarity, and the interpretation efficiency is improved through simple exclusive or operation to search.
In addition, the method provided by the application is widely tested on a public satellite cloud image data set, namely LSCIDMR-V2, and the test result shows that mAP on LSCIDMR-V2 reaches 92.27 percent, so that the method has excellent performance.
In summary, the application provides a satellite cloud image multi-label hash retrieval method of an interpretable deep network, wherein the extracted fusion characteristic comprises semantic information rich in target cloud images, and the target hash code generated subsequently on the basis also comprises rich semantics and has interpretability, so that the reliability of the whole interpretable deep network is improved, an air image worker can trust the retrieval result, accurate retrieval of the satellite cloud images is realized, and practical application is facilitated.
As a preferred embodiment, the hash learning module in the interpretable deep network generates a target hash code from the fusion feature, including:
Inputting the fusion characteristics as input items to a pre-trained hash code learning module to obtain hash codes corresponding to the fusion characteristics;
determining a target hash code of the target cloud picture according to the class hash code and a first preset relational expression;
The first preset relation is:
where d is a hash-like code and b is a target hash code.
Specifically, the compact target hash code can be generated for subsequent retrieval, and the target hash code also comprises rich semantic information, so that the interpretable deep network has interpretability, and the reliability of the output result of the network is improved.
As a preferred embodiment, the step of generating the fusion feature of the target cloud image by the feature learning module in the interpretive deep network includes:
Inputting the target cloud image into a backbone network to obtain a deep feature image;
determining global features of global semantic information representing a target cloud image according to the deep feature image and a preset global feature extraction strategy;
Determining local features of local semantic information representing a plurality of cloud types in different areas in a target cloud image based on a deep feature image and a preset local feature extraction strategy;
and splicing the global features and the local features to obtain fusion features.
Specifically, the backbone network includes, but is not limited to ResNet, and the deep feature map can be obtained by extracting the target cloud map through the ResNet. Referring to fig. 3, the module generating the global feature GF is referred to as a global feature learning module in fig. 3, and the generated local features (including the first local feature LF1, the first second local feature LF2, and the second local feature LF 3) are referred to as local feature learning modules.
As a preferred embodiment, determining global features of global semantic information characterizing a target cloud image according to a deep feature image and a preset global feature extraction strategy includes:
carrying out average pooling operation on the deep feature map;
And inputting the result after the average pooling operation to a full-connection layer for feature dimension reduction so as to obtain global features of global semantic information representing the target cloud image.
Specifically, the deep feature map sequentially passes through the average pooling operation and the full-connection layer, so that the global feature can be simply, accurately and reliably determined, and the determination of the subsequent fusion feature is facilitated.
As a preferred embodiment, determining local features of local semantic information characterizing cloud types in different areas in a target cloud image based on a deep feature image and a preset local feature extraction strategy includes:
Convolving the deep feature map to obtain a first sub-graph F1;
Processing the first sub-graph F1 according to a preset attention strategy to obtain a first attention map M1 representing the mining result of the most distinctive region;
generating a first local feature map LFM1 according to the first attention map M1 and the first sub-graph F1;
and determining a first local feature of the local semantic information representing the cloud type under the first local feature map LFM1 so as to splice the global feature and the first local feature and further obtain a fusion feature corresponding to the target cloud map.
Specifically, considering that a satellite cloud image often comprises a plurality of cloud types, features of the local areas need to be captured to generate more meaningful fusion features and hash codes, the attention mechanism is considered by the design mechanism of a preset local feature extraction strategy so as to focus on a foreground part of the image more, and the salient part is highlighted, so that the features of the local areas are captured better.
For development, the deep feature map is convolved by 1×1 to obtain a first sub-map F1, please refer to fig. 2, fig. 2 is a schematic implementation diagram of a preset local feature extraction strategy of the present invention, the first sub-map F1 is processed according to the preset attention strategy to obtain a first attention map M1, a Hadamard product between elements of the first sub-map F1 and the first attention map M1 is performed to obtain a first local feature map LFM1, and a process of generating the first attention map M1 is denoted as an attention mechanism in fig. 3.
The specific manner of determining the first local feature of the cloud type local semantic information under the first local feature map LFM1 may be that the first local feature map LFM1 is subjected to an average pooling operation, and the result after the average pooling operation is input to a full-connection layer to perform feature dimension reduction so as to obtain the first local feature of the cloud type local semantic information under the first local feature map LFM1, as shown in fig. 3.
As a preferred embodiment, processing the first sub-graph according to a preset attention policy to obtain a first attention graph characterizing the mining results of the most distinct region comprises:
activating the first sub-graph F1 to determine a corresponding activation feature graph Fm;
the convolution operation and the aggregation normalization operation are sequentially carried out on the activation feature map Fm, and the execution result is processed by utilizing a Sigmoid activation function to obtain a first attention map M1.
Specifically, referring to fig. 2, the step of activating the first sub-graph F1 to determine the corresponding activation feature map Fm may be that the first sub-graph F1 sequentially passes through a3×3 convolution layer, a batch normalization layer, a 1×1 convolution layer, and a batch normalization layer, and then is activated by using Relu activation functions to obtain an activation feature map Fm, and the activation feature map Fm sequentially passes through the 1×1 convolution layer and the batch normalization layer, and then is processed by using a Sigmoid activation function to obtain a first attention map M1.
As a preferred embodiment, after processing the first sub-graph according to a preset attention policy to obtain a first attention graph characterizing the mining result of the most distinct region, the method further comprises:
s21. let o=1;
s22, determining a second pixel point based on a first attention map corresponding to the first sub-graph and a second preset relational expression so as to obtain a suppression map corresponding to the second pixel point;
The second preset relation is:
Wherein μk′ is a pixel value of a kth second pixel in the suppression map, μk is a pixel value of a kth pixel in the first attention map, a is a preset super parameter, μmean is an average number of pixel values of all pixels in the first attention map, and μstd is a standard deviation of pixel values of all pixels in the first attention map;
s23, determining an o second sub-graph according to the inhibition graph and the first sub-graph;
S24, processing the o second sub-graph according to a preset attention strategy to obtain an o second attention graph, so as to generate an o second local feature graph according to the o second attention graph and the o second sub-graph, and further determining an o second local feature corresponding to the o second local feature graph;
S25: let o=o+1;
s26, judging whether o reaches the preset branch number, wherein the preset branch number is an integer not less than 2, if not, entering S27, and if so, entering S29;
s27, determining a new second pixel point according to the o-1 second attention map and a second preset relation to obtain a new inhibition map corresponding to the new second pixel point;
S28, generating an o second sub-graph according to the new inhibition graph and the o-1 second sub-graph, and returning to S24;
and S29, splicing the global feature, the first local feature and the obtained total o-1 second local features to obtain fusion features corresponding to the target cloud image.
In particular, it is further considered that although the most distinctive region can be mined by the mining of the first local feature, the remaining object-specific regions in the cloud image complementary region are sometimes ignored, that is, a context exists between the cloud-like regions in the satellite cloud image, which means that if the most distinctive region is simply erased, the mining of the remaining specific object region may be affected, and for this purpose, it is necessary to enhance the portion of the other specific object region while suppressing the most distinctive region before, and to develop, after obtaining the first attention map, the determination of the plurality of second local features is achieved by means of the above manner.
The preset super parameter a is a parameter for distinguishing the extent of the suppression ratio of the significant region from the enhancement ratio of the other active region, and the higher a, the higher the suppression and enhancement extent, by this operation based on suppression enhancement, the most distinctive region in the previous stage will be partially suppressed. At the same time, those other complementary regions that are activated will be enhanced accordingly, maintaining the relationship between the activation regions of the previous stage and the activation regions generated in the subsequent stage.
It should be further noted that the specific value of the preset number of branches is set according to the training effect, which is not limited to 3, in S23, the method of determining the o second sub-graph according to the suppression graph and the first sub-graph may be that a Hadamard product between elements is performed on the suppression graph and the first sub-graph, in S24, the method of determining the o second local feature corresponding to the o second local feature graph may be that an average pooling operation is performed on the o second local feature graph, and a result after the average pooling operation is input to the full-connection layer to perform feature dimension reduction, so as to obtain the o second local feature representing the cloud type local semantic information under the o second local feature graph.
Referring to fig. 3, where the preset number of branches=3 is taken as an example for illustration, it can be seen that the fusion feature at this time includes a global feature GF, a first local feature LF1, and a total of 2 second local features (respectively, a first second local feature LF2 and a second local feature LF 3), and in fig. 3, a process of generating the first second sub-graph F21 is denoted as a suppression module.
As a preferred embodiment, the interpretive depth network is trained in advance based on a cloud image training set, the cloud image training set comprises N training cloud images, each training cloud image corresponds to at least one real image classification label, the total category number of the real image classification labels corresponding to all the training cloud images is C, and N and C are integers not less than 1, and the training step of the interpretive depth network comprises the following steps:
inputting the training cloud image into an interpretable depth network to be trained to obtain hash-like code characteristics;
inputting the hash code characteristics into a classification layer to obtain corresponding image classification label predicted values;
Training an interpretable depth network by utilizing the hash-like code characteristics and the image classification label predicted value and a preset loss function;
the preset loss function is:
Ltotal=λLcls+ηLq+νLb
Ltotal is training error, Lcls is multi-label classification loss, lambda is a first penalty parameter corresponding to Lcls, Lq is quantization loss, eta is a second penalty parameter corresponding to Lq, Lb is bit balance loss, and v is a third penalty parameter corresponding to Lb;
Wherein, Lcls is determined based on a third preset relation, which is:
An ith predicted value in the image classification label predicted values corresponding to the nth training cloud image, omegai is a weight value,Classifying the ith real value in the label for the real image corresponding to the nth training cloud picture, wherein sigma is a sigmoid activation function;
Lq is determined based on a fourth preset relationship:
k is the preset hash code length, dn is the hash code-like characteristic corresponding to the nth training cloud picture, and e is a K-dimensional vector with a value of 1;
lb is determined based on a fifth preset relationship:
mean (dn) represents the mean value of dn.
It should be noted that, in the cloud image training set, a plurality of training cloud images are divided into a plurality of batches and input to the interpretive depth network for training (each batch includes N training cloud images), and preferably 64 batches are set, in addition, unlike a general natural image, a satellite cloud image is a multispectral image, different channels of which reflect different physical properties, in the application, the band images of the 1 st, 2 nd, 3 rd and 5 th channels of the LSCIDMR-V2 dataset are selected as the training cloud images during training to obtain the input of the interpretive depth network, wherein the 1 st, 2 nd and 3 th channels are visible light bands, provide aerosol physical characteristics, and the 5 th channels are near infrared bands, provide cloud physical parameter data, and in addition, in order to keep more details of the training cloud image, the robustness of the batch processing is enhanced, a series of preprocessing operations such as 256 x-size adjustment of the cloud image can be performed on the input cloud image 256 x, unified normalization data can be performed.
It can be understood that any one training cloud image corresponds to at least one real image classification label, so that training of the interpretive deep network is completed by using the multi-label labeled training cloud image and the preset loss function to achieve a desired training result, wherein the label vector corresponding to any one training cloud image is { z1,z2,...,zC},zi ∈ {0,1}, where 1 indicates that the training cloud image has the ith real image classification label, and 0 indicates that the training cloud image does not have the ith real image classification label. The first penalty parameter lambda, the second penalty parameter eta and the third penalty parameter v can be used for balancing multi-label classification loss, quantization loss and bit balance loss, and the preferred setting of the values of the parameters in the scheme is provided as follows, wherein the first penalty parameter lambda=0.5, the second penalty parameter eta=0.5 and the third penalty parameter v=0.0002.
In addition, Lq is used as quantization loss for measuring quantization error generated when fusion characteristics are converted into binary training hash codes, Lb is used as bit balance loss for ensuring that 50% probability of each training hash code is 0 or 1, and all real image classification labels possibly included in a training cloud image are regarded as a classification problem, namely the problem or not, when designing, the multi-label classification loss Lcls is designed, wherein sigma is used as a Sigmoid activation function, and the sigma is applied to the input of the Sigmoid activation functionThe specific operations performed are shown below and are intended to beMapping into (0, 1) intervals:
Therefore, the cloud image training set with multiple labels is utilized to train the interpretable depth network, and each real image classification label contains the whole semantic information of the cloud image, so that the cloud image classification label can be used as a semantic clue to assist the interpretable depth network, the accuracy of the generated hash code is improved, and the hash code also has rich semantics. After the interpretability depth network training is completed, the method can be used for determining the hash codes of each historical cloud image in the cloud image data set, and after the target hash codes of the target cloud images are determined, the Hamming distances between the target hash codes and the hash codes of each historical cloud image can be calculated and sequenced, so that similarity retrieval is realized, and the historical cloud images which are most similar to the target cloud images are found.
Although the present disclosure is described above, the scope of protection of the present disclosure is not limited thereto. Various changes and modifications may be made by one skilled in the art without departing from the spirit and scope of the disclosure, and these changes and modifications will fall within the scope of the invention.
It should also be noted that in this specification, relational terms such as first, second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily implying any actual such relationship or order between such entities or actions.