Disclosure of Invention
In order to solve at least one technical problem in the background technology, the invention provides a hypergraph convolution-based unsupervised cross-modal retrieval method, a hypergraph convolution-based unsupervised cross-modal retrieval system, medium and equipment, which can more comprehensively capture complementary and symbiotic information of multi-modal data, overcome the defect of the existing method in capturing comprehensive semantic information of the multi-modal data, and enhance understanding and measurement of the correlation of a model to multi-modal content.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
 The first aspect of the invention provides an unsupervised cross-modal retrieval method based on hypergraph convolution, which comprises the following steps:
 Acquiring a multi-modal training data set;
 Training the cross-modal retrieval model based on the multi-modal training data set to obtain a trained cross-modal retrieval model, wherein the method specifically comprises the following steps:
 Performing cross-modal fusion on the image features and the text features extracted based on the multi-modal training data set to obtain cross-modal fused image features and text features;
 Constructing an image mode similarity matrix based on the image features and the text features, constructing a text mode similarity matrix based on the cross-mode fused image features and text features, and unifying the image mode similarity matrix and the text mode similarity matrix to a robust similarity matrix;
 Utilizing a robust similarity matrix, introducing a hypergraph to aggregate common features of similar samples into a hyperedge to obtain a hypergraph incidence matrix, utilizing the incidence matrix to carry out hypergraph convolution on image features and text features, and mining high-order semantic information among all nodes to obtain hash codes in a hypergraph learning process;
 constructing a reconstruction loss function according to the generated hash code and the robust similarity matrix, and updating parameters of the hash coding network based on the reconstruction loss function;
 And searching according to the task data to be searched and the trained cross-modal searching model to obtain a searching result.
Further, the cross-modal fusion of the image features and the text features extracted based on the multi-modal training dataset to obtain the cross-modal fused image features and text features includes:
 Extracting image features by using a CLIP image feature extractor, and extracting text features by using a text feature extractor;
 splicing the output result of the CLIP image feature extractor and the output result of the text feature extractor to obtain a spliced tensor;
 Inputting the spliced tensor into a multi-modal fusion Transformer, and capturing the correlation between the modes and the semantic correlation between the features by using a self-attention mechanism to obtain fused image features and text features.
Further, the loss function when cross-modal fusion is performed based on the image features and the text features extracted from the multi-modal training dataset is:
,
,
,
 Wherein,Is the contrast loss of the image modality,Is the contrast loss of the text modality,Is the cross-modal contrast loss of the first two in combination,Is the temperature coefficient of the temperature of the material,Representing the function of the similarity calculation,AndFeatures representing truly aligned image-text pairs,AndRepresent the firstSample and the firstA number of samples of the sample were taken,For each batchFor training samples.
Further, the robust similarity matrix is expressed as:
,
,
,
 Wherein,Is a weight for measuring similarity information of different modalities,In the form of an image modality similarity matrix,In the form of a text modality similarity matrix,Is a symmetric matrix of the type described above,For robust similarity matrixSimilarity value between the ith sample and the jth sample, m being for each batchFor training samples.
Further, the step of introducing the hypergraph to aggregate the common features of the similar samples into the hyperedge by using the robust similarity matrix to obtain the association matrix of the hypergraph comprises the following steps:
 Characterizing an imageText featureRobust similarity matrixAs input, each feature vector is usedConsidered as a node, expressed asUsing a similarity matrixIdentifying each nodeCombining the identified nodes into a superedgeWhich can be expressed asWherein, the method comprises the steps of, wherein,Representation and representationMost similar toA set of individual nodes;
 association matrix of hypergraphExpressed as:
。
 Further, the performing hypergraph convolution on the image feature and the text feature by using the association matrix, and mining high-order semantic information between each node to obtain a hash code in the hypergraph learning process, including:
 Introducing a standard Laplace matrix for the association matrix of the constructed hypergraph;
 calculating a hypergraph convolution layer representation by combining a standard Laplace matrix;
 and constructing a hypergraph convolutional network based on the obtained hypergraph convolutional layer representation, and generating hash codes in the hypergraph learning process.
Further, the expression of the reconstruction loss function is:
,
,
,
,
,
 Wherein,The reconstruction is lost to the process,Is a super-parameter for adjusting the scaling range of the similarity matrix, and is a symbolWhich represents the Hadamard product of the two,Is a hyper-parameter that measures the weight between different losses,In order to have a robust similarity matrix,、、、Is a learned hash code.
A second aspect of the present invention provides a hypergraph convolution based unsupervised cross-modal retrieval comprising:
 a multi-modal data acquisition module for acquiring a multi-modal training dataset;
 The cross-modal retrieval model training module is used for training the cross-modal retrieval model based on a multi-modal training data set to obtain a trained cross-modal retrieval model, and specifically comprises the following steps:
 Performing cross-modal fusion on the image features and the text features extracted based on the multi-modal training data set to obtain cross-modal fused image features and text features;
 Constructing an image mode similarity matrix based on the image features and the text features, constructing a text mode similarity matrix based on the cross-mode fused image features and text features, and unifying the image mode similarity matrix and the text mode similarity matrix to a robust similarity matrix;
 Utilizing a robust similarity matrix, introducing a hypergraph to aggregate common features of similar samples into a hyperedge to obtain a hypergraph incidence matrix, utilizing the incidence matrix to carry out hypergraph convolution on image features and text features, and mining high-order semantic information among all nodes to obtain hash codes in a hypergraph learning process;
 constructing a reconstruction loss function according to the generated hash code and the robust similarity matrix, and updating parameters of the hash coding network based on the reconstruction loss function;
 and the retrieval module is used for retrieving and obtaining a retrieval result according to the task data to be retrieved and the trained cross-modal retrieval model.
A third aspect of the present invention provides a computer-readable storage medium.
A computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps in a hypergraph convolution based unsupervised cross-modality retrieval method as described above.
A fourth aspect of the invention provides a computer device.
A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps in a hypergraph convolution based unsupervised cross-modality retrieval method as described above when the program is executed.
Compared with the prior art, the invention has the beneficial effects that:
 1. According to the invention, after the extracted fine-granularity semantic features are fused, a semantic complementary similarity matrix is constructed, so that potential semantic correlation among different modal examples is maximized, and then the method helps to learn the hash codes through the higher-order relation among the hypergraph convolution coding vertexes and the local clustering structure, so that the hash codes with more discriminant are generated, and the retrieval precision is improved.
2. According to the invention, the CLIP multi-mode model is adopted to extract fine granularity semantic features, and semantic representation of each mode is further enhanced through the multi-mode fusion converter. This depth feature extraction and fusion strategy enables the model of the present invention to more fully capture complementary and symbiotic information for multimodal data.
3. The invention provides an effective fusion method to construct a semantic complementary similarity matrix so as to maximize the potential semantic relativity between different modal examples, the method is helpful for overcoming the defects of the existing method in the aspect of capturing the comprehensive semantic information of the multi-modal data, and enhancing the understanding and measurement of the model on the correlation of the multi-modal content.
4. The invention introduces a self-adaptive hypergraph neural network, which helps to learn the hash codes through the higher-order relation among the hypergraph convolution coding vertexes and the local clustering structure, thereby generating the hash codes with better discrimination.
Additional aspects of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Detailed Description
The invention will be further described with reference to the drawings and examples.
It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.
Interpretation of the terms
CLIP (Contrastive Language-Image Pre-Training), a multimodal Pre-trained neural network model published by OpenAI in 2021. The image and the text are mapped to a unified vector space in a contrast learning mode, so that the model can directly calculate the similarity between the image and the text in the vector space. The core idea of the CLIP model is to pretrain with a large amount of paired data of images and text to learn the alignment between the images and text. Such a model is particularly suited for zero sample learning tasks, i.e. the model predicts without having to see a training example of a new image or text. The CLIP model is excellent in various fields such as image text retrieval, image-text generation, and the like.
The hypergraph neural network (HYPERGRAPH NEURAL NETWORKS, HGNN) is an advanced graph representation learning method that handles complex graph structures by encoding higher-order data correlations. The network realizes effective representation learning through super-edge convolution operation, and can capture complex relations among nodes. The concept of hypergraphs has been introduced into cross-modal hash retrieval to overcome the limitations of pairwise generic graphs in describing higher-order relationships between samples. Hypergraphs can more fully describe the similarity between samples, i.e., the higher order relationship, by connecting any number of samples over edges. This mining of higher order information provides a more robust modal representation for an unsupervised cross-modal hash method.
Aiming at the limitations of the existing non-supervision cross-modal hash mentioned in the background technology, the invention combines the CLIP model with the hypergraph learning for the first time, and provides a hypergraph convolution-based non-supervision cross-modal retrieval method, which comprises the following steps:
 (1) According to the invention, the CLIP multi-mode model is adopted to extract fine granularity semantic features, and semantic representation of each mode is further enhanced through the multi-mode fusion converter. The depth feature extraction and fusion strategy enables the model of the invention to more comprehensively capture complementary and symbiotic information of multi-modal data;
 (2) The invention provides an effective fusion method for constructing a semantic complementary similarity matrix so as to maximize potential semantic correlation among different modal examples. The method is helpful for overcoming the defects of the existing method in the aspect of capturing the comprehensive semantic information of the multi-modal data, and enhancing the understanding and measurement of the model on the correlation of the multi-modal content;
 (3) The invention introduces a self-adaptive hypergraph neural network, which helps to learn the hash codes through the higher-order relation among the vertexes of the hypergraph convolutional codes and the local clustering structure, thereby generating the hash codes with more discriminant;
 (4) The invention adopts an iterative approximate optimization strategy to reduce the information loss in the binarization process.
Example 1
As shown in fig. 1, the embodiment provides an unsupervised cross-modal retrieval method based on hypergraph convolution, which comprises the following steps:
 Step 1, acquiring a multi-mode data set;
 In this embodiment, the acquired multimodal dataset is represented asWherein, the method comprises the steps of, wherein,Respectively represent the firstImage data and text data in each sample pair,Is the number of pairs of samples;
 The task of cross-modality retrieval is to use data of one modality to retrieve data of another modality.
In this embodiment, a cross-modal retrieval model is constructed and optimized using image-to-text retrieval and text-to-image retrieval as retrieval tasks.
Representing a training multimodal dataset asWherein, the method comprises the steps of, wherein,Respectively represent the firstImage data and text data in each sample pair, in each batchFor image features extracted from training samples based on CLIPAnd text featuresThe constructed similarity matrix is expressed asImage features fused by cross-modal transformersAnd text featuresThe constructed similarity matrix is expressed as. The final representations of the image and text are defined as respectivelyAndWhereinAndRepresenting the dimensions of the image features and the text features, respectively. The hash code generated by the hash coding network is recorded asAndThe hash code generated by the hypergraph convolution network is recorded asAndWhereinIs the length of the hash code.
Step 2, fusing the image features and the text features extracted based on the multi-mode dataset to obtain fused image features and text features;
 the step 2 specifically comprises the following steps:
 step 201, at each batchExtracting image features from training samples by using a CLIP image feature extractorExtracting text features by using a text feature extractor。
Step 202, outputting the result of the CLIP image feature extractorAnd the output result of the text feature extractorSplicing to obtain a new tensorWhereinAndRepresenting the dimensions of the visual feature and the text feature respectively,。
Step 203, splicing the resultsInputting into a multi-modal fusion transducer, capturing the correlation between the modes and the semantic correlation between the features by using a self-attention mechanism to obtain fused image features and text features, wherein the method specifically comprises the following steps:
 In this embodiment, the attention score is obtained by calculating the similarity between the query and the key, which is then used to weight the sum of the values
The method effectively identifies the degree of association between each feature and other features and integrates the relationships into the feature fusion process. Finally, through the multimode fusion transducer, the effective fusion of the image and text characteristics is realized, so that the multimode learning performance is improved.
Step 2031, constructing queries, keys and values in the self-attention mechanism;
 In the self-attention mechanism, queries, keys, and values are constructed using multi-modal features, expressed as:
,
,
,
,
 Wherein,Is an imageIs characterized in that,Is textIs characterized in that,Is toAndThe splicing result is that, in the same way,Is to the imageFeatures of (2)And textFeatures of (2)As a result of the stitching the two or more pieces together,、AndAre all parameters of the network that are trainable,Is tensorAnd (3) withThe resulting query vector is multiplied by,Is tensorAnd (3) withThe key vector resulting from the multiplication is used,Is tensorAnd (3) withThe resulting vector of values is multiplied.
Step 2032, for any image-text pair, generating a multi-modal feature with a higher characterization capability by the self-attention mechanism of the multi-modal transducer, expressed as:
,
 Wherein,As the dimension of the key(s),As a mechanism of self-attention,AndA key matrix and a value matrix, respectively.
Step 2033, generating a multi-modal feature based on the multi-modal feature generated by the self-attention mechanism, using the feedforward neural network, expressed as:
,
,
,
 Wherein,AndThe normalization layer and the dropout layer are shown separately,Is the output of the multimodal fusion transducer.
Step 2034, in order to ensure that the data representation of the same category in the same modality contains consistent category semantics, defining the modality contrast loss as follows:
,
,
 Wherein,Is the contrast loss of the image modality,Is the contrast loss of the text modality,Is the cross-modal contrast loss of the first two in combination,Is the temperature coefficient of the temperature of the material,Representing the function of the similarity calculation,AndFeatures representing image-text pairs that are truly aligned.
Thus, the multi-modal contrast loss can be expressed as:
;
 Step 2035 of usingAnd,AndThe parameters of the hash coding network are respectively used as final characterization of image and text modes, so that effective information of original features is reserved, semantic information of higher level is extracted, complex association between the image and the text is better captured and utilized, and performance and accuracy of the model are improved.
Step 3, constructing an image mode similarity matrix based on the image features and the text features, constructing a text mode similarity matrix based on the cross-mode fused image features and text features, and unifying the image mode similarity matrix and the text mode similarity matrix to a robust similarity matrix, wherein the method specifically comprises the following steps:
 The unsupervised hash method cannot construct a multi-label similarity matrix to guide the learning of hash codes due to the lack of sample labels. The embodiment provides a similar matrix construction scheme based on aggregation and dynamic adjustment, and visual characteristics are assumedConstructing an image modal cosine similarity matrixWhereinCalculated asWherein;
Using text featuresConstructing text modal similarity matrixWhereinCalculated as。
Then, the image mode similarity matrixSimilarity matrix to text modalityIs integrated into a unified similarity matrix, maintains the semantic relationship among different modal examples, and supplements each other.
For this purpose, a joint modal similarity matrix is constructedFusion by handAndMulti-modal information in (a), expressed as:
,
,
 Wherein,Is the weight of measuring the similarity information of different modes.Image-text contrast with greater valueImage-text pairs with smaller values have higher semantic similarity;
 is a symmetric matrix, expressed as:
,
 Wherein,Is an image modal similarity matrixIs the first of (2)The number of rows of the device is,Is a text modal similarity matrixIs the first of (2)Columns.
From experimental observations, it is noted that for the above-mentioned similarity matrixThe distances between unpaired instances are not well separated. One reason is that the features are obtained by contrast learning of a large amount of data, which keeps the learning distance of unpaired examples within a small range. The former hash algorithm focuses only on the values on the diagonal of the similarity matrix, i.e. focuses more on the pairwise relationship between the image and the text. However, in the hash learning process, the relationship between the unpaired image and text should not be ignored. Therefore, the remapping method is adopted to enhance the value of the non-diagonal element in the similarity matrix (reflecting the relative relation between the unpaired image and the text), so that the similarity matrix is more differentiated.
First, each batch is obtainedAverage, minimum and maximum values of all elements in (expressed as、And);
Then, each of the data is determined by comparing with the average valueThe corresponding image-text pair is a "similar sample pair" or a "dissimilar sample pair", and is weighted differently according to the judgment result, expressed as:
,
 wherein the weights areAndThe calculation of (2) can be expressed as:,;
 AndWeights of "similar sample pair" and "different sample pair", respectively, are represented. These weights may "stretch" the original element non-linearly. They help the hash function learn common features between similar samples and distinguishing features between different samples, thereby generating a more discriminative and accurate hash code. Finally, the new cross-modal similarity matrix may be represented as。
Respectively constructing similarity matrixes according to the modeAndWhereinIs based on the features extracted by the CLIP encoderAndThe construction of the composite material is carried out,Is based on the features after the trans-modal fusionAndConstructed by the method. Unifying the two similarity matrices to a final robust similarity matrix by weightingIs used for supervising and guiding the generation process of the hash code.
Step 4, utilizing a robust similarity matrix, introducing a hypergraph to aggregate common features of similar samples into a hyperedge, and obtaining an association matrix of the hypergraph;
 Characterizing an imageText featureRobust similarity matrixAs input, each feature vector is usedConsidered as a node, expressed asUsing a similarity matrixIdentifying each nodeThe most similar nodes. The identified nodes are then combined into superedgesWhich can be expressed asWherein, the method comprises the steps of, wherein,Representation and representationMost similar toA set of individual nodes;
 Therefore, the association matrix of hypergraphThe expression is as follows:
。
 step 5, performing hypergraph convolution on the image features and the text features by using the incidence matrix, and mining high-order semantic information among all nodes to obtain hash codes in the hypergraph learning process;
 Hypergraph learning requires aggregating feature information from different neighborhood structures and generating representations of nodes. For this purpose, a structured hypergraphA standard laplace matrix is introduced, expressed as:
,
 Wherein,Is a diagonal matrix, each element representing the degree of each node,Is another diagonal matrix, with each element representing the size of each superside.
The hypergraph convolution layer is expressed as:
,
 Wherein,Representing the first in hypergraph networksThe input of the layer is made,Represent the firstThe weight matrix of the layer is used to determine,AndRespectively represent the firstLayer and the firstThe nodes in the layer represent dimensions.
The hash code table generated by the hypergraph convolution network is described as follows:
。
 Wherein the method comprises the steps ofRepresenting the number of iterations. We use an iterative approximate optimization strategy to optimize the hash code, i.eTherefore, the discrete problem is converted into a series of continuous optimization problems, and the information loss and the instability problems in the binarization process can be effectively relieved.
Step 6, final representation of image and text mode、Input into hash coding network composed of multi-layer perceptron to generate hash code for searchingAndAnd use contrast lossThe generation of the hash code is supervised.
Using two modality-specific hash functions implemented by a multi-layer perceptronA hash code is generated for the samples of each modality, denoted as:
,
,
 Wherein the method comprises the steps ofFor a binary function, a common subspace learned for contrast loss optimization is introduced, as follows:
,
,
,
 AndIs a truly aligned image-text pair.
Step 7, utilizing the learned hash code、、、And integrated similarity matrixConstructing a loss function, and updating parameters of the hash coding network based on the loss function;
 using learned hash codes、、、And integrated similarity matrixConstructing a loss function:
,
,
,
,
,
 Wherein,Is a super-parameter for adjusting the scaling range of the similarity matrix, and is a symbolWhich represents the Hadamard product of the two,Is a hyper-parameter that measures the weight between different losses.
And 8, searching according to the task data to be searched and the trained cross-modal searching model to obtain a searching result.
Example two
The embodiment provides a CLIP-based hypergraph convolution unsupervised cross-modal hash system, which comprises the following steps:
 a multi-modal data acquisition module for acquiring a multi-modal training dataset;
 The cross-modal retrieval model training module is used for training the cross-modal retrieval model based on a multi-modal training data set to obtain a trained cross-modal retrieval model, and specifically comprises the following steps:
 Performing cross-modal fusion on the image features and the text features extracted based on the multi-modal training data set to obtain cross-modal fused image features and text features;
 Constructing an image mode similarity matrix based on the image features and the text features, constructing a text mode similarity matrix based on the cross-mode fused image features and text features, and unifying the image mode similarity matrix and the text mode similarity matrix to a robust similarity matrix;
 Utilizing a robust similarity matrix, introducing a hypergraph to aggregate common features of similar samples into a hyperedge to obtain a hypergraph incidence matrix, utilizing the incidence matrix to carry out hypergraph convolution on image features and text features, and mining high-order semantic information among all nodes to obtain hash codes in a hypergraph learning process;
 constructing a reconstruction loss function according to the generated hash code and the robust similarity matrix, and updating parameters of the hash coding network based on the reconstruction loss function;
 and the retrieval module is used for retrieving and obtaining a retrieval result according to the task data to be retrieved and the trained cross-modal retrieval model.
Example III
The present embodiment provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps in a hypergraph convolution based unsupervised cross-modality retrieval method as described above.
Example IV
The present embodiment provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps in a hypergraph convolution based unsupervised cross-modal retrieval method as described above when executing the program.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.