Movatterモバイル変換


[0]ホーム

URL:


CN118916497B - Unsupervised cross-modal retrieval method, system, medium and device based on hypergraph convolution - Google Patents

Unsupervised cross-modal retrieval method, system, medium and device based on hypergraph convolution
Download PDF

Info

Publication number
CN118916497B
CN118916497BCN202411425829.6ACN202411425829ACN118916497BCN 118916497 BCN118916497 BCN 118916497BCN 202411425829 ACN202411425829 ACN 202411425829ACN 118916497 BCN118916497 BCN 118916497B
Authority
CN
China
Prior art keywords
modal
cross
hypergraph
features
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202411425829.6A
Other languages
Chinese (zh)
Other versions
CN118916497A (en
Inventor
罗昕
张乾
陈振铎
许信顺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong UniversityfiledCriticalShandong University
Priority to CN202411425829.6ApriorityCriticalpatent/CN118916497B/en
Publication of CN118916497ApublicationCriticalpatent/CN118916497A/en
Application grantedgrantedCritical
Publication of CN118916497BpublicationCriticalpatent/CN118916497B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

The invention belongs to the field of cross-modal information retrieval and provides a cross-modal information retrieval method, a system, a medium and equipment based on hypergraph convolution, which have the technical scheme that fine granularity semantic features are extracted, semantic representation of each mode is further enhanced through a multi-mode fusion converter, the depth feature extraction and fusion strategy enables a model to more comprehensively capture complementary and symbiotic information of multi-mode data, an effective fusion method is provided for constructing a semantic complementary similarity matrix, potential semantic correlation among different mode instances is maximized, the defect of the existing method in capturing comprehensive semantic information of the multi-mode data is overcome, understanding and measurement of the multi-mode content correlation by the model are enhanced, and an adaptive hypergraph neural network is introduced, which helps learn hash codes through a higher-order relation and a local clustering structure among hypergraph convolution encoding peaks, so that a hash code with more discriminant is generated.

Description

Overseeing-based convolution unsupervised cross-modal retrieval method, system, medium and device
Technical Field
The invention belongs to the field of cross-modal information retrieval, and particularly relates to an unsupervised cross-modal retrieval method, system, medium and device based on hypergraph convolution.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
With the vigorous development of the internet and social media, massive multi-modal data such as text, images, video and the like are increasing at an unprecedented speed, and an efficient and accurate cross-modal retrieval system becomes an urgent need in the field of information retrieval. Unsupervised cross-modal hash techniques have received great attention for their computational and storage advantages in cross-modal data retrieval. The core challenge of this technique is how to reduce semantic gaps between heterogeneous modalities during the hash code learning process and efficiently encode the correlation of data of different modalities into the binary code. Although deep learning methods exhibit excellent performance in cross-modal tasks, they still have limitations in inter-modal semantic interactions and generalizing capabilities for new data. Existing unsupervised cross-modal hashes still have some limitations, such as inaccurate similarity measurement and unbalanced modalities, resulting in non-ideal retrieval performance.
Disclosure of Invention
In order to solve at least one technical problem in the background technology, the invention provides a hypergraph convolution-based unsupervised cross-modal retrieval method, a hypergraph convolution-based unsupervised cross-modal retrieval system, medium and equipment, which can more comprehensively capture complementary and symbiotic information of multi-modal data, overcome the defect of the existing method in capturing comprehensive semantic information of the multi-modal data, and enhance understanding and measurement of the correlation of a model to multi-modal content.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
The first aspect of the invention provides an unsupervised cross-modal retrieval method based on hypergraph convolution, which comprises the following steps:
Acquiring a multi-modal training data set;
Training the cross-modal retrieval model based on the multi-modal training data set to obtain a trained cross-modal retrieval model, wherein the method specifically comprises the following steps:
Performing cross-modal fusion on the image features and the text features extracted based on the multi-modal training data set to obtain cross-modal fused image features and text features;
Constructing an image mode similarity matrix based on the image features and the text features, constructing a text mode similarity matrix based on the cross-mode fused image features and text features, and unifying the image mode similarity matrix and the text mode similarity matrix to a robust similarity matrix;
Utilizing a robust similarity matrix, introducing a hypergraph to aggregate common features of similar samples into a hyperedge to obtain a hypergraph incidence matrix, utilizing the incidence matrix to carry out hypergraph convolution on image features and text features, and mining high-order semantic information among all nodes to obtain hash codes in a hypergraph learning process;
constructing a reconstruction loss function according to the generated hash code and the robust similarity matrix, and updating parameters of the hash coding network based on the reconstruction loss function;
And searching according to the task data to be searched and the trained cross-modal searching model to obtain a searching result.
Further, the cross-modal fusion of the image features and the text features extracted based on the multi-modal training dataset to obtain the cross-modal fused image features and text features includes:
Extracting image features by using a CLIP image feature extractor, and extracting text features by using a text feature extractor;
splicing the output result of the CLIP image feature extractor and the output result of the text feature extractor to obtain a spliced tensor;
Inputting the spliced tensor into a multi-modal fusion Transformer, and capturing the correlation between the modes and the semantic correlation between the features by using a self-attention mechanism to obtain fused image features and text features.
Further, the loss function when cross-modal fusion is performed based on the image features and the text features extracted from the multi-modal training dataset is:
,
,
,
Wherein,Is the contrast loss of the image modality,Is the contrast loss of the text modality,Is the cross-modal contrast loss of the first two in combination,Is the temperature coefficient of the temperature of the material,Representing the function of the similarity calculation,AndFeatures representing truly aligned image-text pairs,AndRepresent the firstSample and the firstA number of samples of the sample were taken,For each batchFor training samples.
Further, the robust similarity matrix is expressed as:
,
,
,
Wherein,Is a weight for measuring similarity information of different modalities,In the form of an image modality similarity matrix,In the form of a text modality similarity matrix,Is a symmetric matrix of the type described above,For robust similarity matrixSimilarity value between the ith sample and the jth sample, m being for each batchFor training samples.
Further, the step of introducing the hypergraph to aggregate the common features of the similar samples into the hyperedge by using the robust similarity matrix to obtain the association matrix of the hypergraph comprises the following steps:
Characterizing an imageText featureRobust similarity matrixAs input, each feature vector is usedConsidered as a node, expressed asUsing a similarity matrixIdentifying each nodeCombining the identified nodes into a superedgeWhich can be expressed asWherein, the method comprises the steps of, wherein,Representation and representationMost similar toA set of individual nodes;
association matrix of hypergraphExpressed as:
Further, the performing hypergraph convolution on the image feature and the text feature by using the association matrix, and mining high-order semantic information between each node to obtain a hash code in the hypergraph learning process, including:
Introducing a standard Laplace matrix for the association matrix of the constructed hypergraph;
calculating a hypergraph convolution layer representation by combining a standard Laplace matrix;
and constructing a hypergraph convolutional network based on the obtained hypergraph convolutional layer representation, and generating hash codes in the hypergraph learning process.
Further, the expression of the reconstruction loss function is:
,
,
,
,
,
Wherein,The reconstruction is lost to the process,Is a super-parameter for adjusting the scaling range of the similarity matrix, and is a symbolWhich represents the Hadamard product of the two,Is a hyper-parameter that measures the weight between different losses,In order to have a robust similarity matrix,Is a learned hash code.
A second aspect of the present invention provides a hypergraph convolution based unsupervised cross-modal retrieval comprising:
a multi-modal data acquisition module for acquiring a multi-modal training dataset;
The cross-modal retrieval model training module is used for training the cross-modal retrieval model based on a multi-modal training data set to obtain a trained cross-modal retrieval model, and specifically comprises the following steps:
Performing cross-modal fusion on the image features and the text features extracted based on the multi-modal training data set to obtain cross-modal fused image features and text features;
Constructing an image mode similarity matrix based on the image features and the text features, constructing a text mode similarity matrix based on the cross-mode fused image features and text features, and unifying the image mode similarity matrix and the text mode similarity matrix to a robust similarity matrix;
Utilizing a robust similarity matrix, introducing a hypergraph to aggregate common features of similar samples into a hyperedge to obtain a hypergraph incidence matrix, utilizing the incidence matrix to carry out hypergraph convolution on image features and text features, and mining high-order semantic information among all nodes to obtain hash codes in a hypergraph learning process;
constructing a reconstruction loss function according to the generated hash code and the robust similarity matrix, and updating parameters of the hash coding network based on the reconstruction loss function;
and the retrieval module is used for retrieving and obtaining a retrieval result according to the task data to be retrieved and the trained cross-modal retrieval model.
A third aspect of the present invention provides a computer-readable storage medium.
A computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps in a hypergraph convolution based unsupervised cross-modality retrieval method as described above.
A fourth aspect of the invention provides a computer device.
A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps in a hypergraph convolution based unsupervised cross-modality retrieval method as described above when the program is executed.
Compared with the prior art, the invention has the beneficial effects that:
1. According to the invention, after the extracted fine-granularity semantic features are fused, a semantic complementary similarity matrix is constructed, so that potential semantic correlation among different modal examples is maximized, and then the method helps to learn the hash codes through the higher-order relation among the hypergraph convolution coding vertexes and the local clustering structure, so that the hash codes with more discriminant are generated, and the retrieval precision is improved.
2. According to the invention, the CLIP multi-mode model is adopted to extract fine granularity semantic features, and semantic representation of each mode is further enhanced through the multi-mode fusion converter. This depth feature extraction and fusion strategy enables the model of the present invention to more fully capture complementary and symbiotic information for multimodal data.
3. The invention provides an effective fusion method to construct a semantic complementary similarity matrix so as to maximize the potential semantic relativity between different modal examples, the method is helpful for overcoming the defects of the existing method in the aspect of capturing the comprehensive semantic information of the multi-modal data, and enhancing the understanding and measurement of the model on the correlation of the multi-modal content.
4. The invention introduces a self-adaptive hypergraph neural network, which helps to learn the hash codes through the higher-order relation among the hypergraph convolution coding vertexes and the local clustering structure, thereby generating the hash codes with better discrimination.
Additional aspects of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.
Fig. 1 is a flowchart of an unsupervised cross-modal retrieval method based on hypergraph convolution provided by an embodiment of the invention.
Detailed Description
The invention will be further described with reference to the drawings and examples.
It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.
Interpretation of the terms
CLIP (Contrastive Language-Image Pre-Training), a multimodal Pre-trained neural network model published by OpenAI in 2021. The image and the text are mapped to a unified vector space in a contrast learning mode, so that the model can directly calculate the similarity between the image and the text in the vector space. The core idea of the CLIP model is to pretrain with a large amount of paired data of images and text to learn the alignment between the images and text. Such a model is particularly suited for zero sample learning tasks, i.e. the model predicts without having to see a training example of a new image or text. The CLIP model is excellent in various fields such as image text retrieval, image-text generation, and the like.
The hypergraph neural network (HYPERGRAPH NEURAL NETWORKS, HGNN) is an advanced graph representation learning method that handles complex graph structures by encoding higher-order data correlations. The network realizes effective representation learning through super-edge convolution operation, and can capture complex relations among nodes. The concept of hypergraphs has been introduced into cross-modal hash retrieval to overcome the limitations of pairwise generic graphs in describing higher-order relationships between samples. Hypergraphs can more fully describe the similarity between samples, i.e., the higher order relationship, by connecting any number of samples over edges. This mining of higher order information provides a more robust modal representation for an unsupervised cross-modal hash method.
Aiming at the limitations of the existing non-supervision cross-modal hash mentioned in the background technology, the invention combines the CLIP model with the hypergraph learning for the first time, and provides a hypergraph convolution-based non-supervision cross-modal retrieval method, which comprises the following steps:
(1) According to the invention, the CLIP multi-mode model is adopted to extract fine granularity semantic features, and semantic representation of each mode is further enhanced through the multi-mode fusion converter. The depth feature extraction and fusion strategy enables the model of the invention to more comprehensively capture complementary and symbiotic information of multi-modal data;
(2) The invention provides an effective fusion method for constructing a semantic complementary similarity matrix so as to maximize potential semantic correlation among different modal examples. The method is helpful for overcoming the defects of the existing method in the aspect of capturing the comprehensive semantic information of the multi-modal data, and enhancing the understanding and measurement of the model on the correlation of the multi-modal content;
(3) The invention introduces a self-adaptive hypergraph neural network, which helps to learn the hash codes through the higher-order relation among the vertexes of the hypergraph convolutional codes and the local clustering structure, thereby generating the hash codes with more discriminant;
(4) The invention adopts an iterative approximate optimization strategy to reduce the information loss in the binarization process.
Example 1
As shown in fig. 1, the embodiment provides an unsupervised cross-modal retrieval method based on hypergraph convolution, which comprises the following steps:
Step 1, acquiring a multi-mode data set;
In this embodiment, the acquired multimodal dataset is represented asWherein, the method comprises the steps of, wherein,Respectively represent the firstImage data and text data in each sample pair,Is the number of pairs of samples;
The task of cross-modality retrieval is to use data of one modality to retrieve data of another modality.
In this embodiment, a cross-modal retrieval model is constructed and optimized using image-to-text retrieval and text-to-image retrieval as retrieval tasks.
Representing a training multimodal dataset asWherein, the method comprises the steps of, wherein,Respectively represent the firstImage data and text data in each sample pair, in each batchFor image features extracted from training samples based on CLIPAnd text featuresThe constructed similarity matrix is expressed asImage features fused by cross-modal transformersAnd text featuresThe constructed similarity matrix is expressed as. The final representations of the image and text are defined as respectivelyAndWhereinAndRepresenting the dimensions of the image features and the text features, respectively. The hash code generated by the hash coding network is recorded asAndThe hash code generated by the hypergraph convolution network is recorded asAndWhereinIs the length of the hash code.
Step 2, fusing the image features and the text features extracted based on the multi-mode dataset to obtain fused image features and text features;
the step 2 specifically comprises the following steps:
step 201, at each batchExtracting image features from training samples by using a CLIP image feature extractorExtracting text features by using a text feature extractor
Step 202, outputting the result of the CLIP image feature extractorAnd the output result of the text feature extractorSplicing to obtain a new tensorWhereinAndRepresenting the dimensions of the visual feature and the text feature respectively,
Step 203, splicing the resultsInputting into a multi-modal fusion transducer, capturing the correlation between the modes and the semantic correlation between the features by using a self-attention mechanism to obtain fused image features and text features, wherein the method specifically comprises the following steps:
In this embodiment, the attention score is obtained by calculating the similarity between the query and the key, which is then used to weight the sum of the values
The method effectively identifies the degree of association between each feature and other features and integrates the relationships into the feature fusion process. Finally, through the multimode fusion transducer, the effective fusion of the image and text characteristics is realized, so that the multimode learning performance is improved.
Step 2031, constructing queries, keys and values in the self-attention mechanism;
In the self-attention mechanism, queries, keys, and values are constructed using multi-modal features, expressed as:
,
,
,
,
Wherein,Is an imageIs characterized in that,Is textIs characterized in that,Is toAndThe splicing result is that, in the same way,Is to the imageFeatures of (2)And textFeatures of (2)As a result of the stitching the two or more pieces together,AndAre all parameters of the network that are trainable,Is tensorAnd (3) withThe resulting query vector is multiplied by,Is tensorAnd (3) withThe key vector resulting from the multiplication is used,Is tensorAnd (3) withThe resulting vector of values is multiplied.
Step 2032, for any image-text pair, generating a multi-modal feature with a higher characterization capability by the self-attention mechanism of the multi-modal transducer, expressed as:
,
Wherein,As the dimension of the key(s),As a mechanism of self-attention,AndA key matrix and a value matrix, respectively.
Step 2033, generating a multi-modal feature based on the multi-modal feature generated by the self-attention mechanism, using the feedforward neural network, expressed as:
,
,
,
Wherein,AndThe normalization layer and the dropout layer are shown separately,Is the output of the multimodal fusion transducer.
Step 2034, in order to ensure that the data representation of the same category in the same modality contains consistent category semantics, defining the modality contrast loss as follows:
,
,
Wherein,Is the contrast loss of the image modality,Is the contrast loss of the text modality,Is the cross-modal contrast loss of the first two in combination,Is the temperature coefficient of the temperature of the material,Representing the function of the similarity calculation,AndFeatures representing image-text pairs that are truly aligned.
Thus, the multi-modal contrast loss can be expressed as:
;
Step 2035 of usingAnd,AndThe parameters of the hash coding network are respectively used as final characterization of image and text modes, so that effective information of original features is reserved, semantic information of higher level is extracted, complex association between the image and the text is better captured and utilized, and performance and accuracy of the model are improved.
Step 3, constructing an image mode similarity matrix based on the image features and the text features, constructing a text mode similarity matrix based on the cross-mode fused image features and text features, and unifying the image mode similarity matrix and the text mode similarity matrix to a robust similarity matrix, wherein the method specifically comprises the following steps:
The unsupervised hash method cannot construct a multi-label similarity matrix to guide the learning of hash codes due to the lack of sample labels. The embodiment provides a similar matrix construction scheme based on aggregation and dynamic adjustment, and visual characteristics are assumedConstructing an image modal cosine similarity matrixWhereinCalculated asWherein;
Using text featuresConstructing text modal similarity matrixWhereinCalculated as
Then, the image mode similarity matrixSimilarity matrix to text modalityIs integrated into a unified similarity matrix, maintains the semantic relationship among different modal examples, and supplements each other.
For this purpose, a joint modal similarity matrix is constructedFusion by handAndMulti-modal information in (a), expressed as:
,
,
Wherein,Is the weight of measuring the similarity information of different modes.Image-text contrast with greater valueImage-text pairs with smaller values have higher semantic similarity;
is a symmetric matrix, expressed as:
,
Wherein,Is an image modal similarity matrixIs the first of (2)The number of rows of the device is,Is a text modal similarity matrixIs the first of (2)Columns.
From experimental observations, it is noted that for the above-mentioned similarity matrixThe distances between unpaired instances are not well separated. One reason is that the features are obtained by contrast learning of a large amount of data, which keeps the learning distance of unpaired examples within a small range. The former hash algorithm focuses only on the values on the diagonal of the similarity matrix, i.e. focuses more on the pairwise relationship between the image and the text. However, in the hash learning process, the relationship between the unpaired image and text should not be ignored. Therefore, the remapping method is adopted to enhance the value of the non-diagonal element in the similarity matrix (reflecting the relative relation between the unpaired image and the text), so that the similarity matrix is more differentiated.
First, each batch is obtainedAverage, minimum and maximum values of all elements in (expressed asAnd);
Then, each of the data is determined by comparing with the average valueThe corresponding image-text pair is a "similar sample pair" or a "dissimilar sample pair", and is weighted differently according to the judgment result, expressed as:
,
wherein the weights areAndThe calculation of (2) can be expressed as:,;
AndWeights of "similar sample pair" and "different sample pair", respectively, are represented. These weights may "stretch" the original element non-linearly. They help the hash function learn common features between similar samples and distinguishing features between different samples, thereby generating a more discriminative and accurate hash code. Finally, the new cross-modal similarity matrix may be represented as
Respectively constructing similarity matrixes according to the modeAndWhereinIs based on the features extracted by the CLIP encoderAndThe construction of the composite material is carried out,Is based on the features after the trans-modal fusionAndConstructed by the method. Unifying the two similarity matrices to a final robust similarity matrix by weightingIs used for supervising and guiding the generation process of the hash code.
Step 4, utilizing a robust similarity matrix, introducing a hypergraph to aggregate common features of similar samples into a hyperedge, and obtaining an association matrix of the hypergraph;
Characterizing an imageText featureRobust similarity matrixAs input, each feature vector is usedConsidered as a node, expressed asUsing a similarity matrixIdentifying each nodeThe most similar nodes. The identified nodes are then combined into superedgesWhich can be expressed asWherein, the method comprises the steps of, wherein,Representation and representationMost similar toA set of individual nodes;
Therefore, the association matrix of hypergraphThe expression is as follows:
step 5, performing hypergraph convolution on the image features and the text features by using the incidence matrix, and mining high-order semantic information among all nodes to obtain hash codes in the hypergraph learning process;
Hypergraph learning requires aggregating feature information from different neighborhood structures and generating representations of nodes. For this purpose, a structured hypergraphA standard laplace matrix is introduced, expressed as:
,
Wherein,Is a diagonal matrix, each element representing the degree of each node,Is another diagonal matrix, with each element representing the size of each superside.
The hypergraph convolution layer is expressed as:
,
Wherein,Representing the first in hypergraph networksThe input of the layer is made,Represent the firstThe weight matrix of the layer is used to determine,AndRespectively represent the firstLayer and the firstThe nodes in the layer represent dimensions.
The hash code table generated by the hypergraph convolution network is described as follows:
Wherein the method comprises the steps ofRepresenting the number of iterations. We use an iterative approximate optimization strategy to optimize the hash code, i.eTherefore, the discrete problem is converted into a series of continuous optimization problems, and the information loss and the instability problems in the binarization process can be effectively relieved.
Step 6, final representation of image and text modeInput into hash coding network composed of multi-layer perceptron to generate hash code for searchingAndAnd use contrast lossThe generation of the hash code is supervised.
Using two modality-specific hash functions implemented by a multi-layer perceptronA hash code is generated for the samples of each modality, denoted as:
,
,
Wherein the method comprises the steps ofFor a binary function, a common subspace learned for contrast loss optimization is introduced, as follows:
,
,
,
AndIs a truly aligned image-text pair.
Step 7, utilizing the learned hash codeAnd integrated similarity matrixConstructing a loss function, and updating parameters of the hash coding network based on the loss function;
using learned hash codesAnd integrated similarity matrixConstructing a loss function:
,
,
,
,
,
Wherein,Is a super-parameter for adjusting the scaling range of the similarity matrix, and is a symbolWhich represents the Hadamard product of the two,Is a hyper-parameter that measures the weight between different losses.
And 8, searching according to the task data to be searched and the trained cross-modal searching model to obtain a searching result.
Example two
The embodiment provides a CLIP-based hypergraph convolution unsupervised cross-modal hash system, which comprises the following steps:
a multi-modal data acquisition module for acquiring a multi-modal training dataset;
The cross-modal retrieval model training module is used for training the cross-modal retrieval model based on a multi-modal training data set to obtain a trained cross-modal retrieval model, and specifically comprises the following steps:
Performing cross-modal fusion on the image features and the text features extracted based on the multi-modal training data set to obtain cross-modal fused image features and text features;
Constructing an image mode similarity matrix based on the image features and the text features, constructing a text mode similarity matrix based on the cross-mode fused image features and text features, and unifying the image mode similarity matrix and the text mode similarity matrix to a robust similarity matrix;
Utilizing a robust similarity matrix, introducing a hypergraph to aggregate common features of similar samples into a hyperedge to obtain a hypergraph incidence matrix, utilizing the incidence matrix to carry out hypergraph convolution on image features and text features, and mining high-order semantic information among all nodes to obtain hash codes in a hypergraph learning process;
constructing a reconstruction loss function according to the generated hash code and the robust similarity matrix, and updating parameters of the hash coding network based on the reconstruction loss function;
and the retrieval module is used for retrieving and obtaining a retrieval result according to the task data to be retrieved and the trained cross-modal retrieval model.
Example III
The present embodiment provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps in a hypergraph convolution based unsupervised cross-modality retrieval method as described above.
Example IV
The present embodiment provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps in a hypergraph convolution based unsupervised cross-modal retrieval method as described above when executing the program.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (8)

CN202411425829.6A2024-10-142024-10-14 Unsupervised cross-modal retrieval method, system, medium and device based on hypergraph convolutionActiveCN118916497B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202411425829.6ACN118916497B (en)2024-10-142024-10-14 Unsupervised cross-modal retrieval method, system, medium and device based on hypergraph convolution

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202411425829.6ACN118916497B (en)2024-10-142024-10-14 Unsupervised cross-modal retrieval method, system, medium and device based on hypergraph convolution

Publications (2)

Publication NumberPublication Date
CN118916497A CN118916497A (en)2024-11-08
CN118916497Btrue CN118916497B (en)2025-02-25

Family

ID=93298272

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202411425829.6AActiveCN118916497B (en)2024-10-142024-10-14 Unsupervised cross-modal retrieval method, system, medium and device based on hypergraph convolution

Country Status (1)

CountryLink
CN (1)CN118916497B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN119830918B (en)*2024-12-212025-07-15北方工业大学Multi-mode entity relation extraction method based on hypergraph neural network
CN119904783B (en)*2025-03-282025-05-23山东大学Video instance segmentation method, system, medium and device based on hypergraph representation
CN120068009B (en)*2025-04-292025-07-15山东大学Multi-mode feature fusion method, system, device, medium and program product
CN120218568B (en)*2025-05-272025-09-02江苏苏美达轻纺科技产业有限公司 A clothing pattern resource management system and method
CN120371996B (en)*2025-06-262025-09-02中南大学Unsupervised cross-modal hash retrieval method, system and device based on implicit characteristics

Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN109299216A (en)*2018-10-292019-02-01山东师范大学 A cross-modal hash retrieval method and system integrating supervised information
CN115687571A (en)*2022-10-282023-02-03重庆师范大学 A Deep Unsupervised Cross-Modal Retrieval Method Based on Modal Fusion Reconstruction Hash

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
WO2022104540A1 (en)*2020-11-172022-05-27深圳大学Cross-modal hash retrieval method, terminal device, and storage medium
CN115878757A (en)*2022-12-092023-03-31大连理工大学Concept decomposition-based hybrid hypergraph regularization semi-supervised cross-modal hashing method
CN117093730B (en)*2023-06-272025-07-11重庆师范大学 Self-supervised cross-modal hashing retrieval method based on graph convolution semantic enhancement
CN117540039A (en)*2023-11-142024-02-09大连理工大学Data retrieval method based on unsupervised cross-modal hash algorithm

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN109299216A (en)*2018-10-292019-02-01山东师范大学 A cross-modal hash retrieval method and system integrating supervised information
CN115687571A (en)*2022-10-282023-02-03重庆师范大学 A Deep Unsupervised Cross-Modal Retrieval Method Based on Modal Fusion Reconstruction Hash

Also Published As

Publication numberPublication date
CN118916497A (en)2024-11-08

Similar Documents

PublicationPublication DateTitle
CN118916497B (en) Unsupervised cross-modal retrieval method, system, medium and device based on hypergraph convolution
Shi et al.Deep adaptively-enhanced hashing with discriminative similarity guidance for unsupervised cross-modal retrieval
CN108959246B (en)Answer selection method and device based on improved attention mechanism and electronic equipment
Du et al.Relation extraction for manufacturing knowledge graphs based on feature fusion of attention mechanism and graph convolution network
Wu et al.Learning semantic structure-preserved embeddings for cross-modal retrieval
CN115098646B (en) A multi-level relationship analysis and mining method for graphic and text data
Qu et al.A novel cross-modal hashing algorithm based on multimodal deep learning
CN118411572B (en)Small sample image classification method and system based on multi-mode multi-level feature aggregation
CN113806554A (en)Knowledge graph construction method for massive conference texts
CN117473071B (en)Data retrieval method, device, equipment and computer readable medium
CN119938985B (en)Text video retrieval method based on human brain scene memory path heuristic
Passalis et al.Deep supervised hashing using quadratic spherical mutual information for efficient image retrieval
CN119202300A (en) Remote sensing image cross-modal retrieval method, device and electronic equipment
CN117807259A (en)Cross-modal hash retrieval method based on deep learning technology
CN119739867A (en) Medical knowledge graph retrieval system and method based on potential relationship reasoning
CN117494051A (en)Classification processing method, model training method and related device
CN117911929A (en)Visual emotion recognition method, device, equipment and readable storage medium
CN117009570A (en)Image-text retrieval method and device based on position information and confidence perception
CN114996501B (en)Emotion image retrieval method and device based on image style coding
CN120067464A (en)Cross-modal collaborative recommendation method and system based on large language model
CN117874351B (en) A personalized recommendation method and system for battlefield situation information based on situational awareness
CN117786110A (en) Text classification method and system based on multi-scale quantum convolutional neural network
CN115082430B (en)Image analysis method and device and electronic equipment
Chen et al.Supervised consensus anchor graph hashing for cross modal retrieval
CN118467768B (en)Rapid image retrieval method and system based on large-model advanced semantic graph embedding

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp