CN118132677A

Movatterモバイル変換

Info

Publication number: CN118132677A
Application number: CN202410326442.9A
Authority: CN
Inventors: 胡宇鹏; 李明; 王锟; 李子旭; 牟岩松; 田阳
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2024-03-21
Filing date: 2024-03-21
Publication date: 2024-06-04
Anticipated expiration: 2044-03-21
Also published as: CN118132677B

Abstract

The invention relates to a graph and text retrieval method and a system based on cross-modal semantic analysis, comprising the following steps: image characterization: understanding a given image and generating a feature code of a salient region; text characterization: understanding a given text query statement to generate a contextually relevant discrete vocabulary code; intra-modal feature fusion is performed on the image and text characterization by using a self-attention mechanism; and (3) respectively calculating cosine similarity of image-text pairs by utilizing hash codes and quantization codes generated by aggregation features, screening out a candidate set ranked at the front through two rounds of sequencing, introducing a cross-modal attention mechanism to calculate the candidate set to obtain a relatively accurate fine granularity matching score, and finely adjusting the ranking relationship by using similarity reordering to finally realize high-performance cross-modal image-text retrieval.

Description

Translated fromChinese

基于跨模态语义解析的图文检索方法及系统Image and text retrieval method and system based on cross-modal semantic analysis

技术领域Technical Field

本发明涉及基于跨模态语义解析的图文检索方法及系统，属于跨模态检索和人工智能技术领域。The present invention relates to a picture and text retrieval method and system based on cross-modal semantic analysis, and belongs to the technical field of cross-modal retrieval and artificial intelligence.

背景技术Background technique

图文检索系统利用先进的图像识别和文本分析技术，已经在电子商务、社交媒体分析等场景得到了广泛应用，成为信息获取和数据分析的重要工具。然而，现有的图文检索系统大多数依赖于实体信息的关键词匹配，这种方法往往无法充分理解图像内容与文本描述之间的复杂关系。Image-text retrieval systems, using advanced image recognition and text analysis technologies, have been widely used in e-commerce, social media analysis and other scenarios, becoming an important tool for information acquisition and data analysis. However, most existing image-text retrieval systems rely on keyword matching of entity information, which often fails to fully understand the complex relationship between image content and text description.

基于自然语言的跨模态图文匹配研究，不仅深化了传统图文检索研究，而且提升了人机交互的便利性。目前该研究领域主要面临以下两个挑战：The research on cross-modal image-text matching based on natural language not only deepens the research on traditional image-text retrieval, but also improves the convenience of human-computer interaction. Currently, this research field mainly faces the following two challenges:

(1)有效的跨模态语义对齐。对于大规模图像库，不同用户的查询需求具有显著的多样性。例如，用户A可能期望查询“穿红衣服的小孩在沙滩上玩耍”的图像，而用户B可能想找“穿红衣服的小孩在公园草地上玩耍”的图像。这些目标图像在呈现的内容上有明显的差异。因此，如何针对不同用户的文本查询语句，在大量图像中实现有效的跨模态语义对齐，是一个重要的挑战。(1) Effective cross-modal semantic alignment. For large-scale image libraries, the query requirements of different users are significantly diverse. For example, user A may want to query images of "children in red clothes playing on the beach", while user B may want to find images of "children in red clothes playing on the grass in the park". These target images have obvious differences in the content presented. Therefore, how to achieve effective cross-modal semantic alignment in a large number of images for different users' text query statements is an important challenge.

(2)高效的跨模态匹配机制。现有主流图文匹配方法有两个方向：全局级匹配，倾向于学习全局对齐，即将图像或文本表示为整体特征来衡量相似性；局部级匹配侧重于局部片段之间的细粒度对齐，即通过所有单词-区域对的相关性推断整体图像-文本相似度。其中全局匹配检索效率较高，但精度较差；而局部级匹配检索精度虽然较好，但其计算复杂度过高、推理过度耗时，实际的落地能力受到了很大的限制。因此，探索一种同时兼备检索精度和效率的跨模态图文匹配机制也同样具有挑战性。(2) Efficient cross-modal matching mechanism. Existing mainstream image-text matching methods have two directions: global-level matching tends to learn global alignment, that is, representing images or texts as overall features to measure similarity; local-level matching focuses on fine-grained alignment between local fragments, that is, inferring the overall image-text similarity through the correlation of all word-region pairs. Among them, global matching has high retrieval efficiency but poor accuracy; and although local-level matching has good retrieval accuracy, its computational complexity is too high and reasoning is too time-consuming, and its actual implementation ability is greatly limited. Therefore, it is also challenging to explore a cross-modal image-text matching mechanism that has both retrieval accuracy and efficiency.

发明内容Summary of the invention

针对现有技术的不足，本发明提供了基于跨模态语义解析的图文检索方法；In view of the shortcomings of the prior art, the present invention provides a picture-text retrieval method based on cross-modal semantic parsing;

本发明构建了一个高效跨模态图文检索框架，对语义相似的图像和文本进行双向的匹配。为此，本发明首先提出了一种端到端的两阶段过滤策略，使用哈希码和量化码由粗到细的筛选出正类置信度分数较高的少量数据点，从而在尽可能多囊括正样本的同时，大幅减少需要计算的样本数量；其次，本发明引入跨模态注意力机制，对筛选后的候选样本进行细粒度计算，得到精度较高的匹配结果；最后，本发明提出一种双向相似度重排序技术来改善由于跨模态检索中因语义信息对齐困难导致的精度损失。The present invention constructs an efficient cross-modal image-text retrieval framework to perform bidirectional matching of semantically similar images and texts. To this end, the present invention first proposes an end-to-end two-stage filtering strategy, which uses hash codes and quantization codes to filter out a small number of data points with high positive confidence scores from coarse to fine, thereby greatly reducing the number of samples that need to be calculated while covering as many positive samples as possible; secondly, the present invention introduces a cross-modal attention mechanism to perform fine-grained calculations on the screened candidate samples to obtain matching results with higher accuracy; finally, the present invention proposes a bidirectional similarity re-ranking technology to improve the accuracy loss caused by the difficulty of aligning semantic information in cross-modal retrieval.

本发明专注于跨模态图文检索技术的研究，即一方面通过自然语言描述的查询语句，从大量图像数据中检索出与查询语义相匹配的图像；另一方面则是通过给定的查询图像，在文本数据中检索语义相似的文字描述。以文本查询图片的方向为例，根据自然语言描述的查询语句(“穿着红色衣服的小孩在公园玩耍”)，在图像数据库中检索出与查询语义相匹配的图像并进行显示。The present invention focuses on the research of cross-modal image-text retrieval technology, that is, on the one hand, through the query statement described in natural language, images matching the query semantics are retrieved from a large amount of image data; on the other hand, through a given query image, text descriptions with similar semantics are retrieved in text data. Taking the direction of a text query image as an example, according to the query statement described in natural language ("children wearing red clothes playing in the park"), images matching the query semantics are retrieved from the image database and displayed.

本发明还提供了基于跨模态语义解析的图文检索系统。The present invention also provides a picture-text retrieval system based on cross-modal semantic analysis.

术语解释：Terminology explanation:

1、ResNet-101:ResNet是由微软研究院提出的一种深度神经网络架构。ResNet通过引入残差模块来训练更深层次的神经网络，在ResNet系列中，数字部分表示模型的深度，ResNet-101是一个具有101层深度、采用残差学习机制的深度神经网络模型。1. ResNet-101: ResNet is a deep neural network architecture proposed by Microsoft Research. ResNet introduces residual modules to train deeper neural networks. In the ResNet series, the number part represents the depth of the model. ResNet-101 is a deep neural network model with 101 layers and a residual learning mechanism.

2、视觉基因组数据集:视觉基因组数据集是一个大规模的视觉推理数据集，由斯坦福大学计算机科学系和智能信息系统实验室发布。它由101174张来自MSCOCO的图像和170万个问题-答案对组成，平均每张图像有17个问题，并且提供了108K张带有密集注释的对象、属性和关系的图像，是用于视觉场景理解和推理研究的重要资源。2. Visual Genome Dataset: The Visual Genome Dataset is a large-scale visual reasoning dataset released by the Department of Computer Science and Intelligent Information Systems Laboratory at Stanford University. It consists of 101,174 images from MSCOCO and 1.7 million question-answer pairs, with an average of 17 questions per image, and provides 108K images with densely annotated objects, attributes, and relationships. It is an important resource for visual scene understanding and reasoning research.

3、Faster R-CNN:Faster R-CNN是一种经典的目标检测算法，由Ross Girshick于2015年提出。它是R-CNN系列算法的改进版本，通过引入区域提议网络来提高检测速度。其核心思想是将目标检测任务分为两个阶段：候选区域生成和目标分类。其中，区域提议网络用于生成候选目标区域，而后续的Fast R-CNN网络用于对这些候选区域进行分类和边界框回归。相比于传统的R-CNN算法具有更高的检测速度和更好的准确性，使得目标检测任务在实际应用中更加高效。3. Faster R-CNN: Faster R-CNN is a classic object detection algorithm proposed by Ross Girshick in 2015. It is an improved version of the R-CNN series of algorithms, which improves the detection speed by introducing the region proposal network. Its core idea is to divide the object detection task into two stages: candidate region generation and object classification. Among them, the region proposal network is used to generate candidate object regions, and the subsequent Fast R-CNN network is used to classify and regress these candidate regions. Compared with the traditional R-CNN algorithm, it has higher detection speed and better accuracy, making the object detection task more efficient in practical applications.

4、双向门控循环神经网络:双向门控循环神经网络(Bi-GRU)结合了双向循环神经网络(Bi-RNN)的双向特性和门控循环单元(GRU)的门控机制，用于处理序列数据并捕捉上下文信息。GRU是一种循环神经网络变体，它具有更新门和重置门来控制信息的流动和记忆的更新。双向结构使得Bi-GRU在每个时间步都有正向传播和反向传播这两个隐藏状态。因此模型将同时考虑到当前时刻之前和之后的信息，从而更全面地理解序列数据。Bi-GRU适用于各种序列建模任务，如自然语言处理中的文本分类、命名实体识别、情感分析等，以及时间序列预测、语音识别等领域。4. Bidirectional Gated Recurrent Neural Network: Bidirectional Gated Recurrent Neural Network (Bi-GRU) combines the bidirectional characteristics of Bidirectional Recurrent Neural Network (Bi-RNN) and the gating mechanism of Gated Recurrent Unit (GRU) to process sequence data and capture contextual information. GRU is a variant of recurrent neural network that has update gates and reset gates to control the flow of information and the update of memory. The bidirectional structure enables Bi-GRU to have two hidden states, forward propagation and backward propagation, at each time step. Therefore, the model will take into account the information before and after the current moment at the same time, thereby more comprehensively understanding the sequence data. Bi-GRU is suitable for various sequence modeling tasks, such as text classification, named entity recognition, sentiment analysis, etc. in natural language processing, as well as time series prediction, speech recognition and other fields.

5、Mini-Batch K-means聚类算法:Mini-Batch K-means是K-means聚类算法的一种改进版本。与标准的K-means算法不同，Mini-Batch K-means使用小批量随机样本来更新聚类中心，从而加快了收敛速度。其主要思想是在每次迭代时，从数据集中随机选择一小部分样本(即小批量)进行聚类，并根据这些样本的平均值更新聚类中心。这个过程被称为“一次迭代”。相比于标准的K-means算法，Mini-Batch K-means可以大大减少迭代次数和计算时间，尤其是在大规模数据集上。5. Mini-Batch K-means clustering algorithm: Mini-Batch K-means is an improved version of the K-means clustering algorithm. Unlike the standard K-means algorithm, Mini-Batch K-means uses small batches of random samples to update the cluster centers, thereby speeding up convergence. The main idea is to randomly select a small number of samples (i.e., small batches) from the dataset for clustering at each iteration, and update the cluster centers based on the average value of these samples. This process is called "one iteration". Compared with the standard K-means algorithm, Mini-Batch K-means can greatly reduce the number of iterations and computing time, especially on large-scale datasets.

本发明的技术方案为：The technical solution of the present invention is:

基于跨模态语义解析的图文检索方法，包括：The image and text retrieval method based on cross-modal semantic analysis includes:

图像和文本双向的查询具有流程一致性，因此，以图像到文本的查询为例来进行阐述。The query process of two-way image and text is consistent, so we will take the query from image to text as an example to illustrate.

图像表征：对给定图像进行理解并生成显著区域的特征编码；Image representation: understand a given image and generate feature encodings of salient areas;

文本表征：对给定文本查询语句进行理解，生成上下文相关的离散词汇编码；Text representation: Understand a given text query and generate context-dependent discrete vocabulary encodings;

使用自注意力机制对图像和文本表征进行模态内特征融合；Use self-attention mechanism to perform intra-modal feature fusion on image and text representations;

利用聚合特征产生的哈希码和量化码分别计算图像-文本对的余弦相似度，经过两轮排序筛选出排名靠前的候选集，引入跨模态注意力机制对候选集计算得到较为精确的细粒度匹配分数，使用相似度重排序对排名关系内部微调，最终实现高性能的跨模态图文检索。The hash codes and quantization codes generated by aggregated features are used to calculate the cosine similarity of image-text pairs respectively. After two rounds of sorting, the top candidate sets are screened out. The cross-modal attention mechanism is introduced to calculate the candidate sets to obtain a more accurate fine-grained matching score. The similarity reranking is used to fine-tune the internal ranking relationship, and finally high-performance cross-modal image-text retrieval is achieved.

作为进一步的优选方案，图像或文本表征的学习过程，包括：As a further preferred solution, the learning process of image or text representation includes:

对第i个图像使用ResNet-101作为骨干，在视觉基因组数据集上预训练的FasterR-CNN提取相应的区域特征；通过平均池化和全连接层得到最终的区域特征V＝[v₁,v₂,.....v_n]；For the i-th image, ResNet-101 is used as the backbone, and FasterR-CNN pre-trained on the visual genome dataset is used to extract the corresponding regional features; the final regional features V = [v₁ ,v₂ ,.....v_n ] are obtained through average pooling and fully connected layers;

使用一个具有d维隐藏的双向门控循环神经网络来获得输入文本的正向和反向隐层状态的信息，按照文本的正向和反向隐层状态的信息的平均值作为其特征表示，得到最终的词汇特征T＝[t₁,t₂,.....t_m]；A bidirectional gated recurrent neural network with d-dimensional hidden layer is used to obtain the information of the forward and reverse hidden layer states of the input text, and the average value of the information of the forward and reverse hidden layer states of the text is used as its feature representation to obtain the final vocabulary feature T = [t₁ , t₂ , ..... t_m ];

对于视觉特征聚合，使用式(1)中得到的区域平均特征作为查询，如式(2)所示，利用注意力机制来学习区域片段和平均特征的注意力权重，并与原始特征进行加权获得了全局视觉特征/>For visual feature aggregation, the regional average feature obtained in formula (1) is used As a query, as shown in formula (2), the attention mechanism is used to learn the attention weights of regional fragments and average features, and the global visual features are obtained by weighting them with the original features/>

同理，根据式(3)、(4)得到聚合的全局文本特征Similarly, according to formulas (3) and (4), we can obtain the aggregated global text features:

其中，W_vq、W_vk、W_vv、W_tq、W_tk、W_tv是可学习的映射矩阵，d_k则是隐藏层维度的大小；Among them, W_vq , W_vk , W_vv , W_tq , W_tk , W_tv are learnable mapping matrices, and d_k is the size of the hidden layer dimension;

使用一个多层感知器来将这两种模式的特征包括和/>投影到哈希码对应的维度空间中，tanh函数作为激活函数，将向量的数值限制在[-1,+1]之间，得到聚合后的图像特征和文本特征分别表示如式(5)、(6)所示：A multi-layer perceptron is used to incorporate the features of these two modes and/> Projected into the dimensional space corresponding to the hash code, the tanh function is used as the activation function to limit the value of the vector to between [-1, +1]. The aggregated image features and text features are expressed as shown in equations (5) and (6), respectively:

通过符号函数sgn将哈希层的输出向量离散为由-1、+1表示的整形数哈希码，表示为式(7)、(8)：The output vector of the hash layer is discretized into an integer hash code represented by -1 and +1 through the sign function sgn, which can be expressed as equations (7) and (8):

其中，k表示生成的哈希码b_v和b_t的长度；Where k represents the length of the generated hash codes b_v and b_t ;

通过式(9)所示的三重排名损失函数来约束样本间关系：The relationship between samples is constrained by the triple ranking loss function shown in formula (9):

其中，和/>表示最难区别的次高序的负类文本；γ为表示正类和负类标签间的边际距离，S_r表示嵌入向量的余弦相似度；[a]₊代表max(0,a)；in, and/> represents the second highest order negative text that is the most difficult to distinguish; γ represents the marginal distance between positive and negative labels, S_r represents the cosine similarity of the embedding vector; [a]₊ represents max(0,a);

分别对哈希层输出的图像模态向量和文本模态向量/>执行Mini-Batch K-means聚类算法，将每一个特征向量分成整数段更短长度的向量，整个高维向量空间被拆分成M个低维向量空间，之后对每个子空间执行聚类过程，得到的类簇称之为码书C^l，如下式(10)、(11)使用码书中的类簇替代原始特征，得到近似细化表示：The image modality vector output by the hash layer is and the text modality vector /> The Mini-Batch K-means clustering algorithm is executed to divide each feature vector into vectors with shorter lengths of integer segments. The entire high-dimensional vector space is split into M low-dimensional vector spaces. Then, the clustering process is performed on each subspace. The resulting clusters are called codebooks C^l , as shown in the following equations (10) and (11). The clusters in the codebooks are used to replace the original features to obtain an approximate refined representation:

其中，分别表示图像和文本第l个子空间的索引；/>和/>分别表示图像和文本的量化码；in, Represent the index of the lth subspace of image and text respectively;/> and/> Represent the quantization codes of images and texts respectively;

通过式(12)的量化损失来进行约束：The constraint is performed through the quantization loss of formula (12):

其中，D、分别代表数据集样本、图像和文本的聚合向量，C则代表码书；为弗罗贝尼乌斯范数。Among them, D. Represent the aggregate vectors of dataset samples, images, and texts respectively, and C represents the codebook; is the Frobenius norm.

作为进一步的优选方案，利用生成的图像和文本的哈希码、量化码执行两阶段筛选得到候选集，包括：As a further preferred solution, the generated hash codes and quantization codes of the image and text are used to perform two-stage screening to obtain a candidate set, including:

哈希粗略筛选：基于生成的哈希码，直接计算查询图像哈希码b_v和全体文本哈希码的汉明相似度并将匹配分数降序排列，得到L个排名；选取前K₁数据作为初始候选集进行后续细致的检索；Hash rough screening: Based on the generated hash code, directly calculate the query image hash code b_v and the entire text hash code The Hamming similarity of the two sets is calculated and the matching scores are arranged in descending order to obtain L rankings; the first K₁ data are selected as the initial candidate set for subsequent detailed retrieval;

量化细致筛选：基于生成的查询图像量化码和初始候选集中文本对应的原始向量/>做非对称相似性度量，得到量化匹配分数，排序后选取K₁个初始排名中较高优先顺序的前K个文本作为最终候选集。Quantitative Detailed Screening: Generated Quantization Codes Based on Query Images The original vector corresponding to the text in the initial candidate set/> An asymmetric similarity measurement is performed to obtain a quantitative matching score, and after sorting, the top K texts with higher priority in the K₁ initial rankings are selected as the final candidate set.

作为进一步的优选方案，引入基于跨模态注意力机制的主干网络对最终候选集进行计算，得到较为精确的细粒度匹配分数；包括：As a further preferred solution, a backbone network based on a cross-modal attention mechanism is introduced to calculate the final candidate set to obtain a more accurate fine-grained matching score; including:

给定最终候选集中一个查询图像I的区域特征V＝[v₁,v₂,.....v_n]以及一条文本E的词汇特征T＝[t₁,t₂,.....t_m]，计算所有区域-词汇对的余弦相似度矩阵，即：Given a region feature V = [v₁ ,v₂ ,.....v_n ] of a query image I in the final candidate set and a vocabulary feature T = [t₁ ,t₂ ,.....t_m ] of a text E, calculate the cosine similarity matrix of all region-vocabulary pairs, that is:

s_ij＝cosine(v_i,t_j),i∈[1,n],j∈[1,m](13)s_ij =cosine(v_i ,t_j ),i∈[1,n],j∈[1,m](13)

其中，v_i,t_j分别代表第i个图像区域特征和第j个文本词汇特征；Among them,_vi ,_tj represent the i-th image region feature and the j-th text word feature respectively;

将区域-词汇对的相似度矩阵进行归一化，并进一步得到第i个区域特征关于第j个文本词汇特征注意力分数α_ij，其中，λ₁是引入的平滑系数：The similarity matrix of the region-word pair is normalized, and the attention score α_ij of the i-th region feature on the j-th text word feature is further obtained, where λ₁ is the introduced smoothing coefficient:

按照注意力矩阵得到第i个区域特征关于文本所有的词汇特征的加权表示计算v_i和/>的余弦相似度，得到所有区域和文本的相似度语义矩阵：According to the attention matrix, we get the weighted representation of the i-th region feature with respect to all the lexical features of the text. Calculate vi_and /> The cosine similarity of is used to obtain the similarity semantic matrix of all regions and texts:

将相似度矩阵语义矩阵按照图像到文本方向平均池化，得到图像I和文本E的聚合的全局级匹配分数：The similarity matrix and semantic matrix are averagely pooled in the image-to-text direction to obtain the aggregated global matching score of image I and text E:

按照匹配分数由大到小进行排序，初始细化结果如式(19)所示：The matching scores are sorted from large to small, and the initial refinement result is shown in formula (19):

R(I,K)＝{E₁,E₂,...E_j,...,E_K}(19)R(I,K)＝{E₁ ,E₂ ,...E_j ,...,E_K }(19)

其中，R(I,K)是跨模态注意力机制得到的初始排名：I代表查询图像，K代表选取的顶部候选集的个数，E_j表示文本候选点。Among them, R(I,K) is the initial ranking obtained by the cross-modal attention mechanism: I represents the query image, K represents the number of top candidate sets selected, and_Ej represents the text candidate point.

作为进一步的优选方案，使用双向相似度重排序对排名关系内部微调，最终实现高性能的跨模态图文检索；包括：As a further preferred solution, bidirectional similarity re-ranking is used to fine-tune the internal ranking relationship, and finally high-performance cross-modal image-text retrieval is achieved; including:

利用初始细化结果中排名的前K₂条文本进行重排；Use the top K₂ texts ranked in the initial refinement results for re-ranking;

基于已生成的表征，分别计算待重排文本即前K₂条文本关于查询图像的相似度分数；Based on the generated representations, the similarity scores of the texts to be rearranged, i.e., the first K₂ texts, with respect to the query image are calculated respectively;

按照重新分配的顺序索引从小到大对文本进行重新排序，得到最终检索结果。The text is reordered from small to large according to the reallocated sequential index to obtain the final retrieval result.

进一步优选的，双向相似度重排序，包括：Further preferably, the bidirectional similarity reordering includes:

计算查询图像I的前K₂个返回的结果文本和全部N个图像的匹配分数并按照降序排列：Calculate the matching scores of the first K₂ returned result texts of the query image I and all N images and sort them in descending order:

R(E_j,N)＝{I₁,I₂,....I_N},j∈[1,K₂](20)R(E_j ,N)＝{I₁ ,I₂ ,....I_N },j∈[1,K₂ ](20)

按照式(21)重新分配每一个文本E_j的顺序索引：Reassign the order index of each text_Ej according to formula (21):

p(E_j)＝s,I_s＝I(21)p(E_j )＝s,I_s ＝I(21)

依照式(22)使用文本分配到的索引按照从小到大的排列对文本进行内部顺序的调整，得到最后的排序结果：According to formula (22), the texts are sorted from small to large using the indexes assigned to them, and the final sorting result is obtained:

一种计算机设备，包括存储器和处理器，所述存储器存储有计算机程序，所述处理器执行所述计算机程序时实现基于跨模态语义解析的图文检索方法的步骤。A computer device includes a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of a graphic and text retrieval method based on cross-modal semantic parsing when executing the computer program.

一种计算机可读存储介质，其上存储有计算机程序，所述计算机程序被处理器执行时实现基于跨模态语义解析的图文检索方法的步骤。A computer-readable storage medium stores a computer program, which, when executed by a processor, implements the steps of a method for image and text retrieval based on cross-modal semantic parsing.

基于跨模态语义解析的图文检索系统，包括：The image and text retrieval system based on cross-modal semantic analysis includes:

多模态信息编码模块，被配置为：对给定图像进行理解并生成区域特征编码；对给定文本查询语句进行理解，生成上下文相关的词汇特征编码；The multimodal information encoding module is configured to: understand a given image and generate a regional feature encoding; understand a given text query sentence and generate a context-related vocabulary feature encoding;

模态内特征融合模块，被配置为：使用自注意力机制对图像和文本表征进行模态内特征融合；The intra-modal feature fusion module is configured to: use the self-attention mechanism to perform intra-modal feature fusion on image and text representations;

双重过滤模块，被配置为：利用聚合特征产生的哈希码和量化码分别计算图像-文本对的余弦相似度，经过两轮排序筛选出排名靠前的候选集；The double filtering module is configured to: use the hash code and quantization code generated by the aggregated features to calculate the cosine similarity of the image-text pair respectively, and select the top-ranked candidate sets through two rounds of sorting;

细粒度计算与重排序模块，被配置为：引入跨模态注意力机制对候选集计算得到较为精确的细粒度匹配分数，使用相似度重排序对排名关系内部微调，最终实现高性能的跨模态图文检索。The fine-grained calculation and re-ranking module is configured as follows: a cross-modal attention mechanism is introduced to calculate a more accurate fine-grained matching score for the candidate set, similarity re-ranking is used to fine-tune the internal ranking relationship, and finally high-performance cross-modal image and text retrieval is achieved.

与现有技术相比，本发明的有益效果为：Compared with the prior art, the present invention has the following beneficial effects:

1、本发明提出了端到端的两阶段过滤策略。该框架将检索过程分为粗略筛选到精细筛选：首先进行粗筛，然后进行细筛，从而提炼出高相关性的候选集合，大幅减少需要计算的样本数量，充分考虑到现有方法实际落地存在的检索效率问题并给出了改进。1. This invention proposes an end-to-end two-stage filtering strategy. This framework divides the retrieval process into coarse screening and fine screening: first coarse screening, then fine screening, so as to extract a highly relevant candidate set, greatly reducing the number of samples that need to be calculated, and fully considering the retrieval efficiency problems existing in the actual implementation of existing methods and providing improvements.

2、本发明提出跨模态注意力机制和相似度重排序相结合的方法，对筛选后的高可信候选集进行细粒度解算，得到与主干注意力方法相持平的结果；然后利用双向检索的语义互补性设计了重排序算法，进一步有效的改善了检索性能。2. The present invention proposes a method that combines a cross-modal attention mechanism with similarity reranking, performs fine-grained calculations on the screened high-confidence candidate set, and obtains results comparable to those of the backbone attention method; then, a reranking algorithm is designed using the semantic complementarity of bidirectional retrieval, which further effectively improves the retrieval performance.

3、本发明所提出的方法是一个轻量级的即插即用的框架，可以嵌入到任意现有的主干图文检索框架之中，并且具有模型参数可迁移性的优点。在开源基准数据上进行的大量性能对比实验，可以证明本发明所提供的模型，能够实现精准且高效的跨模态图文检索。3. The method proposed in the present invention is a lightweight plug-and-play framework that can be embedded in any existing backbone image-text retrieval framework and has the advantage of model parameter portability. A large number of performance comparison experiments conducted on open source benchmark data can prove that the model provided by the present invention can achieve accurate and efficient cross-modal image-text retrieval.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本发明基于跨模态语义解析的图文检索方法的实现框图；FIG1 is a block diagram of an implementation of a method for image-text retrieval based on cross-modal semantic parsing according to the present invention;

图2是本发明的图像和文本表征生成流程图；FIG2 is a flowchart of image and text representation generation of the present invention;

图3是本发明的两阶段过滤流程图；Fig. 3 is a two-stage filtration flow chart of the present invention;

图4是本发明的注意力细化计算与重排序流程图；FIG4 is a flowchart of attention refinement calculation and reordering of the present invention;

具体实施方式Detailed ways

下面结合说明书附图和实施例对本发明作进一步限定，但不限于此。The present invention will be further defined below in conjunction with the accompanying drawings and embodiments, but is not limited thereto.

实施例1Example 1

如图1所示，通过构建一个高级分层检索网络(Advanced HierarchicalRetrieval network,AHR)实现高效精准的跨模态图文双向检索。图像和文本双向的查询具有流程一致性，因此，以图像到文本的查询为例来进行阐述。As shown in Figure 1, an advanced hierarchical retrieval network (AHR) is constructed to achieve efficient and accurate cross-modal image-text bidirectional retrieval. The image and text bidirectional query has a consistent process, so the image-to-text query is used as an example to illustrate.

实施例2Example 2

根据实施例1所述的基于跨模态语义解析的图文检索方法，其区别在于：The difference between the image-text retrieval method based on cross-modal semantic parsing described in Example 1 is that:

图像或文本表征的学习过程，如图2所示，包括：The learning process of image or text representation, as shown in Figure 2, includes:

使用一个多层感知器(MLP)来将这两种模式的特征包括和/>投影到哈希码对应的维度空间中，tanh函数作为激活函数，将向量的数值限制在[-1,+1]之间，得到聚合后的图像特征和文本特征分别表示如式(5)、(6)所示：A multi-layer perceptron (MLP) is used to incorporate the features of these two modes and/> Projected into the dimensional space corresponding to the hash code, the tanh function is used as the activation function to limit the value of the vector to between [-1, +1]. The aggregated image features and text features are expressed as shown in equations (5) and (6), respectively:

为了使得生成的哈希码保持原始样本空间中粗略的异源数据的相近关系，通过式(9)所示的三重排名损失函数来约束样本间关系：In order to make the generated hash code maintain the close relationship of the roughly heterogeneous data in the original sample space, the relationship between samples is constrained by the triple ranking loss function shown in formula (9):

为了尽可能减小量化码和连续特征向量/>之间的差异，通过式(12)的量化损失来进行约束：In order to reduce the quantization code as much as possible and continuous eigenvectors/> The difference between them is constrained by the quantization loss of formula (12):

利用生成的图像和文本的哈希码、量化码执行两阶段筛选得到候选集，如图3所示，包括：The generated hash codes and quantization codes of the image and text are used to perform two-stage screening to obtain the candidate set, as shown in Figure 3, including:

哈希粗略筛选：在本实施例中，基于生成的哈希码，直接计算查询图像哈希码b_v和全体文本哈希码的汉明相似度并将匹配分数降序排列，得到L个排名；由于哈希码保证了不同模态下一定的语义相关性，因此只需要选取前K₁数据作为初始候选集进行后续细致的检索；Hash rough screening: In this embodiment, based on the generated hash code, the query image hash code b_v and the entire text hash code are directly calculated The Hamming similarity of the two models is calculated and the matching scores are arranged in descending order to obtain L rankings. Since the hash code ensures a certain semantic relevance under different modalities, only the first K₁ data need to be selected as the initial candidate set for subsequent detailed retrieval.

量化细致筛选：在本实施例中，基于生成的查询图像量化码和初始候选集中文本对应的原始向量/>做非对称相似性度量，得到量化匹配分数，排序后按照类似第一步哈希筛选的过程，选取K₁个初始排名中较高优先顺序的前K个文本作为最终候选集。Quantization and detailed screening: In this embodiment, based on the generated query image quantization code The original vector corresponding to the text in the initial candidate set/> Perform an asymmetric similarity measurement to obtain a quantitative matching score. After sorting, follow a process similar to the first step of hash screening to select the top K texts with higher priority in the K₁ initial rankings as the final candidate set.

在本方法中，如图4所示，引入基于跨模态注意力机制的主干网络对最终候选集进行计算，得到较为精确的细粒度匹配分数；包括：In this method, as shown in Figure 4, a backbone network based on a cross-modal attention mechanism is introduced to calculate the final candidate set to obtain a more accurate fine-grained matching score; including:

s_ij＝cosine(v_i,t_j),i∈[1,n],j∈[1,m](13)s_ij = cosine(v_i ,t_j ),i∈[1,n],j∈[1,m](13)

在本方法中，使用双向相似度重排序对排名关系内部微调，最终实现高性能的跨模态图文检索；包括：In this method, bidirectional similarity re-ranking is used to fine-tune the internal ranking relationship, and finally high-performance cross-modal image-text retrieval is achieved; including:

双向相似度重排序，包括：Bidirectional similarity re-ranking, including:

p(E_j)＝s,I_s＝I(21)p(E_j )＝s,I_s ＝I(21)

在本方法中，利用优化函数求解哈希编码网络和跨模态注意力主干网络中的参数，优化函数为Pytorch中的adam optimizer函数。In this method, an optimization function is used to solve the parameters in the hash coding network and the cross-modal attention backbone network, and the optimization function is the adam optimizer function in Pytorch.

表1为是本实施例对比模型介绍表；Table 1 is a table of comparative models of this embodiment;

表1Table 1

表2是本发明的检索精度与效率对比表；Table 2 is a comparison table of retrieval accuracy and efficiency of the present invention;

表2Table 2

通过与国际领先的图文匹配模型进行检索性能对比，如表2中的结果显示，相比于原始主干模型，应用AHR框架后的检索精度和效率具有显著优越性。By comparing the retrieval performance with the internationally leading image-text matching model, as shown in Table 2, the retrieval accuracy and efficiency after applying the AHR framework are significantly superior to those of the original backbone model.

实施例3Example 3

一种计算机设备，包括存储器和处理器，存储器存储有计算机程序，处理器执行计算机程序时实现实施例1或2所述的基于跨模态语义解析的图文检索方法的步骤。A computer device includes a memory and a processor, wherein the memory stores a computer program, and when the processor executes the computer program, the steps of the image and text retrieval method based on cross-modal semantic parsing described in Example 1 or 2 are implemented.

实施例4Example 4

一种计算机可读存储介质，其上存储有计算机程序，计算机程序被处理器执行时实现实施例1或2所述的基于跨模态语义解析的图文检索方法的步骤。A computer-readable storage medium stores a computer program, which, when executed by a processor, implements the steps of the image-text retrieval method based on cross-modal semantic parsing described in Example 1 or 2.

实施例5Example 5

多模态信息编码模块，被配置为：对给定图像进行理解并生成区域特征编码；对给定文本查询语句进行理解，生成上下文相关的词汇特征编码；能够将图像和文本分别进行有针对性的语义理解与特征表征，以分别获取两种模态数据的哈希和量化空间的短码聚合表示。The multimodal information encoding module is configured to: understand a given image and generate regional feature encoding; understand a given text query statement and generate context-related vocabulary feature encoding; and be able to perform targeted semantic understanding and feature representation of images and texts respectively, so as to obtain short code aggregation representations of the hash and quantization spaces of the two modal data respectively.

双重过滤模块，被配置为：利用聚合特征产生的哈希码和量化码分别计算图像-文本对的余弦相似度，经过两轮排序筛选出排名靠前的候选集；能够使用哈希码和量化码通过简单的内积计算，有效从冗余的原始数据集中筛选出高质量的候选集合，从而大幅削减了运算量，提高了模型的速度，减少了内存占用。The double filtering module is configured as follows: the hash code and quantization code generated by the aggregated features are used to calculate the cosine similarity of the image-text pairs respectively, and the top-ranked candidate sets are screened out after two rounds of sorting; the hash code and quantization code can be used to effectively screen out high-quality candidate sets from redundant original data sets through simple inner product calculations, thereby greatly reducing the amount of calculations, improving the speed of the model, and reducing memory usage.

细粒度计算与重排序模块，被配置为：引入跨模态注意力机制对候选集计算得到较为精确的细粒度匹配分数，使用相似度重排序对排名关系内部微调，最终实现高性能的跨模态图文检索。基于现有跨模态注意力机制的主干得到过滤后的候选集合的较精确匹配分数，考虑到单峰检索中存在数据不对称的问题，利用双向查询响应一致性，构建相似度重排序对结果进行微调，显著提高了性能表现。The fine-grained calculation and re-ranking module is configured as follows: introduce a cross-modal attention mechanism to calculate a more accurate fine-grained matching score for the candidate set, use similarity re-ranking to fine-tune the internal ranking relationship, and finally achieve high-performance cross-modal image and text retrieval. Based on the backbone of the existing cross-modal attention mechanism, a more accurate matching score of the filtered candidate set is obtained. Considering the data asymmetry problem in unimodal retrieval, the bidirectional query response consistency is used to construct similarity re-ranking to fine-tune the results, which significantly improves the performance.