CN118427631A

Movatterモバイル変換

Info

Publication number: CN118427631A
Application number: CN202410513675.XA
Authority: CN
Inventors: 刘安安; 杨龙; 李文辉; 王岚君; 田宏硕
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2024-04-26
Filing date: 2024-04-26
Publication date: 2024-08-02

Abstract

The invention discloses a knowledge enhancement-based cross-mode matching method and device, wherein the method comprises the following steps: encoding the input image features and text features by utilizing a multi-head attention mechanism based on the multi-modal clustering exogenous knowledge information to obtain encoded image features and encoded text features; acquiring region semantic knowledge information based on multi-mode aggregation, aggregating image region features and tag information by utilizing the guidance of the tag information, and interacting with coded text features through a graph convolution network of multi-step reasoning by utilizing the aggregated features to acquire the region semantic knowledge information based on multi-mode aggregation; based on the multi-modal aggregated regional semantic knowledge information, enhancing the encoded image features and the encoded text features through a gating mechanism to obtain enhanced image features and enhanced text features; and the enhanced image features and the enhanced text features are subjected to self-adaptive joint reasoning with global and local alignment, so that cross-modal matching of the image and text pairs is realized. The device comprises: a processor and a memory.

Description

Translated fromChinese

一种基于知识增强的跨模态匹配方法及装置A cross-modal matching method and device based on knowledge enhancement

技术领域Technical Field

本发明涉及跨模态匹配技术领域，尤其涉及一种基于知识增强的跨模态匹配方法及装置。The present invention relates to the field of cross-modal matching technology, and in particular to a cross-modal matching method and device based on knowledge enhancement.

背景技术Background technique

随着信息技术的不断发展，社交媒体网络上产生了大量异构多模态数据，因此跨模态学习变得越来越重要。在这些多模态数据中，视觉和语言作为人们获取信息的两种主要方式，两者之间的交互学习受到了广泛的关注。而图像文本匹配作为视觉语言交互的基础，应用非常广泛，支撑起了各种具有挑战性的跨模态语义交互任务，例如：视觉问答、视觉语言导航、多媒体理解等。图像文本匹配是计算机视觉和自然语言处理交叉的研究领域，其目的是建立图像和文本之间的语义关联并计算相似性，根据需要配对的图像或文本去检索最相关的文本或图像，过程中需要同时理解复杂的视觉语义和文本信息。然而，由于图像和文本之间的数据异构性导致的语义差距，使得如何准确学习异构模态之间的语义对应仍然是图像文本匹配任务中的关键。With the continuous development of information technology, a large amount of heterogeneous multimodal data has been generated on social media networks, so cross-modal learning has become increasingly important. Among these multimodal data, vision and language are the two main ways for people to obtain information, and the interactive learning between the two has received widespread attention. As the basis of visual language interaction, image-text matching is widely used and supports various challenging cross-modal semantic interaction tasks, such as visual question answering, visual language navigation, multimedia understanding, etc. Image-text matching is a research field at the intersection of computer vision and natural language processing. Its purpose is to establish semantic associations between images and texts and calculate similarities. It retrieves the most relevant text or image based on the paired images or texts. In the process, it is necessary to understand complex visual semantics and text information at the same time. However, due to the semantic gap caused by the data heterogeneity between images and texts, how to accurately learn the semantic correspondence between heterogeneous modalities is still the key to the image-text matching task.

传统的跨模态图文匹配方法^[1,2]利用子空间学习和主题模型等基于统计分析与机器学习的方法构建匹配模型。近年来，随着深度学习和人工智能的发展，越来越多的研究人员使用深度学习方法构建跨模态图像文本匹配模型，例如：注意力网络和图神经网络^[3,4]等。但是它们只考虑了图像文本等同的特征交互，忽略了图像相对于文本特殊的抽象语义，难以弥合异构模态之间的语义差距。因此，通过多模态知识聚类与融合的方式，将图像区域与文本信息进行聚合，使图像视觉信息的语义表达更加具体，从而提升跨模态图像文本匹配的准确性。Traditional cross-modal image-text matching methods^[1,2] use statistical analysis and machine learning methods such as subspace learning and topic models to build matching models. In recent years, with the development of deep learning and artificial intelligence, more and more researchers have used deep learning methods to build cross-modal image-text matching models, such as attention networks and graph neural networks^[3,4] . However, they only consider the interaction of equivalent features between images and texts, ignoring the special abstract semantics of images relative to texts, and it is difficult to bridge the semantic gap between heterogeneous modalities. Therefore, through multimodal knowledge clustering and fusion, the image region and text information are aggregated to make the semantic expression of image visual information more specific, thereby improving the accuracy of cross-modal image-text matching.

图像和文本是描述物体的基本手段，经常结合起来提供现实世界场景的全面描述。一幅图像(或图像的区域)包含多种信息，导致其所指代的对象并不明确，其代表的语义信息对于文本来说相对抽象。Images and text are the basic means of describing objects, and they are often combined to provide a comprehensive description of real-world scenes. An image (or an area of an image) contains a variety of information, which makes it unclear what object it refers to, and the semantic information it represents is relatively abstract to text.

因此，急需一种针对图像抽象语义，弱化异构模态之间的语义差异，实现图像与文本跨模态准确匹配的方法。Therefore, there is an urgent need for a method that targets image abstract semantics, weakens the semantic differences between heterogeneous modalities, and achieves accurate cross-modal matching of images and texts.

发明内容Summary of the invention

本发明提供了一种基于知识增强的跨模态匹配方法及装置，本发明充分学习联合嵌入空间中图像和文本的语义对齐与交互，提升了图像文本跨模态匹配的准确率，详见下文描述：The present invention provides a cross-modal matching method and device based on knowledge enhancement. The present invention fully learns the semantic alignment and interaction of images and texts in the joint embedding space, and improves the accuracy of cross-modal matching of images and texts. See the following description for details:

第一方面、一种基于知识增强的跨模态匹配方法，所述方法包括：In a first aspect, a cross-modal matching method based on knowledge enhancement, the method comprising:

基于多模态聚类的外源知识信息，利用多头注意力机制对输入的图像特征V和文本特征B进行编码，得到编码后的图像特征和编码后的文本特征Based on the exogenous knowledge information of multimodal clustering, the multi-head attention mechanism is used to encode the input image features V and text features B to obtain the encoded image features and the encoded text features

获取基于多模态聚合的区域语义知识信息利用标签信息t的引导，将图像区域特征与标签信息t进行聚合，利用聚合后的特征G，与编码后的文本特征通过多步推理的图卷积网络进行交互，获取基于多模态聚合的区域语义知识信息Obtaining regional semantic knowledge information based on multimodal aggregation Using the guidance of label information t, the image region features Aggregate with label information t, and use the aggregated feature G and the encoded text feature Interact with graph convolutional networks with multi-step reasoning to obtain regional semantic knowledge information based on multimodal aggregation

基于多模态聚合的区域语义知识信息通过门控机制，对编码后的图像特征和编码后的文本特征进行增强，得到增强后的图像特征V^*和增强后的文本特征B^*；Regional semantic knowledge information based on multimodal aggregation Through the gating mechanism, the encoded image features and the encoded text features Perform enhancement to obtain enhanced image features V^* and enhanced text features B^* ;

对增强后的图像特征V^*和增强后的文本特征B^*同时使用全局和局部对齐的自适应联合推理，实现图像和文本对的跨模态匹配。We use adaptive joint reasoning with global and local alignment on the enhanced image features V^* and enhanced text features B^* to achieve cross-modal matching of image and text pairs.

其中，所述基于多模态聚类的外源知识信息为：Wherein, the exogenous knowledge information based on multimodal clustering is:

首先进行一次图像空间外源图像特征的聚类，利用WordNet中外源文本特征的引导，通过相似度判断将图像空间中的抽象语义与文本空间中的语义相融合；在图文联合嵌入空间中进行第二次聚类。First, the exogenous image features in the image space are clustered. Using the guidance of the exogenous text features in WordNet, the abstract semantics in the image space are integrated with the semantics in the text space through similarity judgment. The second clustering is performed in the joint image-text embedding space.

其中，所述多模态聚类的外源知识信息包括：The exogenous knowledge information of multimodal clustering includes:

通过相似度判别将名词反向分类到k个图像语义中心当中，名词T_i属于第l个图像语义中心的概率为：The nouns are reversely classified into k image semantic centers through similarity discrimination. The probability that noun_Ti belongs to the lth image semantic center is:

其中，sim的计算方式为余弦相似度，为每个图像语义中心选择前γ个得分排名的名词，名词T_i将被选为第l个图像语义中心等价于：Among them, sim is calculated as cosine similarity. For each image semantic center, the first γ ranked nouns are selected. Noun_Ti will be selected as the lth image semantic center, which is equivalent to:

其中，对应于属于第l个图像语义中心的名词的γ-th最大置信度；设定图像语义中心S_l所选定的名词集合为根据每个图像语义中心选定的名词，每个图像区域映射到联合嵌入空间的特征计算为：in, The γ-th maximum confidence of the noun belonging to the l-th image semantic center; the set of nouns selected by the image semantic center S_l is set to According to the noun selected by the semantic center of each image, the feature of each image region mapped to the joint embedding space is calculated as:

其中，为图像区域O_i所在类的图像语义中心S_l所选定的名词，τ为控制聚合的裕度，经过名词文本语义增强后的图像区域特征为其中：in, is the noun selected by the image semantic center S_l of the class where the image region O_i belongs, τ is the margin for controlling aggregation, and the image region feature after noun text semantic enhancement is in:

通过对融合的图像区域特征为重新应用k-means聚类算法，得到基于多模态聚类的外源知识即：The fused image region features are Reapplying the k-means clustering algorithm, we can obtain the exogenous knowledge based on multimodal clustering, namely:

M^k＝[m₁,m₂,m_i,…,m_k]M^k =[m₁ ,m₂ ,_mi ,…,m_k ]

其中，m_i为算法收敛后图像区域特征的语义中心。Among them, mi_is the semantic center of the image region feature after the algorithm converges.

其中，所述利用标签信息t的引导，将图像区域特征与标签信息t进行聚合具体为：The image region features are guided by the label information t. Aggregation with tag information t is as follows:

第i个区域将被归类为第l个标签信息等价于：The i-th region will be classified as the l-th label information is equivalent to:

其中，对应于属于第l个标签信息的区域特征的最大置信度，设定图像标签信息t_i所选定的区域特征集合为聚合选中的图像区域得到：in, Corresponding to the regional features belonging to the lth label information Maximum confidence, set the regional feature set selected by the image label information_ti to be Aggregate the selected image regions to obtain:

其中，为图像标签信息t_i所对应的聚合后的图像区域特征；则图像区域标签融合信息表示为G＝[g₁,g₂,g_i,…,g_n]，g_i由标签信息和图像区域聚合特征拼接得到：in, is the aggregated image region feature corresponding to the image label information_ti ; then the image region label fusion information is expressed as G = [g₁ ,g₂ ,_gi ,…,_gn ],_wheregi is obtained by concatenating the label information and the image region aggregate feature:

其中，所述通过门控机制，对编码后的图像特征和编码后的文本特征进行增强，得到增强后的图像特征V^*和增强后的文本特征B^*具体为：The gating mechanism is used to control the encoded image features. and the encoded text features After enhancement, the enhanced image feature V^* and enhanced text feature B^* are obtained as follows:

将得到的和通过门控机制分别添加到增强后的图像和文本的区域特征中：Will get and The gating mechanism is used to add the regional features of the enhanced image and text respectively:

其中，和为增强后图像文本的区域特征，和为可学习的权重矩阵，经过两步多模态知识增强后的图文特征为和in, and To enhance the regional features of the image text, and is a learnable weight matrix, and the image-text features after two-step multimodal knowledge enhancement are and

第二方面、一种基于知识增强的跨模态匹配装置，所述装置包括：处理器和存储器，所述存储器中存储有程序指令，所述处理器调用存储器中存储的程序指令以使装置执行第一方面中的任一项所述的方法。In a second aspect, a cross-modal matching device based on knowledge enhancement comprises: a processor and a memory, wherein program instructions are stored in the memory, and the processor calls the program instructions stored in the memory to enable the device to execute any one of the methods described in the first aspect.

第三方面、一种计算机可读存储介质，所述计算机可读存储介质存储有计算机程序，所述计算机程序包括程序指令，所述程序指令被处理器执行时使所述处理器执行第一方面中的任一项所述的方法。In a third aspect, a computer-readable storage medium stores a computer program, wherein the computer program includes program instructions, and when the program instructions are executed by a processor, the processor executes any one of the methods described in the first aspect.

本发明提供的技术方案的有益效果是：The beneficial effects of the technical solution provided by the present invention are:

1、本发明获取了基于多模态聚类的外源知识信息和基于多模态聚合的图像区域标签融合信息等多模态数据，分别在名词文本信息和图像标签信息的引导下，将图像区域特征与单词文本特征进行融合；在明确抽象视觉指代对象的同时，利用联合嵌入空间聚焦于不同模态下相同语义的表达，挖掘异构模态下的语义共性；1. The present invention obtains multimodal data such as exogenous knowledge information based on multimodal clustering and image region label fusion information based on multimodal aggregation, and fuses image region features with word text features under the guidance of noun text information and image label information respectively; while clarifying the abstract visual referent, the joint embedding space is used to focus on the expression of the same semantics in different modalities, and to mine the semantic commonalities in heterogeneous modalities;

2、本发明利用基于多模态聚类的外源知识信息和基于多模态聚合的图像区域标签融合信息等多模态数据，分别从图像外部和图像内部两个角度来增强图像文本对的语义特征；基于多模态聚类的外源知识信息由于聚类共性特征会倾向增强于图文对中频繁出现的语义信息，基于目标检测得到的图像区域标签融合信息通过标签信息可以增强稀有语义；两种多模态知识共同的语义增强使得联合语义共嵌空间的构建更加完整，弥合了异构模态的语义鸿沟，增强了模态间的关联与交互。2. The present invention utilizes multimodal data such as exogenous knowledge information based on multimodal clustering and image region label fusion information based on multimodal aggregation to enhance the semantic features of image-text pairs from two perspectives: the image exterior and the image interior. Exogenous knowledge information based on multimodal clustering tends to enhance the semantic information that frequently appears in image-text pairs due to the common characteristics of clustering, and the image region label fusion information obtained based on target detection can enhance rare semantics through label information. The common semantic enhancement of the two multimodal knowledge makes the construction of the joint semantic co-embedded space more complete, bridges the semantic gap between heterogeneous modalities, and enhances the association and interaction between modalities.

因此，本发明能够充分学习联合嵌入空间中图像和文本的语义对齐与交互，提升了图像文本跨模态匹配的准确率。Therefore, the present invention can fully learn the semantic alignment and interaction of images and texts in the joint embedding space, thereby improving the accuracy of cross-modal matching of images and texts.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为一种基于知识增强的跨模态匹配方法的流程图；FIG1 is a flow chart of a cross-modal matching method based on knowledge enhancement;

图2为一种基于知识增强的跨模态匹配方法的操作示意图。FIG2 is a schematic diagram of the operation of a cross-modal matching method based on knowledge enhancement.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚，下面对本发明实施方式作进一步地详细描述。In order to make the objectives, technical solutions and advantages of the present invention more clear, the embodiments of the present invention are described in further detail below.

实施例1Example 1

一种基于知识增强的跨模态匹配方法，参见图1，该方法包括以下步骤：A cross-modal matching method based on knowledge enhancement, as shown in FIG1 , includes the following steps:

步骤101：获取待匹配的图像特征V、文本特征B、外源图像特征O以及外源文本特征T；Step 101: Obtain image features V, text features B, external image features O, and external text features T to be matched;

步骤102：获取基于多模态聚类的外源知识信息M^k；Step 102: Acquire exogenous knowledge information M^k based on multimodal clustering;

首先进行图像空间中外源图像特征O的聚类，之后利用WordNet中外源文本特征T的引导，通过相似度判断将图像空间中的抽象语义与文本空间中更为具体的语义相融合，最后在图文联合嵌入空间中进行第二次聚类，以获得基于多模态聚类的外源知识信息M^k。First, the exogenous image features O in the image space are clustered. Then, guided by the exogenous text features T in WordNet, the abstract semantics in the image space are integrated with the more specific semantics in the text space through similarity judgment. Finally, a second clustering is performed in the joint image-text embedding space to obtain the exogenous knowledge information^Mk based on multimodal clustering.

步骤103：针对基于多模态聚类的外源知识信息M^k，利用多头注意力机制对输入的图像特征V和文本特征B进行编码，得到编码后的图像特征和编码后的文本特征以增强于图文对中频繁出现的常见语义信息；Step 103: For the exogenous knowledge information M^k based on multimodal clustering, use the multi-head attention mechanism to encode the input image features V and text features B to obtain the encoded image features and the encoded text features To enhance the common semantic information that frequently appears in image-text pairs;

步骤104：获取基于多模态聚合的区域语义知识信息利用标签信息t的引导，将图像区域特征与标签信息t进行聚合，利用聚合后的特征G，与编码后的文本特征通过多步推理的图卷积网络进行交互，获取基于多模态聚合的区域语义知识信息Step 104: Obtaining regional semantic knowledge information based on multimodal aggregation Using the guidance of label information t, the image region features Aggregate with label information t, and use the aggregated feature G and the encoded text feature Interact with graph convolutional networks with multi-step reasoning to obtain regional semantic knowledge information based on multimodal aggregation

步骤105：针对基于多模态聚合的区域语义知识信息通过门控机制，对编码后的图像特征和编码后的文本特征进行增强，得到增强后的图像特征V^*和增强后的文本特征B^*，使得模型能够同时关注到稀有语义特征，使联合语义共嵌空间的构建更加完整；Step 105: Regional semantic knowledge information based on multimodal aggregation Through the gating mechanism, the encoded image features and the encoded text features Enhancement is performed to obtain enhanced image features V^* and enhanced text features B^* , so that the model can focus on rare semantic features at the same time, making the construction of the joint semantic co-embedded space more complete;

步骤106：对增强后的图像特征V^*和增强后的文本特征B^*同时使用全局和局部对齐的自适应联合推理，在注重细节特征的同时避免引入冗余信息，进而实现图像和文本对正确的跨模态匹配。Step 106: Adaptive joint reasoning of global and local alignment is used for the enhanced image feature V^* and the enhanced text feature B^* , which avoids introducing redundant information while focusing on detailed features, thereby achieving correct cross-modal matching of image and text pairs.

综上所述，本发明实施例通过上述步骤101-步骤106将图像区域中的内容与单词文本进行融合，使图像指代的抽象语义更加具体明确；本发明实施例充分学习联合嵌入空间中图像和文本的语义对齐与交互，提升了图像文本跨模态匹配的准确率。In summary, the embodiment of the present invention fuses the content in the image area with the word text through the above steps 101 to 106, so that the abstract semantics referred to by the image is more specific and clear; the embodiment of the present invention fully learns the semantic alignment and interaction of images and texts in the joint embedding space, and improves the accuracy of cross-modal matching of images and texts.

实施例2Example 2

下面结合具体的计算公式、实例对实施例1中的方案进行进一步地介绍，详见下文描述：The scheme in Example 1 is further introduced below in combination with specific calculation formulas and examples, as described below for details:

201：获取待匹配的图像V、文本数据B、外源图像特征O以及外源文本特征T；201: Obtain an image V to be matched, text data B, an external image feature O, and an external text feature T;

对于待匹配的输入图像：For the input image to be matched:

F＝FastRCNN(image)F = FastRCNN(image)

v_i＝MLP(f_i)+FC(f_i)_vi =MLP(_fi )+FC(_fi )

首先，使用经过预训练的Fast-RCNN进行图像区域的目标检测，并选择置信度排名前36名的区域。然后，使用在ImageNet上预训练的Resnet101进行特征提取，得到2048维特征向量。将提取到的特征向量设为F＝[f₁,f₂…,f_n]，其中n＝36为图像区域特征的数量。最后，通过一个全连接层(Fully Connected Layer)和一个带有残差链接的多层感知器(Multi-Layer Perceptron)对图像特征进行微调，将它们映射到1024维的公共嵌入空间，得到待匹配输入图像最终的特征表示V＝[v₁，v₂...，v_n]，其中v_i代表了图像的区域特征。对于待匹配的输入文本：First, use the pre-trained Fast-RCNN to detect the target in the image area and select the top 36 areas with confidence. Then, use Resnet101 pre-trained on ImageNet to extract features and obtain a 2048-dimensional feature vector. Set the extracted feature vector to F = [f₁ ,f₂ ...,f_n ], where n = 36 is the number of image area features. Finally, fine-tune the image features through a fully connected layer and a multi-layer perceptron with residual links, map them to a 1024-dimensional common embedding space, and obtain the final feature representation V = [v₁ ,v₂ ...,v_n ] of the input image to be matched, where_vi represents the regional features of the image. For the input text to be matched:

C＝Bert(text)C＝Bert(text)

b_i＝FC(c_i)b_i =FC(c_i )

本发明实施例使用序列模型BERT进行处理，提取文本中每个单词描述的特征，得到C＝[c₁，c₂...，c_m]，其中m代表每段文本中的单词数量，c_i代表了维度为768的词向量。之后，通过一个全连接层(Fully Connected Layer)将词向量映射到1024维度的公共嵌入空间中，得到待匹配输入文本最终的特征表示为B＝[b₁，b₂...，b_m]，其中b_i为单词特征向量。The embodiment of the present invention uses the sequence model BERT for processing, extracts the features described by each word in the text, and obtains C=[c₁ , c₂ ...,_cm ], where m represents the number of words in each text segment, and_ci represents a word vector with a dimension of 768. Afterwards, the word vector is mapped to a 1024-dimensional common embedding space through a fully connected layer, and the final feature representation of the input text to be matched is obtained as B=[b₁ , b₂ ..., b_m ], where_bi is a word feature vector.

对于需要引入的外源数据集，为了避免验证集的数据泄露，本发明实施例选择Visual Genome中的图像数据和WordNet中的文本数据。本发明实施例使用经过预训练的图像编码器Bottom-Up and Top-Down (BUTD)对Visual Genome图像进行编码，得到外源图像区域特征O＝[O₁，O₂...，O_N]，同样本发明实施例使用经过预训练的文本编码器BERT对WordNet中的名词集合进行编码得到外源文本特征T＝[T₁，T₂...，T_M]，其中，N和M分别为对应样本的数量。For the exogenous data set that needs to be introduced, in order to avoid data leakage of the validation set, the embodiment of the present invention selects image data in Visual Genome and text data in WordNet. The embodiment of the present invention uses the pre-trained image encoder Bottom-Up and Top-Down (BUTD) to encode the Visual Genome image to obtain the exogenous image region feature O=[O₁ , O₂ ..., O_N ], and similarly, the embodiment of the present invention uses the pre-trained text encoder BERT to encode the noun set in WordNet to obtain the exogenous text feature T=[T₁ , T₂ ..., T_M ], where N and M are the number of corresponding samples respectively.

202：获取基于多模态聚类的外源知识信息M^k；202: Obtaining exogenous knowledge information M^k based on multimodal clustering;

首先进行图像空间中外源图像特征O的聚类，之后利用WordNet中外源文本特征T的引导，通过相似度判断将图像空间中的抽象语义与文本空间中更为具体的语义相融合，最后在图文联合嵌入空间中第二次聚类以获得基于多模态聚类的外源知识信息M^k。First, the exogenous image features O in the image space are clustered. Then, guided by the exogenous text features T in WordNet, the abstract semantics in the image space are integrated with the more specific semantics in the text space through similarity judgment. Finally, a second clustering is performed in the joint image-text embedding space to obtain the exogenous knowledge information^Mk based on multimodal clustering.

本发明实施例对外源图像区域集合O＝[O₁，O₂...，O_N]进行聚类。由于样本数据规模足够大，并且能够用欧氏距离描述样本之间的相似度，所以本发明实施例选择通过k-means算法来聚类以获得更好的聚类结果。不同粒度的图像语义可以通过k-means算法中不同的k值来捕获，较小的k值对应于粗粒度语义，其精度可能不足以覆盖聚类边界的图像语义，相反，较大的k值会产生细粒度的语义，这可能无法区分不同的区域特征。为了找到合适粒度的图像语义，对于Visual Genome大型数据集，本发明实施例通过实验，认为一个约300个图像区域的集群足够紧凑，即：The embodiment of the present invention clusters the set of exogenous image regions O = [O₁ , O₂ ..., O_N ]. Since the sample data scale is large enough and the similarity between samples can be described by Euclidean distance, the embodiment of the present invention chooses to cluster through the k-means algorithm to obtain better clustering results. Image semantics of different granularities can be captured by different k values in the k-means algorithm. Smaller k values correspond to coarse-grained semantics, and their accuracy may not be sufficient to cover the image semantics of the cluster boundary. On the contrary, larger k values will produce fine-grained semantics, which may not be able to distinguish different regional features. In order to find image semantics of appropriate granularity, for the large Visual Genome dataset, the embodiment of the present invention believes through experiments that a cluster of about 300 image regions is compact enough, that is:

k＝N/300k＝N/300

聚类过程如下：The clustering process is as follows:

其中，μ_j(j∈[1，k])代表了k-means聚类中每类的质心，对于每一个图像区域样本O_i，需要计算与每个质心μ_j的距离，O_i则属于与他距离最近质心μ_j的类中，表示一个标识，即当O_i属于第j个聚类时为1，其他情况下为0。根据上式重复计算每类质心的值，直到满足下式聚类算法收敛：Among them, μ_j (j∈[1, k]) represents the centroid of each class in k-means clustering. For each image region sample O_i , the distance to each centroid μ_j needs to be calculated, and O_i belongs to the class with the closest centroid μ_j . Represents an identifier, that is, when O_i belongs to the jth cluster, it is 1, otherwise it is 0. Repeat the calculation of the centroid value of each class according to the above formula until the clustering algorithm converges to the following formula:

其中，表示当前图像区域O_i所在类的质心。聚类完成后，本发明实施例定义每个类的图像语义中心为：in, Represents the centroid of the class where the current image region O_i belongs. After clustering is completed, the embodiment of the present invention defines the image semantic center of each class as:

为了使图像语义更加具体，弥合图像与文本之间的语义差异。本发明实施例利用WordNet数据集中的名词语义T＝[T₁,T₂…,T_M]来描述聚类后的图像区域，即通过相似度判别将名词反向分类到k个图像语义中心当中。具体来说，名词T_i属于第l个图像语义中心的概率为：In order to make the image semantics more specific and bridge the semantic gap between image and text, the present invention uses the noun semantics T = [T₁ , T₂ …, T_M ] in the WordNet dataset to describe the clustered image region, that is, the nouns are reversely classified into k image semantic centers through similarity discrimination. Specifically, the probability that noun_Ti belongs to the lth image semantic center is:

其中，sim的计算方式为余弦相似度，为了利用具有高度代表性和可区分性的名词，而舍弃那些语义无关的名词，充分挖掘图文异构模态间的语义共性。本发明实施例为每个图像语义中心选择前γ个得分排名的名词，形式上，名词T_i将被选为第l个图像语义中心等价于：Among them, sim is calculated by cosine similarity. In order to utilize nouns with high representativeness and distinguishability, those semantically irrelevant nouns are discarded, and the semantic commonality between heterogeneous image and text modalities is fully explored. The embodiment of the present invention selects the first γ scoring nouns for each image semantic center. Formally, noun_Ti will be selected as the lth image semantic center, which is equivalent to:

其中，对应于属于第l个图像语义中心的名词的γ-th最大置信度。本发明实施例设定图像语义中心S_l所选定的名词集合为根据每个图像语义中心选定的名词，每个图像区域映射到联合嵌入空间的特征计算为：in, The γ-th maximum confidence of the noun belonging to the l-th image semantic center. The embodiment of the present invention sets the noun set selected by the image semantic center S_l as According to the noun selected by the semantic center of each image, the feature of each image region mapped to the joint embedding space is calculated as:

其中，为图像区域O_i所在类的图像语义中心S_l所选定的名词，τ为控制聚合的裕度，其设计是为了防止不同图像的联合空间嵌入退化到同一点。则经过名词文本语义增强后的图像区域特征为其中：in, is the noun selected by the image semantic center S_l of the class where the image region O_i belongs, and τ is the margin for controlling aggregation, which is designed to prevent the joint spatial embedding of different images from degenerating to the same point. The image region feature after noun text semantic enhancement is in:

在联合嵌入空间构建完成之后，本发明实施例通过对融合的图像区域特征为重新应用k-means聚类算法，即可得到基于多模态聚类的外源知识M^k，用以增强抽象图像语义，挖掘异构模态语义共性：After the joint embedding space is constructed, the embodiment of the present invention performs the fusion of the image region features as follows: By reapplying the k-means clustering algorithm, we can obtain the exogenous knowledge M^k based on multimodal clustering to enhance the abstract image semantics and mine the common semantics of heterogeneous modalities:

M^k＝[m₁,m₂,m_i,…,m_k]M^k =[m₁ ,m₂ ,_mi ,…,m_k ]

203：针对基于多模态聚类的外源知识信息M^k，利用多头注意力机制对输入的图像特征V和文本特征B进行编码，得到和以增强于图文对中频繁出现的常见语义信息；203: For the exogenous knowledge information M^k based on multimodal clustering, the multi-head attention mechanism is used to encode the input image features V and text features B to obtain and To enhance the common semantic information that frequently appears in image-text pairs;

本发明实施例利用重新聚类后图像区域特征语义中心特征M^k对待匹配的图文对进行特征增强，嵌入增强有助于将多模态知识注入到输出嵌入中。具体本发明实施例采用多头注意力机制，使用M^k对输入图文对的全局/局部嵌入进行编码，具体如下所示：The embodiment of the present invention uses the semantic center feature M^k of the image region feature after re-clustering to enhance the features of the image-text pair to be matched, and the embedding enhancement helps to inject multimodal knowledge into the output embedding. Specifically, the embodiment of the present invention adopts a multi-head attention mechanism and uses M^k to encode the global/local embedding of the input image-text pair, as shown below:

MultiHead(X,Y)＝Concat(h₁,h₂,h_i,…,h_H)+XMultiHead(X,Y)＝Concat(h₁ ,h₂ ,_hi ,…,h_H )+X

其中，X为图像或文本的区域特征v_i或b_i，Y＝M^k，Concat(.)表示拼接特征维数操作，H表示注意力头的个数，使用缩放后的点积注意力Att(.)计算h_i如下：Where X is the regional feature_vi or_bi of the image or text, Y = M^k , Concat(.) represents the concatenation feature dimension operation, H represents the number of attention heads, and the scaled dot product attention Att(.) is used to calculate_hi as follows:

其中，d_k为Q和K的通道数，W_i^Q、W_i^V为可学习的矩阵。增强后的最终特征表示为：Where, d_k is the number of channels of Q and K,_Wi^Q , W_i^V is a learnable matrix. The final feature after enhancement Expressed as:

式中，FFN(.)表示由两层MLP实现的前馈网络，并在中间附带有ReLU激活函数。对于图像区域特征，对于文本特征，基于多模态聚类的外源知识可以弥合图像相对于文本的抽象语义，两种模态的语义嵌入通过多头注意力机制得到增强，因此这种增强可以帮助模型学习文本和图像异构模态之间的语义交互，以获得更好的图像-文本匹配性能。该步获得的图像特征和文本特征将作为205中基于多模态聚合知识增强的输入特征，进一步增强语义特征的表示。In the formula, FFN(.) represents a feedforward network implemented by two layers of MLP with a ReLU activation function in the middle. For image region features, For text features, Exogenous knowledge based on multimodal clustering can bridge the abstract semantics of images relative to texts. The semantic embedding of the two modalities is enhanced through the multi-head attention mechanism. Therefore, this enhancement can help the model learn the semantic interaction between text and image heterogeneous modalities to obtain better image-text matching performance. The image features obtained in this step and text features It will be used as the input feature enhanced based on multimodal aggregation knowledge in 205 to further enhance the representation of semantic features.

204：获取基于多模态聚合的区域语义知识信息利用标签信息t的引导，将图像区域特征与标签信息t进行聚合；利用聚合后的特征G，与文本特征通过多步推理的图卷积网络进行交互，获取基于多模态聚合的区域语义知识信息204: Obtaining regional semantic knowledge information based on multimodal aggregation Using the guidance of label information t, the image region features Aggregate with label information t; use the aggregated feature G and text feature Interact with graph convolutional networks with multi-step reasoning to obtain regional semantic knowledge information based on multimodal aggregation

对于多模态图像区域标签融合信息的获取，本发明实施例以单个图像文本对为例，目标检测的标签信息通过Bert进行编码得到的特征为t＝[t₁,t₂…,t_n]，经过增强后的图像的区域特征为本发明实施例将标签信息t_i与对应图像的区域特征进行替换得到接下来本发明实施例利用剩余的区域特征来描述每个标签信息t_i，即将所有剩余的区域特征反向分类到n个目标检测标签信息t_i中。具体来说，第i个图像区域属于第j个标签语义信息的概率为：Regarding the acquisition of multimodal image region label fusion information, the embodiment of the present invention takes a single image-text pair as an example. The label information of target detection is encoded by Bert to obtain the feature t=[t₁ ,t₂ …,t_n ], and the region feature of the enhanced image is The embodiment of the present invention combines the label information_ti with the regional features of the corresponding image Replace it with Next, the embodiment of the present invention uses the remaining region features to describe each label information_ti , that is, reversely classify all the remaining region features into n target detection label information_ti . Specifically, the probability that the i-th image region belongs to the j-th label semantic information is:

其中，sim的计算方式为余弦相似度。为了利用具有高度语义相关性和可区分性的图像区域，而舍弃那些与标签信息语义无关的区域。本发明实施例为每个标签信息特征选择前γ个得分排名的图像区域特征。形式上，第i个区域将被归类为第l个标签信息等价于：Among them, sim is calculated by cosine similarity. In order to utilize image regions with high semantic relevance and distinguishability, and discard those regions that are irrelevant to the semantics of the label information. The embodiment of the present invention selects the image region features with the top γ scores for each label information feature. Formally, the i-th region will be classified as the l-th label information equivalent to:

其中，对应于属于第l个标签信息的区域特征的最大置信度。本发明实施例固定设定图像标签信息t_i所选定的区域特征集合为聚合选中的图像区域得到：in, Corresponding to the regional features belonging to the lth label information Maximum confidence. Set the regional feature set selected by the image label information_ti to Aggregate the selected image regions to obtain:

其中，为图像标签信息t_i所对应的聚合后的图像区域特征，与基于多模态聚合的外源知识不同，此时本发明实施例关注的是图像内部基于标签信息引导的多模态融合，则图像语义融合信息表示为G＝[g₁,g₂,g_i,…,g_n]，g_i由标签信息和图像区域聚合特征拼接得到：in, is the aggregated image region feature corresponding to the image label information_ti . Different from the exogenous knowledge based on multimodal aggregation, the embodiment of the present invention focuses on the multimodal fusion guided by the label information inside the image. The image semantic fusion information is expressed as G = [g₁ ,g₂ ,_gi ,…,_gn ],_wheregi is obtained by concatenating the label information and the image region aggregated feature:

在获得所有与区域特征融合的语义融合信息G后，本发明实施例通过图卷积网络与待匹配的输入文本特征进行交互。After obtaining all the semantic fusion information G fused with the regional features, the embodiment of the present invention uses a graph convolutional network to match the input text features to be matched. to interact.

首先是模态内的图卷积推理：First is the graph convolution reasoning within the modality:

其中，Aⁱ和A^t代表了根据特征之间余弦相似度计算得到的邻接矩阵。W_i^(l-1)和W_t^(l-1)为可学习矩阵，和为可学习的偏差矩阵，σ(.)为LeakyReLU激活函数，l代表了模态内图卷积语义推理的当前层数，l_s代表了模态内图卷积语义推理的总层数。Among them,^Ai and^At represent the adjacency matrix calculated based on the cosine similarity between features._Wi^(l-1) and_Wt^(l-1) are learnable matrices. and is the learnable bias matrix, σ(.) is the LeakyReLU activation function, l represents the current number of layers of intra-modal graph convolutional semantic reasoning, and l_s represents the total number of layers of intra-modal graph convolutional semantic reasoning.

接下来本发明实施例对整个多模态语义交互进行推理，在整个图上实现模态间语义的推理，挖掘不同模态间相似的图结构关系。过程如下：Next, the embodiment of the present invention infers the entire multimodal semantic interaction, realizes the semantic reasoning between modalities on the entire graph, and mines similar graph structure relationships between different modalities. The process is as follows:

其中，A表示可以根据语义相似度将两个模态连接起来的邻接矩阵，W^(l)和C^(l)为可学习矩阵，l₂为l₂范数函数，||代表了特征向量的拼接操作，G^(l)和B^(l)即为最终的推理结果。Among them, A represents the adjacency matrix that can connect two modalities according to semantic similarity, W^(l) and C^(l) are learnable matrices, l₂ is the l₂ norm function, || represents the concatenation operation of feature vectors, and G^(l) and B^(l) are the final inference results.

205：针对基于多模态聚合的区域语义知识信息通过门控机制，对图像特征和文本特征进行增强，得到V^*和B^*，同时能够增强稀有语义特征，使得联合语义共嵌空间的构建更加完整；205: Regional semantic knowledge information based on multimodal aggregation Through the gating mechanism, the image features and text features Enhancement is performed to obtain V^* and B^* , and rare semantic features can be enhanced at the same time, making the construction of the joint semantic co-embedding space more complete;

将经过多步推理得到的区域语义知识信息中的和通过门控机制分别添加到经过步骤203增强后的图像和文本的区域特征和中：The regional semantic knowledge information obtained through multi-step reasoning middle and The regional features of the image and text enhanced in step 203 are added respectively by the gating mechanism and middle:

其中，和为增强后图像文本的区域特征，和为可学习的权重矩阵。经过两步多模态知识增强后的图文特征为和in, and To enhance the regional features of the image text, and is a learnable weight matrix. After two steps of multimodal knowledge enhancement, the image and text features are and

206：对V^*和B^*同时使用全局和局部对齐的自适应联合推理，在注重细节特征的同时避免引入冗余信息，进而实现图像和文本对正确的跨模态匹配。206: Adaptive joint reasoning using global and local alignment for V^* and B^* , focusing on detailed features while avoiding the introduction of redundant information, thereby achieving correct cross-modal matching of image and text pairs.

对于图文匹配任务来说，多种语义场景下简单地用全局或局部范式的方法无法很好的完成任务。对于简单场景，使用局部范式的方式会在图像文本的特征构建中引入冗余信息。对于复杂场景，使用全局范式的方式会使得图像文本的细节特征丢失。为此，本发明实施例使用全局和局部范式的方法自适应联合推理图像和文本的跨模态匹配。For image-text matching tasks, simply using global or local paradigms in various semantic scenarios cannot accomplish the task well. For simple scenarios, using the local paradigm will introduce redundant information in the feature construction of the image text. For complex scenarios, using the global paradigm will cause the detailed features of the image text to be lost. To this end, an embodiment of the present invention uses global and local paradigms to adaptively and jointly reason about cross-modal matching of images and texts.

对于全局对齐，本发明实施例的计算方式如下：For global alignment, the calculation method of the embodiment of the present invention is as follows:

S_g＝Sim_a(agg(V^*),agg(B^*))S_g =Sim_a (agg(V^* ),agg(B^* ))

其中，agg(.)为区域特征的聚合操作，这里本发明实施例选择的聚合方式为GPO^[5]，其能够学习自动适应不同特征的最优池化策略。全局相似度计算公式如下：Wherein, agg(.) is the aggregation operation of regional features. Here, the aggregation method selected in the embodiment of the present invention is GPO^[5] , which can learn the optimal pooling strategy that automatically adapts to different features. The global similarity calculation formula is as follows:

其中，T＝agg(B^*)为聚合后的全局文本特征，I＝agg(V^*)为聚合后的全局图像特征。该相似度计算方式不同于图文匹配方法常用的余弦相似度，能够使得相似度结果由数值变为向量，便于后续的联合推理。局部对齐的相似度计算方式如下：Among them, T = agg (B^* ) is the global text feature after aggregation, and I = agg (V^* ) is the global image feature after aggregation. This similarity calculation method is different from the cosine similarity commonly used in image-text matching methods. It can convert the similarity result from a numerical value to a vector, which is convenient for subsequent joint reasoning. The similarity calculation method of local alignment is as follows:

在执行细粒度语义对齐时，本发明实施例将单词与其跨模态突出片段之间的相似性视为单词与整个图像之间的相似性，而忽略所有其他跨模态片段。并将所有单词与图像的相似度相加作为最后的局部对齐相似度。最终图文对的相似度得分则通过全局对齐和局部对齐的两个向量学习得到，以自动适应不同语义场景复杂度的图文对：When performing fine-grained semantic alignment, the embodiment of the present invention regards the similarity between a word and its cross-modal salient fragment as the similarity between the word and the entire image, while ignoring all other cross-modal fragments. The similarities between all words and the image are added as the final local alignment similarity. The final similarity score of the image-text pair is obtained by learning the two vectors of global alignment and local alignment to automatically adapt to image-text pairs of different semantic scene complexities:

其中，W_g和W_l为可学习的权重矩阵，fc为全连接层，将相似度向量下采样为相似度分数。Among them,_Wg and_Wl are learnable weight matrices, and fc is a fully connected layer that downsamples the similarity vector to a similarity score.

在训练过程中的损失函数，本发明实施例选择图文匹配中常用的基于最负样本的三元组排序损失。具体如下：In the loss function during the training process, the embodiment of the present invention selects the triple ranking loss based on the most negative sample commonly used in image-text matching. The details are as follows:

其中，和作为区别于正样本的最负样本用来优化匹配过程，α为距离因子，[x]₊＝max(x,0)。in, and The most negative sample that is different from the positive sample is used to optimize the matching process. α is the distance factor, [x]₊ = max(x,0).

综上所述，对于待匹配的图文数据，本发明实施例通过上述步骤201-206，利用基于多模态聚合的图像区域标签融合信息和基于多模态聚类的外源知识信息等多模态数据，分别从待匹配图像内部和外部来增强图像文本对的语义特征；本发明实施例使用了基于多模态聚类的外源知识从图像外部增强语义，将Visual Genome的图像区域与WordNet名词文本进行融合，在联合嵌入空间中进行多模态聚类；通过聚类后图像区域与名词文本的融合特征，利用多头注意力机制对需要匹配的图文对的特征进行增强。在明确视觉指代对象的同时，利用联合嵌入空间聚焦于不同模态下相同语义的表达，使其更能理解图文中的语义共性；本发明实施例使用了基于多模态聚合的图像区域标签融合信息从图像内部增强语义，从图像内部将区域特征目标检测的标签信息作为图像区域特征的语义补充，通过标签信息的引导，利用图像区域特征以标签特征为中心进行聚合；利用聚合后图像区域和标签信息的融合特征，与待匹配的文本特征通过图卷积网络进行交互，通过多步推理的方式，提取文本特征和融合特征中相似的图结构信息，再通过门机制添加到图像区域特征中，在增强图像特征语义表示的同时，增加了模态间的关联；最后，本发明实施例同时使用全局和局部对齐的方法自适应联合推理，提高了图像文本跨模态匹配的准确性，满足了实际应用中的多种需要。In summary, for the image and text data to be matched, the embodiment of the present invention uses multimodal data such as image region label fusion information based on multimodal aggregation and exogenous knowledge information based on multimodal clustering through the above steps 201-206 to enhance the semantic features of the image-text pair from the inside and outside of the image to be matched respectively; the embodiment of the present invention uses exogenous knowledge based on multimodal clustering to enhance semantics from the outside of the image, fuses the image region of Visual Genome with the noun text of WordNet, and performs multimodal clustering in the joint embedding space; through the fusion features of the image region and the noun text after clustering, the multi-head attention mechanism is used to enhance the features of the image-text pair to be matched. While clarifying the visual referent, the joint embedding space is used to focus on the expression of the same semantics under different modalities, so that it can better understand the semantic commonalities in the image and text; the embodiment of the present invention uses image region label fusion information based on multimodal aggregation to enhance the semantics from within the image, and uses the label information of the regional feature target detection from within the image as the semantic supplement of the image region feature. Under the guidance of the label information, the image region features are aggregated with the label features as the center; the fusion features of the aggregated image region and label information are used to interact with the text features to be matched through a graph convolutional network, and similar graph structure information in the text features and fusion features is extracted through a multi-step reasoning method, and then added to the image region features through a gate mechanism, which enhances the semantic representation of image features while increasing the association between modalities; finally, the embodiment of the present invention uses global and local alignment methods for adaptive joint reasoning, which improves the accuracy of cross-modal matching of images and texts and meets various needs in practical applications.

实施例3Example 3

下面结合具体的实验对实施例1和2进行可行性验证，详见下文描述：The feasibility of Examples 1 and 2 is verified by combining specific experiments, as described below:

本发明实施例选择了Flickr30k图像标注数据集，它包含了31783张图像，每张图像带有五个不同的文本描述。本发明实施例对该数据集进行划分，将29783张图像及其对应的文本作为训练集，将1000张图像及其对应文本作为测试集。本发明实施例的跨模态匹配技术是利用经过多模态知识增强后的图像文本特征的跨模态相似度进行匹配。我们首先获取基于多模态聚类的外源知识信息和基于多模态聚合的图像标签融合信息，分别利用名词文本和标签信息的引导，在外源数据集中和Flickr30k数据集的样本内部，进行图文的多模态语义聚类和融合，增强了图像和文本共性语义之间的关联，共同构建了更为准确的联合嵌入语义空间。之后本发明实施例分别通过多头注意力机制和门控机制，利用上述知识对输入的图像和文本特征进行增强，从而缩小异构模态间的语义差异，进一步挖掘不同模态之间的共有特性，在局部和全局两种对齐方式下，计算图文的跨模态相似度矩阵并排名，实现图文对跨模态匹配关系的预测。本发明实施例使用了R@K和rSum两种指标评估方法的性能，R@K(K＝1,5,10)表示图文对正确的匹配排名，R@K越高表示性能越好。rSum(R@K的加和总值)作为计算文本和图像跨模态匹配的评估度量，为整体的跨模态匹配性能提供了较为通用的评判视角。最新现有方法的rSum的指标为526.5，本发明实施例的方法的rSum指标为530.1。相较于最新现有方法，本发明实施例的rSum提升了3.6％，提升了跨模态匹配的可靠性。The embodiment of the present invention selects the Flickr30k image annotation dataset, which contains 31,783 images, each with five different text descriptions. The embodiment of the present invention divides the dataset, taking 29,783 images and their corresponding texts as the training set, and taking 1,000 images and their corresponding texts as the test set. The cross-modal matching technology of the embodiment of the present invention uses the cross-modal similarity of image-text features enhanced by multimodal knowledge for matching. We first obtain exogenous knowledge information based on multimodal clustering and image label fusion information based on multimodal aggregation, and use the guidance of noun text and label information to perform multimodal semantic clustering and fusion of images and texts in the exogenous dataset and within the samples of the Flickr30k dataset, respectively, to enhance the association between the common semantics of images and texts, and jointly construct a more accurate joint embedding semantic space. Afterwards, the embodiment of the present invention uses the multi-head attention mechanism and the gating mechanism to enhance the input image and text features using the above knowledge, thereby narrowing the semantic differences between heterogeneous modalities, further exploring the common characteristics between different modalities, and calculating and ranking the cross-modal similarity matrix of the image and text under the local and global alignment methods to achieve the prediction of the cross-modal matching relationship between the image and text. The embodiment of the present invention uses two indicators, R@K and rSum, to evaluate the performance of the method. R@K (K = 1, 5, 10) represents the correct matching ranking of the image and text, and the higher the R@K, the better the performance. rSum (the sum of R@K) is used as an evaluation metric for calculating the cross-modal matching of text and images, providing a more general evaluation perspective for the overall cross-modal matching performance. The rSum index of the latest existing method is 526.5, and the rSum index of the method of the embodiment of the present invention is 530.1. Compared with the latest existing method, the rSum of the embodiment of the present invention is improved by 3.6%, which improves the reliability of cross-modal matching.

实施例4Example 4

一种基于知识增强的跨模态匹配装置，该装置包括：处理器和存储器，存储器中存储有程序指令，处理器调用存储器中存储的程序指令以使装置执行实施例1中的以下方法步骤：A cross-modal matching device based on knowledge enhancement, the device comprising: a processor and a memory, wherein program instructions are stored in the memory, and the processor calls the program instructions stored in the memory to enable the device to execute the following method steps in embodiment 1:

对增强后的图像特征V^*和增强后的文本特征B^*同时使用全局和局部对齐的自适应联合推理，实现图像和文本对的跨模态匹配。We use adaptive joint reasoning with global and local alignment for the enhanced image features V^* and enhanced text features B^* to achieve cross-modal matching of image and text pairs.

其中，基于多模态聚类的外源知识信息为：Among them, the exogenous knowledge information based on multimodal clustering is:

首先进行一次图像空间中的聚类利用WordNet中名词文本的引导，通过相似度判断将图像空间中的抽象语义与文本空间中的语义相融合；在图文联合嵌入空间中进行第二次聚类。First, clustering is performed in the image space using the guidance of noun text in WordNet, and the abstract semantics in the image space is integrated with the semantics in the text space through similarity judgment; the second clustering is performed in the joint image-text embedding space.

其中，多模态聚类的外源知识信息包括：Among them, the exogenous knowledge information of multimodal clustering includes:

其中，sim的计算方式为余弦相似度，为每个图像语义中心选择前γ个得分排名的名词，名词T_i将被选为第l个图像语义中心等价于：Among them, sim is calculated by cosine similarity. For each image semantic center, the first γ ranked nouns are selected. Noun_Ti will be selected as the lth image semantic center, which is equivalent to:

M^k＝[m₁,m₂,m_i,…,m_k]M^k =[m₁ ,m₂ ,_mi ,…,m_k ]

其中，利用标签信息t的引导，将图像区域特征与标签信息t进行聚合具体为：Among them, the image region features are guided by the label information t. Aggregation with tag information t is as follows:

其中，通过门控机制，对编码后的图像特征和编码后的文本特征进行增强，得到增强后的图像特征V^*和增强后的文本特征B^*具体为：Among them, through the gating mechanism, the encoded image features and the encoded text features After enhancement, the enhanced image feature V^* and enhanced text feature B^* are obtained as follows:

这里需要指出的是，以上实施例中的装置描述是与实施例中的方法描述相对应的，本发明实施例在此不做赘述。It should be pointed out here that the device description in the above embodiment corresponds to the method description in the embodiment, and the embodiment of the present invention will not be described in detail here.

上述的处理器和存储器的执行主体可以是计算机、单片机、微控制器等具有计算功能的器件，具体实现时，本发明实施例对执行主体不做限制，根据实际应用中的需要进行选择。存储器和处理器之间通过总线传输数据信号，本发明实施例对此不做赘述。The execution subject of the above-mentioned processor and memory can be a computer, a single-chip microcomputer, a microcontroller or other devices with computing functions. In specific implementation, the embodiment of the present invention does not limit the execution subject, and it is selected according to the needs in the actual application. The memory and the processor transmit data signals through a bus, which is not described in detail in the embodiment of the present invention.

基于同一发明构思，本发明实施例还提供了一种计算机可读存储介质，存储介质包括存储的程序，在程序运行时控制存储介质所在的设备执行上述实施例中的方法步骤。Based on the same inventive concept, an embodiment of the present invention further provides a computer-readable storage medium, the storage medium includes a stored program, and when the program is running, the device where the storage medium is located is controlled to execute the method steps in the above embodiment.

该计算机可读存储介质包括但不限于快闪存储器、硬盘、固态硬盘等。这里需要指出的是，以上实施例中的可读存储介质描述是与实施例中的方法描述相对应的，本发明实施例在此不做赘述。The computer readable storage medium includes but is not limited to a flash memory, a hard disk, a solid state drive, etc. It should be noted that the description of the computer readable storage medium in the above embodiment corresponds to the description of the method in the embodiment, and the embodiment of the present invention will not be repeated here.

在上述实施例中，可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时，可以全部或部分地以计算机程序产品的形式实现。计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时，全部或部分地产生按照本发明实施例的流程或功能。In the above embodiments, all or part of the embodiments may be implemented by software, hardware, firmware or any combination thereof. When implemented by software, all or part of the embodiments may be implemented in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions according to the embodiments of the present invention are generated.

计算机可以是通用计算机、专用计算机、计算机网络、或者其它可编程装置。计算机指令可以存储在计算机可读存储介质中，或者通过计算机可读存储介质进行传输。计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。可用介质可以是磁性介质或者半导体介质等。The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. Computer instructions may be stored in a computer-readable storage medium or transmitted via a computer-readable storage medium. The computer-readable storage medium may be any available medium that can be accessed by the computer or a data storage device such as a server or a data center that includes one or more available media. The available medium may be a magnetic medium or a semiconductor medium, etc.

参考文献references

[1]Mahadevan V,Wong C,Pereira J,et al.Maximum covariance unfolding:Manifold learning for bimodal data[C].In Proceedings of the InternationalConference on Neural Information Processing Systems,Granada:MIT Press,2011:918–926.[1] Mahadevan V, Wong C, Pereira J, et al. Maximum covariance unfolding: Manifold learning for bimodal data [C]. In Proceedings of the International Conference on Neural Information Processing Systems, Granada: MIT Press, 2011: 918–926.

[2]Blei D M,Jordan M I.Modeling annotated data[C].In Proceedings ofthe International Conference on ACM Special Interest Group on InformationRetrieval,Toronto:ACM,2003:127–134.[2]Blei D M, Jordan M I.Modeling annotated data[C].In Proceedings of the International Conference on ACM Special Interest Group on InformationRetrieval,Toronto:ACM,2003:127–134.

[3]Lee K,Chen X,Hua G,et al.Stacked Cross Attention for Image-TextMatching[C].In Proceedings of the European Conference on Computer Vision,Munich:Springer,2018:212–228.[3]Lee K, Chen

[4]Liu C,Mao Z,Zhang T,et al.Graph Structured Network for Image-TextMatching[C].In Proceedings of the IEEE/CVF Conference on Computer Vision andPattern Recognition,Seattle:IEEE Computer Vision Foundation,2020:10918–10927.[4]Liu C,Mao Z,Zhang T,et al.Graph Structured Network for Image-TextMatching[C].In Proceedings of the IEEE/CVF Conference on Computer Vision andPattern Recognition,Seattle:IEEE Computer Vision Foundation,2020:10918 –10927.

[5]Chen J,Hu H,Wu H,et al.Learning the best pooling strategy forvisual semantic embedding[C].In Proceedings of the IEEE/CVF Conference onComputer Vision and Pattern Recognition,virtual:IEEE Computer VisionFoundation,2021:15789–15798.[5]Chen J, Hu H, Wu H, et al. Learning the best pooling strategy for visual semantic embedding[C]. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, virtual: IEEE Computer Vision Foundation, 2021: 15789 –15798.

本发明实施例对各器件的型号除做特殊说明的以外，其他器件的型号不做限制，只要能完成上述功能的器件均可。Unless otherwise specified, the models of the components in the embodiments of the present invention are not limited, and any device that can perform the above functions may be used.

本领域技术人员可以理解附图只是一个优选实施例的示意图，上述本发明实施例序号仅仅为了描述，不代表实施例的优劣。Those skilled in the art can understand that the accompanying drawing is only a schematic diagram of a preferred embodiment, and the serial numbers of the embodiments of the present invention are only for description and do not represent the advantages and disadvantages of the embodiments.

以上所述仅为本发明的较佳实施例，并不用以限制本发明，凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.