CN117994791A

Movatterモバイル変換

Info

Publication number: CN117994791A
Application number: CN202311777086.4A
Authority: CN
Inventors: 徐永秀; 李世鑫; 许洪波
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2023-12-22
Filing date: 2023-12-22
Publication date: 2024-05-07
Also published as: WO2025130069A1

Abstract

The invention discloses a text-guided multi-modal relation extraction method and a device, wherein the method comprises the following steps: obtaining a plurality of local object images in a global image for a given image; obtaining a text feature encoded representation of a given text and a visual feature encoded representation of the image and the local object image; taking the text feature coding representation as the prior input of the visual encoder, and further guiding the visual encoder to learn the visual feature coding representation more relevant to text semantics in a mode of backward decoding feedback based on a top-down attention mechanism; fusing the text feature coding representation with the visual feature coding representation more related to text semantics through a cross attention mechanism to obtain a cross-modal text feature coding representation; and carrying out relationship classification based on the cross-modal text feature coding representation to obtain the semantic relationship type between two entities in the given text. The invention can reduce the interference of irrelevant visual information and improve the accuracy of relation extraction.

Description

Translated fromChinese

一种文本引导的多模态关系抽取方法及装置A text-guided multimodal relationship extraction method and device

技术领域Technical Field

本公开涉及关系抽取技术领域，尤其涉及一种文本引导的多模态关系抽取方法及装置。The present disclosure relates to the technical field of relationship extraction, and in particular to a text-guided multimodal relationship extraction method and device.

背景技术Background technique

关系抽取是指从非结构化的自然语言文本中抽取指定类型的实体、关系等事实信息，并形成结构化数据输出的文本处理技术。而多模态关系抽取是旨在从多种模态的数据中识别和抽取实体之间的关系，从而更全面地理解数据中的语义信息。Relation extraction refers to a text processing technology that extracts specified types of entities, relationships and other factual information from unstructured natural language text and forms structured data output. Multimodal relation extraction aims to identify and extract relationships between entities from data of multiple modalities, thereby more comprehensively understanding the semantic information in the data.

为了更有效地利用图片信息提高多模态关系抽取准确率，文献Chen,Xiang,Ningyu Zhang,Ningyu Zhang,Lei Li,Yunzhi Yao,Shumin Deng,Chuanqi Tan,FeiHuang,Luo Si and Huajun Chen.“Good Visual Guidance Makes A Better Extractor:Hierarchical Visual Prefix for Multimodal Entity and Relation Extraction.”ArXiv abs/2205.03521(2022)在BERT模型中将视觉前缀添加到每一层的注意力计算中以融合视觉信息。该文献中提出了视觉前缀引导的融合机制，涉及串联对象级视觉表征作为BERT中每个自我注意层的前缀，同时进一步为每一层设计了一个动态门，以生成与图像相关的路径，从而可以将各种聚合的层次化多尺度视觉特征作为增强NER和RE的视觉前缀。从而提高多模态关系抽取的准确率。In order to more effectively utilize image information to improve the accuracy of multimodal relation extraction, the paper Chen, Xiang, Ningyu Zhang, Ningyu Zhang, Lei Li, Yunzhi Yao, Shumin Deng, Chuanqi Tan, Fei Huang, Luo Si and Huajun Chen. "Good Visual Guidance Makes A Better Extractor: Hierarchical Visual Prefix for Multimodal Entity and Relation Extraction." ArXiv abs/2205.03521 (2022) adds visual prefixes to the attention calculation of each layer in the BERT model to fuse visual information. The paper proposes a visual prefix-guided fusion mechanism, which involves concatenating object-level visual representations as prefixes for each self-attention layer in BERT, and further designs a dynamic gate for each layer to generate image-related paths, so that various aggregated hierarchical multi-scale visual features can be used as visual prefixes to enhance NER and RE. Thereby improving the accuracy of multimodal relation extraction.

然而上述方法假定所有输入信息对任务目标都是有用的，事实上，如文献BowenYu,Mengge Xue,Zhenyu Zhang,Tingwen Liu,Yubin Wang,and Bin Wang.2020.Learningto prune dependency trees with rethinking for neural relation extraction.InProceedings of the COLING,pages3842–3852.(2020)的实验表明，通常只有部分文本是有助于关系推理的。同时对于视觉输入而言，情况更为严重，因为并非所有的视觉信息都起到了积极的作用，尤其是在社交媒体数据方面。如文献Alakananda Vempala and DanielPreo，tiuc-Pietro.2019.Categorizing and inferring the relationship between thetext and image of Twitter posts.In Proceedings of the ACL,pages 2830–2840.(2019)的实验分析所述:超过33％的视觉信息在多模态关系抽取中没有起到上下文补充作用，甚至会引入大量噪音，降低关系抽取的性能。对于图像模态，噪声可以分为两个层次：1)在全局层次上，图像中的大多数区域对于目标实体的识别没有信息；2)在局部层次上，相应的明显区域也表达了比我们需要的更复杂的视觉语义。在这种情况下，冗余信息会干扰模型对图像中区域的注意力权重分配，从而阻碍最终任务的预测。因此需要对输入的图像对象特征进行有选择的进行筛选。However, the above methods assume that all input information is useful for the task goal. In fact, as shown in the experiments in the literature Bowen Yu, Mengge Xue, Zhenyu Zhang, Tingwen Liu, Yubin Wang, and Bin Wang. 2020. Learning to prune dependency trees with rethinking for neural relation extraction. In Proceedings of the COLING, pages 3842–3852. (2020), usually only part of the text is helpful for relational reasoning. At the same time, the situation is even more serious for visual input, because not all visual information plays a positive role, especially in social media data. As described in the experimental analysis of the literature Alakananda Vempala and Daniel Preo, Tiuc-Pietro. 2019. Categorizing and inferring the relationship between the text and image of Twitter posts. In Proceedings of the ACL, pages 2830–2840. (2019): More than 33% of visual information does not play a role in contextual supplementation in multimodal relation extraction, and even introduces a lot of noise, reducing the performance of relation extraction. For the image modality, noise can be divided into two levels: 1) At the global level, most areas in the image have no information for the recognition of the target entity; 2) At the local level, the corresponding obvious areas also express more complex visual semantics than we need. In this case, redundant information will interfere with the model's allocation of attention weights to the regions in the image, thereby hindering the prediction of the final task. Therefore, it is necessary to selectively screen the input image object features.

综上所述，多模态关系抽取研究中存在着模态噪音以及模态交互不足等问题，不能有效地利用多种模态信息，导致关系抽取性能不足。In summary, there are problems such as modal noise and insufficient modal interaction in the research of multimodal relationship extraction, which cannot effectively utilize multiple modal information, resulting in insufficient relationship extraction performance.

发明内容Summary of the invention

为了解决上述问题，在本发明提出了一种文本引导的多模态关系抽取方法及装置，简单来说就是在图片编码的过程中引入文本信息，使用该文本信息对图片编码器的输出进行调控，使得图片编码器的输出与文本信息相关，从而达到降低不相关视觉信息的干扰，提高关系抽取的准确率。同时为了更加多层次细粒度的实现视觉特征和文本特征的对齐融合，设计了基于交叉注意力的融合架构。In order to solve the above problems, the present invention proposes a text-guided multimodal relationship extraction method and device. In simple terms, text information is introduced in the process of image encoding, and the output of the image encoder is regulated by using the text information, so that the output of the image encoder is related to the text information, thereby reducing the interference of irrelevant visual information and improving the accuracy of relationship extraction. At the same time, in order to realize the alignment and fusion of visual features and text features in a more multi-level and fine-grained manner, a fusion architecture based on cross attention is designed.

为达到上述目的，本发明的技术方案包括以下内容。To achieve the above-mentioned purpose, the technical solution of the present invention includes the following contents.

一种文本引导的多模态关系抽取方法，包括以下步骤：A text-guided multimodal relationship extraction method comprises the following steps:

针对给定的图像，获得全局图像中的多个局部对象图像；For a given image, a plurality of local object images in the global image are obtained;

获得给定文本的文本特征编码表示和所述图像及局部对象图像的视觉特征编码表示；Obtaining a text feature coding representation of a given text and a visual feature coding representation of the image and the local object image;

将所述文本特征编码表示作为视觉编码器的先验输入，基于自上而下的注意力机制，以后向解码反馈的方式，进一步引导视觉编码器学习与文本语义更相关的视觉特征编码表示；The text feature encoding representation is used as a priori input of the visual encoder, and based on a top-down attention mechanism, in a backward decoding feedback manner, the visual encoder is further guided to learn a visual feature encoding representation that is more relevant to the text semantics;

通过交叉注意力机制将所述文本特征编码表示和所述与文本语义更相关的视觉特征编码表示融合，获得跨模态文本特征编码表示；The text feature encoding representation and the visual feature encoding representation that is more relevant to the text semantics are fused through a cross-attention mechanism to obtain a cross-modal text feature encoding representation;

基于所述跨模态文本特征编码表示进行关系分类，得到所述给定文本中两个实体之间的语义关系类型。Relation classification is performed based on the cross-modal text feature encoding representation to obtain the semantic relationship type between two entities in the given text.

进一步地，所述针对给定的图像，获得全局图像中的多个局部对象图像，包括：Furthermore, for a given image, obtaining a plurality of local object images in a global image includes:

基于原始图片使用工具提取原始图像中的视觉对象，并对检测到的物体的概率设置一个置信阈值；Based on the original image, use tools to extract visual objects in the original image and set a confidence threshold for the probability of the detected object;

基于置信阈值，得到全局图像中的多个局部对象图像。Based on the confidence threshold, multiple local object images in the global image are obtained.

进一步地，所述获得给定文本的文本特征编码表示和所述图像及局部对象图像的视觉特征编码表示，包括：Furthermore, the obtaining of the text feature coding representation of the given text and the visual feature coding representation of the image and the local object image includes:

用预训练的Bert Embedding获得给定文本的初始文本编码表征；Use pre-trained Bert Embedding to obtain the initial text encoding representation of the given text;

将所述初始文本编码表征输入到预训练的文本编码器中，获得文本特征编码表示；其中，所述文本编码器由若干层的Bert layer组成；Inputting the initial text encoding representation into a pre-trained text encoder to obtain a text feature encoding representation; wherein the text encoder is composed of several layers of Bert layers;

采用预训练的CLIP Embedding获得所述图像和所述多个局部对象图像的初始视觉编码表征；Using pre-trained CLIP Embedding to obtain initial visual encoding representations of the image and the multiple local object images;

将所述初始视觉编码表征输入到预训练的视觉编码器中，获得视觉编码表示；其中，所述视觉编码器由若干层的CLIP layer组成。The initial visual coding representation is input into a pre-trained visual encoder to obtain a visual coding representation; wherein the visual encoder is composed of several layers of CLIP layers.

进一步地，将所述文本特征编码表示作为视觉编码器的先验输入，基于自上而下的注意力机制，以后向解码反馈的方式，进一步引导视觉编码器学习与文本语义更相关的视觉特征编码表示，包括：Furthermore, the text feature encoding representation is used as a priori input of the visual encoder, and based on a top-down attention mechanism, the visual encoder is further guided to learn a visual feature encoding representation that is more relevant to the text semantics in a backward decoding feedback manner, including:

根据目标图像的初始视觉编码表征与所述文本特征编码表示的相似性进行重新加权，得到重新加权后的视觉特征；其中，所述目标图像包括：所述图像和每一局部对象图像；Re-weighting is performed according to the similarity between the initial visual coding representation of the target image and the text feature coding representation to obtain the re-weighted visual features; wherein the target image includes: the image and each local object image;

将所述重新加权后的视觉特征送入视觉编码器的解码器中，生成自上而下的信号x_td后，将该信号x_td作为自上而下的输入反馈到视觉编码器的每一层自注意力模块，以更新该自注意力模块的Value矩阵；The re-weighted visual features are sent to the decoder of the visual encoder to generate a top-down signal x_td , and then the signal x_td is fed back as a top-down input to each layer of the self-attention module of the visual encoder to update the Value matrix of the self-attention module;

结合更新后的Value矩阵进行目标图像的二次前向传播，得到包含所述图像和所述图像中的局部对象图像的视觉特征编码表示。The updated Value matrix is combined to perform secondary forward propagation of the target image to obtain a visual feature encoding representation including the image and the local object image in the image.

进一步地，视觉编码器训练损失其中，L为视觉编码器的编码层数，sg表示停止梯度，z_l表示第l层编码后的输出，g_l是指第l层的解码器，z_L表示视觉编码器输出的图像特征表示，/>ξ表示文本特征表示，/>表示负样本。Furthermore, the visual encoder training loss Where L is the number of encoding layers of the visual encoder, sg represents the stop gradient,_zl represents the output after the lth layer encoding,_gl refers to the decoder of the lth layer,_zL represents the image feature representation output by the visual encoder,/> ξ represents the text feature representation,/> Represents negative samples.

进一步地，通过交叉注意力机制将所述文本特征编码表示和所述与文本语义更相关的视觉特征编码表示融合，获得跨模态文本特征编码表示，包括：Furthermore, the text feature encoding representation and the visual feature encoding representation that is more relevant to the text semantics are fused through a cross-attention mechanism to obtain a cross-modal text feature encoding representation, including:

将第l层的文本特征编码表示和视觉特征编码表示分别投影到交叉注意力query向量、key向量和value向量中，以得到文本特征编码表示对应的向量表示向量表示/>向量表示/>和所述与文本语义更相关的视觉特征编码表示对应的向量表示/>向量表示向量表示/>Project the text feature encoding representation and visual feature encoding representation of the lth layer into the cross-attention query vector, key vector and value vector respectively to obtain the vector representation corresponding to the text feature encoding representation Vector representation/> Vector representation/> The vector representation corresponding to the visual feature encoding representation that is more relevant to the text semantics/> Vector Representation Vector representation/>

通过交叉注意计算第(l+1)层的隐含特征Calculate the hidden features of the (l+1)th layer by cross attention

逐层迭代更新，最后一层的隐含特征即为跨模态文本特征编码表示。Iteratively update layer by layer, the hidden features of the last layer That is the cross-modal text feature encoding representation.

进一步地，基于所述跨模态文本特征编码表示进行关系分类，得到所述给定文本中两个实体之间的语义关系类型，包括：Further, relationship classification is performed based on the cross-modal text feature encoding representation to obtain the semantic relationship type between two entities in the given text, including:

将跨模态文本特征编码表示输入多层感知器中，进而获得所述给定文本中实体对的编码表示；Inputting the cross-modal text feature encoding representation into a multi-layer perceptron to obtain the encoding representation of the entity pairs in the given text;

基于所述实体对的编码表示，通过softmax分类器，进行关系分类；Based on the encoded representation of the entity pair, relationship classification is performed through a softmax classifier;

利用交叉熵损失作为任务目标损失，进行迭代优化。The cross entropy loss is used as the task target loss for iterative optimization.

一种文本引导的多模态关系抽取装置，所述装置包括：A text-guided multimodal relationship extraction device, the device comprising:

对象抽取模块，用于针对给定的图像，获得全局图像中的多个局部对象图像；An object extraction module, for obtaining a plurality of local object images in a global image for a given image;

文本编码器，用于获得给定文本的文本特征编码表示；视觉编码器，用于获得所述图像及局部对象图像的视觉特征编码表示；将所述文本特征编码表示作为视觉编码器的先验输入，基于自上而下的注意力机制，以后向解码反馈的方式，进一步引导视觉编码器学习与文本语义更相关的视觉特征编码表示；A text encoder is used to obtain a text feature encoding representation of a given text; a visual encoder is used to obtain a visual feature encoding representation of the image and the local object image; the text feature encoding representation is used as a priori input of the visual encoder, and based on a top-down attention mechanism, in a backward decoding feedback manner, the visual encoder is further guided to learn a visual feature encoding representation that is more relevant to the text semantics;

特征融合模块，用于通过交叉注意力机制将所述文本特征编码表示和所述与文本语义更相关的视觉特征编码表示融合，获得跨模态文本特征编码表示；A feature fusion module, used for fusing the text feature encoding representation and the visual feature encoding representation that is more relevant to the text semantics through a cross-attention mechanism to obtain a cross-modal text feature encoding representation;

关系抽取模块，用于基于所述跨模态文本编码特征表示进行关系分类，得到所述给定文本中两个实体之间的语义关系类型。The relationship extraction module is used to perform relationship classification based on the cross-modal text encoding feature representation to obtain the semantic relationship type between two entities in the given text.

一种计算机设备，所述计算机设备包括：处理器以及存储有计算机程序指令的存储器；所述处理器执行所述计算机程序指令时实现上述任一项所述的文本引导的多模态关系抽取方法。A computer device, comprising: a processor and a memory storing computer program instructions; when the processor executes the computer program instructions, the text-guided multimodal relationship extraction method described in any one of the above items is implemented.

一种计算机可读存储介质，其上存储有计算机程序指令，其特征在于，所述计算机程序指令在被执行时实现上述任一项所述的文本引导的多模态关系抽取方法。A computer-readable storage medium having computer program instructions stored thereon, characterized in that the computer program instructions, when executed, implement any of the above-mentioned text-guided multimodal relationship extraction methods.

本公开实施例提供的技术方案至少包括以下有益效果：The technical solution provided by the embodiments of the present disclosure includes at least the following beneficial effects:

使用文本编码器的输出，来对图片编码器的输出进行引导调控，从而达到降低不相关视觉信息的影响。具体来说，当前视觉注意力算法是刺激驱动的，并突出显示图片中的所有显著对象，然而这些对象中有些并不是多模态关系抽取任务想要的，这些视觉对象对于多模态关系抽取任务来说，就是噪声。然而对于人类来说，对于一张图片，会根据高级任务来引导注意力，注意到局部的视觉对象，也就是与任务相关的视觉对象。因此，该方法旨在更好地模拟人类的注意力引导机制，将注意力集中在与多模态关系抽取任务相关的视觉对象上。通过引入文本先验，视觉编码器能够更精确地捕捉与给定文本内容密切相关的视觉特征，从而提高了多模态任务的性能。这一策略使得在处理图片特征时能够更加有效地过滤掉不必要的信息，从而增强了多模态关系抽取任务的准确性和效率。The output of the text encoder is used to guide and regulate the output of the image encoder, thereby reducing the impact of irrelevant visual information. Specifically, the current visual attention algorithm is stimulus-driven and highlights all salient objects in the image. However, some of these objects are not what the multimodal relationship extraction task wants. These visual objects are noise for the multimodal relationship extraction task. However, for humans, for a picture, attention will be guided according to high-level tasks, and attention will be paid to local visual objects, that is, visual objects related to the task. Therefore, this method aims to better simulate the human attention guidance mechanism and focus attention on visual objects related to the multimodal relationship extraction task. By introducing text priors, the visual encoder can more accurately capture visual features that are closely related to the given text content, thereby improving the performance of multimodal tasks. This strategy enables more effective filtering of unnecessary information when processing image features, thereby enhancing the accuracy and efficiency of multimodal relationship extraction tasks.

使用了带有文本引导的视觉编码器之后得到的视觉特征与文本特征相关，这样就可以提高关系抽取的准确率。The visual features obtained after using the text-guided visual encoder are related to the text features, which can improve the accuracy of relationship extraction.

同时为了提高视觉特征和文本特征的多层次细粒度的对齐融合，使用了基于交叉注意力的融合机制。At the same time, in order to improve the multi-level fine-grained alignment fusion of visual features and text features, a fusion mechanism based on cross attention is used.

应当理解的是，以上的一般描述和后文的细节描述仅是示例性和解释性的，并不能限制本公开。It is to be understood that the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the present disclosure.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

此处的附图被并入说明书中并构成本说明书的一部分，示出了符合本公开实施例，并与说明书一起用于解释本公开的原理。The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the present disclosure.

图1本发明方法流程图。Fig. 1 is a flow chart of the method of the present invention.

图2本发明文本引导的多模态关系抽取方法模型图。FIG2 is a model diagram of the text-guided multimodal relationship extraction method of the present invention.

图3视觉编码器框架图Figure 3 Visual encoder framework diagram

图4视觉特征和文本特征的多层次细粒度对齐示意图Cross-Model Fusion。Figure 4 Schematic diagram of multi-level fine-grained alignment of visual features and text features Cross-Model Fusion.

图5本发明在关系集合中的每一类关系具体结果。FIG. 5 shows the specific results of each type of relationship in the relationship set of the present invention.

具体实施方式Detailed ways

下面将结合附图详细地对示例性实施例进行描述说明。Exemplary embodiments will be described in detail below with reference to the accompanying drawings.

一种新颖的文本引导的多模态关系抽取方法，其步骤包括：A novel text-guided multimodal relation extraction method, the steps of which include:

1.对原始图片使用工具(如Faster-RCNN)提取原始图像中的视觉对象，得到当前图像的top k个局部视觉对象。1. Use tools (such as Faster-RCNN) to extract visual objects from the original image and obtain the top k local visual objects of the current image.

在一个实施例中，本发明对检测到的物体的概率设置一个置信阈值，并得到每幅图像的最大k个物体。最后，如果图像中检测到的物体数量小于k，通过补零使其达到最大个数k。In one embodiment, the present invention sets a confidence threshold for the probability of the detected objects and obtains a maximum of k objects for each image. Finally, if the number of objects detected in the image is less than k, zero padding is performed to make it reach the maximum number k.

2.文本编码器。2. Text encoder.

构建文本编码器，使用预训练Bert Embedding对输入文本进行处理得到文本的编码。然后将得到的文本编码送入文本编码器(12层的Bert layer组成)。经过文本编码器得到文本的特征表示。Build a text encoder and use the pre-trained Bert Embedding to process the input text to get the text encoding. Then send the obtained text encoding to the text encoder (composed of 12 layers of Bert layer). The feature representation of the text is obtained through the text encoder.

3.视觉编码器。3. Visual encoder.

对于视觉编码器：为了提高图片表征的表现能力，同时为了让降低视觉对象中不相关的视觉对象的影响，使用了一个带有文本引导(prior)的Vision Transformer，从而使得得到的图片特征更加与文本内容有关。For the visual encoder: In order to improve the expressiveness of image representation and reduce the influence of irrelevant visual objects in the visual object, a Vision Transformer with text guidance (prior) is used, so that the obtained image features are more relevant to the text content.

图3展示了该视觉编码器的具体组成。视觉编码器的网络分为前向传播和反向传播两个部分，前向传播是一个正常的Vision Transformer过程，对输入图像进行编码，反向传播部分是一个解码器，每一层包含一个线性解码器。其主要编码流程如下：Figure 3 shows the specific composition of the visual encoder. The network of the visual encoder is divided into two parts: forward propagation and back propagation. The forward propagation is a normal Vision Transformer process that encodes the input image. The back propagation part is a decoder, and each layer contains a linear decoder. The main encoding process is as follows:

1)将原始图像和提取的top k个局部视觉对象经过预训练的CLIP Embedding预处理后得到原始图像和局部视觉对象的初始表征。1) The original image and the extracted top k local visual objects are preprocessed by the pre-trained CLIP Embedding to obtain the initial representation of the original image and local visual objects.

2)对于初始表征，通过前向传播对图像进行编码得到图像对应的视觉特征(tokens)。2) For the initial representation, the image is encoded through forward propagation to obtain the visual features (tokens) corresponding to the image.

3)输出tokens根据其与先验向量ξ(文本编码器输出的文本特征表示)的相似性进行重新加权。如下公式：3) The output tokens are re-weighted according to their similarity to the prior vector ξ (text feature representation output by the text encoder). The following formula:

z_L→α·sim(z_L,ξ)·z_Lz_L →α·sim(z_L ,ξ)·z_L

其中，z_L表示第L层的输出(token)，sim是指余弦相似度，α是缩放因子，控制自上而下的信号的规模。L表示视觉编码器的编码层数。Where z_L represents the output (token) of the Lth layer, sim refers to the cosine similarity, and α is the scaling factor that controls the scale of the top-down signal. L represents the number of encoding layers of the visual encoder.

4)将重新加权后的tokens通过反向传播也就是：送入解码器decoder，生成自上而下的信号(图3中的x_td)，将该信号反馈到每一层自注意力模块，即作为自下而上的输入送回每一层的self-attention的Value矩阵中，其他部分保持不变。4) The reweighted tokens are sent to the decoder through back propagation, generating a top-down signal (x_td in Figure 3), which is fed back to each layer of the self-attention module, that is, sent back to the Value matrix of each layer of self-attention as a bottom-up input, and the other parts remain unchanged.

5)再做一次前向传播(feedforward)后，得到原始图像或局部视觉对象的图像特征表示；其中，该次前向传播的每一层的self-attention有额外的自上而下的输入。5) After another feedforward, the image feature representation of the original image or local visual object is obtained; among them, each layer of self-attention in this forward propagation has additional top-down input.

6)综合原始图像和各个局部视觉对象的图像特征表示，得到视觉编码器的输出。6) The image feature representation of the original image and each local visual object is integrated to obtain the output of the visual encoder.

为了更好的将网络的输出与prior相符合，使用类似CLIP的loss来对齐网络输出与语言描述即：In order to better align the network output with the prior, a loss similar to CLIP is used to align the network output with the language description:

ξ是文本编码器的输出即文本引导向量，z_L是与文本对应的图片的输出,是指负样本即其他图像的输出，k′是指一个batch中图像个数。ξ is the output of the text encoder, i.e., the text guide vector, z_L is the output of the image corresponding to the text, It refers to the negative samples, i.e. the output of other images, and k′ refers to the number of images in a batch.

同时为了让第l层的decoder从第l+1层重建特征第l层特征，使用如下loss优化：At the same time, in order to allow the decoder of layer l to reconstruct the features of layer l from layer l+1, the following loss optimization is used:

‖z_l-g_l(z_l+1)‖²‖z_l -g_l (z_l+1 )‖²

其中，g_l是指第l层的decoder，z_l是指视觉编码器第l层对应的视觉特征。Among them, g_l refers to the decoder of the lth layer, and z_l refers to the visual features corresponding to the lth layer of the visual encoder.

最后总的视觉编码器训练loss如下：The final total visual encoder training loss is as follows:

其中，sg是指stop_gradient，停止梯度。Among them, sg refers to stop_gradient, which stops the gradient.

该方法旨在更好地模拟人类的注意力引导机制，将注意力集中在与多模态关系抽取任务相关的视觉对象上。通过引入文本先验，Vision Transformer模型能够更精确地捕捉与给定文本内容密切相关的视觉特征，从而提高了多模态任务的性能。这一策略使得在处理图片特征时能够更加有效地过滤掉不必要的信息，从而增强了多模态关系抽取任务的准确性和效率This method aims to better simulate the human attention guidance mechanism and focus attention on visual objects related to the multimodal relationship extraction task. By introducing text priors, the Vision Transformer model can more accurately capture visual features that are closely related to the given text content, thereby improving the performance of multimodal tasks. This strategy makes it possible to more effectively filter out unnecessary information when processing image features, thereby enhancing the accuracy and efficiency of multimodal relationship extraction tasks.

4.视觉特征和文本特征的多尺度细粒度融合。4. Multi-scale fine-grained fusion of visual features and textual features.

在这里使用融合模块每个层次上应用多粒度信号的隐式令牌-对象(token-object)对齐，以捕捉视觉对象与实体之间的关联，实现多模态特征的融合。具体来说，如图3所示：Here, the implicit token-object alignment of multi-granularity signals is applied at each level of the fusion module to capture the association between visual objects and entities and realize the fusion of multimodal features. Specifically, as shown in Figure 3:

对于给定第l层的视觉特征和文本特征：将它们投影到query/key/value向量中，得到/>其中/>为注意投影参数。其中，n表示一个batch中的数据数量，d表示编码维度，d_h表示投影参数。For a given layer l of visual features and text features: Project them into the query/key/value vector and get/> Where/> is the attention projection parameter. Where n represents the number of data in a batch, d represents the encoding dimension, and d_h represents the projection parameter.

然后通过交叉注意计算第(l+1)层的隐含特征如下:Then the hidden features of the (l+1)th layer are calculated by cross attention as follows:

5.使用多模态融合模块的输出进行预测。5. Use the output of the multimodal fusion module for prediction.

对于一个给定的关系抽取数据集目标是预测关系主体与客体直观的关系r∈y，y是给定关系的集合。最后使用融合模型的文本融合模块的输出OP，用来关系预测,将其作为MLP(多层感知器)的输入，完成关系预测分类任务.For a given relation extraction dataset The goal is to predict the intuitive relationship between the subject and the object r∈y, where y is a set of given relationships. Finally, the output OP of the text fusion module of the fusion model is used for relationship prediction, and it is used as the input of MLP (multi-layer perceptron) to complete the relationship prediction classification task.

p(r|X)＝softmax(MLP(OP))p(r|X)＝softmax(MLP(OP))

使用交叉熵损失作为MRE(平均相对误差)任务训练loss：Use cross entropy loss as the MRE (mean relative error) task training loss:

其中，X⁽ⁱ⁾表示数据集中第i个样本，CMF(Cross-Model Fusion)是指融合操作，M：数据集样本个数。Among them, X⁽ⁱ⁾ represents the i-th sample in the dataset, CMF (Cross-Model Fusion) refers to the fusion operation, and M is the number of samples in the dataset.

这样，总的loss为：In this way, the total loss is:

下面，以数据集MNRE上为例，对本发明提供的多模态关系抽取方法进行实验验证。本次验证采用准确度(Acc.)、精确度(Pre.)、召回率(Rec.)和F1作为主要的评价指标。具体表现结果如下表1：Next, we take the dataset MNRE as an example to experimentally verify the multimodal relationship extraction method provided by the present invention. This verification uses accuracy (Acc.), precision (Pre.), recall (Rec.) and F1 as the main evaluation indicators. The specific performance results are shown in Table 1 below:

表1：实验结果Table 1: Experimental results

其中，w/o prior是指在视觉编码器没有使用prior。Among them, w/o prior means that no prior is used in the visual encoder.

具体在每个关系中的精确度，召回率以及F1如图5所示。通过分析结果表明本发明方法的有效性，能够解决图片中不相关视觉信息对关系抽取的影响从而提高关系抽取的准确率。相较于现有的方法，HVPNeT有较为明显的提升。The precision, recall rate and F1 in each relationship are shown in Figure 5. The analysis results show the effectiveness of the proposed method, which can solve the impact of irrelevant visual information in the image on relationship extraction and improve the accuracy of relationship extraction. Compared with the existing methods, HVPNeT has a more obvious improvement.

本领域技术人员在考虑说明书及实践本公开后，将容易想到本公开的其它实施方案。本公开旨在涵盖本公开的任何变型、用途或者适应性变化，这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的，本公开也并不局限于上面已经描述并在附图中示出的精确结构，并且可以在不脱离其范围进行各种修改和改变。Those skilled in the art will readily appreciate other embodiments of the present disclosure after considering the specification and practicing the present disclosure. The present disclosure is intended to cover any variations, uses or adaptations of the present disclosure, which follow the general principles of the present disclosure and include common knowledge or customary technical means in the art that are not disclosed in the present disclosure. The specification and embodiments are intended to be exemplary only, and the present disclosure is not limited to the precise structures described above and shown in the accompanying drawings, and various modifications and changes may be made without departing from the scope thereof.

Claims

Translated fromChinese

1.一种文本引导的多模态关系抽取方法，其特征在于，包括以下步骤：1. A text-guided multimodal relationship extraction method, characterized in that it includes the following steps:

2.如权利要求1所述的方法，其特征在于，所述针对给定的图像，获得全局图像中的多个局部对象图像，包括：2. The method according to claim 1, wherein for a given image, obtaining a plurality of local object images in a global image comprises:

3.如权利要求1所述的方法，其特征在于，所述获得给定文本的文本特征编码表示和所述图像及局部对象图像的视觉特征编码表示，包括：3. The method according to claim 1, wherein obtaining the text feature coding representation of the given text and the visual feature coding representation of the image and the local object image comprises:

4.如权利要求1所述的方法，其特征在于，将所述文本特征编码表示作为视觉编码器的先验输入，基于自上而下的注意力机制，以后向解码反馈的方式，进一步引导视觉编码器学习与文本语义更相关的视觉特征编码表示，包括：4. The method according to claim 1, characterized in that the text feature encoding representation is used as a priori input of the visual encoder, and based on a top-down attention mechanism, in a backward decoding feedback manner, the visual encoder is further guided to learn a visual feature encoding representation that is more relevant to the text semantics, including:

5.如权利要求4所述的方法，其特征在于，视觉编码器训练损失其中，L为视觉编码器的编码层数，sg表示停止梯度，z_l表示第l层编码后的输出，g_l是指第l层的解码器，z_L表示视觉编码器输出的图像特征表示，/>ξ表示文本特征表示，/>表示负样本。5. The method of claim 4, wherein the visual encoder training loss Where L is the number of encoding layers of the visual encoder, sg represents the stop gradient,_zl represents the output after the lth layer encoding,_gl refers to the decoder of the lth layer,_zL represents the image feature representation output by the visual encoder,/> ξ represents the text feature representation,/> Represents negative samples.

6.如权利要求1所述的方法，其特征在于，通过交叉注意力机制将所述文本特征编码表示和所述与文本语义更相关的视觉特征编码表示融合，获得跨模态文本特征编码表示，包括：6. The method according to claim 1, characterized in that the text feature encoding representation and the visual feature encoding representation that is more relevant to the text semantics are fused through a cross-attention mechanism to obtain a cross-modal text feature encoding representation, comprising:

将第l层的文本特征编码表示和视觉特征编码表示分别投影到交叉注意力query向量、key向量和value向量中，以得到文本特征编码表示对应的向量表示向量表示/>向量表示/>和所述与文本语义更相关的视觉特征编码表示对应的向量表示/>向量表示/>向量表示/>Project the text feature encoding representation and visual feature encoding representation of the lth layer into the cross-attention query vector, key vector and value vector respectively to obtain the vector representation corresponding to the text feature encoding representation Vector representation/> Vector representation/> The vector representation corresponding to the visual feature encoding representation that is more relevant to the text semantics/> Vector representation/> Vector representation/>

7.如权利要求1所述的方法，其特征在于，基于所述跨模态文本特征编码表示进行关系分类，得到所述给定文本中两个实体之间的语义关系类型，包括：7. The method according to claim 1, wherein performing relationship classification based on the cross-modal text feature encoding representation to obtain the semantic relationship type between two entities in the given text comprises:

8.一种文本引导的多模态关系抽取装置，其特征在于，所述装置包括：8. A text-guided multimodal relationship extraction device, characterized in that the device comprises:

9.一种计算机设备，其特征在于，所述计算机设备包括：处理器以及存储有计算机程序指令的存储器；所述处理器执行所述计算机程序指令时实现如权利要求1-7任一项所述的文本引导的多模态关系抽取方法。9. A computer device, characterized in that the computer device comprises: a processor and a memory storing computer program instructions; when the processor executes the computer program instructions, it implements the text-guided multimodal relationship extraction method as described in any one of claims 1-7.

10.一种计算机可读存储介质，其上存储有计算机程序指令，其特征在于，所述计算机程序指令在被执行时实现权利要求1至7中任一项所述的文本引导的多模态关系抽取方法。10. A computer-readable storage medium having computer program instructions stored thereon, wherein the computer program instructions, when executed, implement the text-guided multimodal relationship extraction method according to any one of claims 1 to 7.