CN113343982B

Movatterモバイル変換

Info

Publication number: CN113343982B
Application number: CN202110666465.0A
Authority: CN
Inventors: 李煜林; 庾悦晨; 钦夏孟; 章成全; 姚锟; 韩钧宇; 刘经拓; 丁二锐; 吴甜; 王海峰
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-06-16
Filing date: 2021-06-16
Publication date: 2023-07-25
Anticipated expiration: 2041-06-16
Also published as: CN113343982A

Abstract

Translated fromChinese

根据本公开的实施例，提供了一种多模态特征融合的实体关系提取的方法、装置、设备、介质和程序产品。涉及人工智能技术领域，具体涉及计算机视觉和深度学习技术领域，可应用于智慧城市和智慧金融场景。方案为：针对包括字符的图像中的多个区域中的每个区域，确定区域的视觉特征和区域的多个字符文本特征，字符文本特征对应于区域中的一个字符；针对每个区域，基于区域的视觉特征和多个字符文本特征，确定区域的区域视觉语义特征；基于区域视觉语义特征，确定多个区域的关系信息，关系信息至少指示多个区域中的任意两个区域之间的关联程度；基于关系信息，将多个区域中的区域相关联；针对获取的实体，提取实体关系。由此能够提升文本识别的准确率。

According to the embodiments of the present disclosure, a method, device, device, medium and program product for entity relationship extraction based on multimodal feature fusion are provided. It involves the field of artificial intelligence technology, specifically computer vision and deep learning technology, and can be applied to smart cities and smart financial scenarios. The solution is: for each region in the plurality of regions in the image including characters, determine the visual feature of the region and a plurality of character text features of the region, the character text feature corresponds to a character in the region; for each region, determine the regional visual semantic feature of the region based on the regional visual feature and the plurality of character text features; This can improve the accuracy of text recognition.

Description

Translated fromChinese

多模态特征融合的实体关系提取方法、装置和设备Entity relationship extraction method, device and equipment for multimodal feature fusion

技术领域technical field

本公开涉及人工智能技术领域，具体涉及计算机视觉和深度学习技术领域，可应用于智慧城市和智慧金融场景，并且更具体地，涉及多模态特征融合的实体关系提取方法、装置、设备、计算机可读存储介质和计算机程序产品。The present disclosure relates to the field of artificial intelligence technology, specifically to the field of computer vision and deep learning technology, which can be applied to smart cities and smart financial scenarios, and more specifically, to an entity relationship extraction method, device, equipment, computer-readable storage medium, and computer program product for multi-modal feature fusion.

背景技术Background technique

随着信息技术的发展，神经网络被广泛用于诸如计算机视觉、语音识别和信息检索等的各种机器学习任务。文档的特定信息提取，是从文档(例如请示、通知函、报告、会议纪要，以及合同、招标书、巡检报告、检修工单)中自动抽取特定信息，包括用户感兴趣的信息实体和关系等。利用神经网络对文档的图像进行处理，以对文档中的信息提取被认为是一种有效的方法。然而，文本识别的准确率还有待提高。With the development of information technology, neural networks are widely used in various machine learning tasks such as computer vision, speech recognition, and information retrieval. The specific information extraction of documents is to automatically extract specific information from documents (such as requests for instructions, notification letters, reports, meeting minutes, and contracts, bidding documents, inspection reports, maintenance work orders), including information entities and relationships that users are interested in. It is considered to be an effective method to use neural network to process the image of the document to extract information from the document. However, the accuracy of text recognition needs to be improved.

发明内容Contents of the invention

根据本公开的示例实施例，提供了一种多模态特征融合的实体关系提取的方法、装置、设备、计算机可读存储介质和计算机程序产品。According to example embodiments of the present disclosure, a method, apparatus, device, computer-readable storage medium and computer program product for entity relationship extraction based on multi-modal feature fusion are provided.

在本公开的第一方面中，提供了一种多模态特征融合的实体关系提取处理的方法。该方法包括：针对包括字符的图像中的多个区域中的每个区域，确定区域的视觉特征和区域的多个字符文本特征，字符文本特征对应于区域中的一个字符；针对每个区域，基于区域的视觉特征和多个字符文本特征，确定区域的区域视觉语义特征；基于区域视觉语义特征，确定多个区域的关系信息，关系信息至少指示多个区域中的任意两个区域之间的关联程度；基于关系信息，将多个区域中的区域相关联；以及针对获取的实体，提取实体关系。In the first aspect of the present disclosure, a method for entity relationship extraction processing of multimodal feature fusion is provided. The method includes: for each region in a plurality of regions in an image including characters, determining a visual feature of the region and a plurality of character text features of the region, the character text feature corresponding to a character in the region; for each region, based on the visual feature of the region and the plurality of character text features, determining the regional visual semantic feature of the region;

在本公开的第二方面中，提供了一种多模态特征融合的实体关系提取处理装置。该装置包括：第一特征确定模块，被配置为针对包括字符的图像中的多个区域中的每个区域，确定区域的视觉特征和区域的多个字符文本特征，字符文本特征对应于区域中的一个字符；第二特征确定模块，被配置为针对每个区域，基于区域的视觉特征和多个字符文本特征，确定区域的区域视觉语义特征；关系信息确定模块，被配置为基于区域视觉语义特征，确定多个区域的关系信息，关系信息至少指示多个区域中的任意两个区域之间的关联程度；第一区域关联模块，被配置为基于关系信息，将多个区域中的区域相关联；以及第一提取模块，被配置为针对获取的实体，提取实体关系。In a second aspect of the present disclosure, an entity relationship extraction processing device for multimodal feature fusion is provided. The device includes: a first feature determination module configured to, for each region in a plurality of regions in an image including characters, determine a visual feature of the region and a plurality of character text features of the region, the character text feature corresponding to a character in the region; a second feature determination module configured to determine a regional visual semantic feature of the region based on the visual feature of the region and the plurality of character text features; a relationship information determination module configured to determine the relationship information of the plurality of regions based on the regional visual semantic characteristics, the relationship information at least indicates the degree of association between any two regions of the plurality of regions; the first region association module , configured to associate regions in the plurality of regions based on the relationship information; and a first extraction module configured to extract entity relationships for the acquired entities.

在本公开的第三方面中，提供了一种电子设备，包括一个或多个处理器；以及存储装置，用于存储一个或多个程序，当一个或多个程序被一个或多个处理器执行，使得一个或多个处理器实现根据本公开的第一方面的方法。In a third aspect of the present disclosure, an electronic device is provided, including one or more processors; and a storage device for storing one or more programs, when the one or more programs are executed by the one or more processors, so that the one or more processors implement the method according to the first aspect of the present disclosure.

在本公开的第四方面中，提供了一种计算机可读介质，其上存储有计算机程序，该程序被处理器执行时实现根据本公开的第一方面的方法。In a fourth aspect of the present disclosure, there is provided a computer-readable medium on which is stored a computer program that implements the method according to the first aspect of the present disclosure when executed by a processor.

在本公开的第五方面中，提供了一种计算机程序产品，包括计算机程序指令，该计算机程序指令被处理器实现如本公开的第一方面的方法。In a fifth aspect of the present disclosure, there is provided a computer program product comprising computer program instructions, the computer program instructions being implemented by a processor as the method of the first aspect of the present disclosure.

应当理解，发明内容部分中所描述的内容并非旨在限定本公开的实施例的关键或重要特征，亦非用于限制本公开的范围。本公开的其它特征将通过以下的描述变得容易理解。It should be understood that what is described in the Summary of the Invention is not intended to limit the key or important features of the embodiments of the present disclosure, nor is it intended to limit the scope of the present disclosure. Other features of the present disclosure will be readily understood through the following description.

附图说明Description of drawings

结合附图并参考以下详细说明，本公开各实施例的上述和其他特征、优点及方面将变得更加明显。在附图中，相同或相似的附图标记表示相同或相似的元素。附图用于更好地理解本方案，不构成对本公开的限定，其中：The above and other features, advantages and aspects of the various embodiments of the present disclosure will become more apparent with reference to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, the same or similar reference numerals denote the same or similar elements. The accompanying drawings are used to better understand the scheme, and do not constitute a limitation to the present disclosure, wherein:

图1示出了本公开的一些实施例能够在其中实现的多模态特征融合的实体关系提取的系统100的示例的示意图；FIG. 1 shows a schematic diagram of an example of a system 100 for extracting entity relationships by multimodal feature fusion in which some embodiments of the present disclosure can be implemented;

图2示出了本公开实施例的包括字符的图像的示例性图像200；FIG. 2 shows an exemplary image 200 of an image including characters according to an embodiment of the present disclosure;

图3示出了根据本公开的一些实施例的多模态特征融合的实体关系提取的过程300的流程图；FIG. 3 shows a flowchart of a process 300 of entity relationship extraction based on multimodal feature fusion according to some embodiments of the present disclosure;

图4示出了根据本公开的一些实施例的用于确定区域视觉语义特征的过程400的流程图；FIG. 4 shows a flowchart of a process 400 for determining regional visual semantic features according to some embodiments of the present disclosure;

图5示出了根据本公开的实施例的多模态特征融合的实体关系提取装置500的示意框图；以及FIG. 5 shows a schematic block diagram of an entity relationship extraction device 500 for multimodal feature fusion according to an embodiment of the present disclosure; and

图6示出了能够实施本公开的多个实施例的设备600的框图。FIG. 6 shows a block diagram of a device 600 capable of implementing various embodiments of the present disclosure.

具体实施方式Detailed ways

下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例，然而应当理解的是，本公开可以通过各种形式来实现，而且不应该被解释为限于这里阐述的实施例，相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是，本公开的附图及实施例仅用于示例性作用，并非用于限制本公开的保护范围。Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although certain embodiments of the present disclosure are shown in the drawings, it should be understood that the disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided so that the disclosure will be more thorough and complete. It should be understood that the drawings and embodiments of the present disclosure are for exemplary purposes only, and are not intended to limit the protection scope of the present disclosure.

在本公开的实施例的描述中，术语“包括”及其类似用语应当理解为开放性包含，即“包括但不限于”。术语“基于”应当理解为“至少部分地基于”。术语“一个实施例”或“该实施例”应当理解为“至少一个实施例”。术语“第一”、“第二”等等可以指代不同的或相同的对象。下文还可能包括其他明确的和隐含的定义。In the description of the embodiments of the present disclosure, the term "comprising" and its similar expressions should be interpreted as an open inclusion, that is, "including but not limited to". The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be read as "at least one embodiment". The terms "first", "second", etc. may refer to different or the same object. Other definitions, both express and implied, may also be included below.

在本公开的实施例中，术语“模型”能够处理输入并且提供相应输出。以神经网络模型为例，其通常包括输入层、输出层以及在输入层与输出层之间的一个或多个隐藏层。在深度学习应用中使用的模型(也称为“深度学习模型”)通常包括许多隐藏层，从而延长网络的深度。神经网络模型的各个层按顺序相连以使得前一层的输出被用作后一层的输入，其中输入层接收神经网络模型的输入，而输出层的输出作为神经网络模型的最终输出。神经网络模型的每个层包括一个或多个节点(也称为处理节点或神经元)，每个节点处理来自上一层的输入。在本文中，术语“神经网络”、“模型”、“网络”和“神经网络模型”可互换使用。In embodiments of the present disclosure, the term "model" is capable of processing an input and providing a corresponding output. Taking a neural network model as an example, it generally includes an input layer, an output layer, and one or more hidden layers between the input layer and the output layer. Models used in deep learning applications (also called "deep learning models") typically include many hidden layers, extending the depth of the network. The layers of the neural network model are connected in order so that the output of the previous layer is used as the input of the subsequent layer, wherein the input layer receives the input of the neural network model, and the output of the output layer serves as the final output of the neural network model. Each layer of a neural network model consists of one or more nodes (also known as processing nodes or neurons), each of which processes input from the previous layer. In this document, the terms "neural network", "model", "network" and "neural network model" are used interchangeably.

如以上提及的，需要提高文本识别的准确率。在传统方案中，通常存在如下三种情况：(1)人工手动录入。该方法缺陷在于不适用智能办公系统。无法实现自动化，人力成本较高。(2)通过定位文本实体，进行区域搜索。该方法缺陷在于局限于固定样式的文档，应用范围有局限性。(3)基于命名实体进行关系提取。在纯文本中通过上下文的语义特征进行关系判断。该方法缺陷在于使用纯文本进行实体抽取，忽略了文档中内容的视觉排版，容易导致语义混淆问题。因此，传统方案对图像中的字符的识别的准确率较低。As mentioned above, there is a need to improve the accuracy of text recognition. In the traditional solution, there are usually the following three situations: (1) manual input. The disadvantage of this method is that it is not suitable for intelligent office systems. It cannot be automated and the labor cost is high. (2) Perform region search by locating text entities. The disadvantage of this method is that it is limited to documents with a fixed style, and its application range is limited. (3) Relation extraction based on named entities. Relational judgments in plain text through contextual semantic features. The disadvantage of this method is that it uses plain text for entity extraction, ignoring the visual layout of the content in the document, which easily leads to semantic confusion. Therefore, the accuracy rate of traditional schemes for character recognition in images is low.

本公开的示例实施例提出了一种用于多模态特征融合的实体关系提取的方案。在该方案中，首先获取待处理的图像，图像中包括待识别的字符。图像可以根据字符所在的行或者列被划分为多个区域，可以针对每个区域，确定在该区域中的字符的文本特征以及该区域的视觉特征(图像特征、位置特征等)。然后根据上述确定的区域的视觉特征和该区域中的字符的文本特征，对其进行例如特征融合操作，来确定区域的区域视觉语义特征。接着根据每个区域的区域视觉语义特征，确定上述多个区域中两两区域之间的关联程度，关联程度越高，则两个区域中的字符之间存在关系的可能性却大。之根据上述确定的关联程度，将区域两两相关联。并且最后根据待确定的实体，基于实体的实体名称和实体值，在相关联的区域中进行提取。根据本公开的实施例，通过综合考虑图像中的字符和区域的位置特征、视觉特征和文本特征，可以准确地确定不同区域之间的关系。进而可以准确地将相关联的区域中的实体相关联，提升文本识别的准确率。Example embodiments of the present disclosure propose a scheme for entity relationship extraction for multimodal feature fusion. In this scheme, the image to be processed is acquired first, and the image includes characters to be recognized. The image can be divided into multiple regions according to the row or column where the characters are located, and for each region, the text characteristics of the characters in the region and the visual characteristics (image characteristics, position characteristics, etc.) of the region can be determined. Then, according to the above-mentioned determined visual features of the area and the text features of the characters in the area, for example, a feature fusion operation is performed on them to determine the regional visual semantic features of the area. Then, according to the regional visual semantic features of each region, the degree of association between two regions in the above multiple regions is determined. The higher the degree of association, the greater the possibility of a relationship between the characters in the two regions. According to the degree of association determined above, the regions are associated in pairs. And finally, according to the entity to be determined, extraction is performed in the associated area based on the entity name and entity value of the entity. According to the embodiments of the present disclosure, by comprehensively considering the position features, visual features and text features of characters and areas in an image, the relationship between different areas can be accurately determined. Furthermore, the entities in the associated regions can be accurately associated to improve the accuracy of text recognition.

图1示出本公开的一些实施例能够在其中实现的多模态特征融合的实体关系提取的系统100的示例的示意图。如图1所示，系统100包括计算设备110。计算设备110可以是具有计算能力的任何设备，例如个人计算机、平板计算机、可穿戴设备、云服务器、大型机和分布式计算系统等。FIG. 1 shows a schematic diagram of an example of a system 100 for extracting entity relations by multi-modal feature fusion in which some embodiments of the present disclosure can be implemented. As shown in FIG. 1 , system 100 includes computing device 110 . The computing device 110 may be any device with computing capabilities, such as a personal computer, a tablet computer, a wearable device, a cloud server, a mainframe, a distributed computing system, and the like.

计算设备110获取输入120。例如，输入120可以是图像、视频、音频、文本、和/或多媒体文件等。计算设备110可以将输入120应用于网络模型130，以利用网络模型130，生成与输入120相对应的处理结果140。在一些实施例中，网络模型130可以是但不限于OCR识别模型、图像分类模型、语义分割模型、目标检测模型，或者其他与图像处理相关的神经网络模型。可以利用任何合适的网络结构来实现网络模型130，包括但不限于支持向量机(SVM)模型，贝叶斯模型，随机森林模型，各种深度学习/神经网络模型，诸如卷积神经网络(CNN)、循环神经网络(RNN)、深度神经网络(DNN)、深度强化学习网路(DQN)等。本公开的范围在此方面不受限制。Computing device 110 obtains input 120 . For example, input 120 may be an image, video, audio, text, and/or multimedia file, among others. Computing device 110 may apply input 120 to network model 130 to generate processing result 140 corresponding to input 120 using network model 130 . In some embodiments, the network model 130 may be, but not limited to, an OCR recognition model, an image classification model, a semantic segmentation model, an object detection model, or other neural network models related to image processing. Any suitable network structure can be utilized to implement the network model 130, including but not limited to support vector machine (SVM) model, Bayesian model, random forest model, various deep learning/neural network models, such as convolutional neural network (CNN), recurrent neural network (RNN), deep neural network (DNN), deep reinforcement learning network (DQN), etc. The scope of the present disclosure is not limited in this respect.

系统100还可以包括训练数据获取装置、模型训练装置和模型应用装置(未示出)。在一些实施例中，上述多个装置可以分别实现在不同的物理计算设备中。备选地，上述多个装置中的至少一部分装置可以被实现在同一计算设备中。例如，训练数据获取装置、模型训练装置和可以被实现在同一计算设备中，而模型应用装置可以被实现在另一计算设备中。The system 100 may also include a training data acquisition device, a model training device and a model application device (not shown). In some embodiments, the above multiple means may be respectively implemented in different physical computing devices. Alternatively, at least a part of the above multiple means may be implemented in the same computing device. For example, the training data obtaining means, the model training means and can be implemented in the same computing device, while the model applying means can be implemented in another computing device.

输入120可以是待处理的输入数据(例如，图像数据)，网络模型130是经图像处理(例如，经训练的图像分类模型)，处理结果140可以是与输入120(例如，图像数据)相对应的预测结果(例如，图像的分类结果、语义分割结果或目标识别结果)。The input 120 may be input data to be processed (for example, image data), the network model 130 is image processed (for example, a trained image classification model), and the processing result 140 may be a prediction result corresponding to the input 120 (for example, image data) (for example, a classification result of an image, a semantic segmentation result or an object recognition result).

在一些实施例中，处理结果140可以是待确定的多个实体在文本中所对应的字符，例如实体“姓名”对应于“张三”，实体“日期”对应于“2021年01月01日”，实体“金额”对应于“200”等。在一些实施例中，处理结果140还可以是图像中的多个区域的关联程度。备选地，在一些实施例中，处理结果140还可以是待处理图像中的每个字符的分类结果。可以根据需要应用根据本公开的实施例的方法以获取不同的处理结果140，本公开在此不做限制。In some embodiments, the processing result 140 may be characters corresponding to multiple entities to be determined in the text, for example, the entity "name" corresponds to "Zhang San", the entity "date" corresponds to "January 01, 2021", the entity "amount" corresponds to "200", etc. In some embodiments, the processing result 140 may also be the degree of association of multiple regions in the image. Alternatively, in some embodiments, the processing result 140 may also be a classification result of each character in the image to be processed. The method according to the embodiment of the present disclosure can be applied to obtain different processing results 140 as required, and the present disclosure does not limit it here.

在一些实施例中，为了减少模型的计算量，计算设备110可以对输入120(例如图像)进行进一步处理。例如，计算设备110可以对上述图片进行尺寸重新设定和归一化操作，以形成预处理图像。在一些实施例中，对于形式为图像的输入120，可以通过对其中的图像进行图像裁剪、旋转和翻转。In some embodiments, the computing device 110 may perform further processing on the input 120 (eg, an image) in order to reduce the computational load of the model. For example, the computing device 110 may perform resizing and normalization operations on the above-mentioned pictures to form a pre-processed image. In some embodiments, for the input 120 in the form of an image, image cropping, rotation and flipping may be performed on the image.

应当理解，图1所示的系统100仅仅是本公开的实施例可实现于其中的一种示例，不旨在限制本公开的范围。本公开的实施例同样适用于其他系统或架构。It should be understood that the system 100 shown in FIG. 1 is only an example in which the embodiments of the present disclosure may be implemented, and is not intended to limit the scope of the present disclosure. Embodiments of the present disclosure are equally applicable to other systems or architectures.

图2示出了本公开实施例的包括字符的图像的示例性图像200。FIG. 2 illustrates an exemplary image 200 of an embodiment of the disclosure including images of characters.

为了在下文中清楚地对实施例进行阐述，在描述本公开的各实施例前，首先参考图2对包括字符的图像200进行描述。In order to clearly explain the embodiments below, before describing various embodiments of the present disclosure, an image 200 including characters is first described with reference to FIG. 2 .

如图2所示，图像200包括多个区域210-270(由虚线矩形框所指示)，每个区域可以包括多个字符，例如区域210可以包括多个字符211-217。区域在这里可以是指一行字符或一行文本在图像200中所占据的区域，或者一列字符或一列文本在图像200中所占据的区域。区域可以是任何形状，本公开在此不做限制。字符可以是各种语言形式的文本。下文将参考图2作为示例图像进行描述。As shown in FIG. 2, the image 200 includes a plurality of regions 210-270 (indicated by dashed rectangular boxes), and each region may include a plurality of characters, for example, the region 210 may include a plurality of characters 211-217. The area here may refer to the area occupied by a line of characters or text in the image 200 , or the area occupied by a column of characters or text in the image 200 . The area can be in any shape, and the present disclosure is not limited here. Characters can be text in various languages. Description will be made below with reference to FIG. 2 as an example image.

以下结合图2至图4来进一步描述详细的多模态特征融合的实体关系提取过程。The detailed entity relationship extraction process of multimodal feature fusion will be further described below in conjunction with FIG. 2 to FIG. 4 .

图3图示了根据本公开的实施例的多模态特征融合的实体关系提取的过程300的流程图。FIG. 3 illustrates a flowchart of a process 300 of entity relationship extraction based on multimodal feature fusion according to an embodiment of the present disclosure.

过程300可以由图1中的计算设备110来实施。为便于描述，将参照图1来描述过程300。Process 300 may be implemented by computing device 110 in FIG. 1 . For ease of description, process 300 will be described with reference to FIG. 1 .

在图3的步骤310，计算设备110针对包括字符的图像200中的多个区域中的每个区域，确定区域的视觉特征和区域的多个字符文本特征，字符文本特征对应于区域中的一个字符。例如，计算设备110针对图像200中的多个区域210-270中的每个区域，确定区域的视觉特征和字符211-217、221、223、231、233、241、243、…271、273的字符文本特征。In step 310 of FIG. 3 , computing device 110 , for each of the plurality of regions in image 200 including characters, determines a visual characteristic of the region and a plurality of character textual characteristics of the region, the character textual characteristics corresponding to a character in the region. For example, computing device 110 determines, for each of plurality of regions 210-270 in image 200, visual characteristics of the region and character textual characteristics of characters 211-217, 221, 223, 231, 233, 241, 243, . . . 271, 273.

区域的视觉特征可以表示区域在图像中的图像表观特征和其位置特征，计算设备110可以通过合适的算法或者模型确定区域的图像表观特征，例如通过卷神层对图像200进行处理所获取的特征图。计算设备110可以通过合适的算法或者模型确定区域在图像200中的位置来确定区域的位置特征。计算设备110可以对位置特征和图像表观特征进行加和等处理来确定视觉特征。对于区域中的字符文本特征。计算设备110可以利用光学字符识别技术来确定字符的字符文本特征。The visual features of the region may represent the apparent image characteristics and location characteristics of the region in the image, and the computing device 110 may determine the apparent image characteristics of the region through a suitable algorithm or model, for example, the feature map obtained by processing the image 200 through the convolution layer. The computing device 110 may determine the location characteristics of the region by determining the location of the region in the image 200 through a suitable algorithm or model. The computing device 110 may perform summation and other processing on the position feature and the image appearance feature to determine the visual feature. For character text features in regions. Computing device 110 may utilize optical character recognition techniques to determine the character textual characteristics of a character.

在图3的步骤320，计算设备110针对每个区域，基于区域的视觉特征和多个字符文本特征，确定区域的区域视觉语义特征。例如，在确定了上述区域的视觉特征和字符的字符文本特征后，计算设备110可以对上述特征进行进一步处理来确定区域的区域视觉语义特征，以用于后续的区域关联。In step 320 of FIG. 3 , for each region, the computing device 110 determines regional visual semantic features of the region based on the visual characteristics of the region and a plurality of character text features. For example, after determining the above-mentioned visual features of the region and character text features of the characters, the computing device 110 may further process the above-mentioned features to determine the regional visual semantic features of the region for subsequent region association.

具体上说，计算设备110可以对区域的视觉特征和多个字符文本特征进行融合处理，然后对经融合的特征进行特征增强来确定区域的区域视觉语义特征。区域的区域视觉语义特征不仅可以准确地表示区域所包括的字符的文本特征，还可以表示该区域在图像中的视觉特征、空间特征和位置特征。Specifically, the computing device 110 may perform fusion processing on the visual features of the region and multiple character text features, and then perform feature enhancement on the fused features to determine the regional visual semantic features of the region. The regional visual semantic features of a region can not only accurately represent the textual features of the characters included in the region, but also represent the visual, spatial and positional characteristics of the region in the image.

在图3的步骤330，计算设备110基于区域视觉语义特征，确定多个区域的关系信息，关系信息至少指示多个区域中的任意两个区域之间的关联程度。在准确地表示了区域的区域视觉语义特征后，计算设备110可以设置可学习参数矩阵P，然后根据如下公式(1)确定区域之间的关系信息：In step 330 of FIG. 3 , the computing device 110 determines relationship information of a plurality of regions based on the regional visual semantic features, and the relationship information at least indicates the degree of association between any two regions in the plurality of regions. After accurately representing the regional visual semantic features of the region, the computing device 110 can set the learnable parameter matrix P, and then determine the relationship information between regions according to the following formula (1):

A＝σ(MPM^t) 公式(1)A＝σ(MPM^t ) formula (1)

其中M为区域视觉语义特征，M^t表示M的转置矩阵，可学习参数矩阵P的维度和参数可以根据M进行设置。关系信息可以入下表1所示：Where M is the regional visual semantic feature, M^t represents the transposition matrix of M, and the dimensions and parameters of the learnable parameter matrix P can be set according to M. The relationship information can be entered as shown in Table 1 below:

表1Table 1

关联程度degree of association区域210Area 210区域220Area 220区域230Area 230区域240Area 240区域250Area 250区域260Area 260区域270Area 270区域210Area 210000000000000区域220Area 22000110.10.10.150.150.130.130.240.24区域230Area 23000110.20.20.20.20.30.30.30.3区域240Area 240000.10.10.20.2110.140.140.150.15区域250Area 250000.150.150.20.2110000区域260Area 260000.130.130.30.30.140.140011区域270Area 270000.240.240.30.30.150.150011

其中数字指示关联程度，数字越高代表关联程度越高。数字仅仅是示例性的，其不旨在限制本公开的范围。The number indicates the degree of association, and the higher the number, the higher the degree of association. The numbers are exemplary only and are not intended to limit the scope of the present disclosure.

在图3的步骤340，计算设备110基于关系信息，将多个区域中的区域相关联。例如，在确定了多个区域之间的关联程度后，可以根据待确定的信息对区域中的字符进行识别和提取。At step 340 of FIG. 3 , computing device 110 associates regions of the plurality of regions based on the relationship information. For example, after determining the degree of association between multiple regions, the characters in the regions can be identified and extracted according to the information to be determined.

在一些实施例中，计算设备110可以分别确定多个区域中的第一区域与多个区域中的、除第一区域之外的区域之间的关联程度。并且将具有最高关联程度的目标区域与第一区域相关联。例如，计算设备110可以从上表1(其中每两个区域之间都存在相应的关联程度)中分别确定两两区域中的关联程度，计算设备110从表1中可以确定第一区域220与除第一区域之外的区域210和230-270之间的关联程度分别为0、1、0.1、0.15、0.13和0.24。然后可以确定具有最高关联程度1的区域230作为目标区域。In some embodiments, the computing device 110 may respectively determine a degree of association between a first area of the plurality of areas and an area of the plurality of areas other than the first area. And the target area with the highest degree of association is associated with the first area. For example, the computing device 110 can determine the degree of association in each pair of regions from Table 1 above (where there is a corresponding degree of association between each two regions), and the computing device 110 can determine from Table 1 that the degrees of association between the first region 220 and the regions 210 and 230-270 other than the first region are 0, 1, 0.1, 0.15, 0.13, and 0.24, respectively. The region 230 with the highest degree of association 1 may then be determined as the target region.

备选地，在一些实施例中，计算设备110如果确定第一区域与除第一区域之外的两个区域具有相同的最高关联度，则可以将该两个区域同时作为目标区域，然后从其中确定待提取的实体名称和实体值。可以将多个区域相关联，本公开在此不做限制Alternatively, in some embodiments, if the computing device 110 determines that the first area has the same highest degree of association with the two areas other than the first area, the two areas may be used as target areas at the same time, and then determine the entity name and entity value to be extracted from them. Multiple areas can be associated, the disclosure is not limited here

通过上述确定的包括丰富的图像、文本、空间特征的区域视觉语义特征，可以准确地确定区域与区域之间的关联，进一步为实体的提取打下基础。在确定了相关联的区域之后，计算设备110可以根据所需要的信息提取目标区域和第一区域中的字符。Through the regional visual semantic features determined above including rich images, texts, and spatial features, the association between regions can be accurately determined, further laying a foundation for entity extraction. After determining the associated regions, the computing device 110 may extract characters in the target region and the first region according to required information.

在图3的步骤350，计算设备110针对获取的实体，提取实体关系。在确定了区域间的关联关系后，可以根据待确定的实体从多个区域中确定与实体相关的内容。In step 350 of FIG. 3 , the computing device 110 extracts entity relationships for the acquired entities. After determining the association relationship between the regions, content related to the entity can be determined from multiple regions according to the entity to be determined.

在一些实施例中，在针对已知结构的图像进行实体提取的过程中，计算设备110可以首先获取待确定的实体，实体具有相关联的实体名称和实体值，例如，实体名称为“姓名”，实体值为“张三”。然后计算设备110可以确定包括实体名称的第一区域。计算设备110可以提取所述第一区域中的第一字符。例如，计算设备110提取第一区域220中的第一字符“姓名”。接着，计算设备110基于所确定的第一区域，确定相关联的目标区域。如上述步骤340所述，已经确定第一区域220与目标区域230相关联，最后，计算设备110可以将区域230中包括的字符“张三”作为实体值。In some embodiments, during the process of extracting entities from images with known structures, the computing device 110 may first acquire the entities to be determined, and the entities have associated entity names and entity values, for example, the entity name is "name", and the entity value is "Zhang San". Computing device 110 may then determine a first region that includes the entity name. Computing device 110 may extract the first character in the first area. For example, the computing device 110 extracts the first character “name” in the first area 220 . Computing device 110 then determines an associated target area based on the determined first area. As described in step 340 above, it has been determined that the first area 220 is associated with the target area 230 , and finally, the computing device 110 may use the character "Zhang San" included in the area 230 as an entity value.

在一些实施例中，对于实体名称为“名字”，计算设备110可以确定“名字”的近义词，并且确定其近义词字符“姓名”包括在第一区域220中。计算设备110可以提取如上述步骤确定相关联的目标区域230中的目标字符“张三”，所提取的目标字符“张三”作为实体“姓名”的实体值。In some embodiments, for the entity name being “name”, computing device 110 may determine a synonym of “name” and determine that its synonym character “name” is included in the first region 220 . The computing device 110 may extract the target character "Zhang San" in the associated target area 230 as determined in the above steps, and the extracted target character "Zhang San" as the entity value of the entity "name".

备选地，在一些实施例中，计算设备110获取了待确定的实体的实体名称为“地址”。计算设备110在图像200中并没有找到包括字符“地址”的区域或者包括与字符“地址”的意思相近的字符的区域。则计算设备110例如可以通过用户界面向用户发出未找到该实体的提示。或者计算设备110可以在返回的实体提取结果中将“地址”的实体值标记为0。Alternatively, in some embodiments, the computing device 110 obtains the entity name of the entity to be determined as "address". The computing device 110 does not find an area including the character "address" or an area including characters similar in meaning to the character "address" in the image 200 . Then the computing device 110 may, for example, issue a prompt to the user through a user interface that the entity is not found. Or the computing device 110 may mark the entity value of "address" as 0 in the returned entity extraction result.

对于已知文本结构已知的图像，该识别方法特别有利，其通过确定区域之间的关系来提取实体，节省了算力。此外，由于上述区域关系的准确确定，对文本识别的准确率也得到提高。For images with known text structure, this recognition method is particularly beneficial, which saves computing power by determining the relationship between regions to extract entities. In addition, due to the accurate determination of the above-mentioned regional relationship, the accuracy rate of text recognition is also improved.

根据本公开的实施例，通过将图像中的各个区域的视觉特征和文本特征进行重组和融合，可以准确地确定区域之间的关系，从而可以提升文本识别的准确率。进一步地，可以准确地提取待确定的实体的实体内容。According to the embodiments of the present disclosure, by reorganizing and fusing the visual features and text features of each area in the image, the relationship between the areas can be accurately determined, thereby improving the accuracy of text recognition. Further, the entity content of the entity to be determined can be accurately extracted.

继续参见图2，针对步骤210“计算设备110针对包括字符的图像200中的多个区域中的每个区域，确定区域的视觉特征和区域的多个字符文本特征”，本实施例提供一种可选的实现方式，具体实现如下：Continuing to refer to FIG. 2 , for step 210 "computing device 110 determines the visual features of the area and the multiple character text features of the area for each of the multiple areas in the image 200 including characters", this embodiment provides an optional implementation, and the specific implementation is as follows:

计算设备110可以首先确定图像200的图像特征。然后基于图像特征和图像200中的多个区域的每个区域在图像200中的区域位置信息，确定区域的视觉特征。并且基于区域位置信息和区域中包括的字符，确定多个字符文本特征。例如，计算设备110可以使用Resnet(Residual Network，残差网络)中Resnet50卷积神经网络来提取图像200的特征图，并且将该特征图作为图像200的图像特征。请注意，上述神经网络仅仅是示例性的，还可以应用任何合适的神经网络模型(例如Resnet43、Resnet101)来确定图像200的图像特征。Computing device 110 may first determine image characteristics of image 200 . Then based on the image features and the area position information of each area in the image 200 of the multiple areas in the image 200 , the visual features of the area are determined. And based on the area location information and the characters included in the area, a plurality of character text features are determined. For example, the computing device 110 may use the Resnet50 convolutional neural network in Resnet (Residual Network) to extract the feature map of the image 200 and use the feature map as the image feature of the image 200 . Please note that the above neural network is only exemplary, and any suitable neural network model (such as Resnet43, Resnet101) can also be applied to determine the image features of the image 200 .

备选地，计算设备110可以利用合适的算法分别确定图像200(以及其中所包括的字符)的颜色特征、纹理特征、形状特征和空间关系特征等。然后将上述确定的特征进行融合(例如矩阵形式的拼接和加和)以确定图像200的特征。Alternatively, the computing device 110 may use appropriate algorithms to respectively determine the color features, texture features, shape features, and spatial relationship features of the image 200 (and the characters included therein). The features determined above are then fused (for example, concatenated and summed in a matrix form) to determine the features of the image 200 .

在确定图像200的图像特征后，计算设备110根据该图像特征确定相应区域的视觉特征。区域的视觉特征可以表示区域在图像中的图像表观特征和其位置特征。After determining the image feature of the image 200, the computing device 110 determines the visual feature of the corresponding region according to the image feature. The visual feature of the region can represent the image appearance feature and its position feature of the region in the image.

具体上讲，计算设备110可以确定图像200中的多个区域的每个区域在图像200中的区域位置信息。根据上述确定的图像特征和区域位置信息，确定区域的区域特征。然后将区域位置信息所对应的特征和区域特征进行组合，以确定区域的视觉特征。Specifically, the computing device 110 may determine region location information in the image 200 for each region of the plurality of regions in the image 200 . According to the above determined image features and area location information, the area features of the area are determined. Then the features corresponding to the regional location information are combined with the regional features to determine the visual features of the region.

例如，计算设备110可以首先确定图像200中的各个区域在图像200中的位置以作为区域位置信息。计算设备110可以应用EAST算法预测图像200中的包括字符的多个区域210-270的位置。例如，图像200经过EAST算法后的输出结果可以是图2所示的多个虚线框(多个区域)，每个虚线框中包围多个字符。计算设备110可以根据该多个虚线框确定每个区域在图像200中的区域位置信息。在一些实施例中，区域位置信息可以通过该区域的左上、右上、左下、右下四个点的坐标(虚线矩形框的四个顶点的坐标)来表示。备选地，在一个实施例中，在多个区域的区域大小相同的情况下，区域位置信息可以通过区域的中心点坐标来表示。还可以通过任何合适的模型和算法来确定区域在图像中的位置。在确定该位置的位置信息之后，计算设备110可以将该位置信息编码成向量(例如768维的向量)以作为区域位置信息(下文可以被记为S)。For example, the computing device 110 may first determine the position of each area in the image 200 in the image 200 as area position information. Computing device 110 may apply the EAST algorithm to predict the location of a plurality of regions 210-270 in image 200 that include characters. For example, the output result of the image 200 after being subjected to the EAST algorithm may be a plurality of dotted-line frames (multiple areas) shown in FIG. 2 , each of which encloses multiple characters. The computing device 110 may determine the area position information of each area in the image 200 according to the multiple dotted boxes. In some embodiments, the location information of the area may be represented by the coordinates of four upper left, upper right, lower left, and lower right points of the area (the coordinates of the four vertices of the dotted rectangular box). Alternatively, in an embodiment, in the case that multiple areas have the same area size, the area location information may be represented by the center point coordinates of the areas. The location of the region in the image can also be determined by any suitable model and algorithm. After determining the location information of the location, the computing device 110 may encode the location information into a vector (for example, a 768-dimensional vector) as area location information (hereinafter may be denoted as S).

在一些实施例中，计算设备110可以根据上述确定的图像200的特征和区域位置信息来确定区域的区域特征。例如，计算设备110可以使用ROI(regions of interest)Pooling(感兴趣的区域的池化操作，用于在图像的特征图中确定感兴趣区域的特征)操作在图像200的图像特征图中提取区域所在位置的图像表观特征，以作为区域的区域特征(下文可以被记为F)。In some embodiments, the computing device 110 may determine the regional characteristics of the region according to the above-mentioned determined characteristics of the image 200 and region location information. For example, the computing device 110 may use the ROI (regions of interest) Pooling (pooling operation of the region of interest, for determining the feature of the region of interest in the feature map of the image) operation to extract the image appearance feature of the location of the region in the image feature map of the image 200, as the region feature of the region (hereinafter may be denoted as F).

备选地，计算设备110可以将图像200根据上述确定的位置信息分割成多个子图像，然后利用合适的模型和算法确定多个子图像的图像特征以作为各个区域的区域特征。子图像的图像特征的确定方法参照上文描述(例如参照上文确定图像200的图像特征的方法)，在此不再赘述。Alternatively, the computing device 110 may divide the image 200 into multiple sub-images according to the above determined position information, and then determine the image features of the multiple sub-images using appropriate models and algorithms as regional features of each area. For the method of determining the image feature of the sub-image, refer to the above description (for example, refer to the above method for determining the image feature of the image 200 ), which will not be repeated here.

附加地或备选地，在区域的区域位置信息已经明确的情况下(例如对于预定格式的文件的图像)，可以根据预先确定的位置信息分别识别图像200中的不同区域，以确定各个区域的区域特征。Additionally or alternatively, in the case where the area position information of the area is already clear (for example, for an image of a file in a predetermined format), different areas in the image 200 may be identified according to the predetermined position information, so as to determine the area characteristics of each area.

在确定了图像中的相应区域的区域特征和位置特征后，计算设备110可以将其组合为区域的视觉特征，例如，在F和S为相同维度的特征向量时(例如都为768维的向量)，计算设备110可以利用如下公式(2)确定视觉特征：After determining the regional features and location features of the corresponding regions in the image, the computing device 110 can combine them into the visual features of the region. For example, when F and S are feature vectors of the same dimension (for example, both are 768-dimensional vectors), the computing device 110 can use the following formula (2) to determine the visual features:

视觉特征＝F+S公式(2)Visual features = F+S formula (2)

上述以向量加和的形式对特征进行组合仅仅示例性的，还存在其他合适的组合方式，本公开在此不做限制。可以理解的是，区域的该视觉特征融合了区域的图像表观特征和位置特征，该视觉特征与图像特征相比更加丰富，这为后续的字符识别任务打下基础，使得最终的处理结果更加准确。The above combination of features in the form of vector sum is only exemplary, and there are other suitable combination manners, which are not limited in the present disclosure. It can be understood that the visual feature of the region is integrated with the image appearance feature and position feature of the region, which is more abundant than the image feature, which lays the foundation for the subsequent character recognition task and makes the final processing result more accurate.

接下来，计算设备110可以确定字符的字符文本特征。例如，计算设备110可以根据上述位置信息，对图像200的虚线框内的字符使用光学字符识别技术(OCR)来确定其中每个字符。Next, computing device 110 may determine character text features for the character. For example, the computing device 110 may use optical character recognition technology (OCR) on the characters within the dotted frame of the image 200 to determine each character according to the above position information.

在一些实施例中，对于图像中的字符长短不同，可以考虑将不同长度的字符转化为同样的长度。例如，计算设备140可以从图像200中确定包括最长字符长度的区域210，例如将最长字符长度4作为字符的定长字符。对于其他区域220-270内的字符，可以利用特定符号对长度不足4的字符进行填充。然后对各个区域210-270进行识别。请注意，上述将最长字符长度定为4仅仅是示例性的，还可以根据不同的包括不同字符的不同图像存在其他长度(例如5、6或者模型可以确定的最长字符长度)的字符，本公开在此不做限制。在一些实施例中，计算设备110可以利用特定的不定长字符识别模型，如CRNN字符识别模型直接对各个区域中的字符进行识别。并且将该字符编码为向量以作为字符文本特征。为了方便表示，我们将定位有n个区域，每个区域包括ki个字符，我们得到字符文本特征的序列：In some embodiments, for characters with different lengths in the image, it may be considered to convert characters of different lengths into the same length. For example, the computing device 140 may determine the region 210 including the longest character length from the image 200, for example, taking the longest character length 4 as a fixed-length character of the character. For the characters in other areas 220-270, characters whose length is less than 4 can be filled with specific symbols. Each area 210-270 is then identified. Please note that the above-mentioned setting of the longest character length as 4 is only exemplary, and there may also be characters of other lengths (such as 5, 6 or the longest character length that can be determined by the model) according to different images including different characters, and the present disclosure is not limited here. In some embodiments, the computing device 110 can use a specific variable-length character recognition model, such as a CRNN character recognition model, to directly recognize characters in each region. And encode the character into a vector as a character text feature. For the convenience of representation, we will locate n regions, each region includes ki characters, and we get the sequence of character text features:

T＝(t₁,t₂,…,t_n)＝(c_1.1,c_1.2,…,c_1.k1,c_2.1,c_2.2,…,c_2.k2,…,c_n.1,…,c_n.kn)T=(t₁ ,t₂ ,...,t_n )=(c_1.1 ,c_1.2 ,...,c_1.k1 ,c_2.1 ,c_2.2 ,...,c_2.k2 ,...,c_n.1 ,...,c_n.kn )

其中T表示图像中的所有字符的字符文本特征，t1-tn表示每个区域中的所有字符的字符文本特征，Cij表示单个字符的字符文本特征，i∈n，j∈ki在已经确定了区域的视觉特征的情况下，进一步确定区域内的字符文本特征可以更加准确地表示相应的区域，从而使得对区域内的字符识别和提取更加准确。Where T represents the character text features of all characters in the image, t1-tn represents the character text features of all characters in each region, Cij represents the character text features of a single character, i∈n, j∈ki After the visual features of the region have been determined, further determining the character text features in the region can represent the corresponding region more accurately, so that the character recognition and extraction in the region are more accurate.

备选地，为了节省计算成本，计算设备110可以通过合适的算法或者模型直接确定字符的字符文本特征。而不必预先进行OCR识别再编码为字符文本特征。Alternatively, in order to save calculation costs, the computing device 110 may directly determine the character text features of the characters through a suitable algorithm or model. It is not necessary to carry out OCR recognition in advance and then encode it into character text features.

图4示出了根据本公开的一些实施例的用于确定区域视觉语义特征的过程400的示意图。本实施例针对步骤320“针对每个区域，基于区域的视觉特征和多个字符文本特征，确定区域的区域视觉语义特征”，提供其他可选的实现方式。FIG. 4 shows a schematic diagram of a process 400 for determining regional visual semantic features according to some embodiments of the present disclosure. This embodiment provides other optional implementation manners for step 320 of "determining the regional visual semantic features of the region based on the visual features of the region and multiple character text features for each region".

在图4的步骤410，计算设备110将多个区域的视觉特征和多个字符文本特征进行融合，以获取图像视觉语义特征。In step 410 of FIG. 4 , the computing device 110 fuses visual features of multiple regions and multiple character text features to obtain visual semantic features of the image.

在一些实施例中，计算设备110可以根据如下公式(3)确定图像视觉语义特征：In some embodiments, the computing device 110 may determine the visual semantic feature of the image according to the following formula (3):

V＝concat(T,F+S) 公式(3)V＝concat(T,F+S) formula (3)

也即，将上述确定的视觉特征F+S和图像中的所有字符的字符文本特征T进行拼接，以获取图像200的图像视觉语义特征。That is, the above-mentioned determined visual features F+S and character text features T of all characters in the image are concatenated to obtain image visual semantic features of the image 200 .

在一些实施例中，计算设备110可以对字符文本特征T、区域特征F和区域位置信息S设置不同的权重以根据如下公式(4)确定图像视觉语义特征：In some embodiments, the computing device 110 can set different weights for the character text feature T, the area feature F, and the area position information S to determine the visual semantic feature of the image according to the following formula (4):

V＝concat(αT,βF+γS) 公式(4)V＝concat(αT,βF+γS) formula (4)

其中α、β和γ可以根据测试结果或者应用场景的需求进行设置。Among them, α, β, and γ can be set according to test results or requirements of application scenarios.

备选地，在一些实施例中，计算设备110还可以利用AdaIN算法根据如下公式(5)来将区域特征F和区域位置特征S进行组合：Alternatively, in some embodiments, the computing device 110 can also use the AdaIN algorithm to combine the area feature F and the area position feature S according to the following formula (5):

其中σ是平均值，μ和标准差，可以将x设置为F，将y设置为S(反之亦可)。然后可以根据如下公式(6)来确定图像视觉语义特征：where σ is the mean, μ and standard deviation, you can set x to be F and y to be S (or vice versa). Then the visual semantic features of the image can be determined according to the following formula (6):

V＝concat(T,AdaIN(F,S)) 公式(6)V＝concat(T,AdaIN(F,S)) formula (6)

请注意，上述将字符文本特征T、区域特征F和区域位置信息S融合以确定图像视觉语义特征V仅仅是示例性的，可以采用除加和、拼接、AdaIN以外的其他合适的融合方法或其组合，本公开在此不做限制。Please note that the above-mentioned fusion of character text features T, area features F, and area position information S to determine image visual semantic features V is only exemplary, and other suitable fusion methods other than summation, splicing, and AdaIN or combinations thereof can be used, and the present disclosure is not limited here.

在图4的步骤420，计算设备110对图像视觉语义特征进行增强，以获取增强图像视觉语义特征。为了对图像视觉语义特征进行增强，计算设备110可以利用合适的算法使上述融合的特征V中的视觉特征F+S和字符文本特征T进一步融合。例如，可以利用多层双向转换自编码器(Bidirectional Encoder Representation from Transformers，BERT)增强图像视觉语义特征在空间、视觉、语义等模态上的信息表示。我们定义编码器的初始输入层H₀＝V，并且根据如下公式(7)定义编码器的编码方式：In step 420 of FIG. 4 , the computing device 110 enhances the visual semantic features of the image to obtain the enhanced visual semantic features of the image. In order to enhance the visual semantic features of the image, the computing device 110 may use a suitable algorithm to further fuse the visual features F+S and character text features T in the above-mentioned fused features V. For example, multi-layer Bidirectional Encoder Representation from Transformers (BERT) can be used to enhance the information representation of image visual semantic features in spatial, visual, semantic and other modalities. We define the initial input layer H₀ =V of the encoder, and define the encoding method of the encoder according to the following formula (7):

其中H_l-1,H_l表示分别第l层编码的输入特征和输出特征。模型使用多个全连接层(W_l*)对特征H_l-1进行变换并计算权重矩阵，再与H_l-1进行相乘，得到第l次融合的编码特征H_l。σ是归一化函数sigmoid。通过这样堆叠多次编码，使得视觉特征F+S和字符文本特征T在上述编码过程中交互信息，最后重组成更加丰富的增强图像视觉语义特征H。从上述公式(6)可以看出，H的维度没有变化，H中的每项与V中的每项相对应，区别在于H中的每项融合了相关联的项的特征。请注意，上述编码器和公式仅仅是示例性的，可以利用任何合适的方式融合特征中的信息。Among them, H_l-1 and H_l represent the input features and output features encoded in the first layer respectively. The model uses multiple fully connected layers (W_l* ) to transform the feature H_l-1 and calculate the weight matrix, and then multiply it with H l_-1 to obtain the encoded feature H_l of the l-th fusion. σ is the normalization function sigmoid. By stacking multiple encodings in this way, the visual feature F+S and the character text feature T exchange information during the above encoding process, and finally recombine into a richer enhanced image visual semantic feature H. It can be seen from the above formula (6) that the dimension of H does not change, and each item in H corresponds to each item in V, the difference is that each item in H incorporates the features of the associated items. Please note that the above encoders and formulas are merely exemplary, and information in features can be fused in any suitable way.

在图4的步骤430，计算设备110将增强图像视觉语义特征中的、一个区域中的多个字符文本特征进行平均，以获取相应区域的区域文本特征。上述得到的增强图像视觉语义特征H可以被表示为：In step 430 of FIG. 4 , the computing device 110 averages the textual features of multiple characters in a region in the visual semantic features of the enhanced image, so as to obtain the regional textual features of the corresponding region. The enhanced image visual semantic feature H obtained above can be expressed as:

H＝(x_1，1，x_1，2，…，x_1，k1，x_2，1，x_2，2，…，x_2，k2，…，x_n，1，…，x_n，kn，y_l，…，y_n)H=(x_1,1 , x_1,2 ,...,x_1,k1 ,x_2,1 ,x_2,2 ,...,x_2,k2 ,...,x_n,1 ,...,x_n,kn ,y_l ,...,y_n )

其中X_ij对应于字符文本特征C_ij增强后的特征，y_i对应于视觉特征F+S增强后的特征，i∈n，j∈k_i。计算设备110可以将增强图像视觉语义特征中的属于同一区域的多个字符文本特征X_ij进行平均以获取代表该区域的区域文本特征Q，Among them, X_ij corresponds to the enhanced feature of character text feature C_ij , y_i corresponds to the enhanced feature of visual feature F+S, i∈n,_j∈ki . The computing device 110 may average a plurality of character text features X_ij belonging to the same area in the enhanced image visual semantic features to obtain an area text feature Q representing the area,

在图4的步骤440，计算设备110基于区域文本特征和增强图像视觉语义特征中的、相应的视觉特征，确定相应区域的区域视觉语义特征。In step 440 of FIG. 4 , the computing device 110 determines the regional visual semantic features of the corresponding region based on the corresponding visual features in the regional text features and the enhanced image visual semantic features.

在一些实施例中，计算设备110可以将区域的区域文本特征q_i与该区域的经增强的视觉特征yi进行哈达玛积(Hadamard product)操作以获取该区域的区域视觉语义特征M,M＝{m_i；m_i＝q_i⊙y_i}。在一些实施例中，还可以对q_i和yi进行克罗内克积(Kroneckerproduct)操作。备选地，在一些实施例中，还可以应用标准矢量乘积的方式确定区域视觉语义特征。上述乘积操作仅为了将该区域内的字符的文本特征和区域的视觉、空间、位置特征融合在一起，还可以利用其他合适的操作进行融合，本公开在此不做限制。In some embodiments, the computing device 110 may perform a Hadamard product operation on the regional text feature q_i of the region and the enhanced visual feature yi of the region to obtain the regional visual semantic feature M of the region, M={m_i ; m_i =q_i ⊙y_i }. In some embodiments, a Kronecker product operation can also be performed on q_i and yi. Alternatively, in some embodiments, the regional visual semantic feature may also be determined by applying a standard vector product. The above product operation is only to fuse the text features of the characters in the area with the visual, spatial, and positional features of the area, and other suitable operations can also be used for fusion, which is not limited in the present disclosure.

通过组合(例如加和)、融合(例如拼接、AdaIN)、增强、以及乘积的多种方式，可以将每个区域的空间特征、语义特征以及视觉特征组合在一起，以构成代表该区域的特征，可以显著增加后续实体关系提取的准确率。Through combination (such as summation), fusion (such as splicing, AdaIN), enhancement, and multiplication, the spatial features, semantic features and visual features of each region can be combined to form a representative feature of the region, which can significantly increase the accuracy of subsequent entity relationship extraction.

图5示出了根据本公开的实施例的多模态特征融合的实体关系提取装置500的示意框图。如图5所示，装置500包括：第一特征确定模块510，被配置为针对包括字符的图像中的多个区域中的每个区域，确定区域的视觉特征和区域的多个字符文本特征，字符文本特征对应于区域中的一个字符；第二特征确定模块520，被配置为针对每个区域，基于区域的视觉特征和多个字符文本特征，确定区域的区域视觉语义特征；关系信息确定模块530，被配置为基于区域视觉语义特征，确定多个区域的关系信息，关系信息至少指示多个区域中的任意两个区域之间的关联程度；第一区域关联模块540，被配置为基于关系信息，将多个区域中的区域相关联；以及第一提取模块550，被配置为针对获取的实体，提取实体关系。Fig. 5 shows a schematic block diagram of an entity relationship extraction device 500 for multimodal feature fusion according to an embodiment of the present disclosure. As shown in FIG. 5 , the apparatus 500 includes: a first feature determination module 510 configured to, for each of multiple areas in an image including characters, determine visual features of the area and a plurality of character text features of the area, the character text features corresponding to a character in the area; a second feature determination module 520 configured to determine the regional visual semantic features of the area based on the visual features of the area and multiple character text features for each area; the relationship information determination module 530 is configured to determine the relationship information of the multiple areas based on the visual semantic features of the area, the relationship information indicates at least a plurality of The degree of association between any two areas in the area; the first area association module 540 is configured to associate areas in the plurality of areas based on the relationship information; and the first extraction module 550 is configured to extract entity relationships for the acquired entities.

在一些实施例中，其中第一特征确定模块510可以包括：图像特征确定模块，被配置为确定包括字符的图像的图像特征；第一视觉特征确定模块，被配置为基于图像特征和图像中的多个区域的每个区域在图像中的区域位置信息，确定区域的视觉特征；以及字符文本特征确定模块，被配置为基于区域位置信息和区域中包括的字符，确定多个字符文本特征。In some embodiments, the first feature determination module 510 may include: an image feature determination module configured to determine image features of an image including characters; a first visual feature determination module configured to determine visual features of the region based on the image features and region position information of each region in the image in the plurality of regions; and a character text feature determination module configured to determine a plurality of character text features based on the region position information and the characters included in the region.

在一些实施例中，其中第一视觉特征确定模块可以包括：区域位置信息确定模块，被配置为确定图像中的多个区域的每个区域在图像中的区域位置信息；区域特征确定模块，被配置为基于图像特征和区域位置信息，确定区域的区域特征；以及第二视觉特征确定模块，被配置为将区域位置信息和区域特征进行组合，以确定区域的视觉特征。In some embodiments, the first visual feature determination module may include: an area position information determination module configured to determine area position information in the image of each area of a plurality of areas in the image; an area feature determination module configured to determine area features of the area based on image features and area position information; and a second visual feature determination module configured to combine the area position information and the area features to determine the visual features of the area.

在一些实施例中，其中第二特征确定模块520可以包括：图像视觉语义特征确定模块，被配置为将多个区域的视觉特征和多个字符文本特征进行融合，以获取图像视觉语义特征；增强模块，被配置为对图像视觉语义特征进行增强，以获取增强图像视觉语义特征；区域文本特征确定模块，被配置为将增强图像视觉语义特征中的、一个区域中的多个字符文本特征进行平均，以获取相应区域的区域文本特征；以及区域视觉语义特征确定模块，被配置为基于区域文本特征和增强图像视觉语义特征中的、相应的视觉特征，确定相应区域的区域视觉语义特征。In some embodiments, the second feature determination module 520 may include: an image visual semantic feature determination module configured to fuse visual features of multiple regions and multiple character text features to obtain image visual semantic features; an enhancement module configured to enhance image visual semantic features to obtain enhanced image visual semantic features; a regional text feature determination module configured to average multiple character text features in a region in the enhanced image visual semantic features to obtain regional text features of the corresponding region; feature and enhance the corresponding visual features in the visual semantic features of the image, and determine the regional visual semantic features of the corresponding region.

在一些实施例中，其中第一区域关联模块540可以包括：关联程度确定模块，被配置为分别确定多个区域中的第一区域与多个区域中的、除第一区域之外的区域之间的关联程度；以及第二区域关联模块，被配置为将具有最高关联程度的目标区域与第一区域相关联。In some embodiments, the first area association module 540 may include: an association degree determination module configured to respectively determine the degree of association between the first area in the plurality of areas and areas other than the first area in the plurality of areas; and a second area association module configured to associate the target area with the highest degree of association with the first area.

在一些实施例中，其中第一提取模块550可以包括：实体获取模块，被配置为获取待确定的实体，实体具有相关联的实体名称和实体值；区域确定模块，被配置为确定包括实体名称的第一区域目标区域确定模块，被配置为基于所确定的第一区域，确定相关联的目标区域；以及实体值确定模块，被配置为将目标区域中包括的字符作为实体值。In some embodiments, the first extraction module 550 may include: an entity acquisition module configured to acquire an entity to be determined, the entity having an associated entity name and an entity value; an area determination module configured to determine a first area including the entity name; a target area determination module configured to determine an associated target area based on the determined first area; and an entity value determination module configured to use characters included in the target area as entity values.

图6示出了可以用来实施本公开的实施例的示例电子设备600的示意性框图。电子设备旨在表示各种形式的数字计算机，诸如，膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置，诸如，个人数字处理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例，并且不意在限制本文中描述的和/或者要求的本公开的实现。FIG. 6 shows a schematic block diagram of an example electronic device 600 that may be used to implement embodiments of the present disclosure. Electronic device is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are by way of example only, and are not intended to limit implementations of the disclosure described and/or claimed herein.

如图6所示，设备600包括计算单元601，其可以根据存储在只读存储器(ROM)602中的计算机程序或者从存储单元608加载到随机访问存储器(RAM)603中的计算机程序，来执行各种适当的动作和处理。在RAM 603中，还可存储设备600操作所需的各种程序和数据。计算单元601、ROM 602以及RAM 603通过总线604彼此相连。输入/输出(I/O)接口605也连接至总线604。As shown in FIG. 6, the device 600 includes a computing unit 601, which can perform various appropriate actions and processes according to computer programs stored in a read-only memory (ROM) 602 or loaded from a storage unit 608 into a random access memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the device 600 can also be stored. The calculation unit 601 , the ROM 602 and the RAM 603 are connected to each other through a bus 604 . An input/output (I/O) interface 605 is also connected to the bus 604 .

设备600中的多个部件连接至I/O接口605，包括：输入单元606，例如键盘、鼠标等；输出单元607，例如各种类型的显示器、扬声器等；存储单元608，例如磁盘、光盘等；以及通信单元609，例如网卡、调制解调器、无线通信收发机等。通信单元609允许设备600通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。Multiple components in the device 600 are connected to the I/O interface 605, including: an input unit 606, such as a keyboard, a mouse, etc.; an output unit 607, such as various types of displays, speakers, etc.; a storage unit 608, such as a magnetic disk, an optical disk, etc.; and a communication unit 609, such as a network card, a modem, a wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices over a computer network such as the Internet and/or various telecommunication networks.

计算单元601可以是各种具有处理和计算能力的通用和/或专用处理组件。计算单元601的一些示例包括但不限于中央处理单元(CPU)、图形处理单元(GPU)、各种专用的人工智能(AI)计算芯片、各种运行机器学习模型算法的计算单元、数字信号处理器(DSP)、以及任何适当的处理器、控制器、微控制器等。计算单元601执行上文所描述的各个装置和处理，例如过程300和过程400。例如，在一些实施例中，过程300和过程400可被实现为计算机软件程序，其被有形地包含于机器可读介质，例如存储单元608。在一些实施例中，计算机程序的部分或者全部可以经由ROM 602和/或通信单元609而被载入和/或安装到设备600上。当计算机程序加载到RAM 603并由计算单元601执行时，可以执行上文描述的过程300和过程400的一个或多个步骤。备选地，在其他实施例中，计算单元601可以通过其他任何适当的方式(例如，借助于固件)而被配置为执行过程300和过程400。The computing unit 601 may be various general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of computing units 601 include, but are not limited to, central processing units (CPUs), graphics processing units (GPUs), various dedicated artificial intelligence (AI) computing chips, various computing units that run machine learning model algorithms, digital signal processors (DSPs), and any suitable processors, controllers, microcontrollers, etc. The computing unit 601 executes the various devices and processes described above, such as the process 300 and the process 400 . For example, in some embodiments, process 300 and process 400 may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 608 . In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609 . When the computer program is loaded into RAM 603 and executed by computing unit 601, one or more steps of process 300 and process 400 described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to execute the process 300 and the process 400 in any other suitable manner (for example, by means of firmware).

本文中以上描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、芯片上系统的系统(SOC)、负载可编程逻辑设备(CPLD)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括：实施在一个或者多个计算机程序中，该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释，该可编程处理器可以是专用或者通用可编程处理器，可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令，并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。Various implementations of the systems and techniques described above herein may be implemented in digital electronic circuitry, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs executable and/or interpreted on a programmable system comprising at least one programmable processor, which may be a special purpose or general purpose programmable processor, capable of receiving data and instructions from and transmitting data and instructions to a storage system, at least one input device, and at least one output device.

用于实施本公开的装置的程序代码可以采用一个或多个编程语言的任何组合来编写。这些程序代码可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器或控制器，使得程序代码当由处理器或控制器执行时使流程图和/或框图中所规定的功能/操作被实施。程序代码可以完全在机器上执行、部分地在机器上执行，作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。Program code for implementing the apparatus of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to processors or controllers of general-purpose computers, special purpose computers, or other programmable data processing devices, so that the program codes cause the functions/operations specified in the flowcharts and/or block diagrams to be implemented when executed by the processors or controllers. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

在本公开的上下文中，机器可读介质可以是有形的介质，其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备，或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include one or more wire-based electrical connections, a portable computer disk, a hard disk, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.

为了提供与用户的交互，可以在计算机上实施此处描述的系统和技术，该计算机具有：用于向用户显示信息的显示装置(例如，CRT(阴极射线管)或者LCD(液晶显示器)监视器)；以及键盘和指向装置(例如，鼠标或者轨迹球)，用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以用于提供与用户的交互；例如，提供给用户的反馈可以是任何形式的传感反馈(例如，视觉反馈、听觉反馈、或者触觉反馈)；并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。To provide for interaction with a user, the systems and techniques described herein can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user; and a keyboard and pointing device (e.g., a mouse or trackball) through which the user can provide input to the computer. Other types of devices may also be used to provide interaction with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, voice input, or tactile input.

可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如，作为数据服务器)、或者包括中间件部件的计算系统(例如，应用服务器)、或者包括前端部件的计算系统(例如，具有图形用户界面或者网络浏览器的用户计算机，用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如，通信网络)来将系统的部件相互连接。通信网络的示例包括：局域网(LAN)、广域网(WAN)和互联网。The systems and techniques described herein can be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., a user computer having a graphical user interface or web browser through which a user can interact with implementations of the systems and techniques described herein), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication, eg, a communication network. Examples of communication networks include: Local Area Network (LAN), Wide Area Network (WAN) and the Internet.

计算机系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。服务器可以是云服务器，又称为云计算服务器或云主机，是云计算服务体系中的一项主机产品，以解决了传统物理主机与VPS服务("Virtual Private Server"，或简称"VPS")中，存在的管理难度大，业务增广性弱的缺陷。服务器也可以为分布式系统的服务器，或者是结合了区块链的服务器。A computer system may include clients and servers. Clients and servers are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also known as a cloud computing server or cloud host, which is a host product in the cloud computing service system to solve the defects of difficult management and weak business expansion in traditional physical hosts and VPS services ("Virtual Private Server", or "VPS" for short). The server can also be a server of a distributed system, or a server combined with a blockchain.

应该理解，可以使用上面所示的各种形式的流程，重新排序、增加或删除步骤。例如，本发公开中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行，只要能够实现本公开公开的技术方案所期望的结果，本文在此不进行限制。It should be understood that steps may be reordered, added or deleted using the various forms of flow shown above. For example, each step described in the present disclosure may be executed in parallel, sequentially, or in a different order, as long as the desired result of the technical solution disclosed in the present disclosure can be achieved, no limitation is imposed herein.

上述具体实施方式，并不构成对本公开保护范围的限制。本领域技术人员应该明白的是，根据设计要求和其他因素，可以进行各种修改、组合、子组合和替代。任何在本公开的精神和原则之内所作的修改、等同替换和改进等，均应包含在本公开保护范围之内。The specific implementation manners described above do not limit the protection scope of the present disclosure. It should be apparent to those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made depending on design requirements and other factors. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present disclosure shall be included within the protection scope of the present disclosure.