技术领域Technical field
本发明涉及OCR识别技术领域,尤其涉及一种基于图神经网络的OCR表格语义识别方法及装置。The present invention relates to the field of OCR recognition technology, and in particular to an OCR table semantic recognition method and device based on a graph neural network.
背景技术Background technique
随着目前计算机领域计算能力和并行计算技术的快速发展,深度学习可训练的参数量日益提升,这也使得深度学习对复杂数据的学习能力逐渐增强,最终在各个领域得以应用。With the rapid development of computing power and parallel computing technology in the current computer field, the number of parameters that can be trained by deep learning is increasing day by day, which also makes deep learning's ability to learn complex data gradually enhance, and is eventually applied in various fields.
传统OCR(optical character recognition)中通过将模板匹配方法和结构分析法等方法用于文字识别。在OCR文字识别的基础上,研究人员进一步提出了如何识别表格结构和表格文本信息的问题,其在文字识别的工作流程之中增加了表格检测,表格结构分解以及表格结构识别等步骤。如现有技术中的MarcinNamysl等人在《Flexible TableRecognition and Semantic Interpretation System》一文中提出在基于规则的算法上结合语义信息提取表格。Yiren Li等人在《GFTE:Graph-based Financial TableExtraction》提出一个GFTE模型将图像特征、位置特征和文本特征融合在一起,以提高提取表格非结构化数据文件的能力。In traditional OCR (optical character recognition), template matching methods and structural analysis methods are used for text recognition. On the basis of OCR text recognition, the researchers further raised the issue of how to identify table structure and table text information. They added steps such as table detection, table structure decomposition, and table structure recognition to the text recognition workflow. For example, in the article "Flexible TableRecognition and Semantic Interpretation System" by Marcin Namysl et al. in the prior art, they proposed to extract tables based on rule-based algorithms combined with semantic information. Yiren Li et al. proposed a GFTE model in "GFTE: Graph-based Financial Table Extraction" that integrates image features, location features and text features to improve the ability to extract table unstructured data files.
但是,在智能制造领域,用户的需求不仅仅满足于对表格结构的识别,对表格键值识别以及其对应关系的识别也越发重视。此外,实际生活中存在着各式各样的表格,譬如,在品质检验报告书审核工作流程中,当用户提交其品质检验报告书(一般以pdf文件提供),作为审核的一方需检验对应的检测指标是否符合行业标准以及产品名称等信息是否符合审核要求,表格中的键值识别及其对应关系的识别能影响识别效率和准确性。目前的OCR技术在表格识别上的实际应用难以满足自动化和工业化的需求,缺乏高效和准确的解决方案。However, in the field of intelligent manufacturing, users' needs are not only satisfied with the identification of table structures, but also pay more and more attention to the identification of table key values and their corresponding relationships. In addition, there are various forms in real life. For example, in the quality inspection report review workflow, when the user submits his quality inspection report (usually provided as a pdf file), the reviewer needs to check the corresponding Whether the detection indicators comply with industry standards and whether information such as product names meet audit requirements, the identification of key values and their corresponding relationships in the table can affect the identification efficiency and accuracy. The current practical application of OCR technology in form recognition is difficult to meet the needs of automation and industrialization, and lacks efficient and accurate solutions.
发明内容Contents of the invention
本发明提供了一种基于图神经网络的OCR表格语义识别方法及装置,能有效识别表格存在的键值并发现键值之间存在的对应关系,从而满足自动化表格审核等工业上的实际需求。The present invention provides an OCR form semantic recognition method and device based on a graph neural network, which can effectively identify the key values existing in the form and discover the correspondence between the key values, thereby meeting the actual needs of industry such as automated form review.
为了解决上述技术问题,本发明实施例提供了一种基于图神经网络的OCR表格语义识别方法,包括:In order to solve the above technical problems, embodiments of the present invention provide an OCR table semantic recognition method based on graph neural network, including:
获取待识别的第一PNG表格图片;其中,所述第一PNG表格图片是由PDF表格经过预处理后而获得;Obtain the first PNG table image to be recognized; wherein the first PNG table image is obtained from the PDF table after preprocessing;
将所述第一PNG表格图片输入至训练好的GKVR识别模型,以使所述GKVR识别模型对所述第一PNG表格图片进行OCR识别,获得第一文本信息、表格框信息和各文字节点的位置信息,并根据所述第一文本信息和预设的词汇表,通过GRU网络生成所述第一文本信息对应的句向量特征,再通过卷积神经网络和grid_simple算法将所述表格框信息转换为节点图像特征,继而将各文字节点的位置信息进行归一化处理后,获得位置特征,最后将所述第一文本信息对应的句向量特征和所述位置特征分别输入到图注意力网络,与所述节点图像特征拼接后经过多层感知器MLP,输出所述第一PNG表格图片对应的键值信息集合;其中,所述键值信息集合包括:键信息集合和值信息集合;Input the first PNG table image into the trained GKVR recognition model, so that the GKVR recognition model performs OCR recognition on the first PNG table image, and obtains the first text information, table box information and the information of each text node. location information, and generate sentence vector features corresponding to the first text information through the GRU network based on the first text information and the preset vocabulary, and then convert the table box information through the convolutional neural network and grid_simple algorithm is the node image feature, and then normalizes the position information of each text node to obtain the position feature, and finally inputs the sentence vector feature and the position feature corresponding to the first text information into the graph attention network respectively, After being spliced with the node image features and passed through the multi-layer perceptron MLP, the key value information set corresponding to the first PNG table picture is output; wherein the key value information set includes: a key information set and a value information set;
根据预设的划分规则树,对所述键值信息集合进行遍历匹配,输出所述键值信息集合中的各键值对。According to the preset division rule tree, the key-value information set is traversed and matched, and each key-value pair in the key-value information set is output.
作为优选方案,所述训练好的GKVR识别模型包括:句向量特征提取模块;As a preferred solution, the trained GKVR recognition model includes: a sentence vector feature extraction module;
所述句向量特征提取模块的训练过程具体为:The training process of the sentence vector feature extraction module is specifically as follows:
根据预设的词汇表,对训练样本中各文本节点的文本内容进行词汇识别,生成字符串,并对每一字符串进行one-hot编码后应用一层单向前馈网络进行词嵌入,获得每个文本节点对应的词序列;According to the preset vocabulary list, perform word recognition on the text content of each text node in the training sample, generate a string, perform one-hot encoding on each string, and then apply a layer of one-way feed-forward network for word embedding, and obtain The word sequence corresponding to each text node;
通过GRU网络对各词序列中的语义进行学习,生成各文本节点的句向量特征。The GRU network is used to learn the semantics in each word sequence and generate the sentence vector features of each text node.
作为优选方案,所述训练好的GKVR识别模型包括:节点图像特征提取模块;As a preferred solution, the trained GKVR recognition model includes: a node image feature extraction module;
所述节点图像特征提取模块的训练过程具体为:The training process of the node image feature extraction module is specifically as follows:
获取训练样本中的多个表格框信息,并通过卷积神经网络对各表格框信息进行图片结构信息提取,获得多个第一特征图;Obtain multiple table box information in the training sample, and extract image structure information from each table box information through a convolutional neural network to obtain multiple first feature maps;
通过grid_simple算法将所述多个第一特征图以双线性插值的方法放缩至网格中,并将每个文字节点对应坐标的网格特征作为文字节点的节点图像特征。The plurality of first feature maps are scaled into a grid using a bilinear interpolation method through the grid_simple algorithm, and the grid features corresponding to the coordinates of each text node are used as the node image features of the text node.
作为优选方案,所述训练好的GKVR识别模型包括:位置特征提取模块;As a preferred solution, the trained GKVR recognition model includes: a location feature extraction module;
所述位置特征提取模块的训练过程具体为:The training process of the location feature extraction module is specifically as follows:
获取训练样本中各文本节点的位置信息;Obtain the position information of each text node in the training sample;
将各位置信息进行坐标转换,并将坐标系归一化至[-1,1]区间内,输出各文本节点对应的位置特征。Coordinate conversion is performed on each position information, and the coordinate system is normalized to the [-1,1] interval, and the position characteristics corresponding to each text node are output.
作为优选方案,所述训练好的识别模型的训练过程具体为:As a preferred solution, the training process of the trained recognition model is specifically:
将训练样本中各文本节点对应的句向量特征、节点图像特征和位置特征作为GKVR识别模型的输入,将各文本节点对应的键信息和值信息作为GKVR识别模型的输出;The sentence vector features, node image features and position features corresponding to each text node in the training sample are used as the input of the GKVR recognition model, and the key information and value information corresponding to each text node are used as the output of the GKVR recognition model;
对于各文本节点,分别将句向量特征、位置特征输入到图注意力网络,与所述节点图像特征进行拼接后,组成各文本节点的节点特征,结合所述GKVR识别模型的输出,训练图注意力网络和多层感知器MLP。For each text node, the sentence vector features and position features are input into the graph attention network respectively. After splicing with the node image features, the node features of each text node are formed. Combined with the output of the GKVR recognition model, the graph attention is trained. Force networks and multilayer perceptron MLP.
作为优选方案,所述第一PNG表格图片是由PDF表格经过预处理后而获得,具体为:As a preferred solution, the first PNG table image is obtained from the PDF table after preprocessing, specifically:
获取待处理的PDF文档,并通过KVLabel工具从所述PDF文档中截取表格部分,生成所述第一PNG表格图片。Obtain the PDF document to be processed, intercept the table part from the PDF document through the KVLabel tool, and generate the first PNG table image.
作为优选方案,所述KVLabel工具还用于对所述GKVR识别模型的训练样本进行预处理,具体为:As a preferred solution, the KVLabel tool is also used to preprocess the training samples of the GKVR recognition model, specifically:
通过所述KVLabel工具对初始样本中的PDF文档进行表格框选取,并对表格框中的各文本节点进行键值标注及键值对标注,生成各初始样本对应的PNG表格图片,将所有PNG表格图片、键值标注及键值对标注作为所述训练样本。Use the KVLabel tool to select the table box of the PDF document in the initial sample, and perform key-value annotation and key-value pair annotation on each text node in the table box, generate a PNG table image corresponding to each initial sample, and combine all PNG tables Pictures, key-value annotations and key-value pair annotations serve as the training samples.
作为优选方案,所述根据预设的划分规则树,对所述键值信息集合进行遍历匹配,输出所述键值信息集合中的各键值对,具体为:As a preferred solution, the key-value information set is traversed and matched according to a preset division rule tree, and each key-value pair in the key-value information set is output, specifically as follows:
通过广度优先遍历划分规则树对键信息集合进行逐渐划分,并到达叶子节点时选取值信息集合中的值,产生若干个键值对。The key information set is gradually divided through the breadth-first traversal of the division rule tree, and when reaching the leaf node, the values in the value information set are selected to generate several key-value pairs.
作为优选方案,所述划分规则树设置在所述GKVR识别模型中。As a preferred solution, the division rule tree is set in the GKVR recognition model.
本发明另一实施例对应提供了一种基于图神经网络的OCR表格语义识别装置,包括:获取单元、识别单元和输出单元;Another embodiment of the present invention provides an OCR table semantic recognition device based on a graph neural network, including: an acquisition unit, a recognition unit and an output unit;
其中,所述获取单元用于获取待识别的第一PNG表格图片;其中,所述第一PNG表格图片是由PDF表格经过预处理后而获得;Wherein, the obtaining unit is used to obtain the first PNG table picture to be recognized; wherein the first PNG table picture is obtained from the PDF table after preprocessing;
所述识别单元用于将所述第一PNG表格图片输入至训练好的GKVR识别模型,以使所述GKVR识别模型对所述第一PNG表格图片进行OCR识别,获得第一文本信息、表格框信息和各文字节点的位置信息,并根据所述第一文本信息和预设的词汇表,通过GRU网络生成所述第一文本信息对应的句向量特征,再通过卷积神经网络和grid_simple算法将所述表格框信息转换为节点图像特征,继而将各文字节点的位置信息进行归一化处理后,获得位置特征,最后将所述第一文本信息对应的句向量特征和所述位置特征分别输入到图注意力网络,与所述节点图像特征拼接后经过多层感知器MLP,输出所述第一PNG表格图片对应的键值信息集合;其中,所述键值信息集合包括:键信息集合和值信息集合;The recognition unit is used to input the first PNG table image into the trained GKVR recognition model, so that the GKVR recognition model performs OCR recognition on the first PNG table image and obtains the first text information and table frame. information and the position information of each text node, and based on the first text information and the preset vocabulary list, the sentence vector features corresponding to the first text information are generated through the GRU network, and then the convolutional neural network and grid_simple algorithm are used to generate The table box information is converted into node image features, and then the position information of each text node is normalized to obtain position features. Finally, the sentence vector features and the position features corresponding to the first text information are input respectively. to the graph attention network, and after being spliced with the node image features and passed through the multi-layer perceptron MLP, the key value information set corresponding to the first PNG table image is output; wherein the key value information set includes: a key information set and value information collection;
所述输出单元用于根据预设的划分规则树,对所述键值信息集合进行遍历匹配,输出所述键值信息集合中的各键值对。The output unit is configured to traverse and match the key-value information set according to a preset division rule tree, and output each key-value pair in the key-value information set.
相比于现有技术,本发明实施例具有如下有益效果:Compared with the prior art, embodiments of the present invention have the following beneficial effects:
本发明提供了一种基于图神经网络的OCR表格语义识别方法及装置,将PNG表格图片输入至训练好的GKVR识别模型,通过模型中对文本节点的句向量特征、节点图像特征和位置特征,能够准确判断表格节点的属性为键或值;而且通过设置划分规则树的方式实现键值之间的匹配,能够提高表格的键与值之间关系识别的能力。相比于现有技术难以直接提取的便捷式文档格式和图像,本发明结合了图神经网络以及门控循环单元等深度学习网络结构,提出了用于进行表格键值识别的GKVR网络模型,能够实现一键识别,满足自动化表格审核等工业上的实际需求。The present invention provides a method and device for OCR table semantic recognition based on graph neural network. PNG table pictures are input into the trained GKVR recognition model, and through the sentence vector features, node image features and position features of text nodes in the model, It can accurately determine whether the attributes of table nodes are keys or values; and by setting a partitioning rule tree to achieve matching between key values, it can improve the ability to identify the relationship between the keys and values of the table. Compared with the existing convenient document formats and images that are difficult to directly extract, the present invention combines deep learning network structures such as graph neural networks and gated loop units, and proposes a GKVR network model for table key value recognition, which can Realize one-click identification to meet actual industrial needs such as automated form review.
附图说明Description of drawings
图1:为本发明实施例提供的基于图神经网络的OCR表格语义识别方法的一种实施例的流程示意图;Figure 1: A schematic flow chart of an embodiment of the OCR table semantic recognition method based on graph neural network provided by the embodiment of the present invention;
图2:为现有技术提供的原chunk矩形标注绘制在原PNG图片上效果示意图;Figure 2: Schematic diagram of the effect of drawing the original chunk rectangular annotation on the original PNG image provided by the existing technology;
图3:为本发明实施例提供的由原PDF文件截取表格部分制作PNG图片的效果示意图;Figure 3: A schematic diagram of the effect of producing a PNG image from the intercepted form part of the original PDF file provided by the embodiment of the present invention;
图4:为本发明实施例提供的chunk数据的示意图;Figure 4: Schematic diagram of chunk data provided for the embodiment of the present invention;
图5:为本发明实施例提供的GKVR识别模型的示意图;Figure 5: Schematic diagram of the GKVR recognition model provided by the embodiment of the present invention;
图6:为本发明实施例提供的基于GCN与基于GAT的GKVR模型训练过程损失值表现示意图;Figure 6: A schematic diagram of the loss value performance of the GCN-based and GAT-based GKVR model training process provided by the embodiment of the present invention;
图7:为本发明实施例提供的基于GCN与基于GAT的GKVR模型训练过程准确度表现示意图;Figure 7: A schematic diagram of the accuracy performance of the GCN-based and GAT-based GKVR model training process provided by the embodiment of the present invention;
图8:为本发明实施例提供的SciTSR-Key-Value数据集键值匹配中使用的划分规则树的示意图。Figure 8: A schematic diagram of the partitioning rule tree used in key-value matching of the SciTSR-Key-Value data set provided by an embodiment of the present invention.
具体实施方式Detailed ways
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without making creative efforts fall within the scope of protection of the present invention.
实施例一Embodiment 1
请参照图1,为本发明实施例提供的基于图神经网络的OCR表格语义识别方法的一种实施例的流程示意图,该方法包括步骤101至步骤103,各步骤具体如下:Please refer to Figure 1, which is a schematic flow chart of an embodiment of a graph neural network-based OCR table semantic recognition method provided by an embodiment of the present invention. The method includes steps 101 to 103, and each step is specifically as follows:
步骤101:获取待识别的第一PNG表格图片;其中,所述第一PNG表格图片是由PDF表格经过预处理后而获得。Step 101: Obtain the first PNG table image to be recognized; wherein the first PNG table image is obtained from the PDF table after preprocessing.
在本实施例中,所述第一PNG表格图片是由PDF表格经过预处理后而获得,具体为:获取待处理的PDF文档,并通过KVLabel工具从所述PDF文档中截取表格部分,生成所述第一PNG表格图片。In this embodiment, the first PNG table image is obtained from a PDF table after preprocessing. Specifically, the PDF document to be processed is obtained, and the table part is intercepted from the PDF document through the KVLabel tool to generate the Describe the first PNG table image.
具体地,KVLabel工具是一款用于对PDF文档进行标注的工具,可实现区域信息标注、节点属性标注(如某一节点为键或值)以及节点键值对关系标注等功能。针对待识别的PDF文档,先通过KVLabel工具截取其表格部分,再转换为第一PNG表格图片。Specifically, the KVLabel tool is a tool used to annotate PDF documents, which can realize functions such as regional information annotation, node attribute annotation (such as a node being a key or a value), and node key-value pair relationship annotation. For the PDF document to be recognized, first intercept the table part through the KVLabel tool, and then convert it into the first PNG table image.
步骤102:将所述第一PNG表格图片输入至训练好的GKVR识别模型,以使所述GKVR识别模型对所述第一PNG表格图片进行OCR识别,获得第一文本信息、表格框信息和各文字节点的位置信息,并根据所述第一文本信息和预设的词汇表,通过GRU网络生成所述第一文本信息对应的句向量特征,再通过卷积神经网络和grid_simple算法将所述表格框信息转换为节点图像特征,继而将各文字节点的位置信息进行归一化处理后,获得位置特征,最后将所述第一文本信息对应的句向量特征和所述位置特征分别输入到图注意力网络,与所述节点图像特征拼接后经过多层感知器MLP,输出所述第一PNG表格图片对应的键值信息集合;其中,所述键值信息集合包括:键信息集合和值信息集合。Step 102: Input the first PNG table image into the trained GKVR recognition model, so that the GKVR recognition model performs OCR recognition on the first PNG table image, and obtains the first text information, table box information and each The position information of the text node, and based on the first text information and the preset vocabulary list, the sentence vector features corresponding to the first text information are generated through the GRU network, and then the table is converted into the table through the convolutional neural network and the grid_simple algorithm. The frame information is converted into node image features, and then the position information of each text node is normalized to obtain position features. Finally, the sentence vector features and the position features corresponding to the first text information are input into the graph. The force network, after being spliced with the node image features, passes through the multi-layer perceptron MLP and outputs the key value information set corresponding to the first PNG table picture; wherein the key value information set includes: a key information set and a value information set. .
在本实施例中,在执行步骤102之前,需要使用训练样本对GKVR识别模型进行训练,通过训练好的GKVR识别模型对第一PNG表格图片进行识别,输出对应的键值信息集合。其中,键值信息集合包括:键信息集合、值信息集合和其他信息集合。其他信息集合就是除键和值之外的其他内容的集合,如在表格中的表头等内容。In this embodiment, before executing step 102, the GKVR recognition model needs to be trained using training samples, the first PNG table image is recognized through the trained GKVR recognition model, and the corresponding key value information set is output. Among them, the key-value information set includes: key information set, value information set and other information sets. Other information collections are collections of other content besides keys and values, such as headers in tables.
在本实施例中,可以但不限于从SciTSR数据集提取若干个表格数据,再通过KVLabel工具对若干个表格数据进行预处理后,得到训练样本的数据集SciTSR-Key-Value。对所述GKVR识别模型的训练样本进行预处理,具体为:通过所述KVLabel工具对初始样本中的PDF文档进行表格框选取,并对表格框中的各文本节点进行键值标注及键值对标注,生成各初始样本对应的PNG表格图片,将所有PNG表格图片、键值标注及键值对标注作为所述训练样本。In this embodiment, you can, but are not limited to, extract several table data from the SciTSR data set, and then preprocess the several table data through the KVLabel tool to obtain the SciTSR-Key-Value data set of the training sample. Preprocess the training samples of the GKVR recognition model, specifically: use the KVLabel tool to select the table box of the PDF document in the initial sample, and perform key-value annotation and key-value pairing of each text node in the table box. Label, generate PNG table images corresponding to each initial sample, and use all PNG table images, key-value annotations and key-value pair annotations as the training samples.
在本实施例中,相比于现有技术直接将SciTSR数据集中的PDF文档直接转换为PNG文件,再根据图片的矩形标注进行表格框定,本实施例是先从PDF文档中截取表格对应的表格框,再将截取后的表格框转换为PNG表格图片,最后再进行键值标注及键值对标注。这样能够保证表格框的坐标信息与图片保持一致,避免因为文件格式从PDF转换为PNG时造成的错位或不匹配的问题。如图2和图3所示,图2为现有技术造成的错位或不一致问题,图3为采用本实施例技术手段后得到的PNG表格图片。In this embodiment, compared to the existing technology, which directly converts the PDF documents in the SciTSR data set into PNG files, and then frames the tables according to the rectangular annotations of the images, this embodiment first intercepts the tables corresponding to the tables from the PDF documents. frame, and then convert the intercepted table frame into a PNG table image, and finally perform key-value annotation and key-value pair annotation. This can ensure that the coordinate information of the table box is consistent with the image, and avoids misalignment or mismatch problems caused by converting the file format from PDF to PNG. As shown in Figures 2 and 3, Figure 2 shows the misalignment or inconsistency problem caused by the prior art, and Figure 3 shows the PNG table picture obtained by using the technical means of this embodiment.
作为本实施例的一种举例,用矩形框标注的图片中的键值信息、矩形框的坐标信息存储在chunk数据中。chunk数据的具体示意可以但不限于参见图4。如图4所示,chunk数据用于还可以用于保存文本节点位置,同时文本节点在列表中的索引为其节点编号。As an example of this embodiment, the key value information in the picture marked with a rectangular frame and the coordinate information of the rectangular frame are stored in the chunk data. The specific illustration of chunk data can be, but is not limited to, see Figure 4 . As shown in Figure 4, the chunk data can also be used to save the position of the text node, and the index of the text node in the list is its node number.
作为本实施例的一种举例,键值标注及键值对标注时,可通过标注文本节点的类型属性,区分该文本节点为键Key、值Value还是其他信息Other。采用info数据保存文本节点的类型属性,该字典的键为节点编号,值为属性。另外,键值对的标注采用pair数据进行保存,记录文本节点键值对关系,其中元素的含义为[Key节点编号,Value节点编号]。As an example of this embodiment, when annotating key values and key-value pairs, the type attribute of the text node can be annotated to distinguish whether the text node is a key, a value, or other information. Use info data to save the type attribute of the text node. The key of the dictionary is the node number and the value is the attribute. In addition, the annotation of key-value pairs is saved using pair data to record the key-value pair relationship of text nodes, where the meaning of the elements is [Key node number, Value node number].
可见,通过本发明实施例开发的KVLabel工具,能够实现导入待标注数据集、选取待标注数据、进行矩形框选标注、设置所框选的矩形结点的属性、设置结点之间的键值关系等功能。It can be seen that through the KVLabel tool developed in the embodiment of the present invention, it is possible to import the data set to be labeled, select the data to be labeled, perform rectangular selection and labeling, set the attributes of the selected rectangular node, and set the key values between the nodes. relationship functions.
在本实施例中,在表格键值识别工作中,作为输入数据的表格实际上具有较强的结构性,表格在通过OCR识别以后得到表格中各个文字区域的位置信息以及文本信息,则其可以被认为是图数据。而且由于OCR技术较为成熟,已经很容易识别文字,本发明实施例主要针对如何识别表格中文字节点的键值类别。In this embodiment, in the table key value recognition work, the table as the input data actually has a strong structure. After the table is recognized by OCR, the position information and text information of each text area in the table are obtained, then it can is considered graph data. Moreover, since the OCR technology is relatively mature, it is easy to recognize text. The embodiment of the present invention is mainly aimed at how to identify the key value category of the text node in the table.
在各模块进行训练前,可以加入若干个扰动,默认的扰动方式有:颜色空间转换(cvtColor)、模糊(blur)、抖动(jitter)、噪声(Gasuss noise)、随机切割(random crop)、透视(perspective)、颜色反转(reverse)等。训练时所需的数据符合如下表所示:Before each module is trained, several perturbations can be added. The default perturbation methods are: color space conversion (cvtColor), blur (blur), jitter (jitter), noise (Gasuss noise), random crop (random crop), and perspective. (perspective), color inversion (reverse), etc. The data required for training is as shown in the following table:
在本实施例中,训练好的GKVR识别模型包括:句向量特征提取模块、节点图像特征提取模块和位置特征提取模块。In this embodiment, the trained GKVR recognition model includes: a sentence vector feature extraction module, a node image feature extraction module and a location feature extraction module.
句向量特征提取模块的训练过程具体为:根据预设的词汇表,对训练样本中各文本节点的文本内容进行词汇识别,生成字符串,并对每一字符串进行one-hot编码后应用一层单向前馈网络进行词嵌入,获得每个文本节点对应的词序列;通过GRU网络对各词序列中的语义进行学习,生成各文本节点的句向量特征。The training process of the sentence vector feature extraction module is as follows: according to the preset vocabulary list, perform vocabulary recognition on the text content of each text node in the training sample, generate a string, and perform one-hot encoding on each string and then apply a A layer of one-way feed-forward network is used for word embedding to obtain the word sequence corresponding to each text node; the semantics in each word sequence are learned through the GRU network and the sentence vector features of each text node are generated.
为了使得模型能够得到表格文字的语义信息,本文中使用了自然语言处理领域对文本信息的常见处理方式,首先建立一张词汇表vocab,此处Vocab就是26个字母和各个符号。数据类型为:字符串。第一部分较有序(“0123456789abcdefghijklmnopqrstuvwxyz”)便是简单的数字加小写字母遍历的结果。第二部分较无序,除了有多种符号和大写字母拼接而成,还加入一些罗马数字等,仍为字符串形式。结构特点为简单的横式“xxxx”,无特殊排布。其次将不存在vocab的字符转化为vocab表示未知符号的单词。In order to enable the model to obtain the semantic information of table text, this article uses a common method of processing text information in the field of natural language processing. First, a vocabulary vocabulary vocab is established, where Vocab is 26 letters and various symbols. The data type is: string. The first part is more orderly ("0123456789abcdefghijklmnopqrstuvwxyz"), which is the result of simple traversal of numbers and lowercase letters. The second part is more disordered. In addition to being spliced together with a variety of symbols and capital letters, some Roman numerals are also added, and it is still in the form of a string. The structural feature is a simple horizontal "xxxx" with no special arrangement. Secondly, the characters without vocab are converted into words whose vocab represents unknown symbols.
然后,对每一字符进行one-hot编码并应用一层单向前馈网络(现有技术不再赘述)进行词嵌入(word embedding)表示,One-Hot编码采用N位状态寄存器来对N个状态进行编码,每个状态都由他独立的寄存器位,并且在任意时候只有一位有效。One-Hot编码首先先要求将分类值映射到整个数值。每个整数值被表示为二进制向量,除了整数的索引之外,它都是零值,被标记为1。Then, one-hot encoding is performed on each character and a layer of unidirectional feed-forward network (the existing technology will not be described in detail) is applied to perform word embedding representation. The One-Hot encoding uses an N-bit status register to encode N The states are encoded, each state is represented by its own independent register bit, and only one bit is valid at any time. One-Hot encoding first requires mapping categorical values to entire numerical values. Each integer value is represented as a binary vector, except for the index of the integer, which is zero-valued, labeled 1.
最终使用GRU对词序列中的语义信息进行学习,最终获得用于表示图节点文本信息的句向量特征,其过程如下:Finally, GRU is used to learn the semantic information in the word sequence, and finally the sentence vector features used to represent the text information of the graph nodes are obtained. The process is as follows:
word_vectori[j]=embedding(one_hot(testi[j]))word_vectori [j]=embedding(one_hot(testi [j]))
sentence_featurei[j]=GRU(word_vectori[j])sentence_featurei [j]=GRU(word_vectori [j])
其中,word_vectori[j]表示第i个图表中第j个节点文本的字符embedding集合。sentence_ei[j]为句向量特征。文本句向量特征表示网络参数如下表所示:Among them, word_vectori [j] represents the character embedding set of the j-th node text in the i-th graph. sentence_ei [j] is the sentence vector feature. The text sentence vector feature representation network parameters are shown in the following table:
所述节点图像特征提取模块的训练过程具体为:获取训练样本中的多个表格框信息,并通过卷积神经网络对各表格框信息进行图片结构信息提取,获得多个第一特征图;通过grid_simple算法将所述多个第一特征图以双线性插值的方法放缩至网格中,并将每个文字节点对应坐标的网格特征作为文字节点的节点图像特征。The training process of the node image feature extraction module is specifically: obtain multiple table box information in the training sample, and extract the picture structure information of each table box information through a convolutional neural network to obtain multiple first feature maps; The grid_simple algorithm scales the plurality of first feature maps into a grid using a bilinear interpolation method, and uses the grid features corresponding to the coordinates of each text node as the node image features of the text node.
对于表格而言,除了各个文字节点的位置和文字信息,表格本身的表框信息对于键值识别也具有一定的价值,键节点与值节点的表框结构显示出的差异性表示其具有一定的参考价值。为了更好地完成键值识别的工作,本发明实施例使用卷积神经网络(Convolutional Neural Networks,CNN)提取表格的图片结构信息,再通过grid_simple算法即将通过卷积网络得到特征图以双线性插值的方法放缩至网格中并取得每个文字节点对应坐标的网格特征作为文字节点的图像特征,其详细过程如下:For tables, in addition to the position and text information of each text node, the table frame information of the table itself is also of certain value for key value identification. The difference shown in the table frame structure of key nodes and value nodes indicates that it has a certain value. Reference value. In order to better complete the key value identification work, the embodiment of the present invention uses Convolutional Neural Networks (CNN) to extract the picture structure information of the table, and then uses the grid_simple algorithm to obtain the feature map through the convolutional network to bilinear The interpolation method scales into the grid and obtains the grid features corresponding to the coordinates of each text node as the image features of the text node. The detailed process is as follows:
img_feature_mapi=CNNs(imgi)img_feature_mapi =CNNs(imgi )
img_feature_boxi[j]=grid_simple(img_feature_mapi,posi[j])img_feature_boxi [j]=grid_simple(img_feature_mapi ,posi [j])
其中,img_feature_mapi为第i个表格的图片特征图,img_feature_boxi[j]为第i个表格中第j个文字节点的节点图像特征。CNNs网络参数具体如下表所示:Among them, img_feature_mapi is the image feature map of the i-th table, and img_feature_boxi [j] is the node image feature of the j-th text node in the i-th table. The specific parameters of CNNs network are as shown in the following table:
在本实施例中,位置特征提取模块的训练过程具体为:获取训练样本中各文本节点的位置信息;将各位置信息进行坐标转换,并将坐标系归一化至[-1,1]区间内,输出各文本节点对应的位置特征。In this embodiment, the training process of the location feature extraction module is specifically: obtaining the location information of each text node in the training sample; performing coordinate conversion on each location information, and normalizing the coordinate system to the [-1,1] interval Within, the location features corresponding to each text node are output.
在表格数据中,由于表格的大小不同,具有相似结构的表格节点的绝对位置有较大的差距,所以若直接使用绝对定位作为网络的输入可能会导致学习效率较低。为了避免以上问题从而使网络能更好地学习表格结构,本文将节点的绝对位置信息转为相对位置信息,同时将坐标系归一化至-1到1区间。其过程如下:In tabular data, due to different table sizes, there is a large gap in the absolute positions of table nodes with similar structures. Therefore, directly using absolute positioning as the input of the network may result in low learning efficiency. In order to avoid the above problems and enable the network to better learn the table structure, this article converts the absolute position information of the nodes into relative position information, and normalizes the coordinate system to the -1 to 1 interval. The process is as follows:
min_xi=min(Xi)min_xi =min(Xi )
min_yi=min(Yi)min_yi =min(Yi )
其中,Xi,Yi分别表示第i张表格x坐标值集合和y坐标集合,设定某一节点j的绝对位置为{(x1,y1),(x2,y2)},tabel_width表示表格宽度,tabel_height表示表格高度。Amongthem,___ tabel_width represents the width of the table, and tabel_height represents the height of the table.
在本实施例中,将训练样本中各文本节点对应的句向量特征、节点图像特征和位置特征作为GKVR识别模型的输入,将各文本节点对应的键信息和值信息作为GKVR识别模型的输出;对于各文本节点,分别将句向量特征、位置特征输入到图注意力网络,与所述节点图像特征进行拼接后,组成各文本节点的节点特征,结合所述GKVR识别模型的输出,训练图注意力网络和多层感知器MLP。In this embodiment, the sentence vector features, node image features and position features corresponding to each text node in the training sample are used as the input of the GKVR recognition model, and the key information and value information corresponding to each text node are used as the output of the GKVR recognition model; For each text node, the sentence vector features and position features are input into the graph attention network respectively. After splicing with the node image features, the node features of each text node are formed. Combined with the output of the GKVR recognition model, the graph attention is trained. Force networks and multilayer perceptron MLP.
图注意力网络GAT通过自注意力机制决定邻居节点特征在聚合时所占的权重从而实现对不同邻居节点进行权值自适应,避免邻居节点数目对输出特征的影响。由于数据集中无法提供节点之间存在的边信息,如果使用全连接构建边集合,将使得复杂性达到O(|N|2),为了减少复杂性并考虑到表格节点邻居节点具有位置相近的特点,本文采用了最近邻算法(K Nearest Neighbor,KNN)生成表格图的边集合,使得可以将复杂性降为O(K*|N|)。The graph attention network GAT determines the weight of neighbor node features in aggregation through the self-attention mechanism to implement weight adaptation for different neighbor nodes and avoid the impact of the number of neighbor nodes on the output features. Since the edge information existing between nodes cannot be provided in the data set, if a full connection is used to construct the edge set, the complexity will reach O(|N|2 ). In order to reduce the complexity and take into account that the neighbor nodes of the table nodes have similar locations , this paper uses the nearest neighbor algorithm (K Nearest Neighbor, KNN) to generate the edge set of the tabular graph, so that the complexity can be reduced to O(K*|N|).
在整体步骤中,使用最近邻算法降低复杂性,是在如图5中“pos of Node”部分转到grid simple的过程中作用的。图上每个结点有自己的相对位置属性,通过最近邻选取离某一结点最近的k个结点,从而在该结点和其他k个结点之间设置一条边。原因在于,有了边集和结点集,才能做图卷积。In the overall step, the nearest neighbor algorithm is used to reduce complexity, which is the process of switching to grid simple in the "pos of Node" part in Figure 5. Each node on the graph has its own relative position attribute. The k nodes closest to a certain node are selected through nearest neighbors, thereby setting an edge between the node and the other k nodes. The reason is that graph convolution can only be done with an edge set and a node set.
其计算过程如下:The calculation process is as follows:
edgesi[j]=KNN(posi[j])edgesi [j]=KNN(posi [j])
pos_h_featurei[j]=GATθ1(normlized_posi[j],edgesi[j])pos_h_featurei [j]=GATθ1 (normlized_posi [j], edgesi [j])
text_h_featurei[j]=GATθ2(sentence_featurei[j],edgesi[j])text_h_featurei [j]=GATθ2 (sentence_featurei [j], edgesi [j])
h_fi[j]=concat(pos_h_featurei[j],text_h_featurei[j],img_feature_boxi[j])h_fi [j]=concat(pos_h_featurei [j], text_h_featurei [j], img_feature_boxi [j])
predictioni[j]=Softmax(MLP((h_fi[j])))predictioni [j]=Softmax(MLP((h_fi [j])))
其中,edgesi表示通过KNN算法得到的第i张表格的边。edgesi[j]表示通过KNN算法得到的第i张表格结点j的边,posi[j]表示节点j的绝对位置,normlized_posi[j]表示节点j的相对位置,sentence_featurei[j]表示节点j的文本信息的句向量,pos_h_featurei[j]表示第i个表格第j个文字节点的位置特征,text_h_featurei[j]表示第i个表格第j个文字节点的文本特征,img_feature_boxi[j]表示第j个文字节点的图像特征,h_fi[j]表示节点j的特征信息,predictioni[j]表示对节点j类别的预测结果;GATθ1和GATθ2是用于区分GAT对位置特征和句子特征进行处理。Among them, edgesi represents the edges of the i-th table obtained through the KNN algorithm. edgesi [j] represents the edge of node j of the i-th table obtained through the KNN algorithm, posi [j] represents the absolute position of node j, normlized_posi [j] represents the relative position of node j, sentence_featurei [j] Sentence vector representing the text information of node j, pos_h_featurei [j] represents the position feature of the j-th text node in the i-th table, text_h_featurei [j] represents the text feature of the j-th text node in the i-th table, img_feature_boxi [j] represents the image feature of the j-th text node, h_fi [j] represents the feature information of node j, predictioni [j] represents the prediction result of the category of node j; GATθ1 and GATθ2 are used to distinguish GAT pairs Positional features and sentence features are processed.
为了更好地说明本实施例的有益效果,可通过GCN和基于GAT的对比实验进行验证。在模型设计中,图神经网络用于使图上节点能够结合附近节点的信息,从而更好地推测节点的类型。在GFTE这一基于图神经网络的键值行列关系推导模型中,使用了GCN作为其节点信息聚合的底层网络并在其所在的工作中表现良好。但由于GCN本身对邻居节点的融合是受其邻居节点的度的影响并无法达到根据不同节点的不同特征值产生其对应的权重,而在表格键值推导任务中,本文认为邻居节点对中心节点的影响因子因包含邻居节点的特征值,所以本文采用了GAT作为节点聚合工作的底层网络并在完成键值识别工作时模型的准确性以及收敛的稳定性得到较好的提升。In order to better illustrate the beneficial effects of this embodiment, it can be verified through comparative experiments based on GCN and GAT. In model design, graph neural networks are used to enable nodes on the graph to combine information from nearby nodes to better infer the type of nodes. In GFTE, a key-value row-column relationship derivation model based on graph neural networks, GCN is used as the underlying network for node information aggregation and performs well in its work. However, since the fusion of neighbor nodes by GCN itself is affected by the degree of its neighbor nodes and cannot generate corresponding weights based on the different feature values of different nodes, in the table key value derivation task, this paper considers that the neighbor nodes are very important to the central node. Because the influence factor of contains the eigenvalues of neighbor nodes, this paper uses GAT as the underlying network for node aggregation work and improves the accuracy of the model and the stability of convergence when completing the key value identification work.
如图6所示,基于GCN的GKVR模型在训练集上Loss收敛趋势与基于GAT的GKVR模型基本一致,但前者的收敛情况的最低值较大。在测试集上基于GCN的GKVR模型表现出了Loss的抖动性强,从而可以看出使用GAT作为节点信息聚合的底层网络可以提高GKVR模型收敛的稳定性。As shown in Figure 6, the Loss convergence trend of the GKVR model based on GCN on the training set is basically the same as that of the GKVR model based on GAT, but the minimum value of the former's convergence condition is larger. The GKVR model based on GCN on the test set shows strong jitter in Loss. It can be seen that using GAT as the underlying network for node information aggregation can improve the stability of the convergence of the GKVR model.
如图7所示,在识别工作的准确度方面,基于GAT的GKVR模型表现明显优于基于GCN的GKVR模型,其在训练集上的最高准确度大于后者6个百分点,在测试集上的最高准确度大于后者7个百分点。所以对于表格节点键值识别工作而言,采用GAT替换GCN是合理的方案。As shown in Figure 7, in terms of recognition accuracy, the GAT-based GKVR model performs significantly better than the GCN-based GKVR model. Its highest accuracy on the training set is 6 percentage points greater than the latter, and on the test set it is The highest accuracy is 7 percentage points greater than the latter. Therefore, for table node key value identification work, using GAT to replace GCN is a reasonable solution.
在本实施例中,GKVR识别模型训练后,第一PNG表格图片输入后会被提取相应第一文本信息、表格框信息和各文字节点的位置信息。其中,第一文本信息主要包括表格的文本内容,通过句向量特征提取模块生成所述第一文本信息对应的句向量特征,通过节点图像特征提取模块所述表格框信息转换为节点图像特征,再通过位置特征提取模块将各文字节点的位置信息进行归一化处理后,获得位置特征,最后将所述第一文本信息对应的句向量特征和所述位置特征分别输入到图注意力网络,与所述节点图像特征拼接后经过多层感知器MLP,输出所述第一PNG表格图片对应的键值信息集合。In this embodiment, after the GKVR recognition model is trained, after the first PNG table image is input, the corresponding first text information, table box information, and position information of each text node will be extracted. Among them, the first text information mainly includes the text content of the table. The sentence vector feature corresponding to the first text information is generated through the sentence vector feature extraction module. The table frame information is converted into node image features through the node image feature extraction module. After normalizing the position information of each text node through the position feature extraction module, the position features are obtained. Finally, the sentence vector features and the position features corresponding to the first text information are input into the graph attention network respectively, and After the node image features are spliced, they pass through the multi-layer perceptron MLP to output a set of key value information corresponding to the first PNG table image.
步骤103:根据预设的划分规则树,对所述键值信息集合进行遍历匹配,输出所述键值信息集合中的各键值对。Step 103: Traverse and match the key-value information set according to the preset division rule tree, and output each key-value pair in the key-value information set.
在本实施例中,步骤103具体为:通过广度优先遍历划分规则树对键信息集合进行逐渐划分,并到达叶子节点时选取值信息集合中的值,产生若干个键值对。In this embodiment, step 103 is specifically: gradually divide the key information set through a breadth-first traversal of the division rule tree, and select the values in the value information set when reaching the leaf node to generate several key-value pairs.
在识别表格中节点键值属性的工作后,即由步骤102可以将表格节点集合划分为Key={k1,k2,…,kn},Value={v1,v2,…,vm},Other={o1,o2,…,ok}三个集合,其中主要探讨如何获取Key集合与Value集合中元素在表格中的对应关系。After identifying the key value attributes of the nodes in the table, step 102 can divide the table node set into Key={k1 , k2 ,..., kn }, Value={v1 , v2 ,..., vm }, Other={o1 , o2 ,..., ok } three sets, which mainly discusses how to obtain the correspondence between the elements in the Key set and the Value set in the table.
而将节点之间是否存在键值对关系视为两种类别,并通过图神经网络对图上节点进行特征提取,最终将问题转为二分类问题从而预测节点之间是否存在键值对关系。为了能够发现键值之间存在的所有键值对关系,一个合理的设计是将Key集合与Value集合构造成一个完全二分图,从而预测二分图上每条边<Node1,Node2>的类别。上述解决方案为现有技术,若按照上面所述设计二分类神经网络将面临样本标签分布极其不平衡的问题。经实验,这种不平衡导致了模型将所有节点之间的边分类为非键值对类别从而能够达到较高的准确度,但由混淆矩阵可知模型并无法识别图中存在的键值对关系。The existence of a key-value pair relationship between nodes is regarded as two categories, and feature extraction is performed on the nodes on the graph through a graph neural network. Finally, the problem is converted into a binary classification problem to predict whether there is a key-value pair relationship between nodes. In order to be able to discover all key-value pair relationships between key values, a reasonable design is to construct the Key set and the Value set into a complete bipartite graph, thereby predicting the category of each edge <Node1 , Node2 > on the bipartite graph. . The above solutions are existing technologies. If a two-class neural network is designed as described above, it will face the problem of extremely unbalanced sample label distribution. After experiments, this imbalance caused the model to classify the edges between all nodes as non-key-value pairs, thereby achieving higher accuracy. However, it can be seen from the confusion matrix that the model cannot identify the key-value pair relationships that exist in the graph. .
在本实施例中,表格键值匹配存在明显的先验知识,如Key节点与Value节点处于同一行或同一列;Value节点与其对应的Key的在某一坐标系的距离或欧式距离最小等。为了引入以上先验知识用于键值匹配问题,本文定义划分规则树PT如下:In this embodiment, there is obvious a priori knowledge for table key value matching, for example, the Key node and the Value node are in the same row or column; the distance between the Value node and its corresponding Key in a certain coordinate system or the Euclidean distance is the smallest, etc. In order to introduce the above prior knowledge for the key-value matching problem, this article defines the partitioning rule tree PT as follows:
1.PT不为空。1.PT is not empty.
2.若PT中某一节点i不为叶子节点,其包含划分规则pi。2. If a node i in PT is not a leaf node, it contains the partition rule pi .
若PT中某一节点i不为叶子节点,则其子节点数等于pi将集合划分成的集合类别数。If a node i in PT is not a leaf node, the number of its child nodes is equal to the number of set categories that pi divides the set into.
3.若PT中某一节点i为叶子节点,其包含选取规则si。3. If a node i in PT is a leaf node, it contains the selection rule si .
上文是识别键值,规则树算法是用来探索键值对关系。pi是在规则树上设定的划分规则,将键集合划分为多个子集。si表示规则树定义的选取规则,匹配符合统一规则的键值对。The above is to identify key values, and the rule tree algorithm is used to explore key-value pair relationships. pi is the division rule set on the rule tree, which divides the key set into multiple subsets. si represents the selection rule defined by the rule tree, matching key-value pairs that comply with unified rules.
通过广度优先遍历规则树PT对Key集合进行逐渐划分并到达叶子节点时选取键并产生键值对,可以对符合同一规则树的Key进行键值匹配。其匹配规则可以表示为:广度遍历,按规则树结点里的规则取键集的子集,最终找到某一个值对应的键。譬如,先对键集合进行划分,在规则树中与值集进行匹配,最终找到符合同一规则树的值产生键值对。Through the breadth-first traversal rule tree PT, the Key set is gradually divided and when reaching the leaf node, the key is selected and the key-value pair is generated. Key-value matching can be performed on the keys that conform to the same rule tree. The matching rules can be expressed as: breadth traversal, taking a subset of the key set according to the rules in the rule tree node, and finally finding the key corresponding to a certain value. For example, first divide the key set, match it with the value set in the rule tree, and finally find the values that match the same rule tree to generate key-value pairs.
作为本实施例的一种举例,基于划分的键值匹配算法可以但不限于参见下表所示。As an example of this embodiment, the partition-based key value matching algorithm can be, but is not limited to, shown in the following table.
为了更好地说明本实施例的基于划分的键值匹配算法应用,以下面的例子进行说明,参见图8,,定义其对应的划分规则树PT如图8所示:根节点为划分的规则,包含水平和垂直方向的集合,且给定了作用范围内的区间。左孩子结点为水平集元素,右孩子结点为垂直集元素。都符合其键值对最相邻原则。最终,通过该规则树进行键值对匹配能够较好地识别SciTSR-Key-Value数据集中键值对。其中D(x,y)为计算x节点与y节点相连的边与x轴所构成的角度。In order to better illustrate the application of the key-value matching algorithm based on partitioning in this embodiment, the following example is used to illustrate, see Figure 8, and the corresponding partitioning rule tree PT is defined as shown in Figure 8: the root node is the partitioning rule , contains a set of horizontal and vertical directions, and the interval within the scope of action is given. The left child node is a horizontal set element, and the right child node is a vertical set element. All comply with the principle of nearest neighbor of their key-value pairs. Finally, key-value pair matching through this rule tree can better identify key-value pairs in the SciTSR-Key-Value data set. Among them, D(x, y) is the angle between the edge connecting the x node and the y node and the x-axis.
作为本实施例的一种举例,划分规则树设置在所述GKVR识别模型中。在本举例中将划分规则树也集成在GKVR识别模型中,简化操作,提高效率。As an example of this embodiment, the division rule tree is set in the GKVR recognition model. In this example, the division rule tree is also integrated into the GKVR recognition model to simplify operations and improve efficiency.
另一方面,本发明实施例提供了一种基于图神经网络的OCR表格语义识别装置,包括:获取单元、识别单元和输出单元;On the other hand, embodiments of the present invention provide an OCR table semantic recognition device based on a graph neural network, including: an acquisition unit, a recognition unit and an output unit;
其中,所述获取单元用于获取待识别的第一PNG表格图片;其中,所述第一PNG表格图片是由PDF表格经过预处理后而获得;Wherein, the obtaining unit is used to obtain the first PNG table picture to be recognized; wherein the first PNG table picture is obtained from the PDF table after preprocessing;
所述识别单元用于将所述第一PNG表格图片输入至训练好的GKVR识别模型,以使所述GKVR识别模型对所述第一PNG表格图片进行OCR识别,获得第一文本信息、表格框信息和各文字节点的位置信息,并根据所述第一文本信息和预设的词汇表,通过GRU网络生成所述第一文本信息对应的句向量特征,再通过卷积神经网络和grid_simple算法将所述表格框信息转换为节点图像特征,继而将各文字节点的位置信息进行归一化处理后,获得位置特征,最后将所述第一文本信息对应的句向量特征和所述位置特征分别输入到图注意力网络,与所述节点图像特征拼接后经过多层感知器MLP,输出所述第一PNG表格图片对应的键值信息集合;其中,所述键值信息集合包括:键信息集合和值信息集合;The recognition unit is used to input the first PNG table image into the trained GKVR recognition model, so that the GKVR recognition model performs OCR recognition on the first PNG table image and obtains the first text information and table frame. information and the position information of each text node, and based on the first text information and the preset vocabulary list, the sentence vector features corresponding to the first text information are generated through the GRU network, and then the convolutional neural network and grid_simple algorithm are used to generate The table box information is converted into node image features, and then the position information of each text node is normalized to obtain position features. Finally, the sentence vector features and the position features corresponding to the first text information are input respectively. to the graph attention network, and after being spliced with the node image features and passed through the multi-layer perceptron MLP, the key value information set corresponding to the first PNG table image is output; wherein the key value information set includes: a key information set and Value information collection;
所述输出单元用于根据预设的划分规则树,对所述键值信息集合进行遍历匹配,输出所述键值信息集合中的各键值对。The output unit is configured to traverse and match the key-value information set according to a preset division rule tree, and output each key-value pair in the key-value information set.
本装置更详细的工作原理与流程可以但不限于参见上文的相关记载For more detailed working principles and processes of this device, you can refer to, but are not limited to, the relevant records above.
由上可见,本发明实施例提供了一种基于图神经网络的OCR表格语义识别方法及装置,将PNG表格图片输入至训练好的GKVR识别模型,通过模型中对文本节点的句向量特征、节点图像特征和位置特征,能够准确判断表格节点的属性为键或值;而且通过设置划分规则树的方式实现键值之间的匹配,能够提高表格的键与值之间关系识别的能力。相比于现有技术难以直接提取的便捷式文档格式和图像,本发明结合了图神经网络以及门控循环单元等深度学习网络结构,提出了用于进行表格键值识别的GKVR网络模型,能够实现一键识别,是对现有的、传统的与广泛应用的表格识别方法的重要补充,满足自动化表格审核等工业上的实际需求。As can be seen from the above, embodiments of the present invention provide a method and device for OCR table semantic recognition based on graph neural network. PNG table images are input into the trained GKVR recognition model, and the sentence vector features and nodes of the text nodes in the model are Image features and position features can accurately determine whether the attributes of table nodes are keys or values; and by setting a partitioning rule tree to achieve matching between key values, it can improve the ability to identify the relationship between the keys and values of the table. Compared with the existing convenient document formats and images that are difficult to directly extract, the present invention combines deep learning network structures such as graph neural networks and gated loop units, and proposes a GKVR network model for table key value recognition, which can The realization of one-click recognition is an important supplement to the existing, traditional and widely used form recognition methods, and meets the actual needs of industry such as automated form review.
进一步的,现有技术应用图卷积神经网络对邻居节点的融合是受其邻居节点的度影响并无法达到根据不同节点的不同特征值产生其对应的权重。而在表格键值推导任务中,因为邻居节点对中心节点的影响因子应包含邻居节点的特征值,所以本发明采用了图注意力网络作为节点聚合工作的底层网络并在完成键值识别工作时模型的准确性以及收敛的稳定性得到较好的提升。Furthermore, the existing technology's application of graph convolutional neural network to fuse neighbor nodes is affected by the degree of its neighbor nodes and cannot generate corresponding weights based on different feature values of different nodes. In the table key value derivation task, because the influence factor of the neighbor node on the central node should include the characteristic value of the neighbor node, the present invention uses the graph attention network as the underlying network for the node aggregation work and completes the key value identification work. The accuracy and convergence stability of the model are improved.
进一步的,虽然有一些方法已经在识别表格中存在的键值方面做出一些探索,但是表格中常常有难以直接提取的便捷式文档格式和图像。本发明结合了图神经网络以及门控循环单元等深度学习网络结构,提出了用于进行表格键值识别的网络模型GKVR(Graph-based Key and Value Recognition),该模型可以利用表格中文本的文本信息、位置信息以及表格图片的图片信息对表格中某一节点进行键值分类,提高了表格的键与值之间关系识别的能力。Furthermore, although some methods have made some explorations in identifying key values that exist in tables, tables often contain convenient document formats and images that are difficult to extract directly. This invention combines deep learning network structures such as graph neural networks and gated loop units, and proposes a network model GKVR (Graph-based Key and Value Recognition) for table key value recognition. This model can use the text of the text in the table Information, location information, and picture information of table pictures are used to classify the key value of a certain node in the table, which improves the ability to identify the relationship between the keys and values of the table.
以上所述的具体实施例,对本发明的目的、技术方案和有益效果进行了进一步的详细说明,应当理解,以上所述仅为本发明的具体实施例而已,并不用于限定本发明的保护范围。特别指出,对于本领域技术人员来说,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above-mentioned specific embodiments further describe the purpose, technical solutions and beneficial effects of the present invention in detail. It should be understood that the above-mentioned are only specific embodiments of the present invention and are not intended to limit the scope of the present invention. . It is particularly pointed out that for those skilled in the art, any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of the present invention shall be included in the protection scope of the present invention.
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310646731.2ACN116740743B (en) | 2023-06-01 | 2023-06-01 | OCR (optical character recognition) form semantic recognition method and device based on graphic neural network |
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310646731.2ACN116740743B (en) | 2023-06-01 | 2023-06-01 | OCR (optical character recognition) form semantic recognition method and device based on graphic neural network |
| Publication Number | Publication Date |
|---|---|
| CN116740743Atrue CN116740743A (en) | 2023-09-12 |
| CN116740743B CN116740743B (en) | 2025-08-19 |
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202310646731.2AActiveCN116740743B (en) | 2023-06-01 | 2023-06-01 | OCR (optical character recognition) form semantic recognition method and device based on graphic neural network |
| Country | Link |
|---|---|
| CN (1) | CN116740743B (en) |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN109902289A (en)* | 2019-01-23 | 2019-06-18 | 汕头大学 | A news video topic segmentation method for fuzzy text mining |
| US20210201182A1 (en)* | 2020-09-29 | 2021-07-01 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method and apparatus for performing structured extraction on text, device and storage medium |
| WO2022147965A1 (en)* | 2021-01-09 | 2022-07-14 | 江苏拓邮信息智能技术研究院有限公司 | Arithmetic question marking system based on mixnet-yolov3 and convolutional recurrent neural network (crnn) |
| CN115700828A (en)* | 2021-07-30 | 2023-02-07 | 上海爱数信息技术股份有限公司 | Table element identification method and device, computer equipment and storage medium |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN109902289A (en)* | 2019-01-23 | 2019-06-18 | 汕头大学 | A news video topic segmentation method for fuzzy text mining |
| US20210201182A1 (en)* | 2020-09-29 | 2021-07-01 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method and apparatus for performing structured extraction on text, device and storage medium |
| WO2022147965A1 (en)* | 2021-01-09 | 2022-07-14 | 江苏拓邮信息智能技术研究院有限公司 | Arithmetic question marking system based on mixnet-yolov3 and convolutional recurrent neural network (crnn) |
| CN115700828A (en)* | 2021-07-30 | 2023-02-07 | 上海爱数信息技术股份有限公司 | Table element identification method and device, computer equipment and storage medium |
| Title |
|---|
| GENG TU ET AL.: "Context and sentiment aware networks for emotion recognition in conversation", TAI, 7 February 2022 (2022-02-07)* |
| Publication number | Publication date |
|---|---|
| CN116740743B (en) | 2025-08-19 |
| Publication | Publication Date | Title |
|---|---|---|
| CN114419304B (en) | A multimodal document information extraction method based on graph neural network | |
| CN113762028B (en) | Data driven structure extraction from text documents | |
| CN112001368A (en) | Character structured extraction method, device, equipment and storage medium | |
| CN111914558A (en) | Method and system of curriculum knowledge relation extraction based on sentence bag attention remote supervision | |
| CN116737967B (en) | A system and method for constructing and improving knowledge graphs based on natural language | |
| AU2022204702A1 (en) | Multimodal multitask machine learning system for document intelligence tasks | |
| CN114969275A (en) | A dialogue method and system based on bank knowledge graph | |
| CN113360582A (en) | Relation classification method and system based on BERT model fusion multi-element entity information | |
| Julca-Aguilar et al. | A general framework for the recognition of online handwritten graphics | |
| CN109902144A (en) | An Entity Alignment Method Based on Improved WMD Algorithm | |
| CN109214410A (en) | A kind of method and system promoting multi-tag classification accuracy rate | |
| CN114881038B (en) | Chinese entity and relation extraction method and device based on span and attention mechanism | |
| CN114925835A (en) | Intelligent industrial map generation method and system | |
| CN117608545B (en) | Standard operation program generation method based on knowledge graph | |
| CN116186067A (en) | Industrial data table storage query method and equipment | |
| CN113901813A (en) | An event extraction method based on topic features and implicit sentence structure | |
| CN115545036A (en) | Reading order detection in documents | |
| CN115410185A (en) | A method for extracting attributes of specific person names and unit names from multimodal data | |
| CN113449524A (en) | Named entity identification method, system, equipment and medium | |
| CN115952259B (en) | Intelligent generation method of enterprise portrait tag | |
| CN118467747A (en) | Knowledge graph construction and completion method based on joint extraction and link prediction | |
| CN117668333A (en) | File classification method, system, equipment and readable storage medium | |
| CN116740743A (en) | OCR (optical character recognition) form semantic recognition method and device based on graphic neural network | |
| CN118410331A (en) | Method for enhancing label correlation based on global-local label specific feature learning | |
| CN114880466B (en) | A nested entity recognition method, device and storage medium integrating full-text information |
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |