CN116740743B

Movatterモバイル変換

Info

Publication number: CN116740743B
Application number: CN202310646731.2A
Authority: CN
Inventors: 姜大志; 余玮伦; 陈业维; 吴昊
Original assignee: Shantou University
Current assignee: Shantou University
Priority date: 2023-06-01
Filing date: 2023-06-01
Publication date: 2025-08-19
Anticipated expiration: 2043-06-01
Also published as: CN116740743A

Abstract

The invention discloses an OCR (optical character recognition) form semantic recognition method and device based on a graph neural network, which are characterized in that PNG form pictures are input into a trained GKVR recognition model, the attribute of a form node can be accurately judged to be a key or a value through sentence vector characteristics, node image characteristics and position characteristics of text nodes in the model, and the matching between key values is realized in a mode of setting a division rule tree, so that the relation recognition capability between the keys and the values of the form can be improved. The invention combines deep learning network structures such as a graph neural network, a gate control circulation unit and the like, provides a GKVR network model for carrying out table key value identification, can realize one-key identification, and meets the actual industrial requirements of automatic table auditing and the like.

Description

OCR (optical character recognition) form semantic recognition method and device based on graphic neural network

Technical Field

The invention relates to the technical field of OCR (optical character recognition), in particular to an OCR form semantic recognition method and device based on a graph neural network.

Background

With the rapid development of computing power and parallel computing technology in the current computer field, the trainable parameter amount of deep learning is increasingly improved, so that the learning power of the deep learning on complex data is gradually enhanced, and finally, the method is applied to various fields.

Conventionally OCR (optical character recognition) is used for character recognition by using a template matching method, a structural analysis method and the like. Based on OCR character recognition, researchers further put forward the problem of how to recognize the table structure and the table text information, which adds the steps of table detection, table structure decomposition, table structure recognition and the like in the workflow of character recognition. The extraction of tables on rule-based algorithms in combination with semantic information is proposed as in prior art MARCINNAMYSL et al, flexible Table Recognition AND SEMANTIC Interpretation System. Yiren Li et al, GFTE:graph-based Financial Table Extraction, propose a GFTE model to fuse image features, location features, and text features together to improve the ability to extract unstructured data files of a table.

However, in the field of intelligent manufacturing, the requirements of users are more important than the recognition of the table structure, the recognition of the table key value and the recognition of the corresponding relation thereof. In addition, various forms exist in the actual life, for example, in the working procedure of checking the quality inspection report, when the user submits the quality inspection report (generally provided by pdf file), the user needs to check whether the corresponding detection index meets the industry standard and whether the information such as the product name meets the checking requirement, and the key value identification in the form and the identification of the corresponding relation can affect the identification efficiency and accuracy. The practical application of the current OCR technology on form recognition is difficult to meet the requirements of automation and industrialization, and is lack of an efficient and accurate solution.

Disclosure of Invention

The invention provides an OCR (optical character recognition) form semantic recognition method and device based on a graph neural network, which can effectively recognize key values existing in a form and find out corresponding relations existing among the key values, thereby meeting the actual industrial requirements of automatic form auditing and the like.

In order to solve the above technical problems, an embodiment of the present invention provides an OCR form semantic recognition method based on a graph neural network, including:

acquiring a first PNG table picture to be identified, wherein the first PNG table picture is obtained by preprocessing a PDF table;

Inputting the first PNG table picture into a trained GKVR recognition model, so that the GKVR recognition model carries out OCR (optical character recognition) on the first PNG table picture to obtain first text information, table frame information and position information of each text node, generating sentence vector features corresponding to the first text information through a GRU (word line unit) network according to the first text information and a preset vocabulary, converting the table frame information into node image features through a convolutional neural network and a grid_simple algorithm, carrying out normalization processing on the position information of each text node to obtain position features, finally respectively inputting the sentence vector features corresponding to the first text information and the position features into a graph-meaning network, and outputting a key value information set corresponding to the first PNG picture through a multi-layer perceptron MLP (multi-level perceptron) after being spliced with the node image features, wherein the key value information set comprises a key information set and a value information set;

And traversing and matching the key value information set according to a preset division rule tree, and outputting each key value pair in the key value information set.

As a preferable scheme, the trained GKVR recognition model comprises a sentence vector feature extraction module;

The training process of the sentence vector feature extraction module specifically comprises the following steps:

According to a preset vocabulary, carrying out vocabulary recognition on text contents of each text node in a training sample, generating character strings, carrying out one-hot encoding on each character string, and then carrying out word embedding by applying a layer of unidirectional feed-forward network to obtain a word sequence corresponding to each text node;

and learning the semantics in each word sequence through the GRU network to generate sentence vector characteristics of each text node.

As a preferable scheme, the trained GKVR identification model comprises a node image feature extraction module;

the training process of the node image feature extraction module specifically comprises the following steps:

acquiring a plurality of pieces of table frame information in a training sample, and extracting picture structure information of each piece of table frame information through a convolutional neural network to acquire a plurality of first feature images;

And scaling the plurality of first feature graphs into grids by a bilinear interpolation method through a grid_simple algorithm, and taking grid features of coordinates corresponding to each text node as node image features of the text node.

Preferably, the trained GKVR recognition model comprises a position feature extraction module;

the training process of the position feature extraction module specifically comprises the following steps:

Acquiring position information of each text node in a training sample;

and carrying out coordinate conversion on each piece of position information, normalizing the coordinate system into the [ -1,1] interval, and outputting the position characteristics corresponding to each text node.

As a preferable scheme, the training process of the trained recognition model specifically comprises the following steps:

Sentence vector features, node image features and position features corresponding to each text node in the training sample are used as input of GKVR identification models, and key information and value information corresponding to each text node are used as output of GKVR identification models;

and respectively inputting sentence vector features and position features to a drawing attention network for each text node, splicing the sentence vector features and the position features with the node image features to form the node features of each text node, and combining the outputs of the GKVR recognition model to train the drawing attention network and the multi-layer perceptron MLP.

As a preferred solution, the first PNG table picture is obtained by preprocessing a PDF table, specifically:

and acquiring a PDF document to be processed, intercepting a form part from the PDF document through a KVLabel tool, and generating the first PNG form picture.

Preferably, the KVLabel tool is further configured to preprocess training samples of the GKVR recognition model, specifically:

And selecting a form frame of the PDF document in the initial sample through the KVLabel tool, marking each text node in the form frame with a key value and marking with a key value pair, generating PNG form pictures corresponding to each initial sample, and taking all the PNG form pictures, the key value marks and the key value pair marks as the training sample.

As a preferred solution, the traversing matching is performed on the key value information set according to a preset partition rule tree, and each key value pair in the key value information set is output, which specifically includes:

Gradually dividing the key information set by traversing the dividing rule tree with breadth first, and selecting values in the value information set when the key information set reaches the leaf node to generate a plurality of key value pairs.

Preferably, the division rule tree is set in the GKVR identification model.

The invention correspondingly provides an OCR form semantic recognition device based on a graphic neural network, which comprises an acquisition unit, a recognition unit and an output unit;

the acquisition unit is used for acquiring a first PNG table picture to be identified, wherein the first PNG table picture is obtained by preprocessing a PDF table;

The recognition unit is used for inputting the first PNG table picture into a trained GKVR recognition model, so that the GKVR recognition model carries out OCR (optical character recognition) on the first PNG table picture to obtain first text information, table frame information and position information of each text node, sentence vector features corresponding to the first text information are generated through a GRU (graphic user unit) network according to the first text information and a preset vocabulary, the table frame information is converted into node image features through a convolutional neural network and a grid_simple algorithm, the position information of each text node is normalized to obtain position features, sentence vector features corresponding to the first text information and the position features are finally respectively input into a graph meaning network, and key value information sets corresponding to the first PNG table picture are output through a multi-layer perceptron MLP after being spliced with the node image features, wherein the key value information sets comprise key information sets and value information sets;

the output unit is used for performing traversal matching on the key value information set according to a preset division rule tree, and outputting each key value pair in the key value information set.

Compared with the prior art, the embodiment of the invention has the following beneficial effects:

the invention provides an OCR (optical character recognition) form semantic recognition method and device based on a graph neural network, which are characterized in that PNG form pictures are input into a trained GKVR recognition model, the attribute of a form node can be accurately judged to be a key or a value through sentence vector characteristics, node image characteristics and position characteristics of text nodes in the model, and the matching between key values is realized in a mode of setting a division rule tree, so that the relation recognition capability between the keys and the values of the form can be improved. Compared with the portable document format and image which are difficult to directly extract in the prior art, the invention combines deep learning network structures such as a graph neural network, a gate control circulation unit and the like, provides a GKVR network model for carrying out table key value identification, can realize one-key identification, and meets the actual industrial requirements of automatic table auditing and the like.

Drawings

FIG. 1 is a schematic flow chart of an embodiment of a method for semantic recognition of OCR tables based on a neural network according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of the effect of drawing the original chunk rectangle label on the original PNG picture provided by the prior art;

FIG. 3 is a schematic diagram showing the effect of making PNG pictures from the original PDF file intercepting form part according to the embodiment of the present invention;

FIG. 4 is a schematic diagram of the chunk data provided in an embodiment of the present invention;

FIG. 5 is a schematic diagram of GKVR identification models provided in an embodiment of the present invention;

FIG. 6 is a diagram showing the performance of the loss values of the GCN-based and GAT-based GKVR model training processes according to an embodiment of the present invention;

FIG. 7 is a diagram showing accuracy performance of a GCN-based and GAT-based GKVR model training process according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of a partitioning rule tree used in Key Value matching of SciTSR-Key-Value data sets provided by an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

Referring to fig. 1, a schematic flow chart of an embodiment of an OCR form semantic recognition method based on a graph neural network according to an embodiment of the present invention is shown, where the method includes steps 101 to 103, and the steps are as follows:

step101, acquiring a first PNG table picture to be identified, wherein the first PNG table picture is obtained by preprocessing a PDF table.

In this embodiment, the first PNG table picture is obtained by preprocessing a PDF table, specifically, a PDF document to be processed is obtained, and a table portion is intercepted from the PDF document by a KVLabel tool to generate the first PNG table picture.

Specifically, KVLabel is a tool for marking PDF documents, and can realize functions such as region information marking, node attribute marking (for example, a certain node is a key or a value), node key value pair relation marking and the like. For a PDF document to be identified, a table part of the PDF document is firstly intercepted by a KVLabel tool, and then converted into a first PNG table picture.

Step 102, inputting the first PNG table picture into a trained GKVR recognition model, so that the GKVR recognition model carries out OCR (optical character recognition) on the first PNG table picture to obtain first text information, table frame information and position information of each text node, generating sentence vector features corresponding to the first text information through a GRU (graphic user unit) network according to the first text information and a preset vocabulary, converting the table frame information into node image features through a convolutional neural network and a grid-simple algorithm, carrying out normalization processing on the position information of each text node to obtain position features, finally inputting the sentence vector features corresponding to the first text information and the position features into a graph-meaning network respectively, and outputting a key value information set corresponding to the first PNG table picture through a multi-layer perceptron MLP after being spliced with the node image features, wherein the key value information set comprises a key information set and a value information set.

In this embodiment, before executing step 102, training samples are required to train GKVR the recognition model, the first PNG table picture is recognized by the trained GKVR recognition model, and a corresponding key value information set is output. The key value information set comprises a key information set, a value information set and other information sets. Other information sets are sets of other content than keys and values, such as the header in a table.

In this embodiment, a plurality of table data may be extracted from SciTSR data sets, and then preprocessed by KVLabel tool to obtain a data set SciTSR-Key-Value of the training sample. Preprocessing a training sample of the GKVR recognition model, specifically, selecting a form frame of a PDF document in an initial sample through the KVLabel tool, marking each text node in the form frame with a key value and a key value pair, generating PNG form pictures corresponding to each initial sample, and taking all the PNG form pictures, the key value marks and the key value pair marks as the training sample.

In this embodiment, compared with the prior art that directly converts the PDF document in the SciTSR dataset into the PNG file, and then performs table framing according to the rectangular label of the picture, the embodiment firstly intercepts the table frame corresponding to the table from the PDF document, then converts the intercepted table frame into the PNG table picture, and finally performs key value label and key value pair label. Therefore, the coordinate information of the table frame can be ensured to be consistent with the picture, and the problem of dislocation or mismatch caused by converting the file format from PDF to PNG is avoided. As shown in fig. 2 and 3, fig. 2 is a diagram of PNG table obtained by adopting the technical means of the present embodiment, and fig. 3 is a diagram of PNG table for solving the problem of misalignment or inconsistency caused by the prior art.

As an example of the present embodiment, key value information in a picture marked with a rectangular frame, coordinate information of the rectangular frame are stored in the chunk data. A specific illustration of the chunk data may be, but is not limited to, see fig. 4. As shown in fig. 4, the chunk data is used to also hold text node locations, with the text nodes indexed in the list by their node numbers.

As an example of this embodiment, when the Key Value is labeled and the Key Value pair is labeled, the text node can be distinguished as the Key Key, the Value or Other information Other by labeling the type attribute of the text node. And storing the type attribute of the text node by adopting info data, wherein the key of the dictionary is the node number, and the value is the attribute. In addition, the label of the Key Value pair is stored by using pair data, and the relation of the text node Key Value pair is recorded, wherein the meaning of the element is [ Key node number, value node number ].

Therefore, by means of the KVLabel tool developed by the embodiment of the invention, functions of importing a data set to be marked, selecting the data to be marked, performing rectangular box selection marking, setting attributes of rectangular nodes selected by boxes, setting key value relations among the nodes and the like can be realized.

In the present embodiment, in the form key value recognition work, the form as input data has a strong structural nature in practice, and after passing OCR recognition, the form obtains position information and text information of each text area in the form, which can be regarded as map data. Moreover, because the OCR technology is mature, the characters are easy to identify, and the embodiment of the invention is mainly aimed at identifying the key value category of the character nodes in the table.

Before each module performs training, a plurality of disturbance modes can be added, and default disturbance modes include color space conversion (cvtColor), blurring (blur), jitter (jitter), noise (Gasuss noise), random crop (random crop), perspective (perspective), color inversion (reverse) and the like. The data required for training are as shown in the following table:

in the embodiment, the trained GKVR recognition model comprises a sentence vector feature extraction module, a node image feature extraction module and a position feature extraction module.

The training process of the sentence vector feature extraction module comprises the steps of carrying out vocabulary recognition on text contents of each text node in a training sample according to a preset vocabulary, generating character strings, carrying out one-hot encoding on each character string, then carrying out word embedding by using a layer of unidirectional feed-forward network to obtain word sequences corresponding to each text node, and learning semantics in each word sequence through a GRU network to generate sentence vector features of each text node.

In order to enable the model to obtain the semantic information of the form text, a common processing manner of text information in the field of natural language processing is used herein, and a vocabulary vocab is first established, where Vocab is 26 letters and each symbol. The data type is a character string. The first part is more ordered ("0123456789 abcdefghijklmnopqrstuvwxyz") as a result of a simple number plus lowercase traversal. The second part is more unordered, is formed by splicing a plurality of symbols and capital letters, is added with Roman numerals and the like, and is still in the form of a character string. The structural feature is simple horizontal "xxxx", no special arrangement. Next, the characters for which vocab is not present are converted into vocab words representing unknown symbols.

Then, one-Hot encoding is performed on each character and word embedding (word embedding) is performed by applying a layer of single feed forward network (not described in detail in the prior art) to represent, and One-Hot encoding uses an N-bit state register to encode N states, each state is represented by its independent register bit, and only One bit is valid at any time. One-Hot encoding first requires mapping the classification value to the entire value. Each integer value is represented as a binary vector, which is zero except for the index of the integer, labeled 1.

Finally, the GRU is used for learning semantic information in the word sequence, and sentence vector features used for representing the text information of the graph nodes are finally obtained, and the process is as follows:

word_vector_i[j]＝embedding(one_hot(test_i[j]))

sentence_feature_i[j]＝GRU(word_vector_i[j])

Where word vector_i j represents the set of characters embedding of the j-th node text in the i-th chart. sentence _e_i [ j ] is a sentence vector feature. The text sentence vector feature representation network parameters are shown in the following table:

Parameters (parameters)	Value of
		Vocabulary size	105
Post-Embedding word vector size	64
		Sentence vector size	64

The training process of the node image feature extraction module comprises the steps of obtaining a plurality of pieces of table frame information in a training sample, extracting picture structure information of each piece of table frame information through a convolutional neural network to obtain a plurality of first feature images, scaling the plurality of first feature images into grids through a grid-simple algorithm in a bilinear interpolation method, and taking grid features of coordinates corresponding to each text node as node image features of the text node.

For the table, besides the position and the text information of each text node, the table frame information of the table has a certain value for key value identification, and the difference between the table frame structures of the key node and the value node shows that the table frame information has a certain reference value. In order to better complete the work of key value identification, the embodiment of the invention uses a convolutional neural network (Convolutional Neural Networks, CNN) to extract the picture structure information of a table, and then scales a feature map obtained through the convolutional network into a grid by a bilinear interpolation method and obtains the grid feature of the corresponding coordinate of each text node as the image feature of the text node by a grid-simple algorithm, and the detailed process is as follows:

img_feature_map_i＝CNNs(img_i)

img_feature_box_i[j]＝grid_simple(img_feature_map_i,pos_i[j])

Wherein img_feature_map_i is a picture feature map of the ith table, img_feature_box_i [ j ] is a node image feature of the jth text node in the ith table. CNNs network parameters are specifically shown in the following table:

Layer	In channel	Out channel	Kernel size	Stride	Padding	Activation function
							0	1	64	3x3	1	1	Relu
1	64	64	3x3	1	1	Relu
							2	64	64	3x3	1	1	Relu

In the embodiment, the training process of the position feature extraction module specifically comprises the steps of obtaining position information of each text node in a training sample, carrying out coordinate transformation on each position information, normalizing a coordinate system to be within a [ -1,1] interval, and outputting position features corresponding to each text node.

In the table data, since the absolute positions of the table nodes having similar structures have a large gap due to the different sizes of the tables, the learning efficiency may be low if the absolute positioning is directly used as the input of the network. To avoid the above problems, and thereby enable the network to learn the table structure better, absolute position information of the nodes is converted into relative position information, and the coordinate system is normalized to the interval-1 to 1. The process is as follows:

min_x_i＝min(X_i)

min_y_i＝min(Y_i)

Wherein, X_i,Y_i represents an i-th table X coordinate value set and a y coordinate set, respectively, and an absolute position of a certain node j is set to be { (X₁,y₁),(x₂,y₂) }, tabel _width represents a table width, and tabel _height represents a table height.

In the embodiment, sentence vector features, node image features and position features corresponding to each text node in a training sample are used as input of GKVR identification models, key information and value information corresponding to each text node are used as output of GKVR identification models, sentence vector features and position features are respectively input to a drawing attention network for each text node, the node features of each text node are formed after the sentence vector features and the position features are spliced with the node image features, and the drawing attention network and a multi-layer perceptron MLP are trained by combining the output of GKVR identification models.

The attention network GAT determines the weight occupied by the neighbor node characteristics during aggregation through a self-attention mechanism so as to realize weight self-adaption on different neighbor nodes and avoid the influence of the number of the neighbor nodes on the output characteristics. Since the data set cannot provide the side information existing between the nodes, if the side set is constructed by using full connection, the complexity reaches O (|N|²), in order to reduce the complexity and considering the characteristic that the neighbor nodes of the table node have similar positions, a nearest neighbor algorithm (K Nearest Neighbor, KNN) is adopted to generate the side set of the table graph, so that the complexity can be reduced to O (K|N|).

In the overall step, the use of nearest neighbor algorithms to reduce complexity is a function of the "pos of Node" section to GRID SIMPLE in FIG. 5. Each node in the graph has its own relative position attribute, and k nodes nearest to a node are selected by nearest neighbors, so that an edge is set between the node and other k nodes. The reason is that with the edge set and the junction set, the graph convolution can be performed.

The calculation process is as follows:

edges_i[j]＝KNN(pos_i[j])

pos_h_feature_i[j]＝GAT_θ1(normlized_pos_i[j],edges_i[j])

text_h_feature_i[j]＝GAT_θ2(sentence_feature_i[j],edges_i[j])

h_f_i[j]＝concat(pos_h_feature_i[j],text_h_feature_i[j],img_feature_box_i[j])

prediction_i[j]＝Softmax(MLP((h_f_i[j])))

To better illustrate the benefits of this embodiment, it can be verified by GCN and GAT-based comparative experiments. In model design, the graph neural network is used to enable nodes on the graph to incorporate information of nearby nodes, thereby better presuming the type of the nodes. In GFTE, a graph neural network-based key value row-column relationship derivation model, GCN is used as a bottom layer network for node information aggregation of the GCN and works well in the work of the GCN. However, the fusion of the GCN to the neighbor nodes is influenced by the degrees of the neighbor nodes and cannot achieve the purpose of generating the corresponding weights according to different characteristic values of different nodes, and in a table key value deducing task, the influence factors of the neighbor nodes to the center node are considered to contain the characteristic values of the neighbor nodes, so that GAT is adopted as a bottom network of node aggregation work, and the accuracy and convergence stability of a model are better improved when key value identification work is completed.

As shown in fig. 6, the GCN-based GKVR model has a Loss convergence trend substantially consistent with the GAT-based GKVR model over the training set, but the former has a greater minimum value of convergence. The GKVR model based on GCN shows strong jitter of Loss on the test set, so that the use of GAT as the underlying network for node information aggregation can be seen to improve the convergence stability of the GKVR model.

As shown in fig. 7, the GAT-based GKVR model performs significantly better than the GCN-based GKVR model in terms of accuracy of the recognition effort, with the highest accuracy on the training set being greater than the latter 6 percentiles and the highest accuracy on the test set being greater than the latter 7 percentiles. It is a reasonable solution for table node key identification to replace GCN with GAT.

In this embodiment, after GKVR recognition model training, the first PNG table picture is input and then the corresponding first text information, table frame information and position information of each text node are extracted. The first text information mainly comprises text content of a table, sentence vector characteristics corresponding to the first text information are generated through a sentence vector characteristic extraction module, table frame information is converted into node image characteristics through a node image characteristic extraction module, position information of each text node is normalized through a position characteristic extraction module to obtain position characteristics, the sentence vector characteristics corresponding to the first text information and the position characteristics are finally respectively input into a graph meaning network, and the sentence vector characteristics and the position characteristics are spliced with the node image characteristics and then pass through a multi-layer perceptron MLP to output a key value information set corresponding to the first PNG table picture.

Step 103, performing traversal matching on the key value information set according to a preset division rule tree, and outputting each key value pair in the key value information set.

In this embodiment, step 103 includes gradually dividing the key information set by traversing the dividing rule tree with breadth first, and selecting values in the value information set when the leaf node is reached, to generate a plurality of key value pairs.

After the task of identifying the node Key attribute in the table, the table node set may be divided into Key＝{k₁,k₂,…,k_n},Value＝{v₁,v₂,…,v_m},Other＝{o₁,o₂,…,o_k} sets in step 102, where how to obtain the correspondence between the elements in the Key set and the Value set in the table is mainly discussed.

And regarding whether the key value pair relation exists between the nodes as two categories, extracting the characteristics of the nodes on the graph through the graph neural network, and finally converting the problem into two categories of problems so as to predict whether the key value pair relation exists between the nodes. In order to find all Key-Value pair relationships existing between Key values, a reasonable design is to construct the Key set and the Value set into a complete bipartite graph, so as to predict the category of each edge < Node₁,Node₂ > on the bipartite graph. The above solution is the prior art, and designing a two-class neural network according to the above will face the problem of extremely unbalanced sample tag distribution. Experimentally, this imbalance results in the model classifying the edges between all nodes into non-key-pair categories to enable higher accuracy, but the confusion matrix-aware model cannot identify key-pair relationships that exist in the graph.

In the embodiment, obvious priori knowledge exists in table Key Value matching, such as that Key nodes and Value nodes are in the same row or the same column, and the distance between the Value nodes and the corresponding Key nodes in a certain coordinate system or Euclidean distance is the smallest. To introduce the above a priori knowledge for the key-value matching problem, the partitioning rule tree PT is defined herein as follows:

pt is not null.

2. If a node i in the PT is not a leaf node, it contains a partition rule p_i.

If a node i in the PT is not a leaf node, the number of child nodes is equal to the number of set categories into which the set is divided by p_i.

3. If a node i in the PT is a leaf node, it includes a selection rule s_i.

The above is to identify key values, and rule tree algorithms are used to explore key-value pair relationships. p_i is a partitioning rule set on the rule tree, partitioning the key set into subsets. s_i represents a selection rule defined by the rule tree, and matches key-value pairs conforming to the unified rule.

Key value matching can be performed on keys conforming to the same rule tree by gradually dividing the Key set through breadth-first traversal rule tree PT and selecting keys and generating Key value pairs when the Key set reaches leaf nodes. The matching rule can be expressed as breadth traversal, a subset of the key set is taken according to rules in the rule tree nodes, and finally a key corresponding to a certain value is found. For example, the key set is divided, and the key set is matched with the value set in the rule tree, so that the key value pair is generated by finally finding the value conforming to the same rule tree.

As an example of this embodiment, a partition-based key-value matching algorithm may be, but is not limited to, that shown in the following table.

For a better explanation of the application of the partition-based key-value matching algorithm of this embodiment, the following example is described, referring to fig. 8, and the corresponding partition rule tree PT is defined as shown in fig. 8, where the root node is a rule of partition, contains a set of horizontal and vertical directions, and an interval within the scope of action is given. The left child node is a horizontal set element and the right child node is a vertical set element. All conform to the nearest neighbor principle of the key value pair. Finally, key Value pair matching is carried out through the rule tree, so that Key Value pairs in SciTSR-Key-Value data sets can be well identified. Wherein D (x, y) is the angle formed by the edges of the x node and the y node and the x axis.

As an example of this embodiment, a division rule tree is set in the GKVR recognition model. In the example, the division rule tree is also integrated in the GKVR identification model, so that the operation is simplified, and the efficiency is improved.

On the other hand, the embodiment of the invention provides an OCR table semantic recognition device based on a graphic neural network, which comprises an acquisition unit, a recognition unit and an output unit;

The more detailed working principle and flow of the device can be but is not limited to the related description

From the above, the embodiment of the invention provides an OCR (optical character recognition) table semantic recognition method and device based on a graph neural network, which are used for inputting PNG table pictures into a trained GKVR recognition model, accurately judging that the attribute of a table node is a key or a value through sentence vector characteristics, node image characteristics and position characteristics of text nodes in the model, realizing the matching between key values in a mode of setting a division rule tree, and improving the capability of recognizing the relationship between the key and the value of the table. Compared with the portable document format and image which are difficult to directly extract in the prior art, the invention combines deep learning network structures such as a graph neural network, a gate control circulation unit and the like, provides a GKVR network model for carrying out table key value identification, can realize one-key identification, is an important supplement to the existing, traditional and widely applied table identification method, and meets the actual industrial requirements such as automatic table auditing and the like.

Furthermore, the fusion of the prior art application graph convolution neural network to the neighbor nodes is influenced by the degree of the neighbor nodes and cannot achieve the purpose of generating the corresponding weights according to different characteristic values of different nodes. In the task of deducing the key value of the table, because the influence factors of the neighbor nodes to the central node should contain the characteristic values of the neighbor nodes, the invention adopts the graph annotation force network as the bottom network of the node aggregation work and improves the accuracy and convergence stability of the model when the key value identification work is completed.

Further, while some approaches have explored in identifying key values present in tables, there are often portable document formats and images in tables that are difficult to directly extract. The invention combines deep learning network structures such as a Graph neural network, a gate control circulation unit and the like, and provides a network model GKVR (Graph-based Key and Value Recognition) for carrying out table key value identification, wherein the model can carry out key value classification on a certain node in a table by utilizing text information, position information and picture information of a text in the table and picture information of the table picture, thereby improving the capability of the relation identification between keys and values of the table.

The foregoing embodiments have been provided for the purpose of illustrating the general principles of the present invention, and are not to be construed as limiting the scope of the invention. It should be noted that any modifications, equivalent substitutions, improvements, etc. made by those skilled in the art without departing from the spirit and principles of the present invention are intended to be included in the scope of the present invention.

Claims

Translated fromChinese

1.一种基于图神经网络的 OCR 表格语义识别方法，其特征在于，包括：1. A graph neural network-based OCR table semantic recognition method, comprising:

获取待识别的第一PNG表格图片；其中，所述第一PNG表格图片是由PDF表格经过预处理后而获得；Obtain a first PNG table image to be identified; wherein the first PNG table image is obtained by pre-processing a PDF table;

将所述第一PNG表格图片输入至训练好的GKVR识别模型，以使所述GKVR识别模型对所述第一PNG表格图片进行OCR识别，获得第一文本信息、表格框信息和各文字节点的位置信息，并根据所述第一文本信息和预设的词汇表，通过GRU网络生成所述第一文本信息对应的句向量特征，再通过卷积神经网络和grid_simple 算法将所述表格框信息转换为节点图像特征，继而将各文字节点的位置信息进行归一化处理后，获得位置特征，最后将所述第一文本信息对应的句向量特征和所述位置特征分别输入到图注意力网络，与所述节点图像特征拼接后经过多层感知器MLP，输出所述第一PNG表格图片对应的键值信息集合；其中，所述键值信息集合包括：键信息集合和值信息集合；Input the first PNG table image into a trained GKVR recognition model so that the GKVR recognition model performs OCR recognition on the first PNG table image to obtain first text information, table frame information, and position information of each text node, and generate sentence vector features corresponding to the first text information through a GRU network based on the first text information and a preset vocabulary, and then convert the table frame information into node image features through a convolutional neural network and a grid_simple algorithm, and then normalize the position information of each text node to obtain position features, and finally input the sentence vector features and the position features corresponding to the first text information into a graph attention network respectively, and after splicing with the node image features, pass through a multi-layer perceptron MLP, and output a key-value information set corresponding to the first PNG table image; wherein the key-value information set includes: a key information set and a value information set;

根据预设的划分规则树，对所述键值信息集合进行遍历匹配，输出所述键值信息集合中的各键值对；According to a preset partitioning rule tree, the key-value information set is traversed and matched, and each key-value pair in the key-value information set is output;

所述训练好的GKVR识别模型的训练过程具体为：The training process of the trained GKVR recognition model is as follows:

将训练样本中各文本节点对应的句向量特征、节点图像特征和位置特征作为GKVR识别模型的输入，将各文本节点对应的键信息和值信息作为GKVR识别模型的输出；The sentence vector features, node image features, and position features corresponding to each text node in the training sample are used as the input of the GKVR recognition model, and the key information and value information corresponding to each text node are used as the output of the GKVR recognition model;

对于各文本节点，分别将句向量特征、位置特征输入到图注意力网络，与所述节点图像特征进行拼接后，组成各文本节点的节点特征，结合所述GKVR识别模型的输出，训练图注意力网络和多层感知器MLP。For each text node, the sentence vector feature and position feature are respectively input into the graph attention network, and after being spliced with the node image feature, the node feature of each text node is formed. Combined with the output of the GKVR recognition model, the graph attention network and multi-layer perceptron MLP are trained.

2.根据权利要求1所述的基于图神经网络的 OCR 表格语义识别方法，其特征在于，所述训练好的GKVR识别模型包括：句向量特征提取模块；2. The graph neural network-based OCR table semantic recognition method according to claim 1, wherein the trained GKVR recognition model includes: a sentence vector feature extraction module;

所述句向量特征提取模块的训练过程具体为：The training process of the sentence vector feature extraction module is specifically as follows:

根据预设的词汇表，对训练样本中各文本节点的文本内容进行词汇识别，生成字符串，并对每一字符串进行one-hot编码后应用一层单向前馈网络进行词嵌入，获得每个文本节点对应的词序列；According to the preset vocabulary, the text content of each text node in the training sample is recognized to generate a string. After one-hot encoding, each string is applied to a single-layer feedforward network for word embedding to obtain the word sequence corresponding to each text node.

通过GRU网络对各词序列中的语义进行学习，生成各文本节点的句向量特征。The semantics of each word sequence is learned through the GRU network to generate sentence vector features for each text node.

3.根据权利要求2所述的基于图神经网络的 OCR 表格语义识别方法，其特征在于，所述训练好的GKVR识别模型包括：节点图像特征提取模块；3. The graph neural network-based OCR table semantic recognition method according to claim 2, wherein the trained GKVR recognition model comprises: a node image feature extraction module;

所述节点图像特征提取模块的训练过程具体为：The training process of the node image feature extraction module is specifically as follows:

获取训练样本中的多个表格框信息，并通过卷积神经网络对各表格框信息进行图片结构信息提取，获得多个第一特征图；Acquire multiple table frame information in the training sample, and extract image structure information from each table frame information through a convolutional neural network to obtain multiple first feature maps;

通过grid_simple算法将所述多个第一特征图以双线性插值的方法放缩至网格中，并将每个文字节点对应坐标的网格特征作为文字节点的节点图像特征。The multiple first feature maps are scaled to a grid using a bilinear interpolation method through the grid_simple algorithm, and the grid features of the coordinates corresponding to each text node are used as the node image features of the text node.

4.根据权利要求3所述的基于图神经网络的 OCR 表格语义识别方法，其特征在于，所述训练好的GKVR识别模型包括：位置特征提取模块；4. The graph neural network-based OCR table semantic recognition method according to claim 3, wherein the trained GKVR recognition model includes: a position feature extraction module;

所述位置特征提取模块的训练过程具体为：The training process of the position feature extraction module is specifically as follows:

获取训练样本中各文本节点的位置信息；Obtain the position information of each text node in the training sample;

将各位置信息进行坐标转换，并将坐标系归一化至[-1,1]区间内，输出各文本节点对应的位置特征。The coordinates of each position information are converted, and the coordinate system is normalized to the interval [-1, 1], and the position features corresponding to each text node are output.

5.根据权利要求1所述的基于图神经网络的 OCR 表格语义识别方法，其特征在于，所述第一PNG表格图片是由PDF表格经过预处理后而获得，具体为：5. The graph neural network-based OCR table semantic recognition method according to claim 1, wherein the first PNG table image is obtained by preprocessing a PDF table, specifically:

获取待处理的PDF文档，并通过 KVLabel工具从所述PDF文档中截取表格部分，生成所述第一PNG表格图片。Obtain the PDF document to be processed, and use the KVLabel tool to extract the table part from the PDF document to generate the first PNG table image.

6.根据权利要求5所述的基于图神经网络的 OCR 表格语义识别方法，其特征在于，所述 KVLabel工具还用于对所述GKVR识别模型的训练样本进行预处理，具体为：6. The graph neural network-based OCR table semantic recognition method according to claim 5, characterized in that the KVLabel tool is further used to preprocess the training samples of the GKVR recognition model, specifically:

通过所述KVLabel工具对初始样本中的PDF文档进行表格框选取，并对表格框中的各文本节点进行键值标注及键值对标注，生成各初始样本对应的PNG表格图片，将所有PNG表格图片、键值标注及键值对标注作为所述训练样本。The KVLabel tool is used to select the table frame of the PDF document in the initial sample, and each text node in the table frame is labeled with key values and key-value pairs to generate a PNG table image corresponding to each initial sample, and all PNG table images, key-value labels and key-value pair labels are used as the training samples.

7.根据权利要求1所述的基于图神经网络的 OCR 表格语义识别方法，其特征在于，所述根据预设的划分规则树，对所述键值信息集合进行遍历匹配，输出所述键值信息集合中的各键值对，具体为：7. The graph neural network-based OCR table semantic recognition method according to claim 1, wherein the key-value information set is traversed and matched according to a preset partitioning rule tree, and each key-value pair in the key-value information set is output, specifically:

通过广度优先遍历划分规则树对键信息集合进行逐渐划分，并到达叶子节点时选取值信息集合中的值，产生若干个键值对。The key information set is gradually divided by traversing the partition rule tree in a breadth-first manner, and when a leaf node is reached, a value in the value information set is selected to generate several key-value pairs.

8.根据权利要求7所述的基于图神经网络的 OCR 表格语义识别方法，其特征在于，所述划分规则树设置在所述GKVR识别模型中。8. The graph neural network-based OCR table semantic recognition method according to claim 7, wherein the partitioning rule tree is set in the GKVR recognition model.

9.一种基于图神经网络的 OCR 表格语义识别装置，其特征在于，包括：获取单元、识别单元和输出单元；9. A graph neural network-based OCR table semantic recognition device, comprising: an acquisition unit, a recognition unit, and an output unit;

其中，所述获取单元用于获取待识别的第一PNG表格图片；其中，所述第一PNG表格图片是由PDF表格经过预处理后而获得；The acquisition unit is used to acquire a first PNG table image to be identified; wherein the first PNG table image is obtained by pre-processing a PDF table;

所述识别单元用于将所述第一PNG表格图片输入至训练好的GKVR识别模型，以使所述GKVR识别模型对所述第一PNG表格图片进行OCR识别，获得第一文本信息、表格框信息和各文字节点的位置信息，并根据所述第一文本信息和预设的词汇表，通过GRU网络生成所述第一文本信息对应的句向量特征，再通过卷积神经网络和grid_simple 算法将所述表格框信息转换为节点图像特征，继而将各文字节点的位置信息进行归一化处理后，获得位置特征，最后将所述第一文本信息对应的句向量特征和所述位置特征分别输入到图注意力网络，与所述节点图像特征拼接后经过多层感知器MLP，输出所述第一PNG表格图片对应的键值信息集合；其中，所述键值信息集合包括：键信息集合和值信息集合；The recognition unit is used to input the first PNG table image into a trained GKVR recognition model, so that the GKVR recognition model performs OCR recognition on the first PNG table image to obtain first text information, table frame information and position information of each text node, and generate a sentence vector feature corresponding to the first text information through a GRU network based on the first text information and a preset vocabulary, and then convert the table frame information into a node image feature through a convolutional neural network and a grid_simple algorithm, and then normalize the position information of each text node to obtain a position feature, and finally input the sentence vector feature and the position feature corresponding to the first text information into a graph attention network respectively, and after splicing with the node image feature, pass through a multi-layer perceptron MLP to output a key-value information set corresponding to the first PNG table image; wherein the key-value information set includes: a key information set and a value information set;

所述输出单元用于根据预设的划分规则树，对所述键值信息集合进行遍历匹配，输出所述键值信息集合中的各键值对；The output unit is used to traverse and match the key-value information set according to a preset partitioning rule tree, and output each key-value pair in the key-value information set;