CN117037201B

Movatterモバイル変換

Info

Publication number: CN117037201B
Application number: CN202310891035.8A
Authority: CN
Inventors: 薛洋; 陈嘉聪; 金连文
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2023-07-20
Filing date: 2023-07-20
Publication date: 2025-09-30
Anticipated expiration: 2043-07-20
Also published as: CN117037201A

Abstract

The invention discloses a form structure identification method, a system, equipment and a storage medium based on a graph neural network, wherein the method comprises the steps of detecting and identifying a form image to be identified to obtain a text line and a corresponding boundary box; according to text lines and corresponding bounding boxes, constructing a table graph with text line vertices and edge relations among the vertices, obtaining three modal characteristics of the vertices corresponding to the text lines, carrying out intra-modal and inter-modal characteristic interaction by utilizing a self-adaptive graph neural network based on the three modal characteristics of the table graph and the vertices, obtaining fusion characteristics after vertex updating, obtaining characteristics of each edge according to the fusion characteristics after vertex updating corresponding to each edge in the table graph, classifying each edge according to the characteristics of each edge, and correcting the table graph according to the classification of each edge to obtain table structure information. According to the invention, the characteristic enhancement is carried out by constructing the table diagram and utilizing the graphic neural network, so that the accuracy of identifying the table structure is effectively improved.

Description

Table structure identification method, system, equipment and storage medium based on graph neural network

Technical Field

The invention relates to the field of artificial intelligence and computers, in particular to a table structure identification method, a system, electronic equipment and a computer readable storage medium based on a graph neural network.

Background

The form is widely applied to various documents such as scientific and technical literature, financial reports, newspapers, magazines and the like and various scenes of daily life, and form diversity, scene diversity and image degradation are difficulties in identifying the existing form structure. The forms include a few-line form, a wireless form, a color form and the like, the scenes include electronic documents such as PDF, excel, scanned documents and the like, bills such as bills, invoices and the like, and food packages, propaganda color pages, personal notes and the like in natural scenes, and the various forms and scenes bring difficulty to the form recognition technology, and the forms are various, the scene changes and the image degradation problems undoubtedly bring about the reduction of the form recognition accuracy.

Modeling a table using a graph conforms to the human visual perception and the graph is not sensitive to scene changes and image degradation. At present, most of the methods for identifying various mainstream table structures of images are based on the field of images to identify the table structures, so that how to model table diagrams and explore the potential of a neural network of the diagrams in table identification are less studied. Therefore, the form recognition method based on the graph neural network has important research significance in the form recognition field.

Disclosure of Invention

In order to solve the defects in the prior art, the invention provides a table structure identification method, a system, electronic equipment and a computer readable storage medium based on a graph neural network, which can solve the problems caused by various table forms, scene changes and image degradation into table structure identification, and effectively improve the accuracy of table structure identification by effectively modeling a table graph and utilizing the graph neural network to perform characteristic enhancement.

The first object of the present invention is to provide a table structure identification method based on a graph neural network.

A second object of the present invention is to provide a table structure recognition system based on a graph neural network.

A third object of the present invention is to provide an electronic device.

A fourth object of the present invention is to provide a storage medium.

The first object of the present invention can be achieved by adopting the following technical scheme:

a method for identifying a table structure based on a graph neural network, the method comprising:

detecting and identifying the form image to be identified to obtain text lines and boundary boxes corresponding to the text lines, wherein the text lines represent one line of text in the unit cells;

constructing a table graph with text behavior vertexes and edge relations among the vertexes according to the text lines and the corresponding boundary boxes;

According to the text line and the corresponding bounding box, three modal characteristics of the vertex corresponding to the text line are obtained;

according to the three modal characteristics of the table graph and the vertexes, utilizing the self-adaptive graph neural network to conduct characteristic interaction in each mode and among modes to obtain fusion characteristics after vertex updating;

the characteristics of each side are input into a classifier and the category of each side is output according to the updated fusion characteristics of the two vertexes corresponding to each side in the table diagram;

and correcting the table graph according to the category of each side to obtain the table structure information of the table image to be identified.

Further, according to the three modal features of the table diagram and the vertex, performing intra-modal and inter-modal feature interaction by using the adaptive graph neural network to obtain a fusion feature after updating the vertex, including:

For the three modal features, stacking the same modal feature of all vertexes into a feature matrix and inputting the edge matrix of the table diagram into the self-adaptive graph neural network to finish the feature interaction in the mode and obtain an updated modal feature matrix;

And fusing three updated modal feature matrixes obtained by corresponding to the three modal features, inputting the fused feature matrixes and the edge matrixes of the table graph into the self-adaptive graph neural network, completing feature interaction among the modes, obtaining an updated fused feature matrix, and enabling each row in the fused feature matrix to represent the updated fused feature of each vertex, wherein the edge matrixes are the set of all edges in the table graph.

Further, the adaptive graph neural network comprises a graph convolution network and a graph annotation meaning network;

The method for stacking the same modal feature of all vertexes into a feature matrix and inputting the edge matrix of the table diagram into the self-adaptive graph neural network to finish the feature interaction in the mode and obtain an updated modal feature matrix, and comprises the following steps:

Stacking the same modal feature of all vertexes into a feature matrix, and respectively inputting the feature matrix and an edge matrix of the table diagram into a graph convolution network and a graph annotation force network to respectively obtain a first feature matrix and a second feature matrix;

And fusing the first feature matrix and the second feature matrix, wherein the fused feature matrix is used as an updated modal feature matrix.

Further, the table diagram is constructed based on a Delaunay triangulation algorithm.

Further, the constructing a table graph with text behavior vertices and edge relations between vertices according to the text lines and the corresponding bounding boxes includes:

Uniformly sampling the boundary boxes of the text lines, sampling a plurality of points of each boundary box, and marking the boundary box to which each point belongs;

Applying a Delaunay triangulation algorithm to all the points obtained by sampling to obtain Delaunay triangulation graphs of all the points;

splitting each triangle in the Delaunay triangle splitting diagram, wherein each side of the triangle represents the side between two bounding boxes to which two sampling points belong, so as to obtain a side set of all the sampling points;

Filtering repeated edges and invalid edges in the edge set, and reserving only one valid edge between every two text line boundary boxes to obtain a valid edge set E;

and constructing a table graph G= (V, E) by taking the text behavior vertexes as an edge matrix and constructing a vertex matrix V and an effective edge set E.

Further, six points are sampled each.

Further, the obtaining three modal features of the vertex corresponding to the text line according to the text line and the corresponding bounding box includes:

performing feature embedding on the space coordinates of the boundary frame of the text line by using a multi-layer perceptron to obtain the space features of the text line;

extracting image features of the text line corresponding region in the form image to be identified to obtain the image features of the text line;

Encoding the content of the text line to obtain the text characteristic of the text line;

And performing feature embedding on the spatial features, the image features and the text features of the text line to obtain three modal features of the corresponding vertexes of the text line.

Further, the extracting the image features of the text line corresponding region in the form image to be identified to obtain the image features of the text line includes:

extracting features of the form image to be identified by adopting a deep neural network ResNet-50;

And extracting image features of the text line corresponding region by adopting an ROI alignment algorithm based on bilinear interpolation according to the extracted feature image to obtain the image features of the text line.

The second object of the invention can be achieved by adopting the following technical scheme:

a graph neural network based table structure identification system, the system comprising:

The recognition module is used for detecting and recognizing the form image to be recognized to obtain text lines and a boundary box corresponding to each text line, wherein the text lines represent one line of text in the unit cells;

The construction module is used for constructing a table graph with text behavior vertexes and edge relations among the vertexes according to the text lines and the corresponding boundary boxes;

the embedding module is used for obtaining three modal characteristics of the vertex corresponding to the text line according to the text line and the corresponding boundary box;

The interaction module is used for carrying out intra-mode and inter-mode feature interaction by utilizing the self-adaptive graph neural network according to the three mode features of the table graph and the vertexes to obtain fusion features after the vertexes are updated;

the classification module is used for obtaining the characteristics of each side according to the updated fusion characteristics of the two vertexes corresponding to each side in the table diagram;

and the correction module is used for correcting the table graph according to the category of each side to obtain the table structure information of the table image to be identified.

The third object of the present invention can be achieved by adopting the following technical scheme:

An electronic device includes a processor and a memory for storing a program executable by the processor, wherein the method for identifying a table structure is implemented when the processor executes the program stored in the memory.

The fourth object of the present invention can be achieved by adopting the following technical scheme:

a computer-readable storage medium storing a program which, when executed by a processor, implements the above-described table structure identification method.

Compared with the prior art, the invention has the following beneficial effects:

The invention provides a form identification method, a system, equipment and a storage medium based on a graph neural network, which are used for obtaining text lines and a boundary box corresponding to each text line through detection and identification of a form image to be identified, wherein the text lines represent one line of text in a cell, the text lines are at least two, a form graph with text lines and edge relations among the vertices is constructed according to the text lines and the corresponding boundary boxes, three modal characteristics of the vertices corresponding to the text lines are obtained according to the text lines and the corresponding boundary boxes, characteristic interaction in each mode and among the modes is carried out by utilizing a self-adaptive graph neural network according to the form graph and the three modal characteristics of the vertices, fusion characteristics after vertex update are obtained, characteristics of each edge are obtained according to the fusion characteristics after updating of the two vertices corresponding to each edge in the form graph, the characteristics of each edge are input into a classifier, the category of each edge is output, and the form image is corrected according to the category of each edge, so that form structure information of the form image to be identified is obtained. By effectively constructing the table diagram and utilizing the diagram neural network to perform feature enhancement, the accuracy of the table structure identification is effectively improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to the structures shown in these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a general flowchart of a table structure identification method based on a neural network in embodiment 1 of the present invention;

FIG. 2 is a schematic diagram of a Delaunay triangle sectional view G_D of example 1 of the present invention;

Fig. 3 (a) is a schematic diagram of a table diagram generated based on modeling of Delaunay triangulation algorithm in embodiment 1 of the present invention, and fig. 3 (b) is a schematic diagram of a table diagram after correction of fig. 3 (a);

FIG. 4 is a flow chart of an adaptive graph neural network according to embodiment 1 of the present invention;

FIG. 5 is a chart of a form image to be identified according to embodiment 1 of the present invention;

Fig. 6 is a schematic diagram of a table structure in a table image to be identified according to embodiment 1 of the present invention;

FIG. 7 is a block diagram showing a table structure recognition system based on a neural network according to embodiment 2 of the present invention;

fig. 8 is a block diagram showing the structure of an electronic device according to embodiment 3 of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments, and all other embodiments obtained by those skilled in the art without making any inventive effort based on the embodiments of the present application are within the scope of protection of the present application. It should be understood that the detailed description is intended to illustrate the application, and is not intended to limit the application.

Example 1:

as shown in fig. 1, the present embodiment provides a table structure identification method based on a graph neural network, which includes the following steps:

s101, identifying all text lines of each cell in the form image to be identified and a boundary box corresponding to each text line, wherein the text lines represent one line of text in the cell.

The form image to be identified is a scene image containing forms, including electronic documents such as PDF, excel, scanned documents, etc., notes such as bills, invoices, etc., and food packages, promotional pages, personal notes, etc. in natural scenes.

And detecting and identifying the text lines of the cells in the table by using an OCR engine to obtain all text lines of each cell in the table and the space coordinates of the boundary box corresponding to each text line.

Optionally, the OCR engine is an open source item pad OCR, and may detect and identify bounding boxes and contents of all text lines in the table cells, to obtain spatial coordinates of the bounding boxes and contents of the text lines corresponding to each text line.

S102, constructing a table diagram with text behavior vertexes and edge relations among the vertexes according to text lines and corresponding boundary boxes.

And constructing a table diagram by using a Delaunay triangulation algorithm according to the text lines and the corresponding bounding boxes to obtain the table diagram with text line vertexes and edge relations among the vertexes.

In this embodiment, step S102 specifically includes:

(1) Uniformly sampling the boundary boxes of the text line, sampling 6 points of each boundary box, and marking the boundary box to which each point belongs;

specifically, 2 sampling points are taken on the long side of the bounding box, and 1 sampling point is taken on the short side, so as to find the four directions up, down, left and right. In practice, these 6 points are calculated by using 4 corner points of the rectangular frame, and the total of the 4 corner points of the rectangular frame is 10 points. Experiments prove that the effect of 6 points is similar to the recall effect of taking more (such as 10) points to the effective edge, and the error edge can be reduced.

(2) Applying a Delaunay triangulation algorithm to all the points obtained by sampling to obtain a Delaunay triangulation graph G_D of all the points, see FIG. 2;

(3) Splitting each triangle in the Delaunay triangle splitting diagram GD, wherein each side of the triangle represents the side between two bounding boxes to which two sampling points belong, so as to obtain a side set E_D of all the sampling points;

(4) Filtering repeated edges and invalid edges in the edge set, and reserving only one valid edge between every two text line boundary boxes to obtain a valid edge set E;

(5) Constructing a vertex matrix and an effective edge set by using text behavior vertices as edge matrixes, and constructing a table graph G= (V, E);

performing the operations of steps S101, S102 on the image to be identified may result in a table diagram modeled based on Delaunay triangulation algorithm, see fig. 3 (a), where each numbered vertex represents a text line and the connection between each vertex represents an edge where there may be a relationship between two vertices.

In the embodiment, the map of the table is modeled by adopting a Delaunay triangulation algorithm, so that the balance between connectivity and sparsity in the map of the table is well realized.

And S103, obtaining three modal characteristics of the vertex corresponding to the text line according to the text line and the corresponding bounding box.

And embedding the text line and the corresponding boundary box input features into a network, and outputting three modal features of the vertex corresponding to the text line.

In this embodiment, step S103 specifically includes:

(1) Performing feature embedding on the space coordinates of the boundary frame of the text line by using a multi-layer perceptron to obtain a space feature R^G of the text line;

(2) Carrying out feature extraction on a table image to be identified by using a deep neural network ResNet-50, and extracting image features of a text line corresponding region by using an ROI alignment algorithm based on bilinear interpolation to obtain image features R^I of the text line;

(3) Encoding the content of the text line by using a pre-training text embedding model Sentence Transformer to obtain a text feature R^C of the text line;

the spatial features, the image features and the text features are three modal features of the text line, namely three modal features of the corresponding vertexes of the text line.

S104, designing an adaptive graph neural network combining the graph convolution network and the graph meaning network.

In this embodiment, as shown in fig. 4, step S104 specifically includes:

S1041, taking a characteristic matrix R of a mode and an edge matrix E of a table diagram as input, and obtaining an updated characteristic matrix R^GCN through a diagram convolution network;

specifically, the graph rolling network adopts a two-layer convolution structure, and input data sequentially passes through a graph rolling module, a regularization module, an activation function, a graph rolling module, a regularization module and an activation function to obtain an updated feature matrix R^GCN;

the formula of the graph convolution module is as follows:

Where σ is the activation function, N_i is the adjacent vertex to vertex i, c_ij is the normalized weight, depending on the number of neighbors to vertex i, w^(l) is a learnable matrix; Is the characteristic vector of the vertex i output by the layer I network, R^(l+1) is the characteristic matrix output by the layer I+1 network, which is formed by stacking the characteristic vectors of all the vertices output by the layer I+1 network, R⁽⁰⁾ is the characteristic matrix R which is used as the input of the first layer,The ith row of R⁽⁰⁾ is the eigenvector of vertex i.

S1042, taking a feature matrix R of a mode and an edge matrix E of a table diagram as input, and obtaining an updated feature matrix R^GAT through a diagram meaning network;

Specifically, the graph attention network adopts a two-layer attention structure, and input data sequentially passes through a graph attention module, a regularization module, an activation function, a graph attention module, a regularization module and an activation function to obtain an updated feature matrix R^GAT;

the drawing force module formula is as follows:

Wherein σ is an activation function, N_i is an adjacent vertex of the vertex i, α_ij is a weight coefficient of the vertex i to the vertex j, and w^(l) is a learnable matrix obtained by characteristic calculation of the vertex i and the vertex j; Is the characteristic vector of the vertex i output by the layer I network, R^(l+1) is the characteristic matrix output by the layer I+1 network, which is formed by stacking the characteristic vectors of all the vertices output by the layer I+1 network, R⁽⁰⁾ represents the characteristic matrix R as the input of the first layer,The ith row of R⁽⁰⁾ is the eigenvector of vertex i.

S1043, fusing a characteristic matrix R 'obtained by a graph convolution network branch and a graph annotation network branch through a learnable weight alpha, and using the characteristic matrix R' as the output of the self-adaptive graph neural network, wherein the formula is expressed as follows:

R′=α·R^GCN+(1-α)·R^GAT;

In the embodiment, the self-adaptive graph neural network is adopted, the advantages of smoothness and standardization of the graph rolling network are combined, and the graph annotation force network can capture the advantages of global correlation and independence, so that the self-adaptive graph neural network can adapt to a table structure recognition task through learning, and the characteristic representation capability of vertex characteristics is enhanced.

And S105, according to the table diagram and the vertex characteristics of each mode, performing characteristic interaction in each mode and among modes by using the self-adaptive diagram neural network to obtain the updated fusion characteristics of each vertex.

In this embodiment, step S105 specifically includes:

(1) Stacking the spatial features of all vertexes into a matrix, inputting the stacked matrix as a feature matrix into the self-adaptive graph neural network, and completing feature interaction in a spatial mode to obtain an updated spatial feature matrix R^G′;

(2) Stacking the image features of all vertexes into a matrix, inputting the stacked matrix as a feature matrix into the self-adaptive graph neural network, and completing feature interaction in an image mode to obtain an updated image feature matrix R^I′;

(3) Stacking text features of all vertexes into a matrix, and inputting the stacked matrix serving as a feature matrix into the self-adaptive graph neural network to finish feature interaction in a text mode to obtain an updated text feature matrix R^C′;

(4) The updated feature matrix is fused through the learnable weights beta, gamma and delta to obtain a fused feature matrix R^Fusion of the vertex:

R^Fusion＝β·R^G′+γ·R^I′+δ·R^C′

(5) And taking the fusion feature matrix of the vertexes as a feature matrix to be input into the self-adaptive graph neural network, so as to finish feature interaction among three modes and obtain an updated fusion feature matrix R^Fusion′.

Each row in the updated fusion feature matrix represents the updated fusion feature for each vertex.

According to the embodiment, the mode information provided by the table is fully utilized, and the spatial characteristics, the image characteristics and the text characteristics of the table are subjected to intra-mode and inter-mode characteristic interaction by utilizing the graph neural network, so that intra-mode and inter-mode cooperation is enhanced, and the characteristic representation of each mode of the table vertex is effectively enhanced.

And S106, obtaining the characteristics of each side according to the updated fusion characteristics of the two vertexes of each side, classifying each side by using a classifier by taking the side characteristics as input, and correcting the table graph according to the side classification result to obtain the table structure information.

Firstly, training a network formed by a feature embedding network, an adaptive graph neural network and a classifier, namely carrying out gradient back transmission according to a classification result and updating parameters in the feature embedding network, the adaptive graph neural network and the classifier;

In the test stage, outputting the category of each side of the table diagram in the table image to be identified by using the trained network, and correcting the table diagram according to the side classification result to directly obtain the table structure information.

In this embodiment, step S106 specifically includes:

(1) For the edge matrix of the table diagram, the features of the edge e_ij are expressed as follows in combination with the updated fusion feature matrix R^Fusion′Obtaining the characteristics of each edge in the edge matrix; respectively representing the updated fusion characteristics of the vertexes i and j;

(2) The edge characteristic matrix E is used as input into a classifier to classify, and the class of each edge is output, wherein the classes are same row, same column, same cell or irrelevant system;

(3) In the training stage, the cross entropy is used as a loss function to calculate the classification result and the labeled loss, gradient back transmission is carried out, and parameter updating of the feature embedding network, the self-adaptive graph neural network and the edge classifier is carried out. The table diagram constructed in step S102 is corrected according to the edge classification result in the test stage to obtain table structure information, see fig. 3 (b). And deleting the edges classified as irrelevant in the test stage in the table diagram, thereby obtaining accurate table diagram information.

In this embodiment, the disclosed table recognition dataset SciTSR is used to train the network composed of the feature embedded network, the adaptive graph neural network and the edge classifier, and update the parameters in the feature embedded network, the adaptive graph neural network and the edge classifier.

In the embodiment, in terms of task definition, a table recognition task is defined as an edge classification task in the field of the graph neural network, and the relation between text lines is classified into the same line, the same column, the same cell or an independent system through a classifier.

Fig. 5 is an effect diagram of a certain image to be identified, and steps S101 to S106 are performed to obtain a corrected table diagram, which is a true structure of a table, and as shown in fig. 6, the table is a 10-row and 2-column structure, wherein each vertex with a serial number represents a text row, horizontal connection lines between the vertices are edges classified into the same row, and vertical connection lines are edges classified into the same column. It is noted that the vertical lines between the two vertices numbered 17 and 18 in fig. 6 are classified as sides of the same cell, because the two text lines in column 2 and line 9 in fig. 5 belong to the same cell.

Those skilled in the art will appreciate that all or part of the steps in a method implementing the above embodiments may be implemented by a program to instruct related hardware, and the corresponding program may be stored in a computer readable storage medium.

It should be noted that although the method operations of the above embodiments are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in that particular order or that all illustrated operations be performed in order to achieve desirable results. Rather, the depicted steps may change the order of execution. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.

Example 2:

As shown in fig. 7, the present embodiment provides a table structure recognition system based on a graph neural network, where the system includes a recognition module 701, a construction module 702, an embedding module 703, an interaction module 704, a classification module 705, and a correction module 706, and specific functions of the respective modules are as follows:

The recognition module 701 is configured to detect and recognize a form image to be recognized, so as to obtain text lines and a bounding box corresponding to each text line, where the text lines represent a line of text in a cell;

a construction module 702, configured to construct a table graph with text behavior vertices and edge relationships between vertices according to the text lines and the corresponding bounding boxes;

An embedding module 703, configured to obtain three modal features of the vertex corresponding to the text line according to the text line and the corresponding bounding box;

the interaction module 704 is configured to perform intra-mode and inter-mode feature interaction by using the adaptive graph neural network according to the three mode features of the table graph and the vertex, so as to obtain a fusion feature after vertex update;

The classification module 705 is configured to obtain a feature of each edge according to the updated fusion feature of the two vertices corresponding to each edge in the table graph;

and the correction module 706 is configured to correct the table graph according to the category of each edge, so as to obtain table structure information of the table image to be identified.

The specific implementation of each module in this embodiment may refer to the above embodiment 1, and will not be described in detail herein, it should be noted that, the device provided in this embodiment is only illustrated by the division of the above functional modules, and in practical application, the above functional allocation may be completed by different functional modules according to needs, that is, the internal structure is divided into different functional modules to complete all or part of the functions described above.

Example 3:

The present embodiment provides an electronic device, which may be a computer or a server, etc., as shown in fig. 8, and includes a processor 802, a memory, an input device 803, a display 804 and a network interface 805, which are connected through a system bus 801, where the processor is configured to provide computing and control capabilities, the memory includes a nonvolatile storage medium 806 and an internal memory 807, where the nonvolatile storage medium 806 stores an operating system, a computer program and a database, and the internal memory 807 provides an environment for the operating system and the computer program in the nonvolatile storage medium, and when the processor 802 executes the computer program stored in the memory, the table structure identifying method of the foregoing embodiment 1 is implemented as follows:

Example 4:

The present embodiment provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the table structure identifying method of the above embodiment 1, as follows:

The computer readable storage medium of the present embodiment may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of a computer-readable storage medium may include, but are not limited to, an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

In summary, the invention provides a table structure identification method, a system, electronic equipment and a computer readable storage medium based on a graph neural network, which are convenient for effectively extracting text line information in a table unit by detecting and identifying text lines of an image to be identified by using a pad OCR engine, realize good balance between connectivity and sparsity in the table graph by modeling the table graph of the text lines based on a Delaunay triangulation algorithm, facilitate characteristic interaction of vertexes of the next step, adapt to a table structure identification task by learning through the self-adaptive graph neural network, enhance characteristic representation capability of vertex characteristics, effectively enhance characteristic representation of each mode of the table by performing multi-mode cooperation and fusion through the graph neural network, realize application of the graph neural network on table structure identification by classifying modeled edges, improve accuracy of table structure identification and are suitable for popularization and application. The invention mainly solves the problems caused by various table forms, scene change and image degradation into table structure identification, fills the blank of research on how to effectively model a table diagram and how to use a diagram neural network to perform characteristic enhancement in the prior art, and has important research significance in the field of table identification.

The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims

Translated fromChinese

1.一种基于图神经网络的表格结构识别方法，其特征在于，所述方法包括：1. A table structure recognition method based on graph neural network, characterized in that the method includes:

对待识别表格图像进行检测与识别，得到文本行及每个文本行对应的边界框，其中文本行表示单元格中的一行文本；所述文本行至少为两个；Detecting and recognizing the table image to be recognized to obtain text rows and a bounding box corresponding to each text row, wherein a text row represents a row of text in a cell; and there are at least two text rows;

根据所述文本行及对应的边界框，构建以文本行为顶点且顶点间具有边关系的表格图；Constructing a tabular graph based on the text lines and the corresponding bounding boxes, wherein the text lines are vertices and the vertices have edge relationships;

根据所述文本行及对应的边界框，得到所述文本行对应顶点的三个模态特征；Obtaining three modal features of vertices corresponding to the text line according to the text line and the corresponding bounding box;

根据所述表格图以及顶点的三个模态特征，利用自适应图神经网络进行各模态内及模态间特征交互，得到顶点更新后的融合特征；Based on the tabular graph and the three modal features of the vertex, an adaptive graph neural network is used to perform feature interaction within each modality and between modalities to obtain an updated fusion feature of the vertex;

根据所述表格图中每条边对应的两个顶点更新后的融合特征，得到每条边的特征；将每条边的特征输入分类器，输出每条边的类别；Obtain the features of each edge based on the updated fusion features of the two vertices corresponding to each edge in the table graph; input the features of each edge into the classifier and output the category of each edge;

根据每条边的类别，对所述表格图进行修正，得到待识别表格图像的表格结构信息；According to the category of each edge, the table image is modified to obtain table structure information of the table image to be identified;

其中，所述根据所述文本行及对应的边界框，构建以文本行为顶点且顶点间具有边关系的表格图，包括：The step of constructing a table graph having text lines as vertices and edge relationships between vertices based on the text lines and the corresponding bounding boxes includes:

对文本行的边界框进行均匀采样，每个边界框各采样多个点，并标记每个点所属的边界框；Uniformly sample the bounding boxes of the text lines, sample multiple points for each bounding box, and mark the bounding box to which each point belongs;

对采样得到的所有点应用Delaunay三角剖分算法，得到所有点的Delaunay三角剖分图；Apply the Delaunay triangulation algorithm to all the sampled points to obtain the Delaunay triangulation graph of all points;

对Delaunay三角剖分图中的每个三角形进行拆分，三角形的每条边代表两个采样点所属的两个边界框间的边，得到所有采样点的边集合；Split each triangle in the Delaunay triangulation graph. Each edge of the triangle represents the edge between the two bounding boxes to which the two sampling points belong, and obtain the edge set of all sampling points.

对边集合中的重复边和无效边进行过滤，每两个文本行的边界框之间只保留一条有效边，得到有效边集合E；Filter out duplicate edges and invalid edges in the edge set, and only retain one valid edge between the bounding boxes of every two text lines to obtain the valid edge setE ;

以文本行为顶点构建顶点矩阵V、有效边集合E为边矩阵，构建表格图G=(V，E)；Construct a vertex matrixV with text lines as vertices and an edge matrix with the valid edge setE , and construct a table graphG = (V ,E );

所述根据所述表格图以及顶点的三个模态特征，利用自适应图神经网络进行各模态内及模态间特征交互，得到顶点更新后的融合特征，包括：The method uses an adaptive graph neural network to perform feature interaction within and between modalities based on the table graph and the three modal features of the vertex to obtain the updated fusion features of the vertex, including:

对于三个模态特征，分别将所有顶点的同一个模态特征堆叠成特征矩阵和所述表格图的边矩阵输入自适应图神经网络中，完成模态内的特征交互，得到更新后的模态特征矩阵；For the three modal features, the same modal features of all vertices are stacked into a feature matrix and the edge matrix of the table graph are input into the adaptive graph neural network to complete the feature interaction within the modality and obtain the updated modal feature matrix;

将三个模态特征对应得到的三个更新后的模态特征矩阵进行融合，将融合后的特征矩阵和所述表格图的边矩阵输入自适应图神经网络中，完成模态间的特征交互，得到更新后的融合特征矩阵，融合特征矩阵中的每一行表示每个顶点更新后的融合特征；其中，所述边矩阵为所述表格图中所有边的集合。The three updated modal feature matrices corresponding to the three modal features are fused, and the fused feature matrix and the edge matrix of the tabular graph are input into the adaptive graph neural network to complete the feature interaction between the modalities and obtain an updated fused feature matrix. Each row in the fused feature matrix represents the updated fused feature of each vertex; wherein the edge matrix is the set of all edges in the tabular graph.

2.根据权利要求1所述的表格结构识别方法，其特征在于，所述自适应图神经网络包括图卷积网络和图注意力网络；2. The table structure recognition method according to claim 1, wherein the adaptive graph neural network includes a graph convolutional network and a graph attention network;

所述将所有顶点的同一个模态特征堆叠成特征矩阵和所述表格图的边矩阵输入自适应图神经网络中，完成模态内的特征交互，得到更新后的模态特征矩阵，包括：The same modal features of all vertices are stacked into a feature matrix and the edge matrix of the table graph is input into the adaptive graph neural network to complete the feature interaction within the modality and obtain the updated modal feature matrix, including:

将所有顶点的同一个模态特征堆叠成特征矩阵，将所述特征矩阵和所述表格图的边矩阵分别输入图卷积网络和图注意力网络中，分别得到第一特征矩阵和第二特征矩阵；Stacking the same modal features of all vertices into a feature matrix, inputting the feature matrix and the edge matrix of the table graph into the graph convolution network and the graph attention network respectively, to obtain a first feature matrix and a second feature matrix respectively;

将第一特征矩阵和第二特征矩阵进行融合，融合后的特征矩阵作为更新后的模态特征矩阵。The first feature matrix and the second feature matrix are fused, and the fused feature matrix is used as the updated modal feature matrix.

3.根据权利要求1所述的表格结构识别方法，其特征在于，各采样六个点。The table structure recognition method according to claim 1 , wherein six points are sampled.

4.根据权利要求1~2任一项所述的表格结构识别方法，其特征在于，所述根据所述文本行及对应的边界框，得到所述文本行对应顶点的三个模态特征，包括：4. The table structure recognition method according to any one of claims 1 to 2, wherein the step of obtaining three modal features of vertices corresponding to the text line based on the text line and the corresponding bounding box comprises:

对所述文本行的边界框的空间坐标使用多层感知机进行特征嵌入，得到文本行的空间特征；Performing feature embedding on the spatial coordinates of the bounding box of the text line using a multi-layer perceptron to obtain spatial features of the text line;

提取待识别表格图像中所述文本行对应区域的图像特征，得到文本行的图像特征；Extracting image features of the area corresponding to the text line in the table image to be recognized to obtain image features of the text line;

对所述文本行的内容进行编码，得到文本行的文本特征；Encoding the content of the text line to obtain text features of the text line;

对所述文本行的空间特征、图像特征和文本特征进行特征嵌入，得到所述文本行对应顶点的三个模态特征。Feature embedding is performed on the spatial features, image features and text features of the text line to obtain three modal features of the vertices corresponding to the text line.

5.根据权利要求4所述的表格结构识别方法，其特征在于，所述提取待识别表格图像中所述文本行对应区域的图像特征，得到文本行的图像特征，包括：5. The table structure recognition method according to claim 4, wherein extracting image features of the area corresponding to the text line in the table image to be recognized to obtain the image features of the text line comprises:

采用深度神经网络ResNet-50对待识别表格图像进行特征提取；The deep neural network ResNet-50 is used to extract features of the table image to be identified;

根据提取的特征图像，采用基于双线性插值的ROI Align算法提取所述文本行对应区域的图像特征，得到文本行的图像特征。According to the extracted feature image, an ROI Align algorithm based on bilinear interpolation is used to extract image features of the region corresponding to the text line, thereby obtaining image features of the text line.

6.一种基于图神经网络的表格结构识别系统，其特征在于，所述系统包括：6. A table structure recognition system based on graph neural network, characterized in that the system includes:

识别模块，用于对待识别表格图像进行检测与识别，得到文本行及每个文本行对应的边界框，其中文本行表示单元格中的一行文本；所述文本行至少为两个；A recognition module, configured to detect and recognize the table image to be recognized, and obtain text rows and a bounding box corresponding to each text row, wherein a text row represents a row of text in a cell; and the number of text rows is at least two;

构建模块，用于根据所述文本行及对应的边界框，构建以文本行为顶点且顶点间具有边关系的表格图；A construction module, configured to construct, based on the text lines and the corresponding bounding boxes, a table graph in which the text lines are vertices and the vertices have edge relationships;

嵌入模块，用于根据所述文本行及对应的边界框，得到所述文本行对应顶点的三个模态特征；An embedding module, configured to obtain three modal features of vertices corresponding to the text line based on the text line and the corresponding bounding box;

交互模块，用于根据所述表格图以及顶点的三个模态特征，利用自适应图神经网络进行各模态内及模态间特征交互，得到顶点更新后的融合特征；An interaction module is used to use an adaptive graph neural network to perform feature interaction within each modality and between modalities based on the table graph and the three modal features of the vertex to obtain an updated fusion feature of the vertex;

分类模块，用于根据所述表格图中每条边对应的两个顶点更新后的融合特征，得到每条边的特征；将每条边的特征输入分类器，输出每条边的类别；A classification module is used to obtain the features of each edge according to the updated fusion features of the two vertices corresponding to each edge in the table graph; input the features of each edge into the classifier and output the category of each edge;

修正模块，用于根据每条边的类别，对所述表格图进行修正，得到待识别表格图像的表格结构信息；A correction module, configured to correct the table image according to the category of each edge, and obtain table structure information of the table image to be identified;

其中，所述构建模块具体用于：对文本行的边界框进行均匀采样，每个边界框各采样多个点，并标记每个点所属的边界框；对采样得到的所有点应用Delaunay三角剖分算法，得到所有点的Delaunay三角剖分图；对Delaunay三角剖分图中的每个三角形进行拆分，三角形的每条边代表两个采样点所属的两个边界框间的边，得到所有采样点的边集合；对边集合中的重复边和无效边进行过滤，每两个文本行的边界框之间只保留一条有效边，得到有效边集合E；以文本行为顶点构建顶点矩阵V、有效边集合E为边矩阵，构建表格图G=(V，E)；The construction module is specifically used to: uniformly sample the bounding boxes of the text lines, sample multiple points in each bounding box, and mark the bounding box to which each point belongs; apply the Delaunay triangulation algorithm to all the sampled points to obtain a Delaunay triangulation graph of all the points; split each triangle in the Delaunay triangulation graph, where each edge of the triangle represents an edge between two bounding boxes to which two sampling points belong, and obtain an edge set of all the sampling points; filter duplicate edges and invalid edges in the edge set, retain only one valid edge between the bounding boxes of every two text lines, and obtain a valid edge setE ; construct a vertex matrixV with the text lines as vertices and the valid edge setE as the edge matrix, and construct a tabular graphG = (V ,E );

所述交互模块具体用于：对于三个模态特征，分别将所有顶点的同一个模态特征堆叠成特征矩阵和所述表格图的边矩阵输入自适应图神经网络中，完成模态内的特征交互，得到更新后的模态特征矩阵；将三个模态特征对应得到的三个更新后的模态特征矩阵进行融合，将融合后的特征矩阵和所述表格图的边矩阵输入自适应图神经网络中，完成模态间的特征交互，得到更新后的融合特征矩阵，融合特征矩阵中的每一行表示每个顶点更新后的融合特征；其中，所述边矩阵为所述表格图中所有边的集合。The interaction module is specifically used to: for three modal features, stack the same modal features of all vertices into a feature matrix and the edge matrix of the tabular graph, input them into the adaptive graph neural network, complete the feature interaction within the modality, and obtain an updated modal feature matrix; fuse the three updated modal feature matrices corresponding to the three modal features, input the fused feature matrix and the edge matrix of the tabular graph into the adaptive graph neural network, complete the feature interaction between the modalities, and obtain an updated fused feature matrix, where each row in the fused feature matrix represents the updated fused feature of each vertex; wherein the edge matrix is the set of all edges in the tabular graph.

7.一种计算机可读存储介质，其上存储有计算机程序，其特征在于，所述计算机程序被处理器执行时，实现权利要求1-5任一项所述的表格结构识别方法。7. A computer-readable storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, the table structure recognition method according to any one of claims 1 to 5 is implemented.