Disclosure of Invention
In order to solve the defects in the prior art, the invention provides a table structure identification method, a system, electronic equipment and a computer readable storage medium based on a graph neural network, which can solve the problems caused by various table forms, scene changes and image degradation into table structure identification, and effectively improve the accuracy of table structure identification by effectively modeling a table graph and utilizing the graph neural network to perform characteristic enhancement.
The first object of the present invention is to provide a table structure identification method based on a graph neural network.
A second object of the present invention is to provide a table structure recognition system based on a graph neural network.
A third object of the present invention is to provide an electronic device.
A fourth object of the present invention is to provide a storage medium.
The first object of the present invention can be achieved by adopting the following technical scheme:
a method for identifying a table structure based on a graph neural network, the method comprising:
detecting and identifying the form image to be identified to obtain text lines and boundary boxes corresponding to the text lines, wherein the text lines represent one line of text in the unit cells;
constructing a table graph with text behavior vertexes and edge relations among the vertexes according to the text lines and the corresponding boundary boxes;
According to the text line and the corresponding bounding box, three modal characteristics of the vertex corresponding to the text line are obtained;
according to the three modal characteristics of the table graph and the vertexes, utilizing the self-adaptive graph neural network to conduct characteristic interaction in each mode and among modes to obtain fusion characteristics after vertex updating;
the characteristics of each side are input into a classifier and the category of each side is output according to the updated fusion characteristics of the two vertexes corresponding to each side in the table diagram;
and correcting the table graph according to the category of each side to obtain the table structure information of the table image to be identified.
Further, according to the three modal features of the table diagram and the vertex, performing intra-modal and inter-modal feature interaction by using the adaptive graph neural network to obtain a fusion feature after updating the vertex, including:
For the three modal features, stacking the same modal feature of all vertexes into a feature matrix and inputting the edge matrix of the table diagram into the self-adaptive graph neural network to finish the feature interaction in the mode and obtain an updated modal feature matrix;
And fusing three updated modal feature matrixes obtained by corresponding to the three modal features, inputting the fused feature matrixes and the edge matrixes of the table graph into the self-adaptive graph neural network, completing feature interaction among the modes, obtaining an updated fused feature matrix, and enabling each row in the fused feature matrix to represent the updated fused feature of each vertex, wherein the edge matrixes are the set of all edges in the table graph.
Further, the adaptive graph neural network comprises a graph convolution network and a graph annotation meaning network;
The method for stacking the same modal feature of all vertexes into a feature matrix and inputting the edge matrix of the table diagram into the self-adaptive graph neural network to finish the feature interaction in the mode and obtain an updated modal feature matrix, and comprises the following steps:
Stacking the same modal feature of all vertexes into a feature matrix, and respectively inputting the feature matrix and an edge matrix of the table diagram into a graph convolution network and a graph annotation force network to respectively obtain a first feature matrix and a second feature matrix;
And fusing the first feature matrix and the second feature matrix, wherein the fused feature matrix is used as an updated modal feature matrix.
Further, the table diagram is constructed based on a Delaunay triangulation algorithm.
Further, the constructing a table graph with text behavior vertices and edge relations between vertices according to the text lines and the corresponding bounding boxes includes:
Uniformly sampling the boundary boxes of the text lines, sampling a plurality of points of each boundary box, and marking the boundary box to which each point belongs;
Applying a Delaunay triangulation algorithm to all the points obtained by sampling to obtain Delaunay triangulation graphs of all the points;
splitting each triangle in the Delaunay triangle splitting diagram, wherein each side of the triangle represents the side between two bounding boxes to which two sampling points belong, so as to obtain a side set of all the sampling points;
Filtering repeated edges and invalid edges in the edge set, and reserving only one valid edge between every two text line boundary boxes to obtain a valid edge set E;
and constructing a table graph G= (V, E) by taking the text behavior vertexes as an edge matrix and constructing a vertex matrix V and an effective edge set E.
Further, six points are sampled each.
Further, the obtaining three modal features of the vertex corresponding to the text line according to the text line and the corresponding bounding box includes:
performing feature embedding on the space coordinates of the boundary frame of the text line by using a multi-layer perceptron to obtain the space features of the text line;
extracting image features of the text line corresponding region in the form image to be identified to obtain the image features of the text line;
Encoding the content of the text line to obtain the text characteristic of the text line;
And performing feature embedding on the spatial features, the image features and the text features of the text line to obtain three modal features of the corresponding vertexes of the text line.
Further, the extracting the image features of the text line corresponding region in the form image to be identified to obtain the image features of the text line includes:
extracting features of the form image to be identified by adopting a deep neural network ResNet-50;
And extracting image features of the text line corresponding region by adopting an ROI alignment algorithm based on bilinear interpolation according to the extracted feature image to obtain the image features of the text line.
The second object of the invention can be achieved by adopting the following technical scheme:
a graph neural network based table structure identification system, the system comprising:
The recognition module is used for detecting and recognizing the form image to be recognized to obtain text lines and a boundary box corresponding to each text line, wherein the text lines represent one line of text in the unit cells;
The construction module is used for constructing a table graph with text behavior vertexes and edge relations among the vertexes according to the text lines and the corresponding boundary boxes;
the embedding module is used for obtaining three modal characteristics of the vertex corresponding to the text line according to the text line and the corresponding boundary box;
The interaction module is used for carrying out intra-mode and inter-mode feature interaction by utilizing the self-adaptive graph neural network according to the three mode features of the table graph and the vertexes to obtain fusion features after the vertexes are updated;
the classification module is used for obtaining the characteristics of each side according to the updated fusion characteristics of the two vertexes corresponding to each side in the table diagram;
and the correction module is used for correcting the table graph according to the category of each side to obtain the table structure information of the table image to be identified.
The third object of the present invention can be achieved by adopting the following technical scheme:
An electronic device includes a processor and a memory for storing a program executable by the processor, wherein the method for identifying a table structure is implemented when the processor executes the program stored in the memory.
The fourth object of the present invention can be achieved by adopting the following technical scheme:
a computer-readable storage medium storing a program which, when executed by a processor, implements the above-described table structure identification method.
Compared with the prior art, the invention has the following beneficial effects:
The invention provides a form identification method, a system, equipment and a storage medium based on a graph neural network, which are used for obtaining text lines and a boundary box corresponding to each text line through detection and identification of a form image to be identified, wherein the text lines represent one line of text in a cell, the text lines are at least two, a form graph with text lines and edge relations among the vertices is constructed according to the text lines and the corresponding boundary boxes, three modal characteristics of the vertices corresponding to the text lines are obtained according to the text lines and the corresponding boundary boxes, characteristic interaction in each mode and among the modes is carried out by utilizing a self-adaptive graph neural network according to the form graph and the three modal characteristics of the vertices, fusion characteristics after vertex update are obtained, characteristics of each edge are obtained according to the fusion characteristics after updating of the two vertices corresponding to each edge in the form graph, the characteristics of each edge are input into a classifier, the category of each edge is output, and the form image is corrected according to the category of each edge, so that form structure information of the form image to be identified is obtained. By effectively constructing the table diagram and utilizing the diagram neural network to perform feature enhancement, the accuracy of the table structure identification is effectively improved.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments, and all other embodiments obtained by those skilled in the art without making any inventive effort based on the embodiments of the present application are within the scope of protection of the present application. It should be understood that the detailed description is intended to illustrate the application, and is not intended to limit the application.
Example 1:
as shown in fig. 1, the present embodiment provides a table structure identification method based on a graph neural network, which includes the following steps:
s101, identifying all text lines of each cell in the form image to be identified and a boundary box corresponding to each text line, wherein the text lines represent one line of text in the cell.
The form image to be identified is a scene image containing forms, including electronic documents such as PDF, excel, scanned documents, etc., notes such as bills, invoices, etc., and food packages, promotional pages, personal notes, etc. in natural scenes.
And detecting and identifying the text lines of the cells in the table by using an OCR engine to obtain all text lines of each cell in the table and the space coordinates of the boundary box corresponding to each text line.
Optionally, the OCR engine is an open source item pad OCR, and may detect and identify bounding boxes and contents of all text lines in the table cells, to obtain spatial coordinates of the bounding boxes and contents of the text lines corresponding to each text line.
S102, constructing a table diagram with text behavior vertexes and edge relations among the vertexes according to text lines and corresponding boundary boxes.
And constructing a table diagram by using a Delaunay triangulation algorithm according to the text lines and the corresponding bounding boxes to obtain the table diagram with text line vertexes and edge relations among the vertexes.
In this embodiment, step S102 specifically includes:
(1) Uniformly sampling the boundary boxes of the text line, sampling 6 points of each boundary box, and marking the boundary box to which each point belongs;
specifically, 2 sampling points are taken on the long side of the bounding box, and 1 sampling point is taken on the short side, so as to find the four directions up, down, left and right. In practice, these 6 points are calculated by using 4 corner points of the rectangular frame, and the total of the 4 corner points of the rectangular frame is 10 points. Experiments prove that the effect of 6 points is similar to the recall effect of taking more (such as 10) points to the effective edge, and the error edge can be reduced.
(2) Applying a Delaunay triangulation algorithm to all the points obtained by sampling to obtain a Delaunay triangulation graph GD of all the points, see FIG. 2;
(3) Splitting each triangle in the Delaunay triangle splitting diagram GD, wherein each side of the triangle represents the side between two bounding boxes to which two sampling points belong, so as to obtain a side set ED of all the sampling points;
(4) Filtering repeated edges and invalid edges in the edge set, and reserving only one valid edge between every two text line boundary boxes to obtain a valid edge set E;
(5) Constructing a vertex matrix and an effective edge set by using text behavior vertices as edge matrixes, and constructing a table graph G= (V, E);
performing the operations of steps S101, S102 on the image to be identified may result in a table diagram modeled based on Delaunay triangulation algorithm, see fig. 3 (a), where each numbered vertex represents a text line and the connection between each vertex represents an edge where there may be a relationship between two vertices.
In the embodiment, the map of the table is modeled by adopting a Delaunay triangulation algorithm, so that the balance between connectivity and sparsity in the map of the table is well realized.
And S103, obtaining three modal characteristics of the vertex corresponding to the text line according to the text line and the corresponding bounding box.
And embedding the text line and the corresponding boundary box input features into a network, and outputting three modal features of the vertex corresponding to the text line.
In this embodiment, step S103 specifically includes:
(1) Performing feature embedding on the space coordinates of the boundary frame of the text line by using a multi-layer perceptron to obtain a space feature RG of the text line;
(2) Carrying out feature extraction on a table image to be identified by using a deep neural network ResNet-50, and extracting image features of a text line corresponding region by using an ROI alignment algorithm based on bilinear interpolation to obtain image features RI of the text line;
(3) Encoding the content of the text line by using a pre-training text embedding model Sentence Transformer to obtain a text feature RC of the text line;
the spatial features, the image features and the text features are three modal features of the text line, namely three modal features of the corresponding vertexes of the text line.
S104, designing an adaptive graph neural network combining the graph convolution network and the graph meaning network.
In this embodiment, as shown in fig. 4, step S104 specifically includes:
S1041, taking a characteristic matrix R of a mode and an edge matrix E of a table diagram as input, and obtaining an updated characteristic matrix RGCN through a diagram convolution network;
specifically, the graph rolling network adopts a two-layer convolution structure, and input data sequentially passes through a graph rolling module, a regularization module, an activation function, a graph rolling module, a regularization module and an activation function to obtain an updated feature matrix RGCN;
the formula of the graph convolution module is as follows:
Where σ is the activation function, Ni is the adjacent vertex to vertex i, cij is the normalized weight, depending on the number of neighbors to vertex i, w(l) is a learnable matrix; Is the characteristic vector of the vertex i output by the layer I network, R(l+1) is the characteristic matrix output by the layer I+1 network, which is formed by stacking the characteristic vectors of all the vertices output by the layer I+1 network, R(0) is the characteristic matrix R which is used as the input of the first layer,The ith row of R(0) is the eigenvector of vertex i.
S1042, taking a feature matrix R of a mode and an edge matrix E of a table diagram as input, and obtaining an updated feature matrix RGAT through a diagram meaning network;
Specifically, the graph attention network adopts a two-layer attention structure, and input data sequentially passes through a graph attention module, a regularization module, an activation function, a graph attention module, a regularization module and an activation function to obtain an updated feature matrix RGAT;
the drawing force module formula is as follows:
Wherein σ is an activation function, Ni is an adjacent vertex of the vertex i, αij is a weight coefficient of the vertex i to the vertex j, and w(l) is a learnable matrix obtained by characteristic calculation of the vertex i and the vertex j; Is the characteristic vector of the vertex i output by the layer I network, R(l+1) is the characteristic matrix output by the layer I+1 network, which is formed by stacking the characteristic vectors of all the vertices output by the layer I+1 network, R(0) represents the characteristic matrix R as the input of the first layer,The ith row of R(0) is the eigenvector of vertex i.
S1043, fusing a characteristic matrix R 'obtained by a graph convolution network branch and a graph annotation network branch through a learnable weight alpha, and using the characteristic matrix R' as the output of the self-adaptive graph neural network, wherein the formula is expressed as follows:
R′=α·RGCN+(1-α)·RGAT;
In the embodiment, the self-adaptive graph neural network is adopted, the advantages of smoothness and standardization of the graph rolling network are combined, and the graph annotation force network can capture the advantages of global correlation and independence, so that the self-adaptive graph neural network can adapt to a table structure recognition task through learning, and the characteristic representation capability of vertex characteristics is enhanced.
And S105, according to the table diagram and the vertex characteristics of each mode, performing characteristic interaction in each mode and among modes by using the self-adaptive diagram neural network to obtain the updated fusion characteristics of each vertex.
In this embodiment, step S105 specifically includes:
(1) Stacking the spatial features of all vertexes into a matrix, inputting the stacked matrix as a feature matrix into the self-adaptive graph neural network, and completing feature interaction in a spatial mode to obtain an updated spatial feature matrix RG′;
(2) Stacking the image features of all vertexes into a matrix, inputting the stacked matrix as a feature matrix into the self-adaptive graph neural network, and completing feature interaction in an image mode to obtain an updated image feature matrix RI′;
(3) Stacking text features of all vertexes into a matrix, and inputting the stacked matrix serving as a feature matrix into the self-adaptive graph neural network to finish feature interaction in a text mode to obtain an updated text feature matrix RC′;
(4) The updated feature matrix is fused through the learnable weights beta, gamma and delta to obtain a fused feature matrix RFusion of the vertex:
RFusion=β·RG′+γ·RI′+δ·RC′
(5) And taking the fusion feature matrix of the vertexes as a feature matrix to be input into the self-adaptive graph neural network, so as to finish feature interaction among three modes and obtain an updated fusion feature matrix RFusion′.
Each row in the updated fusion feature matrix represents the updated fusion feature for each vertex.
According to the embodiment, the mode information provided by the table is fully utilized, and the spatial characteristics, the image characteristics and the text characteristics of the table are subjected to intra-mode and inter-mode characteristic interaction by utilizing the graph neural network, so that intra-mode and inter-mode cooperation is enhanced, and the characteristic representation of each mode of the table vertex is effectively enhanced.
And S106, obtaining the characteristics of each side according to the updated fusion characteristics of the two vertexes of each side, classifying each side by using a classifier by taking the side characteristics as input, and correcting the table graph according to the side classification result to obtain the table structure information.
Firstly, training a network formed by a feature embedding network, an adaptive graph neural network and a classifier, namely carrying out gradient back transmission according to a classification result and updating parameters in the feature embedding network, the adaptive graph neural network and the classifier;
In the test stage, outputting the category of each side of the table diagram in the table image to be identified by using the trained network, and correcting the table diagram according to the side classification result to directly obtain the table structure information.
In this embodiment, step S106 specifically includes:
(1) For the edge matrix of the table diagram, the features of the edge eij are expressed as follows in combination with the updated fusion feature matrix RFusion′Obtaining the characteristics of each edge in the edge matrix; respectively representing the updated fusion characteristics of the vertexes i and j;
(2) The edge characteristic matrix E is used as input into a classifier to classify, and the class of each edge is output, wherein the classes are same row, same column, same cell or irrelevant system;
(3) In the training stage, the cross entropy is used as a loss function to calculate the classification result and the labeled loss, gradient back transmission is carried out, and parameter updating of the feature embedding network, the self-adaptive graph neural network and the edge classifier is carried out. The table diagram constructed in step S102 is corrected according to the edge classification result in the test stage to obtain table structure information, see fig. 3 (b). And deleting the edges classified as irrelevant in the test stage in the table diagram, thereby obtaining accurate table diagram information.
In this embodiment, the disclosed table recognition dataset SciTSR is used to train the network composed of the feature embedded network, the adaptive graph neural network and the edge classifier, and update the parameters in the feature embedded network, the adaptive graph neural network and the edge classifier.
In the embodiment, in terms of task definition, a table recognition task is defined as an edge classification task in the field of the graph neural network, and the relation between text lines is classified into the same line, the same column, the same cell or an independent system through a classifier.
Fig. 5 is an effect diagram of a certain image to be identified, and steps S101 to S106 are performed to obtain a corrected table diagram, which is a true structure of a table, and as shown in fig. 6, the table is a 10-row and 2-column structure, wherein each vertex with a serial number represents a text row, horizontal connection lines between the vertices are edges classified into the same row, and vertical connection lines are edges classified into the same column. It is noted that the vertical lines between the two vertices numbered 17 and 18 in fig. 6 are classified as sides of the same cell, because the two text lines in column 2 and line 9 in fig. 5 belong to the same cell.
Those skilled in the art will appreciate that all or part of the steps in a method implementing the above embodiments may be implemented by a program to instruct related hardware, and the corresponding program may be stored in a computer readable storage medium.
It should be noted that although the method operations of the above embodiments are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in that particular order or that all illustrated operations be performed in order to achieve desirable results. Rather, the depicted steps may change the order of execution. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.
Example 2:
As shown in fig. 7, the present embodiment provides a table structure recognition system based on a graph neural network, where the system includes a recognition module 701, a construction module 702, an embedding module 703, an interaction module 704, a classification module 705, and a correction module 706, and specific functions of the respective modules are as follows:
The recognition module 701 is configured to detect and recognize a form image to be recognized, so as to obtain text lines and a bounding box corresponding to each text line, where the text lines represent a line of text in a cell;
a construction module 702, configured to construct a table graph with text behavior vertices and edge relationships between vertices according to the text lines and the corresponding bounding boxes;
An embedding module 703, configured to obtain three modal features of the vertex corresponding to the text line according to the text line and the corresponding bounding box;
the interaction module 704 is configured to perform intra-mode and inter-mode feature interaction by using the adaptive graph neural network according to the three mode features of the table graph and the vertex, so as to obtain a fusion feature after vertex update;
The classification module 705 is configured to obtain a feature of each edge according to the updated fusion feature of the two vertices corresponding to each edge in the table graph;
and the correction module 706 is configured to correct the table graph according to the category of each edge, so as to obtain table structure information of the table image to be identified.
The specific implementation of each module in this embodiment may refer to the above embodiment 1, and will not be described in detail herein, it should be noted that, the device provided in this embodiment is only illustrated by the division of the above functional modules, and in practical application, the above functional allocation may be completed by different functional modules according to needs, that is, the internal structure is divided into different functional modules to complete all or part of the functions described above.
Example 3:
The present embodiment provides an electronic device, which may be a computer or a server, etc., as shown in fig. 8, and includes a processor 802, a memory, an input device 803, a display 804 and a network interface 805, which are connected through a system bus 801, where the processor is configured to provide computing and control capabilities, the memory includes a nonvolatile storage medium 806 and an internal memory 807, where the nonvolatile storage medium 806 stores an operating system, a computer program and a database, and the internal memory 807 provides an environment for the operating system and the computer program in the nonvolatile storage medium, and when the processor 802 executes the computer program stored in the memory, the table structure identifying method of the foregoing embodiment 1 is implemented as follows:
detecting and identifying the form image to be identified to obtain text lines and boundary boxes corresponding to the text lines, wherein the text lines represent one line of text in the unit cells;
constructing a table graph with text behavior vertexes and edge relations among the vertexes according to the text lines and the corresponding boundary boxes;
According to the text line and the corresponding bounding box, three modal characteristics of the vertex corresponding to the text line are obtained;
according to the three modal characteristics of the table graph and the vertexes, utilizing the self-adaptive graph neural network to conduct characteristic interaction in each mode and among modes to obtain fusion characteristics after vertex updating;
the characteristics of each side are input into a classifier and the category of each side is output according to the updated fusion characteristics of the two vertexes corresponding to each side in the table diagram;
and correcting the table graph according to the category of each side to obtain the table structure information of the table image to be identified.
Example 4:
The present embodiment provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the table structure identifying method of the above embodiment 1, as follows:
detecting and identifying the form image to be identified to obtain text lines and boundary boxes corresponding to the text lines, wherein the text lines represent one line of text in the unit cells;
constructing a table graph with text behavior vertexes and edge relations among the vertexes according to the text lines and the corresponding boundary boxes;
According to the text line and the corresponding bounding box, three modal characteristics of the vertex corresponding to the text line are obtained;
according to the three modal characteristics of the table graph and the vertexes, utilizing the self-adaptive graph neural network to conduct characteristic interaction in each mode and among modes to obtain fusion characteristics after vertex updating;
the characteristics of each side are input into a classifier and the category of each side is output according to the updated fusion characteristics of the two vertexes corresponding to each side in the table diagram;
and correcting the table graph according to the category of each side to obtain the table structure information of the table image to be identified.
The computer readable storage medium of the present embodiment may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of a computer-readable storage medium may include, but are not limited to, an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
In summary, the invention provides a table structure identification method, a system, electronic equipment and a computer readable storage medium based on a graph neural network, which are convenient for effectively extracting text line information in a table unit by detecting and identifying text lines of an image to be identified by using a pad OCR engine, realize good balance between connectivity and sparsity in the table graph by modeling the table graph of the text lines based on a Delaunay triangulation algorithm, facilitate characteristic interaction of vertexes of the next step, adapt to a table structure identification task by learning through the self-adaptive graph neural network, enhance characteristic representation capability of vertex characteristics, effectively enhance characteristic representation of each mode of the table by performing multi-mode cooperation and fusion through the graph neural network, realize application of the graph neural network on table structure identification by classifying modeled edges, improve accuracy of table structure identification and are suitable for popularization and application. The invention mainly solves the problems caused by various table forms, scene change and image degradation into table structure identification, fills the blank of research on how to effectively model a table diagram and how to use a diagram neural network to perform characteristic enhancement in the prior art, and has important research significance in the field of table identification.
The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.