CN114332893B

Movatterモバイル変換

Info

Publication number: CN114332893B
Application number: CN202111020622.7A
Authority: CN
Inventors: 李鑫; 刘皓; 刘银松; 姜德强
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-09-01
Filing date: 2021-09-01
Publication date: 2025-07-01
Anticipated expiration: 2041-09-01
Also published as: CN114332893A

Abstract

The application relates to a table structure identification method, a table structure identification device, computer equipment and a storage medium. The method comprises the steps of obtaining a target table image area, identifying text areas in the target table image area, determining image characteristics and coordinate characteristics of each text area, and fusing the image characteristics and the coordinate characteristics to obtain corresponding text area element fusion characteristics. According to the text region element fusion characteristics, determining adjacent characteristics of each node in the target table image region, performing characteristic splicing on the adjacent characteristics of any two nodes, performing classification prediction on the spliced adjacent matrix, generating a row-column relation prediction result of the text region corresponding to the two nodes, and determining a table structure corresponding to the target table image region based on the row-column relation prediction result. By adopting the method, the whole identification of the target table image area can be realized, the corresponding table structure is obtained according to the prediction result of the row-column relationship, and the accuracy and the efficiency of table identification are improved.

Description

Table structure identification method, apparatus, computer device and storage medium

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a method, an apparatus, a computer device, and a storage medium for identifying a table structure.

Background

Along with the development of artificial intelligence technology and the increasing of the efficiency and accuracy requirements for extracting, sorting and updating data information, the table is used as a storage form of structured data, has the characteristic of standardization, and is more convenient for users to inquire, extract or update and input the data stored in the table. However, at present, a form is usually converted into a PDF format and then published, so that data in the form cannot be directly extracted or updated, and thus a structure and content recognition technology for the PDF form appears.

In the conventional table identification method, text detection is firstly carried out on PDF files to obtain text regions in images, which can include different text regions related in the images, then a graph neural network is utilized to predict the relation between every two text regions, whether the corresponding text regions need to be combined or not is determined according to the relation between every two text regions, finally post-processing is carried out on the predicted adjacent matrixes, the table structure in the images is reproduced, and then the content in the table is identified.

However, the conventional table recognition method cannot directly solve the problem that blank fields exist in the table, and the predicted adjacency matrix can only represent whether text areas are merged or not, only the characteristics of field nodes are considered, the whole table to be recognized cannot be covered, and an additional text detection network is required to locate text positions in an image and further organize inline information. Therefore, the conventional form recognition method cannot perform overall and global recognition on the form to be recognized, and a corresponding text detection network is additionally arranged, so that the problem of content recognition errors easily occurs, and the form recognition efficiency is still lower.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a method, an apparatus, a computer device, and a storage medium for identifying a table structure, which are capable of integrally and comprehensively identifying a PDF table to improve the accuracy and efficiency of table identification.

A method of identifying a table structure, the method comprising:

Acquiring a target table image area and identifying a text area in the target table image area;

Determining image features and coordinate features of each text region, and respectively fusing the image features and the coordinate features to obtain text region element fusion features corresponding to each text region;

Determining adjacent characteristics of each node in the target form image area according to the text area element fusion characteristics;

performing feature stitching on adjacent features of any two nodes, and performing classification prediction on the spliced adjacent matrix to generate a row-column relation prediction result of a text region corresponding to the two nodes;

and determining a table structure corresponding to the target table image area based on the row-column relation prediction result of each text area.

In one embodiment, the feature aggregation based on the local feature and the global feature generates an adjacent feature of each node in the target table image area, including:

Acquiring each door parameter corresponding to the door mechanism;

and carrying out feature aggregation on the local features and the global features based on a preset activation function and each gate parameter to obtain adjacent features of each node in the target table image area.

A form structure identification device, the device comprising:

The text region identification module is used for acquiring a target table image region and identifying a text region in the target table image region;

The text region element fusion feature generation module is used for determining the image features and the coordinate features of each text region, and respectively fusing the image features and the coordinate features to obtain text region element fusion features corresponding to each text region;

the adjacent feature generation module is used for determining adjacent features of all nodes in the target form image area according to the text area element fusion features;

The row-column relation prediction result generation module is used for carrying out feature splicing on adjacent features of any two nodes, carrying out classification prediction on adjacent matrixes obtained by splicing, and generating a row-column relation prediction result of a text area corresponding to the two nodes;

and the table structure determining module is used for determining a table structure corresponding to the target table image area based on the row-column relation prediction result of each text area.

A computer device comprising a memory storing a computer program and a processor which when executing the computer program performs the steps of:

A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

In the table structure identification method, the device, the computer equipment and the storage medium, the text region in the target table image region is identified by acquiring the target table image region, the image characteristic and the coordinate characteristic of each text region are determined, the image characteristic and the coordinate characteristic are respectively fused to obtain the text region element fusion characteristic corresponding to each text region, and the fusion of the image characteristic and the coordinate characteristic can be carried out on different text regions in the target table image region so as to achieve the overall identification of the target table image region instead of the local identification of a single text region. And determining adjacent characteristics of each node in the target table image area according to the text area element fusion characteristics, performing characteristic splicing on the adjacent characteristics of any two nodes, performing classification prediction on the spliced adjacent matrix to generate a row-column relation prediction result of the text area corresponding to the two nodes, and determining a table structure corresponding to the target table image area based on the row-column relation prediction result of each text area. The method and the device have the advantages that the corresponding table structure can be determined and obtained according to the prediction result of the row-column relationship of each text region, the additional text detection network is not needed to be used for further identification, unnecessary complicated operation can be reduced, and the accuracy and the efficiency of table identification are improved.

Drawings

FIG. 1 is an application environment diagram of a table structure identification method in one embodiment;

FIG. 2 is a flow diagram of a table structure identification method in one embodiment;

FIG. 3 is a schematic diagram of a target table image area of a table structure identification method in one embodiment;

FIG. 4 is a diagram of text region detection results of a table structure recognition method in one embodiment;

FIG. 5 is a schematic diagram of a row relationship prediction result of a table structure recognition method according to an embodiment;

FIG. 6 is a schematic diagram of a column relationship prediction result of a table structure identification method in one embodiment;

FIG. 7 is a flow diagram of obtaining text region element fusion features corresponding to text regions in one embodiment;

FIG. 8 is a flowchart of an embodiment for obtaining local features of nodes corresponding to fusion features of text region elements;

FIG. 9 is a schematic overall flow diagram of a table structure identification method in one embodiment;

Figure 10 is a diagram of a FLAG network structure for generating adjacency features in one embodiment;

FIG. 11 is a block diagram showing a structure of a table structure recognition apparatus in one embodiment;

fig. 12 is an internal structural diagram of a computer device in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

The application provides a table structure identification method, which relates to an artificial intelligence technology, wherein, artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is a theory, a method, a technology and an application system which simulate, extend and expand human intelligence by utilizing a digital computer or a machine controlled by the digital computer, sense environment, acquire knowledge and acquire an optimal result by using the knowledge. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The machine learning (MACHINE LEARNING, ML) in the artificial intelligence software technology is a multi-domain cross subject, and relates to multiple subjects such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

With research and progress of artificial intelligence technology, research and application of artificial intelligence technology are developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, automatic driving, unmanned aerial vehicles, robots, smart medical services, smart customer service, smart class and the like, and it is believed that with development of technology, artificial intelligence technology will be applied in more fields and become more and more important.

The table structure identification method provided by the embodiment of the application can be applied to an application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The server 104 identifies text regions in the target table image region by acquiring the target table image region, determines image features and coordinate features of each text region, and fuses the image features and the coordinate features to obtain text region element fusion features corresponding to each text region. And the server 104 determines the adjacent characteristics of each node in the target table image area according to the text area element fusion characteristics, performs characteristic splicing on the adjacent characteristics of any two nodes, performs classification prediction on the spliced adjacent matrix, and generates a row-column relation prediction result of the text area corresponding to the two nodes. And then, based on the row-column relation prediction result of each text region, a table structure corresponding to the target table image region is determined, and the corresponding table structure is fed back to the terminal 102. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smartphones, tablet computers, and portable wearable devices, and the server 104 may be implemented by a stand-alone server or a server cluster composed of a plurality of servers.

In one embodiment, as shown in fig. 2, a table structure identifying method is provided, and the method is applied to the server in fig. 1 for illustration, and includes the following steps:

step S202, a target table image area is acquired, and a text area in the target table image area is identified.

Specifically, the target table image area corresponding to the image to be identified is obtained by carrying out target detection on the image to be identified, and then the target table image area is further identified, so that the text area in the target table image area is obtained through identification.

In one embodiment, as shown in fig. 3, a target table image area of a table structure recognition method is provided, and by performing target detection on an image to be recognized to obtain the target table image area shown in fig. 3, and further identifying the target table image area, a text area detection result of the table structure recognition method shown in fig. 4 can be obtained by identification.

Specifically, the text region in the target form region is determined by performing target detection on the target form region by using a Mask-RCNN network. The text region may be represented by the text region detection result shown in fig. 4, that is, in the target table region, there is a text region corresponding to each text box shown in fig. 4, and then the target table region is subjected to target detection through the Mask-RCNN network, so that the positions of different text boxes may be determined.

Further, the Mask-RCNN network represents a network compatible with the general purpose target detection and segmentation tasks, and comprises a detection branch and a segmentation branch. In this embodiment, since only the text region needs to be identified, only the detection branch of the Mask-RCNN network is used to detect the target table region.

The backbone network of the Mask-RCNN network is a Res50 network with FPN, wherein the FPN represents a characteristic pyramid network (Feature Pyramid Network), a neural network capable of improving a target detection effect in a multi-scale mode is utilized, the Res50 network represents a depth residual network (Deep Residual Network) with the layer number of 50, and the network belongs to the basic network type of a convolutional neural network. Feature graphs of different stages of the picture can be obtained through the Res50 network, and then a feature pyramid is built according to the feature graphs of different stages, so that the Res50 network with the FPN is obtained.

In one embodiment, since more redundant text regions still exist in the RPN (region generation network) prediction result obtained after the target detection by the Res50 network (depth residual error network) of the FPN (feature pyramid network), the NMS algorithm is further used to filter all text regions to filter out the redundant text regions, thereby reducing the computational complexity. The NMS algorithm (Non-Maximum Suppression) is expressed as a Non-maximum suppression algorithm and is used for searching local maximum values in the field of target detection, and can play a role in filtering data values which do not meet the maximum value requirement.

Step S204, determining the image features and the coordinate features of each text region, and respectively fusing the image features and the coordinate features to obtain text region element fusion features corresponding to each text region.

Specifically, the coordinate feature after the dimension increase can be obtained by acquiring the position coordinates of each text region determined from the target form image region and performing the dimension increase on the position coordinates of each text region. The image content of the corresponding text region can be further obtained according to the position coordinates of each text region, and then the image characteristic alignment is carried out based on the image content of the text region, so that the aligned image characteristic is obtained. Wherein the dimensions of the aligned image features are the same as the dimensions of the upscaled coordinate features.

Further, the text region element fusion characteristics corresponding to each text region are obtained by fusing the coordinate characteristics after dimension increase and the aligned image characteristics.

Step S206, determining the adjacent characteristics of each node in the target table image area according to the text area element fusion characteristics.

Specifically, local features and global features of nodes corresponding to the element fusion features of each text region are obtained, and feature aggregation is further performed based on the local features and the global features, so that adjacent features of each node in the target table image region are obtained.

K adjacent nodes of the nodes corresponding to the text region element fusion characteristics are respectively determined by adopting a K-adjacent algorithm (K-Nearest Neighbor algorithm), the K adjacent nodes and the text region element fusion characteristics of the nodes are fused to obtain adjacent fusion characteristics of the nodes, and the adjacent fusion characteristics of the nodes are integrated to obtain corresponding aggregation characteristics. And further adopting an FCN network (Full Connected Network, namely a fully connected neural network) to perform dimension reduction processing on the aggregate characteristics of each node, and obtaining the dimension-reduced multi-head graph characteristics. The multi-head graph features after dimension reduction are local features of nodes corresponding to the text region element fusion features.

Further, by acquiring a preset activation function and each gate parameter corresponding to the gate mechanism, and further based on the preset activation function and each gate parameter, feature aggregation is performed on the local features and the global features, and adjacent features of each node in the target table image area are obtained.

In one embodiment, the local features and global features are aggregated to obtain the contiguous features with each node in the target table image region using the following equation (1):

F_agg＝Sigmoid(gate_i)*F_global+(1-Sigmoid(gate_i))*F_local,i∈[1,2,..,N]; (1)

Wherein F_agg represents the aggregated adjacency feature, F_global represents the global feature, F_local represents the local feature, sigmoid is a preset activation function, gate_i represents the gate parameter on the ith head, head represents the different points of attention in the Multi-head attention mechanism (Multi-head attention), and the number of heads is the same as the gate number of the gate mechanism.

And step S208, performing feature splicing on adjacent features of any two nodes, and performing classification prediction on the spliced adjacent matrix to generate a row-column relation prediction result of the text region corresponding to the two nodes.

Specifically, feature stitching is carried out on adjacent features of any two nodes to obtain a spliced adjacent matrix, and then classification prediction is carried out on the spliced adjacent matrix according to the fully-connected neural network to obtain a row-column relation prediction result of a corresponding text region.

The method comprises the steps of judging whether two text areas corresponding to the spliced adjacent matrix belong to the same row in a table or not, judging whether the two text areas corresponding to the spliced adjacent matrix belong to the same column in the table or not, and further obtaining a row-column relation prediction result of the corresponding text area.

In one embodiment, as shown in fig. 5 and 6, a schematic view of a row relationship prediction result of a table structure identification method and a schematic view of a column relationship prediction result of a table structure identification method are provided, respectively. Referring to fig. 5, it can be seen that the line relation prediction result can determine text regions corresponding to each line in the table, and in fig. 5, text regions of different lines are represented by using gray levels of different shades, where text regions included in a first line are "SE", "POS tagging information", text regions included in a second line are "NT", "adj", "verb", "idiom", "noise", "other", text regions included in a third line are "pos", "1230", "734", "1026", "266", "642", text regions included in a fourth line are "neg", "785", "904", "746", "165", "797", text regions included in a fifth line are "neu", "918", "7569", "2016", "12668", "10214", and text regions included in a sixth line are "sum", "2933", "9207", "3788", "13099", "11653".

Further, referring to fig. 6, it is understood that the column relation prediction result can determine text regions corresponding to each column in the table, and in fig. 6, text regions of different columns are represented by using gray scales of different shades, where the text regions of the first column are "send", "POS", "neg", "neu", "sum", the text regions of the second column are "POS", "adj", "1230", "785", "918", "2933", the text regions of the third column are "tagg", "verb", "734", "904", "7569", "9207", the text regions of the fourth column are "ing in", "idiom", "1026", "746", "2016", "3788", and the text regions of the fifth column are "form", "non", "266", "165", "12668", "13099", and the text regions of the sixth column are "ation", "oth", "642", "797", "10214", "11653".

Step S210, a table structure corresponding to the target table image area is determined based on the row-column relation prediction result of each text area.

Specifically, according to the prediction result of the row-column relationship of every two text regions, it can be further determined which text regions belong to the same row and which text regions belong to the same column, and further analysis and arrangement are performed on the row-column relationship and the position coordinates of different text regions, so that a table structure corresponding to the target table image region can be determined.

In one embodiment, to objectively evaluate the accuracy of a table identification result, a dataset is constructed that measures the accuracy of a table structure identification, as shown in table 1. The recall rate of the table structure identification refers to the proportion of the correctly predicted adjacent relation in the table in which the test set exists, and the accuracy rate of the table structure identification refers to the proportion of the correctly predicted adjacent relation in the predicted result.

Table 1 table structure identification index

	Recall rate of recall	Accuracy rate of
			Table structure identification	97.91%	98.14%

In the table structure identification method, the text region in the target table image region is identified by acquiring the target table image region, the image characteristic and the coordinate characteristic of each text region are determined, the image characteristic and the coordinate characteristic are respectively fused to obtain the text region element fusion characteristic corresponding to each text region, and the fusion of the image characteristic and the coordinate characteristic can be carried out on different text regions in the target table image region so as to achieve the overall identification of the target table image region instead of the local identification of a single text region. And determining adjacent characteristics of each node in the target table image area according to the text area element fusion characteristics, performing characteristic splicing on the adjacent characteristics of any two nodes, performing classification prediction on the spliced adjacent matrix to generate a row-column relation prediction result of the text area corresponding to the two nodes, and determining a table structure corresponding to the target table image area based on the row-column relation prediction result of each text area. The method and the device have the advantages that the corresponding table structure can be determined and obtained according to the prediction result of the row-column relationship of each text region, the additional text detection network is not needed to be used for further identification, unnecessary complicated operation can be reduced, and the accuracy and the efficiency of table identification are improved.

In one embodiment, as shown in fig. 7, the step of obtaining a text region element fusion feature corresponding to each text region, that is, determining an image feature and a coordinate feature of each text region, and respectively fusing the image feature and the coordinate feature to obtain the text region element fusion feature corresponding to each text region specifically includes:

Step S702, position coordinates of each text region are determined from the target table image region, and dimension rising is carried out on the position coordinates of the text region, so that coordinate features after dimension rising are obtained.

Specifically, each text region is determined from the target table image region, the position coordinates of each text region are obtained, and further the position coordinates of each text region are respectively subjected to dimension lifting by adopting an FCN network (fully connected network), so that coordinate characteristics after dimension lifting are obtained.

The position coordinates of each text region are 4-dimensional, can be four-dimensional coordinates of (x, y, w and h), and are used for fusing subsequent image features, so that the four-dimensional coordinates are increased to be consistent with the dimensions of the image features by adopting an FCN network.

In one embodiment, before acquiring the position coordinates of each text region determined from the target table image region, and performing dimension lifting on the position coordinates of the text region to obtain the coordinate features after dimension lifting, the method further comprises:

And screening out the text areas with the intersection ratio larger than a preset intersection ratio threshold value.

Specifically, through obtaining a preset labeling text region, calculating the intersection ratio of each text region in the target form image region and the preset labeling text region, obtaining a preset intersection ratio threshold value, and screening out text regions with the intersection ratio larger than the preset intersection ratio threshold value.

The preset marked text region is a text region which is marked in advance, and also carries a row-column relationship of the corresponding marked text region, namely the marked text region and the marked text region belong to the same row or the same column. In this embodiment, the preset cross ratio threshold may be different values from 0.7 to 0.9, and preferably, the preset cross ratio threshold may be 0.8.

Step S704, according to the position coordinates of each text region, the image content of the corresponding text region is acquired.

Specifically, according to the position coordinates of the text region, determining the specific position of the text region in the target table image region, further acquiring the image content corresponding to the specific position, and determining the image content as the image content corresponding to the text region.

Step S706, performing image feature alignment based on the image content of the text region to obtain an aligned image feature, where the dimension of the aligned image feature is the same as the dimension of the coordinate feature after dimension increase.

Specifically, the image content of the text region is subjected to image feature alignment by adopting a Roi Align algorithm (namely an algorithm for fixing feature outputs of regions of interest of different sizes by using bilinear interpolation). The image features of the text region can be obtained by further identifying the image content in the text region according to the FPN network (Feature Pyramid Network, namely the feature pyramid network) when the Mask-RCNN network is adopted to perform target detection on the target table region and determine the text region in the target table region.

When the Roi Align algorithm is adopted to Align the image characteristics of the image content of the text region, the obtained aligned image characteristics are 128 dimensions. The position coordinates of each text region are 4-dimensional, which can be four-dimensional coordinates of (x, y, w, h), so that the four-dimensional coordinates are fused with the image features for subsequent use, and further the four-dimensional coordinates are up-scaled to be consistent with the dimensions of the image features by adopting an FCN network (fully connected network), namely the four-dimensional coordinates are up-scaled to be 128-dimensional by the FCN network (fully connected network), and are consistent with the dimensions of the aligned image features.

Step S708, fusing the coordinate features after the dimension rise and the aligned image features to obtain text region element fusion features corresponding to each text region.

Specifically, by adopting a mode of adding according to points, the coordinate features after dimension increase and the aligned image features corresponding to each text region are fused, so that the text region element fusion features corresponding to each text region are obtained.

In this embodiment, the position coordinates of each text region are determined from the target table image region, and the position coordinates of the text region are up-scaled to obtain coordinate features after up-scaling. And according to the position coordinates of each text region, acquiring the image content of the corresponding text region, and carrying out image feature alignment based on the image content of the text region to obtain aligned image features, wherein the dimensions of the aligned image features are the same as those of the coordinate features after dimension increase. By fusing the coordinate features after dimension increase and the aligned image features, the text region element fusion features corresponding to each text region are obtained, so that the overall recognition of all text regions in the target table image region can be achieved, rather than the local recognition of a single text region, and the table recognition accuracy of the target table image region is improved.

In one embodiment, as shown in fig. 8, the step of obtaining the local feature of the node corresponding to the fusion feature of each text region element specifically includes:

Step S802, obtaining nodes corresponding to the fusion characteristics of the text region elements, and determining the preset number of adjacent nodes of each node.

Specifically, K adjacent nodes of the nodes corresponding to the text region element fusion features are respectively determined by acquiring the nodes corresponding to the text region element fusion features and adopting a K-adjacent algorithm (K-Nearest Neighbor algorithm). The K-neighbor algorithm is used for determining K neighbor nodes which are nearest to the current node.

Step S804, fusing each node with the text region element fusion characteristics of each adjacent node corresponding to the node to obtain the adjacent fusion characteristics corresponding to each node.

Specifically, feature fusion is performed on the text region element features of the K adjacent nodes and the text region element fusion features of the nodes, so that adjacent fusion features of each node are obtained.

The text region element characteristics of each node are 128-dimensional, and after the text region element characteristics of K adjacent nodes and the text region element fusion characteristics of the node are subjected to characteristic fusion, the obtained adjacent fusion characteristics of the node are improved to 128-K-dimensional.

Step S806, integrating adjacent fusion features of all nodes to obtain an aggregation feature.

Specifically, by integrating the adjacent fusion characteristics of the nodes by using the FCN network (fully connected network), corresponding aggregation characteristics are obtained. Wherein, since the adjacent fusion characteristics of the nodes are 128K-dimension, the adjacent fusion characteristics of 128K-dimension of each node are aggregated into an aggregation characteristic of 128-dimension by adopting the FCN network.

And step S808, performing dimension reduction processing on the aggregate characteristics of each node to obtain dimension-reduced multi-head graph characteristics, wherein the dimension-reduced multi-head graph characteristics are local characteristics of the nodes corresponding to the text region element fusion characteristics.

Specifically, the aggregate characteristics of the nodes are subjected to dimension reduction processing by adopting a preset parallel FCN network, so that the multi-head graph characteristics after dimension reduction are obtained. The multi-head graph features after dimension reduction are local features of nodes corresponding to the text region element fusion features.

Further, in this embodiment, 8 parallel FCN networks may be used to perform a dimension reduction process on the aggregate feature of the node, and convert the aggregate feature of 128 dimensions into 8 multi-header features of 16 dimensions. When the parallel FCN networks are adopted for the dimension reduction processing, the dimension reduction processing process of each parallel FCN network is independent and does not affect each other.

In this embodiment, the adjacent fusion features corresponding to each node are obtained by obtaining the node corresponding to the fusion feature of each text region element, determining the preset number of adjacent nodes of each node, and further fusing each node with the text region element fusion feature of each adjacent node corresponding to the node. And integrating adjacent fusion characteristics of each node to obtain an aggregation characteristic, performing dimension reduction processing on the aggregation characteristics of each node to obtain a dimension-reduced multi-head image characteristic, wherein the dimension-reduced multi-head image characteristic is the local characteristic of the node corresponding to the text region element fusion characteristic. The method has the advantages that the further integration and dimension reduction processing of the text region element fusion characteristics are realized, the multi-head diagram characteristics after dimension reduction are obtained, global characteristics obtained by carrying out context characteristic aggregation on the text region element fusion characteristics through a multi-head attention mechanism are facilitated, the further fusion is carried out, the overall recognition of the target table image region is achieved, the local recognition of a single text region is not achieved, and the table recognition accuracy is improved.

In one embodiment, as shown in fig. 9, an overall flow of a table structure recognition method is provided, specifically including a P1 text detection portion, a P2 feature aggregation portion, and a P3 adjacency predicting portion, where:

1. the P1 text region identification section specifically includes:

1) And performing target detection on the target table area by adopting a Mask-RCNN network to obtain a corresponding RPN (regional generation network) identification result, and determining a text region in the target table area according to the RPN identification result. The backbone network of the Mask-RCNN network is a Res50 network with FPN.

2) And filtering the text region obtained by recognition by using an NMS algorithm (non-maximum suppression algorithm) to obtain a filtered text region. As shown in fig. 4, the text region detection result includes text regions corresponding to the text boxes shown in fig. 4 in the target table region.

2. The P2 characteristic polymerization part specifically comprises:

1) And determining the image characteristics and the coordinate characteristics of each text region, and respectively fusing the image characteristics and the coordinate characteristics to obtain the text region element fusion characteristics corresponding to each text region.

In one embodiment, determining image features and coordinate features of each text region, and respectively fusing the image features and the coordinate features to obtain text region element fusion features corresponding to each text region, including:

The method comprises the steps of obtaining position coordinates of each text area from a target table image area, carrying out dimension lifting on the position coordinates of each text area to obtain coordinate features after dimension lifting, obtaining image contents of corresponding text areas according to the position coordinates of each text area, carrying out image feature alignment based on the image contents of the text areas to obtain aligned image features, enabling dimensions of the aligned image features to be identical to dimensions of the coordinate features after dimension lifting, and fusing the coordinate features after dimension lifting and the aligned image features to obtain text area element fusion features corresponding to each text area.

Specifically, each text region is determined from the target table image region, the position coordinates of each text region are obtained, and further the position coordinates of each text region are respectively subjected to dimension lifting by adopting an FCN network (fully connected network), so that coordinate characteristics after dimension lifting are obtained. And determining the specific position of the text region in the target table image region according to the position coordinates of the text region, further acquiring the image content corresponding to the specific position, determining the image content as the image content corresponding to the text region, and performing image feature alignment on the image content of the text region by adopting a Roi Align algorithm (namely an algorithm for fixing feature output of regions of interest with different sizes by using bilinear interpolation). And fusing the coordinate features after the dimension rise and the image features after the alignment corresponding to each text region by adopting a mode of adding according to the points so as to obtain the text region element fusion features corresponding to each text region.

2) And determining the adjacent characteristics of each node in the target table image area according to the text area element fusion characteristics.

Specifically, determining the adjacent characteristics of each node in the target form image area according to the text area element fusion characteristics comprises the following steps:

And performing feature aggregation based on the local features and the global features to obtain adjacent features of each node in the target form image region.

In one embodiment, obtaining local features of nodes corresponding to the fusion features of the text region elements includes:

The method comprises the steps of obtaining nodes corresponding to fusion characteristics of text area elements, determining preset number of adjacent nodes of the nodes, fusing the nodes with the fusion characteristics of the text area elements of the adjacent nodes corresponding to the nodes to obtain the adjacent fusion characteristics of the nodes, integrating the adjacent fusion characteristics of the nodes to obtain aggregation characteristics, performing dimension reduction on the aggregation characteristics of the nodes to obtain dimension reduced multi-head graph characteristics, wherein the dimension reduced multi-head graph characteristics are local characteristics of the nodes corresponding to the fusion characteristics of the text area elements.

Specifically, K-neighbor algorithms (K-Nearest Neighbor algorithm) are adopted to respectively determine K neighbor nodes of the nodes corresponding to the text region element fusion characteristics, the text region element characteristics of the K neighbor nodes and the text region element fusion characteristics of the nodes are subjected to feature fusion, so that neighbor fusion characteristics of the nodes are obtained, and the neighbor fusion characteristics of the nodes are integrated by adopting an FCN network (fully connected network) to obtain corresponding aggregation characteristics. And further adopting a preset parallel FCN network to perform dimension reduction treatment on the aggregation characteristics of the nodes, and obtaining the multi-head graph characteristics after dimension reduction.

In one embodiment, obtaining global features of nodes corresponding to the fusion features of the text region elements includes:

and according to the multi-head attention mechanism, carrying out context feature aggregation on the text region element fusion features corresponding to each node to obtain the global features of the nodes corresponding to the text region element fusion features.

Wherein the text region element fusion feature is characterized by an encoder employing a transform model (a language model based on a self-attention mechanism), wherein the hidden layer size of the transform model is 128 and the dimension of the text region element fusion feature is 128 dimensions. The head number corresponding to the Multi-head attention mechanism (Multi-head attention) is 8, when context feature aggregation is performed on the text region element fusion features corresponding to each node according to the Multi-head attention mechanism, the dimension of the global feature of the node corresponding to the text region element fusion features is 16 dimensions, and the dimension of the 128-dimensional aggregation features of the node is also 16 dimensions by adopting 8 parallel FCN networks.

In one embodiment, feature aggregation is performed based on local features and global features to obtain contiguous features of each node in the target table image area, including:

And carrying out feature aggregation on the local features and the global features based on a preset activation function and the gate parameters to obtain adjacent features of each node in the target table image area.

In one embodiment, as shown in fig. 10, a schematic diagram of a FLAG network structure for generating contiguous features is provided, wherein the FLAG network structure represents the structure of a self-attention mechanism incorporated into the graph features. Referring to fig. 10, it can be seen that the FLAG Network structure is provided with GNN branches (Graph Neural Networks, i.e. a graph neural Network) for determining local features, including GNN₁、……、GNN_N, self-attention (self-attention mechanism) branches for determining global features, including self-attention head₁、……、self-attention head_n, gate mechanisms (gate mechanisms) for controlling different types of feature fusion, gate parameters corresponding to the gate mechanisms including gate₁、……、gate_n, and FFN (Feed-Forward neural Network) for enhancing the characterization capability of the model. Wherein the number of GNN branches, self-attention branches and gate mechanism are consistent.

Specifically, the self-attention (self-attention mechanism) branch is characterized by an encoder adopting a transducer model (language model based on the self-attention mechanism), and attention functions used by the transducer model comprise Q (request), K (key), and V (value), wherein the query vector (Q) is used for searching the relevance of the target feature and other features, the key vector (K) is used for matching the query vector (Q) to obtain a relevance score, and the value vector (V) is used for weighted summation of the scores, as can be seen from FIG. 10.

Further, since self-attention (self-attention mechanism) branch is a multi-head attention mechanism, a plurality of heads are correspondingly arranged, including head₁、head₂、……、head_n, and the attention function used by the transducer model is correspondingly arranged for each head, including K₁、Q₁、V₁,K₂、Q₂、V₂,……K_N、Q_N、V_N.

Specifically, the adjacent characteristics of each node in the target table image area are obtained by acquiring a preset activation function and gate parameters corresponding to a gate mechanism, including gate₁、……、gate_n, and performing characteristic aggregation on the local characteristics and the global characteristics of each node based on the preset activation function and each gate parameter. And further identifying and analyzing adjacent features obtained by feature aggregation by adopting an FFN (Feed-Forward Network), so as to improve the characterization capability of the model.

In this embodiment, four layers of FLAG network structures are provided, and the adjacent features output by each layer of FLAG network structure are further subjected to feature fusion, so that the adjacent features of each node in the target table image area are finally determined.

3. The P3 adjacency predicting section specifically includes:

(1) And performing feature splicing on adjacent features of any two nodes, and performing classification prediction on the spliced adjacent matrix to generate a row-column relation prediction result of the text region corresponding to the two nodes.

In one embodiment, feature stitching is performed on adjacent features of any two nodes, classification prediction is performed on an adjacent matrix obtained by stitching, and a row-column relationship prediction result of text regions corresponding to the two nodes is generated, including:

the method comprises the steps of performing feature stitching on adjacent features of any two nodes to obtain a stitched adjacent matrix, and performing classification prediction on the stitched adjacent matrix according to a fully connected neural network to obtain a row-column relationship prediction result of a corresponding text region, wherein the classification prediction comprises row relationship prediction and column relationship prediction.

Specifically, the spliced adjacent matrix is subjected to classification prediction according to the fully-connected network, whether two text areas corresponding to the spliced adjacent matrix belong to the same row in the table is determined, or whether the two text areas corresponding to the spliced adjacent matrix belong to the same column in the table is judged, and then a row-column relation prediction result of the corresponding text area is obtained.

Further, the obtained row-column relationship prediction result of each text region may refer to the row relationship prediction result of the table structure recognition method shown in fig. 5 and the column relationship prediction result of the table structure recognition method shown in fig. 6.

(2) Based on the row-column relation prediction result of each text region, a table structure corresponding to the target table image region is determined.

It should be understood that, although the steps in the flowcharts related to the above embodiments are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least a part of the steps in the flowcharts related to the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages performed is not necessarily sequential, but may be performed alternately or alternately with at least a part of the steps or stages in other steps or other steps.

In one embodiment, as shown in fig. 11, a table structure identifying apparatus is provided, which may be a software module or a hardware module, or a combination of the two forms a part of a computer device, and the apparatus specifically includes a text region identifying module 1102, a text region element fusion feature generating module 1104, an adjacent feature generating module 1106, a determinant prediction result generating module 1108, and a table structure determining module 1110, where:

The text region identification module 1102 is configured to obtain a target form image region, and identify a text region in the target form image region.

The text region element fusion feature generation module 1104 is configured to determine an image feature and a coordinate feature of each text region, and fuse the image feature and the coordinate feature to obtain a text region element fusion feature corresponding to each text region.

And the adjacent feature generating module 1106 is configured to determine the adjacent features of each node in the target table image area according to the text area element fusion feature.

And a row-column relation prediction result generating module 1108, configured to perform feature stitching on adjacent features of any two nodes, and perform classification prediction on the stitched adjacent matrix, so as to generate a row-column relation prediction result of the text region corresponding to the two nodes.

The table structure determining module 1110 is configured to determine a table structure corresponding to the target table image area based on the row-column relationship prediction result of each text area.

In the table structure identification device, the text region in the target table image region is identified by acquiring the target table image region, the image characteristic and the coordinate characteristic of each text region are determined, the image characteristic and the coordinate characteristic are respectively fused to obtain the text region element fusion characteristic corresponding to each text region, and the fusion of the image characteristic and the coordinate characteristic can be carried out on different text regions in the target table image region so as to achieve the overall identification of the target table image region instead of the local identification of a single text region. And determining adjacent characteristics of each node in the target table image area according to the text area element fusion characteristics, performing characteristic splicing on the adjacent characteristics of any two nodes, performing classification prediction on the spliced adjacent matrix to generate a row-column relation prediction result of the text area corresponding to the two nodes, and determining a table structure corresponding to the target table image area based on the row-column relation prediction result of each text area. The method and the device have the advantages that the corresponding table structure can be determined and obtained according to the prediction result of the row-column relationship of each text region, the additional text detection network is not needed to be used for further identification, unnecessary complicated operation can be reduced, and the accuracy and the efficiency of table identification are improved.

In one embodiment, the text region element fusion feature generation module is further configured to:

The text region element fusion feature generation module realizes the fusion of the coordinate features after dimension increase and the aligned image features, obtains the text region element fusion features corresponding to each text region, can realize the overall recognition of all text regions in the target table image region, and does not aim at the local recognition of a single text region, thereby improving the table recognition accuracy of the target table image region.

In one embodiment, the text region element fusion feature generation module further includes:

The intersection ratio calculating unit is used for calculating the intersection ratio of each text region in the target form image region and a preset labeling text region;

And the text region screening module is used for screening text regions with the cross-over ratio larger than a preset cross-over ratio threshold value.

In one embodiment, the adjacency feature generation module is further configured to:

In one embodiment, the adjacency feature generation module further comprises:

The adjacent node acquisition module is used for acquiring nodes corresponding to the fusion characteristics of the text region elements and determining the preset number of adjacent nodes of each node;

The adjacent fusion feature generation module is used for fusing each node with the text region element fusion feature of each adjacent node corresponding to the node to obtain the adjacent fusion feature corresponding to each node;

the aggregation feature generation module is used for integrating adjacent fusion features of all nodes to obtain aggregation features;

The local feature generation module is used for carrying out dimension reduction processing on the aggregate features of each node to obtain dimension reduced multi-head graph features, wherein the dimension reduced multi-head graph features are local features of the nodes corresponding to the text region element fusion features.

The adjacent feature generation module realizes further integration and dimension reduction processing of the text region element fusion features, obtains the multi-head image features after dimension reduction, facilitates further fusion with global features obtained by carrying out context feature aggregation on the text region element fusion features through a multi-head attention mechanism, achieves overall recognition of a target table image region, does not aim at local recognition of a single text region, and improves the table recognition accuracy.

In one embodiment, the adjacency feature generation module further comprises a global feature generation module for:

In one embodiment, the determinant prediction result generation module is further configured to:

For specific limitations of the table structure recognition device, reference may be made to the above limitation of the table structure recognition method, and the description thereof will not be repeated here. The respective modules in the above-described table structure identifying apparatus may be implemented in whole or in part by software, hardware, and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 12. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer equipment is used for storing data such as text areas, image features, coordinate features, text area element fusion features, adjacent features, row-column relation prediction results and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a table structure identification method.

It will be appreciated by those skilled in the art that the structure shown in FIG. 12 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.

In one embodiment, a computer-readable storage medium is provided, storing a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

In one embodiment, a computer program product or computer program is provided that includes computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the steps in the above-described method embodiments.

Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims

1. A method of identifying a table structure, the method comprising:

acquiring nodes corresponding to the text region element fusion characteristics, determining a preset number of adjacent nodes of the nodes, and fusing the nodes with the text region element fusion characteristics of the adjacent nodes corresponding to the nodes to obtain adjacent fusion characteristics corresponding to the nodes;

Integrating adjacent fusion features of the nodes to obtain an aggregation feature, and performing dimension reduction processing on the aggregation feature of the nodes to obtain a dimension-reduced multi-head graph feature, wherein the dimension-reduced multi-head graph feature is a local feature of the node corresponding to the text region element fusion feature;

Acquiring global features of nodes corresponding to the text region element fusion features, and performing feature aggregation based on the local features and the global features to obtain adjacent features of the nodes in the target table image region;

2. The method according to claim 1, wherein determining the image feature and the coordinate feature of each text region, and fusing the image feature and the coordinate feature to obtain a text region element fusion feature corresponding to each text region, includes:

Acquiring position coordinates of each text region determined from the target table image region, and carrying out dimension lifting on the position coordinates of the text region to obtain coordinate features after dimension lifting;

acquiring the image content of the corresponding text region according to the position coordinates of each text region;

Performing image feature alignment based on the image content of the text region to obtain aligned image features, wherein the dimensions of the aligned image features are the same as the dimensions of the coordinate features after dimension increase;

and fusing the coordinate features after dimension lifting and the aligned image features to obtain text region element fusion features corresponding to the text regions.

3. The method of claim 2, further comprising, prior to said obtaining determining the location coordinates of each text region from within the target form image region and upsizing the location coordinates of the text region to obtain the upscaled coordinate features:

Calculating the intersection ratio of each text region in the target form image region and a preset labeling text region;

and screening text areas with the cross-over ratio greater than a preset cross-over ratio threshold value.

4. The method of claim 1, wherein obtaining global features of nodes corresponding to the text region element fusion features comprises:

and according to a multi-head attention mechanism, carrying out context feature aggregation on the text region element fusion features corresponding to each node to obtain the global features of the nodes corresponding to the text region element fusion features.

5. The method of claim 1, wherein the performing feature stitching on the adjacent features of any two nodes, and performing classification prediction on the adjacency matrix obtained by stitching, to generate a row-column relationship prediction result of text regions corresponding to the two nodes, includes:

performing feature splicing on adjacent features of any two nodes to obtain a spliced adjacent matrix;

And performing classification prediction on the spliced adjacent matrix according to the fully-connected neural network to obtain a row-column relationship prediction result of the corresponding text region, wherein the classification prediction comprises row relationship prediction and column relationship prediction.

6. A form structure identification device, the device comprising:

The adjacent feature generation module is used for acquiring nodes corresponding to the text region element fusion features, determining a preset number of adjacent nodes of the nodes, fusing the adjacent nodes with the text region element fusion features of the adjacent nodes corresponding to the nodes to obtain adjacent fusion features corresponding to the nodes, integrating the adjacent fusion features of the nodes to obtain aggregation features, performing dimension reduction on the aggregation features of the nodes to obtain dimension reduced multi-head map features, wherein the dimension reduced multi-head map features are local features of the nodes corresponding to the text region element fusion features, acquiring global features of the nodes corresponding to the text region element fusion features, and performing feature aggregation based on the local features and the global features to obtain adjacent features of the nodes in the target table image region;

7. The apparatus of claim 6, wherein the text region element fusion feature generation module is further configured to:

the method comprises the steps of obtaining position coordinates of each text area from an image area of a target table, carrying out dimension lifting on the position coordinates of each text area to obtain coordinate characteristics after dimension lifting, obtaining image contents of corresponding text areas according to the position coordinates of each text area, carrying out image characteristic alignment based on the image contents of the text areas to obtain aligned image characteristics, enabling dimensions of the aligned image characteristics to be identical to those of the coordinate characteristics after dimension lifting, and fusing the coordinate characteristics after dimension lifting and the aligned image characteristics to obtain text area element fusion characteristics corresponding to each text area.

8. The apparatus of claim 7, wherein the text region element fusion feature generation module further comprises:

the intersection ratio calculating unit is used for calculating the intersection ratio of each text region in the target table image region and a preset labeling text region;

9. The apparatus of claim 6, wherein the adjacency feature generation module further comprises a global feature generation module to:

10. The apparatus of claim 6, wherein the rank relation prediction result generation module is further configured to:

and performing classification prediction on the spliced adjacent matrix according to a fully-connected neural network to obtain a row-column relationship prediction result of a corresponding text region, wherein the classification prediction comprises row relationship prediction and column relationship prediction.

11. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 5 when the computer program is executed.

12. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method of any one of claims 1 to 5.