技术领域Technical field
本发明属于文本识别技术领域,具体涉及一种基于开源Paddle框架的PDF文档识别方法。The invention belongs to the field of text recognition technology, and specifically relates to a PDF document recognition method based on the open source Paddle framework.
背景技术Background technique
如今,购买保险产品的人数显著上升,随着保险业的快速发展,保险产品的数量级越来越大,对于保险公司来说,海量保险产品PDF文档需要高效的管理方法。保险产品PDF文档的内容丰富、形式多样,多数保险公司只能通过人力手动对数据进行处理,但传统的人工进行保险产品文件数据整理的方法繁琐又乏味,工作量巨大,且不能保证人工输入的正确率,在整理完毕后还需要大量的人工进行校对,效率十分低下。Nowadays, the number of people buying insurance products has increased significantly. With the rapid development of the insurance industry, the number of insurance products is getting larger and larger. For insurance companies, massive insurance product PDF documents require efficient management methods. Insurance product PDF documents are rich in content and in various forms. Most insurance companies can only process the data manually. However, the traditional manual method of organizing insurance product file data is cumbersome and tedious, with a huge workload and cannot guarantee the accuracy of manual input. The accuracy rate requires a lot of manual proofreading after finishing, which is very inefficient.
OCR技术的发展使得保险产品的识别录入效率得到极大提升,但保险产品数据的提取需要保证较高的准确率,如何保证OCR识别的准确率以减少人工校对成本,是目前亟需解决的问题。目前,相关技术中已存在一些相关方案,但是这些方案存在以下缺点:The development of OCR technology has greatly improved the efficiency of identifying and entering insurance products. However, the extraction of insurance product data needs to ensure a high accuracy. How to ensure the accuracy of OCR recognition to reduce manual proofreading costs is an urgent problem that needs to be solved. . Currently, there are some related solutions in related technologies, but these solutions have the following shortcomings:
(1)效率低。传统的文本检测技术难以自适应调整检测框以应对不同类型的文本,深度学习技术虽然能通过自适应阈值进行调节,但存在检测框遗漏问题。(1) Low efficiency. It is difficult for traditional text detection technology to adaptively adjust detection frames to deal with different types of text. Although deep learning technology can adjust through adaptive thresholds, there is a problem of missing detection frames.
(2)难以做到自适应、结构化地提取不同板式的文档,效率较低。(2) It is difficult to adaptively and structurally extract documents in different formats, and the efficiency is low.
(3)识别出的文本中仍有较多错误识别的情况,导致最终校验时成本较高。(3) There are still many misidentifications in the recognized text, resulting in higher costs in the final verification.
发明内容Contents of the invention
为了解决相关技术中存在的上述问题,本发明提供了一种基于开源Paddle框架的PDF文档识别方法。本发明要解决的技术问题通过以下技术方案实现:In order to solve the above problems existing in related technologies, the present invention provides a PDF document recognition method based on the open source Paddle framework. The technical problems to be solved by the present invention are achieved through the following technical solutions:
本发明提供一种基于开源Paddle框架的PDF文档识别方法,包括:The present invention provides a PDF document recognition method based on the open source Paddle framework, including:
获取待识别PDF文档;Get the PDF document to be recognized;
通过PaddleOCR框架的预训练的文本检测模型和预训练的文字识别模型,以及OpenCV的形态学操作和投影分割,检测出所述PDF文档中每页的文本块,得到所述PDF文档的文本块列表;Through the pre-trained text detection model and pre-trained text recognition model of the PaddleOCR framework, as well as the morphological operation and projection segmentation of OpenCV, the text blocks of each page in the PDF document are detected, and the text block list of the PDF document is obtained. ;
通过PaddleOCR框架的预训练的版面分析模型,识别所述PDF文档中每页的文本区域的类别,得到所述PDF文档的文本区域列表;Through the pre-trained layout analysis model of the PaddleOCR framework, identify the category of the text area of each page in the PDF document, and obtain the text area list of the PDF document;
根据所述文本块列表和所述文本区域列表,确定所述PDF文档的OCR识别文本;Determine the OCR recognition text of the PDF document according to the text block list and the text area list;
基于所述文本区域列表和字符提取工具,确定所述PDF文档的PDF识别文本;Determine the PDF recognition text of the PDF document based on the text area list and the character extraction tool;
根据所述OCR识别文本和所述PDF识别文本生成对比文件;所述对比文件用于描述所述OCR识别文本和所述PDF识别文本之间的不同部分。A comparison file is generated based on the OCR recognized text and the PDF recognized text; the comparison file is used to describe the different parts between the OCR recognized text and the PDF recognized text.
本发明具有如下有益技术效果:The present invention has the following beneficial technical effects:
1.文本检测效果好,检测框完整。本发明在文本检测方面,结合了PaddleOCR框架的轻量级深度学习模型的DBNet高精度与高检测效率并通过OpenCV的形态学方法进行辅助,弥补了深度学习模型以及传统方法可能出现的问题,提高了文本检测的准确率和完整性。1. The text detection effect is good and the detection frame is complete. In terms of text detection, this invention combines the high accuracy and high detection efficiency of DBNet, a lightweight deep learning model of the PaddleOCR framework, and is assisted by the morphological method of OpenCV, which makes up for possible problems that may arise in deep learning models and traditional methods, and improves Improve the accuracy and completeness of text detection.
2.通用性、效率高。本发明采用了PaddleDetection框架的深度学习预训练模型作为版面分析模型,无需大参数结构和大量数据训练,利用自己标注的少量保险产品文档图片数据即可进行版面分析模型的训练,得到效果较好的模型,可以应对不同版面的保险产品文档,不需要预先设定匹配模板,以更高的效率提取结构化文本。2. Versatility and high efficiency. This invention adopts the deep learning pre-training model of the PaddleDetection framework as the layout analysis model. It does not require a large parameter structure and a large amount of data training. The layout analysis model can be trained by using a small amount of insurance product document picture data marked by itself, and the results are better. The model can handle insurance product documents of different layouts, without the need to set matching templates in advance, and extract structured text with higher efficiency.
3.文本纠错成本低。本发明将OCR识别方法与PDF识别方法得到的两种结果进行互相对比纠错,可以进行全文的字符纠错。同时生成带有标记的对比文件,在保证文本准确率的同时也降低了人工校验的成本,具有较高的实际应用价值。3. The cost of text error correction is low. The present invention compares the two results obtained by the OCR recognition method and the PDF recognition method with each other to correct errors, and can perform character error correction in the full text. At the same time, marked comparison files are generated, which not only ensures text accuracy but also reduces the cost of manual verification, and has high practical application value.
以下将结合附图及实施例对本发明做进一步详细说明。The present invention will be further described in detail below with reference to the accompanying drawings and examples.
附图说明Description of drawings
图1为本发明实施例提供的基于开源Paddle框架的PDF文档识别方法的一个流程图;Figure 1 is a flow chart of a PDF document recognition method based on the open source Paddle framework provided by an embodiment of the present invention;
图2为本发明实施例提供的示例性的OCR识别方法的流程示意图;Figure 2 is a schematic flow chart of an exemplary OCR recognition method provided by an embodiment of the present invention;
图3为本发明实施例提供的示例性的PDF识别方法的流程示意图;Figure 3 is a schematic flowchart of an exemplary PDF recognition method provided by an embodiment of the present invention;
图4为本发明实施例提供的示例性的对比纠错方法的流程示意图;Figure 4 is a schematic flowchart of an exemplary comparison error correction method provided by an embodiment of the present invention;
图5为本发明实施例提供的示例性的基于开源Paddle框架的PDF文档识别方法的总体结构图。Figure 5 is an overall structural diagram of an exemplary PDF document recognition method based on the open source Paddle framework provided by an embodiment of the present invention.
具体实施方式Detailed ways
下面结合具体实施例对本发明做进一步详细的描述,但本发明的实施方式不限于此。The present invention will be described in further detail below with reference to specific examples, but the implementation of the present invention is not limited thereto.
在本发明的描述中,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。在本发明的描述中,“多个”的含义是两个或两个以上,除非另有明确具体的限定。In the description of the present invention, the terms "first" and "second" are used for descriptive purposes only and cannot be understood as indicating or implying relative importance or implicitly indicating the number of indicated technical features. Therefore, features defined as "first" and "second" may explicitly or implicitly include one or more of these features. In the description of the present invention, "plurality" means two or more than two, unless otherwise explicitly and specifically limited.
在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不必须针对的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任何的一个或多个实施例或示例中以合适的方式结合。此外,本领域的技术人员可以将本说明书中描述的不同实施例或示例进行接合和组合。In the description of this specification, reference to the terms "one embodiment," "some embodiments," "an example," "specific examples," or "some examples" or the like means that specific features are described in connection with the embodiment or example. , structures, materials or features are included in at least one embodiment or example of the invention. In this specification, the schematic expressions of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the specific features, structures, materials or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, those skilled in the art may join and combine the different embodiments or examples described in this specification.
尽管在此结合各实施例对本发明进行了描述,然而,在实施所要求保护的本发明过程中,本领域技术人员通过查看所述附图、公开内容、以及所附权利要求书,可理解并实现所述公开实施例的其他变化。在权利要求中,“包括”(comprising)一词不排除其他组成部分或步骤,“一”或“一个”不排除多个的情况。单个处理器或其他单元可以实现权利要求中列举的若干项功能。相互不同的从属权利要求中记载了某些措施,但这并不表示这些措施不能组合起来产生良好的效果。Although the present invention has been described herein in conjunction with various embodiments, those skilled in the art, in practicing the claimed invention, will understand and understand by reviewing the drawings, the disclosure, and the appended claims. Other variations of the disclosed embodiments are implemented. In the claims, the word "comprising" does not exclude other components or steps, and "a" or "an" does not exclude a plurality. A single processor or other unit may perform several of the functions recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not mean that a combination of these measures cannot be combined to advantageous effects.
目前,相关技术中存在的方案以及方案不足具体阐述如下:At present, the existing solutions and solution deficiencies in related technologies are detailed as follows:
申请号为201910436587.3的专利文件公开了一种基于tesseract-ocr的pdf文件解析方法,该方法用fitz工具包把pdf文件转换为图片序列,利用tesseract-ocr进行图片中的图、表格的提取、识别与纠错。该方案的不足在于,只能对表格进行纠错,没有对其余可能出错的文本区域进行纠错,整个pdf文件识别的准确率无法保证。The patent document with application number 201910436587.3 discloses a pdf file parsing method based on tesseract-ocr. This method uses the fitz toolkit to convert pdf files into image sequences, and uses tesseract-ocr to extract and identify figures and tables in the images. and error correction. The disadvantage of this solution is that it can only correct errors in the form, but not in other text areas that may have errors, so the accuracy of the entire PDF file recognition cannot be guaranteed.
申请号为202010493630.2的专利文件提供了一种保险单识别方法、装置以及计算机设备,涉及保险单识别技术领域,缓解了对于多种不同类型的保险单版面,保险单识别的准确度较低的技术问题。该方案的不足在于,其所使用的对不同版面的保险单的识别方法为匹配预先设置好的各种版面,若遇到新型版面则不可适配,需要重新预设版面,效率较低。The patent document with application number 202010493630.2 provides an insurance policy identification method, device and computer equipment, involving the field of insurance policy identification technology, which alleviates the low accuracy of insurance policy identification for many different types of insurance policy layouts. question. The disadvantage of this solution is that the method used to identify insurance policies of different layouts is to match various preset layouts. If a new layout is encountered, it cannot be adapted and the layout needs to be reset, which is inefficient.
申请号为202210550788.8的专利文件公开了一种基于OCR识别的PDF识别方法,通过调用多种OCR识别技术对目标PDF文档的每个字节进行识别,得到各种OCR识别技术对目标PDF文档中每个字节识别出的字符串,并根据预设选出第i个字节对应的最优字节字符串,将所有字节的最优字节字符串按照目标PDF文档的字节顺序输出,得到目标PDF文档的识别结果。该专利的不足在于,调用了多种OCR识别技术,当面对海量数据时,识别效率较低,并且没有有效的纠错手段来保证OCR识别结果的可靠性。The patent document with application number 202210550788.8 discloses a PDF recognition method based on OCR recognition. By calling multiple OCR recognition technologies to identify each byte of the target PDF document, various OCR recognition technologies are obtained to identify each byte in the target PDF document. string identified by bytes, and the optimal byte string corresponding to the i-th byte is selected according to the preset, and the optimal byte string of all bytes is output in the byte order of the target PDF document. Get the recognition result of the target PDF document. The shortcoming of this patent is that it uses a variety of OCR recognition technologies. When faced with massive data, the recognition efficiency is low, and there is no effective error correction method to ensure the reliability of the OCR recognition results.
图1是本发明实施例提供的基于开源Paddle框架的PDF文档识别方法的一个流程图,如图1所示,所述方法包括以下步骤:Figure 1 is a flow chart of a PDF document recognition method based on the open source Paddle framework provided by an embodiment of the present invention. As shown in Figure 1, the method includes the following steps:
S101、获取待识别PDF文档。S101. Obtain the PDF document to be recognized.
这里,待识别PDF文档可以是任意的PDF文档,例如,可以是保险产品的PDF文档,也可以是方案设计类的PDF文档,等等。Here, the PDF document to be recognized can be any PDF document, for example, it can be a PDF document of an insurance product, a PDF document of a plan design, etc.
S102、通过PaddleOCR框架的预训练的文本检测模型和预训练的文字识别模型,以及OpenCV的形态学操作和投影分割,检测出PDF文档中每页的文本块,得到PDF文档的文本块列表。S102. Through the pre-trained text detection model and pre-trained text recognition model of the PaddleOCR framework, as well as the morphological operation and projection segmentation of OpenCV, the text blocks of each page in the PDF document are detected, and the text block list of the PDF document is obtained.
这里,S102可以通过以下步骤实现:Here, S102 can be achieved through the following steps:
S1021、将PDF文档转换为图片,并对每张图片进行预处理。S1021. Convert PDF documents into images and preprocess each image.
例如,20页的保险产品PDF文档,可以转换为20张图片,之后,可以对每张图片进行灰度化和二值化操作;其中,灰度化、二值化操作通过OpenCV将彩色的图片转化为仅有黑白两种颜色的图片,排除了图片中的光影、颜色以及水印等干扰因素。For example, a 20-page insurance product PDF document can be converted into 20 pictures. After that, each picture can be grayscaled and binarized; among them, the grayscaled and binarized operations convert the color pictures into 20 pages through OpenCV. Convert it into a picture with only black and white colors, eliminating interfering factors such as light and shadow, color, and watermarks in the picture.
S1022、通过PaddleOCR框架的预训练的文本检测模型,检测出每张预处理后的图片中的文本块,得到第一文本块和第一文本块信息。S1022. Use the pre-trained text detection model of the PaddleOCR framework to detect the text block in each pre-processed image, and obtain the first text block and the first text block information.
这里,预训练的文本检测模型可以是预训练的DBNet模型,可以利用PaddleOCR框架的预训练的DBNet文本检测模型对预处理后的每张图片进行文本检测,获得每张图片中每个文本检测框在所在的图片中的坐标,将通过DBNet文本检测模型检测得到的文本检测框称为第一文本块,将第一文本块在所在的图片中的坐标称为第一文本块信息。Here, the pre-trained text detection model can be a pre-trained DBNet model. The pre-trained DBNet text detection model of the PaddleOCR framework can be used to perform text detection on each pre-processed picture to obtain each text detection frame in each picture. The coordinates in the picture where it is located, the text detection frame detected by the DBNet text detection model is called the first text block, and the coordinates of the first text block in the picture where it is located are called the first text block information.
S1023、通过OpenCV的形态学操作和投影分割,检测出每张预处理后的图片中的文本块,得到第二文本块和第二文本块信息。S1023. Through OpenCV's morphological operation and projection segmentation, detect the text block in each preprocessed picture, and obtain the second text block and second text block information.
这里,可以通过OpenCV的形态学操作将每张预处理后的图片中的文本区域膨胀,并将每个膨胀的区域根据像素值进行横向和纵向的投影,通过投影值分割文本块,获得每个文本块的坐标,将通过OpenCV的形态学操作和投影分割检测得到的文本块称为第二文本块,将第二文本块在所在的图片中的坐标称为第二文本块信息。Here, the text area in each preprocessed picture can be expanded through the morphological operation of OpenCV, and each expanded area is projected horizontally and vertically according to the pixel value, and the text blocks are divided by the projection value to obtain each The coordinates of the text block. The text block obtained through OpenCV's morphological operation and projection segmentation detection is called the second text block, and the coordinates of the second text block in the picture where it is located are called the second text block information.
S1024、通过PaddleOCR框架的预训练的文字识别模型对每个第一文本块和每个第二文本块分别进行文本识别,得到识别结果。S1024. Perform text recognition on each first text block and each second text block using the pre-trained text recognition model of the PaddleOCR framework to obtain recognition results.
这里,预训练的文字识别模型可以是预训练的CRNN文字识别模型,可以采用预训练的CRNN文字识别模型对每个第一文本块,以及每个第二文本块分别进行文本识别,得到每个第一文本块的识别结果和每个第二文本块的识别结果;其中,每个第一文本块的识别结果包括:该第一文本块中包含的字符、该第一文本块在所在的图片中的坐标、该第一文本块的置信度;每个第二文本块的识别结果包括:该第二文本块中包含的字符、该第二文本块在所在的图片中的坐标、该第二文本块的置信度。Here, the pre-trained text recognition model can be a pre-trained CRNN text recognition model. The pre-trained CRNN text recognition model can be used to perform text recognition on each first text block and each second text block respectively, to obtain each The recognition result of the first text block and the recognition result of each second text block; wherein the recognition result of each first text block includes: the characters contained in the first text block, the picture in which the first text block is located coordinates in the first text block, the confidence level of the first text block; the recognition result of each second text block includes: the characters contained in the second text block, the coordinates of the second text block in the picture, the second The confidence level of the text block.
S1025、根据识别结果、第一文本块信息和第二文本块信息,对第一文本块和第二文本块进行比较和对齐,得到每张预处理后的图片中的目标文本块和目标文本块总信息;其中,各张预处理后的图片中的目标文本块总信息构成PDF文档的文本块列表。S1025. Compare and align the first text block and the second text block according to the recognition result, the first text block information and the second text block information, and obtain the target text block and target text block in each preprocessed picture. Total information; among them, the total information of the target text blocks in each preprocessed image constitutes the text block list of the PDF document.
具体的,对于每张预处理后的图片,确定该张图片中每个第一文本块的坐标与该张图片中每个第二文本块的坐标之间的差值;根据差值、预设阈值、每个第一文本块的置信度和每个第二文本块的置信度,从由该张图片中的第一文本块和第二文本块构成的集合中,筛选出第一文本块或第二文本块,作为该张图片中的目标文本块;对于每个目标文本块,将该目标文本块的坐标和该目标文本块中包含的字符,作为该目标文本块的目标文本块总信息。Specifically, for each preprocessed picture, the difference between the coordinates of each first text block in the picture and the coordinates of each second text block in the picture is determined; according to the difference and the preset The threshold, the confidence of each first text block and the confidence of each second text block, filter out the first text block or the second text block from the set consisting of the first text block and the second text block in the picture. The second text block is used as the target text block in the picture; for each target text block, the coordinates of the target text block and the characters contained in the target text block are used as the total information of the target text block of the target text block. .
具体的,上述差值为纵坐标之间的差值,可以通过以下方法确定每张图片中的目标文本块:对于该张图片中的第一个第一文本块,判断该张图片中的所有第二文本块中是否存在纵坐标与该第一个第一文本块的纵坐标之间的差值小于预设阈值的目标第二文本块,若存在,则将该第一个第一文本块与该目标第二文本块中置信度高的一个文本块,作为该张图片中的一个目标文本块,并得到该张图片中剩余的第二文本块;若不存在,则将该第一个第一文本块作为该张图片中的一个目标文本块,得到该张图片中剩余的第二文本块;对于该张图片中的第二个第一文本块,判断该张图片中剩余的第二文本块中是否存在纵坐标与该第二个第一文本块的纵坐标之间的差值小于该预设阈值的目标第二文本块,若存在,则将该第二个第一文本块和该目标第二文本块中置信度高的一个文本块,作为该张图片中的一个目标文本块,得到更新后的该张图片中剩余的第二文本块;若不存在,则将该第二个第一文本块作为该张图片中的一个目标文本块,得到更新后的该张图片中剩余的第二文本块,如此,对该张图片中的各个第一文本块依次处理,直至完成对该张图片中的最后一个第一文本块的处理时,得到该张图片中的所有目标文本块。Specifically, the above difference is the difference between the ordinates. The target text block in each picture can be determined by the following method: for the first first text block in the picture, determine all the text blocks in the picture. Whether there is a target second text block in the second text block with an ordinate difference between the ordinate and the ordinate of the first first text block that is less than the preset threshold. If there is a target second text block, the first first text block is A text block with high confidence in the target second text block is used as a target text block in the picture, and the remaining second text block in the picture is obtained; if it does not exist, the first text block is The first text block is used as a target text block in the picture to obtain the remaining second text block in the picture; for the second first text block in the picture, the remaining second text block in the picture is determined. Whether there is a target second text block in the text block whose ordinate difference between the ordinate and the ordinate of the second first text block is less than the preset threshold. If there is, then the second first text block and A text block with high confidence in the target second text block is used as a target text block in the picture to obtain the remaining second text block in the picture after the update; if it does not exist, the second text block is The first text block is used as a target text block in the picture, and the remaining second text blocks in the picture are obtained after the update. In this way, each first text block in the picture is processed in sequence until the processing is completed. When processing the last first text block in the picture, all target text blocks in the picture are obtained.
通过上述处理过程,既实现了采用OpenCV的形态学操作和投影分割检测出的文本块补充DBNet文本检测可能出现的遗漏,又实现了对形态学文本检测存在的文本块粘连问题的调整。Through the above process, the text blocks detected by OpenCV's morphological operations and projection segmentation are used to supplement the possible omissions of DBNet text detection, and the text block adhesion problem existing in morphological text detection is adjusted.
S103、通过PaddleOCR框架的预训练的版面分析模型,识别PDF文档中每页的文本区域的类别,得到PDF文档的文本区域列表。S103. Use the pre-trained layout analysis model of the PaddleOCR framework to identify the category of the text area of each page in the PDF document and obtain the text area list of the PDF document.
这里,版面分析模型是采用多个PDF样本文件对PaddleOCR框架的目标检测预训练模型进行训练得到的,其中,每个PDF样本文件的每页中带有标注信息,标注信息用于至少描述该页的文本内容中的每个标题的位置、正文文本的位置和脚注释义的位置。Here, the layout analysis model is obtained by using multiple PDF sample files to train the target detection pre-training model of the PaddleOCR framework. Each PDF sample file has annotation information on each page, and the annotation information is used to at least describe the page. The position of each heading, body text, and footer definition within the text content.
示例性的,版面分析模型可以为PP-PicoDet模型。For example, the layout analysis model may be a PP-PicoDet model.
在一些实施例中,上述S103可以通过以下步骤实现:In some embodiments, the above S103 can be implemented through the following steps:
S1031、通过PaddleOCR框架的预训练的PP-PicoDet模型识别每张预处理后的图片中每个文本区域的类别,得到该张图片中每个文本区域在该张图片中的坐标和类别。S1031. Use the pre-trained PP-PicoDet model of the PaddleOCR framework to identify the category of each text area in each pre-processed image, and obtain the coordinates and category of each text area in the image.
例如,每个文本区域的类别可以为标题、正文或脚注释义等等。For example, the category for each text area could be title, body, or footer definitions, and so on.
S1032、将该张图片中所有文本区域的坐标和类别按照坐标的先后顺序存入一个列表,得到该张图片的文本区域列表。S1032. Store the coordinates and categories of all text areas in the picture into a list in order of coordinates to obtain a list of text areas of the picture.
S1033、根据PDF文档转换得到的各张图片的先后顺序,将各张图片的文本区域列表组成PDF文档的文本块列表。S1033. According to the order of the pictures converted from the PDF document, the text area list of each picture is formed into a text block list of the PDF document.
S104、根据文本块列表和文本区域列表,确定PDF文档的OCR识别文本。S104. Determine the OCR recognition text of the PDF document according to the text block list and the text area list.
这里,PDF文档的文本块列表包括:PDF文档的每页中每个目标文本块的坐标;PDF文档的文本区域列表包括:PDF文档的每页中每个文本区域的坐标、每个文本区域的类别;基于此,S104可以具体实现为:对于PDF文档的每页,根据该页的每个目标文本块的坐标与该页的每个文本区域的坐标,确定该页的每个目标文本块所属的文本区域,将所属的文本区域的类别作为目标文本块的类别;将该页中属于同一类别的目标文本块根据目标文本块的坐标的先后顺序放入同一个第一类别列表,得到该页的多个第一类别列表;根据多个第一类别列表对应的类别,将多个第一类别列表进行组合后,得到该页的OCR识别文本;根据PDF文档的各页的页码,将各页的OCR识别文本进行组合,得到PDF文档的OCR识别文本。Here, the text block list of the PDF document includes: the coordinates of each target text block in each page of the PDF document; the text area list of the PDF document includes: the coordinates of each text area in each page of the PDF document, the coordinates of each text area category; based on this, S104 can be specifically implemented as: for each page of the PDF document, determine which target text block each of the page belongs to based on the coordinates of each target text block of the page and the coordinates of each text area of the page. of the text area, use the category of the text area to which it belongs as the category of the target text block; put the target text blocks belonging to the same category in the page into the same first category list according to the order of the coordinates of the target text block, and obtain the page Multiple first category lists; according to the categories corresponding to the multiple first category lists, after combining the multiple first category lists, the OCR recognized text of the page is obtained; according to the page number of each page of the PDF document, each page is The OCR recognized text is combined to obtain the OCR recognized text of the PDF document.
例如,得到PDF文档的OCR识别文本的方法如下:For example, the method to obtain the OCR recognized text of a PDF document is as follows:
S1、单独提取类别为标题的文本块为一组新的列表,若根据文本块的纵坐标可知一个标题文本块中含有多行文本,则根据该文本块中的文本的坐标确定合并的顺序,将一个文本块内分散为多行的标题文本合并为一行;S1. Extract the text blocks whose category is title separately into a new set of lists. If it is known from the vertical coordinate of the text block that a title text block contains multiple lines of text, then the order of merging is determined based on the coordinates of the text in the text block. Merge title text that is dispersed into multiple lines within a text block into one line;
S2、单独提取类别为正文的文本块为一组新的列表,与S1得到的标题文本块的列表相加,得到标题和正文的混合列表,并根据文本块的坐标进行排序,得到与原文档顺序一致的文本列表;S2. Separately extract the text blocks with the category of text as a new set of lists. Add them to the list of title text blocks obtained in S1 to obtain a mixed list of titles and bodies. Sort according to the coordinates of the text blocks to obtain the same list as the original document. A consistent text list;
S3、单独提取类别为脚注释义的文本块为一组新的列表,并按照脚注释义的文本块的坐标将脚注释义文本块进行排序,将排序后的脚注释义文本块加入S2得到的与原文档顺序一致的文本列表中,得到最终完整的OCR识别文本列表。S3. Extract the text blocks classified as footnote definitions separately into a new set of lists, sort the footnote definition text blocks according to the coordinates of the footnote definition text blocks, and add the sorted footnote definition text blocks to the original document obtained by S2. From the text list in the same order, the final complete OCR recognized text list is obtained.
以上步骤S101~S104可以称为OCR识别方法,示例性的,图2为对步骤S101~S104所述的OCR识别方法的流程的示例性描述,如图2所示,输入PDF文档(即PDF数据)后,首先,将PDF文档的每一页都转为图片,并通过图片预处理方法进行处理,然后,利用OpenCV形态学操作和投影分割方法以及基于PaddleOCR的DBNet模型分别对每张预处理后的图片进行文本检测,之后,融合二者的检测结果,然后,根据检测框基于PaddleOCR框架的CRNN模型对融合后的检测结果(上述的目标文本块)进行文本识别,其次,通过PDF样本文件训练PadddleDeteciton框架的PP-Picodet版面分析模型,并采用训练得到的PP-Picodet版面分析模型对每张预处理后的图片进行版面分析,得到每张预处理后的图片中的每个文本区域的类别和坐标,根据得到的每张预处理后的图片中的每个文本区域的类别和坐标,对前述的融合后的检测结果进行分类后处理,得到PDF文档的OCR识别文本(即OCR方法识别文本)。The above steps S101 to S104 can be called an OCR recognition method. For example, Figure 2 is an exemplary description of the process of the OCR recognition method described in steps S101 to S104. As shown in Figure 2, a PDF document (i.e., PDF data After Perform text detection on the pictures, and then fuse the detection results of the two. Then, perform text recognition on the fused detection results (the above target text block) based on the CRNN model of the detection frame based on the PaddleOCR framework. Secondly, train through PDF sample files The PP-Picodet layout analysis model of the PaddleDeteciton framework is used, and the trained PP-Picodet layout analysis model is used to perform layout analysis on each pre-processed picture, and the category and sum of each text area in each pre-processed picture are obtained. Coordinates, according to the category and coordinates of each text area in each pre-processed picture, the aforementioned fused detection results are classified and post-processed to obtain the OCR recognized text of the PDF document (that is, the OCR method recognizes the text) .
S105、基于文本区域列表和字符提取工具,确定PDF文档的PDF识别文本。S105. Determine the PDF recognition text of the PDF document based on the text area list and the character extraction tool.
示例性的,字符提取工具可以是PDFPLUMBER工具。For example, the character extraction tool may be the PDFPLUMBER tool.
具体的,可以采用字符提取工具提取PDF文档的每页中的字符和每个字符的坐标;对于PDF文档的每页,根据该页的每个字符的坐标与该页的每个文本区域的坐标,确定该页的每个字符所属的文本区域,将所属的文本区域的类别作为字符的类别;将该页中属于同一类别的字符根据字符的坐标的先后顺序放入同一个第二类别列表,得到该页的多个第二类别列表;根据多个第二类别列表对应的类别,将多个第二类别列表进行组合后,得到该页的PDF识别文本;根据PDF文档的各页的页码,将各页的PDF识别文本进行组合,得到PDF文档的PDF识别文本。Specifically, a character extraction tool can be used to extract the characters and the coordinates of each character in each page of the PDF document; for each page of the PDF document, the coordinates of each character on the page and the coordinates of each text area on the page are used. , determine the text area to which each character of the page belongs, and use the category of the text area to which it belongs as the category of the character; put the characters belonging to the same category on the page into the same second category list according to the order of the character coordinates. Obtain multiple second category lists of the page; according to the categories corresponding to the multiple second category lists, after combining the multiple second category lists, obtain the PDF recognition text of the page; according to the page number of each page of the PDF document, The PDF recognition text of each page is combined to obtain the PDF recognition text of the PDF document.
步骤S105可以称为PDF识别方法,示例性的,图3为对步骤S105所述的PDF识别方法的流程的示例性描述。如图3所示,输入PDF文档(即PDF数据)后,通过PDFPLUMBER工具进行文本的分页提取,然后根据OCR识别方法中进行版面分析得到的结果调整通过PDFPLUMBER工具所提取的文本的结构,以对不同类型的文本进行分类后处理,得到PDF文档的PDF识别文本(PDF方法识别文本)。Step S105 may be called a PDF recognition method. For example, FIG. 3 is an exemplary description of the process of the PDF recognition method described in step S105. As shown in Figure 3, after inputting the PDF document (i.e., PDF data), the text is extracted in paging by the PDFPLUMBER tool, and then the structure of the text extracted by the PDFPLUMBER tool is adjusted according to the results of layout analysis in the OCR recognition method, so as to Different types of text are classified and post-processed to obtain the PDF recognition text of the PDF document (PDF method recognition text).
S106、根据OCR识别文本和PDF识别文本生成对比文件;对比文件用于描述OCR识别文本和PDF识别文本之间的不同部分。S106. Generate a comparison file based on the OCR recognized text and the PDF recognized text; the comparison file is used to describe the different parts between the OCR recognized text and the PDF recognized text.
具体的,可以通过对比工具(例如,difflib工具)逐行计算OCR识别文本与PDF识别文本之间的文本相似度,得到每行的文本相似度;对于每行,当该行的文本相似度大于或等于预设相似度阈值,并且OCR识别文本与PDF识别文本的该行中存在相同位置处的字符不同的情况时,将OCR识别文本中该行的该位置处的字符替换为PDF识别文本中该行的该位置处的字符;当该行的文本相似度小于预设相似度阈值时,在OCR识别文本与PDF识别文本中分别对该行的文本进行标注;根据标注生成对比文件。Specifically, the text similarity between the OCR recognized text and the PDF recognized text can be calculated line by line through a comparison tool (for example, the difflib tool) to obtain the text similarity of each line; for each line, when the text similarity of the line is greater than Or equal to the preset similarity threshold, and when the characters at the same position in the line of the OCR-recognized text and the PDF-recognized text are different, replace the character at the position of the line in the OCR-recognized text with the character in the PDF-recognized text. The character at this position of the line; when the text similarity of the line is less than the preset similarity threshold, the text of the line is marked in the OCR recognition text and the PDF recognition text respectively; a comparison file is generated based on the annotation.
这里,在进行标注时,可以在OCR识别文本与PDF识别文本中分别采用不同的标记方式进行标记,例如,通过不同的颜色对不同的部分分别标记。Here, when annotating, different marking methods can be used to mark the OCR recognized text and the PDF recognized text, for example, different parts can be marked separately with different colors.
这里,生成的对比文件可以是HTML格式的对比文件。例如,在HTML文件中可以对OCR识别文本与PDF识别文本中不同的部分分别进行了绿色高亮和红色高亮标记,以突出两个文本文件中不相同的部分,从而在人工校验时,仅需关注高亮部分,可以提升人工校验的效率。Here, the generated comparison file may be a comparison file in HTML format. For example, in the HTML file, different parts of the OCR recognized text and the PDF recognized text can be marked with green and red highlights respectively to highlight the different parts of the two text files, so that during manual verification, Only focusing on the highlighted parts can improve the efficiency of manual verification.
示例性的,可以通过以下方法生成对比文件:For example, the comparison file can be generated by the following method:
S11、通过difflib工具,对两种方法得到的文本进行逐行比较,相似度达到所设阈值则将两种文本当前的这行视为可以进行比较纠错的同一行,对于同一行中不相同的字符会使用‘\0^’字符进行标记。若连续两行以上相似度较低,则会使用‘\0+’或‘\0-’两种字符分别在两种方法得到的文本中进行连续的整行标记,以表示增加或者减少的行(这种标记认为是两种方法识别的格式有所差别);S11. Use the difflib tool to compare the texts obtained by the two methods line by line. If the similarity reaches the set threshold, the current line of the two texts will be regarded as the same line that can be compared and corrected. For differences in the same line The characters will be marked with the '\0^' character. If the similarity between two or more consecutive lines is low, the two characters '\0+' or '\0-' will be used to mark consecutive entire lines in the text obtained by the two methods respectively to indicate increased or decreased lines. (This mark is considered to be a difference in the formats recognized by the two methods);
S12、对于同一行中有标记的字符,应用PDF识别方法得到的字符对OCR识别方法中的结果进行替换,并消除所带的‘\0^’字符标记。对于连续的整行标记,应用OCR识别方法的格式,不消除所带字符标记;S12. For marked characters in the same line, use the characters obtained by the PDF recognition method to replace the results in the OCR recognition method and eliminate the '\0^' character marks. For continuous whole-line marks, the format of the OCR recognition method is applied, and the character marks are not eliminated;
S13、生成HTML格式的对比文件,其中,未消除的字符标记‘\0+’和‘\0-’将在HTML文件中分别为两种方法得到的文本进行绿色高亮和红色高亮,以突出两种方法得到的文本中不相同的部分,从而在人工校验时,仅需关注高亮部分,提升了人工校验的效率。S13. Generate a comparison file in HTML format. The unremoved character marks '\0+' and '\0-' will be highlighted in green and red respectively in the HTML file for the text obtained by the two methods. Highlight the different parts of the text obtained by the two methods, so that during manual verification, you only need to focus on the highlighted parts, which improves the efficiency of manual verification.
步骤S106可以称为对比纠错方法,示例性的,图4为对步骤S106所述的对比纠错方法的流程的示例性描述。如图4所示,输入PDF文件的OCR识别文本和PDF识别文本,通过difflib工具,对两种方法得到的文本进行逐行比较,标记两种文本中的不同之处,然后,基于PDF识别文本(PDF识别所用工具对字符提取的准确率极高)对OCR识别文本进行字符纠错,以提高字符的准确率,最后生成具有对不同部分进行高亮显示的HTML对比文件,提高人工校验的效率。Step S106 may be called a comparative error correction method. For example, FIG. 4 is an exemplary description of the process of the comparative error correction method described in step S106. As shown in Figure 4, input the OCR recognized text and PDF recognized text of the PDF file. Use the difflib tool to compare the text obtained by the two methods line by line, mark the differences in the two texts, and then recognize the text based on the PDF. (The tools used for PDF recognition have extremely high accuracy in character extraction.) Character correction is performed on the OCR recognition text to improve the accuracy of the characters. Finally, an HTML comparison file with highlighting of different parts is generated to improve the efficiency of manual verification. efficiency.
根据以上内容可知,如图5所示,本发明提供的基于开源Paddle框架的PDF文档识别方法由OCR识别方法、PDF识别方法和对比纠错方法组成。According to the above content, as shown in Figure 5, the PDF document recognition method based on the open source Paddle framework provided by the present invention consists of an OCR recognition method, a PDF recognition method and a comparison and error correction method.
本实施例中,基于深度学习PaddleOCR框架,并以OpenCV作为辅助进行文本检测,防止检测遗漏,保证了文本检测的准确、高效和可靠。并且,通过PaddleDetection框架训练版面分析模型,能够高效、准确的提取结构化文本。最后,结合OCR识别方法与PDF识别方法得到的两种结果进行互相对比纠错,生成带有标记的对比文件,提高了最终的识别准确率,同时也降低了人工校验的成本,具有较高的实际应用价值。In this embodiment, text detection is performed based on the deep learning PaddleOCR framework and OpenCV is used as an auxiliary to prevent detection omissions and ensure the accuracy, efficiency and reliability of text detection. Moreover, by training the layout analysis model through the PaddleDetection framework, structured text can be extracted efficiently and accurately. Finally, the two results obtained by combining the OCR recognition method and the PDF recognition method are compared and corrected with each other to generate a marked comparison file, which improves the final recognition accuracy and reduces the cost of manual verification. It has a higher practical application value.
以上内容是结合具体的优选实施方式对本发明所作的进一步详细说明,不能认定本发明的具体实施只局限于这些说明。对于本发明所属技术领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干简单推演或替换,都应当视为属于本发明的保护范围。The above content is a further detailed description of the present invention in combination with specific preferred embodiments, and it cannot be concluded that the specific implementation of the present invention is limited to these descriptions. For those of ordinary skill in the technical field to which the present invention belongs, several simple deductions or substitutions can be made without departing from the concept of the present invention, and all of them should be regarded as belonging to the protection scope of the present invention.
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310549326.9ACN116740723A (en) | 2023-05-16 | 2023-05-16 | A PDF document recognition method based on the open source Paddle framework |
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310549326.9ACN116740723A (en) | 2023-05-16 | 2023-05-16 | A PDF document recognition method based on the open source Paddle framework |
| Publication Number | Publication Date |
|---|---|
| CN116740723Atrue CN116740723A (en) | 2023-09-12 |
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202310549326.9APendingCN116740723A (en) | 2023-05-16 | 2023-05-16 | A PDF document recognition method based on the open source Paddle framework |
| Country | Link |
|---|---|
| CN (1) | CN116740723A (en) |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN116912867A (en)* | 2023-09-13 | 2023-10-20 | 之江实验室 | Teaching material structure extraction method and device combining automatic labeling and recall completion |
| CN117473980A (en)* | 2023-11-10 | 2024-01-30 | 中国医学科学院医学信息研究所 | Structured analysis method of portable document format file and related products |
| CN117973335A (en)* | 2024-01-18 | 2024-05-03 | 粤港澳大湾区(广东)国创中心 | PDF file component extraction device, method, electronic device and readable storage medium |
| CN118885443A (en)* | 2024-09-27 | 2024-11-01 | 中国科学院自动化研究所 | Method, device, equipment and storage medium for extracting image and text pairs from PDF documents |
| CN120509421A (en)* | 2025-07-21 | 2025-08-19 | 浪潮通用软件有限公司 | A contract document translation method and system based on large model |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN116912867A (en)* | 2023-09-13 | 2023-10-20 | 之江实验室 | Teaching material structure extraction method and device combining automatic labeling and recall completion |
| CN116912867B (en)* | 2023-09-13 | 2023-12-29 | 之江实验室 | Teaching material structure extraction method and device combining automatic labeling and recall completion |
| CN117473980A (en)* | 2023-11-10 | 2024-01-30 | 中国医学科学院医学信息研究所 | Structured analysis method of portable document format file and related products |
| CN117973335A (en)* | 2024-01-18 | 2024-05-03 | 粤港澳大湾区(广东)国创中心 | PDF file component extraction device, method, electronic device and readable storage medium |
| CN118885443A (en)* | 2024-09-27 | 2024-11-01 | 中国科学院自动化研究所 | Method, device, equipment and storage medium for extracting image and text pairs from PDF documents |
| CN120509421A (en)* | 2025-07-21 | 2025-08-19 | 浪潮通用软件有限公司 | A contract document translation method and system based on large model |
| Publication | Publication Date | Title |
|---|---|---|
| CN116740723A (en) | A PDF document recognition method based on the open source Paddle framework | |
| CN113221735B (en) | Method, device and related equipment for restoring paragraph structure of scanned documents based on multimodality | |
| CN104732228B (en) | A kind of detection of PDF document mess code, the method for correction | |
| CN114663904B (en) | A PDF document layout detection method, device, equipment and medium | |
| CN109670494B (en) | Text detection method and system with recognition confidence | |
| CN104966097A (en) | Complex character recognition method based on deep learning | |
| CN106250830A (en) | Digital book structured analysis processing method | |
| CN114781997A (en) | Intelligent review system and implementation method of special construction plan for dangerous and large projects | |
| CN112861865B (en) | An assisted audit method based on OCR technology | |
| CN108197119A (en) | The archives of paper quality digitizing solution of knowledge based collection of illustrative plates | |
| CN116343237A (en) | Bill identification method based on deep learning and knowledge graph | |
| CN107818320A (en) | Recognition methods based on OCR technique transformer infrared image numerical value of increasing income | |
| CN116343210A (en) | File digitization management method and device | |
| CN109960707B (en) | A method and system for collecting data of university enrollment based on artificial intelligence | |
| CN110135225A (en) | Sample labeling method and computer storage medium | |
| CN110147534A (en) | A kind of method and system that LaTeX document is converted to Word document | |
| CN111860524A (en) | Device and method for intelligent classification of digital files | |
| CN115828874A (en) | Industry table digital processing method based on image recognition technology | |
| CN119942576A (en) | A document information extraction method, device, system and storage medium | |
| CN120340054A (en) | Document recognition method, system, device and medium based on multimodal large model | |
| CN112036330A (en) | A text recognition method, text recognition device and readable storage medium | |
| CN115457585A (en) | Processing method, device, computer equipment and readable storage medium for job correction | |
| CN114565749A (en) | Method and system for identifying key content of visa document of power construction site | |
| CN113705560A (en) | Data extraction method, device and equipment based on image recognition and storage medium | |
| CN119380350A (en) | Handwriting elimination method and system based on integration of layout structure and semantic knowledge |
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |