Movatterモバイル変換


[0]ホーム

URL:


CN116758565B - A decision tree-based OCR text restoration method, equipment and storage medium - Google Patents

A decision tree-based OCR text restoration method, equipment and storage medium
Download PDF

Info

Publication number
CN116758565B
CN116758565BCN202311064174.XACN202311064174ACN116758565BCN 116758565 BCN116758565 BCN 116758565BCN 202311064174 ACN202311064174 ACN 202311064174ACN 116758565 BCN116758565 BCN 116758565B
Authority
CN
China
Prior art keywords
text
text box
decision tree
ocr
keywords
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311064174.XA
Other languages
Chinese (zh)
Other versions
CN116758565A (en
Inventor
刘法
白建亮
阎德劲
郑大安
雷文强
向元新
熊可欣
袁焦
丁栋威
邓欣
顾海燕
奂锐
谢明华
孙国东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 10 Research Institute
Original Assignee
CETC 10 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 10 Research InstitutefiledCriticalCETC 10 Research Institute
Priority to CN202311064174.XApriorityCriticalpatent/CN116758565B/en
Publication of CN116758565ApublicationCriticalpatent/CN116758565A/en
Application grantedgrantedCritical
Publication of CN116758565BpublicationCriticalpatent/CN116758565B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

The application provides an OCR text restoring method, equipment and a storage medium based on a decision tree, which comprises the following steps: preprocessing the text box recognized by OCR; extracting text box characteristics, and constructing a decision tree based on the text box characteristics; and classifying and merging the text boxes according to the decision tree, and restoring the original layout of the text. The application carries out post-processing on the recognition result of OCR, analyzes the multiple characteristics of the text box by applying a decision tree, and recognizes the content category of the text box: such as a title, a chapter, a page number, a paragraph, etc., and then classifying and merging to restore the original layout of the text, the situation that text boxes in the OCR recognition result are wrongly classified, arranged or overlapped is avoided, and the problems that the text content is incoherent, and the format and layout of the text are easily disordered are solved.

Description

Translated fromChinese
一种基于决策树的OCR文本还原方法、设备及存储介质A decision tree-based OCR text restoration method, equipment and storage medium

技术领域Technical field

本发明涉及文字识别技术领域,特别涉及一种基于决策树的OCR文本还原方法、设备及存储介质。The present invention relates to the field of text recognition technology, and in particular to a decision tree-based OCR text restoration method, equipment and storage medium.

背景技术Background technique

为进一步提高文档信息的可访问性并方便管理,需要对文档进行文本内容识别,将图像和扫描图中的文本转换为可编辑、可搜索的文本。最早的文档识别技术就是基于OCR方法,它使用光学字符识别技术将文档中的文字提取出来。近年来,随着科学技术的快速发展,渐渐出现了基于深度学习的和基于计算机视觉的文档识别技术。基于深度学习的文档识别技术虽然在图像处理上有了显著进展,但需要大规模的数据集训练,并耗费大量的计算资源和时间。基于计算机视觉的文档识别技术在表格解析上已经被广泛应用,但它也需要消耗大量资源训练,并且对于特殊结构的表格仍可能发生解析错误或丢失部分信息。反观OCR技术具备较高的成熟度和稳定性,可用于多种类型文档,随着算法改进其识别结果准确度高,支持多种语言,还有许多商业和开源引擎可供选择。因此,当前OCR识别技术仍然是最常用的文档识别技术。In order to further improve the accessibility of document information and facilitate management, it is necessary to perform text content recognition on documents and convert the text in images and scanned images into editable and searchable text. The earliest document recognition technology is based on the OCR method, which uses optical character recognition technology to extract text from the document. In recent years, with the rapid development of science and technology, document recognition technology based on deep learning and computer vision has gradually emerged. Although document recognition technology based on deep learning has made significant progress in image processing, it requires large-scale data set training and consumes a lot of computing resources and time. Document recognition technology based on computer vision has been widely used in table parsing, but it also requires a lot of resource training, and parsing errors or partial information may still occur for tables with special structures. On the other hand, OCR technology has high maturity and stability, and can be used for many types of documents. As the algorithm improves, its recognition results are highly accurate, support multiple languages, and there are many commercial and open source engines to choose from. Therefore, OCR recognition technology is still the most commonly used document recognition technology.

尽管OCR技术的识别准确性已经取得了显著的进步,但在文本较为复杂、模糊或扭曲的文本、低分辨率图像等具有挑战性的情况下,识别后的文本可能仍然无法完全保留原始文档的格式和布局,导致识别结果与原文不一致。这时后处理方法就会发挥作用:对于已知样式和模板的文档,可以根据样式规则和模板信息进行还原,但这种方法无法处理格式未知的文档。还可以通过自然语言处理技术,对OCR识别结果进行语义分析和实体识别,提取文本中的关键信息、命名实体、关系等,从而还原原始文档中的语义结构和信息,但这种方法需要耗费大量资源进行模型训练,还需要纳入特定领域的实体知识。所以,当前最常用的OCR文本后处理方法是文本布局分析法,通过分析OCR识别结果中文本块的相对位置关系,对多个文本框进行距离计算或聚类,来还原原始文档的布局结构。然而,目前许多文本布局分析法只关注了文本框的相对位置信息,却很少关注诸如字体、数字比例、特定关键字等其他特征。Although the recognition accuracy of OCR technology has made significant progress, in challenging situations such as complex text, blurred or distorted text, low-resolution images, etc., the recognized text may still not fully retain the original document. The format and layout cause the recognition results to be inconsistent with the original text. This is when post-processing methods come into play: documents with known styles and templates can be restored based on style rules and template information, but this method cannot handle documents with unknown formats. Natural language processing technology can also be used to perform semantic analysis and entity recognition on the OCR recognition results to extract key information, named entities, relationships, etc. in the text, thereby restoring the semantic structure and information in the original document, but this method requires a lot of time Resources are used for model training, and entity knowledge in specific fields also needs to be incorporated. Therefore, the most commonly used OCR text post-processing method is the text layout analysis method, which restores the layout structure of the original document by analyzing the relative position of text blocks in the OCR recognition results and performing distance calculations or clustering on multiple text boxes. However, many current text layout analysis methods only focus on the relative position information of text boxes, but rarely pay attention to other features such as fonts, number ratios, and specific keywords.

针对现有研究情况,当前面向文档的OCR识别技术后处理方法有如下问题:According to the existing research situation, the current post-processing method of document-oriented OCR recognition technology has the following problems:

1.现有的后处理技术对所识别文本结构的还原能力较差,可能让文本被错误地分类或合并,影响识别结果地准确性和连续性;1. The existing post-processing technology has poor ability to restore the recognized text structure, which may cause the text to be incorrectly classified or merged, affecting the accuracy and continuity of the recognition results;

2.缺乏对字体、数字比例、特定关键字等其他多种特征的关注。2. Lack of attention to various other features such as fonts, number proportions, specific keywords, etc.

发明内容Contents of the invention

针对现有技术中存在的问题,提供了一种基于决策树的OCR文本还原方法、设备及存储介质,决策树分析文本框的多项特征,对文本框进行分类和合并,实现了文本还原,可以解决文本框被错误分类、排列或重叠的问题。In view of the problems existing in the existing technology, an OCR text restoration method, equipment and storage medium based on a decision tree are provided. The decision tree analyzes multiple characteristics of the text box, classifies and merges the text boxes, and realizes text restoration. Can solve the problem of text boxes being miscategorized, arranged or overlapped.

本发明采用的技术方案如下: 一种基于决策树的OCR文本还原方法,包括:The technical solutions adopted by the present invention are as follows: An OCR text restoration method based on decision tree, including:

对OCR识别的文本框进行预处理;Preprocess text boxes recognized by OCR;

提取文本框特征,并基于文本框特征构建决策树;Extract text box features and build a decision tree based on text box features;

根据决策树,对文本框进行分类与合并,还原文本原始布局。Based on the decision tree, text boxes are classified and merged to restore the original layout of the text.

进一步的,所述预处理包括:Further, the preprocessing includes:

对每个文本框进行编号,记录其初始内容;Number each text box and record its initial content;

将文本框的所有英文字符转换为小写;Convert all English characters in the text box to lowercase;

去除文本框中的特殊字符。Remove special characters from text boxes.

进一步的,所述特殊字符包括非数字、非字母、非中文、非标点、非空格的字符。Further, the special characters include non-numbers, non-letters, non-Chinese, non-punctuation, and non-space characters.

进一步的,所述提取文本框特征过程包括:Further, the process of extracting text box features includes:

提取每个文本框的字数、行数以及在整个文档中的位置;Extract the number of words, lines and position in the entire document of each text box;

提取每个文本框的长度、宽度以及字体;Extract the length, width and font of each text box;

提取每个文本框中数字比例、字母比例以及包含的关键字。Extract the number ratio, letter ratio, and keywords contained in each text box.

进一步的,所述关键字为能表示文本框内容的含义的关键字,例如“图1”,“表2”,“1.1”“2.1”等。这些关键字的格式由专家根据经验制定,可通过正则表达式来识别。Further, the keywords are keywords that can represent the meaning of the text box content, such as "Figure 1", "Table 2", "1.1", "2.1", etc. The format of these keywords is developed by experts based on experience and can be identified through regular expressions.

进一步的,所述构建决策树包括:Further, the construction of the decision tree includes:

根节点:判断是否包含关键字;是则根据关键字类型对文本框分类,包括:Root node: Determine whether it contains keywords; if so, classify the text box according to the keyword type, including:

章节节点判断:根据文本框的宽度、字体、关键字数量,细分章节等级;Chapter node judgment: subdivide the chapter level based on the width, font, and number of keywords of the text box;

图表节点判断;根据文本框的字体、位置、关键字特征,确定所属图表;Chart node judgment; determine the chart to which it belongs based on the font, position, and keyword characteristics of the text box;

否则直接根据文本框长度、宽度、字体、位置等对文本框进行分类;Otherwise, the text boxes are classified directly according to the length, width, font, position, etc. of the text box;

标题节点判断:文本框宽度最宽,处于页面中最高位置;Title node judgment: the text box has the widest width and is at the highest position on the page;

页码节点判断:若包含关键字“页”“page”,则其余内容均为数字,若不包含关键字,则全为数字;长度小于一行,处于页面中最高或最低位置;Page number node judgment: If the keyword "page" is included, the rest of the content will be numbers. If it does not contain the keyword, all the content will be numbers; the length is less than one line, and it is at the highest or lowest position on the page;

段落节点判断:根据数字比例以及字母比例特征,确定段落类型。Paragraph node judgment: Determine the paragraph type based on the numerical proportion and letter proportion characteristics.

进一步的,所述分类与合并过程包括:Further, the classification and merging process includes:

将所有文本框按照决策树进行分类;Classify all text boxes according to decision trees;

根据文本框编号,复原每个文本框的初始内容以及位置排布;According to the text box number, restore the initial content and position arrangement of each text box;

对同一类别内位置相邻、字体一致、宽度相同的文本框进行合并。Merge text boxes that are adjacent in the same category, have the same font, and have the same width.

本发明第二方面提出了一种电子设备,包括处理器和存储器,所述存储器存储有能够被所述处理器执行的计算机程序,所述处理器可执行所述计算机程序以实现上述的基于决策树的OCR文本还原方法。A second aspect of the present invention provides an electronic device, including a processor and a memory. The memory stores a computer program that can be executed by the processor. The processor can execute the computer program to implement the above decision-based decision-making. OCR text restoration method for trees.

本发明第三方面提出了一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现如上述的基于决策树的OCR文本还原方法。A third aspect of the present invention proposes a computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, the above-mentioned OCR text restoration method based on a decision tree is implemented.

与现有技术相比,采用上述技术方案的有益效果为:本发明关注了文本框除位置外的多项特征,使用决策树对文本框进行分类再合并,避免了位置相近的文本框被错误分类的情况,能够基于文本的不同类别进行针对性还原。Compared with the existing technology, the beneficial effects of adopting the above technical solution are: the present invention pays attention to multiple characteristics of text boxes except position, uses decision trees to classify and merge text boxes, and avoids text boxes with similar positions being mistakenly In the case of classification, targeted restoration can be carried out based on different categories of text.

附图说明Description of drawings

图1为本发明提出的基于决策树的OCR文本还原方法流程图。Figure 1 is a flow chart of the decision tree-based OCR text restoration method proposed by the present invention.

图2为本发明一实施例中预处理流程图。Figure 2 is a preprocessing flow chart in an embodiment of the present invention.

图3为本发明一实施例中特征提取流程图。Figure 3 is a feature extraction flow chart in an embodiment of the present invention.

图4为本发明一实施例中决策树构建流程图。Figure 4 is a flow chart of decision tree construction in an embodiment of the present invention.

图5为本发明一实施例中分类与合并流程图。Figure 5 is a flow chart of classification and merging in an embodiment of the present invention.

具体实施方式Detailed ways

下面详细描述本申请的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的模块或具有相同或类似功能的模块。下面通过参考附图描述的实施例是示例性的,仅用于解释本申请,而不能理解为对本申请的限制。相反,本申请的实施例包括落入所附加权利要求书的精神和内涵范围内的所有变化、修改和等同物。The embodiments of the present application are described in detail below. Examples of the embodiments are shown in the drawings, where the same or similar reference numerals throughout represent the same or similar modules or modules with the same or similar functions. The embodiments described below with reference to the drawings are exemplary and are only used to explain the present application and cannot be understood as limiting the present application. On the contrary, the embodiments of the present application include all changes, modifications and equivalents falling within the spirit and scope of the appended claims.

实施例1Example 1

OCR(optical character recognition,光学字符识别)算法将图像或扫描件中的文字识别为包含文本内容、长度、宽度、位置等特征的文本框,但还需要对文本框进行格式复原才能够流畅地阅读。由于现有的OCR识别过程中容易错误对文本框进行分类或合并,更多的只考虑位置特征而缺乏对更多其他特征的关注,为了解决这个问题,本发明实施例提出了一种基于决策树的OCR文本还原方法,针对OCR的识别结果进行后处理,通过应用决策树分析文本框的多项特征,对标题、章节、页码、段落和框图中的文本框进行分类与合并,以还原文本的原始布局,避免了OCR识别结果中的文本框被错误分类、排列或重叠的情况,解决了文本内容不连贯、文本的格式和布局容易错乱的问题。如图1所示,具体方案如下:The OCR (optical character recognition) algorithm recognizes text in images or scanned documents as text boxes containing text content, length, width, position and other features, but the text box also needs to be formatted to be read smoothly. . Since the existing OCR recognition process is prone to mistakenly classify or merge text boxes, and only considers positional features and lacks attention to more other features, in order to solve this problem, the embodiment of the present invention proposes a decision-based method. The OCR text restoration method of trees performs post-processing on OCR recognition results. By applying decision trees to analyze multiple features of text boxes, it classifies and merges text boxes in titles, chapters, page numbers, paragraphs and block diagrams to restore the text. The original layout avoids the situation where the text boxes in the OCR recognition results are misclassified, arranged or overlapped, and solves the problem of incoherent text content and easy confusion of text format and layout. As shown in Figure 1, the specific plan is as follows:

步骤S101、对OCR识别的文本框进行预处理。Step S101: Preprocess the text box recognized by OCR.

如图2所示,在本实施例中,预处理主要包括:先对每个文本框进行编号,并记录初始内容,便于后续复原。As shown in Figure 2, in this embodiment, preprocessing mainly includes: first numbering each text box and recording the initial content to facilitate subsequent restoration.

同时,将文本框中的所有英文字符转换为小写,并去除文本框中的特殊字符。通过该预处理过程,能够有效去除文本框中的干扰项,更准确提取文本框特征,提高文本框分类的准确性。At the same time, convert all English characters in the text box to lowercase and remove special characters in the text box. Through this preprocessing process, interference items in the text box can be effectively removed, text box features can be extracted more accurately, and the accuracy of text box classification can be improved.

在一个实施例中,特殊字符为非数字、非字母、非中文、非标点、非空格的字符。In one embodiment, the special characters are non-digits, non-letters, non-Chinese, non-punctuation, and non-space characters.

步骤S102、提取文本框特征,并基于文本框特征构建决策树。Step S102: Extract text box features and build a decision tree based on the text box features.

如图3所示,为了对文本框进行分类合并,需要先提取文本框的各类特征,本实施例中,包括:As shown in Figure 3, in order to classify and merge text boxes, various features of the text boxes need to be extracted first. In this embodiment, they include:

对于每个文本框,提取字数、行数以及在整个文档中的位置。For each text box, extract the word count, line count, and position in the entire document.

对于每个文本框,提取长度、宽度、字体;For each text box, extract the length, width, and font;

对于每个文本框,提取数字比例、字母比例以及所包含的关键字。For each text box, extract the number ratio, letter ratio, and contained keywords.

需要说明的是,本实施例中关键字为能表示文本框内容的含义的关键字,例如“图1”,“表2”,“1.1”“2.1”等。这些关键字的格式由专家根据经验制定,可通过正则表达式来识别。It should be noted that the keywords in this embodiment are keywords that can represent the meaning of the text box content, such as "Figure 1", "Table 2", "1.1", "2.1", etc. The format of these keywords is developed by experts based on experience and can be identified through regular expressions.

在确定文本框包含的特征之后,基于所提取的特征进一步建立决策树。具体过程如下:After determining the features contained in the text box, a decision tree is further built based on the extracted features. The specific process is as follows:

如图4所示,本实施例中先统计整个文档中包括的关键字、字体类型,文本框的宽度区间等。As shown in Figure 4, in this embodiment, keywords, font types, width intervals of text boxes, etc. included in the entire document are first counted.

再根据统计结构构造决策树:Then construct a decision tree based on the statistical structure:

根节点:判断是否包含关键字(如:“图1”,“表2”,“1.1”“2.1”等),是则根据关键字类型对文本框进行分类,包括:Root node: Determine whether it contains keywords (such as: "Figure 1", "Table 2", "1.1", "2.1", etc.). If so, classify the text box according to the keyword type, including:

章节节点判断:根据文本框的宽度、字体、关键字数量等特征,进一步细分章节等级。Chapter node judgment: further subdivide the chapter level based on the width, font, number of keywords and other characteristics of the text box.

图表节点判断:根据字体、位置、关键字等特征,进一步确定属于哪个图表。Chart node judgment: further determine which chart it belongs to based on characteristics such as font, position, keywords, etc.

否则根据文本框长度、宽度、字体、位置等对文本框进行分类。Otherwise the text boxes are classified based on their length, width, font, position, etc.

标题节点判断:文本框宽度最宽,位置通常在页面中最高,长度通常小于等于一行,不排除超过一行的情况。Title node judgment: The text box has the widest width, the position is usually the highest on the page, and the length is usually less than or equal to one line, but it does not exclude the case of more than one line.

页码节点判断:若不包含关键字,则为全数字;若包含“页”“page”这样的关键字,则除关键字外是全数字。长度小于一行,位置通常在页面中最低。Page number node judgment: If it does not contain keywords, it will be all numbers; if it contains keywords such as "page" and "page", it will be all numbers except the keywords. Less than one line in length and usually positioned lowest on the page.

还包括,段落节点判断:根据数字比例、字母比例等特征,确定段落类型(如正文、引用等)。It also includes paragraph node judgment: determining the paragraph type (such as text, quotation, etc.) based on characteristics such as number proportions and letter proportions.

步骤103、根据决策树,对文本框进行分类与合并,还原文本原始布局。Step 103: Classify and merge the text boxes according to the decision tree, and restore the original layout of the text.

请参考图5,本实施例中,直接采用构造的决策树对所有文本框分类;再根据文本框编号,复原每个文本框的初始内容以及位置排布;对同一类别内位置相邻、字体一致、宽度相同的文本框进行合并。Please refer to Figure 5. In this embodiment, the constructed decision tree is directly used to classify all text boxes; then according to the text box number, the initial content and position arrangement of each text box are restored; adjacent positions and fonts within the same category are restored Text boxes that are consistent and have the same width are merged.

本发明关注了文本框除位置外的多项特征(如:数字/字母比例、特定关键字等),再使用决策树对文本框进行分类再合并;避免了位置相近的文本框被错误分类的情况,能够基于文本的不同类别进行针对性还原。This invention pays attention to multiple features of the text box in addition to the position (such as number/letter ratio, specific keywords, etc.), and then uses a decision tree to classify and merge the text boxes; avoiding the misclassification of text boxes with similar positions. Situation can be restored based on different categories of text.

实施例2Example 2

本实施例提出了一种电子设备,包括处理器和存储器,所述存储器存储有能够被所述处理器执行的计算机程序,所述处理器可执行所述计算机程序以实现实施例1所述的基于决策树的OCR文本还原方法。This embodiment proposes an electronic device, including a processor and a memory. The memory stores a computer program that can be executed by the processor. The processor can execute the computer program to implement the method described in Embodiment 1. OCR text restoration method based on decision tree.

其中,所述处理器可以是中央处理器(CPU,Central Processing Unit),还可以是其他通用处理器、数字信号处理器(digital signal processor)、专用集成电路(Application Specific Integrated Circuit)、现成可编程门阵列(Field programmablegate array)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。Wherein, the processor may be a central processing unit (CPU), or other general-purpose processor, a digital signal processor (digital signal processor), an application specific integrated circuit (Application Specific Integrated Circuit), a ready-made programmable Gate array (Field programmablegate array) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc.

所述存储器可用于存储所述计算机程序和/或模块,所述处理器通过运行或执行存储在所述存储器内的数据,实现发明中一种不同前端框架间的代码转换装置的各种功能。所述存储器可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等。此外,存储器可以包括高速随机存取存储器、还可以包括非易失性存储器,例如硬盘、内存、插接式硬盘,智能存储卡,安全数字卡,闪存卡、至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。The memory can be used to store the computer program and/or module, and the processor implements various functions of a code conversion device between different front-end frameworks in the invention by running or executing data stored in the memory. The memory may mainly include a program storage area and a data storage area, where the program storage area may store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), and the like. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as hard disk, memory, plug-in hard disk, smart memory card, secure digital card, flash memory card, at least one disk storage device, flash memory device, or other volatile solid-state storage devices.

本发明已对基本概念做了描述,显然,对于本领域技术人员来说,上述详细披露仅仅作为示例,而并不构成对本说明书的限定。虽然此处并没有明确说明,本领域技术人员可能会对本说明书进行各种修改、改进和修正。该类修改、改进和修正在本说明书中被建议,所以该类修改、改进、修正仍属于本说明书示范实施例的精神和范围。The basic concepts of the present invention have been described. It is obvious to those skilled in the art that the above detailed disclosure is only an example and does not constitute a limitation of the present specification. Although not explicitly stated herein, various modifications, improvements, and corrections may be made to this specification by those skilled in the art. Such modifications, improvements, and corrections are suggested in this specification, and therefore such modifications, improvements, and corrections remain within the spirit and scope of the exemplary embodiments of this specification.

同时,本说明书使用了特定词语来描述本说明书的实施例。如“一个实施例”、“一实施例”、和/或“一些实施例”意指与本说明书至少一个实施例相关的某一特征、结构或特点。因此,应强调并注意的是,本说明书中在不同位置两次或多次提及的“一实施例”或“一个实施例”或“一个替代性实施例”并不一定是指同一实施例。此外,本说明书的一个或多个实施例中的某些特征、结构或特点可以进行适当的组合。At the same time, this specification uses specific words to describe the embodiments of this specification. For example, "one embodiment," "an embodiment," and/or "some embodiments" means a certain feature, structure, or characteristic related to at least one embodiment of this specification. Therefore, it should be emphasized and noted that “one embodiment” or “an embodiment” or “an alternative embodiment” mentioned twice or more at different places in this specification does not necessarily refer to the same embodiment. . In addition, certain features, structures or characteristics in one or more embodiments of this specification may be appropriately combined.

此外,本领域技术人员可以理解,本说明书的各方面可以通过若干具有可专利性的种类或情况进行说明和描述,包括任何新的和有用的工序、机器、产品或物质的组合,或对他们的任何新的和有用的改进。相应地,本说明书的各个方面可以完全由硬件执行、可以完全由软件(包括固件、常驻软件、微码等)执行、也可以由硬件和软件组合执行。以上硬件或软件均可被称为“数据块”、“模块”、“引擎”、“单元”、“组件”或“系统”。Furthermore, those skilled in the art will appreciate that aspects of the specification may be illustrated and described in several patentable categories or circumstances, including any new and useful process, machine, product, or combination of matter, or combination thereof. any new and useful improvements. Accordingly, various aspects of this specification may be entirely executed by hardware, may be entirely executed by software (including firmware, resident software, microcode, etc.), or may be executed by a combination of hardware and software. The above hardware or software may be referred to as "data block", "module", "engine", "unit", "component" or "system".

实施例3Example 3

本实施例提出了一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现实施例1所述的基于决策树的OCR文本还原方法。This embodiment proposes a computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, the decision tree-based OCR text restoration method described in Embodiment 1 is implemented.

计算机可读存储介质可能包含一个内含有计算机程序编码的传播数据信号,例如在基带上或作为载波的一部分。该传播信号可能有多种表现形式,包括电磁形式、光形式等,或合适的组合形式。计算机存储介质可以是除计算机可读存储介质之外的任何计算机可读介质,该介质可以通过连接至一个指令执行系统、装置或设备以实现通讯、传播或传输供使用的程序。位于计算机存储介质上的程序编码可以通过任何合适的介质进行传播,包括无线电、电缆、光纤电缆、RF、或类似介质,或任何上述介质的组合。A computer-readable storage medium may contain a propagated data signal embodying computer program encoding, such as on baseband or as part of a carrier wave. The propagated signal may have multiple manifestations, including electromagnetic form, optical form, etc., or a suitable combination. Computer storage media may be any computer-readable media other than computer-readable storage media that enables communication, propagation, or transfer of a program for use in connection with an instruction execution system, apparatus, or device. Program code located on a computer storage medium may be transmitted via any suitable medium, including radio, electrical cable, fiber optic cable, RF, or similar media, or a combination of any of the foregoing.

本说明书各部分操作所需的计算机程序编码可以用任意一种或多种程序语言编写,包括面向对象编程语言如Java、Scala、Smalltalk、Eiffel、JADE、Emerald、C++、C#、VB.NET、Python等,常规程序化编程语言如C语言、Visual Basic、Fortran 2003、Perl、COBOL2002、PHP、ABAP,动态编程语言如Python、Ruby和Groovy,或其他编程语言等。该程序编码可以完全在用户计算机上运行、或作为独立的软件包在用户计算机上运行、或部分在用户计算机上运行部分在远程计算机运行、或完全在远程计算机或服务器上运行。在后种情况下,远程计算机可以通过任何网络形式与用户计算机连接,比如局域网(LAN)或广域网(WAN),或连接至外部计算机(例如通过因特网),或在云计算环境中,或作为服务使用如软件即服务(SaaS)。The computer program coding required to operate each part of this manual can be written in any one or more programming languages, including object-oriented programming languages such as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C++, C#, VB.NET, Python etc., conventional procedural programming languages such as C language, Visual Basic, Fortran 2003, Perl, COBOL2002, PHP, ABAP, dynamic programming languages such as Python, Ruby and Groovy, or other programming languages. The program code may run entirely on the user's computer, as a stand-alone software package, or partially on the user's computer and partially on a remote computer, or entirely on the remote computer or server. In the latter case, the remote computer can be connected to the user computer via any form of network, such as a local area network (LAN) or a wide area network (WAN), or to an external computer (e.g. via the Internet), or in a cloud computing environment, or as a service Use software as a service (SaaS).

此外,除非权利要求中明确说明,本说明书所述处理元素和序列的顺序、数字字母的使用、或其他名称的使用,并非用于限定本说明书流程和方法的顺序。尽管上述披露中通过各种示例讨论了一些目前认为有用的发明实施例,但应当理解的是,该类细节仅起到说明的目的,附加的权利要求并不仅限于披露的实施例,相反,权利要求旨在覆盖所有符合本说明书实施例实质和范围的修正和等价组合。例如,虽然以上所描述的系统组件可以通过硬件设备实现,但是也可以只通过软件的解决方案得以实现,如在现有的服务器或移动设备上安装所描述的系统。In addition, unless explicitly stated in the claims, the order of the processing elements and sequences, the use of numbers and letters, or the use of other names in this specification are not intended to limit the order of the processes and methods in this specification. Although the foregoing disclosure discusses by various examples some embodiments of the invention that are presently considered useful, it is to be understood that such details are for purposes of illustration only and that the appended claims are not limited to the disclosed embodiments. To the contrary, rights The claims are intended to cover all modifications and equivalent combinations consistent with the spirit and scope of the embodiments of this specification. For example, although the system components described above can be implemented through hardware devices, they can also be implemented through software-only solutions, such as installing the described system on an existing server or mobile device.

同理,应当注意的是,为了简化本说明书披露的表述,从而帮助对一个或多个发明实施例的理解,前文对本说明书实施例的描述中,有时会将多种特征归并至一个实施例、附图或对其的描述中。但是,这种披露方法并不意味着本说明书对象所需要的特征比权利要求中提及的特征多。实际上,实施例的特征要少于上述披露的单个实施例的全部特征。Similarly, it should be noted that, in order to simplify the expression disclosed in this specification and thereby help understand one or more embodiments of the invention, in the previous description of the embodiments of this specification, multiple features are sometimes combined into one embodiment. accompanying drawings or descriptions thereof. However, this method of disclosure does not imply that the subject matter of the description requires more features than are mentioned in the claims. In fact, embodiments may have less than all features of a single disclosed embodiment.

针对本说明书引用的每个专利、专利申请、专利申请公开物和其他材料,如文章、书籍、说明书、出版物、文档等,特此将其全部内容并入本说明书作为参考。与本说明书内容不一致或产生冲突的申请历史文件除外,对本说明书权利要求最广范围有限制的文件(当前或之后附加于本说明书中的)也除外。需要说明的是,如果本说明书附属材料中的描述、定义、和/或术语的使用与本说明书所述内容有不一致或冲突的地方,以本说明书的描述、定义和/或术语的使用为准。Each patent, patent application, patent application publication and other material, such as articles, books, instructions, publications, documents, etc. cited in this specification is hereby incorporated by reference into this specification in its entirety. Application history documents that are inconsistent with or conflict with the content of this specification are excluded, as are documents (currently or later appended to this specification) that limit the broadest scope of the claims in this specification. It should be noted that if there is any inconsistency or conflict between the descriptions, definitions, and/or the use of terms in the accompanying materials of this manual and the content described in this manual, the descriptions, definitions, and/or the use of terms in this manual shall prevail. .

最后,应当理解的是,本说明书中所述实施例仅用以说明本说明书实施例的原则。其他的变形也可能属于本说明书的范围。因此,作为示例而非限制,本说明书实施例的替代配置可视为与本说明书的教导一致。相应地,本说明书的实施例不仅限于本说明书明确介绍和描述的实施例。Finally, it should be understood that the embodiments described in this specification are only used to illustrate the principles of the embodiments of this specification. Other variations may also fall within the scope of this specification. Accordingly, by way of example and not limitation, alternative configurations of the embodiments of this specification may be considered consistent with the teachings of this specification. Accordingly, the embodiments of this specification are not limited to those expressly introduced and described in this specification.

尽管已描述了本发明的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例作出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本发明范围的所有变更和修改。Although the preferred embodiments of the present invention have been described, those skilled in the art will be able to make additional changes and modifications to these embodiments once the basic inventive concepts are apparent. Therefore, it is intended that the appended claims be construed to include the preferred embodiments and all changes and modifications that fall within the scope of the invention.

显然,本领域的技术人员可以对本发明进行各种改动和变型而不脱离本发明的精神和范围。这样,倘若本发明的这些修改和变型属于本发明权利要求及其等同技术的范围之内,则本发明也意图包含这些改动和变型在内。Obviously, those skilled in the art can make various changes and modifications to the present invention without departing from the spirit and scope of the invention. In this way, if these modifications and variations of the present invention fall within the scope of the claims of the present invention and equivalent technologies, the present invention is also intended to include these modifications and variations.

Claims (5)

Translated fromChinese
1.一种基于决策树的OCR文本还原方法,其特征在于,包括:1. An OCR text restoration method based on decision trees, which is characterized by including:对OCR识别的文本框进行预处理;Preprocess text boxes recognized by OCR;提取文本框特征,并基于文本框特征构建决策树;Extract text box features and build a decision tree based on text box features;根据决策树,对文本框进行分类与合并,还原文本原始布局;According to the decision tree, text boxes are classified and merged to restore the original layout of the text;所述预处理包括:The preprocessing includes:对每个文本框进行编号,记录其初始内容;Number each text box and record its initial content;将文本框的所有英文字符转换为小写;Convert all English characters in the text box to lowercase;去除文本框中的特殊字符;Remove special characters from text boxes;所述提取文本框特征过程包括:The process of extracting text box features includes:提取每个文本框的字数、行数以及在整个文档中的位置;Extract the number of words, lines and position in the entire document of each text box;提取每个文本框的长度、宽度以及字体;Extract the length, width and font of each text box;提取每个文本框中数字比例、字母比例以及包含的关键字;Extract the number ratio, letter ratio, and keywords contained in each text box;所述构建决策树包括:The decision tree construction includes:根节点:判断是否包含关键字;是则根据关键字类型对文本框分类,包括:Root node: Determine whether it contains keywords; if so, classify the text box according to the keyword type, including:章节节点判断:根据文本框的宽度、字体、关键字数量,细分章节等级;Chapter node judgment: subdivide the chapter level based on the width, font, and number of keywords of the text box;图表节点判断;根据文本框的字体、位置、关键字特征,确定所属图表;Chart node judgment; determine the chart to which it belongs based on the font, position, and keyword characteristics of the text box;否则直接根据文本框长度、宽度、字体、位置对文本框进行分类;Otherwise, the text boxes are classified directly according to the text box length, width, font, and position;标题节点判断:文本框宽度最宽,处于页面中最高位置;Title node judgment: the text box has the widest width and is at the highest position on the page;页码节点判断:若包含关键字“页”“page”,则其余内容均为数字,若不包含关键字,则全为数字;长度小于一行,处于页面中最高或最低位置;Page number node judgment: If the keyword "page" is included, the rest of the content will be numbers. If it does not contain the keyword, all the content will be numbers; the length is less than one line, and it is at the highest or lowest position on the page;段落节点判断:根据数字比例和字母比例特征,确定具体段落类型;Paragraph node judgment: Determine the specific paragraph type based on the numerical proportion and letter proportion characteristics;所述分类与合并过程包括:The classification and merging process includes:将所有文本框按照决策树进行分类;Classify all text boxes according to decision trees;根据文本框编号,复原每个文本框的初始内容以及位置排布;According to the text box number, restore the initial content and position arrangement of each text box;对同一类别内位置相邻、字体一致、宽度相同的文本框进行合并。Merge text boxes that are adjacent in the same category, have the same font, and have the same width.2.根据权利要求1所述的基于决策树的OCR文本还原方法,其特征在于,所述特殊字符包括非数字、非字母、非中文、非标点、非空格的字符。2. The OCR text restoration method based on decision tree according to claim 1, characterized in that the special characters include non-numbers, non-letters, non-Chinese, non-punctuation, and non-space characters.3.根据权利要求1所述的基于决策树的OCR文本还原方法,其特征在于,所述关键字为能表示文本框内容的含义的关键字,通过正则表达式来识别。3. The OCR text restoration method based on decision tree according to claim 1, characterized in that the keywords are keywords that can represent the meaning of the text box content and are identified by regular expressions.4.一种电子设备,其特征在于,包括处理器和存储器,所述存储器存储有能够被所述处理器执行的计算机程序,所述处理器执行所述计算机程序以实现权利要求1-3任一所述的基于决策树的OCR文本还原方法。4. An electronic device, characterized in that it includes a processor and a memory, the memory stores a computer program that can be executed by the processor, and the processor executes the computer program to implement any of claims 1-3. 1. Decision tree-based OCR text restoration method.5.一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1-3任一项所述的基于决策树的OCR文本还原方法。5. A computer-readable storage medium with a computer program stored thereon, characterized in that, when the computer program is executed by a processor, the decision tree-based OCR text restoration as claimed in any one of claims 1-3 is achieved. method.
CN202311064174.XA2023-08-232023-08-23 A decision tree-based OCR text restoration method, equipment and storage mediumActiveCN116758565B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202311064174.XACN116758565B (en)2023-08-232023-08-23 A decision tree-based OCR text restoration method, equipment and storage medium

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202311064174.XACN116758565B (en)2023-08-232023-08-23 A decision tree-based OCR text restoration method, equipment and storage medium

Publications (2)

Publication NumberPublication Date
CN116758565A CN116758565A (en)2023-09-15
CN116758565Btrue CN116758565B (en)2023-11-24

Family

ID=87951980

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202311064174.XAActiveCN116758565B (en)2023-08-232023-08-23 A decision tree-based OCR text restoration method, equipment and storage medium

Country Status (1)

CountryLink
CN (1)CN116758565B (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN106250830A (en)*2016-07-222016-12-21浙江大学Digital book structured analysis processing method
CN107145479A (en)*2017-05-042017-09-08北京文因互联科技有限公司Structure of an article analysis method based on text semantic
US10636074B1 (en)*2015-09-182020-04-28Amazon Technologies, Inc.Determining and executing application functionality based on text analysis
CN111768820A (en)*2020-06-042020-10-13上海森亿医疗科技有限公司Paper medical record digitization and target detection model training method, device and storage medium
CN113221735A (en)*2021-05-112021-08-06润联软件系统(深圳)有限公司Multimodal-based scanned part paragraph structure restoration method and device and related equipment
CN114186533A (en)*2021-11-042022-03-15北京百度网讯科技有限公司 Model training method and device, knowledge extraction method and device, equipment and medium
CN114220114A (en)*2021-12-282022-03-22科大讯飞股份有限公司 Text image recognition method, device, device and storage medium
CN114238575A (en)*2021-12-152022-03-25平安科技(深圳)有限公司 Document parsing method, system, computer device, and computer-readable storage medium
CN114303140A (en)*2019-07-032022-04-08马里兰怡安风险服务有限公司Analysis of intellectual property data related to products and services
CN114495147A (en)*2022-01-252022-05-13北京百度网讯科技有限公司Identification method, device, equipment and storage medium
CN115147841A (en)*2022-06-012022-10-04兴业银行股份有限公司杭州分行 Data intelligent identification and extraction system, method and medium based on deep learning

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US9098471B2 (en)*2011-12-292015-08-04Chegg, Inc.Document content reconstruction
WO2014127535A1 (en)*2013-02-222014-08-28Google Inc.Systems and methods for automated content generation

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US10636074B1 (en)*2015-09-182020-04-28Amazon Technologies, Inc.Determining and executing application functionality based on text analysis
CN106250830A (en)*2016-07-222016-12-21浙江大学Digital book structured analysis processing method
CN107145479A (en)*2017-05-042017-09-08北京文因互联科技有限公司Structure of an article analysis method based on text semantic
CN114303140A (en)*2019-07-032022-04-08马里兰怡安风险服务有限公司Analysis of intellectual property data related to products and services
CN111768820A (en)*2020-06-042020-10-13上海森亿医疗科技有限公司Paper medical record digitization and target detection model training method, device and storage medium
CN113221735A (en)*2021-05-112021-08-06润联软件系统(深圳)有限公司Multimodal-based scanned part paragraph structure restoration method and device and related equipment
CN114186533A (en)*2021-11-042022-03-15北京百度网讯科技有限公司 Model training method and device, knowledge extraction method and device, equipment and medium
CN114238575A (en)*2021-12-152022-03-25平安科技(深圳)有限公司 Document parsing method, system, computer device, and computer-readable storage medium
CN114220114A (en)*2021-12-282022-03-22科大讯飞股份有限公司 Text image recognition method, device, device and storage medium
CN114495147A (en)*2022-01-252022-05-13北京百度网讯科技有限公司Identification method, device, equipment and storage medium
CN115147841A (en)*2022-06-012022-10-04兴业银行股份有限公司杭州分行 Data intelligent identification and extraction system, method and medium based on deep learning

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Email Spam Detection Using Machine Learning Algorithms;Nikhil Kumar等;《Proceedings of the Second International Conference on Inventive Research in Computing Applications (ICIRCA-2020)》;第108-113页*
刑事检察办案辅助系统的设计与实现;李艳露;《中国优秀硕士学位论文全文数据库 信息科技辑》(第5期);第I138-627页*
基于网页信息和图像特征的Web图像检索研究;黄治虎;《中国博士学位论文全文数据库 信息科技辑》(第7期);第I138-34页*
多模态公文的结构知识抽取与组织研究;徐瑞麟等;《系统工程与电子技术》;第44卷(第7期);第2241-2250页*

Also Published As

Publication numberPublication date
CN116758565A (en)2023-09-15

Similar Documents

PublicationPublication DateTitle
US10685462B2 (en)Automatic data extraction from a digital image
CN107463605B (en)Method and device for identifying low-quality news resource, computer equipment and readable medium
WO2022222300A1 (en)Open relationship extraction method and apparatus, electronic device, and storage medium
JP6335898B2 (en) Information classification based on product recognition
US11042576B2 (en)Identifying and prioritizing candidate answer gaps within a corpus
CN111680506A (en) Method, device, electronic device and storage medium for foreign key mapping of database table
CN116932730B (en)Document question-answering method and related equipment based on multi-way tree and large-scale language model
CN113127605A (en)Method and system for establishing target recognition model, electronic equipment and medium
US20250046110A1 (en)Method for extracting and structuring information
US20240265206A1 (en)Reading order detection in a document
CN111538846A (en) Third-party library recommendation method based on hybrid collaborative filtering
CN107239564A (en)A kind of text label based on supervision topic model recommends method
CN113128234B (en)Method and system for establishing entity recognition model, electronic equipment and medium
CN114067343A (en)Data set construction method, model training method and corresponding device
CN114297388A (en) A text keyword extraction method
WO2024245081A1 (en)Model training method, text processing method and related device
CN118313348A (en)Document format typesetting method, device, computer equipment, storage medium and product
CN119002930A (en)Code processing method, model end and storage medium integrating multiple search modes
Chua et al.DeepCPCFG: deep learning and context free grammars for end-to-end information extraction
CN118964514B (en)Data processing and storing method and device based on graph database and vector database
CN114782965A (en) Method, system and medium for visual rich document information extraction based on layout correlation
CN116758565B (en) A decision tree-based OCR text restoration method, equipment and storage medium
CN118396803A (en)Intelligent education system based on artificial intelligence
CN112417220A (en)Heterogeneous data integration method
CN117973388A (en) Data enhancement method, system and storage medium for nested named entity recognition

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp