CN112183458B

Movatterモバイル変換

Info

Publication number: CN112183458B
Application number: CN202011119353.5A
Authority: CN
Inventors: 徐�明; 孟宁; 龙启斌
Original assignee: Wanhuilian Intelligent Technology Suzhou Co ltd
Current assignee: Wanhuilian Intelligent Technology Suzhou Co ltd
Priority date: 2020-10-19
Filing date: 2020-10-19
Publication date: 2023-07-28
Anticipated expiration: 2040-10-19
Also published as: CN112183458A

Abstract

The invention relates to the field of data processing, and discloses a data processing system of an electronic document based on an artificial intelligence technology.

Description

Translated fromChinese

一种基于人工智能技术的电子单证的数据处理系统A data processing system for electronic documents based on artificial intelligence technology

技术领域technical field

本发明涉及数据处理领域，更具体的说是涉及一种基于人工智能技术的电子单证的数据处理系统。The present invention relates to the field of data processing, and more specifically relates to a data processing system for electronic documents based on artificial intelligence technology.

背景技术Background technique

电子装箱单是根据国家交通部规定格式，结合本港业务运作情况制定的出口装箱单电子单证，出口装箱单电子单证上存在有字符不变的特征字符，若干个特征字符形成特征数据；The electronic packing list is an electronic export packing list document formulated according to the format stipulated by the Ministry of Communications and in combination with the business operation in Hong Kong. There are characteristic characters on the electronic export packing list that do not change characters, and several characteristic characters form a feature. data;

为对电子单证上的数据进行更加有效的存储和分析，需要提取电子单证上的数据，一般在对电子单证上的数据进行提取的过程中，电子单证数据以PDF或图片形式输入，会有一定的倾斜，需要对电子单证数据进行角度矫正。In order to store and analyze the data on the electronic document more effectively, it is necessary to extract the data on the electronic document. Generally, in the process of extracting the data on the electronic document, the data of the electronic document is input in the form of PDF or picture , there will be a certain inclination, and it is necessary to correct the angle of the electronic document data.

发明内容Contents of the invention

针对现有技术存在的不足，本发明的目的在于提供一种基于人工智能技术的电子单证的数据处理系统，用于对电子单证数据进行角度矫正。In view of the deficiencies in the prior art, the purpose of the present invention is to provide a data processing system for electronic documents based on artificial intelligence technology, which is used for angle correction of electronic document data.

为实现上述目的，本发明提供了如下技术方案：To achieve the above object, the present invention provides the following technical solutions:

一种基于人工智能技术的电子单证的数据处理系统，包括输入端、选取端、预处理端、文字集成端和后处理端，A data processing system for electronic documents based on artificial intelligence technology, including an input terminal, a selection terminal, a preprocessing terminal, a text integration terminal and a postprocessing terminal,

所述输入端配置有输入单元，所述输入单元输入电子单证数据并生成动态数据库；The input end is configured with an input unit, and the input unit inputs electronic document data and generates a dynamic database;

所述选取端配置有常用数据库，所述常用数据库内存储有常用数据，所述常用数据包括位置数据和特征数据，所述位置数据表征了特征字符在标准电子单证上的位置；The selection terminal is configured with a common database, and common data is stored in the common database, and the common data includes position data and characteristic data, and the position data characterizes the position of characteristic characters on the standard electronic document;

所述预处理端配置有版面数据库和预处理单元，所述版面数据库内存储有版面数据，所述版面数据表征了标准电子单证的版面尺寸；所述预处理单元从所述常用数据库调取所述常用数据，并从所述动态数据库调取电子单证数据；The preprocessing end is equipped with a layout database and a preprocessing unit, and the layout data is stored in the layout database, and the layout data characterizes the layout size of the standard electronic document; the preprocessing unit retrieves from the common database The commonly used data, and retrieve the electronic document data from the dynamic database;

所述文字集成端配置有集成单元，所述集成单元根据字符生成对应的图像，并根据图像为每个字符建立对应的图像映射关系表，根据图像映射关系表建立数据集基础库；The text integration end is equipped with an integration unit, and the integration unit generates a corresponding image according to the characters, and establishes a corresponding image mapping relationship table for each character according to the image, and establishes a data set basic library according to the image mapping relationship table;

所述预处理单元根据所述特征数据从所述数据集基础库调取与所述特征数据对应的图像，并根据图像确定所述特征数据在所述电子单证数据中的位置，根据所述位置数据对所述电子单证数据中的每一个字符进行切割并存储于识别数据库中；The preprocessing unit retrieves an image corresponding to the feature data from the data set basic library according to the feature data, and determines the position of the feature data in the electronic document data according to the image, and according to the The position data cuts each character in the electronic document data and stores it in the recognition database;

所述后处理端配置有矫正单元和识别单元，矫正单元包括检查策略和矫正策略，所述识别单元从所述识别数据库中调取字符，并通过OCR识别模型对字符进行识别，得到字符识别结果，所述检查策略用以检查所述字符识别结果中字符的组合逻辑，所述矫正单元用以矫正所述字符识别结果中字符的组合逻辑。The post-processing end is configured with a correction unit and a recognition unit, the correction unit includes a check strategy and a correction strategy, the recognition unit retrieves characters from the recognition database, and recognizes the characters through an OCR recognition model to obtain a character recognition result , the checking strategy is used to check the combination logic of the characters in the character recognition result, and the correcting unit is used to correct the combination logic of the characters in the character recognition result.

在本发明中，优选的，所述预处理单元包括角度矫正策略，所述角度矫正策略根据确定的所述特征数据在所述电子单证数据中的位置，通过预设的角度矫正算法得到所述特征数据在水平位置上的角度偏差，并根据角度偏差计算得出所述电子单证数据中版面所处的位置，对所述电子单证数据中版面水平的角度进行调整。In the present invention, preferably, the preprocessing unit includes an angle correction strategy, and the angle correction strategy obtains the angle correction algorithm according to the determined position of the feature data in the electronic document data through a preset angle correction algorithm. The angular deviation of the characteristic data in the horizontal position, and calculate the position of the layout in the electronic document data according to the angular deviation, and adjust the horizontal angle of the layout in the electronic document data.

在本发明中，优选的，所述预处理单元还包括分析策略，所述分析策略用以检测所述电子单证数据中的页眉和页脚信息，根据页眉和页脚信息中字符的朝向来判断所述电子单证数据中的文本朝向，所述页眉和页脚信息包括页眉区域的长宽和页脚区域的长宽，所述分析策略包括分析算法，所述分析算法通过所述页眉区域的长宽和所述页脚区域的长宽得到所述电子单证数据中版面的长宽，获得所述电子单证数据版面位置并选取所述电子单证数据版面。In the present invention, preferably, the preprocessing unit further includes an analysis strategy, the analysis strategy is used to detect the header and footer information in the electronic document data, according to the characters in the header and footer information Orientation is used to judge the text orientation in the electronic document data, the header and footer information includes the length and width of the header area and the length and width of the footer area, the analysis strategy includes an analysis algorithm, and the analysis algorithm passes The length and width of the header area and the length and width of the footer area obtain the length and width of the layout in the electronic document data, obtain the layout position of the electronic document data, and select the layout of the electronic document data.

在本发明中，优选的，所述分析算法根据页眉和页脚信息中字符的朝向得出所述电子单证数据版面的偏向角度，所述分析策略根据所述偏向角度对所述电子单证数据中版面的水平角度进行调整。In the present invention, preferably, the analysis algorithm obtains the deflection angle of the electronic document data layout according to the orientation of the characters in the header and footer information, and the analysis strategy calculates the electronic document data layout according to the deflection angle. Adjust the horizontal angle of the layout in the certificate data.

在本发明中，优选的，所述预处理单元根据确定的所述特征数据在所述电子单证数据中的位置和所述位置数据得到所述电子单证数据中各个版块的起始位置和结束位置，根据预设的位置算法计算得出所述电子单证数据中各个版块的长宽尺寸，并根据所述电子单证数据中各个版块的长宽尺寸对各个版块进行切割并生成版块数据存储于版块数据库中。In the present invention, preferably, the preprocessing unit obtains the starting position and End position, calculate the length and width of each block in the electronic document data according to the preset position algorithm, and cut each block according to the length and width of each block in the electronic document data to generate block data Stored in the block database.

在本发明中，优选的，所述预处理单元根据版块的起始位置和结束位置，采用图像水平投影方式，得到各个版块内每一行字符的上界限和下界限，完成每行字符的切割并生成行数据存储与行数据库中。In the present invention, preferably, the preprocessing unit obtains the upper limit and lower limit of each line of characters in each block by using the image horizontal projection method according to the starting position and end position of the block, completes the cutting of each line of characters and The generated row data is stored with the row database.

在本发明中，优选的，所述预处理单元采用垂直投影方式，得到每行字符中单个字符的边界，并进行单个字符的切割。In the present invention, preferably, the preprocessing unit adopts a vertical projection method to obtain the boundary of a single character in each line of characters, and cuts a single character.

在本发明中，优选的，还包括搜寻端，所述搜寻端配置有搜寻单元和调整单元，所述搜寻单元从所述常用数据库中调取特征数据，并从所述版块数据库中调取版块数据，所述预处理单元对所述特征数据中的每个字符进行分割，并从所述数据集基础库调取与字符对应的图像，生成分解图像信息存储于搜寻数据库中，所述搜寻单元从所述搜寻数据库中调取所述分解图像信息，并确定各个所述分解图像信息在所述版块数据中的位置，所述调整单元根据各个所述分解图像信息在所述版块数据中的位置，对所述版块数据中的版块水平角度进行调整。In the present invention, preferably, it also includes a search terminal, the search terminal is configured with a search unit and an adjustment unit, and the search unit retrieves feature data from the common database and retrieves a block from the block database data, the preprocessing unit divides each character in the feature data, and retrieves the image corresponding to the character from the basic library of the data set, generates decomposed image information and stores it in the search database, and the search unit Retrieving the decomposed image information from the search database, and determining the position of each of the decomposed image information in the block data, the adjustment unit according to the position of each of the decomposed image information in the block data , to adjust the horizontal angle of the block in the block data.

在本发明中，优选的，所述搜寻单元从所述行数据库中调取行数据，所述调整单元根据各个所述分解图像信息在所述行数据中的位置，对所述行数据的水平角度进行调整。In the present invention, preferably, the searching unit retrieves the row data from the row database, and the adjustment unit adjusts the horizontal position of the row data according to the position of each decomposed image information in the row data. to adjust the angle.

在本发明中，优选的，所述矫正单元根据所述分解图像信息从所述识别数据库中检索出与所述分解图像信息相对应的字符，所述检查策略省略对该字符的检查。In the present invention, preferably, the correcting unit retrieves a character corresponding to the decomposed image information from the recognition database according to the decomposed image information, and the checking strategy omits checking the character.

本发明的有益效果：本发明中预处理单元根据特征数据从数据集基础库调取与特征数据对应的图像，并根据图像确定特征数据在电子单证数据中的位置，可以准确的定位电子单证的版面尺寸，以此来切割版面，减少了版面切割的误差，进而提高字符切割后字符的完整性，更容易被OCR识别模型所识别，并提高识别准确度。Beneficial effects of the present invention: the preprocessing unit in the present invention retrieves the image corresponding to the feature data from the data set basic library according to the feature data, and determines the position of the feature data in the electronic document data according to the image, and can accurately locate the electronic document The layout size of the certificate is used to cut the layout, which reduces the error of layout cutting, thereby improving the integrity of the characters after character cutting, making it easier to be recognized by the OCR recognition model, and improving the recognition accuracy.

附图说明Description of drawings

图1是本发明的结构框图。Fig. 1 is a structural block diagram of the present invention.

附图标记：1、输入端；2、选取端；3、预处理端；4、文字集成端；5、后处理端。Reference signs: 1. input terminal; 2. selection terminal; 3. preprocessing terminal; 4. text integration terminal; 5. postprocessing terminal.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

需要说明的是，当组件被称为“固定于”另一个组件，它可以直接在另一个组件上或者也可以存在居中的组件。当一个组件被认为是“连接”另一个组件，它可以是直接连接到另一个组件或者可能同时存在居中组件。当一个组件被认为是“设置于”另一个组件，它可以是直接设置在另一个组件上或者可能同时存在居中组件。本文所使用的术语“垂直的”、“水平的”、“左”、“右”以及类似的表述只是为了说明的目的。It should be noted that when a component is said to be "fixed" to another component, it can be directly on the other component or there can also be an intervening component. When a component is said to be "connected" to another component, it may be directly connected to the other component or there may be intervening components at the same time. When a component is said to be "set on" another component, it may be set directly on the other component or there may be an intervening component at the same time. The terms "vertical," "horizontal," "left," "right," and similar expressions are used herein for purposes of illustration only.

除非另有定义，本文所使用的所有的技术和科学术语与属于本发明的技术领域的技术人员通常理解的含义相同。本文中在本发明的说明书中所使用的术语只是为了描述具体的实施例的目的，不是旨在于限制本发明。本文所使用的术语“及/或”包括一个或多个相关的所列项目的任意的和所有的组合。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the technical field of the invention. The terms used herein in the description of the present invention are for the purpose of describing specific embodiments only, and are not intended to limit the present invention. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

请同时参见图1，本实施例的一种基于人工智能技术的电子单证的数据处理系统，包括输入端1、选取端2、预处理端3、文字集成端4和后处理端5，Please refer to Fig. 1 at the same time, a data processing system for electronic documents based on artificial intelligence technology in this embodiment includes an input terminal 1, a selection terminal 2, a preprocessing terminal 3, a text integration terminal 4 and a postprocessing terminal 5,

输入端1配置有输入单元，输入单元输入电子单证数据并生成动态数据库；The input terminal 1 is equipped with an input unit, and the input unit inputs the electronic document data and generates a dynamic database;

选取端2配置有常用数据库，常用数据库内存储有常用数据，常用数据包括位置数据和特征数据，位置数据表征了特征字符在标准电子单证上的位置；以装箱单为例，特征数据可以为装箱单字符，出单方的公司名称，包括中文名和英文名，因这些数据不会改变，则可以根据这些字符在装箱单上的位置，来确定整个装箱单的版面位置，使得获得的装箱单版面位置比较准确；The selection terminal 2 is configured with a common database, which stores common data. The common data includes location data and characteristic data. The location data characterizes the position of characteristic characters on a standard electronic document; Packing list characters, the company name of the issuing party, including Chinese name and English name, because these data will not change, the layout position of the entire packing list can be determined according to the position of these characters on the packing list, so that the obtained The layout position of the packing list is more accurate;

预处理端3配置有版面数据库和预处理单元，版面数据库内存储有版面数据，版面数据表征了标准电子单证的版面尺寸；预处理单元从常用数据库调取常用数据，并从动态数据库调取电子单证数据；The preprocessing terminal 3 is equipped with a layout database and a preprocessing unit. The layout data is stored in the layout database, and the layout data represents the layout size of a standard electronic document; the preprocessing unit retrieves common data from the common database, and retrieves Electronic document data;

文字集成端4配置有集成单元，集成单元根据字符生成对应的图像，可以对字符进行一定角度的旋转，增加噪点，适当的腐蚀和膨胀，为字符增加一些随机的干扰点，可以增加预处理单元对字符的识别率，并根据图像为每个字符建立对应的图像映射关系表，根据图像映射关系表建立数据集基础库；The text integration terminal 4 is equipped with an integration unit, which generates a corresponding image according to the characters, which can rotate the characters at a certain angle, increase noise, properly corrode and expand, add some random interference points to the characters, and add a preprocessing unit The recognition rate of characters, and establish a corresponding image mapping relationship table for each character according to the image, and establish a data set basic library according to the image mapping relationship table;

预处理单元根据特征数据从数据集基础库调取与特征数据对应的图像，并根据图像确定特征数据在电子单证数据中的位置，根据位置数据对电子单证数据中的每一个字符进行切割并存储于识别数据库中；不同的特征数据可以对应不同的电子单证，根据特征数据确定在电子单证上的位置，可以准确的定位电子单证的版面尺寸，以此来切割版面，减少了版面切割的误差。The preprocessing unit retrieves the image corresponding to the feature data from the basic library of the data set according to the feature data, determines the position of the feature data in the electronic document data according to the image, and cuts each character in the electronic document data according to the position data and stored in the identification database; different feature data can correspond to different electronic documents, and the position on the electronic document can be determined according to the feature data, and the layout size of the electronic document can be accurately positioned, so as to cut the layout and reduce the Errors in layout cutting.

后处理端5配置有矫正单元和识别单元，矫正单元包括检查策略和矫正策略，识别单元从识别数据库中调取字符，并通过OCR识别模型对字符进行识别，得到字符识别结果，检查策略用以检查字符识别结果中字符的组合逻辑，矫正单元用以矫正字符识别结果中字符的组合逻辑。The post-processing terminal 5 is equipped with a correction unit and a recognition unit. The correction unit includes a check strategy and a correction strategy. The recognition unit retrieves characters from the recognition database, and recognizes the characters through the OCR recognition model to obtain character recognition results. The check strategy is used to The combination logic of the characters in the character recognition result is checked, and the correction unit is used for correcting the combination logic of the characters in the character recognition result.

预处理单元包括角度矫正策略，角度矫正策略根据确定的特征数据在电子单证数据中的位置，通过预设的角度矫正算法得到特征数据在水平位置上的角度偏差，并根据角度偏差计算得出电子单证数据中版面所处的位置，电子单证数据中版面所处的位置可以依据特征字符在标准电子单证上的位置，来推导出需要处理的电子单证数据上版面的位置，对电子单证数据中版面水平的角度进行调整，确保切割后的字符的高度位置和水平位置精确，确保字符的完整性。The preprocessing unit includes an angle correction strategy. The angle correction strategy obtains the angle deviation of the feature data in the horizontal position through the preset angle correction algorithm according to the determined position of the feature data in the electronic document data, and calculates the angle deviation according to the angle deviation. The position of the layout in the electronic document data, the position of the layout in the electronic document data can be based on the position of the characteristic characters on the standard electronic document to deduce the position of the layout on the electronic document data that needs to be processed. The horizontal angle of the layout in the electronic document data is adjusted to ensure that the height position and horizontal position of the cut characters are accurate and the integrity of the characters is ensured.

预处理单元还包括分析策略，分析策略用以检测电子单证数据中的页眉和页脚信息，页眉和页脚信息中的字符相比于文本中正文的字符会有偏差，可能存在偏大、缩小、字体颜色较淡等情况，使得页眉和页脚信息中的字符比较容易被检测到，并且页眉和页脚信息中的字符是不会发生变化的，使得根据页眉和页脚信息来确定电子单证数据中的版面更加的简单和准确，根据页眉和页脚信息中字符的朝向来判断电子单证数据中的文本朝向，页眉和页脚信息包括页眉区域的长宽和页脚区域的长宽，分析策略包括分析算法，分析算法通过页眉区域的长宽和页脚区域的长宽得到电子单证数据中版面的长宽，获得电子单证数据版面位置并选取电子单证数据版面，防止切割后的字符进入OCR识别模型时，OCR识别模型识别不出字符。The preprocessing unit also includes an analysis strategy. The analysis strategy is used to detect the header and footer information in the electronic document data. The characters in the header and footer information will deviate from the characters in the text, and there may be bias. The characters in the header and footer information are easier to be detected due to the large size, shrinkage, and light font color, and the characters in the header and footer information will not change, so that according to the header and page It is easier and more accurate to determine the layout of the electronic document data based on the footer information. The text orientation in the electronic document data is determined according to the orientation of the characters in the header and footer information. The header and footer information includes the header area. The length and width of the footer area and the length and width of the page, the analysis strategy includes the analysis algorithm, the analysis algorithm obtains the length and width of the layout in the electronic document data through the length and width of the header area and the length and width of the footer area, and obtains the layout position of the electronic document data And select the data layout of the electronic document to prevent the OCR recognition model from recognizing the characters when the cut characters enter the OCR recognition model.

分析算法根据页眉和页脚信息中字符的朝向得出电子单证数据版面的偏向角度，页眉和页脚信息中字符的朝向跟随着电子单证数据版面的转动而发生角度的偏差，使得页眉和页脚信息中字符的角度偏差与偏向角度相同，以此来对版面进行角度调整，分析策略根据偏向角度对电子单证数据中版面的水平角度进行调整，使得切割时版面的水平角度更加的精准。The analysis algorithm obtains the deflection angle of the electronic document data layout according to the orientation of the characters in the header and footer information. The angle deviation of the characters in the header and footer information is the same as the deviation angle, so as to adjust the angle of the layout. The analysis strategy adjusts the horizontal angle of the layout in the electronic document data according to the deviation angle, so that the horizontal angle of the layout when cutting more precise.

预处理单元根据确定的特征数据在电子单证数据中的位置和位置数据得到电子单证数据中各个版块的起始位置和结束位置，根据预设的位置算法计算得出电子单证数据中各个版块的长宽尺寸，并根据电子单证数据中各个版块的长宽尺寸对各个版块进行切割并生成版块数据存储于版块数据库中。预处理单元根据版块的起始位置和结束位置，采用图像水平投影方式，得到各个版块内每一行字符的上界限和下界限，完成每行字符的切割并生成行数据存储与行数据库中。预处理单元采用垂直投影方式，得到每行字符中单个字符的边界，并进行单个字符的切割。预处理单元从版块切割进行到行切割，然后进行列切割，三步切割，减少切割后字符大小或位置的误差。The preprocessing unit obtains the start position and end position of each block in the electronic document data according to the determined position and position data of the characteristic data in the electronic document data, and calculates each block in the electronic document data according to the preset position algorithm. According to the length and width dimensions of each block in the electronic document data, each block is cut and block data is generated and stored in the block database. The preprocessing unit obtains the upper limit and lower limit of each line of characters in each block by using the image horizontal projection method according to the start position and end position of the block, completes the cutting of each line of characters and generates row data for storage and row database. The preprocessing unit adopts a vertical projection method to obtain the boundary of a single character in each line of characters, and cuts a single character. The preprocessing unit proceeds from section cutting to row cutting, and then performs column cutting, three-step cutting, to reduce the error of character size or position after cutting.

还包括搜寻端，搜寻端配置有搜寻单元和调整单元，搜寻单元从常用数据库中调取特征数据，并从版块数据库中调取版块数据，预处理单元对特征数据中的每个字符进行分割，并从数据集基础库调取与字符对应的图像，生成分解图像信息存储于搜寻数据库中，搜寻单元从搜寻数据库中调取分解图像信息，并确定各个分解图像信息在版块数据中的位置，调整单元根据各个分解图像信息在版块数据中的位置，对版块数据中的版块水平角度进行调整，更加精确的对版块水平角度的确定，使得行分割能够更加的准确。It also includes a search terminal. The search terminal is equipped with a search unit and an adjustment unit. The search unit retrieves feature data from the common database and block data from the block database. The preprocessing unit divides each character in the feature data, The images corresponding to the characters are retrieved from the basic library of the data set, and the decomposed image information is generated and stored in the search database. The search unit retrieves the decomposed image information from the search database, and determines the position of each decomposed image information in the block data, adjusts The unit adjusts the horizontal angle of the block in the block data according to the position of each decomposed image information in the block data, and determines the horizontal angle of the block more accurately, so that the row segmentation can be more accurate.

搜寻单元从行数据库中调取行数据，调整单元根据各个分解图像信息在行数据中的位置，对行数据的水平角度进行调整，更加精确的对行数据水平角度的确定。The search unit retrieves the row data from the row database, and the adjustment unit adjusts the horizontal angle of the row data according to the position of each decomposed image information in the row data, so as to more accurately determine the horizontal angle of the row data.

矫正单元根据分解图像信息从识别数据库中检索出与分解图像信息相对应的字符，检查策略省略对该字符的检查，可以减少检查策略对字符的检查量，加快了检查效率，减少了检查时间。The correcting unit retrieves the character corresponding to the decomposed image information from the recognition database according to the decomposed image information, and the check strategy omits the check of the character, which can reduce the check amount of the character by the check strategy, speed up the check efficiency, and reduce the check time.

以上仅是本发明的优选实施方式，本发明的保护范围并不仅局限于上述实施例，凡属于本发明思路下的技术方案均属于本发明的保护范围。应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明原理前提下的若干改进和润饰，这些改进和润饰也应视为本发明的保护范围。The above are only preferred implementations of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions under the idea of the present invention belong to the protection scope of the present invention. It should be pointed out that for those skilled in the art, some improvements and modifications without departing from the principles of the present invention should also be regarded as the protection scope of the present invention.

Claims

Translated fromChinese

1.一种基于人工智能技术的电子单证的数据处理系统，包括输入端、选取端、预处理端、文字集成端和后处理端，其特征在于：1. A data processing system of an electronic document based on artificial intelligence technology, comprising an input terminal, a selection terminal, a preprocessing terminal, a text integration terminal and a postprocessing terminal, characterized in that:

2.根据权利要求1所述的一种基于人工智能技术的电子单证的数据处理系统，其特征在于：所述预处理单元包括角度矫正策略，所述角度矫正策略根据确定的所述特征数据在所述电子单证数据中的位置，通过预设的角度矫正算法得到所述特征数据在水平位置上的角度偏差，并根据角度偏差计算得出所述电子单证数据中版面所处的位置，对所述电子单证数据中版面水平的角度进行调整。2. The data processing system of an electronic document based on artificial intelligence technology according to claim 1, wherein the preprocessing unit includes an angle correction strategy, and the angle correction strategy is based on the determined characteristic data At the position in the electronic document data, the angle deviation of the characteristic data in the horizontal position is obtained through a preset angle correction algorithm, and the position of the layout in the electronic document data is calculated according to the angle deviation , to adjust the angle of the layout level in the electronic document data.

3.根据权利要求2所述的一种基于人工智能技术的电子单证的数据处理系统，其特征在于：所述预处理单元还包括分析策略，所述分析策略用以检测所述电子单证数据中的页眉和页脚信息，根据页眉和页脚信息中字符的朝向来判断所述电子单证数据中的文本朝向，所述页眉和页脚信息包括页眉区域的长宽和页脚区域的长宽，所述分析策略包括分析算法，所述分析算法通过所述页眉区域的长宽和所述页脚区域的长宽得到所述电子单证数据中版面的长宽，获得所述电子单证数据版面位置并选取所述电子单证数据版面。3. A data processing system for electronic documents based on artificial intelligence technology according to claim 2, characterized in that: said preprocessing unit also includes an analysis strategy for detecting said electronic documents Header and footer information in the data, judge the text orientation in the electronic document data according to the orientation of the characters in the header and footer information, the header and footer information includes the length and width of the header area and The length and width of the footer area, the analysis strategy includes an analysis algorithm, the analysis algorithm obtains the length and width of the layout in the electronic document data through the length and width of the header area and the length and width of the footer area, Obtaining the position of the electronic document data layout and selecting the electronic document data layout.

4.根据权利要求3所述的一种基于人工智能技术的电子单证的数据处理系统，其特征在于：所述分析算法根据页眉和页脚信息中字符的朝向得出所述电子单证数据版面的偏向角度，所述分析策略根据所述偏向角度对所述电子单证数据中版面的水平角度进行调整。4. A data processing system for electronic documents based on artificial intelligence technology according to claim 3, characterized in that: the analysis algorithm obtains the electronic document according to the orientation of the characters in the header and footer information The deflection angle of the data layout, the analysis strategy adjusts the horizontal angle of the layout in the electronic document data according to the deflection angle.

5.根据权利要求1所述的一种基于人工智能技术的电子单证的数据处理系统，其特征在于：所述预处理单元根据确定的所述特征数据在所述电子单证数据中的位置和所述位置数据得到所述电子单证数据中各个版块的起始位置和结束位置，根据预设的位置算法计算得出所述电子单证数据中各个版块的长宽尺寸，并根据所述电子单证数据中各个版块的长宽尺寸对各个版块进行切割并生成版块数据存储于版块数据库中。5. A data processing system for electronic documents based on artificial intelligence technology according to claim 1, characterized in that: the preprocessing unit determines the position of the feature data in the electronic document data according to claim 1. and the position data to obtain the start position and end position of each block in the electronic document data, and calculate the length and width of each block in the electronic document data according to the preset position algorithm, and according to the The length and width dimensions of each block in the electronic document data cut each block and generate block data to be stored in the block database.

6.根据权利要求5所述的一种基于人工智能技术的电子单证的数据处理系统，其特征在于：所述预处理单元根据版块的起始位置和结束位置，采用图像水平投影方式，得到各个版块内每一行字符的上界限和下界限，完成每行字符的切割并生成行数据存储与行数据库中。6. The data processing system of an electronic document based on artificial intelligence technology according to claim 5, wherein the preprocessing unit adopts an image horizontal projection method according to the starting position and the ending position of the section to obtain The upper limit and lower limit of each line of characters in each block, complete the cutting of each line of characters and generate line data for storage and line database.

7.根据权利要求6所述的一种基于人工智能技术的电子单证的数据处理系统，其特征在于：所述预处理单元采用垂直投影方式，得到每行字符中单个字符的边界，并进行单个字符的切割。7. The data processing system of an electronic document based on artificial intelligence technology according to claim 6, characterized in that: the preprocessing unit adopts a vertical projection method to obtain the boundary of a single character in each row of characters, and perform Slicing of individual characters.

8.根据权利要求7所述的一种基于人工智能技术的电子单证的数据处理系统，其特征在于：还包括搜寻端，所述搜寻端配置有搜寻单元和调整单元，所述搜寻单元从所述常用数据库中调取特征数据，并从所述版块数据库中调取版块数据，所述预处理单元对所述特征数据中的每个字符进行分割，并从所述数据集基础库调取与字符对应的图像，生成分解图像信息存储于搜寻数据库中，所述搜寻单元从所述搜寻数据库中调取所述分解图像信息，并确定各个所述分解图像信息在所述版块数据中的位置，所述调整单元根据各个所述分解图像信息在所述版块数据中的位置，对所述版块数据中的版块水平角度进行调整。8. A data processing system for electronic documents based on artificial intelligence technology according to claim 7, characterized in that: it also includes a search terminal, the search terminal is configured with a search unit and an adjustment unit, and the search unit is obtained from The feature data is called from the common database, and the block data is called from the block database, and each character in the feature data is segmented by the preprocessing unit, and the block data is called from the basic database of the data set. The image corresponding to the character generates decomposed image information and stores it in the search database, the search unit retrieves the decomposed image information from the search database, and determines the position of each decomposed image information in the block data The adjustment unit adjusts the horizontal angle of the block in the block data according to the position of each of the decomposed image information in the block data.

9.根据权利要求8所述的一种基于人工智能技术的电子单证的数据处理系统，其特征在于：所述搜寻单元从所述行数据库中调取行数据，所述调整单元根据各个所述分解图像信息在所述行数据中的位置，对所述行数据的水平角度进行调整。9. A data processing system for electronic documents based on artificial intelligence technology according to claim 8, characterized in that: the search unit retrieves row data from the row database, and the adjustment unit retrieves row data according to each row data The position of the decomposed image information in the line data is adjusted, and the horizontal angle of the line data is adjusted.

10.根据权利要求9所述的一种基于人工智能技术的电子单证的数据处理系统，其特征在于：所述矫正单元根据所述分解图像信息从所述识别数据库中检索出与所述分解图像信息相对应的字符，所述检查策略省略对该字符的检查。10. A data processing system for electronic documents based on artificial intelligence technology according to claim 9, characterized in that: the correcting unit retrieves from the identification database according to the decomposed image information and the decomposed The character corresponding to the image information, the check strategy omits the check of the character.