Movatterモバイル変換


[0]ホーム

URL:


CN116029280A - Method, device, computing equipment and storage medium for extracting key information of document - Google Patents

Method, device, computing equipment and storage medium for extracting key information of document
Download PDF

Info

Publication number
CN116029280A
CN116029280ACN202111239393.8ACN202111239393ACN116029280ACN 116029280 ACN116029280 ACN 116029280ACN 202111239393 ACN202111239393 ACN 202111239393ACN 116029280 ACN116029280 ACN 116029280A
Authority
CN
China
Prior art keywords
target attribute
document
key information
clause
processed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111239393.8A
Other languages
Chinese (zh)
Inventor
唐海庆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Suzhou Software Technology Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Suzhou Software Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Suzhou Software Technology Co LtdfiledCriticalChina Mobile Communications Group Co Ltd
Priority to CN202111239393.8ApriorityCriticalpatent/CN116029280A/en
Publication of CN116029280ApublicationCriticalpatent/CN116029280A/en
Pendinglegal-statusCriticalCurrent

Links

Images

Classifications

Landscapes

Abstract

The invention discloses a method, a device, a computing device and a storage medium for extracting key information of a document, which are used for analyzing the document to be processed to obtain a content object in the document to be processed by acquiring the document to be processed and a target attribute determined by a user; for a table object, detecting position information of a target attribute in the table object; extracting key information corresponding to the target attribute from the table object according to the position information; aiming at a paragraph text object, obtaining a clause set corresponding to a target attribute from the paragraph text object; merging clauses in the clause set according to a preset splicing strategy to obtain context information of the target attribute; and determining key information corresponding to the target attribute according to the context information. According to the method, the key information is extracted by adopting different methods for the table object and the paragraph text object in the document, so that the key information extraction efficiency is improved, and the extraction effect is more accurate.

Description

Translated fromChinese
一种文档关键信息抽取方法、装置、计算设备和存储介质A document key information extraction method, device, computing device and storage medium

技术领域technical field

本发明涉及数据处理技术领域,具体涉及一种文档关键信息抽取方法、装置、计算设备和存储介质。The invention relates to the technical field of data processing, in particular to a method, device, computing device and storage medium for extracting key information of documents.

背景技术Background technique

信息记录载体常见的如数据库表、Excel表格、Txt文本、Word文档等,借助人工智能技术从信息记录载体的非结构化文本中自动提取关键信息,形成结构化信息,从而实现上层智能化的业务处理逻辑,逐渐成为行业趋势。以Word文档为例,Word文档主要构成对象为表格和段落,即待抽取的目标关键信息可以上述两种形式存在,如果将这两类对象使用统一抽取方式进行处理,势必对抽取效果存在明显影响。Common information recording carriers such as database tables, Excel tables, Txt text, Word documents, etc., use artificial intelligence technology to automatically extract key information from the unstructured text of the information recording carrier to form structured information, thereby realizing upper-level intelligent business Processing logic has gradually become an industry trend. Taking Word documents as an example, the main objects of Word documents are tables and paragraphs, that is, the target key information to be extracted can exist in the above two forms. If these two types of objects are processed in a unified extraction method, it is bound to have a significant impact on the extraction effect .

现有技术中利用光学字符识别(Optical Character Recognition,OCR)技术获得表格的位置信息,并将表格单元格位置信息以及内容信息作为模板构成要素,不同类型的表格生成对应的信息抽取模板,应用模板获取用户关注的表格属性内容值。由于目标关键信息的提取是直接根据模板中的位置信息进行获得,当OCR识别出的位置信息错误或是表格存在合并现象,将导致提取信息错误。同时当两个表格唯一的差异是属性的相对位置不相同时,对应的信息抽取模板却不能通用,这也会导致需要维护的表格抽取模板数量成指数级增长。In the prior art, optical character recognition (Optical Character Recognition, OCR) technology is used to obtain the position information of the form, and the position information and content information of the form cell are used as template components, and different types of forms generate corresponding information extraction templates, and the application template Get the content value of the form attribute that the user cares about. Since the key information of the target is extracted directly based on the location information in the template, when the location information identified by OCR is wrong or the tables are merged, the extracted information will be wrong. At the same time, when the only difference between the two tables is the relative position of the attributes, the corresponding information extraction templates cannot be used universally, which will also lead to an exponential increase in the number of table extraction templates that need to be maintained.

现有技术中对文本段落形式的关键信息抽取均采用以段落行作为输入,对于较长文本输入,通常采取截断方式,丢弃部分输入,导致学习到的语义信息不完整;或者因为输入信息不够聚焦,模型将其中的噪声信息作为正确信息学习,从而影响最终的抽取结果输出。In the existing technology, the key information extraction in the form of text paragraphs uses paragraph lines as input. For long text input, the truncation method is usually adopted to discard part of the input, resulting in incomplete semantic information learned; or because the input information is not focused enough , the model learns the noise information in it as correct information, thus affecting the final extraction result output.

发明内容Contents of the invention

鉴于上述问题,提出了本发明以便提供一种克服上述问题或者至少部分地解决上述问题的一种文档关键信息抽取方法、装置、计算设备和存储介质。In view of the above problems, the present invention is proposed to provide a document key information extraction method, device, computing device and storage medium that overcome the above problems or at least partially solve the above problems.

根据本发明的一个方面,提供了一种文档关键信息抽取方法,包括:According to one aspect of the present invention, a method for extracting key document information is provided, including:

获取待处理文档以及用户确定的目标属性,对待处理文档进行文档内容解析,得到所述待处理文档中的内容对象;其中,所述内容对象包括表格对象和段落文本对象;Obtain the document to be processed and the target attribute determined by the user, analyze the document content of the document to be processed, and obtain the content object in the document to be processed; wherein, the content object includes a table object and a paragraph text object;

针对所述表格对象,检测在所述表格对象中所述目标属性的位置信息;依据所述位置信息从所述表格对象中提取所述目标属性对应的关键信息;For the form object, detecting the position information of the target attribute in the form object; extracting the key information corresponding to the target attribute from the form object according to the position information;

针对所述段落文本对象,从所述段落文本对象中获取所述目标属性对应的子句集合;按照预设拼接策略,对所述子句集合中的子句进行合并得到所述目标属性的上下文信息;依据所述上下文信息确定所述目标属性对应的关键信息。For the paragraph text object, obtain the clause set corresponding to the target attribute from the paragraph text object; according to the preset splicing strategy, merge the clauses in the clause set to obtain the context of the target attribute information; determining key information corresponding to the target attribute according to the context information.

根据本发明的另一方面,提供了一种文档关键信息抽取装置,包括:According to another aspect of the present invention, a document key information extraction device is provided, including:

内容解析模块,用于获取待处理文档以及用户确定的目标属性;对待处理文档进行文档内容解析,得到所述待处理文档中的内容对象;其中,所述内容对象包括表格对象和段落文本对象;A content parsing module, configured to obtain the document to be processed and the target attribute determined by the user; analyze the document content of the document to be processed to obtain content objects in the document to be processed; wherein the content objects include table objects and paragraph text objects;

表格信息抽取模块,用于针对所述表格对象,检测在所述表格对象中所述目标属性的位置信息;依据所述位置信息从所述表格对象中提取所述目标属性对应的关键信息;The form information extraction module is used to detect the position information of the target attribute in the form object for the form object; extract the key information corresponding to the target attribute from the form object according to the position information;

段落信息抽取模块,用于针对所述段落文本对象,从所述段落文本对象中获取所述目标属性对应的子句集合;按照预设拼接策略,对所述子句集合中的子句进行合并得到所述目标属性的上下文信息;依据所述上下文信息确定所述目标属性对应的关键信息。The paragraph information extraction module is used to obtain the clause set corresponding to the target attribute from the paragraph text object for the paragraph text object; according to the preset splicing strategy, the clauses in the clause set are merged Obtain context information of the target attribute; determine key information corresponding to the target attribute according to the context information.

根据本发明的又一方面,提供了一种计算设备,包括:处理器、存储器、通信接口和通信总线,所述处理器、所述存储器和所述通信接口通过所述通信总线完成相互间的通信;According to yet another aspect of the present invention, a computing device is provided, including: a processor, a memory, a communication interface, and a communication bus, and the processor, the memory, and the communication interface complete mutual communication through the communication bus communication;

所述存储器用于存放至少一可执行指令,所述可执行指令使所述处理器执行上述一种文档关键信息抽取方法对应的操作。The memory is used to store at least one executable instruction, and the executable instruction causes the processor to perform an operation corresponding to the above-mentioned document key information extraction method.

根据本发明的再一方面,提供了一种计算机存储介质,所述存储介质中存储有至少一可执行指令,所述可执行指令使处理器执行如上述一种文档关键信息抽取方法对应的操作。According to yet another aspect of the present invention, a computer storage medium is provided, wherein at least one executable instruction is stored in the storage medium, and the executable instruction causes the processor to perform operations corresponding to the above-mentioned document key information extraction method .

根据本发明的一种文档关键信息抽取方法、装置、计算设备和存储介质,通过获取待处理文档以及用户确定的目标属性,对待处理文档进行文档内容解析,得到待处理文档中的内容对象;其中,内容对象包括表格对象和段落文本对象;针对表格对象,检测在表格对象中目标属性的位置信息;依据位置信息从表格对象中提取目标属性对应的关键信息;针对段落文本对象,从段落文本对象中获取目标属性对应的子句集合;按照预设拼接策略,对子句集合中的子句进行合并得到目标属性的上下文信息;依据上下文信息确定目标属性对应的关键信息。本发明通过分别对文档中表格对象和段落文本对象采取对应的方法抽取关键信息,提高了关键信息业务抽取效率,抽取效果更精准。According to a document key information extraction method, device, computing device and storage medium of the present invention, by acquiring the document to be processed and the target attribute determined by the user, the document content analysis is performed on the document to be processed, and the content object in the document to be processed is obtained; , the content object includes a table object and a paragraph text object; for the table object, detect the position information of the target attribute in the table object; extract the key information corresponding to the target attribute from the table object according to the position information; for the paragraph text object, from the paragraph text object Obtain the clause set corresponding to the target attribute; according to the preset splicing strategy, merge the clauses in the clause set to obtain the context information of the target attribute; determine the key information corresponding to the target attribute according to the context information. The present invention extracts key information by adopting corresponding methods for table objects and paragraph text objects in documents respectively, thereby improving the efficiency of key information service extraction, and the extraction effect is more accurate.

上述说明仅是本发明技术方案的概述,为了能够更清楚了解本发明的技术手段,而可依照说明书的内容予以实施,并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂,以下特举本发明的具体实施方式。The above description is only an overview of the technical solution of the present invention. In order to better understand the technical means of the present invention, it can be implemented according to the contents of the description, and in order to make the above and other purposes, features and advantages of the present invention more obvious and understandable , the specific embodiments of the present invention are enumerated below.

附图说明Description of drawings

通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本发明的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiment. The drawings are only for the purpose of illustrating a preferred embodiment and are not to be considered as limiting the invention. Also throughout the drawings, the same reference numerals are used to designate the same components. In the attached picture:

图1示出了本发明实施例提供的一种文档关键信息抽取方法流程图;FIG. 1 shows a flow chart of a method for extracting key document information provided by an embodiment of the present invention;

图2示出了本发明另一实施例提供的段落文本对象合并过程示意图;FIG. 2 shows a schematic diagram of a paragraph text object merging process provided by another embodiment of the present invention;

图3示出了本发明实施例提供的一种文档关键信息抽取装置的结构示意图;Fig. 3 shows a schematic structural diagram of a document key information extraction device provided by an embodiment of the present invention;

图4示出了本发明实施例提供的计算设备的结构示意图。Fig. 4 shows a schematic structural diagram of a computing device provided by an embodiment of the present invention.

具体实施方式Detailed ways

下面将参照附图更详细地描述本发明的示例性实施例。虽然附图中显示了本发明的示例性实施例,然而应当理解,可以以各种形式实现本发明而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了能够更透彻地理解本发明,并且能够将本发明的范围完整的传达给本领域的技术人员。Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. Although exemplary embodiments of the present invention are shown in the drawings, it should be understood that the invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided for more thorough understanding of the present invention and to fully convey the scope of the present invention to those skilled in the art.

图1示出了本发明一种文档关键信息抽取方法实施例的流程图,如图1所示,该方法包括以下步骤:Fig. 1 shows a flow chart of an embodiment of a method for extracting document key information according to the present invention. As shown in Fig. 1, the method includes the following steps:

步骤S110:获取待处理文档以及用户确定的目标属性,对待处理文档进行文档内容解析,得到待处理文档中的内容对象;其中,内容对象包括表格对象和段落文本对象。Step S110: Obtain the document to be processed and the target attribute determined by the user, analyze the content of the document to be processed, and obtain the content objects in the document to be processed; wherein, the content objects include table objects and paragraph text objects.

以word文档为例,其主要构成对象为表格和段落文本,获取待处理文档以及用户确定的目标属性;其中,用户确定的目标属性指的是用户想要获得的字段值对应的名称;对待处理文档通过文档内容解析工具进行文档内容解析,得到待处理文档中所包括的内容对象。针对内容对象的不同,本实施例通过对应的方法抽取关键信息。Taking a word document as an example, its main objects are tables and paragraph texts, and the document to be processed and the target attribute determined by the user are obtained; among them, the target attribute determined by the user refers to the name corresponding to the field value that the user wants to obtain; the target attribute to be processed is The document is analyzed by the document content analysis tool to obtain the content objects included in the document to be processed. For different content objects, this embodiment extracts key information through a corresponding method.

步骤S120:针对表格对象,检测在表格对象中目标属性的位置信息;依据位置信息从表格对象中提取目标属性对应的关键信息。Step S120: For the table object, detect the location information of the target attribute in the table object; extract the key information corresponding to the target attribute from the table object according to the location information.

在一种可选的方式中,步骤S120进一步包括:对表格对象进行数据样式预处理,得到处理后的表格对象;其中,处理后的表格对象为由二维数组表征的表格;从处理后的表格对象中获取标题检测区域,在标题检测区域中查找目标属性,得到目标属性的位置信息。In an optional manner, step S120 further includes: performing data style preprocessing on the table object to obtain a processed table object; wherein, the processed table object is a table represented by a two-dimensional array; The title detection area is obtained from the table object, the target attribute is searched in the title detection area, and the position information of the target attribute is obtained.

针对待处理文档中的表格对象,将文档内容解析工具所输出的数据样式,通过数据样式预处理,例如内容行的填充,数据的对齐等,得到一个二维数组表征的表格。表1为本申请实施例的表格对象数据样式示例,如表1所示,从处理后的表格对象中获取标题检测区域,在标题检测区域中查找目标属性,得到目标属性的位置信息。如表1所示,若步骤S110中用户想要获得各个付款阶段的款项名称,则可将目标属性确定为表头中的“付款阶段”;获取到目标属性在处理后的表格对象中的相对位置信息(以二维数组表征的表格中相应元素对应的下标来表示)。需要特别说明的是,表1示例表格存储类型依据目标属性为“付款阶段”来确定,由于付款阶段对应的关键信息“首款”、“初验款”、“终验款”以及“尾款”等为从上至下存储,为从上至下一对多样式,因此将该表格对象存储类型设定为上下类型,反之,若为从左至右一对多存储,则为左右类型。For the table object in the document to be processed, the data style output by the document content analysis tool is preprocessed by the data style, such as filling the content line, aligning the data, etc., to obtain a table represented by a two-dimensional array. Table 1 is an example of the data style of the table object in the embodiment of the present application. As shown in Table 1, the title detection area is obtained from the processed table object, and the target attribute is searched in the title detection area to obtain the location information of the target attribute. As shown in Table 1, if the user wants to obtain the payment name of each payment stage in step S110, the target attribute can be determined as "payment stage" in the header; the relative value of the target attribute in the processed form object can be obtained Position information (represented by the subscript corresponding to the corresponding element in the table represented by the two-dimensional array). It should be noted that the storage type of the sample table in Table 1 is determined based on the target attribute being "payment stage", since the key information corresponding to the payment stage is "first payment", "initial inspection payment", "final inspection payment" and "final payment" If it is stored from top to bottom, it is a one-to-many style from top to bottom, so the storage type of the table object is set as the top-bottom type, otherwise, if it is one-to-many storage from left to right, it is the left-right type.

Figure BDA0003318672980000051
Figure BDA0003318672980000051

表1表格对象数据样式示例Table 1 Table object data style example

在一种可选的方式中,步骤S120进一步包括:若目标属性在标题检测区域中所处的子区域为第一子区域,则根据目标属性的位置信息确定基准点,从处理后的表格对象中提取位于基准点所处列且位于基准点所处行的后续各行中的元素得到第一元素集合,根据第一元素集合中各个元素对应的主题,从处理后的表格对象中提取目标属性对应的关键信息;若目标属性在标题检测区域中所处的子区域为第二子区域,则从处理后的表格对象中提取位于基准点所处行且位于基准点所处列的后续列中的元素得到第二元素集合,将第二元素集合作为目标属性对应的关键信息。In an optional manner, step S120 further includes: if the sub-area where the target attribute is located in the title detection area is the first sub-area, then determine the reference point according to the position information of the target attribute, from the processed table object Extract the elements located in the column where the reference point is located and the elements in subsequent rows located in the row where the reference point is located to obtain the first element set, and extract the corresponding target attribute from the processed table object according to the theme corresponding to each element in the first element set key information; if the sub-area where the target attribute is located in the title detection area is the second sub-area, then extract the data that is located in the row where the reference point is located and in the subsequent column of the column where the reference point is located from the processed table object The element obtains a second element set, and the second element set is used as key information corresponding to the target attribute.

在一种可选的方式中,步骤S120进一步包括:判断第一元素集合中所有元素是否具有相同主题;若第一元素集合中所有元素具有相同主题,则将第一元素集合作为目标属性对应的关键信息;若第一元素集合中所有元素具有不同主题,则从第一元素集合中查找与第一元素集合中的第一个元素具有相同主题的元素,并将第一个元素以及与第一个元素具有相同主题的元素添加至第三元素集合中;若第三元素集合的元素个数大于第一预设阈值,则将第三元素集合作为目标属性对应的关键信息;若第三元素集合的元素个数小于或等于第一预设阈值,判定第二元素集合和第三元素集合中各个元素的属性,依据判定结果从第二元素集合和第三元素集合中选择目标元素作为目标属性对应的关键信息。In an optional manner, step S120 further includes: judging whether all the elements in the first element set have the same theme; if all the elements in the first element set have the same theme, then use the first element set as the target attribute corresponding Key information; if all the elements in the first element set have different themes, then search for an element with the same theme as the first element in the first element set from the first element set, and combine the first element and the first element with the same theme elements with the same theme are added to the third element set; if the number of elements in the third element set is greater than the first preset threshold, the third element set is used as the key information corresponding to the target attribute; if the third element set The number of elements is less than or equal to the first preset threshold, determine the attribute of each element in the second element set and the third element set, and select the target element from the second element set and the third element set according to the judgment result as the target attribute correspondence key information.

具体地说,假设二维数组表征的表格对象大小为n*m,可通过式(1)获得目标属性Fieldt所在的标题检测区域AreatSpecifically, assuming that the size of the table object represented by the two-dimensional array is n*m, the title detection area Areat where the target attribute Fieldt is located can be obtained by formula (1).

Figure BDA0003318672980000061
Figure BDA0003318672980000061

其中,标题检测区域Areat包括第一子区域和第二子区域,其中,第一子区域为由表格的前k行(当n≤3,k=1;否则,

Figure BDA0003318672980000062
)单元格内容Areat1;第二子区域为除了第一子区域之外的表格剩余行所有偶数列(计数从0开始)单元格内容Areat2。其中
Figure BDA0003318672980000063
意为向下取整(即把小数点后面的数字四舍五入,取比自己小的最大整数)。以表1为例,表1的表格对象大小为6*4,即n=6,则第一子区域为表格的前两行单元格内容,第二子区域为表格的第3-6行的第0列和第2列。Wherein, the header detection area Areat includes a first sub-area and a second sub-area, wherein the first sub-area is the first k rows of the table (when n≤3, k=1; otherwise,
Figure BDA0003318672980000062
) cell content Areat1 ; the second sub-area is the cell content Areat2 of all even-numbered columns (counting starts from 0) in the remaining rows of the table except the first sub-area. in
Figure BDA0003318672980000063
It means rounding down (that is, round the numbers after the decimal point and take the largest integer smaller than itself). Take Table 1 as an example, the size of the table object in Table 1 is 6*4, that is, n=6, then the first sub-area is the cell content of the first two rows of the table, and the second sub-area is the contents of the 3rd to 6th row of the table Columns 0 and 2.

进一步地,通过分布式的语义向量相似度计算方式在标题检测区域Areat中检索是否存在目标属性,若在第二子区域Areat2中检索到目标属性,输出目标属性的位置信息(i,j),同时判定当前属性在表格中的存储类型为左右类型,以目标属性的位置信息(i,j)为基准点,从处理后的表格对象中提取位于基准点所处行且位于基准点所处列的后续列中的元素Celli,j+1得到第二元素集合,将第二元素集合作为目标属性对应的关键信息。Further, search whether there is a target attribute in the title detection area Areat through the distributed semantic vector similarity calculation method, if the target attribute is retrieved in the second sub-area Areat2 , output the position information of the target attribute (i, j ), and at the same time determine that the storage type of the current attribute in the table is the left-right type, take the position information (i, j) of the target attribute as the reference point, and extract the row located at the reference point and the row located at the reference point from the processed table object The element Celli, j+1 in the subsequent column of the processing column obtains the second element set, and the second element set is used as the key information corresponding to the target attribute.

若在第一子区域Areat1检索到目标属性,仅能确定目标属性的位置信息(i,j),无法确定目标属性在表格中的存储类型,其存储类型可能为上下类型也可能为左右类型,则以目标属性所在位置(i,j)作为基准点,首先取基准点所在列的后续所有行对应的单元格内容形成第一元素集合SetrowValues,通过主题相似度模型判定第一元素集合中的所有元素是否具有相同的主题;其中主题相似度模型可以采用隐含狄利克雷分布(LatentDirichlet Allocation,LDA)主题模型。当主题唯一时,第一元素集合SetrowValues作为目标属性的提取值输出;当主题不唯一时,则从第一元素集合中查找与第一元素集合中的第一个元素具有相同主题的元素,并将第一个元素以及与第一个元素具有相同主题的元素添加至第三元素集合中;具体地,可以先初始化集合Setresults={Celli+1,j},按照顺序遍历的方式,以集合中的第一个元素的主题为标准,当后续行单元格内容拥有相同主题时,加入集合Setresults,否则遍历停止,将集合Setresults作为第三元素集合输出。若Setresults元素个数大于第一预设阈值(一般设置为1),则目标属性的值为Setresults;若Setresults元素个数小于或等于第一预设阈值,则进行Celli+1,j(即表格中第i+1行第j列对应的元素)与Celli,j+1(即表格中第i行第j+1列对应的元素)的属性判定,可以将Celli+1,j与Celli,j+1单元格的元素输入二分类模型中判断其是否属于目标属性对应的目标元素,将判定结果为“是”的元素确定为目标元素依次加入集合Setresults,将最终的集合Setresults作为第三元素集合输出作为目标属性对应的关键信息进行输出。以表1为例,当抽取的目标属性为付款阶段时,按照上述处理逻辑,首先获得第一元素集合SetrowValues={首款、初验款、终验款、尾款、总计},通过主题相似度模型判定该第一元素集合内主题不唯一;其中,首款、初验款、终验款、尾款为一类主题,总计为另一类主题;则依次判定与第一元素集合SetrowValues的第一个元素“首款”主题一致的元素,获得集合Setresults={首款、初验款、终验款、尾款},由于Setresults的元素个数为4,大于第一预设阈值(预设阈值为1),则抽取完成。If the target attribute is retrieved in the first sub-area Areat1 , only the location information (i, j) of the target attribute can be determined, but the storage type of the target attribute in the table cannot be determined, and its storage type may be up-down type or left-right type , then take the position (i, j) of the target attribute as the reference point, first take the cell contents corresponding to all subsequent rows of the column where the reference point is located to form the first element set SetrowValues , and determine the first element set through the topic similarity model Whether all the elements of have the same topic; where the topic similarity model can adopt the Latent Dirichlet Allocation (LDA) topic model. When the theme is unique, the first element set SetrowValues is output as the extracted value of the target attribute; when the theme is not unique, an element with the same theme as the first element in the first element set is searched from the first element set, And add the first element and elements with the same theme as the first element to the third element set; specifically, you can first initialize the set Setresults = {Celli+1,j }, and traverse in order, Take the theme of the first element in the set as the standard, and when the cell content of the subsequent row has the same theme, join the set Setresults , otherwise, the traversal stops, and the set Setresults is output as the third element set. If the number of Setresults elements is greater than the first preset threshold (generally set to 1), the value of the target attribute is Setresults ; if the number of Setresults elements is less than or equal to the first preset threshold, Celli+1 is performed, j (that is, the element corresponding to the i+1 row and column j in the table) and Celli, j+1 (that is, the element corresponding to the i row and j+1 column in the table) attribute determination, Celli+1 can be, j and Celli, the elements of j+1 cells are input into the binary classification model to judge whether they belong to the target element corresponding to the target attribute, and the element whose judgment result is "yes" is determined as the target element and added to the set Setresults in turn, and finally The set of Setresults is output as the third element set and output as the key information corresponding to the target attribute. Taking Table 1 as an example, when the extracted target attribute is the payment stage, according to the above processing logic, first obtain the first element set SetrowValues = {first payment, preliminary inspection payment, final inspection payment, final payment, total}, through similar themes The degree model determines that the theme in the first element set is not unique; among them, the first payment, the preliminary inspection payment, the final inspection payment, and the final payment are one type of theme, and the total is anothertype of theme ; The first element "the first item" has the same theme and obtains the set Setresults = {the first item, the initial inspection item, the final inspection item, the final item}, since the number of elements in the Setresults is 4, which is greater than the first preset threshold ( If the preset threshold is 1), the extraction is complete.

步骤S130:针对段落文本对象,从段落文本对象中获取目标属性对应的子句集合;按照预设拼接策略,对子句集合中的子句进行合并得到目标属性的上下文信息;依据上下文信息确定目标属性对应的关键信息。Step S130: For the paragraph text object, obtain the clause set corresponding to the target attribute from the paragraph text object; according to the preset splicing strategy, merge the clauses in the clause set to obtain the context information of the target attribute; determine the target according to the context information The key information corresponding to the attribute.

在一种可选的方式中,步骤S130进一步包括:对段落文本对象进行语义筛选,得到目标属性对应的段落文本集合;按照预设符号集中的符号对段落文本集合中的各个段落进行切分,生成目标属性对应的子句集合。In an optional manner, step S130 further includes: performing semantic screening on the paragraph text object to obtain the paragraph text set corresponding to the target attribute; segmenting each paragraph in the paragraph text set according to the symbols in the preset symbol set, Generate a set of clauses corresponding to the target attribute.

具体地说,若对文档的全部段落文本对象直接提取关键信息,对于提取模型来说,学习到噪声的可能性越高,处理速率较低,且预测结果的准确性降低;因此,针对段落文本对象,首先通过浅层的语义筛选,从段落文本对象中获取目标属性对应的段落文本集合;例如,可以通过首先圈定目标属性所在章节,进一步在目标属性所在章节搜索触发词,从而确定与目标属性相关的段落文本集合;其中,触发词可以根据具体需要抽取的目标属性进行自定义,比如需要抽取甲方账号信息,那么触发词可以定义为“账号”,“甲方”等。进一步地,可以对段落文本集合中的各个段落做进一步的切分;具体地,可以通过定义预设符号集,例如预设符号集可为{。,;},对于每一个段落文本对象pharai,通过预设符号集的切分,生成目标属性对应的多个子句,多个子句构成子句集合{senti1,senti2,...,sentin}(以下简写为{s1,s2,...,sn})。Specifically, if the key information is directly extracted from all paragraph text objects of the document, for the extraction model, the possibility of learning noise is higher, the processing rate is lower, and the accuracy of the prediction result is reduced; therefore, for paragraph text object, first obtain the paragraph text set corresponding to the target attribute from the paragraph text object through shallow semantic filtering; A collection of relevant paragraph texts; among them, the trigger words can be customized according to the specific target attributes that need to be extracted. For example, if the account information of Party A needs to be extracted, the trigger words can be defined as "account number", "Party A" and so on. Furthermore, each paragraph in the paragraph text set can be further segmented; specifically, a preset symbol set can be defined, for example, the preset symbol set can be {. ,;}, for each paragraph text object pharai , through the segmentation of the preset symbol set, multiple clauses corresponding to the target attribute are generated, and multiple clauses form a clause set {senti1 ,senti2 ,...,sentin } (hereinafter abbreviated as {s1 , s2 ,...,sn }).

在一种可选的方式中,步骤S130进一步包括:按照预设顺序,遍历子句集合中的子句,对当前子句与相邻子句进行语义分析,依据语义分析结果确定是否对当前子句与相邻子句进行合并。In an optional manner, step S130 further includes: traversing the clauses in the clause set in a preset order, performing semantic analysis on the current clause and adjacent clauses, and determining whether to clauses are combined with adjacent clauses.

在一种可选的方式中,在对当前子句与相邻子句进行语义分析之前,该方法还包括:判断当前子句的有效单词个数是否小于第二预设阈值;若是,则将当前子句与相邻子句进行拼接,直至拼接后的子句的有效单词个数大于或等于第二预设阈值。In an optional manner, before performing semantic analysis on the current clause and adjacent clauses, the method further includes: judging whether the number of valid words in the current clause is less than a second preset threshold; The current clause is spliced with adjacent clauses until the number of effective words in the spliced clause is greater than or equal to the second preset threshold.

针对切分得到的子句集合,需要进行前后的拼接,获得包含目标属性完整上下文信息的最小粒度文本行。对于子句集合{s1,s2,...,sn},按照预设顺序遍历子句集合{s1,s2,...,sn}中的每一个子句,具体地说,首先对当前子句通过语义分析技术进行语义分析,确定当前子句与目标属性是否直接相关,若相关,则将当前子句作为合并操作的中心,进一步地,设定拼接滑动窗口大小为w,并确定用于遍历的预设顺序,其中,预设顺序可为向前拼接顺序或者向后拼接顺序,向前拼接顺序为:senti1←senti2←...←sentin;向后拼接顺序为senti1→senti2→...→sentin,无论选择何种预设顺序,都需要遍历子句集合中的所有子句,因此两种预设顺序的拼接方式相似,在本实施例中以向后拼接顺序为例进行介绍。For the set of clauses obtained by segmentation, it needs to be spliced before and after to obtain the smallest granularity text line containing the complete context information of the target attribute. For the clause set {s1 , s2 ,...,sn }, traverse each clause in the clause set {s1 , s2 ,...,sn } according to the preset order, specifically In other words, firstly, semantic analysis is performed on the current clause through semantic analysis technology to determine whether the current clause is directly related to the target attribute. If so, the current clause is used as the center of the merge operation. w, and determine the preset order used for traversal, wherein the preset order can be a forward splicing sequence or a backward splicing sequence, and the forward splicing sequence is: senti1 ←senti2 ←...←sentin ; backward The splicing sequence is senti1 →senti2 →...→sentin . No matter which preset sequence is selected, all clauses in the clause set need to be traversed. Therefore, the splicing methods of the two preset sequences are similar. In this implementation In the example, the backward splicing sequence is taken as an example for introduction.

图2示出了向后拼接顺序的流程图,如图2所示,当前子句为s,后项待拼接子句为nexti,则第一步:判断当前子句的有效单词个数是否小于第二预设阈值N,若当前子句s的有效单词个数小于N,则将当前子句与相邻子句进行拼接,获得新的当前子句s←s+nexti,直至拼接后的子句的有效单词个数大于或等于第二预设阈值或者后面的子句不在拼接滑动窗口w内(表示子句集合向后拼接已经遍历完成),结束拼接。Fig. 2 has shown the flow chart of splicing order backwards, as shown in Fig. 2, current clause is s, and the clause to be spliced after is nexti , then the first step: judge whether the number of effective words of current clause is less than the second preset threshold N, if the number of effective words in the current clause s is less than N, the current clause will be spliced with adjacent clauses to obtain a new current clause s←s+nexti until splicing The number of effective words of the clause is greater than or equal to the second preset threshold or the following clauses are not in the splicing sliding window w (indicating that the backward splicing of the clause set has been traversed), and the splicing ends.

需要特别说明的是,当前子句的有效单词个数指的是当前子句中的有效单词的个数,有效单词是指在对子句进行分词后,去除停用词(无用的、无意义的词)后进行的单词个数统计。针对第二预设阈值N,可在实际应用时根据具体场景,进行统计分析获得合适的N值,例如,设置N=20等。It should be noted that the number of effective words in the current clause refers to the number of effective words in the current clause, and the effective word refers to the removal of stop words (useless, meaningless) after the clause is segmented. words) to count the number of words. For the second preset threshold N, statistical analysis may be performed according to specific scenarios in actual application to obtain an appropriate value of N, for example, N=20 is set.

第二步:若当前子句s的有效单词个数大于或等于第二预设阈值N,则将当前子句s与相邻子句nexti输入双向编码(Bidirectional Encoder Representations fromTransformers,Bert)模型,通过Bert模型的分类单元(CLS)来判定nexti是否可以与s进行拼接,若Bert模型预测nexti不是s的下一句,则拼接操作停止,当前子句s为目标属性的完整上下文输出,对于剩余未进行拼接的子句,继续执行第一步。Step 2: If the number of effective words in the current clause s is greater than or equal to the second preset threshold N, input the current clause s and the adjacent clause nexti into the bidirectional encoding (Bidirectional Encoder Representations from Transformers, Bert) model, Use the classification unit (CLS) of the Bert model to determine whether nexti can be spliced with s. If the Bert model predicts that nexti is not the next sentence of s, the splicing operation stops, and the current clause s is the complete context output of the target attribute. For For the remaining clauses that have not been spliced, proceed to the first step.

若Bert模型预测nexti是s的下一句,则更新s←s+nexti,继续执行第二步,直到后面的子句不在拼接滑动窗口w内,结束拼接,得到目标属性的上下文信息。If the Bert model predicts that nexti is the next sentence of s, then update s←s+nexti and continue to the second step until the following clauses are not within the splicing sliding window w, then end splicing and get the context information of the target attribute.

需要特别说明的是,Bert模型是一个预训练模型,其训练任务由两部分组成,这两部分包括语言模型和预测是否为下一个句子;因此,本实施例选用Bert模型判断当前子句与相邻子句是否可以拼接。It should be noted that the Bert model is a pre-training model, and its training task consists of two parts. These two parts include the language model and whether the prediction is the next sentence; Whether adjacent clauses can be concatenated.

将目标属性的上下文信息输入至序列标注模型,通过序列标注模型自动识别文本输入序列中的关键信息,完成目标属性的关键信息的抽取;其中,序列标注模型可以选用条件随机场(conditional random field,CRF)模型等。Input the context information of the target attribute into the sequence labeling model, automatically identify the key information in the text input sequence through the sequence labeling model, and complete the extraction of the key information of the target attribute; among them, the sequence labeling model can choose conditional random field (conditional random field, CRF) model, etc.

采用本实施例的方法,通过对表格对象和段落文本对象采用不同的处理技术实现文档对象的关键信息抽取,相比其他统一处理模式,可极大提升目标属性对应的关键信息的抽取准确率以及召回率;具体地,针对文档解析工具获得的表格对象与段落文本对象,对表格对象的关键信息的抽取,仅需用户指定目标属性,利用分布式的语义相似度即可输出目标属性在表格中的位置信息,相对于需人工指定表格类型的模板抽取方法,无需人工干预,减少了人为错误的引入;在以目标属性的位置信息为基准点,在纵向表格元素上利用主题相似模型实现目标属性的关键信息的抽取,或通过分别取表格的纵向与横向两个方向相邻的元素利用二分类模型判定单元格内容的属性,确定目标属性对应的关键信息,通过无监督与有监督相结合的方式,充分利用关键信息的语义相似与单元格内容值特征,不再依赖各种人工总结规则以及大量的标题字典,有效提升表格对象内容抽取的准确率;针对段落文本对象,基于原始输入进行切分后生成的子句集合,根据语义相关性,衡量多个子句之间是否可以做进一步的拼接合并,从而获得目标属性最小粒度的上下文信息,进而将目标属性的上下文信息作为序列标注模型的输入,相比传统的将原始段落文本对象输入序列标注模型的操作,减少噪声的输入,使得模型的目标学习更为聚焦,预测准确率得到有效提高。Using the method of this embodiment, the key information extraction of the document object is realized by using different processing technologies for the table object and the paragraph text object. Compared with other unified processing modes, the extraction accuracy of the key information corresponding to the target attribute can be greatly improved and Recall rate; specifically, for the table object and paragraph text object obtained by the document parsing tool, the key information extraction of the table object only needs the user to specify the target attribute, and the target attribute can be output in the table by using the distributed semantic similarity Compared with the template extraction method that needs to manually specify the form type, it does not require manual intervention and reduces the introduction of human errors; taking the position information of the target attribute as the reference point, the subject similarity model is used to achieve the target attribute on the vertical table elements The extraction of key information, or by taking the elements adjacent to the vertical and horizontal directions of the table respectively, using the binary classification model to determine the attributes of the cell content, to determine the key information corresponding to the target attribute, through the combination of unsupervised and supervised method, making full use of the semantic similarity of key information and the characteristics of cell content values, no longer relying on various manual summary rules and a large number of title dictionaries, effectively improving the accuracy of table object content extraction; for paragraph text objects, cutting based on the original input The set of clauses generated after segmentation, according to the semantic correlation, measures whether multiple clauses can be further spliced and merged, so as to obtain the minimum granularity context information of the target attribute, and then use the context information of the target attribute as the input of the sequence labeling model , compared with the traditional operation of inputting the original paragraph text object into the sequence labeling model, the input of noise is reduced, the target learning of the model is more focused, and the prediction accuracy is effectively improved.

图3示出了本发明一种文档关键信息抽取装置实施例的结构示意图。如图3所示,该装置包括:内容解析模块310、表格信息抽取模块320和段落信息抽取模块330。Fig. 3 shows a schematic structural diagram of an embodiment of an apparatus for extracting document key information according to the present invention. As shown in FIG. 3 , the device includes: a content analysis module 310 , a table information extraction module 320 and a paragraph information extraction module 330 .

内容解析模块310,用于获取待处理文档以及用户确定的目标属性;对待处理文档进行文档内容解析,得到待处理文档中的内容对象;其中,内容对象包括表格对象和段落文本对象。The content parsing module 310 is used to obtain the document to be processed and the target attribute determined by the user; analyze the document content of the document to be processed to obtain content objects in the document to be processed; wherein, the content objects include table objects and paragraph text objects.

表格信息抽取模块320,用于针对表格对象,检测在表格对象中目标属性的位置信息;依据位置信息从表格对象中提取目标属性对应的关键信息。The form information extraction module 320 is configured to detect the position information of the target attribute in the form object for the form object; and extract the key information corresponding to the target attribute from the form object according to the position information.

在一种可选的方式中,表格信息抽取模块320进一步用于:对表格对象进行数据样式预处理,得到处理后的表格对象;其中,处理后的表格对象为由二维数组表征的表格;从处理后的表格对象中获取标题检测区域,在标题检测区域中查找目标属性,得到目标属性的位置信息。In an optional manner, the table information extraction module 320 is further configured to: perform data style preprocessing on the table object to obtain a processed table object; wherein, the processed table object is a table represented by a two-dimensional array; The heading detection area is obtained from the processed table object, the target attribute is searched in the heading detection area, and the position information of the target attribute is obtained.

在一种可选的方式中,表格信息抽取模块320进一步用于:若目标属性在标题检测区域中所处的子区域为第一子区域,则根据目标属性的位置信息确定基准点,从处理后的表格对象中提取位于基准点所处列且位于基准点所处行的后续各行中的元素得到第一元素集合,根据第一元素集合中各个元素对应的主题,从处理后的表格对象中提取目标属性对应的关键信息;若目标属性在标题检测区域中所处的子区域为第二子区域,则从处理后的表格对象中提取位于基准点所处行且位于基准点所处列的后续列中的元素得到第二元素集合,将第二元素集合作为目标属性对应的关键信息。In an optional manner, the table information extraction module 320 is further configured to: if the sub-area where the target attribute is located in the title detection area is the first sub-area, then determine the reference point according to the position information of the target attribute, and start from the processing Extract the elements located in the column of the reference point and in the subsequent rows of the row of the reference point from the final table object to obtain the first element set, and according to the theme corresponding to each element in the first element set, from the processed table object Extract the key information corresponding to the target attribute; if the sub-area where the target attribute is located in the title detection area is the second sub-area, then extract the row where the reference point is located and the column where the reference point is located from the processed table object The elements in the subsequent columns obtain a second element set, and the second element set is used as key information corresponding to the target attribute.

在一种可选的方式中,表格信息抽取模块320进一步用于:判断第一元素集合中所有元素是否具有相同主题;若第一元素集合中所有元素具有相同主题,则将第一元素集合作为目标属性对应的关键信息;若第一元素集合中所有元素具有不同主题,则从第一元素集合中查找与第一元素集合中的第一个元素具有相同主题的元素,并将第一个元素以及与第一个元素具有相同主题的元素添加至第三元素集合中;若第三元素集合的元素个数大于第一预设阈值,则将第三元素集合作为目标属性对应的关键信息;若第三元素集合的元素个数小于或等于第一预设阈值,判定第二元素集合和第三元素集合中各个元素的属性,依据判定结果从第二元素集合和第三元素集合中选择目标元素作为目标属性对应的关键信息。In an optional manner, the form information extraction module 320 is further configured to: determine whether all elements in the first element set have the same theme; if all elements in the first element set have the same theme, then use the first element set as The key information corresponding to the target attribute; if all the elements in the first element set have different themes, then find an element with the same theme as the first element in the first element set from the first element set, and put the first element And add elements with the same theme as the first element to the third element set; if the number of elements in the third element set is greater than the first preset threshold, then use the third element set as the key information corresponding to the target attribute; if The number of elements in the third element set is less than or equal to the first preset threshold, determine the attribute of each element in the second element set and the third element set, and select the target element from the second element set and the third element set according to the judgment result As the key information corresponding to the target attribute.

段落信息抽取模块330,用于针对段落文本对象,从段落文本对象中获取目标属性对应的子句集合;按照预设拼接策略,对子句集合中的子句进行合并得到目标属性的上下文信息;依据上下文信息确定目标属性对应的关键信息。The paragraph information extraction module 330 is used to obtain the clause set corresponding to the target attribute from the paragraph text object for the paragraph text object; according to the preset splicing strategy, the clauses in the clause set are merged to obtain the context information of the target attribute; The key information corresponding to the target attribute is determined according to the context information.

在一种可选的方式中,段落信息抽取模块330进一步用于:对段落文本对象进行语义筛选,得到目标属性对应的段落文本集合;按照预设符号集中的符号对段落文本集合中的各个段落进行切分,生成目标属性对应的子句集合。In an optional manner, the paragraph information extraction module 330 is further used to: perform semantic screening on the paragraph text object to obtain the paragraph text set corresponding to the target attribute; Perform segmentation to generate a set of clauses corresponding to the target attribute.

在一种可选的方式中,段落信息抽取模块330进一步用于:按照预设顺序,遍历子句集合中的子句,对当前子句与相邻子句进行语义分析,依据语义分析结果确定是否对当前子句与相邻子句进行合并。In an optional manner, the paragraph information extraction module 330 is further used to: traverse the clauses in the clause set in a preset order, perform semantic analysis on the current clause and adjacent clauses, and determine Whether to merge the current clause with adjacent clauses.

在一种可选的方式中,段落信息抽取模块330进一步用于:判断当前子句的有效单词个数是否小于第二预设阈值;若是,则将当前子句与相邻子句进行拼接,直至拼接后的子句的有效单词个数大于或等于第二预设阈值。In an optional manner, the paragraph information extraction module 330 is further used to: determine whether the number of valid words in the current clause is less than a second preset threshold; if so, splice the current clause with the adjacent clauses, until the number of effective words in the concatenated clause is greater than or equal to the second preset threshold.

采用本实施例的装置,通过获取待处理文档以及用户确定的目标属性,对待处理文档进行文档内容解析,得到待处理文档中的内容对象;其中,内容对象包括表格对象和段落文本对象;针对表格对象,检测在表格对象中目标属性的位置信息;依据位置信息从表格对象中提取目标属性对应的关键信息;针对段落文本对象,从段落文本对象中获取目标属性对应的子句集合;按照预设拼接策略,对子句集合中的子句进行合并得到目标属性的上下文信息;依据上下文信息确定目标属性对应的关键信息。该装置通过分别对文档中表格对象和段落文本对象采取对应的方法抽取关键信息,提高了关键信息业务抽取效率,抽取效果更精准。Using the device of this embodiment, by acquiring the document to be processed and the target attribute determined by the user, the document content analysis is performed on the document to be processed to obtain the content object in the document to be processed; wherein, the content object includes a table object and a paragraph text object; for the table Object, detect the position information of the target attribute in the table object; extract the key information corresponding to the target attribute from the table object according to the position information; for the paragraph text object, obtain the clause set corresponding to the target attribute from the paragraph text object; according to the preset The concatenation strategy combines the clauses in the clause set to obtain the context information of the target attribute; determines the key information corresponding to the target attribute according to the context information. The device extracts key information by using corresponding methods for table objects and paragraph text objects in the document, which improves the efficiency of key information business extraction, and the extraction effect is more accurate.

本发明实施例提供了一种非易失性计算机存储介质,计算机存储介质存储有至少一可执行指令,该计算机可执行指令可执行上述任意方法实施例中的一种文档关键信息抽取方法。An embodiment of the present invention provides a non-volatile computer storage medium. The computer storage medium stores at least one executable instruction, and the computer executable instruction can execute a method for extracting key document information in any of the above method embodiments.

可执行指令具体可以用于使得处理器执行以下操作:Specifically, the executable instruction can be used to make the processor perform the following operations:

获取待处理文档以及用户确定的目标属性,对待处理文档进行文档内容解析,得到待处理文档中的内容对象;其中,内容对象包括表格对象和段落文本对象;Obtain the document to be processed and the target attribute determined by the user, analyze the document content of the document to be processed, and obtain the content object in the document to be processed; wherein, the content object includes a table object and a paragraph text object;

针对表格对象,检测在表格对象中目标属性的位置信息;依据位置信息从表格对象中提取目标属性对应的关键信息;For the table object, detect the location information of the target attribute in the table object; extract the key information corresponding to the target attribute from the table object according to the location information;

针对段落文本对象,从段落文本对象中获取目标属性对应的子句集合;按照预设拼接策略,对子句集合中的子句进行合并得到目标属性的上下文信息;依据上下文信息确定目标属性对应的关键信息。For the paragraph text object, the clause set corresponding to the target attribute is obtained from the paragraph text object; according to the preset splicing strategy, the clauses in the clause set are merged to obtain the context information of the target attribute; the context information corresponding to the target attribute is determined according to the context information Key Information.

图4示出了本发明计算设备实施例的结构示意图,本发明具体实施例并不对计算设备的具体实现做限定。FIG. 4 shows a schematic structural diagram of an embodiment of a computing device in the present invention, and the specific embodiment of the present invention does not limit the specific implementation of the computing device.

如图4所示,该计算设备可以包括:As shown in Figure 4, the computing device may include:

处理器(processor)、通信接口(Communications Interface)、存储器(memory)、以及通信总线。Processor (processor), communication interface (Communications Interface), memory (memory), and communication bus.

其中:处理器、通信接口、以及存储器通过通信总线完成相互间的通信。通信接口,用于与其它设备比如客户端或其它服务器等的网元通信。处理器,用于执行程序,具体可以执行上述一种文档关键信息抽取方法实施例中的相关步骤。Wherein: the processor, the communication interface, and the memory complete the mutual communication through the communication bus. The communication interface is used to communicate with network elements of other devices such as clients or other servers. The processor is configured to execute a program, and specifically, may execute relevant steps in the embodiment of the above-mentioned method for extracting key information of a document.

具体地,程序可以包括程序代码,该程序代码包括计算机操作指令。Specifically, the program may include program code including computer operation instructions.

处理器可能是中央处理器CPU,或者是特定集成电路ASIC(Application SpecificIntegrated Circuit),或者是被配置成实施本发明实施例的一个或多个集成电路。服务器包括的一个或多个处理器,可以是同一类型的处理器,如一个或多个CPU;也可以是不同类型的处理器,如一个或多个CPU以及一个或多个ASIC。The processor may be a central processing unit CPU, or an ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement the embodiments of the present invention. The one or more processors included in the server may be of the same type, such as one or more CPUs, or may be of different types, such as one or more CPUs and one or more ASICs.

存储器,用于存放程序。存储器可能包含高速RAM存储器,也可能还包括非易失性存储器(non-volatile memory),例如至少一个磁盘存储器。Memory for storing programs. The memory may include a high-speed RAM memory, and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.

程序具体可以用于使得处理器执行以下操作:Specifically, the program can be used to cause the processor to perform the following operations:

获取待处理文档以及用户确定的目标属性,对待处理文档进行文档内容解析,得到待处理文档中的内容对象;其中,内容对象包括表格对象和段落文本对象;Obtain the document to be processed and the target attribute determined by the user, analyze the document content of the document to be processed, and obtain the content object in the document to be processed; wherein, the content object includes a table object and a paragraph text object;

针对表格对象,检测在表格对象中目标属性的位置信息;依据位置信息从表格对象中提取目标属性对应的关键信息;For the table object, detect the location information of the target attribute in the table object; extract the key information corresponding to the target attribute from the table object according to the location information;

针对段落文本对象,从段落文本对象中获取目标属性对应的子句集合;按照预设拼接策略,对子句集合中的子句进行合并得到目标属性的上下文信息;依据上下文信息确定目标属性对应的关键信息。For the paragraph text object, the clause set corresponding to the target attribute is obtained from the paragraph text object; according to the preset splicing strategy, the clauses in the clause set are merged to obtain the context information of the target attribute; the context information corresponding to the target attribute is determined according to the context information Key Information.

在此提供的算法或显示不与任何特定计算机、虚拟系统或者其它设备固有相关。各种通用系统也可以与基于在此的示教一起使用。根据上面的描述,构造这类系统所要求的结构是显而易见的。此外,本发明实施例也不针对任何特定编程语言。应当明白,可以利用各种编程语言实现在此描述的本发明的内容,并且上面对特定语言所做的描述是为了披露本发明的最佳实施方式。The algorithms or displays presented herein are not inherently related to any particular computer, virtual system, or other device. Various generic systems can also be used with the teachings based on this. The structure required to construct such a system is apparent from the above description. Furthermore, embodiments of the present invention are not directed to any particular programming language. It should be understood that various programming languages can be used to implement the content of the present invention described herein, and the above description of specific languages is for disclosing the best mode of the present invention.

在此处所提供的说明书中,说明了大量具体细节。然而,能够理解,本发明的实施例可以在没有这些具体细节的情况下实践。在一些实例中,并未详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure the understanding of this description.

类似地,应当理解,为了精简本发明并帮助理解各个发明方面中的一个或多个,在上面对本发明的示例性实施例的描述中,本发明实施例的各个特征有时被一起分组到单个实施例、图、或者对其的描述中。然而,并不应将该公开的方法解释成反映如下意图:即所要求保护的本发明要求比在每个权利要求中所明确记载的特征更多的特征。更确切地说,如下面的权利要求书所反映的那样,发明方面在于少于前面公开的单个实施例的所有特征。因此,遵循具体实施方式的权利要求书由此明确地并入该具体实施方式,其中每个权利要求本身都作为本发明的单独实施例。Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, in order to streamline the present disclosure and to facilitate an understanding of one or more of the various inventive aspects, various features of the embodiments of the invention are sometimes grouped together into a single implementation examples, figures, or descriptions thereof. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the Detailed Description are hereby expressly incorporated into this Detailed Description, with each claim standing on its own as a separate embodiment of this invention.

本领域那些技术人员可以理解,可以对实施例中的设备中的模块进行自适应性地改变并且把它们设置在与该实施例不同的一个或多个设备中。可以把实施例中的模块或单元或组件组合成一个模块或单元或组件,以及此外可以把它们分成多个子模块或子单元或子组件。除了这样的特征和/或过程或者单元中的至少一些是相互排斥之外,可以采用任何组合对本说明书(包括伴随的权利要求、摘要和附图)中公开的所有特征以及如此公开的任何方法或者设备的所有过程或单元进行组合。除非另外明确陈述,本说明书(包括伴随的权利要求、摘要和附图)中公开的每个特征可以由提供相同、等同或相似目的的替代特征来代替。Those skilled in the art can understand that the modules in the device in the embodiment can be adaptively changed and arranged in one or more devices different from the embodiment. Modules or units or components in the embodiments may be combined into one module or unit or component, and furthermore may be divided into a plurality of sub-modules or sub-units or sub-assemblies. All features disclosed in this specification (including accompanying claims, abstract and drawings) and any method or method so disclosed may be used in any combination, except that at least some of such features and/or processes or units are mutually exclusive. All processes or units of equipment are combined. Each feature disclosed in this specification (including accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

此外,本领域的技术人员能够理解,尽管在此的一些实施例包括其它实施例中所包括的某些特征而不是其它特征,但是不同实施例的特征的组合意味着处于本发明的范围之内并且形成不同的实施例。例如,在下面的权利要求书中,所要求保护的实施例的任意之一都可以以任意的组合方式来使用。Furthermore, those skilled in the art will understand that although some embodiments herein include some features included in other embodiments but not others, combinations of features from different embodiments are meant to be within the scope of the invention. And form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

本发明的各个部件实施例可以以硬件实现,或者以在一个或者多个处理器上运行的软件模块实现,或者以它们的组合实现。本领域的技术人员应当理解,可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本发明实施例的一些或者全部部件的一些或者全部功能。本发明还可以实现为用于执行这里所描述的方法的一部分或者全部的设备或者装置程序(例如,计算机程序和计算机程序产品)。这样的实现本发明的程序可以存储在计算机可读介质上,或者可以具有一个或者多个信号的形式。这样的信号可以从因特网网站上下载得到,或者在载体信号上提供,或者以任何其他形式提供。The various component embodiments of the present invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art should understand that a microprocessor or a digital signal processor (DSP) may be used in practice to implement some or all functions of some or all components according to the embodiments of the present invention. The present invention can also be implemented as an apparatus or an apparatus program (for example, a computer program and a computer program product) for performing a part or all of the methods described herein. Such a program for realizing the present invention may be stored on a computer-readable medium, or may be in the form of one or more signals. Such a signal may be downloaded from an Internet site, or provided on a carrier signal, or provided in any other form.

应该注意的是上述实施例对本发明进行说明而不是对本发明进行限制,并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中,不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本发明可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中,这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。上述实施例中的步骤,除有特殊说明外,不应理解为对执行顺序的限定。It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention can be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In a unit claim enumerating several means, several of these means can be embodied by one and the same item of hardware. The use of the words first, second, and third, etc. does not indicate any order. These words can be interpreted as names. The steps in the above embodiments, unless otherwise specified, should not be construed as limiting the execution order.

Claims (10)

Translated fromChinese
1.一种文档关键信息抽取方法,其特征在于,包括:1. A method for extracting key information of a document, comprising:获取待处理文档以及用户确定的目标属性,对待处理文档进行文档内容解析,得到所述待处理文档中的内容对象;其中,所述内容对象包括表格对象和段落文本对象;Obtain the document to be processed and the target attribute determined by the user, analyze the document content of the document to be processed, and obtain the content object in the document to be processed; wherein, the content object includes a table object and a paragraph text object;针对所述表格对象,检测在所述表格对象中所述目标属性的位置信息;依据所述位置信息从所述表格对象中提取所述目标属性对应的关键信息;For the form object, detecting the position information of the target attribute in the form object; extracting the key information corresponding to the target attribute from the form object according to the position information;针对所述段落文本对象,从所述段落文本对象中获取所述目标属性对应的子句集合;按照预设拼接策略,对所述子句集合中的子句进行合并得到所述目标属性的上下文信息;依据所述上下文信息确定所述目标属性对应的关键信息。For the paragraph text object, obtain the clause set corresponding to the target attribute from the paragraph text object; according to the preset splicing strategy, merge the clauses in the clause set to obtain the context of the target attribute information; determining key information corresponding to the target attribute according to the context information.2.根据权利要求1所述的方法,其特征在于,所述检测在所述表格对象中所述目标属性的位置信息进一步包括:2. The method according to claim 1, wherein the detecting the position information of the target attribute in the form object further comprises:对所述表格对象进行数据样式预处理,得到处理后的表格对象;其中,处理后的表格对象为由二维数组表征的表格;Performing data style preprocessing on the table object to obtain a processed table object; wherein, the processed table object is a table represented by a two-dimensional array;从处理后的表格对象中获取标题检测区域,在所述标题检测区域中查找所述目标属性,得到所述目标属性的位置信息。Obtain the title detection area from the processed table object, search the target attribute in the title detection area, and obtain the position information of the target attribute.3.根据权利要求2所述的方法,其特征在于,所述依据所述位置信息从所述表格对象中提取所述目标属性对应的关键信息进一步包括:3. The method according to claim 2, wherein said extracting key information corresponding to said target attribute from said form object according to said location information further comprises:若所述目标属性在所述标题检测区域中所处的子区域为第一子区域,则根据所述目标属性的位置信息确定基准点,从处理后的表格对象中提取位于所述基准点所处列且位于所述基准点所处行的后续各行中的元素得到第一元素集合,根据所述第一元素集合中各个元素对应的主题,从处理后的表格对象中提取所述目标属性对应的关键信息;If the sub-area where the target attribute is located in the title detection area is the first sub-area, determine the reference point according to the position information of the target attribute, and extract the reference point located at the reference point from the processed form object. The first element set is obtained from the elements in the next row of the row where the reference point is located. According to the theme corresponding to each element in the first element set, the corresponding target attribute is extracted from the processed table object. the key information;若所述目标属性在所述标题检测区域中所处的子区域为第二子区域,则从处理后的表格对象中提取位于所述基准点所处行且位于所述基准点所处列的后续列中的元素得到第二元素集合,将所述第二元素集合作为所述目标属性对应的关键信息。If the sub-area where the target attribute is located in the title detection area is the second sub-area, extract the data that is located in the row where the reference point is located and in the column where the reference point is located from the processed table object Elements in subsequent columns obtain a second element set, and the second element set is used as key information corresponding to the target attribute.4.根据权利要求3所述的方法,其特征在于,所述根据所述第一元素集合中各个元素对应的主题,从处理后的表格对象中提取所述目标属性对应的关键信息进一步包括:4. The method according to claim 3, wherein, according to the theme corresponding to each element in the first element set, extracting the key information corresponding to the target attribute from the processed table object further comprises:判断所述第一元素集合中所有元素是否具有相同主题;judging whether all elements in the first element set have the same theme;若所述第一元素集合中所有元素具有相同主题,则将所述第一元素集合作为所述目标属性对应的关键信息;If all the elements in the first element set have the same theme, then use the first element set as the key information corresponding to the target attribute;若所述第一元素集合中所有元素具有不同主题,则从所述第一元素集合中查找与所述第一元素集合中的第一个元素具有相同主题的元素,并将所述第一个元素以及与所述第一个元素具有相同主题的元素添加至第三元素集合中;If all the elements in the first set of elements have different themes, search for an element with the same theme as the first element in the first set of elements from the first set of elements, and set the first elements and elements having the same theme as said first element are added to a third set of elements;若所述第三元素集合的元素个数大于第一预设阈值,则将所述第三元素集合作为所述目标属性对应的关键信息;If the number of elements in the third element set is greater than the first preset threshold, the third element set is used as key information corresponding to the target attribute;若所述第三元素集合的元素个数小于或等于第一预设阈值,判定所述第二元素集合和所述第三元素集合中各个元素的属性,依据判定结果从所述第二元素集合和所述第三元素集合中选择目标元素作为所述目标属性对应的关键信息。If the number of elements in the third element set is less than or equal to the first preset threshold, determine the attribute of each element in the second element set and the third element set, and select from the second element set according to the determination result Selecting a target element from the third element set as the key information corresponding to the target attribute.5.根据权利要求1所述的方法,其特征在于,所述从所述段落文本对象中获取所述目标属性对应的子句集合进一步包括:5. The method according to claim 1, wherein the obtaining the clause set corresponding to the target attribute from the paragraph text object further comprises:对所述段落文本对象进行语义筛选,得到所述目标属性对应的段落文本集合;Semantic screening is performed on the paragraph text object to obtain a paragraph text set corresponding to the target attribute;按照预设符号集中的符号对所述段落文本集合中的各个段落进行切分,生成所述目标属性对应的子句集合。Each paragraph in the paragraph text set is segmented according to the symbols in the preset symbol set to generate a clause set corresponding to the target attribute.6.根据权利要求1-5任一项中所述的方法,其特征在于,所述按照预设拼接策略,对所述子句集合中的子句进行合并得到所述目标属性的上下文信息进一步包括:6. The method according to any one of claims 1-5, characterized in that merging the clauses in the clause set according to the preset splicing strategy to obtain the context information of the target attribute is further include:按照预设顺序,遍历所述子句集合中的子句,对当前子句与相邻子句进行语义分析,依据语义分析结果确定是否对当前子句与相邻子句进行合并。Traversing the clauses in the clause set according to a preset order, performing semantic analysis on the current clause and adjacent clauses, and determining whether to merge the current clause and adjacent clauses according to the semantic analysis result.7.根据权利要求6所述的方法,其特征在于,在对当前子句与相邻子句进行语义分析之前,所述方法还包括:7. method according to claim 6, is characterized in that, before carrying out semantic analysis to current clause and adjacent clause, described method also comprises:判断当前子句的有效单词个数是否小于第二预设阈值;judging whether the number of effective words in the current clause is less than a second preset threshold;若是,则将当前子句与相邻子句进行拼接,直至拼接后的子句的有效单词个数大于或等于所述第二预设阈值。If so, the current clause is spliced with adjacent clauses until the number of effective words in the spliced clause is greater than or equal to the second preset threshold.8.一种文档关键信息抽取装置,其特征在于,包括:8. A document key information extraction device, characterized in that it comprises:内容解析模块,用于获取待处理文档以及用户确定的目标属性;对待处理文档进行文档内容解析,得到所述待处理文档中的内容对象;其中,所述内容对象包括表格对象和段落文本对象;A content parsing module, configured to obtain the document to be processed and the target attribute determined by the user; analyze the document content of the document to be processed to obtain content objects in the document to be processed; wherein the content objects include table objects and paragraph text objects;表格信息抽取模块,用于针对所述表格对象,检测在所述表格对象中所述目标属性的位置信息;依据所述位置信息从所述表格对象中提取所述目标属性对应的关键信息;The form information extraction module is used to detect the position information of the target attribute in the form object for the form object; extract the key information corresponding to the target attribute from the form object according to the position information;段落信息抽取模块,用于针对所述段落文本对象,从所述段落文本对象中获取所述目标属性对应的子句集合;按照预设拼接策略,对所述子句集合中的子句进行合并得到所述目标属性的上下文信息;依据所述上下文信息确定所述目标属性对应的关键信息。The paragraph information extraction module is used to obtain the clause set corresponding to the target attribute from the paragraph text object for the paragraph text object; according to the preset splicing strategy, the clauses in the clause set are merged Obtain context information of the target attribute; determine key information corresponding to the target attribute according to the context information.9.一种计算设备,其特征在于,包括:处理器、存储器、通信接口和通信总线,所述处理器、所述存储器和所述通信接口通过所述通信总线完成相互间的通信;9. A computing device, comprising: a processor, a memory, a communication interface, and a communication bus, wherein the processor, the memory, and the communication interface complete mutual communication through the communication bus;所述存储器用于存放至少一可执行指令,所述可执行指令使所述处理器执行如权利要求1-7中任一项所述的一种文档关键信息抽取方法对应的操作。The memory is used to store at least one executable instruction, and the executable instruction causes the processor to perform an operation corresponding to a method for extracting key document information according to any one of claims 1-7.10.一种计算机存储介质,其特征在于,所述存储介质中存储有至少一可执行指令,所述可执行指令使处理器执行如权利要求1-7中任一项所述的一种文档关键信息抽取方法对应的操作。10. A computer storage medium, characterized in that at least one executable instruction is stored in the storage medium, and the executable instruction causes a processor to execute a document according to any one of claims 1-7 The operation corresponding to the key information extraction method.
CN202111239393.8A2021-10-252021-10-25Method, device, computing equipment and storage medium for extracting key information of documentPendingCN116029280A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202111239393.8ACN116029280A (en)2021-10-252021-10-25Method, device, computing equipment and storage medium for extracting key information of document

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202111239393.8ACN116029280A (en)2021-10-252021-10-25Method, device, computing equipment and storage medium for extracting key information of document

Publications (1)

Publication NumberPublication Date
CN116029280Atrue CN116029280A (en)2023-04-28

Family

ID=86072814

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202111239393.8APendingCN116029280A (en)2021-10-252021-10-25Method, device, computing equipment and storage medium for extracting key information of document

Country Status (1)

CountryLink
CN (1)CN116029280A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN117541359A (en)*2024-01-042024-02-09江西工业贸易职业技术学院(江西省粮食干部学校、江西省粮食职工中等专业学校)Dining recommendation method and system based on preference analysis
CN118113816A (en)*2024-04-262024-05-31杭州数云信息技术有限公司Document knowledge extraction method and device, storage medium, terminal and computer program product
CN118839678A (en)*2024-09-202024-10-25杭州恒生聚源信息技术有限公司Document information recall method and device, electronic equipment and storage medium

Cited By (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN117541359A (en)*2024-01-042024-02-09江西工业贸易职业技术学院(江西省粮食干部学校、江西省粮食职工中等专业学校)Dining recommendation method and system based on preference analysis
CN117541359B (en)*2024-01-042024-03-29江西工业贸易职业技术学院(江西省粮食干部学校、江西省粮食职工中等专业学校)Dining recommendation method and system based on preference analysis
CN118113816A (en)*2024-04-262024-05-31杭州数云信息技术有限公司Document knowledge extraction method and device, storage medium, terminal and computer program product
CN118113816B (en)*2024-04-262024-08-06杭州数云信息技术有限公司Document knowledge extraction method and device, storage medium, terminal and computer program product
CN118839678A (en)*2024-09-202024-10-25杭州恒生聚源信息技术有限公司Document information recall method and device, electronic equipment and storage medium

Similar Documents

PublicationPublication DateTitle
CN111444330B (en) Method, device, equipment and storage medium for extracting short text keywords
CN107229668B (en) A text extraction method based on keyword matching
CN109726274B (en)Question generation method, device and storage medium
WO2022222300A1 (en)Open relationship extraction method and apparatus, electronic device, and storage medium
CN108959431A (en)Label automatic generation method, system, computer readable storage medium and equipment
CN112347760B (en) Training method and device of intention recognition model, intention recognition method and device
CN116029280A (en)Method, device, computing equipment and storage medium for extracting key information of document
CN109446333A (en)A kind of method that realizing Chinese Text Categorization and relevant device
CN112199499A (en) Text division method, text classification method, apparatus, equipment and storage medium
CN112699232A (en)Text label extraction method, device, equipment and storage medium
CN111177375A (en)Electronic document classification method and device
CN114065749B (en) A text-oriented Cantonese recognition model and system training and recognition method
WO2022143608A1 (en)Language labeling method and apparatus, and computer device and storage medium
CN115374786B (en) Entity and relationship joint extraction method and device, storage medium and terminal
CN119377415B (en)Chinese bad language theory detection method and system
CN118503454B (en)Data query method, device, storage medium and computer program product
CN117291192B (en)Government affair text semantic understanding analysis method and system
CN118298449A (en)Document structure segmentation method, device, equipment and medium
CN117131159A (en)Method, device, equipment and storage medium for extracting sensitive information
KR102685135B1 (en)Video editing automation system
CN110610001A (en)Short text integrity identification method and device, storage medium and computer equipment
KR101126186B1 (en)Apparatus and Method for disambiguation of morphologically ambiguous Korean verbs, and Recording medium thereof
CN120337937B (en) Academic opinion extraction method and system applied to academic literature
CN110110190B (en) Information output method and device
CN119474268A (en) Information retrieval method, device, equipment, storage medium and product

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination

[8]ページ先頭

©2009-2025 Movatter.jp