Movatterモバイル変換


[0]ホーム

URL:


CN118862843A - A method and system for checking duplicates and automatically annotating scientific and technological project documents - Google Patents

A method and system for checking duplicates and automatically annotating scientific and technological project documents
Download PDF

Info

Publication number
CN118862843A
CN118862843ACN202410762065.3ACN202410762065ACN118862843ACN 118862843 ACN118862843 ACN 118862843ACN 202410762065 ACN202410762065 ACN 202410762065ACN 118862843 ACN118862843 ACN 118862843A
Authority
CN
China
Prior art keywords
paragraph
project
detected
document
similar
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410762065.3A
Other languages
Chinese (zh)
Inventor
王云飞
杨彦飞
杨芷婷
刘志铭
王元丰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Liangzhi Data Technology Co ltd
Original Assignee
Hangzhou Liangzhi Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Liangzhi Data Technology Co ltdfiledCriticalHangzhou Liangzhi Data Technology Co ltd
Priority to CN202410762065.3ApriorityCriticalpatent/CN118862843A/en
Publication of CN118862843ApublicationCriticalpatent/CN118862843A/en
Pendinglegal-statusCriticalCurrent

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本发明公开了一种面向科技项目文档的查重及自动批注方法及系统,属于自然语言处理领域。本发明的查重及自动批注方法中,针对科技项目文档的快速查重检测问题,基于Jaccard相似度算法提出了一种改进的文档段落相似度和文档整体相似度计算方法,算法综合考虑了全文内容、相似片段连续程度、段落关键词权重等信息,检测结果更客观有效。另外,本发明基于相似度对比算法提供了一种针对科技项目文档的自动查重系统,系统同时包含了文档解析和自动批注功能模块,能够有效辅助审核人员快速定位重复片段及比对信息,提高审核人员工作效率。对比以往的人工阅读对比检测方法,提高了检测相似片段的覆盖度和检测及时性,显著降低检测结果等待时间,提升查重效率。

The present invention discloses a method and system for checking duplicate content and automatically annotating scientific and technological project documents, and belongs to the field of natural language processing. In the method for checking duplicate content and automatically annotating the present invention, an improved method for calculating document paragraph similarity and document overall similarity is proposed based on the Jaccard similarity algorithm for the problem of rapid checking duplicate content for scientific and technological project documents. The algorithm comprehensively considers information such as the full-text content, the degree of continuity of similar fragments, and the weight of paragraph keywords, and the detection result is more objective and effective. In addition, the present invention provides an automatic checking duplicate content system for scientific and technological project documents based on the similarity comparison algorithm. The system also includes document parsing and automatic annotation function modules, which can effectively assist auditors to quickly locate repeated fragments and comparison information, and improve the work efficiency of auditors. Compared with the previous manual reading comparison detection method, the coverage and timeliness of detecting similar fragments are improved, the waiting time for detection results is significantly reduced, and the efficiency of checking duplicate content is improved.

Description

Translated fromChinese
一种面向科技项目文档的查重及自动批注方法及系统A method and system for checking duplicates and automatically annotating scientific and technological project documents

技术领域Technical Field

本发明属于自然语言处理领域,具体涉及一种面向科技项目申报文档的相似对比和查重检测系统。The present invention belongs to the field of natural language processing, and in particular relates to a similarity comparison and duplicate detection system for science and technology project application documents.

背景技术Background Art

随着科技项目和经费规模的显著提升,各类科技资助体系不断完善。在这一背景下,如何快速高效地从大量科技项目申报文档中检出重复雷同项目,避免重复立项和经费浪费,确保科技项目立项和资金投入分配公正合理,成为亟需解决的问题。With the significant increase in the scale of science and technology projects and funds, various science and technology funding systems have been continuously improved. In this context, how to quickly and efficiently detect duplicate projects from a large number of science and technology project application documents, avoid duplicate project establishment and waste of funds, and ensure that science and technology project establishment and capital investment allocation are fair and reasonable has become an urgent problem to be solved.

以往由于各地的项目申报分散且数量相对少,往往通过人工阅读,翻阅资料,查询对比的方法来判断重复项目。但近年来随着项目申报文档规模急剧增长,仅仅依靠人工或简单利用项目关键词来进行查重的方法已经无法满足现实需求。In the past, due to the scattered and relatively small number of project applications, duplicate projects were often identified through manual reading, browsing materials, and query comparison. However, with the rapid growth of project application documents in recent years, methods that rely solely on manual or simple use of project keywords to check for duplicates can no longer meet actual needs.

因此,如何对海量的项目申报文档进行查重并提供相似片段和来源的文档自动批注,是目前亟待解决的技术问题。Therefore, how to check for duplicates in massive amounts of project application documents and provide automatic annotations of documents with similar fragments and sources is a technical problem that needs to be solved urgently.

发明内容Summary of the invention

本发明的目的在于解决海量项目申报文档难以实现自动查重和重复信息批注的问题,并提供一种面向科技项目文档的查重及自动批注方法及系统。The purpose of the present invention is to solve the problem that it is difficult to automatically check for duplicates and annotate repeated information in massive project application documents, and to provide a method and system for checking for duplicates and automatically annotating scientific and technological project documents.

本发明所采用的具体技术方案如下:The specific technical solutions adopted by the present invention are as follows:

第一方面,本发明提供了一种面向科技项目文档的查重及自动批注方法,其包括:In a first aspect, the present invention provides a method for checking for duplicates and automatically annotating scientific and technological project documents, which comprises:

S1、对待检测项目文档进行解析,获得每个项目文档的段落文本;S1. Parse the project documents to be tested to obtain the paragraph text of each project document;

S2、对待检测项目文档中的各段落文本分别进行分词处理,得到各段落的分词特征;S2, performing word segmentation processing on each paragraph text in the project document to be detected, and obtaining the word segmentation features of each paragraph;

S3、从历史项目数据库中获取所有历史项目文档的所有段落作为查重范围,每个历史项目文档预先通过所述解析处理和所述分词处理得到各段落对应的段落文本和分词特征;S3, obtaining all paragraphs of all historical project documents from the historical project database as the duplicate checking scope, and obtaining the paragraph text and word segmentation features corresponding to each paragraph of each historical project document in advance through the parsing process and the word segmentation process;

S4、对待检测项目文档的全文以及每个段落分别进行关键词抽取,获得全文关键词、每个段落的段落关键词以及各关键词的权重;再针对每个段落,将该段落的段落关键词、全文关键词以及各关键词的权重构成该段落对应的关键词字典;遍历待检测项目文档中的每个待检测段落,生成关键词匹配检索所需的数据库检索语句,在数据库中通过倒排索引策略检索得到所述查重范围内与每个待检测段落的关键词匹配程度最高的多个相似段落,并记录每个相似段落所属的历史项目文档作为相似项目文档;所有待检测段落的相似项目文档构成相似项目库;S4, extract keywords from the full text and each paragraph of the project document to be detected, obtain the full-text keywords, the paragraph keywords of each paragraph and the weight of each keyword; then for each paragraph, the paragraph keywords, the full-text keywords and the weight of each keyword constitute the keyword dictionary corresponding to the paragraph; traverse each paragraph to be detected in the project document to be detected, generate the database search statement required for keyword matching retrieval, retrieve multiple similar paragraphs with the highest keyword matching degree of each paragraph to be detected within the scope of the duplicate check in the database through the inverted index strategy, and record the historical project document to which each similar paragraph belongs as a similar project document; the similar project documents of all paragraphs to be detected constitute a similar project library;

S5、将待检测项目文档的每个待检测段落分别与对应的每个相似段落两两配对,基于上下文信息改进的Jaccard的段落相似度对比方法,综合考虑相似片段连续程度、段落关键词权重和全文关键词权重,计算每一组配对段落之间的段落相似度;然后针对每个待检测段落从对应的所有相似段落中选出最相似段落;S5, pairing each paragraph to be detected in the project document to be detected with each corresponding similar paragraph in pairs, and calculating the paragraph similarity between each group of paired paragraphs by using Jaccard's paragraph similarity comparison method improved based on context information, taking into account the continuity of similar segments, paragraph keyword weights and full-text keyword weights; then selecting the most similar paragraph from all corresponding similar paragraphs for each paragraph to be detected;

S6、遍历所述相似项目库中的每个相似项目文档,将当前遍历的相似项目文档中的段落与待检测项目文档中的待检测段落两两配对,基于所述上下文信息改进的Jaccard的段落相似度对比方法,计算得到每一组配对段落之间的段落相似度,得到每个待检测段落的最相似段落以及其最大段落相似度;再以段落位置权重和段落长度权重同时作为加权信息,将所有待检测段落的最大段落相似度进行加权求和,获得当前遍历的相似项目文档与待检测项目文档之间的文档整体相似度,并确定待检测项目文档的最相似项目文档;S6, traversing each similar project document in the similar project library, pairing the paragraphs in the currently traversed similar project document with the paragraphs to be detected in the project document to be detected, and calculating the paragraph similarity between each group of paired paragraphs based on the improved Jaccard paragraph similarity comparison method based on the context information, and obtaining the most similar paragraph of each paragraph to be detected and its maximum paragraph similarity; then taking the paragraph position weight and the paragraph length weight as weighting information at the same time, performing weighted summation on the maximum paragraph similarities of all the paragraphs to be detected, obtaining the overall document similarity between the currently traversed similar project document and the project document to be detected, and determining the most similar project document of the project document to be detected;

S7、针对待检测项目文档,基于段落级别和文档级别的相似比对结果,按照预设的批注形式和批注内容,在文档中生成可视化的自动批注。S7. For the project document to be detected, based on the similarity comparison results at the paragraph level and the document level, a visual automatic annotation is generated in the document according to the preset annotation form and annotation content.

作为上述第一方面的优选,所生成的数据库检索语句中,以每个待检测段落的段落关键词和所述查重范围内所有段落的分词特征作为匹配对象,并按照用户输入的筛选条件设置数据过滤器。As a preferred embodiment of the above-mentioned first aspect, in the generated database search statement, the paragraph keywords of each paragraph to be detected and the word segmentation features of all paragraphs within the duplicate checking range are used as matching objects, and the data filter is set according to the screening conditions input by the user.

作为上述第一方面的优选,基于所述上下文信息改进的Jaccard的段落相似度对比方法计算段落相似度的方法如下:As a preferred embodiment of the first aspect, a method for calculating paragraph similarity based on the improved Jaccard paragraph similarity comparison method based on the context information is as follows:

A1、将待检测段落T1的分词特征中的每个词在另一对比段落T2中进行检索,若存在则标记为1,若不存在则标记为-1,从而将待检测段落T1的分词特征转换为第一标记向量;A1, search each word in the word segmentation feature of the paragraph to be detected T1 in another comparison paragraph T2, if it exists, it is marked as 1, if it does not exist, it is marked as -1, so as to convert the word segmentation feature of the paragraph to be detected T1 into a first tag vector;

A2、将所述第一标记向量进行连续片段划分,获得一系列连续片段,其中连续片段划分时允许存在不超过预设长度的不一致向量值;若连续片段均为1,则将第一标记向量中该连续片段对应的向量值乘上第一权值,若连续片段中存在-1但比例小于预设阈值,则将第一标记向量中该连续片段对应的向量值乘上第二权值,否则将第一标记向量中该连续片段对应的向量值乘上第三权值;遍历结束后,将第一标记向量转换为第二标记向量;A2. Divide the first marking vector into continuous segments to obtain a series of continuous segments, wherein inconsistent vector values not exceeding a preset length are allowed to exist when dividing the continuous segments; if all the continuous segments are 1, multiply the vector value corresponding to the continuous segment in the first marking vector by a first weight; if there is -1 in the continuous segment but the ratio is less than a preset threshold, multiply the vector value corresponding to the continuous segment in the first marking vector by a second weight; otherwise, multiply the vector value corresponding to the continuous segment in the first marking vector by a third weight; after the traversal is completed, convert the first marking vector into a second marking vector;

A3、基于待检测段落T1的所述关键词字典,将第一标记向量加权后加到第二标记向量上,得到第三标记向量;其中若待检测段落T1的分词特征中一个词存在于所述关键词字典中,则加权权值为这个词对应的关键词权重,否则加权权值为0;A3, based on the keyword dictionary of the paragraph to be detected T1, weight the first tag vector and add it to the second tag vector to obtain a third tag vector; wherein if a word in the word segmentation feature of the paragraph to be detected T1 exists in the keyword dictionary, the weighted value is the keyword weight corresponding to this word, otherwise the weighted value is 0;

A4、将所述第三标记向量中的向量值求平均值,作为待检测段落T1与对比段落T2之间的段落相似度。A4. Calculate an average value of the vector values in the third tag vector as the paragraph similarity between the to-be-detected paragraph T1 and the comparison paragraph T2.

作为上述第一方面的优选,所述第一权值、第二权值和第三权值的取值均与片段长度正相关,但都不超过各自的阈值限制。As a preference for the first aspect, the values of the first weight, the second weight and the third weight are all positively correlated with the fragment length, but none of them exceeds the respective threshold limits.

作为上述第一方面的优选,计算所述文档整体相似度时,每一个段落的段落位置权重采用该段落在项目文档中所处的结构部分所对应的预设权重值,每一个段落的段落长度权重采用段落长度的对数值。As a preferred embodiment of the above-mentioned first aspect, when calculating the overall similarity of the document, the paragraph position weight of each paragraph adopts the preset weight value corresponding to the structural part where the paragraph is located in the project document, and the paragraph length weight of each paragraph adopts the logarithm of the paragraph length.

作为上述第一方面的优选,在文档中生成可视化的自动批注时,对于每一个段落需要标记出其与最相似段落之间的重复文本,并批注标出最相似段落所属的历史项目文档信息。As a preferred embodiment of the first aspect, when generating visual automatic annotations in a document, for each paragraph, it is necessary to mark the repeated text between it and the most similar paragraph, and annotate the historical project document information to which the most similar paragraph belongs.

第二方面,本发明提供了一种面向科技项目文档的查重及自动批注系统,其包括:In a second aspect, the present invention provides a system for checking for duplicates and automatically annotating scientific and technological project documents, which comprises:

文档解析模块,用于对待检测项目文档进行解析,获得每个项目文档的段落文本;The document parsing module is used to parse the project documents to be tested and obtain the paragraph text of each project document;

段落特征提取模块,用于对待检测项目文档中的各段落文本分别进行分词处理,得到各段落的分词特征;The paragraph feature extraction module is used to perform word segmentation processing on each paragraph text in the project document to be detected, and obtain the word segmentation features of each paragraph;

历史项目管理模块,用于从历史项目数据库中获取所有历史项目文档的所有段落作为查重范围,每个历史项目文档预先通过所述解析处理和所述分词处理得到各段落对应的段落文本和分词特征;A historical project management module is used to obtain all paragraphs of all historical project documents from the historical project database as the duplicate checking scope, and each historical project document is preliminarily processed by the parsing process and the word segmentation process to obtain the paragraph text and word segmentation features corresponding to each paragraph;

历史相似片段检索粗筛模块,用于对待检测项目文档的全文以及每个段落分别进行关键词抽取,获得全文关键词、每个段落的段落关键词以及各关键词的权重;再针对每个段落,将该段落的段落关键词、全文关键词以及各关键词的权重构成该段落对应的关键词字典;遍历待检测项目文档中的每个待检测段落,生成关键词匹配检索所需的数据库检索语句,在数据库中通过倒排索引策略检索得到所述查重范围内与每个待检测段落的关键词匹配程度最高的多个相似段落,并记录每个相似段落所属的历史项目文档作为相似项目文档;所有待检测段落的相似项目文档构成相似项目库;The historical similar fragment retrieval rough screening module is used to extract keywords from the full text and each paragraph of the project document to be detected, and obtain the full-text keywords, the paragraph keywords of each paragraph, and the weight of each keyword; then for each paragraph, the paragraph keywords, the full-text keywords of the paragraph, and the weight of each keyword constitute the keyword dictionary corresponding to the paragraph; traverse each paragraph to be detected in the project document to be detected, generate the database search statement required for keyword matching retrieval, retrieve multiple similar paragraphs with the highest degree of keyword matching with each paragraph to be detected within the scope of the duplicate check in the database through the inverted index strategy, and record the historical project document to which each similar paragraph belongs as a similar project document; the similar project documents of all paragraphs to be detected constitute a similar project library;

段落相似计算模块,用于将待检测项目文档的每个待检测段落分别与对应的每个相似段落两两配对,基于上下文信息改进的Jaccard的段落相似度对比方法,综合考虑相似片段连续程度、段落关键词权重和全文关键词权重,计算每一组配对段落之间的段落相似度;然后针对每个待检测段落从对应的所有相似段落中选出最相似段落;The paragraph similarity calculation module is used to pair each paragraph to be detected in the project document to be detected with each corresponding similar paragraph, and calculate the paragraph similarity between each group of paired paragraphs based on the Jaccard paragraph similarity comparison method improved by context information, taking into account the continuity of similar segments, paragraph keyword weights and full-text keyword weights; then, for each paragraph to be detected, the most similar paragraph is selected from all corresponding similar paragraphs;

文档相似计算模块,用于遍历所述相似项目库中的每个相似项目文档,将当前遍历的相似项目文档中的段落与待检测项目文档中的待检测段落两两配对,基于所述上下文信息改进的Jaccard的段落相似度对比方法,计算得到每一组配对段落之间的段落相似度,得到每个待检测段落的最相似段落以及其最大段落相似度;再以段落位置权重和段落长度权重同时作为加权信息,将所有待检测段落的最大段落相似度进行加权求和,获得当前遍历的相似项目文档与待检测项目文档之间的文档整体相似度,并确定待检测项目文档的最相似项目文档;The document similarity calculation module is used to traverse each similar project document in the similar project library, pair the paragraphs in the currently traversed similar project document with the paragraphs to be detected in the project document to be detected, calculate the paragraph similarity between each group of paired paragraphs based on the improved Jaccard paragraph similarity comparison method based on the context information, and obtain the most similar paragraph of each paragraph to be detected and its maximum paragraph similarity; then, using the paragraph position weight and the paragraph length weight as weighting information at the same time, perform weighted summation on the maximum paragraph similarities of all the paragraphs to be detected, obtain the overall document similarity between the currently traversed similar project document and the project document to be detected, and determine the most similar project document of the project document to be detected;

查重检测自动批注模块,用于针对待检测项目文档,基于段落级别和文档级别的相似比对结果,按照预设的批注形式和批注内容,在文档中生成可视化的自动批注。The automatic annotation module for duplicate detection is used to generate visual automatic annotations in the document for the project document to be detected based on the similarity comparison results at the paragraph level and document level, in accordance with the preset annotation form and annotation content.

作为上述第二方面的优选,所述重复文本通过字体颜色或者高亮形式标出。As a preferred embodiment of the above second aspect, the repeated text is marked by font color or highlighting.

第三方面,本发明提供了一种计算机程序产品,包括计算机程序/指令,该计算机程序/指令被处理器执行时,能实现如上述第一方面任一项方案所述的面向科技项目文档的查重及自动批注方法。In a third aspect, the present invention provides a computer program product, including a computer program/instruction, which, when executed by a processor, can implement the method for checking for duplicates and automatically annotating scientific and technological project documents as described in any of the schemes of the first aspect above.

第四方面,本发明提供了一种计算机可读存储介质,所述存储介质上存储有计算机程序,当所述计算机程序被处理器执行时,实现如上述第一方面任一项方案所述的面向科技项目文档的查重及自动批注方法。In a fourth aspect, the present invention provides a computer-readable storage medium having a computer program stored thereon. When the computer program is executed by a processor, the method for checking for duplicates and automatically annotating scientific and technological project documents as described in any one of the schemes of the first aspect above is implemented.

第五方面,本发明提供了一种计算机电子设备,其包括存储器和处理器;In a fifth aspect, the present invention provides a computer electronic device comprising a memory and a processor;

所述存储器,用于存储计算机程序;The memory is used to store computer programs;

所述处理器,用于当执行所述计算机程序时,实现如上述第一方面任一项方案所述的面向科技项目文档的查重及自动批注方法。The processor is used to implement the method for checking for duplicates and automatically annotating scientific and technological project documents as described in any one of the solutions of the first aspect above when executing the computer program.

本发明相对于现有技术而言,具有以下有益效果:Compared with the prior art, the present invention has the following beneficial effects:

1、针对科技项目文档的快速查重检测问题,本发明基于Jaccard相似度算法提出了一种改进的文档段落相似度和文档整体相似度计算方法,算法综合考虑了全文内容、相似片段连续程度、段落关键词权重等信息,检测结果更客观有效;1. Aiming at the problem of rapid duplicate detection of scientific and technological project documents, the present invention proposes an improved method for calculating document paragraph similarity and overall document similarity based on the Jaccard similarity algorithm. The algorithm comprehensively considers the full text content, the degree of continuity of similar segments, the weight of paragraph keywords and other information, and the detection result is more objective and effective;

2、本发明基于相似度对比算法提供了一种针对科技项目文档的自动查重系统,对比以往的人工阅读对比检测方法,提高了检测相似片段的覆盖度和检测及时性,显著降低检测结果等待时间,提升查重效率;2. The present invention provides an automatic duplicate checking system for scientific and technological project documents based on a similarity comparison algorithm. Compared with the previous manual reading comparison detection method, it improves the coverage and timeliness of detecting similar fragments, significantly reduces the waiting time for detection results, and improves the efficiency of duplicate checking;

3、本发明提供的自动查重系统同时包含了文档解析和自动批注功能模块,能够有效辅助审核人员快速定位重复片段及比对信息,提高审核人员工作效率。3. The automatic duplicate checking system provided by the present invention includes both document parsing and automatic annotation function modules, which can effectively assist auditors in quickly locating duplicate segments and comparing information, thereby improving the work efficiency of auditors.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为一种面向科技项目文档的查重及自动批注方法的步骤流程图;FIG1 is a flowchart of a method for checking for duplicates and automatically annotating scientific and technological project documents;

图2为一种面向科技项目文档的查重及自动批注系统的模块组成示意图;FIG2 is a schematic diagram of the module composition of a duplicate checking and automatic annotation system for scientific and technological project documents;

图3为一种面向科技项目文档的查重及自动批注系统的模块间流程关系示意图。FIG3 is a schematic diagram of the inter-module process relationship of a duplicate checking and automatic annotation system for scientific and technological project documents.

具体实施方式DETAILED DESCRIPTION

为使本发明的上述目的、特征和优点能够更加明显易懂,下面结合附图对本发明的具体实施方式做详细的说明。在下面的描述中阐述了很多具体细节以便于充分理解本发明。但是本发明能够以很多不同于在此描述的其它方式来实施,本领域技术人员可以在不违背本发明内涵的情况下做类似改进,因此本发明不受下面公开的具体实施例的限制。本发明各个实施例中的技术特征在没有相互冲突的前提下,均可进行相应组合。In order to make the above-mentioned purpose, features and advantages of the present invention more obvious and easy to understand, the specific implementation mode of the present invention is described in detail below in conjunction with the accompanying drawings. In the following description, many specific details are set forth to facilitate a full understanding of the present invention. However, the present invention can be implemented in many other ways different from those described herein, and those skilled in the art can make similar improvements without violating the connotation of the present invention. Therefore, the present invention is not limited to the specific embodiments disclosed below. The technical features in each embodiment of the present invention can be combined accordingly without conflicting with each other.

在本发明的描述中,需要理解的是,当一个元件被认为是“连接”另一个元件,可以是直接连接到另一个元件或者是间接连接即存在中间元件。相反,当元件为称作“直接”与另一元件连接时,不存在中间元件。In the description of the present invention, it is to be understood that when an element is considered to be "connected" to another element, it may be directly connected to the other element or indirectly connected, that is, there are intermediate elements. On the contrary, when an element is said to be "directly" connected to another element, there are no intermediate elements.

在本发明的描述中,需要理解的是,术语“第一”、“第二”仅用于区分描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。In the description of the present invention, it should be understood that the terms "first" and "second" are only used for the purpose of distinguishing descriptions, and should not be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Therefore, the features defined as "first" and "second" may explicitly or implicitly include at least one of the features.

在本发明的一个较佳实施例中,提供了一种面向科技项目文档的查重及自动批注方法,其具体步骤包括S1~S7。In a preferred embodiment of the present invention, a method for checking for duplicates and automatically annotating scientific and technological project documents is provided, and the specific steps thereof include S1 to S7.

S1、对待检测项目文档进行解析,获得每个项目文档的段落文本。S1. Parse the project documents to be tested to obtain the paragraph text of each project document.

需要说明的是,具体的文档解析方式,需要根据实际的项目文档情况进行确定。在本发明的实施例中,待检测项目文档可以是新上传的项目文档,也可以是已上传由用户指定的项目文档,项目文档优选为word格式。文档解析时可对项目申报文档进行格式转换,统一格式后再进行文档结构的识别,确定文档的结构,再通过文本解析获取项目申报文档中所有段落文本,并在段落文本中识别去除目录、参考文献、图片、项目模板等与项目重点内容无关的内容,得到项目文本各段落的元数据内容,用于进行后续的查重检测。It should be noted that the specific document parsing method needs to be determined according to the actual project document situation. In an embodiment of the present invention, the project document to be detected can be a newly uploaded project document, or it can be an uploaded project document specified by the user, and the project document is preferably in word format. During document parsing, the project application document can be formatted, and the document structure can be identified after the format is unified to determine the structure of the document. All paragraph texts in the project application document are obtained through text parsing, and the contents that are not related to the key content of the project, such as directories, references, pictures, and project templates, are identified and removed from the paragraph text to obtain the metadata content of each paragraph of the project text for subsequent duplicate detection.

S2、对待检测项目文档中的各段落文本分别进行分词处理,得到各段落的分词特征。S2. Perform word segmentation processing on each paragraph text in the project document to be detected, and obtain the word segmentation features of each paragraph.

需要说明的是,段落特征提取处理可在段落元数据内容基础上通过对段落文本的分词处理,得到各段落的分词特征。对于文本的分词处理,属于自然语言处理领域的现有技术。由于科技项目文档专业词汇较多,具有较强的领域属性,因此考虑到领域词汇查重的专业性,在进行分词处理时可结合领域术语库进行处理。领域术语库可事先针对各领域收集专业中英文术语,形成领域自定义专业术语库来实现。同时,本发明在分词处理之前,可先进行文本清洗等操作,例如进行停用词与标点符号等特殊符号的处理过滤,以进一步提高准确性。本发明中提取的分词特征即每个段落文本分词处理后得到的词(token)序列。It should be noted that the paragraph feature extraction process can obtain the segmentation features of each paragraph by segmenting the paragraph text based on the paragraph metadata content. The word segmentation process of the text belongs to the prior art in the field of natural language processing. Since there are many professional vocabulary in scientific and technological project documents and they have strong domain attributes, considering the professionalism of domain vocabulary duplicate checking, the word segmentation process can be combined with the domain terminology library for processing. The domain terminology library can collect professional Chinese and English terms for various fields in advance to form a domain-defined professional terminology library to achieve this. At the same time, before the word segmentation process, the present invention can first perform operations such as text cleaning, such as processing and filtering special symbols such as stop words and punctuation marks, to further improve the accuracy. The word segmentation features extracted in the present invention are the word (token) sequences obtained after the word segmentation process of each paragraph text.

S3、从历史项目数据库中获取所有历史项目文档的所有段落作为查重范围,每个历史项目文档预先通过所述解析处理和所述分词处理得到各段落对应的段落文本和分词特征。S3. Obtain all paragraphs of all historical project documents from the historical project database as the duplicate checking scope. Each historical project document is preliminarily processed by the parsing process and the word segmentation process to obtain the paragraph text and word segmentation features corresponding to each paragraph.

需要说明的是,历史项目数据库需要预先收集汇总所有历史项目文档信息,在应用阶段即可直接读取使用。收集汇总的每个历史项目文档后,均可以按照S1和S2步骤所示进行解析和特征提取,构建每个历史项目项目的段落文本和段落特征,并预存储到非结构化数据库形成历史项目数据库。It should be noted that the historical project database needs to collect and summarize all historical project document information in advance, so that it can be directly read and used in the application stage. After collecting and summarizing each historical project document, it can be parsed and feature extracted as shown in steps S1 and S2 to construct the paragraph text and paragraph features of each historical project, and pre-stored in the unstructured database to form a historical project database.

S4、对待检测项目文档的全文以及每个段落分别进行关键词抽取,获得全文关键词、每个段落的段落关键词以及各关键词的权重;再针对每个段落,将该段落的段落关键词、全文关键词以及各关键词的权重构成该段落对应的关键词字典;遍历待检测项目文档中的每个待检测段落,生成关键词匹配检索所需的数据库检索语句,在数据库中通过倒排索引策略检索得到所述查重范围内与每个待检测段落的关键词匹配程度最高的多个相似段落,并记录每个相似段落所属的历史项目文档作为相似项目文档;所有待检测段落的相似项目文档构成相似项目库。S4. Extract keywords from the full text and each paragraph of the project document to be detected, and obtain the full-text keywords, the paragraph keywords of each paragraph and the weight of each keyword; then for each paragraph, the paragraph keywords, the full-text keywords and the weight of each keyword constitute the keyword dictionary corresponding to the paragraph; traverse each paragraph to be detected in the project document to be detected, generate the database search statement required for keyword matching retrieval, retrieve multiple similar paragraphs with the highest degree of keyword matching with each paragraph to be detected within the scope of the duplicate check in the database through the inverted index strategy, and record the historical project document to which each similar paragraph belongs as a similar project document; the similar project documents of all paragraphs to be detected constitute a similar project library.

需要说明的是,待检测项目文档的全文以及每个段落进行关键词抽取所用的算法可以是任意地关键词抽取算法,只要能够抽取出相应的关键词以及各关键词对应的关键词权重即可。在本发明的实施例中,可采用申请号为CN202110600989.X、名称为“利用语义特征的科技创新领域中文关键短语抽取方法及系统”的中国发明专利中所公开的方法和系统,实现上述关键词以及关键词权重的抽取。It should be noted that the algorithm used for keyword extraction of the full text and each paragraph of the project document to be detected can be any keyword extraction algorithm, as long as the corresponding keywords and the keyword weights corresponding to each keyword can be extracted. In an embodiment of the present invention, the method and system disclosed in the Chinese invention patent with application number CN202110600989.X and name "Chinese key phrase extraction method and system in the field of scientific and technological innovation using semantic features" can be used to achieve the extraction of the above keywords and keyword weights.

本发明将待检测项目文档的每一个段落视为待检测段落,而每个待检测段落都需要构建关键词字典,关键词字典由全文关键词以及全文关键词的权重、当前待检测段落的段落关键词以及段落关键词的权重构成。具体的关键词数量可以根据实际需要进行优化,在本发明的实施例中,可融合关键词抽取算法提取出全文Top20个关键词以及关键词权重、每个待检测段落的Top10个关键词以及关键词权重,形成这个待检测段落对应的关键词字典keywords_dict,关键词字典可作为待检测段落的融合特征。The present invention regards each paragraph of the project document to be detected as a paragraph to be detected, and each paragraph to be detected needs to construct a keyword dictionary, which is composed of full-text keywords and weights of full-text keywords, paragraph keywords of the current paragraph to be detected and weights of paragraph keywords. The specific number of keywords can be optimized according to actual needs. In an embodiment of the present invention, the keyword extraction algorithm can be integrated to extract the full-text Top20 keywords and keyword weights, the Top10 keywords and keyword weights of each paragraph to be detected, and form the keyword dictionary keywords_dict corresponding to the paragraph to be detected. The keyword dictionary can be used as a fusion feature of the paragraph to be detected.

另外需要说明的是,本发明中对于相似段落的查重检测,需要以历史项目数据库中获取所有历史项目文档的所有段落作为查重范围,因此对于每一个待检测段落,都需要将其与查重范围内的所有段落构建数据库检索语句,通过待检测段落中的关键词在被检测段落中的匹配程度,来判断两个段落之间的相似程度。进而通过倒排索引策略来确定关键词匹配程度最高的TopK个相似段落。It should also be noted that the duplicate detection of similar paragraphs in the present invention requires obtaining all paragraphs of all historical project documents in the historical project database as the duplicate detection range. Therefore, for each paragraph to be detected, it is necessary to construct a database search statement with it and all paragraphs in the duplicate detection range, and judge the similarity between the two paragraphs by the matching degree of the keywords in the paragraph to be detected in the detected paragraph. Then, the inverted index strategy is used to determine the TopK similar paragraphs with the highest keyword matching degree.

因此,对于每一组配对的段落,依据待检测段落中抽取的关键词以及被检测段落中的分词特征,结合用户的检测需求中输入的筛选条件,自动生成每个段落的数据库检索语句。数据库检索语句的构建属于现有技术。在本发明的实施例中,所生成的数据库检索语句中,以每个待检测段落段落关键词和所述查重范围内所有段落的分词特征作为匹配对象,并按照用户输入的筛选条件设置数据过滤器。用户筛选条件具体包括对比形式、年份范围、等级、报告书类型等。例如,用户指定筛选跟A省市的科技报告类型项目做对比,则生成的检索语句query为:Therefore, for each group of paired paragraphs, based on the keywords extracted from the paragraphs to be detected and the word segmentation features in the detected paragraphs, combined with the screening conditions entered in the user's detection requirements, a database search statement for each paragraph is automatically generated. The construction of database search statements belongs to the prior art. In an embodiment of the present invention, in the generated database search statement, the paragraph keywords of each paragraph to be detected and the word segmentation features of all paragraphs within the duplication check range are used as matching objects, and a data filter is set according to the screening conditions entered by the user. The user screening conditions specifically include comparison form, year range, grade, report type, etc. For example, the user specifies the screening to be compared with the scientific and technological report type projects of Province and City A, and the generated search statement query is:

{{

"query":{"query":{

"bool":{"bool":{

"must":{"match":{"para":"w1 w2 w3...wn"}},"must":{"match":{"para":"w1 w2 w3...wn"}},

"filter":["filter":[

{"term":{"province":"A省"}},{"term":{"province":"Province A"}},

{"term":{"project_type":"科技报告"}}{"term":{"project_type":"Scientific and technological report"}}

]]

}}

}}

}}

S5、将待检测项目文档的每个待检测段落分别与对应的每个相似段落两两配对,基于上下文信息改进的Jaccard的段落相似度对比方法,综合考虑相似片段连续程度、段落关键词权重和全文关键词权重,计算每一组配对段落之间的段落相似度;然后针对每个待检测段落从对应的所有相似段落中选出最相似段落。S5. Pair each paragraph to be detected in the project document to be detected with each corresponding similar paragraph in pairs, and calculate the paragraph similarity between each group of paired paragraphs based on the Jaccard paragraph similarity comparison method improved by context information, taking into account the continuity of similar segments, paragraph keyword weights and full-text keyword weights; then select the most similar paragraph from all corresponding similar paragraphs for each paragraph to be detected.

需要说明的是,本发明上述提出的上下文信息改进的Jaccard的段落相似度对比方法,可用来计算任意两个段落之间的段落相似度,该方法在相似度计算过程中综合考虑了相似片段连续程度、段落关键词权重和全文关键词权重,检测结果更客观有效。下面对该上下文信息改进的Jaccard的段落相似度对比方法的具体实现步骤进行介绍,可按照A1~A4步骤来实现:It should be noted that the paragraph similarity comparison method of Jaccard improved by context information proposed in the present invention can be used to calculate the paragraph similarity between any two paragraphs. This method comprehensively considers the continuity of similar segments, paragraph keyword weights and full-text keyword weights during the similarity calculation process, and the detection result is more objective and effective. The specific implementation steps of the paragraph similarity comparison method of Jaccard improved by context information are introduced below, which can be implemented according to steps A1 to A4:

A1、将待检测段落T1的分词特征中的每个词在另一对比段落T2中进行检索,若存在则标记为1,若不存在则标记为-1,从而将待检测段落T1的分词特征转换为第一标记向量。A1. Each word in the word segmentation feature of the paragraph to be detected T1 is searched in another comparison paragraph T2. If it exists, it is marked as 1, and if it does not exist, it is marked as -1, so as to convert the word segmentation feature of the paragraph to be detected T1 into a first tag vector.

需要说明的是,上述第一标记向量实际上记录了待检测段落T1的分词特征中的每个词在对比段落T2中的出现情况,假设待检测段落T1总共有N个词序列组成,如下公式(1),任意一个词w出现计1不出现计-1,则待检测段落T1的分词特征转换而成的可量化的第一标记向量V1,如下公式(2):It should be noted that the above first tag vector actually records the occurrence of each word in the word segmentation feature of the paragraph to be detected T1 in the comparison paragraph T2. Assuming that the paragraph to be detected T1 consists of a total of N word sequences, as shown in the following formula (1), the occurrence of any word w counts as 1 and the absence of any word w counts as -1. The quantifiable first tag vector V1 converted from the word segmentation feature of the paragraph to be detected T1 is as shown in the following formula (2):

T1=w1w2w3w4…wN (1)T1=w1 w2 w3 w4 …wN (1)

A2、将上述第一标记向量进行连续片段划分,获得一系列连续片段,其中连续片段划分时允许存在不超过预设长度的不一致向量值;若连续片段均为1,则将第一标记向量中该连续片段对应的向量值乘上第一权值,若连续片段中存在-1但比例小于预设阈值,则将第一标记向量中该连续片段对应的向量值乘上第二权值,否则将第一标记向量中该连续片段对应的向量值乘上第三权值;遍历结束后,将第一标记向量转换为第二标记向量。A2. Divide the first marker vector into continuous segments to obtain a series of continuous segments, wherein inconsistent vector values not exceeding a preset length are allowed to exist when dividing the continuous segments; if the continuous segments are all 1, then multiply the vector value corresponding to the continuous segment in the first marker vector by the first weight; if there is -1 in the continuous segment but the ratio is less than a preset threshold, then multiply the vector value corresponding to the continuous segment in the first marker vector by the second weight; otherwise, multiply the vector value corresponding to the continuous segment in the first marker vector by the third weight; after the traversal is completed, convert the first marker vector into the second marker vector.

需要说明的是,在对第一标记向量进行连续片段划分时,本质上是对待检测段落T1的分词特征进行连续片段划分,因此上述第一权值、第二权值和第三权值实际上代表了对应一个片段的三种内容重复程度的权重赋值。但上述连续片段划分,并非机械性地按照第一标记向量出现不同向量值的位置进行分段,而是需要在一段连续相同的向量值中允许存在一定长度范围内的不一致向量值。例如,若允许存在的不一致向量值的预设长度为2,如果第一标记向量为[1,1,1,1,-1,1,1,-1,-1,-1,1],那么第一个连续片段是一系列的[1,1,1,1,-1,1,1],在这一个连续片段中存在长度为1的向量值-1,由于其长度没有超过预设长度2,因此依然将此处视为连续,但是当碰到连续三个-1值时,其长度超过了预设长度2,因此第一个连续片段需要在这三个-1值之前断开,从这三个-1值开始继续计算第二个连续片段。It should be noted that when the first marker vector is divided into continuous segments, the word segmentation features of the paragraph T1 to be detected are essentially divided into continuous segments. Therefore, the first weight, the second weight and the third weight actually represent the weight assignment of the three content repetition degrees corresponding to a segment. However, the above-mentioned continuous segment division is not mechanically segmented according to the positions where different vector values appear in the first marker vector, but it is necessary to allow inconsistent vector values within a certain length range in a continuous identical vector value. For example, if the preset length of inconsistent vector values allowed is 2, if the first marker vector is [1,1,1,1,-1,1,1,-1,-1,-1,1], then the first continuous segment is a series of [1,1,1,1,1,-1,1,1]. In this continuous segment, there is a vector value -1 with a length of 1. Since its length does not exceed the preset length of 2, it is still regarded as continuous. However, when encountering three consecutive -1 values, its length exceeds the preset length of 2. Therefore, the first continuous segment needs to be disconnected before these three -1 values, and the second continuous segment is calculated from these three -1 values.

在本发明中,上述第一权值、第二权值和第三权值对应的三种情况分别是:情况(1):片段中所有字符均一致;情况(2):片段中字符大部分一致,存在一两个词不一致;情况(3):片段中字符大部分不一致,存在一两个词一致。对于情况(1)和(2),情况(1)的抄袭权重明显大于情况(2),情况(3)可视为不抄袭。因此,情况(1)、(2)、(3)对应的第一权值α、第二权值β和第三权值γ可依次递减。而且理论上,片段越长相似度权重越高,因此第一权值、第二权值和第三权值的取值均可与片段长度正相关,但为避免最终结果计算范围超出,三种情况相似度权重都需要有阈值限制,即第一权值α、第二权值β和第三权值γ都不能超过各自的阈值限制。在本发明的实施例中,可设置第一权值α、第二权值β和第三权值γ的阈值分别为1.12、1.08、0.3,三种情况可分别在相应的阈值以下设置多个权值,每个权值对应一个片段长度范围,当一个片段的长度位于某个片段长度范围时,即可选择这个片段长度范围对应的权值进行赋值。例如,可将片段长度的整体跨度范围分割为多个片段长度范围,第一权值α可在1.12以下设置对应数量的值,每个值对应一个片段长度范围,且片段长度越大权值也越大;第二权值β和第三权值γ同理。In the present invention, the three situations corresponding to the above-mentioned first weight, second weight and third weight are respectively: situation (1): all characters in the segment are consistent; situation (2): most of the characters in the segment are consistent, and there are one or two words that are inconsistent; situation (3): most of the characters in the segment are inconsistent, and there are one or two words that are consistent. For situations (1) and (2), the plagiarism weight of situation (1) is significantly greater than that of situation (2), and situation (3) can be regarded as no plagiarism. Therefore, the first weight α, the second weight β and the third weight γ corresponding to situations (1), (2) and (3) can be decreased in sequence. In theory, the longer the segment, the higher the similarity weight. Therefore, the values of the first weight, the second weight and the third weight can all be positively correlated with the length of the segment. However, in order to avoid exceeding the calculation range of the final result, the similarity weights of the three situations need to be subject to threshold limits, that is, the first weight α, the second weight β and the third weight γ cannot exceed their respective threshold limits. In the embodiment of the present invention, the thresholds of the first weight α, the second weight β, and the third weight γ can be set to 1.12, 1.08, and 0.3 respectively. In the three cases, multiple weights can be set below the corresponding thresholds respectively, and each weight corresponds to a fragment length range. When the length of a fragment is within a certain fragment length range, the weight corresponding to this fragment length range can be selected for assignment. For example, the overall span range of the fragment length can be divided into multiple fragment length ranges, and the first weight α can be set to a corresponding number of values below 1.12, each value corresponds to a fragment length range, and the larger the fragment length, the larger the weight; the second weight β and the third weight γ are similar.

由此,当第一标记向量V1中的每个向量值均确定了权值后,即可将向量值乘上各自对应的权值,获得第二标记向量V1′,公式如(3)所示:Therefore, when the weight of each vector value in the first label vector V1 is determined, the vector value can be multiplied by the corresponding weight to obtain the second label vector V1′, as shown in formula (3):

V1′=[v′1,v′2,v′3,…,v′N],其中V1′=[v′1 ,v′2 ,v′3 ,…,v′N ], where

A3、基于待检测段落T1的所述关键词字典,将第一标记向量加权后加到第二标记向量上,得到第三标记向量;其中若待检测段落T1的分词特征中一个词存在于前述的关键词字典中,则加权权值为这个词对应的关键词权重,否则加权权值为0。A3. Based on the keyword dictionary of the paragraph to be detected T1, the first tag vector is weighted and added to the second tag vector to obtain a third tag vector; wherein if a word in the word segmentation feature of the paragraph to be detected T1 exists in the aforementioned keyword dictionary, the weighted value is the keyword weight corresponding to this word, otherwise the weighted value is 0.

上述第一标记向量和第二标记向量的加权融合,可通过下式(4)来表示:The weighted fusion of the first label vector and the second label vector can be expressed by the following formula (4):

其中kvi表示vi在关键词表keywords_dict中的关键词权重值。Where kvi represents the keyword weight value ofvi in the keyword table keywords_dict.

A4、将上述第三标记向量中的向量值求平均值,作为待检测段落T1与对比段落T2之间的段落相似度。A4. Calculate the average value of the vector values in the third tag vector as the paragraph similarity between the to-be-detected paragraph T1 and the comparison paragraph T2.

上述段落相似度的计算可通过公式(5)来表示:The calculation of the above paragraph similarity can be expressed by formula (5):

由此,上述S5骤中,若待检测段落存在K个相似段落,则可按照上述A1~A4步骤计算得到待检测段落T1与任意第i个相似段落的段落相似度,记为Simk。然后相似度最高的相似段落作为最相似段落,并记最相似段落对应的段落相似度为Simfinal,该计算过程可通过公式(6)来表示:Therefore, in the above step S5, if there are K similar paragraphs to the paragraph to be detected, the paragraph similarity between the paragraph to be detected T1 and any i-th similar paragraph can be calculated according to the above steps A1 to A4, which is recorded as Simk . Then the similar paragraph with the highest similarity is taken as the most similar paragraph, and the paragraph similarity corresponding to the most similar paragraph is recorded as Simfinal . The calculation process can be expressed by formula (6):

Simfinal=max1≤k≤K Simk (6)Simfinal =max1≤k≤K Simk (6)

S6、遍历所述相似项目库中的每个相似项目文档,将当前遍历的相似项目文档中的段落与待检测项目文档中的待检测段落两两配对,基于所述上下文信息改进的Jaccard的段落相似度对比方法,计算得到每一组配对段落之间的段落相似度,得到每个待检测段落的最相似段落以及其最大段落相似度;再以段落位置权重和段落长度权重同时作为加权信息,将所有待检测段落的最大段落相似度进行加权求和,获得当前遍历的相似项目文档与待检测项目文档之间的文档整体相似度,并确定待检测项目文档的最相似项目文档。S6, traversing each similar project document in the similar project library, pairing the paragraphs in the currently traversed similar project document with the paragraphs to be detected in the project document to be detected, calculating the paragraph similarity between each group of paired paragraphs based on the improved Jaccard paragraph similarity comparison method based on the context information, and obtaining the most similar paragraph of each paragraph to be detected and its maximum paragraph similarity; then taking the paragraph position weight and the paragraph length weight as weighting information at the same time, performing weighted summation on the maximum paragraph similarities of all the paragraphs to be detected, obtaining the overall document similarity between the currently traversed similar project document and the project document to be detected, and determining the most similar project document of the project document to be detected.

需要说明的是,S6步骤在计算的文档整体相似度时,也需要用到上述步骤S5中的上下文信息改进的Jaccard的段落相似度对比方法,区别在于S5中需要以历史项目数据库中获取所有历史项目文档的所有段落作为查重范围,而S6步骤中仅需要将每个相似项目文档中的所有段落作为查重范围。也就是说,若相似项目库中一共有M个相似项目文档,则需要依次针对每个相似项目文档Xi,将相似项目文档Xi中的所有P个段落文本作为新的查重范围,依次遍历待检测项目文档中的每个段落,将当前遍历段落与新的查重范围中的各段落两两配对,基于A1~A4所示的上下文信息改进的Jaccard的段落相似度对比方法计算每一组配对段落之间的段落相似度,从而得到每个待检测段落在新的查重范围内的最相似段落以及其最大段落相似度。最后,再将所有待检测段落的最大段落相似度进行加权求和。It should be noted that, when calculating the overall similarity of the document in step S6, the paragraph similarity comparison method of Jaccard improved by the context information in step S5 is also needed. The difference is that in S5, all paragraphs of all historical project documents in the historical project database need to be obtained as the duplicate checking range, while in step S6, only all paragraphs of each similar project document need to be used as the duplicate checking range. In other words, if there are a total of M similar project documents in the similar project library, it is necessary to sequentially target each similar project document Xi, use all P paragraph texts in the similar project document Xi as the new duplicate checking range, traverse each paragraph in the project document to be detected in turn, pair the currently traversed paragraph with each paragraph in the new duplicate checking range, and calculate the paragraph similarity between each group of paired paragraphs based on the paragraph similarity comparison method of Jaccard improved by the context information shown in A1 to A4, so as to obtain the most similar paragraph of each paragraph to be detected in the new duplicate checking range and its maximum paragraph similarity. Finally, the maximum paragraph similarities of all paragraphs to be detected are weighted and summed.

需要说明的是,计算所述文档整体相似度时,得到所有段落的的最大段落相似度后,在的最大段落相似度的基础上,还需要添加段落位置权重、段落长度权重的加权信息,即以段落位置权重和段落长度权重同时作为加权信息,将所有待检测段落的最大段落相似度进行加权求和。段落位置权重主要是根据项目申报书中段落分布情况等启发信息,本发明可根据经验,将项目文档的中间部分段落权重提高,开头和结尾的段落权重调低;段落长度权重主要是根据段落实际长度计算取对数计算得出,段落长度越长权重越高。由此,在本发明的实施例中,每一个段落的段落位置权重可采用该段落在项目文档中所处的结构部分所对应的预设权重值,每一个段落的段落长度权重可采用段落长度的对数值,本实施例根据项目文档数据的具体情况,取log200进行计算。最终文档整体相似度Doc_sim计算可参见公式(7)所示:It should be noted that when calculating the overall similarity of the document, after obtaining the maximum paragraph similarity of all paragraphs, it is necessary to add the weighted information of the paragraph position weight and the paragraph length weight on the basis of the maximum paragraph similarity, that is, the paragraph position weight and the paragraph length weight are used as weighted information at the same time, and the maximum paragraph similarities of all the paragraphs to be detected are weighted and summed. The paragraph position weight is mainly based on the heuristic information such as the distribution of paragraphs in the project application. According to experience, the present invention can increase the weight of the middle part of the project document and reduce the weight of the beginning and end paragraphs; the paragraph length weight is mainly calculated based on the actual length of the paragraph and the logarithm is calculated. The longer the paragraph length, the higher the weight. Therefore, in an embodiment of the present invention, the paragraph position weight of each paragraph can adopt the preset weight value corresponding to the structural part where the paragraph is located in the project document, and the paragraph length weight of each paragraph can adopt the logarithm value of the paragraph length. In this embodiment, log200 is taken for calculation according to the specific situation of the project document data. The final document overall similarity Doc_sim calculation can be shown in formula (7):

其中P为待检测项目文档中提取出的段落总数;表示待检测项目文档中第i个段落的最大段落相似度;pos(i)表示第i段落在项目文档中的位置权重,根据经验值调整;leni表示第i个段落的长度,若文本为中文,段落长度的计算按照字数计算,若文本为英文,段落长度的计算可按照单词计算。Where P is the total number of paragraphs extracted from the project document to be detected; It represents the maximum paragraph similarity of the i-th paragraph in the project document to be detected; pos(i) represents the position weight of the i-th paragraph in the project document, which is adjusted according to the experience value; leni represents the length of the i-th paragraph. If the text is in Chinese, the paragraph length is calculated according to the number of words. If the text is in English, the paragraph length can be calculated according to the number of words.

在相似项目库中的每个相似项目文档均计算得到文档整体相似度Doc_sim后,即可取文档整体相似度最高的项目文档作为待检测项目文档的最相似项目文档。After calculating the overall document similarity Doc_sim for each similar project document in the similar project library, the project document with the highest overall document similarity can be taken as the most similar project document to the project document to be detected.

S7、针对待检测项目文档,基于段落级别和文档级别的相似比对结果,按照预设的批注形式和批注内容,在文档中生成可视化的自动批注。S7. For the project document to be detected, based on the similarity comparison results at the paragraph level and the document level, a visual automatic annotation is generated in the document according to the preset annotation form and annotation content.

需要说明的是,具体的批注形式和批注内容,可根据实际需要进行个性化地调整。在本发明的实施例中,在文档中生成可视化的自动批注时,对于每一个段落可以需要标记出其与最相似段落之间的重复文本,并批注标出最相似段落所属的历史项目文档信息,其中重复文本可以通过字体颜色或者高亮等形式标出。例如,在一示例性的做法中,可根据前述各步骤的计算结果利用文档生成工具自动生成标红批注文档,高亮标记出检测出的重复文本,自动批注相对应的历史项目重复片段。特别的,重复段落的标红可按照步骤S5中判断的情况分类处理,属于情况(1)和(2)下词语片段均做标红处理,属于情况(3)时只标红重复词语,同时自动批注内容加上对比段落的项目文档编号或名称信息方便审核人员审查。另外,对于整个文档,亦可进一步批注生成最相似项目文档的编号或名称信息。It should be noted that the specific annotation form and annotation content can be adjusted individually according to actual needs. In an embodiment of the present invention, when generating visual automatic annotations in a document, for each paragraph, it is necessary to mark the repeated text between it and the most similar paragraph, and annotate the historical project document information to which the most similar paragraph belongs, wherein the repeated text can be marked in the form of font color or highlighting. For example, in an exemplary approach, a document generation tool can be used to automatically generate a red-marked annotated document based on the calculation results of the aforementioned steps, highlight the detected repeated text, and automatically annotate the corresponding historical project repeated fragments. In particular, the red marking of repeated paragraphs can be classified and processed according to the situation judged in step S5. In situations (1) and (2), all word fragments are marked in red. In situation (3), only repeated words are marked in red. At the same time, the automatic annotation content is added with the project document number or name information of the comparison paragraph to facilitate the review of the auditor. In addition, for the entire document, the number or name information of the most similar project document can also be further annotated to generate.

需要说明的是,上述S1~S7所示的方法步骤,本质上可以以计算机程序的形式来实现,各步骤可以通过程序代码构建为功能模块。It should be noted that the method steps shown in S1 to S7 above can essentially be implemented in the form of a computer program, and each step can be constructed as a functional module through program code.

由此,基于同一发明构思,如图2所示,本发明还提供了与上述实施例提供的一种面向科技项目文档的查重及自动批注方法对应的一种面向科技项目文档的查重及自动批注系统,其包括:Therefore, based on the same inventive concept, as shown in FIG2 , the present invention further provides a system for checking duplicates and automatically annotating scientific and technological project documents corresponding to the method for checking duplicates and automatically annotating scientific and technological project documents provided in the above embodiment, which comprises:

文档解析模块,用于对待检测项目文档进行解析,获得每个项目文档的段落文本;The document parsing module is used to parse the project documents to be tested and obtain the paragraph text of each project document;

段落特征提取模块,用于对待检测项目文档中的各段落文本分别进行分词处理,得到各段落的分词特征;The paragraph feature extraction module is used to perform word segmentation processing on each paragraph text in the project document to be detected, and obtain the word segmentation features of each paragraph;

历史项目管理模块,用于从历史项目数据库中获取所有历史项目文档的所有段落作为查重范围,每个历史项目文档预先通过所述解析处理和所述分词处理得到各段落对应的段落文本和分词特征;A historical project management module is used to obtain all paragraphs of all historical project documents from the historical project database as the duplicate checking scope, and each historical project document is preliminarily processed by the parsing process and the word segmentation process to obtain the paragraph text and word segmentation features corresponding to each paragraph;

历史相似片段检索粗筛模块,用于对待检测项目文档的全文以及每个段落分别进行关键词抽取,获得全文关键词、每个段落的段落关键词以及各关键词的权重;再针对每个段落,将该段落的段落关键词、全文关键词以及各关键词的权重构成该段落对应的关键词字典;遍历待检测项目文档中的每个待检测段落,生成关键词匹配检索所需的数据库检索语句,在数据库中通过倒排索引策略检索得到所述查重范围内与每个待检测段落的关键词匹配程度最高的多个相似段落,并记录每个相似段落所属的历史项目文档作为相似项目文档;所有待检测段落的相似项目文档构成相似项目库;The historical similar fragment retrieval rough screening module is used to extract keywords from the full text and each paragraph of the project document to be detected, and obtain the full-text keywords, the paragraph keywords of each paragraph, and the weight of each keyword; then for each paragraph, the paragraph keywords, the full-text keywords of the paragraph, and the weight of each keyword constitute the keyword dictionary corresponding to the paragraph; traverse each paragraph to be detected in the project document to be detected, generate the database search statement required for keyword matching retrieval, retrieve multiple similar paragraphs with the highest degree of keyword matching with each paragraph to be detected within the scope of the duplicate check in the database through the inverted index strategy, and record the historical project document to which each similar paragraph belongs as a similar project document; the similar project documents of all paragraphs to be detected constitute a similar project library;

段落相似计算模块,用于将待检测项目文档的每个待检测段落分别与对应的每个相似段落两两配对,基于上下文信息改进的Jaccard的段落相似度对比方法,综合考虑相似片段连续程度、段落关键词权重和全文关键词权重,计算每一组配对段落之间的段落相似度;然后针对每个待检测段落从对应的所有相似段落中选出最相似段落;The paragraph similarity calculation module is used to pair each paragraph to be detected in the project document to be detected with each corresponding similar paragraph, and calculate the paragraph similarity between each group of paired paragraphs based on the Jaccard paragraph similarity comparison method improved by context information, taking into account the continuity of similar segments, paragraph keyword weights and full-text keyword weights; then, for each paragraph to be detected, the most similar paragraph is selected from all corresponding similar paragraphs;

文档相似计算模块,用于遍历所述相似项目库中的每个相似项目文档,将当前遍历的相似项目文档中的段落与待检测项目文档中的待检测段落两两配对,基于所述上下文信息改进的Jaccard的段落相似度对比方法,计算得到每一组配对段落之间的段落相似度,得到每个待检测段落的最相似段落以及其最大段落相似度;再以段落位置权重和段落长度权重同时作为加权信息,将所有待检测段落的最大段落相似度进行加权求和,获得当前遍历的相似项目文档与待检测项目文档之间的文档整体相似度,并确定待检测项目文档的最相似项目文档;The document similarity calculation module is used to traverse each similar project document in the similar project library, pair the paragraphs in the currently traversed similar project document with the paragraphs to be detected in the project document to be detected, calculate the paragraph similarity between each group of paired paragraphs based on the improved Jaccard paragraph similarity comparison method based on the context information, and obtain the most similar paragraph of each paragraph to be detected and its maximum paragraph similarity; then, using the paragraph position weight and the paragraph length weight as weighting information at the same time, perform weighted summation on the maximum paragraph similarities of all the paragraphs to be detected, obtain the overall document similarity between the currently traversed similar project document and the project document to be detected, and determine the most similar project document of the project document to be detected;

查重检测自动批注模块,用于针对待检测项目文档,基于段落级别和文档级别的相似比对结果,按照预设的批注形式和批注内容,在文档中生成可视化的自动批注。The automatic annotation module for duplicate detection is used to generate visual automatic annotations in the document for the project document to be detected based on the similarity comparison results at the paragraph level and document level, in accordance with the preset annotation form and annotation content.

上述各功能模块的具体实现方式,与前述实施例中的面向科技项目文档的查重及自动批注方法类似,因此不再赘述。整个系统中各模块之间的流程关系如图3所示。The specific implementation of each of the above functional modules is similar to the duplicate checking and automatic annotation method for scientific and technological project documents in the above embodiment, so it will not be described in detail. The process relationship between the modules in the whole system is shown in FIG3 .

另外,基于同一发明构思,本发明还提供了与上述实施例提供的一种面向科技项目文档的查重及自动批注方法对应的一种计算机电子设备,其包括存储器和处理器;In addition, based on the same inventive concept, the present invention also provides a computer electronic device corresponding to the method for checking duplicates and automatically annotating scientific and technological project documents provided in the above embodiment, which includes a memory and a processor;

所述存储器,用于存储计算机程序;The memory is used to store computer programs;

所述处理器,用于当执行所述计算机程序时,实现如前所述的面向科技项目文档的查重及自动批注方法。The processor is used to implement the aforementioned method for checking for duplicates and automatically annotating scientific and technological project documents when executing the computer program.

此外,上述的存储器中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。In addition, the logic instructions in the above-mentioned memory can be implemented in the form of software functional units and can be stored in a computer-readable storage medium when sold or used as an independent product. Based on such an understanding, the technical solution of the present invention, or the part that contributes to the prior art, or the part of the technical solution, can be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for a computer device (which can be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in each embodiment of the present invention.

由此,基于同一发明构思,本发明提供了一种面向科技项目文档的查重及自动批注方法对应的一种计算机可读存储介质,该所述存储介质上存储有计算机程序,当所述计算机程序被处理器执行时,能实现如前所述的面向科技项目文档的查重及自动批注方法。Therefore, based on the same inventive concept, the present invention provides a computer-readable storage medium corresponding to a method for checking for duplicates and automatically annotating scientific and technological project documents, and a computer program is stored on the storage medium. When the computer program is executed by a processor, it can implement the method for checking for duplicates and automatically annotating scientific and technological project documents as described above.

由此,基于同一发明构思,本发明提供了一种计算机程序产品,包括计算机程序/指令,该计算机程序/指令被处理器执行时,能实现如前所述的面向科技项目文档的查重及自动批注方法。Therefore, based on the same inventive concept, the present invention provides a computer program product, including a computer program/instruction, which, when executed by a processor, can implement the aforementioned method for checking for duplicates and automatically annotating scientific and technological project documents.

具体而言,在上述三个实施例的计算机可读存储介质中,存储的计算机程序被处理器执行,可执行前述S1~S7的步骤。Specifically, in the computer-readable storage medium of the above three embodiments, the stored computer program is executed by the processor to perform the above steps S1 to S7.

可以理解的是,上述存储介质可以包括随机存取存储器(Random Access Memory,RAM),也可以包括非易失性存储器(Non-Volatile Memory,NVM),例如至少一个磁盘存储器。同时存储介质还可以是U盘、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。It is understandable that the above storage medium may include random access memory (RAM) or non-volatile memory (NVM), such as at least one disk storage. The storage medium may also be a U disk, a mobile hard disk, a magnetic disk or an optical disk, etc., which can store program codes.

可以理解的是,上述的处理器可以是通用处理器,包括中央处理器(CentralProcessing Unit,CPU)、网络处理器(Network Processor,NP)等;还可以是数字信号处理器(Digital SignalProcessing,DSP)、专用集成电路(Application Specific IntegratedCircuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。It can be understood that the above-mentioned processor can be a general-purpose processor, including a central processing unit (CPU), a network processor (NP), etc.; it can also be a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

另外需要说明的是,所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。在本申请所提供的各实施例中,所述的系统和方法中对于步骤或者模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个模块或步骤可以结合或者可以集成到一起,一个模块或者步骤亦可进行拆分。It should also be noted that those skilled in the art can clearly understand that for the convenience and simplicity of description, the specific working process of the system described above can refer to the corresponding process in the aforementioned method embodiment, and will not be repeated here. In the various embodiments provided in this application, the division of steps or modules in the system and method is only a logical function division, and there may be other division methods in actual implementation, such as multiple modules or steps can be combined or integrated together, and a module or step can also be split.

以上所述的实施例只是本发明的一种较佳的方案,然其并非用以限制本发明。有关技术领域的普通技术人员,在不脱离本发明的精神和范围的情况下,还可以做出各种变化和变型。因此凡采取等同替换或等效变换的方式所获得的技术方案,均落在本发明的保护范围内。The above-described embodiment is only a preferred solution of the present invention, but it is not intended to limit the present invention. A person skilled in the relevant technical field may make various changes and modifications without departing from the spirit and scope of the present invention. Therefore, any technical solution obtained by equivalent replacement or equivalent transformation falls within the protection scope of the present invention.

Claims (10)

S4, extracting keywords from the full text of the project document to be detected and each paragraph respectively to obtain full text keywords, paragraph keywords of each paragraph and weights of the keywords; then, aiming at each paragraph, the paragraph keywords, the full text keywords and the weights of the keywords of the paragraph form a keyword dictionary corresponding to the paragraph; traversing each paragraph to be detected in the project document to be detected, generating a database retrieval statement required by keyword matching retrieval, retrieving a plurality of similar paragraphs with highest matching degree with the keywords of each paragraph to be detected in the searching range through an inverted index strategy in a database, and recording a history project document to which each similar paragraph belongs as a similar project document; all similar project documents of the paragraphs to be detected form a similar project library;
S6, traversing each similar project document in the similar project library, pairing the paragraphs in the similar project document traversed at present with the paragraphs to be detected in the project document to be detected in pairs, and calculating to obtain the paragraph similarity between each group of paired paragraphs based on a paragraph similarity comparison method of Jaccard improved by the context information to obtain the most similar paragraph and the maximum paragraph similarity of each paragraph to be detected; taking the paragraph position weight and the paragraph length weight as weighting information at the same time, carrying out weighted summation on the maximum paragraph similarity of all the paragraphs to be detected, obtaining the overall similarity of the document between the similar project document which is traversed currently and the project document to be detected, and determining the most similar project document of the project document to be detected;
A2, carrying out continuous segment division on the first marking vector to obtain a series of continuous segments, wherein inconsistent vector values which do not exceed a preset length are allowed to exist when the continuous segments are divided; multiplying the vector value corresponding to the continuous segment in the first marking vector by a first weight value if the continuous segment is 1, multiplying the vector value corresponding to the continuous segment in the first marking vector by a second weight value if the continuous segment is-1 but the proportion is smaller than a preset threshold value, and multiplying the vector value corresponding to the continuous segment in the first marking vector by a third weight value if the continuous segment is not 1; after the traversal is finished, converting the first mark vector into a second mark vector;
The historical similar segment retrieval coarse screening module is used for extracting keywords from the full text of the project document to be detected and each paragraph respectively to obtain full text keywords, paragraph keywords of each paragraph and weights of the keywords; then, aiming at each paragraph, the paragraph keywords, the full text keywords and the weights of the keywords of the paragraph form a keyword dictionary corresponding to the paragraph; traversing each paragraph to be detected in the project document to be detected, generating a database retrieval statement required by keyword matching retrieval, retrieving a plurality of similar paragraphs with highest matching degree with the keywords of each paragraph to be detected in the searching range through an inverted index strategy in a database, and recording a history project document to which each similar paragraph belongs as a similar project document; all similar project documents of the paragraphs to be detected form a similar project library;
The document similarity calculation module is used for traversing each similar project document in the similar project library, pairing the paragraphs in the similar project document traversed at present with the paragraphs to be detected in the project document to be detected in pairs, and calculating to obtain the paragraph similarity between each group of paired paragraphs based on the paragraph similarity comparison method of Jaccard improved by the context information to obtain the most similar paragraph and the maximum paragraph similarity of each paragraph to be detected; taking the paragraph position weight and the paragraph length weight as weighting information at the same time, carrying out weighted summation on the maximum paragraph similarity of all the paragraphs to be detected, obtaining the overall similarity of the document between the similar project document which is traversed currently and the project document to be detected, and determining the most similar project document of the project document to be detected;
CN202410762065.3A2024-06-132024-06-13 A method and system for checking duplicates and automatically annotating scientific and technological project documentsPendingCN118862843A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202410762065.3ACN118862843A (en)2024-06-132024-06-13 A method and system for checking duplicates and automatically annotating scientific and technological project documents

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202410762065.3ACN118862843A (en)2024-06-132024-06-13 A method and system for checking duplicates and automatically annotating scientific and technological project documents

Publications (1)

Publication NumberPublication Date
CN118862843Atrue CN118862843A (en)2024-10-29

Family

ID=93163594

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202410762065.3APendingCN118862843A (en)2024-06-132024-06-13 A method and system for checking duplicates and automatically annotating scientific and technological project documents

Country Status (1)

CountryLink
CN (1)CN118862843A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN119272721A (en)*2024-12-052025-01-07北京轻松怡康信息技术有限公司 Method, device, storage medium, and program product for checking and highlighting duplicate text

Cited By (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN119272721A (en)*2024-12-052025-01-07北京轻松怡康信息技术有限公司 Method, device, storage medium, and program product for checking and highlighting duplicate text

Similar Documents

PublicationPublication DateTitle
US20210342404A1 (en)System and method for indexing electronic discovery data
CN108829858B (en)Data query method and device and computer readable storage medium
CN108460014B (en)Enterprise entity identification method and device, computer equipment and storage medium
US20190236102A1 (en)System and method for differential document analysis and storage
WO2019174132A1 (en)Data processing method, server and computer storage medium
CN106649260B (en)Product characteristic structure tree construction method based on comment text mining
US8315997B1 (en)Automatic identification of document versions
CN102918532B (en)To the detection of rubbish in search results ranking
CN113377927A (en)Similar document detection method and device, electronic equipment and storage medium
CN109783787A (en)A kind of generation method of structured document, device and storage medium
CN104063387A (en)Device and method abstracting keywords in text
CN103399901A (en)Keyword extraction method
CN107562843B (en)News hot phrase extraction method based on title high-frequency segmentation
JP2003281186A (en) Example-based search method and search system for similarity determination
CN118820389B (en)Keyword-based data association storage method and device
US8862586B2 (en)Document analysis system
CN115422371A (en)Software test knowledge graph-based retrieval method
CN115329048A (en)Statement retrieval method and device, electronic equipment and storage medium
CN109165373B (en)Data processing method and device
CN114117038A (en)Document classification method, device and system and electronic equipment
CN118862843A (en) A method and system for checking duplicates and automatically annotating scientific and technological project documents
CN111368547A (en)Entity identification method, device, equipment and storage medium based on semantic analysis
CN120046588A (en)Management system and management method for writing of labels
CN115309978A (en)Webpage processing method based on key long sentence and text length pre-classification
CN110909532B (en)User name matching method and device, computer equipment and storage medium

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination

[8]ページ先頭

©2009-2025 Movatter.jp