CN118658172A

Movatterモバイル変換

Info

Publication number: CN118658172A
Application number: CN202410812037.8A
Authority: CN
Inventors: 何学洲; 陈秀法; 舒思齐; 马晨; 李玉龙; 王桂春; 王杨刚; 李雪松; 张伟波; 王秋舒
Original assignee: Development & Research Center Of China Geological Survey Bureau
Current assignee: Development & Research Center Of China Geological Survey Bureau
Priority date: 2024-06-21
Filing date: 2024-06-21
Publication date: 2024-09-17
Anticipated expiration: 2044-06-21
Also published as: CN118658172B

Abstract

Translated fromChinese

本发明公开了一种国际合作出访计划信息提取与识别比对方法与系统，该方法包括将待对比的国际合作计划与申报数据库中国际合作计划进行初次对比筛选出与待对比国际合作计划相似的国际合作计划，基于二次对比方法将初次国际合作计划筛选池中的国际合作计划与待对比国际合作计划进行对比筛选出与待对比国际合作计划相似的国际合作计划，基于识别的国际合作计划中辅助因子的重要程度及比对方法对二次国际合作计划筛选池中的计划进行标记排序，实现相似国际合作计划的准确标记和辅助比对。本发明能够实现国际合作信息的快速提取、准确识别、逐级筛选、分层比对和辅助标记，进而对服务国际合作计划的日常管理和优化调整具有重要意义。

The present invention discloses a method and system for extracting, identifying and comparing information of international cooperation plans for overseas visits, the method comprising performing a primary comparison between an international cooperation plan to be compared and an international cooperation plan in a declaration database to screen out international cooperation plans similar to the international cooperation plan to be compared, performing a secondary comparison method based on comparing international cooperation plans in a primary international cooperation plan screening pool with international cooperation plans to be compared to screen out international cooperation plans similar to the international cooperation plan to be compared, marking and sorting plans in a secondary international cooperation plan screening pool based on the importance of auxiliary factors in the identified international cooperation plans and the comparison method, and realizing accurate marking and auxiliary comparison of similar international cooperation plans. The present invention can realize rapid extraction, accurate identification, step-by-step screening, hierarchical comparison and auxiliary marking of international cooperation information, and thus has important significance for serving the daily management and optimization adjustment of international cooperation plans.

Description

Translated fromChinese

一种国际合作出访计划信息提取与识别比对方法及系统A method and system for extracting, identifying and comparing information of international cooperative visit plans

技术领域Technical Field

本发明涉及信息提取技术领域，特别是涉及一种国际合作出访计划信息提取与识别比对方法及系统。The present invention relates to the technical field of information extraction, and in particular to a method and system for extracting, identifying and comparing information on international cooperative visit plans.

背景技术Background Art

近年来，科学技术的飞速发展极大地推动了地质调查领域的理论、技术和方法的革新，众多新颖的地质调查理论、技术以及方法的涌现，不仅为地质工作提供了更为强大的支持，也极大地丰富了人类对地球的认识和理解。In recent years, the rapid development of science and technology has greatly promoted the innovation of theories, technologies and methods in the field of geological survey. The emergence of many novel geological survey theories, technologies and methods has not only provided stronger support for geological work, but also greatly enriched human knowledge and understanding of the earth.

为推广我国先进的地质调查技术和实践经验，学习国外先进理论和装备，近些年地质调查国际合作呈现逐步开放和交流日益深化的态势，中国地质调查局系统每年度申报出访交流以及邀请来华国际合作计划呈现逐年递增的趋势。这些国际合作计划涉及多个学科领域，合作内容丰富多样，人员流动性大，参与单位众多，在极大推动地质调查国际合作交流的同时，也无疑给国际合作计划的审批、信息归类和报送处理等管理工作带来了诸多挑战。In order to promote my country's advanced geological survey technology and practical experience, and learn advanced foreign theories and equipment, international cooperation in geological survey has gradually opened up and deepened exchanges in recent years. The China Geological Survey system has reported an increasing number of international cooperation plans for visits and exchanges and invitations to China each year. These international cooperation plans involve multiple disciplines, with rich and diverse cooperation content, high personnel mobility, and many participating units. While greatly promoting international cooperation and exchanges in geological surveys, they have undoubtedly brought many challenges to the management of international cooperation plans such as approval, information classification, and submission and processing.

国际合作工作周期长、涉及人员和方法装备较多，尤其是双方或多方合作涉及政治、外交和国际影响力等诸多因素，在以往的国际合作计划审批过程中需要大量的人力、物力和财力资源的支持和严格把关，会导致管理与服务工作的效率低下，经常出现国际合作信息识别不准确的情况，报送审批及时性受到严重影响，不仅延误了出访及邀请来访人员的审批时效，还可能因错误的信息报送及审批而造成合作方面的影响，损害国际合作双方的权益，甚至对合作进程产生不利影响。在应对海量的国际合作计划申请时，如何高效地对这些计划进行条理化的梳理与分类，精确捕捉国际合作中的关键信息，并快速实现这些信息的提取、准确识别、分层比对以及辅助标记，已成为国际合作管理所面临的一大挑战。因此，国际合作管理工作需要一套高效的辅助处理方案，以支持国际合作计划的整合、冗余计划的剔除，从而强化中国地质调查局系统在国际合作方面的顶层设计，提升一体化管理水平和整装出访合作能力，进而形成合力显著增强国际合作执行效能。International cooperation has a long working cycle and involves many people, methods and equipment. In particular, bilateral or multilateral cooperation involves many factors such as politics, diplomacy and international influence. In the past, the approval process of international cooperation plans required a lot of human, material and financial resources to support and strictly control, which led to low efficiency in management and service work. Inaccurate identification of international cooperation information often occurred, and the timeliness of submission and approval was seriously affected. Not only did it delay the approval time for visiting and inviting visitors, but it may also cause cooperation impacts due to wrong information submission and approval, damage the rights and interests of both parties in international cooperation, and even have an adverse impact on the progress of cooperation. When dealing with a large number of international cooperation plan applications, how to efficiently organize and classify these plans in an orderly manner, accurately capture key information in international cooperation, and quickly realize the extraction, accurate identification, hierarchical comparison and auxiliary marking of this information has become a major challenge facing international cooperation management. Therefore, international cooperation management requires a set of efficient auxiliary processing solutions to support the integration of international cooperation plans and the elimination of redundant plans, so as to strengthen the top-level design of the China Geological Survey system in international cooperation, improve the level of integrated management and the ability to cooperate in overseas visits, and thus form a joint force to significantly enhance the effectiveness of international cooperation execution.

发明内容Summary of the invention

本发明旨在至少在一定程度上解决相关技术中的技术问题之一。The present invention aims to solve one of the technical problems in the related art at least to a certain extent.

为此，本发明提出了一种国际合作出访计划信息提取与识别比对方法，建立了算法模型，能精准识别国际合作信息，实现国际合作快速提取、准确识别、分层比对、高效合并，能够满足协助国际合作计划的快速审批和处理调整等需要，全面提升国际合作管理与服务水平。To this end, the present invention proposes a method for extracting, identifying and comparing international cooperation visit plan information, establishes an algorithm model, can accurately identify international cooperation information, realize rapid extraction, accurate identification, layered comparison, and efficient merging of international cooperation, and can meet the needs of assisting in the rapid approval and processing and adjustment of international cooperation plans, and comprehensively improve the level of international cooperation management and services.

本发明的另一个目的在于提出一种国际合作出访计划信息提取与识别比对系统。Another object of the present invention is to provide an international cooperative visit plan information extraction and identification comparison system.

为达上述目的，本发明一方面提出一种国际合作出访计划信息提取与识别比对方法，包括：To achieve the above-mentioned purpose, the present invention proposes, on one hand, a method for extracting and identifying international cooperative visit plan information, comprising:

利用预设的数据检索方法从国际合作计划申报数据库中提取待对比的国际合作计划；The international cooperation plans to be compared are extracted from the international cooperation plan application database using the preset data retrieval method;

利用初次相似度对比算法将所述待对比的国际合作计划与所述国际合作计划申报数据库中的合作计划进行对比筛选出相似国际合作计划，以构建初次国际合作计划筛选池，实现相似国际合作计划的快速初步筛选；Using the initial similarity comparison algorithm, the international cooperation plan to be compared is compared with the cooperation plans in the international cooperation plan application database to screen out similar international cooperation plans, so as to construct an initial international cooperation plan screening pool and realize rapid preliminary screening of similar international cooperation plans;

利用二次相似度对比算法将所述待对比的国际合作计划与所述初次国际合作计划筛选池中的国际合作计划进行二次对比筛选出相似国际合作计划，以构建二次国际合作计划筛选池，实现相似国际合作计划的精确筛选；Using a secondary similarity comparison algorithm, the international cooperation plan to be compared is compared with the international cooperation plans in the primary international cooperation plan screening pool to screen out similar international cooperation plans, so as to construct a secondary international cooperation plan screening pool and realize accurate screening of similar international cooperation plans;

利用辅助因子相似计划比对标记算法根据识别的国际合作计划中的辅助因子的重要程度对所述二次国际合作计划筛选池中的计划按照比对逻辑进行判断并标记排序，以根据计划标记结果得到最终国际合作计划筛选池，实现相似国际合作计划的准确标记和辅助比对。The auxiliary factor similar plan comparison and marking algorithm is used to judge and mark the plans in the secondary international cooperation plan screening pool according to the importance of the auxiliary factors in the identified international cooperation plans according to the comparison logic, so as to obtain the final international cooperation plan screening pool according to the plan marking results, thereby realizing accurate marking and auxiliary comparison of similar international cooperation plans.

本发明实施例的国际合作出访计划信息提取与识别比对方法还可以具有以下附加技术特征：The method for extracting and identifying and comparing international cooperative visit plan information in the embodiment of the present invention may also have the following additional technical features:

在本发明的一个实施例中，利用预设的数据检索方法从国际合作计划申报数据库中提取待对比的国际合作计划，包括：In one embodiment of the present invention, the international cooperation plan to be compared is extracted from the international cooperation plan application database using a preset data retrieval method, including:

利用布尔逻辑算符对国际合作计划申报数据库中包含检索关键词的国际合作计划进行检索提取；以及，Using Boolean logic operators to search and extract international cooperation plans containing search keywords from the international cooperation plan application database; and,

利用字段限定检索方法设定检索查询的国际合作计划的字段信息，并为不同字段信息设定需要检索的内容，以通过多个字段的匹配检索对国际合作计划申报数据库进行检索提取，以提取需要审批的单项国际合作计划；Use the field-limited search method to set the field information of the international cooperation plan to be searched, and set the content to be searched for different field information, so as to search and extract the international cooperation plan application database through matching searches of multiple fields to extract the individual international cooperation plans that need to be approved;

基于所述需要审批的单项国际合作计划确定待对比的国际合作计划。The international cooperation plans to be compared are determined based on the individual international cooperation plans that require approval.

在本发明的一个实施例中，利用初次相似度对比算法将所述待对比的国际合作计划与所述国际合作计划申报数据库中的合作计划进行对比筛选出相似国际合作计划，以构建初次国际合作计划筛选池，包括：In one embodiment of the present invention, the international cooperation plan to be compared is compared with the cooperation plans in the international cooperation plan application database using a primary similarity comparison algorithm to screen out similar international cooperation plans, so as to construct a primary international cooperation plan screening pool, including:

获取待对比的第一国际合作计划，遍历所述国际合作计划申报数据库中的国际合作计划，顺序获取第二国际合作计划，并利用特殊字符算法删除待对比的第一国际合作计划和第二国际合作计划中所有特殊字符，以得到没有特殊字符的第一国际合作计划文本字符串和第二国际合作计划文本字符串；Acquire the first international cooperation plan to be compared, traverse the international cooperation plans in the international cooperation plan declaration database, sequentially acquire the second international cooperation plan, and use a special character algorithm to delete all special characters in the first international cooperation plan and the second international cooperation plan to be compared, so as to obtain a text string of the first international cooperation plan and a text string of the second international cooperation plan without special characters;

利用字符转换算法将第一国际合作计划文本字符串和第二国际合作计划文本字符串进行字符转换，以得到第一国际合作计划文本的字符集合和第二国际合作计划文本的字符集合；Performing character conversion on the first international cooperation plan text string and the second international cooperation plan text string using a character conversion algorithm to obtain a character set of the first international cooperation plan text and a character set of the second international cooperation plan text;

计算第一国际合作计划文本字符集合的长度，计算第二国际合作计划文本字符集合的长度，并比较第一国际合作计划和第二国际合作计划字符集合长短大小；Calculate the length of the first international cooperation plan text character set, calculate the length of the second international cooperation plan text character set, and compare the lengths of the first international cooperation plan and the second international cooperation plan character set;

按照从左到右的顺序遍历所述字符集长度较短的国际合作计划文本字符集，并判断每一个字符是否被包含在字符集长度较长的国际合作计划字符集合中，并记录被包含的字符总数。The international cooperation plan text character set with a shorter character set length is traversed from left to right, and it is determined whether each character is included in the international cooperation plan character set with a longer character set length, and the total number of included characters is recorded.

对所述包含的字符数与较长字符集的长度做比值，然后进行百分化处理得到相似字符的百分比，若所述相似度百分比大于预设的第一相似度阈值，则将所述待对比的第二国际合作计划放置到初次国际合作计划筛选池。The number of characters included is compared with the length of the longer character set, and then percentage processing is performed to obtain the percentage of similar characters. If the similarity percentage is greater than a preset first similarity threshold, the second international cooperation plan to be compared is placed in the initial international cooperation plan screening pool.

在本发明的一个实施例中，利用二次相似度对比算法将所述待对比的国际合作计划与所述初次国际合作计划筛选池中的国际合作计划进行二次对比筛选出相似国际合作计划，以构建二次国际合作计划筛选池，包括：In one embodiment of the present invention, a secondary similarity comparison algorithm is used to perform a secondary comparison between the international cooperation plan to be compared and the international cooperation plans in the primary international cooperation plan screening pool to screen out similar international cooperation plans, so as to construct a secondary international cooperation plan screening pool, including:

利用NPL语料库对待对比的第一国际合作计划进行分词解析，以得到分词解析后的第一国际合作计划文本字符，判断第一国际合作计划文本字符是否包含国家名称，若包含，则提取第一国家名称；Using the NPL corpus to perform word segmentation analysis on the first international cooperation plan to be compared, so as to obtain the text characters of the first international cooperation plan after word segmentation analysis, and judging whether the text characters of the first international cooperation plan contain the country name, if so, extracting the first country name;

利用正则表达式建立特殊字符提取过滤算法函数对第一国际合作计划字符串的特殊字符进行过滤，并删除特殊字符，以得到没有特殊字符的第一国际合作计划的文本；A special character extraction and filtering algorithm function is established by using regular expressions to filter the special characters of the first international cooperation plan string, and the special characters are deleted to obtain the text of the first international cooperation plan without special characters;

利用NPL停用词语料库对所述文本字符进行过滤，删除预设条件的停用词得到处理后文本；Using the NPL stop word corpus to filter the text characters, and deleting the stop words under the preset conditions to obtain the processed text;

对所述处理后文本字符进行处理以删除与年份有关的字符信息，得到预处理文本；Processing the processed text characters to delete character information related to the year to obtain a pre-processed text;

遍历所述初次国际合作计划筛选池，对初次国际合作计划筛选池中的国际合作计划进行分析处理得到第二国际合作计划的第二国家名称；Traversing the initial international cooperation plan screening pool, analyzing and processing the international cooperation plans in the initial international cooperation plan screening pool to obtain the second country name of the second international cooperation plan;

判断第一国家名称是否与第二国家名称相同；如果相同，则基于得到的第二国家名称，分别对第一国际合作计划与第二国际合作计划进行删除国家名称处理；如果国家名称不同，则保留第一国际合作计划与第二国际合作计划文本中的国家名称；Determine whether the first country name is the same as the second country name; if they are the same, delete the country name from the first international cooperation plan and the second international cooperation plan based on the obtained second country name; if the country names are different, retain the country names in the texts of the first international cooperation plan and the second international cooperation plan;

利用NPL分词语料库建立地质调查领域机构相关的单位全称与单位简称、国际会议全称与简称、国际会议中文名称与英文名称缩写的对应替换算法，对第一国际合作计划和第二国际合作计划中的单位名称、国际会议名称的表达进行量纲统一；The NPL word segmentation corpus was used to establish a corresponding replacement algorithm for the full names and abbreviations of units related to geological survey institutions, the full names and abbreviations of international conferences, and the Chinese names and English abbreviations of international conferences, and to unify the dimensions of the unit names and international conference names in the first and second international cooperation plans.

利用余弦向量对比算法计算数据统一后的第一国际合作计划和第二国际合作计划的文本的相似度结果以得到相似度百分数，若所述相似度百分数大于预设的第二相似度阈值，则将所述待对比的第二国际合作计划放置到二次国际合作计划筛选池。The cosine vector comparison algorithm is used to calculate the similarity results of the texts of the first international cooperation plan and the second international cooperation plan after data unification to obtain a similarity percentage. If the similarity percentage is greater than a preset second similarity threshold, the second international cooperation plan to be compared is placed in the secondary international cooperation plan screening pool.

在本发明的一个实施例中，利用辅助因子相似计划比对算法根据识别的国际合作计划中的辅助因子的重要程度对所述二次国际合作计划筛选池中的国际合作计划按照比对逻辑进行判断并标记排序，以根据计划标记结果得到最终国际合作计划筛选池，包括：In one embodiment of the present invention, the auxiliary factor similarity plan comparison algorithm is used to judge and mark the international cooperation plans in the secondary international cooperation plan screening pool according to the importance of the auxiliary factors in the identified international cooperation plans according to the comparison logic, so as to obtain the final international cooperation plan screening pool according to the plan marking results, including:

从所述国际合作计划申报数据库中提取国际合作计划辅助因子，并按照国际合作计划辅助因子的重要程度进行排序得到辅助因子排序结果；Extracting international cooperation plan auxiliary factors from the international cooperation plan application database, and sorting the international cooperation plan auxiliary factors according to their importance to obtain auxiliary factor sorting results;

遍历所述二次国际合作计划筛选池中的国际合作计划，并基于提取的辅助因子结合辅助因子比对逻辑处理方法依次判断待对比国际合作计划与二次国际合作计划筛选池遍历的国际合作计划中辅助因子相互关系；Traversing the international cooperation plans in the secondary international cooperation plan screening pool, and determining the relationship between the auxiliary factors in the international cooperation plan to be compared and the international cooperation plans traversed in the secondary international cooperation plan screening pool in turn based on the extracted auxiliary factors combined with the auxiliary factor comparison logic processing method;

基于所述辅助因子相互关系对遍历的国际合作计划按照比对逻辑进行判断并标记排序，以根据计划标记结果得到最终国际合作计划筛选池。Based on the mutual relationship of the auxiliary factors, the traversed international cooperation plans are judged and marked and sorted according to the comparison logic, so as to obtain the final international cooperation plan screening pool according to the plan marking results.

为达上述目的，本发明另一方面提出一种国际合作出访计划信息提取与识别比对系统，包括：To achieve the above-mentioned purpose, the present invention further provides a system for extracting and identifying international cooperative visit plan information, comprising:

原始计划提取模块，用于利用预设的数据检索方法从国际合作计划申报数据库中提取待对比的国际合作计划；The original plan extraction module is used to extract the international cooperation plan to be compared from the international cooperation plan application database using a preset data retrieval method;

第一筛选池确定模块，用于利用初次相似度对比算法将所述待对比的国际合作计划与所述国际合作计划申报数据库中的合作计划进行对比筛选出相似国际合作计划，以构建初次国际合作计划筛选池，实现相似国际合作计划的快速初步筛选；A first screening pool determination module is used to use a primary similarity comparison algorithm to compare the international cooperation plan to be compared with the cooperation plans in the international cooperation plan application database to screen out similar international cooperation plans, so as to construct a primary international cooperation plan screening pool and realize rapid preliminary screening of similar international cooperation plans;

第二筛选池确定模块，用于利用二次相似度对比算法将所述待对比的国际合作计划与所述初次国际合作计划筛选池中的国际合作计划进行二次对比筛选出相似国际合作计划，以构建二次国际合作计划筛选池，实现相似国际合作计划的精确筛选；A second screening pool determination module is used to use a secondary similarity comparison algorithm to perform a secondary comparison between the international cooperation plan to be compared and the international cooperation plans in the primary international cooperation plan screening pool to screen out similar international cooperation plans, so as to construct a secondary international cooperation plan screening pool and realize accurate screening of similar international cooperation plans;

最终筛选池确定模块，用于利用辅助因子相似计划比对标记算法根据识别的国际合作计划中的辅助因子的重要程度对所述二次国际合作计划筛选池中的计划按照比对逻辑进行判断并标记排序，以根据计划标记结果得到最终国际合作计划筛选池，实现相似国际合作计划的准确标记和辅助比对。The final screening pool determination module is used to use the auxiliary factor similarity plan comparison and marking algorithm to judge and mark the plans in the secondary international cooperation plan screening pool according to the importance of the auxiliary factors in the identified international cooperation plans according to the comparison logic, so as to obtain the final international cooperation plan screening pool according to the plan marking results, and realize accurate marking and auxiliary comparison of similar international cooperation plans.

本发明实施例的国际合作出访计划信息提取与识别比对方法和系统，精准识别国际合作信息，实现国际合作快速提取、准确识别、分层比对和辅助标记，协助国际合作计划的快速调整、审批等需要，全面提升国际合作管理与服务水平。结合国际合作业务特点，分析国际合作信息结构和综合内容，通过设计国际合作信息识别与信息处理技术流程，创建地质调查和国际合作领域的专属分词语料库，结合NLP自然语言处理算法和辅助因子处理判断方法，研发国际合作文本信息相似性识别程序，建立国际合作信息识别与处理三级模型，利用逐级比对、分层筛选和逐步细化的方法算法，实现了国际合作多门类信息的识别、提取及比对，可以为国际合作管理、信息服务和业务审批提供了高效、准确的辅助工具，以初步达到国际合作管理的预期目的。The method and system for extracting, identifying and comparing international cooperation visit plan information of the embodiment of the present invention accurately identifies international cooperation information, realizes rapid extraction, accurate identification, hierarchical comparison and auxiliary marking of international cooperation, assists in the rapid adjustment and approval of international cooperation plans, and comprehensively improves the management and service level of international cooperation. In combination with the characteristics of international cooperation business, the structure and comprehensive content of international cooperation information are analyzed, and a dedicated word segmentation corpus in the field of geological survey and international cooperation is created by designing the technical process of international cooperation information identification and information processing. In combination with the NLP natural language processing algorithm and the auxiliary factor processing judgment method, an international cooperation text information similarity identification program is developed, and a three-level model for international cooperation information identification and processing is established. The method and algorithm of step-by-step comparison, hierarchical screening and gradual refinement are used to realize the identification, extraction and comparison of multiple categories of international cooperation information, which can provide efficient and accurate auxiliary tools for international cooperation management, information services and business approval, so as to preliminarily achieve the expected purpose of international cooperation management.

本发明附加的方面和优点将在下面的描述中部分给出，部分将从下面的描述中变得明显，或通过本发明的实践了解到。Additional aspects and advantages of the present invention will be given in part in the following description and in part will be obvious from the following description, or will be learned through practice of the present invention.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

本发明上述的和/或附加的方面和优点从下面结合附图对实施例的描述中将变得明显和容易理解，其中：The above and/or additional aspects and advantages of the present invention will become apparent and easily understood from the following description of the embodiments in conjunction with the accompanying drawings, in which:

图1是根据本发明实施例的国际合作出访计划信息提取与识别比对方法的流程图；1 is a flow chart of a method for extracting and identifying and comparing information on international cooperative visit plans according to an embodiment of the present invention;

图2是根据本发明实施例的国际合作出访计划信息提取与识别比对方法的模型架构图；2 is a model architecture diagram of a method for extracting and identifying and comparing international cooperative visit plan information according to an embodiment of the present invention;

图3是根据本发明实施例的国际合作计划辅助因子比对逻辑流程图；3 is a logic flow chart of the comparison of auxiliary factors of the international cooperation plan according to an embodiment of the present invention;

图4是根据本发明实施例的国际合作出访计划信息提取与识别比对系统的结构图。FIG. 4 is a structural diagram of a system for extracting, identifying and comparing international cooperative visit plan information according to an embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

需要说明的是，在不冲突的情况下，本发明中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本发明。It should be noted that, in the absence of conflict, the embodiments of the present invention and the features in the embodiments can be combined with each other. The present invention will be described in detail below with reference to the accompanying drawings and in combination with the embodiments.

为了使本技术领域的人员更好地理解本发明方案，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分的实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都应当属于本发明保护的范围。In order to enable those skilled in the art to better understand the scheme of the present invention, the technical scheme in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments are only part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work should fall within the scope of protection of the present invention.

下面参照附图描述根据本发明实施例提出的一种国际合作出访计划信息提取与识别比对方法和系统。The following describes a method and system for extracting, identifying and comparing international cooperative visit plan information according to an embodiment of the present invention with reference to the accompanying drawings.

图1是根据本发明实施例的一种国际合作出访计划信息提取与识别比对方法的流程图，如图1所示，该方法包括：FIG. 1 is a flow chart of a method for extracting and identifying international cooperative visit plan information according to an embodiment of the present invention. As shown in FIG. 1 , the method includes:

S1，利用预设的数据检索方法从国际合作计划申报数据库中提取待对比的国际合作计划。S1, using the preset data retrieval method to extract the international cooperation plans to be compared from the international cooperation plan declaration database.

可以理解的是，本发明首先需要各国际合作计划派出单位填报年度国际合作计划各项信息，通过国际合作系统平台实现中国地质调查局局属各单位年度国际合作计划信息入库，形成每一个年度的国际合作计划申报数据库。It can be understood that the present invention first requires each international cooperation plan dispatching unit to fill in various information of the annual international cooperation plan, and realizes the storage of annual international cooperation plan information of each unit under the China Geological Survey through the international cooperation system platform, thus forming an international cooperation plan declaration database for each year.

进一步地，通过国际合作系统提供的便捷的数据检索、查询功能算法，提取需要审批的单项国际合作计划，作为待对比的国际合作计划。因此，国际合作系统研发了信息检索的功能，本发明使用了现有的布尔运算检索方法和字段限定检索的方法对国际合作计划进行提取，如图2所示。Furthermore, through the convenient data retrieval and query function algorithm provided by the international cooperation system, the individual international cooperation plans that need to be approved are extracted as the international cooperation plans to be compared. Therefore, the international cooperation system has developed an information retrieval function, and the present invention uses the existing Boolean operation retrieval method and the field-limited retrieval method to extract the international cooperation plans, as shown in Figure 2.

可以理解的是，布尔逻辑检索算法是信息检索领域中一种基本而重要的方法，它允许用户通过逻辑运算符来组合检索词，从而精确地表达信息需求并从数据库中筛选出符合条件的记录。本发明应用的布尔逻辑检索算法过程如下所示：It is understandable that the Boolean logic retrieval algorithm is a basic and important method in the field of information retrieval, which allows users to combine search terms through logical operators to accurately express information needs and filter out qualified records from the database. The Boolean logic retrieval algorithm process applied by the present invention is as follows:

具体地，程序中使用布尔逻辑算符AND、OR、NOT来对检索词进行判断，实现逻辑组配，从而精确地查找相关信息。例如：用户检索同时包括“世界地质大会”和“法国”两个关键词的国际合作计划，程序就通过布尔逻辑AND关键词进行检索，查找同时包含“世界地质大会”和“法国”内容相一致的国际合作计划。或者用户检索包括“锆石实验测试”但不包含“大洋洲”的国际合作计划，通过布尔逻辑运算同样可以得到结果。本发明的布尔逻辑检索可以包括如下步骤：Specifically, the program uses Boolean logic operators AND, OR, and NOT to judge the search terms and implement logical combination, so as to accurately find relevant information. For example: when a user searches for international cooperation plans that include both the keywords "World Geological Congress" and "France", the program searches through the Boolean logic AND keyword to find international cooperation plans that contain the same content as "World Geological Congress" and "France". Or when a user searches for international cooperation plans that include "zircon experimental testing" but do not include "Oceania", the result can also be obtained through Boolean logic operations. The Boolean logic search of the present invention may include the following steps:

1)用户输入检索关键词(系统设定：小于等于3个)，要求提取包含关键词A、关键词B和关键词C的所有国际合作计划，则使用布尔逻辑判断and进行检索；1) The user enters the search keyword (system setting: less than or equal to 3) and requires to extract all international cooperation plans containing keyword A, keyword B and keyword C, then the Boolean logic judgment and is used for retrieval;

2)用户输入检索关键词(系统设定：小于等于3个)，要求提取包含关键词A和关键词B，但不包含关键词C的所有国际合作计划，则使用布尔逻辑判断and和not进行检索；2) The user enters the search keywords (system setting: less than or equal to 3) and requires to extract all international cooperation plans containing keyword A and keyword B but not keyword C, then the Boolean logic judgment and and not are used for retrieval;

3)用户输入检索关键词(系统设定：小于等于3个)，要求提取包含关键词A或者包含关键词B或者包含关键词C的所有国际合作计划，则使用布尔逻辑判断or进行检索。3) The user enters the search keyword (system setting: less than or equal to 3) and requires to extract all international cooperation plans containing keyword A, keyword B, or keyword C, then the Boolean logic judgment or is used for retrieval.

因此，通过布尔逻辑判断可以组合检索提取国际合作计划。Therefore, international cooperation plans can be combined and retrieved through Boolean logic judgment.

可以理解的是，字段限定检索是一种精确控制信息检索过程的技术，它允许用户指定检索词应当在数据库记录的哪些特定字段中进行匹配。这种方法有助于提高检索的准确性和效率，避免无关信息的干扰。本发明应用的字段限定检索方法过程如下所示：It is understood that field-limited retrieval is a technique for precisely controlling the information retrieval process, which allows the user to specify which specific fields of the database records the search terms should match. This method helps to improve the accuracy and efficiency of retrieval and avoid interference from irrelevant information. The process of the field-limited retrieval method used in the present invention is as follows:

具体地，设定可以检索查询国际合作计划的字段信息，如单位、机构、年度、类型等字段，用户为不同字段设定需要检索的内容，如检索出访单位是“中国地质调查局成都地质调查中心”，出访年度为2023年度，访问机构是“澳大利亚地质调查局”，通过这三个字段的匹配查询并提取检索到的国际合作计划。Specifically, the field information of international cooperation plans can be set for retrieval and query, such as unit, institution, year, type and other fields. Users set the content to be searched for different fields. For example, the visiting unit is "Chengdu Geological Survey Center of China Geological Survey", the visiting year is 2023, and the visiting institution is "Australian Geological Survey". The retrieved international cooperation plans are queried and extracted through the matching of these three fields.

因此，系统提供多个字段进行限定检索条件，提取国际合作计划。Therefore, the system provides multiple fields to limit the search conditions and extract international cooperation plans.

由此，利用布尔逻辑运算和字段限定混合检索方法混合的方式对国际合作计划进行检索提取得到需要对比的国际合作计划，待对比的国际合作计划可以是单独比对(一对一)，也可以是一个比对多个(一对多)，在本系统中则通常是进行待对比国际合作计划与国际合作计划申报数据库中的其他多条计划进行对比。Therefore, the international cooperation plans that need to be compared are retrieved and extracted by using a mixed search method of Boolean logic operations and field restrictions. The international cooperation plans to be compared can be compared individually (one-to-one) or one-to-many (one-to-many). In this system, the international cooperation plans to be compared are usually compared with multiple other plans in the international cooperation plan declaration database.

S2，利用初次相似度对比算法将待对比的国际合作计划与国际合作计划申报数据库中的合作计划进行对比筛选出相似国际合作计划，以构建初次国际合作计划筛选池，实现相似国际合作计划的快速初步筛选。S2, using the initial similarity comparison algorithm to compare the international cooperation plan to be compared with the cooperation plans in the international cooperation plan declaration database to screen out similar international cooperation plans, so as to build an initial international cooperation plan screening pool and realize rapid preliminary screening of similar international cooperation plans.

可以理解的是，本发明设置初次相似度对比算法，设定第一相似度阈值(50％为经验最优值，用户可以结合数据量大小和需求配置60％、70％...等不同权重)，筛选相似国际合作计划，形成初次国际合作计划筛选池，达到快速粗筛检索过滤的目的。本发明初次筛选的目的是从大量的国际合作计划中快速的筛选出符合相似度阈值的国际合作计划对比库(池)，达到快速初步对比的目的。该初次相似度对比算法的具体步骤如下：It is understandable that the present invention sets up a primary similarity comparison algorithm, sets a first similarity threshold (50% is the empirical optimal value, and the user can configure different weights such as 60%, 70%... based on the amount of data and needs), screens similar international cooperation plans, and forms a primary international cooperation plan screening pool to achieve the purpose of rapid rough screening and retrieval filtering. The purpose of the initial screening of the present invention is to quickly screen out an international cooperation plan comparison library (pool) that meets the similarity threshold from a large number of international cooperation plans, so as to achieve the purpose of rapid preliminary comparison. The specific steps of the initial similarity comparison algorithm are as follows:

步骤1：通过系统提供的检索程序(功能)获取满足条件的某一个需要对比的国际合作计划A。Step 1: Use the search program (function) provided by the system to obtain an international cooperation plan A that meets the requirements and needs to be compared.

步骤2：研发建立正则表达式函数程序，通过正则表达式函数程序对国际合作计划A进行特殊字符过滤处理(-？“”！‘’、，/&～等符号)，并删除所有的特殊字符，得到没有特殊字符的国际合作计划A的字符文本。Step 2: Develop and establish a regular expression function program, use the regular expression function program to filter special characters of International Cooperation Plan A (-? "!"! '', /&~ and other symbols), and delete all special characters to obtain the character text of International Cooperation Plan A without special characters.

步骤3：利用字符转换程序将步骤2得到的国际合作计划A字符文本进行拼音字母转换，得到国际合作计划A字符集合。Step 3: Use a character conversion program to convert the character text of the International Cooperation Plan A obtained in Step 2 into pinyin letters to obtain the International Cooperation Plan A character set.

步骤4：遍历系统中各单位提交的所有国际合作计划，并顺序选择待对比的国际合作计划J，并对国际合作计划J进行步骤2(特殊字符过滤处理)和步骤3(字符集转换)的处理，并得到待对比国际合作计划J文本字符集合。Step 4: Traverse all international cooperation plans submitted by various units in the system, and select the international cooperation plan J to be compared in sequence, and perform step 2 (special character filtering processing) and step 3 (character set conversion) on the international cooperation plan J to obtain the text character set of the international cooperation plan J to be compared.

步骤5：研发字符比对算法，开始对比已转换为拼音字母的国际合作计划A字符集合和国际合作计划J的字符集合，比较两个字符集合的长度。Step 5: Develop a character matching algorithm, start comparing the character set of International Cooperation Program A that has been converted into phonetic letters with the character set of International Cooperation Program J, and compare the lengths of the two character sets.

步骤6：得到字符集合长度较短的国际合作计划，并按从左到右顺序遍历该国际合作计划字符集中的所有字符，判断该字符是否被字符集长度较长的国际合作计划字符集所包含，直至遍历完所有字母，并记录被包含的字符总数为TolCount。Step 6: Get the international cooperation plan with a shorter character set length, and traverse all characters in the international cooperation plan character set from left to right, and determine whether the character is included in the international cooperation plan character set with a longer character set length, until all letters are traversed, and the total number of included characters is recorded as TolCount.

步骤7：用TolCount除以国际合作计划A和国际合作计划J转换为字符集后较长的字符集长度LenAJ，并对商进行百分化处理，最终得到相似度百分比SimJ。Step 7: Divide TolCount by the longer character set length LenAJ after converting International Cooperation Plan A and International Cooperation Plan J into character sets, and percentile the quotient to finally obtain the similarity percentage SimJ.

步骤8：根据用户配置的权重因子(50％、60％...用户可以设置不同权重)，判断SimJ是否大于权重因子，如果大于权重因子则将国际合作计划J放到初次筛选的国际合作计划列表中(筛选池)，等待二次筛选，对于小于权重因子的国际合作计划则不予记录。Step 8: According to the weight factor configured by the user (50%, 60%... users can set different weights), determine whether SimJ is greater than the weight factor. If it is greater than the weight factor, put the international cooperation plan J into the list of international cooperation plans for the initial screening (screening pool) and wait for the second screening. International cooperation plans with a weight factor less than the weight factor will not be recorded.

步骤9：重复步骤4操作，遍历对比其他国际合作计划，直到所有计划遍历完成。Step 9: Repeat step 4 to traverse and compare other international cooperation plans until all plans have been traversed.

由此，在国际合作计划的文字内容处理中，特别是在对比算法设计阶段，考虑到文本的多样性、语言差异、特殊符号的使用习惯以及可能存在的噪声信息，加入字符和特殊内容的过滤算法以及设置合理的阈值对比筛选变得尤为重要，本发明使用初次相似度对比算法，结合国际合作计划的特点，确保算法符合业务需要，同时利用技术手段有效过滤无关杂音，相对精确提取有价值的信息，是提高对比效果的关键。Therefore, in the text content processing of international cooperation plans, especially in the comparison algorithm design stage, considering the diversity of texts, language differences, the usage habits of special symbols and possible noise information, it is particularly important to add filtering algorithms for characters and special contents and set reasonable threshold comparison and screening. The present invention uses the initial similarity comparison algorithm, combined with the characteristics of international cooperation plans, to ensure that the algorithm meets business needs. At the same time, it uses technical means to effectively filter out irrelevant noise and relatively accurately extract valuable information, which is the key to improving the comparison effect.

S3，利用二次相似度对比算法将待对比的国际合作计划与初次国际合作计划筛选池中的国际合作计划进行二次对比筛选出相似国际合作计划，以构建二次国际合作计划筛选池，实现相似国际合作计划的精确筛选。S3, using the secondary similarity comparison algorithm, conducts a secondary comparison between the international cooperation plan to be compared and the international cooperation plans in the primary international cooperation plan screening pool to screen out similar international cooperation plans, so as to construct a secondary international cooperation plan screening pool and realize accurate screening of similar international cooperation plans.

可以理解的是，在二次筛选之前，需要建立NLP地质调查国际合作分词语料库，形成职工名录、国家地区、国际会议、合作机构、派出单位、停用词等分词语料子库，为基于NLP的二次筛选比对算法提供基础。这里的语料库是指专用的语料库，需要是自己进行扩展建立的。专用语料库主要是基于地质调查国际合作的特点和专有的语料使用情况建立的，如将地质调查的术语、国际会议名称、国际合作机构、职工名称等等纳入到专用的语料库中。It is understandable that before the secondary screening, it is necessary to establish an NLP geological survey international cooperation word segmentation corpus to form a sub-corpus of word segmentation such as employee lists, countries and regions, international conferences, cooperative institutions, dispatching units, and stop words, to provide a basis for the NLP-based secondary screening and comparison algorithm. The corpus here refers to a dedicated corpus, which needs to be expanded and established by itself. The dedicated corpus is mainly established based on the characteristics of international cooperation in geological surveys and the use of proprietary corpora, such as incorporating geological survey terms, international conference names, international cooperation institutions, employee names, etc. into the dedicated corpus.

进一步地，利用NLP地质调查国际合作分词语料库，设置二次相似度对比算法，设定第二相似度阈值(70％为经验最优，用户可配置更高60％，80％，90％...等权重)，筛选相似国际合作计划，形成二次国际合作计划筛选池，实现较为精准的国际合作计划匹配。如图2所示。该二次相似度对比算法的具体步骤包括：Furthermore, using the NLP geological survey international cooperation word segmentation corpus, a secondary similarity comparison algorithm is set, and a second similarity threshold is set (70% is the best experience, and users can configure higher weights such as 60%, 80%, 90%...), similar international cooperation plans are screened, and a secondary international cooperation plan screening pool is formed to achieve more accurate international cooperation plan matching. As shown in Figure 2. The specific steps of the secondary similarity comparison algorithm include:

第一步：继承初次筛选时提取的待对比国际合作计划A。Step 1: Inherit the international cooperation plan A to be compared extracted during the initial screening.

第二步：利用NPL语料库(专业语料库和通用语料库)对国际合作计划A进行分词解析，得到分词解析后的国际合作计划A文本字符。Step 2: Use the NPL corpus (professional corpus and general corpus) to perform word segmentation and analysis on International Cooperation Plan A to obtain the text characters of International Cooperation Plan A after word segmentation and analysis.

第三步：判断分词后国际合作计划A文本字符中是否包含国家名称，如包含则提取国家名称，国家名称记录在变量CountryA中。Step 3: Determine whether the text characters of International Cooperation Plan A contain the country name after word segmentation. If so, extract the country name and record it in the variable CountryA.

第四步：研发建立正则表达式函数程序，通过正则表达式函数程序对国际合作计划A进行特殊字符过滤(-？“”！‘’、，/&～等符号)，并删除所有的特殊字符，得到没有特殊字符的国际合作计划A的字符文本。Step 4: Develop and establish a regular expression function program, use the regular expression function program to filter special characters (-? "!"!'', /&~ and other symbols) of International Cooperation Plan A, and delete all special characters to obtain the character text of International Cooperation Plan A without special characters.

第五步：通过NPL停用词语料库，对第四步得到的国际合作计划A的文本进行停用词处理，过滤掉“的、地、得”等语气词、助词等停用词，得到进一步净化的国际合作计划A文本字符。Step 5: Use the NPL stop word corpus to process the text of International Cooperation Plan A obtained in the fourth step with stop words, filter out stop words such as modal particles and auxiliary words such as "的、地、得", and obtain further purified text characters of International Cooperation Plan A.

第六步：对第五步得到的文本字符串进行处理，删除与年份有关的字符，如2024年等，使得国际合作文本内容更加客观，含有的内容信息更加纯粹。Step 6: Process the text string obtained in step 5 and delete characters related to the year, such as 2024, so that the content of the international cooperation text is more objective and the content information contained is purer.

第七步：遍历国际合作计划初次筛选池，每次得到一个需要对比的国际合作计划J，对国际合作计划J按照第二步至六步进行分析处理(计划名称分词解析、无效字符过滤处理、年度字符提取过滤、提取国家名称)，并得到国际合作计划J的出访国家名称CountryJ。Step 7: Traverse the initial screening pool of international cooperation plans, and each time get an international cooperation plan J that needs to be compared, analyze and process the international cooperation plan J according to steps 2 to 6 (plan name segmentation and parsing, invalid character filtering, annual character extraction and filtering, extracting country names), and get the name of the country visited by international cooperation plan J, CountryJ.

第八步：判断待对比的国际合作计划A的国家名称是否与国际合作计划J的国家名称相同，即判断CountryA与CountryJ是否相同。Step 8: Determine whether the country name of the international cooperation plan A to be compared is the same as the country name of the international cooperation plan J, that is, determine whether CountryA is the same as CountryJ.

第九步：如果CountryA与CountryJ相同，分别对国际合作计划A与国际合作计划J同时做删除国家名称处理，再进行第十步处理；如果CountryA与CountryJ国家名称不同则在文本中保留各自的国家名称，再进行第十步处理。Step 9: If CountryA is the same as CountryJ, delete the country names of International Cooperation Plan A and International Cooperation Plan J respectively, and then proceed to Step 10; if the country names of CountryA and CountryJ are different, retain their respective country names in the text, and then proceed to Step 10.

第十步：利用分词语料库，判断待对比的国际合作计划A和国际合作计划J中是否含有派出单位名称、国际会议名称、访问机构名称等信息，通过研发单位全称与单位简称、国际会议全称与简称、国际会议中文与英文缩写、访问机构全称与简称的替换算法，对国际合作计划A和国际合作计划J中的单位名称、国际会议名称、访问机构名称等进行量纲统一，便于国际合作计划A与J文本字符相似度对比。Step 10: Use the word segmentation corpus to determine whether the international cooperation plans A and J to be compared contain information such as the names of the dispatching units, the names of international conferences, and the names of visiting institutions. Through the replacement algorithm of the full name and abbreviation of the R&D unit, the full name and abbreviation of the international conference, the Chinese and English abbreviations of the international conference, and the full name and abbreviation of the visiting institution, the unit names, international conference names, visiting institution names, etc. in the international cooperation plans A and J are unified in terms of dimensionality, which is convenient for comparing the text character similarity of the international cooperation plans A and J.

第十一步：研发文本字符余弦向量相似度对比算法程序，将第十步量纲统一后得到的国际合作计划A和国际合作计划J文本字符作为两个参数传入到余弦向量相似度对比算法程序函数中，计算两个字符串的相似度结果，并对结果进行百分数表示记作SimJ。Step 11: Develop a text character cosine vector similarity comparison algorithm program. Pass the International Cooperation Plan A and International Cooperation Plan J text characters obtained after dimensional unification in the tenth step as two parameters into the cosine vector similarity comparison algorithm program function, calculate the similarity of the two character strings, and express the results as a percentage, recorded as SimJ.

第十二步：根据用户配置的权重因子(50％、60％...)，判断SimJ是否大于配置的权重因子，如果大于权重因子则将国际合作计划J放到二次国际合作计划筛选池，对于小于权重因子的国际合作计划则不予记录。Step 12: According to the weight factor configured by the user (50%, 60%...), determine whether SimJ is greater than the configured weight factor. If it is greater than the weight factor, the international cooperation plan J will be placed in the secondary international cooperation plan screening pool. International cooperation plans with a value less than the weight factor will not be recorded.

本发明利用NLP的分词技术，将文本切割为有意义的词汇单元，便于后续处理。基于上述停用词库，识别并过滤掉无意义的停用词和特殊字符，减少信息干扰。利用命名实体识别(NER)技术，从文本中提取出国际合作机构、单位名称、年份等关键信息，并进行规范化处理，为后续信息比对做准备。结合NLP处理后的文本，使用先进的文本相似度算法进行文本相似度计算，这比简单的字符对比更准确地反映语义相似性。根据实际需求和测试结果设定合理的相似度阈值，过滤掉相似度过低的匹配项。阈值可能需要根据初期的实验反馈和实际应用效果动态调整。通过上述步骤，得到的筛选池包含了经过严格筛选的、与目标高度相关的国际合作项目信息，为后续分析、决策提供了高质量的数据基础。The present invention utilizes the word segmentation technology of NLP to cut the text into meaningful vocabulary units for subsequent processing. Based on the above-mentioned stop word library, meaningless stop words and special characters are identified and filtered out to reduce information interference. Using named entity recognition (NER) technology, key information such as international cooperation institutions, unit names, years, etc. are extracted from the text, and standardized processing is performed to prepare for subsequent information comparison. Combined with the text processed by NLP, an advanced text similarity algorithm is used to calculate text similarity, which more accurately reflects semantic similarity than simple character comparison. A reasonable similarity threshold is set according to actual needs and test results to filter out matches with low similarity. The threshold may need to be dynamically adjusted according to initial experimental feedback and actual application effects. Through the above steps, the screening pool obtained contains strictly screened international cooperation project information that is highly relevant to the target, providing a high-quality data basis for subsequent analysis and decision-making.

由此，利用自然语言处理(NLP)技术针对地质调查领域的国际合作项目建立专用语料库和改进对比算法，能够极大地提升数据处理的准确性和效率。Therefore, using natural language processing (NLP) technology to establish a dedicated corpus and improve comparison algorithms for international cooperation projects in the field of geological surveys can greatly improve the accuracy and efficiency of data processing.

S4，利用辅助因子相似计划比对标记算法根据识别的国际合作计划中的辅助因子的重要程度对二次国际合作计划筛选池中的计划按照比对逻辑进行判断并标记排序，以根据计划标记结果得到最终国际合作计划筛选池，实现相似国际合作计划的准确标记和辅助比对。S4, using the auxiliary factor similarity plan comparison and marking algorithm to judge and mark the plans in the secondary international cooperation plan screening pool according to the importance of the auxiliary factors in the identified international cooperation plans according to the comparison logic, so as to obtain the final international cooperation plan screening pool based on the plan marking results, and realize accurate marking and auxiliary comparison of similar international cooperation plans.

本发明实施例引入国际合作计划对比辅助因子，通过国际合作计划中的出访国家、出访周期、访问时间、访问机构、组团单位、派出单位及出访目的等作为辅助因子，对国际合作计划进行辅助因子提取与自动识别比对，属于国际合作计划相似度的第三级(次)比对，通过辅助因子比对是对第二次国际合作计划筛选对比结果的精确度补充，并对筛选出的国际合作计划进行标记排序，辅助国际合作管理人员对国际合作计划进行识别、研判、优化和调整。如图2所示。The embodiment of the present invention introduces auxiliary factors for comparison of international cooperation plans. By using the visiting countries, visiting cycles, visiting time, visiting institutions, organizing units, dispatching units and visiting purposes in the international cooperation plans as auxiliary factors, auxiliary factors are extracted and automatically identified and compared for the international cooperation plans. This is the third level (secondary) comparison of the similarity of international cooperation plans. The auxiliary factor comparison is used to supplement the accuracy of the second international cooperation plan screening and comparison results, and the screened international cooperation plans are marked and sorted, assisting international cooperation managers in identifying, judging, optimizing and adjusting international cooperation plans. As shown in Figure 2.

可以理解的是，国际合作计划对比辅助因子的确定是根据多年来国际合作计划历史数据总结出来的。辅助因子主要包括“合作单位、出访国家、出访周期、访问时间、访问机构”，这些辅助因子信息可以通过程序从每一个国际合作计划中提取出来的。通过辅助因子参与国际合作计划文本字符相似度进行对比，也是国际合作计划相似度三级自动化对比的最后一步。在上述初次和二次国际合作计划文本比对的基础上，利用国际合作的内容进行比对，根据辅助因子的重要程度和相似度逻辑判断流程，对第二次筛选出来的国际合作计划进行标记排序，并最终给管理人员提供带有标记的相似国际合作计划池，便于管理人员进行研判最终的国际合作筛选池中的计划是否与待对比的国际合作计划A一致，并根据研判的结果对国际合作计划进行优化调整。It is understandable that the determination of auxiliary factors for the comparison of international cooperation plans is summarized based on the historical data of international cooperation plans over the years. Auxiliary factors mainly include "cooperating units, visiting countries, visiting cycles, visiting time, and visiting institutions". These auxiliary factor information can be extracted from each international cooperation plan through the program. The use of auxiliary factors to participate in the comparison of the similarity of the text characters of the international cooperation plan is also the last step of the three-level automated comparison of the similarity of the international cooperation plan. On the basis of the above-mentioned initial and secondary international cooperation plan text comparisons, the content of the international cooperation is used for comparison. According to the importance of the auxiliary factors and the similarity logic judgment process, the international cooperation plans screened out for the second time are marked and sorted, and finally a pool of similar international cooperation plans with marks is provided to managers, so that managers can judge whether the plans in the final international cooperation screening pool are consistent with the international cooperation plan A to be compared, and optimize and adjust the international cooperation plan according to the results of the judgment.

本发明实施例的辅助因子比对的具体步骤如下：The specific steps of auxiliary factor alignment in the embodiment of the present invention are as follows:

第一步：确定国际合作计划辅助因子重要程度和相互关系。按照重要层次进行排序，重要程度依次为：出访国家、访问机构(参加会议名称)、出访时间、组团单位、访问周期。Step 1: Determine the importance and interrelationship of the auxiliary factors of the international cooperation plan. Sort them by importance level, in the following order: country of visit, institution of visit (name of conference to be attended), time of visit, group unit, and visit period.

第二步：在二次国际合作计划筛选的基础上，遍历国际合作计划二次筛选池。Step 2: Based on the secondary international cooperation program screening, traverse the secondary screening pool of international cooperation programs.

第三步：对遍历的国际合作计划通过“辅助因子逻辑判断流程算法”，依次判断待对比的国际合作计划A与遍历二次筛选池国际合作计划中“出访国家-访问机构(会议)-出访时间-组团单位”的辅助因子的相互关系(相同、包含、不同)。Step 3: Through the "auxiliary factor logic judgment process algorithm" for the traversed international cooperation plans, the relationship between the auxiliary factors of the international cooperation plan A to be compared and the "visited country-visited organization (conference)-visit time-group unit" in the traversed secondary screening pool international cooperation plans is judged in turn (same, included, different).

第四步：遍历二次筛选池的国际合作计划，按照图3中的“辅助因子逻辑判断流程算法”，对各个国际合作计划进行标记(两类标记：辅助研判标记1和辅助研判标记2)，通过最终的标记结果辅助管理人员判断和优化调整国际合作计划提供基础。Step 4: Traverse the international cooperation plans in the secondary screening pool, and mark each international cooperation plan according to the "auxiliary factor logic judgment process algorithm" in Figure 3 (two types of marks: auxiliary judgment mark 1 and auxiliary judgment mark 2). The final marking results provide a basis for assisting managers in judging and optimizing international cooperation plans.

进一步地，根据辅助因子研判比对后得到的国际合作计划最终筛选池，可以辅助管理人员进行计划的合并、拆分、优化调整等工作，形成最终待审批的国际合作计划终稿。Furthermore, the final screening pool of international cooperation plans obtained after analysis and comparison of auxiliary factors can assist management personnel in merging, splitting, optimizing and adjusting plans, and form the final draft of the international cooperation plan for approval.

可以知道的是，本发明首次提出了使用辅助因子进行国际合作计划相似性的判断对比和标记，可以作为国际合作计划对比中的新手段，并创造性地提出辅助因子逻辑判断，为后续国际合作计划人工研判和优化调整提供基础。It can be known that the present invention proposes for the first time the use of auxiliary factors to judge, compare and mark the similarity of international cooperation plans, which can be used as a new means of comparing international cooperation plans, and creatively proposes auxiliary factor logical judgment, providing a basis for subsequent manual judgment and optimization adjustment of international cooperation plans.

根据本发明实施例的国际合作出访计划信息提取与识别比对方法，结合地质调查国际合作的业务特点和实际需要，设计地质调查国际合作出访信息提取与识别对比模型系统，利用NLP自然语言模型，融合地质调查专业分词语料，研发国际合作信息相似性对比算法，通过国际合作信息的提取、检索、分析、比对和辅助因子研判等过程，可以实现国际合作信息的精准分类与识别，为国际合作计划的分类、处理、整合和审批提供便捷的工具。According to the method for extracting, identifying and comparing international cooperation visit plan information in an embodiment of the present invention, a geological survey international cooperation visit information extraction and identification comparison model system is designed in combination with the business characteristics and actual needs of international cooperation in geological surveys. The NLP natural language model is used to integrate geological survey professional word segmentation data, and an international cooperation information similarity comparison algorithm is developed. Through the processes of extracting, retrieving, analyzing, comparing and judging auxiliary factors of international cooperation information, accurate classification and identification of international cooperation information can be achieved, providing a convenient tool for the classification, processing, integration and approval of international cooperation plans.

为了实现上述实施例，如图4所示，本实施例中还提供了国际合作出访计划信息提取与识别比对系统10，该系统10包括：In order to implement the above embodiment, as shown in FIG4 , this embodiment further provides an international cooperative visit plan information extraction and identification comparison system 10, which includes:

原始计划提取模块100，用于利用预设的数据检索方法从国际合作计划申报数据库中提取待对比的国际合作计划；The original plan extraction module 100 is used to extract the international cooperation plan to be compared from the international cooperation plan application database using a preset data retrieval method;

第一筛选池确定模块200，用于利用初次相似度对比算法将所述待对比的国际合作计划与所述国际合作计划申报数据库中的合作计划进行对比筛选出相似国际合作计划，以构建初次国际合作计划筛选池，实现相似国际合作计划的快速初步筛选；The first screening pool determination module 200 is used to use the initial similarity comparison algorithm to compare the international cooperation plan to be compared with the cooperation plans in the international cooperation plan application database to screen out similar international cooperation plans, so as to construct an initial international cooperation plan screening pool and realize rapid preliminary screening of similar international cooperation plans;

第二筛选池确定模块300，用于利用二次相似度对比算法将所述待对比的国际合作计划与所述初次国际合作计划筛选池中的国际合作计划进行二次对比筛选出相似国际合作计划，以构建二次国际合作计划筛选池，实现相似国际合作计划的精确筛选；A second screening pool determination module 300 is used to perform a secondary comparison between the international cooperation plan to be compared and the international cooperation plans in the primary international cooperation plan screening pool using a secondary similarity comparison algorithm to screen out similar international cooperation plans, so as to construct a secondary international cooperation plan screening pool and realize accurate screening of similar international cooperation plans;

最终筛选池确定模块400，用于利用辅助因子相似计划比对标记算法根据识别的国际合作计划中的辅助因子的重要程度对所述二次国际合作计划筛选池中的计划按照比对逻辑进行判断并标记排序，以根据计划标记结果得到最终国际合作计划筛选池，实现相似国际合作计划的准确标记和辅助比对。The final screening pool determination module 400 is used to use the auxiliary factor similar plan comparison and marking algorithm to judge and mark the plans in the secondary international cooperation plan screening pool according to the importance of the auxiliary factors in the identified international cooperation plans according to the comparison logic, so as to obtain the final international cooperation plan screening pool according to the plan marking results, and realize accurate marking and auxiliary comparison of similar international cooperation plans.

进一步地，上述原始计划提取模块100，还用于：Furthermore, the original plan extraction module 100 is also used for:

进一步地，上述第一筛选池确定模块200，还用于：Furthermore, the first screening pool determination module 200 is further used for:

进一步地，上述第二筛选池确定模块300，还用于：Furthermore, the second screening pool determination module 300 is further used for:

进一步地，上述最终筛选池确定模块400，还用于：Furthermore, the above-mentioned final screening pool determination module 400 is also used for:

根据本发明实施例的国际合作出访计划信息提取与识别比对系统，结合地质调查国际合作的业务特点和实际需要，设计地质调查国际合作出访信息提取与识别对比模型系统，利用NLP自然语言模型，融合地质调查专业分词语料，研发国际合作信息相似性三级比对算法，建立比对模型，通过国际合作信息的提取、检索、分析和比对等过程，可以实现国际合作信息的精准分类与识别，为国际合作计划的分类、处理、整合和审批提供便捷的工具。According to the international cooperation visit plan information extraction and identification comparison system of the embodiment of the present invention, combined with the business characteristics and actual needs of international cooperation in geological surveys, a geological survey international cooperation visit information extraction and identification comparison model system is designed, which utilizes the NLP natural language model, integrates the geological survey professional word segmentation data, develops a three-level comparison algorithm for the similarity of international cooperation information, and establishes a comparison model. Through the processes of extraction, retrieval, analysis and comparison of international cooperation information, accurate classification and identification of international cooperation information can be achieved, providing a convenient tool for the classification, processing, integration and approval of international cooperation plans.

在本说明书的描述中，参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中，对上述术语的示意性表述不必须针对的是相同的实施例或示例。而且，描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外，在不相互矛盾的情况下，本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。In the description of this specification, the description with reference to the terms "one embodiment", "some embodiments", "example", "specific example", or "some examples" etc. means that the specific features, structures, materials or characteristics described in conjunction with the embodiment or example are included in at least one embodiment or example of the present invention. In this specification, the schematic representations of the above terms do not necessarily refer to the same embodiment or example. Moreover, the specific features, structures, materials or characteristics described may be combined in any one or more embodiments or examples in a suitable manner. In addition, those skilled in the art may combine and combine the different embodiments or examples described in this specification and the features of the different embodiments or examples, without contradiction.

此外，术语“第一”、“第二”仅用于描述目的，而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此，限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。在本发明的描述中，“多个”的含义是至少两个，例如两个，三个等，除非另有明确具体的限定。In addition, the terms "first" and "second" are used for descriptive purposes only and should not be understood as indicating or implying relative importance or implicitly indicating the number of the indicated technical features. Therefore, the features defined as "first" and "second" may explicitly or implicitly include at least one of the features. In the description of the present invention, the meaning of "plurality" is at least two, such as two, three, etc., unless otherwise clearly and specifically defined.