CN111026834B

Movatterモバイル変換

Info

Publication number: CN111026834B
Application number: CN201911258482.XA
Authority: CN
Inventors: 许建伟
Original assignee: Sipic Technology Co Ltd
Current assignee: Sipic Technology Co Ltd
Priority date: 2019-12-10
Filing date: 2019-12-10
Publication date: 2022-07-08
Anticipated expiration: 2039-12-10
Also published as: CN111026834A

Abstract

The embodiment of the invention provides a question and answer corpus generating method. The method comprises the following steps: receiving a corpus text; detecting the text amount of the corpus text, and determining the entity and attribute of the corpus text for the knowledge graph when the text amount is smaller than a preset threshold value; querying a regular expression matched with the corpus text based on the entity and the attribute; determining a fuzzy statement of the corpus text based on the regular expression, inputting the fuzzy statement into a knowledge graph, and determining a corresponding text of the corpus text according to the inverted index; and performing corpus generation on the corpus text and the corresponding text through a regular expression to construct a plurality of paired question-answer dialogue corpuses. The embodiment of the invention also provides a question and answer corpus generating system. The embodiment of the invention uses fuzzy search in the knowledge graph, and improves the recall rate of retrieval. In the knowledge graph retrieval, an inverted index method is used, and the retrieval efficiency is improved. So that a plurality of paired question-and-answer dialog predictions can be generated in the text and the text segment.

Description

Translated fromChinese

问答语料生成方法及系统Question and answer corpus generation method and system

技术领域technical field

本发明涉及知识图谱问答领域，尤其涉及一种问答语料生成方法及系统。The invention relates to the field of knowledge graph question and answer, in particular to a method and system for generating question and answer corpus.

背景技术Background technique

阅读理解问答式语言模型的回答效果，需要大量高质量的成对问答语料支持，为了得到这些高质量的对话语料，通常会使用语料生成方法。The answering effect of the question-and-answer language model for reading comprehension requires the support of a large number of high-quality paired question-and-answer corpora. In order to obtain these high-quality dialogue materials, corpus generation methods are usually used.

在实现本发明过程中，发明人发现相关技术中至少存在如下问题：In the process of realizing the present invention, the inventor found that there are at least the following problems in the related art:

现有的语料生成方法，难以生成对话问答式这样成对的语料，由于较为容易获取的训练文本语料都不是成对出现的，是单独的语句。使用这种单独的语句生成对话问答式成对的语料较为困难：Existing corpus generation methods are difficult to generate paired corpora such as dialogue question-and-answer, because the training text corpora that are relatively easy to obtain are not paired, but are separate sentences. It is more difficult to generate dialogue-question-like paired corpora using such individual sentences:

1、如果要使用单独的语句，构建成对的问答式语料，那么就需要知识图谱来回答(或从答案中生成题目)。然而知识图谱查询答案的速度较慢；同时语句的提问方式各不相同，难以用到知识图谱的模糊搜索，可能相同的内容，由于提问方式的不同，就会得不到对应的答案。1. If you want to use a single sentence to construct a paired question-and-answer corpus, then you need a knowledge graph to answer (or generate questions from the answer). However, the speed of querying answers in the knowledge graph is slow; at the same time, the questioning methods of sentences are different, and it is difficult to use the fuzzy search of the knowledge graph. It may be that the same content will not get the corresponding answer due to the different questioning methods.

2、面对成段落的大文本语句时，会使用到阅读理解模型，来提取其中的成对的问答对话。然而阅读理解模型本身就需要大量带标注的高质量成对问答式文本语料来训练。2. When faced with large text sentences in paragraphs, a reading comprehension model will be used to extract the paired question-and-answer dialogues. However, the reading comprehension model itself requires a large amount of annotated high-quality paired question-and-answer text corpus to train.

发明内容SUMMARY OF THE INVENTION

为了至少解决现有技术中，在成对的问答语料生成的过程中，知识图谱查询速度较慢、不能用到模糊搜索、对话问答式成对的语料获取较为困难的问题。In order to at least solve the problems in the prior art, in the process of generating paired question-and-answer corpus, the query speed of knowledge graph is slow, fuzzy search cannot be used, and it is difficult to obtain paired corpus of dialogue question-and-answer.

第一方面，本发明实施例提供一种问答语料生成方法，包括：In a first aspect, an embodiment of the present invention provides a method for generating question-and-answer corpus, including:

接收语料文本；receive corpus text;

检测所述语料文本的文本量，当所述文本量小于预设阈值时，确定所述语料文本的用于知识图谱的实体和属性；Detecting the text amount of the corpus text, and when the text amount is less than a preset threshold, determining the entities and attributes of the corpus text for the knowledge graph;

基于所述实体和所述属性，查询与所述语料文本相匹配的正则表达式；based on the entity and the attribute, query for a regular expression matching the corpus text;

基于所述正则表达式确定所述语料文本的模糊说法，将所述模糊说法输入至知识图谱，按照倒排索引确定所述语料文本的对应文本，其中所述对应文本包括：回答文本和/或提问文本；Determine the fuzzy statement of the corpus text based on the regular expression, input the fuzzy statement into the knowledge graph, and determine the corresponding text of the corpus text according to the inverted index, wherein the corresponding text includes: answer text and/or question text;

通过所述正则表达式对所述语料文本和所述对应文本进行语料生成，以构建多条成对的问答式对话语料。The corpus text and the corresponding text are generated by using the regular expression, so as to construct a plurality of pairs of question-and-answer dialogue materials.

第二方面，本发明实施例提供一种问答语料生成系统，包括：In a second aspect, an embodiment of the present invention provides a question-and-answer corpus generation system, including:

语料接收程序模块，用于接收语料文本；The corpus receiving program module is used to receive the corpus text;

信息确定程序模块，用于检测所述语料文本的文本量，当所述文本量小于预设阈值时，确定所述语料文本的用于知识图谱的实体和属性；an information determination program module for detecting the text amount of the corpus text, and when the text amount is less than a preset threshold, determining the entities and attributes of the corpus text used for the knowledge graph;

正则表达式查询程序模块，用于基于所述实体和所述属性，查询与所述语料文本相匹配的正则表达式；a regular expression query program module, configured to query a regular expression matching the corpus text based on the entity and the attribute;

对应文本确定程序模块，用于基于所述正则表达式确定所述语料文本的模糊说法，将所述模糊说法输入至知识图谱，按照倒排索引确定所述语料文本的对应文本，其中所述对应文本包括：回答文本和/或提问文本；The corresponding text determination program module is used to determine the fuzzy statement of the corpus text based on the regular expression, input the fuzzy statement into the knowledge graph, and determine the corresponding text of the corpus text according to the inverted index, wherein the corresponding text The text includes: answer text and/or question text;

问答语料生成程序模块，用于通过所述正则表达式对所述语料文本和所述对应文本进行语料生成，以构建多条成对的问答式对话语料。The question-and-answer corpus generation program module is used to generate corpus of the corpus text and the corresponding text through the regular expression, so as to construct a plurality of pairs of question-and-answer dialogue materials.

第三方面，提供一种电子设备，其包括：至少一个处理器，以及与所述至少一个处理器通信连接的存储器，其中，所述存储器存储有可被所述至少一个处理器执行的指令，所述指令被所述至少一个处理器执行，以使所述至少一个处理器能够执行本发明任一实施例的问答语料生成方法的步骤。In a third aspect, an electronic device is provided, comprising: at least one processor, and a memory communicatively connected to the at least one processor, wherein the memory stores instructions executable by the at least one processor, The instructions are executed by the at least one processor, so that the at least one processor can perform the steps of the method for generating a question and answer corpus according to any embodiment of the present invention.

第四方面，本发明实施例提供一种存储介质，其上存储有计算机程序，其特征在于，该程序被处理器执行时实现本发明任一实施例的问答语料生成方法的步骤。In a fourth aspect, an embodiment of the present invention provides a storage medium on which a computer program is stored, characterized in that, when the program is executed by a processor, the steps of the method for generating question-and-answer corpus of any embodiment of the present invention are implemented.

本发明实施例的有益效果在于：在知识图谱中使用了模糊搜索，提高检索的召回率。在知识图谱检索中，使用倒排索引的方法，提高检索的效率。从而才可以在文本段、文本中生成多条成对的问答式对话预料，来训练阅读理解问答式语言模型。The beneficial effect of the embodiments of the present invention is that fuzzy search is used in the knowledge graph, which improves the recall rate of retrieval. In knowledge graph retrieval, the inverted index method is used to improve retrieval efficiency. In this way, multiple pairs of question-and-answer dialogue predictions can be generated in the text segment and text to train the question-and-answer language model for reading comprehension.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained according to these drawings without creative efforts.

图1是本发明一实施例提供的一种问答语料生成方法的流程图；1 is a flowchart of a method for generating question-and-answer corpus provided by an embodiment of the present invention;

图2是本发明一实施例提供的一种问答语料生成系统的结构示意图。FIG. 2 is a schematic structural diagram of a question-and-answer corpus generation system provided by an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明实施例的目的、技术方案和优点更加清楚，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purposes, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments These are some embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

如图1所示为本发明一实施例提供的一种问答语料生成方法的流程图，包括如下步骤：1 is a flowchart of a method for generating question and answer corpus provided by an embodiment of the present invention, including the following steps:

S11：接收语料文本；S11: Receive corpus text;

S12：检测所述语料文本的文本量，当所述文本量小于预设阈值时，确定所述语料文本的用于知识图谱的实体和属性；S12: Detect the text amount of the corpus text, and when the text amount is less than a preset threshold, determine the entity and attribute of the corpus text for the knowledge graph;

S13：基于所述实体和所述属性，查询与所述语料文本相匹配的正则表达式；S13: Based on the entity and the attribute, query a regular expression matching the corpus text;

S14：基于所述正则表达式确定所述语料文本的模糊说法，将所述模糊说法输入至知识图谱，按照倒排索引确定所述语料文本的对应文本，其中所述对应文本包括：回答文本和/或提问文本；S14: Determine the fuzzy statement of the corpus text based on the regular expression, input the fuzzy statement into the knowledge graph, and determine the corresponding text of the corpus text according to the inverted index, wherein the corresponding text includes: answer text and / or question text;

S15：通过所述正则表达式对所述语料文本和所述对应文本进行语料生成，以构建多条成对的问答式对话语料。S15: Perform corpus generation on the corpus text and the corresponding text by using the regular expression, so as to construct a plurality of pairs of question-and-answer dialogue materials.

对于步骤S11，接收语料文本，这里的语料文本通常为较为容易采集到的一些语料文本，例如，论坛中发帖的对话，网页中展示内容的句子。这些句子不但容易获取，而且通常具有潜在的“问答”模式。例如，有人在网上发帖问“相对论是谁发明的”。这样，“相对论是谁发明的”就会被采集到，作为本方法的语料文本。For step S11, the corpus text is received, and the corpus text here is usually some corpus texts that are relatively easy to collect, for example, a dialogue posted in a forum, and a sentence showing content in a web page. These sentences are not only easily accessible, but often have an underlying "question and answer" pattern. For example, someone posted online asking "who invented the theory of relativity". In this way, "who invented the theory of relativity" will be collected as the corpus text of this method.

对于步骤S12，检测语料文本的文本量，也就是这一段文本的字数，由于这些语料文本较为容易获取，有的语料文本可能是单独的一句话，有的语料文本可能是一整段的句子。通过字数，将这两种文本进行区分。对于这种单独的少量字数的句子，会确定其用于知识图谱的实体和属性。其中，实体(Entity)是对客观个体的抽象，一个人、一部电影、一句话都可以看作是一个实体。属性(property)是对实体与实体之间关系的抽象。例如，“相对论是谁发明的”中，“相对论”就是实体，“发明”就是对应的属性。For step S12, the text amount of the corpus text is detected, that is, the number of words of the text. Since these corpus texts are relatively easy to obtain, some corpus texts may be a single sentence, and some corpus texts may be a whole paragraph of sentences. The two texts are differentiated by word count. For such a single low-word sentence, its entities and attributes for the knowledge graph are determined. Among them, the entity (Entity) is the abstraction of the objective individual, a person, a movie, a sentence can be regarded as an entity. A property is an abstraction of the relationship between entities. For example, in "Who invented the theory of relativity", "the theory of relativity" is the entity, and "invention" is the corresponding attribute.

对于步骤S13，基于所确定的“相对论”、“发明”这些实体和属性，来查询与所述“相对论是谁发明的”相匹配的正则表达式。例如，匹配到了：${#basicconcept}是(谁|什么人)(发明|提出来？)的。其中，${#basicconcept}是实体。For step S13, based on the determined entities and attributes of "relativity" and "invention", a regular expression matching the "who invented the theory of relativity" is queried. For example, it matches: ${#basicconcept} is of (who | what) (invented | proposed?). where ${#basicconcept} is the entity.

对于步骤S14，基于在步骤S13中确定的正则表达式，来确定所述“相对论是谁发明的”的模糊说法。例如，相对论是(谁|什么人|哪个人|who|哪位仁兄)(发明|提出来|提出|)(的|呢|的呢|“空”(代表可以没有这部分))。这个模糊说法可以组成多种说法，输入至知识图谱，保证语句的多样性，确保在知识图谱中可以按照倒排索引，在提高检索速度的同时，可以检索到所述语料文本的对应文本。For step S14, the vague statement of "who invented the theory of relativity" is determined based on the regular expression determined in step S13. For example, the theory of relativity is (who|who|who|who|who)(invented|proposed|proposed|)(of|what|of it|"empty" (meaning that there is no such part)). This vague statement can be composed of multiple statements, which can be input into the knowledge graph to ensure the diversity of sentences and ensure that the knowledge graph can be indexed according to the inverted order. While improving the retrieval speed, the corresponding text of the corpus text can be retrieved.

例如，Q：“相对论是谁发明的”For example, Q: "Who invented the theory of relativity"

通过知识图谱得到答案，A：“爱因斯坦”。Get the answer through the knowledge graph, A: "Einstein".

对于步骤S15，通过所述正则表达式对所述语料文本和对应文本进行语料生成，例如：For step S15, corpus generation is performed on the corpus text and the corresponding text through the regular expression, for example:

“相对论是什么人提出来的”“爱因斯坦”"Who came up with the theory of relativity" "Einstein"

“相对论是哪个人发明的呢”“爱因斯坦”"Who invented the theory of relativity?" "Einstein"

“相对论是who提出的”“爱因斯坦”"The theory of relativity was proposed by who" "Einstein"

“相对论是哪位仁兄发明的”“爱因斯坦”"Who invented the theory of relativity" "Einstein"

这样生成了多条成对的问答式对话语料，使用充足的成对的问答式对话语料来训练阅读理解问答式语言模型。In this way, multiple paired question-and-answer dialogue materials are generated, and sufficient paired question-and-answer dialogue materials are used to train the reading comprehension question-and-answer language model.

通过该实施方式可以看出，在知识图谱中使用了模糊搜索，提高检索的召回率，比如，“相对论是谁发明的呢”由于语句中多加了一个“呢”，直接搜索的话，恰好知识图谱无法处理这种查询，也就得不到对应的结果。在知识图谱检索中，使用倒排索引的方法，提高检索的效率。从而才可以生成多条成对的问答式对话预料，来训练阅读理解问答式语言模型。It can be seen from this embodiment that fuzzy search is used in the knowledge graph to improve the recall rate of retrieval. For example, "who invented the theory of relativity?" Since an extra "?" is added to the sentence, if you search directly, the knowledge graph will be just right. This query cannot be processed, and the corresponding result cannot be obtained. In knowledge graph retrieval, the inverted index method is used to improve retrieval efficiency. In this way, multiple paired question-and-answer dialogue predictions can be generated to train a question-and-answer language model for reading comprehension.

作为一种实施方式，在本实施例中，当所述文本量大于预设阈值时，将所述语料文本划分为多个语料文本段；As an implementation manner, in this embodiment, when the amount of text is greater than a preset threshold, the corpus text is divided into a plurality of corpus text segments;

分别提取各所述语料文本段的用于知识图谱的实体和属性；Respectively extract entities and attributes for the knowledge graph of each of the corpus text segments;

基于所述实体和所述属性，查询与各所述语料文本段相匹配的多个正则表达式；Based on the entity and the attribute, query a plurality of regular expressions matching each of the corpus text segments;

基于各所述正则表达式确定各所述语料文本段的模糊说法，将各所述模糊说法输入至知识图谱，按照倒排索引确定各所述语料文本段的对应文本，其中所述对应文本包括：回答文本或提问文本；Determine the fuzzy statement of each corpus text segment based on each of the regular expressions, input each of the fuzzy statements into the knowledge graph, and determine the corresponding text of each of the corpus text segments according to the inverted index, wherein the corresponding text includes : Answer text or question text;

确定所述语料文本的语料简称，通过多条所述语料文本段和对应文本生成多条[提问文本，语料简称，回答文本]的三元组；Determine the corpus abbreviation of the corpus text, and generate multiple triples of [question text, corpus abbreviation, answer text] through a plurality of the corpus text segments and corresponding texts;

通过所述多个正则表达式对所述[提问文本，语料简称，回答文本]三元组进行语料生成，构建多条成对的问答式对话语料。The corpus is generated for the [question text, corpus abbreviation, answer text] triplet by using the multiple regular expressions, and multiple pairs of question-and-answer dialogue corpus are constructed.

在本实施方式中，对于语料文本可能是一整段的句子时，将所述语料文本划分为多个语料文本段；In this embodiment, when the corpus text may be an entire sentence, the corpus text is divided into a plurality of corpus text segments;

例如，“清水乡位于府谷县域东北部30公里处，东与黄甫乡毗连，南与海则庙乡为邻西与木瓜乡、赵五家满乡相连，北与哈镇接壤，总面积1667平方公里，总耕地面积为38265亩。全乡辖15个行政村，80个自然村，2489户，10102人，其中农业人口9868人，留地人口8408人。清水乡地处黄土高原，全年降雨稀少，春季多风沙自然资源相对医乏，仅在东南部有较丰富的煤炭资源。全乡经济产业主更以农业种植业为主，主要的经济作物有磨子、谷子、玉米，土豆、绿豆等。另外，还活度种植海红果果树，畜牧养殖业也有一定的发展。清水乡民风浮朴社会治安状况良好，人民安居乐业，已经连续多年没有发生过刑事案件，连续多年榆林市、府谷县评为安全乡镇、文明集镇、社会治安综合治理先进单位”。For example, "Qingshui Township is located 30 kilometers northeast of Fugu County. It is adjacent to Huangfu Township in the east, Haizemiao Township in the south, Papaya Township and Zhaowujiaman Township in the west, and Ha Town in the north. It covers 1,667 square kilometers with a total arable land area of 38,265 mu. The township governs 15 administrative villages, 80 natural villages, 2,489 households and 10,102 people, including 9,868 agricultural population and 8,408 land-reserved population. Qingshui Township is located on the Loess Plateau. Rainfall is sparse, and natural resources are relatively scarce due to windy and sandy spring. Only in the southeast, there are abundant coal resources. The economic industry of the township is mainly based on agricultural planting. The main economic crops are millet, millet, corn, potatoes, Mung beans, etc. In addition, the sea red fruit trees are still actively planted, and the animal husbandry industry has also developed to a certain extent. Qingshui Township is in a good state of public security, and the people live and work in peace and contentment. There have been no criminal cases for many years. The county was rated as a safe township, a civilized market town, and an advanced unit for comprehensive management of social security."

将这个句子分为：“清水乡位于府谷县域东北部30公里处”、“总面积1667平方公里，总耕地面积为38265亩”、“全乡经济产业主更以农业种植业为主，主要的经济作物有磨子、谷子、玉米，土豆、绿豆等”(由于文本太多，就不依次展示，仅提取部分语料文本段)。Divide this sentence into: "Qingshui Township is located 30 kilometers northeast of Fugu County", "The total area is 1,667 square kilometers, and the total arable land area is 38,265 mu", "The economic and industrial owners of the township are mainly agricultural planting. The economic crops are millet, millet, corn, potatoes, mung beans, etc.” (due to too many texts, they will not be displayed in sequence, and only part of the corpus text segment will be extracted).

分别提取各语料文本段的实体和属性，例如，“清水乡位于府谷县域东北部30公里处”中实体为“清水乡”，属性为“府谷县域东北部30公里处”。其余的语料文本段就不再赘述。The entities and attributes of each corpus text segment are extracted respectively. For example, in "Qingshui Township is located 30 kilometers northeast of Fugu County", the entity is "Qingshui Township", and the attribute is "30 kilometers northeast of Fugu County". The rest of the corpus text segments will not be repeated.

同样的，对“清水乡位于府谷县域东北部30公里处”匹配对应的正则表达式${#basicconcept}(在|位于|处在)府谷县域(东北部|东南部|西南部|西北部)30(公里|千米)处。Similarly, for "Qingshui Township is located 30 kilometers northeast of Fugu County", the corresponding regular expression ${#basicconcept} (in | located at | in) Fugu County (northeast | southeast | southwest | northwest part) 30 (km|km).

基于所述正则表达式确定“清水乡位于府谷县域东北部30公里处”的模糊说法，将各所述模糊说法输入至知识图谱，按照倒排索引确定各所述语料文本段的对应文本。由于“清水乡位于府谷县域东北部30公里处”属于回答类型的文本，在知识图谱中，确定提对应的提问文本，这样，就就得到“清水乡地址多少”的提问文本。Based on the regular expression, determine the fuzzy statement that "Qingshui Township is located 30 kilometers northeast of Fugu County", input each fuzzy statement into the knowledge map, and determine the corresponding text of each corpus text segment according to the inverted index. Since "Qingshui Township is located 30 kilometers northeast of Fugu County" is an answer-type text, in the knowledge map, the corresponding question text is determined, so that the question text of "What is the address of Qingshui Township" is obtained.

确定所述语料文本的简称，也就是提取上文中一大段的简称，用来代表上文中的这一大段语料文本，也作为问答的依据。例如，简称为清水镇简介。通过多条所述语料文本段和对应文本生成[清水乡地址多少，清水镇简介，清水乡位于府谷县域东北部30公里处]的三元组；Determining the abbreviation of the corpus text, that is, extracting the abbreviation of a large section of the above, is used to represent the large section of the above corpus text, and is also used as a basis for question and answer. For example, it's simply called Qingshui Town Profile. Generate a triplet of [what is the address of Qingshui Township, an introduction to Qingshui Township, and Qingshui Township is located 30 kilometers northeast of Fugu County] through multiple text segments of the corpus and corresponding texts;

基于所确定的三元组，构建多条成对的问答式对话语料，例如：Based on the determined triples, construct multiple pairs of question-and-answer dialogue materials, such as:

“清水乡地址是啥”“清水乡处于府谷县域东北部30公里处”"What is the address of Qingshui Township?" "Qingshui Township is located 30 kilometers northeast of Fugu County"

“清水乡地址在何处”“清水乡处于府谷县域东北部30千米处”"Where is the address of Qingshui Township" "Qingshui Township is located 30 kilometers northeast of Fugu County"

通过该实施方式可以看出，对于文本量大的语料文本也可以提取出高质量的成对的问答式对话预料，进一步训练阅读理解问答式语言模型。It can be seen from this embodiment that high-quality paired question-and-answer dialogue predictions can also be extracted for corpus texts with a large amount of text, and the question-and-answer language model for reading comprehension can be further trained.

作为一种实施方式，在本实施例中，在所述查询与所述语料文本相匹配的正则表达式之后，所述方法包括，对所述正则表达式进行预处理，包括：As an implementation manner, in this embodiment, after the query for a regular expression that matches the corpus text, the method includes, preprocessing the regular expression, including:

依次检测所述正则表达式的每个字符，当存在任一字符为预设通配符时，将所述为预设通配符的任一字符换成指定字符。Each character of the regular expression is detected in turn, and when any character is a preset wildcard, any character that is a preset wildcard is replaced with a specified character.

在本实施方式中，由于算法的约束下，无法直接使用正则表达式，是不能出现通配符，例如“.”、“*”、“+”、“$”、“{、}”这类的字符。而这些字符在正则表达式中具有特殊含义，所以要将其替换。例如：In this embodiment, due to the constraints of the algorithm, regular expressions cannot be used directly, and wildcard characters such as ".", "*", "+", "$", "{, }" cannot appear. . And these characters have special meaning in regular expressions, so replace them. E.g:

${#company}(公司)？的(什么|哪一？些|啥|哪一？类|哪一？款|哪一？种)(产品|东西|作品)(用到|用|使用|采用|运用|应用)了(与|跟|和)${@technology}(有关|相关)的技术。${#company}(company)? of (what | which? some | what | which? class | which? paragraph| which? |related to |and )${@technology} (relating to |technology).

会被替换成will be replaced with

#company(公司)？的(什么|哪一？些|啥|哪一？类|哪一？款|哪一？种)(产品|东西|作品)(用到|用|使用|采用|运用|应用)了(与|跟|和)@technology(有关|相关)的技术。#company(company)? of (what | which? some | what | which? class | which? paragraph| which? Technology related to | and )@technology (relating to |).

作为一种实施方式，所述依次检测所述正则表达式的每个字符包括：As an implementation manner, the sequential detection of each character of the regular expression includes:

通过递归算法对所述正则表达式的每个字符进行逐个判断。Each character of the regular expression is judged one by one through a recursive algorithm.

通过该实施方式可以看出，由于在算法中引入正则表达式会有一定约束，通过对正则表达式进行调整，避免这种约束，提高本方法的稳定性。It can be seen from this embodiment that since the introduction of regular expressions into the algorithm will have certain constraints, by adjusting the regular expressions, such constraints are avoided and the stability of the method is improved.

如图2所示为本发明一实施例提供的一种问答语料生成系统的结构示意图，该系统可执行上述任意实施例所述的问答语料生成方法，并配置在终端中。FIG. 2 is a schematic structural diagram of a question-and-answer corpus generation system according to an embodiment of the present invention. The system can execute the question-and-answer corpus generation method described in any of the foregoing embodiments, and is configured in a terminal.

本实施例提供的一种问答语料生成系统包括：语料接收程序模块11，信息确定程序模块12，正则表达式查询程序模块13，对应文本确定程序模块14和问答语料生成程序模块15。A question and answer corpus generation system provided in this embodiment includes: a corpusreceiving program module 11 , an informationdetermination program module 12 , a regular expressionquery program module 13 , a corresponding textdetermination program module 14 and a question and answer corpusgeneration program module 15 .

其中，语料接收程序模块11用于接收语料文本；信息确定程序模块12用于检测所述语料文本的文本量，当所述文本量小于预设阈值时，确定所述语料文本的用于知识图谱的实体和属性；正则表达式查询程序模块13用于基于所述实体和所述属性，查询与所述语料文本相匹配的正则表达式；对应文本确定程序模块14用于基于所述正则表达式确定所述语料文本的模糊说法，将所述模糊说法输入至知识图谱，按照倒排索引确定所述语料文本的对应文本，其中所述对应文本包括：回答文本和/或提问文本；问答语料生成程序模块15用于通过所述正则表达式对所述语料文本和所述对应文本进行语料生成，以构建多条成对的问答式对话语料。Wherein, the corpus receivingprogram module 11 is used to receive corpus text; the informationdetermination program module 12 is used to detect the text amount of the corpus text, and when the text amount is less than a preset threshold, determine the corpus text for the knowledge graph entity and attribute; the regular expressionquery program module 13 is used to query the regular expression matching the corpus text based on the entity and the attribute; the corresponding textdetermination program module 14 is used to query the regular expression based on the regular expression Determine the fuzzy statement of the corpus text, input the fuzzy statement into the knowledge graph, and determine the corresponding text of the corpus text according to the inverted index, wherein the corresponding text includes: answer text and/or question text; question and answer corpus generation Theprogram module 15 is configured to perform corpus generation on the corpus text and the corresponding text through the regular expression, so as to construct a plurality of pairs of question-and-answer dialogue materials.

进一步地，所述信息确定程序模块还用于：当所述文本量大于预设阈值时，将所述语料文本划分为多个语料文本段；Further, the information determination program module is further configured to: when the text amount is greater than a preset threshold, divide the corpus text into a plurality of corpus text segments;

信息确定程序模块，用于分别提取各所述语料文本段的用于知识图谱的实体和属性；an information determination program module for extracting the entities and attributes for the knowledge graph of each of the corpus text segments;

正则表达式查询程序模块，用于基于所述实体和所述属性，查询与各所述语料文本段相匹配的多个正则表达式；a regular expression query program module, configured to query a plurality of regular expressions matching each of the corpus text segments based on the entity and the attribute;

对应文本确定程序模块，用于基于各所述正则表达式确定各所述语料文本段的模糊说法，将各所述模糊说法输入至知识图谱，按照倒排索引确定各所述语料文本段的对应文本，其中所述对应文本包括：回答文本或提问文本；The corresponding text determination program module is used to determine the fuzzy statement of each of the corpus text segments based on each of the regular expressions, input each of the fuzzy statements into the knowledge map, and determine the correspondence of each of the corpus text segments according to the inverted index Text, wherein the corresponding text includes: answer text or question text;

问答语料生成程序模块，用于通过所述多个正则表达式对所述[提问文本，语料简称，回答文本]三元组进行语料生成，构建多条成对的问答式对话语料。The question-and-answer corpus generation program module is used to generate corpus for the triplet of [question text, corpus abbreviation, answer text] through the plurality of regular expressions, so as to construct a plurality of pairs of question-and-answer dialogue corpus.

进一步地，所述系统还包括：正则表达式预处理程序模块，用于Further, the system also includes: a regular expression preprocessor module for

进一步地，所述依次检测所述正则表达式的每个字符包括：Further, the sequential detection of each character of the regular expression includes:

本发明实施例还提供了一种非易失性计算机存储介质，计算机存储介质存储有计算机可执行指令，该计算机可执行指令可执行上述任意方法实施例中的问答语料生成方法；Embodiments of the present invention further provide a non-volatile computer storage medium, where the computer storage medium stores computer-executable instructions, and the computer-executable instructions can execute the question-and-answer corpus generation method in any of the foregoing method embodiments;

作为一种实施方式，本发明的非易失性计算机存储介质存储有计算机可执行指令，计算机可执行指令设置为：As an embodiment, the non-volatile computer storage medium of the present invention stores computer-executable instructions, and the computer-executable instructions are set to:

接收语料文本；receive corpus text;

作为一种非易失性计算机可读存储介质，可用于存储非易失性软件程序、非易失性计算机可执行程序以及模块，如本发明实施例中的方法对应的程序指令/模块。一个或者多个程序指令存储在非易失性计算机可读存储介质中，当被处理器执行时，执行上述任意方法实施例中的问答语料生成方法。As a non-volatile computer-readable storage medium, it can be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the methods in the embodiments of the present invention. One or more program instructions are stored in a non-volatile computer-readable storage medium, and when executed by the processor, execute the question-and-answer corpus generation method in any of the above method embodiments.

非易失性计算机可读存储介质可以包括存储程序区和存储数据区，其中，存储程序区可存储操作系统、至少一个功能所需要的应用程序；存储数据区可存储根据装置的使用所创建的数据等。此外，非易失性计算机可读存储介质可以包括高速随机存取存储器，还可以包括非易失性存储器，例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。在一些实施例中，非易失性计算机可读存储介质可选包括相对于处理器远程设置的存储器，这些远程存储器可以通过网络连接至装置。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the device. data etc. In addition, the non-volatile computer-readable storage medium may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some embodiments, the non-volatile computer-readable storage medium may optionally include memory located remotely from the processor, which may be connected to the device through a network. Examples of such networks include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.

本发明实施例还提供一种电子设备，其包括：至少一个处理器，以及与所述至少一个处理器通信连接的存储器，其中，所述存储器存储有可被所述至少一个处理器执行的指令，所述指令被所述至少一个处理器执行，以使所述至少一个处理器能够执行本发明任一实施例的问答语料生成方法的步骤。An embodiment of the present invention further provides an electronic device, comprising: at least one processor, and a memory communicatively connected to the at least one processor, wherein the memory stores instructions executable by the at least one processor , the instructions are executed by the at least one processor, so that the at least one processor can execute the steps of the method for generating a question and answer corpus according to any embodiment of the present invention.

本申请实施例的客户端以多种形式存在，包括但不限于：The clients in the embodiments of the present application exist in various forms, including but not limited to:

(1)移动通信设备:这类设备的特点是具备移动通信功能，并且以提供话音、数据通信为主要目标。这类终端包括:智能手机、多媒体手机、功能性手机，以及低端手机等。(1) Mobile communication equipment: This type of equipment is characterized by having mobile communication functions, and its main goal is to provide voice and data communication. Such terminals include: smart phones, multimedia phones, feature phones, and low-end phones.

(2)超移动个人计算机设备:这类设备属于个人计算机的范畴，有计算和处理功能，一般也具备移动上网特性。这类终端包括:PDA、MID和UMPC设备等，例如平板电脑。(2) Ultra-mobile personal computer equipment: This type of equipment belongs to the category of personal computers, has computing and processing functions, and generally has the characteristics of mobile Internet access. Such terminals include: PDAs, MIDs, and UMPC devices, such as tablet computers.

(3)便携式娱乐设备:这类设备可以显示和播放多媒体内容。该类设备包括:音频、视频播放器，掌上游戏机，电子书，以及智能玩具和便携式车载导航设备。(3) Portable entertainment equipment: This type of equipment can display and play multimedia content. Such devices include: audio and video players, handheld game consoles, e-books, as well as smart toys and portable car navigation devices.

(4)其他具有数据处理功能的电子装置。(4) Other electronic devices with data processing functions.

在本文中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”，不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。In this document, relational terms such as first and second, etc. are used only to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any such existence between these entities or operations. The actual relationship or sequence. Furthermore, the terms "comprising" and "comprising" include not only those elements, but also other elements not expressly listed, or elements inherent to such a process, method, article or apparatus. Without further limitation, an element defined by the phrase "comprises" does not preclude the presence of additional identical elements in a process, method, article, or device that includes the element.

以上所描述的装置实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下，即可以理解并实施。The device embodiments described above are only illustrative, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in One place, or it can be distributed over multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment. Those of ordinary skill in the art can understand and implement it without creative effort.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件。基于这样的理解，上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品可以存储在计算机可读存储介质中，如ROM/RAM、磁碟、光盘等，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。From the description of the above embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus a necessary general hardware platform, and certainly can also be implemented by hardware. Based on this understanding, the above-mentioned technical solutions can be embodied in the form of software products in essence or the parts that make contributions to the prior art, and the computer software products can be stored in computer-readable storage media, such as ROM/RAM, magnetic A disc, an optical disc, etc., includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the methods described in various embodiments or some parts of the embodiments.

最后应说明的是：以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that it can still be Modifications are made to the technical solutions described in the foregoing embodiments, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

Translated fromChinese

1.一种问答语料生成方法，包括：1. A question-and-answer corpus generation method, comprising:

接收语料文本；receive corpus text;

2.根据权利要求1所述的方法，其中，当所述文本量大于预设阈值时，将所述语料文本划分为多个语料文本段；2. The method according to claim 1, wherein, when the text amount is greater than a preset threshold, the corpus text is divided into a plurality of corpus text segments;

3.根据权利要求1所述的方法，其中，在所述查询与所述语料文本相匹配的正则表达式之后，所述方法包括，对所述正则表达式进行预处理，包括：3. The method of claim 1, wherein, after the query for a regular expression that matches the corpus text, the method comprises, preprocessing the regular expression, comprising:

4.根据权利要求3所述的方法，其中，所述依次检测所述正则表达式的每个字符包括：4. The method according to claim 3, wherein said sequentially detecting each character of said regular expression comprises:

5.一种问答语料生成系统，包括：5. A question and answer corpus generation system, comprising:

问答语料生成程序模块，用于通过所述正则表达式对所述语料文本和所述对应文本进行语料生成，以构建多条成对的问答式对话语料。The question-and-answer corpus generation program module is used to generate corpus text and the corresponding text through the regular expression, so as to construct a plurality of pairs of question-and-answer dialogue materials.

6.根据权利要求5所述的系统，其中，所述信息确定程序模块还用于：当所述文本量大于预设阈值时，将所述语料文本划分为多个语料文本段；6. The system according to claim 5, wherein the information determination program module is further used for: when the text amount is greater than a preset threshold, dividing the corpus text into a plurality of corpus text segments;

问答语料生成程序模块，用于通过所述多个正则表达式对所述[提问文本，语料简称，回答文本]三元组进行语料生成，构建多条成对的问答式对话语料。The question-and-answer corpus generation program module is used to generate corpus for the triplet of [question text, corpus abbreviation, answer text] through the multiple regular expressions, so as to construct multiple pairs of question-and-answer dialogue corpus.

7.根据权利要求5所述的系统，其中，所述系统还包括：正则表达式预处理程序模块，用于7. The system of claim 5, wherein the system further comprises: a regular expression preprocessor module for

依次检测所述正则表达式的每个字符，当存在任一字符为预设通配符时，将所述为预设通配符的任一字符换成指定字符。Each character of the regular expression is sequentially detected, and when any character is a preset wildcard, any character that is a preset wildcard is replaced with a specified character.

8.根据权利要求7所述的系统，其中，所述依次检测所述正则表达式的每个字符包括：8. The system of claim 7, wherein the sequentially detecting each character of the regular expression comprises:

9.一种电子设备，其包括：至少一个处理器，以及与所述至少一个处理器通信连接的存储器，其中，所述存储器存储有可被所述至少一个处理器执行的指令，所述指令被所述至少一个处理器执行，以使所述至少一个处理器能够执行权利要求1-4中任一项所述方法的步骤。9. An electronic device comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor, the instructions Executed by the at least one processor to enable the at least one processor to perform the steps of the method of any of claims 1-4.

10.一种存储介质，其上存储有计算机程序，其特征在于，该程序被处理器执行时实现权利要求1-4中任一项所述方法的步骤。10. A storage medium on which a computer program is stored, characterized in that, when the program is executed by a processor, the steps of the method according to any one of claims 1-4 are implemented.