CN115033669A

Movatterモバイル変換

Info

Publication number: CN115033669A
Application number: CN202210620506.7A
Authority: CN
Inventors: 陈征宇; 戴文艳; 黄炳裕; 林文国; 倪坤; 黄河; 洪章阳; 王伟宗
Original assignee: Evecom Information Technology Development Co ltd
Current assignee: Evecom Information Technology Development Co ltd
Priority date: 2022-06-01
Filing date: 2022-06-01
Publication date: 2022-09-09

Abstract

The invention discloses a new question mining method and a terminal of an FAQ question-answering system, which are used for acquiring the number of new questions, similar questions and a pre-trained language model corresponding to the FAQ question-answering system; when the number of the new problems reaches a second preset value, clustering the new problems by using an improved DEC clustering algorithm according to similar problems and optimizing a pre-trained language model to obtain a third cluster center vector and an optimized language model; determining a most similar problem from the new problems according to the third clustering center vector to obtain a new standard problem, storing the new standard problem into a knowledge base of an FAQ question-answering system, continuously improving the quality of a semantic vector model, adding the similar problem corresponding to the standard problem when clustering the new problem by improving a DEC clustering algorithm, enabling the original standard problem vector to be still applicable, and not needing to fully update the clustering center vector of the standard problem after finely tuning the model every time, thereby continuously mining the new problem and optimizing the semantic vector model.

Description

Translated fromChinese

一种FAQ问答系统的新问题挖掘方法及终端A new question mining method and terminal for FAQ question answering system

技术领域technical field

本发明涉及自然语言处理技术领域，尤其涉及一种FAQ问答系统的新问题挖掘方法及终端。The invention relates to the technical field of natural language processing, in particular to a new question mining method and terminal of a FAQ question answering system.

背景技术Background technique

基于问答对的问答系统(即FAQ问答系统)是目前应用最为广泛的问答系统，其本质是一种基于检索的问答，即通过文本匹配技术从知识库中检索出与用户输入问题最相似的问题并返回其答案。现有的文本匹配算法主要分为传统文本匹配算法和深度语义匹配算法，前者包括TF-IDF(Term Frequency-Inverse Document Frequency，词频-逆文本频率指数)、BM25(Best Match25，最佳匹配25)和Jaccard(Jaccard相似性系数)等，主要用来解决词汇层面的匹配问题；后者包括经典的DSSM(Deep Structured Semantic Models，双塔模型)及其衍生模型等，双塔模型主要通过将两段文本编码成固定长度的向量，通过计算两个向量间的余弦相似度来判断两段文本之间的相似性，其编码方式可以采用简单的Word2Vec(词向量)或LSTM(long short term memory，长短期记忆)、CNN(Convnet，卷积神经网络)、BERT(Bidirectional Encoder Representation from Transformers，预训练语言模型)等复杂的编码方式。The question answering system based on question answering pairs (that is, the FAQ question answering system) is the most widely used question answering system at present. Its essence is a question and answer based on retrieval, that is, the question most similar to the user input question is retrieved from the knowledge base through text matching technology. and return its answer. The existing text matching algorithms are mainly divided into traditional text matching algorithms and deep semantic matching algorithms. The former includes TF-IDF (Term Frequency-Inverse Document Frequency, word frequency-inverse text frequency index), BM25 (Best Match25, best match 25) and Jaccard (Jaccard similarity coefficient), etc., are mainly used to solve the matching problem at the vocabulary level; the latter includes the classic DSSM (Deep Structured Semantic Models, twin-tower model) and its derivative models, etc. The text is encoded into a fixed-length vector, and the similarity between two texts is judged by calculating the cosine similarity between the two vectors. The encoding method can use a simple Word2Vec (word vector) or LSTM (long short term memory, long Short-term memory), CNN (Convnet, convolutional neural network), BERT (Bidirectional Encoder Representation from Transformers, pre-trained language model) and other complex encoding methods.

现有常用的聚类算法有K-Means(k-means clustering algorithm，K均值算法)、高斯混合模型以及谱聚类算法等，前两者速度快、适用范围广；谱聚类算法允许更灵活的距离度量，对数据分布的适应性更强，聚类性能更好。The existing commonly used clustering algorithms include K-Means (k-means clustering algorithm, K-means algorithm), Gaussian mixture model and spectral clustering algorithm, etc. The first two are fast and have a wide range of applications; spectral clustering algorithm allows more flexibility The distance metric is more adaptable to the data distribution and has better clustering performance.

但现有技术中还是存在如下缺点：But there are still the following shortcomings in the prior art:

1、传统的文本匹配算法是基于词汇重合度的匹配算法，在语义匹配方面存在较大的局限性，而DSSM双塔模型需要大量有标签数据从头训练，成本较高，不适合低资源应用；1. The traditional text matching algorithm is a matching algorithm based on the degree of lexical coincidence, which has great limitations in semantic matching, while the DSSM dual-tower model requires a large amount of labeled data to be trained from scratch, and the cost is high, which is not suitable for low-resource applications;

2、在传统的FAQ问答系统无法回答知识库中没有的问题；2. The traditional FAQ system cannot answer questions that are not in the knowledge base;

3、K-Means聚类算法和高斯混合模型的距离度量仅适用于原始数据空间，当特征维度较高时，通常需要先降维，效果较差；而谱聚类算法需要计算全图的拉普拉斯矩阵，当特征维度较高时，内存消耗较大，另外，K-Means、高斯混合模型以及谱聚类算法均无法在聚类的同时微调特征向量。3. The distance measure of K-Means clustering algorithm and Gaussian mixture model is only applicable to the original data space. When the feature dimension is high, it is usually necessary to reduce the dimension first, and the effect is poor; while the spectral clustering algorithm needs to calculate the pull of the whole graph. Plasma matrix, when the feature dimension is high, the memory consumption is large, in addition, K-Means, Gaussian mixture model and spectral clustering algorithms cannot fine-tune the feature vector at the same time of clustering.

发明内容SUMMARY OF THE INVENTION

本发明所要解决的技术问题是：提供一种FAQ问答系统的新问题挖掘方法及终端，能够持续性地挖掘新问题以及优化语义向量模型。The technical problem to be solved by the present invention is to provide a new question mining method and terminal for a FAQ question answering system, which can continuously mine new questions and optimize the semantic vector model.

为了解决上述技术问题，本发明采用的一种技术方案为：In order to solve the above-mentioned technical problems, a kind of technical scheme adopted in the present invention is:

一种FAQ问答系统的新问题挖掘方法，包括步骤：A new question mining method for FAQ question answering system, including steps:

获取FAQ问答系统对应的新问题的数量、相似问题以及预训练的语言模型；Obtain the number of new questions, similar questions and pre-trained language models corresponding to the FAQ system;

判断所述新问题的数量是否达到第二预设值，若是，则根据所述相似问题使用改进DEC聚类算法对所述新问题进行聚类并对所述预训练的语言模型进行优化，得到第三聚类中心向量以及优化后的语言模型；Determine whether the number of the new questions reaches the second preset value, and if so, use the improved DEC clustering algorithm to cluster the new questions according to the similar questions and optimize the pre-trained language model to obtain The third cluster center vector and the optimized language model;

根据所述第三聚类中心向量从所述新问题中确定一最相似问题，得到新标准问题，并将所述新标准问题保存至所述FAQ问答系统的知识库中。According to the third cluster center vector, a most similar question is determined from the new questions, a new standard question is obtained, and the new standard question is saved in the knowledge base of the FAQ question answering system.

为了解决上述技术问题，本发明采用的另一种技术方案为：In order to solve the above-mentioned technical problems, another technical scheme adopted by the present invention is:

一种FAQ问答系统的新问题挖掘终端，包括存储器、处理器及存储在存储器上并可在所述处理器上运行的计算机程序，所述处理器执行所述计算机程序时实现以下步骤：A new question mining terminal of a FAQ question answering system, comprising a memory, a processor and a computer program stored on the memory and running on the processor, the processor implements the following steps when executing the computer program:

本发明的有益效果在于：获取FAQ问答系统对应的新问题的数量、相似问题以及预训练的语言模型，当新问题的数量达到第二预设值，则根据相似问题使用改进DEC聚类算法对新问题进行聚类并对预训练的语言模型进行优化，得到第三聚类中心向量以及优化后的语言模型，根据第三聚类中心向量从新问题中确定一最相似问题，得到新标准问题，并将新标准问题保存至FAQ问答系统的知识库中，在对新问题进行聚类的同时对预训练的语言模型进行增量更新，不断提高语义向量模型的质量，且通过改进DEC聚类算法对新问题进行聚类时加入与标准问题对应的相似问题，使得模型优化后，原始的标准问题向量依然适用，不需要每次微调模型后就对标准问题的聚类中心向量进行全量更新，从而持续性地挖掘新问题以及优化语义向量模型。The beneficial effects of the present invention are: acquiring the number of new questions, similar questions and pre-trained language models corresponding to the FAQ question answering system, and when the number of new questions reaches the second preset value, the improved DEC clustering algorithm is used according to the similar questions. The new problem is clustered and the pre-trained language model is optimized to obtain the third cluster center vector and the optimized language model. According to the third cluster center vector, a most similar problem is determined from the new problem, and a new standard problem is obtained. Save the new standard questions to the knowledge base of the FAQ question answering system, and incrementally update the pre-trained language model while clustering the new questions, continuously improve the quality of the semantic vector model, and improve the DEC clustering algorithm. When clustering new problems, similar problems corresponding to the standard problems are added, so that after the model is optimized, the original standard problem vector is still applicable, and it is not necessary to update the cluster center vector of the standard problem every time after fine-tuning the model. Continuously mine new problems and optimize semantic vector models.

附图说明Description of drawings

图1为本发明实施例的一种FAQ问答系统的新问题挖掘方法的步骤流程图；Fig. 1 is the step flow chart of the new question mining method of a kind of FAQ question answering system according to the embodiment of the present invention;

图2为本发明实施例的一种FAQ问答系统的新问题挖掘终端的结构示意图；2 is a schematic structural diagram of a new question mining terminal of a FAQ question answering system according to an embodiment of the present invention;

图3为本发明实施例FAQ问答系统的新问题挖掘方法中的新问题挖掘流程图。FIG. 3 is a flowchart of new question mining in the new question mining method of the FAQ question answering system according to the embodiment of the present invention.

具体实施方式Detailed ways

为详细说明本发明的技术内容、所实现目的及效果，以下结合实施方式并配合附图予以说明。In order to describe in detail the technical content, achieved objects and effects of the present invention, the following descriptions are given with reference to the embodiments and the accompanying drawings.

请参照图1，本发明实施例提供了一种FAQ问答系统的新问题挖掘方法，包括步骤：Please refer to FIG. 1, an embodiment of the present invention provides a new question mining method of a FAQ question answering system, including the steps:

从上述描述可知，本发明的有益效果在于：获取FAQ问答系统对应的新问题的数量、相似问题以及预训练的语言模型，当新问题的数量达到第二预设值，则根据相似问题使用改进DEC聚类算法对新问题进行聚类并对预训练的语言模型进行优化，得到第三聚类中心向量以及优化后的语言模型，根据第三聚类中心向量从新问题中确定一最相似问题，得到新标准问题，并将新标准问题保存至FAQ问答系统的知识库中，在对新问题进行聚类的同时对预训练的语言模型进行增量更新，不断提高语义向量模型的质量，且通过改进DEC聚类算法对新问题进行聚类时加入与标准问题对应的相似问题，使得模型优化后，原始的标准问题向量依然适用，不需要每次微调模型后就对标准问题的聚类中心向量进行全量更新，从而持续性地挖掘新问题以及优化语义向量模型。As can be seen from the above description, the beneficial effects of the present invention are: acquiring the number of new questions, similar questions and pre-trained language models corresponding to the FAQ question answering system, and when the number of new questions reaches the second preset value, the improvement is used according to the similar questions. The DEC clustering algorithm clusters the new problems and optimizes the pre-trained language model to obtain the third cluster center vector and the optimized language model. According to the third cluster center vector, a most similar problem is determined from the new problem, Get new standard questions, save the new standard questions to the knowledge base of the FAQ question answering system, incrementally update the pre-trained language model while clustering the new questions, and continuously improve the quality of the semantic vector model. The improved DEC clustering algorithm adds similar problems corresponding to the standard problems when clustering new problems, so that after the model is optimized, the original standard problem vectors are still applicable, and it is not necessary to fine-tune the model every time. Perform full updates to continuously mine new problems and optimize semantic vector models.

进一步地，所述获取FAQ问答系统对应的新问题的数量、相似问题以及预训练的语言模型之前包括步骤：Further, before the acquisition of the number of new questions, similar questions and pre-trained language models corresponding to the FAQ question answering system, the steps include:

获取FAQ问答系统的知识库中的标准问题和与所述标准问题对应的答案；Obtain standard questions and answers corresponding to the standard questions in the knowledge base of the FAQ question answering system;

将所述标准问题使用预训练的语言模型转换为第一聚类中心向量，并将所述第一聚类中心向量存储至向量检索库中；Converting the standard question into a first cluster center vector using a pre-trained language model, and storing the first cluster center vector in a vector retrieval library;

接收用户问题，并将所述用户问题使用所述预训练的语言模型转换为第二聚类中心向量；receiving a user question, and converting the user question into a second cluster center vector using the pre-trained language model;

将所述第一聚类中心向量与所述第二聚类中心向量两两进行计算，得到多个余弦相似度；Calculate the first cluster center vector and the second cluster center vector in pairs to obtain a plurality of cosine similarities;

判断所述多个余弦相似度中是否存在大于且不等于第一预设值的余弦相似度，若是，则根据所述大于且不等于第一预设值的余弦相似度对应的第一聚类中心向量从所述知识库中确定一目标标准问题以及与所述目标标准问题对应的答案，并将所述用户问题标记为相似问题存储至数据库中，若否，则将所述用户问题标记为新问题存储至数据库中。Judging whether there is a cosine similarity greater than but not equal to a first preset value in the plurality of cosine similarities, and if so, according to the first cluster corresponding to the cosine similarity greater than but not equal to the first preset value The center vector determines a target standard question and an answer corresponding to the target standard question from the knowledge base, and marks the user question as a similar question and stores it in the database, if not, marks the user question as New questions are stored in the database.

由上述描述可知，使用预训练的语言模型作为语义向量模型，预训练的语言模型有较强的语言表征能力，能够得到较好的向量表示，即使在未经过微调的情况下也能使FAQ问答系统具备良好的问题检索能力。It can be seen from the above description that using the pre-trained language model as the semantic vector model, the pre-trained language model has strong language representation ability and can obtain a better vector representation, even without fine-tuning. The system has good question retrieval ability.

进一步地，所述根据所述相似问题使用改进DEC聚类算法对所述新问题进行聚类并对所述预训练的语言模型进行优化，得到第三聚类中心向量以及优化后的语言模型包括：Further, according to the similar problem, use the improved DEC clustering algorithm to cluster the new problem and optimize the pre-trained language model, and obtain the third cluster center vector and the optimized language model including: :

从所述相似问题中随机抽取第一预设数量的相似问题，得到相似问题集，并根据所述相似问题集和所述新问题得到问题集；Randomly extract a first preset number of similar problems from the similar problems to obtain a similar problem set, and obtain a problem set according to the similar problem set and the new problem;

根据所述相似问题集和所述新问题确定聚类个数；Determine the number of clusters according to the similar problem set and the new problem;

使用所述预训练的语言模型初始化改进DEC聚类算法的编码层，并初始化所述聚类个数对应的第四聚类中心向量；Use the pre-trained language model to initialize the coding layer of the improved DEC clustering algorithm, and initialize the fourth cluster center vector corresponding to the number of clusters;

通过所述编码层计算所述第四聚类中心向量和所述问题集对应的问题集向量；Calculate the fourth cluster center vector and the question set vector corresponding to the question set through the coding layer;

使用t分布计算所述问题集向量和所述第四聚类中心向量的第一相似性，并将所述第一相似性作为问题集向量分配至所述第四聚类中心向量的概率分布，得到第一概率分布；calculating a first similarity between the question set vector and the fourth cluster center vector using a t-distribution, and assigning the first similarity as a question set vector to the probability distribution of the fourth cluster center vector, get the first probability distribution;

将所述第一概率分布提高至二次幂后进行归一化，得到目标分布；After raising the first probability distribution to the second power, normalization is performed to obtain the target distribution;

从所述问题集中随机抽取第二预设数量的问题，得到目标问题集，并计算所述目标问题集向量；Randomly extract a second preset number of questions from the question set to obtain a target question set, and calculate the target question set vector;

使用所述t分布计算所述目标问题集向量和所述第四聚类中心向量的第二相似性，并将所述第二相似性作为目标问题集向量分配至所述第四聚类中心向量的概率分布，得到第二概率分布；Calculate the second similarity between the target question set vector and the fourth cluster center vector using the t distribution, and assign the second similarity as the target question set vector to the fourth cluster center vector The probability distribution of , obtains the second probability distribution;

计算所述第二概率分布和所述目标分布的KL散度，并使用反向传播梯度更新所述编码层和所述第四聚类中心向量中的第三聚类中心向量，得到更新后的编码层和更新后的第三聚类中心向量；Calculate the KL divergence of the second probability distribution and the target distribution, and use the back-propagation gradient to update the encoding layer and the third cluster center vector in the fourth cluster center vector to obtain the updated encoding layer and the updated third cluster center vector;

根据所述更新后的编码层得到优化后的语言模型。An optimized language model is obtained according to the updated coding layer.

由上述描述可知，使用改进的DEC聚类算法在聚类的同时微调语义向量模型，相对于传统聚类算法能够达到更好的聚类效果，且可挖掘出知识库中没有的新问题，不断扩充知识库的规模，从而提高FAQ问答系统的问答能力。It can be seen from the above description that using the improved DEC clustering algorithm to fine-tune the semantic vector model at the same time of clustering can achieve better clustering effect than the traditional clustering algorithm, and can discover new problems that are not in the knowledge base. Expand the scale of the knowledge base, thereby improving the question answering ability of the FAQ question answering system.

进一步地，所述初始化所述聚类个数对应的第四聚类中心向量包括：Further, the initialization of the fourth cluster center vector corresponding to the number of clusters includes:

获取所述相似问题对应的标准问题的所述第一聚类中心向量；obtaining the first cluster center vector of the standard question corresponding to the similar question;

使用K-Means算法计算所述新问题的第三聚类中心向量。The third cluster center vector for the new problem is calculated using the K-Means algorithm.

由上述描述可知，获取相似问题对应的标准问题的第一聚类中心向量，不可反向更新，使用K-Means算法计算新问题的第三聚类中心向量，可反向更新，保证了相似问题对应的标准问题的聚类中心向量不变，这样使得微调后的BERT模型在计算这些聚类中心向量时与原始模型计算得到的向量相差不大，以此无需在每次微调后都对标准问题的聚类中心向量进行全量更新，保证了算法效率。It can be seen from the above description that the first cluster center vector of the standard problem corresponding to the similar problem cannot be reversely updated, and the K-Means algorithm is used to calculate the third cluster center vector of the new problem, which can be reversely updated to ensure similar problems. The cluster center vector of the corresponding standard problem remains unchanged, so that the fine-tuned BERT model calculates these cluster center vectors and the vector calculated by the original model is not much different, so there is no need to adjust the standard problem after each fine-tuning. The cluster center vector is fully updated to ensure the efficiency of the algorithm.

进一步地，所述根据所述相似问题使用改进DEC聚类算法对所述新问题进行聚类并对所述预训练的语言模型进行优化，得到第三聚类中心向量以及优化后的语言模型之后包括步骤：Further, according to the similar problems, the improved DEC clustering algorithm is used to cluster the new problems and optimize the pre-trained language model, after obtaining the third cluster center vector and the optimized language model. Include steps:

获取模型优化次数，并将所述模型优化次数加一，得到更新后的模型优化次数；Obtain the number of model optimizations, and add one to the number of model optimizations to obtain the updated number of model optimizations;

判断所述更新后的模型优化次数是否达到第三预设值，若是，则随机获取一标准问题，并使用所述预训练的语言模型将所述标准问题转换为第五聚类中心向量，使用所述优化后的语言模型将所述标准问题转换为第六聚类中心向量；Determine whether the number of times of optimization of the updated model reaches the third preset value, and if so, obtain a standard question randomly, and use the pre-trained language model to convert the standard question into the fifth cluster center vector, using The optimized language model converts the standard question into a sixth cluster center vector;

计算所述第五聚类中心向量和所述第六聚类中心向量的向量相似度，并判断所述向量相似度是否小于第四预设值，若小于，则使用所述优化后的语言模型返回执行所述将所述标准问题使用预训练的语言模型转换为第一聚类中心向量步骤。Calculate the vector similarity between the fifth cluster center vector and the sixth cluster center vector, and determine whether the vector similarity is less than the fourth preset value, if it is less than, use the optimized language model Return to perform the step of converting the standard question into a first cluster center vector using the pre-trained language model.

由上述描述可知，只有预训练的语言模型优化到一定次数，且与原始模型存在较大差异时才使用优化后的语言模型进行全量更新，提升了系统的工作效率，避免因为更新过于频繁导致的访问不稳定问题，另外，优化后的语言模型计算得到的聚类中心向量更靠近于其对应类的聚类中心，区分度更好，进一步提高了问答效果。It can be seen from the above description that only when the pre-trained language model is optimized to a certain number of times and there is a big difference from the original model, the optimized language model is used for full update, which improves the work efficiency of the system and avoids the problem caused by too frequent updates. In addition, the cluster center vector calculated by the optimized language model is closer to the cluster center of its corresponding class, and the discrimination is better, which further improves the question-answering effect.

请参照图2，一种FAQ问答系统的新问题挖掘终端，包括存储器、处理器及存储在存储器上并可在所述处理器上运行的计算机程序，所述处理器执行所述计算机程序时实现以下步骤：Please refer to FIG. 2 , a new question mining terminal of a FAQ system, including a memory, a processor, and a computer program stored in the memory and running on the processor, which is implemented when the processor executes the computer program. The following steps:

进一步地，所述根据所述相似问题使用改进DEC聚类算法对所述新问题进行聚类并对所述预训练的语言模型进行优化，得到第三聚类中心向量以及优化后的语言模型之后包括步骤：Further, using the improved DEC clustering algorithm to cluster the new problem and optimize the pre-trained language model according to the similar problem, after obtaining the third cluster center vector and the optimized language model Include steps:

本发明上述的一种FAQ问答系统的新问题挖掘方法及终端能够适用于FAQ问答系统中，以下通过具体实施方式进行说明：The new question mining method and terminal of the above-mentioned FAQ question and answer system of the present invention can be applied to the FAQ question and answer system, and the following is explained by specific implementation manners:

实施例一Example 1

请参照图1和图3，本实施例的一种FAQ问答系统的新问题挖掘方法，包括步骤：Please refer to FIG. 1 and FIG. 3 , a new question mining method of a FAQ question answering system of the present embodiment includes the steps:

S0、获取FAQ问答系统的知识库中的标准问题和与所述标准问题对应的答案；S0. Obtain standard questions in the knowledge base of the FAQ question answering system and answers corresponding to the standard questions;

S1、将所述标准问题使用预训练的语言模型转换为第一聚类中心向量，并将所述第一聚类中心向量存储至向量检索库中；S1, the standard question is converted into a first cluster center vector using a pre-trained language model, and the first cluster center vector is stored in a vector retrieval library;

其中，所述预训练的语言模型包括BERT模型、Roberta模型或ERNIE模型等，本实施例中，所述预训练的语言模型为BERT模型；Wherein, the pre-trained language model includes a BERT model, a Roberta model or an ERNIE model, etc. In this embodiment, the pre-trained language model is a BERT model;

S2、接收用户问题，并将所述用户问题使用所述预训练的语言模型转换为第二聚类中心向量，如图3所示；S2, receiving user questions, and converting the user questions into a second cluster center vector using the pre-trained language model, as shown in Figure 3;

S3、将所述第一聚类中心向量与所述第二聚类中心向量两两进行计算，得到多个余弦相似度；S3, the first cluster center vector and the second cluster center vector are calculated in pairs to obtain a plurality of cosine similarities;

S4、判断所述多个余弦相似度中是否存在大于且不等于第一预设值的余弦相似度，若是，则执行S41，若否，则执行S42；S4, determine whether there is a cosine similarity greater than but not equal to the first preset value in the plurality of cosine similarities, and if so, execute S41, and if not, execute S42;

其中，所述第一预设值可根据实际情况自由设置；Wherein, the first preset value can be freely set according to the actual situation;

S41、根据所述大于且不等于第一预设值的余弦相似度对应的第一聚类中心向量从所述知识库中确定一目标标准问题以及与所述目标标准问题对应的答案，并将所述用户问题标记为相似问题存储至数据库中，如图3所示；S41. Determine a target standard question and an answer corresponding to the target standard question from the knowledge base according to the first cluster center vector corresponding to the cosine similarity greater than but not equal to the first preset value, and set the The user questions are marked as similar questions and stored in the database, as shown in Figure 3;

S42、将所述用户问题标记为新问题存储至数据库中，如图3所示；S42, marking the user question as a new question and storing it in the database, as shown in Figure 3;

在另一种可选的实施方式中，S4、将所述多个余弦相似度从高到底进行排序，得到排序后的多个余弦相似度，并从所述排序后的多个余弦相似度中确定前预设个数的余弦相似度，根据所述前预设个数的余弦相似度从所述知识库中确定目标标准问题以及与所述目标标准问题对应的答案，并将所述用户问题标记为相似问题存储至数据库中；In another optional implementation manner, S4: Sort the multiple cosine similarities from high to bottom to obtain multiple sorted cosine similarities, and obtain the sorted multiple cosine similarities from the sorted multiple cosine similarities Determine the cosine similarity of the first preset number, determine the target standard question and the answer corresponding to the target standard question from the knowledge base according to the cosine similarity of the first preset number, and use the user question. Mark similar questions and store them in the database;

S5、获取FAQ问答系统对应的新问题的数量、相似问题以及预训练的语言模型；S5. Obtain the number of new questions, similar questions and pre-trained language models corresponding to the FAQ question answering system;

S6、判断所述新问题的数量是否达到第二预设值，若是，则执行S61；若否，则返回执行S5；S6, determine whether the number of the new questions reaches the second preset value, if so, execute S61; if not, return to execute S5;

其中，所述第二预设值可根据实际情况自由设置；Wherein, the second preset value can be freely set according to the actual situation;

S61、根据所述相似问题使用改进DEC聚类算法对所述新问题进行聚类并对所述预训练的语言模型进行优化，得到第三聚类中心向量以及优化后的语言模型，如图3所示，具体包括：S61, according to the similar problems, use the improved DEC clustering algorithm to cluster the new problems and optimize the pre-trained language model to obtain a third cluster center vector and an optimized language model, as shown in Figure 3 shown, including:

S610、从所述相似问题中随机抽取第一预设数量的相似问题，得到相似问题集，并根据所述相似问题集和所述新问题得到问题集；S610. Randomly extract a first preset number of similar questions from the similar questions to obtain a similar question set, and obtain a question set according to the similar question set and the new question;

具体的，从相似问题中随机抽取n个相似问题，得到相似问题集N，并根据相似问题集N和新问题M得到问题集D；Specifically, randomly select n similar problems from similar problems to obtain a similar problem set N, and obtain a problem set D according to the similar problem set N and the new problem M;

S611、根据所述相似问题集和所述新问题确定聚类个数；S611. Determine the number of clusters according to the similar problem set and the new problem;

具体的，根据相似问题集N和新问题M确定聚类个数k＝n+m，m表示新问题M的类别数，可自定义，n即上述相似问题的数量，一个相似问题即一个类别；Specifically, the number of clusters k=n+m is determined according to the similar problem set N and the new problem M, where m represents the number of categories of the new problem M, which can be customized, n is the number of the above-mentioned similar problems, and a similar problem is a category ;

S612、使用所述预训练的语言模型初始化改进DEC聚类算法的编码层，并初始化所述聚类个数对应的第四聚类中心向量，具体包括：S612, use the pre-trained language model to initialize the coding layer of the improved DEC clustering algorithm, and initialize the fourth cluster center vector corresponding to the number of clusters, specifically including:

S6121、使用所述预训练的语言模型初始化改进DEC聚类算法的编码层；S6121, using the pre-trained language model to initialize the coding layer of the improved DEC clustering algorithm;

具体的，使用BERT模型初始化改进DEC聚类算法的编码层W_enc；Specifically, use the BERT model to initialize the encoding layer W_enc of the improved DEC clustering algorithm;

S6122、获取所述相似问题对应的标准问题的所述第一聚类中心向量；S6122, obtaining the first cluster center vector of the standard question corresponding to the similar question;

具体的，使用BERT模型计算n个相似问题对应的标准问题的聚类中心向量，由于S1已计算过，这里可直接获取相似问题对应的标准问题的第一聚类中心向量V_n；Specifically, the BERT model is used to calculate the cluster center vectors of the standard problems corresponding to n similar problems. Since S1 has been calculated, the first cluster center vector V_n of the standard problems corresponding to the similar problems can be directly obtained here;

S6123、使用K-Means算法计算所述新问题的第三聚类中心向量；S6123, using the K-Means algorithm to calculate the third cluster center vector of the new problem;

具体的，使用K-Means算法计算新问题M的m个第三聚类中心向量V_m，以此初始化k个第四聚类中心向量V_k＝V_m+V_k；Specifically, use the K-Means algorithm to calculate m third cluster center vectors V_m of the new problem M, thereby initializing k fourth cluster center vectors V_k =V_m +V_k ;

S613、通过所述编码层计算所述第四聚类中心向量和所述问题集对应的问题集向量；S613, calculating the fourth cluster center vector and the problem set vector corresponding to the problem set through the coding layer;

具体的，通过编码层W_enc计算第四聚类中心向量V_k和所述问题集D对应的问题集向量V_D；Specifically, calculating the fourth cluster center vector V_k and the problem set vector V_D corresponding to the problem set D through the encoding layer W_enc ;

S614、使用t分布计算所述问题集向量和所述第四聚类中心向量的第一相似性，并将所述第一相似性作为问题集向量分配至所述第四聚类中心向量的概率分布，得到第一概率分布；S614. Calculate the first similarity between the question set vector and the fourth cluster center vector using t distribution, and use the first similarity as the probability of assigning the question set vector to the fourth cluster center vector distribution to obtain the first probability distribution;

具体的，使用t分布为核计算问题集向量V_D和第四聚类中心向量V_k的第一相似性，并将第一相似性作为问题集向量V_D分配至第四聚类中心向量V_k的概率分布，得到第一概率分布q；Specifically, the first similarity between the question set vector V_D and the fourth cluster center vector V_k is calculated using t distribution as the kernel, and the first similarity is assigned to the fourth cluster center vector V as the question set vector V_D The probability distribution of_k , the first probability distribution q is obtained;

S615、将所述第一概率分布提高至二次幂后进行归一化，得到目标分布；S615, normalizing the first probability distribution to a second power to obtain a target distribution;

具体的，将第一概率分布q提高至二次幂后按照每一簇的频率进行归一化，得到目标分布p_dk；Specifically, after raising the first probability distribution q to the second power, normalization is performed according to the frequency of each cluster to obtain the target distribution p_dk ;

S616、从所述问题集中随机抽取第二预设数量的问题，得到目标问题集，并计算所述目标问题集向量；S616, randomly extract a second preset number of questions from the question set, obtain a target question set, and calculate the target question set vector;

具体的，从问题集D中随机抽取第二预设数量的问题，得到目标问题集d，并计算所述目标问题集向量V_d；Specifically, randomly extracting a second preset number of questions from the question set D, obtaining a target question set d, and calculating the target question set vector V_d ;

S617、使用所述t分布计算所述目标问题集向量和所述第四聚类中心向量的第二相似性，并将所述第二相似性作为目标问题集向量分配至所述第四聚类中心向量的概率分布，得到第二概率分布；S617. Calculate the second similarity between the target question set vector and the fourth cluster center vector using the t distribution, and assign the second similarity to the fourth cluster as the target question set vector The probability distribution of the center vector to obtain the second probability distribution;

具体的，使用t分布计算目标问题集向量V_d和第四聚类中心向量V_k的第二相似性，并将第二相似性作为目标问题集向量V_d分配至第四聚类中心向量V_k的概率分布，得到第二概率分布q_dk；Specifically, the second similarity between the target question set vector V_d and the fourth cluster center vector V_k is calculated using t distribution, and the second similarity is assigned to the fourth cluster center vector V as the target question set vector V_d the probability distribution of_k to obtain the second probability distribution q_dk ;

S618、计算所述第二概率分布和所述目标分布的KL散度，并使用反向传播梯度更新所述编码层和所述第四聚类中心向量中的第三聚类中心向量，得到更新后的编码层和更新后的第三聚类中心向量；S618: Calculate the KL divergence of the second probability distribution and the target distribution, and update the encoding layer and the third cluster center vector in the fourth cluster center vector using the back-propagation gradient to obtain an update The latter coding layer and the updated third cluster center vector;

具体的，计算第二概率分布q_dk和目标分布p_dk的KL散度，并使用反向传播梯度更新编码层W_enc和第四聚类中心向量V_k中的第三聚类中心向量V_m，得到更新后的编码层和更新后的第三聚类中心向量；Specifically, the KL divergence of the second probability distribution q_dk and the target distribution p_dk is calculated, and the third cluster center vector V_m in the encoding layer We_enc and the fourth cluster center vector V_k is updated using the back-propagation gradient , obtain the updated coding layer and the updated third cluster center vector;

S619、根据所述更新后的编码层得到优化后的语言模型；S619, obtaining an optimized language model according to the updated coding layer;

具体的，根据更新后的编码层得到优化后的BERT模型；Specifically, the optimized BERT model is obtained according to the updated coding layer;

S62、获取模型优化次数，并将所述模型优化次数加一，得到更新后的模型优化次数；S62, obtaining the model optimization times, and adding one to the model optimization times to obtain the updated model optimization times;

S63、判断所述更新后的模型优化次数是否达到第三预设值，若是，则随机获取一标准问题，并使用所述预训练的语言模型将所述标准问题转换为第五聚类中心向量，使用所述优化后的语言模型将所述标准问题转换为第六聚类中心向量；若否，则不执行；S63. Determine whether the number of times of optimization of the updated model reaches a third preset value, and if so, randomly obtain a standard question, and use the pre-trained language model to convert the standard question into a fifth cluster center vector , using the optimized language model to convert the standard problem into the sixth cluster center vector; if not, do not execute;

其中，所述第三预设值可根据实际情况自由设置；Wherein, the third preset value can be freely set according to the actual situation;

S64、计算所述第五聚类中心向量和所述第六聚类中心向量的向量相似度，并判断所述向量相似度是否小于第四预设值，若小于，则使用所述优化后的语言模型返回执行所述将所述标准问题使用预训练的语言模型转换为第一聚类中心向量步骤；否则，不执行；S64: Calculate the vector similarity between the fifth cluster center vector and the sixth cluster center vector, and determine whether the vector similarity is less than a fourth preset value, and if it is less than, use the optimized The language model returns to execute the step of converting the standard question into the first cluster center vector using the pre-trained language model; otherwise, do not execute;

其中，所述第四预设值可根据实际情况自由设置，本实施例中，所述第四预设值为0.95；Wherein, the fourth preset value can be freely set according to the actual situation, and in this embodiment, the fourth preset value is 0.95;

S7、根据所述第三聚类中心向量从所述新问题中确定一最相似问题，得到新标准问题，并将所述新标准问题保存至所述FAQ问答系统的知识库中，具体包括：S7, determine a most similar question from the new question according to the third cluster center vector, obtain a new standard question, and save the new standard question in the knowledge base of the FAQ question answering system, specifically including:

S71、根据所述更新后的第三聚类中心向量从所述新问题中确定一最相似问题，得到新标准问题；S71, determine a most similar problem from the new problem according to the updated third cluster center vector to obtain a new standard problem;

S72、确定与所述新标准问题对应的答案，并将所述新标准问题和所述新标准问题对应的答案保存至所述FAQ问答系统的知识库中；S72, determine the answer corresponding to the new standard question, and save the new standard question and the answer corresponding to the new standard question in the knowledge base of the FAQ question answering system;

其中，如图3所示，通过人工审核添加的方式确定与所述新标准问题对应的答案。Wherein, as shown in FIG. 3 , the answer corresponding to the new standard question is determined by manual review and addition.

实施例二Embodiment 2

请参照图2，本实施例的一种FAQ问答系统的新问题挖掘终端，包括存储器、处理器及存储在存储器上并可在所述处理器上运行的计算机程序，所述处理器执行所述计算机程序时实现实施例一中的FAQ问答系统的新问题挖掘方法中的各个步骤。Referring to FIG. 2 , a new question mining terminal of a FAQ system in this embodiment includes a memory, a processor, and a computer program stored in the memory and running on the processor, and the processor executes the The computer program implements each step in the new question mining method of the FAQ question answering system in the first embodiment.

综上所述，本发明提供的一种FAQ问答系统的新问题挖掘方法及终端，当新问题的数量达到第二预设值，则根据所述相似问题使用改进DEC聚类算法对所述新问题进行聚类并对所述预训练的语言模型进行优化，得到第三聚类中心向量以及优化后的语言模型；根据所述第三聚类中心向量从所述新问题中确定一最相似问题，得到新标准问题，并将所述新标准问题保存至所述FAQ问答系统的知识库中；当更新后的模型优化次数达到第三预设值，则随机获取一标准问题，并使用所述预训练的语言模型将所述标准问题转换为第五聚类中心向量，使用所述优化后的语言模型将所述标准问题转换为第六聚类中心向量；计算二者的向量相似度，当向量相似度小于第四预设值，则使用所述优化后的语言模型进行全量更新；在对新问题进行聚类的同时对预训练的语言模型进行增量更新，不断提高语义向量模型的质量，且通过改进DEC聚类算法对新问题进行聚类时加入与标准问题对应的相似问题，使得模型优化后，原始的标准问题向量依然适用，不需要每次微调模型后就对标准问题的聚类中心向量进行全量更新，而是在进行多次微调且优化后的模型与原始模型差异较大后才进行全量更新，提升了系统的工作效率，避免因为更新过于频繁导致的访问不稳定问题，从而持续性地挖掘新问题以及优化语义向量模型。To sum up, the present invention provides a new question mining method and terminal for a FAQ question answering system. When the number of new questions reaches the second preset value, the improved DEC clustering algorithm is used to classify the new questions according to the similar questions. The problem is clustered and the pre-trained language model is optimized to obtain a third cluster center vector and an optimized language model; a most similar problem is determined from the new problem according to the third cluster center vector. , obtain a new standard question, and save the new standard question in the knowledge base of the FAQ question answering system; when the updated model optimization times reaches the third preset value, then randomly acquire a standard question, and use the The pre-trained language model converts the standard question into the fifth cluster center vector, and uses the optimized language model to convert the standard question into the sixth cluster center vector; calculate the vector similarity of the two, when If the vector similarity is less than the fourth preset value, the optimized language model is used for full update; while clustering new questions, the pre-trained language model is incrementally updated to continuously improve the quality of the semantic vector model , and by improving the DEC clustering algorithm, when clustering new problems, similar problems corresponding to the standard problems are added, so that after the model is optimized, the original standard problem vectors are still applicable, and there is no need to fine-tune the model every time. The class center vector is fully updated, but the full update is performed after multiple fine-tuning and the optimized model is quite different from the original model, which improves the work efficiency of the system and avoids the problem of unstable access caused by too frequent updates. So as to continuously mine new problems and optimize the semantic vector model.

以上所述仅为本发明的实施例，并非因此限制本发明的专利范围，凡是利用本发明说明书及附图内容所作的等同变换，或直接或间接运用在相关的技术领域，均同理包括在本发明的专利保护范围内。The above descriptions are only examples of the present invention, and are not intended to limit the scope of the present invention. Any equivalent transformations made by using the contents of the description and drawings of the present invention, or directly or indirectly applied in related technical fields, are similarly included in the within the scope of patent protection of the present invention.

Claims

Translated fromChinese

1.一种FAQ问答系统的新问题挖掘方法，其特征在于，包括步骤：1. a new problem mining method of FAQ question answering system, is characterized in that, comprises the steps:

2.根据权利要求1所述的一种FAQ问答系统的新问题挖掘方法，其特征在于，所述获取FAQ问答系统对应的新问题的数量、相似问题以及预训练的语言模型之前包括步骤：2. the new question mining method of a kind of FAQ question answering system according to claim 1, is characterized in that, before the language model of described acquisition FAQ question answering system corresponding new question quantity, similar question and pre-training:

3.根据权利要求2所述的一种FAQ问答系统的新问题挖掘方法，其特征在于，所述根据所述相似问题使用改进DEC聚类算法对所述新问题进行聚类并对所述预训练的语言模型进行优化，得到第三聚类中心向量以及优化后的语言模型包括：3. the new question mining method of a kind of FAQ question answering system according to claim 2, it is characterized in that, described according to described similar question, use improved DEC clustering algorithm to carry out clustering to described new question and to predict described new question. The trained language model is optimized to obtain the third cluster center vector and the optimized language model includes:

计算所述第二概率分布和所述目标分布的KL散度，并使用反向传播梯度更新所述编码层和所述第四聚类中心向量中的第三聚类中心向量，得到更新后的编码层和更新后的第三聚类中心向量；Calculate the KL divergence of the second probability distribution and the target distribution, and use the backpropagation gradient to update the encoding layer and the third cluster center vector in the fourth cluster center vector to obtain the updated encoding layer and the updated third cluster center vector;

4.根据权利要求3所述的一种FAQ问答系统的新问题挖掘方法，其特征在于，所述初始化所述聚类个数对应的第四聚类中心向量包括：4. the new question mining method of a kind of FAQ question answering system according to claim 3, it is characterised in that the initialization of the fourth cluster center vector corresponding to the number of clusters comprises:

5.根据权利要求2所述的一种FAQ问答系统的新问题挖掘方法，其特征在于，所述根据所述相似问题使用改进DEC聚类算法对所述新问题进行聚类并对所述预训练的语言模型进行优化，得到第三聚类中心向量以及优化后的语言模型之后包括步骤：5. The new question mining method of a FAQ question answering system according to claim 2, characterized in that, according to the similar questions, the improved DEC clustering algorithm is used to cluster the new questions and the pre-predicted questions are clustered. The trained language model is optimized to obtain the third cluster center vector and the optimized language model, including steps:

6.一种FAQ问答系统的新问题挖掘终端，包括存储器、处理器及存储在存储器上并可在所述处理器上运行的计算机程序，其特征在于，所述处理器执行所述计算机程序时实现以下步骤：6. A new question mining terminal of a FAQ question answering system, comprising a memory, a processor and a computer program stored on the memory and running on the processor, wherein the processor executes the computer program when the processor executes the computer program. Implement the following steps:

7.根据权利要求6所述的一种FAQ问答系统的新问题挖掘终端，其特征在于，所述获取FAQ问答系统对应的新问题的数量、相似问题以及预训练的语言模型之前包括步骤：7. the new question mining terminal of a kind of FAQ question answering system according to claim 6, is characterized in that, before described obtaining the quantity of the new question corresponding to FAQ question answering system, similar question and the language model of pre-training comprises steps:

8.根据权利要求7所述的一种FAQ问答系统的新问题挖掘终端，其特征在于，所述根据所述相似问题使用改进DEC聚类算法对所述新问题进行聚类并对所述预训练的语言模型进行优化，得到第三聚类中心向量以及优化后的语言模型包括：8. The new question mining terminal of a FAQ question answering system according to claim 7, characterized in that, according to the similar questions, the improved DEC clustering algorithm is used to perform clustering on the new questions and the prediction is performed. The trained language model is optimized to obtain the third cluster center vector and the optimized language model includes:

9.根据权利要求8所述的一种FAQ问答系统的新问题挖掘终端，其特征在于，所述初始化所述聚类个数对应的第四聚类中心向量包括：9. the new question mining terminal of a kind of FAQ question answering system according to claim 8, it is characterised in that the initialization of the fourth cluster center vector corresponding to the number of clusters comprises:

10.根据权利要求7所述的一种FAQ问答系统的新问题挖掘终端，其特征在于，所述根据所述相似问题使用改进DEC聚类算法对所述新问题进行聚类并对所述预训练的语言模型进行优化，得到第三聚类中心向量以及优化后的语言模型之后包括步骤：10. The new question mining terminal of a FAQ question answering system according to claim 7, characterized in that, according to the similar questions, the new questions are clustered by using an improved DEC clustering algorithm, and the pre-predicted questions are clustered. The trained language model is optimized to obtain the third cluster center vector and the optimized language model, including steps: