Movatterモバイル変換


[0]ホーム

URL:


CN113641778B - Topic identification method of dialogue text - Google Patents

Topic identification method of dialogue text
Download PDF

Info

Publication number
CN113641778B
CN113641778BCN202011191264.1ACN202011191264ACN113641778BCN 113641778 BCN113641778 BCN 113641778BCN 202011191264 ACN202011191264 ACN 202011191264ACN 113641778 BCN113641778 BCN 113641778B
Authority
CN
China
Prior art keywords
dialogue
topic
text
sentence
round
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011191264.1A
Other languages
Chinese (zh)
Other versions
CN113641778A (en
Inventor
陈杭升
李建红
吴向宏
韩翊
陈耀军
姜炯挺
孙灵
林昊
翁张力
张湘
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Zhejiang Electric Power Co Ltd
Zhejiang Huayun Information Technology Co Ltd
Original Assignee
State Grid Zhejiang Electric Power Co Ltd
Zhejiang Huayun Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Zhejiang Electric Power Co Ltd, Zhejiang Huayun Information Technology Co LtdfiledCriticalState Grid Zhejiang Electric Power Co Ltd
Priority to CN202011191264.1ApriorityCriticalpatent/CN113641778B/en
Publication of CN113641778ApublicationCriticalpatent/CN113641778A/en
Application grantedgrantedCritical
Publication of CN113641778BpublicationCriticalpatent/CN113641778B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本发明提出一种对话文本的主题识别方法,包括以下步骤:S1,在原有电力领域本体词典和通用词典的基础上进行对话文本预处理,包括分词、词性标注和词频特征提取;S2,在原有电力领域本体词典和通用词典的基础上,新增属性条目,包括电力专有词汇、供应商名称词汇和事件关键词汇;S3,单轮对话下句预测分析,利用上下句的连贯性判断是否同主题;S4,进行对话中断交叉处理,获得全部同主题对话集合;S5,进行供应商识别。本发明能够筛除与主题无关的冗余内容并对同主题的对话文本进行识别归纳。

The present invention proposes a topic identification method for a dialogue text, comprising the following steps: S1, preprocessing the dialogue text on the basis of the original power field ontology dictionary and the general dictionary, including word segmentation, part-of-speech tagging and word frequency feature extraction; S2, adding attribute entries on the basis of the original power field ontology dictionary and the general dictionary, including power-specific vocabulary, supplier name vocabulary and event key vocabulary; S3, predicting and analyzing the next sentence of a single-round dialogue, and judging whether it is the same topic by the coherence of the upper and lower sentences; S4, performing dialogue interruption cross processing to obtain a set of all dialogues with the same topic; S5, identifying suppliers. The present invention can screen out redundant content irrelevant to the topic and identify and summarize the dialogue texts with the same topic.

Description

Translated fromChinese
一种对话文本的主题识别方法A topic identification method for conversation text

技术领域Technical Field

本发明涉及电力系统技术领域,尤其是一种对话文本的主题识别方法。The invention relates to the technical field of power systems, and in particular to a topic recognition method for a dialogue text.

背景技术Background technique

电网企业技术人员及管理人员在日常运维管理过程中,会通过腾讯通(Real TimeeXpert,RTX)、微信、钉钉等通讯软件进行对话交流,其中蕴含不少主题信息,如讨论某供应商的产品质量等。针对上述电力对话文本进行挖掘,可获得包含不同主题的信息。然而,对话文本中往往会包含与主题无关的冗余内容,存在隐式评价对象以及交叉中断现象,对话文本中对话主题划分非常困难。During their daily operation and maintenance, technicians and managers of power grid companies communicate through communication software such as Tencent (RTX), WeChat, and DingTalk, which contain a lot of topic information, such as discussing the product quality of a supplier. Mining the above power dialogue texts can obtain information containing different topics. However, dialogue texts often contain redundant content that is irrelevant to the topic, and there are implicit evaluation objects and cross-interruption phenomena. It is very difficult to divide the dialogue topics in the dialogue text.

由于电力对话文本专业性强的特点,与常用词库存在一定差别,为提升文本理解的准确率,需要扩充对话业务领域的本体词典,通过自然语言处理技术,包括分词、词性标注等,抽取电力对话文本领域专业词汇,并对词典新增属性条目,包括电力专有词汇、供应商名称词汇和事件关键词汇,为后续的主题归纳奠定基础。主题归纳的准确性是后续针对主题内容进行其他研究的关键基础和重要保障,目前尚未有相关研究,因此开展对话文本主题归纳方法的研究十分必要和迫切。Due to the strong professional characteristics of power dialogue texts, there are certain differences from common vocabulary. In order to improve the accuracy of text understanding, it is necessary to expand the ontology dictionary in the field of dialogue business. Through natural language processing technology, including word segmentation and part-of-speech tagging, professional vocabulary in the field of power dialogue texts is extracted, and attribute entries are added to the dictionary, including power-specific vocabulary, supplier name vocabulary and event key vocabulary, laying the foundation for subsequent topic summarization. The accuracy of topic summarization is the key foundation and important guarantee for subsequent research on the subject content. There is no relevant research at present, so it is necessary and urgent to carry out research on the topic summarization method of dialogue text.

发明内容Summary of the invention

本发明解决了对话文本中对话主题划分非常困难的问题,提出一种对话文本的主题识别方法,能够筛除与主题无关的冗余内容并对同主题的对话文本进行识别归纳。The present invention solves the problem that it is very difficult to divide dialogue topics in dialogue texts, and proposes a dialogue text topic identification method that can screen out redundant content irrelevant to the topic and identify and summarize dialogue texts with the same topic.

为实现上述目的,提出以下技术方案:In order to achieve the above objectives, the following technical solutions are proposed:

一种对话文本的主题识别方法,包括以下步骤:A method for identifying a topic in a conversation text comprises the following steps:

S1,在原有电力领域本体词典和通用词典的基础上进行对话文本预处理,包括分词、词性标注和词频特征提取;S1, based on the original power field ontology dictionary and general dictionary, dialogue text preprocessing is performed, including word segmentation, part-of-speech tagging and word frequency feature extraction;

S2,在原有电力领域本体词典和通用词典的基础上,新增属性条目,包括电力专有词汇、供应商名称词汇和事件关键词汇;S2, based on the original power domain ontology dictionary and general dictionary, new attribute entries are added, including power-specific vocabulary, supplier name vocabulary and event key vocabulary;

S3,单轮对话下句预测分析,利用上下句的连贯性判断是否同主题;S3, single-round conversation next sentence prediction analysis, using the coherence of the previous and next sentences to determine whether they are on the same topic;

S4,进行对话中断交叉处理,获得全部同主题对话集合;S4, performing conversation interruption cross processing to obtain a collection of all conversations on the same topic;

S5,进行供应商识别,在全部同主题对话集合的基础上,依据电力业务本体词典中的供应商信息类别,对每个主题对话集合提取其中供应商信息,采用向上就近原则识别其中隐式评价对象,再去除无关的冗余主题内容。S5, supplier identification, based on all the same topic conversation sets, according to the supplier information category in the power business ontology dictionary, extract the supplier information from each topic conversation set, use the upward proximity principle to identify the implicit evaluation objects, and then remove irrelevant redundant topic content.

本发明提出的方法能够归纳出对话文本主题并识别供应商信息,解决对话文本中含有无关冗余内容、隐式评价对象和交叉中断现象等问题,能够为后续对话文本分析奠定基础。The method proposed in the present invention can summarize the topics of the dialogue text and identify supplier information, solve the problems of irrelevant redundant content, implicit evaluation objects and cross-interruption phenomena in the dialogue text, and lay the foundation for subsequent dialogue text analysis.

作为优选,所述步骤S3具体包括以下步骤:Preferably, the step S3 specifically comprises the following steps:

S301,利用基于Transformer的双向编码器下句预测BERT-NSP模型,以两条对话文本为输入,添加第一个标记[CLS],对其进行变换输出每个字对应的隐藏向量,对电力业务单轮对话文本进行下句预测匹配概率计算,获得模型输出:S301, using the Transformer-based bidirectional encoder next sentence prediction BERT-NSP model, taking two dialogue texts as input, adding the first tag [CLS], transforming them to output the hidden vector corresponding to each word, and calculating the next sentence prediction matching probability for the single-round dialogue text of the power business to obtain the model output:

p=softmax(CWT)p = softmax(CWT )

式中,p为下句预测匹配概率矩阵;C为BERT模型第一个标志[CLS]的最终隐藏状态;W为全连接层权重矩阵;本模型实际是一个二分类问题,因此p为一个二维向量,分别表示下句预测为0和1的概率值,即不相关和相关的概率,下句预测概率PNS取该向量中表示两句相关的数值。Where p is the next sentence prediction matching probability matrix; C is the final hidden state of the first flag [CLS] of the BERT model; W is the fully connected layer weight matrix; this model is actually a binary classification problem, so p is a two-dimensional vector, which represents the probability values of the next sentence prediction as 0 and 1, that is, the probabilities of irrelevant and relevant, and the next sentence prediction probability PNS takes the value in the vector that indicates the correlation between the two sentences.

S302,计算上下两句单轮对话文本的余弦相似度,作为上下重复内容连贯性判断标准:S302, calculating the cosine similarity of the two sentences of the single-round dialogue text as a criterion for judging the coherence of the repeated content:

式中,S为相邻对话的余弦相似度;A=(a1,a2,…,an)和B=(b1,b2,…,bn)分别为上下两句文本词频向量化表示后获得的n维词频特征向量;Where S is the cosine similarity of adjacent dialogues; A = (a1 , a2 , …, an ) and B = (b1 , b2 , …, bn ) are the n-dimensional word frequency feature vectors obtained by vectorizing the word frequencies of the upper and lower sentences respectively;

S303,融合上述两类预测结果,定义单轮对话语义相关性匹配度:S303, integrating the above two types of prediction results, and defining the semantic relevance matching degree of a single-round dialogue:

M=(1-α)PNS+αSM=(1-α)PNS +αS

式中,M为单轮对话语义相关性匹配度;α为余弦相似度权重系数,M值是一个大于等于0的数,M越大表示两句匹配相关性越大,参考PNS的二分类取值标准,当M大于等于设定阈值时,则判定上下句相关,将上下俩句划归为同一对话主题,当M小于设定阈值时,则判定上下句不相关;其意义在于能够融合深度特征和相似度特征,综合考虑上下句的语言联系,提高匹配判断的准确性。α的作用是平衡深度特征和相似度特征的权重比例,对该系数进行寻优可获得单轮对话文本判断最优模型。In the formula, M is the semantic relevance matching degree of a single-round dialogue; α is the cosine similarity weight coefficient, and the M value is a number greater than or equal to 0. The larger the M, the greater the matching relevance of the two sentences. Referring to the binary classification value standard of PNS , when M is greater than or equal to the set threshold, the upper and lower sentences are judged to be related, and the upper and lower sentences are classified as the same dialogue topic. When M is less than the set threshold, the upper and lower sentences are judged to be unrelated. Its significance lies in the ability to integrate deep features and similarity features, comprehensively consider the language connection between the upper and lower sentences, and improve the accuracy of matching judgment. The role of α is to balance the weight ratio of deep features and similarity features. Optimizing this coefficient can obtain the optimal model for single-round dialogue text judgment.

作为优选,所述设定阈值为0.5。Preferably, the set threshold is 0.5.

作为优选,所述步骤S4具体包括以下步骤:Preferably, the step S4 specifically comprises the following steps:

S401,分别设对话文本集合D中顺序取出的两条文本为di和dj,判断两条文本的间隔轮次;S401, assuming that two texts sequentially taken out from the dialogue text set D are di and dj respectively, and determining the interval between the two texts;

S402,若间隔轮次在设定间隔允许范围内,则对两条文本进行单轮对话相关性匹配度M值判断;S402, if the interval round is within the set interval allowable range, then the single-round dialogue correlation matching degree M value of the two texts is determined;

S403,若间隔轮次不在设定间隔允许范围内,则对di进行链接@用户ID信息判断;若含有链接信息,就将含链接语句依次和被链接用户间隔轮次最近的上下两条对话文本分别进行相关性匹配度M值判断,根据匹配度判断结果进行同主题对话集合归纳;若不含链接信息,则该条文本所属主题对话集合已归纳完毕;S403, if the interval round is not within the set interval allowable range, then the link @ user ID information is judged for di ; if there is link information, the link-containing sentence is judged with the two conversation texts with the closest interval round of the linked user in sequence for the relevance matching degree M value, and the conversation set with the same topic is summarized according to the matching degree judgment result; if there is no link information, the conversation set with the topic to which the text belongs has been summarized;

S404,重复步骤S401到S403,直至对话文本集合D为空,获得全部同主题对话集合。S404, repeat steps S401 to S403 until the conversation text set D is empty, and obtain all conversation sets with the same topic.

作为优选,所述设定间隔允许范围为3次及以内。根据对话文本研究经验,认为间隔轮次在3轮之外的对话一般无直接相关关系。Preferably, the set interval is allowed to be within 3 times. According to the experience of dialogue text research, it is believed that dialogues with an interval of more than 3 rounds generally have no direct correlation.

作为优选,所述步骤S5具体包括以下步骤:Preferably, the step S5 specifically comprises the following steps:

S501,若未识别到供应商信息,则判定该对话集合谈论对象为无关冗余内容,对设备供应商评价无价值,筛除无关冗余内容;S501, if the supplier information is not identified, it is determined that the discussion object of the conversation set is irrelevant and redundant content, which is of no value to the evaluation of the equipment supplier, and the irrelevant and redundant content is screened out;

S502,若识别到一个供应商信息或多个相同供应商信息,则判定该对话集合评价对象为被识别供应商;S502, if one supplier information or multiple identical supplier information is identified, then determining that the evaluation object of the conversation set is the identified supplier;

S503,出现两个及以上不同供应商信息,按出现顺序分别为厂家A,B,…,X,对集合内每条文本采用向上就近原则确定对应供应商,定义该对话集合自第一句至厂家B出现句之前评价对象为厂家A,厂家B出现句至厂家C出现句之前评价对象为厂家B,以此类推,若某厂家重复出现,则对该厂家的对话集进行合并。S503, two or more different supplier information appears, which are manufacturers A, B, ..., X in order of appearance. The corresponding supplier is determined for each text in the set using the principle of upward proximity. The evaluation object of the conversation set from the first sentence to the sentence before manufacturer B appears is defined as manufacturer A, and the evaluation object from the sentence before manufacturer B appears to the sentence before manufacturer C appears is manufacturer B, and so on. If a manufacturer appears repeatedly, the conversation set of that manufacturer is merged.

本发明的有益效果是:本发明能够归纳出对话文本主题并识别供应商信息,解决对话文本中含有无关冗余内容、隐式评价对象和交叉中断现象等问题,能够为后续对话文本分析奠定基础。The beneficial effects of the present invention are as follows: the present invention can summarize the topics of the dialogue text and identify supplier information, solve the problems of irrelevant redundant content, implicit evaluation objects and cross-interruption phenomena in the dialogue text, and can lay the foundation for subsequent dialogue text analysis.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本发明的流程图;Fig. 1 is a flow chart of the present invention;

图2是本发明的α影响下的BERT-NSP与余弦相似度加权准确率曲线图;FIG2 is a weighted accuracy curve of BERT-NSP and cosine similarity under the influence of α of the present invention;

图3是本发明对话中断交叉处理流程图;3 is a flow chart of the dialogue interruption cross processing of the present invention;

具体实施方式Detailed ways

实施例:Example:

本实施例提出一种对话文本的主题识别方法,参考图1,包括以下步骤:This embodiment provides a topic identification method for a conversation text, referring to FIG1 , comprising the following steps:

S1,在原有电力领域本体词典和通用词典的基础上进行对话文本预处理,包括分词、词性标注和词频特征提取,这是文本分析、挖掘的常见步骤,不再累述;S1, based on the original power field ontology dictionary and general dictionary, dialogue text preprocessing is performed, including word segmentation, part-of-speech tagging and word frequency feature extraction. This is a common step in text analysis and mining, and will not be repeated here;

S2,在原有电力领域本体词典和通用词典的基础上,新增属性条目,包括电力专有词汇、供应商名称词汇和事件关键词汇,采用半监督的方法补充本体词典,然后由人工核查确定是否成为本体词以及词的属性;S2, based on the original power domain ontology dictionary and general dictionary, new attribute entries are added, including power-specific vocabulary, supplier name vocabulary and event key vocabulary, and the ontology dictionary is supplemented by a semi-supervised method, and then manually checked to determine whether it becomes an ontology word and the attribute of the word;

S3,单轮对话下句预测分析,利用上下句的连贯性判断是否同主题;S3, single-round conversation next sentence prediction analysis, using the coherence of the previous and next sentences to determine whether they are on the same topic;

步骤S3具体包括以下步骤:Step S3 specifically includes the following steps:

S301,利用基于Transformer的双向编码器下句预测BERT-NSP模型,以两条对话文本为输入,添加第一个标记[CLS],对其进行变换输出每个字对应的隐藏向量,对电力业务单轮对话文本进行下句预测匹配概率计算,获得模型输出:S301, using the Transformer-based bidirectional encoder next sentence prediction BERT-NSP model, taking two dialogue texts as input, adding the first tag [CLS], transforming them to output the hidden vector corresponding to each word, and calculating the next sentence prediction matching probability for the single-round dialogue text of the power business to obtain the model output:

p=softmax(CWT)p = softmax(CWT )

式中,p为下句预测匹配概率矩阵;C为BERT模型第一个标志[CLS]的最终隐藏状态;W为全连接层权重矩阵;本模型实际是一个二分类问题,因此p为一个二维向量,分别表示下句预测为0和1的概率值,即不相关和相关的概率,下句预测概率PNS取该向量中表示两句相关的数值。Where p is the next sentence prediction matching probability matrix; C is the final hidden state of the first flag [CLS] of the BERT model; W is the weight matrix of the fully connected layer; this model is actually a binary classification problem, so p is a two-dimensional vector, which represents the probability values of the next sentence prediction as 0 and 1, that is, the probabilities of irrelevant and relevant, and the next sentence prediction probability PNS takes the value in the vector that indicates the correlation between the two sentences.

S302,计算上下两句单轮对话文本的余弦相似度,作为上下重复内容连贯性判断标准:S302, calculating the cosine similarity of the two sentences of the single-round dialogue text as a criterion for judging the coherence of the repeated content:

式中,S为相邻对话的余弦相似度;A=(a1,a2,…,an)和B=(b1,b2,…,bn)分别为上下两句文本词频向量化表示后获得的n维词频特征向量;Where S is the cosine similarity of adjacent dialogues; A = (a1 , a2 , …, an ) and B = (b1 , b2 , …, bn ) are the n-dimensional word frequency feature vectors obtained by vectorizing the word frequencies of the upper and lower sentences respectively;

S303,融合上述两类预测结果,定义单轮对话语义相关性匹配度:S303, integrating the above two types of prediction results, and defining the semantic relevance matching degree of a single-round dialogue:

M=(1-α)PNS+αSM=(1-α)PNS +αS

式中,M为单轮对话语义相关性匹配度;α为余弦相似度权重系数,M值是一个大于等于0的数,M越大表示两句匹配相关性越大,参考PNS的二分类取值标准,当M大于等于设定阈值时,则判定上下句相关,将上下俩句划归为同一对话主题,当M小于设定阈值时,则判定上下句不相关;其意义在于能够融合深度特征和相似度特征,综合考虑上下句的语言联系,提高匹配判断的准确性。α的作用是平衡深度特征和相似度特征的权重比例,对该系数进行寻优可获得单轮对话文本判断最优模型。In the formula, M is the semantic relevance matching degree of a single-round dialogue; α is the cosine similarity weight coefficient, and the M value is a number greater than or equal to 0. The larger the M, the greater the matching relevance of the two sentences. Referring to the binary classification value standard of PNS , when M is greater than or equal to the set threshold, the upper and lower sentences are judged to be related, and the upper and lower sentences are classified as the same dialogue topic. When M is less than the set threshold, the upper and lower sentences are judged to be unrelated. Its significance lies in the ability to integrate deep features and similarity features, comprehensively consider the language connection between the upper and lower sentences, and improve the accuracy of matching judgment. The role of α is to balance the weight ratio of deep features and similarity features. Optimizing this coefficient can obtain the optimal model for single-round dialogue text judgment.

设定阈值为0.5。Set the threshold to 0.5.

S4,进行对话中断交叉处理,获得全部同主题对话集合;S4, performing conversation interruption cross processing to obtain a collection of all conversations on the same topic;

参考图3,步骤S4具体包括以下步骤:Referring to FIG3 , step S4 specifically includes the following steps:

S401,分别设对话文本集合D中顺序取出的两条文本为di和dj,判断两条文本的间隔轮次;S401, assuming that two texts sequentially taken out from the dialogue text set D are di and dj respectively, and determining the interval between the two texts;

S402,若间隔轮次在设定间隔允许范围内,则对两条文本进行单轮对话相关性匹配度M值判断;S402, if the interval round is within the set interval allowable range, then the single-round dialogue correlation matching degree M value of the two texts is determined;

S403,若间隔轮次不在设定间隔允许范围内,则对di进行链接@用户ID信息判断;若含有链接信息,就将含链接语句依次和被链接用户间隔轮次最近的上下两条对话文本分别进行相关性匹配度M值判断,根据匹配度判断结果进行同主题对话集合归纳;若不含链接信息,则该条文本所属主题对话集合已归纳完毕;S403, if the interval round is not within the set interval allowable range, then the link @ user ID information is judged for di ; if there is link information, the link-containing sentence is judged with the two conversation texts with the closest interval round of the linked user in sequence for the relevance matching degree M value, and the conversation set with the same topic is summarized according to the matching degree judgment result; if there is no link information, the conversation set with the topic to which the text belongs has been summarized;

S404,重复步骤S401到S403,直至对话文本集合D为空,获得全部同主题对话集合。S404, repeat steps S401 to S403 until the conversation text set D is empty, and obtain all conversation sets with the same topic.

设定间隔允许范围为3次及以内。根据对话文本研究经验,认为间隔轮次在3轮之外的对话一般无直接相关关系。The allowed interval is set to 3 times or less. Based on the experience of dialogue text research, it is believed that dialogues with an interval of more than 3 rounds generally have no direct correlation.

S5,进行供应商识别,在全部同主题对话集合的基础上,依据电力业务本体词典中的供应商信息类别,对每个主题对话集合提取其中供应商信息,采用向上就近原则识别其中隐式评价对象,再去除无关的冗余主题内容。S5, supplier identification, based on all the same topic conversation sets, according to the supplier information category in the power business ontology dictionary, extract the supplier information from each topic conversation set, use the upward proximity principle to identify the implicit evaluation objects, and then remove irrelevant redundant topic content.

步骤S5具体包括以下步骤:Step S5 specifically includes the following steps:

S501,若未识别到供应商信息,则判定该对话集合谈论对象为无关冗余内容,对设备供应商评价无价值,筛除无关冗余内容;S501, if the supplier information is not identified, it is determined that the discussion object of the conversation set is irrelevant and redundant content, which is of no value to the evaluation of the equipment supplier, and the irrelevant and redundant content is screened out;

S502,若识别到一个供应商信息或多个相同供应商信息,则判定该对话集合评价对象为被识别供应商;S502, if one supplier information or multiple identical supplier information is identified, then determining that the evaluation object of the conversation set is the identified supplier;

S503,出现两个及以上不同供应商信息,按出现顺序分别为厂家A,B,…,X,对集合内每条文本采用向上就近原则确定对应供应商,定义该对话集合自第一句至厂家B出现句之前评价对象为厂家A,厂家B出现句至厂家C出现句之前评价对象为厂家B,以此类推,若某厂家重复出现,则对该厂家的对话集进行合并。S503, two or more different supplier information appears, which are manufacturers A, B, ..., X in order of appearance. The corresponding supplier is determined for each text in the set using the principle of upward proximity. The evaluation object of the conversation set from the first sentence to the sentence before manufacturer B appears is defined as manufacturer A, and the evaluation object from the sentence before manufacturer B appears to the sentence before manufacturer C appears is manufacturer B, and so on. If a manufacturer appears repeatedly, the conversation set of that manufacturer is merged.

以下以具体的应用例进一步进行阐述:The following is further explained with specific application examples:

以电力采集运维领域对话文本为例进行验证研究,首先建立语料库,搜集了包括电力采集运维RTX工作群聊对话、《电网企业一线员工作业采集异常一本通》导则、采集异常文本在内的共23.8M文本数据;然后基于隐马尔科夫模型(hidden Markov model,HMM)对语料库进行分词等预处理。该过程是在通用词典和已有的电力设备缺陷本体词典基础上,识别新的本体词,由人工核查后补充形成了新的领域本体词典。格式及示例如表1所示,对本体词标注了词条属性和同义词、近义词,属性包括专有领域名词、供应商名称和事件关键词汇等主题归纳相关的。新增采集运维领域本体词典共计752条,包括在采集运维语料库中出现的专业词汇但原有本体词典未包含的专有领域名词词条539条,供应商名称类词汇106条,事件关键词汇107条。Taking the dialogue text in the field of power collection and operation as an example for verification research, a corpus was first established, and a total of 23.8M text data including power collection and operation RTX work group chat dialogues, the "A Handbook of Abnormal Collection for Frontline Employees of Power Grid Enterprises" guide, and abnormal collection texts were collected; then the corpus was preprocessed by word segmentation based on the hidden Markov model (HMM). This process is based on the general dictionary and the existing power equipment defect ontology dictionary, identifying new ontology words, and supplementing them after manual verification to form a new domain ontology dictionary. The format and examples are shown in Table 1. The ontology words are annotated with entry attributes and synonyms and antonyms. The attributes include subject-related induction such as proprietary domain nouns, supplier names, and event key words. A total of 752 new collection and operation domain ontology dictionaries were added, including 539 proprietary domain noun entries that appeared in the collection and operation corpus but were not included in the original ontology dictionary, 106 supplier name words, and 107 event key words.

表1电力领域本体词典示例Table 1 Example of ontology dictionary in the power sector

主题归纳数据集选取电力采集运维RTX工作群聊对话文本中的347对单轮对话文本,表2列出了其中13条对话内容。BERT-NSP模型参数使用12层Transformer单元110M个参数的BERT-Chinese预训练模型结构,自我注意力机制为12头,隐含层维度为768维,最大序列长度为128,学习率为3e-5,批量大小为32。基于BERT-NSP与余弦相似度加权的单轮对话下句预测分析对347对电力供应商的主题对话内容进行实验,对单轮对话语义相关性匹配度M中的余弦相似度权重系数α进行寻优,准确率如图2所示。当取值为0和1时分别代表BERT-NSP模型和余弦相似度模型,单轮对话判断准确率在α取0.04时最大为80.69%,超过0.04后准确率单调递减,模型准确率指标如表3所示。因此下述BERT-NSP与余弦相似度加权模型取余弦相似度权重系数α为0.04。The topic induction dataset selects 347 pairs of single-round dialogue texts from the RTX work group chat dialogue texts of power collection and operation, and Table 2 lists 13 of the dialogue contents. The BERT-NSP model parameters use the BERT-Chinese pre-trained model structure with 12 layers of Transformer units and 110M parameters, the self-attention mechanism is 12 heads, the hidden layer dimension is 768 dimensions, the maximum sequence length is 128, the learning rate is 3e-5, and the batch size is 32. The single-round dialogue next sentence prediction analysis based on BERT-NSP and cosine similarity weighting is experimented on the topic dialogue content of 347 pairs of power suppliers, and the cosine similarity weight coefficient α in the single-round dialogue semantic relevance matching degree M is optimized. The accuracy is shown in Figure 2. When the value is 0 and 1, it represents the BERT-NSP model and the cosine similarity model respectively. The single-round dialogue judgment accuracy is 80.69% at most when α is 0.04. After exceeding 0.04, the accuracy decreases monotonically. The model accuracy indicators are shown in Table 3. Therefore, the following BERT-NSP and cosine similarity weighted model takes the cosine similarity weight coefficient α as 0.04.

表2电力对话文本示例Table 2 Example of power dialogue text

表3单轮对话文本判断准确率Table 3. Single-round dialogue text judgment accuracy

基于表3可知,本发明采用的单轮对话下句预测分析模型,在BERT-NSP判断两句深度特征相关性的基础上与语言特征相似度加权,能够提高单轮对话判断的准确性。该加权模型的确立同时也具有一定可解释性,对话过程中出现相同文本内容更倾向于为相同主题的讨论。Based on Table 3, the single-round conversation next sentence prediction analysis model adopted by the present invention can improve the accuracy of single-round conversation judgment by weighting the language feature similarity based on the BERT-NSP judgment of the correlation between the deep features of the two sentences. The establishment of this weighted model also has a certain degree of interpretability. The same text content in the conversation process is more likely to be a discussion of the same topic.

在单轮对话文本下句预测分析基础上,处理对话交叉中断情况。以表2电力对话文本为例,该段对话包含两个对话主题,分别为表计异常供应商讨论主题和陶瓷杯性价比讨论主题。比较对话交叉中断处理通过无处理、余弦相似度、BERT-NSP、BERT-NSP与余弦相似度加权模型后得到的多轮对话主题划分情况如表4所示,其中同供应商识别情况下不同对话主题以斜杠划分。Based on the prediction and analysis of the next sentence of a single-round conversation text, the cross-interruption of the conversation is handled. Taking the power conversation text in Table 2 as an example, this conversation contains two conversation topics, namely the discussion topic of the abnormal meter supplier and the discussion topic of the ceramic cup cost performance. The comparison of the multi-round conversation topic division obtained after the cross-interruption of the conversation is handled by no processing, cosine similarity, BERT-NSP, and BERT-NSP and cosine similarity weighted model is shown in Table 4, where different conversation topics are divided by slashes in the case of the same supplier identification.

表4不同模型多轮对话主题划分情况Table 4. Topic division of multi-round dialogues in different models

从表4可以看出,在多轮对话主题划分方面,仅使用余弦相似度计算的模型主题划分准确率最低,加权模型准确率最高。相比较而言,余弦相似度模型主要看重两句对话间的文本重复率,仅依靠两句中的重复内容来判断连贯性,忽视内在联系;仅使用BERT-NSP模型虽然能够达到较高准确率,但仍出现划分不完整的情况,针对划分错误的第4、7、13句分析可知,通过相同关键词“陶瓷杯”可以判断两句为同一主题;因此本发明的加权模型综合以上两者的优点,克服BERT模型忽略词语级联系的不足,更准确划分各主题的范围。从加权模型的对话集合匹配结果可见,采用图3对话交叉中断处理流程可以准确划分同主题内容,不仅对话间隔轮次在3句以内的同主题可以正确划分,如第4、13句那样通过链接@用户ID远距离对话也能准确划分主题。As can be seen from Table 4, in terms of multi-round dialogue topic division, the model using only cosine similarity calculation has the lowest topic division accuracy, while the weighted model has the highest accuracy. In comparison, the cosine similarity model mainly focuses on the text repetition rate between two sentences of dialogue, and only relies on the repeated content in the two sentences to judge the coherence, ignoring the internal connection; although the BERT-NSP model can achieve a higher accuracy, it still has incomplete division. According to the analysis of the 4th, 7th, and 13th sentences with incorrect division, the two sentences can be judged as the same topic through the same keyword "ceramic cup"; therefore, the weighted model of the present invention combines the advantages of the above two, overcomes the deficiency of the BERT model ignoring the word-level connection, and more accurately divides the scope of each topic. It can be seen from the dialogue set matching results of the weighted model that the same topic content can be accurately divided by using the dialogue cross-interruption processing flow in Figure 3. Not only can the same topic with a dialogue interval of less than 3 sentences be correctly divided, but also the long-distance dialogue through the link @ user ID as in the 4th and 13th sentences can also accurately divide the topic.

最后,依据供应商识别方法,能够正确识别表计异常供应商讨论主题下A厂家的内容为(1)(2)(3)(5)(6)(8)(9)(10)(12),而B厂家的内容仅有(11),不属于该主题的无关内容(4)(7)(13)将从对话中删除。Finally, according to the supplier identification method, the content of manufacturer A under the discussion topic of meter abnormality supplier can be correctly identified as (1)(2)(3)(5)(6)(8)(9)(10)(12), while the content of manufacturer B is only (11). Irrelevant content (4)(7)(13) that does not belong to this topic will be deleted from the conversation.

Claims (6)

CN202011191264.1A2020-10-302020-10-30Topic identification method of dialogue textActiveCN113641778B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202011191264.1ACN113641778B (en)2020-10-302020-10-30Topic identification method of dialogue text

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202011191264.1ACN113641778B (en)2020-10-302020-10-30Topic identification method of dialogue text

Publications (2)

Publication NumberPublication Date
CN113641778A CN113641778A (en)2021-11-12
CN113641778Btrue CN113641778B (en)2024-07-12

Family

ID=78415631

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202011191264.1AActiveCN113641778B (en)2020-10-302020-10-30Topic identification method of dialogue text

Country Status (1)

CountryLink
CN (1)CN113641778B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US11837219B2 (en)2021-11-182023-12-05International Business Machines CorporationCreation of a minute from a record of a teleconference

Citations (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN112632982A (en)*2020-10-292021-04-09国网浙江省电力有限公司湖州供电公司Dialogue text emotion analysis method capable of being used for supplier evaluation

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
JP2005535007A (en)*2002-05-282005-11-17ナシプニイ、ウラジミル・ウラジミロビッチ Synthesizing method of self-learning system for knowledge extraction for document retrieval system
CN101075435B (en)*2007-04-192011-05-18深圳先进技术研究院Intelligent chatting system and its realizing method
US9165053B2 (en)*2013-03-152015-10-20Xerox CorporationMulti-source contextual information item grouping for document analysis
US9575952B2 (en)*2014-10-212017-02-21At&T Intellectual Property I, L.P.Unsupervised topic modeling for short texts
JP2019049873A (en)*2017-09-112019-03-28株式会社Screenホールディングス Synonym dictionary creation apparatus, synonym dictionary creation program, and synonym dictionary creation method
US10608968B2 (en)*2017-12-012020-03-31International Business Machines CorporationIdentifying different chat topics in a communication channel using cognitive data science
CN110162787A (en)*2019-05-052019-08-23西安交通大学A kind of class prediction method and device based on subject information
CN110717339B (en)*2019-12-122020-06-30北京百度网讯科技有限公司 Method, device, electronic device and storage medium for processing semantic representation model

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN112632982A (en)*2020-10-292021-04-09国网浙江省电力有限公司湖州供电公司Dialogue text emotion analysis method capable of being used for supplier evaluation

Also Published As

Publication numberPublication date
CN113641778A (en)2021-11-12

Similar Documents

PublicationPublication DateTitle
CN113704451B (en)Power user appeal screening method and system, electronic device and storage medium
CN108376151B (en)Question classification method and device, computer equipment and storage medium
CN112632982B (en) A conversation text sentiment analysis method for supplier evaluation
CN110347787B (en)Interview method and device based on AI auxiliary interview scene and terminal equipment
US12367345B2 (en)Identifying high effort statements for call center summaries
CN113177164B (en)Multi-platform collaborative new media content monitoring and management system based on big data
Aliero et al.Systematic review on text normalization techniques and its approach to non-standard words
US20230298615A1 (en)System and method for extracting hidden cues in interactive communications
CN113051886A (en)Test question duplicate checking method and device, storage medium and equipment
CN118964641B (en) Method and system for building AI knowledge base model for enterprises
CN118485046B (en)Labeling data processing method and device, electronic equipment and computer storage medium
CN115934936A (en)Intelligent traffic text analysis method based on natural language processing
CN118194875B (en) Intelligent voice service management system and method driven by natural language understanding
KR20160149050A (en)Apparatus and method for selecting a pure play company by using text mining
KR102661438B1 (en)Web crawler system that collect Internet articles and provides a summary service of issue article affecting the global value chain
CN111177402A (en) Evaluation method, device, computer equipment and storage medium based on word segmentation processing
CN117195864A (en) A question generation system based on answer awareness
Shahade et al.Deep learning approach-based hybrid fine-tuned Smith algorithm with Adam optimiser for multilingual opinion mining
CN113641778B (en)Topic identification method of dialogue text
CN119202249A (en) A text element extraction method based on natural language processing
CN118982978A (en) Intelligent evaluation method, system and medium for call voice
CN118395968A (en) A method and system for automatic analysis of data classification and grading standard files
CN115860778B (en) Tourism consumer demand analysis method and system based on improved KANO model
CN111428475A (en)Word segmentation word bank construction method, word segmentation method, device and storage medium
CN107886233B (en)Service quality evaluation method and system for customer service

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp