Movatterモバイル変換


[0]ホーム

URL:


CN112052684A - Named entity identification method, device, equipment and storage medium for power metering - Google Patents

Named entity identification method, device, equipment and storage medium for power metering
Download PDF

Info

Publication number
CN112052684A
CN112052684ACN202010928024.9ACN202010928024ACN112052684ACN 112052684 ACN112052684 ACN 112052684ACN 202010928024 ACN202010928024 ACN 202010928024ACN 112052684 ACN112052684 ACN 112052684A
Authority
CN
China
Prior art keywords
model
word
corpus
classification
loss
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010928024.9A
Other languages
Chinese (zh)
Inventor
郑楷洪
李鹏
杨劲锋
周尚礼
张英楠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southern Power Grid Digital Grid Research Institute Co Ltd
Original Assignee
China Southern Power Grid Co Ltd
Southern Power Grid Digital Grid Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Southern Power Grid Co Ltd, Southern Power Grid Digital Grid Research Institute Co LtdfiledCriticalChina Southern Power Grid Co Ltd
Priority to CN202010928024.9ApriorityCriticalpatent/CN112052684A/en
Publication of CN112052684ApublicationCriticalpatent/CN112052684A/en
Pendinglegal-statusCriticalCurrent

Links

Images

Classifications

Landscapes

Abstract

The application relates to the technical field of electric power metering, and provides a named entity identification method and device for electric power metering, computer equipment and a storage medium. The method comprises the following steps: the method comprises the steps of obtaining electric power metering corpora to be recognized, inputting the electric power metering corpora to be recognized into a pre-trained recognition model, obtaining output of the recognition model, and obtaining named entity recognition information of the electric power metering corpora, wherein the recognition model is obtained by inputting electric power metering corpora training data into a convolutional neural network model, and inputting output results of the convolutional neural network model into a NER model, a Chinese word segmentation model and a word classification model respectively for training, and is used for carrying out named entity recognition on the input electric power metering corpora, and accuracy of electric power metering named entity recognition is improved.

Description

Translated fromChinese
电力计量的命名实体识别方法、装置、设备和存储介质Named entity identification method, device, device and storage medium for power metering

技术领域technical field

本申请涉及电力计量技术领域,特别是涉及一种电力计量的命名实体识别方法、装置、计算机设备和存储介质。The present application relates to the technical field of power metering, and in particular, to a named entity identification method, apparatus, computer equipment and storage medium for power metering.

背景技术Background technique

随着新一代人工智能技术的发展,结构化语义信息在电力计量领域的应用日益广泛。其中,电力知识图谱是将电力业务对应不同种类的业务信息按照业务对象的业务结构关联组合而形成的巨大信息网络。用户可以通过电力知识图谱来获取电力业务对象的相关信息,以提高获取检索信息的效率。命名实体识别(Named Entity Recognition,NER)是知识图谱构建中的一个关键性基础步骤。With the development of a new generation of artificial intelligence technology, the application of structured semantic information in the field of power metering is increasingly widespread. Among them, the power knowledge graph is a huge information network formed by associating and combining different types of business information corresponding to the power business according to the business structure of the business object. Users can obtain relevant information of power business objects through the power knowledge graph, so as to improve the efficiency of obtaining and retrieving information. Named Entity Recognition (NER) is a key basic step in knowledge graph construction.

目前技术中,构建命名实体识别的主流方法是基于深度学习,通过神经网络的方式实现。这种命名实体识别方式主要用于英文的命名实体识别,而中文和英文的语言表现形式存在较大区别,将该方式应用于中文文本的命名实体识别时,准确性较低。In the current technology, the mainstream method of constructing named entity recognition is based on deep learning and realized by means of neural network. This named entity recognition method is mainly used for named entity recognition in English, and there is a big difference between Chinese and English in terms of language representation. When this method is applied to named entity recognition in Chinese text, the accuracy is low.

发明内容SUMMARY OF THE INVENTION

基于此,有必要针对目前技术中存在的中文文本的命名实体识别准确性低的技术问题,提供一种电力计量的命名实体识别方法、装置、计算机设备和存储介质。Based on this, it is necessary to provide a named entity identification method, device, computer equipment and storage medium for power metering in order to solve the technical problem of low named entity identification accuracy of Chinese text existing in the current technology.

一种电力计量的命名实体识别方法,所述方法包括:A named entity identification method for power metering, the method comprising:

获取待识别的电力计量语料;所述电力计量语料包含用于描述电力计量信息的文本信息;obtaining the power metering corpus to be identified; the power metering corpus includes text information for describing the power metering information;

将所述待识别的电力计量语料输入预先训练好的识别模型;所述识别模型至少包括卷积神经网络模型、NER模型、中文分词模型和单词分类模型;所述识别模型是将电力计量语料训练数据输入到所述卷积神经网络模型,并将所述卷积神经网络模型的输出结果分别输入到所述NER模型、中文分词模型和单词分类模型进行训练得到;所述识别模型用于对输入的所述电力计量语料进行命名实体识别;The power metering corpus to be recognized is input into a pre-trained recognition model; the recognition model includes at least a convolutional neural network model, a NER model, a Chinese word segmentation model and a word classification model; the recognition model is to train the power metering corpus The data is input into the convolutional neural network model, and the output result of the convolutional neural network model is input into the NER model, the Chinese word segmentation model and the word classification model respectively for training; the recognition model is used for inputting. Perform named entity recognition on the power metering corpus;

获取所述识别模型的输出,得到所述电力计量语料的命名实体识别信息。The output of the recognition model is acquired, and the named entity recognition information of the power metering corpus is obtained.

在其中一个实施例中,所述方法还包括:In one embodiment, the method further includes:

根据预设的电力计量命名实体术语分类,对获取到的电力计量语料集进行分类标注,得到第一标注语料集;According to the preset power metering named entity term classification, classify and label the obtained power metering corpus to obtain a first labeling corpus;

对所述第一标注语料集进行向量化处理,获得所述第一标注语料集中各个词语对应的字符向量;根据各个词语对应的所述字符向量得到第二标注语料集;Perform vectorization processing on the first labeled corpus to obtain character vectors corresponding to each word in the first labeled corpus; obtain a second labeled corpus according to the character vector corresponding to each word;

将所述第二标注语料集输入所述识别模型的待训练的卷积神经网络模型,获得所述各个词语对应的局部语境信息;所述局部语境信息用于表征所述各个词语的上下文信息;Inputting the second labeled corpus into the convolutional neural network model to be trained of the recognition model to obtain local context information corresponding to each word; the local context information is used to represent the context of each word information;

将所述各个词语对应的局部语境信息分别输入到所述识别模型的待训练NER模型、中文分词模型和单词分类模型进行训练,并根据所述NER模型的损失、所述中文分词模型的损失和所述单词分类模型的损失得到所述识别模型的当前损失值;The local context information corresponding to each word is input into the NER model to be trained, the Chinese word segmentation model and the word classification model of the recognition model respectively for training, and according to the loss of the NER model, the loss of the Chinese word segmentation model and the loss of the word classification model to obtain the current loss value of the recognition model;

根据所述识别模型的当前损失值,更新所述识别模型的所述卷积神经网络模型、NER模型、中文分词模型和单词分类模型,得到训练后的所述识别模型。According to the current loss value of the recognition model, the convolutional neural network model, NER model, Chinese word segmentation model and word classification model of the recognition model are updated to obtain the trained recognition model.

在其中一个实施例中,所述根据预设的电力计量命名实体术语分类,对获取到的电力计量语料集进行分类标注,得到第一标注语料集,包括:In one embodiment, according to the preset power metering named entity term classification, the obtained power metering corpus is classified and annotated to obtain a first annotated corpus, including:

获得电力计量原始语料数据集;Obtain the original corpus data set of power metering;

获取预设的电力计量命名实体术语分类,包括电力指标类别、电力对象类别、电力现象类别、计量行为类别中的至少两个,以及包括电力计量术语True/false分类;Obtain the preset power metering named entity term classification, including at least two of the power index category, the power object category, the power phenomenon category, the metering behavior category, and the True/false category including the power metering term;

按照BIO标注方式,根据所述预设的电力计量命名实体术语分类对所述原始语料数据集进行标注,得到所述第一标注语料集。According to the BIO labeling method, the original corpus data set is labeled according to the preset power metering named entity term classification, and the first labeled corpus set is obtained.

在其中一个实施例中,所述对所述第一标注语料集进行向量化处理,获得所述第一标注语料集中各个词语对应的字符向量,包括:In one embodiment, performing vectorization processing on the first annotated corpus to obtain character vectors corresponding to each word in the first annotated corpus, including:

将所述第一标注语料集中的各个词语映射到one-hot向量后,输入到word2vec模型;After mapping each word in the first labeled corpus to the one-hot vector, input it to the word2vec model;

根据所述word2vec模型的输出结果,获得所述各个词语对应的字符向量。According to the output result of the word2vec model, the character vector corresponding to each word is obtained.

在其中一个实施例中,所述卷积神经网络模型包括窗口大小不同的多个滤波器;所述将所述第二标注语料集输入所述识别模型的待训练的卷积神经网络模型,获得所述各个词语对应的局部语境信息,包括:In one of the embodiments, the convolutional neural network model includes a plurality of filters with different window sizes; the second labeled corpus is input into the convolutional neural network model to be trained of the recognition model to obtain The local context information corresponding to each word includes:

将所述第二标注语料集输入所述识别模型的待训练的卷积神经网络模型,通过所述多个滤波器学习所述第二标注语料集的所述各个词语的上下文信息,获得所述各个词语对应的局部语境信息。Inputting the second annotated corpus into the convolutional neural network model to be trained of the recognition model, learning the context information of each word in the second annotated corpus through the multiple filters, and obtaining the Local context information corresponding to each word.

在其中一个实施例中,所述将所述各个词语对应的局部语境信息分别输入到所述识别模型的待训练NER模型、中文分词模型和单词分类模型进行训练,并根据所述NER模型的损失、所述中文分词模型的损失和所述单词分类模型的损失得到所述识别模型的当前损失值,包括:In one embodiment, the local context information corresponding to each word is input into the NER model to be trained, the Chinese word segmentation model and the word classification model of the recognition model respectively for training, and according to the NER model The loss, the loss of the Chinese word segmentation model and the loss of the word classification model obtain the current loss value of the recognition model, including:

将所述各个词语对应的局部语境信息输入到所述NER模型,获得所述各个词语的第一分类结果;根据所述第一分类结果和对应的分类标注,计算所述NER模型的损失;所述NER模型包括BLSTM层和CRF层;所述第一分类结果包括所述各个词语的BIO分类和对应的电力计量命名实体术语分类;Input the local context information corresponding to each word into the NER model to obtain the first classification result of each word; calculate the loss of the NER model according to the first classification result and the corresponding classification label; The NER model includes a BLSTM layer and a CRF layer; the first classification result includes the BIO classification of each word and the corresponding power metering named entity term classification;

及,将所述各个词语对应的局部语境信息输入到所述中文分词模型的CRF层,获得所述各个词语对应的第二分类结果;根据所述第二分类结果和对应的分类标注,计算所述中文分词模型的损失;所述第二分类结果包括所述各个词语的BIO分类;And, input the local context information corresponding to each word into the CRF layer of the Chinese word segmentation model to obtain the second classification result corresponding to each word; according to the second classification result and the corresponding classification label, calculate The loss of the Chinese word segmentation model; the second classification result includes the BIO classification of each word;

及,将所述各个词语对应的局部语境信息输入到所述单词分类模型,获得所述各个词语的第三分类结果;根据所述第三分类结果和对应的分类标注,计算所述单词分类模型的损失;所述第三分类结果包括所述各个词语的电力计量术语True/false分类;and, input the local context information corresponding to each word into the word classification model to obtain the third classification result of each word; according to the third classification result and the corresponding classification label, calculate the word classification the loss of the model; the third classification result includes the true/false classification of the power metering terms of the respective words;

根据所述NER模型的损失、所述中文分词模型的损失和所述单词分类模型的损失以及各自对应的损失系数,得到所述识别模型的当前损失值。According to the loss of the NER model, the loss of the Chinese word segmentation model, the loss of the word classification model and their corresponding loss coefficients, the current loss value of the recognition model is obtained.

在其中一个实施例中,所述根据所述NER模型的损失、所述中文分词模型的损失和所述单词分类模型的损失以及各自对应的系数,得到所述识别模型的当前损失值,包括:In one embodiment, the current loss value of the recognition model is obtained according to the loss of the NER model, the loss of the Chinese word segmentation model, the loss of the word classification model, and their corresponding coefficients, including:

根据所述中文分词模型的损失和所述单词分类模型的损失分别对应的损失系数,确定所述NER模型的损失对应的损失系数;According to the loss coefficients corresponding to the loss of the Chinese word segmentation model and the loss of the word classification model respectively, determine the loss coefficient corresponding to the loss of the NER model;

根据所述NER模型的损失及对应的损失系数,和所述中文分词模型的损失及对应的损失系数,得到所述识别模型的当前损失值。According to the loss of the NER model and the corresponding loss coefficient, and the loss of the Chinese word segmentation model and the corresponding loss coefficient, the current loss value of the recognition model is obtained.

一种电力计量的命名实体识别装置,所述装置包括:A named entity identification device for power metering, the device comprising:

语料获取模块,用于获取待识别的电力计量语料;所述电力计量语料包含用于描述电力计量信息的文本信息;a corpus acquisition module, used for acquiring the power metering corpus to be identified; the power metering corpus includes text information for describing the power metering information;

模型输入模块,用于将所述待识别的电力计量语料输入预先训练好的识别模型;所述识别模型至少包括卷积神经网络模型、NER模型、中文分词模型和单词分类模型;所述识别模型是将电力计量语料训练数据输入到所述卷积神经网络模型,并将所述卷积神经网络模型的输出结果分别输入到所述NER模型、中文分词模型和单词分类模型进行训练得到;所述识别模型用于对输入的所述电力计量语料进行命名实体识别;A model input module for inputting the power metering corpus to be recognized into a pre-trained recognition model; the recognition model at least includes a convolutional neural network model, a NER model, a Chinese word segmentation model and a word classification model; the recognition model It is obtained by inputting the power metering corpus training data into the convolutional neural network model, and inputting the output results of the convolutional neural network model into the NER model, the Chinese word segmentation model and the word classification model for training; the described The recognition model is used to perform named entity recognition on the inputted power metering corpus;

识别信息获取模块,用于获取所述识别模型的输出,得到所述电力计量语料的命名实体识别信息。The identification information acquisition module is used for acquiring the output of the identification model to obtain the named entity identification information of the power metering corpus.

一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现以下步骤:A computer device includes a memory and a processor, the memory stores a computer program, and the processor implements the following steps when executing the computer program:

获取待识别的电力计量语料;所述电力计量语料包含用于描述电力计量信息的文本信息;将所述待识别的电力计量语料输入预先训练好的识别模型;所述识别模型至少包括卷积神经网络模型、NER模型、中文分词模型和单词分类模型;所述识别模型是将电力计量语料训练数据输入到所述卷积神经网络模型,并将所述卷积神经网络模型的输出结果分别输入到所述NER模型、中文分词模型和单词分类模型进行训练得到;所述识别模型用于对输入的所述电力计量语料进行命名实体识别;获取所述识别模型的输出,得到所述电力计量语料的命名实体识别信息。Obtain the power metering corpus to be recognized; the power metering corpus contains text information used to describe the power metering information; input the to-be-recognized power metering corpus into a pre-trained recognition model; the recognition model at least includes a convolutional neural network network model, NER model, Chinese word segmentation model and word classification model; the recognition model is to input the power metering corpus training data into the convolutional neural network model, and the output results of the convolutional neural network model are respectively input into The NER model, the Chinese word segmentation model and the word classification model are obtained by training; the recognition model is used to perform named entity recognition on the inputted power metering corpus; the output of the recognition model is obtained, and the power metering corpus is obtained. Named entity identification information.

一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现以下步骤:A computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the following steps are implemented:

获取待识别的电力计量语料;所述电力计量语料包含用于描述电力计量信息的文本信息;将所述待识别的电力计量语料输入预先训练好的识别模型;所述识别模型至少包括卷积神经网络模型、NER模型、中文分词模型和单词分类模型;所述识别模型是将电力计量语料训练数据输入到所述卷积神经网络模型,并将所述卷积神经网络模型的输出结果分别输入到所述NER模型、中文分词模型和单词分类模型进行训练得到;所述识别模型用于对输入的所述电力计量语料进行命名实体识别;获取所述识别模型的输出,得到所述电力计量语料的命名实体识别信息。Obtain the power metering corpus to be recognized; the power metering corpus contains text information used to describe the power metering information; input the to-be-recognized power metering corpus into a pre-trained recognition model; the recognition model at least includes a convolutional neural network network model, NER model, Chinese word segmentation model and word classification model; the recognition model is to input the power metering corpus training data into the convolutional neural network model, and the output results of the convolutional neural network model are respectively input into The NER model, the Chinese word segmentation model and the word classification model are obtained by training; the recognition model is used to perform named entity recognition on the inputted power metering corpus; the output of the recognition model is obtained, and the power metering corpus is obtained. Named entity identification information.

上述电力计量的命名实体识别方法、装置、计算机设备和存储介质,通过获取待识别的电力计量语料,将待识别的电力计量语料输入预先训练好的识别模型,获取识别模型的输出,得到电力计量语料的命名实体识别信息,其中,识别模型是将电力计量语料训练数据输入到卷积神经网络模型,并将卷积神经网络模型的输出结果分别输入到NER模型、中文分词模型和单词分类模型进行训练得到,用于对输入的电力计量语料进行命名实体识别。本申请的方案,在NER模型的基础上,结合中文分词模型和基于电力领域词典知识的单词分类模型共同训练出电力计量的识别模型,根据该识别模型进行电力计量的命名实体识别,提高了电力计量命名实体识别的准确性。The above named entity recognition method, device, computer equipment and storage medium for power metering, by acquiring the power metering corpus to be recognized, inputting the power metering corpus to be recognized into a pre-trained recognition model, obtaining the output of the recognition model, and obtaining the power metering The named entity recognition information of the corpus, where the recognition model is to input the training data of the power metering corpus into the convolutional neural network model, and input the output results of the convolutional neural network model to the NER model, the Chinese word segmentation model and the word classification model respectively. After training, it is used to perform named entity recognition on the input power metering corpus. The solution of the present application, on the basis of the NER model, combines the Chinese word segmentation model and the word classification model based on the knowledge of the power domain dictionary to jointly train a power metering recognition model, and the named entity recognition of the power metering is carried out according to the recognition model, which improves the power Measures the accuracy of named entity recognition.

附图说明Description of drawings

图1为一个实施例中电力计量的命名实体识别方法的流程示意图;1 is a schematic flowchart of a named entity identification method for power metering in one embodiment;

图2为一个实施例中电力计量的命名实体识别方法的流程示意图;2 is a schematic flowchart of a named entity identification method for power metering in one embodiment;

图3为一个实施例中电力计量的命名实体识别方法的模型示意图;3 is a schematic diagram of a model of a named entity recognition method for power metering in one embodiment;

图4为一个实施例中电力计量的命名实体识别方法的模型示意图;4 is a schematic diagram of a model of a named entity identification method for power metering in one embodiment;

图5为一个实施例中电力计量的命名实体识别装置的结构框图;5 is a structural block diagram of a named entity identification device for power metering in one embodiment;

图6为一个实施例中计算机设备的内部结构图。FIG. 6 is a diagram of the internal structure of a computer device in one embodiment.

具体实施方式Detailed ways

为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。In order to make the purpose, technical solutions and advantages of the present application more clearly understood, the present application will be described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application.

需要说明的是,本发明实施例所涉及的术语“第一\第二”仅仅是是区别类似的对象,不代表针对对象的特定排序,可以理解地,“第一\第二”在允许的情况下可以互换特定的顺序或先后次序。应该理解“第一\第二”区分的对象在适当情况下可以互换,以使这里描述的本发明的实施例能够以除了在这里图示或描述的那些以外的顺序实施。It should be noted that the term "first\second" involved in the embodiments of the present invention is only to distinguish similar objects, and does not represent a specific ordering of objects. It is understandable that "first\second" is allowed The specific order or sequence may be interchanged under circumstances. It should be understood that the "first\second" distinctions may be interchanged under appropriate circumstances to enable the embodiments of the invention described herein to be practiced in sequences other than those illustrated or described herein.

在一个实施例中,如图1所示,提供了一种电力计量的命名实体识别方法,本实施例以该方法应用于服务器进行举例说明,可以理解的是,该方法也可以应用于终端,还可以应用于包括终端和服务器的系统,并通过终端和服务器的交互实现。本实施例中,该方法包括以下步骤:In one embodiment, as shown in FIG. 1 , a method for identifying a named entity for power metering is provided. In this embodiment, the method is applied to a server for illustration. It can be understood that the method can also be applied to a terminal. It can also be applied to a system including a terminal and a server, and is realized through the interaction between the terminal and the server. In this embodiment, the method includes the following steps:

步骤S101,获取待识别的电力计量语料;Step S101, acquiring the power metering corpus to be identified;

其中,命名实体识别(Named Entity Recognition,NER)指识别文本中具有特定意义的实体,主要包括人名、地名、机构名、专有名词等,就是识别自然文本中的实体指称的边界和类别。在电力计量领域,可以根据任务需求,预定义电力指标(Index)、电力对象(Object)、电力现象(Phenomenon)、计量行为(Meter)等实体术语分类。在进行命名实体识别时,可以构建包括多个模型的识别模型,以提高识别的准确性和效率。电力计量语料包含用于描述电力计量信息的文本信息。语料数据来源可以包含电力百科上各类电力计量领域的期刊文献,百度百科中半结构化的相关电力计量知识,以及百度文库中文档类和表格类的相关电力计量知识。其中百科网站和文库中的电力计量知识数据大多为非结构化文本形式,包含各类基本概念的说明及相关原理、技术、应用的介绍等。另外,也可以根据电网提供的部分计量业务报告、计量统计数据等内容提取语料信息。服务器可以对获取到的数据进行清洗操作,剔除无关信息,并以各类标点符号为标志来划分语料集的文本结构。Among them, Named Entity Recognition (NER) refers to the identification of entities with specific meanings in the text, mainly including names of people, places, institutions, proper nouns, etc., which is to identify the boundaries and categories of entity references in natural texts. In the field of power metering, entity terms such as power index (Index), power object (Object), power phenomenon (Phenomenon), and metering behavior (Meter) can be predefined according to task requirements. When performing named entity recognition, a recognition model including multiple models can be constructed to improve the accuracy and efficiency of recognition. The power metering corpus contains textual information for describing power metering information. The sources of corpus data can include various journal documents in the field of electricity metering on Electric Power Encyclopedia, semi-structured related power metering knowledge in Baidu Encyclopedia, and related power metering knowledge in documents and tables in Baidu Library. Most of the power metering knowledge data in the encyclopedia website and library are in the form of unstructured text, including descriptions of various basic concepts and introductions to related principles, technologies, and applications. In addition, corpus information can also be extracted according to some metering business reports and metering statistical data provided by the power grid. The server can clean the acquired data, remove irrelevant information, and use various punctuation marks as symbols to divide the text structure of the corpus.

具体实现中,服务器可以从终端或其他输入模块获取待识别的电力计量语料。In a specific implementation, the server may acquire the power metering corpus to be identified from the terminal or other input modules.

步骤S102,将待识别的电力计量语料输入预先训练好的识别模型。Step S102, input the power metering corpus to be recognized into the pre-trained recognition model.

其中,识别模型可以至少包括卷积神经网络模型、NER模型、中文分词模型和单词分类模型。卷积神经网络模型(Convolutional Neural Networks,CNN)是一类包含卷积计算且具有深度结构的前馈神经网络,具有特征学习能力,能够按其阶层结构对输入信息进行平移不变分类。NER模型是常用的命名实体识别模型,可以包括Spacy NER模型、斯坦福NER等。中文分词模型在中文自然语言处理领域,可以用于为中文语句划分单词边界,提高识别的准确性。单词分类模型可以结合专业词典根据一串汉字序列是否可以是一个电力计量实体词汇来对其进行分类,例如,字符序列“电量波动异常”将被分类为true,而字符序列“电置工动异常”将被分类为false,引入单词分类模型可以识别目标词汇是否是电力计量领域的词汇。识别模型是将电力计量语料训练数据输入到卷积神经网络模型,并将卷积神经网络模型的输出结果分别输入到NER模型、中文分词模型和单词分类模型进行多任务训练得到。识别模型用于对输入的电力计量语料进行命名实体识别。Wherein, the recognition model may include at least a convolutional neural network model, a NER model, a Chinese word segmentation model and a word classification model. Convolutional Neural Networks (CNN) is a kind of feedforward neural network with convolution calculation and deep structure, which has the ability of feature learning and can classify the input information according to its hierarchical structure. NER model is a commonly used named entity recognition model, which can include Spacy NER model, Stanford NER, etc. In the field of Chinese natural language processing, the Chinese word segmentation model can be used to divide word boundaries for Chinese sentences to improve the accuracy of recognition. The word classification model can be combined with a professional dictionary to classify a string of Chinese characters according to whether it can be a power metering entity vocabulary. " will be classified as false, and the introduction of a word classification model can identify whether the target vocabulary is a vocabulary in the field of electricity metering. The recognition model is obtained by inputting the power metering corpus training data into the convolutional neural network model, and inputting the output of the convolutional neural network model into the NER model, the Chinese word segmentation model and the word classification model for multi-task training. The recognition model is used to perform named entity recognition on the input power metering corpus.

具体实现中,服务器可以将获得的电力计量语料,输入到训练好的识别模型,由识别模型对该电力计量语料进行识别处理。In a specific implementation, the server may input the obtained power metering corpus into the trained recognition model, and the recognition model will perform recognition processing on the power metering corpus.

步骤S103,获取识别模型的输出,得到电力计量语料的命名实体识别信息。In step S103, the output of the recognition model is obtained, and the named entity recognition information of the power metering corpus is obtained.

其中,识别模型的输出可以包括上述NER模型、中文分词模型和单词分类模型的输出,根据模型功能和设置的不同,各个模型输出的结果的形式也会有所不同。命名实体识别信息可以包括预定义的实体术语分类,也可以通过可视化的方式呈现,例如根据属于的分类显示不同的颜色。Among them, the output of the recognition model may include the output of the above-mentioned NER model, Chinese word segmentation model and word classification model. According to different model functions and settings, the form of the output results of each model will also be different. Named entity identification information can include predefined entity term classifications, and can also be presented in a visual way, such as displaying different colors according to the classification it belongs to.

具体实现中,服务器可以根据识别模型的输出结果,得到电力计量语料的实体术语分类。In a specific implementation, the server can obtain entity term classification of the power metering corpus according to the output result of the recognition model.

上述电力计量的命名实体识别方法中,通过获取待识别的电力计量语料,将待识别的电力计量语料输入预先训练好的识别模型,获取识别模型的输出,得到电力计量语料的命名实体识别信息,其中,识别模型是将电力计量语料训练数据输入到卷积神经网络模型,并将卷积神经网络模型的输出结果分别输入到NER模型、中文分词模型和单词分类模型进行训练得到,用于对输入的电力计量语料进行命名实体识别。在NER模型的基础上,结合中文分词模型和基于电力领域词典知识的单词分类模型共同训练出电力计量的识别模型,根据该识别模型进行电力计量的命名实体识别,提高了电力计量命名实体识别的准确性。In the above named entity recognition method of power metering, by acquiring the power metering corpus to be recognized, inputting the power metering corpus to be recognized into a pre-trained recognition model, obtaining the output of the recognition model, and obtaining the named entity identification information of the power metering corpus, Among them, the recognition model is obtained by inputting the training data of the power metering corpus into the convolutional neural network model, and inputting the output results of the convolutional neural network model into the NER model, the Chinese word segmentation model and the word classification model for training. Named Entity Recognition from the Electricity Metering Corpus. On the basis of the NER model, combined with the Chinese word segmentation model and the word classification model based on the knowledge of the power domain dictionary, a recognition model for power metering is trained. accuracy.

在一个实施例中,上述方法还包括:In one embodiment, the above method further includes:

根据预设的电力计量命名实体术语分类,对获取到的电力计量语料集进行分类标注,得到第一标注语料集;对第一标注语料集进行向量化处理,获得第一标注语料集中各个词语对应的字符向量;根据各个词语对应的字符向量得到第二标注语料集;将第二标注语料集输入识别模型的待训练的卷积神经网络模型,获得各个词语对应的局部语境信息;局部语境信息用于表征各个词语的上下文信息;将各个词语对应的局部语境信息分别输入到识别模型的待训练NER模型、中文分词模型和单词分类模型进行训练,并根据NER模型的损失、中文分词模型的损失和单词分类模型的损失得到识别模型的当前损失值;根据识别模型的当前损失值,更新识别模型的卷积神经网络模型、NER模型、中文分词模型和单词分类模型,得到训练后的识别模型。According to the preset power metering named entity term classification, classify and label the obtained power metering corpus to obtain a first labeling corpus; perform vectorization processing on the first labeling corpus to obtain the corresponding words in the first labeling corpus. character vector; obtain the second labeled corpus according to the character vector corresponding to each word; input the second labeled corpus into the convolutional neural network model to be trained of the recognition model to obtain the local context information corresponding to each word; The information is used to represent the context information of each word; the local context information corresponding to each word is input into the NER model to be trained, the Chinese word segmentation model and the word classification model of the recognition model for training, and according to the loss of the NER model, the Chinese word segmentation model According to the current loss value of the recognition model, update the convolutional neural network model, NER model, Chinese word segmentation model and word classification model of the recognition model to obtain the recognition model after training. Model.

本实施例中,服务器可以获取电力计量语料集,经过标注分类后,进行向量化处理,输入到卷积神经网络模型进行处理后,将处理结果分输入到识别模型的待训练NER模型、中文分词模型和单词分类模型进行训练,直到训练得到该识别模型。其中,电力计量语料实体术语分类可以包括电力指标类别、电力对象类别、电力现象类别、计量行为类别,以及包括电力计量术语True/false分类。In this embodiment, the server can obtain the power metering corpus, perform vectorization processing after labeling and classification, input the processing results to the convolutional neural network model for processing, and then input the processing results into the NER model to be trained and the Chinese word segmentation of the recognition model. The model and the word classification model are trained until the recognition model is obtained. The power metering corpus entity term classification may include power index category, power object category, power phenomenon category, metering behavior category, and True/false category including power metering terms.

服务器对第一标注语料集进行向量化处理,可以通过构建Embedding层实现。Embedding是将离散变量转为连续向量表示的方式,可以通过word2vec算法实现。服务器可以通过Embedding层为每一个词语生成字符向量,将输入文本向量化。The server performs vectorization processing on the first labeled corpus, which can be implemented by constructing an Embedding layer. Embedding is a way to convert discrete variables into continuous vector representation, which can be implemented by word2vec algorithm. The server can generate character vectors for each word through the Embedding layer to vectorize the input text.

卷积神经网络模型(CNN层)可以根据特征的表达与提取功能,对Embedding层输入的字符向量进行局部上下文信息的捕获,利用卷积窗口学习每个字符及其相关联上下文字符之间的特征,获得各个词语对应的局部语境信息。The convolutional neural network model (CNN layer) can capture the local context information of the character vector input by the Embedding layer according to the expression and extraction function of the features, and use the convolution window to learn the features between each character and its associated context characters. , to obtain the local context information corresponding to each word.

在识别模型训练时,待训练的NER模型、中文分词模型和单词分类模型共享Embedding层和CNN层从计量领域专业语料文本中的实体提取结果,可以对分词中的有用信息进行编码,以学习对单词边界有帮助的上下文字符表示形式,提升预测实体的边界的效率。服务器可以将CNN层输出的各个词语对应的局部语境信息,分别输入到待训练的NER模型、中文分词模型和单词分类模型进行训练,分别获得NER模型的损失、中文分词模型的损失和单词分类模型的损失,将各个模型的损失进行组合,获得识别模型的当前损失值。根据该当前损失值,更新卷积神经网络模型、NER模型、中文分词模型和单词分类模型了,以训练得到该识别模型。卷积神经网络模型、NER模型、中文分词模型和单词分类模型的功能设计各不相同,例如,单词分类模型可以某一词汇是否是识别电力计量实体分类,而中文分词模型则识别词汇的序列,各个子模型的训练结果可以互相验证,多次训练,可以提高模型的文本识别的准确度。When the recognition model is trained, the NER model to be trained, the Chinese word segmentation model and the word classification model share the Embedding layer and the CNN layer from the entity extraction results in the professional corpus text in the metrology field, which can encode the useful information in the word segmentation to learn to Word boundaries are helpful contextual character representations that improve the efficiency of predicting entity boundaries. The server can input the local context information corresponding to each word output by the CNN layer into the NER model to be trained, the Chinese word segmentation model and the word classification model for training, and obtain the loss of the NER model, the loss of the Chinese word segmentation model and the word classification respectively. The loss of the model, the loss of each model is combined to obtain the current loss value of the recognition model. According to the current loss value, the convolutional neural network model, NER model, Chinese word segmentation model and word classification model are updated to train the recognition model. The functional design of the convolutional neural network model, NER model, Chinese word segmentation model and word classification model are different. For example, the word classification model can identify whether a word is a power metering entity classification, while the Chinese word segmentation model can identify the sequence of words. The training results of each sub-model can be verified with each other, and multiple training can improve the accuracy of the text recognition of the model.

上述实施例的方案,通过获取电力计量语料集,将电力计量语料集进行分类标注、向量化处理后,输入到卷积神经网络模型得到各个词语对应的局部语境信息,随后将各个词语对应的局部语境信息分别输入到待训练NER模型、中文分词模型和单词分类模型进行训练,得到训练后的识别模型。其中,通过Embedding层和CNN层从计量领域专业语料文本中的实体提取,NER模型、中文分词模型和单词分类模型共享该实体类别和置信度,提高了模型训练的效率,减少识别误差累积,通过多任务、多模型联合训练的方式识别模型,也进一步提高识别模型的命名实体识别准确度。In the solution of the above embodiment, by obtaining the power metering corpus, classifying, labeling and vectorizing the power metering corpus, inputting it into the convolutional neural network model to obtain the local context information corresponding to each word, and then converting the corresponding word to each word. The local context information is input to the NER model to be trained, the Chinese word segmentation model and the word classification model for training, and the trained recognition model is obtained. Among them, through the Embedding layer and the CNN layer, entities are extracted from the professional corpus text in the measurement field. The NER model, the Chinese word segmentation model and the word classification model share the entity category and confidence, which improves the efficiency of model training and reduces the accumulation of recognition errors. The multi-task and multi-model joint training method is used to identify the model, which further improves the named entity recognition accuracy of the recognition model.

在一个实施例中,根据预设的电力计量命名实体术语分类,对获取到的电力计量语料集进行分类标注,得到第一标注语料集,包括:In one embodiment, according to the preset power metering named entity term classification, the obtained power metering corpus is classified and annotated to obtain a first annotated corpus, including:

获得电力计量原始语料数据集;获取预设的电力计量命名实体术语分类,包括电力指标类别、电力对象类别、电力现象类别、计量行为类别中的至少两个,以及包括电力计量术语True/false分类;按照BIO标注方式,根据预设的电力计量命名实体术语分类对原始语料数据集进行标注,得到第一标注语料集。Obtain the power metering original corpus data set; obtain the preset power metering named entity term classification, including at least two of the power indicator category, the power object category, the power phenomenon category, the metering behavior category, and the True/false category including the power metering term ; According to the BIO labeling method, label the original corpus data set according to the preset power metering named entity term classification, and obtain the first labeling corpus set.

本实施例中,在进行识别模型训练时,需要对收集到的训练模型用的语料进行分类与标注操作,为了将名称、位置、各种电气量等名称,命名格式进行相对统一的规范。对于电力计量领域,由于目前并未有公开的数据集可以直接用于特定的NER模型的训练,且各类电力计量知识语料零散分布于网络上各个信息资源点,服务器可以构建电力计量领域的文本语料库,用于进行识别模型的构建。电力计量领域的实体分类比起通用领域更为复杂,而且常常存在命名实体边界模糊、难以界定的情况。服务器可以结合电力领域专家知识和相关书籍资料,定义电力指标(Index)、电力对象(Object)、电力现象(Phenomenon)、计量行为(Meter)等类别的实体术语分类,以及电力计量术语True/false分类。其中电力计量术语True/false分类可以用于单词分类模型的训练。命名实体识别领域,可以通过序列标注的方式进行文本标注,例如BIO标注。BIO标注可以将每个元素标注为“B-X”、“I-X”或者“O”。其中,“B-X”表示此元素所在的片段属于X类型并且此元素在此片段的开头,“I-X”表示此元素所在的片段属于X类型并且此元素在此片段的中间位置,“O”表示不属于任何类型。服务器可以根据确定的电力计量命名实体术语分类,结合BIO标注,对原始语料数据集进行标注,得到第一标注语料集。例如可以将电力计量某指标文本的首个元素标注为B-Index。在标注过程中可以使用中文实体标注工具YEDDA,利用可视化界面,通过选取待识别实体部分的文字和快捷标注按键对大量语料进行高效标注,减小人工误差。对标注完的整段语料可以利用编写好的程序进行格式重构、切分字符、单句空行、使之符合基本语料集格式。同时,考虑到批次切分是以句子为单位进行划分,所以对其中过长的句子进行拆分,例如整段成句,或者以分号间隔的大段表述等内容。In this embodiment, when training the recognition model, it is necessary to classify and label the collected corpus for training the model, in order to relatively uniformly standardize the names, locations, various electrical quantities and other names and naming formats. For the field of power metering, since there is currently no public data set that can be directly used for the training of a specific NER model, and various types of power metering knowledge corpora are scattered in various information resource points on the network, the server can construct texts in the field of power metering. Corpus for the construction of recognition models. The classification of entities in the field of electricity metering is more complex than that in the general field, and the boundaries of named entities are often ambiguous and difficult to define. The server can combine the knowledge of experts in the field of electricity and related books and materials to define the entity term classification of power index (Index), power object (Object), power phenomenon (Phenomenon), metering behavior (Meter) and other categories, as well as power metering terms True/false Classification. Among them, the power metering term True/false classification can be used for the training of the word classification model. In the field of named entity recognition, text annotation can be performed by means of sequence annotation, such as BIO annotation. BIO annotation can label each element as "B-X", "I-X" or "O". Among them, "B-X" means that the fragment where this element is located is of type X and this element is at the beginning of this fragment, "I-X" means that the fragment where this element is located is of type X and this element is in the middle of this fragment, "O" means no of any type. The server can label the original corpus data set according to the determined power metering named entity term classification, combined with the BIO labeling, to obtain the first labeling corpus set. For example, the first element of a certain index text of power metering can be marked as B-Index. In the labeling process, the Chinese entity labeling tool YEDDA can be used to efficiently label a large amount of corpus by selecting the text of the entity to be recognized and the quick labeling button by using the visual interface to reduce manual errors. For the entire marked corpus, the written program can be used for format reconstruction, character segmentation, single sentence blank line, and make it conform to the basic corpus format. At the same time, considering that batch segmentation is divided into sentence units, the sentences that are too long are divided, such as whole paragraphs into sentences, or large paragraphs of expressions separated by semicolons.

上述实施例的方案,通过划分电力计量命名实体术语分类,以及根据BIO标注方式对原始语料数据集进行标注,提高了电力计量命名实体分类的一致性和规范性。The solution of the above embodiment improves the consistency and standardization of the power metering named entity classification by dividing the term classification of the power metering named entity and marking the original corpus data set according to the BIO marking method.

在一个实施例中,对第一标注语料集进行向量化处理,获得第一标注语料集中各个词语对应的字符向量,包括:In one embodiment, performing vectorization processing on the first annotated corpus to obtain character vectors corresponding to each word in the first annotated corpus, including:

将所述第一标注语料集中的各个词语映射到one-hot向量后,输入到word2vec模型;根据word2vec模型的输出结果,获得各个词语对应的字符向量。After mapping each word in the first marked corpus to the one-hot vector, it is input to the word2vec model; according to the output result of the word2vec model, the character vector corresponding to each word is obtained.

本实施例中,one-hot向量可以是类别变量转换后,使得机器学习算法易于利用的向量。word2vec模型可以用于将输入的词语转化为对应的字符向量,word2vec模型可以包含CBOW和Skip-Gram两类模型。以下以应用于Skip-Gram模型为例进行说明。其中,输入的文本语句可以表示为s=[w1,w2,…,wn]其中n表示文本中中文汉字的数量,wi为每个字符的one-hot向量表示,word2vec模型的输出为字符向量序列[x1,x2,…,xn]xi为文本中第i个中文汉字所对应的字符向量。In this embodiment, the one-hot vector may be a vector that can be easily used by the machine learning algorithm after the categorical variable is converted. The word2vec model can be used to convert the input words into corresponding character vectors. The word2vec model can include CBOW and Skip-Gram models. The following description is given by taking the application to the Skip-Gram model as an example. Among them, the input text sentence can be expressed as s=[w1 , w2 , ...,wn ] where n represents the number of Chinese characters in the text,wi is the one-hot vector representation of each character, and the output of the word2vec model is the character vector sequence [x1 , x2 ,..., xn ] xi is the character vector corresponding to the ith Chinese character in the text.

上述实施例的方案,通过将第一标注语料集中的各个词语映射到one-hot向量后,输入到word2vec模型,得到对应的字符向量,使得电力计量词语信息转化为低维度的向量,使得本身孤立的字词产生了数值层面的联系,便于通过深度学习进行计算。The solution of the above embodiment, after mapping each word in the first marked corpus to the one-hot vector, input it into the word2vec model to obtain the corresponding character vector, so that the power metering word information is converted into a low-dimensional vector, making itself isolated. The words of 's create numerical-level connections that are easy to compute through deep learning.

在一个实施例中,卷积神经网络模型包括窗口大小不同的多个滤波器;将第二标注语料集输入识别模型的待训练的卷积神经网络模型,获得各个词语对应的局部语境信息,包括:In one embodiment, the convolutional neural network model includes multiple filters with different window sizes; the second labeled corpus is input into the convolutional neural network model to be trained of the recognition model to obtain local context information corresponding to each word, include:

将第二标注语料集输入识别模型的待训练的卷积神经网络模型,通过多个滤波器学习第二标注语料集的各个词语的上下文信息,获得各个词语对应的局部语境信息。The second labeled corpus is input into the convolutional neural network model to be trained of the recognition model, the context information of each word in the second labeled corpus is learned through multiple filters, and the local context information corresponding to each word is obtained.

本实施例中,CNN层可以对中文电力计量实体所在的文本进行上下文信息的提取。在CNN中将W∈RKD表示为卷积层的滤波器,其中K表示了窗口的大小,而D则是第二标注语料集的各个词向量的维度。因此,该滤波器在学习其中第i个字符时,其上下文可以表示为:In this embodiment, the CNN layer can extract context information for the text where the Chinese power metering entity is located. In CNN, W∈RKD is represented as the filter of the convolutional layer, where K represents the size of the window, and D is the dimension of each word vector of the second annotated corpus. Therefore, when the filter learns the i-th character, its context can be expressed as:

Figure BDA0002669147420000111
Figure BDA0002669147420000111

这里X与其下标构成了在窗口K中嵌入的字符的链接。进一步地,选择ReLU作为激活函数,同时使用具有不同窗口大小的多个滤波器来学习上下文字符表示,例如,窗口大小可以是2到5个。滤波器数量设为M,第i个字符(表示为ci)的上下文表示是该位置所有滤波器输出的串联,该层的输出为c=[c1,c2,…,cn]。Here X and its subscript form a link to the character embedded in window K. Further, ReLU is chosen as the activation function, while multiple filters with different window sizes are used to learn contextual character representations, for example, the window size can be 2 to 5. The number of filters is set to M, the context representation of the ith character (denoted as ci ) is the concatenation of all filter outputs at that position, and the output of this layer is c=[c1 , c2 , . . . , cn ].

上述实施例的方案,通过卷积神经网络模型的窗口大小不同的多个滤波器学习第二标注语料集的所述各个词语的上下文信息,获得各个词语对应的局部语境信息,提升了模型对于中文命名实体的识别准确度。In the solution of the above embodiment, the context information of each word in the second labeled corpus is learned through multiple filters with different window sizes of the convolutional neural network model, and the local context information corresponding to each word is obtained, which improves the model for The recognition accuracy of Chinese named entities.

在一个实施例中,将各个词语对应的局部语境信息分别输入到识别模型的待训练NER模型、中文分词模型和单词分类模型进行训练,并根据NER模型的损失、中文分词模型的损失和单词分类模型的损失得到识别模型的当前损失值,包括:In one embodiment, the local context information corresponding to each word is input into the NER model to be trained, the Chinese word segmentation model and the word classification model of the recognition model respectively for training, and according to the loss of the NER model, the loss of the Chinese word segmentation model and the word The loss of the classification model gets the current loss value of the recognition model, including:

将各个词语对应的局部语境信息输入到NER模型,获得各个词语的第一分类结果;根据第一分类结果和对应的分类标注,计算NER模型的损失;及,将各个词语对应的局部语境信息输入到中文分词模型的CRF层,获得各个词语对应的第二分类结果;根据第二分类结果和对应的分类标注,计算中文分词模型的损失;及,将各个词语对应的局部语境信息输入到单词分类模型,获得各个词语的第三分类结果;根据第三分类结果和对应的分类标注,计算单词分类模型的损失;根据NER模型的损失、中文分词模型的损失和单词分类模型的损失以及各自对应的损失系数,得到识别模型的当前损失值。The local context information corresponding to each word is input into the NER model to obtain the first classification result of each word; the loss of the NER model is calculated according to the first classification result and the corresponding classification label; and, the local context corresponding to each word is The information is input into the CRF layer of the Chinese word segmentation model, and the second classification result corresponding to each word is obtained; according to the second classification result and the corresponding classification label, the loss of the Chinese word segmentation model is calculated; and the local context information corresponding to each word is input. To the word classification model, the third classification result of each word is obtained; according to the third classification result and the corresponding classification label, the loss of the word classification model is calculated; according to the loss of the NER model, the loss of the Chinese word segmentation model and the loss of the word classification model and The corresponding loss coefficients are obtained to obtain the current loss value of the recognition model.

本实施例中,服务器可以根据卷积神经网络模型输出的各个词语对应的局部语境信息,对NER模型、中文分词模型和单词分类模型进行训练。In this embodiment, the server may train the NER model, the Chinese word segmentation model and the word classification model according to the local context information corresponding to each word output by the convolutional neural network model.

NER模型可以包括BLSTM层和CRF层。第一分类结果可以包括各个词语的BIO分类和对应的电力计量命名实体术语分类。其中,双向长短期记忆人工神经网络(BidirectionalLong Short-Term Memory,BLSTM)可以通过电力计量各个词语之前与之后的状态信息来判断实体的特征,可以通过一个正向LSTM和一个反向LSTM,分别计算每个词语所处的左侧和右侧词对应的向量,然后将每个词的两个向量进行连接,形成词的向量输出给CRF层。CRF(Conditional Random Field,条件随机场)可以用来建立字符级序列标注,建立电力计量实体相邻标签之间的相关性,在给定训练数据集后,CRF层可以通过极大似然估计得到条件概率模型,输入X={x1,x2,x3,…,xn}为文本序列向量的完全展开式,该序列中的每一个xi表示了经过卷积神经网络模型处理之后的中文汉字向量。定义Y={y1,y2,y3,…,yn}为该序列对应的标签。则CRF层可以看作是通过对应的标签y来计算概率P,公式如下:The NER model can include BLSTM layers and CRF layers. The first classification result may include the BIO classification of each term and the corresponding power metering named entity term classification. Among them, the bidirectional long short-term memory artificial neural network (Bidirectional Long Short-Term Memory, BLSTM) can judge the characteristics of the entity through the state information before and after each word in the power metering, and can be calculated separately through a forward LSTM and a reverse LSTM. The vector corresponding to the left and right words of each word, and then the two vectors of each word are connected to form the vector of the word and output to the CRF layer. CRF (Conditional Random Field, Conditional Random Field) can be used to establish character-level sequence labels and establish the correlation between adjacent labels of power metering entities. After a training data set is given, the CRF layer can be obtained by maximum likelihood estimation Conditional probability model, the input X={x1 , x2 , x3 ,..., xn } is the full expansion of the text sequence vector, each xi in the sequence represents the convolutional neural network model after processing. Chinese character vector. Define Y={y1 , y2 , y3 , ..., yn } as the label corresponding to the sequence. Then the CRF layer can be regarded as calculating the probability P through the corresponding label y, the formula is as follows:

Figure BDA0002669147420000131
Figure BDA0002669147420000131

其中,

Figure BDA0002669147420000132
为势函数。参数W与b则分表示了该层输入对应的权重矩阵与偏置量。进一步地,该层对应的损失函数表示为:in,
Figure BDA0002669147420000132
is the potential function. The parameters W and b respectively represent the weight matrix and offset corresponding to the input of this layer. Further, the corresponding loss function of this layer is expressed as:

Figure BDA0002669147420000133
Figure BDA0002669147420000133

其中x为隐藏表示,y为文本语句的标注序列,通过极大似然估计可以得到公式的最优解。最后,通过解码得到与最优解相对应的标签序列y*Where x is the hidden representation, y is the label sequence of the text sentence, and the optimal solution of the formula can be obtained by maximum likelihood estimation. Finally, the label sequence y* corresponding to the optimal solution is obtained by decoding.

中文分词模型(Chinese Word Segmentation,CWS)的目的是为中文语句划分单词边界,CWS也属于字符级序列标注问题,也可以利用CRF方法来进行处理,因此可以通过联合训练NER和CWS模型来提高识别模型识别实体边界的能力。服务器可以将各个词语对应的局部语境信息输入到中文分词模型的CRF层,得到第二分类结果,根据第二分类结果和对应的分类标注,计算中文分词模型的损失。其中,第二分类结果可以包括各个词语的BIO分类。CWS模型的损失函数为:The purpose of Chinese Word Segmentation (CWS) is to divide word boundaries for Chinese sentences. CWS is also a character-level sequence labeling problem and can also be processed by the CRF method. Therefore, the recognition can be improved by jointly training the NER and CWS models. The ability of the model to recognize the boundaries of entities. The server can input the local context information corresponding to each word into the CRF layer of the Chinese word segmentation model to obtain the second classification result, and calculate the loss of the Chinese word segmentation model according to the second classification result and the corresponding classification label. Wherein, the second classification result may include the BIO classification of each word. The loss function of the CWS model is:

Figure BDA0002669147420000134
Figure BDA0002669147420000134

其中θcws为模型的参数,c则表示从CNN层中输出的句子的隐藏字符表示。where θcws is the parameter of the model, and c is the hidden character representation of the sentence output from the CNN layer.

单词分类模型为识别模型的训练接入了词典信息,用于判断汉字序列是否可以是一个电力计量实体词汇来对其进行分类。例如,字符序列“电量波动异常”将被分类为true,而字符序列“电置工动异常”将被分类为false。true样本的词汇会从专业词典中选取,而false样本则从词典中随机抽取一个词汇,然后将该词汇中的每个字以概率p来随机替换为另一个随机选择的字。在进行语料集标注分类时,可以将电力计量术语True/false分类进行标注。服务器可以将各个词语对应的局部语境信息输入到单词分类模型,获得各个词语的第三分类结果。单词分类模型可以运用神经网络进行,可以通过最大池化层和sigmoid函数层进行模型构建,从而进行电力计量术语True/false分类。单词分类模型的损失函数公式为:The word classification model accesses dictionary information for the training of the recognition model, and is used to judge whether the Chinese character sequence can be a power metering entity vocabulary to classify it. For example, the character sequence "abnormal power fluctuation" would be classified as true, while the character sequence "abnormal electrical power operation" would be classified as false. The vocabulary of the true sample will be selected from a professional dictionary, while the false sample will randomly select a vocabulary from the dictionary, and then randomly replace each word in the vocabulary with another randomly selected word with probability p. When performing corpus labeling and classification, the true/false classification of power metering terms can be labelled. The server may input the local context information corresponding to each word into the word classification model, and obtain a third classification result of each word. The word classification model can be performed using a neural network, which can be modeled through a max pooling layer and a sigmoid function layer to perform True/false classification of power metering terms. The loss function formula of the word classification model is:

Figure BDA0002669147420000135
Figure BDA0002669147420000135

其中,Nw表示为用于单词分类的训练样本数量,si是第i个样本的预测分数,yi是表示为0-1标签的单词分类量化值。whereNw is the number of training samples used for word classification, si is the predicted score of the ith sample, andyi is the quantified value of word classification represented as 0-1 labels.

服务器可以根据各个模型的重要程度和对结果的影响,为各个模型配置对应的损失系数,并根据NER模型的损失、中文分词模型的损失和单词分类模型的损失以及各自对应的损失系数,得到识别模型的当前损失值The server can configure the corresponding loss coefficients for each model according to the importance of each model and the impact on the results, and obtain recognition according to the loss of the NER model, the loss of the Chinese word segmentation model, the loss of the word classification model, and the corresponding loss coefficients. the current loss value of the model

上述实施例的方案,通过将卷积神经网络模型输出的各个词语对应的局部语境信息,分别输入到NER模型、中文分词模型和单词分类模型,以获得识别模型的当前损失值,通过多任务、多模型联合训练的方式训练识别模型,提高识别模型的命名实体识别准确度。The scheme of the above embodiment, by inputting the local context information corresponding to each word output by the convolutional neural network model into the NER model, the Chinese word segmentation model and the word classification model, respectively, to obtain the current loss value of the recognition model, through multi-tasking. , The multi-model joint training method trains the recognition model to improve the named entity recognition accuracy of the recognition model.

在一个实施例中,根据NER模型的损失、中文分词模型的损失和单词分类模型的损失以及各自对应的系数,得到识别模型的当前损失值,包括:In one embodiment, the current loss value of the recognition model is obtained according to the loss of the NER model, the loss of the Chinese word segmentation model, the loss of the word classification model and their corresponding coefficients, including:

根据中文分词模型的损失和单词分类模型的损失分别对应的损失系数,确定NER模型的损失对应的损失系数;根据NER模型的损失及对应的损失系数,和中文分词模型的损失及对应的损失系数,得到识别模型的损失值。According to the loss coefficients corresponding to the loss of the Chinese word segmentation model and the loss of the word classification model, determine the loss coefficient corresponding to the loss of the NER model; according to the loss of the NER model and the corresponding loss coefficient, and the loss of the Chinese word segmentation model and the corresponding loss coefficient , get the loss value of the recognition model.

本实施例中,服务器可以根据各个模型的特点、关联程度,以及在识别模型中的重要程度,确定各个模型对应的损失系数,并根据各个模型的损失和对应的损失系数,确定识别模型的当前损失值。识别模型的当前损失值可以是NER模型的损失、中文分词模型的损失和单词分类模型的损失的组合,可以表示为:In this embodiment, the server may determine the loss coefficient corresponding to each model according to the characteristics of each model, the degree of association, and the degree of importance in the recognition model, and determine the current loss coefficient of the recognition model according to the loss of each model and the corresponding loss coefficient loss value. The current loss value of the recognition model can be a combination of the loss of the NER model, the loss of the Chinese word segmentation model, and the loss of the word classification model, which can be expressed as:

L=(1-λ12)LNER1LcwsL=(1-λ12 )LNER1 Lcws

其中,λ1为中文分词模型的损失在总损失中相对重要性的系数,λ2为单词分类损失在中文分词模型的损失中相对重要性的系数。Among them, λ1 is the coefficient of the relative importance of the loss of the Chinese word segmentation model in the total loss, and λ2 is the coefficient of the relative importance of the word classification loss in the loss of the Chinese word segmentation model.

上述实施例的方案,通过确定NER模型、中文分词模型和单词分类模型对应的损失系数,获得模型的当前损失值,通过多任务、多模型联合训练的方式训练识别模型,提高识别模型的命名实体识别准确度The scheme of the above embodiment, by determining the loss coefficient corresponding to the NER model, the Chinese word segmentation model and the word classification model, the current loss value of the model is obtained, and the recognition model is trained by means of multi-task and multi-model joint training, so as to improve the named entity of the recognition model. recognition accuracy

应该理解的是,虽然图1的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图1中的至少一部分步骤可以包括多个步骤或者多个阶段,这些步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤中的步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although the various steps in the flowchart of FIG. 1 are shown in sequence according to the arrows, these steps are not necessarily executed in the sequence shown by the arrows. Unless explicitly stated herein, the execution of these steps is not strictly limited to the order, and these steps may be performed in other orders. Moreover, at least a part of the steps in FIG. 1 may include multiple steps or multiple stages, these steps or stages are not necessarily executed at the same time, but may be executed at different times, and the execution sequence of these steps or stages is also It does not have to be performed sequentially, but may be performed alternately or alternately with other steps or at least a portion of the steps or stages within the other steps.

为了更清晰的阐述本申请的上述方案,将电力计量的命名实体识别方法应用于电力计量命名实体识别模型的训练和识别过程。模型训练阶段,建立包含328篇电力计量相关的文本文献语料库,共计16454个句子,包含21627个4种类别的电力计量技术术语。将语料数据集按照8:1的比例选择292篇作为训练集,36篇作为测试集。按照如图2所示的电力计量的命名实体识别方法的流程进行识别模型的训练,图3是用于训练的电力计量的命名实体识别方法的模型。模型预测阶段,显示结果如图4所示,已经经过分词处理,词与词之间使用空格隔开。同时对于电力计量领域的专业技术词汇附上对应实体标签(如Meter,Index)并采用彩色文字背景进行标示。In order to explain the above solution of the present application more clearly, the named entity recognition method of power metering is applied to the training and recognition process of the named entity recognition model of power metering. In the model training stage, a corpus of 328 text documents related to power metering was established, with a total of 16,454 sentences and 21,627 technical terms of power metering in four categories. The corpus data set was selected according to the ratio of 8:1 to 292 as the training set and 36 as the test set. The training of the recognition model is performed according to the flow of the named entity recognition method of power metering as shown in FIG. 2 , and FIG. 3 is a model of the named entity recognition method of power metering used for training. In the model prediction stage, the display result is shown in Figure 4, which has been processed by word segmentation, and the words are separated by spaces. At the same time, the corresponding entity labels (such as Meter, Index) are attached to the professional technical vocabulary in the field of power metering and marked with a colored text background.

在一个实施例中,如图5所示,提供了一种电力计量的命名实体识别装置,该装置500包括:In one embodiment, as shown in FIG. 5, a named entity identification device for power metering is provided, and the device 500 includes:

语料获取模块501,用于获取待识别的电力计量语料;电力计量语料包含用于描述电力计量信息的文本信息;Thecorpus acquisition module 501 is used for acquiring the power metering corpus to be identified; the power metering corpus includes text information for describing the power metering information;

模型输入模块502,用于将待识别的电力计量语料输入预先训练好的识别模型;识别模型至少包括卷积神经网络模型、NER模型、中文分词模型和单词分类模型;识别模型是将电力计量语料训练数据输入到卷积神经网络模型,并将卷积神经网络模型的输出结果分别输入到NER模型、中文分词模型和单词分类模型进行训练得到;识别模型用于对输入的电力计量语料进行命名实体识别;Themodel input module 502 is used to input the power metering corpus to be recognized into the pre-trained recognition model; the recognition model includes at least a convolutional neural network model, a NER model, a Chinese word segmentation model and a word classification model; the recognition model is the power metering corpus. The training data is input into the convolutional neural network model, and the output results of the convolutional neural network model are input into the NER model, the Chinese word segmentation model and the word classification model for training; the recognition model is used to name entities for the input power metering corpus identify;

识别信息获取模块503,用于获取识别模型的输出,得到电力计量语料的命名实体识别信息。The identificationinformation obtaining module 503 is configured to obtain the output of the identification model, and obtain the named entity identification information of the power metering corpus.

在一个实施例中,上述装置500进一步用于根据预设的电力计量命名实体术语分类,对获取到的电力计量语料集进行分类标注,得到第一标注语料集;对第一标注语料集进行向量化处理,获得第一标注语料集中各个词语对应的字符向量;根据各个词语对应的字符向量得到第二标注语料集;将第二标注语料集输入识别模型的待训练的卷积神经网络模型,获得各个词语对应的局部语境信息;局部语境信息用于表征各个词语的上下文信息;将各个词语对应的局部语境信息分别输入到识别模型的待训练NER模型、中文分词模型和单词分类模型进行训练,并根据NER模型的损失、中文分词模型的损失和单词分类模型的损失得到识别模型的当前损失值;根据识别模型的当前损失值,更新识别模型的卷积神经网络模型、NER模型、中文分词模型和单词分类模型,得到训练后的识别模型。In one embodiment, the above-mentioned apparatus 500 is further configured to classify and label the acquired power metering corpus according to the preset power metering named entity term classification to obtain a first annotated corpus; perform vectorization on the first annotated corpus processing to obtain the character vector corresponding to each word in the first labeled corpus; obtain the second labeled corpus according to the character vector corresponding to each word; input the second labeled corpus into the convolutional neural network model to be trained of the recognition model to obtain The local context information corresponding to each word; the local context information is used to represent the context information of each word; the local context information corresponding to each word is input into the NER model to be trained, the Chinese word segmentation model and the word classification model of the recognition model. Training, and obtain the current loss value of the recognition model according to the loss of the NER model, the loss of the Chinese word segmentation model and the loss of the word classification model; according to the current loss value of the recognition model, update the convolutional neural network model of the recognition model, NER model, Chinese The word segmentation model and the word classification model are used to obtain the trained recognition model.

在一个实施例中,上述装置500进一步用于获得电力计量原始语料数据集;获取预设的电力计量命名实体术语分类,包括电力指标类别、电力对象类别、电力现象类别、计量行为类别中的至少两个,以及包括电力计量术语True/false分类;按照BIO标注方式,根据预设的电力计量命名实体术语分类对原始语料数据集进行标注,得到第一标注语料集。In one embodiment, the above-mentioned apparatus 500 is further configured to obtain a power metering original corpus data set; obtain a preset power metering named entity term classification, including at least one of the power index category, the power object category, the power phenomenon category, and the metering behavior category. Two, and include True/False classification of power metering terms; according to the BIO labeling method, the original corpus data set is labeled according to the preset power metering named entity term classification, and the first labeled corpus is obtained.

在一个实施例中,上述装置500进一步用于将第一标注语料集中的各个词语映射到one-hot向量后,输入到word2vec模型;根据word2vec模型的输出结果,获得各个词语对应的字符向量。In one embodiment, the above-mentioned apparatus 500 is further configured to map each word in the first marked corpus to a one-hot vector, and then input it to the word2vec model; according to the output result of the word2vec model, obtain the character vector corresponding to each word.

在一个实施例中,卷积神经网络模型包括窗口大小不同的多个滤波器,上述装置500进一步用于将第二标注语料集输入识别模型的待训练的卷积神经网络模型,通过多个滤波器学习第二标注语料集的各个词语的上下文信息,获得各个词语对应的局部语境信息。In one embodiment, the convolutional neural network model includes a plurality of filters with different window sizes, and the above-mentioned apparatus 500 is further configured to input the second labeled corpus into the convolutional neural network model to be trained of the recognition model, through the plurality of filters The device learns the context information of each word in the second labeled corpus, and obtains the local context information corresponding to each word.

在一个实施例中,上述装置500进一步用于将各个词语对应的局部语境信息输入到NER模型,获得各个词语的第一分类结果;根据第一分类结果和对应的分类标注,计算NER模型的损失;及,将各个词语对应的局部语境信息输入到中文分词模型的CRF层,获得各个词语对应的第二分类结果;根据第二分类结果和对应的分类标注,计算中文分词模型的损失;及,将各个词语对应的局部语境信息输入到单词分类模型,获得各个词语的第三分类结果;根据第三分类结果和对应的分类标注,计算单词分类模型的损失;根据NER模型的损失、中文分词模型的损失和单词分类模型的损失以及各自对应的损失系数,得到识别模型的当前损失值。In one embodiment, the above-mentioned device 500 is further configured to input the local context information corresponding to each word into the NER model to obtain the first classification result of each word; according to the first classification result and the corresponding classification label, calculate the and, input the local context information corresponding to each word into the CRF layer of the Chinese word segmentation model to obtain the second classification result corresponding to each word; calculate the loss of the Chinese word segmentation model according to the second classification result and the corresponding classification label; And, input the local context information corresponding to each word into the word classification model to obtain the third classification result of each word; calculate the loss of the word classification model according to the third classification result and the corresponding classification label; according to the loss of the NER model, The loss of the Chinese word segmentation model and the loss of the word classification model and their corresponding loss coefficients are used to obtain the current loss value of the recognition model.

在一个实施例中,上述装置500进一步用于根据中文分词模型的损失和单词分类模型的损失分别对应的损失系数,确定NER模型的损失对应的损失系数;根据NER模型的损失及对应的损失系数,和中文分词模型的损失及对应的损失系数,得到识别模型的损失值。In one embodiment, the above device 500 is further configured to determine the loss coefficient corresponding to the loss of the NER model according to the loss coefficients corresponding to the loss of the Chinese word segmentation model and the loss of the word classification model respectively; according to the loss of the NER model and the corresponding loss coefficient , and the loss of the Chinese word segmentation model and the corresponding loss coefficient to obtain the loss value of the recognition model.

关于电力计量的命名实体识别装置的具体限定可以参见上文中对于电力计量的命名实体识别方法的限定,在此不再赘述。上述电力计量的命名实体识别装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。For the specific definition of the named entity identification device for power metering, reference may be made to the definition of the named entity identification method for power metering above, which will not be repeated here. Each module in the above-mentioned named entity identification device for power metering can be implemented in whole or in part by software, hardware and combinations thereof. The above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.

本申请提供的电力计量的命名实体识别方法,可以应用于计算机设备,该计算机设备可以是服务器,其内部结构图可以如图6所示。该计算机设备包括通过系统总线连接的处理器、存储器和网络接口。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用于存储获取到的电力计量语料和识别模型。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种电力计量的命名实体识别方法。The named entity identification method for power metering provided in this application can be applied to a computer device, and the computer device can be a server, and its internal structure diagram can be as shown in FIG. 6 . The computer device includes a processor, memory, and a network interface connected by a system bus. Among them, the processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium, an internal memory. The nonvolatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the execution of the operating system and computer programs in the non-volatile storage medium. The database of the computer equipment is used to store the acquired power metering corpus and recognition model. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer program, when executed by the processor, implements a method for identifying a named entity of electricity metering.

本领域技术人员可以理解,图6中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。Those skilled in the art can understand that the structure shown in FIG. 6 is only a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. Include more or fewer components than shown in the figures, or combine certain components, or have a different arrangement of components.

在一个实施例中,提供了一种计算机设备,包括存储器和处理器,存储器中存储有计算机程序,该处理器执行计算机程序时实现上述各方法实施例中的步骤。In one embodiment, a computer device is provided, including a memory and a processor, where a computer program is stored in the memory, and when the processor executes the computer program, the steps in the foregoing method embodiments are implemented.

在一个实施例中,提供了一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现上述各方法实施例中的步骤。In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, and when the computer program is executed by a processor, implements the steps in the foregoing method embodiments.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和易失性存储器中的至少一种。非易失性存储器可包括只读存储器(Read-Only Memory,ROM)、磁带、软盘、闪存或光存储器等。易失性存储器可包括随机存取存储器(Random Access Memory,RAM)或外部高速缓冲存储器。作为说明而非局限,RAM可以是多种形式,比如静态随机存取存储器(Static Random Access Memory,SRAM)或动态随机存取存储器(Dynamic Random Access Memory,DRAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware through a computer program, and the computer program can be stored in a non-volatile computer-readable storage In the medium, when the computer program is executed, it may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other media used in the various embodiments provided in this application may include at least one of non-volatile and volatile memory. The non-volatile memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash memory or optical memory, and the like. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, the RAM may be in various forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM).

以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。The technical features of the above embodiments can be combined arbitrarily. In order to make the description simple, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features It is considered to be the range described in this specification.

以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only represent several embodiments of the present application, and the descriptions thereof are relatively specific and detailed, but should not be construed as a limitation on the scope of the invention patent. It should be noted that, for those skilled in the art, without departing from the concept of the present application, several modifications and improvements can be made, which all belong to the protection scope of the present application. Therefore, the scope of protection of the patent of the present application shall be subject to the appended claims.

Claims (10)

1. A named entity identification method for power metering, characterized in that the method comprises the following steps:
acquiring an electric power measurement corpus to be identified; the electric power metering corpus comprises text information used for describing electric power metering information;
inputting the electric power metering corpus to be recognized into a recognition model which is trained in advance; the recognition model at least comprises a convolutional neural network model, an NER model, a Chinese word segmentation model and a word classification model; the recognition model is obtained by inputting electric power measurement corpus training data into the convolutional neural network model and respectively inputting output results of the convolutional neural network model into the NER model, the Chinese word segmentation model and the word classification model for training; the recognition model is used for carrying out named entity recognition on the input electric power metering corpus;
and acquiring the output of the identification model to obtain named entity identification information of the electric power metering corpus.
2. The method of claim 1, further comprising:
classifying the obtained electric power metering corpus according to preset electric power metering named entity terms to obtain a first labeled corpus;
vectorizing the first labeled corpus to obtain a character vector corresponding to each word in the first labeled corpus; obtaining a second labeled corpus according to the character vectors corresponding to the words;
inputting the second labeled corpus into a convolutional neural network model to be trained of the recognition model to obtain local context information corresponding to each word; the local context information is used for representing the context information of each word;
respectively inputting the local context information corresponding to each word into an NER model to be trained, a Chinese word segmentation model and a word classification model of the recognition model for training, and obtaining the current loss value of the recognition model according to the loss of the NER model, the loss of the Chinese word segmentation model and the loss of the word classification model;
and updating the convolutional neural network model, the NER model, the Chinese word segmentation model and the word classification model of the recognition model according to the current loss value of the recognition model to obtain the trained recognition model.
3. The method according to claim 2, wherein the classifying the obtained electric power metering corpus according to a preset electric power metering named entity term to obtain a first labeled corpus comprises:
obtaining an electric power measurement original corpus data set;
acquiring preset electric power measurement named entity term classifications, including at least two of an electric power index classification, an electric power object classification, an electric power phenomenon classification and a measurement behavior classification, and including an electric power measurement term True/false classification;
and labeling the original corpus data set according to the preset electric power metering named entity term classification in a BIO labeling mode to obtain the first labeled corpus set.
4. The method according to claim 2, wherein the vectorizing the first markup corpus to obtain the character vector corresponding to each word in the first markup corpus comprises:
mapping each word in the first labeled corpus set to one-hot vectors, and inputting the words into a word2vec model;
and obtaining the character vector corresponding to each word according to the output result of the word2vec model.
5. The method of claim 2, wherein the convolutional neural network model comprises a plurality of filters with different window sizes; inputting the second labeled corpus into a convolutional neural network model to be trained of the recognition model to obtain local context information corresponding to each word, including:
and inputting the second labeled corpus into a convolutional neural network model to be trained of the recognition model, and learning the context information of each word of the second labeled corpus through the plurality of filters to obtain the local context information corresponding to each word.
6. The method according to claim 3, wherein the inputting the local context information corresponding to each word into a NER model to be trained, a chinese participle model and a word classification model of the recognition model for training respectively, and obtaining a current loss value of the recognition model according to a loss of the NER model, a loss of the chinese participle model and a loss of the word classification model comprises:
inputting the local context information corresponding to each word into the NER model to obtain a first classification result of each word; calculating the loss of the NER model according to the first classification result and the corresponding classification label; the NER model comprises a BLSTM layer and a CRF layer; the first classification result comprises BIO classifications of all words and corresponding electric power metering named entity term classifications;
inputting the local context information corresponding to each word into a CRF layer of the Chinese word segmentation model to obtain a second classification result corresponding to each word; calculating the loss of the Chinese word segmentation model according to the second classification result and the corresponding classification label; the second classification result comprises BIO classification of each word;
inputting the local context information corresponding to each word into the word classification model to obtain a third classification result of each word; calculating the loss of the word classification model according to the third classification result and the corresponding classification label; the third classification result comprises a power measurement term True/false classification of the words;
and obtaining the current loss value of the recognition model according to the loss of the NER model, the loss of the Chinese word segmentation model, the loss of the word classification model and the loss coefficients corresponding to the loss models.
7. The method according to claim 6, wherein the obtaining the current loss value of the recognition model according to the loss of the NER model, the loss of the Chinese segmentation model and the loss of the word classification model and the corresponding coefficients comprises:
determining a loss coefficient corresponding to the loss of the NER model according to the loss coefficients respectively corresponding to the loss of the Chinese word segmentation model and the loss of the word classification model;
and obtaining the current loss value of the recognition model according to the loss and the corresponding loss coefficient of the NER model and the loss and the corresponding loss coefficient of the Chinese word segmentation model.
8. An apparatus for named entity identification for power metering, the apparatus comprising:
the corpus acquiring module is used for acquiring electric power metering corpora to be identified; the electric power metering corpus comprises text information used for describing electric power metering information;
the model input module is used for inputting the electric power metering corpus to be recognized into a recognition model which is trained in advance; the recognition model at least comprises a convolutional neural network model, an NER model, a Chinese word segmentation model and a word classification model; the recognition model is obtained by inputting electric power measurement corpus training data into the convolutional neural network model and respectively inputting output results of the convolutional neural network model into the NER model, the Chinese word segmentation model and the word classification model for training; the recognition model is used for carrying out named entity recognition on the input electric power metering corpus;
and the identification information acquisition module is used for acquiring the output of the identification model to obtain the named entity identification information of the electric power metering corpus.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN202010928024.9A2020-09-072020-09-07Named entity identification method, device, equipment and storage medium for power meteringPendingCN112052684A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202010928024.9ACN112052684A (en)2020-09-072020-09-07Named entity identification method, device, equipment and storage medium for power metering

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202010928024.9ACN112052684A (en)2020-09-072020-09-07Named entity identification method, device, equipment and storage medium for power metering

Publications (1)

Publication NumberPublication Date
CN112052684Atrue CN112052684A (en)2020-12-08

Family

ID=73607909

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202010928024.9APendingCN112052684A (en)2020-09-072020-09-07Named entity identification method, device, equipment and storage medium for power metering

Country Status (1)

CountryLink
CN (1)CN112052684A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN112507117A (en)*2020-12-162021-03-16中国南方电网有限责任公司Deep learning-based maintenance suggestion automatic classification method and system
CN112528447A (en)*2020-12-182021-03-19中国南方电网有限责任公司Power grid model classification identification method and system, electronic equipment and storage medium
CN112528662A (en)*2020-12-152021-03-19深圳壹账通智能科技有限公司Entity category identification method, device, equipment and storage medium based on meta-learning
CN112765984A (en)*2020-12-312021-05-07平安资产管理有限责任公司Named entity recognition method and device, computer equipment and storage medium
CN113010647A (en)*2021-04-012021-06-22腾讯科技(深圳)有限公司Corpus processing model training method and device, storage medium and electronic equipment
CN113255357A (en)*2021-06-242021-08-13北京金山数字娱乐科技有限公司Data processing method, target recognition model training method, target recognition method and device
CN113591480A (en)*2021-07-232021-11-02深圳供电局有限公司Named entity identification method and device for power metering and computer equipment
CN113626609A (en)*2021-08-102021-11-09南方电网数字电网研究院有限公司Electric power measurement knowledge map construction method, device, equipment and storage medium
CN113627514A (en)*2021-08-052021-11-09南方电网数字电网研究院有限公司Data processing method and device of knowledge graph, electronic equipment and storage medium
CN113688622A (en)*2021-09-052021-11-23安徽清博大数据科技有限公司Method for identifying situation comedy conversation humor based on NER

Citations (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN108108344A (en)*2016-11-242018-06-01北京智能管家科技有限公司A kind of method and device for combining identification and connection name entity
CN108280064A (en)*2018-02-282018-07-13北京理工大学Participle, part-of-speech tagging, Entity recognition and the combination treatment method of syntactic analysis
CN110852103A (en)*2019-10-282020-02-28青岛聚好联科技有限公司Named entity identification method and device
CN111091002A (en)*2019-11-262020-05-01华东师范大学Method for identifying Chinese named entity
CN111209738A (en)*2019-12-312020-05-29浙江大学Multi-task named entity recognition method combining text classification

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN108108344A (en)*2016-11-242018-06-01北京智能管家科技有限公司A kind of method and device for combining identification and connection name entity
CN108280064A (en)*2018-02-282018-07-13北京理工大学Participle, part-of-speech tagging, Entity recognition and the combination treatment method of syntactic analysis
CN110852103A (en)*2019-10-282020-02-28青岛聚好联科技有限公司Named entity identification method and device
CN111091002A (en)*2019-11-262020-05-01华东师范大学Method for identifying Chinese named entity
CN111209738A (en)*2019-12-312020-05-29浙江大学Multi-task named entity recognition method combining text classification

Cited By (14)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN112528662A (en)*2020-12-152021-03-19深圳壹账通智能科技有限公司Entity category identification method, device, equipment and storage medium based on meta-learning
CN112507117B (en)*2020-12-162024-02-13中国南方电网有限责任公司Deep learning-based automatic overhaul opinion classification method and system
CN112507117A (en)*2020-12-162021-03-16中国南方电网有限责任公司Deep learning-based maintenance suggestion automatic classification method and system
CN112528447A (en)*2020-12-182021-03-19中国南方电网有限责任公司Power grid model classification identification method and system, electronic equipment and storage medium
CN112528447B (en)*2020-12-182024-04-12中国南方电网有限责任公司Power grid model classification identification method, system, electronic equipment and storage medium
CN112765984A (en)*2020-12-312021-05-07平安资产管理有限责任公司Named entity recognition method and device, computer equipment and storage medium
CN113010647A (en)*2021-04-012021-06-22腾讯科技(深圳)有限公司Corpus processing model training method and device, storage medium and electronic equipment
CN113255357A (en)*2021-06-242021-08-13北京金山数字娱乐科技有限公司Data processing method, target recognition model training method, target recognition method and device
CN113591480A (en)*2021-07-232021-11-02深圳供电局有限公司Named entity identification method and device for power metering and computer equipment
WO2023000725A1 (en)*2021-07-232023-01-26深圳供电局有限公司Named entity identification method and apparatus for electric power measurement, and computer device
CN113627514A (en)*2021-08-052021-11-09南方电网数字电网研究院有限公司Data processing method and device of knowledge graph, electronic equipment and storage medium
CN113626609B (en)*2021-08-102024-03-26南方电网数字电网研究院有限公司Electric power metering knowledge graph construction method, device, equipment and storage medium
CN113626609A (en)*2021-08-102021-11-09南方电网数字电网研究院有限公司Electric power measurement knowledge map construction method, device, equipment and storage medium
CN113688622A (en)*2021-09-052021-11-23安徽清博大数据科技有限公司Method for identifying situation comedy conversation humor based on NER

Similar Documents

PublicationPublication DateTitle
CN111950269B (en) Text sentence processing method, device, computer equipment and storage medium
CN112052684A (en)Named entity identification method, device, equipment and storage medium for power metering
CN111782769B (en)Intelligent knowledge graph question-answering method based on relation prediction
CN113987187B (en)Public opinion text classification method, system, terminal and medium based on multi-label embedding
CN113255320A (en)Entity relation extraction method and device based on syntax tree and graph attention machine mechanism
CN106599032B (en)Text event extraction method combining sparse coding and structure sensing machine
CN113515632B (en)Text classification method based on graph path knowledge extraction
CN110232192A (en)Electric power term names entity recognition method and device
CN113051886B (en)Test question duplicate checking method, device, storage medium and equipment
CN107688870B (en) A method and device for visual analysis of hierarchical factors of deep neural network based on text stream input
CN116661805B (en)Code representation generation method and device, storage medium and electronic equipment
CN112256866B (en)Text fine-grained emotion analysis algorithm based on deep learning
CN115599901B (en) Machine Question Answering Method, Device, Equipment and Storage Medium Based on Semantic Prompts
CN114417785B (en)Knowledge point labeling method, training method of model, computer equipment and storage medium
CN113191148A (en)Rail transit entity identification method based on semi-supervised learning and clustering
CN114648029A (en) A Named Entity Recognition Method in Electric Power Field Based on BiLSTM-CRF Model
CN116804998A (en) Medical terminology retrieval method and system based on medical semantic understanding
CN118227790A (en) Text classification method, system, device and medium based on multi-label association
CN116842194A (en) An electric power semantic knowledge graph system and method
CN114048314A (en) A Natural Language Steganalysis Method
CN108875024B (en)Text classification method and system, readable storage medium and electronic equipment
CN119202384A (en) A multi-heterogeneous data talent evaluation system and method based on intelligent graph
CN117828024A (en)Plug-in retrieval method, device, storage medium and equipment
CN114743029B (en) A method for image text matching
CN116089597A (en)Statement recommendation method, device, equipment and storage medium

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
CB02Change of applicant information
CB02Change of applicant information

Country or region after:China

Address after:Room 86, room 406, No. 1, Yichuang street, Zhongxin knowledge city, Huangpu District, Guangzhou, Guangdong 510700

Applicant after:Southern Power Grid Digital Grid Research Institute Co.,Ltd.

Applicant after:CHINA SOUTHERN POWER GRID Co.,Ltd.

Address before:Room 86, room 406, No. 1, Yichuang street, Zhongxin knowledge city, Huangpu District, Guangzhou, Guangdong 510700

Applicant before:Southern Power Grid Digital Grid Research Institute Co.,Ltd.

Country or region before:China

Applicant before:CHINA SOUTHERN POWER GRID Co.,Ltd.

TA01Transfer of patent application right
TA01Transfer of patent application right

Effective date of registration:20240314

Address after:Room 86, room 406, No.1, Yichuang street, Zhongxin Guangzhou Knowledge City, Huangpu District, Guangzhou City, Guangdong Province

Applicant after:Southern Power Grid Digital Grid Research Institute Co.,Ltd.

Country or region after:China

Address before:Room 86, room 406, No. 1, Yichuang street, Zhongxin knowledge city, Huangpu District, Guangzhou, Guangdong 510700

Applicant before:Southern Power Grid Digital Grid Research Institute Co.,Ltd.

Country or region before:China

Applicant before:CHINA SOUTHERN POWER GRID Co.,Ltd.

RJ01Rejection of invention patent application after publication
RJ01Rejection of invention patent application after publication

Application publication date:20201208


[8]ページ先頭

©2009-2025 Movatter.jp