CN113190656A

Movatterモバイル変換

Info

Publication number: CN113190656A
Application number: CN202110511025.8A
Authority: CN
Inventors: 麦丞程; 刘健; 黄宜华
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2021-05-11
Filing date: 2021-05-11
Publication date: 2021-07-30
Anticipated expiration: 2041-05-11
Also published as: CN113190656B

Abstract

Translated fromChinese

本发明公开了一种基于多标注框架与融合特征的中文命名实体抽取方法，该首先基于预训练语言模型对汉字进行编码。然后，通过词典匹配为每个汉字引入词信息与分词标记信息，构建词典特征。在此基础上，根据汉字在匹配词中的含义，使用汉语拼音软件对汉字进行注音，构建拼音特征。接着，基于点乘注意力机制融合词典特征与拼音特征到汉字编码中，得到结合词典特征与拼音特征的汉字语义编码，提升对于中文命名实体边界的识别能力。最后，结合序列标注与指标标注的优点，利用多任务学习模型联合学习两种标注任务，提高中文命名实体抽取的准确率。

The invention discloses a Chinese named entity extraction method based on a multi-label framework and fusion features, which firstly encodes Chinese characters based on a pre-trained language model. Then, word information and word segmentation mark information are introduced for each Chinese character through dictionary matching to construct dictionary features. On this basis, according to the meaning of the Chinese characters in the matching words, the Chinese Pinyin software is used to pinyin the Chinese characters, and the pinyin features are constructed. Then, based on the point product attention mechanism, the dictionary features and pinyin features are integrated into the Chinese character encoding, and the Chinese character semantic encoding combining dictionary features and pinyin features is obtained, which improves the recognition ability of Chinese named entity boundaries. Finally, combining the advantages of sequence labeling and index labeling, the multi-task learning model is used to jointly learn two labeling tasks to improve the accuracy of Chinese named entity extraction.

Description

Translated fromChinese

一种基于多标注框架与融合特征的中文命名实体抽取方法A Chinese Named Entity Extraction Method Based on Multi-annotation Framework and Fusion Features

技术领域technical field

本发明属于人工智能和自然语言处理领域，具体涉及一种基于多标注框架与融合特征的中文命名实体抽取方法。The invention belongs to the field of artificial intelligence and natural language processing, and in particular relates to a Chinese named entity extraction method based on a multi-label framework and fusion features.

背景技术Background technique

随着互联网技术的飞速发展，各行业数据信息在爆发式增长，推动了行业大数据智能化分析挖掘服务与创新应用的发展，进一步推动着我国数字经济的发展。这些数据信息中包含大量的非结构化文本，从这些非结构化文本中抽取出结构化的有效信息成为了工业界关注的重点，而其中就涉及到自然语言处理领域中一个基础任务：命名实体抽取。With the rapid development of Internet technology, the explosive growth of data information in various industries has promoted the development of intelligent analysis and mining services and innovative applications of big data in the industry, and further promoted the development of my country's digital economy. These data information contains a large number of unstructured texts, and extracting structured and effective information from these unstructured texts has become the focus of the industry, which involves a basic task in the field of natural language processing: named entities Extract.

早期命名实体识别的研究工作主要是基于词典与规则的方法，这些方法主要依靠语言学家和领域专家依据数据集特征手工构造领域词典和规则模板。这种基于规则的方法的优点在于，可以根据需求不断地更新迭代规则来抽取目标实体。但是其缺点在于，面对一些复杂的领域和应用场景，人工建立规则的代价较大，并且随着规则库的扩大，容易产生规则冲突问题，使得已有的规则库难以维护与扩展，无法适应数据与领域的变化。Early research work on named entity recognition was mainly based on dictionary and rule methods. These methods mainly rely on linguists and domain experts to manually construct domain dictionaries and rule templates according to data set features. The advantage of this rule-based approach is that iterative rules can be continuously updated to extract target entities as needed. However, its disadvantage is that in the face of some complex fields and application scenarios, the cost of manually establishing rules is high, and with the expansion of the rule base, it is easy to generate rule conflicts, making the existing rule base difficult to maintain and expand, and cannot adapt to Data and Domain Changes.

随后，基于统计机器学习的命名实体识别研究得到关注。命名实体识别在统计机器学习方法中被定义为序列标注问题。应用于NER的统计机器学习方法主要有最大熵模型、隐马尔可夫模型、最大熵马尔可夫模型、条件随机场等。这种方法依赖于人工构建的特征，过程比较繁琐。Subsequently, research on named entity recognition based on statistical machine learning has received attention. Named entity recognition is defined as a sequence labeling problem in statistical machine learning methods. Statistical machine learning methods applied to NER mainly include maximum entropy model, hidden Markov model, maximum entropy Markov model, conditional random field, etc. This method relies on artificially constructed features, and the process is cumbersome.

近几年随着深度学习的不断发展，命名实体识别领域出现了越来越多的基于深度神经网络(Deep Neural Network，DNN)的工作。基于DNN的命名实体识别方法无需繁琐的特征工程，并且模型效果远超传统的规则以及统计机器学习方法。In recent years, with the continuous development of deep learning, more and more work based on Deep Neural Network (DNN) has appeared in the field of named entity recognition. The DNN-based named entity recognition method does not require tedious feature engineering, and the model effect far exceeds the traditional rules and statistical machine learning methods.

中文命名实体识别相较于英文的更难，因为中文缺少英文文本中空格符这样的分隔符，也没有明显的词形变化特征，容易造成边界歧义。除此之外，中文还存在一词多义的现象，在不同领域或者不同上下文中，同一个词表现为不同的含义，需要充分利用上下文信息对词义进行理解。同时，中文还存在省略、简写等语言学特点，这些都给中文命名实体识别带来了更大的挑战。现有很多中文命名实体抽取方法缺乏对词信息的利用，并且标注框架单一、局限性较大，影响中文命名实体抽取的精度。Compared with English, Chinese named entity recognition is more difficult, because Chinese lacks separators such as space characters in English text, and there is no obvious morphological feature, which is easy to cause boundary ambiguity. In addition, there is a phenomenon of polysemy in Chinese. In different fields or in different contexts, the same word has different meanings. It is necessary to make full use of the context information to understand the meaning of the word. At the same time, Chinese also has linguistic features such as ellipsis and abbreviations, which bring greater challenges to Chinese named entity recognition. Many existing Chinese named entity extraction methods lack the use of word information, and the labeling framework is single and limited, which affects the accuracy of Chinese named entity extraction.

发明内容SUMMARY OF THE INVENTION

发明目的：针对上述现有技术存在的问题和不足，本发明的目的是提出一种基于多标注框架与融合特征的中文命名实体抽取方法，以解决现有的中文命名实体抽取方法因标注框架单一，导致局限于单标注框架的问题，以及缺乏对词信息的利用，导致难以识别实体边界的问题。Purpose of the invention: In view of the problems and deficiencies in the above-mentioned prior art, the purpose of the present invention is to propose a Chinese named entity extraction method based on multiple annotation frameworks and fusion features, so as to solve the problem of the existing Chinese named entity extraction method due to the single labeling framework. , resulting in the problem of being limited to a single-label frame, and the lack of utilization of word information, resulting in the problem of difficulty in identifying entity boundaries.

技术方案：为实现上述发明目的，本发明采用的技术方案为一种基于多标注框架与融合特征的中文命名实体抽取方法，包括以下步骤：Technical solution: In order to achieve the above purpose of the invention, the technical solution adopted in the present invention is a Chinese named entity extraction method based on a multi-label framework and fusion features, including the following steps:

(1)对输入汉字序列中的每个汉字在外部词典中进行词匹配，利用词向量查询表将词映射成词向量，利用分词标记向量查询表将汉字在词中的分词标记映射成分词标记向量，所述分词标记向量与词向量拼接构成词典特征；(1) Perform word matching in the external dictionary for each Chinese character in the input Chinese character sequence, use the word vector lookup table to map the words into word vectors, and use the word segmentation mark vector lookup table to map the word segmentation marks of the Chinese characters in the words to the component word marks vector, the word segmentation mark vector and the word vector are spliced to form a dictionary feature;

(2)根据汉字在匹配词中的含义对汉字注上拼音，通过拼音向量查询表对所述拼音映射得到拼音特征；(2) according to the meaning of the Chinese character in the matching word, the Chinese character is marked with pinyin, and the pinyin feature is obtained by the pinyin vector lookup table to the described pinyin mapping;

(3)基于点乘注意力机制将所述词典特征与拼音特征融合到中文预训练语言模型BERT得到的汉字编码中，为后续提供结合词典特征与拼音特征的汉字语义编码；(3) Integrating the dictionary features and pinyin features into the Chinese character encoding obtained by the Chinese pre-training language model BERT based on the dot product attention mechanism, to provide subsequent Chinese character semantic encoding combining dictionary features and pinyin features;

(4)将所述汉字语义编码分别输入到两个独立的双向长短期记忆网络模型中进行特征序列建模，分别输出得到第一特征序列编码

与第二特征序列编码

(4) Input the Chinese character semantic codes into two independent bidirectional long-term and short-term memory network models for feature sequence modeling, and output the first feature sequence codes respectively

encoded with the second feature sequence

(5)序列标注作为辅助任务，指针标注作为主任务，将所述第一特征序列编码

作为序列标注辅助任务的输入，所述第二特征序列编码

作为指针标注主任务的输入，利用多任务学习模型对序列标注辅助任务与指针标注主任务进行联合学习；(5) Sequence annotation is used as an auxiliary task, pointer annotation is used as the main task, and the first feature sequence is encoded

As the input of the sequence labeling auxiliary task, the second feature sequence encodes

As the input of the main task of pointer labeling, the multi-task learning model is used to jointly learn the auxiliary task of sequence labeling and the main task of pointer labeling;

(6)计算序列标注辅助任务在条件随机场中的对数似然损失

指针标注主任务中实体片段头汉字的实体类型分类交叉熵损失

以及指针标注主任务中实体片段尾汉字的实体类型分类交叉熵损失

对所述

加权求和得到模型需要最小化的训练目标，进行端到端联合训练，而测试阶段通过指针标注主任务抽取出句子中的实体片段及其类型。(6) Calculate the log-likelihood loss of the sequence labeling auxiliary task in the conditional random field

Entity type classification cross-entropy loss of Chinese characters in entity segment headers in the main task of pointer labeling

and the entity type classification cross-entropy loss of Chinese characters at the end of the entity segment in the main task of pointer labeling

to the said

The weighted summation obtains the training target that the model needs to be minimized, and performs end-to-end joint training. In the testing phase, the main task of pointer labeling is used to extract the entity fragments and their types in the sentence.

进一步地，所述步骤(1)中，外部词典与词向量查询表来源于互联网上公开的预训练词向量，分词标记向量查询表由one-hot向量构成。Further, in the step (1), the external dictionary and the word vector look-up table are derived from pre-trained word vectors disclosed on the Internet, and the word segmentation mark vector look-up table is composed of one-hot vectors.

进一步地，所述步骤(2)中，拼音向量查询表通过word2vec基于外部中文语料集训练得到，使用汉语拼音软件将外部中文语料集中的文本转换成拼音。Further, in the step (2), the pinyin vector look-up table is obtained through word2vec training based on an external Chinese corpus, and the text in the external Chinese corpus is converted into pinyin using Hanyu Pinyin software.

进一步地，所述步骤(5)中，序列标注辅助任务使用不带实体类型的BMOES对输入句子中的实体进行标记，负责中文命名实体片段抽取，抽取出的实体片段不带类型；指针标注主任务只对句子中实体片段的头、尾汉字进行实体类型标记，负责中文命名实体抽取，抽取出的实体带有类型。Further, in the step (5), the sequence labeling auxiliary task uses BMOES without entity type to mark the entities in the input sentence, and is responsible for extracting Chinese named entity fragments, and the extracted entity fragments do not have types; The task only performs entity type marking on the head and tail Chinese characters of entity fragments in the sentence, and is responsible for Chinese named entity extraction, and the extracted entities have types.

进一步地，所述步骤(6)中，测试阶段取每个汉字实体类型预测概率分布的最大值对应的标签作为该汉字的预测标签，然后匹配与实体片段头汉字实体类型相同且位置距离最近的实体片段尾汉字，将所述实体片段头汉字与实体片段尾汉字之间的文本片段抽取出来作为实体。Further, in the described step (6), the test phase gets the label corresponding to the maximum value of each Chinese character entity type prediction probability distribution as the predicted label of this Chinese character, and then matches with the entity segment head Chinese character entity type identical and the position distance is nearest. The Chinese character at the end of the entity segment is extracted from the text segment between the Chinese character at the head of the entity segment and the Chinese character at the end of the entity segment as an entity.

有益效果：本发明能够有效解决难以识别中文命名实体边界的问题，充分发挥不同标注框架的优点，提高了中文命名实体抽取的准确率。第一，本发明通过构建词典与拼音特征，增强模型对实体边界的识别，并且通过中文预训练语言模型BERT对汉字进行编码，为上层模型提供上下文语义支撑；第二，利用双向长短期记忆网络模型的递归结构进行特征序列建模，学习序列位置信息，缓解因预训练语言模型BERT缺少序列依赖式的建模而导致序列位置信息容易丢失的问题；第三，通过多任务学习模型对序列标注与指针标注进行联合学习，结合不同标注框架的优点，突破单标注框架的局限性，进一步提升中文命名实体抽取的准确率。Beneficial effects: the present invention can effectively solve the problem of difficulty in identifying the boundaries of Chinese named entities, give full play to the advantages of different labeling frameworks, and improve the accuracy of Chinese named entity extraction. First, the present invention enhances the model's recognition of entity boundaries by constructing dictionary and pinyin features, and encodes Chinese characters through the Chinese pre-training language model BERT, providing contextual semantic support for the upper model; second, using a bidirectional long-term and short-term memory network. The recursive structure of the model performs feature sequence modeling, learns sequence position information, and alleviates the problem that sequence position information is easily lost due to the lack of sequence-dependent modeling of the pre-trained language model BERT. Third, the multi-task learning model is used to label sequences. Joint learning with pointer annotation, combined with the advantages of different annotation frameworks, breaks through the limitations of a single annotation framework, and further improves the accuracy of Chinese named entity extraction.

附图说明Description of drawings

图1为本发明方法的整体框架图；Fig. 1 is the overall frame diagram of the method of the present invention;

图2为本发明方法中词典与拼音特征构建的示例图；Fig. 2 is an example diagram of dictionary and pinyin feature construction in the method of the present invention;

图3为本发明方法中序列标注示例图；Fig. 3 is an example diagram of sequence labeling in the method of the present invention;

图4为本发明方法中指针标注示例图；4 is an example diagram of pointer marking in the method of the present invention;

图5(a)(b)分别为本发明方法中词典匹配窗口大小在Ontonotes4数据集和MSRA数据集上对准确性影响的实验结果图；Figure 5 (a) and (b) are respectively the experimental result diagrams of the effect of the dictionary matching window size on the accuracy of the Ontonotes4 data set and the MSRA data set in the method of the present invention;

图6(a)(b)分别为本发明方法中词典匹配窗口大小在Resume数据集和Weibo数据集上对准确性影响的实验结果图。Figures 6(a) and (b) are graphs of the experimental results of the effect of the size of the dictionary matching window on the accuracy of the Resume data set and the Weibo data set, respectively, in the method of the present invention.

具体实施方式Detailed ways

下面结合附图和具体实施例，进一步阐明本发明，应理解这些实施例仅用于说明本发明而不用于限制本发明的范围，在阅读了本发明之后，本领域技术人员对本发明的各种等价形式的修改均落于本申请所附权利要求所限定的范围。Below in conjunction with the accompanying drawings and specific embodiments, the present invention will be further clarified. It should be understood that these embodiments are only used to illustrate the present invention and not to limit the scope of the present invention. Modifications of equivalent forms all fall within the scope defined by the appended claims of this application.

本发明提出了一种基于多标注框架与融合特征的中文命名实体抽取方法，解决了现有中文命名实体抽取方法难以识别实体边界以及局限于单一标注框架的问题。如图1所示，本发明的完整流程包括词典特征构建阶段、拼音特征构建阶段、词典与拼音特征融合阶段、特征序列建模阶段、多标注框架的联合学习阶段、输出层建模阶段6个部分。具体的实施方式说明如下：The invention proposes a Chinese named entity extraction method based on multiple annotation frames and fusion features, which solves the problems that the existing Chinese named entity extraction methods are difficult to identify entity boundaries and are limited to a single annotation frame. As shown in Figure 1, the complete flow of the present invention includes six stages: dictionary feature construction stage, pinyin feature construction stage, dictionary and pinyin feature fusion stage, feature sequence modeling stage, joint learning stage of multi-label framework, and output layer modeling stage. part. The specific implementation is described as follows:

词典特征构建阶段对应技术方案步骤(1)。具体实施方式为：对于任意给定的输入汉字序列

其中

表示汉字表，n表示序列长，c_i(1≤i≤n)表示长度为1的汉字。对于序列X中任意汉字c_i，为了引入与汉字c_i上下文相关的词，需要引入一个外部词典L_x，通过设置一个词汇匹配窗口l_w，将句子中所有包含汉字c_i且长度小于等于l_w的文本片段与词典L_x中的词进行匹配。如果出现在词典L_x中，则该文本片段就被当作是与该汉字c_i上下文相关的候选词。由于句子中可能会有多个包含汉字c_i的文本片段出现在词典中，最终会得到汉字c_i的一个候选匹配词集合ws(c_i)＝{w₁,w₂,…,w_m}，w_j(1≤j≤m)表示匹配词。The dictionary feature construction stage corresponds to step (1) of the technical solution. The specific implementation is: for any given input Chinese character sequence

in

Represents a Chinese character table,_n represents the sequence length, and ci (1≤i≤n) represents a Chinese character with a length of 1. For any Chinese character c_i in the sequence X, in order to introduce words related to the Chinese character c_i context, an external dictionary L_x needs to be introduced. By setting a vocabulary matching window l_w , all sentences containing Chinese character c_i and the length is less than or equal to l The text fragments of_w are matched against the words in the dictionary_Lx . If it appears in the dictionary_Lx , the text segment is regarded as a candidate word related to the context of the Chinese character_ci . Since there may be multiple text fragments containing the Chinese character c_i in the dictionary, a set of candidate matching words for the Chinese character c_i will finally be obtained ws(ci₎ ={w₁ ,w₂ ,...,w_m } , w_j (1≤j≤m) represents the matching word.

得到候选匹配词集合ws(c_i)后，还需进一步筛选，对于候选匹配词集合中任意一个词，如果该词是候选匹配词集合中另一个词的子串，则将该词从候选匹配词集合中过滤除去。这么做的原因为：1)一个完整的词通常更符合汉字的上下文中信息，比如“南京市长江大桥”中的“长江大桥”相比“长江”就更适合作为“长”的候选词；2)减少在基于注意力机制融合词典与拼音特征过程中的干扰，使得注意力更有可能从候选词列表中选出最符合该汉字上下文信息的词。After obtaining the candidate matching word set ws(c_i ), further screening is required. For any word in the candidate matching word set, if the word is a substring of another word in the candidate matching word set, the word will be selected from the candidate matching word set. Filter out the word set. The reasons for this are: 1) A complete word is usually more in line with the contextual information of Chinese characters. For example, "Yangtze River Bridge" in "Nanjing Yangtze River Bridge" is more suitable as a candidate word for "long" than "Yangtze River"; 2) Reduce the interference in the process of fusing dictionary and pinyin features based on the attention mechanism, so that attention is more likely to select the word that best matches the context information of the Chinese character from the candidate word list.

通过词向量查询表(lookup table)^w将筛选后的匹配词集合ws(c_i)中的词映射成词向量得到匹配词特征编码WE(c_i)：Through the word vector lookup table (lookup table)^w , the words in the filtered matching word set ws(c_i ) are mapped into word vectors to obtain the matching word feature code WE(c_i ):

WE(c_i)＝e^w(ws(c_i))WE(^{ci )=e w}₍ ws(_ci ))

其中，e^w来源于已经训练好的预训练词向量，在训练过程中保持不变。接着，对汉字在匹配词中的位置进行分词标记。假设B表示汉字c_i在词首，M表示汉字c_i在词中间，E表示汉字c_i在词尾。汉字c_i匹配不同的词对应着序列不同的分词结果，因此有必要将汉字c_i在匹配词中的分词标记也融入到词典特征中，进一步突出不同匹配词之间的差异性。对汉字c_i的候选匹配词集合ws(c_i)中的任意词w_j，令seg(w_j)∈{B,M,E}表示汉字c_i在w_j中的分词标记。若START(w_j)表示w_j在序列X中的开始位置索引，END(w_j)表示w_j在序列X中的结束位置索引，seg(w_j)的计算公式定义如下：Among them, e^w comes from the pre-trained word vector that has been trained and remains unchanged during the training process. Next, tokenize the position of the Chinese character in the matched word. Suppose B means that Chinese character c_i is at the beginning of a word, M means that Chinese character c_i is in the middle of a word, and E means that Chinese character c_i is at the end of a word. Matching different words of Chinese character c_i corresponds to different sequence of word segmentation results, so it is necessary to incorporate the word segmentation marks of Chinese character c_i in the matching words into the dictionary features to further highlight the differences between different matching words. For any word w_j in the candidate matching word set ws(ci ) of the Chinese character c_i , let seg(w_j )∈{B,M,E} denote the token of the Chinese character_ci in_w_j . If START(w_j ) represents the index of the start position of w_j in the sequence X, and END(w_j ) represents the index of the end position of w_j in the sequence X, the calculation formula of seg(w_j ) is defined as follows:

对于汉字c_i的候选匹配词集合ws(c_i)中所有词汇应用上式可得segs(c_i)：Applying the above formula to all the words in the candidate matching word set ws(ci₎ of Chinese character c_i can get segs(_ci ):

其中，segs(c_i)表示c_i在其所有匹配词中的分词标记构成的集合，通过分词标记向量查询表e^seg将segs(c_i)中分词标记映射成one-hot向量分词标记编码SEGE(c_i)：Among them, segs(ci_{) represents the set of word segmentation marks of c i}_in all its matching words, and the word segmentation marks in segs(ci ) are mapped to one-hot vector word segmentation marks encoding_SEGE through the word segmentation mark vector look-up table e^seg (c_i ):

SEGE(c_i)＝e^seg(segs(c_i))_SEGE (ci )=e^seg (_segs (ci ))

one-hot向量的每一维分别对应到集合{B,,}中的每一位元素上。其中，[1,0,0]对应B，[0,1,0]对应M，[0,0,1]对应E。Each dimension of the one-hot vector corresponds to each element in the set {B,,}. Among them, [1,0,0] corresponds to B, [0,1,0] corresponds to M, and [0,0,1] corresponds to E.

将汉字c_i在匹配词中分词标记编码SEGE(c_i)与匹配词特征编码WE(c_i)在编码维度上进行拼接得到汉字c_i最终的词典特征编码LE(c_i)：Splicing the Chinese character c_i in the matching word segmentation mark code_SEGE (ci ) and the matching word feature code WE(ci ) in the coding dimension to obtain the final dictionary feature code LE₍ ci₎ of the Chinese character c_i :

LE(c_i)＝[SEGE(c_i)；WE(c_i)]LE(ci )=[_SEGE (ci₎ ; WE(_ci )]

拼音特征构建阶段对应技术方案步骤(2)。具体实施方式为：包括轻声在内，拼音一共有5种音调，例如“chang”、“chāng”、“cháng”、“chǎng”、“chàng”。假如要从“南京市长江大桥”这个句子中抽取实体，当句中的“长”发“cháng”这个音时，句子被断句为“南京市|长江大桥”，此时“长江大桥”作为地名实体被抽取出来；当句中的“长”读音为“zhǎng”时，句子被断句为“南京市长|江大桥”，此时“江大桥”作为人名实体被抽取出来。说明汉字在句中的拼音特征存在影响实体抽取准确率的情况。The pinyin feature construction stage corresponds to step (2) of the technical solution. The specific implementation is as follows: including soft sounds, there are altogether 5 tones in Pinyin, such as "chang", "chāng", "cháng", "chǎng", and "chàng". If you want to extract the entity from the sentence "Nanjing Yangtze River Bridge", when the "long" in the sentence pronounces the sound "cháng", the sentence is segmented as "Nanjing | Yangtze River Bridge", and "Changjiang Bridge" is used as the place name. The entity is extracted; when the "long" in the sentence is pronounced as "zhǎng", the sentence is segmented as "Nanjing Mayor | Jiang Daqiao", and "Jiang Daqiao" is extracted as a person name entity. It shows that the pinyin characteristics of Chinese characters in sentences may affect the accuracy of entity extraction.

对输入汉字序列X中任意汉字c_i，得到其候选词集合ws(c_i)后，利用汉语拼音软件(例如pypinyin)，根据汉字c_i在匹配词中的含义对c_i注上拼音，得到与候选匹配词集合ws(c_i)对应的拼音集合pys(c_i)。然后，通过拼音向量查询表e^py将pys(c_i)中的拼音映射成拼音向量得到拼音特征编码PYE(c_i)：To any Chinese character ci in the input Chinese character sequence X, after obtaining its candidate word set ws(_ci ), utilize Chinese Pinyin software (for example_pypinyin ), according to the meaning of Chinese character_ci in the matching word_, note the pinyin to ci, obtain Pinyin set_pys (ci ) corresponding to the set of candidate matching words ws(ci₎ . Then, map the pinyin in pys(c_i ) into a pinyin vector through the pinyin vector lookup table e^py to obtain the pinyin feature code PYE(c_i ):

PYE(c_i)＝e^py(pys(c_i))_PYE (ci )=e^py (_pys (ci ))

其中，拼音向量查询表e^py是利用汉语拼音软件将外部中文语料集(例如，中文维基百科语料集)转换成拼音，然后，基于Word2Vec的Skip-gram方法训练得到。由于外部中文语料集中可能包含数字、英语或其它没有拼音的符号，在进行词向量训练之前的数据预处理阶段，本发明将英文转换成“[ENG]”，数字转换成“[DIGIT]”，其它没有拼音的字符统一转换成“[UNK]”。The pinyin vector lookup table e^py is obtained by converting an external Chinese corpus (eg, Chinese Wikipedia corpus) into Pinyin by using the Chinese Pinyin software, and then trained by the Skip-gram method based on Word2Vec. Since the external Chinese corpus may contain numbers, English or other symbols without pinyin, in the data preprocessing stage before word vector training, the present invention converts English into "[ENG]" and numbers into "[DIGIT]", Other characters without pinyin are uniformly converted into "[UNK]".

词典与拼音特征构建的示例图如图2所示。图中给出了“市”和“长”的匹配结果，其中w_i,j表示序列片段{c_i,c_i+1,…,c_j}构成的词。可以看出“长江”没有被包含在“长”的匹配结果中，因为“长江”是“长江大桥”的子串而被过滤。An example diagram of dictionary and pinyin feature construction is shown in Figure 2. The figure shows the matching results of "city" and "long", where w_i,j_represents a word composed of sequence fragments {ci ,ci₊₁ ,...,c_j }. It can be seen that "Yangtze River" is not included in the matching result of "Changjiang", because "Yangtze River" is a substring of "Yangtze River Bridge" and is filtered.

词典与拼音特征融合阶段对应技术方案步骤(3)。具体实施方式为：为了避免一些垂直领域的实体抽取标注数据集规模较小而导致模型训练过拟合，本发明利用中文预训练语言模型BERT提供语义支撑，提升模型泛化性能。将输入序列X＝{c₁,c₂,…,c_n}输入到中文预训练语言模型BERT中，取BERT最后一层输出作为序列编码X_h＝[x₁,x₂,…,x_n]，其中

d_x表示BERT编码维度，R表示实数，

表明x_i是维度为d_x的实数列向量，

表明X_h是维度为d_x×n的实数矩阵。将上述构建得到的汉字c_i的词典特征与拼音特征在编码维度上进行拼接得到融合特征LPE(c_i)：The dictionary and pinyin feature fusion stage corresponds to step (3) of the technical solution. The specific implementation is as follows: in order to avoid the overfitting of model training due to the small scale of entity extraction and annotation data sets in some vertical fields, the present invention uses the Chinese pre-trained language model BERT to provide semantic support and improve the generalization performance of the model. Input the input sequence X={c₁ ,c₂ ,...,c_n } into the Chinese pre-training language model BERT, and take the output of the last layer of BERT as the sequence code X_h =[x₁ ,x₂ ,...,x_n ],in

d_x represents the BERT encoding dimension, R represents the real number,

show that x_i is a real column vector of dimension d_x ,

Show that X_h is a real matrix of dimension d_x ×n. The dictionary feature and pinyin feature of the Chinese character c_i constructed above are spliced in the coding dimension to obtain the fusion feature LPE(_ci ):

LPE(c_i)＝[LE(c_i)；PYE(c_i)]LPE(ci )=[LE₍ ci );_PYE (ci₎ ]

假设词向量查询表e^w的编码维度为d_w，拼音向量查询表e^py编码维度为d_py，汉字c_i的候选匹配词集合ws(c_i)大小为m，则

基于点乘注意力机制将LPE(c_i)融合到汉字编码x_i中，x_i相当于注意力机制中query，而LPE(c_i)则相当于注意力机制中key与value。首先，将LPE(c_i)线性映射到与x_i编码维度一致的LPE_ikv：Assuming that the encoding dimension of the word vector look-up table e^w is d_w , the encoding dimension of the pinyin vector look-up table e^py is d_py , and the size of the candidate matching word set ws(ci₎ of the Chinese character c_i is m, then

Based on the point product attention mechanism, LPE(ci) is integrated into the Chinese character code_xi_{, where xi}_is equivalent to the query in the attention mechanism, and LPE(_ci ) is equivalent to the key and value in the attention mechanism. First, LPE(_{ci) is linearly mapped to LPE ikv}_consistent with the encoding dimension of_xi :

其中，训练参数

而映射后的融合特征

假设unsqueeze(M,y)表示扩张矩阵M的第y维，squeeze(M,y)表示压缩矩阵M的第y维，则unsqueeze(x_i,0)可将x_i从

转换为

然后，计算注意力权重LPE_iw：Among them, the training parameters

The fused features after mapping

Assuming that unsqueeze(M,y) represents the yth dimension of the expansion matrix M, and squeeze(M,y) represents the yth dimension of the compressed matrix M, then unsqueeze(x_i ,0) can convert x_i from

convert to

Then, calculate the attention weight LPE_iw :

LPE_iw＝softmax(unsqueeze(x_i,0)·PE_ikv)LPE_iw =softmax(unsqueeze(x_i ,0)·PE_ikv )

其中，注意力权重LPE_iw∈R^1×m，softmax之后的权重和为1。接着，利用注意力权重LPE_iw对LPE_ikv加权求和计算注意力输出LPE_io：Among them, the attention weight LPE_iw ∈ R^1×m , and the sum of the weights after softmax is 1. Next, the attention output LPE_io is calculated by the weighted summation of the LPE_ikv using the attention weight LPE_iw :

其中，注意力输出

最后，将LPE_io与汉字编码x_i相加作为汉字c_i最终的语义编码，表示为：Among them, the attention output

Finally, add the LPE_io and the Chinese character code_xi as the final semantic code of the Chinese character c_i , which is expressed as:

x_i＝LPE_io+x_ix_i = LPE_io +x_i

特征序列建模阶段对应技术方案步骤(4)。具体实施方式为：针对Transfomer的自注意力机制无法捕捉序列位置信息的问题，预训练语言模型BERT将可训练的绝对位置编码融入到输入中来缓解该问题，但依然缺少序列依赖式的建模。长短期记忆网络模型(LongShort-Term Memory，LSTM)不需要位置编码，LSTM按序列顺序递归编码的结构就具备学习到序列位置信息的能力。将上一步融合词典与拼音特征后的汉字语义序列编码

分别输入到两个双向长短期记忆网络模型(BidirectionalLong Short-Term Memory，BiLSTM)中进行特征序列建模，其中，一个BiLSTM输出用于第(5)步中基于序列标注的中文命名实体片段抽取辅助任务，另一个BiLSTM输出用于第(5)步中基于指针标注的中文命名实体抽取主任务。BiLSTM由前向和后向LSTM构成，两个任务的BiLSTM是独立不共享训练参数的。The feature sequence modeling stage corresponds to step (4) of the technical solution. The specific implementation is as follows: In view of the problem that the self-attention mechanism of Transfomer cannot capture sequence position information, the pre-trained language model BERT integrates the trainable absolute position encoding into the input to alleviate this problem, but it still lacks sequence-dependent modeling. . The Long Short-Term Memory (LSTM) network model does not require positional coding, and the structure of LSTM recursively coded in sequence order has the ability to learn sequence positional information. Chinese character semantic sequence encoding after combining the dictionary and pinyin features in the previous step

Input into two bidirectional long short-term memory network models (BidirectionalLong Short-Term Memory, BiLSTM) respectively for feature sequence modeling, where one BiLSTM output is used for the extraction of Chinese named entity fragments based on sequence annotation in step (5). task, another BiLSTM output is used for the main task of Chinese named entity extraction based on pointer annotation in step (5). BiLSTM is composed of forward and backward LSTM, and the BiLSTMs of the two tasks are independent and do not share training parameters.

假设在时间步t，基于序列标注的中文命名实体片段抽取辅助任务的前向LSTM隐状态输出为

后向LSTM隐状态输出为

将

与

相加得到辅助任务在时间步t的BiLSTM隐状态输出

Assuming that at time step t, the hidden state output of the forward LSTM for the auxiliary task of extracting Chinese named entity fragments based on sequence annotation is:

The output of the backward LSTM hidden state is

Will

and

Add to get the BiLSTM hidden state output of the auxiliary task at time step t

基于指针标注的中文命名实体抽取主任务的前向LSTM隐状态输出为

后向LSTM隐状态输出为

将

与

相加得到主任务在时间步t的BiLSTM隐状态输出

The hidden state output of forward LSTM for the main task of Chinese named entity extraction based on pointer annotation is:

The output of the backward LSTM hidden state is

Will

and

Add to get the BiLSTM hidden state output of the main task at time step t

最终，序列标注辅助任务的特征序列建模输出为

指针标注主任务的特征序列建模输出为

d_h表示LSTM编码维度。Finally, the feature sequence modeling output of the sequence labeling auxiliary task is

The feature sequence modeling output of the pointer labeling main task is

d_h represents the LSTM encoding dimension.

多标注框架的联合学习阶段对应技术方案步骤(5)。具体实施方式为：序列标注与指针标注是应用于命名实体抽取的两种常见标注框架。序列标注对文本序列中每个汉字在实体中的位置进行标记，如图3所示是用BMOES对文本序列进行标记的示例图，其中，B表示汉字在命名实体片段的开始，M表示汉字在命名实体片段的中间，O表示汉字在命名实体片段之外，E表示汉字在命名实体片段的结尾，S表示汉字本身就是命名实体片段。例句中包含“南京市”和“长江大桥”两个实体。指针标注对文本序列中每个实体片段的头汉字和尾汉字所属实体类型进行标记，如图4所示，其中，“南京市”和“长江大桥”都是地点类(Loc)实体。The joint learning stage of the multi-label framework corresponds to step (5) of the technical solution. The specific implementation is as follows: sequence annotation and pointer annotation are two common annotation frameworks applied to named entity extraction. The sequence labeling marks the position of each Chinese character in the text sequence in the entity, as shown in Figure 3 is an example diagram of marking the text sequence with BMOES, where B indicates that the Chinese character is at the beginning of the named entity segment, and M indicates that the Chinese character is at the beginning of the named entity segment. In the middle of the named entity fragment, O indicates that the Chinese character is outside the named entity fragment, E indicates that the Chinese character is at the end of the named entity fragment, and S indicates that the Chinese character itself is a named entity fragment. The example contains two entities "Nanjing City" and "Yangtze River Bridge". The pointer annotation marks the entity type of the head and tail Chinese characters of each entity segment in the text sequence, as shown in Figure 4, where "Nanjing City" and "Yangtze River Bridge" are both location (Loc) entities.

序列标注通过对全序列依赖建模，抽取出的实体完整性更好，通常查准率更高；指针标注通过对实体片段头、尾汉字实体类型分类，抗噪声干扰性与鲁棒性更好，通常查全率更高。为了结合不同标注框架的优点，将所述

作为序列标注辅助任务的输入，

作为指针标注主任务的输入，利用多任务学习模型，例如，多门混合专家(Multi-gate Mixture-of-Experts，MMOE)模型、渐进层次抽取(Progressive Layered Extraction，PLE)模型等，对基于序列标注中文命名实体片段抽取辅助任务与基于指针标注的中文命名实体抽取主任务进行联合学习，得到序列标注辅助任务输出

与指针标注主任务输出

Sequence labeling, by modeling full sequence dependencies, extracts entities with better integrity and usually has a higher accuracy rate; pointer labeling classifies the entity types of Chinese characters at the head and tail of the entity segment, and has better anti-noise interference and robustness , usually with higher recall. In order to combine the advantages of different annotation frameworks, the

As the input of the sequence labeling auxiliary task,

As the input of the main task of pointer labeling, multi-task learning models, such as Multi-gate Mixture-of-Experts (MMOE) model, Progressive Layered Extraction (PLE) model, etc., are used. The auxiliary task of labeling Chinese named entity fragment extraction is jointly learned with the main task of Chinese named entity extraction based on pointer annotation, and the output of the auxiliary task of sequence labeling is obtained.

Label main task output with pointer

输出层序列建模阶段对应技术方案步骤(6)。具体实施方式为：对上一步得到的X_a与X_b加一层Dropout防止模型过拟合。然后，将Dropout后的X_a输入到条件随机场(Conditional Random Field，CRF)中，计算基于序列标注的中文命名实体片段抽取辅助任务对BMOES标签索引序列y∈Zⁿ的似然概率p(y|X)：The output layer sequence modeling stage corresponds to step (6) of the technical solution. The specific implementation is as follows: adding a layer of Dropout to X_a and X_b obtained in the previous step to prevent overfitting of the model. Then, input the X_a after Dropout into the Conditional Random Field (CRF), and calculate the likelihood probability^p (y |X):

其中，

表示在该任务下X所有可能的BMOES标签索引序列构成的集合，y′∈Zⁿ是

中任一BMOES标签索引序列。训练参数

b_CRF∈R^5×5(BMOES序列标记法的标签数为5)，

表示W_CRF中对应标签y_t的训练参数，

表示b_CRF中对应标签y_t-1转移到标签y_t的训练参数，

同理。假设序列标注辅助任务的真实BMOES标签索引序列为y_span∈Zⁿ，Z表示整数，代入到上式中用于计算序列标注辅助任务的对数似然损失

in,

represents the set of all possible BMOES label index sequences of X under this task, y′∈Zⁿ is

Any BMOES tag index sequence. training parameters

b_CRF ∈ R^5×5 (the number of tags for BMOES sequence notation is 5),

represents the training parameter of the corresponding label y_t in the W_CRF ,

represents the training parameter of the corresponding label y_t-1 in the b_CRF transferred to the label y_t ,

The same is true. Assuming that the real BMOES label index sequence of the sequence labeling auxiliary task is y_span ∈ Zⁿ , Z represents an integer, and substituted into the above formula to calculate the log-likelihood loss of the sequence labeling auxiliary task

接着，将Dropout后的X_b线性映射到基于指针标注的中文命名实体抽取主任务的标签空间，然后加一层softmax计算每个汉字在各个标签上的概率分布p_start与p_end：Next, linearly map X_b after Dropout to the label space of the main task of Chinese named entity extraction based on pointer annotation, and then add a layer of softmax to calculate the probability distribution p_start and p_end of each Chinese character on each label:

其中，训练参数

c_e+1是实体类型数c_e与非实体类型的和，

是实体片段头汉字实体类型的预测概率分布，

是实体片段尾汉字实体类型的预测概率分布。假设实体片段头汉字的真实实体类型标签索引序列为y_start∈Zⁿ，实体片段尾汉字的真实实体类型标签索引序列为y_end∈Zⁿ，计算指针标注主任务的交叉熵(Cross Entropy，CE)损失

与

Among them, the training parameters

c_e +1 is the sum of the entity type number c_e and the non-entity type,

is the predicted probability distribution of the entity segment head Chinese character entity type,

is the predicted probability distribution of the Chinese character entity type at the end of the entity segment. Assuming that the real entity type label index sequence of the Chinese character at the head of the entity segment is y_start ∈ Zⁿ , and the real entity type label index sequence of the Chinese character at the end of the entity segment is y_end ∈ Zⁿ , calculate the cross entropy of the main task of pointer labeling (Cross Entropy, CE )loss

and

其中，

表示第i个汉字的真实实体类型标签索引，

表示p_start中对应第i个汉字预测为第

种实体类型的概率值，

同理。in,

Represents the real entity type label index of the i-th Chinese character,

Indicates that the corresponding ith Chinese character in p_start is predicted to be the ith

probability values for each entity type,

The same is true.

最后，得到序列标注辅助任务损失

与指针标注主任务损失

后，将3个loss融合成模型需要最小化的整体训练目标

进行端到端联合训练：Finally, get the sequence labeling auxiliary task loss

Labeling the main task loss with the pointer

Afterwards, fuse the 3 losses into the overall training objective that the model needs to minimize

Do end-to-end joint training:

其中，λ₁、λ₂、λ₃是控制各任务对整体训练目标影响的超参数。在测试阶段，取p_start与p_end中每个汉字标签预测概率分布的最大值对应的索引

与

作为标签预测索引：Among them, λ₁ , λ₂ , and λ₃ are hyperparameters that control the impact of each task on the overall training objective. In the test phase, take the index corresponding to the maximum value of the predicted probability distribution of each Chinese character label in p_start and p_end

and

As label prediction index:

然后，将实体类型相同且位置距离最近的实体片段头、尾汉字进行配对，抽取出序列中的实体。Then, pair the head and tail Chinese characters of the entity segment with the same entity type and the nearest location distance, and extract the entities in the sequence.

本发明提出了一种基于多标注框架与融合特征的中文命名实体抽取方法。为了测试该方法的有效性，分别在Ontonotes4、MSRA、Resume、Weibo数据集上，从查准率(P)、查全率(R)、F1指标三个方面评估了方法，并和其它中文命名实体抽取方法进行了对比。The invention proposes a Chinese named entity extraction method based on a multi-label framework and fusion features. In order to test the effectiveness of the method, the method was evaluated on the Ontonotes4, MSRA, Resume, and Weibo datasets from three aspects: precision (P), recall (R), and F1 indicators, and named it with other Chinese names. Entity extraction methods are compared.

模型优化器使用自适应矩估计(Adaptive momentum estimation，Adam)，BERT训练参数的学习率设置为3e-5，其它模型参数学习率设置为1e-3，BERT编码维度d_x＝768，多任务学习模型使用渐进层次抽取模型PLE，PLE中各任务独立Experts和共享Experts的Expert个数统一设置为2，Expert设置为单层全连接网络，PLE层数设置为2，LSTM层数设置为1，LSTM编码维度d_h＝768，词向量编码维度d_w＝50，拼音向量编码维度d_py＝50，损失权重

The model optimizer uses Adaptive momentum estimation (Adam), the learning rate of BERT training parameters is set to 3e-5, the learning rate of other model parameters is set to 1e-3, the BERT encoding dimension d_x = 768, multi-task learning The model uses the progressive hierarchical extraction model PLE. The number of independent Experts and shared Experts in PLE is uniformly set to 2, Expert is set to a single-layer fully connected network, the number of PLE layers is set to 2, the number of LSTM layers is set to 1, and the number of LSTM layers is set to 1. Coding dimension d_h =768, word vector coding dimension d_w =50, pinyin vector coding dimension d_py =50, loss weight

表1显示了不同中文命名实体抽取方法在Ontonotes4数据集上的准确率对比结果；表2显示了不同中文命名实体抽取方法在MSRA数据集上的准确率对比结果；表3显示了不同中文命名实体抽取方法在Resume数据集上的准确率对比结果；表4显示了不同中文命名实体抽取方法在Weibo数据集上的准确率对比结果。从上述表中的实验结果可以看出，本发明提出的中文命名实体抽取方法相比其它的中文命名实体抽取方法，在绝大多数数据集以及指标项上都取得了最好的中文命名实体抽取准确率表现。图5(a)(b)显示了本发明方法中词典匹配窗口大小在Ontonotes4和MSRA数据集上对准确率影响实验结果，图6(a)(b)显示了本发明方法中词典匹配窗口大小在Resume和Weibo数据集上对准确率影响实验结果，通过评估分析方法中词典匹配窗口大小的选择对中文命名实体抽取准确率的影响，为后续不同应用场景下词典匹配窗口大小的选择提供指导性建议。Table 1 shows the accuracy comparison results of different Chinese named entity extraction methods on the Ontonotes4 dataset; Table 2 shows the accuracy comparison results of different Chinese named entity extraction methods on the MSRA dataset; Table 3 shows different Chinese named entities The accuracy comparison results of the extraction methods on the Resume dataset; Table 4 shows the accuracy comparison results of different Chinese named entity extraction methods on the Weibo dataset. It can be seen from the experimental results in the above table that the Chinese named entity extraction method proposed by the present invention achieves the best Chinese named entity extraction in most data sets and index items compared with other Chinese named entity extraction methods. Accuracy performance. Figure 5(a)(b) shows the experimental results of the effect of the size of the dictionary matching window on the accuracy on the Ontonotes4 and MSRA datasets in the method of the present invention, and Figure 6(a)(b) shows the size of the dictionary matching window in the method of the present invention The experimental results on the Resume and Weibo datasets affect the accuracy. By evaluating the impact of the selection of the dictionary matching window size in the analysis method on the accuracy of Chinese named entity extraction, it provides guidance for the selection of the dictionary matching window size in different application scenarios. Suggest.

表1 Ontonotes4数据集上不同实体抽取方法的准确率对比Table 1 Accuracy comparison of different entity extraction methods on the Ontonotes4 dataset

表2 MSRA数据集上不同实体抽取方法的准确率对比Table 2 Accuracy comparison of different entity extraction methods on MSRA dataset

表3 Resume数据集上不同实体抽取方法的准确率对比Table 3 Accuracy comparison of different entity extraction methods on the Resume dataset

表4 Weibo数据集上不同实体抽取方法的准确率对比Table 4 Accuracy comparison of different entity extraction methods on Weibo dataset

Claims

1. A Chinese named entity extraction method based on a multi-label frame and fusion features comprises the following steps:

(1) performing word matching on each Chinese character in an input Chinese character sequence in an external dictionary, mapping words into word vectors by using a word vector query table, mapping word segmentation marks of the Chinese characters in the words into word marking vectors by using a word segmentation mark vector query table, and splicing the word segmentation mark vectors and the word vectors to form dictionary features;

(2) according to the meaning of the Chinese characters in the matching words, the Chinese characters are marked with pinyin, and pinyin characteristics are obtained by mapping the pinyin through a pinyin vector lookup table;

(3) fusing the dictionary features and the pinyin features into Chinese character codes obtained by a Chinese pre-training language model BERT based on a point-by-point attention mechanism, and providing Chinese character semantic codes combining the dictionary features and the pinyin features for follow-up;

(4) the Chinese character semantic codes are respectively input into two independent bidirectional long-short term memory network models for feature sequence modeling, and the feature sequences are respectively output to obtain first feature sequence codes

Encoding with the second signature sequence

(5) Sequence mark as auxiliary task and pointer mark as main task, and coding the first characteristic sequence

The second signature sequence is encoded as input to a sequence annotation auxiliary task

As the input of the pointer labeling main task, performing joint learning on the sequence labeling auxiliary task and the pointer labeling main task by using a multi-task learning model;

(6) calculating log-likelihood loss of sequence labeling auxiliary tasks in conditional random fields

Pointer labeling entity type classification cross entropy loss of entity fragment head Chinese character in main task

And entity type classification cross entropy loss of entity fragment tail Chinese characters in the pointer labeling main task

To the above

Weighted sum to obtain model requirementsEnd-to-end joint training is carried out on the minimized training target, and in the testing stage, entity fragments and types in sentences are extracted by marking a main task through a pointer.

2. The method for extracting Chinese named entities based on multi-label frame and fusion features as claimed in claim 1, wherein in step (1), the external dictionary and word vector lookup table is derived from pre-training word vectors published on the Internet, and the word segmentation label vector lookup table is composed of one-hot vectors.

3. The method as claimed in claim 1, wherein in step (2), the pinyin vector lookup table is obtained by word2vec training based on the external chinese corpus, and the text in the external chinese corpus is converted into pinyin by using chinese pinyin software.

4. The method for extracting named entity in Chinese based on multi-label frame and fusion features as claimed in claim 1, wherein in step (5), the sequence label assisting task uses BMOES without entity type to label the entities in the input sentence, which is responsible for the extraction of named entity fragment in Chinese, and the extracted entity fragment has no type; the pointer labeling main task only carries out entity type labeling on the head and tail Chinese characters of the entity fragment in the sentence and is responsible for extracting the named Chinese entity, and the extracted entity has a type.

5. The method for extracting named entities in Chinese based on multi-label frame and fusion features as claimed in claim 1, wherein in step (6), the test stage takes the label corresponding to the maximum value of the predicted probability distribution of each Chinese character entity type as the predicted label of the Chinese character, then matches the tail Chinese character of the entity segment with the same type as the Chinese character at the head of the entity segment and the closest position distance, and extracts the text segment between the head Chinese character of the entity segment and the tail Chinese character of the entity segment as the entity.