





技术领域technical field
本发明属于人工智能和自然语言处理领域,具体涉及一种基于多标注框架与融合特征的中文命名实体抽取方法。The invention belongs to the field of artificial intelligence and natural language processing, and in particular relates to a Chinese named entity extraction method based on a multi-label framework and fusion features.
背景技术Background technique
随着互联网技术的飞速发展,各行业数据信息在爆发式增长,推动了行业大数据智能化分析挖掘服务与创新应用的发展,进一步推动着我国数字经济的发展。这些数据信息中包含大量的非结构化文本,从这些非结构化文本中抽取出结构化的有效信息成为了工业界关注的重点,而其中就涉及到自然语言处理领域中一个基础任务:命名实体抽取。With the rapid development of Internet technology, the explosive growth of data information in various industries has promoted the development of intelligent analysis and mining services and innovative applications of big data in the industry, and further promoted the development of my country's digital economy. These data information contains a large number of unstructured texts, and extracting structured and effective information from these unstructured texts has become the focus of the industry, which involves a basic task in the field of natural language processing: named entities Extract.
早期命名实体识别的研究工作主要是基于词典与规则的方法,这些方法主要依靠语言学家和领域专家依据数据集特征手工构造领域词典和规则模板。这种基于规则的方法的优点在于,可以根据需求不断地更新迭代规则来抽取目标实体。但是其缺点在于,面对一些复杂的领域和应用场景,人工建立规则的代价较大,并且随着规则库的扩大,容易产生规则冲突问题,使得已有的规则库难以维护与扩展,无法适应数据与领域的变化。Early research work on named entity recognition was mainly based on dictionary and rule methods. These methods mainly rely on linguists and domain experts to manually construct domain dictionaries and rule templates according to data set features. The advantage of this rule-based approach is that iterative rules can be continuously updated to extract target entities as needed. However, its disadvantage is that in the face of some complex fields and application scenarios, the cost of manually establishing rules is high, and with the expansion of the rule base, it is easy to generate rule conflicts, making the existing rule base difficult to maintain and expand, and cannot adapt to Data and Domain Changes.
随后,基于统计机器学习的命名实体识别研究得到关注。命名实体识别在统计机器学习方法中被定义为序列标注问题。应用于NER的统计机器学习方法主要有最大熵模型、隐马尔可夫模型、最大熵马尔可夫模型、条件随机场等。这种方法依赖于人工构建的特征,过程比较繁琐。Subsequently, research on named entity recognition based on statistical machine learning has received attention. Named entity recognition is defined as a sequence labeling problem in statistical machine learning methods. Statistical machine learning methods applied to NER mainly include maximum entropy model, hidden Markov model, maximum entropy Markov model, conditional random field, etc. This method relies on artificially constructed features, and the process is cumbersome.
近几年随着深度学习的不断发展,命名实体识别领域出现了越来越多的基于深度神经网络(Deep Neural Network,DNN)的工作。基于DNN的命名实体识别方法无需繁琐的特征工程,并且模型效果远超传统的规则以及统计机器学习方法。In recent years, with the continuous development of deep learning, more and more work based on Deep Neural Network (DNN) has appeared in the field of named entity recognition. The DNN-based named entity recognition method does not require tedious feature engineering, and the model effect far exceeds the traditional rules and statistical machine learning methods.
中文命名实体识别相较于英文的更难,因为中文缺少英文文本中空格符这样的分隔符,也没有明显的词形变化特征,容易造成边界歧义。除此之外,中文还存在一词多义的现象,在不同领域或者不同上下文中,同一个词表现为不同的含义,需要充分利用上下文信息对词义进行理解。同时,中文还存在省略、简写等语言学特点,这些都给中文命名实体识别带来了更大的挑战。现有很多中文命名实体抽取方法缺乏对词信息的利用,并且标注框架单一、局限性较大,影响中文命名实体抽取的精度。Compared with English, Chinese named entity recognition is more difficult, because Chinese lacks separators such as space characters in English text, and there is no obvious morphological feature, which is easy to cause boundary ambiguity. In addition, there is a phenomenon of polysemy in Chinese. In different fields or in different contexts, the same word has different meanings. It is necessary to make full use of the context information to understand the meaning of the word. At the same time, Chinese also has linguistic features such as ellipsis and abbreviations, which bring greater challenges to Chinese named entity recognition. Many existing Chinese named entity extraction methods lack the use of word information, and the labeling framework is single and limited, which affects the accuracy of Chinese named entity extraction.
发明内容SUMMARY OF THE INVENTION
发明目的:针对上述现有技术存在的问题和不足,本发明的目的是提出一种基于多标注框架与融合特征的中文命名实体抽取方法,以解决现有的中文命名实体抽取方法因标注框架单一,导致局限于单标注框架的问题,以及缺乏对词信息的利用,导致难以识别实体边界的问题。Purpose of the invention: In view of the problems and deficiencies in the above-mentioned prior art, the purpose of the present invention is to propose a Chinese named entity extraction method based on multiple annotation frameworks and fusion features, so as to solve the problem of the existing Chinese named entity extraction method due to the single labeling framework. , resulting in the problem of being limited to a single-label frame, and the lack of utilization of word information, resulting in the problem of difficulty in identifying entity boundaries.
技术方案:为实现上述发明目的,本发明采用的技术方案为一种基于多标注框架与融合特征的中文命名实体抽取方法,包括以下步骤:Technical solution: In order to achieve the above purpose of the invention, the technical solution adopted in the present invention is a Chinese named entity extraction method based on a multi-label framework and fusion features, including the following steps:
(1)对输入汉字序列中的每个汉字在外部词典中进行词匹配,利用词向量查询表将词映射成词向量,利用分词标记向量查询表将汉字在词中的分词标记映射成分词标记向量,所述分词标记向量与词向量拼接构成词典特征;(1) Perform word matching in the external dictionary for each Chinese character in the input Chinese character sequence, use the word vector lookup table to map the words into word vectors, and use the word segmentation mark vector lookup table to map the word segmentation marks of the Chinese characters in the words to the component word marks vector, the word segmentation mark vector and the word vector are spliced to form a dictionary feature;
(2)根据汉字在匹配词中的含义对汉字注上拼音,通过拼音向量查询表对所述拼音映射得到拼音特征;(2) according to the meaning of the Chinese character in the matching word, the Chinese character is marked with pinyin, and the pinyin feature is obtained by the pinyin vector lookup table to the described pinyin mapping;
(3)基于点乘注意力机制将所述词典特征与拼音特征融合到中文预训练语言模型BERT得到的汉字编码中,为后续提供结合词典特征与拼音特征的汉字语义编码;(3) Integrating the dictionary features and pinyin features into the Chinese character encoding obtained by the Chinese pre-training language model BERT based on the dot product attention mechanism, to provide subsequent Chinese character semantic encoding combining dictionary features and pinyin features;
(4)将所述汉字语义编码分别输入到两个独立的双向长短期记忆网络模型中进行特征序列建模,分别输出得到第一特征序列编码与第二特征序列编码(4) Input the Chinese character semantic codes into two independent bidirectional long-term and short-term memory network models for feature sequence modeling, and output the first feature sequence codes respectively encoded with the second feature sequence
(5)序列标注作为辅助任务,指针标注作为主任务,将所述第一特征序列编码作为序列标注辅助任务的输入,所述第二特征序列编码作为指针标注主任务的输入,利用多任务学习模型对序列标注辅助任务与指针标注主任务进行联合学习;(5) Sequence annotation is used as an auxiliary task, pointer annotation is used as the main task, and the first feature sequence is encoded As the input of the sequence labeling auxiliary task, the second feature sequence encodes As the input of the main task of pointer labeling, the multi-task learning model is used to jointly learn the auxiliary task of sequence labeling and the main task of pointer labeling;
(6)计算序列标注辅助任务在条件随机场中的对数似然损失指针标注主任务中实体片段头汉字的实体类型分类交叉熵损失以及指针标注主任务中实体片段尾汉字的实体类型分类交叉熵损失对所述加权求和得到模型需要最小化的训练目标,进行端到端联合训练,而测试阶段通过指针标注主任务抽取出句子中的实体片段及其类型。(6) Calculate the log-likelihood loss of the sequence labeling auxiliary task in the conditional random field Entity type classification cross-entropy loss of Chinese characters in entity segment headers in the main task of pointer labeling and the entity type classification cross-entropy loss of Chinese characters at the end of the entity segment in the main task of pointer labeling to the said The weighted summation obtains the training target that the model needs to be minimized, and performs end-to-end joint training. In the testing phase, the main task of pointer labeling is used to extract the entity fragments and their types in the sentence.
进一步地,所述步骤(1)中,外部词典与词向量查询表来源于互联网上公开的预训练词向量,分词标记向量查询表由one-hot向量构成。Further, in the step (1), the external dictionary and the word vector look-up table are derived from pre-trained word vectors disclosed on the Internet, and the word segmentation mark vector look-up table is composed of one-hot vectors.
进一步地,所述步骤(2)中,拼音向量查询表通过word2vec基于外部中文语料集训练得到,使用汉语拼音软件将外部中文语料集中的文本转换成拼音。Further, in the step (2), the pinyin vector look-up table is obtained through word2vec training based on an external Chinese corpus, and the text in the external Chinese corpus is converted into pinyin using Hanyu Pinyin software.
进一步地,所述步骤(5)中,序列标注辅助任务使用不带实体类型的BMOES对输入句子中的实体进行标记,负责中文命名实体片段抽取,抽取出的实体片段不带类型;指针标注主任务只对句子中实体片段的头、尾汉字进行实体类型标记,负责中文命名实体抽取,抽取出的实体带有类型。Further, in the step (5), the sequence labeling auxiliary task uses BMOES without entity type to mark the entities in the input sentence, and is responsible for extracting Chinese named entity fragments, and the extracted entity fragments do not have types; The task only performs entity type marking on the head and tail Chinese characters of entity fragments in the sentence, and is responsible for Chinese named entity extraction, and the extracted entities have types.
进一步地,所述步骤(6)中,测试阶段取每个汉字实体类型预测概率分布的最大值对应的标签作为该汉字的预测标签,然后匹配与实体片段头汉字实体类型相同且位置距离最近的实体片段尾汉字,将所述实体片段头汉字与实体片段尾汉字之间的文本片段抽取出来作为实体。Further, in the described step (6), the test phase gets the label corresponding to the maximum value of each Chinese character entity type prediction probability distribution as the predicted label of this Chinese character, and then matches with the entity segment head Chinese character entity type identical and the position distance is nearest. The Chinese character at the end of the entity segment is extracted from the text segment between the Chinese character at the head of the entity segment and the Chinese character at the end of the entity segment as an entity.
有益效果:本发明能够有效解决难以识别中文命名实体边界的问题,充分发挥不同标注框架的优点,提高了中文命名实体抽取的准确率。第一,本发明通过构建词典与拼音特征,增强模型对实体边界的识别,并且通过中文预训练语言模型BERT对汉字进行编码,为上层模型提供上下文语义支撑;第二,利用双向长短期记忆网络模型的递归结构进行特征序列建模,学习序列位置信息,缓解因预训练语言模型BERT缺少序列依赖式的建模而导致序列位置信息容易丢失的问题;第三,通过多任务学习模型对序列标注与指针标注进行联合学习,结合不同标注框架的优点,突破单标注框架的局限性,进一步提升中文命名实体抽取的准确率。Beneficial effects: the present invention can effectively solve the problem of difficulty in identifying the boundaries of Chinese named entities, give full play to the advantages of different labeling frameworks, and improve the accuracy of Chinese named entity extraction. First, the present invention enhances the model's recognition of entity boundaries by constructing dictionary and pinyin features, and encodes Chinese characters through the Chinese pre-training language model BERT, providing contextual semantic support for the upper model; second, using a bidirectional long-term and short-term memory network. The recursive structure of the model performs feature sequence modeling, learns sequence position information, and alleviates the problem that sequence position information is easily lost due to the lack of sequence-dependent modeling of the pre-trained language model BERT. Third, the multi-task learning model is used to label sequences. Joint learning with pointer annotation, combined with the advantages of different annotation frameworks, breaks through the limitations of a single annotation framework, and further improves the accuracy of Chinese named entity extraction.
附图说明Description of drawings
图1为本发明方法的整体框架图;Fig. 1 is the overall frame diagram of the method of the present invention;
图2为本发明方法中词典与拼音特征构建的示例图;Fig. 2 is an example diagram of dictionary and pinyin feature construction in the method of the present invention;
图3为本发明方法中序列标注示例图;Fig. 3 is an example diagram of sequence labeling in the method of the present invention;
图4为本发明方法中指针标注示例图;4 is an example diagram of pointer marking in the method of the present invention;
图5(a)(b)分别为本发明方法中词典匹配窗口大小在Ontonotes4数据集和MSRA数据集上对准确性影响的实验结果图;Figure 5 (a) and (b) are respectively the experimental result diagrams of the effect of the dictionary matching window size on the accuracy of the Ontonotes4 data set and the MSRA data set in the method of the present invention;
图6(a)(b)分别为本发明方法中词典匹配窗口大小在Resume数据集和Weibo数据集上对准确性影响的实验结果图。Figures 6(a) and (b) are graphs of the experimental results of the effect of the size of the dictionary matching window on the accuracy of the Resume data set and the Weibo data set, respectively, in the method of the present invention.
具体实施方式Detailed ways
下面结合附图和具体实施例,进一步阐明本发明,应理解这些实施例仅用于说明本发明而不用于限制本发明的范围,在阅读了本发明之后,本领域技术人员对本发明的各种等价形式的修改均落于本申请所附权利要求所限定的范围。Below in conjunction with the accompanying drawings and specific embodiments, the present invention will be further clarified. It should be understood that these embodiments are only used to illustrate the present invention and not to limit the scope of the present invention. Modifications of equivalent forms all fall within the scope defined by the appended claims of this application.
本发明提出了一种基于多标注框架与融合特征的中文命名实体抽取方法,解决了现有中文命名实体抽取方法难以识别实体边界以及局限于单一标注框架的问题。如图1所示,本发明的完整流程包括词典特征构建阶段、拼音特征构建阶段、词典与拼音特征融合阶段、特征序列建模阶段、多标注框架的联合学习阶段、输出层建模阶段6个部分。具体的实施方式说明如下:The invention proposes a Chinese named entity extraction method based on multiple annotation frames and fusion features, which solves the problems that the existing Chinese named entity extraction methods are difficult to identify entity boundaries and are limited to a single annotation frame. As shown in Figure 1, the complete flow of the present invention includes six stages: dictionary feature construction stage, pinyin feature construction stage, dictionary and pinyin feature fusion stage, feature sequence modeling stage, joint learning stage of multi-label framework, and output layer modeling stage. part. The specific implementation is described as follows:
词典特征构建阶段对应技术方案步骤(1)。具体实施方式为:对于任意给定的输入汉字序列其中表示汉字表,n表示序列长,ci(1≤i≤n)表示长度为1的汉字。对于序列X中任意汉字ci,为了引入与汉字ci上下文相关的词,需要引入一个外部词典Lx,通过设置一个词汇匹配窗口lw,将句子中所有包含汉字ci且长度小于等于lw的文本片段与词典Lx中的词进行匹配。如果出现在词典Lx中,则该文本片段就被当作是与该汉字ci上下文相关的候选词。由于句子中可能会有多个包含汉字ci的文本片段出现在词典中,最终会得到汉字ci的一个候选匹配词集合ws(ci)={w1,w2,…,wm},wj(1≤j≤m)表示匹配词。The dictionary feature construction stage corresponds to step (1) of the technical solution. The specific implementation is: for any given input Chinese character sequence in Represents a Chinese character table,n represents the sequence length, and ci (1≤i≤n) represents a Chinese character with a length of 1. For any Chinese character ci in the sequence X, in order to introduce words related to the Chinese character ci context, an external dictionary Lx needs to be introduced. By setting a vocabulary matching window lw , all sentences containing Chinese character ci and the length is less than or equal to l The text fragments ofw are matched against the words in the dictionaryLx . If it appears in the dictionaryLx , the text segment is regarded as a candidate word related to the context of the Chinese characterci . Since there may be multiple text fragments containing the Chinese character ci in the dictionary, a set of candidate matching words for the Chinese character ci will finally be obtained ws(ci) ={w1 ,w2 ,...,wm } , wj (1≤j≤m) represents the matching word.
得到候选匹配词集合ws(ci)后,还需进一步筛选,对于候选匹配词集合中任意一个词,如果该词是候选匹配词集合中另一个词的子串,则将该词从候选匹配词集合中过滤除去。这么做的原因为:1)一个完整的词通常更符合汉字的上下文中信息,比如“南京市长江大桥”中的“长江大桥”相比“长江”就更适合作为“长”的候选词;2)减少在基于注意力机制融合词典与拼音特征过程中的干扰,使得注意力更有可能从候选词列表中选出最符合该汉字上下文信息的词。After obtaining the candidate matching word set ws(ci ), further screening is required. For any word in the candidate matching word set, if the word is a substring of another word in the candidate matching word set, the word will be selected from the candidate matching word set. Filter out the word set. The reasons for this are: 1) A complete word is usually more in line with the contextual information of Chinese characters. For example, "Yangtze River Bridge" in "Nanjing Yangtze River Bridge" is more suitable as a candidate word for "long" than "Yangtze River"; 2) Reduce the interference in the process of fusing dictionary and pinyin features based on the attention mechanism, so that attention is more likely to select the word that best matches the context information of the Chinese character from the candidate word list.
通过词向量查询表(lookup table)w将筛选后的匹配词集合ws(ci)中的词映射成词向量得到匹配词特征编码WE(ci):Through the word vector lookup table (lookup table)w , the words in the filtered matching word set ws(ci ) are mapped into word vectors to obtain the matching word feature code WE(ci ):
WE(ci)=ew(ws(ci))WE(ci )=e w( ws(ci ))
其中,ew来源于已经训练好的预训练词向量,在训练过程中保持不变。接着,对汉字在匹配词中的位置进行分词标记。假设B表示汉字ci在词首,M表示汉字ci在词中间,E表示汉字ci在词尾。汉字ci匹配不同的词对应着序列不同的分词结果,因此有必要将汉字ci在匹配词中的分词标记也融入到词典特征中,进一步突出不同匹配词之间的差异性。对汉字ci的候选匹配词集合ws(ci)中的任意词wj,令seg(wj)∈{B,M,E}表示汉字ci在wj中的分词标记。若START(wj)表示wj在序列X中的开始位置索引,END(wj)表示wj在序列X中的结束位置索引,seg(wj)的计算公式定义如下:Among them, ew comes from the pre-trained word vector that has been trained and remains unchanged during the training process. Next, tokenize the position of the Chinese character in the matched word. Suppose B means that Chinese character ci is at the beginning of a word, M means that Chinese character ci is in the middle of a word, and E means that Chinese character ci is at the end of a word. Matching different words of Chinese character ci corresponds to different sequence of word segmentation results, so it is necessary to incorporate the word segmentation marks of Chinese character ci in the matching words into the dictionary features to further highlight the differences between different matching words. For any word wj in the candidate matching word set ws(ci ) of the Chinese character ci , let seg(wj )∈{B,M,E} denote the token of the Chinese characterci inwj . If START(wj ) represents the index of the start position of wj in the sequence X, and END(wj ) represents the index of the end position of wj in the sequence X, the calculation formula of seg(wj ) is defined as follows:
对于汉字ci的候选匹配词集合ws(ci)中所有词汇应用上式可得segs(ci):Applying the above formula to all the words in the candidate matching word set ws(ci) of Chinese character ci can get segs(ci ):
其中,segs(ci)表示ci在其所有匹配词中的分词标记构成的集合,通过分词标记向量查询表eseg将segs(ci)中分词标记映射成one-hot向量分词标记编码SEGE(ci):Among them, segs(ci) represents the set of word segmentation marks of c iin all its matching words, and the word segmentation marks in segs(ci ) are mapped to one-hot vector word segmentation marks encodingSEGE through the word segmentation mark vector look-up table eseg (ci ):
SEGE(ci)=eseg(segs(ci))SEGE (ci )=eseg (segs (ci ))
one-hot向量的每一维分别对应到集合{B,,}中的每一位元素上。其中,[1,0,0]对应B,[0,1,0]对应M,[0,0,1]对应E。Each dimension of the one-hot vector corresponds to each element in the set {B,,}. Among them, [1,0,0] corresponds to B, [0,1,0] corresponds to M, and [0,0,1] corresponds to E.
将汉字ci在匹配词中分词标记编码SEGE(ci)与匹配词特征编码WE(ci)在编码维度上进行拼接得到汉字ci最终的词典特征编码LE(ci):Splicing the Chinese character ci in the matching word segmentation mark codeSEGE (ci ) and the matching word feature code WE(ci ) in the coding dimension to obtain the final dictionary feature code LE( ci) of the Chinese character ci :
LE(ci)=[SEGE(ci);WE(ci)]LE(ci )=[SEGE (ci) ; WE(ci )]
拼音特征构建阶段对应技术方案步骤(2)。具体实施方式为:包括轻声在内,拼音一共有5种音调,例如“chang”、“chāng”、“cháng”、“chǎng”、“chàng”。假如要从“南京市长江大桥”这个句子中抽取实体,当句中的“长”发“cháng”这个音时,句子被断句为“南京市|长江大桥”,此时“长江大桥”作为地名实体被抽取出来;当句中的“长”读音为“zhǎng”时,句子被断句为“南京市长|江大桥”,此时“江大桥”作为人名实体被抽取出来。说明汉字在句中的拼音特征存在影响实体抽取准确率的情况。The pinyin feature construction stage corresponds to step (2) of the technical solution. The specific implementation is as follows: including soft sounds, there are altogether 5 tones in Pinyin, such as "chang", "chāng", "cháng", "chǎng", and "chàng". If you want to extract the entity from the sentence "Nanjing Yangtze River Bridge", when the "long" in the sentence pronounces the sound "cháng", the sentence is segmented as "Nanjing | Yangtze River Bridge", and "Changjiang Bridge" is used as the place name. The entity is extracted; when the "long" in the sentence is pronounced as "zhǎng", the sentence is segmented as "Nanjing Mayor | Jiang Daqiao", and "Jiang Daqiao" is extracted as a person name entity. It shows that the pinyin characteristics of Chinese characters in sentences may affect the accuracy of entity extraction.
对输入汉字序列X中任意汉字ci,得到其候选词集合ws(ci)后,利用汉语拼音软件(例如pypinyin),根据汉字ci在匹配词中的含义对ci注上拼音,得到与候选匹配词集合ws(ci)对应的拼音集合pys(ci)。然后,通过拼音向量查询表epy将pys(ci)中的拼音映射成拼音向量得到拼音特征编码PYE(ci):To any Chinese character ci in the input Chinese character sequence X, after obtaining its candidate word set ws(ci ), utilize Chinese Pinyin software (for examplepypinyin ), according to the meaning of Chinese characterci in the matching word, note the pinyin to ci, obtain Pinyin setpys (ci ) corresponding to the set of candidate matching words ws(ci) . Then, map the pinyin in pys(ci ) into a pinyin vector through the pinyin vector lookup table epy to obtain the pinyin feature code PYE(ci ):
PYE(ci)=epy(pys(ci))PYE (ci )=epy (pys (ci ))
其中,拼音向量查询表epy是利用汉语拼音软件将外部中文语料集(例如,中文维基百科语料集)转换成拼音,然后,基于Word2Vec的Skip-gram方法训练得到。由于外部中文语料集中可能包含数字、英语或其它没有拼音的符号,在进行词向量训练之前的数据预处理阶段,本发明将英文转换成“[ENG]”,数字转换成“[DIGIT]”,其它没有拼音的字符统一转换成“[UNK]”。The pinyin vector lookup table epy is obtained by converting an external Chinese corpus (eg, Chinese Wikipedia corpus) into Pinyin by using the Chinese Pinyin software, and then trained by the Skip-gram method based on Word2Vec. Since the external Chinese corpus may contain numbers, English or other symbols without pinyin, in the data preprocessing stage before word vector training, the present invention converts English into "[ENG]" and numbers into "[DIGIT]", Other characters without pinyin are uniformly converted into "[UNK]".
词典与拼音特征构建的示例图如图2所示。图中给出了“市”和“长”的匹配结果,其中wi,j表示序列片段{ci,ci+1,…,cj}构成的词。可以看出“长江”没有被包含在“长”的匹配结果中,因为“长江”是“长江大桥”的子串而被过滤。An example diagram of dictionary and pinyin feature construction is shown in Figure 2. The figure shows the matching results of "city" and "long", where wi,jrepresents a word composed of sequence fragments {ci ,ci+1 ,...,cj }. It can be seen that "Yangtze River" is not included in the matching result of "Changjiang", because "Yangtze River" is a substring of "Yangtze River Bridge" and is filtered.
词典与拼音特征融合阶段对应技术方案步骤(3)。具体实施方式为:为了避免一些垂直领域的实体抽取标注数据集规模较小而导致模型训练过拟合,本发明利用中文预训练语言模型BERT提供语义支撑,提升模型泛化性能。将输入序列X={c1,c2,…,cn}输入到中文预训练语言模型BERT中,取BERT最后一层输出作为序列编码Xh=[x1,x2,…,xn],其中dx表示BERT编码维度,R表示实数,表明xi是维度为dx的实数列向量,表明Xh是维度为dx×n的实数矩阵。将上述构建得到的汉字ci的词典特征与拼音特征在编码维度上进行拼接得到融合特征LPE(ci):The dictionary and pinyin feature fusion stage corresponds to step (3) of the technical solution. The specific implementation is as follows: in order to avoid the overfitting of model training due to the small scale of entity extraction and annotation data sets in some vertical fields, the present invention uses the Chinese pre-trained language model BERT to provide semantic support and improve the generalization performance of the model. Input the input sequence X={c1 ,c2 ,...,cn } into the Chinese pre-training language model BERT, and take the output of the last layer of BERT as the sequence code Xh =[x1 ,x2 ,...,xn ],in dx represents the BERT encoding dimension, R represents the real number, show that xi is a real column vector of dimension dx , Show that Xh is a real matrix of dimension dx ×n. The dictionary feature and pinyin feature of the Chinese character ci constructed above are spliced in the coding dimension to obtain the fusion feature LPE(ci ):
LPE(ci)=[LE(ci);PYE(ci)]LPE(ci )=[LE( ci );PYE (ci) ]
假设词向量查询表ew的编码维度为dw,拼音向量查询表epy编码维度为dpy,汉字ci的候选匹配词集合ws(ci)大小为m,则基于点乘注意力机制将LPE(ci)融合到汉字编码xi中,xi相当于注意力机制中query,而LPE(ci)则相当于注意力机制中key与value。首先,将LPE(ci)线性映射到与xi编码维度一致的LPEikv:Assuming that the encoding dimension of the word vector look-up table ew is dw , the encoding dimension of the pinyin vector look-up table epy is dpy , and the size of the candidate matching word set ws(ci) of the Chinese character ci is m, then Based on the point product attention mechanism, LPE(ci) is integrated into the Chinese character codexi, where xiis equivalent to the query in the attention mechanism, and LPE(ci ) is equivalent to the key and value in the attention mechanism. First, LPE(ci) is linearly mapped to LPE ikvconsistent with the encoding dimension ofxi :
其中,训练参数而映射后的融合特征假设unsqueeze(M,y)表示扩张矩阵M的第y维,squeeze(M,y)表示压缩矩阵M的第y维,则unsqueeze(xi,0)可将xi从转换为然后,计算注意力权重LPEiw:Among them, the training parameters The fused features after mapping Assuming that unsqueeze(M,y) represents the yth dimension of the expansion matrix M, and squeeze(M,y) represents the yth dimension of the compressed matrix M, then unsqueeze(xi ,0) can convert xi from convert to Then, calculate the attention weight LPEiw :
LPEiw=softmax(unsqueeze(xi,0)·PEikv)LPEiw =softmax(unsqueeze(xi ,0)·PEikv )
其中,注意力权重LPEiw∈R1×m,softmax之后的权重和为1。接着,利用注意力权重LPEiw对LPEikv加权求和计算注意力输出LPEio:Among them, the attention weight LPEiw ∈ R1×m , and the sum of the weights after softmax is 1. Next, the attention output LPEio is calculated by the weighted summation of the LPEikv using the attention weight LPEiw :
其中,注意力输出最后,将LPEio与汉字编码xi相加作为汉字ci最终的语义编码,表示为:Among them, the attention output Finally, add the LPEio and the Chinese character codexi as the final semantic code of the Chinese character ci , which is expressed as:
xi=LPEio+xixi = LPEio +xi
特征序列建模阶段对应技术方案步骤(4)。具体实施方式为:针对Transfomer的自注意力机制无法捕捉序列位置信息的问题,预训练语言模型BERT将可训练的绝对位置编码融入到输入中来缓解该问题,但依然缺少序列依赖式的建模。长短期记忆网络模型(LongShort-Term Memory,LSTM)不需要位置编码,LSTM按序列顺序递归编码的结构就具备学习到序列位置信息的能力。将上一步融合词典与拼音特征后的汉字语义序列编码分别输入到两个双向长短期记忆网络模型(BidirectionalLong Short-Term Memory,BiLSTM)中进行特征序列建模,其中,一个BiLSTM输出用于第(5)步中基于序列标注的中文命名实体片段抽取辅助任务,另一个BiLSTM输出用于第(5)步中基于指针标注的中文命名实体抽取主任务。BiLSTM由前向和后向LSTM构成,两个任务的BiLSTM是独立不共享训练参数的。The feature sequence modeling stage corresponds to step (4) of the technical solution. The specific implementation is as follows: In view of the problem that the self-attention mechanism of Transfomer cannot capture sequence position information, the pre-trained language model BERT integrates the trainable absolute position encoding into the input to alleviate this problem, but it still lacks sequence-dependent modeling. . The Long Short-Term Memory (LSTM) network model does not require positional coding, and the structure of LSTM recursively coded in sequence order has the ability to learn sequence positional information. Chinese character semantic sequence encoding after combining the dictionary and pinyin features in the previous step Input into two bidirectional long short-term memory network models (BidirectionalLong Short-Term Memory, BiLSTM) respectively for feature sequence modeling, where one BiLSTM output is used for the extraction of Chinese named entity fragments based on sequence annotation in step (5). task, another BiLSTM output is used for the main task of Chinese named entity extraction based on pointer annotation in step (5). BiLSTM is composed of forward and backward LSTM, and the BiLSTMs of the two tasks are independent and do not share training parameters.
假设在时间步t,基于序列标注的中文命名实体片段抽取辅助任务的前向LSTM隐状态输出为后向LSTM隐状态输出为将与相加得到辅助任务在时间步t的BiLSTM隐状态输出Assuming that at time step t, the hidden state output of the forward LSTM for the auxiliary task of extracting Chinese named entity fragments based on sequence annotation is: The output of the backward LSTM hidden state is Will and Add to get the BiLSTM hidden state output of the auxiliary task at time step t
基于指针标注的中文命名实体抽取主任务的前向LSTM隐状态输出为后向LSTM隐状态输出为将与相加得到主任务在时间步t的BiLSTM隐状态输出The hidden state output of forward LSTM for the main task of Chinese named entity extraction based on pointer annotation is: The output of the backward LSTM hidden state is Will and Add to get the BiLSTM hidden state output of the main task at time step t
最终,序列标注辅助任务的特征序列建模输出为指针标注主任务的特征序列建模输出为dh表示LSTM编码维度。Finally, the feature sequence modeling output of the sequence labeling auxiliary task is The feature sequence modeling output of the pointer labeling main task is dh represents the LSTM encoding dimension.
多标注框架的联合学习阶段对应技术方案步骤(5)。具体实施方式为:序列标注与指针标注是应用于命名实体抽取的两种常见标注框架。序列标注对文本序列中每个汉字在实体中的位置进行标记,如图3所示是用BMOES对文本序列进行标记的示例图,其中,B表示汉字在命名实体片段的开始,M表示汉字在命名实体片段的中间,O表示汉字在命名实体片段之外,E表示汉字在命名实体片段的结尾,S表示汉字本身就是命名实体片段。例句中包含“南京市”和“长江大桥”两个实体。指针标注对文本序列中每个实体片段的头汉字和尾汉字所属实体类型进行标记,如图4所示,其中,“南京市”和“长江大桥”都是地点类(Loc)实体。The joint learning stage of the multi-label framework corresponds to step (5) of the technical solution. The specific implementation is as follows: sequence annotation and pointer annotation are two common annotation frameworks applied to named entity extraction. The sequence labeling marks the position of each Chinese character in the text sequence in the entity, as shown in Figure 3 is an example diagram of marking the text sequence with BMOES, where B indicates that the Chinese character is at the beginning of the named entity segment, and M indicates that the Chinese character is at the beginning of the named entity segment. In the middle of the named entity fragment, O indicates that the Chinese character is outside the named entity fragment, E indicates that the Chinese character is at the end of the named entity fragment, and S indicates that the Chinese character itself is a named entity fragment. The example contains two entities "Nanjing City" and "Yangtze River Bridge". The pointer annotation marks the entity type of the head and tail Chinese characters of each entity segment in the text sequence, as shown in Figure 4, where "Nanjing City" and "Yangtze River Bridge" are both location (Loc) entities.
序列标注通过对全序列依赖建模,抽取出的实体完整性更好,通常查准率更高;指针标注通过对实体片段头、尾汉字实体类型分类,抗噪声干扰性与鲁棒性更好,通常查全率更高。为了结合不同标注框架的优点,将所述作为序列标注辅助任务的输入,作为指针标注主任务的输入,利用多任务学习模型,例如,多门混合专家(Multi-gate Mixture-of-Experts,MMOE)模型、渐进层次抽取(Progressive Layered Extraction,PLE)模型等,对基于序列标注中文命名实体片段抽取辅助任务与基于指针标注的中文命名实体抽取主任务进行联合学习,得到序列标注辅助任务输出与指针标注主任务输出Sequence labeling, by modeling full sequence dependencies, extracts entities with better integrity and usually has a higher accuracy rate; pointer labeling classifies the entity types of Chinese characters at the head and tail of the entity segment, and has better anti-noise interference and robustness , usually with higher recall. In order to combine the advantages of different annotation frameworks, the As the input of the sequence labeling auxiliary task, As the input of the main task of pointer labeling, multi-task learning models, such as Multi-gate Mixture-of-Experts (MMOE) model, Progressive Layered Extraction (PLE) model, etc., are used. The auxiliary task of labeling Chinese named entity fragment extraction is jointly learned with the main task of Chinese named entity extraction based on pointer annotation, and the output of the auxiliary task of sequence labeling is obtained. Label main task output with pointer
输出层序列建模阶段对应技术方案步骤(6)。具体实施方式为:对上一步得到的Xa与Xb加一层Dropout防止模型过拟合。然后,将Dropout后的Xa输入到条件随机场(Conditional Random Field,CRF)中,计算基于序列标注的中文命名实体片段抽取辅助任务对BMOES标签索引序列y∈Zn的似然概率p(y|X):The output layer sequence modeling stage corresponds to step (6) of the technical solution. The specific implementation is as follows: adding a layer of Dropout to Xa and Xb obtained in the previous step to prevent overfitting of the model. Then, input the Xa after Dropout into the Conditional Random Field (CRF), and calculate the likelihood probabilityp (y |X):
其中,表示在该任务下X所有可能的BMOES标签索引序列构成的集合,y′∈Zn是中任一BMOES标签索引序列。训练参数bCRF∈R5×5(BMOES序列标记法的标签数为5),表示WCRF中对应标签yt的训练参数,表示bCRF中对应标签yt-1转移到标签yt的训练参数,同理。假设序列标注辅助任务的真实BMOES标签索引序列为yspan∈Zn,Z表示整数,代入到上式中用于计算序列标注辅助任务的对数似然损失in, represents the set of all possible BMOES label index sequences of X under this task, y′∈Zn is Any BMOES tag index sequence. training parameters bCRF ∈ R5×5 (the number of tags for BMOES sequence notation is 5), represents the training parameter of the corresponding label yt in the WCRF , represents the training parameter of the corresponding label yt-1 in the bCRF transferred to the label yt , The same is true. Assuming that the real BMOES label index sequence of the sequence labeling auxiliary task is yspan ∈ Zn , Z represents an integer, and substituted into the above formula to calculate the log-likelihood loss of the sequence labeling auxiliary task
接着,将Dropout后的Xb线性映射到基于指针标注的中文命名实体抽取主任务的标签空间,然后加一层softmax计算每个汉字在各个标签上的概率分布pstart与pend:Next, linearly map Xb after Dropout to the label space of the main task of Chinese named entity extraction based on pointer annotation, and then add a layer of softmax to calculate the probability distribution pstart and pend of each Chinese character on each label:
其中,训练参数ce+1是实体类型数ce与非实体类型的和,是实体片段头汉字实体类型的预测概率分布,是实体片段尾汉字实体类型的预测概率分布。假设实体片段头汉字的真实实体类型标签索引序列为ystart∈Zn,实体片段尾汉字的真实实体类型标签索引序列为yend∈Zn,计算指针标注主任务的交叉熵(Cross Entropy,CE)损失与Among them, the training parameters ce +1 is the sum of the entity type number ce and the non-entity type, is the predicted probability distribution of the entity segment head Chinese character entity type, is the predicted probability distribution of the Chinese character entity type at the end of the entity segment. Assuming that the real entity type label index sequence of the Chinese character at the head of the entity segment is ystart ∈ Zn , and the real entity type label index sequence of the Chinese character at the end of the entity segment is yend ∈ Zn , calculate the cross entropy of the main task of pointer labeling (Cross Entropy, CE )loss and
其中,表示第i个汉字的真实实体类型标签索引,表示pstart中对应第i个汉字预测为第种实体类型的概率值,同理。in, Represents the real entity type label index of the i-th Chinese character, Indicates that the corresponding ith Chinese character in pstart is predicted to be the ith probability values for each entity type, The same is true.
最后,得到序列标注辅助任务损失与指针标注主任务损失后,将3个loss融合成模型需要最小化的整体训练目标进行端到端联合训练:Finally, get the sequence labeling auxiliary task loss Labeling the main task loss with the pointer Afterwards, fuse the 3 losses into the overall training objective that the model needs to minimize Do end-to-end joint training:
其中,λ1、λ2、λ3是控制各任务对整体训练目标影响的超参数。在测试阶段,取pstart与pend中每个汉字标签预测概率分布的最大值对应的索引与作为标签预测索引:Among them, λ1 , λ2 , and λ3 are hyperparameters that control the impact of each task on the overall training objective. In the test phase, take the index corresponding to the maximum value of the predicted probability distribution of each Chinese character label in pstart and pend and As label prediction index:
然后,将实体类型相同且位置距离最近的实体片段头、尾汉字进行配对,抽取出序列中的实体。Then, pair the head and tail Chinese characters of the entity segment with the same entity type and the nearest location distance, and extract the entities in the sequence.
本发明提出了一种基于多标注框架与融合特征的中文命名实体抽取方法。为了测试该方法的有效性,分别在Ontonotes4、MSRA、Resume、Weibo数据集上,从查准率(P)、查全率(R)、F1指标三个方面评估了方法,并和其它中文命名实体抽取方法进行了对比。The invention proposes a Chinese named entity extraction method based on a multi-label framework and fusion features. In order to test the effectiveness of the method, the method was evaluated on the Ontonotes4, MSRA, Resume, and Weibo datasets from three aspects: precision (P), recall (R), and F1 indicators, and named it with other Chinese names. Entity extraction methods are compared.
模型优化器使用自适应矩估计(Adaptive momentum estimation,Adam),BERT训练参数的学习率设置为3e-5,其它模型参数学习率设置为1e-3,BERT编码维度dx=768,多任务学习模型使用渐进层次抽取模型PLE,PLE中各任务独立Experts和共享Experts的Expert个数统一设置为2,Expert设置为单层全连接网络,PLE层数设置为2,LSTM层数设置为1,LSTM编码维度dh=768,词向量编码维度dw=50,拼音向量编码维度dpy=50,损失权重The model optimizer uses Adaptive momentum estimation (Adam), the learning rate of BERT training parameters is set to 3e-5, the learning rate of other model parameters is set to 1e-3, the BERT encoding dimension dx = 768, multi-task learning The model uses the progressive hierarchical extraction model PLE. The number of independent Experts and shared Experts in PLE is uniformly set to 2, Expert is set to a single-layer fully connected network, the number of PLE layers is set to 2, the number of LSTM layers is set to 1, and the number of LSTM layers is set to 1. Coding dimension dh =768, word vector coding dimension dw =50, pinyin vector coding dimension dpy =50, loss weight
表1显示了不同中文命名实体抽取方法在Ontonotes4数据集上的准确率对比结果;表2显示了不同中文命名实体抽取方法在MSRA数据集上的准确率对比结果;表3显示了不同中文命名实体抽取方法在Resume数据集上的准确率对比结果;表4显示了不同中文命名实体抽取方法在Weibo数据集上的准确率对比结果。从上述表中的实验结果可以看出,本发明提出的中文命名实体抽取方法相比其它的中文命名实体抽取方法,在绝大多数数据集以及指标项上都取得了最好的中文命名实体抽取准确率表现。图5(a)(b)显示了本发明方法中词典匹配窗口大小在Ontonotes4和MSRA数据集上对准确率影响实验结果,图6(a)(b)显示了本发明方法中词典匹配窗口大小在Resume和Weibo数据集上对准确率影响实验结果,通过评估分析方法中词典匹配窗口大小的选择对中文命名实体抽取准确率的影响,为后续不同应用场景下词典匹配窗口大小的选择提供指导性建议。Table 1 shows the accuracy comparison results of different Chinese named entity extraction methods on the Ontonotes4 dataset; Table 2 shows the accuracy comparison results of different Chinese named entity extraction methods on the MSRA dataset; Table 3 shows different Chinese named entities The accuracy comparison results of the extraction methods on the Resume dataset; Table 4 shows the accuracy comparison results of different Chinese named entity extraction methods on the Weibo dataset. It can be seen from the experimental results in the above table that the Chinese named entity extraction method proposed by the present invention achieves the best Chinese named entity extraction in most data sets and index items compared with other Chinese named entity extraction methods. Accuracy performance. Figure 5(a)(b) shows the experimental results of the effect of the size of the dictionary matching window on the accuracy on the Ontonotes4 and MSRA datasets in the method of the present invention, and Figure 6(a)(b) shows the size of the dictionary matching window in the method of the present invention The experimental results on the Resume and Weibo datasets affect the accuracy. By evaluating the impact of the selection of the dictionary matching window size in the analysis method on the accuracy of Chinese named entity extraction, it provides guidance for the selection of the dictionary matching window size in different application scenarios. Suggest.
表1 Ontonotes4数据集上不同实体抽取方法的准确率对比Table 1 Accuracy comparison of different entity extraction methods on the Ontonotes4 dataset
表2 MSRA数据集上不同实体抽取方法的准确率对比Table 2 Accuracy comparison of different entity extraction methods on MSRA dataset
表3 Resume数据集上不同实体抽取方法的准确率对比Table 3 Accuracy comparison of different entity extraction methods on the Resume dataset
表4 Weibo数据集上不同实体抽取方法的准确率对比Table 4 Accuracy comparison of different entity extraction methods on Weibo dataset
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202110511025.8ACN113190656B (en) | 2021-05-11 | 2021-05-11 | A Chinese Named Entity Extraction Method Based on Multi-Annotation Framework and Fusion Features |
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202110511025.8ACN113190656B (en) | 2021-05-11 | 2021-05-11 | A Chinese Named Entity Extraction Method Based on Multi-Annotation Framework and Fusion Features |
| Publication Number | Publication Date |
|---|---|
| CN113190656Atrue CN113190656A (en) | 2021-07-30 |
| CN113190656B CN113190656B (en) | 2023-07-14 |
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202110511025.8AActiveCN113190656B (en) | 2021-05-11 | 2021-05-11 | A Chinese Named Entity Extraction Method Based on Multi-Annotation Framework and Fusion Features |
| Country | Link |
|---|---|
| CN (1) | CN113190656B (en) |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113657105A (en)* | 2021-08-31 | 2021-11-16 | 平安医疗健康管理股份有限公司 | Medical entity extraction method, device, equipment and medium based on vocabulary enhancement |
| CN114036933A (en)* | 2022-01-10 | 2022-02-11 | 湖南工商大学 | Information extraction method based on legal documents |
| CN114048281A (en)* | 2021-11-05 | 2022-02-15 | 北京明略软件系统有限公司 | Method, device and storage medium for calculating Chinese pinyin alphabet vector |
| CN114065773A (en)* | 2021-11-22 | 2022-02-18 | 山东新一代信息产业技术研究院有限公司 | A Semantic Representation Method of Historical Context for Multi-round Question Answering System |
| CN114139541A (en)* | 2021-11-22 | 2022-03-04 | 北京中科闻歌科技股份有限公司 | Named entity identification method, device, equipment and medium |
| CN114186565A (en)* | 2021-12-03 | 2022-03-15 | 华中科技大学 | User semantic analysis method in IT operation and maintenance service field |
| CN114662476A (en)* | 2022-02-24 | 2022-06-24 | 北京交通大学 | Character sequence recognition method fusing dictionary and character features |
| CN114692634A (en)* | 2022-01-27 | 2022-07-01 | 清华大学 | Chinese named entity recognition and classification method and device |
| CN114912453A (en)* | 2022-05-20 | 2022-08-16 | 大连大学 | Chinese legal document named entity identification method based on enhanced sequence features |
| CN115146644A (en)* | 2022-09-01 | 2022-10-04 | 北京航空航天大学 | A multi-feature fusion named entity recognition method for police texts |
| CN115470871A (en)* | 2022-11-02 | 2022-12-13 | 江苏鸿程大数据技术与应用研究院有限公司 | Policy matching method and system based on named entity recognition and relation extraction model |
| CN115546814A (en)* | 2022-10-08 | 2022-12-30 | 招商局通商融资租赁有限公司 | Key contract field extraction method, device, electronic equipment and storage medium |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US10032451B1 (en)* | 2016-12-20 | 2018-07-24 | Amazon Technologies, Inc. | User recognition for speech processing systems |
| US10140973B1 (en)* | 2016-09-15 | 2018-11-27 | Amazon Technologies, Inc. | Text-to-speech processing using previously speech processed data |
| CN109446521A (en)* | 2018-10-18 | 2019-03-08 | 京东方科技集团股份有限公司 | Name entity recognition method, device, electronic equipment, machine readable storage medium |
| CN111444721A (en)* | 2020-05-27 | 2020-07-24 | 南京大学 | Chinese text key information extraction method based on pre-training language model |
| CN111462752A (en)* | 2020-04-01 | 2020-07-28 | 北京思特奇信息技术股份有限公司 | Client intention identification method based on attention mechanism, feature embedding and BI-L STM |
| CN111476031A (en)* | 2020-03-11 | 2020-07-31 | 重庆邮电大学 | An Improved Chinese Named Entity Recognition Method Based on Lattice-LSTM |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US10140973B1 (en)* | 2016-09-15 | 2018-11-27 | Amazon Technologies, Inc. | Text-to-speech processing using previously speech processed data |
| US10032451B1 (en)* | 2016-12-20 | 2018-07-24 | Amazon Technologies, Inc. | User recognition for speech processing systems |
| CN109446521A (en)* | 2018-10-18 | 2019-03-08 | 京东方科技集团股份有限公司 | Name entity recognition method, device, electronic equipment, machine readable storage medium |
| CN111476031A (en)* | 2020-03-11 | 2020-07-31 | 重庆邮电大学 | An Improved Chinese Named Entity Recognition Method Based on Lattice-LSTM |
| CN111462752A (en)* | 2020-04-01 | 2020-07-28 | 北京思特奇信息技术股份有限公司 | Client intention identification method based on attention mechanism, feature embedding and BI-L STM |
| CN111444721A (en)* | 2020-05-27 | 2020-07-24 | 南京大学 | Chinese text key information extraction method based on pre-training language model |
| Title |
|---|
| FENIL DOSHI等: "Normalizing Text using Language Modelling based on Phonetics and String Similarity", ARXIV, pages 1 - 9* |
| H PENG等: "Phonetic-enriched Text Representation for Chinese Sentiment Analysis with Reinforcement Learning", IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, pages 1 - 16* |
| 江涛: "基于深度神经网络的电子病历命名实体识别关键技术研究与应用", 中国优秀硕士学位论文全文数据库 (医药卫生科技辑), no. 7, pages 053 - 210* |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113657105A (en)* | 2021-08-31 | 2021-11-16 | 平安医疗健康管理股份有限公司 | Medical entity extraction method, device, equipment and medium based on vocabulary enhancement |
| CN114048281A (en)* | 2021-11-05 | 2022-02-15 | 北京明略软件系统有限公司 | Method, device and storage medium for calculating Chinese pinyin alphabet vector |
| CN114065773A (en)* | 2021-11-22 | 2022-02-18 | 山东新一代信息产业技术研究院有限公司 | A Semantic Representation Method of Historical Context for Multi-round Question Answering System |
| CN114139541A (en)* | 2021-11-22 | 2022-03-04 | 北京中科闻歌科技股份有限公司 | Named entity identification method, device, equipment and medium |
| CN114186565A (en)* | 2021-12-03 | 2022-03-15 | 华中科技大学 | User semantic analysis method in IT operation and maintenance service field |
| CN114186565B (en)* | 2021-12-03 | 2024-10-18 | 华中科技大学 | User semantic analysis method in IT operation and maintenance service field |
| CN114036933A (en)* | 2022-01-10 | 2022-02-11 | 湖南工商大学 | Information extraction method based on legal documents |
| CN114036933B (en)* | 2022-01-10 | 2022-04-22 | 湖南工商大学 | Information extraction method based on legal documents |
| CN114692634A (en)* | 2022-01-27 | 2022-07-01 | 清华大学 | Chinese named entity recognition and classification method and device |
| CN114662476A (en)* | 2022-02-24 | 2022-06-24 | 北京交通大学 | Character sequence recognition method fusing dictionary and character features |
| CN114912453A (en)* | 2022-05-20 | 2022-08-16 | 大连大学 | Chinese legal document named entity identification method based on enhanced sequence features |
| CN114912453B (en)* | 2022-05-20 | 2025-04-29 | 大连大学 | Named Entity Recognition Method for Chinese Legal Documents Based on Enhanced Sequence Features |
| CN115146644A (en)* | 2022-09-01 | 2022-10-04 | 北京航空航天大学 | A multi-feature fusion named entity recognition method for police texts |
| CN115546814A (en)* | 2022-10-08 | 2022-12-30 | 招商局通商融资租赁有限公司 | Key contract field extraction method, device, electronic equipment and storage medium |
| CN115470871A (en)* | 2022-11-02 | 2022-12-13 | 江苏鸿程大数据技术与应用研究院有限公司 | Policy matching method and system based on named entity recognition and relation extraction model |
| CN115470871B (en)* | 2022-11-02 | 2023-02-17 | 江苏鸿程大数据技术与应用研究院有限公司 | Policy matching method and system based on named entity recognition and relation extraction model |
| Publication number | Publication date |
|---|---|
| CN113190656B (en) | 2023-07-14 |
| Publication | Publication Date | Title |
|---|---|---|
| CN113190656B (en) | A Chinese Named Entity Extraction Method Based on Multi-Annotation Framework and Fusion Features | |
| CN109657239B (en) | Chinese Named Entity Recognition Method Based on Attention Mechanism and Language Model Learning | |
| CN108460013B (en) | A sequence tagging model and method based on a fine-grained word representation model | |
| CN114757182B (en) | A BERT short text sentiment analysis method with improved training method | |
| CN111666758B (en) | Chinese word segmentation method, training device and computer readable storage medium | |
| CN112541356B (en) | Method and system for recognizing biomedical named entities | |
| CN109800437B (en) | A named entity recognition method based on feature fusion | |
| CN110609891A (en) | A Visual Dialogue Generation Method Based on Context-Aware Graph Neural Network | |
| CN113591483A (en) | Document-level event argument extraction method based on sequence labeling | |
| CN110196978A (en) | A kind of entity relation extraction method for paying close attention to conjunctive word | |
| CN116151256A (en) | A Few-Shot Named Entity Recognition Method Based on Multi-task and Hint Learning | |
| CN108628823A (en) | In conjunction with the name entity recognition method of attention mechanism and multitask coordinated training | |
| CN115081437B (en) | Machine-generated text detection method and system based on comparative learning of linguistic features | |
| WO2022198750A1 (en) | Semantic recognition method | |
| CN111145914B (en) | Method and device for determining text entity of lung cancer clinical disease seed bank | |
| CN116521882A (en) | Domain Long Text Classification Method and System Based on Knowledge Graph | |
| CN113191150B (en) | A multi-feature fusion method for Chinese medical text named entity recognition | |
| CN115310448A (en) | A Chinese named entity recognition method based on the combination of bert and word vectors | |
| CN114722818B (en) | A named entity recognition model based on adversarial transfer learning | |
| CN114818717A (en) | Chinese named entity recognition method and system fusing vocabulary and syntax information | |
| CN115062123A (en) | Knowledge base question-answer pair generation method of conversation generation system | |
| CN119272774B (en) | Chinese named entity recognition method based on hierarchical label enhanced contrast learning | |
| CN115455197A (en) | Dialogue relation extraction method integrating position perception refinement | |
| CN114564953A (en) | Emotion target extraction model based on multiple word embedding fusion and attention mechanism | |
| CN115510230A (en) | Mongolian emotion analysis method based on multi-dimensional feature fusion and comparative reinforcement learning mechanism |
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |