CN111738003B

Movatterモバイル変換

Info

Publication number: CN111738003B
Application number: CN202010541415.5A
Authority: CN
Inventors: 程学旗; 郭嘉丰; 范意兴; 张儒清; 刘艺菲
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2020-06-15
Filing date: 2020-06-15
Publication date: 2023-06-06
Anticipated expiration: 2040-06-15
Also published as: CN111738003A

Abstract

The embodiment of the invention provides a named entity recognition model training method, a named entity recognition method and a medium.

Description

Translated fromChinese

命名实体识别模型训练方法、命名实体识别方法和介质Named entity recognition model training method, named entity recognition method and medium

技术领域Technical Field

本发明涉及自然语言处理技术领域，具体来说涉及命名实体识别技术领域，更具体地说，涉及命名实体识别模型训练方法、命名实体识别方法和介质。The present invention relates to the field of natural language processing technology, specifically to the field of named entity recognition technology, and more specifically to a named entity recognition model training method, a named entity recognition method and a medium.

背景技术Background Art

自然语言处理是为了让计算机理解人类的语言，从而更好地实现人与计算之间的交互(如语音助手、消息自动回复、翻译软件等应用与人的交互)。自然语言处理通常包括分词、词性标注、命名实体识别和语法分析等。命名实体识别(Named Entity Recognition,简称NER)是自然语言处理(Natural Language Processing，简称NLP)的一个重要组成部分。命名实体识别是指识别文本中具有特定意义的事物名称或者符号的过程，命名实体主要包括人名、地名、机构名、日期、专有名词等。许多下游NLP任务或应用程序都依赖NER进行信息提取，例如问题回答、关系提取、事件提取和实体链接等。若能更准确地识别出文本中的命名实体，有助于计算机更好地理解语言的语义、更好地执行任务，从而提高人机交互体验。Natural language processing is to enable computers to understand human language, so as to better realize the interaction between people and computers (such as the interaction between voice assistants, automatic message replies, translation software and other applications and people). Natural language processing usually includes word segmentation, part-of-speech tagging, named entity recognition and grammatical analysis. Named entity recognition (NER) is an important part of natural language processing (NLP). Named entity recognition refers to the process of identifying the names or symbols of things with specific meanings in text. Named entities mainly include names of people, places, institutions, dates, proper nouns, etc. Many downstream NLP tasks or applications rely on NER for information extraction, such as question answering, relationship extraction, event extraction and entity linking. If the named entities in the text can be identified more accurately, it will help the computer better understand the semantics of the language and perform tasks better, thereby improving the human-computer interaction experience.

基于深度神经网络的命名实体识别方法通常将命名实体识别看做多分类任务或序列标注任务，可以分为输入的分布式表示、语义编码和标签解码三个过程，其中输入的分布式表示根据编码对象可以分为字符级别、词级别和混合三种，可以得到每个词的向量表示；语义编码通常应用深度神经网络，比如双向长短记忆神经网络，基于Transform的双向编码器表示(Bidirectional Encoder Representation from Transformers，简称BERT)以及迁移学习网络等，可以利用文本中每个词的词向量得到文本的向量表示；标签解码由分类器完成，分类器常利用全连接神经网络+Softmax层或者条件随机场+维特比算法(Viterbi算法)来得到每个词的标签。Named entity recognition methods based on deep neural networks usually regard named entity recognition as a multi-classification task or a sequence labeling task, which can be divided into three processes: distributed representation of input, semantic encoding, and label decoding. The distributed representation of input can be divided into three types according to the encoding object: character level, word level, and mixed, and the vector representation of each word can be obtained; semantic encoding usually uses deep neural networks, such as bidirectional long short-term memory neural networks, bidirectional encoder representation from Transformers (BERT), and transfer learning networks, etc., and the word vector of each word in the text can be used to obtain the vector representation of the text; label decoding is completed by the classifier, and the classifier often uses a fully connected neural network + Softmax layer or a conditional random field + Viterbi algorithm to obtain the label of each word.

命名实体识别当前并不是一个大热的研究方向，因为学术界部分认为这是一个已经解决了的问题。但是，也有很多研究者认为这个问题还没有得到很好地解决，原因主要是命名实体识别只是在有限的文本类型(主要是新闻语料中)和实体类别(主要是人名、地名、组织机构名)中取得了不错的效果；而在其他自然语言处理领域，命名实体评测语料较小，容易产生过拟合，通用的识别多种类型的命名实体的系统性能还很差。Named entity recognition is not a hot research topic at present, because some academics believe that this is a solved problem. However, many researchers believe that this problem has not been well solved, mainly because named entity recognition has only achieved good results in limited text types (mainly news corpus) and entity categories (mainly names of people, places, and organizations); in other natural language processing fields, the named entity evaluation corpus is small, which is prone to overfitting, and the performance of general systems for recognizing multiple types of named entities is still poor.

基于深度学习的命名实体识别在英语新闻语料上已经达到不错的效果(F1值在90％以上)，但深度学习方法一般需要大量标注数据，在真实世界中很多语言和领域通常标记数据比较少，因此出现了低资源命名实体识别问题。迁移学习是目前解决低资源命名实体识别问题的常用方法，但目前应用于低资源命名实体识别问题的迁移学习存在数据量、标签资源不平衡问题，共同学习时会更加偏向于高资源数据(数据量更大的数据集)的问题，使得命名实体识别模型的识别效果不好。因此，有必要对现有技术进行改进。Named entity recognition based on deep learning has achieved good results on English news corpus (F1 value is above 90%), but deep learning methods generally require a large amount of labeled data. In the real world, many languages and fields usually have relatively little labeled data, so the problem of low-resource named entity recognition arises. Transfer learning is currently a common method to solve the problem of low-resource named entity recognition, but the transfer learning currently applied to the problem of low-resource named entity recognition has problems with data volume and label resources imbalance. When learning together, it will be more biased towards high-resource data (data sets with larger data volumes), resulting in poor recognition effects of named entity recognition models. Therefore, it is necessary to improve the existing technology.

发明内容Summary of the invention

因此，本发明的目的在于克服上述现有技术的缺陷，提供命名实体识别模型训练方法、命名实体识别方法和介质。Therefore, the purpose of the present invention is to overcome the defects of the above-mentioned prior art and provide a named entity recognition model training method, a named entity recognition method and a medium.

本发明的目的是通过以下技术方案实现的：The objective of the present invention is achieved through the following technical solutions:

根据本发明的第一方面，提供一种命名实体识别模型训练方法，所述方法包括：A1、构建第一训练模型，所述第一训练模型包括特征提取模块、识别模块和领域区分模块；A2、对第一训练模型进行多轮训练，其中，每轮训练中，用第一数据集对识别模块进行训练、用第一数据集和第二数据集对特征提取模块和领域区分模块进行对抗训练，每轮训练后至少根据识别模块的损失函数和领域区分模块的损失函数对特征提取模块的参数进行调整，同时更新第一数据集和第二数据集，以更新后的第一数据集和第二数据集进行下一轮训练，其中，第一数据集是以单词向量形式表示的有实体标签的源领域标记数据集，第二数据集是以单词向量形式表示的无实体标签的目标领域未标记数据集；A3、构建第二训练模型，所述第二训练模型包括特征提取模块和识别模块，第二训练模型的特征提取模块的初始参数采用经步骤A2训练后的第一训练模型的特征提取模块的参数进行设置，识别模块的初始参数采用随机初始化的方式进行设置；A4、用第三数据集以监督训练的方式对由步骤A3构建的第二训练模型的特征提取模块和识别模块的进行参数微调，将经参数微调后的第二训练模型作为命名实体识别模型，其中，第三数据集是以单词向量形式表示的有实体标签的目标领域标记数据集。According to a first aspect of the present invention, a method for training a named entity recognition model is provided, the method comprising: A1, constructing a first training model, the first training model comprising a feature extraction module, a recognition module and a domain distinction module; A2, performing multiple rounds of training on the first training model, wherein in each round of training, the recognition module is trained with a first data set, and the feature extraction module and the domain distinction module are adversarially trained with the first data set and the second data set, after each round of training, the parameters of the feature extraction module are adjusted at least according to the loss function of the recognition module and the loss function of the domain distinction module, and the first data set and the second data set are updated at the same time, and the next round of training is performed with the updated first data set and the second data set, wherein the first data set is represented in the form of word vectors with entity labels. A source domain labeled dataset, the second dataset is a target domain unlabeled dataset without entity labels represented in the form of word vectors; A3, constructing a second training model, the second training model includes a feature extraction module and a recognition module, the initial parameters of the feature extraction module of the second training model are set using the parameters of the feature extraction module of the first training model trained in step A2, and the initial parameters of the recognition module are set using random initialization; A4, using a third dataset to fine-tune the parameters of the feature extraction module and the recognition module of the second training model constructed in step A3 in a supervised training manner, and using the second training model after parameter fine-tuning as a named entity recognition model, wherein the third dataset is a target domain labeled dataset with entity labels represented in the form of word vectors.

优选的，所述源领域标记数据集的规模与所述目标领域未标记数据集的规模相同或者大致相同，所述目标领域标记数据集的规模小于所述目标领域未标记数据集的规模。Preferably, the size of the source domain labeled dataset is the same or approximately the same as the size of the target domain unlabeled dataset, and the size of the target domain labeled dataset is smaller than the size of the target domain unlabeled dataset.

优选的，规模相同或者大致相同是指源领域标记数据集与目标领域未标记数据集的数据量之比为：10:14～10:9。Preferably, the same or approximately the same scale means that the ratio of the data volume of the source domain labeled data set to the target domain unlabeled data set is 10:14 to 10:9.

在本发明的一些实施例中，所述第一训练模型中的特征提取模块包括预处理层、CNN模型、Word2Vec模型、包含前向LSTM和后向LSTM的BiLSTM模型，其中，前向LSTM、后向LSTM分别包括多个依次连接的LSTM单元；该特征提取模块分别对非单词向量形式表示的源领域标记数据集、目标领域未标记数据集、目标领域标记数据集进行如下处理以获得第一数据集、第二数据集、第三数据集：用所述预处理层对数据集的单词进行包含统一大小写和去除停用词的预处理；用CNN模型提取数据集中各单词的字符级别嵌入特征；用Word2Vec模型提取数据集中各单词的单词嵌入特征；对数据集中各单词的字符级别嵌入特征和单词嵌入特征进行串联拼接，得到各单词的向量表示；将数据集中各单词的向量表示输入特征提取模块的BiLSTM模型中进行处理，得到包含上下文信息的以单词向量形式表示的数据集。In some embodiments of the present invention, the feature extraction module in the first training model includes a preprocessing layer, a CNN model, a Word2Vec model, and a BiLSTM model including a forward LSTM and a backward LSTM, wherein the forward LSTM and the backward LSTM respectively include a plurality of LSTM units connected in sequence; the feature extraction module respectively processes the source domain labeled data set, the target domain unlabeled data set, and the target domain labeled data set represented in the form of non-word vectors as follows to obtain the first data set, the second data set, and the third data set: using the preprocessing layer to preprocess the words in the data set including unifying the case and removing stop words; using the CNN model to extract the character-level embedding features of each word in the data set; using the Word2Vec model to extract the word embedding features of each word in the data set; concatenating the character-level embedding features and the word embedding features of each word in the data set to obtain the vector representation of each word; inputting the vector representation of each word in the data set into the BiLSTM model of the feature extraction module for processing to obtain a data set represented in the form of a word vector containing context information.

在本发明的一些实施例中，第一训练模型和第二训练模型的识别模块均包括BiLSTM-CRF模型，其中，采用源领域标记数据的实体标签设置第一训练模型中识别模块的BiLSTM-CRF模型的CRF层的标签取值空间，采用的目标领域标记数据集的实体标签设置第二训练模型的识别模块的BiLSTM-CRF模型的CRF层的标签设置。In some embodiments of the present invention, the recognition modules of the first training model and the second training model both include a BiLSTM-CRF model, wherein the label value space of the CRF layer of the BiLSTM-CRF model of the recognition module in the first training model is set using the entity labels of the source domain label data, and the label setting of the CRF layer of the BiLSTM-CRF model of the recognition module of the second training model is set using the entity labels of the target domain label data set.

在本发明的一些实施例中，所述第一训练模型还包括梯度反转层，对特征提取模块和领域区分模块进行对抗训练过程中，在正向传播时通过梯度反转层对第一训练模型的特征提取模块和领域区分模块执行标准随机梯度下降操作，并且在反向传播时，在将领域区分模块的损失函数返回到特征提取模块之前将梯度反转层的参数自动取反，以使特征提取模块提取源领域标记数据集和目标领域未标记数据集中单词的通用特征。In some embodiments of the present invention, the first training model further includes a gradient reversal layer. During adversarial training of the feature extraction module and the domain distinction module, a standard stochastic gradient descent operation is performed on the feature extraction module and the domain distinction module of the first training model through the gradient reversal layer during forward propagation, and during back propagation, the parameters of the gradient reversal layer are automatically reversed before the loss function of the domain distinction module is returned to the feature extraction module, so that the feature extraction module extracts common features of words in the source domain labeled dataset and the target domain unlabeled dataset.

在本发明的一些实施例中，所述第一训练模型还包括自动编码模块，用第二数据集对自动编码模块进行训练，每轮训练后，根据自动编码模块的损失函数、识别模块的损失函数和领域区分模块的损失函数共同更新特征提取模块的参数。In some embodiments of the present invention, the first training model also includes an automatic encoding module, which is trained with a second data set. After each round of training, the parameters of the feature extraction module are jointly updated according to the loss function of the automatic encoding module, the loss function of the recognition module, and the loss function of the domain distinction module.

在本发明的一些实施例中，所述自动编码模块包含编码器和解码器，其中，每轮训练中，编码器获取用特征提取模块的BiLSTM模型提取到目标领域未标记数据集的单词的前向LSTM中最后一个LSTM和后向LSTM中最后一个LSTM的隐藏状态并组合为解码器的初始状态特征，并使用该初始状态特征和其前一个单词嵌入特征作为解码器的输入以训练自动编码模块提取目标领域的私有特征。In some embodiments of the present invention, the automatic encoding module includes an encoder and a decoder, wherein, in each round of training, the encoder obtains the hidden states of the last LSTM in the forward LSTM and the last LSTM in the backward LSTM of the words of the unlabeled data set of the target domain extracted by the BiLSTM model of the feature extraction module and combines them into the initial state features of the decoder, and uses the initial state features and the previous word embedding features as the input of the decoder to train the automatic encoding module to extract private features of the target domain.

在本发明的一些实施例中，按照以下方式对第一训练模型的特征提取模块的参数进行调整：In some embodiments of the present invention, the parameters of the feature extraction module of the first training model are adjusted in the following manner:

其中，θ_f表示本次调整后特征提取模块的参数，θ’_f表示本次调整前特征提取模块的参数，μ表示学习率，L_task表示识别模块的损失函数，L_type表示领域区分模块的损失函数，L_target表示自动编码模块的损失函数，-ω表示梯度翻转参数，α、β、γ表示用户设置的权重。Among them, θ_f represents the parameters of the feature extraction module after this adjustment, θ'_f represents the parameters of the feature extraction module before this adjustment, μ represents the learning rate, L_task represents the loss function of the recognition module, L_type represents the loss function of the domain distinction module, L_target represents the loss function of the automatic encoding module, -ω represents the gradient flipping parameter, and α, β, and γ represent the weights set by the user.

在本发明的一些实施例中，所述步骤A2还包括：在每轮训练后按照以下方式对第一训练模型的识别模块、领域区分模块和自动编码模块的参数进行调整：In some embodiments of the present invention, step A2 further includes: adjusting the parameters of the recognition module, the domain distinction module and the automatic encoding module of the first training model after each round of training in the following manner:

识别模块对应的参数调整方式为：

The parameter adjustment method corresponding to the recognition module is:

领域区分模块对应的参数调整方式为：

The parameter adjustment method corresponding to the domain differentiation module is as follows:

自动编码模块对应的参数调整方式为：

The parameter adjustment method corresponding to the automatic encoding module is:

其中，θ_y表示本次调整后识别模块的参数，θ’_y表示本次调整前识别模块的参数，θ_d表示本次调整后领域区分模块的参数，θ’_d表示本次调整前领域区分模块的参数，θ_r表示本次调整后自动编码模块的参数，θ’_r表示本次调整前自动编码模块的参数，μ表示学习率，α、β、γ表示用户设置的权重。Among them, θ_y represents the parameters of the recognition module after this adjustment, θ'_y represents the parameters of the recognition module before this adjustment, θ_d represents the parameters of the domain distinction module after this adjustment, θ'_d represents the parameters of the domain distinction module before this adjustment, θ_r represents the parameters of the automatic encoding module after this adjustment, θ'_r represents the parameters of the automatic encoding module before this adjustment, μ represents the learning rate, and α, β, and γ represent the weights set by the user.

根据本发明的第二方面，提供一种基于如第一方面所述的命名实体识别模型训练方法训练得到的命名实体识别模型的命名实体识别方法，所述命名实体识别模型包括特征提取模块和识别模块，所述命名实体识别方法包括：B1、通过命名实体识别模型的特征提取模块获取待识别文本的字符级别嵌入特征和单词嵌入特征并进行串联拼接，得到待识别文本中各单词的单词向量；B2、将单词向量的形式表示的待识别文本输入命名实体识别模型的识别模块，得到所述待识别文本的命名实体识别结果。According to a second aspect of the present invention, there is provided a named entity recognition method based on a named entity recognition model trained by the named entity recognition model training method as described in the first aspect, wherein the named entity recognition model comprises a feature extraction module and a recognition module, and the named entity recognition method comprises: B1, obtaining character-level embedding features and word embedding features of a text to be recognized through the feature extraction module of the named entity recognition model and concatenating them in series to obtain word vectors of each word in the text to be recognized; B2, inputting the text to be recognized represented in the form of word vectors into the recognition module of the named entity recognition model to obtain a named entity recognition result of the text to be recognized.

根据本发明的第三方面，提供一种电子设备，包括：一个或多个处理器；以及存储器，其中存储器用于存储一个或多个可执行指令；所述一个或多个处理器被配置为经由执行所述一个或多个可执行指令以实现第一方面或者第二方面所述方法的步骤。According to a third aspect of the present invention, there is provided an electronic device comprising: one or more processors; and a memory, wherein the memory is used to store one or more executable instructions; the one or more processors are configured to implement the steps of the method described in the first aspect or the second aspect by executing the one or more executable instructions.

与现有技术相比，本发明的优点在于：Compared with the prior art, the advantages of the present invention are:

本发明考虑先用源领域标记数据和目标领域未标记数据集对第一训练模型进行训练，基于第一训练模型的参数设置第二训练模型，再用目标领域标记数据集对第二训练模型进行微调，从而得到最终的命名实体识别模型。由此，避免了需要大量标记目标领域的样本用于训练的问题，训练出的命名实体识别模型对目标领域未标记数据集中的单词进行命名实体识别时的识别效果也得到了提升。The present invention considers first training a first training model with source domain labeled data and target domain unlabeled data set, setting a second training model based on the parameters of the first training model, and then fine-tuning the second training model with the target domain labeled data set, thereby obtaining a final named entity recognition model. Thus, the problem of requiring a large number of labeled samples of the target domain for training is avoided, and the recognition effect of the trained named entity recognition model when performing named entity recognition on words in the target domain unlabeled data set is also improved.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

以下参照附图对本发明实施例作进一步说明，其中：The embodiments of the present invention are further described below with reference to the accompanying drawings, in which:

图1为根据本发明实施例的命名实体识别模型训练方法的简化示意图；FIG1 is a simplified schematic diagram of a named entity recognition model training method according to an embodiment of the present invention;

图2为根据本发明实施例的命名实体识别模型的结构原理示意图；FIG2 is a schematic diagram of the structural principle of a named entity recognition model according to an embodiment of the present invention;

图3为根据本发明实施例的鞍点的示意图；FIG3 is a schematic diagram of a saddle point according to an embodiment of the present invention;

图4为作为本发明的基线实验的现有模型进行命名实体识别的示意图；FIG4 is a schematic diagram of named entity recognition using an existing model as a baseline experiment of the present invention;

图5为作为本发明的对比实验的两种现有方法的示意图。FIG. 5 is a schematic diagram of two existing methods used as a comparative experiment of the present invention.

具体实施方式DETAILED DESCRIPTION

为了使本发明的目的，技术方案及优点更加清楚明白，以下结合附图通过具体实施例对本发明进一步详细说明。应当理解，此处所描述的具体实施例仅用以解释本发明，并不用于限定本发明。In order to make the purpose, technical solution and advantages of the present invention more clearly understood, the present invention is further described in detail by specific embodiments in conjunction with the accompanying drawings. It should be understood that the specific embodiments described herein are only used to explain the present invention and are not used to limit the present invention.

如在背景技术部分提到的，目前的命名实体识别模型是在特定领域的经过标记(即标记出数据集中哪些字词是命名实体)的数据集上经过监督训练得到的，其能够在特定领域实现较高的识别准确性，但是如果把该模型直接应用到其他领域时，存在泛化能力较差、识别准确性不高的问题。而在真实世界中很多语言和领域通常标记数据比较少，难以通过监督训练得到该领域所需的命名实体识别模型，而如果通过人工对这些领域的数据进行标记，不仅需要负责人工标记的人员具有对各领域的各类命名实体的名称有清晰的认识，还需要其从海量的数据中准确标记出各类命名实体，其工作量大且成本高。而如果直接用规模相对较大的源领域标记数据集和规模较小的目标领域标记数据集进行迁移学习，会存在标签资源不平衡问题，导致共同学习时会更加偏向于高资源数据的问题，使训练得到的模型在目标领域的识别效果不佳。As mentioned in the background technology section, the current named entity recognition model is obtained through supervised training on a labeled data set in a specific field (i.e., which words in the data set are named entities) and can achieve high recognition accuracy in a specific field. However, if the model is directly applied to other fields, there are problems of poor generalization ability and low recognition accuracy. In the real world, many languages and fields usually have relatively few labeled data, and it is difficult to obtain the named entity recognition model required in the field through supervised training. If the data in these fields are manually labeled, not only does the person responsible for manual labeling need to have a clear understanding of the names of various named entities in various fields, but he also needs to accurately label various named entities from massive data, which is labor-intensive and costly. If a relatively large-scale source field labeled data set and a smaller-scale target field labeled data set are directly used for transfer learning, there will be a problem of imbalanced label resources, resulting in a problem of being more biased towards high-resource data during joint learning, so that the trained model has poor recognition effect in the target field.

因此，本发明考虑先用源领域标记数据和目标领域未标记数据集对第一训练模型进行训练，基于第一训练模型的参数设置第二训练模型，再用目标领域标记数据集对第二训练模型进行微调，从而得到最终的命名实体识别模型。由此，可以将目标领域数据分为目标领域未标记数据集和目标领域标记数据集两部分对模型进行训练，从而可以将目标领域未标记数据集与源领域标记数据集的规模设置为大致相同以避免最终训练的模型参数偏向数据量大的数据集；训练后用规模比目标领域未标记数据集小的目标领域标记数据集对第二训练模型进行微调，从而得到最终的命名实体识别模型，避免了需要大量标记目标领域的样本用于训练的问题。Therefore, the present invention considers first training the first training model with the source domain labeled data and the target domain unlabeled data set, setting the second training model based on the parameters of the first training model, and then fine-tuning the second training model with the target domain labeled data set, so as to obtain the final named entity recognition model. Thus, the target domain data can be divided into two parts, the target domain unlabeled data set and the target domain labeled data set, to train the model, so that the scale of the target domain unlabeled data set and the source domain labeled data set can be set to be roughly the same to avoid the final training model parameters being biased towards the data set with a large amount of data; after training, the target domain labeled data set with a smaller scale than the target domain unlabeled data set is used to fine-tune the second training model, so as to obtain the final named entity recognition model, avoiding the problem of requiring a large number of labeled samples of the target domain for training.

在对本发明的实施例进行具体介绍之前，先对其中使用到的部分术语作如下解释：Before specifically introducing the embodiments of the present invention, some of the terms used therein are explained as follows:

对抗训练，也称对抗学习，是由Goodfellow等人提出，基本思想是基于两个模型：一个生成模型和一个判别模型。判别模型的任务是判断一张给定的图片是真实的还是经过人工修饰，生成模型的任务是模拟生成与图集中的图片相似的合成图片。在训练过程中通过反复对抗，生成模型和判别模型的能力都会不断增强，直到达成一个平衡。我们可以把这个过程看作一种零和游戏。目前对抗学习已成功用于图像生成、半监督学习以及域自适应。领域自适应对抗学习网络的关键思想是在优化特征提取模块的过程中对抗领域区分模块，来构建通用不变的特征。Adversarial training, also known as adversarial learning, was proposed by Goodfellow et al. The basic idea is based on two models: a generative model and a discriminative model. The task of the discriminative model is to determine whether a given picture is real or artificially modified, and the task of the generative model is to simulate and generate synthetic pictures similar to the pictures in the atlas. Through repeated confrontation during the training process, the capabilities of the generative model and the discriminative model will continue to increase until a balance is reached. We can think of this process as a zero-sum game. At present, adversarial learning has been successfully used in image generation, semi-supervised learning, and domain adaptation. The key idea of the domain adaptive adversarial learning network is to construct universal and invariant features by adversarially distinguishing the domain in the process of optimizing the feature extraction module.

迁移学习，是将已经学习到的知识迁移到另一种未知的知识的学习，即从源领域迁移到目标领域。Transfer learning is the process of transferring already learned knowledge to another unknown knowledge, that is, from the source domain to the target domain.

源领域标记数据集，是指源领域的经过实体标记而带有实体标签的数据集。换言之，源领域标记数据集中的实体对象带有其对应类型的实体标签。The source domain labeled dataset refers to a dataset in the source domain that has been labeled with entity labels. In other words, the entity objects in the source domain labeled dataset have entity labels of their corresponding types.

目标领域未标记数据集，是指目标领域的没有经过实体标记的数据集。没有经过标记是指无需在本发明的训练过程前对其进行标记。即使收集的部分数据带有其原来他人标记的实体标签，也被视为不带有实体标签的数据集，因为这部分实体标签在对抗训练的过程中不考虑或者不会被使用。The target domain unlabeled dataset refers to a dataset in the target domain that has not been labeled with entities. Unlabeled means that it does not need to be labeled before the training process of the present invention. Even if part of the collected data has entity labels that were originally labeled by others, it is considered as a dataset without entity labels, because these entity labels are not considered or will not be used in the adversarial training process.

目标领域标记数据集，是指目标领域的经过实体标记而带有实体标签的数据集。The target domain labeled dataset refers to a dataset in the target domain that has been labeled with entity tags.

CNN(Convolutional Neural Networks)，表示卷积神经网络，是一类包含卷积计算且具有深度结构的前馈神经网络。CNN (Convolutional Neural Networks) stands for convolutional neural network, which is a type of feedforward neural network that includes convolution calculations and has a deep structure.

Word2Vec(Word to Vector)模型，是一种将词汇向量化的自然语言处理模型。Word2Vec模型的工作原理是从大量文本语料中以无监督的方式学习词的语义信息，并输出词向量来表征词的语义信息。The Word2Vec (Word to Vector) model is a natural language processing model that vectorizes vocabulary. The working principle of the Word2Vec model is to learn the semantic information of words from a large amount of text corpus in an unsupervised manner and output word vectors to represent the semantic information of words.

LSTM(Long Short-Term Memory)，表示长短期记忆网络，是一种递归循环神经网络。LSTM的出现主要是为了解决长序列训练过程中的梯度消失和梯度爆炸问题。相比于普通的循环神经网络(Recurrent Neural Network,简称RNN)，LSTM在长序列中学习长期依赖信息方面具有更好的表现。LSTM (Long Short-Term Memory) is a recursive neural network. LSTM was developed to solve the gradient vanishing and gradient exploding problems in long sequence training. Compared with ordinary recurrent neural networks (RNN), LSTM performs better in learning long-term dependent information in long sequences.

BiLSTM(Bi-directional Long Short-Term Memory)，表示双向长短期记忆网络。BiLSTM (Bi-directional Long Short-Term Memory) stands for bidirectional long short-term memory network.

CRF(Conditional Random Fields)，表示条件随机场，是一种给定输入随机变量x，求解条件概率P(y│x)的概率无向图模型。条件随机场模型需要建模的是输入变量和输出变量的条件概率分布。条件随机场常用于标注或分析序列资料，如自然语言文字或是生物序列。用于序列标注时，输入输出随机变量为两个等长的序列。CRF (Conditional Random Fields), which stands for conditional random fields, is a probabilistic undirected graph model that solves the conditional probability P(y│x) given an input random variable x. The conditional random field model needs to model the conditional probability distribution of the input and output variables. Conditional random fields are often used to annotate or analyze sequence data, such as natural language text or biological sequences. When used for sequence annotation, the input and output random variables are two sequences of equal length.

MLP(Multilayer Perceptron)，表示多层感知机，是一种前馈人工神经网络模型，用于进行多层线性或非线性变换。MLP (Multilayer Perceptron), which stands for Multilayer Perceptron, is a feedforward artificial neural network model used for multi-layer linear or nonlinear transformation.

停用词，是指在处理自然语言数据(或文本)之前或之后会自动过滤掉的无明确意义的字或词。比如，语气助词、副词、介词、连接词等。去除停用词可以节省存储空间、提高处理效率。Stop words refer to words or characters without clear meaning that are automatically filtered out before or after processing natural language data (or text). For example, modal particles, adverbs, prepositions, conjunctions, etc. Removing stop words can save storage space and improve processing efficiency.

参见图1，本发明的命名实体模型训练方法的训练过程主要包括以下几个阶段：先用规模相同或者大致相同的第一数据集和第二数据集训练第一训练模型，训练完成后将第一训练模型的知识通过迁移学习的方式传递给第二训练模型后用规模小于第二数据集的第三数据集对第二训练模型进行微调，得到命名实体识别模型。Referring to Figure 1, the training process of the named entity model training method of the present invention mainly includes the following stages: first, a first training model is trained with a first data set and a second data set of the same or approximately the same size. After the training is completed, the knowledge of the first training model is transferred to the second training model through transfer learning, and then the second training model is fine-tuned with a third data set that is smaller than the second data set to obtain a named entity recognition model.

根据本发明的一个实施例，提供一种命名实体识别模型训练方法，包括步骤A1、A2、A3、A4。为了更好地理解本发明，下面结合具体的实施例针对每一个步骤分别进行详细说明。According to an embodiment of the present invention, a method for training a named entity recognition model is provided, comprising steps A1, A2, A3, and A4. In order to better understand the present invention, each step is described in detail below in conjunction with a specific embodiment.

步骤A1：构建第一训练模型，第一训练模型包括特征提取模块11、识别模块12和领域区分模块13。Step A1: construct a first training model, which includes afeature extraction module 11, arecognition module 12 and adomain distinction module 13.

根据本发明的一个实施例，第一训练模型中的特征提取模块11包括预处理层、CNN模型、Word2Vec模型和包含前向LSTM和后向LSTM的BiLSTM模型，其中，前向LSTM、后向LSTM分别包括多个依次连接的LSTM单元；该特征提取模块11分别对非单词向量形式表示的源领域标记数据集、目标领域未标记数据集、目标领域标记数据集进行如下处理以获得第一数据集、第二数据集、第三数据集：用所述预处理层对数据集的单词进行包含统一大小写和去除停用词的预处理；用CNN模型提取数据集中各单词的字符级别嵌入特征；用Word2Vec模型提取数据集中各单词的单词嵌入特征；对数据集中各单词的字符级别嵌入特征和单词嵌入特征进行串联拼接，得到各单词的向量表示；将数据集中各单词的向量表示输入到特征提取模块11的BiLSTM模型中进行处理，得到包含上下文信息的以单词向量形式表示的数据集。简单地说，特征提取模块11可以提取源领域和目标领域共同的字符级别嵌入特征和单词嵌入特征以及包含上下文信息的单词向量。参见图2，源领域和目标领域为一些句子组成的样本，将其输入到特征提取模块11。特征提取模块11利用CNN提取字符级别嵌入特征

可以有效解决单词未出现在词典中(Out-of-vocabulary，OOV)的问题。然后将单词嵌入特征

与字符级别嵌入特征

连接起来，作为下一层BiLSTM的输入，特征提取模块11利用BiLSTM对句子进行建模，可以捕获上下文信息。将输入单词序列(样本)表示为x，将第i个单词表示为x_i。x_i∈S(x)和x_i∈T(x)表示输入样本分别来自源领域和目标域。为了后续描述方便，将特征提取模块11的参数记作θ_f，将特征提取模块11提取的包含上下文信息的单词向量表示为F(x_i)，将以单词向量形式表示的单词序列表示为F(x)。According to one embodiment of the present invention, the feature extraction module 11 in the first training model includes a preprocessing layer, a CNN model, a Word2Vec model and a BiLSTM model including a forward LSTM and a backward LSTM, wherein the forward LSTM and the backward LSTM respectively include a plurality of LSTM units connected in sequence; the feature extraction module 11 respectively processes the source domain labeled data set, the target domain unlabeled data set, and the target domain labeled data set represented in the form of non-word vectors as follows to obtain the first data set, the second data set, and the third data set: the preprocessing layer is used to preprocess the words in the data set including unifying the case and removing stop words; the CNN model is used to extract the character-level embedding features of each word in the data set; the Word2Vec model is used to extract the word embedding features of each word in the data set; the character-level embedding features and the word embedding features of each word in the data set are concatenated and spliced to obtain the vector representation of each word; the vector representation of each word in the data set is input into the BiLSTM model of the feature extraction module 11 for processing to obtain a data set represented in the form of a word vector containing context information. In short, thefeature extraction module 11 can extract the character-level embedding features and word embedding features common to the source domain and the target domain, as well as the word vector containing context information. Referring to FIG2 , the source domain and the target domain are samples composed of some sentences, which are input into thefeature extraction module 11. Thefeature extraction module 11 uses CNN to extract the character-level embedding features.

It can effectively solve the problem of words not appearing in the dictionary (Out-of-vocabulary, OOV). Then embed the word into the feature

With character-level embedding features

Connected together as the input of the next layer of BiLSTM, thefeature extraction module 11 uses BiLSTM to model the sentence and capture context information. The input word sequence (sample) is represented as x, and the i-th word is represented as x_. x_i ∈ S (x) and x_i ∈ T (x) indicate that the input samples come from the source domain and the target domain, respectively. For the convenience of subsequent description, the parameters of thefeature extraction module 11 are denoted as θ_f , the word vector containing context information extracted by thefeature extraction module 11 is denoted as F (x_i ), and the word sequence represented in the form of word vectors is denoted as F (x).

根据本发明的一个实施例，第一训练模型的识别模块12包括BiLSTM-CRF模型。其中，采用源领域标记数据集的实体标签设置第一训练模型中识别模块12的BiLSTM-CRF模型的CRF层的标签取值空间。标签例如图2识别模块12内示出的示意性结果，B-GPE表示一种示意性的实体标签，例如为国家、城市、州这类实体，O表示非实体的标签，识别模块12用于进行命名实体识别标注。识别模块12以F(x)作为输入，使用CRF层，最大似然估计计算损失函数以及Viterbi算法将F(x)中的每个单词向量F(x_i)映射到其实体标签，CRF层的CRF算法使用特征函数来更抽象地表达特征，其中，CRF算法的目标函数为：According to one embodiment of the present invention, therecognition module 12 of the first training model includes a BiLSTM-CRF model. Among them, the entity label of the source domain label data set is used to set the label value space of the CRF layer of the BiLSTM-CRF model of therecognition module 12 in the first training model. The label is such as the schematic result shown in therecognition module 12 in Figure 2, B-GPE represents a schematic entity label, such as entities such as countries, cities, and states, O represents a non-entity label, and therecognition module 12 is used for named entity recognition and annotation. Therecognition module 12 takes F(x) as input, uses the CRF layer, the maximum likelihood estimation to calculate the loss function, and the Viterbi algorithm to map each word vector F(_xi ) in F(x) to its entity label. The CRF algorithm of the CRF layer uses a feature function to express features more abstractly, wherein the objective function of the CRF algorithm is:

其中，x表示输入单词序列，y表示输出实体标签序列，θ_y为特征函数权重，Z(x)为归一化因子，i为当前单词的位置，M为输入单词序列长度，j表示特征函数的个数，θ_yj表示第j个特征函数的权重，f(x,i,y_i,y_i-1)表示特征函数，y_i为当前位置的输出实体标签，y_i-1为前一个位置的输出实体标签。该目标函数所表达的含义是：给定输入单词序列x和特征函数权重θ_y，输出标签序列y出现的条件概率，将概率最高的实体标签作为单词序列x中对应单词的实体标签y_i。Where x represents the input word sequence, y represents the output entity label sequence,_θy is the feature function weight, Z(x) is the normalization factor, i is the position of the current word, M is the length of the input word sequence, j represents the number of feature functions,_θyj represents the weight of the jth feature function, f(x,i,_yi ,yi_-1 ) represents the feature function,_yi is the output entity label of the current position, and yi_-1 is the output entity label of the previous position. The objective function means: given the input word sequence x and the feature function weight_θy , the conditional probability of the output label sequence y appears, and the entity label with the highest probability is used as the entity label_yi of the corresponding word in the word sequence x.

上述归一化因子Z(x)表示为：

其中，Y表示所有可能出现的输出实体标签序列的集合。The above normalization factor Z(x) is expressed as:

Among them, Y represents the set of all possible output entity label sequences.

识别模块12的参数即为特征函数的权重θ_y，对参数的求解用最大似然估计，假设源领域训练集为

N_S为源领域样本个数，

表示源领域的第N_S个样本，

表示输出的源领域的第N_S个样本的实体标签序列。The parameter of therecognition module 12 is the weight θ_y of the feature function. The maximum likelihood estimation is used to solve the parameter. Assuming that the source domain training set is

N_S is the number of samples in the source domain,

represents the N_S- th sample in the source domain,

Represents the entity label sequence of the N_Sth sample in the output source domain.

用以下公式计算对数似然作为损失函数以对识别模块12进行训练：The log-likelihood is calculated as the loss function to train therecognition module 12 using the following formula:

其中，k表示当前样本的序号。Where k represents the sequence number of the current sample.

根据本发明的一个实施例，第一训练模型的领域区分模块13包括多层感知机MLP，多层感知机包括Softmax层。领域区分模块13以F(x)作为输入，是标准的前馈网络。领域区分模块13的训练目标为尽量不能对样本来自源领域和目标领域进行区分。领域区分模块13将相同的隐藏状态h映射到域标签，将该映射的参数记作θ_d。领域区分模块13旨在通过以下损失函数来识别域标签：According to one embodiment of the present invention, thedomain distinction module 13 of the first training model includes a multi-layer perceptron MLP, which includes a Softmax layer. Thedomain distinction module 13 takes F(x) as input and is a standard feedforward network. The training goal of thedomain distinction module 13 is to try not to distinguish between samples from the source domain and the target domain. Thedomain distinction module 13 maps the same hidden state h to the domain label, and the parameter of the mapping is recorded as θ_d . Thedomain distinction module 13 aims to identify the domain label through the following loss function:

其中，d_k是样本k的标准域标签，

是领域区分模块13Q在样本k上的输出，

N_t表示来自目标领域的N_t个样本。通过最大化特征提取模块11F的参数θ_f上的损失的同时最小化领域区分模块13Q的参数θ_d上的损失，将领域区分模块13训练到损失函数的鞍点。鞍点就是一个维度向上倾斜且另一维度向下倾斜的点。如图3所示，鞍点通常被相同误差值的平面所包围，这使得算法陷入其中很难脱离出来，因为梯度在所有维度上接近于零。优化特征提取模块11的参数θ_f可确保领域区分模块13无法区分域，即特征提取模块11F可找到源领域和目标领域之间共同的通用特征。在训练过程中，对领域区分模块13的参数θ_d、识别模块12的参数θ_y更新后，会根据更新后的领域区分模块13的参数θ_d、识别模块12的参数θ_y对特征提取模块11F的参数θ_f进行优化，以最小化分类损失L_task，这样可以确保P(F(x_i))可以对源领域进行准确的预测。where_dk is the standard domain label of sample k,

is the output of the domain distinction module 13Q on sample k,

N_t represents N_t samples from the target domain. Thedomain differentiation module 13 is trained to the saddle point of the loss function by maximizing the loss on the parameter θ_f of the feature extraction module 11F while minimizing the loss on the parameter θ_d of the domain differentiation module 13Q. A saddle point is a point where one dimension tilts upward and another dimension tilts downward. As shown in FIG3 , the saddle point is usually surrounded by planes of the same error value, which makes it difficult for the algorithm to get out of it because the gradient is close to zero in all dimensions. Optimizing the parameter θ_f of thefeature extraction module 11 ensures that thedomain differentiation module 13 cannot distinguish the domain, that is, the feature extraction module 11F can find common universal features between the source domain and the target domain. During the training process, after the parameter θ_d of thedomain differentiation module 13 and the parameter θ_y of therecognition module 12 are updated, the parameter θ_f of the feature extraction module 11F is optimized according to the updated parameter θ_d of thedomain differentiation module 13 and the parameter θ_y of therecognition module 12 to minimize the classification loss L_task , so as to ensure that P(F(_xi )) can accurately predict the source domain.

根据本发明的一个实施例，第一训练模型还包括梯度反转层。对特征提取模块11和领域区分模块13进行对抗训练过程中，在正向传播时通过梯度反转层对第一训练模型的特征提取模块11和领域区分模块13执行标准随机梯度下降操作。在反向传播时，在将领域区分模块13的损失函数返回到特征提取模块11之前将梯度反转层的参数自动取反，以使特征提取模块11提取源领域标记数据集和目标领域未标记数据集中单词的通用特征。According to one embodiment of the present invention, the first training model further includes a gradient reversal layer. During the adversarial training of thefeature extraction module 11 and thedomain distinction module 13, a standard stochastic gradient descent operation is performed on thefeature extraction module 11 and thedomain distinction module 13 of the first training model through the gradient reversal layer during forward propagation. During back propagation, the parameters of the gradient reversal layer are automatically reversed before the loss function of thedomain distinction module 13 is returned to thefeature extraction module 11, so that thefeature extraction module 11 extracts the common features of the words in the source domain labeled dataset and the target domain unlabeled dataset.

根据本发明的一个实施例，再次参见图2，第一训练模型还包括自动编码模块14，用第二数据集对自动编码模块14进行训练，每轮训练后，根据自动编码模块14的损失函数、识别模块12的损失函数和领域区分模块13的损失函数共同更新特征提取模块11的参数。优选的，所述自动编码模块14包含编码器和解码器，其中，每轮训练中，编码器获取用特征提取模块11的BiLSTM模型提取到目标领域未标记数据集的单词的前向LSTM中最后一个LSTM和后向LSTM中最后一个LSTM的隐藏状态并组合为解码器的初始状态特征，并使用该初始状态特征和其前一个单词嵌入特征作为解码器的输入以训练自动编码模块14提取目标领域的私有特征。对抗学习试图将隐藏表示优化为通用表示h_common，通过对抗学习的这一优化过程得到的识别模块12的参数初始化第二训练模型的识别模块12的参数，而目标域自动编码模块14通过调整通用表示使其既包括一部分源领域和目标领域的通用特征，又包括一部分目标领域数据的领域私有特征，得到包含目标领域信息的领域特征表示，作为最终模型的特征提取模块11，抵消了对抗学习网络去除目标领域特征的趋势。换言之，自动编码模块14进行目标领域的特征学习，保留其领域特性。通过训练由特征提取模块11和领域区分模块13组成的对抗学习网络，可以获得源领域和目标领域的通用特征h_common，但是它将削弱一些对命名实体识别有用的领域特定特征，可见，仅获得领域通用特征将限制分类能力。因此，本发明通过引入目标领域的自动编码模块14来解决该缺陷，自动编码模块14试图重建目标领域数据。本发明用自动编码模块14的编码器获取特征提取模块11中的BiLSTM模型中前向LSTM和后向LSTM的最后隐藏状态组合为解码器LSTM的初始状态h₀(dec)。因此，本发明不需要颠倒输入句子(单词序列)的单词顺序，并且该模型避免了在输入和输出之间建立交流的困难。使用h₀(dec)和前一个单词嵌入特征作为解码器的输入。假设

是输出单词序列，z_i是第i个单词表示形式：z_i＝MLP(h_i)，其中MLP是多重感知机。隐藏状态h_i＝LSTM([h₀(dec):z_i-1],h_i-1)，其中[·：·]是串联操作，表示将h₀(dec)和前一个单词嵌入特征z_i-1进行串联，和前一个位置隐藏状态h_i-1一起作为LSTM的输入，输出即为当前位置隐藏状态。则在给定h₀(dec)的条件下输出单词序列

的条件概率

如以下公式所示：According to one embodiment of the present invention, referring to FIG. 2 again, the first training model further includes anautomatic encoding module 14, and theautomatic encoding module 14 is trained with the second data set. After each round of training, the parameters of thefeature extraction module 11 are jointly updated according to the loss function of theautomatic encoding module 14, the loss function of therecognition module 12, and the loss function of thedomain distinction module 13. Preferably, theautomatic encoding module 14 includes an encoder and a decoder, wherein in each round of training, the encoder obtains the hidden states of the last LSTM in the forward LSTM and the last LSTM in the backward LSTM of the words extracted from the unlabeled data set of the target domain by the BiLSTM model of thefeature extraction module 11 and combines them into the initial state features of the decoder, and uses the initial state features and the previous word embedding features as the input of the decoder to train theautomatic encoding module 14 to extract the private features of the target domain. Adversarial learning attempts to optimize the hidden representation into a common representation h_common . The parameters of therecognition module 12 obtained through this optimization process of adversarial learning are used to initialize the parameters of therecognition module 12 of the second training model. The target domainautomatic encoding module 14 adjusts the common representation to include both a part of the common features of the source domain and the target domain and a part of the domain-private features of the target domain data, thereby obtaining a domain feature representation containing the target domain information. As thefeature extraction module 11 of the final model, the tendency of the adversarial learning network to remove the target domain features is offset. In other words, theautomatic encoding module 14 performs feature learning of the target domain and retains its domain characteristics. By training the adversarial learning network composed of thefeature extraction module 11 and thedomain distinction module 13, the common features h_common of the source domain and the target domain can be obtained, but it will weaken some domain-specific features useful for named entity recognition. It can be seen that only obtaining domain-general features will limit the classification ability. Therefore, the present invention solves this defect by introducing anautomatic encoding module 14 for the target domain, which attempts to reconstruct the target domain data. The present invention uses the encoder of theautomatic encoding module 14 to obtain the last hidden state of the forward LSTM and the backward LSTM in the BiLSTM model in thefeature extraction module 11 and combines them into the initial state h₀ (dec) of the decoder LSTM. Therefore, the present invention does not need to reverse the order of words in the input sentence (word sequence), and the model avoids the difficulty of establishing communication between input and output. Use h₀ (dec) and the previous word embedding feature as the input of the decoder. Assume

is the output word sequence,_zi is the representation of the ith word:_zi = MLP(_hi ), where MLP is a multi-sensor machine. Hidden state_hi = LSTM([h₀ (dec):zi_-1 ],hi_-1 ), where [·:·] is a concatenation operation, which means that h₀ (dec) is concatenated with the previous word embedding feature zi_-1 , and the previous position hidden state hi_-1 is used as the input of LSTM, and the output is the current position hidden state. Then, given h₀ (dec), the output word sequence is

The conditional probability

As shown in the following formula:

其中，每个

是在词典中所有词上计算softmax概率。Among them, each

It calculates the softmax probability over all words in the dictionary.

本发明的目标是针对自动编码模块14的参数θ_r最小化如以下公式所示的损失函数：The goal of the present invention is to minimize the loss function as shown in the following formula for the parameters_θr of the auto-encoding module 14:

其中，

是样本k的第i个词的one-hot向量。这使得h₀(dec)学习目标领域数据上不完整和最显着的句子表示形式。对抗学习网络试图将隐藏表示优化为通用表示h_common，目标域自动编码模块14通过优化通用表示为其增加目标领域数据的私有特征，抵消了对抗学习网络擦除目标领域私有特征的趋势。in,

is the one-hot vector of the i-th word of sample k. This makes h₀ (dec) learn the incomplete and most salient sentence representation on the target domain data. The adversarial learning network attempts to optimize the hidden representation to a common representation h_common . The target domain auto-encodingmodule 14 optimizes the common representation to add private features of the target domain data, thus counteracting the tendency of the adversarial learning network to erase private features of the target domain.

步骤A2：对第一训练模型进行多轮训练，其中，每轮训练中，用第一数据集对识别模块12进行训练、用第一数据集和第二数据集对特征提取模块11和领域区分模块13进行对抗训练，每轮训练后至少根据识别模块12的损失函数和领域区分模块13的损失函数对特征提取模块11的参数进行调整，同时更新第一数据集和第二数据集，以更新后的第一数据集和第二数据集进行下一轮训练，其中，第一数据集是以单词向量形式表示的有实体标签的源领域标记数据集，第二数据集是以单词向量形式表示的无实体标签的目标领域未标记数据集。Step A2: Perform multiple rounds of training on the first training model, wherein in each round of training, therecognition module 12 is trained with the first data set, and thefeature extraction module 11 and thedomain distinction module 13 are adversarially trained with the first data set and the second data set, and after each round of training, the parameters of thefeature extraction module 11 are adjusted at least according to the loss function of therecognition module 12 and the loss function of thedomain distinction module 13, and the first data set and the second data set are updated at the same time, and the next round of training is performed with the updated first data set and the second data set, wherein the first data set is a source domain labeled data set with entity labels represented in the form of word vectors, and the second data set is a target domain unlabeled data set without entity labels represented in the form of word vectors.

其中，源领域标记数据集的规模与目标领域未标记数据集的规模相同或者大致相同。目标领域标记数据集的规模小于目标领域未标记数据集的规模。优选的，规模相同或者大致相同是指源领域标记数据集与目标领域未标记数据集的数据量之比为：10:14～10:9。源领域标记数据集的规模与目标领域未标记数据集的规模相同或者大致相同可以避免在对抗训练时训练的模型的参数不至于因为资源不均衡而偏向其中数据量较大的领域，使得最终的模型在目标领域上获得更好的命名实体识别的效果。Among them, the scale of the source domain labeled dataset is the same or approximately the same as the scale of the target domain unlabeled dataset. The scale of the target domain labeled dataset is smaller than the scale of the target domain unlabeled dataset. Preferably, the same or approximately the same scale means that the ratio of the data volume of the source domain labeled dataset to the target domain unlabeled dataset is: 10:14 to 10:9. The same or approximately the same scale of the source domain labeled dataset and the target domain unlabeled dataset can avoid the parameters of the model trained during adversarial training from being biased towards the field with a larger amount of data due to resource imbalance, so that the final model can obtain better named entity recognition effect in the target field.

优选的，用第一数据集对识别模块12进行训练的过程包括用以单词向量表示的单词序列以及单词序列中各单词的实体标签训练识别模块12以使其能够根据单词向量识别单词所属的实体标签。Preferably, the process of training therecognition module 12 with the first data set includes training therecognition module 12 with a word sequence represented by a word vector and an entity label of each word in the word sequence so that therecognition module 12 can recognize the entity label to which the word belongs based on the word vector.

优选的，用第一数据集和第二数据集对特征提取模块11和领域区分模块13进行对抗训练的过程包括：在一轮训练时，领域区分模块13以特征提取模块11生成的包含上下文信息的单词序列F(x)为输入，训练领域区分模块13输出单词序列是来自源领域还是目标领域的分类结果；在反向传播过程中，至少根据前一步训练得到的领域区分模块13的损失函数对特征提取模块11的参数进行调整，从而拥有新的参数的特征提取模块11能够生成新的包含上下文信息的单词序列F(x)，根据新的包含上下文信息的单词序列F(x)，重复上述步骤进行下一轮训练。Preferably, the process of using the first data set and the second data set to perform adversarial training on thefeature extraction module 11 and thedomain distinction module 13 includes: in one round of training, thedomain distinction module 13 takes the word sequence F(x) containing context information generated by thefeature extraction module 11 as input, and trains thedomain distinction module 13 to output a classification result of whether the word sequence is from the source domain or the target domain; in the back propagation process, the parameters of thefeature extraction module 11 are adjusted at least according to the loss function of thedomain distinction module 13 obtained in the previous step of training, so that thefeature extraction module 11 with new parameters can generate a new word sequence F(x) containing context information, and according to the new word sequence F(x) containing context information, the above steps are repeated for the next round of training.

优选的，在每轮训练后按照以下方式对第一训练模型的特征提取模块11的参数进行调整：Preferably, after each round of training, the parameters of thefeature extraction module 11 of the first training model are adjusted in the following manner:

其中，θ_f表示本次调整后特征提取模块11的参数，θ’_f表示本次调整前特征提取模块11的参数，μ表示学习率，L_task表示识别模块12的损失函数，L_type表示领域区分模块13的损失函数，L_target表示自动编码模块14的损失函数，-ω表示梯度翻转参数，α、β、γ表示用户设置的权重。Among them, θ_f represents the parameters of thefeature extraction module 11 after this adjustment, θ'_f represents the parameters of thefeature extraction module 11 before this adjustment, μ represents the learning rate, L_task represents the loss function of therecognition module 12, L_type represents the loss function of thedomain distinction module 13, L_target represents the loss function of theautomatic encoding module 14, -ω represents the gradient flip parameter, and α, β, and γ represent the weights set by the user.

优选的，在每轮训练后按照以下方式对第一训练模型的识别模块12、领域区分模块13和自动编码模块14的参数进行调整：Preferably, after each round of training, the parameters of therecognition module 12, thedomain distinction module 13 and theautomatic encoding module 14 of the first training model are adjusted in the following manner:

识别模块12对应的参数调整方式为：

The parameter adjustment method corresponding to theidentification module 12 is:

领域区分模块13对应的参数调整方式为：

The parameter adjustment method corresponding to thedomain differentiation module 13 is as follows:

自动编码模块14对应的参数调整方式为：

The parameter adjustment method corresponding to theautomatic encoding module 14 is:

其中，θ_t表示本次调整后识别模块12的参数，θ’_y表示本次调整前识别模块12的参数，θ_d表示本次调整后领域区分模块13的参数，θ’_d表示本次调整前领域区分模块13的参数，θ_r表示本次调整后自动编码模块14的参数，θ’_r表示本次调整前自动编码模块14的参数，μ表示学习率，α、β、γ表示用户设置的权重。Among them, θ_t represents the parameters of therecognition module 12 after this adjustment, θ'_y represents the parameters of therecognition module 12 before this adjustment, θ_d represents the parameters of thedomain distinction module 13 after this adjustment, θ'_d represents the parameters of thedomain distinction module 13 before this adjustment, θ_r represents the parameters of theautomatic encoding module 14 after this adjustment, θ'_r represents the parameters of theautomatic encoding module 14 before this adjustment, μ represents the learning rate, and α, β, and γ represent weights set by the user.

优选的，训练第一训练模型的目标是将其训练至收敛，其中一个判断标准就是训练至使第一训练模型的特征提取模块11、识别模块12、领域区分模块13对应的损失函数的加权求和结果最小化，即最小化以下总损失函数：Preferably, the goal of training the first training model is to train it to convergence, and one of the judgment criteria is to train it to minimize the weighted sum of the loss functions corresponding to thefeature extraction module 11, therecognition module 12, and thedomain distinction module 13 of the first training model, that is, to minimize the following total loss function:

L_total＝αL_task+βL_target+γL_type；L_total =αL_task +βL_target +γL_type ;

其中，α、β、γ表示用户设置的权重。或者说，α、β、γ表示用户设置的用于权衡第一训练模型的特征提取模块11、识别模块12、领域区分模块13的损失函数的影响的权重。Among them, α, β, and γ represent weights set by the user. In other words, α, β, and γ represent weights set by the user for weighing the influence of the loss functions of thefeature extraction module 11, therecognition module 12, and thedomain distinction module 13 of the first training model.

步骤A3：构建第二训练模型，所述第二训练模型包括特征提取模块21和识别模块22，第二训练模型的特征提取模块21的初始参数采用经步骤A2训练后的第一训练模型的特征提取模块11的参数进行设置，识别模块22的初始参数采用随机初始化的方式进行设置。识别模块22的初始参数采用随机初始化的方式进行设置可以生成均匀分布的参数，减少将模型训练至收敛的时间，也会更容易将模型训练到最优效果而不是次优效果。Step A3: Construct a second training model, the second training model includes afeature extraction module 21 and arecognition module 22, the initial parameters of thefeature extraction module 21 of the second training model are set using the parameters of thefeature extraction module 11 of the first training model trained in step A2, and the initial parameters of therecognition module 22 are set using random initialization. Setting the initial parameters of therecognition module 22 using random initialization can generate uniformly distributed parameters, reduce the time to train the model to convergence, and make it easier to train the model to the optimal effect rather than the suboptimal effect.

第二训练模型的特征提取模块21的结构和第一训练模型的特征提取模块11的结构相同，即，包括预处理层、CNN模型、Word2Vec模型和包含前向LSTM和后向LSTM的BiLSTM模型。在完成对第一训练模型的训练后，通过迁移学习的方式将训练第一训练模型的特征提取模块11得到的参数用于设置第二训练模型的特征提取模块21。此外，第二训练模型的识别模块22包括BiLSTM-CRF模型。其中，与第一训练模型的识别模块12不同，第二训练模型采用的目标领域标记数据集的实体标签设置其识别模块22的BiLSTM-CRF模型的CRF层的标签，由此，以根据目标领域的实体标签对目标领域的数据进行命名实体识别。The structure of thefeature extraction module 21 of the second training model is the same as that of thefeature extraction module 11 of the first training model, that is, it includes a preprocessing layer, a CNN model, a Word2Vec model, and a BiLSTM model including a forward LSTM and a backward LSTM. After completing the training of the first training model, the parameters obtained by training thefeature extraction module 11 of the first training model are used to set thefeature extraction module 21 of the second training model through transfer learning. In addition, therecognition module 22 of the second training model includes a BiLSTM-CRF model. Among them, unlike therecognition module 12 of the first training model, the entity label of the target domain labeled data set used by the second training model sets the label of the CRF layer of the BiLSTM-CRF model of itsrecognition module 22, thereby performing named entity recognition on the data of the target domain according to the entity label of the target domain.

步骤A4：用第三数据集以监督训练的方式对由步骤A3构建的第二训练模型的特征提取模块21和识别模块22的进行参数微调，将经参数微调后的第二训练模型作为命名实体识别模型，其中，第三数据集是以单词向量形式表示的有实体标签的目标领域标记数据集。第三数据集带有目标领域的实体标签，可以用于对第二训练模型进行监督训练，以对第二训练模型的特征提取模块21和识别模块22的参数进行调整，从而使最终得到的命名实体识别模型对目标领域的数据进行命名实体识别的准确性得到进一步地改善。应当理解地是，命名实体识别模型包括特征提取模块31和识别模块32。其中，命名实体识别模型的特征提取模块31是由第二训练模型的特征提取模块21经参数微调后得到的。命名实体识别模型的识别模块32是由第二训练模型的识别模块22经参数微调后得到的。Step A4: Use the third data set to fine-tune the parameters of thefeature extraction module 21 and therecognition module 22 of the second training model constructed by step A3 in a supervised training manner, and use the second training model after parameter fine-tuning as a named entity recognition model, wherein the third data set is a target domain labeled data set with entity labels represented in the form of word vectors. The third data set carries entity labels of the target domain and can be used for supervised training of the second training model to adjust the parameters of thefeature extraction module 21 and therecognition module 22 of the second training model, so that the accuracy of the named entity recognition model obtained by the final named entity recognition model for the data in the target domain is further improved. It should be understood that the named entity recognition model includes afeature extraction module 31 and arecognition module 32. Among them, thefeature extraction module 31 of the named entity recognition model is obtained by fine-tuning the parameters of thefeature extraction module 21 of the second training model. Therecognition module 32 of the named entity recognition model is obtained by fine-tuning the parameters of therecognition module 22 of the second training model.

下面通过一个具体实验示例来说明本发明。The present invention is described below by means of a specific experimental example.

第一部分：数据集设置Part 1: Dataset Setup

源领域标记数据：为了训练对抗学习网络(第一训练模型的特征提取模块和领域区分模块)进行命名实体识别，使用了Ontonotes5.0英文数据集。Source domain labeled data: In order to train the adversarial learning network (the feature extraction module and domain distinction module of the first training model) for named entity recognition, the Ontonotes5.0 English dataset was used.

目标领域标记数据：为了训练和评估提出的模型，使用了Ritter11数据集。Target Domain Labeled Data: To train and evaluate the proposed model, the Ritter11 dataset is used.

目标领域未标记数据：为了训练对抗学习网络保留通用特征，本发明需要使用具有大规模未标记推文的数据集；因此，本发明使用Twitter的接口从Twitter构造了大规模的Twitter领域未标记数据集作为目标领域未标记数据。Target domain unlabeled data: In order to train the adversarial learning network to retain common features, the present invention needs to use a dataset with large-scale unlabeled tweets; therefore, the present invention uses the Twitter interface to construct a large-scale Twitter domain unlabeled dataset from Twitter as the target domain unlabeled data.

Ontonotes5.0和Ritter11数据集的统计信息如表1所示，可以看到Ontonotes5.0的训练数据集单词数量(Token数量)为848,220，Ritter11的训练数据集单词数量为37,098。构造的Twitter领域未标记数据的验证数据集单词数量为1,177,746。The statistics of the Ontonotes5.0 and Ritter11 datasets are shown in Table 1. It can be seen that the number of words (tokens) in the Ontonotes5.0 training dataset is 848,220, and the number of words in the Ritter11 training dataset is 37,098. The number of words in the constructed Twitter domain unlabeled data validation dataset is 1,177,746.

表1 Ontonotes5.0和Ritter11数据集统计信息Table 1 Statistics of Ontonotes5.0 and Ritter11 datasets

Ontonotes5.0数据集Ontonotes5.0 datasetRitter11数据集Ritter11 dataset训练数据集单词数量Number of words in the training dataset848,220848,22037,09837,098验证数据集单词数量Number of words in the validation dataset144,319144,3194,4614,461测试数据集单词数量Number of words in the test dataset49,23549,2354,7304,730训练数据集句子数量Number of sentences in the training dataset33,90833,9081,9151,915验证数据集句子数量Number of sentences in the validation dataset5,7715,771239239测试数据集句子数量Number of sentences in the test dataset1,8981,898240240命名实体类别数量Number of named entity categories18181010

在本领域，获取到数据集后，通常会将数据集分为表1所示的三个部分，分别是训练数据集(简称训练集)、验证数据集(简称验证集)和测试数据集(简称测试集)。训练集用于训练模型，训练中会用训练集中的样本对各个模型或者模块多轮训练，训练至收敛。满足以下评价规则中任意一个则视为模型已训练至收敛：第一评价规则：训练轮数达到自定义的上限轮数；第二评价规则：命名实体识别模型对应的F1值在一轮训练后与其前一轮训练后相比变化幅度小于等于预设变化幅度阈值；第三评价规则：训练轮数已达到自定义的下限轮数,并且命名实体识别模型在验证集上识别的精确率在某一轮训练后与其前一轮训练后相比没有提升。例如，下限轮数设为2，上限轮数设为30，变化幅度阈值设为设为±0.5％。验证集用于统计评估指标、调节参数、选择算法。测试集用于在最后整体评估模型的性能。In this field, after obtaining the data set, the data set is usually divided into three parts as shown in Table 1, namely, the training data set (referred to as the training set), the validation data set (referred to as the validation set), and the test data set (referred to as the test set). The training set is used to train the model. During the training, the samples in the training set are used to train each model or module for multiple rounds until convergence. The model is considered to have been trained to convergence if any of the following evaluation rules is met: the first evaluation rule: the number of training rounds reaches the custom upper limit number of rounds; the second evaluation rule: the F1 value corresponding to the named entity recognition model after one round of training is less than or equal to the preset change range threshold compared with the previous round of training; the third evaluation rule: the number of training rounds has reached the custom lower limit number of rounds, and the accuracy of the named entity recognition model on the validation set has not improved after a certain round of training compared with the previous round of training. For example, the lower limit number of rounds is set to 2, the upper limit number of rounds is set to 30, and the change range threshold is set to ±0.5%. The validation set is used to statistically evaluate indicators, adjust parameters, and select algorithms. The test set is used to evaluate the performance of the model as a whole at the end.

对于表1中Ontonotes5.0数据集的18个命名实体类别和Ritter11数据集的10个命名实体类别对应的实体标签可分别参见下面的表2和表3。此外，通常用标签O(Outside)表示非实体。The entity labels corresponding to the 18 named entity categories of the Ontonotes5.0 dataset and the 10 named entity categories of the Ritter11 dataset in Table 1 can be found in Tables 2 and 3 below. In addition, the label O (Outside) is usually used to represent non-entities.

表2 Ontonotes5.0数据集实体标签Table 2 Entity labels of Ontonotes5.0 dataset

表3 Ritter11数据集实体标签Table 3 Entity labels of the Ritter11 dataset

第二部分：实验设置Part II: Experimental Setup

因为标记的Ontonotes5.0数据集的大小是标记的Ritter11数据集的20倍以上，所以如果直接使用合并的数据集训练模型，最终结果会更加偏向于Ontonotes5.0数据集，使得训练结果较差。因此，本发明首先在Ontonotes5.0数据集和Twitter领域未标记数据集上进行对抗训练，然后使用Ritter11数据集来以低学习率微调(fine-tune)第二训练模型的参数。微调(fine-tune)就是使用已经训练好的模型参数当做训练的起点，在新的数据上重新训练一遍的过程。Because the size of the labeled Ontonotes5.0 dataset is more than 20 times that of the labeled Ritter11 dataset, if the merged dataset is used directly to train the model, the final result will be more biased towards the Ontonotes5.0 dataset, resulting in poor training results. Therefore, the present invention first performs adversarial training on the Ontonotes5.0 dataset and the Twitter domain unlabeled dataset, and then uses the Ritter11 dataset to fine-tune the parameters of the second training model at a low learning rate. Fine-tuning is the process of using the trained model parameters as the starting point of training and retraining on new data.

本示例的实验中，模型使用的超参数如下：目前的优化器主要有AdaGrad、RMSProp、Adam以及AdaDelta。经过进行实验对使用各个优化器的效果对比，本发明中对抗学习过程选择AdaGrad优化器，学习率设置范围为(0～1]，本示例用0.1作为默认学习率。微调(fine-tune)过程选择Adam优化器，学习率设置范围为(0～1]，本示例中用0.0001作为默认学习率，并且使用早停机制(early stop)训练轮数设置范围例如为(0～100]，本示例中设置为100。单词嵌入特征使用Google发布的将字词转换成多维向量的Word2Vec技术进行训练，Word2Vec模型的维数设为默认的200维。字符级别嵌入特征使用CNN训练，维数设置范围为(0～300],本示例中设置为25维。使用BiLSTM进行编码，每层隐藏的神经元的个数的设置范围例如为(0～300]，本示例中每层包含250个隐藏的神经元。使用三层标准LSTM进行解码，每个LSTM层的隐藏的神经元的个数的设置范围例如为(0～1000]，本示例中由500个隐藏的神经元组成。本示例中，权重α、β、γ均设置为1。应当注意的是，以上训练轮数、维度、神经元个数等的设置范围仅是示意性的，在计算资源充足的情况下也可设置为更大数值，本发明对此不作任何限制。In the experiment of this example, the hyperparameters used in the model are as follows: The current optimizers mainly include AdaGrad, RMSProp, Adam, and AdaDelta. After conducting experiments to compare the effects of using various optimizers, the adversarial learning process in this invention selects the AdaGrad optimizer, and the learning rate setting range is (0-1]. In this example, 0.1 is used as the default learning rate. The fine-tuning process selects the Adam optimizer, and the learning rate setting range is (0-1). In this example, 0.0001 is used as the default learning rate, and the early stopping mechanism is used. The setting range of the number of training rounds is, for example, (0-100], and is set to 100 in this example. The word embedding feature is trained using the Word2Vec technology released by Google that converts words into multidimensional vectors, and the dimension of the Word2Vec model is set to the default 200 dimensions. The character level embedding feature is trained using CNN, and the dimension setting range is (0-300], and is set to 25 dimensions in this example. BiLSTM is used for encoding, and the setting range of the number of hidden neurons in each layer is, for example, (0-300], and each layer contains 250 hidden neurons in this example. Three-layer standard LSTM is used for decoding, and the setting range of the number of hidden neurons in each LSTM layer is, for example, (0-1000], and in this example, it consists of 500 hidden neurons. In this example, the weights α, β, and γ are all set to 1. It should be noted that the setting ranges of the above training rounds, dimensions, number of neurons, etc. are only schematic, and can also be set to larger values if computing resources are sufficient. The present invention does not impose any restrictions on this.

基线实验为如图4所示的使用本章提出的特征提取模块的基础BiLSTM-CRF模型在Ritter11数据集上的训练结果，记作In-domain。即，直接将Bob Dylan visited Sweden这类样本的字符级别嵌入特征和单词嵌入特征进行拼接后输入BiLSTM-CRF模型的BiLSTM层和CRF层，得到命名实体识别结果。比如，对样本Bob Dylan visited Sweden的命名实体识别结果分别为：B-PER I-PER O B-GPE。B-PER表示Bob是人名实体(Begin，开始)，I-PER表示Dylan是人名实体(Inside，内部)，O表示visited是非实体，B-GPE表示Sweden是国家、城市、州这类实体(Begin，开始)。The baseline experiment is the training result of the basic BiLSTM-CRF model on the Ritter11 dataset using the feature extraction module proposed in this chapter, as shown in Figure 4, denoted as In-domain. That is, the character-level embedding features and word embedding features of samples such as Bob Dylan visited Sweden are directly concatenated and input into the BiLSTM layer and CRF layer of the BiLSTM-CRF model to obtain the named entity recognition results. For example, the named entity recognition results of the sample Bob Dylan visited Sweden are: B-PER I-PER O B-GPE. B-PER means that Bob is a person name entity (Begin, start), I-PER means that Dylan is a person name entity (Inside, inside), O means that visited is a non-entity, and B-GPE means that Sweden is an entity such as a country, city, or state (Begin, start).

此外，本发明也使用了如图5所示的现有的参数初始化方法(INIT)和多任务学习方法(MULT)作为对比试验。这两个现有方法的具体描述如下：In addition, the present invention also uses the existing parameter initialization method (INIT) and multi-task learning method (MULT) as shown in Figure 5 as a comparative experiment. The specific descriptions of these two existing methods are as follows:

参数初始化方法：首先使用源领域训练数据D_S训练源模型M_S。接下来，构造目标模型M_T并重建最后的CRF层，以解决输出空间(标签)不同的问题。使用学习到的M_S的参数来初始化M_T(不包括CRF层)。最后，继续使用目标领域训练数据D_T训练M_T。Parameter initialization method: First, train the source model_MS using the source domain training data D_S. Next, construct the target model_MT and rebuild the final CRF layer to solve the problem of different output spaces (labels). Use the learned parameters of_MS to initialize_MT (excluding the CRF layer). Finally, continue to train_MT using the target domain training data D_T.

多任务学习方法：多任务学习同时使用D_S和D_T训练M_S和M_T。M_S和M_T的参数(不包括CRF层)在训练过程中共享。在一些现有技术方案中都使用超参数λ，作为从D_S而非D_T中选择实例的概率来优化模型参数。通过选择超参数λ，多任务学习过程在目标领域中表现更好。但是，其存在源领域大、目标领域小，偏向源领域，效果不好。Multi-task learning method: Multi-task learning uses_DS and_DT to train_MS and_MT at the same time. The parameters of_MS and_MT (excluding the CRF layer) are shared during the training process. In some existing technical solutions, a hyperparameter λ is used as the probability of selecting an instance from_DS instead of_DT to optimize the model parameters. By selecting the hyperparameter λ, the multi-task learning process performs better in the target domain. However, it has the problem of large source domain and small target domain, biased towards the source domain, and poor effect.

第三部分：评价方法和指标Part III: Evaluation methods and indicators

评价方法采用CoNLL03会议规定的完全匹配，即实体的边界和类型均匹配才算作正确匹配(正确标注)。The evaluation method adopts the complete match specified by the CoNLL03 conference, that is, the entity boundary and type must match to be considered a correct match (correct labeling).

评价指标使用精确率(Precision)，召回率(Recall)和F1值(F1-score)，计算方式如下：The evaluation indicators used are precision, recall and F1-score, which are calculated as follows:

精确率：

Accuracy:

召回率：

Recall:

F1值：

F1 value:

其中，TP表示True Positive(TP)，是指被模型预测为正的正样本(实体单词被正确标注)；可以称作判断为真的正确率；TP stands for True Positive (TP), which refers to the positive samples predicted by the model as positive (the entity words are correctly labeled); it can be called the accuracy rate of the judgment as true;

FP表示False Positive(FP)，是指被模型预测为正的负样本(非实体单词被标注为实体)；可以称作误报率；FP stands for False Positive (FP), which refers to negative samples predicted as positive by the model (non-entity words are marked as entities); it can be called the false alarm rate;

FN表示False Negative(FN)，是指被模型预测为负的正样本(实体单词被标注为非实体)；可以称作漏报率。FN stands for False Negative (FN), which refers to positive samples predicted as negative by the model (entity words are marked as non-entities); it can be called the false negative rate.

该示例的实验结果如表4所示：The experimental results of this example are shown in Table 4:

表4实验结果Table 4 Experimental results

从表4可以看出，只使用特征提取模块和NER分类模块进行微调(fine-tune)的本发明模型(即表4第5行)与In-domain方法、INIT方法和MULT方法效果相近，因为只使用特征提取模块和NER分类模块进行微调(fine-tune)本质上就是标准迁移学习方法。而在此基础上增加领域区分模块和特征提取模块组成对抗学习网络(即表4第6行)可以提高性能，并且增加领域区分模块和目标域自动编码模块(即表4第7行)比单独增加领域区分模块性能更高，表明通过在目标领域中引入自动编码模块，可以保留特定于域的特征以获得更好的性能。实验结果表明本发明提出的模型可以极大地帮助执行跨域命名实体识别。As can be seen from Table 4, the model of the present invention that uses only the feature extraction module and the NER classification module for fine-tuning (i.e., the 5th row of Table 4) has similar effects to the In-domain method, the INIT method, and the MULT method, because only using the feature extraction module and the NER classification module for fine-tuning (fine-tuning) is essentially a standard transfer learning method. On this basis, adding a domain distinction module and a feature extraction module to form an adversarial learning network (i.e., the 6th row of Table 4) can improve the performance, and adding a domain distinction module and a target domain automatic encoding module (i.e., the 7th row of Table 4) has higher performance than adding a domain distinction module alone, indicating that by introducing an automatic encoding module in the target domain, domain-specific features can be retained to obtain better performance. The experimental results show that the model proposed by the present invention can greatly help perform cross-domain named entity recognition.

根据本发明的一个实施例，提供一种基于前述实施例的命名实体识别模型训练方法训练得到的命名实体识别模型的命名实体识别方法，其特征在于，所述命名实体识别模型包括特征提取模块31和识别模块32，所述命名实体识别方法包括：B1、通过命名实体识别模型的特征提取模块31获取待识别文本的字符级别嵌入特征和单词嵌入特征并进行串联拼接，得到待识别文本中各单词的单词向量；B2、将单词向量的形式表示的待识别文本输入命名实体识别模型的识别模块32，得到所述待识别文本的命名实体识别结果。According to one embodiment of the present invention, there is provided a named entity recognition method based on a named entity recognition model trained by the named entity recognition model training method of the aforementioned embodiment, characterized in that the named entity recognition model includes afeature extraction module 31 and arecognition module 32, and the named entity recognition method includes: B1, obtaining character-level embedding features and word embedding features of a text to be recognized through thefeature extraction module 31 of the named entity recognition model and concatenating them in series to obtain word vectors of each word in the text to be recognized; B2, inputting the text to be recognized represented in the form of word vectors into therecognition module 32 of the named entity recognition model to obtain a named entity recognition result of the text to be recognized.

需要说明的是，虽然上文按照特定顺序描述了各个步骤，但是并不意味着必须按照上述特定顺序来执行各个步骤，实际上，这些步骤中的一些可以并发执行，甚至改变顺序，只要能够实现所需要的功能即可。It should be noted that although the above describes the various steps in a specific order, it does not mean that the various steps must be executed in the above specific order. In fact, some of these steps can be executed concurrently or even in a different order as long as the required functions can be achieved.

本发明可以是系统、方法和/或计算机程序产品。计算机程序产品可以包括计算机可读存储介质，其上载有用于使处理器实现本发明的各个方面的计算机可读程序指令。The present invention may be a system, a method and/or a computer program product. The computer program product may include a computer-readable storage medium carrying computer-readable program instructions for causing a processor to implement various aspects of the present invention.

计算机可读存储介质可以是保持和存储由指令执行设备使用的指令的有形设备。计算机可读存储介质例如可以包括但不限于电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括：便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、静态随机存取存储器(SRAM)、便携式压缩盘只读存储器(CD-ROM)、数字多功能盘(DVD)、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述的任意合适的组合。Computer readable storage medium can be a tangible device that holds and stores instructions used by an instruction execution device. Computer readable storage medium can include, for example, but is not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination thereof. More specific examples (non-exhaustive list) of computer readable storage medium include: a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a static random access memory (SRAM), a portable compact disk read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanical encoding device, for example, a punch card or a protruding structure in a groove on which instructions are stored, and any suitable combination thereof.

以上已经描述了本发明的各实施例，上述说明是示例性的，并非穷尽性的，并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下，对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择，旨在最好地解释各实施例的原理、实际应用或对市场中的技术改进，或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。The embodiments of the present invention have been described above, and the above description is exemplary, not exhaustive, and is not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The selection of terms used herein is intended to best explain the principles of the embodiments, practical applications, or technical improvements in the market, or to enable other persons of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

Translated fromChinese

1.一种命名实体识别模型训练方法，其特征在于，所述方法包括：1. A named entity recognition model training method, is characterized in that, described method comprises:

A1、构建第一训练模型，所述第一训练模型包括特征提取模块、识别模块和领域区分模块；A1, build the first training model, the first training model includes a feature extraction module, a recognition module and a domain distinction module;

A2、对第一训练模型进行多轮训练，其中，每轮训练中，用第一数据集对识别模块进行训练、用第一数据集和第二数据集对特征提取模块和领域区分模块进行对抗训练，每轮训练后至少根据识别模块的损失函数和领域区分模块的损失函数对特征提取模块的参数进行调整，同时更新第一数据集和第二数据集，以更新后的第一数据集和第二数据集进行下一轮训练，其中，第一数据集是以单词向量形式表示的有命名实体标签的源领域标记数据集，第二数据集是以单词向量形式表示的无命名实体标签的目标领域未标记数据集；A2. Perform multiple rounds of training on the first training model, wherein, in each round of training, use the first data set to train the recognition module, and use the first data set and the second data set to confront the feature extraction module and the domain distinction module Training, after each round of training, adjust the parameters of the feature extraction module at least according to the loss function of the identification module and the loss function of the domain distinction module, and update the first data set and the second data set at the same time, with the updated first data set and The second data set is used for the next round of training, where the first data set is a source domain labeled data set with named entity labels represented in the form of word vectors, and the second data set is a data set without named entity labels represented in the form of word vectors target domain unlabeled dataset;

A3、构建第二训练模型，所述第二训练模型包括特征提取模块和识别模块，第二训练模型的特征提取模块的初始参数采用经步骤A2训练后的第一训练模型的特征提取模块的参数进行设置，识别模块的初始参数采用随机初始化的方式进行设置；A3, build the second training model, the second training model includes a feature extraction module and a recognition module, the initial parameters of the feature extraction module of the second training model adopt the parameters of the feature extraction module of the first training model after step A2 training To set, the initial parameters of the identification module are set in the way of random initialization;

A4、用第三数据集以监督训练的方式对由步骤A3构建的第二训练模型的特征提取模块和识别模块的进行参数微调，将经参数微调后的第二训练模型作为命名实体识别模型，其中，第三数据集是以单词向量形式表示的有命名实体标签的目标领域标记数据集；A4. Use the third data set to fine-tune the parameters of the feature extraction module and the recognition module of the second training model constructed in step A3 in a supervised training manner, and use the second training model after parameter fine-tuning as a named entity recognition model, Among them, the third data set is a target field labeled data set with named entity labels represented in the form of word vectors;

其中，所述第一训练模型中的特征提取模块包括预处理层、CNN模型、Word2Vec模型、包含前向LSTM和后向LSTM的BiLSTM模型，其中，前向LSTM、后向LSTM分别包括多个依次连接的LSTM单元；Wherein, the feature extraction module in the first training model includes a preprocessing layer, a CNN model, a Word2Vec model, a BiLSTM model including a forward LSTM and a backward LSTM, wherein the forward LSTM and the backward LSTM respectively include multiple connected LSTM units;

该特征提取模块分别对非单词向量形式表示的源领域标记数据集、目标领域未标记数据集、目标领域标记数据集进行如下处理以获得第一数据集、第二数据集、第三数据集：The feature extraction module performs the following processing on the source field marked data set represented by the non-word vector form, the target field unlabeled data set, and the target field marked data set to obtain the first data set, the second data set, and the third data set:

用所述预处理层对数据集的单词进行包含统一大小写和去除停用词的预处理；Using the preprocessing layer to carry out preprocessing including uniform capitalization and removal of stop words to the words of the data set;

用CNN模型提取数据集中各单词的字符级别嵌入特征；Use the CNN model to extract the character-level embedding features of each word in the dataset;

用Word2Vec模型提取数据集中各单词的单词嵌入特征；Use the Word2Vec model to extract the word embedding features of each word in the dataset;

对数据集中各单词的字符级别嵌入特征和单词嵌入特征进行串联拼接，得到各单词的向量表示；The character-level embedding features and word embedding features of each word in the data set are concatenated and spliced to obtain the vector representation of each word;

将数据集中各单词的向量表示输入特征提取模块的BiLSTM模型中进行处理，得到包含上下文信息的以单词向量形式表示的数据集。The vector representation of each word in the data set is input into the BiLSTM model of the feature extraction module for processing, and a data set in the form of word vector containing context information is obtained.

2.根据权利要求1所述的命名实体识别模型训练方法，其特征在于，所述源领域标记数据集的规模与所述目标领域未标记数据集的规模相同或者大致相同，所述目标领域标记数据集的规模小于所述目标领域未标记数据集的规模。2. The named entity recognition model training method according to claim 1, wherein the scale of the labeled data set in the source domain is the same or approximately the same as the scale of the unlabeled data set in the target domain, and the labeled data set in the target domain is The size of the dataset is smaller than the size of the unlabeled dataset for the target domain.

3.根据权利要求2所述的命名实体识别模型训练方法，其特征在于，规模相同或者大致相同是指源领域标记数据集与目标领域未标记数据集的数据量之比为：10:14～10:9。3. The named entity recognition model training method according to claim 2, wherein the same or roughly the same scale means that the ratio of the amount of data in the source domain labeled dataset to the target domain unlabeled dataset is: 10:14～ 10:9.

4.根据权利要求1至3任一所述的命名实体识别模型训练方法，其特征在于，第一训练模型和第二训练模型的识别模块均包括BiLSTM-CRF模型，其中，采用源领域标记数据的命名实体标签设置第一训练模型中识别模块的BiLSTM-CRF模型的CRF层的标签取值空间，采用的目标领域标记数据集的命名实体标签设置第二训练模型的识别模块的BiLSTM-CRF模型的CRF层的标签设置。4. The named entity recognition model training method according to any one of claims 1 to 3, wherein the identification modules of the first training model and the second training model all include a BiLSTM-CRF model, wherein the source field label data is used Set the label value space of the CRF layer of the BiLSTM-CRF model of the recognition module in the first training model for the named entity label, and set the BiLSTM-CRF model of the recognition module of the second training model for the named entity label of the target domain labeling data set Label settings for the CRF layer.

5.根据权利要求3所述的命名实体识别模型训练方法，其特征在于，所述第一训练模型还包括梯度反转层，对特征提取模块和领域区分模块进行对抗训练过程中，在正向传播时通过梯度反转层对第一训练模型的特征提取模块和领域区分模块执行标准随机梯度下降操作，并且在反向传播时，在将领域区分模块的损失函数返回到特征提取模块之前将梯度反转层的参数自动取反，以使特征提取模块提取源领域标记数据集和目标领域未标记数据集中单词的通用特征。5. The named entity recognition model training method according to claim 3, wherein the first training model also includes a gradient inversion layer, and during the confrontation training process of the feature extraction module and the domain distinction module, in the forward direction During propagation, standard stochastic gradient descent is performed on the feature extraction and domain discrimination modules of the first trained model through the gradient inversion layer, and during backpropagation, the gradient The parameters of the inversion layer are automatically reversed, so that the feature extraction module extracts the common features of words in the source domain labeled dataset and the target domain unlabeled dataset.

6.根据权利要求5所述的命名实体识别模型训练方法，其特征在于，所述第一训练模型还包括自动编码模块，用第二数据集对自动编码模块进行训练，每轮训练后，根据自动编码模块的损失函数、识别模块的损失函数和领域区分模块的损失函数共同更新特征提取模块的参数。6. The named entity recognition model training method according to claim 5, wherein the first training model also includes an automatic coding module, and the automatic coding module is trained with the second data set, after each round of training, according to The loss function of the automatic encoding module, the loss function of the identification module and the loss function of the domain discrimination module jointly update the parameters of the feature extraction module.

7.根据权利要求6所述的命名实体识别模型训练方法，其特征在于，所述自动编码模块包含编码器和解码器，7. the named entity recognition model training method according to claim 6, is characterized in that, described automatic encoding module comprises encoder and decoder,

其中，每轮训练中，编码器获取用特征提取模块的BiLSTM模型提取到目标领域未标记数据集的单词的前向LSTM中最后一个LSTM和后向LSTM中最后一个LSTM的隐藏状态并组合为解码器的初始状态特征，并使用该初始状态特征和其前一个单词嵌入特征作为解码器的输入以训练自动编码模块提取目标领域的私有特征。Among them, in each round of training, the encoder obtains the hidden state of the last LSTM in the forward LSTM and the last LSTM in the backward LSTM of the words extracted by the BiLSTM model of the feature extraction module to the unlabeled data set in the target field and combines them into a decoding The initial state feature of the decoder, and use the initial state feature and its previous word embedding feature as the input of the decoder to train the auto-encoding module to extract the private features of the target domain.

8.根据权利要求7所述的命名实体识别模型训练方法，其特征在于，在步骤A2中，按照以下方式对第一训练模型的特征提取模块的参数进行调整：8. The named entity recognition model training method according to claim 7, wherein, in step A2, the parameters of the feature extraction module of the first training model are adjusted in the following manner:

其中，θ_f表示本次调整后特征提取模块的参数，θ’_f表示本次调整前特征提取模块的参数，μ表示学习率，L_task表示识别模块的损失函数，L_type表示领域区分模块的损失函数，L_target表示自动编码模块的损失函数，-ω表示梯度翻转参数，α、β、γ表示用户设置的权重。Among them, θ_f represents the parameters of the feature extraction module after this adjustment, θ'_f represents the parameters of the feature extraction module before this adjustment, μ represents the learning rate, L_task represents the loss function of the recognition module, and L_type represents the domain differentiation module. Loss function, L_target represents the loss function of the auto-encoding module, -ω represents the gradient flip parameter, and α, β, γ represent the weights set by the user.

9.根据权利要求8所述的命名实体识别模型训练方法，其特征在于，所述步骤A2还包括：在每轮训练后按照以下方式对第一训练模型的识别模块、领域区分模块和自动编码模块的参数进行调整：9. The named entity recognition model training method according to claim 8, characterized in that, said step A2 also includes: after each round of training, the recognition module, domain distinction module and automatic encoding of the first training model are performed in the following manner The parameters of the module are adjusted:

识别模块对应的参数调整方式为：

The parameter adjustment method corresponding to the recognition module is:

领域区分模块对应的参数调整方式为：

The parameter adjustment method corresponding to the domain distinction module is as follows:

自动编码模块对应的参数调整方式为：

其中，θ_y表示本次调整后识别模块的参数，θ’_y表示本次调整前识别模块的参数，θ_d表示本次调整后领域区分模块的参数，θ’_d表示本次调整前领域区分模块的参数，θ_r表示本次调整后自动编码模块的参数，θ’_r表示本次调整前自动编码模块的参数。Among them, θ_y represents the parameters of the recognition module after this adjustment, θ'_y represents the parameters of the recognition module before this adjustment, θ_d represents the parameters of the domain distinction module after this adjustment, and θ'_d represents the domain distinction before this adjustment. The parameters of the module, θ_r represents the parameters of the automatic encoding module after this adjustment, and θ'_r represents the parameters of the automatic encoding module before this adjustment.

10.一种基于如前述权利要求1至9任一项所述的命名实体识别模型训练方法训练得到的命名实体识别模型的命名实体识别方法，其特征在于，所述命名实体识别模型包括特征提取模块和识别模块，10. A named entity recognition method based on the named entity recognition model trained by the named entity recognition model training method as described in any one of claims 1 to 9, wherein the named entity recognition model includes feature extraction modules and identification modules,

所述命名实体识别方法包括：The named entity recognition method includes:

B1、通过命名实体识别模型的特征提取模块获取待识别文本的字符级别嵌入特征和单词嵌入特征并进行串联拼接，得到待识别文本中各单词的单词向量；B1. Obtain the character-level embedding features and word embedding features of the text to be recognized through the feature extraction module of the named entity recognition model and perform serial splicing to obtain the word vector of each word in the text to be recognized;

B2、将单词向量的形式表示的待识别文本输入命名实体识别模型的识别模块，得到所述待识别文本的命名实体识别结果。B2. Input the text to be recognized in the form of word vectors into the recognition module of the named entity recognition model, and obtain the named entity recognition result of the text to be recognized.

11.一种计算机可读存储介质，其特征在于，其上包含有计算机程序，所述计算机程序可被处理器执行以实现权利要求1至10中任一项所述方法的步骤。11. A computer-readable storage medium, characterized in that a computer program is contained thereon, and the computer program can be executed by a processor to implement the steps of the method according to any one of claims 1 to 10.

12.一种电子设备，其特征在于，包括：12. An electronic device, characterized in that it comprises:

一个或多个处理器；以及one or more processors; and

存储器，其中存储器用于存储一个或多个可执行指令；memory, where the memory is used to store one or more executable instructions;

所述一个或多个处理器被配置为经由执行所述一个或多个可执行指令以实现权利要求1至10中任一项所述方法的步骤。The one or more processors are configured to implement the steps of the method of any one of claims 1 to 10 by executing the one or more executable instructions.