CN112632997A

Movatterモバイル変換

Info

Publication number: CN112632997A
Application number: CN202011462808.3A
Authority: CN
Inventors: 张有强
Original assignee: Hebei University of Engineering
Current assignee: Hebei University of Engineering
Priority date: 2020-12-14
Filing date: 2020-12-14
Publication date: 2021-04-09

Abstract

Translated fromChinese

本发明公开了一种基于BERT和Word2Vec向量融合的中文实体识别方法。该方法整体分为三个阶段，首先将海量文本预处理，之后输入到BERT和Word2Vec模型训练，获得预训练BERT模型和静态词向量表；接着将待识别文本与词向量表进行匹配获取每个字符的候选词向量，通过设计的两种融合策略对每个字的候选词向量融合，之后与BERT输出的字向量拼接；最后将拼接的字向量输入Bi‑LSTM‑CRF进行实体识别模型的训练。本发明构建的基于BERT和Word2Vec向量融合的中文实体识别方法，通过词向量融合拼接的方式间接引入了词语的边界信息，且利用BERT获取具体语境下的字向量，充分表征了字的多义性。

The invention discloses a Chinese entity recognition method based on BERT and Word2Vec vector fusion. The method is divided into three stages as a whole. First, the massive text is preprocessed, and then input to the BERT and Word2Vec models for training to obtain the pre-trained BERT model and the static word vector table; then, the text to be recognized is matched with the word vector table to obtain each The candidate word vector of the character is fused with the candidate word vector of each character through two designed fusion strategies, and then spliced with the word vector output by BERT; finally, the spliced word vector is input into Bi‑LSTM‑CRF for entity recognition model training . The Chinese entity recognition method based on BERT and Word2Vec vector fusion constructed by the present invention indirectly introduces the boundary information of words by means of word vector fusion and splicing, and uses BERT to obtain word vectors in specific contexts, which fully characterizes the polysemy of words sex.

Description

Translated fromChinese

基于BERT和Word2Vec向量融合的中文实体识别方法Chinese entity recognition method based on BERT and Word2Vec vector fusion

技术领域technical field

本发明属于命名实体识别领域，具体涉及一种基于BERT和Word2Vec 向量融合的中文实体识别方法。The invention belongs to the field of named entity recognition, in particular to a Chinese entity recognition method based on BERT and Word2Vec vector fusion.

背景技术Background technique

命名实体识别是一项识别文本中指定类型的实体成分并对其进行分类的任务，常见的实体类型包括：人名、地名、机构名等。在网络数据日益剧增的今天，命名实体识别为数据挖掘提供了强有力的支持，同时它也是信息检索、问答系统、知识图谱等任务的重要组成部分。常用的命名实体识别方法主要分为以下三类：基于规则和词典的方法、基于统计机器学习的方法和基于深度学习的方法。Named entity recognition is a task of identifying and classifying entity components of a specified type in text. Common entity types include: person name, place name, institution name, etc. Today, with the increasing number of network data, named entity recognition provides strong support for data mining, and it is also an important part of tasks such as information retrieval, question answering systems, and knowledge graphs. Commonly used named entity recognition methods are mainly divided into the following three categories: methods based on rules and dictionary, methods based on statistical machine learning and methods based on deep learning.

基于规则和词典的方法，需要依靠语言学专家手工设计规则模板，选取能够描述预定义类型的实体特征，包括：统计信息、关键字、指示词、位置词以及标点符号等，结合领域内的词典，通过规则模板与字符串匹配的方式进行实体识别。The rule-based and dictionary-based methods need to rely on linguistic experts to manually design rule templates, select entity features that can describe predefined types, including statistical information, keywords, demonstrative words, positional words, and punctuation marks, etc., combined with the dictionary in the field , and perform entity recognition by matching the rule template with the string.

基于统计机器学习的方法，把命名实体识别当作序列标注任务处理，该类方法不需要拥有深厚语言学知识的专家来挑选和设计特征，普通研究人员就可以挑选出能有效反映该类实体特性的特征集合，包括：单词特征、上下文特征、词性特征以及语义特征等。通常采用人工标注的语料训练模型，常用的机器学习模型包括：隐马尔可夫模型、最大熵模型、支持向量机、条件随机场等。The method based on statistical machine learning treats named entity recognition as a sequence labeling task. This type of method does not require experts with deep linguistic knowledge to select and design features. Ordinary researchers can select features that can effectively reflect the characteristics of such entities. The feature set includes: word features, context features, part-of-speech features, and semantic features. Manually labeled corpus is usually used to train the model, and the commonly used machine learning models include: Hidden Markov Model, Maximum Entropy Model, Support Vector Machine, Conditional Random Field, etc.

基于深度学习的方法，能够进行端到端的模型训练，避免了人工挑选和设计特征的问题。随着人工神经网络在词嵌入技术中的应用，使用大量未标注语料进行无监督预训练，可以获得更贴近词语表达含义的低维稠密的原生词向量，常用的词向量训练模型包括：Word2Vec、Glove等。在特征提取上常用的深度学习模型有卷积神经网络、循环神经网络等，其中双向长短时记忆(Bidirectional Long Short Term Memory,Bi-LSTM)网络是最经典，也是效果较好的一个模型，而标签解码一般采用条件随机场(ConditionalRandom Fields,CRF)模型。The method based on deep learning can perform end-to-end model training, avoiding the problem of manual selection and design of features. With the application of artificial neural network in word embedding technology, a large amount of unlabeled corpus is used for unsupervised pre-training, and low-dimensional and dense native word vectors that are closer to the meaning of words can be obtained. Commonly used word vector training models include: Word2Vec, Glove et al. Deep learning models commonly used in feature extraction include convolutional neural networks, recurrent neural networks, etc. Among them, the Bidirectional Long Short Term Memory (Bi-LSTM) network is the most classic and a model with better effects. Label decoding generally adopts a Conditional Random Fields (CRF) model.

基于预训练语言模型的方法，使用海量文本对语言模型进行无监督预训练，常用的预训练语言模型是BERT(Bidirectional Encoder Representations fromTransformers)，利用获得的预训练模型在实体识别数据集上通过微调参数的方式进行实体识别。Based on the method of pre-training language model, unsupervised pre-training of language model is carried out using massive text. The commonly used pre-training language model is BERT (Bidirectional Encoder Representations from Transformers), and the obtained pre-training model is used to fine-tune the parameters on the entity recognition data set. way of entity recognition.

但上述技术在下列缺陷：But the above technology has the following drawbacks:

基于规则和词典的方法具有较强的领域性，且有限的规则无法覆盖所有的语言现象，缺乏鲁棒性和可移植性；The methods based on rules and dictionaries have strong domain, and limited rules cannot cover all language phenomena, lacking robustness and portability;

基于统计机器学习的方法，需要人工进行特征的挑选和组合，且人类语言的使用通常具有很大的随意性，仅仅使用基于统计的方法会使状态搜索空间非常庞大，导致实体识别效果不好；The method based on statistical machine learning requires manual selection and combination of features, and the use of human language is usually very arbitrary. Only using the method based on statistics will make the state search space very large, resulting in poor entity recognition effect;

基于深度学习的方法，采用Word2Vec等模型训练获得固定的静态词向量来表示词语的语义含义，无法解决一词多义的问题，且分词错误会导致误差传播，影响实体识别效果；Based on the method of deep learning, using Word2Vec and other models to train to obtain a fixed static word vector to represent the semantic meaning of words, which cannot solve the problem of polysemy, and word segmentation errors will lead to error propagation and affect the effect of entity recognition;

基于BERT预训练语言模型微调的方法，通常模型参数量巨大，训练和预测都要花费很长的时间，且在训练和部署方面对硬件设施的要求较高。The method of fine-tuning the language model based on BERT pre-training usually has a huge amount of model parameters, and it takes a long time for training and prediction, and requires high hardware facilities in terms of training and deployment.

发明内容SUMMARY OF THE INVENTION

本发明的目的是为了解决现有技术存在的上述所列问题，提供了一种基于BERT和Word2Vec向量融合的中文实体识别方案，在保证实体识别效果的前提下提高模型训练和预测的效率。The purpose of the present invention is to solve the above-mentioned problems existing in the prior art, provide a Chinese entity recognition scheme based on BERT and Word2Vec vector fusion, and improve the efficiency of model training and prediction under the premise of ensuring the effect of entity recognition.

为实现上述目的，本发明采用的技术方案为：使用BERT模型获取包含上下文信息的动态字向量，使用Word2Vec模型获取静态词向量，之后通过两种词向量融合策略对候选词向量进行融合，最后将字向量和融合后的词向量拼接作为后续模型的输入向量，且使用了经典的Bi-LSTM-CRF模型进行特征编码和标签的解码。In order to achieve the above object, the technical solution adopted in the present invention is: use the BERT model to obtain dynamic word vectors containing context information, use the Word2Vec model to obtain static word vectors, then use two word vector fusion strategies to fuse the candidate word vectors, and finally fuse the word vectors. The word vector and the fused word vector are concatenated as the input vector of the subsequent model, and the classic Bi-LSTM-CRF model is used for feature encoding and label decoding.

基于BERT和Word2Vec进行向量融合的中文实体识别方法，具体包括以下步骤：A Chinese entity recognition method based on BERT and Word2Vec for vector fusion, which includes the following steps:

步骤1，获取海量中文文本语料，利用Python中的jieba模块对文本进行分词，训练Word2Vec模型，获取静态词向量表；Step 1, obtain massive Chinese text corpus, use the jieba module in Python to segment the text, train the Word2Vec model, and obtain a static word vector table;

步骤2，对BERT模型进行预训练，把中文文本构造成BERT模型需要的输入格式，具体分为以下几步：Step 2, pre-train the BERT model, and construct the Chinese text into the input format required by the BERT model, which is divided into the following steps:

2.1对于原始语料，通过换行来分割句子，通过空行来分割上下文；2.1 For the original corpus, the sentences are separated by line breaks, and the context is separated by blank lines;

2.2构建BERT下一句预测预训练任务需要的样本，其中正样本表示的是输入的两个句子是存在上下文关系的连续的两个句子；负样本表示的是不存在语义关系的随机选择的两个句子；2.2 Construct the samples required for the BERT next sentence prediction pre-training task, where the positive samples indicate that the two input sentences are two consecutive sentences with a contextual relationship; the negative samples indicate that there is no semantic relationship. Two randomly selected two sentences sentence;

2.3对于超过设定的最大长度的句子，随机选择从句首或句尾进行截断；2.3 For sentences exceeding the set maximum length, randomly select the beginning or end of the clause for truncation;

2.4将待输入的两个句子用[SEP]标签连接，并且在整个句首添加[CLS] 标签，整个句尾添加[SEP]标签，若句子长度不够，用[PAD]标签进行填充；2.4 Connect the two sentences to be input with the [SEP] tag, and add the [CLS] tag at the beginning of the entire sentence, and the [SEP] tag at the end of the entire sentence. If the sentence length is not enough, fill it with the [PAD] tag;

2.5构建BERT遮蔽语言模型预训练任务需要的样本，随机选择句子中 15％的字符进行遮蔽，对于选中的字符80％的时间用[MASK]代替，10％的时间用随机选择的一个字符代替，10％的时间保持原字符不变；2.5 Build the samples required for the BERT masking language model pre-training task, randomly select 15% of the characters in the sentence for masking, replace the selected characters with [MASK] 80% of the time, and replace them with a randomly selected character for 10% of the time, 10% of the time keep the original characters unchanged;

步骤3，根据上述两个预训练任务训练BERT模型，训练目标分别是预测当前输入的句子对是否是存在上下文关系的句子和预测被遮蔽掉字符的原始内容，最终获得预训练好的BERT模型；Step 3: Train the BERT model according to the above two pre-training tasks. The training objectives are to predict whether the currently input sentence pair is a sentence with a contextual relationship and predict the original content of the masked characters, and finally obtain a pre-trained BERT model;

步骤4，中文命名实体识别数据集的获取、预处理以及标注，具体标注方式一般采用BIO标注法，其中B表示实体开始字符，I表示实体中间和结尾字符，O表示非实体字符；Step 4, the acquisition, preprocessing and labeling of Chinese named entity recognition data set, and the specific labeling method generally adopts the BIO labeling method, wherein B represents the start character of the entity, I represents the middle and end characters of the entity, and O represents the non-entity character;

步骤5，对步骤4得到的数据集进行预处理，给每一个句子的句首添加 [CLS]标签，句尾添加[SEP]标签，将处理好的句子输入步骤3获得的预训练 BERT模型，最终获得BERT模型输出句子中每一个字符的字向量；Step 5: Preprocess the data set obtained in step 4, add a [CLS] tag to the beginning of each sentence, add a [SEP] tag to the end of the sentence, and input the processed sentence into the pre-trained BERT model obtained in step 3. Finally, the word vector of each character in the output sentence of the BERT model is obtained;

步骤6，对步骤4得到的数据集中的每一个句子，通过与词汇表进行匹配获取该句子包含的所有候选词语，查询步骤1获得的静态词向量表，获得每一个候选词语的词向量，将句子中每个字对应的候选词语的词向量通过两种向量融合策略进行融合，来表示每个字在词汇层面的语义含义，具体包含以下两种词向量融合策略：Step 6: For each sentence in the data set obtained in step 4, obtain all the candidate words contained in the sentence by matching with the vocabulary, query the static word vector table obtained instep 1, and obtain the word vector of each candidate word. The word vectors of the candidate words corresponding to each word in the sentence are fused through two vector fusion strategies to represent the semantic meaning of each word at the lexical level, including the following two word vector fusion strategies:

6.1词向量融合策略一：对句子中每个字的候选词向量进行求和取均值，以“广州市长隆公园”句子为例，“广”字包含“广州”和“广州市”两个候选词语，首先查询词向量表获得两个词语的词向量，然后对两个词向量求和取均值作为“广”字的词向量表示部分。6.1 Word vector fusion strategy 1: Sum the candidate word vectors of each word in the sentence and take the mean value. Taking the sentence "Guangzhou Changlong Park" as an example, the word "Guang" contains two words "Guangzhou" and "Guangzhou City" For candidate words, first query the word vector table to obtain the word vectors of the two words, and then sum the two word vectors and take the average as the word vector representation part of the word "Guang".

6.2词向量融合策略二：对句子中每个字的候选词向量以词频作为权重进行加权求和，同样以上述例子为例，首先统计“广州”和“广州市”在数据集中出现的总次数，然后将两个词出现的次数分别除以两个词的总次数作为两个词向量的权重，最后将权重和词向量相乘并求和作为“广”字的词向量表示部分，其余字符同理，当某个字不存在匹配词语时，用[None]的词向量表示该字的词向量部分，维度同其他词向量维度一样。6.2 Word vector fusion strategy 2: The candidate word vector of each word in the sentence is weighted and summed with the word frequency as the weight. Also taking the above example as an example, first count the total number of occurrences of "Guangzhou" and "Guangzhou City" in the data set , and then divide the number of occurrences of the two words by the total number of the two words as the weight of the two word vectors, and finally multiply the weight and the word vector and sum it up as the word vector representation part of the word "Guang", and the rest of the characters Similarly, when a word does not have a matching word, the word vector of [None] is used to represent the word vector part of the word, and the dimension is the same as that of other word vectors.

步骤7，将步骤6得到的每个字的词向量与步骤5得到的每个字的字向量进行拼接，获得每个字符的最终字向量；Step 7, the word vector of each word obtained in step 6 is spliced with the word vector of each word obtained in step 5, and the final word vector of each character is obtained;

步骤8，将步骤7得到的字向量输入Bi-LSTM-CRF模型进行训练预测，得到实体识别结果。In step 8, the word vector obtained in step 7 is input into the Bi-LSTM-CRF model for training prediction, and an entity recognition result is obtained.

本发明的有益效果是：The beneficial effects of the present invention are:

1.本发明针对传统词向量特征表达能力不强，提出使用预训练BERT模型获取包含上下文信息的动态字向量，增强字的语义含义，解决一词多义的问题；1. the present invention is not strong for traditional word vector feature expression ability, proposes to use pre-training BERT model to obtain the dynamic word vector that contains context information, enhances the semantic meaning of word, and solves the problem of polysemy;

2.为了解决在传统词向量使用过程中存在的分词错误问题，更好的引入词语以及实体边界信息，提出了词向量融合的策略，且引入了词频信息来给可能性更大的词向量赋予更高的权重，减少错误分词带来的影响。2. In order to solve the problem of word segmentation errors in the use of traditional word vectors, and better introduce word and entity boundary information, a word vector fusion strategy is proposed, and word frequency information is introduced to give more likely word vectors. Higher weights reduce the impact of wrong word segmentation.

3.通过词向量与字向量拼接的方式，实现字与词的融合，丰富了初始向量的特征表示，提高了实体识别的精度和召回率；3. By splicing word vectors and word vectors, the fusion of words and words is realized, which enriches the feature representation of the initial vector and improves the precision and recall rate of entity recognition;

4.本发明在输入向量的表示上进行改进，而没有涉及到特征编码模型结构的改进，因此也可以适用于其他特征编码模型，而不仅仅局限于Bi-LSTM 模型，具有很强的灵活性；4. The present invention improves the representation of the input vector without involving the improvement of the structure of the feature encoding model, so it can also be applied to other feature encoding models, not only limited to the Bi-LSTM model, and has strong flexibility ;

5.为了减少模型训练时间，没有对预训练模型微调，而是采用特征抽取的方式获取字向量，大大减少了模型训练的参数，提高了模型训练效率。5. In order to reduce the model training time, the pre-training model is not fine-tuned, but the word vector is obtained by means of feature extraction, which greatly reduces the parameters of model training and improves the efficiency of model training.

附图说明Description of drawings

图1为本发明的基于BERT和Word2vec向量融合的中文实体识别流程示意图；Fig. 1 is the schematic flow chart of Chinese entity recognition based on BERT and Word2vec vector fusion of the present invention;

图2为本发明实施例的基于BERT和Word2Vec向量融合的中文实体识别模型整体结构示意图；Fig. 2 is the overall structure schematic diagram of the Chinese entity recognition model based on BERT and Word2Vec vector fusion of the embodiment of the present invention;

图3为本发明实施例的BERT预训练语言模型结构示意图；3 is a schematic structural diagram of a BERT pre-training language model according to an embodiment of the present invention;

图4为本发明实施例的Word2vec中的Skip-gram模型结构示意图。FIG. 4 is a schematic structural diagram of a Skip-gram model in Word2vec according to an embodiment of the present invention.

具体实施方式Detailed ways

为了使本发明所要解决的技术问题、技术方案及有益效果更加清楚、明白，以下结合附图和实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅用以解释本发明，并不用于限定本发明。In order to make the technical problems, technical solutions and beneficial effects to be solved by the present invention more clear and comprehensible, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

如图1所示，本发明基于BERT和Word2vec向量融合的中文实体识别方法，具体包括以下步骤：As shown in Figure 1, the present invention is based on the Chinese entity recognition method of BERT and Word2vec vector fusion, specifically comprises the following steps:

步骤1，获取Word2vec模型的训练语料并进行预处理；Step 1, obtain the training corpus of the Word2vec model and preprocess it;

步骤2，根据步骤1预处理后的训练语料训练Word2vec中的Skip-gram 模型，如图4所示，通过输入中心词来预测指定大小窗口内的上下文的词，训练完成获得的映射层的权重矩阵就是词向量表：W∈R^|V|*d，其中|V|是词汇表长度，d是词向量维度。Step 2: Train the Skip-gram model in Word2vec according to the training corpus preprocessed inStep 1, as shown in Figure 4, predict the words of the context in the specified size window by inputting the central word, and complete the training to obtain the weight of the mapping layer The matrix is the word vector table: W∈R^|V|*d , where |V| is the vocabulary length and d is the word vector dimension.

步骤3，通过查询步骤2训练获得的静态词向量表来获取每个词对应的词向量：

其中v_i是长度为|V|的one-hot向量，对应维度的值为1，其余维度为0。Step 3, obtain the word vector corresponding to each word by querying the static word vector table obtained by the training in Step 2:

where vi is a one-hot vector of length |_V |, the value of the corresponding dimension is 1, and the other dimensions are 0.

步骤4，根据步骤1预处理后的训练语料自己预训练BERT语言模型，也可直接下载其它已经预训练好的中文BERT模型。Step 4: Pre-train the BERT language model by yourself according to the training corpus preprocessed inStep 1, or directly download other pre-trained Chinese BERT models.

步骤5，将实体识别数据集输入到BERT模型获取包含具体语境的字向量，

c_i表示句子中的每一个字符，l表示字向量的维度。Step 5, input the entity recognition data set into the BERT model to obtain the word vector containing the specific context,

c_i represents each character in the sentence, and l represents the dimension of the word vector.

步骤6，输入句子与预先训练好的词汇表进行匹配，获取每个字符的候选词向量e^w，如图2所示，之后通过词向量融合策略对候选词向量进行融合，策略一为求和取均值，其计算如下：Step 6, the input sentence is matched with the pre-trained vocabulary, and the candidate word vector^ew of each character is obtained, as shown in Figure 2, and then the candidate word vector is fused through the word vector fusion strategy, and the strategy one is summation Take the mean, which is calculated as follows:

其中，e^w(w)表示该词语的词向量，S表示字符所对应的候选词语集合， N表示集合中词语的个数，e^w(None)表示[None]标签的词向量，

表示该集合为空集，即该字符不包含任何匹配词语。Among them, e^w (w) represents the word vector of the word, S represents the candidate word set corresponding to the character, N represents the number of words in the set, e^w (None) represents the word vector of the [None] label,

Indicates that the set is empty, that is, the character does not contain any matching words.

策略二为词频加权求和，其计算如下：The second strategy is the weighted summation of word frequency, which is calculated as follows:

其中，z(w)表示每个词语的词频，词频通过统计每个词在训练集和测试集上出现的频率获得，其他参数同上。Among them, z(w) represents the word frequency of each word, and the word frequency is obtained by counting the frequency of each word in the training set and test set, and other parameters are the same as above.

将融合的词向量与BERT输出的字向量进行拼接，获得每个字符的最终向量表示，

表示向量拼接。Splicing the fused word vector with the word vector output by BERT to obtain the final vector representation of each character,

Represents vector concatenation.

步骤7，将句子中每一个字的字向量输入到LSTM模型中，学习句子中较长距离的前后依赖关系，LSTM通过输入门、遗忘门、输出门控制和保持信息的传递，其参数化表示如下所示：Step 7: Input the word vector of each word in the sentence into the LSTM model to learn the long-distance front and back dependencies in the sentence. LSTM controls and maintains the transmission of information through the input gate, forget gate, and output gate, and its parameterized representation As follows:

i_t＝σ(W_ix_t+U_ih_t-1+b_i)i_t =σ(W_i x_t +U_i h_t-1 +b_i )

f_t＝σ(W_fx_t+U_fh_t-1+b_f)f_t =σ(W_f x_t +U_f h_t-1 +b_f )

o_t＝σ(W_ox_t+U_oh_t-1+b_o)o_t =σ(W_o x_t +U_o h_t-1 +b_o )

h_t＝o_t e tanh(c_t)h_t =o_t e tanh(c_t )

其中，σ是Sigmoid激活函数，tanh表示tanh激活函数，

表示点乘运算，W、U分别表示对应每个门的权重矩阵，b表示偏置，x_t表示步骤6获得的当前时刻的输入向量，h_t-1和c_t-1分别表示上一时刻的输出和上一时刻的细胞状态。where σ is the sigmoid activation function, tanh is the tanh activation function,

Represents the point multiplication operation, W and U represent the weight matrix corresponding to each gate respectively, b represents the bias, x_t represents the input vector of the current moment obtained in step 6, h_t-1 and c_t-1 represent the previous moment respectively output and the cell state at the previous moment.

步骤8，如图2所示，Bi-LSTM包含前向传递和反向传递两个过程，能够编码双向语言信息，对于输入的句子向量序列S＝{e₁,e₂,L,e_n}，e_i∈R^1×(d+l)，其中1≤i≤n，d，l分别表示词向量和字向量的维度。前向传递过程为：Step 8, as shown in Figure 2, Bi-LSTM includes two processes of forward pass and reverse pass, which can encode bidirectional language information. For the input sentence vector sequence S={e₁ ,e₂ ,L,_en } , e_i ∈R^1×(d+l) , where 1≤i≤n, d, l represent the dimension of word vector and word vector, respectively. The forward pass process is:

反向传递过程为：The reverse transfer process is:

其中，

是前向t-1时刻的隐藏状态，

是反向t+1时刻的隐藏状态， e_t是t时刻的输入向量。in,

is the hidden state at time t-1 forward,

is the hidden state at time t+1 in reverse, and e_t is the input vector at time t.

步骤9，最后对前向和反向LSTM的输出进行拼接获得t时刻的隐藏状态h_t：Step 9, finally splicing the output of the forward and reverse LSTM to obtain the hidden state h t at time_t :

步骤10，CRF层在Bi-LSTM输出的基础上考虑了标签之间的转移信息，能够获得全局最优标签序列，计算过程如下：In step 10, the CRF layer considers the transfer information between labels on the basis of the Bi-LSTM output, and can obtain the global optimal label sequence. The calculation process is as follows:

其中，s表示评估得分，W是标签间的转移矩阵，P表示对应标签的得分。根据评估得分计算序列x到标签y的概率为：Among them, s is the evaluation score, W is the transition matrix between labels, and P is the score of the corresponding label. Calculate the probability of sequence x to label y based on the evaluation score as:

步骤11，训练损失函数为：Step 11, the training loss function is:

至此，具体实施例流程结束。So far, the process of the specific embodiment ends.

步骤12，本发明训练基于BERT和Word2Vec向量融合的Bi-LSTM-CRF 模型参数时，将已标注好的文本和标签作为输入，然后采用梯度下降法或其他优化方法训练该模型，训练中只更新Bi-LSTM层和CRF层的参数，BERT 模型参数保持不变，当模型产生的损失值满足设定要求或达到最大迭代次数时，则终止该模型的训练。Step 12, when the present invention trains the parameters of the Bi-LSTM-CRF model based on BERT and Word2Vec vector fusion, the marked text and labels are used as input, and then gradient descent method or other optimization methods are used to train the model, and only update during training. The parameters of the Bi-LSTM layer and the CRF layer, and the parameters of the BERT model remain unchanged. When the loss value generated by the model meets the set requirements or reaches the maximum number of iterations, the training of the model is terminated.

上述说明示出并描述了本发明的优选实施例，如前所述，应当理解本发明并非局限于本文所披露的形式，不应看作是对其他实施例的排除，而可用于各种其他组合、修改和环境，并能够在本文所述发明构想范围内，通过上述教导或相关领域的技术或知识进行改动。而本领域人员所进行的改动和变化不脱离本发明的精神和范围，则都应在本发明所附权利要求的保护范围内。The foregoing specification illustrates and describes preferred embodiments of the present invention, and as previously stated, it should be understood that the present invention is not limited to the form disclosed herein, and should not be construed as an exclusion of other embodiments, but may be used in a variety of other Combinations, modifications and environments are possible within the scope of the inventive concepts described herein, from the above teachings or from skill or knowledge in the relevant fields. And the modification and change that those skilled in the art carry out do not depart from the spirit and scope of the present invention, then all should be within the protection scope of the appended claims of the present invention.

Claims

1. A Chinese entity recognition method based on BERT and Word2Vec vector fusion is characterized in that a BERT model is used for obtaining a dynamic Word vector of each Word in a sentence, Word2Vec is used for obtaining a static Word vector, a plurality of candidate Word vectors are fused through two designed fusion strategies, then the candidate Word vectors are spliced with the Word vectors and input into Bi-LSTM-CRF for model training, and entities of specified types in texts are automatically extracted.

2. The method for Chinese entity recognition based on the fusion of BERT and Word2Vec vectors as claimed in claim 1, wherein the method for Chinese entity recognition specifically comprises the following steps:

step 1, acquiring mass Chinese texts and preprocessing the texts, performing Word segmentation on the texts by using a jieba module in Python, training a Word2Vec model, and acquiring a static Word vector table;

step 2, pre-training the BERT model, and constructing the Chinese text into an input format required by the BERT model, wherein the method specifically comprises the following steps:

2.1 for the original corpus, segmenting sentences by line feed and segmenting context paragraphs by empty lines;

2.2 constructing samples required by the BERT next sentence prediction pre-training task, wherein positive samples represent that the input two sentences are two continuous sentences with context; the negative examples represent two sentences randomly selected without semantic relation;

2.3 randomly selecting the sentence with the length exceeding the set maximum length to cut off from the beginning or the end of the sentence;

2.4 connecting two sentences to be input by using [ SEP ] tags, adding [ CLS ] tags at the beginning of the whole sentence, and adding [ SEP ] tags at the end of the whole sentence;

2.5 constructing samples required by a pretraining task of a BERT masking language model, randomly selecting 15% of characters in a sentence for masking, replacing 80% of the time of the selected characters with [ MASK ], replacing 10% of the time with one randomly selected character, and keeping the original characters unchanged 10% of the time;

step 3, training a BERT model according to the two pre-training tasks, wherein the training targets are respectively predicting whether a currently input sentence pair is a sentence with a context relationship and predicting the original content of the occluded characters, and finally obtaining the pre-trained BERT model;

step 4, acquiring, preprocessing and labeling a Chinese named entity identification data set, wherein a specific labeling mode generally adopts a BIO labeling method, wherein B represents an entity start character, I represents an entity middle character and an entity end character, and O represents a non-entity character;

step 5, preprocessing the labeled data set obtained in the step 4, adding [ CLS ] labels to the beginning of each sentence and [ SEP ] labels to the end of each sentence, inputting the processed sentences into the pretrained BERT model in the step 3, and obtaining the word vector of each character in the sentences output by the BERT model;

step 6, for each sentence in the data set obtained in the step 4, obtaining word vectors of all candidate words contained in the sentence in a manner of matching with the vocabulary, and fusing the candidate word vectors corresponding to each word in the sentence through two word vector fusion strategies to express the semantic meaning of each word at the vocabulary level, wherein the two fusion strategies specifically include the following two fusion strategies:

6.1 word vector fusion strategy one: taking the sentence of Guangzhou city Changhong park as an example, the word vector of the two words is obtained by firstly inquiring the word vector table to obtain the word vector of the two words, and then the two word vectors are summed and averaged to be used as the word vector representation part of the 'Guangzhou' word.

6.2 word vector fusion strategy two: taking the word frequency as the weight to perform weighted summation on the candidate word vector of each word in the sentence, taking the above example as an example, firstly counting the total times of occurrence of the word in the data set of "Guangzhou" and "Guangzhou City", then dividing the times of occurrence of the two words by the total times of the two words respectively as the weight of the two word vectors, finally multiplying the weight and the word vectors and summing the result to be the word vector representing part of the "Wide" word, and the same applies to the rest characters, when a word does not have a candidate word, the word vector part of the word is represented by the word vector of [ None ], and the dimension is the same as the dimension of the other word vectors.

Step 7, splicing the word vector of each character obtained in the step 6 with the word vector of each character obtained in the step 5 to obtain a final word vector of each character;

and 8, inputting the word vector obtained in the step 7 into a Bi-LSTM-CRF model for training and prediction to obtain an entity recognition result.

3. The method for Chinese entity recognition based on the fusion of BERT and Word2Vec vectors as claimed in claim 2, wherein the Chinese text preprocessing in steps 1 and 2 mainly comprises removing useless symbols, repeated data, and normalized data format from the text data obtained by crawler or other means.

4. The method for Chinese entity recognition based on the fusion of BERT and Word2Vec vectors as claimed in claim 3, wherein in step 2, the sentence with insufficient length needs to be filled up with [ PAD ] tag, and finally the fixed length sentence is input into the BERT model for training.

5. The method for recognizing Chinese entities based on the vector fusion of BERT and Word2Vec as claimed in claim 4, wherein the vocabulary in step 6 is also a Word vector table obtained by Word2Vec training, and when a sentence is input, firstly a candidate Word vector of each character is obtained by querying the Word vector table, and then one of two vector fusion strategies is selected for the fusion of Word vectors.

6. The method for Chinese entity recognition based on BERT and Word2Vec vector fusion of claim 5, wherein in the step 8, the whole model can be regarded as three layers, namely an input vector representation layer based on the BERT and Word2Vec models, a Bi-LSTM based context coding layer and a CRF based tag decoding layer; and splicing the static Word vector obtained by using Word2vec and the dynamic Word vector obtained by using BERT to be used as an input vector, wherein the Bi-LSTM layer is responsible for carrying out feature coding on the input vector, and the CRF layer selects an optimal label sequence by learning the transition probability among labels.