CN112836046A

Movatterモバイル変換

Info

Publication number: CN112836046A
Application number: CN202110039836.2A
Authority: CN
Inventors: 范贺添; 申林山; 黄少滨; 李熔盛; 吴汉瑜; 谷虹润
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2021-01-13
Filing date: 2021-01-13
Publication date: 2021-05-25

Abstract

Translated fromChinese

本发明属于命名实体识别技术领域，具体涉及一种四险一金领域政策法规文本实体识别方法。本发明通过预训练语言模型BERT得到每个字符基于上下文特征动态生成的具有上下文语义信息的字向量，通过word2vec中的skip‑gram模型得到每个词语的词向量，将具有上下文语义信息的字向量和其所在的词的词向量利用维度拼接的方式进行特征融合，得到联合字词向量，不仅可以弥补少量标注样本特征不足和字符语义提取不充分的问题，还为字向量补充了词级短语信息，从而在一定程度上提高模型的准确率。本发明可以有效解决四险一金领域命名实体识别任务中标注数据不足以及识别精度不高的问题。

The invention belongs to the technical field of named entity recognition, and in particular relates to a method for recognizing text entities of policies and regulations in the field of four insurances and one housing fund. The present invention obtains the word vector with contextual semantic information dynamically generated by each character based on the context feature by pre-training the language model BERT, obtains the word vector of each word by the skip-gram model in word2vec, and converts the word vector with the contextual semantic information into the word vector And the word vector of the word in which it is located is fused by dimension splicing, and the joint word vector is obtained, which can not only make up for the insufficient features of a small number of labeled samples and the insufficient extraction of character semantics, but also supplement the word-level phrase information for the word vector. , so as to improve the accuracy of the model to a certain extent. The invention can effectively solve the problems of insufficient labeled data and low recognition accuracy in the named entity recognition task in the field of four insurances and one gold.

Description

Four-risk one-gold-field policy and regulation text entity identification method

Technical Field

The invention belongs to the technical field of named entity identification, and particularly relates to a four-risk one-gold-field policy and regulation text entity identification method.

Background

Along with the social development, the system is gradually perfected, and the role of insisting on the basic guarantee system in China is more and more prominent. Therefore, the method has important significance for research works such as question-answering system and knowledge map construction in the field of four risks and one fund. Named Entity Recognition (NER) as an important basic unit of the knowledge graph is a core technology for constructing and complementing the knowledge graph. The method refers to identifying entities with specific meanings in texts, and mainly comprises name of a person, place name, organization name, proper noun and the like. Therefore, when constructing the four-risk one-gold domain knowledge graph, it is also significant to identify the related professional terms and commonly used named entities (such as organization names, place names, etc.) of the four-risk one-gold domain.

The traditional named entity identification method mainly comprises two major categories of algorithms based on rule matching and algorithms based on machine learning. However, the conventional machine learning model (for example, CRF conditional random field is still an important component of the mainstream model of NER, the objective function thereof not only considers the input state feature function, but also includes the label transfer feature function, so as to obtain the optimal label sequence), but has a common disadvantage of high requirement on feature extraction, and various features influencing the named entity recognition task need to be selected and combined into a vector to represent the words in the text, and a large amount of manual labeling needs to be performed on the preprocessed data in advance to train a good effect, so that the modeling cost is high. In recent years, with the development of computer power and the introduction of word embedding, a deep learning method is gradually applied to a named entity recognition task, the neural network becomes a model capable of efficiently processing a plurality of NLP tasks, which is mainly represented by that the deep learning method based on the neural network has strong generalization, in order to make words contain more comprehensive semantic information and syntactic characteristics, scholars in the year propose to further enhance the expression of word vectors by using a pre-training language model, the most prominent of them is the BERT model (Bicorrective Encoder transformations from transformations) proposed by Google researchers Devrin, etc., which uses the self-attention machine system and the transformer Encoder to pre-train large-scale open corpus to obtain word vectors with more context semantic information, and pengM, etc. has obtained good effect on entity recognition effect in the general field by using the method.

Disclosure of Invention

The invention aims to provide a method for identifying text entities of four-risk one-gold-field policy and regulation.

The purpose of the invention is realized by the following technical scheme: the method comprises the following steps:

step 1: inputting a text to be recognized; constructing a four-risk one-gold-domain entity segmentation and labeling dictionary, and pre-training a language model BERT based on the four-risk one-gold-domain entity segmentation and labeling dictionary;

step 2: performing word segmentation processing on a text to be recognized;

and step 3: taking part of the segmented texts to be recognized to construct a training set, and forming a test set by the rest of the segmented texts to be recognized; labeling the text to be recognized after word segmentation in the training set according to the four-risk one-gold domain entity word segmentation and the labeling dictionary;

and 4, step 4: segmenting the words marked in the training set into single Chinese characters, and carrying out further BIO entity boundary marking according to the entity category of the words and the position of each Chinese character appearing in the entity;

and 5: inputting the marked training set into a pre-training language model BERT to obtain a word vector W with context semantic information, which is dynamically generated by each character based on context characteristics_i^charbert；

Step 6: inputting the marked training set into a skip-gram model in the word2vec for training to obtain a word vector of each word;

and 7: word vector W to be provided with contextual semantic information_i^charbertAnd word vector W of the word in which it is located_i^wordPerforming feature fusion by using a dimension splicing mode to obtain a fused word combination vector W_i；

And 8: combining the words and phrases in the training set into a vector W_iInputting the data into a Bi-LSTM and conditional random field CRF model of a bidirectional long-short term memory network for training to obtain an entity recognition and classification model;

firstly, combining the words and phrases in the training set into a vector W_iInputting the vector as an input vector into a Bi-directional long and short term memory network Bi-LSTM to acquire effective context information, and decoding the model by using a conditional random field CRF model as a decoder, namely obtaining an optimal mark sequence of each character by transferring probability, thereby marking a category label for an entity and realizing entity identification and classification;

and step 9: inputting the test set into a trained entity recognition and classification model to obtain an entity recognition result of the text to be recognized; the entity identification result comprises an entity, a starting position of the entity, an ending position of the entity and a category label of the entity.

The invention has the beneficial effects that:

the word vector with context semantic information dynamically generated by each character based on context features is obtained through the pre-training language model BERT, the word vector of each word is obtained through a skip-gram model in word2vec, the word vector with the context semantic information and the word vector of the word in which the word is located are subjected to feature fusion in a dimension splicing mode, and a combined word vector is obtained. The method can effectively solve the problems of insufficient labeled data and low identification precision in the four-risk one-gold domain named entity identification task.

Drawings

FIG. 1 is a model diagram of the pre-trained language model BERT of the present invention.

Fig. 2 is a flow chart of the overall implementation of the present invention.

Fig. 3 is an entity tag description table in an embodiment of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

The invention relates to a method for identifying text entities of four-risk one-gold-field policy and regulation, which is used for automatically identifying named entities with field characteristics from text of the four-risk one-gold-field policy and regulation, in particular to identifying named entities related to the four-risk one-gold-field from text of the policy and regulation issued from the center to the local government.

The existing named entity recognition in the four-risk one-gold field has the following problems: firstly, different from the general field, the entity of the four-risk one-fund policy and regulation text has particularity, and not only contains a large number of proprietary field terms, but also does not necessarily contain the field terms in a common word bank; a large number of word combinations may occur. And secondly, the four-risk one-fund field also lacks a public large-scale labeled data set.

Aiming at the problems, the invention provides a method for constructing a four-risk one-gold domain dictionary by using a part-of-speech collocation method based on rules. And marking the selected original text by using the domain dictionary. Not only reduces a large amount of labor cost, but also facilitates the subsequent preprocessing work of quickly expanding training data and carrying out word segmentation, labeling and the like on the original text. And (3) taking the BERT pre-training as a feature layer of Word vectors, extracting Word features in the text after Word segmentation of the four-risk one-golden policy and regulation through a Word2Vec model, and splicing the trained Word vectors to obtain a combined Word vector. The method can not only make up the problems of insufficient characteristics of a small number of labeled samples and insufficient extraction of character semantics, but also supplement word-level phrase information for the word vectors. And finally, training the combined word vector by using a bidirectional long-short term memory network (Bi-LSTM) and a Conditional Random Field (CRF) to obtain a four-risk one-gold-domain entity recognition model.

The invention provides an entity recognition method based on a pre-training language model BERT, which aims at the problems of overlong entity length and low recognition precision caused by word nesting in the field, enhances semantic representation of characters in policy and regulation by the BERT model, dynamically generates character vectors according to context characteristics of the characters, and simultaneously takes the fact that Chinese characters are not the most basic unit of Chinese semantics into consideration, and obtains a combined vector after splicing the generated dynamic character vectors and the word vectors of the words as Bi-LSTM-CRF model input, wherein the Bi-LSTM layer carries out coding and CRF layer decoding, and finally marks an entity recognition result.

A method for identifying text entities of four-risk one-gold-field policy and regulation comprises the following steps:

step 2: performing word segmentation processing on a text to be recognized;

Example 1:

because the four-risk one-fund policy and regulation text is obtained through a web crawler and may contain html tags and some messy codes and table symbols, unified coding should be performed on the original text by adopting an utf-8 coding format, and messy code fields such as spaces and the like are removed by formulating a regular expression. And performing word segmentation and part-of-speech tagging on the preprocessed text.

The manner in which domain terms constitute words can be divided into word concepts and phrase-type domain concepts. The word-type domain concept is composed of one word, so it cannot be segmented any more, and is the smallest independent word unit. The phrase-type domain concept is composed of two or more words, and the word is not required to be the word-type domain concept and can be other words. And then, after word segmentation is carried out on the speech, statistics is carried out to find that terms in the four-risk one-gold domain are mostly concentrated in binary, ternary and quaternary phrases, N-grams are counted to select phrases with high occurrence frequency together, and the phrases which do not accord with the rules are removed by analyzing and counting the characteristics of the domain phrases and formulating a rule table and manually screening according to the part of speech. Performing word segmentation on the original policy and regulation text by using the constructed dictionary and by means of a Jieba word segmenter and a user dictionary according to the maximum matching principle, and performing preprocessing work such as entity type automatic labeling on the text with the segmented words;

by crawling several knowledge, including four-risk one-gold judicial cases, central laws and regulations and local laws and regulations related and related to four-risk one-gold domain encyclopedia entries. The laws and regulations mainly come from northern great law treasure, and the encyclopedia entries mainly come from Baidu encyclopedia. A domain term concept set is obtained for the corpus by using the text of the policy and regulation based on the regular part of speech collocation and partial manual help. (although the current Chinese word segmentation tool achieves higher accuracy, the concept processing effect on some fields is poor due to fine word segmentation granularity, such as basic endowment insurance cost which is divided into 2 words and is regarded as a term entity, so that partial semantic information is lost.) except for field professional terms, the invention manually defines and classifies the commonly used field entities appearing in the policy and regulation. And finally, summarizing 5 categories (including domain terms, place names, organization names, person names and regulation names) of the four-risk one-gold domain entities to carry out category labeling so as to construct a four-risk one-gold domain entity segmentation and labeling dictionary.

The original policy and regulation books are segmented and added with category labels by using a constructed dictionary and by means of a Jieba segmentation tool, and the corpus used by the invention comprises judicial cases published by relevant departments in the four-risk one-fund field (old insurance, industrial insurance, medical insurance, unemployed insurance and public deposit fund), central laws and regulations and a local regulation and regulation totaling 25554 as experimental corpus, wherein the old insurance 7704, the unemployed insurance 1357, the industrial insurance 1946, the birth/medical insurance 7749+996 8745 and the housing public deposit fund 2969 are used as experimental corpus. And extracting 1000 dangerous species in the corpus according to the original proportion. The words marked with categories are divided into single Chinese characters, and further BIO entity boundary marking is carried out according to the entity categories and the positions appearing in the entity words, for example, the basic endowment insurance fee is marked as { base B-PRO } { this I-PRO } { nourishing I-PRO } { old I-PRO } { protecting I-PRO } { insurance I-PRO } fee I-PRO }. (PRO is an entity tag) from the 1000 labeled policy and regulation, 70% was used as a training set, 20% was used as a validation set, and 10% was used as a test set.

The word-level features are initialized for the word vectors of the input text information by using the pre-trained BERT language model, and the obtained word vectors are recorded as a sequence X (X1, X2, X3, … … xn)) so that the problem that the traditional character vectors cannot be represented as different feature vectors according to the context can be solved by using the context semantic information, and the semantic features in the text can be extracted more effectively.

And (3) extracting and expressing the characteristics of the Word level, extracting the Word characteristics in the text after the words are segmented by the four-risk one-gold policy and regulation through a Word2Vec model, and training the Word characteristics into Word vector expression.

And fusing the word vectors and the word vectors obtained by the BERT model in a dimension splicing mode.

Training the obtained combined feature vectors of the words for entity recognition and classification by using a bidirectional long-short term memory network (Bi-LSTM) and a Conditional Random Field (CRF) model to finally obtain a model capable of realizing entity recognition of the four-risk one-gold-domain policy and regulation text, and respectively evaluating and testing the F1 values of the obtained model and applying the model to construction of the four-risk one-gold-domain knowledge map.

bert word vector: training the training corpus by using a skip-gram model in word2vec to obtain a character vector W_i^charInputting the word vector W with context semantic information into a pre-training language model bert_i^charbert

Word vector for word vector W_i^wordThe method comprises the steps of firstly, using jieba word segmentation to segment Chinese texts, and then using a skip-gram model to train the corpus after word segmentation.

And performing feature fusion on the word vector with context semantic information obtained by the BERT pre-training language model and the word vector obtained by the word in which the word is located by using a dimension splicing mode, and finally obtaining a word joint expression with the dimension being the sum of the word vector and the word vector, namely obtaining a fused word combination vector.

The LSTM is also called a long-short term memory network, and is a variant of a cycle network RNN, namely a sequence model, and the sequence model inputs a gate, a forgetting gate and an output gate and selectively transmits time sequence information, so that the problem of gradient disappearance caused by overlong sequence of a common RNN model is effectively solved. The LSTM structure can be formally represented as:

wherein x_tIs the cell input at time t, i_t，f_t，o_tRespectively representing an input gate, a forgetting gate and an output gate at the moment t. w and b represent the weight parameter matrix and the offset vector for the 3 gates.

The intermediate state obtained for the input of the current time t is used to update the current time state c_t，h_tAnd is output for the current moment. (sigma is sigmod activation function, and tanh is hyperbolic tangent activation function) so that context information of words can be effectively collected through double LSTM, and hidden output of sequential transmission embedded in each combination is output

And hidden output delivered in reverse order

Splicing together to obtain a final hidden layer representation of the combined embedding

For the sequence tagging task, it is useful to consider the correlation between adjacent tags and jointly decode the optimal tag sequence for a given sentence. For example, for the NER task with BIO tags, "B-PER I-PER" is a legal sequence, but "B-lOC, I-ORG", "O, I-label" is an illegal tag sequence. Because the first tag, which should be followed by the entity tag B-LOC instead of the "I-ORG" entity tag, should be "B-" instead of "I-". The problem of generating illegal tags can be effectively solved by jointly modeling the tag sequence by using a Conditional Random Field (CRF) instead of decoding each tag independently. We therefore represent r 'for the combined embedded hidden layer resulting from the coding layer'_iThe final sequence probability is given according to all possible sequence labels y when the sequence is input into the CRF layer:

the evaluation index selected in the experiment is F1 value, which is obtained by calculating the accuracy P and the recall ratio R, and the specific calculation formula is as follows:

where TP indicates a positive case in which the determination is correct, FP indicates that a negative case is determined as a positive case, and FN indicates that a positive case is determined as a negative case.

The entity recognition algorithm model provided by the invention is tested in the environments of Python 3.6.8, keras2.1.4 and Tensorflow 1.14.0, the batch _ size of a training set and a testing set is 64, the epoch is 25, and in order to prevent the overfitting dropout rate is 0.2, the sequence _ length is 100, the advanced stop condition is as follows: the 2-cycle validation set accuracy is not improved. The pretraining process of the BERT model needs a large amount of calculation to be realized, and the BERT pretrained language model version is shown in the figure, wherein L represents the number of layers, H represents a hidden layer, and A is the number of heads of self attention. The experiment is carried out by using a BERT-Base-Chinese model version, the model has 12 layers in total, the hidden layer is 768, the number of the heads is 12, and the hidden layer comprises 110M parameters. The first step of training requires that 64 sentences are input in each batch, and the word vector obtained by training the word in which each word is located is spliced with the word vector dimension obtained by the BERT model to obtain the joint feature representation. In the invention, an adam function is selected as an optimizer to carry out iterative training in a network training stage, each round of training improves the parameters of an accuracy training model by continuously reducing errors, firstly, a combination vector is used as an input vector and is firstly input into BI-LSTM to acquire effective context information, and finally, a conditional random field is used as a decoder to decode the model, namely, the method is startedAnd (4) obtaining the optimal marking sequence of each character through transition probability, so as to mark a class label for the entity and realize entity identification and classification. Through model training, the accuracy of the model in a verification set can reach 93.8%, the recall rate is 90.05%, the F value is 91.3%, and the accuracy evaluation index is obviously superior to that of the model only using the character vector W_i^charThe accuracy of the model as a feature was 87.1% and the accuracy of the model using only the bert character vectors as features (without adding word vectors as a supplement to the word-level phrase) was 89.2%.

And (3) a model testing stage: the user can return a result in json format to the user by inputting a sentence to be tested, wherein the result comprises the following information: the entity (word) identified and extracted, the starting position (start) of the entity, the ending position (end) of the entity, and the category tag (type) of the entity, and the actual meaning represented by each category tag can be seen in fig. 3.

For example, the sentence to be tested input by the user on the control console is 'participating in basic endowment insurance of enterprise employees after employment of the urban and rural resident social endowment insurance personnel', the urban and rural resident social endowment insurance relationship can be kept, and the concrete transfer method is according to the human resource social security ministry and the finance ministry of 'temporary method of urban and rural endowment insurance system'. The result of the recognition is: 'entities' { 'word': 'urban and rural resident social insurance', 'start': 12, 'type': PRO '}, {' word ':' enterprise employee basic insurance ',' start ':21,' end ':30,' type ': PRO' }, { 'word': 'urban and rural resident social insurance relationship', 'start':35, 'end':46, 'type': PRO '}, {' word ': human resource social security part', 'start':55, 'temporary end':63, 'type': ORG '},' word ': start': 64, 'end':65 '},' ORG '},' local insurance '},' local area '} 67' and 'local insurance' method.

According to the invention, a four-risk one-gold domain dictionary is pre-established and entity categories are defined in a part-of-speech combination mode, entities in the dictionary are marked, and the original four-risk one-gold policy and regulation text can be automatically marked by utilizing a Jieba word segmentation tool and a related algorithm, so that a marked corpus with a certain scale is obtained, and the cost for manually marking data is reduced. In the aspect of feature extraction, the invention takes BERT pre-training as a feature layer of a Word vector, extracts Word features in a text after Word segmentation of a four-risk one-gold policy and regulation through a Word2Vec model, and combines the trained Word vectors to obtain a combined Word vector, thereby not only solving the problems of insufficient feature of a small amount of labeled samples and insufficient extraction of character semantics, but also supplementing Word-level phrase information for the Word vectors, and further improving the accuracy of the model to a certain extent. The method can effectively solve the problems of insufficient labeled data and low identification precision in the four-risk one-gold domain named entity identification task.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for recognizing text entities of four-risk one-gold-field policy and regulation is characterized by comprising the following steps:

step 2: performing word segmentation processing on a text to be recognized;

and 5: inputting the marked training set into a pre-training language model BERT to obtain the context-based characteristics of each characterDynamically generated word vector W with contextual semantic information_i^charbert；