Disclosure of Invention
The invention aims to provide a method for identifying text entities of four-risk one-gold-field policy and regulation.
The purpose of the invention is realized by the following technical scheme: the method comprises the following steps:
step 1: inputting a text to be recognized; constructing a four-risk one-gold-domain entity segmentation and labeling dictionary, and pre-training a language model BERT based on the four-risk one-gold-domain entity segmentation and labeling dictionary;
step 2: performing word segmentation processing on a text to be recognized;
and step 3: taking part of the segmented texts to be recognized to construct a training set, and forming a test set by the rest of the segmented texts to be recognized; labeling the text to be recognized after word segmentation in the training set according to the four-risk one-gold domain entity word segmentation and the labeling dictionary;
and 4, step 4: segmenting the words marked in the training set into single Chinese characters, and carrying out further BIO entity boundary marking according to the entity category of the words and the position of each Chinese character appearing in the entity;
and 5: inputting the marked training set into a pre-training language model BERT to obtain a word vector W with context semantic information, which is dynamically generated by each character based on context characteristicsicharbert;
Step 6: inputting the marked training set into a skip-gram model in the word2vec for training to obtain a word vector of each word;
and 7: word vector W to be provided with contextual semantic informationicharbertAnd word vector W of the word in which it is locatediwordPerforming feature fusion by using a dimension splicing mode to obtain a fused word combination vector Wi;
And 8: combining the words and phrases in the training set into a vector WiInputting the data into a Bi-LSTM and conditional random field CRF model of a bidirectional long-short term memory network for training to obtain an entity recognition and classification model;
firstly, combining the words and phrases in the training set into a vector WiInputting the vector as an input vector into a Bi-directional long and short term memory network Bi-LSTM to acquire effective context information, and decoding the model by using a conditional random field CRF model as a decoder, namely obtaining an optimal mark sequence of each character by transferring probability, thereby marking a category label for an entity and realizing entity identification and classification;
and step 9: inputting the test set into a trained entity recognition and classification model to obtain an entity recognition result of the text to be recognized; the entity identification result comprises an entity, a starting position of the entity, an ending position of the entity and a category label of the entity.
The invention has the beneficial effects that:
the word vector with context semantic information dynamically generated by each character based on context features is obtained through the pre-training language model BERT, the word vector of each word is obtained through a skip-gram model in word2vec, the word vector with the context semantic information and the word vector of the word in which the word is located are subjected to feature fusion in a dimension splicing mode, and a combined word vector is obtained. The method can effectively solve the problems of insufficient labeled data and low identification precision in the four-risk one-gold domain named entity identification task.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
The invention relates to a method for identifying text entities of four-risk one-gold-field policy and regulation, which is used for automatically identifying named entities with field characteristics from text of the four-risk one-gold-field policy and regulation, in particular to identifying named entities related to the four-risk one-gold-field from text of the policy and regulation issued from the center to the local government.
The existing named entity recognition in the four-risk one-gold field has the following problems: firstly, different from the general field, the entity of the four-risk one-fund policy and regulation text has particularity, and not only contains a large number of proprietary field terms, but also does not necessarily contain the field terms in a common word bank; a large number of word combinations may occur. And secondly, the four-risk one-fund field also lacks a public large-scale labeled data set.
Aiming at the problems, the invention provides a method for constructing a four-risk one-gold domain dictionary by using a part-of-speech collocation method based on rules. And marking the selected original text by using the domain dictionary. Not only reduces a large amount of labor cost, but also facilitates the subsequent preprocessing work of quickly expanding training data and carrying out word segmentation, labeling and the like on the original text. And (3) taking the BERT pre-training as a feature layer of Word vectors, extracting Word features in the text after Word segmentation of the four-risk one-golden policy and regulation through a Word2Vec model, and splicing the trained Word vectors to obtain a combined Word vector. The method can not only make up the problems of insufficient characteristics of a small number of labeled samples and insufficient extraction of character semantics, but also supplement word-level phrase information for the word vectors. And finally, training the combined word vector by using a bidirectional long-short term memory network (Bi-LSTM) and a Conditional Random Field (CRF) to obtain a four-risk one-gold-domain entity recognition model.
The invention provides an entity recognition method based on a pre-training language model BERT, which aims at the problems of overlong entity length and low recognition precision caused by word nesting in the field, enhances semantic representation of characters in policy and regulation by the BERT model, dynamically generates character vectors according to context characteristics of the characters, and simultaneously takes the fact that Chinese characters are not the most basic unit of Chinese semantics into consideration, and obtains a combined vector after splicing the generated dynamic character vectors and the word vectors of the words as Bi-LSTM-CRF model input, wherein the Bi-LSTM layer carries out coding and CRF layer decoding, and finally marks an entity recognition result.
A method for identifying text entities of four-risk one-gold-field policy and regulation comprises the following steps:
step 1: inputting a text to be recognized; constructing a four-risk one-gold-domain entity segmentation and labeling dictionary, and pre-training a language model BERT based on the four-risk one-gold-domain entity segmentation and labeling dictionary;
step 2: performing word segmentation processing on a text to be recognized;
and step 3: taking part of the segmented texts to be recognized to construct a training set, and forming a test set by the rest of the segmented texts to be recognized; labeling the text to be recognized after word segmentation in the training set according to the four-risk one-gold domain entity word segmentation and the labeling dictionary;
and 4, step 4: segmenting the words marked in the training set into single Chinese characters, and carrying out further BIO entity boundary marking according to the entity category of the words and the position of each Chinese character appearing in the entity;
and 5: inputting the marked training set into a pre-training language model BERT to obtain a word vector W with context semantic information, which is dynamically generated by each character based on context characteristicsicharbert;
Step 6: inputting the marked training set into a skip-gram model in the word2vec for training to obtain a word vector of each word;
and 7: word vector W to be provided with contextual semantic informationicharbertAnd word vector W of the word in which it is locatediwordPerforming feature fusion by using a dimension splicing mode to obtain a fused word combination vector Wi;
And 8: combining the words and phrases in the training set into a vector WiInputting the data into a Bi-LSTM and conditional random field CRF model of a bidirectional long-short term memory network for training to obtain an entity recognition and classification model;
firstly, combining the words and phrases in the training set into a vector WiInputting the vector as an input vector into a Bi-directional long and short term memory network Bi-LSTM to acquire effective context information, and decoding the model by using a conditional random field CRF model as a decoder, namely obtaining an optimal mark sequence of each character by transferring probability, thereby marking a category label for an entity and realizing entity identification and classification;
and step 9: inputting the test set into a trained entity recognition and classification model to obtain an entity recognition result of the text to be recognized; the entity identification result comprises an entity, a starting position of the entity, an ending position of the entity and a category label of the entity.
Example 1:
because the four-risk one-fund policy and regulation text is obtained through a web crawler and may contain html tags and some messy codes and table symbols, unified coding should be performed on the original text by adopting an utf-8 coding format, and messy code fields such as spaces and the like are removed by formulating a regular expression. And performing word segmentation and part-of-speech tagging on the preprocessed text.
The manner in which domain terms constitute words can be divided into word concepts and phrase-type domain concepts. The word-type domain concept is composed of one word, so it cannot be segmented any more, and is the smallest independent word unit. The phrase-type domain concept is composed of two or more words, and the word is not required to be the word-type domain concept and can be other words. And then, after word segmentation is carried out on the speech, statistics is carried out to find that terms in the four-risk one-gold domain are mostly concentrated in binary, ternary and quaternary phrases, N-grams are counted to select phrases with high occurrence frequency together, and the phrases which do not accord with the rules are removed by analyzing and counting the characteristics of the domain phrases and formulating a rule table and manually screening according to the part of speech. Performing word segmentation on the original policy and regulation text by using the constructed dictionary and by means of a Jieba word segmenter and a user dictionary according to the maximum matching principle, and performing preprocessing work such as entity type automatic labeling on the text with the segmented words;
by crawling several knowledge, including four-risk one-gold judicial cases, central laws and regulations and local laws and regulations related and related to four-risk one-gold domain encyclopedia entries. The laws and regulations mainly come from northern great law treasure, and the encyclopedia entries mainly come from Baidu encyclopedia. A domain term concept set is obtained for the corpus by using the text of the policy and regulation based on the regular part of speech collocation and partial manual help. (although the current Chinese word segmentation tool achieves higher accuracy, the concept processing effect on some fields is poor due to fine word segmentation granularity, such as basic endowment insurance cost which is divided into 2 words and is regarded as a term entity, so that partial semantic information is lost.) except for field professional terms, the invention manually defines and classifies the commonly used field entities appearing in the policy and regulation. And finally, summarizing 5 categories (including domain terms, place names, organization names, person names and regulation names) of the four-risk one-gold domain entities to carry out category labeling so as to construct a four-risk one-gold domain entity segmentation and labeling dictionary.
The original policy and regulation books are segmented and added with category labels by using a constructed dictionary and by means of a Jieba segmentation tool, and the corpus used by the invention comprises judicial cases published by relevant departments in the four-risk one-fund field (old insurance, industrial insurance, medical insurance, unemployed insurance and public deposit fund), central laws and regulations and a local regulation and regulation totaling 25554 as experimental corpus, wherein the old insurance 7704, the unemployed insurance 1357, the industrial insurance 1946, the birth/medical insurance 7749+996 8745 and the housing public deposit fund 2969 are used as experimental corpus. And extracting 1000 dangerous species in the corpus according to the original proportion. The words marked with categories are divided into single Chinese characters, and further BIO entity boundary marking is carried out according to the entity categories and the positions appearing in the entity words, for example, the basic endowment insurance fee is marked as { base B-PRO } { this I-PRO } { nourishing I-PRO } { old I-PRO } { protecting I-PRO } { insurance I-PRO } fee I-PRO }. (PRO is an entity tag) from the 1000 labeled policy and regulation, 70% was used as a training set, 20% was used as a validation set, and 10% was used as a test set.
The word-level features are initialized for the word vectors of the input text information by using the pre-trained BERT language model, and the obtained word vectors are recorded as a sequence X (X1, X2, X3, … … xn)) so that the problem that the traditional character vectors cannot be represented as different feature vectors according to the context can be solved by using the context semantic information, and the semantic features in the text can be extracted more effectively.
And (3) extracting and expressing the characteristics of the Word level, extracting the Word characteristics in the text after the words are segmented by the four-risk one-gold policy and regulation through a Word2Vec model, and training the Word characteristics into Word vector expression.
And fusing the word vectors and the word vectors obtained by the BERT model in a dimension splicing mode.
Training the obtained combined feature vectors of the words for entity recognition and classification by using a bidirectional long-short term memory network (Bi-LSTM) and a Conditional Random Field (CRF) model to finally obtain a model capable of realizing entity recognition of the four-risk one-gold-domain policy and regulation text, and respectively evaluating and testing the F1 values of the obtained model and applying the model to construction of the four-risk one-gold-domain knowledge map.
bert word vector: training the training corpus by using a skip-gram model in word2vec to obtain a character vector WicharInputting the word vector W with context semantic information into a pre-training language model berticharbert
Word vector for word vector WiwordThe method comprises the steps of firstly, using jieba word segmentation to segment Chinese texts, and then using a skip-gram model to train the corpus after word segmentation.
And performing feature fusion on the word vector with context semantic information obtained by the BERT pre-training language model and the word vector obtained by the word in which the word is located by using a dimension splicing mode, and finally obtaining a word joint expression with the dimension being the sum of the word vector and the word vector, namely obtaining a fused word combination vector.
The LSTM is also called a long-short term memory network, and is a variant of a cycle network RNN, namely a sequence model, and the sequence model inputs a gate, a forgetting gate and an output gate and selectively transmits time sequence information, so that the problem of gradient disappearance caused by overlong sequence of a common RNN model is effectively solved. The LSTM structure can be formally represented as:
wherein x
tIs the cell input at time t, i
t,f
t,o
tRespectively representing an input gate, a forgetting gate and an output gate at the moment t. w and b represent the weight parameter matrix and the offset vector for the 3 gates.
The intermediate state obtained for the input of the current time t is used to update the current time state c
t,h
tAnd is output for the current moment. (sigma is sigmod activation function, and tanh is hyperbolic tangent activation function) so that context information of words can be effectively collected through double LSTM, and hidden output of sequential transmission embedded in each combination is output
And hidden output delivered in reverse order
Splicing together to obtain a final hidden layer representation of the combined embedding
For the sequence tagging task, it is useful to consider the correlation between adjacent tags and jointly decode the optimal tag sequence for a given sentence. For example, for the NER task with BIO tags, "B-PER I-PER" is a legal sequence, but "B-lOC, I-ORG", "O, I-label" is an illegal tag sequence. Because the first tag, which should be followed by the entity tag B-LOC instead of the "I-ORG" entity tag, should be "B-" instead of "I-". The problem of generating illegal tags can be effectively solved by jointly modeling the tag sequence by using a Conditional Random Field (CRF) instead of decoding each tag independently. We therefore represent r 'for the combined embedded hidden layer resulting from the coding layer'iThe final sequence probability is given according to all possible sequence labels y when the sequence is input into the CRF layer:
the evaluation index selected in the experiment is F1 value, which is obtained by calculating the accuracy P and the recall ratio R, and the specific calculation formula is as follows:
where TP indicates a positive case in which the determination is correct, FP indicates that a negative case is determined as a positive case, and FN indicates that a positive case is determined as a negative case.
The entity recognition algorithm model provided by the invention is tested in the environments of Python 3.6.8, keras2.1.4 and Tensorflow 1.14.0, the batch _ size of a training set and a testing set is 64, the epoch is 25, and in order to prevent the overfitting dropout rate is 0.2, the sequence _ length is 100, the advanced stop condition is as follows: the 2-cycle validation set accuracy is not improved. The pretraining process of the BERT model needs a large amount of calculation to be realized, and the BERT pretrained language model version is shown in the figure, wherein L represents the number of layers, H represents a hidden layer, and A is the number of heads of self attention. The experiment is carried out by using a BERT-Base-Chinese model version, the model has 12 layers in total, the hidden layer is 768, the number of the heads is 12, and the hidden layer comprises 110M parameters. The first step of training requires that 64 sentences are input in each batch, and the word vector obtained by training the word in which each word is located is spliced with the word vector dimension obtained by the BERT model to obtain the joint feature representation. In the invention, an adam function is selected as an optimizer to carry out iterative training in a network training stage, each round of training improves the parameters of an accuracy training model by continuously reducing errors, firstly, a combination vector is used as an input vector and is firstly input into BI-LSTM to acquire effective context information, and finally, a conditional random field is used as a decoder to decode the model, namely, the method is startedAnd (4) obtaining the optimal marking sequence of each character through transition probability, so as to mark a class label for the entity and realize entity identification and classification. Through model training, the accuracy of the model in a verification set can reach 93.8%, the recall rate is 90.05%, the F value is 91.3%, and the accuracy evaluation index is obviously superior to that of the model only using the character vector WicharThe accuracy of the model as a feature was 87.1% and the accuracy of the model using only the bert character vectors as features (without adding word vectors as a supplement to the word-level phrase) was 89.2%.
And (3) a model testing stage: the user can return a result in json format to the user by inputting a sentence to be tested, wherein the result comprises the following information: the entity (word) identified and extracted, the starting position (start) of the entity, the ending position (end) of the entity, and the category tag (type) of the entity, and the actual meaning represented by each category tag can be seen in fig. 3.
For example, the sentence to be tested input by the user on the control console is 'participating in basic endowment insurance of enterprise employees after employment of the urban and rural resident social endowment insurance personnel', the urban and rural resident social endowment insurance relationship can be kept, and the concrete transfer method is according to the human resource social security ministry and the finance ministry of 'temporary method of urban and rural endowment insurance system'. The result of the recognition is: 'entities' { 'word': 'urban and rural resident social insurance', 'start': 12, 'type': PRO '}, {' word ':' enterprise employee basic insurance ',' start ':21,' end ':30,' type ': PRO' }, { 'word': 'urban and rural resident social insurance relationship', 'start':35, 'end':46, 'type': PRO '}, {' word ': human resource social security part', 'start':55, 'temporary end':63, 'type': ORG '},' word ': start': 64, 'end':65 '},' ORG '},' local insurance '},' local area '} 67' and 'local insurance' method.
According to the invention, a four-risk one-gold domain dictionary is pre-established and entity categories are defined in a part-of-speech combination mode, entities in the dictionary are marked, and the original four-risk one-gold policy and regulation text can be automatically marked by utilizing a Jieba word segmentation tool and a related algorithm, so that a marked corpus with a certain scale is obtained, and the cost for manually marking data is reduced. In the aspect of feature extraction, the invention takes BERT pre-training as a feature layer of a Word vector, extracts Word features in a text after Word segmentation of a four-risk one-gold policy and regulation through a Word2Vec model, and combines the trained Word vectors to obtain a combined Word vector, thereby not only solving the problems of insufficient feature of a small amount of labeled samples and insufficient extraction of character semantics, but also supplementing Word-level phrase information for the Word vectors, and further improving the accuracy of the model to a certain extent. The method can effectively solve the problems of insufficient labeled data and low identification precision in the four-risk one-gold domain named entity identification task.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.