Movatterモバイル変換


[0]ホーム

URL:


CN112836046A - A method for entity recognition of policies and regulations texts in the field of four insurances and one housing fund - Google Patents

A method for entity recognition of policies and regulations texts in the field of four insurances and one housing fund
Download PDF

Info

Publication number
CN112836046A
CN112836046ACN202110039836.2ACN202110039836ACN112836046ACN 112836046 ACN112836046 ACN 112836046ACN 202110039836 ACN202110039836 ACN 202110039836ACN 112836046 ACN112836046 ACN 112836046A
Authority
CN
China
Prior art keywords
word
entity
vector
model
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110039836.2A
Other languages
Chinese (zh)
Inventor
范贺添
申林山
黄少滨
李熔盛
吴汉瑜
谷虹润
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Engineering University
Original Assignee
Harbin Engineering University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Engineering UniversityfiledCriticalHarbin Engineering University
Priority to CN202110039836.2ApriorityCriticalpatent/CN112836046A/en
Publication of CN112836046ApublicationCriticalpatent/CN112836046A/en
Pendinglegal-statusCriticalCurrent

Links

Images

Classifications

Landscapes

Abstract

Translated fromChinese

本发明属于命名实体识别技术领域,具体涉及一种四险一金领域政策法规文本实体识别方法。本发明通过预训练语言模型BERT得到每个字符基于上下文特征动态生成的具有上下文语义信息的字向量,通过word2vec中的skip‑gram模型得到每个词语的词向量,将具有上下文语义信息的字向量和其所在的词的词向量利用维度拼接的方式进行特征融合,得到联合字词向量,不仅可以弥补少量标注样本特征不足和字符语义提取不充分的问题,还为字向量补充了词级短语信息,从而在一定程度上提高模型的准确率。本发明可以有效解决四险一金领域命名实体识别任务中标注数据不足以及识别精度不高的问题。

Figure 202110039836

The invention belongs to the technical field of named entity recognition, and in particular relates to a method for recognizing text entities of policies and regulations in the field of four insurances and one housing fund. The present invention obtains the word vector with contextual semantic information dynamically generated by each character based on the context feature by pre-training the language model BERT, obtains the word vector of each word by the skip-gram model in word2vec, and converts the word vector with the contextual semantic information into the word vector And the word vector of the word in which it is located is fused by dimension splicing, and the joint word vector is obtained, which can not only make up for the insufficient features of a small number of labeled samples and the insufficient extraction of character semantics, but also supplement the word-level phrase information for the word vector. , so as to improve the accuracy of the model to a certain extent. The invention can effectively solve the problems of insufficient labeled data and low recognition accuracy in the named entity recognition task in the field of four insurances and one gold.

Figure 202110039836

Description

Four-risk one-gold-field policy and regulation text entity identification method
Technical Field
The invention belongs to the technical field of named entity identification, and particularly relates to a four-risk one-gold-field policy and regulation text entity identification method.
Background
Along with the social development, the system is gradually perfected, and the role of insisting on the basic guarantee system in China is more and more prominent. Therefore, the method has important significance for research works such as question-answering system and knowledge map construction in the field of four risks and one fund. Named Entity Recognition (NER) as an important basic unit of the knowledge graph is a core technology for constructing and complementing the knowledge graph. The method refers to identifying entities with specific meanings in texts, and mainly comprises name of a person, place name, organization name, proper noun and the like. Therefore, when constructing the four-risk one-gold domain knowledge graph, it is also significant to identify the related professional terms and commonly used named entities (such as organization names, place names, etc.) of the four-risk one-gold domain.
The traditional named entity identification method mainly comprises two major categories of algorithms based on rule matching and algorithms based on machine learning. However, the conventional machine learning model (for example, CRF conditional random field is still an important component of the mainstream model of NER, the objective function thereof not only considers the input state feature function, but also includes the label transfer feature function, so as to obtain the optimal label sequence), but has a common disadvantage of high requirement on feature extraction, and various features influencing the named entity recognition task need to be selected and combined into a vector to represent the words in the text, and a large amount of manual labeling needs to be performed on the preprocessed data in advance to train a good effect, so that the modeling cost is high. In recent years, with the development of computer power and the introduction of word embedding, a deep learning method is gradually applied to a named entity recognition task, the neural network becomes a model capable of efficiently processing a plurality of NLP tasks, which is mainly represented by that the deep learning method based on the neural network has strong generalization, in order to make words contain more comprehensive semantic information and syntactic characteristics, scholars in the year propose to further enhance the expression of word vectors by using a pre-training language model, the most prominent of them is the BERT model (Bicorrective Encoder transformations from transformations) proposed by Google researchers Devrin, etc., which uses the self-attention machine system and the transformer Encoder to pre-train large-scale open corpus to obtain word vectors with more context semantic information, and pengM, etc. has obtained good effect on entity recognition effect in the general field by using the method.
Disclosure of Invention
The invention aims to provide a method for identifying text entities of four-risk one-gold-field policy and regulation.
The purpose of the invention is realized by the following technical scheme: the method comprises the following steps:
step 1: inputting a text to be recognized; constructing a four-risk one-gold-domain entity segmentation and labeling dictionary, and pre-training a language model BERT based on the four-risk one-gold-domain entity segmentation and labeling dictionary;
step 2: performing word segmentation processing on a text to be recognized;
and step 3: taking part of the segmented texts to be recognized to construct a training set, and forming a test set by the rest of the segmented texts to be recognized; labeling the text to be recognized after word segmentation in the training set according to the four-risk one-gold domain entity word segmentation and the labeling dictionary;
and 4, step 4: segmenting the words marked in the training set into single Chinese characters, and carrying out further BIO entity boundary marking according to the entity category of the words and the position of each Chinese character appearing in the entity;
and 5: inputting the marked training set into a pre-training language model BERT to obtain a word vector W with context semantic information, which is dynamically generated by each character based on context characteristicsicharbert
Step 6: inputting the marked training set into a skip-gram model in the word2vec for training to obtain a word vector of each word;
and 7: word vector W to be provided with contextual semantic informationicharbertAnd word vector W of the word in which it is locatediwordPerforming feature fusion by using a dimension splicing mode to obtain a fused word combination vector Wi
Figure RE-GDA0002971552760000021
And 8: combining the words and phrases in the training set into a vector WiInputting the data into a Bi-LSTM and conditional random field CRF model of a bidirectional long-short term memory network for training to obtain an entity recognition and classification model;
firstly, combining the words and phrases in the training set into a vector WiInputting the vector as an input vector into a Bi-directional long and short term memory network Bi-LSTM to acquire effective context information, and decoding the model by using a conditional random field CRF model as a decoder, namely obtaining an optimal mark sequence of each character by transferring probability, thereby marking a category label for an entity and realizing entity identification and classification;
and step 9: inputting the test set into a trained entity recognition and classification model to obtain an entity recognition result of the text to be recognized; the entity identification result comprises an entity, a starting position of the entity, an ending position of the entity and a category label of the entity.
The invention has the beneficial effects that:
the word vector with context semantic information dynamically generated by each character based on context features is obtained through the pre-training language model BERT, the word vector of each word is obtained through a skip-gram model in word2vec, the word vector with the context semantic information and the word vector of the word in which the word is located are subjected to feature fusion in a dimension splicing mode, and a combined word vector is obtained. The method can effectively solve the problems of insufficient labeled data and low identification precision in the four-risk one-gold domain named entity identification task.
Drawings
FIG. 1 is a model diagram of the pre-trained language model BERT of the present invention.
Fig. 2 is a flow chart of the overall implementation of the present invention.
Fig. 3 is an entity tag description table in an embodiment of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
The invention relates to a method for identifying text entities of four-risk one-gold-field policy and regulation, which is used for automatically identifying named entities with field characteristics from text of the four-risk one-gold-field policy and regulation, in particular to identifying named entities related to the four-risk one-gold-field from text of the policy and regulation issued from the center to the local government.
The existing named entity recognition in the four-risk one-gold field has the following problems: firstly, different from the general field, the entity of the four-risk one-fund policy and regulation text has particularity, and not only contains a large number of proprietary field terms, but also does not necessarily contain the field terms in a common word bank; a large number of word combinations may occur. And secondly, the four-risk one-fund field also lacks a public large-scale labeled data set.
Aiming at the problems, the invention provides a method for constructing a four-risk one-gold domain dictionary by using a part-of-speech collocation method based on rules. And marking the selected original text by using the domain dictionary. Not only reduces a large amount of labor cost, but also facilitates the subsequent preprocessing work of quickly expanding training data and carrying out word segmentation, labeling and the like on the original text. And (3) taking the BERT pre-training as a feature layer of Word vectors, extracting Word features in the text after Word segmentation of the four-risk one-golden policy and regulation through a Word2Vec model, and splicing the trained Word vectors to obtain a combined Word vector. The method can not only make up the problems of insufficient characteristics of a small number of labeled samples and insufficient extraction of character semantics, but also supplement word-level phrase information for the word vectors. And finally, training the combined word vector by using a bidirectional long-short term memory network (Bi-LSTM) and a Conditional Random Field (CRF) to obtain a four-risk one-gold-domain entity recognition model.
The invention provides an entity recognition method based on a pre-training language model BERT, which aims at the problems of overlong entity length and low recognition precision caused by word nesting in the field, enhances semantic representation of characters in policy and regulation by the BERT model, dynamically generates character vectors according to context characteristics of the characters, and simultaneously takes the fact that Chinese characters are not the most basic unit of Chinese semantics into consideration, and obtains a combined vector after splicing the generated dynamic character vectors and the word vectors of the words as Bi-LSTM-CRF model input, wherein the Bi-LSTM layer carries out coding and CRF layer decoding, and finally marks an entity recognition result.
A method for identifying text entities of four-risk one-gold-field policy and regulation comprises the following steps:
step 1: inputting a text to be recognized; constructing a four-risk one-gold-domain entity segmentation and labeling dictionary, and pre-training a language model BERT based on the four-risk one-gold-domain entity segmentation and labeling dictionary;
step 2: performing word segmentation processing on a text to be recognized;
and step 3: taking part of the segmented texts to be recognized to construct a training set, and forming a test set by the rest of the segmented texts to be recognized; labeling the text to be recognized after word segmentation in the training set according to the four-risk one-gold domain entity word segmentation and the labeling dictionary;
and 4, step 4: segmenting the words marked in the training set into single Chinese characters, and carrying out further BIO entity boundary marking according to the entity category of the words and the position of each Chinese character appearing in the entity;
and 5: inputting the marked training set into a pre-training language model BERT to obtain a word vector W with context semantic information, which is dynamically generated by each character based on context characteristicsicharbert
Step 6: inputting the marked training set into a skip-gram model in the word2vec for training to obtain a word vector of each word;
and 7: word vector W to be provided with contextual semantic informationicharbertAnd word vector W of the word in which it is locatediwordPerforming feature fusion by using a dimension splicing mode to obtain a fused word combination vector Wi
Figure RE-GDA0002971552760000041
And 8: combining the words and phrases in the training set into a vector WiInputting the data into a Bi-LSTM and conditional random field CRF model of a bidirectional long-short term memory network for training to obtain an entity recognition and classification model;
firstly, combining the words and phrases in the training set into a vector WiInputting the vector as an input vector into a Bi-directional long and short term memory network Bi-LSTM to acquire effective context information, and decoding the model by using a conditional random field CRF model as a decoder, namely obtaining an optimal mark sequence of each character by transferring probability, thereby marking a category label for an entity and realizing entity identification and classification;
and step 9: inputting the test set into a trained entity recognition and classification model to obtain an entity recognition result of the text to be recognized; the entity identification result comprises an entity, a starting position of the entity, an ending position of the entity and a category label of the entity.
Example 1:
because the four-risk one-fund policy and regulation text is obtained through a web crawler and may contain html tags and some messy codes and table symbols, unified coding should be performed on the original text by adopting an utf-8 coding format, and messy code fields such as spaces and the like are removed by formulating a regular expression. And performing word segmentation and part-of-speech tagging on the preprocessed text.
The manner in which domain terms constitute words can be divided into word concepts and phrase-type domain concepts. The word-type domain concept is composed of one word, so it cannot be segmented any more, and is the smallest independent word unit. The phrase-type domain concept is composed of two or more words, and the word is not required to be the word-type domain concept and can be other words. And then, after word segmentation is carried out on the speech, statistics is carried out to find that terms in the four-risk one-gold domain are mostly concentrated in binary, ternary and quaternary phrases, N-grams are counted to select phrases with high occurrence frequency together, and the phrases which do not accord with the rules are removed by analyzing and counting the characteristics of the domain phrases and formulating a rule table and manually screening according to the part of speech. Performing word segmentation on the original policy and regulation text by using the constructed dictionary and by means of a Jieba word segmenter and a user dictionary according to the maximum matching principle, and performing preprocessing work such as entity type automatic labeling on the text with the segmented words;
by crawling several knowledge, including four-risk one-gold judicial cases, central laws and regulations and local laws and regulations related and related to four-risk one-gold domain encyclopedia entries. The laws and regulations mainly come from northern great law treasure, and the encyclopedia entries mainly come from Baidu encyclopedia. A domain term concept set is obtained for the corpus by using the text of the policy and regulation based on the regular part of speech collocation and partial manual help. (although the current Chinese word segmentation tool achieves higher accuracy, the concept processing effect on some fields is poor due to fine word segmentation granularity, such as basic endowment insurance cost which is divided into 2 words and is regarded as a term entity, so that partial semantic information is lost.) except for field professional terms, the invention manually defines and classifies the commonly used field entities appearing in the policy and regulation. And finally, summarizing 5 categories (including domain terms, place names, organization names, person names and regulation names) of the four-risk one-gold domain entities to carry out category labeling so as to construct a four-risk one-gold domain entity segmentation and labeling dictionary.
The original policy and regulation books are segmented and added with category labels by using a constructed dictionary and by means of a Jieba segmentation tool, and the corpus used by the invention comprises judicial cases published by relevant departments in the four-risk one-fund field (old insurance, industrial insurance, medical insurance, unemployed insurance and public deposit fund), central laws and regulations and a local regulation and regulation totaling 25554 as experimental corpus, wherein the old insurance 7704, the unemployed insurance 1357, the industrial insurance 1946, the birth/medical insurance 7749+996 8745 and the housing public deposit fund 2969 are used as experimental corpus. And extracting 1000 dangerous species in the corpus according to the original proportion. The words marked with categories are divided into single Chinese characters, and further BIO entity boundary marking is carried out according to the entity categories and the positions appearing in the entity words, for example, the basic endowment insurance fee is marked as { base B-PRO } { this I-PRO } { nourishing I-PRO } { old I-PRO } { protecting I-PRO } { insurance I-PRO } fee I-PRO }. (PRO is an entity tag) from the 1000 labeled policy and regulation, 70% was used as a training set, 20% was used as a validation set, and 10% was used as a test set.
The word-level features are initialized for the word vectors of the input text information by using the pre-trained BERT language model, and the obtained word vectors are recorded as a sequence X (X1, X2, X3, … … xn)) so that the problem that the traditional character vectors cannot be represented as different feature vectors according to the context can be solved by using the context semantic information, and the semantic features in the text can be extracted more effectively.
And (3) extracting and expressing the characteristics of the Word level, extracting the Word characteristics in the text after the words are segmented by the four-risk one-gold policy and regulation through a Word2Vec model, and training the Word characteristics into Word vector expression.
And fusing the word vectors and the word vectors obtained by the BERT model in a dimension splicing mode.
Training the obtained combined feature vectors of the words for entity recognition and classification by using a bidirectional long-short term memory network (Bi-LSTM) and a Conditional Random Field (CRF) model to finally obtain a model capable of realizing entity recognition of the four-risk one-gold-domain policy and regulation text, and respectively evaluating and testing the F1 values of the obtained model and applying the model to construction of the four-risk one-gold-domain knowledge map.
bert word vector: training the training corpus by using a skip-gram model in word2vec to obtain a character vector WicharInputting the word vector W with context semantic information into a pre-training language model berticharbert
Word vector for word vector WiwordThe method comprises the steps of firstly, using jieba word segmentation to segment Chinese texts, and then using a skip-gram model to train the corpus after word segmentation.
And performing feature fusion on the word vector with context semantic information obtained by the BERT pre-training language model and the word vector obtained by the word in which the word is located by using a dimension splicing mode, and finally obtaining a word joint expression with the dimension being the sum of the word vector and the word vector, namely obtaining a fused word combination vector.
Figure RE-GDA0002971552760000061
The LSTM is also called a long-short term memory network, and is a variant of a cycle network RNN, namely a sequence model, and the sequence model inputs a gate, a forgetting gate and an output gate and selectively transmits time sequence information, so that the problem of gradient disappearance caused by overlong sequence of a common RNN model is effectively solved. The LSTM structure can be formally represented as:
Figure RE-GDA0002971552760000062
Figure RE-GDA0002971552760000063
Figure RE-GDA0002971552760000064
Figure RE-GDA0002971552760000065
Figure RE-GDA0002971552760000066
Figure RE-GDA0002971552760000067
wherein xtIs the cell input at time t, it,ft,otRespectively representing an input gate, a forgetting gate and an output gate at the moment t. w and b represent the weight parameter matrix and the offset vector for the 3 gates.
Figure RE-GDA0002971552760000068
The intermediate state obtained for the input of the current time t is used to update the current time state ct,htAnd is output for the current moment. (sigma is sigmod activation function, and tanh is hyperbolic tangent activation function) so that context information of words can be effectively collected through double LSTM, and hidden output of sequential transmission embedded in each combination is output
Figure RE-GDA0002971552760000069
And hidden output delivered in reverse order
Figure RE-GDA00029715527600000610
Splicing together to obtain a final hidden layer representation of the combined embedding
Figure RE-GDA00029715527600000611
For the sequence tagging task, it is useful to consider the correlation between adjacent tags and jointly decode the optimal tag sequence for a given sentence. For example, for the NER task with BIO tags, "B-PER I-PER" is a legal sequence, but "B-lOC, I-ORG", "O, I-label" is an illegal tag sequence. Because the first tag, which should be followed by the entity tag B-LOC instead of the "I-ORG" entity tag, should be "B-" instead of "I-". The problem of generating illegal tags can be effectively solved by jointly modeling the tag sequence by using a Conditional Random Field (CRF) instead of decoding each tag independently. We therefore represent r 'for the combined embedded hidden layer resulting from the coding layer'iThe final sequence probability is given according to all possible sequence labels y when the sequence is input into the CRF layer:
Figure RE-GDA00029715527600000612
Figure RE-GDA00029715527600000613
the evaluation index selected in the experiment is F1 value, which is obtained by calculating the accuracy P and the recall ratio R, and the specific calculation formula is as follows:
Figure RE-GDA0002971552760000071
where TP indicates a positive case in which the determination is correct, FP indicates that a negative case is determined as a positive case, and FN indicates that a positive case is determined as a negative case.
The entity recognition algorithm model provided by the invention is tested in the environments of Python 3.6.8, keras2.1.4 and Tensorflow 1.14.0, the batch _ size of a training set and a testing set is 64, the epoch is 25, and in order to prevent the overfitting dropout rate is 0.2, the sequence _ length is 100, the advanced stop condition is as follows: the 2-cycle validation set accuracy is not improved. The pretraining process of the BERT model needs a large amount of calculation to be realized, and the BERT pretrained language model version is shown in the figure, wherein L represents the number of layers, H represents a hidden layer, and A is the number of heads of self attention. The experiment is carried out by using a BERT-Base-Chinese model version, the model has 12 layers in total, the hidden layer is 768, the number of the heads is 12, and the hidden layer comprises 110M parameters. The first step of training requires that 64 sentences are input in each batch, and the word vector obtained by training the word in which each word is located is spliced with the word vector dimension obtained by the BERT model to obtain the joint feature representation. In the invention, an adam function is selected as an optimizer to carry out iterative training in a network training stage, each round of training improves the parameters of an accuracy training model by continuously reducing errors, firstly, a combination vector is used as an input vector and is firstly input into BI-LSTM to acquire effective context information, and finally, a conditional random field is used as a decoder to decode the model, namely, the method is startedAnd (4) obtaining the optimal marking sequence of each character through transition probability, so as to mark a class label for the entity and realize entity identification and classification. Through model training, the accuracy of the model in a verification set can reach 93.8%, the recall rate is 90.05%, the F value is 91.3%, and the accuracy evaluation index is obviously superior to that of the model only using the character vector WicharThe accuracy of the model as a feature was 87.1% and the accuracy of the model using only the bert character vectors as features (without adding word vectors as a supplement to the word-level phrase) was 89.2%.
And (3) a model testing stage: the user can return a result in json format to the user by inputting a sentence to be tested, wherein the result comprises the following information: the entity (word) identified and extracted, the starting position (start) of the entity, the ending position (end) of the entity, and the category tag (type) of the entity, and the actual meaning represented by each category tag can be seen in fig. 3.
For example, the sentence to be tested input by the user on the control console is 'participating in basic endowment insurance of enterprise employees after employment of the urban and rural resident social endowment insurance personnel', the urban and rural resident social endowment insurance relationship can be kept, and the concrete transfer method is according to the human resource social security ministry and the finance ministry of 'temporary method of urban and rural endowment insurance system'. The result of the recognition is: 'entities' { 'word': 'urban and rural resident social insurance', 'start': 12, 'type': PRO '}, {' word ':' enterprise employee basic insurance ',' start ':21,' end ':30,' type ': PRO' }, { 'word': 'urban and rural resident social insurance relationship', 'start':35, 'end':46, 'type': PRO '}, {' word ': human resource social security part', 'start':55, 'temporary end':63, 'type': ORG '},' word ': start': 64, 'end':65 '},' ORG '},' local insurance '},' local area '} 67' and 'local insurance' method.
According to the invention, a four-risk one-gold domain dictionary is pre-established and entity categories are defined in a part-of-speech combination mode, entities in the dictionary are marked, and the original four-risk one-gold policy and regulation text can be automatically marked by utilizing a Jieba word segmentation tool and a related algorithm, so that a marked corpus with a certain scale is obtained, and the cost for manually marking data is reduced. In the aspect of feature extraction, the invention takes BERT pre-training as a feature layer of a Word vector, extracts Word features in a text after Word segmentation of a four-risk one-gold policy and regulation through a Word2Vec model, and combines the trained Word vectors to obtain a combined Word vector, thereby not only solving the problems of insufficient feature of a small amount of labeled samples and insufficient extraction of character semantics, but also supplementing Word-level phrase information for the Word vectors, and further improving the accuracy of the model to a certain extent. The method can effectively solve the problems of insufficient labeled data and low identification precision in the four-risk one-gold domain named entity identification task.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (1)

1. A method for recognizing text entities of four-risk one-gold-field policy and regulation is characterized by comprising the following steps:
step 1: inputting a text to be recognized; constructing a four-risk one-gold-domain entity segmentation and labeling dictionary, and pre-training a language model BERT based on the four-risk one-gold-domain entity segmentation and labeling dictionary;
step 2: performing word segmentation processing on a text to be recognized;
and step 3: taking part of the segmented texts to be recognized to construct a training set, and forming a test set by the rest of the segmented texts to be recognized; labeling the text to be recognized after word segmentation in the training set according to the four-risk one-gold domain entity word segmentation and the labeling dictionary;
and 4, step 4: segmenting the words marked in the training set into single Chinese characters, and carrying out further BIO entity boundary marking according to the entity category of the words and the position of each Chinese character appearing in the entity;
and 5: inputting the marked training set into a pre-training language model BERT to obtain the context-based characteristics of each characterDynamically generated word vector W with contextual semantic informationicharbert
Step 6: inputting the marked training set into a skip-gram model in the word2vec for training to obtain a word vector of each word;
and 7: word vector W to be provided with contextual semantic informationicharbertAnd word vector W of the word in which it is locatediwordPerforming feature fusion by using a dimension splicing mode to obtain a fused word combination vector Wi
Figure FDA0002895308770000011
And 8: combining the words and phrases in the training set into a vector WiInputting the data into a Bi-LSTM and conditional random field CRF model of a bidirectional long-short term memory network for training to obtain an entity recognition and classification model;
firstly, combining the words and phrases in the training set into a vector WiInputting the vector as an input vector into a Bi-directional long and short term memory network Bi-LSTM to acquire effective context information, and decoding the model by using a conditional random field CRF model as a decoder, namely obtaining an optimal mark sequence of each character by transferring probability, thereby marking a category label for an entity and realizing entity identification and classification;
and step 9: inputting the test set into a trained entity recognition and classification model to obtain an entity recognition result of the text to be recognized; the entity identification result comprises an entity, a starting position of the entity, an ending position of the entity and a category label of the entity.
CN202110039836.2A2021-01-132021-01-13 A method for entity recognition of policies and regulations texts in the field of four insurances and one housing fundPendingCN112836046A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202110039836.2ACN112836046A (en)2021-01-132021-01-13 A method for entity recognition of policies and regulations texts in the field of four insurances and one housing fund

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202110039836.2ACN112836046A (en)2021-01-132021-01-13 A method for entity recognition of policies and regulations texts in the field of four insurances and one housing fund

Publications (1)

Publication NumberPublication Date
CN112836046Atrue CN112836046A (en)2021-05-25

Family

ID=75927981

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202110039836.2APendingCN112836046A (en)2021-01-132021-01-13 A method for entity recognition of policies and regulations texts in the field of four insurances and one housing fund

Country Status (1)

CountryLink
CN (1)CN112836046A (en)

Cited By (29)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN113255294A (en)*2021-07-142021-08-13北京邮电大学Named entity recognition model training method, recognition method and device
CN113408287A (en)*2021-06-232021-09-17北京达佳互联信息技术有限公司Entity identification method and device, electronic equipment and storage medium
CN113434695A (en)*2021-06-252021-09-24平安科技(深圳)有限公司Financial event extraction method and device, electronic equipment and storage medium
CN113535976A (en)*2021-07-092021-10-22泰康保险集团股份有限公司Path vectorization representation method and device, computing equipment and storage medium
CN113609857A (en)*2021-07-222021-11-05武汉工程大学Legal named entity identification method and system based on cascade model and data enhancement
CN113627139A (en)*2021-08-112021-11-09平安国际智慧城市科技股份有限公司 Enterprise declaration form generation method, device, equipment and storage medium
CN113626602A (en)*2021-08-192021-11-09支付宝(杭州)信息技术有限公司 Method and apparatus for text classification
CN113627187A (en)*2021-08-122021-11-09平安国际智慧城市科技股份有限公司Named entity recognition method and device, electronic equipment and readable storage medium
CN113657105A (en)*2021-08-312021-11-16平安医疗健康管理股份有限公司Medical entity extraction method, device, equipment and medium based on vocabulary enhancement
CN113919291A (en)*2021-09-262022-01-11上海犀语科技有限公司Master-slave parallel operation current sharing method based on analog control
CN114064902A (en)*2021-11-262022-02-18中国农业银行股份有限公司重庆市分行Financial innovation patent classification method based on BERT model
CN114281934A (en)*2021-09-162022-04-05腾讯科技(深圳)有限公司 Text recognition method, device, equipment and storage medium
CN114372468A (en)*2022-01-132022-04-19大连海事大学BERT-based maritime mail named entity identification method
CN114385784A (en)*2021-12-232022-04-22沈阳东软智能医疗科技研究院有限公司Named entity identification method, device, medium and electronic equipment
CN114398482A (en)*2021-12-062022-04-26腾讯数码(天津)有限公司 A dictionary construction method, device, electronic device and storage medium
CN114443848A (en)*2022-01-292022-05-06中国建设银行股份有限公司Credit certificate clause identification method and device
CN114528840A (en)*2022-01-212022-05-24深圳大学Chinese entity identification method, terminal and storage medium fusing context information
CN114548095A (en)*2021-12-232022-05-27北京三快在线科技有限公司Entity recognition model training method and device
CN114692635A (en)*2022-02-232022-07-01北京快确信息科技有限公司Information analysis method and device based on vocabulary enhancement and electronic equipment
CN114757184A (en)*2022-04-112022-07-15中国航空综合技术研究所Method and system for realizing knowledge question answering in aviation field
CN114780677A (en)*2022-04-062022-07-22西安电子科技大学Chinese event extraction method based on feature fusion
CN115099230A (en)*2022-07-012022-09-23联洋国融(北京)科技有限公司BERT model-based multi-target task credit risk identification method and system
CN115221265A (en)*2022-06-242022-10-21浙江嘉兴数字城市实验室有限公司Method for identifying event element named entities in social management field based on BilSTM-CRF
CN115270803A (en)*2022-09-302022-11-01北京道达天际科技股份有限公司Entity extraction method based on BERT and fused with N-gram characteristics
CN115292490A (en)*2022-08-022022-11-04福建省科立方科技有限公司Analysis algorithm for policy interpretation semantics
CN115859977A (en)*2022-10-312023-03-28浙江工业大学Named entity identification method based on fusion sequence characteristics
CN115982419A (en)*2021-10-132023-04-18中核核电运行管理有限公司 A method for document string content identification
CN116629267A (en)*2023-07-212023-08-22云筑信息科技(成都)有限公司Named entity identification method based on multiple granularities
CN119046468A (en)*2024-10-302024-11-29之江实验室Vertical domain entity expansion method and device based on large language model

Citations (12)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20060088214A1 (en)*2004-10-222006-04-27Xerox CorporationSystem and method for identifying and labeling fields of text associated with scanned business documents
KR20090004216A (en)*2007-07-062009-01-12주식회사 예스피치 Statistical Meaning Classification System and Method for Speech Recognition
CN105468744A (en)*2015-11-252016-04-06浪潮软件集团有限公司Big data platform for realizing tax public opinion analysis and full text retrieval
CN107885721A (en)*2017-10-122018-04-06北京知道未来信息技术有限公司A kind of name entity recognition method based on LSTM
CN109710756A (en)*2018-11-232019-05-03京华信息科技股份有限公司Document type categorizing system and method based on semantic character labeling
CN109871538A (en)*2019-02-182019-06-11华南理工大学 A Named Entity Recognition Method for Chinese Electronic Medical Records
CN110188340A (en)*2019-04-092019-08-30国金涌富资产管理有限公司One kind grinding message this substantive noun automatic identifying method
CN110297913A (en)*2019-06-122019-10-01中电科大数据研究院有限公司A kind of electronic government documents entity abstracting method
CN110826335A (en)*2019-11-142020-02-21北京明略软件系统有限公司 A method and apparatus for named entity recognition
CN111079377A (en)*2019-12-032020-04-28哈尔滨工程大学Method for recognizing named entities oriented to Chinese medical texts
CN111563383A (en)*2020-04-092020-08-21华南理工大学 A Chinese Named Entity Recognition Method Based on BERT and SemiCRF
CN112131404A (en)*2020-09-192020-12-25哈尔滨工程大学Entity alignment method in four-risk one-gold domain knowledge graph

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20060088214A1 (en)*2004-10-222006-04-27Xerox CorporationSystem and method for identifying and labeling fields of text associated with scanned business documents
KR20090004216A (en)*2007-07-062009-01-12주식회사 예스피치 Statistical Meaning Classification System and Method for Speech Recognition
CN105468744A (en)*2015-11-252016-04-06浪潮软件集团有限公司Big data platform for realizing tax public opinion analysis and full text retrieval
CN107885721A (en)*2017-10-122018-04-06北京知道未来信息技术有限公司A kind of name entity recognition method based on LSTM
CN109710756A (en)*2018-11-232019-05-03京华信息科技股份有限公司Document type categorizing system and method based on semantic character labeling
CN109871538A (en)*2019-02-182019-06-11华南理工大学 A Named Entity Recognition Method for Chinese Electronic Medical Records
CN110188340A (en)*2019-04-092019-08-30国金涌富资产管理有限公司One kind grinding message this substantive noun automatic identifying method
CN110297913A (en)*2019-06-122019-10-01中电科大数据研究院有限公司A kind of electronic government documents entity abstracting method
CN110826335A (en)*2019-11-142020-02-21北京明略软件系统有限公司 A method and apparatus for named entity recognition
CN111079377A (en)*2019-12-032020-04-28哈尔滨工程大学Method for recognizing named entities oriented to Chinese medical texts
CN111563383A (en)*2020-04-092020-08-21华南理工大学 A Chinese Named Entity Recognition Method Based on BERT and SemiCRF
CN112131404A (en)*2020-09-192020-12-25哈尔滨工程大学Entity alignment method in four-risk one-gold domain knowledge graph

Cited By (36)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN113408287A (en)*2021-06-232021-09-17北京达佳互联信息技术有限公司Entity identification method and device, electronic equipment and storage medium
CN113434695A (en)*2021-06-252021-09-24平安科技(深圳)有限公司Financial event extraction method and device, electronic equipment and storage medium
CN113535976A (en)*2021-07-092021-10-22泰康保险集团股份有限公司Path vectorization representation method and device, computing equipment and storage medium
CN113255294A (en)*2021-07-142021-08-13北京邮电大学Named entity recognition model training method, recognition method and device
CN113609857B (en)*2021-07-222023-11-28武汉工程大学 Legal named entity recognition method and system based on cascade model and data enhancement
CN113609857A (en)*2021-07-222021-11-05武汉工程大学Legal named entity identification method and system based on cascade model and data enhancement
CN113627139A (en)*2021-08-112021-11-09平安国际智慧城市科技股份有限公司 Enterprise declaration form generation method, device, equipment and storage medium
CN113627187A (en)*2021-08-122021-11-09平安国际智慧城市科技股份有限公司Named entity recognition method and device, electronic equipment and readable storage medium
CN113627187B (en)*2021-08-122024-09-13平安国际智慧城市科技股份有限公司Named entity recognition method, named entity recognition device, electronic equipment and readable storage medium
CN113626602A (en)*2021-08-192021-11-09支付宝(杭州)信息技术有限公司 Method and apparatus for text classification
CN113657105A (en)*2021-08-312021-11-16平安医疗健康管理股份有限公司Medical entity extraction method, device, equipment and medium based on vocabulary enhancement
CN114281934A (en)*2021-09-162022-04-05腾讯科技(深圳)有限公司 Text recognition method, device, equipment and storage medium
CN113919291A (en)*2021-09-262022-01-11上海犀语科技有限公司Master-slave parallel operation current sharing method based on analog control
CN115982419A (en)*2021-10-132023-04-18中核核电运行管理有限公司 A method for document string content identification
CN114064902A (en)*2021-11-262022-02-18中国农业银行股份有限公司重庆市分行Financial innovation patent classification method based on BERT model
CN114398482A (en)*2021-12-062022-04-26腾讯数码(天津)有限公司 A dictionary construction method, device, electronic device and storage medium
CN114548095A (en)*2021-12-232022-05-27北京三快在线科技有限公司Entity recognition model training method and device
CN114548095B (en)*2021-12-232025-04-29北京三快在线科技有限公司 A method and device for training entity recognition model
CN114385784B (en)*2021-12-232024-12-24沈阳东软智能医疗科技研究院有限公司 Named entity recognition method, device, medium and electronic device
CN114385784A (en)*2021-12-232022-04-22沈阳东软智能医疗科技研究院有限公司Named entity identification method, device, medium and electronic equipment
CN114372468A (en)*2022-01-132022-04-19大连海事大学BERT-based maritime mail named entity identification method
CN114528840A (en)*2022-01-212022-05-24深圳大学Chinese entity identification method, terminal and storage medium fusing context information
CN114443848A (en)*2022-01-292022-05-06中国建设银行股份有限公司Credit certificate clause identification method and device
CN114692635A (en)*2022-02-232022-07-01北京快确信息科技有限公司Information analysis method and device based on vocabulary enhancement and electronic equipment
CN114780677A (en)*2022-04-062022-07-22西安电子科技大学Chinese event extraction method based on feature fusion
CN114757184B (en)*2022-04-112023-11-10中国航空综合技术研究所Method and system for realizing knowledge question and answer in aviation field
CN114757184A (en)*2022-04-112022-07-15中国航空综合技术研究所Method and system for realizing knowledge question answering in aviation field
CN115221265A (en)*2022-06-242022-10-21浙江嘉兴数字城市实验室有限公司Method for identifying event element named entities in social management field based on BilSTM-CRF
CN115099230A (en)*2022-07-012022-09-23联洋国融(北京)科技有限公司BERT model-based multi-target task credit risk identification method and system
CN115292490A (en)*2022-08-022022-11-04福建省科立方科技有限公司Analysis algorithm for policy interpretation semantics
CN115270803A (en)*2022-09-302022-11-01北京道达天际科技股份有限公司Entity extraction method based on BERT and fused with N-gram characteristics
CN115859977A (en)*2022-10-312023-03-28浙江工业大学Named entity identification method based on fusion sequence characteristics
CN116629267A (en)*2023-07-212023-08-22云筑信息科技(成都)有限公司Named entity identification method based on multiple granularities
CN116629267B (en)*2023-07-212023-12-08云筑信息科技(成都)有限公司Named entity identification method based on multiple granularities
CN119046468A (en)*2024-10-302024-11-29之江实验室Vertical domain entity expansion method and device based on large language model
CN119046468B (en)*2024-10-302025-04-08之江实验室Vertical domain entity expansion method and device based on large language model

Similar Documents

PublicationPublication DateTitle
CN112836046A (en) A method for entity recognition of policies and regulations texts in the field of four insurances and one housing fund
CN110427623B (en)Semi-structured document knowledge extraction method and device, electronic equipment and storage medium
BharadiyaA comprehensive survey of deep learning techniques natural language processing
CN112115238B (en)Question-answering method and system based on BERT and knowledge base
CN111444726B (en)Chinese semantic information extraction method and device based on long-short-term memory network of bidirectional lattice structure
CN109857990B (en) A financial announcement information extraction method based on document structure and deep learning
Chalkidis et al.Obligation and prohibition extraction using hierarchical RNNs
Palmer et al.Adaptive multilingual sentence boundary disambiguation
US20230069935A1 (en)Dialog system answering method based on sentence paraphrase recognition
CN109933796A (en) Method and device for extracting key information from announcement text
CN112528649B (en)English pinyin identification method and system for multi-language mixed text
CN110457690A (en) A Method for Judging the Inventiveness of a Patent
CN102360346B (en) Text Reasoning Approach Based on Restricted Semantic Dependency Analysis
CN116341519B (en) Method, device and storage medium for extracting event causal relationships based on background knowledge
CN113010684B (en)Construction method and system of civil complaint judging map
CN115796182A (en)Multi-modal named entity recognition method based on entity-level cross-modal interaction
CN112613316A (en)Method and system for generating ancient Chinese marking model
CN117194682B (en)Method, device and medium for constructing knowledge graph based on power grid related file
CN112257442A (en)Policy document information extraction method based on corpus expansion neural network
CN113869055A (en) Feature Attribute Recognition Method of Power Grid Project Based on Deep Learning
CN113869054A (en) A feature recognition method of power field project based on deep learning
CN115510230A (en)Mongolian emotion analysis method based on multi-dimensional feature fusion and comparative reinforcement learning mechanism
TummalaText Summarization based Named Entity Recognition for Certain Application using BERT
Kim et al.Inherent risks identification in a contract document through automated rule generation
CN110909547A (en)Judicial entity identification method based on improved deep learning

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
RJ01Rejection of invention patent application after publication

Application publication date:20210525

RJ01Rejection of invention patent application after publication

[8]ページ先頭

©2009-2025 Movatter.jp