Movatterモバイル変換


[0]ホーム

URL:


CN112632997A - Chinese entity identification method based on BERT and Word2Vec vector fusion - Google Patents

Chinese entity identification method based on BERT and Word2Vec vector fusion
Download PDF

Info

Publication number
CN112632997A
CN112632997ACN202011462808.3ACN202011462808ACN112632997ACN 112632997 ACN112632997 ACN 112632997ACN 202011462808 ACN202011462808 ACN 202011462808ACN 112632997 ACN112632997 ACN 112632997A
Authority
CN
China
Prior art keywords
word
bert
word vector
sentence
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011462808.3A
Other languages
Chinese (zh)
Inventor
张有强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hebei University of Engineering
Original Assignee
Hebei University of Engineering
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hebei University of EngineeringfiledCriticalHebei University of Engineering
Priority to CN202011462808.3ApriorityCriticalpatent/CN112632997A/en
Publication of CN112632997ApublicationCriticalpatent/CN112632997A/en
Pendinglegal-statusCriticalCurrent

Links

Images

Classifications

Landscapes

Abstract

Translated fromChinese

本发明公开了一种基于BERT和Word2Vec向量融合的中文实体识别方法。该方法整体分为三个阶段,首先将海量文本预处理,之后输入到BERT和Word2Vec模型训练,获得预训练BERT模型和静态词向量表;接着将待识别文本与词向量表进行匹配获取每个字符的候选词向量,通过设计的两种融合策略对每个字的候选词向量融合,之后与BERT输出的字向量拼接;最后将拼接的字向量输入Bi‑LSTM‑CRF进行实体识别模型的训练。本发明构建的基于BERT和Word2Vec向量融合的中文实体识别方法,通过词向量融合拼接的方式间接引入了词语的边界信息,且利用BERT获取具体语境下的字向量,充分表征了字的多义性。

Figure 202011462808

The invention discloses a Chinese entity recognition method based on BERT and Word2Vec vector fusion. The method is divided into three stages as a whole. First, the massive text is preprocessed, and then input to the BERT and Word2Vec models for training to obtain the pre-trained BERT model and the static word vector table; then, the text to be recognized is matched with the word vector table to obtain each The candidate word vector of the character is fused with the candidate word vector of each character through two designed fusion strategies, and then spliced with the word vector output by BERT; finally, the spliced word vector is input into Bi‑LSTM‑CRF for entity recognition model training . The Chinese entity recognition method based on BERT and Word2Vec vector fusion constructed by the present invention indirectly introduces the boundary information of words by means of word vector fusion and splicing, and uses BERT to obtain word vectors in specific contexts, which fully characterizes the polysemy of words sex.

Figure 202011462808

Description

Translated fromChinese
基于BERT和Word2Vec向量融合的中文实体识别方法Chinese entity recognition method based on BERT and Word2Vec vector fusion

技术领域technical field

本发明属于命名实体识别领域,具体涉及一种基于BERT和Word2Vec 向量融合的中文实体识别方法。The invention belongs to the field of named entity recognition, in particular to a Chinese entity recognition method based on BERT and Word2Vec vector fusion.

背景技术Background technique

命名实体识别是一项识别文本中指定类型的实体成分并对其进行分类的 任务,常见的实体类型包括:人名、地名、机构名等。在网络数据日益剧增 的今天,命名实体识别为数据挖掘提供了强有力的支持,同时它也是信息检 索、问答系统、知识图谱等任务的重要组成部分。常用的命名实体识别方法 主要分为以下三类:基于规则和词典的方法、基于统计机器学习的方法和基 于深度学习的方法。Named entity recognition is a task of identifying and classifying entity components of a specified type in text. Common entity types include: person name, place name, institution name, etc. Today, with the increasing number of network data, named entity recognition provides strong support for data mining, and it is also an important part of tasks such as information retrieval, question answering systems, and knowledge graphs. Commonly used named entity recognition methods are mainly divided into the following three categories: methods based on rules and dictionary, methods based on statistical machine learning and methods based on deep learning.

基于规则和词典的方法,需要依靠语言学专家手工设计规则模板,选取 能够描述预定义类型的实体特征,包括:统计信息、关键字、指示词、位置 词以及标点符号等,结合领域内的词典,通过规则模板与字符串匹配的方式 进行实体识别。The rule-based and dictionary-based methods need to rely on linguistic experts to manually design rule templates, select entity features that can describe predefined types, including statistical information, keywords, demonstrative words, positional words, and punctuation marks, etc., combined with the dictionary in the field , and perform entity recognition by matching the rule template with the string.

基于统计机器学习的方法,把命名实体识别当作序列标注任务处理,该 类方法不需要拥有深厚语言学知识的专家来挑选和设计特征,普通研究人员 就可以挑选出能有效反映该类实体特性的特征集合,包括:单词特征、上下 文特征、词性特征以及语义特征等。通常采用人工标注的语料训练模型,常 用的机器学习模型包括:隐马尔可夫模型、最大熵模型、支持向量机、条件 随机场等。The method based on statistical machine learning treats named entity recognition as a sequence labeling task. This type of method does not require experts with deep linguistic knowledge to select and design features. Ordinary researchers can select features that can effectively reflect the characteristics of such entities. The feature set includes: word features, context features, part-of-speech features, and semantic features. Manually labeled corpus is usually used to train the model, and the commonly used machine learning models include: Hidden Markov Model, Maximum Entropy Model, Support Vector Machine, Conditional Random Field, etc.

基于深度学习的方法,能够进行端到端的模型训练,避免了人工挑选和 设计特征的问题。随着人工神经网络在词嵌入技术中的应用,使用大量未标 注语料进行无监督预训练,可以获得更贴近词语表达含义的低维稠密的原生 词向量,常用的词向量训练模型包括:Word2Vec、Glove等。在特征提取上 常用的深度学习模型有卷积神经网络、循环神经网络等,其中双向长短时记 忆(Bidirectional Long Short Term Memory,Bi-LSTM)网络是最经典,也是效 果较好的一个模型,而标签解码一般采用条件随机场(ConditionalRandom Fields,CRF)模型。The method based on deep learning can perform end-to-end model training, avoiding the problem of manual selection and design of features. With the application of artificial neural network in word embedding technology, a large amount of unlabeled corpus is used for unsupervised pre-training, and low-dimensional and dense native word vectors that are closer to the meaning of words can be obtained. Commonly used word vector training models include: Word2Vec, Glove et al. Deep learning models commonly used in feature extraction include convolutional neural networks, recurrent neural networks, etc. Among them, the Bidirectional Long Short Term Memory (Bi-LSTM) network is the most classic and a model with better effects. Label decoding generally adopts a Conditional Random Fields (CRF) model.

基于预训练语言模型的方法,使用海量文本对语言模型进行无监督预训 练,常用的预训练语言模型是BERT(Bidirectional Encoder Representations fromTransformers),利用获得的预训练模型在实体识别数据集上通过微调参 数的方式进行实体识别。Based on the method of pre-training language model, unsupervised pre-training of language model is carried out using massive text. The commonly used pre-training language model is BERT (Bidirectional Encoder Representations from Transformers), and the obtained pre-training model is used to fine-tune the parameters on the entity recognition data set. way of entity recognition.

但上述技术在下列缺陷:But the above technology has the following drawbacks:

基于规则和词典的方法具有较强的领域性,且有限的规则无法覆盖所有 的语言现象,缺乏鲁棒性和可移植性;The methods based on rules and dictionaries have strong domain, and limited rules cannot cover all language phenomena, lacking robustness and portability;

基于统计机器学习的方法,需要人工进行特征的挑选和组合,且人类语 言的使用通常具有很大的随意性,仅仅使用基于统计的方法会使状态搜索空 间非常庞大,导致实体识别效果不好;The method based on statistical machine learning requires manual selection and combination of features, and the use of human language is usually very arbitrary. Only using the method based on statistics will make the state search space very large, resulting in poor entity recognition effect;

基于深度学习的方法,采用Word2Vec等模型训练获得固定的静态词向 量来表示词语的语义含义,无法解决一词多义的问题,且分词错误会导致误 差传播,影响实体识别效果;Based on the method of deep learning, using Word2Vec and other models to train to obtain a fixed static word vector to represent the semantic meaning of words, which cannot solve the problem of polysemy, and word segmentation errors will lead to error propagation and affect the effect of entity recognition;

基于BERT预训练语言模型微调的方法,通常模型参数量巨大,训练和 预测都要花费很长的时间,且在训练和部署方面对硬件设施的要求较高。The method of fine-tuning the language model based on BERT pre-training usually has a huge amount of model parameters, and it takes a long time for training and prediction, and requires high hardware facilities in terms of training and deployment.

发明内容SUMMARY OF THE INVENTION

本发明的目的是为了解决现有技术存在的上述所列问题,提供了一种基 于BERT和Word2Vec向量融合的中文实体识别方案,在保证实体识别效果 的前提下提高模型训练和预测的效率。The purpose of the present invention is to solve the above-mentioned problems existing in the prior art, provide a Chinese entity recognition scheme based on BERT and Word2Vec vector fusion, and improve the efficiency of model training and prediction under the premise of ensuring the effect of entity recognition.

为实现上述目的,本发明采用的技术方案为:使用BERT模型获取包含 上下文信息的动态字向量,使用Word2Vec模型获取静态词向量,之后通过 两种词向量融合策略对候选词向量进行融合,最后将字向量和融合后的词向 量拼接作为后续模型的输入向量,且使用了经典的Bi-LSTM-CRF模型进行 特征编码和标签的解码。In order to achieve the above object, the technical solution adopted in the present invention is: use the BERT model to obtain dynamic word vectors containing context information, use the Word2Vec model to obtain static word vectors, then use two word vector fusion strategies to fuse the candidate word vectors, and finally fuse the word vectors. The word vector and the fused word vector are concatenated as the input vector of the subsequent model, and the classic Bi-LSTM-CRF model is used for feature encoding and label decoding.

基于BERT和Word2Vec进行向量融合的中文实体识别方法,具体包括 以下步骤:A Chinese entity recognition method based on BERT and Word2Vec for vector fusion, which includes the following steps:

步骤1,获取海量中文文本语料,利用Python中的jieba模块对文本进 行分词,训练Word2Vec模型,获取静态词向量表;Step 1, obtain massive Chinese text corpus, use the jieba module in Python to segment the text, train the Word2Vec model, and obtain a static word vector table;

步骤2,对BERT模型进行预训练,把中文文本构造成BERT模型需要 的输入格式,具体分为以下几步:Step 2, pre-train the BERT model, and construct the Chinese text into the input format required by the BERT model, which is divided into the following steps:

2.1对于原始语料,通过换行来分割句子,通过空行来分割上下文;2.1 For the original corpus, the sentences are separated by line breaks, and the context is separated by blank lines;

2.2构建BERT下一句预测预训练任务需要的样本,其中正样本表示的是 输入的两个句子是存在上下文关系的连续的两个句子;负样本表示的是不存 在语义关系的随机选择的两个句子;2.2 Construct the samples required for the BERT next sentence prediction pre-training task, where the positive samples indicate that the two input sentences are two consecutive sentences with a contextual relationship; the negative samples indicate that there is no semantic relationship. Two randomly selected two sentences sentence;

2.3对于超过设定的最大长度的句子,随机选择从句首或句尾进行截断;2.3 For sentences exceeding the set maximum length, randomly select the beginning or end of the clause for truncation;

2.4将待输入的两个句子用[SEP]标签连接,并且在整个句首添加[CLS] 标签,整个句尾添加[SEP]标签,若句子长度不够,用[PAD]标签进行填充;2.4 Connect the two sentences to be input with the [SEP] tag, and add the [CLS] tag at the beginning of the entire sentence, and the [SEP] tag at the end of the entire sentence. If the sentence length is not enough, fill it with the [PAD] tag;

2.5构建BERT遮蔽语言模型预训练任务需要的样本,随机选择句子中 15%的字符进行遮蔽,对于选中的字符80%的时间用[MASK]代替,10%的时 间用随机选择的一个字符代替,10%的时间保持原字符不变;2.5 Build the samples required for the BERT masking language model pre-training task, randomly select 15% of the characters in the sentence for masking, replace the selected characters with [MASK] 80% of the time, and replace them with a randomly selected character for 10% of the time, 10% of the time keep the original characters unchanged;

步骤3,根据上述两个预训练任务训练BERT模型,训练目标分别是预 测当前输入的句子对是否是存在上下文关系的句子和预测被遮蔽掉字符的原 始内容,最终获得预训练好的BERT模型;Step 3: Train the BERT model according to the above two pre-training tasks. The training objectives are to predict whether the currently input sentence pair is a sentence with a contextual relationship and predict the original content of the masked characters, and finally obtain a pre-trained BERT model;

步骤4,中文命名实体识别数据集的获取、预处理以及标注,具体标注 方式一般采用BIO标注法,其中B表示实体开始字符,I表示实体中间和结 尾字符,O表示非实体字符;Step 4, the acquisition, preprocessing and labeling of Chinese named entity recognition data set, and the specific labeling method generally adopts the BIO labeling method, wherein B represents the start character of the entity, I represents the middle and end characters of the entity, and O represents the non-entity character;

步骤5,对步骤4得到的数据集进行预处理,给每一个句子的句首添加 [CLS]标签,句尾添加[SEP]标签,将处理好的句子输入步骤3获得的预训练 BERT模型,最终获得BERT模型输出句子中每一个字符的字向量;Step 5: Preprocess the data set obtained in step 4, add a [CLS] tag to the beginning of each sentence, add a [SEP] tag to the end of the sentence, and input the processed sentence into the pre-trained BERT model obtained in step 3. Finally, the word vector of each character in the output sentence of the BERT model is obtained;

步骤6,对步骤4得到的数据集中的每一个句子,通过与词汇表进行匹 配获取该句子包含的所有候选词语,查询步骤1获得的静态词向量表,获得 每一个候选词语的词向量,将句子中每个字对应的候选词语的词向量通过两 种向量融合策略进行融合,来表示每个字在词汇层面的语义含义,具体包含 以下两种词向量融合策略:Step 6: For each sentence in the data set obtained in step 4, obtain all the candidate words contained in the sentence by matching with the vocabulary, query the static word vector table obtained instep 1, and obtain the word vector of each candidate word. The word vectors of the candidate words corresponding to each word in the sentence are fused through two vector fusion strategies to represent the semantic meaning of each word at the lexical level, including the following two word vector fusion strategies:

6.1词向量融合策略一:对句子中每个字的候选词向量进行求和取均值, 以“广州市长隆公园”句子为例,“广”字包含“广州”和“广州市”两个 候选词语,首先查询词向量表获得两个词语的词向量,然后对两个词向量求 和取均值作为“广”字的词向量表示部分。6.1 Word vector fusion strategy 1: Sum the candidate word vectors of each word in the sentence and take the mean value. Taking the sentence "Guangzhou Changlong Park" as an example, the word "Guang" contains two words "Guangzhou" and "Guangzhou City" For candidate words, first query the word vector table to obtain the word vectors of the two words, and then sum the two word vectors and take the average as the word vector representation part of the word "Guang".

6.2词向量融合策略二:对句子中每个字的候选词向量以词频作为权重进 行加权求和,同样以上述例子为例,首先统计“广州”和“广州市”在数据 集中出现的总次数,然后将两个词出现的次数分别除以两个词的总次数作为 两个词向量的权重,最后将权重和词向量相乘并求和作为“广”字的词向量 表示部分,其余字符同理,当某个字不存在匹配词语时,用[None]的词向量 表示该字的词向量部分,维度同其他词向量维度一样。6.2 Word vector fusion strategy 2: The candidate word vector of each word in the sentence is weighted and summed with the word frequency as the weight. Also taking the above example as an example, first count the total number of occurrences of "Guangzhou" and "Guangzhou City" in the data set , and then divide the number of occurrences of the two words by the total number of the two words as the weight of the two word vectors, and finally multiply the weight and the word vector and sum it up as the word vector representation part of the word "Guang", and the rest of the characters Similarly, when a word does not have a matching word, the word vector of [None] is used to represent the word vector part of the word, and the dimension is the same as that of other word vectors.

步骤7,将步骤6得到的每个字的词向量与步骤5得到的每个字的字向 量进行拼接,获得每个字符的最终字向量;Step 7, the word vector of each word obtained in step 6 is spliced with the word vector of each word obtained in step 5, and the final word vector of each character is obtained;

步骤8,将步骤7得到的字向量输入Bi-LSTM-CRF模型进行训练预测, 得到实体识别结果。In step 8, the word vector obtained in step 7 is input into the Bi-LSTM-CRF model for training prediction, and an entity recognition result is obtained.

本发明的有益效果是:The beneficial effects of the present invention are:

1.本发明针对传统词向量特征表达能力不强,提出使用预训练BERT模 型获取包含上下文信息的动态字向量,增强字的语义含义,解决一词多义的 问题;1. the present invention is not strong for traditional word vector feature expression ability, proposes to use pre-training BERT model to obtain the dynamic word vector that contains context information, enhances the semantic meaning of word, and solves the problem of polysemy;

2.为了解决在传统词向量使用过程中存在的分词错误问题,更好的引入 词语以及实体边界信息,提出了词向量融合的策略,且引入了词频信息来给 可能性更大的词向量赋予更高的权重,减少错误分词带来的影响。2. In order to solve the problem of word segmentation errors in the use of traditional word vectors, and better introduce word and entity boundary information, a word vector fusion strategy is proposed, and word frequency information is introduced to give more likely word vectors. Higher weights reduce the impact of wrong word segmentation.

3.通过词向量与字向量拼接的方式,实现字与词的融合,丰富了初始向 量的特征表示,提高了实体识别的精度和召回率;3. By splicing word vectors and word vectors, the fusion of words and words is realized, which enriches the feature representation of the initial vector and improves the precision and recall rate of entity recognition;

4.本发明在输入向量的表示上进行改进,而没有涉及到特征编码模型结 构的改进,因此也可以适用于其他特征编码模型,而不仅仅局限于Bi-LSTM 模型,具有很强的灵活性;4. The present invention improves the representation of the input vector without involving the improvement of the structure of the feature encoding model, so it can also be applied to other feature encoding models, not only limited to the Bi-LSTM model, and has strong flexibility ;

5.为了减少模型训练时间,没有对预训练模型微调,而是采用特征抽取 的方式获取字向量,大大减少了模型训练的参数,提高了模型训练效率。5. In order to reduce the model training time, the pre-training model is not fine-tuned, but the word vector is obtained by means of feature extraction, which greatly reduces the parameters of model training and improves the efficiency of model training.

附图说明Description of drawings

图1为本发明的基于BERT和Word2vec向量融合的中文实体识别流程示意 图;Fig. 1 is the schematic flow chart of Chinese entity recognition based on BERT and Word2vec vector fusion of the present invention;

图2为本发明实施例的基于BERT和Word2Vec向量融合的中文实体识别模 型整体结构示意图;Fig. 2 is the overall structure schematic diagram of the Chinese entity recognition model based on BERT and Word2Vec vector fusion of the embodiment of the present invention;

图3为本发明实施例的BERT预训练语言模型结构示意图;3 is a schematic structural diagram of a BERT pre-training language model according to an embodiment of the present invention;

图4为本发明实施例的Word2vec中的Skip-gram模型结构示意图。FIG. 4 is a schematic structural diagram of a Skip-gram model in Word2vec according to an embodiment of the present invention.

具体实施方式Detailed ways

为了使本发明所要解决的技术问题、技术方案及有益效果更加清楚、明 白,以下结合附图和实施例,对本发明进行进一步详细说明。应当理解,此 处所描述的具体实施例仅用以解释本发明,并不用于限定本发明。In order to make the technical problems, technical solutions and beneficial effects to be solved by the present invention more clear and comprehensible, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

如图1所示,本发明基于BERT和Word2vec向量融合的中文实体识别 方法,具体包括以下步骤:As shown in Figure 1, the present invention is based on the Chinese entity recognition method of BERT and Word2vec vector fusion, specifically comprises the following steps:

步骤1,获取Word2vec模型的训练语料并进行预处理;Step 1, obtain the training corpus of the Word2vec model and preprocess it;

步骤2,根据步骤1预处理后的训练语料训练Word2vec中的Skip-gram 模型,如图4所示,通过输入中心词来预测指定大小窗口内的上下文的词, 训练完成获得的映射层的权重矩阵就是词向量表:W∈R|V|*d,其中|V|是词 汇表长度,d是词向量维度。Step 2: Train the Skip-gram model in Word2vec according to the training corpus preprocessed inStep 1, as shown in Figure 4, predict the words of the context in the specified size window by inputting the central word, and complete the training to obtain the weight of the mapping layer The matrix is the word vector table: W∈R|V|*d , where |V| is the vocabulary length and d is the word vector dimension.

步骤3,通过查询步骤2训练获得的静态词向量表来获取每个词对应的 词向量:

Figure BDA0002833104080000061
其中vi是长度为|V|的one-hot向量,对应 维度的值为1,其余维度为0。Step 3, obtain the word vector corresponding to each word by querying the static word vector table obtained by the training in Step 2:
Figure BDA0002833104080000061
where vi is a one-hot vector of length |V |, the value of the corresponding dimension is 1, and the other dimensions are 0.

步骤4,根据步骤1预处理后的训练语料自己预训练BERT语言模型, 也可直接下载其它已经预训练好的中文BERT模型。Step 4: Pre-train the BERT language model by yourself according to the training corpus preprocessed inStep 1, or directly download other pre-trained Chinese BERT models.

步骤5,将实体识别数据集输入到BERT模型获取包含具体语境的字向 量,

Figure BDA0002833104080000062
ci表示句子中的每一个字符,l表示字向量的维度。Step 5, input the entity recognition data set into the BERT model to obtain the word vector containing the specific context,
Figure BDA0002833104080000062
ci represents each character in the sentence, and l represents the dimension of the word vector.

步骤6,输入句子与预先训练好的词汇表进行匹配,获取每个字符的候 选词向量ew,如图2所示,之后通过词向量融合策略对候选词向量进行融合, 策略一为求和取均值,其计算如下:Step 6, the input sentence is matched with the pre-trained vocabulary, and the candidate word vectorew of each character is obtained, as shown in Figure 2, and then the candidate word vector is fused through the word vector fusion strategy, and the strategy one is summation Take the mean, which is calculated as follows:

Figure BDA0002833104080000071
Figure BDA0002833104080000071

其中,ew(w)表示该词语的词向量,S表示字符所对应的候选词语集合, N表示集合中词语的个数,ew(None)表示[None]标签的词向量,

Figure BDA0002833104080000072
表示该集合 为空集,即该字符不包含任何匹配词语。Among them, ew (w) represents the word vector of the word, S represents the candidate word set corresponding to the character, N represents the number of words in the set, ew (None) represents the word vector of the [None] label,
Figure BDA0002833104080000072
Indicates that the set is empty, that is, the character does not contain any matching words.

策略二为词频加权求和,其计算如下:The second strategy is the weighted summation of word frequency, which is calculated as follows:

Figure BDA0002833104080000073
Figure BDA0002833104080000073

其中,z(w)表示每个词语的词频,词频通过统计每个词在训练集和测试 集上出现的频率获得,其他参数同上。Among them, z(w) represents the word frequency of each word, and the word frequency is obtained by counting the frequency of each word in the training set and test set, and other parameters are the same as above.

将融合的词向量与BERT输出的字向量进行拼接,获得每个字符的最终 向量表示,

Figure BDA0002833104080000074
表示向量拼接。Splicing the fused word vector with the word vector output by BERT to obtain the final vector representation of each character,
Figure BDA0002833104080000074
Represents vector concatenation.

步骤7,将句子中每一个字的字向量输入到LSTM模型中,学习句子中 较长距离的前后依赖关系,LSTM通过输入门、遗忘门、输出门控制和保持 信息的传递,其参数化表示如下所示:Step 7: Input the word vector of each word in the sentence into the LSTM model to learn the long-distance front and back dependencies in the sentence. LSTM controls and maintains the transmission of information through the input gate, forget gate, and output gate, and its parameterized representation As follows:

it=σ(Wixt+Uiht-1+bi)it =σ(Wi xt +Ui ht-1 +bi )

ft=σ(Wfxt+Ufht-1+bf)ft =σ(Wf xt +Uf ht-1 +bf )

Figure RE-GDA0002966111930000076
Figure RE-GDA0002966111930000076

Figure RE-GDA0002966111930000077
Figure RE-GDA0002966111930000077

ot=σ(Woxt+Uoht-1+bo)ot =σ(Wo xt +Uo ht-1 +bo )

ht=ot e tanh(ct)ht =ot e tanh(ct )

其中,σ是Sigmoid激活函数,tanh表示tanh激活函数,

Figure BDA0002833104080000077
表示点乘运 算,W、U分别表示对应每个门的权重矩阵,b表示偏置,xt表示步骤6获得 的当前时刻的输入向量,ht-1和ct-1分别表示上一时刻的输出和上一时刻的细 胞状态。where σ is the sigmoid activation function, tanh is the tanh activation function,
Figure BDA0002833104080000077
Represents the point multiplication operation, W and U represent the weight matrix corresponding to each gate respectively, b represents the bias, xt represents the input vector of the current moment obtained in step 6, ht-1 and ct-1 represent the previous moment respectively output and the cell state at the previous moment.

步骤8,如图2所示,Bi-LSTM包含前向传递和反向传递两个过程,能 够编码双向语言信息,对于输入的句子向量序列S={e1,e2,L,en},ei∈R1×(d+l), 其中1≤i≤n,d,l分别表示词向量和字向量的维度。前向传递过程为:Step 8, as shown in Figure 2, Bi-LSTM includes two processes of forward pass and reverse pass, which can encode bidirectional language information. For the input sentence vector sequence S={e1 ,e2 ,L,en } , ei ∈R1×(d+l) , where 1≤i≤n, d, l represent the dimension of word vector and word vector, respectively. The forward pass process is:

Figure RE-GDA0002966111930000081
Figure RE-GDA0002966111930000081

反向传递过程为:The reverse transfer process is:

Figure RE-GDA0002966111930000082
Figure RE-GDA0002966111930000082

其中,

Figure RE-GDA0002966111930000083
是前向t-1时刻的隐藏状态,
Figure RE-GDA0002966111930000084
是反向t+1时刻的隐藏状态, et是t时刻的输入向量。in,
Figure RE-GDA0002966111930000083
is the hidden state at time t-1 forward,
Figure RE-GDA0002966111930000084
is the hidden state at time t+1 in reverse, and et is the input vector at time t.

步骤9,最后对前向和反向LSTM的输出进行拼接获得t时刻的隐藏状 态htStep 9, finally splicing the output of the forward and reverse LSTM to obtain the hidden state h t at timet :

Figure RE-GDA0002966111930000085
Figure RE-GDA0002966111930000085

步骤10,CRF层在Bi-LSTM输出的基础上考虑了标签之间的转移信息, 能够获得全局最优标签序列,计算过程如下:In step 10, the CRF layer considers the transfer information between labels on the basis of the Bi-LSTM output, and can obtain the global optimal label sequence. The calculation process is as follows:

Figure BDA0002833104080000086
Figure BDA0002833104080000086

其中,s表示评估得分,W是标签间的转移矩阵,P表示对应标签的得 分。根据评估得分计算序列x到标签y的概率为:Among them, s is the evaluation score, W is the transition matrix between labels, and P is the score of the corresponding label. Calculate the probability of sequence x to label y based on the evaluation score as:

Figure BDA0002833104080000087
Figure BDA0002833104080000087

步骤11,训练损失函数为:Step 11, the training loss function is:

Figure BDA0002833104080000091
Figure BDA0002833104080000091

至此,具体实施例流程结束。So far, the process of the specific embodiment ends.

步骤12,本发明训练基于BERT和Word2Vec向量融合的Bi-LSTM-CRF 模型参数时,将已标注好的文本和标签作为输入,然后采用梯度下降法或其 他优化方法训练该模型,训练中只更新Bi-LSTM层和CRF层的参数,BERT 模型参数保持不变,当模型产生的损失值满足设定要求或达到最大迭代次数 时,则终止该模型的训练。Step 12, when the present invention trains the parameters of the Bi-LSTM-CRF model based on BERT and Word2Vec vector fusion, the marked text and labels are used as input, and then gradient descent method or other optimization methods are used to train the model, and only update during training. The parameters of the Bi-LSTM layer and the CRF layer, and the parameters of the BERT model remain unchanged. When the loss value generated by the model meets the set requirements or reaches the maximum number of iterations, the training of the model is terminated.

上述说明示出并描述了本发明的优选实施例,如前所述,应当理解本发 明并非局限于本文所披露的形式,不应看作是对其他实施例的排除,而可用 于各种其他组合、修改和环境,并能够在本文所述发明构想范围内,通过上 述教导或相关领域的技术或知识进行改动。而本领域人员所进行的改动和变 化不脱离本发明的精神和范围,则都应在本发明所附权利要求的保护范围内。The foregoing specification illustrates and describes preferred embodiments of the present invention, and as previously stated, it should be understood that the present invention is not limited to the form disclosed herein, and should not be construed as an exclusion of other embodiments, but may be used in a variety of other Combinations, modifications and environments are possible within the scope of the inventive concepts described herein, from the above teachings or from skill or knowledge in the relevant fields. And the modification and change that those skilled in the art carry out do not depart from the spirit and scope of the present invention, then all should be within the protection scope of the appended claims of the present invention.

Claims (6)

1. A Chinese entity recognition method based on BERT and Word2Vec vector fusion is characterized in that a BERT model is used for obtaining a dynamic Word vector of each Word in a sentence, Word2Vec is used for obtaining a static Word vector, a plurality of candidate Word vectors are fused through two designed fusion strategies, then the candidate Word vectors are spliced with the Word vectors and input into Bi-LSTM-CRF for model training, and entities of specified types in texts are automatically extracted.
2. The method for Chinese entity recognition based on the fusion of BERT and Word2Vec vectors as claimed in claim 1, wherein the method for Chinese entity recognition specifically comprises the following steps:
step 1, acquiring mass Chinese texts and preprocessing the texts, performing Word segmentation on the texts by using a jieba module in Python, training a Word2Vec model, and acquiring a static Word vector table;
step 2, pre-training the BERT model, and constructing the Chinese text into an input format required by the BERT model, wherein the method specifically comprises the following steps:
2.1 for the original corpus, segmenting sentences by line feed and segmenting context paragraphs by empty lines;
2.2 constructing samples required by the BERT next sentence prediction pre-training task, wherein positive samples represent that the input two sentences are two continuous sentences with context; the negative examples represent two sentences randomly selected without semantic relation;
2.3 randomly selecting the sentence with the length exceeding the set maximum length to cut off from the beginning or the end of the sentence;
2.4 connecting two sentences to be input by using [ SEP ] tags, adding [ CLS ] tags at the beginning of the whole sentence, and adding [ SEP ] tags at the end of the whole sentence;
2.5 constructing samples required by a pretraining task of a BERT masking language model, randomly selecting 15% of characters in a sentence for masking, replacing 80% of the time of the selected characters with [ MASK ], replacing 10% of the time with one randomly selected character, and keeping the original characters unchanged 10% of the time;
step 3, training a BERT model according to the two pre-training tasks, wherein the training targets are respectively predicting whether a currently input sentence pair is a sentence with a context relationship and predicting the original content of the occluded characters, and finally obtaining the pre-trained BERT model;
step 4, acquiring, preprocessing and labeling a Chinese named entity identification data set, wherein a specific labeling mode generally adopts a BIO labeling method, wherein B represents an entity start character, I represents an entity middle character and an entity end character, and O represents a non-entity character;
step 5, preprocessing the labeled data set obtained in the step 4, adding [ CLS ] labels to the beginning of each sentence and [ SEP ] labels to the end of each sentence, inputting the processed sentences into the pretrained BERT model in the step 3, and obtaining the word vector of each character in the sentences output by the BERT model;
step 6, for each sentence in the data set obtained in the step 4, obtaining word vectors of all candidate words contained in the sentence in a manner of matching with the vocabulary, and fusing the candidate word vectors corresponding to each word in the sentence through two word vector fusion strategies to express the semantic meaning of each word at the vocabulary level, wherein the two fusion strategies specifically include the following two fusion strategies:
6.1 word vector fusion strategy one: taking the sentence of Guangzhou city Changhong park as an example, the word vector of the two words is obtained by firstly inquiring the word vector table to obtain the word vector of the two words, and then the two word vectors are summed and averaged to be used as the word vector representation part of the 'Guangzhou' word.
6.2 word vector fusion strategy two: taking the word frequency as the weight to perform weighted summation on the candidate word vector of each word in the sentence, taking the above example as an example, firstly counting the total times of occurrence of the word in the data set of "Guangzhou" and "Guangzhou City", then dividing the times of occurrence of the two words by the total times of the two words respectively as the weight of the two word vectors, finally multiplying the weight and the word vectors and summing the result to be the word vector representing part of the "Wide" word, and the same applies to the rest characters, when a word does not have a candidate word, the word vector part of the word is represented by the word vector of [ None ], and the dimension is the same as the dimension of the other word vectors.
Step 7, splicing the word vector of each character obtained in the step 6 with the word vector of each character obtained in the step 5 to obtain a final word vector of each character;
and 8, inputting the word vector obtained in the step 7 into a Bi-LSTM-CRF model for training and prediction to obtain an entity recognition result.
3. The method for Chinese entity recognition based on the fusion of BERT and Word2Vec vectors as claimed in claim 2, wherein the Chinese text preprocessing in steps 1 and 2 mainly comprises removing useless symbols, repeated data, and normalized data format from the text data obtained by crawler or other means.
4. The method for Chinese entity recognition based on the fusion of BERT and Word2Vec vectors as claimed in claim 3, wherein in step 2, the sentence with insufficient length needs to be filled up with [ PAD ] tag, and finally the fixed length sentence is input into the BERT model for training.
5. The method for recognizing Chinese entities based on the vector fusion of BERT and Word2Vec as claimed in claim 4, wherein the vocabulary in step 6 is also a Word vector table obtained by Word2Vec training, and when a sentence is input, firstly a candidate Word vector of each character is obtained by querying the Word vector table, and then one of two vector fusion strategies is selected for the fusion of Word vectors.
6. The method for Chinese entity recognition based on BERT and Word2Vec vector fusion of claim 5, wherein in the step 8, the whole model can be regarded as three layers, namely an input vector representation layer based on the BERT and Word2Vec models, a Bi-LSTM based context coding layer and a CRF based tag decoding layer; and splicing the static Word vector obtained by using Word2vec and the dynamic Word vector obtained by using BERT to be used as an input vector, wherein the Bi-LSTM layer is responsible for carrying out feature coding on the input vector, and the CRF layer selects an optimal label sequence by learning the transition probability among labels.
CN202011462808.3A2020-12-142020-12-14Chinese entity identification method based on BERT and Word2Vec vector fusionPendingCN112632997A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202011462808.3ACN112632997A (en)2020-12-142020-12-14Chinese entity identification method based on BERT and Word2Vec vector fusion

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202011462808.3ACN112632997A (en)2020-12-142020-12-14Chinese entity identification method based on BERT and Word2Vec vector fusion

Publications (1)

Publication NumberPublication Date
CN112632997Atrue CN112632997A (en)2021-04-09

Family

ID=75312414

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202011462808.3APendingCN112632997A (en)2020-12-142020-12-14Chinese entity identification method based on BERT and Word2Vec vector fusion

Country Status (1)

CountryLink
CN (1)CN112632997A (en)

Cited By (43)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN113128199A (en)*2021-05-112021-07-16济南大学Word vector generation method based on pre-training language model and multiple word information embedding
CN113221735A (en)*2021-05-112021-08-06润联软件系统(深圳)有限公司Multimodal-based scanned part paragraph structure restoration method and device and related equipment
CN113239689A (en)*2021-07-072021-08-10北京语言大学Selection question interference item automatic generation method and device for confusing word investigation
CN113254628A (en)*2021-05-182021-08-13北京中科智加科技有限公司Event relation determining method and device
CN113342930A (en)*2021-05-242021-09-03北京明略软件系统有限公司String vector-based text representation method and device, electronic equipment and storage medium
CN113392629A (en)*2021-06-292021-09-14哈尔滨工业大学Method for eliminating pronouns of personal expressions based on pre-training model
CN113450760A (en)*2021-06-072021-09-28北京一起教育科技有限责任公司Method and device for converting text into voice and electronic equipment
CN113505200A (en)*2021-07-152021-10-15河海大学Sentence-level Chinese event detection method combining document key information
CN113505587A (en)*2021-06-232021-10-15科大讯飞华南人工智能研究院(广州)有限公司Entity extraction method, related device, equipment and storage medium
CN113554168A (en)*2021-06-292021-10-26北京三快在线科技有限公司 Model training, vector generation method, device, electronic device and storage medium
CN113657105A (en)*2021-08-312021-11-16平安医疗健康管理股份有限公司Medical entity extraction method, device, equipment and medium based on vocabulary enhancement
CN113672727A (en)*2021-07-282021-11-19重庆大学Financial text entity relation extraction method and system
CN113673248A (en)*2021-08-232021-11-19中国人民解放军32801部队Named entity identification method for testing and identifying small sample text
CN113849597A (en)*2021-08-312021-12-28艾迪恩(山东)科技有限公司Illegal advertising word detection method based on named entity recognition
CN113889259A (en)*2021-09-062022-01-04浙江工业大学Automatic diagnosis dialogue system under assistance of knowledge graph
CN113901843A (en)*2021-09-072022-01-07昆明理工大学BERT and word embedding dual-representation fused Hanyue neural machine translation method
CN113935327A (en)*2021-10-092022-01-14新华智云科技有限公司Method and device for identifying domain entity
CN113988073A (en)*2021-10-262022-01-28迪普佰奥生物科技(上海)股份有限公司Text recognition method and system suitable for life science
CN114281934A (en)*2021-09-162022-04-05腾讯科技(深圳)有限公司 Text recognition method, device, equipment and storage medium
CN114298048A (en)*2021-12-292022-04-08中国电信股份有限公司 Named Entity Recognition Method and Device
CN114356116A (en)*2021-12-312022-04-15科大讯飞股份有限公司 Text input methods and related devices
CN114416991A (en)*2022-01-182022-04-29中山大学 A Prompt-based Text Sentiment Analysis Method and System
CN114528840A (en)*2022-01-212022-05-24深圳大学Chinese entity identification method, terminal and storage medium fusing context information
CN114757184A (en)*2022-04-112022-07-15中国航空综合技术研究所Method and system for realizing knowledge question answering in aviation field
CN114781380A (en)*2022-03-212022-07-22哈尔滨工程大学Chinese named entity recognition method, equipment and medium fusing multi-granularity information
CN115017901A (en)*2022-06-082022-09-06上海金仕达软件科技有限公司Information prediction method, system, device and storage medium of bulletin corpus
CN115129892A (en)*2022-06-242022-09-30武汉大学 Method and device for constructing knowledge graph of fault handling in distribution network
CN115146642A (en)*2022-07-212022-10-04北京市科学技术研究院Automatic training set labeling method and system for named entity recognition
CN115238691A (en)*2022-06-022022-10-25哈尔滨理工大学Knowledge fusion based embedded multi-intention recognition and slot filling model
CN115270803A (en)*2022-09-302022-11-01北京道达天际科技股份有限公司Entity extraction method based on BERT and fused with N-gram characteristics
CN115329766A (en)*2022-08-232022-11-11中国人民解放军国防科技大学 A Named Entity Recognition Method Based on Dynamic Word Information Fusion
CN115422362A (en)*2022-10-092022-12-02重庆邮电大学 A Text Matching Method Based on Artificial Intelligence
CN115687577A (en)*2023-01-042023-02-03交通运输部公路科学研究所 A method and system for discovering road transport normalization problem appeals
CN115859977A (en)*2022-10-312023-03-28浙江工业大学Named entity identification method based on fusion sequence characteristics
CN116011456A (en)*2023-03-172023-04-25北京建筑大学Chinese building specification text entity identification method and system based on prompt learning
CN116029354A (en)*2022-08-092023-04-28中国搜索信息科技股份有限公司Text pair-oriented Chinese language model pre-training method
CN116415007A (en)*2023-01-102023-07-11西北工业大学Text retrieval method based on self-adaptive triplet updating
CN116720520A (en)*2023-08-072023-09-08烟台云朵软件有限公司Text data-oriented alias entity rapid identification method and system
CN117094322A (en)*2023-08-212023-11-21天翼物联科技有限公司Named entity identification method, device, equipment and medium based on knowledge graph
CN117195877A (en)*2023-11-062023-12-08中南大学Word vector generation method, system and equipment for electronic medical record and storage medium
CN117350283A (en)*2023-10-112024-01-05西安栗子互娱网络科技有限公司Text defect detection method, device, equipment and storage medium
WO2024045318A1 (en)*2022-08-302024-03-07北京龙智数科科技服务有限公司Method and apparatus for training natural language pre-training model, device, and storage medium
CN118690722A (en)*2024-07-302024-09-24山东政通科技发展有限公司 A method and system for automatically formatting official documents

Cited By (66)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN113128199A (en)*2021-05-112021-07-16济南大学Word vector generation method based on pre-training language model and multiple word information embedding
CN113221735A (en)*2021-05-112021-08-06润联软件系统(深圳)有限公司Multimodal-based scanned part paragraph structure restoration method and device and related equipment
CN113221735B (en)*2021-05-112025-04-25华润数字科技有限公司 Method, device and related equipment for restoring paragraph structure of scanned documents based on multimodality
CN113128199B (en)*2021-05-112022-06-21济南大学Word vector generation method based on pre-training language model and multiple word information embedding
CN113254628A (en)*2021-05-182021-08-13北京中科智加科技有限公司Event relation determining method and device
CN113342930A (en)*2021-05-242021-09-03北京明略软件系统有限公司String vector-based text representation method and device, electronic equipment and storage medium
CN113342930B (en)*2021-05-242024-03-08北京明略软件系统有限公司Text representing method and device based on string vector, electronic equipment and storage medium
CN113450760A (en)*2021-06-072021-09-28北京一起教育科技有限责任公司Method and device for converting text into voice and electronic equipment
CN113505587A (en)*2021-06-232021-10-15科大讯飞华南人工智能研究院(广州)有限公司Entity extraction method, related device, equipment and storage medium
CN113505587B (en)*2021-06-232024-04-09科大讯飞华南人工智能研究院(广州)有限公司Entity extraction method, related device, equipment and storage medium
CN113554168A (en)*2021-06-292021-10-26北京三快在线科技有限公司 Model training, vector generation method, device, electronic device and storage medium
CN113392629A (en)*2021-06-292021-09-14哈尔滨工业大学Method for eliminating pronouns of personal expressions based on pre-training model
CN113392629B (en)*2021-06-292022-10-28哈尔滨工业大学Human-term pronoun resolution method based on pre-training model
CN113239689B (en)*2021-07-072021-10-08北京语言大学Selection question interference item automatic generation method and device for confusing word investigation
CN113239689A (en)*2021-07-072021-08-10北京语言大学Selection question interference item automatic generation method and device for confusing word investigation
CN113505200A (en)*2021-07-152021-10-15河海大学Sentence-level Chinese event detection method combining document key information
CN113505200B (en)*2021-07-152023-11-24河海大学 A method for sentence-level Chinese event detection combining key information of documents
CN113672727A (en)*2021-07-282021-11-19重庆大学Financial text entity relation extraction method and system
CN113672727B (en)*2021-07-282024-04-05重庆大学Financial text entity relation extraction method and system
CN113673248B (en)*2021-08-232022-02-01中国人民解放军32801部队Named entity identification method for testing and identifying small sample text
CN113673248A (en)*2021-08-232021-11-19中国人民解放军32801部队Named entity identification method for testing and identifying small sample text
CN113849597B (en)*2021-08-312024-04-30艾迪恩(山东)科技有限公司Illegal advertisement word detection method based on named entity recognition
CN113849597A (en)*2021-08-312021-12-28艾迪恩(山东)科技有限公司Illegal advertising word detection method based on named entity recognition
CN113657105A (en)*2021-08-312021-11-16平安医疗健康管理股份有限公司Medical entity extraction method, device, equipment and medium based on vocabulary enhancement
CN113889259A (en)*2021-09-062022-01-04浙江工业大学Automatic diagnosis dialogue system under assistance of knowledge graph
CN113901843B (en)*2021-09-072025-05-30昆明理工大学 Chinese-Vietnamese neural machine translation method integrating BERT and word embedding dual representation
CN113901843A (en)*2021-09-072022-01-07昆明理工大学BERT and word embedding dual-representation fused Hanyue neural machine translation method
CN114281934A (en)*2021-09-162022-04-05腾讯科技(深圳)有限公司 Text recognition method, device, equipment and storage medium
CN113935327A (en)*2021-10-092022-01-14新华智云科技有限公司Method and device for identifying domain entity
CN113988073A (en)*2021-10-262022-01-28迪普佰奥生物科技(上海)股份有限公司Text recognition method and system suitable for life science
CN114298048B (en)*2021-12-292025-02-11中国电信股份有限公司 Named entity recognition method and device
CN114298048A (en)*2021-12-292022-04-08中国电信股份有限公司 Named Entity Recognition Method and Device
CN114356116B (en)*2021-12-312024-11-01科大讯飞股份有限公司Text input method and related device
CN114356116A (en)*2021-12-312022-04-15科大讯飞股份有限公司 Text input methods and related devices
CN114416991A (en)*2022-01-182022-04-29中山大学 A Prompt-based Text Sentiment Analysis Method and System
CN114416991B (en)*2022-01-182025-08-05中山大学 A prompt-based text sentiment analysis method and system
CN114528840A (en)*2022-01-212022-05-24深圳大学Chinese entity identification method, terminal and storage medium fusing context information
CN114781380B (en)*2022-03-212025-06-24哈尔滨工程大学 A Chinese named entity recognition method, device and medium integrating multi-granularity information
CN114781380A (en)*2022-03-212022-07-22哈尔滨工程大学Chinese named entity recognition method, equipment and medium fusing multi-granularity information
CN114757184A (en)*2022-04-112022-07-15中国航空综合技术研究所Method and system for realizing knowledge question answering in aviation field
CN114757184B (en)*2022-04-112023-11-10中国航空综合技术研究所Method and system for realizing knowledge question and answer in aviation field
CN115238691A (en)*2022-06-022022-10-25哈尔滨理工大学Knowledge fusion based embedded multi-intention recognition and slot filling model
CN115017901A (en)*2022-06-082022-09-06上海金仕达软件科技有限公司Information prediction method, system, device and storage medium of bulletin corpus
CN115129892A (en)*2022-06-242022-09-30武汉大学 Method and device for constructing knowledge graph of fault handling in distribution network
CN115146642B (en)*2022-07-212023-08-29北京市科学技术研究院Named entity recognition-oriented training set automatic labeling method and system
CN115146642A (en)*2022-07-212022-10-04北京市科学技术研究院Automatic training set labeling method and system for named entity recognition
CN116029354A (en)*2022-08-092023-04-28中国搜索信息科技股份有限公司Text pair-oriented Chinese language model pre-training method
CN115329766A (en)*2022-08-232022-11-11中国人民解放军国防科技大学 A Named Entity Recognition Method Based on Dynamic Word Information Fusion
WO2024045318A1 (en)*2022-08-302024-03-07北京龙智数科科技服务有限公司Method and apparatus for training natural language pre-training model, device, and storage medium
CN115270803A (en)*2022-09-302022-11-01北京道达天际科技股份有限公司Entity extraction method based on BERT and fused with N-gram characteristics
CN115422362B (en)*2022-10-092023-10-31郑州数智技术研究院有限公司Text matching method based on artificial intelligence
CN115422362A (en)*2022-10-092022-12-02重庆邮电大学 A Text Matching Method Based on Artificial Intelligence
CN115859977A (en)*2022-10-312023-03-28浙江工业大学Named entity identification method based on fusion sequence characteristics
CN115687577A (en)*2023-01-042023-02-03交通运输部公路科学研究所 A method and system for discovering road transport normalization problem appeals
CN116415007A (en)*2023-01-102023-07-11西北工业大学Text retrieval method based on self-adaptive triplet updating
CN116415007B (en)*2023-01-102025-08-01西北工业大学Text retrieval method based on self-adaptive triplet updating
CN116011456A (en)*2023-03-172023-04-25北京建筑大学Chinese building specification text entity identification method and system based on prompt learning
CN116720520A (en)*2023-08-072023-09-08烟台云朵软件有限公司Text data-oriented alias entity rapid identification method and system
CN116720520B (en)*2023-08-072023-11-03烟台云朵软件有限公司Text data-oriented alias entity rapid identification method and system
CN117094322A (en)*2023-08-212023-11-21天翼物联科技有限公司Named entity identification method, device, equipment and medium based on knowledge graph
CN117350283B (en)*2023-10-112024-10-01西安栗子互娱网络科技有限公司Text defect detection method, device, equipment and storage medium
CN117350283A (en)*2023-10-112024-01-05西安栗子互娱网络科技有限公司Text defect detection method, device, equipment and storage medium
CN117195877A (en)*2023-11-062023-12-08中南大学Word vector generation method, system and equipment for electronic medical record and storage medium
CN117195877B (en)*2023-11-062024-01-30中南大学 A word vector generation method, system, equipment and storage medium for electronic medical records
CN118690722B (en)*2024-07-302025-02-14山东政通科技发展有限公司 A method and system for automatically formatting official documents
CN118690722A (en)*2024-07-302024-09-24山东政通科技发展有限公司 A method and system for automatically formatting official documents

Similar Documents

PublicationPublication DateTitle
CN112632997A (en)Chinese entity identification method based on BERT and Word2Vec vector fusion
CN109657239B (en) Chinese Named Entity Recognition Method Based on Attention Mechanism and Language Model Learning
CN113609859B (en) A Chinese named entity recognition method for special equipment based on pre-training model
CN109992782B (en)Legal document named entity identification method and device and computer equipment
CN110083831B (en)Chinese named entity identification method based on BERT-BiGRU-CRF
CN111626056B (en)Chinese named entity identification method and device based on RoBERTA-BiGRU-LAN model
CN111966812B (en) An automatic question answering method and storage medium based on dynamic word vector
CN108460013A (en)A kind of sequence labelling model based on fine granularity vocabulary representation model
CN113312453B (en) A model pre-training system for cross-language dialogue understanding
CN115392259B (en)Microblog text sentiment analysis method and system based on confrontation training fusion BERT
CN108388560A (en)GRU-CRF meeting title recognition methods based on language model
CN107729309A (en)A kind of method and device of the Chinese semantic analysis based on deep learning
CN110008469A (en) A Multi-level Named Entity Recognition Method
CN113190656A (en)Chinese named entity extraction method based on multi-label framework and fusion features
CN110263325A (en)Chinese automatic word-cut
CN110781290A (en)Extraction method of structured text abstract of long chapter
CN111931496B (en) A text style conversion system and method based on a recurrent neural network model
CN114153973A (en) Mongolian multimodal sentiment analysis method based on T-M BERT pre-training model
CN114818717A (en)Chinese named entity recognition method and system fusing vocabulary and syntax information
CN114528368A (en)Spatial relationship extraction method based on pre-training language model and text feature fusion
CN112818698A (en)Fine-grained user comment sentiment analysis method based on dual-channel model
CN113076718B (en)Commodity attribute extraction method and system
CN115510230A (en)Mongolian emotion analysis method based on multi-dimensional feature fusion and comparative reinforcement learning mechanism
CN117725964A (en)Multi-congratulation text generation method based on BART model
CN114021549A (en) Chinese Named Entity Recognition Method and Device Based on Vocabulary Enhancement and Multi-feature

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
WD01Invention patent application deemed withdrawn after publication
WD01Invention patent application deemed withdrawn after publication

Application publication date:20210409


[8]ページ先頭

©2009-2025 Movatter.jp