技术领域technical field
本发明涉及信息抽取领域,具体涉及一种结合从句级远程监督和半监督集成学习的关系抽取方法。The invention relates to the field of information extraction, in particular to a relation extraction method combining clause-level remote supervision and semi-supervised integrated learning.
背景技术Background technique
信息抽取(Information Extraction)是指从一段文本中抽取实体、事件、关系等类型的信息,形成结构化数据存入数据库中以供用户查询和使用的过程。关系抽取(Relation Extraction)是信息抽取的关键内容,旨在抽取实体之间存在的语义关系。关系抽取技术在自动问答系统构建、海量信息处理、知识库自动构建、搜索引擎和特定文本挖掘等领域具有广阔的应用前景。Information Extraction refers to the process of extracting information such as entities, events, and relationships from a piece of text, forming structured data and storing it in a database for user query and use. Relation Extraction is the key content of information extraction, which aims to extract the semantic relationship between entities. Relation extraction technology has broad application prospects in the fields of automatic question answering system construction, massive information processing, automatic knowledge base construction, search engines, and specific text mining.
传统的关系抽取研究一般采用有监督的机器学习方法,该类方法将关系抽取看作分类问题,使用人工标注的训练数据,通过抽取的词法特征和句法特征训练关系分类器,能取得一定的分类效果。但是,由于需要代价高昂的人工标注数据,使得有监督的关系抽取方法能识别的关系类型局限于特定领域且不能适应海量网络文本的情况。Traditional relation extraction research generally adopts supervised machine learning methods, which regard relation extraction as a classification problem, use manually labeled training data, and train relation classifiers through extracted lexical and syntactic features, which can achieve certain classification Effect. However, due to the need for expensive manual labeling data, the types of relations that can be identified by supervised relation extraction methods are limited to specific domains and cannot be adapted to massive network texts.
为了解决有监督的关系抽取方法人工标注数据不足的问题,研究人员提出了自动生成标注数据的方法—远程监督(Distant Supervision),其假设如果两个实体之间有某种语义关系,则所有包含它们的句子都在一定程度上表达了这种关系。基于上述假设,远程监督利用知识库蕴含的大量关系三元组,通过与训练语料的文本对齐,可以生成大量的标注数据。远程监督解决了有监督的关系抽取方法标注数据不足的问题,但由于其假设并不总是正确,导致生成的标注数据中存在大量的错误标注数据(即噪声数据),对关系抽取模型造成不利影响。In order to solve the problem of insufficient manual labeling data in supervised relationship extraction methods, the researchers proposed a method for automatically generating labeling data—Distant Supervision, which assumes that if there is a certain semantic relationship between two entities, then all entities that contain Their sentences express this relationship to some extent. Based on the above assumptions, distant supervision utilizes a large number of relational triples contained in the knowledge base, and can generate a large amount of labeled data by aligning with the text of the training corpus. Remote supervision solves the problem of insufficient labeled data in supervised relationship extraction methods, but because its assumptions are not always correct, there are a large amount of wrongly labeled data (that is, noise data) in the generated labeled data, which is unfavorable to the relationship extraction model. influences.
针对噪声问题,现有处理方法一般通过修改关系抽取模型的方式来减小噪声数据的负面影响,虽然能够取得一定的效果,但并不能够从根本上解决噪声问题。For the noise problem, the existing processing methods generally reduce the negative impact of noisy data by modifying the relationship extraction model. Although some effects can be achieved, they cannot fundamentally solve the noise problem.
另外,基于远程监督的关系抽取普遍存在负例数据利用不足的问题,这是因为通过远程监督生成的关系实例集中负例关系实例数量远大于正例关系实例数据数量,导致特征数据集中负例数据的数量远大于正例数据数量,为保证参与训练的正例数据和负例数据数量均衡,一般选取特征数据集全部的正例数据和少部分负例数据组成训练数据集,剩余的大部分负例数据被搁置不用。In addition, the relationship extraction based on remote supervision generally has the problem of insufficient utilization of negative example data. This is because the number of negative example relationship instances in the set of relationship instances generated by remote supervision is much larger than the number of positive example relationship instance data, resulting in negative example data in feature datasets. The number of positive data is much larger than the number of positive data. In order to ensure the balance of the number of positive data and negative data participating in the training, generally all the positive data and a small part of the negative data in the feature data set are selected to form the training data set, and most of the remaining negative data The sample data is set aside.
发明内容Contents of the invention
为了解决关系抽取方法中噪声数据和负例数据问题,本发明提供了一种结合从句级远程监督和半监督集成学习的关系抽取方法,该方法既能够去除噪声数据,又能够充分利用负例数据。In order to solve the problem of noise data and negative example data in the relation extraction method, the present invention provides a relation extraction method combining sentence-level remote supervision and semi-supervised ensemble learning, which can not only remove noise data, but also make full use of negative example data .
一种结合从句级远程监督和半监督集成学习的关系抽取方法,主要包括如下步骤:A method of relation extraction combining clause-level remote supervision and semi-supervised ensemble learning mainly includes the following steps:
步骤1,通过远程监督将知识库中的关系三元组对齐到语料库,构建关系实例集;Step 1, align the relational triples in the knowledge base to the corpus through remote supervision, and construct a relational instance set;
步骤2,使用基于句法分析的从句识别去除关系实例集中的噪声数据;Step 2, using clause recognition based on syntactic analysis to remove noise data in relational instance sets;
步骤3,抽取关系实例的词法特征并转化为分布式表征向量,构建特征数据集;Step 3, extract the lexical features of relational instances and convert them into distributed representation vectors to construct feature data sets;
步骤4,选择特征数据集中全部的正例数据和少部分负例数据组成标注数据集,其余负例数据在去除标签后组成未标注数据集,使用半监督集成学习算法训练关系分类器。Step 4: Select all the positive data in the feature data set and a small number of negative data to form a labeled data set, and the remaining negative data will form an unlabeled data set after removing labels, and use a semi-supervised ensemble learning algorithm to train a relational classifier.
在步骤1中,通过远程监督将知识库K中的关系三元组对齐到语料库D,构建关系实例集Q={qn丨qn=(sm,ei,rk,ej),sm∈D}。In step 1, the relational triples in the knowledge base K are aligned to the corpus D through remote supervision, and a set of relational instances Q={qn丨qn =(sm , ei , rk , ej ), sm ∈ D}.
其中,qn为关系实例,sm为句子,ei和ej为实体,rk为ei和ej之间存在的实体关系。Among them, qn is a relationship instance, sm is a sentence, ei and ej are entities, and rk is the entity relationship between ei and ej .
如果句子sm同时包含实体ei和实体ej,且知识库K中存在关系三元组(ei,rk,ej),则qn=(sm,ei,rk,ej)为正例关系实例,同时选择一些不符合上述条件的关系实例作为负例关系实例。If the sentence sm contains both entities ei and ej , and there is a relation triplet (ei , rk , ej ) in the knowledge base K, then qn = (sm , ei , rk , ej ) is a positive relationship instance, and some relationship instances that do not meet the above conditions are selected as negative relationship instances.
步骤2的具体步骤如下:The specific steps of step 2 are as follows:
步骤2-1,使用概率上下文无关文法对关系实例qn的句子sm进行解析,得到其语法树,根据语法树表示的句子sm的词之间的结构关系,将sm划分成从句;Step 2-1, use the probabilistic context-free grammar to analyze the sentence sm of the relation instance qn to obtain its syntax tree, and divide sm into clauses according to the structural relationship between the words of the sentence sm represented by the syntax tree;
步骤2-2,根据关系实例qn的实体对(ei,ej)是否同时出现在句子sm的某一个从句当中来判断关系实例qn是否是噪声数据;如果qn是噪声数据,则将其从关系实例集Q中去除;Step 2-2, according to whether the entity pair (ei , ej ) of the relation instance qn appears in a certain clause of the sentence sm at the same time to judge whether the relation instance qn is noise data; if qn is noise data, Then remove it from the relation instance set Q;
如果关系实例qn=(sm,ei,rk,ej)是正例关系实例,当句子sm对应的实体对(ei,ej)没有出现在句子sm的任一从句中时,认为关系实例qn是噪声数据,并将其从关系实例集Q中去除;If the relation instance qn = (sm , ei , rk , ej ) is a positive relation instance, when the entity pair (ei , ej ) corresponding to the sentence sm does not appear in any clause of the sentence sm When , the relational instance qn is considered to be noise data and removed from the relational instance set Q;
如果关系实例qn=(sm,ei,rk,ej)是负例关系实例,当句子sm对应的实体对(ei,ej)出现在句子sm的某一从句中时,认为关系实例qn是噪声数据,并将其从关系实例集Q中去除。If the relation instance qn = (sm , ei , rk , ej ) is a negative relation instance, when the entity pair (ei , ej ) corresponding to the sentence sm appears in a clause of the sentence sm When , the relation instance qn is considered to be noise data and removed from the relation instance set Q.
步骤3的具体步骤如下:The specific steps of step 3 are as follows:
步骤3-1,抽取关系实例集Q中每个关系实例qn的词法特征lexn;Step 3-1, extracting the lexical feature lexn of each relation instance qn in the relation instance set Q;
步骤3-2,将词法特征lexn转化为分布式表征向量vn,构建特征数据集M。Step 3-2, convert the lexical feature lexn into a distributed representation vector vn , and construct a feature data set M.
在步骤3-1中,对于关系实例qn=(sm,ei,rk,ej),其词法特征lexn为实体对(ei,ej)本身以及(ei,ej)在句子sm中的上下文,具体的词法特征类型如表1所示。In step 3-1, for a relation instance qn =(sm , ei , rk , ej ), its lexical feature lexn is the entity pair (ei , ej ) itself and (ei , ej ) in the context of the sentence sm , the specific types of lexical features are shown in Table 1.
表1词法特征类型Table 1 Lexical Feature Types
在步骤3-2中,将词法特征lexn转化为分布式表征向量vn,然后将所有的vn集合起来组成特征数据集M;关系实例集Q中正例关系实例的词法特征向量化后变为M的正例数据,关系实例集Q中负例关系实例的词法特征向量化后变为M的负例数据。In step 3-2, the lexical feature lexn is transformed into a distributed representation vector vn , and then all vn are assembled to form a feature data set M; the lexical feature vectorization of positive relation instances in the relation instance set Q becomes is the positive example data of M, and the lexical features of the negative example relation instances in the relation instance set Q become the negative example data of M after vectorization.
步骤4的具体步骤如下:The specific steps of step 4 are as follows:
步骤4-1,选择特征数据集M中全部的正例数据和少部分负例数据组成标注数据集L;剩余负例数据在去除标签后作为未标注数据集U;Step 4-1, select all the positive data in the feature data set M and a small part of the negative data to form the labeled data set L; the remaining negative data is used as the unlabeled data set U after removing the label;
步骤4-2,从标注数据集L中有放回地选取n个初始样本集L1,L2,…,Ln;Step 4-2, select n initial sample sets L1 , L2 ,...,Ln from the labeled data set L with replacement;
步骤4-3,使用初始样本集Li和第t-1轮选出的高置信度的未标注样本集Ui,t-1训练对应的关系分类器Ci,其中,i=1,2,…,n;Step 4-3, use the initial sample set Li and the high confidence unlabeled sample set Ui,t-1 selected in round t-1 to train the corresponding relation classifier Ci , where i=1,2 ,...,n;
步骤4-4,n个关系分类器C1,C2,…,Cn对未标注数据集U中未标注样本xu的类标记分别进行预测,通过投票法生成高置信度的未标注样本集Fi,t;Step 4-4, n relational classifiers C1 , C2 ,...,Cn respectively predict the class labels of the unlabeled samples xu in the unlabeled data set U, and generate high-confidence unlabeled samples by voting method Set Fi,t ;
步骤4-5,根据一定的过滤筛选准则,从高置信度的未标注样本集Fi,t中,为第i个关系分类器Ci挑选一定数量的未标注样本xu,构成Ui,t,在下一轮迭代过程中加入到第i个关系分类器Ci的训练集中,然后重新训练对应的关系分类器Ci;Step 4-5: Select a certain number of unlabeled samples xu for the i-th relationship classifier Ci from the high-confidence unlabeled sample set Fi,t according to certain filtering criteria to form Ui, t , added to the training set of the i-th relationship classifier Ci in the next iteration process, and then retrain the corresponding relationship classifier Ci ;
步骤4-6,重复步骤4-4,4-5,4-6,当所有Ui,t都为空集,即没有新的未标注样本xu加入到训练集中时,或者迭代次数已经达到预先设定的最大迭代次数时,该训练过程停止。Step 4-6, repeat steps 4-4, 4-5, 4-6, when all Ui,t are empty sets, that is, when no new unlabeled samples xu are added to the training set, or the number of iterations has reached The training process stops when the preset maximum number of iterations is reached.
在步骤4-3中,Ui,t-1表示在第t-1轮迭代中,关系分类器为第i个关系分类器Ci时,挑选的未标注样本xu的集合,该未标注样本xu由U中的未标注样本xu以及从t-1轮迭代中得到的类标记组成,其中t大于等于2,当t=1时,Ui,t-1为空集。In step 4-3, Ui,t-1 represents the set of unlabeled samples xu selected when the relationship classifier is the i-th relationship classifier Ci in the t-1 iteration, the unlabeled The sample xu consists of the unlabeled sample xu in U and the class labels obtained from the t-1 round of iterations, where t is greater than or equal to 2. When t=1, Ui,t-1 is an empty set.
注意,t-1轮前添加到训练集的未标注样本xu将会从训练集中被删除掉,重新加入到未标注样本集Fi,t中,每一轮迭代中训练集都只扩充上一轮添加的未标注样本xu。Note that the unlabeled sample xu added to the training set before round t-1 will be deleted from the training set and re-added to the unlabeled sample set Fi,t , and the training set will only be expanded by A round of added unlabeled samples xu .
在步骤4-4中,Fi,t表示在第t轮迭代中,关系分类器为Ci时,挑选的高置信度未标注样本xu的集合,该集合经过一定的过滤筛选后,留下来的未标注样本xu将构成Ui,t。In step 4-4, Fi,t represents the set of high-confidence unlabeled samples xu selected when the relationship classifier is Ci in the t-th iteration. The unlabeled samples xu that come down will constitute Ui,t .
针对未标注样本xu,用hi(xu)表示第i个关系分类器Ci对未标注样本xu预测的类标记。For the unlabeled sample xu , let hi (xu ) denote the class label predicted by the i-th relationship classifier Ci for the unlabeled sample xu .
关系分类器E中删除Ci后的集合设为Ei,即Ei={Cj∈E|j≠i}。The set after deleting Ci in relation classifier E is set as Ei , that is, Ei ={Cj ∈E|j≠i}.
未标注样本xu的类标记由Ei中的多个关系分类器Ei投票决定,选择票数最多的类标记作为未标注样本xu的类标记。The class label of an unlabeled sample xu is voted by multiple relation classifiers Ei in Ei , and the class label with the most votes is selected as the class label for the unlabeled sample xu .
样本预测结果的一致性程度,即为置信度,关系分类器Ei根据其预测的样本标记的一致性计算置信度,计算公式为公式1-1:The degree of consistency of the sample prediction results is the confidence level. The relationship classifier Ei calculates the confidence level according to the consistency of the predicted sample marks. The calculation formula is formula 1-1:
其中,confi(xu)表示xu的真实类标记为的置信度;I()是一个指示函数,如果输入为假,该函数值为0,否则为1。Among them, confi (xu ) means that the true class of xu is marked as The confidence level of ; I() is an indicator function that takes the value 0 if the input is false, and 1 otherwise.
高置信度的未标注样本xu能够有效地提升关系分类器的分类准确率,如果在保证未标注样本标记高置信度的前提下,考虑Ci和Ei在同一样本上预测结果的不一致性,进而选择出能够纠正关系分类器Ci的未标注样本集Fi,t,则能进一步提升关系分类器的分类准确率。The unlabeled sample xu with high confidence can effectively improve the classification accuracy of the relational classifier. If the unlabeled samples are marked with high confidence, the inconsistency of the prediction results of Ci and Ei on the same sample is considered , and then select the unlabeled sample set Fi,t that can correct the relationship classifier Ci , which can further improve the classification accuracy of the relationship classifier.
因此,在第t轮迭代过程中,公式1-2为第i个关系分类器选择高置信度的未标注样本xu,Therefore, in the iterative process of round t, Equation 1-2 selects the unlabeled sample xu with high confidence for the i-th relation classifier,
其中θ是一个预设的阈值,只有未标注样本xu的置信度大于该阈值,并且Ci与Ei的预测结果不一致时,该样本才会被选择加入到Fi,t中。Where θ is a preset threshold, only when the confidence of the unlabeled sample xu is greater than the threshold, and the prediction results of Ci and Ei are inconsistent, the sample will be selected to be added to Fi,t .
在步骤4-5中,对于未标注样本xu,令P(hi(xu))表示Ci预测xu输出为hi(xu)的概率值,在过滤筛选时,同时考虑P(hi(xu))和confi(xu),将Fi,t集合中的高置信度未标注样本按照confi(xu)、P(hi(xu))的顺序依次降序排序,即confi(xu)越大的样本越靠前,confi(xu)相同的情况下,P(hi(xu))越大的样本越靠前;经过排序后,取前mi,t个样本构成Ui,t。In step 4-5, for the unlabeled sample xu , let P(hi (xu )) represent the probability value that Ci predicts that the output of xu is hi (xu ), and when filtering, consider P (hi (xu )) and confi (xu ), put the high confidence unlabeled samples in the Fi,t set in the order of confi (xu ), P(hi (xu )) Sort in descending order, that is, the sample with the larger confi (xu ) is higher in the front; when the confi (xu ) is the same, the sample with the larger P(hi (xu )) is in the front; after sorting, Take the first mi,t samples to form Ui,t .
本发明结合了从句识别和半监督集成学习算法,在去除关系实例噪声的同时,充分利用负例数据。与现有的技术相比,本发明的优点包括:The invention combines clause recognition and semi-supervised integrated learning algorithm, and makes full use of negative example data while removing relational instance noise. Compared with the prior art, the advantages of the present invention include:
(1)通过从句识别去除训练数据中的噪声数据,提高了训练数据的标记准确度,从而提高了关系抽取的分类准确度。(1) The noise data in the training data is removed by clause recognition, which improves the labeling accuracy of the training data, thereby improving the classification accuracy of the relation extraction.
(2)通过半监督集成学习算法训练关系分类器,将传统关系抽取中未被利用的负例数据去除标签后作为无标注数据使用,提高了负例数据的利用率,从而提高了关系抽取的分类准确度。(2) The relationship classifier is trained by the semi-supervised ensemble learning algorithm, and the unused negative example data in the traditional relationship extraction is removed from the label and used as unlabeled data, which improves the utilization rate of the negative example data, thereby improving the efficiency of the relationship extraction. classification accuracy.
附图说明Description of drawings
图1是结合从句识别与半监督集成学习的关系抽取方法流程图;Fig. 1 is a flow chart of a relation extraction method combining clause recognition and semi-supervised ensemble learning;
图2是第t轮迭代流程图。Figure 2 is the flow chart of the tth iteration.
具体实施方式detailed description
为了更为具体地描述本发明,下面结合附图及具体实施方式对本发明的技术方案进行详细说明。In order to describe the present invention more specifically, the technical solutions of the present invention will be described in detail below in conjunction with the accompanying drawings and specific embodiments.
图1所示的是本发明一种结合从句级远程监督与半监督集成学习的关系抽取方法的流程图,该方法分为数据处理和模型训练两个阶段。Fig. 1 shows a flow chart of a relation extraction method combining sentence-level remote supervision and semi-supervised integrated learning according to the present invention. The method is divided into two stages: data processing and model training.
数据处理阶段data processing stage
数据处理的具体步骤如下:The specific steps of data processing are as follows:
步骤a-1,通过远程监督将知识库K中的关系三元组对齐到语料库D,构建关系实例集Q={qn丨qn=(sm,ei,rk,ej),sm∈D}。Step a-1, align the relational triples in the knowledge base K to the corpus D through remote supervision, and construct the relational instance set Q={qn丨qn =(sm , ei , rk , ej ), sm ∈ D}.
如果句子sm同时包含实体ei和ej,且知识库K中存在关系三元组(ei,rk,ej),则(sm,ei,rk,ej)为正例关系实例,同时选择一些不符合上述条件的关系实例作为负例关系实例。If the sentence sm contains both entities ei and ej , and there is a relation triplet (ei , rk , ej ) in the knowledge base K, then (sm , ei , rk , ej ) is positive example relation instances, and select some relation instances that do not meet the above conditions as negative relation instances.
步骤a-2,使用概率上下文无关文法对关系实例qn的句子sm进行解析,得到其语法树,根据语法树表示的句子sm的词之间的结构关系,将sm划分成从句。Step a-2, use the probabilistic context-free grammar to analyze the sentence sm of the relational instance qn to obtain its syntax tree, and divide sm into clauses according to the structural relationship between the words of the sentence sm represented by the syntax tree.
步骤a-3,根据关系实例qn的实体对(ei,ej)是否同时出现在句子sm的某一个从句当中来判断关系实例qn是否是噪声数据;如果qn是噪声数据,则将其从关系实例集Q中去除;Step a-3, according to whether the entity pair (ei , ej ) of the relation instance qn appears in a certain clause of the sentence sm at the same time, judge whether the relation instance qn is noise data; if qn is noise data, Then remove it from the relation instance set Q;
如果关系实例qn=(sm,ei,rk,ej)是正例关系实例,当句子sm对应的实体对(ei,ej)没有出现在句子sm的任一从句当中时,认为关系实例qn是噪声数据,并将其从关系实例集Q中去除;If the relation instance qn = (sm , ei , rk , ej ) is a positive relation instance, when the entity pair (ei , ej ) corresponding to the sentence sm does not appear in any clause of the sentence sm When , the relational instance qn is considered to be noise data and removed from the relational instance set Q;
如果关系实例qn=(sm,ei,rk,ej)是负例关系实例,当句子sm对应的实体对(ei,ej)出现在句子sm的某一从句中时,认为关系实例qn是噪声数据,并将其从关系实例集Q中去除。If the relation instance qn = (sm , ei , rk , ej ) is a negative relation instance, when the entity pair (ei , ej ) corresponding to the sentence sm appears in a clause of the sentence sm When , the relation instance qn is considered to be noise data and removed from the relation instance set Q.
步骤a-4,抽取关系实例集Q中每个关系实例qn的词法特征lexn。Step a-4, extracting the lexical feature lexn of each relation instance qn in the relation instance set Q.
对于关系实例qn=(sm,ei,rk,ej),其词法特征lexn为实体对(ei,ej)本身以及(ei,ej)在句子sm中的上下文,具体的词法特征类型如表1所示。For a relation instance qn =(sm , ei , rk , ej ), its lexical feature lexn is the entity pair (ei , ej ) itself and (ei , ej ) in the sentence sm The context and specific lexical feature types are shown in Table 1.
表2词法特征类型Table 2 Lexical Feature Types
步骤a-5,将词法特征lexn转化为分布式表征向量vn,构建特征数据集M。Step a-5, convert the lexical feature lexn into a distributed representation vector vn , and construct a feature data set M.
将词法特征lexn转化为分布式表征向量vn,然后将所有的vn集合起来组成特征数据集M;关系实例集Q中正例关系实例的词法特征向量化后变为M的正例数据,关系实例集Q中负例关系实例的词法特征向量化后变为M的负例数据。Transform the lexical feature lexn into a distributed representation vector vn , and then gather all vn to form a feature data set M; the lexical features of the positive relational instances in the relational instance set Q are vectorized and become the positive case data of M, The lexical features of the negative relational instances in the relational instance set Q are vectorized and become the negative instance data of M.
模型训练阶段Model Training Phase
模型训练是一个迭代式学习过程,其第t次迭代如图2所示。Model training is an iterative learning process, and its t-th iteration is shown in Figure 2.
步骤b-1,选择特征数据集M中全部的正例数据和少部分负例数据组成标注数据集,记作L;剩余负例数据在去除标签后作为未标注数据集,记作U。Step b-1, select all the positive data and a small number of negative data in the feature data set M to form a labeled data set, which is denoted as L; the remaining negative data is denoted as an unlabeled data set after removing labels, which is denoted as U.
步骤b-2,从标注数据集L中有放回地选取n个初始样本集L1,L2,…,Ln。Step b-2, select n initial sample sets L1 , L2 ,...,Ln from the labeled dataset L with replacement.
步骤b-3,使用初始样本集Li和第t-1轮选出的高置信度未标注样本集Ui,t-1训练对应的关系分类器Ci,其中,i=1,2,…,n。Step b-3, use the initial sample set Li and the high confidence unlabeled sample set Ui,t-1 selected in round t-1 to train the corresponding relation classifier Ci , where i=1,2, ..., n.
Ui,t-1表示在第t-1轮迭代中,关系分类器为第i个关系分类器Ci时,挑选的未标注样本xu的集合,该未标注样本xu由U中的未标注样本xu以及从t-1轮迭代中得到的类标记组成,其中t大于等于2,当t=1时,Ui,t-1为空集。Ui,t-1 represents the set of unlabeled samples xu selected when the relationship classifier is the i-th relationship classifier Ci in the t-1 round of iterations, the unlabeled samples xu are selected by the The unlabeled sample xu and the class label obtained from the t-1 round of iterations, where t is greater than or equal to 2, when t=1, Ui,t-1 is an empty set.
注意,t-1轮前添加到训练集的未标注样本xu将会从训练集中被删除掉,重新加入到未标注样本集Fi,t中,每一轮迭代中训练集都只扩充上一轮添加的未标注样本xu。Note that the unlabeled sample xu added to the training set before round t-1 will be deleted from the training set and re-added to the unlabeled sample set Fi,t , and the training set will only be expanded by A round of added unlabeled samples xu .
步骤b-4,n个关系分类器C1,C2,…,Cn对未标注数据集U中未标注样本xu的类标记分别进行预测,通过投票法生成高置信度的未标注样本集Fi,t;Step b-4, n relational classifiers C1 , C2 ,...,Cn respectively predict the class labels of the unlabeled samples xu in the unlabeled data set U, and generate high-confidence unlabeled samples by voting method Set Fi,t ;
Fi,t表示在第t轮迭代中,关系分类器为Ci时,挑选的高置信度未标注样本xu的集合,该集合经过一定的过滤筛选后,留下来的未标注样本xu将构成Ui,t。Fi,t represents the set of high-confidence unlabeled samples xu selected when the relationship classifier is Ci in the t-th iteration. After the set is filtered and screened, the remaining unlabeled samples xu will form Ui,t .
针对未标注样本xu,用hi(xu)表示第i个关系分类器Ci对未标注样本xu预测的类标记。For the unlabeled sample xu , let hi (xu ) denote the class label predicted by the i-th relationship classifier Ci for the unlabeled sample xu .
关系分类器E中删除Ci后的集合设为Ei,即Ei={Cj∈E|j≠i}。The set after deleting Ci in relation classifier E is set as Ei , that is, Ei ={Cj ∈E|j≠i}.
未标注样本xu的类标记由Ei中的多个关系分类器Ei投票决定,选择票数最多的类标记作为未标注样本xu的类标记。The class label of an unlabeled sample xu is voted by multiple relation classifiers Ei in Ei , and the class label with the most votes is selected as the class label for the unlabeled sample xu .
样本预测结果的一致性程度,即为置信度,关系分类器Ei根据其预测的样本标记的一致性计算置信度,计算公式为公式1-1:The degree of consistency of the sample prediction results is the confidence level. The relationship classifier Ei calculates the confidence level according to the consistency of the predicted sample marks. The calculation formula is formula 1-1:
其中,confi(xu)表示xu的真实类标记为的置信度;I()是一个指示函数,如果输入为假,该函数值为0,否则为1。Among them, confi (xu ) means that the true class of xu is marked as The confidence level of ; I() is an indicator function that takes the value 0 if the input is false, and 1 otherwise.
高置信度的未标注样本xu能够有效地提升关系分类器的分类准确率,如果在保证未标注样本标记高置信度的前提下,考虑Ci和Ei在同一样本上预测结果的不一致性,进而选择出能够纠正关系分类器Ci的未标注样本集Fi,t,则能进一步提升关系分类器的分类准确率。The unlabeled sample xu with high confidence can effectively improve the classification accuracy of the relational classifier. If the unlabeled samples are marked with high confidence, the inconsistency of the prediction results of Ci and Ei on the same sample is considered , and then select the unlabeled sample set Fi,t that can correct the relationship classifier Ci , which can further improve the classification accuracy of the relationship classifier.
因此,在第t轮迭代过程中,公式2为第i个关系分类器选择高置信度的未标注样本,Therefore, during the t-th iteration, Equation 2 selects high-confidence unlabeled samples for the i-th relation classifier,
其中θ是一个预设的阈值,只有未标注样本xu的置信度大于该阈值,并且Ci与Ei的预测结果不一致时,该样本才会被选择加入到Fi,t中。Where θ is a preset threshold, only when the confidence of the unlabeled sample xu is greater than the threshold, and the prediction results of Ci and Ei are inconsistent, the sample will be selected to be added to Fi,t .
步骤b-5,根据一定的过滤筛选准则,从高置信度的未标注样本集Fi,t中,为第i个关系分类器Ci挑选一定数量的未标注样本xu,构成Ui,t,在下一轮迭代过程中加入到第i个关系分类器Ci的训练集中,然后重新训练对应的关系分类器Ci;Step b-5: Select a certain number of unlabeled samples xu for the i-th relationship classifier Ci from the high-confidence unlabeled sample set Fi,t according to certain filtering criteria to form Ui, t , added to the training set of the i-th relationship classifier Ci in the next iteration process, and then retrain the corresponding relationship classifier Ci ;
对于未标注样本xu,令P(hi(xu))表示Ci预测xu输出为hi(xu)的概率值,在过滤筛选时,同时考虑P(hi(xu))和confi(xu),将Fi,t集合中的高置信度未标注样本按照confi(xu)、P(hi(xu))的顺序依次降序排序,即confi(xu)越大的样本越靠前,confi(xu)相同的情况下,P(hi(xu))越大的样本越靠前。经过排序后,取前mi,t个样本构成Ui,t。For the unlabeled sample xu , let P(hi (xu )) represent the probability value that Ci predicts that the output of xu is hi (xu ), and when filtering, consider P(hi (xu ) ) and confi (xu ), sort the high-confidence unlabeled samples in the Fi,t set in descending order in the order of confi (xu ), P(hi (xu )), that is, confi ( The sample with larger xu ) is closer to the front. When confi (xu ) is the same, the sample with larger P(hi (xu )) is closer to the front. After sorting, take the first mi,t samples to form Ui,t .
步骤b-6,重复步骤b-3、b-4、b-5,当所有Ui,t都为空集,即没有新的未标注样本加入到训练集中时,或者迭代次数已经达到预先设定的最大迭代次数时,该训练过程停止。Step b-6, repeat steps b-3, b-4, b-5, when all Ui,t are empty sets, that is, when no new unlabeled samples are added to the training set, or the number of iterations has reached the preset The training process stops when the specified maximum number of iterations is reached.
| Application Number | Priority Date | Filing Date | Title | 
|---|---|---|---|
| CN201610615087.2ACN106294593B (en) | 2016-07-28 | 2016-07-28 | A Relation Extraction Method Combining Clause-Level Remote Supervision and Semi-Supervised Ensemble Learning | 
| Application Number | Priority Date | Filing Date | Title | 
|---|---|---|---|
| CN201610615087.2ACN106294593B (en) | 2016-07-28 | 2016-07-28 | A Relation Extraction Method Combining Clause-Level Remote Supervision and Semi-Supervised Ensemble Learning | 
| Publication Number | Publication Date | 
|---|---|
| CN106294593Atrue CN106294593A (en) | 2017-01-04 | 
| CN106294593B CN106294593B (en) | 2019-04-09 | 
| Application Number | Title | Priority Date | Filing Date | 
|---|---|---|---|
| CN201610615087.2AExpired - Fee RelatedCN106294593B (en) | 2016-07-28 | 2016-07-28 | A Relation Extraction Method Combining Clause-Level Remote Supervision and Semi-Supervised Ensemble Learning | 
| Country | Link | 
|---|---|
| CN (1) | CN106294593B (en) | 
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| CN106933804A (en)* | 2017-03-10 | 2017-07-07 | 上海数眼科技发展有限公司 | A kind of structured message abstracting method based on deep learning | 
| CN107169079A (en)* | 2017-05-10 | 2017-09-15 | 浙江大学 | A kind of field text knowledge abstracting method based on Deepdive | 
| CN107292330A (en)* | 2017-05-02 | 2017-10-24 | 南京航空航天大学 | A kind of iterative label Noise Identification algorithm based on supervised learning and semi-supervised learning double-point information | 
| CN107291828A (en)* | 2017-05-27 | 2017-10-24 | 北京百度网讯科技有限公司 | Spoken inquiry analytic method, device and storage medium based on artificial intelligence | 
| CN108763353A (en)* | 2018-05-14 | 2018-11-06 | 中山大学 | Rule-based and remote supervisory Baidupedia relationship triple abstracting method | 
| CN108829722A (en)* | 2018-05-08 | 2018-11-16 | 国家计算机网络与信息安全管理中心 | A kind of Dual-Attention relationship classification method and system of remote supervisory | 
| CN108959252A (en)* | 2018-06-28 | 2018-12-07 | 中国人民解放军国防科技大学 | Semi-supervised Chinese named entity recognition method based on deep learning | 
| CN110032650A (en)* | 2019-04-18 | 2019-07-19 | 腾讯科技(深圳)有限公司 | A kind of generation method, device and the electronic equipment of training sample data | 
| CN110209836A (en)* | 2019-05-17 | 2019-09-06 | 北京邮电大学 | Remote supervisory Relation extraction method and device | 
| CN110334355A (en)* | 2019-07-15 | 2019-10-15 | 苏州大学 | Method, system and related components for relation extraction | 
| CN110543634A (en)* | 2019-09-02 | 2019-12-06 | 北京邮电大学 | Processing method, device, electronic device and storage medium of corpus data set | 
| CN110728148A (en)* | 2018-06-29 | 2020-01-24 | 富士通株式会社 | Entity relationship extraction method and device | 
| CN111191461A (en)* | 2019-06-06 | 2020-05-22 | 北京理工大学 | A Remote Supervision Relation Extraction Method Based on Curriculum Learning | 
| CN111914555A (en)* | 2019-05-09 | 2020-11-10 | 中国人民大学 | Automatic relation extraction system based on Transformer structure | 
| CN112329463A (en)* | 2020-11-27 | 2021-02-05 | 上海汽车集团股份有限公司 | Training method and related device for remote supervision relation extraction model | 
| CN113378563A (en)* | 2021-02-05 | 2021-09-10 | 中国司法大数据研究院有限公司 | Case feature extraction method and device based on genetic variation, semi-supervision and reinforcement learning | 
| CN114328942A (en)* | 2021-10-29 | 2022-04-12 | 腾讯科技(深圳)有限公司 | Relationship extraction method, apparatus, device, storage medium and computer program product | 
| CN114519092A (en)* | 2022-02-24 | 2022-05-20 | 复旦大学 | Large-scale complex relation data set construction framework oriented to Chinese field | 
| WO2022116417A1 (en)* | 2020-12-03 | 2022-06-09 | 平安科技(深圳)有限公司 | Triple information extraction method, apparatus, and device, and computer-readable storage medium | 
| CN115048536A (en)* | 2022-07-07 | 2022-09-13 | 南方电网大数据服务有限公司 | Knowledge graph generation method and device, computer equipment and storage medium | 
| CN115270776A (en)* | 2022-08-30 | 2022-11-01 | 陕西师范大学 | Method, system, device and medium for automatically acquiring concepts in domain knowledge base | 
| CN115619192A (en)* | 2022-11-10 | 2023-01-17 | 国网江苏省电力有限公司物资分公司 | A Hybrid Relation Extraction Algorithm Oriented to Demand Planning Rules | 
| CN116992869A (en)* | 2023-07-18 | 2023-11-03 | 中国中医科学院中医药信息研究所 | Remote supervision relation extraction method and device based on search engine and classifier | 
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| CN101980202A (en)* | 2010-11-04 | 2011-02-23 | 西安电子科技大学 | Semi-supervised classification methods for imbalanced data | 
| CN103886330A (en)* | 2014-03-27 | 2014-06-25 | 西安电子科技大学 | Classification method based on semi-supervised SVM ensemble learning | 
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| CN101980202A (en)* | 2010-11-04 | 2011-02-23 | 西安电子科技大学 | Semi-supervised classification methods for imbalanced data | 
| CN103886330A (en)* | 2014-03-27 | 2014-06-25 | 西安电子科技大学 | Classification method based on semi-supervised SVM ensemble learning | 
| Title | 
|---|
| DAOJIAN ZENG ET AL.: "Distant Supervision for Relation Extraction via Piecewise Convolutional Neural Networks", 《PROCEEDINGS OF THE 2015 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING》* | 
| TOMAS MIKOLOV ET AL.: "Distributed Representations ofWords and Phrases and their Compositionality", 《ARXIV》* | 
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| CN106933804A (en)* | 2017-03-10 | 2017-07-07 | 上海数眼科技发展有限公司 | A kind of structured message abstracting method based on deep learning | 
| CN106933804B (en)* | 2017-03-10 | 2020-03-31 | 上海数眼科技发展有限公司 | Structured information extraction method based on deep learning | 
| CN107292330A (en)* | 2017-05-02 | 2017-10-24 | 南京航空航天大学 | A kind of iterative label Noise Identification algorithm based on supervised learning and semi-supervised learning double-point information | 
| CN107169079B (en)* | 2017-05-10 | 2019-09-20 | 浙江大学 | A method of domain text knowledge extraction based on Deepdive | 
| CN107169079A (en)* | 2017-05-10 | 2017-09-15 | 浙江大学 | A kind of field text knowledge abstracting method based on Deepdive | 
| CN107291828A (en)* | 2017-05-27 | 2017-10-24 | 北京百度网讯科技有限公司 | Spoken inquiry analytic method, device and storage medium based on artificial intelligence | 
| CN107291828B (en)* | 2017-05-27 | 2021-06-11 | 北京百度网讯科技有限公司 | Spoken language query analysis method and device based on artificial intelligence and storage medium | 
| CN108829722B (en)* | 2018-05-08 | 2020-10-02 | 国家计算机网络与信息安全管理中心 | Remote supervision Dual-Attention relation classification method and system | 
| CN108829722A (en)* | 2018-05-08 | 2018-11-16 | 国家计算机网络与信息安全管理中心 | A kind of Dual-Attention relationship classification method and system of remote supervisory | 
| CN108763353A (en)* | 2018-05-14 | 2018-11-06 | 中山大学 | Rule-based and remote supervisory Baidupedia relationship triple abstracting method | 
| CN108959252A (en)* | 2018-06-28 | 2018-12-07 | 中国人民解放军国防科技大学 | Semi-supervised Chinese named entity recognition method based on deep learning | 
| CN108959252B (en)* | 2018-06-28 | 2022-02-08 | 中国人民解放军国防科技大学 | Semi-supervised Chinese named entity recognition method based on deep learning | 
| CN110728148A (en)* | 2018-06-29 | 2020-01-24 | 富士通株式会社 | Entity relationship extraction method and device | 
| CN110728148B (en)* | 2018-06-29 | 2023-07-14 | 富士通株式会社 | Entity relationship extraction method and device | 
| CN110032650A (en)* | 2019-04-18 | 2019-07-19 | 腾讯科技(深圳)有限公司 | A kind of generation method, device and the electronic equipment of training sample data | 
| CN111914555B (en)* | 2019-05-09 | 2022-08-23 | 中国人民大学 | Automatic relation extraction system based on Transformer structure | 
| CN111914555A (en)* | 2019-05-09 | 2020-11-10 | 中国人民大学 | Automatic relation extraction system based on Transformer structure | 
| CN110209836A (en)* | 2019-05-17 | 2019-09-06 | 北京邮电大学 | Remote supervisory Relation extraction method and device | 
| CN111191461A (en)* | 2019-06-06 | 2020-05-22 | 北京理工大学 | A Remote Supervision Relation Extraction Method Based on Curriculum Learning | 
| CN111191461B (en)* | 2019-06-06 | 2021-08-03 | 北京理工大学 | A Remote Supervision Relation Extraction Method Based on Curriculum Learning | 
| CN110334355B (en)* | 2019-07-15 | 2023-08-18 | 苏州大学 | Relation extraction method, system and related components | 
| CN110334355A (en)* | 2019-07-15 | 2019-10-15 | 苏州大学 | Method, system and related components for relation extraction | 
| CN110543634B (en)* | 2019-09-02 | 2021-03-02 | 北京邮电大学 | Corpus data set processing method, device, electronic device and storage medium | 
| CN110543634A (en)* | 2019-09-02 | 2019-12-06 | 北京邮电大学 | Processing method, device, electronic device and storage medium of corpus data set | 
| CN112329463A (en)* | 2020-11-27 | 2021-02-05 | 上海汽车集团股份有限公司 | Training method and related device for remote supervision relation extraction model | 
| WO2022116417A1 (en)* | 2020-12-03 | 2022-06-09 | 平安科技(深圳)有限公司 | Triple information extraction method, apparatus, and device, and computer-readable storage medium | 
| CN113378563A (en)* | 2021-02-05 | 2021-09-10 | 中国司法大数据研究院有限公司 | Case feature extraction method and device based on genetic variation, semi-supervision and reinforcement learning | 
| CN114328942B (en)* | 2021-10-29 | 2025-08-08 | 腾讯科技(深圳)有限公司 | Relationship extraction method, apparatus, device, storage medium and computer program product | 
| CN114328942A (en)* | 2021-10-29 | 2022-04-12 | 腾讯科技(深圳)有限公司 | Relationship extraction method, apparatus, device, storage medium and computer program product | 
| CN114519092A (en)* | 2022-02-24 | 2022-05-20 | 复旦大学 | Large-scale complex relation data set construction framework oriented to Chinese field | 
| CN115048536A (en)* | 2022-07-07 | 2022-09-13 | 南方电网大数据服务有限公司 | Knowledge graph generation method and device, computer equipment and storage medium | 
| CN115270776A (en)* | 2022-08-30 | 2022-11-01 | 陕西师范大学 | Method, system, device and medium for automatically acquiring concepts in domain knowledge base | 
| CN115619192A (en)* | 2022-11-10 | 2023-01-17 | 国网江苏省电力有限公司物资分公司 | A Hybrid Relation Extraction Algorithm Oriented to Demand Planning Rules | 
| CN115619192B (en)* | 2022-11-10 | 2023-10-03 | 国网江苏省电力有限公司物资分公司 | Mixed relation extraction method oriented to demand planning rules | 
| CN116992869A (en)* | 2023-07-18 | 2023-11-03 | 中国中医科学院中医药信息研究所 | Remote supervision relation extraction method and device based on search engine and classifier | 
| CN116992869B (en)* | 2023-07-18 | 2024-08-16 | 中国中医科学院中医药信息研究所 | Remote supervision relation extraction method and device based on search engine and classifier | 
| Publication number | Publication date | 
|---|---|
| CN106294593B (en) | 2019-04-09 | 
| Publication | Publication Date | Title | 
|---|---|---|
| CN106294593A (en) | In conjunction with subordinate clause level remote supervisory and the Relation extraction method of semi-supervised integrated study | |
| CN110597735B (en) | A software defect prediction method for deep learning of open source software defect features | |
| CN108595632B (en) | A Hybrid Neural Network Text Classification Method Fusing Abstract and Main Features | |
| CN106383877B (en) | Social media online short text clustering and topic detection method | |
| CN110598005B (en) | Public safety event-oriented multi-source heterogeneous data knowledge graph construction method | |
| CN112487143A (en) | Public opinion big data analysis-based multi-label text classification method | |
| CN111027595B (en) | Two-stage semantic word vector generation method | |
| CN108763213A (en) | Theme feature text key word extracting method | |
| CN110298032A (en) | Text classification corpus labeling training system | |
| CN106844349B (en) | Spam comment recognition method based on collaborative training | |
| CN117725222B (en) | Method for extracting document complex knowledge object by integrating knowledge graph and large language model | |
| CN103942340A (en) | Microblog user interest recognizing method based on text mining | |
| WO2021128704A1 (en) | Open set classification method based on classification utility | |
| CN108733647B (en) | A word vector generation method based on Gaussian distribution | |
| CN107423339A (en) | Popular microblogging Forecasting Methodology based on extreme Gradient Propulsion and random forest | |
| CN117313849A (en) | Knowledge graph construction method and device for energy industry based on multi-source heterogeneous data fusion technology | |
| CN115510245A (en) | A Domain Knowledge Extraction Method Oriented to Unstructured Data | |
| CN111581368A (en) | Intelligent expert recommendation-oriented user image drawing method based on convolutional neural network | |
| CN105224953A (en) | In a kind of machine part technology, knowledge is extracted and the method developed | |
| CN111858842A (en) | A Judicial Case Screening Method Based on LDA Topic Model | |
| CN112051986A (en) | Device and method for code search recommendation based on open source knowledge | |
| CN102004796B (en) | A non-blocking hierarchical classification method and device for webpage text | |
| CN109062904A (en) | Logical predicate extracting method and device | |
| CN115269870A (en) | Method for realizing classification and early warning of data link faults in data based on knowledge graph | |
| CN113869054A (en) | A feature recognition method of power field project based on deep learning | 
| Date | Code | Title | Description | 
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| C10 | Entry into substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant | ||
| CF01 | Termination of patent right due to non-payment of annual fee | Granted publication date:20190409 | |
| CF01 | Termination of patent right due to non-payment of annual fee |