CN106294593A

Movatterモバイル変換

Info

Publication number: CN106294593A
Application number: CN201610615087.2A
Authority: CN
Inventors: 陈岭; 余小康
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2016-07-28
Filing date: 2016-07-28
Publication date: 2017-01-04
Anticipated expiration: 2036-07-28
Also published as: CN106294593B

Abstract

Translated fromChinese

本发明公开了一种结合从句级远程监督和半监督集成学习的关系抽取方法，具体实施如下：步骤1，通过远程监督将知识库中的关系三元组对齐到语料库，构建关系实例集；步骤2，使用基于句法分析的从句识别去除关系实例集中的噪声数据；步骤3，抽取关系实例的词法特征，并转化为分布式表征向量，构建特征数据集；步骤4，选择特征数据集中全部的正例数据和少部分负例数据组成标注数据集，其余负例数据在去除标签后组成未标注数据集，使用半监督集成学习算法训练关系分类器。本发明结合从句识别、远程监督和半监督集成学习进行关系抽取，在自动问答系统构建、海量信息处理、知识库自动构建、搜索引擎和特定文本挖掘等领域具有广阔的应用前景。

The invention discloses a relation extraction method combining sentence-level remote supervision and semi-supervised integrated learning, which is specifically implemented as follows: step 1, aligning relation triples in a knowledge base to a corpus through remote supervision, and constructing a relation instance set; step 2. Use the clause recognition based on syntactic analysis to remove the noise data in the relational instance set; Step 3, extract the lexical features of the relational instance, and convert them into distributed representation vectors to construct a feature data set; Step 4, select all positive data in the feature data set Example data and a small number of negative example data form a labeled dataset, and the remaining negative example data form an unlabeled dataset after removing labels. A semi-supervised ensemble learning algorithm is used to train a relational classifier. The invention combines clause recognition, remote supervision and semi-supervised integrated learning to extract relations, and has broad application prospects in the fields of automatic question answering system construction, massive information processing, automatic knowledge base construction, search engines and specific text mining.

Description

Translated fromChinese

结合从句级远程监督和半监督集成学习的关系抽取方法A Relation Extraction Method Combining Sentence-Level Distant Supervision and Semi-Supervised Ensemble Learning

技术领域technical field

本发明涉及信息抽取领域，具体涉及一种结合从句级远程监督和半监督集成学习的关系抽取方法。The invention relates to the field of information extraction, in particular to a relation extraction method combining clause-level remote supervision and semi-supervised integrated learning.

背景技术Background technique

信息抽取(Information Extraction)是指从一段文本中抽取实体、事件、关系等类型的信息，形成结构化数据存入数据库中以供用户查询和使用的过程。关系抽取(Relation Extraction)是信息抽取的关键内容，旨在抽取实体之间存在的语义关系。关系抽取技术在自动问答系统构建、海量信息处理、知识库自动构建、搜索引擎和特定文本挖掘等领域具有广阔的应用前景。Information Extraction refers to the process of extracting information such as entities, events, and relationships from a piece of text, forming structured data and storing it in a database for user query and use. Relation Extraction is the key content of information extraction, which aims to extract the semantic relationship between entities. Relation extraction technology has broad application prospects in the fields of automatic question answering system construction, massive information processing, automatic knowledge base construction, search engines, and specific text mining.

传统的关系抽取研究一般采用有监督的机器学习方法，该类方法将关系抽取看作分类问题，使用人工标注的训练数据，通过抽取的词法特征和句法特征训练关系分类器，能取得一定的分类效果。但是，由于需要代价高昂的人工标注数据，使得有监督的关系抽取方法能识别的关系类型局限于特定领域且不能适应海量网络文本的情况。Traditional relation extraction research generally adopts supervised machine learning methods, which regard relation extraction as a classification problem, use manually labeled training data, and train relation classifiers through extracted lexical and syntactic features, which can achieve certain classification Effect. However, due to the need for expensive manual labeling data, the types of relations that can be identified by supervised relation extraction methods are limited to specific domains and cannot be adapted to massive network texts.

为了解决有监督的关系抽取方法人工标注数据不足的问题，研究人员提出了自动生成标注数据的方法—远程监督(Distant Supervision)，其假设如果两个实体之间有某种语义关系，则所有包含它们的句子都在一定程度上表达了这种关系。基于上述假设，远程监督利用知识库蕴含的大量关系三元组，通过与训练语料的文本对齐，可以生成大量的标注数据。远程监督解决了有监督的关系抽取方法标注数据不足的问题，但由于其假设并不总是正确，导致生成的标注数据中存在大量的错误标注数据(即噪声数据)，对关系抽取模型造成不利影响。In order to solve the problem of insufficient manual labeling data in supervised relationship extraction methods, the researchers proposed a method for automatically generating labeling data—Distant Supervision, which assumes that if there is a certain semantic relationship between two entities, then all entities that contain Their sentences express this relationship to some extent. Based on the above assumptions, distant supervision utilizes a large number of relational triples contained in the knowledge base, and can generate a large amount of labeled data by aligning with the text of the training corpus. Remote supervision solves the problem of insufficient labeled data in supervised relationship extraction methods, but because its assumptions are not always correct, there are a large amount of wrongly labeled data (that is, noise data) in the generated labeled data, which is unfavorable to the relationship extraction model. influences.

针对噪声问题，现有处理方法一般通过修改关系抽取模型的方式来减小噪声数据的负面影响，虽然能够取得一定的效果，但并不能够从根本上解决噪声问题。For the noise problem, the existing processing methods generally reduce the negative impact of noisy data by modifying the relationship extraction model. Although some effects can be achieved, they cannot fundamentally solve the noise problem.

另外，基于远程监督的关系抽取普遍存在负例数据利用不足的问题，这是因为通过远程监督生成的关系实例集中负例关系实例数量远大于正例关系实例数据数量，导致特征数据集中负例数据的数量远大于正例数据数量，为保证参与训练的正例数据和负例数据数量均衡，一般选取特征数据集全部的正例数据和少部分负例数据组成训练数据集，剩余的大部分负例数据被搁置不用。In addition, the relationship extraction based on remote supervision generally has the problem of insufficient utilization of negative example data. This is because the number of negative example relationship instances in the set of relationship instances generated by remote supervision is much larger than the number of positive example relationship instance data, resulting in negative example data in feature datasets. The number of positive data is much larger than the number of positive data. In order to ensure the balance of the number of positive data and negative data participating in the training, generally all the positive data and a small part of the negative data in the feature data set are selected to form the training data set, and most of the remaining negative data The sample data is set aside.

发明内容Contents of the invention

为了解决关系抽取方法中噪声数据和负例数据问题，本发明提供了一种结合从句级远程监督和半监督集成学习的关系抽取方法，该方法既能够去除噪声数据，又能够充分利用负例数据。In order to solve the problem of noise data and negative example data in the relation extraction method, the present invention provides a relation extraction method combining sentence-level remote supervision and semi-supervised ensemble learning, which can not only remove noise data, but also make full use of negative example data .

一种结合从句级远程监督和半监督集成学习的关系抽取方法，主要包括如下步骤：A method of relation extraction combining clause-level remote supervision and semi-supervised ensemble learning mainly includes the following steps:

步骤1，通过远程监督将知识库中的关系三元组对齐到语料库，构建关系实例集；Step 1, align the relational triples in the knowledge base to the corpus through remote supervision, and construct a relational instance set;

步骤2，使用基于句法分析的从句识别去除关系实例集中的噪声数据；Step 2, using clause recognition based on syntactic analysis to remove noise data in relational instance sets;

步骤3，抽取关系实例的词法特征并转化为分布式表征向量，构建特征数据集；Step 3, extract the lexical features of relational instances and convert them into distributed representation vectors to construct feature data sets;

步骤4，选择特征数据集中全部的正例数据和少部分负例数据组成标注数据集，其余负例数据在去除标签后组成未标注数据集，使用半监督集成学习算法训练关系分类器。Step 4: Select all the positive data in the feature data set and a small number of negative data to form a labeled data set, and the remaining negative data will form an unlabeled data set after removing labels, and use a semi-supervised ensemble learning algorithm to train a relational classifier.

在步骤1中，通过远程监督将知识库K中的关系三元组对齐到语料库D，构建关系实例集Q＝{q_n丨q_n＝(s_m,e_i,r_k,e_j),s_m∈D}。In step 1, the relational triples in the knowledge base K are aligned to the corpus D through remote supervision, and a set of relational instances Q={q_n丨q_n =(s_m , e_i , r_k , e_j ), s_m ∈ D}.

其中，q_n为关系实例，s_m为句子，e_i和e_j为实体，r_k为e_i和e_j之间存在的实体关系。Among them, q_n is a relationship instance, s_m is a sentence, e_i and e_j are entities, and r_k is the entity relationship between e_i and e_j .

如果句子s_m同时包含实体e_i和实体e_j，且知识库K中存在关系三元组(e_i,r_k,e_j)，则q_n＝(s_m,e_i,r_k,e_j)为正例关系实例，同时选择一些不符合上述条件的关系实例作为负例关系实例。If the sentence s_m contains both entities e_i and e_j , and there is a relation triplet (e_i , r_k , e_j ) in the knowledge base K, then q_n = (s_m , e_i , r_k , e_j ) is a positive relationship instance, and some relationship instances that do not meet the above conditions are selected as negative relationship instances.

步骤2的具体步骤如下：The specific steps of step 2 are as follows:

步骤2-1，使用概率上下文无关文法对关系实例q_n的句子s_m进行解析，得到其语法树，根据语法树表示的句子s_m的词之间的结构关系，将s_m划分成从句；Step 2-1, use the probabilistic context-free grammar to analyze the sentence s_m of the relation instance q_n to obtain its syntax tree, and divide s_m into clauses according to the structural relationship between the words of the sentence s_m represented by the syntax tree;

步骤2-2，根据关系实例q_n的实体对(e_i,e_j)是否同时出现在句子s_m的某一个从句当中来判断关系实例q_n是否是噪声数据；如果q_n是噪声数据，则将其从关系实例集Q中去除；Step 2-2, according to whether the entity pair (e_i , e_j ) of the relation instance q_n appears in a certain clause of the sentence s_m at the same time to judge whether the relation instance q_n is noise data; if q_n is noise data, Then remove it from the relation instance set Q;

如果关系实例q_n＝(s_m,e_i,r_k,e_j)是正例关系实例，当句子s_m对应的实体对(e_i,e_j)没有出现在句子s_m的任一从句中时，认为关系实例q_n是噪声数据，并将其从关系实例集Q中去除；If the relation instance q_n = (s_m , e_i , r_k , e_j ) is a positive relation instance, when the entity pair (e_i , e_j ) corresponding to the sentence s_m does not appear in any clause of the sentence s_m When , the relational instance q_n is considered to be noise data and removed from the relational instance set Q;

如果关系实例q_n＝(s_m,e_i,r_k,e_j)是负例关系实例，当句子s_m对应的实体对(e_i,e_j)出现在句子s_m的某一从句中时，认为关系实例q_n是噪声数据，并将其从关系实例集Q中去除。If the relation instance q_n = (s_m , e_i , r_k , e_j ) is a negative relation instance, when the entity pair (e_i , e_j ) corresponding to the sentence s_m appears in a clause of the sentence s_m When , the relation instance q_n is considered to be noise data and removed from the relation instance set Q.

步骤3的具体步骤如下：The specific steps of step 3 are as follows:

步骤3-1，抽取关系实例集Q中每个关系实例q_n的词法特征lex_n；Step 3-1, extracting the lexical feature lex_n of each relation instance q_n in the relation instance set Q;

步骤3-2，将词法特征lex_n转化为分布式表征向量v_n，构建特征数据集M。Step 3-2, convert the lexical feature lex_n into a distributed representation vector v_n , and construct a feature data set M.

在步骤3-1中，对于关系实例q_n＝(s_m,e_i,r_k,e_j)，其词法特征lex_n为实体对(e_i,e_j)本身以及(e_i,e_j)在句子s_m中的上下文，具体的词法特征类型如表1所示。In step 3-1, for a relation instance q_n =(s_m , e_i , r_k , e_j ), its lexical feature lex_n is the entity pair (e_i , e_j ) itself and (e_i , e_j ) in the context of the sentence s_m , the specific types of lexical features are shown in Table 1.

表1词法特征类型Table 1 Lexical Feature Types

在步骤3-2中，将词法特征lex_n转化为分布式表征向量v_n，然后将所有的v_n集合起来组成特征数据集M；关系实例集Q中正例关系实例的词法特征向量化后变为M的正例数据，关系实例集Q中负例关系实例的词法特征向量化后变为M的负例数据。In step 3-2, the lexical feature lex_n is transformed into a distributed representation vector v_n , and then all v_n are assembled to form a feature data set M; the lexical feature vectorization of positive relation instances in the relation instance set Q becomes is the positive example data of M, and the lexical features of the negative example relation instances in the relation instance set Q become the negative example data of M after vectorization.

步骤4的具体步骤如下：The specific steps of step 4 are as follows:

步骤4-1，选择特征数据集M中全部的正例数据和少部分负例数据组成标注数据集L；剩余负例数据在去除标签后作为未标注数据集U；Step 4-1, select all the positive data in the feature data set M and a small part of the negative data to form the labeled data set L; the remaining negative data is used as the unlabeled data set U after removing the label;

步骤4-2，从标注数据集L中有放回地选取n个初始样本集L₁,L₂,…,L_n；Step 4-2, select n initial sample sets L₁ , L₂ ,...,L_n from the labeled data set L with replacement;

步骤4-3，使用初始样本集L_i和第t-1轮选出的高置信度的未标注样本集U_i,t-1训练对应的关系分类器C_i，其中，i＝1,2,…,n；Step 4-3, use the initial sample set L_i and the high confidence unlabeled sample set U_i,t-1 selected in round t-1 to train the corresponding relation classifier C_i , where i=1,2 ,...,n;

步骤4-4，n个关系分类器C₁,C₂,…,C_n对未标注数据集U中未标注样本x_u的类标记分别进行预测，通过投票法生成高置信度的未标注样本集F_i,t；Step 4-4, n relational classifiers C₁ , C₂ ,...,C_n respectively predict the class labels of the unlabeled samples x_u in the unlabeled data set U, and generate high-confidence unlabeled samples by voting method Set F_i,t ;

步骤4-5，根据一定的过滤筛选准则，从高置信度的未标注样本集F_i,t中，为第i个关系分类器C_i挑选一定数量的未标注样本x_u，构成U_i,t，在下一轮迭代过程中加入到第i个关系分类器C_i的训练集中，然后重新训练对应的关系分类器C_i；Step 4-5: Select a certain number of unlabeled samples x_u for the i-th relationship classifier C_i from the high-confidence unlabeled sample set F_i,t according to certain filtering criteria to form U_{i, t} , added to the training set of the i-th relationship classifier C_i in the next iteration process, and then retrain the corresponding relationship classifier C_i ;

步骤4-6，重复步骤4-4,4-5,4-6，当所有U_i,t都为空集，即没有新的未标注样本x_u加入到训练集中时，或者迭代次数已经达到预先设定的最大迭代次数时，该训练过程停止。Step 4-6, repeat steps 4-4, 4-5, 4-6, when all U_i,t are empty sets, that is, when no new unlabeled samples x_u are added to the training set, or the number of iterations has reached The training process stops when the preset maximum number of iterations is reached.

在步骤4-3中，U_i,t-1表示在第t-1轮迭代中，关系分类器为第i个关系分类器C_i时，挑选的未标注样本x_u的集合，该未标注样本x_u由U中的未标注样本x_u以及从t-1轮迭代中得到的类标记组成，其中t大于等于2，当t＝1时，U_i,t-1为空集。In step 4-3, U_i,t-1 represents the set of unlabeled samples x_u selected when the relationship classifier is the i-th relationship classifier C_i in the t-1 iteration, the unlabeled The sample x_u consists of the unlabeled sample x_u in U and the class labels obtained from the t-1 round of iterations, where t is greater than or equal to 2. When t=1, U_i,t-1 is an empty set.

注意，t-1轮前添加到训练集的未标注样本x_u将会从训练集中被删除掉，重新加入到未标注样本集F_i,t中，每一轮迭代中训练集都只扩充上一轮添加的未标注样本x_u。Note that the unlabeled sample x_u added to the training set before round t-1 will be deleted from the training set and re-added to the unlabeled sample set F_i,t , and the training set will only be expanded by A round of added unlabeled samples x_u .

在步骤4-4中，F_i,t表示在第t轮迭代中，关系分类器为C_i时，挑选的高置信度未标注样本x_u的集合，该集合经过一定的过滤筛选后，留下来的未标注样本x_u将构成U_i,t。In step 4-4, F_i,t represents the set of high-confidence unlabeled samples x_u selected when the relationship classifier is C_i in the t-th iteration. The unlabeled samples x_u that come down will constitute U_i,t .

针对未标注样本x_u，用h_i(x_u)表示第i个关系分类器C_i对未标注样本x_u预测的类标记。For the unlabeled sample x_u , let h_i (x_u ) denote the class label predicted by the i-th relationship classifier C_i for the unlabeled sample x_u .

关系分类器E中删除C_i后的集合设为E_i，即E_i＝{C_j∈E|j≠i}。The set after deleting C_i in relation classifier E is set as E_i , that is, E_i ={C_j ∈E|j≠i}.

未标注样本x_u的类标记由E_i中的多个关系分类器E_i投票决定，选择票数最多的类标记作为未标注样本x_u的类标记。The class label of an unlabeled sample x_u is voted by multiple relation classifiers E_i in E_i , and the class label with the most votes is selected as the class label for the unlabeled sample x_u .

样本预测结果的一致性程度，即为置信度，关系分类器E_i根据其预测的样本标记的一致性计算置信度，计算公式为公式1-1：The degree of consistency of the sample prediction results is the confidence level. The relationship classifier E_i calculates the confidence level according to the consistency of the predicted sample marks. The calculation formula is formula 1-1:

${conf conf}_{i i} (({x x}_{u u})) = = \frac{{Σ Σ}_{j j = = 00,, j j &NotEqual; &NotEqual; i i}^{n no} I I (({h h}_{j j} (({x x}_{u u})) = = {\overset{^^}{l l}}_{{x x}_{u u}}^{i i}))}{n no - - 11},, - - - - - - ((11 - - 11))$

其中，conf_i(x_u)表示x_u的真实类标记为的置信度；I()是一个指示函数，如果输入为假，该函数值为0，否则为1。Among them, conf_i (x_u ) means that the true class of x_u is marked as The confidence level of ; I() is an indicator function that takes the value 0 if the input is false, and 1 otherwise.

高置信度的未标注样本x_u能够有效地提升关系分类器的分类准确率，如果在保证未标注样本标记高置信度的前提下，考虑C_i和E_i在同一样本上预测结果的不一致性，进而选择出能够纠正关系分类器C_i的未标注样本集F_i,t，则能进一步提升关系分类器的分类准确率。The unlabeled sample x_u with high confidence can effectively improve the classification accuracy of the relational classifier. If the unlabeled samples are marked with high confidence, the inconsistency of the prediction results of C_i and E_i on the same sample is considered , and then select the unlabeled sample set F_i,t that can correct the relationship classifier C_i , which can further improve the classification accuracy of the relationship classifier.

因此，在第t轮迭代过程中，公式1-2为第i个关系分类器选择高置信度的未标注样本x_u，Therefore, in the iterative process of round t, Equation 1-2 selects the unlabeled sample x_u with high confidence for the i-th relation classifier,

${F f}_{i i,, t t} = = \{\begin{matrix} x x | | {conf conf}_{i i} ((x x)) &GreaterEqual; &Greater Equal; θ θ & Λ Λ & {h h}_{i i} ((x x)) &NotEqual; &NotEqual; {\overset{^^}{l l}}_{x x}^{i i},, x x &Element; &Element; U u \end{matrix}\},, - - - - - - ((11 - - 22))$

其中θ是一个预设的阈值，只有未标注样本x_u的置信度大于该阈值，并且C_i与E_i的预测结果不一致时，该样本才会被选择加入到F_i,t中。Where θ is a preset threshold, only when the confidence of the unlabeled sample x_u is greater than the threshold, and the prediction results of C_i and E_i are inconsistent, the sample will be selected to be added to F_i,t .

在步骤4-5中，对于未标注样本x_u，令P(h_i(x_u))表示C_i预测x_u输出为h_i(x_u)的概率值，在过滤筛选时，同时考虑P(h_i(x_u))和conf_i(x_u)，将F_i,t集合中的高置信度未标注样本按照conf_i(x_u)、P(h_i(x_u))的顺序依次降序排序，即conf_i(x_u)越大的样本越靠前，conf_i(x_u)相同的情况下，P(h_i(x_u))越大的样本越靠前；经过排序后，取前m_i,t个样本构成U_i,t。In step 4-5, for the unlabeled sample x_u , let P(h_i (x_u )) represent the probability value that C_i predicts that the output of x_u is h_i (x_u ), and when filtering, consider P (h_i (x_u )) and conf_i (x_u ), put the high confidence unlabeled samples in the F_i,t set in the order of conf_i (x_u ), P(h_i (x_u )) Sort in descending order, that is, the sample with the larger conf_i (x_u ) is higher in the front; when the conf_i (x_u ) is the same, the sample with the larger P(h_i (x_u )) is in the front; after sorting, Take the first m_i,t samples to form U_i,t .

本发明结合了从句识别和半监督集成学习算法，在去除关系实例噪声的同时，充分利用负例数据。与现有的技术相比，本发明的优点包括：The invention combines clause recognition and semi-supervised integrated learning algorithm, and makes full use of negative example data while removing relational instance noise. Compared with the prior art, the advantages of the present invention include:

(1)通过从句识别去除训练数据中的噪声数据，提高了训练数据的标记准确度，从而提高了关系抽取的分类准确度。(1) The noise data in the training data is removed by clause recognition, which improves the labeling accuracy of the training data, thereby improving the classification accuracy of the relation extraction.

(2)通过半监督集成学习算法训练关系分类器，将传统关系抽取中未被利用的负例数据去除标签后作为无标注数据使用，提高了负例数据的利用率，从而提高了关系抽取的分类准确度。(2) The relationship classifier is trained by the semi-supervised ensemble learning algorithm, and the unused negative example data in the traditional relationship extraction is removed from the label and used as unlabeled data, which improves the utilization rate of the negative example data, thereby improving the efficiency of the relationship extraction. classification accuracy.

附图说明Description of drawings

图1是结合从句识别与半监督集成学习的关系抽取方法流程图；Fig. 1 is a flow chart of a relation extraction method combining clause recognition and semi-supervised ensemble learning;

图2是第t轮迭代流程图。Figure 2 is the flow chart of the tth iteration.

具体实施方式detailed description

为了更为具体地描述本发明，下面结合附图及具体实施方式对本发明的技术方案进行详细说明。In order to describe the present invention more specifically, the technical solutions of the present invention will be described in detail below in conjunction with the accompanying drawings and specific embodiments.

图1所示的是本发明一种结合从句级远程监督与半监督集成学习的关系抽取方法的流程图，该方法分为数据处理和模型训练两个阶段。Fig. 1 shows a flow chart of a relation extraction method combining sentence-level remote supervision and semi-supervised integrated learning according to the present invention. The method is divided into two stages: data processing and model training.

数据处理阶段data processing stage

数据处理的具体步骤如下：The specific steps of data processing are as follows:

步骤a-1，通过远程监督将知识库K中的关系三元组对齐到语料库D，构建关系实例集Q＝{q_n丨q_n＝(s_m,e_i,r_k,e_j),s_m∈D}。Step a-1, align the relational triples in the knowledge base K to the corpus D through remote supervision, and construct the relational instance set Q={q_n丨q_n =(s_m , e_i , r_k , e_j ), s_m ∈ D}.

如果句子s_m同时包含实体e_i和e_j，且知识库K中存在关系三元组(e_i,r_k,e_j)，则(s_m,e_i,r_k,e_j)为正例关系实例，同时选择一些不符合上述条件的关系实例作为负例关系实例。If the sentence s_m contains both entities e_i and e_j , and there is a relation triplet (e_i , r_k , e_j ) in the knowledge base K, then (s_m , e_i , r_k , e_j ) is positive example relation instances, and select some relation instances that do not meet the above conditions as negative relation instances.

步骤a-2，使用概率上下文无关文法对关系实例q_n的句子s_m进行解析，得到其语法树，根据语法树表示的句子s_m的词之间的结构关系，将s_m划分成从句。Step a-2, use the probabilistic context-free grammar to analyze the sentence s_m of the relational instance q_n to obtain its syntax tree, and divide s_m into clauses according to the structural relationship between the words of the sentence s_m represented by the syntax tree.

步骤a-3，根据关系实例q_n的实体对(e_i,e_j)是否同时出现在句子s_m的某一个从句当中来判断关系实例q_n是否是噪声数据；如果q_n是噪声数据，则将其从关系实例集Q中去除；Step a-3, according to whether the entity pair (e_i , e_j ) of the relation instance q_n appears in a certain clause of the sentence s_m at the same time, judge whether the relation instance q_n is noise data; if q_n is noise data, Then remove it from the relation instance set Q;

如果关系实例q_n＝(s_m,e_i,r_k,e_j)是正例关系实例，当句子s_m对应的实体对(e_i,e_j)没有出现在句子s_m的任一从句当中时，认为关系实例q_n是噪声数据，并将其从关系实例集Q中去除；If the relation instance q_n = (s_m , e_i , r_k , e_j ) is a positive relation instance, when the entity pair (e_i , e_j ) corresponding to the sentence s_m does not appear in any clause of the sentence s_m When , the relational instance q_n is considered to be noise data and removed from the relational instance set Q;

步骤a-4，抽取关系实例集Q中每个关系实例q_n的词法特征lex_n。Step a-4, extracting the lexical feature lex_n of each relation instance q_n in the relation instance set Q.

对于关系实例q_n＝(s_m,e_i,r_k,e_j)，其词法特征lex_n为实体对(e_i,e_j)本身以及(e_i,e_j)在句子s_m中的上下文，具体的词法特征类型如表1所示。For a relation instance q_n =(s_m , e_i , r_k , e_j ), its lexical feature lex_n is the entity pair (e_i , e_j ) itself and (e_i , e_j ) in the sentence s_m The context and specific lexical feature types are shown in Table 1.

表2词法特征类型Table 2 Lexical Feature Types

步骤a-5，将词法特征lex_n转化为分布式表征向量v_n，构建特征数据集M。Step a-5, convert the lexical feature lex_n into a distributed representation vector v_n , and construct a feature data set M.

将词法特征lex_n转化为分布式表征向量v_n，然后将所有的v_n集合起来组成特征数据集M；关系实例集Q中正例关系实例的词法特征向量化后变为M的正例数据，关系实例集Q中负例关系实例的词法特征向量化后变为M的负例数据。Transform the lexical feature lex_n into a distributed representation vector v_n , and then gather all v_n to form a feature data set M; the lexical features of the positive relational instances in the relational instance set Q are vectorized and become the positive case data of M, The lexical features of the negative relational instances in the relational instance set Q are vectorized and become the negative instance data of M.

模型训练阶段Model Training Phase

模型训练是一个迭代式学习过程，其第t次迭代如图2所示。Model training is an iterative learning process, and its t-th iteration is shown in Figure 2.

步骤b-1，选择特征数据集M中全部的正例数据和少部分负例数据组成标注数据集，记作L；剩余负例数据在去除标签后作为未标注数据集，记作U。Step b-1, select all the positive data and a small number of negative data in the feature data set M to form a labeled data set, which is denoted as L; the remaining negative data is denoted as an unlabeled data set after removing labels, which is denoted as U.

步骤b-2，从标注数据集L中有放回地选取n个初始样本集L₁,L₂,…,L_n。Step b-2, select n initial sample sets L₁ , L₂ ,...,L_n from the labeled dataset L with replacement.

步骤b-3，使用初始样本集L_i和第t-1轮选出的高置信度未标注样本集U_i,t-1训练对应的关系分类器C_i，其中，i＝1,2,…,n。Step b-3, use the initial sample set L_i and the high confidence unlabeled sample set U_i,t-1 selected in round t-1 to train the corresponding relation classifier C_i , where i=1,2, ..., n.

U_i,t-1表示在第t-1轮迭代中，关系分类器为第i个关系分类器C_i时，挑选的未标注样本x_u的集合，该未标注样本x_u由U中的未标注样本x_u以及从t-1轮迭代中得到的类标记组成，其中t大于等于2，当t＝1时，U_i,t-1为空集。U_i,t-1 represents the set of unlabeled samples x_u selected when the relationship classifier is the i-th relationship classifier C_i in the t-1 round of iterations, the unlabeled samples x_u are selected by the The unlabeled sample x_u and the class label obtained from the t-1 round of iterations, where t is greater than or equal to 2, when t=1, U_i,t-1 is an empty set.

步骤b-4，n个关系分类器C₁,C₂,…,C_n对未标注数据集U中未标注样本x_u的类标记分别进行预测，通过投票法生成高置信度的未标注样本集F_i,t；Step b-4, n relational classifiers C₁ , C₂ ,...,C_n respectively predict the class labels of the unlabeled samples x_u in the unlabeled data set U, and generate high-confidence unlabeled samples by voting method Set F_i,t ;

F_i,t表示在第t轮迭代中，关系分类器为C_i时，挑选的高置信度未标注样本x_u的集合，该集合经过一定的过滤筛选后，留下来的未标注样本x_u将构成U_i,t。F_i,t represents the set of high-confidence unlabeled samples x_u selected when the relationship classifier is C_i in the t-th iteration. After the set is filtered and screened, the remaining unlabeled samples x_u will form U_i,t .

因此，在第t轮迭代过程中，公式2为第i个关系分类器选择高置信度的未标注样本，Therefore, during the t-th iteration, Equation 2 selects high-confidence unlabeled samples for the i-th relation classifier,

步骤b-5，根据一定的过滤筛选准则，从高置信度的未标注样本集F_i,t中，为第i个关系分类器C_i挑选一定数量的未标注样本x_u，构成U_i,t，在下一轮迭代过程中加入到第i个关系分类器C_i的训练集中，然后重新训练对应的关系分类器C_i；Step b-5: Select a certain number of unlabeled samples x_u for the i-th relationship classifier C_i from the high-confidence unlabeled sample set F_i,t according to certain filtering criteria to form U_{i, t} , added to the training set of the i-th relationship classifier C_i in the next iteration process, and then retrain the corresponding relationship classifier C_i ;

对于未标注样本x_u，令P(h_i(x_u))表示C_i预测x_u输出为h_i(x_u)的概率值，在过滤筛选时，同时考虑P(h_i(x_u))和conf_i(x_u)，将F_i,t集合中的高置信度未标注样本按照conf_i(x_u)、P(h_i(x_u))的顺序依次降序排序，即conf_i(x_u)越大的样本越靠前，conf_i(x_u)相同的情况下，P(h_i(x_u))越大的样本越靠前。经过排序后，取前m_i,t个样本构成U_i,t。For the unlabeled sample x_u , let P(h_i (x_u )) represent the probability value that C_i predicts that the output of x_u is h_i (x_u ), and when filtering, consider P(h_i (x_u ) ) and conf_i (x_u ), sort the high-confidence unlabeled samples in the F_i,t set in descending order in the order of conf_i (x_u ), P(h_i (x_u )), that is, conf_i ( The sample with larger x_u ) is closer to the front. When conf_i (x_u ) is the same, the sample with larger P(h_i (x_u )) is closer to the front. After sorting, take the first m_i,t samples to form U_i,t .

步骤b-6，重复步骤b-3、b-4、b-5，当所有U_i,t都为空集，即没有新的未标注样本加入到训练集中时，或者迭代次数已经达到预先设定的最大迭代次数时，该训练过程停止。Step b-6, repeat steps b-3, b-4, b-5, when all U_i,t are empty sets, that is, when no new unlabeled samples are added to the training set, or the number of iterations has reached the preset The training process stops when the specified maximum number of iterations is reached.