CN112860959B

Movatterモバイル変換

Info

Publication number: CN112860959B
Application number: CN202110160938.XA
Authority: CN
Inventors: 巩建光; 刘凌灼; 黄若文; 吴昊; 王福焱
Original assignee: Harbin Engineering University
Current assignee: Tongpu Information Technology Co ltd
Priority date: 2021-02-05
Filing date: 2021-02-05
Publication date: 2021-11-05
Anticipated expiration: 2041-02-05
Also published as: CN112860959A

Abstract

Translated fromChinese

本发明提供了一种基于随机森林改进的实体解析方法，包括以下步骤：S1：提供一个包括k个决策树的随机森林F，提供若干个字符串B_i；S2：执行修剪步骤包括：S2.1：从k个决策树中提取m个决策树T_m，分别使用T_m执行每一个字符串B_i，得到输出C_m；S2.2：建立集合I＝C₁∩C₂∩...∩C_m；S3：执行验证步骤包括：S3.1：建立集合J＝(C₁∪C₂∪...∪C_m)\(C₁∩C₂∩...∩C_m)；S3.2：从随机森林F中提取n个决策树R_n，使用R_n执行集合J，以生成集合K_n；S4：随机森林F输出结果为I∪K₁∪K₂∪...∪K_n。本发明通过将执行每一个决策树分解为在修剪步骤中执行树的子集，然后在验证步骤中执行剩余的树，通过树的重用计算简化执行决策树集，以大幅缩短时间。

The present invention provides an improved entity parsing method based on random forest, including the following steps: S1: provide a random forest F including k decision trees, and provide several character strings B_i ; S2: performing the pruning step includes: S2. 1: Extract m decision trees T_m from k decision trees, use T_m to execute each string B_i respectively, and obtain the output C_m ; S2.2: Establish a set I=C₁ ∩ C₂ ∩... ∩C_m ; S3: Execute the verification step including: S3.1: Establish a set J=(C₁ ∪C₂ ∪...∪C_m )\(C₁ ∩C₂ ∩...∩C_m ); S3 .2: Extract n decision trees R_n from random forest F, use R_n to execute set J to generate set K_n ; S4: Random forest F output result is I∪K₁ ∪K₂ ∪...∪K_n . The present invention greatly shortens the time by decomposing each decision tree for execution into a subset of trees that are executed in the pruning step, and then executing the remaining trees in the verification step, and simplifying the execution of the decision tree set by reusing the trees.

Description

Entity analysis method based on random forest improvement

Technical Field

The invention relates to the technical field of data processing, in particular to an entity analysis method based on random forest improvement.

Background

In a dataset, objects in the real world, to which data is directed, are generally referred to as entities. There may be many different representations or descriptions of the same entity in different or even the same data set, and when data sets from different sources are combined for analysis, the descriptions of the same entity may be mixed together to cause some degree of duplication. Entity resolution is the process of identifying and linking multiple different descriptions in a data set, and determining which descriptions map to the same entity in the real world. Entity analysis is an important step in the data preprocessing process and is mainly used for solving the quality problems of repeated redundancy and the like of data.

The current entity analysis means that different data may have different descriptions (the descriptions include data formats, representation methods, and the like) for the same thing, that is, an entity, but they may often have errors such as typesetting or wrongly-written characters in the description storage process, which increases the time for data processing analysis and easily causes that matching redundancy cannot accurately obtain a data set that we want.

Disclosure of Invention

The invention aims to provide an entity analysis method based on random forest improvement, which can perform similarity connection on matching of a character string and an entity through random forest, improve accuracy and efficiency of matching a data set and overcome the defects of the existing entity analysis technology.

The invention provides an entity analysis method based on random forest improvement, which comprises the following steps:

s1: providing a random forest F comprising k decision trees, wherein k is 1, 2.. N; providing a plurality of character strings B_iWherein i is 1, 2.. N; and performing the following training steps:

s1.1: given a number of sample data tables Ai, where i ═ 1, 2.. N;

s1.2: randomly selecting a group of Xp tuples from an Ap table, randomly selecting a group of Xq tuples from an Aq table that may match the Ap table, pairing the Xp with the Xq to form a sample S, wherein: p belongs to i, q belongs to i, and p is not equal to q;

s1.3: examining patterns of the Ap table and the Aq table, creating a set of characteristics, and converting tuple pairs in the sample S into feature vectors by using the characteristics;

s1.4: training the random forest F by using the feature vectors in S1.3;

s2: performing a trimming step, the trimming step comprising:

s2.1: extracting m decision trees T from the k decision trees₁，T₂...T_mRespectively using said T₁，T₂...T_mExecuting each of the character strings B_iTo obtain an output C₁，C₂...C_mM is the minimum number of decision trees required for correct analysis, and the correct analysis is that the random forest F uses the character string B_iCorrectly resolving into an entity;

s2.2: establishing a set I ═ C₁∩C₂∩...∩C_m；

S3: performing a verification step, the verification step comprising:

s3.1: set up J ═ C₁∪C₂∪...∪C_m)\(C₁∩C₂∩...∩C_m)；

S3.2: extracting n decision trees R from the random forest F₁，R₂...R_nUsing said R₁，R₂...R_nExecuting the set J to generate a set K₁，K₂...K_nAnd wherein

S4: the random forest F outputs an entity analysis result as I U K₁∪K₂∪...∪K_n。

Further, in S3.2, (R)₁，R₂...R_n)∪(T₁，T₂...T_m) A random forest F.

Further, in S1.2, in the A_qConstructing a reverse index in a table, and quickly searching the A by using the reverse index_qIn table with said X_pTuples with tuples sharing X symbols, constituting X_qTuple, where x ≧ 2.

Further, in S2, the k decision trees are pruned before execution.

The invention reduces the number of candidate pairs to be matched, reduces the sample, trains the random forest through the reduced sample to shorten the time, decomposes each executed decision tree into subsets of executed trees in the pruning step, executes the rest trees in the verification step, and simplifies the executed decision tree set through the reuse calculation of the trees to shorten the time again, simplify the data and accurately analyze the result.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a schematic diagram of an entity resolution process according to the present invention;

FIG. 2 is a schematic diagram of a conventional entity resolution process;

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms also include the plural forms unless the context clearly dictates otherwise, and further, it is understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of the stated features, steps, operations, devices, components, and/or combinations thereof.

The technical solutions of the present invention will be described clearly and completely with reference to the following embodiments, and it should be understood that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1:

s1: providing a random forest F comprising k decision trees, wherein k is 1, 2.. N; providing a plurality of character strings B_iWherein i is 1, 2.. N;

training the random forest F;

s1.1: given a number of sample data tables A_iWherein i is 1, 2.. N;

s1.2: from A_pRandomly selecting a set of X in the table_pTuple, from A_qRandom selection of and A in the table_pSet of possible table matches_qTuple, X_pAnd X_qThe pairings constitute a sample S, where: p belongs to i, q belongs to i, and p is not equal to q; in S1.2, in A_qConstructing an inverse index in the table, and quickly searching A by using the inverse index_qIn table with X_pTuples with tuples sharing X symbols, constituting X_qTuple, where x ≧ 2.

S1.3: examination A_pWatch and A_qA table pattern, creating a set of characteristics, and converting tuple pairs in the sample S into feature vectors by using the characteristics;

s1.4: the random forest F is trained using the feature vectors in S1.3.

For example, in S1.1, two sample data tables A are given₁And A₂In S1.2, from A₁Randomly selecting a set of X in the table₁Tuple, from A₂Random selection of and A in the table₁Set of possible matches of X in the table₂Tuple, X₁And X₂The pairings constitute sample S. S1.3, by checking A₁Watch and A₂Schema of a table, create a set of properties, the way to create the properties, for example, if it is detected that property city belongs to string type, then many properties using string similarity measures are created, for example: edit dist (A)₁.city，A₂.city)，jaccard 2gram(A₁.city，A₂.city)。

S2: performing a trimming step, the trimming step comprising:

s2.1: extracting m decision trees T from k decision trees₁，T₂...T_mUsing T separately₁，T₂...T_mExecuting each character string B_iTo obtain an output C₁，C₂...C_m；

S2.2: establishing a set I ═ C₁∩C₂∩...∩C_m；

S3: performing a verification step, the verification step comprising:

s3.1: set up J ═ C₁∪C₂∪...∪C_m)\(C₁∩C₂∩...∩C_m)；

S3.2: extracting n decision trees R from random forest F₁，R₂...R_nUsing R₁，R₂...R_nExecuting the set J to generate a set K₁，K₂...K_nAnd wherein

As shown in fig. 1: for example, in S1, the random forest F includes 3 decision trees T₁、T₂And R₁And in S2, two character strings B are provided₁And B₂In B₁And B₂Upper execution of T₁、T₂To obtain an output C₁And C₂Then, in S3, set I ═ C is established₁∩C₂Then I is represented by T₁And T₂Predicting all pair components of a match, which may be part of the random forest F output, then in S4, set J ═ C is established₁∪C₂)\(C₁∩C₂) Then, in S5, useR₁Set J is executed to produce set K, so obviously K also matches random forest F, so the output of random forest F is IuU K, and any other pair (neither in I nor J) cannot be T₁And T₂A match is predicted and therefore cannot be matched to the random forest F.

In S2, since the set J is often small, the tree T is divided₃Application to J is often more than application to the original string B₁And B₂The aggregation is much faster. When the F is large, such as 10 trees, the time saved is considerable. Assuming that in this case we need at least 5 trees to match F, then we can match B₁And B₂Apply 6 trees to get sets I and J, and then apply the remaining 4 trees to the relatively smaller set J, the former being pruning of the trees and the latter being verification of the trees.

The method specifically comprises the following steps: consider a forest F consisting of 10 trees, of which at least 5 must match to match the F. The pruning step then executes 6 trees to produce a set of J, taking into account that a pair of p1 ∈ J matches 4 trees during pruning. Then, as soon as one of the remaining 4 trees matches p1, we can declare p1 as a match; considering that a pair of p2 ∈ J only matches one tree at pruning, we can declare a p2 mismatch when one of the predicted p2 of the remaining 4 trees does not match.

Therefore, at S₂In m decision trees T₁，T₂...T_mThe execution process is a trimming process; in S5, R₁，R₂...R_mThe execution process of (2) is a verification process, and the verification process can simplify the time for data processing and analysis.

In S3.2, (R)₁，R₂...R_n)∪(T₁，T₂...T_m) A random forest F. And all the trees in the random forest F are fully utilized, so that the analysis result is more accurate.

In S2.1, m is the minimum number of decision trees required by correct analysis, and the character string B is obtained by the random forest F through the correct analysis_iCorrectly resolved into entities. When used in the trimming stepThe minimum number m of decision trees for correct parsing can be guaranteed, and the remaining number n of decision trees is used in the verification step, so that the obvious effect of shortening the parsing time is the maximum, because the pruning step needs to execute all the character strings, and the verification step only needs to execute the pruned set J.

In S2, k decision trees are pruned before execution. A subset of the tree is applied for pruning, which is done to avoid overfitting, and then verified with the remaining tree application J.

Example 2:

in this example, consider matching two sets of names, both long (e.g., Graphene Nanospheres) and short (e.g., Golf Ball). Two sets of strings B by performing pruning and validation of random forest F₁And B₂And matching to obtain the desired character string.

To match two sets of character strings B₁And B₂We learn the random forest F, then at B₁And B₂F is performed. The execution process of the invention is that the execution of F is divided into two steps: pruning and verifying. The following describes how to perform these two steps efficiently.

Assume that the random forest F has k trees, of which at least a dk/2e tree is needed for F matching. It is clear that when the pruning step is performed for at least (bk/2c +1) trees, any string pairs not output by this step cannot be matched.

As shown in FIG. 2, assume that the random forest F has at least three treesTwo trees output the same pair before outputting a pair (i.e., declaring it matching). Simply put, we can be in two sets of strings B₁And B₂Is to execute F by executing B₁And B₂Each tree T of (1)_iTo obtain an output C_iThen all pairs appearing in the output of at least two trees are output (see fig. 2), which is a time consuming approach.

As shown in FIG. 1, the present invention has the advantage of only performing two trees, such as at B₁And B₂Upper execution of T₁、T₂To obtain an output C₁And C₂(see FIG. 1). Set I ═ C₁∩C₂From T₁、T₂All pairs of the predicted matches are composed and therefore can be immediately output as part of the random forest F output.

The implementation process of the trimming step is as follows: set J ═ C₁∪C₂)\(C₁∩C₂) Composed of only one tree (T)₁Or T₂) All pairs of matching are predicted to constitute. It can be easily seen that we only need to leave the remaining trees R₁Applied to the set J, let K be R in J₁The set of pairs that are predicted to match, it is clear that any such pair is also a match of a random forest F, since it consists of exactly two trees (T)₁Or T₂And R₁) And (6) matching. Thus, the output of random forest F is I @ (see FIG. 1). None of the other pairings (i.e. pairings in I and J) can be represented by T₁And T₂A match is predicted and therefore cannot be matched with F.

The implementation process of the verification step is as follows: the set J tends to be relatively small, and therefore the tree R₁Application to J tends to be more than application to string B₁And B₂Much faster, when F is large (e.g., 10 trees), this time saving is very significant; suppose in this case we need at least five trees to match F. We can then apply six trees to B₁And B₂To obtain sets I and J (i.e., a pruning step), and then apply the remaining four trees to the relatively smaller set J (i.e., a verification step).

In the step of verification, the verification step,suppose that the pruning T is performed in a preceding pruning step_mTrees produce a set of pairs J, then it is necessary to consider how to execute the remaining trees on J: let U be the set of remaining trees that are executed on set J, similar to the way trees are executed in the pruning step, where the above optimization procedure can simply be used to generate a plan P that executes all the trees in U in a combined manner (i.e. reuse calculations), but a better solution is to apply the trees in order to avoid applying all the trees in U to all pairs in J.

In summary, the invention trains a random forest comprising k decision trees, inputs character strings in a reference set, executes the decision trees in the random forest to construct a new set I, J, constructs the output of the random forest by using the set I, J, completes the matching of the reference set and the real entity, and can verify the matching by using the constructed set J in the execution process of the random forest.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

Translated fromChinese

1.一种基于随机森林改进的实体解析方法，其特征在于，包括以下步骤：1. an improved entity parsing method based on random forest, is characterized in that, comprises the following steps:

S1：提供一个包括k个决策树的随机森林F，其中k＝1，2...N；提供若干个字符串B_i，其中i＝1，2...N；并执行如下训练步骤：S1: Provide a random forest F including k decision trees, where k=1, 2...N; provide several strings B_i , where i=1, 2...N; and perform the following training steps:

S1.1：给定若干个样本数据表Ai，其中i＝1，2...N；S1.1: Given several sample data tables Ai, where i=1, 2...N;

S1.2：从Ap表中随机选择一组Xp元组，从Aq表中随机选择与所述Ap表可能匹配的一组Xq元组，将所述Xp与所述Xq配对组成样本S，其中：p∈i，q∈i，p≠q；S1.2: Randomly select a set of Xp tuples from the Ap table, randomly select a set of Xq tuples that may match the Ap table from the Aq table, and pair the Xp and the Xq to form a sample S, wherein : p∈i, q∈i, p≠q;

S1.3：检查所述Ap表与所述Aq表的模式，创建一组特性，使用所述特性将所述样本S中的元组对转换为特征向量；S1.3: Check the schema of the Ap table and the Aq table, create a set of characteristics, and use the characteristics to convert the tuple pairs in the sample S into feature vectors;

S1.4：使用S1.3中的所述特征向量训练所述随机森林F；S1.4: Use the feature vector in S1.3 to train the random forest F;

S2：执行修剪步骤，所述修剪步骤包括：S2: perform a trimming step, the trimming step includes:

S2.1：从所述k个决策树中提取m个决策树T₁，T₂...T_m，分别使用所述T₁，T₂...T_m执行每一个所述字符串B_i，得到输出C₁，C₂...C_m，所述m为正确解析所需要的最小的决策树数量，所述正确解析为所述随机森林F将所述字符串B_i正确解析为实体；S2.1: Extract m decision trees T₁ , T₂ . . . T_m from the k decision trees, and use the T₁ , T₂ . . . T_m to execute each of the character strings B respectively_i , the outputs C₁ , C₂ . . . C_m are obtained, where m is the minimum number of decision trees required for correct parsing. The correct parsing is that the random forest F correctly parses the string B_i as entity;

S2.2：建立集合I＝C₁∩C₂∩...∩C_m；S2.2: Create a set I=C₁ ∩C₂ ∩...∩C_m ;

S3：执行验证步骤，所述验证步骤包括：S3: Execute a verification step, the verification step includes:

S3.1：建立集合J＝(C₁∪C₂∪...∪C_m)\(C₁∩C₂∩...∩C_m)；S3.1: Establish set J=(C₁ ∪C₂ ∪...∪C_m )\(C₁ ∩C₂ ∩...∩C_m );

S3.2：从所述随机森林F中提取n个决策树R₁，R₂...R_n，使用所述R₁，R₂...R_n执行所述集合J，以生成集合K₁，K₂...K_n，且其中S3.2: Extract n decision trees R₁ , R₂ . . . R_n from the random forest F, and execute the set J using the R₁ , R₂ . . . R_n to generate a set K₁ , K₂ ...K_n , and where

S4：所述随机森林F输出实体解析结果为I∪K₁∪K₂∪...∪K_n。S4: The random forest F outputs the entity analysis result as I∪K₁ ∪K₂ ∪...∪K_n .

2.根据权利要求1所述的一种基于随机森林改进的实体解析方法，其特征在于，S3.2中，(R₁，R₂...R_n)∪(T₁，T₂...T_m)＝随机森林F。2. An improved entity parsing method based on random forest according to claim 1, characterized in that, in S3.2, (R₁ , R₂ . . . R_n )∪(T₁ , T₂ .. .T_m ) = random forest F.

3.根据权利要求1所述的一种基于随机森林改进的实体解析方法，其特征在于，S1.2中，在所述A_q表中构建反向索引，使用所述反向索引快速查找所述A_q表中与所述X_p元组共享x个符号的元组，组成X_q元组，其中x≥2。3. The improved entity parsing method based on random forest according to claim 1, wherein in S1.2, an inverted index is constructed in the_Aq table, and the inverted index is used to quickly find all the objects. The tuples in the A_q table that share x symbols with the X_p tuples form X_q tuples, where x≥2.

4.根据权利要求2所述的一种基于随机森林改进的实体解析方法，其特征在于，S2中，在执行前对所述k个决策树进行修剪。4 . The improved entity parsing method based on random forest according to claim 2 , wherein, in S2 , the k decision trees are pruned before execution. 5 .