Disclosure of Invention
The invention aims to provide an entity analysis method based on random forest improvement, which can perform similarity connection on matching of a character string and an entity through random forest, improve accuracy and efficiency of matching a data set and overcome the defects of the existing entity analysis technology.
The invention provides an entity analysis method based on random forest improvement, which comprises the following steps:
s1: providing a random forest F comprising k decision trees, wherein k is 1, 2.. N; providing a plurality of character strings BiWherein i is 1, 2.. N; and performing the following training steps:
s1.1: given a number of sample data tables Ai, where i ═ 1, 2.. N;
s1.2: randomly selecting a group of Xp tuples from an Ap table, randomly selecting a group of Xq tuples from an Aq table that may match the Ap table, pairing the Xp with the Xq to form a sample S, wherein: p belongs to i, q belongs to i, and p is not equal to q;
s1.3: examining patterns of the Ap table and the Aq table, creating a set of characteristics, and converting tuple pairs in the sample S into feature vectors by using the characteristics;
s1.4: training the random forest F by using the feature vectors in S1.3;
s2: performing a trimming step, the trimming step comprising:
s2.1: extracting m decision trees T from the k decision trees1,T2...TmRespectively using said T1,T2...TmExecuting each of the character strings BiTo obtain an output C1,C2...CmM is the minimum number of decision trees required for correct analysis, and the correct analysis is that the random forest F uses the character string BiCorrectly resolving into an entity;
s2.2: establishing a set I ═ C1∩C2∩...∩Cm;
S3: performing a verification step, the verification step comprising:
s3.1: set up J ═ C1∪C2∪...∪Cm)\(C1∩C2∩...∩Cm);
S3.2: extracting n decision trees R from the random forest F1,R2...RnUsing said R1,R2...RnExecuting the set J to generate a set K1,K2...KnAnd wherein
S4: the random forest F outputs an entity analysis result as I U K1∪K2∪...∪Kn。
Further, in S3.2, (R)1,R2...Rn)∪(T1,T2...Tm) A random forest F.
Further, in S1.2, in the AqConstructing a reverse index in a table, and quickly searching the A by using the reverse indexqIn table with said XpTuples with tuples sharing X symbols, constituting XqTuple, where x ≧ 2.
Further, in S2, the k decision trees are pruned before execution.
The invention reduces the number of candidate pairs to be matched, reduces the sample, trains the random forest through the reduced sample to shorten the time, decomposes each executed decision tree into subsets of executed trees in the pruning step, executes the rest trees in the verification step, and simplifies the executed decision tree set through the reuse calculation of the trees to shorten the time again, simplify the data and accurately analyze the result.
Detailed Description
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms also include the plural forms unless the context clearly dictates otherwise, and further, it is understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of the stated features, steps, operations, devices, components, and/or combinations thereof.
The technical solutions of the present invention will be described clearly and completely with reference to the following embodiments, and it should be understood that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1:
the invention provides an entity analysis method based on random forest improvement, which comprises the following steps:
s1: providing a random forest F comprising k decision trees, wherein k is 1, 2.. N; providing a plurality of character strings BiWherein i is 1, 2.. N;
training the random forest F;
s1.1: given a number of sample data tables AiWherein i is 1, 2.. N;
s1.2: from ApRandomly selecting a set of X in the tablepTuple, from AqRandom selection of and A in the tablepSet of possible table matchesqTuple, XpAnd XqThe pairings constitute a sample S, where: p belongs to i, q belongs to i, and p is not equal to q; in S1.2, in AqConstructing an inverse index in the table, and quickly searching A by using the inverse indexqIn table with XpTuples with tuples sharing X symbols, constituting XqTuple, where x ≧ 2.
S1.3: examination ApWatch and AqA table pattern, creating a set of characteristics, and converting tuple pairs in the sample S into feature vectors by using the characteristics;
s1.4: the random forest F is trained using the feature vectors in S1.3.
For example, in S1.1, two sample data tables A are given1And A2In S1.2, from A1Randomly selecting a set of X in the table1Tuple, from A2Random selection of and A in the table1Set of possible matches of X in the table2Tuple, X1And X2The pairings constitute sample S. S1.3, by checking A1Watch and A2Schema of a table, create a set of properties, the way to create the properties, for example, if it is detected that property city belongs to string type, then many properties using string similarity measures are created, for example: edit dist (A)1.city,A2.city),jaccard 2gram(A1.city,A2.city)。
S2: performing a trimming step, the trimming step comprising:
s2.1: extracting m decision trees T from k decision trees1,T2...TmUsing T separately1,T2...TmExecuting each character string BiTo obtain an output C1,C2...Cm;
S2.2: establishing a set I ═ C1∩C2∩...∩Cm;
S3: performing a verification step, the verification step comprising:
s3.1: set up J ═ C1∪C2∪...∪Cm)\(C1∩C2∩...∩Cm);
S3.2: extracting n decision trees R from random forest F1,R2...RnUsing R1,R2...RnExecuting the set J to generate a set K1,K2...KnAnd wherein
S4: the random forest F outputs an entity analysis result as I U K1∪K2∪...∪Kn。
As shown in fig. 1: for example, in S1, the random forest F includes 3 decision trees T1、T2And R1And in S2, two character strings B are provided1And B2In B1And B2Upper execution of T1、T2To obtain an output C1And C2Then, in S3, set I ═ C is established1∩C2Then I is represented by T1And T2Predicting all pair components of a match, which may be part of the random forest F output, then in S4, set J ═ C is established1∪C2)\(C1∩C2) Then, in S5, useR1Set J is executed to produce set K, so obviously K also matches random forest F, so the output of random forest F is IuU K, and any other pair (neither in I nor J) cannot be T1And T2A match is predicted and therefore cannot be matched to the random forest F.
In S2, since the set J is often small, the tree T is divided3Application to J is often more than application to the original string B1And B2The aggregation is much faster. When the F is large, such as 10 trees, the time saved is considerable. Assuming that in this case we need at least 5 trees to match F, then we can match B1And B2Apply 6 trees to get sets I and J, and then apply the remaining 4 trees to the relatively smaller set J, the former being pruning of the trees and the latter being verification of the trees.
The method specifically comprises the following steps: consider a forest F consisting of 10 trees, of which at least 5 must match to match the F. The pruning step then executes 6 trees to produce a set of J, taking into account that a pair of p1 ∈ J matches 4 trees during pruning. Then, as soon as one of the remaining 4 trees matches p1, we can declare p1 as a match; considering that a pair of p2 ∈ J only matches one tree at pruning, we can declare a p2 mismatch when one of the predicted p2 of the remaining 4 trees does not match.
Therefore, at S2In m decision trees T1,T2...TmThe execution process is a trimming process; in S5, R1,R2...RmThe execution process of (2) is a verification process, and the verification process can simplify the time for data processing and analysis.
In S3.2, (R)1,R2...Rn)∪(T1,T2...Tm) A random forest F. And all the trees in the random forest F are fully utilized, so that the analysis result is more accurate.
In S2.1, m is the minimum number of decision trees required by correct analysis, and the character string B is obtained by the random forest F through the correct analysisiCorrectly resolved into entities. When used in the trimming stepThe minimum number m of decision trees for correct parsing can be guaranteed, and the remaining number n of decision trees is used in the verification step, so that the obvious effect of shortening the parsing time is the maximum, because the pruning step needs to execute all the character strings, and the verification step only needs to execute the pruned set J.
In S2, k decision trees are pruned before execution. A subset of the tree is applied for pruning, which is done to avoid overfitting, and then verified with the remaining tree application J.
Example 2:
in this example, consider matching two sets of names, both long (e.g., Graphene Nanospheres) and short (e.g., Golf Ball). Two sets of strings B by performing pruning and validation of random forest F1And B2And matching to obtain the desired character string.
To match two sets of character strings B1And B2We learn the random forest F, then at B1And B2F is performed. The execution process of the invention is that the execution of F is divided into two steps: pruning and verifying. The following describes how to perform these two steps efficiently.
Assume that the random forest F has k trees, of which at least a dk/2e tree is needed for F matching. It is clear that when the pruning step is performed for at least (bk/2c +1) trees, any string pairs not output by this step cannot be matched.
As shown in FIG. 2, assume that the random forest F has at least three treesTwo trees output the same pair before outputting a pair (i.e., declaring it matching). Simply put, we can be in two sets of strings B1And B2Is to execute F by executing B1And B2Each tree T of (1)iTo obtain an output CiThen all pairs appearing in the output of at least two trees are output (see fig. 2), which is a time consuming approach.
As shown in FIG. 1, the present invention has the advantage of only performing two trees, such as at B1And B2Upper execution of T1、T2To obtain an output C1And C2(see FIG. 1). Set I ═ C1∩C2From T1、T2All pairs of the predicted matches are composed and therefore can be immediately output as part of the random forest F output.
The implementation process of the trimming step is as follows: set J ═ C1∪C2)\(C1∩C2) Composed of only one tree (T)1Or T2) All pairs of matching are predicted to constitute. It can be easily seen that we only need to leave the remaining trees R1Applied to the set J, let K be R in J1The set of pairs that are predicted to match, it is clear that any such pair is also a match of a random forest F, since it consists of exactly two trees (T)1Or T2And R1) And (6) matching. Thus, the output of random forest F is I @ (see FIG. 1). None of the other pairings (i.e. pairings in I and J) can be represented by T1And T2A match is predicted and therefore cannot be matched with F.
The implementation process of the verification step is as follows: the set J tends to be relatively small, and therefore the tree R1Application to J tends to be more than application to string B1And B2Much faster, when F is large (e.g., 10 trees), this time saving is very significant; suppose in this case we need at least five trees to match F. We can then apply six trees to B1And B2To obtain sets I and J (i.e., a pruning step), and then apply the remaining four trees to the relatively smaller set J (i.e., a verification step).
In the step of verification, the verification step,suppose that the pruning T is performed in a preceding pruning stepmTrees produce a set of pairs J, then it is necessary to consider how to execute the remaining trees on J: let U be the set of remaining trees that are executed on set J, similar to the way trees are executed in the pruning step, where the above optimization procedure can simply be used to generate a plan P that executes all the trees in U in a combined manner (i.e. reuse calculations), but a better solution is to apply the trees in order to avoid applying all the trees in U to all pairs in J.
In summary, the invention trains a random forest comprising k decision trees, inputs character strings in a reference set, executes the decision trees in the random forest to construct a new set I, J, constructs the output of the random forest by using the set I, J, completes the matching of the reference set and the real entity, and can verify the matching by using the constructed set J in the execution process of the random forest.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.