The method of liver diseases data classification Rule Extraction based on random forestTechnical field
The invention belongs to the data processing fields more particularly to a kind of liver based on random forest of disease and diagnostic messageThe method that disease data classifying rules extracts.
Background technique
Liver cancer is the second largest reason of whole world cancer mortality, and primary hepatitis can develop as fibrosis, cirrhosis veryTo liver cancer.The diagnostic method of liver diseases is black-box model mostly at present, and is still concentrated in classification problem, it is difficult to diagnose liverThe accuracy and interpretation of the classifying rules of dirty disease cannot sufficiently show the information hidden in data.It is answered in practical medicalIn, although some black-box models realize very high precision, the reason of classification cannot be provided, this right and wrong for doctorIt is often important.The representation of knowledge rule extracted from data is more readily understood and understands than other expressions.Therefore, to the solution of classificationSome succinct and effective rule can be expressed by releasing.Succinct effective Rule Extraction can provide explaining in detail for bottom, cureIt learns and becomes more and more popular in environment, this does not require nothing more than high-precision, and should be readily appreciated that.Rule Extraction is always artificial intelligence fieldResearch hot topic.So-called Rule Extraction refers to that many experimental studies combine the data from multiple sources, is potentially asked with understandingTopic.Find and explain that most important information is critically important from these sources.Therefore, it is necessary to a kind of effective algorithms can be simultaneouslyDecision rule is extracted, and while retention forecasting performance, the relationship between key feature is selected to explain influence liver diseasesOn risk factor, and the relational expression for providing influence factor is supplied to diagnosis.
Currently, many diagnostic methods of the data set of hepatitis have been successfully applied different sorting algorithms: being based on attributeThe cluster of weighting;Extreme learning machine;Support vector machines;Neural network;Fuzzy rule extracting is based on support vector machines;Classification returnsTree;Supporting vector identification;Principal component analysis .Hsieh etc. proposes a kind of particle group optimizing-and is based on the fuzzy compound mind of hypermatrixThrough network, the rule trimming training generated using particle swarm optimization algorithm is without reducing (or even raising) recognition performance.Barakat, N.andA.P.Bradley etc. propose a kind of output vector using SVM model and application decision tree algorithm intoLine discipline extracts.In job family, naive Bayesian tree is used from the prediction model of SVM, TREPAN, RIPPER and CART are extractedRule.Another work is to carry out Rule Extraction from support vector model using ANFIS and DENFIS.Recently,T.MarthiPadmaja etc. proposes a kind of new hybrid algorithm, and support vector data add RIPPE to improve single class svm classifierInterpretability.Most work is concentrated mainly in SVM classifier, in order to improve the regular interpretation of generation,ShengLiu, RonakY.Patel etc. propose the model of the Rule Extraction and feature selecting based on random forest of system, numberRule Extraction is carried out according to by random forest, feature present in selection rule carries out classification into random forest in feedback and testsCard, is classified using the rule of generation, and precision can achieve the precision of initial data classification.Feature searching algorithm may be specialLevy one of most important part in selection method.Propose several search strategies for feature selecting: branch and constraint are divided and ruledMethod, greedy method, evolution algorithm, annealing algorithm etc..Wherein, greedy method search strategy, for example, favorable selection (incremental search) orIt eliminates backward, is one of most popular technology.
From the foregoing, SVM, neural network, decision tree and random forest are the basic models for studying Rule Extraction, forThe limitation and extraction of regular quantity mainly utilize L1 L2 norm regularization, sparse, the i.e. feature choosing of implementation rule and featureIt selects and interpretation.
As described above, during liver diseases practical diagnosis, an interpretable model and high estimated performance areIt is very important, while can also better understand potential problem.State-of-the-art algorithm, such as support vector machines (SVM), artificial mindThrough network and random forest (RF), the accuracy of usual prediction result is higher, but other than accuracy, is difficult to explain these mouldsThe building of type because they are all " black box submodels ", or includes many decision rules that we can not clearly explain.AnotherOn the one hand, some algorithms are based particularly on the algorithm of decision tree, it is easy to explain.However, compared with SVM, ANN or RF, predictionPerformance is usually lower.
Second, in diagnosis for liver disease, if generating excessive diagnostic rule, just without any reality for doctorMeaning, therefore, the rule extraction for being directed to basic model decision tree can then generate many rule sets, for the straight of userSee explain it is nonsensical, and L1 norm regularization although may be implemented rule and feature extraction, but itself by relevance verySmall rule can directly be set to 0, be easy to cause over-fitting;The rule of relevance very little is set to one very by L2 norm itself simultaneouslySmall numerical value be easy to cause the poor fitting of data.
Summary of the invention
Of the existing technology in order to solve the problems, such as, the present invention is equal with interpretation model suitable for classification performance using selectionThe model of weighing apparatus.Simultaneously using the implementation method of elastic norm iteration during Rule Extraction.
The present invention proposes the elastic convergence in norm algorithm using a kind of new combination L1 and L2 for liver diseases dataEffective and a small amount of rule is selected, passes through the method for a kind of mixed Rule Extraction and feature selecting, the result of Rule ExtractionFor feature selecting, the feature being selected in generation rule, is re-fed into random forest and elastic norm coding step extracts weightWant rule.By continuous iteration alternated process, until selected feature and rule will not be changed.Finally, and the most mustIt wants, for the rule of generation, to allow the validity and accuracy of doctor or users to trust rule, the present invention utilizes coveringRate and precision carry out quantization performance, reach the optimal balance of accuracy and nicety of grading.
The present invention proposes the binary-coded forest generated by random forest (RF) in hepatic data, it willSample point is mapped to by the space of entire leaf node (rule) collection definition.Then it is extracted using binary coding and elastic normThe coding method of representative rule.In the rule of selection, the feature reselected is used as the subcharacter recycled next time,It is used to the new RF of construction one and generates one group of new rule, repeat this process until meeting stop condition, i.e. feature quantity is protectedKeep steady fixed and regular quantity convergence.
Specifically use following technical scheme:
A method of the liver diseases data classification Rule Extraction based on random forest, which is characterized in that including followingStep:
Step 1: data uneven or irregular in liver diseases being pre-processed, by SMOTE, (synthesis is a small number ofOversampling technique) obtain liver diseases data set;
Step 2: binary sparse coding being carried out to liver diseases data set using Random Forest model, obtains liver diseasesRule set;
Step 3: elastic norm sparse coding Rule Extraction being carried out to liver diseases rule set, obtains coding liver diseases ruleThen collect;
Step 4: the extraction and deletion of feature are carried out to coding liver diseases rule set;
Step 5: carrying out initial data verifying, generate final rule set.
Wherein, the imbalance as existing for hepatic data collection initial data will cause many problems in pattern-recognition.ExampleSuch as, if data set is uneven, classifier tends to " learn " sample of maximum ratio, and with highest precision to they intoRow cluster.In practical applications, this prejudice is unacceptable.The present invention is by (synthesizing a small number of over-sampling skills by SMOTEArt) it is handled, it can be that each minority class creates " synthesis " example with seldom sample.
Further, described that binary sparse volume is carried out to liver diseases data set using Random Forest model in step 2Code method the following steps are included:
Step 2A: liver diseases data set being trained and obtains the random forest including more decision trees, in each decision tree,The path of root node to leaf node is interpreted a decision rule, then random forest is equivalent to a decision rule set;
Step 2B: each sample of liver diseases data set is corresponded to the decision from root node to only one leaf nodeTree;
Step 2C: it defines a binary feature vector and captures random forest leaf segment point structure: for sample Xi, correspondingBinary vector and coding leaf node is defined as:
Xi=[X1 ..., Xq]T, wherein q is leaf node sum;
Then XiSpace be the leaf segment space of points, in this space, each sample is mapped to the vertex of hypercube, oftenThe dimension of a rule space is defined as a decision rule.Therefore, which such mapping processing substantially defines for sampleA little rules are effectively which is invalid.
Further, in step 3, the side that elastic norm sparse coding Rule Extraction is carried out to liver diseases rule setMethod the following steps are included:
Step 3A: according to the mapping result of step 2C, new training sample is constructed:
{(X1, y1), (X2, X2) ..., (Xp, yp)};
Wherein, XiIt is binary attribute vector, y ∈ { 1,2 ..., K } is related category, the formula of defining classification are as follows:
Wherein weight vectors WkWith scalar bkDefine the linear discriminant function of kth class;
Since each binary attribute indicates a decision rule.Weight W in formula (1)kMeasure the importance of rule:The size of weight shows the significance level of rule.Obviously, in classifier above, if the weight of its all class is 0Safely to remove the rule.Therefore Rule Extraction also just becomes the problem of study weight vectors.
Step 3B: elastic norm normalization study is carried out, wherein objective function is as follows:
ξi>=0, i=1 ..., p (2)
The objective function is formed by two: first item is the elastic norm formula in conjunction with L1 and L2 norm:
To control the quantity of non-zero weight and Rule Extraction, P is selection L1 or L2The probability factor of norm;Section 2 εkBe slack variable and;λ is regularization parameter.Because non-zero slack variable represents oneThe sample of mistake classification, Section 2 is related to experience error.As a result sparsity and experience error depends on regularization parameter,And L1 and L2 norm sparse coding has been widely used in statistics and machine learning.L1 norm can delete unessential feature,And L2 norm can prevent overfitting data.The present invention uses the highest P value of model cross validation precision after step 3B,The P value is chosen to substitute into formula (2).
The calculation method of the importance of any sample characteristics is as follows in random forest:
Step 3C: it for each decision tree in random forest, is calculated using corresponding OOB (the outer data of bag) dataThe outer data error of its bag, is denoted as errOOB1;
Step 3D: randomly to the feature addition noise jamming of all samples of data OOB outside bag, outside the bag for calculating it againData error is denoted as errOOB2;
Step 3E: setting in random forest has Ntree decision tree, to the importance of feature is defined as:
∑(err00B1-err00B2)/Ntree (3)
Calculate the importance of all features.
Why formula (3) as the metric of the importance of individual features be because are as follows: if being added at random to some featureAfter noise, the accuracy rate outside bag is greatly lowered, then it is very big to illustrate that this feature influences the classification results of sample, alsoIt is to say that its significance level is relatively high.
Further, in step 4, described pair of coding liver diseases rule set carries out the method packet of extraction and the deletion of featureInclude following steps:
Since the feature distribution in random forest is determined by the learning process of random forest.Usually it with from previousThe feature distribution that rule extraction generates in formula is different.And important feature be done in the decision rule based on extraction it is assumed that weIt can use feature dissimilarity and go selection feature.If feature does not appear in the rule of preceding formula (2) extraction, deletedIt removes, because the classifier that it defines formula (1) does not influence.It, can be with simultaneous selection rule and feature under this thinking.
And regularization parameter λ can be selected by training set cross validation.It is reconfigured by the feature of selection randomForest can further select rule to go to obtain overall compact rule.The process of an iteration in this way, previousFeature is selected to construct new random forest in iteration, can produce new rule by this new random forest, repeatedlyIn generation, is until the feature of selection will not change.
Thus have:
Step 4A: it if a certain feature does not appear in the rule of formula (2) extraction, is deleted;
Step 4B: regularization parameter λ is selected by training set cross validation, and returns to step 2A and reconfigures random forestIt is trained;
Step 4C: repeating the iterative process of step 2A to step 4B, until the feature of selection will not change.
Further, in step 5, carry out initial data verifying, generate final rule set the following steps are included:
Step 5A: the liver diseases data set D of given class label, if ncoversFor the data amount check of covering, ncorrectFor ruleThe data number for then collecting R Accurate classification, the coverage rate of rule set R and accuracy are respectively defined as:
Wherein regular coverage rate and accuracy rate are higher, then the rule is bigger for the confidence level of auxiliary diagnosis;It will coverLid rate and the relatively high rule of accuracy rate generate final rule set.
The elastic norm Rule Extraction and feature selection approach for combination L1 and the L2 norm that the present invention and preferred embodiment proposeMake the method for the present invention not only and can choose relatively small number of feature, and generalization ability can be improved, improves nicety of grading.
Quadratic programming proposed by the present invention extracts and verifying (i.e. progress initial data verifying, generate final rule set) methodGreatly improve the confidence level of create-rule.
The Rule Extraction verifying of multiclass may be implemented in the present invention, and solving Rule Extraction in Prior efforts can only classify extractionProblem.
Training dataset is uneven, will cause many problems in pattern-recognition.For example, if data set is uneven,Classifier tends to " learn " sample of maximum ratio, and is clustered with highest precision to them.In practical applications, thisKind prejudice is unacceptable.In order to realize being uniformly distributed for sample data, the present invention utilizes a small number of oversampling technique solutions of synthesisIt has determined this problem, algorithm is that each minority class creates " synthesis " example with seldom sample.
Application advantage compared with the existing technology of the present invention in specific example is as follows:
1, since existing liver diseases data rule extraction algorithm is based primarily upon SVM or decision tree, and signature searchAlgorithm may be most important part in feature selection approach.Propose several search strategies for feature selecting: branch with aboutBeam, divide and conquer, greedy method, evolution algorithm, annealing algorithm etc..Wherein, such as greedy method search strategy, favorable selection (incrementSearch) or eliminate backward, be one of most popular technology, but their computational efficiency, robustness be easy over-fitting orPoor fitting.
The present invention uses the basic model of random forest for liver diseases data, solves SVM high-precision but without method interpretationThe disadvantage of rule, while innovative the problem of using the elastic convergence in norm for combining L1 and L2, not only can solve the above method,It can also solve the problem of L1 norm deletion rule or feature is excessive leads to over-fitting;Solve L2 norm exist rule orThe problem of feature is excessive, leads to poor fitting.
2, the rule set as caused by the result at present for liver diseases rule extraction, effectively there is no oneVerification algorithm, that is, the rule generated is exactly final rule, it is such strategy confidence level it is not good enough.
And the present invention uses rule verification algorithm, the secondary verification step as generation rule collection.It can solve two to askTopic: 1. can verify every rule in the confidence level situation of original sample when the few situation of regular number;2. when regular number comparesWhen more, it can be used as the means and the algorithm of proof rule confidence level again of simplified rules.
3, in medical data be especially liver diseases initial data there are data noise or missing the case where, data exceptionIt can be partial to the more normal part of data to the precision of model and the rule of generation.
Using the missing values processing synthesized in a small number of oversampling techniques, missing values are filled up with median first by the present invention,Guarantee that data are continuous;Secondly, keeping the consistent of the amount of inhomogeneous sample using resampling, and guaranteed using cross validationEnough training samples.
4, in the algorithm of the Rule Extraction of existing hepatic data, majority separately carries out rule using one species and mentionsIt takes, it is clear that it will lead in this way and calculate the time and increase, the problems such as practical application is not real enough.
The Random Forest model that the present invention uses first carries out classification storage to data before running to sample rules and extractingAnd reaffirm that inhomogeneous sample size is consistent, then rule is done to the sample calculating simultaneously that liver diseases classification is completed and is mentionedIt takes and feature selecting.Such processing improves whole computational efficiency.
Detailed description of the invention
The present invention is described in more detail with reference to the accompanying drawings and detailed description:
Fig. 1 is present invention method overall procedure schematic diagram;
Fig. 2 is the binary-coded schematic diagram in the embodiment of the present invention for random forest;
Fig. 3 is that rule rejects schematic diagram schematic diagram in the embodiment of the present invention;
Fig. 4 is the schematic diagram that L1 and L2 norm combines in the embodiment of the present invention;
Fig. 5 is present invention method main algorithm flow diagram.
Specific embodiment
For the feature and advantage of this patent can be clearer and more comprehensible, special embodiment below is described in detail below:
As shown in Figure 1, the embodiment of the present invention includes following steps:
Step 1: data uneven or irregular in liver diseases being pre-processed, by SMOTE, (synthesis is a small number ofOversampling technique) obtain liver diseases data set;
Step 2: binary sparse coding being carried out to liver diseases data set using Random Forest model, obtains liver diseasesRule set;
Step 3: elastic norm sparse coding Rule Extraction being carried out to liver diseases rule set, obtains coding liver diseases ruleThen collect;
Step 4: the extraction and deletion of feature are carried out to coding liver diseases rule set;
Step 5: carrying out initial data verifying, generate final rule set.
Wherein, the imbalance as existing for hepatic data collection initial data will cause many problems in pattern-recognition.ExampleSuch as, if data set is uneven, classifier tends to " learn " sample of maximum ratio, and with highest precision to they intoRow cluster.In practical applications, this prejudice is unacceptable.The present invention is by (synthesizing a small number of over-sampling skills by SMOTEArt) (mainly including data balancing processing and shortage of data processing) is handled, it can be with seldom sample for each minority classCreate " synthesis " example.
In step 2, the method that binary sparse coding is carried out to liver diseases data set using Random Forest modelThe following steps are included:
Step 2A: liver diseases data set being trained and obtains the random forest including more decision trees, in each decision tree,The path of root node to leaf node is interpreted a decision rule, then random forest is equivalent to a decision rule set;
Step 2B: each sample of liver diseases data set is corresponded to the decision from root node to only one leaf nodeTree;
Step 2C: it defines a binary feature vector and captures random forest leaf segment point structure: is corresponding for sample XiBinary vector and coding leaf node is defined as:
Xi=[X1 ..., Xq]T, wherein q is leaf node sum;
As shown in Fig. 2, illustrating the mapping relations of node in an example applied by the present invention:
Then XiSpace be the leaf segment space of points, in this space, each sample is mapped to the vertex of hypercube, oftenThe dimension of a rule space is defined as a decision rule.Therefore, which such mapping processing substantially defines for sampleA little rules are effectively which is invalid.
In step 3, the method that elastic norm sparse coding Rule Extraction is carried out to liver diseases rule set include withLower step:
Step 3A: according to the mapping result of step 2C, new training sample is constructed:
{(X1, y1), (X2, y2) ..., (Xp, yp)};
Wherein, XiIt is binary attribute vector, y ∈ { 1,2 ..., K } is related category, the formula of defining classification are as follows:
Wherein weight vectors WkWith scalar bkDefine the linear discriminant function of kth class;
Since each binary attribute indicates a decision rule.Weight W in formula (1)kMeasure the importance of rule:The size of weight shows the significance level of rule.Obviously, in classifier above, if the weight of its all class is 0Safely to remove the rule.Therefore Rule Extraction also just becomes the problem of study weight vectors.
Step 3B: elastic norm normalization study is carried out, wherein objective function is as follows:
ξi>=0, i=1 ..., p (2)
As shown in figure 4, the objective function is formed by two: first item is the elastic norm formula in conjunction with L1 and L2 norm:To control the quantity of non-zero weight and Rule Extraction, P is selection L1 L2 normProbability factor;Section 2 εkBe slack variable and;λ is regularization parameter.Because non-zero slack variable represents a mistakeThe sample of classification, Section 2 are related to experience error.As a result sparsity and experience error depends on regularization parameter, and L1Statistics and machine learning have been widely used in it with L2 norm sparse coding.L1 norm can delete unessential feature, and L2Norm can prevent overfitting data.The present invention uses the highest P value of model cross validation precision after step 3B, choosesThe P value substitutes into formula (2).
The calculation method of the importance of any sample characteristics is as follows in random forest:
Step 3C: it for each decision tree in random forest, is calculated using corresponding OOB (the outer data of bag) dataThe outer data error of its bag, is denoted as errOOB1;
Step 3D: randomly to the feature addition noise jamming of all samples of data OOB outside bag, outside the bag for calculating it againData error is denoted as errOOB2;
Step 3E: setting in random forest has Ntree decision tree, to the importance of feature is defined as:
∑(err00B1-err00B2)/Ntree (3)
Calculate the importance of all features.
Why formula (3) as the metric of the importance of individual features be because are as follows: if being added at random to some featureAfter noise, the accuracy rate outside bag is greatly lowered, then it is very big to illustrate that this feature influences the classification results of sample, alsoIt is to say that its significance level is relatively high.
In step 4, the method that described pair of coding liver diseases rule set carries out extraction and the deletion of feature includes following stepIt is rapid:
Since the feature distribution in random forest is determined by the learning process of random forest.Usually it with from previousThe feature distribution that rule extraction generates in formula is different.And important feature be done in the decision rule based on extraction it is assumed that weIt can use feature dissimilarity and go selection feature.If feature does not appear in the rule of preceding formula (2) extraction, deletedIt removes, because the classifier that it defines formula (1) does not influence.It, can be with simultaneous selection rule and feature under this thinking.
And regularization parameter λ can be selected by training set cross validation.It is reconfigured by the feature of selection randomForest can further select rule to go to obtain overall compact rule.The process of an iteration in this way, previousFeature is selected to construct new random forest in iteration, can produce new rule by this new random forest, repeatedlyIn generation, is until the feature of selection will not change.
As shown in figure 3, Fig. 3 is the final result handled in the present invention the random forest of Fig. 2.Thus have:
Step 4A: it if a certain feature does not appear in the rule of formula (2) extraction, is deleted;
Step 4B: regularization parameter λ is selected by training set cross validation, and returns to step 2A and reconfigures random forestIt is trained;
Step 4C: repeating the iterative process of step 2A to step 4B, until the feature of selection will not change.
In step 5, carry out initial data verifying, generate final rule set the following steps are included:
Step 5A: the liver diseases data set D of given class label, if ncoversFor the data amount check of covering, ncorrectFor ruleThe data number for then collecting R Accurate classification, the coverage rate of rule set R and accuracy are respectively defined as:
Wherein regular coverage rate and accuracy rate are higher, then the rule is bigger for the confidence level of auxiliary diagnosis;It will coverLid rate and the relatively high rule of accuracy rate generate final rule set.
As shown in figure 5, the pseudocode operational process to the embodiment of the present invention is as follows:
Premise calls:
1: initialization feature variable F.
2: training sample X is randomly choosed from total sample D.
Input:
1: selection feature Ff
2: selection rule Rf
Algorithm key step:
1: feature Fi, i=1;
2: ifIt executes:
3: operation Random Forest model, data set are with feature FiTraining sample X;
4: random forest generates a series of rule set Rr;
5: using rule set RrTo encode training sample X;
6: cross validation precision C is obtained with linear equation (2)iWith weight W value;
7: holding power great in a threshold value (a preset sufficiently small value), record the index of all parameters;
8:RrIndex is transmitted to Ri;
9:RiIn feature be transmitted to Fi+1, i=i+1;
10: end loop;
11: selection cross validation precision CiMaximum i is transmitted to i*;
12:It is transmitted to Ff;
13:It is transmitted to Rf;
14: returning to Ff、Rf。
This patent is not limited to above-mentioned preferred forms, anyone can obtain other each under the enlightenment of this patentThe method of the liver diseases data classification Rule Extraction based on random forest of the form of kind, and significantly irregular for presence,The other types sample data of uneven feature, can all carry out effective Rule Extraction under design scheme of the invention, it is all according toEquivalent changes and modifications within the scope of the patent application of the present invention, should all belong to the covering scope of this patent.