Movatterモバイル変換


[0]ホーム

URL:


CN109409434A - The method of liver diseases data classification Rule Extraction based on random forest - Google Patents

The method of liver diseases data classification Rule Extraction based on random forest
Download PDF

Info

Publication number
CN109409434A
CN109409434ACN201811292849.5ACN201811292849ACN109409434ACN 109409434 ACN109409434 ACN 109409434ACN 201811292849 ACN201811292849 ACN 201811292849ACN 109409434 ACN109409434 ACN 109409434A
Authority
CN
China
Prior art keywords
rule
liver disease
random forest
data
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811292849.5A
Other languages
Chinese (zh)
Other versions
CN109409434B (en
Inventor
黄立勤
陈宋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuzhou University
Original Assignee
Fuzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuzhou UniversityfiledCriticalFuzhou University
Publication of CN109409434ApublicationCriticalpatent/CN109409434A/en
Application grantedgrantedCritical
Publication of CN109409434BpublicationCriticalpatent/CN109409434B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本发明提出一种基于随机森林的肝脏疾病数据分类规则提取的方法,包括:步骤1:对肝脏疾病中不平衡或者不规则的数据进行预处理,通过SMOTE(合成少数过采样技术)获得肝脏疾病数据集;步骤2:利用随机森林模型对肝脏疾病数据集进行二进制稀疏编码,获得肝脏疾病规则集;步骤3:对肝脏疾病规则集进行弹性范数稀疏编码规则提取,获得编码肝脏疾病规则集;步骤5:进行原始数据验证,生成最终规则集。本发明提出的结合L1和L2范数的弹性范数规则提取和特征选择方法使本发明方法不仅可以选择相对较少的特征,并可以提高泛化能力,提高分类精度。本发明提出的二次规则提取与验证方法大大提升了生成规则的可信度。

The present invention proposes a method for extracting classification rules of liver disease data based on random forest, including: Step 1: Preprocess unbalanced or irregular data in liver disease, and obtain liver disease through SMOTE (Synthetic Minority Oversampling Technology). Data set; Step 2: Use the random forest model to perform binary sparse coding on the liver disease data set to obtain a liver disease rule set; Step 3: Extract the elastic norm sparse coding rule from the liver disease rule set to obtain a coded liver disease rule set; Step 5: Validate the original data and generate the final rule set. The elastic norm rule extraction and feature selection method combining L1 and L2 norm proposed by the present invention enables the method of the present invention not only to select relatively few features, but also to improve generalization ability and classification accuracy. The quadratic rule extraction and verification method proposed by the present invention greatly improves the credibility of the generated rules.

Description

The method of liver diseases data classification Rule Extraction based on random forest
Technical field
The invention belongs to the data processing fields more particularly to a kind of liver based on random forest of disease and diagnostic messageThe method that disease data classifying rules extracts.
Background technique
Liver cancer is the second largest reason of whole world cancer mortality, and primary hepatitis can develop as fibrosis, cirrhosis veryTo liver cancer.The diagnostic method of liver diseases is black-box model mostly at present, and is still concentrated in classification problem, it is difficult to diagnose liverThe accuracy and interpretation of the classifying rules of dirty disease cannot sufficiently show the information hidden in data.It is answered in practical medicalIn, although some black-box models realize very high precision, the reason of classification cannot be provided, this right and wrong for doctorIt is often important.The representation of knowledge rule extracted from data is more readily understood and understands than other expressions.Therefore, to the solution of classificationSome succinct and effective rule can be expressed by releasing.Succinct effective Rule Extraction can provide explaining in detail for bottom, cureIt learns and becomes more and more popular in environment, this does not require nothing more than high-precision, and should be readily appreciated that.Rule Extraction is always artificial intelligence fieldResearch hot topic.So-called Rule Extraction refers to that many experimental studies combine the data from multiple sources, is potentially asked with understandingTopic.Find and explain that most important information is critically important from these sources.Therefore, it is necessary to a kind of effective algorithms can be simultaneouslyDecision rule is extracted, and while retention forecasting performance, the relationship between key feature is selected to explain influence liver diseasesOn risk factor, and the relational expression for providing influence factor is supplied to diagnosis.
Currently, many diagnostic methods of the data set of hepatitis have been successfully applied different sorting algorithms: being based on attributeThe cluster of weighting;Extreme learning machine;Support vector machines;Neural network;Fuzzy rule extracting is based on support vector machines;Classification returnsTree;Supporting vector identification;Principal component analysis .Hsieh etc. proposes a kind of particle group optimizing-and is based on the fuzzy compound mind of hypermatrixThrough network, the rule trimming training generated using particle swarm optimization algorithm is without reducing (or even raising) recognition performance.Barakat, N.andA.P.Bradley etc. propose a kind of output vector using SVM model and application decision tree algorithm intoLine discipline extracts.In job family, naive Bayesian tree is used from the prediction model of SVM, TREPAN, RIPPER and CART are extractedRule.Another work is to carry out Rule Extraction from support vector model using ANFIS and DENFIS.Recently,T.MarthiPadmaja etc. proposes a kind of new hybrid algorithm, and support vector data add RIPPE to improve single class svm classifierInterpretability.Most work is concentrated mainly in SVM classifier, in order to improve the regular interpretation of generation,ShengLiu, RonakY.Patel etc. propose the model of the Rule Extraction and feature selecting based on random forest of system, numberRule Extraction is carried out according to by random forest, feature present in selection rule carries out classification into random forest in feedback and testsCard, is classified using the rule of generation, and precision can achieve the precision of initial data classification.Feature searching algorithm may be specialLevy one of most important part in selection method.Propose several search strategies for feature selecting: branch and constraint are divided and ruledMethod, greedy method, evolution algorithm, annealing algorithm etc..Wherein, greedy method search strategy, for example, favorable selection (incremental search) orIt eliminates backward, is one of most popular technology.
From the foregoing, SVM, neural network, decision tree and random forest are the basic models for studying Rule Extraction, forThe limitation and extraction of regular quantity mainly utilize L1 L2 norm regularization, sparse, the i.e. feature choosing of implementation rule and featureIt selects and interpretation.
As described above, during liver diseases practical diagnosis, an interpretable model and high estimated performance areIt is very important, while can also better understand potential problem.State-of-the-art algorithm, such as support vector machines (SVM), artificial mindThrough network and random forest (RF), the accuracy of usual prediction result is higher, but other than accuracy, is difficult to explain these mouldsThe building of type because they are all " black box submodels ", or includes many decision rules that we can not clearly explain.AnotherOn the one hand, some algorithms are based particularly on the algorithm of decision tree, it is easy to explain.However, compared with SVM, ANN or RF, predictionPerformance is usually lower.
Second, in diagnosis for liver disease, if generating excessive diagnostic rule, just without any reality for doctorMeaning, therefore, the rule extraction for being directed to basic model decision tree can then generate many rule sets, for the straight of userSee explain it is nonsensical, and L1 norm regularization although may be implemented rule and feature extraction, but itself by relevance verySmall rule can directly be set to 0, be easy to cause over-fitting;The rule of relevance very little is set to one very by L2 norm itself simultaneouslySmall numerical value be easy to cause the poor fitting of data.
Summary of the invention
Of the existing technology in order to solve the problems, such as, the present invention is equal with interpretation model suitable for classification performance using selectionThe model of weighing apparatus.Simultaneously using the implementation method of elastic norm iteration during Rule Extraction.
The present invention proposes the elastic convergence in norm algorithm using a kind of new combination L1 and L2 for liver diseases dataEffective and a small amount of rule is selected, passes through the method for a kind of mixed Rule Extraction and feature selecting, the result of Rule ExtractionFor feature selecting, the feature being selected in generation rule, is re-fed into random forest and elastic norm coding step extracts weightWant rule.By continuous iteration alternated process, until selected feature and rule will not be changed.Finally, and the most mustIt wants, for the rule of generation, to allow the validity and accuracy of doctor or users to trust rule, the present invention utilizes coveringRate and precision carry out quantization performance, reach the optimal balance of accuracy and nicety of grading.
The present invention proposes the binary-coded forest generated by random forest (RF) in hepatic data, it willSample point is mapped to by the space of entire leaf node (rule) collection definition.Then it is extracted using binary coding and elastic normThe coding method of representative rule.In the rule of selection, the feature reselected is used as the subcharacter recycled next time,It is used to the new RF of construction one and generates one group of new rule, repeat this process until meeting stop condition, i.e. feature quantity is protectedKeep steady fixed and regular quantity convergence.
Specifically use following technical scheme:
A method of the liver diseases data classification Rule Extraction based on random forest, which is characterized in that including followingStep:
Step 1: data uneven or irregular in liver diseases being pre-processed, by SMOTE, (synthesis is a small number ofOversampling technique) obtain liver diseases data set;
Step 2: binary sparse coding being carried out to liver diseases data set using Random Forest model, obtains liver diseasesRule set;
Step 3: elastic norm sparse coding Rule Extraction being carried out to liver diseases rule set, obtains coding liver diseases ruleThen collect;
Step 4: the extraction and deletion of feature are carried out to coding liver diseases rule set;
Step 5: carrying out initial data verifying, generate final rule set.
Wherein, the imbalance as existing for hepatic data collection initial data will cause many problems in pattern-recognition.ExampleSuch as, if data set is uneven, classifier tends to " learn " sample of maximum ratio, and with highest precision to they intoRow cluster.In practical applications, this prejudice is unacceptable.The present invention is by (synthesizing a small number of over-sampling skills by SMOTEArt) it is handled, it can be that each minority class creates " synthesis " example with seldom sample.
Further, described that binary sparse volume is carried out to liver diseases data set using Random Forest model in step 2Code method the following steps are included:
Step 2A: liver diseases data set being trained and obtains the random forest including more decision trees, in each decision tree,The path of root node to leaf node is interpreted a decision rule, then random forest is equivalent to a decision rule set;
Step 2B: each sample of liver diseases data set is corresponded to the decision from root node to only one leaf nodeTree;
Step 2C: it defines a binary feature vector and captures random forest leaf segment point structure: for sample Xi, correspondingBinary vector and coding leaf node is defined as:
Xi=[X1 ..., Xq]T, wherein q is leaf node sum;
Then XiSpace be the leaf segment space of points, in this space, each sample is mapped to the vertex of hypercube, oftenThe dimension of a rule space is defined as a decision rule.Therefore, which such mapping processing substantially defines for sampleA little rules are effectively which is invalid.
Further, in step 3, the side that elastic norm sparse coding Rule Extraction is carried out to liver diseases rule setMethod the following steps are included:
Step 3A: according to the mapping result of step 2C, new training sample is constructed:
{(X1, y1), (X2, X2) ..., (Xp, yp)};
Wherein, XiIt is binary attribute vector, y ∈ { 1,2 ..., K } is related category, the formula of defining classification are as follows:
Wherein weight vectors WkWith scalar bkDefine the linear discriminant function of kth class;
Since each binary attribute indicates a decision rule.Weight W in formula (1)kMeasure the importance of rule:The size of weight shows the significance level of rule.Obviously, in classifier above, if the weight of its all class is 0Safely to remove the rule.Therefore Rule Extraction also just becomes the problem of study weight vectors.
Step 3B: elastic norm normalization study is carried out, wherein objective function is as follows:
ξi>=0, i=1 ..., p (2)
The objective function is formed by two: first item is the elastic norm formula in conjunction with L1 and L2 norm:
To control the quantity of non-zero weight and Rule Extraction, P is selection L1 or L2The probability factor of norm;Section 2 εkBe slack variable and;λ is regularization parameter.Because non-zero slack variable represents oneThe sample of mistake classification, Section 2 is related to experience error.As a result sparsity and experience error depends on regularization parameter,And L1 and L2 norm sparse coding has been widely used in statistics and machine learning.L1 norm can delete unessential feature,And L2 norm can prevent overfitting data.The present invention uses the highest P value of model cross validation precision after step 3B,The P value is chosen to substitute into formula (2).
The calculation method of the importance of any sample characteristics is as follows in random forest:
Step 3C: it for each decision tree in random forest, is calculated using corresponding OOB (the outer data of bag) dataThe outer data error of its bag, is denoted as errOOB1;
Step 3D: randomly to the feature addition noise jamming of all samples of data OOB outside bag, outside the bag for calculating it againData error is denoted as errOOB2;
Step 3E: setting in random forest has Ntree decision tree, to the importance of feature is defined as:
∑(err00B1-err00B2)/Ntree (3)
Calculate the importance of all features.
Why formula (3) as the metric of the importance of individual features be because are as follows: if being added at random to some featureAfter noise, the accuracy rate outside bag is greatly lowered, then it is very big to illustrate that this feature influences the classification results of sample, alsoIt is to say that its significance level is relatively high.
Further, in step 4, described pair of coding liver diseases rule set carries out the method packet of extraction and the deletion of featureInclude following steps:
Since the feature distribution in random forest is determined by the learning process of random forest.Usually it with from previousThe feature distribution that rule extraction generates in formula is different.And important feature be done in the decision rule based on extraction it is assumed that weIt can use feature dissimilarity and go selection feature.If feature does not appear in the rule of preceding formula (2) extraction, deletedIt removes, because the classifier that it defines formula (1) does not influence.It, can be with simultaneous selection rule and feature under this thinking.
And regularization parameter λ can be selected by training set cross validation.It is reconfigured by the feature of selection randomForest can further select rule to go to obtain overall compact rule.The process of an iteration in this way, previousFeature is selected to construct new random forest in iteration, can produce new rule by this new random forest, repeatedlyIn generation, is until the feature of selection will not change.
Thus have:
Step 4A: it if a certain feature does not appear in the rule of formula (2) extraction, is deleted;
Step 4B: regularization parameter λ is selected by training set cross validation, and returns to step 2A and reconfigures random forestIt is trained;
Step 4C: repeating the iterative process of step 2A to step 4B, until the feature of selection will not change.
Further, in step 5, carry out initial data verifying, generate final rule set the following steps are included:
Step 5A: the liver diseases data set D of given class label, if ncoversFor the data amount check of covering, ncorrectFor ruleThe data number for then collecting R Accurate classification, the coverage rate of rule set R and accuracy are respectively defined as:
Wherein regular coverage rate and accuracy rate are higher, then the rule is bigger for the confidence level of auxiliary diagnosis;It will coverLid rate and the relatively high rule of accuracy rate generate final rule set.
The elastic norm Rule Extraction and feature selection approach for combination L1 and the L2 norm that the present invention and preferred embodiment proposeMake the method for the present invention not only and can choose relatively small number of feature, and generalization ability can be improved, improves nicety of grading.
Quadratic programming proposed by the present invention extracts and verifying (i.e. progress initial data verifying, generate final rule set) methodGreatly improve the confidence level of create-rule.
The Rule Extraction verifying of multiclass may be implemented in the present invention, and solving Rule Extraction in Prior efforts can only classify extractionProblem.
Training dataset is uneven, will cause many problems in pattern-recognition.For example, if data set is uneven,Classifier tends to " learn " sample of maximum ratio, and is clustered with highest precision to them.In practical applications, thisKind prejudice is unacceptable.In order to realize being uniformly distributed for sample data, the present invention utilizes a small number of oversampling technique solutions of synthesisIt has determined this problem, algorithm is that each minority class creates " synthesis " example with seldom sample.
Application advantage compared with the existing technology of the present invention in specific example is as follows:
1, since existing liver diseases data rule extraction algorithm is based primarily upon SVM or decision tree, and signature searchAlgorithm may be most important part in feature selection approach.Propose several search strategies for feature selecting: branch with aboutBeam, divide and conquer, greedy method, evolution algorithm, annealing algorithm etc..Wherein, such as greedy method search strategy, favorable selection (incrementSearch) or eliminate backward, be one of most popular technology, but their computational efficiency, robustness be easy over-fitting orPoor fitting.
The present invention uses the basic model of random forest for liver diseases data, solves SVM high-precision but without method interpretationThe disadvantage of rule, while innovative the problem of using the elastic convergence in norm for combining L1 and L2, not only can solve the above method,It can also solve the problem of L1 norm deletion rule or feature is excessive leads to over-fitting;Solve L2 norm exist rule orThe problem of feature is excessive, leads to poor fitting.
2, the rule set as caused by the result at present for liver diseases rule extraction, effectively there is no oneVerification algorithm, that is, the rule generated is exactly final rule, it is such strategy confidence level it is not good enough.
And the present invention uses rule verification algorithm, the secondary verification step as generation rule collection.It can solve two to askTopic: 1. can verify every rule in the confidence level situation of original sample when the few situation of regular number;2. when regular number comparesWhen more, it can be used as the means and the algorithm of proof rule confidence level again of simplified rules.
3, in medical data be especially liver diseases initial data there are data noise or missing the case where, data exceptionIt can be partial to the more normal part of data to the precision of model and the rule of generation.
Using the missing values processing synthesized in a small number of oversampling techniques, missing values are filled up with median first by the present invention,Guarantee that data are continuous;Secondly, keeping the consistent of the amount of inhomogeneous sample using resampling, and guaranteed using cross validationEnough training samples.
4, in the algorithm of the Rule Extraction of existing hepatic data, majority separately carries out rule using one species and mentionsIt takes, it is clear that it will lead in this way and calculate the time and increase, the problems such as practical application is not real enough.
The Random Forest model that the present invention uses first carries out classification storage to data before running to sample rules and extractingAnd reaffirm that inhomogeneous sample size is consistent, then rule is done to the sample calculating simultaneously that liver diseases classification is completed and is mentionedIt takes and feature selecting.Such processing improves whole computational efficiency.
Detailed description of the invention
The present invention is described in more detail with reference to the accompanying drawings and detailed description:
Fig. 1 is present invention method overall procedure schematic diagram;
Fig. 2 is the binary-coded schematic diagram in the embodiment of the present invention for random forest;
Fig. 3 is that rule rejects schematic diagram schematic diagram in the embodiment of the present invention;
Fig. 4 is the schematic diagram that L1 and L2 norm combines in the embodiment of the present invention;
Fig. 5 is present invention method main algorithm flow diagram.
Specific embodiment
For the feature and advantage of this patent can be clearer and more comprehensible, special embodiment below is described in detail below:
As shown in Figure 1, the embodiment of the present invention includes following steps:
Step 1: data uneven or irregular in liver diseases being pre-processed, by SMOTE, (synthesis is a small number ofOversampling technique) obtain liver diseases data set;
Step 2: binary sparse coding being carried out to liver diseases data set using Random Forest model, obtains liver diseasesRule set;
Step 3: elastic norm sparse coding Rule Extraction being carried out to liver diseases rule set, obtains coding liver diseases ruleThen collect;
Step 4: the extraction and deletion of feature are carried out to coding liver diseases rule set;
Step 5: carrying out initial data verifying, generate final rule set.
Wherein, the imbalance as existing for hepatic data collection initial data will cause many problems in pattern-recognition.ExampleSuch as, if data set is uneven, classifier tends to " learn " sample of maximum ratio, and with highest precision to they intoRow cluster.In practical applications, this prejudice is unacceptable.The present invention is by (synthesizing a small number of over-sampling skills by SMOTEArt) (mainly including data balancing processing and shortage of data processing) is handled, it can be with seldom sample for each minority classCreate " synthesis " example.
In step 2, the method that binary sparse coding is carried out to liver diseases data set using Random Forest modelThe following steps are included:
Step 2A: liver diseases data set being trained and obtains the random forest including more decision trees, in each decision tree,The path of root node to leaf node is interpreted a decision rule, then random forest is equivalent to a decision rule set;
Step 2B: each sample of liver diseases data set is corresponded to the decision from root node to only one leaf nodeTree;
Step 2C: it defines a binary feature vector and captures random forest leaf segment point structure: is corresponding for sample XiBinary vector and coding leaf node is defined as:
Xi=[X1 ..., Xq]T, wherein q is leaf node sum;
As shown in Fig. 2, illustrating the mapping relations of node in an example applied by the present invention:
Then XiSpace be the leaf segment space of points, in this space, each sample is mapped to the vertex of hypercube, oftenThe dimension of a rule space is defined as a decision rule.Therefore, which such mapping processing substantially defines for sampleA little rules are effectively which is invalid.
In step 3, the method that elastic norm sparse coding Rule Extraction is carried out to liver diseases rule set include withLower step:
Step 3A: according to the mapping result of step 2C, new training sample is constructed:
{(X1, y1), (X2, y2) ..., (Xp, yp)};
Wherein, XiIt is binary attribute vector, y ∈ { 1,2 ..., K } is related category, the formula of defining classification are as follows:
Wherein weight vectors WkWith scalar bkDefine the linear discriminant function of kth class;
Since each binary attribute indicates a decision rule.Weight W in formula (1)kMeasure the importance of rule:The size of weight shows the significance level of rule.Obviously, in classifier above, if the weight of its all class is 0Safely to remove the rule.Therefore Rule Extraction also just becomes the problem of study weight vectors.
Step 3B: elastic norm normalization study is carried out, wherein objective function is as follows:
ξi>=0, i=1 ..., p (2)
As shown in figure 4, the objective function is formed by two: first item is the elastic norm formula in conjunction with L1 and L2 norm:To control the quantity of non-zero weight and Rule Extraction, P is selection L1 L2 normProbability factor;Section 2 εkBe slack variable and;λ is regularization parameter.Because non-zero slack variable represents a mistakeThe sample of classification, Section 2 are related to experience error.As a result sparsity and experience error depends on regularization parameter, and L1Statistics and machine learning have been widely used in it with L2 norm sparse coding.L1 norm can delete unessential feature, and L2Norm can prevent overfitting data.The present invention uses the highest P value of model cross validation precision after step 3B, choosesThe P value substitutes into formula (2).
The calculation method of the importance of any sample characteristics is as follows in random forest:
Step 3C: it for each decision tree in random forest, is calculated using corresponding OOB (the outer data of bag) dataThe outer data error of its bag, is denoted as errOOB1;
Step 3D: randomly to the feature addition noise jamming of all samples of data OOB outside bag, outside the bag for calculating it againData error is denoted as errOOB2;
Step 3E: setting in random forest has Ntree decision tree, to the importance of feature is defined as:
∑(err00B1-err00B2)/Ntree (3)
Calculate the importance of all features.
Why formula (3) as the metric of the importance of individual features be because are as follows: if being added at random to some featureAfter noise, the accuracy rate outside bag is greatly lowered, then it is very big to illustrate that this feature influences the classification results of sample, alsoIt is to say that its significance level is relatively high.
In step 4, the method that described pair of coding liver diseases rule set carries out extraction and the deletion of feature includes following stepIt is rapid:
Since the feature distribution in random forest is determined by the learning process of random forest.Usually it with from previousThe feature distribution that rule extraction generates in formula is different.And important feature be done in the decision rule based on extraction it is assumed that weIt can use feature dissimilarity and go selection feature.If feature does not appear in the rule of preceding formula (2) extraction, deletedIt removes, because the classifier that it defines formula (1) does not influence.It, can be with simultaneous selection rule and feature under this thinking.
And regularization parameter λ can be selected by training set cross validation.It is reconfigured by the feature of selection randomForest can further select rule to go to obtain overall compact rule.The process of an iteration in this way, previousFeature is selected to construct new random forest in iteration, can produce new rule by this new random forest, repeatedlyIn generation, is until the feature of selection will not change.
As shown in figure 3, Fig. 3 is the final result handled in the present invention the random forest of Fig. 2.Thus have:
Step 4A: it if a certain feature does not appear in the rule of formula (2) extraction, is deleted;
Step 4B: regularization parameter λ is selected by training set cross validation, and returns to step 2A and reconfigures random forestIt is trained;
Step 4C: repeating the iterative process of step 2A to step 4B, until the feature of selection will not change.
In step 5, carry out initial data verifying, generate final rule set the following steps are included:
Step 5A: the liver diseases data set D of given class label, if ncoversFor the data amount check of covering, ncorrectFor ruleThe data number for then collecting R Accurate classification, the coverage rate of rule set R and accuracy are respectively defined as:
Wherein regular coverage rate and accuracy rate are higher, then the rule is bigger for the confidence level of auxiliary diagnosis;It will coverLid rate and the relatively high rule of accuracy rate generate final rule set.
As shown in figure 5, the pseudocode operational process to the embodiment of the present invention is as follows:
Premise calls:
1: initialization feature variable F.
2: training sample X is randomly choosed from total sample D.
Input:
1: selection feature Ff
2: selection rule Rf
Algorithm key step:
1: feature Fi, i=1;
2: ifIt executes:
3: operation Random Forest model, data set are with feature FiTraining sample X;
4: random forest generates a series of rule set Rr
5: using rule set RrTo encode training sample X;
6: cross validation precision C is obtained with linear equation (2)iWith weight W value;
7: holding power great in a threshold value (a preset sufficiently small value), record the index of all parameters;
8:RrIndex is transmitted to Ri
9:RiIn feature be transmitted to Fi+1, i=i+1;
10: end loop;
11: selection cross validation precision CiMaximum i is transmitted to i*
12:It is transmitted to Ff
13:It is transmitted to Rf
14: returning to Ff、Rf
This patent is not limited to above-mentioned preferred forms, anyone can obtain other each under the enlightenment of this patentThe method of the liver diseases data classification Rule Extraction based on random forest of the form of kind, and significantly irregular for presence,The other types sample data of uneven feature, can all carry out effective Rule Extraction under design scheme of the invention, it is all according toEquivalent changes and modifications within the scope of the patent application of the present invention, should all belong to the covering scope of this patent.

Claims (7)

Translated fromChinese
1.一种基于随机森林的肝脏疾病数据分类规则提取的方法,其特征在于,包括以下步骤:1. a method for extracting liver disease data classification rules based on random forest, is characterized in that, comprises the following steps:步骤1:对肝脏疾病中不平衡或者不规则的数据进行预处理,通过SMOTE(合成少数过采样技术)获得肝脏疾病数据集;Step 1: Preprocess unbalanced or irregular data in liver disease, and obtain liver disease data set through SMOTE (Synthetic Minority Oversampling Technology);步骤2:利用随机森林模型对肝脏疾病数据集进行二进制稀疏编码,获得肝脏疾病规则集;Step 2: Use the random forest model to perform binary sparse coding on the liver disease data set to obtain the liver disease rule set;步骤3:对肝脏疾病规则集进行弹性范数稀疏编码规则提取,获得编码肝脏疾病规则集;Step 3: Extract the elastic norm sparse coding rules for the liver disease rule set to obtain the coding liver disease rule set;步骤4:对编码肝脏疾病规则集进行特征的提取和删除;Step 4: Extract and delete features of the encoded liver disease rule set;步骤5:进行原始数据验证,生成最终规则集。Step 5: Validate the original data and generate the final rule set.2.根据权利要求1所述的基于随机森林的肝脏疾病数据分类规则提取的方法,其特征在于:步骤2中,所述利用随机森林模型对肝脏疾病数据集进行二进制稀疏编码的方法包括以下步骤:2. The method for extracting classification rules of liver disease data based on random forest according to claim 1, wherein in step 2, the method for performing binary sparse coding on liver disease data set by using random forest model comprises the following steps :步骤2A:将肝脏疾病数据集训练获得包括多颗决策树的随机森林,每个决策树中,根节点到叶节点的路径被解释为一个决策规则,则随机森林等价为一个决策规则集合;Step 2A: The liver disease data set is trained to obtain a random forest including multiple decision trees. In each decision tree, the path from the root node to the leaf node is interpreted as a decision rule, and the random forest is equivalent to a decision rule set;步骤2B:将肝脏疾病数据集每个样本对应为从根节点到只有一个叶子节点的决定树;Step 2B: Corresponding each sample of the liver disease dataset as a decision tree from the root node to only one leaf node;步骤2C:定义一个二元特征向量捕捉随机森林叶节点结构:对于样本Xi,其对应的二进制向量和编码叶子节点定义为:Step 2C: Define a binary feature vector to capture the random forest leaf node structure: For sample Xi , its corresponding binary vector and encoded leaf node are defined as:Xi=[X1,...,Xq]T,其中q是叶子节点总数;Xi = [X1 , . . . , Xq ]T , where q is the total number of leaf nodes;则Xi的空间为叶节点空间,在这个空间中,每个样本被映射到超立方体的顶点,每个规则空间的维度定义为一个决策规则。Then the space of Xi is the leaf node space, in this space, each sample is mapped to the vertex of the hypercube, and the dimension of each rule space is defined as a decision rule.3.根据权利要求2所述的基于随机森林的肝脏疾病数据分类规则提取的方法,其特征在于:步骤3中,所述对肝脏疾病规则集进行弹性范数稀疏编码规则提取的方法包括以下步骤:3. The method for extracting classification rules of liver disease data based on random forest according to claim 2, wherein in step 3, the method for extracting elastic norm sparse coding rules for liver disease rule set comprises the following steps :步骤3A:根据步骤2C的映射结果,构建新的训练样本:Step 3A: According to the mapping result of Step 2C, construct a new training sample:{(X1,y1),(X2,y2),...,(Xp,yp)};{(X1 , y1 ), (X2 , y2 ), ..., (Xp , yp )};其中,Xi是二进制属性向量,y∈{1,2,...,K}是相关类标,定义分类的公式为:Among them, Xi is the binary attribute vector,y∈ {1, 2,...,K} is the relevant class label, and the formula for defining the classification is:其中权重向量Wk和标量bk定义了第k类的线性判别函数;where the weight vector Wk and the scalar bk define the linear discriminant function of the kth class;步骤3B:进行弹性范数正规化学习,其中目标函数如下:Step 3B: Perform elastic norm regularization learning, where the objective function is as follows:ξi≥0,i=1,...,p (2)ξi ≥0,i=1,...,p (2)该目标函数由两项组成:第一项为结合L1和L2范数的弹性范数公式:The objective function consists of two terms: the first term is the elastic norm formula combining the L1 and L2 norms:用以控制非零权重和规则提取的数量,P为选择L1或者L2范数的概率因子,取值为:0≤P≤1;第二项εk是松弛变量的和;λ为正则化参数。 Used to control the number of non-zero weights and rule extraction, P is the probability factor for selecting L1 or L2 norm, the value is: 0≤P≤1; the second term εk is the sum of slack variables; λ is the regularization parameter .4.根据权利要求3所述的基于随机森林的肝脏疾病数据分类规则提取的方法,其特征在于:步骤4中,所述对编码肝脏疾病规则集进行特征的提取和删除的方法包括以下步骤:4. the method for extracting classification rules of liver disease data based on random forest according to claim 3, is characterized in that: in step 4, the described method for feature extraction and deletion to coding liver disease rule set comprises the following steps:步骤4A:如果某一特征没有出现在公式(2)抽取的规则中,则被删除;Step 4A: if a certain feature does not appear in the rules extracted by formula (2), it is deleted;步骤4B:通过训练集交叉验证选择正则化参数λ,并回到步骤2A重新构造随机森林进行训练;Step 4B: Select the regularization parameter λ through the training set cross-validation, and return to Step 2A to reconstruct the random forest for training;步骤4C:重复步骤2A至步骤4B的迭代过程,直至选择的特征不会变化为止。Step 4C: Repeat the iterative process from Step 2A to Step 4B until the selected feature does not change.5.根据权利要求4所述的基于随机森林的肝脏疾病数据分类规则提取的方法,其特征在于:步骤5中,进行原始数据验证,生成最终规则集包括以下步骤:5. The method for extracting liver disease data classification rules based on random forest according to claim 4, wherein: in step 5, performing original data verification, and generating a final rule set comprises the following steps:步骤5A:给定类标记的肝脏疾病数据集D,设ncovers为覆盖的数据个数,ncorrect为规则集R准确分类的数据数,将规则集R的覆盖率和准确度分别定义为:Step 5A: Given a class labeled liver disease dataset D, let ncovers be the number of covered data, ncorrect be the number of data accurately classified by the rule set R, and define the coverage and accuracy of the rule set R as:其中规则的覆盖率和准确率越高,则该规则对于辅助诊断的可信度就越大;将覆盖率和准确率相对较高的规则生成最终规则集。The higher the coverage and accuracy of the rule, the greater the credibility of the rule for auxiliary diagnosis; the rules with relatively high coverage and accuracy are generated into the final rule set.6.根据权利要求3所述的基于随机森林的肝脏疾病数据分类规则提取的方法,其特征在于:在步骤3B之后采用模型交叉验证精度最高的P值,选取该P值代入公式(2)中。6. The method for extracting classification rules of liver disease data based on random forest according to claim 3, characterized in that: after step 3B, the P value with the highest model cross-validation accuracy is adopted, and the P value is selected and substituted into formula (2) .7.根据权利要求3所述的基于随机森林的肝脏疾病数据分类规则提取的方法,其特征在于:7. the method that the liver disease data classification rule extraction based on random forest according to claim 3 is characterized in that:在随机森林中任一样本特征的重要性的计算方法如下:The calculation method of the importance of any sample feature in random forest is as follows:步骤3C:对于随机森林中的每一颗决策树,使用相应的OOB(袋外数据)数据来计算它的袋外数据误差,记为errOOB1;Step 3C: For each decision tree in the random forest, use the corresponding OOB (out-of-bag data) data to calculate its out-of-bag data error, denoted as errOOB1;步骤3D:随机地对袋外数据OOB所有样本的特征加入噪声干扰,再次计算它的袋外数据误差,记为errOOB2;Step 3D: Randomly add noise interference to the features of all samples of out-of-bag data OOB, and calculate its out-of-bag data error again, denoted as errOOB2;步骤3E:设随机森林中有Ntree棵决策树,对特征的重要性定义为:Step 3E: Suppose there are Ntree decision trees in the random forest, and the importance of the feature is defined as:∑(err00B1-err00B2)/Ntree (3)∑(err00B1-err00B2)/Ntree (3)计算所有特征的重要性。Calculate the importance of all features.
CN201811292849.5A2018-02-052018-11-01 Method for extracting classification rules of liver disease data based on random forestActiveCN109409434B (en)

Applications Claiming Priority (2)

Application NumberPriority DateFiling DateTitle
CN201810115962X2018-02-05
CN2018101159622018-02-05

Publications (2)

Publication NumberPublication Date
CN109409434Atrue CN109409434A (en)2019-03-01
CN109409434B CN109409434B (en)2021-05-18

Family

ID=65470743

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201811292849.5AActiveCN109409434B (en)2018-02-052018-11-01 Method for extracting classification rules of liver disease data based on random forest

Country Status (1)

CountryLink
CN (1)CN109409434B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN110046654A (en)*2019-03-252019-07-23东软集团股份有限公司A kind of method, apparatus and relevant device of identification classification influence factor
CN110327074A (en)*2019-08-022019-10-15无锡海斯凯尔医学技术有限公司Liver evaluation method, device, equipment and computer readable storage medium
CN113628697A (en)*2021-07-282021-11-09上海基绪康生物科技有限公司Random forest model training method for classification unbalance data optimization
CN114566284A (en)*2021-06-072022-05-31医渡云(北京)技术有限公司Disease prognosis risk prediction model training method and device and electronic equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN104156537A (en)*2014-08-192014-11-19中山大学Cellular automaton urban growth simulating method based on random forest
WO2015066564A1 (en)*2013-10-312015-05-07Cancer Prevention And Cure, Ltd.Methods of identification and diagnosis of lung diseases using classification systems and kits thereof
CN104882144A (en)*2015-05-062015-09-02福州大学Animal voice identification method based on double sound spectrogram characteristics
CN104915679A (en)*2015-05-262015-09-16浪潮电子信息产业股份有限公司Large-scale high-dimensional data classification method based on random forest weighted distance
CN105844300A (en)*2016-03-242016-08-10河南师范大学Optimized classification method and optimized classification device based on random forest algorithm
CN106778836A (en)*2016-11-292017-05-31天津大学A kind of random forest proposed algorithm based on constraints
CN108647829A (en)*2018-05-162018-10-12河海大学A kind of Hydropower Stations combined dispatching Rules extraction method based on random forest

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
WO2015066564A1 (en)*2013-10-312015-05-07Cancer Prevention And Cure, Ltd.Methods of identification and diagnosis of lung diseases using classification systems and kits thereof
CN104156537A (en)*2014-08-192014-11-19中山大学Cellular automaton urban growth simulating method based on random forest
CN104882144A (en)*2015-05-062015-09-02福州大学Animal voice identification method based on double sound spectrogram characteristics
CN104915679A (en)*2015-05-262015-09-16浪潮电子信息产业股份有限公司Large-scale high-dimensional data classification method based on random forest weighted distance
CN105844300A (en)*2016-03-242016-08-10河南师范大学Optimized classification method and optimized classification device based on random forest algorithm
CN106778836A (en)*2016-11-292017-05-31天津大学A kind of random forest proposed algorithm based on constraints
CN108647829A (en)*2018-05-162018-10-12河海大学A kind of Hydropower Stations combined dispatching Rules extraction method based on random forest

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
HUI ZOU ET AL: "Regularization and variable selection via the elastic net", 《JOURNAL OF THE ROYAL STATISTICAL SOCIETY》*
SHENG LIU ET AL: "Combined Rule Extraction and Feature Elimination in Supervised Classification", 《IEEE TRANSACTIONS ON NANOBIOSCIENCE》*
冯盼峰等: "基于随机森林算法的两阶段变量选择研究", 《系统科学与数学》*
王健等: "《面向样本不平衡的故障特征提取方法》", 29 February 2016*
裴平等: "《美国次贷风险引发的国际金融危机研究》", 31 July 2016*

Cited By (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN110046654A (en)*2019-03-252019-07-23东软集团股份有限公司A kind of method, apparatus and relevant device of identification classification influence factor
CN110327074A (en)*2019-08-022019-10-15无锡海斯凯尔医学技术有限公司Liver evaluation method, device, equipment and computer readable storage medium
CN114566284A (en)*2021-06-072022-05-31医渡云(北京)技术有限公司Disease prognosis risk prediction model training method and device and electronic equipment
CN113628697A (en)*2021-07-282021-11-09上海基绪康生物科技有限公司Random forest model training method for classification unbalance data optimization

Also Published As

Publication numberPublication date
CN109409434B (en)2021-05-18

Similar Documents

PublicationPublication DateTitle
US10713597B2 (en)Systems and methods for preparing data for use by machine learning algorithms
Pelikan et al.Estimation of distribution algorithms
Mikenina et al.Improved feature selection and classification by the 2-additive fuzzy measure
CN109409434A (en)The method of liver diseases data classification Rule Extraction based on random forest
CN113688869A (en)Photovoltaic data missing reconstruction method based on generation countermeasure network
CN113095501A (en)Deep reinforcement learning-based unbalanced classification decision tree generation method
Rabuñal et al.A new approach to the extraction of ANN rules and to their generalization capacity through GP
CN119150158B (en) A method for constructing user portraits for O2O platforms based on deep learning
FarooqGenetic algorithm technique in hybrid intelligent systems for pattern recognition
Cho et al.Genetic evolution processing of data structures for image classification
CN112541530B (en)Data preprocessing method and device for clustering model
Dhebar et al.Evaluating nonlinear decision trees for binary classification tasks with other existing methods
Tarle et al.Improved artificial neural network for dimension reduction in medical data classification
CN115795382A (en)Method and system for quickly classifying high-dimensional and class unbalanced time sequence data
WaniSAFARI: a structured approach for automatic rule
Bengoetxea et al.Solving graph matching with EDAs using a permutation-based representation
Bej et al.Combining uniform manifold approximation with localized affine shadowsampling improves classification of imbalanced datasets
Ahmed et al.A Comprehensive Review on Machine Learning Models for Medical Data Classification
CN118534330B (en) A method for predicting remaining life of fuel cells based on robust model
CN120087187B (en)Product collaborative design management method and system based on data driving
BrahmanaImproving Feature Learning Capability and Interpretability of Unsupervised Neural Networks
Huang et al.Pattern trees
Do et al.Swift Hydra: Self-Reinforcing Generative Framework for Anomaly Detection with Multiple Mamba Models
CN119993275A (en) A fungal phenotypic barcode screening method and system based on machine learning
Starkey et al.Class-Level Feature Selection Method Using Feature Weighted Growing Self-Organising Maps

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp