Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Random forest

From Wikipedia, the free encyclopedia
(Redirected fromRandom forests)
This article is about the machine learning technique. For other kinds of random tree, seeRandom tree.
Tree-based ensemble machine learning method
Part of a series on
Machine learning
anddata mining

Random forests orrandom decision forests is anensemble learning method forclassification,regression and other tasks that works by creating a multitude ofdecision trees during training. For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the output is the average of the predictions of the trees.[1][2] Random forests correct for decision trees' habit ofoverfitting to theirtraining set.[3]: 587–588 

The first algorithm for random decision forests was created in 1995 byTin Kam Ho[1] using therandom subspace method,[2] which, in Ho's formulation, is a way to implement the "stochastic discrimination" approach to classification proposed by Eugene Kleinberg.[4][5][6]

An extension of the algorithm was developed byLeo Breiman[7] andAdele Cutler,[8] who registered[9] "Random Forests" as atrademark in 2006 (as of 2019[update], owned byMinitab, Inc.).[10] The extension combines Breiman's "bagging" idea and random selection of features, introduced first by Ho[1] and later independently by Amit andGeman[11] in order to construct a collection of decision trees with controlled variance.

History

[edit]

The general method of random decision forests was first proposed by Salzberg and Heath in 1993,[12] with a method that used a randomized decision tree algorithm to create multiple trees and then combine them using majority voting. This idea was developed further by Ho in 1995.[1] Ho established that forests of trees splitting with oblique hyperplanes can gain accuracy as they grow without suffering from overtraining, as long as the forests are randomly restricted to be sensitive to only selectedfeature dimensions. A subsequent work along the same lines[2] concluded that other splitting methods behave similarly, as long as they are randomly forced to be insensitive to some feature dimensions. This observation that a more complex classifier (a larger forest) gets more accurate nearly monotonically is in sharp contrast to the common belief that the complexity of a classifier can only grow to a certain level of accuracy before being hurt by overfitting. The explanation of the forest method's resistance to overtraining can be found in Kleinberg's theory of stochastic discrimination.[4][5][6]

The early development of Breiman's notion of random forests was influenced by the work of Amit and Geman[11] who introduced the idea of searching over a random subset of the available decisions when splitting a node, in the context of growing a singletree. The idea of random subspace selection from Ho[2] was also influential in the design of random forests. This method grows a forest of trees, and introduces variation among the trees by projecting the training data into a randomly chosensubspace before fitting each tree or each node. Finally, the idea of randomized node optimization, where the decision at each node is selected by a randomized procedure, rather than a deterministic optimization was first introduced byThomas G. Dietterich.[13]

The proper introduction of random forests was made in a paper byLeo Breiman,[7]that has become one of the world's most cited papers.[14]This paper describes a method of building a forest of uncorrelated trees using aCART like procedure, combined with randomized node optimization andbagging. In addition, this paper combines several ingredients, some previously known and some novel, which form the basis of the modern practice of random forests, in particular:

  1. Usingout-of-bag error as an estimate of thegeneralization error.
  2. Measuring variable importance through permutation.

The report also offers the first theoretical result for random forests in the form of a bound on thegeneralization error which depends on the strength of the trees in the forest and theircorrelation.

Algorithm

[edit]

Preliminaries: decision tree learning

[edit]
Main article:Decision tree learning

Decision trees are a popular method for various machine learning tasks. Tree learning is almost "an off-the-shelf procedure for data mining", sayHastieet al., "because it is invariant under scaling and various other transformations of feature values, is robust to inclusion of irrelevant features, and produces inspectable models. However, they are seldom accurate".[3]: 352 

In particular, trees that are grown very deep tend to learn highly irregular patterns: theyoverfit their training sets, i.e. havelow bias, but very high variance. Random forests are a way of averaging multiple deep decision trees, trained on different parts of the same training set, with the goal of reducing the variance.[3]: 587–588  This comes at the expense of a small increase in the bias and some loss of interpretability, but generally greatly boosts the performance in the final model.

Bagging

[edit]
Main article:Bootstrap aggregating
Illustration of training a Random Forest model. The training dataset (in this case, of 250 rows and 100 columns) is randomly sampled with replacementn times. Then, a decision tree is trained on each sample. Finally, for prediction, the results of alln trees are aggregated to produce a final decision.

The training algorithm for random forests applies the general technique ofbootstrap aggregating, or bagging, to tree learners. Given a training setX =x1, ...,xn with responsesY =y1, ...,yn, bagging repeatedly (B times) selects arandom sample with replacement of the training set and fits trees to these samples:

Forb = 1, ...,B:
  1. Sample, with replacement,n training examples fromX,Y; call theseXb,Yb.
  2. Train a classification or regression treefb onXb,Yb.

After training, predictions for unseen samplesx' can be made by averaging the predictions from all the individual regression trees onx':

f^=1Bb=1Bfb(x){\displaystyle {\hat {f}}={\frac {1}{B}}\sum _{b=1}^{B}f_{b}(x')}

or by taking the plurality vote in the case of classification trees.

This bootstrapping procedure leads to better model performance because it decreases thevariance of the model, without increasing the bias. This means that while the predictions of a single tree are highly sensitive to noise in its training set, the average of many trees is not, as long as the trees are not correlated. Simply training many trees on a single training set would give strongly correlated trees (or even the same tree many times, if the training algorithm is deterministic); bootstrap sampling is a way of de-correlating the trees by showing them different training sets.

Additionally, an estimate of the uncertainty of the prediction can be made as the standard deviation of the predictions from all the individual regression trees onx′:σ=b=1B(fb(x)f^)2B1.{\displaystyle \sigma ={\sqrt {\frac {\sum _{b=1}^{B}(f_{b}(x')-{\hat {f}})^{2}}{B-1}}}.}

The numberB of samples (equivalently, of trees) is a free parameter. Typically, a few hundred to several thousand trees are used, depending on the size and nature of the training set.B can be optimized usingcross-validation, or by observing theout-of-bag error: the mean prediction error on each training samplexi, using only the trees that did not havexi in their bootstrap sample.[15]

The training and test error tend to level off after some number of trees have been fit.

From bagging to random forests

[edit]
Main article:Random subspace method

The above procedure describes the original bagging algorithm for trees. Random forests also include another type of bagging scheme: they use a modified tree learning algorithm that selects, at each candidate split in the learning process, arandom subset of the features. This process is sometimes called "feature bagging". The reason for doing this is the correlation of the trees in an ordinary bootstrap sample: if one or a fewfeatures are very strong predictors for the response variable (target output), these features will be selected in many of theB trees, causing them to become correlated. An analysis of how bagging and random subspace projection contribute to accuracy gains under different conditions is given by Ho.[16]

Typically, for a classification problem withp features,p (rounded down) features are used in each split.[3]: 592  For regression problems the inventors recommendp/3 (rounded down) with a minimum node size of 5 as the default.[3]: 592  In practice, the best values for these parameters should be tuned on a case-to-case basis for every problem.[3]: 592 

ExtraTrees

[edit]

Adding one further step of randomization yieldsextremely randomized trees, or ExtraTrees. As with ordinary random forests, they are an ensemble of individual trees, but there are two main differences: (1) each tree is trained using the whole learning sample (rather than a bootstrap sample), and (2) the top-down splitting is randomized: for each feature under consideration, a number ofrandom cut-points are selected, instead of computing the locallyoptimal cut-point (based on, e.g.,information gain or theGini impurity). The values are chosen from a uniform distribution within the feature's empirical range (in the tree's training set). Then, of all the randomly chosen splits, the split that yields the highest score is chosen to split the node.

Similar to ordinary random forests, the number of randomly selected features to be considered at each node can be specified. Default values for this parameter arep{\displaystyle {\sqrt {p}}} for classification andp{\displaystyle p} for regression, wherep{\displaystyle p} is the number of features in the model.[17]

Random forests for high-dimensional data

[edit]

The basic random forest procedure may not work well in situations where there are a large number of features but only a small proportion of these features are informative with respect to sample classification. This can be addressed by encouraging the procedure to focus mainly on features and trees that are informative. Some methods for accomplishing this are:

  • Prefiltering: Eliminate features that are mostly just noise.[18][19]
  • Enriched Random Forest (ERF): Use weighted random sampling instead of simple random sampling at each node of each tree, giving greater weight to features that appear to be more informative.[20][21][22]
  • Tree-weighted random forest (TWRF): Give more weight to more accurate trees.[23][24]

Properties

[edit]

Variable importance

[edit]

Random forests can be used to rank the importance of variables in a regression or classification problem in a natural way. The following technique was described in Breiman's original paper[7] and is implemented in theR packagerandomForest.[8]

Permutation importance

[edit]

To measure a feature's importance in a data setDn={(Xi,Yi)}i=1n{\displaystyle {\mathcal {D}}_{n}=\{(X_{i},Y_{i})\}_{i=1}^{n}}, first a random forest is trained on the data. During training, theout-of-bag error for each data point is recorded and averaged over the forest. (If bagging is not used during training, we can instead compute errors on an independent test set.)

After training, the values of the feature are permuted in the out-of-bag samples and the out-of-bag error is again computed on this perturbed data set. The importance for the feature is computed by averaging the difference in out-of-bag error before and after the permutation over all trees. The score is normalized by the standard deviation of these differences.

Features which produce large values for this score are ranked as more important than features which produce small values. The statistical definition of the variable importance measure was given and analyzed by Zhuet al.[25]

This method of determining variable importance has some drawbacks:

  • When features have different numbers of values, random forests favor features with more values. Solutions to this problem includepartial permutations[26][27][28] and growing unbiased trees.[29][30]
  • If the data contain groups of correlated features of similar relevance, then smaller groups are favored over large groups.[31]
  • If there are collinear features, the procedure may fail to identify important features. A solution is to permute groups of correlated features together.[32]

Mean decrease in impurity feature importance

[edit]

This approach to feature importance for random forests considers as important the variables which decrease a lot the impurity during splitting.[33] It is described in the bookClassification and Regression Trees by Leo Breiman[34] and is the default implementation insci-kit learn andR. The definition is:unormalized average importance(x)=1nTi=1nTnode jTi|split variable(j)=xpTi(j)ΔiTi(j),{\displaystyle {\text{unormalized average importance}}(x)={\frac {1}{n_{T}}}\sum _{i=1}^{n_{T}}\sum _{{\text{node }}j\in T_{i}|{\text{split variable}}(j)=x}p_{T_{i}}(j)\Delta i_{T_{i}}(j),}where

As impurity measure for samples falling in a node e.g. the following statistics can be used:

The normalized importance is then obtained by normalizing over all features, so that the sum of normalized feature importances is 1.

Thesci-kit learn default implementation can report misleading feature importance:[32]

  • it favors high cardinality features
  • it uses training statistics and so does not reflect a feature's usefulness for predictions on a test set[35]

Relationship to nearest neighbors

[edit]

A relationship between random forests and thek-nearest neighbor algorithm (k-NN) was pointed out by Lin and Jeon in 2002.[36] Both can be viewed as so-calledweighted neighborhoods schemes. These are models built from a training set{(xi,yi)}i=1n{\displaystyle \{(x_{i},y_{i})\}_{i=1}^{n}} that make predictionsy^{\displaystyle {\hat {y}}} for new pointsx' by looking at the "neighborhood" of the point, formalized by a weight functionW:y^=i=1nW(xi,x)yi.{\displaystyle {\hat {y}}=\sum _{i=1}^{n}W(x_{i},x')\,y_{i}.}Here,W(xi,x){\displaystyle W(x_{i},x')} is the non-negative weight of thei'th training point relative to the new pointx' in the same tree. For anyx', the weights for pointsxi{\displaystyle x_{i}} must sum to 1. Weight functions are as follows:

Since a forest averages the predictions of a set ofm trees with individual weight functionsWj{\displaystyle W_{j}}, its predictions arey^=1mj=1mi=1nWj(xi,x)yi=i=1n(1mj=1mWj(xi,x))yi.{\displaystyle {\hat {y}}={\frac {1}{m}}\sum _{j=1}^{m}\sum _{i=1}^{n}W_{j}(x_{i},x')\,y_{i}=\sum _{i=1}^{n}\left({\frac {1}{m}}\sum _{j=1}^{m}W_{j}(x_{i},x')\right)\,y_{i}.}

This shows that the whole forest is again a weighted neighborhood scheme, with weights that average those of the individual trees. The neighbors ofx' in this interpretation are the pointsxi{\displaystyle x_{i}} sharing the same leaf in any treej{\displaystyle j}. In this way, the neighborhood ofx' depends in a complex way on the structure of the trees, and thus on the structure of the training set. Lin and Jeon show that the shape of the neighborhood used by a random forest adapts to the local importance of each feature.[36]

Unsupervised learning

[edit]

As part of their construction, random forest predictors naturally lead to a dissimilarity measure among observations. One can analogously define dissimilarity between unlabeled data, by training a forest to distinguish original "observed" data from suitably generated synthetic data drawn from a reference distribution.[7][37] A random forest dissimilarity is attractive because it handles mixed variable types very well, is invariant to monotonic transformations of the input variables, and is robust to outlying observations. Random forest dissimilarity easily deals with a large number of semi-continuous variables due to its intrinsic variable selection; for example, the "Addcl 1" random forest dissimilarity weighs the contribution of each variable according to how dependent it is on other variables. Random forest dissimilarity has been used in a variety of applications, e.g. to find clusters of patients based on tissue marker data.[38]

Variants

[edit]

Instead of decision trees, linear models have been proposed and evaluated as base estimators in random forests, in particularmultinomial logistic regression andnaive Bayes classifiers.[39][40][41] In cases that the relationship between the predictors and the target variable is linear, the base learners may have an equally high accuracy as the ensemble learner.[42][39]

Kernel random forest

[edit]

In machine learning, kernel random forests (KeRF) establish the connection between random forests andkernel methods. By slightly modifying their definition, random forests can be rewritten askernel methods, which are more interpretable and easier to analyze.[43]

History

[edit]

Leo Breiman[44] was the first person to notice the link between random forest andkernel methods. He pointed out that random forests trained usingi.i.d. random vectors in the tree construction are equivalent to a kernel acting on the true margin. Lin and Jeon[45] established the connection between random forests and adaptive nearest neighbor, implying that random forests can be seen as adaptive kernel estimates. Davies and Ghahramani[46] proposed Kernel Random Forest (KeRF) and showed that it can empirically outperform state-of-art kernel methods. Scornet[43] first defined KeRF estimates and gave the explicit link between KeRF estimates and random forest. He also gave explicit expressions for kernels based on centered random forest[47] and uniform random forest,[48] two simplified models of random forest. He named these two KeRFs Centered KeRF and Uniform KeRF, and proved upper bounds on their rates of consistency.

Notations and definitions

[edit]

Preliminaries: Centered forests

[edit]

Centered forest[47] is a simplified model for Breiman's original random forest, which uniformly selects an attribute among all attributes and performs splits at the center of the cell along the pre-chosen attribute. The algorithm stops when a fully binary tree of levelk{\displaystyle k} is built, wherekN{\displaystyle k\in \mathbb {N} } is a parameter of the algorithm.

Uniform forest

[edit]

Uniform forest[48] is another simplified model for Breiman's original random forest, which uniformly selects a feature among all features and performs splits at a point uniformly drawn on the side of the cell, along the preselected feature.

From random forest to KeRF

[edit]

Given a training sampleDn={(Xi,Yi)}i=1n{\displaystyle {\mathcal {D}}_{n}=\{(\mathbf {X} _{i},Y_{i})\}_{i=1}^{n}} of[0,1]p×R{\displaystyle [0,1]^{p}\times \mathbb {R} }-valued independent random variables distributed as the independent prototype pair(X,Y){\displaystyle (\mathbf {X} ,Y)}, whereE[Y2]<{\displaystyle \operatorname {E} [Y^{2}]<\infty }. We aim at predicting the responseY{\displaystyle Y}, associated with the random variableX{\displaystyle \mathbf {X} }, by estimating the regression functionm(x)=E[YX=x]{\displaystyle m(\mathbf {x} )=\operatorname {E} [Y\mid \mathbf {X} =\mathbf {x} ]}. A random regression forest is an ensemble ofM{\displaystyle M} randomized regression trees. Denotemn(x,Θj){\displaystyle m_{n}(\mathbf {x} ,\mathbf {\Theta } _{j})} the predicted value at pointx{\displaystyle \mathbf {x} } by thej{\displaystyle j}-th tree, whereΘ1,,ΘM{\displaystyle \mathbf {\Theta } _{1},\ldots ,\mathbf {\Theta } _{M}} are independent random variables, distributed as a generic random variableΘ{\displaystyle \mathbf {\Theta } }, independent of the sampleDn{\displaystyle {\mathcal {D}}_{n}}. This random variable can be used to describe the randomness induced by node splitting and the sampling procedure for tree construction. The trees are combined to form the finite forest estimatemM,n(x,Θ1,,ΘM)=1Mj=1Mmn(x,Θj){\displaystyle m_{M,n}(\mathbf {x} ,\Theta _{1},\ldots ,\Theta _{M})={\frac {1}{M}}\sum _{j=1}^{M}m_{n}(\mathbf {x} ,\Theta _{j})}.For regression trees, we havemn=i=1nYi1XiAn(x,Θj)Nn(x,Θj){\displaystyle m_{n}=\sum _{i=1}^{n}{\frac {Y_{i}\mathbf {1} _{\mathbf {X} _{i}\in A_{n}(\mathbf {x} ,\Theta _{j})}}{N_{n}(\mathbf {x} ,\Theta _{j})}}}, whereAn(x,Θj){\displaystyle A_{n}(\mathbf {x} ,\Theta _{j})} is the cell containingx{\displaystyle \mathbf {x} }, designed with randomnessΘj{\displaystyle \Theta _{j}} and datasetDn{\displaystyle {\mathcal {D}}_{n}}, andNn(x,Θj)=i=1n1XiAn(x,Θj){\displaystyle N_{n}(\mathbf {x} ,\Theta _{j})=\sum _{i=1}^{n}\mathbf {1} _{\mathbf {X} _{i}\in A_{n}(\mathbf {x} ,\Theta _{j})}}.

Thus random forest estimates satisfy, for allx[0,1]d{\displaystyle \mathbf {x} \in [0,1]^{d}},mM,n(x,Θ1,,ΘM)=1Mj=1M(i=1nYi1XiAn(x,Θj)Nn(x,Θj)){\displaystyle m_{M,n}(\mathbf {x} ,\Theta _{1},\ldots ,\Theta _{M})={\frac {1}{M}}\sum _{j=1}^{M}\left(\sum _{i=1}^{n}{\frac {Y_{i}\mathbf {1} _{\mathbf {X} _{i}\in A_{n}(\mathbf {x} ,\Theta _{j})}}{N_{n}(\mathbf {x} ,\Theta _{j})}}\right)}. Random regression forest has two levels of averaging, first over the samples in the target cell of a tree, then over all trees. Thus the contributions of observations that are in cells with a high density of data points are smaller than that of observations which belong to less populated cells. In order to improve the random forest methods and compensate the misestimation, Scornet[43] defined KeRF bym~M,n(x,Θ1,,ΘM)=1j=1MNn(x,Θj)j=1Mi=1nYi1XiAn(x,Θj),{\displaystyle {\tilde {m}}_{M,n}(\mathbf {x} ,\Theta _{1},\ldots ,\Theta _{M})={\frac {1}{\sum _{j=1}^{M}N_{n}(\mathbf {x} ,\Theta _{j})}}\sum _{j=1}^{M}\sum _{i=1}^{n}Y_{i}\mathbf {1} _{\mathbf {X} _{i}\in A_{n}(\mathbf {x} ,\Theta _{j})},}which is equal to the mean of theYi{\displaystyle Y_{i}}'s falling in the cells containingx{\displaystyle \mathbf {x} } in the forest. If we define the connection function of theM{\displaystyle M} finite forest asKM,n(x,z)=1Mj=1M1zAn(x,Θj){\displaystyle K_{M,n}(\mathbf {x} ,\mathbf {z} )={\frac {1}{M}}\sum _{j=1}^{M}\mathbf {1} _{\mathbf {z} \in A_{n}(\mathbf {x} ,\Theta _{j})}}, i.e. the proportion of cells shared betweenx{\displaystyle \mathbf {x} } andz{\displaystyle \mathbf {z} }, then almost surely we havem~M,n(x,Θ1,,ΘM)=i=1nYiKM,n(x,xi)=1nKM,n(x,x){\displaystyle {\tilde {m}}_{M,n}(\mathbf {x} ,\Theta _{1},\ldots ,\Theta _{M})={\frac {\sum _{i=1}^{n}Y_{i}K_{M,n}(\mathbf {x} ,\mathbf {x} _{i})}{\sum _{\ell =1}^{n}K_{M,n}(\mathbf {x} ,\mathbf {x} _{\ell })}}}, which defines the KeRF.

Centered KeRF

[edit]

The construction of Centered KeRF of levelk{\displaystyle k} is the same as for centered forest, except that predictions are made bym~M,n(x,Θ1,,ΘM){\displaystyle {\tilde {m}}_{M,n}(\mathbf {x} ,\Theta _{1},\ldots ,\Theta _{M})}, the corresponding kernel function, or connection function isKkcc(x,z)=k1,,kd,j=1dkj=kk!k1!kd!(1d)kj=1d12kjxj=2kjzj, for all x,z[0,1]d.{\displaystyle K_{k}^{cc}(\mathbf {x} ,\mathbf {z} )=\sum _{k_{1},\ldots ,k_{d},\sum _{j=1}^{d}k_{j}=k}{\frac {k!}{k_{1}!\cdots k_{d}!}}\left({\frac {1}{d}}\right)^{k}\prod _{j=1}^{d}\mathbf {1} _{\lceil 2^{k_{j}}x_{j}\rceil =\lceil 2^{k_{j}}z_{j}\rceil },\qquad {\text{ for all }}\mathbf {x} ,\mathbf {z} \in [0,1]^{d}.}

Uniform KeRF

[edit]

Uniform KeRF is built in the same way as uniform forest, except that predictions are made bym~M,n(x,Θ1,,ΘM){\displaystyle {\tilde {m}}_{M,n}(\mathbf {x} ,\Theta _{1},\ldots ,\Theta _{M})}, the corresponding kernel function, or connection function isKkuf(0,x)=k1,,kd,j=1dkj=kk!k1!kd!(1d)km=1d(1|xm|j=0km1(ln|xm|)jj!) for all x[0,1]d.{\displaystyle K_{k}^{uf}(\mathbf {0} ,\mathbf {x} )=\sum _{k_{1},\ldots ,k_{d},\sum _{j=1}^{d}k_{j}=k}{\frac {k!}{k_{1}!\ldots k_{d}!}}\left({\frac {1}{d}}\right)^{k}\prod _{m=1}^{d}\left(1-|x_{m}|\sum _{j=0}^{k_{m}-1}{\frac {\left(-\ln |x_{m}|\right)^{j}}{j!}}\right){\text{ for all }}\mathbf {x} \in [0,1]^{d}.}

Properties

[edit]

Relation between KeRF and random forest

[edit]

Predictions given by KeRF and random forests are close if the number of points in each cell is controlled:

Assume that there exist sequences(an),(bn){\displaystyle (a_{n}),(b_{n})} such that, almost surely,anNn(x,Θ)bn and an1Mm=1MNnx,Θmbn.{\displaystyle a_{n}\leq N_{n}(\mathbf {x} ,\Theta )\leq b_{n}{\text{ and }}a_{n}\leq {\frac {1}{M}}\sum _{m=1}^{M}N_{n}{\mathbf {x} ,\Theta _{m}}\leq b_{n}.}Then almost surely,|mM,n(x)m~M,n(x)|bnananm~M,n(x).{\displaystyle |m_{M,n}(\mathbf {x} )-{\tilde {m}}_{M,n}(\mathbf {x} )|\leq {\frac {b_{n}-a_{n}}{a_{n}}}{\tilde {m}}_{M,n}(\mathbf {x} ).}

Relation between infinite KeRF and infinite random forest

[edit]

When the number of treesM{\displaystyle M} goes to infinity, then we have infinite random forest and infinite KeRF. Their estimates are close if the number of observations in each cell is bounded:

Assume that there exist sequences(εn),(an),(bn){\displaystyle (\varepsilon _{n}),(a_{n}),(b_{n})} such that, almost surely

Then almost surely,|m,n(x)m~,n(x)|bnananm~,n(x)+nεn(max1inYi).{\displaystyle |m_{\infty ,n}(\mathbf {x} )-{\tilde {m}}_{\infty ,n}(\mathbf {x} )|\leq {\frac {b_{n}-a_{n}}{a_{n}}}{\tilde {m}}_{\infty ,n}(\mathbf {x} )+n\varepsilon _{n}\left(\max _{1\leq i\leq n}Y_{i}\right).}

Consistency results

[edit]

Assume thatY=m(X)+ε{\displaystyle Y=m(\mathbf {X} )+\varepsilon }, whereε{\displaystyle \varepsilon } is a centered Gaussian noise, independent ofX{\displaystyle \mathbf {X} }, with finite varianceσ2<{\displaystyle \sigma ^{2}<\infty }. Moreover,X{\displaystyle \mathbf {X} } is uniformly distributed on[0,1]d{\displaystyle [0,1]^{d}} andm{\displaystyle m} isLipschitz. Scornet[43] proved upper bounds on the rates of consistency for centered KeRF and uniform KeRF.

Consistency of centered KeRF

[edit]

Providingk{\displaystyle k\rightarrow \infty } andn/2k{\displaystyle n/2^{k}\rightarrow \infty }, there exists a constantC1>0{\displaystyle C_{1}>0} such that, for alln{\displaystyle n},E[m~ncc(X)m(X)]2C1n1/(3+dlog2)(logn)2{\displaystyle \mathbb {E} [{\tilde {m}}_{n}^{cc}(\mathbf {X} )-m(\mathbf {X} )]^{2}\leq C_{1}n^{-1/(3+d\log 2)}(\log n)^{2}}.

Consistency of uniform KeRF

[edit]

Providingk{\displaystyle k\rightarrow \infty } andn/2k{\displaystyle n/2^{k}\rightarrow \infty }, there exists a constantC>0{\displaystyle C>0} such that,E[m~nuf(X)m(X)]2Cn2/(6+3dlog2)(logn)2{\displaystyle \mathbb {E} [{\tilde {m}}_{n}^{uf}(\mathbf {X} )-m(\mathbf {X} )]^{2}\leq Cn^{-2/(6+3d\log 2)}(\log n)^{2}}.

Disadvantages

[edit]

While random forests often achieve higher accuracy than a single decision tree, they sacrifice the intrinsicinterpretability of decision trees. Decision trees are among a fairly small family of machine learning models that are easily interpretable along with linear models,rule-based models, andattention-based models. This interpretability is one of the main advantages of decision trees. It allows developers to confirm that the model has learned realistic information from the data and allows end-users to have trust and confidence in the decisions made by the model.[39][3] For example, following the path that a decision tree takes to make its decision is quite trivial, but following the paths of tens or hundreds of trees is much harder. To achieve both performance and interpretability, some model compression techniques allow transforming a random forest into a minimal "born-again" decision tree that faithfully reproduces the same decision function.[39][49][50]

Another limitation of random forests is that if features are linearly correlated with the target, random forest may not enhance the accuracy of the base learner.[39][42] Likewise in problems with multiple categorical variables.[51]

See also

[edit]

References

[edit]
  1. ^abcdHo, Tin Kam (1995).Random Decision Forests(PDF). Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, 14–16 August 1995. pp. 278–282. Archived fromthe original(PDF) on 17 April 2016. Retrieved5 June 2016.
  2. ^abcdHo TK (1998)."The Random Subspace Method for Constructing Decision Forests"(PDF).IEEE Transactions on Pattern Analysis and Machine Intelligence.20 (8):832–844.Bibcode:1998ITPAM..20..832T.doi:10.1109/34.709601.S2CID 206420153.
  3. ^abcdefgHastie, Trevor;Tibshirani, Robert;Friedman, Jerome (2008).The Elements of Statistical Learning (2nd ed.). Springer.ISBN 0-387-95284-5.
  4. ^abKleinberg E (1990)."Stochastic Discrimination"(PDF).Annals of Mathematics and Artificial Intelligence.1 (1–4):207–239.CiteSeerX 10.1.1.25.6750.doi:10.1007/BF01531079.S2CID 206795835. Archived fromthe original(PDF) on 2018-01-18.
  5. ^abKleinberg E (1996)."An Overtraining-Resistant Stochastic Modeling Method for Pattern Recognition".Annals of Statistics.24 (6):2319–2349.doi:10.1214/aos/1032181157.MR 1425956.
  6. ^abKleinberg E (2000)."On the Algorithmic Implementation of Stochastic Discrimination"(PDF).IEEE Transactions on Pattern Analysis and Machine Intelligence.22 (5):473–490.Bibcode:2000ITPAM..22..473K.CiteSeerX 10.1.1.33.4131.doi:10.1109/34.857004.S2CID 3563126. Archived fromthe original(PDF) on 2018-01-18.
  7. ^abcdBreiman L (2001)."Random Forests".Machine Learning.45 (1):5–32.Bibcode:2001MachL..45....5B.doi:10.1023/A:1010933404324.
  8. ^abLiaw A (16 October 2012)."Documentation for R package randomForest"(PDF). Retrieved15 March 2013.
  9. ^U.S. trademark registration number 3185828, registered 2006/12/19.
  10. ^"RANDOM FORESTS Trademark of Health Care Productivity, Inc. - Registration Number 3185828 - Serial Number 78642027 :: Justia Trademarks".
  11. ^abAmit Y,Geman D (1997)."Shape quantization and recognition with randomized trees"(PDF).Neural Computation.9 (7):1545–1588.CiteSeerX 10.1.1.57.6069.doi:10.1162/neco.1997.9.7.1545.S2CID 12470146. Archived fromthe original(PDF) on 2018-02-05. Retrieved2008-04-01.
  12. ^Heath, D., Kasif, S. and Salzberg, S. (1993).k-DT: A multi-tree learning method. InProceedings of the Second Intl. Workshop on Multistrategy Learning, pp. 138-149.
  13. ^Dietterich, Thomas (2000)."An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization".Machine Learning.40 (2):139–157.doi:10.1023/A:1007607513941.
  14. ^Helen Pearson; Heidi Ledford; Matthew Hutson; Richard Van Noorden (15 April 2025). "Exclusive: the most-cited papers of the twenty-first century".Nature.640 (8059):588–592.doi:10.1038/D41586-025-01125-9.ISSN 1476-4687.Wikidata Q135104889.
  15. ^Gareth James; Daniela Witten; Trevor Hastie; Robert Tibshirani (2013).An Introduction to Statistical Learning. Springer. pp. 316–321.
  16. ^Ho, Tin Kam (2002)."A Data Complexity Analysis of Comparative Advantages of Decision Forest Constructors"(PDF).Pattern Analysis and Applications.5 (2):102–112.doi:10.1007/s100440200009.S2CID 7415435. Archived fromthe original(PDF) on 2016-04-17. Retrieved2015-11-13.
  17. ^Geurts P, Ernst D, Wehenkel L (2006)."Extremely randomized trees"(PDF).Machine Learning.63:3–42.doi:10.1007/s10994-006-6226-1.
  18. ^Dessi, N. & Milia, G. & Pes, B. (2013). Enhancing random forests performance in microarray data classification. Conference paper, 99-103. 10.1007/978-3-642-38326-7_15.
  19. ^Ye, Y., Li, H., Deng, X., and Huang, J. (2008) Feature weighting random forest for detection of hidden web search interfaces. Journal of Computational Linguistics and Chinese Language Processing, 13, 387–404.
  20. ^Amaratunga, D., Cabrera, J., Lee, Y.S. (2008) Enriched Random Forest. Bioinformatics, 24, 2010-2014.
  21. ^Ghosh D, Cabrera J. (2022) Enriched random forest for high dimensional genomic data. IEEE/ACM Trans Comput Biol Bioinform. 19(5):2817-2828. doi:10.1109/TCBB.2021.3089417.
  22. ^ Amaratunga, D., Cabrera, J., Shkedy, Z. (2014). Exploration and Analysis of DNA Microarray and Other High-Dimensional Data. New York: John Wiley. Second Edition. 0.1002/9781118364505.
  23. ^Winham, Stacey & Freimuth, Robert & Biernacka, Joanna. (2013). A weighted random forests approach to improve predictive performance. Statistical Analysis and Data Mining. 6. 10.1002/sam.11196.
  24. ^ Li, H. B., Wang, W., Ding, H. W., & Dong, J. (2010, 10-12 Nov. 2010). Trees weighting random forest method for classifying high-dimensional noisy data. Paper presented at the 2010 IEEE 7th International Conference on E-Business Engineering.
  25. ^Zhu R, Zeng D, Kosorok MR (2015)."Reinforcement Learning Trees".Journal of the American Statistical Association.110 (512):1770–1784.doi:10.1080/01621459.2015.1036994.PMC 4760114.PMID 26903687.
  26. ^Deng, H.; Runger, G.; Tuv, E. (2011).Bias of importance measures for multi-valued attributes and solutions. Proceedings of the 21st International Conference on Artificial Neural Networks (ICANN). pp. 293–300.
  27. ^Altmann A, Toloşi L, Sander O, Lengauer T (May 2010)."Permutation importance: a corrected feature importance measure".Bioinformatics.26 (10):1340–7.doi:10.1093/bioinformatics/btq134.PMID 20385727.
  28. ^Piryonesi S. Madeh; El-Diraby Tamer E. (2020-06-01). "Role of Data Analytics in Infrastructure Asset Management: Overcoming Data Size and Quality Problems".Journal of Transportation Engineering, Part B: Pavements.146 (2): 04020022.doi:10.1061/JPEODX.0000175.S2CID 216485629.
  29. ^Strobl C, Boulesteix AL, Augustin T (2007)."Unbiased split selection for classification trees based on the Gini index"(PDF).Computational Statistics & Data Analysis.52:483–501.CiteSeerX 10.1.1.525.3178.doi:10.1016/j.csda.2006.12.030.
  30. ^Painsky A, Rosset S (2017). "Cross-Validated Variable Selection in Tree-Based Methods Improves Predictive Performance".IEEE Transactions on Pattern Analysis and Machine Intelligence.39 (11):2142–2153.arXiv:1512.03444.Bibcode:2017ITPAM..39.2142P.doi:10.1109/tpami.2016.2636831.PMID 28114007.S2CID 5381516.
  31. ^Tolosi L, Lengauer T (July 2011)."Classification with correlated features: unreliability of feature ranking and solutions".Bioinformatics.27 (14):1986–94.doi:10.1093/bioinformatics/btr300.PMID 21576180.
  32. ^ab"Beware Default Random Forest Importances".explained.ai. Retrieved2023-10-25.
  33. ^Ortiz-Posadas, Martha Refugio (2020-02-29).Pattern Recognition Techniques Applied to Biomedical Problems. Springer Nature.ISBN 978-3-030-38021-2.
  34. ^Breiman, Leo (2017-10-25).Classification and Regression Trees. New York: Routledge.doi:10.1201/9781315139470.ISBN 978-1-315-13947-0.
  35. ^https://scikit-learn.org/stable/auto_examples/inspection/plot_permutation_importance.html 31. Aug. 2023
  36. ^abLin, Yi; Jeon, Yongho (2002).Random forests and adaptive nearest neighbors (Technical report). Technical Report No. 1055. University of Wisconsin.CiteSeerX 10.1.1.153.9168.
  37. ^Shi, T.; Horvath, S. (2006). "Unsupervised Learning with Random Forest Predictors".Journal of Computational and Graphical Statistics.15 (1):118–138.CiteSeerX 10.1.1.698.2365.doi:10.1198/106186006X94072.JSTOR 27594168.S2CID 245216.
  38. ^Shi T, Seligson D, Belldegrun AS, Palotie A, Horvath S (April 2005)."Tumor classification by tissue microarray profiling: random forest clustering applied to renal cell carcinoma".Modern Pathology.18 (4):547–57.doi:10.1038/modpathol.3800322.PMID 15529185.
  39. ^abcdePiryonesi, S. Madeh; El-Diraby, Tamer E. (2021-02-01)."Using Machine Learning to Examine Impact of Type of Performance Indicator on Flexible Pavement Deterioration Modeling".Journal of Infrastructure Systems.27 (2): 04021005.doi:10.1061/(ASCE)IS.1943-555X.0000602.ISSN 1076-0342.S2CID 233550030.
  40. ^Prinzie, A.; Van den Poel, D. (2008). "Random Forests for multiclass classification: Random MultiNomial Logit".Expert Systems with Applications.34 (3):1721–1732.doi:10.1016/j.eswa.2007.01.029.
  41. ^Prinzie, Anita (2007). "Random Multiclass Classification: Generalizing Random Forests to Random MNL and Random NB". In Roland Wagner; Norman Revell; Günther Pernul (eds.).Database and Expert Systems Applications: 18th International Conference, DEXA 2007, Regensburg, Germany, September 3-7, 2007, Proceedings. Lecture Notes in Computer Science. Vol. 4653. pp. 349–358.doi:10.1007/978-3-540-74469-6_35.ISBN 978-3-540-74467-2.
  42. ^abSmith, Paul F.; Ganesh, Siva; Liu, Ping (2013-10-01)."A comparison of random forest regression and multiple linear regression for prediction in neuroscience".Journal of Neuroscience Methods.220 (1):85–91.doi:10.1016/j.jneumeth.2013.08.024.PMID 24012917.S2CID 13195700.
  43. ^abcdScornet, Erwan (2015). "Random forests and kernel methods".arXiv:1502.03836 [math.ST].
  44. ^Breiman, Leo (2000)."Some infinity theory for predictor ensembles". Technical Report 579, Statistics Dept. UCB.{{cite journal}}:Cite journal requires|journal= (help)
  45. ^Lin, Yi; Jeon, Yongho (2006). "Random forests and adaptive nearest neighbors".Journal of the American Statistical Association.101 (474):578–590.CiteSeerX 10.1.1.153.9168.doi:10.1198/016214505000001230.S2CID 2469856.
  46. ^Davies, Alex; Ghahramani, Zoubin (2014). "The Random Forest Kernel and other kernels for big data from random partitions".arXiv:1402.4293 [stat.ML].
  47. ^abBreiman L, Ghahramani Z (2004). "Consistency for a simple model of random forests".Statistical Department, University of California at Berkeley. Technical Report (670).CiteSeerX 10.1.1.618.90.
  48. ^abArlot S, Genuer R (2014). "Analysis of purely random forests bias".arXiv:1407.3939 [math.ST].
  49. ^Sagi, Omer; Rokach, Lior (2020)."Explainable decision forest: Transforming a decision forest into an interpretable tree".Information Fusion.61:124–138.doi:10.1016/j.inffus.2020.03.013.S2CID 216444882.
  50. ^Vidal, Thibaut; Schiffer, Maximilian (2020)."Born-Again Tree Ensembles".International Conference on Machine Learning.119. PMLR:9743–9753.arXiv:2003.11132.
  51. ^Piryonesi, Sayed Madeh (November 2019).The Application of Data Analytics to Asset Management: Deterioration and Climate Change Adaptation in Ontario Roads (Doctoral dissertation) (Thesis).

Further reading

[edit]
Scholia has atopic profile forRandom forest.

External links

[edit]
Retrieved from "https://en.wikipedia.org/w/index.php?title=Random_forest&oldid=1311485174"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2025 Movatter.jp