1.11.Ensembles: Gradient boosting, random forests, bagging, voting, stacking#
Ensemble methods combine the predictions of severalbase estimators built with a given learning algorithm in order to improvegeneralizability / robustness over a single estimator.
Two very famous examples of ensemble methods aregradient-boosted trees andrandom forests.
More generally, ensemble models can be applied to any base learner beyondtrees, in averaging methods such asBagging methods,model stacking, orVoting, or inboosting, asAdaBoost.
1.11.1.Gradient-boosted trees#
Gradient Tree Boostingor Gradient Boosted Decision Trees (GBDT) is a generalizationof boosting to arbitrary differentiable loss functions, see the seminal work of[Friedman2001]. GBDT is an excellent model for both regression andclassification, in particular for tabular data.
GradientBoostingClassifier vsHistGradientBoostingClassifier
Scikit-learn provides two implementations of gradient-boosted trees:HistGradientBoostingClassifier vsGradientBoostingClassifier for classification, and thecorresponding classes for regression. The former can beorders ofmagnitude faster than the latter when the number of samples islarger than tens of thousands of samples.
Missing values and categorical data are natively supported by theHist… version, removing the need for additional preprocessing such asimputation.
GradientBoostingClassifier andGradientBoostingRegressor might be preferred for small samplesizes since binning may lead to split points that are too approximatein this setting.
1.11.1.1.Histogram-Based Gradient Boosting#
Scikit-learn 0.21 introduced two new implementations ofgradient boosted trees, namelyHistGradientBoostingClassifierandHistGradientBoostingRegressor, inspired byLightGBM (See[LightGBM]).
These histogram-based estimators can beorders of magnitude fasterthanGradientBoostingClassifier andGradientBoostingRegressor when the number of samples is largerthan tens of thousands of samples.
They also have built-in support for missing values, which avoids the needfor an imputer.
These fast estimators first bin the input samplesX intointeger-valued bins (typically 256 bins) which tremendously reduces thenumber of splitting points to consider, and allows the algorithm toleverage integer-based data structures (histograms) instead of relying onsorted continuous values when building the trees. The API of theseestimators is slightly different, and some of the features fromGradientBoostingClassifier andGradientBoostingRegressorare not yet supported, for instance some loss functions.
Examples
Partial Dependence and Individual Conditional Expectation Plots
Comparing Random Forests and Histogram Gradient Boosting models
1.11.1.1.1.Usage#
Most of the parameters are unchanged fromGradientBoostingClassifier andGradientBoostingRegressor.One exception is themax_iter parameter that replacesn_estimators, andcontrols the number of iterations of the boosting process:
>>>fromsklearn.ensembleimportHistGradientBoostingClassifier>>>fromsklearn.datasetsimportmake_hastie_10_2>>>X,y=make_hastie_10_2(random_state=0)>>>X_train,X_test=X[:2000],X[2000:]>>>y_train,y_test=y[:2000],y[2000:]>>>clf=HistGradientBoostingClassifier(max_iter=100).fit(X_train,y_train)>>>clf.score(X_test,y_test)0.8965
Available losses forregression are:
‘squared_error’, which is the default loss;
‘absolute_error’, which is less sensitive to outliers than the squared error;
‘gamma’, which is well suited to model strictly positive outcomes;
‘poisson’, which is well suited to model counts and frequencies;
‘quantile’, which allows for estimating a conditional quantile that can laterbe used to obtain prediction intervals.
Forclassification, ‘log_loss’ is the only option. For binary classificationit uses the binary log loss, also known as binomial deviance or binarycross-entropy. Forn_classes>=3, it uses the multi-class log loss function,with multinomial deviance and categorical cross-entropy as alternative names.The appropriate loss version is selected based ony passed tofit.
The size of the trees can be controlled through themax_leaf_nodes,max_depth, andmin_samples_leaf parameters.
The number of bins used to bin the data is controlled with themax_binsparameter. Using less bins acts as a form of regularization. It is generallyrecommended to use as many bins as possible (255), which is the default.
Thel2_regularization parameter acts as a regularizer for the loss function,and corresponds to\(\lambda\) in the following expression (see equation (2)in[XGBoost]):
Details on l2 regularization#
It is important to notice that the loss term\(l(\hat{y}_i, y_i)\) describesonly half of the actual loss function except for the pinball loss and absoluteerror.
The index\(k\) refers to the k-th tree in the ensemble of trees. In thecase of regression and binary classification, gradient boosting models grow onetree per iteration, then\(k\) runs up tomax_iter. In the case ofmulticlass classification problems, the maximal value of the index\(k\) isn_classes\(\times\)max_iter.
If\(T_k\) denotes the number of leaves in the k-th tree, then\(w_k\)is a vector of length\(T_k\), which contains the leaf values of the formw=-sum_gradient/(sum_hessian+l2_regularization) (see equation (5) in[XGBoost]).
The leaf values\(w_k\) are derived by dividing the sum of the gradients ofthe loss function by the combined sum of hessians. Adding the regularization tothe denominator penalizes the leaves with small hessians (flat regions),resulting in smaller updates. Those\(w_k\) values contribute then to themodel’s prediction for a given input that ends up in the corresponding leaf. Thefinal prediction is the sum of the base prediction and the contributions fromeach tree. The result of that sum is then transformed by the inverse linkfunction depending on the choice of the loss function (seeMathematical formulation).
Notice that the original paper[XGBoost] introduces a term\(\gamma\sum_kT_k\) that penalizes the number of leaves (making it a smooth version ofmax_leaf_nodes) not presented here as it is not implemented in scikit-learn;whereas\(\lambda\) penalizes the magnitude of the individual treepredictions before being rescaled by the learning rate, seeShrinkage via learning rate.
Note thatearly-stopping is enabled by default if the number of samples islarger than 10,000. The early-stopping behaviour is controlled via theearly_stopping,scoring,validation_fraction,n_iter_no_change, andtol parameters. It is possible to early-stopusing an arbitraryscorer, or just the training or validation loss.Note that for technical reasons, using a callable as a scorer is significantly slowerthan using the loss. By default, early-stopping is performed if there are at least10,000 samples in the training set, using the validation loss.
1.11.1.1.2.Missing values support#
HistGradientBoostingClassifier andHistGradientBoostingRegressor have built-in support for missingvalues (NaNs).
During training, the tree grower learns at each split point whether sampleswith missing values should go to the left or right child, based on thepotential gain. When predicting, samples with missing values are assigned tothe left or right child consequently:
>>>fromsklearn.ensembleimportHistGradientBoostingClassifier>>>importnumpyasnp>>>X=np.array([0,1,2,np.nan]).reshape(-1,1)>>>y=[0,0,1,1]>>>gbdt=HistGradientBoostingClassifier(min_samples_leaf=1).fit(X,y)>>>gbdt.predict(X)array([0, 0, 1, 1])
When the missingness pattern is predictive, the splits can be performed onwhether the feature value is missing or not:
>>>X=np.array([0,np.nan,1,2,np.nan]).reshape(-1,1)>>>y=[0,1,0,0,1]>>>gbdt=HistGradientBoostingClassifier(min_samples_leaf=1,...max_depth=2,...learning_rate=1,...max_iter=1).fit(X,y)>>>gbdt.predict(X)array([0, 1, 0, 0, 1])
If no missing values were encountered for a given feature during training,then samples with missing values are mapped to whichever child has the mostsamples.
Examples
1.11.1.1.3.Sample weight support#
HistGradientBoostingClassifier andHistGradientBoostingRegressor support sample weights duringfit.
The following toy example demonstrates that samples with a sample weight of zero are ignored:
>>>X=[[1,0],...[1,0],...[1,0],...[0,1]]>>>y=[0,0,1,0]>>># ignore the first 2 training samples by setting their weight to 0>>>sample_weight=[0,0,1,1]>>>gb=HistGradientBoostingClassifier(min_samples_leaf=1)>>>gb.fit(X,y,sample_weight=sample_weight)HistGradientBoostingClassifier(...)>>>gb.predict([[1,0]])array([1])>>>gb.predict_proba([[1,0]])[0,1]np.float64(0.999)
As you can see, the[1,0] is comfortably classified as1 since the firsttwo samples are ignored due to their sample weights.
Implementation detail: taking sample weights into account amounts tomultiplying the gradients (and the hessians) by the sample weights. Note thatthe binning stage (specifically the quantiles computation) does not take theweights into account.
1.11.1.1.4.Categorical Features Support#
HistGradientBoostingClassifier andHistGradientBoostingRegressor have native support for categoricalfeatures: they can consider splits on non-ordered, categorical data.
For datasets with categorical features, using the native categorical supportis often better than relying on one-hot encoding(OneHotEncoder), because one-hot encodingrequires more tree depth to achieve equivalent splits. It is also usuallybetter to rely on the native categorical support rather than to treatcategorical features as continuous (ordinal), which happens for ordinal-encodedcategorical data, since categories are nominal quantities where order does notmatter.
To enable categorical support, a boolean mask can be passed to thecategorical_features parameter, indicating which feature is categorical. Inthe following, the first feature will be treated as categorical and thesecond feature as numerical:
>>>gbdt=HistGradientBoostingClassifier(categorical_features=[True,False])
Equivalently, one can pass a list of integers indicating the indices of thecategorical features:
>>>gbdt=HistGradientBoostingClassifier(categorical_features=[0])
When the input is a DataFrame, it is also possible to pass a list of columnnames:
>>>gbdt=HistGradientBoostingClassifier(categorical_features=["site","manufacturer"])
Finally, when the input is a DataFrame we can usecategorical_features="from_dtype" in which case all columns with a categoricaldtype will be treated as categorical features.
The cardinality of each categorical feature must be less than themax_binsparameter. For an example using histogram-based gradient boosting on categoricalfeatures, seeCategorical Feature Support in Gradient Boosting.
If there are missing values during training, the missing values will betreated as a proper category. If there are no missing values during training,then at prediction time, missing values are mapped to the child node that hasthe most samples (just like for continuous features). When predicting,categories that were not seen during fit time will be treated as missingvalues.
Split finding with categorical features#
The canonical way of considering categorical splits in a tree is to considerall of the\(2^{K - 1} - 1\) partitions, where\(K\) is the number ofcategories. This can quickly become prohibitive when\(K\) is large.Fortunately, since gradient boosting trees are always regression trees (evenfor classification problems), there exists a faster strategy that can yieldequivalent splits. First, the categories of a feature are sorted according tothe variance of the target, for each categoryk. Once the categories aresorted, one can considercontinuous partitions, i.e. treat the categoriesas if they were ordered continuous values (see Fisher[Fisher1958] for aformal proof). As a result, only\(K - 1\) splits need to be consideredinstead of\(2^{K - 1} - 1\). The initial sorting is a\(\mathcal{O}(K \log(K))\) operation, leading to a total complexity of\(\mathcal{O}(K \log(K) + K)\), instead of\(\mathcal{O}(2^K)\).
Examples
1.11.1.1.5.Monotonic Constraints#
Depending on the problem at hand, you may have prior knowledge indicatingthat a given feature should in general have a positive (or negative) effecton the target value. For example, all else being equal, a higher creditscore should increase the probability of getting approved for a loan.Monotonic constraints allow you to incorporate such prior knowledge into themodel.
For a predictor\(F\) with two features:
amonotonic increase constraint is a constraint of the form:
\[x_1 \leq x_1' \implies F(x_1, x_2) \leq F(x_1', x_2)\]amonotonic decrease constraint is a constraint of the form:
\[x_1 \leq x_1' \implies F(x_1, x_2) \geq F(x_1', x_2)\]
You can specify a monotonic constraint on each feature using themonotonic_cst parameter. For each feature, a value of 0 indicates noconstraint, while 1 and -1 indicate a monotonic increase andmonotonic decrease constraint, respectively:
>>>fromsklearn.ensembleimportHistGradientBoostingRegressor... # monotonic increase, monotonic decrease, and no constraint on the 3 features>>>gbdt=HistGradientBoostingRegressor(monotonic_cst=[1,-1,0])
In a binary classification context, imposing a monotonic increase (decrease) constraint means that higher values of the feature are supposedto have a positive (negative) effect on the probability of samplesto belong to the positive class.
Nevertheless, monotonic constraints only marginally constrain feature effects on the output.For instance, monotonic increase and decrease constraints cannot be used to enforce thefollowing modelling constraint:
Also, monotonic constraints are not supported for multiclass classification.
For a practical implementation of monotonic constraints with the histogram-basedgradient boosting, including how they can improve generalization when domain knowledgeis available, seeMonotonic Constraints.
Note
Since categories are unordered quantities, it is not possible to enforcemonotonic constraints on categorical features.
Examples
1.11.1.1.6.Interaction constraints#
A priori, the histogram gradient boosted trees are allowed to use any featureto split a node into child nodes. This creates so called interactions betweenfeatures, i.e. usage of different features as split along a branch. Sometimes,one wants to restrict the possible interactions, see[Mayer2022]. This can bedone by the parameterinteraction_cst, where one can specify the indicesof features that are allowed to interact.For instance, with 3 features in total,interaction_cst=[{0},{1},{2}]forbids all interactions.The constraints[{0,1},{1,2}] specify two groups of possiblyinteracting features. Features 0 and 1 may interact with each other, as wellas features 1 and 2. But note that features 0 and 2 are forbidden to interact.The following depicts a tree and the possible splits of the tree:
1 <- Both constraint groups could be applied from now on / \ 1 2 <- Left split still fulfills both constraint groups./ \ / \ Right split at feature 2 has only group {1, 2} from now on.LightGBM uses the same logic for overlapping groups.
Note that features not listed ininteraction_cst are automaticallyassigned an interaction group for themselves. With again 3 features, thismeans that[{0}] is equivalent to[{0},{1,2}].
Examples
References
M. Mayer, S.C. Bourassa, M. Hoesli, and D.F. Scognamiglio.2022.Machine Learning Applications to Land and Structure Valuation.Journal of Risk and Financial Management 15, no. 5: 193
1.11.1.1.7.Low-level parallelism#
HistGradientBoostingClassifier andHistGradientBoostingRegressor use OpenMPfor parallelization through Cython. For more details on how to control thenumber of threads, please refer to ourParallelism notes.
The following parts are parallelized:
mapping samples from real values to integer-valued bins (finding the binthresholds is however sequential)
building histograms is parallelized over features
finding the best split point at a node is parallelized over features
during fit, mapping samples into the left and right children isparallelized over samples
gradient and hessians computations are parallelized over samples
predicting is parallelized over samples
1.11.1.1.8.Why it’s faster#
The bottleneck of a gradient boosting procedure is building the decisiontrees. Building a traditional decision tree (as in the other GBDTsGradientBoostingClassifier andGradientBoostingRegressor)requires sorting the samples at each node (foreach feature). Sorting is needed so that the potential gain of a split pointcan be computed efficiently. Splitting a single node has thus a complexityof\(\mathcal{O}(n_\text{features} \times n \log(n))\) where\(n\)is the number of samples at the node.
HistGradientBoostingClassifier andHistGradientBoostingRegressor, in contrast, do not require sorting thefeature values and instead use a data-structure called a histogram, where thesamples are implicitly ordered. Building a histogram has a\(\mathcal{O}(n)\) complexity, so the node splitting procedure has a\(\mathcal{O}(n_\text{features} \times n)\) complexity, much smallerthan the previous one. In addition, instead of considering\(n\) splitpoints, we consider onlymax_bins split points, which might be muchsmaller.
In order to build histograms, the input dataX needs to be binned intointeger-valued bins. This binning procedure does require sorting the featurevalues, but it only happens once at the very beginning of the boosting process(not at each node, like inGradientBoostingClassifier andGradientBoostingRegressor).
Finally, many parts of the implementation ofHistGradientBoostingClassifier andHistGradientBoostingRegressor are parallelized.
References
Fisher, W.D. (1958).“On Grouping for Maximum Homogeneity”Journal of the American Statistical Association, 53, 789-798.
1.11.1.2.GradientBoostingClassifier andGradientBoostingRegressor#
The usage and the parameters ofGradientBoostingClassifier andGradientBoostingRegressor are described below. The 2 most importantparameters of these estimators aren_estimators andlearning_rate.
Classification#
GradientBoostingClassifier supports both binary and multi-classclassification.The following example shows how to fit a gradient boosting classifierwith 100 decision stumps as weak learners:
>>>fromsklearn.datasetsimportmake_hastie_10_2>>>fromsklearn.ensembleimportGradientBoostingClassifier>>>X,y=make_hastie_10_2(random_state=0)>>>X_train,X_test=X[:2000],X[2000:]>>>y_train,y_test=y[:2000],y[2000:]>>>clf=GradientBoostingClassifier(n_estimators=100,learning_rate=1.0,...max_depth=1,random_state=0).fit(X_train,y_train)>>>clf.score(X_test,y_test)0.913
The number of weak learners (i.e. regression trees) is controlled by theparametern_estimators;The size of each tree can be controlled either by setting the treedepth viamax_depth or by setting the number of leaf nodes viamax_leaf_nodes. Thelearning_rate is a hyper-parameter in the range(0.0, 1.0] that controls overfitting viashrinkage .
Note
Classification with more than 2 classes requires the inductionofn_classes regression trees at each iteration,thus, the total number of induced trees equalsn_classes*n_estimators. For datasets with a large numberof classes we strongly recommend to useHistGradientBoostingClassifier as an alternative toGradientBoostingClassifier .
Regression#
GradientBoostingRegressor supports a number ofdifferent loss functionsfor regression which can be specified via the argumentloss; the default loss function for regression is squared error('squared_error').
>>>importnumpyasnp>>>fromsklearn.metricsimportmean_squared_error>>>fromsklearn.datasetsimportmake_friedman1>>>fromsklearn.ensembleimportGradientBoostingRegressor>>>X,y=make_friedman1(n_samples=1200,random_state=0,noise=1.0)>>>X_train,X_test=X[:200],X[200:]>>>y_train,y_test=y[:200],y[200:]>>>est=GradientBoostingRegressor(...n_estimators=100,learning_rate=0.1,max_depth=1,random_state=0,...loss='squared_error'...).fit(X_train,y_train)>>>mean_squared_error(y_test,est.predict(X_test))5.00
The figure below shows the results of applyingGradientBoostingRegressorwith least squares loss and 500 base learners to the diabetes dataset(sklearn.datasets.load_diabetes).The plot shows the train and test error at each iteration.The train error at each iteration is stored in thetrain_score_ attribute of the gradient boosting model.The test error at each iteration can be obtainedvia thestaged_predict method which returns agenerator that yields the predictions at each stage. Plots like these can be usedto determine the optimal number of trees (i.e.n_estimators) by early stopping.

Examples
1.11.1.2.1.Fitting additional weak-learners#
BothGradientBoostingRegressor andGradientBoostingClassifiersupportwarm_start=True which allows you to add more estimators to an alreadyfitted model.
>>>importnumpyasnp>>>fromsklearn.metricsimportmean_squared_error>>>fromsklearn.datasetsimportmake_friedman1>>>fromsklearn.ensembleimportGradientBoostingRegressor>>>X,y=make_friedman1(n_samples=1200,random_state=0,noise=1.0)>>>X_train,X_test=X[:200],X[200:]>>>y_train,y_test=y[:200],y[200:]>>>est=GradientBoostingRegressor(...n_estimators=100,learning_rate=0.1,max_depth=1,random_state=0,...loss='squared_error'...)>>>est=est.fit(X_train,y_train)# fit with 100 trees>>>mean_squared_error(y_test,est.predict(X_test))5.00>>>_=est.set_params(n_estimators=200,warm_start=True)# set warm_start and increase num of trees>>>_=est.fit(X_train,y_train)# fit additional 100 trees to est>>>mean_squared_error(y_test,est.predict(X_test))3.84
1.11.1.2.2.Controlling the tree size#
The size of the regression tree base learners defines the level of variableinteractions that can be captured by the gradient boosting model. In general,a tree of depthh can capture interactions of orderh .There are two ways in which the size of the individual regression trees canbe controlled.
If you specifymax_depth=h then complete binary treesof depthh will be grown. Such trees will have (at most)2**h leaf nodesand2**h-1 split nodes.
Alternatively, you can control the tree size by specifying the number ofleaf nodes via the parametermax_leaf_nodes. In this case,trees will be grown using best-first search where nodes with the highest improvementin impurity will be expanded first.A tree withmax_leaf_nodes=k hask-1 split nodes and thus canmodel interactions of up to ordermax_leaf_nodes-1 .
We found thatmax_leaf_nodes=k gives comparable results tomax_depth=k-1but is significantly faster to train at the expense of a slightly highertraining error.The parametermax_leaf_nodes corresponds to the variableJ in thechapter on gradient boosting in[Friedman2001] and is related to the parameterinteraction.depth in R’s gbm package wheremax_leaf_nodes==interaction.depth+1 .
1.11.1.2.3.Mathematical formulation#
We first present GBRT for regression, and then detail the classificationcase.
Regression#
GBRT regressors are additive models whose prediction\(\hat{y}_i\) for agiven input\(x_i\) is of the following form:
where the\(h_m\) are estimators calledweak learners in the contextof boosting. Gradient Tree Boosting usesdecision tree regressors of fixed size as weak learners. The constant M corresponds to then_estimators parameter.
Similar to other boosting algorithms, a GBRT is built in a greedy fashion:
where the newly added tree\(h_m\) is fitted in order to minimize a sumof losses\(L_m\), given the previous ensemble\(F_{m-1}\):
where\(l(y_i, F(x_i))\) is defined by theloss parameter, detailedin the next section.
By default, the initial model\(F_{0}\) is chosen as the constant thatminimizes the loss: for a least-squares loss, this is the empirical mean ofthe target values. The initial model can also be specified via theinitargument.
Using a first-order Taylor approximation, the value of\(l\) can beapproximated as follows:
Note
Briefly, a first-order Taylor approximation says that\(l(z) \approx l(a) + (z - a) \frac{\partial l}{\partial z}(a)\).Here,\(z\) corresponds to\(F_{m - 1}(x_i) + h_m(x_i)\), and\(a\) corresponds to\(F_{m-1}(x_i)\)
The quantity\(\left[ \frac{\partial l(y_i, F(x_i))}{\partial F(x_i)}\right]_{F=F_{m - 1}}\) is the derivative of the loss with respect to itssecond parameter, evaluated at\(F_{m-1}(x)\). It is easy to compute forany given\(F_{m - 1}(x_i)\) in a closed form since the loss isdifferentiable. We will denote it by\(g_i\).
Removing the constant terms, we have:
This is minimized if\(h(x_i)\) is fitted to predict a value that isproportional to the negative gradient\(-g_i\). Therefore, at eachiteration,the estimator\(h_m\)is fitted to predict the negativegradients of the samples. The gradients are updated at each iteration.This can be considered as some kind of gradient descent in a functionalspace.
Note
For some losses, e.g.'absolute_error' where the gradientsare\(\pm 1\), the values predicted by a fitted\(h_m\) are notaccurate enough: the tree can only output integer values. As a result, theleaves values of the tree\(h_m\) are modified once the tree isfitted, such that the leaves values minimize the loss\(L_m\). Theupdate is loss-dependent: for the absolute error loss, the value ofa leaf is updated to the median of the samples in that leaf.
Classification#
Gradient boosting for classification is very similar to the regression case.However, the sum of the trees\(F_M(x_i) = \sum_m h_m(x_i)\) is nothomogeneous to a prediction: it cannot be a class, since the trees predictcontinuous values.
The mapping from the value\(F_M(x_i)\) to a class or a probability isloss-dependent. For the log-loss, the probability that\(x_i\) belongs to the positive class is modeled as\(p(y_i = 1 |x_i) = \sigma(F_M(x_i))\) where\(\sigma\) is the sigmoid or expit function.
For multiclass classification, K trees (for K classes) are built at each ofthe\(M\) iterations. The probability that\(x_i\) belongs to classk is modeled as a softmax of the\(F_{M,k}(x_i)\) values.
Note that even for a classification task, the\(h_m\) sub-estimator isstill a regressor, not a classifier. This is because the sub-estimators aretrained to predict (negative)gradients, which are always continuousquantities.
1.11.1.2.4.Loss Functions#
The following loss functions are supported and can be specified usingthe parameterloss:
Regression#
Squared error (
'squared_error'): The natural choice for regressiondue to its superior computational properties. The initial model isgiven by the mean of the target values.Absolute error (
'absolute_error'): A robust loss function forregression. The initial model is given by the median of thetarget values.Huber (
'huber'): Another robust loss function that combinesleast squares and least absolute deviation; usealphatocontrol the sensitivity with regards to outliers (see[Friedman2001] formore details).Quantile (
'quantile'): A loss function for quantile regression.Use0<alpha<1to specify the quantile. This loss functioncan be used to create prediction intervals(seePrediction Intervals for Gradient Boosting Regression).
Classification#
Binary log-loss (
'log-loss'): The binomialnegative log-likelihood loss function for binary classification. It providesprobability estimates. The initial model is given by thelog odds-ratio.Multi-class log-loss (
'log-loss'): The multinomialnegative log-likelihood loss function for multi-class classification withn_classesmutually exclusive classes. It providesprobability estimates. The initial model is given by theprior probability of each class. At each iterationn_classesregression trees have to be constructed which makes GBRT ratherinefficient for data sets with a large number of classes.Exponential loss (
'exponential'): The same loss functionasAdaBoostClassifier. Less robust to mislabeledexamples than'log-loss'; can only be used for binaryclassification.
1.11.1.2.5.Shrinkage via learning rate#
[Friedman2001] proposed a simple regularization strategy that scalesthe contribution of each weak learner by a constant factor\(\nu\):
The parameter\(\nu\) is also called thelearning rate becauseit scales the step length of the gradient descent procedure; it canbe set via thelearning_rate parameter.
The parameterlearning_rate strongly interacts with the parametern_estimators, the number of weak learners to fit. Smaller valuesoflearning_rate require larger numbers of weak learners to maintaina constant training error. Empirical evidence suggests that smallvalues oflearning_rate favor better test error.[HTF]recommend to set the learning rate to a small constant(e.g.learning_rate<=0.1) and choosen_estimators large enoughthat early stopping applies,seeEarly stopping in Gradient Boostingfor a more detailed discussion of the interaction betweenlearning_rate andn_estimators see[R2007].
1.11.1.2.6.Subsampling#
[Friedman2002] proposed stochastic gradient boosting, which combines gradientboosting with bootstrap averaging (bagging). At each iterationthe base classifier is trained on a fractionsubsample ofthe available training data. The subsample is drawn without replacement.A typical value ofsubsample is 0.5.
The figure below illustrates the effect of shrinkage and subsamplingon the goodness-of-fit of the model. We can clearly see that shrinkageoutperforms no-shrinkage. Subsampling with shrinkage can further increasethe accuracy of the model. Subsampling without shrinkage, on the other hand,does poorly.

Another strategy to reduce the variance is by subsampling the featuresanalogous to the random splits inRandomForestClassifier.The number of subsampled features can be controlled via themax_featuresparameter.
Note
Using a smallmax_features value can significantly decrease the runtime.
Stochastic gradient boosting allows to compute out-of-bag estimates of thetest deviance by computing the improvement in deviance on the examples that arenot included in the bootstrap sample (i.e. the out-of-bag examples).The improvements are stored in the attributeoob_improvement_.oob_improvement_[i] holds the improvement in terms of the loss on the OOB samplesif you add the i-th stage to the current predictions.Out-of-bag estimates can be used for model selection, for example to determinethe optimal number of iterations. OOB estimates are usually very pessimistic thuswe recommend to use cross-validation instead and only use OOB if cross-validationis too time consuming.
Examples
1.11.1.2.7.Interpretation with feature importance#
Individual decision trees can be interpreted easily by simplyvisualizing the tree structure. Gradient boosting models, however,comprise hundreds of regression trees thus they cannot be easilyinterpreted by visual inspection of the individual trees. Fortunately,a number of techniques have been proposed to summarize and interpretgradient boosting models.
Often features do not contribute equally to predict the targetresponse; in many situations the majority of the features are in factirrelevant.When interpreting a model, the first question usually is: what arethose important features and how do they contribute in predictingthe target response?
Individual decision trees intrinsically perform feature selection by selectingappropriate split points. This information can be used to measure theimportance of each feature; the basic idea is: the more often afeature is used in the split points of a tree the more important thatfeature is. This notion of importance can be extended to decision treeensembles by simply averaging the impurity-based feature importance of each tree (seeFeature importance evaluation for more details).
The feature importance scores of a fit gradient boosting model can beaccessed via thefeature_importances_ property:
>>>fromsklearn.datasetsimportmake_hastie_10_2>>>fromsklearn.ensembleimportGradientBoostingClassifier>>>X,y=make_hastie_10_2(random_state=0)>>>clf=GradientBoostingClassifier(n_estimators=100,learning_rate=1.0,...max_depth=1,random_state=0).fit(X,y)>>>clf.feature_importances_array([0.107, 0.105, 0.113, 0.0987, 0.0947, 0.107, 0.0916, 0.0972, 0.0958, 0.0906])
Note that this computation of feature importance is based on entropy, and itis distinct fromsklearn.inspection.permutation_importance which isbased on permutation of the features.
Examples
References
Friedman, J.H. (2001).Greedy function approximation: A gradientboosting machine.Annals of Statistics, 29, 1189-1232.
Friedman, J.H. (2002).Stochastic gradient boosting..Computational Statistics & Data Analysis, 38, 367-378.
G. Ridgeway (2006).Generalized Boosted Models: A guide to the gbmpackage
1.11.2.Random forests and other randomized tree ensembles#
Thesklearn.ensemble module includes two averaging algorithms basedon randomizeddecision trees: the RandomForest algorithmand the Extra-Trees method. Both algorithms are perturb-and-combinetechniques[B1998] specifically designed for trees. This means a diverseset of classifiers is created by introducing randomness in the classifierconstruction. The prediction of the ensemble is given as the averagedprediction of the individual classifiers.
As other classifiers, forest classifiers have to be fitted with twoarrays: a sparse or dense array X of shape(n_samples,n_features)holding the training samples, and an array Y of shape(n_samples,)holding the target values (class labels) for the training samples:
>>>fromsklearn.ensembleimportRandomForestClassifier>>>X=[[0,0],[1,1]]>>>Y=[0,1]>>>clf=RandomForestClassifier(n_estimators=10)>>>clf=clf.fit(X,Y)
Likedecision trees, forests of trees also extend tomulti-output problems (if Y is an arrayof shape(n_samples,n_outputs)).
1.11.2.1.Random Forests#
In random forests (seeRandomForestClassifier andRandomForestRegressor classes), each tree in the ensemble is builtfrom a sample drawn with replacement (i.e., a bootstrap sample) from thetraining set.
Furthermore, when splitting each node during the construction of a tree, thebest split is found through an exhaustive search of the feature values ofeither all input features or a random subset of sizemax_features.(See theparameter tuning guidelines for more details.)
The purpose of these two sources of randomness is to decrease the variance ofthe forest estimator. Indeed, individual decision trees typically exhibit highvariance and tend to overfit. The injected randomness in forests yield decisiontrees with somewhat decoupled prediction errors. By taking an average of thosepredictions, some errors can cancel out. Random forests achieve a reducedvariance by combining diverse trees, sometimes at the cost of a slight increasein bias. In practice the variance reduction is often significant hence yieldingan overall better model.
In contrast to the original publication[B2001], the scikit-learnimplementation combines classifiers by averaging their probabilisticprediction, instead of letting each classifier vote for a single class.
A competitive alternative to random forests areHistogram-Based Gradient Boosting (HGBT) models:
Building trees: Random forests typically rely on deep trees (that overfitindividually) which uses much computational resources, as they requireseveral splittings and evaluations of candidate splits. Boosting modelsbuild shallow trees (that underfit individually) which are faster to fitand predict.
Sequential boosting: In HGBT, the decision trees are built sequentially,where each tree is trained to correct the errors made by the previous ones.This allows them to iteratively improve the model’s performance usingrelatively few trees. In contrast, random forests use a majority vote topredict the outcome, which can require a larger number of trees to achievethe same level of accuracy.
Efficient binning: HGBT uses an efficient binning algorithm that can handlelarge datasets with a high number of features. The binning algorithm canpre-process the data to speed up the subsequent tree construction (seeWhy it’s faster). In contrast, the scikit-learnimplementation of random forests does not use binning and relies on exactsplitting, which can be computationally expensive.
Overall, the computational cost of HGBT versus RF depends on the specificcharacteristics of the dataset and the modeling task. It’s a good ideato try both models and compare their performance and computational efficiencyon your specific problem to determine which model is the best fit.
Examples
1.11.2.2.Extremely Randomized Trees#
In extremely randomized trees (seeExtraTreesClassifierandExtraTreesRegressor classes), randomness goes one stepfurther in the way splits are computed. As in random forests, a randomsubset of candidate features is used, but instead of looking for themost discriminative thresholds, thresholds are drawn at random for eachcandidate feature and the best of these randomly-generated thresholds ispicked as the splitting rule. This usually allows to reduce the varianceof the model a bit more, at the expense of a slightly greater increasein bias:
>>>fromsklearn.model_selectionimportcross_val_score>>>fromsklearn.datasetsimportmake_blobs>>>fromsklearn.ensembleimportRandomForestClassifier>>>fromsklearn.ensembleimportExtraTreesClassifier>>>fromsklearn.treeimportDecisionTreeClassifier>>>X,y=make_blobs(n_samples=10000,n_features=10,centers=100,...random_state=0)>>>clf=DecisionTreeClassifier(max_depth=None,min_samples_split=2,...random_state=0)>>>scores=cross_val_score(clf,X,y,cv=5)>>>scores.mean()np.float64(0.98)>>>clf=RandomForestClassifier(n_estimators=10,max_depth=None,...min_samples_split=2,random_state=0)>>>scores=cross_val_score(clf,X,y,cv=5)>>>scores.mean()np.float64(0.999)>>>clf=ExtraTreesClassifier(n_estimators=10,max_depth=None,...min_samples_split=2,random_state=0)>>>scores=cross_val_score(clf,X,y,cv=5)>>>scores.mean()>0.999np.True_

1.11.2.3.Parameters#
The main parameters to adjust when using these methods isn_estimators andmax_features. The former is the number of trees in the forest. The largerthe better, but also the longer it will take to compute. In addition, note thatresults will stop getting significantly better beyond a critical number oftrees. The latter is the size of the random subsets of features to considerwhen splitting a node. The lower the greater the reduction of variance, butalso the greater the increase in bias. Empirical good default values aremax_features=1.0 or equivalentlymax_features=None (always consideringall features instead of a random subset) for regression problems, andmax_features="sqrt" (using a random subset of sizesqrt(n_features))for classification tasks (wheren_features is the number of features inthe data). The default value ofmax_features=1.0 is equivalent to baggedtrees and more randomness can be achieved by setting smaller values (e.g. 0.3is a typical default in the literature). Good results are often achieved whensettingmax_depth=None in combination withmin_samples_split=2 (i.e.,when fully developing the trees). Bear in mind though that these values areusually not optimal, and might result in models that consume a lot of RAM.The best parameter values should always be cross-validated. In addition, notethat in random forests, bootstrap samples are used by default(bootstrap=True) while the default strategy for extra-trees is to use thewhole dataset (bootstrap=False). When using bootstrap sampling thegeneralization error can be estimated on the left out or out-of-bag samples.This can be enabled by settingoob_score=True.
Note
The size of the model with the default parameters is\(O( M * N * log (N) )\),where\(M\) is the number of trees and\(N\) is the number of samples.In order to reduce the size of the model, you can change these parameters:min_samples_split,max_leaf_nodes,max_depth andmin_samples_leaf.
1.11.2.4.Parallelization#
Finally, this module also features the parallel construction of the treesand the parallel computation of the predictions through then_jobsparameter. Ifn_jobs=k then computations are partitioned intok jobs, and run onk cores of the machine. Ifn_jobs=-1then all cores available on the machine are used. Note that because ofinter-process communication overhead, the speedup might not be linear(i.e., usingk jobs will unfortunately not bek times asfast). Significant speedup can still be achieved though when buildinga large number of trees, or when building a single tree requires a fairamount of time (e.g., on large datasets).
Examples
References
Breiman, “Random Forests”, Machine Learning, 45(1), 5-32, 2001.
Breiman, “Arcing Classifiers”, Annals of Statistics 1998.
P. Geurts, D. Ernst., and L. Wehenkel, “Extremely randomizedtrees”, Machine Learning, 63(1), 3-42, 2006.
1.11.2.5.Feature importance evaluation#
The relative rank (i.e. depth) of a feature used as a decision node in atree can be used to assess the relative importance of that feature withrespect to the predictability of the target variable. Features used atthe top of the tree contribute to the final prediction decision of alarger fraction of the input samples. Theexpected fraction of thesamples they contribute to can thus be used as an estimate of therelative importance of the features. In scikit-learn, the fraction ofsamples a feature contributes to is combined with the decrease in impurityfrom splitting them to create a normalized estimate of the predictive powerof that feature.
Byaveraging the estimates of predictive ability over several randomizedtrees one canreduce the variance of such an estimate and use itfor feature selection. This is known as the mean decrease in impurity, or MDI.Refer to[L2014] for more information on MDI and feature importanceevaluation with Random Forests.
Warning
The impurity-based feature importances computed on tree-based models sufferfrom two flaws that can lead to misleading conclusions. First they arecomputed on statistics derived from the training dataset and thereforedonot necessarily inform us on which features are most important to make goodpredictions on held-out dataset. Secondly,they favor high cardinalityfeatures, that is features with many unique values.Permutation feature importance is an alternative to impurity-based featureimportance that does not suffer from these flaws. These two methods ofobtaining feature importance are explored in:Permutation Importance vs Random Forest Feature Importance (MDI).
In practice those estimates are stored as an attribute namedfeature_importances_ on the fitted model. This is an array with shape(n_features,) whose values are positive and sum to 1.0. The higherthe value, the more important is the contribution of the matching featureto the prediction function.
Examples
References
G. Louppe,“Understanding Random Forests: From Theory toPractice”,PhD Thesis, U. of Liege, 2014.
1.11.2.6.Totally Random Trees Embedding#
RandomTreesEmbedding implements an unsupervised transformation of thedata. Using a forest of completely random trees,RandomTreesEmbeddingencodes the data by the indices of the leaves a data point ends up in. Thisindex is then encoded in a one-of-K manner, leading to a high dimensional,sparse binary coding.This coding can be computed very efficiently and can then be used as a basisfor other learning tasks.The size and sparsity of the code can be influenced by choosing the number oftrees and the maximum depth per tree. For each tree in the ensemble, the codingcontains one entry of one. The size of the coding is at mostn_estimators*2**max_depth, the maximum number of leaves in the forest.
As neighboring data points are more likely to lie within the same leaf of atree, the transformation performs an implicit, non-parametric densityestimation.
Examples
Manifold learning on handwritten digits: Locally Linear Embedding, Isomap… compares non-lineardimensionality reduction techniques on handwritten digits.
Feature transformations with ensembles of trees comparessupervised and unsupervised tree based feature transformations.
See also
Manifold learning techniques can also be useful to derive non-linearrepresentations of feature space, also these approaches focus also ondimensionality reduction.
1.11.2.7.Fitting additional trees#
RandomForest, Extra-Trees andRandomTreesEmbedding estimators all supportwarm_start=True which allows you to add more trees to an already fitted model.
>>>fromsklearn.datasetsimportmake_classification>>>fromsklearn.ensembleimportRandomForestClassifier>>>X,y=make_classification(n_samples=100,random_state=1)>>>clf=RandomForestClassifier(n_estimators=10)>>>clf=clf.fit(X,y)# fit with 10 trees>>>len(clf.estimators_)10>>># set warm_start and increase num of estimators>>>_=clf.set_params(n_estimators=20,warm_start=True)>>>_=clf.fit(X,y)# fit additional 10 trees>>>len(clf.estimators_)20
Whenrandom_state is also set, the internal random state is also preservedbetweenfit calls. This means that training a model once withn estimators isthe same as building the model iteratively via multiplefit calls, where thefinal number of estimators is equal ton.
>>>clf=RandomForestClassifier(n_estimators=20)# set `n_estimators` to 10 + 10>>>_=clf.fit(X,y)# fit `estimators_` will be the same as `clf` above
Note that this differs from the usual behavior ofrandom_state in that it doesnot result in the same result across different calls.
1.11.3.Bagging meta-estimator#
In ensemble algorithms, bagging methods form a class of algorithms which buildseveral instances of a black-box estimator on random subsets of the originaltraining set and then aggregate their individual predictions to form a finalprediction. These methods are used as a way to reduce the variance of a baseestimator (e.g., a decision tree), by introducing randomization into itsconstruction procedure and then making an ensemble out of it. In many cases,bagging methods constitute a very simple way to improve with respect to asingle model, without making it necessary to adapt the underlying basealgorithm. As they provide a way to reduce overfitting, bagging methods workbest with strong and complex models (e.g., fully developed decision trees), incontrast with boosting methods which usually work best with weak models (e.g.,shallow decision trees).
Bagging methods come in many flavours but mostly differ from each other by theway they draw random subsets of the training set:
When random subsets of the dataset are drawn as random subsets of thesamples, then this algorithm is known as Pasting[B1999].
When samples are drawn with replacement, then the method is known asBagging[B1996].
When random subsets of the dataset are drawn as random subsets ofthe features, then the method is known as Random Subspaces[H1998].
Finally, when base estimators are built on subsets of both samples andfeatures, then the method is known as Random Patches[LG2012].
In scikit-learn, bagging methods are offered as a unifiedBaggingClassifier meta-estimator (resp.BaggingRegressor),taking as input a user-specified estimator along with parametersspecifying the strategy to draw random subsets. In particular,max_samplesandmax_features control the size of the subsets (in terms of samples andfeatures), whilebootstrap andbootstrap_features control whethersamples and features are drawn with or without replacement. When using a subsetof the available samples the generalization accuracy can be estimated with theout-of-bag samples by settingoob_score=True. As an example, thesnippet below illustrates how to instantiate a bagging ensemble ofKNeighborsClassifier estimators, each built on randomsubsets of 50% of the samples and 50% of the features.
>>>fromsklearn.ensembleimportBaggingClassifier>>>fromsklearn.neighborsimportKNeighborsClassifier>>>bagging=BaggingClassifier(KNeighborsClassifier(),...max_samples=0.5,max_features=0.5)
Examples
References
L. Breiman, “Pasting small votes for classification in largedatabases and on-line”, Machine Learning, 36(1), 85-103, 1999.
L. Breiman, “Bagging predictors”, Machine Learning, 24(2),123-140, 1996.
T. Ho, “The random subspace method for constructing decisionforests”, Pattern Analysis and Machine Intelligence, 20(8), 832-844, 1998.
G. Louppe and P. Geurts, “Ensembles on Random Patches”,Machine Learning and Knowledge Discovery in Databases, 346-361, 2012.
1.11.4.Voting Classifier#
The idea behind theVotingClassifier is to combineconceptually different machine learning classifiers and use a majority voteor the average predicted probabilities (soft vote) to predict the class labels.Such a classifier can be useful for a set of equally well performing modelsin order to balance out their individual weaknesses.
1.11.4.1.Majority Class Labels (Majority/Hard Voting)#
In majority voting, the predicted class label for a particular sample isthe class label that represents the majority (mode) of the class labelspredicted by each individual classifier.
E.g., if the prediction for a given sample is
classifier 1 -> class 1
classifier 2 -> class 1
classifier 3 -> class 2
the VotingClassifier (withvoting='hard') would classify the sampleas “class 1” based on the majority class label.
In the cases of a tie, theVotingClassifier will select the classbased on the ascending sort order. E.g., in the following scenario
classifier 1 -> class 2
classifier 2 -> class 1
the class label 1 will be assigned to the sample.
1.11.4.2.Usage#
The following example shows how to fit the majority rule classifier:
>>>fromsklearnimportdatasets>>>fromsklearn.model_selectionimportcross_val_score>>>fromsklearn.linear_modelimportLogisticRegression>>>fromsklearn.naive_bayesimportGaussianNB>>>fromsklearn.ensembleimportRandomForestClassifier>>>fromsklearn.ensembleimportVotingClassifier>>>iris=datasets.load_iris()>>>X,y=iris.data[:,1:3],iris.target>>>clf1=LogisticRegression(random_state=1)>>>clf2=RandomForestClassifier(n_estimators=50,random_state=1)>>>clf3=GaussianNB()>>>eclf=VotingClassifier(...estimators=[('lr',clf1),('rf',clf2),('gnb',clf3)],...voting='hard')>>>forclf,labelinzip([clf1,clf2,clf3,eclf],['Logistic Regression','Random Forest','naive Bayes','Ensemble']):...scores=cross_val_score(clf,X,y,scoring='accuracy',cv=5)...print("Accuracy:%0.2f (+/-%0.2f) [%s]"%(scores.mean(),scores.std(),label))Accuracy: 0.95 (+/- 0.04) [Logistic Regression]Accuracy: 0.94 (+/- 0.04) [Random Forest]Accuracy: 0.91 (+/- 0.04) [naive Bayes]Accuracy: 0.95 (+/- 0.04) [Ensemble]
1.11.4.3.Weighted Average Probabilities (Soft Voting)#
In contrast to majority voting (hard voting), soft votingreturns the class label as argmax of the sum of predicted probabilities.
Specific weights can be assigned to each classifier via theweightsparameter. When weights are provided, the predicted class probabilitiesfor each classifier are collected, multiplied by the classifier weight,and averaged. The final class label is then derived from the class labelwith the highest average probability.
To illustrate this with a simple example, let’s assume we have 3classifiers and a 3-class classification problem where we assignequal weights to all classifiers: w1=1, w2=1, w3=1.
The weighted average probabilities for a sample would then becalculated as follows:
classifier | class 1 | class 2 | class 3 |
|---|---|---|---|
classifier 1 | w1 * 0.2 | w1 * 0.5 | w1 * 0.3 |
classifier 2 | w2 * 0.6 | w2 * 0.3 | w2 * 0.1 |
classifier 3 | w3 * 0.3 | w3 * 0.4 | w3 * 0.3 |
weighted average | 0.37 | 0.4 | 0.23 |
Here, the predicted class label is 2, since it has the highest averagepredicted probability. See the example onVisualizing the probabilistic predictions of a VotingClassifier for ademonstration of how the predicted class label can be obtained from the weightedaverage of predicted probabilities.
The following figure illustrates how the decision regions may change whena softVotingClassifier is trained with weights on three linearmodels:

1.11.4.4.Usage#
In order to predict the class labels based on the predictedclass-probabilities (scikit-learn estimators in the VotingClassifiermust supportpredict_proba method):
>>>eclf=VotingClassifier(...estimators=[('lr',clf1),('rf',clf2),('gnb',clf3)],...voting='soft'...)
Optionally, weights can be provided for the individual classifiers:
>>>eclf=VotingClassifier(...estimators=[('lr',clf1),('rf',clf2),('gnb',clf3)],...voting='soft',weights=[2,5,1]...)
Using theVotingClassifier withGridSearchCV#
TheVotingClassifier can also be used together withGridSearchCV in order to tune thehyperparameters of the individual estimators:
>>>fromsklearn.model_selectionimportGridSearchCV>>>clf1=LogisticRegression(random_state=1)>>>clf2=RandomForestClassifier(random_state=1)>>>clf3=GaussianNB()>>>eclf=VotingClassifier(...estimators=[('lr',clf1),('rf',clf2),('gnb',clf3)],...voting='soft'...)>>>params={'lr__C':[1.0,100.0],'rf__n_estimators':[20,200]}>>>grid=GridSearchCV(estimator=eclf,param_grid=params,cv=5)>>>grid=grid.fit(iris.data,iris.target)
1.11.5.Voting Regressor#
The idea behind theVotingRegressor is to combine conceptuallydifferent machine learning regressors and return the average predicted values.Such a regressor can be useful for a set of equally well performing modelsin order to balance out their individual weaknesses.
1.11.5.1.Usage#
The following example shows how to fit the VotingRegressor:
>>>fromsklearn.datasetsimportload_diabetes>>>fromsklearn.ensembleimportGradientBoostingRegressor>>>fromsklearn.ensembleimportRandomForestRegressor>>>fromsklearn.linear_modelimportLinearRegression>>>fromsklearn.ensembleimportVotingRegressor>>># Loading some example data>>>X,y=load_diabetes(return_X_y=True)>>># Training classifiers>>>reg1=GradientBoostingRegressor(random_state=1)>>>reg2=RandomForestRegressor(random_state=1)>>>reg3=LinearRegression()>>>ereg=VotingRegressor(estimators=[('gb',reg1),('rf',reg2),('lr',reg3)])>>>ereg=ereg.fit(X,y)

Examples
1.11.6.Stacked generalization#
Stacked generalization is a method for combining estimators to reduce theirbiases[W1992][HTF]. More precisely, the predictions of each individualestimator are stacked together and used as input to a final estimator tocompute the prediction. This final estimator is trained throughcross-validation.
TheStackingClassifier andStackingRegressor provide suchstrategies which can be applied to classification and regression problems.
Theestimators parameter corresponds to the list of the estimators whichare stacked together in parallel on the input data. It should be given as alist of names and estimators:
>>>fromsklearn.linear_modelimportRidgeCV,LassoCV>>>fromsklearn.neighborsimportKNeighborsRegressor>>>estimators=[('ridge',RidgeCV()),...('lasso',LassoCV(random_state=42)),...('knr',KNeighborsRegressor(n_neighbors=20,...metric='euclidean'))]
Thefinal_estimator will use the predictions of theestimators as input. Itneeds to be a classifier or a regressor when usingStackingClassifierorStackingRegressor, respectively:
>>>fromsklearn.ensembleimportGradientBoostingRegressor>>>fromsklearn.ensembleimportStackingRegressor>>>final_estimator=GradientBoostingRegressor(...n_estimators=25,subsample=0.5,min_samples_leaf=25,max_features=1,...random_state=42)>>>reg=StackingRegressor(...estimators=estimators,...final_estimator=final_estimator)
To train theestimators andfinal_estimator, thefit method needsto be called on the training data:
>>>fromsklearn.datasetsimportload_diabetes>>>X,y=load_diabetes(return_X_y=True)>>>fromsklearn.model_selectionimporttrain_test_split>>>X_train,X_test,y_train,y_test=train_test_split(X,y,...random_state=42)>>>reg.fit(X_train,y_train)StackingRegressor(...)
During training, theestimators are fitted on the whole training dataX_train. They will be used when callingpredict orpredict_proba. Togeneralize and avoid over-fitting, thefinal_estimator is trained onout-samples usingsklearn.model_selection.cross_val_predict internally.
ForStackingClassifier, note that the output of theestimators iscontrolled by the parameterstack_method and it is called by each estimator.This parameter is either a string, being estimator method names, or'auto'which will automatically identify an available method depending on theavailability, tested in the order of preference:predict_proba,decision_function andpredict.
AStackingRegressor andStackingClassifier can be used asany other regressor or classifier, exposing apredict,predict_proba, ordecision_function method, e.g.:
>>>y_pred=reg.predict(X_test)>>>fromsklearn.metricsimportr2_score>>>print('R2 score:{:.2f}'.format(r2_score(y_test,y_pred)))R2 score: 0.53
Note that it is also possible to get the output of the stackedestimators using thetransform method:
>>>reg.transform(X_test[:5])array([[142, 138, 146], [179, 182, 151], [139, 132, 158], [286, 292, 225], [126, 124, 164]])
In practice, a stacking predictor predicts as good as the best predictor of thebase layer and even sometimes outperforms it by combining the differentstrengths of these predictors. However, training a stacking predictor iscomputationally expensive.
Note
ForStackingClassifier, when usingstack_method_='predict_proba',the first column is dropped when the problem is a binary classificationproblem. Indeed, both probability columns predicted by each estimator areperfectly collinear.
Note
Multiple stacking layers can be achieved by assigningfinal_estimator toaStackingClassifier orStackingRegressor:
>>>final_layer_rfr=RandomForestRegressor(...n_estimators=10,max_features=1,max_leaf_nodes=5,random_state=42)>>>final_layer_gbr=GradientBoostingRegressor(...n_estimators=10,max_features=1,max_leaf_nodes=5,random_state=42)>>>final_layer=StackingRegressor(...estimators=[('rf',final_layer_rfr),...('gbrt',final_layer_gbr)],...final_estimator=RidgeCV()...)>>>multi_layer_regressor=StackingRegressor(...estimators=[('ridge',RidgeCV()),...('lasso',LassoCV(random_state=42)),...('knr',KNeighborsRegressor(n_neighbors=20,...metric='euclidean'))],...final_estimator=final_layer...)>>>multi_layer_regressor.fit(X_train,y_train)StackingRegressor(...)>>>print('R2 score:{:.2f}'....format(multi_layer_regressor.score(X_test,y_test)))R2 score: 0.53
Examples
References
Wolpert, David H. “Stacked generalization.” Neural networks 5.2(1992): 241-259.
1.11.7.AdaBoost#
The modulesklearn.ensemble includes the popular boosting algorithmAdaBoost, introduced in 1995 by Freund and Schapire[FS1995].
The core principle of AdaBoost is to fit a sequence of weak learners (i.e.,models that are only slightly better than random guessing, such as smalldecision trees) on repeatedly modified versions of the data. The predictionsfrom all of them are then combined through a weighted majority vote (or sum) toproduce the final prediction. The data modifications at each so-called boostingiteration consists of applying weights\(w_1\),\(w_2\), …,\(w_N\)to each of the training samples. Initially, those weights are all set to\(w_i = 1/N\), so that the first step simply trains a weak learner on theoriginal data. For each successive iteration, the sample weights areindividually modified and the learning algorithm is reapplied to the reweighteddata. At a given step, those training examples that were incorrectly predictedby the boosted model induced at the previous step have their weights increased,whereas the weights are decreased for those that were predicted correctly. Asiterations proceed, examples that are difficult to predict receiveever-increasing influence. Each subsequent weak learner is thereby forced toconcentrate on the examples that are missed by the previous ones in the sequence[HTF].

AdaBoost can be used both for classification and regression problems:
For multi-class classification,
AdaBoostClassifierimplementsAdaBoost.SAMME[ZZRH2009].For regression,
AdaBoostRegressorimplements AdaBoost.R2[D1997].
1.11.7.1.Usage#
The following example shows how to fit an AdaBoost classifier with 100 weaklearners:
>>>fromsklearn.model_selectionimportcross_val_score>>>fromsklearn.datasetsimportload_iris>>>fromsklearn.ensembleimportAdaBoostClassifier>>>X,y=load_iris(return_X_y=True)>>>clf=AdaBoostClassifier(n_estimators=100)>>>scores=cross_val_score(clf,X,y,cv=5)>>>scores.mean()np.float64(0.95)
The number of weak learners is controlled by the parametern_estimators. Thelearning_rate parameter controls the contribution of the weak learners inthe final combination. By default, weak learners are decision stumps. Differentweak learners can be specified through theestimator parameter.The main parameters to tune to obtain good results aren_estimators andthe complexity of the base estimators (e.g., its depthmax_depth orminimum required number of samples to consider a splitmin_samples_split).
Examples
Multi-class AdaBoosted Decision Trees shows the performanceof AdaBoost on a multi-class problem.
Two-class AdaBoost shows the decision boundaryand decision function values for a non-linearly separable two-class problemusing AdaBoost-SAMME.
Decision Tree Regression with AdaBoost demonstrates regressionwith the AdaBoost.R2 algorithm.
References
