Random Forests(TM) in XGBoost

XGBoost is normally used to train gradient-boosted decision trees and other gradientboosted models. Random Forests use the same model representation and inference, asgradient-boosted decision trees, but a different training algorithm. One can use XGBoostto train a standalone random forest or use random forest as a base model for gradientboosting. Here we focus on training standalone random forest.

We have native APIs for training random forests since the early days, and a newScikit-Learn wrapper after 0.82 (not included in 0.82). Please note that the newScikit-Learn wrapper is stillexperimental, which means we might change the interfacewhenever needed.

Standalone Random Forest With XGBoost API

The following parameters must be set to enable random forest training.

  • booster should be set togbtree, as we are training forests. Note that as thisis the default, this parameter needn’t be set explicitly.

  • subsample must be set to a value less than 1 to enable random selection of trainingcases (rows).

  • One ofcolsample_by* parameters must be set to a value less than 1 to enable randomselection of columns. Normally,colsample_bynode would be set to a value less than 1to randomly sample columns at each tree split.

  • num_parallel_tree should be set to the size of the forest being trained.

  • num_boost_round should be set to 1 to prevent XGBoost from boosting multiple randomforests. Note that this is a keyword argument totrain(), and is not part of theparameter dictionary.

  • eta (alias:learning_rate) must be set to 1 when training random forestregression.

  • random_state can be used to seed the random number generator.

Other parameters should be set in a similar way they are set for gradient boosting. Forinstance,objective will typically bereg:squarederror for regression andbinary:logistic for classification,lambda should be set according to a desiredregularization weight, etc.

If bothnum_parallel_tree andnum_boost_round are greater than 1, training willuse a combination of random forest and gradient boosting strategy. It will performnum_boost_round rounds, boosting a random forest ofnum_parallel_tree trees ateach round. If early stopping is not enabled, the final model will consist ofnum_parallel_tree *num_boost_round trees.

Here is a sample parameter dictionary for training a random forest on a GPU usingxgboost:

params={"colsample_bynode":0.8,"learning_rate":1,"max_depth":5,"num_parallel_tree":100,"objective":"binary:logistic","subsample":0.8,"tree_method":"hist","device":"cuda",}

A random forest model can then be trained as follows:

bst=train(params,dmatrix,num_boost_round=1)

Standalone Random Forest With Scikit-Learn-Like API

XGBRFClassifier andXGBRFRegressor are SKL-like classes that provide random forestfunctionality. They are basically versions ofXGBClassifier andXGBRegressor thattrain random forest instead of gradient boosting, and have default values and meaning ofsome of the parameters adjusted accordingly. In particular:

  • n_estimators specifies the size of the forest to be trained; it is converted tonum_parallel_tree, instead of the number of boosting rounds

  • learning_rate is set to 1 by default

  • colsample_bynode andsubsample are set to 0.8 by default

  • booster is alwaysgbtree

For a simple example, you can train a random forest regressor with:

fromsklearn.model_selectionimportKFold# Your code ...kf=KFold(n_splits=2)fortrain_index,test_indexinkf.split(X,y):xgb_model=xgb.XGBRFRegressor(random_state=42).fit(X[train_index],y[train_index])

Note that these classes have a smaller selection of parameters compared to usingtrain(). In particular, it is impossible to combine random forests with gradientboosting using this API.

Caveats

  • XGBoost uses 2nd order approximation to the objective function. This can lead to resultsthat differ from a random forest implementation that uses the exact value of theobjective function.

  • XGBoost does not perform replacement when subsampling training cases. Each training casecan occur in a subsampled set either 0 or 1 time.