Using the Scikit-Learn Estimator Interface
Contents
Overview
In addition to the native interface, XGBoost features a sklearn estimator interface thatconforms tosklearn estimator guideline. Itsupports regression, classification, and learning to rank. Survival training for thesklearn estimator interface is still working in progress.
You can find some some quick start examples atCollection of examples for using sklearn interface. The main advantage of using sklearninterface is that it works with most of the utilities provided by sklearn likesklearn.model_selection.cross_validate(). Also, many other libraries recognizethe sklearn estimator interface thanks to its popularity.
With the sklearn estimator interface, we can train a classification model with only acouple lines of Python code. Here’s an example for training a classification model:
fromsklearn.datasetsimportload_breast_cancerfromsklearn.model_selectionimporttrain_test_splitimportxgboostasxgbX,y=load_breast_cancer(return_X_y=True)X_train,X_test,y_train,y_test=train_test_split(X,y,stratify=y,random_state=94)# Use "hist" for constructing the trees, with early stopping enabled.clf=xgb.XGBClassifier(tree_method="hist",early_stopping_rounds=2)# Fit the model, test sets are used for early stopping.clf.fit(X_train,y_train,eval_set=[(X_test,y_test)])# Save model into JSON format.clf.save_model("clf.json")
Thetree_method parameter specifies the method to use for constructing the trees, andthe early_stopping_rounds parameter enables early stopping. Early stopping can helpprevent overfitting and save time during training.
Early Stopping
As demonstrated in the previous example, early stopping can be enabled by the parameterearly_stopping_rounds. Alternatively, there’s a callback function that can be usedxgboost.callback.EarlyStopping to specify more details about the behavior ofearly stopping, including whether XGBoost should return the best model instead of the fullstack of trees:
early_stop=xgb.callback.EarlyStopping(rounds=2,metric_name='logloss',data_name='validation_0',save_best=True)clf=xgb.XGBClassifier(tree_method="hist",callbacks=[early_stop])clf.fit(X_train,y_train,eval_set=[(X_test,y_test)])
At present, XGBoost doesn’t implement data spliting logic within the estimator and relieson theeval_set parameter of thexgboost.XGBModel.fit() method. If you wantto use early stopping to prevent overfitting, you’ll need to manually split your data intotraining and testing sets using thesklearn.model_selection.train_test_split()function from thesklearn library. Some other machine learning algorithms, like those insklearn, include early stopping as part of the estimator and may work with crossvalidation. However, using early stopping during cross validation may not be a perfectapproach because it changes the model’s number of trees for each validation fold, leadingto different model. A better approach is to retrain the model after cross validation usingthe best hyperparameters along with early stopping. If you want to experiment with idea ofusing cross validation with early stopping, here is a snippet to begin with:
fromsklearn.baseimportclonefromsklearn.datasetsimportload_breast_cancerfromsklearn.model_selectionimportStratifiedKFold,cross_validateimportxgboostasxgbX,y=load_breast_cancer(return_X_y=True)deffit_and_score(estimator,X_train,X_test,y_train,y_test):"""Fit the estimator on the train set and score it on both sets"""estimator.fit(X_train,y_train,eval_set=[(X_test,y_test)])train_score=estimator.score(X_train,y_train)test_score=estimator.score(X_test,y_test)returnestimator,train_score,test_scorecv=StratifiedKFold(n_splits=5,shuffle=True,random_state=94)clf=xgb.XGBClassifier(tree_method="hist",early_stopping_rounds=3)results={}fortrain,testincv.split(X,y):X_train=X[train]X_test=X[test]y_train=y[train]y_test=y[test]est,train_score,test_score=fit_and_score(clone(clf),X_train,X_test,y_train,y_test)results[est]=(train_score,test_score)
Obtaining the native booster object
The sklearn estimator interface primarily facilitates training and doesn’t implement allfeatures available in XGBoost. For instance, in order to have cached predictions,xgboost.DMatrix needs to be used withxgboost.Booster.predict(). Onecan obtain the booster object from the sklearn interface usingxgboost.XGBModel.get_booster():
booster=clf.get_booster()print(booster.num_boosted_rounds())
Prediction
When early stopping is enabled, prediction functions including thexgboost.XGBModel.predict(),xgboost.XGBModel.score(), andxgboost.XGBModel.apply() methods will use the best model automatically. Meaningthexgboost.XGBModel.best_iteration is used to specify the range of trees usedin prediction.
To have cached results for incremental prediction, please use thexgboost.Booster.predict() method instead.
Number of parallel threads
When working with XGBoost and other sklearn tools, you can specify how many threads youwant to use by using then_jobs parameter. By default, XGBoost uses all the availablethreads on your computer, which can lead to some interesting consequences when combinedwith other sklearn functions likesklearn.model_selection.cross_validate(). Ifboth XGBoost and sklearn are set to use all threads, your computer may start to slow downsignificantly due to something called “thread thrashing”. To avoid this, you can simplyset then_jobs parameter for XGBoost toNone (which uses all threads) and then_jobs parameter for sklearn to1. This way, both programs will be able to worktogether smoothly without causing any unnecessary computer strain.