Categorical Data

Contents

Since version 1.5, XGBoost has support for categorical data. For numerical data, thesplit condition is defined as\(value < threshold\), while for categorical data thesplit is defined depending on whether partitioning or onehot encoding is used. Forpartition-based splits, the splits are specified as\(value \in categories\), wherecategories is the set of categories in one feature. If onehot encoding is usedinstead, then the split is defined as\(value == category\). More advanced categoricalsplit strategy is planned for future releases and this tutorial details how to informXGBoost about the data type.

Training with scikit-learn Interface

The easiest way to pass categorical data into XGBoost is using dataframe and thescikit-learn interface likeXGBClassifier. Forpreparing the data, users need to specify the data type of input predictor ascategory. Forpandas/cudfDataframe, this can be achieved by

X["cat_feature"].astype("category")

for all columns that represent categorical features. After which, users can tell XGBoostto enable training with categorical data. Assuming that you are using theXGBClassifier for classification problem, specify theparameterenable_categorical:

# Supported tree methods are `approx` and `hist`.clf=xgb.XGBClassifier(tree_method="hist",enable_categorical=True,device="cuda")# X is the dataframe we created in previous snippetclf.fit(X,y)# Must use JSON/UBJSON for serialization, otherwise the information is lost.clf.save_model("categorical-model.json")

Once training is finished, most of other features can utilize the model. For instance onecan plot the model and calculate the global feature importance:

# Get a graphgraph=xgb.to_graphviz(clf,num_trees=1)# Or get a matplotlib axisax=xgb.plot_tree(clf,num_trees=1)# Get feature importancesclf.feature_importances_

Thescikit-learn interface from dask is similar to single node version. The basicidea is create dataframe with category feature type, and tell XGBoost to use it by settingtheenable_categorical parameter. SeeGetting started with categorical datafor a worked example of using categorical data withscikit-learn interface withone-hot encoding. A comparison between using one-hot encoded data and XGBoost’scategorical data support can be foundTrain XGBoost with cat_in_the_dat dataset.

Added in version 3.0:Support for the R package usingfactor.

Optimal Partitioning

Added in version 1.6.

Optimal partitioning is a technique for partitioning the categorical predictors for eachnode split, the proof of optimality for numerical output was first introduced by[1]. The algorithm is used in decision trees[2], laterLightGBM[3] brought it to the context of gradient boosting trees andnow is also adopted in XGBoost as an optional feature for handling categoricalsplits. More specifically, the proof by Fisher[1] states that, whentrying to partition a set of discrete values into groups based on the distances between ameasure of these values, one only needs to look at sorted partitions instead ofenumerating all possible permutations. In the context of decision trees, the discretevalues are categories, and the measure is the output leaf value. Intuitively, we want togroup the categories that output similar leaf values. During split finding, we first sortthe gradient histogram to prepare the contiguous partitions then enumerate the splitsaccording to these sorted values. One of the related parameters for XGBoost ismax_cat_to_onehot, which controls whether one-hot encoding or partitioning should beused for each feature, seeParameters for Categorical Feature for details.

Using native interface

Thescikit-learn interface is user friendly, but lacks some features that are onlyavailable in native interface. For instance users cannot compute SHAP value directly.Also native interface supports more data types. To use the native interface withcategorical data, we need to pass the similar parameter toDMatrix orQuantileDMatrix and thetrain function. Fordataframe input:

# X is a dataframe we created in previous snippetXy=xgb.DMatrix(X,y,enable_categorical=True)booster=xgb.train({"tree_method":"hist","max_cat_to_onehot":5},Xy)# Must use JSON for serialization, otherwise the information is lostbooster.save_model("categorical-model.json")

SHAP value computation:

SHAP=booster.predict(Xy,pred_interactions=True)# categorical features are listed as "c"print(booster.feature_types)

For other types of input, likenumpyarray, we can tell XGBoost about the featuretypes by using thefeature_types parameter inDMatrix:

# "q" is numerical feature, while "c" is categorical featureft=["q","c","c"]X:np.ndarray=load_my_data()assertX.shape[1]==3Xy=xgb.DMatrix(X,y,feature_types=ft,enable_categorical=True)

For numerical data, the feature type can be"q" or"float", while for categoricalfeature it’s specified as"c". The Dask module in XGBoost has the same interface sodask.Array can also be used for categorical data. Lastly, thesklearn interfaceXGBRegressor has the same parameter.

Auto-recoding (Data Consistency)

Changed in version 3.1:Starting with XGBoost 3.1, thePython interface can perform automatic re-coding fornew inputs.

XGBoost accepts parameters to indicate which feature is considered categorical, eitherthrough thedtypes of a dataframe or through thefeature_types parameter. However,except for the Python interface, XGBoost doesn’t store the information about howcategories are encoded in the first place. For instance, given an encoding schema thatmaps music genres to integer codes:

{"acoustic":0,"indie":1,"blues":2,"country":3}

Aside from the Python interface (R/Java/C, etc), XGBoost doesn’t know this mapping fromthe input and hence cannot store it in the model. The mapping usually happens in theusers’ data engineering pipeline. To ensure the correct result from XGBoost, users need tokeep the pipeline for transforming data consistent across training and testing data.

Starting with 3.1, thePython interface can remember the encoding and perform recodingduring inference and training continuation when the input is a dataframe (pandas,cuDF,polars,pyarrow,modin). The feature support focuses on basic usage. It hassome restrictions on the types of inputs that can be accepted. First, category names musthave one of the following types:

  • string

  • integer, from 8-bit to 64-bit, both signed and unsigned are supported.

  • 32-bit or 64-bit floating point

Other category types are not supported. Second, the input types must be strictlyconsistent. For example, XGBoost will raise an error if the categorical columns in thetraining set are unsigned integers whereas the test dataset has signed integer columns. Ifyou have categories that are not one of the supported types, you need to perform there-coding using a pre-processing data transformer like thesklearn.preprocessing.OrdinalEncoder. SeeFeature engineering pipeline for categorical data for a worked example using an ordinalencoder. To clarify, the type here refers to the type of the name of categories (calledIndex in pandas):

# string type{"acoustic":0,"indie":1,"blues":2,"country":3}# integer type{-1:0,1:1,3:2,7:3}# depending on the dataframe implementation, it can be signed or unsigned.{5:0,1:1,3:2,7:3}# floating point type, both 32-bit and 64-bit are supported.{-1.0:0,1.0:1,3.0:2,7.0:3}

Internally, XGBoost attempts to extract the categories from the dataframe inputs. Forinference (predict), the re-coding happens on the fly and there’s no data copy (baringsome internal transformations performed by the dataframe itself). For trainingcontinuation however, re-coding requires some extra steps if you are using the nativeinterface. The sklearn interface and the Dask interface can handle training continuationautomatically. Last, please note that using the re-coder with the native interface isstill experimental. It’s ready for testing, but we want to observe the feature usage for aperiod of time and might make some breaking changes if needed. The following is a snippetof using the native interface:

importpandasaspdX=pd.DataFrame()Xy=xgboost.QuantileDMatrix(X,y,enable_categorical=True)booster=xgboost.train({},Xy)# XGBoost can handle re-coding for inference without user interventionX_new=pd.DataFrame()booster.inplace_predict(X_new)# Get categories saved in the model for training continuationcategories=booster.get_categories()# Use saved categories as a reference for re-coding.# Training continuation requires a re-coded DMatrix, pass the categories as feature_typesXy_new=xgboost.QuantileDMatrix(X_new,y_new,feature_types=categories,enable_categorical=True,ref=Xy)booster_1=xgboost.train({},Xy_new,xgb_model=booster)

No extra step is required for using the scikit-learn interface as long as the inputs aredataframes. During training continuation, XGBoost will either extract the categories fromthe previous model or use the categories from the new training dataset if the input modeldoesn’t have the information. As a side note, users can inspect the content of thecategories by exporting it to arrow arrays. This interface is still experimental:

categories=booster.get_categories(export_to_arrow=True)print(categories.to_arrow())

ForR, the auto-recoding is not yet supported as of 3.1. To provide an example:

>f0=factor(c("a","b","c"))>as.numeric(f0)[1]123>f0[1]abcLevels:abc

In the above snippet, we have the mapping:a->1,b->2,c->3. Assuming the aboveis the training data, and the next snippet is the test data:

>f1=factor(c("a","c"))>as.numeric(f1)[1]12>f1[1]acLevels:ac

Now, we havea->1,c->2 becauseb is missing, and the R factor encodes the datadifferently, resulting in invalid test-time encoding. XGBoost cannot remember the originalencoding for the R package. You will have to encode the data explicitly during inference:

>f1=factor(c("a","c"),levels=c("a","b","c"))>f1[1]acLevels:abc>as.numeric(f1)[1]13

Miscellaneous

By default, XGBoost assumes input category codes are integers starting from 0 till thenumber of categories\([0, n\_categories)\). However, user might provide inputs withinvalid values due to mistakes or missing values in training dataset. It can be negativevalue, integer values that can not be accurately represented by 32-bit floating point, orvalues that are larger than actual number of unique categories. During training this isvalidated but for prediction it’s treated as the same as not-chosen category forperformance reasons.

References

[1] Walter D. Fisher. “On Grouping for Maximum Homogeneity”. Journal of the American Statistical Association. Vol. 53, No. 284 (Dec., 1958), pp. 789-798.

[2] Trevor Hastie, Robert Tibshirani, Jerome Friedman. “The Elements of Statistical Learning”. Springer Series in Statistics Springer New York Inc. (2001).

[3] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, Tie-Yan Liu. “LightGBM: A Highly Efficient Gradient Boosting Decision Tree.” Advances in Neural Information Processing Systems 30 (NIPS 2017), pp. 3149-3157.