Introduction to Model IO
Contents
Since 2.1.0, the default model format for XGBoost is the UBJSON format, the option isenabled for serializing models to file, serializing models to buffer, and for memorysnapshot (pickle and alike).
JSON and UBJSON have the same document structure with different representations, and wewill refer them collectively as the JSON format. This tutorial aims to share some basicinsights into the JSON serialisation method used in XGBoost. Without explicitlymentioned, the following sections assume you are using the one of the 2 outputs formats,which can be enabled by providing the file name with.json (or.ubj for binaryJSON) as file extension when saving/loading model:booster.save_model('model.json').More details below.
Before we get started, XGBoost is a gradient boosting library with focus on tree model,which means inside XGBoost, there are 2 distinct parts:
The model consisting of trees and
Hyperparameters and configurations used for building the model.
If you come from the Deep Learning community, then it should be clear to you that thereare differences between the neural network structures composed of weights with fixedtensor operations, and the optimizers (like RMSprop) used to train them.
So when one callsbooster.save_model (xgb.save in R), XGBoost saves the trees,some model parameters like number of input columns in trained trees, and the objectivefunction, which combined to represent the concept of “model” in XGBoost. As for why arewe saving the objective as part of model, that’s because objective controls transformationof global bias (calledbase_score or the intercept in XGBoost) and task-specificinformation. Users can share this model with others for inference, evaluation or continuethe training with a different set of hyper-parameters etc.
However, this is not the end of story. There are cases where we need to save somethingmore than just the model itself. For example, in distributed training, XGBoost performscheckpointing operation. Or for some reasons, your favorite distributed computingframework decide to copy the model from one worker to another and continue the training inthere. In such cases, the serialisation output is required to contain enough informationto continue previous training without user providing any parameters again. We considersuch scenario asmemory snapshot (or memory based serialisation method) anddistinguish it with normal model IO operation. Currently, memory snapshot is used in thefollowing places:
Python package: when the
Boosterobject is pickled with the built-inpicklemodule.R package: when the
xgb.Boosterobject is persisted with the built-in functionssaveRDSorsave.JVM packages: when the
Boosterobject is serialized with the built-in functionssaveModel.
To enable JSON format support for model IO (saving only the trees and objective), providea filename with.json or.ubj as file extension, the latter is the extension forUniversal Binary JSON
bst.save_model('model_file_name.json')
xgb.save(bst,'model_file_name.json')
valformat="json"// or val format = "ubj"model.write.option("format",format).save("model_directory_path")
Note
Only load models from JSON files that were produced by XGBoost. Attempting to loadJSON files that were produced by an external source may lead to undefined behaviorsand crashes.
When loading the model back, XGBoost recognizes the file extensions.json and.ubj, and can dispatch accordingly. If the extension is not specified, XGBoost triesto guess the right one.
A note on backward compatibility of models and memory snapshots
We guarantee backward compatibility for models but not for memory snapshots.
Models (trees and objective) use a stable representation, so that models produced in earlierversions of XGBoost are accessible in later versions of XGBoost.If you’d like to store or archiveyour model for long-term storage, usesave_model (Python) andxgb.save (R).
On the other hand, memory snapshot (serialisation) captures many stuff internal to XGBoost, and itsformat is not stable and is subject to frequent changes. Therefore, memory snapshot is suitable forcheckpointing only, where you persist the complete snapshot of the training configurations so thatyou can recover robustly from possible failures and resume the training process. Loading memorysnapshot generated by an earlier version of XGBoost may result in errors or undefined behaviors.If a model is persisted withpickle.dump (Python) orsaveRDS (R),then the model maynot be accessible in later versions of XGBoost.
Custom objective and metric
XGBoost accepts user provided objective and metric functions as an extension. Thesefunctions are not saved in model file as they are language dependent features. WithPython, user can pickle the model to include these functions in saved binary. Onedrawback is, the output from pickle is not a stable serialization format and doesn’t workon different Python version nor XGBoost version, not to mention different languageenvironments. Another way to workaround this limitation is to provide these functionsagain after the model is loaded. If the customized function is useful, please considermaking a PR for implementing it inside XGBoost, this way we can have your functionsworking with different language bindings.
Loading pickled file from different version of XGBoost
As noted, pickled model is neither portable nor stable, but in some cases the pickledmodels are valuable. One way to restore it in the future is to load it back with thatspecific version of Python and XGBoost, export the model by callingsave_model.
A similar procedure may be used to recover the model persisted in an old RDS file. In R,you are able to install an older version of XGBoost using theremotes package:
library(remotes)remotes::install_version("xgboost","0.90.0.1")# Install version 0.90.0.1
Once the desired version is installed, you can load the RDS file withreadRDS and recover thexgb.Booster object. Then callxgb.save to export the model using the stable representation.Now you should be able to use the model in the latest version of XGBoost.
Saving and Loading the internal parameters configuration
XGBoost’sCAPI,PythonAPI andRAPI support saving and loading the internalconfiguration directly as a JSON string. In Python package:
bst=xgboost.train(...)config=bst.save_config()print(config)
or in R:
config<-xgb.config(bst)print(config)
Will print out something similar to (not actual output as it’s too long for demonstration):
{"Learner":{"generic_parameter":{"device":"cuda:0","gpu_page_size":"0","n_jobs":"0","random_state":"0","seed":"0","seed_per_iteration":"0"},"gradient_booster":{"gbtree_train_param":{"num_parallel_tree":"1","process_type":"default","tree_method":"hist","updater":"grow_gpu_hist","updater_seq":"grow_gpu_hist"},"name":"gbtree","updater":{"grow_gpu_hist":{"gpu_hist_train_param":{"debug_synchronize":"0",},"train_param":{"alpha":"0","cache_opt":"1","colsample_bylevel":"1","colsample_bynode":"1","colsample_bytree":"1","default_direction":"learn",..."subsample":"1"}}}},"learner_train_param":{"booster":"gbtree","disable_default_eval_metric":"0","objective":"reg:squarederror"},"metrics":[],"objective":{"name":"reg:squarederror","reg_loss_param":{"scale_pos_weight":"1"}}},"version":[1,0,0]}
You can load it back to the model generated by same version of XGBoost by:
bst.load_config(config)
This way users can study the internal representation more closely. Please note that someJSON generators make use of locale dependent floating point serialization methods, whichis not supported by XGBoost.
Difference between saving model and dumping model
XGBoost has a function calleddump_model in the Booster class, which lets you toexport the model in a readable format liketext,json ordot (graphviz). Theprimary use case for it is for model interpretation and visualization, and is not supposedto be loaded back to XGBoost.
Categories
Since 3.1, the categories encoding from a training dataframe is stored in the booster toprovide test-time re-coding support, seeAuto-recoding (Data Consistency) for more info about how there-coder works. We will briefly explain the JSON format for the serialized category index.
The categories are saved in a JSON object named “cats” under the gbtree model. It containsthree keys:
feature_segments
This is a CSR-like pointer that stores the number of categories for each feature. Itstarts with zero and ends with the total number of categories from all features. Forexample:
feature_segments=[0,3,3,5]
Thefeature_segments list represents a dataset with two categorical features and onenumerical feature. The first feature contains three categories, the second feature isnumerical and thus has no categories, and the last feature includes two categories.
sorted_idx
This array stores the sorted indices (argsort) of categories across all features,segmented by thefeature_segments. Given a feature with categories:["b","c","a"], the sorted index is[2,0,1].
enc
This is an array with a length equal to the number of features, storing all the categoriesin the same order as the input dataframe. The storage schema depends on whether thecategories are strings (XGBoost also supports numerical categories, such as integers). Forstring categories, we use a schema similar to the arrow format for a string array. Thecategories of each feature are represented by two arrays, namelyoffsets andvalues. The format is also similar to a CSR-matrix. Thevalues field is auint8 array storing characters from all category names. Given a feature with threecategories:["bb","c","a"], thevalues field is[98,98,99,97]. Then theoffsets segments thevalues array similar to a CSR pointer:[0,2,3,4]. Wechose to not store thevalues as a JSON string to avoid handling special charactersand string encoding. The string names are stored exactly as given by the dataframe.
As for numerical categories, theenc contains two keys:type andvalues. Thetype field is an integer ID that identifies the type of the categories, such as 64-bitintegers and 32-bit floating points (note that they are all f32 inside a decisiontree). The exact mapping between the type to the integer ID is internal but stable. Thevalues is an array storing all categories in a feature.
Brief History
The JSON format was introduced in 1.0, aiming to replace the now removed old binaryinternal format with an open format that can be easily reused
Later in XGBoost 1.6.0, additional support for Universal Binary JSON was introduced asan optimization for more efficient model IO.
UBJSON has been set to default in 2.1.
The old binary format was removed in 3.1.
The JSON schema file is no longer maintained and has been removed in 3.2. The underlyingschema of the model is not changed.