XGBoost4J-Spark Tutorial
XGBoost4J-Spark is a project aiming to seamlessly integrate XGBoost and Apache Spark by fitting XGBoost toApache Spark’s MLLIB framework. With the integration, user can not only uses the high-performant algorithmimplementation of XGBoost, but also leverages the powerful data processing engine of Spark for:
Feature Engineering: feature extraction, transformation, dimensionality reduction, and selection, etc.
Pipelines: constructing, evaluating, and tuning ML Pipelines
Persistence: persist and load machine learning models and even whole Pipelines
This tutorial is to cover the end-to-end process to build a machine learning pipeline with XGBoost4J-Spark. We will discuss
Using Spark to preprocess data to fit to XGBoost4J-Spark’s data interface
Training a XGBoost model with XGBoost4J-Spark
Serving XGBoost model (prediction) with Spark
Building a Machine Learning Pipeline with XGBoost4J-Spark
Running XGBoost4J-Spark in Production
Build an ML Application with XGBoost4J-Spark
Refer to XGBoost4J-Spark Dependency
Before we go into the tour of how to use XGBoost4J-Spark, you should first consultInstallation from Maven repositoryin order to add XGBoost4J-Spark as a dependency for your project. We provide both stable releases and snapshots.
Note
XGBoost4J-Spark requires Apache Spark 3.0+
XGBoost4J-Spark now requiresApache Spark 3.0+. Latest versions of XGBoost4J-Spark uses facilities oforg.apache.spark.ml.param.sharedextensively to provide for a tight integration with Spark MLLIB framework, and these facilities are not fully available on earlier versions of Spark.
Also, make sure to install Spark directly fromApache website.Upstream XGBoost is not guaranteed towork with third-party distributions of Spark, such as Cloudera Spark. Consult appropriate third parties to obtain their distribution of XGBoost.
Data Preparation
As aforementioned, XGBoost4J-Spark seamlessly integrates Spark and XGBoost. The integration enablesusers to apply various types of transformation over the training/test datasets with the convenientand powerful data processing framework: Spark.
In this section, we useIris dataset as an example toshowcase how we use Spark to transform raw dataset and make it fit to the data interface of XGBoost.
Iris dataset is shipped in CSV format. Each instance contains 4 features, “sepal length”, “sepal width”,“petal length” and “petal width”. In addition, it contains the “class” column, which is essentially thelabel with three possible values: “Iris Setosa”, “Iris Versicolour” and “Iris Virginica”.
Read Dataset with Spark’s Built-In Reader
The first thing in data transformation is to load the dataset as Spark’s structured data abstraction, DataFrame.
importorg.apache.spark.sql.SparkSessionimportorg.apache.spark.sql.types.{DoubleType,StringType,StructField,StructType}valspark=SparkSession.builder().getOrCreate()valschema=newStructType(Array(StructField("sepal length",DoubleType,true),StructField("sepal width",DoubleType,true),StructField("petal length",DoubleType,true),StructField("petal width",DoubleType,true),StructField("class",StringType,true)))valrawInput=spark.read.schema(schema).csv("input_path")
At the first line, we create a instance ofSparkSessionwhich is the entry of any Spark program working with DataFrame. Theschema variable defines the schema of DataFrame wrapping Iris data.With this explicitly set schema, we can define the columns’ name as well as their types; otherwise the column name would be the default onesderived by Spark, such as_col0, etc. Finally, we can use Spark’s built-in csv reader to load Iris csv file as a DataFrame namedrawInput.
Spark also contains many built-in readers for other format. The latest version of Spark supports CSV, JSON, Parquet, and LIBSVM.
Transform Raw Iris Dataset
To make Iris dataset be recognizable to XGBoost, we need to
Transform String-typed label, i.e. “class”, to Double-typed label.
Assemble the feature columns as a vector to fit to the data interface of Spark ML framework.
To convert String-typed label to Double, we can use Spark’s built-in feature transformerStringIndexer.
importorg.apache.spark.ml.feature.StringIndexervalstringIndexer=newStringIndexer().setInputCol("class").setOutputCol("classIndex").fit(rawInput)vallabelTransformed=stringIndexer.transform(rawInput).drop("class")
With a newly created StringIndexer instance:
we set input column, i.e. the column containing String-typed label.
we set output column, i.e. the column containing the Double-typed label.
Then we
fitStringIndex with our input DataFramerawInput, so that Spark internals can get information like total number of distinct values, etc.
Now we have a StringIndexer which is ready to be applied to our input DataFrame. To execute the transformation logic of StringIndexer,wetransform the input DataFramerawInput and to keep a concise DataFrame,we drop the column “class” and only keeps the feature columns and the transformed Double-typed label column (in the last line of the above code snippet).
Thefit andtransform are two key operations in MLLIB. Basically,fit produces a “transformer”, e.g. StringIndexer,and each transformer appliestransform method on DataFrame to add new column(s) containing transformed features/labels orprediction results, etc. To understand more aboutfit andtransform, You can find more details inhere.
Similarly, we can use another transformer,VectorAssembler,to assemble feature columns “sepal length”, “sepal width”, “petal length” and “petal width” as a vector.
importorg.apache.spark.ml.feature.VectorAssemblervalvectorAssembler=newVectorAssembler().setInputCols(Array("sepal length","sepal width","petal length","petal width")).setOutputCol("features")valxgbInput=vectorAssembler.transform(labelTransformed).select("features","classIndex")
Now, we have a DataFrame containing only two columns, “features” which contains vector-represented“sepal length”, “sepal width”, “petal length” and “petal width” and “classIndex” which has Double-typedlabels. A DataFrame like this (containing vector-represented features and numeric labels) can be fed to XGBoost4J-Spark’s training engine directly.
Dealing with missing values
XGBoost supports missing values by default (as desribed here).If given a SparseVector, XGBoost will treat any values absent from the SparseVector as missing. You are also able tospecify to XGBoost to treat a specific value in your Dataset as if it was a missing value. By default XGBoost will treat NaN as the value representing missing.
Example of setting a missing value (e.g. -999) to the “missing” parameter in XGBoostClassifier:
importml.dmlc.xgboost4j.scala.spark.XGBoostClassifiervalxgbParam=Map("eta"->0.1f,"missing"->-999,"objective"->"multi:softprob","num_class"->3,"num_round"->100,"num_workers"->2)valxgbClassifier=newXGBoostClassifier(xgbParam).setFeaturesCol("features").setLabelCol("classIndex")
Note
Missing values
If the feature is vector type, the single feature instance could be a SparseVector, where “0” will be treated as the missing value.In order to get the correct model, XGBoost4j-Spark will convert the SparseVector to array by restoring the “0”. However, we can’tassume 0 for missing values as it may be meaningful. So in this case, users need to specify the missing value explicitlyeven the missing value has been set toFloat.NaN by default in the XGBoost4j-Spark.
Training
XGBoost supports regression, classification and ranking. While we use Iris dataset in this tutorial to show how weuse XGBoost4J-Spark to resolve a multi-classes classification problem, the usage in Regression and Ranking is very similar to classification.
To train a XGBoost model for classification, we need to create a XGBoostClassifier first:
importml.dmlc.xgboost4j.scala.spark.XGBoostClassifiervalxgbParam=Map("eta"->0.1f,"max_depth"->2,"objective"->"multi:softprob","num_class"->3)valxgbClassifier=newXGBoostClassifier(xgbParam).setNumRound(100).setNumWorkers(2).setFeaturesCol("features").setLabelCol("classIndex")
The available parameters for training a XGBoost model can be found inhere. In XGBoost4J-Spark, we supportnot only the default set of parameters but also the camel-case variant of these parameters to keep consistent with Spark’s MLLIB parameters.
Specifically, each parameter inthis page has itsequivalent form in XGBoost4J-Spark with camel case. For example, to setmax_depth for each tree, you can pass parameter justlike what we did in the above code snippet (asmax_depth wrapped in a Map), or you can do it through setters in XGBoostClassifer:
valxgbClassifier=newXGBoostClassifier().setFeaturesCol("features").setLabelCol("classIndex")xgbClassifier.setMaxDepth(2)
After we set XGBoostClassifier parameters and feature/label column, we can build a transformer, XGBoostClassificationModel byfitting XGBoostClassifier with the input DataFrame. Thisfit operation is essentially the training process and the generatedmodel can then be used in prediction.
valxgbClassificationModel=xgbClassifier.fit(xgbInput)
Early Stopping
Early stopping is a feature to prevent the unnecessary training iterations. By specifyingnum_early_stopping_rounds ordirectly callsetNumEarlyStoppingRounds over a XGBoostClassifier or XGBoostRegressor, we can define number of rounds ifthe evaluation metric going away from the best iteration and early stop training iterations.
When it comes to custom eval metrics, in additional tonum_early_stopping_rounds, you also need to definemaximize_evaluation_metricsor callsetMaximizeEvaluationMetrics to specify whether you want to maximize or minimize the metrics in training. For built-in eval metrics,XGBoost4J-Spark will automatically select the direction.
For example, we need to maximize the evaluation metrics (setmaximize_evaluation_metrics with true), and setnum_early_stopping_roundswith 5. The evaluation metric of 10th iteration is the maximum one until now. In the following iterations, if there is no evaluation metricgreater than the 10th iteration’s (best one), the training would be early stopped at 15th iteration.
Training with Evaluation Dataset
You can also monitor the performance of the model during training with evaluation dataset. By callingsetEvalDataset over aXGBoostClassifier, XGBoostRegressor or XGBoostRanker.
Prediction
XGBoost4j-Spark supports two ways for model serving: batch prediction and single instance prediction.
Batch Prediction
When we get a model, either XGBoostClassificationModel, XGBoostRegressionModel or XGBoostRankerModel, it takes a DataFrame, read the column containingfeature vectors, predict for each feature vector, and output a new DataFrame with the following columns by default:
XGBoostClassificationModel will output margins (
rawPredictionCol), probabilities(probabilityCol) and the eventual prediction labels (predictionCol) for each possible label.XGBoostRegressionModel will output prediction label(
predictionCol).XGBoostRankerModel will output prediction label(
predictionCol).
Batch prediction expects the user to pass the testset in the form of a DataFrame. XGBoost4J-Spark starts a XGBoost workerfor each partition of DataFrame for parallel prediction and generates prediction results for the whole DataFrame in a batch.
valxgbClassificationModel=xgbClassifier.fit(xgbInput)valresults=xgbClassificationModel.transform(testSet)
With the above code snippet, we get a result DataFrame, result containing margin, probability for each class and the prediction for each instance
+-----------------+----------+--------------------+--------------------+----------+| features|classIndex| rawPrediction| probability|prediction|+-----------------+----------+--------------------+--------------------+----------+|[5.1,3.5,1.4,0.2]| 0.0|[3.45569849014282...|[0.99579632282257...| 0.0||[4.9,3.0,1.4,0.2]| 0.0|[3.45569849014282...|[0.99618089199066...| 0.0||[4.7,3.2,1.3,0.2]| 0.0|[3.45569849014282...|[0.99643349647521...| 0.0||[4.6,3.1,1.5,0.2]| 0.0|[3.45569849014282...|[0.99636095762252...| 0.0||[5.0,3.6,1.4,0.2]| 0.0|[3.45569849014282...|[0.99579632282257...| 0.0||[5.4,3.9,1.7,0.4]| 0.0|[3.45569849014282...|[0.99428516626358...| 0.0||[4.6,3.4,1.4,0.3]| 0.0|[3.45569849014282...|[0.99643349647521...| 0.0||[5.0,3.4,1.5,0.2]| 0.0|[3.45569849014282...|[0.99579632282257...| 0.0||[4.4,2.9,1.4,0.2]| 0.0|[3.45569849014282...|[0.99618089199066...| 0.0||[4.9,3.1,1.5,0.1]| 0.0|[3.45569849014282...|[0.99636095762252...| 0.0||[5.4,3.7,1.5,0.2]| 0.0|[3.45569849014282...|[0.99428516626358...| 0.0||[4.8,3.4,1.6,0.2]| 0.0|[3.45569849014282...|[0.99643349647521...| 0.0||[4.8,3.0,1.4,0.1]| 0.0|[3.45569849014282...|[0.99618089199066...| 0.0||[4.3,3.0,1.1,0.1]| 0.0|[3.45569849014282...|[0.99618089199066...| 0.0||[5.8,4.0,1.2,0.2]| 0.0|[3.45569849014282...|[0.97809928655624...| 0.0||[5.7,4.4,1.5,0.4]| 0.0|[3.45569849014282...|[0.97809928655624...| 0.0||[5.4,3.9,1.3,0.4]| 0.0|[3.45569849014282...|[0.99428516626358...| 0.0||[5.1,3.5,1.4,0.3]| 0.0|[3.45569849014282...|[0.99579632282257...| 0.0||[5.7,3.8,1.7,0.3]| 0.0|[3.45569849014282...|[0.97809928655624...| 0.0||[5.1,3.8,1.5,0.3]| 0.0|[3.45569849014282...|[0.99579632282257...| 0.0|+-----------------+----------+--------------------+--------------------+----------+
Single instance prediction
XGBoostClassificationModel, XGBoostRegressionModel or XGBoostRankerModel supports making prediction on single instance as well.It accepts a single Vector as feature, and output the prediction label.
However, the overhead of single-instance prediction is high due to the internal overhead of XGBoost, use it carefully!
valfeatures=xgbInput.head().getAs[Vector]("features")valresult=xgbClassificationModel.predict(features)
Model Persistence
Model and pipeline persistence
A data scientist produces an ML model and hands it over to an engineering team for deployment in a production environment.Reversely, a trained model may be used by data scientists, for example as a baseline, across the process of data exploration.So it’s important to support model persistence to make the models available across usage scenarios and programming languages.
XGBoost4j-Spark supports saving and loading XGBoostClassifier/XGBoostClassificationModel and XGBoostRegressor/XGBoostRegressionModeland XGBoostRanker/XGBoostRankerModel to/from file system. It also supports saving and loading a ML pipeline which includes theseestimators and models.
We can save the XGBoostClassificationModel to file system:
valxgbClassificationModelPath="/tmp/xgbClassificationModel"xgbClassificationModel.write.overwrite().save(xgbClassificationModelPath)
and then loading the model in another session:
importml.dmlc.xgboost4j.scala.spark.XGBoostClassificationModelvalxgbClassificationModel2=XGBoostClassificationModel.load(xgbClassificationModelPath)xgbClassificationModel2.transform(xgbInput)
Note
Besides dumping the model to raw format, users are able to dump the model to be json or ubj format.
valxgbClassificationModelPath="/tmp/xgbClassificationModel"xgbClassificationModel.write.overwrite().option("format","json").save(xgbClassificationModelPath)
With regards to ML pipeline save and load, please refer the next section.
Interact with Other Bindings of XGBoost
After we train a model with XGBoost4j-Spark on massive dataset, sometimes we want to do model servingin single machine or integrate it with other single node libraries for further processing.
After saving the model, we can load this model with single node Python XGBoost directly.
valxgbClassificationModelPath="/tmp/xgbClassificationModel"xgbClassificationModel.write.overwrite().save(xgbClassificationModelPath)
importxgboostasxgbbst=xgb.Booster({'nthread':4})bst.load_model("/tmp/xgbClassificationModel/data/model")
Note
Consistency issue between XGBoost4J-Spark and other bindings
There is a consistency issue between XGBoost4J-Spark and other language bindings of XGBoost.
When users use Spark to load training/test data in LIBSVM format with the following code snippet:
spark.read.format("libsvm").load("trainingset_libsvm")
Spark assumes that the dataset is using 1-based indexing (feature indices staring with 1). However,when you do prediction with other bindings of XGBoost (e.g. Python API of XGBoost), XGBoost assumesthat the dataset is using 0-based indexing (feature indices starting with 0) by default. It creates apitfall for the users who train model with Spark but predict with the dataset in the same format inother bindings of XGBoost. The solution is to transform the dataset to 0-based indexing before youpredict with, for example, Python API, or you append?indexing_mode=1 to your file path whenloading with DMatirx. For example in Python:
xgb.DMatrix('test.libsvm?indexing_mode=1')
Building a ML Pipeline with XGBoost4J-Spark
Basic ML Pipeline
Spark ML pipeline can combine multiple algorithms or functions into a single pipeline.It covers from feature extraction, transformation, selection to model training and prediction.XGBoost4j-Spark makes it feasible to embed XGBoost into such a pipeline seamlessly.The following example shows how to build such a pipeline consisting of Spark MLlib feature transformerand XGBoostClassifier estimator.
We still useIris dataset and therawInput DataFrame.First we need to split the dataset into training and test dataset.
valArray(training,test)=rawInput.randomSplit(Array(0.8,0.2),123)
The we build the ML pipeline which includes 4 stages:
Assemble all features into a single vector column.
From string label to indexed double label.
Use XGBoostClassifier to train classification model.
Convert indexed double label back to original string label.
We have shown the first three steps in the earlier sections, and the last step is finished with a newtransformerIndexToString:
vallabelConverter=newIndexToString().setInputCol("prediction").setOutputCol("realLabel").setLabels(stringIndexer.labels)
We need to organize these steps as a Pipeline in Spark ML framework and evaluate the whole pipeline to get a PipelineModel:
importorg.apache.spark.ml.feature._importorg.apache.spark.ml.Pipelinevalpipeline=newPipeline().setStages(Array(assembler,stringIndexer,booster,labelConverter))valmodel=pipeline.fit(training)
After we get the PipelineModel, we can make prediction on the test dataset and evaluate the model accuracy.
importorg.apache.spark.ml.evaluation.MulticlassClassificationEvaluatorvalprediction=model.transform(test)valevaluator=newMulticlassClassificationEvaluator()valaccuracy=evaluator.evaluate(prediction)
Pipeline with Hyper-parameter Tunning
The most critical operation to maximize the power of XGBoost is to select the optimal parameters for the model.Tuning parameters manually is a tedious and labor-consuming process. With the latest version of XGBoost4J-Spark,we can utilize the Spark model selecting tool to automate this process.
The following example shows the code snippet utilizing CrossValidation and MulticlassClassificationEvaluatorto search the optimal combination of two XGBoost parameters,max_depth andeta. (SeeXGBoost Parameters.)The model producing the maximum accuracy defined by MulticlassClassificationEvaluator is selected and used togenerate the prediction for the test set.
importorg.apache.spark.ml.tuning._importorg.apache.spark.ml.PipelineModelimportml.dmlc.xgboost4j.scala.spark.XGBoostClassificationModelvalparamGrid=newParamGridBuilder().addGrid(booster.maxDepth,Array(3,8)).addGrid(booster.eta,Array(0.2,0.6)).build()valcv=newCrossValidator().setEstimator(pipeline).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(3)valcvModel=cv.fit(training)valbestModel=cvModel.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[XGBoostClassificationModel]bestModel.extractParamMap()
Run XGBoost4J-Spark in Production
XGBoost4J-Spark is one of the most important steps to bring XGBoost to production environment easier. In this section,we introduce three key features to run XGBoost4J-Spark in production.
Parallel/Distributed Training
The massive size of training dataset is one of the most significant characteristics in production environment. To ensurethat training in XGBoost scales with the data size, XGBoost4J-Spark bridges the distributed/parallel processing frameworkof Spark and the parallel/distributed training mechanism of XGBoost.
In XGBoost4J-Spark, each XGBoost worker is wrapped by a Spark task and the training dataset in Spark’s memory space isfed to XGBoost workers in a transparent approach to the user.
In the code snippet where we build XGBoostClassifier, we set parameternum_workers (ornumWorkers).This parameter controls how many parallel workers we want to have when training a XGBoostClassificationModel.
Note
Regarding OpenMP optimization
By default, we allocate a core per each XGBoost worker. Therefore, the OpenMP optimization within each XGBoost worker doesnot take effect and the parallelization of training is achieved by running multiple workers (i.e. Spark tasks) at the same time.
If you do want OpenMP optimization, you have to
set
nthreadto a value larger than 1 when creating XGBoostClassifier/XGBoostRegressorset
spark.task.cpusin Spark to the same value asnthread
Gang Scheduling
XGBoost usesAllReduce.algorithm to synchronize the stats, e.g. histogram values, of each worker during training. Therefore XGBoost4J-Spark requiresthat all ofnthread*numWorkers cores should be available before the training runs.
In the production environment where many users share the same cluster, it’s hard to guarantee that your XGBoost4J-Spark applicationcan get all requested resources for every run. By default, the communication layer in XGBoost will block the whole application whenit requires more resources to be available. This process usually brings unnecessary resource waste as it keeps the ready resourcesand try to claim more. Additionally, this usually happens silently and does not bring the attention of users.
XGBoost4J-Spark allows the user to setup a timeout threshold for claiming resources from the cluster. If the application cannot getenough resources within this time period, the application would fail instead of wasting resources for hanging long. To enable thisfeature, you can set with XGBoostClassifier/XGBoostRegressor/XGBoostRanker:
xgbClassifier.setRabitTrackerTimeout(60000L)
or pass inrabit_tracker_timeout inxgbParamMap when building XGBoostClassifier:
valxgbParam=Map("eta"->0.1f,"max_depth"->2,"objective"->"multi:softprob","num_class"->3,"num_round"->100,"num_workers"->2,"rabit_tracker_timeout"->60000L)valxgbClassifier=newXGBoostClassifier(xgbParam).setFeaturesCol("features").setLabelCol("classIndex")
If XGBoost4J-Spark cannot get enough resources for running two XGBoost workers, the application would fail.Users can have external mechanism to monitor the status of application and get notified for such case.
Checkpoint During Training
Transient failures are also commonly seen in production environment. To simplify the design of XGBoost,we stop training if any of the distributed workers fail. However, if the training fails after having beenthrough a long time, it would be a great waste of resources.
We support creating checkpoint during training to facilitate more efficient recovery from failure. To enable this feature,you can set how many iterations we build each checkpoint withsetCheckpointInterval and the location of checkpointswithsetCheckpointPath:
xgbClassifier.setCheckpointInterval(2)xgbClassifier.setCheckpointPath("/checkpoint_path")
An equivalent way is to pass in parameters in XGBoostClassifier’s constructor:
valxgbParam=Map("eta"->0.1f,"max_depth"->2,"objective"->"multi:softprob","num_class"->3,"num_round"->100,"num_workers"->2,"checkpoint_path"->"/checkpoints_path","checkpoint_interval"->2)valxgbClassifier=newXGBoostClassifier(xgbParam).setFeaturesCol("features").setLabelCol("classIndex")
If the training failed during these 100 rounds, the next run of training would start by reading the latest checkpointfile in/checkpoints_path and start from the iteration when the checkpoint was built until to next failure or the specified 100 rounds.
External Memory
Added in version 3.0.
Warning
The feature is experimental.
Here we refer to the iterator-based external memory instead of the one that uses specialURL parameters. XGBoost-Spark has experimental support for GPU-based external memorytraining (XGBoost4J-Spark-GPU Tutorial) since 3.0. When it’s used incombination with GPU-based training, data is first cached on disk and then staged on CPUmemory. SeeUsing XGBoost External Memory Version for general concept and best practices forthe external memory training. In addition, see the doc string of the estimator parameteruseExternalMemory. With Spark estimators:
valxgbClassifier=newXGBoostClassifier(xgbParam).setFeaturesCol(featuresNames).setLabelCol(labelName).setUseExternalMemory(true).setDevice("cuda")// CPU is not yet supported