Distributed XGBoost on Kubernetes

Distributed XGBoost training and batch prediction onKubernetes are supported viaKubeflow Trainer.

Instructions

In order to run a XGBoost job in a Kubernetes cluster, perform the following steps:

Install Kubeflow Trainer on the Kubernetes cluster.
1. Kubeflow Trainer is designed to manage the scheduling and monitoring of XGBoost jobs. Followthis installation guide to install it.
Write application code that will be executed by the Kubeflow Trainer.
1. To use Kubeflow Trainer, you’ll have to write a couple of Python scripts that implement the distributed training logic for XGBoost. Please refer to theIris classification example.
2. Data reader/writer: you need to implement the data reader and writer based on the specific requirements of your chosen data source. For example, if your dataset is stored in a Hive table, you have to write the code to read from or write to the Hive table based on the index of the worker.
3. Model persistence: in theIris classification example, the model is stored inAlibaba OSS. If you want to store your model in other storages such as Amazon S3 or Google NFS, you’ll need to implement the model persistence logic based on the requirements of the chosen storage system.
Configure the XGBoost job using a YAML file.
1. YAML file is used to configure the computational resources and environment for your XGBoost job to run, e.g. the number of workers/masters and the number of CPU/GPUs. Please refer to thisYAML template for an example.
Submit XGBoost job to a Kubernetes cluster.
1. Usekubectl to submit a distributed XGBoost job as illustratedhere.

Please submit an issue onKubeflow Trainer repo for any feature requests or problems.