Note
Go to the endto download the full example code.
Demo for using data iterator with Quantile DMatrix
Added in version 1.2.0.
The demo that defines a customized iterator for passing batches of data intoxgboost.QuantileDMatrix and use thisQuantileDMatrix for training. Thefeature is primarily designed to reduce the required GPU memory for training ondistributed environment.
Aftering going through the demo, one might ask why don’t we use more native Pythoniterator? That’s because XGBoost requires areset function, while usingitertools.tee might incur significant memory usage according to:
fromtypingimportCallableimportcupyimportnumpyimportxgboostCOLS=64ROWS_PER_BATCH=1000# data is splited by rowsBATCHES=32classIterForDMatrixDemo(xgboost.core.DataIter):"""A data iterator for XGBoost DMatrix. `reset` and `next` are required for any data iterator, other functions here are utilites for demonstration's purpose. """def__init__(self)->None:"""Generate some random data for demostration. Actual data can be anything that is currently supported by XGBoost. """self.rows=ROWS_PER_BATCHself.cols=COLSrng=cupy.random.RandomState(numpy.uint64(1994))self._data=[rng.randn(self.rows,self.cols)]*BATCHESself._labels=[rng.randn(self.rows)]*BATCHESself._weights=[rng.uniform(size=self.rows)]*BATCHESself.it=0# set iterator to 0super().__init__()defas_array(self)->cupy.ndarray:returncupy.concatenate(self._data)defas_array_labels(self)->cupy.ndarray:returncupy.concatenate(self._labels)defas_array_weights(self)->cupy.ndarray:returncupy.concatenate(self._weights)defdata(self)->cupy.ndarray:"""Utility function for obtaining current batch of data."""returnself._data[self.it]deflabels(self)->cupy.ndarray:"""Utility function for obtaining current batch of label."""returnself._labels[self.it]defweights(self)->cupy.ndarray:returnself._weights[self.it]defreset(self)->None:"""Reset the iterator"""self.it=0defnext(self,input_data:Callable)->bool:"""Yield the next batch of data."""ifself.it==len(self._data):# Return False to let XGBoost know this is the end of iterationreturnFalse# input_data is a keyword-only function passed in by XGBoost and has the similar# signature to the ``DMatrix`` constructor.input_data(data=self.data(),label=self.labels(),weight=self.weights())self.it+=1returnTruedefmain()->None:rounds=100it=IterForDMatrixDemo()# Use iterator, must be `QuantileDMatrix`.# In this demo, the input batches are created using cupy, and the data processing# (quantile sketching) will be performed on GPU. If data is loaded with CPU based# data structures like numpy or pandas, then the processing step will be performed# on CPU instead.m_with_it=xgboost.QuantileDMatrix(it)# Use regular DMatrix.m=xgboost.DMatrix(it.as_array(),it.as_array_labels(),weight=it.as_array_weights())assertm_with_it.num_col()==m.num_col()assertm_with_it.num_row()==m.num_row()# Tree method must be `hist`.reg_with_it=xgboost.train({"tree_method":"hist","device":"cuda"},m_with_it,num_boost_round=rounds,evals=[(m_with_it,"Train")],)predict_with_it=reg_with_it.predict(m_with_it)reg=xgboost.train({"tree_method":"hist","device":"cuda"},m,num_boost_round=rounds,evals=[(m,"Train")],)predict=reg.predict(m)if__name__=="__main__":main()