Using XGBoost External Memory Version
Contents
Overview
When working with large datasets, training XGBoost models can be challenging as the entiredataset needs to be loaded into the main memory. This can be costly and sometimesinfeasible.
External memory training is sometimes called out-of-core training. It refers to thecapability that XGBoost can optionally cache data in a location external to the mainprocessor, be it CPU or GPU. XGBoost doesn’t support network file systems by itself. As aresult, for CPU, the external memory usually refers to a harddrive. And for GPU, it refersto either the host memory or a harddrive.
Users can define a custom iterator to load data in chunks for running XGBoostalgorithms. External memory can be used for training and prediction, but training is theprimary use case and it will be our focus in this tutorial. For prediction and evaluation,users can iterate through the data themselves, whereas training requires the entiredataset to be loaded into the memory. During model training, XGBoost fetches the cache inbatches to construct the decision trees, hence avoiding loading the entire dataset intothe main memory and achieve better vertical scaling (scaling within the same node).
Significant progress was made in the 3.0 release for the GPU implementation. We willintroduce the difference between CPU and GPU in the following sections.
Note
Training on data from external memory is not supported by theexact tree method. Werecommend using the defaulthist tree method for performance reasons.
Note
The feature is considered experimental but ready for public testing in 3.0. Vector-leafis not yet supported.
The external memory support has undergone multiple development iterations. See belowsections for a brief history.
Data Iterator
To start using the external memory, users need define a data iterator. The data iteratorinterface was added to the Python and C interfaces in 1.5, and to the R interface in3.0.0. Like theQuantileDMatrix withDataIter,XGBoost loads data batch-by-batch using the custom iterator supplied by the user. However,unlike theQuantileDMatrix, external memory does not concatenate thebatches (unless specified by theextmem_single_page for GPU) . Instead, it caches allbatches in the external memory and fetch them on-demand. Go to the end of the document tosee a comparison betweenQuantileDMatrix and the external memoryversion ofExtMemQuantileDMatrix.
Some examples are in thedemo directory for a quick start. To enable external memorytraining, the custom data iterator needs to have two class methods:next andreset.
importosfromtypingimportList,CallableimportnumpyasnpimportxgboostclassIterator(xgboost.DataIter):"""A custom iterator for loading files in batches."""def__init__(self,device:Literal["cpu","cuda"],file_paths:List[Tuple[str,str]])->None:self.device=deviceself._file_paths=file_pathsself._it=0# XGBoost will generate some cache files under the current directory with the# prefix "cache"super().__init__(cache_prefix=os.path.join(".","cache"))defload_file(self)->Tuple[np.ndarray,np.ndarray]:"""Load a single batch of data."""X_path,y_path=self._file_paths[self._it]# When the `ExtMemQuantileDMatrix` is used, the device must match. GPU cannot# consume CPU input data and vice-versa.ifself.device=="cpu":X=np.load(X_path)y=np.load(y_path)else:importcupyascpX=cp.load(X_path)y=cp.load(y_path)assertX.shape[0]==y.shape[0]returnX,ydefnext(self,input_data:Callable)->bool:"""Advance the iterator by 1 step and pass the data to XGBoost. This function is called by XGBoost during the construction of ``DMatrix`` """ifself._it==len(self._file_paths):# return False to let XGBoost know this is the end of iterationreturnFalse# input_data is a keyword-only function passed in by XGBoost and has the similar# signature to the ``DMatrix`` constructor.X,y=self.load_file()input_data(data=X,label=y)self._it+=1returnTruedefreset(self)->None:"""Reset the iterator to its beginning"""self._it=0
After defining the iterator, we can to pass it into theDMatrix ortheExtMemQuantileDMatrix constructor:
it=Iterator(device="cpu",file_paths=["file_0.npy","file_1.npy","file_2.npy"])# Use the ``ExtMemQuantileDMatrix`` for the hist tree method, recommended.Xy=xgboost.ExtMemQuantileDMatrix(it)booster=xgboost.train({"tree_method":"hist"},Xy)# The ``approx`` tree method also works, but with lower performance and cannot be used# with the quantile DMatrix.Xy=xgboost.DMatrix(it)booster=xgboost.train({"tree_method":"approx"},Xy)
The above snippet is a simplified version ofExperimental support for external memory.For an example in C, please seedemo/c-api/external-memory/. The iterator is thecommon interface for using external memory with XGBoost, you can pass the resultingDMatrix object for training, prediction, and evaluation.
TheExtMemQuantileDMatrix is an external memory version of theQuantileDMatrix. These two classes are specifically designed for thehist tree method for reduced memory usage and data loading overhead. See respectivereferences for more info.
It is important to set the batch size based on the memory available. A good starting pointfor CPU is to set the batch size to 10GB per batch if you have 64GB of memory. It isnotrecommended to set small batch sizes like 32 samples per batch, as this can severely hurtperformance in gradient boosting. See below sections for information about the GPU versionand other best practices.
GPU Version (GPU Hist tree method)
External memory is supported by GPU algorithms (i.e., whendevice is set tocuda). Starting with 3.0, the default GPU implementation is similar to what the CPUversion does. It also supports the use ofExtMemQuantileDMatrix whenthehist tree method is employed (default). For a GPU device, the main memory is thedevice memory, whereas the external memory can be either a disk or the CPU memory. XGBooststages the cache on CPU memory by default. Users can change the backing storage to disk byspecifying theon_host parameter in theDataIter. However, usingthe disk is not recommended as it’s likely to make the GPU slower than the CPU. The optionis here for experimentation purposes only. In addition,ExtMemQuantileDMatrix parametersmin_cache_page_bytes, andmax_quantile_batches can help control the data placement and memory usage.
Inputs to theExtMemQuantileDMatrix (through the iterator) must be onthe GPU. Following is a snippet fromExperimental support for external memory:
importcupyascpimportrmmfromrmm.allocators.cupyimportrmm_cupy_allocator# It's important to use RMM for GPU-based external memory to improve performance.# If XGBoost is not built with RMM support, a warning will be raised.# We use the pool memory resource here for simplicity, you can also try the# `ArenaMemoryResource` for improved memory fragmentation handling.mr=rmm.mr.PoolMemoryResource(rmm.mr.CudaAsyncMemoryResource())rmm.mr.set_current_device_resource(mr)# Set the allocator for cupy as well.cp.cuda.set_allocator(rmm_cupy_allocator)# Make sure XGBoost is using RMM for all allocations.withxgboost.config_context(use_rmm=True):# Construct the iterators for ExtMemQuantileDMatrix# ...# Build the ExtMemQuantileDMatrix and start trainingXy_train=xgboost.ExtMemQuantileDMatrix(it_train,max_bin=n_bins)# Use the training DMatrix as a referenceXy_valid=xgboost.ExtMemQuantileDMatrix(it_valid,max_bin=n_bins,ref=Xy_train)booster=xgboost.train({"tree_method":"hist","max_depth":6,"max_bin":n_bins,"device":device,},Xy_train,num_boost_round=n_rounds,evals=[(Xy_train,"Train"),(Xy_valid,"Valid")])
It’s crucial to useRAPIDS Memory Manager (RMM) withan asynchronous memory resource for all memory allocation when training with externalmemory. XGBoost relies on the asynchronous memory pool to reduce the overhead of datafetching. In addition, the open sourceNVIDIA Linux driveris required forHeterogeneousmemorymanagement(HMM) support. Usually, users need notto changeExtMemQuantileDMatrix parameters likemin_cache_page_bytes, they are automatically configured based on the device and don’tchange model accuracy. However, themax_quantile_batches can be useful ifExtMemQuantileDMatrix is running out of device memory duringconstruction, seeQuantileDMatrix and the following sections for moreinfo. Currently, we focus on devices withNVLink-C2C support for GPU-based externalmemory support.
In addition to the batch-based data fetching, the GPU version supports concatenatingbatches into a single blob for the training data to improve performance. For GPUsconnected via PCIe instead of nvlink, the performance overhead with batch-based trainingis significant, particularly for non-dense data. Overall, it can be at least five timesslower than in-core training. Concatenating pages can be used to get the performancecloser to in-core training. This option should be used in combination with subsampling toreduce the memory usage. During concatenation, subsampling removes a portion of samples,reducing the training dataset size. The GPU hist tree method supportsgradient-basedsampling, enabling users to set a low sampling rate without compromising accuracy. Before3.0, concatenation with subsampling was the only option for GPU-based externalmemory. After 3.0, XGBoost uses the regular batch fetching as the default while the pageconcatenation can be enabled by:
param={"device":"cuda","extmem_single_page":true,'subsample':0.2,'sampling_method':'gradient_based',}
For more information about the sampling algorithm and its use in external memory training,seethis paper. Lastly, see following sections forbest practices.
NVLink-C2C
The newer NVIDIA platforms likeGrace-Hopper useNVLink-C2C, which facilitates a fastinterconnect between the CPU and the GPU. With the host memory serving as the data cache,XGBoost can retrieve data with significantly lower overhead. When the input data is dense,there’s minimal to no performance loss for training, except for the initial constructionof theExtMemQuantileDMatrix. The initial construction iteratesthrough the input data twice, as a result, the most significant overhead compared toin-core training is one additional data read when the data is dense. Please note thatthere are multiple variants of the platform and they come with different C2Cbandwidths. During initial development of the feature, we used the LPDDR5 480G version,which has about 350GB/s bandwidth for host to device transfer. When choosing the variantfor training XGBoost models, one should pay extra attention to the C2C bandwidth.
Here we provide a simple example as a starting point for training with external memory. Weused this example for one of the benchmarks. To train a model with2 ^ 29 32-bitfloating point samples,512 features (total 1TB) on a GH200 (a H200 GPU connected to aGrace CPU by a chip-to-chip link) system. One can start with:- Evenly divide the data into 128 batches with 8GB per batch.- Define a custom iterator as previously described.- Set themax_quantile_batches parameter of theExtMemQuantileDMatrix to 32 (256GB per sub-stream for quantization). Load the data.- Start training withdevice=cuda.
To run experiments on these platforms, the open sourceNVIDIA Linux driverwith version>=565.47 is required, it should come with CTK 12.7 and laterversions. Lastly, there’s a known issue with Linux 6.11 that can lead to CUDA host memoryallocation failure with aninvalidargument error.
Adaptive Cache
Starting with 3.1, XGBoost introduces an adaptive cache for GPU-based external memorytraining. The feature helps split the data cache into a host cache and a device cache. Bykeeping a portion of the cache on the GPU, we can reduce the amount of data transferduring training when there’s sufficient amount of GPU memory. The feature can becontrolled by thecache_host_ratio parameter in thexgboost.ExtMemQuantileDMatrix. It is disabled when the device has full C2Cbandwidth since it’s not needed there. On devices that with reduced bandwidth or deviceswith PCIe connections, unless explicitly specified, the ratio is automatically estimatedbased on device memory size and the size of the dataset.
However, this parameter increases memory fragmentation as XGBoost needs large memory pageswith irregular sizes. As a result, you might see out of memory error after theconstruction of theDMatrix but before the actual training begins.
For reference, we tested the adaptive cache with a 128GB (512 features) dense 32bitfloating dataset using a NVIDIA A6000 GPU, which comes with 48GB device memory. Thecache_host_ratio was estimated to be about 0.3, meaning about 30 percent of thequantized cache was on the host and rest of 70 percent was actually in-core. Given thisratio, the overhead is minimal. However, the estimated ratio increases as the data sizegrows.
Non-Uniform Memory Access (NUMA)
On multi-socket systems,NUMA helps optimize data access byprioritizing memory that is local to each socket. On these systems, it’s essential to setthe correct affinity to reduce the overhead of cross-socket data access. Since the out ofcore training stages the data cache on the host and trains the model using a GPU, thetraining performance is particularly sensitive to the data read bandwidth. To provide somecontext, on a GB200 machine, accessing the wrong NUMA node from a GPU can reduce the C2Cbandwidth by half. Even if you are not using distributed training, you should still payattention to NUMA control since there’s no guarantee that your process will have thecorrect configuration.
We have tested two approaches of NUMA configuration. The first (and recommended) way is touse thenumactl command line available on Linux distributions:
numactl--membind=${NODEID}--cpunodebind=${NODEID}./myapp
To obtain the node ID, you can check the machine topology vianvidia-smi:
nvidia-smitopo-m
The columnNUMAAffinity lists the NUMA node ID for each GPU. In the example outputshown below, theGPU0 is associated with the0 node ID:
GPU0GPU1NIC0NIC1NIC2NIC3CPUAffinityNUMAAffinityGPUNUMAIDGPU0XNV18NODENODENODESYS0-7102GPU1NV18XSYSSYSSYSNODE72-143110NIC0NODESYSXPIXNODESYSNIC1NODESYSPIXXNODESYSNIC2NODESYSNODENODEXSYSNIC3SYSNODESYSSYSSYSX
Alternatively, one can also use thehwloc command line interface, please make sure thestrict flag is used:
hwloc-bind--strict--membindnode:${NODEID}--cpubindnode:${NODEID}./myapp
Another approach is to use the CPU affinity. Thedask-cuda project configures optimal CPU affinity for theDask interface through using thenvml library in addition to the Linux schedroutines. This can help guide the memory allocation policy but does not enforce it. As aresult, when the memory is under pressure, the OS can allocate memory on different NUMAnodes. On the other hand, it’s easier to use since launchers likeLocalCUDACluster have already integrated the solution.
We use the first approach for benchmarks as it has better enforcement.
Distributed Training
Distributed training is similar to in-core learning, but the work for frameworkintegration is still on-going. SeeExperimental support for distributed training with external memoryfor an example for using the communicator to build a simple pipeline. Since users candefine their custom data loader, it’s unlikely that existing distributed frameworksinterface in XGBoost can meet all the use cases, the example can be a starting point forusers who have custom infrastructure.
Best Practices
In previous sections, we demonstrated how to train a tree-based model with data residingon an external memory. In addition, we made some recommendations for batch size andNUMA. Here are some other configurations we find useful. The external memory featureinvolves iterating through data batches stored in a cache during tree construction. Foroptimal performance, we recommend using thegrow_policy=depthwise setting, whichallows XGBoost to build an entire layer of tree nodes with only a few batchiterations. Conversely, using thelossguide policy requires XGBoost to iterate overthe data set for each tree node, resulting in significantly slower performance (tree sizeis exponential to the depth).
In addition, thehist tree method should be preferred over theapprox tree methodas the former doesn’t recreate the histogram bins for every iteration. Creating thehistogram bins requires loading the raw input data, which is prohibitively expensive. TheExtMemQuantileDMatrix designed for thehist tree method can speedup the initial data construction and the evaluation significantly for external memory.
Since the external memory implementation focuses on training where XGBoost needs to accessthe entire dataset, only theX is divided into batches while everything else isconcatenated. As a result, it’s recommended for users to define their own management codeto iterate through the data for inference, especially for SHAP value computation. The sizeof SHAP matrix can be larger than the feature matrixX, making external memory inXGBoost less effective.
When external memory is used, the performance of CPU training is limited by disk IO(input/output) speed. This means that the disk IO speed primarily determines the trainingspeed. Similarly, PCIe bandwidth limits the GPU performance, assuming the CPU memory isused as a cache and address translation services (ATS) is unavailable. During development,we observed that typical data transfer in XGBoost with PCIe4x16 has about 24GB/s bandwidthand about 42GB/s with PCIe5, which is significantly lower than the GPU processingperformance. Whereas with a C2C-enabled machine, the performance of data transfer andprocessing in training are close to each other.
Running inference is much less computation-intensive than training and, hence, muchfaster. As a result, the performance bottleneck of inference is back to data transfer. ForGPU, the time it takes to read the data from host to device completely determines the timeit takes to run inference, even if a C2C link is available.
Xy_train=xgboost.ExtMemQuantileDMatrix(it_train,max_bin=n_bins)Xy_valid=xgboost.ExtMemQuantileDMatrix(it_valid,max_bin=n_bins,ref=Xy_train)
In addition, since the GPU implementation relies on asynchronous memory pool, which issubject to memory fragmentation even if theCudaAsyncMemoryResource isused. You might want to start the training with a fresh pool instead of starting trainingright after the ETL process. If you run into out-of-memory errors and you are convincedthat the pool is not full yet (pool memory usage can be profiled withnsight-system),consider using theArenaMemoryResource memory resource. Alternatively,usingCudaAsyncMemoryResource in conjunction withBinningMemoryResource(mr,21,25) instead ofthe defaultPoolMemoryResource.
During CPU benchmarking, we used an NVMe connected to a PCIe-4 slot. Other types ofstorage can be too slow for practical usage. However, your system will likely perform somecaching to reduce the overhead of the file read. See the following sections for remarks.
Remarks
When using external memory with XGBoost, data is divided into smaller chunks so that onlya fraction of it needs to be stored in memory at any given time. It’s important to notethat this method only applies to the predictor data (X), while other data, like labelsand internal runtime structures are concatenated. This means that memory reduction is mosteffective when dealing with wide datasets whereX is significantly larger in sizecompared to other data likey, while it has little impact on slim datasets.
As one might expect, fetching data on demand puts significant pressure on the storagedevice. Today’s computing devices can process way more data than storage devices can readin a single unit of time. The ratio is in the order of magnitudes. A GPU is capable ofprocessing hundreds of Gigabytes of floating-point data in a split second. On the otherhand, a four-lane NVMe storage connected to a PCIe-4 slot usually has about 6GB/s of datatransfer rate. As a result, the training is likely to be severely bounded by your storagedevice. Before adopting the external memory solution, some back-of-envelop calculationsmight help you determine its viability. For instance, if your NVMe drive can transfer 4GB(a reasonably practical number) of data per second, and you have a 100GB of data in acompressed XGBoost cache (corresponding to a dense float32 numpy array with 200GB, give ortake). A tree with depth 8 needs at least 16 iterations through the data when theparameter is optimal. You need about 14 minutes to train a single tree without accountingfor some other overheads and assume the computation overlaps with the IO. If your datasethappens to have a TB-level size, you might need thousands of trees to get a generalizedmodel. These calculations can help you get an estimate of the expected training time.
However, sometimes, we can ameliorate this limitation. One should also consider that theOS (mainly talking about the Linux kernel) can usually cache the data on host memory. Itonly evicts pages when new data comes in and there’s no room left. In practice, at leastsome portion of the data can persist in the host memory throughout the entire trainingsession. We are aware of this cache when optimizing the external memory fetcher. Thecompressed cache is usually smaller than the raw input data, especially when the input isdense without any missing value. If the host memory can fit a significant portion of thiscompressed cache, the performance should be decent after initialization. Our developmentso far focuses on following fronts of optimization for external memory:
Avoid iterating through the data whenever appropriate.
If the OS can cache the data, the performance should be close to in-core training.
For GPU, the actual computation should overlap with memory copy as much as possible.
Starting with XGBoost 2.0, the CPU implementation of external memory usesmmap. It hasnot been tested against system errors like disconnected network devices (SIGBUS). In theface of a bus error, you will see a hard crash and need to clean up the cache files. Ifthe training session might take a long time and you use solutions like NVMe-oF, werecommend checkpointing your model periodically. Also, it’s worth noting that most testshave been conducted on Linux distributions.
Another important point to keep in mind is that creating the initial cache for XGBoost maytake some time. The interface to external memory is through custom iterators, which we cannot assume to be thread-safe. Therefore, initialization is performed sequentially. Usingtheconfig_context() withverbosity=2 can give you some information onwhat XGBoost is doing during the wait if you don’t mind the extra output.
Compared to the QuantileDMatrix
Passing an iterator to theQuantileDMatrix enables directconstruction ofQuantileDMatrix with data chunks. On the other hand,if it’s passed to theDMatrix or theExtMemQuantileDMatrix, it instead enables the external memoryfeature. TheQuantileDMatrix concatenates the data in memory aftercompression and doesn’t fetch data during training. On the other hand, the external memoryDMatrix (ExtMemQuantileDMatrix) fetches databatches from external memory on demand. Use theQuantileDMatrix (withiterator if necessary) when you can fit most of your data in memory. For many platforms,the training speed can be an order of magnitude faster than external memory.
Brief History
For a long time, external memory support has been an experimental feature and hasundergone multiple development iterations. Here’s a brief summary of major changes:
Gradient-based sampling was introduced to the GPU hist in 1.1.
The iterator interface was introduced in 1.5, along with a major rewrite for theinternal framework.
2.0 introduced the use of
mmap, along with optimization in XBGoost to enablezero-copy data fetching.3.0 reworked the GPU implementation to support caching data on the host and disk,introduced the
ExtMemQuantileDMatrixclass, added quantile-basedobjectives support.In addition, we begin support for distributed training in 3.0
3.1 added support for having divided cache pages. One can have part of a cache page inthe GPU and the rest of the cache in the host memory. In addition, XGBoost works withthe Grace Blackwell hardware decompression engine when data is sparse.
The text file cache format has been removed in 3.1.0.