Tree Methods

For training boosted tree models, there are 2 parameters used for choosing algorithms,namelyupdater andtree_method. XGBoost has 3 builtin tree methods, namelyexact,approx andhist. Along with these tree methods, there are also somefree standing updaters includingrefresh,prune andsync. The parameterupdater is more primitive thantree_method as the latter is just apre-configuration of the former. The difference is mostly due to historical reasons thateach updater requires some specific configurations and might has missing features. As weare moving forward, the gap between them is becoming more and more irrelevant. We willcollectively document them under tree methods.

Exact Solution

Exact means XGBoost considers all candidates from data for tree splitting, but underlyingthe objective is still interpreted as a Taylor expansion.

  1. exact: The vanilla gradient boosting tree algorithm described inreference paper. During split-finding, it iterates over allentries of input data. It’s more accurate (among other greedy methods) butcomputationally slower in compared to other tree methods. Further more, its featureset is limited. Features like distributed training and external memory that requireapproximated quantiles are not supported. This tree method can be used with theparametertree_method set toexact.

Approximated Solutions

Asexact tree method is slow in computation performance and difficult to scale, weoften employ approximated training algorithms. These algorithms build a gradienthistogram for each node and iterate through the histogram instead of real dataset. Herewe introduce the implementations in XGBoost.

  1. approx tree method: An approximation tree method described inreference paper. It runs sketching before building each treeusing all the rows (rows belonging to the root). Hessian is used as weights duringsketch. The algorithm can be accessed by settingtree_method toapprox.

  2. hist tree method: An approximation tree method used in LightGBM with slightdifferences in implementation. It runs sketching before training using only userprovided weights instead of hessian. The subsequent per-node histogram is built uponthis global sketch. This is the fastest algorithm as it runs sketching only once. Thealgorithm can be accessed by settingtree_method tohist.

Implications

Some objectives likereg:squarederror have constant hessian. In this case, thehist should be preferred as weighted sketching doesn’t make sense with constantweights. When using non-constant hessian objectives, sometimesapprox yields betteraccuracy, but with slower computation performance. Most of the time usinghist withhighermax_bin can achieve similar or even superior accuracy while maintaining goodperformance. However, as xgboost is largely driven by community effort, the actualimplementations have some differences than pure math description. Result might beslightly different than expectation, which we are currently trying to overcome.

Other Updaters

  1. Prune: It prunes the existing trees.prune is usually used as part of othertree methods. To use pruner independently, one needs to set the process type to updateby:{"process_type":"update","updater":"prune"}. With this set of parameters,during training, XGBoost will prune the existing trees according to 2 parametersmin_split_loss(gamma) andmax_depth.

  2. Refresh: Refresh the statistic of built trees on a new training dataset. Like thepruner, To use refresh independently, one needs to set the process type to update:{"process_type":"update","updater":"refresh"}. During training, the updaterwill change statistics likecover andweight according to the new trainingdataset. Whenrefresh_leaf is also set to true (default), XGBoost will update theleaf value according to the new leaf weight, but the tree structure (split condition)itself doesn’t change.

    There are examples on both training continuation (adding new trees) and using updateprocess ondemo/guide-python. Also checkout theprocess_type parameter inXGBoost Parameters.

  3. Sync: Synchronize the tree among workers when running distributed training.

Removed Updaters

3 Updaters were removed during development due to maintainability. We describe them heresolely for the interest of documentation.

  1. Distributed colmaker, which was a distributed version of exact tree method. Itrequired specialization for column based splitting strategy and a different predictionprocedure. As the exact tree method is slow by itself and scaling is even lessefficient, we removed it entirely.

  2. skmaker. Per-node weighted sketching employed bygrow_local_histmaker is slow,theskmaker was unmaintained and seems to be a workaround trying to eliminate thehistogram creation step and uses sketching values directly during split evaluation. Itwas never tested and contained some unknown bugs, we decided to remove it and focus ourresources on more promising algorithms instead. For accuracy, most of the timeapprox andhist are enough with some parameters tuning, so removing them don’thave any real practical impact.

  3. grow_local_histmaker updater: An approximation tree method described inreferencepaper. This updater was rarely used in practice soit was still an updater rather than tree method. During split finding, it first runs aweighted GK sketching for data points belong to current node to find split candidates,using hessian as weights. The histogram is built upon this per-node sketch. It wasfaster thanexact in some applications, but still slow in computation. It wasremoved because it depended on Rabit’s customized reduction function that handles allthe data structure that can be serialized/deserialized into fixed size buffer, which isnot directly supported by NCCL or federated learning gRPC, making it hard to refactorinto a common allreducer interface.

Feature Matrix

Following table summarizes some differences in supported features between 4 tree methods,T means supported whileF means unsupported.

Exact

Approx

Approx (GPU)

Hist

Hist (GPU)

grow_policy

Depthwise

depthwise/lossguide

depthwise/lossguide

depthwise/lossguide

depthwise/lossguide

max_leaves

F

T

T

T

T

sampling method

uniform

uniform

gradient_based/uniform

uniform

gradient_based/uniform

categorical data

F

T

T

T

T

External memory

F

T

P

T

P

Distributed

F

T

T

T

T

Features/parameters that are not mentioned here are universally supported for all 3 treemethods (for instance, column sampling and constraints). TheP in external memory meansspecial handling. Please note that both categorical data and external memory areexperimental.