Advanced Usage of Custom Objectives

Contents

Overview

XGBoost allows optimizing custom user-defined functions based ongradients and Hessians provided by the user for the desired objective function.

In order for a custom objective to work as intended:

  • The function to optimize must be smooth and twice differentiable.

  • The function must be additive with respect to rows / observations,such as a likelihood function with i.i.d. assumptions.

  • The range of the scores for the function must be unbounded(i.e. it should not work exclusively with positive numbers, for example).

  • The function must be convex. Note that, if the Hessian has negativevalues, they will be clipped, which will likely result in a modelthat does not fit the function well.

  • For multi-output objectives, there should not be dependencies betweendifferent targets (i.e. Hessian should be diagonal for each row).

Some of these limitations can nevertheless be worked around by foregoingthe true Hessian of the function, using something else instead such as anapproximation with better properties - convergence might be slower whennot using the true Hessian of a function, but many theoretical guaranteesshould still hold and result in usable models. For example, XGBoost’sinternal implementation of multionomial logistic regression uses an upperbound on the Hessian with diagonal structure instead of the true Hessianwhich is a full square matrix for each row in the data.

This tutorial provides some suggestions for use-cases that do not perfectlyfit the criteria outlined above, by showing how to solve a Dirichlet regressionparameterized by concentrations.

A Dirichlet regression model poses certain challenges for XGBoost:

  • Concentration parameters must be positive. An easy way to achieve this isby applying an ‘exp’ transform on raw unbounded values, but in such casethe objective becomes non-convex. Furthermore, note that this function isnot in the exponential family, unlike typical distributions used for GLMmodels.

  • The Hessian has dependencies between targets - that is, for a Dirichletdistribution with ‘k’ parameters, each row will have a full Hessian matrixof dimensions[k,k].

  • An optimal intercept for this type of model would involve a vector ofvalues rather than the same value for every target.

In order to use this type of model as a custom objetive:

  • It’s possible to use the expected Hessian (a.k.a. the Fisher informationmatrix or expected information) instead of the true Hessian. The expectedHessian is always positive semi-definite for an additive likelihood, evenif the true Hessian isn’t.

  • It’s possible to use an upper bound on the expected Hessian with a diagonalstructure, such that a second-order approximation under this diagonalbound would always yield greater or equal function values than under thenon-diagonal expected Hessian.

  • Since thebase_score parameter that XGBoost uses for an intercept islimited to a scalar, one can use thebase_margin functionality instead,but note that using it requires a bit more effort.

Dirichlet Regression Formulae

The Dirichlet distribution is a generalization of the Beta distribution tomultiple dimensions. It models proportions data in which the values sum to1, and is typically used as part of composite models (e.g. Dirichlet-multinomial)or as a prior in Bayesian models, but it also can be used on its own forproportions data for example.

Its likelihood for a given observation with valuesy and a given predictionxis given as follows:

\[L(\mathbf{y} | \mathbf{x}) = \frac{1}{\beta(\mathbf{x})} \prod_{i=1}^k y_i^{x_i - 1}\]

Where:

\[\beta(\mathbf{x}) = \frac{ \prod_{i=1}^k \Gamma(x_i) }{\Gamma( \sum_{i=1}^k x_i )}\]

In this case, we want to optimize the negative of the log-likelihood summed across rows.The resulting function, gradient and Hessian could be implemented as follows:

importnumpyasnpfromscipy.specialimportloggamma,psiasdigamma,polygammatrigamma=lambdax:polygamma(1,x)defdirichlet_fun(pred:np.ndarray,Y:np.ndarray)->float:epred=np.exp(pred)sum_epred=np.sum(epred,axis=1,keepdims=True)return(loggamma(epred).sum()-loggamma(sum_epred).sum()-np.sum(np.log(Y)*(epred-1)))defdirichlet_grad(pred:np.ndarray,Y:np.ndarray)->np.ndarray:epred=np.exp(pred)returnepred*(digamma(epred)-digamma(np.sum(epred,axis=1,keepdims=True))-np.log(Y))defdirichlet_hess(pred:np.ndarray,Y:np.ndarray)->np.ndarray:epred=np.exp(pred)grad=dirichlet_grad(pred,Y)k=Y.shape[1]H=np.empty((pred.shape[0],k,k))forrowinrange(pred.shape[0]):H[row,:,:]=(-trigamma(epred[row].sum())*np.outer(epred[row],epred[row])+np.diag(grad[row]+trigamma(epred[row])*epred[row]**2))returnH

Convince yourself that the implementation is correct:

frommathimportisclosefromscipyimportstatsfromscipy.optimizeimportcheck_gradfromscipy.specialimportsoftmaxdefgen_random_dirichlet(rng:np.random.Generator,m:int,k:int):alpha=np.exp(rng.standard_normal(size=k))returnrng.dirichlet(alpha,size=m)deftest_dirichlet_fun_grad_hess():k=3m=10rng=np.random.default_rng(seed=123)Y=gen_random_dirichlet(rng,m,k)x0=rng.standard_normal(size=k)forrowinrange(Y.shape[0]):fun_row=dirichlet_fun(x0.reshape((1,-1)),Y[[row]])ref_logpdf=stats.dirichlet.logpdf(Y[row]/Y[row].sum(),# <- avoid roundoff errornp.exp(x0),)assertisclose(fun_row,-ref_logpdf)gdiff=check_grad(lambdapred:dirichlet_fun(pred.reshape((1,-1)),Y[[row]]),lambdapred:dirichlet_grad(pred.reshape((1,-1)),Y[[row]]),x0)assertgdiff<=1e-6H_numeric=np.empty((k,k))eps=1e-7foriiinrange(k):x0_plus_eps=x0.reshape((1,-1)).copy()x0_plus_eps[0,ii]+=epsforjjinrange(k):H_numeric[ii,jj]=(dirichlet_grad(x0_plus_eps,Y[[row]])[0][jj]-dirichlet_grad(x0.reshape((1,-1)),Y[[row]])[0][jj])/epsH=dirichlet_hess(x0.reshape((1,-1)),Y[[row]])[0]np.testing.assert_almost_equal(H,H_numeric,decimal=6)test_dirichlet_fun_grad_hess()

Dirichlet Regression as Objective Function

As mentioned earlier, the Hessian of this function is problematic forXGBoost: it can have a negative determinant, and might even have negativevalues in the diagonal, which is problematic for optimization methods - inXGBoost, those values would be clipped and the resulting model might notend up producing sensible predictions.

A potential workaround is to use the expected Hessian instead - that is,the expected outer product of the gradient if the response variable weredistributed according to what is predicted. See the Wikipedia articlefor more information:

https://en.wikipedia.org/wiki/Fisher_information

In general, for objective functions in the exponential family, this is easyto obtain from the gradient of the link function and the variance of theprobability distribution, but for other functions in general, it mightinvolve other types of calculations (e.g. covariances and covariances oflogarithms for Dirichlet).

It nevertheless results in a form very similar to the Hessian. One can alsosee from the differences here that, at an optimal point (gradient being zero),the expected and true Hessian for Dirichlet will match, which is a niceproperty for optimization (i.e. the Hessian will be positive at a stationarypoint, which means it will be a minimum rather than a maximum or saddle point).

defdirichlet_expected_hess(pred:np.ndarray)->np.ndarray:epred=np.exp(pred)k=pred.shape[1]Ehess=np.empty((pred.shape[0],k,k))forrowinrange(pred.shape[0]):Ehess[row,:,:]=(-trigamma(epred[row].sum())*np.outer(epred[row],epred[row])+np.diag(trigamma(epred[row])*epred[row]**2))returnEhessdeftest_dirichlet_expected_hess():k=3rng=np.random.default_rng(seed=123)x0=rng.standard_normal(size=k)y_sample=rng.dirichlet(np.exp(x0),size=5_000_000)x_broadcast=np.broadcast_to(x0,(y_sample.shape[0],k))g_sample=dirichlet_grad(x_broadcast,y_sample)ref=(g_sample.T@g_sample)/y_sample.shape[0]Ehess=dirichlet_expected_hess(x0.reshape((1,-1)))[0]np.testing.assert_almost_equal(Ehess,ref,decimal=2)test_dirichlet_expected_hess()

But note that this is still not usable for XGBoost, since the expectedHessian, just like the true Hessian, has shape[nrows,k,k], whileXGBoost requires something with shape[nrows,k].

One may use the diagonal of the expected Hessian for each row, but it’spossible to do better: one can use instead an upper bound with diagonalstructure, since it should lead to better convergence properties, just likefor other Hessian-based optimization methods.

In the absence of any obvious way of obtaining an upper bound, a possibilityhere is to construct such a bound numerically based directly on the definitionof a diagonally dominant matrix:

https://en.wikipedia.org/wiki/Diagonally_dominant_matrix

That is: take the absolute value of the expected Hessian for each row of the data,and sum by rows of the[k,k]-shaped Hessian for that row in the data:

defdirichlet_diag_upper_bound_expected_hess(pred:np.ndarray,Y:np.ndarray)->np.ndarray:Ehess=dirichlet_expected_hess(pred)diag_bound_Ehess=np.empty((pred.shape[0],Y.shape[1]))forrowinrange(pred.shape[0]):diag_bound_Ehess[row,:]=np.abs(Ehess[row,:,:]).sum(axis=1)returndiag_bound_Ehess

(note: the calculation can be made more efficiently than what is shown hereby not calculating the full matrix, and in R, by making the rows be the lastdimension and transposing after the fact)

With all these pieces in place, one can now frame this model into the formatrequired for XGBoost’s custom objectives:

importxgboostasxgbfromtypingimportTupledefdirichlet_xgb_objective(pred:np.ndarray,dtrain:xgb.DMatrix)->Tuple[np.ndarray,np.ndarray]:Y=dtrain.get_label().reshape(pred.shape)return(dirichlet_grad(pred,Y),dirichlet_diag_upper_bound_expected_hess(pred,Y),)

And for an evaluation metric monitoring based on the Dirichlet log-likelihood:

defdirichlet_eval_metric(pred:np.ndarray,dtrain:xgb.DMatrix)->Tuple[str,float]:Y=dtrain.get_label().reshape(pred.shape)return"dirichlet_ll",dirichlet_fun(pred,Y)

Practical Example

A good source for test datasets for proportions data is the R packageDirichletReg:

https://cran.r-project.org/package=DirichletReg

For this example, we’ll now use the Arctic Lake dataset(Aitchison, J. (2003). The Statistical Analysis of Compositional Data. The Blackburn Press, Caldwell, NJ.),taken from theDirichletReg R package, which consists of 39 rows with one predictor variable ‘depth’and a three-valued response variable denoting the sediment composition of the measurements in this arcticlake (sand, silt, clay).

The data:

# depthX=np.array([10.4,11.7,12.8,13,15.7,16.3,18,18.7,20.7,22.1,22.4,24.4,25.8,32.5,33.6,36.8,37.8,36.9,42.2,47,47.1,48.4,49.4,49.5,59.2,60.1,61.7,62.4,69.3,73.6,74.4,78.5,82.9,87.7,88.1,90.4,90.6,97.7,103.7,]).reshape((-1,1))# sand, silt, clayY=np.array([[0.775,0.195,0.03],[0.719,0.249,0.032],[0.507,0.361,0.132],[0.522,0.409,0.066],[0.7,0.265,0.035],[0.665,0.322,0.013],[0.431,0.553,0.016],[0.534,0.368,0.098],[0.155,0.544,0.301],[0.317,0.415,0.268],[0.657,0.278,0.065],[0.704,0.29,0.006],[0.174,0.536,0.29],[0.106,0.698,0.196],[0.382,0.431,0.187],[0.108,0.527,0.365],[0.184,0.507,0.309],[0.046,0.474,0.48],[0.156,0.504,0.34],[0.319,0.451,0.23],[0.095,0.535,0.37],[0.171,0.48,0.349],[0.105,0.554,0.341],[0.048,0.547,0.41],[0.026,0.452,0.522],[0.114,0.527,0.359],[0.067,0.469,0.464],[0.069,0.497,0.434],[0.04,0.449,0.511],[0.074,0.516,0.409],[0.048,0.495,0.457],[0.045,0.485,0.47],[0.066,0.521,0.413],[0.067,0.473,0.459],[0.074,0.456,0.469],[0.06,0.489,0.451],[0.063,0.538,0.399],[0.025,0.48,0.495],[0.02,0.478,0.502],])

Fitting an XGBoost model and making predictions:

fromtypingimportDict,Listdtrain=xgb.DMatrix(X,label=Y)results:Dict[str,Dict[str,List[float]]]={}booster=xgb.train(params={"tree_method":"hist","num_target":Y.shape[1],"base_score":0,"disable_default_eval_metric":True,"max_depth":3,"seed":123,},dtrain=dtrain,num_boost_round=10,obj=dirichlet_xgb_objective,evals=[(dtrain,"Train")],evals_result=results,custom_metric=dirichlet_eval_metric,)yhat=softmax(booster.inplace_predict(X),axis=1)

Should produce an evaluation log as follows (note: the function is decreasing asexpected - but unlike other objectives, the minimum value here can reach below zero):

[0] Train-dirichlet_ll:-40.25009[1] Train-dirichlet_ll:-47.69122[2] Train-dirichlet_ll:-52.64620[3] Train-dirichlet_ll:-56.36977[4] Train-dirichlet_ll:-59.33048[5] Train-dirichlet_ll:-61.93359[6] Train-dirichlet_ll:-64.17280[7] Train-dirichlet_ll:-66.29709[8] Train-dirichlet_ll:-68.21001[9] Train-dirichlet_ll:-70.03442

One can confirm that the obtainedyhat resembles the actual concentrationsto a large degree, beyond what would be expected from random predictions by asimple look at bothyhat andY.

For better results, one might want to add an intercept. XGBoost onlyallows using scalars for intercepts, but for a vector-valued model,the optimal intercept should also have vector form.

This can be done by supplyingbase_margin instead - unlike theintercept, one must specifically supply values for every row here,and saidbase_margin must be supplied again at the moment of makingpredictions (i.e. does not get added automatically likebase_scoredoes).

For the case of a Dirichlet model, the optimal intercept can be obtainedefficiently using a general solver (e.g. SciPy’s Newton solver) withdedicated likelihood, gradient and Hessian functions for just the intercept part.Further, note that if one frames it instead as bounded optimization withoutapplying ‘exp’ transform to the concentrations, it becomes instead a convexproblem, for which the true Hessian can be used without issues in otherclasses of solvers.

For simplicity, this example will nevertheless reuse the same likelihoodand gradient functions that were defined earlier alongside with SciPy’s / R’sL-BFGS solver to obtain the optimal vector-valued intercept:

fromscipy.optimizeimportminimizedefget_optimal_intercepts(Y:np.ndarray)->np.ndarray:k=Y.shape[1]res=minimize(fun=lambdapred:dirichlet_fun(np.broadcast_to(pred,(Y.shape[0],k)),Y),x0=np.zeros(k),jac=lambdapred:dirichlet_grad(np.broadcast_to(pred,(Y.shape[0],k)),Y).sum(axis=0))returnres["x"]intercepts=get_optimal_intercepts(Y)

Now fitting a model again, this time with the intercept:

base_margin=np.broadcast_to(intercepts,Y.shape)dtrain_w_intercept=xgb.DMatrix(X,label=Y,base_margin=base_margin)results:Dict[str,Dict[str,List[float]]]={}booster=xgb.train(params={"tree_method":"hist","num_target":Y.shape[1],"base_score":0,"disable_default_eval_metric":True,"max_depth":3,"seed":123,},dtrain=dtrain_w_intercept,num_boost_round=10,obj=dirichlet_xgb_objective,evals=[(dtrain,"Train")],evals_result=results,custom_metric=dirichlet_eval_metric,)yhat=softmax(booster.predict(xgb.DMatrix(X,base_margin=base_margin)),axis=1)
[0] Train-dirichlet_ll:-37.01861[1] Train-dirichlet_ll:-42.86120[2] Train-dirichlet_ll:-46.55133[3] Train-dirichlet_ll:-49.15111[4] Train-dirichlet_ll:-51.02638[5] Train-dirichlet_ll:-52.53880[6] Train-dirichlet_ll:-53.77409[7] Train-dirichlet_ll:-54.88851[8] Train-dirichlet_ll:-55.95961[9] Train-dirichlet_ll:-56.95497

For this small example problem, predictions should be very similar between thetwo and the version without intercepts achieved a lower objective function in thetraining data (for the Python version at least), but for more serious usage withreal-world data, one is likely to observe better results when adding the intercepts.