Python API Reference

This page gives the Python API reference of xgboost, please also refer to Python Package Introduction for more information about the Python package.

Global Configuration

xgboost.config_context(**new_config)

Context manager for global XGBoost configuration.

Global configuration consists of a collection of parameters that can be applied in theglobal scope. SeeGlobal Configuration for the full list of parameters supported inthe global configuration.

Note

All settings, not just those presently modified, will be returned to theirprevious values when the context manager is exited. This is not thread-safe.

Added in version 1.4.0.

Parameters:

new_config (Dict[str,Any]) – Keyword arguments representing the parameters and their values

Return type:

Iterator[None]

Example

importxgboostasxgb# Show all messages, including ones pertaining to debuggingxgb.set_config(verbosity=2)# Get current value of global configuration# This is a dict containing all parameters in the global configuration,# including 'verbosity'config=xgb.get_config()assertconfig['verbosity']==2# Example of using the context manager xgb.config_context().# The context manager will restore the previous value of the global# configuration upon exiting.withxgb.config_context(verbosity=0):# Suppress warning caused by model generated with XGBoost version < 1.0.0bst=xgb.Booster(model_file='./old_model.bin')assertxgb.get_config()['verbosity']==2# old value restored

Nested configuration context is also supported:

Example

withxgb.config_context(verbosity=3):assertxgb.get_config()["verbosity"]==3withxgb.config_context(verbosity=2):assertxgb.get_config()["verbosity"]==2xgb.set_config(verbosity=2)assertxgb.get_config()["verbosity"]==2withxgb.config_context(verbosity=3):assertxgb.get_config()["verbosity"]==3

See also

set_config

Set global XGBoost configuration

get_config

Get current values of the global configuration

xgboost.set_config(**new_config)

Set global configuration.

Global configuration consists of a collection of parameters that can be applied in theglobal scope. SeeGlobal Configuration for the full list of parameters supported inthe global configuration.

Added in version 1.4.0.

Parameters:

new_config (Dict[str,Any]) – Keyword arguments representing the parameters and their values

Return type:

None

Example

importxgboostasxgb# Show all messages, including ones pertaining to debuggingxgb.set_config(verbosity=2)# Get current value of global configuration# This is a dict containing all parameters in the global configuration,# including 'verbosity'config=xgb.get_config()assertconfig['verbosity']==2# Example of using the context manager xgb.config_context().# The context manager will restore the previous value of the global# configuration upon exiting.withxgb.config_context(verbosity=0):# Suppress warning caused by model generated with XGBoost version < 1.0.0bst=xgb.Booster(model_file='./old_model.bin')assertxgb.get_config()['verbosity']==2# old value restored

Nested configuration context is also supported:

Example

withxgb.config_context(verbosity=3):assertxgb.get_config()["verbosity"]==3withxgb.config_context(verbosity=2):assertxgb.get_config()["verbosity"]==2xgb.set_config(verbosity=2)assertxgb.get_config()["verbosity"]==2withxgb.config_context(verbosity=3):assertxgb.get_config()["verbosity"]==3
xgboost.get_config()

Get current values of the global configuration.

Global configuration consists of a collection of parameters that can be applied in theglobal scope. SeeGlobal Configuration for the full list of parameters supported inthe global configuration.

Added in version 1.4.0.

Returns:

args – The list of global parameters and their values

Return type:

Dict[str, Any]

Example

importxgboostasxgb# Show all messages, including ones pertaining to debuggingxgb.set_config(verbosity=2)# Get current value of global configuration# This is a dict containing all parameters in the global configuration,# including 'verbosity'config=xgb.get_config()assertconfig['verbosity']==2# Example of using the context manager xgb.config_context().# The context manager will restore the previous value of the global# configuration upon exiting.withxgb.config_context(verbosity=0):# Suppress warning caused by model generated with XGBoost version < 1.0.0bst=xgb.Booster(model_file='./old_model.bin')assertxgb.get_config()['verbosity']==2# old value restored

Nested configuration context is also supported:

Example

withxgb.config_context(verbosity=3):assertxgb.get_config()["verbosity"]==3withxgb.config_context(verbosity=2):assertxgb.get_config()["verbosity"]==2xgb.set_config(verbosity=2)assertxgb.get_config()["verbosity"]==2withxgb.config_context(verbosity=3):assertxgb.get_config()["verbosity"]==3
xgboost.build_info()

Build information of XGBoost. The returned value format is not stable. Also,please note that build time dependency is not the same as runtime dependency. Forinstance, it’s possible to build XGBoost with older CUDA version but run it with thelastest one.

Added in version 1.6.0.

Return type:

dict

Core Data Structure

Core XGBoost Library.

classxgboost.DMatrix(data,label=None,*,weight=None,base_margin=None,missing=None,silent=False,feature_names=None,feature_types=None,nthread=None,group=None,qid=None,label_lower_bound=None,label_upper_bound=None,feature_weights=None,enable_categorical=False,data_split_mode=DataSplitMode.ROW)

Bases:object

Data Matrix used in XGBoost.

DMatrix is an internal data structure that is used by XGBoost, which is optimizedfor both memory efficiency and training speed. You can construct DMatrix frommultiple different sources of data.

Parameters:
  • data (Any) –

    Data source of DMatrix. SeeSupported data structures for various XGBoost functions for a list of supported inputtypes.

    Note that, if passing an iterator, itwill cache data on disk, and notethat fields likelabel will be concatenated in-memory from multiplecalls to the iterator.

  • label (Any |None) – Label of the training data.

  • weight (Any |None) –

    Weight for each instance.

    Note

    For ranking task, weights are per-group. In ranking task, one weightis assigned to each group (not each data point). This is because weonly care about the relative ordering of data points within each group,so it doesn’t make sense to assign weights to individual data points.

  • base_margin (Any |None) – Global bias for each instance. SeeIntercept for details.

  • missing (float |None) – Value in the input data which needs to be present as a missing value. IfNone, defaults to np.nan.

  • silent (bool) – Whether print messages during construction

  • feature_names (Sequence[str]|None) – Set names for features.

  • feature_types (Sequence[str]|None) –

    Set types for features. Ifdata is a DataFrame type and passingenable_categorical=True, the types will be deduced automaticallyfrom the column types.

    Otherwise, one can pass a list-like input with the same length as numberof columns indata, with the following possible values:

    • ”c”, which represents categorical columns.

    • ”q”, which represents numeric columns.

    • ”int”, which represents integer columns.

    • ”i”, which represents boolean columns.

    Note that, while categorical types are treated differently fromthe rest for model fitting purposes, the other types do not influencethe generated model, but have effects in other functionalities such asfeature importances.

    For categorical features, the input is assumed to be preprocessed andencoded by the users. The encoding can be done viasklearn.preprocessing.OrdinalEncoder or pandas dataframe.cat.codes method. This is useful when users want to specify categoricalfeatures without having to construct a dataframe as input.

  • nthread (int |None) – Number of threads to use for loading data when parallelization isapplicable. If -1, uses maximum threads available on the system.

  • group (Any |None) – Group size for all ranking group.

  • qid (Any |None) – Query ID for data samples, used for ranking.

  • label_lower_bound (Any |None) – Lower bound for survival training.

  • label_upper_bound (Any |None) – Upper bound for survival training.

  • feature_weights (Any |None) – Set feature weights for column sampling.

  • enable_categorical (bool) –

    Added in version 1.3.0.

    Note

    This parameter is experimental

    Experimental support of specializing for categorical features. SeeCategorical Data for more info.

    If passingTrue anddata is a data frame (from supported libraries suchas Pandas, Modin or cuDF), The DMatrix recognizes categorical columns andautomatically set thefeature_types parameter. Ifdata is not a dataframe, this argument is ignored.

    If passingFalse anddata is a data frame with categorical columns, itwill result in an error.

    See notes in theDataIter for consistency requirement when theinput is an iterator.

    Changed in version 3.1.0.

    XGBoost can remember the encoding of categories when the input is adataframe.

  • data_split_mode (DataSplitMode)

data_split_mode()

Get the data split mode of the DMatrix.

Added in version 2.1.0.

Return type:

DataSplitMode

propertyfeature_names:Sequence[str]|None

Labels for features (column labels).

Setting it toNone resets existing feature names.

propertyfeature_types:Sequence[str]|None

Type of features (column types).

This is for displaying the results and categorical data support. SeeDMatrix for details.

Setting it toNone resets existing feature types.

get_base_margin()

Get the base margin of the DMatrix.

Return type:

base_margin

get_categories()

Get the categories in the dataset usingpyarrow. ReturnsNone if there’sno categorical features.

Warning

This function is still working in progress.

Added in version 3.1.0.

Return type:

Dict[str, pa.DictionaryArray] | None

get_data()

Get the predictors from DMatrix as a CSR matrix. This getter is mostly fortesting purposes. If this is a quantized DMatrix then quantized values arereturned instead of input values.

Added in version 1.7.0.

Return type:

csr_matrix

get_float_info(field)

Get float property from the DMatrix.

Parameters:

field (str) – The field name of the information

Returns:

info – a numpy array of float information of the data

Return type:

array

get_group()

Get the group of the DMatrix.

Return type:

group

get_label()

Get the label of the DMatrix.

Returns:

label

Return type:

array

get_quantile_cut()

Get quantile cuts for quantization.

Added in version 2.0.0.

Return type:

Tuple[ndarray,ndarray]

get_uint_info(field)

Get unsigned integer property from the DMatrix.

Parameters:

field (str) – The field name of the information

Returns:

info – a numpy array of unsigned integer information of the data

Return type:

array

get_weight()

Get the weight of the DMatrix.

Returns:

weight

Return type:

array

num_col()

Get the number of columns (features) in the DMatrix.

Return type:

int

num_nonmissing()

Get the number of non-missing values in the DMatrix.

Added in version 1.7.0.

Return type:

int

num_row()

Get the number of rows in the DMatrix.

Return type:

int

save_binary(fname,silent=True)

Save DMatrix to an XGBoost buffer. Saved binary can be later loadedby providing the path toxgboost.DMatrix() as input.

Parameters:
  • fname (string oros.PathLike) – Name of the output buffer file.

  • silent (bool (optional; default: True)) – If set, the output is suppressed.

Return type:

None

set_base_margin(margin)

Set base margin of booster to start from.

This can be used to specify a prediction value of existing model to bebase_margin However, remember margin is needed, instead of transformedprediction e.g. for logistic regression: need to put in value beforelogistic transformation see also example/demo.py

Parameters:

margin (array like) – Prediction margin of each datapoint

Return type:

None

set_float_info(field,data)

Set float type property into the DMatrix.

Parameters:
  • field (str) – The field name of the information

  • data (numpy array) – The array of data to be set

Return type:

None

set_float_info_npy2d(field,data)
Set float type property into the DMatrix

for numpy 2d array input

Parameters:
  • field (str) – The field name of the information

  • data (numpy array) – The array of data to be set

Return type:

None

set_group(group)

Set group size of DMatrix (used for ranking).

Parameters:

group (array like) – Group size of each group

Return type:

None

set_info(*,label=None,weight=None,base_margin=None,group=None,qid=None,label_lower_bound=None,label_upper_bound=None,feature_names=None,feature_types=None,feature_weights=None)

Set meta info for DMatrix. See doc string forxgboost.DMatrix.

Parameters:
  • label (Any |None)

  • weight (Any |None)

  • base_margin (Any |None)

  • group (Any |None)

  • qid (Any |None)

  • label_lower_bound (Any |None)

  • label_upper_bound (Any |None)

  • feature_names (Sequence[str]|None)

  • feature_types (Sequence[str]|None)

  • feature_weights (Any |None)

Return type:

None

set_label(label)

Set label of dmatrix

Parameters:

label (array like) – The label information to be set into DMatrix

Return type:

None

set_uint_info(field,data)

Set uint type property into the DMatrix.

Parameters:
  • field (str) – The field name of the information

  • data (numpy array) – The array of data to be set

Return type:

None

set_weight(weight)

Set weight of each instance.

Parameters:

weight (array like) –

Weight for each data point

Note

For ranking task, weights are per-group.

In ranking task, one weight is assigned to each group (not eachdata point). This is because we only care about the relativeordering of data points within each group, so it doesn’t makesense to assign weights to individual data points.

Return type:

None

slice(rindex,allow_groups=False)

Slice the DMatrix and return a new DMatrix that only containsrindex.

Parameters:
  • rindex (List[int]|ndarray) – List of indices to be selected.

  • allow_groups (bool) – Allow slicing of a matrix with a groups attribute

Returns:

A new DMatrix containing only selected indices.

Return type:

res

classxgboost.QuantileDMatrix(data,label=None,*,weight=None,base_margin=None,missing=None,silent=False,feature_names=None,feature_types=None,nthread=None,max_bin=None,ref=None,group=None,qid=None,label_lower_bound=None,label_upper_bound=None,feature_weights=None,enable_categorical=False,max_quantile_batches=None,data_split_mode=DataSplitMode.ROW)

Bases:DMatrix,_RefMixIn

A DMatrix variant that generates quantilized data directly from input for thehist tree method. This DMatrix is primarily designed to save memory in trainingby avoiding intermediate storage. Setmax_bin to control the number of binsduring quantisation, which should be consistent with the training parametermax_bin. WhenQuantileDMatrix is used for validation/test dataset,refshould be anotherQuantileDMatrix orDMatrix, but not recommended as itdefeats the purpose of saving memory) constructed from training dataset. Seexgboost.DMatrix for documents on meta info.

Note

Do not useQuantileDMatrix as validation/test dataset without supplying areference (the training dataset)QuantileDMatrix usingref as someinformation may be lost in quantisation.

Added in version 1.7.0.

Examples

fromsklearn.datasetsimportmake_regressionfromsklearn.model_selectionimporttrain_test_splitX,y=make_regression()X_train,X_test,y_train,y_test=train_test_split(X,y)Xy_train=xgb.QuantileDMatrix(X_train,y_train)# It's necessary to have the training DMatrix as a reference for valid# quantiles.Xy_test=xgb.QuantileDMatrix(X_test,y_test,ref=Xy_train)
Parameters:
  • max_bin (int |None) – The number of histogram bin, should be consistent with the training parametermax_bin.

  • ref (DMatrix |None) – The training dataset that provides quantile information, needed when creatingvalidation/test dataset withQuantileDMatrix. Supplying the training DMatrixas a reference means that the same quantisation applied to the training data isapplied to the validation/test data

  • max_quantile_batches (int |None) –

    For GPU-based inputs from an iterator, XGBoost handles incoming batches withmultiple growing substreams. This parameter sets the maximum number of batchesbefore XGBoost can cut the sub-stream and create a new one. This can help boundthe memory usage. By default, XGBoost grows a sub-stream exponentially untilbatches are exhausted. This option is only used for the training dataset and thedefault is None (unbounded). Lastly, if thedata is a single batch instead of aniterator, this parameter has no effect.

    Added in version 3.0.0.

    Warning

    This is an experimental parameter and subject to change.

  • data (Any) –

    Data source of DMatrix. SeeSupported data structures for various XGBoost functions for a list of supported inputtypes.

    Note that, if passing an iterator, itwill cache data on disk, and notethat fields likelabel will be concatenated in-memory from multiplecalls to the iterator.

  • label (Any |None) – Label of the training data.

  • weight (Any |None) –

    Weight for each instance.

    Note

    For ranking task, weights are per-group. In ranking task, one weightis assigned to each group (not each data point). This is because weonly care about the relative ordering of data points within each group,so it doesn’t make sense to assign weights to individual data points.

  • base_margin (Any |None) – Global bias for each instance. SeeIntercept for details.

  • missing (float |None) – Value in the input data which needs to be present as a missing value. IfNone, defaults to np.nan.

  • silent (bool) – Whether print messages during construction

  • feature_names (Sequence[str]|None) – Set names for features.

  • feature_types (Sequence[str]|None) –

    Set types for features. Ifdata is a DataFrame type and passingenable_categorical=True, the types will be deduced automaticallyfrom the column types.

    Otherwise, one can pass a list-like input with the same length as numberof columns indata, with the following possible values:

    • ”c”, which represents categorical columns.

    • ”q”, which represents numeric columns.

    • ”int”, which represents integer columns.

    • ”i”, which represents boolean columns.

    Note that, while categorical types are treated differently fromthe rest for model fitting purposes, the other types do not influencethe generated model, but have effects in other functionalities such asfeature importances.

    For categorical features, the input is assumed to be preprocessed andencoded by the users. The encoding can be done viasklearn.preprocessing.OrdinalEncoder or pandas dataframe.cat.codes method. This is useful when users want to specify categoricalfeatures without having to construct a dataframe as input.

  • nthread (int |None) – Number of threads to use for loading data when parallelization isapplicable. If -1, uses maximum threads available on the system.

  • group (Any |None) – Group size for all ranking group.

  • qid (Any |None) – Query ID for data samples, used for ranking.

  • label_lower_bound (Any |None) – Lower bound for survival training.

  • label_upper_bound (Any |None) – Upper bound for survival training.

  • feature_weights (Any |None) – Set feature weights for column sampling.

  • enable_categorical (bool) –

    Added in version 1.3.0.

    Note

    This parameter is experimental

    Experimental support of specializing for categorical features. SeeCategorical Data for more info.

    If passingTrue anddata is a data frame (from supported libraries suchas Pandas, Modin or cuDF), The DMatrix recognizes categorical columns andautomatically set thefeature_types parameter. Ifdata is not a dataframe, this argument is ignored.

    If passingFalse anddata is a data frame with categorical columns, itwill result in an error.

    See notes in theDataIter for consistency requirement when theinput is an iterator.

    Changed in version 3.1.0.

    XGBoost can remember the encoding of categories when the input is adataframe.

  • data_split_mode (DataSplitMode)

data_split_mode()

Get the data split mode of the DMatrix.

Added in version 2.1.0.

Return type:

DataSplitMode

propertyfeature_names:Sequence[str]|None

Labels for features (column labels).

Setting it toNone resets existing feature names.

propertyfeature_types:Sequence[str]|None

Type of features (column types).

This is for displaying the results and categorical data support. SeeDMatrix for details.

Setting it toNone resets existing feature types.

get_base_margin()

Get the base margin of the DMatrix.

Return type:

base_margin

get_categories()

Get the categories in the dataset usingpyarrow. ReturnsNone if there’sno categorical features.

Warning

This function is still working in progress.

Added in version 3.1.0.

Return type:

Dict[str, pa.DictionaryArray] | None

get_data()

Get the predictors from DMatrix as a CSR matrix. This getter is mostly fortesting purposes. If this is a quantized DMatrix then quantized values arereturned instead of input values.

Added in version 1.7.0.

Return type:

csr_matrix

get_float_info(field)

Get float property from the DMatrix.

Parameters:

field (str) – The field name of the information

Returns:

info – a numpy array of float information of the data

Return type:

array

get_group()

Get the group of the DMatrix.

Return type:

group

get_label()

Get the label of the DMatrix.

Returns:

label

Return type:

array

get_quantile_cut()

Get quantile cuts for quantization.

Added in version 2.0.0.

Return type:

Tuple[ndarray,ndarray]

get_uint_info(field)

Get unsigned integer property from the DMatrix.

Parameters:

field (str) – The field name of the information

Returns:

info – a numpy array of unsigned integer information of the data

Return type:

array

get_weight()

Get the weight of the DMatrix.

Returns:

weight

Return type:

array

num_col()

Get the number of columns (features) in the DMatrix.

Return type:

int

num_nonmissing()

Get the number of non-missing values in the DMatrix.

Added in version 1.7.0.

Return type:

int

num_row()

Get the number of rows in the DMatrix.

Return type:

int

propertyref:ReferenceType|None

Internal method for retrieving a reference to the training DMatrix.

save_binary(fname,silent=True)

Save DMatrix to an XGBoost buffer. Saved binary can be later loadedby providing the path toxgboost.DMatrix() as input.

Parameters:
  • fname (string oros.PathLike) – Name of the output buffer file.

  • silent (bool (optional; default: True)) – If set, the output is suppressed.

Return type:

None

set_base_margin(margin)

Set base margin of booster to start from.

This can be used to specify a prediction value of existing model to bebase_margin However, remember margin is needed, instead of transformedprediction e.g. for logistic regression: need to put in value beforelogistic transformation see also example/demo.py

Parameters:

margin (array like) – Prediction margin of each datapoint

Return type:

None

set_float_info(field,data)

Set float type property into the DMatrix.

Parameters:
  • field (str) – The field name of the information

  • data (numpy array) – The array of data to be set

Return type:

None

set_float_info_npy2d(field,data)
Set float type property into the DMatrix

for numpy 2d array input

Parameters:
  • field (str) – The field name of the information

  • data (numpy array) – The array of data to be set

Return type:

None

set_group(group)

Set group size of DMatrix (used for ranking).

Parameters:

group (array like) – Group size of each group

Return type:

None

set_info(*,label=None,weight=None,base_margin=None,group=None,qid=None,label_lower_bound=None,label_upper_bound=None,feature_names=None,feature_types=None,feature_weights=None)

Set meta info for DMatrix. See doc string forxgboost.DMatrix.

Parameters:
  • label (Any |None)

  • weight (Any |None)

  • base_margin (Any |None)

  • group (Any |None)

  • qid (Any |None)

  • label_lower_bound (Any |None)

  • label_upper_bound (Any |None)

  • feature_names (Sequence[str]|None)

  • feature_types (Sequence[str]|None)

  • feature_weights (Any |None)

Return type:

None

set_label(label)

Set label of dmatrix

Parameters:

label (array like) – The label information to be set into DMatrix

Return type:

None

set_uint_info(field,data)

Set uint type property into the DMatrix.

Parameters:
  • field (str) – The field name of the information

  • data (numpy array) – The array of data to be set

Return type:

None

set_weight(weight)

Set weight of each instance.

Parameters:

weight (array like) –

Weight for each data point

Note

For ranking task, weights are per-group.

In ranking task, one weight is assigned to each group (not eachdata point). This is because we only care about the relativeordering of data points within each group, so it doesn’t makesense to assign weights to individual data points.

Return type:

None

slice(rindex,allow_groups=False)

Slice the DMatrix and return a new DMatrix that only containsrindex.

Parameters:
  • rindex (List[int]|ndarray) – List of indices to be selected.

  • allow_groups (bool) – Allow slicing of a matrix with a groups attribute

Returns:

A new DMatrix containing only selected indices.

Return type:

res

classxgboost.ExtMemQuantileDMatrix(data,*,missing=None,nthread=None,max_bin=None,ref=None,enable_categorical=False,max_quantile_batches=None,cache_host_ratio=None)

Bases:DMatrix,_RefMixIn

The external memory version of theQuantileDMatrix.

SeeUsing XGBoost External Memory Version for explanation and usage examples, andQuantileDMatrix for parameter document.

Warning

This is an experimental feature and subject to change.

Added in version 3.0.0.

Parameters:
  • data (DataIter) – A user-definedDataIter for loading data.

  • max_quantile_batches (int |None) – SeeQuantileDMatrix.

  • cache_host_ratio (float |None) –

    Added in version 3.1.0.

    Used by the GPU implementation. For GPU-based inputs, XGBoost can split thecache into host and device caches to reduce the data transfer overhead. Thisparameter specifies the size of host cache compared to the size of theentire cache:\(host / (host + device)\).

    SeeAdaptive Cache for more info.

  • missing (float |None)

  • nthread (int |None)

  • max_bin (int |None)

  • ref (DMatrix |None)

  • enable_categorical (bool)

data_split_mode()

Get the data split mode of the DMatrix.

Added in version 2.1.0.

Return type:

DataSplitMode

propertyfeature_names:Sequence[str]|None

Labels for features (column labels).

Setting it toNone resets existing feature names.

propertyfeature_types:Sequence[str]|None

Type of features (column types).

This is for displaying the results and categorical data support. SeeDMatrix for details.

Setting it toNone resets existing feature types.

get_base_margin()

Get the base margin of the DMatrix.

Return type:

base_margin

get_categories()

Get the categories in the dataset usingpyarrow. ReturnsNone if there’sno categorical features.

Warning

This function is still working in progress.

Added in version 3.1.0.

Return type:

Dict[str, pa.DictionaryArray] | None

get_data()

Get the predictors from DMatrix as a CSR matrix. This getter is mostly fortesting purposes. If this is a quantized DMatrix then quantized values arereturned instead of input values.

Added in version 1.7.0.

Return type:

csr_matrix

get_float_info(field)

Get float property from the DMatrix.

Parameters:

field (str) – The field name of the information

Returns:

info – a numpy array of float information of the data

Return type:

array

get_group()

Get the group of the DMatrix.

Return type:

group

get_label()

Get the label of the DMatrix.

Returns:

label

Return type:

array

get_quantile_cut()

Get quantile cuts for quantization.

Added in version 2.0.0.

Return type:

Tuple[ndarray,ndarray]

get_uint_info(field)

Get unsigned integer property from the DMatrix.

Parameters:

field (str) – The field name of the information

Returns:

info – a numpy array of unsigned integer information of the data

Return type:

array

get_weight()

Get the weight of the DMatrix.

Returns:

weight

Return type:

array

num_col()

Get the number of columns (features) in the DMatrix.

Return type:

int

num_nonmissing()

Get the number of non-missing values in the DMatrix.

Added in version 1.7.0.

Return type:

int

num_row()

Get the number of rows in the DMatrix.

Return type:

int

propertyref:ReferenceType|None

Internal method for retrieving a reference to the training DMatrix.

save_binary(fname,silent=True)

Save DMatrix to an XGBoost buffer. Saved binary can be later loadedby providing the path toxgboost.DMatrix() as input.

Parameters:
  • fname (string oros.PathLike) – Name of the output buffer file.

  • silent (bool (optional; default: True)) – If set, the output is suppressed.

Return type:

None

set_base_margin(margin)

Set base margin of booster to start from.

This can be used to specify a prediction value of existing model to bebase_margin However, remember margin is needed, instead of transformedprediction e.g. for logistic regression: need to put in value beforelogistic transformation see also example/demo.py

Parameters:

margin (array like) – Prediction margin of each datapoint

Return type:

None

set_float_info(field,data)

Set float type property into the DMatrix.

Parameters:
  • field (str) – The field name of the information

  • data (numpy array) – The array of data to be set

Return type:

None

set_float_info_npy2d(field,data)
Set float type property into the DMatrix

for numpy 2d array input

Parameters:
  • field (str) – The field name of the information

  • data (numpy array) – The array of data to be set

Return type:

None

set_group(group)

Set group size of DMatrix (used for ranking).

Parameters:

group (array like) – Group size of each group

Return type:

None

set_info(*,label=None,weight=None,base_margin=None,group=None,qid=None,label_lower_bound=None,label_upper_bound=None,feature_names=None,feature_types=None,feature_weights=None)

Set meta info for DMatrix. See doc string forxgboost.DMatrix.

Parameters:
  • label (Any |None)

  • weight (Any |None)

  • base_margin (Any |None)

  • group (Any |None)

  • qid (Any |None)

  • label_lower_bound (Any |None)

  • label_upper_bound (Any |None)

  • feature_names (Sequence[str]|None)

  • feature_types (Sequence[str]|None)

  • feature_weights (Any |None)

Return type:

None

set_label(label)

Set label of dmatrix

Parameters:

label (array like) – The label information to be set into DMatrix

Return type:

None

set_uint_info(field,data)

Set uint type property into the DMatrix.

Parameters:
  • field (str) – The field name of the information

  • data (numpy array) – The array of data to be set

Return type:

None

set_weight(weight)

Set weight of each instance.

Parameters:

weight (array like) –

Weight for each data point

Note

For ranking task, weights are per-group.

In ranking task, one weight is assigned to each group (not eachdata point). This is because we only care about the relativeordering of data points within each group, so it doesn’t makesense to assign weights to individual data points.

Return type:

None

slice(rindex,allow_groups=False)

Slice the DMatrix and return a new DMatrix that only containsrindex.

Parameters:
  • rindex (List[int]|ndarray) – List of indices to be selected.

  • allow_groups (bool) – Allow slicing of a matrix with a groups attribute

Returns:

A new DMatrix containing only selected indices.

Return type:

res

classxgboost.Booster(params=None,cache=None,model_file=None)

Bases:object

A Booster of XGBoost.

Booster is the model of xgboost, that contains low level routines fortraining, prediction and evaluation.

Parameters:
__getitem__(val)

Get a slice of the tree-based model. Attributes likebest_iteration andbest_score are removed in the resulting booster.

Added in version 1.3.0.

Parameters:

val (int |integer |tuple |slice |ellipsis)

Return type:

Booster

attr(key)

Get attribute string from the Booster.

Parameters:

key (str) – The key to get attribute from.

Returns:

The attribute value of the key, returns None if attribute do not exist.

Return type:

value

attributes()

Get attributes stored in the Booster as a dictionary.

Returns:

result – Returns an empty dict if there’s no attributes.

Return type:

dictionary of attribute_name: attribute_value pairs of strings.

propertybest_iteration:int

The best iteration during training.

propertybest_score:float

The best evaluation score during training.

boost(dtrain,iteration,grad,hess)

Boost the booster for one iteration with customized gradient statistics.Likexgboost.Booster.update(), this function should not be calleddirectly by users.

Parameters:
  • dtrain (DMatrix) – The training DMatrix.

  • grad (Any) – The first order of gradient.

  • hess (Any) – The second order of gradient.

  • iteration (int)

Return type:

None

copy()

Copy the booster object.

Returns:

A copied booster model

Return type:

booster

dump_model(fout,fmap='',with_stats=False,dump_format='text')

Dump model into a text or JSON file. Unlikesave_model(), theoutput format is primarily used for visualization or interpretation,hence it’s more human readable but cannot be loaded back to XGBoost.

Parameters:
  • fout (str |PathLike) – Output file name.

  • fmap (str |PathLike) – Name of the file containing feature map names.

  • with_stats (bool) – Controls whether the split statistics are output.

  • dump_format (str) – Format of model dump file. Can be ‘text’ or ‘json’.

Return type:

None

eval(data,name='eval',iteration=0)

Evaluate the model on mat.

Parameters:
  • data (DMatrix) – The dmatrix storing the input.

  • name (str) – The name of the dataset.

  • iteration (int) – The current iteration number.

Returns:

result – Evaluation result string.

Return type:

str

eval_set(evals,iteration=0,feval=None,output_margin=True)

Evaluate a set of data.

Parameters:
Returns:

result – Evaluation result string.

Return type:

str

propertyfeature_names:Sequence[str]|None

Feature names for this booster. Can be directly set by input data or byassignment.

propertyfeature_types:Sequence[str]|None

Feature types for this booster. Can be directly set by input data or byassignment. SeeDMatrix for details.

get_categories()

Get the categories in the dataset usingpyarrow. ReturnsNone if there’sno categorical features.

Warning

This function is still working in progress.

Added in version 3.1.0.

Return type:

Dict[str, pa.DictionaryArray] | None

get_dump(fmap='',with_stats=False,dump_format='text')

Returns the model dump as a list of strings. Unlikesave_model(),the output format is primarily used for visualization or interpretation, henceit’s more human readable but cannot be loaded back to XGBoost.

Parameters:
  • fmap (str |PathLike) – Name of the file containing feature map names.

  • with_stats (bool) – Controls whether the split statistics should be included.

  • dump_format (str) – Format of model dump. Can be ‘text’, ‘json’ or ‘dot’.

Return type:

List[str]

get_fscore(fmap='')

Get feature importance of each feature.

Note

Zero-importance features will not be included

Keep in mind that this function does not include zero-importance feature,i.e. those features that have not been used in any split conditions.

Parameters:

fmap (str |PathLike) – The name of feature map file

Return type:

Dict[str,float |List[float]]

get_score(fmap='',importance_type='weight')

Get feature importance of each feature.For tree model Importance type can be defined as:

  • ‘weight’: the number of times a feature is used to split the data across all trees.

  • ‘gain’: the average gain across all splits the feature is used in.

  • ‘cover’: the average coverage across all splits the feature is used in.

  • ‘total_gain’: the total gain across all splits the feature is used in.

  • ‘total_cover’: the total coverage across all splits the feature is used in.

Note

For linear model, only “weight” is defined and it’s the normalizedcoefficients without bias.

Note

Zero-importance features will not be included

Keep in mind that this function does not include zero-importance feature,i.e. those features that have not been used in any split conditions.

Parameters:
  • fmap (str |PathLike) – The name of feature map file.

  • importance_type (str) – One of the importance types defined above.

Returns:

  • A map between feature names and their scores. Whengblinear is used for

  • multi-class classification the scores for each feature is a list with length

  • n_classes, otherwise they’re scalars.

Return type:

Dict[str,float |List[float]]

get_split_value_histogram(feature,fmap='',bins=None,as_pandas=True)

Get split value histogram of a feature

Parameters:
  • feature (str) – The name of the feature.

  • fmap (str |PathLike) – The name of feature map file.

  • bin – The maximum number of bins.Number of bins equals number of unique split values n_unique,if bins == None or bins > n_unique.

  • as_pandas (bool) – Return pd.DataFrame when pandas is installed.If False or pandas is not installed, return numpy ndarray.

  • bins (int |None)

Returns:

  • a histogram of used splitting values for the specified feature

  • either as numpy array or pandas DataFrame.

Return type:

ndarray |DataFrame

inplace_predict(data,*,iteration_range=(0,0),predict_type='value',missing=nan,validate_features=True,base_margin=None,strict_shape=False)

Run prediction in-place when possible, Unlikepredict() method,inplace prediction does not cache the prediction result.

Calling onlyinplace_predict in multiple threads is safe and lockfree. But the safety does not hold when used in conjunction with othermethods. E.g. you can’t train the booster in one thread and performprediction in the other.

Note

If the device ordinal of the input data doesn’t match the one configured forthe booster, data will be copied to the booster device.

booster.set_param({"device":"cuda:0"})booster.inplace_predict(cupy_array)booster.set_param({"device":"cpu"})booster.inplace_predict(numpy_array)

Added in version 1.1.0.

Parameters:
Returns:

prediction – The prediction result. When input data is on GPU, prediction result isstored in a cupy array.

Return type:

numpy.ndarray/cupy.ndarray

load_config(config)

Load configuration returned bysave_config.

Added in version 1.0.0.

Parameters:

config (str)

Return type:

None

load_model(fname)

Load the model from a file or a bytearray.

The model is saved in an XGBoost internal format which is universal among thevarious XGBoost interfaces. Auxiliary attributes of the Python Booster object(such as feature_names) are only saved when using JSON or UBJSON (default)format. Also, parameters that are not part of the model (like metrics,max_depth, etc) are not saved, seeModel IOfor more info.

model.save_model("model.json")model.load_model("model.json")# ormodel.save_model("model.ubj")model.load_model("model.ubj")# orbuf=model.save_raw()model.load_model(buf)
Parameters:

fname (PathLike |bytearray |str) – Input file name or memory buffer(see also save_raw)

Return type:

None

num_boosted_rounds()

Get number of boosted rounds. For gblinear this is reset to 0 afterserializing the model.

Return type:

int

num_features()

Number of features in booster.

Return type:

int

predict(data,*,output_margin=False,pred_leaf=False,pred_contribs=False,approx_contribs=False,pred_interactions=False,validate_features=True,training=False,iteration_range=(0,0),strict_shape=False)

Predict with data. The full model will be used unlessiteration_range isspecified, meaning user have to either slice the model or use thebest_iteration attribute to get prediction from best model returned fromearly stopping.

Note

SeePrediction for issues like thread safety and asummary of outputs from this function.

Parameters:
  • data (DMatrix) – The dmatrix storing the input.

  • output_margin (bool) – Whether to output the raw untransformed margin value.

  • pred_leaf (bool) – When this option is on, the output will be a matrix of (nsample,ntrees) with each record indicating the predicted leaf index ofeach sample in each tree. Note that the leaf index of a tree isunique per tree, so you may find leaf 1 in both tree 1 and tree 0.

  • pred_contribs (bool) – When this is True the output will be a matrix of size (nsample,nfeats + 1) with each record indicating the feature contributions(SHAP values) for that prediction. The sum of all featurecontributions is equal to the raw untransformed margin value of theprediction. Note the final column is the bias term.

  • approx_contribs (bool) – Approximate the contributions of each feature. Used whenpred_contribsorpred_interactions is set to True. Changing the default of thisparameter (False) is not recommended.

  • pred_interactions (bool) – When this is True the output will be a matrix of size (nsample,nfeats + 1, nfeats + 1) indicating the SHAP interaction values foreach pair of features. The sum of each row (or column) of theinteraction values equals the corresponding SHAP value (frompred_contribs), and the sum of the entire matrix equals the rawuntransformed margin value of the prediction. Note the last row andcolumn correspond to the bias term.

  • validate_features (bool) – When this is True, validate that the Booster’s and data’sfeature_names are identical. Otherwise, it is assumed that thefeature_names are the same.

  • training (bool) –

    Whether the prediction value is used for training. This can effectdartbooster, which performs dropouts during training iterations but use alltrees for inference. If you want to obtain result with dropouts, set thisparameter toTrue. Also, the parameter is set to true when obtainingprediction for custom objective function.

    Added in version 1.0.0.

  • iteration_range (Tuple[int |integer,int |integer]) –

    Specifies which layer of trees are used in prediction. For example, if arandom forest is trained with 100 rounds. Specifyingiteration_range=(10,20), then only the forests built during [10, 20) (half open set) rounds areused in this prediction.

    Added in version 1.4.0.

  • strict_shape (bool) –

    When set to True, output shape is invariant to whether classification isused. For both value and margin prediction, the output shape is (n_samples,n_groups), n_groups == 1 when multi-class is not used. Default to False, inwhich case the output shape can be (n_samples, ) if multi-class is not used.

    Added in version 1.4.0.

Returns:

prediction

Return type:

numpy array

reset()

Reset the booster object to release data caches used for training.

Added in version 3.0.0.

Return type:

Booster

save_config()

Output internal parameter configuration of Booster as a JSONstring.

Added in version 1.0.0.

Return type:

str

save_model(fname)

Save the model to a file.

The model is saved in an XGBoost internal format which is universal among thevarious XGBoost interfaces. Auxiliary attributes of the Python Booster object(such as feature_names) are only saved when using JSON or UBJSON (default)format. Also, parameters that are not part of the model (like metrics,max_depth, etc) are not saved, seeModel IOfor more info.

model.save_model("model.json")# ormodel.save_model("model.ubj")
Parameters:

fname (str |PathLike) – Output file name

Return type:

None

save_raw(raw_format='ubj')

Save the model to a in memory buffer representation instead of file.

The model is saved in an XGBoost internal format which is universal among thevarious XGBoost interfaces. Auxiliary attributes of the Python Booster object(such as feature_names) are only saved when using JSON or UBJSON (default)format. Also, parameters that are not part of the model (like metrics,max_depth, etc) are not saved, seeModel IOfor more info.

Parameters:

raw_format (str) – Format of output buffer. Can bejson orubj.

Return type:

An in memory buffer representation of the model

set_attr(**kwargs)

Set the attribute of the Booster.

Parameters:

**kwargs (Any |None) – The attributes to set. Setting a value to None deletes an attribute.

Return type:

None

set_param(params,value=None)

Set parameters into the Booster.

Parameters:
  • params (Dict |Iterable[Tuple[str,Any]]|str) – list of key,value pairs, dict of key to value or simply str key

  • value (str |None) – value of the specified parameter, when params is str key

Return type:

None

trees_to_dataframe(fmap='')

Parse a boosted tree model text dump into a pandas DataFrame structure.

This feature is only defined when the decision tree model is chosen as baselearner (booster in {gbtree, dart}). It is not defined for other base learnertypes, such as linear learners (booster=gblinear).

Parameters:

fmap (str |PathLike) – The name of feature map file.

Return type:

DataFrame

update(dtrain,iteration,fobj=None)

Update for one iteration, with objective function calculatedinternally. This function should not be called directly by users.

Parameters:
Return type:

None

classxgboost.DataIter(cache_prefix=None,release_data=True,*,on_host=True,min_cache_page_bytes=None)

Bases:ABC

The interface for user defined data iterator. The iterator facilitatesdistributed training,QuantileDMatrix, and external memory support usingDMatrix orExtMemQuantileDMatrix. Most of time, users don’tneed to interact with this class directly.

Note

The class caches some intermediate results using thedata input (predictorX) as key. Don’t repeat theX for multiple batches with different meta data(likelabel), make a copy if necessary.

Note

When the input for each batch is a DataFrame, we assume categories areconsistently encoded for all batches. For example, given two dataframes for twobatches, this is invalid:

importpandasaspdx0=pd.DataFrame({"a":[0,1]},dtype="category")x1=pd.DataFrame({"a":[1,2]},dtype="category")

This is invalid because thex0 has[0, 1] as categories whilex2 has[1,2]. They should share the same set of categories and encoding:

importnumpyasnpcategories=np.array([0,1,2])x0["a"]=pd.Categorical.from_codes(codes=np.array([0,1]),categories=categories)x1["a"]=pd.Categorical.from_codes(codes=np.array([1,2]),categories=categories)

You can make sure the consistent encoding in your preprocessing step be carefulthat the data is stored in formats that preserve the encoding when chunking thedata.

Parameters:
  • cache_prefix (str |None) –

    Prefix to the cache files, only used in external memory.

    Note that using this class for external memorywill cache dataon disk under the path passed here.

  • release_data (bool) – Whether the iterator should release the data during iteration. Set it to True ifthe data transformation (converting data to np.float32 type) is memoryintensive. Otherwise, if the transformation is computation intensive then we cankeep the cache.

  • on_host (bool) –

    Whether the data should be cached on the host memory instead of the file systemwhen using GPU with external memory. When set to true (the default), the“external memory” is the CPU (host) memory. SeeUsing XGBoost External Memory Version for more info.

    Added in version 3.0.0.

    Warning

    This is an experimental parameter and subject to change.

  • min_cache_page_bytes (int |None) –

    The minimum number of bytes of each cached pages. Only used for on-host cachewith GPU-basedExtMemQuantileDMatrix. When using GPU-based externalmemory with the data cached in the host memory, XGBoost can concatenate thepages internally to increase the batch size for the GPU. The default page sizeis about 1/16 of the total device memory. Users can manually set the value basedon the actual hardware and datasets. Set this to 0 to disable pageconcatenation.

    Added in version 3.0.0.

    Warning

    This is an experimental parameter and subject to change.

get_callbacks(enable_categorical)

Get callback functions for iterating in C. This is an internal function.

Parameters:

enable_categorical (bool)

Return type:

Tuple[Callable,Callable]

abstractnext(input_data)

Set the next batch of data.

Parameters:

input_data (Callable) – A function with same data fields likedata,label withxgboost.DMatrix.

Return type:

False if there’s no more batch, otherwise True.

propertyproxy:_ProxyDMatrix

Handle of DMatrix proxy.

reraise()

Reraise the exception thrown during iteration.

Return type:

None

abstractreset()

Reset the data iterator. Prototype for user defined function.

Return type:

None

Learning API

Training Library containing training routines.

xgboost.train(params,dtrain,num_boost_round=10,*,evals=None,obj=None,maximize=None,early_stopping_rounds=None,evals_result=None,verbose_eval=True,xgb_model=None,callbacks=None,custom_metric=None)

Train a booster with given parameters.

Parameters:
  • params (Dict[str,Any]) – Booster params.

  • dtrain (DMatrix) – Data to be trained.

  • num_boost_round (int) – Number of boosting iterations.

  • evals (Sequence[Tuple[DMatrix,str]]|None) – List of validation sets for which metrics will evaluated during training.Validation metrics will help us track the performance of the model.

  • obj (Callable[[ndarray,DMatrix],Tuple[ndarray,ndarray]]|None) – Custom objective function. SeeCustom Objective for details.

  • maximize (bool |None) – Whether to maximize custom_metric.

  • early_stopping_rounds (int |None) –

    Activates early stopping. Validation metric needs to improve at least once ineveryearly_stopping_rounds round(s) to continue training.

    Requires at least one item inevals.

    The method returns the model from the last iteration (not the best one). Usecustom callbackEarlyStopping ormodelslicing if the best model is desired. If there’smore than one item inevals, the last entry will be used for early stopping.

    If there’s more than one metric in theeval_metric parameter given inparams, the last metric will be used for early stopping.

    If early stopping occurs, the model will have two additional fields:bst.best_score,bst.best_iteration.

  • evals_result (Dict[str,Dict[str,List[float]|List[Tuple[float,float]]]]|None) –

    This dictionary stores the evaluation results of all the items in watchlist.

    Example: with a watchlist containing[(dtest,'eval'),(dtrain,'train')] anda parameter containing('eval_metric':'logloss'),theevals_result returns

    {'train':{'logloss':['0.48253','0.35953']},'eval':{'logloss':['0.480385','0.357756']}}

  • verbose_eval (bool |int |None) –

    Requires at least one item inevals.

    Ifverbose_eval is True then the evaluation metric on the validation set isprinted at each boosting stage.

    Ifverbose_eval is an integer then the evaluation metric on the validationset is printed at every givenverbose_eval boosting stage. The last boostingstage / the boosting stage found by usingearly_stopping_rounds is alsoprinted.

    Example: withverbose_eval=4 and at least one item inevals, anevaluation metric is printed every 4 boosting stages, instead of every boostingstage.

  • xgb_model (str |PathLike |Booster |bytearray |None) – Xgb model to be loaded before training (allows training continuation).

  • callbacks (Sequence[TrainingCallback]|None) –

    List of callback functions that are applied at end of each iteration.It is possible to use predefined callbacks by usingCallback API.

    Note

    States in callback are not preserved during training, which means callbackobjects can not be reused for multiple training sessions withoutreinitialization or deepcopy.

    forparamsinparameters_grid:# be sure to (re)initialize the callbacks before each runcallbacks=[xgb.callback.LearningRateScheduler(custom_rates)]xgboost.train(params,Xy,callbacks=callbacks)

  • custom_metric (Callable[[ndarray,DMatrix],Tuple[str,float]]|None) –

    Custom metric function. SeeCustom Metricfor details. The metric receives transformed prediction (after applying thereverse link function) when using a builtin objective, and raw output when usinga custom objective.

Returns:

Booster

Return type:

a trained booster model

xgboost.cv(params,dtrain,num_boost_round=10,*,nfold=3,stratified=False,folds=None,metrics=(),obj=None,maximize=None,early_stopping_rounds=None,fpreproc=None,as_pandas=True,verbose_eval=None,show_stdv=True,seed=0,callbacks=None,shuffle=True,custom_metric=None)

Cross-validation with given parameters.

Parameters:
  • params (dict) – Booster params.

  • dtrain (DMatrix) – Data to be trained. Only theDMatrix without external memory issupported.

  • num_boost_round (int) – Number of boosting iterations.

  • nfold (int) – Number of folds in CV.

  • stratified (bool) – Perform stratified sampling.

  • folds (a KFold orStratifiedKFold instance orlist offold indices) – Sklearn KFolds or StratifiedKFolds object.Alternatively may explicitly pass sample indices for each fold.Forn folds,folds should be a lengthn list of tuples.Each tuple is(in,out) wherein is a list of indices to be usedas the training samples for then th fold andout is a list ofindices to be used as the testing samples for then th fold.

  • metrics (string orlist ofstrings) – Evaluation metrics to be watched in CV.

  • obj (Callable[[ndarray,DMatrix],Tuple[ndarray,ndarray]]|None) – Custom objective function. SeeCustom Objective for details.

  • maximize (bool) – Whether to maximize the evaluataion metric (score or error).

  • early_stopping_rounds (int) – Activates early stopping. Cross-Validation metric (average of validationmetric computed over CV folds) needs to improve at least once ineveryearly_stopping_rounds round(s) to continue training.The last entry in the evaluation history will represent the best iteration.If there’s more than one metric in theeval_metric parameter given inparams, the last metric will be used for early stopping.

  • fpreproc (function) – Preprocessing function that takes (dtrain, dtest, param) and returnstransformed versions of those.

  • as_pandas (bool,default True) – Return pd.DataFrame when pandas is installed.If False or pandas is not installed, return np.ndarray

  • verbose_eval (bool,int, orNone,default None) – Whether to display the progress. If None, progress will be displayedwhen np.ndarray is returned. If True, progress will be displayed atboosting stage. If an integer is given, progress will be displayedat every givenverbose_eval boosting stage.

  • show_stdv (bool,default True) – Whether to display the standard deviation in progress.Results are not affected, and always contains std.

  • seed (int) – Seed used to generate the folds (passed to numpy.random.seed).

  • callbacks (Sequence[TrainingCallback]|None) –

    List of callback functions that are applied at end of each iteration.It is possible to use predefined callbacks by usingCallback API.

    Note

    States in callback are not preserved during training, which means callbackobjects can not be reused for multiple training sessions withoutreinitialization or deepcopy.

    forparamsinparameters_grid:# be sure to (re)initialize the callbacks before each runcallbacks=[xgb.callback.LearningRateScheduler(custom_rates)]xgboost.train(params,Xy,callbacks=callbacks)

  • shuffle (bool) – Shuffle data before creating folds.

  • custom_metric (Callable[[ndarray,DMatrix],Tuple[str,float]]|None) –

    Custom metric function. SeeCustom Metricfor details.

Returns:

evaluation history

Return type:

list(string)

Scikit-Learn API

Scikit-Learn Wrapper interface for XGBoost.

classxgboost.XGBRegressor(*,objective='reg:squarederror',**kwargs)

Bases:RegressorMixin,XGBModel

Implementation of the scikit-learn API for XGBoost regression.SeeUsing the Scikit-Learn Estimator Interface for more information.

Parameters:
  • n_estimators (Optional[int]) – Number of gradient boosted trees. Equivalent to number of boostingrounds.

  • max_depth (Optional[int]) – Maximum tree depth for base learners.

  • max_leaves (Optional[int]) – Maximum number of leaves; 0 indicates no limit.

  • max_bin (Optional[int]) – If using histogram-based algorithm, maximum number of bins per feature

  • grow_policy (Optional[str]) –

    Tree growing policy.

    • depthwise: Favors splitting at nodes closest to the node,

    • lossguide: Favors splitting at nodes with highest loss change.

  • learning_rate (Optional[float]) – Boosting learning rate (xgb’s “eta”)

  • verbosity (Optional[int]) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).

  • objective (Union[str,xgboost.sklearn._SklObjWProto,Callable[[Any,Any],Tuple[numpy.ndarray,numpy.ndarray]],NoneType]) –

    Specify the learning task and the corresponding learning objective or a customobjective function to be used.

    For custom objective, seeCustom Objective and Evaluation Metric andCustom objective and metric for more information, along with the end note forfunction signatures.

  • booster (Optional[str]) – Specify which booster to use:gbtree,gblinear ordart.

  • tree_method (Optional[str]) – Specify which tree method to use. Default to auto. If this parameter is set todefault, XGBoost will choose the most conservative option available. It’srecommended to study this option from the parameters documenttree method

  • n_jobs (Optional[int]) – Number of parallel threads used to run xgboost. When used with otherScikit-Learn algorithms like grid search, you may choose which algorithm toparallelize and balance the threads. Creating thread contention willsignificantly slow down both algorithms.

  • gamma (Optional[float]) – (min_split_loss) Minimum loss reduction required to make a further partition ona leaf node of the tree.

  • min_child_weight (Optional[float]) – Minimum sum of instance weight(hessian) needed in a child.

  • max_delta_step (Optional[float]) – Maximum delta step we allow each tree’s weight estimation to be.

  • subsample (Optional[float]) – Subsample ratio of the training instance.

  • sampling_method (Optional[str]) –

    Sampling method. Used only by the GPU version ofhist tree method.

    • uniform: Select random training instances uniformly.

    • gradient_based: Select random training instances with higher probability

      when the gradient and hessian are larger. (cf. CatBoost)

  • colsample_bytree (Optional[float]) – Subsample ratio of columns when constructing each tree.

  • colsample_bylevel (Optional[float]) – Subsample ratio of columns for each level.

  • colsample_bynode (Optional[float]) – Subsample ratio of columns for each split.

  • reg_alpha (Optional[float]) – L1 regularization term on weights (xgb’s alpha).

  • reg_lambda (Optional[float]) – L2 regularization term on weights (xgb’s lambda).

  • scale_pos_weight (Optional[float]) – Balancing of positive and negative weights.

  • base_score (Optional[float]) – The initial prediction score of all instances, global bias.

  • random_state (Union[numpy.random.mtrand.RandomState,numpy.random._generator.Generator,int,NoneType]) –

    Random number seed.

    Note

    Using gblinear booster with shotgun updater is nondeterministic asit uses Hogwild algorithm.

  • missing (float) – Value in the data which needs to be present as a missing value. Default tonumpy.nan.

  • num_parallel_tree (Optional[int]) – Used for boosting random forest.

  • monotone_constraints (Union[Dict[str,int],str,NoneType]) – Constraint of variable monotonicity. Seetutorialfor more information.

  • interaction_constraints (Union[str,List[Tuple[str]],NoneType]) – Constraints for interaction representing permitted interactions. Theconstraints must be specified in the form of a nested list, e.g.[[0,1],[2,3,4]], where each inner list is a group of indices of features that areallowed to interact with each other. Seetutorial for more information

  • importance_type (Optional[str]) –

    The feature importance type for the feature_importances_ property:

    • For tree model, it’s either “gain”, “weight”, “cover”, “total_gain” or“total_cover”.

    • For linear model, only “weight” is defined and it’s the normalizedcoefficients without bias.

  • device (Optional[str]) –

    Added in version 2.0.0.

    Device ordinal, available options arecpu,cuda, andgpu.

  • validate_parameters (Optional[bool]) – Give warnings for unknown parameter.

  • enable_categorical (bool) – See the same parameter ofDMatrix for details.

  • feature_types (Optional[Sequence[str]]) –

    Added in version 1.7.0.

    Used for specifying feature types without constructing a dataframe. SeeDMatrix for details.

  • feature_weights (Optional[ArrayLike]) – Weight for each feature, defines the probability of each feature being selectedwhen colsample is being used. All values must be greater than 0, otherwise aValueError is thrown.

  • max_cat_to_onehot (Optional[int]) –

    Added in version 1.6.0.

    Note

    This parameter is experimental

    A threshold for deciding whether XGBoost should use one-hot encoding based splitfor categorical data. When number of categories is lesser than the thresholdthen one-hot encoding is chosen, otherwise the categories will be partitionedinto children nodes. Also,enable_categorical needs to be set to havecategorical feature support. SeeCategorical Data andParameters for Categorical Feature for details.

  • max_cat_threshold (Optional[int]) –

    Added in version 1.7.0.

    Note

    This parameter is experimental

    Maximum number of categories considered for each split. Used only bypartition-based splits for preventing over-fitting. Also,enable_categoricalneeds to be set to have categorical feature support. SeeCategorical Data andParameters for Categorical Feature for details.

  • multi_strategy (Optional[str]) –

    Added in version 2.0.0.

    Note

    This parameter is working-in-progress.

    The strategy used for training multi-target models, including multi-targetregression and multi-class classification. SeeMultiple Outputs formore information.

    • one_output_per_tree: One model for each target.

    • multi_output_tree: Use multi-target trees.

  • eval_metric (Union[str,List[Union[str,Callable]],Callable,NoneType]) –

    Added in version 1.6.0.

    Metric used for monitoring the training result and early stopping. It can be astring or list of strings as names of predefined metric in XGBoost (SeeXGBoost Parameters), one of the metrics insklearn.metrics, or anyother user defined metric that looks likesklearn.metrics.

    If custom objective is also provided, then custom metric should implement thecorresponding reverse link function.

    Unlike thescoring parameter commonly used in scikit-learn, when a callableobject is provided, it’s assumed to be a cost function and by default XGBoostwill minimize the result during early stopping.

    For advanced usage on Early stopping like directly choosing to maximize insteadof minimize, seexgboost.callback.EarlyStopping.

    SeeCustom Objective and Evaluation Metric andCustom objective and metric for moreinformation.

    fromsklearn.datasetsimportload_diabetesfromsklearn.metricsimportmean_absolute_errorX,y=load_diabetes(return_X_y=True)reg=xgb.XGBRegressor(tree_method="hist",eval_metric=mean_absolute_error,)reg.fit(X,y,eval_set=[(X,y)])

  • early_stopping_rounds (Optional[int]) –

    Added in version 1.6.0.

    • Activates early stopping. Validation metric needs to improve at least once ineveryearly_stopping_rounds round(s) to continue training. Requires atleast one item ineval_set infit().

    • If early stopping occurs, the model will have two additional attributes:best_score andbest_iteration. These are used by thepredict() andapply() methods to determine the optimalnumber of trees during inference. If users want to access the full model(including trees built after early stopping), they can specify theiteration_range in these inference methods. In addition, other utilitieslike model plotting can also use the entire model.

    • If you prefer to discard the trees afterbest_iteration, consider using thecallback functionxgboost.callback.EarlyStopping.

    • If there’s more than one item ineval_set, the last entry will be used forearly stopping. If there’s more than one metric ineval_metric, the lastmetric will be used for early stopping.

  • callbacks (Optional[List[xgboost.callback.TrainingCallback]]) –

    List of callback functions that are applied at end of each iteration.It is possible to use predefined callbacks by usingCallback API.

    Note

    States in callback are not preserved during training, which means callbackobjects can not be reused for multiple training sessions withoutreinitialization or deepcopy.

    forparamsinparameters_grid:# be sure to (re)initialize the callbacks before each runcallbacks=[xgb.callback.LearningRateScheduler(custom_rates)]reg=xgboost.XGBRegressor(**params,callbacks=callbacks)reg.fit(X,y)

  • kwargs (Optional[Any]) –

    Keyword arguments for XGBoost Booster object. Full documentation of parameterscan be foundhere.Attempting to set a parameter via the constructor args and **kwargsdict simultaneously will result in a TypeError.

    Note

    **kwargs unsupported by scikit-learn

    **kwargs is unsupported by scikit-learn. We do not guaranteethat parameters passed via this argument will interact properlywith scikit-learn.

    Note

    Custom objective function

    A custom objective function can be provided for theobjectiveparameter. In this case, it should have the signatureobjective(y_true,y_pred)->[grad,hess] orobjective(y_true,y_pred,*,sample_weight)->[grad,hess]:

    y_true: array_like of shape [n_samples]

    The target values

    y_pred: array_like of shape [n_samples]

    The predicted values

    sample_weight :

    Optional sample weights.

    grad: array_like of shape [n_samples]

    The value of the gradient for each sample point.

    hess: array_like of shape [n_samples]

    The value of the second derivative for each sample point

    Note that, if the custom objective produces negative values forthe Hessian, these will be clipped. If the objective is non-convex,one might also consider using the expected Hessian (Fisherinformation) instead.

apply(X,iteration_range=None)

Return the predicted leaf every tree for each sample. If the model is trainedwith early stopping, thenbest_iteration is used automatically.

Parameters:
Returns:

X_leaves – For each datapoint x in X and for each tree, return the index of theleaf x ends up in. Leaves are numbered within[0;2**(self.max_depth+1)), possibly with gaps in the numbering.

Return type:

array_like, shape=[n_samples, n_trees]

propertybest_iteration:int

The best iteration obtained by early stopping. This attribute is 0-based,for instance if the best iteration is the first round, then best_iteration is 0.

propertybest_score:float

The best score obtained by early stopping.

propertycoef_:ndarray

Coefficients property

Note

Coefficients are defined only for linear learners

Coefficients are only defined when the linear model is chosen asbase learner (booster=gblinear). It is not defined for other baselearner types, such as tree learners (booster=gbtree).

Returns:

coef_

Return type:

array of shape[n_features] or[n_classes,n_features]

evals_result()

Return the evaluation results.

Ifeval_set is passed to thefit() function, you can callevals_result() to get evaluation results for all passedeval_sets. Wheneval_metric is also passed to thefit() function, theevals_result will contain theeval_metrics passed to thefit()function.

The returned evaluation result is a dictionary:

{'validation_0':{'logloss':['0.604835','0.531479']},'validation_1':{'logloss':['0.41965','0.17686']}}
Return type:

evals_result

propertyfeature_importances_:ndarray

Feature importances property, return depends onimportance_typeparameter. When model trained with multi-class/multi-label/multi-target dataset,the feature importance is “averaged” over all targets. The “average” is definedbased on the importance type. For instance, if the importance type is“total_gain”, then the score is sum of loss change for each split from alltrees.

Returns:

  • feature_importances_ (array of shape[n_features] except for multi-class)

  • linear model, which returns an array with shape(n_features, n_classes)

propertyfeature_names_in_:ndarray

Names of features seen duringfit(). Defined only whenX hasfeature names that are all strings.

fit(X,y,*,sample_weight=None,base_margin=None,eval_set=None,verbose=True,xgb_model=None,sample_weight_eval_set=None,base_margin_eval_set=None,feature_weights=None)

Fit gradient boosting model.

Note that callingfit() multiple times will cause the model object to bere-fit from scratch. To resume training from a previous checkpoint, explicitlypassxgb_model argument.

Parameters:
  • X (Any) –

    Input feature matrix. SeeSupported data structures for various XGBoost functions for a list of supported types.

    When thetree_method is set tohist, internally, theQuantileDMatrix will be used instead of theDMatrixfor conserving memory. However, this has performance implications when thedevice of input data is not matched with algorithm. For instance, if theinput is a numpy array on CPU butcuda is used for training, then thedata is first processed on CPU then transferred to GPU.

  • y (Any) – Labels

  • sample_weight (Any |None) – instance weights

  • base_margin (Any |None) – Global bias for each instance. SeeIntercept for details.

  • eval_set (Sequence[Tuple[Any,Any]]|None) – A list of (X, y) tuple pairs to use as validation sets, for whichmetrics will be computed.Validation metrics will help us track the performance of the model.

  • verbose (bool |int |None) – Ifverbose is True and an evaluation set is used, the evaluation metricmeasured on the validation set is printed to stdout at each boosting stage.Ifverbose is an integer, the evaluation metric is printed at eachverbose boosting stage. The last boosting stage / the boosting stage foundby usingearly_stopping_rounds is also printed.

  • xgb_model (Booster |XGBModel |str |None) – file name of stored XGBoost model or ‘Booster’ instance XGBoost model to beloaded before training (allows training continuation).

  • sample_weight_eval_set (Sequence[Any]|None) – A list of the form [L_1, L_2, …, L_n], where each L_i is an array likeobject storing instance weights for the i-th validation set.

  • base_margin_eval_set (Sequence[Any]|None) – A list of the form [M_1, M_2, …, M_n], where each M_i is an array likeobject storing base margin for the i-th validation set.

  • feature_weights (Any |None) –

    Deprecated since version 3.0.0.

    Usefeature_weights in__init__() orset_params()instead.

Return type:

XGBModel

get_booster()

Get the underlying xgboost Booster of this model.

This will raise an exception when fit was not called

Returns:

booster

Return type:

a xgboost booster of underlying model

get_metadata_routing()

Get metadata routing of this object.

Please checkUser Guide on how the routingmechanism works.

Returns:

routing – AMetadataRequest encapsulatingrouting information.

Return type:

MetadataRequest

get_num_boosting_rounds()

Gets the number of xgboost boosting rounds.

Return type:

int

get_params(deep=True)

Get parameters.

Parameters:

deep (bool)

Return type:

Dict[str,Any]

get_xgb_params()

Get xgboost specific parameters.

Return type:

Dict[str,Any]

propertyintercept_:ndarray

Intercept (bias) property

For tree-based model, the returned value is thebase_score.

Returns:

intercept_

Return type:

array of shape(1,) or[n_classes]

load_model(fname)

Load the model from a file or a bytearray.

The model is saved in an XGBoost internal format which is universal among thevarious XGBoost interfaces. Auxiliary attributes of the Python Booster object(such as feature_names) are only saved when using JSON or UBJSON (default)format. Also, parameters that are not part of the model (like metrics,max_depth, etc) are not saved, seeModel IOfor more info.

model.save_model("model.json")model.load_model("model.json")# ormodel.save_model("model.ubj")model.load_model("model.ubj")# orbuf=model.save_raw()model.load_model(buf)
Parameters:

fname (PathLike |bytearray |str) – Input file name or memory buffer(see also save_raw)

Return type:

None

propertyn_features_in_:int

Number of features seen duringfit().

predict(X,*,output_margin=False,validate_features=True,base_margin=None,iteration_range=None)

Predict withX. If the model is trained with early stopping, thenbest_iteration is used automatically. The estimator usesinplace_predict by default and falls back to usingDMatrix ifdevices between the data and the estimator don’t match.

Note

This function is only thread safe forgbtree anddart.

Parameters:
  • X (Any) – Data to predict with. SeeSupported data structures for various XGBoost functions for a list of supported types.

  • output_margin (bool) – Whether to output the raw untransformed margin value.

  • validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names areidentical. Otherwise, it is assumed that the feature_names are the same.

  • base_margin (Any |None) – Global bias for each instance. SeeIntercept for details.

  • iteration_range (Tuple[int |integer,int |integer]|None) –

    Specifies which layer of trees are used in prediction. For example, if arandom forest is trained with 100 rounds. Specifyingiteration_range=(10,20), then only the forests built during [10, 20) (half open set) roundsare used in this prediction.

    Added in version 1.4.0.

Return type:

prediction

save_model(fname)

Save the model to a file.

The model is saved in an XGBoost internal format which is universal among thevarious XGBoost interfaces. Auxiliary attributes of the Python Booster object(such as feature_names) are only saved when using JSON or UBJSON (default)format. Also, parameters that are not part of the model (like metrics,max_depth, etc) are not saved, seeModel IOfor more info.

model.save_model("model.json")# ormodel.save_model("model.ubj")
Parameters:

fname (str |PathLike) – Output file name

Return type:

None

score(X,y,sample_weight=None)

Returncoefficient of determination on test data.

The coefficient of determination,\(R^2\), is defined as\((1 - \frac{u}{v})\), where\(u\) is the residualsum of squares((y_true-y_pred)**2).sum() and\(v\)is the total sum of squares((y_true-y_true.mean())**2).sum().The best possible score is 1.0 and it can be negative (because themodel can be arbitrarily worse). A constant model that always predictsthe expected value ofy, disregarding the input features, would geta\(R^2\) score of 0.0.

Parameters:
  • X (array-like ofshape (n_samples,n_features)) – Test samples. For some estimators this may be a precomputedkernel matrix or a list of generic objects instead with shape(n_samples,n_samples_fitted), wheren_samples_fittedis the number of samples used in the fitting for the estimator.

  • y (array-like ofshape (n_samples,) or(n_samples,n_outputs)) – True values forX.

  • sample_weight (array-like ofshape (n_samples,),default=None) – Sample weights.

Returns:

score\(R^2\) ofself.predict(X) w.r.t.y.

Return type:

float

Notes

The\(R^2\) score used when callingscore on a regressor usesmultioutput='uniform_average' from version 0.23 to keep consistentwith default value ofr2_score().This influences thescore method of all the multioutputregressors (except forMultiOutputRegressor).

set_fit_request(*,base_margin='$UNCHANGED$',base_margin_eval_set='$UNCHANGED$',eval_set='$UNCHANGED$',feature_weights='$UNCHANGED$',sample_weight='$UNCHANGED$',sample_weight_eval_set='$UNCHANGED$',verbose='$UNCHANGED$',xgb_model='$UNCHANGED$')

Configure whether metadata should be requested to be passed to thefit method.

Note that this method is only relevant when this estimator is used as asub-estimator within ameta-estimator and metadata routing is enabledwithenable_metadata_routing=True (seesklearn.set_config()).Please check theUser Guide on how the routingmechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed tofit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it tofit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains theexisting request. This allows you to change the request for someparameters and not others.

Added in version 1.3.

base_marginstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forbase_margin parameter infit.

base_margin_eval_setstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forbase_margin_eval_set parameter infit.

eval_setstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing foreval_set parameter infit.

feature_weightsstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forfeature_weights parameter infit.

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forsample_weight parameter infit.

sample_weight_eval_setstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forsample_weight_eval_set parameter infit.

verbosestr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forverbose parameter infit.

xgb_modelstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forxgb_model parameter infit.

selfobject

The updated object.

Parameters:
Return type:

XGBRegressor

set_params(**params)

Set the parameters of this estimator. Modification of the sklearn method toallow unknown kwargs. This allows using the full range of xgboostparameters that are not defined as member variables in sklearn gridsearch.

Return type:

self

Parameters:

params (Any)

set_predict_request(*,base_margin='$UNCHANGED$',iteration_range='$UNCHANGED$',output_margin='$UNCHANGED$',validate_features='$UNCHANGED$')

Configure whether metadata should be requested to be passed to thepredict method.

Note that this method is only relevant when this estimator is used as asub-estimator within ameta-estimator and metadata routing is enabledwithenable_metadata_routing=True (seesklearn.set_config()).Please check theUser Guide on how the routingmechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed topredict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it topredict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains theexisting request. This allows you to change the request for someparameters and not others.

Added in version 1.3.

base_marginstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forbase_margin parameter inpredict.

iteration_rangestr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing foriteration_range parameter inpredict.

output_marginstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing foroutput_margin parameter inpredict.

validate_featuresstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forvalidate_features parameter inpredict.

selfobject

The updated object.

Parameters:
Return type:

XGBRegressor

set_score_request(*,sample_weight='$UNCHANGED$')

Configure whether metadata should be requested to be passed to thescore method.

Note that this method is only relevant when this estimator is used as asub-estimator within ameta-estimator and metadata routing is enabledwithenable_metadata_routing=True (seesklearn.set_config()).Please check theUser Guide on how the routingmechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed toscore if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it toscore.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains theexisting request. This allows you to change the request for someparameters and not others.

Added in version 1.3.

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forsample_weight parameter inscore.

selfobject

The updated object.

Parameters:
Return type:

XGBRegressor

classxgboost.XGBClassifier(*,objective='binary:logistic',**kwargs)

Bases:ClassifierMixin,XGBModel

Implementation of the scikit-learn API for XGBoost classification.SeeUsing the Scikit-Learn Estimator Interface for more information.

Parameters:
  • n_estimators (Optional[int]) – Number of boosting rounds.

  • max_depth (Optional[int]) – Maximum tree depth for base learners.

  • max_leaves (Optional[int]) – Maximum number of leaves; 0 indicates no limit.

  • max_bin (Optional[int]) – If using histogram-based algorithm, maximum number of bins per feature

  • grow_policy (Optional[str]) –

    Tree growing policy.

    • depthwise: Favors splitting at nodes closest to the node,

    • lossguide: Favors splitting at nodes with highest loss change.

  • learning_rate (Optional[float]) – Boosting learning rate (xgb’s “eta”)

  • verbosity (Optional[int]) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).

  • objective (Union[str,xgboost.sklearn._SklObjWProto,Callable[[Any,Any],Tuple[numpy.ndarray,numpy.ndarray]],NoneType]) –

    Specify the learning task and the corresponding learning objective or a customobjective function to be used.

    For custom objective, seeCustom Objective and Evaluation Metric andCustom objective and metric for more information, along with the end note forfunction signatures.

  • booster (Optional[str]) – Specify which booster to use:gbtree,gblinear ordart.

  • tree_method (Optional[str]) – Specify which tree method to use. Default to auto. If this parameter is set todefault, XGBoost will choose the most conservative option available. It’srecommended to study this option from the parameters documenttree method

  • n_jobs (Optional[int]) – Number of parallel threads used to run xgboost. When used with otherScikit-Learn algorithms like grid search, you may choose which algorithm toparallelize and balance the threads. Creating thread contention willsignificantly slow down both algorithms.

  • gamma (Optional[float]) – (min_split_loss) Minimum loss reduction required to make a further partition ona leaf node of the tree.

  • min_child_weight (Optional[float]) – Minimum sum of instance weight(hessian) needed in a child.

  • max_delta_step (Optional[float]) – Maximum delta step we allow each tree’s weight estimation to be.

  • subsample (Optional[float]) – Subsample ratio of the training instance.

  • sampling_method (Optional[str]) –

    Sampling method. Used only by the GPU version ofhist tree method.

    • uniform: Select random training instances uniformly.

    • gradient_based: Select random training instances with higher probability

      when the gradient and hessian are larger. (cf. CatBoost)

  • colsample_bytree (Optional[float]) – Subsample ratio of columns when constructing each tree.

  • colsample_bylevel (Optional[float]) – Subsample ratio of columns for each level.

  • colsample_bynode (Optional[float]) – Subsample ratio of columns for each split.

  • reg_alpha (Optional[float]) – L1 regularization term on weights (xgb’s alpha).

  • reg_lambda (Optional[float]) – L2 regularization term on weights (xgb’s lambda).

  • scale_pos_weight (Optional[float]) – Balancing of positive and negative weights.

  • base_score (Optional[float]) – The initial prediction score of all instances, global bias.

  • random_state (Union[numpy.random.mtrand.RandomState,numpy.random._generator.Generator,int,NoneType]) –

    Random number seed.

    Note

    Using gblinear booster with shotgun updater is nondeterministic asit uses Hogwild algorithm.

  • missing (float) – Value in the data which needs to be present as a missing value. Default tonumpy.nan.

  • num_parallel_tree (Optional[int]) – Used for boosting random forest.

  • monotone_constraints (Union[Dict[str,int],str,NoneType]) – Constraint of variable monotonicity. Seetutorialfor more information.

  • interaction_constraints (Union[str,List[Tuple[str]],NoneType]) – Constraints for interaction representing permitted interactions. Theconstraints must be specified in the form of a nested list, e.g.[[0,1],[2,3,4]], where each inner list is a group of indices of features that areallowed to interact with each other. Seetutorial for more information

  • importance_type (Optional[str]) –

    The feature importance type for the feature_importances_ property:

    • For tree model, it’s either “gain”, “weight”, “cover”, “total_gain” or“total_cover”.

    • For linear model, only “weight” is defined and it’s the normalizedcoefficients without bias.

  • device (Optional[str]) –

    Added in version 2.0.0.

    Device ordinal, available options arecpu,cuda, andgpu.

  • validate_parameters (Optional[bool]) – Give warnings for unknown parameter.

  • enable_categorical (bool) – See the same parameter ofDMatrix for details.

  • feature_types (Optional[Sequence[str]]) –

    Added in version 1.7.0.

    Used for specifying feature types without constructing a dataframe. SeeDMatrix for details.

  • feature_weights (Optional[ArrayLike]) – Weight for each feature, defines the probability of each feature being selectedwhen colsample is being used. All values must be greater than 0, otherwise aValueError is thrown.

  • max_cat_to_onehot (Optional[int]) –

    Added in version 1.6.0.

    Note

    This parameter is experimental

    A threshold for deciding whether XGBoost should use one-hot encoding based splitfor categorical data. When number of categories is lesser than the thresholdthen one-hot encoding is chosen, otherwise the categories will be partitionedinto children nodes. Also,enable_categorical needs to be set to havecategorical feature support. SeeCategorical Data andParameters for Categorical Feature for details.

  • max_cat_threshold (Optional[int]) –

    Added in version 1.7.0.

    Note

    This parameter is experimental

    Maximum number of categories considered for each split. Used only bypartition-based splits for preventing over-fitting. Also,enable_categoricalneeds to be set to have categorical feature support. SeeCategorical Data andParameters for Categorical Feature for details.

  • multi_strategy (Optional[str]) –

    Added in version 2.0.0.

    Note

    This parameter is working-in-progress.

    The strategy used for training multi-target models, including multi-targetregression and multi-class classification. SeeMultiple Outputs formore information.

    • one_output_per_tree: One model for each target.

    • multi_output_tree: Use multi-target trees.

  • eval_metric (Union[str,List[Union[str,Callable]],Callable,NoneType]) –

    Added in version 1.6.0.

    Metric used for monitoring the training result and early stopping. It can be astring or list of strings as names of predefined metric in XGBoost (SeeXGBoost Parameters), one of the metrics insklearn.metrics, or anyother user defined metric that looks likesklearn.metrics.

    If custom objective is also provided, then custom metric should implement thecorresponding reverse link function.

    Unlike thescoring parameter commonly used in scikit-learn, when a callableobject is provided, it’s assumed to be a cost function and by default XGBoostwill minimize the result during early stopping.

    For advanced usage on Early stopping like directly choosing to maximize insteadof minimize, seexgboost.callback.EarlyStopping.

    SeeCustom Objective and Evaluation Metric andCustom objective and metric for moreinformation.

    fromsklearn.datasetsimportload_diabetesfromsklearn.metricsimportmean_absolute_errorX,y=load_diabetes(return_X_y=True)reg=xgb.XGBRegressor(tree_method="hist",eval_metric=mean_absolute_error,)reg.fit(X,y,eval_set=[(X,y)])

  • early_stopping_rounds (Optional[int]) –

    Added in version 1.6.0.

    • Activates early stopping. Validation metric needs to improve at least once ineveryearly_stopping_rounds round(s) to continue training. Requires atleast one item ineval_set infit().

    • If early stopping occurs, the model will have two additional attributes:best_score andbest_iteration. These are used by thepredict() andapply() methods to determine the optimalnumber of trees during inference. If users want to access the full model(including trees built after early stopping), they can specify theiteration_range in these inference methods. In addition, other utilitieslike model plotting can also use the entire model.

    • If you prefer to discard the trees afterbest_iteration, consider using thecallback functionxgboost.callback.EarlyStopping.

    • If there’s more than one item ineval_set, the last entry will be used forearly stopping. If there’s more than one metric ineval_metric, the lastmetric will be used for early stopping.

  • callbacks (Optional[List[xgboost.callback.TrainingCallback]]) –

    List of callback functions that are applied at end of each iteration.It is possible to use predefined callbacks by usingCallback API.

    Note

    States in callback are not preserved during training, which means callbackobjects can not be reused for multiple training sessions withoutreinitialization or deepcopy.

    forparamsinparameters_grid:# be sure to (re)initialize the callbacks before each runcallbacks=[xgb.callback.LearningRateScheduler(custom_rates)]reg=xgboost.XGBRegressor(**params,callbacks=callbacks)reg.fit(X,y)

  • kwargs (Optional[Any]) –

    Keyword arguments for XGBoost Booster object. Full documentation of parameterscan be foundhere.Attempting to set a parameter via the constructor args and **kwargsdict simultaneously will result in a TypeError.

    Note

    **kwargs unsupported by scikit-learn

    **kwargs is unsupported by scikit-learn. We do not guaranteethat parameters passed via this argument will interact properlywith scikit-learn.

    Note

    Custom objective function

    A custom objective function can be provided for theobjectiveparameter. In this case, it should have the signatureobjective(y_true,y_pred)->[grad,hess] orobjective(y_true,y_pred,*,sample_weight)->[grad,hess]:

    y_true: array_like of shape [n_samples]

    The target values

    y_pred: array_like of shape [n_samples]

    The predicted values

    sample_weight :

    Optional sample weights.

    grad: array_like of shape [n_samples]

    The value of the gradient for each sample point.

    hess: array_like of shape [n_samples]

    The value of the second derivative for each sample point

    Note that, if the custom objective produces negative values forthe Hessian, these will be clipped. If the objective is non-convex,one might also consider using the expected Hessian (Fisherinformation) instead.

apply(X,iteration_range=None)

Return the predicted leaf every tree for each sample. If the model is trainedwith early stopping, thenbest_iteration is used automatically.

Parameters:
Returns:

X_leaves – For each datapoint x in X and for each tree, return the index of theleaf x ends up in. Leaves are numbered within[0;2**(self.max_depth+1)), possibly with gaps in the numbering.

Return type:

array_like, shape=[n_samples, n_trees]

propertybest_iteration:int

The best iteration obtained by early stopping. This attribute is 0-based,for instance if the best iteration is the first round, then best_iteration is 0.

propertybest_score:float

The best score obtained by early stopping.

propertycoef_:ndarray

Coefficients property

Note

Coefficients are defined only for linear learners

Coefficients are only defined when the linear model is chosen asbase learner (booster=gblinear). It is not defined for other baselearner types, such as tree learners (booster=gbtree).

Returns:

coef_

Return type:

array of shape[n_features] or[n_classes,n_features]

evals_result()

Return the evaluation results.

Ifeval_set is passed to thefit() function, you can callevals_result() to get evaluation results for all passedeval_sets. Wheneval_metric is also passed to thefit() function, theevals_result will contain theeval_metrics passed to thefit()function.

The returned evaluation result is a dictionary:

{'validation_0':{'logloss':['0.604835','0.531479']},'validation_1':{'logloss':['0.41965','0.17686']}}
Return type:

evals_result

propertyfeature_importances_:ndarray

Feature importances property, return depends onimportance_typeparameter. When model trained with multi-class/multi-label/multi-target dataset,the feature importance is “averaged” over all targets. The “average” is definedbased on the importance type. For instance, if the importance type is“total_gain”, then the score is sum of loss change for each split from alltrees.

Returns:

  • feature_importances_ (array of shape[n_features] except for multi-class)

  • linear model, which returns an array with shape(n_features, n_classes)

propertyfeature_names_in_:ndarray

Names of features seen duringfit(). Defined only whenX hasfeature names that are all strings.

fit(X,y,*,sample_weight=None,base_margin=None,eval_set=None,verbose=True,xgb_model=None,sample_weight_eval_set=None,base_margin_eval_set=None,feature_weights=None)

Fit gradient boosting classifier.

Note that callingfit() multiple times will cause the model object to bere-fit from scratch. To resume training from a previous checkpoint, explicitlypassxgb_model argument.

Parameters:
  • X (Any) –

    Input feature matrix. SeeSupported data structures for various XGBoost functions for a list of supported types.

    When thetree_method is set tohist, internally, theQuantileDMatrix will be used instead of theDMatrixfor conserving memory. However, this has performance implications when thedevice of input data is not matched with algorithm. For instance, if theinput is a numpy array on CPU butcuda is used for training, then thedata is first processed on CPU then transferred to GPU.

  • y (Any) – Labels

  • sample_weight (Any |None) – instance weights

  • base_margin (Any |None) – Global bias for each instance. SeeIntercept for details.

  • eval_set (Sequence[Tuple[Any,Any]]|None) – A list of (X, y) tuple pairs to use as validation sets, for whichmetrics will be computed.Validation metrics will help us track the performance of the model.

  • verbose (bool |int |None) – Ifverbose is True and an evaluation set is used, the evaluation metricmeasured on the validation set is printed to stdout at each boosting stage.Ifverbose is an integer, the evaluation metric is printed at eachverbose boosting stage. The last boosting stage / the boosting stage foundby usingearly_stopping_rounds is also printed.

  • xgb_model (Booster |str |XGBModel |None) – file name of stored XGBoost model or ‘Booster’ instance XGBoost model to beloaded before training (allows training continuation).

  • sample_weight_eval_set (Sequence[Any]|None) – A list of the form [L_1, L_2, …, L_n], where each L_i is an array likeobject storing instance weights for the i-th validation set.

  • base_margin_eval_set (Sequence[Any]|None) – A list of the form [M_1, M_2, …, M_n], where each M_i is an array likeobject storing base margin for the i-th validation set.

  • feature_weights (Any |None) –

    Deprecated since version 3.0.0.

    Usefeature_weights in__init__() orset_params()instead.

Return type:

XGBClassifier

get_booster()

Get the underlying xgboost Booster of this model.

This will raise an exception when fit was not called

Returns:

booster

Return type:

a xgboost booster of underlying model

get_metadata_routing()

Get metadata routing of this object.

Please checkUser Guide on how the routingmechanism works.

Returns:

routing – AMetadataRequest encapsulatingrouting information.

Return type:

MetadataRequest

get_num_boosting_rounds()

Gets the number of xgboost boosting rounds.

Return type:

int

get_params(deep=True)

Get parameters.

Parameters:

deep (bool)

Return type:

Dict[str,Any]

get_xgb_params()

Get xgboost specific parameters.

Return type:

Dict[str,Any]

propertyintercept_:ndarray

Intercept (bias) property

For tree-based model, the returned value is thebase_score.

Returns:

intercept_

Return type:

array of shape(1,) or[n_classes]

load_model(fname)

Load the model from a file or a bytearray.

The model is saved in an XGBoost internal format which is universal among thevarious XGBoost interfaces. Auxiliary attributes of the Python Booster object(such as feature_names) are only saved when using JSON or UBJSON (default)format. Also, parameters that are not part of the model (like metrics,max_depth, etc) are not saved, seeModel IOfor more info.

model.save_model("model.json")model.load_model("model.json")# ormodel.save_model("model.ubj")model.load_model("model.ubj")# orbuf=model.save_raw()model.load_model(buf)
Parameters:

fname (PathLike |bytearray |str) – Input file name or memory buffer(see also save_raw)

Return type:

None

propertyn_features_in_:int

Number of features seen duringfit().

predict(X,*,output_margin=False,validate_features=True,base_margin=None,iteration_range=None)

Predict withX. If the model is trained with early stopping, thenbest_iteration is used automatically. The estimator usesinplace_predict by default and falls back to usingDMatrix ifdevices between the data and the estimator don’t match.

Note

This function is only thread safe forgbtree anddart.

Parameters:
  • X (Any) – Data to predict with. SeeSupported data structures for various XGBoost functions for a list of supported types.

  • output_margin (bool) – Whether to output the raw untransformed margin value.

  • validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names areidentical. Otherwise, it is assumed that the feature_names are the same.

  • base_margin (Any |None) – Global bias for each instance. SeeIntercept for details.

  • iteration_range (Tuple[int |integer,int |integer]|None) –

    Specifies which layer of trees are used in prediction. For example, if arandom forest is trained with 100 rounds. Specifyingiteration_range=(10,20), then only the forests built during [10, 20) (half open set) roundsare used in this prediction.

    Added in version 1.4.0.

Return type:

prediction

predict_proba(X,validate_features=True,base_margin=None,iteration_range=None)

Predict the probability of eachX example being of a given class. If themodel is trained with early stopping, thenbest_iteration is usedautomatically. The estimator usesinplace_predict by default and falls back tousingDMatrix if devices between the data and the estimator don’tmatch.

Note

This function is only thread safe forgbtree anddart.

Parameters:
  • X (Any) – Feature matrix. SeeSupported data structures for various XGBoost functions for a list of supported types.

  • validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names areidentical. Otherwise, it is assumed that the feature_names are the same.

  • base_margin (Any |None) – Global bias for each instance. SeeIntercept for details.

  • iteration_range (Tuple[int |integer,int |integer]|None) – Specifies which layer of trees are used in prediction. For example, if arandom forest is trained with 100 rounds. Specifyingiteration_range=(10,20), then only the forests built during [10, 20) (half open set) rounds areused in this prediction.

Returns:

a numpy array of shape array-like of shape (n_samples, n_classes) with theprobability of each data example being of a given class.

Return type:

prediction

save_model(fname)

Save the model to a file.

The model is saved in an XGBoost internal format which is universal among thevarious XGBoost interfaces. Auxiliary attributes of the Python Booster object(such as feature_names) are only saved when using JSON or UBJSON (default)format. Also, parameters that are not part of the model (like metrics,max_depth, etc) are not saved, seeModel IOfor more info.

model.save_model("model.json")# ormodel.save_model("model.ubj")
Parameters:

fname (str |PathLike) – Output file name

Return type:

None

score(X,y,sample_weight=None)

Returnaccuracy on provided data and labels.

In multi-label classification, this is the subset accuracywhich is a harsh metric since you require for each sample thateach label set be correctly predicted.

Parameters:
  • X (array-like ofshape (n_samples,n_features)) – Test samples.

  • y (array-like ofshape (n_samples,) or(n_samples,n_outputs)) – True labels forX.

  • sample_weight (array-like ofshape (n_samples,),default=None) – Sample weights.

Returns:

score – Mean accuracy ofself.predict(X) w.r.t.y.

Return type:

float

set_fit_request(*,base_margin='$UNCHANGED$',base_margin_eval_set='$UNCHANGED$',eval_set='$UNCHANGED$',feature_weights='$UNCHANGED$',sample_weight='$UNCHANGED$',sample_weight_eval_set='$UNCHANGED$',verbose='$UNCHANGED$',xgb_model='$UNCHANGED$')

Configure whether metadata should be requested to be passed to thefit method.

Note that this method is only relevant when this estimator is used as asub-estimator within ameta-estimator and metadata routing is enabledwithenable_metadata_routing=True (seesklearn.set_config()).Please check theUser Guide on how the routingmechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed tofit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it tofit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains theexisting request. This allows you to change the request for someparameters and not others.

Added in version 1.3.

base_marginstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forbase_margin parameter infit.

base_margin_eval_setstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forbase_margin_eval_set parameter infit.

eval_setstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing foreval_set parameter infit.

feature_weightsstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forfeature_weights parameter infit.

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forsample_weight parameter infit.

sample_weight_eval_setstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forsample_weight_eval_set parameter infit.

verbosestr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forverbose parameter infit.

xgb_modelstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forxgb_model parameter infit.

selfobject

The updated object.

Parameters:
Return type:

XGBClassifier

set_params(**params)

Set the parameters of this estimator. Modification of the sklearn method toallow unknown kwargs. This allows using the full range of xgboostparameters that are not defined as member variables in sklearn gridsearch.

Return type:

self

Parameters:

params (Any)

set_predict_proba_request(*,base_margin='$UNCHANGED$',iteration_range='$UNCHANGED$',validate_features='$UNCHANGED$')

Configure whether metadata should be requested to be passed to thepredict_proba method.

Note that this method is only relevant when this estimator is used as asub-estimator within ameta-estimator and metadata routing is enabledwithenable_metadata_routing=True (seesklearn.set_config()).Please check theUser Guide on how the routingmechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed topredict_proba if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it topredict_proba.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains theexisting request. This allows you to change the request for someparameters and not others.

Added in version 1.3.

base_marginstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forbase_margin parameter inpredict_proba.

iteration_rangestr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing foriteration_range parameter inpredict_proba.

validate_featuresstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forvalidate_features parameter inpredict_proba.

selfobject

The updated object.

Parameters:
Return type:

XGBClassifier

set_predict_request(*,base_margin='$UNCHANGED$',iteration_range='$UNCHANGED$',output_margin='$UNCHANGED$',validate_features='$UNCHANGED$')

Configure whether metadata should be requested to be passed to thepredict method.

Note that this method is only relevant when this estimator is used as asub-estimator within ameta-estimator and metadata routing is enabledwithenable_metadata_routing=True (seesklearn.set_config()).Please check theUser Guide on how the routingmechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed topredict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it topredict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains theexisting request. This allows you to change the request for someparameters and not others.

Added in version 1.3.

base_marginstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forbase_margin parameter inpredict.

iteration_rangestr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing foriteration_range parameter inpredict.

output_marginstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing foroutput_margin parameter inpredict.

validate_featuresstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forvalidate_features parameter inpredict.

selfobject

The updated object.

Parameters:
Return type:

XGBClassifier

set_score_request(*,sample_weight='$UNCHANGED$')

Configure whether metadata should be requested to be passed to thescore method.

Note that this method is only relevant when this estimator is used as asub-estimator within ameta-estimator and metadata routing is enabledwithenable_metadata_routing=True (seesklearn.set_config()).Please check theUser Guide on how the routingmechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed toscore if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it toscore.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains theexisting request. This allows you to change the request for someparameters and not others.

Added in version 1.3.

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forsample_weight parameter inscore.

selfobject

The updated object.

Parameters:
Return type:

XGBClassifier

classxgboost.XGBRanker(*,objective='rank:ndcg',**kwargs)

Bases:XGBRankerMixIn,XGBModel

Implementation of the Scikit-Learn API for XGBoost Ranking.

SeeLearning to Rank for an introducion.

SeeUsing the Scikit-Learn Estimator Interface for more information.

Parameters:
  • n_estimators (Optional[int]) – Number of gradient boosted trees. Equivalent to number of boostingrounds.

  • max_depth (Optional[int]) – Maximum tree depth for base learners.

  • max_leaves (Optional[int]) – Maximum number of leaves; 0 indicates no limit.

  • max_bin (Optional[int]) – If using histogram-based algorithm, maximum number of bins per feature

  • grow_policy (Optional[str]) –

    Tree growing policy.

    • depthwise: Favors splitting at nodes closest to the node,

    • lossguide: Favors splitting at nodes with highest loss change.

  • learning_rate (Optional[float]) – Boosting learning rate (xgb’s “eta”)

  • verbosity (Optional[int]) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).

  • objective (Union[str,xgboost.sklearn._SklObjWProto,Callable[[Any,Any],Tuple[numpy.ndarray,numpy.ndarray]],NoneType]) –

    Specify the learning task and the corresponding learning objective or a customobjective function to be used.

    For custom objective, seeCustom Objective and Evaluation Metric andCustom objective and metric for more information, along with the end note forfunction signatures.

  • booster (Optional[str]) – Specify which booster to use:gbtree,gblinear ordart.

  • tree_method (Optional[str]) – Specify which tree method to use. Default to auto. If this parameter is set todefault, XGBoost will choose the most conservative option available. It’srecommended to study this option from the parameters documenttree method

  • n_jobs (Optional[int]) – Number of parallel threads used to run xgboost. When used with otherScikit-Learn algorithms like grid search, you may choose which algorithm toparallelize and balance the threads. Creating thread contention willsignificantly slow down both algorithms.

  • gamma (Optional[float]) – (min_split_loss) Minimum loss reduction required to make a further partition ona leaf node of the tree.

  • min_child_weight (Optional[float]) – Minimum sum of instance weight(hessian) needed in a child.

  • max_delta_step (Optional[float]) – Maximum delta step we allow each tree’s weight estimation to be.

  • subsample (Optional[float]) – Subsample ratio of the training instance.

  • sampling_method (Optional[str]) –

    Sampling method. Used only by the GPU version ofhist tree method.

    • uniform: Select random training instances uniformly.

    • gradient_based: Select random training instances with higher probability

      when the gradient and hessian are larger. (cf. CatBoost)

  • colsample_bytree (Optional[float]) – Subsample ratio of columns when constructing each tree.

  • colsample_bylevel (Optional[float]) – Subsample ratio of columns for each level.

  • colsample_bynode (Optional[float]) – Subsample ratio of columns for each split.

  • reg_alpha (Optional[float]) – L1 regularization term on weights (xgb’s alpha).

  • reg_lambda (Optional[float]) – L2 regularization term on weights (xgb’s lambda).

  • scale_pos_weight (Optional[float]) – Balancing of positive and negative weights.

  • base_score (Optional[float]) – The initial prediction score of all instances, global bias.

  • random_state (Union[numpy.random.mtrand.RandomState,numpy.random._generator.Generator,int,NoneType]) –

    Random number seed.

    Note

    Using gblinear booster with shotgun updater is nondeterministic asit uses Hogwild algorithm.

  • missing (float) – Value in the data which needs to be present as a missing value. Default tonumpy.nan.

  • num_parallel_tree (Optional[int]) – Used for boosting random forest.

  • monotone_constraints (Union[Dict[str,int],str,NoneType]) – Constraint of variable monotonicity. Seetutorialfor more information.

  • interaction_constraints (Union[str,List[Tuple[str]],NoneType]) – Constraints for interaction representing permitted interactions. Theconstraints must be specified in the form of a nested list, e.g.[[0,1],[2,3,4]], where each inner list is a group of indices of features that areallowed to interact with each other. Seetutorial for more information

  • importance_type (Optional[str]) –

    The feature importance type for the feature_importances_ property:

    • For tree model, it’s either “gain”, “weight”, “cover”, “total_gain” or“total_cover”.

    • For linear model, only “weight” is defined and it’s the normalizedcoefficients without bias.

  • device (Optional[str]) –

    Added in version 2.0.0.

    Device ordinal, available options arecpu,cuda, andgpu.

  • validate_parameters (Optional[bool]) – Give warnings for unknown parameter.

  • enable_categorical (bool) – See the same parameter ofDMatrix for details.

  • feature_types (Optional[Sequence[str]]) –

    Added in version 1.7.0.

    Used for specifying feature types without constructing a dataframe. SeeDMatrix for details.

  • feature_weights (Optional[ArrayLike]) – Weight for each feature, defines the probability of each feature being selectedwhen colsample is being used. All values must be greater than 0, otherwise aValueError is thrown.

  • max_cat_to_onehot (Optional[int]) –

    Added in version 1.6.0.

    Note

    This parameter is experimental

    A threshold for deciding whether XGBoost should use one-hot encoding based splitfor categorical data. When number of categories is lesser than the thresholdthen one-hot encoding is chosen, otherwise the categories will be partitionedinto children nodes. Also,enable_categorical needs to be set to havecategorical feature support. SeeCategorical Data andParameters for Categorical Feature for details.

  • max_cat_threshold (Optional[int]) –

    Added in version 1.7.0.

    Note

    This parameter is experimental

    Maximum number of categories considered for each split. Used only bypartition-based splits for preventing over-fitting. Also,enable_categoricalneeds to be set to have categorical feature support. SeeCategorical Data andParameters for Categorical Feature for details.

  • multi_strategy (Optional[str]) –

    Added in version 2.0.0.

    Note

    This parameter is working-in-progress.

    The strategy used for training multi-target models, including multi-targetregression and multi-class classification. SeeMultiple Outputs formore information.

    • one_output_per_tree: One model for each target.

    • multi_output_tree: Use multi-target trees.

  • eval_metric (Union[str,List[Union[str,Callable]],Callable,NoneType]) –

    Added in version 1.6.0.

    Metric used for monitoring the training result and early stopping. It can be astring or list of strings as names of predefined metric in XGBoost (SeeXGBoost Parameters), one of the metrics insklearn.metrics, or anyother user defined metric that looks likesklearn.metrics.

    If custom objective is also provided, then custom metric should implement thecorresponding reverse link function.

    Unlike thescoring parameter commonly used in scikit-learn, when a callableobject is provided, it’s assumed to be a cost function and by default XGBoostwill minimize the result during early stopping.

    For advanced usage on Early stopping like directly choosing to maximize insteadof minimize, seexgboost.callback.EarlyStopping.

    SeeCustom Objective and Evaluation Metric andCustom objective and metric for moreinformation.

    fromsklearn.datasetsimportload_diabetesfromsklearn.metricsimportmean_absolute_errorX,y=load_diabetes(return_X_y=True)reg=xgb.XGBRegressor(tree_method="hist",eval_metric=mean_absolute_error,)reg.fit(X,y,eval_set=[(X,y)])

  • early_stopping_rounds (Optional[int]) –

    Added in version 1.6.0.

    • Activates early stopping. Validation metric needs to improve at least once ineveryearly_stopping_rounds round(s) to continue training. Requires atleast one item ineval_set infit().

    • If early stopping occurs, the model will have two additional attributes:best_score andbest_iteration. These are used by thepredict() andapply() methods to determine the optimalnumber of trees during inference. If users want to access the full model(including trees built after early stopping), they can specify theiteration_range in these inference methods. In addition, other utilitieslike model plotting can also use the entire model.

    • If you prefer to discard the trees afterbest_iteration, consider using thecallback functionxgboost.callback.EarlyStopping.

    • If there’s more than one item ineval_set, the last entry will be used forearly stopping. If there’s more than one metric ineval_metric, the lastmetric will be used for early stopping.

  • callbacks (Optional[List[xgboost.callback.TrainingCallback]]) –

    List of callback functions that are applied at end of each iteration.It is possible to use predefined callbacks by usingCallback API.

    Note

    States in callback are not preserved during training, which means callbackobjects can not be reused for multiple training sessions withoutreinitialization or deepcopy.

    forparamsinparameters_grid:# be sure to (re)initialize the callbacks before each runcallbacks=[xgb.callback.LearningRateScheduler(custom_rates)]reg=xgboost.XGBRegressor(**params,callbacks=callbacks)reg.fit(X,y)

  • kwargs (Optional[Any]) –

    Keyword arguments for XGBoost Booster object. Full documentation of parameterscan be foundhere.Attempting to set a parameter via the constructor args and **kwargsdict simultaneously will result in a TypeError.

    Note

    **kwargs unsupported by scikit-learn

    **kwargs is unsupported by scikit-learn. We do not guaranteethat parameters passed via this argument will interact properlywith scikit-learn.

    Note

    A custom objective function is currently not supported by XGBRanker.

    Note

    Query group information is only required for ranking training but notprediction. Multiple groups can be predicted on a single call topredict().

    When fitting the model with thegroup parameter, your data need to be sortedby the query group first.group is an array that contains the size of eachquery group.

    Similarly, when fitting the model with theqid parameter, the data should besorted according to query index andqid is an array that contains the queryindex for each training sample.

    For example, if your original data look like:

    qid

    label

    features

    1

    0

    x_1

    1

    1

    x_2

    1

    0

    x_3

    2

    0

    x_4

    2

    1

    x_5

    2

    1

    x_6

    2

    1

    x_7

    thenfit() method can be called with eithergroup array as[3,4]or withqid as[1,1,1,2,2,2,2], that is the qid column. Also, theqid can be a special column of inputX instead of a separated parameter, seefit() for more info.

apply(X,iteration_range=None)

Return the predicted leaf every tree for each sample. If the model is trainedwith early stopping, thenbest_iteration is used automatically.

Parameters:
Returns:

X_leaves – For each datapoint x in X and for each tree, return the index of theleaf x ends up in. Leaves are numbered within[0;2**(self.max_depth+1)), possibly with gaps in the numbering.

Return type:

array_like, shape=[n_samples, n_trees]

propertybest_iteration:int

The best iteration obtained by early stopping. This attribute is 0-based,for instance if the best iteration is the first round, then best_iteration is 0.

propertybest_score:float

The best score obtained by early stopping.

propertycoef_:ndarray

Coefficients property

Note

Coefficients are defined only for linear learners

Coefficients are only defined when the linear model is chosen asbase learner (booster=gblinear). It is not defined for other baselearner types, such as tree learners (booster=gbtree).

Returns:

coef_

Return type:

array of shape[n_features] or[n_classes,n_features]

evals_result()

Return the evaluation results.

Ifeval_set is passed to thefit() function, you can callevals_result() to get evaluation results for all passedeval_sets. Wheneval_metric is also passed to thefit() function, theevals_result will contain theeval_metrics passed to thefit()function.

The returned evaluation result is a dictionary:

{'validation_0':{'logloss':['0.604835','0.531479']},'validation_1':{'logloss':['0.41965','0.17686']}}
Return type:

evals_result

propertyfeature_importances_:ndarray

Feature importances property, return depends onimportance_typeparameter. When model trained with multi-class/multi-label/multi-target dataset,the feature importance is “averaged” over all targets. The “average” is definedbased on the importance type. For instance, if the importance type is“total_gain”, then the score is sum of loss change for each split from alltrees.

Returns:

  • feature_importances_ (array of shape[n_features] except for multi-class)

  • linear model, which returns an array with shape(n_features, n_classes)

propertyfeature_names_in_:ndarray

Names of features seen duringfit(). Defined only whenX hasfeature names that are all strings.

fit(X,y,*,group=None,qid=None,sample_weight=None,base_margin=None,eval_set=None,eval_group=None,eval_qid=None,verbose=False,xgb_model=None,sample_weight_eval_set=None,base_margin_eval_set=None,feature_weights=None)

Fit gradient boosting ranker

Note that callingfit() multiple times will cause the model object to bere-fit from scratch. To resume training from a previous checkpoint, explicitlypassxgb_model argument.

Parameters:
  • X (Any) –

    Feature matrix. SeeSupported data structures for various XGBoost functions for a list of supported types.

    When this is apandas.DataFrame or acudf.DataFrame,it may contain a special column calledqid for specifying the queryindex. Using a special column is the same as using theqid parameter,except for being compatible with sklearn utility functions likesklearn.model_selection.cross_validation(). The same conventionapplies to theXGBRanker.score() andXGBRanker.predict().

    qid

    feat_0

    feat_1

    0

    \(x_{00}\)

    \(x_{01}\)

    1

    \(x_{10}\)

    \(x_{11}\)

    1

    \(x_{20}\)

    \(x_{21}\)

    When thetree_method is set tohist, internally, theQuantileDMatrix will be used instead of theDMatrixfor conserving memory. However, this has performance implications when thedevice of input data is not matched with algorithm. For instance, if theinput is a numpy array on CPU butcuda is used for training, then thedata is first processed on CPU then transferred to GPU.

  • y (Any) – Labels

  • group (Any |None) – Size of each query group of training data. Should have as many elements asthe query groups in the training data. If this is set to None, then usermust provide qid.

  • qid (Any |None) – Query ID for each training sample. Should have the size of n_samples. Ifthis is set to None, then user must provide group or a special column in X.

  • sample_weight (Any |None) –

    Query group weights

    Note

    Weights are per-group for ranking tasks

    In ranking task, one weight is assigned to each query group/id (not eachdata point). This is because we only care about the relative ordering ofdata points within each group, so it doesn’t make sense to assignweights to individual data points.

  • base_margin (Any |None) – Global bias for each instance. SeeIntercept for details.

  • eval_set (Sequence[Tuple[Any,Any]]|None) – A list of (X, y) tuple pairs to use as validation sets, for whichmetrics will be computed.Validation metrics will help us track the performance of the model.

  • eval_group (Sequence[Any]|None) – A list in whicheval_group[i] is the list containing the sizes of allquery groups in thei-th pair ineval_set.

  • eval_qid (Sequence[Any]|None) – A list in whicheval_qid[i] is the array containing query ID ofi-thpair ineval_set. The special column convention inX applies tovalidation datasets as well.

  • verbose (bool |int |None) – Ifverbose is True and an evaluation set is used, the evaluation metricmeasured on the validation set is printed to stdout at each boosting stage.Ifverbose is an integer, the evaluation metric is printed at eachverbose boosting stage. The last boosting stage / the boosting stage foundby usingearly_stopping_rounds is also printed.

  • xgb_model (Booster |str |XGBModel |None) – file name of stored XGBoost model or ‘Booster’ instance XGBoost model to beloaded before training (allows training continuation).

  • sample_weight_eval_set (Sequence[Any]|None) –

    A list of the form [L_1, L_2, …, L_n], where each L_i is a list ofgroup weights on the i-th validation set.

    Note

    Weights are per-group for ranking tasks

    In ranking task, one weight is assigned to each query group (not eachdata point). This is because we only care about the relative ordering ofdata points within each group, so it doesn’t make sense to assignweights to individual data points.

  • base_margin_eval_set (Sequence[Any]|None) – A list of the form [M_1, M_2, …, M_n], where each M_i is an array likeobject storing base margin for the i-th validation set.

  • feature_weights (Any |None) – Weight for each feature, defines the probability of each feature beingselected when colsample is being used. All values must be greater than 0,otherwise aValueError is thrown.

Return type:

XGBRanker

get_booster()

Get the underlying xgboost Booster of this model.

This will raise an exception when fit was not called

Returns:

booster

Return type:

a xgboost booster of underlying model

get_metadata_routing()

Get metadata routing of this object.

Please checkUser Guide on how the routingmechanism works.

Returns:

routing – AMetadataRequest encapsulatingrouting information.

Return type:

MetadataRequest

get_num_boosting_rounds()

Gets the number of xgboost boosting rounds.

Return type:

int

get_params(deep=True)

Get parameters.

Parameters:

deep (bool)

Return type:

Dict[str,Any]

get_xgb_params()

Get xgboost specific parameters.

Return type:

Dict[str,Any]

propertyintercept_:ndarray

Intercept (bias) property

For tree-based model, the returned value is thebase_score.

Returns:

intercept_

Return type:

array of shape(1,) or[n_classes]

load_model(fname)

Load the model from a file or a bytearray.

The model is saved in an XGBoost internal format which is universal among thevarious XGBoost interfaces. Auxiliary attributes of the Python Booster object(such as feature_names) are only saved when using JSON or UBJSON (default)format. Also, parameters that are not part of the model (like metrics,max_depth, etc) are not saved, seeModel IOfor more info.

model.save_model("model.json")model.load_model("model.json")# ormodel.save_model("model.ubj")model.load_model("model.ubj")# orbuf=model.save_raw()model.load_model(buf)
Parameters:

fname (PathLike |bytearray |str) – Input file name or memory buffer(see also save_raw)

Return type:

None

propertyn_features_in_:int

Number of features seen duringfit().

predict(X,*,output_margin=False,validate_features=True,base_margin=None,iteration_range=None)

Predict withX. If the model is trained with early stopping, thenbest_iteration is used automatically. The estimator usesinplace_predict by default and falls back to usingDMatrix ifdevices between the data and the estimator don’t match.

Note

This function is only thread safe forgbtree anddart.

Parameters:
  • X (Any) – Data to predict with. SeeSupported data structures for various XGBoost functions for a list of supported types.

  • output_margin (bool) – Whether to output the raw untransformed margin value.

  • validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names areidentical. Otherwise, it is assumed that the feature_names are the same.

  • base_margin (Any |None) – Global bias for each instance. SeeIntercept for details.

  • iteration_range (Tuple[int |integer,int |integer]|None) –

    Specifies which layer of trees are used in prediction. For example, if arandom forest is trained with 100 rounds. Specifyingiteration_range=(10,20), then only the forests built during [10, 20) (half open set) roundsare used in this prediction.

    Added in version 1.4.0.

Return type:

prediction

save_model(fname)

Save the model to a file.

The model is saved in an XGBoost internal format which is universal among thevarious XGBoost interfaces. Auxiliary attributes of the Python Booster object(such as feature_names) are only saved when using JSON or UBJSON (default)format. Also, parameters that are not part of the model (like metrics,max_depth, etc) are not saved, seeModel IOfor more info.

model.save_model("model.json")# ormodel.save_model("model.ubj")
Parameters:

fname (str |PathLike) – Output file name

Return type:

None

score(X,y)

Evaluate score for data using the last evaluation metric. If the model istrained with early stopping, thenbest_iteration is usedautomatically.

Parameters:
  • X (Union[pd.DataFrame,cudf.DataFrame]) – Feature matrix. A DataFrame with a specialqid column.

  • y (Any) – Labels

Returns:

The result of the first evaluation metric for the ranker.

Return type:

score

set_fit_request(*,base_margin='$UNCHANGED$',base_margin_eval_set='$UNCHANGED$',eval_group='$UNCHANGED$',eval_qid='$UNCHANGED$',eval_set='$UNCHANGED$',feature_weights='$UNCHANGED$',group='$UNCHANGED$',qid='$UNCHANGED$',sample_weight='$UNCHANGED$',sample_weight_eval_set='$UNCHANGED$',verbose='$UNCHANGED$',xgb_model='$UNCHANGED$')

Configure whether metadata should be requested to be passed to thefit method.

Note that this method is only relevant when this estimator is used as asub-estimator within ameta-estimator and metadata routing is enabledwithenable_metadata_routing=True (seesklearn.set_config()).Please check theUser Guide on how the routingmechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed tofit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it tofit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains theexisting request. This allows you to change the request for someparameters and not others.

Added in version 1.3.

base_marginstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forbase_margin parameter infit.

base_margin_eval_setstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forbase_margin_eval_set parameter infit.

eval_groupstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing foreval_group parameter infit.

eval_qidstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing foreval_qid parameter infit.

eval_setstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing foreval_set parameter infit.

feature_weightsstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forfeature_weights parameter infit.

groupstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forgroup parameter infit.

qidstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forqid parameter infit.

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forsample_weight parameter infit.

sample_weight_eval_setstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forsample_weight_eval_set parameter infit.

verbosestr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forverbose parameter infit.

xgb_modelstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forxgb_model parameter infit.

selfobject

The updated object.

Parameters:
Return type:

XGBRanker

set_params(**params)

Set the parameters of this estimator. Modification of the sklearn method toallow unknown kwargs. This allows using the full range of xgboostparameters that are not defined as member variables in sklearn gridsearch.

Return type:

self

Parameters:

params (Any)

set_predict_request(*,base_margin='$UNCHANGED$',iteration_range='$UNCHANGED$',output_margin='$UNCHANGED$',validate_features='$UNCHANGED$')

Configure whether metadata should be requested to be passed to thepredict method.

Note that this method is only relevant when this estimator is used as asub-estimator within ameta-estimator and metadata routing is enabledwithenable_metadata_routing=True (seesklearn.set_config()).Please check theUser Guide on how the routingmechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed topredict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it topredict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains theexisting request. This allows you to change the request for someparameters and not others.

Added in version 1.3.

base_marginstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forbase_margin parameter inpredict.

iteration_rangestr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing foriteration_range parameter inpredict.

output_marginstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing foroutput_margin parameter inpredict.

validate_featuresstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forvalidate_features parameter inpredict.

selfobject

The updated object.

Parameters:
Return type:

XGBRanker

classxgboost.XGBRFRegressor(*,learning_rate=1.0,subsample=0.8,colsample_bynode=0.8,reg_lambda=1e-05,**kwargs)

Bases:XGBRegressor

scikit-learn API for XGBoost random forest regression.SeeUsing the Scikit-Learn Estimator Interface for more information.

Parameters:
  • n_estimators (Optional[int]) – Number of trees in random forest to fit.

  • max_depth (Optional[int]) – Maximum tree depth for base learners.

  • max_leaves (Optional[int]) – Maximum number of leaves; 0 indicates no limit.

  • max_bin (Optional[int]) – If using histogram-based algorithm, maximum number of bins per feature

  • grow_policy (Optional[str]) –

    Tree growing policy.

    • depthwise: Favors splitting at nodes closest to the node,

    • lossguide: Favors splitting at nodes with highest loss change.

  • learning_rate (Optional[float]) – Boosting learning rate (xgb’s “eta”)

  • verbosity (Optional[int]) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).

  • objective (Union[str,xgboost.sklearn._SklObjWProto,Callable[[Any,Any],Tuple[numpy.ndarray,numpy.ndarray]],NoneType]) –

    Specify the learning task and the corresponding learning objective or a customobjective function to be used.

    For custom objective, seeCustom Objective and Evaluation Metric andCustom objective and metric for more information, along with the end note forfunction signatures.

  • booster (Optional[str]) – Specify which booster to use:gbtree,gblinear ordart.

  • tree_method (Optional[str]) – Specify which tree method to use. Default to auto. If this parameter is set todefault, XGBoost will choose the most conservative option available. It’srecommended to study this option from the parameters documenttree method

  • n_jobs (Optional[int]) – Number of parallel threads used to run xgboost. When used with otherScikit-Learn algorithms like grid search, you may choose which algorithm toparallelize and balance the threads. Creating thread contention willsignificantly slow down both algorithms.

  • gamma (Optional[float]) – (min_split_loss) Minimum loss reduction required to make a further partition ona leaf node of the tree.

  • min_child_weight (Optional[float]) – Minimum sum of instance weight(hessian) needed in a child.

  • max_delta_step (Optional[float]) – Maximum delta step we allow each tree’s weight estimation to be.

  • subsample (Optional[float]) – Subsample ratio of the training instance.

  • sampling_method (Optional[str]) –

    Sampling method. Used only by the GPU version ofhist tree method.

    • uniform: Select random training instances uniformly.

    • gradient_based: Select random training instances with higher probability

      when the gradient and hessian are larger. (cf. CatBoost)

  • colsample_bytree (Optional[float]) – Subsample ratio of columns when constructing each tree.

  • colsample_bylevel (Optional[float]) – Subsample ratio of columns for each level.

  • colsample_bynode (Optional[float]) – Subsample ratio of columns for each split.

  • reg_alpha (Optional[float]) – L1 regularization term on weights (xgb’s alpha).

  • reg_lambda (Optional[float]) – L2 regularization term on weights (xgb’s lambda).

  • scale_pos_weight (Optional[float]) – Balancing of positive and negative weights.

  • base_score (Optional[float]) – The initial prediction score of all instances, global bias.

  • random_state (Union[numpy.random.mtrand.RandomState,numpy.random._generator.Generator,int,NoneType]) –

    Random number seed.

    Note

    Using gblinear booster with shotgun updater is nondeterministic asit uses Hogwild algorithm.

  • missing (float) – Value in the data which needs to be present as a missing value. Default tonumpy.nan.

  • num_parallel_tree (Optional[int]) – Used for boosting random forest.

  • monotone_constraints (Union[Dict[str,int],str,NoneType]) – Constraint of variable monotonicity. Seetutorialfor more information.

  • interaction_constraints (Union[str,List[Tuple[str]],NoneType]) – Constraints for interaction representing permitted interactions. Theconstraints must be specified in the form of a nested list, e.g.[[0,1],[2,3,4]], where each inner list is a group of indices of features that areallowed to interact with each other. Seetutorial for more information

  • importance_type (Optional[str]) –

    The feature importance type for the feature_importances_ property:

    • For tree model, it’s either “gain”, “weight”, “cover”, “total_gain” or“total_cover”.

    • For linear model, only “weight” is defined and it’s the normalizedcoefficients without bias.

  • device (Optional[str]) –

    Added in version 2.0.0.

    Device ordinal, available options arecpu,cuda, andgpu.

  • validate_parameters (Optional[bool]) – Give warnings for unknown parameter.

  • enable_categorical (bool) – See the same parameter ofDMatrix for details.

  • feature_types (Optional[Sequence[str]]) –

    Added in version 1.7.0.

    Used for specifying feature types without constructing a dataframe. SeeDMatrix for details.

  • feature_weights (Optional[ArrayLike]) – Weight for each feature, defines the probability of each feature being selectedwhen colsample is being used. All values must be greater than 0, otherwise aValueError is thrown.

  • max_cat_to_onehot (Optional[int]) –

    Added in version 1.6.0.

    Note

    This parameter is experimental

    A threshold for deciding whether XGBoost should use one-hot encoding based splitfor categorical data. When number of categories is lesser than the thresholdthen one-hot encoding is chosen, otherwise the categories will be partitionedinto children nodes. Also,enable_categorical needs to be set to havecategorical feature support. SeeCategorical Data andParameters for Categorical Feature for details.

  • max_cat_threshold (Optional[int]) –

    Added in version 1.7.0.

    Note

    This parameter is experimental

    Maximum number of categories considered for each split. Used only bypartition-based splits for preventing over-fitting. Also,enable_categoricalneeds to be set to have categorical feature support. SeeCategorical Data andParameters for Categorical Feature for details.

  • multi_strategy (Optional[str]) –

    Added in version 2.0.0.

    Note

    This parameter is working-in-progress.

    The strategy used for training multi-target models, including multi-targetregression and multi-class classification. SeeMultiple Outputs formore information.

    • one_output_per_tree: One model for each target.

    • multi_output_tree: Use multi-target trees.

  • eval_metric (Union[str,List[Union[str,Callable]],Callable,NoneType]) –

    Added in version 1.6.0.

    Metric used for monitoring the training result and early stopping. It can be astring or list of strings as names of predefined metric in XGBoost (SeeXGBoost Parameters), one of the metrics insklearn.metrics, or anyother user defined metric that looks likesklearn.metrics.

    If custom objective is also provided, then custom metric should implement thecorresponding reverse link function.

    Unlike thescoring parameter commonly used in scikit-learn, when a callableobject is provided, it’s assumed to be a cost function and by default XGBoostwill minimize the result during early stopping.

    For advanced usage on Early stopping like directly choosing to maximize insteadof minimize, seexgboost.callback.EarlyStopping.

    SeeCustom Objective and Evaluation Metric andCustom objective and metric for moreinformation.

    fromsklearn.datasetsimportload_diabetesfromsklearn.metricsimportmean_absolute_errorX,y=load_diabetes(return_X_y=True)reg=xgb.XGBRegressor(tree_method="hist",eval_metric=mean_absolute_error,)reg.fit(X,y,eval_set=[(X,y)])

  • early_stopping_rounds (Optional[int]) –

    Added in version 1.6.0.

    • Activates early stopping. Validation metric needs to improve at least once ineveryearly_stopping_rounds round(s) to continue training. Requires atleast one item ineval_set infit().

    • If early stopping occurs, the model will have two additional attributes:best_score andbest_iteration. These are used by thepredict() andapply() methods to determine the optimalnumber of trees during inference. If users want to access the full model(including trees built after early stopping), they can specify theiteration_range in these inference methods. In addition, other utilitieslike model plotting can also use the entire model.

    • If you prefer to discard the trees afterbest_iteration, consider using thecallback functionxgboost.callback.EarlyStopping.

    • If there’s more than one item ineval_set, the last entry will be used forearly stopping. If there’s more than one metric ineval_metric, the lastmetric will be used for early stopping.

  • callbacks (Optional[List[xgboost.callback.TrainingCallback]]) –

    List of callback functions that are applied at end of each iteration.It is possible to use predefined callbacks by usingCallback API.

    Note

    States in callback are not preserved during training, which means callbackobjects can not be reused for multiple training sessions withoutreinitialization or deepcopy.

    forparamsinparameters_grid:# be sure to (re)initialize the callbacks before each runcallbacks=[xgb.callback.LearningRateScheduler(custom_rates)]reg=xgboost.XGBRegressor(**params,callbacks=callbacks)reg.fit(X,y)

  • kwargs (Optional[Any]) –

    Keyword arguments for XGBoost Booster object. Full documentation of parameterscan be foundhere.Attempting to set a parameter via the constructor args and **kwargsdict simultaneously will result in a TypeError.

    Note

    **kwargs unsupported by scikit-learn

    **kwargs is unsupported by scikit-learn. We do not guaranteethat parameters passed via this argument will interact properlywith scikit-learn.

    Note

    Custom objective function

    A custom objective function can be provided for theobjectiveparameter. In this case, it should have the signatureobjective(y_true,y_pred)->[grad,hess] orobjective(y_true,y_pred,*,sample_weight)->[grad,hess]:

    y_true: array_like of shape [n_samples]

    The target values

    y_pred: array_like of shape [n_samples]

    The predicted values

    sample_weight :

    Optional sample weights.

    grad: array_like of shape [n_samples]

    The value of the gradient for each sample point.

    hess: array_like of shape [n_samples]

    The value of the second derivative for each sample point

    Note that, if the custom objective produces negative values forthe Hessian, these will be clipped. If the objective is non-convex,one might also consider using the expected Hessian (Fisherinformation) instead.

apply(X,iteration_range=None)

Return the predicted leaf every tree for each sample. If the model is trainedwith early stopping, thenbest_iteration is used automatically.

Parameters:
Returns:

X_leaves – For each datapoint x in X and for each tree, return the index of theleaf x ends up in. Leaves are numbered within[0;2**(self.max_depth+1)), possibly with gaps in the numbering.

Return type:

array_like, shape=[n_samples, n_trees]

propertybest_iteration:int

The best iteration obtained by early stopping. This attribute is 0-based,for instance if the best iteration is the first round, then best_iteration is 0.

propertybest_score:float

The best score obtained by early stopping.

propertycoef_:ndarray

Coefficients property

Note

Coefficients are defined only for linear learners

Coefficients are only defined when the linear model is chosen asbase learner (booster=gblinear). It is not defined for other baselearner types, such as tree learners (booster=gbtree).

Returns:

coef_

Return type:

array of shape[n_features] or[n_classes,n_features]

evals_result()

Return the evaluation results.

Ifeval_set is passed to thefit() function, you can callevals_result() to get evaluation results for all passedeval_sets. Wheneval_metric is also passed to thefit() function, theevals_result will contain theeval_metrics passed to thefit()function.

The returned evaluation result is a dictionary:

{'validation_0':{'logloss':['0.604835','0.531479']},'validation_1':{'logloss':['0.41965','0.17686']}}
Return type:

evals_result

propertyfeature_importances_:ndarray

Feature importances property, return depends onimportance_typeparameter. When model trained with multi-class/multi-label/multi-target dataset,the feature importance is “averaged” over all targets. The “average” is definedbased on the importance type. For instance, if the importance type is“total_gain”, then the score is sum of loss change for each split from alltrees.

Returns:

  • feature_importances_ (array of shape[n_features] except for multi-class)

  • linear model, which returns an array with shape(n_features, n_classes)

propertyfeature_names_in_:ndarray

Names of features seen duringfit(). Defined only whenX hasfeature names that are all strings.

fit(X,y,*,sample_weight=None,base_margin=None,eval_set=None,verbose=True,xgb_model=None,sample_weight_eval_set=None,base_margin_eval_set=None,feature_weights=None)

Fit gradient boosting model.

Note that callingfit() multiple times will cause the model object to bere-fit from scratch. To resume training from a previous checkpoint, explicitlypassxgb_model argument.

Parameters:
  • X (Any) –

    Input feature matrix. SeeSupported data structures for various XGBoost functions for a list of supported types.

    When thetree_method is set tohist, internally, theQuantileDMatrix will be used instead of theDMatrixfor conserving memory. However, this has performance implications when thedevice of input data is not matched with algorithm. For instance, if theinput is a numpy array on CPU butcuda is used for training, then thedata is first processed on CPU then transferred to GPU.

  • y (Any) – Labels

  • sample_weight (Any |None) – instance weights

  • base_margin (Any |None) – Global bias for each instance. SeeIntercept for details.

  • eval_set (Sequence[Tuple[Any,Any]]|None) – A list of (X, y) tuple pairs to use as validation sets, for whichmetrics will be computed.Validation metrics will help us track the performance of the model.

  • verbose (bool |int |None) – Ifverbose is True and an evaluation set is used, the evaluation metricmeasured on the validation set is printed to stdout at each boosting stage.Ifverbose is an integer, the evaluation metric is printed at eachverbose boosting stage. The last boosting stage / the boosting stage foundby usingearly_stopping_rounds is also printed.

  • xgb_model (Booster |str |XGBModel |None) – file name of stored XGBoost model or ‘Booster’ instance XGBoost model to beloaded before training (allows training continuation).

  • sample_weight_eval_set (Sequence[Any]|None) – A list of the form [L_1, L_2, …, L_n], where each L_i is an array likeobject storing instance weights for the i-th validation set.

  • base_margin_eval_set (Sequence[Any]|None) – A list of the form [M_1, M_2, …, M_n], where each M_i is an array likeobject storing base margin for the i-th validation set.

  • feature_weights (Any |None) –

    Deprecated since version 3.0.0.

    Usefeature_weights in__init__() orset_params()instead.

Return type:

XGBRFRegressor

get_booster()

Get the underlying xgboost Booster of this model.

This will raise an exception when fit was not called

Returns:

booster

Return type:

a xgboost booster of underlying model

get_metadata_routing()

Get metadata routing of this object.

Please checkUser Guide on how the routingmechanism works.

Returns:

routing – AMetadataRequest encapsulatingrouting information.

Return type:

MetadataRequest

get_num_boosting_rounds()

Gets the number of xgboost boosting rounds.

Return type:

int

get_params(deep=True)

Get parameters.

Parameters:

deep (bool)

Return type:

Dict[str,Any]

get_xgb_params()

Get xgboost specific parameters.

Return type:

Dict[str,Any]

propertyintercept_:ndarray

Intercept (bias) property

For tree-based model, the returned value is thebase_score.

Returns:

intercept_

Return type:

array of shape(1,) or[n_classes]

load_model(fname)

Load the model from a file or a bytearray.

The model is saved in an XGBoost internal format which is universal among thevarious XGBoost interfaces. Auxiliary attributes of the Python Booster object(such as feature_names) are only saved when using JSON or UBJSON (default)format. Also, parameters that are not part of the model (like metrics,max_depth, etc) are not saved, seeModel IOfor more info.

model.save_model("model.json")model.load_model("model.json")# ormodel.save_model("model.ubj")model.load_model("model.ubj")# orbuf=model.save_raw()model.load_model(buf)
Parameters:

fname (PathLike |bytearray |str) – Input file name or memory buffer(see also save_raw)

Return type:

None

propertyn_features_in_:int

Number of features seen duringfit().

predict(X,*,output_margin=False,validate_features=True,base_margin=None,iteration_range=None)

Predict withX. If the model is trained with early stopping, thenbest_iteration is used automatically. The estimator usesinplace_predict by default and falls back to usingDMatrix ifdevices between the data and the estimator don’t match.

Note

This function is only thread safe forgbtree anddart.

Parameters:
  • X (Any) – Data to predict with. SeeSupported data structures for various XGBoost functions for a list of supported types.

  • output_margin (bool) – Whether to output the raw untransformed margin value.

  • validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names areidentical. Otherwise, it is assumed that the feature_names are the same.

  • base_margin (Any |None) – Global bias for each instance. SeeIntercept for details.

  • iteration_range (Tuple[int |integer,int |integer]|None) –

    Specifies which layer of trees are used in prediction. For example, if arandom forest is trained with 100 rounds. Specifyingiteration_range=(10,20), then only the forests built during [10, 20) (half open set) roundsare used in this prediction.

    Added in version 1.4.0.

Return type:

prediction

save_model(fname)

Save the model to a file.

The model is saved in an XGBoost internal format which is universal among thevarious XGBoost interfaces. Auxiliary attributes of the Python Booster object(such as feature_names) are only saved when using JSON or UBJSON (default)format. Also, parameters that are not part of the model (like metrics,max_depth, etc) are not saved, seeModel IOfor more info.

model.save_model("model.json")# ormodel.save_model("model.ubj")
Parameters:

fname (str |PathLike) – Output file name

Return type:

None

score(X,y,sample_weight=None)

Returncoefficient of determination on test data.

The coefficient of determination,\(R^2\), is defined as\((1 - \frac{u}{v})\), where\(u\) is the residualsum of squares((y_true-y_pred)**2).sum() and\(v\)is the total sum of squares((y_true-y_true.mean())**2).sum().The best possible score is 1.0 and it can be negative (because themodel can be arbitrarily worse). A constant model that always predictsthe expected value ofy, disregarding the input features, would geta\(R^2\) score of 0.0.

Parameters:
  • X (array-like ofshape (n_samples,n_features)) – Test samples. For some estimators this may be a precomputedkernel matrix or a list of generic objects instead with shape(n_samples,n_samples_fitted), wheren_samples_fittedis the number of samples used in the fitting for the estimator.

  • y (array-like ofshape (n_samples,) or(n_samples,n_outputs)) – True values forX.

  • sample_weight (array-like ofshape (n_samples,),default=None) – Sample weights.

Returns:

score\(R^2\) ofself.predict(X) w.r.t.y.

Return type:

float

Notes

The\(R^2\) score used when callingscore on a regressor usesmultioutput='uniform_average' from version 0.23 to keep consistentwith default value ofr2_score().This influences thescore method of all the multioutputregressors (except forMultiOutputRegressor).

set_fit_request(*,base_margin='$UNCHANGED$',base_margin_eval_set='$UNCHANGED$',eval_set='$UNCHANGED$',feature_weights='$UNCHANGED$',sample_weight='$UNCHANGED$',sample_weight_eval_set='$UNCHANGED$',verbose='$UNCHANGED$',xgb_model='$UNCHANGED$')

Configure whether metadata should be requested to be passed to thefit method.

Note that this method is only relevant when this estimator is used as asub-estimator within ameta-estimator and metadata routing is enabledwithenable_metadata_routing=True (seesklearn.set_config()).Please check theUser Guide on how the routingmechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed tofit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it tofit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains theexisting request. This allows you to change the request for someparameters and not others.

Added in version 1.3.

base_marginstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forbase_margin parameter infit.

base_margin_eval_setstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forbase_margin_eval_set parameter infit.

eval_setstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing foreval_set parameter infit.

feature_weightsstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forfeature_weights parameter infit.

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forsample_weight parameter infit.

sample_weight_eval_setstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forsample_weight_eval_set parameter infit.

verbosestr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forverbose parameter infit.

xgb_modelstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forxgb_model parameter infit.

selfobject

The updated object.

Parameters:
Return type:

XGBRFRegressor

set_params(**params)

Set the parameters of this estimator. Modification of the sklearn method toallow unknown kwargs. This allows using the full range of xgboostparameters that are not defined as member variables in sklearn gridsearch.

Return type:

self

Parameters:

params (Any)

set_predict_request(*,base_margin='$UNCHANGED$',iteration_range='$UNCHANGED$',output_margin='$UNCHANGED$',validate_features='$UNCHANGED$')

Configure whether metadata should be requested to be passed to thepredict method.

Note that this method is only relevant when this estimator is used as asub-estimator within ameta-estimator and metadata routing is enabledwithenable_metadata_routing=True (seesklearn.set_config()).Please check theUser Guide on how the routingmechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed topredict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it topredict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains theexisting request. This allows you to change the request for someparameters and not others.

Added in version 1.3.

base_marginstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forbase_margin parameter inpredict.

iteration_rangestr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing foriteration_range parameter inpredict.

output_marginstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing foroutput_margin parameter inpredict.

validate_featuresstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forvalidate_features parameter inpredict.

selfobject

The updated object.

Parameters:
Return type:

XGBRFRegressor

set_score_request(*,sample_weight='$UNCHANGED$')

Configure whether metadata should be requested to be passed to thescore method.

Note that this method is only relevant when this estimator is used as asub-estimator within ameta-estimator and metadata routing is enabledwithenable_metadata_routing=True (seesklearn.set_config()).Please check theUser Guide on how the routingmechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed toscore if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it toscore.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains theexisting request. This allows you to change the request for someparameters and not others.

Added in version 1.3.

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forsample_weight parameter inscore.

selfobject

The updated object.

Parameters:
Return type:

XGBRFRegressor

classxgboost.XGBRFClassifier(*,learning_rate=1.0,subsample=0.8,colsample_bynode=0.8,reg_lambda=1e-05,**kwargs)

Bases:XGBClassifier

scikit-learn API for XGBoost random forest classification.SeeUsing the Scikit-Learn Estimator Interface for more information.

Parameters:
  • n_estimators (Optional[int]) – Number of trees in random forest to fit.

  • max_depth (Optional[int]) – Maximum tree depth for base learners.

  • max_leaves (Optional[int]) – Maximum number of leaves; 0 indicates no limit.

  • max_bin (Optional[int]) – If using histogram-based algorithm, maximum number of bins per feature

  • grow_policy (Optional[str]) –

    Tree growing policy.

    • depthwise: Favors splitting at nodes closest to the node,

    • lossguide: Favors splitting at nodes with highest loss change.

  • learning_rate (Optional[float]) – Boosting learning rate (xgb’s “eta”)

  • verbosity (Optional[int]) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).

  • objective (Union[str,xgboost.sklearn._SklObjWProto,Callable[[Any,Any],Tuple[numpy.ndarray,numpy.ndarray]],NoneType]) –

    Specify the learning task and the corresponding learning objective or a customobjective function to be used.

    For custom objective, seeCustom Objective and Evaluation Metric andCustom objective and metric for more information, along with the end note forfunction signatures.

  • booster (Optional[str]) – Specify which booster to use:gbtree,gblinear ordart.

  • tree_method (Optional[str]) – Specify which tree method to use. Default to auto. If this parameter is set todefault, XGBoost will choose the most conservative option available. It’srecommended to study this option from the parameters documenttree method

  • n_jobs (Optional[int]) – Number of parallel threads used to run xgboost. When used with otherScikit-Learn algorithms like grid search, you may choose which algorithm toparallelize and balance the threads. Creating thread contention willsignificantly slow down both algorithms.

  • gamma (Optional[float]) – (min_split_loss) Minimum loss reduction required to make a further partition ona leaf node of the tree.

  • min_child_weight (Optional[float]) – Minimum sum of instance weight(hessian) needed in a child.

  • max_delta_step (Optional[float]) – Maximum delta step we allow each tree’s weight estimation to be.

  • subsample (Optional[float]) – Subsample ratio of the training instance.

  • sampling_method (Optional[str]) –

    Sampling method. Used only by the GPU version ofhist tree method.

    • uniform: Select random training instances uniformly.

    • gradient_based: Select random training instances with higher probability

      when the gradient and hessian are larger. (cf. CatBoost)

  • colsample_bytree (Optional[float]) – Subsample ratio of columns when constructing each tree.

  • colsample_bylevel (Optional[float]) – Subsample ratio of columns for each level.

  • colsample_bynode (Optional[float]) – Subsample ratio of columns for each split.

  • reg_alpha (Optional[float]) – L1 regularization term on weights (xgb’s alpha).

  • reg_lambda (Optional[float]) – L2 regularization term on weights (xgb’s lambda).

  • scale_pos_weight (Optional[float]) – Balancing of positive and negative weights.

  • base_score (Optional[float]) – The initial prediction score of all instances, global bias.

  • random_state (Union[numpy.random.mtrand.RandomState,numpy.random._generator.Generator,int,NoneType]) –

    Random number seed.

    Note

    Using gblinear booster with shotgun updater is nondeterministic asit uses Hogwild algorithm.

  • missing (float) – Value in the data which needs to be present as a missing value. Default tonumpy.nan.

  • num_parallel_tree (Optional[int]) – Used for boosting random forest.

  • monotone_constraints (Union[Dict[str,int],str,NoneType]) – Constraint of variable monotonicity. Seetutorialfor more information.

  • interaction_constraints (Union[str,List[Tuple[str]],NoneType]) – Constraints for interaction representing permitted interactions. Theconstraints must be specified in the form of a nested list, e.g.[[0,1],[2,3,4]], where each inner list is a group of indices of features that areallowed to interact with each other. Seetutorial for more information

  • importance_type (Optional[str]) –

    The feature importance type for the feature_importances_ property:

    • For tree model, it’s either “gain”, “weight”, “cover”, “total_gain” or“total_cover”.

    • For linear model, only “weight” is defined and it’s the normalizedcoefficients without bias.

  • device (Optional[str]) –

    Added in version 2.0.0.

    Device ordinal, available options arecpu,cuda, andgpu.

  • validate_parameters (Optional[bool]) – Give warnings for unknown parameter.

  • enable_categorical (bool) – See the same parameter ofDMatrix for details.

  • feature_types (Optional[Sequence[str]]) –

    Added in version 1.7.0.

    Used for specifying feature types without constructing a dataframe. SeeDMatrix for details.

  • feature_weights (Optional[ArrayLike]) – Weight for each feature, defines the probability of each feature being selectedwhen colsample is being used. All values must be greater than 0, otherwise aValueError is thrown.

  • max_cat_to_onehot (Optional[int]) –

    Added in version 1.6.0.

    Note

    This parameter is experimental

    A threshold for deciding whether XGBoost should use one-hot encoding based splitfor categorical data. When number of categories is lesser than the thresholdthen one-hot encoding is chosen, otherwise the categories will be partitionedinto children nodes. Also,enable_categorical needs to be set to havecategorical feature support. SeeCategorical Data andParameters for Categorical Feature for details.

  • max_cat_threshold (Optional[int]) –

    Added in version 1.7.0.

    Note

    This parameter is experimental

    Maximum number of categories considered for each split. Used only bypartition-based splits for preventing over-fitting. Also,enable_categoricalneeds to be set to have categorical feature support. SeeCategorical Data andParameters for Categorical Feature for details.

  • multi_strategy (Optional[str]) –

    Added in version 2.0.0.

    Note

    This parameter is working-in-progress.

    The strategy used for training multi-target models, including multi-targetregression and multi-class classification. SeeMultiple Outputs formore information.

    • one_output_per_tree: One model for each target.

    • multi_output_tree: Use multi-target trees.

  • eval_metric (Union[str,List[Union[str,Callable]],Callable,NoneType]) –

    Added in version 1.6.0.

    Metric used for monitoring the training result and early stopping. It can be astring or list of strings as names of predefined metric in XGBoost (SeeXGBoost Parameters), one of the metrics insklearn.metrics, or anyother user defined metric that looks likesklearn.metrics.

    If custom objective is also provided, then custom metric should implement thecorresponding reverse link function.

    Unlike thescoring parameter commonly used in scikit-learn, when a callableobject is provided, it’s assumed to be a cost function and by default XGBoostwill minimize the result during early stopping.

    For advanced usage on Early stopping like directly choosing to maximize insteadof minimize, seexgboost.callback.EarlyStopping.

    SeeCustom Objective and Evaluation Metric andCustom objective and metric for moreinformation.

    fromsklearn.datasetsimportload_diabetesfromsklearn.metricsimportmean_absolute_errorX,y=load_diabetes(return_X_y=True)reg=xgb.XGBRegressor(tree_method="hist",eval_metric=mean_absolute_error,)reg.fit(X,y,eval_set=[(X,y)])

  • early_stopping_rounds (Optional[int]) –

    Added in version 1.6.0.

    • Activates early stopping. Validation metric needs to improve at least once ineveryearly_stopping_rounds round(s) to continue training. Requires atleast one item ineval_set infit().

    • If early stopping occurs, the model will have two additional attributes:best_score andbest_iteration. These are used by thepredict() andapply() methods to determine the optimalnumber of trees during inference. If users want to access the full model(including trees built after early stopping), they can specify theiteration_range in these inference methods. In addition, other utilitieslike model plotting can also use the entire model.

    • If you prefer to discard the trees afterbest_iteration, consider using thecallback functionxgboost.callback.EarlyStopping.

    • If there’s more than one item ineval_set, the last entry will be used forearly stopping. If there’s more than one metric ineval_metric, the lastmetric will be used for early stopping.

  • callbacks (Optional[List[xgboost.callback.TrainingCallback]]) –

    List of callback functions that are applied at end of each iteration.It is possible to use predefined callbacks by usingCallback API.

    Note

    States in callback are not preserved during training, which means callbackobjects can not be reused for multiple training sessions withoutreinitialization or deepcopy.

    forparamsinparameters_grid:# be sure to (re)initialize the callbacks before each runcallbacks=[xgb.callback.LearningRateScheduler(custom_rates)]reg=xgboost.XGBRegressor(**params,callbacks=callbacks)reg.fit(X,y)

  • kwargs (Optional[Any]) –

    Keyword arguments for XGBoost Booster object. Full documentation of parameterscan be foundhere.Attempting to set a parameter via the constructor args and **kwargsdict simultaneously will result in a TypeError.

    Note

    **kwargs unsupported by scikit-learn

    **kwargs is unsupported by scikit-learn. We do not guaranteethat parameters passed via this argument will interact properlywith scikit-learn.

    Note

    Custom objective function

    A custom objective function can be provided for theobjectiveparameter. In this case, it should have the signatureobjective(y_true,y_pred)->[grad,hess] orobjective(y_true,y_pred,*,sample_weight)->[grad,hess]:

    y_true: array_like of shape [n_samples]

    The target values

    y_pred: array_like of shape [n_samples]

    The predicted values

    sample_weight :

    Optional sample weights.

    grad: array_like of shape [n_samples]

    The value of the gradient for each sample point.

    hess: array_like of shape [n_samples]

    The value of the second derivative for each sample point

    Note that, if the custom objective produces negative values forthe Hessian, these will be clipped. If the objective is non-convex,one might also consider using the expected Hessian (Fisherinformation) instead.

apply(X,iteration_range=None)

Return the predicted leaf every tree for each sample. If the model is trainedwith early stopping, thenbest_iteration is used automatically.

Parameters:
Returns:

X_leaves – For each datapoint x in X and for each tree, return the index of theleaf x ends up in. Leaves are numbered within[0;2**(self.max_depth+1)), possibly with gaps in the numbering.

Return type:

array_like, shape=[n_samples, n_trees]

propertybest_iteration:int

The best iteration obtained by early stopping. This attribute is 0-based,for instance if the best iteration is the first round, then best_iteration is 0.

propertybest_score:float

The best score obtained by early stopping.

propertycoef_:ndarray

Coefficients property

Note

Coefficients are defined only for linear learners

Coefficients are only defined when the linear model is chosen asbase learner (booster=gblinear). It is not defined for other baselearner types, such as tree learners (booster=gbtree).

Returns:

coef_

Return type:

array of shape[n_features] or[n_classes,n_features]

evals_result()

Return the evaluation results.

Ifeval_set is passed to thefit() function, you can callevals_result() to get evaluation results for all passedeval_sets. Wheneval_metric is also passed to thefit() function, theevals_result will contain theeval_metrics passed to thefit()function.

The returned evaluation result is a dictionary:

{'validation_0':{'logloss':['0.604835','0.531479']},'validation_1':{'logloss':['0.41965','0.17686']}}
Return type:

evals_result

propertyfeature_importances_:ndarray

Feature importances property, return depends onimportance_typeparameter. When model trained with multi-class/multi-label/multi-target dataset,the feature importance is “averaged” over all targets. The “average” is definedbased on the importance type. For instance, if the importance type is“total_gain”, then the score is sum of loss change for each split from alltrees.

Returns:

  • feature_importances_ (array of shape[n_features] except for multi-class)

  • linear model, which returns an array with shape(n_features, n_classes)

propertyfeature_names_in_:ndarray

Names of features seen duringfit(). Defined only whenX hasfeature names that are all strings.

fit(X,y,*,sample_weight=None,base_margin=None,eval_set=None,verbose=True,xgb_model=None,sample_weight_eval_set=None,base_margin_eval_set=None,feature_weights=None)

Fit gradient boosting classifier.

Note that callingfit() multiple times will cause the model object to bere-fit from scratch. To resume training from a previous checkpoint, explicitlypassxgb_model argument.

Parameters:
  • X (Any) –

    Input feature matrix. SeeSupported data structures for various XGBoost functions for a list of supported types.

    When thetree_method is set tohist, internally, theQuantileDMatrix will be used instead of theDMatrixfor conserving memory. However, this has performance implications when thedevice of input data is not matched with algorithm. For instance, if theinput is a numpy array on CPU butcuda is used for training, then thedata is first processed on CPU then transferred to GPU.

  • y (Any) – Labels

  • sample_weight (Any |None) – instance weights

  • base_margin (Any |None) – Global bias for each instance. SeeIntercept for details.

  • eval_set (Sequence[Tuple[Any,Any]]|None) – A list of (X, y) tuple pairs to use as validation sets, for whichmetrics will be computed.Validation metrics will help us track the performance of the model.

  • verbose (bool |int |None) – Ifverbose is True and an evaluation set is used, the evaluation metricmeasured on the validation set is printed to stdout at each boosting stage.Ifverbose is an integer, the evaluation metric is printed at eachverbose boosting stage. The last boosting stage / the boosting stage foundby usingearly_stopping_rounds is also printed.

  • xgb_model (Booster |str |XGBModel |None) – file name of stored XGBoost model or ‘Booster’ instance XGBoost model to beloaded before training (allows training continuation).

  • sample_weight_eval_set (Sequence[Any]|None) – A list of the form [L_1, L_2, …, L_n], where each L_i is an array likeobject storing instance weights for the i-th validation set.

  • base_margin_eval_set (Sequence[Any]|None) – A list of the form [M_1, M_2, …, M_n], where each M_i is an array likeobject storing base margin for the i-th validation set.

  • feature_weights (Any |None) –

    Deprecated since version 3.0.0.

    Usefeature_weights in__init__() orset_params()instead.

Return type:

XGBRFClassifier

get_booster()

Get the underlying xgboost Booster of this model.

This will raise an exception when fit was not called

Returns:

booster

Return type:

a xgboost booster of underlying model

get_metadata_routing()

Get metadata routing of this object.

Please checkUser Guide on how the routingmechanism works.

Returns:

routing – AMetadataRequest encapsulatingrouting information.

Return type:

MetadataRequest

get_num_boosting_rounds()

Gets the number of xgboost boosting rounds.

Return type:

int

get_params(deep=True)

Get parameters.

Parameters:

deep (bool)

Return type:

Dict[str,Any]

get_xgb_params()

Get xgboost specific parameters.

Return type:

Dict[str,Any]

propertyintercept_:ndarray

Intercept (bias) property

For tree-based model, the returned value is thebase_score.

Returns:

intercept_

Return type:

array of shape(1,) or[n_classes]

load_model(fname)

Load the model from a file or a bytearray.

The model is saved in an XGBoost internal format which is universal among thevarious XGBoost interfaces. Auxiliary attributes of the Python Booster object(such as feature_names) are only saved when using JSON or UBJSON (default)format. Also, parameters that are not part of the model (like metrics,max_depth, etc) are not saved, seeModel IOfor more info.

model.save_model("model.json")model.load_model("model.json")# ormodel.save_model("model.ubj")model.load_model("model.ubj")# orbuf=model.save_raw()model.load_model(buf)
Parameters:

fname (PathLike |bytearray |str) – Input file name or memory buffer(see also save_raw)

Return type:

None

propertyn_features_in_:int

Number of features seen duringfit().

predict(X,*,output_margin=False,validate_features=True,base_margin=None,iteration_range=None)

Predict withX. If the model is trained with early stopping, thenbest_iteration is used automatically. The estimator usesinplace_predict by default and falls back to usingDMatrix ifdevices between the data and the estimator don’t match.

Note

This function is only thread safe forgbtree anddart.

Parameters:
  • X (Any) – Data to predict with. SeeSupported data structures for various XGBoost functions for a list of supported types.

  • output_margin (bool) – Whether to output the raw untransformed margin value.

  • validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names areidentical. Otherwise, it is assumed that the feature_names are the same.

  • base_margin (Any |None) – Global bias for each instance. SeeIntercept for details.

  • iteration_range (Tuple[int |integer,int |integer]|None) –

    Specifies which layer of trees are used in prediction. For example, if arandom forest is trained with 100 rounds. Specifyingiteration_range=(10,20), then only the forests built during [10, 20) (half open set) roundsare used in this prediction.

    Added in version 1.4.0.

Return type:

prediction

predict_proba(X,validate_features=True,base_margin=None,iteration_range=None)

Predict the probability of eachX example being of a given class. If themodel is trained with early stopping, thenbest_iteration is usedautomatically. The estimator usesinplace_predict by default and falls back tousingDMatrix if devices between the data and the estimator don’tmatch.

Note

This function is only thread safe forgbtree anddart.

Parameters:
  • X (Any) – Feature matrix. SeeSupported data structures for various XGBoost functions for a list of supported types.

  • validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names areidentical. Otherwise, it is assumed that the feature_names are the same.

  • base_margin (Any |None) – Global bias for each instance. SeeIntercept for details.

  • iteration_range (Tuple[int |integer,int |integer]|None) – Specifies which layer of trees are used in prediction. For example, if arandom forest is trained with 100 rounds. Specifyingiteration_range=(10,20), then only the forests built during [10, 20) (half open set) rounds areused in this prediction.

Returns:

a numpy array of shape array-like of shape (n_samples, n_classes) with theprobability of each data example being of a given class.

Return type:

prediction

save_model(fname)

Save the model to a file.

The model is saved in an XGBoost internal format which is universal among thevarious XGBoost interfaces. Auxiliary attributes of the Python Booster object(such as feature_names) are only saved when using JSON or UBJSON (default)format. Also, parameters that are not part of the model (like metrics,max_depth, etc) are not saved, seeModel IOfor more info.

model.save_model("model.json")# ormodel.save_model("model.ubj")
Parameters:

fname (str |PathLike) – Output file name

Return type:

None

score(X,y,sample_weight=None)

Returnaccuracy on provided data and labels.

In multi-label classification, this is the subset accuracywhich is a harsh metric since you require for each sample thateach label set be correctly predicted.

Parameters:
  • X (array-like ofshape (n_samples,n_features)) – Test samples.

  • y (array-like ofshape (n_samples,) or(n_samples,n_outputs)) – True labels forX.

  • sample_weight (array-like ofshape (n_samples,),default=None) – Sample weights.

Returns:

score – Mean accuracy ofself.predict(X) w.r.t.y.

Return type:

float

set_fit_request(*,base_margin='$UNCHANGED$',base_margin_eval_set='$UNCHANGED$',eval_set='$UNCHANGED$',feature_weights='$UNCHANGED$',sample_weight='$UNCHANGED$',sample_weight_eval_set='$UNCHANGED$',verbose='$UNCHANGED$',xgb_model='$UNCHANGED$')

Configure whether metadata should be requested to be passed to thefit method.

Note that this method is only relevant when this estimator is used as asub-estimator within ameta-estimator and metadata routing is enabledwithenable_metadata_routing=True (seesklearn.set_config()).Please check theUser Guide on how the routingmechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed tofit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it tofit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains theexisting request. This allows you to change the request for someparameters and not others.

Added in version 1.3.

base_marginstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forbase_margin parameter infit.

base_margin_eval_setstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forbase_margin_eval_set parameter infit.

eval_setstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing foreval_set parameter infit.

feature_weightsstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forfeature_weights parameter infit.

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forsample_weight parameter infit.

sample_weight_eval_setstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forsample_weight_eval_set parameter infit.

verbosestr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forverbose parameter infit.

xgb_modelstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forxgb_model parameter infit.

selfobject

The updated object.

Parameters:
Return type:

XGBRFClassifier

set_params(**params)

Set the parameters of this estimator. Modification of the sklearn method toallow unknown kwargs. This allows using the full range of xgboostparameters that are not defined as member variables in sklearn gridsearch.

Return type:

self

Parameters:

params (Any)

set_predict_proba_request(*,base_margin='$UNCHANGED$',iteration_range='$UNCHANGED$',validate_features='$UNCHANGED$')

Configure whether metadata should be requested to be passed to thepredict_proba method.

Note that this method is only relevant when this estimator is used as asub-estimator within ameta-estimator and metadata routing is enabledwithenable_metadata_routing=True (seesklearn.set_config()).Please check theUser Guide on how the routingmechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed topredict_proba if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it topredict_proba.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains theexisting request. This allows you to change the request for someparameters and not others.

Added in version 1.3.

base_marginstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forbase_margin parameter inpredict_proba.

iteration_rangestr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing foriteration_range parameter inpredict_proba.

validate_featuresstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forvalidate_features parameter inpredict_proba.

selfobject

The updated object.

Parameters:
Return type:

XGBRFClassifier

set_predict_request(*,base_margin='$UNCHANGED$',iteration_range='$UNCHANGED$',output_margin='$UNCHANGED$',validate_features='$UNCHANGED$')

Configure whether metadata should be requested to be passed to thepredict method.

Note that this method is only relevant when this estimator is used as asub-estimator within ameta-estimator and metadata routing is enabledwithenable_metadata_routing=True (seesklearn.set_config()).Please check theUser Guide on how the routingmechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed topredict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it topredict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains theexisting request. This allows you to change the request for someparameters and not others.

Added in version 1.3.

base_marginstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forbase_margin parameter inpredict.

iteration_rangestr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing foriteration_range parameter inpredict.

output_marginstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing foroutput_margin parameter inpredict.

validate_featuresstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forvalidate_features parameter inpredict.

selfobject

The updated object.

Parameters:
Return type:

XGBRFClassifier

set_score_request(*,sample_weight='$UNCHANGED$')

Configure whether metadata should be requested to be passed to thescore method.

Note that this method is only relevant when this estimator is used as asub-estimator within ameta-estimator and metadata routing is enabledwithenable_metadata_routing=True (seesklearn.set_config()).Please check theUser Guide on how the routingmechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed toscore if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it toscore.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains theexisting request. This allows you to change the request for someparameters and not others.

Added in version 1.3.

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing forsample_weight parameter inscore.

selfobject

The updated object.

Parameters:
Return type:

XGBRFClassifier

Plotting API

Plotting Library.

xgboost.plot_importance(booster,*,ax=None,height=0.2,xlim=None,ylim=None,title='Featureimportance',xlabel='Importancescore',ylabel='Features',fmap='',importance_type='weight',max_num_features=None,grid=True,show_values=True,values_format='{v}',**kwargs)

Plot importance based on fitted trees.

Parameters:
  • booster (XGBModel |Booster |dict) – Booster or XGBModel instance, or dict taken by Booster.get_fscore()

  • ax (matplotlib Axes) – Target axes instance. If None, new figure and axes will be created.

  • grid (bool) – Turn the axes grids on or off. Default is True (On).

  • importance_type (str) –

    How the importance is calculated: either “weight”, “gain”, or “cover”

    • ”weight” is the number of times a feature appears in a tree

    • ”gain” is the average gain of splits which use the feature

    • ”cover” is the average coverage of splits which use the featurewhere coverage is defined as the number of samples affected by the split

  • max_num_features (int |None) – Maximum number of top features displayed on plot. If None, all features will bedisplayed.

  • height (float) – Bar height, passed to ax.barh()

  • xlim (tuple |None) – Tuple passed to axes.xlim()

  • ylim (tuple |None) – Tuple passed to axes.ylim()

  • title (str) – Axes title. To disable, pass None.

  • xlabel (str) – X axis title label. To disable, pass None.

  • ylabel (str) – Y axis title label. To disable, pass None.

  • fmap (str |PathLike) – The name of feature map file.

  • show_values (bool) – Show values on plot. To disable, pass False.

  • values_format (str) – Format string for values. “v” will be replaced by the value of the featureimportance. e.g. Pass “{v:.2f}” in order to limit the number of digits afterthe decimal point to two, for each value printed on the graph.

  • kwargs (Any) – Other keywords passed to ax.barh()

Returns:

ax

Return type:

matplotlib Axes

xgboost.plot_tree(booster,*,fmap='',num_trees=None,rankdir=None,ax=None,with_stats=False,tree_idx=0,**kwargs)

Plot specified tree.

Parameters:
  • booster (Booster |XGBModel) – Booster or XGBModel instance

  • fmap (str (optional)) – The name of feature map file

  • num_trees (int |None) –

    Deprecated since version 3.0.

  • rankdir (str,default "TB") – Passed to graphviz via graph_attr

  • ax (matplotlib Axes,default None) – Target axes instance. If None, new figure and axes will be created.

  • with_stats (bool) –

    Added in version 3.0.

    Seeto_graphviz().

  • tree_idx (int) –

    Added in version 3.0.

    Seeto_graphviz().

  • kwargs (Any) – Other keywords passed toto_graphviz()

Returns:

ax

Return type:

matplotlib Axes

xgboost.to_graphviz(booster,*,fmap='',num_trees=None,rankdir=None,yes_color=None,no_color=None,condition_node_params=None,leaf_node_params=None,with_stats=False,tree_idx=0,**kwargs)

Convert specified tree to graphviz instance. IPython can automatically plotthe returned graphviz instance. Otherwise, you should call .render() methodof the returned graphviz instance.

Parameters:
  • booster (Booster |XGBModel) – Booster or XGBModel instance

  • fmap (str |PathLike) – The name of feature map file

  • num_trees (int |None) –

    Deprecated since version 3.0.

    Specify the ordinal number of target tree

  • rankdir (str |None) – Passed to graphviz via graph_attr

  • yes_color (str |None) – Edge color when meets the node condition.

  • no_color (str |None) – Edge color when doesn’t meet the node condition.

  • condition_node_params (dict |None) –

    Condition node configuration for for graphviz. Example:

    {'shape':'box','style':'filled,rounded','fillcolor':'#78bceb'}

  • leaf_node_params (dict |None) –

    Leaf node configuration for graphviz. Example:

    {'shape':'box','style':'filled','fillcolor':'#e48038'}

  • with_stats (bool) –

    Added in version 3.0.

    Controls whether the split statistics should be included.

  • tree_idx (int) –

    Added in version 3.0.

    Specify the ordinal index of target tree.

  • kwargs (Any) – Other keywords passed to graphviz graph_attr, e.g.graph[{key}={value}]

Returns:

graph

Return type:

graphviz.Source

Callback API

Callback library containing training routines. SeeCallback Functions for a quick introduction.

classxgboost.callback.TrainingCallback

Interface for training callback.

Added in version 1.3.0.

after_iteration(model,epoch,evals_log)

Run after each iteration. ReturnsTrue when training should stop.

Parameters:
  • model (Any) – Eeither aBooster object or a CVPack if the cv functionin xgboost is being used.

  • epoch (int) – The current training iteration.

  • evals_log (Dict[str,Dict[str,List[float]|List[Tuple[float,float]]]]) –

    A dictionary containing the evaluation history:

    {"data_name":{"metric_name":[0.5,...]}}

Return type:

bool

after_training(model)

Run after training is finished.

Parameters:

model (Any)

Return type:

Any

before_iteration(model,epoch,evals_log)

Run before each iteration. Returns True when training should stop. Seeafter_iteration() for details.

Parameters:
Return type:

bool

before_training(model)

Run before training starts.

Parameters:

model (Any)

Return type:

Any

classxgboost.callback.EvaluationMonitor(rank=0,period=1,show_stdv=False,logger=<functioncommunicator_print>)

Bases:TrainingCallback

Print the evaluation result at each iteration.

Added in version 1.3.0.

Parameters:
  • rank (int) – Which worker should be used for printing the result.

  • period (int) – How many epoches between printing.

  • show_stdv (bool) – Used in cv to show standard deviation. Users should not specify it.

  • logger (Callable[[str],None]) – A callable used for logging evaluation result.

after_iteration(model,epoch,evals_log)

Run after each iteration. ReturnsTrue when training should stop.

Parameters:
  • model (Any) – Eeither aBooster object or a CVPack if the cv functionin xgboost is being used.

  • epoch (int) – The current training iteration.

  • evals_log (Dict[str,Dict[str,List[float]|List[Tuple[float,float]]]]) –

    A dictionary containing the evaluation history:

    {"data_name":{"metric_name":[0.5,...]}}

Return type:

bool

after_training(model)

Run after training is finished.

Parameters:

model (Any)

Return type:

Any

classxgboost.callback.EarlyStopping(*,rounds,metric_name=None,data_name=None,maximize=None,save_best=False,min_delta=0.0)

Bases:TrainingCallback

Callback function for early stopping

Added in version 1.3.0.

Parameters:
  • rounds (int) – Early stopping rounds.

  • metric_name (str |None) – Name of metric that is used for early stopping.

  • data_name (str |None) – Name of dataset that is used for early stopping.

  • maximize (bool |None) – Whether to maximize evaluation metric. None means auto (discouraged).

  • save_best (bool |None) – Whether training should return the best model or the last model. If set toTrue, it will only keep the boosting rounds up to the detected best iteration,discarding the ones that come after. This is only supported with tree methods(notgblinear). Also, thecv function doesn’t return a model, the parameteris not applicable.

  • min_delta (float) –

    Added in version 1.5.0.

    Minimum absolute change in score to be qualified as an improvement.

Examples

es=xgboost.callback.EarlyStopping(rounds=2,min_delta=1e-3,save_best=True,maximize=False,data_name="validation_0",metric_name="mlogloss",)clf=xgboost.XGBClassifier(tree_method="hist",device="cuda",callbacks=[es])X,y=load_digits(return_X_y=True)clf.fit(X,y,eval_set=[(X,y)])
after_iteration(model,epoch,evals_log)

Run after each iteration. ReturnsTrue when training should stop.

Parameters:
  • model (Any) – Eeither aBooster object or a CVPack if the cv functionin xgboost is being used.

  • epoch (int) – The current training iteration.

  • evals_log (Dict[str,Dict[str,List[float]|List[Tuple[float,float]]]]) –

    A dictionary containing the evaluation history:

    {"data_name":{"metric_name":[0.5,...]}}

Return type:

bool

after_training(model)

Run after training is finished.

Parameters:

model (Any)

Return type:

Any

before_training(model)

Run before training starts.

Parameters:

model (Any)

Return type:

Any

classxgboost.callback.LearningRateScheduler(learning_rates)

Bases:TrainingCallback

Callback function for scheduling learning rate.

Added in version 1.3.0.

Parameters:

learning_rates (Callable[[int],float]|Sequence[float]) – If it’s a callable object, then it should accept an integer parameterepoch and returns the corresponding learning rate. Otherwise itshould be a sequence like list or tuple with the same size of boostingrounds.

after_iteration(model,epoch,evals_log)

Run after each iteration. ReturnsTrue when training should stop.

Parameters:
  • model (Any) – Eeither aBooster object or a CVPack if the cv functionin xgboost is being used.

  • epoch (int) – The current training iteration.

  • evals_log (Dict[str,Dict[str,List[float]|List[Tuple[float,float]]]]) –

    A dictionary containing the evaluation history:

    {"data_name":{"metric_name":[0.5,...]}}

Return type:

bool

classxgboost.callback.TrainingCheckPoint(directory,name='model',as_pickle=False,interval=100)

Bases:TrainingCallback

Checkpointing operation. Users are encouraged to create their own callbacks forcheckpoint as XGBoost doesn’t handle distributed file systems. When checkpointing ondistributed systems, be sure to know the rank of the worker to avoid multipleworkers checkpointing to the same place.

Added in version 1.3.0.

Since XGBoost 2.1.0, the default format is changed to UBJSON.

Parameters:
  • directory (str |PathLike) – Output model directory.

  • name (str) – pattern of output model file. Models will be saved as name_0.ubj, name_1.ubj,name_2.ubj ….

  • as_pickle (bool) – When set to True, all training parameters will be saved in pickle format,instead of saving only the model.

  • interval (int) – Interval of checkpointing. Checkpointing is slow so setting a larger number canreduce performance hit.

after_iteration(model,epoch,evals_log)

Run after each iteration. ReturnsTrue when training should stop.

Parameters:
  • model (Any) – Eeither aBooster object or a CVPack if the cv functionin xgboost is being used.

  • epoch (int) – The current training iteration.

  • evals_log (Dict[str,Dict[str,List[float]|List[Tuple[float,float]]]]) –

    A dictionary containing the evaluation history:

    {"data_name":{"metric_name":[0.5,...]}}

Return type:

bool

before_training(model)

Run before training starts.

Parameters:

model (Any)

Return type:

Any

Dask API

PySpark API

PySpark XGBoost integration interface

classxgboost.spark.SparkXGBClassifier(*,features_col='features',label_col='label',prediction_col='prediction',probability_col='probability',raw_prediction_col='rawPrediction',pred_contrib_col=None,validation_indicator_col=None,weight_col=None,base_margin_col=None,num_workers=1,device=None,force_repartition=False,repartition_random_shuffle=False,enable_sparse_data_optim=False,launch_tracker_on_driver=True,coll_cfg=None,**kwargs)

Bases:_SparkXGBEstimator,HasProbabilityCol,HasRawPredictionCol

SparkXGBClassifier is a PySpark ML estimator. It implements the XGBoostclassification algorithm based on XGBoost python library, and it can be used inPySpark Pipeline and PySpark ML meta algorithms like-CrossValidator/-TrainValidationSplit/-OneVsRest

SparkXGBClassifier automatically supports most of the parameters inxgboost.XGBClassifier constructor and most of the parameters used inxgboost.XGBClassifier.fit() andxgboost.XGBClassifier.predict()method.

To enable GPU support, setdevice tocuda orgpu.

SparkXGBClassifier doesn’t support settingbase_margin explicitly as well, butsupport another param calledbase_margin_col. see doc below for more details.

SparkXGBClassifier doesn’t support settingoutput_margin, but we can get outputmargin from the raw prediction column. Seeraw_prediction_col param doc below formore details.

SparkXGBClassifier doesn’t supportvalidate_features andoutput_margin param.

SparkXGBClassifier doesn’t support settingnthread xgboost param, instead, thenthread param for each xgboost worker will be set equal tospark.task.cpusconfig value.

Parameters:
  • features_col (str |List[str]) – When the value is string, it requires the features column name to be vector type.When the value is a list of string, it requires all the feature columns to be numeric types.

  • label_col (str) – Label column name. Default to “label”.

  • prediction_col (str) – Prediction column name. Default to “prediction”

  • probability_col (str) – Column name for predicted class conditional probabilities. Default to probabilityCol

  • raw_prediction_col (str) – Theoutput_margin=True is implicitly supported by therawPredictionCol output column, which is always returned with the predicted marginvalues.

  • pred_contrib_col (pyspark.ml.param.Param[str]) – Contribution prediction column name.

  • validation_indicator_col (str |None) – For params related toxgboost.XGBClassifier training withevaluation dataset’s supervision,setxgboost.spark.SparkXGBClassifier.validation_indicator_colparameter instead of setting theeval_set parameter inxgboost.XGBClassifierfit method.

  • weight_col (str |None) – To specify the weight of the training and validation dataset, setxgboost.spark.SparkXGBClassifier.weight_col parameter instead of settingsample_weight andsample_weight_eval_set parameter inxgboost.XGBClassifierfit method.

  • base_margin_col (str |None) – To specify the base margins of the training and validationdataset, setxgboost.spark.SparkXGBClassifier.base_margin_col parameterinstead of settingbase_margin andbase_margin_eval_set in thexgboost.XGBClassifier fit method.

  • num_workers (int) – How many XGBoost workers to be used to train.Each XGBoost worker corresponds to one spark task.

  • device (str |None) –

    Added in version 2.0.0.

    Device for XGBoost workers, available options arecpu,cuda, andgpu.

  • force_repartition (bool) – Boolean value to specify if forcing the input dataset to be repartitionedbefore XGBoost training.

  • repartition_random_shuffle (bool) – Boolean value to specify if randomly shuffling the dataset when repartitioning is required.

  • enable_sparse_data_optim (bool) – Boolean value to specify if enabling sparse data optimization, if True,Xgboost DMatrix object will be constructed from sparse matrix instead ofdense matrix.

  • launch_tracker_on_driver (bool) – Boolean value to indicate whether the tracker should be launched on the driver side orthe executor side.

  • coll_cfg (Config |None) – The collective configuration. SeeConfig

  • kwargs (Any) – A dictionary of xgboost parameters, please refer tohttps://xgboost.readthedocs.io/en/stable/parameter.html

Note

The Parameters chart above contains parameters that need special handling.For a full list of parameters, see entries withParam(parent=… below.

This API is experimental.

Examples

>>>fromxgboost.sparkimportSparkXGBClassifier>>>frompyspark.ml.linalgimportVectors>>>df_train=spark.createDataFrame([...(Vectors.dense(1.0,2.0,3.0),0,False,1.0),...(Vectors.sparse(3,{1:1.0,2:5.5}),1,False,2.0),...(Vectors.dense(4.0,5.0,6.0),0,True,1.0),...(Vectors.sparse(3,{1:6.0,2:7.5}),1,True,2.0),...],["features","label","isVal","weight"])>>>df_test=spark.createDataFrame([...(Vectors.dense(1.0,2.0,3.0),),...],["features"])>>>xgb_classifier=SparkXGBClassifier(max_depth=5,missing=0.0,...validation_indicator_col='isVal',weight_col='weight',...early_stopping_rounds=1,eval_metric='logloss')>>>xgb_clf_model=xgb_classifier.fit(df_train)>>>xgb_clf_model.transform(df_test).show()
clear(param)

Clears a param from the param map if it has been explicitly set.

Parameters:

param (Param)

Return type:

None

copy(extra=None)

Creates a copy of this instance with the same uid and someextra params. The default implementation creates ashallow copy usingcopy.copy(), and then copies theembedded and extra parameters over and returns the copy.Subclasses should override this method if the default approachis not sufficient.

Parameters:
  • extra (dict,optional) – Extra parameters to copy to the new instance

  • self (P)

Returns:

Copy of this instance

Return type:

Params

explainParam(param)

Explains a single param and returns its name, doc, and optionaldefault value and user-supplied value in a string.

Parameters:

param (str |Param)

Return type:

str

explainParams()

Returns the documentation of all params with their optionallydefault values and user-supplied values.

Return type:

str

extractParamMap(extra=None)

Extracts the embedded default param values and user-suppliedvalues, and then merges them with extra values from input intoa flat param map, where the latter value is used if there existconflicts, i.e., with ordering: default param values <user-supplied values < extra.

Parameters:

extra (dict,optional) – extra param values

Returns:

merged param map

Return type:

dict

fit(dataset,params=None)

Fits a model to the input dataset with optional parameters.

Added in version 1.3.0.

Parameters:
  • dataset (pyspark.sql.DataFrame) – input dataset.

  • params (dict orlist ortuple,optional) – an optional param map that overrides embedded params. If a list/tuple ofparam maps is given, this calls fit on each param map and returns a list ofmodels.

Returns:

fitted model(s)

Return type:

Transformer or a list ofTransformer

fitMultiple(dataset,paramMaps)

Fits a model to the input dataset for each param map inparamMaps.

Added in version 2.3.0.

Parameters:
Returns:

A thread safe iterable which contains one model for each param map. Eachcall tonext(modelIterator) will return(index, model) where model was fitusingparamMaps[index].index values may not be sequential.

Return type:

_FitMultipleIterator

getFeaturesCol()

Gets the value of featuresCol or its default value.

Return type:

str

getLabelCol()

Gets the value of labelCol or its default value.

Return type:

str

getOrDefault(param)

Gets the value of a param in the user-supplied param map or itsdefault value. Raises an error if neither is set.

Parameters:

param (str |Param[T])

Return type:

Any |T

getParam(paramName)

Gets a param by its name.

Parameters:

paramName (str)

Return type:

Param

getPredictionCol()

Gets the value of predictionCol or its default value.

Return type:

str

getProbabilityCol()

Gets the value of probabilityCol or its default value.

Return type:

str

getRawPredictionCol()

Gets the value of rawPredictionCol or its default value.

Return type:

str

getValidationIndicatorCol()

Gets the value of validationIndicatorCol or its default value.

Return type:

str

getWeightCol()

Gets the value of weightCol or its default value.

Return type:

str

hasDefault(param)

Checks whether a param has a default value.

Parameters:

param (str |Param[Any])

Return type:

bool

hasParam(paramName)

Tests whether this instance contains a param with a given(string) name.

Parameters:

paramName (str)

Return type:

bool

isDefined(param)

Checks whether a param is explicitly set by user or hasa default value.

Parameters:

param (str |Param[Any])

Return type:

bool

isSet(param)

Checks whether a param is explicitly set by user.

Parameters:

param (str |Param[Any])

Return type:

bool

classmethodload(path)

Reads an ML instance from the input path, a shortcut ofread().load(path).

Parameters:

path (str)

Return type:

RL

propertyparams:List[Param]

Returns all params ordered by name. The default implementationusesdir() to get all attributes of typeParam.

classmethodread()

Return the reader for loading the estimator.

Return type:

SparkXGBReader

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

Parameters:

path (str)

Return type:

None

set(param,value)

Sets a parameter in the embedded param map.

Parameters:
Return type:

None

setParams(**kwargs)

Set params for the estimator.

Parameters:

kwargs (Any)

Return type:

None

set_coll_cfg(value)

Set collective configuration

Parameters:

value (Config)

Return type:

_SparkXGBParams

set_device(value)

Set device, optional value: cpu, cuda, gpu

Parameters:

value (str)

Return type:

_SparkXGBParams

uid

A unique id for the object.

write()

Return the writer for saving the estimator.

Return type:

SparkXGBWriter

classxgboost.spark.SparkXGBClassifierModel(xgb_sklearn_model=None,training_summary=None)

Bases:_ClassificationModel

The model returned byxgboost.spark.SparkXGBClassifier.fit()

Note

This API is experimental.

Parameters:
  • xgb_sklearn_model (XGBModel |None)

  • training_summary (XGBoostTrainingSummary |None)

clear(param)

Clears a param from the param map if it has been explicitly set.

Parameters:

param (Param)

Return type:

None

copy(extra=None)

Creates a copy of this instance with the same uid and someextra params. The default implementation creates ashallow copy usingcopy.copy(), and then copies theembedded and extra parameters over and returns the copy.Subclasses should override this method if the default approachis not sufficient.

Parameters:
  • extra (dict,optional) – Extra parameters to copy to the new instance

  • self (P)

Returns:

Copy of this instance

Return type:

Params

explainParam(param)

Explains a single param and returns its name, doc, and optionaldefault value and user-supplied value in a string.

Parameters:

param (str |Param)

Return type:

str

explainParams()

Returns the documentation of all params with their optionallydefault values and user-supplied values.

Return type:

str

extractParamMap(extra=None)

Extracts the embedded default param values and user-suppliedvalues, and then merges them with extra values from input intoa flat param map, where the latter value is used if there existconflicts, i.e., with ordering: default param values <user-supplied values < extra.

Parameters:

extra (dict,optional) – extra param values

Returns:

merged param map

Return type:

dict

getFeaturesCol()

Gets the value of featuresCol or its default value.

Return type:

str

getLabelCol()

Gets the value of labelCol or its default value.

Return type:

str

getOrDefault(param)

Gets the value of a param in the user-supplied param map or itsdefault value. Raises an error if neither is set.

Parameters:

param (str |Param[T])

Return type:

Any |T

getParam(paramName)

Gets a param by its name.

Parameters:

paramName (str)

Return type:

Param

getPredictionCol()

Gets the value of predictionCol or its default value.

Return type:

str

getProbabilityCol()

Gets the value of probabilityCol or its default value.

Return type:

str

getRawPredictionCol()

Gets the value of rawPredictionCol or its default value.

Return type:

str

getValidationIndicatorCol()

Gets the value of validationIndicatorCol or its default value.

Return type:

str

getWeightCol()

Gets the value of weightCol or its default value.

Return type:

str

get_booster()

Return thexgboost.core.Booster instance.

Return type:

Booster

get_feature_importances(importance_type='weight')

Get feature importance of each feature.Importance type can be defined as:

  • ‘weight’: the number of times a feature is used to split the data across all trees.

  • ‘gain’: the average gain across all splits the feature is used in.

  • ‘cover’: the average coverage across all splits the feature is used in.

  • ‘total_gain’: the total gain across all splits the feature is used in.

  • ‘total_cover’: the total coverage across all splits the feature is used in.

Parameters:

importance_type (str,default 'weight') – One of the importance types defined above.

Return type:

Dict[str,float |List[float]]

hasDefault(param)

Checks whether a param has a default value.

Parameters:

param (str |Param[Any])

Return type:

bool

hasParam(paramName)

Tests whether this instance contains a param with a given(string) name.

Parameters:

paramName (str)

Return type:

bool

isDefined(param)

Checks whether a param is explicitly set by user or hasa default value.

Parameters:

param (str |Param[Any])

Return type:

bool

isSet(param)

Checks whether a param is explicitly set by user.

Parameters:

param (str |Param[Any])

Return type:

bool

classmethodload(path)

Reads an ML instance from the input path, a shortcut ofread().load(path).

Parameters:

path (str)

Return type:

RL

propertyparams:List[Param]

Returns all params ordered by name. The default implementationusesdir() to get all attributes of typeParam.

classmethodread()

Return the reader for loading the model.

Return type:

SparkXGBModelReader

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

Parameters:

path (str)

Return type:

None

set(param,value)

Sets a parameter in the embedded param map.

Parameters:
Return type:

None

set_coll_cfg(value)

Set collective configuration

Parameters:

value (Config)

Return type:

_SparkXGBParams

set_device(value)

Set device, optional value: cpu, cuda, gpu

Parameters:

value (str)

Return type:

_SparkXGBParams

transform(dataset,params=None)

Transforms the input dataset with optional parameters.

Added in version 1.3.0.

Parameters:
  • dataset (pyspark.sql.DataFrame) – input dataset

  • params (dict,optional) – an optional param map that overrides embedded params.

Returns:

transformed dataset

Return type:

pyspark.sql.DataFrame

uid

A unique id for the object.

write()

Return the writer for saving the model.

Return type:

SparkXGBModelWriter

classxgboost.spark.SparkXGBRegressor(*,features_col='features',label_col='label',prediction_col='prediction',pred_contrib_col=None,validation_indicator_col=None,weight_col=None,base_margin_col=None,num_workers=1,device=None,force_repartition=False,repartition_random_shuffle=False,enable_sparse_data_optim=False,launch_tracker_on_driver=True,coll_cfg=None,**kwargs)

Bases:_SparkXGBEstimator

SparkXGBRegressor is a PySpark ML estimator. It implements the XGBoost regressionalgorithm based on XGBoost python library, and it can be used in PySpark Pipelineand PySpark ML meta algorithms like-CrossValidator/-TrainValidationSplit/-OneVsRest

SparkXGBRegressor automatically supports most of the parameters inxgboost.XGBRegressor constructor and most of the parameters used inxgboost.XGBRegressor.fit() andxgboost.XGBRegressor.predict()method.

To enable GPU support, setdevice tocuda orgpu.

SparkXGBRegressor doesn’t support settingbase_margin explicitly as well, butsupport another param calledbase_margin_col. see doc below for more details.

SparkXGBRegressor doesn’t supportvalidate_features andoutput_margin param.

SparkXGBRegressor doesn’t support settingnthread xgboost param, instead, thenthread param for each xgboost worker will be set equal tospark.task.cpusconfig value.

Parameters:
  • features_col (str |List[str]) – When the value is string, it requires the features column name to be vector type.When the value is a list of string, it requires all the feature columns to be numeric types.

  • label_col (str) – Label column name. Default to “label”.

  • prediction_col (str) – Prediction column name. Default to “prediction”

  • pred_contrib_col (pyspark.ml.param.Param[str]) – Contribution prediction column name.

  • validation_indicator_col (str |None) – For params related toxgboost.XGBRegressor training withevaluation dataset’s supervision,setxgboost.spark.SparkXGBRegressor.validation_indicator_colparameter instead of setting theeval_set parameter inxgboost.XGBRegressorfit method.

  • weight_col (str |None) – To specify the weight of the training and validation dataset, setxgboost.spark.SparkXGBRegressor.weight_col parameter instead of settingsample_weight andsample_weight_eval_set parameter inxgboost.XGBRegressorfit method.

  • base_margin_col (str |None) – To specify the base margins of the training and validationdataset, setxgboost.spark.SparkXGBRegressor.base_margin_col parameterinstead of settingbase_margin andbase_margin_eval_set in thexgboost.XGBRegressor fit method.

  • num_workers (int) – How many XGBoost workers to be used to train.Each XGBoost worker corresponds to one spark task.

  • device (str |None) –

    Added in version 2.0.0.

    Device for XGBoost workers, available options arecpu,cuda, andgpu.

  • force_repartition (bool) – Boolean value to specify if forcing the input dataset to be repartitionedbefore XGBoost training.

  • repartition_random_shuffle (bool) – Boolean value to specify if randomly shuffling the dataset when repartitioning is required.

  • enable_sparse_data_optim (bool) – Boolean value to specify if enabling sparse data optimization, if True,Xgboost DMatrix object will be constructed from sparse matrix instead ofdense matrix.

  • launch_tracker_on_driver (bool) – Boolean value to indicate whether the tracker should be launched on the driver side orthe executor side.

  • coll_cfg (Config |None) – The collective configuration. SeeConfig

  • kwargs (Any) – A dictionary of xgboost parameters, please refer tohttps://xgboost.readthedocs.io/en/stable/parameter.html

Note

The Parameters chart above contains parameters that need special handling.For a full list of parameters, see entries withParam(parent=… below.

This API is experimental.

Examples

>>>fromxgboost.sparkimportSparkXGBRegressor>>>frompyspark.ml.linalgimportVectors>>>df_train=spark.createDataFrame([...(Vectors.dense(1.0,2.0,3.0),0,False,1.0),...(Vectors.sparse(3,{1:1.0,2:5.5}),1,False,2.0),...(Vectors.dense(4.0,5.0,6.0),2,True,1.0),...(Vectors.sparse(3,{1:6.0,2:7.5}),3,True,2.0),...],["features","label","isVal","weight"])>>>df_test=spark.createDataFrame([...(Vectors.dense(1.0,2.0,3.0),),...(Vectors.sparse(3,{1:1.0,2:5.5}),)...],["features"])>>>xgb_regressor=SparkXGBRegressor(max_depth=5,missing=0.0,...validation_indicator_col='isVal',weight_col='weight',...early_stopping_rounds=1,eval_metric='rmse')>>>xgb_reg_model=xgb_regressor.fit(df_train)>>>xgb_reg_model.transform(df_test)
clear(param)

Clears a param from the param map if it has been explicitly set.

Parameters:

param (Param)

Return type:

None

copy(extra=None)

Creates a copy of this instance with the same uid and someextra params. The default implementation creates ashallow copy usingcopy.copy(), and then copies theembedded and extra parameters over and returns the copy.Subclasses should override this method if the default approachis not sufficient.

Parameters:
  • extra (dict,optional) – Extra parameters to copy to the new instance

  • self (P)

Returns:

Copy of this instance

Return type:

Params

explainParam(param)

Explains a single param and returns its name, doc, and optionaldefault value and user-supplied value in a string.

Parameters:

param (str |Param)

Return type:

str

explainParams()

Returns the documentation of all params with their optionallydefault values and user-supplied values.

Return type:

str

extractParamMap(extra=None)

Extracts the embedded default param values and user-suppliedvalues, and then merges them with extra values from input intoa flat param map, where the latter value is used if there existconflicts, i.e., with ordering: default param values <user-supplied values < extra.

Parameters:

extra (dict,optional) – extra param values

Returns:

merged param map

Return type:

dict

fit(dataset,params=None)

Fits a model to the input dataset with optional parameters.

Added in version 1.3.0.

Parameters:
  • dataset (pyspark.sql.DataFrame) – input dataset.

  • params (dict orlist ortuple,optional) – an optional param map that overrides embedded params. If a list/tuple ofparam maps is given, this calls fit on each param map and returns a list ofmodels.

Returns:

fitted model(s)

Return type:

Transformer or a list ofTransformer

fitMultiple(dataset,paramMaps)

Fits a model to the input dataset for each param map inparamMaps.

Added in version 2.3.0.

Parameters:
Returns:

A thread safe iterable which contains one model for each param map. Eachcall tonext(modelIterator) will return(index, model) where model was fitusingparamMaps[index].index values may not be sequential.

Return type:

_FitMultipleIterator

getFeaturesCol()

Gets the value of featuresCol or its default value.

Return type:

str

getLabelCol()

Gets the value of labelCol or its default value.

Return type:

str

getOrDefault(param)

Gets the value of a param in the user-supplied param map or itsdefault value. Raises an error if neither is set.

Parameters:

param (str |Param[T])

Return type:

Any |T

getParam(paramName)

Gets a param by its name.

Parameters:

paramName (str)

Return type:

Param

getPredictionCol()

Gets the value of predictionCol or its default value.

Return type:

str

getValidationIndicatorCol()

Gets the value of validationIndicatorCol or its default value.

Return type:

str

getWeightCol()

Gets the value of weightCol or its default value.

Return type:

str

hasDefault(param)

Checks whether a param has a default value.

Parameters:

param (str |Param[Any])

Return type:

bool

hasParam(paramName)

Tests whether this instance contains a param with a given(string) name.

Parameters:

paramName (str)

Return type:

bool

isDefined(param)

Checks whether a param is explicitly set by user or hasa default value.

Parameters:

param (str |Param[Any])

Return type:

bool

isSet(param)

Checks whether a param is explicitly set by user.

Parameters:

param (str |Param[Any])

Return type:

bool

classmethodload(path)

Reads an ML instance from the input path, a shortcut ofread().load(path).

Parameters:

path (str)

Return type:

RL

propertyparams:List[Param]

Returns all params ordered by name. The default implementationusesdir() to get all attributes of typeParam.

classmethodread()

Return the reader for loading the estimator.

Return type:

SparkXGBReader

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

Parameters:

path (str)

Return type:

None

set(param,value)

Sets a parameter in the embedded param map.

Parameters:
Return type:

None

setParams(**kwargs)

Set params for the estimator.

Parameters:

kwargs (Any)

Return type:

None

set_coll_cfg(value)

Set collective configuration

Parameters:

value (Config)

Return type:

_SparkXGBParams

set_device(value)

Set device, optional value: cpu, cuda, gpu

Parameters:

value (str)

Return type:

_SparkXGBParams

uid

A unique id for the object.

write()

Return the writer for saving the estimator.

Return type:

SparkXGBWriter

classxgboost.spark.SparkXGBRegressorModel(xgb_sklearn_model=None,training_summary=None)

Bases:_SparkXGBModel

The model returned byxgboost.spark.SparkXGBRegressor.fit()

Note

This API is experimental.

Parameters:
  • xgb_sklearn_model (XGBModel |None)

  • training_summary (XGBoostTrainingSummary |None)

clear(param)

Clears a param from the param map if it has been explicitly set.

Parameters:

param (Param)

Return type:

None

copy(extra=None)

Creates a copy of this instance with the same uid and someextra params. The default implementation creates ashallow copy usingcopy.copy(), and then copies theembedded and extra parameters over and returns the copy.Subclasses should override this method if the default approachis not sufficient.

Parameters:
  • extra (dict,optional) – Extra parameters to copy to the new instance

  • self (P)

Returns:

Copy of this instance

Return type:

Params

explainParam(param)

Explains a single param and returns its name, doc, and optionaldefault value and user-supplied value in a string.

Parameters:

param (str |Param)

Return type:

str

explainParams()

Returns the documentation of all params with their optionallydefault values and user-supplied values.

Return type:

str

extractParamMap(extra=None)

Extracts the embedded default param values and user-suppliedvalues, and then merges them with extra values from input intoa flat param map, where the latter value is used if there existconflicts, i.e., with ordering: default param values <user-supplied values < extra.

Parameters:

extra (dict,optional) – extra param values

Returns:

merged param map

Return type:

dict

getFeaturesCol()

Gets the value of featuresCol or its default value.

Return type:

str

getLabelCol()

Gets the value of labelCol or its default value.

Return type:

str

getOrDefault(param)

Gets the value of a param in the user-supplied param map or itsdefault value. Raises an error if neither is set.

Parameters:

param (str |Param[T])

Return type:

Any |T

getParam(paramName)

Gets a param by its name.

Parameters:

paramName (str)

Return type:

Param

getPredictionCol()

Gets the value of predictionCol or its default value.

Return type:

str

getValidationIndicatorCol()

Gets the value of validationIndicatorCol or its default value.

Return type:

str

getWeightCol()

Gets the value of weightCol or its default value.

Return type:

str

get_booster()

Return thexgboost.core.Booster instance.

Return type:

Booster

get_feature_importances(importance_type='weight')

Get feature importance of each feature.Importance type can be defined as:

  • ‘weight’: the number of times a feature is used to split the data across all trees.

  • ‘gain’: the average gain across all splits the feature is used in.

  • ‘cover’: the average coverage across all splits the feature is used in.

  • ‘total_gain’: the total gain across all splits the feature is used in.

  • ‘total_cover’: the total coverage across all splits the feature is used in.

Parameters:

importance_type (str,default 'weight') – One of the importance types defined above.

Return type:

Dict[str,float |List[float]]

hasDefault(param)

Checks whether a param has a default value.

Parameters:

param (str |Param[Any])

Return type:

bool

hasParam(paramName)

Tests whether this instance contains a param with a given(string) name.

Parameters:

paramName (str)

Return type:

bool

isDefined(param)

Checks whether a param is explicitly set by user or hasa default value.

Parameters:

param (str |Param[Any])

Return type:

bool

isSet(param)

Checks whether a param is explicitly set by user.

Parameters:

param (str |Param[Any])

Return type:

bool

classmethodload(path)

Reads an ML instance from the input path, a shortcut ofread().load(path).

Parameters:

path (str)

Return type:

RL

propertyparams:List[Param]

Returns all params ordered by name. The default implementationusesdir() to get all attributes of typeParam.

classmethodread()

Return the reader for loading the model.

Return type:

SparkXGBModelReader

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

Parameters:

path (str)

Return type:

None

set(param,value)

Sets a parameter in the embedded param map.

Parameters:
Return type:

None

set_coll_cfg(value)

Set collective configuration

Parameters:

value (Config)

Return type:

_SparkXGBParams

set_device(value)

Set device, optional value: cpu, cuda, gpu

Parameters:

value (str)

Return type:

_SparkXGBParams

transform(dataset,params=None)

Transforms the input dataset with optional parameters.

Added in version 1.3.0.

Parameters:
  • dataset (pyspark.sql.DataFrame) – input dataset

  • params (dict,optional) – an optional param map that overrides embedded params.

Returns:

transformed dataset

Return type:

pyspark.sql.DataFrame

uid

A unique id for the object.

write()

Return the writer for saving the model.

Return type:

SparkXGBModelWriter

classxgboost.spark.SparkXGBRanker(*,features_col='features',label_col='label',prediction_col='prediction',pred_contrib_col=None,validation_indicator_col=None,weight_col=None,base_margin_col=None,qid_col=None,num_workers=1,device=None,force_repartition=False,repartition_random_shuffle=False,enable_sparse_data_optim=False,launch_tracker_on_driver=True,coll_cfg=None,**kwargs)

Bases:_SparkXGBEstimator

SparkXGBRanker is a PySpark ML estimator. It implements the XGBoostranking algorithm based on XGBoost python library, and it can be used inPySpark Pipeline and PySpark ML meta algorithms likeCrossValidator/TrainValidationSplit/OneVsRest

SparkXGBRanker automatically supports most of the parameters inxgboost.XGBRanker constructor and most of the parameters used inxgboost.XGBRanker.fit() andxgboost.XGBRanker.predict() method.

To enable GPU support, setdevice tocuda orgpu.

SparkXGBRanker doesn’t support settingbase_margin explicitly as well, but supportanother param calledbase_margin_col. see doc below for more details.

SparkXGBRanker doesn’t support settingoutput_margin, but we can get output marginfrom the raw prediction column. Seeraw_prediction_col param doc below for moredetails.

SparkXGBRanker doesn’t supportvalidate_features andoutput_margin param.

SparkXGBRanker doesn’t support settingnthread xgboost param, instead, thenthread param for each xgboost worker will be set equal tospark.task.cpusconfig value.

Parameters:
  • features_col (str |List[str]) – When the value is string, it requires the features column name to be vector type.When the value is a list of string, it requires all the feature columns to be numeric types.

  • label_col (str) – Label column name. Default to “label”.

  • prediction_col (str) – Prediction column name. Default to “prediction”

  • pred_contrib_col (pyspark.ml.param.Param[str]) – Contribution prediction column name.

  • validation_indicator_col (str |None) – For params related toxgboost.XGBRanker training withevaluation dataset’s supervision,setxgboost.spark.SparkXGBRanker.validation_indicator_colparameter instead of setting theeval_set parameter inxgboost.XGBRankerfit method.

  • weight_col (str |None) – To specify the weight of the training and validation dataset, setxgboost.spark.SparkXGBRanker.weight_col parameter instead of settingsample_weight andsample_weight_eval_set parameter inxgboost.XGBRankerfit method.

  • base_margin_col (str |None) – To specify the base margins of the training and validationdataset, setxgboost.spark.SparkXGBRanker.base_margin_col parameterinstead of settingbase_margin andbase_margin_eval_set in thexgboost.XGBRanker fit method.

  • qid_col (str |None) – Query id column name.

  • num_workers (int) – How many XGBoost workers to be used to train.Each XGBoost worker corresponds to one spark task.

  • device (str |None) –

    Added in version 2.0.0.

    Device for XGBoost workers, available options arecpu,cuda, andgpu.

  • force_repartition (bool) – Boolean value to specify if forcing the input dataset to be repartitionedbefore XGBoost training.

  • repartition_random_shuffle (bool) – Boolean value to specify if randomly shuffling the dataset when repartitioning is required.

  • enable_sparse_data_optim (bool) – Boolean value to specify if enabling sparse data optimization, if True,Xgboost DMatrix object will be constructed from sparse matrix instead ofdense matrix.

  • launch_tracker_on_driver (bool) – Boolean value to indicate whether the tracker should be launched on the driver side orthe executor side.

  • coll_cfg (Config |None) – The collective configuration. SeeConfig

  • kwargs (Any) – A dictionary of xgboost parameters, please refer tohttps://xgboost.readthedocs.io/en/stable/parameter.html

  • Note: (..) – The Parameters chart above contains parameters that need special handling.: For a full list of parameters, see entries withParam(parent=… below.

  • Note: – This API is experimental.:

Examples

>>>fromxgboost.sparkimportSparkXGBRanker>>>frompyspark.ml.linalgimportVectors>>>ranker=SparkXGBRanker(qid_col="qid")>>>df_train=spark.createDataFrame(...[...(Vectors.dense(1.0,2.0,3.0),0,0),...(Vectors.dense(4.0,5.0,6.0),1,0),...(Vectors.dense(9.0,4.0,8.0),2,0),...(Vectors.sparse(3,{1:1.0,2:5.5}),0,1),...(Vectors.sparse(3,{1:6.0,2:7.5}),1,1),...(Vectors.sparse(3,{1:8.0,2:9.5}),2,1),...],...["features","label","qid"],...)>>>df_test=spark.createDataFrame(...[...(Vectors.dense(1.5,2.0,3.0),0),...(Vectors.dense(4.5,5.0,6.0),0),...(Vectors.dense(9.0,4.5,8.0),0),...(Vectors.sparse(3,{1:1.0,2:6.0}),1),...(Vectors.sparse(3,{1:6.0,2:7.0}),1),...(Vectors.sparse(3,{1:8.0,2:10.5}),1),...],...["features","qid"],...)>>>model=ranker.fit(df_train)>>>model.transform(df_test).show()
clear(param)

Clears a param from the param map if it has been explicitly set.

Parameters:

param (Param)

Return type:

None

copy(extra=None)

Creates a copy of this instance with the same uid and someextra params. The default implementation creates ashallow copy usingcopy.copy(), and then copies theembedded and extra parameters over and returns the copy.Subclasses should override this method if the default approachis not sufficient.

Parameters:
  • extra (dict,optional) – Extra parameters to copy to the new instance

  • self (P)

Returns:

Copy of this instance

Return type:

Params

explainParam(param)

Explains a single param and returns its name, doc, and optionaldefault value and user-supplied value in a string.

Parameters:

param (str |Param)

Return type:

str

explainParams()

Returns the documentation of all params with their optionallydefault values and user-supplied values.

Return type:

str

extractParamMap(extra=None)

Extracts the embedded default param values and user-suppliedvalues, and then merges them with extra values from input intoa flat param map, where the latter value is used if there existconflicts, i.e., with ordering: default param values <user-supplied values < extra.

Parameters:

extra (dict,optional) – extra param values

Returns:

merged param map

Return type:

dict

fit(dataset,params=None)

Fits a model to the input dataset with optional parameters.

Added in version 1.3.0.

Parameters:
  • dataset (pyspark.sql.DataFrame) – input dataset.

  • params (dict orlist ortuple,optional) – an optional param map that overrides embedded params. If a list/tuple ofparam maps is given, this calls fit on each param map and returns a list ofmodels.

Returns:

fitted model(s)

Return type:

Transformer or a list ofTransformer

fitMultiple(dataset,paramMaps)

Fits a model to the input dataset for each param map inparamMaps.

Added in version 2.3.0.

Parameters:
Returns:

A thread safe iterable which contains one model for each param map. Eachcall tonext(modelIterator) will return(index, model) where model was fitusingparamMaps[index].index values may not be sequential.

Return type:

_FitMultipleIterator

getFeaturesCol()

Gets the value of featuresCol or its default value.

Return type:

str

getLabelCol()

Gets the value of labelCol or its default value.

Return type:

str

getOrDefault(param)

Gets the value of a param in the user-supplied param map or itsdefault value. Raises an error if neither is set.

Parameters:

param (str |Param[T])

Return type:

Any |T

getParam(paramName)

Gets a param by its name.

Parameters:

paramName (str)

Return type:

Param

getPredictionCol()

Gets the value of predictionCol or its default value.

Return type:

str

getValidationIndicatorCol()

Gets the value of validationIndicatorCol or its default value.

Return type:

str

getWeightCol()

Gets the value of weightCol or its default value.

Return type:

str

hasDefault(param)

Checks whether a param has a default value.

Parameters:

param (str |Param[Any])

Return type:

bool

hasParam(paramName)

Tests whether this instance contains a param with a given(string) name.

Parameters:

paramName (str)

Return type:

bool

isDefined(param)

Checks whether a param is explicitly set by user or hasa default value.

Parameters:

param (str |Param[Any])

Return type:

bool

isSet(param)

Checks whether a param is explicitly set by user.

Parameters:

param (str |Param[Any])

Return type:

bool

classmethodload(path)

Reads an ML instance from the input path, a shortcut ofread().load(path).

Parameters:

path (str)

Return type:

RL

propertyparams:List[Param]

Returns all params ordered by name. The default implementationusesdir() to get all attributes of typeParam.

classmethodread()

Return the reader for loading the estimator.

Return type:

SparkXGBReader

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

Parameters:

path (str)

Return type:

None

set(param,value)

Sets a parameter in the embedded param map.

Parameters:
Return type:

None

setParams(**kwargs)

Set params for the estimator.

Parameters:

kwargs (Any)

Return type:

None

set_coll_cfg(value)

Set collective configuration

Parameters:

value (Config)

Return type:

_SparkXGBParams

set_device(value)

Set device, optional value: cpu, cuda, gpu

Parameters:

value (str)

Return type:

_SparkXGBParams

uid

A unique id for the object.

write()

Return the writer for saving the estimator.

Return type:

SparkXGBWriter

classxgboost.spark.SparkXGBRankerModel(xgb_sklearn_model=None,training_summary=None)

Bases:_SparkXGBModel

The model returned byxgboost.spark.SparkXGBRanker.fit()

Note

This API is experimental.

Parameters:
  • xgb_sklearn_model (XGBModel |None)

  • training_summary (XGBoostTrainingSummary |None)

clear(param)

Clears a param from the param map if it has been explicitly set.

Parameters:

param (Param)

Return type:

None

copy(extra=None)

Creates a copy of this instance with the same uid and someextra params. The default implementation creates ashallow copy usingcopy.copy(), and then copies theembedded and extra parameters over and returns the copy.Subclasses should override this method if the default approachis not sufficient.

Parameters:
  • extra (dict,optional) – Extra parameters to copy to the new instance

  • self (P)

Returns:

Copy of this instance

Return type:

Params

explainParam(param)

Explains a single param and returns its name, doc, and optionaldefault value and user-supplied value in a string.

Parameters:

param (str |Param)

Return type:

str

explainParams()

Returns the documentation of all params with their optionallydefault values and user-supplied values.

Return type:

str

extractParamMap(extra=None)

Extracts the embedded default param values and user-suppliedvalues, and then merges them with extra values from input intoa flat param map, where the latter value is used if there existconflicts, i.e., with ordering: default param values <user-supplied values < extra.

Parameters:

extra (dict,optional) – extra param values

Returns:

merged param map

Return type:

dict

getFeaturesCol()

Gets the value of featuresCol or its default value.

Return type:

str

getLabelCol()

Gets the value of labelCol or its default value.

Return type:

str

getOrDefault(param)

Gets the value of a param in the user-supplied param map or itsdefault value. Raises an error if neither is set.

Parameters:

param (str |Param[T])

Return type:

Any |T

getParam(paramName)

Gets a param by its name.

Parameters:

paramName (str)

Return type:

Param

getPredictionCol()

Gets the value of predictionCol or its default value.

Return type:

str

getValidationIndicatorCol()

Gets the value of validationIndicatorCol or its default value.

Return type:

str

getWeightCol()

Gets the value of weightCol or its default value.

Return type:

str

get_booster()

Return thexgboost.core.Booster instance.

Return type:

Booster

get_feature_importances(importance_type='weight')

Get feature importance of each feature.Importance type can be defined as:

  • ‘weight’: the number of times a feature is used to split the data across all trees.

  • ‘gain’: the average gain across all splits the feature is used in.

  • ‘cover’: the average coverage across all splits the feature is used in.

  • ‘total_gain’: the total gain across all splits the feature is used in.

  • ‘total_cover’: the total coverage across all splits the feature is used in.

Parameters:

importance_type (str,default 'weight') – One of the importance types defined above.

Return type:

Dict[str,float |List[float]]

hasDefault(param)

Checks whether a param has a default value.

Parameters:

param (str |Param[Any])

Return type:

bool

hasParam(paramName)

Tests whether this instance contains a param with a given(string) name.

Parameters:

paramName (str)

Return type:

bool

isDefined(param)

Checks whether a param is explicitly set by user or hasa default value.

Parameters:

param (str |Param[Any])

Return type:

bool

isSet(param)

Checks whether a param is explicitly set by user.

Parameters:

param (str |Param[Any])

Return type:

bool

classmethodload(path)

Reads an ML instance from the input path, a shortcut ofread().load(path).

Parameters:

path (str)

Return type:

RL

propertyparams:List[Param]

Returns all params ordered by name. The default implementationusesdir() to get all attributes of typeParam.

classmethodread()

Return the reader for loading the model.

Return type:

SparkXGBModelReader

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

Parameters:

path (str)

Return type:

None

set(param,value)

Sets a parameter in the embedded param map.

Parameters:
Return type:

None

set_coll_cfg(value)

Set collective configuration

Parameters:

value (Config)

Return type:

_SparkXGBParams

set_device(value)

Set device, optional value: cpu, cuda, gpu

Parameters:

value (str)

Return type:

_SparkXGBParams

transform(dataset,params=None)

Transforms the input dataset with optional parameters.

Added in version 1.3.0.

Parameters:
  • dataset (pyspark.sql.DataFrame) – input dataset

  • params (dict,optional) – an optional param map that overrides embedded params.

Returns:

transformed dataset

Return type:

pyspark.sql.DataFrame

uid

A unique id for the object.

write()

Return the writer for saving the model.

Return type:

SparkXGBModelWriter

Collective

XGBoost collective communication related API.

classxgboost.collective.Config(retry=None,timeout=None,tracker_host_ip=None,tracker_port=None,tracker_timeout=None)

User configuration for the communicator context. This is used for easierintegration with distributed frameworks. Users of the collective module can pass theparameters directly into tracker and the communicator.

Added in version 3.0.

Parameters:
  • retry (int |None)

  • timeout (int |None)

  • tracker_host_ip (str |None)

  • tracker_port (int |None)

  • tracker_timeout (int |None)

retry
Type:

Seedmlc_retry ininit().

timeout

Seedmlc_timeout ininit(). This is only used for communicators, notthe tracker. They are different parameters since the timeout for tracker limitsonly the time for starting and finalizing the communication group, whereas thetimeout for communicators limits the time used for collective operations.

Type:

int | None

tracker_host_ip
Type:

SeeRabitTracker.

tracker_port
Type:

SeeRabitTracker.

tracker_timeout
Type:

SeeRabitTracker.

xgboost.collective.init(**args)

Initialize the collective library with arguments.

Parameters:

args (int |str |None) –

Keyword arguments representing the parameters and their values.

Accepted parameters:
  • dmlc_communicator: The type of the communicator.* rabit: Use Rabit. This is the default if the type is unspecified.* federated: Use the gRPC interface for Federated Learning.

Only applicable to the Rabit communicator:
  • dmlc_tracker_uri: Hostname of the tracker.

  • dmlc_tracker_port: Port number of the tracker.

  • dmlc_task_id: ID of the current task, can be used to obtain deterministic

  • dmlc_retry: The number of retry when handling network errors.

  • dmlc_timeout: Timeout in seconds.

  • dmlc_nccl_path: Path to load (dlopen) nccl for GPU-based communication.

Only applicable to the Federated communicator:
  • federated_server_address: Address of the federated server.

  • federated_world_size: Number of federated workers.

  • federated_rank: Rank of the current worker.

  • federated_server_cert: Server certificate file path. Only needed for the SSLmode.

  • federated_client_key: Client key file path. Only needed for the SSL mode.

  • federated_client_cert: Client certificate file path. Only needed for the SSLmode.

Use upper case for environment variables, use lower case for runtime configuration.

Return type:

None

Tracker for XGBoost collective.

classxgboost.tracker.RabitTracker(n_workers,host_ip,port=0,*,sortby='host',timeout=0)

Tracker for the collective used in XGBoost, acting as a coordinator betweenworkers.

Parameters:
  • n_workers (int) – The total number of workers in the communication group.

  • host_ip (str |None) – The IP address of the tracker node. XGBoost can try to guess one by probing withsockets. But it’s best to explicitly pass an address.

  • port (int) – The port this tracker should listen to. XGBoost can query an available port fromthe OS, this configuration is useful for restricted network environments.

  • sortby (str) –

    How to sort the workers for rank assignment. The default is host, but users canset theDMLC_TASK_ID via arguments ofinit() andobtain deterministic rank assignment through sorting by task name. Availableoptions are:

    • host

    • task

  • timeout (int) –

    Timeout for constructing (bootstrap) and shutting down the communication group,doesn’t apply to communication when the group is up and running.

    The timeout value should take the time of data loading and pre-processing intoaccount, due to potential lazy execution. By default the Tracker doesn’t haveany timeout to avoid pre-mature aborting.

    Thewait_for() method has a different timeout parameter that can stopthe tracker even if the tracker is still being used. A value error is raisedwhen timeout is reached.

Examples

fromxgboost.trackerimportRabitTrackerfromxgboostimportcollectiveascolltracker=RabitTracker(host_ip="127.0.0.1",n_workers=2)tracker.start()withcoll.CommunicatorContext(**tracker.worker_args()):ret=coll.broadcast("msg",0)assertstr(ret)=="msg"