Text Input Format of DMatrix
Here we will briefly describe the text input formats for XGBoost. However, for users with access to a supported language environment like Python or R, it’s recommended to use data parsers from that ecosystem instead. For instance,sklearn.datasets.load_svmlight_file().
Warning
As stated above, users are encouraged to use third-party data parsers. The text parsersin XGBoost have been deprecated.
Basic Input Format
XGBoost currently supports two text formats for ingesting data: LIBSVM and CSV. The rest of this document will describe the LIBSVM format. (Seethis Wikipedia article for a description of the CSV format.). Please be careful that, XGBoost doesnot understand file extensions, nor try to guess the file format, as there is no universal agreement upon file extension of LIBSVM or CSV. Instead it employsURI format for specifying the precise input file type. For example if you provide acsv file./data.train.csv as input, XGBoost will blindly use the default LIBSVM parser to digest it and generate a parser error. Instead, users need to provide an URI in the form oftrain.csv?format=csv ortrain.csv?format=libsvm. For external memory input, the URI should of a form similar totrain.csv?format=csv#dtrain.cache. SeeData Interface andUsing XGBoost External Memory Version also.
For training or predicting, XGBoost takes an instance file with the format as below:
train.txt1 101:1.2 102:0.030 1:2.1 10001:300 10002:4000 0:1.3 1:0.31 0:0.01 1:0.30 0:0.2 1:0.3
Each line represent a single instance, and in the first line ‘1’ is the instance label, ‘101’ and ‘102’ are feature indices, ‘1.2’ and ‘0.03’ are feature values. In the binary classification case, ‘1’ is used to indicate positive samples, and ‘0’ is used to indicate negative samples. We also support probability values in [0,1] as label, to indicate the probability of the instance being positive.
Auxiliary Files for Additional Information
Note: all information below is applicable only to single-node version of the package. If you’d like to perform distributed training with multiple nodes, skip to the sectionEmbedding additional information inside LIBSVM file.
Group Input Format
For ranking task, XGBoost supports the group input format. In ranking task, instances are categorized intoquery groups in real world scenarios. For example, in the learning to rank web pages scenario, the web page instances are grouped by their queries. XGBoost requires an file that indicates the group information. For example, if the instance file is thetrain.txt shown above, the group file should be namedtrain.txt.group and be of the following format:
train.txt.group23
This means that, the data set contains 5 instances, and the first two instances are in a group and the other three are in another group. The numbers in the group file are actually indicating the number of instances in each group in the instance file in order.At the time of configuration, you do not have to indicate the path of the group file. If the instance file name isxxx, XGBoost will check whether there is a file namedxxx.group in the same directory.
Instance Weight File
Instances in the training data may be assigned weights to differentiate relative importance among them. For example, if we provide an instance weight file for thetrain.txt file in the example as below:
train.txt.weight10.50.510.5
It means that XGBoost will emphasize more on the first and fourth instance (i.e. the positive instances) while training.The configuration is similar to configuring the group information. If the instance file name isxxx, XGBoost will look for a file namedxxx.weight in the same directory. If the file exists, the instance weights will be extracted and used at the time of training.
Note
Binary buffer format and instance weights
If you choose to save the training data as a binary buffer (usingsave_binary()), keep in mind that the resulting binary buffer file will include the instance weights. To update the weights, use theset_weight() function.
Initial Margin File
XGBoost supports providing each instance an initial margin prediction. For example, if we have a initial prediction using logistic regression fortrain.txt file, we can create the following file:
train.txt.base_margin-0.41.03.4
XGBoost will take these values as initial margin prediction and boost from that. An important note about base_margin is that it should be margin prediction before transformation, so if you are doing logistic loss, you will need to put in value before logistic transformation. If you are using XGBoost predictor, usepred_margin=1 to output margin values.
Embedding additional information inside LIBSVM file
This section is applicable to both single- and multiple-node settings.
Query ID Columns
This is most useful forranking task, where the instances are grouped into query groups. You may embed query group ID for each instance in the LIBSVM file by adding a token of formqid:xx in each row:
train.txt1 qid:1 101:1.2 102:0.030 qid:1 1:2.1 10001:300 10002:4000 qid:2 0:1.3 1:0.31 qid:2 0:0.01 1:0.30 qid:3 0:0.2 1:0.31 qid:3 3:-0.1 10:-0.30 qid:3 6:0.2 10:0.15
Keep in mind the following restrictions:
You are not allowed to specify query ID’s for some instances but not for others. Either every row is assigned query ID’s or none at all.
The rows have to be sorted in ascending order by the query IDs. So, for instance, you may not have one row having large query ID than any of the following rows.
Instance weights
You may specify instance weights in the LIBSVM file by appending each instance label with the corresponding weight in the form of[label]:[weight], as shown by the following example:
train.txt1:1.0 101:1.2 102:0.030:0.5 1:2.1 10001:300 10002:4000:0.5 0:1.3 1:0.31:1.0 0:0.01 1:0.30:0.5 0:0.2 1:0.3
where the negative instances are assigned half weights compared to the positive instances.