fetch_openml #

Fetch dataset from openml by name or dataset id.

Datasets are uniquely identified by either an integer ID or by acombination of name and version (i.e. there might be multipleversions of the ‘iris’ dataset). Please give either name or data_id(not both). In case a name is given, a version can also beprovided.

Read more in theUser Guide.

Added in version 0.20.

Note

EXPERIMENTAL

The API is experimental (particularly the return value structure),and might have small backward-incompatible changes without noticeor warning in future releases.

Parameters:

namestr, default=None

String identifier of the dataset. Note that OpenML can have multipledatasets with the same name.

versionint or ‘active’, default=’active’

Version of the dataset. Can only be provided if alsoname is given.If ‘active’ the oldest version that’s still active is used. Sincethere may be more than one active version of a dataset, and thoseversions may fundamentally be different from one another, setting anexact version is highly recommended.

data_idint, default=None

OpenML ID of the dataset. The most specific way of retrieving adataset. If data_id is not given, name (and potential version) areused to obtain a dataset.

data_homestr or path-like, default=None

Specify another download and cache folder for the data sets. By defaultall scikit-learn data is stored in ‘~/scikit_learn_data’ subfolders.

target_columnstr, list or None, default=’default-target’

Specify the column name in the data to use as target. If‘default-target’, the standard target column a stored on the serveris used. IfNone, all columns are returned as data and thetarget isNone. If list (of strings), all columns with these namesare returned as multi-target (Note: not all scikit-learn classifierscan handle all types of multi-output combinations).

cachebool, default=True

Whether to cache the downloaded datasets intodata_home.

return_X_ybool, default=False

If True, returns(data,target) instead of a Bunch object. Seebelow for more information about thedata andtarget objects.

as_framebool or ‘auto’, default=’auto’

If True, the data is a pandas DataFrame including columns withappropriate dtypes (numeric, string or categorical). The target isa pandas DataFrame or Series depending on the number of target_columns.The Bunch will contain aframe attribute with the target and thedata. Ifreturn_X_y is True, then(data,target) will be pandasDataFrames or Series as describe above.

Ifas_frame is ‘auto’, the data and target will be converted toDataFrame or Series as ifas_frame is set to True, unless the datasetis stored in sparse format.

Ifas_frame is False, the data and target will be NumPy arrays andthedata will only contain numerical values whenparser="liac-arff"where the categories are provided in the attributecategories of theBunch instance. Whenparser="pandas", no ordinal encoding is made.

Changed in version 0.24:The default value ofas_frame changed fromFalse to'auto'in 0.24.

n_retriesint, default=3

Number of retries when HTTP errors or network timeouts are encountered.Error with status code 412 won’t be retried as they represent OpenMLgeneric errors.

delayfloat, default=1.0

Number of seconds between retries.

parser{“auto”, “pandas”, “liac-arff”}, default=”auto”

Parser used to load the ARFF file. Two parsers are implemented:

"pandas": this is the most efficient parser. However, it requirespandas to be installed and can only open dense datasets.
"liac-arff": this is a pure Python ARFF parser that is much lessmemory- and CPU-efficient. It deals with sparse ARFF datasets.

If"auto", the parser is chosen automatically such that"liac-arff"is selected for sparse ARFF datasets, otherwise"pandas" is selected.

Added in version 1.2.

Changed in version 1.4:The default value ofparser changes from"liac-arff" to"auto".

read_csv_kwargsdict, default=None

Keyword arguments passed topandas.read_csv when loading the datafrom a ARFF file and using the pandas parser. It can allow tooverwrite some default parameters.

Added in version 1.3.

Returns:

dataBunch

Dictionary-like object, with the following attributes.

datanp.array, scipy.sparse.csr_matrix of floats, or pandas DataFrame: The feature matrix. Categorical features are encoded as ordinals.
targetnp.array, pandas Series or DataFrame: The regression target or classification labels, if applicable.Dtype is float if numeric, and object if categorical. Ifas_frame is True,target is a pandas object.
DESCRstr: The full description of the dataset.
feature_nameslist: The names of the dataset columns.
target_names: list: The names of the target columns.

Added in version 0.22.

categoriesdict or None: Maps each categorical feature name to a list of values, suchthat the value encoded as i is ith in the list. Ifas_frameis True, this is None.
detailsdict: More metadata from OpenML.
framepandas DataFrame: Only present whenas_frame=True. DataFrame withdata andtarget.

(data, target)tuple ifreturn_X_y is True

Note

EXPERIMENTAL

This interface isexperimental and subsequent releases maychange attributes without notice (although there should only beminor changes todata andtarget).

Missing values in the ‘data’ are represented as NaN’s. Missing valuesin ‘target’ are represented as NaN’s (numerical target) or None(categorical target).

Notes

The"pandas" and"liac-arff" parsers can lead to different data typesin the output. The notable differences are the following:

The"liac-arff" parser always encodes categorical features asstr objects.To the contrary, the"pandas" parser instead infers the type whilereading and numerical categories will be casted into integers wheneverpossible.
The"liac-arff" parser uses float64 to encode numerical featurestagged as ‘REAL’ and ‘NUMERICAL’ in the metadata. The"pandas"parser instead infers if these numerical features correspondsto integers and uses panda’s Integer extension dtype.
In particular, classification datasets with integer categories aretypically loaded as such(0,1,...) with the"pandas" parser while"liac-arff" will force the use of string encoded class labels such as"0","1" and so on.
The"pandas" parser will not strip single quotes - i.e.' - fromstring columns. For instance, a string'mystring' will be kept as iswhile the"liac-arff" parser will strip the single quotes. Forcategorical columns, the single quotes are stripped from the values.

In addition, whenas_frame=False is used, the"liac-arff" parserreturns ordinally encoded data where the categories are provided in theattributecategories of theBunch instance. Instead,"pandas" returnsa NumPy array were the categories are not encoded.

Examples

>>>fromsklearn.datasetsimportfetch_openml>>>adult=fetch_openml("adult",version=2)>>>adult.frame.info()<class 'pandas.core.frame.DataFrame'>RangeIndex: 48842 entries, 0 to 48841Data columns (total 15 columns): #   Column          Non-Null Count  Dtype---  ------          --------------  ----- 0   age             48842 non-null  int64 1   workclass       46043 non-null  category 2   fnlwgt          48842 non-null  int64 3   education       48842 non-null  category 4   education-num   48842 non-null  int64 5   marital-status  48842 non-null  category 6   occupation      46033 non-null  category 7   relationship    48842 non-null  category 8   race            48842 non-null  category 9   sex             48842 non-null  category 10  capital-gain    48842 non-null  int64 11  capital-loss    48842 non-null  int64 12  hours-per-week  48842 non-null  int64 13  native-country  47985 non-null  category 14  class           48842 non-null  categorydtypes: category(9), int64(6)memory usage: 2.7 MB