8.4.Loading other datasets#

8.4.1.Sample images#

Scikit-learn also embeds a couple of sample JPEG images published under CreativeCommons license by their authors. Those images can be useful to test algorithmsand pipelines on 2D data.

load_sample_images()

Load sample images for image manipulation.

load_sample_image(image_name)

Load the numpy array of a single sample image.

../_images/loading_other_datasets-1.png

Warning

The default coding of images is based on theuint8 dtype tospare memory. Often machine learning algorithms work best if theinput is converted to a floating point representation first. Also,if you plan to usematplotlib.pyplot.imshow, don’t forget to scale to the range0 - 1 as done in the following example.

8.4.2.Datasets in svmlight / libsvm format#

scikit-learn includes utility functions for loadingdatasets in the svmlight / libsvm format. In this format, each linetakes the form<label><feature-id>:<feature-value><feature-id>:<feature-value>.... This format is especially suitable for sparse datasets.In this module, scipy sparse CSR matrices are used forX and numpy arrays are used fory.

You may load a dataset like this as follows:

>>>fromsklearn.datasetsimportload_svmlight_file>>>X_train,y_train=load_svmlight_file("/path/to/train_dataset.txt")...

You may also load two (or more) datasets at once:

>>>X_train,y_train,X_test,y_test=load_svmlight_files(...("/path/to/train_dataset.txt","/path/to/test_dataset.txt"))...

In this case,X_train andX_test are guaranteed to have the same numberof features. Another way to achieve the same result is to fix the number offeatures:

>>>X_test,y_test=load_svmlight_file(..."/path/to/test_dataset.txt",n_features=X_train.shape[1])...

Related links

8.4.3.Downloading datasets from the openml.org repository#

openml.org is a public repository for machine learningdata and experiments, that allows everybody to upload open datasets.

Thesklearn.datasets package is able to download datasetsfrom the repository using the functionsklearn.datasets.fetch_openml.

For example, to download a dataset of gene expressions in mice brains:

>>>fromsklearn.datasetsimportfetch_openml>>>mice=fetch_openml(name='miceprotein',version=4)

To fully specify a dataset, you need to provide a name and a version, thoughthe version is optional, seeDataset Versions below.The dataset contains a total of 1080 examples belonging to 8 differentclasses:

>>>mice.data.shape(1080, 77)>>>mice.target.shape(1080,)>>>np.unique(mice.target)array(['c-CS-m', 'c-CS-s', 'c-SC-m', 'c-SC-s', 't-CS-m', 't-CS-s', 't-SC-m', 't-SC-s'], dtype=object)

You can get more information on the dataset by looking at theDESCRanddetails attributes:

>>>print(mice.DESCR)**Author**: Clara Higuera, Katheleen J. Gardiner, Krzysztof J. Cios**Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/Mice+Protein+Expression) - 2015**Please cite**: Higuera C, Gardiner KJ, Cios KJ (2015) Self-OrganizingFeature Maps Identify Proteins Critical to Learning in a Mouse Model of DownSyndrome. PLoS ONE 10(6): e0129126...>>>mice.details{'id': '40966', 'name': 'MiceProtein', 'version': '4', 'format': 'ARFF','upload_date': '2017-11-08T16:00:15', 'licence': 'Public','url': 'https://www.openml.org/data/v1/download/17928620/MiceProtein.arff','file_id': '17928620', 'default_target_attribute': 'class','row_id_attribute': 'MouseID','ignore_attribute': ['Genotype', 'Treatment', 'Behavior'],'tag': ['OpenML-CC18', 'study_135', 'study_98', 'study_99'],'visibility': 'public', 'status': 'active','md5_checksum': '3c479a6885bfa0438971388283a1ce32'}

TheDESCR contains a free-text description of the data, whiledetailscontains a dictionary of meta-data stored by openml, like the dataset id.For more details, see theOpenML documentation Thedata_id of the mice protein datasetis 40966, and you can use this (or the name) to get more information on thedataset on the openml website:

>>>mice.url'https://www.openml.org/d/40966'

Thedata_id also uniquely identifies a dataset from OpenML:

>>>mice=fetch_openml(data_id=40966)>>>mice.details{'id': '4550', 'name': 'MiceProtein', 'version': '1', 'format': 'ARFF','creator': ...,'upload_date': '2016-02-17T14:32:49', 'licence': 'Public', 'url':'https://www.openml.org/data/v1/download/1804243/MiceProtein.ARFF', 'file_id':'1804243', 'default_target_attribute': 'class', 'citation': 'Higuera C,Gardiner KJ, Cios KJ (2015) Self-Organizing Feature Maps Identify ProteinsCritical to Learning in a Mouse Model of Down Syndrome. PLoS ONE 10(6):e0129126. [Web Link] journal.pone.0129126', 'tag': ['OpenML100', 'study_14','study_34'], 'visibility': 'public', 'status': 'active', 'md5_checksum':'3c479a6885bfa0438971388283a1ce32'}

8.4.3.1.Dataset Versions#

A dataset is uniquely specified by itsdata_id, but not necessarily by itsname. Several different “versions” of a dataset with the same name can existwhich can contain entirely different datasets.If a particular version of a dataset has been found to contain significantissues, it might be deactivated. Using a name to specify a dataset will yieldthe earliest version of a dataset that is still active. That means thatfetch_openml(name="miceprotein") can yield different resultsat different times if earlier versions become inactive.You can see that the dataset withdata_id 40966 that we fetched above isthe first version of the “miceprotein” dataset:

>>>mice.details['version']'1'

In fact, this dataset only has one version. The iris dataset on the other handhas multiple versions:

>>>iris=fetch_openml(name="iris")>>>iris.details['version']'1'>>>iris.details['id']'61'>>>iris_61=fetch_openml(data_id=61)>>>iris_61.details['version']'1'>>>iris_61.details['id']'61'>>>iris_969=fetch_openml(data_id=969)>>>iris_969.details['version']'3'>>>iris_969.details['id']'969'

Specifying the dataset by the name “iris” yields the lowest version, version 1,with thedata_id 61. To make sure you always get this exact dataset, it issafest to specify it by the datasetdata_id. The other dataset, withdata_id 969, is version 3 (version 2 has become inactive), and contains abinarized version of the data:

>>>np.unique(iris_969.target)array(['N', 'P'], dtype=object)

You can also specify both the name and the version, which also uniquelyidentifies the dataset:

>>>iris_version_3=fetch_openml(name="iris",version=3)>>>iris_version_3.details['version']'3'>>>iris_version_3.details['id']'969'

References

8.4.3.2.ARFF parser#

From version 1.2, scikit-learn provides a new keyword argumentparser thatprovides several options to parse the ARFF files provided by OpenML. The legacyparser (i.e.parser="liac-arff") is based on the projectLIAC-ARFF. This parser is howeverslow and consumes more memory than required. A new parser based on pandas(i.e.parser="pandas") is both faster and more memory efficient.However, this parser does not support sparse data.Therefore, we recommend usingparser="auto" which will use the best parseravailable for the requested dataset.

The"pandas" and"liac-arff" parsers can lead to different data types inthe output. The notable differences are the following:

  • The"liac-arff" parser always encodes categorical features asstrobjects. To the contrary, the"pandas" parser instead infers the type whilereading and numerical categories will be casted into integers wheneverpossible.

  • The"liac-arff" parser uses float64 to encode numerical features tagged as‘REAL’ and ‘NUMERICAL’ in the metadata. The"pandas" parser instead infersif these numerical features correspond to integers and uses pandas’ Integerextension dtype.

  • In particular, classification datasets with integer categories are typicallyloaded as such(0,1,...) with the"pandas" parser while"liac-arff"will force the use of string encoded class labels such as"0","1" and soon.

  • The"pandas" parser will not strip single quotes - i.e.' - from stringcolumns. For instance, a string'mystring' will be kept as is while the"liac-arff" parser will strip the single quotes. For categorical columns,the single quotes are stripped from the values.

In addition, whenas_frame=False is used, the"liac-arff" parser returnsordinally encoded data where the categories are provided in the attributecategories of theBunch instance. Instead,"pandas" returns a NumPy arraywere the categories. Then it’s up to the user to design a featureengineering pipeline with an instance ofOneHotEncoder orOrdinalEncoder typically wrapped in aColumnTransformer topreprocess the categorical columns explicitly. See for instance:Column Transformer with Mixed Types.

8.4.4.Loading from external datasets#

scikit-learn works on any numeric data stored as numpy arrays or scipy sparsematrices. Other types that are convertible to numeric arrays such as pandasDataFrame are also acceptable.

Here are some recommended ways to load standard columnar data into aformat usable by scikit-learn:

  • pandas.ioprovides tools to read data from common formats including CSV, Excel, JSONand SQL. DataFrames may also be constructed from lists of tuples or dicts.Pandas handles heterogeneous data smoothly and provides tools formanipulation and conversion into a numeric array suitable for scikit-learn.

  • scipy.iospecializes in binary formats often used in scientific computingcontexts such as .mat and .arff

  • numpy/routines.iofor standard loading of columnar data into numpy arrays

  • scikit-learn’sload_svmlight_file for the svmlight or libSVMsparse format

  • scikit-learn’sload_files for directories of text files wherethe name of each directory is the name of each category and each file insideof each directory corresponds to one sample from that category

For some miscellaneous data such as images, videos, and audio, you may wish torefer to:

Categorical (or nominal) features stored as strings (common in pandas DataFrames)will need converting to numerical features usingOneHotEncoderorOrdinalEncoder or similar.SeePreprocessing data.

Note: if you manage your own numerical data it is recommended to use anoptimized file format such as HDF5 to reduce data load times. Various librariessuch as H5Py, PyTables and pandas provide a Python interface for reading andwriting data in that format.