The OpenML R Interface

The OpenML R Team

2022-10-19

Introduction

The R package OpenML is an interface to make interactions with theOpenML server as comfortable aspossible. For example, the users can download and upload files, runtheir implementations on specific tasks and get predictions in thecorrect form directly via R commands. In this tutorial, we will show themost important functions of this package and give examples on standardworkflows.

For general information on what OpenML is, please have a look at theREADMEfile or visit theofficial OpenMLwebsite.

After installation and before making practical use of the package, inmost cases it is desirable tosetup a configurationfile to simplify further steps. Afterwards, there are differentbasic stages when using this package or OpenML, respectively:

Listing
- lists which data is available w.r.t. a specific object(DataSets,Tasks,Flows,Runs,RunEvaluations,EvaluationMeasures, andTaskTypes)
- function names begin withlistOML
- result is always adata.frame
Downloading
- downloads the specific objects (forDataSets,Tasks,Runs,Predictions, andFlows)
- function names begin withgetOML
- result is an object of a specific OpenML class
Running models on tasks
- functionrunTaskMlr
- input:OMLTask andLearner
- output:OMLMlrRun,OMLRun
Uploading
- functionuploadOMLRun

Installation instructions

Installation works as in any other package using

install.packages("OpenML")

To install the current development version use thedevtoolspackage and run

devtools::install_github("openml/openml-r")

Using the OpenML package also requires a reader for theARFF fileformat. By defaultfarff is used.Alternatively, theRWeka package can beused. You can install the packages with the following calls.

install.packages(c("farff", "RWeka"))

Private key notification

All examples in this tutorial are given with aREAD-ONLY API key.

With this key you canread all the information fromthe server but notwrite data sets, tasks, flows, andruns to the server. This key allows to emulate uploading to the serverbut doesn’t allow to really store data. If one wants to write data to aserver, one has toget a personal API key. The processof how to obtain a key is shown in theconfigurationsection.

Important: Please do not writemeaningless data to the server such as copies of already existing datasets, tasks, or runs (such as the ones from this tutorial)! One instanceof the Iris data set should be enough for everyone.:D

Basic example

In this paragraph you can find an example on how to download a taskfrom the server, print some information about it to the console, andproduce a run which is then uploaded to the server. For detailedinformation on OpenML terminology (task, run, etc.) see theOpenML guide.

library("OpenML")## temporarily set API key to read only keysetOMLConfig(apikey = "c1994bdb7ecb3c6f3c8f3b35f4b47f1f")

## OpenML configuration:##   server           : https://www.openml.org/api/v1##   cachedir         : C:\Users\GDaddy\AppData\Local\Temp\RtmpUzg5GZ/working_dir\RtmpMfS7jw/cache##   verbosity        : 0##   arff.reader      : farff##   confirm.upload   : TRUE##   apikey           : ***************************47f1f

# download a task (whose ID is 1L)task = getOMLTask(task.id = 1L)

## Warning in getOMLDataSetById(data.id = data.id, cache.only = cache.only, : Data set has been deactivated.

task

## ## OpenML Task 1 :: (Data ID = 1)##   Task Type            : Supervised Classification##   Data Set             : anneal :: (Version = 2, OpenML ID = 1)##   Target Feature(s)    : class##   Tags                 : basic, study_1, study_41, study_7, study_73, study_89, test-tagging, testtag,...##   Estimation Procedure : Stratified crossvalidation (1 x 10 folds)##   Evaluation Measure(s): predictive_accuracy

The task contains information on the following:

task type: defines the type of the task (regression, classification,clustering, etc.)
data set: which data set belongs to the given task (one task canalways only be connected to a single data set)
target feature(s): optional field for all kinds of classificationand regression tasks
tags: tags / labels, which might be helpful for furthersub-selections
estimation procedure: which estimation procedure has been used whencomputing the performance

In the next line,randomForest is used as a classifierand run with the help of themlr package. Notethat one needs to run the algorithm locally and thatmlrwill automatically load the package that is needed to run the specifiedclassifier.

# define the classifier (usually called "flow" within OpenML)library("mlr")lrn = makeLearner("classif.randomForest")# upload the new flow (with information about the algorithm and settings);# if this algorithm already exists on the server, one will receive a message# with the ID of the existing flowflow.id = uploadOMLFlow(lrn)# the last step is to perform a run and upload the resultsrun.mlr = runTaskMlr(task, lrn)run.id = uploadOMLRun(run.mlr)

Following this very brief example, we will explain the single stepsof the OpenML package in more detail in the next sections.

Configuration

Interacting with the OpenML server requires an API key. Fordemonstration purposes, we have created apublic read-onlyAPI key ("c1994bdb7ecb3c6f3c8f3b35f4b47f1f"), whichwill be used in this tutorial to make the examples executable. However,for a full-fledged usage of theOpenML package, you needyour personal API.

Generating your own personal API key

In order to receive your own API key

Setting your configuration

You can set your own OpenML configuration either just temporarily forthe current R session viasetOMLConfig or permanently viasaveOMLConfig. In order to create a permanent configurationfile using default values and at the same time setting your personal APIkey, run

saveOMLConfig(apikey = "c1994bdb7ecb3c6f3c8f3b35f4b47f1f")

where"c1994bdb7ecb3c6f3c8f3b35f4b47f1f" should bereplaced withyour personal APIkey. It is noteworthy that basically everybody who has accessto your computer can read the configuration file and thus see your APIkey. With your API key other users have full access to your account viathe API, so please handle it with care!

It is also possible to manually create a file~/.openml/config in your home directory – you can use the Rcommandpath.expand("~/.openml/config") to get the fullpath to the configuration file on the operating system. Theconfig file consists ofkey = value pairs,note that the values are not quoted. An exemplary minimalconfig file might look as follows:

apikey=c1994bdb7ecb3c6f3c8f3b35f4b47f1f

Theconfig file may contain the followinginformation:

server:
- default:https://www.openml.org/api/v1
cachedir:
- directory where the current content of the cache is stored
- the default cache directory can be obtained by the R commandfile.path(tempdir(), "cache").
verbosity:
- 0: normal output
- 1: info output (default)
- 2: debug output
arff.reader:
- RWeka: this is the standard Java parser used inWeka
- farff: thefarff package provides anewer, faster parser without any Java requirements
confirm.upload:
- default decision w.r.t. confirming uploads
- per default (FALSE) one does not need to confirm theupload decision
apikey:
- required to access the server

If you manually modify theconfig file, you need toreload the modifiedconfig file to the current R sessionusingloadOMLConfig(). You can query the currentconfiguration using

getOMLConfig()

## OpenML configuration:##   server           : https://www.openml.org/api/v1##   cachedir         : C:\Users\GDaddy\AppData\Local\Temp\RtmpUzg5GZ/working_dir\RtmpMfS7jw/cache##   verbosity        : 0##   arff.reader      : farff##   confirm.upload   : TRUE##   apikey           : ***************************47f1f

The configuration file and some related things are also explained intheOpenMLWiki.

Once the config file is set up, you areready togo!

Listing

In this stage, we want to list basic information about the variousOpenML objects:

data sets
tasks
flows
runs
run results
evaluation measures
task types

For each of these objects, we have a function to query theinformation, beginning withlistOML. All of these functionsreturn adata.frame, even in case the result consists of asingle column or has zero observations (i.e., rows).

Note that thelistOML* functions only list informationon the corresponding objects – they do not download the respectiveobjects. Information on actually downloading specific objects is coveredin thenext section.

List data sets

To browse the OpenML data base for appropriate data sets, you can uselistOMLDataSets() in order to get basic datacharacteristics (number of features, instances, classes, missing values,etc.) for each data set. By default,listOMLDataSets()returns only data sets that have an active status on OpenML:

datasets = listOMLDataSets()  # returns active data sets

The resultingdata.frame contains the followinginformation for each of the listed data sets:

the data set IDdata.id
thestatus ("active","in_preparation" or"deactivated") of the dataset
thename of the data set
the size of the majority / biggest class(majority.class.size)
etc.

str(datasets)

## 'data.frame':    4390 obs. of  16 variables:##  $ data.id                                : int  2 3 4 5 6 7 8 9 10 11 ...##  $ name                                   : chr  "anneal" "kr-vs-kp" "labor" "arrhythmia" ...##  $ version                                : int  1 1 1 1 1 1 1 1 1 1 ...##  $ status                                 : chr  "active" "active" "active" "active" ...##  $ format                                 : chr  "ARFF" "ARFF" "ARFF" "ARFF" ...##  $ tags                                   : chr  "" "" "" "" ...##  $ majority.class.size                    : int  684 1669 37 245 813 57 NA 67 81 288 ...##  $ max.nominal.att.distinct.values        : int  7 3 3 13 26 24 NA 22 8 3 ...##  $ minority.class.size                    : int  8 1527 20 2 734 1 NA 3 2 49 ...##  $ number.of.classes                      : int  5 2 2 13 26 24 0 6 4 3 ...##  $ number.of.features                     : int  39 37 17 280 17 70 6 26 19 5 ...##  $ number.of.instances                    : int  898 3196 57 452 20000 226 345 205 148 625 ...##  $ number.of.instances.with.missing.values: int  898 0 56 384 0 222 0 46 0 0 ...##  $ number.of.missing.values               : int  22175 0 326 408 0 317 0 59 0 0 ...##  $ number.of.numeric.features             : int  6 0 8 206 16 0 6 15 3 4 ...##  $ number.of.symbolic.features            : int  33 37 9 74 1 70 0 11 16 1 ...

head(datasets[, 1:5])

##   data.id       name version status format## 1       2     anneal       1 active   ARFF## 2       3   kr-vs-kp       1 active   ARFF## 3       4      labor       1 active   ARFF## 4       5 arrhythmia       1 active   ARFF## 5       6     letter       1 active   ARFF## 6       7  audiology       1 active   ARFF

To find a specific data set, you can now query the resultingdatasets object. Suppose we want to find theiris data set.

subset(datasets, name == "iris")

##      data.id name version status format tags majority.class.size max.nominal.att.distinct.values## 53        61 iris       1 active   ARFF                       50                               3## 812      969 iris       3 active   ARFF                      100                               2## 2602   41510 iris       9 active   ARFF                       NA                               3## 2603   41511 iris      10 active   ARFF                       50                               3## 2636   41567 iris      11 active   ARFF                       NA                               3## 2637   41568 iris      12 active   ARFF                       50                               3## 2638   41582 iris      13 active   ARFF                       NA                               3## 2639   41583 iris      14 active   ARFF                       50                               3## 2889   41996 iris      15 active   ARFF                       NA                               3## 2890   41997 iris      16 active   ARFF                       50                               3## 2892   42002 iris      17 active   ARFF                       NA                               3## 2893   42003 iris      18 active   ARFF                       50                               3## 2896   42010 iris      19 active   ARFF                       NA                               3## 2897   42011 iris      20 active   ARFF                       50                               3## 2898   42015 iris      21 active   ARFF                       NA                               3## 2899   42016 iris      22 active   ARFF                       50                               3## 2900   42020 iris      23 active   ARFF                       NA                               3## 2901   42021 iris      24 active   ARFF                       50                               3## 2902   42025 iris      25 active   ARFF                       NA                               3## 2903   42026 iris      26 active   ARFF                       50                               3## 2904   42030 iris      27 active   ARFF                       NA                               3## 2905   42031 iris      28 active   ARFF                       50                               3## 2906   42035 iris      29 active   ARFF                       NA                               3## 2907   42036 iris      30 active   ARFF                       50                               3## 2908   42040 iris      31 active   ARFF                       NA                               3## 2909   42041 iris      32 active   ARFF                       50                               3## 2910   42045 iris      33 active   ARFF                       NA                               3## 2911   42046 iris      34 active   ARFF                       50                               3## 2912   42050 iris      35 active   ARFF                       NA                               3## 2913   42051 iris      36 active   ARFF                       50                               3## 2914   42055 iris      37 active   ARFF                       NA                               3## 2915   42056 iris      38 active   ARFF                       50                               3## 2920   42065 iris      39 active   ARFF                       NA                               3## 2921   42066 iris      40 active   ARFF                       50                               3## 2922   42070 iris      41 active   ARFF                       NA                               3## 2923   42071 iris      42 active   ARFF                       50                               3## 2934   42091 iris      43 active   ARFF                       NA                               3## 2937   42097 iris      44 active   ARFF                       NA                               3## 2938   42098 iris      45 active   ARFF                       50                               3## 3190   42661 iris      46 active   arff                       NA                              NA## 3206   42699 iris      47 active   ARFF                       NA                              NA## 3207   42700 iris      48 active   ARFF                       50                              NA## 3292   42851 iris      49 active   ARFF                       NA                              NA## 3309   42871 iris      50 active   ARFF                       NA                              NA##      minority.class.size number.of.classes number.of.features number.of.instances## 53                    50                 3                  5                 150## 812                   50                 2                  5                 150## 2602                  NA                NA                  5                 150## 2603                  50                 3                  5                 150## 2636                  NA                NA                  5                 150## 2637                  50                 3                  5                 150## 2638                  NA                NA                  5                 150## 2639                  50                 3                  5                 150## 2889                  NA                NA                  5                 150## 2890                  50                 3                  5                 150## 2892                  NA                NA                  5                 150## 2893                  50                 3                  5                 150## 2896                  NA                NA                  5                 150## 2897                  50                 3                  5                 150## 2898                  NA                NA                  5                 150## 2899                  50                 3                  5                 150## 2900                  NA                NA                  5                 150## 2901                  50                 3                  5                 150## 2902                  NA                NA                  5                 150## 2903                  50                 3                  5                 150## 2904                  NA                NA                  5                 150## 2905                  50                 3                  5                 150## 2906                  NA                NA                  5                 150## 2907                  50                 3                  5                 150## 2908                  NA                NA                  5                 150## 2909                  50                 3                  5                 150## 2910                  NA                NA                  5                 150## 2911                  50                 3                  5                 150## 2912                  NA                NA                  5                 150## 2913                  50                 3                  5                 150## 2914                  NA                NA                  5                 150## 2915                  50                 3                  5                 150## 2920                  NA                NA                  5                 150## 2921                  50                 3                  5                 150## 2922                  NA                NA                  5                 150## 2923                  50                 3                  5                 150## 2934                  NA                NA                  5                 150## 2937                  NA                NA                  5                 150## 2938                  50                 3                  5                 150## 3190                  NA                NA                  5                 150## 3206                  NA                NA                  5                 150## 3207                  50                 3                  5                 150## 3292                  NA                NA                  7                 150## 3309                  NA                NA                  7                 150##      number.of.instances.with.missing.values number.of.missing.values number.of.numeric.features## 53                                         0                        0                          4## 812                                        0                        0                          4## 2602                                       0                        0                          4## 2603                                       0                        0                          4## 2636                                       0                        0                          4## 2637                                       0                        0                          4## 2638                                       0                        0                          4## 2639                                       0                        0                          4## 2889                                       0                        0                          4## 2890                                       0                        0                          4## 2892                                       0                        0                          4## 2893                                       0                        0                          4## 2896                                       0                        0                          4## 2897                                       0                        0                          4## 2898                                       0                        0                          4## 2899                                       0                        0                          4## 2900                                       0                        0                          4## 2901                                       0                        0                          4## 2902                                       0                        0                          4## 2903                                       0                        0                          4## 2904                                       0                        0                          4## 2905                                       0                        0                          4## 2906                                       0                        0                          4## 2907                                       0                        0                          4## 2908                                       0                        0                          4## 2909                                       0                        0                          4## 2910                                       0                        0                          4## 2911                                       0                        0                          4## 2912                                       0                        0                          4## 2913                                       0                        0                          4## 2914                                       0                        0                          4## 2915                                       0                        0                          4## 2920                                       0                        0                          4## 2921                                       0                        0                          4## 2922                                       0                        0                          4## 2923                                       0                        0                          4## 2934                                       0                        0                          4## 2937                                       0                        0                          4## 2938                                       0                        0                          4## 3190                                       0                        0                          4## 3206                                       0                        0                          4## 3207                                       0                        0                          4## 3292                                       0                        0                          4## 3309                                       0                        0                          4##      number.of.symbolic.features## 53                             1## 812                            1## 2602                           1## 2603                           1## 2636                           1## 2637                           1## 2638                           1## 2639                           1## 2889                           1## 2890                           1## 2892                           1## 2893                           1## 2896                           1## 2897                           1## 2898                           1## 2899                           1## 2900                           1## 2901                           1## 2902                           1## 2903                           1## 2904                           1## 2905                           1## 2906                           1## 2907                           1## 2908                           1## 2909                           1## 2910                           1## 2911                           1## 2912                           1## 2913                           1## 2914                           1## 2915                           1## 2920                           1## 2921                           1## 2922                           1## 2923                           1## 2934                           1## 2937                           1## 2938                           1## 3190                           0## 3206                           1## 3207                           1## 3292                           3## 3309                           3

As you can see, there are two data sets callediris. Wewant to use theoriginal data set with three classes, which isstored under the data set ID (data.id) 61, 41511, 41568,41583, 41997, 42003, 42011, 42016, 42021, 42026, 42031, 42036, 42041,42046, 42051, 42056, 42066, 42071, 42098, 42700. You can also have acloser look at the data set on the corresponding OpenML web page (https://www.openml.org/d/61, 41511, 41568, 41583, 41997,42003, 42011, 42016, 42021, 42026, 42031, 42036, 42041, 42046, 42051,42056, 42066, 42071, 42098, 42700).

List tasks

Each OpenML task is a bundle that encapsulates information on variousobjects:

a specifictype, e.g.,"Supervised Classification" or"Supervised Regression"
a data set
a target feature (which might differ from the data set’s defaulttarget)
an estimation/resampling procedure, e.g., a 10-foldcross-validation
data splits for this estimation procedure
one or more (performance) evaluation measures, e.g.,"predictive accuracy" for a classification task

Listing the tasks can be done via

tasks = listOMLTasks()

The resultingdata.frame contains for each of the listedtasks information on:

the task IDtask.id
the type of the tasktask.type
information on the data set (analogously to thelist data set area), such as the number offeatures, classes and instances
the name of the target variabletarget.feature
tags which can be used for labelling the task
theestimation.procedure (aka resampling strategy)
theevaluation.measures used for measuring theperformance of thelearner / flow on the task

str(tasks)

## 'data.frame':    5000 obs. of  25 variables:##  $ task.id                                : int  2 3 4 5 6 7 8 9 10 11 ...##  $ task.type                              : chr  "Supervised Classification" "Supervised Classification" "Supervised Classification" "Supervised Classification" ...##  $ data.id                                : int  2 3 4 5 6 7 8 9 10 11 ...##  $ name                                   : chr  "anneal" "kr-vs-kp" "labor" "arrhythmia" ...##  $ status                                 : chr  "active" "active" "active" "active" ...##  $ format                                 : chr  "ARFF" "ARFF" "ARFF" "ARFF" ...##  $ estimation.procedure                   : chr  "10-fold Crossvalidation" "10-fold Crossvalidation" "10-fold Crossvalidation" "10-fold Crossvalidation" ...##  $ evaluation.measures                    : chr  "predictive_accuracy" NA "predictive_accuracy" "predictive_accuracy" ...##  $ target.feature                         : chr  "class" "class" "class" "class" ...##  $ cost.matrix                            : chr  NA NA NA NA ...##  $ source.data.labeled                    : chr  NA NA NA NA ...##  $ target.feature.event                   : chr  NA NA NA NA ...##  $ target.feature.left                    : chr  NA NA NA NA ...##  $ target.feature.right                   : chr  NA NA NA NA ...##  $ quality.measure                        : chr  NA NA NA NA ...##  $ majority.class.size                    : int  684 1669 37 245 813 57 NA 67 81 288 ...##  $ max.nominal.att.distinct.values        : int  7 3 3 13 26 24 NA 22 8 3 ...##  $ minority.class.size                    : int  8 1527 20 2 734 1 NA 3 2 49 ...##  $ number.of.classes                      : int  5 2 2 13 26 24 0 6 4 3 ...##  $ number.of.features                     : int  39 37 17 280 17 70 6 26 19 5 ...##  $ number.of.instances                    : int  898 3196 57 452 20000 226 345 205 148 625 ...##  $ number.of.instances.with.missing.values: int  898 0 56 384 0 222 0 46 0 0 ...##  $ number.of.missing.values               : int  22175 0 326 408 0 317 0 59 0 0 ...##  $ number.of.numeric.features             : int  6 0 8 206 16 0 6 15 3 4 ...##  $ number.of.symbolic.features            : int  33 37 9 74 1 70 0 11 16 1 ...

For some data sets, there may be more than one task available on theOpenML server. For example, one can look for"Supervised Classification" tasks that are available fordata set 61 via

head(subset(tasks, task.type == "Supervised Classification" & data.id == 61L)[, 1:5])

##      task.id                 task.type data.id name status## 51        59 Supervised Classification      61 iris active## 263      289 Supervised Classification      61 iris active## 428     1823 Supervised Classification      61 iris active## 535     1939 Supervised Classification      61 iris active## 580     1992 Supervised Classification      61 iris active## 3300    7306 Supervised Classification      61 iris active

List flows

A flow is the definition and implementation of a specific algorithmworkflow or script, i.e., a flow is essentially the code /implementation of the algorithm.

flows = listOMLFlows()str(flows)

## 'data.frame':    16365 obs. of  6 variables:##  $ flow.id         : int  1 2 3 4 5 6 7 8 9 10 ...##  $ full.name       : chr  "openml.evaluation.EuclideanDistance(1.0)" "openml.evaluation.PolynomialKernel(1.0)" "openml.evaluation.RBFKernel(1.0)" "openml.evaluation.area_under_roc_curve(1.0)" ...##  $ name            : chr  "openml.evaluation.EuclideanDistance" "openml.evaluation.PolynomialKernel" "openml.evaluation.RBFKernel" "openml.evaluation.area_under_roc_curve" ...##  $ version         : int  1 1 1 1 1 1 1 1 1 1 ...##  $ external.version: chr  "" "" "" "" ...##  $ uploader        : int  1 1 1 1 1 1 1 1 1 1 ...

flows[56:63, 1:4]

##    flow.id             full.name               name version## 56      56         weka.ZeroR(1)         weka.ZeroR       1## 57      57          weka.OneR(1)          weka.OneR       1## 58      58    weka.NaiveBayes(1)    weka.NaiveBayes       1## 59      59          weka.JRip(1)          weka.JRip       1## 60      60           weka.J48(1)           weka.J48       1## 61      61       weka.REPTree(1)       weka.REPTree       1## 62      62 weka.DecisionStump(1) weka.DecisionStump       1## 63      63 weka.HoeffdingTree(1) weka.HoeffdingTree       1

List runs and run results

A run is an experiment, which is executed on a given combination oftask, flow and setup (i.e., the explicit parameter configuration of aflow). The corresponding results are stored as a run result. Bothobjects, i.e., runs and run results, can be listed vialistOMLRuns orlistOMLRunEvaluations,respectively. As each of those objects is defined with a task, setup andflow, you can extract runs and run results with specific combinations oftask.id,setup.id and/orflow.id.For instance, listing all runs fortask 59 (supervisedclassification on iris) can be done with

runs = listOMLRuns(task.id = 59L)  # must be specified with the task, setup and/or implementation IDhead(runs)

##   run.id task.id setup.id flow.id uploader error.message## 1     81      59       12      67        1          <NA>## 2    161      59       13      70        1          <NA>## 3    234      59        1      56        1          <NA>## 4    447      59        6      61        1          <NA>## 5    473      59       18      77        1          <NA>## 6    491      59        7      62        1          <NA>

# one of the IDs (here: task.id) must be suppliedrun.results = listOMLRunEvaluations(task.id = 59L)str(run.results)

## 'data.frame':    4283 obs. of  35 variables:##  $ run.id                       : int  81 161 234 447 473 491 550 6088 6157 6158 ...##  $ task.id                      : int  59 59 59 59 59 59 59 59 59 59 ...##  $ setup.id                     : int  12 13 1 6 18 7 16 11 12 3 ...##  $ flow.id                      : int  67 70 56 61 77 62 75 66 67 58 ...##  $ flow.name                    : chr  "weka.BayesNet_K2(1)" "weka.SMO_PolyKernel(1)" "weka.ZeroR(1)" "weka.REPTree(1)" ...##  $ flow.version                 : chr  "1" "1" "1" "1" ...##  $ flow.source                  : chr  "weka" "weka" "weka" "weka" ...##  $ learner.name                 : chr  "BayesNet_K2" "SMO_PolyKernel" "ZeroR" "REPTree" ...##  $ data.name                    : chr  "iris" "iris" "iris" "iris" ...##  $ upload.time                  : chr  "2014-04-07 00:05:11" "2014-04-07 00:55:32" "2014-04-07 01:33:24" "2014-04-07 06:26:27" ...##  $ area.under.roc.curve         : num  0.983 0.977 0.5 0.967 0.978 ...##  $ average.cost                 : num  0 0 0 0 0 0 0 0 0 0 ...##  $ build.cpu.time               : num  NA NA NA NA NA NA NA NA NA NA ...##  $ build.memory                 : num  NA NA NA NA NA NA NA NA NA NA ...##  $ f.measure                    : num  0.94 0.96 0.167 0.927 0.947 ...##  $ kappa                        : num  0.91 0.94 0 0.89 0.92 0.5 0.95 0.93 0.91 0.93 ...##  $ kb.relative.information.score: num  1.39e+02 9.09e+01 -6.80e-05 1.31e+02 1.38e+02 ...##  $ mean.absolute.error          : num  0.0384 0.2311 0.4444 0.0671 0.0392 ...##  $ mean.prior.absolute.error    : num  0.444 0.444 0.444 0.444 0.444 ...##  $ number.of.instances          : num  150 150 150 150 150 150 150 150 150 150 ...##  $ precision                    : num  0.94 0.96 0.111 0.927 0.947 ...##  $ predictive.accuracy          : num  0.94 0.96 0.333 0.927 0.947 ...##  $ prior.entropy                : num  1.58 1.58 1.58 1.58 1.58 ...##  $ recall                       : num  0.94 0.96 0.333 0.927 0.947 ...##  $ relative.absolute.error      : num  0.0863 0.52 1 0.151 0.0881 ...##  $ root.mean.prior.squared.error: num  0.471 0.471 0.471 0.471 0.471 ...##  $ root.mean.squared.error      : num  0.16 0.288 0.471 0.211 0.178 ...##  $ root.relative.squared.error  : num  0.339 0.611 1 0.447 0.377 ...##  $ scimark.benchmark            : num  1981 1980 2011 1887 1998 ...##  $ total.cost                   : num  0 0 0 0 0 0 0 0 0 0 ...##  $ unweighted.recall            : num  NA NA NA NA NA NA NA NA NA NA ...##  $ usercpu.time.millis          : num  NA NA NA NA NA NA NA NA NA NA ...##  $ usercpu.time.millis.testing  : num  NA NA NA NA NA NA NA NA NA NA ...##  $ usercpu.time.millis.training : num  NA NA NA NA NA NA NA NA NA NA ...##  $ weighted.recall              : num  NA NA NA NA NA NA NA NA NA NA ...

List evaluation measures and task types

Analogously to the previous listings, one can list further objectssimply by calling the respective functions.

listOMLDataSetQualities()listOMLEstimationProcedures()listOMLEvaluationMeasures()listOMLTaskTypes()

Downloading

Users can download data sets, tasks, flows and runs from the OpenMLserver. The package provides special representations for each object,which will be discussed here.

Download an OpenML data set

To directly download a data set, e.g., when you want to run a fewpreliminary experiments, one can use the functiongetOMLDataSet. The function accepts a data set ID as inputand returns the correspondingOMLDataSet:

iris.data = getOMLDataSet(data.id = 61L)  # the iris data set has the data set ID 61

Download an OpenML task

The following call returns an OpenML task object for a supervisedclassification task on the iris data:

task = getOMLTask(task.id = 59L)task

## ## OpenML Task 59 :: (Data ID = 61)##   Task Type            : Supervised Classification##   Data Set             : iris :: (Version = 1, OpenML ID = 61)##   Target Feature(s)    : class##   Tags                 : basic, study_1, study_41, study_50, study_7, study_89, testsuite, under100k, ...##   Estimation Procedure : Stratified crossvalidation (1 x 10 folds)##   Evaluation Measure(s): predictive_accuracy

The corresponding"OMLDataSet" object can be accessedby

task$input$data.set

## ## Data Set 'iris' :: (Version = 1, OpenML ID = 61)##   Collection Date         : 1936##   Creator(s)              : R.A. Fisher##   Default Target Attribute: class

and the class of the task can be shown with the next line

task$task.type

## [1] "Supervised Classification"

Also, it is possible to extract the data set itself via

iris.data = task$input$data.set$datahead(iris.data)

##   sepallength sepalwidth petallength petalwidth       class## 0         5.1        3.5         1.4        0.2 Iris-setosa## 1         4.9        3.0         1.4        0.2 Iris-setosa## 2         4.7        3.2         1.3        0.2 Iris-setosa## 3         4.6        3.1         1.5        0.2 Iris-setosa## 4         5.0        3.6         1.4        0.2 Iris-setosa## 5         5.4        3.9         1.7        0.4 Iris-setosa

Download an OpenML flow

Aside from tasks and data sets, one can also download flows – bycallinggetOMLFlow with the specificflow.id

flow = getOMLFlow(flow.id = 2700L)flow

## ## Flow 'classif.randomForest' :: (Version = 47, Flow ID = 2700)##  External Version         : R_3.1.2-734b029d##  Dependencies             : mlr_2.9, randomForest_4.6.12##  Number of Flow Parameters: 16##  Number of Flow Components: 0

Download an OpenML run

To download the results of one run, including all server and usercomputed metrics, you have to define the corresponding run ID. For allruns that are actually related to the task, the corresponding ID can beextracted from theruns object, which was created in theprevious section. Here we use a run of task 59, which has therun.id 525534. Single OpenML runs can be downloaded withthe functiongetOMLRun:

task.list = listOMLRuns(task.id = 59L)task.list[281:285, ]

##      run.id task.id setup.id flow.id uploader error.message## 281 7244063      59  5275959    6952        1          <NA>## 282 7245683      59  5277579    6952        1          <NA>## 283 7245684      59  5277580    6952        1          <NA>## 284 7245686      59  5277582    6952        1          <NA>## 285 7245687      59  5277583    6952        1          <NA>

run = getOMLRun(run.id = 524027L)run

## ## OpenML Run 524027 :: (Task ID = 59, Flow ID = 2393)##  User ID  : 970##  Learner  : classif.randomForest(43)##  Task type: Supervised Classification

EachOMLRun object is a list object, which storesadditional information on the run. For instance, the flow of thepreviously downloaded run has some non-default settings forhyperparameters, which can be obtained by:

run$parameter.setting  # retrieve the list of parameter settings

## $seed##  (parameter of component 2393) seed = 1## ## $kind##  (parameter of component 2393) kind = Mersenne-Twister## ## $normal.kind##  (parameter of component 2393) normal.kind = Inversion

If the underlying flow has hyperparameters that are different fromthe default values of the corresponding learner, they are also shown,otherwise the default hyperparameters are used (but not explicitlylisted).

All the data that served as input for the run, including data set IDsand the URL to the data, is stored ininput.data:

run$input.data

## ## ** Data Sets **##   data.id name                                                          url## 1      61 iris https://www.openml.org/data/download/61/dataset_61_iris.arff## ## ** Files **## Dataframe mit 0 Spalten und 0 Zeilen## ## ** Evaluations **## Dataframe mit 0 Spalten und 0 Zeilen

Predictions made by an uploaded run are stored within thepredictions element and can be retrieved via

head(run$predictions, 10)

##    repeat fold row_id      prediction           truth confidence.Iris-setosa confidence.Iris-versicolor## 1       0    0     43     Iris-setosa     Iris-setosa                      1                          0## 2       0    0     14     Iris-setosa     Iris-setosa                      1                          0## 3       0    0     37     Iris-setosa     Iris-setosa                      1                          0## 4       0    0     23     Iris-setosa     Iris-setosa                      1                          0## 5       0    0     10     Iris-setosa     Iris-setosa                      1                          0## 6       0    0     99 Iris-versicolor Iris-versicolor                      0                          1## 7       0    0     87 Iris-versicolor Iris-versicolor                      0                          1## 8       0    0     97 Iris-versicolor Iris-versicolor                      0                          1## 9       0    0     62 Iris-versicolor Iris-versicolor                      0                          1## 10      0    0     92 Iris-versicolor Iris-versicolor                      0                          1##    confidence.Iris-virginica## 1                          0## 2                          0## 3                          0## 4                          0## 5                          0## 6                          0## 7                          0## 8                          0## 9                          0## 10                         0

The output above shows predictions, ground truth information aboutclasses and task-specific information, e.g., about the confidence of aclassifier (for every observation) or in which fold a data point hasbeen placed.

Running

The modularized structure of OpenML allows to apply theimplementation of an algorithm to a specific task and there existmultiple possibilities to do this.

Run a task with a specified mlr learner

If one is working withmlr, one canspecify anRLearner object and use the functionrunTaskMlr to create the desired"OMLMlrRun"object. Thetask is created the same way as in the previoussections:

task = getOMLTask(task.id = 59L)library("mlr")lrn = makeLearner("classif.rpart")run.mlr = runTaskMlr(task, lrn)run.mlr

## $run## ## OpenML Run NA :: (Task ID = 59, Flow ID = NA)## ## $bmr##   task.id    learner.id acc.test.join timetrain.test.sum timepredict.test.sum## 1    iris classif.rpart          0.94               0.01                 0.03## ## $flow## ## Flow 'mlr.classif.rpart' :: (Version = NA, Flow ID = NA)##  External Version         : R_4.2.1-v2.4b8be4e0##  Dependencies             : R_4.2.1, OpenML_1.12, mlr_2.19.0, rpart_4.1.16##  Number of Flow Parameters: 14##  Number of Flow Components: 0## ## attr(,"class")## [1] "OMLMlrRun"

Note that locally created runs don’t have a run ID or flow ID yet.These are assigned by the OpenML server after uploading the run.

Run a task without using mlr

If you are not usingmlr, you will have to invest somemore time and effort to get things done since this is not supported yet.So, unless you have good reasons to do otherwise, we strongly encourageto usemlr. If the algorithm you want to use is notintegrated inmlr yet, you can integrate it yourself (seethetutorial)or open anissue on mlrGitHub repository and hope someone else will do it for you.

Uploading

The following section gives an overview on how one can contributebuilding blocks (i.e. data sets, flows and runs) to the OpenMLserver.

Upload a data set

A data set contains information that can be stored on OpenML and usedby OpenML tasks and runs. This example shows how a very simple data setcan be taken from R, converted to an OpenML data set and afterwardsuploaded to the server. The corresponding workflow consists of thefollowing three steps:

makeOMLDataSetDescription: create the descriptionobject of an OpenML data set
makeOMLDataSet: convert the data set into an OpenMLdata set
uploadOMLDataSet: upload the data set to theserver

data("airquality")dsc = "Daily air quality measurements in New York, May to September 1973.  This data is taken from R."cit = "Chambers, J. M., Cleveland, W. S., Kleiner, B. and Tukey, P. A. (1983)  Graphical Methods for Data Analysis. Belmont, CA: Wadsworth."## (1) Create the description objectdesc = makeOMLDataSetDescription(name = "airquality",  description = dsc,  creator = "New York State Department of Conservation (ozone data) and the National    Weather Service (meteorological data)",  collection.date = "May 1, 1973 to September 30, 1973",  language = "English",  licence = "GPL-2",  url = "https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html",  default.target.attribute = "Ozone",  citation = cit,  tags = "R")## (2) Create the OpenML data setair.data = makeOMLDataSet(desc = desc,  data = airquality,  colnames.old = colnames(airquality),  colnames.new = colnames(airquality),  target.features = "Ozone")## (3) Upload the OpenML data set to the server## Because this is a simple data set which is generally already available in R## please do not actually upload it to the server!## The code would be:#dataset.id = uploadOMLDataSet(air.data)#dataset.id

Alternatively you can enter data directly on theOpenML website.

Upload a flow

Aflow is an implementation of a singlealgorithm or a script. Eachmlrlearner can be considered an implementation of a flow, which can beuploaded to the server with the functionuploadOMLFlow. Ifthe flow has already been uploaded to the server (either by you orsomeone else), one receives a message that the flow already exists andtheflow.id is returned from the function. Otherwise, theflow will be uploaded, receive its ownflow.id and returnthat ID.

library("mlr")lrn = makeLearner("classif.randomForest")flow.id = uploadOMLFlow(lrn)flow.id

Upload a run

In addition to uploading data sets or flows, one can also upload runs(which a priori have to be created, e.g., usingmlr):

## choose 2 flows (i.e., mlr-learners)learners = list(  makeLearner("classif.kknn"),  makeLearner("classif.randomForest"))## pick 3 random taskstask.ids = c(57, 59, 2382)for (lrn in learners) {  for (id in task.ids) {    task = getOMLTask(id)    res = runTaskMlr(task, lrn)$run    run.id = uploadOMLRun(res)  # upload results  }}

Before your run will be uploaded to the server,uploadOMLRun checks whether the flow that created this runis already available on the server. If the flow does not exist on theserver, it will (automatically) be uploaded as well.

Feedback

Now, you should have gotten an idea on how to use our package.However, as there is always room for improvement, we are more than happyto receive your feedback. So, in case

there is anything not well documented,
you encounter a bug, or
you are missing functionality

please open an issue in theissue tracker of ourGitHub repository.

Movatterモバイル変換