The R package OpenML is an interface to make interactions with theOpenML server as comfortable aspossible. For example, the users can download and upload files, runtheir implementations on specific tasks and get predictions in thecorrect form directly via R commands. In this tutorial, we will show themost important functions of this package and give examples on standardworkflows.
For general information on what OpenML is, please have a look at theREADMEfile or visit theofficial OpenMLwebsite.
After installation and before making practical use of the package, inmost cases it is desirable tosetup a configurationfile to simplify further steps. Afterwards, there are differentbasic stages when using this package or OpenML, respectively:
DataSets,Tasks,Flows,Runs,RunEvaluations,EvaluationMeasures, andTaskTypes)listOMLdata.frameDataSets,Tasks,Runs,Predictions, andFlows)getOMLrunTaskMlrOMLTask andLearnerOMLMlrRun,OMLRunuploadOMLRunInstallation works as in any other package using
install.packages("OpenML")To install the current development version use thedevtoolspackage and run
devtools::install_github("openml/openml-r")Using the OpenML package also requires a reader for theARFF fileformat. By defaultfarff is used.Alternatively, theRWeka package can beused. You can install the packages with the following calls.
install.packages(c("farff", "RWeka"))All examples in this tutorial are given with aREAD-ONLY API key.
With this key you canread all the information fromthe server but notwrite data sets, tasks, flows, andruns to the server. This key allows to emulate uploading to the serverbut doesn’t allow to really store data. If one wants to write data to aserver, one has toget a personal API key. The processof how to obtain a key is shown in theconfigurationsection.
Important: Please do not writemeaningless data to the server such as copies of already existing datasets, tasks, or runs (such as the ones from this tutorial)! One instanceof the Iris data set should be enough for everyone.:D
In this paragraph you can find an example on how to download a taskfrom the server, print some information about it to the console, andproduce a run which is then uploaded to the server. For detailedinformation on OpenML terminology (task, run, etc.) see theOpenML guide.
library("OpenML")## temporarily set API key to read only keysetOMLConfig(apikey = "c1994bdb7ecb3c6f3c8f3b35f4b47f1f")## OpenML configuration:## server : https://www.openml.org/api/v1## cachedir : C:\Users\GDaddy\AppData\Local\Temp\RtmpUzg5GZ/working_dir\RtmpMfS7jw/cache## verbosity : 0## arff.reader : farff## confirm.upload : TRUE## apikey : ***************************47f1f# download a task (whose ID is 1L)task = getOMLTask(task.id = 1L)## Warning in getOMLDataSetById(data.id = data.id, cache.only = cache.only, : Data set has been deactivated.task## ## OpenML Task 1 :: (Data ID = 1)## Task Type : Supervised Classification## Data Set : anneal :: (Version = 2, OpenML ID = 1)## Target Feature(s) : class## Tags : basic, study_1, study_41, study_7, study_73, study_89, test-tagging, testtag,...## Estimation Procedure : Stratified crossvalidation (1 x 10 folds)## Evaluation Measure(s): predictive_accuracyThe task contains information on the following:
In the next line,randomForest is used as a classifierand run with the help of themlr package. Notethat one needs to run the algorithm locally and thatmlrwill automatically load the package that is needed to run the specifiedclassifier.
# define the classifier (usually called "flow" within OpenML)library("mlr")lrn = makeLearner("classif.randomForest")# upload the new flow (with information about the algorithm and settings);# if this algorithm already exists on the server, one will receive a message# with the ID of the existing flowflow.id = uploadOMLFlow(lrn)# the last step is to perform a run and upload the resultsrun.mlr = runTaskMlr(task, lrn)run.id = uploadOMLRun(run.mlr)Following this very brief example, we will explain the single stepsof the OpenML package in more detail in the next sections.
Interacting with the OpenML server requires an API key. Fordemonstration purposes, we have created apublic read-onlyAPI key ("c1994bdb7ecb3c6f3c8f3b35f4b47f1f"), whichwill be used in this tutorial to make the examples executable. However,for a full-fledged usage of theOpenML package, you needyour personal API.
In order to receive your own API key
You can set your own OpenML configuration either just temporarily forthe current R session viasetOMLConfig or permanently viasaveOMLConfig. In order to create a permanent configurationfile using default values and at the same time setting your personal APIkey, run
saveOMLConfig(apikey = "c1994bdb7ecb3c6f3c8f3b35f4b47f1f")where"c1994bdb7ecb3c6f3c8f3b35f4b47f1f" should bereplaced withyour personal APIkey. It is noteworthy that basically everybody who has accessto your computer can read the configuration file and thus see your APIkey. With your API key other users have full access to your account viathe API, so please handle it with care!
It is also possible to manually create a file~/.openml/config in your home directory – you can use the Rcommandpath.expand("~/.openml/config") to get the fullpath to the configuration file on the operating system. Theconfig file consists ofkey = value pairs,note that the values are not quoted. An exemplary minimalconfig file might look as follows:
apikey=c1994bdb7ecb3c6f3c8f3b35f4b47f1fTheconfig file may contain the followinginformation:
server:https://www.openml.org/api/v1cachedir:file.path(tempdir(), "cache").verbosity:0: normal output1: info output (default)2: debug outputarff.reader:RWeka: this is the standard Java parser used inWekafarff: thefarff package provides anewer, faster parser without any Java requirementsconfirm.upload:FALSE) one does not need to confirm theupload decisionapikey:If you manually modify theconfig file, you need toreload the modifiedconfig file to the current R sessionusingloadOMLConfig(). You can query the currentconfiguration using
getOMLConfig()## OpenML configuration:## server : https://www.openml.org/api/v1## cachedir : C:\Users\GDaddy\AppData\Local\Temp\RtmpUzg5GZ/working_dir\RtmpMfS7jw/cache## verbosity : 0## arff.reader : farff## confirm.upload : TRUE## apikey : ***************************47f1fThe configuration file and some related things are also explained intheOpenMLWiki.
Once the config file is set up, you areready togo!
In this stage, we want to list basic information about the variousOpenML objects:
For each of these objects, we have a function to query theinformation, beginning withlistOML. All of these functionsreturn adata.frame, even in case the result consists of asingle column or has zero observations (i.e., rows).
Note that thelistOML* functions only list informationon the corresponding objects – they do not download the respectiveobjects. Information on actually downloading specific objects is coveredin thenext section.
To browse the OpenML data base for appropriate data sets, you can uselistOMLDataSets() in order to get basic datacharacteristics (number of features, instances, classes, missing values,etc.) for each data set. By default,listOMLDataSets()returns only data sets that have an active status on OpenML:
datasets = listOMLDataSets() # returns active data setsThe resultingdata.frame contains the followinginformation for each of the listed data sets:
data.idstatus ("active","in_preparation" or"deactivated") of the datasetname of the data setmajority.class.size)str(datasets)## 'data.frame': 4390 obs. of 16 variables:## $ data.id : int 2 3 4 5 6 7 8 9 10 11 ...## $ name : chr "anneal" "kr-vs-kp" "labor" "arrhythmia" ...## $ version : int 1 1 1 1 1 1 1 1 1 1 ...## $ status : chr "active" "active" "active" "active" ...## $ format : chr "ARFF" "ARFF" "ARFF" "ARFF" ...## $ tags : chr "" "" "" "" ...## $ majority.class.size : int 684 1669 37 245 813 57 NA 67 81 288 ...## $ max.nominal.att.distinct.values : int 7 3 3 13 26 24 NA 22 8 3 ...## $ minority.class.size : int 8 1527 20 2 734 1 NA 3 2 49 ...## $ number.of.classes : int 5 2 2 13 26 24 0 6 4 3 ...## $ number.of.features : int 39 37 17 280 17 70 6 26 19 5 ...## $ number.of.instances : int 898 3196 57 452 20000 226 345 205 148 625 ...## $ number.of.instances.with.missing.values: int 898 0 56 384 0 222 0 46 0 0 ...## $ number.of.missing.values : int 22175 0 326 408 0 317 0 59 0 0 ...## $ number.of.numeric.features : int 6 0 8 206 16 0 6 15 3 4 ...## $ number.of.symbolic.features : int 33 37 9 74 1 70 0 11 16 1 ...head(datasets[, 1:5])## data.id name version status format## 1 2 anneal 1 active ARFF## 2 3 kr-vs-kp 1 active ARFF## 3 4 labor 1 active ARFF## 4 5 arrhythmia 1 active ARFF## 5 6 letter 1 active ARFF## 6 7 audiology 1 active ARFFTo find a specific data set, you can now query the resultingdatasets object. Suppose we want to find theiris data set.
subset(datasets, name == "iris")## data.id name version status format tags majority.class.size max.nominal.att.distinct.values## 53 61 iris 1 active ARFF 50 3## 812 969 iris 3 active ARFF 100 2## 2602 41510 iris 9 active ARFF NA 3## 2603 41511 iris 10 active ARFF 50 3## 2636 41567 iris 11 active ARFF NA 3## 2637 41568 iris 12 active ARFF 50 3## 2638 41582 iris 13 active ARFF NA 3## 2639 41583 iris 14 active ARFF 50 3## 2889 41996 iris 15 active ARFF NA 3## 2890 41997 iris 16 active ARFF 50 3## 2892 42002 iris 17 active ARFF NA 3## 2893 42003 iris 18 active ARFF 50 3## 2896 42010 iris 19 active ARFF NA 3## 2897 42011 iris 20 active ARFF 50 3## 2898 42015 iris 21 active ARFF NA 3## 2899 42016 iris 22 active ARFF 50 3## 2900 42020 iris 23 active ARFF NA 3## 2901 42021 iris 24 active ARFF 50 3## 2902 42025 iris 25 active ARFF NA 3## 2903 42026 iris 26 active ARFF 50 3## 2904 42030 iris 27 active ARFF NA 3## 2905 42031 iris 28 active ARFF 50 3## 2906 42035 iris 29 active ARFF NA 3## 2907 42036 iris 30 active ARFF 50 3## 2908 42040 iris 31 active ARFF NA 3## 2909 42041 iris 32 active ARFF 50 3## 2910 42045 iris 33 active ARFF NA 3## 2911 42046 iris 34 active ARFF 50 3## 2912 42050 iris 35 active ARFF NA 3## 2913 42051 iris 36 active ARFF 50 3## 2914 42055 iris 37 active ARFF NA 3## 2915 42056 iris 38 active ARFF 50 3## 2920 42065 iris 39 active ARFF NA 3## 2921 42066 iris 40 active ARFF 50 3## 2922 42070 iris 41 active ARFF NA 3## 2923 42071 iris 42 active ARFF 50 3## 2934 42091 iris 43 active ARFF NA 3## 2937 42097 iris 44 active ARFF NA 3## 2938 42098 iris 45 active ARFF 50 3## 3190 42661 iris 46 active arff NA NA## 3206 42699 iris 47 active ARFF NA NA## 3207 42700 iris 48 active ARFF 50 NA## 3292 42851 iris 49 active ARFF NA NA## 3309 42871 iris 50 active ARFF NA NA## minority.class.size number.of.classes number.of.features number.of.instances## 53 50 3 5 150## 812 50 2 5 150## 2602 NA NA 5 150## 2603 50 3 5 150## 2636 NA NA 5 150## 2637 50 3 5 150## 2638 NA NA 5 150## 2639 50 3 5 150## 2889 NA NA 5 150## 2890 50 3 5 150## 2892 NA NA 5 150## 2893 50 3 5 150## 2896 NA NA 5 150## 2897 50 3 5 150## 2898 NA NA 5 150## 2899 50 3 5 150## 2900 NA NA 5 150## 2901 50 3 5 150## 2902 NA NA 5 150## 2903 50 3 5 150## 2904 NA NA 5 150## 2905 50 3 5 150## 2906 NA NA 5 150## 2907 50 3 5 150## 2908 NA NA 5 150## 2909 50 3 5 150## 2910 NA NA 5 150## 2911 50 3 5 150## 2912 NA NA 5 150## 2913 50 3 5 150## 2914 NA NA 5 150## 2915 50 3 5 150## 2920 NA NA 5 150## 2921 50 3 5 150## 2922 NA NA 5 150## 2923 50 3 5 150## 2934 NA NA 5 150## 2937 NA NA 5 150## 2938 50 3 5 150## 3190 NA NA 5 150## 3206 NA NA 5 150## 3207 50 3 5 150## 3292 NA NA 7 150## 3309 NA NA 7 150## number.of.instances.with.missing.values number.of.missing.values number.of.numeric.features## 53 0 0 4## 812 0 0 4## 2602 0 0 4## 2603 0 0 4## 2636 0 0 4## 2637 0 0 4## 2638 0 0 4## 2639 0 0 4## 2889 0 0 4## 2890 0 0 4## 2892 0 0 4## 2893 0 0 4## 2896 0 0 4## 2897 0 0 4## 2898 0 0 4## 2899 0 0 4## 2900 0 0 4## 2901 0 0 4## 2902 0 0 4## 2903 0 0 4## 2904 0 0 4## 2905 0 0 4## 2906 0 0 4## 2907 0 0 4## 2908 0 0 4## 2909 0 0 4## 2910 0 0 4## 2911 0 0 4## 2912 0 0 4## 2913 0 0 4## 2914 0 0 4## 2915 0 0 4## 2920 0 0 4## 2921 0 0 4## 2922 0 0 4## 2923 0 0 4## 2934 0 0 4## 2937 0 0 4## 2938 0 0 4## 3190 0 0 4## 3206 0 0 4## 3207 0 0 4## 3292 0 0 4## 3309 0 0 4## number.of.symbolic.features## 53 1## 812 1## 2602 1## 2603 1## 2636 1## 2637 1## 2638 1## 2639 1## 2889 1## 2890 1## 2892 1## 2893 1## 2896 1## 2897 1## 2898 1## 2899 1## 2900 1## 2901 1## 2902 1## 2903 1## 2904 1## 2905 1## 2906 1## 2907 1## 2908 1## 2909 1## 2910 1## 2911 1## 2912 1## 2913 1## 2914 1## 2915 1## 2920 1## 2921 1## 2922 1## 2923 1## 2934 1## 2937 1## 2938 1## 3190 0## 3206 1## 3207 1## 3292 3## 3309 3As you can see, there are two data sets callediris. Wewant to use theoriginal data set with three classes, which isstored under the data set ID (data.id) 61, 41511, 41568,41583, 41997, 42003, 42011, 42016, 42021, 42026, 42031, 42036, 42041,42046, 42051, 42056, 42066, 42071, 42098, 42700. You can also have acloser look at the data set on the corresponding OpenML web page (https://www.openml.org/d/61, 41511, 41568, 41583, 41997,42003, 42011, 42016, 42021, 42026, 42031, 42036, 42041, 42046, 42051,42056, 42066, 42071, 42098, 42700).
Each OpenML task is a bundle that encapsulates information on variousobjects:
"Supervised Classification" or"Supervised Regression""predictive accuracy" for a classification taskListing the tasks can be done via
tasks = listOMLTasks()The resultingdata.frame contains for each of the listedtasks information on:
task.idtask.typetarget.featuretags which can be used for labelling the taskestimation.procedure (aka resampling strategy)evaluation.measures used for measuring theperformance of thelearner / flow on the taskstr(tasks)## 'data.frame': 5000 obs. of 25 variables:## $ task.id : int 2 3 4 5 6 7 8 9 10 11 ...## $ task.type : chr "Supervised Classification" "Supervised Classification" "Supervised Classification" "Supervised Classification" ...## $ data.id : int 2 3 4 5 6 7 8 9 10 11 ...## $ name : chr "anneal" "kr-vs-kp" "labor" "arrhythmia" ...## $ status : chr "active" "active" "active" "active" ...## $ format : chr "ARFF" "ARFF" "ARFF" "ARFF" ...## $ estimation.procedure : chr "10-fold Crossvalidation" "10-fold Crossvalidation" "10-fold Crossvalidation" "10-fold Crossvalidation" ...## $ evaluation.measures : chr "predictive_accuracy" NA "predictive_accuracy" "predictive_accuracy" ...## $ target.feature : chr "class" "class" "class" "class" ...## $ cost.matrix : chr NA NA NA NA ...## $ source.data.labeled : chr NA NA NA NA ...## $ target.feature.event : chr NA NA NA NA ...## $ target.feature.left : chr NA NA NA NA ...## $ target.feature.right : chr NA NA NA NA ...## $ quality.measure : chr NA NA NA NA ...## $ majority.class.size : int 684 1669 37 245 813 57 NA 67 81 288 ...## $ max.nominal.att.distinct.values : int 7 3 3 13 26 24 NA 22 8 3 ...## $ minority.class.size : int 8 1527 20 2 734 1 NA 3 2 49 ...## $ number.of.classes : int 5 2 2 13 26 24 0 6 4 3 ...## $ number.of.features : int 39 37 17 280 17 70 6 26 19 5 ...## $ number.of.instances : int 898 3196 57 452 20000 226 345 205 148 625 ...## $ number.of.instances.with.missing.values: int 898 0 56 384 0 222 0 46 0 0 ...## $ number.of.missing.values : int 22175 0 326 408 0 317 0 59 0 0 ...## $ number.of.numeric.features : int 6 0 8 206 16 0 6 15 3 4 ...## $ number.of.symbolic.features : int 33 37 9 74 1 70 0 11 16 1 ...For some data sets, there may be more than one task available on theOpenML server. For example, one can look for"Supervised Classification" tasks that are available fordata set 61 via
head(subset(tasks, task.type == "Supervised Classification" & data.id == 61L)[, 1:5])## task.id task.type data.id name status## 51 59 Supervised Classification 61 iris active## 263 289 Supervised Classification 61 iris active## 428 1823 Supervised Classification 61 iris active## 535 1939 Supervised Classification 61 iris active## 580 1992 Supervised Classification 61 iris active## 3300 7306 Supervised Classification 61 iris activeA flow is the definition and implementation of a specific algorithmworkflow or script, i.e., a flow is essentially the code /implementation of the algorithm.
flows = listOMLFlows()str(flows)## 'data.frame': 16365 obs. of 6 variables:## $ flow.id : int 1 2 3 4 5 6 7 8 9 10 ...## $ full.name : chr "openml.evaluation.EuclideanDistance(1.0)" "openml.evaluation.PolynomialKernel(1.0)" "openml.evaluation.RBFKernel(1.0)" "openml.evaluation.area_under_roc_curve(1.0)" ...## $ name : chr "openml.evaluation.EuclideanDistance" "openml.evaluation.PolynomialKernel" "openml.evaluation.RBFKernel" "openml.evaluation.area_under_roc_curve" ...## $ version : int 1 1 1 1 1 1 1 1 1 1 ...## $ external.version: chr "" "" "" "" ...## $ uploader : int 1 1 1 1 1 1 1 1 1 1 ...flows[56:63, 1:4]## flow.id full.name name version## 56 56 weka.ZeroR(1) weka.ZeroR 1## 57 57 weka.OneR(1) weka.OneR 1## 58 58 weka.NaiveBayes(1) weka.NaiveBayes 1## 59 59 weka.JRip(1) weka.JRip 1## 60 60 weka.J48(1) weka.J48 1## 61 61 weka.REPTree(1) weka.REPTree 1## 62 62 weka.DecisionStump(1) weka.DecisionStump 1## 63 63 weka.HoeffdingTree(1) weka.HoeffdingTree 1A run is an experiment, which is executed on a given combination oftask, flow and setup (i.e., the explicit parameter configuration of aflow). The corresponding results are stored as a run result. Bothobjects, i.e., runs and run results, can be listed vialistOMLRuns orlistOMLRunEvaluations,respectively. As each of those objects is defined with a task, setup andflow, you can extract runs and run results with specific combinations oftask.id,setup.id and/orflow.id.For instance, listing all runs fortask 59 (supervisedclassification on iris) can be done with
runs = listOMLRuns(task.id = 59L) # must be specified with the task, setup and/or implementation IDhead(runs)## run.id task.id setup.id flow.id uploader error.message## 1 81 59 12 67 1 <NA>## 2 161 59 13 70 1 <NA>## 3 234 59 1 56 1 <NA>## 4 447 59 6 61 1 <NA>## 5 473 59 18 77 1 <NA>## 6 491 59 7 62 1 <NA># one of the IDs (here: task.id) must be suppliedrun.results = listOMLRunEvaluations(task.id = 59L)str(run.results)## 'data.frame': 4283 obs. of 35 variables:## $ run.id : int 81 161 234 447 473 491 550 6088 6157 6158 ...## $ task.id : int 59 59 59 59 59 59 59 59 59 59 ...## $ setup.id : int 12 13 1 6 18 7 16 11 12 3 ...## $ flow.id : int 67 70 56 61 77 62 75 66 67 58 ...## $ flow.name : chr "weka.BayesNet_K2(1)" "weka.SMO_PolyKernel(1)" "weka.ZeroR(1)" "weka.REPTree(1)" ...## $ flow.version : chr "1" "1" "1" "1" ...## $ flow.source : chr "weka" "weka" "weka" "weka" ...## $ learner.name : chr "BayesNet_K2" "SMO_PolyKernel" "ZeroR" "REPTree" ...## $ data.name : chr "iris" "iris" "iris" "iris" ...## $ upload.time : chr "2014-04-07 00:05:11" "2014-04-07 00:55:32" "2014-04-07 01:33:24" "2014-04-07 06:26:27" ...## $ area.under.roc.curve : num 0.983 0.977 0.5 0.967 0.978 ...## $ average.cost : num 0 0 0 0 0 0 0 0 0 0 ...## $ build.cpu.time : num NA NA NA NA NA NA NA NA NA NA ...## $ build.memory : num NA NA NA NA NA NA NA NA NA NA ...## $ f.measure : num 0.94 0.96 0.167 0.927 0.947 ...## $ kappa : num 0.91 0.94 0 0.89 0.92 0.5 0.95 0.93 0.91 0.93 ...## $ kb.relative.information.score: num 1.39e+02 9.09e+01 -6.80e-05 1.31e+02 1.38e+02 ...## $ mean.absolute.error : num 0.0384 0.2311 0.4444 0.0671 0.0392 ...## $ mean.prior.absolute.error : num 0.444 0.444 0.444 0.444 0.444 ...## $ number.of.instances : num 150 150 150 150 150 150 150 150 150 150 ...## $ precision : num 0.94 0.96 0.111 0.927 0.947 ...## $ predictive.accuracy : num 0.94 0.96 0.333 0.927 0.947 ...## $ prior.entropy : num 1.58 1.58 1.58 1.58 1.58 ...## $ recall : num 0.94 0.96 0.333 0.927 0.947 ...## $ relative.absolute.error : num 0.0863 0.52 1 0.151 0.0881 ...## $ root.mean.prior.squared.error: num 0.471 0.471 0.471 0.471 0.471 ...## $ root.mean.squared.error : num 0.16 0.288 0.471 0.211 0.178 ...## $ root.relative.squared.error : num 0.339 0.611 1 0.447 0.377 ...## $ scimark.benchmark : num 1981 1980 2011 1887 1998 ...## $ total.cost : num 0 0 0 0 0 0 0 0 0 0 ...## $ unweighted.recall : num NA NA NA NA NA NA NA NA NA NA ...## $ usercpu.time.millis : num NA NA NA NA NA NA NA NA NA NA ...## $ usercpu.time.millis.testing : num NA NA NA NA NA NA NA NA NA NA ...## $ usercpu.time.millis.training : num NA NA NA NA NA NA NA NA NA NA ...## $ weighted.recall : num NA NA NA NA NA NA NA NA NA NA ...Analogously to the previous listings, one can list further objectssimply by calling the respective functions.
listOMLDataSetQualities()listOMLEstimationProcedures()listOMLEvaluationMeasures()listOMLTaskTypes()Users can download data sets, tasks, flows and runs from the OpenMLserver. The package provides special representations for each object,which will be discussed here.
To directly download a data set, e.g., when you want to run a fewpreliminary experiments, one can use the functiongetOMLDataSet. The function accepts a data set ID as inputand returns the correspondingOMLDataSet:
iris.data = getOMLDataSet(data.id = 61L) # the iris data set has the data set ID 61The following call returns an OpenML task object for a supervisedclassification task on the iris data:
task = getOMLTask(task.id = 59L)task## ## OpenML Task 59 :: (Data ID = 61)## Task Type : Supervised Classification## Data Set : iris :: (Version = 1, OpenML ID = 61)## Target Feature(s) : class## Tags : basic, study_1, study_41, study_50, study_7, study_89, testsuite, under100k, ...## Estimation Procedure : Stratified crossvalidation (1 x 10 folds)## Evaluation Measure(s): predictive_accuracyThe corresponding"OMLDataSet" object can be accessedby
task$input$data.set## ## Data Set 'iris' :: (Version = 1, OpenML ID = 61)## Collection Date : 1936## Creator(s) : R.A. Fisher## Default Target Attribute: classand the class of the task can be shown with the next line
task$task.type## [1] "Supervised Classification"Also, it is possible to extract the data set itself via
iris.data = task$input$data.set$datahead(iris.data)## sepallength sepalwidth petallength petalwidth class## 0 5.1 3.5 1.4 0.2 Iris-setosa## 1 4.9 3.0 1.4 0.2 Iris-setosa## 2 4.7 3.2 1.3 0.2 Iris-setosa## 3 4.6 3.1 1.5 0.2 Iris-setosa## 4 5.0 3.6 1.4 0.2 Iris-setosa## 5 5.4 3.9 1.7 0.4 Iris-setosaAside from tasks and data sets, one can also download flows – bycallinggetOMLFlow with the specificflow.id
flow = getOMLFlow(flow.id = 2700L)flow## ## Flow 'classif.randomForest' :: (Version = 47, Flow ID = 2700)## External Version : R_3.1.2-734b029d## Dependencies : mlr_2.9, randomForest_4.6.12## Number of Flow Parameters: 16## Number of Flow Components: 0To download the results of one run, including all server and usercomputed metrics, you have to define the corresponding run ID. For allruns that are actually related to the task, the corresponding ID can beextracted from theruns object, which was created in theprevious section. Here we use a run of task 59, which has therun.id 525534. Single OpenML runs can be downloaded withthe functiongetOMLRun:
task.list = listOMLRuns(task.id = 59L)task.list[281:285, ]## run.id task.id setup.id flow.id uploader error.message## 281 7244063 59 5275959 6952 1 <NA>## 282 7245683 59 5277579 6952 1 <NA>## 283 7245684 59 5277580 6952 1 <NA>## 284 7245686 59 5277582 6952 1 <NA>## 285 7245687 59 5277583 6952 1 <NA>run = getOMLRun(run.id = 524027L)run## ## OpenML Run 524027 :: (Task ID = 59, Flow ID = 2393)## User ID : 970## Learner : classif.randomForest(43)## Task type: Supervised ClassificationEachOMLRun object is a list object, which storesadditional information on the run. For instance, the flow of thepreviously downloaded run has some non-default settings forhyperparameters, which can be obtained by:
run$parameter.setting # retrieve the list of parameter settings## $seed## (parameter of component 2393) seed = 1## ## $kind## (parameter of component 2393) kind = Mersenne-Twister## ## $normal.kind## (parameter of component 2393) normal.kind = InversionIf the underlying flow has hyperparameters that are different fromthe default values of the corresponding learner, they are also shown,otherwise the default hyperparameters are used (but not explicitlylisted).
All the data that served as input for the run, including data set IDsand the URL to the data, is stored ininput.data:
run$input.data## ## ** Data Sets **## data.id name url## 1 61 iris https://www.openml.org/data/download/61/dataset_61_iris.arff## ## ** Files **## Dataframe mit 0 Spalten und 0 Zeilen## ## ** Evaluations **## Dataframe mit 0 Spalten und 0 ZeilenPredictions made by an uploaded run are stored within thepredictions element and can be retrieved via
head(run$predictions, 10)## repeat fold row_id prediction truth confidence.Iris-setosa confidence.Iris-versicolor## 1 0 0 43 Iris-setosa Iris-setosa 1 0## 2 0 0 14 Iris-setosa Iris-setosa 1 0## 3 0 0 37 Iris-setosa Iris-setosa 1 0## 4 0 0 23 Iris-setosa Iris-setosa 1 0## 5 0 0 10 Iris-setosa Iris-setosa 1 0## 6 0 0 99 Iris-versicolor Iris-versicolor 0 1## 7 0 0 87 Iris-versicolor Iris-versicolor 0 1## 8 0 0 97 Iris-versicolor Iris-versicolor 0 1## 9 0 0 62 Iris-versicolor Iris-versicolor 0 1## 10 0 0 92 Iris-versicolor Iris-versicolor 0 1## confidence.Iris-virginica## 1 0## 2 0## 3 0## 4 0## 5 0## 6 0## 7 0## 8 0## 9 0## 10 0The output above shows predictions, ground truth information aboutclasses and task-specific information, e.g., about the confidence of aclassifier (for every observation) or in which fold a data point hasbeen placed.
The modularized structure of OpenML allows to apply theimplementation of an algorithm to a specific task and there existmultiple possibilities to do this.
If one is working withmlr, one canspecify anRLearner object and use the functionrunTaskMlr to create the desired"OMLMlrRun"object. Thetask is created the same way as in the previoussections:
task = getOMLTask(task.id = 59L)library("mlr")lrn = makeLearner("classif.rpart")run.mlr = runTaskMlr(task, lrn)run.mlr## $run## ## OpenML Run NA :: (Task ID = 59, Flow ID = NA)## ## $bmr## task.id learner.id acc.test.join timetrain.test.sum timepredict.test.sum## 1 iris classif.rpart 0.94 0.01 0.03## ## $flow## ## Flow 'mlr.classif.rpart' :: (Version = NA, Flow ID = NA)## External Version : R_4.2.1-v2.4b8be4e0## Dependencies : R_4.2.1, OpenML_1.12, mlr_2.19.0, rpart_4.1.16## Number of Flow Parameters: 14## Number of Flow Components: 0## ## attr(,"class")## [1] "OMLMlrRun"Note that locally created runs don’t have a run ID or flow ID yet.These are assigned by the OpenML server after uploading the run.
If you are not usingmlr, you will have to invest somemore time and effort to get things done since this is not supported yet.So, unless you have good reasons to do otherwise, we strongly encourageto usemlr. If the algorithm you want to use is notintegrated inmlr yet, you can integrate it yourself (seethetutorial)or open anissue on mlrGitHub repository and hope someone else will do it for you.
The following section gives an overview on how one can contributebuilding blocks (i.e. data sets, flows and runs) to the OpenMLserver.
A data set contains information that can be stored on OpenML and usedby OpenML tasks and runs. This example shows how a very simple data setcan be taken from R, converted to an OpenML data set and afterwardsuploaded to the server. The corresponding workflow consists of thefollowing three steps:
makeOMLDataSetDescription: create the descriptionobject of an OpenML data setmakeOMLDataSet: convert the data set into an OpenMLdata setuploadOMLDataSet: upload the data set to theserverdata("airquality")dsc = "Daily air quality measurements in New York, May to September 1973. This data is taken from R."cit = "Chambers, J. M., Cleveland, W. S., Kleiner, B. and Tukey, P. A. (1983) Graphical Methods for Data Analysis. Belmont, CA: Wadsworth."## (1) Create the description objectdesc = makeOMLDataSetDescription(name = "airquality", description = dsc, creator = "New York State Department of Conservation (ozone data) and the National Weather Service (meteorological data)", collection.date = "May 1, 1973 to September 30, 1973", language = "English", licence = "GPL-2", url = "https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html", default.target.attribute = "Ozone", citation = cit, tags = "R")## (2) Create the OpenML data setair.data = makeOMLDataSet(desc = desc, data = airquality, colnames.old = colnames(airquality), colnames.new = colnames(airquality), target.features = "Ozone")## (3) Upload the OpenML data set to the server## Because this is a simple data set which is generally already available in R## please do not actually upload it to the server!## The code would be:#dataset.id = uploadOMLDataSet(air.data)#dataset.idAlternatively you can enter data directly on theOpenML website.
Aflow is an implementation of a singlealgorithm or a script. Eachmlrlearner can be considered an implementation of a flow, which can beuploaded to the server with the functionuploadOMLFlow. Ifthe flow has already been uploaded to the server (either by you orsomeone else), one receives a message that the flow already exists andtheflow.id is returned from the function. Otherwise, theflow will be uploaded, receive its ownflow.id and returnthat ID.
library("mlr")lrn = makeLearner("classif.randomForest")flow.id = uploadOMLFlow(lrn)flow.idIn addition to uploading data sets or flows, one can also upload runs(which a priori have to be created, e.g., usingmlr):
## choose 2 flows (i.e., mlr-learners)learners = list( makeLearner("classif.kknn"), makeLearner("classif.randomForest"))## pick 3 random taskstask.ids = c(57, 59, 2382)for (lrn in learners) { for (id in task.ids) { task = getOMLTask(id) res = runTaskMlr(task, lrn)$run run.id = uploadOMLRun(res) # upload results }}Before your run will be uploaded to the server,uploadOMLRun checks whether the flow that created this runis already available on the server. If the flow does not exist on theserver, it will (automatically) be uploaded as well.
Now, you should have gotten an idea on how to use our package.However, as there is always room for improvement, we are more than happyto receive your feedback. So, in case
please open an issue in theissue tracker of ourGitHub repository.