Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 170 Commits
.github/workflows		.github/workflows
R		R
man		man
pkgdown		pkgdown
tests		tests
.Rbuildignore		.Rbuildignore
.editorconfig		.editorconfig
.gitignore		.gitignore
.ignore		.ignore
.lintr		.lintr
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.Rmd		README.Rmd
README.md		README.md
codecov.yml		codecov.yml
cran-comments.md		cran-comments.md
mlr3benchmark.Rproj		mlr3benchmark.Rproj

Repository files navigation

mlr3benchmark

Analysis and tools for benchmarking inmlr3.

What is mlr3benchmark?

Do you have a large benchmark experiment with many tasks, learners, andmeasures, and don’t know where to begin with analysis? Do you want toperform a complete quantitative analysis of benchmark results todetermine which learner truly is the ‘best’? Do you want to visualisecomplex results for benchmark experiments in one line of code?

Thenmlr3benchmark is the answer, or at least will be once it’sfinished maturing.

mlr3benchmark enables fast and efficient analysis of benchmarkexperiments in just a few lines of code. As long as you can coerce yourresults into a format fitting our classes (which have very fewrequirements), then you can perform your benchmark analysis withmlr3benchmark.

Installation

Install the last release from CRAN:

install.packages("mlr3benchmark")

Install the development version from GitHub:

remotes::install_github("mlr-org/mlr3benchmark")

Feature Overview

Currentlymlr3benchmark only supports analysis of multiple learnersover multiple tasks. The current implemented features are bestdemonstrated by example!

First we run a mlr3 benchmark experiment:

library(mlr3)library(mlr3learners)library(ggplot2)set.seed(1)task= tsks(c("iris","sonar","wine","zoo"))learns= lrns(c("classif.featureless","classif.rpart","classif.xgboost"))bm= benchmark(benchmark_grid(task,learns, rsmp("cv",folds=3)))

Now we create aBenchmarkAggr object for our analysis, these objectsstore measure results after beingaggregated over all resamplings:

# these measures are the same but we'll continue for the exampleba= as_benchmark_aggr(bm,measures= msrs(c("classif.acc","classif.ce")))ba

## <BenchmarkAggr> of 12 rows with 4 tasks, 3 learners and 2 measures##     task_id  learner_id       acc         ce##      <fctr>      <fctr>     <num>      <num>##  1:    iris featureless 0.2800000 0.72000000##  2:    iris       rpart 0.9466667 0.05333333##  3:    iris     xgboost 0.9600000 0.04000000##  4:   sonar featureless 0.5334023 0.46659765##  5:   sonar       rpart 0.6537612 0.34623879##  6:   sonar     xgboost 0.6394755 0.36052450##  7:    wine featureless 0.3990584 0.60094162##  8:    wine       rpart 0.8652542 0.13474576##  9:    wine     xgboost 0.9048023 0.09519774## 10:     zoo featureless 0.4058229 0.59417706## 11:     zoo       rpart 0.8309566 0.16904337## 12:     zoo     xgboost 0.9099822 0.09001783

Now we can begin our analysis! Inmlr3benchmark, analysis ofmultiple learners over multiple independent tasks follows the guidelinesof Demsar (2006). So we begin by checking if the global Friedman test issignificant: is there are a significant difference in the rankings ofthe learners over all the tasks?

ba$friedman_test()

##      X2 df    p.value p.signif## acc 6.5  2 0.03877421        *## ce  6.5  2 0.03877421        *

Both measures are significant, so now we can proceed with the post-hoctests. Now comparing each learner to each other with post-hocFriedman-Nemenyi tests:

ba$friedman_posthoc(meas="acc")

## ##  Pairwise comparisons using Nemenyi-Wilcoxon-Wilcox all-pairs test for a two-way balanced complete block design## data: acc and learner_id and task_id##         featureless rpart## rpart   0.181       -    ## xgboost 0.036       0.759## ## P value adjustment method: single-step

The results tell us that xgboost is significantly different from thefeatureless model, but all other comparisons are non-significant. Thisdoesn’t tell uswhich of xgboost and featureless is better though, themost detailed information is given in a critical difference diagram,note we includeminimize = FALSE as accuracy should be maximised:

autoplot(ba,type="cd",meas="acc",minimize=FALSE)

We read the diagram from left to right, so that learners to the lefthave the highest rank and are the best performing, and decrease goingright. The thick horizontal lines connect learners that arenotsignificantly difference in ranked performance, so this tells us:

xgboost is significantly better than featureless
xgboost is not significantly better than rpart
rpart is not significantly better than featureless

Now we visualise two much simpler plots which display similarinformation, the first is the mean and standard error of the resultsacross all tasks, the second is a boxplot across all tasks:

autoplot(ba,meas="acc")

autoplot(ba,type="box",meas="acc")

We conclude that xgboost is significantly better than the baseline butnot significantly better than the decision tree but the decision tree isnot significantly better than the baseline, so we will recommend xgboostfor now.

The analysis is complete!

Roadmap

mlr3benchmark is in its early stages and the interface is stillmaturing, near-future updates will include:

ExtendingBenchmarkAggr to non-independent tasks
ExtendingBenchmarkAggr to single tasks
AddingBenchmarkScore for non-aggregated measures,e.g. observation-level scores
Bayesian methods for analysis

Bugs, Questions, Feedback

mlr3benchmark is a free and open source software project thatencourages participation and feedback. If you have any issues,questions, suggestions or feedback, please do not hesitate to open an“issue” about it on theGitHubpage! In case ofproblems / bugs, it is often helpful if you provide a “minimum workingexample” that showcases the behaviour (but don’t worry about this if thebug is obvious).