Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork2
Analysis and tools for benchmarking in mlr3 and beyond.
License
mlr-org/mlr3benchmark
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
Analysis and tools for benchmarking inmlr3.
Do you have a large benchmark experiment with many tasks, learners, andmeasures, and don’t know where to begin with analysis? Do you want toperform a complete quantitative analysis of benchmark results todetermine which learner truly is the ‘best’? Do you want to visualisecomplex results for benchmark experiments in one line of code?
Thenmlr3benchmark is the answer, or at least will be once it’sfinished maturing.
mlr3benchmark enables fast and efficient analysis of benchmarkexperiments in just a few lines of code. As long as you can coerce yourresults into a format fitting our classes (which have very fewrequirements), then you can perform your benchmark analysis withmlr3benchmark.
Install the last release from CRAN:
install.packages("mlr3benchmark")Install the development version from GitHub:
remotes::install_github("mlr-org/mlr3benchmark")
Currentlymlr3benchmark only supports analysis of multiple learnersover multiple tasks. The current implemented features are bestdemonstrated by example!
First we run a mlr3 benchmark experiment:
library(mlr3)library(mlr3learners)library(ggplot2)set.seed(1)task= tsks(c("iris","sonar","wine","zoo"))learns= lrns(c("classif.featureless","classif.rpart","classif.xgboost"))bm= benchmark(benchmark_grid(task,learns, rsmp("cv",folds=3)))
Now we create aBenchmarkAggr object for our analysis, these objectsstore measure results after beingaggregated over all resamplings:
# these measures are the same but we'll continue for the exampleba= as_benchmark_aggr(bm,measures= msrs(c("classif.acc","classif.ce")))ba
## <BenchmarkAggr> of 12 rows with 4 tasks, 3 learners and 2 measures## task_id learner_id acc ce## <fctr> <fctr> <num> <num>## 1: iris featureless 0.2800000 0.72000000## 2: iris rpart 0.9466667 0.05333333## 3: iris xgboost 0.9600000 0.04000000## 4: sonar featureless 0.5334023 0.46659765## 5: sonar rpart 0.6537612 0.34623879## 6: sonar xgboost 0.6394755 0.36052450## 7: wine featureless 0.3990584 0.60094162## 8: wine rpart 0.8652542 0.13474576## 9: wine xgboost 0.9048023 0.09519774## 10: zoo featureless 0.4058229 0.59417706## 11: zoo rpart 0.8309566 0.16904337## 12: zoo xgboost 0.9099822 0.09001783Now we can begin our analysis! Inmlr3benchmark, analysis ofmultiple learners over multiple independent tasks follows the guidelinesof Demsar (2006). So we begin by checking if the global Friedman test issignificant: is there are a significant difference in the rankings ofthe learners over all the tasks?
ba$friedman_test()
## X2 df p.value p.signif## acc 6.5 2 0.03877421 *## ce 6.5 2 0.03877421 *Both measures are significant, so now we can proceed with the post-hoctests. Now comparing each learner to each other with post-hocFriedman-Nemenyi tests:
ba$friedman_posthoc(meas="acc")
## ## Pairwise comparisons using Nemenyi-Wilcoxon-Wilcox all-pairs test for a two-way balanced complete block design## data: acc and learner_id and task_id## featureless rpart## rpart 0.181 - ## xgboost 0.036 0.759## ## P value adjustment method: single-stepThe results tell us that xgboost is significantly different from thefeatureless model, but all other comparisons are non-significant. Thisdoesn’t tell uswhich of xgboost and featureless is better though, themost detailed information is given in a critical difference diagram,note we includeminimize = FALSE as accuracy should be maximised:
autoplot(ba,type="cd",meas="acc",minimize=FALSE)
We read the diagram from left to right, so that learners to the lefthave the highest rank and are the best performing, and decrease goingright. The thick horizontal lines connect learners that arenotsignificantly difference in ranked performance, so this tells us:
- xgboost is significantly better than featureless
- xgboost is not significantly better than rpart
- rpart is not significantly better than featureless
Now we visualise two much simpler plots which display similarinformation, the first is the mean and standard error of the resultsacross all tasks, the second is a boxplot across all tasks:
autoplot(ba,meas="acc")
autoplot(ba,type="box",meas="acc")
We conclude that xgboost is significantly better than the baseline butnot significantly better than the decision tree but the decision tree isnot significantly better than the baseline, so we will recommend xgboostfor now.
The analysis is complete!
mlr3benchmark is in its early stages and the interface is stillmaturing, near-future updates will include:
- Extending
BenchmarkAggrto non-independent tasks - Extending
BenchmarkAggrto single tasks - Adding
BenchmarkScorefor non-aggregated measures,e.g. observation-level scores - Bayesian methods for analysis
mlr3benchmark is a free and open source software project thatencourages participation and feedback. If you have any issues,questions, suggestions or feedback, please do not hesitate to open an“issue” about it on theGitHubpage! In case ofproblems / bugs, it is often helpful if you provide a “minimum workingexample” that showcases the behaviour (but don’t worry about this if thebug is obvious).
About
Analysis and tools for benchmarking in mlr3 and beyond.
Topics
Resources
License
Contributing
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Sponsor this project
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Contributors8
Uh oh!
There was an error while loading.Please reload this page.


