- Notifications
You must be signed in to change notification settings - Fork0
pmlb is an R interface to the Penn Machine Learning Benchmarks data repository
License
Unknown, GPL-2.0 licenses found
Licenses found
EpistasisLab/pmlbr
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
pmlbr is an R interface to thePenn Machine LearningBenchmarks (PMLB) datarepository, a large collection of curated benchmark datasets forevaluating and comparing supervised machine learning algorithms. Thesedatasets cover a broad range of applications includingbinary/multi-class classification and regression problems as well ascombinations of categorical, ordinal, and continuous features.
This repository is originally forked frommakeyourownmaker/pmlblite.We thank thepmlblite’s author for releasing the source code undertheGPL-2licenseso that others could reuse the software.
This package works for any recent version of R.
You can install the released version ofpmlbr from CRAN with:
install.packages("pmlbr")Or the development version from GitHub with remotes:
# install.packages('remotes') # uncomment to install remoteslibrary(remotes)remotes::install_github("EpistasisLab/pmlbr")
The core function of this package isfetch_data that allows us todownload data from the PMLB repository. For example:
library(pmlbr)# Download features and labels for penguins dataset in single data framepenguins<- fetch_data("penguins")
## Download successful.str(penguins)## 'data.frame': 333 obs. of 8 variables:## $ island : int 2 2 2 2 2 2 2 2 2 2 ...## $ bill_length_mm : num 39.1 39.5 40.3 36.7 39.3 38.9 39.2 41.1 38.6 34.6 ...## $ bill_depth_mm : num 18.7 17.4 18 19.3 20.6 17.8 19.6 17.6 21.2 21.1 ...## $ flipper_length_mm: int 181 186 195 193 190 181 195 182 191 198 ...## $ body_mass_g : int 3750 3800 3250 3450 3650 3625 4675 3200 3800 4400 ...## $ sex : int 1 0 0 0 1 0 1 0 1 1 ...## $ year : int 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...## $ target : int 0 0 0 0 0 0 0 0 0 0 ...## - attr(*, "na.action")= 'omit' Named int [1:11] 4 9 10 11 12 48 179 219 257 269 ...## ..- attr(*, "names")= chr [1:11] "4" "9" "10" "11" ...# Download features and labels for penguins dataset in separate data structurespenguins<- fetch_data("penguins",return_X_y=TRUE)
## Download successful.head(penguins$x)# data frame
## island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year## 1 2 39.1 18.7 181 3750 1 2007## 2 2 39.5 17.4 186 3800 0 2007## 3 2 40.3 18.0 195 3250 0 2007## 4 2 NA NA NA NA NA 2007## 5 2 36.7 19.3 193 3450 0 2007## 6 2 39.3 20.6 190 3650 1 2007head(penguins$y)# vector
## [1] 0 0 0 0 0 0Let’s check other available datasets and their summary statistics:
# Dataset namessample(classification_datasets(),9)
## [1] "heart_disease_hungarian" "fars" ## [3] "allrep" "_deprecated_colic" ## [5] "Hill_Valley_without_noise" "_deprecated_german" ## [7] "sleep" "_deprecated_cleveland_nominal"## [9] "analcatdata_happiness"sample(regression_datasets(),9)## [1] "527_analcatdata_election2000" "1089_USCrime" ## [3] "feynman_III_8_54" "225_puma8NH" ## [5] "657_fri_c2_250_10" "strogatz_glider2" ## [7] "611_fri_c3_100_5" "586_fri_c3_1000_25" ## [9] "650_fri_c0_500_50"# Dataset summariessum_stats<- summary_stats()head(sum_stats)
## dataset n_instances n_features n_binary_features## 1 1027_ESL 488 4 0## 2 1028_SWD 1000 10 1## 3 1029_LEV 1000 4 0## 4 1030_ERA 1000 4 0## 5 1089_USCrime 47 13 1## 6 1096_FacultySalaries 50 4 1## n_categorical_features n_continuous_features endpoint_type n_classes## 1 4 0 continuous 9## 2 9 0 continuous 4## 3 4 0 continuous 5## 4 0 4 continuous 9## 5 0 12 continuous 42## 6 0 3 continuous 39## imbalance task## 1 0.099363200 regression## 2 0.108290667 regression## 3 0.111245000 regression## 4 0.031251250 regression## 5 0.002970111 regression## 6 0.004063158 regressionSelecting a subset of datasets that satisfy certain conditions isstraight forward withdplyr. For example, if we need datasets withfewer than 100 observations for a classification task:
library(dplyr)sum_stats %>% filter(n_instances<100,task=="classification") %>% pull(dataset)
## [1] "analcatdata_aids" "analcatdata_asbestos" ## [3] "analcatdata_bankruptcy" "analcatdata_cyyoung8092" ## [5] "analcatdata_cyyoung9302" "analcatdata_fraud" ## [7] "analcatdata_happiness" "analcatdata_japansolvent" ## [9] "confidence" "labor" ## [11] "lupus" "parity5" ## [13] "postoperative_patient_data"All data sets are stored in a common format:
- First row is the column names
- Each following row corresponds to an individual observation
- The target column is named
target - All columns are tab (
\t) separated - All files are compressed with
gzipto conserve space
This R library includes summaries of the classification and regressiondata sets but doesnot store any of the PMLB data sets. The datasets can be downloaded using thefetch_data function which is similarto the corresponding PMLB python function.
Further info:
?fetch_data?summary_stats
If you use PMLB in a scientific publication, please consider citing oneof the following papers:
Joseph D. Romano, Le, Trang T., William La Cava, John T. Gregg, DanielJ. Goldberg, Praneel Chakraborty, Natasha L. Ray, Daniel Himmelstein,Weixuan Fu, and Jason H. Moore.PMLB v1.0: an open source datasetcollection for benchmarking machine learningmethods.arXiv preprintarXiv:2012.00058 (2020).
Randal S. Olson, William La Cava, Patryk Orzechowski, Ryan J.Urbanowicz, and Jason H. Moore (2017).PMLB: a large benchmark suitefor machine learning evaluation andcomparison.BioData Mining 10, page 36.
- Add tests
Pull requests are welcome. For major changes, please open an issue firstto discuss what you would like to change.
Integration of other data repositories are particularly welcome.
- Penn Machine LearningBenchmarks
- OpenML Approximately 2,500datasets - available for download usingRmodule
- UC Irvine Machine LearningRepository
- mlbench: Machine Learning BenchmarkProblems
- Rdatasets: An archive of datasets distributed withR
- datasets.load: Visual interface for loading datasets in RStudio fromall installed (unloaded)packages
- stackoverflow: How do I get a list of built-in data sets inR?
About
pmlb is an R interface to the Penn Machine Learning Benchmarks data repository
Resources
License
Unknown, GPL-2.0 licenses found
Licenses found
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Languages
- R100.0%
