EpistasisLab/pmlbrPublic

forked fromtrangdata/pmlblite

NotificationsYou must be signed in to change notification settings
Fork0
Star10

pmlb is an R interface to the Penn Machine Learning Benchmarks data repository

epistasislab.github.io/pmlb/using-r.html

License

Unknown, GPL-2.0 licenses found

Licenses found

10 stars 2 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.github		.github
R		R
man		man
tests		tests
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
CRAN-SUBMISSION		CRAN-SUBMISSION
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.Rmd		README.Rmd
README.md		README.md
cran-comments.md		cran-comments.md
pmlbr.Rproj		pmlbr.Rproj

Repository files navigation

pmlbr

pmlbr is an R interface to thePenn Machine LearningBenchmarks (PMLB) datarepository, a large collection of curated benchmark datasets forevaluating and comparing supervised machine learning algorithms. Thesedatasets cover a broad range of applications includingbinary/multi-class classification and regression problems as well ascombinations of categorical, ordinal, and continuous features.

This repository is originally forked frommakeyourownmaker/pmlblite.We thank thepmlblite’s author for releasing the source code undertheGPL-2licenseso that others could reuse the software.

Install

This package works for any recent version of R.

You can install the released version ofpmlbr from CRAN with:

install.packages("pmlbr")

Or the development version from GitHub with remotes:

# install.packages('remotes') # uncomment to install remoteslibrary(remotes)remotes::install_github("EpistasisLab/pmlbr")

Usage

The core function of this package isfetch_data that allows us todownload data from the PMLB repository. For example:

library(pmlbr)# Download features and labels for penguins dataset in single data framepenguins<- fetch_data("penguins")

## Download successful.

str(penguins)

## 'data.frame':    333 obs. of  8 variables:##  $ island           : int  2 2 2 2 2 2 2 2 2 2 ...##  $ bill_length_mm   : num  39.1 39.5 40.3 36.7 39.3 38.9 39.2 41.1 38.6 34.6 ...##  $ bill_depth_mm    : num  18.7 17.4 18 19.3 20.6 17.8 19.6 17.6 21.2 21.1 ...##  $ flipper_length_mm: int  181 186 195 193 190 181 195 182 191 198 ...##  $ body_mass_g      : int  3750 3800 3250 3450 3650 3625 4675 3200 3800 4400 ...##  $ sex              : int  1 0 0 0 1 0 1 0 1 1 ...##  $ year             : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...##  $ target           : int  0 0 0 0 0 0 0 0 0 0 ...##  - attr(*, "na.action")= 'omit' Named int [1:11] 4 9 10 11 12 48 179 219 257 269 ...##   ..- attr(*, "names")= chr [1:11] "4" "9" "10" "11" ...

# Download features and labels for penguins dataset in separate data structurespenguins<- fetch_data("penguins",return_X_y=TRUE)

## Download successful.

head(penguins$x)# data frame

##   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year## 1      2           39.1          18.7               181        3750   1 2007## 2      2           39.5          17.4               186        3800   0 2007## 3      2           40.3          18.0               195        3250   0 2007## 4      2             NA            NA                NA          NA  NA 2007## 5      2           36.7          19.3               193        3450   0 2007## 6      2           39.3          20.6               190        3650   1 2007

head(penguins$y)# vector

## [1] 0 0 0 0 0 0

Let’s check other available datasets and their summary statistics:

# Dataset namessample(classification_datasets(),9)

## [1] "heart_disease_hungarian"       "fars"                         ## [3] "allrep"                        "_deprecated_colic"            ## [5] "Hill_Valley_without_noise"     "_deprecated_german"           ## [7] "sleep"                         "_deprecated_cleveland_nominal"## [9] "analcatdata_happiness"

sample(regression_datasets(),9)

## [1] "527_analcatdata_election2000" "1089_USCrime"                ## [3] "feynman_III_8_54"             "225_puma8NH"                 ## [5] "657_fri_c2_250_10"            "strogatz_glider2"            ## [7] "611_fri_c3_100_5"             "586_fri_c3_1000_25"          ## [9] "650_fri_c0_500_50"

# Dataset summariessum_stats<- summary_stats()head(sum_stats)

##                dataset n_instances n_features n_binary_features## 1             1027_ESL         488          4                 0## 2             1028_SWD        1000         10                 1## 3             1029_LEV        1000          4                 0## 4             1030_ERA        1000          4                 0## 5         1089_USCrime          47         13                 1## 6 1096_FacultySalaries          50          4                 1##   n_categorical_features n_continuous_features endpoint_type n_classes## 1                      4                     0    continuous         9## 2                      9                     0    continuous         4## 3                      4                     0    continuous         5## 4                      0                     4    continuous         9## 5                      0                    12    continuous        42## 6                      0                     3    continuous        39##     imbalance       task## 1 0.099363200 regression## 2 0.108290667 regression## 3 0.111245000 regression## 4 0.031251250 regression## 5 0.002970111 regression## 6 0.004063158 regression

Selecting a subset of datasets that satisfy certain conditions isstraight forward withdplyr. For example, if we need datasets withfewer than 100 observations for a classification task:

library(dplyr)sum_stats %>%  filter(n_instances<100,task=="classification") %>%  pull(dataset)

##  [1] "analcatdata_aids"           "analcatdata_asbestos"      ##  [3] "analcatdata_bankruptcy"     "analcatdata_cyyoung8092"   ##  [5] "analcatdata_cyyoung9302"    "analcatdata_fraud"         ##  [7] "analcatdata_happiness"      "analcatdata_japansolvent"  ##  [9] "confidence"                 "labor"                     ## [11] "lupus"                      "parity5"                   ## [13] "postoperative_patient_data"

Dataset format

All data sets are stored in a common format:

First row is the column names
Each following row corresponds to an individual observation
The target column is namedtarget
All columns are tab (\t) separated
All files are compressed withgzip to conserve space

This R library includes summaries of the classification and regressiondata sets but doesnot store any of the PMLB data sets. The datasets can be downloaded using thefetch_data function which is similarto the corresponding PMLB python function.

Further info:

?fetch_data?summary_stats

Citing

If you use PMLB in a scientific publication, please consider citing oneof the following papers:

Joseph D. Romano, Le, Trang T., William La Cava, John T. Gregg, DanielJ. Goldberg, Praneel Chakraborty, Natasha L. Ray, Daniel Himmelstein,Weixuan Fu, and Jason H. Moore.PMLB v1.0: an open source datasetcollection for benchmarking machine learningmethods.arXiv preprintarXiv:2012.00058 (2020).

Randal S. Olson, William La Cava, Patryk Orzechowski, Ryan J.Urbanowicz, and Jason H. Moore (2017).PMLB: a large benchmark suitefor machine learning evaluation andcomparison.BioData Mining 10, page 36.

Roadmap

Add tests

Contributing

Pull requests are welcome. For major changes, please open an issue firstto discuss what you would like to change.

Integration of other data repositories are particularly welcome.

Alternatives

Penn Machine LearningBenchmarks
OpenML Approximately 2,500datasets - available for download usingRmodule
UC Irvine Machine LearningRepository
mlbench: Machine Learning BenchmarkProblems
Rdatasets: An archive of datasets distributed withR
datasets.load: Visual interface for loading datasets in RStudio fromall installed (unloaded)packages
stackoverflow: How do I get a list of built-in data sets inR?

License

GPL-2

About

pmlb is an R interface to the Penn Machine Learning Benchmarks data repository

epistasislab.github.io/pmlb/using-r.html

Resources

Readme

License

Unknown, GPL-2.0 licenses found

Releases4

Dynamic dataset names Latest

Feb 27, 2025

+ 3 releases

Packages

No packages published

Languages

R100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Licenses found

Folders and files

Latest commit

History

Repository files navigation

pmlbr

Install

Usage

Dataset format

Citing

Roadmap

Contributing

Alternatives

License

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases4

Packages

Languages

Movatterモバイル変換

License

Licenses found

EpistasisLab/pmlbr

Folders and files

Latest commit

History

Repository files navigation

pmlbr

Install

Usage

Dataset format

Citing

Roadmap

Contributing

Alternatives

License

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases4

Packages0

Languages

Packages