Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

pmlb is an R interface to the Penn Machine Learning Benchmarks data repository

License

Unknown, GPL-2.0 licenses found

Licenses found

Unknown
LICENSE
GPL-2.0
LICENSE.md
NotificationsYou must be signed in to change notification settings

EpistasisLab/pmlbr

 
 

Repository files navigation

vignettedocumentation

pmlbr

LifecycleR %>%= 3.1.0DependenciesR build status

pmlbr is an R interface to thePenn Machine LearningBenchmarks (PMLB) datarepository, a large collection of curated benchmark datasets forevaluating and comparing supervised machine learning algorithms. Thesedatasets cover a broad range of applications includingbinary/multi-class classification and regression problems as well ascombinations of categorical, ordinal, and continuous features.

This repository is originally forked frommakeyourownmaker/pmlblite.We thank thepmlblite’s author for releasing the source code undertheGPL-2licenseso that others could reuse the software.

Install

This package works for any recent version of R.

You can install the released version ofpmlbr from CRAN with:

install.packages("pmlbr")

Or the development version from GitHub with remotes:

# install.packages('remotes') # uncomment to install remoteslibrary(remotes)remotes::install_github("EpistasisLab/pmlbr")

Usage

The core function of this package isfetch_data that allows us todownload data from the PMLB repository. For example:

library(pmlbr)# Download features and labels for penguins dataset in single data framepenguins<- fetch_data("penguins")
## Download successful.
str(penguins)
## 'data.frame':    333 obs. of  8 variables:##  $ island           : int  2 2 2 2 2 2 2 2 2 2 ...##  $ bill_length_mm   : num  39.1 39.5 40.3 36.7 39.3 38.9 39.2 41.1 38.6 34.6 ...##  $ bill_depth_mm    : num  18.7 17.4 18 19.3 20.6 17.8 19.6 17.6 21.2 21.1 ...##  $ flipper_length_mm: int  181 186 195 193 190 181 195 182 191 198 ...##  $ body_mass_g      : int  3750 3800 3250 3450 3650 3625 4675 3200 3800 4400 ...##  $ sex              : int  1 0 0 0 1 0 1 0 1 1 ...##  $ year             : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...##  $ target           : int  0 0 0 0 0 0 0 0 0 0 ...##  - attr(*, "na.action")= 'omit' Named int [1:11] 4 9 10 11 12 48 179 219 257 269 ...##   ..- attr(*, "names")= chr [1:11] "4" "9" "10" "11" ...
# Download features and labels for penguins dataset in separate data structurespenguins<- fetch_data("penguins",return_X_y=TRUE)
## Download successful.
head(penguins$x)# data frame
##   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year## 1      2           39.1          18.7               181        3750   1 2007## 2      2           39.5          17.4               186        3800   0 2007## 3      2           40.3          18.0               195        3250   0 2007## 4      2             NA            NA                NA          NA  NA 2007## 5      2           36.7          19.3               193        3450   0 2007## 6      2           39.3          20.6               190        3650   1 2007
head(penguins$y)# vector
## [1] 0 0 0 0 0 0

Let’s check other available datasets and their summary statistics:

# Dataset namessample(classification_datasets(),9)
## [1] "heart_disease_hungarian"       "fars"                         ## [3] "allrep"                        "_deprecated_colic"            ## [5] "Hill_Valley_without_noise"     "_deprecated_german"           ## [7] "sleep"                         "_deprecated_cleveland_nominal"## [9] "analcatdata_happiness"
sample(regression_datasets(),9)
## [1] "527_analcatdata_election2000" "1089_USCrime"                ## [3] "feynman_III_8_54"             "225_puma8NH"                 ## [5] "657_fri_c2_250_10"            "strogatz_glider2"            ## [7] "611_fri_c3_100_5"             "586_fri_c3_1000_25"          ## [9] "650_fri_c0_500_50"
# Dataset summariessum_stats<- summary_stats()head(sum_stats)
##                dataset n_instances n_features n_binary_features## 1             1027_ESL         488          4                 0## 2             1028_SWD        1000         10                 1## 3             1029_LEV        1000          4                 0## 4             1030_ERA        1000          4                 0## 5         1089_USCrime          47         13                 1## 6 1096_FacultySalaries          50          4                 1##   n_categorical_features n_continuous_features endpoint_type n_classes## 1                      4                     0    continuous         9## 2                      9                     0    continuous         4## 3                      4                     0    continuous         5## 4                      0                     4    continuous         9## 5                      0                    12    continuous        42## 6                      0                     3    continuous        39##     imbalance       task## 1 0.099363200 regression## 2 0.108290667 regression## 3 0.111245000 regression## 4 0.031251250 regression## 5 0.002970111 regression## 6 0.004063158 regression

Selecting a subset of datasets that satisfy certain conditions isstraight forward withdplyr. For example, if we need datasets withfewer than 100 observations for a classification task:

library(dplyr)sum_stats %>%  filter(n_instances<100,task=="classification") %>%  pull(dataset)
##  [1] "analcatdata_aids"           "analcatdata_asbestos"      ##  [3] "analcatdata_bankruptcy"     "analcatdata_cyyoung8092"   ##  [5] "analcatdata_cyyoung9302"    "analcatdata_fraud"         ##  [7] "analcatdata_happiness"      "analcatdata_japansolvent"  ##  [9] "confidence"                 "labor"                     ## [11] "lupus"                      "parity5"                   ## [13] "postoperative_patient_data"

Dataset format

All data sets are stored in a common format:

  • First row is the column names
  • Each following row corresponds to an individual observation
  • The target column is namedtarget
  • All columns are tab (\t) separated
  • All files are compressed withgzip to conserve space

This R library includes summaries of the classification and regressiondata sets but doesnot store any of the PMLB data sets. The datasets can be downloaded using thefetch_data function which is similarto the corresponding PMLB python function.

Further info:

?fetch_data?summary_stats

Citing

If you use PMLB in a scientific publication, please consider citing oneof the following papers:

Joseph D. Romano, Le, Trang T., William La Cava, John T. Gregg, DanielJ. Goldberg, Praneel Chakraborty, Natasha L. Ray, Daniel Himmelstein,Weixuan Fu, and Jason H. Moore.PMLB v1.0: an open source datasetcollection for benchmarking machine learningmethods.arXiv preprintarXiv:2012.00058 (2020).

Randal S. Olson, William La Cava, Patryk Orzechowski, Ryan J.Urbanowicz, and Jason H. Moore (2017).PMLB: a large benchmark suitefor machine learning evaluation andcomparison.BioData Mining 10, page 36.

Roadmap

  • Add tests

Contributing

Pull requests are welcome. For major changes, please open an issue firstto discuss what you would like to change.

Integration of other data repositories are particularly welcome.

Alternatives

License

GPL-2

About

pmlb is an R interface to the Penn Machine Learning Benchmarks data repository

Resources

License

Unknown, GPL-2.0 licenses found

Licenses found

Unknown
LICENSE
GPL-2.0
LICENSE.md

Stars

Watchers

Forks

Packages

No packages published

Languages

  • R100.0%

[8]ページ先頭

©2009-2026 Movatter.jp