Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Scalable decision tree training and inference.

License

NotificationsYou must be signed in to change notification settings

suiji/Arborist

Repository files navigation

LicenseR LicensePython LicenseCRANDownloadsPyPI versionTravis-CI Build Status

The Arborist project hosts fast, open-source implementations of several decision-tree algorithms. Breiman and Cutler'sRandom Forest algorithm is implemented and Friedman'sStochastic Gradient Boosting is available as an alpha release. A spin providing Friedman and Fisher'sPRIM ("Patient Rule Induction Method") has been developed by Decision Patterns, LLC. Arborist derivatives achieve their speed through parallelized and vectorized inner loops. Parallel, distributed training is also possible for independently-trained trees. Considerable attention has been devoted to minimizing and regularizing data movement, a key challenge to accelerating these algorithms.

Bindings are provided forR. A language-agnostic bridge design supports development of bindings for additional front ends, such asPython andJulia.

R

CRAN hosts the released packageRborist, which implements the Random Forest algorithm.

Installation of thereleased version usingR:

> install.packages('Rborist')

Installation of thedevelopment version, hosted on this archive, from the top-level directory:

> ./Rborist/Package/Rborist.CRAN.sh> R CMD INSTALL Rborist_*.*-*.tar.gz

ACRAN-friendly snapshot of thedevelopment source is mirrored by the neighboring archiveRborist.CRAN. This archive is intended for remote access byR utilities such asdevtools.

Notes

  • Rborist version 0.3-7 is hosted onCRAN.

Python

  • Version 0.1-0 has been archived.
  • Test cases sought.

Performance

Performance metrics have been measured usingbenchm-ml. Partial results can be foundhere

Some users have reported diminished performance when running single-threaded. We recommend running with at least two cores, as frequently-executed inner loops have been cast specifically to take advantage of multiple cores. In particular, when using a scaffold such ascaret, please prefer to let Rborist be greedier with cores than is the scaffold.

In a 2015 paper,https://www.jstatsoft.org/article/view/v077i01/v77i01.pdf, Wright and Ziegler compare severalR-language implementations of the Random Forest algorithm, including Ranger and Rborist. A key finding was that Rborist failed to achieve the$O(\sqrt{p})$ scaling expected for classification under high predictor (feature) count,p. Although the default behavior for classification is to select$\sqrt{p}$ splitting candidates at a given node, the repartitioning scheme then in use eclipsed any benefit of restricting to such a sparse candidate set. Repartitioning had been implemented greedily, taking place on every node/predictor cell regardless whether a candidate. As a result of the authors' findings, repartitioning was revised the following year to employ a lazy scheme, only applied to cells either about to undergo splitting or overflowing a fixed-size deque. Using an 8-bit repartitioning history, the lazy scheme appears to scale appropriately beyondp =$10^4$. A 16-bit history can be substituted to bring scaling well beyondp =$10^6$. Although not explicitly stated, their work illustrates that data movement (in this case, repartitioning) poses a greater challenge to scaling than does splitting. Benchmarks used in the study are provided alongside the paper athttps://www.jstatsoft.org/article/view/v077i01.

A 2019 paper compares several categories of regression tools, including Random Forests. Rborist is among the faster packages offering high prediction accuracy: (https://doi.org/10.1109/ACCESS.2019.2933261). Based on the findings, we have updated the package's default settings. In particular, fixed-number predictor sampling (mtry) appears to provide more accurate predictions at low dimension than the current approach of Bernoulli sampling.

References

News/Changes

  • New archivesgbArb.CRAN mirrors thesgbArb package for stochastic gradient boosting.
  • New archiveRborist.CRAN mirrors theRborist package source in a form directly amenable to utilities such asdevtools.
  • New commandrfTrain exposes the training component of the compound format/sample/train/validate task performed byrfArb. This provides separate training of sampled, prefomatted data.
  • New prediction optionkeyedFrame accesses prediction columns by name, bypassing a previous requirement that training and prediction frames have the same column ordering. In addition to arbitrary ordering, the prediction frame may now include columns not submitted to training.
  • New commandforestWeight computes Meinshausen's forest-wide weights. Nonterminals are weighted in addition to leaves, both to facilitate post-pruning and to accommodate early exit under prediction with trap-and-bail.
  • New prediction optionindexing=TRUE records final node indices of tree walks.
  • Training ignores missing predictor values, splitting over appropriately reduced subnodes.
  • Quantile estimation supports both leaf and nonterminal (i.e., trap-and-bail) prediction modes.
  • Prediction and validiation support large (> 32 bits) observation counts.
  • Support for training more than 2^32 observations may be enabled by recompiling.
  • New optionimpPermute introduces permutation-based variable importance.

Known issues in Rborist 0.3-7

  • Following the introduction of standalone sampling, a break in backward compatibility appears in versions 0.3-0 and higher of theRborist package. Prediction with models trained using earlier versions throws an unidentified-index exception from within theRcpp glue layer. Older models should therefore be retrained in order to use version 0.3-0 and above.
  • Documentation for thevalidate command are out of date and may lead to errors. The documentation has been revised with version 0.3-8.
  • Classification was reporting theoobErr value as accuracy instead of out-of-bag error. This has been repaired in 0.3-8.
  • Non-binary classification with moderately-sized factor predictors (>= 64 levels) was failing due a wraparound shift. Repaired in 0.3-8.

Correctness and runtime errors are addressed as received. With reproducible test cases, repairs are typically uploaded to theRborst.CRAN repository within several days.

Feature requests are addressed on a case-by-case basis.

About

Scalable decision tree training and inference.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors6

Languages


[8]ページ先頭

©2009-2025 Movatter.jp