Movatterモバイル変換

suiji/ArboristPublic

NotificationsYou must be signed in to change notification settings
Fork14
Star83

Scalable decision tree training and inference.

License

View license

83 stars 14 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 572 Commits
Pyborist		Pyborist
Rborist		Rborist
RboristBase		RboristBase
boostArb		boostArb
cart		cart
core		core
deframe		deframe
deframeR		deframeR
forest		forest
frontier		frontier
obs		obs
rf		rf
sgb		sgb
split		split
.travis.yml		.travis.yml
Doxyfile		Doxyfile
LICENSE		LICENSE
README.md		README.md
TODO.md		TODO.md

Repository files navigation

Arborist: Parallelized, Extensible Decision Tree Tools

The Arborist project hosts fast, open-source implementations of several decision-tree algorithms. Breiman and Cutler'sRandom Forest algorithm is implemented and Friedman'sStochastic Gradient Boosting is available as an alpha release. A spin providing Friedman and Fisher'sPRIM ("Patient Rule Induction Method") has been developed by Decision Patterns, LLC. Arborist derivatives achieve their speed through parallelized and vectorized inner loops. Parallel, distributed training is also possible for independently-trained trees. Considerable attention has been devoted to minimizing and regularizing data movement, a key challenge to accelerating these algorithms.

Bindings are provided forR. A language-agnostic bridge design supports development of bindings for additional front ends, such asPython andJulia.

R

CRAN hosts the released packageRborist, which implements the Random Forest algorithm.

Installation of thereleased version usingR:

> install.packages('Rborist')

Installation of thedevelopment version, hosted on this archive, from the top-level directory:

> ./Rborist/Package/Rborist.CRAN.sh> R CMD INSTALL Rborist_*.*-*.tar.gz

ACRAN-friendly snapshot of thedevelopment source is mirrored by the neighboring archiveRborist.CRAN. This archive is intended for remote access byR utilities such asdevtools.

Notes

Rborist version 0.3-7 is hosted onCRAN.

Python

Version 0.1-0 has been archived.
Test cases sought.

Performance

Performance metrics have been measured usingbenchm-ml. Partial results can be foundhere

Some users have reported diminished performance when running single-threaded. We recommend running with at least two cores, as frequently-executed inner loops have been cast specifically to take advantage of multiple cores. In particular, when using a scaffold such ascaret, please prefer to let Rborist be greedier with cores than is the scaffold.

In a 2015 paper,https://www.jstatsoft.org/article/view/v077i01/v77i01.pdf, Wright and Ziegler compare severalR-language implementations of the Random Forest algorithm, including Ranger and Rborist. A key finding was that Rborist failed to achieve the$O(\sqrt{p})$ scaling expected for classification under high predictor (feature) count,p. Although the default behavior for classification is to select$\sqrt{p}$ splitting candidates at a given node, the repartitioning scheme then in use eclipsed any benefit of restricting to such a sparse candidate set. Repartitioning had been implemented greedily, taking place on every node/predictor cell regardless whether a candidate. As a result of the authors' findings, repartitioning was revised the following year to employ a lazy scheme, only applied to cells either about to undergo splitting or overflowing a fixed-size deque. Using an 8-bit repartitioning history, the lazy scheme appears to scale appropriately beyondp =$10^4$. A 16-bit history can be substituted to bring scaling well beyondp =$10^6$. Although not explicitly stated, their work illustrates that data movement (in this case, repartitioning) poses a greater challenge to scaling than does splitting. Benchmarks used in the study are provided alongside the paper athttps://www.jstatsoft.org/article/view/v077i01.

A 2019 paper compares several categories of regression tools, including Random Forests. Rborist is among the faster packages offering high prediction accuracy: (https://doi.org/10.1109/ACCESS.2019.2933261). Based on the findings, we have updated the package's default settings. In particular, fixed-number predictor sampling (mtry) appears to provide more accurate predictions at low dimension than the current approach of Bernoulli sampling.

References

Scalability Issues in Training Decision Trees (video), Nimbix Developer Summit, 2017.
Controlling for Monotonicity in Random Forest Regressors (PDF), R in Finance, May 2016.
Accelerating the Random Forest algorithm for commodity parallel hardware (Video), PyData, July, 2015.
The Arborist: A High-Performance Random Forest (TM) Implementation, R in Finance, May 2015.
Training Random Forests on the GPU: Tree Unrolling (PDF), GTC, March, 2015.

News/Changes

New archivesgbArb.CRAN mirrors thesgbArb package for stochastic gradient boosting.
New archiveRborist.CRAN mirrors theRborist package source in a form directly amenable to utilities such asdevtools.
New commandrfTrain exposes the training component of the compound format/sample/train/validate task performed byrfArb. This provides separate training of sampled, prefomatted data.
New prediction optionkeyedFrame accesses prediction columns by name, bypassing a previous requirement that training and prediction frames have the same column ordering. In addition to arbitrary ordering, the prediction frame may now include columns not submitted to training.
New commandforestWeight computes Meinshausen's forest-wide weights. Nonterminals are weighted in addition to leaves, both to facilitate post-pruning and to accommodate early exit under prediction with trap-and-bail.
New prediction optionindexing=TRUE records final node indices of tree walks.
Training ignores missing predictor values, splitting over appropriately reduced subnodes.
Quantile estimation supports both leaf and nonterminal (i.e., trap-and-bail) prediction modes.
Prediction and validiation support large (> 32 bits) observation counts.
Support for training more than 2^32 observations may be enabled by recompiling.
New optionimpPermute introduces permutation-based variable importance.

Known issues in Rborist 0.3-7

Following the introduction of standalone sampling, a break in backward compatibility appears in versions 0.3-0 and higher of theRborist package. Prediction with models trained using earlier versions throws an unidentified-index exception from within theRcpp glue layer. Older models should therefore be retrained in order to use version 0.3-0 and above.
Documentation for thevalidate command are out of date and may lead to errors. The documentation has been revised with version 0.3-8.
Classification was reporting theoobErr value as accuracy instead of out-of-bag error. This has been repaired in 0.3-8.
Non-binary classification with moderately-sized factor predictors (>= 64 levels) was failing due a wraparound shift. Repaired in 0.3-8.

Correctness and runtime errors are addressed as received. With reproducible test cases, repairs are typically uploaded to theRborst.CRAN repository within several days.

Feature requests are addressed on a case-by-case basis.

About

Scalable decision tree training and inference.

Releases9

CRAN 0.3-7 Latest

Jan 18, 2024

+ 8 releases

Packages

No packages published

Contributors6

Languages

C++83.2%
R15.2%
Other1.6%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Arborist: Parallelized, Extensible Decision Tree Tools

R

Notes

Python

Performance

References

News/Changes

Known issues in Rborist 0.3-7

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases9

Packages

Uh oh!

Contributors6

Uh oh!

Languages

Movatterモバイル変換

License

suiji/Arborist

Folders and files

Latest commit

History

Repository files navigation

Arborist: Parallelized, Extensible Decision Tree Tools

R

Notes

Python

Performance

References

News/Changes

Known issues in Rborist 0.3-7

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases9

Packages0

Uh oh!

Contributors6

Uh oh!

Languages

Packages