- Notifications
You must be signed in to change notification settings - Fork352
ML in Rust#306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Merged
Uh oh!
There was an error while loading.Please reload this page.
Merged
ML in Rust#306
Changes fromall commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Uh oh!
There was an error while loading.Please reload this page.
Jump to
Jump to file
Failed to load files.
Loading
Uh oh!
There was an error while loading.Please reload this page.
Diff view
Diff view
There are no files selected for viewing
120 changes: 120 additions & 0 deletionspgml-docs/docs/blog/oxidizing-machine-learning.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,120 @@ | ||
| --- | ||
| author: Lev Kokotov | ||
| description: Machine learning in Python is slow and error-prone, while Rust makes it fast and reliable. | ||
| --- | ||
| # Oxidizing Machine Learning | ||
| <p class="author"> | ||
| <img width="54px" height="54px" src="/images/team/lev.jpg" alt="Author" /> | ||
| Lev Kokotov<br/> | ||
| September 7, 2022 | ||
| </p> | ||
| Machine learning in Python can be hard to deploy at scale. We all love Python, but it's no secret | ||
| that its overhead is large: | ||
| * Load data from large CSV files | ||
| * Do some post-processing with NumPy | ||
| * Move and join data into a Pandas dataframe | ||
| * Load data into the algorithm | ||
| Each step incurs at least one copy of the data in memory; 4x storage and compute cost for training a model sounds inefficient, but when you add Python's memory allocation, the price tag increases exponentially. | ||
| Even if you could find the money to pay for the compute needed, fitting the dataset we want into the RAM we have becomes difficult. | ||
| The status quo needs a shake up, and along came Rust. | ||
| ## The State of ML in Rust | ||
| Doing machine learning in anything but Python sounds wild, but if one looks under the hood, ML algorithms are mostly written in C++: `libtorch` (Torch), XGBoost, large parts of Tensorflow, `libsvm` (Support Vector Machines), and the list goes on. A linear regression can be (and is) written in about 10 lines of for-loops. | ||
| It then should come to no surprise that the Rust ML community is alive, and doing well: | ||
| * SmartCore[^1] is rivaling Scikit for commodity algorithms | ||
| * XGBoost bindings[^2] work great for gradient boosted trees | ||
| * Torch bindings[^3] are first class for building any kind of neural network | ||
| * Tensorflow bindings[^4] are also in the mix, although parts of them are still Python (e.g. Keras) | ||
| If you start missing NumPy, don't worry, the Rust version[^5] has got you covered, and the list of available tools keeps growing. | ||
| When you only need 4 bytes to represent a floating point instead of Python's 26 bytes[^6], suddenly you can do more. | ||
| ## XGBoost, Rustified | ||
| Let's do a quick example to illustrate our point. | ||
| XGBoost is a popular decision tree algorithm which uses gradient boosting, a fancy optimization technique, to train algorithms on data that could confuse simpler linear models. It comes with a Python interface, which calls into its C++ primitives, but now, it has a Rust interface as well. | ||
| _Cargo.toml_ | ||
| ```toml | ||
| [dependencies] | ||
| xgboost = "0.1" | ||
| ``` | ||
| _src/main.rs_ | ||
| ```rust | ||
| use xgboost::{parameters, Booster, DMatrix}; | ||
| fn main() { | ||
| // Data is read directly into the C++ data structure. | ||
| let train = DMatrix::load("train.txt").unwrap(); | ||
| let test = DMatrix::load("test.txt").unwrap(); | ||
| // Task (regression or classification) | ||
| let learning_params = parameters::learning::LearningTaskParametersBuilder::default() | ||
| .objective(parameters::learning::Objective::BinaryLogistic) | ||
| .build() | ||
| .unwrap(); | ||
| // Tree parameters (e.g. depth) | ||
| let tree_params = parameters::tree::TreeBoosterParametersBuilder::default() | ||
| .max_depth(2) | ||
| .eta(1.0) | ||
| .build() | ||
| .unwrap(); | ||
| // Gradient boosting parameters | ||
| let booster_params = parameters::BoosterParametersBuilder::default() | ||
| .booster_type(parameters::BoosterType::Tree(tree_params)) | ||
| .learning_params(learning_params) | ||
| .build() | ||
| .unwrap(); | ||
| // Train on train data, test accuracy on test data | ||
| let evaluation_sets = &[(&train, "train"), (&test, "test")]; | ||
| // Final algorithm configuration | ||
| let params = parameters::TrainingParametersBuilder::default() | ||
| .dtrain(&train) | ||
| .boost_rounds(2) // n_estimators | ||
| .booster_params(booster_params) | ||
| .evaluation_sets(Some(evaluation_sets)) | ||
| .build() | ||
| .unwrap(); | ||
| // Train! | ||
| let model = Booster::train(¶ms).unwrap(); | ||
| // Save and load later in any language that has XGBoost bindings. | ||
| model.save("/tmp/xbgoost_model.bin").unwrap(); | ||
| } | ||
| ``` | ||
| <small>Example created from `rust-xgboost`[^7] documentation and my own experiments.</small> | ||
| That's it! You just trained an XGBoost model in Rust, in just a few lines of efficient and ergonomic code. | ||
| Unlike Python, Rust compiles and verifies your code, so you'll know that it's likely to work before you even run it. When it can take several hours to train a model, it's great to know that you don't have a syntax error on your last line. | ||
| [^1]: [SmartCore](https://smartcorelib.org/) | ||
| [^2]: [XGBoost bindings](https://github.com/davechallis/rust-xgboost) | ||
| [^3]: [Torch bindings](https://github.com/LaurentMazare/tch-rs) | ||
| [^4]: [Tensorflow bindings](https://github.com/tensorflow/rust) | ||
| [^5]: [rust-ndarray](https://github.com/rust-ndarray/ndarray) | ||
| [^6]: [Python floating points](https://github.com/python/cpython/blob/e42b705188271da108de42b55d9344642170aa2b/Include/floatobject.h#L15) | ||
| [^7]: [`rust-xgboost`](https://docs.rs/xgboost/latest/xgboost/) | ||
1 change: 1 addition & 0 deletionspgml-docs/mkdocs.yml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.