Movatterモバイル変換

Introduction

Theagua package provides tidymodels interface to theH2O platform and theh2oR package. It has two main components

new parsnip engine'h2o' for the followingmodels:
- linear_reg(),logistic_reg(),poisson_reg(),multinom_reg(): All fitpenalized generalized linear models. If the model parameterspenalty andmixture are not specified, h2owill internally search for the optimal regularization settings.
- boost_tree(): . Fits boosted trees via xgboost. Useh2o::h2o.xgboost.available() to see if h2o’s xgboost issupported on your machine. For classical gradient boosting, use the'h2o_gbm' engine.
- rand_forest(): Random forest models.
- naive_Bayes(): Naive Bayes models.
- rule_fit(): RuleFit models.
- mlp(): Multi-layer feedforward neuralnetworks.
- auto_ml(): Automatic machine learning.
Infrastructure for the tune package, seeTuning withagua for more details.

All supported models can accept an additional engine argumentvalidation, which is a number between 0 and 1 specifyingtheproportion of data reserved as validation set. This canused by h2o for performance assessment and potential early stopping.

Fitting models with the`'h2o'` engine

As an example, we will fit a random forest model to theconcrete data. This will be a regression model with theoutcome being the compressive strength of concrete mixtures.

library(tidymodels)library(agua)library(ggplot2)tidymodels_prefer()theme_set(theme_bw())# start h2o serverh2o_start()data(concrete,package ="modeldata")concrete<-  concrete%>%group_by(across(-compressive_strength))%>%summarize(compressive_strength =mean(compressive_strength),.groups ="drop")concrete#> # A tibble: 992 × 9#>    cement blast_furn…¹ fly_ash water super…² coars…³ fine_…⁴   age compr…⁵#>     <dbl>        <dbl>   <dbl> <dbl>   <dbl>   <dbl>   <dbl> <int>   <dbl>#>  1   102          153        0  192        0    887     942      3    4.57#>  2   102          153        0  192        0    887     942      7    7.68#>  3   102          153        0  192        0    887     942     28   17.3#>  4   102          153        0  192        0    887     942     90   25.5#>  5   108.         162.       0  204.       0    938.    849      3    2.33#>  6   108.         162.       0  204.       0    938.    849      7    7.72#>  7   108.         162.       0  204.       0    938.    849     28   20.6#>  8   108.         162.       0  204.       0    938.    849     90   29.2#>  9   116          173        0  192        0    910.    892.     3    6.28#> 10   116          173        0  192        0    910.    892.     7   10.1#> # … with 982 more rows, and abbreviated variable names#> #   ¹blast_furnace_slag, ²superplasticizer, ³coarse_aggregate,#> #   ⁴fine_aggregate, ⁵compressive_strength

Note that we need to callh2o_start() orh2o::h2o.init() to start the h2o instance. The h2o serverhandles computations related to estimation and prediction, and passesthe results back to R. agua takes care of data conversion and errorhandling, it also tries to store as least objects on the server aspossible. The h2o server will automatically terminate once R session isclosed. You can useh2o::h2o.removeAll() to remove allserver-side objects andh2o::h2o.shutdown() to manuallystop the server.

The rest of the syntax of model fitting and prediction are identicalto the usage of any other engine in tidymodels.

set.seed(1501)concrete_split<-initial_split(concrete,strata = compressive_strength)concrete_train<-training(concrete_split)concrete_test<-testing(concrete_split)rf_spec<-rand_forest(mtry =3,trees =500)%>%set_engine("h2o",histogram_type ="Random")%>%set_mode("regression")normalized_rec<-recipe(compressive_strength~ .,data = concrete_train)%>%step_normalize(all_predictors())rf_wflow<-workflow()%>%add_model(rf_spec)%>%add_recipe(normalized_rec)rf_fit<-fit(rf_wflow,data = concrete_train)rf_fit#> ══ Workflow [trained] ════════════════════════════════════════════════════#> Preprocessor: Recipe#> Model: rand_forest()#>#> ── Preprocessor ──────────────────────────────────────────────────────────#> 1 Recipe Step#>#> • step_normalize()#>#> ── Model ─────────────────────────────────────────────────────────────────#> Model Details:#> ==============#>#> H2ORegressionModel: drf#> Model ID:  DRF_model_R_1665503649643_6#> Model Summary:#>   number_of_trees number_of_internal_trees model_size_in_bytes min_depth#> 1             500                      500             2652880        15#>   max_depth mean_depth min_leaves max_leaves mean_leaves#> 1        20   17.97600        375        450   417.48000#>#>#> H2ORegressionMetrics: drf#> ** Reported on training data. **#> ** Metrics reported on Out-Of-Bag training samples **#>#> MSE:  26.5#> RMSE:  5.15#> MAE:  3.7#> RMSLE:  0.169#> Mean Residual Deviance :  26.5

predict(rf_fit,new_data = concrete_test)#> # A tibble: 249 × 1#>    .pred#>    <dbl>#>  1  6.42#>  2  9.54#>  3  9.20#>  4 25.5#>  5  6.60#>  6 28.6#>  7 10.0#>  8 31.9#>  9 12.1#> 10 11.4#> # … with 239 more rows

Here, we specify the engine argumenthistogram_type = "Random" to use the extremely randomizedtrees (XRT) algorithm. For all available engine arguments, consult theengine specific help page for “h2o” of that model. For instance, the h2olink in the help page ofrand_forest() shows that it usesh2o::h2o.randomForest(), whose arguments can be passed inas engine arguments inset_engine().

You can also usefit_resamples() with h2o models.

concrete_folds<-vfold_cv(concrete_train,strata = compressive_strength)fit_resamples(rf_wflow,resamples = concrete_folds)#> # Resampling results#> # 10-fold cross-validation using stratification#> # A tibble: 10 × 4#>    splits           id     .metrics         .notes#>    <list>           <chr>  <list>           <list>#>  1 <split [667/76]> Fold01 <tibble [2 × 4]> <tibble [0 × 3]>#>  2 <split [667/76]> Fold02 <tibble [2 × 4]> <tibble [0 × 3]>#>  3 <split [667/76]> Fold03 <tibble [2 × 4]> <tibble [0 × 3]>#>  4 <split [667/76]> Fold04 <tibble [2 × 4]> <tibble [0 × 3]>#>  5 <split [667/76]> Fold05 <tibble [2 × 4]> <tibble [0 × 3]>#>  6 <split [668/75]> Fold06 <tibble [2 × 4]> <tibble [0 × 3]>#>  7 <split [671/72]> Fold07 <tibble [2 × 4]> <tibble [0 × 3]>#>  8 <split [671/72]> Fold08 <tibble [2 × 4]> <tibble [0 × 3]>#>  9 <split [671/72]> Fold09 <tibble [2 × 4]> <tibble [0 × 3]>#> 10 <split [671/72]> Fold10 <tibble [2 × 4]> <tibble [0 × 3]>

Variable importance scores can be visualized by the vip package.

library(vip)rf_fit%>%extract_fit_parsnip()%>%vip()

Movatterモバイル変換

Introduction to agua

Introduction

Fitting models with the'h2o' engine

Fitting models with the`'h2o'` engine