Movatterモバイル変換

mrIML:multi-Response (Multivariate) Interpretable Machine Learning

GitHub R package version GitHub contributors GitHub last commit

Overview

This package aims to enable users to build and interpret multivariatemachine learning models harnessing the tidyverse (tidy model syntax inparticular). This package builds off ideas from Gradient Forests (Elliset al., 2012), ecological genomic approaches (Fitzpatrick & Keller,2015), and multi-response stacking algorithms (Xing et al., 2020).

This package can be of use for any multi-response machine learningproblem, but was designed to handle data common to community ecology(site by species data) and ecological genomics (individual or populationby SNP loci).

How to Install

You can install the development version ofmrIML usingdevtools:

install.packages("mrIML")# Install development versiondevtools::install_github('nickfountainjones/mrIML')

Using mrIML

To get started, load mrIML and tidymodels:

library(mrIML)library(tidymodels)#> ── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──#> ✔ broom        1.0.8     ✔ recipes      1.3.0#> ✔ dials        1.4.0     ✔ rsample      1.3.0#> ✔ dplyr        1.1.4     ✔ tibble       3.2.1#> ✔ ggplot2      3.5.2     ✔ tidyr        1.3.1#> ✔ infer        1.0.8     ✔ tune         1.3.0#> ✔ modeldata    1.4.0     ✔ workflows    1.2.0#> ✔ parsnip      1.3.1     ✔ workflowsets 1.1.0#> ✔ purrr        1.0.4     ✔ yardstick    1.3.2#> ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──#> ✖ purrr::discard() masks scales::discard()#> ✖ dplyr::filter()  masks stats::filter()#> ✖ dplyr::lag()     masks stats::lag()#> ✖ recipes::step()  masks stats::step()

Many functions in mrIML benefit from parallel processing.

future::plan("multisession",workers =2)

The core function ofmrIML ismrIMLpredicts(), which is a wrapper around the tidymodelsworkflow that fits a provided model to each response variable in amulti-response data set.

# Load example multi-response datadata<- MRFcov::Bird.parasites# Split into response and predictor dataY<- data%>%select(-c("scale.prop.zos"))X<- data%>%select(scale.prop.zos)# Define tidymodelmodel<-rand_forest(trees =100,mode ="classification",mtry =tune(),min_n =tune())%>%set_engine("randomForest")# Fit multi-response modelmrIML_model<-mrIMLpredicts(X = X,Y = Y,Model = model,prop =0.7,k =5)#>   |                                                                              |                                                                      |   0%  |                                                                              |==================                                                    |  25%  |                                                                              |===================================                                   |  50%  |                                                                              |====================================================                  |  75%  |                                                                              |======================================================================| 100%

The objectmrIML_model can be investigated using:

mrIMLperformance() to get performance metrics for eachresponse variable,
mrvip() to get variable importance for each responsevariable,
mrFlashlight() to get partial dependence plots for eachresponse variable,
mrCovar() to get covariate importance for eachpredictor variable, and
mrInteractions() to get interaction importance for eachpredictor variable in the response models.

Two multi-response models can be compared usingmrPerformance().

Bootstrapping can be implemented usingmrBootstrap(),which can then be used to quantify uncertainty around partial dependenceplots,mrPdPlotBootstrap(), and variable importance,mrvipBootstrap(), as well as build co-occurrence networksusingmrCoOccurNet().

Recent mrIML publications

Fountain-Jones, N. M., Kozakiewicz, C. P., Forester, B. R.,Landguth, E. L., Carver, S., Charleston, M., Gagne, R. B., Greenwell,B., Kraberger, S., Trumbo, D. R., Mayer, M., Clark, N. J., &Machado, G. (2021). MrIML: Multi-response interpretable machine learningto model genomic landscapes.Molecular Ecology Resources, 21,2766–2781.https://doi.org/10.1111/1755-0998.13495
Sykes, A. L., Silva, G. S., Holtkamp, D. J., Mauch, B. W.,Osemeke, O., Linhares, D. C. L., & Machado, G. (2021). Interpretablemachine learning applied to on-farm biosecurity and porcine reproductiveand respiratory syndrome virus.Transboundary and Emerging Diseases,00, 1–15.https://doi.org/10.1111/tbed.14369
Fountain-Jones, N. M., Appaw, R., Alkhamis, M., Baker, S., Clark,N., Powell-Romero, F., Mayer, M., Machado, G., & Videvall, E.(2024). Advancing ecological community analysis with MrIML 2.0:Unravelling taxa associations through interpretable machine learning.Authorea [preprint].https://doi.org/10.22541/au.172676147.77148600/v1

References

Ellis, N., Smith, S. J., & Pitcher, C. R. (2012). Gradientforests: calculating importance gradients on physical predictors.Ecology, 93, 156-168.https://doi.org/10.1890/11-0252.1

Fitzpatrick, M. C., & Keller, S. R. (2015). Ecological genomicsmeets community-level modelling of biodiversity: Mapping the genomiclandscape of current and future environmental adaptation.EcologyLetters, 18, 1–16.https://doi.org/10.1111/ele.12376

Xing, L., Lesperance, M. L., & Zhang, X. (2020). Simultaneousprediction of multiple outcomes using revised stacking algorithms.Bioinformatics, 36, 65-72.https://doi.org/10.1093/bioinformatics/btz531

[8]ページ先頭