| Title: | Feature Selection Using Supervised Filter-Based Methods |
| Version: | 0.2.0 |
| Description: | Tidy tools to apply filter-based supervised feature selection methods. These methods score and rank feature relevance using metrics such as p-values, correlation, and importance scores (Kuhn and Johnson (2019) <doi:10.1201/9781315108230>). |
| License: | MIT + file LICENSE |
| URL: | https://github.com/tidymodels/filtro,https://filtro.tidymodels.org/ |
| BugReports: | https://github.com/tidymodels/filtro/issues |
| Depends: | R (≥ 4.1) |
| Imports: | cli, desirability2 (≥ 0.1.0), dplyr, generics, pROC, purrr,rlang (≥ 1.1.0), S7, stats, tibble, tidyr, vctrs |
| Suggests: | aorsf, FSelectorRcpp, knitr, modeldata, partykit, quarto,ranger, rmarkdown, testthat (≥ 3.0.0), titanic |
| Config/Needs/website: | tidyverse/tidytemplate |
| Config/testthat/edition: | 3 |
| Encoding: | UTF-8 |
| RoxygenNote: | 7.3.2 |
| Collate: | 'aaa.R' 'class_score.R' 'data.R' 'desirability2.R''filtro-package.R' 'import-standalone-obj-type.R''import-standalone-types-check.R' 'misc.R' 'score-aov.R''utilities.R' 'score-cor.R' 'score-cross_tab.R''score-forest_imp.R' 'score-info_gain.R' 'score-roc_auc.R''zzz.R' |
| LazyData: | true |
| VignetteBuilder: | quarto, knitr |
| NeedsCompilation: | no |
| Packaged: | 2025-08-26 21:19:41 UTC; franceslin |
| Author: | Frances Lin [aut, cre], Max Kuhn |
| Maintainer: | Frances Lin <franceslinyc@gmail.com> |
| Repository: | CRAN |
| Date/Publication: | 2025-08-26 21:40:02 UTC |
filtro: Feature Selection Using Supervised Filter-Based Methods
Description

Tidy tools to apply filter-based supervised feature selection methods. These methods score and rank feature relevance using metrics such as p-values, correlation, and importance scores (Kuhn and Johnson (2019)doi:10.1201/9781315108230).
Author(s)
Maintainer: Frances Linfranceslinyc@gmail.com
Authors:
Max Kuhnmax@posit.co (ORCID)
Emil Hvitfeldtemil.hvitfeldt@posit.co
Other contributors:
Posit Software, PBC (03wc8by49) [copyright holder, funder]
See Also
Useful links:
Report bugs athttps://github.com/tidymodels/filtro/issues
Ames exampled score results
Description
This data an ames exampled score results for, usedforshow_best_desirability_prop() as a demonstration,and created by examples infill_safe_values().
Value
ames_scores_results | a tibble |
Examples
data(ames_scores_results)Arrange score
Description
Arrange score
Arguments
x | A score class object (e.g., |
... | Further arguments passed to or from other methods. |
target | A numeric value specifying the target value. The defaultof |
Value
A tibble of score results.
Examples
library(dplyr)ames_subset <- modeldata::ames |> dplyr::select( Sale_Price, MS_SubClass, MS_Zoning, Lot_Frontage, Lot_Area, Street )ames_subset <- ames_subset |> dplyr::mutate(Sale_Price = log10(Sale_Price))ames_aov_pval_res <- score_aov_pval |> fit(Sale_Price ~ ., data = ames_subset)ames_aov_pval_res@results# Arrange scoreames_aov_pval_res |> arrange_score()Bind score class object, including their associated metadata and scores
Description
Binds multiple score class objects (e.g.,score_*), including their associated metadata and scores.Seefill_safe_values() for binding with safe-value handling.
Arguments
x | A list. |
... | Further arguments passed to or from other methods. |
Value
A tibble of scores results.
Examples
library(dplyr)ames_subset <- modeldata::ames |> dplyr::select( Sale_Price, MS_SubClass, MS_Zoning, Lot_Frontage, Lot_Area, Street )ames_subset <- ames_subset |> dplyr::mutate(Sale_Price = log10(Sale_Price))# ANOVA p-valueames_aov_pval_res <- score_aov_pval |> fit(Sale_Price ~ ., data = ames_subset)ames_aov_pval_res@results# Pearson correlationames_cor_pearson_res <- score_cor_pearson |> fit(Sale_Price ~ ., data = ames_subset)ames_cor_pearson_res@results# Forest importanceset.seed(42)ames_imp_rf_reg_res <- score_imp_rf |> fit(Sale_Price ~ ., data = ames_subset)ames_imp_rf_reg_res@results# Information gainames_info_gain_reg_res <- score_info_gain |> fit(Sale_Price ~ ., data = ames_subset)ames_info_gain_reg_res@results# Create a listclass_score_list <- list( ames_aov_pval_res, ames_cor_pearson_res, ames_imp_rf_reg_res, ames_info_gain_reg_res)# Bind scoresclass_score_list |> bind_scores()General S7 classes for scoring objects
Description
class_score is an S7 object that contains slots for the characteristics ofpredictor importance scores. More specific classes for individual methods arebased on this object (shown below).
Usage
class_score( outcome_type = c("numeric", "factor"), predictor_type = c("numeric", "factor"), case_weights = logical(0), range = integer(0), inclusive = logical(0), fallback_value = integer(0), score_type = character(0), transform_fn = function() NULL, direction = character(0), deterministic = logical(0), tuning = logical(0), calculating_fn = function() NULL, label = character(0), packages = character(0), results = data.frame())class_score_aov( outcome_type = c("numeric", "factor"), predictor_type = c("numeric", "factor"), case_weights = logical(0), range = integer(0), inclusive = logical(0), fallback_value = integer(0), score_type = character(0), transform_fn = function() NULL, direction = character(0), deterministic = logical(0), tuning = logical(0), calculating_fn = function() NULL, label = character(0), packages = character(0), results = data.frame(), neg_log10 = TRUE)class_score_cor( outcome_type = c("numeric", "factor"), predictor_type = c("numeric", "factor"), case_weights = logical(0), range = integer(0), inclusive = logical(0), fallback_value = integer(0), score_type = character(0), transform_fn = function() NULL, direction = character(0), deterministic = logical(0), tuning = logical(0), calculating_fn = function() NULL, label = character(0), packages = character(0), results = data.frame())class_score_xtab( outcome_type = c("numeric", "factor"), predictor_type = c("numeric", "factor"), case_weights = logical(0), range = integer(0), inclusive = logical(0), fallback_value = integer(0), score_type = character(0), transform_fn = function() NULL, direction = character(0), deterministic = logical(0), tuning = logical(0), calculating_fn = function() NULL, label = character(0), packages = character(0), results = data.frame(), neg_log10 = TRUE)class_score_imp_rf( outcome_type = c("numeric", "factor"), predictor_type = c("numeric", "factor"), case_weights = logical(0), range = integer(0), inclusive = logical(0), fallback_value = integer(0), score_type = character(0), transform_fn = function() NULL, direction = character(0), deterministic = logical(0), tuning = logical(0), calculating_fn = function() NULL, label = character(0), packages = character(0), results = data.frame(), engine = "ranger")class_score_info_gain( outcome_type = c("numeric", "factor"), predictor_type = c("numeric", "factor"), case_weights = logical(0), range = integer(0), inclusive = logical(0), fallback_value = integer(0), score_type = character(0), transform_fn = function() NULL, direction = character(0), deterministic = logical(0), tuning = logical(0), calculating_fn = function() NULL, label = character(0), packages = character(0), results = data.frame(), mode = "classification")class_score_roc_auc( outcome_type = c("numeric", "factor"), predictor_type = c("numeric", "factor"), case_weights = logical(0), range = integer(0), inclusive = logical(0), fallback_value = integer(0), score_type = character(0), transform_fn = function() NULL, direction = character(0), deterministic = logical(0), tuning = logical(0), calculating_fn = function() NULL, label = character(0), packages = character(0), results = data.frame())S7 subclass of base R'slist for method dispatch
Description
class_score_list is an S7 subclass of S3 base R'slist, used for method dispatch inbind_scores() andfill_safe_values().
Usage
class_score_listFormat
An object of classS7_S3_class of length 3.
Value
A list of S7 objects.
Examples
library(dplyr)ames_subset <- modeldata::ames |> dplyr::select( Sale_Price, MS_SubClass, MS_Zoning, Lot_Frontage, Lot_Area, Street )ames_subset <- ames_subset |> dplyr::mutate(Sale_Price = log10(Sale_Price))# ANOVA p-valueames_aov_pval_res <- score_aov_pval |> fit(Sale_Price ~ ., data = ames_subset)ames_aov_pval_res@results# Pearson correlationames_cor_pearson_res <- score_cor_pearson |> fit(Sale_Price ~ ., data = ames_subset)ames_cor_pearson_res@results# Create a listclass_score_list <- list( ames_aov_pval_res, ames_cor_pearson_res)Disable -log10 transformation of p-values
Description
Disable -log10 transformation of p-values
Usage
dont_log_pvalues(x)Arguments
x | A score class object. |
Value
The modified score class object withneg_log10 set toFALSE.
Fill safe value(singular)
Description
Fills in safe value for missing score, with an option to apply transformation.This is asingular scoring method. Seefill_safe_values() forplural scoring method.
Arguments
x | A score class object (e.g., |
return_results | A logical value indicating whether to return results. |
Details
Iftransform = TRUE, by default, all score objects use the identity transformation, except thecorrelation score object, which uses the absolute transformation.
Value
A tibble of score results.
Examples
library(dplyr)ames_subset <- modeldata::ames |> dplyr::select( Sale_Price, MS_SubClass, MS_Zoning, Lot_Frontage, Lot_Area, Street )ames_subset <- ames_subset |> dplyr::mutate(Sale_Price = log10(Sale_Price))ames_aov_pval_res <- score_aov_pval |> fit(Sale_Price ~ ., data = ames_subset)ames_aov_pval_res@results# Fill safe valueames_aov_pval_res |> fill_safe_value(return_results = TRUE)# Fill safe value, option to transformames_aov_pval_res |> fill_safe_value(return_results = TRUE, transform = TRUE)Fill safe values(plural)
Description
Wrapsbind_scores(), and fills in safe values for missing scores, with an option to apply transformation.This is aplural scoring method. Seefill_safe_value() forsingular scoring method.
Arguments
x | A list. |
... | Further arguments passed to or from other methods. |
Details
Iftransform = TRUE, by default, all score objects use the identity transformation, except thecorrelation score object, which uses the absolute transformation.
Value
A tibble of scores results.
Examples
library(dplyr)ames_subset <- modeldata::ames |> dplyr::select( Sale_Price, MS_SubClass, MS_Zoning, Lot_Frontage, Lot_Area, Street )ames_subset <- ames_subset |> dplyr::mutate(Sale_Price = log10(Sale_Price))# ANOVA p-valueames_aov_pval_res <- score_aov_pval |> fit(Sale_Price ~ ., data = ames_subset)ames_aov_pval_res@results# Pearson correlationames_cor_pearson_res <- score_cor_pearson |> fit(Sale_Price ~ ., data = ames_subset)ames_cor_pearson_res@results# Forest importanceset.seed(42)ames_imp_rf_reg_res <- score_imp_rf |> fit(Sale_Price ~ ., data = ames_subset)ames_imp_rf_reg_res@results# Information gainames_info_gain_reg_res <- score_info_gain |> fit(Sale_Price ~ ., data = ames_subset)ames_info_gain_reg_res@results# Create a listclass_score_list <- list( ames_aov_pval_res, ames_cor_pearson_res, ames_imp_rf_reg_res, ames_info_gain_reg_res)# Fill safe valuesclass_score_list |> fill_safe_values()# Fill safe value, option to transformclass_score_list |> fill_safe_values(transform = TRUE)Rank score based ondplyr::dense_rank(), where tied values receive thesame rank and ranks are without gaps(singular)
Description
Rank score based ondplyr::dense_rank(), where tied values receive thesame rank and ranks are without gaps(singular)
Usage
rank_best_score_dense(x, ...)Arguments
x | A score class object (e.g., |
... | Further arguments passed to or from other methods. |
Value
A tibble of score results.
Examples
library(dplyr)ames_subset <- modeldata::ames |> dplyr::select( Sale_Price, MS_SubClass, MS_Zoning, Lot_Frontage, Lot_Area, Street )ames_subset <- ames_subset |> dplyr::mutate(Sale_Price = log10(Sale_Price))ames_aov_pval_res <- score_aov_pval |> fit(Sale_Price ~ ., data = ames_subset)ames_aov_pval_res@results# Rank scoreames_aov_pval_res |> rank_best_score_dense()Rank score based ondplyr::min_rank(), where tied values receive thesame rank and ranks are with gaps(singular)
Description
Rank score based ondplyr::min_rank(), where tied values receive thesame rank and ranks are with gaps(singular)
Usage
rank_best_score_min(x, ...)Arguments
x | A score class object (e.g., |
... | Further arguments passed to or from other methods. |
Value
A tibble of score results.
Examples
library(dplyr)ames_subset <- modeldata::ames |> dplyr::select( Sale_Price, MS_SubClass, MS_Zoning, Lot_Frontage, Lot_Area, Street )ames_subset <- ames_subset |> dplyr::mutate(Sale_Price = log10(Sale_Price))ames_aov_pval_res <- score_aov_pval |> fit(Sale_Price ~ ., data = ames_subset)ames_aov_pval_res@results# Rank scoreames_aov_pval_res |> rank_best_score_min()Objects exported from other packages
Description
These objects are imported from other packages. Follow the linksbelow to see their documentation.
- generics
Scoring via analysis of variance hypothesis tests
Description
These two objects can be used to compute importance scores based on Analysisof Variance techniques.
Usage
score_aov_pvalscore_aov_fstatFormat
An object of classfiltro::class_score_aov (inherits fromfiltro::class_score,S7_object) of length 1.
An object of classfiltro::class_score_aov (inherits fromfiltro::class_score,S7_object) of length 1.
Details
These objects are used when either:
The predictors are numeric and the outcome is a factor/category, or
The predictors are factors and the outcome is numeric.
In either case, a linear model (viastats::lm()) is created with the propervariable roles, and the overall p-value for the hypothesis that all means areequal is computed via the standard F-statistic. The p-value that is returnedis transformed to be-log10(p_value) so that larger values are associatedwith more important predictors.
Estimating the scores
Infiltro, thescore_* objects define a scoring method (e.g., datainput requirements, package dependencies, etc). To compute the scores fora specific data set, thefit() method is used. The main arguments forthese functions are:
objectA score class object (e.g.,
score_aov_pval).formulaA standard R formula with a single outcome on the right-hand side and one or more predictors (or
.) on the left-hand side. The data are processed viastats::model.frame()dataA data frame containing the relevant columns defined by the formula.
...Further arguments passed to or from other methods.
case_weightsA quantitative vector of case weights that is the same length as the number of rows in
data. The default ofNULLindicates that there are no case weights.
Missing values are removed for each predictor/outcome combination beingscored.
In cases where the underlying computations fail, the scoring proceedssilently, and a missing value is given for the score.
Value
An S7 object. The primary property of interest is inresults. Thisis a data frame of results that is populated by thefit() method and hascolumns:
name: The name of the score (e.g.,aov_fstatoraov_pval).score: The estimates for each predictor.outcome: The name of the outcome column.predictor: The names of the predictor inputs.
These data are accessed usingobject@results (see examples below).
See Also
Other class score metrics:score_cor_pearson,score_imp_rf,score_info_gain,score_roc_auc,score_xtab_pval_chisq
Examples
# Analysis of variance where `class` is the class predictor and the numeric# predictors are the outcomes/responsescell_data <- modeldata::cellscell_data$case <- NULL# ANOVA p-valuecell_p_val_res <- score_aov_pval |> fit(class ~ ., data = cell_data)cell_p_val_res@results# ANOVA raw p-valuenatrual_units <- score_aov_pval |> dont_log_pvalues()cell_pval_natrual_res <- natrual_units |> fit(class ~ ., data = cell_data)cell_pval_natrual_res@results# ANOVA t/F-statisticcell_t_stat_res <- score_aov_fstat |> fit(class ~ ., data = cell_data)cell_t_stat_res@results# ---------------------------------------------------------------------------library(dplyr)# Analysis of variance where `chem_fp_*` are the class predictors and# `permeability` is the numeric outcome/responsepermeability <- modeldata::permeability_qsar |> # Make the problem a little smaller for time; use 50 predictors select(1:51) |> # Make the binary predictor columns into factors mutate(across(starts_with("chem_fp"), as.factor))perm_p_val_res <- score_aov_pval |> fit(permeability ~ ., data = permeability)perm_p_val_res@results# Note that some `lm()` calls failed and are given NA score values. For# example:table(permeability$chem_fp_0007)perm_t_stat_res <- score_aov_fstat |> fit(permeability ~ ., data = permeability)perm_t_stat_res@resultsScoring via correlation coefficient
Description
These two objects can be used to compute importance scores based oncorrelation coefficient.
Usage
score_cor_pearsonscore_cor_spearmanFormat
An object of classfiltro::class_score_cor (inherits fromfiltro::class_score,S7_object) of length 1.
An object of classfiltro::class_score_cor (inherits fromfiltro::class_score,S7_object) of length 1.
Details
These objects are used when:
The predictors are numeric and the outcome is numeric.
In this case, a correlation coefficient (viastats::cov.wt()) is computed withthe proper variable roles. Values closer to 1 or -1 (i.e.,abs(cor_pearson)closer to 1) are associated with more important predictors.
Estimating the scores
Infiltro, thescore_* objects define a scoring method (e.g., datainput requirements, package dependencies, etc). To compute the scores fora specific data set, thefit() method is used. The main arguments forthese functions are:
objectA score class object (e.g.,
score_cor_pearson).formulaA standard R formula with a single outcome on the right-hand side and one or more predictors (or
.) on the left-hand side. The data are processed viastats::model.frame()dataA data frame containing the relevant columns defined by the formula.
...Further arguments passed to or from other methods.
case_weightsA quantitative vector of case weights that is the same length as the number of rows in
data. The default ofNULLindicates that there are no case weights.
Missing values are removed for each predictor/outcome combination beingscored.
In cases where the underlying computations fail, the scoring proceedssilently, and a missing value is given for the score.
Value
An S7 object. The primary property of interest is inresults. Thisis a data frame of results that is populated by thefit() method and hascolumns:
name: The name of the score (e.g.,score_cor_pearsonorscore_cor_spearman).score: The estimates for each predictor.outcome: The name of the outcome column.predictor: The names of the predictor inputs.
These data are accessed usingobject@results (see examples below).
See Also
Other class score metrics:score_aov_pval,score_imp_rf,score_info_gain,score_roc_auc,score_xtab_pval_chisq
Examples
library(dplyr)ames <- modeldata::ames# Pearson correlationames_cor_pearson_res <- score_cor_pearson |> fit(Sale_Price ~ ., data = ames)ames_cor_pearson_res@results# Spearman correlationames_cor_spearman_res <- score_cor_spearman |> fit(Sale_Price ~ ., data = ames)ames_cor_spearman_res@resultsScoring via random forests
Description
Three different random forest models can be used to measure predictor importance.
Usage
score_imp_rfscore_imp_rf_conditionalscore_imp_rf_obliqueFormat
An object of classfiltro::class_score_imp_rf (inherits fromfiltro::class_score,S7_object) of length 1.
An object of classfiltro::class_score_imp_rf (inherits fromfiltro::class_score,S7_object) of length 1.
An object of classfiltro::class_score_imp_rf (inherits fromfiltro::class_score,S7_object) of length 1.
Details
These objects are used when either:
The predictors are numeric and the outcome is a factor/category, or
The predictors are factors and the outcome is numeric.
In either case, a random forest, conditional random forest, or oblique random forest(viaranger::ranger(),partykit::cforest(), oraorsf::orsf()) is created withthe proper variable roles, and the feature importance scores are computed. Largervalues are associated with more important predictors.
When a predictor's importance score is 0,partykit::cforest() may omit itsname from the results. In cases like these, a score of 0 is assigned to themissing predictors.
Estimating the scores
Infiltro, thescore_* objects define a scoring method (e.g., datainput requirements, package dependencies, etc). To compute the scores fora specific data set, thefit() method is used. The main arguments forthese functions are:
objectA score class object (e.g.,
score_imp_rf).formulaA standard R formula with a single outcome on the right-hand side and one or more predictors (or
.) on the left-hand side. The data are processed viastats::model.frame()dataA data frame containing the relevant columns defined by the formula.
...Further arguments passed to or from other methods.
case_weightsA quantitative vector of case weights that is the same length as the number of rows in
data. The default ofNULLindicates that there are no case weights.
Missing values are removed by case-wise deletion.
In cases where the underlying computations fail, the scoring proceedssilently, and a missing value is given for the score.
Value
An S7 object. The primary property of interest is inresults. Thisis a data frame of results that is populated by thefit() method and hascolumns:
name: The name of the score (e.g.,imp_rf).score: The estimates for each predictor.outcome: The name of the outcome column.predictor: The names of the predictor inputs.
These data are accessed usingobject@results (see examples below).
See Also
Other class score metrics:score_aov_pval,score_cor_pearson,score_info_gain,score_roc_auc,score_xtab_pval_chisq
Examples
library(dplyr)# Random forests for classification taskcells_subset <- modeldata::cells |> # Use a small example for efficiency dplyr::select( class, angle_ch_1, area_ch_1, avg_inten_ch_1, avg_inten_ch_2, avg_inten_ch_3 ) |> slice(1:50)# Random forestset.seed(42)cells_imp_rf_res <- score_imp_rf |> fit(class ~ ., data = cells_subset)cells_imp_rf_res@results# Conditional random forestcells_imp_rf_conditional_res <- score_imp_rf_conditional |> fit(class ~ ., data = cells_subset, trees = 10)cells_imp_rf_conditional_res@results# Oblique random forestcells_imp_rf_oblique_res <- score_imp_rf_oblique |> fit(class ~ ., data = cells_subset)cells_imp_rf_oblique_res@results# ----------------------------------------------------------------------------# Random forests for regression taskames_subset <- modeldata::ames |> # Use a small example for efficiency dplyr::select( Sale_Price, MS_SubClass, MS_Zoning, Lot_Frontage, Lot_Area, Street ) |> slice(1:50)ames_subset <- ames_subset |> dplyr::mutate(Sale_Price = log10(Sale_Price))set.seed(42)ames_imp_rf_regression_task_res <- score_imp_rf |> fit(Sale_Price ~ ., data = ames_subset)ames_imp_rf_regression_task_res@resultsScoring via entropy-based filters
Description
Three different information theory (entropy) scores can be computed.
Usage
score_info_gainscore_gain_ratioscore_sym_uncertFormat
An object of classfiltro::class_score_info_gain (inherits fromfiltro::class_score,S7_object) of length 1.
An object of classfiltro::class_score_info_gain (inherits fromfiltro::class_score,S7_object) of length 1.
An object of classfiltro::class_score_info_gain (inherits fromfiltro::class_score,S7_object) of length 1.
Details
These objects are used when either:
The predictors are numeric and the outcome is a factor/category, or
The predictors are factors and the outcome is numeric.
In either case, an entropy-based filter (viaFSelectorRcpp::information_gain()) is applied with the proper variableroles. Depending on the chosen method, information gain, gain ratio, orsymmetrical uncertainty is computed. Larger values are associated with moreimportant predictors.
Estimating the scores
Infiltro, thescore_* objects define a scoring method (e.g., datainput requirements, package dependencies, etc). To compute the scores fora specific data set, thefit() method is used. The main arguments forthese functions are:
objectA score class object (e.g.,
score_info_gain).formulaA standard R formula with a single outcome on the right-hand side and one or more predictors (or
.) on the left-hand side. The data are processed viastats::model.frame()dataA data frame containing the relevant columns defined by the formula.
...Further arguments passed to or from other methods.
case_weightsA quantitative vector of case weights that is the same length as the number of rows in
data. The default ofNULLindicates that there are no case weights.
Missing values are removed for each predictor/outcome combination beingscored.
In cases where the underlying computations fail, the scoring proceedssilently, and a missing value is given for the score.
Value
An S7 object. The primary property of interest is inresults. Thisis a data frame of results that is populated by thefit() method and hascolumns:
name: The name of the score (e.g.,info_gain).score: The estimates for each predictor.outcome: The name of the outcome column.predictor: The names of the predictor inputs.
These data are accessed usingobject@results (see examples below).
See Also
Other class score metrics:score_aov_pval,score_cor_pearson,score_imp_rf,score_roc_auc,score_xtab_pval_chisq
Examples
library(dplyr)# Entropy-based filter for classification taskscells_subset <- modeldata::cells |> dplyr::select( class, angle_ch_1, area_ch_1, avg_inten_ch_1, avg_inten_ch_2, avg_inten_ch_3 )# Information gaincells_info_gain_res <- score_info_gain |> fit(class ~ ., data = cells_subset)cells_info_gain_res@results# Gain ratiocells_gain_ratio_res <- score_gain_ratio |> fit(class ~ ., data = cells_subset)cells_gain_ratio_res@results# Symmetrical uncertaintycells_sym_uncert_res <- score_sym_uncert |> fit(class ~ ., data = cells_subset)cells_sym_uncert_res@results# ----------------------------------------------------------------------------# Entropy-based filter for regression tasksames_subset <- modeldata::ames |> dplyr::select( Sale_Price, MS_SubClass, MS_Zoning, Lot_Frontage, Lot_Area, Street )ames_subset <- ames_subset |> dplyr::mutate(Sale_Price = log10(Sale_Price))regression_task <- score_info_gainregression_task@mode <- "regression"ames_info_gain_regression_task_res <- regression_task |> fit(Sale_Price ~ ., data = ames_subset)ames_info_gain_regression_task_res@resultsScoring via area under the Receiver Operating Characteristic curve (ROC AUC)
Description
The area under the ROC curves can be used to measure predictor importance.
Usage
score_roc_aucFormat
An object of classfiltro::class_score_roc_auc (inherits fromfiltro::class_score,S7_object) of length 1.
Details
This objects are used when either:
The predictors are numeric and the outcome is a factor/category, or
The predictors are factors and the outcome is numeric.
In either case, a ROC curve (viapROC::roc() orpROC::multiclass.roc()) is createdwith the proper variable roles, and the area under the ROC curve is computed (viapROC::auc()).Values higher than 0.5 (i.e.,max(roc_auc, 1 - roc_auc) > 0.5) are associated withmore important predictors.
Estimating the scores
Infiltro, thescore_* objects define a scoring method (e.g., datainput requirements, package dependencies, etc). To compute the scores fora specific data set, thefit() method is used. The main arguments forthese functions are:
objectA score class object (e.g.,
score_cor_pearson).formulaA standard R formula with a single outcome on the right-hand side and one or more predictors (or
.) on the left-hand side. The data are processed viastats::model.frame()dataA data frame containing the relevant columns defined by the formula.
...Further arguments passed to or from other methods.
case_weightsA quantitative vector of case weights that is the same length as the number of rows in
data. The default ofNULLindicates that there are no case weights. NOTE case weights cannot be used when a multiclass ROC is computed.
Missing values are removed for each predictor/outcome combination beingscored.
In cases where the underlying computations fail, the scoring proceedssilently, and a missing value is given for the score.
Value
An S7 object. The primary property of interest is inresults. Thisis a data frame of results that is populated by thefit() method and hascolumns:
name: The name of the score (e.g.,roc_auc).score: The estimates for each predictor.outcome: The name of the outcome column.predictor: The names of the predictor inputs.
These data are accessed usingobject@results (see examples below).
See Also
Other class score metrics:score_aov_pval,score_cor_pearson,score_imp_rf,score_info_gain,score_xtab_pval_chisq
Examples
library(dplyr)# ROC AUC where the numeric predictors are the predictors and# `class` is the class outcome/responsecells_subset <- modeldata::cells |> dplyr::select( class, angle_ch_1, area_ch_1, avg_inten_ch_1, avg_inten_ch_2, avg_inten_ch_3 )cells_roc_auc_res <- score_roc_auc |> fit(class ~ ., data = cells_subset)cells_roc_auc_res@results# ----------------------------------------------------------------------------# ROC AUC where `Sale_Price` is the numeric predictor and the class predictors# are the outcomes/responsesames_subset <- modeldata::ames |> dplyr::select( Sale_Price, MS_SubClass, MS_Zoning, Lot_Frontage, Lot_Area, Street )ames_subset <- ames_subset |> dplyr::mutate(Sale_Price = log10(Sale_Price))ames_roc_auc_res <- score_roc_auc |> fit(Sale_Price ~ ., data = ames_subset)ames_roc_auc_res@results# TODO Add multiclass exampleScoring via the chi-squared test or Fisher's exact test
Description
These two objects can be used to compute importance scores based onchi-squared test or Fisher's exact test.
Usage
score_xtab_pval_chisqscore_xtab_pval_fisherFormat
An object of classfiltro::class_score_xtab (inherits fromfiltro::class_score,S7_object) of length 1.
An object of classfiltro::class_score_xtab (inherits fromfiltro::class_score,S7_object) of length 1.
Details
These objects are used when:
The predictors are factors and the outcome is a factor.
In this case, a contingency table (viatable()) is created with the propervariable roles, and the cross tabulation p-value is computed using eitherthe chi-squared test (viastats::chisq.test()) or Fisher's exact test(viastats::fisher.test()). The p-value that is returned is transformed tobe-log10(p_value) so that larger values are associated with more importantpredictors.
Estimating the scores
Infiltro, thescore_* objects define a scoring method (e.g., datainput requirements, package dependencies, etc). To compute the scores fora specific data set, thefit() method is used. The main arguments forthese functions are:
objectA score class object (e.g.,
score_xtab_pval_chisq).formulaA standard R formula with a single outcome on the right-hand side and one or more predictors (or
.) on the left-hand side. The data are processed viastats::model.frame()dataA data frame containing the relevant columns defined by the formula.
...Further arguments passed to or from other methods.
case_weightsA quantitative vector of case weights that is the same length as the number of rows in
data. The default ofNULLindicates that there are no case weights.
Missing values are removed for each predictor/outcome combination beingscored.
In cases where the underlying computations fail, the scoring proceedssilently, and a missing value is given for the score.
Value
An S7 object. The primary property of interest is inresults. Thisis a data frame of results that is populated by thefit() method and hascolumns:
name: The name of the score (e.g.,pval_chisq).score: The estimates for each predictor.outcome: The name of the outcome column.predictor: The names of the predictor inputs.
These data are accessed usingobject@results (see examples below).
See Also
Other class score metrics:score_aov_pval,score_cor_pearson,score_imp_rf,score_info_gain,score_roc_auc
Examples
# Binary factor examplelibrary(titanic)library(dplyr)titanic_subset <- titanic_train |> mutate(across(c(Survived, Pclass, Sex, Embarked), as.factor)) |> select(Survived, Pclass, Sex, Age, Fare, Embarked)# Chi-squared testtitanic_xtab_pval_chisq_res <- score_xtab_pval_chisq |> fit(Survived ~ ., data = titanic_subset)titanic_xtab_pval_chisq_res@results# Chi-squared test adjusted p-valuestitanic_xtab_pval_chisq_p_adj_res <- score_xtab_pval_chisq |> fit(Survived ~ ., data = titanic_subset, adjustment = "BH")# Fisher's exact testtitanic_xtab_pval_fisher_res <- score_xtab_pval_fisher |> fit(Survived ~ ., data = titanic_subset)titanic_xtab_pval_fisher_res@results# Chi-squared test where `class` is the multiclass outcome/responsehpc_subset <- modeldata::hpc_data |> dplyr::select( class, protocol, hour )hpc_xtab_pval_chisq_res <- score_xtab_pval_chisq |> fit(class ~ ., data = hpc_subset)hpc_xtab_pval_chisq_res@resultsShow best desirability scores, based on number of predictors(plural)
Description
Similar toshow_best_desirability_prop() that cansimultaneously optimize multiple scores using desirability functions.Seeshow_best_score_num() forsingular scoring method.
Usage
show_best_desirability_num(x, ..., num_terms = 5)Arguments
x | A tibble or data frame returned by |
... | One or more desirability selectors to configure the optimization. |
num_terms | An integer value specifying the numberof predictors to consider. |
Details
Seeshow_best_desirability_prop() for details.
Value
A tibble withnum_termsnumber of rows. When showing the results,the metrics are presented in "wide format" (one column per metric) and thereare new columns for the corresponding desirability values (each starts with.d_).
Examples
library(desirability2)library(dplyr)# Remove outcomeames_scores_results <- ames_scores_results |> dplyr::select(-outcome)ames_scores_resultsshow_best_desirability_num( ames_scores_results, maximize(cor_pearson, low = 0, high = 1))show_best_desirability_num( ames_scores_results, maximize(cor_pearson, low = 0, high = 1), maximize(imp_rf))show_best_desirability_num( ames_scores_results, maximize(cor_pearson, low = 0, high = 1), maximize(imp_rf), maximize(infogain))show_best_desirability_num( ames_scores_results, maximize(cor_pearson, low = 0, high = 1), maximize(imp_rf), maximize(infogain), num_terms = 2)show_best_desirability_num( ames_scores_results, target(cor_pearson, low = 0.2, target = 0.255, high = 0.9))show_best_desirability_num( ames_scores_results, constrain(cor_pearson, low = 0.2, high = 1))Show best desirability scores, based on proportion of predictors(plural)
Description
Analogous to, and adapted fromdesirability2::show_best_desirability() that cansimultaneously optimize multiple scores using desirability functions.Seeshow_best_score_prop() forsingular filtering method.
Usage
show_best_desirability_prop(x, ..., prop_terms = 1)Arguments
x | A tibble or data frame returned by |
... | One or more desirability selectors to configure the optimization. |
prop_terms | A numeric value specifying the proportionof predictors to consider. |
Details
Desirability functions might help when selecting the best modelbased on more than one performance metric. The user creates a desirabilityfunction to map values of a metric to a[0, 1] range where 1.0 is mostdesirable and zero is unacceptable. After constructing these for the metricof interest, the overall desirability is computed using the geometric meanof the individual desirabilities.
The verbs that can be used in... (and their arguments) are:
maximize()when larger values are better, such as the area under the ROCscore.minimize()for metrics such as RMSE or the Brier score.target()for cases when a specific value of the metric is important.constrain()is used when there is a range of values that are equallydesirable.
Value
A tibble withprop_termsproportion of rows. When showing the results,the metrics are presented in "wide format" (one column per metric) and thereare new columns for the corresponding desirability values (each starts with.d_).
Examples
library(desirability2)library(dplyr)# Remove outcomeames_scores_results <- ames_scores_results |> dplyr::select(-outcome)ames_scores_resultsshow_best_desirability_prop( ames_scores_results, maximize(cor_pearson, low = 0, high = 1))show_best_desirability_prop( ames_scores_results, maximize(cor_pearson, low = 0, high = 1), maximize(imp_rf))show_best_desirability_prop( ames_scores_results, maximize(cor_pearson, low = 0, high = 1), maximize(imp_rf), maximize(infogain))show_best_desirability_prop( ames_scores_results, maximize(cor_pearson, low = 0, high = 1), maximize(imp_rf), maximize(infogain), prop_terms = 0.2)show_best_desirability_prop( ames_scores_results, target(cor_pearson, low = 0.2, target = 0.255, high = 0.9))show_best_desirability_prop( ames_scores_results, constrain(cor_pearson, low = 0.2, high = 1))Show best score, based on based on cutoff value(singular)
Description
Show best score, based on based on cutoff value(singular)
Arguments
x | A score class object (e.g., |
... | Further arguments passed to or from other methods. |
cutoff | A numeric value specifying the cutoff value. |
target | A numeric value specifying the target value. The defaultof |
Value
A tibble of score results.
Examples
library(dplyr)ames_subset <- modeldata::ames |> dplyr::select( Sale_Price, MS_SubClass, MS_Zoning, Lot_Frontage, Lot_Area, Street )ames_subset <- ames_subset |> dplyr::mutate(Sale_Price = log10(Sale_Price))ames_aov_pval_res <- score_aov_pval |> fit(Sale_Price ~ ., data = ames_subset)ames_aov_pval_res@results# Show best scoreames_aov_pval_res |> show_best_score_cutoff(cutoff = 130)Show best score, based on number or proportion of predictors withoptional cutoff value(singular)
Description
Show best score, based on number or proportion of predictors withoptional cutoff value(singular)
Arguments
x | A score class object (e.g., |
... | Further arguments passed to or from other methods. |
prop_terms | A numeric value specifying the proportionof predictors to consider. |
num_terms | An integer value specifying the numberof predictors to consider. |
cutoff | A numeric value specifying the cutoff value. |
Value
A tibble of score results.
Examples
library(dplyr)ames_subset <- modeldata::ames |> dplyr::select( Sale_Price, MS_SubClass, MS_Zoning, Lot_Frontage, Lot_Area, Street )ames_subset <- ames_subset |> dplyr::mutate(Sale_Price = log10(Sale_Price))ames_aov_pval_res <- score_aov_pval |> fit(Sale_Price ~ ., data = ames_subset)ames_aov_pval_res@results# Show best scoreames_aov_pval_res |> show_best_score_dual(prop_terms = 0.5)ames_aov_pval_res |> show_best_score_dual(prop_terms = 0.5, cutoff = 130)ames_aov_pval_res |> show_best_score_dual(num_terms = 2)ames_aov_pval_res |> show_best_score_dual(num_terms = 2, cutoff = 130)Show best score, based on number of predictors(singular)
Description
Show best score, based on number of predictors(singular)
Arguments
x | A score class object (e.g., |
... | Further arguments passed to or from other methods. |
num_terms | An integer value specifying the numberof predictors to consider. |
Value
A tibble of score results.
Examples
library(dplyr)ames_subset <- modeldata::ames |> dplyr::select( Sale_Price, MS_SubClass, MS_Zoning, Lot_Frontage, Lot_Area, Street )ames_subset <- ames_subset |> dplyr::mutate(Sale_Price = log10(Sale_Price))ames_aov_pval_res <- score_aov_pval |> fit(Sale_Price ~ ., data = ames_subset)ames_aov_pval_res@results# Show best scoreames_aov_pval_res |> show_best_score_num(num_terms = 2)Show best score, based on proportion of predictors(singular)
Description
Show best score, based on proportion of predictors(singular)
Arguments
x | A score class object (e.g., |
... | Further arguments passed to or from other methods. |
prop_terms | A numeric value specifying the proportionof predictors to consider. |
Value
A tibble of score results.
Examples
library(dplyr)ames_subset <- modeldata::ames |> dplyr::select( Sale_Price, MS_SubClass, MS_Zoning, Lot_Frontage, Lot_Area, Street )ames_subset <- ames_subset |> dplyr::mutate(Sale_Price = log10(Sale_Price))ames_aov_pval_res <- score_aov_pval |> fit(Sale_Price ~ ., data = ames_subset)ames_aov_pval_res@results# Show best scoreames_aov_pval_res |> show_best_score_prop(prop_terms = 0.2)