Movatterモバイル変換

Title:

Feature Selection Using Supervised Filter-Based Methods

Version:

0.2.0

Description:

Tidy tools to apply filter-based supervised feature selection methods. These methods score and rank feature relevance using metrics such as p-values, correlation, and importance scores (Kuhn and Johnson (2019) <doi:10.1201/9781315108230>).

License:

MIT + file LICENSE

URL:

https://github.com/tidymodels/filtro,https://filtro.tidymodels.org/

BugReports:

https://github.com/tidymodels/filtro/issues

Depends:

R (≥ 4.1)

Imports:

cli, desirability2 (≥ 0.1.0), dplyr, generics, pROC, purrr,rlang (≥ 1.1.0), S7, stats, tibble, tidyr, vctrs

Suggests:

aorsf, FSelectorRcpp, knitr, modeldata, partykit, quarto,ranger, rmarkdown, testthat (≥ 3.0.0), titanic

Config/Needs/website:

tidyverse/tidytemplate

Config/testthat/edition:

Encoding:

UTF-8

RoxygenNote:

7.3.2

Collate:

'aaa.R' 'class_score.R' 'data.R' 'desirability2.R''filtro-package.R' 'import-standalone-obj-type.R''import-standalone-types-check.R' 'misc.R' 'score-aov.R''utilities.R' 'score-cor.R' 'score-cross_tab.R''score-forest_imp.R' 'score-info_gain.R' 'score-roc_auc.R''zzz.R'

LazyData:

true

VignetteBuilder:

quarto, knitr

NeedsCompilation:

Packaged:

2025-08-26 21:19:41 UTC; franceslin

Author:

Frances Lin [aut, cre], Max Kuhn

[aut], Emil Hvitfeldt [aut], Posit Software, PBC

[cph, fnd]

Maintainer:

Frances Lin <franceslinyc@gmail.com>

Repository:

CRAN

Date/Publication:

2025-08-26 21:40:02 UTC

filtro: Feature Selection Using Supervised Filter-Based Methods

Description

logo

Author(s)

Maintainer: Frances Linfranceslinyc@gmail.com

Authors:

Max Kuhnmax@posit.co (ORCID)
Emil Hvitfeldtemil.hvitfeldt@posit.co

Other contributors:

Posit Software, PBC (03wc8by49) [copyright holder, funder]

Ames exampled score results

Description

This data an ames exampled score results for, usedforshow_best_desirability_prop() as a demonstration,and created by examples infill_safe_values().

Value

ames_scores_results

a tibble

Examples

data(ames_scores_results)

Arrange score

Description

Arrange score

Arguments

x

A score class object (e.g.,score_cor_pearson).

...

Further arguments passed to or from other methods.

target

A numeric value specifying the target value. The defaultofNULL indicates that there is no target value.

Value

A tibble of score results.

Examples

library(dplyr)ames_subset <- modeldata::ames |>  dplyr::select(    Sale_Price,    MS_SubClass,    MS_Zoning,    Lot_Frontage,    Lot_Area,    Street  )ames_subset <- ames_subset |>  dplyr::mutate(Sale_Price = log10(Sale_Price))ames_aov_pval_res <-  score_aov_pval |>  fit(Sale_Price ~ ., data = ames_subset)ames_aov_pval_res@results# Arrange scoreames_aov_pval_res |> arrange_score()

Bind score class object, including their associated metadata and scores

Description

Binds multiple score class objects (e.g.,⁠score_*⁠), including their associated metadata and scores.Seefill_safe_values() for binding with safe-value handling.

Arguments

x

A list.

...

Further arguments passed to or from other methods.

Value

A tibble of scores results.

Examples

library(dplyr)ames_subset <- modeldata::ames |>  dplyr::select(    Sale_Price,    MS_SubClass,    MS_Zoning,    Lot_Frontage,    Lot_Area,    Street  )ames_subset <- ames_subset |>  dplyr::mutate(Sale_Price = log10(Sale_Price))# ANOVA p-valueames_aov_pval_res <-  score_aov_pval |>  fit(Sale_Price ~ ., data = ames_subset)ames_aov_pval_res@results# Pearson correlationames_cor_pearson_res <-  score_cor_pearson |>  fit(Sale_Price ~ ., data = ames_subset)ames_cor_pearson_res@results# Forest importanceset.seed(42)ames_imp_rf_reg_res <-  score_imp_rf |>  fit(Sale_Price ~ ., data = ames_subset)ames_imp_rf_reg_res@results# Information gainames_info_gain_reg_res <-  score_info_gain |>  fit(Sale_Price ~ ., data = ames_subset)ames_info_gain_reg_res@results# Create a listclass_score_list <- list(  ames_aov_pval_res,  ames_cor_pearson_res,  ames_imp_rf_reg_res,  ames_info_gain_reg_res)# Bind scoresclass_score_list |> bind_scores()

General S7 classes for scoring objects

Description

class_score is an S7 object that contains slots for the characteristics ofpredictor importance scores. More specific classes for individual methods arebased on this object (shown below).

Usage

class_score(  outcome_type = c("numeric", "factor"),  predictor_type = c("numeric", "factor"),  case_weights = logical(0),  range = integer(0),  inclusive = logical(0),  fallback_value = integer(0),  score_type = character(0),  transform_fn = function() NULL,  direction = character(0),  deterministic = logical(0),  tuning = logical(0),  calculating_fn = function() NULL,  label = character(0),  packages = character(0),  results = data.frame())class_score_aov(  outcome_type = c("numeric", "factor"),  predictor_type = c("numeric", "factor"),  case_weights = logical(0),  range = integer(0),  inclusive = logical(0),  fallback_value = integer(0),  score_type = character(0),  transform_fn = function() NULL,  direction = character(0),  deterministic = logical(0),  tuning = logical(0),  calculating_fn = function() NULL,  label = character(0),  packages = character(0),  results = data.frame(),  neg_log10 = TRUE)class_score_cor(  outcome_type = c("numeric", "factor"),  predictor_type = c("numeric", "factor"),  case_weights = logical(0),  range = integer(0),  inclusive = logical(0),  fallback_value = integer(0),  score_type = character(0),  transform_fn = function() NULL,  direction = character(0),  deterministic = logical(0),  tuning = logical(0),  calculating_fn = function() NULL,  label = character(0),  packages = character(0),  results = data.frame())class_score_xtab(  outcome_type = c("numeric", "factor"),  predictor_type = c("numeric", "factor"),  case_weights = logical(0),  range = integer(0),  inclusive = logical(0),  fallback_value = integer(0),  score_type = character(0),  transform_fn = function() NULL,  direction = character(0),  deterministic = logical(0),  tuning = logical(0),  calculating_fn = function() NULL,  label = character(0),  packages = character(0),  results = data.frame(),  neg_log10 = TRUE)class_score_imp_rf(  outcome_type = c("numeric", "factor"),  predictor_type = c("numeric", "factor"),  case_weights = logical(0),  range = integer(0),  inclusive = logical(0),  fallback_value = integer(0),  score_type = character(0),  transform_fn = function() NULL,  direction = character(0),  deterministic = logical(0),  tuning = logical(0),  calculating_fn = function() NULL,  label = character(0),  packages = character(0),  results = data.frame(),  engine = "ranger")class_score_info_gain(  outcome_type = c("numeric", "factor"),  predictor_type = c("numeric", "factor"),  case_weights = logical(0),  range = integer(0),  inclusive = logical(0),  fallback_value = integer(0),  score_type = character(0),  transform_fn = function() NULL,  direction = character(0),  deterministic = logical(0),  tuning = logical(0),  calculating_fn = function() NULL,  label = character(0),  packages = character(0),  results = data.frame(),  mode = "classification")class_score_roc_auc(  outcome_type = c("numeric", "factor"),  predictor_type = c("numeric", "factor"),  case_weights = logical(0),  range = integer(0),  inclusive = logical(0),  fallback_value = integer(0),  score_type = character(0),  transform_fn = function() NULL,  direction = character(0),  deterministic = logical(0),  tuning = logical(0),  calculating_fn = function() NULL,  label = character(0),  packages = character(0),  results = data.frame())

S7 subclass of base R's`list` for method dispatch

Description

class_score_list is an S7 subclass of S3 base R'slist, used for method dispatch inbind_scores() andfill_safe_values().

Usage

class_score_list

Format

An object of classS7_S3_class of length 3.

Value

A list of S7 objects.

Examples

library(dplyr)ames_subset <- modeldata::ames |>  dplyr::select(    Sale_Price,    MS_SubClass,    MS_Zoning,    Lot_Frontage,    Lot_Area,    Street  )ames_subset <- ames_subset |>  dplyr::mutate(Sale_Price = log10(Sale_Price))# ANOVA p-valueames_aov_pval_res <-  score_aov_pval |>  fit(Sale_Price ~ ., data = ames_subset)ames_aov_pval_res@results# Pearson correlationames_cor_pearson_res <-  score_cor_pearson |>  fit(Sale_Price ~ ., data = ames_subset)ames_cor_pearson_res@results# Create a listclass_score_list <- list(  ames_aov_pval_res,  ames_cor_pearson_res)

Disable -log10 transformation of p-values

Description

Disable -log10 transformation of p-values

Usage

dont_log_pvalues(x)

Arguments

x

A score class object.

Value

The modified score class object withneg_log10 set toFALSE.

Fill safe value(singular)

Description

Fills in safe value for missing score, with an option to apply transformation.This is asingular scoring method. Seefill_safe_values() forplural scoring method.

Arguments

x

A score class object (e.g.,score_cor_pearson).

return_results

A logical value indicating whether to return results.

Details

Iftransform = TRUE, by default, all score objects use the identity transformation, except thecorrelation score object, which uses the absolute transformation.

Value

A tibble of score results.

Examples

library(dplyr)ames_subset <- modeldata::ames |>  dplyr::select(    Sale_Price,    MS_SubClass,    MS_Zoning,    Lot_Frontage,    Lot_Area,    Street  )ames_subset <- ames_subset |>  dplyr::mutate(Sale_Price = log10(Sale_Price))ames_aov_pval_res <-  score_aov_pval |>  fit(Sale_Price ~ ., data = ames_subset)ames_aov_pval_res@results# Fill safe valueames_aov_pval_res |> fill_safe_value(return_results = TRUE)# Fill safe value, option to transformames_aov_pval_res |> fill_safe_value(return_results = TRUE, transform = TRUE)

Fill safe values(plural)

Description

Wrapsbind_scores(), and fills in safe values for missing scores, with an option to apply transformation.This is aplural scoring method. Seefill_safe_value() forsingular scoring method.

Arguments

x

A list.

...

Further arguments passed to or from other methods.

Details

Iftransform = TRUE, by default, all score objects use the identity transformation, except thecorrelation score object, which uses the absolute transformation.

Value

A tibble of scores results.

Examples

library(dplyr)ames_subset <- modeldata::ames |>  dplyr::select(    Sale_Price,    MS_SubClass,    MS_Zoning,    Lot_Frontage,    Lot_Area,    Street  )ames_subset <- ames_subset |>  dplyr::mutate(Sale_Price = log10(Sale_Price))# ANOVA p-valueames_aov_pval_res <-  score_aov_pval |>  fit(Sale_Price ~ ., data = ames_subset)ames_aov_pval_res@results# Pearson correlationames_cor_pearson_res <-  score_cor_pearson |>  fit(Sale_Price ~ ., data = ames_subset)ames_cor_pearson_res@results# Forest importanceset.seed(42)ames_imp_rf_reg_res <-  score_imp_rf |>  fit(Sale_Price ~ ., data = ames_subset)ames_imp_rf_reg_res@results# Information gainames_info_gain_reg_res <-  score_info_gain |>  fit(Sale_Price ~ ., data = ames_subset)ames_info_gain_reg_res@results# Create a listclass_score_list <- list(  ames_aov_pval_res,  ames_cor_pearson_res,  ames_imp_rf_reg_res,  ames_info_gain_reg_res)# Fill safe valuesclass_score_list |> fill_safe_values()# Fill safe value, option to transformclass_score_list |> fill_safe_values(transform = TRUE)

Rank score based on`dplyr::dense_rank()`, where tied values receive thesame rank and ranks are without gaps(singular)

Description

Rank score based ondplyr::dense_rank(), where tied values receive thesame rank and ranks are without gaps(singular)

Usage

rank_best_score_dense(x, ...)

Arguments

x

A score class object (e.g.,score_cor_pearson).

...

Further arguments passed to or from other methods.

Value

A tibble of score results.

Examples

library(dplyr)ames_subset <- modeldata::ames |>  dplyr::select(    Sale_Price,    MS_SubClass,    MS_Zoning,    Lot_Frontage,    Lot_Area,    Street  )ames_subset <- ames_subset |>  dplyr::mutate(Sale_Price = log10(Sale_Price))ames_aov_pval_res <-  score_aov_pval |>  fit(Sale_Price ~ ., data = ames_subset)ames_aov_pval_res@results# Rank scoreames_aov_pval_res |> rank_best_score_dense()

Rank score based on`dplyr::min_rank()`, where tied values receive thesame rank and ranks are with gaps(singular)

Description

Rank score based ondplyr::min_rank(), where tied values receive thesame rank and ranks are with gaps(singular)

Usage

rank_best_score_min(x, ...)

Arguments

x

A score class object (e.g.,score_cor_pearson).

...

Further arguments passed to or from other methods.

Value

A tibble of score results.

Examples

library(dplyr)ames_subset <- modeldata::ames |>  dplyr::select(    Sale_Price,    MS_SubClass,    MS_Zoning,    Lot_Frontage,    Lot_Area,    Street  )ames_subset <- ames_subset |>  dplyr::mutate(Sale_Price = log10(Sale_Price))ames_aov_pval_res <-  score_aov_pval |>  fit(Sale_Price ~ ., data = ames_subset)ames_aov_pval_res@results# Rank scoreames_aov_pval_res |> rank_best_score_min()

Objects exported from other packages

Description

These objects are imported from other packages. Follow the linksbelow to see their documentation.

generics: fit,required_pkgs

Scoring via analysis of variance hypothesis tests

Description

These two objects can be used to compute importance scores based on Analysisof Variance techniques.

Usage

score_aov_pvalscore_aov_fstat

Format

An object of classfiltro::class_score_aov (inherits fromfiltro::class_score,S7_object) of length 1.

Details

These objects are used when either:

The predictors are numeric and the outcome is a factor/category, or
The predictors are factors and the outcome is numeric.

In either case, a linear model (viastats::lm()) is created with the propervariable roles, and the overall p-value for the hypothesis that all means areequal is computed via the standard F-statistic. The p-value that is returnedis transformed to be-log10(p_value) so that larger values are associatedwith more important predictors.

Estimating the scores

Infiltro, the⁠score_*⁠ objects define a scoring method (e.g., datainput requirements, package dependencies, etc). To compute the scores fora specific data set, thefit() method is used. The main arguments forthese functions are:

object: A score class object (e.g.,score_aov_pval).
formula: A standard R formula with a single outcome on the right-hand side and one or more predictors (or.) on the left-hand side. The data are processed viastats::model.frame()
data: A data frame containing the relevant columns defined by the formula.
...: Further arguments passed to or from other methods.
case_weights: A quantitative vector of case weights that is the same length as the number of rows indata. The default ofNULL indicates that there are no case weights.

Missing values are removed for each predictor/outcome combination beingscored.

In cases where the underlying computations fail, the scoring proceedssilently, and a missing value is given for the score.

Value

An S7 object. The primary property of interest is inresults. Thisis a data frame of results that is populated by thefit() method and hascolumns:

name: The name of the score (e.g.,aov_fstat oraov_pval).
score: The estimates for each predictor.
outcome: The name of the outcome column.
predictor: The names of the predictor inputs.

These data are accessed usingobject@results (see examples below).

Examples

# Analysis of variance where `class` is the class predictor and the numeric# predictors are the outcomes/responsescell_data <- modeldata::cellscell_data$case <- NULL# ANOVA p-valuecell_p_val_res <-  score_aov_pval |>  fit(class ~ ., data = cell_data)cell_p_val_res@results# ANOVA raw p-valuenatrual_units <- score_aov_pval |> dont_log_pvalues()cell_pval_natrual_res <-  natrual_units |>  fit(class ~ ., data = cell_data)cell_pval_natrual_res@results# ANOVA t/F-statisticcell_t_stat_res <-  score_aov_fstat |>  fit(class ~ ., data = cell_data)cell_t_stat_res@results# ---------------------------------------------------------------------------library(dplyr)# Analysis of variance where `chem_fp_*` are the class predictors and# `permeability` is the numeric outcome/responsepermeability <-  modeldata::permeability_qsar |>  # Make the problem a little smaller for time; use 50 predictors  select(1:51) |>  # Make the binary predictor columns into factors  mutate(across(starts_with("chem_fp"), as.factor))perm_p_val_res <-  score_aov_pval |>  fit(permeability ~ ., data = permeability)perm_p_val_res@results# Note that some `lm()` calls failed and are given NA score values. For# example:table(permeability$chem_fp_0007)perm_t_stat_res <-  score_aov_fstat |>  fit(permeability ~ ., data = permeability)perm_t_stat_res@results

Scoring via correlation coefficient

Description

These two objects can be used to compute importance scores based oncorrelation coefficient.

Usage

score_cor_pearsonscore_cor_spearman

Format

An object of classfiltro::class_score_cor (inherits fromfiltro::class_score,S7_object) of length 1.

Details

These objects are used when:

The predictors are numeric and the outcome is numeric.

In this case, a correlation coefficient (viastats::cov.wt()) is computed withthe proper variable roles. Values closer to 1 or -1 (i.e.,abs(cor_pearson)closer to 1) are associated with more important predictors.

Estimating the scores

object: A score class object (e.g.,score_cor_pearson).
formula: A standard R formula with a single outcome on the right-hand side and one or more predictors (or.) on the left-hand side. The data are processed viastats::model.frame()
data: A data frame containing the relevant columns defined by the formula.
...: Further arguments passed to or from other methods.
case_weights: A quantitative vector of case weights that is the same length as the number of rows indata. The default ofNULL indicates that there are no case weights.

Missing values are removed for each predictor/outcome combination beingscored.

In cases where the underlying computations fail, the scoring proceedssilently, and a missing value is given for the score.

Value

An S7 object. The primary property of interest is inresults. Thisis a data frame of results that is populated by thefit() method and hascolumns:

name: The name of the score (e.g.,score_cor_pearson orscore_cor_spearman).
score: The estimates for each predictor.
outcome: The name of the outcome column.
predictor: The names of the predictor inputs.

These data are accessed usingobject@results (see examples below).

Examples

library(dplyr)ames <- modeldata::ames# Pearson correlationames_cor_pearson_res <-  score_cor_pearson |>  fit(Sale_Price ~ ., data = ames)ames_cor_pearson_res@results# Spearman correlationames_cor_spearman_res <-  score_cor_spearman |>  fit(Sale_Price ~ ., data = ames)ames_cor_spearman_res@results

Scoring via random forests

Description

Three different random forest models can be used to measure predictor importance.

Usage

score_imp_rfscore_imp_rf_conditionalscore_imp_rf_oblique

Format

An object of classfiltro::class_score_imp_rf (inherits fromfiltro::class_score,S7_object) of length 1.

Details

These objects are used when either:

The predictors are numeric and the outcome is a factor/category, or
The predictors are factors and the outcome is numeric.

In either case, a random forest, conditional random forest, or oblique random forest(viaranger::ranger(),partykit::cforest(), oraorsf::orsf()) is created withthe proper variable roles, and the feature importance scores are computed. Largervalues are associated with more important predictors.

When a predictor's importance score is 0,partykit::cforest() may omit itsname from the results. In cases like these, a score of 0 is assigned to themissing predictors.

Estimating the scores

object: A score class object (e.g.,score_imp_rf).
formula: A standard R formula with a single outcome on the right-hand side and one or more predictors (or.) on the left-hand side. The data are processed viastats::model.frame()
data: A data frame containing the relevant columns defined by the formula.
...: Further arguments passed to or from other methods.
case_weights: A quantitative vector of case weights that is the same length as the number of rows indata. The default ofNULL indicates that there are no case weights.

Missing values are removed by case-wise deletion.

In cases where the underlying computations fail, the scoring proceedssilently, and a missing value is given for the score.

Value

An S7 object. The primary property of interest is inresults. Thisis a data frame of results that is populated by thefit() method and hascolumns:

name: The name of the score (e.g.,imp_rf).
score: The estimates for each predictor.
outcome: The name of the outcome column.
predictor: The names of the predictor inputs.

These data are accessed usingobject@results (see examples below).

Examples

library(dplyr)# Random forests for classification taskcells_subset <- modeldata::cells |>  # Use a small example for efficiency  dplyr::select(    class,    angle_ch_1,    area_ch_1,    avg_inten_ch_1,    avg_inten_ch_2,    avg_inten_ch_3  ) |>  slice(1:50)# Random forestset.seed(42)cells_imp_rf_res <- score_imp_rf |>  fit(class ~ ., data = cells_subset)cells_imp_rf_res@results# Conditional random forestcells_imp_rf_conditional_res <- score_imp_rf_conditional |>  fit(class ~ ., data = cells_subset, trees = 10)cells_imp_rf_conditional_res@results# Oblique random forestcells_imp_rf_oblique_res <- score_imp_rf_oblique |>  fit(class ~ ., data = cells_subset)cells_imp_rf_oblique_res@results# ----------------------------------------------------------------------------# Random forests for regression taskames_subset <- modeldata::ames |>  # Use a small example for efficiency  dplyr::select(    Sale_Price,    MS_SubClass,    MS_Zoning,    Lot_Frontage,    Lot_Area,    Street  ) |>  slice(1:50)ames_subset <- ames_subset |>  dplyr::mutate(Sale_Price = log10(Sale_Price))set.seed(42)ames_imp_rf_regression_task_res <-  score_imp_rf |>  fit(Sale_Price ~ ., data = ames_subset)ames_imp_rf_regression_task_res@results

Scoring via entropy-based filters

Description

Three different information theory (entropy) scores can be computed.

Usage

score_info_gainscore_gain_ratioscore_sym_uncert

Format

An object of classfiltro::class_score_info_gain (inherits fromfiltro::class_score,S7_object) of length 1.

Details

These objects are used when either:

The predictors are numeric and the outcome is a factor/category, or
The predictors are factors and the outcome is numeric.

In either case, an entropy-based filter (viaFSelectorRcpp::information_gain()) is applied with the proper variableroles. Depending on the chosen method, information gain, gain ratio, orsymmetrical uncertainty is computed. Larger values are associated with moreimportant predictors.

Estimating the scores

object: A score class object (e.g.,score_info_gain).
formula: A standard R formula with a single outcome on the right-hand side and one or more predictors (or.) on the left-hand side. The data are processed viastats::model.frame()
data: A data frame containing the relevant columns defined by the formula.
...: Further arguments passed to or from other methods.
case_weights: A quantitative vector of case weights that is the same length as the number of rows indata. The default ofNULL indicates that there are no case weights.

Missing values are removed for each predictor/outcome combination beingscored.

In cases where the underlying computations fail, the scoring proceedssilently, and a missing value is given for the score.

Value

An S7 object. The primary property of interest is inresults. Thisis a data frame of results that is populated by thefit() method and hascolumns:

name: The name of the score (e.g.,info_gain).
score: The estimates for each predictor.
outcome: The name of the outcome column.
predictor: The names of the predictor inputs.

These data are accessed usingobject@results (see examples below).

Examples

library(dplyr)# Entropy-based filter for classification taskscells_subset <- modeldata::cells |>  dplyr::select(    class,    angle_ch_1,    area_ch_1,    avg_inten_ch_1,    avg_inten_ch_2,    avg_inten_ch_3  )# Information gaincells_info_gain_res <- score_info_gain |>  fit(class ~ ., data = cells_subset)cells_info_gain_res@results# Gain ratiocells_gain_ratio_res <- score_gain_ratio |>  fit(class ~ ., data = cells_subset)cells_gain_ratio_res@results# Symmetrical uncertaintycells_sym_uncert_res <- score_sym_uncert |>  fit(class ~ ., data = cells_subset)cells_sym_uncert_res@results# ----------------------------------------------------------------------------# Entropy-based filter for regression tasksames_subset <- modeldata::ames |>  dplyr::select(    Sale_Price,    MS_SubClass,    MS_Zoning,    Lot_Frontage,    Lot_Area,    Street  )ames_subset <- ames_subset |>  dplyr::mutate(Sale_Price = log10(Sale_Price))regression_task <- score_info_gainregression_task@mode <- "regression"ames_info_gain_regression_task_res <-  regression_task |>  fit(Sale_Price ~ ., data = ames_subset)ames_info_gain_regression_task_res@results

Scoring via area under the Receiver Operating Characteristic curve (ROC AUC)

Description

The area under the ROC curves can be used to measure predictor importance.

Usage

score_roc_auc

Format

An object of classfiltro::class_score_roc_auc (inherits fromfiltro::class_score,S7_object) of length 1.

Details

This objects are used when either:

The predictors are numeric and the outcome is a factor/category, or
The predictors are factors and the outcome is numeric.

In either case, a ROC curve (viapROC::roc() orpROC::multiclass.roc()) is createdwith the proper variable roles, and the area under the ROC curve is computed (viapROC::auc()).Values higher than 0.5 (i.e.,max(roc_auc, 1 - roc_auc) > 0.5) are associated withmore important predictors.

Estimating the scores

object: A score class object (e.g.,score_cor_pearson).
formula: A standard R formula with a single outcome on the right-hand side and one or more predictors (or.) on the left-hand side. The data are processed viastats::model.frame()
data: A data frame containing the relevant columns defined by the formula.
...: Further arguments passed to or from other methods.
case_weights: A quantitative vector of case weights that is the same length as the number of rows indata. The default ofNULL indicates that there are no case weights. NOTE case weights cannot be used when a multiclass ROC is computed.

Missing values are removed for each predictor/outcome combination beingscored.

In cases where the underlying computations fail, the scoring proceedssilently, and a missing value is given for the score.

Value

An S7 object. The primary property of interest is inresults. Thisis a data frame of results that is populated by thefit() method and hascolumns:

name: The name of the score (e.g.,roc_auc).
score: The estimates for each predictor.
outcome: The name of the outcome column.
predictor: The names of the predictor inputs.

These data are accessed usingobject@results (see examples below).

Examples

library(dplyr)# ROC AUC where the numeric predictors are the predictors and# `class` is the class outcome/responsecells_subset <- modeldata::cells |>  dplyr::select(    class,    angle_ch_1,    area_ch_1,    avg_inten_ch_1,    avg_inten_ch_2,    avg_inten_ch_3  )cells_roc_auc_res <- score_roc_auc |>  fit(class ~ ., data = cells_subset)cells_roc_auc_res@results# ----------------------------------------------------------------------------# ROC AUC where `Sale_Price` is the numeric predictor and the class predictors# are the outcomes/responsesames_subset <- modeldata::ames |>  dplyr::select(    Sale_Price,    MS_SubClass,    MS_Zoning,    Lot_Frontage,    Lot_Area,    Street  )ames_subset <- ames_subset |>  dplyr::mutate(Sale_Price = log10(Sale_Price))ames_roc_auc_res <- score_roc_auc |>  fit(Sale_Price ~ ., data = ames_subset)ames_roc_auc_res@results# TODO Add multiclass example

Scoring via the chi-squared test or Fisher's exact test

Description

These two objects can be used to compute importance scores based onchi-squared test or Fisher's exact test.

Usage

score_xtab_pval_chisqscore_xtab_pval_fisher

Format

An object of classfiltro::class_score_xtab (inherits fromfiltro::class_score,S7_object) of length 1.

Details

These objects are used when:

The predictors are factors and the outcome is a factor.

In this case, a contingency table (viatable()) is created with the propervariable roles, and the cross tabulation p-value is computed using eitherthe chi-squared test (viastats::chisq.test()) or Fisher's exact test(viastats::fisher.test()). The p-value that is returned is transformed tobe-log10(p_value) so that larger values are associated with more importantpredictors.

Estimating the scores

object: A score class object (e.g.,score_xtab_pval_chisq).
formula: A standard R formula with a single outcome on the right-hand side and one or more predictors (or.) on the left-hand side. The data are processed viastats::model.frame()
data: A data frame containing the relevant columns defined by the formula.
...: Further arguments passed to or from other methods.
case_weights: A quantitative vector of case weights that is the same length as the number of rows indata. The default ofNULL indicates that there are no case weights.

Missing values are removed for each predictor/outcome combination beingscored.

In cases where the underlying computations fail, the scoring proceedssilently, and a missing value is given for the score.

Value

An S7 object. The primary property of interest is inresults. Thisis a data frame of results that is populated by thefit() method and hascolumns:

name: The name of the score (e.g.,pval_chisq).
score: The estimates for each predictor.
outcome: The name of the outcome column.
predictor: The names of the predictor inputs.

These data are accessed usingobject@results (see examples below).

Examples

# Binary factor examplelibrary(titanic)library(dplyr)titanic_subset <- titanic_train |>  mutate(across(c(Survived, Pclass, Sex, Embarked), as.factor)) |>  select(Survived, Pclass, Sex, Age, Fare, Embarked)# Chi-squared testtitanic_xtab_pval_chisq_res <- score_xtab_pval_chisq |>  fit(Survived ~ ., data = titanic_subset)titanic_xtab_pval_chisq_res@results# Chi-squared test adjusted p-valuestitanic_xtab_pval_chisq_p_adj_res <- score_xtab_pval_chisq |>  fit(Survived ~ ., data = titanic_subset, adjustment = "BH")# Fisher's exact testtitanic_xtab_pval_fisher_res <- score_xtab_pval_fisher |>  fit(Survived ~ ., data = titanic_subset)titanic_xtab_pval_fisher_res@results# Chi-squared test where `class` is the multiclass outcome/responsehpc_subset <- modeldata::hpc_data |>  dplyr::select(    class,    protocol,    hour  )hpc_xtab_pval_chisq_res <- score_xtab_pval_chisq |>    fit(class ~ ., data = hpc_subset)hpc_xtab_pval_chisq_res@results

Show best desirability scores, based on number of predictors(plural)

Description

Similar toshow_best_desirability_prop() that cansimultaneously optimize multiple scores using desirability functions.Seeshow_best_score_num() forsingular scoring method.

Usage

show_best_desirability_num(x, ..., num_terms = 5)

Arguments

x

A tibble or data frame returned byfill_safe_values().

...

One or more desirability selectors to configure the optimization.

num_terms

An integer value specifying the numberof predictors to consider.

Details

Seeshow_best_desirability_prop() for details.

Value

A tibble withnum_termsnumber of rows. When showing the results,the metrics are presented in "wide format" (one column per metric) and thereare new columns for the corresponding desirability values (each starts with.d_).

Examples

library(desirability2)library(dplyr)# Remove outcomeames_scores_results <- ames_scores_results |>  dplyr::select(-outcome)ames_scores_resultsshow_best_desirability_num(  ames_scores_results,  maximize(cor_pearson, low = 0, high = 1))show_best_desirability_num(  ames_scores_results,  maximize(cor_pearson, low = 0, high = 1),  maximize(imp_rf))show_best_desirability_num(  ames_scores_results,  maximize(cor_pearson, low = 0, high = 1),  maximize(imp_rf),  maximize(infogain))show_best_desirability_num(  ames_scores_results,  maximize(cor_pearson, low = 0, high = 1),  maximize(imp_rf),  maximize(infogain),  num_terms = 2)show_best_desirability_num(  ames_scores_results,  target(cor_pearson, low = 0.2, target = 0.255, high = 0.9))show_best_desirability_num(  ames_scores_results,  constrain(cor_pearson, low = 0.2, high = 1))

Show best desirability scores, based on proportion of predictors(plural)

Description

Analogous to, and adapted fromdesirability2::show_best_desirability() that cansimultaneously optimize multiple scores using desirability functions.Seeshow_best_score_prop() forsingular filtering method.

Usage

show_best_desirability_prop(x, ..., prop_terms = 1)

Arguments

x

A tibble or data frame returned byfill_safe_values().

...

One or more desirability selectors to configure the optimization.

prop_terms

A numeric value specifying the proportionof predictors to consider.

Details

Desirability functions might help when selecting the best modelbased on more than one performance metric. The user creates a desirabilityfunction to map values of a metric to a⁠[0, 1]⁠ range where 1.0 is mostdesirable and zero is unacceptable. After constructing these for the metricof interest, the overall desirability is computed using the geometric meanof the individual desirabilities.

The verbs that can be used in... (and their arguments) are:

maximize() when larger values are better, such as the area under the ROCscore.
minimize() for metrics such as RMSE or the Brier score.
target() for cases when a specific value of the metric is important.
constrain() is used when there is a range of values that are equallydesirable.

Value

A tibble withprop_termsproportion of rows. When showing the results,the metrics are presented in "wide format" (one column per metric) and thereare new columns for the corresponding desirability values (each starts with.d_).

Examples

library(desirability2)library(dplyr)# Remove outcomeames_scores_results <- ames_scores_results |>  dplyr::select(-outcome)ames_scores_resultsshow_best_desirability_prop(  ames_scores_results,  maximize(cor_pearson, low = 0, high = 1))show_best_desirability_prop(  ames_scores_results,  maximize(cor_pearson, low = 0, high = 1),  maximize(imp_rf))show_best_desirability_prop(  ames_scores_results,  maximize(cor_pearson, low = 0, high = 1),  maximize(imp_rf),  maximize(infogain))show_best_desirability_prop(  ames_scores_results,  maximize(cor_pearson, low = 0, high = 1),  maximize(imp_rf),  maximize(infogain),  prop_terms = 0.2)show_best_desirability_prop(  ames_scores_results,  target(cor_pearson, low = 0.2, target = 0.255, high = 0.9))show_best_desirability_prop(  ames_scores_results,  constrain(cor_pearson, low = 0.2, high = 1))

Show best score, based on based on cutoff value(singular)

Description

Show best score, based on based on cutoff value(singular)

Arguments

x

A score class object (e.g.,score_cor_pearson).

...

Further arguments passed to or from other methods.

cutoff

A numeric value specifying the cutoff value.

target

A numeric value specifying the target value. The defaultofNULL indicates that there is no target value.

Value

A tibble of score results.

Examples

library(dplyr)ames_subset <- modeldata::ames |>  dplyr::select(    Sale_Price,    MS_SubClass,    MS_Zoning,    Lot_Frontage,    Lot_Area,    Street  )ames_subset <- ames_subset |>  dplyr::mutate(Sale_Price = log10(Sale_Price))ames_aov_pval_res <-  score_aov_pval |>  fit(Sale_Price ~ ., data = ames_subset)ames_aov_pval_res@results# Show best scoreames_aov_pval_res |> show_best_score_cutoff(cutoff = 130)

Show best score, based on number or proportion of predictors withoptional cutoff value(singular)

Description

Show best score, based on number or proportion of predictors withoptional cutoff value(singular)

Arguments

x

A score class object (e.g.,score_cor_pearson).

...

Further arguments passed to or from other methods.

prop_terms

A numeric value specifying the proportionof predictors to consider.

num_terms

An integer value specifying the numberof predictors to consider.

cutoff

A numeric value specifying the cutoff value.

Value

A tibble of score results.

Examples

library(dplyr)ames_subset <- modeldata::ames |>  dplyr::select(    Sale_Price,    MS_SubClass,    MS_Zoning,    Lot_Frontage,    Lot_Area,    Street  )ames_subset <- ames_subset |>  dplyr::mutate(Sale_Price = log10(Sale_Price))ames_aov_pval_res <-  score_aov_pval |>  fit(Sale_Price ~ ., data = ames_subset)ames_aov_pval_res@results# Show best scoreames_aov_pval_res |> show_best_score_dual(prop_terms = 0.5)ames_aov_pval_res |> show_best_score_dual(prop_terms = 0.5, cutoff = 130)ames_aov_pval_res |> show_best_score_dual(num_terms = 2)ames_aov_pval_res |> show_best_score_dual(num_terms = 2, cutoff = 130)

Show best score, based on number of predictors(singular)

Description

Show best score, based on number of predictors(singular)

Arguments

x

A score class object (e.g.,score_cor_pearson).

...

Further arguments passed to or from other methods.

num_terms

An integer value specifying the numberof predictors to consider.

Value

A tibble of score results.

Examples

library(dplyr)ames_subset <- modeldata::ames |>  dplyr::select(    Sale_Price,    MS_SubClass,    MS_Zoning,    Lot_Frontage,    Lot_Area,    Street  )ames_subset <- ames_subset |>  dplyr::mutate(Sale_Price = log10(Sale_Price))ames_aov_pval_res <-  score_aov_pval |>  fit(Sale_Price ~ ., data = ames_subset)ames_aov_pval_res@results# Show best scoreames_aov_pval_res |> show_best_score_num(num_terms = 2)

Show best score, based on proportion of predictors(singular)

Description

Show best score, based on proportion of predictors(singular)

Arguments

x

A score class object (e.g.,score_cor_pearson).

...

Further arguments passed to or from other methods.

prop_terms

A numeric value specifying the proportionof predictors to consider.

Value

A tibble of score results.

Examples

library(dplyr)ames_subset <- modeldata::ames |>  dplyr::select(    Sale_Price,    MS_SubClass,    MS_Zoning,    Lot_Frontage,    Lot_Area,    Street  )ames_subset <- ames_subset |>  dplyr::mutate(Sale_Price = log10(Sale_Price))ames_aov_pval_res <-  score_aov_pval |>  fit(Sale_Price ~ ., data = ames_subset)ames_aov_pval_res@results# Show best scoreames_aov_pval_res |> show_best_score_prop(prop_terms = 0.2)

Movatterモバイル変換

filtro: Feature Selection Using Supervised Filter-Based Methods

Description

Author(s)

See Also

Ames exampled score results

Description

Value

Examples

Arrange score

Description

Arguments

Value

Examples

Bind score class object, including their associated metadata and scores

Description

Arguments

Value

Examples

General S7 classes for scoring objects

Description

Usage

S7 subclass of base R'slist for method dispatch

Description

Usage

Format

Value

Examples

Disable -log10 transformation of p-values

Description

Usage

Arguments

Value

Fill safe value(singular)

Description

Arguments

Details

Value

Examples

Fill safe values(plural)

Description

Arguments

Details

Value

Examples

Rank score based ondplyr::dense_rank(), where tied values receive thesame rank and ranks are without gaps(singular)

Description

Usage

Arguments

Value

Examples

Rank score based ondplyr::min_rank(), where tied values receive thesame rank and ranks are with gaps(singular)

Description

Usage

Arguments

Value

Examples

Objects exported from other packages

Description

Scoring via analysis of variance hypothesis tests

Description

Usage

Format

Details

Estimating the scores

Value

See Also

Examples

Scoring via correlation coefficient

Description

Usage

Format

Details

Estimating the scores

Value

See Also

Examples

Scoring via random forests

Description

Usage

S7 subclass of base R's`list` for method dispatch

Rank score based on`dplyr::dense_rank()`, where tied values receive thesame rank and ranks are without gaps(singular)

Rank score based on`dplyr::min_rank()`, where tied values receive thesame rank and ranks are with gaps(singular)