Movatterモバイル変換

Title:

Forward Stepwise Discriminant Analysis with Pillai's Trace

Version:

0.2.0

Description:

A novel forward stepwise discriminant analysis framework that integrates Pillai's trace with Uncorrelated Linear Discriminant Analysis (ULDA), providing an improvement over traditional stepwise LDA methods that rely on Wilks' Lambda. A stand-alone ULDA implementation is also provided, offering a more general solution than the one available in the 'MASS' package. It automatically handles missing values and provides visualization tools. For more details, see Wang (2024) <doi:10.48550/arXiv.2409.03136>.

License:

MIT + file LICENSE

Encoding:

UTF-8

RoxygenNote:

7.3.2

Imports:

ggplot2, grDevices, Rcpp, stats

URL:

https://github.com/Moran79/folda,http://iamwangsiyu.com/folda/

BugReports:

https://github.com/Moran79/folda/issues

Suggests:

knitr, rmarkdown, testthat (≥ 3.0.0)

Config/testthat/edition:

LinkingTo:

Rcpp, RcppEigen

VignetteBuilder:

knitr

NeedsCompilation:

yes

Packaged:

2024-10-29 21:20:54 UTC; moran

Author:

Siyu Wang

[aut, cre, cph]

Maintainer:

Siyu Wang <iamwangsiyu@gmail.com>

Repository:

CRAN

Date/Publication:

2024-10-29 22:20:02 UTC

Check and Normalize Prior Probabilities and Misclassification Costs

Description

This function verifies and normalizes the provided prior probabilities andmisclassification cost matrix for a given response variable. It ensures thatthe lengths of the prior and the dimensions of the misclassification costmatrix match the number of levels in the response variable. Ifprior ormisClassCost are not provided, default values are used: the prior is set tothe observed frequencies of the response, and the misclassification costmatrix is set to 1 for all misclassifications and 0 for correctclassifications.

Usage

checkPriorAndMisClassCost(prior, misClassCost, response)

Arguments

prior

A numeric vector representing the prior probabilities for eachclass in the response variable. IfNULL, the observed frequencies of theresponse are used as the default prior.

misClassCost

A square matrix representing the misclassification costsfor each pair of classes in the response variable. IfNULL, a defaultmisclassification matrix is created where all misclassifications have acost of 1 and correct classifications have a cost of 0.

response

A factor representing the response variable with multipleclasses.

Value

A list containing:

prior

A normalized vector of priorprobabilities for each class.

misClassCost

A square matrixrepresenting the misclassification costs, with rows and columns labeled bythe levels of the response variable.

Examples

# Example 1: Using default prior and misClassCostresponse <- factor(c('A', 'B', 'A', 'B', 'C', 'A'))checkPriorAndMisClassCost(NULL, NULL, response)# Example 2: Providing custom prior and misClassCostprior <- c(A = 1, B = 1, C = 2)misClassCost <- matrix(c(0, 2, 10,                         1, 0, 10,                         1, 2, 0), nrow = 3, byrow = TRUE)checkPriorAndMisClassCost(prior, misClassCost, response)

Forward Uncorrelated Linear Discriminant Analysis

Description

This function fits a ULDA (Uncorrelated Linear Discriminant Analysis) modelto the provided data, with an option for forward selection of variables basedon Pillai's trace or Wilks' Lambda. It can also handle missing values,perform downsampling, and compute the linear discriminant scores and groupmeans for classification. The function returns a fitted ULDA model object.

Usage

folda(  datX,  response,  subsetMethod = c("forward", "all"),  testStat = c("Pillai", "Wilks"),  correction = TRUE,  alpha = 0.1,  prior = NULL,  misClassCost = NULL,  missingMethod = c("medianFlag", "newLevel"),  downSampling = FALSE,  kSample = NULL)

Arguments

datX

A data frame of predictor variables.

response

A factor representing the response variable with multipleclasses.

subsetMethod

A character string specifying the method for variableselection. Options are"forward" for forward selection or"all" forusing all variables. Default is"forward".

testStat

A character string specifying the test statistic to use forforward selection. Options are"Pillai" or"Wilks". Default is"Pillai".

correction

A logical value indicating whether to apply a multiplecomparison correction during forward selection. Default isTRUE.

alpha

A numeric value between 0 and 1 specifying the significancelevel for the test statistic during forward selection. Default is 0.1.

prior

A numeric vector representing the prior probabilities for eachclass in the response variable. IfNULL, the observed class frequenciesare used as the prior. Default isNULL.

misClassCost

A square matrixC, where each elementC_{ij}represents the cost of classifying an observation into classi giventhat it truly belongs to classj. IfNULL, a default matrix withequal misclassification costs for all class pairs is used. Default isNULL.

missingMethod

A character vector of length 2 specifying how to handlemissing values for numerical and categorical variables, respectively.Default isc("medianFlag", "newLevel").

downSampling

A logical value indicating whether to performdownsampling to balance the class distribution in the training data or toimprove computational efficiency. Default isFALSE. Note that ifdownsampling is applied and theprior isNULL, the class prior will becalculated based on the downsampled data. To retain the original prior,please specify it explicitly using theprior parameter.

kSample

An integer specifying the maximum number of samples to takefrom each class during downsampling. IfNULL, the number of samples islimited to the size of the smallest class. Default isNULL.

Value

A list of classULDA containing the following components:

scaling

The matrix of scaling coefficients for the lineardiscriminants.

groupMeans

The group means of the lineardiscriminant scores.

prior

The prior probabilities for each class.

misClassCost

The misclassification cost matrix.

misReference

A reference for handling missing values.

terms

The terms used in the model formula.

xlevels

Thelevels of the factors used in the model.

varIdx

The indices of theselected variables.

varSD

The standard deviations of the selectedvariables.

varCenter

The means of the selected variables.

statPillai

The Pillai's trace statistic.

pValue

The p-valueassociated with Pillai's trace.

predGini

The Gini index of thepredictions on the training data.

confusionMatrix

The confusionmatrix for the training data predictions.

forwardInfo

Informationabout the forward selection process, if applicable.

stopInfo

Amessage indicating why forward selection stopped, if applicable.

References

Howland, P., Jeon, M., & Park, H. (2003).Structurepreserving dimension reduction for clustered text data based on thegeneralized singular value decomposition. SIAM Journal on Matrix Analysisand Applications

Wang, S. (2024). A New Forward Discriminant Analysis Framework Based OnPillai's Trace and ULDA.arXiv preprint arXiv:2409.03136. Availableathttps://arxiv.org/abs/2409.03136.

Examples

# Fit the ULDA modelfit <- folda(datX = iris[, -5], response = iris[, 5], subsetMethod = "all")# Fit the ULDA model with forward selectionfit <- folda(datX = iris[, -5], response = iris[, 5], subsetMethod = "forward")

Compute Chi-Squared Statistics for Variables

Description

This function calculates the chi-squared statistic for each column ofdatXagainst the response variableresponse. It supports both numerical andcategorical predictors indatX. For numerical variables, it automaticallydiscretizes them into factor levels based on standard deviations and mean,using different splitting criteria depending on the sample size.

Usage

getChiSqStat(datX, response)

Arguments

datX

A matrix or data frame containing predictor variables. It canconsist of both numerical and categorical variables.

response

A factor representing the class labels. It must have at leasttwo levels for the chi-squared test to be applicable.

Details

For each variable indatX, the function first checks if thevariable is numerical. If so, it is discretized into factor levels usingeither two or three split points, depending on the sample size and thenumber of levels in theresponse. Missing values are handled by assigningthem to a new factor level.

The chi-squared statistic is then computed between each predictor and theresponse. If the chi-squared test has more than one degree of freedom,the Wilson-Hilferty transformation is applied to adjust the statistic to a1-degree-of-freedom chi-squared distribution.

Value

A vector of chi-squared statistics, one for each predictor variableindatX. For numerical variables, the chi-squared statistic is computedafter binning the variable.

References

Loh, W. Y. (2009). Improving the precision of classificationtrees.The Annals of Applied Statistics, 1710–1737. JSTOR.

Examples

datX <- data.frame(var1 = rnorm(100), var2 = factor(sample(letters[1:3], 100, replace = TRUE)))y <- factor(sample(c("A", "B"), 100, replace = TRUE))getChiSqStat(datX, y)

Align Data with a Missing Reference

Description

This function aligns a given dataset (data) with a reference dataset(missingReference). It ensures that the structure, column names, and factorlevels indata match the structure ofmissingReference. If necessary,missing columns are initialized withNA, and factor levels are adjusted tomatch the reference. Additionally, it handles the imputation of missingvalues based on the reference and manages flag variables for categorical ornumerical columns.

Usage

getDataInShape(data, missingReference)

Arguments

data

A data frame to be aligned and adjusted according to themissingReference.

missingReference

A reference data frame that provides the structure(column names, factor levels, and missing value reference) for aligningdata.

Value

A data frame where the structure, column names, and factor levels ofdata are aligned withmissingReference. Missing values indata areimputed based on the first row of themissingReference, and flagvariables are updated accordingly.

Examples

data <- data.frame(  X1_FLAG = c(0, 0, 0),  X1 = factor(c(NA, "C", "B"), levels = LETTERS[2:3]),  X2_FLAG = c(NA, 0, 1),  X2 = c(2, NA, 3))missingReference <- data.frame(  X1_FLAG = 1,  X1 = factor("A", levels = LETTERS[1:2]),  X2 = 1,  X2_FLAG = 1)getDataInShape(data, missingReference)

Calculate the Mode of a Factor Variable with Optional Priors

Description

This function calculates the mode of a given factor or vector that can becoerced into a factor. You can optionally provide prior weights for eachlevel of the factor.

Usage

getMode(v, prior)

Arguments

v

A factor or vector that can be coerced into a factor. The mode willbe calculated from the levels of this factor.

prior

A numeric vector of prior weights for each level of the factor.If not provided, all levels will be given equal weight.

Value

The mode of the factorv as a character string. If all values areNA, the function returnsNA.

Examples

# Example 1: Mode without priorsv <- factor(c("apple", "banana", "apple", "orange", NA))getMode(v)# Example 2: Mode with priorsv <- factor(c("apple", "banana", "apple", "orange", NA))prior <- c(apple = 0.5, banana = 1.5, orange = 1)getMode(v, prior)

Impute Missing Values and Add Missing Flags to a Data Frame

Description

This function imputes missing values in a data frame based on specifiedmethods for numerical and categorical variables. Additionally, it can addflag columns to indicate missing values. For numerical variables, missingvalues can be imputed using the mean or median. For categorical variables,missing values can be imputed using the mode or a new level. This functionalso removes constant columns (all NAs or all observed but the same value).

Usage

missingFix(data, missingMethod = c("medianFlag", "newLevel"))

Arguments

data

A data frame containing the data to be processed. Missing values(NA) will be imputed based on the methods provided inmissingMethod.

missingMethod

A character vector of length 2 specifying the methodsfor imputing missing values. The first element specifies the method fornumerical variables ("mean","median","meanFlag", or"medianFlag"), and the second element specifies the method forcategorical variables ("mode","modeFlag", or"newLevel"). If"Flag" is included, a flag column will be added for the correspondingvariable type.

Value

A list with two elements:

data

The original data frame withmissing values imputed, and flag columns added if applicable.

ref

Areference row containing the imputed values and flag levels, which can beused for future predictions or reference.

Examples

dat <- data.frame(  X1 = rep(NA, 5),  X2 = factor(rep(NA, 5), levels = LETTERS[1:3]),  X3 = 1:5,  X4 = LETTERS[1:5],  X5 = c(NA, 2, 3, 10, NA),  X6 = factor(c("A", NA, NA, "B", "B"), levels = LETTERS[1:3]))missingFix(dat)

Plot Decision Boundaries and Linear Discriminant Scores

Description

This function plots the decision boundaries and linear discriminant (LD)scores for a given ULDA model. If it is a binary classification problem, adensity plot is created. Otherwise, a scatter plot with decision boundariesis generated.

Usage

## S3 method for class 'ULDA'plot(x, datX, response, ...)

Arguments

x

A fitted ULDA model object.

datX

A data frame containing the predictor variables.

response

A factor representing the response variable (training labels)corresponding todatX.

...

Additional arguments.

Value

Aggplot2 plot object, either a density plot or a scatter plot withdecision boundaries.

Examples

fit <- folda(datX = iris[, -5], response = iris[, 5], subsetMethod = "all")plot(fit, iris[, -5], iris[, 5])

Predict Method for ULDA Model

Description

This function predicts the class labels or class probabilities for new datausing a fitted ULDA model. The prediction can return either the most likelyclass ("response") or the posterior probabilities for each class("prob").

Usage

## S3 method for class 'ULDA'predict(object, newdata, type = c("response", "prob"), ...)

Arguments

object

A fittedULDA model object.

newdata

A data frame containing the new predictor variables for whichpredictions are to be made.

type

A character string specifying the type of prediction to return."response" returns the predicted class labels, while"prob" returns theposterior probabilities for each class. Default is"response".

...

Additional arguments.

Value

Iftype = "response", the function returns a vector of predictedclass labels. Iftype = "prob", it returns a matrix of posteriorprobabilities, where each row corresponds to a sample and each column to aclass.

Examples

fit <- folda(datX = iris[, -5], response = iris[, 5], subsetMethod = "all")# Predict class labelspredictions <- predict(fit, iris, type = "response")# Predict class probabilitiesprob_predictions <- predict(fit, iris, type = "prob")