| Title: | Forward Stepwise Discriminant Analysis with Pillai's Trace |
| Version: | 0.2.0 |
| Description: | A novel forward stepwise discriminant analysis framework that integrates Pillai's trace with Uncorrelated Linear Discriminant Analysis (ULDA), providing an improvement over traditional stepwise LDA methods that rely on Wilks' Lambda. A stand-alone ULDA implementation is also provided, offering a more general solution than the one available in the 'MASS' package. It automatically handles missing values and provides visualization tools. For more details, see Wang (2024) <doi:10.48550/arXiv.2409.03136>. |
| License: | MIT + file LICENSE |
| Encoding: | UTF-8 |
| RoxygenNote: | 7.3.2 |
| Imports: | ggplot2, grDevices, Rcpp, stats |
| URL: | https://github.com/Moran79/folda,http://iamwangsiyu.com/folda/ |
| BugReports: | https://github.com/Moran79/folda/issues |
| Suggests: | knitr, rmarkdown, testthat (≥ 3.0.0) |
| Config/testthat/edition: | 3 |
| LinkingTo: | Rcpp, RcppEigen |
| VignetteBuilder: | knitr |
| NeedsCompilation: | yes |
| Packaged: | 2024-10-29 21:20:54 UTC; moran |
| Author: | Siyu Wang |
| Maintainer: | Siyu Wang <iamwangsiyu@gmail.com> |
| Repository: | CRAN |
| Date/Publication: | 2024-10-29 22:20:02 UTC |
Check and Normalize Prior Probabilities and Misclassification Costs
Description
This function verifies and normalizes the provided prior probabilities andmisclassification cost matrix for a given response variable. It ensures thatthe lengths of the prior and the dimensions of the misclassification costmatrix match the number of levels in the response variable. Ifprior ormisClassCost are not provided, default values are used: the prior is set tothe observed frequencies of the response, and the misclassification costmatrix is set to 1 for all misclassifications and 0 for correctclassifications.
Usage
checkPriorAndMisClassCost(prior, misClassCost, response)Arguments
prior | A numeric vector representing the prior probabilities for eachclass in the response variable. If |
misClassCost | A square matrix representing the misclassification costsfor each pair of classes in the response variable. If |
response | A factor representing the response variable with multipleclasses. |
Value
A list containing:
prior | A normalized vector of priorprobabilities for each class. |
misClassCost | A square matrixrepresenting the misclassification costs, with rows and columns labeled bythe levels of the response variable. |
Examples
# Example 1: Using default prior and misClassCostresponse <- factor(c('A', 'B', 'A', 'B', 'C', 'A'))checkPriorAndMisClassCost(NULL, NULL, response)# Example 2: Providing custom prior and misClassCostprior <- c(A = 1, B = 1, C = 2)misClassCost <- matrix(c(0, 2, 10, 1, 0, 10, 1, 2, 0), nrow = 3, byrow = TRUE)checkPriorAndMisClassCost(prior, misClassCost, response)Forward Uncorrelated Linear Discriminant Analysis
Description
This function fits a ULDA (Uncorrelated Linear Discriminant Analysis) modelto the provided data, with an option for forward selection of variables basedon Pillai's trace or Wilks' Lambda. It can also handle missing values,perform downsampling, and compute the linear discriminant scores and groupmeans for classification. The function returns a fitted ULDA model object.
Usage
folda( datX, response, subsetMethod = c("forward", "all"), testStat = c("Pillai", "Wilks"), correction = TRUE, alpha = 0.1, prior = NULL, misClassCost = NULL, missingMethod = c("medianFlag", "newLevel"), downSampling = FALSE, kSample = NULL)Arguments
datX | A data frame of predictor variables. |
response | A factor representing the response variable with multipleclasses. |
subsetMethod | A character string specifying the method for variableselection. Options are |
testStat | A character string specifying the test statistic to use forforward selection. Options are |
correction | A logical value indicating whether to apply a multiplecomparison correction during forward selection. Default is |
alpha | A numeric value between 0 and 1 specifying the significancelevel for the test statistic during forward selection. Default is 0.1. |
prior | A numeric vector representing the prior probabilities for eachclass in the response variable. If |
misClassCost | A square matrix |
missingMethod | A character vector of length 2 specifying how to handlemissing values for numerical and categorical variables, respectively.Default is |
downSampling | A logical value indicating whether to performdownsampling to balance the class distribution in the training data or toimprove computational efficiency. Default is |
kSample | An integer specifying the maximum number of samples to takefrom each class during downsampling. If |
Value
A list of classULDA containing the following components:
scaling | The matrix of scaling coefficients for the lineardiscriminants. |
groupMeans | The group means of the lineardiscriminant scores. |
prior | The prior probabilities for each class. |
misClassCost | The misclassification cost matrix. |
misReference | A reference for handling missing values. |
terms | The terms used in the model formula. |
xlevels | Thelevels of the factors used in the model. |
varIdx | The indices of theselected variables. |
varSD | The standard deviations of the selectedvariables. |
varCenter | The means of the selected variables. |
statPillai | The Pillai's trace statistic. |
pValue | The p-valueassociated with Pillai's trace. |
predGini | The Gini index of thepredictions on the training data. |
confusionMatrix | The confusionmatrix for the training data predictions. |
forwardInfo | Informationabout the forward selection process, if applicable. |
stopInfo | Amessage indicating why forward selection stopped, if applicable. |
References
Howland, P., Jeon, M., & Park, H. (2003).Structurepreserving dimension reduction for clustered text data based on thegeneralized singular value decomposition. SIAM Journal on Matrix Analysisand Applications
Wang, S. (2024). A New Forward Discriminant Analysis Framework Based OnPillai's Trace and ULDA.arXiv preprint arXiv:2409.03136. Availableathttps://arxiv.org/abs/2409.03136.
Examples
# Fit the ULDA modelfit <- folda(datX = iris[, -5], response = iris[, 5], subsetMethod = "all")# Fit the ULDA model with forward selectionfit <- folda(datX = iris[, -5], response = iris[, 5], subsetMethod = "forward")Compute Chi-Squared Statistics for Variables
Description
This function calculates the chi-squared statistic for each column ofdatXagainst the response variableresponse. It supports both numerical andcategorical predictors indatX. For numerical variables, it automaticallydiscretizes them into factor levels based on standard deviations and mean,using different splitting criteria depending on the sample size.
Usage
getChiSqStat(datX, response)Arguments
datX | A matrix or data frame containing predictor variables. It canconsist of both numerical and categorical variables. |
response | A factor representing the class labels. It must have at leasttwo levels for the chi-squared test to be applicable. |
Details
For each variable indatX, the function first checks if thevariable is numerical. If so, it is discretized into factor levels usingeither two or three split points, depending on the sample size and thenumber of levels in theresponse. Missing values are handled by assigningthem to a new factor level.
The chi-squared statistic is then computed between each predictor and theresponse. If the chi-squared test has more than one degree of freedom,the Wilson-Hilferty transformation is applied to adjust the statistic to a1-degree-of-freedom chi-squared distribution.
Value
A vector of chi-squared statistics, one for each predictor variableindatX. For numerical variables, the chi-squared statistic is computedafter binning the variable.
References
Loh, W. Y. (2009). Improving the precision of classificationtrees.The Annals of Applied Statistics, 1710–1737. JSTOR.
Examples
datX <- data.frame(var1 = rnorm(100), var2 = factor(sample(letters[1:3], 100, replace = TRUE)))y <- factor(sample(c("A", "B"), 100, replace = TRUE))getChiSqStat(datX, y)Align Data with a Missing Reference
Description
This function aligns a given dataset (data) with a reference dataset(missingReference). It ensures that the structure, column names, and factorlevels indata match the structure ofmissingReference. If necessary,missing columns are initialized withNA, and factor levels are adjusted tomatch the reference. Additionally, it handles the imputation of missingvalues based on the reference and manages flag variables for categorical ornumerical columns.
Usage
getDataInShape(data, missingReference)Arguments
data | A data frame to be aligned and adjusted according to the |
missingReference | A reference data frame that provides the structure(column names, factor levels, and missing value reference) for aligning |
Value
A data frame where the structure, column names, and factor levels ofdata are aligned withmissingReference. Missing values indata areimputed based on the first row of themissingReference, and flagvariables are updated accordingly.
Examples
data <- data.frame( X1_FLAG = c(0, 0, 0), X1 = factor(c(NA, "C", "B"), levels = LETTERS[2:3]), X2_FLAG = c(NA, 0, 1), X2 = c(2, NA, 3))missingReference <- data.frame( X1_FLAG = 1, X1 = factor("A", levels = LETTERS[1:2]), X2 = 1, X2_FLAG = 1)getDataInShape(data, missingReference)Calculate the Mode of a Factor Variable with Optional Priors
Description
This function calculates the mode of a given factor or vector that can becoerced into a factor. You can optionally provide prior weights for eachlevel of the factor.
Usage
getMode(v, prior)Arguments
v | A factor or vector that can be coerced into a factor. The mode willbe calculated from the levels of this factor. |
prior | A numeric vector of prior weights for each level of the factor.If not provided, all levels will be given equal weight. |
Value
The mode of the factorv as a character string. If all values areNA, the function returnsNA.
Examples
# Example 1: Mode without priorsv <- factor(c("apple", "banana", "apple", "orange", NA))getMode(v)# Example 2: Mode with priorsv <- factor(c("apple", "banana", "apple", "orange", NA))prior <- c(apple = 0.5, banana = 1.5, orange = 1)getMode(v, prior)Impute Missing Values and Add Missing Flags to a Data Frame
Description
This function imputes missing values in a data frame based on specifiedmethods for numerical and categorical variables. Additionally, it can addflag columns to indicate missing values. For numerical variables, missingvalues can be imputed using the mean or median. For categorical variables,missing values can be imputed using the mode or a new level. This functionalso removes constant columns (all NAs or all observed but the same value).
Usage
missingFix(data, missingMethod = c("medianFlag", "newLevel"))Arguments
data | A data frame containing the data to be processed. Missing values( |
missingMethod | A character vector of length 2 specifying the methodsfor imputing missing values. The first element specifies the method fornumerical variables ( |
Value
A list with two elements:
data | The original data frame withmissing values imputed, and flag columns added if applicable. |
ref | Areference row containing the imputed values and flag levels, which can beused for future predictions or reference. |
Examples
dat <- data.frame( X1 = rep(NA, 5), X2 = factor(rep(NA, 5), levels = LETTERS[1:3]), X3 = 1:5, X4 = LETTERS[1:5], X5 = c(NA, 2, 3, 10, NA), X6 = factor(c("A", NA, NA, "B", "B"), levels = LETTERS[1:3]))missingFix(dat)Plot Decision Boundaries and Linear Discriminant Scores
Description
This function plots the decision boundaries and linear discriminant (LD)scores for a given ULDA model. If it is a binary classification problem, adensity plot is created. Otherwise, a scatter plot with decision boundariesis generated.
Usage
## S3 method for class 'ULDA'plot(x, datX, response, ...)Arguments
x | A fitted ULDA model object. |
datX | A data frame containing the predictor variables. |
response | A factor representing the response variable (training labels)corresponding to |
... | Additional arguments. |
Value
Aggplot2 plot object, either a density plot or a scatter plot withdecision boundaries.
Examples
fit <- folda(datX = iris[, -5], response = iris[, 5], subsetMethod = "all")plot(fit, iris[, -5], iris[, 5])Predict Method for ULDA Model
Description
This function predicts the class labels or class probabilities for new datausing a fitted ULDA model. The prediction can return either the most likelyclass ("response") or the posterior probabilities for each class("prob").
Usage
## S3 method for class 'ULDA'predict(object, newdata, type = c("response", "prob"), ...)Arguments
object | A fitted |
newdata | A data frame containing the new predictor variables for whichpredictions are to be made. |
type | A character string specifying the type of prediction to return. |
... | Additional arguments. |
Value
Iftype = "response", the function returns a vector of predictedclass labels. Iftype = "prob", it returns a matrix of posteriorprobabilities, where each row corresponds to a sample and each column to aclass.
Examples
fit <- folda(datX = iris[, -5], response = iris[, 5], subsetMethod = "all")# Predict class labelspredictions <- predict(fit, iris, type = "response")# Predict class probabilitiesprob_predictions <- predict(fit, iris, type = "prob")