Movatterモバイル変換


[0]ホーム

URL:


Type:Package
Title:Linear Optimal Low-Rank Projection
Version:2.1
Date:2020-06-20
Maintainer:Eric Bridgeford <ericwb95@gmail.com>
Description:Supervised learning techniques designed for the situation when the dimensionality exceeds the sample size have a tendency to overfit as the dimensionality of the data increases. To remedy this High dimensionality; low sample size (HDLSS) situation, we attempt to learn a lower-dimensional representation of the data before learning a classifier. That is, we project the data to a situation where the dimensionality is more manageable, and then are able to better apply standard classification or clustering techniques since we will have fewer dimensions to overfit. A number of previous works have focused on how to strategically reduce dimensionality in the unsupervised case, yet in the supervised HDLSS regime, few works have attempted to devise dimensionality reduction techniques that leverage the labels associated with the data. In this package and the associated manuscript Vogelstein et al. (2017) <doi:10.48550/arXiv.1709.01233>, we provide several methods for feature extraction, some utilizing labels and some not, along with easily extensible utilities to simplify cross-validative efforts to identify the best feature extraction method. Additionally, we include a series of adaptable benchmark simulations to serve as a standard for future investigative efforts into supervised HDLSS. Finally, we produce a comprehensive comparison of the included algorithms across a range of benchmark simulations and real data applications.
Depends:R (≥ 3.4.0)
License:GPL-2
URL:https://github.com/neurodata/lol
Imports:ggplot2, abind, MASS, irlba, pls, robust, robustbase
Encoding:UTF-8
LazyData:true
VignetteBuilder:knitr
RoxygenNote:7.1.0
Suggests:knitr, rmarkdown, parallel, randomForest, latex2exp,testthat, covr
NeedsCompilation:no
Packaged:2020-06-25 18:56:31 UTC; eric
Author:Eric Bridgeford [aut, cre], Minh Tang [ctb], Jason Yim [ctb], Joshua Vogelstein [ths]
Repository:CRAN
Date/Publication:2020-06-26 22:30:03 UTC

Nearest Centroid Classifier Training

Description

A function that trains a classifier based on the nearest centroid.

Usage

lol.classify.nearestCentroid(X, Y, ...)

Arguments

X

[n, d] the data withn samples ind dimensions.

Y

[n] the labels of then samples.

...

optional args.

Value

A list of classnearestCentroid, with the following attributes:

centroids

[K, d] the centroids of each class withK classes ind dimensions.

ylabs

[K] the ylabels for each of theK unique classes, ordered.

priors

[K] the priors for each of theK classes.

Details

For more details see the help vignette:vignette("centroid", package = "lolR")

Author(s)

Eric Bridgeford

Examples

library(lolR)data <- lol.sims.rtrunk(n=200, d=30)  # 200 examples of 30 dimensionsX <- data$X; Y <- data$Ymodel <- lol.classify.nearestCentroid(X, Y)

Random Classifier Utility

Description

A function for random classifiers.

Usage

lol.classify.rand(X, Y, ...)

Arguments

X

[n, d] the data withn samples ind dimensions.

Y

[n] the labels of then samples.

...

optional args.

Value

A structure, with the following attributes:

ylabs

[K] the ylabels for each of theK unique classes, ordered.

priors

[K] the priors for each of theK classes.

Author(s)

Eric Bridgeford


Randomly Chance Classifier Training

Description

A function that predicts the maximally present class in the dataset. Functionality consistentwith the standard R prediction interface so that one can compute the "chance" accuracywith minimal modification of other classification scripts.

Usage

lol.classify.randomChance(X, Y, ...)

Arguments

X

[n, d] the data withn samples ind dimensions.

Y

[n] the labels of then samples.

...

optional args.

Value

A list of classrandomGuess, with the following attributes:

ylabs

[K] the ylabels for each of theK unique classes, ordered.

priors

[K] the priors for each of theK classes.

Author(s)

Eric Bridgeford

Examples

library(lolR)data <- lol.sims.rtrunk(n=200, d=30)  # 200 examples of 30 dimensionsX <- data$X; Y <- data$Ymodel <- lol.classify.randomChance(X, Y)

Randomly Guessing Classifier Training

Description

A function that predicts by randomly guessing based on the pmf of the class priors. Functionality consistentwith the standard R prediction interface so that one can compute the "guess" accuracywith minimal modification of other classification scripts.

Usage

lol.classify.randomGuess(X, Y, ...)

Arguments

X

[n, d] the data withn samples ind dimensions.

Y

[n] the labels of then samples.

...

optional args.

Value

A list of classrandomGuess, with the following attributes:

ylabs

[K] the ylabels for each of theK unique classes, ordered.

priors

[K] the priors for each of theK classes.

Author(s)

Eric Bridgeford

Examples

library(lolR)data <- lol.sims.rtrunk(n=200, d=30)  # 200 examples of 30 dimensionsX <- data$X; Y <- data$Ymodel <- lol.classify.randomGuess(X, Y)

Embedding

Description

A function that embeds points in high dimensions to a lower dimensionality.

Usage

lol.embed(X, A, ...)

Arguments

X

[n, d] the data withn samples ind dimensions.

A

[d, r] the embedding matrix fromd tor dimensions.

...

optional args.

Value

an array[n, r] the originaln points embedded intor dimensions.

Author(s)

Eric Bridgeford

Examples

library(lolR)data <- lol.sims.rtrunk(n=200, d=30)  # 200 examples of 30 dimensionsX <- data$X; Y <- data$Ymodel <- lol.project.lol(X=X, Y=Y, r=5)  # use lol to project into 5 dimensionsXr <- lol.embed(X, model$A)

Bayes Optimal

Description

A function for recovering the Bayes Optimal Projection, which optimizes Bayes classification.

Usage

lol.project.bayes_optimal(X, Y, mus, Sigmas, priors, ...)

Arguments

X

[n, p] the data withn samples ind dimensions.

Y

[n] the labels of the samples withK unique labels.

mus

[d, K] theK class means ind dimensions.

Sigmas

[d, d, K] theK class covariance matrices ind dimensions.

priors

[K] the priors for each of theK classes.

...

optional args.

Value

A list of classembedding containing the following:

A

[d, K] the projection matrix fromd toK dimensions.

d

the eigen values associated with the eigendecomposition.

ylabs

[K] vector containing theK unique, ordered class labels.

centroids

[K, d] centroid matrix of theK unique, ordered classes in natived dimensions.

priors

[K] vector containing theK prior probabilities for the unique, ordered classes.

Xr

[n, K] then data points in reduced dimensionalityK.

cr

[K, K] theK centroids in reduced dimensionalityK.

Author(s)

Eric Bridgeford

Examples

library(lolR)data <- lol.sims.rtrunk(n=200, d=30)  # 200 examples of 30 dimensionsX <- data$X; Y <- data$Y# obtain bayes-optimal projection of the datamodel <- lol.project.bayes_optimal(X=X, Y=Y, mus=data$mus,                                   S=data$Sigmas, priors=data$priors)

Data Piling

Description

A function for implementing the Maximal Data Piling (MDP) Algorithm.

Usage

lol.project.dp(X, Y, ...)

Arguments

X

[n, d] the data withn samples ind dimensions.

Y

[n] the labels of the samples withK unique labels.

...

optional args.

Value

A list containing the following:

A

[d, r] the projection matrix fromd tor dimensions.

ylabs

[K] vector containing theK unique, ordered class labels.

centroids

[K, d] centroid matrix of theK unique, ordered classes in natived dimensions.

priors

[K] vector containing theK prior probabilities for the unique, ordered classes.

Xr

[n, r] then data points in reduced dimensionalityr.

cr

[K, r] theK centroids in reduced dimensionalityr.

Details

For more details see the help vignette:vignette("dp", package = "lolR")

Author(s)

Minh Tang and Eric Bridgeford

Examples

library(lolR)data <- lol.sims.rtrunk(n=200, d=30)  # 200 examples of 30 dimensionsX <- data$X; Y <- data$Ymodel <- lol.project.dp(X=X, Y=Y)  # use mdp to project into maximal data piling

Linear Optimal Low-Rank Projection (LOL)

Description

A function for implementing the Linear Optimal Low-Rank Projection (LOL) Algorithm. This algorithm allows users to find an optimalprojection from 'd' to 'r' dimensions, where 'r << d', by combining information from the first and second moments in thet data.

Usage

lol.project.lol(  X,  Y,  r,  second.moment.xfm = FALSE,  second.moment.xfm.opts = list(),  first.moment = "delta",  second.moment = "linear",  orthogonalize = FALSE,  robust = FALSE,  ...)

Arguments

X

[n, d] the data withn samples ind dimensions.

Y

[n] the labels of the samples withK unique labels.

r

the rank of the projection. Note thatr >= K, andr < d.

second.moment.xfm

whether to use extraneous options in estimation of the second moment component. The transforms specified should be a numbered list of transforms you wish to apply, and will be applied in accordance withsecond.moment.

second.moment.xfm.opts

optional arguments to pass to thesecond.moment.xfm option specified. Should be a numbered list of lists, wheresecond.moment.xfm.opts[[i]] corresponds to the optional arguments forsecond.moment.xfm[[i]].Defaults to the default options for each transform scheme.

first.moment

the function to capture the first moment. Defaults to'delta'.

  • 'delta' capture the first moment with the hyperplane separating the per-class means.

  • FALSE do not capture the first moment.

second.moment

the function to capture the second moment. Defaults to'linear'.

  • 'linear' performs PCA on the class-conditional data to capture the second moment, retaining the vectors with the top singular values. Transform options forsecond.moment.xfm and arguments insecond.moment.opts should be in accordance with the trailing arguments forlol.project.lrlda.

  • 'quadratic' performs PCA on the data for each class separately to capture the second moment, retaining the vectors with the top singular values from each class's PCA. Transform options forsecond.moment.xfm and arguments insecond.moment.opts should be in accordance with the trailing arguments forlol.project.pca.

  • 'pls' performs PLS on the data to capture the second moment, retaining the vectors that maximize the correlation between the different classes. Transform options forsecond.moment.xfm and arguments insecond.moment.opts should be in accordance with the trailing arguments forlol.project.pls.

  • FALSE do not capture the second moment.

orthogonalize

whether to orthogonalize the projection matrix. Defaults toFALSE.

robust

whether to perform PCA on a robust estimate of the covariance matrix or not. Defaults toFALSE.

...

trailing args.

Value

A list containing the following:

A

[d, r] the projection matrix fromd tor dimensions.

ylabs

[K] vector containing theK unique, ordered class labels.

centroids

[K, d] centroid matrix of theK unique, ordered classes in natived dimensions.

priors

[K] vector containing theK prior probabilities for the unique, ordered classes.

Xr

[n, r] then data points in reduced dimensionalityr.

cr

[K, r] theK centroids in reduced dimensionalityr.

second.moment

the method used to estimate the second moment.

first.moment

the method used to estimate the first moment.

Details

For more details see the help vignette:vignette("lol", package = "lolR")

Author(s)

Eric Bridgeford

References

Joshua T. Vogelstein, et al. "Supervised Dimensionality Reduction for Big Data" arXiv (2020).

Examples

library(lolR)data <- lol.sims.rtrunk(n=200, d=30)  # 200 examples of 30 dimensionsX <- data$X; Y <- data$Ymodel <- lol.project.lol(X=X, Y=Y, r=5)  # use lol to project into 5 dimensions# use lol to project into 5 dimensions, and produce an orthogonal basis for the projection matrixmodel <- lol.project.lol(X=X, Y=Y, r=5, orthogonalize=TRUE)# use LRQDA to estimate the second moment by performing PCA on each classmodel <- lol.project.lol(X=X, Y=Y, r=5, second.moment='quadratic')# use PLS to estimate the second momentmodel <- lol.project.lol(X=X, Y=Y, r=5, second.moment='pls')# use LRLDA to estimate the second moment, and apply a unit transformation# (according to scale function) with no centeringmodel <- lol.project.lol(X=X, Y=Y, r=5, second.moment='linear', second.moment.xfm='unit',                         second.moment.xfm.opts=list(center=FALSE))

Low-rank Canonical Correlation Analysis (LR-CCA)

Description

A function for implementing the Low-rank Canonical Correlation Analysis (LR-CCA) Algorithm.

Usage

lol.project.lrcca(X, Y, r, ...)

Arguments

X

[n, d] the data withn samples ind dimensions.

Y

[n] the labels of the samples withK unique labels.

r

the rank of the projection.

...

trailing args.

Value

A list containing the following:

A

[d, r] the projection matrix fromd tor dimensions.

d

the eigen values associated with the eigendecomposition.

ylabs

[K] vector containing theK unique, ordered class labels.

centroids

[K, d] centroid matrix of theK unique, ordered classes in natived dimensions.

priors

[K] vector containing theK prior probabilities for the unique, ordered classes.

Xr

[n, r] then data points in reduced dimensionalityr.

cr

[K, r] theK centroids in reduced dimensionalityr.

Details

For more details see the help vignette:vignette("lrcca", package = "lolR")

Author(s)

Eric Bridgeford and Minh Tang

Examples

library(lolR)data <- lol.sims.rtrunk(n=200, d=30)  # 200 examples of 30 dimensionsX <- data$X; Y <- data$Ymodel <- lol.project.lrcca(X=X, Y=Y, r=5)  # use lrcca to project into 5 dimensions

Low-Rank Linear Discriminant Analysis (LRLDA)

Description

A function that performs LRLDA on the class-centered data. Same as class-conditional PCA.

Usage

lol.project.lrlda(X, Y, r, xfm = FALSE, xfm.opts = list(), robust = FALSE, ...)

Arguments

X

[n, d] the data withn samples ind dimensions.

Y

[n] the labels of the samples withK unique labels.

r

the rank of the projection.

xfm

whether to transform the variables before taking the SVD.

  • FALSEapply no transform to the variables.

  • 'unit'unit transform the variables, defaulting to centering and scaling to mean 0, variance 1. Seescale for details and optional args.

  • 'log'log-transform the variables, for use-cases such as having high variance in larger values. Defaults to natural logarithm. Seelog for details and optional args.

  • 'rank'rank-transform the variables. Defalts to breaking ties with the average rank of the tied values. Seerank for details and optional args.

  • c(opt1, opt2, etc.)apply the transform specified in opt1, followed by opt2, etc.

xfm.opts

optional arguments to pass to thexfm option specified. Should be a numbered list of lists, wherexfm.opts[[i]] corresponds to the optional arguments forxfm[i]. Defaults to the default options for each transform scheme.

robust

whether to use a robust estimate of the covariance matrix when taking PCA. Defaults toFALSE.

...

trailing args.

Value

A list containing the following:

A

[d, r] the projection matrix fromd tor dimensions.

d

the eigen values associated with the eigendecomposition.

ylabs

[K] vector containing theK unique, ordered class labels.

centroids

[K, d] centroid matrix of theK unique, ordered classes in natived dimensions.

priors

[K] vector containing theK prior probabilities for the unique, ordered classes.

Xr

[n, r] then data points in reduced dimensionalityr.

cr

[K, r] theK centroids in reduced dimensionalityr.

Details

For more details see the help vignette:vignette("lrlda", package = "lolR")

Author(s)

Eric Bridgeford

Examples

library(lolR)data <- lol.sims.rtrunk(n=200, d=30)  # 200 examples of 30 dimensionsX <- data$X; Y <- data$Ymodel <- lol.project.lrlda(X=X, Y=Y, r=2)  # use lrlda to project into 2 dimensions

Principal Component Analysis (PCA)

Description

A function that performs PCA on data.

Usage

lol.project.pca(X, r, xfm = FALSE, xfm.opts = list(), robust = FALSE, ...)

Arguments

X

[n, d] the data withn samples ind dimensions.

r

the rank of the projection.

xfm

whether to transform the variables before taking the SVD.

  • FALSEapply no transform to the variables.

  • 'unit'unit transform the variables, defaulting to centering and scaling to mean 0, variance 1. Seescale for details and optional arguments to be passed withxfm.opts.

  • 'log'log-transform the variables, for use-cases such as having high variance in larger values. Defaults to natural logarithm. Seelog for details and optional arguments to be passed withxfm.opts.

  • 'rank'rank-transform the variables. Defalts to breaking ties with the average rank of the tied values. Seerank for details and optional arguments to be passed withxfm.opts.

  • c(opt1, opt2, etc.)apply the transform specified in opt1, followed by opt2, etc.

xfm.opts

optional arguments to pass to thexfm option specified. Should be a numbered list of lists, wherexfm.opts[[i]] corresponds to the optional arguments forxfm[i]. Defaults to the default options for each transform scheme.

robust

whether to perform PCA on a robust estimate of the covariance matrix or not. Defaults toFALSE.

...

trailing args.

Value

A list containing the following:

A

[d, r] the projection matrix fromd tor dimensions.

d

the eigen values associated with the eigendecomposition.

Xr

[n, r] then data points in reduced dimensionalityr.

Details

For more details see the help vignette:vignette("pca", package = "lolR")

Author(s)

Eric Bridgeford

Examples

library(lolR)data <- lol.sims.rtrunk(n=200, d=30)  # 200 examples of 30 dimensionsX <- data$X; Y <- data$Ymodel <- lol.project.pca(X=X, r=2)  # use pca to project into 2 dimensions

Partial Least-Squares (PLS)

Description

A function for implementing the Partial Least-Squares (PLS) Algorithm.

Usage

lol.project.pls(X, Y, r, ...)

Arguments

X

[n, d] the data withn samples ind dimensions.

Y

[n] the labels of the samples withK unique labels.

r

the rank of the projection.

...

trailing args.

Value

A list containing the following:

A

[d, r] the projection matrix fromd tor dimensions.

ylabs

[K] vector containing theK unique, ordered class labels.

centroids

[K, d] centroid matrix of theK unique, ordered classes in natived dimensions.

priors

[K] vector containing theK prior probabilities for the unique, ordered classes.

Xr

[n, r] then data points in reduced dimensionalityr.

cr

[K, r] theK centroids in reduced dimensionalityr.

Details

For more details see the help vignette:vignette("pls", package = "lolR")

Author(s)

Eric Bridgeford

Examples

library(lolR)data <- lol.sims.rtrunk(n=200, d=30)  # 200 examples of 30 dimensionsX <- data$X; Y <- data$Ymodel <- lol.project.pls(X=X, Y=Y, r=5)  # use pls to project into 5 dimensions

Random Projections (RP)

Description

A function for implementing gaussian random projections (rp).

Usage

lol.project.rp(X, r, scale = TRUE, ...)

Arguments

X

[n, d] the data withn samples ind dimensions.

r

the rank of the projection. Note thatr >= K, andr < d.

scale

whether to scale the random projection by the sqrt(1/d). Defaults toTRUE.

...

trailing args.

Value

A list containing the following:

A

[d, r] the projection matrix fromd tor dimensions.

Xr

[n, r] then data points in reduced dimensionalityr.

Details

For more details see the help vignette:vignette("rp", package = "lolR")

Author(s)

Eric Bridgeford

Examples

library(lolR)data <- lol.sims.rtrunk(n=200, d=30)  # 200 examples of 30 dimensionsX <- data$X; Y <- data$Ymodel <- lol.project.rp(X=X, r=5)  # use lol to project into 5 dimensions

Stacked Cigar

Description

A simulation for the stacked cigar experiment.

Usage

lol.sims.cigar(n, d, rotate = FALSE, priors = NULL, a = 0.15, b = 4)

Arguments

n

the number of samples of the simulated data.

d

the dimensionality of the simulated data.

rotate

whether to apply a random rotation to the mean and covariance. With random rotataion matrixQ,mu = Q*mu, andS = Q*S*Q. Defaults toFALSE.

priors

the priors for each class. IfNULL, class priors are all equal. If not null, should be|priors| = K, a lengthK vector forK classes. Defaults toNULL.

a

scalar for all of the mu1 but 2nd dimension. Defaults to0.15.

b

scalar for 2nd dimension value of mu2 and the 2nd variance term of S. Defaults to4.

Value

A list of classsimulation with the following:

X

[n, d] then data points ind dimensions as a matrix.

Y

[n] then labels as an array.

mus

[d, K] theK class means ind dimensions.

Sigmas

[d, d, K] theK class covariance matrices ind dimensions.

priors

[K] the priors for each of theK classes.

simtype

The name of the simulation.

params

Any extraneous parameters the simulation was created with.

Details

For more details see the help vignette:vignette("sims", package = "lolR")

Author(s)

Eric Bridgeford

Examples

library(lolR)data <- lol.sims.cigar(n=200, d=30)  # 200 examples of 30 dimensionsX <- data$X; Y <- data$Y

Cross

Description

A simulation for the cross experiment, in which the two classes have orthogonal covariant dimensions and the same means.

Usage

lol.sims.cross(n, d, rotate = FALSE, priors = NULL, a = 1, b = 0.25, K = 2)

Arguments

n

the number of samples of simulated data.

d

the dimensionality of the simulated data.

rotate

With random rotataion matrixQ,mu = Q*mu, andS = Q*S*Q. Defaults toFALSE.

priors

the priors for each class. IfNULL, class priors are all equal. If not null, should be|priors| = K, a lengthK vector forK classes. Defaults toNULL.

a

scalar for the magnitude of the variance that is high within the particular class. Defaults to1.

b

scalar for the magnitude of the varaince that is not high within the particular class. Defaults to2.

K

the number of classes. Defaults to2.

Value

A list of classsimulation with the following:

X

[n, d] then data points ind dimensions as a matrix.

Y

[n] then labels as an array.

mus

[d, K] theK class means ind dimensions.

Sigmas

[d, d, K] theK class covariance matrices ind dimensions.

priors

[K] the priors for each of theK classes.

simtype

The name of the simulation.

params

Any extraneous parameters the simulation was created with.

Details

For more details see the help vignette:vignette("sims", package = "lolR")

Author(s)

Eric Bridgeford

Examples

library(lolR)data <- lol.sims.cross(n=200, d=30)  # 200 examples of 30 dimensionsX <- data$X; Y <- data$Y

Fat Tails Simulation

Description

A function for simulating from 2 classes with differing means each with 2 sub-clusters, where one sub-cluster has a narrow tail and the other sub-cluster has a fat tail.

Usage

lol.sims.fat_tails(  n,  d,  rotate = FALSE,  f = 15,  s0 = 10,  rho = 0.2,  t = 0.8,  priors = NULL)

Arguments

n

the number of samples of the simulated data.

d

the dimensionality of the simulated data.

rotate

whether to apply a random rotation to the mean and covariance. With random rotataion matrixQ,mu = Q*mu, andS = Q*S*Q. Defaults toFALSE.

f

the fatness scaling of the tail. S2 = f*S1, where S1_ij = rho if i != j, and 1 if i == j. Defaults to15.

s0

the number of dimensions with a difference in the means. s0 should be < d. Defaults to10.

rho

the scaling of the off-diagonal covariance terms, should be < 1. Defaults to0.2.

t

the fraction of each class from the narrower-tailed distribution. Defaults to0.8.

priors

the priors for each class. IfNULL, class priors are all equal. If not null, should be|priors| = K, a lengthK vector forK classes. Defaults toNULL.

Value

A list of classsimulation with the following:

X

[n, d] then data points ind dimensions as a matrix.

Y

[n] then labels as an array.

mus

[d, K] theK class means ind dimensions.

Sigmas

[d, d, K] theK class covariance matrices ind dimensions.

priors

[K] the priors for each of theK classes.

simtype

The name of the simulation.

params

Any extraneous parameters the simulation was created with.

Details

For more details see the help vignette:vignette("sims", package = "lolR")

Author(s)

Eric Bridgeford

Examples

library(lolR)data <- lol.sims.fat_tails(n=200, d=30)  # 200 examples of 30 dimensionsX <- data$X; Y <- data$Y

Mean Difference Simulation

Description

A function for simulating data in which a difference in the means is present only in a subset of dimensions, and equal covariance.

Usage

lol.sims.mean_diff(  n,  d,  rotate = FALSE,  priors = NULL,  K = 2,  md = 1,  subset = c(1),  offdiag = 0,  s = 1)

Arguments

n

the number of samples of the simulated data.

d

the dimensionality of the simulated data.

rotate

whether to apply a random rotation to the mean and covariance. With random rotataion matrixQ,mu = Q*mu, andS = Q*S*Q. Defaults toFALSE.

priors

the priors for each class. IfNULL, class priors are all equal. If not null, should be|priors| = K, a lengthK vector forK classes. Defaults toNULL.

K

the number of classes. Defaults to2.

md

the magnitude of the difference in the means in the specified subset of dimensions. Ddefaults to1.

subset

the dimensions to have a difference in the means. Defaults to only the first dimension.max(subset) < d. Defaults toc(1).

offdiag

the off-diagonal elements of the covariance matrix. Should be < 1.S_{ij} = offdiag ifi != j, or 1 ifi == j. Defaults to0.

s

the scaling parameter of the covariance matrix. S_ij = scaling*1 if i == j, or scaling*offdiag if i != j. Defaults to1.

Value

A list of classsimulation with the following:

X

[n, d] then data points ind dimensions as a matrix.

Y

[n] then labels as an array.

mus

[d, K] theK class means ind dimensions.

Sigmas

[d, d, K] theK class covariance matrices ind dimensions.

priors

[K] the priors for each of theK classes.

simtype

The name of the simulation.

params

Any extraneous parameters the simulation was created with.

Details

For more details see the help vignette:vignette("sims", package = "lolR")

Author(s)

Eric Bridgeford

Examples

library(lolR)data <- lol.sims.mean_diff(n=200, d=30)  # 200 examples of 30 dimensionsX <- data$X; Y <- data$Y

Quadratic Discriminant Toeplitz Simulation

Description

A function for simulating data generalizing the Toeplitz setting, where each class has a different covariance matrix. This results in a Quadratic Discriminant.

Usage

lol.sims.qdtoep(  n,  d,  rotate = FALSE,  priors = NULL,  D1 = 10,  b = 0.4,  rho = 0.5)

Arguments

n

the number of samples of the simulated data.

d

the dimensionality of the simulated data.

rotate

whether to apply a random rotation to the mean and covariance. With random rotataion matrixQ,mu = Q*mu, andS = Q*S*Q. Defaults toFALSE.

priors

the priors for each class. IfNULL, class priors are all equal. If not null, should be|priors| = K, a lengthK vector forK classes. Defaults toNULL.

D1

the dimensionality for the non-equal covariance terms. Defaults to10.

b

a scaling parameter for the means. Defaults to0.4.

rho

the scaling of the covariance terms, should be < 1. Defaults to0.5.

Value

A list of classsimulation with the following:

X

[n, d] then data points ind dimensions as a matrix.

Y

[n] then labels as an array.

mus

[d, K] theK class means ind dimensions.

Sigmas

[d, d, K] theK class covariance matrices ind dimensions.

priors

[K] the priors for each of theK classes.

simtype

The name of the simulation.

params

Any extraneous parameters the simulation was created with.

Details

For more details see the help vignette:vignette("sims", package = "lolR")

Author(s)

Eric Bridgeford

Examples

library(lolR)data <- lol.sims.qdtoep(n=200, d=30)  # 200 examples of 30 dimensionsX <- data$X; Y <- data$Y

Random Rotation

Description

A helper function for applying a random rotation to gaussian parameter set.

Usage

lol.sims.random_rotate(mus, Sigmas, Q = NULL)

Arguments

mus

means per class.

Sigmas

covariances per class.

Q

rotation to use, if any

Author(s)

Eric Bridgeford


Reverse Random Trunk

Description

A simulation for the reversed random trunk experiment, in which the maximal covariant directions are the same as the directions with the maximal mean difference.

Usage

lol.sims.rev_rtrunk(  n,  d,  robust = FALSE,  rotate = FALSE,  priors = NULL,  b = 4,  K = 2,  maxvar = b^3,  maxvar.outlier = maxvar^3)

Arguments

n

the number of samples of the simulated data.

d

the dimensionality of the simulated data.

robust

the number of outlier points to add, where outliers have opposite covariance of inliers. Defaults toFALSE, which will not add any outliers.

rotate

whether to apply a random rotation to the mean and covariance. With random rotataion matrixQ,mu = Q*mu, andS = Q*S*Q. Defaults toFALSE.

priors

the priors for each class. IfNULL, class priors are all equal. If not null, should be|priors| = K, a lengthK vector forK classes. Defaults toNULL.

b

scalar for mu scaling. Default to4.

K

number of classes, should be <4. Defaults to2.

maxvar

the maximum covariance between the two classes. Defaults to100.

maxvar.outlier

the maximum covariance for the outlier points. Defaults tomaxvar*5.

Value

A list of classsimulation with the following:

X

[n, d] then data points ind dimensions as a matrix.

Y

[n] then labels as an array.

mus

[d, K] theK class means ind dimensions.

Sigmas

[d, d, K] theK class covariance matrices ind dimensions.

priors

[K] the priors for each of theK classes.

simtype

The name of the simulation.

params

Any extraneous parameters the simulation was created with.

robust

If robust is not false, a list containinginlier a boolean array indicating which points are inliers,s.outlier the covariance structure of outliers, andmu.outlier the means of the outliers.

Details

For more details see the help vignette:vignette("sims", package = "lolR")

Author(s)

Eric Bridgeford

Examples

library(lolR)data <- lol.sims.rtrunk(n=200, d=30)  # 200 examples of 30 dimensionsX <- data$X; Y <- data$Y

Sample Random Rotation

Description

A helper function for estimating a random rotation matrix.

Usage

lol.sims.rotation(d)

Arguments

d

dimensions to generate a rotation matrix for.

Value

the rotation matrix

Author(s)

Eric Bridgeford


Random Trunk

Description

A simulation for the random trunk experiment, in which the maximal covariant dimensions are the reverse of the maximal mean differences.

Usage

lol.sims.rtrunk(  n,  d,  rotate = FALSE,  priors = NULL,  b = 4,  K = 2,  maxvar = 100)

Arguments

n

the number of samples of the simulated data.

d

the dimensionality of the simulated data.

rotate

whether to apply a random rotation to the mean and covariance. With random rotataion matrixQ,mu = Q*mu, andS = Q*S*Q. Defaults toFALSE.

priors

the priors for each class. IfNULL, class priors are all equal. If not null, should be|priors| = K, a lengthK vector forK classes. Defaults toNULL.

b

scalar for mu scaling. Default to4.

K

number of classes, should be <4. Defaults to2.

maxvar

the maximum covariance between the two classes. Defaults to100.

Value

A list of classsimulation with the following:

X

[n, d] then data points ind dimensions as a matrix.

Y

[n] then labels as an array.

mus

[d, K] theK class means ind dimensions.

Sigmas

[d, d, K] theK class covariance matrices ind dimensions.

priors

[K] the priors for each of theK classes.

simtype

The name of the simulation.

params

Any extraneous parameters the simulation was created with.

robust

If robust is not false, a list containinginlier a boolean array indicating which points are inliers,s.outlier the covariance structure of outliers, andmu.outlier the means of the outliers.

Details

For more details see the help vignette:vignette("sims", package = "lolR")

Author(s)

Eric Bridgeford

Examples

library(lolR)data <- lol.sims.rtrunk(n=200, d=30)  # 200 examples of 30 dimensionsX <- data$X; Y <- data$Y

GMM Simulate

Description

A helper function for simulating from Gaussian Mixture.

Usage

lol.sims.sim_gmm(mus, Sigmas, n, priors)

Arguments

mus

[d, K] the mus for each class.

Sigmas

[d,d,K] the Sigmas for each class.

n

the number of examples.

priors

K the priors for each class.

Value

A list with the following:

X

[n, d] the simulated data.

Y

[n] the labels for each data point.

priors

[K] the priors for each class.

Author(s)

Eric Bridgeford


Toeplitz Simulation

Description

A function for simulating data in which the covariance is a non-symmetric toeplitz matrix.

Usage

lol.sims.toep(n, d, rotate = FALSE, priors = NULL, D1 = 10, b = 0.4, rho = 0.5)

Arguments

n

the number of samples of the simulated data.

d

the dimensionality of the simulated data.

rotate

whether to apply a random rotation to the mean and covariance. With random rotataion matrixQ,mu = Q*mu, andS = Q*S*Q. Defaults toFALSE.

priors

the priors for each class. IfNULL, class priors are all equal. If not null, should be|priors| = K, a lengthK vector forK classes. Defaults toNULL.

D1

the dimensionality for the non-equal covariance terms. Defaults to10.

b

a scaling parameter for the means. Defaults to0.4.

rho

the scaling of the covariance terms, should be < 1. Defaults to0.5/

Value

A list of classsimulation with the following:

X

[n, d] then data points ind dimensions as a matrix.

Y

[n] then labels as an array.

mus

[d, K] theK class means ind dimensions.

Sigmas

[d, d, K] theK class covariance matrices ind dimensions.

priors

[K] the priors for each of theK classes.

simtype

The name of the simulation.

params

Any extraneous parameters the simulation was created with.

Details

For more details see the help vignette:vignette("sims", package = "lolR")

Author(s)

Eric Bridgeford

Examples

library(lolR)data <- lol.sims.toep(n=200, d=30)  # 200 examples of 30 dimensionsX <- data$X; Y <- data$Y

Xor Problem

Description

A function to simulate from the 2-class xor problem.

Usage

lol.sims.xor2(n, d, priors = NULL, fall = 100)

Arguments

n

the number of samples of the simulated data.

d

the dimensionality of the simulated data.

priors

the priors for each class. IfNULL, class priors are all equal. If not null, should be|priors| = K, a lengthK vector forK classes. Defaults toNULL.

fall

the falloff for the covariance structuring. Sigma declines by ndim/fall across the variance terms. Defaults to100.

Value

A list of classsimulation with the following:

X

[n, d] then data points ind dimensions as a matrix.

Y

[n] then labels as an array.

mus

[d, K] theK class means ind dimensions.

Sigmas

[d, d, K] theK class covariance matrices ind dimensions.

priors

[K] the priors for each of theK classes.

simtype

The name of the simulation.

params

Any extraneous parameters the simulation was created with.

Details

For more details see the help vignette:vignette("sims", package = "lolR")

Author(s)

Eric Bridgeford

Examples

library(lolR)data <- lol.sims.xor2(n=200, d=30)  # 200 examples of 30 dimensionsX <- data$X; Y <- data$Y

A utility to use irlba when necessary

Description

A utility to use irlba when necessary

Usage

lol.utils.decomp(  X,  xfm = FALSE,  xfm.opts = list(),  ncomp = 0,  t = 0.05,  robust = FALSE)

Arguments

X

the data to compute the svd of.

xfm

whether to transform the variables before taking the SVD.

  • FALSEapply no transform to the variables.

  • 'unit'unit transform the variables, defaulting to centering and scaling to mean 0, variance 1. Seescale for details and optional args.

  • 'log'log-transform the variables, for use-cases such as having high variance in larger values. Defaults to natural logarithm. Seelog for details and optional args.

  • 'rank'rank-transform the variables. Defalts to breaking ties with the average rank of the tied values. Seerank for details and optional args.

  • c(opt1, opt2, etc.)apply the transform specified in opt1, followed by opt2, etc.

xfm.opts

optional arguments to pass to thexfm option specified. Should be a numbered list of lists, wherexfm.opts[[i]] corresponds to the optional arguments forxfm[i]. Defaults to the default options for each transform scheme.

ncomp

the number of left singular vectors to retain.

t

the threshold of percent of singular vals/vecs to use irlba.

robust

whether to use a robust estimate of the covariance matrix when taking PCA. Defaults toFALSE.

Value

the svd of X.

Author(s)

Eric Bridgeford


A function that performs a utility computation of information about the differences of the classes.

Description

A function that performs a utility computation of information about the differences of the classes.

Usage

lol.utils.deltas(centroids, priors, ...)

Arguments

centroids

[d, K] centroid matrix of the unique, ordered classes.

priors

[K] vector containing prior probability for the unique, ordered classes.

...

optional args.

Value

deltas[d, K] the K difference vectors.

Author(s)

Eric Bridgeford


A function that performs basic utilities about the data.

Description

A function that performs basic utilities about the data.

Usage

lol.utils.info(X, Y, robust = FALSE, ...)

Arguments

X

[n, d] the data with n samples in d dimensions.

Y

[n] the labels of the samples.

robust

whether to perform PCA on a robust estimate of the covariance matrix or not. Defaults toFALSE.

...

optional args.

Value

n the number of samples.

d the number of dimensions.

ylabs[K] vector containing the unique, ordered class labels.

priors[K] vector containing prior probability for the unique, ordered classes.

Author(s)

Eric Bridgeford


A function for one-hot encoding categorical respose vectors.

Description

A function for one-hot encoding categorical respose vectors.

Usage

lol.utils.ohe(Y)

Arguments

Y

[n] a vector of the categorical resposes, withK unique categories.

Value

a list containing the following:

Yh

[n, K] the one-hot encoded Y respose variable.

ylabs

[K] a vector of the y names corresponding to each response column.

Author(s)

Eric Bridgeford


Embedding Cross Validation

Description

A function for performing leave-one-out cross-validation for a given embedding model. This function produces fold-wisecross-validated misclassification rates for standard embedding techniques. Users can optionally specify custom embedding techniqueswith proper configuration ofalg.* parameters and hyperparameters. Optional classifiers implementing the S3predict function can be usedfor classification, with hyperparameters to classifiers for determining misclassification rate specified inclassifier.* parameters and hyperparameters.

Usage

lol.xval.eval(  X,  Y,  r,  alg,  sets = NULL,  alg.dimname = "r",  alg.opts = list(),  alg.embedding = "A",  classifier = lda,  classifier.opts = list(),  classifier.return = "class",  k = "loo",  rank.low = FALSE,  ...)

Arguments

X

[n, d] the data withn samples ind dimensions.

Y

[n] the labels of the samples withK unique labels.

r

the number of embedding dimensions desired, wherer <= d.

alg

the algorithm to use for embedding. Should be a function that accepts inputsX,Y, and has a parameter foralg.dimname ifalg is supervised, or justX andalg.dimname ifalg is unsupervised.This algorithm should return a list containing a matrix that embeds from d to r <= d dimensions.

sets

a user-defined cross-validation set. Defaults toNULL.

  • is.null(sets) randomly partition the inputsX andY into training and testing sets.

  • !is.null(sets) use a user-defined partitioning of the inputsX andY into training and testing sets. Should be in the format of the outputs fromlol.xval.split. That is, alist with each element containingX.train, an[n-k][d] subset of data to test on,Y.train, an[n-k] subset of class labels forX.train;X.test, an[n-k][d] subset of data to test the model on,Y.train, an[k] subset of class labels forX.test.

alg.dimname

the name of the parameter accepted byalg for indicating the embedding dimensionality desired. Defaults tor.

alg.opts

the hyper-parameter options you want to pass into your algorithm, as a keyworded list. Defaults tolist(), or no hyper-parameters.

alg.embedding

the attribute returned byalg containing the embedding matrix. Defaults to assuming thatalg returns an embgedding matrix as"A".

  • !is.nan(alg.embedding) Assumes thatalg will return a list containing an attribute,alg.embedding, a[d, r] matrix that embeds[n, d] data from[d] to[r < d] dimensions.

  • is.nan(alg.embedding) Assumes thatalg returns a[d, r] matrix that embeds[n, d] data from[d] to[r < d] dimensions.

classifier

the classifier to use for assessing performance. The classifier should acceptX, a[n, d] array as the first input, andY, a[n] array of labels, as the first 2 arguments. The class should implement a predict function,predict.classifier, that is compatible with thestats::predictS3 method. Defaults toMASS::lda.

classifier.opts

any extraneous options to be passed to the classifier function, as a list. Defaults to an empty list.

classifier.return

if the return type is a list,class encodes the attribute containing the prediction labels fromstats::predict. Defaults to the return type ofMASS::lda,class.

  • !is.nan(classifier.return) Assumes thatpredict.classifier will return a list containing an attribute,classifier.return, that encodes the predicted labels.

  • is.nan(classifier.return) Assumes thatpredict.classifer returns a[n] vector/array containing the prediction labels for[n, d] inputs.

k

the cross-validated method to perform. Defaults to'loo'. Ifsets is provided, this option is ignored. Seelol.xval.split for details.

  • 'loo' Leave-one-out cross validation

  • isinteger(k) performk-fold cross-validation withk as the number of folds.

rank.low

whether to force the training set to low-rank. Defaults toFALSE. Ifsets is provided, this option is ignored. Seelol.xval.split for details.

  • ifrank.low == FALSE, uses default cross-validation method with standardk-fold validation. Training sets arek-1 folds, and testing sets are1 fold, where the fold held-out for testing is rotated to ensure no dependence of potential downstream inference in the cross-validated misclassification rates.

  • if ]coderank.low == TRUE, users cross-validation method withntrain = min((k-1)/k*n, d) sample training sets, whered is the number of dimensions inX. This ensures that the training data is always low-rank,ntrain < d + 1. Note that the resulting training sets may haventrain < (k-1)/k*n, but the resulting testing sets will always be properly rotatedntest = n/k to ensure no dependencies in fold-wise testing.

...

trailing args.

Value

Returns a list containing:

lhat

the mean cross-validated error.

model

The model returned byalg computed on all of the data.

classifier

The classifier trained on all of the embedded data.

lhats

the cross-validated error for each of thek-folds.

Details

For more details see the help vignette:vignette("xval", package = "lolR")

For extending cross-validation techniques shown here to arbitrary embedding algorithms, see the vignette:vignette("extend_embedding", package = "lolR")

For extending cross-validation techniques shown here to arbitrary classification algorithms, see the vignette:vignette("extend_classification", package = "lolR")

Author(s)

Eric Bridgeford

Examples

# train model and analyze with loo validation using lda classifierlibrary(lolR)data <- lol.sims.rtrunk(n=200, d=30)  # 200 examples of 30 dimensionsX <- data$X; Y <- data$Yr=5  # embed into r=5 dimensions# run cross-validation with the nearestCentroid method and# leave-one-out cross-validation, which returns only# prediction labels so we specify classifier.return as NaNxval.fit <- lol.xval.eval(X, Y, r, lol.project.lol,                          classifier=lol.classify.nearestCentroid,                          classifier.return=NaN, k='loo')# train model and analyze with 5-fold validation using lda classifierdata <- lol.sims.rtrunk(n=200, d=30)  # 200 examples of 30 dimensionsX <- data$X; Y <- data$Yxval.fit <- lol.xval.eval(X, Y, r, lol.project.lol, k=5)# pass in existing cross-validation setssets <- lol.xval.split(X, Y, k=2)xval.fit <- lol.xval.eval(X, Y, r, lol.project.lol, sets=sets)

Optimal Cross-Validated Number of Embedding Dimensions

Description

A function for performing leave-one-out cross-validation for a given embedding model, that allows users to determine the optimal number of embedding dimensions fortheir algorithm-of-choice. This function produces fold-wise cross-validated misclassification rates for standard embedding techniques across a specified selection ofembedding dimensions. Optimal embedding dimension is selected as the dimension with the lowest average misclassification rate across all folds.Users can optionally specify custom embedding techniques with proper configuration ofalg.* parameters and hyperparameters.Optional classifiers implementing the S3predict function can be used for classification, with hyperparameters to classifiers fordetermining misclassification rate specified inclassifier.*.

Usage

lol.xval.optimal_dimselect(  X,  Y,  rs,  alg,  sets = NULL,  alg.dimname = "r",  alg.opts = list(),  alg.embedding = "A",  alg.structured = TRUE,  classifier = lda,  classifier.opts = list(),  classifier.return = "class",  k = "loo",  rank.low = FALSE,  ...)

Arguments

X

[n, d] the data withn samples ind dimensions.

Y

[n] the labels of the samples withK unique labels. Defaults toNaN.#' @param alg.opts any extraneous options to be passed to the classifier function, as a list. Defaults to an empty list. For example, this could be the embedding dimensionality to investigate.

rs

[r.n] the embedding dimensions to investigate over, wheremax(rs) <= d.

alg

the algorithm to use for embedding. Should be a function that accepts inputsX andY and embedding dimensionr ifalg is supervised, or justX and embedding dimensionr ifalg is unsupervised.This algorithm should return a list containing a matrix that embeds from d to r < d dimensions.

sets

a user-defined cross-validation set. Defaults toNULL.

  • is.null(sets) randomly partition the inputsX andY into training and testing sets.

  • !is.null(sets) use a user-defined partitioning of the inputsX andY into training and testing sets. Should be in the format of the outputs fromlol.xval.split. That is, alist with each element containingX.train, an[n-k][d] subset of data to test on,Y.train, an[n-k] subset of class labels forX.train;X.test, an[n-k][d] subset of data to test the model on,Y.train, an[k] subset of class labels forX.test.

alg.dimname

the name of the parameter accepted byalg for indicating the embedding dimensionality desired. Defaults tor.

alg.opts

the hyper-parameter options to pass to your algorithm as a keyworded list. Defaults tolist(), or no hyper-parameters. This should not include the number of embedding dimensions,r, which are passed separately in thers vector.

alg.embedding

the attribute returned byalg containing the embedding matrix. Defaults to assuming thatalg returns an embgedding matrix as"A".

  • !is.nan(alg.embedding) Assumes thatalg will return a list containing an attribute,alg.embedding, a[d, r] matrix that embeds[n, d] data from[d] to[r < d] dimensions.

  • is.nan(alg.embedding) Assumes thatalg returns a[d, r] matrix that embeds[n, d] data from[d] to[r < d] dimensions.

alg.structured

a boolean to indicate whether the embedding matrix is structured. Provides performance increase by not having to compute the embedding matrixxv times if unnecessary. Defaults toTRUE.

  • TRUE assumes that ifAr: R^d -> R^r embeds fromd tor dimensions andAq: R^d -> R^q fromd toq > r dimensions, thatAq[, 1:r] == Ar,

  • TRUE assumes that ifAr: R^d -> R^r embeds fromd tor dimensions andAq: R^d -> R^q fromd toq > r dimensions, thatAq[, 1:r] != Ar,

classifier

the classifier to use for assessing performance. The classifier should acceptX, a[n, d] array as the first input, andY, a[n] array of labels, as the first 2 arguments. The class should implement a predict function,predict.classifier, that is compatible with thestats::predictS3 method. Defaults toMASS::lda.

classifier.opts

any extraneous options to be passed to the classifier function, as a list. Defaults to an empty list.

classifier.return

if the return type is a list,class encodes the attribute containing the prediction labels fromstats::predict. Defaults to the return type ofMASS::lda,class.

  • !is.nan(classifier.return) Assumes thatpredict.classifier will return a list containing an attribute,classifier.return, that encodes the predicted labels.

  • is.nan(classifier.return) Assumes thatpredict.classifer returns a[n] vector/array containing the prediction labels for[n, d] inputs.

k

the cross-validated method to perform. Defaults to'loo'. Ifsets is provided, this option is ignored. Seelol.xval.split for details.

  • 'loo' Leave-one-out cross validation

  • isinteger(k) performk-fold cross-validation withk as the number of folds.

rank.low

whether to force the training set to low-rank. Defaults toFALSE. Ifsets is provided, this option is ignored. Seelol.xval.split for details.

  • ifrank.low == FALSE, uses default cross-validation method with standardk-fold validation. Training sets arek-1 folds, and testing sets are1 fold, where the fold held-out for testing is rotated to ensure no dependence of potential downstream inference in the cross-validated misclassification rates.

  • if ]coderank.low == TRUE, users cross-validation method withntrain = min((k-1)/k*n, d) sample training sets, whered is the number of dimensions inX. This ensures that the training data is always low-rank,ntrain < d + 1. Note that the resulting training sets may haventrain < (k-1)/k*n, but the resulting testing sets will always be properly rotatedntest = n/k to ensure no dependencies in fold-wise testing.

...

trailing args.

Value

Returns a list containing:

folds.data

the results, as a data-frame, of the per-fold classification accuracy.

foldmeans.data

the results, as a data-frame, of the average classification accuracy for eachr.

optimal.lhat

the classification error of the optimalr

.

optimal.r

the optimal number of embedding dimensions fromrs

.

model

the model trained on all of the data at the optimal number of embedding dimensions.

classifier

the classifier trained on all of the data at the optimal number of embedding dimensions.

Details

For more details see the help vignette:vignette("xval", package = "lolR")

For extending cross-validation techniques shown here to arbitrary embedding algorithms, see the vignette:vignette("extend_embedding", package = "lolR")

For extending cross-validation techniques shown here to arbitrary classification algorithms, see the vignette:vignette("extend_classification", package = "lolR")

Author(s)

Eric Bridgeford

Examples

# train model and analyze with loo validation using lda classifierlibrary(lolR)data <- lol.sims.rtrunk(n=200, d=30)  # 200 examples of 30 dimensionsX <- data$X; Y <- data$Y# run cross-validation with the nearestCentroid method and# leave-one-out cross-validation, which returns only# prediction labels so we specify classifier.return as NaNxval.fit <- lol.xval.optimal_dimselect(X, Y, rs=c(5, 10, 15), lol.project.lol,                          classifier=lol.classify.nearestCentroid,                          classifier.return=NaN, k='loo')# train model and analyze with 5-fold validation using lda classifierdata <- lol.sims.rtrunk(n=200, d=30)  # 200 examples of 30 dimensionsX <- data$X; Y <- data$Yxval.fit <- lol.xval.optimal_dimselect(X, Y, rs=c(5, 10, 15), lol.project.lol, k=5)# pass in existing cross-validation setssets <- lol.xval.split(X, Y, k=2)xval.fit <- lol.xval.optimal_dimselect(X, Y, rs=c(5, 10, 15), lol.project.lol, sets=sets)

Cross-Validation Data Splitter

Description

A function to split a dataset into training and testing sets for cross validation. The procedure for cross-validationis to split the data into k-folds. The k-folds are then rotated individually to form a single held-out testing set the model will be validated on,and the remaining (k-1) folds are used for training the developed model. Note that this cross-validation function includes functionality to be used forlow-rank cross-validation. In that case, instead of using the full (k-1) folds for training, we subsetmin((k-1)/k*n, d) samples to ensure thatthe resulting training sets are all low-rank. We still rotate properly over the held-out fold to ensure that the resulting testing setsdo not have any shared examples, which would add a complicated dependence structure to inference we attempt to infer on the testing sets.

Usage

lol.xval.split(X, Y, k = "loo", rank.low = FALSE, ...)

Arguments

X

[n, d] the data withn samples ind dimensions.

Y

[n] the labels of the samples withK unique labels.

k

the cross-validated method to perform. Defaults to'loo'.

  • ifk == round(k), performed k-fold cross-validation.

  • ifk == 'loo', performs leave-one-out cross-validation.

rank.low

whether to force the training set to low-rank. Defaults toFALSE.

  • ifrank == FALSE, uses default cross-validation method with standardk-fold validation. Training sets arek-1 folds, and testing sets are1 fold, where the fold held-out for testing is rotated to ensure no dependence of potential downstream inference in the cross-validated misclassification rates.

  • ifrank == TRUE, users cross-validation method withntrain = min((k-1)/k*n, d) sample training sets, whered is the number of dimensions inX. This ensures that the training data is always low-rank,ntrain < d + 1. Note that the resulting training sets may haventrain < (k-1)/k*n, but the resulting testing sets will always be properly rotatedntest = n/k to ensure no dependencies in fold-wise testing.

...

optional args.

Value

sets the cross-validation sets as an object of class"XV" containing the following:

train

length[ntrain] vector indicating the indices of the training examples.

test

length[ntest] vector indicating the indices of the testing examples.

Author(s)

Eric Bridgeford

Examples

# prepare data for 10-fold validationlibrary(lolR)data <- lol.sims.rtrunk(n=200, d=30)  # 200 examples of 30 dimensionsX <- data$X; Y <- data$Ysets.xval.10fold <- lol.xval.split(X, Y, k=10)# prepare data for loo validationsets.xval.loo <- lol.xval.split(X, Y, k='loo')

Nearest Centroid Classifier Prediction

Description

A function that predicts the class of points based on the nearest centroid

Usage

## S3 method for class 'nearestCentroid'predict(object, X, ...)

Arguments

object

An object of classnearestCentroid, with the following attributes:

  • centroids[K, d] the centroids of each class withK classes ind dimensions.

  • ylabs[K] the ylabels for each of theK unique classes, ordered.

  • priors[K] the priors for each of theK classes.

X

[n, d] the data to classify withn samples ind dimensions.

...

optional args.

Value

Yhat[n] the predicted class of each of then data point inX.

Details

For more details see the help vignette:vignette("centroid", package = "lolR")

Author(s)

Eric Bridgeford

Examples

library(lolR)data <- lol.sims.rtrunk(n=200, d=30)  # 200 examples of 30 dimensionsX <- data$X; Y <- data$Ymodel <- lol.classify.nearestCentroid(X, Y)Yh <- predict(model, X)

Randomly Chance Classifier Prediction

Description

A function that predicts the maximally present class in the dataset. Functionality consistentwith the standard R prediction interface so that one can compute the "chance" accuracywith minimal modification of other classification scripts.

Usage

## S3 method for class 'randomChance'predict(object, X, ...)

Arguments

object

An object of classrandomChance, with the following attributes:

  • ylabs[K] the ylabels for each of theK unique classes, ordered.

  • priors[K] the priors for each of theK classes.

X

[n, d] the data to classify withn samples ind dimensions.

...

optional args.

Value

Yhat[n] the predicted class of each of then data point inX.

Author(s)

Eric Bridgeford

Examples

library(lolR)data <- lol.sims.rtrunk(n=200, d=30)  # 200 examples of 30 dimensionsX <- data$X; Y <- data$Ymodel <- lol.classify.randomChance(X, Y)Yh <- predict(model, X)

Randomly Guessing Classifier Prediction

Description

A function that predicts by randomly guessing based on the pmf of the class priors. Functionality consistentwith the standard R prediction interface so that one can compute the "guess" accuracywith minimal modification of other classification scripts.

Usage

## S3 method for class 'randomGuess'predict(object, X, ...)

Arguments

object

An object of classrandomGuess, with the following attributes:

  • ylabs[K] the ylabels for each of theK unique classes, ordered.

  • priors[K] the priors for each of theK classes.

X

[n, d] the data to classify withn samples ind dimensions.

...

optional args.

Value

Yhat[n] the predicted class of each of then data point inX.

Author(s)

Eric Bridgeford

Examples

library(lolR)data <- lol.sims.rtrunk(n=200, d=30)  # 200 examples of 30 dimensionsX <- data$X; Y <- data$Ymodel <- lol.classify.randomGuess(X, Y)Yh <- predict(model, X)

[8]ページ先頭

©2009-2025 Movatter.jp