Movatterモバイル変換

Type:

Package

Title:

Feature Ordering by Conditional Independence

Version:

0.1.3

Maintainer:

Mona Azadkia <monaazadkia@gmail.com>

Description:

Feature Ordering by Conditional Independence (FOCI) is a variable selection algorithm based on the measure of conditional dependence. For more information, see the paper: Azadkia and Chatterjee (2019),"A simple measure of conditional dependence" <doi:10.48550/arXiv.1910.12327>.

License:

GPL-3

Encoding:

UTF-8

LazyData:

true

VignetteBuilder:

knitr

RoxygenNote:

7.1.0

Suggests:

knitr, rmarkdown, testthat

Depends:

R (≥ 3.6.0), data.table

Imports:

RANN, proxy, parallel, gmp

NeedsCompilation:

Packaged:

2021-02-17 10:13:45 UTC; monaazadkia

Author:

Mona Azadkia [aut, cre], Sourav Chatterjee [aut, ctb], Norman Matloff [aut, ctb]

Repository:

CRAN

Date/Publication:

2021-03-18 23:00:07 UTC

Estimate the conditional dependence coefficient (CODEC)

Description

The conditional dependence coefficient (CODEC) is a measure of the amount of conditional dependence betweena random variable Y and a random vector Z given a random vector X, based on an i.i.d. sample of (Y, Z, X).The coefficient is asymptotically guaranteed to be between 0 and 1.

Usage

codec(Y, Z, X = NULL, na.rm = TRUE)

Arguments

Y

Vector (length n)

Z

Matrix (n by q)

X

Matrix (n by p), default is NULL

na.rm

Remove NAs if TRUE

Details

The value returned by codec can be positive or negative. Asymptotically, it is guaranteedto be between 0 and 1. A small value indicates low conditional dependence between Y and Z given X, anda high value indicates strong conditional dependence. The codec function is used by thefoci functionfor variable selection.

Value

The conditional dependence coefficient (CODEC) of Y and Z given X. If X == NULL, this is just ameasure of the dependence between Y and Z.

Author(s)

Mona Azadkia, Sourav Chatterjee, Norman Matloff

References

Azadkia, M. and Chatterjee, S. (2019). A simple measureof conditional dependence.https://arxiv.org/pdf/1910.12327.pdf.

Examples

n = 1000x <- matrix(runif(n * 2), nrow = n)y <- (x[, 1] + x[, 2]) %% 1# given x[, 1], y is a function of x[, 2]codec(y, x[, 2], x[, 1])# y is a function of xcodec(y, x)z <- rnorm(n)# y is a function of x given zcodec(y, x, z)# y is independent of z given xcodec(y, z, x)

Variable selection by the FOCI algorithm

Description

FOCI is a variable selection algorithm based on the measure of conditional dependencecodec.

Usage

foci(  Y,  X,  num_features = NULL,  stop = TRUE,  na.rm = TRUE,  standardize = "scale",  numCores = parallel::detectCores(),  parPlat = "none",  printIntermed = TRUE)

Arguments

Y

Vector of responses (length n)

X

Matrix of predictors (n by p)

num_features

Number of variables to be selected, cannot be larger than p. The default value is NULL and in thatcase it will be set equal to p. If stop == TRUE (see below), then num_features is irrelevant.

stop

Stops at the first instance of negative codec, if TRUE.

na.rm

Removes NAs if TRUE.

standardize

Standardize covariates if set equal to "scale" or "bounded". Otherwise will use the raw inputs.The default value is "scale" and normalizes each column of X to have mean zero and variance 1. If set equal to "bounded"map the values of each column of X to [0, 1].

numCores

Number of cores that are going to be used forparallelizing the variable selecction process.

parPlat

Specifies the parallel platform to chunk data by rows.It can take three values:1- The default value is set to 'none', in which case no row chunkingis done;2- theparallel cluster to be used for row chunking;3- "locThreads", specifying that row chunking will be done viathreads on the host machine.

printIntermed

The default value is TRUE, in which case print intermediate results from the cluster nodes before final processing.

Details

FOCI is a forward stepwise algorithm that uses the conditional dependence coefficient (codec)at each step, instead of the multiple correlation coefficientas in ordinary forward stepwise. Ifstop == TRUE, the process is stopped at the first instance ofnonpositive codec, thereby selecting a subset of variables. Otherwise, a set of covariates of sizenum_features, ordered according to predictive power (as measured by codec) is produced.

Parallel computation:

The computation can be lengthy, so the package offers two kinds ofparallel computation.

The first, controlled by the argumentnumCores,specifies the number of cores to be used on the hostmachine. If at a given step there are k candidate variablesunder consideration for inclusion, these k tasks are assignedto the various cores.

The second approach, controlled by the argumentparPlat("parallel platform"), involves the user first setting up a cluster viatheparallel package. The data are divided into chunks by rows,with each cluster node applying FOCI to its data chunk. Theunion of the results is then formed, and fed through FOCI one moretime to adjust the discrepancies. The idea is that that last stepwill not be too lengthy, as the number of candidate variables hasalready been reduced. A cluster size of r may actuallyproduce a speedup factor of more than r (Matloff 2016).

Potentially the best speedup is achieved by using the two approachestogether.

The first approach cannot be used on Windows platforms, asparallel::mcapply has no effect. Windows users should thususe the second approach only.

In addition to speed, the second approach is useful for diagnostics, asthe results from the different chunks gives the user anidea of the degree of sampling variability in theFOCI results.

In the second approach, a random permutation is applied to therows of the dataset, as many datasets are sorted by one or morecolumns.

Note that if a certain value of a feature is rare in thefull dataset, it may be absent entirely in some chunk.

Value

An object of class "foci", with attributesselectedVar, showing the selected variables in decreasingorder of (conditional) predictive power, andstepT, listingthe 'codec' values. Typically the latter will begin to level off atsome point, with additional marginal improvements being small.

Author(s)

Mona Azadkia, Sourav Chatterjee, and Norman Matloff

References

Azadkia, M. and Chatterjee, S. (2019). A simple measureof conditional dependence.https://arxiv.org/pdf/1910.12327.pdf.

Matloff, N. (2016). Software Alchemy: Turning ComplexStatistical Computations into Embarrassingly-Parallel Ones.J. of Stat. Software.

Examples

# Example 1n = 1000p = 100x <- matrix(rnorm(n * p), nrow = n)colnames(x) = paste0(rep("x", p), seq(1, p))y <- x[, 1] * x[, 10] + x[, 20]^2# with num_features equal to 3 and stop equal to FALSE, foci will give a list of# three selected featuresresult1 = foci(y, x, num_features = 3, stop = FALSE, numCores = 1)result1# Example 2# same example, but stop according to the stopping ruleresult2 = foci(y, x, numCores = 1)result2## Not run: # Windows use of multicorelibrary(parallel)cls <- makeCluster(parallel::detectCores())foci(y, x, parPlat = cls)# run on physical clustercls <- makePSOCKcluster('machineA','machineB')foci(y, x, parPlat = cls)## End(Not run)

Movatterモバイル変換

Estimate the conditional dependence coefficient (CODEC)

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Variable selection by the FOCI algorithm

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples