| Type: | Package |
| Title: | Feature Ordering by Conditional Independence |
| Version: | 0.1.3 |
| Maintainer: | Mona Azadkia <monaazadkia@gmail.com> |
| Description: | Feature Ordering by Conditional Independence (FOCI) is a variable selection algorithm based on the measure of conditional dependence. For more information, see the paper: Azadkia and Chatterjee (2019),"A simple measure of conditional dependence" <doi:10.48550/arXiv.1910.12327>. |
| License: | GPL-3 |
| Encoding: | UTF-8 |
| LazyData: | true |
| VignetteBuilder: | knitr |
| RoxygenNote: | 7.1.0 |
| Suggests: | knitr, rmarkdown, testthat |
| Depends: | R (≥ 3.6.0), data.table |
| Imports: | RANN, proxy, parallel, gmp |
| NeedsCompilation: | no |
| Packaged: | 2021-02-17 10:13:45 UTC; monaazadkia |
| Author: | Mona Azadkia [aut, cre], Sourav Chatterjee [aut, ctb], Norman Matloff [aut, ctb] |
| Repository: | CRAN |
| Date/Publication: | 2021-03-18 23:00:07 UTC |
Estimate the conditional dependence coefficient (CODEC)
Description
The conditional dependence coefficient (CODEC) is a measure of the amount of conditional dependence betweena random variable Y and a random vector Z given a random vector X, based on an i.i.d. sample of (Y, Z, X).The coefficient is asymptotically guaranteed to be between 0 and 1.
Usage
codec(Y, Z, X = NULL, na.rm = TRUE)Arguments
Y | Vector (length n) |
Z | Matrix (n by q) |
X | Matrix (n by p), default is NULL |
na.rm | Remove NAs if TRUE |
Details
The value returned by codec can be positive or negative. Asymptotically, it is guaranteedto be between 0 and 1. A small value indicates low conditional dependence between Y and Z given X, anda high value indicates strong conditional dependence. The codec function is used by thefoci functionfor variable selection.
Value
The conditional dependence coefficient (CODEC) of Y and Z given X. If X == NULL, this is just ameasure of the dependence between Y and Z.
Author(s)
Mona Azadkia, Sourav Chatterjee, Norman Matloff
References
Azadkia, M. and Chatterjee, S. (2019). A simple measureof conditional dependence.https://arxiv.org/pdf/1910.12327.pdf.
See Also
Examples
n = 1000x <- matrix(runif(n * 2), nrow = n)y <- (x[, 1] + x[, 2]) %% 1# given x[, 1], y is a function of x[, 2]codec(y, x[, 2], x[, 1])# y is a function of xcodec(y, x)z <- rnorm(n)# y is a function of x given zcodec(y, x, z)# y is independent of z given xcodec(y, z, x)Variable selection by the FOCI algorithm
Description
FOCI is a variable selection algorithm based on the measure of conditional dependencecodec.
Usage
foci( Y, X, num_features = NULL, stop = TRUE, na.rm = TRUE, standardize = "scale", numCores = parallel::detectCores(), parPlat = "none", printIntermed = TRUE)Arguments
Y | Vector of responses (length n) |
X | Matrix of predictors (n by p) |
num_features | Number of variables to be selected, cannot be larger than p. The default value is NULL and in thatcase it will be set equal to p. If stop == TRUE (see below), then num_features is irrelevant. |
stop | Stops at the first instance of negative codec, if TRUE. |
na.rm | Removes NAs if TRUE. |
standardize | Standardize covariates if set equal to "scale" or "bounded". Otherwise will use the raw inputs.The default value is "scale" and normalizes each column of X to have mean zero and variance 1. If set equal to "bounded"map the values of each column of X to [0, 1]. |
numCores | Number of cores that are going to be used forparallelizing the variable selecction process. |
parPlat | Specifies the parallel platform to chunk data by rows.It can take three values:1- The default value is set to 'none', in which case no row chunkingis done;2- the |
printIntermed | The default value is TRUE, in which case print intermediate results from the cluster nodes before final processing. |
Details
FOCI is a forward stepwise algorithm that uses the conditional dependence coefficient (codec)at each step, instead of the multiple correlation coefficientas in ordinary forward stepwise. Ifstop == TRUE, the process is stopped at the first instance ofnonpositive codec, thereby selecting a subset of variables. Otherwise, a set of covariates of sizenum_features, ordered according to predictive power (as measured by codec) is produced.
Parallel computation:
The computation can be lengthy, so the package offers two kinds ofparallel computation.
The first, controlled by the argumentnumCores,specifies the number of cores to be used on the hostmachine. If at a given step there are k candidate variablesunder consideration for inclusion, these k tasks are assignedto the various cores.
The second approach, controlled by the argumentparPlat("parallel platform"), involves the user first setting up a cluster viatheparallel package. The data are divided into chunks by rows,with each cluster node applying FOCI to its data chunk. Theunion of the results is then formed, and fed through FOCI one moretime to adjust the discrepancies. The idea is that that last stepwill not be too lengthy, as the number of candidate variables hasalready been reduced. A cluster size of r may actuallyproduce a speedup factor of more than r (Matloff 2016).
Potentially the best speedup is achieved by using the two approachestogether.
The first approach cannot be used on Windows platforms, asparallel::mcapply has no effect. Windows users should thususe the second approach only.
In addition to speed, the second approach is useful for diagnostics, asthe results from the different chunks gives the user anidea of the degree of sampling variability in theFOCI results.
In the second approach, a random permutation is applied to therows of the dataset, as many datasets are sorted by one or morecolumns.
Note that if a certain value of a feature is rare in thefull dataset, it may be absent entirely in some chunk.
Value
An object of class "foci", with attributesselectedVar, showing the selected variables in decreasingorder of (conditional) predictive power, andstepT, listingthe 'codec' values. Typically the latter will begin to level off atsome point, with additional marginal improvements being small.
Author(s)
Mona Azadkia, Sourav Chatterjee, and Norman Matloff
References
Azadkia, M. and Chatterjee, S. (2019). A simple measureof conditional dependence.https://arxiv.org/pdf/1910.12327.pdf.
Matloff, N. (2016). Software Alchemy: Turning ComplexStatistical Computations into Embarrassingly-Parallel Ones.J. of Stat. Software.
See Also
Examples
# Example 1n = 1000p = 100x <- matrix(rnorm(n * p), nrow = n)colnames(x) = paste0(rep("x", p), seq(1, p))y <- x[, 1] * x[, 10] + x[, 20]^2# with num_features equal to 3 and stop equal to FALSE, foci will give a list of# three selected featuresresult1 = foci(y, x, num_features = 3, stop = FALSE, numCores = 1)result1# Example 2# same example, but stop according to the stopping ruleresult2 = foci(y, x, numCores = 1)result2## Not run: # Windows use of multicorelibrary(parallel)cls <- makeCluster(parallel::detectCores())foci(y, x, parPlat = cls)# run on physical clustercls <- makePSOCKcluster('machineA','machineB')foci(y, x, parPlat = cls)## End(Not run)