Movatterモバイル変換

Title:

Joint Feature Screening via Sparse MLE

Version:

2.2-2

Description:

Feature screening is a powerful tool in processing ultrahigh dimensional data. It attempts to screen out most irrelevant features in preparation for a more elaborate analysis. Xu and Chen (2014)<doi:10.1080/01621459.2013.879531> proposed an effective screening method SMLE, which naturally incorporates the joint effects among features in the screening process. This package provides an efficient implementation of SMLE-screening for high-dimensional linear, logistic, and Poisson models. The package also provides a function for conducting accurate post-screening feature selection based on an iterative hard-thresholding procedure and a user-specified selection criterion.

License:

GPL-3

Depends:

R(≥ 4.0.0)

Imports:

glmnet, matrixcalc, mvnfast

Encoding:

UTF-8

LazyData:

true

RoxygenNote:

7.2.3

NeedsCompilation:

Author:

Qianxiang Zang [aut, cre], Chen Xu [aut], Kelly Burkett [aut]

Maintainer:

Qianxiang Zang <SMLEmaintainer@gmail.com>

Repository:

CRAN

Packaged:

2025-01-28 22:31:07 UTC; mac

Date/Publication:

2025-01-29 00:10:06 UTC

Suggests:

knitr, rmarkdown, testthat (≥ 3.0.0)

Config/testthat/edition:

VignetteBuilder:

knitr

Joint SMLE-screening for generalized linear models

Description

Feature screening is a powerful tool in processing ultrahigh dimensional data. It attempts to screenout most irrelevant features in preparation for a more elaborate analysis. This package provides an efficient implementation of SMLE-screening for linear, logistic, and Poisson models, where the joint effects among features are naturally incorporated in the screening process. The package also provides a function for conducting accurate post-screening feature selection based on an iterative hard-thresholding procedure and a user-specified selection criterion.

Details

Package:	smle
Type:	Package
Version:	2.1-1
Date:	2024-02-12
License:	GPL-3

Input an \times 1 response vector Y and an \times p predictor (feature) matrix X. The package outputs a set ofk < n features that seem to be most relevant for joint regression. Moreover, the package provides a data simulator that generates synthetic datasets from high-dimensional GLMs, which accommodate both numerical and categorical features with commonly used correlation structures.

Key functions:
Gen_Data
SMLE
smle_select

Author(s)

Qianxiang Zang, Chen Xu, Kelly Burkett
Maintainer: Qianxiang Zang <qzang023@uottawa.ca>

References

Xu, C. and Chen, J. (2014)The Sparse MLE for Ultrahigh-Dimensional Feature ScreeningJournal of the American Statistical Association,109(507), 1257–1269.

Friedman, J., Hastie, T. and Tibshirani, R. (2010)Regularization Paths for Generalized Linear Models via CoordinateDescentJournal of Statistical Software,33(1), 1-22.

Examples

set.seed(1)#Generate correlated dataData <- Gen_Data(n = 200, p = 5000, correlation = "MA",family = "gaussian")print(Data)# joint feature screening via SMLEfit <- SMLE(Y = Data$Y, X = Data$X, k = 10, family = "gaussian")print(fit)summary(fit)plot(fit)#Are there any features missed after screening?setdiff(Data$subset_true, fit$ID_retained)# Elaborative selection after screeningfit_s <- smle_select(fit, gamma_ebic = 0.5, vote = FALSE)#Are there any features missed after selection? setdiff(Data$subset_true, fit_s$ID_selected)print(fit_s)summary(fit_s)plot(fit_s)

Data simulator for high-dimensional GLMs

Description

This function generates synthetic datasets from GLMs with a user-specified correlation structure.It permits both numerical and categorical features, whose quantity can be larger than the sample size.

Usage

Gen_Data(  n = 200,  p = 1000,  sigma = 1,  num_ctgidx = NULL,  pos_ctgidx = NULL,  num_truecoef = NULL,  pos_truecoef = NULL,  level_ctgidx = NULL,  effect_truecoef = NULL,  correlation = c("ID", "AR", "MA", "CS"),  rho = 0.2,  family = c("gaussian", "binomial", "poisson"))

Arguments

n

Sample size, number of rows for the feature matrix to be generated.

p

Number of columns for the feature matrix to be generated.

sigma

Parameter for noise level.

num_ctgidx

The number of features that are categorical. Set toFALSE for only numerical features. Default isFALSE.

pos_ctgidx

Vector of indices denoting which columns are categorical.

num_truecoef

The number of features (columns) that affect response. Default is 5.

pos_truecoef

Vector of indices denoting which features (columns) affect the response variable. If not specified, positions are randomly sampled. See Details for more information.

level_ctgidx

Vector to indicate the number of levels for the categorical features inpos_ctgidx. Default is 2.

effect_truecoef

Effect size corresponding to the features inpos_truecoef. If not specified, effect size is sampled based on a uniform distribution and direction is randomly sampled. See Details.

correlation

Correlation structure among features.correlation = "ID" for independent,correlation = 'MA' for moving average,correlation = "CS" for compound symmetry,correlation = "AR" for auto regressive. Default is"ID". For more information see Details.

rho

Parameter controlling the correlation strength, default is0.2. See Details.

family

Model type for the response variable."gaussian" for normally distributed data,poisson for non-negative counts,"binomial" for binary (0-1).

Details

Simulated data(y_i , x_i) where x_i = (x_{i1},x_{i2} , . . . , x_{ip}) are generated as follows:First, we generate ap by1 model coefficient vector beta with all entries being zero, except for the positions specified inpos_truecoef,on whicheffect_truecoef is used. Whenpos_truecoef is not specified, we randomly choosenum_truecoef positions from the coefficientvector. Wheneffect_truecoef is not specified, we randomly set the strength of the true model coefficients as follow:

(0.5+U) Z,

whereU is sampled from a uniform distribution from 0 to 1, andZ is sampled from a binomial distributionP(Z=1)=1/2,P(Z=-1)=1/2.

Next, we generate an byp feature matrixX according to the model selected withcorrelation and specified as follows.

Independent (ID): all features are independently generated fromN( 0, 1).

Moving average (MA): candidate featuresx_1,..., x_p are joint normal,marginallyN( 0, 1), with

cov(x_j, x_{j-1}) = \rho,cov(x_j, x_{j-2}) = \rho/2 andcov(x_j, x_h) = 0 for|j-h|>3.

Compound symmetry (CS): candidate featuresx_1,..., x_p are joint normal,marginallyN( 0, 1), withcov(x_j, x_h) =\rho/2 ifj,hare both in the set of important features andcov(x_j, x_h)=\rho when onlyone ofj orh are in the set of important features.

Auto-regressive (AR): candidate featuresx_1,..., x_p are joint normal, marginallyN( 0, 1), with

cov(x_j, x_h) = \rho^{|j-h|} for allj andh. The correlation strength\rho is controlled by the argumentrho.

Then, we generate the response variableY according to its response type, which is controlled by the argumentfamilyFor the Gaussian model,y_i =x_i\beta + \epsilon_i where\epsilon_i isN( 0, 1) fori from1 ton. For the binary model let\pi_i = P(Y = 1|x_i). We sampley_i from Bernoulli(\pi_i) where logit(\pi_i) = x_i \beta.Finally, for the Poisson model,y_i is generated from the Poisson distribution with the link\pi_i = exp(x_i\beta ).For more details see the reference below.

Value

call

The call that produced this object.

Y

Response variable vector of lengthn.

X

Feature matrix or data.frame (matrix ifnum_ctgidx =FALSE and data.frame otherwise).

subset_true

Vector of column indices of X for the features that affect the response variables (relevant features).

coef_true

Vector of effects for the features that affect the response variables.

categorical

Logical flag whether the model contains categorical features.

CI

Indices of categorical features whencategorical = TRUE.

rho,family,correlation are return of arguments passed in the function call.

References

Xu, C. and Chen, J. (2014). The Sparse MLE for Ultrahigh-Dimensional FeatureScreening,Journal of the American Statistical Association,109(507), 1257-1269

Examples

#Simulating data with binomial response and auto-regressive structure.set.seed(1)Data <- Gen_Data(n = 500, p = 2000, family = "binomial", correlation = "AR")cor(Data$X[,1:5])print(Data)

Joint feature screening via sparse maximum likelihood estimation for GLMs

Description

Input an by1 responseY and an byp feature matrixX;the function uses SMLE to retain only a set ofk<n features that seemto be most related to the response variable. It thus serves as a pre-processing step for anelaborative analysis. In SMLE, the joint effects between features are naturallyaccounted for; this makes the screening more reliable. The function uses theefficient iterative hard thresholding (IHT) algorithm with step parameteradaptively tuned for fast convergence. Users can choose to further conductan elaborative selection after SMLE-screening. Seesmle_select() for more details.

Usage

SMLE(formula = NULL, ...)## Default S3 method:SMLE(  formula = NULL,  X = NULL,  Y = NULL,  data = NULL,  k = NULL,  family = c("gaussian", "binomial", "poisson"),  keyset = NULL,  intercept = TRUE,  categorical = TRUE,  group = TRUE,  codingtype = NULL,  coef_initial = NULL,  max_iter = 500,  tol = 10^(-3),  selection = F,  standardize = TRUE,  fast = FALSE,  U = 1,  U_rate = 0.5,  penalize_mod = TRUE,  ...)## S3 method for class 'formula'SMLE(formula, data, k = NULL, keyset = NULL, categorical = NULL, ...)

Arguments

formula

An object of class'formula' (or one that can be coerced to that class): a symbolic description of the model to be fitted. It should beNULL whenX andY are used.

...

Additional arguments to be passed tosmle_select() ifselection = TRUE. Seesmle_select() documentation for more details.

X

Then byp feature matrixX with each column denoting a feature(covariate) and each row denoting an observation vector. The input should bea'matrix' object for numerical data, and'data.frame' for categoricaldata (or a mixture of numerical and categorical data). The algorithm willtreat covariates having class'factor' as categorical data and extend the dataframe dimension by the dummy columns needed for coding the categorical features.

Y

The response vectorY of dimensionn by1. Quantitative forfamily = "gaussian", non-negative counts forfamily = "poisson",binary (0-1) forfamily = "binomial". InputY should be'numeric'.

data

An optional data frame, list or environment (or object coercible byas.data.frame() to a'data.frame') containing the features in the model. It is required if'formula' is used.

k

Total number of features (includingkeyset) to be retained after screening. Default is the largest integer not exceeding0.5log(n) n^{1/3}.

family

Model assumption betweenY andX; either a character string representing one of the built-in families, or else a glm() family object. The default model is Gaussian linear.

keyset

A numeric vector with column indices for the key features that do not participate in feature screening and are forced to remain in the model.The column indices for the key features should be fromdata if'formula' is usedor inX ifX andY are provided. The class ofkeyset can be'numeric','integer' or'character'. Default isNULL.

intercept

A logical flag to indicate whether to an intercept be used inthe model. An intercept will not participate in screening.

categorical

A logical flag for whether the input feature matrix includes categorical features( either'factor' or'character').FALSE treats all features as numerical and not check for whether there are categorical features;TRUE treats the data as having some categorical features and the algorithm determines which columns contain the categorical features. If all features are known to be numerical, it will be faster to run SMLE with this argument set toFALSE. we will need to find which columns are the categorical features.Default isTRUE.

group

Logical flag for whether to treat the dummy covariates of acategorical feature as a group. (Only for categorical data, see Details).Default isTRUE.

codingtype

Coding types for categorical features; default is"DV".codingtype = "all" convert each level to a 0-1 vector.codingtype = "DV" conducts deviation coding for each level incomparison with the grand mean.codingtype = "standard" conducts standard dummy coding for each levelin comparison with the reference level (first level).

coef_initial

Ap-dimensional vector for the initial coefficient value of the IHT algorithm. The default is to use Lasso with the sparsity closest ton-1.

max_iter

Maximum number of iteration steps. Default is 500.

tol

A tolerance level to stop the iterations, when the squared sum ofdifferences between two successive coefficient updates is below it.Default is10^{-3}.

selection

A logical flag to indicate whether an elaborate selectionis to be conducted bysmle_select() after screening.IfTRUE, the function will return a'selection' object, seesmle_select() documentation. Default isFALSE.

standardize

A logical flag for feature standardization, prior toperforming feature screening. The resulting coefficients arealways returned on the original scale. If features are in the same units already, you might not wish tostandardize. Default isstandardize = TRUE.

fast

Set toTRUE to enable early stop for SMLE-screening. It may helpto boost the screening efficiency with a little sacrifice of accuracy.Default isFALSE, see Details.

U

A numerical multiplier of initial tuning step parameter in IHT algorithm. Default is 1. For binomial model, a larger initial value is recommended; asmaller one is recommended for poisson model.

U_rate

Decreasing rate in tuning step parameter1/u in IHTalgorithm. See Details.

penalize_mod

A logical flag to indicate whether adjustment is used inranking groups of features. This argument is applicable only whencategorical = TRUE withgroup = TRUE. Whenpenalize_mod = TRUE, a factor of\sqrt J is divided from theL_2 effect of a group withJ members. Default isTRUE.

Details

With the inputY andX,SMLE() conducts joint feature screening by runningiterative hard thresholding algorithm (IHT), where the default initial value is set tobe the Lasso estimate with the sparsity closest to the sample size minus one.

InSMLE(), the initial value for step size parameter1/u is determined as follows. Whencoef_initial = 0, we set1/u = U / \sqrt{p}.Whencoef_initial != 0, we generate a sub-matrixX_0 using the columns ofX corresponding to the non-zero positions ofcoef_initial and set1/u = U/\sqrt{p}||X||^2_{\infty} and recursively decrease the value of step size byU_rate to guarantee the likelihood increment. This strategy is calledu-search.

SMLE() terminates IHT iterations when eithertol ormax_iter issatisfied. Whenfast = TRUE, the algorithm also stops when the non-zeromembers of the coefficient estimates remain the same for 10 successiveiterations or the log-likelihood difference between coefficient estimates is lessthan0.01 times the log-likelihood increase of the first step, ortol\sqrt k is satisfied.

InSMLE(), categorical features are coded by dummy covariates with themethod specified incodingtype. Users can usegroup to specifywhether to treat those dummy covariates as a single group feature or asindividual features.Whengroup = TRUE withpenalize_mod = TRUE, the effect for a groupofJ dummy covariates is computed by

\beta_i = \sqrt{(\beta_1)^2+...+(\beta_J)^2}/\sqrt J,

which will be treated as a single feature in IHT iterations. Whengroup = FALSE, a group ofJ dummy covariates will be treated asJ individual features in the IHT iterations; in this case, a categorical feature is retained after screening when at least one of the corresponding dummy covariates is retained.

Since feature screening is usually a preprocessing step, users may wish tofurther conduct an elaborative feature selection after screening. This canbe done by settingselection = TRUE inSMLE() or applying any existingselection method on the output ofSMLE().

Value

call

The call that produced this object.

ID_retained

A vector indicating the features retained after SMLE-screening.The output includes both features retained bySMLE() and the features specified inkeyset.

coef_retained

The vector of coefficients estimated by IHT for the retained features. When the retained set contains a categorical feature, the value returns a group effect ifgroup = TRUE, or returns the strongest dummy covariate effect ifgroup = FALSE.

path_retained

IHT iteration path with columns recording the coefficient updates.

num_retained

Number of retained features after screening.

intercept

The estimated intercept value by IHT, ifintercept = TRUE.

steps

Number of IHT iterations.

likelihood_iter

A list of log-likelihood updates over the IHT iterations.

Usearch

A vector giving the number of attempts to find a proper1/u at each iteration step.

modified_data

A list containing data objects generated by SMLE.

CM: Design matrix of class'matrix' for numeric features (or'data.frame' with categorical features).

DM: A matrix with dummy variable features added. (only if there are categorical features).

dum_col: Number of levels for all categorical features.

CI: Indices of categorical features inCM.

DFI: Indices of categorical features inIM.

iteration_data

A list containing data objects that track the coefficients over iterations.

IM: Iteration path matrix with columns recording IHT coefficient updates.

beta0: Inital value of regression coefficient for IHT.

feature_name: A list contains the names of selected features.

FD: A matrix that contains feature indices retained at each iteration step.

X,Y,data,family,categorical andcodingtype are return of arguments passed in the function call.

References

UCLA Statistical Consulting Group.coding systems for categoricalvariables in regression analysis.https://stats.oarc.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/.Retrieved May 28, 2020.

Xu, C. and Chen, J. (2014). The Sparse MLE for Ultrahigh-Dimensional FeatureScreening,Journal of the American Statistical Association,109(507), 1257-1269.

Examples

# Example 1:set.seed(1)Data <- Gen_Data( n= 200, p = 5000, family = "gaussian", correlation = "ID")fit <- SMLE( Y = Data$Y , X = Data$X, k = 9,family = "gaussian")summary(fit)Data$subset_true %in% fit$ID_retained # Sure screening check.plot(fit)# Example 2:set.seed(1)Data_sim2 <- Gen_Data(n = 420, p = 1000, family = "gaussian", num_ctgidx = 5,                       pos_ctgidx = c(1,3,5,7,9), effect_truecoef= c(1,2,3,-4,-5),                      pos_truecoef = c(1,3,5,7,8), level_ctgidx = c(3,3,3,4,5))train_X <- Data_sim2$X[1:400,]; test_X <- Data_sim2$X[401:420,]train_Y <- Data_sim2$Y[1:400]; test_Y <- Data_sim2$Y[401:420]fit <- SMLE(Y = train_Y, X = train_X, family = "gaussian", group = TRUE, k = 15)predict(fit, newdata = test_X)test_Y# Example 3:library(datasets)data("attitude")set.seed(1)noise <- matrix(rnorm(30*100, mean = mean(attitude$rating) , sd = 1), ncol = 100)colnames(noise) <- paste("Noise", seq(100), sep = ".")df <- data.frame(cbind(attitude, noise))fit <- SMLE(rating ~., data = df)fit

Extract coefficients from fitted model

Description

Extract coefficients from fitted model for either a'smle' or'selection' object.

Usage

## S3 method for class 'smle'coef(object, refit = TRUE, ...)## S3 method for class 'selection'coef(object, refit = TRUE, ...)

Arguments

object

Returned object from either the functionSMLE() orsmle_select().

refit

A logical flag that controls what coefficients are being return. Default isTRUE.

...

This argument is not used and listed for method consistency.

Value

Fitted coefficients based on the screened or selected model specified in the object. Ifrefit = TRUE, the coefficients are estimated by re-fitting the final screened/selected model withglm(). Ifrefit = FALSE the coefficients estimated by the IHT algorithm are returned.

Examples

set.seed(1)Data<-Gen_Data(n=100, p=5000, family = "gaussian", correlation="ID")fit<-SMLE(Y = Data$Y, X = Data$X, k=15, family = "gaussian")coef(fit)fit_s<-smle_select(fit)coef(fit_s)

Extract log-likelihood

Description

This is a method written to extract the log-likelihood from'smle' and'selection' objects. It refits the model byglm() based on the response and the features selected after screening or selection, and returns an object of'logLik' from the generic.

Usage

## S3 method for class 'smle'logLik(object, ...)## S3 method for class 'selection'logLik(object, ...)

Arguments

object

An object of class'smle' or'sdata'.

...

Forwarded arguments.

Value

Returns an object of class'logLik'. This is a number with at least one attribute,"df" (degrees of freedom), giving the number of (estimated) parameters in the model. For more details, see the genericlogLik() instats.

Examples

set.seed(1)Data<-Gen_Data(n=100, p=5000, family = "gaussian", correlation="ID")fit<-SMLE(Y=Data$Y, X=Data$X, k=9, family = "gaussian")logLik(fit)

Plots to visualize the post-screening selection

Description

This function constructs a sparsity vs. selection criterion curve for a'selection' object.When EBIC is used with voting, it also constructs a histogram showing the voting result.

Usage

## S3 method for class 'selection'plot(x, ...)

Arguments

x

A'selection' object as the output fromsmle_select().

...

Additional arguments to theplot() function.

Value

No return value.

Examples

set.seed(1)Data <- Gen_Data(correlation = "MA", family = "gaussian")fit <- SMLE(Y = Data$Y, X = Data$X, k = 20, family = "gaussian")fit_s <- smle_select(fit, vote = TRUE)plot(fit_s)

Plots to visualize SMLE screening

Description

This function returns two plot windows. By default, the first shows 1) the solution path (estimated coefficient by iteration step) for the retained features.By default, the second plot contains 4 plots to assess convergence:2) log-likelihood, 3) Euclidean distance between the current and the previous coefficient estimates, 4) the number of tries in u-search (see details ofSMLE()), and 5) the number of features changed in the current active set.

Usage

## S3 method for class 'smle'plot(x, num_path = NULL, label = TRUE, which_path = NULL, out_plot = 1, ...)

Arguments

x

A'smle' object as the output fromSMLE().

num_path

The number of top coefficients to be shown.Default is equal to the number of features retained in the model.

label

Logical flag for whether to label each curve with the feature index. Default isTRUE.

which_path

A vector to control which features are shown in addition to the paths for the most significant coefficients.

out_plot

A number from 1 to 5 indicating which plot is to be shown in the separate window; the default for solution path plot is "1".See Description for plot labels 2-5.

...

Additional arguments passed to the second plot.

Value

No return value.

Examples

set.seed(1)Data <- Gen_Data(correlation = "CS")fit <- SMLE(Y = Data$Y,X = Data$X, k = 20, family = "gaussian")plot(fit)

Prediction based on SMLE screening and selection

Description

For a model object of class'smle' or'selection', this function returns the predicted response values after re-fitting the model usingglm.

Usage

## S3 method for class 'smle'predict(object, newdata = NULL, type = c("link", "response", "terms"), ...)## S3 method for class 'selection'predict(object, newdata = NULL, type = c("link", "response", "terms"), ...)

Arguments

object

A'smle' or'selection' object.

newdata

Matrix of new values for the features at which predictions are to be made. If omitted, the fitted linear predictors are used.

type

The type of prediction required bypredict.glm().

...

Further arguments passed topredict.glm().

Value

A prediction vector with length equal to the number of rows ofnewdata.

Examples

set.seed(1)Data_sim <- Gen_Data(n = 420, p = 1000, sigma = 0.5, family = "gaussian")train_X <- Data_sim$X[1:400,]; test_X <- Data_sim$X[401:420,]train_Y <- Data_sim$Y[1:400]; test_Y <- Data_sim$Y[401:420]fit1 <- SMLE(Y = train_Y, X = train_X, family = "gaussian", k = 10)#Fitted responses vs true responses in training datapredict(fit1)[1:10]train_Y[1:10]#Predicted responses vs true responses in testing datapredict(fit1, newdata = test_X)test_Y

Print an object

Description

This function prints information about the fitted model from a call toSMLE() orsmle_select(),or about the simulated data from a call toGen_Data(). The object passed as an argument to print is returned invisibly.

Usage

## S3 method for class 'smle'print(x, ...)## S3 method for class 'selection'print(x, ...)## S3 method for class 'summary.smle'print(x, ...)## S3 method for class 'summary.selection'print(x, ...)## S3 method for class 'sdata'print(x, ...)

Arguments

x

Fitted object.

...

This argument is not used and listed for method consistency.

Value

Return argument invisibly.

Examples

set.seed(1)Data<-Gen_Data(correlation = "MA", family = "gaussian")Datafit<-SMLE(Y = Data$Y, X = Data$X, k = 20, family = "gaussian")print(fit)summary(fit)

p values of synthetic genetic association study data set

Description

The first column is the chromosome number. The second columns is SNP name. The third column is the genomic position of the SNP on the whole data set.The marginal p-values of each SNPs is pre-calculated and saved in the fourth column.

Usage

data(pvals)

Format

An object of classdata.frame with 10031 rows and 4 columns.

Elaborative post-screening selection with SMLE

Description

The features retained after screening are still likely to contain some that are not related to the response. The functionsmle_select() is designed to further identify the relevant features usingSMLE().Given a response and a set ofK features, this functionfirst runsSMLE(fast = TRUE) to generate a series of sub-models withsparsity k varying fromk_min tok_max.It then selects the best model from the series based on a selection criterion.

When criterion EBIC is used, users can choose to repeat the selection withdifferent values of the tuning parameter\gamma, andconduct importance voting for each feature. Whenvote = T, this function fits all the models with\gamma specified ingamma_seq and features with frequency higher thanvote_threshold will be selected inID_voted.

Usage

smle_select(object, ...)## S3 method for class 'sdata'smle_select(  object,  k_min = 1,  k_max = NULL,  subset = NULL,  gamma_ebic = 0.5,  vote = FALSE,  keyset = NULL,  criterion = "ebic",  codingtype = c("DV", "standard", "all"),  gamma_seq = c(seq(0, 1, 0.2)),  vote_threshold = 0.6,  parallel = FALSE,  num_clusters = NULL,  ...)## Default S3 method:smle_select(  object = NULL,  Y = NULL,  X = NULL,  family = "gaussian",  keyset = NULL,  ...)## S3 method for class 'smle'smle_select(object, ...)

Arguments

object

Object of class'smle' or'sdata'. Users can alsoinput a response vector and a feature matrix.

...

Further arguments passed to or from other methods.

k_min

The lower bound of candidate model sparsity. Default is 1.

k_max

The upper bound of candidate model sparsity. Default is the number of columns in feature matrix.

subset

An index vector indicating which features (columns of thefeature matrix) are to be selected. Not applicable if a'smle'object is the input.

gamma_ebic

The EBIC tuning parameter, in[0 , 1]. Default is 0.5.

vote

The logical flag for whether to perform the voting procedure. Only available whencriterion = "ebic".

keyset

A numeric vector with column indices for the key features that do not participate in feature screening and are forced to remain in the model. See SMLE for details.

criterion

Selection criterion. One of "ebic","bic","aic". Default is "ebic".

codingtype

Coding types for categorical features; for more details seeSMLE() documentation.

gamma_seq

The sequence of values forgamma_ebic whenvote = TRUE.

vote_threshold

A relative voting threshold in percentage. A feature isconsidered to be important when it receives votes passing the threshold. Default is 0.6.

parallel

A logical flag to use parallel computing to do voting selection.Default isFALSE. See Details.

num_clusters

The number of compute clusters to use whenparallel = TRUE. The default will be 2 times cores detected.

Y

Input response vector (whenobject = NULL).

X

Input features matrix (whenobject = NULL).

family

Model assumption; seeSMLE() documentation. Default is Gaussian linear.

When input is a'smle' or'sdata' object, the samemodel will be used in the selection.

Details

This function accepts three types of input objects; 1)'smle' object, as the output fromSMLE(); 2)'sdata' object, as the output fromGen_Data(); 3) other response and feature matrix input by users.

Note that this function is mainly designed to conduct an elaborative selectionafter feature screening. We do not recommend using it directly forultra-high-dimensional data without screening.

Value

call

The call that produced this object.

ID_selected

A list of selected features.

coef_selected

Fitted model coefficients.

intercept

Fitted model intercept.

criterion_value

Values of selection criterion for the candidate models with various sparsity.

categorical

A logical flag whether the input feature matrix includes categorical features

ID_pool

A vector containing all features selected during voting.

ID_voted

A vector containing the features selected whenvote = T.

CI

Indices of categorical features whencategorical = TRUE.

X,Y,family,gamma_ebic,gamma_seq,criterion,vote,codyingtype,vote_threshold are return of arguments passed in the function call.

References

Chen. J. and Chen. Z. (2012). "Extended BIC for small-n-large-p sparse GLM."Statistica Sinica,22(2), 555-574.

Examples

set.seed(1)Data<-Gen_Data(correlation = "MA", family = "gaussian")fit<-SMLE(Y = Data$Y, X = Data$X, k = 20, family = "gaussian")fit_bic<-smle_select(fit, criterion = "bic")summary(fit_bic)fit_ebic<-smle_select(fit, criterion = "ebic", vote = TRUE)summary(fit_ebic)plot(fit_ebic)

Summarize SMLE-screening and selection

Description

This function prints a summary of a'smle' (or a'selection') object.In particular, it shows the features retained after SMLE-screening (or selection) with the related convergence information.

Usage

## S3 method for class 'smle'summary(object, ...)## S3 method for class 'selection'summary(object, ...)

Arguments

object

A'smle' or'selection' object.

...

This argument is not used and listed for method consistency.

Value

No return value.

Examples

set.seed(1)Data <- Gen_Data(correlation = "MA", family = "gaussian")fit <- SMLE(Y = Data$Y, X = Data$X, k = 20, family = "gaussian")summary(fit)fit_s <- smle_select(fit)summary(fit_s)

Synthetic genetic association study data set

Description

This simulated data set consists of 10,031 genetic variants (SNPs) and a continuous response variable measured on 800 individuals. The genotypes were sampled from genotypic distributions derived from the 1000 Genomes project using theR packagesim1000G. The genotype is coded as 0, 1, or 2 by counting the number of minor alleles (the allele that is less common in the sample). The continuous response variable was simulated from a normal distribution with mean that depends additively on the causal SNPs.

Usage

data(synSNP)

Format

An object of class'data.frame' with 800 rows and 10,032 columns.

References

The 1000 Genomes Project Consortium (2015). Global reference for human genetic variation,Nature,526(7571), 68-74.s

Examples

data(synSNP)Y_SNP <- synSNP[,1]X_SNP <- synSNP[,-1]fit <- SMLE(Y = Y_SNP, X = X_SNP, k = 40)summary(fit)plot(fit)

Extract and adjust voting from SMLE selection

Description

Whensmle_select() is used withcriterion = "ebic" andvote = TRUE, users can usevote_update() to adjust the voting threshold without a need of rerunsmle_select().

Usage

vote_update(object, ...)## S3 method for class 'selection'vote_update(object, vote_threshold = 0.6, ...)

Arguments

object

A'selection' object as the output fromsmle_select().

...

This argument is not used and listed for method consistency.

vote_threshold

A voting threshold in percentage. A feature isconsidered to be important when it receives votes passing the threshold.Default is 0.6.

Value

The function returns a vector indicating the features selected byEBIC voting with the specifiedvote_threhold.

Examples

set.seed(1)Data <- Gen_Data(n = 100, p = 3000, correlation = "MA", rho = 0.7, family = "gaussian")colnames(Data$X)<- paste("X.",seq(3000) , sep = "")fit <- SMLE(Y = Data$Y, X = Data$X, k = 20, family = "gaussian")fit_s <- smle_select(fit, criterion = "ebic", vote = TRUE)plot(fit_s)fit_svote_update(fit_s, vote_threshold = 0.4)