| Title: | Joint Feature Screening via Sparse MLE |
| Version: | 2.2-2 |
| Description: | Feature screening is a powerful tool in processing ultrahigh dimensional data. It attempts to screen out most irrelevant features in preparation for a more elaborate analysis. Xu and Chen (2014)<doi:10.1080/01621459.2013.879531> proposed an effective screening method SMLE, which naturally incorporates the joint effects among features in the screening process. This package provides an efficient implementation of SMLE-screening for high-dimensional linear, logistic, and Poisson models. The package also provides a function for conducting accurate post-screening feature selection based on an iterative hard-thresholding procedure and a user-specified selection criterion. |
| License: | GPL-3 |
| Depends: | R(≥ 4.0.0) |
| Imports: | glmnet, matrixcalc, mvnfast |
| Encoding: | UTF-8 |
| LazyData: | true |
| RoxygenNote: | 7.2.3 |
| NeedsCompilation: | no |
| Author: | Qianxiang Zang [aut, cre], Chen Xu [aut], Kelly Burkett [aut] |
| Maintainer: | Qianxiang Zang <SMLEmaintainer@gmail.com> |
| Repository: | CRAN |
| Packaged: | 2025-01-28 22:31:07 UTC; mac |
| Date/Publication: | 2025-01-29 00:10:06 UTC |
| Suggests: | knitr, rmarkdown, testthat (≥ 3.0.0) |
| Config/testthat/edition: | 3 |
| VignetteBuilder: | knitr |
Joint SMLE-screening for generalized linear models
Description
Feature screening is a powerful tool in processing ultrahigh dimensional data. It attempts to screenout most irrelevant features in preparation for a more elaborate analysis. This package provides an efficient implementation of SMLE-screening for linear, logistic, and Poisson models, where the joint effects among features are naturally incorporated in the screening process. The package also provides a function for conducting accurate post-screening feature selection based on an iterative hard-thresholding procedure and a user-specified selection criterion.
Details
| Package: | smle |
| Type: | Package |
| Version: | 2.1-1 |
| Date: | 2024-02-12 |
| License: | GPL-3 |
Input an \times 1 response vector Y and an \times p predictor (feature) matrix X. The package outputs a set ofk < n features that seem to be most relevant for joint regression. Moreover, the package provides a data simulator that generates synthetic datasets from high-dimensional GLMs, which accommodate both numerical and categorical features with commonly used correlation structures.
Key functions:Gen_DataSMLEsmle_select
Author(s)
Qianxiang Zang, Chen Xu, Kelly Burkett
Maintainer: Qianxiang Zang <qzang023@uottawa.ca>
References
Xu, C. and Chen, J. (2014)The Sparse MLE for Ultrahigh-Dimensional Feature ScreeningJournal of the American Statistical Association,109(507), 1257–1269.
Friedman, J., Hastie, T. and Tibshirani, R. (2010)Regularization Paths for Generalized Linear Models via CoordinateDescentJournal of Statistical Software,33(1), 1-22.
Examples
set.seed(1)#Generate correlated dataData <- Gen_Data(n = 200, p = 5000, correlation = "MA",family = "gaussian")print(Data)# joint feature screening via SMLEfit <- SMLE(Y = Data$Y, X = Data$X, k = 10, family = "gaussian")print(fit)summary(fit)plot(fit)#Are there any features missed after screening?setdiff(Data$subset_true, fit$ID_retained)# Elaborative selection after screeningfit_s <- smle_select(fit, gamma_ebic = 0.5, vote = FALSE)#Are there any features missed after selection? setdiff(Data$subset_true, fit_s$ID_selected)print(fit_s)summary(fit_s)plot(fit_s)Data simulator for high-dimensional GLMs
Description
This function generates synthetic datasets from GLMs with a user-specified correlation structure.It permits both numerical and categorical features, whose quantity can be larger than the sample size.
Usage
Gen_Data( n = 200, p = 1000, sigma = 1, num_ctgidx = NULL, pos_ctgidx = NULL, num_truecoef = NULL, pos_truecoef = NULL, level_ctgidx = NULL, effect_truecoef = NULL, correlation = c("ID", "AR", "MA", "CS"), rho = 0.2, family = c("gaussian", "binomial", "poisson"))Arguments
n | Sample size, number of rows for the feature matrix to be generated. |
p | Number of columns for the feature matrix to be generated. |
sigma | Parameter for noise level. |
num_ctgidx | The number of features that are categorical. Set to |
pos_ctgidx | Vector of indices denoting which columns are categorical. |
num_truecoef | The number of features (columns) that affect response. Default is 5. |
pos_truecoef | Vector of indices denoting which features (columns) affect the response variable. If not specified, positions are randomly sampled. See Details for more information. |
level_ctgidx | Vector to indicate the number of levels for the categorical features in |
effect_truecoef | Effect size corresponding to the features in |
correlation | Correlation structure among features. |
rho | Parameter controlling the correlation strength, default is |
family | Model type for the response variable. |
Details
Simulated data(y_i , x_i) where x_i = (x_{i1},x_{i2} , . . . , x_{ip}) are generated as follows:First, we generate ap by1 model coefficient vector beta with all entries being zero, except for the positions specified inpos_truecoef,on whicheffect_truecoef is used. Whenpos_truecoef is not specified, we randomly choosenum_truecoef positions from the coefficientvector. Wheneffect_truecoef is not specified, we randomly set the strength of the true model coefficients as follow:
(0.5+U) Z,
whereU is sampled from a uniform distribution from 0 to 1, andZ is sampled from a binomial distributionP(Z=1)=1/2,P(Z=-1)=1/2.
Next, we generate an byp feature matrixX according to the model selected withcorrelation and specified as follows.
Independent (ID): all features are independently generated fromN( 0, 1).
Moving average (MA): candidate featuresx_1,..., x_p are joint normal,marginallyN( 0, 1), with
cov(x_j, x_{j-1}) = \rho,cov(x_j, x_{j-2}) = \rho/2 andcov(x_j, x_h) = 0 for|j-h|>3.
Compound symmetry (CS): candidate featuresx_1,..., x_p are joint normal,marginallyN( 0, 1), withcov(x_j, x_h) =\rho/2 ifj,hare both in the set of important features andcov(x_j, x_h)=\rho when onlyone ofj orh are in the set of important features.
Auto-regressive (AR): candidate featuresx_1,..., x_p are joint normal, marginallyN( 0, 1), with
cov(x_j, x_h) = \rho^{|j-h|} for allj andh. The correlation strength\rho is controlled by the argumentrho.
Then, we generate the response variableY according to its response type, which is controlled by the argumentfamilyFor the Gaussian model,y_i =x_i\beta + \epsilon_i where\epsilon_i isN( 0, 1) fori from1 ton. For the binary model let\pi_i = P(Y = 1|x_i). We sampley_i from Bernoulli(\pi_i) where logit(\pi_i) = x_i \beta.Finally, for the Poisson model,y_i is generated from the Poisson distribution with the link\pi_i = exp(x_i\beta ).For more details see the reference below.
Value
call | The call that produced this object. |
Y | Response variable vector of length |
X | Feature matrix or data.frame (matrix if |
subset_true | Vector of column indices of X for the features that affect the response variables (relevant features). |
coef_true | Vector of effects for the features that affect the response variables. |
categorical | Logical flag whether the model contains categorical features. |
CI | Indices of categorical features when |
rho,family,correlation are return of arguments passed in the function call.
References
Xu, C. and Chen, J. (2014). The Sparse MLE for Ultrahigh-Dimensional FeatureScreening,Journal of the American Statistical Association,109(507), 1257-1269
Examples
#Simulating data with binomial response and auto-regressive structure.set.seed(1)Data <- Gen_Data(n = 500, p = 2000, family = "binomial", correlation = "AR")cor(Data$X[,1:5])print(Data)Joint feature screening via sparse maximum likelihood estimation for GLMs
Description
Input an by1 responseY and an byp feature matrixX;the function uses SMLE to retain only a set ofk<n features that seemto be most related to the response variable. It thus serves as a pre-processing step for anelaborative analysis. In SMLE, the joint effects between features are naturallyaccounted for; this makes the screening more reliable. The function uses theefficient iterative hard thresholding (IHT) algorithm with step parameteradaptively tuned for fast convergence. Users can choose to further conductan elaborative selection after SMLE-screening. Seesmle_select() for more details.
Usage
SMLE(formula = NULL, ...)## Default S3 method:SMLE( formula = NULL, X = NULL, Y = NULL, data = NULL, k = NULL, family = c("gaussian", "binomial", "poisson"), keyset = NULL, intercept = TRUE, categorical = TRUE, group = TRUE, codingtype = NULL, coef_initial = NULL, max_iter = 500, tol = 10^(-3), selection = F, standardize = TRUE, fast = FALSE, U = 1, U_rate = 0.5, penalize_mod = TRUE, ...)## S3 method for class 'formula'SMLE(formula, data, k = NULL, keyset = NULL, categorical = NULL, ...)Arguments
formula | An object of class |
... | Additional arguments to be passed to |
X | The |
Y | The response vector |
data | An optional data frame, list or environment (or object coercible by |
k | Total number of features (including |
family | Model assumption between |
keyset | A numeric vector with column indices for the key features that do not participate in feature screening and are forced to remain in the model.The column indices for the key features should be from |
intercept | A logical flag to indicate whether to an intercept be used inthe model. An intercept will not participate in screening. |
categorical | A logical flag for whether the input feature matrix includes categorical features( either |
group | Logical flag for whether to treat the dummy covariates of acategorical feature as a group. (Only for categorical data, see Details).Default is |
codingtype | Coding types for categorical features; default is |
coef_initial | A |
max_iter | Maximum number of iteration steps. Default is 500. |
tol | A tolerance level to stop the iterations, when the squared sum ofdifferences between two successive coefficient updates is below it.Default is |
selection | A logical flag to indicate whether an elaborate selectionis to be conducted by |
standardize | A logical flag for feature standardization, prior toperforming feature screening. The resulting coefficients arealways returned on the original scale. If features are in the same units already, you might not wish tostandardize. Default is |
fast | Set to |
U | A numerical multiplier of initial tuning step parameter in IHT algorithm. Default is 1. For binomial model, a larger initial value is recommended; asmaller one is recommended for poisson model. |
U_rate | Decreasing rate in tuning step parameter |
penalize_mod | A logical flag to indicate whether adjustment is used inranking groups of features. This argument is applicable only when |
Details
With the inputY andX,SMLE() conducts joint feature screening by runningiterative hard thresholding algorithm (IHT), where the default initial value is set tobe the Lasso estimate with the sparsity closest to the sample size minus one.
InSMLE(), the initial value for step size parameter1/u is determined as follows. Whencoef_initial = 0, we set1/u = U / \sqrt{p}.Whencoef_initial != 0, we generate a sub-matrixX_0 using the columns ofX corresponding to the non-zero positions ofcoef_initial and set1/u = U/\sqrt{p}||X||^2_{\infty} and recursively decrease the value of step size byU_rate to guarantee the likelihood increment. This strategy is calledu-search.
SMLE() terminates IHT iterations when eithertol ormax_iter issatisfied. Whenfast = TRUE, the algorithm also stops when the non-zeromembers of the coefficient estimates remain the same for 10 successiveiterations or the log-likelihood difference between coefficient estimates is lessthan0.01 times the log-likelihood increase of the first step, ortol\sqrt k is satisfied.
InSMLE(), categorical features are coded by dummy covariates with themethod specified incodingtype. Users can usegroup to specifywhether to treat those dummy covariates as a single group feature or asindividual features.Whengroup = TRUE withpenalize_mod = TRUE, the effect for a groupofJ dummy covariates is computed by
\beta_i = \sqrt{(\beta_1)^2+...+(\beta_J)^2}/\sqrt J,
which will be treated as a single feature in IHT iterations. Whengroup = FALSE, a group ofJ dummy covariates will be treated asJ individual features in the IHT iterations; in this case, a categorical feature is retained after screening when at least one of the corresponding dummy covariates is retained.
Since feature screening is usually a preprocessing step, users may wish tofurther conduct an elaborative feature selection after screening. This canbe done by settingselection = TRUE inSMLE() or applying any existingselection method on the output ofSMLE().
Value
call | The call that produced this object. |
ID_retained | A vector indicating the features retained after SMLE-screening.The output includes both features retained by |
coef_retained | The vector of coefficients estimated by IHT for the retained features. When the retained set contains a categorical feature, the value returns a group effect if |
path_retained | IHT iteration path with columns recording the coefficient updates. |
num_retained | Number of retained features after screening. |
intercept | The estimated intercept value by IHT, if |
steps | Number of IHT iterations. |
likelihood_iter | A list of log-likelihood updates over the IHT iterations. |
Usearch | A vector giving the number of attempts to find a proper |
modified_data | A list containing data objects generated by SMLE.
|
iteration_data | A list containing data objects that track the coefficients over iterations.
|
X,Y,data,family,categorical andcodingtype are return of arguments passed in the function call.
References
UCLA Statistical Consulting Group.coding systems for categoricalvariables in regression analysis.https://stats.oarc.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/.Retrieved May 28, 2020.
Xu, C. and Chen, J. (2014). The Sparse MLE for Ultrahigh-Dimensional FeatureScreening,Journal of the American Statistical Association,109(507), 1257-1269.
Examples
# Example 1:set.seed(1)Data <- Gen_Data( n= 200, p = 5000, family = "gaussian", correlation = "ID")fit <- SMLE( Y = Data$Y , X = Data$X, k = 9,family = "gaussian")summary(fit)Data$subset_true %in% fit$ID_retained # Sure screening check.plot(fit)# Example 2:set.seed(1)Data_sim2 <- Gen_Data(n = 420, p = 1000, family = "gaussian", num_ctgidx = 5, pos_ctgidx = c(1,3,5,7,9), effect_truecoef= c(1,2,3,-4,-5), pos_truecoef = c(1,3,5,7,8), level_ctgidx = c(3,3,3,4,5))train_X <- Data_sim2$X[1:400,]; test_X <- Data_sim2$X[401:420,]train_Y <- Data_sim2$Y[1:400]; test_Y <- Data_sim2$Y[401:420]fit <- SMLE(Y = train_Y, X = train_X, family = "gaussian", group = TRUE, k = 15)predict(fit, newdata = test_X)test_Y# Example 3:library(datasets)data("attitude")set.seed(1)noise <- matrix(rnorm(30*100, mean = mean(attitude$rating) , sd = 1), ncol = 100)colnames(noise) <- paste("Noise", seq(100), sep = ".")df <- data.frame(cbind(attitude, noise))fit <- SMLE(rating ~., data = df)fitExtract coefficients from fitted model
Description
Extract coefficients from fitted model for either a'smle' or'selection' object.
Usage
## S3 method for class 'smle'coef(object, refit = TRUE, ...)## S3 method for class 'selection'coef(object, refit = TRUE, ...)Arguments
object | Returned object from either the function |
refit | A logical flag that controls what coefficients are being return. Default is |
... | This argument is not used and listed for method consistency. |
Value
Fitted coefficients based on the screened or selected model specified in the object. Ifrefit = TRUE, the coefficients are estimated by re-fitting the final screened/selected model withglm(). Ifrefit = FALSE the coefficients estimated by the IHT algorithm are returned.
Examples
set.seed(1)Data<-Gen_Data(n=100, p=5000, family = "gaussian", correlation="ID")fit<-SMLE(Y = Data$Y, X = Data$X, k=15, family = "gaussian")coef(fit)fit_s<-smle_select(fit)coef(fit_s)Extract log-likelihood
Description
This is a method written to extract the log-likelihood from'smle' and'selection' objects. It refits the model byglm() based on the response and the features selected after screening or selection, and returns an object of'logLik' from the generic.
Usage
## S3 method for class 'smle'logLik(object, ...)## S3 method for class 'selection'logLik(object, ...)Arguments
object | An object of class |
... | Forwarded arguments. |
Value
Returns an object of class'logLik'. This is a number with at least one attribute,"df" (degrees of freedom), giving the number of (estimated) parameters in the model. For more details, see the genericlogLik() instats.
Examples
set.seed(1)Data<-Gen_Data(n=100, p=5000, family = "gaussian", correlation="ID")fit<-SMLE(Y=Data$Y, X=Data$X, k=9, family = "gaussian")logLik(fit)Plots to visualize the post-screening selection
Description
This function constructs a sparsity vs. selection criterion curve for a'selection' object.When EBIC is used with voting, it also constructs a histogram showing the voting result.
Usage
## S3 method for class 'selection'plot(x, ...)Arguments
x | A |
... | Additional arguments to the |
Value
No return value.
Examples
set.seed(1)Data <- Gen_Data(correlation = "MA", family = "gaussian")fit <- SMLE(Y = Data$Y, X = Data$X, k = 20, family = "gaussian")fit_s <- smle_select(fit, vote = TRUE)plot(fit_s)Plots to visualize SMLE screening
Description
This function returns two plot windows. By default, the first shows 1) the solution path (estimated coefficient by iteration step) for the retained features.By default, the second plot contains 4 plots to assess convergence:2) log-likelihood, 3) Euclidean distance between the current and the previous coefficient estimates, 4) the number of tries in u-search (see details ofSMLE()), and 5) the number of features changed in the current active set.
Usage
## S3 method for class 'smle'plot(x, num_path = NULL, label = TRUE, which_path = NULL, out_plot = 1, ...)Arguments
x | A |
num_path | The number of top coefficients to be shown.Default is equal to the number of features retained in the model. |
label | Logical flag for whether to label each curve with the feature index. Default is |
which_path | A vector to control which features are shown in addition to the paths for the most significant coefficients. |
out_plot | A number from 1 to 5 indicating which plot is to be shown in the separate window; the default for solution path plot is "1".See Description for plot labels 2-5. |
... | Additional arguments passed to the second plot. |
Value
No return value.
Examples
set.seed(1)Data <- Gen_Data(correlation = "CS")fit <- SMLE(Y = Data$Y,X = Data$X, k = 20, family = "gaussian")plot(fit)Prediction based on SMLE screening and selection
Description
For a model object of class'smle' or'selection', this function returns the predicted response values after re-fitting the model usingglm.
Usage
## S3 method for class 'smle'predict(object, newdata = NULL, type = c("link", "response", "terms"), ...)## S3 method for class 'selection'predict(object, newdata = NULL, type = c("link", "response", "terms"), ...)Arguments
object | A |
newdata | Matrix of new values for the features at which predictions are to be made. If omitted, the fitted linear predictors are used. |
type | The type of prediction required by |
... | Further arguments passed to |
Value
A prediction vector with length equal to the number of rows ofnewdata.
Examples
set.seed(1)Data_sim <- Gen_Data(n = 420, p = 1000, sigma = 0.5, family = "gaussian")train_X <- Data_sim$X[1:400,]; test_X <- Data_sim$X[401:420,]train_Y <- Data_sim$Y[1:400]; test_Y <- Data_sim$Y[401:420]fit1 <- SMLE(Y = train_Y, X = train_X, family = "gaussian", k = 10)#Fitted responses vs true responses in training datapredict(fit1)[1:10]train_Y[1:10]#Predicted responses vs true responses in testing datapredict(fit1, newdata = test_X)test_YPrint an object
Description
This function prints information about the fitted model from a call toSMLE() orsmle_select(),or about the simulated data from a call toGen_Data(). The object passed as an argument to print is returned invisibly.
Usage
## S3 method for class 'smle'print(x, ...)## S3 method for class 'selection'print(x, ...)## S3 method for class 'summary.smle'print(x, ...)## S3 method for class 'summary.selection'print(x, ...)## S3 method for class 'sdata'print(x, ...)Arguments
x | Fitted object. |
... | This argument is not used and listed for method consistency. |
Value
Return argument invisibly.
Examples
set.seed(1)Data<-Gen_Data(correlation = "MA", family = "gaussian")Datafit<-SMLE(Y = Data$Y, X = Data$X, k = 20, family = "gaussian")print(fit)summary(fit)p values of synthetic genetic association study data set
Description
The first column is the chromosome number. The second columns is SNP name. The third column is the genomic position of the SNP on the whole data set.The marginal p-values of each SNPs is pre-calculated and saved in the fourth column.
Usage
data(pvals)Format
An object of classdata.frame with 10031 rows and 4 columns.
Elaborative post-screening selection with SMLE
Description
The features retained after screening are still likely to contain some that are not related to the response. The functionsmle_select() is designed to further identify the relevant features usingSMLE().Given a response and a set ofK features, this functionfirst runsSMLE(fast = TRUE) to generate a series of sub-models withsparsity k varying fromk_min tok_max.It then selects the best model from the series based on a selection criterion.
When criterion EBIC is used, users can choose to repeat the selection withdifferent values of the tuning parameter\gamma, andconduct importance voting for each feature. Whenvote = T, this function fits all the models with\gamma specified ingamma_seq and features with frequency higher thanvote_threshold will be selected inID_voted.
Usage
smle_select(object, ...)## S3 method for class 'sdata'smle_select( object, k_min = 1, k_max = NULL, subset = NULL, gamma_ebic = 0.5, vote = FALSE, keyset = NULL, criterion = "ebic", codingtype = c("DV", "standard", "all"), gamma_seq = c(seq(0, 1, 0.2)), vote_threshold = 0.6, parallel = FALSE, num_clusters = NULL, ...)## Default S3 method:smle_select( object = NULL, Y = NULL, X = NULL, family = "gaussian", keyset = NULL, ...)## S3 method for class 'smle'smle_select(object, ...)Arguments
object | Object of class |
... | Further arguments passed to or from other methods. |
k_min | The lower bound of candidate model sparsity. Default is 1. |
k_max | The upper bound of candidate model sparsity. Default is the number of columns in feature matrix. |
subset | An index vector indicating which features (columns of thefeature matrix) are to be selected. Not applicable if a |
gamma_ebic | The EBIC tuning parameter, in |
vote | The logical flag for whether to perform the voting procedure. Only available when |
keyset | A numeric vector with column indices for the key features that do not participate in feature screening and are forced to remain in the model. See SMLE for details. |
criterion | Selection criterion. One of " |
codingtype | Coding types for categorical features; for more details see |
gamma_seq | The sequence of values for |
vote_threshold | A relative voting threshold in percentage. A feature isconsidered to be important when it receives votes passing the threshold. Default is 0.6. |
parallel | A logical flag to use parallel computing to do voting selection.Default is |
num_clusters | The number of compute clusters to use when |
Y | Input response vector (when |
X | Input features matrix (when |
family | Model assumption; see When input is a |
Details
This function accepts three types of input objects; 1)'smle' object, as the output fromSMLE(); 2)'sdata' object, as the output fromGen_Data(); 3) other response and feature matrix input by users.
Note that this function is mainly designed to conduct an elaborative selectionafter feature screening. We do not recommend using it directly forultra-high-dimensional data without screening.
Value
call | The call that produced this object. |
ID_selected | A list of selected features. |
coef_selected | Fitted model coefficients. |
intercept | Fitted model intercept. |
criterion_value | Values of selection criterion for the candidate models with various sparsity. |
categorical | A logical flag whether the input feature matrix includes categorical features |
ID_pool | A vector containing all features selected during voting. |
ID_voted | A vector containing the features selected when |
CI | Indices of categorical features when |
X,Y,family,gamma_ebic,gamma_seq,criterion,vote,codyingtype,vote_threshold are return of arguments passed in the function call.
References
Chen. J. and Chen. Z. (2012). "Extended BIC for small-n-large-p sparse GLM."Statistica Sinica,22(2), 555-574.
Examples
set.seed(1)Data<-Gen_Data(correlation = "MA", family = "gaussian")fit<-SMLE(Y = Data$Y, X = Data$X, k = 20, family = "gaussian")fit_bic<-smle_select(fit, criterion = "bic")summary(fit_bic)fit_ebic<-smle_select(fit, criterion = "ebic", vote = TRUE)summary(fit_ebic)plot(fit_ebic)Summarize SMLE-screening and selection
Description
This function prints a summary of a'smle' (or a'selection') object.In particular, it shows the features retained after SMLE-screening (or selection) with the related convergence information.
Usage
## S3 method for class 'smle'summary(object, ...)## S3 method for class 'selection'summary(object, ...)Arguments
object | A |
... | This argument is not used and listed for method consistency. |
Value
No return value.
Examples
set.seed(1)Data <- Gen_Data(correlation = "MA", family = "gaussian")fit <- SMLE(Y = Data$Y, X = Data$X, k = 20, family = "gaussian")summary(fit)fit_s <- smle_select(fit)summary(fit_s)Synthetic genetic association study data set
Description
This simulated data set consists of 10,031 genetic variants (SNPs) and a continuous response variable measured on 800 individuals. The genotypes were sampled from genotypic distributions derived from the 1000 Genomes project using theR packagesim1000G. The genotype is coded as 0, 1, or 2 by counting the number of minor alleles (the allele that is less common in the sample). The continuous response variable was simulated from a normal distribution with mean that depends additively on the causal SNPs.
Usage
data(synSNP)Format
An object of class'data.frame' with 800 rows and 10,032 columns.
References
The 1000 Genomes Project Consortium (2015). Global reference for human genetic variation,Nature,526(7571), 68-74.s
Examples
data(synSNP)Y_SNP <- synSNP[,1]X_SNP <- synSNP[,-1]fit <- SMLE(Y = Y_SNP, X = X_SNP, k = 40)summary(fit)plot(fit)Extract and adjust voting from SMLE selection
Description
Whensmle_select() is used withcriterion = "ebic" andvote = TRUE, users can usevote_update() to adjust the voting threshold without a need of rerunsmle_select().
Usage
vote_update(object, ...)## S3 method for class 'selection'vote_update(object, vote_threshold = 0.6, ...)Arguments
object | A |
... | This argument is not used and listed for method consistency. |
vote_threshold | A voting threshold in percentage. A feature isconsidered to be important when it receives votes passing the threshold.Default is 0.6. |
Value
The function returns a vector indicating the features selected byEBIC voting with the specifiedvote_threhold.
Examples
set.seed(1)Data <- Gen_Data(n = 100, p = 3000, correlation = "MA", rho = 0.7, family = "gaussian")colnames(Data$X)<- paste("X.",seq(3000) , sep = "")fit <- SMLE(Y = Data$Y, X = Data$X, k = 20, family = "gaussian")fit_s <- smle_select(fit, criterion = "ebic", vote = TRUE)plot(fit_s)fit_svote_update(fit_s, vote_threshold = 0.4)