| Version: | 1.6.1 |
| Date: | 2025-03-05 |
| Title: | Extending Lasso Model Fitting to Big Data |
| Maintainer: | Patrick Breheny <patrick-breheny@uiowa.edu> |
| Description: | Extend lasso and elastic-net model fitting for large data sets that cannot be loaded into memory. Designed to be more memory- and computation-efficient than existing lasso-fitting packages like 'glmnet' and 'ncvreg', thus allowing the user to analyze big data with limited RAM <doi:10.32614/RJ-2021-001>. |
| License: | GPL-3 |
| URL: | https://pbreheny.github.io/biglasso/,https://github.com/pbreheny/biglasso |
| BugReports: | https://github.com/pbreheny/biglasso/issues |
| Depends: | R (≥ 3.2.0), bigmemory (≥ 4.5.0), Matrix, ncvreg |
| Imports: | Rcpp (≥ 0.12.1), methods |
| LinkingTo: | Rcpp, RcppArmadillo (≥ 0.8.600), bigmemory, BH |
| VignetteBuilder: | knitr |
| Suggests: | glmnet, knitr, parallel, rmarkdown, survival, tinytest |
| RoxygenNote: | 7.3.2 |
| Encoding: | UTF-8 |
| NeedsCompilation: | yes |
| Packaged: | 2025-03-05 17:32:37 UTC; pbreheny |
| Author: | Yaohui Zeng [aut], Chuyi Wang [aut], Tabitha Peter [aut], Patrick Breheny |
| Repository: | CRAN |
| Date/Publication: | 2025-03-05 18:10:14 UTC |
Extending Lasso Model Fitting to Big Data
Description
Extend lasso and elastic-net linear, logistic and cox regression models forultrahigh-dimensional, multi-gigabyte data sets that cannot be loaded intoavailable RAM. This package utilizes memory-mapped files to store themassive data on the disk and only read those into memory whenever necessaryduring model fitting. Moreover, some advanced feature screening rules areproposed and implemented to accelerate the model fitting. As a result, thispackage is much more memory- and computation-efficient and highly scalableas compared to existing lasso-fitting packages such asglmnet andncvreg, thus allowing forpowerful big data analysis even with only an ordinary laptop.
Details
| Package: | biglasso |
| Type: | Package |
| Version: | 1.4-1 |
| Date: | 2021-01-29 |
| License: | GPL-3 |
Penalized regression models, in particular the lasso, have been extensivelyapplied to analyzing high-dimensional data sets. However, due to the memorylimit, existing R packages are not capable of fitting lasso models forultrahigh-dimensional, multi-gigabyte data sets which have been increasinglyseen in many areas such as genetics, biomedical imaging, genome sequencingand high-frequency finance.
This package aims to fill the gap by extending lasso model fitting to BigData in R. Version >= 1.2-3 represents a major redesign where the sourcecode is converted into C++ (previously in C), and new feature screeningrules, as well as OpenMP parallel computing, are implemented. Some keyfeatures ofbiglasso are summarized as below:
itutilizes memory-mapped files to store the massive data on the disk, onlyloading data into memory when necessary during model fitting. Consequently,it's able to seamlessly data-larger-than-RAM cases.
it is built uponpathwise coordinate descent algorithm with warm start, active set cycling,and feature screening strategies, which has been proven to be one of fastestlasso solvers.
in incorporates our newly developed hybrid and adaptivescreening that outperform state-of-the-art screening rules such asthe sequential strong rule (SSR) and the sequential EDPP rule (SEDPP) withadditional 1.5x to 4x speedup.
the implementation is designed to be asmemory-efficient as possible by eliminating extra copies of the data createdby other R packages, making it at least 2x more memory-efficient than
glmnet.the underlying computation is implemented in C++, andparallel computing with OpenMP is also supported.
For more information:
Benchmarking results:https://github.com/pbreheny/biglasso
Tutorial:https://pbreheny.github.io/biglasso/articles/biglasso.html
Technical paper:https://arxiv.org/abs/1701.05936
Note
The input design matrix X must be abigmemory::big.matrix() object.This can be created by the functionas.big.matrix in the R packagebigmemory.If the data (design matrix) is very large (e.g. 10 GB) and stored in an externalfile, which is often the case for big data, X can be created by calling thefunctionsetupX().In this case, there are several restrictions about the data file:
the data file must be a well-formated ASCII-file, witheach row corresponding to an observation and each column a variable;
the data file must contain only one single type. Current version onlysupports
doubletype;the data file must contain only numericvariables. If there are categorical variables, the user needs to createdummy variables for each categorical varable (by adding additional columns).
Future versions will try to address these restrictions.
Denote the number of observations and variables be, respectively,nandp. It's worth noting that the package is more suitable for widedata (ultrahigh-dimensional,p >> n) as compared to long data(n >> p). This is because the model fitting algorithm takes advantageof sparsity assumption of high-dimensional data. To just give the user someideas, below are some benchmarking results of the total computing time (inseconds) for solving lasso-penalized linear regression along a sequence of100 values of the tuning parameter. In all cases, assume 20 non-zerocoefficients equal +/- 2 in the true model. (Based on Version 1.2-3,screening rule "SSR-BEDPP" is used)
For wide data case (
p > n),n = 1,000:p1,000 10,000 100,000 1,000,000 Size of X9.5 MB 95 MB 950 MB 9.5 GB Elapsed time (s) 0.11 0.83 8.47 85.50 %
For long data case (
n >> p),p = 1,000:%% n1,000 10,000 100,000 1,000,000 %Size of X9.5 MB 95 MB 950 MB 9.5 GB %Elapsed time (s) 2.50 11.43 83.69 1090.62 %
Author(s)
Yaohui Zeng, Chuyi Wang, Tabitha Peter, and Patrick Breheny
References
Zeng Y and Breheny P. (2021) The biglasso Package: A Memory- andComputation-Efficient Solver for Lasso Model Fitting with Big Data in R.R Journal,12: 6-19.doi:10.32614/RJ-2021-001
Wang C and Breheny P. (2022) Adaptive hybrid screening for efficientlasso optimization.Journal of Statistical Computation and Simulation,92: 2233-2256.doi:10.1080/00949655.2021.2025376
Tibshirani, R., Bien, J., Friedman, J., Hastie, T., Simon, N., Taylor, J.,and Tibshirani, R. J. (2012). Strong rules for discarding predictors inlasso-type problems.Journal of the Royal Statistical Society: SeriesB (Statistical Methodology),74(2), 245-266.
Wang, J., Zhou, J., Wonka, P., and Ye, J. (2013). Lasso screening rules via dualpolytope projection.In Advances in Neural Information ProcessingSystems, pp. 1070-1078.
Xiang, Z. J., and Ramadge, P. J. (2012).Fast lasso screening tests based on correlations.In Acoustics, Speechand Signal Processing (ICASSP), 2012 IEEE International Conference on (pp.2137-2140). IEEE.
Wang, J., Zhou, J., Liu, J., Wonka, P., and Ye, J.(2014). A safe screening rule for sparse logistic regression.InAdvances in Neural Information Processing Systems, pp. 1053-1061.
Examples
## Not run: ## Example of reading data from external big data file, fit lasso model, ## and run cross validation in parallel# simulated design matrix, 1000 observations, 500,000 variables, ~ 5GB# there are 10 true variables with non-zero coefficient 2.xfname <- 'x_e3_5e5.txt' yfname <- 'y_e3_5e5.txt' # response vectortime <- system.time( X <- setupX(xfname, sep = '\t') # create backing files (.bin, .desc))print(time) # ~ 7 minutes; this is just one-time operationdim(X)# the big.matrix then can be retrieved by its descriptor file (.desc) in any new R session. rm(X)xdesc <- 'x_e3_5e5.desc' X <- attach.big.matrix(xdesc)dim(X)y <- as.matrix(read.table(yfname, header = F))time.fit <- system.time( fit <- biglasso(X, y, family = 'gaussian', screen = 'Hybrid'))print(time.fit) # ~ 44 seconds for fitting a lasso model along the entire solution path# cross validation in parallelseed <- 1234time.cvfit <- system.time( cvfit <- cv.biglasso(X, y, family = 'gaussian', screen = 'Hybrid', seed = seed, ncores = 4, nfolds = 10))print(time.cvfit) # ~ 3 minutes for 10-fold cross validationplot(cvfit)summary(cvfit)## End(Not run)Fit lasso penalized regression path for big data
Description
Extend lasso model fitting to big data that cannot be loaded into memory.Fit solution paths for linear, logistic or Cox regression models penalized bylasso, ridge, or elastic-net over a grid of values for the regularizationparameter lambda.
Usage
biglasso( X, y, row.idx = 1:nrow(X), penalty = c("lasso", "ridge", "enet"), family = c("gaussian", "binomial", "cox", "mgaussian"), alg.logistic = c("Newton", "MM"), screen = c("Adaptive", "SSR", "Hybrid", "None"), safe.thresh = 0, update.thresh = 1, ncores = 1, alpha = 1, lambda.min = ifelse(nrow(X) > ncol(X), 0.001, 0.05), nlambda = 100, lambda.log.scale = TRUE, lambda, eps = 1e-07, max.iter = 1000, dfmax = ncol(X) + 1, penalty.factor = rep(1, ncol(X)), warn = TRUE, output.time = FALSE, return.time = TRUE, verbose = FALSE)Arguments
X | The design matrix, without an intercept. It must be adouble type |
y | The response vector for |
row.idx | The integer vector of row indices of |
penalty | The penalty to be applied to the model. Either |
family | Either |
alg.logistic | The algorithm used in logistic regression. If "Newton"then the exact hessian is used (default); if "MM" then amajorization-minimization algorithm is used to set an upper-bound on thehessian matrix. This can be faster, particularly in data-larger-than-RAMcase. |
screen | The feature screening rule used at each |
safe.thresh | the threshold value between 0 and 1 that controls when tostop safe test. For example, 0.01 means to stop safe test at next lambdaiteration if the number of features rejected by safe test at current lambdaiteration is not larger than 1\to always turn off safe test, whereas 0 (default) means to turn off safe testif the number of features rejected by safe test is 0 at current lambda. |
update.thresh | the non negative threshold value that controls how often toupdate the reference of safe rules for "Adaptive" methods. Smaller value meansupdating more often. |
ncores | The number of OpenMP threads used for parallel computing. |
alpha | The elastic-net mixing parameter that controls the relativecontribution from the lasso (l1) and the ridge (l2) penalty. The penalty isdefined as
|
lambda.min | The smallest value for lambda, as a fraction oflambda.max. Default is .001 if the number of observations is larger thanthe number of covariates and .05 otherwise. |
nlambda | The number of lambda values. Default is 100. |
lambda.log.scale | Whether compute the grid values of lambda on logscale (default) or linear scale. |
lambda | A user-specified sequence of lambda values. By default, asequence of values of length |
eps | Convergence threshold for inner coordinate descent. Thealgorithm iterates until the maximum change in the objective after anycoefficient update is less than |
max.iter | Maximum number of iterations. Default is 1000. |
dfmax | Upper bound for the number of nonzero coefficients. Default isno upper bound. However, for large data sets, computational burden may beheavy for models with a large number of nonzero coefficients. |
penalty.factor | A multiplicative factor for the penalty applied toeach coefficient. If supplied, |
warn | Return warning messages for failures to converge and modelsaturation? Default is TRUE. |
output.time | Whether to print out the start and end time of the modelfitting. Default is FALSE. |
return.time | Whether to return the computing time of the modelfitting. Default is TRUE. |
verbose | Whether to output the timing of each lambda iteration.Default is FALSE. |
Details
The objective function for linear regression or multiple responses linear regression(family = "gaussian" orfamily = "mgaussian") is
\frac{1}{2n}\textrm{RSS} + \lambda*\textrm{penalty},
where forfamily = "mgaussian"), a group-lasso type penalty is applied.For logistic regression(family = "binomial") it is
-\frac{1}{n} loglike +\lambda*\textrm{penalty},
, for cox regression,breslow approximation for ties is applied.
Several advanced feature screening rules are implemented. Forlasso-penalized linear regression, all the options ofscreen areapplicable. Our proposal adaptive rule -"Adaptive" - achieves highest speedupso it's the recommended one, especially for ultrahigh-dimensional large-scaledata sets. For cox regression and/or the elastic net penalty, only"SSR" is applicable for now. More efficient rules are under development.
Value
An object with S3 class"biglasso" for"gaussian", "binomial", "cox" families, or an object with S3 class"mbiglasso" for"mgaussian" family, with following variables.
beta | The fitted matrix of coefficients, store in sparse matrixrepresentation. The number of rows is equal to the number of coefficients,whereas the number of columns is equal to |
iter | A vector of length |
lambda | The sequence of regularization parameter values in the path. |
penalty | Same as above. |
family | Same as above. |
alpha | Same as above. |
loss | A vector containing either the residual sum of squares(for |
penalty.factor | Same as above. |
n | The number of observations used in the model fitting. It's equal to |
center | The sample mean vector of the variables, i.e., column mean ofthe sub-matrix of |
scale | The sample standard deviation of the variables, i.e., columnstandard deviation of the sub-matrix of |
y | The response vector used in the model fitting. Depending on |
screen | Same as above. |
col.idx | The indices of features that have 'scale' value greater than1e-6. Features with 'scale' less than 1e-6 are removed from model fitting. |
rejections | The number of features rejected at each value of |
safe_rejections | The number of features rejected by safe rules at eachvalue of |
Author(s)
Yaohui Zeng, Chuyi Wang and Patrick Breheny
References
Zeng Y and Breheny P. (2021) The biglasso Package: A Memory- and Computation-Efficient Solver for Lasso Model Fitting with Big Data in R.R Journal,12: 6-19.doi:10.32614/RJ-2021-001
See Also
biglasso-package,setupX(),cv.biglasso(),plot.biglasso(),ncvreg::ncvreg()
Examples
## Linear regressiondata(colon)X <- colon$Xy <- colon$yX.bm <- as.big.matrix(X)# lasso, defaultpar(mfrow=c(1,2))fit.lasso <- biglasso(X.bm, y, family = 'gaussian')plot(fit.lasso, log.l = TRUE, main = 'lasso')# elastic netfit.enet <- biglasso(X.bm, y, penalty = 'enet', alpha = 0.5, family = 'gaussian')plot(fit.enet, log.l = TRUE, main = 'elastic net, alpha = 0.5')## Logistic regressiondata(colon)X <- colon$Xy <- colon$yX.bm <- as.big.matrix(X)# lasso, defaultpar(mfrow = c(1, 2))fit.bin.lasso <- biglasso(X.bm, y, penalty = 'lasso', family = "binomial")plot(fit.bin.lasso, log.l = TRUE, main = 'lasso')# elastic netfit.bin.enet <- biglasso(X.bm, y, penalty = 'enet', alpha = 0.5, family = "binomial")plot(fit.bin.enet, log.l = TRUE, main = 'elastic net, alpha = 0.5')## Cox regressionset.seed(10101)N <- 1000; p <- 30; nzc <- p/3X <- matrix(rnorm(N * p), N, p)beta <- rnorm(nzc)fx <- X[, seq(nzc)] %*% beta/3hx <- exp(fx)ty <- rexp(N, hx)tcens <- rbinom(n = N, prob = 0.3, size = 1) # censoring indicatory <- cbind(time = ty, status = 1 - tcens) # y <- Surv(ty, 1 - tcens) with library(survival)X.bm <- as.big.matrix(X)fit <- biglasso(X.bm, y, family = "cox")plot(fit, main = "cox")## Multiple responses linear regressionset.seed(10101)n=300; p=300; m=5; s=10; b=1x = matrix(rnorm(n * p), n, p)beta = matrix(seq(from=-b,to=b,length.out=s*m),s,m)y = x[,1:s] %*% beta + matrix(rnorm(n*m,0,1),n,m)x.bm = as.big.matrix(x)fit = biglasso(x.bm, y, family = "mgaussian")plot(fit, main = "mgaussian")Direct interface to biglasso fitting, no preprocessing
Description
This function is intended for users who know exactly what they're doing andwant complete control over the fitting process. It
does NOT add an intercept
does NOT standardize the design matrix
does NOT set up a path for lambda (the lasso tuning parameter)all of the above are critical steps in data analysis. However, a direct APIhas been provided for use in situations where the lasso fitting process isan internal component of a more complicated algorithm and standardizationmust be handled externally.
Usage
biglasso_fit( X, y, r, init = rep(0, ncol(X)), xtx, penalty = "lasso", lambda, alpha = 1, gamma, ncores = 1, max.iter = 1000, eps = 1e-05, dfmax = ncol(X) + 1, penalty.factor = rep(1, ncol(X)), warn = TRUE, output.time = FALSE, return.time = TRUE)Arguments
X | The design matrix, without an intercept. It must be adouble type |
y | The response vector |
r | Residuals (length n vector) corresponding to |
init | Initial values for beta. Default: zero (length p vector) |
xtx | X scales: the jth element should equal |
penalty | String specifying which penalty to use. Default is 'lasso',Other options are 'SCAD' and 'MCP' (the latter are non-convex) |
lambda | A single value for the lasso tuning parameter. |
alpha | The elastic-net mixing parameter that controls the relativecontribution from the lasso (l1) and the ridge (l2) penalty.The penalty is defined as:
|
gamma | Tuning parameter value for nonconvex penalty. Defaults are3.7 for |
ncores | The number of OpenMP threads used for parallel computing. |
max.iter | Maximum number of iterations. Default is 1000. |
eps | Convergence threshold for inner coordinate descent. Thealgorithm iterates until the maximum change in the objectiveafter any coefficient update is less than |
dfmax | Upper bound for the number of nonzero coefficients. Default isno upper bound. However, for large data sets,computational burden may be heavy for models with a largenumber of nonzero coefficients. |
penalty.factor | A multiplicative factor for the penalty applied toeach coefficient. If supplied, |
warn | Return warning messages for failures to converge and modelsaturation? Default is TRUE. |
output.time | Whether to print out the start and end time of the modelfitting. Default is FALSE. |
return.time | Whether to return the computing time of the modelfitting. Default is TRUE. |
Details
Note:
Hybrid safe-strong rules are turned off for
biglasso_fit(), as these relyon standardizationCurrently, the function only works with linear regression(
family = 'gaussian').
Value
An object with S3 class"biglasso" with following variables.
beta | The vector of estimated coefficients |
iter | A vector of length |
resid | Vector of residuals calculated from estimated coefficients. |
lambda | The sequence of regularization parameter values in the path. |
alpha | Same as in |
loss | A vector containing either the residual sum of squares of the fitted model at each value of lambda. |
penalty.factor | Same as in |
n | The number of observations used in the model fitting. |
y | The response vector used in the model fitting. |
Author(s)
Tabitha Peter and Patrick Breheny
Examples
data(Prostate)X <- cbind(1, Prostate$X)xtx <- apply(X, 2, crossprod)/nrow(X)y <- Prostate$yX.bm <- as.big.matrix(X)init <- rep(0, ncol(X))fit <- biglasso_fit(X = X.bm, y = y, r=y, init = init, xtx = xtx, lambda = 0.1, penalty.factor=c(0, rep(1, ncol(X)-1)), max.iter = 10000)fit$beta fit <- biglasso_fit(X = X.bm, y = y, r=y, init = init, xtx = xtx, penalty='MCP', lambda = 0.1, penalty.factor=c(0, rep(1, ncol(X)-1)), max.iter = 10000)fit$betaDirect interface to biglasso fitting, no preprocessing, path version
Description
This function is intended for users who know exactly what they're doing andwant complete control over the fitting process. It
does NOT add an intercept
does NOT standardize the design matrixboth of the above are critical steps in data analysis. However, a direct APIhas been provided for use in situations where the lasso fitting process isan internal component of a more complicated algorithm and standardizationmust be handled externally.
Usage
biglasso_path( X, y, r, init = rep(0, ncol(X)), xtx, penalty = "lasso", lambda, alpha = 1, gamma, ncores = 1, max.iter = 1000, eps = 1e-05, dfmax = ncol(X) + 1, penalty.factor = rep(1, ncol(X)), warn = TRUE, output.time = FALSE, return.time = TRUE)Arguments
X | The design matrix, without an intercept. It must be adouble type |
y | The response vector |
r | Residuals (length n vector) corresponding to |
init | Initial values for beta. Default: zero (length p vector) |
xtx | X scales: the jth element should equal |
penalty | String specifying which penalty to use. Default is 'lasso',Other options are 'SCAD' and 'MCP' (the latter are non-convex) |
lambda | A vector of numeric values the lasso tuning parameter. |
alpha | The elastic-net mixing parameter that controls the relativecontribution from the lasso (l1) and the ridge (l2) penalty.The penalty is defined as:
|
gamma | Tuning parameter value for nonconvex penalty. Defaults are3.7 for |
ncores | The number of OpenMP threads used for parallel computing. |
max.iter | Maximum number of iterations. Default is 1000. |
eps | Convergence threshold for inner coordinate descent. Thealgorithm iterates until the maximum change in the objectiveafter any coefficient update is less than |
dfmax | Upper bound for the number of nonzero coefficients. Default isno upper bound. However, for large data sets,computational burden may be heavy for models with a largenumber of nonzero coefficients. |
penalty.factor | A multiplicative factor for the penalty applied toeach coefficient. If supplied, |
warn | Return warning messages for failures to converge and modelsaturation? Default is TRUE. |
output.time | Whether to print out the start and end time of the modelfitting. Default is FALSE. |
return.time | Whether to return the computing time of the modelfitting. Default is TRUE. |
Details
biglasso_path() works identically tobiglasso_fit() except it offers theadditional option of fitting models across a path of tuning parameter values.
Note:
Hybrid safe-strong rules are turned off for
biglasso_fit(), as these relyon standardizationCurrently, the function only works with linear regression(
family = 'gaussian').
Value
An object with S3 class"biglasso" with following variables.
beta | A sparse matrix where rows are estimates a given coefficient across all values of lambda |
iter | A vector of length |
resid | Vector of residuals calculated from estimated coefficients. |
lambda | The sequence of regularization parameter values in the path. |
alpha | Same as in |
loss | A vector containing either the residual sum of squares of the fitted model at each value of lambda. |
penalty.factor | Same as in |
n | The number of observations used in the model fitting. |
y | The response vector used in the model fitting. |
Author(s)
Tabitha Peter and Patrick Breheny
Examples
data(Prostate)X <- cbind(1, Prostate$X)xtx <- apply(X, 2, crossprod)/nrow(X)y <- Prostate$yX.bm <- as.big.matrix(X)init <- rep(0, ncol(X))fit <- biglasso_path(X = X.bm, y = y, r = y, init = init, xtx = xtx, lambda = c(0.5, 0.1, 0.05, 0.01, 0.001), penalty.factor=c(0, rep(1, ncol(X)-1)), max.iter=2000)fit$beta fit <- biglasso_path(X = X.bm, y = y, r = y, init = init, xtx = xtx, lambda = c(0.5, 0.1, 0.05, 0.01, 0.001), penalty='MCP', penalty.factor=c(0, rep(1, ncol(X)-1)), max.iter = 2000)fit$betaGene expression data from colon-cancer patients
Description
The data file contains gene expression data of 62 samples (40 tumor samples,22 normal samples) from colon-cancer patients analyzed with an Affymetrixoligonucleotide Hum6000 array.
Format
A list of 2 variables included incolon:
X: a 62-by-2000 matrix that records the gene expression data.Used as design matrix.y: a binary vector of length 62 recording the sample status: 1 =tumor; 0 = normal. Used as response vector.
Source
The raw data can be found on Bioconductor:https://bioconductor.org/packages/release/data/experiment/html/colonCA.html.
References
U. Alon et al. (1999): Broad patterns of geneexpression revealed by clustering analysis of tumor and normal colon tissueprobed by oligonucleotide arrays.Proc. Natl. Acad. Sci. USA96, 6745-6750.https://www.pnas.org/doi/abs/10.1073/pnas.96.12.6745.
Examples
data(colon)X <- colon$Xy <- colon$ystr(X)dim(X)X.bm <- as.big.matrix(X, backingfile = "") # convert to big.matrix objectstr(X.bm)dim(X.bm)Cross-validation for biglasso
Description
Perform k-fold cross validation for penalized regression models over a gridof values for the regularization parameter lambda.
Usage
cv.biglasso( X, y, row.idx = 1:nrow(X), family = c("gaussian", "binomial", "cox", "mgaussian"), eval.metric = c("default", "MAPE", "auc", "class"), ncores = parallel::detectCores(), ..., nfolds = 5, seed, cv.ind, trace = FALSE, grouped = TRUE)Arguments
X | The design matrix, without an intercept, as in |
y | The response vector, as in |
row.idx | The integer vector of row indices of |
family | Either |
eval.metric | The evaluation metric for the cross-validated error andfor choosing optimal |
ncores | The number of cores to use for parallel execution of thecross-validation folds, run on a cluster created by the |
... | Additional arguments to |
nfolds | The number of cross-validation folds. Default is 5. |
seed | The seed of the random number generator in order to obtainreproducible results. |
cv.ind | Which fold each observation belongs to. By default theobservations are randomly assigned by |
trace | If set to TRUE, cv.biglasso will inform the user of itsprogress by announcing the beginning of each CV fold. Default is FALSE. |
grouped | Whether to calculate CV standard error ( |
Details
The function callsbiglassonfolds times, each time leavingout 1/nfolds of the data. The cross-validation error is based on theresidual sum of squares whenfamily="gaussian" and the binomialdeviance whenfamily="binomial".
The S3 class objectcv.biglasso inherits classncvreg::cv.ncvreg(). So S3functions such as"summary", "plot" can be directly applied to thecv.biglasso object.
Value
An object with S3 class"cv.biglasso" which inherits fromclass"cv.ncvreg". The following variables are contained in theclass (adopted fromncvreg::cv.ncvreg()).
cve | The error for each value of |
cvse | The estimated standard error associated with each value of for |
lambda | The sequence of regularization parameter values along which the cross-validation error was calculated. |
fit | The fitted |
min | The index of |
lambda.min | The value of |
lambda.1se | The largest value of |
null.dev | The deviance for the intercept-only model. |
pe | If |
cv.ind | Same as above. |
Author(s)
Yaohui Zeng and Patrick Breheny
See Also
biglasso(),plot.cv.biglasso(),summary.cv.biglasso(),setupX()
Examples
## Not run: ## cv.biglassodata(colon)X <- colon$Xy <- colon$yX.bm <- as.big.matrix(X)## logistic regressioncvfit <- cv.biglasso(X.bm, y, family = 'binomial', seed = 1234, ncores = 2)par(mfrow = c(2, 2))plot(cvfit, type = 'all')summary(cvfit)## End(Not run)Internal biglasso functions
Description
Internal biglasso functions
Usage
loss.biglasso(y, yhat, family, eval.metric, grouped = TRUE)Arguments
y | The observed response vector. |
yhat | The predicted response vector. |
family | Either "gaussian" or "binomial", depending on the response. |
eval.metric | The evaluation metric for the cross-validated error andfor choosing optimal |
grouped | Whether to calculate loss for the entire CV fold( |
Details
These are not intended for use by users.loss.biglasso calculates thevalue of the loss function for the given predictions (used for cross-validation).
Author(s)
Yaohui Zeng and Patrick Breheny
Plot coefficients from a "biglasso" object
Description
Produce a plot of the coefficient paths for a fittedbiglasso() object.
Usage
## S3 method for class 'biglasso'plot(x, alpha = 1, log.l = TRUE, ...)Arguments
x | Fitted |
alpha | Controls alpha-blending, helpful when the number of covariatesis large. Default is alpha=1. |
log.l | Should horizontal axis be on the log scale? Default is TRUE. |
... | Other graphical parameters to |
Author(s)
Yaohui Zeng and Patrick Breheny
See Also
Examples
## See examples in "biglasso"Plots the cross-validation curve from a "cv.biglasso" object
Description
Plot the cross-validation curve from acv.biglasso() object,along with standard error bars.
Usage
## S3 method for class 'cv.biglasso'plot( x, log.l = TRUE, type = c("cve", "rsq", "scale", "snr", "pred", "all"), selected = TRUE, vertical.line = TRUE, col = "red", ...)Arguments
x | A |
log.l | Should horizontal axis be on the log scale? Default is TRUE. |
type | What to plot on the vertical axis. |
selected | If |
vertical.line | If |
col | Controls the color of the dots (CV estimates). |
... | Other graphical parameters to |
Details
Error bars representing approximate 68\along with the estimates at value oflambda. Forrsq andsnr, these confidence intervals are quite crude, especially near.
Author(s)
Yaohui Zeng and Patrick Breheny
See Also
Examples
## See examples in "cv.biglasso"Plot coefficients from a "mbiglasso" object
Description
Produce a plot of the coefficient paths for a fitted multiple responsesmbiglasso object.
Usage
## S3 method for class 'mbiglasso'plot(x, alpha = 1, log.l = TRUE, norm.beta = TRUE, ...)Arguments
x | Fitted |
alpha | Controls alpha-blending, helpful when the number of covariatesis large. Default is alpha=1. |
log.l | Should horizontal axis be on the log scale? Default is TRUE. |
norm.beta | Should the vertical axis be the l2 norm of coefficients for each variable?Default is TRUE. If False, the vertical axis is the coefficients. |
... | Other graphical parameters to |
Author(s)
Chuyi Wang
See Also
Examples
## See examples in "biglasso"Model predictions based on a fittedbiglasso object
Description
Extract predictions (fitted reponse, coefficients, etc.) from afittedbiglasso() object.
Usage
## S3 method for class 'biglasso'predict( object, X, row.idx = 1:nrow(X), type = c("link", "response", "class", "coefficients", "vars", "nvars"), lambda, which = 1:length(object$lambda), ...)## S3 method for class 'mbiglasso'predict( object, X, row.idx = 1:nrow(X), type = c("link", "response", "coefficients", "vars", "nvars"), lambda, which = 1:length(object$lambda), k = 1, ...)## S3 method for class 'biglasso'coef(object, lambda, which = 1:length(object$lambda), drop = TRUE, ...)## S3 method for class 'mbiglasso'coef(object, lambda, which = 1:length(object$lambda), intercept = TRUE, ...)Arguments
object | A fitted |
X | Matrix of values at which predictions are to be made. It must be a |
row.idx | Similar to that in |
type | Type of prediction:
|
lambda | Values of the regularization parameter |
which | Indices of the penalty parameter |
... | Not used. |
k | Index of the response to predict in multiple responses regression ( |
drop | If coefficients for a single value of |
intercept | Whether the intercept should be included in the returnedcoefficients. For |
Value
The object returned depends ontype.
Author(s)
Yaohui Zeng and Patrick Breheny
See Also
Examples
## Logistic regressiondata(colon)X <- colon$Xy <- colon$yX.bm <- as.big.matrix(X, backingfile = "")fit <- biglasso(X.bm, y, penalty = 'lasso', family = "binomial")coef <- coef(fit, lambda=0.05, drop = TRUE)coef[which(coef != 0)]predict(fit, X.bm, type="link", lambda=0.05)[1:10]predict(fit, X.bm, type="response", lambda=0.05)[1:10]predict(fit, X.bm, type="class", lambda=0.1)[1:10]predict(fit, type="vars", lambda=c(0.05, 0.1))predict(fit, type="nvars", lambda=c(0.05, 0.1))Model predictions based on a fittedcv.biglasso() object
Description
Extract predictions from a fittedcv.biglasso() object.
Usage
## S3 method for class 'cv.biglasso'predict( object, X, row.idx = 1:nrow(X), type = c("link", "response", "class", "coefficients", "vars", "nvars"), lambda = object$lambda.min, which = object$min, ...)## S3 method for class 'cv.biglasso'coef(object, lambda = object$lambda.min, which = object$min, ...)Arguments
object | A fitted |
X | Matrix of values at which predictions are to be made. It must be a |
row.idx | Similar to that in |
type | Type of prediction:
|
lambda | Values of the regularization parameter |
which | Indices of the penalty parameter |
... | Not used. |
Value
The object returned depends ontype.
Author(s)
Yaohui Zeng and Patrick Breheny
See Also
Examples
## Not run: ## predict.cv.biglassodata(colon)X <- colon$Xy <- colon$yX.bm <- as.big.matrix(X, backingfile = "")fit <- biglasso(X.bm, y, penalty = 'lasso', family = "binomial")cvfit <- cv.biglasso(X.bm, y, penalty = 'lasso', family = "binomial", seed = 1234, ncores = 2)coef <- coef(cvfit)coef[which(coef != 0)]predict(cvfit, X.bm, type = "response")predict(cvfit, X.bm, type = "link")predict(cvfit, X.bm, type = "class")predict(cvfit, X.bm, lambda = "lambda.1se")## End(Not run)Set up design matrix X by reading data from big data file
Description
Set up the design matrix X as abig.matrix object based on externalmassive data file stored on disk that cannot be fullly loaded into memory.The data file must be a well-formated ASCII-file, and contains only onesingle type. Current version only supportsdouble type. Otherrestrictions about the data file are described inbiglasso-package. This function reads the massive data, andcreates abig.matrix object. By default, the resultingbig.matrix is file-backed, and can be shared across processors ornodes of a cluster.
Usage
setupX( filename, dir = getwd(), sep = ",", backingfile = paste0(unlist(strsplit(filename, split = "\\."))[1], ".bin"), descriptorfile = paste0(unlist(strsplit(filename, split = "\\."))[1], ".desc"), type = "double", ...)Arguments
filename | The name of the data file. For example, "dat.txt". |
dir | The directory used to store the binary and descriptor filesassociated with the |
sep | The field separator character. For example, "," forcomma-delimited files (the default); "\t" for tab-delimited files. |
backingfile | The binary file associated with the file-backed |
descriptorfile | The descriptor file used for the description of thefile-backed |
type | The data type. Only "double" is supported for now. |
... | Additional arguments that can be passed into function |
Details
For a data set, this function needs to be called only one time to set up thebig.matrix object with two backing files (.bin, .desc) created incurrent working directory. Once set up, the data can be "loaded" into any(new) R session by callingattach.big.matrix(discriptorfile).
This function is a simple wrapper ofbigmemory::read.big.matrix(). Seebigmemory for moredetails.
Value
Abig.matrix object corresponding to a file-backedbigmemory::big.matrix(). It's ready to be used as the design matrixX inbiglasso() andcv.biglasso().
Author(s)
Yaohui Zeng and Patrick Breheny
See Also
biglasso(),ncvreg::cv.ncvreg(),biglasso-package
Examples
## see the example in "biglasso-package"Summarizing inferences based on cross-validation
Description
Summary method forcv.biglasso objects.
Usage
## S3 method for class 'cv.biglasso'summary(object, ...)## S3 method for class 'summary.cv.biglasso'print(x, digits, ...)Arguments
object | A |
... | Further arguments passed to or from other methods. |
x | A |
digits | Number of digits past the decimal point to print out. Can bea vector specifying different display digits for each of the fivenon-integer printed values. |
Value
summary.cv.biglasso produces an object with S3 class"summary.cv.biglasso". The class has its own print method and containsthe following list elements:
penalty | The penalty used by |
model | Either |
n | Number of observations |
p | Number of regression coefficients (not including the intercept). |
min | The index of |
lambda | The sequence of |
cve | Cross-validation error (deviance). |
r.squared | Proportion of variance explained by the model, as estimated by cross-validation. |
snr | Signal to noise ratio, as estimated by cross-validation. |
sigma | For linear regression models, the scale parameter estimate. |
pe | For logistic regression models, the prediction error (misclassification error). |
Author(s)
Yaohui Zeng and Patrick Breheny
See Also
biglasso(),cv.biglasso(),plot.cv.biglasso(),biglasso-package
Examples
## See examples in "cv.biglasso" and "biglasso-package"