Movatterモバイル変換

Version:

1.6.1

Date:

2025-03-05

Title:

Extending Lasso Model Fitting to Big Data

Maintainer:

Patrick Breheny <patrick-breheny@uiowa.edu>

Description:

Extend lasso and elastic-net model fitting for large data sets that cannot be loaded into memory. Designed to be more memory- and computation-efficient than existing lasso-fitting packages like 'glmnet' and 'ncvreg', thus allowing the user to analyze big data with limited RAM <doi:10.32614/RJ-2021-001>.

License:

GPL-3

URL:

https://pbreheny.github.io/biglasso/,https://github.com/pbreheny/biglasso

BugReports:

https://github.com/pbreheny/biglasso/issues

Depends:

R (≥ 3.2.0), bigmemory (≥ 4.5.0), Matrix, ncvreg

Imports:

Rcpp (≥ 0.12.1), methods

LinkingTo:

Rcpp, RcppArmadillo (≥ 0.8.600), bigmemory, BH

VignetteBuilder:

knitr

Suggests:

glmnet, knitr, parallel, rmarkdown, survival, tinytest

RoxygenNote:

7.3.2

Encoding:

UTF-8

NeedsCompilation:

yes

Packaged:

2025-03-05 17:32:37 UTC; pbreheny

Author:

Yaohui Zeng [aut], Chuyi Wang [aut], Tabitha Peter [aut], Patrick Breheny

[aut, cre]

Repository:

CRAN

Date/Publication:

2025-03-05 18:10:14 UTC

Extending Lasso Model Fitting to Big Data

Description

Extend lasso and elastic-net linear, logistic and cox regression models forultrahigh-dimensional, multi-gigabyte data sets that cannot be loaded intoavailable RAM. This package utilizes memory-mapped files to store themassive data on the disk and only read those into memory whenever necessaryduring model fitting. Moreover, some advanced feature screening rules areproposed and implemented to accelerate the model fitting. As a result, thispackage is much more memory- and computation-efficient and highly scalableas compared to existing lasso-fitting packages such asglmnet andncvreg, thus allowing forpowerful big data analysis even with only an ordinary laptop.

Details

Package:	biglasso
Type:	Package
Version:	1.4-1
Date:	2021-01-29
License:	GPL-3

Penalized regression models, in particular the lasso, have been extensivelyapplied to analyzing high-dimensional data sets. However, due to the memorylimit, existing R packages are not capable of fitting lasso models forultrahigh-dimensional, multi-gigabyte data sets which have been increasinglyseen in many areas such as genetics, biomedical imaging, genome sequencingand high-frequency finance.

This package aims to fill the gap by extending lasso model fitting to BigData in R. Version >= 1.2-3 represents a major redesign where the sourcecode is converted into C++ (previously in C), and new feature screeningrules, as well as OpenMP parallel computing, are implemented. Some keyfeatures ofbiglasso are summarized as below:

itutilizes memory-mapped files to store the massive data on the disk, onlyloading data into memory when necessary during model fitting. Consequently,it's able to seamlessly data-larger-than-RAM cases.
it is built uponpathwise coordinate descent algorithm with warm start, active set cycling,and feature screening strategies, which has been proven to be one of fastestlasso solvers.
in incorporates our newly developed hybrid and adaptivescreening that outperform state-of-the-art screening rules such asthe sequential strong rule (SSR) and the sequential EDPP rule (SEDPP) withadditional 1.5x to 4x speedup.
the implementation is designed to be asmemory-efficient as possible by eliminating extra copies of the data createdby other R packages, making it at least 2x more memory-efficient thanglmnet.
the underlying computation is implemented in C++, andparallel computing with OpenMP is also supported.

For more information:

Benchmarking results:https://github.com/pbreheny/biglasso
Tutorial:https://pbreheny.github.io/biglasso/articles/biglasso.html
Technical paper:https://arxiv.org/abs/1701.05936

Note

The input design matrix X must be abigmemory::big.matrix() object.This can be created by the functionas.big.matrix in the R packagebigmemory.If the data (design matrix) is very large (e.g. 10 GB) and stored in an externalfile, which is often the case for big data, X can be created by calling thefunctionsetupX().In this case, there are several restrictions about the data file:

the data file must be a well-formated ASCII-file, witheach row corresponding to an observation and each column a variable;
the data file must contain only one single type. Current version onlysupportsdouble type;
the data file must contain only numericvariables. If there are categorical variables, the user needs to createdummy variables for each categorical varable (by adding additional columns).

Future versions will try to address these restrictions.

Denote the number of observations and variables be, respectively,nandp. It's worth noting that the package is more suitable for widedata (ultrahigh-dimensional,⁠p >> n⁠) as compared to long data(⁠n >> p⁠). This is because the model fitting algorithm takes advantageof sparsity assumption of high-dimensional data. To just give the user someideas, below are some benchmarking results of the total computing time (inseconds) for solving lasso-penalized linear regression along a sequence of100 values of the tuning parameter. In all cases, assume 20 non-zerocoefficients equal +/- 2 in the true model. (Based on Version 1.2-3,screening rule "SSR-BEDPP" is used)

For wide data case (p > n),⁠n = 1,000⁠:
p 1,000 10,000 100,000 1,000,000
Size ofX 9.5 MB 95 MB 950 MB 9.5 GB
Elapsed time (s) 0.11 0.83 8.47 85.50
%
For long data case (⁠n >> p⁠),⁠p = 1,000⁠:%
%n 1,000 10,000 100,000 1,000,000
%Size ofX 9.5 MB 95 MB 950 MB 9.5 GB
%Elapsed time (s) 2.50 11.43 83.69 1090.62
%

Author(s)

Yaohui Zeng, Chuyi Wang, Tabitha Peter, and Patrick Breheny

References

Zeng Y and Breheny P. (2021) The biglasso Package: A Memory- andComputation-Efficient Solver for Lasso Model Fitting with Big Data in R.R Journal,12: 6-19.doi:10.32614/RJ-2021-001
Wang C and Breheny P. (2022) Adaptive hybrid screening for efficientlasso optimization.Journal of Statistical Computation and Simulation,92: 2233-2256.doi:10.1080/00949655.2021.2025376
Tibshirani, R., Bien, J., Friedman, J., Hastie, T., Simon, N., Taylor, J.,and Tibshirani, R. J. (2012). Strong rules for discarding predictors inlasso-type problems.Journal of the Royal Statistical Society: SeriesB (Statistical Methodology),74(2), 245-266.
Wang, J., Zhou, J., Wonka, P., and Ye, J. (2013). Lasso screening rules via dualpolytope projection.In Advances in Neural Information ProcessingSystems, pp. 1070-1078.
Xiang, Z. J., and Ramadge, P. J. (2012).Fast lasso screening tests based on correlations.In Acoustics, Speechand Signal Processing (ICASSP), 2012 IEEE International Conference on (pp.2137-2140). IEEE.
Wang, J., Zhou, J., Liu, J., Wonka, P., and Ye, J.(2014). A safe screening rule for sparse logistic regression.InAdvances in Neural Information Processing Systems, pp. 1053-1061.

Examples

## Not run: ## Example of reading data from external big data file, fit lasso model, ## and run cross validation in parallel# simulated design matrix, 1000 observations, 500,000 variables, ~ 5GB# there are 10 true variables with non-zero coefficient 2.xfname <- 'x_e3_5e5.txt' yfname <- 'y_e3_5e5.txt' # response vectortime <- system.time(  X <- setupX(xfname, sep = '\t') # create backing files (.bin, .desc))print(time) # ~ 7 minutes; this is just one-time operationdim(X)# the big.matrix then can be retrieved by its descriptor file (.desc) in any new R session. rm(X)xdesc <- 'x_e3_5e5.desc' X <- attach.big.matrix(xdesc)dim(X)y <- as.matrix(read.table(yfname, header = F))time.fit <- system.time(  fit <- biglasso(X, y, family = 'gaussian', screen = 'Hybrid'))print(time.fit) # ~ 44 seconds for fitting a lasso model along the entire solution path# cross validation in parallelseed <- 1234time.cvfit <- system.time(  cvfit <- cv.biglasso(X, y, family = 'gaussian', screen = 'Hybrid',                        seed = seed, ncores = 4, nfolds = 10))print(time.cvfit) # ~ 3 minutes for 10-fold cross validationplot(cvfit)summary(cvfit)## End(Not run)

Fit lasso penalized regression path for big data

Description

Extend lasso model fitting to big data that cannot be loaded into memory.Fit solution paths for linear, logistic or Cox regression models penalized bylasso, ridge, or elastic-net over a grid of values for the regularizationparameter lambda.

Usage

biglasso(  X,  y,  row.idx = 1:nrow(X),  penalty = c("lasso", "ridge", "enet"),  family = c("gaussian", "binomial", "cox", "mgaussian"),  alg.logistic = c("Newton", "MM"),  screen = c("Adaptive", "SSR", "Hybrid", "None"),  safe.thresh = 0,  update.thresh = 1,  ncores = 1,  alpha = 1,  lambda.min = ifelse(nrow(X) > ncol(X), 0.001, 0.05),  nlambda = 100,  lambda.log.scale = TRUE,  lambda,  eps = 1e-07,  max.iter = 1000,  dfmax = ncol(X) + 1,  penalty.factor = rep(1, ncol(X)),  warn = TRUE,  output.time = FALSE,  return.time = TRUE,  verbose = FALSE)

Arguments

X

The design matrix, without an intercept. It must be adouble typebigmemory::big.matrix() object. The functionstandardizes the data and includes an intercept internally by default duringthe model fitting.

y

The response vector forfamily="gaussian" orfamily="binomial".Forfamily="cox",y should be a two-column matrix with columns'time' and 'status'. The latter is a binary variable, with '1' indicating death,and '0' indicating right censored. Forfamily="mgaussin",yshould be a n*m matrix where n is the sample size and m is the number ofresponses.

row.idx

The integer vector of row indices ofX that used forfitting the model.1:nrow(X) by default.

penalty

The penalty to be applied to the model. Either"lasso"(the default),"ridge", or"enet" (elastic net).

family

Either"gaussian","binomial","cox" or"mgaussian" depending on the response.

alg.logistic

The algorithm used in logistic regression. If "Newton"then the exact hessian is used (default); if "MM" then amajorization-minimization algorithm is used to set an upper-bound on thehessian matrix. This can be faster, particularly in data-larger-than-RAMcase.

screen

The feature screening rule used at eachlambda thatdiscards features to speed up computation:"SSR" (default ifpenalty="ridge" orpenalty="enet" )is the sequential strong rule;"Hybrid" is our newly proposed hybrid screening rules which combine thestrong rule with a safe rule."Adaptive" (default forpenalty="lasso"withoutpenalty.factor) is our newly proposed adaptive rules whichreuse screening reference for multiple lambda values.Note that:(1) for linear regression with elastic net penalty, both"SSR" and"Hybrid" are applicable since version 1.3-0; (2) only"SSR" isapplicable to elastic-net-penalized logistic regression or cox regression;(3) active set cycling strategy is incorporated with these screening rules.

safe.thresh

the threshold value between 0 and 1 that controls when tostop safe test. For example, 0.01 means to stop safe test at next lambdaiteration if the number of features rejected by safe test at current lambdaiteration is not larger than 1\to always turn off safe test, whereas 0 (default) means to turn off safe testif the number of features rejected by safe test is 0 at current lambda.

update.thresh

the non negative threshold value that controls how often toupdate the reference of safe rules for "Adaptive" methods. Smaller value meansupdating more often.

ncores

The number of OpenMP threads used for parallel computing.

alpha

The elastic-net mixing parameter that controls the relativecontribution from the lasso (l1) and the ridge (l2) penalty. The penalty isdefined as

\alpha||\beta||_1 + (1-\alpha)/2||\beta||_2^2.

alpha=1 is the lasso penalty,alpha=0 the ridge penalty,alpha in between 0 and 1 is the elastic-net ("enet") penalty.

lambda.min

The smallest value for lambda, as a fraction oflambda.max. Default is .001 if the number of observations is larger thanthe number of covariates and .05 otherwise.

nlambda

The number of lambda values. Default is 100.

lambda.log.scale

Whether compute the grid values of lambda on logscale (default) or linear scale.

lambda

A user-specified sequence of lambda values. By default, asequence of values of lengthnlambda is computed, equally spaced onthe log scale.

eps

Convergence threshold for inner coordinate descent. Thealgorithm iterates until the maximum change in the objective after anycoefficient update is less thaneps times the null deviance. Defaultvalue is1e-7.

max.iter

Maximum number of iterations. Default is 1000.

dfmax

Upper bound for the number of nonzero coefficients. Default isno upper bound. However, for large data sets, computational burden may beheavy for models with a large number of nonzero coefficients.

penalty.factor

A multiplicative factor for the penalty applied toeach coefficient. If supplied,penalty.factor must be a numericvector of length equal to the number of columns ofX. The purpose ofpenalty.factor is to apply differential penalization if somecoefficients are thought to be more likely than others to be in the model.Current package doesn't allow unpenalized coefficients. Thatispenalty.factor cannot be 0.penalty.factor is only supportedfor "SSR" screen.

warn

Return warning messages for failures to converge and modelsaturation? Default is TRUE.

output.time

Whether to print out the start and end time of the modelfitting. Default is FALSE.

return.time

Whether to return the computing time of the modelfitting. Default is TRUE.

verbose

Whether to output the timing of each lambda iteration.Default is FALSE.

Details

The objective function for linear regression or multiple responses linear regression(family = "gaussian" orfamily = "mgaussian") is

\frac{1}{2n}\textrm{RSS} + \lambda*\textrm{penalty},

where forfamily = "mgaussian"), a group-lasso type penalty is applied.For logistic regression(family = "binomial") it is

-\frac{1}{n} loglike +\lambda*\textrm{penalty},

, for cox regression,breslow approximation for ties is applied.

Several advanced feature screening rules are implemented. Forlasso-penalized linear regression, all the options ofscreen areapplicable. Our proposal adaptive rule -"Adaptive" - achieves highest speedupso it's the recommended one, especially for ultrahigh-dimensional large-scaledata sets. For cox regression and/or the elastic net penalty, only"SSR" is applicable for now. More efficient rules are under development.

Value

An object with S3 class"biglasso" for⁠"gaussian", "binomial", "cox"⁠ families, or an object with S3 class"mbiglasso" for"mgaussian" family, with following variables.

beta

The fitted matrix of coefficients, store in sparse matrixrepresentation. The number of rows is equal to the number of coefficients,whereas the number of columns is equal tonlambda. For"mgaussian"family with m responses, it is a list of m such matrices.

iter

A vector of lengthnlambda containing the number ofiterations until convergence at each value oflambda.

lambda

The sequence of regularization parameter values in the path.

penalty

Same as above.

family

Same as above.

alpha

Same as above.

loss

A vector containing either the residual sum of squares(for⁠"gaussian", "mgaussian"⁠) or negative log-likelihood(for⁠"binomial", "cox"⁠) of the fitted model at each value oflambda.

penalty.factor

Same as above.

n

The number of observations used in the model fitting. It's equal tolength(row.idx).

center

The sample mean vector of the variables, i.e., column mean ofthe sub-matrix ofX used for model fitting.

scale

The sample standard deviation of the variables, i.e., columnstandard deviation of the sub-matrix ofX used for model fitting.

y

The response vector used in the model fitting. Depending onrow.idx, it could be a subset of the raw input of the response vector y.

screen

Same as above.

col.idx

The indices of features that have 'scale' value greater than1e-6. Features with 'scale' less than 1e-6 are removed from model fitting.

rejections

The number of features rejected at each value oflambda.

safe_rejections

The number of features rejected by safe rules at eachvalue oflambda.

Author(s)

Yaohui Zeng, Chuyi Wang and Patrick Breheny

References

Zeng Y and Breheny P. (2021) The biglasso Package: A Memory- and Computation-Efficient Solver for Lasso Model Fitting with Big Data in R.R Journal,12: 6-19.doi:10.32614/RJ-2021-001

Examples

## Linear regressiondata(colon)X <- colon$Xy <- colon$yX.bm <- as.big.matrix(X)# lasso, defaultpar(mfrow=c(1,2))fit.lasso <- biglasso(X.bm, y, family = 'gaussian')plot(fit.lasso, log.l = TRUE, main = 'lasso')# elastic netfit.enet <- biglasso(X.bm, y, penalty = 'enet', alpha = 0.5, family = 'gaussian')plot(fit.enet, log.l = TRUE, main = 'elastic net, alpha = 0.5')## Logistic regressiondata(colon)X <- colon$Xy <- colon$yX.bm <- as.big.matrix(X)# lasso, defaultpar(mfrow = c(1, 2))fit.bin.lasso <- biglasso(X.bm, y, penalty = 'lasso', family = "binomial")plot(fit.bin.lasso, log.l = TRUE, main = 'lasso')# elastic netfit.bin.enet <- biglasso(X.bm, y, penalty = 'enet', alpha = 0.5, family = "binomial")plot(fit.bin.enet, log.l = TRUE, main = 'elastic net, alpha = 0.5')## Cox regressionset.seed(10101)N <- 1000; p <- 30; nzc <- p/3X <- matrix(rnorm(N * p), N, p)beta <- rnorm(nzc)fx <- X[, seq(nzc)] %*% beta/3hx <- exp(fx)ty <- rexp(N, hx)tcens <- rbinom(n = N, prob = 0.3, size = 1)  # censoring indicatory <- cbind(time = ty, status = 1 - tcens)  # y <- Surv(ty, 1 - tcens) with library(survival)X.bm <- as.big.matrix(X)fit <- biglasso(X.bm, y, family = "cox")plot(fit, main = "cox")## Multiple responses linear regressionset.seed(10101)n=300; p=300; m=5; s=10; b=1x = matrix(rnorm(n * p), n, p)beta = matrix(seq(from=-b,to=b,length.out=s*m),s,m)y = x[,1:s] %*% beta + matrix(rnorm(n*m,0,1),n,m)x.bm = as.big.matrix(x)fit = biglasso(x.bm, y, family = "mgaussian")plot(fit, main = "mgaussian")

Direct interface to biglasso fitting, no preprocessing

Description

This function is intended for users who know exactly what they're doing andwant complete control over the fitting process. It

does NOT add an intercept
does NOT standardize the design matrix
does NOT set up a path for lambda (the lasso tuning parameter)all of the above are critical steps in data analysis. However, a direct APIhas been provided for use in situations where the lasso fitting process isan internal component of a more complicated algorithm and standardizationmust be handled externally.

Usage

biglasso_fit(  X,  y,  r,  init = rep(0, ncol(X)),  xtx,  penalty = "lasso",  lambda,  alpha = 1,  gamma,  ncores = 1,  max.iter = 1000,  eps = 1e-05,  dfmax = ncol(X) + 1,  penalty.factor = rep(1, ncol(X)),  warn = TRUE,  output.time = FALSE,  return.time = TRUE)

Arguments

X

The design matrix, without an intercept. It must be adouble typebigmemory::big.matrix() object.

y

The response vector

r

Residuals (length n vector) corresponding toinit.WARNING: If you supply an incorrect value ofr, thesolution will be incorrect.

init

Initial values for beta. Default: zero (length p vector)

xtx

X scales: the jth element should equalcrossprod(X[,j])/n.In particular, if X is standardized, one should passxtx = rep(1, p). WARNING: If you supply an incorrect value ofxtx, the solution will be incorrect. (length p vector)

penalty

String specifying which penalty to use. Default is 'lasso',Other options are 'SCAD' and 'MCP' (the latter are non-convex)

lambda

A single value for the lasso tuning parameter.

alpha

The elastic-net mixing parameter that controls the relativecontribution from the lasso (l1) and the ridge (l2) penalty.The penalty is defined as:

\alpha||\beta||_1 + (1-\alpha)/2||\beta||_2^2.

alpha=1 is the lasso penalty,alpha=0 the ridge penalty,alpha in between 0 and 1 is the elastic-net ("enet") penalty.

gamma

Tuning parameter value for nonconvex penalty. Defaults are3.7 forpenalty = 'SCAD' and 3 forpenalty = 'MCP'

ncores

The number of OpenMP threads used for parallel computing.

max.iter

Maximum number of iterations. Default is 1000.

eps

Convergence threshold for inner coordinate descent. Thealgorithm iterates until the maximum change in the objectiveafter any coefficient update is less thaneps timesthe null deviance. Default value is1e-7.

dfmax

Upper bound for the number of nonzero coefficients. Default isno upper bound. However, for large data sets,computational burden may be heavy for models with a largenumber of nonzero coefficients.

penalty.factor

A multiplicative factor for the penalty applied toeach coefficient. If supplied,penalty.factor must be a numericvector of length equal to the number of columns ofX.

warn

Return warning messages for failures to converge and modelsaturation? Default is TRUE.

output.time

Whether to print out the start and end time of the modelfitting. Default is FALSE.

return.time

Whether to return the computing time of the modelfitting. Default is TRUE.

Details

Note:

Hybrid safe-strong rules are turned off forbiglasso_fit(), as these relyon standardization
Currently, the function only works with linear regression(family = 'gaussian').

Value

An object with S3 class"biglasso" with following variables.

beta

The vector of estimated coefficients

iter

A vector of lengthnlambda containing the number ofiterations until convergence

resid

Vector of residuals calculated from estimated coefficients.

lambda

The sequence of regularization parameter values in the path.

alpha

Same as inbiglasso()

loss

A vector containing either the residual sum of squares of the fitted model at each value of lambda.

penalty.factor

Same as inbiglasso().

n

The number of observations used in the model fitting.

y

The response vector used in the model fitting.

Author(s)

Tabitha Peter and Patrick Breheny

Examples

data(Prostate)X <- cbind(1, Prostate$X)xtx <- apply(X, 2, crossprod)/nrow(X)y <- Prostate$yX.bm <- as.big.matrix(X)init <- rep(0, ncol(X))fit <- biglasso_fit(X = X.bm, y = y, r=y, init = init, xtx = xtx,  lambda = 0.1, penalty.factor=c(0, rep(1, ncol(X)-1)), max.iter = 10000)fit$beta  fit <- biglasso_fit(X = X.bm, y = y, r=y, init = init, xtx = xtx, penalty='MCP',  lambda = 0.1, penalty.factor=c(0, rep(1, ncol(X)-1)), max.iter = 10000)fit$beta

Direct interface to biglasso fitting, no preprocessing, path version

Description

This function is intended for users who know exactly what they're doing andwant complete control over the fitting process. It

does NOT add an intercept
does NOT standardize the design matrixboth of the above are critical steps in data analysis. However, a direct APIhas been provided for use in situations where the lasso fitting process isan internal component of a more complicated algorithm and standardizationmust be handled externally.

Usage

biglasso_path(  X,  y,  r,  init = rep(0, ncol(X)),  xtx,  penalty = "lasso",  lambda,  alpha = 1,  gamma,  ncores = 1,  max.iter = 1000,  eps = 1e-05,  dfmax = ncol(X) + 1,  penalty.factor = rep(1, ncol(X)),  warn = TRUE,  output.time = FALSE,  return.time = TRUE)

Arguments

X

The design matrix, without an intercept. It must be adouble typebigmemory::big.matrix() object.

y

The response vector

r

Residuals (length n vector) corresponding toinit.WARNING: If you supply an incorrect value ofr, thesolution will be incorrect.

init

Initial values for beta. Default: zero (length p vector)

xtx

penalty

String specifying which penalty to use. Default is 'lasso',Other options are 'SCAD' and 'MCP' (the latter are non-convex)

lambda

A vector of numeric values the lasso tuning parameter.

alpha

The elastic-net mixing parameter that controls the relativecontribution from the lasso (l1) and the ridge (l2) penalty.The penalty is defined as:

\alpha||\beta||_1 + (1-\alpha)/2||\beta||_2^2.

alpha=1 is the lasso penalty,alpha=0 the ridge penalty,alpha in between 0 and 1 is the elastic-net ("enet") penalty.

gamma

Tuning parameter value for nonconvex penalty. Defaults are3.7 forpenalty = 'SCAD' and 3 forpenalty = 'MCP'

ncores

The number of OpenMP threads used for parallel computing.

max.iter

Maximum number of iterations. Default is 1000.

eps

Convergence threshold for inner coordinate descent. Thealgorithm iterates until the maximum change in the objectiveafter any coefficient update is less thaneps timesthe null deviance. Default value is1e-7.

dfmax

Upper bound for the number of nonzero coefficients. Default isno upper bound. However, for large data sets,computational burden may be heavy for models with a largenumber of nonzero coefficients.

penalty.factor

A multiplicative factor for the penalty applied toeach coefficient. If supplied,penalty.factor must be a numericvector of length equal to the number of columns ofX.

warn

Return warning messages for failures to converge and modelsaturation? Default is TRUE.

output.time

Whether to print out the start and end time of the modelfitting. Default is FALSE.

return.time

Whether to return the computing time of the modelfitting. Default is TRUE.

Details

biglasso_path() works identically tobiglasso_fit() except it offers theadditional option of fitting models across a path of tuning parameter values.

Note:

Hybrid safe-strong rules are turned off forbiglasso_fit(), as these relyon standardization
Currently, the function only works with linear regression(family = 'gaussian').

Value

An object with S3 class"biglasso" with following variables.

beta

A sparse matrix where rows are estimates a given coefficient across all values of lambda

iter

A vector of lengthnlambda containing the number ofiterations until convergence

resid

Vector of residuals calculated from estimated coefficients.

lambda

The sequence of regularization parameter values in the path.

alpha

Same as inbiglasso()

loss

A vector containing either the residual sum of squares of the fitted model at each value of lambda.

penalty.factor

Same as inbiglasso().

n

The number of observations used in the model fitting.

y

The response vector used in the model fitting.

Author(s)

Tabitha Peter and Patrick Breheny

Examples

data(Prostate)X <- cbind(1, Prostate$X)xtx <- apply(X, 2, crossprod)/nrow(X)y <- Prostate$yX.bm <- as.big.matrix(X)init <- rep(0, ncol(X))fit <- biglasso_path(X = X.bm, y = y, r = y, init = init, xtx = xtx,  lambda = c(0.5, 0.1, 0.05, 0.01, 0.001),   penalty.factor=c(0, rep(1, ncol(X)-1)), max.iter=2000)fit$beta  fit <- biglasso_path(X = X.bm, y = y, r = y, init = init, xtx = xtx,  lambda = c(0.5, 0.1, 0.05, 0.01, 0.001), penalty='MCP',  penalty.factor=c(0, rep(1, ncol(X)-1)), max.iter = 2000)fit$beta

Gene expression data from colon-cancer patients

Description

The data file contains gene expression data of 62 samples (40 tumor samples,22 normal samples) from colon-cancer patients analyzed with an Affymetrixoligonucleotide Hum6000 array.

Format

A list of 2 variables included incolon:

X: a 62-by-2000 matrix that records the gene expression data.Used as design matrix.
y: a binary vector of length 62 recording the sample status: 1 =tumor; 0 = normal. Used as response vector.

Source

The raw data can be found on Bioconductor:https://bioconductor.org/packages/release/data/experiment/html/colonCA.html.

References

U. Alon et al. (1999): Broad patterns of geneexpression revealed by clustering analysis of tumor and normal colon tissueprobed by oligonucleotide arrays.Proc. Natl. Acad. Sci. USA96, 6745-6750.https://www.pnas.org/doi/abs/10.1073/pnas.96.12.6745.

Examples

data(colon)X <- colon$Xy <- colon$ystr(X)dim(X)X.bm <- as.big.matrix(X, backingfile = "") # convert to big.matrix objectstr(X.bm)dim(X.bm)

Cross-validation for biglasso

Description

Perform k-fold cross validation for penalized regression models over a gridof values for the regularization parameter lambda.

Usage

cv.biglasso(  X,  y,  row.idx = 1:nrow(X),  family = c("gaussian", "binomial", "cox", "mgaussian"),  eval.metric = c("default", "MAPE", "auc", "class"),  ncores = parallel::detectCores(),  ...,  nfolds = 5,  seed,  cv.ind,  trace = FALSE,  grouped = TRUE)

Arguments

X

The design matrix, without an intercept, as inbiglasso().

y

The response vector, as inbiglasso.

row.idx

The integer vector of row indices ofX that used forfitting the model. as inbiglasso.

family

Either"gaussian","binomial","cox" or"mgaussian" depending on the response."cox" and"mgaussian"are not supported yet.

eval.metric

The evaluation metric for the cross-validated error andfor choosing optimallambda. "default" for linear regression is MSE(mean squared error), for logistic regression is binomial deviance."MAPE", for linear regression only, is the Mean Absolute Percentage Error."auc", for binary classification, is the area under the receiver operatingcharacteristic curve (ROC)."class", for binary classification, gives the misclassification error.

ncores

The number of cores to use for parallel execution of thecross-validation folds, run on a cluster created by theparallelpackage. (This is also supplied to thencores argument inbiglasso(), which is the number of OpenMP threads, but only forthe first call ofbiglasso() that is run on the entire data. Theindividual calls ofbiglasso() for the CV folds are run withoutthencores argument.)

...

Additional arguments tobiglasso.

nfolds

The number of cross-validation folds. Default is 5.

seed

The seed of the random number generator in order to obtainreproducible results.

cv.ind

Which fold each observation belongs to. By default theobservations are randomly assigned bycv.biglasso.

trace

If set to TRUE, cv.biglasso will inform the user of itsprogress by announcing the beginning of each CV fold. Default is FALSE.

grouped

Whether to calculate CV standard error (cvse) overCV folds (TRUE), or over all cross-validated predictions. Ignoredwheneval.metric is 'auc'.

Details

The function callsbiglassonfolds times, each time leavingout 1/nfolds of the data. The cross-validation error is based on theresidual sum of squares whenfamily="gaussian" and the binomialdeviance whenfamily="binomial".

The S3 class objectcv.biglasso inherits classncvreg::cv.ncvreg(). So S3functions such as⁠"summary", "plot"⁠ can be directly applied to thecv.biglasso object.

Value

An object with S3 class"cv.biglasso" which inherits fromclass"cv.ncvreg". The following variables are contained in theclass (adopted fromncvreg::cv.ncvreg()).

cve

The error for each value oflambda, averaged across the cross-validation folds.

cvse

The estimated standard error associated with each value of forcve.

lambda

The sequence of regularization parameter values along which the cross-validation error was calculated.

fit

The fittedbiglasso object for the whole data.

min

The index oflambda corresponding tolambda.min.

lambda.min

The value oflambda with the minimum cross-validation error.

lambda.1se

The largest value oflambda for which the cross-validation error is at most one standard error larger than the minimum cross-validation error.

null.dev

The deviance for the intercept-only model.

pe

Iffamily="binomial", the cross-validation prediction error for each value oflambda.

cv.ind

Same as above.

Author(s)

Yaohui Zeng and Patrick Breheny

Examples

## Not run: ## cv.biglassodata(colon)X <- colon$Xy <- colon$yX.bm <- as.big.matrix(X)## logistic regressioncvfit <- cv.biglasso(X.bm, y, family = 'binomial', seed = 1234, ncores = 2)par(mfrow = c(2, 2))plot(cvfit, type = 'all')summary(cvfit)## End(Not run)

Internal biglasso functions

Description

Internal biglasso functions

Usage

loss.biglasso(y, yhat, family, eval.metric, grouped = TRUE)

Arguments

y

The observed response vector.

yhat

The predicted response vector.

family

Either "gaussian" or "binomial", depending on the response.

eval.metric

The evaluation metric for the cross-validated error andfor choosing optimallambda. "default" for linear regression is MSE(mean squared error), for logistic regression is misclassification error."MAPE", for linear regression only, is the Mean Absolute Percentage Error."auc", for logistic regression, is the area under the receiver operatingcharacteristic curve (ROC).

grouped

Whether to calculate loss for the entire CV fold(TRUE), or for predictions individually. Must beTRUE wheneval.metric is 'auc'.

Details

These are not intended for use by users.loss.biglasso calculates thevalue of the loss function for the given predictions (used for cross-validation).

Author(s)

Yaohui Zeng and Patrick Breheny

Plot coefficients from a "biglasso" object

Description

Produce a plot of the coefficient paths for a fittedbiglasso() object.

Usage

## S3 method for class 'biglasso'plot(x, alpha = 1, log.l = TRUE, ...)

Arguments

x

Fittedbiglasso() model.

alpha

Controls alpha-blending, helpful when the number of covariatesis large. Default is alpha=1.

log.l

Should horizontal axis be on the log scale? Default is TRUE.

...

Other graphical parameters toplot()

Author(s)

Yaohui Zeng and Patrick Breheny

Examples

## See examples in "biglasso"

Plots the cross-validation curve from a "cv.biglasso" object

Description

Plot the cross-validation curve from acv.biglasso() object,along with standard error bars.

Usage

## S3 method for class 'cv.biglasso'plot(  x,  log.l = TRUE,  type = c("cve", "rsq", "scale", "snr", "pred", "all"),  selected = TRUE,  vertical.line = TRUE,  col = "red",  ...)

Arguments

x

A"cv.biglasso" object.

log.l

Should horizontal axis be on the log scale? Default is TRUE.

type

What to plot on the vertical axis.cve plots thecross-validation error (deviance);rsq plots an estimate of thefraction of the deviance explained by the model (R-squared);snrplots an estimate of the signal-to-noise ratio;scale plots, forfamily="gaussian", an estimate of the scale parameter (standarddeviation);pred plots, forfamily="binomial", the estimatedprediction error;all produces all of the above.

selected

IfTRUE (the default), places an axis on top of theplot denoting the number of variables in the model (i.e., that have anonzero regression coefficient) at that value oflambda.

vertical.line

IfTRUE (the default), draws a vertical line atthe value where cross-validaton error is minimized.

col

Controls the color of the dots (CV estimates).

...

Other graphical parameters toplot

Details

Error bars representing approximate 68\along with the estimates at value oflambda. Forrsq andsnr, these confidence intervals are quite crude, especially near.

Author(s)

Yaohui Zeng and Patrick Breheny

Examples

## See examples in "cv.biglasso"

Plot coefficients from a "mbiglasso" object

Description

Produce a plot of the coefficient paths for a fitted multiple responsesmbiglasso object.

Usage

## S3 method for class 'mbiglasso'plot(x, alpha = 1, log.l = TRUE, norm.beta = TRUE, ...)

Arguments

x

Fittedmbiglasso model.

alpha

Controls alpha-blending, helpful when the number of covariatesis large. Default is alpha=1.

log.l

Should horizontal axis be on the log scale? Default is TRUE.

norm.beta

Should the vertical axis be the l2 norm of coefficients for each variable?Default is TRUE. If False, the vertical axis is the coefficients.

...

Other graphical parameters toplot()

Author(s)

Chuyi Wang

Examples

## See examples in "biglasso"

Model predictions based on a fitted`biglasso` object

Description

Extract predictions (fitted reponse, coefficients, etc.) from afittedbiglasso() object.

Usage

## S3 method for class 'biglasso'predict(  object,  X,  row.idx = 1:nrow(X),  type = c("link", "response", "class", "coefficients", "vars", "nvars"),  lambda,  which = 1:length(object$lambda),  ...)## S3 method for class 'mbiglasso'predict(  object,  X,  row.idx = 1:nrow(X),  type = c("link", "response", "coefficients", "vars", "nvars"),  lambda,  which = 1:length(object$lambda),  k = 1,  ...)## S3 method for class 'biglasso'coef(object, lambda, which = 1:length(object$lambda), drop = TRUE, ...)## S3 method for class 'mbiglasso'coef(object, lambda, which = 1:length(object$lambda), intercept = TRUE, ...)

Arguments

object

A fitted"biglasso" model object.

X

Matrix of values at which predictions are to be made. It must be abigmemory::big.matrix() object. Not used fortype="coefficients".

row.idx

Similar to that inbiglasso(), it's avector of the row indices ofX that used for the prediction.1:nrow(X) by default.

type

Type of prediction:

"link" returns the linear predictors
"response" gives the fitted values
"class" returns the binomial outcome with the highest probability
"coefficients" returns the coefficients
"vars" returns a list containing the indices and names of the nonzero variables at each value oflambda
"nvars" returns the number of nonzero coefficients at each value oflambda

lambda

Values of the regularization parameterlambda at whichpredictions are requested. Linear interpolation is used for values oflambda not in the sequence of lambda values in the fitted models.

which

Indices of the penalty parameterlambda at whichpredictions are required. By default, all indices are returned. Iflambda is specified, this will overridewhich.

...

Not used.

k

Index of the response to predict in multiple responses regression (family="mgaussian").

drop

If coefficients for a single value oflambda are to bereturned, reduce dimensions to a vector? Settingdrop=FALSE returnsa 1-column matrix.

intercept

Whether the intercept should be included in the returnedcoefficients. Forfamily="mgaussian" only.

Value

The object returned depends ontype.

Author(s)

Yaohui Zeng and Patrick Breheny

Examples

## Logistic regressiondata(colon)X <- colon$Xy <- colon$yX.bm <- as.big.matrix(X, backingfile = "")fit <- biglasso(X.bm, y, penalty = 'lasso', family = "binomial")coef <- coef(fit, lambda=0.05, drop = TRUE)coef[which(coef != 0)]predict(fit, X.bm, type="link", lambda=0.05)[1:10]predict(fit, X.bm, type="response", lambda=0.05)[1:10]predict(fit, X.bm, type="class", lambda=0.1)[1:10]predict(fit, type="vars", lambda=c(0.05, 0.1))predict(fit, type="nvars", lambda=c(0.05, 0.1))

Model predictions based on a fitted`cv.biglasso()` object

Description

Extract predictions from a fittedcv.biglasso() object.

Usage

## S3 method for class 'cv.biglasso'predict(  object,  X,  row.idx = 1:nrow(X),  type = c("link", "response", "class", "coefficients", "vars", "nvars"),  lambda = object$lambda.min,  which = object$min,  ...)## S3 method for class 'cv.biglasso'coef(object, lambda = object$lambda.min, which = object$min, ...)

Arguments

object

A fitted"cv.biglasso" model object.

X

Matrix of values at which predictions are to be made. It must be abigmemory::big.matrix() object. Not used fortype="coefficients".

row.idx

Similar to that inbiglasso(), it's avector of the row indices ofX that used for the prediction.1:nrow(X) by default.

type

Type of prediction:

"link" returns the linear predictors
"response" gives the fitted values
"class" returns the binomial outcome with the highest probability
"coefficients" returns the coefficients
"vars" returns a list containing the indices and names of the nonzero variables at each value oflambda
"nvars" returns the number of nonzero coefficients at each value oflambda

lambda

Values of the regularization parameterlambda at whichpredictions are requested. The default value is the one corresponding tothe minimum cross-validation error. Accepted values are also the strings"lambda.min" (lambda of minimum cross-validation error) and"lambda.1se" (Largest value oflambda for which the cross-validationerror was at most one standard error larger than the minimum.).

which

Indices of the penalty parameterlambda at whichpredictions are requested. The default value is the index of lambdacorresponding to lambda.min. Note: this is overridden iflambda isspecified.

...

Not used.

Value

The object returned depends ontype.

Author(s)

Yaohui Zeng and Patrick Breheny

Examples

## Not run: ## predict.cv.biglassodata(colon)X <- colon$Xy <- colon$yX.bm <- as.big.matrix(X, backingfile = "")fit <- biglasso(X.bm, y, penalty = 'lasso', family = "binomial")cvfit <- cv.biglasso(X.bm, y, penalty = 'lasso', family = "binomial", seed = 1234, ncores = 2)coef <- coef(cvfit)coef[which(coef != 0)]predict(cvfit, X.bm, type = "response")predict(cvfit, X.bm, type = "link")predict(cvfit, X.bm, type = "class")predict(cvfit, X.bm, lambda = "lambda.1se")## End(Not run)

Set up design matrix X by reading data from big data file

Description

Set up the design matrix X as abig.matrix object based on externalmassive data file stored on disk that cannot be fullly loaded into memory.The data file must be a well-formated ASCII-file, and contains only onesingle type. Current version only supportsdouble type. Otherrestrictions about the data file are described inbiglasso-package. This function reads the massive data, andcreates abig.matrix object. By default, the resultingbig.matrix is file-backed, and can be shared across processors ornodes of a cluster.

Usage

setupX(  filename,  dir = getwd(),  sep = ",",  backingfile = paste0(unlist(strsplit(filename, split = "\\."))[1], ".bin"),  descriptorfile = paste0(unlist(strsplit(filename, split = "\\."))[1], ".desc"),  type = "double",  ...)

Arguments

filename

The name of the data file. For example, "dat.txt".

dir

The directory used to store the binary and descriptor filesassociated with thebig.matrix. The default is current workingdirectory.

sep

The field separator character. For example, "," forcomma-delimited files (the default); "\t" for tab-delimited files.

backingfile

The binary file associated with the file-backedbig.matrix. By default, its name is the same asfilename withthe extension replaced by ".bin".

descriptorfile

The descriptor file used for the description of thefile-backedbig.matrix. By default, its name is the same asfilename with the extension replaced by ".desc".

type

The data type. Only "double" is supported for now.

...

Additional arguments that can be passed into functionbigmemory::read.big.matrix().

Details

For a data set, this function needs to be called only one time to set up thebig.matrix object with two backing files (.bin, .desc) created incurrent working directory. Once set up, the data can be "loaded" into any(new) R session by callingattach.big.matrix(discriptorfile).

This function is a simple wrapper ofbigmemory::read.big.matrix(). Seebigmemory for moredetails.

Value

Abig.matrix object corresponding to a file-backedbigmemory::big.matrix(). It's ready to be used as the design matrixX inbiglasso() andcv.biglasso().

Author(s)

Yaohui Zeng and Patrick Breheny

Examples

## see the example in "biglasso-package"

Summarizing inferences based on cross-validation

Description

Summary method forcv.biglasso objects.

Usage

## S3 method for class 'cv.biglasso'summary(object, ...)## S3 method for class 'summary.cv.biglasso'print(x, digits, ...)

Arguments

object

Acv.biglasso object.

...

Further arguments passed to or from other methods.

x

A"summary.cv.biglasso" object.

digits

Number of digits past the decimal point to print out. Can bea vector specifying different display digits for each of the fivenon-integer printed values.

Value

summary.cv.biglasso produces an object with S3 class"summary.cv.biglasso". The class has its own print method and containsthe following list elements:

penalty

The penalty used bybiglasso.

model

Either"linear" or"logistic", depending on thefamily option inbiglasso.

n

Number of observations

p

Number of regression coefficients (not including the intercept).

min

The index oflambda with the smallest cross-validation error.

lambda

The sequence oflambda values used bycv.biglasso.

cve

Cross-validation error (deviance).

r.squared

Proportion of variance explained by the model, as estimated by cross-validation.

snr

Signal to noise ratio, as estimated by cross-validation.

sigma

For linear regression models, the scale parameter estimate.

pe

For logistic regression models, the prediction error (misclassification error).

Author(s)

Yaohui Zeng and Patrick Breheny

Examples

## See examples in "cv.biglasso" and "biglasso-package"

`p`	1,000	10,000	100,000	1,000,000
Size of`X`	9.5 MB	95 MB	950 MB	9.5 GB
Elapsed time (s)	0.11	0.83	8.47	85.50

%`n`	1,000	10,000	100,000	1,000,000
%Size of`X`	9.5 MB	95 MB	950 MB	9.5 GB
%Elapsed time (s)	2.50	11.43	83.69	1090.62
%

Movatterモバイル変換

Extending Lasso Model Fitting to Big Data

Description

Details

Note

Author(s)

References

Examples

Fit lasso penalized regression path for big data

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Direct interface to biglasso fitting, no preprocessing

Description

Usage

Arguments

Details

Value

Author(s)

Examples

Direct interface to biglasso fitting, no preprocessing, path version

Description

Usage

Arguments

Details

Value

Author(s)

Examples

Gene expression data from colon-cancer patients

Description

Format

Source

References

Examples

Cross-validation for biglasso

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Examples

Internal biglasso functions

Description

Usage

Arguments

Details

Author(s)

Plot coefficients from a "biglasso" object

Description

Usage

Arguments

Author(s)

See Also

Examples

Plots the cross-validation curve from a "cv.biglasso" object

Description

Usage

Arguments

Details

Author(s)

See Also

Examples

Plot coefficients from a "mbiglasso" object

Description

Usage

Arguments

Author(s)

See Also

Examples

Model predictions based on a fittedbiglasso object

Description

Usage

Model predictions based on a fitted`biglasso` object

Model predictions based on a fitted`cv.biglasso()` object