L0Learn Vignette

Hussein Hazimeh, Rahul Mazumder, and Tim Nonet

Massachusetts Institute of Technology (MIT)

2023-03-04

Introduction

L0Learn is a fast toolkit for L0-regularized learning.L0 regularization selects the best subset of features and can outperformcommonly used feature selection methods (e.g., L1 and MCP) under manysparse learning regimes. The toolkit can (approximately) solve thefollowing three problems\[\min_{\beta_0, \beta} \sum_{i=1}^{n} \ell(y_i, \beta_0+ \langle x_i,\beta \rangle) + \lambda ||\beta||_0 \quad \quad (L0)\]\[\min_{\beta_0, \beta} \sum_{i=1}^{n} \ell(y_i, \beta_0+ \langle x_i,\beta \rangle) + \lambda ||\beta||_0 + \gamma||\beta||_1 \quad (L0L1)\]

\[\min_{\beta_0, \beta} \sum_{i=1}^{n} \ell(y_i, \beta_0+ \langle x_i,\beta \rangle) + \lambda ||\beta||_0 + \gamma||\beta||_2^2 \quad (L0L2)\] where$\ell$ is the lossfunction,$\beta_0$ is the intercept,$\beta$ is the vector ofcoefficients, and$||\beta||_0$denotes the L0 norm of$\beta$, i.e.,the number of non-zeros in$\beta$.We support both regression and classification using either one of thefollowing loss functions:

Squared error loss
Logistic loss (logistic regression)
Squared hinge loss (smoothed version of SVM).

The parameter$\lambda$ controlsthe strength of the L0 regularization (larger$\lambda$ leads to less non-zeros). Theparameter$\gamma$ controls thestrength of the shrinkage component (which is the L1 norm in case ofL0L1 or squared L2 norm in case of L0L2); adding a shrinkage term to L0can be very effective in avoiding overfitting and typically leads tobetter predictive models. The fitting is done over a grid of$\lambda$ and$\gamma$ values to generate aregularization path.

The algorithms provided inL0Learn are based on cycliccoordinate descent and local combinatorial search. Many computationaltricks and heuristics are used to speed up the algorithms and improvethe solution quality. These heuristics include warm starts, active setconvergence, correlation screening, greedy cycling order, and efficientmethods for updating the residuals through exploiting sparsity andproblem dimensions. Moreover, we employed a new computationallyefficient method for dynamically selecting the regularization parameter$\lambda$ in the path. For moredetails on the algorithms used, please refer to our paperFastBest Subset Selection: Coordinate Descent and Local CombinatorialOptimization Algorithms.

The toolkit is implemented in C++ along with an easy-to-use Rinterface. In this vignette, we provide a tutorial on using the Rinterface. Particularly, we will demonstrate how use L0Learn’s mainfunctions for fitting models, cross-validation, and visualization.

Installation

L0Learn can be installed directly from CRAN by executing:

install.packages("L0Learn")

If you face installation issues, please refer to theInstallationTroubleshooting Wiki. If the issue is not resolved, you can submitan issue onL0Learn’sGithub Repo.

Tutorial

To demonstrate howL0Learn works, we will first generatea synthetic dataset and then proceed to fitting L0-regularized models.The synthetic dataset (y,X) will be generated from a sparse linear modelas follows:

X is a 500x1000 design matrix with iid standard normal entries
B is a 1000x1 vector with the first 10 entries set to 1 and the restare zeros.
e is a 500x1 vector with iid standard normal entries
y is a 500x1 response vector such that y = XB + e

This dataset can be generated in R as follows:

set.seed(1)# fix the seed to get a reproducible resultX=matrix(rnorm(500*1000),nrow=500,ncol=1000)B=c(rep(1,10),rep(0,990))e=rnorm(500)y= X%*%B+ e

We will use L0Learn to estimate B from the data (y,X). First we loadL0Learn:

library(L0Learn)

We will start by fitting a simple L0 model and then proceed to thecase of L0L2 and L0L1.

Fitting L0 Regression Models

To fit a path of solutions for the L0-regularized model with at most20 non-zeros using coordinate descent (CD), we use theL0Learn.fit function as follows:

fit<-L0Learn.fit(X, y,penalty="L0",maxSuppSize=20)

This will generate solutions for a sequence of$\lambda$ values (chosen automatically bythe algorithm). To view the sequence of$\lambda$ along with the associated supportsizes (i.e., the number of non-zeros), we use theprintmethod as follows:

print(fit)

#>         lambda gamma suppSize#> 1  0.068285500     0        0#> 2  0.067602600     0        1#> 3  0.055200200     0        2#> 4  0.049032300     0        3#> 5  0.040072500     0        6#> 6  0.038602900     0        7#> 7  0.037264900     0        8#> 8  0.032514200     0       10#> 9  0.001142920     0       11#> 10 0.000821227     0       13#> 11 0.000702292     0       14#> 12 0.000669520     0       15#> 13 0.000489938     0       17

To extract the estimated B for particular values of$\lambda$ and$\gamma$, we use the functioncoef(fit,lambda,gamma). For example, the solution at$\lambda = 0.0325142$ (which corresponds toa support size of 10) can be extracted using

coef(fit,lambda=0.0325142,gamma=0)

#> 1001 x 1 sparse Matrix of class "dgCMatrix"#>                     #> Intercept 0.01052346#> V1        1.01601401#> V2        1.01829953#> V3        1.00606260#> V4        0.98308999#> V5        0.97389545#> V6        0.96148014#> V7        1.00990709#> V8        1.08535033#> V9        1.02685765#> V10       0.94236638#> V11       .         #> V12       .         ...

The output is a sparse vector of typedgCMatrix. Thefirst element in the vector is the intercept and the rest are the Bcoefficients. Aside from the intercept, the only non-zeros in the abovesolution are coordinates 1, 2, 3, …, 10, which are the non-zerocoordinates in the true support (used to generated the data). Thus, thissolution successfully recovers the true support. Note that on some BLASimplementations, thelambda value we used above (i.e.,0.0325142) might be slightly different due to thelimitations of numerical precision. Moreover, all the solutions in theregularization path can be extracted at once by callingcoef(fit).

The sequence of$\lambda$generated byL0Learn is stored in the objectfit. Specifically,fit$lambda is a list, whereeach element of the list is a sequence of$\lambda$ values corresponding to a singlevalue of$\gamma$. Since L0 has onlyone value of$\gamma$ (i.e., 0), wecan access the sequence of$\lambda$values usingfit$lambda[[1]]. Thus,$\lambda=0.0325142$ we used previously canbe accessed usingfit$lambda[[1]][7] (since it is the 7thvalue in the output ofprint). So the previous solution canalso be extracted usingcoef(fit,lambda=fit$lambda[[1]][7], gamma=0).

We can make predictions using a specific solution in the grid usingthe functionpredict(fit,newx,lambda,gamma) where newx is atesting sample (vector or matrix). For example, to predict the responsefor the samples in the data matrix X using the solution with$\lambda=0.0325142$, we call the predictionfunction as follows:

predict(fit,newx=X,lambda=0.0325142,gamma=0)

#> 500 x 1 Matrix of class "dgeMatrix"#>               [,1]#>   [1,]  0.44584683#>   [2,] -1.52213221#>   [3,] -1.11755544#>   [4,] -0.93180249#>   [5,] -4.02095821#>   [6,]  2.02300763#>   [7,]  2.03371819#>   [8,]  0.92234198#>   [9,]  2.33359737#>  [10,]  0.92194909#>  [11,] -4.33165265#>  [12,] -2.13282518#>  [13,] -7.21962511...

We can also visualize the regularization path by plotting thecoefficients of the estimated B versus the support size (i.e., thenumber of non-zeros) using theplot(fit,gamma) method asfollows:

plot(fit,gamma=0)

The legend of the plot presents the variables in the order theyentered the regularization path. For example, variable 7 is the firstvariable to enter the path, and variable 6 is the second to enter. Thus,roughly speaking, we can view the first$k$ variables in the legend as the bestsubset of size$k$. To show the linesconnecting the points in the plot, we can set the parametershowLines=TRUE in theplot function, i.e.,callplot(fit, gamma=0, showLines=TRUE). Moreover, we notethat the output of theplot function above is aggplot object, which can be further customized using theggplot2 package.

Fitting L0L2 and L0L1 Regression Models

We have demonstrated the simple case of using an L0 penalty. We canalso fit more elaborate models that combine L0 regularization withshrinkage-inducing penalties like the L1 norm or squared L2 norm. Addingshrinkage helps in avoiding overfitting and typically improves thepredictive performance of the models. Next, we will discuss how to fit amodel using the L0L2 penalty for a two-dimensional grid of$\lambda$ and$\gamma$ values. Recall that by default,L0Learn automatically selects the$\lambda$ sequence, so we only need tospecify the$\gamma$ sequence.Suppose we want to fit an L0L2 model with a maximum of 20 non-zeros anda sequence of 5$\gamma$ valuesranging between 0.0001 and 10. We can do so by callingL0Learn.fit withpenalty="L0L2",nGamma=5,gammaMin=0.0001, andgammaMax=10 as follows:

fit<-L0Learn.fit(X, y,penalty="L0L2",nGamma =5,gammaMin =0.0001,gammaMax =10,maxSuppSize=20)

L0Learn will generate a grid of 5$\gamma$ values equi-spaced on thelogarithmic scale between 0.0001 and 10. Similar to the case of L0, wecan print a summary of the regularization path using theprint function as follows:

print(fit)

#>          lambda        gamma suppSize#> 1   0.003251690 10.000000000        0#> 2   0.003219170 10.000000000        1#> 3   0.002776000 10.000000000        2#> 4   0.002687010 10.000000000        3#> 5   0.002577800 10.000000000        3#> 6   0.002062240 10.000000000        6#> 7   0.001861720 10.000000000        7#> 8   0.001710400 10.000000000        8#> 9   0.001644570 10.000000000        9#> 10  0.001153900 10.000000000       10#> 11  0.000482462 10.000000000       10#> 12  0.000385970 10.000000000       12#> 13  0.000374391 10.000000000       14#> 14  0.000363159 10.000000000       15#> 15  0.000352264 10.000000000       15#> 16  0.000281811 10.000000000       19#> 17  0.000273357 10.000000000       19#> 18  0.032139100  0.562341325        0#> 19  0.031817800  0.562341325        1#> 20  0.027437500  0.562341325        1#> 21  0.021950000  0.562341325        4#> 22  0.021291500  0.562341325        4#> 23  0.017033200  0.562341325        9#> 24  0.011404900  0.562341325       10#> 25  0.004768580  0.562341325       10#> 26  0.003814860  0.562341325       10#> 27  0.003051890  0.562341325       10#> 28  0.002441510  0.562341325       10#> 29  0.001953210  0.562341325       10...

The sequence of$\gamma$ valuescan be accessed usingfit$gamma. To extract a solution weuse thecoef method. For example, extracting the solutionat$\lambda=0.0011539$ and$\gamma=10$ can be done using

coef(fit,lambda=0.0011539,gamma=10)

#> 1001 x 1 sparse Matrix of class "dgCMatrix"#>                      #> Intercept -0.02424084#> V1         0.04267751#> V2         0.04765638#> V3         0.04958919#> V4         0.05100251#> V5         0.04411968#> V6         0.05385520#> V7         0.05597895#> V8         0.04374915#> V9         0.05429957#> V10        0.03644110#> V11        .         #> V12        .         ...

Similarly, we can predict the response at this pair of$\lambda$ and$\gamma$ for the matrix X using

predict(fit,newx=X,lambda=0.0011539,gamma=10)

The regularization path can also be plot at a specific$\gamma$ usingplot(fit, gamma). Finally, we note that fitting an L0L1model can be done by just changing thepenalty to “L0L1” inthe above (in this casegammaMax will be ignored since itis automatically selected by the toolkit; see the reference manual formore details.)

Higher-quality Solutions using Local Search

By default,L0Learn uses coordinate descent (CD) to fitmodels. Since the objective function is non-convex, the choice of theoptimization algorithm can have a significant effect on the solutionquality (different algorithms can lead to solutions with very differentobjective values). A more elaborate algorithm based on combinatorialsearch can be used by setting the parameteralgorithm="CDPSI" in the call toL0Learn.fit.CDPSI typically leads to higher-quality solutions comparedto CD, especially when the features are highly correlated. CDPSI isslower than CD, however, for typical applications it terminates in theorder of seconds.

Cross-validation

We will demonstrate how to use K-fold cross-validation (CV) to selectthe optimal values of the tuning parameters$\lambda$ and$\gamma$. To perform CV, we use theL0Learn.cvfit function, which takes the same parameters asL0Learn.fit, in addition to the number of folds using thenFolds parameter and a seed value using theseed parameter (this is used when randomly shuffling thedata before performing CV).

For example, to perform 5-fold CV using theL0L2 penalty(over a range of 5gamma values between 0.0001 and 0.1)with a maximum of 50 non-zeros, we run:

cvfit=L0Learn.cvfit(X, y,nFolds=5,seed=1,penalty="L0L2",nGamma=5,gammaMin=0.0001,gammaMax=0.1,maxSuppSize=50)

Note that the objectcvfit has a memberfit(accessed usingcvfit$fit) which is output of runningL0Learn.fit on (y,X). The cross-validation errors can beaccessed using thecvMeans attribute ofcvfit:cvfit$cvMeans is a list where the ith element,cvfit$cvMeans[[i]], stores the cross-validation errors forthe ith value of gamma (cvfit$fit$gamma[i]). To find theminimum cross-validation error for everygamma, we call themin function for every element in the listcvfit$cvMeans, as follows:

lapply(cvfit$cvMeans, min)

#> [[1]]#> [1] 1.283026#> #> [[2]]#> [1] 1.000699#> #> [[3]]#> [1] 0.9900166#> #> [[4]]#> [1] 0.9899541#> #> [[5]]#> [1] 0.9900055

The above output indicates that the 4th value of gamma achieves thelowest CV error (=0.9899542). We can plot the CV errorsagainst the support size for the 4th value of gamma, i.e.,gamma = cvfit$fit$gamma[4], using:

plot(cvfit,gamma=cvfit$fit$gamma[4])

The above plot is produced using theggplot2 package andcan be further customized by the user. To extract the optimal$\lambda$ (i.e., the one with minimum CVerror) in this plot, we execute the following:

optimalGammaIndex=4# index of the optimal gamma identified previouslyoptimalLambdaIndex=which.min(cvfit$cvMeans[[optimalGammaIndex]])optimalLambda= cvfit$fit$lambda[[optimalGammaIndex]][optimalLambdaIndex]optimalLambda

#> [1] 0.00809626

To print the solution corresponding to the optimal gamma/lambdapair:

coef(cvfit,lambda=optimalLambda,gamma=cvfit$fit$gamma[4])

#> 1001 x 1 sparse Matrix of class "dgCMatrix"#>                     #> Intercept 0.01044999#> V1        1.01472611#> V2        1.01711155#> V3        1.00494630#> V4        0.98208003#> V5        0.97274703#> V6        0.96057359#> V7        1.00895157#> V8        1.08392099#> V9        1.02580620#> V10       0.94108163#> V11       .         #> V12       .         ...

The optimal solution (above) selected by cross-validation correctlyrecovers the support of the true vector of coefficients used togenerated the model.

Fitting Classification Models

All the commands and plots we have seen in the case of regressionextend to classification. We currently support logistic regression(using the parameterloss = "Logistic") and a smoothedversion of SVM (using the parameterloss="SquaredHinge").To give some examples, we first generate a synthetic classificationdataset (similar to the one we generated in the case of regression):

set.seed(1)# fix the seed to get a reproducible resultX=matrix(rnorm(500*1000),nrow=500,ncol=1000)B=c(rep(1,10),rep(0,990))e=rnorm(500)y=sign(X%*%B+ e)

An L0-regularized logistic regression model can be fit byspecificyingloss = "Logistic" as follows:

fit=L0Learn.fit(X,y,loss="Logistic")print(fit)

#>          lambda gamma suppSize#> 1   2.03415e+01 1e-07        1#> 2   1.82815e+01 1e-07        2#> 3   1.80478e+01 1e-07        2#> 4   1.44382e+01 1e-07        5#> 5   1.40051e+01 1e-07        5#> 6   1.12041e+01 1e-07        7#> 7   1.08679e+01 1e-07        7#> 8   8.69435e+00 1e-07       10#> 9   4.45223e+00 1e-07       10#> 10  3.56178e+00 1e-07       10#> 11  2.84943e+00 1e-07       10#> 12  2.27954e+00 1e-07       10#> 13  1.82363e+00 1e-07       10#> 14  1.45891e+00 1e-07       10...

The output above indicates that$\gamma=10^{-7}$—by default we use a smallridge regularization (with$\gamma=10^{-7}$) to ensure the existenceof a solution. To extract the coefficients of the solution with$\lambda = 8.69435$:

coef(fit,lambda=8.69435,gamma=1e-7)

#> 1001 x 1 sparse Matrix of class "dgCMatrix"#>                     #> Intercept -0.1797751#> V1         2.0617431#> V2         2.2437744#> V3         1.9869710#> V4         2.1627060#> V5         2.1201722#> V6         2.1177628#> V7         1.8449653#> V8         2.4375011#> V9         2.3808994#> V10        2.1134238#> V11        .        #> V12        .        ...

The above indicates that the 10 non-zeros in the estimated modelmatch those we used in generating the data (i.e, L0 regularizationcorrectly recovered the true support). We can also make predictions atthe latter$\lambda$ using:

predict(fit,newx=X,lambda=8.69435,gamma=1e-7)

#> 500 x 1 Matrix of class "dgeMatrix"#>                [,1]#>   [1,] 5.786002e-01#>   [2,] 2.067591e-02#>   [3,] 5.357189e-02#>   [4,] 7.589991e-02#>   [5,] 1.460869e-04#>   [6,] 9.886345e-01#>   [7,] 9.620193e-01#>   [8,] 9.201535e-01#>   [9,] 9.912170e-01#>  [10,] 7.672292e-01#>  [11,] 3.756393e-05#>  [12,] 5.631011e-03#>  [13,] 2.491432e-07...

Each row in the above is the probability that the correspondingsample belongs to class$1$. Othermodels (i.e., L0L2 and L0L1) can be similarly fit by specifyingloss = "Logistic".

Finally, we note that L0Learn also supports a smoothed version of SVMby using squared hinge loss (loss = "SquaredHinge"). Theonly difference from logistic regression is that thepredict function returns$\beta_0 + \langle x, \beta \rangle$ (where$x$ is the testing sample), insteadof returning probabilities. The latter predictions can be assigned tothe appropriate classes by using a thresholding function (e.g., the signfunction).

Advanced Options

Sparse Matrix Support

Starting in version 2.0.0, L0Learn supports sparse matrices of typedgCMatrix. If your sparse matrix uses a different storage format, pleaseconvert it to dgCMatrix before using it in L0Learn. L0Learn keeps thematrix sparse internally and thus is highly efficient if the matrix issufficiently sparse. The API for sparse matrices is the same as that ofdense matrices, so all the demonstrations in this vignette also applyfor sparse matrices. For example, we can fit an L0-regularized model ona sparse matrix as follows:

# As an example, we generate a random, sparse matrix with# 500 samples, 1000 features, and 10% nonzero entries.X_sparse<- Matrix::rsparsematrix(nrow=500,ncol=1000,density=0.1,rand.x = rnorm)# Below we generate the response using the same linear model as before,# but with the sparse data matrix X_sparse.y_sparseX<-as.vector(X_sparse%*%B+ e)# Call L0Learn.fit_sparse<-L0Learn.fit(X_sparse, y_sparseX,penalty="L0")# Note: In the setup above, X_sparse is of type dgCMatrix.# If your sparse matrix is of a different type, convert it# to dgCMatrix before calling L0Learn, e.g., using: X_sparse <- as(X_sparse, "dgCMatrix").

Selection on Subset of Variables

In certain applications, it is desirable to always include some ofthe variables in the model and perform variable selection on others.L0Learn supports this option through theexcludeFirstK parameter. Specifically, settingexcludeFirstK = K (where K is a non-negative integer)instructsL0Learn to exclude the first K variables in thedata matrixX from the L0-norm penalty (those K variableswill still be penalized using the L2 or L1 norm penalties.). Forexample, below we fit anL0 model and exclude the first 3variables from selection by settingexcludeFirstK = 3:

fit<-L0Learn.fit(X, y,penalty="L0",maxSuppSize=10,excludeFirstK=3)

Plotting the regularization path:

plot(fit,gamma=0)

We can see in the plot above that first 3 variables are included inall the solutions of the path.

Coefficient Bounds

Starting in version 2.0.0, L0Learn supports bounds for CD algorithmsfor all losses and penalties. (We plan to support bound constraints forthe CDPSI algorithm in the future). By default, L0Learn does not applybounds, i.e., it assumes$-\infty <=\beta_i <= \infty$ for all i. Users can supply the samebounds for all coefficients by setting the parameterslowsandhighs to scalar values (these should satisfy:lows <= 0,lows != highs, andhighs >= 0). To use different bounds for thecoefficients,lows andhighs can be both setto vectors of lengthp (where the i-th entry corresponds tothe bound on coefficient i).

All of the following examples are valid.

L0Learn.fit(X, y,penalty="L0",lows=-0.5)

L0Learn.fit(X, y,penalty="L0",highs=0.5)

L0Learn.fit(X, y,penalty="L0",lows=-0.5,highs=0.5)

low_vector<-c(rep(-0.1,500),rep(-0.125,500))fit<-L0Learn.fit(X, y,penalty="L0",lows=low_vector,highs=0.25)

We can see the coefficients are subject to the bounds.

print(max(fit$beta[[1]]))

#> [1] 0.25

print(min(fit$beta[[1]][1:500, ]))

#> [1] -0.1

print(min(fit$beta[[1]][501:1000, ]))

#> [1] -0.1222599

User-specified Lambda Grids

By default,L0Learn selects the sequence of lambdavalues in an efficient manner to avoid wasted computation (since close$\lambda$ values can typically leadto the same solution). Advanced users of the toolkit can change thisdefault behavior and supply their own sequence of$\lambda$ values. This can be donesupplying the$\lambda$ valuesthrough the parameterlambdaGrid. L0Learn versions before2.0.0 would also require setting theautoLambda parametertoFALSE. This parameter remains in version 2.0.0 forbackwards compatibility, but is no longer needed or used.

Specifically, the value assigned tolambdaGrid should bea list of lists of decreasing positive values (doubles). The length oflambdaGrid (the number of lists stored) specifies thenumber of gamma parameters that will fill betweengammaMin,andgammaMax. In the case of L0 penalty,lambdaGrid must be a list of length 1. In case of L0L2/L0L1lambdaGrid can have any number of sub-lists stored. Thelength oflambdaGrid (the number of lists stored) specifiesthe number of gamma parameters that will fill betweengammaMin, andgammaMax. The ith element inLambdaGrid should be adecreasing sequenceof positive lambda values which are used by the algorithm for the ithvalue of gamma. For example, to fit an L0 model with the sequence ofuser-specified lambda values: 1, 1e-1, 1e-2, 1e-3, 1e-4, we run thefollowing:

userLambda<-list()userLambda[[1]]<-c(1,1e-1,1e-2,1e-3,1e-4)fit<-L0Learn.fit(X, y,penalty="L0",lambdaGrid=userLambda,maxSuppSize=1000)

To verify the results we print the fit object:

print(fit)

#>   lambda gamma suppSize#> 1  1e+00     0        0#> 2  1e-01     0        0#> 3  1e-02     0       10#> 4  1e-03     0       62#> 5  1e-04     0      257

Note that the$\lambda$ valuesabove are the desired values. For L0L2 and L0L1 penalties, the same canbe done where thelambdaGrid parameter.

userLambda<-list()userLambda[[1]]<-c(1,1e-1,1e-2,1e-3,1e-4)userLambda[[2]]<-c(10,2,1,0.01,0.002,0.001,1e-5)userLambda[[3]]<-c(1e-4,1e-5)# userLambda[[i]] must be a vector of positive decreasing reals.fit<-L0Learn.fit(X, y,penalty="L0L2",lambdaGrid=userLambda,maxSuppSize=1000)

print(fit)

#>    lambda       gamma suppSize#> 1   1e+00 10.00000000        0#> 2   1e-01 10.00000000        0#> 3   1e-02 10.00000000        0#> 4   1e-03 10.00000000        8#> 5   1e-04 10.00000000      132#> 6   1e+01  0.03162278        0#> 7   2e+00  0.03162278        0#> 8   1e+00  0.03162278        0#> 9   1e-02  0.03162278       10#> 10  2e-03  0.03162278       26#> 11  1e-03  0.03162278       65#> 12  1e-05  0.03162278      625#> 13  1e-04  0.00010000      311#> 14  1e-05  0.00010000      409

References

Hussein Hazimeh and Rahul Mazumder.FastBest Subset Selection: Coordinate Descent and Local CombinatorialOptimization Algorithms. Operations Research (2020).

Antoine Dedieu, Hussein Hazimeh, and Rahul Mazumder.Learning Sparse Classifiers:Continuous and Mixed Integer Optimization Perspectives. JMLR (toappear).

Movatterモバイル変換