Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Regularization (mathematics)

From Wikipedia, the free encyclopedia
Technique to make a model more generalizable and transferable
The green and blue functions both incur zero loss on the given data points. A learned model can be induced to prefer the green function, which may generalize better to more points drawn from the underlying unknown distribution, by adjustingλ{\displaystyle \lambda }, the weight of the regularization term.

Inmathematics,statistics,finance,[1] andcomputer science, particularly inmachine learning andinverse problems,regularization is a process that converts theanswer to a problem to a simpler one. It is often used in solvingill-posed problems or to preventoverfitting.[2]

Although regularization procedures can be divided in many ways, the following delineation is particularly helpful:

  • Explicit regularization is regularization whenever one explicitly adds a term to the optimization problem. These terms could bepriors, penalties, or constraints. Explicit regularization is commonly employed with ill-posed optimization problems. The regularization term, or penalty, imposes a cost on the optimization function to make the optimal solution unique.
  • Implicit regularization is all other forms of regularization. This includes, for example, early stopping, using a robust loss function, and discarding outliers. Implicit regularization is essentially ubiquitous in modern machine learning approaches, includingstochastic gradient descent for trainingdeep neural networks, andensemble methods (such asrandom forests andgradient boosted trees).

In explicit regularization, independent of the problem or model, there is always a data term, that corresponds to a likelihood of the measurement, and a regularization term that corresponds to a prior. By combining both usingBayesian statistics, one can compute a posterior, that includes both information sources and therefore stabilizes the estimation process. By trading off both objectives, one chooses to be more aligned to the data or to enforce regularization (to prevent overfitting). There is a whole research branch dealing with all possible regularizations. In practice, one usually tries a specific regularization and then figures out the probability density that corresponds to that regularization to justify the choice. It can also be physically motivated by common sense or intuition.

Inmachine learning, the data term corresponds to the training data and the regularization is either the choice of the model or modifications to the algorithm. It is always intended to reduce thegeneralization error, i.e. the error score with the trained model on the evaluation set (testing data) and not the training data.[3]

One of the earliest uses of regularization isTikhonov regularization (ridge regression), related to the method of least squares.

Regularization in machine learning

[edit]

Inmachine learning, a key challenge is enabling models to accurately predict outcomes on unseen data, not just on familiar training data. Regularization is crucial for addressingoverfitting—where a model memorizes training data details but cannot generalize to new data. The goal of regularization is to encourage models to learn the broader patterns within the data rather than memorizing it. Techniques likeearly stopping, L1 andL2 regularization, anddropout are designed to prevent overfitting and underfitting, thereby enhancing the model's ability to adapt to and perform well with new data, thus improving model generalization.[4]

Early stopping

[edit]

Stops training when validation performance deteriorates, preventing overfitting by halting before the model memorizes training data.[4]

L1 and L2 regularization

[edit]

Adds penalty terms to the cost function to discourage complex models:

  • L1 regularization (also calledLASSO) leads to sparse models by adding a penalty based on the absolute value of coefficients.
  • L2 regularization (also calledridge regression) encourages smaller, more evenly distributed weights by adding a penalty based on the square of the coefficients.[4]

Dropout

[edit]

In the context of neural networks, the Dropout technique repeatedly ignores random subsets of neurons during training, which simulates the training of multiple neural network architectures at once to improve generalization.[4]

Classification

[edit]

Empirical learning of classifiers (from a finite data set) is always anunderdetermined problem, because it attempts to infer a function of anyx{\displaystyle x} given only examplesx1,x2,,xn{\displaystyle x_{1},x_{2},\dots ,x_{n}}.

A regularization term (or regularizer)R(f){\displaystyle R(f)} is added to aloss function:minfi=1nV(f(xi),yi)+λR(f){\displaystyle \min _{f}\sum _{i=1}^{n}V(f(x_{i}),y_{i})+\lambda R(f)}whereV{\displaystyle V} is an underlying loss function that describes the cost of predictingf(x){\displaystyle f(x)} when the label isy{\displaystyle y}, such as thesquare loss orhinge loss; andλ{\displaystyle \lambda } is a parameter which controls the importance of the regularization term.R(f){\displaystyle R(f)} is typically chosen to impose a penalty on the complexity off{\displaystyle f}. Concrete notions of complexity used include restrictions forsmoothness and bounds on thevector space norm.[5][page needed]

A theoretical justification for regularization is that it attempts to imposeOccam's razor on the solution (as depicted in the figure above, where the green function, the simpler one, may be preferred). From aBayesian point of view, many regularization techniques correspond to imposing certainprior distributions on model parameters.[6]

Regularization can serve multiple purposes, including learning simpler models, inducing models to be sparse and introducing group structure[clarification needed] into the learning problem.

The same idea arose in many fields ofscience. A simple form of regularization applied tointegral equations (Tikhonov regularization) is essentially a trade-off between fitting the data and reducing a norm of the solution. More recently, non-linear regularization methods, includingtotal variation regularization, have become popular.

Generalization

[edit]
Main article:Generalization error

Regularization can be motivated as a technique to improve the generalizability of a learned model.

The goal of this learning problem is to find a function that fits or predicts the outcome (label) that minimizes the expected error over all possible inputs and labels. The expected error of a functionfn{\displaystyle f_{n}} is:I[fn]=X×YV(fn(x),y)ρ(x,y)dxdy{\displaystyle I[f_{n}]=\int _{X\times Y}V(f_{n}(x),y)\rho (x,y)\,dx\,dy}whereX{\displaystyle X} andY{\displaystyle Y} are the domains of input datax{\displaystyle x} and their labelsy{\displaystyle y} respectively.

Typically in learning problems, only a subset of input data and labels are available, measured with some noise. Therefore, the expected error is unmeasurable, and the best surrogate available is the empirical error over theN{\displaystyle N} available samples:IS[fn]=1ni=1NV(fn(x^i),y^i){\displaystyle I_{S}[f_{n}]={\frac {1}{n}}\sum _{i=1}^{N}V(f_{n}({\hat {x}}_{i}),{\hat {y}}_{i})}Without bounds on the complexity of the function space (formally, thereproducing kernel Hilbert space) available, a model will be learned that incurs zero loss on the surrogate empirical error. If measurements (e.g. ofxi{\displaystyle x_{i}}) were made with noise, this model may suffer fromoverfitting and display poor expected error. Regularization introduces a penalty for exploring certain regions of the function space used to build the model, which can improve generalization.

Tikhonov regularization (ridge regression)

[edit]
Main article:Tikhonov regularization

These techniques are named forAndrey Nikolayevich Tikhonov, who applied regularization tointegral equations and made important contributions in many other areas.

When learning a linear functionf{\displaystyle f}, characterized by an unknownvectorw{\displaystyle w} such thatf(x)=wx{\displaystyle f(x)=w\cdot x}, one can add theL2{\displaystyle L_{2}}-norm of the vectorw{\displaystyle w} to the loss expression in order to prefer solutions with smaller norms. Tikhonov regularization is one of the most common forms. It is also known as ridge regression. It is expressed as:minwi=1nV(x^iw,y^i)+λw22,{\displaystyle \min _{w}\sum _{i=1}^{n}V({\hat {x}}_{i}\cdot w,{\hat {y}}_{i})+\lambda \left\|w\right\|_{2}^{2},}where(x^i,y^i),1in,{\displaystyle ({\hat {x}}_{i},{\hat {y}}_{i}),\,1\leq i\leq n,} would represent samples used for training.

In the case of a general function, the norm of the function in itsreproducing kernel Hilbert space is:minfi=1nV(f(x^i),y^i)+λfH2{\displaystyle \min _{f}\sum _{i=1}^{n}V(f({\hat {x}}_{i}),{\hat {y}}_{i})+\lambda \left\|f\right\|_{\mathcal {H}}^{2}}

As theL2{\displaystyle L_{2}} norm isdifferentiable, learning can be advanced bygradient descent.

Tikhonov-regularized least squares

[edit]

The learning problem with theleast squares loss function and Tikhonov regularization can be solved analytically. Written in matrix form, the optimalw{\displaystyle w} is the one for which the gradient of the loss function with respect tow{\displaystyle w} is 0.minw1n(X^wY)T(X^wY)+λw22{\displaystyle \min _{w}{\frac {1}{n}}\left({\hat {X}}w-Y\right)^{\mathsf {T}}\left({\hat {X}}w-Y\right)+\lambda \left\|w\right\|_{2}^{2}}w=2nX^T(X^wY)+2λw{\displaystyle \nabla _{w}={\frac {2}{n}}{\hat {X}}^{\mathsf {T}}\left({\hat {X}}w-Y\right)+2\lambda w}0=X^T(X^wY)+nλw{\displaystyle 0={\hat {X}}^{\mathsf {T}}\left({\hat {X}}w-Y\right)+n\lambda w}w=(X^TX^+λnI)1(X^TY){\displaystyle w=\left({\hat {X}}^{\mathsf {T}}{\hat {X}}+\lambda nI\right)^{-1}\left({\hat {X}}^{\mathsf {T}}Y\right)}where the third statement is afirst-order condition.

By construction of the optimization problem, other values ofw{\displaystyle w} give larger values for the loss function. This can be verified by examining thesecond derivativeww{\displaystyle \nabla _{ww}}.

During training, this algorithm takesO(d3+nd2){\displaystyle O(d^{3}+nd^{2})}time. The terms correspond to the matrix inversion and calculatingXTX{\displaystyle X^{\mathsf {T}}X}, respectively. Testing takesO(nd){\displaystyle O(nd)} time.

Early stopping

[edit]
Main article:Early stopping

Early stopping can be viewed as regularization in time. Intuitively, a training procedure such as gradient descent tends to learn more and more complex functions with increasing iterations. By regularizing for time, model complexity can be controlled, improving generalization.

Early stopping is implemented using one data set for training, one statistically independent data set for validation and another for testing. The model is trained until performance on the validation set no longer improves and then applied to the test set.

Theoretical motivation in least squares

[edit]

Consider the finite approximation ofNeumann series for an invertible matrixA whereIA<1{\displaystyle \left\|I-A\right\|<1}:i=0T1(IA)iA1{\displaystyle \sum _{i=0}^{T-1}\left(I-A\right)^{i}\approx A^{-1}}

This can be used to approximate the analytical solution of unregularized least squares, ifγ is introduced to ensure the norm is less than one.wT=γni=0T1(IγnX^TX^)iX^TY^{\displaystyle w_{T}={\frac {\gamma }{n}}\sum _{i=0}^{T-1}\left(I-{\frac {\gamma }{n}}{\hat {X}}^{\mathsf {T}}{\hat {X}}\right)^{i}{\hat {X}}^{\mathsf {T}}{\hat {Y}}}

The exact solution to the unregularized least squares learning problem minimizes the empirical error, but may fail. By limitingT, the only free parameter in the algorithm above, the problem is regularized for time, which may improve its generalization.

The algorithm above is equivalent to restricting the number of gradient descent iterations for the empirical riskIs[w]=12nX^wY^Rn2{\displaystyle I_{s}[w]={\frac {1}{2n}}\left\|{\hat {X}}w-{\hat {Y}}\right\|_{\mathbb {R} ^{n}}^{2}}with the gradient descent update:w0=0wt+1=(IγnX^TX^)wt+γnX^TY^{\displaystyle {\begin{aligned}w_{0}&=0\\[1ex]w_{t+1}&=\left(I-{\frac {\gamma }{n}}{\hat {X}}^{\mathsf {T}}{\hat {X}}\right)w_{t}+{\frac {\gamma }{n}}{\hat {X}}^{\mathsf {T}}{\hat {Y}}\end{aligned}}}

The base case is trivial. The inductive case is proved as follows:wT=(IγnX^TX^)γni=0T2(IγnX^TX^)iX^TY^+γnX^TY^=γni=1T1(IγnX^TX^)iX^TY^+γnX^TY^=γni=0T1(IγnX^TX^)iX^TY^{\displaystyle {\begin{aligned}w_{T}&=\left(I-{\frac {\gamma }{n}}{\hat {X}}^{\mathsf {T}}{\hat {X}}\right){\frac {\gamma }{n}}\sum _{i=0}^{T-2}\left(I-{\frac {\gamma }{n}}{\hat {X}}^{\mathsf {T}}{\hat {X}}\right)^{i}{\hat {X}}^{\mathsf {T}}{\hat {Y}}+{\frac {\gamma }{n}}{\hat {X}}^{\mathsf {T}}{\hat {Y}}\\[1ex]&={\frac {\gamma }{n}}\sum _{i=1}^{T-1}\left(I-{\frac {\gamma }{n}}{\hat {X}}^{\mathsf {T}}{\hat {X}}\right)^{i}{\hat {X}}^{\mathsf {T}}{\hat {Y}}+{\frac {\gamma }{n}}{\hat {X}}^{\mathsf {T}}{\hat {Y}}\\[1ex]&={\frac {\gamma }{n}}\sum _{i=0}^{T-1}\left(I-{\frac {\gamma }{n}}{\hat {X}}^{\mathsf {T}}{\hat {X}}\right)^{i}{\hat {X}}^{\mathsf {T}}{\hat {Y}}\end{aligned}}}

Regularizers for sparsity

[edit]

Assume that a dictionaryϕj{\displaystyle \phi _{j}} with dimensionp{\displaystyle p} is given such that a function in the function space can be expressed as:f(x)=j=1pϕj(x)wj{\displaystyle f(x)=\sum _{j=1}^{p}\phi _{j}(x)w_{j}}

A comparison between the L1 ball and the L2 ball in two dimensions gives an intuition on how L1 regularization achieves sparsity.

Enforcing a sparsity constraint onw{\displaystyle w} can lead to simpler and more interpretable models. This is useful in many real-life applications such ascomputational biology. An example is developing a simple predictive test for a disease in order to minimize the cost of performing medical tests while maximizing predictive power.

A sensible sparsity constraint is theL0{\displaystyle L_{0}} normw0{\displaystyle \|w\|_{0}}, defined as the number of non-zero elements inw{\displaystyle w}. Solving aL0{\displaystyle L_{0}} regularized learning problem, however, has been demonstrated to beNP-hard.[7]

TheL1{\displaystyle L_{1}} norm (see alsoNorms) can be used to approximate the optimalL0{\displaystyle L_{0}} norm via convex relaxation. It can be shown that theL1{\displaystyle L_{1}} norm induces sparsity. In the case of least squares, this problem is known asLASSO in statistics andbasis pursuit in signal processing.minwRp1nX^wY^2+λw1{\displaystyle \min _{w\in \mathbb {R} ^{p}}{\frac {1}{n}}\left\|{\hat {X}}w-{\hat {Y}}\right\|^{2}+\lambda \left\|w\right\|_{1}}

Elastic net regularization

L1{\displaystyle L_{1}} regularization can occasionally produce non-unique solutions. A simple example is provided in the figure when the space of possible solutions lies on a 45 degree line. This can be problematic for certain applications, and is overcome by combiningL1{\displaystyle L_{1}} withL2{\displaystyle L_{2}} regularization inelastic net regularization, which takes the following form:minwRp1nX^wY^2+λ(αw1+(1α)w22),α[0,1]{\displaystyle \min _{w\in \mathbb {R} ^{p}}{\frac {1}{n}}\left\|{\hat {X}}w-{\hat {Y}}\right\|^{2}+\lambda \left(\alpha \left\|w\right\|_{1}+(1-\alpha )\left\|w\right\|_{2}^{2}\right),\alpha \in [0,1]}

Elastic net regularization tends to have a grouping effect, where correlated input features are assigned equal weights.

Elastic net regularization is commonly used in practice and is implemented in many machine learning libraries.

Proximal methods

[edit]
Main article:Proximal gradient method

While theL1{\displaystyle L_{1}} norm does not result in an NP-hard problem, theL1{\displaystyle L_{1}} norm is convex but is not strictly differentiable due to the kink at x = 0.Subgradient methods which rely on thesubderivative can be used to solveL1{\displaystyle L_{1}} regularized learning problems. However, faster convergence can be achieved through proximal methods.

For a problemminwHF(w)+R(w){\displaystyle \min _{w\in H}F(w)+R(w)} such thatF{\displaystyle F} is convex, continuous, differentiable, with Lipschitz continuous gradient (such as the least squares loss function), andR{\displaystyle R} is convex, continuous, and proper, then the proximal method to solve the problem is as follows. First define theproximal operatorproxR(v)=argminwRD{R(w)+12wv2},{\displaystyle \operatorname {prox} _{R}(v)=\mathop {\operatorname {argmin} } _{w\in \mathbb {R} ^{D}}\left\{R(w)+{\frac {1}{2}}\left\|w-v\right\|^{2}\right\},}and then iteratewk+1=proxγ,R(wkγF(wk)){\displaystyle w_{k+1}=\mathop {\operatorname {prox} } _{\gamma ,R}\left(w_{k}-\gamma \nabla F(w_{k})\right)}

The proximal method iteratively performs gradient descent and then projects the result back into the space permitted byR{\displaystyle R}.

WhenR{\displaystyle R} is theL1 regularizer, the proximal operator is equivalent to the soft-thresholding operator,Sλ(v)f(n)={viλ,if vi>λ0,if vi[λ,λ]vi+λ,if vi<λ{\displaystyle S_{\lambda }(v)f(n)={\begin{cases}v_{i}-\lambda ,&{\text{if }}v_{i}>\lambda \\0,&{\text{if }}v_{i}\in [-\lambda ,\lambda ]\\v_{i}+\lambda ,&{\text{if }}v_{i}<-\lambda \end{cases}}}

This allows for efficient computation.

Group sparsity without overlaps

[edit]

Groups of features can be regularized by a sparsity constraint, which can be useful for expressing certain prior knowledge into an optimization problem.

In the case of a linear model with non-overlapping known groups, a regularizer can be defined:R(w)=g=1Gwg2,{\displaystyle R(w)=\sum _{g=1}^{G}\left\|w_{g}\right\|_{2},} wherewg2=j=1|Gg|(wgj)2{\displaystyle \|w_{g}\|_{2}={\sqrt {\sum _{j=1}^{|G_{g}|}\left(w_{g}^{j}\right)^{2}}}}

This can be viewed as inducing a regularizer over theL2{\displaystyle L_{2}} norm over members of each group followed by anL1{\displaystyle L_{1}} norm over groups.

This can be solved by the proximal method, where the proximal operator is a block-wise soft-thresholding function:

proxλ,R,g(wg)={(1λwg2)wg,if wg2>λ0,if wg2λ{\displaystyle \operatorname {prox} \limits _{\lambda ,R,g}(w_{g})={\begin{cases}\left(1-{\dfrac {\lambda }{\left\|w_{g}\right\|_{2}}}\right)w_{g},&{\text{if }}\left\|w_{g}\right\|_{2}>\lambda \\[1ex]0,&{\text{if }}\|w_{g}\|_{2}\leq \lambda \end{cases}}}

Group sparsity with overlaps

[edit]

The algorithm described for group sparsity without overlaps can be applied to the case where groups do overlap, in certain situations. This will likely result in some groups with all zero elements, and other groups with some non-zero and some zero elements.

If it is desired to preserve the group structure, a new regularizer can be defined:R(w)=inf{g=1Gwg2:w=g=1Gw¯g}{\displaystyle R(w)=\inf \left\{\sum _{g=1}^{G}\|w_{g}\|_{2}:w=\sum _{g=1}^{G}{\bar {w}}_{g}\right\}}

For eachwg{\displaystyle w_{g}},w¯g{\displaystyle {\bar {w}}_{g}} is defined as the vector such that the restriction ofw¯g{\displaystyle {\bar {w}}_{g}} to the groupg{\displaystyle g} equalswg{\displaystyle w_{g}} and all other entries ofw¯g{\displaystyle {\bar {w}}_{g}} are zero. The regularizer finds the optimal disintegration ofw{\displaystyle w} into parts. It can be viewed as duplicating all elements that exist in multiple groups. Learning problems with this regularizer can also be solved with the proximal method with a complication. The proximal operator cannot be computed in closed form, but can be effectively solved iteratively, inducing an inner iteration within the proximal method iteration.

Regularizers for semi-supervised learning

[edit]
Main article:Semi-supervised learning

When labels are more expensive to gather than input examples, semi-supervised learning can be useful. Regularizers have been designed to guide learning algorithms to learn models that respect the structure of unsupervised training samples. If a symmetric weight matrixW{\displaystyle W} is given, a regularizer can be defined:R(f)=i,jwij(f(xi)f(xj))2{\displaystyle R(f)=\sum _{i,j}w_{ij}\left(f(x_{i})-f(x_{j})\right)^{2}}

IfWij{\displaystyle W_{ij}} encodes the result of some distance metric for pointsxi{\displaystyle x_{i}} andxj{\displaystyle x_{j}}, it is desirable thatf(xi)f(xj){\displaystyle f(x_{i})\approx f(x_{j})}. This regularizer captures this intuition, and is equivalent to:R(f)=f¯TLf¯{\displaystyle R(f)={\bar {f}}^{\mathsf {T}}L{\bar {f}}} whereL=DW{\displaystyle L=D-W} is theLaplacian matrix of the graph induced byW{\displaystyle W}.

The optimization problemminfRmR(f),m=u+l{\displaystyle \min _{f\in \mathbb {R} ^{m}}R(f),m=u+l} can be solved analytically if the constraintf(xi)=yi{\displaystyle f(x_{i})=y_{i}} is applied for all supervised samples. The labeled part of the vectorf{\displaystyle f} is therefore obvious. The unlabeled part off{\displaystyle f} is solved for by:minfuRufTLf=minfuRu{fuTLuufu+flTLlufu+fuTLulfl}{\displaystyle \min _{f_{u}\in \mathbb {R} ^{u}}f^{\mathsf {T}}Lf=\min _{f_{u}\in \mathbb {R} ^{u}}\left\{f_{u}^{\mathsf {T}}L_{uu}f_{u}+f_{l}^{\mathsf {T}}L_{lu}f_{u}+f_{u}^{\mathsf {T}}L_{ul}f_{l}\right\}}fu=2Luufu+2LulY{\displaystyle \nabla _{f_{u}}=2L_{uu}f_{u}+2L_{ul}Y}fu=Luu(LulY){\displaystyle f_{u}=L_{uu}^{\dagger }\left(L_{ul}Y\right)}The pseudo-inverse can be taken becauseLul{\displaystyle L_{ul}} has the same range asLuu{\displaystyle L_{uu}}.

Regularizers for multitask learning

[edit]
Main article:Multi-task learning

In the case of multitask learning,T{\displaystyle T} problems are considered simultaneously, each related in some way. The goal is to learnT{\displaystyle T} functions, ideally borrowing strength from the relatedness of tasks, that have predictive power. This is equivalent to learning the matrixW:T×D{\displaystyle W:T\times D} .

Sparse regularizer on columns

[edit]

R(w)=i=1DW2,1{\displaystyle R(w)=\sum _{i=1}^{D}\left\|W\right\|_{2,1}}

This regularizer defines an L2 norm on each column and an L1 norm over all columns. It can be solved by proximal methods.

Nuclear norm regularization

[edit]

R(w)=σ(W)1{\displaystyle R(w)=\left\|\sigma (W)\right\|_{1}} whereσ(W){\displaystyle \sigma (W)} is theeigenvalues in thesingular value decomposition ofW{\displaystyle W}.

Mean-constrained regularization

[edit]

R(f1fT)=t=1Tft1Ts=1TfsHk2{\displaystyle R(f_{1}\cdots f_{T})=\sum _{t=1}^{T}\left\|f_{t}-{\frac {1}{T}}\sum _{s=1}^{T}f_{s}\right\|_{H_{k}}^{2}}

This regularizer constrains the functions learned for each task to be similar to the overall average of the functions across all tasks. This is useful for expressing prior information that each task is expected to share with each other task. An example is predicting blood iron levels measured at different times of the day, where each task represents an individual.

Clustered mean-constrained regularization

[edit]

R(f1fT)=r=1CtI(r)ft1I(r)sI(r)fsHk2{\displaystyle R(f_{1}\cdots f_{T})=\sum _{r=1}^{C}\sum _{t\in I(r)}\left\|f_{t}-{\frac {1}{I(r)}}\sum _{s\in I(r)}f_{s}\right\|_{H_{k}}^{2}} whereI(r){\displaystyle I(r)} is a cluster of tasks.

This regularizer is similar to the mean-constrained regularizer, but instead enforces similarity between tasks within the same cluster. This can capture more complex prior information. This technique has been used to predictNetflix recommendations. A cluster would correspond to a group of people who share similar preferences.

Graph-based similarity

[edit]

More generally than above, similarity between tasks can be defined by a function. The regularizer encourages the model to learn similar functions for similar tasks.R(f1fT)=t,s=1,tsTftfs2Mts{\displaystyle R(f_{1}\cdots f_{T})=\sum _{t,s=1,t\neq s}^{\mathsf {T}}\left\|f_{t}-f_{s}\right\|^{2}M_{ts}} for a given symmetricsimilarity matrixM{\displaystyle M}.

Other uses of regularization in statistics and machine learning

[edit]

Bayesian learning methods make use of aprior probability that (usually) gives lower probability to more complex models. Well-known model selection techniques include theAkaike information criterion (AIC),minimum description length (MDL), and theBayesian information criterion (BIC). Alternative methods of controlling overfitting not involving regularization includecross-validation.

Examples of applications of different methods of regularization to thelinear model are:

ModelFit measureEntropy measure[5][8]
AIC/BICYXβ2{\displaystyle \left\|Y-X\beta \right\|_{2}}β0{\displaystyle \left\|\beta \right\|_{0}}
Lasso[9]YXβ2{\displaystyle \left\|Y-X\beta \right\|_{2}}β1{\displaystyle \left\|\beta \right\|_{1}}
Ridge regression[10]YXβ2{\displaystyle \left\|Y-X\beta \right\|_{2}}β2{\displaystyle \left\|\beta \right\|_{2}}
Basis pursuit denoisingYXβ2{\displaystyle \left\|Y-X\beta \right\|_{2}}λβ1{\displaystyle \lambda \left\|\beta \right\|_{1}}
Rudin–Osher–Fatemi model (TV)YXβ2{\displaystyle \left\|Y-X\beta \right\|_{2}}λβ1{\displaystyle \lambda \left\|\nabla \beta \right\|_{1}}
Potts modelYXβ2{\displaystyle \left\|Y-X\beta \right\|_{2}}λβ0{\displaystyle \lambda \left\|\nabla \beta \right\|_{0}}
RLAD[11]YXβ1{\displaystyle \left\|Y-X\beta \right\|_{1}}β1{\displaystyle \left\|\beta \right\|_{1}}
Dantzig Selector[12]XT(YXβ){\displaystyle \left\|X^{\mathsf {T}}(Y-X\beta )\right\|_{\infty }}β1{\displaystyle \left\|\beta \right\|_{1}}
SLOPE[13]YXβ2{\displaystyle \left\|Y-X\beta \right\|_{2}}i=1pλi|β|(i){\displaystyle \sum _{i=1}^{p}\lambda _{i}\left|\beta \right|_{(i)}}

See also

[edit]

Notes

[edit]
  1. ^Kratsios, Anastasis (2020)."Deep Arbitrage-Free Learning in a Generalized HJM Framework via Arbitrage-Regularization Data".Risks.8 (2):[1].arXiv:1710.05114.doi:10.3390/risks8020040.hdl:20.500.11850/456375.Term structure models can be regularized to remove arbitrage opportunities [sic?].
  2. ^Bühlmann, Peter; Van De Geer, Sara (2011).Statistics for High-Dimensional Data. Springer Series in Statistics. p. 9.doi:10.1007/978-3-642-20192-9.ISBN 978-3-642-20191-2.If p > n, the ordinary least squares estimator is not unique and will heavily overfit the data. Thus, a form of complexity regularization will be necessary.
  3. ^Goodfellow, Ian; Bengio, Yoshua; Courville, Aaron.Deep Learning Book. Retrieved2021-01-29.
  4. ^abcdGuo, Jingru."AI Notes: Regularizing neural networks".deeplearning.ai. Retrieved2024-02-04.
  5. ^abBishop, Christopher M. (2007).Pattern recognition and machine learning (Corr. printing. ed.). New York: Springer.ISBN 978-0-387-31073-2.
  6. ^For the connection betweenmaximum a posteriori estimation andridge regression, seeWeinberger, Kilian (July 11, 2018)."Linear / Ridge Regression".CS4780 Machine Learning Lecture 13. Cornell.
  7. ^Natarajan, B. (1995-04-01)."Sparse Approximate Solutions to Linear Systems".SIAM Journal on Computing.24 (2):227–234.doi:10.1137/S0097539792240406.ISSN 0097-5397.S2CID 2072045.
  8. ^Duda, Richard O. (2004).Pattern classification + computer manual : hardcover set (2 ed.). New York [u.a.]: Wiley.ISBN 978-0-471-70350-1.
  9. ^Tibshirani, Robert (1996)."Regression Shrinkage and Selection via the Lasso".Journal of the Royal Statistical Society, Series B.58 (1):267–288.doi:10.1111/j.2517-6161.1996.tb02080.x.MR 1379242. Archived fromthe original(PostScript) on 2008-10-31. Retrieved2009-03-19.
  10. ^Arthur E. Hoerl; Robert W. Kennard (1970). "Ridge regression: Biased estimation for nonorthogonal problems".Technometrics.12 (1):55–67.doi:10.2307/1267351.JSTOR 1267351.
  11. ^Li Wang; Michael D. Gordon; Ji Zhu (2006). "Regularized Least Absolute Deviations Regression and an Efficient Algorithm for Parameter Tuning".Sixth International Conference on Data Mining. pp. 690–700.doi:10.1109/ICDM.2006.134.ISBN 978-0-7695-2701-7.
  12. ^Candes, Emmanuel;Tao, Terence (2007). "The Dantzig selector: Statistical estimation whenp is much larger thann".Annals of Statistics.35 (6):2313–2351.arXiv:math/0506081.doi:10.1214/009053606000001523.MR 2382644.S2CID 88524200.
  13. ^Małgorzata Bogdan; Ewout van den Berg; Weijie Su; Emmanuel J. Candes (2013). "Statistical estimation and testing via the ordered L1 norm".arXiv:1310.1969 [stat.ME].

References

[edit]
Concepts
Applications
Implementations
Audio–visual
Text
Decisional
People
Architectures
Retrieved from "https://en.wikipedia.org/w/index.php?title=Regularization_(mathematics)&oldid=1323610695"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2025 Movatter.jp