Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Regularization by spectral filtering

From Wikipedia, the free encyclopedia

Spectral regularization is any of a class ofregularization techniques used inmachine learning to control the impact of noise and preventoverfitting. Spectral regularization can be used in a broad range of applications, from deblurring images to classifying emails into a spam folder and a non-spam folder. For instance, in the email classification example, spectral regularization can be used to reduce the impact of noise and prevent overfitting when a machine learning system is being trained on a labeled set of emails to learn how to tell a spam and a non-spam email apart.

Spectral regularization algorithms rely on methods that were originally defined and studied in the theory ofill-posedinverse problems (for instance, see[1]) focusing on the inversion of a linear operator (or a matrix) that possibly has a badcondition number or an unbounded inverse. In this context, regularization amounts to substituting the original operator by abounded operator called the "regularization operator" that has a condition number controlled by a regularization parameter,[2] a classical example beingTikhonov regularization. To ensure stability, this regularization parameter is tuned based on the level of noise.[2] The main idea behind spectral regularization is that each regularization operator can be described using spectral calculus as an appropriate filter on the eigenvalues of the operator that defines the problem, and the role of the filter is to "suppress the oscillatory behavior corresponding to small eigenvalues".[2] Therefore, each algorithm in the class of spectral regularization algorithms is defined by a suitable filter function (which needs to be derived for that particular algorithm). Three of the most commonly used regularization algorithms for which spectral filtering is well-studied are Tikhonov regularization,Landweber iteration, andtruncated singular value decomposition (TSVD). As for choosing the regularization parameter, examples of candidate methods to compute this parameter include the discrepancy principle, generalizedcross validation, and the L-curve criterion.[3]

It is of note that the notion of spectral filtering studied in the context of machine learning is closely connected to the literature onfunction approximation (in signal processing).

Notation

[edit]

The training set is defined asS={(x1,y1),,(xn,yn)}{\displaystyle S=\{(x_{1},y_{1}),\dots ,(x_{n},y_{n})\}}, whereX{\displaystyle X} is then×d{\displaystyle n\times d} input matrix andY=(y1,,yn){\displaystyle Y=(y_{1},\dots ,y_{n})} is the output vector. Where applicable, the kernel function is denoted byk{\displaystyle k}, and then×n{\displaystyle n\times n} kernel matrix is denoted byK{\displaystyle K} which has entriesKij=k(xi,xj){\displaystyle K_{ij}=k(x_{i},x_{j})} andH{\displaystyle {\mathcal {H}}} denotes theReproducing Kernel Hilbert Space (RKHS) with kernelk{\displaystyle k}. The regularization parameter is denoted byλ{\displaystyle \lambda }.

(Note: ForgG{\displaystyle g\in G} andfF{\displaystyle f\in F}, withG{\displaystyle G} andF{\displaystyle F} being Hilbert spaces, given a linear, continuous operatorL{\displaystyle L}, assume thatg=Lf{\displaystyle g=Lf} holds. In this setting, the direct problem would be to solve forg{\displaystyle g} givenf{\displaystyle f} and the inverse problem would be to solve forf{\displaystyle f} giveng{\displaystyle g}. If the solution exists, is unique and stable, the inverse problem (i.e. the problem of solving forf{\displaystyle f}) is well-posed; otherwise, it is ill-posed.)

Relation to the theory of ill-posed inverse problems

[edit]

The connection between the regularized least squares (RLS) estimation problem (Tikhonov regularization setting) and the theory of ill-posed inverse problems is an example of how spectral regularization algorithms are related to the theory of ill-posed inverse problems.

The RLS estimator solvesminfH1ni=1n(yif(xi))2+λfH2{\displaystyle \min _{f\in {\mathcal {H}}}{\frac {1}{n}}\sum _{i=1}^{n}(y_{i}-f(x_{i}))^{2}+\lambda \left\|f\right\|_{\mathcal {H}}^{2}}and the RKHS allows for expressing this RLS estimator asfSλ(X)=i=1ncik(x,xi){\displaystyle f_{S}^{\lambda }(X)=\sum _{i=1}^{n}c_{i}k(x,x_{i})} where(K+nλI)c=Y{\displaystyle (K+n\lambda I)c=Y} withc=(c1,,cn){\displaystyle c=(c_{1},\dots ,c_{n})}.[4] The penalization term is used for controlling smoothness and preventing overfitting. Since the solution of empirical risk minimizationminfH1ni=1n(yif(xi))2{\displaystyle \min _{f\in {\mathcal {H}}}{\frac {1}{n}}\sum _{i=1}^{n}(y_{i}-f(x_{i}))^{2}} can be written asfSλ(X)=i=1ncik(x,xi){\displaystyle f_{S}^{\lambda }(X)=\sum _{i=1}^{n}c_{i}k(x,x_{i})} such thatKc=Y{\displaystyle Kc=Y}, adding the penalty function amounts to the following change in the system that needs to be solved:[5]{minfH1ni=1n(yif(xi))2minfH1ni=1n(yif(xi))2+λfH2}{Kc=Y(K+nλI)c=Y}.{\displaystyle \left\{\min _{f\in {\mathcal {H}}}{\frac {1}{n}}\sum _{i=1}^{n}\left(y_{i}-f(x_{i})\right)^{2}\rightarrow \min _{f\in {\mathcal {H}}}{\frac {1}{n}}\sum _{i=1}^{n}\left(y_{i}-f(x_{i})\right)^{2}+\lambda \left\|f\right\|_{\mathcal {H}}^{2}\right\}\equiv {\biggl \{}Kc=Y\rightarrow \left(K+n\lambda I\right)c=Y{\biggr \}}.}

In this learning setting, the kernel matrix can be decomposed asK=QΣQT{\displaystyle K=Q\Sigma Q^{T}}, withσ=diag(σ1,,σn), σ1σ2σn0{\displaystyle \sigma =\operatorname {diag} (\sigma _{1},\dots ,\sigma _{n}),~\sigma _{1}\geq \sigma _{2}\geq \cdots \geq \sigma _{n}\geq 0}andq1,,qn{\displaystyle q_{1},\dots ,q_{n}} are the corresponding eigenvectors. Therefore, in the initial learning setting, the following holds:c=K1Y=QΣ1QTY=i=1n1σiqi,Yqi.{\displaystyle c=K^{-1}Y=Q\Sigma ^{-1}Q^{T}Y=\sum _{i=1}^{n}{\frac {1}{\sigma _{i}}}\langle q_{i},Y\rangle q_{i}.}

Thus, for small eigenvalues, even small perturbations in the data can lead to considerable changes in the solution. Hence, the problem is ill-conditioned, and solving this RLS problem amounts to stabilizing a possibly ill-conditioned matrix inversion problem, which is studied in the theory of ill-posed inverse problems; in both problems, a main concern is to deal with the issue ofnumerical stability.

Implementation of algorithms

[edit]

Each algorithm in the class of spectral regularization algorithms is defined by a suitable filter function, denoted here byGλ(){\displaystyle G_{\lambda }(\cdot )}. If the Kernel matrix is denoted byK{\displaystyle K}, thenλ{\displaystyle \lambda } should control the magnitude of the smaller eigenvalues ofGλ(K){\displaystyle G_{\lambda }(K)}. In a filtering setup, the goal is to find estimatorsfSλ(X):=i=1ncik(x,xi){\displaystyle f_{S}^{\lambda }(X):=\sum _{i=1}^{n}c_{i}k(x,x_{i})} wherec=Gλ(K)Y{\displaystyle c=G_{\lambda }(K)Y}. To do so, a scalar filter functionGλ(σ){\displaystyle G_{\lambda }(\sigma )} is defined using the eigen-decomposition of the kernel matrix:Gλ(K)=QGλ(Σ)QT,{\displaystyle G_{\lambda }(K)=QG_{\lambda }(\Sigma )Q^{T},}which yieldsGλ(K)Y = i=1nGλ(σi)qi,Yqi.{\displaystyle G_{\lambda }(K)Y~=~\sum _{i=1}^{n}G_{\lambda }(\sigma _{i})\langle q_{i},Y\rangle q_{i}.}

Typically, an appropriate filter function should have the following properties:[5]

  1. Asλ{\displaystyle \lambda } goes to zero,Gλ(σ)  1/σ{\displaystyle G_{\lambda }(\sigma )~\rightarrow ~1/\sigma }.
  2. The magnitude of the (smaller) eigenvalues ofGλ{\displaystyle G_{\lambda }} is controlled byλ{\displaystyle \lambda }.

While the above items give a rough characterization of the general properties of filter functions for all spectral regularization algorithms, the derivation of the filter function (and hence its exact form) varies depending on the specific regularization method that spectral filtering is applied to.

Filter function for Tikhonov regularization

[edit]

In the Tikhonov regularization setting, the filter function for RLS is described below. As shown in,[4] in this setting,c=(K+nλI)1Y{\displaystyle c=\left(K+n\lambda I\right)^{-1}Y}. Thus,c=(K+nλI)1Y=Q(Σ+nλI)1QTY=i=1n1σi+nλ<qi,Y>qi.{\displaystyle c=(K+n\lambda I)^{-1}Y=Q(\Sigma +n\lambda I)^{-1}Q^{T}Y=\sum _{i=1}^{n}{\frac {1}{\sigma _{i}+n\lambda }}<q_{i},Y>q_{i}.}

The undesired components are filtered out using regularization:

The filter function for Tikhonov regularization is therefore defined as:[5]Gλ(σ)=1σ+nλ.{\displaystyle G_{\lambda }(\sigma )={\frac {1}{\sigma +n\lambda }}.}

Filter function for Landweber iteration

[edit]

The idea behind the Landweber iteration isgradient descent:[5]

c0 := 0fori = 1, ...,t − 1ci :=ci−1 +η(YKci−1)end

In this setting, ifn{\displaystyle n} is larger thanK{\displaystyle K}'s largest eigenvalue, the above iteration converges by choosingη=2/n{\displaystyle \eta =2/n} as the step-size:.[5] The above iteration is equivalent to minimizing1nYKc22{\displaystyle {\frac {1}{n}}\left\|Y-Kc\right\|_{2}^{2}} (i.e. the empirical risk) via gradient descent; using induction, it can be proved that at thet{\displaystyle t}-th iteration, the solution is given by[5]c=ηi=0t1(IηK)iY.{\displaystyle c=\eta \sum _{i=0}^{t-1}\left(I-\eta K\right)^{i}Y.}

Thus, the appropriate filter function is defined by:Gλ(σ)=ηi=0t1(Iησ)i.{\displaystyle G_{\lambda }(\sigma )=\eta \sum _{i=0}^{t-1}\left(I-\eta \sigma \right)^{i}.}

It can be shown that this filter function corresponds to a truncated power expansion ofK1{\displaystyle K^{-1}};[5] to see this, note that the relationi0xi=1/(1x){\displaystyle \sum _{i\geq 0}x^{i}=1/(1-x)}, would still hold ifx{\displaystyle x} is replaced by a matrix; thus, ifK{\displaystyle K} (the kernel matrix), or ratherIηK{\displaystyle I-\eta K}, is considered, the following holds:K1=ηi=0(IηK)iηi=0t1(IηK)i.{\displaystyle K^{-1}=\eta \sum _{i=0}^{\infty }\left(I-\eta K\right)^{i}\sim \eta \sum _{i=0}^{t-1}\left(I-\eta K\right)^{i}.}

In this setting, the number of iterations gives the regularization parameter; roughly speaking,t1/λ{\displaystyle t\sim 1/\lambda }.[5] Ift{\displaystyle t} is large, overfitting may be a concern. Ift{\displaystyle t} is small, oversmoothing may be a concern. Thus, choosing an appropriate time for early stopping of the iterations provides a regularization effect.

Filter function for TSVD

[edit]

In the TSVD setting, given the eigen-decompositionK=QΣQT{\displaystyle K=Q\Sigma Q^{T}} and using a prescribed thresholdλn{\displaystyle \lambda n}, a regularized inverse can be formed for the kernel matrix by discarding all the eigenvalues that are smaller than this threshold.[5]Thus, the filter function for TSVD can be defined asGλ(σ)={1/σ,if σλn0,otherwise{\displaystyle G_{\lambda }(\sigma )={\begin{cases}1/\sigma ,&{\text{if }}\sigma \geq \lambda n\\[1ex]0,&{\text{otherwise}}\end{cases}}}

It can be shown that TSVD is equivalent to the (unsupervised) projection of the data using (kernel)Principal Component Analysis (PCA), and that it is also equivalent to minimizing the empirical risk on the projected data (without regularization).[5] Note that the number of components kept for the projection is the onlyfree parameter here.

References

[edit]
  1. ^H. W. Engl, M. Hanke, and A. Neubauer.Regularization of inverse problems. Kluwer, 1996.
  2. ^abcL. Lo Gerfo, L. Rosasco, F. Odone, E. De Vito, and A. Verri. Spectral Algorithms for Supervised Learning,Neural Computation, 20(7), 2008.
  3. ^P. C. Hansen, J. G. Nagy, and D. P. O'Leary.Deblurring Images: Matrices, Spectra, and Filtering, Fundamentals of Algorithms 3, SIAM, Philadelphia, 2006.
  4. ^abL. Rosasco. Lecture 6 of the Lecture Notes for 9.520: Statistical Learning Theory and Applications. Massachusetts Institute of Technology, Fall 2013. Available athttps://www.mit.edu/~9.520/fall13/slides/class06/class06_RLSSVM.pdf
  5. ^abcdefghijL. Rosasco. Lecture 7 of the Lecture Notes for 9.520: Statistical Learning Theory and Applications. Massachusetts Institute of Technology, Fall 2013. Available athttps://www.mit.edu/~9.520/fall13/slides/class07/class07_spectral.pdf
Retrieved from "https://en.wikipedia.org/w/index.php?title=Regularization_by_spectral_filtering&oldid=1289289889"
Categories:

[8]ページ先頭

©2009-2025 Movatter.jp