Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Sparse dictionary learning

From Wikipedia, the free encyclopedia
Representation learning method
Part of a series on
Machine learning
anddata mining

Sparse dictionary learning (also known assparse coding orSDL) is arepresentation learning method which aims to find asparse representation of the input data in the form of alinear combination of basic elements as well as those basic elements themselves. These elements are calledatoms, and they compose adictionary. Atoms in the dictionary are not required to beorthogonal, and they may be an over-complete spanning set. This problem setup also allows the dimensionality of the signals being represented to be higher than any one of the signals being observed. These two properties lead to having seemingly redundant atoms that allow multiple representations of the same signal, but also provide an improvement insparsity and flexibility of the representation.

One of the most important applications of sparse dictionary learning is in the field ofcompressed sensing orsignal recovery. In compressed sensing, a high-dimensional signal can be recovered with only a few linear measurements, provided that the signal is sparse or near-sparse. Since not all signals satisfy this condition, it is crucial to find a sparse representation of that signal such as thewavelet transform or the directional gradient of a rasterized matrix. Once a matrix or a high-dimensional vector is transferred to a sparse space, different recovery algorithms likebasis pursuit, CoSaMP,[1] or fast non-iterative algorithms[2] can be used to recover the signal.

One of the key principles of dictionary learning is that the dictionary has to be inferred from the input data. The emergence of sparse dictionary learning methods was stimulated by the fact that insignal processing, one typically wants to represent the input data using a minimal amount of components. Before this approach, the general practice was to use predefined dictionaries such asFourier orwavelet transforms. However, in certain cases, a dictionary that is trained to fit the input data can significantly improve the sparsity, which has applications in data decomposition,compression, andanalysis, and has been used in the fields of imagedenoising andclassification, and video andaudio processing. Sparsity and overcomplete dictionaries have immense applications in image compression, image fusion, andinpainting.

Image denoising by dictionary learning

Problem statement

[edit]

Given the input datasetX=[x1,...,xK],xiRd{\displaystyle X=[x_{1},...,x_{K}],x_{i}\in \mathbb {R} ^{d}} we wish to find a dictionaryDRd×n:D=[d1,...,dn]{\displaystyle \mathbf {D} \in \mathbb {R} ^{d\times n}:D=[d_{1},...,d_{n}]} and a representationR=[r1,...,rK],riRn{\displaystyle R=[r_{1},...,r_{K}],r_{i}\in \mathbb {R} ^{n}} such that bothXDRF2{\displaystyle \|X-\mathbf {D} R\|_{F}^{2}} is minimized and the representationsri{\displaystyle r_{i}} are sparse enough. This can be formulated as the followingoptimization problem:

argminDC,riRni=1KxiDri22+λri0{\displaystyle {\underset {\mathbf {D} \in {\mathcal {C}},r_{i}\in \mathbb {R} ^{n}}{\text{argmin}}}\sum _{i=1}^{K}\|x_{i}-\mathbf {D} r_{i}\|_{2}^{2}+\lambda \|r_{i}\|_{0}}, whereC{DRd×n:di21i=1,...,n}{\displaystyle {\mathcal {C}}\equiv \{\mathbf {D} \in \mathbb {R} ^{d\times n}:\|d_{i}\|_{2}\leq 1\,\,\forall i=1,...,n\}},λ>0{\displaystyle \lambda >0}

C{\displaystyle {\mathcal {C}}} is required to constrainD{\displaystyle \mathbf {D} } so that its atoms would not reach arbitrarily high values allowing for arbitrarily low (but non-zero) values ofri{\displaystyle r_{i}}.λ{\displaystyle \lambda } controls the trade off between the sparsity and the minimization error.

The minimization problem above is not convex because of the0-"norm" and solving this problem is NP-hard.[3] In some casesL1-norm is known to ensure sparsity[4] and so the above becomes aconvex optimization problem with respect to each of the variablesD{\displaystyle \mathbf {D} } andR{\displaystyle \mathbf {R} } when the other one is fixed, but it is not jointly convex in(D,R){\displaystyle (\mathbf {D} ,\mathbf {R} )}.

Properties of the dictionary

[edit]

The dictionaryD{\displaystyle \mathbf {D} } defined above can be "undercomplete" ifn<d{\displaystyle n<d} or "overcomplete" in casen>d{\displaystyle n>d} with the latter being a typical assumption for a sparse dictionary learning problem. The case of a complete dictionary does not provide any improvement from a representational point of view and thus isn't considered.

Undercomplete dictionaries represent the setup in which the actual input data lies in a lower-dimensional space. This case is strongly related todimensionality reduction and techniques likeprincipal component analysis which require atomsd1,...,dn{\displaystyle d_{1},...,d_{n}} to be orthogonal. The choice of these subspaces is crucial for efficient dimensionality reduction, but it is not trivial. And dimensionality reduction based on dictionary representation can be extended to address specific tasks such as data analysis or classification. However, their main downside is limiting the choice of atoms.

Overcomplete dictionaries, however, do not require the atoms to be orthogonal (they will never have abasis anyway) thus allowing for more flexible dictionaries and richer data representations.

An overcomplete dictionary which allows for sparse representation of signal can be a famous transform matrix (wavelets transform, fourier transform) or it can be formulated so that its elements are changed in such a way that it sparsely represents the given signal in a best way. Learned dictionaries are capable of giving sparser solutions as compared to predefined transform matrices.

Algorithms

[edit]

As the optimization problem described above can be solved as a convex problem with respect to either dictionary or sparse coding while the other one of the two is fixed, most of the algorithms are based on the idea of iteratively updating one and then the other.

The problem of finding an optimal sparse codingR{\displaystyle R} with a given dictionaryD{\displaystyle \mathbf {D} } is known assparse approximation (or sometimes just sparse coding problem). A number of algorithms have been developed to solve it (such asmatching pursuit andLASSO) and are incorporated in the algorithms described below.

Method of optimal directions (MOD)

[edit]

The method of optimal directions (or MOD) was one of the first methods introduced to tackle the sparse dictionary learning problem.[5] The core idea of it is to solve the minimization problem subject to the limited number of non-zero components of the representation vector:

minD,R{XDRF2}s.t.iri0T{\displaystyle \min _{\mathbf {D} ,R}\{\|X-\mathbf {D} R\|_{F}^{2}\}\,\,{\text{s.t.}}\,\,\forall i\,\,\|r_{i}\|_{0}\leq T}

Here,F{\displaystyle F} denotes theFrobenius norm. MOD alternates between getting thesparse coding using a method such asmatching pursuit and updating the dictionary by computing the analytical solution of the problem given byD=XR+{\displaystyle \mathbf {D} =XR^{+}} whereR+{\displaystyle R^{+}} is aMoore-Penrose pseudoinverse. After this updateD{\displaystyle \mathbf {D} } is renormalized to fit the constraints and the new sparse coding is obtained again. The process is repeated until convergence (or until a sufficiently small residue).

MOD has proved to be a very efficient method for low-dimensional input dataX{\displaystyle X} requiring just a few iterations to converge. However, due to the high complexity of the matrix-inversion operation, computing the pseudoinverse in high-dimensional cases is in many cases intractable. This shortcoming has inspired the development of other dictionary learning methods.

K-SVD

[edit]
Main article:K-SVD

K-SVD is an algorithm that performsSVD at its core to update the atoms of the dictionary one by one and basically is a generalization ofK-means. It enforces that each element of the input dataxi{\displaystyle x_{i}} is encoded by a linear combination of not more thanT0{\displaystyle T_{0}} elements in a way identical to the MOD approach:

minD,R{XDRF2}s.t.iri0T0{\displaystyle \min _{\mathbf {D} ,R}\{\|X-\mathbf {D} R\|_{F}^{2}\}\,\,{\text{s.t.}}\,\,\forall i\,\,\|r_{i}\|_{0}\leq T_{0}}

This algorithm's essence is to first fix the dictionary, find the best possibleR{\displaystyle R} under the above constraint (usingOrthogonal Matching Pursuit) and then iteratively update the atoms of dictionaryD{\displaystyle \mathbf {D} } in the following manner:

XDRF2=|Xi=1KdixTi|F2=EkdkxTkF2{\displaystyle \|X-\mathbf {D} R\|_{F}^{2}=\left|X-\sum _{i=1}^{K}d_{i}x_{T}^{i}\right|_{F}^{2}=\|E_{k}-d_{k}x_{T}^{k}\|_{F}^{2}}

The next steps of the algorithm includerank-1 approximation of the residual matrixEk{\displaystyle E_{k}}, updatingdk{\displaystyle d_{k}} and enforcing the sparsity ofxk{\displaystyle x_{k}} after the update. This algorithm is considered to be standard for dictionary learning and is used in a variety of applications. However, it shares weaknesses with MOD being efficient only for signals with relatively low dimensionality and having the possibility for being stuck at local minima.

Stochastic gradient descent

[edit]
Main article:Stochastic gradient descent

One can also apply a widespread stochastic gradient descent method with iterative projection to solve this problem.[6] The idea of this method is to update the dictionary using the first order stochastic gradient and project it on the constraint setC{\displaystyle {\mathcal {C}}}. The step that occurs at i-th iteration is described by this expression:

Di=projC{Di1δiDiSxiDri22+λri1}{\displaystyle \mathbf {D} _{i}={\text{proj}}_{\mathcal {C}}\left\{\mathbf {D} _{i-1}-\delta _{i}\nabla _{\mathbf {D} }\sum _{i\in S}\|x_{i}-\mathbf {D} r_{i}\|_{2}^{2}+\lambda \|r_{i}\|_{1}\right\}}, whereS{\displaystyle S} is a random subset of{1...K}{\displaystyle \{1...K\}} andδi{\displaystyle \delta _{i}} is a gradient step.

Lagrange dual method

[edit]

An algorithm based on solving adual Lagrangian problem provides an efficient way to solve for the dictionary having no complications induced by the sparsity function.[7] Consider the following Lagrangian:

L(D,Λ)=tr((XDR)T(XDR))+j=1nλj(i=1dDij2c){\displaystyle {\mathcal {L}}(\mathbf {D} ,\Lambda )={\text{tr}}\left((X-\mathbf {D} R)^{T}(X-\mathbf {D} R)\right)+\sum _{j=1}^{n}\lambda _{j}\left({\sum _{i=1}^{d}\mathbf {D} _{ij}^{2}-c}\right)}, wherec{\displaystyle c} is a constraint on the norm of the atoms andλi{\displaystyle \lambda _{i}} are the so-called dual variables forming the diagonal matrixΛ{\displaystyle \Lambda }.

We can then provide an analytical expression for the Lagrange dual after minimization overD{\displaystyle \mathbf {D} }:

D(Λ)=minDL(D,Λ)=tr(XTXXRT(RRT+Λ)1(XRT)TcΛ){\displaystyle {\mathcal {D}}(\Lambda )=\min _{\mathbf {D} }{\mathcal {L}}(\mathbf {D} ,\Lambda )={\text{tr}}(X^{T}X-XR^{T}(RR^{T}+\Lambda )^{-1}(XR^{T})^{T}-c\Lambda )}.

After applying one of the optimization methods to the value of the dual (such asNewton's method orconjugate gradient) we get the value ofD{\displaystyle \mathbf {D} }:

DT=(RRT+Λ)1(XRT)T{\displaystyle \mathbf {D} ^{T}=(RR^{T}+\Lambda )^{-1}(XR^{T})^{T}}

Solving this problem is less computational hard because the amount of dual variablesn{\displaystyle n} is a lot of times much less than the amount of variables in the primal problem.

LASSO

[edit]
Main article:Lasso (statistics)

In this approach, the optimization problem is formulated as:

minrRn{r1}subject toXDRF2<ϵ{\displaystyle \min _{r\in \mathbb {R} ^{n}}\{\,\,\|r\|_{1}\}\,\,{\text{subject to}}\,\,\|X-\mathbf {D} R\|_{F}^{2}<\epsilon }, whereϵ{\displaystyle \epsilon } is the permitted error in the reconstruction LASSO.

It finds an estimate ofri{\displaystyle r_{i}} by minimizing the least square error subject to aL1-norm constraint in the solution vector, formulated as:

minrRn12XDrF2+λr1{\displaystyle \min _{r\in \mathbb {R} ^{n}}\,\,{\dfrac {1}{2}}\,\,\|X-\mathbf {D} r\|_{F}^{2}+\lambda \,\,\|r\|_{1}}, whereλ>0{\displaystyle \lambda >0} controls the trade-off between sparsity and the reconstruction error. This gives the global optimal solution.[8] See alsoOnline dictionary learning for Sparse coding

Parametric training methods

[edit]

Parametric training methods are aimed to incorporate the best of both worlds — the realm of analytically constructed dictionaries and the learned ones.[9] This allows to construct more powerful generalized dictionaries that can potentially be applied to the cases of arbitrary-sized signals. Notable approaches include:

  • Translation-invariant dictionaries.[10] These dictionaries are composed by the translations of the atoms originating from the dictionary constructed for a finite-size signal patch. This allows the resulting dictionary to provide a representation for the arbitrary-sized signal.
  • Multiscale dictionaries.[11] This method focuses on constructing a dictionary that is composed of differently scaled dictionaries to improve sparsity.
  • Sparse dictionaries.[12] This method focuses on not only providing a sparse representation but also constructing a sparse dictionary which is enforced by the expressionD=BA{\displaystyle \mathbf {D} =\mathbf {B} \mathbf {A} } whereB{\displaystyle \mathbf {B} } is some pre-defined analytical dictionary with desirable properties such as fast computation andA{\displaystyle \mathbf {A} } is a sparse matrix. Such formulation allows to directly combine the fast implementation of analytical dictionaries with the flexibility of sparse approaches.

Online dictionary learning (LASSO approach)

[edit]

Many common approaches to sparse dictionary learning rely on the fact that the whole input dataX{\displaystyle X} (or at least a large enough training dataset) is available for the algorithm. However, this might not be the case in the real-world scenario as the size of the input data might be too big to fit it into memory. The other case where this assumption can not be made is when the input data comes in a form of astream. Such cases lie in the field of study ofonline learning which essentially suggests iteratively updating the model upon the new data pointsx{\displaystyle x} becoming available.

A dictionary can be learned in an online manner the following way:[13]

  1. Fort=1...T:{\displaystyle t=1...T:}
  2. Draw a new samplext{\displaystyle x_{t}}
  3. Find a sparse coding usingLARS:rt=argminrRn(12xtDt1r+λr1){\displaystyle r_{t}={\underset {r\in \mathbb {R} ^{n}}{\text{argmin}}}\left({\frac {1}{2}}\|x_{t}-\mathbf {D} _{t-1}r\|+\lambda \|r\|_{1}\right)}
  4. Update dictionary usingblock-coordinate approach:Dt=argminDC1ti=1t(12xiDri22+λri1){\displaystyle \mathbf {D} _{t}={\underset {\mathbf {D} \in {\mathcal {C}}}{\text{argmin}}}{\frac {1}{t}}\sum _{i=1}^{t}\left({\frac {1}{2}}\|x_{i}-\mathbf {D} r_{i}\|_{2}^{2}+\lambda \|r_{i}\|_{1}\right)}

This method allows us to gradually update the dictionary as new data becomes available for sparse representation learning and helps drastically reduce the amount of memory needed to store the dataset (which often has a huge size).

Applications

[edit]

The dictionary learning framework, namely the linear decomposition of an input signal using a few basis elements learned from data itself, has led to state-of-art[citation needed] results in various image and video processing tasks. This technique can be applied to classification problems in a way that if we have built specific dictionaries for each class, the input signal can be classified by finding the dictionary corresponding to the sparsest representation.It also has properties that are useful for signal denoising since usually one can learn a dictionary to represent the meaningful part of the input signal in a sparse way but the noise in the input will have a much less sparse representation.[14]

Sparse dictionary learning has been successfully applied to various image, video and audio processing tasks as well as to texture synthesis[15] and unsupervised clustering.[16] In evaluations with theBag-of-Words model,[17][18] sparse coding was found empirically to outperform other coding approaches on the object category recognition tasks.

Dictionary learning is used to analyse medical signals in detail. Such medical signals include those from electroencephalography (EEG), electrocardiography (ECG), magnetic resonance imaging (MRI), functional MRI (fMRI), continuous glucose monitors[19] and ultrasound computer tomography (USCT), where different assumptions are used to analyze each signal.

Dictionary learning has also been applied to passive detection of unknown signals in complex environments. In particular, it enables blind signal detection in time-spreading distortion (TSD) channels, without prior knowledge of the source signal.[20] This approach has shown effectiveness in both simulated and experimental conditions, offering robust performance in low signal-to-noise ratio scenarios.

See also

[edit]

References

[edit]
  1. ^Needell, D.; Tropp, J.A. (2009). "CoSaMP: Iterative signal recovery from incomplete and inaccurate samples".Applied and Computational Harmonic Analysis.26 (3):301–321.arXiv:0803.2392.doi:10.1016/j.acha.2008.07.002.
  2. ^Lotfi, M.; Vidyasagar, M."A Fast Non-iterative Algorithm for Compressive Sensing Using Binary Measurement Matrices"
  3. ^A. M. Tillmann, "On the Computational Intractability of Exact and Approximate Dictionary Learning", IEEE Signal Processing Letters 22(1), 2015: 45–49.
  4. ^Donoho, David L. (2006-06-01). "For most large underdetermined systems of linear equations the minimal 𝓁1-norm solution is also the sparsest solution".Communications on Pure and Applied Mathematics.59 (6):797–829.doi:10.1002/cpa.20132.ISSN 1097-0312.S2CID 8510060.
  5. ^Engan, K.; Aase, S.O.; Hakon Husoy, J. (1999-01-01). "Method of optimal directions for frame design".1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258). Vol. 5. pp. 2443–2446 vol.5.doi:10.1109/ICASSP.1999.760624.ISBN 978-0-7803-5041-0.S2CID 33097614.
  6. ^Aharon, Michal; Elad, Michael (2008). "Sparse and Redundant Modeling of Image Content Using an Image-Signature-Dictionary".SIAM Journal on Imaging Sciences.1 (3):228–247.CiteSeerX 10.1.1.298.6982.doi:10.1137/07070156x.
  7. ^Lee, Honglak, et al. "Efficient sparse coding algorithms."Advances in neural information processing systems. 2006.
  8. ^Kumar, Abhay; Kataria, Saurabh."Dictionary Learning Based Applications in Image Processing using Convex Optimisation"(PDF).
  9. ^Rubinstein, R.; Bruckstein, A.M.; Elad, M. (2010-06-01). "Dictionaries for Sparse Representation Modeling".Proceedings of the IEEE.98 (6):1045–1057.CiteSeerX 10.1.1.160.527.doi:10.1109/JPROC.2010.2040551.ISSN 0018-9219.S2CID 2176046.
  10. ^Engan, Kjersti; Skretting, Karl; Husøy, John H\a akon (2007-01-01). "Family of Iterative LS-based Dictionary Learning Algorithms, ILS-DLA, for Sparse Signal Representation".Digit. Signal Process.17 (1):32–49.Bibcode:2007DSP....17...32E.doi:10.1016/j.dsp.2006.02.002.ISSN 1051-2004.
  11. ^Mairal, J.; Sapiro, G.; Elad, M. (2008-01-01). "Learning Multiscale Sparse Representations for Image and Video Restoration".Multiscale Modeling & Simulation.7 (1):214–241.CiteSeerX 10.1.1.95.6239.doi:10.1137/070697653.ISSN 1540-3459.
  12. ^Rubinstein, R.; Zibulevsky, M.; Elad, M. (2010-03-01). "Double Sparsity: Learning Sparse Dictionaries for Sparse Signal Approximation".IEEE Transactions on Signal Processing.58 (3):1553–1564.Bibcode:2010ITSP...58.1553R.CiteSeerX 10.1.1.183.992.doi:10.1109/TSP.2009.2036477.ISSN 1053-587X.S2CID 7193037.
  13. ^Mairal, Julien; Bach, Francis; Ponce, Jean; Sapiro, Guillermo (2010-03-01)."Online Learning for Matrix Factorization and Sparse Coding".J. Mach. Learn. Res.11:19–60.arXiv:0908.0050.Bibcode:2009arXiv0908.0050M.ISSN 1532-4435.
  14. ^Aharon, M, M Elad, and A Bruckstein. 2006. "K-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation." Signal Processing, IEEE Transactions on 54 (11): 4311-4322
  15. ^Peyré, Gabriel (2008-11-06)."Sparse Modeling of Textures"(PDF).Journal of Mathematical Imaging and Vision.34 (1):17–31.doi:10.1007/s10851-008-0120-3.ISSN 0924-9907.S2CID 15994546.
  16. ^Ramirez, Ignacio; Sprechmann, Pablo; Sapiro, Guillermo (2010-01-01). "Classification and clustering via dictionary learning with structured incoherence and shared features".2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA, USA: IEEE Computer Society. pp. 3501–3508.doi:10.1109/CVPR.2010.5539964.ISBN 978-1-4244-6984-0.S2CID 206591234.
  17. ^Koniusz, Piotr; Yan, Fei; Mikolajczyk, Krystian (2013-05-01). "Comparison of mid-level feature coding approaches and pooling strategies in visual concept detection".Computer Vision and Image Understanding.117 (5):479–492.CiteSeerX 10.1.1.377.3979.doi:10.1016/j.cviu.2012.10.010.ISSN 1077-3142.
  18. ^Koniusz, Piotr; Yan, Fei; Gosselin, Philippe Henri; Mikolajczyk, Krystian (2017-02-24)."Higher-order occurrence pooling for bags-of-words: Visual concept detection"(PDF).IEEE Transactions on Pattern Analysis and Machine Intelligence.39 (2):313–326.Bibcode:2017ITPAM..39..313K.doi:10.1109/TPAMI.2016.2545667.hdl:10044/1/39814.ISSN 0162-8828.PMID 27019477.S2CID 10577592.
  19. ^AlMatouq, Ali; LalegKirati, TaousMeriem; Novara, Carlo; Ivana, Rabbone; Vincent, Tyrone (2019-03-15). "Sparse Reconstruction of Glucose Fluxes Using Continuous Glucose Monitors".IEEE/ACM Transactions on Computational Biology and Bioinformatics.17 (5):1797–1809.doi:10.1109/TCBB.2019.2905198.hdl:10754/655914.ISSN 1545-5963.PMID 30892232.S2CID 84185121.
  20. ^Rashid, Rami; Abdi, Ali;Michalopoulou, Zoi-Heleni (2025)."Blind weak signal detection via dictionary learning in time-spreading distortion channels using vector sensors".JASA Express Letters.5 (6): 064803.doi:10.1121/10.0036919.
Retrieved from "https://en.wikipedia.org/w/index.php?title=Sparse_dictionary_learning&oldid=1322576157"
Category:
Hidden categories:

[8]ページ先頭

©2009-2026 Movatter.jp