Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Levenberg–Marquardt algorithm

From Wikipedia, the free encyclopedia
Algorithm used to solve non-linear least squares problems

Inmathematics and computing, theLevenberg–Marquardt algorithm (LMA or justLM), also known as thedamped least-squares (DLS) method, is used to solvenon-linear least squares problems. These minimization problems arise especially inleast squarescurve fitting. The LMA interpolates between theGauss–Newton algorithm (GNA) and the method ofgradient descent. The LMA is morerobust than the GNA, which means that in many cases it finds a solution even if it starts very far off the final minimum. For well-behaved functions and reasonable starting parameters, the LMA tends to be slower than the GNA. LMA can also be viewed asGauss–Newton using atrust region approach.

The algorithm was first published in 1944 byKenneth Levenberg,[1] while working at theFrankford Army Arsenal. It was rediscovered in 1963 byDonald Marquardt,[2] who worked as astatistician atDuPont, and independently by Girard,[3] Wynne[4] and Morrison.[5]

The LMA is used in many software applications for solving generic curve-fitting problems. By using the Gauss–Newton algorithm it often converges faster than first-order methods.[6] However, like other iterative optimization algorithms, the LMA finds only alocal minimum, which is not necessarily theglobal minimum.

The problem

[edit]

The primary application of the Levenberg–Marquardt algorithm is in the least-squares curve fitting problem: given a set ofm{\displaystyle m} empirical pairs(xi,yi){\displaystyle \left(x_{i},y_{i}\right)} of independent and dependent variables, find the parametersβ{\displaystyle {\boldsymbol {\beta }}} of the model curvef(x,β){\displaystyle f\left(x,{\boldsymbol {\beta }}\right)} so that the sum of the squares of the deviationsS(β){\displaystyle S\left({\boldsymbol {\beta }}\right)} is minimized:

β^argminβS(β)argminβi=1m[yif(xi,β)]2,{\displaystyle {\hat {\boldsymbol {\beta }}}\in \operatorname {argmin} \limits _{\boldsymbol {\beta }}S\left({\boldsymbol {\beta }}\right)\equiv \operatorname {argmin} \limits _{\boldsymbol {\beta }}\sum _{i=1}^{m}\left[y_{i}-f\left(x_{i},{\boldsymbol {\beta }}\right)\right]^{2},} which is assumed to be non-empty.

The solution

[edit]

Like other numeric minimization algorithms, the Levenberg–Marquardt algorithm is aniterative procedure. To start a minimization, the user has to provide an initial guess for the parameter vectorβ{\displaystyle {\boldsymbol {\beta }}}. In cases with only one minimum, an uninformed standard guess likeβT=(1, 1, , 1){\displaystyle {\boldsymbol {\beta }}^{\text{T}}={\begin{pmatrix}1,\ 1,\ \dots ,\ 1\end{pmatrix}}} will work fine; in cases withmultiple minima, the algorithm converges to the global minimum only if the initial guess is already somewhat close to the final solution.

In each iteration step, the parameter vectorβ{\displaystyle {\boldsymbol {\beta }}} is replaced by a new estimateβ+δ{\displaystyle {\boldsymbol {\beta }}+{\boldsymbol {\delta }}}. To determineδ{\displaystyle {\boldsymbol {\delta }}}, the functionf(xi,β+δ){\displaystyle f\left(x_{i},{\boldsymbol {\beta }}+{\boldsymbol {\delta }}\right)} is approximated by itslinearization:

f(xi,β+δ)f(xi,β)+Jiδ,{\displaystyle f\left(x_{i},{\boldsymbol {\beta }}+{\boldsymbol {\delta }}\right)\approx f\left(x_{i},{\boldsymbol {\beta }}\right)+\mathbf {J} _{i}{\boldsymbol {\delta }},}

where

Ji=f(xi,β)β{\displaystyle \mathbf {J} _{i}={\frac {\partial f\left(x_{i},{\boldsymbol {\beta }}\right)}{\partial {\boldsymbol {\beta }}}}}

is thegradient (row-vector in this case) off{\displaystyle f} with respect toβ{\displaystyle {\boldsymbol {\beta }}}.

The sumS(β){\displaystyle S\left({\boldsymbol {\beta }}\right)} of square deviations has its minimum at a zerogradient with respect toβ{\displaystyle {\boldsymbol {\beta }}}. The above first-order approximation off(xi,β+δ){\displaystyle f\left(x_{i},{\boldsymbol {\beta }}+{\boldsymbol {\delta }}\right)} gives

S(β+δ)i=1m[yif(xi,β)Jiδ]2,{\displaystyle S\left({\boldsymbol {\beta }}+{\boldsymbol {\delta }}\right)\approx \sum _{i=1}^{m}\left[y_{i}-f\left(x_{i},{\boldsymbol {\beta }}\right)-\mathbf {J} _{i}{\boldsymbol {\delta }}\right]^{2},}

or in vector notation,

S(β+δ)yf(β)Jδ2=[yf(β)Jδ]T[yf(β)Jδ]=[yf(β)]T[yf(β)][yf(β)]TJδ(Jδ)T[yf(β)]+δTJTJδ=[yf(β)]T[yf(β)]2[yf(β)]TJδ+δTJTJδ.{\displaystyle {\begin{aligned}S\left({\boldsymbol {\beta }}+{\boldsymbol {\delta }}\right)&\approx \left\|\mathbf {y} -\mathbf {f} \left({\boldsymbol {\beta }}\right)-\mathbf {J} {\boldsymbol {\delta }}\right\|^{2}\\&=\left[\mathbf {y} -\mathbf {f} \left({\boldsymbol {\beta }}\right)-\mathbf {J} {\boldsymbol {\delta }}\right]^{\mathrm {T} }\left[\mathbf {y} -\mathbf {f} \left({\boldsymbol {\beta }}\right)-\mathbf {J} {\boldsymbol {\delta }}\right]\\&=\left[\mathbf {y} -\mathbf {f} \left({\boldsymbol {\beta }}\right)\right]^{\mathrm {T} }\left[\mathbf {y} -\mathbf {f} \left({\boldsymbol {\beta }}\right)\right]-\left[\mathbf {y} -\mathbf {f} \left({\boldsymbol {\beta }}\right)\right]^{\mathrm {T} }\mathbf {J} {\boldsymbol {\delta }}-\left(\mathbf {J} {\boldsymbol {\delta }}\right)^{\mathrm {T} }\left[\mathbf {y} -\mathbf {f} \left({\boldsymbol {\beta }}\right)\right]+{\boldsymbol {\delta }}^{\mathrm {T} }\mathbf {J} ^{\mathrm {T} }\mathbf {J} {\boldsymbol {\delta }}\\&=\left[\mathbf {y} -\mathbf {f} \left({\boldsymbol {\beta }}\right)\right]^{\mathrm {T} }\left[\mathbf {y} -\mathbf {f} \left({\boldsymbol {\beta }}\right)\right]-2\left[\mathbf {y} -\mathbf {f} \left({\boldsymbol {\beta }}\right)\right]^{\mathrm {T} }\mathbf {J} {\boldsymbol {\delta }}+{\boldsymbol {\delta }}^{\mathrm {T} }\mathbf {J} ^{\mathrm {T} }\mathbf {J} {\boldsymbol {\delta }}.\end{aligned}}}

Taking the derivative of this approximation ofS(β+δ){\displaystyle S\left({\boldsymbol {\beta }}+{\boldsymbol {\delta }}\right)} with respect toδ{\displaystyle {\boldsymbol {\delta }}} and setting the result to zero gives

(JTJ)δ=JT[yf(β)],{\displaystyle \left(\mathbf {J} ^{\mathrm {T} }\mathbf {J} \right){\boldsymbol {\delta }}=\mathbf {J} ^{\mathrm {T} }\left[\mathbf {y} -\mathbf {f} \left({\boldsymbol {\beta }}\right)\right],}

whereJ{\displaystyle \mathbf {J} } is theJacobian matrix, whosei{\displaystyle i}-th row equalsJi{\displaystyle \mathbf {J} _{i}}, and wheref(β){\displaystyle \mathbf {f} \left({\boldsymbol {\beta }}\right)} andy{\displaystyle \mathbf {y} } are vectors withi{\displaystyle i}-th componentf(xi,β){\displaystyle f\left(x_{i},{\boldsymbol {\beta }}\right)} andyi{\displaystyle y_{i}} respectively. The above expression obtained forβ{\displaystyle {\boldsymbol {\beta }}} comes under the Gauss–Newton method. The Jacobian matrix as defined above is not (in general) a square matrix, but a rectangular matrix of sizem×n{\displaystyle m\times n}, wheren{\displaystyle n} is the number of parameters (size of the vectorβ{\displaystyle {\boldsymbol {\beta }}}). The matrix multiplication(JTJ){\displaystyle \left(\mathbf {J} ^{\mathrm {T} }\mathbf {J} \right)} yields the requiredn×n{\displaystyle n\times n} square matrix and the matrix-vector product on the right hand side yields a vector of sizen{\displaystyle n}. The result is a set ofn{\displaystyle n} linear equations, which can be solved forδ{\displaystyle {\boldsymbol {\delta }}}.

Levenberg's contribution is to replace this equation by a "damped version":

(JTJ+λI)δ=JT[yf(β)],{\displaystyle \left(\mathbf {J} ^{\mathrm {T} }\mathbf {J} +\lambda \mathbf {I} \right){\boldsymbol {\delta }}=\mathbf {J} ^{\mathrm {T} }\left[\mathbf {y} -\mathbf {f} \left({\boldsymbol {\beta }}\right)\right],}

whereI{\displaystyle \mathbf {I} } is the identity matrix, giving as the incrementδ{\displaystyle {\boldsymbol {\delta }}} to the estimated parameter vectorβ{\displaystyle {\boldsymbol {\beta }}}.

The (non-negative) damping factorλ{\displaystyle \lambda } is adjusted at each iteration. If reduction ofS{\displaystyle S} is rapid, a smaller value can be used, bringing the algorithm closer to theGauss–Newton algorithm, whereas if an iteration gives insufficient reduction in the residual,λ{\displaystyle \lambda } can be increased, giving a step closer to the gradient-descent direction. Note that thegradient ofS{\displaystyle S} with respect toβ{\displaystyle {\boldsymbol {\beta }}} equals2(JT[yf(β)])T{\displaystyle -2\left(\mathbf {J} ^{\mathrm {T} }\left[\mathbf {y} -\mathbf {f} \left({\boldsymbol {\beta }}\right)\right]\right)^{\mathrm {T} }}. Therefore, for large values ofλ{\displaystyle \lambda }, the step will be taken approximately in the direction opposite to the gradient. If either the length of the calculated stepδ{\displaystyle {\boldsymbol {\delta }}} or the reduction of sum of squares from the latest parameter vectorβ+δ{\displaystyle {\boldsymbol {\beta }}+{\boldsymbol {\delta }}} fall below predefined limits, iteration stops, and the last parameter vectorβ{\displaystyle {\boldsymbol {\beta }}} is considered to be the solution.

When the damping factorλ{\displaystyle \lambda } is large relative toJTJ{\displaystyle \|\mathbf {J} ^{\mathrm {T} }\mathbf {J} \|}, invertingJTJ+λI{\displaystyle \mathbf {J} ^{\mathrm {T} }\mathbf {J} +\lambda \mathbf {I} } is not necessary, as the update is well-approximated by the small gradient stepλ1JT[yf(β)]{\displaystyle \lambda ^{-1}\mathbf {J} ^{\mathrm {T} }\left[\mathbf {y} -\mathbf {f} \left({\boldsymbol {\beta }}\right)\right]}.

To make the solution scale invariant Marquardt's algorithm solved a modified problem with each component of the gradient scaled according to the curvature. This provides larger movement along the directions where the gradient is smaller, which avoids slow convergence in the direction of small gradient. Fletcher in his 1971 paperA modified Marquardt subroutine for non-linear least squares simplified the form, replacing the identity matrixI{\displaystyle \mathbf {I} } with the diagonal matrix consisting of the diagonal elements ofJTJ{\displaystyle \mathbf {J} ^{\text{T}}\mathbf {J} }:

[JTJ+λdiag(JTJ)]δ=JT[yf(β)].{\displaystyle \left[\mathbf {J} ^{\mathrm {T} }\mathbf {J} +\lambda \operatorname {diag} \left(\mathbf {J} ^{\mathrm {T} }\mathbf {J} \right)\right]{\boldsymbol {\delta }}=\mathbf {J} ^{\mathrm {T} }\left[\mathbf {y} -\mathbf {f} \left({\boldsymbol {\beta }}\right)\right].}

A similar damping factor appears inTikhonov regularization, which is used to solve linearill-posed problems, as well as inridge regression, anestimation technique instatistics.

Choice of damping parameter

[edit]

Various more or less heuristic arguments have been put forward for the best choice for the damping parameterλ{\displaystyle \lambda }. Theoretical arguments exist showing why some of these choices guarantee local convergence of the algorithm; however, these choices can make the global convergence of the algorithm suffer from the undesirable properties ofsteepest descent, in particular, very slow convergence close to the optimum.

The absolute values of any choice depend on how well-scaled the initial problem is. Marquardt recommended starting with a valueλ0{\displaystyle \lambda _{0}} and a factorν>1{\displaystyle \nu >1}. Initially settingλ=λ0{\displaystyle \lambda =\lambda _{0}} and computing the residual sum of squaresS(β){\displaystyle S\left({\boldsymbol {\beta }}\right)} after one step from the starting point with the damping factor ofλ=λ0{\displaystyle \lambda =\lambda _{0}} and secondly withλ0/ν{\displaystyle \lambda _{0}/\nu }. If both of these are worse than the initial point, then the damping is increased by successive multiplication byν{\displaystyle \nu } until a better point is found with a new damping factor ofλ0νk{\displaystyle \lambda _{0}\nu ^{k}} for somek{\displaystyle k}.

If use of the damping factorλ/ν{\displaystyle \lambda /\nu } results in a reduction in squared residual, then this is taken as the new value ofλ{\displaystyle \lambda } (and the new optimum location is taken as that obtained with this damping factor) and the process continues; if usingλ/ν{\displaystyle \lambda /\nu } resulted in a worse residual, but usingλ{\displaystyle \lambda } resulted in a better residual, thenλ{\displaystyle \lambda } is left unchanged and the new optimum is taken as the value obtained withλ{\displaystyle \lambda } as damping factor.

An effective strategy for the control of the damping parameter, calleddelayed gratification, consists of increasing the parameter by a small amount for each uphill step, and decreasing by a large amount for each downhill step. The idea behind this strategy is to avoid moving downhill too fast in the beginning of optimization, therefore restricting the steps available in future iterations and therefore slowing down convergence.[7] An increase by a factor of 2 and a decrease by a factor of 3 has been shown to be effective in most cases, while for large problems more extreme values can work better, with an increase by a factor of 1.5 and a decrease by a factor of 5.[8]

Geodesic acceleration

[edit]

When interpreting the Levenberg–Marquardt step as the velocityvk{\displaystyle {\boldsymbol {v}}_{k}} along ageodesic path in the parameter space, it is possible to improve the method by adding a second order term that accounts for the accelerationak{\displaystyle {\boldsymbol {a}}_{k}} along the geodesic

vk+12ak{\displaystyle {\boldsymbol {v}}_{k}+{\frac {1}{2}}{\boldsymbol {a}}_{k}}

whereak{\displaystyle {\boldsymbol {a}}_{k}} is the solution of

Jkak=fvv.{\displaystyle {\boldsymbol {J}}_{k}{\boldsymbol {a}}_{k}=-f_{vv}.}

Since this geodesic acceleration term depends only on thedirectional derivativefvv=μνvμvνμνf(x){\displaystyle f_{vv}=\sum _{\mu \nu }v_{\mu }v_{\nu }\partial _{\mu }\partial _{\nu }f({\boldsymbol {x}})} along the direction of the velocityv{\displaystyle {\boldsymbol {v}}}, it does not require computing the full second order derivative matrix, requiring only a small overhead in terms of computing cost.[9] Since the second order derivative can be a fairly complex expression, it can be convenient to replace it with afinite difference approximation

fvvifi(x+hδ)2fi(x)+fi(xhδ)h2=2h(fi(x+hδ)fi(x)hJiδ){\displaystyle {\begin{aligned}f_{vv}^{i}&\approx {\frac {f_{i}({\boldsymbol {x}}+h{\boldsymbol {\delta }})-2f_{i}({\boldsymbol {x}})+f_{i}({\boldsymbol {x}}-h{\boldsymbol {\delta }})}{h^{2}}}\\&={\frac {2}{h}}\left({\frac {f_{i}({\boldsymbol {x}}+h{\boldsymbol {\delta }})-f_{i}({\boldsymbol {x}})}{h}}-{\boldsymbol {J}}_{i}{\boldsymbol {\delta }}\right)\end{aligned}}}

wheref(x){\displaystyle f({\boldsymbol {x}})} andJ{\displaystyle {\boldsymbol {J}}} have already been computed by the algorithm, therefore requiring only one additional function evaluation to computef(x+hδ){\displaystyle f({\boldsymbol {x}}+h{\boldsymbol {\delta }})}. The choice of the finite difference steph{\displaystyle h} can affect the stability of the algorithm, and a value of around 0.1 is usually reasonable in general.[8]

Since the acceleration may point in opposite direction to the velocity, to prevent it to stall the method in case the damping is too small, an additional criterion on the acceleration is added in order to accept a step, requiring that

2akvkα{\displaystyle {\frac {2\left\|{\boldsymbol {a}}_{k}\right\|}{\left\|{\boldsymbol {v}}_{k}\right\|}}\leq \alpha }

whereα{\displaystyle \alpha } is usually fixed to a value lesser than 1, with smaller values for harder problems.[8]

The addition of a geodesic acceleration term can allow significant increase in convergence speed and it is especially useful when the algorithm is moving through narrow canyons in the landscape of the objective function, where the allowed steps are smaller and the higher accuracy due to the second order term gives significant improvements.[8]

Example

[edit]
Poor fit
Better fit
Best fit

In this example we try to fit the functiony=acos(bX)+bsin(aX){\displaystyle y=a\cos \left(bX\right)+b\sin \left(aX\right)} using the Levenberg–Marquardt algorithm implemented inGNU Octave as theleasqr function. The graphs show progressively better fitting for the parametersa=100{\displaystyle a=100},b=102{\displaystyle b=102} usedin the initial curve. Only when the parameters in the last graph are chosen closest to the original, are the curves fitting exactly. This equationis an example of very sensitive initial conditions for the Levenberg–Marquardt algorithm. One reason for this sensitivity is the existence of multiple minima — the functioncos(βx){\displaystyle \cos \left(\beta x\right)} has minima at parameter valueβ^{\displaystyle {\hat {\beta }}} andβ^+2nπ{\displaystyle {\hat {\beta }}+2n\pi }.

See also

[edit]

References

[edit]
  1. ^Levenberg, Kenneth (1944)."A Method for the Solution of Certain Non-Linear Problems in Least Squares".Quarterly of Applied Mathematics.2 (2):164–168.doi:10.1090/qam/10666.
  2. ^Marquardt, Donald (1963). "An Algorithm for Least-Squares Estimation of Nonlinear Parameters".SIAM Journal on Applied Mathematics.11 (2):431–441.doi:10.1137/0111030.hdl:10338.dmlcz/104299.
  3. ^Girard, André (1958). "Excerpt fromRevue d'optique théorique et instrumentale".Rev. Opt.37:225–241,397–424.
  4. ^Wynne, C. G. (1959). "Lens Designing by Electronic Digital Computer: I".Proc. Phys. Soc. Lond.73 (5):777–787.Bibcode:1959PPS....73..777W.doi:10.1088/0370-1328/73/5/310.
  5. ^Morrison, David D. (1960). "Methods for nonlinear least squares problems and convergence proofs".Proceedings of the Jet Propulsion Laboratory Seminar on Tracking Programs and Orbit Determination:1–9.
  6. ^Wiliamowski, Bogdan; Yu, Hao (June 2010)."Improved Computation for Levenberg–Marquardt Training"(PDF).IEEE Transactions on Neural Networks and Learning Systems.21 (6).
  7. ^Transtrum, Mark K; Machta, Benjamin B; Sethna, James P (2011). "Geometry of nonlinear least squares with applications to sloppy models and optimization".Physical Review E.83 (3) 036701. APS.arXiv:1010.1449.Bibcode:2011PhRvE..83c6701T.doi:10.1103/PhysRevE.83.036701.PMID 21517619.S2CID 15361707.
  8. ^abcdTranstrum, Mark K; Sethna, James P (2012). "Improvements to the Levenberg-Marquardt algorithm for nonlinear least-squares minimization".arXiv:1201.5885 [physics.data-an].
  9. ^"Nonlinear Least-Squares Fitting". GNU Scientific Library. Archived fromthe original on 2020-04-14.
  10. ^Kanzow, Christian; Yamashita, Nobuo; Fukushima, Masao (2004)."Levenberg–Marquardt methods with strong local convergence properties for solving nonlinear equations with convex constraints".Journal of Computational and Applied Mathematics.172 (2):375–397.Bibcode:2004JCoAM.172..375K.doi:10.1016/j.cam.2004.02.013.

Further reading

[edit]

External links

[edit]
Functions
Gradients
Convergence
Quasi–Newton
Other methods
Hessians
Graph of a strictly concave quadratic function with unique maximum.
Optimization computes maxima and minima.
General
Differentiable
Convex
minimization
Linear and
quadratic
Interior point
Basis-exchange
Paradigms
Graph
algorithms
Minimum
spanning tree
Shortest path
Network flows
Retrieved from "https://en.wikipedia.org/w/index.php?title=Levenberg–Marquardt_algorithm&oldid=1314260172"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2025 Movatter.jp