Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Limited-memory BFGS

From Wikipedia, the free encyclopedia
Optimization algorithm

Limited-memory BFGS (L-BFGS orLM-BFGS) is anoptimizationalgorithm in the collection ofquasi-Newton methods that approximates theBroyden–Fletcher–Goldfarb–Shanno algorithm (BFGS) using a limited amount ofcomputer memory.[1] It is a popular algorithm for parameter estimation inmachine learning.[2][3] The algorithm's target problem is to minimizef(x){\displaystyle f(\mathbf {x} )} over unconstrained values of the real-vectorx{\displaystyle \mathbf {x} } wheref{\displaystyle f} is a differentiable scalar function.

Like the original BFGS, L-BFGS uses an estimate of the inverseHessian matrix to steer its search through variable space, but where BFGS stores a densen×n{\displaystyle n\times n} approximation to the inverse Hessian (n being the number of variables in the problem), L-BFGS stores only a few vectors that represent the approximation implicitly. Due to its resulting linear memory requirement, the L-BFGS method is particularly well suited for optimization problems with many variables. Instead of the inverse HessianHk, L-BFGS maintains a history of the pastm updates of the positionx and gradient ∇f(x), where generally the history sizem can be small (oftenm<10{\displaystyle m<10}). These updates are used to implicitly do operations requiring theHk-vector product.

Algorithm

[edit]

The algorithm starts with an initial estimate of the optimal value,x0{\displaystyle \mathbf {x} _{0}}, and proceeds iteratively to refine that estimate with a sequence of better estimatesx1,x2,{\displaystyle \mathbf {x} _{1},\mathbf {x} _{2},\ldots }. The derivatives of the functiongk:=f(xk){\displaystyle g_{k}:=\nabla f(\mathbf {x} _{k})} are used as a key driver of the algorithm to identify the direction of steepest descent, and also to form an estimate of the Hessian matrix (second derivative) off(x){\displaystyle f(\mathbf {x} )}.

L-BFGS shares many features with other quasi-Newton algorithms, but is very different in how the matrix-vector multiplicationdk=Hkgk{\displaystyle d_{k}=-H_{k}g_{k}} is carried out, wheredk{\displaystyle d_{k}} is the approximate Newton's direction,gk{\displaystyle g_{k}} is the current gradient, andHk{\displaystyle H_{k}} is the inverse of the Hessian matrix. There are multiple published approaches using a history of updates to form this direction vector. Here, we give a common approach, the so-called "two loop recursion."[4][5]

We take as givenxk{\displaystyle x_{k}}, the position at thek-th iteration, andgkf(xk){\displaystyle g_{k}\equiv \nabla f(x_{k})} wheref{\displaystyle f} is the function being minimized, and all vectors are column vectors. We also assume that we have stored the lastm updates of the form

sk=xk+1xk{\displaystyle s_{k}=x_{k+1}-x_{k}}
yk=gk+1gk{\displaystyle y_{k}=g_{k+1}-g_{k}}.

We defineρk=1yksk{\displaystyle \rho _{k}={\frac {1}{y_{k}^{\top }s_{k}}}}, andHk0{\displaystyle H_{k}^{0}} will be the 'initial' approximate of the inverse Hessian that our estimate at iterationk begins with.

The algorithm is based on the BFGS recursion for the inverse Hessian as

Hk+1=(Iρkskyk)Hk(Iρkyksk)+ρksksk.{\displaystyle H_{k+1}=(I-\rho _{k}s_{k}y_{k}^{\top })H_{k}(I-\rho _{k}y_{k}s_{k}^{\top })+\rho _{k}s_{k}s_{k}^{\top }.}

For a fixedk we define a sequence of vectorsqkm,,qk{\displaystyle q_{k-m},\ldots ,q_{k}} asqk:=gk{\displaystyle q_{k}:=g_{k}} andqi:=(Iρiyisi)qi+1{\displaystyle q_{i}:=(I-\rho _{i}y_{i}s_{i}^{\top })q_{i+1}}. Then a recursive algorithm for calculatingqi{\displaystyle q_{i}} fromqi+1{\displaystyle q_{i+1}} is to defineαi:=ρisiqi+1{\displaystyle \alpha _{i}:=\rho _{i}s_{i}^{\top }q_{i+1}} andqi=qi+1αiyi{\displaystyle q_{i}=q_{i+1}-\alpha _{i}y_{i}}. We also define another sequence of vectorszkm,,zk{\displaystyle z_{k-m},\ldots ,z_{k}} aszi:=Hiqi{\displaystyle z_{i}:=H_{i}q_{i}}. There is another recursive algorithm for calculating these vectors which is to definezkm=Hk0qkm{\displaystyle z_{k-m}=H_{k}^{0}q_{k-m}} and then recursively defineβi:=ρiyizi{\displaystyle \beta _{i}:=\rho _{i}y_{i}^{\top }z_{i}} andzi+1=zi+(αiβi)si{\displaystyle z_{i+1}=z_{i}+(\alpha _{i}-\beta _{i})s_{i}}. The value ofzk{\displaystyle z_{k}} is then our ascent direction.

Thus we can compute thedescent direction as follows:

q=gkFor i=k1,k2,,kmαi=ρisiqq=qαiyiγk=skmykmykmykmHk0=γkIz=Hk0qFor i=km,km+1,,k1βi=ρiyizz=z+si(αiβi)z=z{\displaystyle {\begin{array}{l}q=g_{k}\\{\mathtt {For}}\ i=k-1,k-2,\ldots ,k-m\\\qquad \alpha _{i}=\rho _{i}s_{i}^{\top }q\\\qquad q=q-\alpha _{i}y_{i}\\\gamma _{k}={\frac {s_{k-m}^{\top }y_{k-m}}{y_{k-m}^{\top }y_{k-m}}}\\H_{k}^{0}=\gamma _{k}I\\z=H_{k}^{0}q\\{\mathtt {For}}\ i=k-m,k-m+1,\ldots ,k-1\\\qquad \beta _{i}=\rho _{i}y_{i}^{\top }z\\\qquad z=z+s_{i}(\alpha _{i}-\beta _{i})\\z=-z\end{array}}}

This formulation gives the search direction for the minimization problem, i.e.,z=Hkgk{\displaystyle z=-H_{k}g_{k}}. For maximization problems, one should thus take-z instead. Note that the initial approximate inverse HessianHk0{\displaystyle H_{k}^{0}} is chosen as adiagonal matrix or even a multiple of the identity matrix since this is numerically efficient.

The scaling of the initial matrixγk{\displaystyle \gamma _{k}} ensures that the search direction is well scaled and therefore the unit step length is accepted in most iterations. AWolfe line search is used to ensure that the curvature condition is satisfied and the BFGS updating is stable. Note that some software implementations use an Armijobacktracking line search, but cannot guarantee that the curvature conditionyksk>0{\displaystyle y_{k}^{\top }s_{k}>0} will be satisfied by the chosen step since a step length greater than1{\displaystyle 1} may be needed to satisfy this condition. Some implementations address this by skipping the BFGS update whenyksk{\displaystyle y_{k}^{\top }s_{k}} is negative or too close to zero, but this approach is not generally recommended since the updates may be skipped too often to allow the Hessian approximationHk{\displaystyle H_{k}} to capture important curvature information. Some solvers employ so called damped (L)BFGS update which modifies quantitiessk{\displaystyle s_{k}} andyk{\displaystyle y_{k}} in order to satisfy the curvature condition.

The two-loop recursion formula is widely used by unconstrained optimizers due to its efficiency in multiplying by the inverse Hessian. However, it does not allow for the explicit formation of either the direct or inverse Hessian and is incompatible with non-box constraints. An alternative approach is thecompact representation, which involves a low-rank representation for the direct and/or inverse Hessian.[6] This represents the Hessian as a sum of a diagonal matrix and a low-rank update. Such a representation enables the use of L-BFGS in constrained settings, for example, as part of the SQP method.

Applications

[edit]

L-BFGS has been called "the algorithm of choice" for fittinglog-linear (MaxEnt) models andconditional random fields with2{\displaystyle \ell _{2}}-regularization.[2][3]

Variants

[edit]

Since BFGS (and hence L-BFGS) is designed to minimizesmooth functions withoutconstraints, the L-BFGS algorithm must be modified to handle functions that include non-differentiable components or constraints. A popular class of modifications are called active-set methods, based on the concept of theactive set. The idea is that when restricted to a small neighborhood of the current iterate, the function and constraints can be simplified.

L-BFGS-B

[edit]

TheL-BFGS-B algorithm extends L-BFGS to handle simple box constraints (aka bound constraints) on variables; that is, constraints of the formlixiui whereli andui are per-variable constant lower and upper bounds, respectively (for eachxi, either or both bounds may be omitted).[7][8] The method works by identifying fixed and free variables at every step (using a simple gradient method), and then using the L-BFGS method on the free variables only to get higher accuracy, and then repeating the process.

OWL-QN

[edit]

Orthant-wise limited-memory quasi-Newton (OWL-QN) is an L-BFGS variant for fitting1{\displaystyle \ell _{1}}-regularized models, exploiting the inherentsparsity of such models.[3]It minimizes functions of the form

f(x)=g(x)+Cx1{\displaystyle f({\vec {x}})=g({\vec {x}})+C\|{\vec {x}}\|_{1}}

whereg{\displaystyle g} is adifferentiableconvexloss function. The method is an active-set type method: at each iterate, it estimates thesign of each component of the variable, and restricts the subsequent step to have the same sign. Once the sign is fixed, the non-differentiablex1{\displaystyle \|{\vec {x}}\|_{1}} term becomes a smooth linear term which can be handled by L-BFGS. After an L-BFGS step, the method allows some variables to change sign, and repeats the process.

O-LBFGS

[edit]

Schraudolphet al. present anonline approximation to both BFGS and L-BFGS.[9] Similar tostochastic gradient descent, this can be used to reduce the computational complexity by evaluating theerror function and gradient on a randomly drawn subset of the overall dataset in each iteration. It has been shown that O-LBFGS has a global almost sure convergence[10] while the online approximation of BFGS (O-BFGS) is not necessarily convergent.[11]

Implementation of variants

[edit]

Notable open source implementations include:


Notable non open source implementations include:

  • The L-BFGS-B variant also exists as ACM TOMS algorithm 778.[8][13] In February 2011, some of the authors of the original L-BFGS-B code posted a major update (version 3.0).
  • A reference implementation inFortran 77 (and with aFortran 90 interface).[14][15] This version, as well as older versions, has been converted to many other languages.
  • An OWL-QN C++ implementation by its designers.[3][16]

Works cited

[edit]
  1. ^Liu, D. C.; Nocedal, J. (1989)."On the Limited Memory Method for Large Scale Optimization".Mathematical Programming B.45 (3):503–528.CiteSeerX 10.1.1.110.6443.doi:10.1007/BF01589116.S2CID 5681609.
  2. ^abMalouf, Robert (2002)."A comparison of algorithms for maximum entropy parameter estimation".Proceedings of the Sixth Conference on Natural Language Learning (CoNLL-2002). pp. 49–55.doi:10.3115/1118853.1118871.
  3. ^abcdAndrew, Galen; Gao, Jianfeng (2007)."Scalable training of L₁-regularized log-linear models".Proceedings of the 24th International Conference on Machine Learning.doi:10.1145/1273496.1273501.ISBN 9781595937933.S2CID 5853259.
  4. ^Matthies, H.; Strang, G. (1979). "The solution of non linear finite element equations".International Journal for Numerical Methods in Engineering.14 (11):1613–1626.Bibcode:1979IJNME..14.1613M.doi:10.1002/nme.1620141104.
  5. ^Nocedal, J. (1980)."Updating Quasi-Newton Matrices with Limited Storage".Mathematics of Computation.35 (151):773–782.doi:10.1090/S0025-5718-1980-0572855-7.
  6. ^Byrd, R. H.; Nocedal, J.; Schnabel, R. B. (1994). "Representations of Quasi-Newton Matrices and their use in Limited Memory Methods".Mathematical Programming.63 (4):129–156.doi:10.1007/BF01582063.S2CID 5581219.
  7. ^Byrd, R. H.; Lu, P.; Nocedal, J.; Zhu, C. (1995)."A Limited Memory Algorithm for Bound Constrained Optimization".SIAM J. Sci. Comput.16 (5):1190–1208.Bibcode:1995SJSC...16.1190B.doi:10.1137/0916069.S2CID 6398414.
  8. ^abZhu, C.; Byrd, Richard H.; Lu, Peihuang; Nocedal, Jorge (1997)."L-BFGS-B: Algorithm 778: L-BFGS-B, FORTRAN routines for large scale bound constrained optimization".ACM Transactions on Mathematical Software.23 (4):550–560.doi:10.1145/279232.279236.S2CID 207228122.
  9. ^Schraudolph, N.; Yu, J.; Günter, S. (2007).A stochastic quasi-Newton method for online convex optimization. AISTATS.
  10. ^Mokhtari, A.; Ribeiro, A. (2015)."Global convergence of online limited memory BFGS"(PDF).Journal of Machine Learning Research.16:3151–3181.arXiv:1409.2045.
  11. ^Mokhtari, A.; Ribeiro, A. (2014). "RES: Regularized Stochastic BFGS Algorithm".IEEE Transactions on Signal Processing.62 (23):6089–6104.arXiv:1401.7625.Bibcode:2014ITSP...62.6089M.CiteSeerX 10.1.1.756.3003.doi:10.1109/TSP.2014.2357775.S2CID 15214938.
  12. ^"Official Documentation of Optim.jl".Documentation Optim.jl.
  13. ^"TOMS Home".toms.acm.org.
  14. ^Morales, J. L.; Nocedal, J. (2011). "Remark on "algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound constrained optimization"".ACM Transactions on Mathematical Software.38:1–4.doi:10.1145/2049662.2049669.S2CID 16742561.
  15. ^"L-BFGS-B Nonlinear Optimization Code".users.iems.northwestern.edu.
  16. ^"Orthant-Wise Limited-memory Quasi-Newton Optimizer for L1-regularized Objectives".Microsoft Download Center.

Further reading

[edit]
Functions
Gradients
Convergence
Quasi–Newton
Other methods
Hessians
Graph of a strictly concave quadratic function with unique maximum.
Optimization computes maxima and minima.
General
Differentiable
Convex
minimization
Linear and
quadratic
Interior point
Basis-exchange
Paradigms
Graph
algorithms
Minimum
spanning tree
Shortest path
Network flows
Retrieved from "https://en.wikipedia.org/w/index.php?title=Limited-memory_BFGS&oldid=1326793624"
Category:
Hidden categories:

[8]ページ先頭

©2009-2026 Movatter.jp