Movatterモバイル変換

Nonlinear conjugate gradient method

From Wikipedia, the free encyclopedia

Concept in mathematics

Innumerical optimization, thenonlinear conjugate gradient method generalizes theconjugate gradient method tononlinear optimization. For a quadratic function $\displaystyle f(x)$

\displaystyle f(x)=\|Ax-b\|^{2},

the minimum of $f {\displaystyle f}$ is obtained when thegradient is 0:

\nabla _{x}f=2A^{T}(Ax-b)=0

Whereas linear conjugate gradient seeks a solution to the linear equation $\displaystyle A^{T}Ax=A^{T}b$ , the nonlinear conjugate gradient method is generally used to find thelocal minimum of a nonlinear function using itsgradient $\nabla _{x}f$ alone. It works when the function is approximately quadratic near the minimum, which is the case when the function is twice differentiable at the minimum and the second derivative is non-singular there.

Given a function $\displaystyle f(x)$ of $N {\displaystyle N}$ variables to minimize, its gradient $\nabla _{x}f$ indicates the direction of maximum increase.One simply starts in the opposite (steepest descent) direction:

\Delta x_{0}=-\nabla _{x}f(x_{0})

with an adjustable step length $\displaystyle \alpha$ and performs aline search in this direction until it reaches the minimum of $\displaystyle f$ :

\displaystyle \alpha _{0}:=\arg \min _{\alpha }f(x_{0}+\alpha \Delta x_{0})

\displaystyle x_{1}=x_{0}+\alpha _{0}\Delta x_{0}

After this first iteration in the steepest direction $\displaystyle \Delta x_{0}$ , the following steps constitute one iteration of moving along a subsequent conjugate direction $\displaystyle s_{n}$ , where $\displaystyle s_{0}=\Delta x_{0}$ :

Calculate the steepest direction: $\Delta x_{n}=-\nabla _{x}f(x_{n})$ ,
Compute $\displaystyle \beta _{n}$ according to one of the formulas below,
Update the conjugate direction: $\displaystyle s_{n}=\Delta x_{n}+\beta _{n}s_{n-1}$
Perform a line search: optimize $\displaystyle \alpha _{n}=\arg \min _{\alpha }f(x_{n}+\alpha s_{n})$ ,
Update the position: $\displaystyle x_{n+1}=x_{n}+\alpha _{n}s_{n}$ ,

With a pure quadratic function the minimum is reached withinN iterations (excepting roundoff error), but a non-quadratic function will make slower progress. Subsequent search directions lose conjugacy requiring the search direction to be reset to the steepest descent direction at least everyN iterations, or sooner if progress stops. However, resetting every iteration turns the method intosteepest descent. The algorithm stops when it finds the minimum, determined when no progress is made after a direction reset (i.e. in the steepest descent direction), or when some tolerance criterion is reached.

Within a linear approximation, the parameters $\displaystyle \alpha$ and $\displaystyle \beta$ are the same as in thelinear conjugate gradient method but have been obtained with line searches.The conjugate gradient method can follow narrow (ill-conditioned) valleys, where thesteepest descent method slows down and follows a criss-cross pattern.

Four of the best known formulas for $\displaystyle \beta _{n}$ are named after their developers:

Fletcher–Reeves:^[1]

\beta _{n}^{FR}={\frac {\Delta x_{n}^{T}\Delta x_{n}}{\Delta x_{n-1}^{T}\Delta x_{n-1}}}.

Polak–Ribière:^[2]

\beta _{n}^{PR}={\frac {\Delta x_{n}^{T}(\Delta x_{n}-\Delta x_{n-1})}{\Delta x_{n-1}^{T}\Delta x_{n-1}}}.

Hestenes–Stiefel:^[3]

\beta _{n}^{HS}={\frac {\Delta x_{n}^{T}(\Delta x_{n}-\Delta x_{n-1})}{-s_{n-1}^{T}(\Delta x_{n}-\Delta x_{n-1})}}.

Dai–Yuan:^[4]

\beta _{n}^{DY}={\frac {\Delta x_{n}^{T}\Delta x_{n}}{-s_{n-1}^{T}(\Delta x_{n}-\Delta x_{n-1})}}.

These formulas are equivalent for a quadratic function, but for nonlinear optimization the preferred formula is a matter of heuristics or taste. A popular choice is $\displaystyle \beta =\max\{0,\beta ^{PR}\}$ , which provides a direction reset automatically.^[5]

Algorithms based onNewton's method potentially converge much faster. There, both step direction and length are computed from the gradient as the solution of a linear system of equations, with the coefficient matrix being the exactHessian matrix (for Newton's method proper) or an estimate thereof (in thequasi-Newton methods, where the observed change in the gradient during the iterations is used to update the Hessian estimate). For high-dimensional problems, the exact computation of the Hessian is usually prohibitively expensive, and even its storage can be problematic, requiring $O(N^{2})$ memory (but see the limited-memoryL-BFGS quasi-Newton method).

The conjugate gradient method can also be derived usingoptimal control theory.^[6] In this accelerated optimization theory, the conjugate gradient method falls out as a nonlinearoptimal feedback controller,

$u=k(x,{\dot {x}}):=-\gamma _{a}\nabla _{x}f(x)-\gamma _{b}{\dot {x}}$

for thedouble integrator system,

${\ddot {x}}=u$

The quantities $\gamma _{a}>0$ and $\gamma _{b}>0$ are variable feedback gains.^[6]

References

[edit]

^Fletcher, R.; Reeves, C. M. (1964)."Function minimization by conjugate gradients".The Computer Journal.7 (2):149–154.doi:10.1093/comjnl/7.2.149.
^Polak, E.; Ribière, G. (1969). "Note sur la convergence de méthodes de directions conjuguées".Revue Française d'Automatique, Informatique, Recherche Opérationnelle.3 (1):35–43.
^Hestenes, M. R.; Stiefel, E. (1952)."Methods of Conjugate Gradients for Solving Linear Systems".Journal of Research of the National Bureau of Standards.49 (6):409–436.doi:10.6028/jres.049.044.
^Dai, Y.-H.; Yuan, Y. (1999). "A nonlinear conjugate gradient method with a strong global convergence property".SIAM Journal on Optimization.10 (1):177–182.doi:10.1137/S1052623497318992.
^Shewchuk, J. R. (August 1994)."An Introduction to the Conjugate Gradient Method Without the Agonizing Pain"(PDF).
^^a ^bRoss, I. M. (2019). "An Optimal Control Theory for Accelerated Optimization".arXiv:1902.09004 [math.OC].

Optimization:Algorithms,methods, andheuristics

Unconstrained nonlinear

Functions

Gradients

Convergence	Trust region Wolfe conditions
Quasi–Newton	Berndt–Hall–Hall–Hausman Broyden–Fletcher–Goldfarb–Shanno andL-BFGS Davidon–Fletcher–Powell Symmetric rank-one (SR1)
Other methods	Conjugate gradient Gauss–Newton Gradient Mirror Levenberg–Marquardt Powell's dog leg method Truncated Newton

Hessians

Newton's method

Graph of a strictly concave quadratic function with unique maximum. — Optimization computes maxima and minima.

Constrained nonlinear

General	Barrier methods Penalty methods
Differentiable	Augmented Lagrangian methods Sequential quadratic programming Successive linear programming

Convex optimization

Convex
minimization

Linear and
quadratic

Interior point	Affine scaling Ellipsoid algorithm of Khachiyan Projective algorithm of Karmarkar
Basis-exchange	Simplex algorithm of Dantzig Revised simplex algorithm Criss-cross algorithm Principal pivoting algorithm of Lemke Active-set method