Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Simultaneous perturbation stochastic approximation

From Wikipedia, the free encyclopedia
Optimization algorithm

Simultaneous perturbation stochastic approximation (SPSA) is analgorithmic method for optimizing systems with multiple unknownparameters. It is a type ofstochastic approximation algorithm. As an optimization method, it is appropriately suited to large-scale population models, adaptive modeling, simulationoptimization, andatmospheric modeling. Many examples are presented at the SPSA websitehttp://www.jhuapl.edu/SPSA. A comprehensive book on the subject is Bhatnagar et al. (2013). An early paper on the subject is Spall (1987) and the foundational paper providing the key theory and justification is Spall (1992).

SPSA is a descent method capable of finding global minima, sharing this property with other methods such assimulated annealing. Its main feature is the gradient approximation that requires only two measurements of the objective function, regardless of the dimension of the optimization problem. Recall that we want to find the optimal controlu{\displaystyle u^{*}} with lossfunctionJ(u){\displaystyle J(u)}:

u=argminuUJ(u).{\displaystyle u^{*}=\arg \min _{u\in U}J(u).}

Both Finite Differences Stochastic Approximation (FDSA)and SPSA use the same iterative process:

un+1=unang^n(un),{\displaystyle u_{n+1}=u_{n}-a_{n}{\hat {g}}_{n}(u_{n}),}

whereun=((un)1,(un)2,,(un)p)T{\displaystyle u_{n}=((u_{n})_{1},(u_{n})_{2},\ldots ,(u_{n})_{p})^{T}}represents thenth{\displaystyle n^{th}} iterate,g^n(un){\displaystyle {\hat {g}}_{n}(u_{n})} is the estimate of the gradient of the objective functiong(u)=uJ(u){\displaystyle g(u)={\frac {\partial }{\partial u}}J(u)} evaluated atun{\displaystyle {u_{n}}}, and{an}{\displaystyle \{a_{n}\}} is a positive number sequence converging to 0. Ifun{\displaystyle u_{n}} is ap-dimensional vector, theith{\displaystyle i^{th}} component of thesymmetric finite difference gradient estimator is:

FD:(gn^(un))i=J(un+cnei)J(uncnei)2cn,{\displaystyle ({\hat {g_{n}}}(u_{n}))_{i}={\frac {J(u_{n}+c_{n}e_{i})-J(u_{n}-c_{n}e_{i})}{2c_{n}}},}

1 ≤i ≤p, whereei{\displaystyle e_{i}} is theunit vector with a 1 in theith{\displaystyle i^{th}}place, andcn{\displaystyle c_{n}} is a small positive number that decreases withn. With this method,2p evaluations ofJ for eachgn{\displaystyle g_{n}} are needed. Whenp is large, this estimator loses efficiency.

Let nowΔn{\displaystyle \Delta _{n}} be a random perturbation vector. Theith{\displaystyle i^{th}} component of the stochastic perturbation gradient estimator is:

SP:(gn^(un))i=J(un+cnΔn)J(uncnΔn)2cn(Δn)i.{\displaystyle ({\hat {g_{n}}}(u_{n}))_{i}={\frac {J(u_{n}+c_{n}\Delta _{n})-J(u_{n}-c_{n}\Delta _{n})}{2c_{n}(\Delta _{n})_{i}}}.}

Remark that FD perturbs only one direction at a time, while the SP estimator disturbs all directions at the same time (the numerator is identical in allp components). The number of loss function measurements needed in the SPSA method for eachgn{\displaystyle g_{n}} is always 2, independent of thedimensionp. Thus, SPSA usesp times fewer function evaluations than FDSA, which makes it a lot more efficient.

Simple experiments withp=2 showed that SPSA converges in the same number of iterations as FDSA. The latter followsapproximately thesteepest descent direction, behaving like the gradient method. On the other hand, SPSA, with the random search direction, does not follow exactly the gradient path. In average though, it tracks it nearly because the gradient approximation is an almostunbiasedestimator of the gradient, as shown in the following lemma.

Convergence lemma

[edit]

Denote by

bn=E[g^n|un]J(un){\displaystyle b_{n}=E[{\hat {g}}_{n}|u_{n}]-\nabla J(u_{n})}

the bias in the estimatorg^n{\displaystyle {\hat {g}}_{n}}. Assume that{(Δn)i}{\displaystyle \{(\Delta _{n})_{i}\}} are all mutually independent with zero-mean, bounded secondmoments, andE(|(Δn)i|1){\displaystyle E(|(\Delta _{n})_{i}|^{-1})} uniformly bounded. Thenbn{\displaystyle b_{n}}→0 w.p. 1.

Sketch of the proof

[edit]

The mainidea is to use conditioning onΔn{\displaystyle \Delta _{n}} to expressE[(g^n)i]{\displaystyle E[({\hat {g}}_{n})_{i}]} and then to use a second order Taylor expansion ofJ(un+cnΔn)i{\displaystyle J(u_{n}+c_{n}\Delta _{n})_{i}} andJ(uncnΔn)i{\displaystyle J(u_{n}-c_{n}\Delta _{n})_{i}}. After algebraic manipulations using the zero mean and the independence of{(Δn)i}{\displaystyle \{(\Delta _{n})_{i}\}}, we get

E[(g^n)i]=(gn)i+O(cn2){\displaystyle E[({\hat {g}}_{n})_{i}]=(g_{n})_{i}+O(c_{n}^{2})}

The result follows from thehypothesis thatcn{\displaystyle c_{n}}→0.

Next we resume some of the hypotheses under whichut{\displaystyle u_{t}} converges inprobability to the set of global minima ofJ(u){\displaystyle J(u)}. The efficiency ofthe method depends on the shape ofJ(u){\displaystyle J(u)}, the values of the parametersan{\displaystyle a_{n}} andcn{\displaystyle c_{n}} and the distribution of the perturbation termsΔni{\displaystyle \Delta _{ni}}. First, the algorithm parameters must satisfy thefollowing conditions:

A good choice forΔni{\displaystyle \Delta _{ni}} is theRademacher distribution, i.e. Bernoulli +-1 with probability 0.5. Other choices are possible too, but note that the uniform and normal distributions cannot be used because they do not satisfy the finite inverse moment conditions.

The loss functionJ(u) must be thrice continuouslydifferentiable and the individual elements of thethird derivative must be bounded:|J(3)(u)|<a3<{\displaystyle |J^{(3)}(u)|<a_{3}<\infty }. Also,|J(u)|{\displaystyle |J(u)|\rightarrow \infty } asu{\displaystyle u\rightarrow \infty }.

In addition,J{\displaystyle \nabla J} must be Lipschitz continuous, bounded and the ODEu˙=g(u){\displaystyle {\dot {u}}=g(u)} must have a unique solution for each initial condition.Under these conditions and a few others,uk{\displaystyle u_{k}}converges in probability to the set of global minima of J(u) (see Maryak and Chin, 2008).

It has been shown that differentiability is not required: continuity and convexity are sufficient for convergence.[1]

Extension to second-order (Newton) methods

[edit]

It is known that a stochastic version of the standard (deterministic) Newton-Raphson algorithm (a “second-order” method) provides an asymptotically optimal or near-optimal form of stochastic approximation. SPSA can also be used to efficiently estimate theHessian matrix of the loss function based on either noisy loss measurements or noisy gradient measurements (stochastic gradients). As with the basic SPSA method, only a small fixed number of loss measurements or gradient measurements are needed at each iteration, regardless of the problem dimensionp. See the brief discussion inStochastic gradient descent.

References

[edit]
  • Bhatnagar, S., Prasad, H. L., and Prashanth, L. A. (2013),Stochastic Recursive Algorithms for Optimization: Simultaneous Perturbation Methods, Springer[1].
  • Hirokami, T., Maeda, Y., Tsukada, H. (2006) "Parameter estimation using simultaneous perturbation stochastic approximation", Electrical Engineering in Japan, 154 (2), 30–3[2]
  • Maryak, J.L., and Chin, D.C. (2008), "Global Random Optimization by Simultaneous Perturbation Stochastic Approximation,"IEEE Transactions on Automatic Control, vol. 53, pp. 780-783.
  • Spall, J. C. (1987), “A Stochastic Approximation Technique for Generating Maximum Likelihood Parameter Estimates,”Proceedings of the American Control Conference, Minneapolis, MN, June 1987, pp. 1161–1167.
  • Spall, J. C. (1992), “Multivariate Stochastic Approximation Using a Simultaneous Perturbation Gradient Approximation,”IEEE Transactions on Automatic Control, vol. 37(3), pp. 332–341.
  • Spall, J.C. (1998). "Overview of the Simultaneous Perturbation Method for Efficient Optimization"2.Johns Hopkins APL Technical Digest, 19(4), 482–492.
  • Spall, J.C. (2003)Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control, Wiley.ISBN 0-471-33052-3 (Chapter 7)
  1. ^He, Ying; Fu, Michael C.; Steven I., Marcus (August 2003). "Convergence of simultaneous perturbation stochastic approximation for nondifferentiable optimization".IEEE Transactions on Automatic Control.48 (8):1459–1463.Bibcode:2003ITAC...48.1459H.doi:10.1109/TAC.2003.815008.
Retrieved from "https://en.wikipedia.org/w/index.php?title=Simultaneous_perturbation_stochastic_approximation&oldid=1317306566"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2025 Movatter.jp