Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Stochastic approximation

From Wikipedia, the free encyclopedia
Family of iterative methods

Stochastic approximation methods are a family ofiterative methods typically used forroot-finding problems or foroptimization problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving linear systems when the collected data is corrupted by noise, or for approximatingextreme values of functions which cannot be computed directly, but only estimated via noisy observations.

In a nutshell, stochastic approximation algorithms deal with a function of the formf(θ)=Eξ[F(θ,ξ)]{\textstyle f(\theta )=\operatorname {E} _{\xi }[F(\theta ,\xi )]}which is theexpected value of a function depending on arandom variableξ{\textstyle \xi }. The goal is to recover properties of such a functionf{\textstyle f} without evaluating it directly. Instead, stochastic approximation algorithms use random samples ofF(θ,ξ){\textstyle F(\theta ,\xi )} to efficiently approximate properties off{\textstyle f} such as zeros or extrema.

Recently, stochastic approximations have found extensive applications in the fields of statistics and machine learning, especially in settings withbig data. These applications range fromstochastic optimization methods and algorithms, to online forms of the EM algorithm, reinforcement learning viatemporal differences, anddeep learning, and others.[1]Stochastic approximation algorithms have also been used in the social sciences to describe collective dynamics: fictitious play in learning theory and consensus algorithms can be studied using their theory.[2]

The earliest, and prototypical, algorithms of this kind are theRobbins–Monro andKiefer–Wolfowitz algorithms introduced respectively in 1951 and 1952.

Robbins–Monro algorithm

[edit]

The Robbins–Monro algorithm, introduced in 1951 byHerbert Robbins andSutton Monro,[3] presented a methodology for solving a root finding problem, where the function is represented as an expected value. Assume that we have a functionM(θ){\textstyle M(\theta )}, and a constantα{\textstyle \alpha }, such that the equationM(θ)=α{\textstyle M(\theta )=\alpha } has a unique root atθ.{\textstyle \theta ^{*}.} It is assumed that while we cannot directly observe the functionM(θ),{\textstyle M(\theta ),} we can instead obtain measurements of the random variableN(θ){\textstyle N(\theta )} whereE[N(θ)]=M(θ){\textstyle \operatorname {E} [N(\theta )]=M(\theta )}. The structure of the algorithm is to then generate iterates of the form:

θn+1=θnan(N(θn)α){\displaystyle \theta _{n+1}=\theta _{n}-a_{n}(N(\theta _{n})-\alpha )}

Here,a1,a2,{\displaystyle a_{1},a_{2},\dots } is a sequence of positive step sizes.Robbins and Monro proved[3], Theorem 2 thatθn{\displaystyle \theta _{n}}converges inL2{\displaystyle L^{2}} (and hence also in probability) toθ{\displaystyle \theta ^{*}}, and Blum[4] later proved the convergence is actually with probability one, provided that:

n=0an= and n=0an2<{\displaystyle \qquad \sum _{n=0}^{\infty }a_{n}=\infty \quad {\mbox{ and }}\quad \sum _{n=0}^{\infty }a_{n}^{2}<\infty \quad }A particular sequence of steps which satisfy these conditions, and was suggested by Robbins–Monro, have the form:an=a/n{\textstyle a_{n}=a/n}, fora>0{\textstyle a>0}. Other series, such asan=1nlnn,1nlnnlnlnn,{\displaystyle a_{n}={\frac {1}{n\ln n}},{\frac {1}{n\ln n\ln \ln n}},\dots } are possible but in order to average out the noise inN(θ){\textstyle N(\theta )}, the above condition must be met.

Example

[edit]

Consider the problem of estimating the meanθ{\displaystyle \theta ^{*}} of a probability distribution from a stream of independent samplesX1,X2,{\displaystyle X_{1},X_{2},\dots }.

LetN(θ):=θX{\displaystyle N(\theta ):=\theta -X}, then the unique solution toE[N(θ)]=0{\textstyle \operatorname {E} [N(\theta )]=0} is the desired meanθ{\displaystyle \theta ^{*}}. The RM algorithm gives usθn+1=θnan(θnXn){\displaystyle \theta _{n+1}=\theta _{n}-a_{n}(\theta _{n}-X_{n})}This is equivalent tostochastic gradient descent with loss functionL(θ)=12Xθ2{\displaystyle L(\theta )={\frac {1}{2}}\|X-\theta \|^{2}}. It is also equivalent to a weighted average:θn+1=(1an)θn+anXn{\displaystyle \theta _{n+1}=(1-a_{n})\theta _{n}+a_{n}X_{n}}In general, if there exists some functionL{\displaystyle L} such thatL(θ)=N(θ)α{\displaystyle \nabla L(\theta )=N(\theta )-\alpha }, then the Robbins–Monro algorithm is equivalent to stochastic gradient descent with loss functionL(θ){\displaystyle L(\theta )}. However, the RM algorithm does not requireL{\displaystyle L} to exist in order to converge.

Complexity results

[edit]
  1. Iff(θ){\textstyle f(\theta )} is twice continuously differentiable, and strongly convex, and the minimizer off(θ){\textstyle f(\theta )} belongs to the interior ofΘ{\textstyle \Theta }, then the Robbins–Monro algorithm will achieve the asymptotically optimal convergence rate, with respect to the objective function, beingE[f(θn)f]=O(1/n){\textstyle \operatorname {E} [f(\theta _{n})-f^{*}]=O(1/n)}, wheref{\textstyle f^{*}} is the minimal value off(θ){\textstyle f(\theta )} overθΘ{\textstyle \theta \in \Theta }.[5][6]
  2. Conversely, in the general convex case, where we lack both the assumption of smoothness and strong convexity, Nemirovski and Yudin[7] have shown that the asymptotically optimal convergence rate, with respect to the objective function values, isO(1/n){\textstyle O(1/{\sqrt {n}})}. They have also proven that this rate cannot be improved.

Subsequent developments and Polyak–Ruppert averaging

[edit]

While the Robbins–Monro algorithm is theoretically able to achieveO(1/n){\textstyle O(1/n)} under the assumption of twice continuous differentiability and strong convexity, it can perform quite poorly upon implementation. This is primarily due to the fact that the algorithm is very sensitive to the choice of the step size sequence, and the supposed asymptotically optimal step size policy can be quite harmful in the beginning.[6][8]

Chung (1954)[9] and Fabian (1968)[10] showed that we would achieve optimal convergence rateO(1/n){\textstyle O(1/{\sqrt {n}})} withan=2f(θ)1/n{\textstyle a_{n}=\bigtriangledown ^{2}f(\theta ^{*})^{-1}/n} (oran=1(nM(θ)){\textstyle a_{n}={\frac {1}{(nM'(\theta ^{*}))}}}). Lai and Robbins[11][12] designed adaptive procedures to estimateM(θ){\textstyle M'(\theta ^{*})} such thatθn{\textstyle \theta _{n}} has minimal asymptotic variance. However the application of such optimal methods requires much a priori information which is hard to obtain in most situations. To overcome this shortfall, Polyak (1991)[13] and Ruppert (1988)[14] independently developed a new optimal algorithm based on the idea of averaging the trajectories. Polyak and Juditsky[15] also presented a method of accelerating Robbins–Monro for linear and non-linear root-searching problems through the use of longer steps, and averaging of the iterates. The algorithm would have the following structure:θn+1θn=an(αN(θn)),θ¯n=1ni=0n1θi{\displaystyle \theta _{n+1}-\theta _{n}=a_{n}(\alpha -N(\theta _{n})),\qquad {\bar {\theta }}_{n}={\frac {1}{n}}\sum _{i=0}^{n-1}\theta _{i}}The convergence ofθ¯n{\displaystyle {\bar {\theta }}_{n}} to the unique rootθ{\displaystyle \theta ^{*}} relies on the condition that the step sequence{an}{\displaystyle \{a_{n}\}} decreases sufficiently slowly. That is

A1)an0,anan+1an=o(an){\displaystyle a_{n}\rightarrow 0,\qquad {\frac {a_{n}-a_{n+1}}{a_{n}}}=o(a_{n})}

Therefore, the sequencean=nα{\textstyle a_{n}=n^{-\alpha }} with0<α<1{\textstyle 0<\alpha <1} satisfies this restriction, butα=1{\textstyle \alpha =1} does not, hence the longer steps. Under the assumptions outlined in the Robbins–Monro algorithm, the resulting modification will result in the same asymptotically optimal convergence rateO(1/n){\textstyle O(1/{\sqrt {n}})} yet with a more robust step size policy.[15] Prior to this, the idea of using longer steps and averaging the iterates had already been proposed by Nemirovski and Yudin[16] for the cases of solving the stochastic optimization problem with continuous convex objectives and for convex-concave saddle point problems. These algorithms were observed to attain the nonasymptotic rateO(1/n){\textstyle O(1/{\sqrt {n}})}.

A more general result is given in Chapter 11 of Kushner and Yin[17] by defining interpolated timetn=i=0n1ai{\textstyle t_{n}=\sum _{i=0}^{n-1}a_{i}}, interpolated processθn(){\textstyle \theta ^{n}(\cdot )} and interpolated normalized processUn(){\textstyle U^{n}(\cdot )} as

θn(t)=θn+i,Un(t)=(θn+iθ)/an+ifort[tn+itn,tn+i+1tn),i0{\displaystyle \theta ^{n}(t)=\theta _{n+i},\quad U^{n}(t)=(\theta _{n+i}-\theta ^{*})/{\sqrt {a_{n+i}}}\quad {\mbox{for}}\quad t\in [t_{n+i}-t_{n},t_{n+i+1}-t_{n}),i\geq 0}Let the iterate average beΘn=anti=nn+t/an1θi{\displaystyle \Theta _{n}={\frac {a_{n}}{t}}\sum _{i=n}^{n+t/a_{n}-1}\theta _{i}} and the associate normalized error to beU^n(t)=anti=nn+t/an1(θiθ){\displaystyle {\hat {U}}^{n}(t)={\frac {\sqrt {a_{n}}}{t}}\sum _{i=n}^{n+t/a_{n}-1}(\theta _{i}-\theta ^{*})}.

With assumptionA1) and the followingA2)

A2)There is a Hurwitz matrixA{\textstyle A} and a symmetric and positive-definite matrixΣ{\textstyle \Sigma } such that{Un()}{\textstyle \{U^{n}(\cdot )\}} converges weakly toU(){\textstyle U(\cdot )}, whereU(){\textstyle U(\cdot )} is the statisolution todU=AUdt+Σ1/2dw{\displaystyle dU=AU\,dt+\Sigma ^{1/2}\,dw}wherew(){\textstyle w(\cdot )} is a standard Wiener process.

satisfied, and defineV¯=(A1)Σ(A)1{\textstyle {\bar {V}}=(A^{-1})'\Sigma (A')^{-1}}. Then for eacht{\textstyle t},

U^n(t)DN(0,Vt),whereVt=V¯/t+O(1/t2).{\displaystyle {\hat {U}}^{n}(t){\stackrel {\mathcal {D}}{\longrightarrow }}{\mathcal {N}}(0,V_{t}),\quad {\text{where}}\quad V_{t}={\bar {V}}/t+O(1/t^{2}).}

The success of the averaging idea is because of the time scale separation of the original sequence{θn}{\textstyle \{\theta _{n}\}} and the averaged sequence{Θn}{\textstyle \{\Theta _{n}\}}, with the time scale of the former one being faster.

Application in stochastic optimization

[edit]

Suppose we want to solve the following stochastic optimization problemg(θ)=minθΘE[Q(θ,X)],{\displaystyle g(\theta ^{*})=\min _{\theta \in \Theta }\operatorname {E} [Q(\theta ,X)],}whereg(θ)=E[Q(θ,X)]{\textstyle g(\theta )=\operatorname {E} [Q(\theta ,X)]} is differentiable and convex, then this problem is equivalent to find the rootθ{\displaystyle \theta ^{*}} ofg(θ)=0{\displaystyle \nabla g(\theta )=0}. HereQ(θ,X){\displaystyle Q(\theta ,X)} can be interpreted as some "observed" cost as a function of the chosenθ{\displaystyle \theta } and random effectsX{\displaystyle X}. In practice, it might be hard to get an analytical form ofg(θ){\displaystyle \nabla g(\theta )}, Robbins–Monro method manages to generate a sequence(θn)n0{\displaystyle (\theta _{n})_{n\geq 0}} to approximateθ{\displaystyle \theta ^{*}} if one can generate(Xn)n0{\displaystyle (X_{n})_{n\geq 0}} , in which the conditional expectation ofXn{\displaystyle X_{n}} givenθn{\displaystyle \theta _{n}} is exactlyg(θn){\displaystyle \nabla g(\theta _{n})}, i.e.Xn{\displaystyle X_{n}} is simulated from a conditional distribution defined by

E[H(θ,X)|θ=θn]=g(θn).{\displaystyle \operatorname {E} [H(\theta ,X)|\theta =\theta _{n}]=\nabla g(\theta _{n}).}

HereH(θ,X){\displaystyle H(\theta ,X)} is an unbiased estimator ofg(θ){\displaystyle \nabla g(\theta )}. IfX{\displaystyle X} depends onθ{\displaystyle \theta }, there is in general no natural way of generating a random outcomeH(θ,X){\displaystyle H(\theta ,X)} that is an unbiased estimator of the gradient. In some special cases when either IPA or likelihood ratio methods are applicable, then one is able to obtain an unbiased gradient estimatorH(θ,X){\displaystyle H(\theta ,X)}. IfX{\displaystyle X} is viewed as some "fundamental" underlying random process that is generatedindependently ofθ{\displaystyle \theta }, and under some regularization conditions for derivative-integral interchange operations so thatE[θQ(θ,X)]=g(θ){\displaystyle \operatorname {E} {\Big [}{\frac {\partial }{\partial \theta }}Q(\theta ,X){\Big ]}=\nabla g(\theta )}, thenH(θ,X)=θQ(θ,X){\displaystyle H(\theta ,X)={\frac {\partial }{\partial \theta }}Q(\theta ,X)} gives the fundamental gradient unbiased estimate. However, for some applications we have to use finite-difference methods in whichH(θ,X){\displaystyle H(\theta ,X)} has a conditional expectation close tog(θ){\displaystyle \nabla g(\theta )} but not exactly equal to it.

We then define a recursion analogously toNewton's Method in the deterministic algorithm:

θn+1=θnεnH(θn,Xn+1).{\displaystyle \theta _{n+1}=\theta _{n}-\varepsilon _{n}H(\theta _{n},X_{n+1}).}

Convergence of the algorithm

[edit]

The following result gives sufficient conditions onθn{\displaystyle \theta _{n}} for the algorithm to converge:[18]

C1)εn0,n0.{\displaystyle \varepsilon _{n}\geq 0,\forall \;n\geq 0.}

C2)n=0εn={\displaystyle \sum _{n=0}^{\infty }\varepsilon _{n}=\infty }

C3)n=0εn2<{\displaystyle \sum _{n=0}^{\infty }\varepsilon _{n}^{2}<\infty }

C4)|Xn|B, for a fixed bound B.{\displaystyle |X_{n}|\leq B,{\text{ for a fixed bound }}B.}

C5)g(θ) is strictly convex, i.e.{\displaystyle g(\theta ){\text{ is strictly convex, i.e.}}}

infδ|θθ|1/δθθ,g(θ)>0, for every 0<δ<1.{\displaystyle \inf _{\delta \leq |\theta -\theta ^{*}|\leq 1/\delta }\langle \theta -\theta ^{*},\nabla g(\theta )\rangle >0,{\text{ for every }}0<\delta <1.}

Thenθn{\displaystyle \theta _{n}} converges toθ{\displaystyle \theta ^{*}} almost surely.

Here are some intuitive explanations about these conditions. SupposeH(θn,Xn+1){\displaystyle H(\theta _{n},X_{n+1})} is a uniformly bounded random variables. If C2) is not satisfied, i.e.n=0εn<{\displaystyle \sum _{n=0}^{\infty }\varepsilon _{n}<\infty } , thenθnθ0=i=0n1εiH(θi,Xi+1){\displaystyle \theta _{n}-\theta _{0}=-\sum _{i=0}^{n-1}\varepsilon _{i}H(\theta _{i},X_{i+1})}is a bounded sequence, so the iteration cannot converge toθ{\displaystyle \theta ^{*}} if the initial guessθ0{\displaystyle \theta _{0}} is too far away fromθ{\displaystyle \theta ^{*}}. As for C3) note that ifθn{\displaystyle \theta _{n}} converges toθ{\displaystyle \theta ^{*}} then

θn+1θn=εnH(θn,Xn+1)0, as n.{\displaystyle \theta _{n+1}-\theta _{n}=-\varepsilon _{n}H(\theta _{n},X_{n+1})\rightarrow 0,{\text{ as }}n\rightarrow \infty .} so we must haveεn0{\displaystyle \varepsilon _{n}\downarrow 0} ,and the condition C3) ensures it. A natural choice would beεn=1/n{\displaystyle \varepsilon _{n}=1/n}. Condition C5) is a fairly stringent condition on the shape ofg(θ){\displaystyle g(\theta )}; it gives the search direction of the algorithm.

Example (where the stochastic gradient method is appropriate)[8]

[edit]

SupposeQ(θ,X)=f(θ)+θTX{\displaystyle Q(\theta ,X)=f(\theta )+\theta ^{T}X}, wheref{\displaystyle f} is differentiable andXRp{\displaystyle X\in \mathbb {R} ^{p}} is a random variable independent ofθ{\displaystyle \theta }. Theng(θ)=E[Q(θ,X)]=f(θ)+θTEX{\displaystyle g(\theta )=\operatorname {E} [Q(\theta ,X)]=f(\theta )+\theta ^{T}\operatorname {E} X} depends on the mean ofX{\displaystyle X}, and the stochastic gradient method would be appropriate in this problem. We can chooseH(θ,X)=θQ(θ,X)=θf(θ)+X.{\displaystyle H(\theta ,X)={\frac {\partial }{\partial \theta }}Q(\theta ,X)={\frac {\partial }{\partial \theta }}f(\theta )+X.}

Kiefer–Wolfowitz algorithm

[edit]

The Kiefer–Wolfowitz algorithm was introduced in 1952 byJacob Wolfowitz andJack Kiefer,[19] and was motivated by the publication of the Robbins–Monro algorithm. However, the algorithm was presented as a method which would stochastically estimate the maximum of a function.

LetM(x){\displaystyle M(x)} be a function which has a maximum at the pointθ{\displaystyle \theta }. It is assumed thatM(x){\displaystyle M(x)} is unknown; however, certain observationsN(x){\displaystyle N(x)}, whereE[N(x)]=M(x){\displaystyle \operatorname {E} [N(x)]=M(x)}, can be made at any pointx{\displaystyle x}. The structure of the algorithm follows a gradient-like method, with the iterates being generated as

xn+1=xn+an(N(xn+cn)N(xncn)2cn){\displaystyle x_{n+1}=x_{n}+a_{n}\cdot \left({\frac {N(x_{n}+c_{n})-N(x_{n}-c_{n})}{2c_{n}}}\right)}

whereN(xn+cn){\displaystyle N(x_{n}+c_{n})} andN(xncn){\displaystyle N(x_{n}-c_{n})} are independent. At every step, the gradient ofM(x){\displaystyle M(x)} is approximated akin to acentral difference method withh=2cn{\displaystyle h=2c_{n}}. So the sequence{cn}{\displaystyle \{c_{n}\}} specifies the sequence of finite difference widths used for the gradient approximation, while the sequence{an}{\displaystyle \{a_{n}\}} specifies a sequence of positive step sizes taken along that direction.

Kiefer and Wolfowitz proved that, ifM(x){\displaystyle M(x)} satisfied certain regularity conditions, thenxn{\displaystyle x_{n}} will converge toθ{\displaystyle \theta } in probability asn{\displaystyle n\to \infty }, and later Blum[4] in 1954 showedxn{\displaystyle x_{n}} converges toθ{\displaystyle \theta } almost surely, provided that:

A suitable choice of sequences, as recommended by Kiefer and Wolfowitz, would bean=1/n{\displaystyle a_{n}=1/n} andcn=n1/3{\displaystyle c_{n}=n^{-1/3}}.

Subsequent developments and important issues

[edit]
  1. The Kiefer Wolfowitz algorithm requires that for each gradient computation, at leastd+1{\displaystyle d+1} different parameter values must be simulated for every iteration of the algorithm, whered{\displaystyle d} is the dimension of the search space. This means that whend{\displaystyle d} is large, the Kiefer–Wolfowitz algorithm will require substantial computational effort per iteration, leading to slow convergence.
    1. To address this problem, Spall proposed the use ofsimultaneous perturbations to estimate the gradient. This method would require only two simulations per iteration, regardless of the dimensiond{\displaystyle d}.[20]
  2. In the conditions required for convergence, the ability to specify a predetermined compact set that fulfills strong convexity (or concavity) and contains the unique solution can be difficult to find. With respect to real world applications, if the domain is quite large, these assumptions can be fairly restrictive and highly unrealistic.

Further developments

[edit]

An extensive theoretical literature has grown up around these algorithms, concerning conditions for convergence, rates of convergence, multivariate and other generalizations, proper choice of step size, possible noise models, and so on.[21][22] These methods are also applied incontrol theory, in which case the unknown function which we wish to optimize or find the zero of may vary in time. In this case, the step sizean{\displaystyle a_{n}} should not converge to zero but should be chosen so as to track the function.[21], 2nd ed., chapter 3

C. Johan Masreliez andR. Douglas Martin were the first to apply stochastic approximation torobustestimation.[23]

The main tool for analyzing stochastic approximations algorithms (including the Robbins–Monro and the Kiefer–Wolfowitz algorithms) is a theorem byAryeh Dvoretzky published in 1956.[24]

See also

[edit]

References

[edit]
  1. ^Toulis, Panos; Airoldi, Edoardo (2015)."Scalable estimation strategies based on stochastic approximations: classical results and new insights".Statistics and Computing.25 (4):781–795.doi:10.1007/s11222-015-9560-y.PMC 4484776.PMID 26139959.
  2. ^Le Ny, Jerome."Introduction to Stochastic Approximation Algorithms"(PDF).Polytechnique Montreal. Teaching Notes. Retrieved16 November 2016.
  3. ^abRobbins, H.; Monro, S. (1951)."A Stochastic Approximation Method".The Annals of Mathematical Statistics.22 (3): 400.doi:10.1214/aoms/1177729586.
  4. ^abBlum, Julius R. (1954-06-01)."Approximation Methods which Converge with Probability one".The Annals of Mathematical Statistics.25 (2):382–386.doi:10.1214/aoms/1177728794.ISSN 0003-4851.
  5. ^Sacks, J. (1958)."Asymptotic Distribution of Stochastic Approximation Procedures".The Annals of Mathematical Statistics.29 (2):373–405.doi:10.1214/aoms/1177706619.JSTOR 2237335.
  6. ^abNemirovski, A.; Juditsky, A.; Lan, G.; Shapiro, A. (2009). "Robust Stochastic Approximation Approach to Stochastic Programming".SIAM Journal on Optimization.19 (4): 1574.doi:10.1137/070704277.
  7. ^Problem Complexity and Method Efficiency in Optimization, A. Nemirovski and D. Yudin,Wiley -Intersci. Ser. Discrete Math15John WileyNew York (1983) .
  8. ^abIntroduction to Stochastic Search and Optimization: Estimation, Simulation and Control, J.C. Spall,John WileyHoboken, NJ, (2003).
  9. ^Chung, K. L. (1954-09-01)."On a Stochastic Approximation Method".The Annals of Mathematical Statistics.25 (3):463–483.doi:10.1214/aoms/1177728716.ISSN 0003-4851.
  10. ^Fabian, Vaclav (1968-08-01)."On Asymptotic Normality in Stochastic Approximation".The Annals of Mathematical Statistics.39 (4):1327–1332.doi:10.1214/aoms/1177698258.ISSN 0003-4851.
  11. ^Lai, T. L.; Robbins, Herbert (1979-11-01)."Adaptive Design and Stochastic Approximation".The Annals of Statistics.7 (6):1196–1221.doi:10.1214/aos/1176344840.ISSN 0090-5364.
  12. ^Lai, Tze Leung; Robbins, Herbert (1981-09-01)."Consistency and asymptotic efficiency of slope estimates in stochastic approximation schemes".Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete.56 (3):329–360.doi:10.1007/BF00536178.ISSN 0044-3719.S2CID 122109044.
  13. ^Polyak, B T (1991)."New stochastic approximation type procedures. (In Russian.)".Automation and Remote Control.7 (7).
  14. ^Ruppert, David (1988).Efficient estimators from a slowly converging robbins-monro process (Technical Report 781). Cornell University School of Operations Research and Industrial Engineering.
  15. ^abPolyak, B. T.; Juditsky, A. B. (1992). "Acceleration of Stochastic Approximation by Averaging".SIAM Journal on Control and Optimization.30 (4): 838.doi:10.1137/0330046.
  16. ^On Cezari's convergence of the steepest descent method for approximating saddle points of convex-concave functions, A. Nemirovski and D. Yudin,Dokl. Akad. Nauk SSR2939, (1978 (Russian)), Soviet Math. Dokl.19 (1978 (English)).
  17. ^Kushner, Harold; George Yin, G. (2003-07-17).Stochastic Approximation and Recursive Algorithms and | Harold Kushner | Springer. www.springer.com.ISBN 9780387008943. Retrieved2016-05-16.
  18. ^Bouleau, N.; Lepingle, D. (1994).Numerical Methods for stochastic Processes. New York: John Wiley.ISBN 9780471546412.
  19. ^Kiefer, J.; Wolfowitz, J. (1952)."Stochastic Estimation of the Maximum of a Regression Function".The Annals of Mathematical Statistics.23 (3): 462.doi:10.1214/aoms/1177729392.
  20. ^Spall, J. C. (2000). "Adaptive stochastic approximation by the simultaneous perturbation method".IEEE Transactions on Automatic Control.45 (10):1839–1853.doi:10.1109/TAC.2000.880982.
  21. ^abKushner, H. J.; Yin, G. G. (1997).Stochastic Approximation Algorithms and Applications.doi:10.1007/978-1-4899-2696-8.ISBN 978-1-4899-2698-2.
  22. ^Stochastic Approximation and Recursive Estimation, Mikhail Borisovich Nevel'son and Rafail Zalmanovich Has'minskiĭ, translated by Israel Program for Scientific Translations and B. Silver, Providence, RI: American Mathematical Society, 1973, 1976.ISBN 0-8218-1597-0.
  23. ^Martin, R.; Masreliez, C. (1975). "Robust estimation via stochastic approximation".IEEE Transactions on Information Theory.21 (3): 263.doi:10.1109/TIT.1975.1055386.
  24. ^Dvoretzky, Aryeh (1956)."On stochastic approximation". InNeyman, Jerzy (ed.).Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, 1954–1955, vol. I. University of California Press. pp. 39–55.MR 0084911.
Continuous data
Center
Dispersion
Shape
Count data
Summary tables
Dependence
Graphics
Study design
Survey methodology
Controlled experiments
Adaptive designs
Observational studies
Statistical theory
Frequentist inference
Point estimation
Interval estimation
Testing hypotheses
Parametric tests
Specific tests
Goodness of fit
Rank statistics
Bayesian inference
Correlation
Regression analysis
Linear regression
Non-standard predictors
Generalized linear model
Partition of variance
Categorical
Multivariate
Time-series
General
Specific tests
Time domain
Frequency domain
Survival
Survival function
Hazard function
Test
Biostatistics
Engineering statistics
Social statistics
Spatial statistics
Retrieved from "https://en.wikipedia.org/w/index.php?title=Stochastic_approximation&oldid=1272133306"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2025 Movatter.jp