Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Policy gradient method

From Wikipedia, the free encyclopedia
Class of reinforcement learning algorithms

Policy gradient methods are a class ofreinforcement learning algorithms.

Policy gradient methods are a sub-class of policy optimization methods. Unlike value-based methods which learn a value function to derive a policy, policy optimization methods directly learn apolicy functionπ{\displaystyle \pi } that selects actions without consulting a value function. For policy gradient to apply, the policy functionπθ{\displaystyle \pi _{\theta }} is parameterized by a differentiable parameterθ{\displaystyle \theta }.[1]

Overview

[edit]

In policy-based RL, the actor is a parameterized policy functionπθ{\displaystyle \pi _{\theta }}, whereθ{\displaystyle \theta } are the parameters of the actor. The actor takes as argument the state of the environments{\displaystyle s} and produces aprobability distributionπθ(s){\displaystyle \pi _{\theta }(\cdot \mid s)}.

If the action space is discrete, thenaπθ(as)=1{\displaystyle \sum _{a}\pi _{\theta }(a\mid s)=1}. If the action space is continuous, thenaπθ(as)da=1{\displaystyle \int _{a}\pi _{\theta }(a\mid s)\mathrm {d} a=1}.

The goal of policy optimization is to find someθ{\displaystyle \theta } that maximizes the expected episodic rewardJ(θ){\displaystyle J(\theta )}:J(θ)=Eπθ[t0:TγtRt|S0=s0]{\displaystyle J(\theta )=\mathbb {E} _{\pi _{\theta }}\left[\sum _{t\in 0:T}\gamma ^{t}R_{t}{\Big |}S_{0}=s_{0}\right]}whereγ{\displaystyle \gamma } is thediscount factor,Rt{\displaystyle R_{t}} is the reward at stept{\displaystyle t},s0{\displaystyle s_{0}} is the starting state, andT{\displaystyle T} is the time-horizon (which can be infinite).

Thepolicy gradient is defined asθJ(θ){\displaystyle \nabla _{\theta }J(\theta )}. Different policy gradient methods stochastically estimate the policy gradient in different ways. The goal of any policy gradient method is to iteratively maximizeJ(θ){\displaystyle J(\theta )} bygradient ascent. Since the key part of any policy gradient method is the stochastic estimation of the policy gradient, they are also studied under the title of "Monte Carlo gradient estimation".[2]

REINFORCE

[edit]

Policy gradient

[edit]

TheREINFORCE algorithm, introduced byRonald J. Williams in 1992, was the first policy gradient method.[3] It is based on the identity for the policy gradientθJ(θ)=Eπθ[t0:Tθlnπθ(AtSt)t0:T(γtRt)|S0=s0]{\displaystyle \nabla _{\theta }J(\theta )=\mathbb {E} _{\pi _{\theta }}\left[\sum _{t\in 0:T}\nabla _{\theta }\ln \pi _{\theta }(A_{t}\mid S_{t})\;\sum _{t\in 0:T}(\gamma ^{t}R_{t}){\Big |}S_{0}=s_{0}\right]} which can be improved via the "causality trick"[1]θJ(θ)=Eπθ[t0:Tθlnπθ(AtSt)τt:T(γτRτ)|S0=s0]{\displaystyle \nabla _{\theta }J(\theta )=\mathbb {E} _{\pi _{\theta }}\left[\sum _{t\in 0:T}\nabla _{\theta }\ln \pi _{\theta }(A_{t}\mid S_{t})\sum _{\tau \in t:T}(\gamma ^{\tau }R_{\tau }){\Big |}S_{0}=s_{0}\right]}

LemmaThe expectation of the score function is zero, conditional on any present or past state. That is, for any0ijT{\displaystyle 0\leq i\leq j\leq T} and any statesi{\displaystyle s_{i}}, we haveEπθ[θlnπθ(Aj|Sj)|Si=si]=0.{\displaystyle \mathbb {E} _{\pi _{\theta }}[\nabla _{\theta }\ln \pi _{\theta }(A_{j}|S_{j})|S_{i}=s_{i}]=0.}

Further, ifΨi{\textstyle \Psi _{i}} is a random variable that is independent ofAi,Si+1,Ai+1,{\textstyle A_{i},S_{i+1},A_{i+1},\dots }, thenEπθ[θlnπθ(Aj|Sj)Ψi|Si=si]=0.{\displaystyle \mathbb {E} _{\pi _{\theta }}[\nabla _{\theta }\ln \pi _{\theta }(A_{j}|S_{j})\cdot \Psi _{i}|S_{i}=s_{i}]=0.}

Proofs
Proof of the lemma

Use thereparameterization trick.

Eπθ[θlnπθ(Aj|Sj)|Si=si]=sPr(Sj=s|Si=si)aπθ(a|s)θlnπθ(a|s)=sPr(Sj=s|Si=si)aπθ(a|s)θπθ(a|s)πθ(a|s)=sPr(Sj=s|Si=si)aθπθ(a|s)=sPr(Sj=s|Si=si)θaπθ(a|s){\displaystyle {\begin{aligned}\mathbb {E} _{\pi _{\theta }}[\nabla _{\theta }\ln \pi _{\theta }(A_{j}|S_{j})|S_{i}=s_{i}]&=\sum _{s}Pr(S_{j}=s|S_{i}=s_{i})\sum _{a}\pi _{\theta }(a|s)\nabla _{\theta }\ln \pi _{\theta }(a|s)\\&=\sum _{s}Pr(S_{j}=s|S_{i}=s_{i})\sum _{a}\pi _{\theta }(a|s){\frac {\nabla _{\theta }\pi _{\theta }(a|s)}{\pi _{\theta }(a|s)}}\\&=\sum _{s}Pr(S_{j}=s|S_{i}=s_{i})\sum _{a}\nabla _{\theta }\pi _{\theta }(a|s)\\&=\sum _{s}Pr(S_{j}=s|S_{i}=s_{i})\nabla _{\theta }\sum _{a}\pi _{\theta }(a|s)\end{aligned}}}Since the policyπθ(a|s){\displaystyle \pi _{\theta }(a|s)} is a probability distribution over actions for a given state,aπθ(a|s)=1{\textstyle \sum _{a}\pi _{\theta }(a|s)=1}.Eπθ[θlnπθ(A|S)]=sPr(Sj=s|Si=si)θ(1)=sPr(Sj=s|Si=si)0=0{\displaystyle {\begin{aligned}\mathbb {E} _{\pi _{\theta }}[\nabla _{\theta }\ln \pi _{\theta }(A|S)]&=\sum _{s}Pr(S_{j}=s|S_{i}=s_{i})\nabla _{\theta }(1)\\&=\sum _{s}Pr(S_{j}=s|S_{i}=s_{i})0\\&=0\end{aligned}}}

By thetower law and the previous lemma.

Eπθ[Ψiθlnπθ(Aj|Sj)|Si=si]=Eπθ[Eπθ[Ψiθlnπθ(Aj|Sj)|Sj]|Si=si]=Eπθ[ΨiEπθ[θlnπθ(Aj|Sj)|Sj]|Si=si]=Eπθ[Ψi0|Si=si]=0{\displaystyle {\begin{aligned}\mathbb {E} _{\pi _{\theta }}\left[\Psi _{i}\nabla _{\theta }\ln \pi _{\theta }(A_{j}|S_{j}){\Big |}S_{i}=s_{i}\right]&=\mathbb {E} _{\pi _{\theta }}\left[\mathbb {E} _{\pi _{\theta }}[\Psi _{i}\nabla _{\theta }\ln \pi _{\theta }(A_{j}|S_{j})|S_{j}]{\Big |}S_{i}=s_{i}\right]\\&=\mathbb {E} _{\pi _{\theta }}\left[\Psi _{i}\mathbb {E} _{\pi _{\theta }}[\nabla _{\theta }\ln \pi _{\theta }(A_{j}|S_{j})|S_{j}]{\Big |}S_{i}=s_{i}\right]\\&=\mathbb {E} _{\pi _{\theta }}\left[\Psi _{i}0{\Big |}S_{i}=s_{i}\right]\\&=0\end{aligned}}}

Proof of the two identities

Applying thereparameterization trick,

θJ(θ)=θEπθ[i0:TγiRi|S0=s0]=Eπθ[(i0:TγiRi)θln(πθ(A0,A1,,AT|S0,S1,,ST))|S0=s0]=Eπθ[(i0:TγiRi)j0:Tθln(πθ(Aj|Sj))|S0=s0]=Eπθ[i,j0:T(γiRi)θlnπθ(Aj|Sj)|S0=s0]{\displaystyle {\begin{aligned}\nabla _{\theta }J(\theta )&=\nabla _{\theta }\mathbb {E} _{\pi _{\theta }}\left[\sum _{i\in 0:T}\gamma ^{i}R_{i}{\Big |}S_{0}=s_{0}\right]\\&=\mathbb {E} _{\pi _{\theta }}\left[\left(\sum _{i\in 0:T}\gamma ^{i}R_{i}\right)\nabla _{\theta }\ln(\pi _{\theta }(A_{0},A_{1},\dots ,A_{T}|S_{0},S_{1},\dots ,S_{T})){\Big |}S_{0}=s_{0}\right]\\&=\mathbb {E} _{\pi _{\theta }}\left[\left(\sum _{i\in 0:T}\gamma ^{i}R_{i}\right)\sum _{j\in 0:T}\nabla _{\theta }\ln(\pi _{\theta }(A_{j}|S_{j})){\Big |}S_{0}=s_{0}\right]\\&=\mathbb {E} _{\pi _{\theta }}\left[\sum _{i,j\in 0:T}(\gamma ^{i}R_{i})\nabla _{\theta }\ln \pi _{\theta }(A_{j}|S_{j}){\Big |}S_{0}=s_{0}\right]\end{aligned}}}which is the first equation.

By the lemma,Eπθ[(γiRi)θlnπθ(Aj|Sj)|S0=s0]=0{\displaystyle \mathbb {E} _{\pi _{\theta }}\left[(\gamma ^{i}R_{i})\nabla _{\theta }\ln \pi _{\theta }(A_{j}|S_{j}){\Big |}S_{0}=s_{0}\right]=0} for any0i<jT{\textstyle 0\leq i<j\leq T}. Plugging this into the previous formula, we zero out a whole triangle of terms, to getθJ(θ)=Eπθ[0jiT(γiRi)θlnπθ(Aj|Sj)|S0=s0]=Eπθ[j0:Tθlnπθ(Aj|Sj)ij:T(γiRi)|S0=s0]{\displaystyle {\begin{aligned}\nabla _{\theta }J(\theta )&=\mathbb {E} _{\pi _{\theta }}\left[\sum _{0\leq j\leq i\leq T}(\gamma ^{i}R_{i})\nabla _{\theta }\ln \pi _{\theta }(A_{j}|S_{j}){\Big |}S_{0}=s_{0}\right]\\&=\mathbb {E} _{\pi _{\theta }}\left[\sum _{j\in 0:T}\nabla _{\theta }\ln \pi _{\theta }(A_{j}|S_{j})\sum _{i\in j:T}(\gamma ^{i}R_{i}){\Big |}S_{0}=s_{0}\right]\end{aligned}}}which is the second equation.

Thus, we have anunbiased estimator of the policy gradient:θJ(θ)1Nn=1N[t0:Tθlnπθ(At,nSt,n)τt:T(γτtRτ,n)]{\displaystyle \nabla _{\theta }J(\theta )\approx {\frac {1}{N}}\sum _{n=1}^{N}\left[\sum _{t\in 0:T}\nabla _{\theta }\ln \pi _{\theta }(A_{t,n}\mid S_{t,n})\sum _{\tau \in t:T}(\gamma ^{\tau -t}R_{\tau ,n})\right]}where the indexn{\displaystyle n} ranges overN{\displaystyle N} rollout trajectories using the policyπθ{\displaystyle \pi _{\theta }}.

Thescore functionθlnπθ(AtSt){\displaystyle \nabla _{\theta }\ln \pi _{\theta }(A_{t}\mid S_{t})} can be interpreted as the direction in the parameter space that increases the probability of taking actionAt{\displaystyle A_{t}} in stateSt{\displaystyle S_{t}}. The policy gradient, then, is aweighted average of all possible directions to increase the probability of taking any action in any state, but weighted by reward signals, so that if taking a certain action in a certain state is associated with high reward, then that direction would be highly reinforced, and vice versa.

Algorithm

[edit]

The REINFORCE algorithm is a loop:

  1. RolloutN{\displaystyle N} trajectories in the environment, usingπθt{\displaystyle \pi _{\theta _{t}}} as the policy function.
  2. Compute the policy gradient estimation:gi1Nn=1N[t0:Tθtlnπθ(At,nSt,n)τt:T(γτRτ,n)]{\displaystyle g_{i}\leftarrow {\frac {1}{N}}\sum _{n=1}^{N}\left[\sum _{t\in 0:T}\nabla _{\theta _{t}}\ln \pi _{\theta }(A_{t,n}\mid S_{t,n})\sum _{\tau \in t:T}(\gamma ^{\tau }R_{\tau ,n})\right]}
  3. Update the policy by gradient ascent:θi+1θi+αigi{\displaystyle \theta _{i+1}\leftarrow \theta _{i}+\alpha _{i}g_{i}}

Here,αi{\displaystyle \alpha _{i}} is the learning rate at update stepi{\displaystyle i}.

Variance reduction

[edit]

REINFORCE is anon-policy algorithm, meaning that the trajectories used for the update must be sampled from the current policyπθ{\displaystyle \pi _{\theta }}. This can lead to high variance in the updates, as the returnsR(τ){\displaystyle R(\tau )} can vary significantly between trajectories. Many variants of REINFORCE have been introduced, under the title ofvariance reduction.

REINFORCE with baseline

[edit]

A common way for reducing variance is theREINFORCE with baseline algorithm, based on the following identity:θJ(θ)=Eπθ[t0:Tθlnπθ(At|St)(τt:T(γτRτ)b(St))|S0=s0]{\displaystyle \nabla _{\theta }J(\theta )=\mathbb {E} _{\pi _{\theta }}\left[\sum _{t\in 0:T}\nabla _{\theta }\ln \pi _{\theta }(A_{t}|S_{t})\left(\sum _{\tau \in t:T}(\gamma ^{\tau }R_{\tau })-b(S_{t})\right){\Big |}S_{0}=s_{0}\right]}for any functionb:StatesR{\displaystyle b:{\text{States}}\to \mathbb {R} }. This can be proven by applying the previous lemma.

The algorithm uses the modified gradient estimatorgi1Nn=1N[t0:Tθtlnπθ(At,n|St,n)(τt:T(γτRτ,n)bi(St,n))]{\displaystyle g_{i}\leftarrow {\frac {1}{N}}\sum _{n=1}^{N}\left[\sum _{t\in 0:T}\nabla _{\theta _{t}}\ln \pi _{\theta }(A_{t,n}|S_{t,n})\left(\sum _{\tau \in t:T}(\gamma ^{\tau }R_{\tau ,n})-b_{i}(S_{t,n})\right)\right]} and the original REINFORCE algorithm is the special case wherebi0{\displaystyle b_{i}\equiv 0}.

Actor-critic methods

[edit]
Main article:Actor-critic algorithm

Ifbi{\textstyle b_{i}} is chosen well, such thatbi(St)τt:T(γτRτ)=γtVπθi(St){\textstyle b_{i}(S_{t})\approx \sum _{\tau \in t:T}(\gamma ^{\tau }R_{\tau })=\gamma ^{t}V^{\pi _{\theta _{i}}}(S_{t})}, this could significantly decrease variance in the gradient estimation. That is, the baseline should be as close to thevalue functionVπθi(St){\displaystyle V^{\pi _{\theta _{i}}}(S_{t})} as possible, approaching the ideal of:θJ(θ)=Eπθ[t0:Tθlnπθ(At|St)(τt:T(γτRτ)γtVπθ(St))|S0=s0]{\displaystyle \nabla _{\theta }J(\theta )=\mathbb {E} _{\pi _{\theta }}\left[\sum _{t\in 0:T}\nabla _{\theta }\ln \pi _{\theta }(A_{t}|S_{t})\left(\sum _{\tau \in t:T}(\gamma ^{\tau }R_{\tau })-\gamma ^{t}V^{\pi _{\theta }}(S_{t})\right){\Big |}S_{0}=s_{0}\right]}Note that, as the policyπθt{\displaystyle \pi _{\theta _{t}}} updates, the value functionVπθi(St){\displaystyle V^{\pi _{\theta _{i}}}(S_{t})} updates as well, so the baseline should also be updated. One common approach is to train a separate function that estimates the value function, and use that as the baseline. This is one of theactor-critic methods, where the policy function is the actor and the value function is the critic.

TheQ-functionQπ{\displaystyle Q^{\pi }} can also be used as the critic, sinceθJ(θ)=Eπθ[0tTγtθlnπθ(At|St)Qπθ(St,At)|S0=s0]{\displaystyle \nabla _{\theta }J(\theta )=E_{\pi _{\theta }}\left[\sum _{0\leq t\leq T}\gamma ^{t}\nabla _{\theta }\ln \pi _{\theta }(A_{t}|S_{t})\cdot Q^{\pi _{\theta }}(S_{t},A_{t}){\Big |}S_{0}=s_{0}\right]} by a similar argument using the tower law.

Subtracting the value function as a baseline, we find that theadvantage functionAπ(S,A)=Qπ(S,A)Vπ(S){\displaystyle A^{\pi }(S,A)=Q^{\pi }(S,A)-V^{\pi }(S)} can be used as the critic as well:θJ(θ)=Eπθ[0tTγtθlnπθ(At|St)Aπθ(St,At)|S0=s0]{\displaystyle \nabla _{\theta }J(\theta )=E_{\pi _{\theta }}\left[\sum _{0\leq t\leq T}\gamma ^{t}\nabla _{\theta }\ln \pi _{\theta }(A_{t}|S_{t})\cdot A^{\pi _{\theta }}(S_{t},A_{t}){\Big |}S_{0}=s_{0}\right]}In summary, there are many unbiased estimators forθJθ{\textstyle \nabla _{\theta }J_{\theta }}, all in the form of:θJ(θ)=Eπθ[0tTθlnπθ(At|St)Ψt|S0=s0]{\displaystyle \nabla _{\theta }J(\theta )=E_{\pi _{\theta }}\left[\sum _{0\leq t\leq T}\nabla _{\theta }\ln \pi _{\theta }(A_{t}|S_{t})\cdot \Psi _{t}{\Big |}S_{0}=s_{0}\right]} whereΨt{\textstyle \Psi _{t}} is any linear sum of the following terms:

Some more possibleΨt{\textstyle \Psi _{t}} are as follows, with very similar proofs.

Natural policy gradient

[edit]

The natural policy gradient method is a variant of the policy gradient method, proposed bySham Kakade in 2001.[5] Unlike standard policy gradient methods, which depend on the choice of parametersθ{\displaystyle \theta } (making updates coordinate-dependent), the natural policy gradient aims to provide acoordinate-free update, which is geometrically "natural".

Motivation

[edit]

Standard policy gradient updatesθi+1=θi+αθJ(θi){\displaystyle \theta _{i+1}=\theta _{i}+\alpha \nabla _{\theta }J(\theta _{i})} solve a constrained optimization problem:{maxθi+1J(θi)+(θi+1θi)TθJ(θi)θi+1θiαθJ(θi){\displaystyle {\begin{cases}\max _{\theta _{i+1}}J(\theta _{i})+(\theta _{i+1}-\theta _{i})^{T}\nabla _{\theta }J(\theta _{i})\\\|\theta _{i+1}-\theta _{i}\|\leq \alpha \cdot \|\nabla _{\theta }J(\theta _{i})\|\end{cases}}}While the objective (linearized improvement) is geometrically meaningful, the Euclidean constraintθi+1θi{\displaystyle \|\theta _{i+1}-\theta _{i}\|} introduces coordinate dependence. To address this, the natural policy gradient replaces the Euclidean constraint with aKullback–Leibler divergence (KL) constraint:{maxθi+1J(θi)+(θi+1θi)TθJ(θi)D¯KL(πθi+1πθi)ϵ{\displaystyle {\begin{cases}\max _{\theta _{i+1}}J(\theta _{i})+(\theta _{i+1}-\theta _{i})^{T}\nabla _{\theta }J(\theta _{i})\\{\bar {D}}_{KL}(\pi _{\theta _{i+1}}\|\pi _{\theta _{i}})\leq \epsilon \end{cases}}}where the KL divergence between two policies isaveraged over the state distribution under policyπθi{\displaystyle \pi _{\theta _{i}}}. That is,D¯KL(πθi+1πθi):=Esπθi[DKL(πθi+1(|s)πθi(|s))]{\displaystyle {\bar {D}}_{KL}(\pi _{\theta _{i+1}}\|\pi _{\theta _{i}}):=\mathbb {E} _{s\sim \pi _{\theta _{i}}}[D_{KL}(\pi _{\theta _{i+1}}(\cdot |s)\|\pi _{\theta _{i}}(\cdot |s))]} This ensures updates are invariant to invertible affine parameter transformations.

Fisher information approximation

[edit]

For smallϵ{\displaystyle \epsilon }, the KL divergence is approximated by theFisher information metric:D¯KL(πθi+1πθi)12(θi+1θi)TF(θi)(θi+1θi){\displaystyle {\bar {D}}_{KL}(\pi _{\theta _{i+1}}\|\pi _{\theta _{i}})\approx {\frac {1}{2}}(\theta _{i+1}-\theta _{i})^{T}F(\theta _{i})(\theta _{i+1}-\theta _{i})}whereF(θ){\displaystyle F(\theta )} is theFisher information matrix of the policy, defined as:F(θ)=Es,aπθ[θlnπθ(a|s)(θlnπθ(a|s))T]{\displaystyle F(\theta )=\mathbb {E} _{s,a\sim \pi _{\theta }}\left[\nabla _{\theta }\ln \pi _{\theta }(a|s)\left(\nabla _{\theta }\ln \pi _{\theta }(a|s)\right)^{T}\right]} This transforms the problem into a problem inquadratic programming, yielding the natural policy gradient update:θi+1=θi+αF(θi)1θJ(θi){\displaystyle \theta _{i+1}=\theta _{i}+\alpha F(\theta _{i})^{-1}\nabla _{\theta }J(\theta _{i})}The step sizeα{\displaystyle \alpha } is typically adjusted to maintain the KL constraint, withα2ϵ(θJ(θi))TF(θi)1θJ(θi){\textstyle \alpha \approx {\sqrt {\frac {2\epsilon }{(\nabla _{\theta }J(\theta _{i}))^{T}F(\theta _{i})^{-1}\nabla _{\theta }J(\theta _{i})}}}}.

InvertingF(θ){\displaystyle F(\theta )} is computationally intensive, especially for high-dimensional parameters (e.g., neural networks). Practical implementations often use approximations.

Trust Region Policy Optimization (TRPO)

[edit]

Trust Region Policy Optimization (TRPO) is a policy gradient method that extends the natural policy gradient approach by enforcing atrust region constraint on policy updates.[6] Developed by Schulman et al. in 2015, TRPO improves upon the natural policy gradient method.

The natural gradient descent is theoretically optimal,if the objective is truly a quadratic function, but this is only an approximation. TRPO's line search and KL constraint attempts to restrict the solution to within a "trust region" in which this approximation does not break down. This makes TRPO more robust in practice.

Formulation

[edit]

Like natural policy gradient, TRPO iteratively updates the policy parametersθ{\displaystyle \theta } by solving a constrained optimization problem specified coordinate-free:{maxθL(θ,θi)D¯KL(πθπθi)ϵ{\displaystyle {\begin{cases}\max _{\theta }L(\theta ,\theta _{i})\\{\bar {D}}_{KL}(\pi _{\theta }\|\pi _{\theta _{i}})\leq \epsilon \end{cases}}}where

Note that in general, other surrogate advantages are possible:L(θ,θi)=Es,aπθi[πθ(a|s)πθi(a|s)Ψπθi(s,a)]{\displaystyle L(\theta ,\theta _{i})=\mathbb {E} _{s,a\sim \pi _{\theta _{i}}}\left[{\frac {\pi _{\theta }(a|s)}{\pi _{\theta _{i}}(a|s)}}\Psi ^{\pi _{\theta _{i}}}(s,a)\right]}whereΨ{\displaystyle \Psi } is any linear sum of the previously mentioned type. Indeed, OpenAI recommended using the Generalized Advantage Estimate, instead of the plain advantageAπθ{\displaystyle A^{\pi _{\theta }}}.

The surrogate advantageL(θ,θt){\displaystyle L(\theta ,\theta _{t})} is designed to align with the policy gradientθJ(θ){\displaystyle \nabla _{\theta }J(\theta )}. Specifically, whenθ=θt{\displaystyle \theta =\theta _{t}},θL(θ,θt){\displaystyle \nabla _{\theta }L(\theta ,\theta _{t})} equals the policy gradient derived from the advantage function:θJ(θ)=E(s,a)πθ[θlnπθ(a|s)Aπθ(s,a)]=θL(θ,θt){\displaystyle \nabla _{\theta }J(\theta )=\mathbb {E} _{(s,a)\sim \pi _{\theta }}\left[\nabla _{\theta }\ln \pi _{\theta }(a|s)\cdot A^{\pi _{\theta }}(s,a)\right]=\nabla _{\theta }L(\theta ,\theta _{t})}However, whenθθi{\displaystyle \theta \neq \theta _{i}}, this is not necessarily true. Thus it is a "surrogate" of the real objective.

As with natural policy gradient, for small policy updates, TRPO approximates the surrogate advantage and KL divergence using Taylor expansions aroundθt{\displaystyle \theta _{t}}:L(θ,θi)gT(θθi),D¯KL(πθπθi)12(θθi)TH(θθi),{\displaystyle {\begin{aligned}L(\theta ,\theta _{i})&\approx g^{T}(\theta -\theta _{i}),\\{\bar {D}}_{\text{KL}}(\pi _{\theta }\|\pi _{\theta _{i}})&\approx {\frac {1}{2}}(\theta -\theta _{i})^{T}H(\theta -\theta _{i}),\end{aligned}}} where:

This reduces the problem to a quadratic optimization, yielding the natural policy gradient update:θi+1=θi+2ϵgTF1gF1g.{\displaystyle \theta _{i+1}=\theta _{i}+{\sqrt {\frac {2\epsilon }{g^{T}F^{-1}g}}}F^{-1}g.}So far, this is essentially the same as natural gradient method. However, TRPO improves upon it by two modifications:

Proximal Policy Optimization (PPO)

[edit]

A further improvement isproximal policy optimization (PPO), which avoids even computingF(θ){\displaystyle F(\theta )} andF(θ)1{\displaystyle F(\theta )^{-1}} via a first-order approximation using clipped probability ratios.[7]

Specifically, instead of maximizing the surrogate advantagemaxθL(θ,θt)=Es,aπθt[πθ(a|s)πθt(a|s)Aπθt(s,a)]{\displaystyle \max _{\theta }L(\theta ,\theta _{t})=\mathbb {E} _{s,a\sim \pi _{\theta _{t}}}\left[{\frac {\pi _{\theta }(a|s)}{\pi _{\theta _{t}}(a|s)}}A^{\pi _{\theta _{t}}}(s,a)\right]} under a KL divergence constraint, it directly inserts the constraint into the surrogate advantage:maxθEs,aπθt[{min(πθ(a|s)πθt(a|s),1+ϵ)Aπθt(s,a) if Aπθt(s,a)>0max(πθ(a|s)πθt(a|s),1ϵ)Aπθt(s,a) if Aπθt(s,a)<0]{\displaystyle \max _{\theta }\mathbb {E} _{s,a\sim \pi _{\theta _{t}}}\left[{\begin{cases}\min \left({\frac {\pi _{\theta }(a|s)}{\pi _{\theta _{t}}(a|s)}},1+\epsilon \right)A^{\pi _{\theta _{t}}}(s,a)&{\text{ if }}A^{\pi _{\theta _{t}}}(s,a)>0\\\max \left({\frac {\pi _{\theta }(a|s)}{\pi _{\theta _{t}}(a|s)}},1-\epsilon \right)A^{\pi _{\theta _{t}}}(s,a)&{\text{ if }}A^{\pi _{\theta _{t}}}(s,a)<0\end{cases}}\right]} and PPO maximizes the surrogate advantage by stochastic gradient descent, as usual.

In words, gradient-ascending the new surrogate advantage function means that, at some states,a{\displaystyle s,a}, if the advantage is positive:Aπθt(s,a)>0{\displaystyle A^{\pi _{\theta _{t}}}(s,a)>0}, then the gradient should directθ{\displaystyle \theta } towards the direction that increases the probability of performing actiona{\displaystyle a} under the states{\displaystyle s}. However, as soon asθ{\displaystyle \theta } has changed so much thatπθ(a|s)(1+ϵ)πθt(a|s){\displaystyle \pi _{\theta }(a|s)\geq (1+\epsilon )\pi _{\theta _{t}}(a|s)}, then the gradient should stop pointing it in that direction. And similarly ifAπθt(s,a)<0{\displaystyle A^{\pi _{\theta _{t}}}(s,a)<0}. Thus, PPO avoids pushing the parameter update too hard, and avoids changing the policy too much.

To be more precise, to updateθt{\displaystyle \theta _{t}} toθt+1{\displaystyle \theta _{t+1}} requires multiple update steps on the same batch of data. It would initializeθ=θt{\displaystyle \theta =\theta _{t}}, then repeatedly apply gradient descent (such as theAdam optimizer) to updateθ{\displaystyle \theta } until the surrogate advantage has stabilized. It would then assignθt+1{\displaystyle \theta _{t+1}} toθ{\displaystyle \theta }, and do it again.

During this inner-loop, the first update toθ{\displaystyle \theta } would not hit the1ϵ,1+ϵ{\displaystyle 1-\epsilon ,1+\epsilon } bounds, but asθ{\displaystyle \theta } is updated further and further away fromθt{\displaystyle \theta _{t}}, it eventually starts hitting the bounds. For each such bound hit, the corresponding gradient becomes zero, and thus PPO avoid updatingθ{\displaystyle \theta } too far away fromθt{\displaystyle \theta _{t}}.

This is important, because the surrogate loss assumes that the state-action pairs,a{\displaystyle s,a} is sampled from what the agent would see if the agent runs the policyπθt{\displaystyle \pi _{\theta _{t}}}, but policy gradient should be on-policy. So, asθ{\displaystyle \theta } changes, the surrogate loss becomes more and moreoff-policy. This is why keepingθ{\displaystyle \theta }proximal toθt{\displaystyle \theta _{t}} is necessary.

If there is a reference policyπref{\displaystyle \pi _{\text{ref}}} that the trained policy should not diverge too far from, then additional KL divergence penalty can be added:βEs,aπθt[log(πθ(a|s)πref(a|s))]{\displaystyle -\beta \mathbb {E} _{s,a\sim \pi _{\theta _{t}}}\left[\log \left({\frac {\pi _{\theta }(a|s)}{\pi _{\text{ref}}(a|s)}}\right)\right]}whereβ{\displaystyle \beta } adjusts the strength of the penalty. This has been used in trainingreasoning language models withreinforcement learning from human feedback.[8] The KL divergence penalty term can be estimated with lower variance using the equivalent form (seef-divergence for details):[9]βEs,aπθt[log(πθ(a|s)πref(a|s))+πref(a|s)πθ(a|s)1]{\displaystyle -\beta \mathbb {E} _{s,a\sim \pi _{\theta _{t}}}\left[\log \left({\frac {\pi _{\theta }(a|s)}{\pi _{\text{ref}}(a|s)}}\right)+{\frac {\pi _{\text{ref}}(a|s)}{\pi _{\theta }(a|s)}}-1\right]}

Group Relative Policy Optimization (GRPO)

[edit]

The Group Relative Policy Optimization (GRPO) is a minor variant of PPO that omits the value function estimatorV{\displaystyle V}. Instead, for each states{\displaystyle s}, it samples multiple actionsa1,,aG{\displaystyle a_{1},\dots ,a_{G}} from the policyπθt{\displaystyle \pi _{\theta _{t}}}, then calculate the group-relative advantage[9]Aπθt(s,aj)=r(s,aj)μσ{\displaystyle A^{\pi _{\theta _{t}}}(s,a_{j})={\frac {r(s,a_{j})-\mu }{\sigma }}} whereμ,σ{\displaystyle \mu ,\sigma } are the mean and standard deviation ofr(s,a1),,r(s,aG){\displaystyle r(s,a_{1}),\dots ,r(s,a_{G})}. That is, it is thestandard score of the rewards.

Then, it maximizes the PPO objective, averaged over all actions:maxθ1Gi=1GE(s,a1,,aG)πθt[{min(πθ(ai|s)πθt(ai|s),1+ϵ)Aπθt(s,ai) if Aπθt(s,ai)>0max(πθ(ai|s)πθt(ai|s),1ϵ)Aπθt(s,ai) if Aπθt(s,ai)<0]{\displaystyle \max _{\theta }{\frac {1}{G}}\sum _{i=1}^{G}\mathbb {E} _{(s,a_{1},\dots ,a_{G})\sim \pi _{\theta _{t}}}\left[{\begin{cases}\min \left({\frac {\pi _{\theta }(a_{i}|s)}{\pi _{\theta _{t}}(a_{i}|s)}},1+\epsilon \right)A^{\pi _{\theta _{t}}}(s,a_{i})&{\text{ if }}A^{\pi _{\theta _{t}}}(s,a_{i})>0\\\max \left({\frac {\pi _{\theta }(a_{i}|s)}{\pi _{\theta _{t}}(a_{i}|s)}},1-\epsilon \right)A^{\pi _{\theta _{t}}}(s,a_{i})&{\text{ if }}A^{\pi _{\theta _{t}}}(s,a_{i})<0\end{cases}}\right]}Intuitively, each policy update step in GRPO makes the policy more likely to respond to each state with an action that performed relatively better than other actions tried at that state, and less likely to respond with one that performed relatively worse.

As before, the KL penalty term can be applied to encourage the trained policy to stay close to a reference policy. GRPO was first proposed in the context of trainingreasoning language models by researchers atDeepSeek.[9]

Policy Optimization and the Mirror Descent perspective (MDPO)

[edit]

Methods like TRPO, PPO and natural policy gradient share a common idea - while the policy should be updated in the direction of the policy gradient, the update should be done in a safe and stable manner, typically measured by some distance with respect to the policy before the update.

A similar notion of update stability is found in proximal convex optimization techniques likeMirror Descent.[10] There,x{\textstyle \mathbf {x} }, the proposed minimizer off{\textstyle f} in some constraint setC{\textstyle {\mathcal {C}}}, is iteratively updated in the direction of the gradientf{\textstyle \nabla f}, with a proximity penalty with respect to the currentxt{\textstyle \mathbf {x} _{t}} measured by someBregman divergenceBω{\textstyle B_{\omega }}, which can formalized by the following formula:xt+1argminxCf(xt)T(xxt)+1ηtBω(x,xt),{\displaystyle \mathbf {x} _{t+1}\in \arg \min _{\mathbf {x} \in {\mathcal {C}}}\nabla f(\mathbf {x} _{t})^{T}(\mathbf {x} -\mathbf {x} _{t})+{\frac {1}{\eta _{t}}}B_{\omega }(x,x_{t}),} whereηt{\textstyle \eta _{t}} controls the proximity between consecutive iterates, similar to the learning rate in gradient descent.

This leads to reconsidering the policy update procedure as an optimization procedure aimed at finding an optimal policy, in the (non-convex) optimization landscape of the underlyingMarkov decision process (MDP). This optimization viewpoint of using the policy gradient is termed Mirror Descent Policy Optimization (MDPO),[11][12] leading to the following update when the KL is the chosen Bregman divergence:πt+1argmaxπEs,aπ[Aπt(s,a)]+1ηtDKL(π||πt){\displaystyle \pi _{t+1}\in \arg \max _{\pi }\mathbb {E} _{s,a\sim \pi }\left[A^{\pi _{t}}(s,a)\right]+{\frac {1}{\eta _{t}}}D_{KL}(\pi ||\pi _{t})}With a parameterized policyπθ{\textstyle \pi _{\theta }}, the MDPO loss becomes:maxθL(θ,θt)=Es,aπθt[πθ(a|s)πθt(a|s)Aπθt(s,a)]+1ηtDKL(πθ||πθt){\displaystyle \max _{\theta }L(\theta ,\theta _{t})=\mathbb {E} _{s,a\sim \pi _{\theta _{t}}}\left[{\frac {\pi _{\theta }(a|s)}{\pi _{\theta _{t}}(a|s)}}A^{\pi _{\theta _{t}}}(s,a)\right]+{\frac {1}{\eta _{t}}}D_{KL}(\pi _{\theta }||\pi _{\theta _{t}})}This objective can be used together with other common techniques like the clipping done in PPO. In fact, the KL divergence penalty also appears in the original PPO paper,[7] suggesting the MDPO perspective as a theoretical unification of the main derivation concepts behind many concurrent policy gradient techniques.

See also

[edit]

References

[edit]
  1. ^abSutton, Richard S; McAllester, David; Singh, Satinder; Mansour, Yishay (1999)."Policy Gradient Methods for Reinforcement Learning with Function Approximation".Advances in Neural Information Processing Systems.12. MIT Press.
  2. ^Mohamed, Shakir; Rosca, Mihaela; Figurnov, Michael; Mnih, Andriy (2020)."Monte Carlo Gradient Estimation in Machine Learning".Journal of Machine Learning Research.21 (132):1–62.arXiv:1906.10652.ISSN 1533-7928.
  3. ^Williams, Ronald J. (May 1992)."Simple statistical gradient-following algorithms for connectionist reinforcement learning".Machine Learning.8 (3–4):229–256.doi:10.1007/BF00992696.ISSN 0885-6125.
  4. ^Schulman, John; Moritz, Philipp; Levine, Sergey; Jordan, Michael; Abbeel, Pieter (2018-10-20). "High-Dimensional Continuous Control Using Generalized Advantage Estimation".arXiv:1506.02438 [cs.LG].
  5. ^Kakade, Sham M (2001)."A Natural Policy Gradient".Advances in Neural Information Processing Systems.14. MIT Press.
  6. ^Schulman, John; Levine, Sergey; Moritz, Philipp; Jordan, Michael; Abbeel, Pieter (2015-07-06)."Trust region policy optimization".Proceedings of the 32nd International Conference on International Conference on Machine Learning.37. Lille, France: JMLR.org:1889–1897.
  7. ^abSchulman, John; Wolski, Filip; Dhariwal, Prafulla; Radford, Alec; Klimov, Oleg (2017-08-28). "Proximal Policy Optimization Algorithms".arXiv:1707.06347 [cs.LG].
  8. ^Nisan Stiennon; Long Ouyang; Jeffrey Wu; Daniel Ziegler; Ryan Lowe; Chelsea Voss; Alec Radford; Dario Amodei; Paul F. Christiano (2020)."Learning to summarize with human feedback".Advances in Neural Information Processing Systems.33.
  9. ^abcShao, Zhihong; Wang, Peiyi; Zhu, Qihao; Xu, Runxin; Song, Junxiao; Bi, Xiao; Zhang, Haowei; Zhang, Mingchuan; Li, Y. K. (2024-04-27). "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models".arXiv:2402.03300 [cs.CL].
  10. ^Arkadi Nemirovsky and David Yudin. Problem Complexity and Method Efficiency in Optimization. John Wiley & Sons, 1983.
  11. ^Shani, Lior; Efroni, Yonathan; Mannor, Shie (2020-04-03)."Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPS".Proceedings of the AAAI Conference on Artificial Intelligence.34 (4):5668–5675.arXiv:1909.02769.doi:10.1609/aaai.v34i04.6021.ISSN 2374-3468.
  12. ^Tomar, Manan; Shani, Lior; Efroni, Yonathan; Ghavamzadeh, Mohammad (2020-05-20). "Mirror Descent Policy Optimization".arXiv:2005.09814v5 [cs.LG].
  • Sutton, Richard S.; Barto, Andrew G. (2018).Reinforcement learning: an introduction. Adaptive computation and machine learning series (2 ed.). Cambridge, Massachusetts: The MIT Press.ISBN 978-0-262-03924-6.
  • Bertsekas, Dimitri P. (2019).Reinforcement learning and optimal control (2 ed.). Belmont, Massachusetts: Athena Scientific.ISBN 978-1-886529-39-7.
  • Grossi, Csaba (2010).Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning (1 ed.). Cham: Springer International Publishing.ISBN 978-3-031-00423-0.
  • Mohamed, Shakir; Rosca, Mihaela; Figurnov, Michael; Mnih, Andriy (2020)."Monte Carlo Gradient Estimation in Machine Learning".Journal of Machine Learning Research.21 (132):1–62.arXiv:1906.10652.ISSN 1533-7928.

External links

[edit]
Concepts
Applications
Implementations
Audio–visual
Text
Decisional
People
Architectures
Political
Social and economic
Retrieved from "https://en.wikipedia.org/w/index.php?title=Policy_gradient_method&oldid=1338835268"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2026 Movatter.jp