The Gradient

Language is not just words.

Fork me on GitHub

Policy Gradient: A Summary !

The mathematical foundations of policy gradient algorithms.

Policy Gradient preliminaries

Policy gradient estimates the gradients with the form:[10]

θJ(θ)=E[t=0Ψtθlogπθ(at|st)]

where Ψt may be the following forms:

(1)(1)t=0rttotal reward of the trajectory(2)(2)t=trtreward following action at(3)(3)t=trtb(st)baselined version of (2)(4)(4)Qπ(st,at)state-action value function(5)(5)Aπ(st,at)advantage function(6)(6)rt+Vπ(st+1)Vπ(st)TD residual

where (6) yields the lowest possible variance:

Vπ(st):=Est+1:,at:at:[l=0γlrt+l]Qπ(st,at):=Est+1:,at+1:at+1:[l=0γlrt+l]

The advantage function

Aπ(st,at):=Qπ(st,at)Vπ(st)
  • Intuitional interpretation: a step in policy gradient direction should increase the probability of better-than-average actions and decrease the probability of worse-than-average actions. The advantage function measures whether or not the action is better or worse than the policy’s default behavior (expection).

Vanilla Policy Gradient

The goal of RL

Objective

θ=argmaxθEτpθ(τ)[tr(st,at)]
  • Infinite horizonθ=argmaxθE(s,a)pθ(s,a)[r(s,a)]
  • Finite horizonθ=argmaxθt=1TE(st,at)pθ(st,at)[r(s,a)]

Evaluating the objective

θ=argmaxθEτpθ(τ)[tr(st,at)]J(θ)J(θ)=Eτpθ(τ)[tr(st,at)]1Nitr(si,t,ai,t)isum over samples from πθ

Direct differentiation

J(θ)=Eτπθ(τ)[r(τ)]=πθ(τ)r(τ)dτr(τ)=t=1Tr(st,at)(7)θJ(θ)=θπθ(τ)r(τ)dτ(8)=πθ(τ)θlogπθ(τ)r(τ)dτ(9)=Eτπθ(τ)[θlogπθ(τ)r(τ)]

A convenient identity:

πθ(τ)θlogπθ(τ)=πθ(τ)θπθ(τ)πθ(τ)=θπθ(τ)
(10)logπθ(τ)=logπθ(s1,a1,,sT,aT)(11)=log[p(s1)t=1Tπθ(at|st)p(st+1|st,at)](12)=logp(s1)initial probability, derivative:0+t=1Tlogπθ(at|st)transition prob.+logp(st+1|st,at)emission prob., derivative:0

Overall,

(13)θJ(θ)=Eτπθ(τ)[θlogπθ(τ)r(τ)](14)=Eτπθ(τ)[(t=1Tlogπθ(at|st))(t=1Tr(st,at))](15)1Ni=1N(t=1Tθlogπθ(ai,t|si,t))(t=1Tr(si,t,ai,t))
  • Comparison to Maximum likelihood:θJML(θ)1Ni=1N(t=1Tθlogπθ(ai,t|si,t))

Drawbacks: variance

  • Reducing variance
  • Future does not affect the past
  • Casuality: policy at time t cannot affect reard at time t when t<t
θJ(θ)1Ni=1Nt=1Tθlogπθ(ai,t|si,t)Q^i,t

where Q^i,t denotes the reward to go:

Q^i,t=t=iTr(si,t,ai,t)

Baselines

θJ(θ)1Ni=1Nθlogπθ(τ)[r(τ)b]

where b=1Ni=1Nr(τ)

proof:

  • Substracting abseline is unbiased in expectation
  • Average reward is not the best baseline, but it’s pretty good.

    E[θlogπθ(τ)b]=πθ(τ)θlogπθ(τ)bdτ=θπθ(τ)bdτ=bθπθ(τ)dτ=bθ1=0
  • Optimal baseline:

    var[x]=E[x2]E[x]2θJ(θ)=Eτπθ(τ)[θlogπθ(τ)(r(τ)b)]
var=Eτπθ(τ)[(θlogπθ(τ)(r(τ)b))2]Eτπθ(τ)[θlogπθ(τ)(r(τ)b)this is just unbiased baseline in expectation]2

Hence,

dvardb=ddbE[g(τ)2(r(τ)b)2]=ddb(E[g(τ)2r(τ)2]2E[g(τ)2r(τ)b]+b2E[g(τ)2])=2E[g(τ)2r(τ)]+2bE[g(τ)2]=0

We get:

b=E[g(τ)2r(τ)]E[g(τ)2]

This is just the expected reward, weighted by gradient magnitudes.

Deriving the simplest policy gradient (Spinning Up)

Consider the case of a stochastic, parameterized policy, πθ. We maximize the expected return

J(θ)=R(τ)τπθ

Optimize the policy by gradient descent:

θk+1=θk+αθJ(πθ)|θkpolicy gradient

Step by step:

  1. Probability of a Trajectory. The probability of a trajectory τ=(s0,a0,,sT+1) given that actions come from πθ is:

    P(τ|θ)=ρ0(s0)t=0TP(st+1|st,at)πθ(at|st)
  2. The log-derivative trick.

    θP(τ|θ)=P(τ|θ)θlogP(τ|θ)
  3. Log-probability of a trajectory:

    logP(τ|θ)=logρ0(s0)+t=0T(logP(st+1|st,at)+logπθ(at|st))
  4. Gradients of environment functions
    The environment has no dependence on θ, so gradients of ρ0(s0), P(st+1|st,at) and R(τ) are zero.
  5. Grad-log-prob of a trajectory.
    The gradient of the log-prob of a trajectory is thus

Overall:

(16)θJ(πθ)=Eτπθ[R(τ)](17)=θτP(τ|θ)R(τ)expand expectation(18)=τθP(τ|θ)R(τ)bring gradient under integral(19)=τP(τ|θ)θlogP(τ|θ)R(τ)log-derivative trick(20)=Eτπθ[θlogP(τ|θ)R(τ)]return to expectation formθJ(πθ)=Eτπθ[t=0TθlogP(τ|θ)R(τ)]

We can estimate the expectation with a sample mean. If we collect a set of trajectories D={τi}i=1,,N where each trajectory is obtained by letting the agent act in the environment using the policy πθ, the policy gradient can be estimated as:

g^=1|D|τDt=0Tθlogπθ(at|st)R(τ)

where |D| is the # of trajectories in D

REINFORCE

REINFORCE(Monte-Carlo policy gradient) is a Monte-Carlo algorithm using the complete return from the time t, which includes future rewards up until the end of episode. [3]

(21)J(θ)=Eπ[aπ(a|St,θ)qπ(St,a)π(a|St,θ)π(a|St,θ)](22)=E[qπ(St,At)π(At|St,θ)π(At|St,θ)](23)=E[Gtπ(At|St,θ)π(At|St,θ)eligibility vector:lnπ(At|St,θ)]
  • Intuition: the update increases the paramer vector in this distribution proportional to the return, and inversely proportional to the action probability.

REINFORCE algorithm:

  1. Initialize policy parameter θ at random;
  2. For each episode:
    1. Generate an episode S0,A0,R1,,ST1,AT1,RT, following π(|,θ)
    2. Loop for each step of the episode t=0,1,,T1:
      1. Gk=t+1Tγkt1Rk
      2. θθ+αγtGlnπ(At|St,θ)
  • Drawbacks: REINFORCE has a high variance and thus produces slow learning.

REINFORCE with Baseline

A variant of REINFORCE is to substract a baseline value from the return Gt to reduce the variance of policy gradient while keeping the bias unchanged.[3]

θt+1=θt+α(Gtb(St)b(St))lnπ(At|St,θt)

Actor-Critic

Actor-Critic consists of two components

Actor-Critic algorithms:

  1. Initialize policy parameter θθRd and state-value weights wwRd
  2. For each episode:
    1. Initilize the first state of episode S1
    2. While S != TERMINAL (for each time step):
      1. Aπ(|S,θ)
      2. Take action A, observe S, R
      3. δR+γv^(S,ww)v^(S,ww) (If S is terminal, then v^(S,ww)=0)
      4. wwww+αwwδv^(S,ww)
      5. θθθθ+αθIδlnπ(A|S,θ)
      6. IγI
      7. SS

Asynchronous Advantage Actor-Critic(A3C)

Mnih et.al(2016) [1] proposed an asynchronous gradient descent for optimization of deep neural networks, showing that parallel actor-learners have a stabilizing effect on training, greatly reducing the training time with a single multi-core CPU instead of GPU.

  • Instead of experience replay, they asynchronously execute multiple agents in parallel on multiple instances of the environment. This parallelism also decorrelates the agents’ data into a more stationary process, since at any given time-step the parallel agents will be experiencing a variety of different states.
  • Apply different exploration policies in each actor-learner to maximize the diversity. By running different exploration policies in different threads, the overall updates of parameters are likely to be less correlated in time than a single agent applying online updates.
  • The gradient accumulation in parallelism can be seen as a prarallized minibatch of stochastic gradient update, where the parameters update thread-by-thread in the direction of each thread independently.[2]

A3C pseudocode for each actor-learner thread:

  1. Assume global shared parameter vectors θ and θv and global shared counter T=0, thread-specified parameter vectors θ and θv
  2. Initialize the thread step count t1
  3. While T<=Tmax:
    1. Reset gradients: dθ0 and dθv0
    2. Synchronize thread-specific parameters θ=θ and θv=θv
    3. tstart=t
    4. sample state st
    5. while (st != TERMINAL and ttstart<=tmax):
      1. Perform the action atπ(at|st;θ)
      2. Receive reward rt and new state st+1;
      3. tt+1
      4. TT+1
    6. The return estimation: R={0if st is TERMINALVw(st)otherwise
    7. For i{t1,,tstart} do
      1. RγR+Ri; here R is a MC measure of Gi
      2. Accumulate gradients w.r.t θ: dθdθ+θlogπ(ai|si;θ)(RV(si;θv))
      3. Accumulate gradients w.r.t dθvdθv+(RV(si;θv))2θv
    8. Perform asynchronous update of θ using dθ and of θv using dθv

Advantage Actor-Critic (A2C)

Removing the first “A”(Asynchronous) from A3C, we get advantage actor-critic (A2C). A3C updates the global parameters independently, thus thread-specific agents updates the policy with different versions and aggregated updates could not be optimal.

A2C waits for each actor to finish its segment of experience before performing an update, averaging over all of the actors. In the next iteration, parallel actors starts from the same policy. A2C is more cost-effective than A3C when using single-GPU machines, and is faster than a CPU-only A3C implementation when using larger policies. [4]

Trust Region Policy Optimization (TRPO)

Trust Region Policy Optimization (TRPO) enforces a KL divergence constraint at every point in the state space.

TRPO minimizes a certain surrogate obejctive fuction guarateeing policy improvement with non-trivial step sizes, giving monotonic improvements with little tuning of hyperparameters at each update[5].

  • Let π~ denote the expected return of another policy π~ in terms of the advantage over the policy π
  • ρπ denotes discounted visitation frequency: ρπ(s)=P(s0=s)+γP(s1=s)+γ2P(s2=s)+.
(24)η(π~)=η(π)+Es0,a0,π~[t=0γtAπ(st,at)](25)=η(π)+sρπ~ρπ~(s)aπ~(a|s)Aπ(s,a)rewrite with sum over states(26)Lπ(π~)η(π)+sρπρπ(s)aπ~(a|s)Aπ(s,a)replace ρπ~ with ρπ to approximate

The aforementioned update does not give any guidance on the step size to update.

  • Conservative policy iteration provides explicit lower bounds on the improvements of η.

Let πold denote the current policy and π=argmaxπLπold(π), the new policy is:

πnew=(1α)πold(a|s)+απ(a|s)

where the lower bound:

(27)η(πnew)Lπold(πnew)2ϵγ(1γ)2α2(28)where ϵ=maxs|Eaπ(a|s)[Aπ(s,a)]|

Replace α with the distance measure between π and π~, total vairation divergence: DTV=12i|piqi| for discrete distributions p,q.
Define DTVmax=maxs,a|Aπ(s,a)|

(29)η(πnew)Lπold(πnew)4ϵγ(1γ)2α2replace α with total variation divergence(30)where ϵ=maxs,a|Aπ(s,a)|

Since DTV(p||q)2DKL(p||q), Let DKLmax(π,π~)=maxsDKL(π(|s)||π~(|s)), we get:

(31)η(π)Lπ(π~CDKLmax(π,π~))replace α2 with DKLmax(π,π~)(32)where C=4ϵγ(1γ)2

Let Mi(π)=Lπi(π)CDKLmax(πi,π), then:

(33)η(πi+1)Mi(πi+1)(34)η(πi)=Mi(πi)

Therefore,

η(πi+1)η(πi)Mi(πi+1)M(πi)

This guarantees that the true objective η is non-decreasing.

Afterwards, we improve the true objective η. Let θold represent πθold, and θ represent πθ.

maximizeθ[Lθold(θ)CDKLmax(θold,θ)]

Thus, we use a constraint on the KL divergence beween the new policy and the old policy, i.e., a trust region constraint:

(35)maximizeθLθold(θ)(36)s.t.DKLmax(θold,θ)δ

By heuristic approximation, we consider the average KL divergence to replace the max KL divergence:

(37)maximizeθLθold(θ)(38)s.t.D¯KLρθold(θold,θ)δ
  • Expand Lθold:maximizeθsρθold(s)aπθ(a|s)Aθold(s,a)
  • Replace the sum over actions by an important sampling estimator:aπθ(a|sn)Aθold(sn,a)=Eaq[πθ(a|sn)q(a|sn)Aθold(sn,a)]
  • Replace sρθold with expectation Esρθold[]; replace the advantage values Aθold by the Q-values Qθold. Finally, we get(39)maximizeθEsρθold,aq[πθ(a|s)q(a|s)Qθold(s,a)](40)s.t.Esρθold[DKL(πθold(|s)||πθ(|s))]δ

Proximal Policy Optimization (PPO)

  • Problems: TRPO is relatively complicated, and is not compatible with architectures that include noise (e.g. dropout) or parameter sharing (between the policy and value function, or with auxiliary tasks).[8]

  • PPO with clipped surrogate objective performs better than that with KL penalty.

PPO-clip *

Clipped surrogate objective

  • Let rt(θ)=πθ(at|st)πθold(at|st), so r(θold)=1. TRPO maimize a surrogate objective with conservative policy iteration (CPI). Without the constraint, maximization of LCPI would lead to excessively large policy update.(41)LCPI(θ)=E^t[πθ(at|st)πθold(at|st)A^t]=E^t[rt(θ)A^t](42)s.t.E^t[KL[πθold(|st),πθ(|st)]]δ
  • PPO pernalize changes to the policy that move rt(θ) away from 1:Lclip(θ)=E^[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)]

where ϵ is a hyperparameter, say, ϵ=0.2. The intuition is to take the minimum of the clipped and unclipped objective, thus the final objective is a lower bound (i.e. a pessimistic bound) on the unclipped objective.

PPO-penalty

Adaptive KL pernalty coefficient

  • Variant: add a penalty on KL divergence
  • With mini-batch SGD, optimize the KL-penalized objective:LKL-penalty=E^[πθ(at|st)πθoldA^tβKL[πθold(|st),πθ(|st)]]
  • Compute d=E^t[KL[πθold(|st),πθ(|st)]]
    • If d<dtarget/1.5,ββ/2
    • If d>dtarget×1.5,ββ×2

PPO algorithms

Finally, the objective function is augmented with an error term on the value estimation and an entropy term to encourage sufficient exploration.

(43)LtClip + SE + Entropy(θ)=E[Ltclip(θ)c1(Vθ(st)Vttarget)2squared error loss+c2H(st,πθ)entropy term]

where c1, c2 are constant coefficients.

  • Settings:
    • RNN
    • Adam
    • mini-batch SGD

PPO algorithms with Actor-Critic style:

  1. for iteration=1,2,,:
    1. for actor=1,2,,,N:
      1. run policy πθold in environment for T timesteps;
      2. Compute advantage estimates A^1,,A^T;
    2. Optimize surrogate L w.r.t θ, with K epochs and mini-batch size MNT
    3. θoldθ

Distributed PPO

Let W denote # of workers; D sets a threshold for the # of workers whose gredients must be available to update the parameters; M, B is the # of sub-iterations with policy and baseline updates given a batch of datapoints.[9]

The distributed PPO-penalty algorithms:


Experiments indicates that averaging gradients and applying them synchronously leads to better results than asynchronously in practice.

Generalized Advantage Estimation(GAE)

Challenges:

  1. Requires a large number of samples;
  2. Difficulty of obtaining stable and steady improvement despite the non-stationarity of the incoming data.
  3. credit reward problem in RL (a.k.a distal reward problem in the behavioral literature): long time delay on rewards.

Solution:[10]

  1. Use value functions to reduce the variance of policy gradient estimates at the cost of some bias, with an exponentially-weighted estimator of advantage function that is analogous to TD(λ);
  2. Use TRPO for both policy and value function with NNs.

Advantage function estimation

The advantage has the form:

(44)Est+1[δtVπ,γ]=Est+1[rt+γVπ,γ(st+1)Vπ,γ(st)](45)=Est+1[Qπ,γ(st,at)Vπ,γ(st)](46)=Aπ,γ(st,at)

Now take the form of k of δ terms, denoted by A^t(k):

(47)A^t(1):=δtV=V(st)+rt+γV(st+1)(48)A^t(2):=δtV+γδt+1V=V(st)+rt+γrt+1+γ2V(st+2)(49)A^t(3):=δtV+γδt+1V+γ2δt+2V=V(st)+rt+γrt+1+γ2rt+2+γ3V(st+3)(50)(51)A^t(k):=l=0k1γlδt+1V=V(st)+rt+γrt+1++γk1rt+k1+γkV(st+k)

With k, the bias generally becomes smaller:

A^t()=l=0γlδt+1V=V(st)value function+l=0γlrt+1empirical returns

which is simply the empirical returns minus the value function baseline.

GAE

GAE is defined as the exponentially-weighted average of these k-step estimators:

(52)A^tGAE(γ,λ):=(1λ)(A^t(1)+λA^t(2)+λ2A^t(3)+)(53)=(1λ)(δtV+λ(δtV+γδt+1V)+λ2(δtV+γδt+1V+γ2δt+2V)+)(54)=(1λ)(δtV(1+λ+λ2+)+γδt+1V(λ+λ2+λ3+)+γ2δt+2V(λ2+λ3+λ4+)+)(55)=(1λ)(δtV(11λ)+γδt+1V(λ1λ)+γ2δt+2V(λ21λ)+)(56)=l=0(γλ)lδt+1V

Now consider two special cases:

  • when λ0:GAE(γ,0)A^t:=δt=rt+γV(st+1)V(st)
  • when λ1:GAE(γ,1)A^t:=t=0γlδt+l=l=0γlrt+1V(st)

It shows that GAE(γ,1) has high variance due to the sum of terms; GAE(γ,0) induces bias but with lower variance. The GAE with λ(0,1) reaches a tradeoff between the bias and variance.

Interpretation as Reward Shaping

  • Reward shaping refers to the following reward transformation of MDP:

    r~(s,a,s)=r(s,a,s)+γΦ(s)Φ(s)

    where Φ:SR is an arbitrary scalar-valued function on the state space.

  • Let Q~π,γ, V~π,γ, A~π,γ be the value and advatage functions of the transformed MDP:

    Q~π,γ(s,a)=Qπ,γ(s,a)Φ(s)V~π,γ(s,a)=Vπ,γ(s,a)Φ(s)A~π,γ(s,a)=(Qπ,γ(s,a)Φ(s))(Vπ,γ(s,a)Φ(s))=Aπ,γ(s,a)

Let Φ=V, then

l=0(γλ)lr~(st+l,at,st+l+1)=l=0(γλ)lδt+lV=A^tGAE(γ,λ)

Value function estimation

Constrain the average KL divergence between the previous value function and new value function to be smaller than ϵ:

(57)minimizeϕn=1N||Vϕ(sn)V^n||2(58)s.t1Nn=1N||Vϕ(sn)Vϕold(sn)||22σ2ϵ

Actor-Critic with Experience Replay(ACER)

Actor-Critic with Experience Replay(ACER)[12] employs truncated importance sampling with bias correction, stochastic dueling network architectures, and a new TRPO method.

Importance weight truncation with bias correction

g^tACER=ρ¯tθlogπθ(at|xt)[Qretrace(xt,at)Vθv(xt)]+E([ρt(a)cρt(a)]+θlogπθ(a|xt)[Qθv(xt,a)Vθv(xt)])

where ρ¯t=min(c,ρt), with importance weight ρt=π(at|xt)μ(at|xt)

Efficient TRPO

ACER maintains an average policy network ϕθa that represents a running average of past policies and forces the updated policy to not deviate far from the average.

Update the parameter θa of the average policy net work “softly” after each update:

θaαθa+(1α)θ

The policy gradient w.r.t ϕ:

g^tACER=ρ¯tϕθ(xt)logπθ(at|ϕθ(x))[Qretrace(xt,at)Vθv(xt)]+E([ρt(a)cρt(a)]+ϕθ(xt)logπθ(a|ϕθ(x))[Qθv(xt,a)Vθv(xt)])

ACKTR

Soft Q-learning

Soft Q-learning(SQL) expresses the optimal policy via Boltzmann distribution (a.k.a Gibbs distribution).

  • Soft Q-learning fomulates a stochastic policy as a (conditional) energy-based model (EBM), with the energy function corresponding to the “soft” Q-function obtained when optimizing the maximum entropy objective.
  • “The entropy regularized actor-critic algorithms can be viewed as approaximate Q-learning methods, with the actor serving the role of an approimate sampler from an intrctable posterior” [14].

Contributions:

  1. Improved exploration performance is with multi-modal reward landscapes, where conventional deterministic or unimodal methods are at high risk of falling into suboptimal local optima.
  2. Stochastic energy-based policies can provide a much better initialization for learning new skills than either random policies or policies pretrained with maximum expected reward objectives.

Maximum Entropy RL

  • Conventional RL objectives to learn a policy π(aat|sst):πstd=argmaxπtE(st,at)ρπ[r(st,at)]
  • Maximum entropy RL augments the entropy term to maximize its entropy at each visited state:πstd=argmaxπtE(st,at)ρπ[r(st,at)+αH(π(|st))]where α is used to determine the relative importance of entropy and reward.

Energy-based models (EBM)

As the figure below, the conventional RL specifies a unimodal policy distribution, centered at the maimal Q-value and extending toe the neighboring actions to provide noise for exploration (as red distribution). The exploration is biased towards the upper mode, RL ignores the lower mode completely.[15]

To ensure the agent to explore all promising states while prioritizing the more promising mode, Soft Q-learning definesthe policy w.r.t exponentiated Q-values (see green distribution):

π(at|st)exp(Q(st,at))

where Q is an energy function, that could be represented by NNs.

Soft value functions

The soft Bellman equation:

Q(st,at)=E[rt+γsoftmaxaQ(st+1,a)]

where

softmaxaf(a):=logaexpf(a)da

The soft Q-function is:

Qsoft(st,at)=rt+E(st+1,)ρπH(πMaxEnt(|st+l))

The soft value function:

Vsoft(st)=αlogAexp(1αQsoft(st,a))da

Then the optimal policy is:

(59)πMaxEnt(at|st)=exp(1α(Qsoft(st,at)Vsoft(st)advantage function))(60)=exp(1αQsoft(st,at))Aexp(1αQsoft(st,a))da(61)=exp(1αQsoft(st,at))exp(1αVsoft(st))

Proofs

  • Define the optimal policy with the EBM form:π(a|s)=exp(Q(s,a))Zwhere Z is the sum of the numerator.
  • Minimize the KL divergence:
(62)minKL(π||π~)=π(a|s)logπ(a|s)π~(a|s)(63)=π(a|s)logπ(a|s)H(|π)π(a|s)logπ~(a|s)expand KL divergence(64)=H(|π)π(a|s)[Q(st,at)logZV(s)](65)here, (66)V(s)=logZ(67)=logAexp(1αQ(s,a)da)

Soft Q-learning

  • Soft Bellman-backupQsoft(st,at)rt+γEst+1ps[Vsoft(st+1)]Vsoft(st)αlogAexp(1αQsoft(st,a))da

This cannot be performed exactly in countinuous or large state and action spaces. Sampling from the energy-based model is intractable in general.

Sampling

Importance sampling

Vsoftθ(st)=αlogEqa[exp(1αQsoftθ(st,a))qa(a)]

where qa can be an arbitrary distribution over the action space.
It can also be equivalent as minimizing:

JQ(θ)=Estqst,atqat[12(Q^softθ¯(st,at)Qsoftθ(st,at))2]

where Q^softθ¯(st,at)=rt+γEst+1ps[Vsoftθ¯(st+1)] is the target Q-value.

Stein Variational Gradient Descent (SVGD)

  • How to approximately sample from the soft Q-funtion?
    1. MCMC based sampling
    2. learn a stochastic sampling network trained to output approximate samples from the target distribution

Soft Q-learning applies the sampling network based on Stein variational gradient descent (SVGD) and amortized SVGD.

  1. Learn a state-conditioned stochastic NN at=fϕ(ξ;st) that maps noise samples ξ drawn from an arbitrary distribution into unbiased action samples from the target EBM of Qsoftθ.
  2. The induced distribution of the actions πϕ(at|st) approximates the energy-based distribution w.r.t KL divergence:Jπ(ϕϕ(|st)||exp(1α(Qsoftθ(st,)Vsoftθ)))
  3. SVGD provides the most greedy directions as a functional:Δfϕ(;st)=Eatπϕ[κ(at,fϕ(;st))aQsoftθ(st,a)|a=at+αaκ(a,fϕ(;st))|a=at]Update the policy networks:Jπ(ϕ;st)ϕEξ[Δfϕ(ξ;st)fϕ(ξ;st)ϕ]

Algorithms

D empty replay memory
Assign target parameters: θ¯θ, ϕ¯ϕ

  1. for each epoch:
    1. for each t do:
      1. Collect experience
        Sample an action for st using fϕ: atfϕ(ξ;st) where ξ(0;I)
        Sample next state from the environment: st+1ps(st+1|st,at)
        Save the new experience in the replay memory: DD{(st,at,r(st,at),st+1)}
      2. Sample a mini-batch from the replay memory {(st(i),at(i),rt(i),st+1(i))}i=0ND
      3. Update the soft Q-function parameters:
        Sample {a(i,j)}j=0Mqa for each st+1(i)
        Compute empirical soft values V^softθ¯(st+1(i)) in Vsoftθ(st)=αlogEqa[exp(1αQsoftθ(st,a))qa(a)] Compute empirical gradient ^θJQ of JQ(θ)=Estqst,atqat[Q^softθ¯(st,at)Qsoftθ(st,at)2] Update θ according to ^θJQ using Adam.
      4. Update policy
        Sample {ξ(i,j)}j=0MN(00,II) for each st(i)
        Compute actions at(i,j)=fϕ(ξ(i,j),st(i))
        Compute Δfϕ using empirical estimate of Δfϕ(;st)=Eatπϕ[κ(at,fϕ(;st))aQsoftθ(st,a)|a=at+αaκ(a,fϕ(;st))|a=at] Compute empirical estimtate of ^ϕJπ of Jπ(ϕ;st)ϕEξ[Δfϕ(ξ;st)fϕ(ξ;st)ϕ] Update ϕ according to ^ϕJπ using Adam
    2. If epoch update_interval == 0:
      θ¯θ,ϕ¯ϕ

Benefits

  • Better exploration. SQL provides an implicit exploration stategy by assgining each action a non-zero probability, shaped by the current belief about its value, effectively combining exploration and exploitation in a natural way.[15]
  • Fine-tuning maximum entropy policies: general-to-specific transfer.Pre-train policies for general purpose tasks, then use them as templates or initializations for more specific tasks.
  • Compositionality. Compose new skills from existing policies—even without any fine-tuning—by intersecting different skills.[15]

  • Robustness. Maximum entropy formulation encourages to try all possible solutions, the agents learn to explore a large portion of the state space. Thus they learn to act in various situations, more robust against perturbations in the environment.

Soft Actor-Critic(SAC)

Soft Actor-Critic (SAC)[13] is an off-policy actor-critic algorithm based on the maximum entropy framework. Like SQL, SAC augments the conventional RL objectives with a maximum entropy objective:

J(π)=t=0TE(st,at)ρπ[r(st,at)+αH(π(|st))]

where the α determines the relative importance of the entropy term against the reward, and thus controls the stochasticity of the optimal policy.

The maximum entropy terms

  1. Encourage to explore more widely, while giving up on clearly unpromising avenues;
  2. Capture multiple modes of near-optimal behavior.
  3. Improve learning speed and exploration.

Soft policy iteration

Policy evaluation

The soft-Q value of the policy π:

τπQ(st,at)r(st,at)+γEst+1p[V(st+1)]

where the soft state value function:

V(st)=Eatπ[Q(st,at)logπ(at|st)]

Policy Improvement

For each state, we update the policy according to:

πnew=argminπDKL(π(|st)||exp(Qold(st,))Zπold(st))

where the partitioning function Zπold(st) normalizes the distribution.

SAC

Consider a parameterized state value function VΨ(st), soft Q-function Qθ(st,at) and a tractable policy πϕ(at|st).

The soft value function is trained to minimize the squared residual error:

JV(Ψ)=EstD[12(VΨ(st)Eatπϕ[Qθ(st,at)+logπϕ(at|st)])]

The soft Q-function can be trained to minimize the soft Bellman residual:

JQ(θ)=E(st,at)D[12(Qθ(st,at)Q^(st,at))2]

with

Q^(st,at)=r(st,at)+γEst+1p[Vψ¯(st+1)]

The policy can be optimized by the expected KL-divergence:

Jπ(ϕ)=EstD[DKL(πϕ(|st)||exp(Qθ(st,))Zθ(st))]

Minimizing Jπ with reparameterization trick.

  • Reparameterize the policy with an NN transformation:at=fϕ(ϵt;st)where ϵ is an input noise vector, sampled from some fixed distribution.
  • Rewrite the previous objective as:

    Jπ(ϕ)=EstD,ϵtN[logπϕ(fϕ(ϵt;st)|st)Qθ(st,fϕ(ϵt;st))]
  • Employ two Q-functions to mitigate positive bias in the policy improvement step, using the minimum of the Q-functions for the value gradient and policy gradient.

Automatically adjusted temperature

SAC is brittle w.r.t the temperature parameter. Choosing the optimal temperature α is non-trival, and the temperature needs to be tuned for each task. [17]

SAC finds a stochastic policy with maximal expected return that satisfies a minimum expected entropy constraint.

maxπo:TEρπ[t=0Tr(st,at)]s.t.E(st,at)ρπ[log(πt(at|st))]gleqHt

where H is a desired minimum expected entropy.

  • Rewite the objective as an iterated maximizationmaxπ0(E[r(s0,a0)]+maxπ1(E[]+maxπTE[r(sT,aT)]))
  • Finally we get:αt=argminαtEatπt[αtlogπt(at|st;αt)αtH¯]

Algorithms

Update α with following objective:

J(α)=Eatπt[αlogπt(at|st)αH¯]

Deterministic Policy Gradient (DPG)

Deep DPG (DDPG)

Distributed Distributional DDPG(D4PG)

Multi-Agent DDPG(MADDPG)

Twin Delayed Deep Deterministic PG(TD3)

References


  1. 1.Mnih, V., Badia, A.P., Mirza, M.P., Graves, A., Lillicrap, T.P., Harley, T., Silver, D., & Kavukcuoglu, K. (2016). Asynchronous Methods for Deep Reinforcement Learning. ICML.
  2. 2.Policy gradient algorithms #A3C
  3. 3.- Actor: updates the policy parameters θ for πθ(a|s) in the direction suggested by the critic. - Critic: updates the value function parameter w and could be action-value Qw(a|s) or state-value Vw(s)
  4. 3.Sutton, R.S., & Barto, A.G. (1988). Reinforcement Learning: An Introduction. IEEE Transactions on Neural Networks, 16, 285-286.
  5. 4.A2C
  6. 5.Schulman, J., Levine, S., Abbeel, P., Jordan, M.I., & Moritz, P. (2015). Trust Region Policy Optimization. ICML.
  7. 6.TRPO blog 1
  8. 7.TRPO blog 2
  9. 8.Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. ArXiv, abs/1707.06347.
  10. 9.Heess, N., Dhruva, T., Sriram, S., Lemmon, J., Merel, J., Wayne, G., Tassa, Y., Erez, T., Wang, Z., Eslami, S.M., Riedmiller, M.A., & Silver, D. (2017). Emergence of Locomotion Behaviours in Rich Environments. ArXiv, abs/1707.02286.
  11. 10.Schulman, J., Moritz, P., Levine, S., Jordan, M.I., & Abbeel, P. (2016). High-Dimensional Continuous Control Using Generalized Advantage Estimation. CoRR, abs/1506.02438.
  12. 11.Wu, Y., Mansimov, E., Grosse, R. B., Liao, S., & Ba, J. (2017). Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. In Advances in neural information processing systems (pp. 5279-5288).
  13. 12.Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu, K., & de Freitas, N. (2016). Sample efficient actor-critic with experience replay. arXiv preprint arXiv:1611.01224.
  14. 13.Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290.
  15. 14.Haarnoja, T., Tang, H., Abbeel, P., & Levine, S. (2017, August). Reinforcement learning with deep energy-based policies. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 (pp. 1352-1361). JMLR. org.
  16. 15.Soft Q-learning, UC Berkeley blog
  17. 16.Notes on the Generalized Advantage Estimation Paper
  18. 17.Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., ... & Levine, S. (2018). Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905.