The mathematical foundations of policy gradient algorithms.
Policy Gradient preliminaries
Policy gradient estimates the gradients with the form:[10]
where
where (6) yields the lowest possible variance:
The advantage function
- Intuitional interpretation: a step in policy gradient direction should increase the probability of better-than-average actions and decrease the probability of worse-than-average actions. The advantage function measures whether or not the action is better or worse than the policy’s default behavior (expection).
Vanilla Policy Gradient
The goal of RL
Objective
- Infinite horizon
- Finite horizon
Evaluating the objective
Direct differentiation
A convenient identity:
Overall,
- Comparison to Maximum likelihood:
Drawbacks: variance
- Reducing variance
- Future does not affect the past
- Casuality: policy at time
cannot affect reard at time t when
where
Baselines
where
proof:
- Substracting abseline is
unbiased in expectation
Average reward is not the best baseline, but it’s pretty good.
Optimal baseline:
Hence,
We get:
This is just the expected reward, weighted by gradient magnitudes
.
Deriving the simplest policy gradient (Spinning Up)
Consider the case of a stochastic, parameterized policy,
Optimize the policy by gradient descent:
Step by step:
Probability of a Trajectory. The probability of a trajectory
given that actions come from is:The log-derivative trick.
Log-probability of a trajectory:
- Gradients of environment functions
The environment has no dependence on , so gradients of , and are zero. - Grad-log-prob of a trajectory.
The gradient of the log-prob of a trajectory is thus
Overall:
We can estimate the expectation with a sample mean. If we collect a set of trajectories
where
REINFORCE
REINFORCE(Monte-Carlo policy gradient) is a Monte-Carlo algorithm using the complete return from the time
- Intuition: the update increases the paramer vector in this distribution proportional to the return, and inversely proportional to the action probability.
REINFORCE algorithm:
- Initialize policy parameter
at random; - For each episode:
- Generate an episode
, following - Loop for each step of the episode
:
- Generate an episode
- Drawbacks: REINFORCE has a high variance and thus produces slow learning.
REINFORCE with Baseline
A variant of REINFORCE is to substract a baseline value from the return
Actor-Critic
Actor-Critic consists of two components
Actor-Critic algorithms:
- Initialize policy parameter
and state-value weights - For each episode:
- Initilize the first state of episode
- While
!= TERMINAL (for each time step):- Take action
, observe , (If is terminal, then )
- Initilize the first state of episode
Asynchronous Advantage Actor-Critic(A3C)
Mnih et.al(2016) [1] proposed an asynchronous gradient descent for optimization of deep neural networks, showing that parallel actor-learners have a stabilizing effect on training, greatly reducing the training time with a single multi-core CPU instead of GPU.
- Instead of experience replay, they asynchronously execute multiple agents in parallel on multiple instances of the environment. This parallelism also decorrelates the agents’ data into a more stationary process, since at any given time-step the parallel agents will be experiencing a variety of different states.
- Apply different exploration policies in each actor-learner to maximize the diversity. By running different exploration policies in different threads, the overall updates of parameters are likely to be less correlated in time than a single agent applying online updates.
- The gradient accumulation in parallelism can be seen as a prarallized minibatch of stochastic gradient update, where the parameters update thread-by-thread in the direction of each thread independently.[2]
A3C pseudocode for each actor-learner thread:
- Assume global shared parameter vectors
and and global shared counter , thread-specified parameter vectors and - Initialize the thread step count
- While
:- Reset gradients:
and - Synchronize thread-specific parameters
and - sample state
- while (
!= TERMINAL and ):- Perform the action
- Receive reward
and new state ;
- Perform the action
- The return estimation:
- For
do ; here is a MC measure of- Accumulate gradients w.r.t
: - Accumulate gradients w.r.t
- Perform asynchronous update of
using and of using
- Reset gradients:
Advantage Actor-Critic (A2C)
Removing the first “A”(Asynchronous) from A3C, we get advantage actor-critic (A2C). A3C updates the global parameters independently, thus thread-specific agents updates the policy with different versions and aggregated updates could not be optimal.
A2C waits for each actor to finish its segment of experience before performing an update, averaging over all of the actors. In the next iteration, parallel actors starts from the same policy. A2C is more cost-effective than A3C when using single-GPU machines, and is faster than a CPU-only A3C implementation when using larger policies. [4]
Trust Region Policy Optimization (TRPO)
Trust Region Policy Optimization (TRPO) enforces a KL divergence constraint at every point in the state space.
TRPO minimizes a certain surrogate obejctive fuction guarateeing policy improvement with non-trivial step sizes, giving monotonic improvements with little tuning of hyperparameters at each update[5].
- Let
denote the expected return of another policy in terms of the advantage over the policy denotes discounted visitation frequency: .
The aforementioned update does not give any guidance on the step size to update.
Conservative policy iteration
provides explicit lower bounds on the improvements of .
Let
where the lower bound:
Replace
Define
Since
Let
Therefore,
This guarantees that the true objective
Afterwards, we improve the true objective
Thus, we use a constraint on the KL divergence beween the new policy and the old policy, i.e., a trust region constraint:
By heuristic approximation, we consider the average KL divergence to replace the
- Expand
: - Replace the sum over actions by an important sampling estimator:
- Replace
with expectation ; replace the advantage values by the -values . Finally, we get
Proximal Policy Optimization (PPO)
Problems: TRPO is relatively complicated, and is not compatible with architectures that include noise (e.g. dropout) or parameter sharing (between the policy and value function, or with auxiliary tasks).[8]
PPO with clipped surrogate objective performs better than that with KL penalty.
PPO-clip *
Clipped surrogate objective
- Let
, so . TRPO maimize a surrogate objective with conservative policy iteration (CPI). Without the constraint, maximization of would lead to excessively large policy update. - PPO pernalize changes to the policy that move
away from 1:
where
PPO-penalty
Adaptive KL pernalty coefficient
- Variant: add a penalty on KL divergence
- With mini-batch SGD, optimize the KL-penalized objective:
- Compute
- If
- If
- If
PPO algorithms
Finally, the objective function is augmented with an error term on the value estimation and an entropy term to encourage sufficient exploration.
where
- Settings:
- RNN
- Adam
- mini-batch SGD
PPO algorithms with Actor-Critic style:
- for iteration=
:- for actor=
:- run policy
in environment for timesteps; - Compute advantage estimates
;
- run policy
- Optimize surrogate
w.r.t , with epochs and mini-batch size
- for actor=
Distributed PPO
Let
The distributed PPO-penalty algorithms:
upload successful upload successful
Experiments indicates that averaging gradients and applying them synchronously leads to better results than asynchronously in practice.
Generalized Advantage Estimation(GAE)
Challenges:
- Requires a large number of samples;
- Difficulty of obtaining stable and steady improvement despite the non-stationarity of the incoming data.
credit reward problem
in RL (a.k.a distal reward problem in the behavioral literature): long time delay on rewards.
Solution:[10]
- Use value functions to reduce the variance of policy gradient estimates at the cost of some bias, with an exponentially-weighted estimator of advantage function that is analogous to TD(
); - Use TRPO for both policy and value function with NNs.
Advantage function estimation
The advantage has the form:
Now take the form of
With
which is simply the empirical returns minus the value function baseline.
GAE
GAE is defined as the exponentially-weighted average of these
Now consider two special cases:
- when
: - when
:
It shows that GAE(
Interpretation as Reward Shaping
Reward shaping refers to the following reward transformation of MDP:
where
is an arbitrary scalar-valued function on the state space.Let
, , be the value and advatage functions of the transformed MDP:
Let
Value function estimation
Constrain the average KL divergence between the previous value function and new value function to be smaller than
Actor-Critic with Experience Replay(ACER)
Actor-Critic with Experience Replay(ACER)[12] employs truncated importance sampling with bias correction, stochastic dueling network architectures, and a new TRPO method.
Importance weight truncation with bias correction
where
Efficient TRPO
ACER maintains an average policy network
Update the parameter
The policy gradient w.r.t
ACKTR
Soft Q-learning
Soft Q-learning(SQL) expresses the optimal policy via Boltzmann distribution (a.k.a Gibbs distribution).
- Soft Q-learning fomulates a stochastic policy as a (conditional) energy-based model (EBM), with the energy function corresponding to the “soft” Q-function obtained when optimizing the maximum entropy objective.
- “The entropy regularized actor-critic algorithms can be viewed as approaximate Q-learning methods, with the actor serving the role of an approimate sampler from an intrctable posterior” [14].
Contributions:
- Improved exploration performance is with multi-modal reward landscapes, where conventional deterministic or unimodal methods are at high risk of falling into suboptimal local optima.
- Stochastic energy-based policies can provide a much better initialization for learning new skills than either random policies or policies pretrained with maximum expected reward objectives.
Maximum Entropy RL
- Conventional RL objectives to learn a policy
: - Maximum entropy RL augments the entropy term to maximize its entropy at each visited state:
where is used to determine the relative importance of entropy and reward.
Energy-based models (EBM)
As the figure below, the conventional RL specifies a unimodal policy distribution, centered at the maimal Q-value and extending toe the neighboring actions to provide noise for exploration (as red distribution). The exploration is biased towards the upper mode, RL ignores the lower mode completely.[15]
To ensure the agent to explore all promising states while prioritizing the more promising mode, Soft Q-learning definesthe policy w.r.t exponentiated Q-values (see green distribution):
where
Soft value functions
The soft Bellman equation:
where
The soft Q-function is:
The soft value function:
Then the optimal policy is:
Proofs
- Define the optimal policy with the EBM form:
where is the sum of the numerator. - Minimize the KL divergence:
Soft Q-learning
- Soft Bellman-backup
This cannot be performed exactly in countinuous or large state and action spaces. Sampling from the energy-based model is intractable in general.
Sampling
Importance sampling
where
It can also be equivalent as minimizing:
where
Stein Variational Gradient Descent (SVGD)
- How to approximately sample from the soft Q-funtion?
- MCMC based sampling
- learn a stochastic sampling network trained to output approximate samples from the target distribution
Soft Q-learning applies the sampling network based on Stein variational gradient descent (SVGD) and amortized SVGD.
- Learn a state-conditioned stochastic NN
that maps noise samples drawn from an arbitrary distribution into unbiased action samples from the target EBM of . - The induced distribution of the actions
approximates the energy-based distribution w.r.t KL divergence: - SVGD provides the most greedy directions as a functional:
Update the policy networks:
Algorithms
Assign target parameters:
- for each epoch:
- for each t do:
- Collect experience
Sample an action for using : where
Sample next state from the environment:
Save the new experience in the replay memory: - Sample a mini-batch from the replay memory
- Update the soft Q-function parameters:
Sample for each
Compute empirical soft values in Compute empirical gradient of Update according to using Adam. - Update policy
Sample for each
Compute actions
Compute using empirical estimate of Compute empirical estimtate of of Update according to using Adam
- Collect experience
- If epoch
update_interval == 0:
- for each t do:
Benefits
- Better exploration. SQL provides an implicit exploration stategy by assgining each action a non-zero probability, shaped by the current belief about its value, effectively combining exploration and exploitation in a natural way.[15]
- Fine-tuning maximum entropy policies:
general-to-specific transfer
.Pre-train policies for general purpose tasks, then use them as templates or initializations for more specific tasks. - Compositionality. Compose new skills from existing policies—even without any fine-tuning—by intersecting different skills.[15]
- Robustness. Maximum entropy formulation encourages to try all possible solutions, the agents learn to explore a large portion of the state space. Thus they learn to act in various situations, more robust against perturbations in the environment.
Soft Actor-Critic(SAC)
Soft Actor-Critic (SAC)[13] is an off-policy actor-critic algorithm based on the maximum entropy framework. Like SQL, SAC augments the conventional RL objectives with a maximum entropy objective:
where the
The maximum entropy terms
- Encourage to explore more widely, while giving up on clearly unpromising avenues;
- Capture multiple modes of near-optimal behavior.
- Improve learning speed and exploration.
Soft policy iteration
Policy evaluation
The soft-Q value of the policy
where the soft state value function:
Policy Improvement
For each state, we update the policy according to:
where the partitioning function
SAC
Consider a parameterized state value function
The soft value function is trained to minimize the squared residual error:
The soft Q-function can be trained to minimize the soft Bellman residual:
with
The policy can be optimized by the expected KL-divergence:
Minimizing
- Reparameterize the policy with an NN transformation:
where is an input noise vector, sampled from some fixed distribution. Rewrite the previous objective as:
Employ two Q-functions to mitigate positive bias in the policy improvement step, using the minimum of the Q-functions for the value gradient and policy gradient.
Automatically adjusted temperature
SAC is brittle w.r.t the temperature parameter. Choosing the optimal temperature
SAC finds a stochastic policy with maximal expected return that satisfies a minimum expected entropy constraint.
where
- Rewite the objective as an iterated maximization
- Finally we get:
Algorithms
Update
Deterministic Policy Gradient (DPG)
Deep DPG (DDPG)
Distributed Distributional DDPG(D4PG)
Multi-Agent DDPG(MADDPG)
Twin Delayed Deep Deterministic PG(TD3)
References
- 1.Mnih, V., Badia, A.P., Mirza, M.P., Graves, A., Lillicrap, T.P., Harley, T., Silver, D., & Kavukcuoglu, K. (2016). Asynchronous Methods for Deep Reinforcement Learning. ICML. ↩
- 2.Policy gradient algorithms #A3C ↩
- 3.- Actor: updates the policy parameters
for in the direction suggested by the critic. - Critic: updates the value function parameter and could be action-value or state-value ↩ - 3.Sutton, R.S., & Barto, A.G. (1988). Reinforcement Learning: An Introduction. IEEE Transactions on Neural Networks, 16, 285-286. ↩
- 4.A2C ↩
- 5.Schulman, J., Levine, S., Abbeel, P., Jordan, M.I., & Moritz, P. (2015). Trust Region Policy Optimization. ICML. ↩
- 6.TRPO blog 1 ↩
- 7.TRPO blog 2 ↩
- 8.Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. ArXiv, abs/1707.06347. ↩
- 9.Heess, N., Dhruva, T., Sriram, S., Lemmon, J., Merel, J., Wayne, G., Tassa, Y., Erez, T., Wang, Z., Eslami, S.M., Riedmiller, M.A., & Silver, D. (2017). Emergence of Locomotion Behaviours in Rich Environments. ArXiv, abs/1707.02286. ↩
- 10.Schulman, J., Moritz, P., Levine, S., Jordan, M.I., & Abbeel, P. (2016). High-Dimensional Continuous Control Using Generalized Advantage Estimation. CoRR, abs/1506.02438. ↩
- 11.Wu, Y., Mansimov, E., Grosse, R. B., Liao, S., & Ba, J. (2017). Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. In Advances in neural information processing systems (pp. 5279-5288). ↩
- 12.Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu, K., & de Freitas, N. (2016). Sample efficient actor-critic with experience replay. arXiv preprint arXiv:1611.01224. ↩
- 13.Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. ↩
- 14.Haarnoja, T., Tang, H., Abbeel, P., & Levine, S. (2017, August). Reinforcement learning with deep energy-based policies. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 (pp. 1352-1361). JMLR. org. ↩
- 15.Soft Q-learning, UC Berkeley blog ↩
- 16.Notes on the Generalized Advantage Estimation Paper ↩
- 17.Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., ... & Levine, S. (2018). Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905. ↩