Notes of lectures by D. Silver.A brief introduction of MDPs.

Markov Decision Processes

Intro to MDPs

MDP formally describe an environment for RL, where the environment is fully observable. i.e. the current state completely characterizes the process.

Almost all RL problems can be formalized as MDPs, e.g.

Optimal control primarily deals with continuous MDPs
Partially observable problems can be converted into MDPs
Bandits are MDPs with one state

Markov Process

Markov property

“The future is independent of the past given the present”

Definition: a state $S_t$ is Markov iff $\mathbb{P}[S_{t+1}|S_t] = \mathbb{P}[S_{t+1}|S_1,...,S_t]$

The state captures all relevant information from the history.
Once the state is known, the history maybe thrown away, i.e. the state is sufficient statistic of the future.

State Transition Matrix

For a Markov state $s$ and successor state $s’$, the state transition probability is :

$\mathcal{P}_{ss'} = \mathbb{P}[S_{t+1}=s'|S_t=s]$

Sate transition matrix $\mathcal{P}$ defines transition probabilities from all sates $s$ to all successor states $s’$

$\mathcal{P} = \text{from } \overset{\text{to}}{\begin{bmatrix} \mathcal{P}_{11} & \dots & \mathcal{P}_{1n} \\ \mathcal{P}_{21} \\ \vdots \\ \mathcal{P}_{n1} & \dots & \mathcal{P}_{nn} \end{bmatrix}}$

where each row of the matrix sums to 1.

Markov Process

A Markov process is a memoryless random process, i.e. a sequence of random state $S_1$, $S_2$, … with the Markov property.

Definition: a Markov Process (or Markov Chain) is a tuple $<\mathcal{S}, \mathcal{P}>$, where $\mathcal{S}$ is a (finite) set of states, $\mathcal{P}$ is a state transition probability matrix, $\mathcal{P}_{ss'} = \mathbb{P}[S_{t+1}=s'|S_t=s]$

Markov Reward Process

A Markov reward process is a Markov chain with values.

Definition: A MRP is a tuple $\mathcal{S}, \mathcal{P}, \mathcal{R}, \mathcal{\gamma}$

$\mathcal{S}$ is a finite set of states
$\mathcal{P}$ is a state transition probability matrix, $\mathcal{P}_{ss'} = \mathbb{P}[S_{t+1}=s' | S_t = s]$
$\mathcal{R}$ is a reward function, $\mathcal{R} = \mathbb{E}[R_{t=1 |S_t=s}]$
$\mathcal{\gamma}$ is a discount factor, $\mathcal{\gamma} \in [0,1]$

Return

The return $G_t$ is the total discounted reward from time-step $t$.

$G_t = R_{t+1} + \gamma R_{t+2} + ... = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$

The discount $\gamma \in [0,1]$ is the present value of future rewards.
The value of receiving reward $R$ after $k+1$ time stems is $\gamma^k R$.
This values immediate reward above delayed reward
- $\gamma$ close to 0 leads to “myopic” evaluation
- $\gamma$ close to 1 leads to “far-sighted” evaluation

Why discount ?

Most MRP, MDP are discounted. Why?

Mathematically convenient to discount rewards.
Avoids infinite returns in cyclic Markov processes.
Uncertainty about the future may not be fully represented.
If the reward is financial, immediate rewards may earn more interest than delayed rewards
Animal/human behavior shows preference for immediate reward
It is sometimes possible to use undiscounted MRP (i.e. $\gamma = 1$), e.g. if all sequences terminate.

Value Function

The value function $v(s)$ gives the long-term value of state $s$
Definition: the state value function $v(s)$ of an MRP is the expected return starting from state $s$:

$v(s) = \mathbb{E}[G_t | S_t = s]$

Bellman Equation for MRPs

The value function can be decomposed into two parts:

immediate reward $R_{t+1}$
discounted value of successor state $\gamma v(S_{t+1})$ $v(s) = \mathbb{E}[G_t|S_t = s] \\ = \mathbb{E}[R_{t+1}+\gamma R_{t+2} + \gamma^2 R_{t+3}+ \dots | S_t =s] \\ = \mathbb{E}[R_{t+1}+\gamma (R_{t+2} + \gamma R_{t+3}+ \dots) | S_t =s] \\ = \mathbb{E}[R_{t+1} + \gamma G_{t+1} | S_t = s] \\ = \mathbb{E}[R_{t+1} + \gamma v(S_{t+1}) | S_t = s]$

$v(s) = \mathbb{E}[R_{t+1} + \gamma v(S_{t+1}) | S_t = s]$ $\Rightarrow v(s) = \mathcal{R}_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'} v(s')$

Bellman equation in matrix form

$v = \mathcal{R} + \gamma \mathcal{P}v$

where $v$ is a column vector with one entry per state

$\begin{bmatrix} v(1) \\ \vdots \\ v(n) \end{bmatrix} = \begin{bmatrix} \mathcal{R}(1) \\ \vdots \\ \mathcal{R}(n) \end{bmatrix} + \gamma \begin{bmatrix} \mathcal{P}_{11} & \dots & \mathcal{P}_{1n} \\ \vdots \\ \mathcal{P}_{n1} & \dots & \mathcal{P}_{nn} \end{bmatrix} \begin{bmatrix} v(1) \\ \vdots \\ v(n) \end{bmatrix}$

Solving the Bellman equation

The Bellman equation is a linear equation
$v = \mathcal{R} + \gamma \mathcal{P}v$ $(1- \gamma \mathcal{P})v = \mathcal{R}$ $\mathcal{v} = (1- \gamma \mathcal{P})^{-1} \mathcal{R}$
Computational complexity $\rightarrow O(n^3)$ for $n$ states
Direct solution is only possible for small MRPs
For large MRPs, iterative methods:
- Dynamic programming
- Monte-Carlo evaluation
- Temporal-Difference learning

Markov Decision Process

Markov Decision Process(MDP): a Markov reward process with decisions.

An MDP is a tuple $<\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \mathcal{\gamma}>$

$\mathcal{S}$ is a finite set of states
$\mathcal{A}$ is a finite set of actions
$\mathcal{P}$ is a state transition probability matrix $\mathcal{P}_{ss'}^a = \mathbb{P}[S_{t+1}=s'|S_t = s, A_t =a]$
$\mathcal{R}$ is a reward function, $\mathcal{R}_s^a = \mathbb{E} [R_{t+1} | S_t=s, A_t=a]$
$\gamma$ is a discount factor $\gamma \in [0,1]$

Policies

A policy $\pi$ is a distribution over actions given states

$\pi(a|s) = \mathbb{P}[A_t=a|S_t=s]$

A policy fully defines the behavior of an agent.
MDP policies depends on the current state (not the history), i.e. Policies are stationary (time-independent) $A_t \sim \pi(\cdot \vert S_t), \forall t > 0$

Given an MDP $\mathcal{M} = <\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \mathcal{\gamma}>$ and a policy $\pi$

The state sequence $S_1, S_2,…$ is a Markov process $<\mathcal{S}, \mathcal{P}^{\pi}>$
The state and reward sequence $S_1,R_2,S_2,\cdots$ is a markov reward process $<\mathcal{S}, \mathcal{P}^{\pi}, \mathcal{R}^{\pi}, \gamma>$
where $\mathcal{P}^{\pi}_{s,s'} = \sum_{a \in \mathcal{A}} \pi (a|s) \mathcal{P}_{ss'}^a$ $\mathcal{R}^{\pi}_s = \sum_{a \in \mathcal{A}} \pi (a|s) \mathcal{R}_s^a$

Value Function

The state-value function $v_{\pi}(s)$ of an MDP is the expected return starting from state $s$, and then following policy $\pi$

$v_{\pi}(s) = \mathbb{E}[G_t | S_t = s]$

The action-value function $q_{\pi}(s,a)$ is the expected return starting from state $s$, taking action $a$, and then following policy $\pi$

$q_{\pi}(s,a) = \mathbb{E}_{\pi} [G_t | S_t = s, A_t = a]$

Bellman Expectation Equation

The state-value function can again be decomposed into immediate reward plus discounted value of successor state:

$v_{\pi}(s) = \mathbb{E}_{\pi} [R_{t+1} + \gamma v_{\pi}(S_{t+1}) \vert S_t = s]$

The action-value function can similarly be decomposed:

$q_{\pi}(s, a) = \mathbb{E}[R_{t+1} + \gamma q_{\pi}(S_{t+1}, A_{t+1}) \vert S_t=s, A_t=a ]$

Bellman Expectation equation for $V^{\pi}$

(Eq. 1)

$v_{\pi}(s) = \sum_{a \in \mathcal{A}} \pi (a|s) q_{\pi}(s,a)$

(Eq. 2)

$v_{\pi}(s) = \sum_{a \in \mathcal{A}} \pi (a \vert s) (\mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^a v_{\pi}(s') )$

Bellman Expectation equation for $Q^{\pi}$

(Eq. 1)

$q_{\pi}(s,a) = \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^{a} v_{\pi}(s')$

(Eq. 2)

$q_{\pi}(s,a) = \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^{a} \sum_{a' \in \mathcal{A}} \pi (a' \vert s') q_{\pi}(s', a')$

Matrix form

The Bellman expectation equation can be expressed concisely using the induced MRP:

$v_{\pi} = \mathcal{R}^{\pi} + \gamma \mathcal{P}^{\pi} v_{\pi}$

with direct solution

$v_{\pi} = (1- \gamma \mathcal{P}^{\pi} )^{-1} \mathcal{R}^{\pi}$

Optimal Value Function

The optimal state-value function $v_{*}(s)$ is the maximum value function over all policies:

$v_{*}(s) = \max_{\pi} v_{\pi}(s)$

The optimal action-value function $_{*}(s,a)$ is the maximum action-value function over all policies

$q_{*}(s,a) = \max_{\pi} q_{\pi}(s,a)$

An MDP is solved when we know the optimal value fn.

Optimal Policy

Define a partial ordering over policies

$\pi \leq \pi' \text{ if } v_{\pi}(s) \leq v_{\pi'}(s), \forall s$

Finding an optimal policy, by maximizing over $q_{*}(s,a)$

$\pi_{*}(a \vert s)=\left\{ \begin{array}{ll} 1 \quad \text{if} \quad a=\underset{a \in \mathcal{A}}{\mathrm{argmax}} q_*(s,a) \\ 0 \quad \text{otherwise} \end{array} \right.$

Solving the Bellman Optimality Equation

Bellman Optimal Equation is non-linear
No closed form solution (in general)
Many iterative solution methods
- Value iteration
- Policy iteration
- Q-learning
- Sarsa

Extensions to MDPs

Infinite MDPs

Countably infinite state and/or action spaces: straightforward
Continuous state and/or action spaces
- Closed form for linear quadratic model (LQR)
Continuous time
- Requires partial differential equations
- Hamilton-Jacobi-Bellman (HJB) equation
- LImiting case of bellman equation as time-step $\rightarrow 0$

POMDPs

A POMPD is an MDP with hidden states. It is a hidden Markov model with actions.

A POMDP is a tuple $<\mathcal{S}, \mathcal{A},\mathcal{O},\mathcal{P},\mathcal{R},\mathcal{Z},\mathcal{\gamma}>$

$\mathcal{S}$ is a finite set of states
$\mathcal{A}$ is a finite set of actions
$\mathcal{O}$ is a finite set of observations
$\mathcal{P}$ is a state transition probability matrix $\mathcal{P}_{ss'}^a = \mathbb{P}[S_{t+1}=s' \vert S_t=s, A_t=a]$
\mathcal{R} is a reward function, $\mathcal{R}_s^a = \mathbb{E}[R_{t+1}|S_t=s, A_t=a]$
$\mathcal{Z}$ is an observation function $\mathcal{Z}_{s'o}^a = \mathbb{P}[O_{t=1} =o \vert S_{t=1}=s', A_t=a]$
$\gamma$ is a discount factor $\gamma \in [0,1]$

The Gradient

Markov Decision Process

Markov Decision Processes

Intro to MDPs

Markov Process

Markov property

State Transition Matrix

Markov Process

Markov Reward Process

Return

Why discount ?

Value Function

Bellman Equation for MRPs

Markov Decision Process

Policies

Value Function

Bellman Expectation Equation

Bellman Expectation equation for $V^{\pi}$

Bellman Expectation equation for $Q^{\pi}$

Matrix form

Optimal Value Function

Optimal Policy

Solving the Bellman Optimality Equation

Extensions to MDPs

Infinite MDPs

POMDPs

Average Reward MDPs