Meet's Notes

Search

Example: Grid World
Markov Decision Process
What is Markov about MDPs?
Policies
Infinite Utilities
Optimal Quantities

Lecture 20 - Markov Decision Process I

Feb 10, 2024, 2 min read

#notes
#cs471

Example: Grid World

Maze-Like Problem: Agent lives in a grid, walls block agent path. Noisy Movement: Actions do not always go as planned

Example: 80% of the time North goes North
10% North goes West, and 10% North goes East. Goal: Maximize sum of rewards

Markov Decision Process

A MDP is defined by:

Set of states $s \in S$
Set of actions $a \in A$
Transition function $T (s, a, s^{'})$
- Probability that $a$ from $s$ leads to $s^{'}$ , i.e. $P (s^{'} ∣ s, a)$
Reward function $R (s, a, s^{'})$
- Sometimes just $R (s)$ or $R (s^{'})$

What is Markov about MDPs?

“Markov” generally means given present state, future and past are independent
For Markov decision processes, “Markov” means action outcomes depend only on the current state
- $P (S_{t + 1} = s^{'} ∣ S_{0.. t} = s_{0.. t}, A_{0.. t} = a_{0.. t})$ = $P (S_{t + 1} = s^{'} ∣ S_{t} = s_{t}, A_{t} = a_{t})$

Policies

When deterministic, we want a optimal plan from start to goal
For MDPs, we want a optimal policy $π^{*} : S \to A$
- Policy $π$ gives action for each state
- Optimal policy maximizes expected utility if followed

Infinite Utilities

Problem: What if game lasts forever? Solutions:

Finite Horizon: Similar to depth-limited search.
- Terminate episodes after fixed T steps (life)
- Gives nonstationary policies (p depends on time left)
Discounting: use 0 < y < 1
- $U ([r_{0}, ..., r_{\infty}]) = \sum_{t = 0}^{\infty} y^{t} r_{t} \leq \frac{R _{ma x}}{( 1 - y )}$
Absorbing State: Guarantee that for every policy, terminal state will be reached.

Optimal Quantities

The value (utility) of state s:

$V^{*} (s)$ = expected utility starting in $s$ and acting optimall.y The value (utility) of state-action pair:

Graph View

Backlinks

CS471