Example: Grid World
Maze-Like Problem: Agent lives in a grid, walls block agent path. Noisy Movement: Actions do not always go as planned
- Example: 80% of the time North goes North
- 10% North goes West, and 10% North goes East. Goal: Maximize sum of rewards
Markov Decision Process
A MDP is defined by:
- Set of states
- Set of actions
- Transition function
- Probability that from leads to , i.e.
- Reward function
- Sometimes just or
What is Markov about MDPs?
- “Markov” generally means given present state, future and past are independent
- For Markov decision processes, “Markov” means action outcomes depend only on the current state
- =
Policies
- When deterministic, we want a optimal plan from start to goal
- For MDPs, we want a optimal policy
- Policy gives action for each state
- Optimal policy maximizes expected utility if followed
Infinite Utilities
Problem: What if game lasts forever? Solutions:
- Finite Horizon: Similar to depth-limited search.
- Terminate episodes after fixed T steps (life)
- Gives nonstationary policies (p depends on time left)
- Discounting: use 0 < y < 1
- Absorbing State: Guarantee that for every policy, terminal state will be reached.
Optimal Quantities
The value (utility) of state s:
- = expected utility starting in and acting optimall.y The value (utility) of state-action pair: