#notes#cs471

Example: Grid World

Maze-Like Problem: Agent lives in a grid, walls block agent path. Noisy Movement: Actions do not always go as planned

  • Example: 80% of the time North goes North
  • 10% North goes West, and 10% North goes East. Goal: Maximize sum of rewards

Markov Decision Process

A MDP is defined by:

  • Set of states
  • Set of actions
  • Transition function
    • Probability that from leads to , i.e.
  • Reward function
    • Sometimes just or

What is Markov about MDPs?

  • “Markov” generally means given present state, future and past are independent
  • For Markov decision processes, “Markov” means action outcomes depend only on the current state
    • =

Policies

  • When deterministic, we want a optimal plan from start to goal
  • For MDPs, we want a optimal policy
    • Policy gives action for each state
    • Optimal policy maximizes expected utility if followed

Infinite Utilities

Problem: What if game lasts forever? Solutions:

  • Finite Horizon: Similar to depth-limited search.
    • Terminate episodes after fixed T steps (life)
    • Gives nonstationary policies (p depends on time left)
  • Discounting: use 0 < y < 1
  • Absorbing State: Guarantee that for every policy, terminal state will be reached.

Optimal Quantities

The value (utility) of state s:

  • = expected utility starting in and acting optimall.y The value (utility) of state-action pair: