#notes#cs471

  • Assume a Markov decision process (MDP):
    • Set of states
    • Set of actions (per state)
    • A model
    • A reward function
  • Still looking for a policy
  • New Twist: T or R is unknown.
    • Must try actions and learn

Offline (MDPs) vs Online (RL)

  • Don’t know reward or transition function.
  • Must take potentially bad actions to learn

Model-Based Learning

Model Based Idea:

  • Learn approximate model based on experiences
  • Solve for values as if the learned model were correct

Step 1: Learn empirical MDP model

  • Count outcomes s’ for each s, a
  • Normalize to give an estimate of

Direct Evaluation

Goal: Compute values for each state under Idea: Average together observed sample values

  • Act according to
  • Every time you visit a state , write down what the sum of discounted rewards turned out to be
  • Average those samples

Temporal Difference (TD) Learning

  • Big Idea: Learn from every experience

  • Update each time we experience a transition

  • Likely s’ will contribute updates more often.

  • Temporal difference learning of values

    • Policies still fixed, still doing evaluation
    • Move values towards value of whatever successor occurs: running average

Exponential Moving Average

  • Forget about past values (factor them in less)