- Assume a Markov decision process (MDP):
- Set of states
- Set of actions (per state)
- A model
- A reward function
- Still looking for a policy
- New Twist: T or R is unknown.
- Must try actions and learn
Offline (MDPs) vs Online (RL)
- Don’t know reward or transition function.
- Must take potentially bad actions to learn
Model-Based Learning
Model Based Idea:
- Learn approximate model based on experiences
- Solve for values as if the learned model were correct
Step 1: Learn empirical MDP model
- Count outcomes s’ for each s, a
- Normalize to give an estimate of
Direct Evaluation
Goal: Compute values for each state under Idea: Average together observed sample values
- Act according to
- Every time you visit a state , write down what the sum of discounted rewards turned out to be
- Average those samples
Temporal Difference (TD) Learning
-
Big Idea: Learn from every experience
-
Update each time we experience a transition
-
Likely s’ will contribute updates more often.
-
Temporal difference learning of values
- Policies still fixed, still doing evaluation
- Move values towards value of whatever successor occurs: running average
Exponential Moving Average
- Forget about past values (factor them in less)