Meet's Notes

Search

Direct Evaluation
Temporal Difference (TD) Learning

Lecture 23 - Reinforcement Learning I

Feb 10, 2024, 2 min read

#notes
#cs471

Assume a Markov decision process (MDP):
- Set of states $S$
- Set of actions (per state) $A$
- A model $T (s, s, s^{'})$
- A reward function $R (s, a, s^{'})$
Still looking for a policy $π (s)$
New Twist: T or R is unknown.
- Must try actions and learn

Offline (MDPs) vs Online (RL)

Don’t know reward or transition function.
Must take potentially bad actions to learn

Model-Based Learning

Model Based Idea:

Learn approximate model based on experiences
Solve for values as if the learned model were correct

Step 1: Learn empirical MDP model

Count outcomes s’ for each s, a
Normalize to give an estimate of $T (s, a, s^{'})$

Direct Evaluation

Goal: Compute values for each state under $π$ Idea: Average together observed sample values

Act according to $π$
Every time you visit a state $s$ , write down what the sum of discounted rewards turned out to be
Average those samples

Temporal Difference (TD) Learning

Big Idea: Learn from every experience
Update $V (s)$ each time we experience a transition
Likely s’ will contribute updates more often.
Temporal difference learning of values
- Policies still fixed, still doing evaluation
- Move values towards value of whatever successor occurs: running average

Exponential Moving Average

Forget about past values (factor them in less)

Graph View

Backlinks

CS471