Policy Evaluation
Fixed Policies
- Expectimax trees max over all actions to compute the optimal values
- If we fixed some policy , then tree would be simpler — only one action per state.
- Recursive relation (one-step lookahead / Bellman Equation)
Value Iteration
- Each iteration updates both values and policy
- We don’t track the policy, but take max over actions which implicitly recomputes it
Policy Iteration
- Several passes that update utilities with fixed policy
- After policy is evaluated, a new policy is chosen.
- The new policy will be better (if not, we are done)