#notes#cs471

Recap

Direct Evaluation
Temporal Difference Learning
  • Big Idea: Learn from every experience using running average
  • Factor in result state (running average)

Active Reinforcement Learning

Learner has to choose between exploitation and exploration

  • Learn utilities for state/actions
  • Compute optimal policy for current learned model

How to Explore?

  • Simplest: Random actions (-greedy)
    • Every time step, flip a coin
      • With small probability , act randomly
      • With large probability , act on current policy
    • Problem?
      • Will keep acting randomly even once learning is done
      • Can reduce over time
      • Can use exploration function
Exploration Functions

When to explore?

  • Random: explore a fixed amount
  • Better Idea: Explore areas whose values have not been established
  • Exploration Function
    • Takes a estimate and a visit count , and returns a optimistic utility (k > 0), e.g.

Q-Learning Properties

  • Q-Learning converges to optimal policy even if acting sub-optimally

Regret

Even if you learn optimal policy, you will make mistakes along the way Regret: Total mistake cost. Difference between (expected) rewards including youthful suboptimality and optimal (expected) rewards.

Linear Value Functions

  • Can write q function for any state using few weights
  • Advantage: Experience summed up in a few powerful numbers
  • Disadvantage: States may share features but are different in value

Approximate Q-Learning

  • Update w