Next: Temporal Difference Learning
Up: Reinforcement Learning
Previous: Updating Sequence
- The functions and can be viewed as
first producing a probability distribution over outcomes based on
and and then drawing an outcome at random according to
this distribution - nondeterministic Markov decision process
- , but is not guaranteed to converge
- decaying weighted average of the current and the
revised estimate
- , where
- convergence long = 1.5 million games in Tesauro's backgammon program
Patricia Riddle
Fri May 15 13:00:36 NZST 1998