Next: Temporal Difference Learning
Up: Reinforcement Learning
Previous: Updating Sequence
- The functions
and
can be viewed as
first producing a probability distribution over outcomes based on
and
and then drawing an outcome at random according to
this distribution - nondeterministic Markov decision process -
, but is not guaranteed to converge - decaying weighted average of the current
and the
revised estimate -
, where
- convergence long = 1.5 million games in Tesauro's backgammon program
Patricia Riddle
Fri May 15 13:00:36 NZST 1998