Next: Temporal Difference Learning
Up: Reinforcement Learning
Previous: Updating Sequence
- The functions
and
can be viewed as
first producing a probability distribution over outcomes based on
and
and then drawing an outcome at random according to
this distribution - nondeterministic Markov decision process -
, but is not guaranteed to converge - decaying weighted average of the current
and the
revised estimate -
, where
- convergence long = 1.5 million games in Tesauro's backgammon program
Patricia Jean Riddle
Wed Jun 23 13:06:34 NZST 1999