Nondeterministic Rewards and Actions

Next: Temporal Difference Learning Up: Reinforcement Learning Previous: Updating Sequence

The functions and can be viewed as first producing a probability distribution over outcomes based on and and then drawing an outcome at random according to this distribution - nondeterministic Markov decision process
, but is not guaranteed to converge
decaying weighted average of the current and the revised estimate
, where
convergence long = 1.5 million games in Tesauro's backgammon program

Patricia Riddle
Fri May 15 13:00:36 NZST 1998