Next: Temporal Difference Learning
Up: Reinforcement Learning
 Previous: Updating Sequence
 
-  The functions  
  and  
  can be viewed as
first producing a probability distribution over outcomes based on
 
  and  
  and then drawing an outcome at random according to
this distribution - nondeterministic Markov decision process -   
 , but is not guaranteed to converge -  decaying weighted average of the current  
  and the
revised estimate -   
 , where
 
  -  convergence long = 1.5 million games in Tesauro's backgammon program
 
 
Patricia Riddle 
Fri May 15 13:00:36 NZST 1998