Next: Relationship to Dynamic Programming
Up: Reinforcement Learning
 Previous: Temporal Difference Learning
 
-  previous algorithms make no attempt to estimate the Q value for
unseen state-action pairs, unrealistic in large or infinite spaces or
when the cost of executing actions is high
 -  substituted ANN for the table lookup and use each
 
  update as a training example -  A more successful alternative is to train a separate ANN for
each action using state as input and  
  as output -  Another common alternative is to train one network with state as
input and with one  
  output for each action -  the convergence theorems no longer hold!!
 
 
Patricia Riddle 
Fri May 15 13:00:36 NZST 1998