Generalizing from Examples

Next: Relationship to Dynamic Programming Up: Reinforcement Learning Previous: Temporal Difference Learning

previous algorithms make no attempt to estimate the Q value for unseen state-action pairs, unrealistic in large or infinite spaces or when the cost of executing actions is high
substituted ANN for the table lookup and use each update as a training example
A more successful alternative is to train a separate ANN for each action using state as input and as output
Another common alternative is to train one network with state as input and with one output for each action
the convergence theorems no longer hold!!

Patricia Riddle
Fri May 15 13:00:36 NZST 1998