Next: Relationship to Dynamic Programming
Up: Reinforcement Learning
Previous: Temporal Difference Learning
- previous algorithms make no attempt to estimate the Q value for
unseen state-action pairs, unrealistic in large or infinite spaces or
when the cost of executing actions is high
- substituted ANN for the table lookup and use each
update as a training example
- A more successful alternative is to train a separate ANN for
each action using state as input and as output
- Another common alternative is to train one network with state as
input and with one output for each action
- the convergence theorems no longer hold!!
Patricia Riddle
Fri May 15 13:00:36 NZST 1998