Updating Sequence

Next: Nondeterministic Rewards and Actions Up: Reinforcement Learning Previous: Experimentation Strategies

Q learning need not train on optimal action sequences to converge to the optimal policy
After the first full episode only one entry in the table will be updated. If the agent follows the same sequence of actions the second table entry will be updated. So perform updates in reverse chronological order! Will converge in fewer iterations, although the agent has to use more memory to store the entire episode.
Another strategy - store past state-action transitions and immediate rewards and retrain on them periodically - This is a real win depending on relative costs (robot is very slow in comparison to replaying)
Many more efficient techniques when the system knows the and functions

Patricia Riddle
Fri May 15 13:00:36 NZST 1998