Next: Nondeterministic Rewards and Actions
Up: Reinforcement Learning
Previous: Experimentation Strategies
- Q learning need not train on optimal action sequences to converge to
the optimal policy
- After the first full episode only one entry in the table will
be updated. If the agent follows the same sequence of actions the
second table entry will be updated. So perform updates in reverse
chronological order! Will converge in fewer iterations, although the
agent has to use more memory to store the entire episode.
- Another strategy - store past state-action transitions and
immediate rewards and retrain on them periodically - This is a real
win depending on relative costs (robot is very slow in comparison to replaying)
- Many more efficient techniques when the system knows the
and functions
Patricia Riddle
Fri May 15 13:00:36 NZST 1998