Next: Optimal Policy
Up: Reinforcement Learning
Previous: Reinforcement Learning Problems
- based on Markov decision processes (MDP)
- At each time step , the agent senses a current state
and chooses an action and performs it. The
environment responds with a reward and by producing
the succeeding state .
- The functions
and are part of the environment and not necessarily
known to the agent. They also only depend on the current state and action.
- We only consider finite sets , and deterministic
functions, but these are not required.
- Learn a policy , with the greatest cumulative reward over
time.
- discounted
cumulative reward
- where is generated by beginning at state and
repeatedly using policy to select actions,
- is a constant that determines the
relative value of delayed versus immediate rewards - if
only immediate reward is considered, as moves closer to 1
future rewards are given more emphasis.
- finite horizon reward
-
average reward
- We will only focus on discounted cumulative reward!
Patricia Riddle
Fri May 15 13:00:36 NZST 1998