Next: Optimal Policy
Up: Reinforcement Learning
Previous: Reinforcement Learning Problems
- based on Markov decision processes (MDP)
- At each time step
, the agent senses a current state
and chooses an action
and performs it. The
environment responds with a reward
and by producing
the succeeding state
. - The functions
and
are part of the environment and not necessarily
known to the agent. They also only depend on the current state and action. - We only consider finite sets
,
and deterministic
functions, but these are not required. - Learn a policy
, with the greatest cumulative reward over
time. -
discounted
cumulative reward - where
is generated by beginning at state
and
repeatedly using policy
to select actions, -
is a constant that determines the
relative value of delayed versus immediate rewards - if
only immediate reward is considered, as
moves closer to 1
future rewards are given more emphasis. -
finite horizon reward -
average reward - We will only focus on discounted cumulative reward!
Patricia Riddle
Fri May 15 13:00:36 NZST 1998