3
Q Learning Algorithm Q(x,a) Q(x,a) + b*(r + *E(y) - Q(x,a))
x is state, a is action
b is learning rate
r is reward
is discount factor (0,1)
E(y) is the utility of the state y, computed as E(y) = max(Q(y,a)) for all actions a
Guaranteed to converge to optimal, given infinite trials