AICurious Logo

What is: Q-Learning?

Year1984
Data SourceCC BY-SA - https://paperswithcode.com

Q-Learning is an off-policy temporal difference control algorithm:

Q(S_t,A_t)Q(S_t,A_t)+α[Rt+1+γmax_aQ(S_t+1,a)Q(S_t,A_t)]Q\left(S\_{t}, A\_{t}\right) \leftarrow Q\left(S\_{t}, A\_{t}\right) + \alpha\left[R_{t+1} + \gamma\max\_{a}Q\left(S\_{t+1}, a\right) - Q\left(S\_{t}, A\_{t}\right)\right]

The learned action-value function QQ directly approximates q_q\_{*}, the optimal action-value function, independent of the policy being followed.

Source: Sutton and Barto, Reinforcement Learning, 2nd Edition