**Sarsa** is an on-policy TD control algorithm:

$$Q\left(S\_{t}, A\_{t}\right) \leftarrow Q\left(S\_{t}, A\_{t}\right) + \alpha\left[R_{t+1} + \gamma{Q}\left(S\_{t+1}, A\_{t+1}\right) - Q\left(S\_{t}, A\_{t}\right)\right] $$

This update is done after every transition from a nonterminal state $S\_{t}$. if $S\_{t+1}$ is terminal, then $Q\left(S\_{t+1}, A\_{t+1}\right)$ is defined as zero.

To design an on-policy control algorithm using Sarsa, we estimate $q\_{\pi}$ for a behaviour policy $\pi$ and then change $\pi$ towards greediness with respect to $q\_{\pi}$.

Source: Sutton and Barto, Reinforcement Learning, 2nd Edition

This method proposes first discretizing observations and calculating the action distribution distance under comparable cases (intersection states).

Playstyle Distance

An Unsupervised Video Game Playstyle Metric via State Discretization

Sarsa

**Retrace** is an off-policy Q-value estimation algorithm which has guaranteed convergence for a target and behaviour policy $\left(\pi, \beta\right)$. With off-policy rollout for TD learning, we must use importance sampling for the update:

$$ \Delta{Q}^{\text{imp}}\left(S\_{t}, A\_{t}\right) = \gamma^{t}\prod\_{1\leq{\tau}\leq{t}}\frac{\pi\left(A\_{\tau}\mid{S\_{\tau}}\right)}{\beta\left(A\_{\tau}\mid{S\_{\tau}}\right)}\delta\_{t} $$

This product term can lead to high variance, so Retrace modifies $\Delta{Q}$ to have importance weights truncated by no more than a constant $c$:

$$ \Delta{Q}^{\text{imp}}\left(S\_{t}, A\_{t}\right) = \gamma^{t}\prod\_{1\leq{\tau}\leq{t}}\min\left(c, \frac{\pi\left(A\_{\tau}\mid{S\_{\tau}}\right)}{\beta\left(A\_{\tau}\mid{S\_{\tau}}\right)}\right)\delta\_{t} $$

Year	1994
Data Source	CC BY-SA - https://paperswithcode.com