**Retrace** is an off-policy Q-value estimation algorithm which has guaranteed convergence for a target and behaviour policy $\left(\pi, \beta\right)$. With off-policy rollout for TD learning, we must use importance sampling for the update:

$$ \Delta{Q}^{\text{imp}}\left(S\_{t}, A\_{t}\right) = \gamma^{t}\prod\_{1\leq{\tau}\leq{t}}\frac{\pi\left(A\_{\tau}\mid{S\_{\tau}}\right)}{\beta\left(A\_{\tau}\mid{S\_{\tau}}\right)}\delta\_{t} $$

This product term can lead to high variance, so Retrace modifies $\Delta{Q}$ to have importance weights truncated by no more than a constant $c$:

$$ \Delta{Q}^{\text{imp}}\left(S\_{t}, A\_{t}\right) = \gamma^{t}\prod\_{1\leq{\tau}\leq{t}}\min\left(c, \frac{\pi\left(A\_{\tau}\mid{S\_{\tau}}\right)}{\beta\left(A\_{\tau}\mid{S\_{\tau}}\right)}\right)\delta\_{t} $$

**Sarsa** is an on-policy TD control algorithm:

$$Q\left(S\_{t}, A\_{t}\right) \leftarrow Q\left(S\_{t}, A\_{t}\right) + \alpha\left[R_{t+1} + \gamma{Q}\left(S\_{t+1}, A\_{t+1}\right) - Q\left(S\_{t}, A\_{t}\right)\right] $$

This update is done after every transition from a nonterminal state $S\_{t}$. if $S\_{t+1}$ is terminal, then $Q\left(S\_{t+1}, A\_{t+1}\right)$ is defined as zero.

To design an on-policy control algorithm using Sarsa, we estimate $q\_{\pi}$ for a behaviour policy $\pi$ and then change $\pi$ towards greediness with respect to $q\_{\pi}$.

Source: Sutton and Barto, Reinforcement Learning, 2nd Edition

Sarsa

Retrace

Safe and Efficient Off-Policy Reinforcement Learning

Context-Aware Visual Attention-based end-to-end pipeline for Webpage Object Detection (_CoVA_) aims to learn function _f_ to predict labels _y = [$y_1, y_2, ..., y_N$]_ for a webpage containing _N_ elements. The input to CoVA consists of:
1. a screenshot of a webpage,
2. list of bounding boxes _[x, y, w, h]_ of the web elements, and
3. neighborhood information for each element obtained from the DOM tree.

This information is processed in four stages:
1. the graph representation extraction for the webpage,
2. the Representation Network (_RN_),
3. the Graph Attention Network (_GAT_), and
4. a fully connected (_FC_) layer.

The graph representation extraction computes for every web element _i_ its set of _K_ neighboring web elements _$N_i$_. The _RN_ consists of a Convolutional Neural Net (_CNN_) and a positional encoder aimed to learn a visual representation _$v_i$_ for each web element _i &isin; {1, ..., N}_. The _GAT_ combines the visual representation _$v_i$_ of the web element _i_ to be classified and those of its neighbors, i.e., _$v_k$ &forall;k &isin; $N_i$_ to compute the contextual representation _$c_i$_ for web element _i_. Finally, the visual and contextual representations of the web element are concatenated and passed through the _FC_ layer to obtain the classification output.

Source	Safe and Efficient Off-Policy Reinforcement Learning
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com