What is: V-trace?

V-trace is an off-policy actor-critic reinforcement learning algorithm that helps tackle the lag between when actions are generated by the actors and when the learner estimates the gradient. Consider a trajectory $\left(x\_{t}, a\_{t}, r\_{t}\right)^{t=s+n}\_{t=s}$ generated by the actor following some policy $\mu$ . We can define the $n$ -steps V-trace target for $V\left(x\_{s}\right)$ , our value approximation at state $x\_{s}$ as:

$v\_{s} = V\left(x\_{s}\right) + \sum^{s+n-1}\_{t=s}\gamma^{t-s}\left(\prod^{t-1}\_{i=s}c\_{i}\right)\delta\_{t}V$

Where $\delta\_{t}V = \rho\_{t}\left(r\_{t} + \gamma{V}\left(x\_{t+1}\right) - V\left(x\_{t}\right)\right)$ is a temporal difference algorithm for $V$ , and $\rho\_{t} = \text{min}\left(\bar{\rho}, \frac{\pi\left(a\_{t}\mid{x\_{t}}\right)}{\mu\left(a\_{t}\mid{x\_{t}}\right)}\right)$ and $c\_{i} = \text{min}\left(\bar{c}, \frac{\pi\left(a\_{t}\mid{x\_{t}}\right)}{\mu\left(a\_{t}\mid{x\_{t}}\right)}\right)$ are truncated importance sampling weights. We assume that the truncation levels are such that $\bar{\rho} \geq \bar{c}$ .

Source	IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com