What is: Generalized State-Dependent Exploration?

Generalized State-Dependent Exploration, or gSDE, is an exploration method for reinforcement learning that uses more general features and re-sampling the noise periodically.

State-Dependent Exploration (SDE) is an intermediate solution for exploration that consists in adding noise as a function of the state $s\_{t}$ , to the deterministic action $\mu\left(\mathbf{s}\_{t}\right)$ . At the beginning of an episode, the parameters $\theta\_{\epsilon}$ of that exploration function are drawn from a Gaussian distribution. The resulting action $\mathbf{a}\_{t}$ is as follows:

\mathbf{a}\_{t}=\mu\left(\mathbf{s}\_{t} ; \theta\_{\mu}\right)+\epsilon\left(\mathbf{s}\_{t} ; \theta\_{\epsilon}\right), \quad \theta\_{\epsilon} \sim \mathcal{N}\left(0, \sigma^{2}\right)

This episode-based exploration is smoother and more consistent than the unstructured step-based exploration. Thus, during one episode, instead of oscillating around a mean value, the action a for a given state $s$ will be the same.

In the case of a linear exploration function $\epsilon\left(\mathbf{s} ; \theta\_{\epsilon}\right)=\theta\_{\epsilon} \mathbf{s}$ , by operation on Gaussian distributions, Rückstieß et al. show that the action element $\mathbf{a}\_{j}$ is normally distributed:

\pi]_{j}\left(\mathbf{a}\_{j} \mid \mathbf{s}\right) \sim \mathcal{N}\left(\mu\_{j}(\mathbf{s}), \hat{\sigma\_{j}}^{2}\right)

where $\hat{\sigma}$ is a diagonal matrix with elements $\hat{\sigma}\_{j}=\sqrt{\sum\_{i}\left(\sigma\_{i j} \mathbf{s}\_{i}\right)^{2}}$ .

Because we know the policy distribution, we can obtain the derivative of the log-likelihood $\log \pi(\mathbf{a} \mid \mathbf{s})$ with respect to the variance $\sigma$ :

\frac{\partial \log \pi(\mathbf{a} \mid \mathbf{s})}{\partial \sigma_{i j}}=\frac{\left(\mathbf{a}\_{j}-\mu\_{j}\right)^{2}-\hat{\sigma\_{j}}^{2}}{\hat{\sigma}\_{j}^{3}} \frac{\mathbf{s}\_{i}^{2} \sigma\_{i j}}{\hat{\sigma_{j}}}

This can be easily plugged into the likelihood ratio gradient estimator, which allows to adapt $\sigma$ during training. SDE is therefore compatible with standard policy gradient methods, while addressing most shortcomings of the unstructured exploration.

For gSDE, two improvements are suggested:

We sample the parameters $\theta\_{\epsilon}$ of the exploration function every $n$ steps instead of every episode.
Instead of the state s, we can in fact use any features. We chose policy features $\mathbf{z}\_{\mu}\left(\mathbf{s} ; \theta\_{\mathbf{z}\_{\mu}}\right)$ (last layer before the deterministic output $\left.\mu(\mathbf{s})=\theta\_{\mu} \mathbf{z}\_{\mu}\left(\mathbf{s} ; \theta_{\mathbf{z}\_{\mu}}\right)\right)$ as input to the noise function $\epsilon\left(\mathbf{s} ; \theta\_{\epsilon}\right)=\theta\_{\epsilon} \mathbf{z}\_{\mu}(\mathbf{s})$

Source	Smooth Exploration for Robotic Reinforcement Learning
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com