Distributed training has become a pervasive and effective approach for training a large neural network
(NN) model with processing massive data. However, it is very challenging to satisfy requirements
from various NN models, diverse computing resources, and their dynamic changes during a training
job. In this study, we design our distributed training framework in a systematic end-to-end view to
provide the built-in adaptive ability for different scenarios, especially for industrial applications and
production environments, by fully considering resource allocation, model partition, task placement,
and distributed execution. Based on the unified distributed graph and the unified cluster object,
our adaptive framework is equipped with a global cost model and a global planner, which can
enable arbitrary parallelism, resource-aware placement, multi-mode execution, fault-tolerant, and
elastic distributed training. The experiments demonstrate that our framework can satisfy various
requirements from the diversity of applications and the heterogeneity of resources with highly
competitive performance.

An **ECA-Net** is a type of convolutional neural network that utilises an [Efficient Channel Attention](https://paperswithcode.com/method/efficient-channel-attention) module.

ECA-Net

ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks

E2EAdaptiveDistTraining

End-to-end Adaptive Distributed Training on PaddlePaddle

**Generalized State-Dependent Exploration**, or **gSDE**, is an exploration method for reinforcement learning that uses more general features and re-sampling the noise periodically. 

State-Dependent Exploration (SDE) is an intermediate solution for exploration that consists in adding noise as a function of the state $s\_{t}$, to the deterministic action $\mu\left(\mathbf{s}\_{t}\right)$. At the beginning of an episode, the parameters $\theta\_{\epsilon}$ of that exploration function are drawn from a Gaussian distribution. The resulting action $\mathbf{a}\_{t}$ is as follows:

$$
\mathbf{a}\_{t}=\mu\left(\mathbf{s}\_{t} ; \theta\_{\mu}\right)+\epsilon\left(\mathbf{s}\_{t} ; \theta\_{\epsilon}\right), \quad \theta\_{\epsilon} \sim \mathcal{N}\left(0, \sigma^{2}\right)
$$

This episode-based exploration is smoother and more consistent than the unstructured step-based exploration. Thus, during one episode, instead of oscillating around a mean value, the action a for a given state $s$ will be the same.

In the case of a linear exploration function $\epsilon\left(\mathbf{s} ; \theta\_{\epsilon}\right)=\theta\_{\epsilon} \mathbf{s}$, by operation on Gaussian distributions, Rückstieß et al. show that the action element $\mathbf{a}\_{j}$ is normally distributed:

$$
\pi]_{j}\left(\mathbf{a}\_{j} \mid \mathbf{s}\right) \sim \mathcal{N}\left(\mu\_{j}(\mathbf{s}), \hat{\sigma\_{j}}^{2}\right)
$$

where $\hat{\sigma}$ is a diagonal matrix with elements $\hat{\sigma}\_{j}=\sqrt{\sum\_{i}\left(\sigma\_{i j} \mathbf{s}\_{i}\right)^{2}}$.

Because we know the policy distribution, we can obtain the derivative of the log-likelihood $\log \pi(\mathbf{a} \mid \mathbf{s})$ with respect to the variance $\sigma$ :

$$
\frac{\partial \log \pi(\mathbf{a} \mid \mathbf{s})}{\partial \sigma_{i j}}=\frac{\left(\mathbf{a}\_{j}-\mu\_{j}\right)^{2}-\hat{\sigma\_{j}}^{2}}{\hat{\sigma}\_{j}^{3}} \frac{\mathbf{s}\_{i}^{2} \sigma\_{i j}}{\hat{\sigma_{j}}}
$$

This can be easily plugged into the likelihood ratio gradient estimator, which allows to adapt $\sigma$ during training. SDE is therefore compatible with standard policy gradient methods, while addressing most shortcomings of the unstructured exploration.

For gSDE, two improvements are suggested:

1. We sample the parameters $\theta\_{\epsilon}$ of the exploration function every $n$ steps instead of every episode.
2. Instead of the state s, we can in fact use any features. We chose policy features $\mathbf{z}\_{\mu}\left(\mathbf{s} ; \theta\_{\mathbf{z}\_{\mu}}\right)$ (last layer before the deterministic output $\left.\mu(\mathbf{s})=\theta\_{\mu} \mathbf{z}\_{\mu}\left(\mathbf{s} ; \theta_{\mathbf{z}\_{\mu}}\right)\right)$ as input to the noise function $\epsilon\left(\mathbf{s} ; \theta\_{\epsilon}\right)=\theta\_{\epsilon} \mathbf{z}\_{\mu}(\mathbf{s})$

Source	End-to-end Adaptive Distributed Training on PaddlePaddle
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com