What is: AdaShift?

AdaShift is a type of adaptive stochastic optimizer that decorrelates $v\_{t}$ and $g\_{t}$ in Adam by temporal shifting, i.e., using temporally shifted gradient $g\_{t−n}$ to calculate $v\_{t}$ . The authors argue that an inappropriate correlation between gradient $g\_{t}$ and the second-moment term $v\_{t}$ exists in Adam, which results in a large gradient being likely to have a small step size while a small gradient may have a large step size. The authors argue that such biased step sizes are the fundamental cause of non-convergence of Adam.

The AdaShift updates, based on the idea of temporal independence between gradients, are as follows:

$g\_{t} = \nabla{f\_{t}}\left(\theta\_{t}\right)$

$m\_{t} = \sum^{n-1}\_{i=0}\beta^{i}\_{1}g\_{t-i}/\sum^{n-1}\_{i=0}\beta^{i}\_{1}$

Then for $i=1$ to $M$ :

$v\_{t}\left[i\right] = \beta\_{2}v\_{t-1}\left[i\right] + \left(1-\beta\_{2}\right)\phi\left(g^{2}\_{t-n}\left[i\right]\right)$

$\theta\_{t}\left[i\right] = \theta\_{t-1}\left[i\right] - \alpha\_{t}/\sqrt{v\_{t}\left[i\right]}\cdot{m\_{t}\left[i\right]}$

Source	AdaShift: Decorrelation and Convergence of Adaptive Learning Rate Methods
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com