AICurious Logo

What is: Adaptive Smooth Optimizer?

SourceAdaSmooth: An Adaptive Learning Rate Method based on Effective Ratio
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

AdaSmooth is a stochastic optimization technique that allows for per-dimension learning rate method for SGD. It is an extension of Adagrad and AdaDelta that seek to reduce its aggressive, monotonically decreasing learning rate. Instead of accumulating all past squared gradients, Adadelta restricts the window of accumulated past gradients to a fixed size ww while AdaSmooth adaptively selects the size of the window.

Given the window size MM, the effective ratio is calculated by

= \frac{| \sum_{i=0}^{M-1} \Delta x_{t-1-i}|}{\sum_{i=0}^{M-1} | \Delta x_{t-1-i}|}.$$ Given the effective ratio, the scaled smoothing constant is obtained by: $$c_t = ( \rho_2- \rho_1) \times e_t + (1-\rho_2),$$ The running average $E\left[g^{2}\right]\_{t}$ at time step $t$ then depends only on the previous average and current gradient: $$ E\left[g^{2}\right]\_{t} = c_t^2 \odot g_{t}^2 + \left(1-c_t^2 \right)\odot E[g^2]_{t-1} $$ Usually $\rho_1$ is set to around $0.5$ and $\rho_2$ is set to around 0.99. The update step the follows: $$ \Delta x_t = -\frac{\eta}{\sqrt{E\left[g^{2}\right]\_{t} + \epsilon}} \odot g_{t}, $$ which is incorporated into the final update: $$x_{t+1} = x_{t} + \Delta x_t.$$ The main advantage of AdaSmooth is its faster convergence rate and insensitivity to hyperparameters.