**Adam** is an adaptive learning rate optimization algorithm that utilises both momentum and scaling, combining the benefits of [RMSProp](https://paperswithcode.com/method/rmsprop) and [SGD w/th Momentum](https://paperswithcode.com/method/sgd-with-momentum). The optimizer is designed to be appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. 

The weight updates are performed as:

$$ w_{t} = w_{t-1} - \eta\frac{\hat{m}\_{t}}{\sqrt{\hat{v}\_{t}} + \epsilon}  $$

with

$$ \hat{m}\_{t} = \frac{m_{t}}{1-\beta^{t}_{1}} $$

$$ \hat{v}\_{t} = \frac{v_{t}}{1-\beta^{t}_{2}} $$

$$ m_{t} = \beta_{1}m_{t-1} + (1-\beta_{1})g_{t} $$

$$ v_{t} = \beta_{2}v_{t-1} + (1-\beta_{2})g_{t}^{2}  $$


$ \eta $ is the step size/learning rate, around 1e-3 in the original paper. $ \epsilon $ is a small number, typically 1e-8 or 1e-10, to prevent dividing by zero. $ \beta_{1} $ and $ \beta_{2} $ are forgetting parameters, with typical values 0.9 and 0.999, respectively.

**Random Synthesized Attention** is a form of synthesized attention where the attention weights are not conditioned on any input tokens. Instead, the attention weights are initialized to random values. It was introduced with the [Synthesizer](https://paperswithcode.com/method/synthesizer) architecture. Random Synthesized Attention contrasts with [Dense Synthesized Attention](https://paperswithcode.com/method/dense-synthesized-attention) which conditions on each token independently, as opposed to pairwise token interactions in the vanilla [Transformer](https://paperswithcode.com/method/transformer) model.

Let $R$ be a randomly initialized matrix. Random Synthesized Attention is defined as:

$$Y = \text{Softmax}\left(R\right)G\left(X\right) $$

where $R \in \mathbb{R}^{l \text{ x } l}$. Notably, each head adds 2 parameters to the overall network. The basic idea of the Random Synthesizer is to not rely on pairwise token interactions or any information from individual token but rather to learn a task-specific alignment that works well globally across many samples. This is a direct generalization of the recently proposed fixed self-attention patterns of [Raganato et al (2020)](https://arxiv.org/abs/2002.10260).

Random Synthesized Attention

Synthesizer: Rethinking Self-Attention in Transformer Models

Adam

Adam: A Method for Stochastic Optimization

**DeBERTa** is a [Transformer](https://paperswithcode.com/methods/category/transformers)-based neural language model that aims to improve the [BERT](https://paperswithcode.com/method/bert) and [RoBERTa](https://paperswithcode.com/method/roberta) models with two techniques: a [disentangled attention mechanism](https://paperswithcode.com/method/disentangled-attention-mechanism) and an enhanced mask decoder. The disentangled attention mechanism is where each word is represented unchanged using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangle matrices on their contents and relative positions. The enhanced mask decoder is used to replace the output [softmax](https://paperswithcode.com/method/softmax) layer to predict the masked tokens for model pre-training.  In addition, a new virtual adversarial training method is used for fine-tuning to improve model’s generalization on downstream tasks.

Source	Adam: A Method for Stochastic Optimization
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com