**AMSGrad** is a stochastic optimization method that seeks to fix a convergence issue with [Adam](https://paperswithcode.com/method/adam) based optimizers. AMSGrad uses the maximum of past squared gradients 
$v\_{t}$ rather than the exponential average to update the parameters:

$$m\_{t} = \beta\_{1}m\_{t-1} + \left(1-\beta\_{1}\right)g\_{t} $$

$$v\_{t} = \beta\_{2}v\_{t-1} + \left(1-\beta\_{2}\right)g\_{t}^{2}$$

$$ \hat{v}\_{t} = \max\left(\hat{v}\_{t-1}, v\_{t}\right) $$

$$\theta\_{t+1} = \theta\_{t} - \frac{\eta}{\sqrt{\hat{v}_{t}} + \epsilon}m\_{t}$$

Cross-encoder Reranking

ReRankMatch: Semi-Supervised Learning with Semantics-Oriented Similarity Representation

AMSGrad

On the Convergence of Adam and Beyond

**Global and Sliding Window Attention** is an attention pattern for attention-based models. It is motivated by the fact that non-sparse attention in the original [Transformer](https://paperswithcode.com/method/transformer) formulation has a [self-attention component](https://paperswithcode.com/method/scaled) with $O\left(n^{2}\right)$ time and memory complexity where $n$ is the input sequence length and thus, is not efficient to scale to long inputs. 

Since [windowed](https://paperswithcode.com/method/sliding-window-attention) and [dilated](https://paperswithcode.com/method/dilated-sliding-window-attention) attention patterns are not flexible enough to learn task-specific representations, the authors of the [Longformer](https://paperswithcode.com/method/longformer) add “global attention” on few pre-selected input locations. This attention is operation symmetric: that is, a token with a global attention attends to all tokens across the sequence, and all tokens in the sequence attend to it. The Figure to the right shows an example of a sliding window attention with global attention at a few tokens at custom locations. For the example of classification, global attention is used for the [CLS] token, while in the example of Question Answering, global attention is provided on all question tokens.

Source	On the Convergence of Adam and Beyond
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com