**Global and Sliding Window Attention** is an attention pattern for attention-based models. It is motivated by the fact that non-sparse attention in the original [Transformer](https://paperswithcode.com/method/transformer) formulation has a [self-attention component](https://paperswithcode.com/method/scaled) with $O\left(n^{2}\right)$ time and memory complexity where $n$ is the input sequence length and thus, is not efficient to scale to long inputs. 

Since [windowed](https://paperswithcode.com/method/sliding-window-attention) and [dilated](https://paperswithcode.com/method/dilated-sliding-window-attention) attention patterns are not flexible enough to learn task-specific representations, the authors of the [Longformer](https://paperswithcode.com/method/longformer) add “global attention” on few pre-selected input locations. This attention is operation symmetric: that is, a token with a global attention attends to all tokens across the sequence, and all tokens in the sequence attend to it. The Figure to the right shows an example of a sliding window attention with global attention at a few tokens at custom locations. For the example of classification, global attention is used for the [CLS] token, while in the example of Question Answering, global attention is provided on all question tokens.

**YellowFin** is a learning rate and momentum tuner motivated by robustness properties and analysis of quadratic objectives. It stems from a known but obscure fact: the momentum operator's spectral radius is constant in a large subset of the hyperparameter space. For quadratic objectives, the optimizer tunes both the learning rate and the momentum to keep the hyperparameters within a region in which the convergence rate is a constant rate equal to the root momentum. This notion is extended empirically to non-convex objectives. On every iteration, YellowFin optimizes the hyperparameters to minimize a local quadratic optimization.

YellowFin

YellowFin and the Art of Momentum Tuning

Global and Sliding Window Attention

Longformer: The Long-Document Transformer

**WaveRNN** is a single-layer recurrent neural network for audio generation that is designed efficiently predict 16-bit raw audio samples.

The overall computation in the WaveRNN is as follows (biases omitted for brevity):

$$ \mathbf{x}\_{t} = \left[\mathbf{c}\_{t−1},\mathbf{f}\_{t−1}, \mathbf{c}\_{t}\right] $$

$$ \mathbf{u}\_{t} = \sigma\left(\mathbf{R}\_{u}\mathbf{h}\_{t-1} + \mathbf{I}^{*}\_{u}\mathbf{x}\_{t}\right) $$

$$ \mathbf{r}\_{t} = \sigma\left(\mathbf{R}\_{r}\mathbf{h}\_{t-1} + \mathbf{I}^{*}\_{r}\mathbf{x}\_{t}\right) $$

$$ \mathbf{e}\_{t} = \tau\left(\mathbf{r}\_{t} \odot \left(\mathbf{R}\_{e}\mathbf{h}\_{t-1}\right) + \mathbf{I}^{*}\_{e}\mathbf{x}\_{t} \right) $$

$$ \mathbf{h}\_{t} = \mathbf{u}\_{t} \cdot \mathbf{h}\_{t-1} + \left(1-\mathbf{u}\_{t}\right) \cdot \mathbf{e}\_{t} $$

$$ \mathbf{y}\_{c}, \mathbf{y}\_{f} = \text{split}\left(\mathbf{h}\_{t}\right) $$

$$ P\left(\mathbf{c}\_{t}\right) = \text{softmax}\left(\mathbf{O}\_{2}\text{relu}\left(\mathbf{O}\_{1}\mathbf{y}\_{c}\right)\right) $$

$$ P\left(\mathbf{f}\_{t}\right) = \text{softmax}\left(\mathbf{O}\_{4}\text{relu}\left(\mathbf{O}\_{3}\mathbf{y}\_{f}\right)\right) $$

where the $*$ indicates a masked matrix whereby the last coarse input $\mathbf{c}\_{t}$ is only connected to the fine part of the states $\mathbf{u}\_{t}$, $\mathbf{r}\_{t}$, $\mathbf{e}\_{t}$ and $\mathbf{h}\_{t}$ and thus only affects the fine output $\mathbf{y}\_{f}$. The coarse and fine parts $\mathbf{c}\_{t}$ and $\mathbf{f}\_{t}$ are encoded as scalars in $\left[0, 255\right]$ and scaled to the interval $\left[−1, 1\right]$. The matrix $\mathbf{R}$ formed from the matrices $\mathbf{R}\_{u}$, $\mathbf{R}\_{r}$, $\mathbf{R}\_{e}$ is computed as a single matrix-vector product to produce the contributions to all three gates $\mathbf{u}\_{t}$, $mathbf{r}\_{t}$ and $\mathbf{e}\_{t}$ (a variant of the [GRU cell](https://paperswithcode.com/method/gru). $\sigma$ and $\tau$ are the standard sigmoid and tanh non-linearities.

Each part feeds into a [softmax](https://paperswithcode.com/method/softmax) layer over the corresponding 8 bits and the prediction of the 8 fine bits is conditioned on the 8 coarse bits. The resulting Dual Softmax layer allows for efficient prediction of 16-bit samples using two small output spaces (2 8 values each) instead of a single large output space (with 2 16 values).

Source	Longformer: The Long-Document Transformer
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com