What is: Adaptive Masking?

Adaptive Masking is a type of attention mechanism that allows a model to learn its own context size to attend over. For each head in Multi-Head Attention, a masking function is added to control for the span of the attention. A masking function is a non-increasing function that maps a distance to a value in $\left[0, 1\right]$ . Adaptive masking takes the following soft masking function $m\_{z}$ parametrized by a real value $z$ in $\left[0, S\right]$ :

$m\_{z}\left(x\right) = \min\left[\max\left[\frac{1}{R}\left(R+z-x\right), 0\right], 1\right]$

where $R$ is a hyper-parameter that controls its softness. The shape of this piecewise function as a function of the distance. This soft masking function is inspired by Jernite et al. (2017). The attention weights from are then computed on the masked span:

$a\_{tr} = \frac{m\_{z}\left(t-r\right)\exp\left(s\_{tr}\right)}{\sum^{t-1}\_{q=t-S}m\_{z}\left(t-q\right)\exp\left(s\_{tq}\right)}$

A $\mathcal{l}\_{1}$ penalization is added on the parameters $z\_{i}$ for each attention head $i$ of the model to the loss function:

$L = - \log{P}\left(w\_{1}, \dots, w\_{T}\right) + \frac{\lambda}{M}\sum\_{i}z\_{i}$

where $\lambda > 0$ is the regularization hyperparameter, and $M$ is the number of heads in each layer. This formulation is differentiable in the parameters $z\_{i}$ , and learnt jointly with the rest of the model.

Source	Adaptive Attention Span in Transformers
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com