What is: Dense Synthesized Attention?

Dense Synthesized Attention, introduced with the Synthesizer architecture, is a type of synthetic attention mechanism that replaces the notion of query-key-values in the self-attention module and directly synthesizes the alignment matrix instead. Dense attention is conditioned on each input token. The method accepts an input $X \in \mathbb{R}^{l\text{ x }d}$ and produces an output of $Y \in \mathbb{R}^{l\text{ x }d}$ . Here $l$ refers to the sequence length and $d$ refers to the dimensionality of the model. We first adopt $F\left(.\right)$ , a parameterized function, for projecting input $X\_{i}$ from $d$ dimensions to $l$ dimensions.

$B\_{i} = F\left(X\_{i}\right)$

where $F\left(.\right)$ is a parameterized function that maps $\mathbb{R}^{d}$ to $\mathbb{R}^{l}$ and $i$ is the $i$ -th token of $X$ . Intuitively, this can be interpreted as learning a token-wise projection to the sequence length $l$ . Essentially, with this model, each token predicts weights for each token in the input sequence. In practice, a simple two layered feed-forward layer with ReLU activations for $F\left(.\right)$ is adopted:

$F\left(X\right) = W\left(\sigma\_{R}\left(W(X) + b\right)\right) + b$

where $\sigma\_{R}$ is the ReLU activation function. Hence, $B$ is now of $\mathbb{R}^{l\text{ x }d}$ . Given $B$ , we now compute:

$Y = \text{Softmax}\left(B\right)G\left(X\right)$

where $G\left(.\right)$ is another parameterized function of $X$ that is analogous to $V$ (value) in the standard Transformer model. This approach eliminates the dot product altogether by replacing $QK^{T}$ in standard Transformers with the synthesizing function $F\left(.\right)$ .

Source	Synthesizer: Rethinking Self-Attention in Transformer Models
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com