What is: Factorized Dense Synthesized Attention?

Factorized Dense Synthesized Attention is a synthesized attention mechanism, similar to dense synthesized attention, but we factorize the outputs to reduce parameters and prevent overfitting. It was proposed as part of the Synthesizer architecture. The factorized variant of the dense synthesizer can be expressed as follows:

$A, B = F\_{A}\left(X\_{i}\right), F\_{B}\left(X\_{i}\right)$

where $F\_{A}\left(.\right)$ projects input $X\_{i}$ into $a$ dimensions, $F\_B\left(.\right)$ projects $X\_{i}$ to $b$ dimensions, and $a \text{ x } b = l$ . The output of the factorized module is now written as:

$Y = \text{Softmax}\left(C\right)G\left(X\right)$

where $C = H\_{A}\left(A\right) * H\_{B}\left(B\right)$ , where $H\_{A}$ , $H\_{B}$ are tiling functions and $C \in \mathbb{R}^{l \text{ x } l}$ . The tiling function simply duplicates the vector $k$ times, i.e., $\mathbb{R}^{l} \rightarrow \mathbb{R}^{lk}$ . In this case, $H\_{A}\left(\right)$ is a projection of $\mathbb{R}^{a} \rightarrow \mathbb{R}^{ab}$ and $H\_{B}\left(\right)$ is a projection of $\mathbb{R}^{b} \rightarrow \mathbb{R}^{ba}$ . To avoid having similar values within the same block, we compose the outputs of $H\_{A}$ and $H\_{B}$ .

Source	Synthesizer: Rethinking Self-Attention in Transformer Models
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com