What is: Multiplicative Attention?

Multiplicative Attention is an attention mechanism where the alignment score function is calculated as:

$f_{att}\left(\textbf{h}_{i}, \textbf{s}\_{j}\right) = \mathbf{h}\_{i}^{T}\textbf{W}\_{a}\mathbf{s}\_{j}$

Here $\mathbf{h}$ refers to the hidden states for the encoder/source, and $\mathbf{s}$ is the hidden states for the decoder/target. The function above is thus a type of alignment score function. We can use a matrix of alignment scores to show the correlation between source and target words, as the Figure to the right shows. Within a neural network, once we have the alignment scores, we calculate the final scores using a softmax function of these alignment scores (ensuring it sums to 1).

Additive and multiplicative attention are similar in complexity, although multiplicative attention is faster and more space-efficient in practice as it can be implemented more efficiently using matrix multiplication. Both variants perform similar for small dimensionality $d_{h}$ of the decoder states, but additive attention performs better for larger dimensions. One way to mitigate this is to scale $f_{att}\left(\textbf{h}_{i}, \textbf{s}\_{j}\right)$ by $1/\sqrt{d\_{h}}$ as with scaled dot-product attention.

Source	Deep Learning for NLP Best Practices by Sebastian Ruder
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com