**Scaled dot-product attention** is an attention mechanism where the dot products are scaled down by $\sqrt{d_k}$. Formally we have a query $Q$, a key $K$ and a value $V$ and calculate the attention as:

$$ {\text{Attention}}(Q, K, V) = \text{softmax}\left(\frac{QK^{T}}{\sqrt{d_k}}\right)V $$

If we assume that $q$ and $k$ are $d_k$-dimensional vectors whose components are independent random variables with mean $0$ and variance $1$, then their dot product, $q \cdot k = \sum_{i=1}^{d_k} u_iv_i$, has mean $0$ and variance $d_k$.  Since we would prefer these values to have variance $1$, we divide by $\sqrt{d_k}$.

**CoVe**, or **Contextualized Word Vectors**, uses a deep [LSTM](https://paperswithcode.com/method/lstm) encoder from an attentional sequence-to-sequence model trained for machine translation to contextualize word vectors. $\text{CoVe}$ word embeddings are therefore a function of the entire input sequence. These word embeddings can then be used in downstream tasks by concatenating them with $\text{GloVe}$ embeddings:

$$ v = \left[\text{GloVe}\left(x\right), \text{CoVe}\left(x\right)\right]$$

and then feeding these in as features for the task-specific models.

CoVe

Learned in Translation: Contextualized Word Vectors

Scaled Dot-Product Attention

Attention Is All You Need

**Multi-DConv-Head Attention**, or **MDHA**, is a type of [Multi-Head Attention](https://paperswithcode.com/method/multi-head-attention) that utilizes [depthwise convolutions](https://paperswithcode.com/method/depthwise-convolution) after the multi-head projections. It is used in the [Primer](https://paperswithcode.com/method/primer) [Transformer](https://paperswithcode.com/method/transformer) architecture.

Specifically, 3x1 depthwise convolutions are added after each of the multi-head projections for query $Q$, key $K$ and value $V$ in self-attention. These depthwise convolutions are performed over the spatial dimension of each dense projection’s output. Interestingly, this ordering of pointwise followed by depthwise convolution is the reverse of typical [separable convolution](https://paperswithcode.com/method/depthwise-separable-convolution), which the authors find to be less effective. They also find that wider depthwise convolution and [standard convolution](https://paperswithcode.com/method/convolution) not only do not improve performance, but in several cases hurt it. 

MDHA is similar to [Convolutional Attention](https://paperswithcode.com/method/cvt), which uses [separable convolution](https://paperswithcode.com/method/depthwise-separable-convolution) instead of depthwise convolution and does not apply convolution operations per attention head as in MDHA.

Source	Attention Is All You Need
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com