What is: Attention-augmented Convolution?

Attention-augmented Convolution is a type of convolution with a two-dimensional relative self-attention mechanism that can replace convolutions as a stand-alone computational primitive for image classification. It employs scaled-dot product attention and multi-head attention as with Transformers.

It works by concatenating convolutional and attentional feature map. To see this, consider an original convolution operator with kernel size $k$ , $F\_{in}$ input filters and $F\_{out}$ output filters. The corresponding attention augmented convolution can be written as"

$\text{AAConv}\left(X\right) = \text{Concat}\left[\text{Conv}(X), \text{MHA}(X)\right]$

$X$ originates from an input tensor of shape $\left(H, W, F\_{in}\right)$ . This is flattened to become $X \in \mathbb{R}^{HW \times F\_{in}}$ which is passed into a multi-head attention module, as well as a convolution (see above).

Similarly to the convolution, the attention augmented convolution 1) is equivariant to translation and 2) can readily operate on inputs of different spatial dimensions.

Source	Attention Augmented Convolutional Networks
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com