What is: Global-and-Local attention?

Most attention mechanisms learn where to focus using only weak supervisory signals from class labels, which inspired Linsley et al. to investigate how explicit human supervision can affect the performance and interpretability of attention models. As a proof of concept, Linsley et al. proposed the global-and-local attention (GALA) module, which extends an SE block with a spatial attention mechanism.

Given the input feature map $X$ , GALA uses an attention mask that combines global and local attention to tell the network where and on what to focus. As in SE blocks, global attention aggregates global information by global average pooling and then produces a channel-wise attention weight vector using a multilayer perceptron. In local attention, two consecutive $1\times 1$ convolutions are conducted on the input to produce a positional weight map. The outputs of the local and global pathways are combined by addition and multiplication. Formally, GALA can be represented as: \begin{align} s_g &= W_{2} \delta (W_{1}\text{GAP}(x)) \end{align}

\begin{align} s_l &= Conv_2^{1\times 1} (\delta(Conv_1^{1\times1}(X))) \end{align}

\begin{align} s_g^* &= \text{Expand}(s_g) \end{align}

\begin{align} s_l^* &= \text{Expand}(s_l) \end{align}

\begin{align} s &= \tanh(a(s_g^* + s_l^*) +m \cdot (s_g^* s_l^*) ) \end{align}

\begin{align} Y &= sX \end{align}

where $a,m \in \mathbb{R}^{C}$ are learnable parameters representing channel-wise weight vectors.

Supervised by human-provided feature importance maps, GALA has significantly improved representational power and can be combined with any CNN backbone.

Source	Learning what and where to attend
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com