What is: Bottleneck Attention Module?

Park et al. proposed the bottleneck attention module (BAM), aiming to efficiently improve the representational capability of networks. It uses dilated convolution to enlarge the receptive field of the spatial attention sub-module, and build a bottleneck structure as suggested by ResNet to save computational cost.

For a given input feature map $X$ , BAM infers the channel attention $s_c \in \mathbb{R}^C$ and spatial attention $s_s\in \mathbb{R}^{H\times W}$ in two parallel streams, then sums the two attention maps after resizing both branch outputs to $\mathbb{R}^{C\times H \times W}$ . The channel attention branch, like an SE block, applies global average pooling to the feature map to aggregate global information, and then uses an MLP with channel dimensionality reduction. In order to utilize contextual information effectively, the spatial attention branch combines a bottleneck structure and dilated convolutions. Overall, BAM can be written as \begin{align} s_c &= \text{BN}(W_2(W_1\text{GAP}(X)+b_1)+b_2) \end{align}

\begin{align} s_s &= BN(Conv_2^{1 \times 1}(DC_2^{3\times 3}(DC_1^{3 \times 3}(Conv_1^{1 \times 1}(X))))) \end{align} \begin{align} s &= \sigma(\text{Expand}(s_s)+\text{Expand}(s_c)) \end{align} \begin{align} Y &= s X+X \end{align} where $W_i$ , $b_i$ denote weights and biases of fully connected layers respectively, $Conv_{1}^{1\times 1}$ and $Conv_{2}^{1\times 1}$ are convolution layers used for channel reduction. $DC_i^{3\times 3}$ denotes a dilated convolution with $3\times 3$ kernel, applied to utilize contextual information effectively. $\text{Expand}$ expands the attention maps $s_s$ and $s_c$ to $\mathbb{R}^{C\times H\times W}$ .

BAM can emphasize or suppress features in both spatial and channel dimensions, as well as improving the representational power. Dimensional reduction applied to both channel and spatial attention branches enables it to be integrated with any convolutional neural network with little extra computational cost. However, although dilated convolutions enlarge the receptive field effectively, it still fails to capture long-range contextual information as well as encoding cross-domain relationships.

Source	BAM: Bottleneck Attention Module
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com