What is: Residual Attention Network?

Inspired by the success of ResNet, Wang et al. proposed the very deep convolutional residual attention network (RAN) by combining an attention mechanism with residual connections.

Each attention module stacked in a residual attention network can be divided into a mask branch and a trunk branch. The trunk branch processes features, and can be implemented by any state-of-the-art structure including a pre-activation residual unit and an inception block. The mask branch uses a bottom-up top-down structure to learn a mask of the same size that softly weights output features from the trunk branch. A sigmoid layer normalizes the output to $[0,1]$ after two $1\times 1$ convolution layers. Overall the residual attention mechanism can be written as

\begin{align} s &= \sigma(Conv_{2}^{1\times 1}(Conv_{1}^{1\times 1}( h_\text{up}(h_\text{down}(X))))) \end{align}

\begin{align} X_{out} &= s f(X) + f(X) \end{align} where $h_\text{up}$ is a bottom-up structure, using max-pooling several times after residual units to increase the receptive field, while $h_\text{down}$ is the top-down part using linear interpolation to keep the output size the same as the input feature map. There are also skip-connections between the two parts, which are omitted from the formulation. $f$ represents the trunk branch which can be any state-of-the-art structure.

Inside each attention module, a bottom-up top-down feedforward structure models both spatial and cross-channel dependencies, leading to a consistent performance improvement. Residual attention can be incorporated into any deep network structure in an end-to-end training fashion. However, the proposed bottom-up top-down structure fails to leverage global spatial information.
Furthermore, directly predicting a 3D attention map has high computational cost.

Source	Residual Attention Network for Image Classification
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com