What is: Scale-wise Feature Aggregation Module?

SFAM, or Scale-wise Feature Aggregation Module, is a feature extraction block from the M2Det architecture. It aims to aggregate the multi-level multi-scale features generated by Thinned U-Shaped Modules into a multi-level feature pyramid.

The first stage of SFAM is to concatenate features of the equivalent scale together along the channel dimension. The aggregated feature pyramid can be presented as $\mathbf{X} =[\mathbf{X}\_1,\mathbf{X}\_2,\dots,\mathbf{X}\_i]$ , where $\mathbf{X}\_i = \text{Concat}(\mathbf{x}\_i^1,\mathbf{x}\_i^2,\dots,\mathbf{x}\_i^L) \in \mathbb{R}^{W\_{i}\times H\_{i}\times C}$ refers to the features of the $i$ -th largest scale. Here, each scale in the aggregated pyramid contains features from multi-level depths.

However, simple concatenation operations are not adaptive enough. In the second stage, we introduce a channel-wise attention module to encourage features to focus on channels that they benefit most. Following Squeeze-and-Excitation, we use global average pooling to generate channel-wise statistics $\mathbf{z} \in \mathbb{R}^C$ at the squeeze step. And to fully capture channel-wise dependencies, the following excitation step learns the attention mechanism via two fully connected layers:

\mathbf{s} = \mathbf{F}\_{ex}(\mathbf{z},\mathbf{W}) = \sigma(\mathbf{W}\_{2} \delta(\mathbf{W}\_{1}\mathbf{z})),

where $\sigma$ refers to the ReLU function, $\delta$ refers to the sigmoid function, $\mathbf{W}\_{1} \in \mathbb{R}^{\frac{C}{r}\times C}$ , $\mathbf{W}\_{2} \in \mathbb{R}^{C\times \frac{C}{r}}$ , r is the reduction ratio ( $r=16$ in our experiments). The final output is obtained by reweighting the input $\mathbf{X}$ with activation $\mathbf{s}$ :

\tilde{\mathbf{X}}_i^c = \mathbf{F}\_{scale}(\mathbf{X}\_i^c,s_c) = s_c \cdot \mathbf{X}_i^c,

where $\tilde{\mathbf{X}\_i} = [\tilde{\mathbf{X}}\_i^1,\tilde{\mathbf{X}}\_i^2,...,\tilde{\mathbf{X}}\_i^C]$ , each of the features is enhanced or weakened by the rescaling operation.

Source	M2Det: A Single-Shot Object Detector based on Multi-Level Feature Pyramid Network
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com