AICurious Logo

What is: Adaptively Spatial Feature Fusion?

SourceLearning Spatial Fusion for Single-Shot Object Detection
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

ASFF, or Adaptively Spatial Feature Fusion, is a method for pyramidal feature fusion. It learns the way to spatially filter conflictive information to suppress inconsistency across different feature scales, thus improving the scale-invariance of features.

ASFF enables the network to directly learn how to spatially filter features at other levels so that only useful information is kept for combination. For the features at a certain level, features of other levels are first integrated and resized into the same resolution and then trained to find the optimal fusion. At each spatial location, features at different levels are fused adaptively, i.e., some features may be filter out as they carry contradictory information at this location and some may dominate with more discriminative clues. ASFF offers several advantages: (1) as the operation of searching the optimal fusion is differential, it can be conveniently learned in back-propagation; (2) it is agnostic to the backbone model and it is applied to single-shot detectors that have a feature pyramid structure; and (3) its implementation is simple and the increased computational cost is marginal.

Let xijnl\mathbf{x}_{ij}^{n\rightarrow l} denote the feature vector at the position (i,j)(i,j) on the feature maps resized from level nn to level ll. Following a feature resizing stage, we fuse the features at the corresponding level ll as follows:

y_ijl=αijlx_ij1l+βijlx_ij2l+γl_ijx_ij3l,\mathbf{y}\_{ij}^l = \alpha^l_{ij} \cdot \mathbf{x}\_{ij}^{1\rightarrow l} + \beta^l_{ij} \cdot \mathbf{x}\_{ij}^{2\rightarrow l} +\gamma^l\_{ij} \cdot \mathbf{x}\_{ij}^{3\rightarrow l},

where y_ijl\mathbf{y}\_{ij}^l implies the (i,j)(i,j)-th vector of the output feature maps yl\mathbf{y}^l among channels. αl_ij\alpha^l\_{ij}, βl_ij\beta^l\_{ij} and γl_ij\gamma^l\_{ij} refer to the spatial importance weights for the feature maps at three different levels to level ll, which are adaptively learned by the network. Note that αl_ij\alpha^l\_{ij}, βl_ij\beta^l\_{ij} and γl_ij\gamma^l\_{ij} can be simple scalar variables, which are shared across all the channels. Inspired by acnet, we force αl_ij+βl_ij+γl_ij=1\alpha^l\_{ij}+\beta^l\_{ij}+\gamma^l\_{ij}=1 and αl_ij,βl_ij,γl_ij[0,1]\alpha^l\_{ij},\beta^l\_{ij},\gamma^l\_{ij} \in [0,1], and

αijl=eλl_α_ijeλl_αij+eλl_βij+eλl_γij. \alpha^l_{ij} = \frac{e^{\lambda^l\_{\alpha\_{ij}}}}{e^{\lambda^l\_{\alpha_{ij}}} + e^{\lambda^l\_{\beta_{ij} }} + e^{\lambda^l\_{\gamma_{ij}}}}.

Here αl_ij\alpha^l\_{ij}, βl_ij\beta^l\_{ij} and γl_ij\gamma^l\_{ij} are defined by using the softmax function with λl_αij\lambda^l\_{\alpha_{ij}}, λl_βij\lambda^l\_{\beta_{ij}} and λl_γij\lambda^l\_{\gamma_{ij}} as control parameters respectively. We use 1×11\times1 convolution layers to compute the weight scalar maps λαl\mathbf{\lambda}^l_\alpha, λl_β\mathbf{\lambda}^l\_\beta and λl_γ\mathbf{\lambda}^l\_\gamma from x1l\mathbf{x}^{1\rightarrow l}, x2l\mathbf{x}^{2\rightarrow l} and x3l\mathbf{x}^{3\rightarrow l} respectively, and they can thus be learned through standard back-propagation.

With this method, the features at all the levels are adaptively aggregated at each scale. The outputs are used for object detection following the same pipeline of YOLOv3.