What is: Split Attention?

A Split Attention block enables attention across feature-map groups. As in ResNeXt blocks, the feature can be divided into several groups, and the number of feature-map groups is given by a cardinality hyperparameter $K$ . The resulting feature-map groups are called cardinal groups. Split Attention blocks introduce a new radix hyperparameter $R$ that indicates the number of splits within a cardinal group, so the total number of feature groups is $G = KR$ . We may apply a series of transformations { $\mathcal{F}\_1, \mathcal{F}\_2, \cdots\mathcal{F}\_G$ } to each individual group, then the intermediate representation of each group is $U\_i = \mathcal{F}\_i\left(X\right)$ , for $i \in$ { $1, 2, \cdots{G}$ }.

A combined representation for each cardinal group can be obtained by fusing via an element-wise summation across multiple splits. The representation for $k$ -th cardinal group is $\hat{U}^k = \sum_{j=R(k-1)+1}^{R k} U_j$ , where $\hat{U}^k \in \mathbb{R}^{H\times W\times C/K}$ for $k\in{1,2,...K}$ , and $H$ , $W$ and $C$ are the block output feature-map sizes. Global contextual information with embedded channel-wise statistics can be gathered with global average pooling across spatial dimensions $s^k\in\mathbb{R}^{C/K}$ . Here the $c$ -th component is calculated as:

s^k\_c = \frac{1}{H\times W} \sum\_{i=1}^H\sum\_{j=1}^W \hat{U}^k\_c(i, j).

A weighted fusion of the cardinal group representation $V^k\in\mathbb{R}^{H\times W\times C/K}$ is aggregated using channel-wise soft attention, where each feature-map channel is produced using a weighted combination over splits. The $c$ -th channel is calculated as:

V^k_c=\sum_{i=1}^R a^k_i(c) U_{R(k-1)+i} ,

where $a_i^k(c)$ denotes a (soft) assignment weight given by:

a_i^k(c) = \begin{cases} \frac{exp(\mathcal{G}^c_i(s^k))}{\sum_{j=0}^R exp(\mathcal{G}^c_j(s^k))} & \quad\textrm{if } R>1, \\ \frac{1}{1+exp(-\mathcal{G}^c_i(s^k))} & \quad\textrm{if } R=1,\\ \end{cases}

and mapping $\mathcal{G}_i^c$ determines the weight of each split for the $c$ -th channel based on the global context representation $s^k$ .

Source	ResNeSt: Split-Attention Networks
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com