What is: Spatially Separable Self-Attention?

Spatially Separable Self-Attention, or SSSA, is an attention module used in the Twins-SVT architecture that aims to reduce the computational complexity of vision transformers for dense prediction tasks (given high-resolution inputs). SSSA is composed of locally-grouped self-attention (LSA) and global sub-sampled attention (GSA).

Formally, spatially separable self-attention (SSSA) can be written as:

\hat{\mathbf{z}}\_{i j}^{l}=\text { LSA }\left(\text { LayerNorm }\left(\mathbf{z}\_{i j}^{l-1}\right)\right)+\mathbf{z}\_{i j}^{l-1} $$ $$\mathbf{z}\_{i j}^{l}=\mathrm{FFN}\left(\operatorname{LayerNorm}\left(\hat{\mathbf{z}}\_{i j}^{l}\right)\right)+\hat{\mathbf{z}}\_{i j}^{l} $$ $$ \hat{\mathbf{z}}^{l+1}=\text { GSA }\left(\text { LayerNorm }\left(\mathbf{z}^{l}\right)\right)+\mathbf{z}^{l} $$ $$ \mathbf{z}^{l+1}=\text { FFN }\left(\text { LayerNorm }\left(\hat{\mathbf{z}}^{l+1}\right)\right)+\hat{\mathbf{z}}^{l+1}$$ $$i \in\{1,2, \ldots ., m\}, j \in\{1,2, \ldots ., n\}

where LSA means locally-grouped self-attention within a sub-window; GSA is the global sub-sampled attention by interacting with the representative keys (generated by the sub-sampling functions) from each sub-window $\hat{\mathbf{z}}\_{i j} \in \mathcal{R}^{k\_{1} \times k\_{2} \times C} .$ Both LSA and GSA have multiple heads as in the standard self-attention.

Source	Twins: Revisiting the Design of Spatial Attention in Vision Transformers
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com