What is: Spatial-Reduction Attention?

Spatial-Reduction Attention, or SRA, is a multi-head attention module used in the Pyramid Vision Transformer architecture which reduces the spatial scale of the key $K$ and value $V$ before the attention operation. This reduces the computational/memory overhead. Details of the SRA in the stage $i$ can be formulated as follows:

\text{SRA}(Q, K, V)=\text { Concat }\left(\operatorname{head}\_{0}, \ldots \text { head }\_{N\_{i}}\right) W^{O} $$ $$\text{ head}\_{j}=\text { Attention }\left(Q W\_{j}^{Q}, \operatorname{SR}(K) W\_{j}^{K}, \operatorname{SR}(V) W\_{j}^{V}\right)

where Concat $(\cdot)$ is the concatenation operation. $W\_{j}^{Q} \in \mathbb{R}^{C\_{i} \times d\_{\text {head }}}$ , $W\_{j}^{K} \in \mathbb{R}^{C\_{i} \times d\_{\text {head }}}$ , $W\_{j}^{V} \in \mathbb{R}^{C\_{i} \times d\_{\text {head }}}$ , and $W^{O} \in \mathbb{R}^{C\_{i} \times C\_{i}}$ are linear projection parameters. $N\_{i}$ is the head number of the attention layer in Stage $i$ . Therefore, the dimension of each head (i.e. $\left.d\_{\text {head }}\right)$ is equal to $\frac{C\_{i}}{N\_{i}} . \text{SR}(\cdot)$ is the operation for reducing the spatial dimension of the input sequence ( $K$ or $V$ ), which is written as:

\text{SR}(\mathbf{x})=\text{Norm}\left(\operatorname{Reshape}\left(\mathbf{x}, R\_{i}\right) W^{S}\right)

Here, $\mathbf{x} \in \mathbb{R}^{\left(H\_{i} W\_{i}\right) \times C\_{i}}$ represents a input sequence, and $R\_{i}$ denotes the reduction ratio of the attention layers in Stage $i .$ Reshape $\left(\mathbf{x}, R\_{i}\right)$ is an operation of reshaping the input sequence $\mathbf{x}$ to a sequence of size $\frac{H\_{i} W\_{i}}{R\_{i}^{2}} \times\left(R\_{i}^{2} C\_{i}\right)$ . $W\_{S} \in \mathbb{R}^{\left(R\_{i}^{2} C\_{i}\right) \times C\_{i}}$ is a linear projection that reduces the dimension of the input sequence to $C\_{i}$ . $\text{Norm}(\cdot)$ refers to layer normalization.

Source	Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com