What is: Shuffle Transformer?

The Shuffle Transformer Block consists of the Shuffle Multi-Head Self-Attention module (ShuffleMHSA), the Neighbor-Window Connection module (NWC), and the MLP module. To introduce cross-window connections while maintaining the efficient computation of non-overlapping windows, a strategy which alternates between WMSA and Shuffle-WMSA in consecutive Shuffle Transformer blocks is proposed. The first window-based transformer block uses regular window partition strategy and the second window-based transformer block uses window-based selfattention with spatial shuffle. Besides, the Neighbor-Window Connection moduel (NWC) is added into each block for enhancing connections among neighborhood windows. Thus the proposed shuffle transformer block could build rich cross-window connections and augments representation. Finally, the consecutive Shuffle Transformer blocks are computed as:

$x^{l}=\mathbf{W M S A}\left(\mathbf{B N}\left(z^{l-1}\right)\right)+z^{l-1}$

$y^{l}=\mathbf{N W C}\left(x^{l}\right)+x^{l}$

$z^{l}=\mathbf{M L P}\left(\mathbf{B N}\left(y^{l}\right)\right)+y^{l}$

$x^{l+1}=\mathbf{S h u f f l e - W M S A}\left(\mathbf{B N}\left(z^{l}\right)\right)+z^{l}$

$y^{l+1}=\mathbf{N W C}\left(x^{l+1}\right)+x^{l+1}$

$z^{l+1}=\mathbf{M L P}\left(\mathbf{B N}\left(y^{l+1}\right)\right)+y^{l+1}$

where $x^l$ , $y^l$ and $z^l$ denote the output features of the (Shuffle-)WMSA module, the Neighbor-Window Connection module and the MLP module for block $l$ , respectively; WMSA and Shuffle-WMSA denote window-based multi-head self-attention without/with spatial shuffle, respectively.

Source	Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com