AICurious Logo

What is: Spatial Gating Unit?

SourcePay Attention to MLPs
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

Spatial Gating Unit, or SGU, is a gating unit used in the gMLP architecture to captures spatial interactions. To enable cross-token interactions, it is necessary for the layer s()s(\cdot) to contain a contraction operation over the spatial dimension. The layer s()s(\cdot) is formulated as the output of linear gating:

s(Z)=Zf_W,b(Z)s(Z)=Z \odot f\_{W, b}(Z)

where \odot denotes element-wise multiplication. For training stability, the authors find it critical to initialize WW as near-zero values and bb as ones, meaning that f_W,b(Z)1f\_{W, b}(Z) \approx 1 and therefore s(Z)Zs(Z) \approx Z at the beginning of training. This initialization ensures each gMLP block behaves like a regular FFN at the early stage of training, where each token is processed independently, and only gradually injects spatial information across tokens during the course of learning.

The authors find it further effective to split ZZ into two independent parts (Z_1,Z_2)\left(Z\_{1}, Z\_{2}\right) along the channel dimension for the gating function and for the multiplicative bypass:

s(Z)=Z_1f_W,b(Z_2)s(Z)=Z\_{1} \odot f\_{W, b}\left(Z\_{2}\right)

They also normalize the input to f_W,bf\_{W, b} which empirically improved the stability of large NLP models.