What is: gMLP?

gMLP is an MLP-based alternative to Transformers without self-attention, which simply consists of channel projections and spatial projections with static parameterization. It is built out of basic MLP layers with gating. The model consists of a stack of $L$ blocks with identical size and structure. Let $X \in \mathbb{R}^{n \times d}$ be the token representations with sequence length $n$ and dimension $d$ . Each block is defined as:

Z=\sigma(X U), \quad \tilde{Z}=s(Z), \quad Y=\tilde{Z} V

where $\sigma$ is an activation function such as GeLU. $U$ and $V$ define linear projections along the channel dimension - the same as those in the FFNs of Transformers (e.g., their shapes are $768 \times 3072$ and $3072 \times 768$ for $\text{BERT}_{\text {base }}$ ).

A key ingredient is $s(\cdot)$ , a layer which captures spatial interactions. When $s$ is an identity mapping, the above transformation degenerates to a regular FFN, where individual tokens are processed independently without any cross-token communication. One of the major focuses is therefore to design a good $s$ capable of capturing complex spatial interactions across tokens. This leads to the use of a Spatial Gating Unit which involves a modified linear gating.

The overall block layout is inspired by inverted bottlenecks, which define $s(\cdot)$ as a spatial depthwise convolution. Note, unlike Transformers, gMLP does not require position embeddings because such information will be captured in $s(\cdot)$ .

Source	Pay Attention to MLPs
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com