What is: Mix-FFN?

Mix-FFN is a feedforward layer used in the SegFormer architecture. ViT uses positional encoding (PE) to introduce the location information. However, the resolution of $\mathrm{PE}$ is fixed. Therefore, when the test resolution is different from the training one, the positional code needs to be interpolated and this often leads to dropped accuracy. To alleviate this problem, CPVT uses $3 \times 3$ Conv together with the PE to implement a data-driven PE. The authors of Mix-FFN argue that positional encoding is actually not necessary for semantic segmentation. Instead, they use Mix-FFN which considers the effect of zero padding to leak location information, by directly using a $3 \times 3$ Conv in the feed-forward network (FFN). Mix-FFN can be formulated as:

\mathbf{x}\_{\text {out }}=\operatorname{MLP}\left(\operatorname{GELU}\left(\operatorname{Conv}\_{3 \times 3}\left(\operatorname{MLP}\left(\mathbf{x}\_{i n}\right)\right)\right)\right)+\mathbf{x}\_{i n}

where $\mathbf{x}\_{i n}$ is the feature from a self-attention module. Mix-FFN mixes a $3 \times 3$ convolution and an MLP into each FFN.

Source	SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com