The **MLP-Mixer** architecture (or “Mixer” for short) is an image architecture that doesn't use convolutions or self-attention. Instead, Mixer’s architecture is based entirely on multi-layer perceptrons (MLPs) that are repeatedly applied across either spatial locations or feature channels. Mixer relies only on basic matrix multiplication routines, changes to data layout (reshapes and transpositions), and scalar nonlinearities.

It accepts a sequence of linearly projected image patches (also referred to as tokens) shaped as a “patches × channels” table as an input, and maintains this dimensionality. Mixer makes use of two types of MLP layers: channel-mixing MLPs and token-mixing MLPs. The channel-mixing MLPs allow communication between different channels; they operate on each token independently and take individual rows of the table as inputs. The token-mixing MLPs allow communication between different spatial locations (tokens); they operate on each channel independently and take individual columns of the table as inputs. These two types of layers are interleaved to enable interaction of both input dimensions.

**Zoneout** is a  method for regularizing [RNNs](https://paperswithcode.com/methods/category/recurrent-neural-networks). At each timestep, zoneout stochastically forces some hidden units to maintain their previous values. Like [dropout](https://paperswithcode.com/method/dropout), zoneout uses random noise to train a pseudo-ensemble, improving generalization.
But by preserving instead of dropping hidden units, gradient information and state information are more readily propagated through time, as in feedforward [stochastic depth](https://paperswithcode.com/method/stochastic-depth) networks.

Zoneout

Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations

MLP-Mixer

MLP-Mixer: An all-MLP Architecture for Vision

**GHM-C** is a loss function designed to balance the gradient flow for anchor classification. The GHM first performs statistics on the number of examples with similar attributes w.r.t their gradient density and then attaches a harmonizing parameter to the gradient of each example according to the density. The modification of gradient can be equivalently implemented by reformulating the loss function. Embedding the GHM into the classification loss is denoted as GHM-C loss. Since the gradient density is a statistical variable depending on the examples distribution in a mini-batch, GHM-C is a dynamic loss that can adapt to the change of data distribution in each batch as well as to the updating of the model.

Source	MLP-Mixer: An all-MLP Architecture for Vision
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com