A Mixer layer is a layer used in the MLP-Mixer architecture proposed by Tolstikhin et. al (2021) for computer vision. Mixer layers consist purely of MLPs, without convolutions or attention. It takes an input of embedded image patches (tokens), with its output having the same shape as its input, similar to that of a Vision Transformer encoder. As suggested by its name, Mixer layers "mix" tokens and channels through its "token mixing" and "channel mixing" MLPs contained the layer. It utilizes previous techniques by other architectures, such as layer normalization, skip-connections, and regularization methods.

Image credit: Tolstikhin, I. O., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., ... & Dosovitskiy, A. (2021). Mlp-mixer: An all-mlp architecture for vision. Advances in Neural Information Processing Systems, 34, 24261-24272.

A **Memory Network** provides a memory component that can be read from and written to with the inference capabilities of a neural network model. The motivation is that many neural networks lack a long-term memory component, and their existing memory component encoded by states and weights is too small and not compartmentalized enough to accurately remember facts from the past (RNNs for example, have difficult memorizing and doing tasks like copying). 

A memory network consists of a memory $\textbf{m}$ (an array of objects indexed by $\textbf{m}\_{i}$ and four potentially learned components:

- Input feature map $I$ - feature representation of the data input.
- Generalization $G$ - updates old memories given the new input.
- Output feature map $O$ - produces new feature map given $I$ and $G$.
- Response $R$ - converts output into the desired response. 

Given an input $x$ (e.g., an input character, word or sentence depending on the granularity chosen, an image or an audio signal) the flow of the model is as follows:

1. Convert $x$ to an internal feature representation $I\left(x\right)$.
2. Update memories $m\_{i}$ given the new input: $m\_{i} = G\left(m\_{i}, I\left(x\right), m\right)$, $\forall{i}$.
3. Compute output features $o$ given the new input and the memory: $o = O\left(I\left(x\right), m\right)$.
4. Finally, decode output features $o$ to give the final response: $r = R\left(o\right)$.

This process is applied at both train and test time, if there is a distinction between such phases, that
is, memories are also stored at test time, but the model parameters of $I$, $G$, $O$ and $R$ are not updated. Memory networks cover a wide class of possible implementations. The components $I$, $G$, $O$ and $R$ can potentially use any existing ideas from the machine learning literature.

Image Source: [Adrian Colyer](https://blog.acolyer.org/2016/03/10/memory-networks/)

Memory Network

Memory Networks

Mixer Layer

MLP-Mixer: An all-MLP Architecture for Vision

**Region-based Fully Convolutional Networks**, or **R-FCNs**, are a type of region-based object detector. In contrast to previous region-based object detectors such as Fast/[Faster R-CNN](https://paperswithcode.com/method/faster-r-cnn) that apply a costly per-region subnetwork hundreds of times, R-FCN is fully convolutional with almost all computation shared on the entire image.

To achieve this, R-FCN utilises position-sensitive score maps to address a dilemma between translation-invariance in image classification and translation-variance in object detection.

Source	MLP-Mixer: An all-MLP Architecture for Vision
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com