A **Sparse Transformer** is a [Transformer](https://paperswithcode.com/method/transformer) based architecture which utilises sparse factorizations of the attention matrix to reduce time/memory to $O(n \sqrt{n})$. Other changes to the Transformer architecture include: (a) a restructured [residual block](https://paperswithcode.com/method/residual-block) and weight initialization, (b) A set of sparse attention kernels which efficiently compute subsets of the attention matrix, (c) recomputation of attention weights during the backwards pass to reduce memory usage

**MoGA-C** is a convolutional neural network optimized for mobile latency and discovered via Mobile GPU-Aware (MoGA) [neural architecture search](https://paperswithcode.com/method/neural-architecture-search). The basic building block is MBConvs (inverted residual blocks) from [MobileNetV2](https://paperswithcode.com/method/mobilenetv2). Squeeze-and-excitation layers are also experimented with.

MoGA-C

MoGA: Searching Beyond MobileNetV3

Sparse Transformer

Generating Long Sequences with Sparse Transformers

**Gaussian Affinity** is a type of affinity or self-similarity function between two points $\mathbb{x\_{i}}$ and $\mathbb{x\_{j}}$ that uses a Gaussian function:

$$ f\left(\mathbb{x\_{i}}, \mathbb{x\_{j}}\right) = e^{\mathbb{x^{T}\_{i}}\mathbb{x\_{j}}} $$

Here $\mathbb{x^{T}\_{i}}\mathbb{x\_{j}}$ is dot-product similarity.

Source	Generating Long Sequences with Sparse Transformers
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com