**Factorized Random Synthesized Attention**, introduced with the [Synthesizer](https://paperswithcode.com/method/synthesizer) architecture, is similar to [factorized dense synthesized attention](https://paperswithcode.com/method/factorized-dense-synthesized-attention) but for random synthesizers. Letting $R$ being a randomly initialized matrix, we factorize $R$ into low rank matrices $R\_{1}, R\_{2} \in \mathbb{R}^{l\text{ x}k}$ in the attention function:

$$ Y = \text{Softmax}\left(R\_{1}R\_{2}^{T}\right)G\left(X\right) . $$

Here $G\left(.\right)$ is a parameterized function that is equivalent to $V$ in [Scaled Dot-Product Attention](https://paperswithcode.com/method/scaled).

For each head, the factorization reduces the parameter costs from $l^{2}$ to $2\left(lk\right)$ where
$k << l$ and hence helps prevent overfitting. In practice, we use a small value of $k = 8$.

The basic idea of a  Random Synthesizer is to not rely on pairwise token interactions or any information from individual token but rather to learn a task-specific alignment that works well globally across many samples.

**MnasNet** is a type of convolutional neural network optimized for mobile devices that is discovered through mobile [neural architecture search](https://paperswithcode.com/method/neural-architecture-search), which explicitly incorporates model latency into the main objective so that the search can identify a model that achieves a good trade-off between accuracy and latency. The main building block is an [inverted residual block](https://paperswithcode.com/method/inverted-residual-block) (from [MobileNetV2](https://paperswithcode.com/method/mobilenetv2)).

MnasNet

MnasNet: Platform-Aware Neural Architecture Search for Mobile

Factorized Random Synthesized Attention

Synthesizer: Rethinking Self-Attention in Transformer Models

The **Sinkhorn Transformer** is a type of [transformer](https://paperswithcode.com/method/transformer) that uses [Sparse Sinkhorn Attention](https://paperswithcode.com/method/sparse-sinkhorn-attention) as a building block. This component is a plug-in replacement for dense fully-connected attention (as well as local attention, and sparse attention alternatives), and allows for reduced memory complexity as well as sparse attention.

Source	Synthesizer: Rethinking Self-Attention in Transformer Models
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com