**Random Synthesized Attention** is a form of synthesized attention where the attention weights are not conditioned on any input tokens. Instead, the attention weights are initialized to random values. It was introduced with the [Synthesizer](https://paperswithcode.com/method/synthesizer) architecture. Random Synthesized Attention contrasts with [Dense Synthesized Attention](https://paperswithcode.com/method/dense-synthesized-attention) which conditions on each token independently, as opposed to pairwise token interactions in the vanilla [Transformer](https://paperswithcode.com/method/transformer) model.

Let $R$ be a randomly initialized matrix. Random Synthesized Attention is defined as:

$$Y = \text{Softmax}\left(R\right)G\left(X\right) $$

where $R \in \mathbb{R}^{l \text{ x } l}$. Notably, each head adds 2 parameters to the overall network. The basic idea of the Random Synthesizer is to not rely on pairwise token interactions or any information from individual token but rather to learn a task-specific alignment that works well globally across many samples. This is a direct generalization of the recently proposed fixed self-attention patterns of [Raganato et al (2020)](https://arxiv.org/abs/2002.10260).

A **HyperNetwork** is a network that generates weights for a main network.  The behavior of the main network is the same with any usual neural network: it learns to map some raw inputs to their desired targets; whereas the hypernetwork takes a set of inputs that contain information about the structure of the weights and generates the weight for that layer.

HyperNetwork

HyperNetworks

Random Synthesized Attention

Synthesizer: Rethinking Self-Attention in Transformer Models

**Adam** is an adaptive learning rate optimization algorithm that utilises both momentum and scaling, combining the benefits of [RMSProp](https://paperswithcode.com/method/rmsprop) and [SGD w/th Momentum](https://paperswithcode.com/method/sgd-with-momentum). The optimizer is designed to be appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. 

The weight updates are performed as:

$$ w_{t} = w_{t-1} - \eta\frac{\hat{m}\_{t}}{\sqrt{\hat{v}\_{t}} + \epsilon}  $$

with

$$ \hat{m}\_{t} = \frac{m_{t}}{1-\beta^{t}_{1}} $$

$$ \hat{v}\_{t} = \frac{v_{t}}{1-\beta^{t}_{2}} $$

$$ m_{t} = \beta_{1}m_{t-1} + (1-\beta_{1})g_{t} $$

$$ v_{t} = \beta_{2}v_{t-1} + (1-\beta_{2})g_{t}^{2}  $$


$ \eta $ is the step size/learning rate, around 1e-3 in the original paper. $ \epsilon $ is a small number, typically 1e-8 or 1e-10, to prevent dividing by zero. $ \beta_{1} $ and $ \beta_{2} $ are forgetting parameters, with typical values 0.9 and 0.999, respectively.

Source	Synthesizer: Rethinking Self-Attention in Transformer Models
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com