**Performer** is a [Transformer](https://paperswithcode.com/methods/category/transformers) architectures which can estimate regular ([softmax](https://paperswithcode.com/method/softmax)) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. To approximate softmax attention-kernels, Performers use a Fast Attention Via positive Orthogonal Random features approach (FAVOR+), leveraging new methods for approximating softmax and Gaussian kernels.

**GradDrop**, or **Gradient Sign Dropout**, is a probabilistic masking procedure which samples gradients at an activation layer based on their level of consistency. It is applied as a layer in any standard network forward pass, usually on the final layer before the prediction head to save on compute overhead and maximize benefits during backpropagation. Below, we develop the GradDrop formalism. Throughout, o denotes elementwise multiplication after any necessary tiling operations (if any) are completed.
To implement GradDrop, we first define the Gradient Positive Sign Purity, $\mathcal{P}$, as

$$
\mathcal{P}=\frac{1}{2}\left(1+\frac{\sum\_{i} \nabla L_\{i}}{\sum\_{i}\left|\nabla L\_{i}\right|}\right)
$$

$\mathcal{P}$ is bounded by $[0,1] .$ For multiple gradient values $\nabla\_{a} L\_{i}$ at some scalar $a$, we see that $\mathcal{P}=0$ if $\nabla_{a} L\_{i}<0 $ $\forall i$, while $\mathcal{P}=1$ if $\nabla\_{a} L\_{i}>0$ $\forall i $. Thus, $\mathcal{P}$ is a measure of how many positive gradients are present at any given value. We then form a mask for each gradient $\mathcal{M}\_{i}$ as follows:

$$
\mathcal{M}\_{i}=\mathcal{I}[f(\mathcal{P})>U] \circ \mathcal{I}\left[\nabla L\_{i}>0\right]+\mathcal{I}[f(\mathcal{P})<U] \circ \mathcal{I}\left[\nabla L\_{i}<0\right]
$$

for $\mathcal{I}$ the standard indicator function and $f$ some monotonically increasing function (often just the identity) that maps $[0,1] \mapsto[0,1]$ and is odd around $(0.5,0.5)$. $U$ is a tensor composed of i.i.d $U(0,1)$ random variables. The $\mathcal{M}\_{i}$ is then used to produce a final gradient $\sum \mathcal{M}\_{i} \nabla L\_{i}$

GradDrop

Just Pick a Sign: Optimizing Deep Multitask Models with Gradient Sign Dropout

Performer

Rethinking Attention with Performers

**EvoNorms** are a set of normalization-activation layers that go beyond existing design patterns. Normalization and activation are unified into a single computation graph, its structure is evolved starting from low-level primitives. EvoNorms consist of two series: B series and S series. The B series are batch-dependent and were discovered by our method without any constraint. The S series work on individual samples, and were discovered by rejecting any batch-dependent operations.

Source	Rethinking Attention with Performers
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com