**PermuteFormer** is a [Performer](https://paperswithcode.com/method/performer)-based model with relative position encoding that scales linearly on long sequences. PermuteFormer applies position-dependent transformation on queries and keys to encode positional information into the attention module. This transformation is carefully crafted so that the final output of self-attention is not affected by absolute positions of tokens.

Each token’s query / key feature is illustrated as a row of blocks in the figure, and its elements are marked with different colors. The position-aware permutation permutes elements of each token’s query / key feature along the head size dimension in each attention head. Depending on the token’s position, the permutation applied to query / key feature is different.

**Step Decay** is a learning rate schedule that drops the learning rate by a factor every few epochs, where the number of epochs is a hyperparameter.

Image Credit: [Suki Lau](https://towardsdatascience.com/learning-rate-schedules-and-adaptive-learning-rate-methods-for-deep-learning-2c8f433990d1)

Step Decay

PermuteFormer

PermuteFormer: Efficient Relative Position Encoding for Long Sequences

**Cross-Covariance Image Transformers**, or **XCiT**, is a type of [vision transformer](https://paperswithcode.com/methods/category/vision-transformer) that aims to combine the accuracy of [conventional transformers](https://paperswithcode.com/methods/category/transformers) with the scalability of [convolutional architectures](https://paperswithcode.com/methods/category/convolutional-neural-networks). 

The [self-attention operation](https://paperswithcode.com/method/scaled) underlying transformers yields global interactions between all tokens, i.e. words or image patches, and enables flexible modelling of image data beyond the local interactions of convolutions. This flexibility, however, comes with a quadratic complexity in time and memory, hindering application to long sequences and high-resolution images. The authors propose a “transposed” version of self-attention called [cross-covariance attention](https://paperswithcode.com/method/cross-covariance-attention) that operates across feature channels rather than tokens, where the interactions are based on the cross-covariances matrix between keys and queries.

Source	PermuteFormer: Efficient Relative Position Encoding for Long Sequences
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com