**Cross-Covariance Image Transformers**, or **XCiT**, is a type of [vision transformer](https://paperswithcode.com/methods/category/vision-transformer) that aims to combine the accuracy of [conventional transformers](https://paperswithcode.com/methods/category/transformers) with the scalability of [convolutional architectures](https://paperswithcode.com/methods/category/convolutional-neural-networks). 

The [self-attention operation](https://paperswithcode.com/method/scaled) underlying transformers yields global interactions between all tokens, i.e. words or image patches, and enables flexible modelling of image data beyond the local interactions of convolutions. This flexibility, however, comes with a quadratic complexity in time and memory, hindering application to long sequences and high-resolution images. The authors propose a “transposed” version of self-attention called [cross-covariance attention](https://paperswithcode.com/method/cross-covariance-attention) that operates across feature channels rather than tokens, where the interactions are based on the cross-covariances matrix between keys and queries.

**PermuteFormer** is a [Performer](https://paperswithcode.com/method/performer)-based model with relative position encoding that scales linearly on long sequences. PermuteFormer applies position-dependent transformation on queries and keys to encode positional information into the attention module. This transformation is carefully crafted so that the final output of self-attention is not affected by absolute positions of tokens.

Each token’s query / key feature is illustrated as a row of blocks in the figure, and its elements are marked with different colors. The position-aware permutation permutes elements of each token’s query / key feature along the head size dimension in each attention head. Depending on the token’s position, the permutation applied to query / key feature is different.

PermuteFormer

PermuteFormer: Efficient Relative Position Encoding for Long Sequences

XCiT

XCiT: Cross-Covariance Image Transformers

**CodeGen** is an autoregressive transformers with next-token prediction language modeling as the learning objective trained on a natural language corpus and programming language data curated from GitHub.

Source	XCiT: Cross-Covariance Image Transformers
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com