**T2T-ViT** (Tokens-To-Token Vision Transformer) is a type of [Vision Transformer](https://paperswithcode.com/method/vision-transformer) which incorporates 1) a layerwise Tokens-to-Token (T2T) transformation to progressively structurize the image to tokens by recursively aggregating neighboring Tokens into one Token (Tokens-to-Token), such that local structure represented by surrounding tokens can be modeled and tokens length can be reduced; 2) an efficient backbone with a deep-narrow structure for vision [transformer](https://paperswithcode.com/method/transformer) motivated by CNN architecture design after empirical study.

Diffusion models generate samples by gradually
removing noise from a signal, and their training objective can be expressed as a reweighted variational lower-bound (https://arxiv.org/abs/2006.11239).

Diffusion

Denoising Diffusion Probabilistic Models

T2T-ViT

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

**CReLU**, or **Concatenated Rectified Linear Units**, is a type of activation function which preserves both positive and negative phase information while enforcing non-saturated non-linearity. We compute by concatenating the layer output $h$ as:

$$ \left[\text{ReLU}\left(h\right), \text{ReLU}\left(-h\right)\right] $$

Source	Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com