The **Re-Attention Module** is an attention layer used in the [DeepViT](https://paperswithcode.com/method/deepvit) architecture which mixes the attention map with a learnable matrix before multiplying with the values. The motivation is to re-generate the attention maps to increase their diversity at different layers with negligible computation and memory cost. The authors note that traditional self-attention fails to learn effective concepts for representation learning in deeper layers of ViT -- attention maps become more similar and less diverse in deeper layers (attention collapse) - and this hinders the model from getting expected performance gain. Re-attention is implemented by:

$$
\operatorname{Re}-\operatorname{Attention}(Q, K, V)=\operatorname{Norm}\left(\Theta^{\top}\left(\operatorname{Softmax}\left(\frac{Q K^{\top}}{\sqrt{d}}\right)\right)\right) V
$$

where transformation matrix $\Theta$ is multiplied to the self-attention map $\textbf{A}$ along the head dimension.

**AdaRNN** is an adaptive [RNN](https://paperswithcode.com/methods/category/recurrent-neural-networks) that learns an adaptive model through two modules: [Temporal Distribution Characterization](https://paperswithcode.com/method/temporal-distribution-characterization) (TDC) and [Temporal Distribution Matching](https://paperswithcode.com/method/temporal-distribution-matching) (TDM) algorithms. Firstly, to better characterize the distribution information in time-series, TDC splits the training data into $K$ most diverse periods that have a large distribution gap inspired by the principle of maximum entropy. After that, a temporal distribution matching (TDM) algorithm is used to dynamically reduce distribution divergence using a [RNN](https://paperswithcode.com/methods/category/recurrent-neural-networks)-based model.

AdaRNN

AdaRNN: Adaptive Learning and Forecasting of Time Series

Re-Attention Module

DeepViT: Towards Deeper Vision Transformer

**LeViT Attention Block** is a module used for [attention](https://paperswithcode.com/methods/category/attention-mechanisms) in the [LeViT](https://paperswithcode.com/method/levit) architecture. Its main feature is providing positional information within each attention block, i.e. where we explicitly inject relative position information in the attention mechanism. This is achieved by adding an attention bias to the attention maps.

Source	DeepViT: Towards Deeper Vision Transformer
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com