[Transformer](https://paperswithcode.com/method/transformer) is a type of self-attention-based neural networks originally applied for NLP tasks. Recently, pure transformer-based models are proposed to solve computer vision problems. These visual transformers usually view an image as a sequence of patches while they ignore the intrinsic structure information inside each patch. In this paper, we propose a novel Transformer-iN-Transformer (TNT) model for modeling both patch-level and pixel-level representation. In each TNT block, an outer transformer block is utilized to process patch embeddings, and an inner transformer block extracts local features from pixel embeddings. The pixel-level feature is projected to the space of patch embedding by a linear transformation layer and then added into the patch. By stacking the TNT blocks, we build the TNT model for image recognition.

Image source: [Han et al.](https://arxiv.org/pdf/2103.00112v1.pdf)

**Polyak Averaging** is an optimization technique that sets final parameters to an average of (recent) parameters visited in the optimization trajectory. Specifically if in $t$ iterations we have parameters $\theta\_{1}, \theta\_{2}, \dots, \theta\_{t}$, then Polyak Averaging suggests setting 

$$ \theta\_t =\frac{1}{t}\sum\_{i}\theta\_{i} $$

Image Credit: [Shubhendu Trivedi & Risi Kondor](https://ttic.uchicago.edu/~shubhendu/Pages/Files/Lecture6_flat.pdf)

Polyak Averaging

Transformer in Transformer

**Meena** is a multi-turn open-domain chatbot trained end-to-end on data mined and filtered from public domain social media conversations. This 2.6B parameter neural network is simply trained to minimize perplexity of the next token. A seq2seq model is used with the Evolved [Transformer](https://paperswithcode.com/method/transformer) as the main architecture. The model is trained on multi-turn conversations where the input sequence is all turns of the context and the output sequence is the response.

Source	Transformer in Transformer
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com