**GPT-3** is an autoregressive [transformer](https://paperswithcode.com/methods/category/transformers)  model with 175 billion
parameters. It uses the same architecture/model as [GPT-2](https://paperswithcode.com/method/gpt-2), including the modified initialization, pre-normalization, and reversible tokenization, with the exception that GPT-3 uses alternating dense and locally banded sparse attention patterns in the layers of the [transformer](https://paperswithcode.com/method/transformer), similar to the [Sparse Transformer](https://paperswithcode.com/method/sparse-transformer).

**Byte Pair Encoding**, or **BPE**, is a subword segmentation algorithm that encodes rare and unknown words as sequences of subword units. The intuition is that various word classes are translatable via smaller units than words, for instance names (via character copying or transliteration), compounds (via compositional translation), and cognates and loanwords (via phonological and morphological transformations).

[Lei Mao](https://leimao.github.io/blog/Byte-Pair-Encoding/) has a detailed blog post that explains how this works.

Neural Machine Translation of Rare Words with Subword Units

GPT-3

Language Models are Few-Shot Learners

A **Fractal Block** is an image model block that utilizes an expansion rule that yields a structural layout of truncated fractals. For the base case where $f\_{1}\left(z\right) = \text{conv}\left(z\right)$ is a convolutional layer, we then have recursive fractals of the form:

$$ f\_{C+1}\left(z\right) = \left[\left(f\_{C}\circ{f\_{C}}\right)\left(z\right)\right] \oplus \left[\text{conv}\left(z\right)\right]$$

Where $C$ is the number of columns. For the join layer (green in Figure), we use the element-wise mean rather than concatenation or addition.

Source	Language Models are Few-Shot Learners
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com