**VirText**, or **Visual representations from Textual annotations** is a pretraining approach using semantically dense captions to learn visual representations. First a ConvNet and [Transformer](https://paperswithcode.com/method/transformer) are jointly trained from scratch to generate natural language captions for images. Then, the learned features are transferred to downstream visual recognition tasks.

**GPT-2** is a [Transformer](https://paperswithcode.com/methods/category/transformers) architecture that was notable for its size (1.5 billion parameters) on its release. The model is pretrained on a WebText dataset - text from 45 million website links. It largely follows the previous [GPT](https://paperswithcode.com/method/gpt) architecture with some modifications:

- [Layer normalization](https://paperswithcode.com/method/layer-normalization) is moved to the input of each sub-block, similar to a
pre-activation residual network and an additional layer normalization was added after the final self-attention block. 

- A modified initialization which accounts for the accumulation on the residual path with model depth
is used. Weights of residual layers are scaled at initialization by a factor of $1/\sqrt{N}$ where $N$ is the number of residual layers. 

- The vocabulary is expanded to 50,257. The context size is expanded from 512 to 1024 tokens and
a larger batch size of 512 is used.

GPT-2

Language Models are Unsupervised Multitask Learners

VirTex

VirTex: Learning Visual Representations from Textual Annotations

OODformer is a [transformer](https://paperswithcode.com/method/transformer)-based OOD detection architecture that leverages the contextualization capabilities of the transformer. Incorporating the transformer as the principal feature extractor allows to exploit the object concepts and their discriminate attributes along with their co-occurrence via [visual attention](https://paperswithcode.com/method/visual-attention). 

OODformer employs [ViT](method/vision-transformer) and its data efficient variant [DeiT](/method/deit). Each encoder layer consist of multi-head self attention and a multi-layer perception block. The combination of MSA and MLP layers in the encoder jointly encode the attributes' importance, associated correlation, and co-occurrence. The [class] token (a representative of an image $x$) consolidated multiple attributes and their related features via the global context. The [class] token from the final layer is used for OOD detection in two ways; first, it is passed to $
F_{\text {classifier }}\left(x_{\text {feat }}\right)$  for softmax confidence score, and second it is used for latent space distance calculation.

Source	VirTex: Learning Visual Representations from Textual Annotations
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com