OODformer is a [transformer](https://paperswithcode.com/method/transformer)-based OOD detection architecture that leverages the contextualization capabilities of the transformer. Incorporating the transformer as the principal feature extractor allows to exploit the object concepts and their discriminate attributes along with their co-occurrence via [visual attention](https://paperswithcode.com/method/visual-attention). 

OODformer employs [ViT](method/vision-transformer) and its data efficient variant [DeiT](/method/deit). Each encoder layer consist of multi-head self attention and a multi-layer perception block. The combination of MSA and MLP layers in the encoder jointly encode the attributes' importance, associated correlation, and co-occurrence. The [class] token (a representative of an image $x$) consolidated multiple attributes and their related features via the global context. The [class] token from the final layer is used for OOD detection in two ways; first, it is passed to $
F_{\text {classifier }}\left(x_{\text {feat }}\right)$  for softmax confidence score, and second it is used for latent space distance calculation.

**VirText**, or **Visual representations from Textual annotations** is a pretraining approach using semantically dense captions to learn visual representations. First a ConvNet and [Transformer](https://paperswithcode.com/method/transformer) are jointly trained from scratch to generate natural language captions for images. Then, the learned features are transferred to downstream visual recognition tasks.

VirTex

VirTex: Learning Visual Representations from Textual Annotations

OODformer

OODformer: Out-Of-Distribution Detection Transformer

**Stacked Hourglass Networks** are a type of convolutional neural network for pose estimation. They are based on the successive steps of pooling and upsampling that are done to produce a final set of predictions.

Source	OODformer: Out-Of-Distribution Detection Transformer
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com