ViLT is a minimal vision-and-language pre-training transformer model where processing of visual inputs is simplified to just the same convolution-free manner that text inputs are processed. The model-specific components of ViLT require less computation than the transformer component for multimodal interactions. ViLTThe model is pre-trained on the following objectives: image text matching, masked language modeling, and word patch alignment.

**Sparse R-CNN** is a purely sparse method for object detection in images, without object positional candidates enumerating
on all(dense) image grids nor object queries interacting with global(dense) image feature.

As shown in the Figure, object candidates are given with a fixed small set of learnable bounding boxes represented by 4-d coordinate. For the example of the COCO dataset, 100 boxes and 400 parameters are needed in total, rather than the predicted ones from hundreds of thousands of candidates in a Region Proposal Network ([RPN](https://paperswithcode.com/method/rpn)). These sparse candidates are used as proposal boxes to extract the feature of Region of Interest (RoI) by [RoIPool](https://paperswithcode.com/method/roi-pooling) or [RoIAlign](https://paperswithcode.com/method/roi-align).

Sparse R-CNN

Sparse R-CNN: End-to-End Object Detection with Learnable Proposals

ViLT

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

Vision-aided GAN training involves using pretrained computer vision models in an ensemble of discriminators to improve GAN performance. Linear separability between real and fake samples in pretrained model embeddings is used as a measure to choose the most accurate pretrained models for a dataset.

Source	ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com