UNITER or UNiversal Image-TExt Representation model is a large-scale pre-trained model for joint multimodal embedding. It is pre-trained using four image-text datasets COCO, Visual Genome, Conceptual Captions, and SBU Captions. It can power heterogeneous downstream V+L tasks with joint multimodal embeddings. 
UNITER takes the visual regions of the image and textual tokens of the sentence as inputs. A faster R-CNN is used in Image Embedder to extract the visual features of each region and a Text Embedder is used to tokenize the input sentence into WordPieces.  

It proposes WRA via the Optimal Transport to provide more fine-grained alignment between word tokens and image regions that is effective in calculating the minimum cost of transporting the contextualized image embeddings to word embeddings and vice versa. 

Four pretraining tasks were designed for this model. They are Masked Language Modeling (MLM), Masked Region Modeling (MRM, with three variants), Image-Text Matching (ITM), and Word-Region Alignment (WRA). This model is different from the previous models because it uses conditional masking on pre-training tasks.

**VQ-VAE** is a type of variational autoencoder that uses vector quantisation to obtain a discrete latent representation. It differs from [VAEs](https://paperswithcode.com/method/vae) in two key ways: the encoder network outputs discrete, rather than continuous, codes; and the prior is learnt rather than static. In order to learn a discrete latent representation, ideas from vector quantisation (VQ) are incorporated. Using the VQ method allows the model to circumvent issues of posterior collapse - where the latents are ignored when they are paired with a powerful autoregressive decoder - typically observed in the VAE framework. Pairing these representations with an autoregressive prior, the model can generate high quality images, videos, and speech as well as doing high quality speaker conversion and unsupervised learning of phonemes.

VQ-VAE

Neural Discrete Representation Learning

UNITER

UNITER: UNiversal Image-TExt Representation Learning

A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Source: [Distilling the Knowledge in a Neural Network](https://arxiv.org/abs/1503.02531)

Source	UNITER: UNiversal Image-TExt Representation Learning
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com