**VQ-VAE** is a type of variational autoencoder that uses vector quantisation to obtain a discrete latent representation. It differs from [VAEs](https://paperswithcode.com/method/vae) in two key ways: the encoder network outputs discrete, rather than continuous, codes; and the prior is learnt rather than static. In order to learn a discrete latent representation, ideas from vector quantisation (VQ) are incorporated. Using the VQ method allows the model to circumvent issues of posterior collapse - where the latents are ignored when they are paired with a powerful autoregressive decoder - typically observed in the VAE framework. Pairing these representations with an autoregressive prior, the model can generate high quality images, videos, and speech as well as doing high quality speaker conversion and unsupervised learning of phonemes.

**Fawkes** is an image cloaking system that helps individuals inoculate their images against unauthorized facial recognition models. Fawkes achieves this by helping users add imperceptible pixel-level changes ("cloaks") to their own photos before releasing them. When used to train facial recognition models, these "cloaked" images produce functional models that consistently cause normal images of the user to be misidentified.

Fawkes

Fawkes: Protecting Privacy against Unauthorized Deep Learning Models

VQ-VAE

Neural Discrete Representation Learning

UNITER or UNiversal Image-TExt Representation model is a large-scale pre-trained model for joint multimodal embedding. It is pre-trained using four image-text datasets COCO, Visual Genome, Conceptual Captions, and SBU Captions. It can power heterogeneous downstream V+L tasks with joint multimodal embeddings. 
UNITER takes the visual regions of the image and textual tokens of the sentence as inputs. A faster R-CNN is used in Image Embedder to extract the visual features of each region and a Text Embedder is used to tokenize the input sentence into WordPieces.  

It proposes WRA via the Optimal Transport to provide more fine-grained alignment between word tokens and image regions that is effective in calculating the minimum cost of transporting the contextualized image embeddings to word embeddings and vice versa. 

Four pretraining tasks were designed for this model. They are Masked Language Modeling (MLM), Masked Region Modeling (MRM, with three variants), Image-Text Matching (ITM), and Word-Region Alignment (WRA). This model is different from the previous models because it uses conditional masking on pre-training tasks.

Source	Neural Discrete Representation Learning
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com