Unified VLP is unified encoder-decoder model for general vision-language pre-training. The models uses a shared multi-layer transformers network for both encoding and decoding. The model is pre-trained on large amount of image-text pairs using the unsupervised learning objectives of two tasks: bidirectional and sequence-to-sequence (seq2seq) masked vision-language prediction. Model architecture for pre-training. For pre-training , the input comprises of image input, sentence input, and three special tokens ([CLS], [SEP], [STOP]). The image is processed as $N$ Region of Interests (RoIs) and region features are extracted. The sentence is tokenized and masked with [MASK] tokens for the later masked language modeling task. The model consists of 12 layers of Transformer blocks, each having a masked self-attention layer and feed-forward module, where the self-attention mask controls what input context the prediction conditions on. Two self-attention masks are implemented depending on whether the objective is bidirectional or seq2seq. The model is fine-tuned for image captioning and visual question answering.

**SimCSE** is a contrastive learning framework for generating sentence embeddings. It utilizes an unsupervised approach, which takes an input sentence and predicts itself in contrastive objective, with only standard [dropout](https://paperswithcode.com/method/dropout) used as noise. The authors find that dropout acts as minimal “data augmentation” of hidden representations, while removing it leads to a representation collapse. Afterwards a supervised approach is used, which incorporates annotated pairs from natural language inference datasets into the contrastive framework, by using “entailment” pairs as positives and “contradiction

SimCSE

SimCSE: Simple Contrastive Learning of Sentence Embeddings

Unified VLP

Unified Vision-Language Pre-Training for Image Captioning and VQA

**Orientation Regularized Network** (ORN) is a multi-view image fusion technique for pose estimation. It uses IMU orientations as a structural prior to mutually fuse the image features of each pair of joints linked by IMUs. For example, it uses the features of the elbow to reinforce those of the wrist based on the IMU at the lower-arm.

Source	Unified Vision-Language Pre-Training for Image Captioning and VQA
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com