SOHO (“See Out of tHe bOx”) that takes a whole image as input, and learns vision-language representation in an end-to-end manner. SOHO does not require bounding box annotations which enables inference 10 times faster than region-based approaches. Text embeddings are used to extract textual embedding features. A trainable CNN is used to extract visual representations. SOHO learns to extract comprehensive yet compact image features through a visual dictionary (VD) that facilitates cross-modal understanding. VD is designed to represent consistent visual abstractions of similar semantics. It is updated on-the-fly and utilized in the proposed pre-training task Masked Visual Modeling (MVM).

A **Neural Probablistic Language Model** is an early language modelling architecture. It involves a feedforward architecture that takes in input vector representations (i.e. word embeddings) of the previous $n$ words, which are looked up in a table $C$.

The word embeddings are concatenated and fed into a hidden layer which then feeds into a [softmax](https://paperswithcode.com/method/softmax) layer to estimate the probability of the word given the context.

Neural Probabilistic Language Model

SOHO

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

Feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construction.

Source	Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com