**VC R-CNN** is an unsupervised feature representation learning method, which uses Region-based Convolutional Neural Network ([R-CNN](https://paperswithcode.com/method/r-cnn)) as the visual backbone, and the causal intervention as the training objective. Given a set of detected object regions in an image (e.g., using [Faster R-CNN](https://paperswithcode.com/method/faster-r-cnn)), like any other unsupervised feature learning methods (e.g., word2vec), the proxy training objective of VC R-CNN is to predict the contextual objects of a region. However, they are fundamentally different: the prediction of VC R-CNN is by using causal intervention: P(Y|do(X)), while others are by using the conventional likelihood: P(Y|X). This is also the core reason why VC R-CNN can learn "sense-making" knowledge like chair can be sat -- while not just "common" co-occurrences such as the chair is likely to exist if table is observed.

**Mixture Normalization** is normalization technique that relies on an approximation of the probability density function of the internal representations. Any continuous distribution can be approximated with arbitrary precision using a Gaussian Mixture Model (GMM). Hence, instead of computing one set of statistical measures from the entire population (of instances in the mini-batch) as [Batch Normalization](https://paperswithcode.com/method/batch-normalization) does, Mixture Normalization works on sub-populations which can be identified by disentangling modes of the distribution, estimated via GMM. 

While BN can only scale and/or shift the whole underlying probability density function, mixture normalization operates like a soft piecewise normalizing transform, capable of completely re-structuring the data distribution by independently scaling and/or shifting individual modes of distribution.

Mixture Normalization

Training Faster by Separating Modes of Variation in Batch-normalized Models

VC R-CNN

Visual Commonsense R-CNN

**PGNet** is a point-gathering network for reading arbitrarily-shaped text in real-time. It is a single-shot text spotter, where the pixel-level character classification map is learned with proposed PG-CTC loss avoiding the usage of character-level annotations. With a PG-CTC decoder, we gather high-level character classification vectors from two-dimensional space and decode them into text symbols without NMS and RoI operations involved, which guarantees high efficiency. Additionally, reasoning the relations between each character and its neighbors, a graph refinement module (GRM) is proposed to optimize the coarse recognition and improve the end-to-end performance.

Source	Visual Commonsense R-CNN
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com