OSCAR is a new learning method that uses object tags detected in images as anchor points to ease the learning of image-text alignment. The model take a triple as input (word-tag-region) and pre-trained with two losses (masked token loss over words and tags, and a contrastive loss between tags and others). OSCAR represents an image-text pair into semantic space via dictionary lookup. Object tags are used as anchor points to align image regions with word embeddings of pre-trained language models. The model is then fine-tuned for understanding and generation tasks.

Spatial pooling usually operates on a small region which limits its capability to capture long-range dependencies and focus on distant regions. To overcome this, Hou et al. proposed  strip pooling, a novel pooling method capable of encoding long-range context in either horizontal or vertical spatial domains.  

Strip pooling has two branches for horizontal  and vertical strip pooling. The horizontal strip pooling part first pools the input feature $F \in \mathcal{R}^{C \times H \times W}$ in the horizontal direction:
\begin{align}
y^1 = \text{GAP}^w (X) 
\end{align}
Then a 1D convolution with kernel size 3 is applied in $y$ to capture the relationship between different rows and channels. This is repeated $W$ times to make  the output $y_v$  consistent with the input shape:
\begin{align}
    y_h = \text{Expand}(\text{Conv1D}(y^1))
\end{align}
Vertical strip pooling is performed in a similar way. Finally, the outputs of the two branches are fused using element-wise summation to produce the attention map:
\begin{align}
s &= \sigma(Conv^{1\times 1}(y_{v} + y_{h}))
\end{align}
\begin{align}
Y &= s  X
\end{align}

The strip pooling module (SPM) is further developed in the mixed pooling module (MPM). Both consider  spatial  and channel relationships to overcome the locality of convolutional neural networks.  SPNet achieves  state-of-the-art results for several complex semantic segmentation benchmarks.

SPNet

Strip Pooling: Rethinking Spatial Pooling for Scene Parsing

OSCAR

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

Non-parametric approximation of Q-values by storing all visited states and doing inference through k-Nearest Neighbors.

Source	Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com