The **focal self-attention** is built to make Transformer layers scalable to high-resolution inputs.  Instead of attending all tokens at fine-grain, the approach attends the fine-grain tokens only locally, but the summarized ones globally. As such, it can cover as many regions as standard self-attention but with much less cost. An image is first partitioned into patches, resulting in visual tokens. Then a patch embedding layer, consisting of a convolutional layer with filter and stride of same size, to project the patches into hidden features. This spatial feature map in then passed to four stages of focal Transformer blocks. Each focal Transformer block consists of $N_i$ focal Transformer layers. Patch embedding layers are used in between to reduce spatial size of feature map by factor 2, while feature dimension increased by 2.

**DeepMask** is an object proposal algorithm based on a convolutional neural network. Given an input image patch, DeepMask generates a class-agnostic mask and an associated score which estimates the likelihood of the patch fully containing a centered object (without any notion of an object category). The core of the model is a ConvNet which jointly predicts the mask and the object score. A large part of the network is shared between those two tasks: only the last few network
layers are specialized for separately outputting a mask and score prediction.

DeepMask

Learning to Segment Object Candidates via Recursive Neural Networks

Focal Transformers

Focal Self-attention for Local-Global Interactions in Vision Transformers

**MoCo**, or **Momentum Contrast**, is a self-supervised learning algorithm with a contrastive loss. 

Contrastive loss methods can be thought of as building dynamic dictionaries. The "keys" (tokens) in the dictionary are sampled from data (e.g., images or patches) and are represented by an encoder network. Unsupervised learning trains encoders to perform dictionary look-up: an encoded “query” should be similar to its matching key and dissimilar to others. Learning is formulated as minimizing a contrastive loss. 

MoCo can be viewed as a way to build large and consistent dictionaries for unsupervised learning with a contrastive loss. In MoCo, we maintain the dictionary as a queue of data samples: the encoded representations of the current mini-batch are enqueued, and the oldest are dequeued. The queue decouples the dictionary size from the mini-batch size, allowing it to be large. Moreover, as the dictionary keys come from the preceding several mini-batches, a slowly progressing key encoder, implemented as a momentum-based moving average of the query encoder, is proposed to maintain consistency.

Source	Focal Self-attention for Local-Global Interactions in Vision Transformers
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com