**LocalViT** aims to introduce depthwise convolutions to enhance local features modeling capability of ViTs. The network, as shown in Figure (c), brings localist mechanism into transformers through the depth-wise convolution (denoted by "DW"). To cope with the convolution operation, the conversation between sequence and image feature map is added by "Seq2Img" and "Img2Seq". The computation is as follows:

$$
\mathbf{Y}^{r}=f\left(f\left(\mathbf{Z}^{r} \circledast \mathbf{W}_{1}^{r} \right) \circledast \mathbf{W}_d  \right) \circledast \mathbf{W}_2^{r}
$$

where $\mathbf{W}_{d} \in \mathbb{R}^{\gamma d \times 1 \times k \times k}$ is the kernel of the depth-wise convolution.

The input (sequence of tokens) is first reshaped to a feature map rearranged on a 2D lattice. Two convolutions along with a depth-wise convolution are applied to the feature map. The feature map is reshaped to a sequence of tokens which are used as by the self-attention of the network transformer layer.

A **Linear Layer** is a projection $\mathbf{XW + b}$.

Linear Layer

LocalViT

LocalViT: Bringing Locality to Vision Transformers

**BERT**, or Bidirectional Encoder Representations from Transformers, improves upon standard [Transformers](http://paperswithcode.com/method/transformer) by removing the unidirectionality constraint by using a *masked language model* (MLM) pre-training objective. The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked word based only on its context. Unlike left-to-right language model pre-training, the MLM objective enables the representation to fuse the left and the right context, which allows us to pre-train a deep bidirectional Transformer. In addition to the masked language model, BERT uses a *next sentence prediction* task that jointly pre-trains text-pair representations. 

There are two steps in BERT: *pre-training* and *fine-tuning*. During pre-training, the model is trained on unlabeled data over different pre-training tasks. For fine-tuning, the BERT model is first initialized with the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the downstream tasks. Each downstream task has separate fine-tuned models, even though they
are initialized with the same pre-trained parameters.

Source	LocalViT: Bringing Locality to Vision Transformers
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com