**Positional Encoding Generator**, or **PEG**, is a module used in the [Conditional Position Encoding](https://paperswithcode.com/method/conditional-positional-encoding) position embeddings. It dynamically produce the positional encodings conditioned on the local neighborhood of an input token. To condition on the local neighbors, we first reshape the flattened input sequence $X \in \mathbb{R}^{B \times N \times C}$ of DeiT back to $X^{\prime} \in \mathbb{R}^{B \times H \times W \times C}$ in the 2 -D image space. Then, a function (denoted by $\mathcal{F}$ in the Figure) is repeatedly applied to the local patch in $X^{\prime}$ to produce the conditional positional encodings $E^{B \times H \times W \times C} .$ PEG can be efficiently implemented with a 2-D convolution with kernel $k(k \geq 3)$ and $\frac{k-1}{2}$ zero paddings. Note that the zero paddings here are important to make the model be aware of the absolute positions, and $\mathcal{F}$ can be of various forms such as separable convolutions and many others.

**FuseFormer** is a [Transformer](https://paperswithcode.com/method/transformer)-based model designed for video inpainting via fine-grained feature fusion based on novel [Soft Split and Soft Composition](https://paperswithcode.com/method/soft-split-and-soft-composition) operations. The soft split divides feature map into many patches with given overlapping interval while the soft composition stitches them back into a whole feature map where pixels in overlapping regions are summed up. FuseFormer builds soft composition and soft split into its [feedforward network](https://paperswithcode.com/method/feedforward-network) for further enhancing subpatch level feature fusion.

FuseFormer

FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting

Positional Encoding Generator

Conditional Positional Encodings for Vision Transformers

**BASNet**, or **Boundary-Aware Segmentation Network**, is an image segmentation architecture that consists of a predict-refine architecture and a hybrid loss. The proposed BASNet comprises a predict-refine architecture and a hybrid loss, for highly accurate image segmentation.  The predict-refine architecture consists of a densely supervised encoder-decoder network and a residual 
 refinement module, which are respectively used to predict and refine a segmentation probability map. The hybrid loss is a combination of the binary cross entropy, structural similarity and intersection-over-union losses, which guide the network to learn three-level (i.e., pixel-, patch- and map- level) hierarchy representations.

Source	Conditional Positional Encodings for Vision Transformers
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com