**ConViT** is a type of [vision transformer](https://paperswithcode.com/method/vision-transformer) that uses a gated positional self-attention module ([GPSA](https://paperswithcode.com/method/gpsa)), a form of positional self-attention which can be equipped with a “soft” convolutional inductive bias. The GPSA layers are initialized to mimic the locality of convolutional layers, then each attention head is given the freedom to escape locality by adjusting a gating parameter regulating the attention paid to position versus content information.

**Revision Network** is a style transfer module that aims to revise the rough stylized image via generating residual details image $r_{c s}$, while the final stylized image is generated by combining $r\_{c s}$ and rough stylized image $\bar{x}\_{c s}$. This procedure ensures that the distribution of global style pattern in $\bar{x}\_{c s}$ is properly kept. Meanwhile, learning to revise local style patterns with residual details image is easier for the Revision Network.

As shown in the Figure, the Revision Network is designed as a simple yet effective encoder-decoder architecture, with only one down-sampling and one up-sampling layer. Further, a [patch discriminator](https://paperswithcode.com/method/patchgan) is used to help Revision Network to capture fine patch textures under adversarial learning setting. The patch discriminator $D$ is defined following SinGAN, where $D$ owns 5 convolution layers and 32 hidden channels. A relatively shallow $D$ is chosen to (1) avoid overfitting since we only have one style image and (2) control the receptive field to ensure D can only capture local patterns.

Revision Network

Drafting and Revision: Laplacian Pyramid Network for Fast High-Quality Artistic Style Transfer

ConViT

ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases

**SNGAN**, or **Spectrally Normalised GAN**, is a type of generative adversarial network that uses [spectral normalization](https://paperswithcode.com/method/spectral-normalization), a type of [weight normalization](https://paperswithcode.com/method/weight-normalization), to stabilise the training of the discriminator.

Source	ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com