**Twins-PCPVT** is a type of [vision transformer](https://paperswithcode.com/methods/category/vision-transformer) that combines global attention, specifically the global sub-sampled attention as proposed in [Pyramid Vision Transformer](https://paperswithcode.com/method/pvt), with [conditional position encodings](https://paperswithcode.com/method/conditional-positional-encoding) (CPE) to replace the [absolute position encodings](https://paperswithcode.com/method/absolute-position-encodings) used in PVT.

The [position encoding generator](https://paperswithcode.com/method/positional-encoding-generator) (PEG), which generates the CPE, is placed after the first encoder block of each stage. The simplest form of PEG is used, i.e., a 2D [depth-wise convolution](https://paperswithcode.com/method/depthwise-convolution) without [batch normalization](https://paperswithcode.com/method/batch-normalization). For image-level classification, following [CPVT](https://paperswithcode.com/method/cpvt), the class token is removed and [global average pooling](https://paperswithcode.com/method/global-average-pooling) is used at the end of the stage. For other vision tasks, the design of PVT is followed.

**LayerScale** is a method used for [vision transformer](https://paperswithcode.com/methods/category/vision-transformer) architectures to help improve training dynamics. It adds a learnable diagonal matrix on output of each residual block, initialized close to (but not at) 0. Adding this simple layer after each residual block improves the training dynamic, allowing for the training of deeper high-capacity image transformers that benefit from depth.

Specifically, LayerScale is a per-channel multiplication of the vector produced by each residual block, as opposed to a single scalar, see Figure (d). The objective is to group the updates of the weights associated with the same output channel. Formally, LayerScale is a multiplication by a diagonal matrix on output of each residual block. In other words:

$$
x\_{l}^{\prime} =x\_{l}+\operatorname{diag}\left(\lambda\_{l, 1}, \ldots, \lambda\_{l, d}\right) \times \operatorname{SA}\left(\eta\left(x\_{l}\right)\right) 
$$

$$
x\_{l+1} =x\_{l}^{\prime}+\operatorname{diag}\left(\lambda\_{l, 1}^{\prime}, \ldots, \lambda\_{l, d}^{\prime}\right) \times \operatorname{FFN}\left(\eta\left(x\_{l}^{\prime}\right)\right)
$$

where the parameters $\lambda\_{l, i}$ and $\lambda\_{l, i}^{\prime}$ are learnable weights. The diagonal values are all initialized to a fixed small value $\varepsilon:$ we set it to $\varepsilon=0.1$ until depth 18 , $\varepsilon=10^{-5}$ for depth 24 and $\varepsilon=10^{-6}$ for deeper networks. 

This formula is akin to other [normalization](https://paperswithcode.com/methods/category/normalization) strategies [ActNorm](https://paperswithcode.com/method/activation-normalization) or [LayerNorm](https://paperswithcode.com/method/layer-normalization) but executed on output of the residual block. Yet LayerScale seeks a different effect: [ActNorm](https://paperswithcode.com/method/activation-normalization) is a data-dependent initialization that calibrates activations so that they have zero-mean and unit variance, like [BatchNorm](https://paperswithcode.com/method/batch-normalization). In contrast, in LayerScale, we initialize the diagonal with small values so that the initial contribution of the residual branches to the function implemented by the transformer is small. In that respect the motivation is therefore closer to that of [ReZero](https://paperswithcode.com/method/rezero), [SkipInit](https://paperswithcode.com/method/skipinit), [Fixup](https://paperswithcode.com/method/fixup-initialization) and [T-Fixup](https://paperswithcode.com/method/t-fixup): to train closer to the identity function and let the network integrate the additional parameters progressively during the training. LayerScale offers more diversity in the optimization than just adjusting the whole layer by a single learnable scalar as in [ReZero](https://paperswithcode.com/method/rezero)/[SkipInit](https://paperswithcode.com/method/skipinit), [Fixup](https://paperswithcode.com/method/fixup-initialization) and [T-Fixup](https://paperswithcode.com/method/t-fixup).

LayerScale

Going deeper with Image Transformers

Twins-PCPVT

Twins: Revisiting the Design of Spatial Attention in Vision Transformers

FixMatch is an algorithm that first generates pseudo-labels using the model's predictions on weakly-augmented unlabeled images. For a given image, the pseudo-label is only retained if the model produces a high-confidence prediction. The model is then trained to predict the pseudo-label when fed a strongly-augmented version of the same image.

Description from: [FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence](https://paperswithcode.com/paper/fixmatch-simplifying-semi-supervised-learning)

Image credit:  [FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence](https://paperswithcode.com/paper/fixmatch-simplifying-semi-supervised-learning)

Source	Twins: Revisiting the Design of Spatial Attention in Vision Transformers
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com