**DINO** (self-distillation with no labels) is a self-supervised learning method that directly predicts the output of a teacher network - built with a momentum encoder - using a standard cross-entropy loss. 

In the example to the right, DINO is illustrated in the case of one single pair of views $\left(x\_{1}, x\_{2}\right)$ for simplicity.
The model passes two different random transformations of an input image to the student and teacher networks. Both networks have the same architecture but other parameters.
The output of the teacher network is centered with a mean computed over the batch. Each network outputs a $K$ dimensional feature normalized with a temperature [softmax](https://paperswithcode.com/method/softmax) over the feature dimension.
Their similarity is then measured with a cross-entropy loss.
A stop-gradient (sg) operator is applied to the teacher to propagate gradients only through the student.
The teacher parameters are updated with the student parameters' exponential moving average (ema).

**Gaussian Processes** are non-parametric models for approximating functions. They rely upon a measure of similarity between points (the kernel function) to predict the value for an unseen point from training data. The models are fully probabilistic so uncertainty bounds are baked in with the model.

Image Source: Gaussian Processes for Machine Learning, C. E. Rasmussen & C. K. I. Williams

Gaussian Process

DINO

Emerging Properties in Self-Supervised Vision Transformers

The **SAGAN Self-Attention Module** is a self-attention module used in the [Self-Attention GAN](https://paperswithcode.com/method/sagan) architecture for image synthesis. In the module, image features from the previous hidden layer $\textbf{x} \in \mathbb{R}^{C\text{x}N}$ are first transformed into two feature spaces $\textbf{f}$, $\textbf{g}$ to calculate the attention, where $\textbf{f(x) = W}\_{\textbf{f}}{\textbf{x}}$, $\textbf{g}(\textbf{x})=\textbf{W}\_{\textbf{g}}\textbf{x}$. We then calculate:

$$\beta_{j, i} = \frac{\exp\left(s_{ij}\right)}{\sum^{N}\_{i=1}\exp\left(s_{ij}\right)} $$

$$ \text{where } s_{ij} = \textbf{f}(\textbf{x}\_{i})^{T}\textbf{g}(\textbf{x}\_{i}) $$

and $\beta_{j, i}$ indicates the extent to which the model attends to the $i$th location when synthesizing the $j$th region. Here, $C$ is the number of channels and $N$ is the number of feature
locations of features from the previous hidden layer. The output of the attention layer is $\textbf{o} = \left(\textbf{o}\_{\textbf{1}}, \textbf{o}\_{\textbf{2}}, \ldots, \textbf{o}\_{\textbf{j}} , \ldots, \textbf{o}\_{\textbf{N}}\right) \in \mathbb{R}^{C\text{x}N}$ , where,

$$ \textbf{o}\_{\textbf{j}} = \textbf{v}\left(\sum^{N}\_{i=1}\beta_{j, i}\textbf{h}\left(\textbf{x}\_{\textbf{i}}\right)\right) $$

$$ \textbf{h}\left(\textbf{x}\_{\textbf{i}}\right) = \textbf{W}\_{\textbf{h}}\textbf{x}\_{\textbf{i}} $$

$$ \textbf{v}\left(\textbf{x}\_{\textbf{i}}\right) = \textbf{W}\_{\textbf{v}}\textbf{x}\_{\textbf{i}} $$

In the above formulation, $\textbf{W}\_{\textbf{g}} \in \mathbb{R}^{\bar{C}\text{x}C}$, $\mathbf{W}\_{f} \in \mathbb{R}^{\bar{C}\text{x}C}$, $\textbf{W}\_{\textbf{h}} \in \mathbb{R}^{\bar{C}\text{x}C}$ and $\textbf{W}\_{\textbf{v}} \in \mathbb{R}^{C\text{x}\bar{C}}$ are the learned weight matrices, which are implemented as $1$×$1$ convolutions. The authors choose  $\bar{C} = C/8$.

In addition, the module further multiplies the output of the attention layer by a scale parameter and adds back the input feature map. Therefore, the final output is given by,

$$\textbf{y}\_{\textbf{i}} = \gamma\textbf{o}\_{\textbf{i}} + \textbf{x}\_{\textbf{i}}$$

where $\gamma$ is a learnable scalar and it is initialized as 0. Introducing $\gamma$ allows the network to first rely on the cues in the local neighborhood – since this is easier – and then gradually learn to assign more weight to the non-local evidence.

Source	Emerging Properties in Self-Supervised Vision Transformers
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com