**Stochastic Depth** aims to shrink the depth of a network during training, while
keeping it unchanged during testing. This is achieved by randomly dropping entire [ResBlocks](https://paperswithcode.com/method/residual-block) during training and bypassing their transformations through skip connections. 

Let $b\_{l} \in$ {$0, 1$} denote a Bernoulli random variable, which indicates whether the $l$th ResBlock is active ($b\_{l} = 1$) or inactive ($b\_{l} = 0$). Further, let us denote the “survival” probability of ResBlock $l$ as $p\_{l} = \text{Pr}\left(b\_{l} = 1\right)$. With this definition we can bypass the $l$th ResBlock by multiplying its function $f\_{l}$ with $b\_{l}$ and we extend the update rule to:

$$ H\_{l} = \text{ReLU}\left(b\_{l}f\_{l}\left(H\_{l-1}\right) + \text{id}\left(H\_{l-1}\right)\right) $$

If $b\_{l} = 1$, this reduces to the original [ResNet](https://paperswithcode.com/method/resnet) update and this ResBlock remains unchanged. If $b\_{l} = 0$, the ResBlock reduces to the identity function, $H\_{l} = \text{id}\left((H\_{l}−1\right)$.

**InfoNCE**, where NCE stands for Noise-Contrastive Estimation, is a type of contrastive loss function used for [self-supervised learning](https://paperswithcode.com/methods/category/self-supervised-learning).

Given a set $X = ${$x\_{1}, \dots, x\_{N}$} of $N$ random samples containing one positive sample from $p\left(x\_{t+k}|c\_{t}\right)$ and $N − 1$ negative samples from the 'proposal' distribution $p\left(x\_{t+k}\right)$, we optimize:

$$ \mathcal{L}\_{N} = - \mathbb{E}\_{X}\left[\log\frac{f\_{k}\left(x\_{t+k}, c\_{t}\right)}{\sum\_{x\_{j}\in{X}}f\_{k}\left(x\_{j}, c\_{t}\right)}\right] $$

Optimizing this loss will result in $f\_{k}\left(x\_{t+k}, c\_{t}\right)$ estimating the density ratio, which is:

$$ f\_{k}\left(x\_{t+k}, c\_{t}\right) \propto \frac{p\left(x\_{t+k}|c\_{t}\right)}{p\left(x\_{t+k}\right)} $$

InfoNCE

Representation Learning with Contrastive Predictive Coding

Stochastic Depth

Deep Networks with Stochastic Depth

**MixText** is a semi-supervised learning method for text classification, which uses a new data augmentation method called TMix. TMix creates a large amount of augmented training samples by interpolating text in hidden space. The technique leverages advances in data augmentation to guess low-entropy labels for unlabeled data, making them as easy to use as labeled data.

Source	Deep Networks with Stochastic Depth
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com