**Shake-Shake Regularization**  aims to improve the generalization ability of multi-branch networks by replacing the standard summation of parallel branches with a stochastic affine combination. A typical pre-activation [ResNet](https://paperswithcode.com/method/resnet) with 2 residual branches would follow this equation:

$$x\_{i+1} = x\_{i} + \mathcal{F}\left(x\_{i}, \mathcal{W}\_{i}^{\left(1\right)}\right) + \mathcal{F}\left(x\_{i}, \mathcal{W}\_{i}^{\left(2\right)}\right) $$

Shake-shake regularization introduces a random variable $\alpha\_{i}$  following a uniform distribution between 0 and 1 during training:

$$x\_{i+1} = x\_{i} + \alpha\mathcal{F}\left(x\_{i}, \mathcal{W}\_{i}^{\left(1\right)}\right) + \left(1-\alpha\right)\mathcal{F}\left(x\_{i}, \mathcal{W}\_{i}^{\left(2\right)}\right) $$

Following the same logic as for [dropout](https://paperswithcode.com/method/dropout), all $\alpha\_{i}$ are set to the expected value of $0.5$ at test time.

**Online Normalization** is a normalization technique for training deep neural networks. To define Online Normalization. we replace arithmetic averages over the full dataset in with exponentially decaying averages of online samples. The decay factors $\alpha\_{f}$ and $\alpha\_{b}$ for forward and backward passes respectively are hyperparameters for the technique.

We allow incoming samples $x\_{t}$, such as images, to have multiple scalar components and denote
feature-wide mean and variance by $\mu\left(x\_{t}\right)$ and $\sigma^{2}\left(x\_{t}\right)$. The algorithm also applies to outputs of fully connected layers with only one scalar output per feature. In fact, this case simplifies to $\mu\left(x\_{t}\right) = x\_{t}$ and $\sigma\left(x\_{t}\right) = 0$. Denote scalars $\mu\_{t}$ and $\sigma\_{t}$ to denote running estimates of mean and variance across
all samples. The subscript $t$ denotes time steps corresponding to processing new incoming samples.

Online Normalization uses an ongoing process during the forward pass to estimate activation means
and variances. It implements the standard online computation of mean and variance generalized to processing multi-value samples and exponential averaging of sample statistics. The
resulting estimates directly lead to an affine normalization transform.

$$ y\_{t} = \frac{x\_{t} - \mu\_{t-1}}{\sigma\_{t-1}} $$ 

$$ \mu\_{t} = \alpha\_{f}\mu\_{t-1} + \left(1-\alpha\_{f}\right)\mu\left(x\_{t}\right) $$

$$ \sigma^{2}\_{t} = \alpha\_{f}\sigma^{2}\_{t-1} + \left(1-\alpha\_{f}\right)\sigma^{2}\left(x\_{t}\right) + \alpha\_{f}\left(1-\alpha\_{f}\right)\left(\mu\left(x\_{t}\right) - \mu\_{t-1}\right)^{2} $$

Online Normalization

Online Normalization for Training Neural Networks

Shake-Shake Regularization

Shake-Shake regularization

Multiscale Attention ViT with Late fusion (MAVL) is a multi-modal network, trained with aligned image-text pairs, capable of performing targeted detection using human understandable natural language text queries. It utilizes multi-scale image features and uses deformable convolutions with late multi-modal fusion. The authors demonstrate excellent ability of MAVL as class-agnostic object detector when queried using general human understandable natural language command, such as "all objects", "all entities", etc.

Source	Shake-Shake regularization
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com