InterBERT aims to model interaction between information flows pertaining to different modalities. This new architecture builds multi-modal interaction and preserves the independence of single modal representation. InterBERT is built with an image embedding layer, a text embedding layer, a single-stream interaction module, and a two stream extraction module. The model is pre-trained with three tasks: 1) masked segment modeling, 2) masked region modeling, and 3) image-text matching.

**Rectified Adam**, or **RAdam**, is a variant of the [Adam](https://paperswithcode.com/method/adam) stochastic optimizer that introduces a term to rectify the variance of the adaptive learning rate. It seeks to tackle the bad convergence problem suffered by Adam. The authors argue that the root cause of this behaviour is that the adaptive learning rate has undesirably large variance in the early stage of model training, due to the limited amount of training samples being used. Thus, to reduce such variance, it is better to use smaller learning rates in the first few epochs of training - which justifies the warmup heuristic. This heuristic motivates RAdam which rectifies the variance problem:

$$g\_{t} = \nabla\_{\theta}f\_{t}\left(\theta\_{t-1}\right) $$

$$v\_{t} = 1/\beta\_{2}v\_{t-1} + \left(1-\beta\_{2}\right)g^{2}\_{t} $$

$$m\_{t} = \beta\_{1}m\_{t-1} + \left(1-\beta\_{1}\right)g\_{t} $$

$$ \hat{m\_{t}} = m\_{t} / \left(1-\beta^{t}\_{1}\right) $$

$$ \rho\_{t} = \rho\_{\infty} - 2t\beta^{t}\_{2}/\left(1-\beta^{t}\_{2}\right) $$

$$\rho_{\infty} = \frac{2}{1-\beta_2} - 1$$ 

If the variance is tractable - $\rho\_{t} > 4$ then:

...the adaptive learning rate is computed as:

$$ l\_{t} = \sqrt{\left(1-\beta^{t}\_{2}\right)/v\_{t}}$$

...the variance rectification term is calculated as:

$$ r\_{t} = \sqrt{\frac{(\rho\_{t}-4)(\rho\_{t}-2)\rho\_{\infty}}{(\rho\_{\infty}-4)(\rho\_{\infty}-2)\rho\_{t}}}$$

...and we update parameters with adaptive momentum:

$$ \theta\_{t} = \theta\_{t-1} - \alpha\_{t}r\_{t}\hat{m}\_{t}l\_{t} $$

If the variance isn't tractable we update instead with:

$$ \theta\_{t} = \theta\_{t-1} - \alpha\_{t}\hat{m}\_{t} $$

RAdam

On the Variance of the Adaptive Learning Rate and Beyond

InterBERT

InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining

Combines learned time-frequency representation with a masker architecture based on 1D [dilated convolution](https://paperswithcode.com/method/dilated-convolution).

Source	InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com