**ReZero** is a [normalization](https://paperswithcode.com/methods/category/normalization) approach that dynamically facilitates well-behaved gradients and arbitrarily deep signal propagation. The idea is simple: ReZero initializes each layer to perform the identity operation. For each layer,  a [residual connection](https://paperswithcode.com/method/residual-connectio) is introduced for the input signal $x$ and one trainable parameter $\alpha$ that modulates the non-trivial transformation of a layer $F(\mathbf{x})$:

$$
\mathbf{x}\_{i+1}=\mathbf{x}\_{i}+\alpha_{i} F\left(\mathbf{x}\_{i}\right)
$$

where $\alpha=0$ at the beginning of training. Initially the gradients for all parameters defining $F$ vanish, but dynamically evolve to suitable values during initial stages of training. The architecture is illustrated in the Figure.

**Electric** is an energy-based cloze model for representation learning over text. Like BERT, it is a conditional generative model of tokens given their contexts. However, Electric does not use masking or output a full distribution over tokens that could occur in a context. Instead, it assigns a scalar energy score to each input token indicating how likely it is given its context.

Specifically, like BERT, Electric also models $p\_{\text {data }}\left(x\_{t} \mid \mathbf{x}\_{\backslash t}\right)$, but does not use masking or a softmax layer. Electric first maps the unmasked input $\mathbf{x}=\left[x\_{1}, \ldots, x\_{n}\right]$ into contextualized vector representations $\mathbf{h}(\mathbf{x})=\left[\mathbf{h}\_{1}, \ldots, \mathbf{h}\_{n}\right]$ using a transformer network. The model assigns a given position $t$ an energy score

$$
E(\mathbf{x})\_{t}=\mathbf{w}^{T} \mathbf{h}(\mathbf{x})\_{t}
$$

using a learned weight vector $w$. The energy function defines a distribution over the possible tokens at position $t$ as

$$
p\_{\theta}\left(x\_{t} \mid \mathbf{x}_{\backslash t}\right)=\exp \left(-E(\mathbf{x})\_{t}\right) / Z\left(\mathbf{x}\_{\backslash t}\right) 
$$

$$
=\frac{\exp \left(-E(\mathbf{x})\_{t}\right)}{\sum\_{x^{\prime} \in \mathcal{V}} \exp \left(-E\left(\operatorname{REPLACE}\left(\mathbf{x}, t, x^{\prime}\right)\right)\_{t}\right)}
$$

where $\text{REPLACE}\left(\mathbf{x}, t, x^{\prime}\right)$ denotes replacing the token at position $t$ with $x^{\prime}$ and $\mathcal{V}$ is the vocabulary, in practice usually word pieces. Unlike with BERT, which produces the probabilities for all possible tokens $x^{\prime}$ using a softmax layer, a candidate $x^{\prime}$ is passed in as input to the transformer. As a result, computing $p_{\theta}$ is prohibitively expensive because the partition function $Z\_{\theta}\left(\mathbf{x}\_{\backslash t}\right)$ requires running the transformer $|\mathcal{V}|$ times; unlike most EBMs, the intractability of $Z\_{\theta}(\mathbf{x} \backslash t)$ is more due to the expensive scoring function rather than having a large sample space.

Electric

Pre-Training Transformers as Energy-Based Cloze Models

ReZero

ReZero is All You Need: Fast Convergence at Large Depth

**Sticker Response Selector**, or **SRS**, is a model for multi-turn dialog that automatically selects a sticker response. SRS first employs a convolutional based sticker image encoder and a self-attention based multi-turn dialog encoder to obtain the representation of stickers and utterances. Next, deep interaction network is proposed to conduct deep matching between the sticker with each utterance in the dialog history. SRS then learns the short-term and long-term dependency between all interaction results by a fusion network to output the the final matching score.

Source	ReZero is All You Need: Fast Convergence at Large Depth
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com