AICurious Logo

What is: Electric?

SourcePre-Training Transformers as Energy-Based Cloze Models
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

Electric is an energy-based cloze model for representation learning over text. Like BERT, it is a conditional generative model of tokens given their contexts. However, Electric does not use masking or output a full distribution over tokens that could occur in a context. Instead, it assigns a scalar energy score to each input token indicating how likely it is given its context.

Specifically, like BERT, Electric also models p_data (x_tx_\t)p\_{\text {data }}\left(x\_{t} \mid \mathbf{x}\_{\backslash t}\right), but does not use masking or a softmax layer. Electric first maps the unmasked input x=[x_1,,x_n]\mathbf{x}=\left[x\_{1}, \ldots, x\_{n}\right] into contextualized vector representations h(x)=[h_1,,h_n]\mathbf{h}(\mathbf{x})=\left[\mathbf{h}\_{1}, \ldots, \mathbf{h}\_{n}\right] using a transformer network. The model assigns a given position tt an energy score

E(x)_t=wTh(x)_tE(\mathbf{x})\_{t}=\mathbf{w}^{T} \mathbf{h}(\mathbf{x})\_{t}

using a learned weight vector ww. The energy function defines a distribution over the possible tokens at position tt as

p_θ(x_tx\t)=exp(E(x)_t)/Z(x_\t)p\_{\theta}\left(x\_{t} \mid \mathbf{x}_{\backslash t}\right)=\exp \left(-E(\mathbf{x})\_{t}\right) / Z\left(\mathbf{x}\_{\backslash t}\right)
=exp(E(x)_t)_xVexp(E(REPLACE(x,t,x))_t)=\frac{\exp \left(-E(\mathbf{x})\_{t}\right)}{\sum\_{x^{\prime} \in \mathcal{V}} \exp \left(-E\left(\operatorname{REPLACE}\left(\mathbf{x}, t, x^{\prime}\right)\right)\_{t}\right)}

where REPLACE(x,t,x)\text{REPLACE}\left(\mathbf{x}, t, x^{\prime}\right) denotes replacing the token at position tt with xx^{\prime} and V\mathcal{V} is the vocabulary, in practice usually word pieces. Unlike with BERT, which produces the probabilities for all possible tokens xx^{\prime} using a softmax layer, a candidate xx^{\prime} is passed in as input to the transformer. As a result, computing pθp_{\theta} is prohibitively expensive because the partition function Z_θ(x_\t)Z\_{\theta}\left(\mathbf{x}\_{\backslash t}\right) requires running the transformer V|\mathcal{V}| times; unlike most EBMs, the intractability of Z_θ(x\t)Z\_{\theta}(\mathbf{x} \backslash t) is more due to the expensive scoring function rather than having a large sample space.