What is: Feedback Memory?

Feedback Memory is a type of attention module used in the Feedback Transformer architecture. It allows a transformer to to use the most abstract representations from the past directly as inputs for the current timestep. This means that the model does not form its representation in parallel, but sequentially token by token. More precisely, we replace the context inputs to attention modules with memory vectors that are computed over the past, i.e.:

$\mathbf{z}^{l}\_{t} = \text{Attn}\left(\mathbf{x}^{l}\_{t}, \left[\mathbf{m}\_{t−\tau}, \dots, \mathbf{m}\_{t−1}\right]\right)$

where a memory vector $\mathbf{m}\_{t}$ is computed by summing the representations of each layer at the $t$ -th time step:

$\mathbf{m}\_{t} = \sum^{L}\_{l=0}\text{Softmax}\left(w^{l}\right)\mathbf{x}\_{t}^{l}$

where $w^{l}$ are learnable scalar parameters. Here $l = 0$ corresponds to token embeddings. The weighting of different layers by a softmax output gives the model more flexibility as it can average them or select one of them. This modification of the self-attention input adapts the computation of the Transformer from parallel to sequential, summarized in the Figure. Indeed, it gives the ability to formulate the representation $\mathbf{x}^{l}\_{t+1}$ based on past representations from any layer $l'$ , while in a standard Transformer this is only true for $l > l'$ . This change can be viewed as exposing all previous computations to all future computations, providing better representations of the input. Such capacity would allow much shallower models to capture the same level of abstraction as a deeper architecture.

Source	Addressing Some Limitations of Transformers with Feedback Memory
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com