The **Compressive Transformer** is an extension to the [Transformer](https://paperswithcode.com/method/transformer) which maps past hidden activations (memories) to a smaller set of compressed representations (compressed memories). The Compressive Transformer uses the same attention mechanism over its set of memories and compressed memories, learning to query both its short-term granular memory and longer-term coarse memory. It builds on the ideas of [Transformer-XL](https://paperswithcode.com/method/transformer-xl) which maintains a memory of past activations at each layer to preserve a longer history of context. The Transformer-XL discards past activations when they become sufficiently old (controlled by the size of the memory). The key principle of the Compressive Transformer is to compress these old memories, instead of discarding them, and store them in an additional [compressed memory](https://paperswithcode.com/method/compressed-memory).

At each time step $t$, we discard the oldest compressed memories (FIFO) and then the oldest $n$ states from ordinary memory are compressed and shifted to the new slot in compressed memory. During training, the compressive memory component is optimized separately from the main language model (separate training loop).

**DIoU-NMS** is a type of non-maximum suppression where we use Distance IoU rather than regular DIoU, in which the overlap area and the distance between two central points of bounding boxes are simultaneously considered when suppressing redundant boxes.

In original NMS, the IoU metric is used to suppress the redundant detection boxes, where the overlap area is the unique factor, often yielding false suppression for the cases with occlusion. With DIoU-NMS, we not only consider the overlap area but also central point distance between two boxes.

DIoU-NMS

Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression

Compressive Transformer

Compressive Transformers for Long-Range Sequence Modelling

An **All-Attention Layer** is an attention module and layer for transformers that merges the self-attention and feedforward sublayers into a single unified attention layer. As opposed to the two-step mechanism of the [Transformer](https://paperswithcode.com/method/transformer) layer, it directly builds its representation from the context and a persistent memory block without going through a feedforward transformation. The additional persistent memory block stores, in the form of key-value vectors, information that does not depend on the context. In terms of parameters, these persistent key-value vectors replace the feedforward sublayer.

Source	Compressive Transformers for Long-Range Sequence Modelling
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com