**Longformer** is a modified [Transformer](https://paperswithcode.com/method/transformer) architecture. Traditional [Transformer-based models](https://paperswithcode.com/methods/category/transformers) are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To address this, **Longformer** uses an attention pattern that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer. The attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention.

The attention patterns utilised include: [sliding window attention](https://paperswithcode.com/method/sliding-window-attention), [dilated sliding window attention](https://paperswithcode.com/method/dilated-sliding-window-attention) and global + sliding window. These can be viewed in the components section of this page.

The MADGRAD method contains a series of modifications to the [AdaGrad](https://paperswithcode.com/method/adagrad)-DA method to improve its performance on deep learning optimization problems. It gives state-of-the-art generalization performance across a diverse set of problems, including those that [Adam](https://paperswithcode.com/method/adam) normally under-performs on.

MADGRAD

Adaptivity without Compromise: A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization

Longformer

Longformer: The Long-Document Transformer

**Packed Levitated Markers**, or **PL-Marker**, is a span representation approach for [named entity recognition](https://paperswithcode.com/task/named-entity-recognition-ner) that considers the dependencies between spans (pairs) by strategically packing the markers in the encoder. A pair of Levitated Markers, emphasizing a span, consists of a start marker and an end marker which share the same position embeddings with span’s start and end tokens respectively. In addition, both levitated markers adopt a restricted attention, that is, they are visible to each other, but not to the text token and other pairs of markers. sBased on the above features, the levitated marker would not affect the attended context of the original text tokens, which allows us to flexibly pack a series of related spans with their levitated markers in the encoding phase and thus model their dependencies.

Source	Longformer: The Long-Document Transformer
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com