AICurious Logo

What is: Sparse Transformer?

SourceGenerating Long Sequences with Sparse Transformers
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

A Sparse Transformer is a Transformer based architecture which utilises sparse factorizations of the attention matrix to reduce time/memory to O(nn)O(n \sqrt{n}). Other changes to the Transformer architecture include: (a) a restructured residual block and weight initialization, (b) A set of sparse attention kernels which efficiently compute subsets of the attention matrix, (c) recomputation of attention weights during the backwards pass to reduce memory usage