AICurious Logo

What is: Multi-Query Attention?

SourceFast Transformer Decoding: One Write-Head is All You Need
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

Multi-head attention consists of multiple attention layers (heads) in parallel with different linear transformations on the queries, keys, values and outputs. Multi-query attention is identical except that the different heads share a single set of keys and values.