AICurious Logo

What is: Patch Merger Module?

SourceLearning to Merge Tokens in Vision Transformers
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

PatchMerger is a module for Vision Transformers that decreases the number of tokens/patches passed onto each individual transformer encoder block whilst maintaining performance and reducing compute. PatchMerger takes linearly transforms an input of shape N patches × D dimensions through a learnable weight matrix of shape M output patches × D. This generates M scores, in which a Softmax function is applied for each score. The resulting output has a shape of M × N, which is multiplied to the original input to get an output of shape M × D.

Mathematically, Y=softmax(WTXT)XY = \text{softmax}({W^T}{X^T})X

Image and formula from: Renggli, C., Pinto, A. S., Houlsby, N., Mustafa, B., Puigcerver, J., & Riquelme, C. (2022). Learning to Merge Tokens in Vision Transformers. arXiv preprint arXiv:2202.12015.