AICurious Logo

What is: EsViT?

SourceEfficient Self-supervised Vision Transformers for Representation Learning
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

EsViT proposes two techniques for developing efficient self-supervised vision transformers for visual representation leaning: a multi-stage architecture with sparse self-attention and a new pre-training task of region matching. The multi-stage architecture reduces modeling complexity but with a cost of losing the ability to capture fine-grained correspondences between image regions. The new pretraining task allows the model to capture fine-grained region dependencies and as a result significantly improves the quality of the learned vision representations.