AICurious Logo

What is: Multiscale Vision Transformer?

SourceMultiscale Vision Transformers
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

Multiscale Vision Transformer, or MViT, is a transformer architecture for modeling visual data such as images and videos. Unlike conventional transformers, which maintain a constant channel capacity and resolution throughout the network, Multiscale Transformers have several channel-resolution scale stages. Starting from the input resolution and a small channel dimension, the stages hierarchically expand the channel capacity while reducing the spatial resolution. This creates a multiscale pyramid of features with early layers operating at high spatial resolution to model simple low-level visual information, and deeper layers at spatially coarse, but complex, high-dimensional features.