AICurious Logo

What is: RegionViT?

SourceRegionViT: Regional-to-Local Attention for Vision Transformers
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

RegionViT consists of two tokenization processes that convert an image into regional (upper path) and local tokens (lower path). Each tokenization is a convolution with different patch sizes, the patch size of regional tokens is 28228^2 while 424^2 is used for local tokens with dimensions projected to CC, which means that one regional token covers 727^2 local tokens based on the spatial locality, leading to the window size of a local region to 727^2. At stage 1, two set of tokens are passed through the proposed regional-to-local transformer encoders. However, for the later stages, to balance the computational load and to have feature maps at different resolution, the approach uses a downsampling process to halve the spatial resolution while doubling the channel dimension like CNN on both regional and local tokens before going to the next stage. Finally, at the end of the network, it simply averages the remaining regional tokens as the final embedding for the classification while the detection uses all local tokens at each stage since it provides more fine-grained location information. By having the pyramid structure, the ViT can generate multi-scale features and hence it could be easily extended to more vision applications, e.g., object detection, rather than image classification only.