What is: RegionViT?

RegionViT consists of two tokenization processes that convert an image into regional (upper path) and local tokens (lower path). Each tokenization is a convolution with different patch sizes, the patch size of regional tokens is $28^2$ while $4^2$ is used for local tokens with dimensions projected to $C$ , which means that one regional token covers $7^2$ local tokens based on the spatial locality, leading to the window size of a local region to $7^2$ . At stage 1, two set of tokens are passed through the proposed regional-to-local transformer encoders. However, for the later stages, to balance the computational load and to have feature maps at different resolution, the approach uses a downsampling process to halve the spatial resolution while doubling the channel dimension like CNN on both regional and local tokens before going to the next stage. Finally, at the end of the network, it simply averages the remaining regional tokens as the final embedding for the classification while the detection uses all local tokens at each stage since it provides more fine-grained location information. By having the pyramid structure, the ViT can generate multi-scale features and hence it could be easily extended to more vision applications, e.g., object detection, rather than image classification only.

Source	RegionViT: Regional-to-Local Attention for Vision Transformers
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com