The **Vision Transformer**, or **ViT**, is a model for image classification that employs a [Transformer](https://paperswithcode.com/method/transformer)-like architecture over patches of the image.  An image is split into fixed-size patches, each of them are then linearly embedded, position embeddings are added, and the resulting sequence of vectors is fed to a standard [Transformer](https://paperswithcode.com/method/transformer) encoder. In order to perform classification, the standard approach of adding an extra learnable “classification token” to the sequence is used.

**Position-Sensitive RoI Pooling layer** aggregates the outputs of the last convolutional layer and generates scores for each RoI. Unlike [RoI Pooling](https://paperswithcode.com/method/roi-pooling), PS RoI Pooling conducts selective pooling, and each of the $k$ × $k$ bin aggregates responses from only one score map out of the bank of $k$ × $k$ score maps. With end-to-end training, this RoI layer shepherds the last convolutional layer to learn specialized position-sensitive score maps.

Position-Sensitive RoI Pooling

R-FCN: Object Detection via Region-based Fully Convolutional Networks

Vision Transformer

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

**BAGUA** is a communication framework whose design goal is to provide a system abstraction that is both flexible and modular to support state-of-the-art system relaxation techniques of distributed training. The abstraction goes beyond parameter server and Allreduce paradigms, and provides a collection of MPI-style collective operations to facilitate communications with different precision and centralization strategies.

Source	An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com