**CrossViT** is a type of [vision transformer](https://paperswithcode.com/method/vision-transformer) that uses a dual-branch architecture to extract multi-scale feature representations for image classification. The architecture combines image patches (i.e. tokens in a [transformer](https://paperswithcode.com/method/transformer)) of different sizes to produce stronger visual features for image classification. It processes small and large patch tokens with two separate branches of different computational complexities and these tokens are fused together multiple times to complement each other.

Fusion is achieved by an efficient [cross-attention module](https://paperswithcode.com/method/cross-attention-module), in which each transformer branch creates a non-patch token as an agent to exchange information with the other branch by attention. This allows for linear-time generation of the attention map in fusion instead of quadratic time otherwise.

**Position-Wise Feed-Forward Layer** is a type of [feedforward layer](https://www.paperswithcode.com/method/category/feedforwad-networks) consisting of two [dense layers](https://www.paperswithcode.com/method/dense-connections) that applies to the last dimension, which means the same dense layers are used for each position item in the sequence, so called position-wise.

Position-Wise Feed-Forward Layer

Attention Is All You Need

CrossViT

CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

**Cycle-CenterNet** is a table structure parsing approach built on [CenterNet](https://paperswithcode.com/method/centernet) that uses a cycle-pairing module to simultaneously detect and group tabular cells into structured tables. It also utilizes a pairing loss which enables the grouping of discrete cells into the structured tables.

Source	CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com