**Convolution-enhanced image Transformer** (**CeiT**) combines the advantages of CNNs in extracting low-level features, strengthening locality, and the advantages of Transformers in establishing long-range dependencies. Three modifications are made to the original Transformer: 1) instead of the straightforward tokenization from raw input images, we design an **Image-to-Tokens** (**I2T**) module that extracts patches from generated low-level features; 2) the feed-froward network in each encoder block is replaced with a **Locally-enhanced Feed-Forward** (**LeFF**) layer that promotes the correlation among neighbouring tokens in the spatial dimension; 3) a **Layer-wise Class token Attention** (**LCA**) is attached at the top of the Transformer that utilizes the multi-level representations.

**ZoomNet** is a 2D human whole-body pose estimation technique. It aims to localize dense landmarks on the entire human body including face, hands, body, and feet. ZoomNet follows the top-down paradigm. Given a human bounding box of each person, ZoomNet first localizes the easy-to-detect body keypoints and estimates the rough position of hands and face. Then it zooms in to focus on the hand/face areas and predicts keypoints using features with higher resolution for accurate localization. Unlike previous approaches which usually assemble multiple networks, ZoomNet has a single network that is end-to-end trainable. It unifies five network heads including the human body pose estimator, hand and face detectors, and hand and face pose estimators into a single network with shared low-level features.

ZoomNet

Whole-Body Human Pose Estimation in the Wild

CeiT

Incorporating Convolution Designs into Visual Transformers

**Conditional Relation Network**, or **CRN**, is a building block to construct more sophisticated structures for representation and reasoning over video. CRN takes as input an array of tensorial objects and a conditioning feature, and computes an array of encoded output objects. Model building becomes a simple exercise of replication, rearrangement and stacking of these reusable units for diverse modalities and contextual information. This design thus supports high-order relational and multi-step reasoning.

Source	Incorporating Convolution Designs into Visual Transformers
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com