**PVT**, or **Pyramid Vision Transformer**, is a type of [vision transformer](https://paperswithcode.com/methods/category/vision-transformer) that utilizes a pyramid structure to make it an effective backbone for dense prediction tasks. Specifically it allows for more fine-grained inputs (4 x 4 pixels per patch) to be used, while simultaneously shrinking the sequence length of the Transformer as it deepens - reducing the computational cost. Additionally, a [spatial-reduction attention](https://paperswithcode.com/method/spatial-reduction-attention) (SRA) layer is used to further reduce the resource consumption when learning high-resolution features.

The entire model is divided into four stages, each of which is comprised of a patch embedding layer and a $\mathcal{L}\_{i}$-layer Transformer encoder. Following a pyramid structure, the output resolution of the four stages progressively shrinks from high (4-stride) to low (32-stride).

**YOLOv2**, or [**YOLO9000**](https://www.youtube.com/watch?v=QsDDXSmGJZA), is a single-stage real-time object detection model. It improves upon [YOLOv1](https://paperswithcode.com/method/yolov1) in several ways, including the use of [Darknet-19](https://paperswithcode.com/method/darknet-19) as a backbone, [batch normalization](https://paperswithcode.com/method/batch-normalization), use of a high-resolution classifier, and the use of anchor boxes to predict bounding boxes, and more.

YOLOv2

YOLO9000: Better, Faster, Stronger

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

**CheXNet** is a 121-layer [DenseNet](https://paperswithcode.com/method/densenet) trained on ChestX-ray14 for pneumonia detection.

Source	Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com