**GPT-2** is a [Transformer](https://paperswithcode.com/methods/category/transformers) architecture that was notable for its size (1.5 billion parameters) on its release. The model is pretrained on a WebText dataset - text from 45 million website links. It largely follows the previous [GPT](https://paperswithcode.com/method/gpt) architecture with some modifications:

- [Layer normalization](https://paperswithcode.com/method/layer-normalization) is moved to the input of each sub-block, similar to a
pre-activation residual network and an additional layer normalization was added after the final self-attention block. 

- A modified initialization which accounts for the accumulation on the residual path with model depth
is used. Weights of residual layers are scaled at initialization by a factor of $1/\sqrt{N}$ where $N$ is the number of residual layers. 

- The vocabulary is expanded to 50,257. The context size is expanded from 512 to 1024 tokens and
a larger batch size of 512 is used.

**Panoptic-PolarNet** is a point cloud segmentation framework for LiDAR point clouds. It learns both semantic segmentation and class-agnostic instance clustering in a single inference network using a polar Bird's Eye View (BEV) representation, enabling the authors to circumvent the issue of occlusion among instances in urban street scenes. We first encode the raw point cloud data with $K$ features into a fixed-size representation on the polar BEV map. Next, we use a single backbone encoder-decoder network to generate semantic prediction, center [heatmap](https://paperswithcode.com/method/heatmap) and offset regression. Finally, we merge these outputs via a voting-based fusion to yield the panoptic segmentation result.

Panoptic-PolarNet

Panoptic-PolarNet: Proposal-free LiDAR Point Cloud Panoptic Segmentation

GPT-2

Language Models are Unsupervised Multitask Learners

**VirText**, or **Visual representations from Textual annotations** is a pretraining approach using semantically dense captions to learn visual representations. First a ConvNet and [Transformer](https://paperswithcode.com/method/transformer) are jointly trained from scratch to generate natural language captions for images. Then, the learned features are transferred to downstream visual recognition tasks.

Source	Language Models are Unsupervised Multitask Learners
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com