**LayoutLMv2** is an architecture and pre-training method for document understanding. The model is pre-trained with a great number of unlabeled scanned document images from the IIT-CDIP dataset, where some images in the text-image pairs are randomly replaced with another document image to make the model learn whether the image and OCR texts are correlated or not. Meanwhile, it also integrates a spatial-aware self-attention mechanism into the Transformer architecture, so that the model can fully understand the relative positional relationship among different text blocks.

Specifically, an enhanced Transformer architecture is used, i.e. a multi-modal Transformer asisthe backbone of LayoutLMv2. The multi-modal Transformer accepts inputs of three modalities: text, image, and layout. The input of each modality is converted to an embedding sequence and fused by the encoder. The model establishes deep interactions within and between modalities by leveraging the powerful Transformer layers.

**CSPPeleeNet** is a convolutional neural network and object detection backbone  where we apply the Cross Stage Partial Network (CSPNet) approach to [PeleeNet](https://paperswithcode.com/method/peleenet). The CSPNet partitions the feature map of the base layer into two parts and then merges them through a cross-stage hierarchy. The use of a split and merge strategy allows for more gradient flow through the network.

CSPPeleeNet

Source	LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com