AICurious Logo

What is: LayoutLMv2?

SourceLayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

LayoutLMv2 is an architecture and pre-training method for document understanding. The model is pre-trained with a great number of unlabeled scanned document images from the IIT-CDIP dataset, where some images in the text-image pairs are randomly replaced with another document image to make the model learn whether the image and OCR texts are correlated or not. Meanwhile, it also integrates a spatial-aware self-attention mechanism into the Transformer architecture, so that the model can fully understand the relative positional relationship among different text blocks.

Specifically, an enhanced Transformer architecture is used, i.e. a multi-modal Transformer asisthe backbone of LayoutLMv2. The multi-modal Transformer accepts inputs of three modalities: text, image, and layout. The input of each modality is converted to an embedding sequence and fused by the encoder. The model establishes deep interactions within and between modalities by leveraging the powerful Transformer layers.