AICurious Logo

What is: InterBERT?

SourceInterBERT: Vision-and-Language Interaction for Multi-modal Pretraining
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

InterBERT aims to model interaction between information flows pertaining to different modalities. This new architecture builds multi-modal interaction and preserves the independence of single modal representation. InterBERT is built with an image embedding layer, a text embedding layer, a single-stream interaction module, and a two stream extraction module. The model is pre-trained with three tasks: 1) masked segment modeling, 2) masked region modeling, and 3) image-text matching.