AICurious Logo

What is: Vision-and-Language BERT?

SourceViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

Vision-and-Language BERT (ViLBERT) is a BERT-based model for learning task-agnostic joint representations of image content and natural language. ViLBERT extend the popular BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs in separate streams that interact through co-attentional transformer layers.