ALBEF introduces a contrastive loss to align the image and text representations before fusing them through cross-modal attention. This enables more grounded vision and language representation learning. ALBEF also doesn't require bounding box annotations. The model consists of an image encode, a text encoder, and a multimodal encoder. The image-text contrastive loss helps to align the unimodal representations of an image-text pair before fusion. The image-text matching loss and a masked language modeling loss are applied to learn multimodal interactions between image and text. In addition, momentum distillation is used to generate pseudo-targets. This improves learning with noisy data.

This method works as a two-levels optimization algorithm.
The outmost layer uses Grammatical evolution to evolve a grammar to build the agent.
Then, [Q-learning](https://paperswithcode.com/method/q-learning) is used the fitness evaluation phase to allow the agent to learn to perform online learning.

Grammatical evolution + Q-learning

Evolutionary learning of interpretable decision trees

ALBEF

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

Original paper : Integrating Constraints and Metric Learning in Semi-Supervised Clustering, Bilenko et al. 2004

Source	Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com