**MDETR** is an end-to-end modulated detector that detects objects in an image conditioned on a raw text query, like a caption or a question. It utilizes a [transformer](https://paperswithcode.com/method/transformer)-based architecture to reason jointly over text and image by fusing the two modalities at an early stage of the model. The network is pre-trained on 1.3M text-image pairs, mined from pre-existing multi-modal datasets having explicit alignment between phrases in text and objects in the image. The network is then fine-tuned on several downstream tasks such as phrase grounding, referring expression comprehension and segmentation.

The **ENet Initial Block** is an image model block used in the [ENet](https://paperswithcode.com/method/enet) semantic segmentation architecture. [Max Pooling](https://paperswithcode.com/method/max-pooling) is performed with non-overlapping 2 × 2 windows, and the [convolution](https://paperswithcode.com/method/convolution) has 13 filters, which sums up to 16 feature maps after concatenation. This is heavily inspired by Inception Modules.

ENet Initial Block

ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation

MDETR

MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding

In contrast to typical GANs, a U-Net GAN uses a segmentation network as the discriminator. This segmentation network predicts two classes: real and fake. In doing so, the discriminator gives the generator region-specific feedback. This discriminator design also enables a  [CutMix](https://paperswithcode.com/method/cutmix)-based consistency regularization on the two-dimensional output of the U-Net GAN discriminator, which further improves image synthesis quality.

Source	MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com