VL-T5 is a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation. The model learns to generate labels in text based on the visual and textual inputs. In contrast to other existing methods, the framework unifies tasks as generating text labels conditioned on multimodal inputs. This allows the model to tackle vision-and-language tasks with unified text generation objective. The models use text prefixes to adapt to different tasks.

**Cascade Mask R-CNN** extends [Cascade R-CNN](https://paperswithcode.com/method/cascade-r-cnn) to instance segmentation, by adding a
mask head to the cascade.

In the [Mask R-CNN](https://paperswithcode.com/method/mask-r-cnn), the segmentation branch is inserted in parallel to the detection branch. However, the Cascade [R-CNN](https://paperswithcode.com/method/r-cnn) has multiple detection branches. This raises the questions of 1) where to add the segmentation branch and 2) how many segmentation branches to add. The authors consider three strategies for mask prediction in the Cascade R-CNN. The first two strategies address the first question, adding a single mask prediction head at either the first or last stage of the Cascade R-CNN. Since the instances used to train the segmentation branch are the positives of the detection branch, their number varies in these two strategies. Placing the segmentation head later on the cascade leads to more examples. However, because segmentation is a pixel-wise operation, a large number of highly overlapping instances is not necessarily as helpful as for object detection, which is a patch-based operation. The third strategy addresses the second question, adding a segmentation branch to each
cascade stage. This maximizes the diversity of samples used to learn the mask prediction task. 

At inference time, all three strategies predict the segmentation masks on the patches produced by the final object detection stage, irrespective of the cascade stage on which the segmentation mask is implemented and how many segmentation branches there are.

Cascade Mask R-CNN

Cascade R-CNN: Delving into High Quality Object Detection

VL-T5

Unifying Vision-and-Language Tasks via Text Generation

**AdaRNN** is an adaptive [RNN](https://paperswithcode.com/methods/category/recurrent-neural-networks) that learns an adaptive model through two modules: [Temporal Distribution Characterization](https://paperswithcode.com/method/temporal-distribution-characterization) (TDC) and [Temporal Distribution Matching](https://paperswithcode.com/method/temporal-distribution-matching) (TDM) algorithms. Firstly, to better characterize the distribution information in time-series, TDC splits the training data into $K$ most diverse periods that have a large distribution gap inspired by the principle of maximum entropy. After that, a temporal distribution matching (TDM) algorithm is used to dynamically reduce distribution divergence using a [RNN](https://paperswithcode.com/methods/category/recurrent-neural-networks)-based model.

Source	Unifying Vision-and-Language Tasks via Text Generation
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com