What is: Support-set Based Cross-Supervision?

Sscs, or Support-set Based Cross-Supervision, is a module for video grounding which consists of two main components: a discriminative contrastive objective and a generative caption objective. The contrastive objective aims to learn effective representations by contrastive learning, while the caption objective can train a powerful video encoder supervised by texts. Due to the co-existence of some visual entities in both ground-truth and background intervals, i.e., mutual exclusion, naively contrastive learning is unsuitable to video grounding. This problem is addressed by boosting the cross-supervision with the support-set concept, which collects visual information from the whole video and eliminates the mutual exclusion of entities.

Specifically, in the Figure to the right, two video-text pairs { $V\_{i}, L\_{i}$ }, { $V\_{j} , L\_{j}$ } in the batch are presented for clarity. After feeding them into a video and text encoder, the clip-level and sentence-level embedding ( { $X\_{i}, Y\_{i}$ } and { $X\_{j} , Y\_{j}$ } ) in a shared space are acquired. Base on the support-set module, the weighted average of $X\_{i}$ and $X\_{j}$ is computed to obtain $\bar{X}\_{i}$ , $\bar{X}\_{j}$ respectively. Finally, the contrastive and caption objectives are combined to pull close the representations of the clips and text from the same samples and push away those from other pairs

Source	Support-Set Based Cross-Supervision for Video Grounding
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com