**VisTR** is a [Transformer](https://paperswithcode.com/method/transformer) based video instance segmentation model. It views video instance segmentation as a direct end-to-end parallel sequence decoding/prediction problem. Given a video clip consisting of multiple image frames as input, VisTR outputs the sequence of masks for each instance in the video in order directly. At the core is a new, effective instance sequence matching and segmentation strategy, which supervises and segments instances at the sequence level as a whole. VisTR frames the instance segmentation and tracking in the same perspective of similarity learning, thus considerably simplifying the overall pipeline and is significantly different from existing approaches.

**lda2vec** builds representations over both words and documents by mixing word2vec’s skipgram architecture with Dirichlet-optimized sparse topic mixtures. 

The Skipgram Negative-Sampling (SGNS) objective of word2vec is modified to utilize document-wide feature vectors while simultaneously learning continuous document weights loading onto topic vectors. The total loss term $L$ is the sum of the Skipgram Negative Sampling Loss (SGNS) $L^{neg}\_{ij}$ with the addition of a Dirichlet-likelihood term over document weights, $L\_{d}$. The loss is conducted using a context vector, $\overrightarrow{c\_{j}}$ , pivot word vector $\overrightarrow{w\_{j}}$, target word vector $\overrightarrow{w\_{i}}$, and negatively-sampled word vector $\overrightarrow{w\_{l}}$:

$$ L = L^{d} + \Sigma\_{ij}L^{neg}\_{ij} $$

$$L^{neg}\_{ij} = \log\sigma\left(c\_{j}\cdot\overrightarrow{w\_{i}}\right) + \sum^{n}\_{l=0}\sigma\left(-\overrightarrow{c\_{j}}\cdot\overrightarrow{w\_{l}}\right)$$

lda2vec

Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec

VisTR

End-to-End Video Instance Segmentation with Transformers

A capsule is an activation vector that basically executes on its inputs some complex internal
computations. Length of these activation vectors signifies the
probability of availability of a feature. Furthermore, the condition
of the recognized element is encoded as the direction in which
the vector is pointing. In traditional, CNN uses Max pooling for
invariance activities of neurons, which is nothing except a minor
change in input and the neurons of output signal will remains
same.

Source	End-to-End Video Instance Segmentation with Transformers
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com