**Segmentation Transformer**, or **SETR**, is a [Transformer](https://paperswithcode.com/methods/category/transformers)-based segmentation model. The transformer-alone encoder treats an input image as a sequence of image patches represented by learned patch embedding, and transforms the sequence with global self-attention modeling for discriminative feature representation learning. Concretely, we first decompose an image into a grid of fixed-sized patches, forming a sequence of patches. With a linear embedding layer applied to the flattened pixel vectors of every patch, we then obtain a sequence of feature embedding vectors as the input to a transformer. Given the learned features from the encoder
transformer, a decoder is then used to recover the original image resolution. Crucially there is no downsampling in spatial resolution but global context modeling at every layer of the encoder transformer.

Temporal attention can be seen as a dynamic time selection mechanism determining when to pay attention, and is thus usually used for video processing.

Temporal attention

Jointly Attentive Spatial-Temporal Pooling Networks for Video-based Person Re-Identification

SETR

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

TWEC is a method to generate temporal word embeddings: this method is efficient and it is based on a simple heuristic: we train an atemporal word embedding, the compass and we use this embedding to freeze one of the layers of the CBOW architecture. The frozen architecture is then used to train time-specific slices that are all comparable after training.

Source	Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com