AICurious Logo

What is: Segmentation Transformer?

SourceRethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

Segmentation Transformer, or SETR, is a Transformer-based segmentation model. The transformer-alone encoder treats an input image as a sequence of image patches represented by learned patch embedding, and transforms the sequence with global self-attention modeling for discriminative feature representation learning. Concretely, we first decompose an image into a grid of fixed-sized patches, forming a sequence of patches. With a linear embedding layer applied to the flattened pixel vectors of every patch, we then obtain a sequence of feature embedding vectors as the input to a transformer. Given the learned features from the encoder transformer, a decoder is then used to recover the original image resolution. Crucially there is no downsampling in spatial resolution but global context modeling at every layer of the encoder transformer.