AICurious Logo

What is: Convolution-enhanced image Transformer?

SourceIncorporating Convolution Designs into Visual Transformers
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

Convolution-enhanced image Transformer (CeiT) combines the advantages of CNNs in extracting low-level features, strengthening locality, and the advantages of Transformers in establishing long-range dependencies. Three modifications are made to the original Transformer: 1) instead of the straightforward tokenization from raw input images, we design an Image-to-Tokens (I2T) module that extracts patches from generated low-level features; 2) the feed-froward network in each encoder block is replaced with a Locally-enhanced Feed-Forward (LeFF) layer that promotes the correlation among neighbouring tokens in the spatial dimension; 3) a Layer-wise Class token Attention (LCA) is attached at the top of the Transformer that utilizes the multi-level representations.