What is: Class Attention?

A Class Attention layer, or CA Layer, is an attention mechanism for vision transformers used in CaiT that aims to extract information from a set of processed patches. It is identical to a self-attention layer, except that it relies on the attention between (i) the class embedding $x_{\text {class }}$ (initialized at CLS in the first CA) and (ii) itself plus the set of frozen patch embeddings $x_{\text {patches }} .$

Considering a network with $h$ heads and $p$ patches, and denoting by $d$ the embedding size, the multi-head class-attention is parameterized with several projection matrices, $W_{q}, W_{k}, W_{v}, W_{o} \in \mathbf{R}^{d \times d}$ , and the corresponding biases $b_{q}, b_{k}, b_{v}, b_{o} \in \mathbf{R}^{d} .$ With this notation, the computation of the CA residual block proceeds as follows. We first augment the patch embeddings (in matrix form) as $z=\left[x_{\text {class }}, x_{\text {patches }}\right]$ . We then perform the projections:

$Q=W\_{q} x\_{\text {class }}+b\_{q}$

$K=W\_{k} z+b\_{k}$

$V=W\_{v} z+b\_{v}$

The class-attention weights are given by

A=\operatorname{Softmax}\left(Q . K^{T} / \sqrt{d / h}\right)

where $Q . K^{T} \in \mathbf{R}^{h \times 1 \times p}$ . This attention is involved in the weighted sum $A \times V$ to produce the residual output vector

\operatorname{out}\_{\mathrm{CA}}=W\_{o} A V+b\_{o}

which is in turn added to $x\_{\text {class }}$ for subsequent processing.

Source	Going deeper with Image Transformers
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com