What is: Semantic Cross Attention?

Semantic Cross Attention (SCA) is based on cross attention, which we restrict with respect to a semantic mask.

The goal of SCA is two-fold depending on what is the query and what is the key. Either it allows to give the feature map information from a semantically restricted set of latents or, respectively, it allows a set of latents to retrieve information in a semantically restricted region of the feature map.

SCA is defined as:

\begin{equation} \text{SCA}(I_{1}, I_{2}, I_{3}) = \sigma\left(\frac{QK^T\odot I_{3} +\tau \left(1-I_{3}\right)}{\sqrt{d_{in}}}\right)V \quad , \end{equation}

where $I_{1},I_{2},I_{3}$ the inputs, with $I_{1}$ attending $I_{2}$ , and $I_{3}$ the mask that forces tokens from $I_1$ to attend only specific tokens from $I_2$ . The attention values requiring masking are filled with $-\infty$ before the softmax. (In practice $\tau{=}-10^9$ ), $Q {=} W_QI_{1}$ , $K {=} W_KI_{2}$ and $V {=} W_VI_{2}$ the queries, keys and values, and $d_{in}$ the internal attention dimension. $\sigma(.)$ is the softmax operation.

Let $X\in\mathbb{R}^{n\times C}$ be the feature map with n the number of pixels, and C the number of channels. Let $Z\in\mathbb{R}^{m\times d}$ be a set of $m$ latents of dimension $d$ and $s$ the number of semantic labels. Each semantic label is attributed $k$ latents, such that $m=k\times s$ . Each semantic label mask is assigned $k$ copies in $S{\in}\{0;1\}^{n \times m}$ .

We can differentiate 3 types of SCA:

(a) SCA with pixels $X$ attending latents $Z$ : $\text{SCA}(X, Z, S)$ , where $W_{Q} {\in} \mathbb{R}^{n\times d_{in}}$ and $W_{K}, W_{V} {\in} \mathbb{R}^{m\times d_{in}}$ . The idea is to force the pixels from a semantic region to attend latents that are associated with the same label.

(b) SCA with latents $Z$ attending pixels $X$ : $\text{SCA}(Z, X, S)$ , where $W_{Q}{\in} \mathbb{R}^{m\times d_{in}}$ , $W_{K}, W_{V} {\in} \mathbb{R}^{n\times d_{in}}$ . The idea is to semantically mask attention values to enforce latents to attend semantically corresponding pixels.

(c) SCA with latents $Z$ attending themselves: $\text{SCA}(Z, Z, M)$ , where $W_{Q}, W_{K}, W_{V} {\in} \mathbb{R}^{n\times d_{in}}$ . We denote $M\in\mathbb{N}^{m\times m}$ this mask, with $M_{\text{latents}}(i,j) {=} 1$ if the semantic label of latent $i$ is the same as the one of latent $j$ ; $0$ otherwise. The idea is to let the latents only attend latents that share the same semantic label.

Source	SCAM! Transferring humans between images with Semantic Cross Attention Modulation
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com