What is: Graph Self-Attention?

Graph Self-Attention (GSA) is a self-attention module used in the BP-Transformer architecture, and is based on the graph attentional layer.

For a given node $u$ , we update its representation according to its neighbour nodes, formulated as $\mathbf{h}\_{u} \leftarrow \text{GSA}\left(\mathcal{G}, \mathbf{h}^{u}\right)$ .

Let $\mathbf{A}\left(u\right)$ denote the set of the neighbour nodes of $u$ in $\mathcal{G}$ , $\text{GSA}\left(\mathcal{G}, \mathbf{h}^{u}\right)$ is detailed as follows:

$\mathbf{A}^{u} = \text{concat}\left(\{\mathbf{h}\_{v} | v \in \mathcal{A}\left(u\right)\}\right)$

$\mathbf{Q}^{u}\_{i} = \mathbf{H}\_{k}\mathbf{W}^{Q}\_{i},\mathbf{K}\_{i}^{u} = \mathbf{A}^{u}\mathbf{W}^{K}\_{i},\mathbf{V}^{u}\_{i} = \mathbf{A}^{u}\mathbf{W}\_{i}^{V}$

$\text{head}^{u}\_{i} = \text{softmax}\left(\frac{\mathbf{Q}^{u}\_{i}\mathbf{K}\_{i}^{uT}}{\sqrt{d}}\right)\mathbf{V}\_{i}^{u}$

$\text{GSA}\left(\mathcal{G}, \mathbf{h}^{u}\right) = \left[\text{head}^{u}\_{1}, \dots, \text{head}^{u}\_{h}\right]\mathbf{W}^{O}$

where d is the dimension of h, and $\mathbf{W}^{Q}\_{i}$ , $\mathbf{W}^{K}\_{i}$ and $\mathbf{W}^{V}\_{i}$ are trainable parameters of the $i$ -th attention head.

Source	BP-Transformer: Modelling Long-Range Context via Binary Partitioning
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com