**Mixed Attention Block** is an attention module used in the [ConvBERT](https://paperswithcode.com/method/convbert) architecture. It is a mixture of [self-attention](https://paperswithcode.com/method/scaled) and [span-based dynamic convolution](https://paperswithcode.com/method/span-based-dynamic-convolution) (highlighted in pink). They share the same Query but use different Key to generate the attention map and [convolution](https://paperswithcode.com/method/convolution) kernel respectively. The number of attention heads is reducing by directly projecting the input to a smaller embedding space to form a bottleneck structure for self-attention and span-based dynamic convolution. Dimensions of the input and output of some blocks are labeled on the left top corner to illustrate the overall framework, where $d$ is the embedding size of the input and $\gamma$ is the reduction ratio.

**LightGCN** is a type of [graph convolutional neural network](https://paperswithcode.com/method/gcn) (GCN), including only the most essential component in GCN (neighborhood aggregation) for collaborative filtering. Specifically, LightGCN learns user and item embeddings by linearly propagating them on the user-item interaction graph, and uses the weighted sum of the embeddings learned at all layers as the final embedding.

LightGCN

LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation

Mixed Attention Block

ConvBERT: Improving BERT with Span-based Dynamic Convolution

**GradDrop**, or **Gradient Sign Dropout**, is a probabilistic masking procedure which samples gradients at an activation layer based on their level of consistency. It is applied as a layer in any standard network forward pass, usually on the final layer before the prediction head to save on compute overhead and maximize benefits during backpropagation. Below, we develop the GradDrop formalism. Throughout, o denotes elementwise multiplication after any necessary tiling operations (if any) are completed.
To implement GradDrop, we first define the Gradient Positive Sign Purity, $\mathcal{P}$, as

$$
\mathcal{P}=\frac{1}{2}\left(1+\frac{\sum\_{i} \nabla L_\{i}}{\sum\_{i}\left|\nabla L\_{i}\right|}\right)
$$

$\mathcal{P}$ is bounded by $[0,1] .$ For multiple gradient values $\nabla\_{a} L\_{i}$ at some scalar $a$, we see that $\mathcal{P}=0$ if $\nabla_{a} L\_{i}<0 $ $\forall i$, while $\mathcal{P}=1$ if $\nabla\_{a} L\_{i}>0$ $\forall i $. Thus, $\mathcal{P}$ is a measure of how many positive gradients are present at any given value. We then form a mask for each gradient $\mathcal{M}\_{i}$ as follows:

$$
\mathcal{M}\_{i}=\mathcal{I}[f(\mathcal{P})>U] \circ \mathcal{I}\left[\nabla L\_{i}>0\right]+\mathcal{I}[f(\mathcal{P})<U] \circ \mathcal{I}\left[\nabla L\_{i}<0\right]
$$

for $\mathcal{I}$ the standard indicator function and $f$ some monotonically increasing function (often just the identity) that maps $[0,1] \mapsto[0,1]$ and is odd around $(0.5,0.5)$. $U$ is a tensor composed of i.i.d $U(0,1)$ random variables. The $\mathcal{M}\_{i}$ is then used to produce a final gradient $\sum \mathcal{M}\_{i} \nabla L\_{i}$

Source	ConvBERT: Improving BERT with Span-based Dynamic Convolution
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com