**CuBERT**, or **Code Understanding BERT**, is a [BERT](https://paperswithcode.com/method/bert) based model for code understanding. In order to achieve this, the authors curate a massive corpus of Python programs collected from GitHub. GitHub projects are known to contain a large amount of duplicate code. To avoid biasing the model to such duplicated code, authors perform deduplication using the method of [Allamanis (2018)](https://arxiv.org/abs/1812.06469). The resulting corpus has 7.4 million files with a total of 9.3 billion tokens (16 million unique).

The **Mogrifier LSTM** is an extension to the [LSTM](https://paperswithcode.com/method/lstm) where the LSTM’s input $\mathbf{x}$ is gated conditioned on the output of the previous step $\mathbf{h}\_{prev}$. Next, the gated input is used in a similar manner to gate the output of the
previous time step. After a couple of rounds of this mutual gating, the last updated $\mathbf{x}$ and $\mathbf{h}\_{prev}$ are fed to an LSTM.  

In detail, the Mogrifier is an LSTM where two inputs $\mathbf{x}$ and $\mathbf{h}\_{prev}$ modulate one another in an alternating fashion before the usual LSTM computation takes place. That is: $ \text{Mogrify}\left(\mathbf{x}, \mathbf{c}\_{prev}, \mathbf{h}\_{prev}\right) = \text{LSTM}\left(\mathbf{x}^{↑}, \mathbf{c}\_{prev}, \mathbf{h}^{↑}\_{prev}\right)$ where the modulated inputs $\mathbf{x}^{↑}$ and $\mathbf{h}^{↑}\_{prev}$ are defined as the highest indexed $\mathbf{x}^{i}$ and $\mathbf{h}^{i}\_{prev}$, respectively, from the interleaved sequences:

$$ \mathbf{x}^{i} = 2\sigma\left(\mathbf{Q}^{i}\mathbf{h}^{i−1}\_{prev}\right) \odot x^{i-2} \text{ for odd } i \in \left[1 \dots r\right] $$

$$ \mathbf{h}^{i}\_{prev}  = 2\sigma\left(\mathbf{R}^{i}\mathbf{x}^{i-1}\right) \odot \mathbf{h}^{i-2}\_{prev} \text{ for even } i \in \left[1 \dots r\right] $$

with $\mathbf{x}^{-1} = \mathbf{x}$ and $\mathbf{h}^{0}\_{prev} = \mathbf{h}\_{prev}$. The number of "rounds", $r \in \mathbb{N}$, is a hyperparameter; $r = 0$ recovers the LSTM. Multiplication with the constant 2 ensures that randomly initialized $\mathbf{Q}^{i}$, $\mathbf{R}^{i}$ matrices result in transformations close to identity. To reduce the number of additional model parameters, we typically factorize the $\mathbf{Q}^{i}$, $\mathbf{R}^{i}$ matrices as products of low-rank matrices: $\mathbf{Q}^{i}$ =
$\mathbf{Q}^{i}\_{left}\mathbf{Q}^{i}\_{right}$ with $\mathbf{Q}^{i} \in \mathbb{R}^{m\times{n}}$, $\mathbf{Q}^{i}\_{left} \in \mathbb{R}^{m\times{k}}$, $\mathbf{Q}^{i}\_{right} \in \mathbb{R}^{k\times{n}}$, where $k < \min\left(m, n\right)$ is the rank.

Mogrifier LSTM

CuBERT

Learning and Evaluating Contextual Embedding of Source Code

Our proposed loss function is a combination of BCE Loss, Focal Loss, and Dice loss. Each one of them contributes individually to improve performance further details of loss functions are mentioned below,

(1) BCE Loss calculates probabilities and compares each actual class output with predicted probabilities which can be either 0 or 1, it is based on Bernoulli distribution loss, it is mostly used when there are only two classes are available in our case there are exactly two classes are available one is background and other is foreground. In a proposed method it is used for pixel-level classification.

(2) Focal Loss is a variant of BCE, it enables the model to focus on learning hard examples by decreasing the wights of easy examples it works well when the data is highly imbalanced.

(3) Dice Loss is inspired by the Dice Coefficient Score which is an evaluation metric used to evaluate the results of image segmentation tasks. Dice Coefficient is convex in nature so it has been changed, so it can be more traceable. It is used to calculate the similarity between two images, Dice Loss represent as


We proposed a Loss function which is a combination of all three above mention loss functions to benefit from all, BCE is used for pixel-wise classification, Focal Loss is used for learning hard examples, we use 0.25 as the value for alpha and 2.0 as the value of gamma. Dice Loss is used for learning better boundary representation, our proposed loss function represent as
\begin{equation}
Loss = \left( BCE Loss + Focal Loss \right)  + Dice Loss
\end{equation}

Source	Learning and Evaluating Contextual Embedding of Source Code
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com