**CodeBERT** is a bimodal pre-trained model for programming language (PL) and natural language (NL). CodeBERT learns general-purpose representations that support downstream NL-PL applications such as natural language code search, code documentation generation, etc. CodeBERT is developed with a [Transformer](https://paperswithcode.com/method/transformer)-based neural architecture, and is trained with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators. This enables the utilization of both bimodal data of NL-PL pairs and unimodal data, where the former provides input tokens for model training while the latter helps to learn better generators.

**Relative Position Encodings** are a type of position embeddings for [Transformer-based models](https://paperswithcode.com/methods/category/transformers) that attempts to exploit pairwise, relative positional information. Relative positional information is supplied to the model on two levels: values and keys. This becomes apparent in the two modified self-attention equations shown below. First, relative positional information is supplied to the model as an additional component to the keys

$$ e\_{ij} = \frac{x\_{i}W^{Q}\left(x\_{j}W^{K} + a^{K}\_{ij}\right)^{T}}{\sqrt{d\_{z}}} $$

Here $a$ is an edge representation for the inputs $x\_{i}$ and $x\_{j}$. The [softmax](https://paperswithcode.com/method/softmax) operation remains unchanged from vanilla self-attention. Then relative positional information is supplied again as a sub-component of the values matrix:

$$ z\_{i} = \sum^{n}\_{j=1}\alpha\_{ij}\left(x\_{j}W^{V} + a\_{ij}^{V}\right)$$

In other words, instead of simply combining semantic embeddings with absolute positional ones, relative positional information is added to keys and values on the fly during attention calculation.

Source: [Jake Tae](https://jaketae.github.io/study/relative-positional-encoding/)

Image Source: [Relative Positional Encoding for Transformers with Linear Complexity](https://www.youtube.com/watch?v=qajudaEHuq8

Relative Position Encodings

Self-Attention with Relative Position Representations

CodeBERT

CodeBERT: A Pre-Trained Model for Programming and Natural Languages

**Peer-attention** is a network component which dynamically learns the attention weights using another block or input modality. This is unlike AssembleNet which partially relies on exponential mutations to explore connections. Once the attention weights are found, we can either prune the connections by only leaving the argmax over $h$ or leave them with [softmax](https://paperswithcode.com/method/softmax).

Source	CodeBERT: A Pre-Trained Model for Programming and Natural Languages
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com