**Disentangled Attention Mechanism** is an attention mechanism used in the [DeBERTa](https://paperswithcode.com/method/deberta) architecture. Unlike [BERT](https://paperswithcode.com/method/bert) where each word in the input layer is represented using a vector which is the sum of its word (content) embedding and position embedding, each word in DeBERTa is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices based on their contents and relative positions, respectively. This is motivated by the observation that the attention weight of a word pair depends on not only their contents but their relative positions. For example, the dependency between the words “deep” and “learning” is much stronger when they occur next to each other than when they occur in different sentences.

**Semi-Supervised Knowledge Distillation** is a type of knowledge distillation for person re-identification that exploits weakly annotated data by assigning soft pseudo labels to YouTube-Human to improve models' generalization ability. SSKD first trains a student model (e.g. [ResNet](https://paperswithcode.com/method/resnet)-50) and a teacher model (e.g. ResNet-101) using labeled data from multi-source domain datasets. Then, SSKD develops an [auxiliary classifier](https://paperswithcode.com/method/auxiliary-classifier) to imitate the soft predictions of unlabeled data generated by the teacher model. Meanwhile, the student model is also supervised by hard labels and predicted soft labels by the teacher model for labeled data.

SSKD

Semi-Supervised Domain Generalizable Person Re-Identification

Disentangled Attention Mechanism

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

**Res2Net** is an image model that employs a variation on bottleneck residual blocks. The motivation is to be able to represent features at multiple scales. This is achieved through a novel building block for CNNs that constructs hierarchical residual-like connections within one single [residual block](https://paperswithcode.com/method/residual-block).
This represents multi-scale features at a granular level and increases the range of receptive fields for each network layer.

Source	DeBERTa: Decoding-enhanced BERT with Disentangled Attention
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com