AICurious Logo

What is: Disentangled Attention Mechanism?

SourceDeBERTa: Decoding-enhanced BERT with Disentangled Attention
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

Disentangled Attention Mechanism is an attention mechanism used in the DeBERTa architecture. Unlike BERT where each word in the input layer is represented using a vector which is the sum of its word (content) embedding and position embedding, each word in DeBERTa is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices based on their contents and relative positions, respectively. This is motivated by the observation that the attention weight of a word pair depends on not only their contents but their relative positions. For example, the dependency between the words “deep” and “learning” is much stronger when they occur next to each other than when they occur in different sentences.