AICurious Logo

What is: Byte Pair Encoding?

SourceNeural Machine Translation of Rare Words with Subword Units
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

Byte Pair Encoding, or BPE, is a subword segmentation algorithm that encodes rare and unknown words as sequences of subword units. The intuition is that various word classes are translatable via smaller units than words, for instance names (via character copying or transliteration), compounds (via compositional translation), and cognates and loanwords (via phonological and morphological transformations).

Lei Mao has a detailed blog post that explains how this works.