AICurious Logo

What is: SentencePiece?

SourceSentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

SentencePiece is a subword tokenizer and detokenizer for natural language processing. It performs subword segmentation, supporting the byte-pair-encoding (BPE) algorithm and unigram language model, and then converts this text into an id sequence guarantee perfect reproducibility of the normalization and subword segmentation.