Wordpiece tokenization

Wordpiece Tokenization is a tokenization method in natural language processing which can naturally deal with large vocabularies and rare words. It achieves this by breaking words down into sub-tokens, which automatically encodes character-level information that typical word-level tokenizers would miss.^[1] Wordpiece tokenization is similar to byte-pair encoding in that it breaks whole words down into subwords.

Background

Before the advent of the Wordpiece model, there were two prevailing types of tokenization methods: character-level tokenization, which splits text into individual characters, and word-level tokenization, which splits text into words based on spaces and punctuation. However, word-level tokenization can struggle with out-of-vocabulary (OOV) tokens, while character-level tokenization can be inefficient for common words that could easily be treated as individual units.

Methodology

Wordpiece tokenization begins with word-level segmentation and progressively splits tokens into smaller pieces until each piece is found in a predetermined vocabulary. The resulting subword units strike a balance between character-level granularity and word-level efficiency. This approach was popularized by Google in models such as BERT, which uses Wordpiece to handle a fixed-size vocabulary.

References

↑ Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. North American Chapter of the Association for Computational Linguistics.

External links

GitHub repository for BERT, which includes the Wordpiece Tokenization implementation

This article "Wordpiece tokenization" is from Wikipedia. The list of its authors can be seen in its historical and/or the page Edithistory:Wordpiece tokenization. Articles copied from Draft Namespace on Wikipedia could be seen on the Draft Namespace of Wikipedia and not main one.

[1] Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. North American Chapter of the Association for Computational Linguistics.

[1]

Wordpiece tokenization

Contents

Background

Methodology

See also

References

External links

📰 Article(s) of the same category(ies)[edit]