Wordpiece tokenization
This article needs attention from an expert in Artificial intelligence. (December 2023) |
Wordpiece Tokenization is a tokenization method in natural language processing which can naturally deal with large vocabularies and rare words. It achieves this by breaking words down into sub-tokens, which automatically encodes character-level information that typical word-level tokenizers would miss.[1] Wordpiece tokenization is similar to byte-pair encoding in that it breaks whole words down into subwords.
Background
Before the advent of the Wordpiece model, there were two prevailing types of tokenization methods: character-level tokenization, which splits text into individual characters, and word-level tokenization, which splits text into words based on spaces and punctuation. However, word-level tokenization can struggle with out-of-vocabulary (OOV) tokens, while character-level tokenization can be inefficient for common words that could easily be treated as individual units.
Methodology
Wordpiece tokenization begins with word-level segmentation and progressively splits tokens into smaller pieces until each piece is found in a predetermined vocabulary. The resulting subword units strike a balance between character-level granularity and word-level efficiency. This approach was popularized by Google in models such as BERT, which uses Wordpiece to handle a fixed-size vocabulary.
See also
References
- ↑ Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. North American Chapter of the Association for Computational Linguistics.
External links
This article "Wordpiece tokenization" is from Wikipedia. The list of its authors can be seen in its historical and/or the page Edithistory:Wordpiece tokenization. Articles copied from Draft Namespace on Wikipedia could be seen on the Draft Namespace of Wikipedia and not main one.
