AIExplainer
Large Language Models Intermediate 1 min read

What is a subword token?

In language models, a token that is a substring of a word, which may be the entire word.

In language models, a token that is a substring of a word, which may be the entire word. For example, a word like "itemize" might be broken up into the pieces "item" (a root word) and "ize" (a suffix), each of which is represented by its own token. Splitting uncommon words into such pieces, called subwords, allows language models to operate on the word's more common constituent parts, such as prefixes and suffixes. Conversely, common words like "going" might not be broken up and might be represented by a single token.

Practitioners refer to subword token when building, training, or evaluating machine learning systems. It appears in research papers, product documentation, and technical discussions about AI capabilities and limitations.