What is a subword token?
In language models, a token that is a substring of a word, which may be the entire word.
subword token explained in plain English
In language models, a token that is a substring of a word, which may be the entire word. For example, a word like "itemize" might be broken up into the pieces "item" (a root word) and "ize" (a suffix), each of which is represented by its own token. Splitting uncommon words into such pieces, called subwords, allows language models to operate on the word's more common constituent parts, such as prefixes and suffixes. Conversely, common words like "going" might not be broken up and might be represented by a single token.
Example
Practitioners refer to subword token when building, training, or evaluating machine learning systems. It appears in research papers, product documentation, and technical discussions about AI capabilities and limitations.
People also read
- agent orchestration
The centralized management and routing of tasks across multiple sub-agents or LLM calls.
- AI slop
Output from a generative AI system that favors quantity over quality.
- Attention
A mechanism that lets a model focus on the most relevant parts of its input when producing an output, weighting what matters most in context.
- auto-regressive model
A model that infers a prediction based on its own previous predictions.
- autoencoder
A system that learns to extract the most important information from the input.
- automatic evaluation
Using software to judge the quality of a model's output.
- autorater evaluation
A hybrid mechanism for judging the quality of a generative AI model's output that combines human evaluation with automatic evaluation.
- average precision at k
A metric for summarizing a model's performance on a single prompt that generates ranked results, such as a numbered list of book recommendations.
- bag of words
A representation of the words in a phrase or passage, irrespective of order.
- BERT
A model architecture for text representation.