What is a tokenizer?
A system or algorithm that translates a sequence of input data into tokens.
tokenizer explained in plain English
A system or algorithm that translates a sequence of input data into tokens. Most modern foundation models are multimodal. A tokenizer for a multimodal system must translate each input type into the appropriate format. For example, given input data consisting of both text and graphics, the tokenizer might translate input text into subwords and input images into small patches. The tokenizer must then convert all the tokens into a single unified embedding space, which enables the model to "understand" a stream of multimodal input.
Example
Practitioners refer to tokenizer when building, training, or evaluating machine learning systems. It appears in research papers, product documentation, and technical discussions about AI capabilities and limitations.
People also read
- automatic evaluation
Using software to judge the quality of a model's output.
- bag of words
A representation of the words in a phrase or passage, irrespective of order.
- BERT
A model architecture for text representation.
- bigram
An N-gram in which N=2.
- BLEU
A metric between 0.
- BLEURT
A metric for evaluating machine translations from one language to another, particularly to and from English.
- Character N-gram F-score
A metric to evaluate machine translation models.
- constituency parsing
Dividing a sentence into smaller grammatical structures ("constituents").
- crash blossom
A sentence or phrase with an ambiguous meaning.
- decoder
In general, any ML system that converts from a processed, dense, or internal representation to a more raw, sparse, or external representation.