AIExplainer

What is a tokenizer?

A system or algorithm that translates a sequence of input data into tokens.

A system or algorithm that translates a sequence of input data into tokens. Most modern foundation models are multimodal. A tokenizer for a multimodal system must translate each input type into the appropriate format. For example, given input data consisting of both text and graphics, the tokenizer might translate input text into subwords and input images into small patches. The tokenizer must then convert all the tokens into a single unified embedding space, which enables the model to "understand" a stream of multimodal input.

Practitioners refer to tokenizer when building, training, or evaluating machine learning systems. It appears in research papers, product documentation, and technical discussions about AI capabilities and limitations.