Natural Language Processing Large Language Models AI Models Intermediate 1 min read

What is a tokenizer?

A system or algorithm that translates a sequence of input data into tokens.

tokenizer explained in plain English

A system or algorithm that translates a sequence of input data into tokens. Most modern foundation models are multimodal. A tokenizer for a multimodal system must translate each input type into the appropriate format. For example, given input data consisting of both text and graphics, the tokenizer might translate input text into subwords and input images into small patches. The tokenizer must then convert all the tokens into a single unified embedding space, which enables the model to "understand" a stream of multimodal input.

Example

Practitioners refer to tokenizer when building, training, or evaluating machine learning systems. It appears in research papers, product documentation, and technical discussions about AI capabilities and limitations.

tokenizer explained in plain English

Example

People also read