What is a latency?
The time it takes for a model to process input and generate a response.
latency explained in plain English
The time it takes for a model to process input and generate a response. A high latency response takes takes longer to generate than a low latency response. Factors that influence latency of large language models include: - Input and output token lengths - Model complexity - The infrastructure the model runs on Optimizing for latency is crucial for creating responsive and user-friendly applications.
Example
Practitioners refer to latency when building, training, or evaluating machine learning systems. It appears in research papers, product documentation, and technical discussions about AI capabilities and limitations.
People also read
- agent orchestration
The centralized management and routing of tasks across multiple sub-agents or LLM calls.
- AI slop
Output from a generative AI system that favors quantity over quality.
- Attention
A mechanism that lets a model focus on the most relevant parts of its input when producing an output, weighting what matters most in context.
- auto-regressive model
A model that infers a prediction based on its own previous predictions.
- autoencoder
A system that learns to extract the most important information from the input.
- automatic evaluation
Using software to judge the quality of a model's output.
- autorater evaluation
A hybrid mechanism for judging the quality of a generative AI model's output that combines human evaluation with automatic evaluation.
- average precision at k
A metric for summarizing a model's performance on a single prompt that generates ranked results, such as a numbered list of book recommendations.
- bag of words
A representation of the words in a phrase or passage, irrespective of order.
- BERT
A model architecture for text representation.