What is an evaluation?
The process of measuring a model's quality or comparing different models against each other.
evaluation explained in plain English
The process of measuring a model's quality or comparing different models against each other. To evaluate a supervised machine learning model, you typically judge it against a validation set and a test set. Evaluating a LLM typically involves broader quality and safety assessments.
Example
Practitioners refer to evaluation when building, training, or evaluating machine learning systems. It appears in research papers, product documentation, and technical discussions about AI capabilities and limitations.
People also read
- average precision at k
A metric for summarizing a model's performance on a single prompt that generates ranked results, such as a numbered list of book recommendations.
- BERT
A model architecture for text representation.
- bias
1.
- bias (math) or bias term
An intercept or offset from an origin.
- Character N-gram F-score
A metric to evaluate machine translation models.
- citation precision
A metric that answers the following question: What percentage of the citations in an LLM's response were actually correct and supportive?
- citation recall
A metric that answers the following question: What percentage of the source documents the LLM used to compose its response are actually cited in the response?
- Confabulation
When an AI produces a confident, fluent answer that sounds true but is factually wrong — generating plausible language without a reliable link to reality.
- confirmation bias
The tendency to search for, interpret, favor, and recall information in a way that confirms one's pre-existing beliefs or hypotheses.
- counterfactual fairness
A fairness metric that checks whether a classification model produces the same result for one individual as it does for another individual who is identical to the first, except with respect to one or more sensitive attributes.