What is an automatic evaluation?
Using software to judge the quality of a model's output.
automatic evaluation explained in plain English
Using software to judge the quality of a model's output. When model output is relatively straightforward, a script or program can compare the model's output to a golden response. This type of automatic evaluation is sometimes called programmatic evaluation. Metrics such as ROUGE or BLEU are often useful for programmatic evaluation. When model output is complex or has no one right answer, a separate ML program called an autorater sometimes performs the automatic evaluation. Contrast with human evaluation.
Example
Practitioners refer to automatic evaluation when building, training, or evaluating machine learning systems. It appears in research papers, product documentation, and technical discussions about AI capabilities and limitations.
People also read
- bag of words
A representation of the words in a phrase or passage, irrespective of order.
- BERT
A model architecture for text representation.
- bigram
An N-gram in which N=2.
- BLEU
A metric between 0.
- BLEURT
A metric for evaluating machine translations from one language to another, particularly to and from English.
- Character N-gram F-score
A metric to evaluate machine translation models.
- constituency parsing
Dividing a sentence into smaller grammatical structures ("constituents").
- crash blossom
A sentence or phrase with an ambiguous meaning.
- decoder
In general, any ML system that converts from a processed, dense, or internal representation to a more raw, sparse, or external representation.
- Embedding
A numerical representation of text, images, or other data that captures semantic meaning.