What is a LLM evaluations?
A set of metrics and benchmarks for assessing the performance of large language models (LLMs).
LLM evaluations explained in plain English
A set of metrics and benchmarks for assessing the performance of large language models (LLMs). At a high level, LLM evaluations: - Help researchers identify areas where LLMs need improvement. - Are useful in comparing different LLMs and identifying the best LLM for a particular task. - Help ensure that LLMs are safe and ethical to use. See Large language models (LLMs) in Machine Learning Crash Course for more information.
Example
Practitioners refer to llm evaluations when building, training, or evaluating machine learning systems. It appears in research papers, product documentation, and technical discussions about AI capabilities and limitations.
People also read
- average precision at k
A metric for summarizing a model's performance on a single prompt that generates ranked results, such as a numbered list of book recommendations.
- BERT
A model architecture for text representation.
- Character N-gram F-score
A metric to evaluate machine translation models.
- citation precision
A metric that answers the following question: What percentage of the citations in an LLM's response were actually correct and supportive?
- citation recall
A metric that answers the following question: What percentage of the source documents the LLM used to compose its response are actually cited in the response?
- cross-entropy
A generalization of Log Loss to multi-class classification problems.
- denoising
A common approach to self-supervised learning in which: 1.
- depth
The sum of the following in a neural network: - the number of hidden layers - the number of output layers, which is typically 1 - the number of any embedding layers For example, a neural network with five hidden layers and one output layer has a depth of 6.
- Embedding
A numerical representation of text, images, or other data that captures semantic meaning.
- embedding layer
A special hidden layer that trains on a high-dimensional categorical feature to gradually learn a lower dimension embedding vector.