What is an average precision at k?
A metric for summarizing a model's performance on a single prompt that generates ranked results, such as a numbered list of book recommendations.
average precision at k explained in plain English
A metric for summarizing a model's performance on a single prompt that generates ranked results, such as a numbered list of book recommendations. Average precision at k is, well, the average of the precision at k values for each relevant result. The formula for average precision at k is therefore: \[{\text{average precision at k}} = \frac{1}{n} \sum_{i=1}^n {\text{precision at k for each relevant item} } \] where: - \(n\) is the number of relevant items in the list. Contrast with recall at k.
Suppose a large language model is given the following query:
And the large language model returns the following list: 1. The General 2. Mean Girls 3. Platoon 4. Bridesmaids 5. Citizen Kane 6. This is Spinal Tap Four of the movies in the returned list are very funny (that is, they are relevant) but two movies are dramas (not relevant). The following table details the results: Relevant? | Precision at k | --- | --- | Yes | 1.0 | Yes | 1.0 | No | not relevant | Yes | 0.75 | No | not relevant | Yes | 0.67 | The number of relevant results is 4. Therefore, you can calculate the average precision at 6 as follows: ---
Example
Practitioners refer to average precision at k when building, training, or evaluating machine learning systems. It appears in research papers, product documentation, and technical discussions about AI capabilities and limitations.
People also read
- few-shot learning
A machine learning approach, often used for object classification, designed to train effective classification models from only a small number of training examples.
- Inference
The phase when a trained model is actually used — taking new input and producing a prediction or response.
- one-shot learning
A machine learning approach, often used for object classification, designed to learn effective classification model from a single training example.
- pass at k
A metric to determine the quality of code (for example, Python) that a large language model generates.
- BERT
A model architecture for text representation.
- black box model
A model whose "reasoning" is impossible or difficult for humans to understand.
- Chain-of-Thought Prompting
Asking an AI to show its reasoning step by step before giving a final answer, which often improves accuracy on complex tasks.
- Character N-gram F-score
A metric to evaluate machine translation models.
- citation precision
A metric that answers the following question: What percentage of the citations in an LLM's response were actually correct and supportive?
- citation recall
A metric that answers the following question: What percentage of the source documents the LLM used to compose its response are actually cited in the response?