What is a pass at k?
A metric to determine the quality of code (for example, Python) that a large language model generates.
pass at k explained in plain English
A metric to determine the quality of code (for example, Python) that a large language model generates. More specifically, pass at k tells you the likelihood that at least one generated block of code out of k generated blocks of code will pass all of its unit tests. Large language models often struggle to generate good code for complex programming problems. Software engineers adapt to this problem by prompting the large language model to generate multiple (k) solutions for the same problem. Then, software engineers test each of the solutions against unit tests. The calculation of pass at k depends on the outcome of the unit tests: - If one or more of those solutions pass the unit test, then the LLM Passes that code generation challenge. - If none of the solutions pass the unit test, then the LLM Fails that code generation challenge. The formula for pass at k is as follows: \[\text{pass at k} = \frac{\text{total number of passes}} {\text{total number of challenges}}\] In general, higher values of k produce higher pass at k scores; however, higher values of k require more large language model and unit testing resources.
Suppose a software engineer asks a large language model to generate k=10 solutions for n=50 challenging coding problems. Here are the results: - 30 Passes - 20 Fails The pass at 10 score is therefore:
Example
Practitioners refer to pass at k when building, training, or evaluating machine learning systems. It appears in research papers, product documentation, and technical discussions about AI capabilities and limitations.
People also read
- average precision at k
A metric for summarizing a model's performance on a single prompt that generates ranked results, such as a numbered list of book recommendations.
- few-shot learning
A machine learning approach, often used for object classification, designed to train effective classification models from only a small number of training examples.
- Inference
The phase when a trained model is actually used — taking new input and producing a prediction or response.
- one-shot learning
A machine learning approach, often used for object classification, designed to learn effective classification model from a single training example.
- BERT
A model architecture for text representation.
- black box model
A model whose "reasoning" is impossible or difficult for humans to understand.
- Chain-of-Thought Prompting
Asking an AI to show its reasoning step by step before giving a final answer, which often improves accuracy on complex tasks.
- Character N-gram F-score
A metric to evaluate machine translation models.
- citation precision
A metric that answers the following question: What percentage of the citations in an LLM's response were actually correct and supportive?
- citation recall
A metric that answers the following question: What percentage of the source documents the LLM used to compose its response are actually cited in the response?