AIExplainer

pass at k

A metric to determine the quality of code (for example, Python) that a large language model generates.

A metric to determine the quality of code (for example, Python) that a large language model generates. More specifically, pass at k tells you the likelihood that at least one generated block of code out of k generated blocks of code will pass all of its unit tests. Large language models often struggle to generate good code for complex programming problems. Software engineers adapt to this problem by prompting the large language model to generate multiple (k) solutions for the same problem. Then, software engineers test each of the solutions against unit tests. The calculation of pass at k depends on the outcome of the unit tests: - If one or more of those solutions pass the unit test, then the LLM Passes that code generation challenge. - If none of the solutions pass the unit test, then the LLM Fails that code generation challenge. The formula for pass at k is as follows: \[\text{pass at k} = \frac{\text{total number of passes}} {\text{total number of challenges}}\] In general, higher values of k produce higher pass at k scores; however, higher values of k require more large language model and unit testing resources.

Suppose a software engineer asks a large language model to generate k=10 solutions for n=50 challenging coding problems. Here are the results: - 30 Passes - 20 Fails The pass at 10 score is therefore:

Practitioners refer to pass at k when building, training, or evaluating machine learning systems. It appears in research papers, product documentation, and technical discussions about AI capabilities and limitations.