How Code is Evaluated in Benchmarks
a historical tour
Cursor got acquired last week and it reminded me to survey how coding benchmarking is done. If you come from the world of knowledge benchmarks like MMLU or reasoning benchmarks like GPQA, coding evals look familiar at first: give the model a prompt, check the output, assign a score. But the similarity is shallow, and the differences explain why this field has churned through four generations of benchmarks in five years while the knowledge-testing world mostly just kept adding harder questions to the same format.
What makes coding evaluation different is that correctness is necessary but not close to sufficient. An MMLU question has one right answer. A math problem has one right answer. A coding problem can have dozens of valid implementations, and two solutions that both pass the test suite can differ wildly in quality, maintainability, and whether anyone would actually ship them. The grading problem is also recursive in a way other domains don’t face: writing good tests to evaluate code is itself a hard engineering task, which means the evaluator needs to be nearly as capable as the thing being evaluated. On top of that, code lives inside a codebase. The “right” solution depends on existing patterns, conventions, and architecture that an isolated prompt can’t capture.
Every lesson the coding benchmark community has learned over the past five years circles back to these differences. Here’s the evolution, what each generation taught us, and what broke.
Generation 1: Can the model write a function? (2021)
HumanEval and MBPP were the starting point. HumanEval gave a model 164 Python function signatures with docstrings and checked completions against hidden unit tests. MBPP did roughly the same thing with about 1,000 entry-level problems.
The grading method was purely deterministic: run the code, compare output to expected output, report pass@k. Clean, cheap, reproducible. For other evaluation domains, this would be fine. In math, either 42 is the answer or it’s not. But code has a property that math answers don’t: it exists in a space of valid alternatives. When your grader can only check “does this exact output match?”, you’re testing one narrow path through a wide solution space.
What we learned: Models can autocomplete functions. pass@k gives you a clean metric to compare generation quality across models.
What broke: Everything else. Top models now score above 99% on HumanEval, making it useless for frontier comparisons. The static task set leaked into training data. When researchers created EvoEval by transforming HumanEval tasks using seven types of semantic and syntactic rewrites, top models saw pass@1 drops of 19 to 47 percentage points. That gap is a contamination fingerprint: the models had memorized solutions, not learned to solve problems.
Generation 2: Harder puzzles, rolling tasks (2024)
The first lesson was: static benchmarks get memorized. LiveCodeBench, published at ICLR 2025, responded by continuously pulling new problems from LeetCode, AtCoder, and Codeforces. The task set refreshes over time, so the contamination window stays small. It also tested more than just generation: code repair, test output prediction, and execution reasoning.
BigCodeBench attacked a different gap. Its 1,140 Python tasks require calling real library APIs across many packages, testing whether models know how to use pandas, requests, matplotlib, and other tools developers actually reach for. HumanEval tested algorithm design. BigCodeBench tested practical API fluency.
The grading was still deterministic: run the code, check the tests. But the tests were better designed, with higher branch coverage and more realistic input patterns.
What we learned: Rolling task sets are the strongest defense against contamination. And testing API usage turns out to be a different capability from testing algorithmic problem-solving. Models that ace competitive programming puzzles sometimes fumble basic library calls.
These benchmarks still evaluate isolated, self-contained problems. They don’t test whether a model can read an unfamiliar 50,000-line codebase, figure out what a vague bug report is actually asking, and make a targeted fix that doesn’t break six other things. That’s the gap the next generation went after.
Generation 3: Real repositories, real issues (2023-2025)
SWE-bench out of Princeton changed the game. Take real GitHub issues from real Python repos. Give the model the issue description and the full codebase. Have it produce a patch. Grade it by running the repo’s test suite: previously failing tests should now pass, and previously passing tests should still pass.
This was where coding evaluation separated most sharply from every other benchmark domain. An MMLU question is self-contained. SWE-bench tasks require navigating an existing system, reasoning about how different components interact, and producing a change that fits into the existing architecture.
The grading was still deterministic, fail-to-pass and pass-to-pass test checks, but now the tests were the repository’s own test suite rather than benchmark-author-written checks. This felt like a huge upgrade. Real tests written by real developers testing real behavior.
What we learned: Models can make multi-file edits in real codebases. The best agents of 2024-2025 were solving 40-50% of issues, which was a dramatic jump from where function-completion benchmarks had started.
What broke: Three things, each more damaging than the last.
First, contamination. SWE-bench tasks come from public GitHub repositories, and the gold patches were merged before the benchmark existed. Models trained on GitHub data may have seen the exact fix. The evidence is damning: Claude Opus 4.5 scores 80.9% on SWE-bench Verified but drops to 45.9% on SWE-bench Pro using the same scaffolding on tasks the model couldn’t have seen during training. GPT-5 High shows a similar pattern, from roughly 55% to 23.3%. A 35-point drop on the same model doing the same type of task tells you the original numbers were inflated by memorization.
Second, grading gaps. SWE-bench only runs modified test files, not the full suite.
An ICSE 2026 study found that 7.2% to 8.4% of patches the benchmark accepted as correct were functionally broken when run against the complete developer test suite. That’s enough to shuffle leaderboard positions.
Third, and this is the big one: passing tests turns out to be a terrible proxy for code quality. METR published an analysis in March 2026 showing that more than half of SWE-bench-passing patches would not survive human code review. They had actual open-source maintainers evaluate agent-generated patches and found core functionality failures, regressions, and quality issues that automated grading completely missed. Here’s a finding from their study that stuck with me: improvements between Claude 3.5 Sonnet and Claude 4 Opus mostly showed up as moving issues from “fails the automated grader” to “passes the grader but has bad code quality.” The models got better at passing CI. They didn’t get proportionally better at writing code a human would accept.
This is where the coding evaluation world confronted something other benchmark domains don’t face. In knowledge benchmarks, if the answer is right, it’s right. There’s no separate “quality” dimension to check. In coding, “passes the tests” and “is actually good” are two very different claims, and the gap between them turned out to be enormous.
SWE-bench Pro responded with 1,865 tasks from 41 professional repos, held-out and private subsets, Docker environments, and human verification. A May 2026 independent audit still found a 32% error rate in its verifiers. Better than the original, but far from clean.
The grading dilemma: deterministic tests vs. LLM judges
Before looking at the newest benchmarks, there’s a tension that runs through every generation of the evolution above, and it maps directly to a choice every coding benchmark has to make.
Deterministic test-based grading (run the code, check the output) has three strengths: it’s reproducible, it’s cheap, and it’s unambiguous. Every run gives the same result. There’s no argument about whether a solution “passed.” For knowledge benchmarks and math benchmarks, this is mostly good enough because the answer space is constrained. 2 + 2 = 4, and that’s the end of the conversation.
For code, deterministic tests miss everything that matters beyond basic correctness. They can’t evaluate whether the solution is readable, whether it handles edge cases the test author didn’t think of, whether it follows the repo’s patterns, or whether it would introduce tech debt. They also have a false-positive problem: a solution can game the test suite (returning hardcoded values for known inputs, for instance) and still “pass.”
LLM-as-judge grading flips those tradeoffs. An LLM judge can evaluate code quality, style, maintainability, and whether the approach fits the architecture. It can recognize valid alternative solutions that don’t match the reference implementation.
But it introduces three known failure modes. Position bias: Zheng et al. found that swapping the order of two candidate responses shifted GPT-4’s preference rating by more than 10% even when the responses were identical. Verbosity bias: longer outputs score higher regardless of content. Self-preference: a judge from the same model family tends to favor its own family’s outputs.
There’s also a deeper problem specific to code. If you’re using an LLM to judge whether code is well-written, the judge needs to understand the codebase well enough to evaluate architectural fit. For large repos, that’s a hard task even for the best models. You’re asking the judge to be approximately as skilled as the agent being tested.
The emerging consensus, visible in Anthropic’s own guidance for 2026, is hybrid grading: deterministic tests for “did it work?”, rubric-based LLM evaluation for “is it good?” Neither alone is sufficient. This is the approach the 2026-era benchmarks are converging on, and it’s why coding evaluation methodology now looks different from every other eval domain.
Generation 4: Beyond passing tests (May-June 2026)
Three benchmarks launched within weeks of each other, each attacking a different failure mode from the SWE-bench era.
DeepSWE: Contamination-proof by design
DeepSWE, released May 26, 2026 by Datacurve, contains 113 original tasks spanning 91 open-source repositories across TypeScript, Go, Python, JavaScript, and Rust. The key design decision: every task is written from scratch. Reference solutions are never merged upstream. The median repository contributes a single task, preventing any well-known framework from dominating scores.
The grading stays deterministic (behavioral verifiers), but the verifiers are much tighter than SWE-bench’s. A 0.3% false positive rate and 1.1% false negative rate, compared to the 7-8% error rates seen in earlier benchmarks.
The results reshuffled the leaderboard dramatically. GPT-5.5 leads at 70%, followed by GPT-5.4 at 56% and Claude Opus 4.7 at 54%. Claude Haiku 4.5, which scores 39% on SWE-bench Pro, collapsed to zero on DeepSWE. When the same model shows a 39-point swing across two benchmarks that supposedly measure the same thing, at least one of those numbers is wrong.
DeepSWE also found that Claude Opus was exploiting embedded git history in some benchmarks to look up existing solutions rather than reasoning about the code. That kind of reward hacking is invisible to deterministic graders.
When you remove contamination, model rankings change dramatically. Benchmarks built on public data overstate the capabilities of models trained on that same data. This shouldn’t be surprising, but the magnitude of the gaps was.
FrontierCode: Grading like a tech lead
FrontierCode, released June 8, 2026 by Cognition, makes the boldest methodological bet in the field. It was built with over 20 open-source maintainers, each investing 40+ hours per task. The benchmark has 150 tasks across 36 repos, with a 50-task Diamond subset that is brutally hard.
The grading is where FrontierCode breaks from everything that came before. Instead of just running CI, it evaluates code across behavioral correctness, regression safety, test quality, scope discipline, style adherence, and compliance with repository standards, using a mix of deterministic tests, rubrics, and custom verifiers. Over 3,000 rubrics specifically target reward hacking. This is the hybrid approach in full form: deterministic checks for correctness, expert-authored rubrics for quality.
The headline number: Opus 4.8, the best model, scores 13.8% on Diamond. That’s an order of magnitude lower than the 80%+ scores on SWE-bench Verified for comparable models. FrontierCode also reports roughly 80% fewer false positives compared to SWE-bench Pro.
One quote from the maintainers captures the philosophy: “Where others grade like a CI, FrontierCode grades like a tech lead.”
The tradeoffs are real, though. Tasks aren’t public, limiting reproducibility. Rubric-based grading carries the subjectivity problems of any LLM-judge system. And Cognition makes Devin (a commercial coding agent), so they have commercial interests in how the benchmark conversation plays out.
The gap between “passes CI” and “would be merged by a senior engineer” is massive. When you measure the second thing instead of the first, the numbers collapse. This is evidence that the coding evaluation community had been measuring the wrong dimension for years.
SWE-Lancer: Tying code to dollars
SWE-Lancer, built by OpenAI, takes a completely different approach. It uses over 1,400 real freelance software engineering tasks sourced from Upwork, collectively worth $1 million in actual payouts. Tasks range from $50 bug fixes to $32,000 feature implementations. Independent tasks are graded with end-to-end tests triple-verified by experienced engineers. Managerial tasks, where the model chooses between competing technical proposals, are scored against the decisions of the original hiring managers.
This is the only benchmark that makes economic value explicit. When the best model earns $400,000 out of $1,000,000 possible, you can have a concrete conversation about what AI coding agents are actually worth to a team.
Performance tied to dollar amounts forces a different kind of honesty. A 26% task completion rate sounds abstract. “$208,050 out of $1,000,000” sounds like what it is: useful but far from replacing a human engineer.
The questions nobody else is asking
A few benchmarks are testing capabilities that the main evaluation track ignores entirely.
SlopCodeBench (March 2026) measures something uniquely important: degradation over time. Its 36 problems with 196 checkpoints force agents to repeatedly extend their own prior solutions under evolving specifications. The result is brutal. No agent solves any problem end-to-end. The best agent passes 14.8% of checkpoints. Verbosity (redundant code) rises in 90% of trajectories. Structural erosion (complexity concentrating in a few massive functions) rises in 80%.
This matters because real software development is iterative. You don’t write code once and walk away. You extend it, refactor it, add features.
Every other benchmark tests single-shot solutions. SlopCodeBench found that prompt-side interventions (asking the agent to write clean code) shift the starting quality but don’t change the rate of degradation. The code rots at the same speed regardless of how you prompt. That’s a finding with direct implications for anyone deploying coding agents on long-running projects.
CodeClash runs multi-round tournaments where agents iteratively improve codebases toward open-ended objectives, competing head-to-head. It’s the only benchmark testing strategic iteration, making tradeoffs over time rather than solving isolated tickets.
ProjDevBench tests whether agents can build entire projects from requirements. A February 2026 evaluation found the best agent (Codex with GPT-5) topped out at 77.85% overall. The most telling finding: extended interaction correlated negatively with performance. Agents that debugged longer did worse. They couldn’t convert debugging time into progress.
MLE-bench and RE-Bench measure ML engineering and research capabilities under compute and time limits, scored using Kaggle-style metrics with human baselines.
What this evolution tells you
Each generation of coding benchmarks has been an argument about what “good code” means, and each argument has gotten closer to how working engineers actually think about it.
Generation 1 said: good code produces the right output.
Generation 2 said: good code produces the right output on problems the model hasn’t memorized.
Generation 3 said: good code fixes real bugs in real repos.
Generation 4 is saying: good code gets merged. And the benchmarks on the frontier (SlopCodeBench, CodeClash) are pushing even further: good code stays good over time.
The grading methodology has followed a parallel track. Pure deterministic testing was enough when the tasks were toy-sized. As tasks got more realistic, the grading needed to get more realistic too, which is why we’re now seeing hybrid approaches that combine automated checks with expert rubrics and LLM judgment.
This trajectory is unique to code. Knowledge benchmarks don’t need it because there’s no “quality” dimension to a factual answer. Reasoning benchmarks don’t need it because the reasoning chain either reaches the right conclusion or it doesn’t. Code is the only domain where “correct but terrible” is a real and common outcome.
If you’re building your own evaluation stack, I’d layer it:
Use LiveCodeBench or BigCodeBench for basic coding ability and API fluency. They catch models that can’t do the fundamentals.
Check SWE-bench Pro or DeepSWE for repo-level work. Running both gives you a contamination cross-reference: if a model scores high on Pro but collapses on DeepSWE, that’s a memorization signal.
Layer in FrontierCode-style rubric evaluation for merge quality. Even if you don’t use FrontierCode itself, build rubric-based LLM judging into your evaluation. Deterministic tests alone will overstate your agent’s readiness for production.
If your agents work on long-running projects, add SlopCodeBench-style iterative evaluation. Single-shot benchmarks will miss the degradation problem entirely.
And the option that no public benchmark covers: evaluate on your own repos, with your own test suite, against your own review bar. An analysis of FrontierCode put it well: “the better move is to copy the spirit of the benchmark inside your own codebase.” If a vendor shows you a benchmark score, ask what the agent does on your code. That’s the only number that actually predicts your team’s experience.
The benchmark field will keep churning. HumanEval lasted three years before it was useless. SWE-bench lasted two. FrontierCode and DeepSWE are weeks old. Whatever you build your evaluation on today, plan to rebuild it.


