The Audit

Why some tasks in this corpus cannot be trusted to score the agent

Most eval harnesses score the agent. This one also scores the benchmark. Before trusting any number, it audits SWE-bench Verified's tasks for defects that would make a score meaningless. Of the 25 tasks in this corpus, 5 are flagged, across three defect types.

Three defect types

Each defect below quarantines or discounts the affected task

Broken grading databroken-tests

The grading data is damaged. Each task carries a list of tests to run for grading, each referenced by its name (its id). For some tasks, SWE-bench's data pipeline truncated those names wherever they contain a space — the real id test_stem[png-w/ line collection] is stored as the fragment test_stem[png-w/. A truncated id points to no real test; hand one to pytest and it aborts the entire run, so the task would falsely score 0 even for a perfect fix. The harness detects these (unbalanced brackets, or stray progress markers like [100%]) and quarantines them before scoring.

Flaggedmatplotlib-13989, pytest-10081, pytest-5262.

Too broad to be one bugbroad

The task is too large to be one bug. Every task lists the tests that should flip from failing to passing once the bug is fixed (its FAIL_TO_PASS set). A focused bug fix flips one or two; django-10097 flips 438. That is not a bug fix — it is a sweeping change, and any score on it is dominated by sheer test volume rather than the quality of the fix.

Flaggeddjango-10097.

Contaminated promptcontaminated

The answer is in the question. The agent is given only the bug report — but in scikit-learn-12585 the bug report quotes the gold fix verbatim, a line of the actual solution pasted into the prompt. The agent can copy it instead of solving anything. A pass here measures whether the model can transcribe a line, not whether it can fix a bug.

Flaggedscikit-learn-12585.