Agent Eval Harness / SWE-bench Verified

deepseek-v4-pro  ·  25 tasks across 12 repositories  ·  run run-20260522-172308

deepseek-v4-pro was run on 25 real bug-fix tasks from SWE-bench Verified — each a genuine GitHub issue from a major Python project. The agent sees only the bug report and must produce a fix; hidden tests it never sees decide the score. The harness also audits the tasks themselves.

Resolved
14 / 25
all FAIL→PASS and PASS→PASS tests green
Mean score
0.975
partial credit · fraction of graded tests passing
Flagged by audit
5
tasks with a benchmark-quality defect

Leaderboard

Per-task results · click a row for the full trace

Click any row for its full task detail — problem statement, gold patch, the agent's trace and the captured diff.