Agent Eval Harness / SWE-bench Verified

deepseek-v4-pro · 25 tasks across 12 repositories · run run-20260522-172308

deepseek-v4-pro was run on 25 real bug-fix tasks from SWE-bench Verified — each a genuine GitHub issue from a major Python project. The agent sees only the bug report and must produce a fix; hidden tests it never sees decide the score. The harness also audits the tasks themselves.

Resolved

14 / 25

all FAIL→PASS and PASS→PASS tests green

Mean score

0.975

partial credit · fraction of graded tests passing

Flagged by audit

tasks with a benchmark-quality defect

Leaderboard

Per-task results · click a row for the full trace

Click any row for its full task detail — problem statement, gold patch, the agent's trace and the captured diff.