deepseek-v4-pro was run on 25 real bug-fix tasks from SWE-bench Verified — each a genuine GitHub issue from a major Python project. The agent sees only the bug report and must produce a fix; hidden tests it never sees decide the score. The harness also audits the tasks themselves.
Per-task results · click a row for the full trace