Ranking LLMs on Hypothesis-Driven Debugging
| Rank | Model | Tier 1 (Easy) | Tier 2 (Med) | Tier 3 (Hard) | Mean Score |
|---|---|---|---|---|---|
| 🥇 1 |
GPT-4o
|
89.0% | 71.0% | 38.0% |
0.742
|
| 🥈 2 |
Llama-3.1-70B-Instruct
Baseline
|
21.0% | 21.5% | 21.5% |
0.210
|
| ⏳ - |
AgentDebugger-Qwen2.5-7B
Training
|
- | - | - |
TBD
|
Models are evaluated on 90 hand-validated Python bugs across 3 difficulty tiers. They must formulate a specific hypothesis before proposing a fix. Blind guessing is heavily penalized by the grading environment.
A hybrid deterministic/semantic grader evaluates the quality of the hypothesis (via Llama-3.1-70B), format compliance, bug localization, and execution correctness inside a secure sandbox.