AgentDebuggerEnv

Ranking LLMs on Hypothesis-Driven Debugging

Rank Model Tier 1 (Easy) Tier 2 (Med) Tier 3 (Hard) Mean Score
🥇 1
GPT-4o
89.0% 71.0% 38.0%
0.742
🥈 2
Llama-3.1-70B-Instruct Baseline
21.0% 21.5% 21.5%
0.210
⏳ -
AgentDebugger-Qwen2.5-7B Training
- - -
TBD

🧪 The Benchmark

Models are evaluated on 90 hand-validated Python bugs across 3 difficulty tiers. They must formulate a specific hypothesis before proposing a fix. Blind guessing is heavily penalized by the grading environment.

⚖️ The Grading

A hybrid deterministic/semantic grader evaluates the quality of the hypothesis (via Llama-3.1-70B), format compliance, bug localization, and execution correctness inside a secure sandbox.

View GitHub Repository