AgentDebuggerEnv Benchmark Leaderboard

Rank	Model	Tier 1 (Easy)	Tier 2 (Med)	Tier 3 (Hard)	Mean Score
🥇 1	GPT-4o	89.0%	71.0%	38.0%	0.742
🥈 2	Llama-3.1-70B-Instruct Baseline	21.0%	21.5%	21.5%	0.210
⏳ -	AgentDebugger-Qwen2.5-7B Training	-	-	-	TBD

🧪 The Benchmark

Models are evaluated on 90 hand-validated Python bugs across 3 difficulty tiers. They must formulate a specific hypothesis before proposing a fix. Blind guessing is heavily penalized by the grading environment.

⚖️ The Grading

A hybrid deterministic/semantic grader evaluates the quality of the hypothesis (via Llama-3.1-70B), format compliance, bug localization, and execution correctness inside a secure sandbox.