Passing rate on TB2 vs TB2-Fn, by model, Terminus 2, Kira, and each model's native CLI agent. The red bracket marks the TB2 → TB2-Fn drop.

Native CLI = each model run under its own agent (GPT-5.5 / Codex, Opus 4.7 / Claude Code, Gemini 3.1 Pro / Gemini CLI). † Error bars show the margin of error (MOE): 95% CI half-width (±1.96σ).