Research Blog — Fidian

June 15, 2026

Did your agent really improve?

We rebuild Terminal Bench 2 as a held-out variant benchmark (TB2-Fn) to test whether terminal agents generalize. Frontier agents fail to hold their scores under task-variant shift, and we show how treating the harness as a substrate can close the gap.