June 15, 2026
Did your agent really improve?
We rebuild Terminal Bench 2 as a held-out variant benchmark (TB2-Fn) to test whether terminal agents generalize. Frontier agents fail to hold their scores under task-variant shift, and we show how treating the harness as a substrate can close the gap.