Fidian
Research Blog Request Early Access

Research Blog

  • June 15, 2026

    Did your agent really improve?

    We rebuild Terminal Bench 2 as a held-out variant benchmark (TB2-Fn) to test whether terminal agents generalize. Frontier agents fail to hold their scores under task-variant shift, and we show how treating the harness as a substrate can close the gap.

© 2026 Fidian, Inc.
  • Research Blog
  • Privacy Policy
  • Terms & Conditions