Passing rate on TB2 vs TB2-Fn, by task group, pooled over models and harnesses (incl. native CLI agents). Brackets mark the TB2 → TB2-Fn change (red = drop, green = gain). n = task count
† Error bars show the margin of error (MOE): 95% confidence-interval half-width (±1.96σ).