iter passing rate on large-scale-text-editing
Kira
Patch 1
terminus_kira.py + prompt
When the shell stalls (5 identical outputs), send a raw Ctrl-C/D instead of echoing a marker, and add a prompt rule: never use heredocs.
Patch 2
terminus-kira.txt (prompt)
Add a rule: validate the macros on 1 row, then 1% → 10% → 100% of the file before the full run.
Patch 3
terminus_kira.py
Add _sanitize_command: regex-detect << EOF as a command is sent, drop those keystrokes, and inject a [SYSTEM] hint.
Eval
60%
t-001
60%
t-002
40%
t-003
80%
t-004
40%
t-005
80%
t-006
60%
Eval
77%
t-001
100%
t-002
60%
t-003
100%
t-004
40%
t-005
80%
t-006
80%
Eval
77%
t-001
60%
t-002
20%
t-003
80%
t-004
100%
t-005
100%
t-006
80%
Eval
90%
t-001
80%
t-002
60%
t-003
100%
t-004
100%
t-005
100%
t-006
100%
✓ approved
Analysis
The agent opened a cat << EOF heredoc but never closed it, so the shell hung waiting for input, and Kira’s marker-echo polling kept re-entering the same trap.
Analysis
The two failing tasks (t-002 and t-004): the agent burned its whole turn budget debugging macros directly on the full 1M-row file instead of a sample.
Analysis
Whack-a-mole: the “test small first” rule pushed t-004 to 100%, but the other three tasks (t-001/002/003) regressed and heredocs still slipped through, the LLM ignored the prompt rule.

Kira's passing rate on large-scale-text-editing across the loop's four eval iterations on the variant set the loop tuned against vs. a held-out variant set never seen during patching.