Kira

✎

Patch 1

terminus_kira.py + prompt

When the shell stalls (5 identical outputs), send a raw Ctrl-C/D instead of echoing a marker, and add a prompt rule: never use heredocs.

✎

Patch 2

terminus-kira.txt (prompt)

Add a rule: validate the macros on 1 row, then 1% → 10% → 100% of the file before the full run.

✎

Patch 3

terminus_kira.py

Add _sanitize_command: regex-detect << EOF as a command is sent, drop those keystrokes, and inject a [SYSTEM] hint.

Eval

60%

t-001

60%

t-002

40%

t-003

80%

t-004

40%

t-005

80%

t-006

60%

Eval

77%

t-001

100%

t-002

60%

t-003

100%

t-004

40%

t-005

80%

t-006

80%

Eval

77%

t-001

60%

t-002

20%

t-003

80%

t-004

100%

t-005

100%

t-006

80%

Eval

90%

t-001

80%

t-002

60%

t-003

100%

t-004

100%

t-005

100%

t-006

100%

✓ approved

Analysis

The agent opened a cat << EOF heredoc but never closed it, so the shell hung waiting for input, and Kira’s marker-echo polling kept re-entering the same trap.

Analysis

The two failing tasks (t-002 and t-004): the agent burned its whole turn budget debugging macros directly on the full 1M-row file instead of a sample.

Analysis

Whack-a-mole: the “test small first” rule pushed t-004 to 100%, but the other three tasks (t-001/002/003) regressed and heredocs still slipped through, the LLM ignored the prompt rule.

Kira's passing rate on large-scale-text-editing across the loop's four eval iterations on the variant set the loop tuned against vs. a held-out variant set never seen during patching.