Two test infrastructure bugs in the initial Codex eval landed in the
prior commit:
1. sandbox: 'read-only' (the default) blocked Codex from writing
$OUT_FILE. Test reported "STATUS: BLOCKED" and exited 0 without
a capture file. Fixed: sandbox: 'workspace-write' for all 4 cases,
allowing writes inside the tempdir.
2. recordCodexResult called a non-existent evalCollector.record()
API (I invented it). The real surface is addTest() with a
different field schema. Aligned with test/codex-e2e.test.ts
pattern.
With both fixed, the eval now actually measures Codex AskUserQuestion
format compliance. All 4 cases pass on v1.6.2.0 with the gpt.md
carve-out: RECOMMENDATION always, Completeness: N/10 only on coverage,
"options differ in kind" note on kind, ELI10 explanation present.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>