Five categories of fixes surfaced by the /ship pre-landing reviews
(testing + maintainability + security + performance + adversarial Claude),
applied as one review-iteration commit.
Refactor — collapse 5x duplicated judge-assertion block:
- Add assertRecommendationQuality() + RECOMMENDATION_SUBSTANCE_THRESHOLD
constant to test/helpers/e2e-helpers.ts.
- Plan-format (4 cases) and Phase 4 (1 case) collapse from ~22 lines each
to a single helper call. Future rubric tweaks land in one place instead
of five.
Performance — extract Phase 4 slice instead of copying full SKILL.md:
- Phase 4 test fixture now reads office-hours/SKILL.md and writes only the
AskUserQuestion Format section + Phase 4 section to the tmpdir, per
CLAUDE.md "extract, don't copy" rule. Verified locally: cost dropped
from $0.51 → $0.36/run, turn count 8 → 4, latency 50s → 36s. Reduces
Opus context bloat without weakening the regression check.
- Add `if (!workDir) return` guard to Phase 4 afterAll cleanup so a
skipped describe block doesn't silently fs.rmSync(undefined) under the
empty catch.
Defense — judge prompt + output:
- Wrap captured AskUserQuestion text in clearly delimited UNTRUSTED_CONTEXT
block with explicit instruction to treat its content as data, not commands.
Cheap defense against the (unlikely but real) injection vector where a
captured AskUserQuestion contains "Ignore previous instructions" text.
- Bump captured-text budget from 4000 → 8000 chars; real plan-format menus
with 4 options × ~800 chars exceed 4000 and were silently truncating
Haiku context mid-option.
Cleanup — abbreviation rule + dead imports + touchfile consistency:
- AUQ → AskUserQuestion in 3 sites (office-hours/SKILL.md.tmpl Phase 4
footer, two test comments) per the always-write-in-full memory rule.
Regenerated office-hours/SKILL.md.
- Drop unused `describe`/`test` imports in 2 new test files (only
describeIfSelected/testConcurrentIfSelected wrappers are used).
- Add `test/skill-e2e-office-hours-phase4.test.ts` to its own touchfile
entry for consistency with other entries that include their test file.
- Fix misleading comment in fixture test about LLM short-circuiting (it's
has_because, not commits, that skips the API call).
Verified: build clean, free `bun test` exits 0, fixture test 30/30
expect() calls pass, Phase 4 paid eval passes substance 5 in 36s.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>