Five categories of fixes surfaced by the /ship pre-landing reviews
(testing + maintainability + security + performance + adversarial Claude),
applied as one review-iteration commit.
Refactor — collapse 5x duplicated judge-assertion block:
- Add assertRecommendationQuality() + RECOMMENDATION_SUBSTANCE_THRESHOLD
constant to test/helpers/e2e-helpers.ts.
- Plan-format (4 cases) and Phase 4 (1 case) collapse from ~22 lines each
to a single helper call. Future rubric tweaks land in one place instead
of five.
Performance — extract Phase 4 slice instead of copying full SKILL.md:
- Phase 4 test fixture now reads office-hours/SKILL.md and writes only the
AskUserQuestion Format section + Phase 4 section to the tmpdir, per
CLAUDE.md "extract, don't copy" rule. Verified locally: cost dropped
from $0.51 → $0.36/run, turn count 8 → 4, latency 50s → 36s. Reduces
Opus context bloat without weakening the regression check.
- Add `if (!workDir) return` guard to Phase 4 afterAll cleanup so a
skipped describe block doesn't silently fs.rmSync(undefined) under the
empty catch.
Defense — judge prompt + output:
- Wrap captured AskUserQuestion text in clearly delimited UNTRUSTED_CONTEXT
block with explicit instruction to treat its content as data, not commands.
Cheap defense against the (unlikely but real) injection vector where a
captured AskUserQuestion contains "Ignore previous instructions" text.
- Bump captured-text budget from 4000 → 8000 chars; real plan-format menus
with 4 options × ~800 chars exceed 4000 and were silently truncating
Haiku context mid-option.
Cleanup — abbreviation rule + dead imports + touchfile consistency:
- AUQ → AskUserQuestion in 3 sites (office-hours/SKILL.md.tmpl Phase 4
footer, two test comments) per the always-write-in-full memory rule.
Regenerated office-hours/SKILL.md.
- Drop unused `describe`/`test` imports in 2 new test files (only
describeIfSelected/testConcurrentIfSelected wrappers are used).
- Add `test/skill-e2e-office-hours-phase4.test.ts` to its own touchfile
entry for consistency with other entries that include their test file.
- Fix misleading comment in fixture test about LLM short-circuiting (it's
has_because, not commits, that skips the API call).
Verified: build clean, free `bun test` exits 0, fixture test 30/30
expect() calls pass, Phase 4 paid eval passes substance 5 in 36s.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Periodic-tier eval surfaced that Opus 4.7 writes "Recommendation: A) SCOPE
EXPANSION because..." (option label, no "Choose" prefix), which the
generate-ask-user-format.ts spec actually mandates — `Recommendation: <choice>
because <reason>` where <choice> is the bare option label. The legacy regex
`/[Rr]ecommendation:[*\s]*Choose/` pinned down a per-skill template-example
phrasing that the canonical spec doesn't require, so it false-failed on
correctly-formatted captures.
judgeRecommendation.present (deterministic regex over the canonical shape)
plus has_because and reason_substance >= 4 cover the recommendation surface
end-to-end. Drop the redundant strict regex from all five wired call sites
(four plan-format cases + new office-hours Phase 4 test).
Verified by re-reading the captured AUQs from both failing periodic runs:
both contained substantive Recommendation lines that the spec accepts and
the judge correctly grades at substance >= 4.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reproduces the production bug: agent in builder mode reaches Phase 4, presents
A/B/C alternatives, writes "Recommendation: C" in chat prose, and starts
editing the design doc immediately — never calls AskUserQuestion. The Phase 4
STOP-gate fix is the production-side change; this test traps regressions.
SDK + captureInstruction pattern (mirrors skill-e2e-plan-format). The PTY
harness can't seed builder mode + accept-premises to reach Phase 4
(runPlanSkillObservation only sends /skill\\r and waits), so we instruct the
agent to dump the verbatim Phase 4 AskUserQuestion to a file and assert on it
directly. The captured file IS the question — no false-pass risk on which
question got asked, since earlier-phase AUQs cannot satisfy the Phase-4-vocab
regex (approach / alternative / architecture / implementation).
Periodic-tier: Phase 4 requires the agent to invent 2-3 distinct architectures,
more open-ended than the 4 plan-format cases. Reclassify to gate if stable.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>