test: drop strict "Choose" regex from AUQ format checks; judge covers presence

Periodic-tier eval surfaced that Opus 4.7 writes "Recommendation: A) SCOPE
EXPANSION because..." (option label, no "Choose" prefix), which the
generate-ask-user-format.ts spec actually mandates — `Recommendation: <choice>
because <reason>` where <choice> is the bare option label. The legacy regex
`/[Rr]ecommendation:[*\s]*Choose/` pinned down a per-skill template-example
phrasing that the canonical spec doesn't require, so it false-failed on
correctly-formatted captures.

judgeRecommendation.present (deterministic regex over the canonical shape)
plus has_because and reason_substance >= 4 cover the recommendation surface
end-to-end. Drop the redundant strict regex from all five wired call sites
(four plan-format cases + new office-hours Phase 4 test).

Verified by re-reading the captured AUQs from both failing periodic runs:
both contained substantive Recommendation lines that the spec accepts and
the judge correctly grades at substance >= 4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Garry Tan
2026-05-01 14:23:07 -07:00
parent ed0e00daab
commit 91c0b31a78
2 changed files with 23 additions and 19 deletions

View File

@@ -35,8 +35,12 @@ import * as os from 'os';
const evalCollector = createEvalCollector('e2e-office-hours-phase4');
// Format predicates — same shape as skill-e2e-plan-format.test.ts.
const RECOMMENDATION_RE = /[Rr]ecommendation:[*\s]*Choose/;
// Format predicates. The strict `Recommendation:[*\s]*Choose` regex used by
// skill-e2e-plan-format pins down a specific template-example wording ("Choose
// [X]"). The format spec at scripts/resolvers/preamble/generate-ask-user-format.ts
// only requires `Recommendation: <choice> because <reason>` — `<choice>` can
// be the bare option label. judgeRecommendation.present (deterministic) checks
// this canonical shape correctly; we don't need a redundant strict regex here.
const BECAUSE_RE = /\bbecause\b/i;
// At least 2 numbered/lettered options (A/B or 1/2). Office-hours Phase 4 says
// "2-3 distinct alternatives," so 2+ is the minimum bar.
@@ -123,8 +127,8 @@ After writing the file with that ONE Phase 4 question, stop. Do not continue to
const captured = fs.readFileSync(outFile, 'utf-8');
expect(captured.length).toBeGreaterThan(100);
// Format-spec compliance.
expect(captured).toMatch(RECOMMENDATION_RE);
// Format-spec compliance. judgeRecommendation below covers the
// Recommendation: line itself; these regexes catch cheap structural shape.
expect(captured).toMatch(BECAUSE_RE);
expect(captured).toMatch(TWO_OPTIONS_RE);
// Phase-4 specificity: prevents a stray earlier-phase AUQ from false-passing.