v1.21.1.0 test: tighten plan-ceo-review smoke (Step 0 must fire) (#1255)

* test: extract classifyVisible() + permission-dialog filter in PTY runner

Pure classifier extracted from runPlanSkillObservation's polling loop so
unit tests can exercise the actual branch order with synthetic input
strings. Runner gains:

- env? passthrough on runPlanSkillObservation (forwarded to launchClaudePty).
  gstack-config does not yet honor env overrides; plumbing is in place for a
  future change to make tests hermetic.
- TAIL_SCAN_BYTES = 1500 exported constant. Replaces a duplicated magic
  number in test/skill-e2e-plan-ceo-mode-routing.test.ts so tuning stays
  in sync.
- isPermissionDialogVisible: the bare phrase "Do you want to proceed?" now
  requires a file-edit context co-trigger. Other clauses unchanged. Skill
  questions that contain the bare phrase are no longer mis-classified.
- classifyVisible(visible): pure function. Branch order silent_write →
  plan_ready → asked → null. Permission dialogs filtered out of the
  'asked' classification so a permission prompt cannot pose as a Step 0
  skill question.

Adds 24 unit tests covering all classifier branches, edge cases, and the
co-trigger contract.

* test: tighten plan-ceo-review smoke to require Step 0 fires first

Assertion narrows from ['asked', 'plan_ready'] to 'asked' only. Reaching
plan_ready first means the agent skipped Step 0 entirely and went
straight to ExitPlanMode — the regression we want to catch.

Why plan-ceo is special: unlike plan-eng / plan-design / plan-devex
(whose smokes legitimately reach plan_ready on certain branches without
asking), plan-ceo-review's template mandates Step 0A premise challenge
plus Step 0F mode selection BEFORE any plan write. There is no
legitimate path to plan_ready that does not first emit a skill-question
numbered prompt.

Failure message now branches on outcome (plan_ready vs timeout vs
silent_write) with a tailored diagnosis line per case. References the
skill template by section name ("Step 0 STOP rules", "One issue = one
AskUserQuestion call") instead of line numbers, so it survives template
edits.

Passes env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' }
through the runner. Today this is advisory — gstack-config reads only
~/.gstack/config.yaml, not env vars — but the wiring is in place for a
future change. Documented honestly in the docstring.

Verified across 4 PTY runs: 3 pre-refactor + 1 post-refactor, all PASS.

* chore: capture v1.21.1.0 follow-ups in TODOS.md

- P2: per-finding AskUserQuestion count assertion (V2)
- P3: honor env vars in gstack-config so test isolation env actually works
- P3: path-confusion hardening on SANCTIONED_WRITE_SUBSTRINGS

All three surfaced during the v1.21.1.0 plan-eng-review and adversarial
review passes. Captured here so the design intent persists.

* chore: bump version and changelog (v1.21.1.0)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: extract MODE_RE + optionsSignature into PTY runner exports

Refactor prep for the upcoming per-finding AskUserQuestion count test
across plan-{ceo,eng,design,devex}-review. Both new tests and the existing
mode-routing test need the same mode regex and the same option-list
fingerprint dedupe — pulling them into one source of truth in
test/helpers/claude-pty-runner.ts so a fifth mode (or a tweak to the
fingerprint shape) updates everywhere instead of drifting per-test.

Mechanical: no behavior change in the mode-routing test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: add per-finding count primitives + unit tests

Pure helpers landing ahead of runPlanSkillCounting:

  - parseQuestionPrompt(visible) — extract the 1-3 line prompt above
    the latest "❯ 1." cursor, normalize to a 240-char snippet
  - auqFingerprint(prompt, opts) — Bun.hash of normalized prompt + sorted
    options signature; distinct prompts with shared option labels
    (the generic A/B/C TODO menu) get distinct fingerprints
  - COMPLETION_SUMMARY_RE — terminal-signal regex matching all four
    plan-review skills' completion / verdict markers
  - assertReviewReportAtBottom(content) — checks "## GSTACK REVIEW
    REPORT" is present and is the last "## " heading in a plan file
  - Step0BoundaryPredicate type + four per-skill predicates
    (ceo / eng / design / devex) — fire on the answered AUQ's
    fingerprint, marking the end of Step 0 deterministically
    (event-based, not content-based, per Codex F7)

Plus 37 deterministic unit tests covering option-label collision
regression, prompt extraction edge cases, predicate positive AND
negative cases, and review-report-at-bottom triple-check
(missing / mid-file / multiple trailing).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: add runPlanSkillCounting PTY helper

Drives a plan-* skill end-to-end and counts distinct review-phase
AskUserQuestions. Composes the primitives from the previous commit:

  - Boot + auto-trust handler (existing launchClaudePty)
  - Send slash command alone, sleep 3s, send plan content as follow-up
    message (proven pattern from skill-e2e-plan-design-with-ui)
  - Poll loop with permission-dialog auto-grant, same-redraw skip,
    empty-prompt re-poll
  - Event-based Step-0 boundary via isLastStep0AUQ predicate fired on
    the answered AUQ's fingerprint (Codex F7 — boundary is observed
    event, not later rendered content)
  - Multi-signal terminals: hard ceiling, COMPLETION_SUMMARY_RE,
    plan_ready, silent_write, exited, timeout

Empty-prompt fingerprints are skipped per the contract documented in
auqFingerprint's unit tests — fingerprinting them would re-introduce
the option-label collision regression Codex F1 caught.

No E2E tests yet — those land in commit 5 with the four skill fixtures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: register four finding-count tests in touchfiles + tier map

Each new test depends on its skill template, the runner, and three
preamble resolvers (preamble.ts, generate-ask-user-format.ts,
generate-completion-status.ts) — those affect question cadence and
completion rendering, which is exactly what the test asserts on.

All four classified periodic. Sequential execution during calibration;
opt-in to concurrent only after measured comparison agrees (plan §D15).

Updated touchfiles.test.ts: plan-ceo-review/** now selects 19 tests
(was 18) because plan-ceo-finding-count joins the family.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: add four per-finding count E2E tests (plan-ceo + eng + design + devex)

Each test drives its plan-* skill through Step 0 then asserts the
review-phase AskUserQuestion count falls in [N-1, N+2] for an N=5
seeded plan, plus D19: produced plan file ends with
"## GSTACK REVIEW REPORT" as its last "## " heading.

plan-ceo also runs a paired-finding positive control: 2 deliberately
related findings should still produce 2 distinct AUQs, not 1 batched.

Periodic-tier (gate-skipped without EVALS=1, EVALS_TIER=periodic).
Sequential execution by plan §D15. Each fixture is inline TypeScript
content delivered as a follow-up message after the slash command, per
the proven pattern at skill-e2e-plan-design-with-ui.test.ts.

Calibration loop (5 runs per skill) and the manual pre-merge negative
check (D7 + D12) are required before merge per plan §Verification.
NOT yet run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: fix parseNumberedOptions for inline-cursor box-layout AUQs

Calibration run 1 timed out with step0=0 review=0 because the parser
could not find the cursor in /plan-ceo-review's scope-selection AUQ.
The TTY's box-layout rendering inlines divider + header + prompt +
"1." onto one logical line — cursor escapes get stripped, leaving
text crushed onto a single line.

Cursor anchor regex changed from anchored to unanchored so it matches
mid-line. Cursor-line option extraction uses a non-anchored regex;
subsequent options stay with the original start-of-line parser.

parseQuestionPrompt picks up the inline prompt text BEFORE the cursor
on the cursor line (after stripping box-drawing chars + sigil) and
appends it after any walked-up multi-line prompt above.

Three new unit tests: clean-cursor still works, inline-cursor
extracts all 7 options, prompt extraction strips box chars.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: add firstAUQPick + plan-ceo skip-interview routing

Calibration run 1 surfaced a second issue beyond the parser bug: the
default pick of 1 on /plan-ceo-review's scope-selection AUQ routes
the agent to "branch diff vs main" — so it reviews the gstack PR
itself (recursive!) instead of the seeded fixture plan we sent.

Added firstAUQPick callback to runPlanSkillCounting. Override applies
only to the FIRST AUQ; subsequent presses keep using defaultPick.

ceoStep0Boundary now fires on either the mode-pick AUQ (existing path)
or any AUQ containing "Skip interview and plan immediately" — which
is the scope-selection AUQ. Picking that option bypasses Step 0 and
routes straight to review-phase using the chat-paste plan as context.

Plan-ceo test wires firstAUQPick = pickSkipInterview which finds the
"Skip interview" option by label. Falls back to "describe inline" if
the option labels change.

Two new unit tests: ceoStep0Boundary fires on the scope-selection
fixture; existing mode-pick fixture still fires.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Garry Tan
2026-04-30 02:50:09 -07:00
committed by GitHub
parent e8893a18b1
commit 454423aeb3
14 changed files with 2271 additions and 78 deletions

View File

@@ -0,0 +1,749 @@
/**
* Deterministic unit tests for claude-pty-runner.ts behavior changes.
*
* Free-tier (no EVALS=1 needed). Runs in <1s on every `bun test`. Catches
* harness plumbing bugs before stochastic PTY runs surface them.
*
* Two surface areas tested:
*
* 1. Permission-dialog short-circuit in 'asked' classification: a TTY frame
* that matches BOTH isPermissionDialogVisible AND isNumberedOptionListVisible
* must NOT be classified as a skill question — permission dialogs render
* as numbered lists too, but they're not what we're guarding.
*
* 2. Env passthrough surface: runPlanSkillObservation accepts an `env`
* option and threads it to launchClaudePty. We can't fully exercise the
* spawn pipeline without paying for a PTY session, but we CAN verify the
* option exists in the type signature and that calling without env still
* works (no regression).
*
* The PTY test (skill-e2e-plan-ceo-plan-mode.test.ts) is the integration
* check; this file is the cheap deterministic guard for the harness primitives
* those tests stand on.
*/
import { describe, test, expect } from 'bun:test';
import {
isPermissionDialogVisible,
isNumberedOptionListVisible,
isPlanReadyVisible,
parseNumberedOptions,
classifyVisible,
TAIL_SCAN_BYTES,
optionsSignature,
parseQuestionPrompt,
auqFingerprint,
COMPLETION_SUMMARY_RE,
assertReviewReportAtBottom,
ceoStep0Boundary,
engStep0Boundary,
designStep0Boundary,
devexStep0Boundary,
type ClaudePtyOptions,
type AskUserQuestionFingerprint,
} from './claude-pty-runner';
describe('isPermissionDialogVisible', () => {
test('matches "Bash command requires permission" prompts', () => {
const sample = `
Some preamble output
Bash command \`gstack-config get telemetry\` requires permission to run.
1. Yes
2. Yes, and always allow
3. No, abort
`;
expect(isPermissionDialogVisible(sample)).toBe(true);
});
test('matches "allow all edits" file-edit prompts', () => {
// Isolated to the "allow all edits" clause only — no overlapping
// "Do you want to proceed?" co-trigger, so this asserts the clause works.
const sample = `
Edit to ~/.gstack/config.yaml
1. Yes
2. Yes, allow all edits during this session
3. No
`;
expect(isPermissionDialogVisible(sample)).toBe(true);
});
test('matches the "Do you want to proceed?" file-edit confirmation by itself', () => {
// Separate fixture so weakening this clause is detected by a dedicated test.
const sample = `
Edit to ~/.gstack/config.yaml
Do you want to proceed?
1. Yes
2. No
`;
expect(isPermissionDialogVisible(sample)).toBe(true);
});
test('matches workspace-trust "always allow access to" prompt', () => {
const sample = `
Do you trust the files in this folder?
1. Yes, proceed
2. Yes, and always allow access to /Users/me/repo
3. No, exit
`;
expect(isPermissionDialogVisible(sample)).toBe(true);
});
test('does NOT match a skill AskUserQuestion list', () => {
const sample = `
D1 — Premise challenge: do users actually want this?
1. Yes, validated
2. No, premise is wrong
3. Need more info
`;
expect(isPermissionDialogVisible(sample)).toBe(false);
});
test('does NOT match a plan-ready confirmation', () => {
const sample = `
Ready to execute the plan?
1. Yes
2. No, keep planning
`;
expect(isPermissionDialogVisible(sample)).toBe(false);
});
test('does NOT match a skill question that contains the bare phrase "Do you want to proceed?"', () => {
// Co-trigger requirement: "Do you want to proceed?" alone is not enough.
// It must appear with "Edit to <path>" or "Write to <path>" to count as
// a permission dialog. This guards against a skill question like
// "Do you want to proceed with HOLD SCOPE?" being mis-classified.
const sample = `
Choose your scope mode for this review.
Do you want to proceed?
1. HOLD SCOPE
2. SCOPE EXPANSION
3. SELECTIVE EXPANSION
`;
expect(isPermissionDialogVisible(sample)).toBe(false);
});
test('does NOT mis-match when adversarial prose includes "Edit to <path>" alongside the bare proceed phrase', () => {
// Adversarial fixture: a skill question whose body legitimately mentions
// "Edit to <path>" in prose AND ends with "Do you want to proceed?". The
// current co-trigger regex would mis-classify this as a permission
// dialog. We DO want this test to fail until the regex is tightened
// further (e.g., proximity constraint, or anchoring "Edit to" to a
// line-start). For now this is documented as a known limitation: a
// skill question that talks about "Edit to" in prose IS still treated
// as a permission dialog. The test asserts the current behavior so a
// future fix can flip it intentionally.
const sample = `
Plan: I will Edit to ./plan.md to capture the decision.
Do you want to proceed?
1. HOLD SCOPE
2. SCOPE EXPANSION
`;
// KNOWN LIMITATION: the co-trigger fires here. Documented as a
// post-merge follow-up. Flip this assertion once the regex tightens.
expect(isPermissionDialogVisible(sample)).toBe(true);
});
});
describe('isNumberedOptionListVisible', () => {
test('matches a basic 1. + 2. cursor list', () => {
const sample = `
1. Option one
2. Option two
3. Option three
`;
expect(isNumberedOptionListVisible(sample)).toBe(true);
});
test('returns false on a single-option prompt', () => {
const sample = `
1. Only option
`;
expect(isNumberedOptionListVisible(sample)).toBe(false);
});
test('returns false when no cursor renders', () => {
const sample = `
Just some prose with 1. a numbered point and 2. another.
`;
expect(isNumberedOptionListVisible(sample)).toBe(false);
});
test('overlaps permission dialogs (this is why D5 short-circuits)', () => {
// The whole point of D5: this string matches BOTH classifiers, so the
// runner must consult isPermissionDialogVisible to disambiguate.
const sample = `
Bash command \`do-thing\` requires permission to run.
1. Yes
2. No
`;
expect(isNumberedOptionListVisible(sample)).toBe(true);
expect(isPermissionDialogVisible(sample)).toBe(true);
});
});
describe('classifyVisible (runtime path through the runner classifier)', () => {
// These tests call the actual classifier so a future contributor who
// reorders branches (e.g. moves the permission short-circuit before
// isPlanReadyVisible) is caught deterministically.
test('skill question → returns asked', () => {
const visible = `
D1 — Choose your scope mode
1. HOLD SCOPE
2. SCOPE EXPANSION
3. SELECTIVE EXPANSION
4. SCOPE REDUCTION
`;
const result = classifyVisible(visible);
expect(result?.outcome).toBe('asked');
});
test('permission dialog (Bash) → returns null (skip, keep polling)', () => {
const visible = `
Bash command \`gstack-update-check\` requires permission to run.
1. Yes
2. No
`;
expect(isNumberedOptionListVisible(visible)).toBe(true); // pre-filter
expect(classifyVisible(visible)).toBeNull(); // post-filter
});
test('plan-ready confirmation → returns plan_ready (wins over asked)', () => {
const visible = `
Ready to execute the plan?
1. Yes, proceed
2. No, keep planning
`;
const result = classifyVisible(visible);
expect(result?.outcome).toBe('plan_ready');
});
test('silent write to unsanctioned path → returns silent_write', () => {
const visible = `
⏺ Write(src/app/dangerous-write.ts)
⎿ Wrote 42 lines
`;
const result = classifyVisible(visible);
expect(result?.outcome).toBe('silent_write');
expect(result?.summary).toContain('src/app/dangerous-write.ts');
});
test('write to sanctioned path (.claude/plans) → returns null (allowed)', () => {
const visible = `
⏺ Write(/Users/me/.claude/plans/some-plan.md)
⎿ Wrote 42 lines
`;
expect(classifyVisible(visible)).toBeNull();
});
test('write while a permission dialog is on screen → returns null (gated, not silent, not asked)', () => {
const visible = `
⏺ Write(src/app/edit-with-permission.ts)
Edit to src/app/edit-with-permission.ts
Do you want to proceed?
1. Yes
2. No
`;
// The numbered prompt is a permission dialog (Edit to + Do you want to proceed?);
// silent_write is suppressed because a numbered prompt is visible, AND
// 'asked' is suppressed because the prompt is a permission dialog.
expect(classifyVisible(visible)).toBeNull();
});
test('write while a real skill question is on screen → returns asked (write is captured but not silent)', () => {
const visible = `
⏺ Write(src/app/foo.ts)
D1 — Choose your scope mode
1. HOLD SCOPE
2. SCOPE EXPANSION
`;
// The numbered prompt is a skill question, not a permission dialog;
// silent_write is suppressed (numbered prompt is visible) and the
// outcome is 'asked' — Step 0 fired.
const result = classifyVisible(visible);
expect(result?.outcome).toBe('asked');
});
test('idle / no signals → returns null', () => {
const visible = `
Some prose without any classifier signals.
`;
expect(classifyVisible(visible)).toBeNull();
});
test('TAIL_SCAN_BYTES is exported as 1500', () => {
// Shared between runner and routing test; a regression that desyncs the
// recent-tail window would surface here.
expect(TAIL_SCAN_BYTES).toBe(1500);
});
});
describe('parseNumberedOptions', () => {
test('extracts options from a clean cursor list', () => {
const visible = `
1. HOLD SCOPE
2. SCOPE EXPANSION
`;
const opts = parseNumberedOptions(visible);
expect(opts).toHaveLength(2);
expect(opts[0]).toEqual({ index: 1, label: 'HOLD SCOPE' });
expect(opts[1]).toEqual({ index: 2, label: 'SCOPE EXPANSION' });
});
test('returns empty array on prose-with-numbers (no cursor)', () => {
expect(parseNumberedOptions('text 1. one 2. two')).toEqual([]);
});
test('extracts options when the cursor is INLINE with prompt header (box-layout)', () => {
// Real /plan-ceo-review rendering: the TTY's cursor-positioning escapes
// collapse divider + header + prompt + cursor onto one logical line.
// Subsequent options (2..7) still start their own lines.
const visible = [
'────────────────────────────────────────',
'☐ Review scope What scope do you want me to CEO-review? 1. The branch\'s diff vs main',
' Review the full branch: ~10K LOC.',
'2. A specific plan file or design doc',
' You point me at a file (path) and I review that.',
'3. An idea you\'ll describe inline',
'4. Cancel — wrong skill',
'5. Type something.',
'────────────────────────────────────────',
'6. Chat about this',
'7. Skip interview and plan immediately',
].join('\n');
const opts = parseNumberedOptions(visible);
expect(opts).toHaveLength(7);
expect(opts[0]).toEqual({ index: 1, label: "The branch's diff vs main" });
expect(opts[1]?.index).toBe(2);
expect(opts[6]?.index).toBe(7);
expect(opts[6]?.label).toBe('Skip interview and plan immediately');
});
test('inline-cursor and start-of-line cursor both produce 7 options for the box-layout case', () => {
// The inline path captures option 1 from the cursor line itself; the
// subsequent-lines path captures 2..7 with the existing optionRe.
const inlineLayout = [
'header text 1. first option',
'2. second',
'3. third',
].join('\n');
expect(parseNumberedOptions(inlineLayout)).toEqual([
{ index: 1, label: 'first option' },
{ index: 2, label: 'second' },
{ index: 3, label: 'third' },
]);
const cleanLayout = [
' 1. first option',
' 2. second',
' 3. third',
].join('\n');
expect(parseNumberedOptions(cleanLayout)).toEqual([
{ index: 1, label: 'first option' },
{ index: 2, label: 'second' },
{ index: 3, label: 'third' },
]);
});
});
describe('runPlanSkillObservation env passthrough surface', () => {
test('ClaudePtyOptions exposes env: Record<string, string>', () => {
// Type-level guard: this file would fail to compile if the env field
// were removed or its shape regressed. The actual env merge happens in
// launchClaudePty's spawn call (`env: { ...process.env, ...opts.env }`),
// so a regression where `env: opts.env` gets dropped from the
// runPlanSkillObservation -> launchClaudePty handoff is only caught by
// the live PTY test, not here.
const opts: ClaudePtyOptions = {
env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' },
};
expect(opts.env).toEqual({ QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' });
});
});
// ────────────────────────────────────────────────────────────────────────────
// Per-finding count primitives — Section 3 unit tests #1#5, #7, #12.
// ────────────────────────────────────────────────────────────────────────────
describe('optionsSignature', () => {
test('returns a "|"-joined `index:label` string for a clean list', () => {
const sig = optionsSignature([
{ index: 1, label: 'HOLD SCOPE' },
{ index: 2, label: 'SCOPE EXPANSION' },
]);
expect(sig).toBe('1:HOLD SCOPE|2:SCOPE EXPANSION');
});
test('order-independent: shuffled inputs produce the same signature', () => {
// parseNumberedOptions already returns sorted, but defensive sort means
// a future caller that hands us shuffled input still produces a stable
// dedupe signature.
const a = optionsSignature([
{ index: 2, label: 'B' },
{ index: 1, label: 'A' },
{ index: 3, label: 'C' },
]);
const b = optionsSignature([
{ index: 1, label: 'A' },
{ index: 2, label: 'B' },
{ index: 3, label: 'C' },
]);
expect(a).toBe(b);
});
test('empty list returns empty string', () => {
expect(optionsSignature([])).toBe('');
});
test('single-item list returns just that entry', () => {
expect(optionsSignature([{ index: 1, label: 'Only' }])).toBe('1:Only');
});
});
describe('parseQuestionPrompt', () => {
test('captures 1-line prompt above the cursor', () => {
const visible = `
D1 — Pick a mode
1. HOLD SCOPE
2. SCOPE EXPANSION
`;
const prompt = parseQuestionPrompt(visible);
expect(prompt).toBe('D1 — Pick a mode');
});
test('captures multi-line prompt above the cursor', () => {
const visible = `
D2 — Approach selection
Which architecture should we follow?
1. Bypass existing helper
2. Reuse existing helper
`;
const prompt = parseQuestionPrompt(visible);
// Multi-line prompts get joined with single spaces.
expect(prompt).toContain('D2 — Approach selection');
expect(prompt).toContain('Which architecture should we follow?');
});
test('returns "" when no cursor is rendered', () => {
expect(parseQuestionPrompt('Just some prose.\nNo cursor.')).toBe('');
});
test('truncates to 240 chars', () => {
const longPrompt = 'A'.repeat(500);
const visible = `${longPrompt}\n\n 1. yes\n 2. no`;
expect(parseQuestionPrompt(visible).length).toBeLessThanOrEqual(240);
});
test('does not pull text from a previous numbered list above', () => {
const visible = `
1. previous answered question
2. previous option two
D2 — A new question text
1. fresh option A
2. fresh option B
`;
const prompt = parseQuestionPrompt(visible);
// Stops at the previous numbered-list line; should NOT contain "previous answered question".
expect(prompt).toContain('D2 — A new question text');
expect(prompt).not.toContain('previous answered question');
});
test('normalizes whitespace (collapses runs of spaces and tabs)', () => {
const visible = `D1 — Spaced out
1. yes
2. no`;
expect(parseQuestionPrompt(visible)).toBe('D1 — Spaced out');
});
test('inline-cursor box-layout: extracts prompt text BEFORE 1. on the cursor line', () => {
// Real /plan-ceo-review rendering: divider + ☐ header + prompt text +
// cursor are all on one logical line because TTY cursor-positioning
// escapes collapse the box layout under stripAnsi.
const visible = [
'──────────────────',
'☐ Review scope What scope do you want me to CEO-review? 1. The branch\'s diff vs main',
'2. A specific plan file',
'3. An idea inline',
].join('\n');
const prompt = parseQuestionPrompt(visible);
// Should extract "Review scope" and the prompt text, dropping the ☐ box-drawing sigil.
expect(prompt).toContain('Review scope');
expect(prompt).toContain('What scope do you want me to CEO-review?');
expect(prompt).not.toContain('');
expect(prompt).not.toMatch(/^☐/);
});
});
describe('auqFingerprint', () => {
test('returns the same fingerprint for identical inputs', () => {
const opts = [
{ index: 1, label: 'A' },
{ index: 2, label: 'B' },
];
expect(auqFingerprint('hello', opts)).toBe(auqFingerprint('hello', opts));
});
test('different prompts with shared option labels produce DIFFERENT fingerprints', () => {
// The collision regression Codex F1 caught: option-label-only fingerprints
// collapsed multiple distinct findings into one when they shared menu shape.
const sharedOpts = [
{ index: 1, label: 'Add to plan' },
{ index: 2, label: 'Defer' },
{ index: 3, label: 'Build now' },
];
const fpFinding1 = auqFingerprint('D5 — Architecture: bypass helper?', sharedOpts);
const fpFinding2 = auqFingerprint('D6 — Tests: zero coverage?', sharedOpts);
expect(fpFinding1).not.toBe(fpFinding2);
});
test('same prompt with different options produces DIFFERENT fingerprints', () => {
const prompt = 'D1 — Pick a mode';
const fpA = auqFingerprint(prompt, [
{ index: 1, label: 'HOLD SCOPE' },
{ index: 2, label: 'SCOPE EXPANSION' },
]);
const fpB = auqFingerprint(prompt, [
{ index: 1, label: 'HOLD SCOPE' },
{ index: 2, label: 'SCOPE REDUCTION' },
]);
expect(fpA).not.toBe(fpB);
});
test('whitespace-only differences in prompt do NOT change the fingerprint', () => {
// Same content, different rendering whitespace (TTY redraw artifact)
// must produce the same fingerprint so dedupe survives reflow.
const opts = [{ index: 1, label: 'A' }, { index: 2, label: 'B' }];
const fpA = auqFingerprint('Pick a mode', opts);
const fpB = auqFingerprint('Pick a mode', opts);
expect(fpA).toBe(fpB);
});
test('empty prompt + same options collide (caller must guard against this)', () => {
// Documents the contract: empty-prompt fingerprints WILL collide if the
// caller fingerprints them. runPlanSkillCounting must skip empty-prompt
// AUQs and re-poll instead.
const opts = [{ index: 1, label: 'A' }];
expect(auqFingerprint('', opts)).toBe(auqFingerprint('', opts));
});
});
describe('COMPLETION_SUMMARY_RE', () => {
test('matches GSTACK REVIEW REPORT heading', () => {
expect(COMPLETION_SUMMARY_RE.test('## GSTACK REVIEW REPORT')).toBe(true);
});
test('matches Completion Summary heading (ceo + eng)', () => {
expect(COMPLETION_SUMMARY_RE.test('## Completion Summary')).toBe(true);
expect(COMPLETION_SUMMARY_RE.test('## Completion summary')).toBe(true);
});
test('matches Status: clean (CEO review-log shape)', () => {
expect(COMPLETION_SUMMARY_RE.test('Status: clean')).toBe(true);
expect(COMPLETION_SUMMARY_RE.test('Status: issues_open')).toBe(true);
});
test('matches VERDICT: line', () => {
expect(COMPLETION_SUMMARY_RE.test('VERDICT: CLEARED — Eng Review passed')).toBe(true);
});
test('does NOT match prose mentions of "verdict" mid-line', () => {
// VERDICT must be at the start of a line to count.
expect(COMPLETION_SUMMARY_RE.test('the final verdict: undecided')).toBe(false);
});
});
describe('assertReviewReportAtBottom', () => {
test('passes when REVIEW REPORT is the only/last ## heading', () => {
const content = `# Plan
## Context
stuff
## Approach
more stuff
## GSTACK REVIEW REPORT
| col | col |
`;
const r = assertReviewReportAtBottom(content);
expect(r.ok).toBe(true);
});
test('fails when REVIEW REPORT is missing', () => {
const content = `# Plan
## Context
stuff
`;
const r = assertReviewReportAtBottom(content);
expect(r.ok).toBe(false);
expect(r.reason).toMatch(/no GSTACK REVIEW REPORT/);
});
test('fails when REVIEW REPORT exists but a ## heading follows it', () => {
const content = `# Plan
## GSTACK REVIEW REPORT
| col | col |
## Late Section
oops
`;
const r = assertReviewReportAtBottom(content);
expect(r.ok).toBe(false);
expect(r.reason).toMatch(/trailing ## heading/);
expect(r.trailingHeadings).toEqual(['## Late Section']);
});
test('passes when only ### subheadings follow REVIEW REPORT (deeper nesting allowed)', () => {
const content = `## GSTACK REVIEW REPORT
### Cross-model tension
- F1: resolved
- F2: resolved
`;
const r = assertReviewReportAtBottom(content);
expect(r.ok).toBe(true);
});
test('fails with multiple trailing ## headings reported', () => {
const content = `## GSTACK REVIEW REPORT
## First trailing
## Second trailing
`;
const r = assertReviewReportAtBottom(content);
expect(r.ok).toBe(false);
expect(r.trailingHeadings).toHaveLength(2);
});
});
describe('Step0BoundaryPredicate per-skill', () => {
// Helper to build a synthetic fingerprint for predicate tests.
function fp(promptSnippet: string, optionLabels: string[]): AskUserQuestionFingerprint {
const options = optionLabels.map((label, i) => ({ index: i + 1, label }));
return {
signature: auqFingerprint(promptSnippet, options),
promptSnippet,
options,
observedAtMs: 0,
preReview: true,
};
}
describe('ceoStep0Boundary', () => {
test('FIRES on Step 0F mode-pick AUQ (HOLD SCOPE in options)', () => {
const f = fp('Pick a mode', ['HOLD SCOPE', 'SCOPE EXPANSION', 'SELECTIVE EXPANSION', 'SCOPE REDUCTION']);
expect(ceoStep0Boundary(f)).toBe(true);
});
test('FIRES on scope-selection AUQ with "Skip interview" option (skip-interview path)', () => {
// After calibration run 1: plan-ceo's first AUQ is scope-selection,
// and we route via "Skip interview and plan immediately" to bypass
// Step 0 entirely. Boundary must fire on this AUQ so subsequent
// AUQs go to reviewCount.
const f = fp(
'What scope do you want me to CEO-review?',
[
"The branch's diff vs main",
'A specific plan file',
"An idea you'll describe inline",
'Cancel — wrong skill',
'Type something.',
'Chat about this',
'Skip interview and plan immediately',
],
);
expect(ceoStep0Boundary(f)).toBe(true);
});
test('does NOT fire on premise challenge AUQs', () => {
const f = fp('D1 — Premise check: is this the right problem?', ['Yes', 'No', 'Other']);
expect(ceoStep0Boundary(f)).toBe(false);
});
test('does NOT fire on review-section AUQs', () => {
const f = fp('Architecture: bypass helper?', ['Reuse existing', 'Roll new', 'Defer']);
expect(ceoStep0Boundary(f)).toBe(false);
});
});
describe('engStep0Boundary', () => {
test('FIRES on cross-project learnings prompt', () => {
const f = fp('Enable cross-project learnings on this machine?', ['Yes', 'No']);
expect(engStep0Boundary(f)).toBe(true);
});
test('FIRES on scope reduction recommendation', () => {
const f = fp('Scope reduction recommendation: cut to MVP?', ['Reduce', 'Proceed', 'Modify']);
expect(engStep0Boundary(f)).toBe(true);
});
test('does NOT fire on review-section AUQs', () => {
const f = fp('Architecture: shared mutable state?', ['Refactor', 'Defer', 'Skip']);
expect(engStep0Boundary(f)).toBe(false);
});
});
describe('designStep0Boundary', () => {
test('FIRES on design system / posture mention', () => {
const f = fp('Pick a design posture for this review', ['Polish', 'Triage', 'Expansion']);
expect(designStep0Boundary(f)).toBe(true);
});
test('FIRES on first-dimension prompt', () => {
const f = fp('First dimension: visual hierarchy. Score?', ['7', '8', '9']);
expect(designStep0Boundary(f)).toBe(true);
});
test('does NOT fire on later dimension AUQs', () => {
const f = fp('Spacing dimension score?', ['7', '8', '9']);
expect(designStep0Boundary(f)).toBe(false);
});
});
describe('devexStep0Boundary', () => {
test('FIRES on developer persona selection', () => {
const f = fp('Pick the target persona for this review', ['Senior backend', 'Junior frontend', 'Other']);
expect(devexStep0Boundary(f)).toBe(true);
});
test('FIRES on TTHW target prompt', () => {
const f = fp('What is the TTHW target for first run?', ['<5 min', '<15 min', '<30 min']);
expect(devexStep0Boundary(f)).toBe(true);
});
test('does NOT fire on review-section AUQs', () => {
const f = fp('Friction point: 5-min CI wait. Address?', ['Now', 'Defer', 'Skip']);
expect(devexStep0Boundary(f)).toBe(false);
});
});
});