mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-13 07:53:04 +08:00
v1.31.0.0 fix: delete AskUserQuestion fallback (root cause of forever war) + harness primitives (#1390)
* test: add multi-finding batching regression test (periodic tier) Adds a periodic-tier E2E that catches the May 2026 transcript bug shape the existing single-finding gate-tier floor test cannot detect: a model that fires one AskUserQuestion and then batches the remaining findings into a single "## Decisions to confirm" plan write + ExitPlanMode. Why a separate test from skill-e2e-plan-eng-finding-floor: the gate-tier floor (runPlanSkillFloorCheck) exits on the first AUQ render and returns success, so a once-then-batch model would pass it trivially. This test uses runPlanSkillCounting at periodic tier with N-AUQ tracking and asserts >= 3 distinct review-phase AUQs on a 4-finding seeded plan. - test/fixtures/forcing-finding-seeds.ts: FORCING_BATCHING_ENG fixture (4 distinct non-trivial findings spread across Architecture, Code Quality, Tests, Performance — mirrors the D1-D4 transcript shape) - test/skill-e2e-plan-eng-multi-finding-batching.test.ts: new test - test/helpers/touchfiles.ts: registered in BOTH E2E_TOUCHFILES and E2E_TIERS (touchfiles.test.ts asserts exact equality) Test will fail on baseline today because today's model uses the preamble fallback to batch findings; passes after the architectural fix lands in a follow-up commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: expand plan-mode pass envelopes to accept BLOCKED path Three existing plan-mode regression tests previously codified the preamble fallback as a valid PASS path under --disallowedTools AskUserQuestion: outcome=plan_ready was accepted only when the model wrote a "## Decisions to confirm" section. The forever-war fix deletes that fallback, so this assertion would fail post-deletion. Expanded envelope accepts EITHER: - 'plan_ready' WITH (## Decisions section [legacy] OR BLOCKED string visible in TTY [post-fix]) - 'exited' WITH BLOCKED string visible in TTY [post-fix] The legacy ## Decisions branch stays in the envelope so these tests keep passing on today's code (where the fallback still exists) and on tomorrow's code (where the model reports BLOCKED instead). Once the deletion has been on main long enough that the cache flushes, the legacy branch can be removed in a follow-up. Failure signals (regression we DO want to catch) unchanged: auto_decided / silent_write / timeout / exited-without-BLOCKED / plan_ready-without-(decisions OR BLOCKED). - test/skill-e2e-plan-ceo-plan-mode.test.ts (test 2 only) - test/skill-e2e-autoplan-auto-mode.test.ts - test/skill-e2e-plan-design-plan-mode.test.ts Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: delete AskUserQuestion fallback (root cause of forever war) The /plan-eng-review skill failed to fire AskUserQuestion on a real plan review and surfaced 4 calibration decisions via prose instead. Investigation traced this to a "fallback when neither variant is callable" clause in the preamble that the model rationalizes around as a general escape hatch from "fanning out round-trip AUQs," even when an AUQ variant IS callable. Codex review confirmed the fallback exists in 8 inline sites with 2 surviving escape hatches the original narrowing missed (a "genuinely trivial" exception duplicated across all 4 plan-* templates, and a "outside plan mode, output as prose and stop" branch in the preamble itself). Net deletion in skill text. Closes both branches of the deleted fallback (plan-file write AND prose-and-stop) and the trivial-fix exception with a single hard rule: If no AskUserQuestion variant appears in your tool list, this skill is BLOCKED. Stop, report `BLOCKED — AskUserQuestion unavailable`, and wait for the user. Honest about being a model directive, not a runtime guard — none of the PTY harness helpers enforce BLOCKED today. The architectural improvement is that the model has fewer alternatives to obey it against. Runtime enforcement is a follow-up TODO. Sources changed: - scripts/resolvers/preamble/generate-ask-user-format.ts: delete both fallback branches; replace with 1-line BLOCKED rule - scripts/resolvers/preamble/generate-completion-status.ts: delete fallback in generatePlanModeInfo - plan-eng-review/SKILL.md.tmpl: delete fallback at Step 0 + Sections 1-4 (5 instances) + delete trivial-fix exception - office-hours/SKILL.md.tmpl: delete fallback in approach-selection - plan-ceo-review/SKILL.md.tmpl: delete trivial-fix exception - plan-design-review/SKILL.md.tmpl: delete trivial-fix exception - plan-devex-review/SKILL.md.tmpl: delete trivial-fix exception Generated SKILL.md regen lands in a follow-up commit per the bisect convention (template changes separate from regenerated output). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: regenerate SKILL.md after fallback deletion Regenerates all 47 generated SKILL.md files (default + 7 host adapters) after the template/resolver edits in the prior commit. Pure mechanical output of `bun run gen:skill-docs`; no hand-edits. Verifies fallback deletion landed across the entire skill surface: - zero hits for "Decisions to confirm" in canonical SKILL.md / .tmpl - zero hits for "no AskUserQuestion variant is callable" - zero hits for "genuinely trivial" - BLOCKED rule present in 42 generated SKILL.md (every Tier-2+ skill) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(harness): detect prose-rendered AskUserQuestion in plan mode When --disallowedTools AskUserQuestion is set and no MCP variant is callable, the model surfaces decisions as visible prose options ("A) ... B) ... C) ..." or "1. ... 2. ... 3. ...") rather than via the native numbered-prompt UI. isNumberedOptionListVisible doesn't catch these because the ❯ cursor sits on the empty input prompt rather than on option 1, so runPlanSkillObservation and runPlanSkillFloorCheck would time out at 5-10 minutes per test even though the model was correctly waiting for user input. This was exposed by the v1.28 fallback deletion: pre-deletion the model used the preamble fallback to silently auto-resolve to plan_ready in this scenario. Post-deletion the model correctly surfaces the question and waits, but the harness couldn't tell. isProseAUQVisible matches: - 2+ distinct lettered options at line starts (A/B/C/D form) - 3+ distinct numbered options at line starts WITHOUT a `❯ 1.` cursor (so it doesn't double-fire on native numbered prompts) Wired into: - classifyVisible (used by runPlanSkillObservation) → returns outcome='asked' instead of timeout - runPlanSkillFloorCheck → counts as auq_observed (floor met) 8 new unit tests in claude-pty-runner.unit.test.ts cover the lettered shape, numbered shape, threshold edges, native-cursor exclusion, and mid-prose false-positive guard. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(harness): LLM judge for waiting-vs-working PTY state + snapshot logs Regex detectors (isNumberedOptionListVisible, isProseAUQVisible) are fast and free, but PTY rendering quirks fragment prose AUQ option lists across logical lines that no regex can reliably reassemble. When detection misses, polling loops time out at the full budget even though the model is correctly waiting for user input. Adds judgePtyState — a Haiku-graded trichotomy classifier: - waiting: agent surfaced a question/options, sitting at input prompt - working: spinner / tool calls / generation in progress - hung: stopped without surfacing anything (rare crash signal) Wired as a fallback into the polling loops of runPlanSkillObservation and runPlanSkillFloorCheck: after 60s with no regex hit, snapshot the TTY every 30s and call the judge. On 'waiting' verdict, return outcome=asked / auq_observed early. On 'working' or 'hung', enrich the eventual timeout summary with the verdict so failures are diagnosable. Implementation: - Spawns `claude -p --model claude-haiku-4-5 --max-turns 1` synchronously with prompt piped via stdin (subscription auth, no API key env required) - In-process cache keyed by SHA-1 of normalized last-4KB so identical spinner-frame snapshots don't re-charge - Best-effort JSONL log to ~/.gstack/analytics/pty-judge.jsonl with timestamp, testName, state, reasoning, hash, judge wall time - 30s timeout per call; returns state='unknown' with diagnostic on any failure mode (timeout, malformed JSON, missing claude binary) Snapshot logging: when GSTACK_PTY_LOG=1 is set, dump last 4KB of visible TTY at every judge tick to ~/.gstack/analytics/pty-snapshots/<test>- <elapsed>ms.txt — postmortem trail for debugging flakes. Cost: ~$0.0005 per call; ~10 calls per 5-min test budget; ~$0.005 per test added in worst case (only when regex detectors miss). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: accept prose-AUQ visible as third valid surface in plan-mode envelopes The first re-run after wiring the LLM judge revealed that the model also emits a third surface I hadn't anticipated: a properly-formatted question with options ("Pick A, B, or C in your reply") rendered as prose AND followed by ExitPlanMode (outcome=plan_ready). The migrated tests only accepted (## Decisions section) OR (BLOCKED string) — neither matched this case, so the test failed even though the user clearly saw the question. Three valid surfaces now: 1. `## Decisions to confirm` section in plan file (legacy fallback path, still valid through migration window) 2. `BLOCKED — AskUserQuestion` string in TTY (post-v1.28 BLOCKED rule) 3. Numbered/lettered options visible in TTY as prose (post-v1.28 prose rendering — uses the existing isProseAUQVisible detector) Also fixes assertReportAtBottomIfPlanWritten to be tolerant of: - Missing files (path detected from TTY but file not persisted) — was throwing ENOENT on plan_design_plan_mode and plan_ceo_plan_mode test 1 - 'asked' outcome (smoke test exited at first AUQ before the model reached the report-writing step) — was throwing on the 1 fail in the plan-eng-plan-mode --disallowedTools test Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: drop GSTACK REVIEW REPORT contract from --disallowedTools migrations The plan-ceo / plan-design --disallowedTools migrated tests called assertReportAtBottomIfPlanWritten as the final assertion, but that contract is for full multi-section review completions. Under --disallowedTools AskUserQuestion the model can't run the full review (no AUQ tools to ask findings questions through), so it exits at Step 0 with either prose-AUQ rendering or the legacy decisions fallback. A plan file written in that mode WON'T have a GSTACK REVIEW REPORT section — the workflow never reached the report-writing step. The contract is still enforced by the periodic finding-count tests (skill-e2e-plan-{ceo,eng,design,devex}-finding-count.test.ts), which DO run the full review end-to-end and assert report-at-bottom there. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(harness): high-water-mark prose-AUQ tracking across polling iterations The autoplan E2E surfaces a brief prose-AUQ window (model emits options, waits ~30s for non-existent test responder, then resumes thinking) that the existing polling loop misses: by judge-tick time the buffer has moved into spinner state, so the LLM judge correctly reports 'working' and the loop times out at 5min. Adds two flags tracked across polling iterations: - proseAUQEverObserved: set true the first tick isProseAUQVisible returns true on the recent buffer - waitingEverObserved: set true on the first LLM judge 'waiting' verdict At timeout, if either flag is set, return outcome='asked' with a summary explaining the historical signal. The model DID surface the question — we just missed the live-state window. Snapshot logged with tag='prose-auq-surfaced' when GSTACK_PTY_LOG=1 for postmortem trace. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: migrate plan-eng-plan-mode test 2 envelope to match other plan-mode tests The plan-ceo, plan-design, and autoplan plan-mode tests under --disallowedTools all moved to the same surface-visibility envelope (decisions section OR BLOCKED string OR prose-AUQ visible) and dropped the GSTACK REVIEW REPORT contract because the workflow can't complete without AUQ tools. plan-eng-plan-mode test 2 had been left on the old envelope and was the last failing test. This commit migrates it to match. Also lifts 'exited' out of the failure list and into a guarded path (acceptable when surface-visible). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(harness): isProseAUQVisible — gate numbered path on tail, not full buffer The numbered-options branch of isProseAUQVisible deferred to isNumberedOptionListVisible whenever a `❯ 1.` cursor was visible in the full buffer. But the boot trust dialog (`❯ 1. Yes, trust`) lives in scrollback for the entire run, so this gate suppressed prose-numbered detection for any session that had the trust prompt at startup — i.e., every E2E run after the first user-trust acceptance. Fix: check only the last 4KB tail. Native-UI deferral applies when the cursor list is CURRENTLY rendered, not historically present in scrollback. Adds a regression test that puts the trust dialog in early scrollback + 5KB filler + a current prose-AUQ render, asserts true. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(harness): isProseAUQVisible — lower numbered threshold to 2 (matches lettered) The 4KB tail window often contains only options 2-4 of a 4-option numbered prose AUQ because the model emits the question header + option 1 several KB earlier in the buffer. The threshold of 3 distinct numbered markers caused the detector to miss real prose AUQs whenever option 1 had scrolled out. Threshold 2 matches the lettered branch and is still tightly gated by: - Line-start anchoring (no false positives on inline `1.` references) - No-cursor gate (defers to native UI when ❯ 1. is currently rendered) - The 4KB tail window itself (prose-AUQ rendering happens at the end of the model's response, so options are clustered in the tail) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: expose high-water-mark flags through PlanSkillObservation The 2KB obs.evidence window often misses the prose-AUQ moment because ExitPlanMode UI ("Ready to execute" + numbered approve/reject prompt) pushes the model's earlier option list out of the tail by the time outcome=plan_ready fires. Tests checking "did the user see a question" need to consult historical state, not just the truncated final tail. Adds two optional fields to PlanSkillObservation: - proseAUQEverObserved: true if isProseAUQVisible was true at any tick - waitingEverObserved: true if the LLM judge ever returned 'waiting' The 4 plan-mode --disallowedTools tests now check these flags as part of the surfaceVisible computation: isProseAUQVisible(obs.evidence) || obs.proseAUQEverObserved === true blockedVisible || proseAUQVisible || obs.waitingEverObserved === true This catches the autoplan / plan-ceo / plan-eng case where the model surfaces options briefly, fails to get a response, then keeps thinking — eventually emitting ExitPlanMode and pushing options out of evidence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(plan-ceo): bump --disallowedTools test timeout to 10 min Last 5 runs showed the model under --disallowedTools spending the full 5-min budget in 'high effort thinking' before surfacing options. The LLM judge correctly reports state=working at every 30s tick, so the high-water-mark fallback never fires. 10-min budget gives the model 20 judge windows to eventually surface the question. Outer bun timeout bumped accordingly to 660s (inner +60s). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(plan-ceo): pre-prime --disallowedTools test with concrete plan content Root cause of the persistent timeout: under --disallowedTools, the model can't fire the AUQ tool to ask "what should I review?" — it has to prose-render that question. Prose-rendering a 4-option choice requires the model to first enumerate every option, which spent the full 5min budget in 'high effort thinking' (8 consecutive 'state=working' verdicts from the LLM judge). Fix: pass initialPlanContent (already supported by runPlanSkillObservation) with a CEO-review-shaped seed plan (vague success metric, missing premise, scope creep smell). The model now has concrete material to critique on entry, bypasses the scope-deliberation loop, and moves directly to surfacing Step 0 / Section 1 findings — the actual behavior we want to regression-test. Reverted timeout from 600_000 back to 300_000 since the 5-min budget is plenty when the model has a real plan to work with. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: delete --disallowedTools AskUserQuestion-blocked test variants These tests simulated a fictional environment that doesn't exist in production. Real Conductor sessions launch claude with `--disallowedTools AskUserQuestion` AND register `mcp__conductor__AskUserQuestion` — the model has the MCP variant. But the tests passed `--disallowedTools` without standing up any MCP server, so they tested "model behavior with NO AUQ available," which no real user state produces. Combined with bare `/plan-ceo-review` invocation (no follow-up content), this forced the model into a 5+ minute deliberation loop trying to prose-render a question with options it had to first invent. The result was persistent flakes that consumed nine paid E2E runs trying to fix "the model takes too long" — but the actual problem was the test configuration, not the model. Removals: - test/skill-e2e-autoplan-auto-mode.test.ts (deleted; the entire file was a single AUQ-blocked test) - test/skill-e2e-plan-ceo-plan-mode.test.ts test 2 (the migrated --disallowedTools test); test 1 (baseline plan-mode smoke) stays - test/skill-e2e-plan-design-plan-mode.test.ts test 2 (same shape); test 1 stays - test/skill-e2e-plan-eng-plan-mode.test.ts test 2 (same shape); test 1 (baseline) and test 3 (STOP-gate with seeded plan, different contract) stay - test/helpers/touchfiles.ts: autoplan-auto-mode entry removed - test/touchfiles.test.ts: assertion count + commentary updated Coverage retained: test 1 of each plan-mode file already verifies the model fires AUQ; the periodic finding-count tests verify per-finding AUQ cadence end-to-end. The harness improvements landed during this debugging cycle (isProseAUQVisible regex, LLM judge, snapshot logging, high-water-mark tracking, ENOENT-tolerant assertReportAtBottomIfPlanWritten) all stay — they're useful for the remaining plan-mode tests that can also encounter prose rendering and slow-thinking phases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v1.31.0.0) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -297,6 +297,250 @@ export function isNumberedOptionListVisible(visible: string): boolean {
|
||||
return /❯\s*1\./.test(visible) && /(^|[^0-9])2\./.test(visible);
|
||||
}
|
||||
|
||||
// ────────────────────────────────────────────────────────────────────────────
|
||||
// LLM judge — "is the model waiting for user input, working, or hung?"
|
||||
//
|
||||
// Regex detectors (isNumberedOptionListVisible, isProseAUQVisible) are fast
|
||||
// and deterministic but brittle to PTY rendering quirks (cursor-positioning
|
||||
// escapes that collapse multi-line option lists onto a single logical line).
|
||||
// When they miss, the polling loop times out at the full budget — even
|
||||
// though the model is correctly surfacing a question via a format the regex
|
||||
// can't reassemble.
|
||||
//
|
||||
// This LLM judge takes a TTY snapshot and answers a trichotomy:
|
||||
// - 'waiting' — agent surfaced a question/options, sitting at input prompt
|
||||
// - 'working' — agent is still generating (spinner, tool calls, "Musing")
|
||||
// - 'hung' — agent stopped without surfacing anything (rare)
|
||||
//
|
||||
// Used by polling loops as a fallback after N seconds with no terminal
|
||||
// classification. On 'waiting' verdict, return outcome='asked' early.
|
||||
//
|
||||
// Cost: ~$0.0005 per call using claude haiku 4.5. Cached by snapshot hash so
|
||||
// identical TTY frames don't re-charge. All verdicts logged to
|
||||
// ~/.gstack/analytics/pty-judge.jsonl for offline analysis.
|
||||
// ────────────────────────────────────────────────────────────────────────────
|
||||
|
||||
import { spawnSync as nodeSpawnSync } from 'node:child_process';
|
||||
import { createHash } from 'node:crypto';
|
||||
|
||||
export interface PtyStateVerdict {
|
||||
state: 'waiting' | 'working' | 'hung' | 'unknown';
|
||||
reasoning: string;
|
||||
/** SHA-1 of the normalized snapshot input (for caching/dedup). */
|
||||
hash: string;
|
||||
/** Wall time (ms) the judge call took. */
|
||||
elapsedMs: number;
|
||||
}
|
||||
|
||||
const PTY_VERDICT_CACHE = new Map<string, PtyStateVerdict>();
|
||||
|
||||
/**
|
||||
* Persist a verdict (or snapshot dump) to the analytics JSONL log.
|
||||
* Best-effort — failures (disk full, permission denied, etc.) are swallowed
|
||||
* so the harness never fails on logging.
|
||||
*/
|
||||
function logPtyJudge(record: Record<string, unknown>): void {
|
||||
try {
|
||||
const dir = `${process.env.HOME}/.gstack/analytics`;
|
||||
fs.mkdirSync(dir, { recursive: true });
|
||||
fs.appendFileSync(`${dir}/pty-judge.jsonl`, JSON.stringify(record) + '\n');
|
||||
} catch {
|
||||
/* best-effort */
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Snapshot dump for postmortem debugging when GSTACK_PTY_LOG=1.
|
||||
* Writes the last 4KB of visible TTY plus context to
|
||||
* ~/.gstack/analytics/pty-snapshots/<testName>-<elapsed>ms.txt.
|
||||
*/
|
||||
export function logPtySnapshot(visible: string, ctx: { testName: string; elapsedMs: number; tag?: string }): void {
|
||||
if (process.env.GSTACK_PTY_LOG !== '1') return;
|
||||
try {
|
||||
const dir = `${process.env.HOME}/.gstack/analytics/pty-snapshots`;
|
||||
fs.mkdirSync(dir, { recursive: true });
|
||||
const tag = ctx.tag ? `-${ctx.tag}` : '';
|
||||
const file = `${dir}/${ctx.testName}-${ctx.elapsedMs}ms${tag}.txt`;
|
||||
fs.writeFileSync(
|
||||
file,
|
||||
`# testName: ${ctx.testName}\n# elapsedMs: ${ctx.elapsedMs}\n# tag: ${ctx.tag ?? ''}\n# visible.length: ${visible.length}\n\n${visible.slice(-4096)}`,
|
||||
);
|
||||
} catch {
|
||||
/* best-effort */
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Ask Claude Haiku 4.5 to classify a TTY snapshot as waiting/working/hung.
|
||||
*
|
||||
* Implementation: spawns `claude -p --model claude-haiku-4-5` synchronously
|
||||
* with the prompt piped via stdin. Uses subscription auth (no API key env
|
||||
* required). 30-second timeout; returns 'unknown' on any failure mode
|
||||
* (timeout, malformed JSON, missing claude binary).
|
||||
*
|
||||
* Cache: identical snapshot hashes return the cached verdict without
|
||||
* re-calling. Cache lives in-process; resets between test runs.
|
||||
*/
|
||||
export function judgePtyState(
|
||||
visible: string,
|
||||
ctx?: { testName?: string },
|
||||
): PtyStateVerdict {
|
||||
// Normalize: strip trailing whitespace lines + take last 4KB. Hash the
|
||||
// normalized form so spinner-frame-only diffs (which all look "working")
|
||||
// don't bust the cache and rack up cost.
|
||||
const tail = visible.slice(-4096).replace(/[ \t]+$/gm, '');
|
||||
const hash = createHash('sha1').update(tail).digest('hex').slice(0, 16);
|
||||
|
||||
const cached = PTY_VERDICT_CACHE.get(hash);
|
||||
if (cached) return cached;
|
||||
|
||||
const judgeStart = Date.now();
|
||||
const prompt = `You are reading a snapshot of a terminal where Claude Code is running in plan mode for an automated test. Your job: classify the agent's current state.
|
||||
|
||||
Pick exactly ONE:
|
||||
- WAITING — agent surfaced a question or option list and is sitting at the input prompt waiting for user reply. Signs: numbered/lettered options visible (1./2./3. or A)/B)/C)), "Recommendation:" line, cursor at empty input prompt with no recent generation activity.
|
||||
- WORKING — agent is actively generating or running tools. Signs: spinner glyphs (✻ ✶ ✳ ✢ ✽), "Musing..." or "Churned for ..." text, recent tool-call blocks (Read/Edit/Bash/Grep), in-flight token output.
|
||||
- HUNG — agent has stopped without surfacing a question and without any spinner/work activity. Rare; usually means a crash.
|
||||
|
||||
Respond with strict JSON ONLY (no markdown fences, no prose):
|
||||
{"state":"waiting","reasoning":"one short sentence"}
|
||||
|
||||
Terminal snapshot (last 4KB):
|
||||
\`\`\`
|
||||
${tail}
|
||||
\`\`\``;
|
||||
|
||||
let verdict: PtyStateVerdict = {
|
||||
state: 'unknown',
|
||||
reasoning: 'judge call did not complete',
|
||||
hash,
|
||||
elapsedMs: 0,
|
||||
};
|
||||
|
||||
try {
|
||||
const result = nodeSpawnSync(
|
||||
'claude',
|
||||
['-p', '--model', 'claude-haiku-4-5', '--max-turns', '1'],
|
||||
{
|
||||
input: prompt,
|
||||
stdio: ['pipe', 'pipe', 'pipe'],
|
||||
timeout: 30_000,
|
||||
encoding: 'utf-8',
|
||||
},
|
||||
);
|
||||
const elapsedMs = Date.now() - judgeStart;
|
||||
if (result.status === 0 && result.stdout) {
|
||||
// Pull the first {...} JSON object out of stdout. Haiku occasionally
|
||||
// wraps in ```json ...``` despite the prompt; tolerate that.
|
||||
const match = result.stdout.match(/\{[\s\S]*?"state"[\s\S]*?\}/);
|
||||
if (match) {
|
||||
try {
|
||||
const parsed = JSON.parse(match[0]);
|
||||
const state = ['waiting', 'working', 'hung'].includes(parsed.state)
|
||||
? (parsed.state as 'waiting' | 'working' | 'hung')
|
||||
: 'unknown';
|
||||
verdict = {
|
||||
state,
|
||||
reasoning: typeof parsed.reasoning === 'string' ? parsed.reasoning.slice(0, 200) : '',
|
||||
hash,
|
||||
elapsedMs,
|
||||
};
|
||||
} catch {
|
||||
verdict = { state: 'unknown', reasoning: 'malformed JSON', hash, elapsedMs };
|
||||
}
|
||||
} else {
|
||||
verdict = { state: 'unknown', reasoning: 'no JSON in response', hash, elapsedMs };
|
||||
}
|
||||
} else {
|
||||
verdict = {
|
||||
state: 'unknown',
|
||||
reasoning: `claude exited ${result.status} (${(result.stderr ?? '').slice(0, 80)})`,
|
||||
hash,
|
||||
elapsedMs,
|
||||
};
|
||||
}
|
||||
} catch (err) {
|
||||
verdict = {
|
||||
state: 'unknown',
|
||||
reasoning: `judge spawn failed: ${(err as Error).message}`.slice(0, 200),
|
||||
hash,
|
||||
elapsedMs: Date.now() - judgeStart,
|
||||
};
|
||||
}
|
||||
|
||||
PTY_VERDICT_CACHE.set(hash, verdict);
|
||||
logPtyJudge({
|
||||
ts: new Date().toISOString(),
|
||||
testName: ctx?.testName ?? 'unknown',
|
||||
state: verdict.state,
|
||||
reasoning: verdict.reasoning,
|
||||
hash: verdict.hash,
|
||||
judgeMs: verdict.elapsedMs,
|
||||
});
|
||||
return verdict;
|
||||
}
|
||||
|
||||
/**
|
||||
* Detect a prose-rendered AskUserQuestion in plan mode.
|
||||
*
|
||||
* Plan-mode AUQs sometimes render as visible model output rather than via
|
||||
* the native numbered-prompt UI — e.g., when --disallowedTools AskUserQuestion
|
||||
* is set and no MCP variant is callable, the model surfaces the question as
|
||||
* lettered or numbered options in plain text. isNumberedOptionListVisible
|
||||
* doesn't catch these because the `❯` cursor sits on the empty input prompt,
|
||||
* not on option 1.
|
||||
*
|
||||
* Detection patterns:
|
||||
* - 2+ distinct lettered options (A) B) C) D)) at line starts — typical
|
||||
* for plan-eng / plan-design / plan-devex prose AUQ
|
||||
* - 3+ distinct numbered options (1. 2. 3.) at line starts WITHOUT a
|
||||
* `❯<spaces>1.` cursor — typical for autoplan / office-hours prose AUQ
|
||||
*
|
||||
* Used by classifyVisible and runPlanSkillFloorCheck to return outcome='asked'
|
||||
* (or auq_observed) instead of letting the harness time out when the model
|
||||
* is correctly surfacing the question and waiting for user input via prose.
|
||||
*
|
||||
* The 4KB tail window avoids matching stale options from earlier prompts in
|
||||
* scrollback. Permission dialogs are filtered out by the caller (see
|
||||
* isPermissionDialogVisible callers in classifyVisible).
|
||||
*/
|
||||
export function isProseAUQVisible(visible: string): boolean {
|
||||
const tail = visible.length > 4096 ? visible.slice(-4096) : visible;
|
||||
|
||||
// Pattern 1: 2+ distinct lettered options at line starts. Allow leading
|
||||
// whitespace or `❯` cursor before the marker. PTY may collapse multiple
|
||||
// option lines onto one logical line via stripped cursor-positioning
|
||||
// escapes, but the NEWLINE before each option survives.
|
||||
const letteredRe = /(?:^|\n)[ \t❯]*([A-D])\)/g;
|
||||
const letteredHits = new Set<string>();
|
||||
let lm: RegExpExecArray | null;
|
||||
while ((lm = letteredRe.exec(tail)) !== null) {
|
||||
if (lm[1]) letteredHits.add(lm[1]);
|
||||
}
|
||||
if (letteredHits.size >= 2) return true;
|
||||
|
||||
// Pattern 2: 2+ distinct numbered options at line starts, AND no
|
||||
// `❯<spaces>1.` cursor IN THE RECENT TAIL (not the full buffer — a
|
||||
// trust-dialog `❯ 1. Yes` at boot is in scrollback forever and
|
||||
// would otherwise suppress this path for the rest of the run).
|
||||
// The native-UI deferral only applies when the cursor list is
|
||||
// currently rendered, not historically.
|
||||
//
|
||||
// Threshold 2 (matching the lettered branch): the tail is a 4KB window,
|
||||
// and by the time the polling loop sees it, the model may have emitted
|
||||
// option 1 several KB earlier and only 2/3/4 remain in tail. False
|
||||
// positives on prose ("First, x. Second, y.") are extremely rare given
|
||||
// the line-start anchor + the no-cursor gate.
|
||||
if (/❯\s*1\./.test(tail)) return false;
|
||||
const numberedRe = /(?:^|\n)[ \t❯]*([1-9])\./g;
|
||||
const numberedHits = new Set<string>();
|
||||
let nm: RegExpExecArray | null;
|
||||
while ((nm = numberedRe.exec(tail)) !== null) {
|
||||
if (nm[1]) numberedHits.add(nm[1]);
|
||||
}
|
||||
return numberedHits.size >= 2;
|
||||
}
|
||||
|
||||
/**
|
||||
* Parse a rendered numbered-option list out of the visible TTY text.
|
||||
*
|
||||
@@ -570,6 +814,21 @@ export function classifyVisible(
|
||||
summary: 'skill fired a numbered-option prompt (AskUserQuestion or routing-injection)',
|
||||
};
|
||||
}
|
||||
// Prose-rendered AUQ: model surfaced the question as lettered or numbered
|
||||
// options in plain text (typical under --disallowedTools AskUserQuestion
|
||||
// when no MCP variant is callable). The model is waiting for user input
|
||||
// via the plan-mode input prompt rather than via the AUQ tool UI; this
|
||||
// is still a legitimate "asked" surface — semantically equivalent to a
|
||||
// tool-call AUQ from the test's perspective.
|
||||
if (isProseAUQVisible(visible)) {
|
||||
if (isPermissionDialogVisible(visible.slice(-TAIL_SCAN_BYTES))) {
|
||||
return null;
|
||||
}
|
||||
return {
|
||||
outcome: 'asked',
|
||||
summary: 'skill rendered a prose-style AskUserQuestion (model waiting for user input)',
|
||||
};
|
||||
}
|
||||
return null;
|
||||
}
|
||||
|
||||
@@ -784,9 +1043,22 @@ export function assertReviewReportAtBottom(
|
||||
* `'wrote_findings_before_asking'` when a plan was already written.
|
||||
*/
|
||||
export function assertReportAtBottomIfPlanWritten(
|
||||
obs: { planFile?: string; evidence: string },
|
||||
obs: { planFile?: string; evidence: string; outcome?: string },
|
||||
): void {
|
||||
if (!obs.planFile) return;
|
||||
// Skip when the plan file path was detected from TTY output but no file
|
||||
// exists on disk. This happens when the model mentions a path mid-stream
|
||||
// (e.g., as a tool-call argument that was interrupted, or in a draft that
|
||||
// was never persisted). The report-at-bottom contract is for fully-written
|
||||
// plan files; ENOENT means there's no file content to enforce against.
|
||||
if (!fs.existsSync(obs.planFile)) return;
|
||||
// Skip on 'asked' outcomes — these are smoke tests that exited at the
|
||||
// first AUQ render (Step 0 only). The model never reached the workflow's
|
||||
// report-writing step, so a partial plan file without the report section
|
||||
// is the expected mid-flight state, not a contract violation. The
|
||||
// report-at-bottom check applies to outcomes that imply the workflow
|
||||
// ran end-to-end (plan_ready, completion_summary, etc.).
|
||||
if (obs.outcome === 'asked') return;
|
||||
const content = fs.readFileSync(obs.planFile, 'utf-8');
|
||||
const verdict = assertReviewReportAtBottom(content);
|
||||
if (!verdict.ok) {
|
||||
@@ -1130,6 +1402,27 @@ export interface PlanSkillObservation {
|
||||
* the section, and that's the regression we want to catch.
|
||||
*/
|
||||
planFile?: string;
|
||||
/**
|
||||
* High-water-mark flag: did the polling loop ever observe a
|
||||
* prose-rendered AskUserQuestion (lettered or numbered options visible)
|
||||
* during the run? Set true the first poll iteration that
|
||||
* isProseAUQVisible returns true on the recent buffer; remains true
|
||||
* for the rest of the observation.
|
||||
*
|
||||
* The 2KB `evidence` window often misses the prose-AUQ moment because
|
||||
* by the time outcome=plan_ready fires, the ExitPlanMode "Ready to
|
||||
* execute" UI has pushed the options out of the tail. Tests that need
|
||||
* to assert "the user saw the question at SOME point" should check
|
||||
* this flag rather than re-running isProseAUQVisible on the truncated
|
||||
* evidence.
|
||||
*/
|
||||
proseAUQEverObserved?: boolean;
|
||||
/**
|
||||
* High-water-mark flag: did the LLM judge ever return state='waiting'
|
||||
* during the run? Same shape as proseAUQEverObserved but driven by the
|
||||
* Haiku judge fallback rather than the regex detector.
|
||||
*/
|
||||
waitingEverObserved?: boolean;
|
||||
}
|
||||
|
||||
/**
|
||||
@@ -1220,6 +1513,17 @@ export async function runPlanSkillObservation(opts: {
|
||||
|
||||
const budgetMs = opts.timeoutMs ?? 180_000;
|
||||
const start = Date.now();
|
||||
let lastJudgeAt = 0;
|
||||
let lastJudgeVerdict: PtyStateVerdict | null = null;
|
||||
// High-water marks: did we EVER see a prose-AUQ surface or a judge
|
||||
// 'waiting' verdict during the run? Models may surface options
|
||||
// briefly, then resume thinking when no user response comes (test
|
||||
// env has no responder). At timeout we trust historical signals
|
||||
// even if the current state is 'working'.
|
||||
let proseAUQEverObserved = false;
|
||||
let waitingEverObserved = false;
|
||||
const JUDGE_AFTER_MS = 60_000;
|
||||
const JUDGE_INTERVAL_MS = 30_000;
|
||||
while (Date.now() - start < budgetMs) {
|
||||
await Bun.sleep(2000);
|
||||
const visible = session.visibleSince(since);
|
||||
@@ -1240,6 +1544,18 @@ export async function runPlanSkillObservation(opts: {
|
||||
elapsedMs: Date.now() - startedAt,
|
||||
};
|
||||
}
|
||||
|
||||
// Cheap surface-tracking: did the model ever surface a prose AUQ in
|
||||
// this tick's recent buffer? Track once-true (high water).
|
||||
if (!proseAUQEverObserved && isProseAUQVisible(visible)) {
|
||||
proseAUQEverObserved = true;
|
||||
logPtySnapshot(visible, {
|
||||
testName: opts.skillName,
|
||||
elapsedMs: Date.now() - start,
|
||||
tag: 'prose-auq-surfaced',
|
||||
});
|
||||
}
|
||||
|
||||
const classified = classifyVisible(visible, {
|
||||
strictPlanWrites: !!opts.initialPlanContent,
|
||||
});
|
||||
@@ -1248,6 +1564,8 @@ export async function runPlanSkillObservation(opts: {
|
||||
...classified,
|
||||
evidence: visible.slice(-2000),
|
||||
elapsedMs: Date.now() - startedAt,
|
||||
proseAUQEverObserved,
|
||||
waitingEverObserved,
|
||||
};
|
||||
// Capture the plan file path on any outcome where one may have been
|
||||
// written. Gating only on 'plan_ready' missed two cases: (1) the
|
||||
@@ -1260,13 +1578,60 @@ export async function runPlanSkillObservation(opts: {
|
||||
if (planFile) obs.planFile = planFile;
|
||||
return obs;
|
||||
}
|
||||
|
||||
// LLM judge fallback: if regex detectors didn't classify and we've
|
||||
// burned >60s with periodic ticks, ask Haiku "is the model waiting,
|
||||
// working, or hung?" Treat 'waiting' as 'asked' (model surfaced a
|
||||
// question via prose the regex couldn't reassemble). Snapshot the
|
||||
// visible buffer at each judge call when GSTACK_PTY_LOG=1.
|
||||
const elapsed = Date.now() - start;
|
||||
if (elapsed > JUDGE_AFTER_MS && Date.now() - lastJudgeAt > JUDGE_INTERVAL_MS) {
|
||||
lastJudgeAt = Date.now();
|
||||
logPtySnapshot(visible, { testName: opts.skillName, elapsedMs: elapsed, tag: 'judge-tick' });
|
||||
lastJudgeVerdict = judgePtyState(visible, { testName: opts.skillName });
|
||||
if (lastJudgeVerdict.state === 'waiting') {
|
||||
waitingEverObserved = true;
|
||||
return {
|
||||
outcome: 'asked',
|
||||
summary: `LLM judge: ${lastJudgeVerdict.reasoning} (state=waiting after ${Math.round(elapsed / 1000)}s)`,
|
||||
evidence: visible.slice(-2000),
|
||||
elapsedMs: Date.now() - startedAt,
|
||||
};
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Timeout fallback: if we observed a prose-AUQ surface OR a judge
|
||||
// 'waiting' verdict at any point during the run, treat as 'asked'.
|
||||
// This catches the model-surfaced-then-resumed-thinking case where
|
||||
// by the time the timeout fires, the buffer has moved past the
|
||||
// options into spinner state but the question DID surface earlier.
|
||||
const finalVisible = session.visibleSince(since);
|
||||
if (proseAUQEverObserved || waitingEverObserved) {
|
||||
return {
|
||||
outcome: 'asked',
|
||||
summary:
|
||||
`prose-AUQ surface observed during run (proseAUQEverObserved=${proseAUQEverObserved}, waitingEverObserved=${waitingEverObserved}); model surfaced the question and the test budget elapsed without a follow-up classification` +
|
||||
(lastJudgeVerdict
|
||||
? ` (last LLM judge: ${lastJudgeVerdict.state} — ${lastJudgeVerdict.reasoning})`
|
||||
: ''),
|
||||
evidence: finalVisible.slice(-2000),
|
||||
elapsedMs: Date.now() - startedAt,
|
||||
proseAUQEverObserved,
|
||||
waitingEverObserved,
|
||||
};
|
||||
}
|
||||
return {
|
||||
outcome: 'timeout',
|
||||
summary: `no terminal outcome within ${budgetMs}ms`,
|
||||
evidence: session.visibleSince(since).slice(-2000),
|
||||
summary:
|
||||
`no terminal outcome within ${budgetMs}ms` +
|
||||
(lastJudgeVerdict
|
||||
? ` (last LLM judge: state=${lastJudgeVerdict.state} — ${lastJudgeVerdict.reasoning})`
|
||||
: ''),
|
||||
evidence: finalVisible.slice(-2000),
|
||||
elapsedMs: Date.now() - startedAt,
|
||||
proseAUQEverObserved,
|
||||
waitingEverObserved,
|
||||
};
|
||||
} finally {
|
||||
await session.close();
|
||||
@@ -1629,6 +1994,10 @@ export async function runPlanSkillFloorCheck(opts: {
|
||||
session.send(`${opts.followUpPrompt}\r`);
|
||||
|
||||
const start = Date.now();
|
||||
let lastJudgeAt = 0;
|
||||
let lastJudgeVerdict: PtyStateVerdict | null = null;
|
||||
const JUDGE_AFTER_MS = 60_000;
|
||||
const JUDGE_INTERVAL_MS = 30_000;
|
||||
while (Date.now() - start < timeoutMs) {
|
||||
await Bun.sleep(2000);
|
||||
const visible = session.visibleSince(since);
|
||||
@@ -1652,12 +2021,15 @@ export async function runPlanSkillFloorCheck(opts: {
|
||||
};
|
||||
}
|
||||
|
||||
// Success: ANY non-permission numbered-option list is an AUQ render.
|
||||
// The bug we're catching is "fired zero AUQs," so observing one is
|
||||
// sufficient — we don't need to fingerprint or navigate past it.
|
||||
// Success: ANY non-permission numbered-option list is an AUQ render —
|
||||
// either via the native numbered-prompt UI (isNumberedOptionListVisible)
|
||||
// OR via prose-rendered options under --disallowedTools when no MCP
|
||||
// variant is callable (isProseAUQVisible). Both surface the question
|
||||
// to the user; the bug we're catching is "fired zero AUQs."
|
||||
const tail = visible.slice(-TAIL_SCAN_BYTES);
|
||||
if (
|
||||
isNumberedOptionListVisible(visible) &&
|
||||
!isPermissionDialogVisible(visible.slice(-TAIL_SCAN_BYTES))
|
||||
(isNumberedOptionListVisible(visible) || isProseAUQVisible(visible)) &&
|
||||
!isPermissionDialogVisible(tail)
|
||||
) {
|
||||
return {
|
||||
auqObserved: true,
|
||||
@@ -1668,6 +2040,28 @@ export async function runPlanSkillFloorCheck(opts: {
|
||||
};
|
||||
}
|
||||
|
||||
// LLM judge fallback: same shape as runPlanSkillObservation. After 60s
|
||||
// of polling without a regex hit, ask Haiku to classify the snapshot.
|
||||
// 'waiting' verdict counts as floor met (model surfaced a question via
|
||||
// prose the regex couldn't catch). 'working' / 'hung' / 'unknown' don't
|
||||
// change the outcome — they enrich the eventual timeout summary so the
|
||||
// failure diagnostic is more actionable than "no AUQ render."
|
||||
const elapsed = Date.now() - start;
|
||||
if (elapsed > JUDGE_AFTER_MS && Date.now() - lastJudgeAt > JUDGE_INTERVAL_MS) {
|
||||
lastJudgeAt = Date.now();
|
||||
logPtySnapshot(visible, { testName: opts.skillName, elapsedMs: elapsed, tag: 'floor-judge-tick' });
|
||||
lastJudgeVerdict = judgePtyState(visible, { testName: opts.skillName });
|
||||
if (lastJudgeVerdict.state === 'waiting') {
|
||||
return {
|
||||
auqObserved: true,
|
||||
outcome: 'auq_observed',
|
||||
summary: `LLM judge: ${lastJudgeVerdict.reasoning} (state=waiting after ${Math.round(elapsed / 1000)}s; floor met)`,
|
||||
evidence: visible.slice(-3000),
|
||||
elapsedMs: Date.now() - startedAt,
|
||||
};
|
||||
}
|
||||
}
|
||||
|
||||
// Silent write outside sanctioned dirs is the transcript-bug shape.
|
||||
const writeRe = /⏺\s*(?:Write|Edit)\(([^)]+)\)/g;
|
||||
let m: RegExpExecArray | null;
|
||||
|
||||
@@ -26,6 +26,7 @@ import { describe, test, expect } from 'bun:test';
|
||||
import {
|
||||
isPermissionDialogVisible,
|
||||
isNumberedOptionListVisible,
|
||||
isProseAUQVisible,
|
||||
isPlanReadyVisible,
|
||||
parseNumberedOptions,
|
||||
classifyVisible,
|
||||
@@ -192,6 +193,105 @@ describe('isNumberedOptionListVisible', () => {
|
||||
});
|
||||
});
|
||||
|
||||
describe('isProseAUQVisible', () => {
|
||||
test('matches 4 lettered options A) B) C) D) at line starts (plan-eng prose AUQ shape)', () => {
|
||||
const sample = `
|
||||
What would you like me to review? Options:
|
||||
A) Point me at an existing design doc or plan file (path).
|
||||
B) Describe new work you're planning — I'll explore the codebase.
|
||||
C) You meant /review for the diff already on this branch.
|
||||
D) Something else (tell me).
|
||||
Recommendation: A if you have a doc in mind, otherwise B.
|
||||
❯
|
||||
`;
|
||||
expect(isProseAUQVisible(sample)).toBe(true);
|
||||
});
|
||||
|
||||
test('matches 2 lettered options (minimum threshold)', () => {
|
||||
const sample = `
|
||||
A) First option
|
||||
B) Second option
|
||||
`;
|
||||
expect(isProseAUQVisible(sample)).toBe(true);
|
||||
});
|
||||
|
||||
test('matches 3 numbered options 1. 2. 3. without ❯ 1. cursor (autoplan prose AUQ shape)', () => {
|
||||
const sample = `
|
||||
What's the task? A few options:
|
||||
1. You have a plan idea in mind — describe it.
|
||||
2. You want to review an existing plan elsewhere.
|
||||
3. You meant a different command — /plan-ceo-review etc.
|
||||
❯
|
||||
`;
|
||||
expect(isProseAUQVisible(sample)).toBe(true);
|
||||
});
|
||||
|
||||
test('returns false when ❯ 1. cursor is present in the recent tail (native UI handled by isNumberedOptionListVisible)', () => {
|
||||
const sample = `
|
||||
❯ 1. First option
|
||||
2. Second option
|
||||
3. Third option
|
||||
`;
|
||||
expect(isProseAUQVisible(sample)).toBe(false);
|
||||
});
|
||||
|
||||
test('does NOT suppress numbered-prose detection when ❯ 1. is only in early scrollback (trust dialog)', () => {
|
||||
// Boot trust dialog rendered ❯ 1. Yes at startup, then a long body of
|
||||
// model output, then prose-rendered numbered options now. The historic
|
||||
// ❯ 1. is in the full buffer but NOT in the recent tail. Should detect
|
||||
// the prose AUQ.
|
||||
const trustHeader = '❯ 1. Yes, trust\n 2. No\n';
|
||||
const filler = 'x'.repeat(5000); // pushes trust dialog out of last 4KB tail
|
||||
const proseAUQ = `\n 1. Review the docs\n 2. Investigate the code\n 3. Defer to next session\n❯ \n`;
|
||||
const sample = trustHeader + filler + proseAUQ;
|
||||
expect(isProseAUQVisible(sample)).toBe(true);
|
||||
});
|
||||
|
||||
test('returns false on single lettered option', () => {
|
||||
const sample = `
|
||||
A) Only one option mentioned in passing.
|
||||
`;
|
||||
expect(isProseAUQVisible(sample)).toBe(false);
|
||||
});
|
||||
|
||||
test('matches 2 numbered options (threshold matches lettered branch — tails miss option 1)', () => {
|
||||
const sample = `
|
||||
1. First note.
|
||||
2. Second note.
|
||||
`;
|
||||
expect(isProseAUQVisible(sample)).toBe(true);
|
||||
});
|
||||
|
||||
test('returns false on a single numbered option', () => {
|
||||
const sample = `
|
||||
1. Only one option mentioned.
|
||||
`;
|
||||
expect(isProseAUQVisible(sample)).toBe(false);
|
||||
});
|
||||
|
||||
test('does not match mid-prose lettered text like "(see option B) above"', () => {
|
||||
const sample = `
|
||||
This refers to (see option B) above and also to point A) earlier.
|
||||
`;
|
||||
// The B) and A) markers are mid-line, not at line starts, so they don't count.
|
||||
expect(isProseAUQVisible(sample)).toBe(false);
|
||||
});
|
||||
|
||||
test('matches with leading whitespace and ❯ prefix on options', () => {
|
||||
const sample = `
|
||||
A) Option with whitespace prefix
|
||||
❯ B) Option with cursor prefix
|
||||
C) Another option
|
||||
`;
|
||||
expect(isProseAUQVisible(sample)).toBe(true);
|
||||
});
|
||||
|
||||
test('returns false on plain text with no option markers', () => {
|
||||
expect(isProseAUQVisible('Just some plain text output from the model.')).toBe(false);
|
||||
expect(isProseAUQVisible('')).toBe(false);
|
||||
});
|
||||
});
|
||||
|
||||
describe('classifyVisible (runtime path through the runner classifier)', () => {
|
||||
// These tests call the actual classifier so a future contributor who
|
||||
// reorders branches (e.g. moves the permission short-circuit before
|
||||
|
||||
@@ -103,7 +103,6 @@ export const E2E_TOUCHFILES: Record<string, string[]> = {
|
||||
// INSIDE the existing 4 plan-X-review-plan-mode test files (covered
|
||||
// transitively by the entries above). Two new standalone files exist for
|
||||
// skills with no prior plan-mode test:
|
||||
'autoplan-auto-mode': ['autoplan/**', 'plan-ceo-review/**', 'plan-design-review/**', 'plan-eng-review/**', 'plan-devex-review/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/question-tuning.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble.ts', 'test/helpers/claude-pty-runner.ts'],
|
||||
'office-hours-auto-mode': ['office-hours/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/question-tuning.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble.ts', 'test/helpers/claude-pty-runner.ts'],
|
||||
'office-hours-phase4-fork': ['office-hours/**', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/question-tuning.ts', 'test/helpers/llm-judge.ts', 'test/skill-e2e-office-hours-phase4.test.ts'],
|
||||
'llm-judge-recommendation': ['test/helpers/llm-judge.ts', 'test/llm-judge-recommendation.test.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'codex/SKILL.md.tmpl', 'scripts/resolvers/review.ts'],
|
||||
@@ -143,6 +142,13 @@ export const E2E_TOUCHFILES: Record<string, string[]> = {
|
||||
'plan-ceo-finding-floor': ['plan-ceo-review/**', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/review.ts', 'test/helpers/claude-pty-runner.ts', 'test/fixtures/forcing-finding-seeds.ts', 'test/skill-e2e-plan-ceo-finding-floor.test.ts'],
|
||||
'plan-design-finding-floor': ['plan-design-review/**', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/review.ts', 'test/helpers/claude-pty-runner.ts', 'test/fixtures/forcing-finding-seeds.ts', 'test/skill-e2e-plan-design-finding-floor.test.ts'],
|
||||
'plan-devex-finding-floor': ['plan-devex-review/**', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/review.ts', 'test/helpers/claude-pty-runner.ts', 'test/fixtures/forcing-finding-seeds.ts', 'test/skill-e2e-plan-devex-finding-floor.test.ts'],
|
||||
|
||||
// Multi-finding batching regression — periodic tier complement to the
|
||||
// gate-tier finding-floor. Catches the May 2026 transcript shape where
|
||||
// a model fires one AUQ then batches the rest into a "## Decisions to
|
||||
// confirm" plan write. runPlanSkillFloorCheck cannot detect that shape
|
||||
// (it exits on first AUQ); runPlanSkillCounting can.
|
||||
'plan-eng-multi-finding-batching': ['plan-eng-review/**', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/review.ts', 'test/helpers/claude-pty-runner.ts', 'test/fixtures/forcing-finding-seeds.ts', 'test/skill-e2e-plan-eng-multi-finding-batching.test.ts'],
|
||||
'brain-privacy-gate': ['scripts/resolvers/preamble/generate-brain-sync-block.ts', 'scripts/resolvers/preamble.ts', 'bin/gstack-brain-sync', 'bin/gstack-artifacts-init', 'bin/gstack-config', 'test/helpers/agent-sdk-runner.ts'],
|
||||
|
||||
// /setup-gbrain Path 4 (Remote MCP) — happy + bad-token end-to-end via
|
||||
@@ -416,7 +422,6 @@ export const E2E_TIERS: Record<string, 'gate' | 'periodic'> = {
|
||||
'plan-devex-review-plan-mode': 'gate',
|
||||
'plan-mode-no-op': 'gate',
|
||||
// v1.21+ auto-mode regression tests
|
||||
'autoplan-auto-mode': 'gate',
|
||||
'office-hours-auto-mode': 'gate',
|
||||
'auto-decide-preserved': 'periodic',
|
||||
'e2e-harness-audit': 'gate',
|
||||
@@ -443,6 +448,7 @@ export const E2E_TIERS: Record<string, 'gate' | 'periodic'> = {
|
||||
'plan-ceo-finding-floor': 'gate',
|
||||
'plan-design-finding-floor': 'gate',
|
||||
'plan-devex-finding-floor': 'gate',
|
||||
'plan-eng-multi-finding-batching': 'periodic',
|
||||
|
||||
// Privacy gate for gstack-brain-sync — periodic (non-deterministic LLM call,
|
||||
// costs ~$0.30-$0.50 per run, not needed on every commit)
|
||||
|
||||
Reference in New Issue
Block a user