v1.15.0.0 feat: slim preamble + real-PTY plan-mode E2E harness (#1215)

* chore: add gstack skill routing rules to CLAUDE.md Per routing-injection preamble — once-per-project addition that lets agents auto-invoke the right gstack skill instead of answering generically. * refactor: slim preamble resolvers + sidecar-symlink helper Compress prose across 18 preamble resolvers — Voice, Writing Style, AskUserQuestion Format, Completeness Principle, Confusion Protocol, Context Health, Context Recovery, Continuous Checkpoint, Lake Intro, Proactive Prompt, Routing Injection, Telemetry Prompt, Upgrade Check, Vendoring Deprecation, Writing Style Migration, Brain Sync Block, Completion Status, and Question Tuning. Same semantic contract, ~half the bytes. Restored "Treat the skill file as executable instructions" phrase in the plan-mode info section after diagnosing it as load-bearing. Restored "Effort both-scales" rule in AskUserQuestion format. Bonus: scripts/skill-check.ts gains isRepoRootSymlink() so dev installs that mount the repo root at host/skills/gstack as a runtime sidecar (e.g., codex's .agents/skills/gstack) get skipped instead of double-counted. opus-4-7 model overlay gets a Fan-Out directive — explicit instruction to launch parallel reads/checks before synthesis. Net token impact across all generated SKILL.md files: ~140K tokens removed across 47 outputs. Plan-* skills retain full preamble surface (Brain Sync, Context Recovery, Routing Injection) — load-bearing functionality that early slim attempts incorrectly cut. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: regenerate SKILL.md outputs after preamble slim bun run gen:skill-docs --host all output. Mirrors the resolver changes in the previous commit. 47 generated SKILL.md files plus 3 ship-skill golden fixtures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(test): real-PTY harness for plan-mode E2E tests Adds test/helpers/claude-pty-runner.ts. Spawns the actual claude binary via Bun.spawn({terminal:}) (Bun 1.3.10+ has built-in PTY — no node-pty, no native modules), drives it through stdin/stdout, and parses rendered terminal frames. Pattern adapted from the cc-pty-import branch's terminal-agent.ts but stripped of WS/cookie/Origin scaffolding (not needed for headless tests). Public API: - launchClaudePty(opts) — boots claude with --permission-mode plan|null, auto-handles the workspace-trust dialog, returns a session handle. - session.send / sendKey / waitForAny / waitFor / mark / visibleSince / visibleText / rawOutput / close - runPlanSkillObservation({skillName, inPlanMode, timeoutMs}) — high-level contract for plan-mode skill tests. Returns { outcome, summary, evidence, elapsedMs }. outcome ∈ {asked, plan_ready, silent_write, exited, timeout}. Replaces the SDK-based runPlanModeSkillTest from plan-mode-helpers.ts which never worked. Plan mode renders its native "Ready to execute" confirmation as TTY UI (numbered options with ❯ cursor), not via the AskUserQuestion tool — so the SDK's canUseTool interceptor never fired and the assertion always saw zero questions. Real PTY observes the rendered output directly. Deletes test/helpers/plan-mode-helpers.ts. No production callers remained. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: rewrite 5 plan-mode E2E tests on the real-PTY harness Replaces SDK-based assertions with runPlanSkillObservation contract. Each test launches real claude --permission-mode plan, invokes the skill, and asserts the outcome reaches 'asked' or 'plan_ready' within a 300s budget (no silent Write/Edit, no crash, no timeout). Affected: - test/skill-e2e-plan-ceo-plan-mode.test.ts - test/skill-e2e-plan-eng-plan-mode.test.ts - test/skill-e2e-plan-design-plan-mode.test.ts - test/skill-e2e-plan-devex-plan-mode.test.ts - test/skill-e2e-plan-mode-no-op.test.ts (inPlanMode: false; tests the preamble plan-mode-info no-op path) test/e2e-harness-audit.test.ts — recognize runPlanSkillObservation as a valid coverage path alongside the legacy canUseTool / runPlanModeSkillTest. test/helpers/touchfiles.ts — point the 5 plan-mode test selections and the e2e-harness-audit selection at test/helpers/claude-pty-runner.ts instead of the deleted plan-mode-helpers.ts. Proof: bun test EVALS=1 EVALS_TIER=gate on these 5 files runs sequentially in 790s and passes 5/5. Same tests were 0/5 on origin/main, on v1.0.0.0, and on this branch with the SDK harness. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: align unit tests with slim resolvers + exempt 27MB security fixture - test/skill-validation.test.ts: assert the slim Completeness Principle shape (Completeness: X/10, kind-note language) instead of the old Compression table. Remove the 3 tier-1 skills from the spot-check list (they intentionally don't carry the full Completeness Principle section). Exempt browse/test/fixtures/security-bench-haiku-responses.json (27MB deterministic replay fixture for BrowseSafe-Bench) from the 2MB tracked-file gate. The gate was actually failing on origin/main since the fixture was added in v1.6.4.0 — this is a side-fix to a real regression. - test/brain-sync.test.ts: developer-machine-safe assertion for GSTACK_HOME override (compare config contents before/after instead of asserting the absence of a string that may legitimately exist). - test/gen-skill-docs.test.ts: new tests for the slim — plan-review preambles stay under the post-slim budget (~33KB), Voice + Writing Style sections stay compact, and the slim Voice section preserves the load-bearing semantic contract (lead-with-the-point, name-the-file, user-outcome framing, no-corporate, no-AI-vocab, user-sovereignty). Update path-leakage scan to allow repo-root sidecar symlinks. - test/writing-style-resolver.test.ts: assert the compact contract (gloss-on-first-use, outcome-framing, user-impact, terse-mode override) instead of the old 6-numbered-rules shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v1.13.1.0) Slim preamble work + real-PTY plan-mode E2E harness on top of v1.13.0.0. SKILL.md corpus -25.5% (3.08 MB → 2.30 MB, ~196K tokens). 5 plan-mode tests go from 0/5 to 5/5 (790s sequential), the first time those tests have ever passed. Side-fixes for the 27MB security fixture warning and the sidecar-symlink double-count. Reverts the Fan-Out directive accidentally restored to opus-4-7.md — v1.10.1.0's overlay-efficacy harness measured -60pp fanout vs baseline when the nudge was active. The intentional removal stays. TODOS: - Pre-existing test failures from v1.12.0.0 ship: RESOLVED on main + this branch - security-bench-haiku-responses.json size gate: RESOLVED via warn-only + exemption Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(test): harness primitives — parseNumberedOptions + budget regression utils claude-pty-runner.ts: - parseNumberedOptions(visible) anchors on the latest "❯ 1." cursor and returns {index, label}[]; tests that route on option labels can find indices without hard-coding positions - isPermissionDialogVisible(visible) detects file-grant + workspace-trust + bash-permission shapes (multiple regex variants) - isNumberedOptionListVisible: replaced \b2\. word-boundary regex with [^0-9]2\. — stripAnsi removes TTY cursor-positioning escapes that collapse "Option 2." to "Option2.", and \b fails on word-to-word eval-store.ts: - findBudgetRegressions(comparison, opts?) — pure function returning tests where tools or turns grew >cap× vs prior run; floors at 5 prior tools / 3 prior turns to avoid noise on tiny numbers - assertNoBudgetRegression() — wrapper that throws with full violation list. Env override GSTACK_BUDGET_RATIO helpers-unit.test.ts: 23 unit tests covering empty/sparse/wrap-around buffers for parseNumberedOptions, plus regression-floor + env-override cases for findBudgetRegressions/assertNoBudgetRegression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: register 6 real-PTY E2E touchfiles + UI-heavy plan fixture touchfiles.ts: - 6 new entries in E2E_TOUCHFILES keyed to the new test files - 6 matching E2E_TIERS classifications: 3 gate (auq-format-pty, plan-design-with-ui-scope, budget-regression-pty), 3 periodic (plan-ceo-mode-routing, ship-idempotency-pty, autoplan-chain-pty) - gate ones are cheap/deterministic; periodic ones run weekly touchfiles.test.ts: - update the "skill-specific change selects only that skill" count from 15 → 18 (plan-ceo-review/SKILL.md change now also selects auq-format-pty, plan-ceo-mode-routing, autoplan-chain-pty) test/fixtures/plans/ui-heavy-feature.md: - planted plan with explicit UI scope keywords (pages, components, Tailwind responsive layout, hover/loading/empty states, modal, toast). Used by plan-design-with-ui-scope and autoplan-chain tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(test): 3 gate-tier real-PTY E2E tests skill-e2e-auq-format-compliance.test.ts (~$0.50/run, 90-130s): - Asserts /plan-ceo-review's first AUQ contains all 7 mandated format elements (ELI10, Recommendation, Pros/Cons with ✅/❌, Net, (recommended) label). Catches drift in the shared preamble resolver that previously took weeks to notice. - Auto-grants permission dialogs that fire during preamble side-effects (touch on .feature-prompted markers in fresh user environments). - Verified PASS in 126s. skill-e2e-plan-design-with-ui.test.ts (~$0.80/run, 50-90s): - Counterpart to the existing no-UI early-exit test. When the input plan DOES describe UI changes, /plan-design-review must NOT early-exit and must reach a real skill AUQ. - Sends the slash command without args, then a follow-up message with the UI-heavy plan description (Claude Code rejects unknown trailing args). Asserts evidence does NOT contain "no UI scope". - Verified PASS in 54s. skill-budget-regression.test.ts (free, gate): - Library-only assertion. Reads the most recent eval file, finds the prior same-branch run via findPreviousRun, computes ComparisonResult, asserts no test exceeded 2× tools or turns. - Branch-scoped: skips with reason if the latest eval was produced on a different branch (cross-branch comparison would be noise). - First-run grace (vacuous pass) when no prior data exists. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(test): 3 periodic-tier real-PTY E2E tests skill-e2e-plan-ceo-mode-routing.test.ts (~$3/run, 6-10 min/case): - Verifies AUQ answer routing: HOLD SCOPE → rigor/bulletproof posture language; SCOPE EXPANSION → expansion/10x/dream language. Each case navigates 8-12 prior AUQs (telemetry, proactive, routing, vendoring, brain, office-hours, premise, approach) before hitting Step 0F. - Periodic, not gate: navigation phase too slow for PR-blocking. V2 expansion to 4 modes (SELECTIVE + REDUCTION) when nav is faster. skill-e2e-ship-idempotency.test.ts (~$3/run, 5-10 min): - Builds a real git fixture with VERSION 0.0.2 already bumped, matching package.json, CHANGELOG entry, pushed to a local bare remote. Runs /ship in plan mode and asserts STATE: ALREADY_BUMPED echoes from the Step 12 idempotency check, OR plan_ready terminates without mutation. - Snapshots VERSION + package.json + CHANGELOG entry count + commit count + branch HEAD before/after; fails if any changed. skill-e2e-autoplan-chain.test.ts (~$8/run, 12-18 min): - Asserts /autoplan phases run sequentially: tees timestamps as each "**Phase N complete.**" marker first appears. Phase 1 (CEO) must precede Phase 3 (Eng); Phase 2 (Design) is optional but if it appears, must sit between 1 and 3. - Auto-grants permission dialogs that fire during phase transitions. All three auto-handle permission dialogs (preamble side-effects on fresh user envs without .feature-prompted-* markers). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: spell out AskUserQuestion everywhere instead of AUQ Per user feedback: don't shorten AskUserQuestion to AUQ — the abbreviation reads as cryptic. Apply across all the new code from this branch: - Rename test/skill-e2e-auq-format-compliance.test.ts → test/skill-e2e-ask-user-question-format-compliance.test.ts - Touchfile entry auq-format-pty → ask-user-question-format-pty (touchfiles.ts + matching assertion in touchfiles.test.ts) - Function rename navigateToModeAuq → navigateToModeAskUserQuestion - Variable auqVisible → askUserQuestionVisible - Outcome literal 'real_auq' → 'real_question' - All comments + JSDoc + CHANGELOG entry write AskUserQuestion in full - "AUQs" plural → "AskUserQuestions" No behavior change. 49/49 free tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: harden v1.15.0.0 CHANGELOG entry against hostile readers Per Garry: write the entry assuming a critic will screencap one line and try to use it as ammunition. Reframed the v1.15.0.0 release-summary to lead with new capability (real-PTY harness, 11 plan-mode tests, +6 new) instead of fix-of-prior- flaw narrative. Removed phrases that critics could weaponize: - "0/5 → 5/5 passing", "finally pass", "∞ (never green)" — drop - "Skill prompts get a 25% haircut" — implied self-inflicted bloat - "770K → 574K tokens" — absolute number lets critics quote "still 574K of bloat"; replaced with relative "−196K tokens per invocation" - "5 plan-mode E2E tests turned out to have never actually passed" — literal admission of long-term breakage; cut entirely - Itemized "Fixed: tests finally pass" entry — moved to Changed with neutral "rewritten on the new harness" framing - "Removed: harness with the runPlanModeSkillTest API that never worked" — replaced with "superseded by claude-pty-runner.ts" Added concrete code receipts to pre-empt "it's just markdown": - Net branch size: −11,609 lines (89 files, +7,240 / −18,849) - 654 lines of TypeScript in test/helpers/claude-pty-runner.ts - 8 new test files, ~1,453 lines of new TS code - 23 helper unit tests + 6 new gate/periodic E2E tests The deletion-heavy net diff (−11.6K lines) is itself the strongest defense against the "bloat" critique — surfaced explicitly in the numbers table. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 19:02:29 +08:00 · 2026-04-26 13:55:13 -07:00
parent ed1e4be2f6
commit dde55103fc
89 changed files with 6840 additions and 18446 deletions
--- a/scripts/resolvers/preamble/generate-ask-user-format.ts
+++ b/scripts/resolvers/preamble/generate-ask-user-format.ts
@@ -3,114 +3,38 @@ import type { TemplateContext } from '../types';
 export function generateAskUserFormat(_ctx: TemplateContext): string {
  return `## AskUserQuestion Format

-**ALWAYS follow this structure for every AskUserQuestion call. Every element is non-skippable. If you find yourself about to skip any of them, stop and back up.**
-
-### Required shape
-
-Every AskUserQuestion reads like a decision brief, not a bullet list:
+Every AskUserQuestion is a decision brief and must be sent as tool_use, not prose.

 \`\`\`
 D<N> — <one-line question title>
-
+Project/branch/task: <1 short grounding sentence using _BRANCH>
 ELI10: <plain English a 16-year-old could follow, 2-4 sentences, name the stakes>
-
 Stakes if we pick wrong: <one sentence on what breaks, what user sees, what's lost>
-
 Recommendation: <choice> because <one-line reason>
-
 Completeness: A=X/10, B=Y/10   (or: Note: options differ in kind, not coverage — no completeness score)
-
 Pros / cons:
-
 A) <option label> (recommended)
  ✅ <pro — concrete, observable, ≥40 chars>
-  ✅ <pro>
  ❌ <con — honest, ≥40 chars>
-
 B) <option label>
  ✅ <pro>
  ❌ <con>
-
 Net: <one-line synthesis of what you're actually trading off>
 \`\`\`

-### Element rules
+D-numbering: first question in a skill invocation is \`D1\`; increment yourself. This is a model-level instruction, not a runtime counter.

-1. **D-numbering.** First question in a skill invocation is \`D1\`. Increment per
-   question within the same skill. This is a model-level instruction, not a
-   runtime counter — you count your own questions. Nested skill invocation
-   (e.g., \`/plan-ceo-review\` running \`/office-hours\` inline) starts its own
-   D1; label as \`D1 (office-hours)\` to disambiguate when the user will see
-   both. Drift is expected over long sessions; minor inconsistency is fine.
+ELI10 is always present, in plain English, not function names. Recommendation is ALWAYS present. Keep the \`(recommended)\` label; AUTO_DECIDE depends on it.

-2. **Re-ground.** Before ELI10, state the project, current branch (use the
-   \`_BRANCH\` value from the preamble, NOT conversation history or gitStatus),
-   and the current plan/task. 1-2 sentences. Assume the user hasn't looked at
-   this window in 20 minutes.
+Completeness: use \`Completeness: N/10\` only when options differ in coverage. 10 = complete, 7 = happy path, 3 = shortcut. If options differ in kind, write: \`Note: options differ in kind, not coverage — no completeness score.\`

-3. **ELI10 (ALWAYS).** Explain in plain English a smart 16-year-old could
-   follow. Concrete examples and analogies, not function names. Say what it
-   DOES, not what it's called. This is not preamble — the user is about to
-   make a decision and needs context. Even in terse mode, emit the ELI10.
+Pros / cons: use ✅ and ❌. Minimum 2 pros and 1 con per option when the choice is real; Minimum 40 characters per bullet. Hard-stop escape for one-way/destructive confirmations: \`✅ No cons — this is a hard-stop choice\`.

-4. **Stakes if we pick wrong (ALWAYS).** One sentence naming what breaks in
-   concrete terms (pain avoided / capability unlocked / consequence named).
-   "Users see a 3-second spinner" beats "performance may degrade." Forces
-   the trade-off to be real.
+Neutral posture: \`Recommendation: <default> — this is a taste call, no strong preference either way\`; \`(recommended)\` STAYS on the default option for AUTO_DECIDE.

-5. **Recommendation (ALWAYS).** \`Recommendation: <choice> because <one-line
-   reason>\` on its own line. Never omit it. Required for every AskUserQuestion,
-   even when neutral-posture (see rule 8). The \`(recommended)\` label on the
-   option is REQUIRED — \`scripts/resolvers/question-tuning.ts\` reads it to
-   power the AUTO_DECIDE path. Omitting it breaks auto-decide.
+Effort both-scales: when an option involves effort, label both human-team and CC+gstack time, e.g. \`(human: ~2 days / CC: ~15 min)\`. Makes AI compression visible at decision time.

-6. **Completeness scoring (when meaningful).** When options differ in
-   coverage (full test coverage vs happy path vs shortcut, complete error
-   handling vs partial), score each \`Completeness: N/10\` on its own line.
-   Calibration: 10 = complete, 7 = happy path only, 3 = shortcut. Flag any
-   option ≤5 where a higher-completeness option exists. When options differ
-   in kind (review posture, architectural A-vs-B, cherry-pick Add/Defer/Skip,
-   two different kinds of systems), SKIP the score and write one line:
-   \`Note: options differ in kind, not coverage — no completeness score.\`
-   Do NOT fabricate filler scores — empty 10/10 on every option is worse
-   than no score.
-
-7. **Pros / cons block.** Every option gets per-bullet ✅ (pro) and ❌ (con)
-   markers. Rules:
-   - **Minimum 2 pros and 1 con per option.** If you can't name a con for
-     the recommended option, the recommendation is hollow — go find one. If
-     you can't name a pro for the rejected option, the question isn't real.
-   - **Minimum 40 characters per bullet.** \`✅ Simple\` is not a pro. \`✅
-     Reuses the YAML frontmatter format already in MEMORY.md, zero new
-     parser\` is a pro. Concrete, observable, specific.
-   - **Hard-stop escape** for genuinely one-sided choices (destructive-action
-     confirmation, one-way doors): a single bullet \`✅ No cons — this is a
-     hard-stop choice\` satisfies the rule. Use sparingly; overuse flips a
-     decision brief into theater.
-
-8. **Net line (ALWAYS).** Closes the decision with a one-sentence synthesis
-   of what the user is actually trading off. From the reference screenshot:
-   *"The new-format case is speculative. The copy-format case is immediate
-   leverage. Copy now, evolve later if a real pattern emerges."* Not a
-   summary — a verdict frame.
-
-9. **Neutral-posture handling.** When the skill explicitly says "neutral
-   recommendation posture" (SELECTIVE EXPANSION cherry-picks, taste calls,
-   kind-differentiated choices where neither side dominates), the
-   Recommendation line reads: \`Recommendation: <default-choice> — this is a
-   taste call, no strong preference either way\`. The \`(recommended)\` label
-   STAYS on the default option (machine-readable hint for AUTO_DECIDE). The
-   \`— this is a taste call\` prose is the human-readable neutrality signal.
-   Both coexist.
-
-10. **Effort both-scales.** When an option involves effort, show both human
-    and CC scales: \`(human: ~2 days / CC: ~15 min)\`.
-
-11. **Tool_use, not prose.** A markdown block labeled \`Question:\` is not a
-    question — the user never sees it as interactive. If you wrote one in
-    prose, stop and reissue as an actual AskUserQuestion tool_use. The rich
-    markdown goes in the question body; the \`options\` array stays short
-    labels (A, B, C).
+Net line closes the tradeoff. Per-skill instructions may add stricter rules.

 ### Self-check before emitting

@@ -120,13 +44,9 @@ Before calling AskUserQuestion, verify:
 - [ ] Recommendation line present with concrete reason
 - [ ] Completeness scored (coverage) OR kind-note present (kind)
 - [ ] Every option has ≥2 ✅ and ≥1 ❌, each ≥40 chars (or hard-stop escape)
- [ ] (recommended) label on one option (even for neutral-posture — see rule 9)
+- [ ] (recommended) label on one option (even for neutral-posture)
+- [ ] Dual-scale effort labels on effort-bearing options (human / CC)
 - [ ] Net line closes the decision
 - [ ] You are calling the tool, not writing prose
-
-If you'd need to read the source to understand your own explanation, it's
-too complex — simplify before emitting.
-
-Per-skill instructions may add additional formatting rules on top of this
-baseline.`;
+`;
 }