mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-08 13:39:45 +08:00
v1.26.2.0 fix: plan-eng-review STOP gates always fire AskUserQuestion + report-at-bottom contract enforcement (#1313)
* fix(plan-eng-review): tighten STOP gates with anti-rationalization clause
Five sites in SKILL.md.tmpl uplift to the office-hours b512be71 pattern:
the four review-section gates (Architecture, Code Quality, Test, Performance)
plus the Step 0 complexity-check trigger. Adds tool_use reminder ("call the
tool directly"), names blocked next steps explicitly, anti-rationalization
clause naming the precise failure mode (loading the schema via ToolSearch
and writing the recommendation as chat prose).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(test/helpers): initialPlanContent + wrote_findings_before_asking + shared report-at-bottom assertion
Three additions to claude-pty-runner.ts:
1. runPlanSkillObservation gains initialPlanContent?: string. Pre-pumps a
user message containing the seeded plan before invoking the skill, with
a 3s gap so the message renders before the slash command. claude has no
--plan-file flag (verified via claude --help), so message-pump is the
route. Lets STOP-gate regression tests force complexity findings.
2. ClassifyResult gains wrote_findings_before_asking with companion
strictPlanWrites?: boolean opt on classifyVisible. Fires when a Write/
Edit to .claude/plans/* precedes any AskUserQuestion render in the
session window. Default off — preserves zero-findings → write plan →
plan_ready as legitimate for unseeded smokes. Six new unit tests cover
before/after-AUQ ordering, permission-dialog edge case, strict-off path.
3. assertReportAtBottomIfPlanWritten(obs) shared helper. Wraps the existing
assertReviewReportAtBottom(content) and gates on obs.planFile (artifact
existing), so the assertion fires under both 'asked' and 'plan_ready'
when a plan was actually written.
Also: runPlanSkillObservation now captures obs.planFile on every classifier
outcome, not just 'plan_ready'. Catches the case where the skill wrote a
plan partway through then paused on a question.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test: wire assertReportAtBottomIfPlanWritten into 4 plan-mode E2E tests + add seeded-plan STOP-gate case
Every test case in skill-e2e-plan-{eng,ceo,design,devex}-plan-mode.test.ts
that produces a plan file now asserts ## GSTACK REVIEW REPORT is the last
## section. The {{PLAN_FILE_REVIEW_REPORT}} resolver mandated this contract;
nothing tested it until now.
Plan-eng additionally gains a third test case: STOP gate fires when seeded
plan forces Step 0 findings. Combines the new initialPlanContent runner
option with --disallowedTools AskUserQuestion to force the Conductor
MCP-variant path through mcp__*__AskUserQuestion. Asserts outcome NOT in
{wrote_findings_before_asking, auto_decided, silent_write, exited, timeout}
and that plan_ready outcomes carry a ## Decisions section.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(touchfiles): delete duplicate plan-design-review-plan-mode keys
Verified duplicates in test/helpers/touchfiles.ts:
- E2E_TOUCHFILES had plan-design-review-plan-mode at line 94 (full deps)
AND line 243 (smaller deps); JS object literals: later wins.
- E2E_TIERS had it at line 399 ('gate') AND line 524 ('periodic'); same
later-wins rule.
Effective tier was 'periodic', not 'gate'. Three of four plan-mode siblings
ran on every PR; design ran weekly only.
Delete the line-243 and line-524 duplicates. Keep line 94 (full deps) and
line 399 ('gate'). Also extend the four plan-mode-test entries to include
scripts/resolvers/review.ts so changes to {{PLAN_FILE_REVIEW_REPORT}}
trigger all four siblings in bun run eval:select.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore: bump version and changelog (v1.26.2.0)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: tighten CHANGELOG voice for v1.26.2.0
Move contributor-flavored bullet (runPlanSkillObservation seeding) into
For contributors. Drop branch-internal narrative (Codex review pass,
plan iteration tracking) per CHANGELOG-for-users style.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
50
CHANGELOG.md
50
CHANGELOG.md
@@ -1,5 +1,55 @@
|
||||
# Changelog
|
||||
|
||||
## [1.26.2.0] - 2026-05-03
|
||||
|
||||
## **`/plan-eng-review` always asks. Never silently writes findings to your plan first.**
|
||||
|
||||
Plan-mode review skills now have a hard STOP gate before any AskUserQuestion. The bug
|
||||
this closes: a `/plan-eng-review` session would do Step 0 scope challenge, find real
|
||||
issues, write the findings into the plan file as prose, then call `ExitPlanMode` —
|
||||
never invoking AskUserQuestion. The user only saw "ready to execute" with the model's
|
||||
opinions already baked in. The tool to surface the question existed, the prompt
|
||||
told the model to use it, and the model still routed around it.
|
||||
|
||||
Five sites in `plan-eng-review/SKILL.md.tmpl` now use the office-hours `b512be71`
|
||||
pattern verbatim: "the AskUserQuestion call is a tool_use, not prose — call the
|
||||
tool directly," named blockers ("do not edit the plan file, do not call
|
||||
ExitPlanMode"), and an anti-rationalization clause ("loading the schema via
|
||||
ToolSearch and writing the recommendation as chat prose is the failure mode this
|
||||
gate exists to prevent"). The four review-section gates (Architecture, Code
|
||||
Quality, Test, Performance) and the Step 0 complexity-check trigger all use the
|
||||
same language.
|
||||
|
||||
### What you can now do
|
||||
|
||||
- **Trust that any plan-* review skill that produces a plan file ends with the review report.** All four plan-mode E2E tests (`plan-eng`, `plan-ceo`, `plan-design`, `plan-devex`) now assert `## GSTACK REVIEW REPORT` is the last `## ` section of the plan file whenever one was written. The `{{PLAN_FILE_REVIEW_REPORT}}` resolver mandated this contract; nothing tested it until now.
|
||||
- **Catch the "writes findings to plan as prose before asking" failure mode.** New `wrote_findings_before_asking` classifier outcome fires when a `Write`/`Edit` to `.claude/plans/*` precedes any AskUserQuestion render in the session window. Opt-in via `strictPlanWrites: true` so existing tests where zero-findings → write plan → plan_ready stays legitimate.
|
||||
- **Run `plan-design-review-plan-mode` on PR CI again.** The touchfiles entry was duplicated — `plan-design-review-plan-mode` appeared at line 94 (gate, full deps) and line 243 (smaller deps). JS object literals: later wins. The effective tier was `periodic`, not `gate`. Three of four plan-mode siblings ran on every PR; design didn't.
|
||||
|
||||
### Itemized changes
|
||||
|
||||
#### Added
|
||||
|
||||
- `runPlanSkillObservation`'s `initialPlanContent?: string` option. Pre-pumps a user message containing the seeded plan before invoking the skill, with a 3s gap so the message renders before the slash command.
|
||||
- `ClassifyResult` outcome `wrote_findings_before_asking` with companion `strictPlanWrites?` opt on `classifyVisible`. Six new unit tests in `claude-pty-runner.unit.test.ts` cover before/after-AUQ ordering plus the strict-off legacy path.
|
||||
- Shared test helper `assertReportAtBottomIfPlanWritten(obs)` in `claude-pty-runner.ts`. Wraps the existing `assertReviewReportAtBottom(content)` and gates on `obs.planFile` (artifact existing), so the assertion fires under `'asked'` and `'plan_ready'` both — wherever a plan file was actually written.
|
||||
- New seeded-plan test case in `skill-e2e-plan-eng-plan-mode.test.ts`: `STOP gate fires when seeded plan forces Step 0 findings`. Combines `initialPlanContent` + `--disallowedTools AskUserQuestion` to force the Conductor MCP-variant path through `mcp__*__AskUserQuestion`.
|
||||
|
||||
#### Changed
|
||||
|
||||
- `plan-eng-review/SKILL.md.tmpl` lines 116, 139, 152, 160, 169 ported from soft "STOP." prose to the office-hours pattern. Adds tool_use reminder, names blocked next steps explicitly, anti-rationalization clause.
|
||||
- `runPlanSkillObservation` now captures `obs.planFile` on every classifier outcome (was: only `'plan_ready'`). Catches the case where the skill wrote a plan partway through then paused on a question.
|
||||
|
||||
#### Fixed
|
||||
|
||||
- `test/helpers/touchfiles.ts` duplicate `plan-design-review-plan-mode` keys deleted (line 243 in `E2E_TOUCHFILES`, line 524 in `E2E_TIERS`). Effective tier is now `gate` again, matching the other three siblings.
|
||||
- `scripts/resolvers/review.ts` added to all four plan-mode-test touchfiles entries so changes to the `{{PLAN_FILE_REVIEW_REPORT}}` resolver text trigger all four sibling tests in `bun run eval:select`.
|
||||
|
||||
#### For contributors
|
||||
|
||||
- 6 new classifier unit tests in `test/helpers/claude-pty-runner.unit.test.ts` (70 → 76).
|
||||
- New `initialPlanContent?: string` option on `runPlanSkillObservation` for seeding a draft plan into a test run before invoking the skill. Lets STOP-gate regression tests pre-pump guaranteed-finding-triggering complexity (8+ files, custom-vs-builtin smell) so the skill has something concrete to react to.
|
||||
|
||||
## [1.26.1.0] - 2026-05-03
|
||||
|
||||
## **`gstack-gbrain-sync` ships host-agnostic. Curated artifacts push from Claude Code, Codex CLI, or a dev workspace — same orchestrator, same install, same result.**
|
||||
|
||||
Reference in New Issue
Block a user