gstack

hai/gstack

mirror of https://github.com/garrytan/gstack.git synced 2026-05-08 21:49:45 +08:00

Author	SHA1	Message	Date
Garry Tan	7b4738bca0	v1.27.1.0 fix: anti-shortcut clause + gate-tier AskUserQuestion floor tests for all plan-* skills (#1354 ) * feat(test/helpers): runPlanSkillFloorCheck — minimal AskUserQuestion-floor observer Adds a focused PTY observer that exits at the first non-permission numbered-option render. Catches the May 2026 transcript-bug class (model wrote plan + ExitPlanMode without firing any AUQ) without needing to fingerprint or navigate past the AUQ. Why separate from runPlanSkillCounting: plan-mode AUQs render every option on a single logical line via cursor-positioning escapes that stripAnsi can't simulate, so parseNumberedOptions returns < 2 options and never records a fingerprint. Counting tests work on 25-min budgets because eventually one frame parses cleanly; gate-tier floor tests need to exit early on the first observation. Trades fingerprint precision for early-exit reliability. Also drops COMPLETION_SUMMARY_RE check from this helper — it matches "GSTACK REVIEW REPORT" anywhere in the buffer including when the agent does recon by reading existing plan files. plan_ready (claude's actual "Ready to execute" confirmation) is the reliable terminal signal for "agent finished without asking." Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(resolvers): generateAntiShortcutClause shared resolver Adds {{ANTI_SHORTCUT_CLAUSE}} placeholder backed by a single resolver function in scripts/resolvers/review.ts. Plan-* review skills can now include the clause via one placeholder line in their .tmpl rather than cloning the paragraph four times. Future tightening edits one resolver, all four skills update on next gen-skill-docs. Wired into the existing RESOLVERS map alongside generateReviewDashboard and generatePlanFileReviewReport — no gen-skill-docs.ts change needed because the generator already does generic placeholder substitution against that map. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(plan--review): anti-shortcut clause in all four review skills Inserts {{ANTI_SHORTCUT_CLAUSE}} placeholder immediately after the Anti-skip rule:* paragraph in plan-{eng,ceo,design,devex}-review SKILL.md.tmpl. The four templates use different surrounding section headers (eng "Review Sections (after scope is agreed)" vs ceo/design/devex variants), so anchoring on the paragraph rather than the heading works across all four. Closes the May 2026 transcript-bug loophole: existing STOP gates name forbidden actions only AFTER a per-section finding is identified. The anti-shortcut clause adds the pre-emptive rule — "the plan file is the OUTPUT of the interactive review, not a substitute for it" — covering the case the transcript exhibited (skip per-section walk, dump every finding into one plan write, call ExitPlanMode). Regenerated SKILL.md for all hosts via bun run gen:skill-docs --host all. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: gate-tier AskUserQuestion floor tests for all plan-* review skills Adds 4 finding-floor tests (one per plan-* skill) that catch the May 2026 transcript-bug class — model wrote a plan and called ExitPlanMode without firing any review-phase AskUserQuestion. Asserts via runPlanSkillFloorCheck that ANY non-permission AUQ render fires before the agent reaches plan_ready. Verified: - Eng floor: passed in 59s - CEO floor: passed in 197s - Design floor: passed - Devex floor: passed - Total ~$2-6 per CI run; only triggers on diff against the 4 plan-* templates, the shared resolver review.ts, the seeds fixture, or the PTY runner helper. Fixtures live in test/fixtures/forcing-finding-seeds.ts, one constant per skill. Each seed is engineered to force at least one obvious finding under that skill's review focus (architectural smell for eng, scope-creep for ceo, UI-slop for design, painful onboarding for devex). Touchfiles wiring: - E2E_TOUCHFILES: 4 plan--finding-floor entries with deps on the matching skill template, the shared resolver, the seeds fixture, and the PTY runner helper - E2E_TIERS: all 4 entries marked 'gate' - touchfiles.test.ts: count assertion bumped 21→22 with explicit plan-ceo-finding-floor containment check Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> chore: bump version and changelog (v1.27.1.0) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 20:27:20 -07:00
Garry Tan	9e244c0bed	v1.11.1.0 fix: plan-mode handshake + canUseTool test harness (#1182 ) * feat: plan-mode handshake for interactive review skills Add a preamble-level STOP-Ask handshake that fires when the user invokes any of the 4 interactive review skills (plan-ceo-review, plan-eng-review, plan-design-review, plan-devex-review) while their Claude Code session is in plan mode. Without this gate, plan mode's "this supercedes any other instructions" system-reminder outranked the skills' interactive STOP gates and the skills silently wrote plan files without any per-finding AskUserQuestion. The handshake offers 2 options (exit-and-rerun, cancel) — the original third "stay and batch" option was dropped after two independent reviewers flagged it as a silent bypass of the skills' anti-skip rule. Architecture decisions (CEO+Eng review): - Preamble-level resolver, not per-template injection (Codex finding #2) - Position 1 in preamble composition: after bash block (_SESSION_ID live), before onboarding AskUserQuestion gates (so fresh-install users see the handshake first, not drowned in telemetry/proactive/routing prompts) - Generator-only `interactive: true` frontmatter flag, following the `preamble-tier` precedent (no host-config frontmatter allowlist edits) - Host-scoped to Claude via `ctx.host === 'claude'` check inside the resolver (simpler than `suppressedResolvers` which only gates `{{}}` placeholders) - One-way-door classification in scripts/question-registry.ts for all 4 skills so question-tuning `never-ask` preferences can't suppress the gate - Synchronous telemetry write to ~/.gstack/analytics/skill-usage.jsonl on handshake fire (captures A-exit and C-cancel outcomes that terminate the skill before end-of-run telemetry runs) Also adds an explicit STOP block to plan-ceo-review Step 0C-bis so the approach-selection question can't silently skip to mode selection. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat: extend agent-sdk-runner with canUseTool for AskUserQuestion interception Test harness at test/helpers/agent-sdk-runner.ts gains an optional `canUseTool` callback parameter. When a test supplies it, the harness flips `permissionMode` from `bypassPermissions` (overlay-harness default) to `default` so the SDK actually invokes the callback on every tool use, and auto-adds `AskUserQuestion` to `allowedTools` so Claude can fire it at all. Exports a `passThroughNonAskUserQuestion` helper so tests that only want to intercept AskUserQuestion can auto-allow every other tool with one line: `return passThroughNonAskUserQuestion(toolName, input)`. This is the foundation for D14 — every future interactive-skill E2E test can now assert on AskUserQuestion shape and routing. Previous E2E tests at `test/skill-e2e.test.ts` explicitly instructed the model to skip AskUserQuestion ("non-interactive run") which meant no test could actually verify the question content or routing. 6 new unit tests in test/agent-sdk-runner.test.ts cover: - permissionMode flips to 'default' when canUseTool supplied - permissionMode stays 'bypassPermissions' when canUseTool absent - canUseTool callback reaches the SDK options - AskUserQuestion auto-added to allowedTools when canUseTool supplied - AskUserQuestion NOT added when canUseTool absent - passThroughNonAskUserQuestion helper returns allow+updatedInput Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * test: plan-mode handshake E2E coverage and unit assertions Adds 6 E2E test files and 8 new unit assertions to verify the plan-mode handshake works end-to-end and stays correct under regeneration. E2E tests (gate-tier, paid, EVALS=1 EVALS_TIER=gate): - test/skill-e2e-plan-ceo-plan-mode.test.ts — handshake fires before any Write/Edit when plan-mode distinctive phrase is present; 2-option shape (Exit/Cancel); option A routes to ExitPlanMode cleanly - test/skill-e2e-plan-eng-plan-mode.test.ts — same contract for plan-eng - test/skill-e2e-plan-design-plan-mode.test.ts — same contract for plan-design; exercises C-cancel branch instead of A-exit - test/skill-e2e-plan-devex-plan-mode.test.ts — same contract for plan-devex - test/skill-e2e-plan-mode-no-op.test.ts — negative regression: handshake must NOT fire when distinctive phrase is absent; skill proceeds normally through Step 0 (REGRESSION RULE guardrail against breaking existing interactive-review sessions) - test/e2e-harness-audit.test.ts — free unit test asserting every `interactive: true` skill has at least one canUseTool-using test file (prevents future drift where a skill opts in without coverage) Shared helper test/helpers/plan-mode-handshake-helpers.ts centralizes the canUseTool interceptor + distinctive-phrase injection so the 4 sibling E2E tests are thin wiring (~20 LOC each) and can't drift out of sync. Unit assertions added to test/gen-skill-docs.test.ts: - handshake section present in all 4 Claude-generated SKILL.md files - handshake section absent from non-interactive Claude skills (ship, review, qa, office-hours, codex, retro, cso) - handshake section absent from non-Claude host outputs (.agents, etc.) - 0C-bis STOP block present in plan-ceo-review/SKILL.md at correct position (between the "Present these approach options" line and "### 0D-prelude" header) - handshake resolver wired BEFORE generateUpgradeCheck in preamble composition order 6 new gate-tier entries added to test/helpers/touchfiles.ts so any change to the handshake resolver, preamble composition, skill templates, question registry, one-way-door classifier, or agent-sdk-runner fires the relevant E2E tests. test/touchfiles.test.ts updated for the new selection count (plan-ceo-review/** now triggers 15 tests, up from 8). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * chore(v1.11.1.0): VERSION bump + CHANGELOG entry + TODOS follow-ups Bumps from main's v1.11.0.0 to v1.11.1.0 (PATCH — bug-fix release, no new user-facing artifacts). CHANGELOG entry covers the plan-mode handshake, agent-sdk-runner canUseTool extension, and the 2 follow-up TODOs. CHANGELOG order: v1.11.1.0 (this) → v1.11.0.0 (workspace-aware ship, merged from main) → v1.10.1.0 (overlay efficacy harness). No duplicate headers. Syncs package.json version to match VERSION per the Step 12 idempotency invariant (both files must agree or /ship halts). TODOS.md: - Preserves the Testing/security-bench-haiku-responses P1 added on main - Adds P1 "Structural STOP-Ask forcing function" — broader class of the bug this release fixes - Adds P2 "Apply interactive: true to non-review skills (office-hours, codex, investigate, qa, retro, cso)" Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-24 00:04:53 -07:00
Garry Tan	a81be53621	v1.10.0.0: fix AskUserQuestion cadence + Pros/Cons format upgrade (#1178 ) * fix(preamble): reorder AskUserQuestion Format above model overlay + rewrite Opus 4.7 pacing directive Root cause of plan-review regression (v1.6.4.0): model overlays rendered ABOVE the pacing rule in every SKILL.md, so Opus 4.7 read "Batch your questions" first and absorbed it as the ambient default. The overlay's claimed subordination ("skill wins on pacing, always") didn't stick — literal-interpretation mode reads physical order, not claimed hierarchy. Part 1 of 4 (plan: ~/.claude/plans/system-instruction-you-are-working-polymorphic-twilight.md): scripts/resolvers/preamble.ts - Move generateAskUserFormat above generateModelOverlay in section array - Comment explains why — prevents future refactors from silently reverting model-overlays/opus-4-7.md - Replace "Batch your questions" block with "Pace questions to the skill" - New wording makes one-question-per-turn the default when the skill contains STOP directives; batching becomes the explicit exception Regenerated 30 SKILL.md files via bun run gen:skill-docs. Verified: - With --model opus-4-7: Format renders at line 359, Model-Specific Patch at 373, "Pace questions" at 419 (Format comes first, overlay second, pacing directive intact). - bun test passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(plan-reviews): tighten STOP/escape-hatch directives across 4 templates Part 2 of 4 (plan: ~/.claude/plans/system-instruction-you-are-working-polymorphic-twilight.md). Codex caught that v1.6.3.0's reasoning collapsed on Opus 4.7: the old escape-hatch wording ("If no issues or fix is obvious, state what you'll do and move on — don't waste a question") let the literal interpreter classify every finding as having an "obvious fix" and skip AskUserQuestion entirely. Reviews became reports. Per-template hardening (16 sites total, verified by rg): plan-ceo-review/SKILL.md.tmpl (13 sites): - 12 inline STOP directives: replace the full escape-hatch clause with "zero findings → say so and proceed; findings → MUST call AskUserQuestion as a tool_use, including for obvious fixes." - 1 Escape hatch bullet in CRITICAL RULE section: tightened. plan-eng-review, plan-design-review, plan-devex-review (1 site each): - Each template's Escape hatch bullet tightened to match the new CEO wording, adapted for each review's domain (issue/gap, decision/design/DX alternatives). After regeneration: rg "don't waste a question" returns 0 across all SKILL.md.tmpl and SKILL.md files. "zero findings, state" wording present 16 times (matches prior count of escape-hatch sites). bun test passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(preamble): upgrade AskUserQuestion format to Pros/Cons decision brief Part 4 of 4 (plan: ~/.claude/plans/system-instruction-you-are-working-polymorphic-twilight.md). Every AskUserQuestion now renders as a decision brief, not a bullet list: D-numbered header, ELI10, Stakes-if-we-pick-wrong, Recommendation, Pros/Cons with ✅/❌ markers per option, closing Net: tradeoff synthesis. scripts/resolvers/preamble/generate-ask-user-format.ts - Full rewrite. Preserves prior rules (Re-ground, ELI10, Recommend, Completeness, Options) and adds: - D-numbering per skill invocation (model-level, not runtime state) - Stakes line (pain avoided / capability unlocked / consequence named) - Pros/Cons block with min 2 ✅ + 1 ❌ per option, min 40 chars/bullet - Hard-stop escape: "✅ No cons — this is a hard-stop choice" for genuine one-sided choices (destructive-action confirmations) - Neutral-posture handling (CT1-compliant): (recommended) label STAYS on default option to preserve AUTO_DECIDE contract; neutrality expressed as prose in Recommendation line only - Net line closes the decision with a one-sentence tradeoff frame - Rule 11: tool_use mandate (prose "Question:" blocks don't count) - Self-check list before emitting test/skill-validation.test.ts - Update format assertions to check for new Pros/Cons tokens (Pros / cons:, Recommendation: <choice>, Net:, ELI10, Stakes if we pick wrong:, ✅, ❌) across all tier-2+ skills - Old "RECOMMENDATION: Choose" expectation removed (the new format uses mixed-case "Recommendation:" with no literal "Choose") test/skill-e2e-plan-format.test.ts - Add v1.7.0.0 format token regexes (PROS_CONS_HEADER_RE, PRO_BULLET_RE, CON_BULLET_RE, NET_LINE_RE, D_NUMBER_RE, STAKES_RE) - Existing RECOMMENDATION_RE loosened to accept mixed-case "Recommendation:" (canonical v1.7.0.0 form) alongside all-caps (legacy). Tests are additive — the strict new-format gate is the upcoming cadence eval. Regenerated 30 SKILL.md files via bun run gen:skill-docs. Verified: - bun test: 319 pass (1 pre-existing security-bench fixture oversize failure on main, unrelated — confirmed via git stash test on main HEAD) - New format tokens render in all tier-2+ skills (plan-ceo-review, plan-eng-review, ship, office-hours verified) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: gate-tier units + periodic Pros/Cons evals for AskUserQuestion format Part 3 of 4 (plan: ~/.claude/plans/system-instruction-you-are-working-polymorphic-twilight.md). Gate-tier (E1, free, runs on every `bun test`): test/preamble-compose.test.ts — pins the composition order Asserts AskUserQuestion Format section renders BEFORE Model-Specific Behavioral Patch in tier-≥2 preamble output. Covers claude default, opus-4-7 overlay, tier 2/3, and codex host. Catches any future edit to scripts/resolvers/preamble.ts that silently reverts the order. test/resolver-ask-user-format.test.ts — pins the Pros/Cons contract 14 assertions against generateAskUserFormat output: D<N>, ELI10, Stakes if we pick wrong:, Recommendation: <choice>, Pros / cons:, ✅/❌ markers, min 2 pros + 1 con rules, hard-stop escape exact phrase, neutral-posture CT1 rule ((recommended) label preserved for AUTO_DECIDE), Completeness coverage-vs-kind, tool_use mandate (rule 11), self-check list, D-numbering model-level caveat. test/model-overlay-opus-4-7.test.ts — pins the pacing directive Asserts raw overlay file + resolved overlay output contain "Pace questions to the skill" and NOT "Batch your questions". Verifies INHERIT:claude chain still works (Todo-list, subordination wrapper), Fan out / Effort-match / Literal interpretation nudges preserved. Also asserts claude base overlay does NOT carry the Opus-specific pacing directive (no cross-contamination). Periodic-tier (E2, Opus-dependent, ~$1-2/run): test/skill-e2e-plan-prosons.test.ts — 4 cases extending v1.6.3.0 harness 1. Format positive — every token present when plan has real tradeoff 2. Hard-stop NEGATIVE — plan with genuine tradeoff must NOT dodge to "No cons — hard-stop choice" escape 3. Neutral-posture NEGATIVE — plan where one option dominates must emit (recommended) label + "because <reason>", must NOT dodge to "taste call" / "no preference" 4. Hard-stop POSITIVE — destructive-action plan may legitimately use the hard-stop escape test/helpers/touchfiles.ts — entries for all new eval cases Dependencies: overlay, preamble.ts, generate-ask-user-format.ts, and the 4 plan-review templates. Diff-based selection triggers the evals whenever those files change. Also added entries for 7 expanded-coverage cases (ship, office-hours, investigate, qa, review, design-review, document-release) — test cases will land in follow-up PRs per skill. Follow-ups noted in test file header: - True multi-turn cadence eval (3 findings → 3 distinct asks) — current harness captures one $OUT_FILE per session; multi-turn capture needs new harness support. - Expanded-coverage test cases for the 7 non-plan-review skills. Verified: - bun test: 349 pass (30 new + 319 baseline), 1 pre-existing security-bench oversize failure on main (unrelated, unchanged). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: regenerate golden fixtures + update ELI10 phrase check for v1.7.0.0 Pros/Cons format rewrite (`6b99df9d`) changed the resolver output across all tier-2+ SKILL.md files. Three golden-file regression tests in test/host-config.test.ts and one phrase-check test in test/gen-skill-docs.test.ts were failing as expected. - test/fixtures/golden/claude-ship-SKILL.md - test/fixtures/golden/codex-ship-SKILL.md - test/fixtures/golden/factory-ship-SKILL.md Regenerated via `bun run gen:skill-docs --host all` + cp into fixtures. - test/gen-skill-docs.test.ts line 244: rename test from "ELI16 simplification rules" to "ELI10 simplification rules" and match the new phrase pattern. v1.7.0.0 uses "ELI10 (ALWAYS)" rather than legacy "Simplify (ELI10, ALWAYS)". bun test: 744 pass, 1 fail (pre-existing security-bench fixture oversize, unrelated to this branch). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * v1.7.0.0: plan reviews walk you through each issue with Pros/Cons Restores AskUserQuestion cadence on Opus 4.7 (v1.6.4.0 regression) and upgrades the format to a numbered decision brief — D<N> header, ELI10, Stakes, Recommendation, per-option ✅/❌ bullets, Net: closing line. Fix: composition reorder + overlay rewrite + 16-site escape-hatch hardening across the 4 plan-review templates. Feature: Pros/Cons format in the preamble resolver, inherited by every tier-2+ skill automatically. 30 new gate-tier unit tests pin the format contract (runs in <100ms, $0). 4 new periodic-tier eval cases defend against escape-hatch abuse (2 positive, 2 negative). Golden fixtures regenerated. CEO + Eng + Codex reviews completed. 5 of 8 Codex findings incorporated; CT2 (16 sites, not 31) and CT1 (AUTO_DECIDE contract break) were load-bearing catches the primary reviews missed. bun test: 774 pass, 1 fail (pre-existing security-bench oversize, unrelated). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * v1.10.0.0: bump VERSION (was v1.7.0.0, align with branch discipline) Per user direction — jumping to 1.10.0.0 for versioning alignment. No functional changes from the prior ship commit (`5f038ab7`). The regression fix + Pros/Cons format are identical. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 18:25:34 -07:00
Garry Tan	b805aa0113	feat: Confusion Protocol, Hermes + GBrain hosts, brain-first resolver (v0.18.0.0) (#1005 ) * feat: add Confusion Protocol to preamble resolver Injects a high-stakes ambiguity gate at preamble tier >= 2 so all workflow skills get it. Fires when Claude encounters architectural decisions, data model changes, destructive operations, or contradictory requirements. Does NOT fire on routine coding. Addresses Karpathy failure mode #1 (wrong assumptions) with an inline STOP gate instead of relying on workflow skill invocation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add Hermes and GBrain host configs Hermes: tool rewrites for terminal/read_file/patch/delegate_task, paths to ~/.hermes/skills/gstack, AGENTS.md config file. GBrain: coding skills become brain-aware when GBrain mod is installed. Same tool rewrites as OpenClaw (agents spawn Claude Code via ACP). GBRAIN_CONTEXT_LOAD and GBRAIN_SAVE_RESULTS NOT suppressed on gbrain host, enabling brain-first lookup and save-to-brain behavior. Both registered in hosts/index.ts with setup script redirect messages. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: GBrain resolver — brain-first lookup and save-to-brain New scripts/resolvers/gbrain.ts with two resolver functions: - GBRAIN_CONTEXT_LOAD: search brain for context before skill starts - GBRAIN_SAVE_RESULTS: save skill output to brain after completion Placeholders added to 4 thinking skill templates (office-hours, investigate, plan-ceo-review, retro). Resolves to empty string on all hosts except gbrain via suppressedResolvers. GBRAIN suppression added to all 9 non-gbrain host configs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: wire slop:diff into /review as advisory diagnostic Adds Step 3.5 to the review template: runs bun run slop:diff against the base branch to catch AI code quality issues (empty catches, redundant return await, overcomplicated abstractions). Advisory only, never blocking. Skips silently if slop-scan is not installed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add Karpathy compatibility note to README Positions gstack as the workflow enforcement layer for Karpathy-style CLAUDE.md rules (17K stars). Links to forrestchang/andrej-karpathy-skills. Maps each Karpathy failure mode to the gstack skill that addresses it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: improve native OpenClaw thinking skills office-hours: add design doc path visibility message after writing ceo-review: add HARD GATE reminder at review section transitions retro: add non-git context support (check memory for meeting notes) Mirrors template improvements to hand-crafted native skills. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: update tests and golden fixtures for new hosts - Host count: 8 → 10 (hermes, gbrain) - OpenClaw adapter test: expects undefined (dead code removed) - Golden ship fixtures: updated with Confusion Protocol + vendoring Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: regenerate all SKILL.md files Regenerated from templates after Confusion Protocol, GBrain resolver placeholders, slop:diff in review, HARD GATE reminders, investigation learnings, design doc visibility, and retro non-git context changes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: update project documentation for v0.18.0.0 - CHANGELOG: add v0.18.0.0 entry (Confusion Protocol, Hermes, GBrain, slop in review, Karpathy note, skill improvements) - CLAUDE.md: add hermes.ts and gbrain.ts to hosts listing - README.md: update agent count 8→10, add Hermes + GBrain to table - VERSION: bump to 0.18.0.0 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: sync package.json version to 0.18.0.0 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: extract Step 0 from review SKILL.md in E2E test The review-base-branch E2E test was copying the full 1493-line review/SKILL.md into the test fixture. The agent spent 8+ turns reading it in chunks, leaving only 7 turns for actual work, causing error_max_turns on every attempt. Now extracts only Step 0 (base branch detection, ~50 lines) which is all the test actually needs. Follows the CLAUDE.md rule: "NEVER copy a full SKILL.md file into an E2E test fixture." Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: update GBrain and Hermes host configs for v0.10.0 integration GBrain: add 'triggers' to keepFields so generated skills pass checkResolvable() validation. Add version compat comment. Hermes: un-suppress GBRAIN_CONTEXT_LOAD and GBRAIN_SAVE_RESULTS. The resolvers handle GBrain-not-installed gracefully, so Hermes agents with GBrain as a mod get brain features automatically. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: GBrain resolver DX improvements and preamble health check Resolver changes: - gbrain query → gbrain search (fast keyword search, not expensive hybrid) - Add keyword extraction guidance for agents - Show explicit gbrain put_page syntax with --title, --tags, heredoc - Add entity enrichment with false-positive filter - Name throttle error patterns (exit code 1, stderr keywords) - Add data-research routing for investigate skill - Expand skillSaveMap from 4 to 8 entries - Add brain operation telemetry summary Preamble changes: - Add gbrain doctor --fast --json health check for gbrain/hermes hosts - Parse check failures/warnings count - Show failing check details when score < 50 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: preserve keepFields in allowlist frontmatter mode The allowlist mode hard-coded name + description reconstruction but never iterated keepFields for additional fields. Adding 'triggers' to keepFields was a no-op because the field was silently stripped. Now iterates keepFields and preserves any field beyond name/description from the source template frontmatter, including YAML arrays. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add triggers to all 38 skill templates Multi-word, skill-specific trigger keywords for GBrain's RESOLVER.md router. Each skill gets 3-6 triggers derived from its "Use when asked to..." description text. Avoids single generic words that would collide across skills (e.g., "debug this" not "debug"). These are distinct from voice-triggers (speech-to-text aliases) and serve GBrain's checkResolvable() validation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: regenerate all SKILL.md files and update golden fixtures Regenerated from updated templates (triggers, brain placeholders, resolver DX improvements, preamble health check). Golden fixtures updated to match. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: settings-hook remove exits 1 when nothing to remove gstack-settings-hook remove was exiting 0 when settings.json didn't exist, causing gstack-uninstall to report "SessionStart hook" as removed on clean systems where nothing was installed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: update project documentation for GBrain v0.10.0 integration ARCHITECTURE.md: added GBRAIN_CONTEXT_LOAD and GBRAIN_SAVE_RESULTS to resolver table. CHANGELOG.md: expanded v0.18.0.0 entry with GBrain v0.10.0 integration details (triggers, expanded brain-awareness, DX improvements, Hermes brain support), updated date. CLAUDE.md: added gbrain to resolvers/ directory comment. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: routing E2E stops writing to user's ~/.claude/skills/ installSkills() was copying SKILL.md files to both project-level (.claude/skills/ in tmpDir) and user-level (~/.claude/skills/). Writing to the user's real install fails when symlinks point to different worktrees or dangling targets (ENOENT on copyFileSync). Now installs to project-level only. The test already sets cwd to the tmpDir, so project-level discovery works. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: scale Gemini E2E back to smoke test Gemini CLI gets lost in worktrees on complex tasks (review times out at 600s, discover-skill hits exit 124). Nobody uses Gemini for gstack skill execution. Replace the two failing tests (gemini-discover-skill and gemini-review-findings) with a single smoke test that verifies Gemini can start and read the README. 90s timeout, no skill invocation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 10:41:38 -07:00
Garry Tan	31943b2f02	feat: anti-skip rule for all review skills (v0.15.6.1) (#804 ) * feat: anti-skip rule for all review skills Review skills sometimes skip sections when reviewing strategy or spec plans. This adds an explicit anti-skip rule to CEO (1-11), eng (1-4), design (1-7), and DX (1-8) review skills. Also fixes CEO header from "10 sections" to "11 sections" to match actual count. * chore: bump version and changelog (v0.15.6.1) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-04 21:22:40 -07:00
Garry Tan	447851452a	feat: interactive /plan-devex-review + plan mode skill fix (v0.15.5.0) (#796 ) * fix: skill invocation during plan mode takes precedence over generic plan mode Adds a "Skill Invocation During Plan Mode" section to the preamble resolver so all generated SKILL.md files include it. Fixes a bug where Claude treats loaded skill content as reference material instead of executable instructions, and keeps trying to ExitPlanMode instead of following the skill workflow step by step. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: interactive /plan-devex-review with persona, benchmarks, and forcing questions Complete rewrite of the DX review skill to match CEO/eng review depth. New flow: investigate (persona, empathy, competitors, magical moment, journey tracing) then force decisions, then score with evidence. Three modes: DX EXPANSION, DX POLISH, DX TRIAGE. 20-45 interactive STOP points vs 10-12 before. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: autoplan DX POLISH mode + review log schema for new devex fields Adds mode selection, persona, competitive, and magical moment override rules to autoplan Phase 3.5. Documents new review log fields (mode, persona, competitive_tier) in the plan-file-review-report schema. Syncs package.json version to VERSION. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: update project documentation for v0.15.5.0 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 14:36:23 -07:00
Garry Tan	be96ff5ce7	feat: /plan-devex-review + /devex-review — DX review skills (v0.15.3.0) (#784 ) * feat: add DX framework resolver for shared principles and scoring rubric New {{DX_FRAMEWORK}} resolver provides compact (~150 lines) shared content for /plan-devex-review and /devex-review: Addy Osmani's 8 DX principles, 7 characteristics table, 10 cognitive patterns, scoring rubric, and TTHW benchmarks. Hall of Fame examples loaded on-demand per pass to avoid bloat. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add DX Review row to review dashboard Adds plan-devex-review and devex-review schema entries to the review dashboard resolver and placeholder table in the preamble. All existing SKILL.md files regenerated to include the new DX Review row. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: /plan-devex-review skill — DX plan review with Osmani framework Plan-stage developer experience review. Rates 8 DX dimensions 0-10: getting started, API/CLI/SDK design, error messages, docs, upgrade path, dev environment, community, and DX measurement. Includes developer empathy simulation, auto-detect product type with applicability gate, DX scorecard with trend tracking, and a conditional Claude Code Skill DX checklist. Hall of Fame examples loaded on-demand per pass from dx-hall-of-fame.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: /devex-review skill — live DX audit with browse Live-system developer experience audit using browse tool. Tests all 8 dimensions aligned with /plan-devex-review for boomerang comparison (plan said 3 min TTHW, reality says 8). Each dimension marked TESTED, INFERRED, or N/A with evidence. Scope-aware: declares what browse can and cannot test, falls back to file artifacts for untestable dimensions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.15.3.0) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 16:22:57 -07:00

7 Commits