7 Commits

Author SHA1 Message Date
Garry Tan
7b4738bca0 v1.27.1.0 fix: anti-shortcut clause + gate-tier AskUserQuestion floor tests for all plan-* skills (#1354)
* feat(test/helpers): runPlanSkillFloorCheck — minimal AskUserQuestion-floor observer

Adds a focused PTY observer that exits at the first non-permission
numbered-option render. Catches the May 2026 transcript-bug class
(model wrote plan + ExitPlanMode without firing any AUQ) without
needing to fingerprint or navigate past the AUQ.

Why separate from runPlanSkillCounting: plan-mode AUQs render every
option on a single logical line via cursor-positioning escapes that
stripAnsi can't simulate, so parseNumberedOptions returns < 2 options
and never records a fingerprint. Counting tests work on 25-min budgets
because eventually one frame parses cleanly; gate-tier floor tests
need to exit early on the first observation. Trades fingerprint
precision for early-exit reliability.

Also drops COMPLETION_SUMMARY_RE check from this helper — it matches
"GSTACK REVIEW REPORT" anywhere in the buffer including when the
agent does recon by reading existing plan files. plan_ready
(claude's actual "Ready to execute" confirmation) is the reliable
terminal signal for "agent finished without asking."

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(resolvers): generateAntiShortcutClause shared resolver

Adds {{ANTI_SHORTCUT_CLAUSE}} placeholder backed by a single resolver
function in scripts/resolvers/review.ts. Plan-* review skills can now
include the clause via one placeholder line in their .tmpl rather than
cloning the paragraph four times. Future tightening edits one resolver,
all four skills update on next gen-skill-docs.

Wired into the existing RESOLVERS map alongside generateReviewDashboard
and generatePlanFileReviewReport — no gen-skill-docs.ts change needed
because the generator already does generic placeholder substitution
against that map.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(plan-*-review): anti-shortcut clause in all four review skills

Inserts {{ANTI_SHORTCUT_CLAUSE}} placeholder immediately after the
**Anti-skip rule:** paragraph in plan-{eng,ceo,design,devex}-review
SKILL.md.tmpl. The four templates use different surrounding section
headers (eng "Review Sections (after scope is agreed)" vs ceo/design/devex
variants), so anchoring on the paragraph rather than the heading works
across all four.

Closes the May 2026 transcript-bug loophole: existing STOP gates name
forbidden actions only AFTER a per-section finding is identified. The
anti-shortcut clause adds the pre-emptive rule — "the plan file is the
OUTPUT of the interactive review, not a substitute for it" — covering
the case the transcript exhibited (skip per-section walk, dump every
finding into one plan write, call ExitPlanMode).

Regenerated SKILL.md for all hosts via bun run gen:skill-docs --host all.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: gate-tier AskUserQuestion floor tests for all plan-* review skills

Adds 4 finding-floor tests (one per plan-* skill) that catch the May
2026 transcript-bug class — model wrote a plan and called ExitPlanMode
without firing any review-phase AskUserQuestion. Asserts via
runPlanSkillFloorCheck that ANY non-permission AUQ render fires before
the agent reaches plan_ready.

Verified:
- Eng floor: passed in 59s
- CEO floor: passed in 197s
- Design floor: passed
- Devex floor: passed
- Total ~$2-6 per CI run; only triggers on diff against the 4 plan-*
  templates, the shared resolver review.ts, the seeds fixture, or the
  PTY runner helper.

Fixtures live in test/fixtures/forcing-finding-seeds.ts, one constant
per skill. Each seed is engineered to force at least one obvious
finding under that skill's review focus (architectural smell for eng,
scope-creep for ceo, UI-slop for design, painful onboarding for devex).

Touchfiles wiring:
- E2E_TOUCHFILES: 4 plan-*-finding-floor entries with deps on the
  matching skill template, the shared resolver, the seeds fixture,
  and the PTY runner helper
- E2E_TIERS: all 4 entries marked 'gate'
- touchfiles.test.ts: count assertion bumped 21→22 with explicit
  plan-ceo-finding-floor containment check

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v1.27.1.0)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 20:27:20 -07:00
Garry Tan
9e244c0bed v1.11.1.0 fix: plan-mode handshake + canUseTool test harness (#1182)
* feat: plan-mode handshake for interactive review skills

Add a preamble-level STOP-Ask handshake that fires when the user invokes any
of the 4 interactive review skills (plan-ceo-review, plan-eng-review,
plan-design-review, plan-devex-review) while their Claude Code session is
in plan mode. Without this gate, plan mode's "this supercedes any other
instructions" system-reminder outranked the skills' interactive STOP gates
and the skills silently wrote plan files without any per-finding AskUserQuestion.

The handshake offers 2 options (exit-and-rerun, cancel) — the original
third "stay and batch" option was dropped after two independent reviewers
flagged it as a silent bypass of the skills' anti-skip rule.

Architecture decisions (CEO+Eng review):
- Preamble-level resolver, not per-template injection (Codex finding #2)
- Position 1 in preamble composition: after bash block (_SESSION_ID live),
  before onboarding AskUserQuestion gates (so fresh-install users see the
  handshake first, not drowned in telemetry/proactive/routing prompts)
- Generator-only `interactive: true` frontmatter flag, following the
  `preamble-tier` precedent (no host-config frontmatter allowlist edits)
- Host-scoped to Claude via `ctx.host === 'claude'` check inside the
  resolver (simpler than `suppressedResolvers` which only gates `{{}}`
  placeholders)
- One-way-door classification in scripts/question-registry.ts for all 4
  skills so question-tuning `never-ask` preferences can't suppress the gate
- Synchronous telemetry write to ~/.gstack/analytics/skill-usage.jsonl on
  handshake fire (captures A-exit and C-cancel outcomes that terminate the
  skill before end-of-run telemetry runs)

Also adds an explicit STOP block to plan-ceo-review Step 0C-bis so the
approach-selection question can't silently skip to mode selection.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat: extend agent-sdk-runner with canUseTool for AskUserQuestion interception

Test harness at test/helpers/agent-sdk-runner.ts gains an optional
`canUseTool` callback parameter. When a test supplies it, the harness
flips `permissionMode` from `bypassPermissions` (overlay-harness default)
to `default` so the SDK actually invokes the callback on every tool use,
and auto-adds `AskUserQuestion` to `allowedTools` so Claude can fire it
at all.

Exports a `passThroughNonAskUserQuestion` helper so tests that only want
to intercept AskUserQuestion can auto-allow every other tool with one
line: `return passThroughNonAskUserQuestion(toolName, input)`.

This is the foundation for D14 — every future interactive-skill E2E test
can now assert on AskUserQuestion shape and routing. Previous E2E tests
at `test/skill-e2e.test.ts` explicitly instructed the model to skip
AskUserQuestion ("non-interactive run") which meant no test could actually
verify the question content or routing.

6 new unit tests in test/agent-sdk-runner.test.ts cover:
- permissionMode flips to 'default' when canUseTool supplied
- permissionMode stays 'bypassPermissions' when canUseTool absent
- canUseTool callback reaches the SDK options
- AskUserQuestion auto-added to allowedTools when canUseTool supplied
- AskUserQuestion NOT added when canUseTool absent
- passThroughNonAskUserQuestion helper returns allow+updatedInput

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* test: plan-mode handshake E2E coverage and unit assertions

Adds 6 E2E test files and 8 new unit assertions to verify the plan-mode
handshake works end-to-end and stays correct under regeneration.

E2E tests (gate-tier, paid, EVALS=1 EVALS_TIER=gate):
- test/skill-e2e-plan-ceo-plan-mode.test.ts — handshake fires before any
  Write/Edit when plan-mode distinctive phrase is present; 2-option shape
  (Exit/Cancel); option A routes to ExitPlanMode cleanly
- test/skill-e2e-plan-eng-plan-mode.test.ts — same contract for plan-eng
- test/skill-e2e-plan-design-plan-mode.test.ts — same contract for
  plan-design; exercises C-cancel branch instead of A-exit
- test/skill-e2e-plan-devex-plan-mode.test.ts — same contract for plan-devex
- test/skill-e2e-plan-mode-no-op.test.ts — negative regression: handshake
  must NOT fire when distinctive phrase is absent; skill proceeds normally
  through Step 0 (REGRESSION RULE guardrail against breaking existing
  interactive-review sessions)
- test/e2e-harness-audit.test.ts — free unit test asserting every
  `interactive: true` skill has at least one canUseTool-using test file
  (prevents future drift where a skill opts in without coverage)

Shared helper test/helpers/plan-mode-handshake-helpers.ts centralizes the
canUseTool interceptor + distinctive-phrase injection so the 4 sibling
E2E tests are thin wiring (~20 LOC each) and can't drift out of sync.

Unit assertions added to test/gen-skill-docs.test.ts:
- handshake section present in all 4 Claude-generated SKILL.md files
- handshake section absent from non-interactive Claude skills (ship,
  review, qa, office-hours, codex, retro, cso)
- handshake section absent from non-Claude host outputs (.agents, etc.)
- 0C-bis STOP block present in plan-ceo-review/SKILL.md at correct
  position (between the "Present these approach options" line and
  "### 0D-prelude" header)
- handshake resolver wired BEFORE generateUpgradeCheck in preamble
  composition order

6 new gate-tier entries added to test/helpers/touchfiles.ts so any change
to the handshake resolver, preamble composition, skill templates, question
registry, one-way-door classifier, or agent-sdk-runner fires the relevant
E2E tests. test/touchfiles.test.ts updated for the new selection count
(plan-ceo-review/** now triggers 15 tests, up from 8).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* chore(v1.11.1.0): VERSION bump + CHANGELOG entry + TODOS follow-ups

Bumps from main's v1.11.0.0 to v1.11.1.0 (PATCH — bug-fix release, no new
user-facing artifacts). CHANGELOG entry covers the plan-mode handshake,
agent-sdk-runner canUseTool extension, and the 2 follow-up TODOs.

CHANGELOG order: v1.11.1.0 (this) → v1.11.0.0 (workspace-aware ship,
merged from main) → v1.10.1.0 (overlay efficacy harness). No duplicate
headers.

Syncs package.json version to match VERSION per the Step 12 idempotency
invariant (both files must agree or /ship halts).

TODOS.md:
- Preserves the Testing/security-bench-haiku-responses P1 added on main
- Adds P1 "Structural STOP-Ask forcing function" — broader class of the
  bug this release fixes
- Adds P2 "Apply interactive: true to non-review skills (office-hours,
  codex, investigate, qa, retro, cso)"

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-24 00:04:53 -07:00
Garry Tan
a81be53621 v1.10.0.0: fix AskUserQuestion cadence + Pros/Cons format upgrade (#1178)
* fix(preamble): reorder AskUserQuestion Format above model overlay + rewrite Opus 4.7 pacing directive

Root cause of plan-review regression (v1.6.4.0): model overlays rendered
ABOVE the pacing rule in every SKILL.md, so Opus 4.7 read "Batch your
questions" first and absorbed it as the ambient default. The overlay's
claimed subordination ("skill wins on pacing, always") didn't stick —
literal-interpretation mode reads physical order, not claimed hierarchy.

Part 1 of 4 (plan: ~/.claude/plans/system-instruction-you-are-working-polymorphic-twilight.md):

scripts/resolvers/preamble.ts
- Move generateAskUserFormat above generateModelOverlay in section array
- Comment explains why — prevents future refactors from silently reverting

model-overlays/opus-4-7.md
- Replace "Batch your questions" block with "Pace questions to the skill"
- New wording makes one-question-per-turn the default when the skill
  contains STOP directives; batching becomes the explicit exception

Regenerated 30 SKILL.md files via bun run gen:skill-docs.

Verified:
- With --model opus-4-7: Format renders at line 359, Model-Specific
  Patch at 373, "Pace questions" at 419 (Format comes first, overlay
  second, pacing directive intact).
- bun test passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(plan-reviews): tighten STOP/escape-hatch directives across 4 templates

Part 2 of 4 (plan: ~/.claude/plans/system-instruction-you-are-working-polymorphic-twilight.md).

Codex caught that v1.6.3.0's reasoning collapsed on Opus 4.7: the old
escape-hatch wording ("If no issues or fix is obvious, state what
you'll do and move on — don't waste a question") let the literal
interpreter classify every finding as having an "obvious fix" and skip
AskUserQuestion entirely. Reviews became reports.

Per-template hardening (16 sites total, verified by rg):

plan-ceo-review/SKILL.md.tmpl (13 sites):
- 12 inline STOP directives: replace the full escape-hatch clause with
  "zero findings → say so and proceed; findings → MUST call AskUserQuestion
  as a tool_use, including for obvious fixes."
- 1 Escape hatch bullet in CRITICAL RULE section: tightened.

plan-eng-review, plan-design-review, plan-devex-review (1 site each):
- Each template's Escape hatch bullet tightened to match the new CEO wording,
  adapted for each review's domain (issue/gap, decision/design/DX alternatives).

After regeneration: rg "don't waste a question" returns 0 across all
*SKILL.md.tmpl and *SKILL.md files. "zero findings, state" wording
present 16 times (matches prior count of escape-hatch sites).

bun test passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(preamble): upgrade AskUserQuestion format to Pros/Cons decision brief

Part 4 of 4 (plan: ~/.claude/plans/system-instruction-you-are-working-polymorphic-twilight.md).

Every AskUserQuestion now renders as a decision brief, not a bullet list:
D-numbered header, ELI10, Stakes-if-we-pick-wrong, Recommendation, Pros/Cons
with / markers per option, closing Net: tradeoff synthesis.

scripts/resolvers/preamble/generate-ask-user-format.ts
- Full rewrite. Preserves prior rules (Re-ground, ELI10, Recommend,
  Completeness, Options) and adds:
  - D-numbering per skill invocation (model-level, not runtime state)
  - Stakes line (pain avoided / capability unlocked / consequence named)
  - Pros/Cons block with min 2  + 1  per option, min 40 chars/bullet
  - Hard-stop escape: " No cons — this is a hard-stop choice" for
    genuine one-sided choices (destructive-action confirmations)
  - Neutral-posture handling (CT1-compliant): (recommended) label
    STAYS on default option to preserve AUTO_DECIDE contract; neutrality
    expressed as prose in Recommendation line only
  - Net line closes the decision with a one-sentence tradeoff frame
  - Rule 11: tool_use mandate (prose "Question:" blocks don't count)
  - Self-check list before emitting

test/skill-validation.test.ts
- Update format assertions to check for new Pros/Cons tokens
  (Pros / cons:, Recommendation: <choice>, Net:, ELI10, Stakes if we
  pick wrong:, , ) across all tier-2+ skills
- Old "RECOMMENDATION: Choose" expectation removed (the new format uses
  mixed-case "Recommendation:" with no literal "Choose")

test/skill-e2e-plan-format.test.ts
- Add v1.7.0.0 format token regexes (PROS_CONS_HEADER_RE, PRO_BULLET_RE,
  CON_BULLET_RE, NET_LINE_RE, D_NUMBER_RE, STAKES_RE)
- Existing RECOMMENDATION_RE loosened to accept mixed-case "Recommendation:"
  (canonical v1.7.0.0 form) alongside all-caps (legacy). Tests are
  additive — the strict new-format gate is the upcoming cadence eval.

Regenerated 30 SKILL.md files via bun run gen:skill-docs.

Verified:
- bun test: 319 pass (1 pre-existing security-bench fixture oversize
  failure on main, unrelated — confirmed via git stash test on main HEAD)
- New format tokens render in all tier-2+ skills (plan-ceo-review,
  plan-eng-review, ship, office-hours verified)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: gate-tier units + periodic Pros/Cons evals for AskUserQuestion format

Part 3 of 4 (plan: ~/.claude/plans/system-instruction-you-are-working-polymorphic-twilight.md).

Gate-tier (E1, free, runs on every `bun test`):

test/preamble-compose.test.ts — pins the composition order
  Asserts AskUserQuestion Format section renders BEFORE Model-Specific
  Behavioral Patch in tier-≥2 preamble output. Covers claude default,
  opus-4-7 overlay, tier 2/3, and codex host. Catches any future edit
  to scripts/resolvers/preamble.ts that silently reverts the order.

test/resolver-ask-user-format.test.ts — pins the Pros/Cons contract
  14 assertions against generateAskUserFormat output: D<N>, ELI10,
  Stakes if we pick wrong:, Recommendation: <choice>, Pros / cons:,
  / markers, min 2 pros + 1 con rules, hard-stop escape exact
  phrase, neutral-posture CT1 rule ((recommended) label preserved for
  AUTO_DECIDE), Completeness coverage-vs-kind, tool_use mandate
  (rule 11), self-check list, D-numbering model-level caveat.

test/model-overlay-opus-4-7.test.ts — pins the pacing directive
  Asserts raw overlay file + resolved overlay output contain "Pace
  questions to the skill" and NOT "Batch your questions". Verifies
  INHERIT:claude chain still works (Todo-list, subordination wrapper),
  Fan out / Effort-match / Literal interpretation nudges preserved.
  Also asserts claude base overlay does NOT carry the Opus-specific
  pacing directive (no cross-contamination).

Periodic-tier (E2, Opus-dependent, ~$1-2/run):

test/skill-e2e-plan-prosons.test.ts — 4 cases extending v1.6.3.0 harness
  1. Format positive — every token present when plan has real tradeoff
  2. Hard-stop NEGATIVE — plan with genuine tradeoff must NOT dodge to
     "No cons — hard-stop choice" escape
  3. Neutral-posture NEGATIVE — plan where one option dominates must emit
     (recommended) label + "because <reason>", must NOT dodge to
     "taste call" / "no preference"
  4. Hard-stop POSITIVE — destructive-action plan may legitimately use
     the hard-stop escape

test/helpers/touchfiles.ts — entries for all new eval cases
  Dependencies: overlay, preamble.ts, generate-ask-user-format.ts, and
  the 4 plan-review templates. Diff-based selection triggers the evals
  whenever those files change. Also added entries for 7 expanded-coverage
  cases (ship, office-hours, investigate, qa, review, design-review,
  document-release) — test cases will land in follow-up PRs per skill.

Follow-ups noted in test file header:
- True multi-turn cadence eval (3 findings → 3 distinct asks) — current
  harness captures one $OUT_FILE per session; multi-turn capture needs
  new harness support.
- Expanded-coverage test cases for the 7 non-plan-review skills.

Verified:
- bun test: 349 pass (30 new + 319 baseline), 1 pre-existing security-bench
  oversize failure on main (unrelated, unchanged).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: regenerate golden fixtures + update ELI10 phrase check for v1.7.0.0

Pros/Cons format rewrite (6b99df9d) changed the resolver output across all
tier-2+ SKILL.md files. Three golden-file regression tests in
test/host-config.test.ts and one phrase-check test in test/gen-skill-docs.test.ts
were failing as expected.

- test/fixtures/golden/claude-ship-SKILL.md
- test/fixtures/golden/codex-ship-SKILL.md
- test/fixtures/golden/factory-ship-SKILL.md
  Regenerated via `bun run gen:skill-docs --host all` + cp into fixtures.

- test/gen-skill-docs.test.ts line 244: rename test from "ELI16 simplification
  rules" to "ELI10 simplification rules" and match the new phrase pattern.
  v1.7.0.0 uses "ELI10 (ALWAYS)" rather than legacy "Simplify (ELI10, ALWAYS)".

bun test: 744 pass, 1 fail (pre-existing security-bench fixture oversize,
unrelated to this branch).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v1.7.0.0: plan reviews walk you through each issue with Pros/Cons

Restores AskUserQuestion cadence on Opus 4.7 (v1.6.4.0 regression) and
upgrades the format to a numbered decision brief — D<N> header, ELI10,
Stakes, Recommendation, per-option / bullets, Net: closing line.

Fix: composition reorder + overlay rewrite + 16-site escape-hatch hardening
across the 4 plan-review templates.
Feature: Pros/Cons format in the preamble resolver, inherited by every
tier-2+ skill automatically.

30 new gate-tier unit tests pin the format contract (runs in <100ms, $0).
4 new periodic-tier eval cases defend against escape-hatch abuse
(2 positive, 2 negative). Golden fixtures regenerated.

CEO + Eng + Codex reviews completed. 5 of 8 Codex findings incorporated;
CT2 (16 sites, not 31) and CT1 (AUTO_DECIDE contract break) were
load-bearing catches the primary reviews missed.

bun test: 774 pass, 1 fail (pre-existing security-bench oversize, unrelated).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v1.10.0.0: bump VERSION (was v1.7.0.0, align with branch discipline)

Per user direction — jumping to 1.10.0.0 for versioning alignment.
No functional changes from the prior ship commit (5f038ab7). The
regression fix + Pros/Cons format are identical.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 18:25:34 -07:00
Garry Tan
b805aa0113 feat: Confusion Protocol, Hermes + GBrain hosts, brain-first resolver (v0.18.0.0) (#1005)
* feat: add Confusion Protocol to preamble resolver

Injects a high-stakes ambiguity gate at preamble tier >= 2 so all
workflow skills get it. Fires when Claude encounters architectural
decisions, data model changes, destructive operations, or contradictory
requirements. Does NOT fire on routine coding.

Addresses Karpathy failure mode #1 (wrong assumptions) with an
inline STOP gate instead of relying on workflow skill invocation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add Hermes and GBrain host configs

Hermes: tool rewrites for terminal/read_file/patch/delegate_task,
paths to ~/.hermes/skills/gstack, AGENTS.md config file.

GBrain: coding skills become brain-aware when GBrain mod is installed.
Same tool rewrites as OpenClaw (agents spawn Claude Code via ACP).
GBRAIN_CONTEXT_LOAD and GBRAIN_SAVE_RESULTS NOT suppressed on gbrain
host, enabling brain-first lookup and save-to-brain behavior.

Both registered in hosts/index.ts with setup script redirect messages.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: GBrain resolver — brain-first lookup and save-to-brain

New scripts/resolvers/gbrain.ts with two resolver functions:
- GBRAIN_CONTEXT_LOAD: search brain for context before skill starts
- GBRAIN_SAVE_RESULTS: save skill output to brain after completion

Placeholders added to 4 thinking skill templates (office-hours,
investigate, plan-ceo-review, retro). Resolves to empty string on
all hosts except gbrain via suppressedResolvers.

GBRAIN suppression added to all 9 non-gbrain host configs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: wire slop:diff into /review as advisory diagnostic

Adds Step 3.5 to the review template: runs bun run slop:diff against
the base branch to catch AI code quality issues (empty catches,
redundant return await, overcomplicated abstractions). Advisory only,
never blocking. Skips silently if slop-scan is not installed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add Karpathy compatibility note to README

Positions gstack as the workflow enforcement layer for Karpathy-style
CLAUDE.md rules (17K stars). Links to forrestchang/andrej-karpathy-skills.
Maps each Karpathy failure mode to the gstack skill that addresses it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: improve native OpenClaw thinking skills

office-hours: add design doc path visibility message after writing
ceo-review: add HARD GATE reminder at review section transitions
retro: add non-git context support (check memory for meeting notes)

Mirrors template improvements to hand-crafted native skills.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: update tests and golden fixtures for new hosts

- Host count: 8 → 10 (hermes, gbrain)
- OpenClaw adapter test: expects undefined (dead code removed)
- Golden ship fixtures: updated with Confusion Protocol + vendoring

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: regenerate all SKILL.md files

Regenerated from templates after Confusion Protocol, GBrain resolver
placeholders, slop:diff in review, HARD GATE reminders, investigation
learnings, design doc visibility, and retro non-git context changes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: update project documentation for v0.18.0.0

- CHANGELOG: add v0.18.0.0 entry (Confusion Protocol, Hermes, GBrain,
  slop in review, Karpathy note, skill improvements)
- CLAUDE.md: add hermes.ts and gbrain.ts to hosts listing
- README.md: update agent count 8→10, add Hermes + GBrain to table
- VERSION: bump to 0.18.0.0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: sync package.json version to 0.18.0.0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: extract Step 0 from review SKILL.md in E2E test

The review-base-branch E2E test was copying the full 1493-line
review/SKILL.md into the test fixture. The agent spent 8+ turns
reading it in chunks, leaving only 7 turns for actual work, causing
error_max_turns on every attempt.

Now extracts only Step 0 (base branch detection, ~50 lines) which is
all the test actually needs. Follows the CLAUDE.md rule: "NEVER copy
a full SKILL.md file into an E2E test fixture."

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: update GBrain and Hermes host configs for v0.10.0 integration

GBrain: add 'triggers' to keepFields so generated skills pass
checkResolvable() validation. Add version compat comment.

Hermes: un-suppress GBRAIN_CONTEXT_LOAD and GBRAIN_SAVE_RESULTS.
The resolvers handle GBrain-not-installed gracefully, so Hermes
agents with GBrain as a mod get brain features automatically.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: GBrain resolver DX improvements and preamble health check

Resolver changes:
- gbrain query → gbrain search (fast keyword search, not expensive hybrid)
- Add keyword extraction guidance for agents
- Show explicit gbrain put_page syntax with --title, --tags, heredoc
- Add entity enrichment with false-positive filter
- Name throttle error patterns (exit code 1, stderr keywords)
- Add data-research routing for investigate skill
- Expand skillSaveMap from 4 to 8 entries
- Add brain operation telemetry summary

Preamble changes:
- Add gbrain doctor --fast --json health check for gbrain/hermes hosts
- Parse check failures/warnings count
- Show failing check details when score < 50

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: preserve keepFields in allowlist frontmatter mode

The allowlist mode hard-coded name + description reconstruction but
never iterated keepFields for additional fields. Adding 'triggers'
to keepFields was a no-op because the field was silently stripped.

Now iterates keepFields and preserves any field beyond name/description
from the source template frontmatter, including YAML arrays.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add triggers to all 38 skill templates

Multi-word, skill-specific trigger keywords for GBrain's RESOLVER.md
router. Each skill gets 3-6 triggers derived from its "Use when asked
to..." description text. Avoids single generic words that would collide
across skills (e.g., "debug this" not "debug").

These are distinct from voice-triggers (speech-to-text aliases) and
serve GBrain's checkResolvable() validation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: regenerate all SKILL.md files and update golden fixtures

Regenerated from updated templates (triggers, brain placeholders,
resolver DX improvements, preamble health check). Golden fixtures
updated to match.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: settings-hook remove exits 1 when nothing to remove

gstack-settings-hook remove was exiting 0 when settings.json didn't
exist, causing gstack-uninstall to report "SessionStart hook" as
removed on clean systems where nothing was installed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: update project documentation for GBrain v0.10.0 integration

ARCHITECTURE.md: added GBRAIN_CONTEXT_LOAD and GBRAIN_SAVE_RESULTS
to resolver table.

CHANGELOG.md: expanded v0.18.0.0 entry with GBrain v0.10.0 integration
details (triggers, expanded brain-awareness, DX improvements, Hermes
brain support), updated date.

CLAUDE.md: added gbrain to resolvers/ directory comment.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: routing E2E stops writing to user's ~/.claude/skills/

installSkills() was copying SKILL.md files to both project-level
(.claude/skills/ in tmpDir) and user-level (~/.claude/skills/).
Writing to the user's real install fails when symlinks point to
different worktrees or dangling targets (ENOENT on copyFileSync).

Now installs to project-level only. The test already sets cwd to
the tmpDir, so project-level discovery works.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: scale Gemini E2E back to smoke test

Gemini CLI gets lost in worktrees on complex tasks (review times out
at 600s, discover-skill hits exit 124). Nobody uses Gemini for gstack
skill execution. Replace the two failing tests (gemini-discover-skill
and gemini-review-findings) with a single smoke test that verifies
Gemini can start and read the README. 90s timeout, no skill invocation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 10:41:38 -07:00
Garry Tan
31943b2f02 feat: anti-skip rule for all review skills (v0.15.6.1) (#804)
* feat: anti-skip rule for all review skills

Review skills sometimes skip sections when reviewing strategy or spec
plans. This adds an explicit anti-skip rule to CEO (1-11), eng (1-4),
design (1-7), and DX (1-8) review skills. Also fixes CEO header from
"10 sections" to "11 sections" to match actual count.

* chore: bump version and changelog (v0.15.6.1)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-04 21:22:40 -07:00
Garry Tan
447851452a feat: interactive /plan-devex-review + plan mode skill fix (v0.15.5.0) (#796)
* fix: skill invocation during plan mode takes precedence over generic plan mode

Adds a "Skill Invocation During Plan Mode" section to the preamble resolver so
all generated SKILL.md files include it. Fixes a bug where Claude treats loaded
skill content as reference material instead of executable instructions, and keeps
trying to ExitPlanMode instead of following the skill workflow step by step.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: interactive /plan-devex-review with persona, benchmarks, and forcing questions

Complete rewrite of the DX review skill to match CEO/eng review depth. New flow:
investigate (persona, empathy, competitors, magical moment, journey tracing) then
force decisions, then score with evidence. Three modes: DX EXPANSION, DX POLISH,
DX TRIAGE. 20-45 interactive STOP points vs 10-12 before.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: autoplan DX POLISH mode + review log schema for new devex fields

Adds mode selection, persona, competitive, and magical moment override rules to
autoplan Phase 3.5. Documents new review log fields (mode, persona, competitive_tier)
in the plan-file-review-report schema. Syncs package.json version to VERSION.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: update project documentation for v0.15.5.0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 14:36:23 -07:00
Garry Tan
be96ff5ce7 feat: /plan-devex-review + /devex-review — DX review skills (v0.15.3.0) (#784)
* feat: add DX framework resolver for shared principles and scoring rubric

New {{DX_FRAMEWORK}} resolver provides compact (~150 lines) shared content
for /plan-devex-review and /devex-review: Addy Osmani's 8 DX principles,
7 characteristics table, 10 cognitive patterns, scoring rubric, and TTHW
benchmarks. Hall of Fame examples loaded on-demand per pass to avoid bloat.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add DX Review row to review dashboard

Adds plan-devex-review and devex-review schema entries to the review
dashboard resolver and placeholder table in the preamble. All existing
SKILL.md files regenerated to include the new DX Review row.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: /plan-devex-review skill — DX plan review with Osmani framework

Plan-stage developer experience review. Rates 8 DX dimensions 0-10:
getting started, API/CLI/SDK design, error messages, docs, upgrade path,
dev environment, community, and DX measurement. Includes developer empathy
simulation, auto-detect product type with applicability gate, DX scorecard
with trend tracking, and a conditional Claude Code Skill DX checklist.
Hall of Fame examples loaded on-demand per pass from dx-hall-of-fame.md.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: /devex-review skill — live DX audit with browse

Live-system developer experience audit using browse tool. Tests all 8
dimensions aligned with /plan-devex-review for boomerang comparison
(plan said 3 min TTHW, reality says 8). Each dimension marked TESTED,
INFERRED, or N/A with evidence. Scope-aware: declares what browse can
and cannot test, falls back to file artifacts for untestable dimensions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v0.15.3.0)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 16:22:57 -07:00