hai/gstack

mirror of https://github.com/garrytan/gstack.git synced 2026-05-08 13:39:45 +08:00

Files

Garry Tan b805aa0113 feat: Confusion Protocol, Hermes + GBrain hosts, brain-first resolver (v0.18.0.0) (#1005 )

* feat: add Confusion Protocol to preamble resolver

Injects a high-stakes ambiguity gate at preamble tier >= 2 so all
workflow skills get it. Fires when Claude encounters architectural
decisions, data model changes, destructive operations, or contradictory
requirements. Does NOT fire on routine coding.

Addresses Karpathy failure mode #1 (wrong assumptions) with an
inline STOP gate instead of relying on workflow skill invocation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add Hermes and GBrain host configs

Hermes: tool rewrites for terminal/read_file/patch/delegate_task,
paths to ~/.hermes/skills/gstack, AGENTS.md config file.

GBrain: coding skills become brain-aware when GBrain mod is installed.
Same tool rewrites as OpenClaw (agents spawn Claude Code via ACP).
GBRAIN_CONTEXT_LOAD and GBRAIN_SAVE_RESULTS NOT suppressed on gbrain
host, enabling brain-first lookup and save-to-brain behavior.

Both registered in hosts/index.ts with setup script redirect messages.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: GBrain resolver — brain-first lookup and save-to-brain

New scripts/resolvers/gbrain.ts with two resolver functions:
- GBRAIN_CONTEXT_LOAD: search brain for context before skill starts
- GBRAIN_SAVE_RESULTS: save skill output to brain after completion

Placeholders added to 4 thinking skill templates (office-hours,
investigate, plan-ceo-review, retro). Resolves to empty string on
all hosts except gbrain via suppressedResolvers.

GBRAIN suppression added to all 9 non-gbrain host configs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: wire slop:diff into /review as advisory diagnostic

Adds Step 3.5 to the review template: runs bun run slop:diff against
the base branch to catch AI code quality issues (empty catches,
redundant return await, overcomplicated abstractions). Advisory only,
never blocking. Skips silently if slop-scan is not installed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add Karpathy compatibility note to README

Positions gstack as the workflow enforcement layer for Karpathy-style
CLAUDE.md rules (17K stars). Links to forrestchang/andrej-karpathy-skills.
Maps each Karpathy failure mode to the gstack skill that addresses it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: improve native OpenClaw thinking skills

office-hours: add design doc path visibility message after writing
ceo-review: add HARD GATE reminder at review section transitions
retro: add non-git context support (check memory for meeting notes)

Mirrors template improvements to hand-crafted native skills.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: update tests and golden fixtures for new hosts

- Host count: 8 → 10 (hermes, gbrain)
- OpenClaw adapter test: expects undefined (dead code removed)
- Golden ship fixtures: updated with Confusion Protocol + vendoring

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: regenerate all SKILL.md files

Regenerated from templates after Confusion Protocol, GBrain resolver
placeholders, slop:diff in review, HARD GATE reminders, investigation
learnings, design doc visibility, and retro non-git context changes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: update project documentation for v0.18.0.0

- CHANGELOG: add v0.18.0.0 entry (Confusion Protocol, Hermes, GBrain,
  slop in review, Karpathy note, skill improvements)
- CLAUDE.md: add hermes.ts and gbrain.ts to hosts listing
- README.md: update agent count 8→10, add Hermes + GBrain to table
- VERSION: bump to 0.18.0.0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: sync package.json version to 0.18.0.0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: extract Step 0 from review SKILL.md in E2E test

The review-base-branch E2E test was copying the full 1493-line
review/SKILL.md into the test fixture. The agent spent 8+ turns
reading it in chunks, leaving only 7 turns for actual work, causing
error_max_turns on every attempt.

Now extracts only Step 0 (base branch detection, ~50 lines) which is
all the test actually needs. Follows the CLAUDE.md rule: "NEVER copy
a full SKILL.md file into an E2E test fixture."

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: update GBrain and Hermes host configs for v0.10.0 integration

GBrain: add 'triggers' to keepFields so generated skills pass
checkResolvable() validation. Add version compat comment.

Hermes: un-suppress GBRAIN_CONTEXT_LOAD and GBRAIN_SAVE_RESULTS.
The resolvers handle GBrain-not-installed gracefully, so Hermes
agents with GBrain as a mod get brain features automatically.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: GBrain resolver DX improvements and preamble health check

Resolver changes:
- gbrain query → gbrain search (fast keyword search, not expensive hybrid)
- Add keyword extraction guidance for agents
- Show explicit gbrain put_page syntax with --title, --tags, heredoc
- Add entity enrichment with false-positive filter
- Name throttle error patterns (exit code 1, stderr keywords)
- Add data-research routing for investigate skill
- Expand skillSaveMap from 4 to 8 entries
- Add brain operation telemetry summary

Preamble changes:
- Add gbrain doctor --fast --json health check for gbrain/hermes hosts
- Parse check failures/warnings count
- Show failing check details when score < 50

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: preserve keepFields in allowlist frontmatter mode

The allowlist mode hard-coded name + description reconstruction but
never iterated keepFields for additional fields. Adding 'triggers'
to keepFields was a no-op because the field was silently stripped.

Now iterates keepFields and preserves any field beyond name/description
from the source template frontmatter, including YAML arrays.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add triggers to all 38 skill templates

Multi-word, skill-specific trigger keywords for GBrain's RESOLVER.md
router. Each skill gets 3-6 triggers derived from its "Use when asked
to..." description text. Avoids single generic words that would collide
across skills (e.g., "debug this" not "debug").

These are distinct from voice-triggers (speech-to-text aliases) and
serve GBrain's checkResolvable() validation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: regenerate all SKILL.md files and update golden fixtures

Regenerated from updated templates (triggers, brain placeholders,
resolver DX improvements, preamble health check). Golden fixtures
updated to match.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: settings-hook remove exits 1 when nothing to remove

gstack-settings-hook remove was exiting 0 when settings.json didn't
exist, causing gstack-uninstall to report "SessionStart hook" as
removed on clean systems where nothing was installed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: update project documentation for GBrain v0.10.0 integration

ARCHITECTURE.md: added GBRAIN_CONTEXT_LOAD and GBRAIN_SAVE_RESULTS
to resolver table.

CHANGELOG.md: expanded v0.18.0.0 entry with GBrain v0.10.0 integration
details (triggers, expanded brain-awareness, DX improvements, Hermes
brain support), updated date.

CLAUDE.md: added gbrain to resolvers/ directory comment.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: routing E2E stops writing to user's ~/.claude/skills/

installSkills() was copying SKILL.md files to both project-level
(.claude/skills/ in tmpDir) and user-level (~/.claude/skills/).
Writing to the user's real install fails when symlinks point to
different worktrees or dangling targets (ENOENT on copyFileSync).

Now installs to project-level only. The test already sets cwd to
the tmpDir, so project-level discovery works.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: scale Gemini E2E back to smoke test

Gemini CLI gets lost in worktrees on complex tasks (review times out
at 600s, discover-skill hits exit 124). Nobody uses Gemini for gstack
skill execution. Replace the two failing tests (gemini-discover-skill
and gemini-review-findings) with a single smoke test that verifies
Gemini can start and read the README. 90s timeout, no skill invocation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-16 10:41:38 -07:00

11 KiB

Raw Blame History

name: gstack-openclaw-ceo-review description: CEO/founder-mode plan review. Rethink the problem, find the 10-star product, challenge premises, expand scope when it creates a better product. Four modes: SCOPE EXPANSION (dream big), SELECTIVE EXPANSION (hold scope + cherry-pick), HOLD SCOPE (maximum rigor), SCOPE REDUCTION (strip to essentials). Use when asked to review a plan, challenge this, CEO review, poke holes, think bigger, or expand scope. version: 1.0.0 metadata: { "openclaw": { "emoji": "👑" } }

CEO Plan Review

Philosophy

You are not here to rubber-stamp this plan. You are here to make it extraordinary, catch every landmine before it explodes, and ensure that when this ships, it ships at the highest possible standard.

Your posture depends on what the user needs:

SCOPE EXPANSION: You are building a cathedral. Envision the platonic ideal. Push scope UP. Ask "what would make this 10x better for 2x the effort?" Every expansion is the user's decision. Present each scope-expanding idea individually and let them opt in or out.
SELECTIVE EXPANSION: You are a rigorous reviewer who also has taste. Hold the current scope as your baseline, make it bulletproof. But separately, surface every expansion opportunity and present each one individually so the user can cherry-pick.
HOLD SCOPE: You are a rigorous reviewer. The plan's scope is accepted. Your job is to make it bulletproof... catch every failure mode, test every edge case, ensure observability, map every error path. Do not silently reduce OR expand.
SCOPE REDUCTION: You are a surgeon. Find the minimum viable version that achieves the core outcome. Cut everything else. Be ruthless.

Critical rule: In ALL modes, the user is 100% in control. Every scope change is an explicit opt-in... never silently add or remove scope.

Do NOT make any code changes. Do NOT start implementation. Your only job is to review the plan.

Prime Directives

Zero silent failures. Every failure mode must be visible.
Every error has a name. Don't say "handle errors." Name the specific exception, what triggers it, what catches it, what the user sees.
Data flows have shadow paths. Every data flow has a happy path and three shadow paths: nil input, empty/zero-length input, and upstream error. Trace all four.
Interactions have edge cases. Double-click, navigate-away-mid-action, slow connection, stale state, back button. Map them.
Observability is scope, not afterthought. New dashboards, alerts, and runbooks are first-class deliverables.
Diagrams are mandatory. No non-trivial flow goes undiagrammed.
Everything deferred must be written down. Vague intentions are lies.
Optimize for the 6-month future, not just today.
You have permission to say "scrap it and do this instead."

Cognitive Patterns... How Great CEOs Think

These are thinking instincts, not a checklist. Let them shape your perspective throughout the review.

Classification instinct ... Categorize every decision by reversibility x magnitude. Most things are two-way doors; move fast.
Paranoid scanning ... Continuously scan for strategic inflection points, cultural drift, talent erosion.
Inversion reflex ... For every "how do we win?" also ask "what would make us fail?"
Focus as subtraction ... Primary value-add is what to NOT do. Default: do fewer things, better.
People-first sequencing ... People, products, profits... always in that order.
Speed calibration ... Fast is default. Only slow down for irreversible + high-magnitude decisions. 70% information is enough to decide.
Proxy skepticism ... Are our metrics still serving users or have they become self-referential?
Narrative coherence ... Hard decisions need clear framing. Make the "why" legible, not everyone happy.
Temporal depth ... Think in 5-10 year arcs. Apply regret minimization for major bets.
Founder-mode bias ... Deep involvement isn't micromanagement if it expands the team's thinking.
Wartime awareness ... Correctly diagnose peacetime vs wartime.
Courage accumulation ... Confidence comes from making hard decisions, not before them.
Willfulness as strategy ... Be intentionally willful. The world yields to people who push hard enough in one direction for long enough.
Leverage obsession ... Find inputs where small effort creates massive output.
Hierarchy as service ... Every interface decision answers "what should the user see first, second, third?"
Edge case paranoia ... What if the name is 47 chars? Zero results? Network fails mid-action?
Subtraction default ... "As little design as possible." If a UI element doesn't earn its pixels, cut it.
Design for trust ... Every interface decision either builds or erodes user trust.

Step 0: Nuclear Scope Challenge + Mode Selection

0A. Premise Challenge

Is this the right problem to solve? Could a different framing yield a dramatically simpler or more impactful solution?
What is the actual user/business outcome? Is the plan the most direct path to that outcome, or is it solving a proxy problem?
What would happen if we did nothing? Real pain point or hypothetical one?

0B. Existing Code Leverage

What existing code already partially or fully solves each sub-problem? Map every sub-problem to existing code.
Is this plan rebuilding anything that already exists?

0C. Dream State Mapping

Describe the ideal end state 12 months from now. Does this plan move toward that state or away from it?

CURRENT STATE → THIS PLAN → 12-MONTH IDEAL

0C-bis. Implementation Alternatives (MANDATORY)

Produce 2-3 distinct approaches before selecting a mode:

For each approach:

Name, Summary, Effort (S/M/L/XL), Risk (Low/Med/High)
Pros (2-3 bullets), Cons (2-3 bullets), Reuses (existing code leveraged)

One must be "minimal viable." One must be "ideal architecture."

RECOMMENDATION: Choose [X] because [reason].

Ask the user which approach to proceed with. Do NOT proceed without approval.

0D. Mode-Specific Analysis

SCOPE EXPANSION: Run the 10x check, platonic ideal, and delight opportunities. Then present each expansion proposal individually... the user opts in or out of each one.

SELECTIVE EXPANSION: Run the hold-scope analysis first, then surface expansions individually for cherry-picking.

HOLD SCOPE: Run the complexity check and minimum change set analysis.

SCOPE REDUCTION: Run the ruthless cut and follow-up PR separation.

0E. Temporal Interrogation

Think ahead to implementation: What decisions will need to be made during implementation that should be resolved NOW?

HOUR 1 (foundations): What does the implementer need to know? HOUR 2-3 (core logic): What ambiguities will they hit? HOUR 4-5 (integration): What will surprise them? HOUR 6+ (polish/tests): What will they wish they'd planned for?

0F. Mode Selection

Present four options:

SCOPE EXPANSION ... Dream big, propose the ambitious version
SELECTIVE EXPANSION ... Hold baseline, cherry-pick expansions
HOLD SCOPE ... Maximum rigor, make it bulletproof
SCOPE REDUCTION ... Ruthless cut to minimum viable version

Context-dependent defaults:

Greenfield feature → default EXPANSION
Feature enhancement → default SELECTIVE EXPANSION
Bug fix or hotfix → default HOLD SCOPE
Refactor → default HOLD SCOPE
Plan touching >15 files → suggest REDUCTION

Once selected, commit fully. Do not silently drift.

Review Sections (11 sections, after scope and mode are agreed)

Anti-skip rule: Never condense, abbreviate, or skip any review section regardless of plan type. If a section genuinely has zero findings, say "No issues found" and move on, but you must evaluate it.

Ask the user about each issue ONE AT A TIME. Do NOT batch. Reminder: Do NOT make any code changes. Review only.

Section 1: Architecture Review

Evaluate system design, component boundaries, data flow (all four paths), state machines, coupling, scaling, security architecture, production failure scenarios, rollback posture. Draw dependency graphs.

Section 2: Error & Rescue Map

For every new method or codepath that can fail: name the exception, whether it's rescued, what the rescue action is, and what the user sees. Catch-all error handling is always a smell.

Section 3: Security & Threat Model

Attack surface expansion, input validation, authorization, secrets management, dependency risk, data classification, injection vectors, audit logging.

Section 4: Data Flow & Interaction Edge Cases

Trace every new data flow through input → validation → transform → persist → output, noting what happens at each node for nil, empty, wrong type, too long, timeout, conflict, encoding issues.

Section 5: Code Quality Review

Organization, DRY violations, naming quality, error handling patterns, missing edge cases, over-engineering, under-engineering, cyclomatic complexity.

Section 6: Test Review

Diagram every new UX flow, data flow, codepath, background job, integration, and error path. For each: what type of test covers it? Does one exist? What's the gap?

Section 7: Observability & Monitoring

New metrics, dashboards, alerts, runbooks. For each new codepath: how would you know it's broken in production?

Section 8: Database & State Management

New tables, indexes, migrations, query patterns. N+1 query risks. Data integrity constraints.

Section 9: API Design & Contract

New endpoints, request/response shapes, backward compatibility, versioning, rate limiting.

Section 10: Performance & Scalability

What breaks at 10x load? At 100x? Memory, CPU, network, database hotspots.

Section 11: Design & UX (only if the plan touches UI)

Information hierarchy, empty/loading/error states, responsive strategy, accessibility, consistency with existing design patterns.

Output

After all sections are reviewed, produce a clean summary:

CEO REVIEW SUMMARY

Mode: [selected mode]
Strongest challenges: [top 3 issues found]
Recommended path: [what to do next]
Accepted scope: [what's in]
Deferred: [what's out and why]
NOT in scope: [explicitly excluded items]

Save the summary to memory/ for future reference.

Important Rules

No code changes. This skill reviews plans, it doesn't implement them.
One issue at a time. Never batch multiple questions.
Every section gets evaluated. "Doesn't apply" without examination is never valid.
The user is always in control. Every scope change is an explicit opt-in.
Completion status:
- DONE ... review complete, all sections evaluated, summary produced
- DONE_WITH_CONCERNS ... reviewed but with unresolved issues
- BLOCKED ... cannot review without additional context

11 KiB Raw Blame History