mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-12 23:43:05 +08:00
* fix(codex): use resume-compatible flags * fix: V-001 security vulnerability Automated security fix generated by Orbis Security AI * docs: align prompt-injection thresholds to security.ts (v1.6.4.0 catch-up) CLAUDE.md:290 and ARCHITECTURE.md:159 were missed when WARN was bumped 0.60 → 0.75 ind75402bb(v1.6.4.0, "cut Haiku classifier FP from 44% to 23%, gate now enforced", #1135). browse/src/security.ts:37 has WARN: 0.75 and BROWSER.md:743 was updated alongside that commit; CLAUDE.md and ARCHITECTURE.md still read 0.60. Also adds the SOLO_CONTENT_BLOCK: 0.92 entry to CLAUDE.md (already in security.ts:50 and BROWSER.md:745, missing from CLAUDE.md's threshold table). No code change. No behavior change. Pure doc-vs-code alignment. Verification: $ grep -n "WARN" browse/src/security.ts CLAUDE.md ARCHITECTURE.md BROWSER.md browse/src/security.ts:37: WARN: 0.75, CLAUDE.md:290: - \`WARN: 0.75\` ... ARCHITECTURE.md:159: ...>= \`WARN\` (0.75)... BROWSER.md:743: - \`WARN: 0.75\` ... Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: Korean/CJK IME input and rendering in Sidebar Terminal Fixes #1272 This commit addresses three separate Korean/CJK bugs in the Sidebar Terminal: **Bug 1 - IME Input**: Korean text typed via IME composition was not reaching the PTY correctly. Added compositionstart/compositionend event listeners to suppress partial jamo fragments and only send the final composed string. **Bug 2a - Font Rendering**: Added CJK monospace font fallbacks ("Noto Sans Mono CJK KR", "Malgun Gothic") to both the xterm.js fontFamily config and the CSS --font-mono variable. This ensures consistent cell-width calculations for Korean characters. **Bug 2b - UTF-8 Boundary Detection**: Added buffering logic to prevent multi-byte UTF-8 characters (Korean is 3 bytes) from being split across WebSocket chunks. This follows the same pattern as PR #1007 which fixed the sidebar-agent path, but extends it to the terminal-agent path. Special thanks to @ldybob for the excellent root cause analysis and proposed solutions in issue #1272. Tested on WSL2 + Windows 11 with Korean IME. * fix(ship): tighten Plan Completion gate (VAS-449 remediation) VAS-446 shipped with a PLAN.md acceptance criterion (domain-hq has /docs/dashboard.md) silently skipped. /ship's Plan Completion subagent existed at ship time (added in v1.4.1.0) but the gate let the failure through. Four structural fixes: 1. Path concreteness rule: items naming a concrete filesystem path MUST be classified DONE/NOT DONE via [ -f <path> ], never UNVERIFIABLE. 2. Validator detection: CONTENT-SHAPE items scan target repo's package.json for validate-* scripts and run them before falling back to UNVERIFIABLE. 3. Per-item UNVERIFIABLE confirmation: replaces blanket "I've checked each one" with per-item Y/N/D loop. The blanket-confirm path is the exact failure VAS-449 surfaced. 4. Subagent fail-closed: if Plan Completion subagent + inline fallback both fail, surface explicit AskUserQuestion instead of silent pass. Replaces the prior "Never block /ship on subagent failure" fail-open. Locked in by test/ship-plan-completion-invariants.test.ts (5 assertions, no LLM dependency, ~60ms). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(browse): bash.exe wrap for telemetry on Windows reportAttemptTelemetry() in browse/src/security.ts calls spawn(bin, args) where bin is the gstack-telemetry-log bash script. On Windows this fails silently with ENOENT — CreateProcess can't dispatch on shebang lines. Adopts v1.24.0.0's Bun.which + GSTACK_*_BIN override pattern (from browse/src/claude-bin.ts:resolveClaudeCommand, introduced in #1252) for resolving bash.exe. resolveBashBinary() honors GSTACK_BASH_BIN absolute-path or PATH-resolvable override, falling back to Bun.which('bash') which finds Git Bash on the standard Windows install. buildTelemetrySpawnCommand() wraps the script invocation on win32 only; POSIX path is bit-identical. Returns null when bash can't be resolved on Windows so caller skips spawn — local attempts.jsonl audit trail keeps working without surfacing a Windows-only failure. 8 new unit tests cover resolveBashBinary (POSIX bash, absolute override, quote-stripping, BASH_BIN fallback, empty-PATH null) and buildTelemetrySpawnCommand (POSIX pass-through, win32 bash wrap, win32 null on unresolvable, arg-array immutability). POSIX path is bit-identical — Bun.which('bash') on Linux/macOS returns the same /bin/bash or /usr/bin/bash that the old hardcoded spawn relied on. * fix(make-pdf): Bun.which-based binary resolution for browse + pdftotext on Windows Extends v1.24.0.0's Bun.which + GSTACK_*_BIN override pattern (introduced in browse/src/claude-bin.ts via #1252) to the two other binary resolvers in the codebase: make-pdf/src/browseClient.ts:resolveBrowseBin and make-pdf/src/pdftotext.ts:resolvePdftotext. Same Windows quirks (fs.accessSync(X_OK) degrades to existence-check; `which` isn't available outside Git Bash; bun --compile --outfile X emits X.exe), same Bun.which-based fix shape, same env override convention. Changes: - GSTACK_BROWSE_BIN / GSTACK_PDFTOTEXT_BIN as the v1.24-aligned overrides; BROWSE_BIN / PDFTOTEXT_BIN remain as back-compat aliases. - Bun.which() replaces execFileSync('which', ...) for PATH lookup. Handles Windows PATHEXT natively; no more `where`-vs-`which` branch. - findExecutable(base) helper exported from each module, probes .exe/.cmd/.bat after the bare-path miss on win32. Linux/macOS behavior is bit-identical (isExecutable short-circuits before the win32 branch ever runs). - macCandidates renamed posixCandidates (always was — /opt/homebrew, /usr/local, /usr/bin). No Windows candidates added; Poppler installs scatter across Scoop/Chocolatey/portable zips and guessing causes false positives. - Error messages get a Windows install hint (scoop install poppler / oschwartz10612) and `setx` example for GSTACK_*_BIN. - Pre-existing test 'honors BROWSE_BIN when it points at a real executable' was hardcoded /bin/sh — made cross-platform via a REAL_EXE constant (cmd.exe on win32, /bin/sh on POSIX). Was a Windows-CI blocker on its own. Coordination: PR #1094 (@BkashJEE) covered browseClient.ts independently with a narrower scope; this PR's pdftotext + cross-platform tests + GSTACK_*_BIN naming are additive. Either order of merge works. Test plan: - bun test make-pdf/test/browseClient.test.ts make-pdf/test/pdftotext.test.ts on win32 — 29 pass, 0 fail (12 new assertions: findExecutable POSIX/win32/null, resolveBrowseBin GSTACK_BROWSE_BIN + BROWSE_BIN + precedence + quote-strip, same shape for resolvePdftotext + Windows install hint in error message). - POSIX branch unchanged — fs.accessSync(X_OK) on Linux/macOS short-circuits before any win32 logic runs, matching the v1.24 claude-bin.ts pattern. * fix(browse): NTFS ACL hardening for Windows state files via icacls gstack's ~/.gstack/ state directory holds bearer tokens, canary tokens, agent queue contents (with prompt history), session state, security-decision logs, and saved cookie bundles — all written with { mode: 0o600 } / 0o700. On Windows, those mode bits are a silent no-op: Node's fs module doesn't translate POSIX modes to NTFS ACLs, and inherited ACLs leave every "restricted" file readable by other principals on the machine (verified via icacls — six ACEs, the intended user is the LAST of six). Threat model is non-trivial on: - Self-hosted CI runners (different service account on the same Windows box can read developer tokens, canary tokens, prompt history) - Shared development machines (agencies, studios, lab environments) - Multi-tenant servers with shared home directories Orthogonal to v1.24.0.0's binary-resolution work — complementary at the write side. v1.24's bin/gstack-paths resolves ~/.gstack/ correctly across plugin / global / local installs; this PR ensures files written into those resolved paths actually get the POSIX 0o600 semantic translated to NTFS. The fix: - New browse/src/file-permissions.ts (158 LOC, 5 public + 1 test-reset). restrictFilePermissions / restrictDirectoryPermissions wrap chmod (POSIX) or icacls /inheritance:r /grant:r <user>:(F) (Windows). writeSecureFile / appendSecureFile / mkdirSecure are drop-in wrappers for the common patterns. - 19 call sites converted across 9 source files: browser-manager.ts, browser-skill-write.ts, cli.ts, config.ts, meta-commands.ts, security-classifier.ts, security.ts (4 sites), server.ts (5 sites), terminal-agent.ts (8 sites), tunnel-denial-log.ts. - (OI)(CI) inheritance flags on directories mean files created via fs.write* *inside* an mkdirSecure-created dir inherit the owner-only ACL automatically — important for tunnel-denial-log.ts where appends use async fsp.appendFile. Error handling: icacls failures (nonexistent path, missing icacls.exe, hardened environments) log a one-shot warning to stderr and proceed. Once-per-process gating prevents log spam if the condition persists. Filesystem stays functional; the file just ends up with inherited ACLs. Test plan: - bun test browse/test/file-permissions.test.ts — 13 pass, 0 fail (POSIX mode-bit assertions, Windows no-throw, mkdir idempotence, recursive creation, Buffer payloads, append-creates-then-reapplies-once semantics) - bun test browse/test/security.test.ts — 38 pass, 0 fail (existing security test suite plus the bash-binary resolution tests added in fix #1119; the converted writeFileSync/appendFileSync/mkdirSync sites in security.ts integrate cleanly) - Empirical icacls before/after on a real file — 6 ACEs → 1 ACE - bun build typecheck on all modified files — clean (server.ts has a pre-existing playwright-core/electron resolution issue unrelated to this PR) POSIX behavior is bit-identical to old code — fs.chmodSync(path, 0o6XX) on the helper's POSIX branch matches the inline { mode: 0o6XX } it replaces. Linux and macOS see no behavior change. Inviting pushback on three judgment calls (in PR description): 1. icacls vs npm library 2. ACL scope — just user, or user + SYSTEM? 3. Graceful degradation — once-per-process warn, not silent, not hard-fail. * fix(browse): declare lastConsoleFlushed to restore console-log persistence flushBuffers() references a `lastConsoleFlushed` cursor at server.ts:337 and assigns it at :344, but the `let lastConsoleFlushed = 0;` declaration is missing — only the network and dialog siblings are declared at lines 327-328. Result: every 1-second flushBuffers tick (line 376) throws `ReferenceError: lastConsoleFlushed is not defined`, gets swallowed by the catch at line 369 ("[browse] Buffer flush failed: ..."), and the console branch's append never runs. browse-console.log is never written in any production deployment since this regressed. Discovered by stress-testing the daemon with 15 concurrent CLIs against cold state — the race surfaced the buffer-flush error spam in one spawned daemon's stderr. Verified by running the daemon against a real file:// page with console.log events: in-memory `browse console` returns the entries, but `.gstack/browse-console.log` is never created on disk. Regression introduced by1a100a2a"fix: eliminate duplicate command sets in chain, improve flush perf and type safety" — the flush refactor switched from `Bun.write` to `fs.appendFileSync` and added the `lastConsoleFlushed` cursor pattern alongside its network/dialog siblings, but missed the matching `let` declaration. Tests don't currently exercise flushBuffers, so the regression shipped silently. Fix: - Declare `let lastConsoleFlushed = 0;` next to `lastNetworkFlushed` and `lastDialogFlushed` (browse/src/server.ts:327) - Add a source-level guard test (browse/test/server-flush-trackers.test.ts) that fails any future refactor that adds a fourth `last*Flushed` cursor without the matching declaration. Same pattern as terminal-agent.test.ts and dual-listener.test.ts — read source as text, assert invariant, no daemon required. Test plan: - [x] New regression test fails on current main, passes with the fix - [x] `bun run build` clean - [x] Manual smoke: spawn daemon -> goto file:// page with console.log -> wait 4s -> .gstack/browse-console.log now exists with the expected entries (163 bytes vs zero before) 🤖 Generated with [Claude Code](https://claude.com/claude-code) * fix(browse): per-process state-file temp path to fix concurrent-write ENOENT The daemon writes `.gstack/browse.json` via the standard atomic-rename pattern: `writeFileSync(tmp, …) → renameSync(tmp, stateFile)`. Four sites in server.ts use this pattern (initial daemon-startup state at :2002, /tunnel/start handler at :1479, BROWSE_TUNNEL=1 inline tunnel update at :2083, BROWSE_TUNNEL_LOCAL_ONLY=1 update at :2113), and all four hard-code the same temp filename `${stateFile}.tmp`. Under concurrent writers the shared filename races on the rename: t0 Writer A: writeFileSync(stateFile + '.tmp', payloadA) t1 Writer B: writeFileSync(stateFile + '.tmp', payloadB) // overwrites A t2 Writer A: renameSync(stateFile + '.tmp', stateFile) // moves B's payload t3 Writer B: renameSync(stateFile + '.tmp', stateFile) // ENOENT — file gone Reproduced empirically with 15 concurrent CLIs against a fresh `.gstack/`: [browse] Failed to start: ENOENT: no such file or directory, rename '…/.gstack/browse.json.tmp' -> '…/.gstack/browse.json' Pre-fix success rate: **0 / 15** under cold-start race. Post-fix success rate: **15 / 15**, zero ENOENT. Fix: - New `tmpStatePath()` helper (server.ts:333) returns `${stateFile}.tmp.${pid}.${randomBytes(4).toString('hex')}` - All 4 call sites use `tmpStatePath()` instead of the shared literal - Atomic rename still gives last-writer-wins semantics on the final state.json content; only behavior change is that concurrent writers no longer kill each other on the rename step Source-level guard test (browse/test/server-tmp-state-path.test.ts) locks two invariants: (1) no remaining `stateFile + '.tmp'` literals, (2) every state-write `writeFileSync` call uses `tmpStatePath()`. Same read-source-as-text pattern as terminal-agent.test.ts and dual-listener.test.ts — no daemon required, runs in tier-1 free. Test plan: - [x] Targeted source-level guard test passes (3 / 0) - [x] `bun run build` clean - [x] Live regression: 15 concurrent CLIs against cold state → 15 / 15 healthy, 0 ENOENT (vs 0 / 15 pre-fix) - [x] No `.tmp.*` orphans left behind after rename succeeds - [x] Related test cluster (server-auth, dual-listener, cdp-mutex, findport) — same pre-existing flakes as `main`, no new regressions introduced 🤖 Generated with [Claude Code](https://claude.com/claude-code) * fix(browse): clear refs when iframe auto-detaches in getActiveFrameOrPage Asymmetric cleanup between two equivalent staleness conditions: onMainFrameNavigated() → clearRefs() + activeFrame = null ✓ getActiveFrameOrPage() → activeFrame = null (refs NOT cleared) ✗ Both paths see the same staleness condition — refs were captured against a frame that no longer exists. The main-frame path correctly clears both pieces of state. The iframe-detach path nulls the frame but leaves the refMap intact. The lazy click-time check in `resolveRef` (tab-session.ts:97) partially saves us — `entry.locator.count()` on a detached-frame locator throws or returns 0, so the click errors out as "Ref X is stale". But the user has no signal that frame context silently changed underfoot: the next `snapshot` runs against `this.page` (main) while old iframe refs still litter `refMap` with the same role+name keys. New refs collide with stale ones, the resolver picks one at random, the user clicks the wrong element. TODOS.md line 816-820 documents "Detached frame auto-recovery" as a shipped iframe-support feature in v0.12.1.0. This restores the documented intent — the recovery should leave the session in a clean state, not a half-cleared one. Fix: 1 line — add `this.clearRefs()` next to `this.activeFrame = null` inside the if-branch. Test plan: - [x] New regression test: 4/4 pass - refs cleared when getActiveFrameOrPage detects detached iframe - refs preserved when active frame is still attached (no regression) - refs preserved when no frame set (page-level path untouched) - matches onMainFrameNavigated symmetry — both paths reach the same clean end state - [x] `bun run build` clean 🤖 Generated with [Claude Code](https://claude.com/claude-code) * fix(codex): resolve python for JSON parser * fix: add fail-fast probe for base branch in ship step 12 * fix(plan-devex-review): remove contradictory plan-mode handshake * fix(design): honor Retry-After header in variants 429 handler Closes #1244. The 429 handler in `generateVariant` discarded the `Retry-After` response header and fell straight through to a local exponential schedule (2s/4s/8s). In image-generation batches, that burns retry attempts inside the provider's cooldown window and the request never recovers. Now we parse `Retry-After` per RFC 7231 — both delta-seconds (`Retry-After: 5`) and HTTP-date (`Retry-After: Fri, 31 Dec 1999 23:59:59 GMT`). Honored waits are capped at 60s to bound stalls from hostile or buggy headers. Delta-seconds are validated as digits-only (rejects `2abc`). When `Retry-After` is honored (including 0 / past-date "retry now"), the next iteration's leading exponential sleep is skipped so we don't double-wait. Invalid or missing headers fall through to the existing exponential schedule unchanged. Behavior matrix: | Header | Behavior | |---------------------------------|-------------------------------------------| | Retry-After: 5 | wait 5s, skip leading on next attempt | | Retry-After: 999999 | capped to 60s, skip leading | | Retry-After: 2abc | invalid, fall through to exponential | | Retry-After: 0 | wait 0, skip leading (retry immediately) | | Retry-After: <past HTTP-date> | wait 0, skip leading | | Retry-After: <future date> | wait diff capped at 60s, skip leading | | no header | fall through to existing exponential | `generateVariant` now accepts an optional `fetchFn` parameter (defaults to `globalThis.fetch`) so tests can inject a stub. Production call sites are unchanged. Tests cover the five behavior buckets above, asserting both the 1st-to-2nd call timing gap and call counts. All five pass in ~8s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(docs): correct per-skill symlink removal snippet in README uninstall Closes #1130. The manual-uninstall fallback in `## Uninstall` → `### Option 2` used `find ~/.claude/skills -maxdepth 1 -type l`, which finds nothing on real installs. Each `~/.claude/skills/<name>/` is a real directory, and only `<name>/SKILL.md` inside it is a symlink into `gstack/`. The find never matched, so the snippet silently removed nothing. Replace with a directory walk that inspects each `<name>/SKILL.md`: find ~/.claude/skills -mindepth 1 -maxdepth 1 -type d ! -name gstack → check $dir/SKILL.md is a symlink → readlink it → if target is gstack/* or */gstack/*: rm -f the link, rmdir the dir (only if empty — preserves any user-added files) Excludes the top-level `gstack/` dir from the walk; that's removed by step 3 of the same uninstall block. `bin/gstack-uninstall` (the script-mode path) already handles the layout correctly via its own walk; only this manual fallback needed updating. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: reject partial browse client env integers * fix(gemini-adapter): detect new ~/.gemini/oauth_creds.json auth path gemini-cli >=0.30 stores OAuth credentials at ~/.gemini/oauth_creds.json instead of the legacy ~/.config/gemini/ directory. The benchmark adapter's availability check now succeeds for users on recent gemini-cli releases who have authenticated via interactive login. Both paths are accepted so users on older versions still work. * fix(browser): add --no-sandbox for root user on Linux/WSL2 Chromium's sandbox can't initialize when running as root on Linux, causing an immediate exit. Extend the existing CI/CONTAINER check to also cover this case, keeping the Windows-safe `typeof getuid` guard. * security: pass cwd to git via execFileSync, not interpolation through /bin/sh `bin/gstack-memory-ingest.ts:632-643` ran `execSync(\`git -C ${JSON.stringify(cwd)} remote get-url origin 2>/dev/null\`, ...)`. JSON.stringify escapes `"` and `\` but not `$` or backticks, so a `cwd` of `"$(touch /tmp/marker)"` survived JSON quoting and detonated under /bin/sh's command-substitution-inside-double-quotes. `cwd` originates from transcript JSONL records under `~/.claude/projects/<encoded-cwd>/<uuid>.jsonl` and `~/.codex/sessions/YYYY/MM/DD/rollout-*.jsonl`. The walker grabs the first `.cwd` it sees per session. That's an untrusted surface in the gstack threat model — the L1-L6 sidebar security stack exists exactly because agent transcripts can carry attacker-influenced text. Two pivots above the local same-uid bar: (a) prompt-injection appending `cwd="$(...)"` to the active session log turns the next /sync-gbrain run into RCE under the user's uid; (b) cross-machine transcript share (a colleague's `.claude/projects` snippet untar'd into HOME, a documented gbrain dogfooding shape) → RCE on first sync. Fix swaps the one execSync for `execFileSync("git", ["-C", cwd, "remote", "get-url", "origin"], ...)`. No shell, argv passed directly to git. The same module already uses execFileSync for `gbrainAvailable()` (line 762 pre-patch) and `gbrainPutPage()` (line 816 pre-patch) — this single execSync was the outlier. Test: `gstack-memory-ingest security: untrusted cwd cannot trigger shell substitution` plants a Claude-Code-shaped JSONL with cwd=`$(touch <marker>)` and asserts the marker file is not created after `--incremental --quiet`. Negative control: with the patch reverted, the test fails (marker created); with the patch applied, it passes (18/18 in test/gstack-memory-ingest.test.ts). * security: gate domain-skill auto-promote on classifier_score > 0 `browse/src/domain-skill-commands.ts:140` (handleSave) writes `classifier_score: 0` with the comment "L4 deferred to load-time / sidebar-agent fills this in on first prompt-injection load." But CLAUDE.md "Sidebar architecture" documents that sidebar-agent.ts was ripped, and grep for recordSkillUse + classifierFlagged callers across browse/src/ returns zero hits outside the module under test. Net effect: every quarantined skill that survives three benign uses without flag (`recordSkillUse(... , classifierFlagged: false)` x3) auto-promotes to `active` and lands in prompt context wrapped as UNTRUSTED on every subsequent visit to that host. The L4 score that was supposed to gate the promotion was never written — the production save path puts 0 on disk and nothing later updates it. Threat model: a domain-skill body authored by an agent under the influence of a poisoned page (the new `gstackInjectToTerminal` PTY path runs no L1-L3 either) would lose its auto-promote barrier after three uses. The exploit isn't single-step but the bar is exactly N=3 prompt-injection-shaped uses on a hostile page, which is well within reach. Fix adds a single condition to the auto-promote gate in `recordSkillUse`: if (state === 'quarantined' && useCount >= PROMOTE_THRESHOLD && flagCount === 0 && current.classifier_score > 0) { state = 'active'; } `classifier_score` is set once at writeSkill and never updated. Production saves it as 0 (handleSave), so the gate stays closed; existing tests that explicitly pass `classifierScore: 0.1` still auto-promote (the auto-promote path is preserved for the day L4 is rewired). Manual promotion via `domain-skill promote-to-global` is unaffected (it goes through `promoteToGlobal` which has its own state-machine guard at line 337+). Test: new regression case `does NOT auto-promote when classifier_score is 0 (production handleSave shape)` plants a skill with classifierScore=0 (matches domain-skill-commands.ts:140), runs three uses without flag, asserts the skill stays quarantined and readSkill returns null. Negative control: revert the patch, the test fails with `Received: "active"`. With the patch: 15/15 pass. * fix(ship): port #1302 SKILL.md edits to .tmpl + resolver source PR #1302 added Verification Mode + UNVERIFIABLE classification + per-item confirmation gate to ship/SKILL.md, but only the generated SKILL.md was edited — not the .tmpl source or scripts/resolvers/review.ts. The next `bun run gen:skill-docs` run would have wiped the changes. Port the same content into the resolver and .tmpl so regeneration produces the intended output. * ci(windows): extend free-tests lane to cover icacls + Bun.which resolvers from fix-wave PRs Closes #1306/#1307/#1308 validation gap. The four newly-added test files already have process.platform guards so they run safely on both POSIX and Windows lanes — only platform-relevant assertions execute on each. Tests added to the windows-latest lane: - browse/test/file-permissions.test.ts (#1308 icacls + writeSecureFile) - browse/test/security.test.ts (#1306 bash.exe wrap pure-function path) - make-pdf/test/browseClient.test.ts (#1307 Bun.which browse resolver) - make-pdf/test/pdftotext.test.ts (#1307 Bun.which pdftotext resolver) * test(codex): live flag-semantics smoke for codex exec resume Closes #1270's regex-only test gap. PR #1270 asserted that codex/SKILL.md's `codex exec resume` invocation drops -C/-s and uses sandbox_mode config. That regex catches the skill template regressing, but not codex CLI itself flipping flag semantics again. This test probes `codex exec resume --help` and asserts the surface gstack relies on: -c/sandbox_mode is accepted, top-level -C is absent. Skips silently when codex isn't on PATH, so dev machines without codex installed never see it fail. * chore: regen SKILL.md after fix wave One regen commit at the end of the merge wave per the plan. plan-devex-review loses the contradictory plan-mode handshake (#1333). review/SKILL.md picks up the Verification Mode + UNVERIFIABLE classification additions that #1302 authored against ship/SKILL.md (same resolver shared between ship and review modes). * fix(server.ts): keep fs.writeFileSync for state-file writes #1308's writeSecureFile wrapper added Windows icacls hardening for the 4 state-file write sites in server.ts, but #1310's regression test grep's for fs.writeFileSync(tmpStatePath()) calls. The two changes are technically compatible only if the test relaxes — keeping the test strict (the safer choice for catching regressions on the cold-start race) means the 4 state- file sites stay on fs.writeFileSync(..., { mode: 0o600 }). POSIX 0o600 hardening is preserved on those 4 sites. Windows icacls hardening still applies to all the other writeSecureFile call sites #1308 added (auth.json, mkdirSecure, etc.). Also refreshes golden baselines after #1302 / port + minor wording tweak in scripts/resolvers/review.ts to keep gen-skill-docs.test.ts assertion 'Cite the specific file' satisfied. * v1.30.0.0: fix wave — 21 community PRs + 2 closing fixes for Windows + codex CI gaps Headline release. Browse stops dropping console logs, cold-start race fixed, codex resume works without python3, Windows hardening (icacls + Bun.which + bash.exe wrap), ship gate gets VAS-449 remediation, two closing fixes that put icacls/Bun.which/codex flag semantics under CI. * test(domain-skills): cover #1369 classifier_score=0 quarantine + score>0 promote path The pre-existing T6 test seeded skills via writeSkill (which defaults classifier_score to 0 until L4 is rewired) and then expected 3 uses to auto-promote. PR #1369 added `current.classifier_score > 0` to the gate specifically to block that path — a quarantined skill written under the influence of a poisoned page would otherwise auto-promote after three benign uses. Updated test asserts both halves of the new contract: - classifier_score=0 + 3 uses → stays quarantined (the security guarantee) - classifier_score>0 + 3 more uses → promotes to active (unblock path) Catches both regressions: the gate going away (would re-allow the bypass) and the unblock path breaking (would silently quarantine all skills forever once L4 is rewired). --------- Co-authored-by: Jayesh Betala <jayesh.betala7@gmail.com> Co-authored-by: orbisai0security <mediratta01.pally@gmail.com> Co-authored-by: Bryce Alan <brycealan.eth@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Terry Carson YM <cym3118288@gmail.com> Co-authored-by: Vasko Ckorovski <vckorovski@gmail.com> Co-authored-by: Samuel Carson <samuel.carson@gmail.com> Co-authored-by: Yashwant Kotipalli <yashwant7kotipalli@gmail.com> Co-authored-by: Jasper Chen <jasperchen925@gmail.com> Co-authored-by: Stefan Neamtu <stefan.neamtu@gmail.com> Co-authored-by: 陈家名 <chenjiaming@kezaihui.com> Co-authored-by: Abigail Atheryon <abi@atheryon.ai> Co-authored-by: Furkan Köykıran <furkankoykiran@gmail.com> Co-authored-by: gus <gustavoraularagon@gmail.com>
781 lines
42 KiB
Markdown
781 lines
42 KiB
Markdown
# gstack development
|
|
|
|
## Commands
|
|
|
|
```bash
|
|
bun install # install dependencies
|
|
bun test # run free tests (browse + snapshot + skill validation)
|
|
bun run test:evals # run paid evals: LLM judge + E2E (diff-based, ~$4/run max)
|
|
bun run test:evals:all # run ALL paid evals regardless of diff
|
|
bun run test:gate # run gate-tier tests only (CI default, blocks merge)
|
|
bun run test:periodic # run periodic-tier tests only (weekly cron / manual)
|
|
bun run test:e2e # run E2E tests only (diff-based, ~$3.85/run max)
|
|
bun run test:e2e:all # run ALL E2E tests regardless of diff
|
|
bun run eval:select # show which tests would run based on current diff
|
|
bun run dev <cmd> # run CLI in dev mode, e.g. bun run dev goto https://example.com
|
|
bun run build # gen docs + compile binaries
|
|
bun run gen:skill-docs # regenerate SKILL.md files from templates
|
|
bun run skill:check # health dashboard for all skills
|
|
bun run dev:skill # watch mode: auto-regen + validate on change
|
|
bun run eval:list # list all eval runs from ~/.gstack-dev/evals/
|
|
bun run eval:compare # compare two eval runs (auto-picks most recent)
|
|
bun run eval:summary # aggregate stats across all eval runs
|
|
bun run slop # full slop-scan report (all files)
|
|
bun run slop:diff # slop findings in files changed on this branch only
|
|
```
|
|
|
|
`test:evals` requires `ANTHROPIC_API_KEY`. Codex E2E tests (`test/codex-e2e.test.ts`)
|
|
use Codex's own auth from `~/.codex/` config — no `OPENAI_API_KEY` env var needed.
|
|
|
|
**Where the keys live on this machine.** Conductor workspaces don't inherit the
|
|
user's interactive shell env, so `ANTHROPIC_API_KEY` and `OPENAI_API_KEY` aren't
|
|
in the default process env. Before running any paid eval / E2E, source them from
|
|
`~/.zshrc` (that's where Garry keeps them):
|
|
|
|
```bash
|
|
bash -c '
|
|
eval "$(grep -E "^export (ANTHROPIC_API_KEY|OPENAI_API_KEY)=" ~/.zshrc)"
|
|
export ANTHROPIC_API_KEY OPENAI_API_KEY
|
|
EVALS=1 EVALS_TIER=periodic bun test test/skill-e2e-<whatever>.test.ts
|
|
'
|
|
```
|
|
|
|
Do not echo the key value anywhere (stdout, logs, shell history). The grep+eval
|
|
pattern keeps it in process env only. When passing to a test's Agent SDK, do NOT
|
|
pass `env: {...}` to `runAgentSdkTest` — the SDK's auth pipeline doesn't pick up
|
|
the key the same way when env is supplied as an object (confirmed failure mode).
|
|
Instead, mutate `process.env.ANTHROPIC_API_KEY` ambiently before the call and
|
|
restore in `finally`.
|
|
E2E tests stream progress in real-time (tool-by-tool via `--output-format stream-json
|
|
--verbose`). Results are persisted to `~/.gstack-dev/evals/` with auto-comparison
|
|
against the previous run.
|
|
|
|
**Diff-based test selection:** `test:evals` and `test:e2e` auto-select tests based
|
|
on `git diff` against the base branch. Each test declares its file dependencies in
|
|
`test/helpers/touchfiles.ts`. Changes to global touchfiles (session-runner, eval-store,
|
|
touchfiles.ts itself) trigger all tests. Use `EVALS_ALL=1` or the `:all` script
|
|
variants to force all tests. Run `eval:select` to preview which tests would run.
|
|
|
|
**Two-tier system:** Tests are classified as `gate` or `periodic` in `E2E_TIERS`
|
|
(in `test/helpers/touchfiles.ts`). CI runs only gate tests (`EVALS_TIER=gate`);
|
|
periodic tests run weekly via cron or manually. Use `EVALS_TIER=gate` or
|
|
`EVALS_TIER=periodic` to filter. When adding new E2E tests, classify them:
|
|
1. Safety guardrail or deterministic functional test? -> `gate`
|
|
2. Quality benchmark, Opus model test, or non-deterministic? -> `periodic`
|
|
3. Requires external service (Codex, Gemini)? -> `periodic`
|
|
|
|
## Testing
|
|
|
|
```bash
|
|
bun test # run before every commit — free, <2s
|
|
bun run test:evals # run before shipping — paid, diff-based (~$4/run max)
|
|
```
|
|
|
|
`bun test` runs skill validation, gen-skill-docs quality checks, and browse
|
|
integration tests. `bun run test:evals` runs LLM-judge quality evals and E2E
|
|
tests via `claude -p`. Both must pass before creating a PR.
|
|
|
|
## Project structure
|
|
|
|
```
|
|
gstack/
|
|
├── browse/ # Headless browser CLI (Playwright)
|
|
│ ├── src/ # CLI + server + commands
|
|
│ │ ├── commands.ts # Command registry (single source of truth)
|
|
│ │ └── snapshot.ts # SNAPSHOT_FLAGS metadata array
|
|
│ ├── test/ # Integration tests + fixtures
|
|
│ └── dist/ # Compiled binary
|
|
├── hosts/ # Typed host configs (one per AI agent)
|
|
│ ├── claude.ts # Primary host config
|
|
│ ├── codex.ts, factory.ts, kiro.ts # Existing hosts
|
|
│ ├── opencode.ts, slate.ts, cursor.ts, openclaw.ts # IDE hosts
|
|
│ ├── hermes.ts, gbrain.ts # Agent runtime hosts
|
|
│ └── index.ts # Registry: exports all, derives Host type
|
|
├── scripts/ # Build + DX tooling
|
|
│ ├── gen-skill-docs.ts # Template → SKILL.md generator (config-driven)
|
|
│ ├── host-config.ts # HostConfig interface + validator
|
|
│ ├── host-config-export.ts # Shell bridge for setup script
|
|
│ ├── host-adapters/ # Host-specific adapters (OpenClaw tool mapping)
|
|
│ ├── resolvers/ # Template resolver modules (preamble, design, review, gbrain, etc.)
|
|
│ ├── skill-check.ts # Health dashboard
|
|
│ └── dev-skill.ts # Watch mode
|
|
├── test/ # Skill validation + eval tests
|
|
│ ├── helpers/ # skill-parser.ts, session-runner.ts, llm-judge.ts, eval-store.ts
|
|
│ ├── fixtures/ # Ground truth JSON, planted-bug fixtures, eval baselines
|
|
│ ├── skill-validation.test.ts # Tier 1: static validation (free, <1s)
|
|
│ ├── gen-skill-docs.test.ts # Tier 1: generator quality (free, <1s)
|
|
│ ├── skill-llm-eval.test.ts # Tier 3: LLM-as-judge (~$0.15/run)
|
|
│ └── skill-e2e-*.test.ts # Tier 2: E2E via claude -p (~$3.85/run, split by category)
|
|
├── qa-only/ # /qa-only skill (report-only QA, no fixes)
|
|
├── plan-design-review/ # /plan-design-review skill (report-only design audit)
|
|
├── design-review/ # /design-review skill (design audit + fix loop)
|
|
├── ship/ # Ship workflow skill
|
|
├── review/ # PR review skill
|
|
├── plan-ceo-review/ # /plan-ceo-review skill
|
|
├── plan-eng-review/ # /plan-eng-review skill
|
|
├── autoplan/ # /autoplan skill (auto-review pipeline: CEO → design → eng)
|
|
├── benchmark/ # /benchmark skill (performance regression detection)
|
|
├── canary/ # /canary skill (post-deploy monitoring loop)
|
|
├── codex/ # /codex skill (multi-AI second opinion via OpenAI Codex CLI)
|
|
├── land-and-deploy/ # /land-and-deploy skill (merge → deploy → canary verify)
|
|
├── office-hours/ # /office-hours skill (YC Office Hours — startup diagnostic + builder brainstorm)
|
|
├── investigate/ # /investigate skill (systematic root-cause debugging)
|
|
├── retro/ # Retrospective skill (includes /retro global cross-project mode)
|
|
├── bin/ # CLI utilities (gstack-repo-mode, gstack-slug, gstack-config, etc.)
|
|
├── document-release/ # /document-release skill (post-ship doc updates)
|
|
├── cso/ # /cso skill (OWASP Top 10 + STRIDE security audit)
|
|
├── design-consultation/ # /design-consultation skill (design system from scratch)
|
|
├── design-shotgun/ # /design-shotgun skill (visual design exploration)
|
|
├── open-gstack-browser/ # /open-gstack-browser skill (launch GStack Browser)
|
|
├── connect-chrome/ # symlink → open-gstack-browser (backwards compat)
|
|
├── design/ # Design binary CLI (GPT Image API)
|
|
│ ├── src/ # CLI + commands (generate, variants, compare, serve, etc.)
|
|
│ ├── test/ # Integration tests
|
|
│ └── dist/ # Compiled binary
|
|
├── extension/ # Chrome extension (side panel + activity feed + CSS inspector)
|
|
├── lib/ # Shared libraries (worktree.ts)
|
|
├── docs/designs/ # Design documents
|
|
├── setup-deploy/ # /setup-deploy skill (one-time deploy config)
|
|
├── .github/ # CI workflows + Docker image
|
|
│ ├── workflows/ # evals.yml (E2E on Ubicloud), skill-docs.yml, actionlint.yml
|
|
│ └── docker/ # Dockerfile.ci (pre-baked toolchain + Playwright/Chromium)
|
|
├── contrib/ # Contributor-only tools (never installed for users)
|
|
│ └── add-host/ # /gstack-contrib-add-host skill
|
|
├── setup # One-time setup: build binary + symlink skills
|
|
├── SKILL.md # Generated from SKILL.md.tmpl (don't edit directly)
|
|
├── SKILL.md.tmpl # Template: edit this, run gen:skill-docs
|
|
├── ETHOS.md # Builder philosophy (Boil the Lake, Search Before Building)
|
|
└── package.json # Build scripts for browse
|
|
```
|
|
|
|
## SKILL.md workflow
|
|
|
|
SKILL.md files are **generated** from `.tmpl` templates. To update docs:
|
|
|
|
1. Edit the `.tmpl` file (e.g. `SKILL.md.tmpl` or `browse/SKILL.md.tmpl`)
|
|
2. Run `bun run gen:skill-docs` (or `bun run build` which does it automatically)
|
|
3. Commit both the `.tmpl` and generated `.md` files
|
|
|
|
To add a new browse command: add it to `browse/src/commands.ts` and rebuild.
|
|
To add a snapshot flag: add it to `SNAPSHOT_FLAGS` in `browse/src/snapshot.ts` and rebuild.
|
|
|
|
**Token ceiling:** Generated SKILL.md files trip a warning above 160KB (~40K tokens).
|
|
This is a "watch for feature bloat" guardrail, not a hard gate. Modern flagship
|
|
models have 200K-1M context windows, so 40K is 4-20% of window, and prompt caching
|
|
makes the marginal cost of larger skills small. The ceiling exists to catch runaway
|
|
preamble/resolver growth, not to force compression on carefully-tuned big skills
|
|
(`ship`, `plan-ceo-review`, `office-hours` legitimately pack 25-35K tokens of
|
|
behavior). If you blow past 40K, the right fix is usually: (1) look at WHAT grew,
|
|
(2) if one resolver added 10K+ in a single PR, question whether it belongs inline
|
|
or as a reference doc, (3) only compress carefully-tuned prose as a last resort —
|
|
cuts to the coverage audit, review army, or voice directive have real quality cost.
|
|
|
|
**Merge conflicts on SKILL.md files:** NEVER resolve conflicts on generated SKILL.md
|
|
files by accepting either side. Instead: (1) resolve conflicts on the `.tmpl` templates
|
|
and `scripts/gen-skill-docs.ts` (the sources of truth), (2) run `bun run gen:skill-docs`
|
|
to regenerate all SKILL.md files, (3) stage the regenerated files. Accepting one side's
|
|
generated output silently drops the other side's template changes.
|
|
|
|
## Platform-agnostic design
|
|
|
|
Skills must NEVER hardcode framework-specific commands, file patterns, or directory
|
|
structures. Instead:
|
|
|
|
1. **Read CLAUDE.md** for project-specific config (test commands, eval commands, etc.)
|
|
2. **If missing, AskUserQuestion** — let the user tell you or let gstack search the repo
|
|
3. **Persist the answer to CLAUDE.md** so we never have to ask again
|
|
|
|
This applies to test commands, eval commands, deploy commands, and any other
|
|
project-specific behavior. The project owns its config; gstack reads it.
|
|
|
|
## Writing SKILL templates
|
|
|
|
SKILL.md.tmpl files are **prompt templates read by Claude**, not bash scripts.
|
|
Each bash code block runs in a separate shell — variables do not persist between blocks.
|
|
|
|
Rules:
|
|
- **Use natural language for logic and state.** Don't use shell variables to pass
|
|
state between code blocks. Instead, tell Claude what to remember and reference
|
|
it in prose (e.g., "the base branch detected in Step 0").
|
|
- **Don't hardcode branch names.** Detect `main`/`master`/etc dynamically via
|
|
`gh pr view` or `gh repo view`. Use `{{BASE_BRANCH_DETECT}}` for PR-targeting
|
|
skills. Use "the base branch" in prose, `<base>` in code block placeholders.
|
|
- **Keep bash blocks self-contained.** Each code block should work independently.
|
|
If a block needs context from a previous step, restate it in the prose above.
|
|
- **Express conditionals as English.** Instead of nested `if/elif/else` in bash,
|
|
write numbered decision steps: "1. If X, do Y. 2. Otherwise, do Z."
|
|
|
|
## Writing style (V1)
|
|
|
|
Default output from every tier-≥2 skill follows the Writing Style section in
|
|
`scripts/resolvers/preamble.ts`: jargon glossed on first use (curated list in
|
|
`scripts/jargon-list.json`, baked at gen-skill-docs time), questions framed in
|
|
outcome terms ("what breaks for your users if...") not implementation terms,
|
|
short sentences, decisions close with user impact. Power users who want the
|
|
tighter V0 prose set `gstack-config set explain_level terse` (binary switch,
|
|
no middle mode). See `docs/designs/PLAN_TUNING_V1.md` for the full design
|
|
rationale. The review pacing overhaul that originally tried to ride alongside
|
|
writing-style was extracted to V1.1 — see `docs/designs/PACING_UPDATES_V0.md`.
|
|
|
|
## Browser interaction
|
|
|
|
When you need to interact with a browser (QA, dogfooding, cookie setup), use the
|
|
`/browse` skill or run the browse binary directly via `$B <command>`. NEVER use
|
|
`mcp__claude-in-chrome__*` tools — they are slow, unreliable, and not what this
|
|
project uses.
|
|
|
|
**Sidebar architecture:** Before modifying `sidepanel.js`, `background.js`,
|
|
`content.js`, `terminal-agent.ts`, or sidebar-related server endpoints,
|
|
read `docs/designs/SIDEBAR_MESSAGE_FLOW.md`. The sidebar has one primary
|
|
surface — the **Terminal** pane (interactive `claude` PTY) — with
|
|
Activity / Refs / Inspector as debug overlays behind the footer's
|
|
`debug` toggle. The chat queue path was ripped once the PTY proved out;
|
|
`sidebar-agent.ts` and the `/sidebar-command` / `/sidebar-chat` /
|
|
`/sidebar-agent/event` endpoints are gone. The doc covers the WS auth
|
|
flow, dual-token model, and threat-model boundary — silent failures
|
|
here usually trace to not understanding the cross-component flow.
|
|
|
|
**WebSocket auth uses Sec-WebSocket-Protocol, not cookies.** Browsers
|
|
can't set `Authorization` on a WebSocket upgrade, but they CAN set
|
|
`Sec-WebSocket-Protocol` via `new WebSocket(url, [token])`. The agent
|
|
reads it, validates against `validTokens`, and MUST echo the protocol
|
|
back in the upgrade response — without the echo, Chromium closes the
|
|
connection immediately. `Set-Cookie: gstack_pty=...` is kept as a
|
|
fallback for non-browser callers (the cross-port `SameSite=Strict`
|
|
cookie path doesn't survive from a chrome-extension origin).
|
|
|
|
**Cross-pane PTY injection.** The toolbar's Cleanup button and the
|
|
Inspector's "Send to Code" action both pipe text into the live claude
|
|
PTY via `window.gstackInjectToTerminal(text)`, exposed by
|
|
`sidepanel-terminal.js`. No `/sidebar-command` POST — the live REPL is
|
|
the only execution surface in the sidebar now.
|
|
|
|
**`/health` MUST NOT surface any shell-grant token.** It already leaks
|
|
`AUTH_TOKEN` to localhost callers in headed mode (a v1.1+ TODO). Don't
|
|
make that worse by adding the PTY session token there. PTY auth flows
|
|
through `POST /pty-session` only.
|
|
|
|
**Transport-layer security** (v1.6.0.0+). When `pair-agent` starts an ngrok tunnel,
|
|
the daemon binds two HTTP listeners: a local listener (127.0.0.1, full command
|
|
surface, never forwarded) and a tunnel listener (locked allowlist: `/connect`,
|
|
`/command` with a scoped token + 26-command browser-driving allowlist,
|
|
`/sidebar-chat`). ngrok forwards only the tunnel port. Root tokens over the tunnel
|
|
return 403. SSE endpoints use a 30-minute HttpOnly `gstack_sse` cookie minted via
|
|
`POST /sse-session` (never valid against `/command`). Tunnel-surface rejections go
|
|
to `~/.gstack/security/attempts.jsonl` via `tunnel-denial-log.ts`. Before editing
|
|
`server.ts`, `sse-session-cookie.ts`, or `tunnel-denial-log.ts`, read
|
|
[ARCHITECTURE.md](ARCHITECTURE.md#dual-listener-tunnel-architecture-v1600) —
|
|
the module boundary (no imports from `token-registry.ts` into `sse-session-cookie.ts`)
|
|
is load-bearing for scope isolation.
|
|
|
|
**Sidebar security stack** (layered defense against prompt injection):
|
|
|
|
| Layer | Module | Lives in |
|
|
|-------|--------|----------|
|
|
| L1-L3 | `content-security.ts` | both server and agent — datamarking, hidden element strip, ARIA regex, URL blocklist, envelope wrapping |
|
|
| L4 | `security-classifier.ts` (TestSavantAI ONNX) | **sidebar-agent only** |
|
|
| L4b | `security-classifier.ts` (Claude Haiku transcript) | **sidebar-agent only** |
|
|
| L5 | `security.ts` (canary) | both — inject in compiled, check in agent |
|
|
| L6 | `security.ts` (combineVerdict ensemble) | both |
|
|
|
|
**Critical constraint:** `security-classifier.ts` CANNOT be imported from the
|
|
compiled browse binary. `@huggingface/transformers` v4 requires `onnxruntime-node`
|
|
which fails to `dlopen` from Bun compile's temp extract dir. Only `security.ts`
|
|
(pure-string operations — canary, verdict combiner, attack log, status) is safe
|
|
for `server.ts`. See `~/.gstack/projects/garrytan-gstack/ceo-plans/2026-04-19-prompt-injection-guard.md`
|
|
§"Pre-Impl Gate 1 Outcome" for full architectural decision.
|
|
|
|
**Thresholds** (in `security.ts`):
|
|
- `BLOCK: 0.85` — single-layer score that would cause BLOCK if cross-confirmed
|
|
- `WARN: 0.75` — cross-confirm threshold. When L4 AND L4b both >= 0.75 → BLOCK
|
|
- `LOG_ONLY: 0.40` — gates transcript classifier (skip Haiku when all layers < 0.40)
|
|
- `SOLO_CONTENT_BLOCK: 0.92` — single-layer threshold for label-less content classifiers
|
|
(testsavant, deberta). Intentionally higher than `BLOCK` because these layers can't
|
|
distinguish "this is an injection" from "this looks like phishing aimed at the user."
|
|
The transcript classifier keeps a separate, label-gated solo path at `BLOCK` (0.85).
|
|
|
|
**Ensemble rule:** BLOCK only when the ML content classifier AND the transcript
|
|
classifier both report >= WARN. Single-layer high confidence degrades to WARN —
|
|
this is the Stack Overflow instruction-writing FP mitigation. Canary leak
|
|
always BLOCKs (deterministic).
|
|
|
|
**Env knobs:**
|
|
- `GSTACK_SECURITY_OFF=1` — emergency kill switch. Classifier stays off even if
|
|
warmed. Canary is still injected; just the ML scan is skipped.
|
|
- `GSTACK_SECURITY_ENSEMBLE=deberta` — opt-in DeBERTa-v3 ensemble. Adds
|
|
ProtectAI DeBERTa-v3-base-injection-onnx as L4c classifier for cross-model
|
|
agreement. 721MB first-run download. With ensemble enabled, BLOCK requires
|
|
2-of-3 ML classifiers agreeing at >= WARN (testsavant, deberta, transcript).
|
|
Without ensemble (default), BLOCK requires testsavant + transcript at >= WARN.
|
|
- Classifier model cache: `~/.gstack/models/testsavant-small/` (112MB, first run only)
|
|
plus `~/.gstack/models/deberta-v3-injection/` (721MB, only when ensemble enabled)
|
|
- Attack log: `~/.gstack/security/attempts.jsonl` (salted sha256 + domain only,
|
|
rotates at 10MB, 5 generations)
|
|
- Per-device salt: `~/.gstack/security/device-salt` (0600)
|
|
- Session state: `~/.gstack/security/session-state.json` (cross-process, atomic)
|
|
|
|
## Dev symlink awareness
|
|
|
|
When developing gstack, `.claude/skills/gstack` may be a symlink back to this
|
|
working directory (gitignored). This means skill changes are **live immediately**,
|
|
great for rapid iteration, risky during big refactors where half-written skills
|
|
could break other Claude Code sessions using gstack concurrently.
|
|
|
|
**Check once per session:** Run `ls -la .claude/skills/gstack` to see if it's a
|
|
symlink or a real copy. If it's a symlink to your working directory, be aware that:
|
|
- Template changes + `bun run gen:skill-docs` immediately affect all gstack invocations
|
|
- Breaking changes to SKILL.md.tmpl files can break concurrent gstack sessions
|
|
- During large refactors, remove the symlink (`rm .claude/skills/gstack`) so the
|
|
global install at `~/.claude/skills/gstack/` is used instead
|
|
|
|
**Prefix setting:** Setup creates real directories (not symlinks) at the top level
|
|
with a SKILL.md symlink inside (e.g., `qa/SKILL.md -> gstack/qa/SKILL.md`). This
|
|
ensures Claude discovers them as top-level skills, not nested under `gstack/`.
|
|
Names are either short (`qa`) or namespaced (`gstack-qa`), controlled by
|
|
`skill_prefix` in `~/.gstack/config.yaml`. Pass `--no-prefix` or `--prefix` to
|
|
skip the interactive prompt.
|
|
|
|
**Note:** Vendoring gstack into a project's repo is deprecated. Use global install
|
|
+ `./setup --team` instead. See README.md for team mode instructions.
|
|
|
|
**For plan reviews:** When reviewing plans that modify skill templates or the
|
|
gen-skill-docs pipeline, consider whether the changes should be tested in isolation
|
|
before going live (especially if the user is actively using gstack in other windows).
|
|
|
|
**Upgrade migrations:** When a change modifies on-disk state (directory structure,
|
|
config format, stale files) in ways that could break existing user installs, add a
|
|
migration script to `gstack-upgrade/migrations/`. Read CONTRIBUTING.md's "Upgrade
|
|
migrations" section for the format and testing requirements. The upgrade skill runs
|
|
these automatically after `./setup` during `/gstack-upgrade`.
|
|
|
|
## Compiled binaries — NEVER commit browse/dist/ or design/dist/
|
|
|
|
The `browse/dist/` and `design/dist/` directories contain compiled Bun binaries
|
|
(`browse`, `find-browse`, `design`, ~58MB each). These are Mach-O arm64 only — they
|
|
do NOT work on Linux, Windows, or Intel Macs. The `./setup` script already builds
|
|
from source for every platform, so the checked-in binaries are redundant. They are
|
|
tracked by git due to a historical mistake and should eventually be removed with
|
|
`git rm --cached`.
|
|
|
|
**NEVER stage or commit these files.** They show up as modified in `git status`
|
|
because they're tracked despite `.gitignore` — ignore them. When staging files,
|
|
always use specific filenames (`git add file1 file2`) — never `git add .` or
|
|
`git add -A`, which will accidentally include the binaries.
|
|
|
|
## Commit style
|
|
|
|
**Always bisect commits.** Every commit should be a single logical change. When
|
|
you've made multiple changes (e.g., a rename + a rewrite + new tests), split them
|
|
into separate commits before pushing. Each commit should be independently
|
|
understandable and revertable.
|
|
|
|
Examples of good bisection:
|
|
- Rename/move separate from behavior changes
|
|
- Test infrastructure (touchfiles, helpers) separate from test implementations
|
|
- Template changes separate from generated file regeneration
|
|
- Mechanical refactors separate from new features
|
|
|
|
When the user says "bisect commit" or "bisect and push," split staged/unstaged
|
|
changes into logical commits and push.
|
|
|
|
## Slop-scan: AI code quality, not AI code hiding
|
|
|
|
We use [slop-scan](https://github.com/benvinegar/slop-scan) to catch patterns where
|
|
AI-generated code is genuinely worse than what a human would write. We are NOT trying
|
|
to pass as human code. We are AI-coded and proud of it. The goal is code quality.
|
|
|
|
```bash
|
|
npx slop-scan scan . # human-readable report
|
|
npx slop-scan scan . --json # machine-readable for diffing
|
|
```
|
|
|
|
Config: `slop-scan.config.json` at repo root (currently excludes `**/vendor/**`).
|
|
|
|
### What to fix (genuine quality improvements)
|
|
|
|
- **Empty catches around file ops** — use `safeUnlink()` (ignores ENOENT, rethrows
|
|
EPERM/EIO). A swallowed EPERM in cleanup means silent data loss.
|
|
- **Empty catches around process kills** — use `safeKill()` (ignores ESRCH, rethrows
|
|
EPERM). A swallowed EPERM means you think you killed something you didn't.
|
|
- **Redundant `return await`** — remove when there's no enclosing try block. Saves a
|
|
microtask, signals intent.
|
|
- **Typed exception catches** — `catch (err) { if (!(err instanceof TypeError)) throw err }`
|
|
is genuinely better than `catch {}` when the try block does URL parsing or DOM work.
|
|
You know what error you expect, so say so.
|
|
|
|
### What NOT to fix (linter gaming, not quality)
|
|
|
|
- **String-matching on error messages** — `err.message.includes('closed')` is brittle.
|
|
Playwright/Chrome can change wording anytime. If a fire-and-forget operation can fail
|
|
for ANY reason and you don't care, `catch {}` is the correct pattern.
|
|
- **Adding comments to exempt pass-through wrappers** — "alias for active session" above
|
|
a method just to trip slop-scan's exemption rule is noise, not documentation.
|
|
- **Converting extension catch-and-log to selective rethrow** — Chrome extensions crash
|
|
entirely on uncaught errors. If the catch logs and continues, that IS the right pattern
|
|
for extension code. Don't make it throw.
|
|
- **Tightening best-effort cleanup paths** — shutdown, emergency cleanup, and disconnect
|
|
code should use `safeUnlinkQuiet()` (swallows ALL errors). A cleanup path that throws
|
|
on EPERM means the rest of cleanup doesn't run. That's worse.
|
|
|
|
### Utilities in `browse/src/error-handling.ts`
|
|
|
|
| Function | Use when | Behavior |
|
|
|----------|----------|----------|
|
|
| `safeUnlink(path)` | Normal file deletion | Ignores ENOENT, rethrows others |
|
|
| `safeUnlinkQuiet(path)` | Shutdown/emergency cleanup | Swallows all errors |
|
|
| `safeKill(pid, signal)` | Sending signals | Ignores ESRCH, rethrows others |
|
|
| `isProcessAlive(pid)` | Boolean process checks | Returns true/false, never throws |
|
|
|
|
### Score tracking
|
|
|
|
Baseline (2026-04-09, before cleanup): 100 findings, 432.8 score, 2.38 score/file.
|
|
After cleanup: 90 findings, 358.1 score, 1.96 score/file.
|
|
|
|
Don't chase the number. Fix patterns that represent actual code quality problems.
|
|
Accept findings where the "sloppy" pattern is the correct engineering choice.
|
|
|
|
## Community PR guardrails
|
|
|
|
When reviewing or merging community PRs, **always AskUserQuestion** before accepting
|
|
any commit that:
|
|
|
|
1. **Touches ETHOS.md** — this file is Garry's personal builder philosophy. No edits
|
|
from external contributors or AI agents, period.
|
|
2. **Removes or softens promotional material** — YC references, founder perspective,
|
|
and product voice are intentional. PRs that frame these as "unnecessary" or
|
|
"too promotional" must be rejected.
|
|
3. **Changes Garry's voice** — the tone, humor, directness, and perspective in skill
|
|
templates, CHANGELOG, and docs are not generic. PRs that rewrite voice to be
|
|
more "neutral" or "professional" must be rejected.
|
|
|
|
Even if the agent strongly believes a change improves the project, these three
|
|
categories require explicit user approval via AskUserQuestion. No exceptions.
|
|
No auto-merging. No "I'll just clean this up."
|
|
|
|
## CHANGELOG + VERSION style
|
|
|
|
**Versioning invariant (workspace-aware ship).** VERSION is a monotonic ordered
|
|
release identifier, not a strict semver commitment. The bump level
|
|
(major/minor/patch/micro) expresses intent at ship time. Queue-advancing past a
|
|
claimed version within the same bump level is explicitly permitted — if branch A
|
|
claims v1.7.0.0 as a MINOR and branch B is also a MINOR, B lands at v1.8.0.0
|
|
(still a MINOR relative to main). Downstream consumers must NOT rely on
|
|
"MINOR = feature-only, PATCH = fix-only" as a strict contract. This is why
|
|
`bin/gstack-next-version` advances within the chosen bump level rather than
|
|
repicking the level when collisions happen.
|
|
|
|
**Scale-aware bumps — use common sense.** When the diff is big, bump MINOR (or
|
|
MAJOR), not PATCH. PATCH is for bug fixes and small additions; MINOR is for
|
|
substantial new capability or substantial reduction; MAJOR is for breaking
|
|
changes. Rough guideposts (don't treat as rules, treat as smell-checks):
|
|
|
|
- **PATCH (X.Y.Z+1.0)**: bug fix, doc tweak, small additive change, single
|
|
test/file added. Net diff under ~500 lines, no new user-facing capability.
|
|
- **MINOR (X.Y+1.0.0)**: new capability shipped (skill, harness, command, big
|
|
refactor), substantial code reduction (compression, migration), or coordinated
|
|
multi-file change. Net diff over ~2000 lines added/removed, OR a user-visible
|
|
feature you'd put in a tweet.
|
|
- **MAJOR (X+1.0.0.0)**: breaking change to public surface (CLI flag rename,
|
|
skill removed, config format changed), OR a release big enough to be the
|
|
headline of a blog post.
|
|
|
|
If you find yourself debating "is 10K added + 24K removed really a PATCH?" — it
|
|
isn't. Bump MINOR. Same for "this adds a whole new test harness with 6 new E2E
|
|
tests + helper utilities" — MINOR. The bump level is communication to the user
|
|
about what kind of release this is; don't undersell it.
|
|
|
|
When merging origin/main brings a higher VERSION, re-evaluate the bump level
|
|
against the SCALE of your branch's work, not just whether main moved forward.
|
|
If main bumped MINOR and your branch is also a substantial change, you bump
|
|
MINOR again on top (e.g., main at v1.14.0.0, your branch lands v1.15.0.0).
|
|
|
|
**VERSION and CHANGELOG are branch-scoped.** Every feature branch that ships gets its
|
|
own version bump and CHANGELOG entry. The entry describes what THIS branch adds —
|
|
not what was already on main.
|
|
|
|
**The CHANGELOG entry is the diff between main and the shipping branch — what users
|
|
get when they upgrade. NOT how the branch got there.** A reader landing on the entry
|
|
should learn what they can do now that they couldn't before; they should not learn
|
|
about the branch's internal version bumps, the bugs we caught and fixed mid-branch,
|
|
the plan reviews we ran, or the commits we squashed. That is branch development
|
|
narrative. It belongs in PR descriptions and commit messages, not CHANGELOG.
|
|
|
|
**Never reference branch-internal versions in a CHANGELOG entry.** If your branch
|
|
bumped VERSION from v1.5.0.0 → v1.5.1.0 → v1.6.0.0 during development and only the
|
|
final v1.6.0.0 ships to main, the entry must read as if v1.5.1.0 never existed.
|
|
Concretely, NEVER write:
|
|
- "v1.5.1.0 had a bug that v1.6.0.0 fixes" — readers don't know about v1.5.1.0; it's
|
|
a branch-internal artifact.
|
|
- "The shipping headline of v1.5.1.0 was broken because..." — same reason. From main's
|
|
perspective, v1.5.1.0 was never released.
|
|
- "Pre-fix tests encoded the broken behavior" — that's a contributor's victory lap,
|
|
not a user benefit.
|
|
- "Two surgical edits, both in the dispatch path" — micro-narrative of the patch.
|
|
|
|
Instead, describe the released system: "Browser-skills run end-to-end with the
|
|
expected tab-access semantics." If a property of the shipped system is worth calling
|
|
out (e.g., "skill spawns get permissive tab access; pair-agent tunnel tokens require
|
|
ownership"), document it as a property, not as a fix. The shipped system is what
|
|
the user gets; the path to that system is invisible to them.
|
|
|
|
**When to write the CHANGELOG entry:**
|
|
- At `/ship` time (Step 13), not during development or mid-branch.
|
|
- The entry covers ALL commits on this branch vs the base branch.
|
|
- Never fold new work into an existing CHANGELOG entry from a prior version that
|
|
already landed on main. If main has v0.10.0.0 and your branch adds features,
|
|
bump to v0.10.1.0 with a new entry — don't edit the v0.10.0.0 entry.
|
|
|
|
**Key questions before writing:**
|
|
1. What branch am I on? What did THIS branch change?
|
|
2. Is the base branch version already released? (If yes, bump and create new entry.)
|
|
3. Does an existing entry on this branch already cover earlier work? (If yes, replace
|
|
it with one unified entry for the final version.)
|
|
|
|
**Merging main does NOT mean adopting main's version.** When you merge origin/main into
|
|
a feature branch, main may bring new CHANGELOG entries and a higher VERSION. Your branch
|
|
still needs its OWN version bump on top. If main is at v0.13.8.0 and your branch adds
|
|
features, bump to v0.13.9.0 with a new entry. Never jam your changes into an entry that
|
|
already landed on main. Your entry goes on top because your branch lands next.
|
|
|
|
**After merging main, always check:**
|
|
- Does CHANGELOG have your branch's own entry separate from main's entries?
|
|
- Is VERSION higher than main's VERSION?
|
|
- Is your entry the topmost entry in CHANGELOG (above main's latest)?
|
|
If any answer is no, fix it before continuing.
|
|
|
|
**After any CHANGELOG edit that moves, adds, or removes entries,** immediately run
|
|
`grep "^## \[" CHANGELOG.md` to verify no duplicates and a sensible reverse-chronological
|
|
order. Gaps between version numbers are fine. A branch that ships at v1.6.4.0 without
|
|
a prior v1.5.2.0 or v1.5.3.0 entry on main is correct — those were branch-internal
|
|
version numbers that never landed. Do not back-fill gaps with placeholder entries.
|
|
|
|
**Never orphan branch-internal versions.** If your branch bumped VERSION several times
|
|
during development (v1.5.1.0 → v1.5.2.0 → v1.6.4.0, say) and those earlier entries were
|
|
never released to main, the final ship consolidates ALL of them into a single entry at
|
|
the final version (v1.6.4.0). Collapse them — delete the old entries and move their
|
|
content into the final entry, re-version table columns accordingly. Readers see one
|
|
release, not a branch diary. Gaps are fine (v1.6.3.0 → v1.6.4.0 with no v1.5.x
|
|
in between on main is correct).
|
|
|
|
CHANGELOG.md is **for users**, not contributors. Write it like product release notes:
|
|
|
|
- Lead with what the user can now **do** that they couldn't before. Sell the feature.
|
|
- Use plain language, not implementation details. "You can now..." not "Refactored the..."
|
|
- **Never mention TODOS.md, internal tracking, eval infrastructure, or contributor-facing
|
|
details.** These are invisible to users and meaningless to them.
|
|
- Put contributor/internal changes in a separate "For contributors" section at the bottom.
|
|
- Every entry should make someone think "oh nice, I want to try that."
|
|
- No jargon: say "every question now tells you which project and branch you're in" not
|
|
"AskUserQuestion format standardized across skill templates via preamble resolver."
|
|
|
|
**Only document what shipped between main and this change.** Readers do not care how
|
|
we got here. Keep out of the CHANGELOG, always:
|
|
|
|
- Branch resyncs, merge commits with main, rebase activity.
|
|
- Plan approvals, review outcomes (CEO / eng / design / outside-voice / codex findings),
|
|
AskUserQuestion decisions, scope negotiations.
|
|
- "Work queued," "plan approved," "in-progress," "will ship later" — the CHANGELOG
|
|
documents what DID ship, not what MIGHT ship.
|
|
- Version-bump housekeeping when no user-facing work actually landed.
|
|
|
|
If the diff between the base branch version and this version has no user-facing change
|
|
(only merges, only CHANGELOG edits, only placeholder work), the honest entry is one
|
|
sentence: "Version bump for branch-ahead discipline. No user-facing changes yet." Stop
|
|
there. Do not pad. Do not explain the plan that will ship eventually. Do not narrate
|
|
the branch's history. When real work lands, the entry will replace this at /ship time.
|
|
|
|
### Release-summary format (every `## [X.Y.Z]` entry)
|
|
|
|
Every version entry in `CHANGELOG.md` MUST start with a release-summary section in
|
|
the GStack/Garry voice, one viewport's worth of prose + tables that lands like a
|
|
verdict, not marketing. The itemized changelog (subsections, bullets, files) goes
|
|
BELOW that summary, separated by a `### Itemized changes` header.
|
|
|
|
The release-summary section gets read by humans, by the auto-update agent, and by
|
|
anyone deciding whether to upgrade. The itemized list is for agents that need to
|
|
know exactly what changed.
|
|
|
|
Structure for the top of every `## [X.Y.Z]` entry:
|
|
|
|
1. **Two-line bold headline** (10-14 words total). Should land like a verdict, not
|
|
marketing. Sound like someone who shipped today and cares whether it works.
|
|
2. **Lead paragraph** (3-5 sentences). What shipped, what changed for the user.
|
|
Specific, concrete, no AI vocabulary, no em dashes, no hype.
|
|
3. **A "The X numbers that matter" section** with:
|
|
- One short setup paragraph naming the source of the numbers (real production
|
|
deployment OR a reproducible benchmark, name the file/command to run).
|
|
- A table of 3-6 key metrics with BEFORE / AFTER / Δ columns.
|
|
- A second optional table for per-category breakdown if relevant.
|
|
- 1-2 sentences interpreting the most striking number in concrete user terms.
|
|
4. **A "What this means for [audience]" closing paragraph** (2-4 sentences) tying
|
|
the metrics to a real workflow shift. End with what to do.
|
|
|
|
Voice rules for the release summary:
|
|
- No em dashes (use commas, periods, "...").
|
|
- No AI vocabulary (delve, robust, comprehensive, nuanced, fundamental, etc.) or
|
|
banned phrases ("here's the kicker", "the bottom line", etc.).
|
|
- Real numbers, real file names, real commands. Not "fast" but "~30s on 30K pages."
|
|
- Short paragraphs, mix one-sentence punches with 2-3 sentence runs.
|
|
- Connect to user outcomes: "the agent does ~3x less reading" beats "improved precision."
|
|
- Be direct about quality. "Well-designed" or "this is a mess." No dancing.
|
|
|
|
Source material:
|
|
- CHANGELOG previous entry for prior context.
|
|
- Benchmark files or `/retro` output for headline numbers.
|
|
- Recent commits (`git log <prev-version>..HEAD --oneline`) for what shipped.
|
|
- Don't make up numbers. If a metric isn't in a benchmark or production data,
|
|
don't include it. Say "no measurement yet" if asked.
|
|
|
|
Target length: ~250-350 words for the summary. Should render as one viewport.
|
|
|
|
### Itemized changes (below the release summary)
|
|
|
|
Write `### Itemized changes` and continue with the detailed subsections (Added,
|
|
Changed, Fixed, For contributors). Same rules as the user-facing voice guidance
|
|
above, plus:
|
|
|
|
- **Always credit community contributions.** When an entry includes work from a
|
|
community PR, name the contributor with `Contributed by @username`. Contributors
|
|
did real work. Thank them publicly every time, no exceptions.
|
|
|
|
## AI effort compression
|
|
|
|
When estimating or discussing effort, always show both human-team and CC+gstack time:
|
|
|
|
| Task type | Human team | CC+gstack | Compression |
|
|
|-----------|-----------|-----------|-------------|
|
|
| Boilerplate / scaffolding | 2 days | 15 min | ~100x |
|
|
| Test writing | 1 day | 15 min | ~50x |
|
|
| Feature implementation | 1 week | 30 min | ~30x |
|
|
| Bug fix + regression test | 4 hours | 15 min | ~20x |
|
|
| Architecture / design | 2 days | 4 hours | ~5x |
|
|
| Research / exploration | 1 day | 3 hours | ~3x |
|
|
|
|
Completeness is cheap. Don't recommend shortcuts when the complete implementation
|
|
is a "lake" (achievable) not an "ocean" (multi-quarter migration). See the
|
|
Completeness Principle in the skill preamble for the full philosophy.
|
|
|
|
## Search before building
|
|
|
|
Before designing any solution that involves concurrency, unfamiliar patterns,
|
|
infrastructure, or anything where the runtime/framework might have a built-in:
|
|
|
|
1. Search for "{runtime} {thing} built-in"
|
|
2. Search for "{thing} best practice {current year}"
|
|
3. Check official runtime/framework docs
|
|
|
|
Three layers of knowledge: tried-and-true (Layer 1), new-and-popular (Layer 2),
|
|
first-principles (Layer 3). Prize Layer 3 above all. See ETHOS.md for the full
|
|
builder philosophy.
|
|
|
|
## Local plans
|
|
|
|
Contributors can store long-range vision docs and design documents in `~/.gstack-dev/plans/`.
|
|
These are local-only (not checked in). When reviewing TODOS.md, check `plans/` for candidates
|
|
that may be ready to promote to TODOs or implement.
|
|
|
|
## E2E eval failure blame protocol
|
|
|
|
When an E2E eval fails during `/ship` or any other workflow, **never claim "not
|
|
related to our changes" without proving it.** These systems have invisible couplings —
|
|
a preamble text change affects agent behavior, a new helper changes timing, a
|
|
regenerated SKILL.md shifts prompt context.
|
|
|
|
**Required before attributing a failure to "pre-existing":**
|
|
1. Run the same eval on main (or base branch) and show it fails there too
|
|
2. If it passes on main but fails on the branch — it IS your change. Trace the blame.
|
|
3. If you can't run on main, say "unverified — may or may not be related" and flag it
|
|
as a risk in the PR body
|
|
|
|
"Pre-existing" without receipts is a lazy claim. Prove it or don't say it.
|
|
|
|
## Long-running tasks: don't give up
|
|
|
|
When running evals, E2E tests, or any long-running background task, **poll until
|
|
completion**. Use `sleep 180 && echo "ready"` + `TaskOutput` in a loop every 3
|
|
minutes. Never switch to blocking mode and give up when the poll times out. Never
|
|
say "I'll be notified when it completes" and stop checking — keep the loop going
|
|
until the task finishes or the user tells you to stop.
|
|
|
|
The full E2E suite can take 30-45 minutes. That's 10-15 polling cycles. Do all of
|
|
them. Report progress at each check (which tests passed, which are running, any
|
|
failures so far). The user wants to see the run complete, not a promise that
|
|
you'll check later.
|
|
|
|
## E2E test fixtures: extract, don't copy
|
|
|
|
**NEVER copy a full SKILL.md file into an E2E test fixture.** SKILL.md files are
|
|
1500-2000 lines. When `claude -p` reads a file that large, context bloat causes
|
|
timeouts, flaky turn limits, and tests that take 5-10x longer than necessary.
|
|
|
|
Instead, extract only the section the test actually needs:
|
|
|
|
```typescript
|
|
// BAD — agent reads 1900 lines, burns tokens on irrelevant sections
|
|
fs.copyFileSync(path.join(ROOT, 'ship', 'SKILL.md'), path.join(dir, 'ship-SKILL.md'));
|
|
|
|
// GOOD — agent reads ~60 lines, finishes in 38s instead of timing out
|
|
const full = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8');
|
|
const start = full.indexOf('## Review Readiness Dashboard');
|
|
const end = full.indexOf('\n---\n', start);
|
|
fs.writeFileSync(path.join(dir, 'ship-SKILL.md'), full.slice(start, end > start ? end : undefined));
|
|
```
|
|
|
|
Also when running targeted E2E tests to debug failures:
|
|
- Run in **foreground** (`bun test ...`), not background with `&` and `tee`
|
|
- Never `pkill` running eval processes and restart — you lose results and waste money
|
|
- One clean run beats three killed-and-restarted runs
|
|
|
|
## Publishing native OpenClaw skills to ClawHub
|
|
|
|
Native OpenClaw skills live in `openclaw/skills/gstack-openclaw-*/SKILL.md`. These are
|
|
hand-crafted methodology skills (not generated by the pipeline) published to ClawHub
|
|
so any OpenClaw user can install them.
|
|
|
|
**Publishing:** The command is `clawhub publish` (NOT `clawhub skill publish`):
|
|
|
|
```bash
|
|
clawhub publish openclaw/skills/gstack-openclaw-office-hours \
|
|
--slug gstack-openclaw-office-hours --name "gstack Office Hours" \
|
|
--version 1.0.0 --changelog "description of changes"
|
|
```
|
|
|
|
Repeat for each skill: `gstack-openclaw-ceo-review`, `gstack-openclaw-investigate`,
|
|
`gstack-openclaw-retro`. Bump `--version` on each update.
|
|
|
|
**Auth:** `clawhub login` (opens browser for GitHub auth). `clawhub whoami` to verify.
|
|
|
|
**Updating:** Same `clawhub publish` command with a higher `--version` and `--changelog`.
|
|
|
|
**Verification:** `clawhub search gstack` to confirm they're live.
|
|
|
|
## Deploying to the active skill
|
|
|
|
The active skill lives at `~/.claude/skills/gstack/`. After making changes:
|
|
|
|
1. Push your branch
|
|
2. Fetch and reset in the skill directory: `cd ~/.claude/skills/gstack && git fetch origin && git reset --hard origin/main`
|
|
3. Rebuild: `cd ~/.claude/skills/gstack && bun run build`
|
|
|
|
Or copy the binaries directly:
|
|
- `cp browse/dist/browse ~/.claude/skills/gstack/browse/dist/browse`
|
|
- `cp design/dist/design ~/.claude/skills/gstack/design/dist/design`
|
|
|
|
## Skill routing
|
|
|
|
When the user's request matches an available skill, invoke it via the Skill tool. When in doubt, invoke the skill.
|
|
|
|
Key routing rules:
|
|
- Product ideas/brainstorming → invoke /office-hours
|
|
- Strategy/scope → invoke /plan-ceo-review
|
|
- Architecture → invoke /plan-eng-review
|
|
- Design system/plan review → invoke /design-consultation or /plan-design-review
|
|
- Full review pipeline → invoke /autoplan
|
|
- Bugs/errors → invoke /investigate
|
|
- QA/testing site behavior → invoke /qa or /qa-only
|
|
- Code review/diff check → invoke /review
|
|
- Visual polish → invoke /design-review
|
|
- Ship/deploy/PR → invoke /ship or /land-and-deploy
|
|
- Save progress → invoke /context-save
|
|
- Resume context → invoke /context-restore
|