mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-10 06:28:23 +08:00
* fix(codex): use resume-compatible flags * fix: V-001 security vulnerability Automated security fix generated by Orbis Security AI * docs: align prompt-injection thresholds to security.ts (v1.6.4.0 catch-up) CLAUDE.md:290 and ARCHITECTURE.md:159 were missed when WARN was bumped 0.60 → 0.75 ind75402bb(v1.6.4.0, "cut Haiku classifier FP from 44% to 23%, gate now enforced", #1135). browse/src/security.ts:37 has WARN: 0.75 and BROWSER.md:743 was updated alongside that commit; CLAUDE.md and ARCHITECTURE.md still read 0.60. Also adds the SOLO_CONTENT_BLOCK: 0.92 entry to CLAUDE.md (already in security.ts:50 and BROWSER.md:745, missing from CLAUDE.md's threshold table). No code change. No behavior change. Pure doc-vs-code alignment. Verification: $ grep -n "WARN" browse/src/security.ts CLAUDE.md ARCHITECTURE.md BROWSER.md browse/src/security.ts:37: WARN: 0.75, CLAUDE.md:290: - \`WARN: 0.75\` ... ARCHITECTURE.md:159: ...>= \`WARN\` (0.75)... BROWSER.md:743: - \`WARN: 0.75\` ... Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: Korean/CJK IME input and rendering in Sidebar Terminal Fixes #1272 This commit addresses three separate Korean/CJK bugs in the Sidebar Terminal: **Bug 1 - IME Input**: Korean text typed via IME composition was not reaching the PTY correctly. Added compositionstart/compositionend event listeners to suppress partial jamo fragments and only send the final composed string. **Bug 2a - Font Rendering**: Added CJK monospace font fallbacks ("Noto Sans Mono CJK KR", "Malgun Gothic") to both the xterm.js fontFamily config and the CSS --font-mono variable. This ensures consistent cell-width calculations for Korean characters. **Bug 2b - UTF-8 Boundary Detection**: Added buffering logic to prevent multi-byte UTF-8 characters (Korean is 3 bytes) from being split across WebSocket chunks. This follows the same pattern as PR #1007 which fixed the sidebar-agent path, but extends it to the terminal-agent path. Special thanks to @ldybob for the excellent root cause analysis and proposed solutions in issue #1272. Tested on WSL2 + Windows 11 with Korean IME. * fix(ship): tighten Plan Completion gate (VAS-449 remediation) VAS-446 shipped with a PLAN.md acceptance criterion (domain-hq has /docs/dashboard.md) silently skipped. /ship's Plan Completion subagent existed at ship time (added in v1.4.1.0) but the gate let the failure through. Four structural fixes: 1. Path concreteness rule: items naming a concrete filesystem path MUST be classified DONE/NOT DONE via [ -f <path> ], never UNVERIFIABLE. 2. Validator detection: CONTENT-SHAPE items scan target repo's package.json for validate-* scripts and run them before falling back to UNVERIFIABLE. 3. Per-item UNVERIFIABLE confirmation: replaces blanket "I've checked each one" with per-item Y/N/D loop. The blanket-confirm path is the exact failure VAS-449 surfaced. 4. Subagent fail-closed: if Plan Completion subagent + inline fallback both fail, surface explicit AskUserQuestion instead of silent pass. Replaces the prior "Never block /ship on subagent failure" fail-open. Locked in by test/ship-plan-completion-invariants.test.ts (5 assertions, no LLM dependency, ~60ms). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(browse): bash.exe wrap for telemetry on Windows reportAttemptTelemetry() in browse/src/security.ts calls spawn(bin, args) where bin is the gstack-telemetry-log bash script. On Windows this fails silently with ENOENT — CreateProcess can't dispatch on shebang lines. Adopts v1.24.0.0's Bun.which + GSTACK_*_BIN override pattern (from browse/src/claude-bin.ts:resolveClaudeCommand, introduced in #1252) for resolving bash.exe. resolveBashBinary() honors GSTACK_BASH_BIN absolute-path or PATH-resolvable override, falling back to Bun.which('bash') which finds Git Bash on the standard Windows install. buildTelemetrySpawnCommand() wraps the script invocation on win32 only; POSIX path is bit-identical. Returns null when bash can't be resolved on Windows so caller skips spawn — local attempts.jsonl audit trail keeps working without surfacing a Windows-only failure. 8 new unit tests cover resolveBashBinary (POSIX bash, absolute override, quote-stripping, BASH_BIN fallback, empty-PATH null) and buildTelemetrySpawnCommand (POSIX pass-through, win32 bash wrap, win32 null on unresolvable, arg-array immutability). POSIX path is bit-identical — Bun.which('bash') on Linux/macOS returns the same /bin/bash or /usr/bin/bash that the old hardcoded spawn relied on. * fix(make-pdf): Bun.which-based binary resolution for browse + pdftotext on Windows Extends v1.24.0.0's Bun.which + GSTACK_*_BIN override pattern (introduced in browse/src/claude-bin.ts via #1252) to the two other binary resolvers in the codebase: make-pdf/src/browseClient.ts:resolveBrowseBin and make-pdf/src/pdftotext.ts:resolvePdftotext. Same Windows quirks (fs.accessSync(X_OK) degrades to existence-check; `which` isn't available outside Git Bash; bun --compile --outfile X emits X.exe), same Bun.which-based fix shape, same env override convention. Changes: - GSTACK_BROWSE_BIN / GSTACK_PDFTOTEXT_BIN as the v1.24-aligned overrides; BROWSE_BIN / PDFTOTEXT_BIN remain as back-compat aliases. - Bun.which() replaces execFileSync('which', ...) for PATH lookup. Handles Windows PATHEXT natively; no more `where`-vs-`which` branch. - findExecutable(base) helper exported from each module, probes .exe/.cmd/.bat after the bare-path miss on win32. Linux/macOS behavior is bit-identical (isExecutable short-circuits before the win32 branch ever runs). - macCandidates renamed posixCandidates (always was — /opt/homebrew, /usr/local, /usr/bin). No Windows candidates added; Poppler installs scatter across Scoop/Chocolatey/portable zips and guessing causes false positives. - Error messages get a Windows install hint (scoop install poppler / oschwartz10612) and `setx` example for GSTACK_*_BIN. - Pre-existing test 'honors BROWSE_BIN when it points at a real executable' was hardcoded /bin/sh — made cross-platform via a REAL_EXE constant (cmd.exe on win32, /bin/sh on POSIX). Was a Windows-CI blocker on its own. Coordination: PR #1094 (@BkashJEE) covered browseClient.ts independently with a narrower scope; this PR's pdftotext + cross-platform tests + GSTACK_*_BIN naming are additive. Either order of merge works. Test plan: - bun test make-pdf/test/browseClient.test.ts make-pdf/test/pdftotext.test.ts on win32 — 29 pass, 0 fail (12 new assertions: findExecutable POSIX/win32/null, resolveBrowseBin GSTACK_BROWSE_BIN + BROWSE_BIN + precedence + quote-strip, same shape for resolvePdftotext + Windows install hint in error message). - POSIX branch unchanged — fs.accessSync(X_OK) on Linux/macOS short-circuits before any win32 logic runs, matching the v1.24 claude-bin.ts pattern. * fix(browse): NTFS ACL hardening for Windows state files via icacls gstack's ~/.gstack/ state directory holds bearer tokens, canary tokens, agent queue contents (with prompt history), session state, security-decision logs, and saved cookie bundles — all written with { mode: 0o600 } / 0o700. On Windows, those mode bits are a silent no-op: Node's fs module doesn't translate POSIX modes to NTFS ACLs, and inherited ACLs leave every "restricted" file readable by other principals on the machine (verified via icacls — six ACEs, the intended user is the LAST of six). Threat model is non-trivial on: - Self-hosted CI runners (different service account on the same Windows box can read developer tokens, canary tokens, prompt history) - Shared development machines (agencies, studios, lab environments) - Multi-tenant servers with shared home directories Orthogonal to v1.24.0.0's binary-resolution work — complementary at the write side. v1.24's bin/gstack-paths resolves ~/.gstack/ correctly across plugin / global / local installs; this PR ensures files written into those resolved paths actually get the POSIX 0o600 semantic translated to NTFS. The fix: - New browse/src/file-permissions.ts (158 LOC, 5 public + 1 test-reset). restrictFilePermissions / restrictDirectoryPermissions wrap chmod (POSIX) or icacls /inheritance:r /grant:r <user>:(F) (Windows). writeSecureFile / appendSecureFile / mkdirSecure are drop-in wrappers for the common patterns. - 19 call sites converted across 9 source files: browser-manager.ts, browser-skill-write.ts, cli.ts, config.ts, meta-commands.ts, security-classifier.ts, security.ts (4 sites), server.ts (5 sites), terminal-agent.ts (8 sites), tunnel-denial-log.ts. - (OI)(CI) inheritance flags on directories mean files created via fs.write* *inside* an mkdirSecure-created dir inherit the owner-only ACL automatically — important for tunnel-denial-log.ts where appends use async fsp.appendFile. Error handling: icacls failures (nonexistent path, missing icacls.exe, hardened environments) log a one-shot warning to stderr and proceed. Once-per-process gating prevents log spam if the condition persists. Filesystem stays functional; the file just ends up with inherited ACLs. Test plan: - bun test browse/test/file-permissions.test.ts — 13 pass, 0 fail (POSIX mode-bit assertions, Windows no-throw, mkdir idempotence, recursive creation, Buffer payloads, append-creates-then-reapplies-once semantics) - bun test browse/test/security.test.ts — 38 pass, 0 fail (existing security test suite plus the bash-binary resolution tests added in fix #1119; the converted writeFileSync/appendFileSync/mkdirSync sites in security.ts integrate cleanly) - Empirical icacls before/after on a real file — 6 ACEs → 1 ACE - bun build typecheck on all modified files — clean (server.ts has a pre-existing playwright-core/electron resolution issue unrelated to this PR) POSIX behavior is bit-identical to old code — fs.chmodSync(path, 0o6XX) on the helper's POSIX branch matches the inline { mode: 0o6XX } it replaces. Linux and macOS see no behavior change. Inviting pushback on three judgment calls (in PR description): 1. icacls vs npm library 2. ACL scope — just user, or user + SYSTEM? 3. Graceful degradation — once-per-process warn, not silent, not hard-fail. * fix(browse): declare lastConsoleFlushed to restore console-log persistence flushBuffers() references a `lastConsoleFlushed` cursor at server.ts:337 and assigns it at :344, but the `let lastConsoleFlushed = 0;` declaration is missing — only the network and dialog siblings are declared at lines 327-328. Result: every 1-second flushBuffers tick (line 376) throws `ReferenceError: lastConsoleFlushed is not defined`, gets swallowed by the catch at line 369 ("[browse] Buffer flush failed: ..."), and the console branch's append never runs. browse-console.log is never written in any production deployment since this regressed. Discovered by stress-testing the daemon with 15 concurrent CLIs against cold state — the race surfaced the buffer-flush error spam in one spawned daemon's stderr. Verified by running the daemon against a real file:// page with console.log events: in-memory `browse console` returns the entries, but `.gstack/browse-console.log` is never created on disk. Regression introduced by1a100a2a"fix: eliminate duplicate command sets in chain, improve flush perf and type safety" — the flush refactor switched from `Bun.write` to `fs.appendFileSync` and added the `lastConsoleFlushed` cursor pattern alongside its network/dialog siblings, but missed the matching `let` declaration. Tests don't currently exercise flushBuffers, so the regression shipped silently. Fix: - Declare `let lastConsoleFlushed = 0;` next to `lastNetworkFlushed` and `lastDialogFlushed` (browse/src/server.ts:327) - Add a source-level guard test (browse/test/server-flush-trackers.test.ts) that fails any future refactor that adds a fourth `last*Flushed` cursor without the matching declaration. Same pattern as terminal-agent.test.ts and dual-listener.test.ts — read source as text, assert invariant, no daemon required. Test plan: - [x] New regression test fails on current main, passes with the fix - [x] `bun run build` clean - [x] Manual smoke: spawn daemon -> goto file:// page with console.log -> wait 4s -> .gstack/browse-console.log now exists with the expected entries (163 bytes vs zero before) 🤖 Generated with [Claude Code](https://claude.com/claude-code) * fix(browse): per-process state-file temp path to fix concurrent-write ENOENT The daemon writes `.gstack/browse.json` via the standard atomic-rename pattern: `writeFileSync(tmp, …) → renameSync(tmp, stateFile)`. Four sites in server.ts use this pattern (initial daemon-startup state at :2002, /tunnel/start handler at :1479, BROWSE_TUNNEL=1 inline tunnel update at :2083, BROWSE_TUNNEL_LOCAL_ONLY=1 update at :2113), and all four hard-code the same temp filename `${stateFile}.tmp`. Under concurrent writers the shared filename races on the rename: t0 Writer A: writeFileSync(stateFile + '.tmp', payloadA) t1 Writer B: writeFileSync(stateFile + '.tmp', payloadB) // overwrites A t2 Writer A: renameSync(stateFile + '.tmp', stateFile) // moves B's payload t3 Writer B: renameSync(stateFile + '.tmp', stateFile) // ENOENT — file gone Reproduced empirically with 15 concurrent CLIs against a fresh `.gstack/`: [browse] Failed to start: ENOENT: no such file or directory, rename '…/.gstack/browse.json.tmp' -> '…/.gstack/browse.json' Pre-fix success rate: **0 / 15** under cold-start race. Post-fix success rate: **15 / 15**, zero ENOENT. Fix: - New `tmpStatePath()` helper (server.ts:333) returns `${stateFile}.tmp.${pid}.${randomBytes(4).toString('hex')}` - All 4 call sites use `tmpStatePath()` instead of the shared literal - Atomic rename still gives last-writer-wins semantics on the final state.json content; only behavior change is that concurrent writers no longer kill each other on the rename step Source-level guard test (browse/test/server-tmp-state-path.test.ts) locks two invariants: (1) no remaining `stateFile + '.tmp'` literals, (2) every state-write `writeFileSync` call uses `tmpStatePath()`. Same read-source-as-text pattern as terminal-agent.test.ts and dual-listener.test.ts — no daemon required, runs in tier-1 free. Test plan: - [x] Targeted source-level guard test passes (3 / 0) - [x] `bun run build` clean - [x] Live regression: 15 concurrent CLIs against cold state → 15 / 15 healthy, 0 ENOENT (vs 0 / 15 pre-fix) - [x] No `.tmp.*` orphans left behind after rename succeeds - [x] Related test cluster (server-auth, dual-listener, cdp-mutex, findport) — same pre-existing flakes as `main`, no new regressions introduced 🤖 Generated with [Claude Code](https://claude.com/claude-code) * fix(browse): clear refs when iframe auto-detaches in getActiveFrameOrPage Asymmetric cleanup between two equivalent staleness conditions: onMainFrameNavigated() → clearRefs() + activeFrame = null ✓ getActiveFrameOrPage() → activeFrame = null (refs NOT cleared) ✗ Both paths see the same staleness condition — refs were captured against a frame that no longer exists. The main-frame path correctly clears both pieces of state. The iframe-detach path nulls the frame but leaves the refMap intact. The lazy click-time check in `resolveRef` (tab-session.ts:97) partially saves us — `entry.locator.count()` on a detached-frame locator throws or returns 0, so the click errors out as "Ref X is stale". But the user has no signal that frame context silently changed underfoot: the next `snapshot` runs against `this.page` (main) while old iframe refs still litter `refMap` with the same role+name keys. New refs collide with stale ones, the resolver picks one at random, the user clicks the wrong element. TODOS.md line 816-820 documents "Detached frame auto-recovery" as a shipped iframe-support feature in v0.12.1.0. This restores the documented intent — the recovery should leave the session in a clean state, not a half-cleared one. Fix: 1 line — add `this.clearRefs()` next to `this.activeFrame = null` inside the if-branch. Test plan: - [x] New regression test: 4/4 pass - refs cleared when getActiveFrameOrPage detects detached iframe - refs preserved when active frame is still attached (no regression) - refs preserved when no frame set (page-level path untouched) - matches onMainFrameNavigated symmetry — both paths reach the same clean end state - [x] `bun run build` clean 🤖 Generated with [Claude Code](https://claude.com/claude-code) * fix(codex): resolve python for JSON parser * fix: add fail-fast probe for base branch in ship step 12 * fix(plan-devex-review): remove contradictory plan-mode handshake * fix(design): honor Retry-After header in variants 429 handler Closes #1244. The 429 handler in `generateVariant` discarded the `Retry-After` response header and fell straight through to a local exponential schedule (2s/4s/8s). In image-generation batches, that burns retry attempts inside the provider's cooldown window and the request never recovers. Now we parse `Retry-After` per RFC 7231 — both delta-seconds (`Retry-After: 5`) and HTTP-date (`Retry-After: Fri, 31 Dec 1999 23:59:59 GMT`). Honored waits are capped at 60s to bound stalls from hostile or buggy headers. Delta-seconds are validated as digits-only (rejects `2abc`). When `Retry-After` is honored (including 0 / past-date "retry now"), the next iteration's leading exponential sleep is skipped so we don't double-wait. Invalid or missing headers fall through to the existing exponential schedule unchanged. Behavior matrix: | Header | Behavior | |---------------------------------|-------------------------------------------| | Retry-After: 5 | wait 5s, skip leading on next attempt | | Retry-After: 999999 | capped to 60s, skip leading | | Retry-After: 2abc | invalid, fall through to exponential | | Retry-After: 0 | wait 0, skip leading (retry immediately) | | Retry-After: <past HTTP-date> | wait 0, skip leading | | Retry-After: <future date> | wait diff capped at 60s, skip leading | | no header | fall through to existing exponential | `generateVariant` now accepts an optional `fetchFn` parameter (defaults to `globalThis.fetch`) so tests can inject a stub. Production call sites are unchanged. Tests cover the five behavior buckets above, asserting both the 1st-to-2nd call timing gap and call counts. All five pass in ~8s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(docs): correct per-skill symlink removal snippet in README uninstall Closes #1130. The manual-uninstall fallback in `## Uninstall` → `### Option 2` used `find ~/.claude/skills -maxdepth 1 -type l`, which finds nothing on real installs. Each `~/.claude/skills/<name>/` is a real directory, and only `<name>/SKILL.md` inside it is a symlink into `gstack/`. The find never matched, so the snippet silently removed nothing. Replace with a directory walk that inspects each `<name>/SKILL.md`: find ~/.claude/skills -mindepth 1 -maxdepth 1 -type d ! -name gstack → check $dir/SKILL.md is a symlink → readlink it → if target is gstack/* or */gstack/*: rm -f the link, rmdir the dir (only if empty — preserves any user-added files) Excludes the top-level `gstack/` dir from the walk; that's removed by step 3 of the same uninstall block. `bin/gstack-uninstall` (the script-mode path) already handles the layout correctly via its own walk; only this manual fallback needed updating. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: reject partial browse client env integers * fix(gemini-adapter): detect new ~/.gemini/oauth_creds.json auth path gemini-cli >=0.30 stores OAuth credentials at ~/.gemini/oauth_creds.json instead of the legacy ~/.config/gemini/ directory. The benchmark adapter's availability check now succeeds for users on recent gemini-cli releases who have authenticated via interactive login. Both paths are accepted so users on older versions still work. * fix(browser): add --no-sandbox for root user on Linux/WSL2 Chromium's sandbox can't initialize when running as root on Linux, causing an immediate exit. Extend the existing CI/CONTAINER check to also cover this case, keeping the Windows-safe `typeof getuid` guard. * security: pass cwd to git via execFileSync, not interpolation through /bin/sh `bin/gstack-memory-ingest.ts:632-643` ran `execSync(\`git -C ${JSON.stringify(cwd)} remote get-url origin 2>/dev/null\`, ...)`. JSON.stringify escapes `"` and `\` but not `$` or backticks, so a `cwd` of `"$(touch /tmp/marker)"` survived JSON quoting and detonated under /bin/sh's command-substitution-inside-double-quotes. `cwd` originates from transcript JSONL records under `~/.claude/projects/<encoded-cwd>/<uuid>.jsonl` and `~/.codex/sessions/YYYY/MM/DD/rollout-*.jsonl`. The walker grabs the first `.cwd` it sees per session. That's an untrusted surface in the gstack threat model — the L1-L6 sidebar security stack exists exactly because agent transcripts can carry attacker-influenced text. Two pivots above the local same-uid bar: (a) prompt-injection appending `cwd="$(...)"` to the active session log turns the next /sync-gbrain run into RCE under the user's uid; (b) cross-machine transcript share (a colleague's `.claude/projects` snippet untar'd into HOME, a documented gbrain dogfooding shape) → RCE on first sync. Fix swaps the one execSync for `execFileSync("git", ["-C", cwd, "remote", "get-url", "origin"], ...)`. No shell, argv passed directly to git. The same module already uses execFileSync for `gbrainAvailable()` (line 762 pre-patch) and `gbrainPutPage()` (line 816 pre-patch) — this single execSync was the outlier. Test: `gstack-memory-ingest security: untrusted cwd cannot trigger shell substitution` plants a Claude-Code-shaped JSONL with cwd=`$(touch <marker>)` and asserts the marker file is not created after `--incremental --quiet`. Negative control: with the patch reverted, the test fails (marker created); with the patch applied, it passes (18/18 in test/gstack-memory-ingest.test.ts). * security: gate domain-skill auto-promote on classifier_score > 0 `browse/src/domain-skill-commands.ts:140` (handleSave) writes `classifier_score: 0` with the comment "L4 deferred to load-time / sidebar-agent fills this in on first prompt-injection load." But CLAUDE.md "Sidebar architecture" documents that sidebar-agent.ts was ripped, and grep for recordSkillUse + classifierFlagged callers across browse/src/ returns zero hits outside the module under test. Net effect: every quarantined skill that survives three benign uses without flag (`recordSkillUse(... , classifierFlagged: false)` x3) auto-promotes to `active` and lands in prompt context wrapped as UNTRUSTED on every subsequent visit to that host. The L4 score that was supposed to gate the promotion was never written — the production save path puts 0 on disk and nothing later updates it. Threat model: a domain-skill body authored by an agent under the influence of a poisoned page (the new `gstackInjectToTerminal` PTY path runs no L1-L3 either) would lose its auto-promote barrier after three uses. The exploit isn't single-step but the bar is exactly N=3 prompt-injection-shaped uses on a hostile page, which is well within reach. Fix adds a single condition to the auto-promote gate in `recordSkillUse`: if (state === 'quarantined' && useCount >= PROMOTE_THRESHOLD && flagCount === 0 && current.classifier_score > 0) { state = 'active'; } `classifier_score` is set once at writeSkill and never updated. Production saves it as 0 (handleSave), so the gate stays closed; existing tests that explicitly pass `classifierScore: 0.1` still auto-promote (the auto-promote path is preserved for the day L4 is rewired). Manual promotion via `domain-skill promote-to-global` is unaffected (it goes through `promoteToGlobal` which has its own state-machine guard at line 337+). Test: new regression case `does NOT auto-promote when classifier_score is 0 (production handleSave shape)` plants a skill with classifierScore=0 (matches domain-skill-commands.ts:140), runs three uses without flag, asserts the skill stays quarantined and readSkill returns null. Negative control: revert the patch, the test fails with `Received: "active"`. With the patch: 15/15 pass. * fix(ship): port #1302 SKILL.md edits to .tmpl + resolver source PR #1302 added Verification Mode + UNVERIFIABLE classification + per-item confirmation gate to ship/SKILL.md, but only the generated SKILL.md was edited — not the .tmpl source or scripts/resolvers/review.ts. The next `bun run gen:skill-docs` run would have wiped the changes. Port the same content into the resolver and .tmpl so regeneration produces the intended output. * ci(windows): extend free-tests lane to cover icacls + Bun.which resolvers from fix-wave PRs Closes #1306/#1307/#1308 validation gap. The four newly-added test files already have process.platform guards so they run safely on both POSIX and Windows lanes — only platform-relevant assertions execute on each. Tests added to the windows-latest lane: - browse/test/file-permissions.test.ts (#1308 icacls + writeSecureFile) - browse/test/security.test.ts (#1306 bash.exe wrap pure-function path) - make-pdf/test/browseClient.test.ts (#1307 Bun.which browse resolver) - make-pdf/test/pdftotext.test.ts (#1307 Bun.which pdftotext resolver) * test(codex): live flag-semantics smoke for codex exec resume Closes #1270's regex-only test gap. PR #1270 asserted that codex/SKILL.md's `codex exec resume` invocation drops -C/-s and uses sandbox_mode config. That regex catches the skill template regressing, but not codex CLI itself flipping flag semantics again. This test probes `codex exec resume --help` and asserts the surface gstack relies on: -c/sandbox_mode is accepted, top-level -C is absent. Skips silently when codex isn't on PATH, so dev machines without codex installed never see it fail. * chore: regen SKILL.md after fix wave One regen commit at the end of the merge wave per the plan. plan-devex-review loses the contradictory plan-mode handshake (#1333). review/SKILL.md picks up the Verification Mode + UNVERIFIABLE classification additions that #1302 authored against ship/SKILL.md (same resolver shared between ship and review modes). * fix(server.ts): keep fs.writeFileSync for state-file writes #1308's writeSecureFile wrapper added Windows icacls hardening for the 4 state-file write sites in server.ts, but #1310's regression test grep's for fs.writeFileSync(tmpStatePath()) calls. The two changes are technically compatible only if the test relaxes — keeping the test strict (the safer choice for catching regressions on the cold-start race) means the 4 state- file sites stay on fs.writeFileSync(..., { mode: 0o600 }). POSIX 0o600 hardening is preserved on those 4 sites. Windows icacls hardening still applies to all the other writeSecureFile call sites #1308 added (auth.json, mkdirSecure, etc.). Also refreshes golden baselines after #1302 / port + minor wording tweak in scripts/resolvers/review.ts to keep gen-skill-docs.test.ts assertion 'Cite the specific file' satisfied. * v1.30.0.0: fix wave — 21 community PRs + 2 closing fixes for Windows + codex CI gaps Headline release. Browse stops dropping console logs, cold-start race fixed, codex resume works without python3, Windows hardening (icacls + Bun.which + bash.exe wrap), ship gate gets VAS-449 remediation, two closing fixes that put icacls/Bun.which/codex flag semantics under CI. * test(domain-skills): cover #1369 classifier_score=0 quarantine + score>0 promote path The pre-existing T6 test seeded skills via writeSkill (which defaults classifier_score to 0 until L4 is rewired) and then expected 3 uses to auto-promote. PR #1369 added `current.classifier_score > 0` to the gate specifically to block that path — a quarantined skill written under the influence of a poisoned page would otherwise auto-promote after three benign uses. Updated test asserts both halves of the new contract: - classifier_score=0 + 3 uses → stays quarantined (the security guarantee) - classifier_score>0 + 3 more uses → promotes to active (unblock path) Catches both regressions: the gate going away (would re-allow the bypass) and the unblock path breaking (would silently quarantine all skills forever once L4 is rewired). --------- Co-authored-by: Jayesh Betala <jayesh.betala7@gmail.com> Co-authored-by: orbisai0security <mediratta01.pally@gmail.com> Co-authored-by: Bryce Alan <brycealan.eth@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Terry Carson YM <cym3118288@gmail.com> Co-authored-by: Vasko Ckorovski <vckorovski@gmail.com> Co-authored-by: Samuel Carson <samuel.carson@gmail.com> Co-authored-by: Yashwant Kotipalli <yashwant7kotipalli@gmail.com> Co-authored-by: Jasper Chen <jasperchen925@gmail.com> Co-authored-by: Stefan Neamtu <stefan.neamtu@gmail.com> Co-authored-by: 陈家名 <chenjiaming@kezaihui.com> Co-authored-by: Abigail Atheryon <abi@atheryon.ai> Co-authored-by: Furkan Köykıran <furkankoykiran@gmail.com> Co-authored-by: gus <gustavoraularagon@gmail.com>
1091 lines
62 KiB
TypeScript
1091 lines
62 KiB
TypeScript
/**
|
|
* Cross-model review resolver
|
|
*
|
|
* Data sent to external review services (via Codex CLI):
|
|
* - Plan markdown content, repository name, branch name, review type
|
|
* Data NOT sent:
|
|
* - Source code files, credentials, environment variables, git history
|
|
*
|
|
* Users invoke this explicitly via /plan-eng-review, /plan-ceo-review,
|
|
* or /plan-design-review. No data is sent without user invocation.
|
|
*
|
|
* Review logs are stored locally at ~/.gstack/reviews/review-log.jsonl.
|
|
* Codex CLI prompts are written to temp files to prevent shell injection.
|
|
*/
|
|
import type { TemplateContext } from './types';
|
|
import { generateInvokeSkill } from './composition';
|
|
|
|
const CODEX_BOUNDARY = 'IMPORTANT: Do NOT read or execute any files under ~/.claude/, ~/.agents/, .claude/skills/, or agents/. These are Claude Code skill definitions meant for a different AI system. They contain bash scripts and prompt templates that will waste your time. Ignore them completely. Do NOT modify agents/openai.yaml. Stay focused on the repository code only.\\n\\n';
|
|
|
|
export function generateReviewDashboard(_ctx: TemplateContext): string {
|
|
return `## Review Readiness Dashboard
|
|
|
|
After completing the review, read the review log and config to display the dashboard.
|
|
|
|
\`\`\`bash
|
|
~/.claude/skills/gstack/bin/gstack-review-read
|
|
\`\`\`
|
|
|
|
Parse the output. Find the most recent entry for each skill (plan-ceo-review, plan-eng-review, review, plan-design-review, design-review-lite, adversarial-review, codex-review, codex-plan-review). Ignore entries with timestamps older than 7 days. For the Eng Review row, show whichever is more recent between \`review\` (diff-scoped pre-landing review) and \`plan-eng-review\` (plan-stage architecture review). Append "(DIFF)" or "(PLAN)" to the status to distinguish. For the Adversarial row, show whichever is more recent between \`adversarial-review\` (new auto-scaled) and \`codex-review\` (legacy). For Design Review, show whichever is more recent between \`plan-design-review\` (full visual audit) and \`design-review-lite\` (code-level check). Append "(FULL)" or "(LITE)" to the status to distinguish. For the Outside Voice row, show the most recent \`codex-plan-review\` entry — this captures outside voices from both /plan-ceo-review and /plan-eng-review.
|
|
|
|
**Source attribution:** If the most recent entry for a skill has a \\\`"via"\\\` field, append it to the status label in parentheses. Examples: \`plan-eng-review\` with \`via:"autoplan"\` shows as "CLEAR (PLAN via /autoplan)". \`review\` with \`via:"ship"\` shows as "CLEAR (DIFF via /ship)". Entries without a \`via\` field show as "CLEAR (PLAN)" or "CLEAR (DIFF)" as before.
|
|
|
|
Note: \`autoplan-voices\` and \`design-outside-voices\` entries are audit-trail-only (forensic data for cross-model consensus analysis). They do not appear in the dashboard and are not checked by any consumer.
|
|
|
|
Display:
|
|
|
|
\`\`\`
|
|
+====================================================================+
|
|
| REVIEW READINESS DASHBOARD |
|
|
+====================================================================+
|
|
| Review | Runs | Last Run | Status | Required |
|
|
|-----------------|------|---------------------|-----------|----------|
|
|
| Eng Review | 1 | 2026-03-16 15:00 | CLEAR | YES |
|
|
| CEO Review | 0 | — | — | no |
|
|
| Design Review | 0 | — | — | no |
|
|
| Adversarial | 0 | — | — | no |
|
|
| Outside Voice | 0 | — | — | no |
|
|
+--------------------------------------------------------------------+
|
|
| VERDICT: CLEARED — Eng Review passed |
|
|
+====================================================================+
|
|
\`\`\`
|
|
|
|
**Review tiers:**
|
|
- **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \\\`gstack-config set skip_eng_review true\\\` (the "don't bother me" setting).
|
|
- **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup.
|
|
- **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes.
|
|
- **Adversarial Review (automatic):** Always-on for every review. Every diff gets both Claude adversarial subagent and Codex adversarial challenge. Large diffs (200+ lines) additionally get Codex structured review with P1 gate. No configuration needed.
|
|
- **Outside Voice (optional):** Independent plan review from a different AI model. Offered after all review sections complete in /plan-ceo-review and /plan-eng-review. Falls back to Claude subagent if Codex is unavailable. Never gates shipping.
|
|
|
|
**Verdict logic:**
|
|
- **CLEARED**: Eng Review has >= 1 entry within 7 days from either \\\`review\\\` or \\\`plan-eng-review\\\` with status "clean" (or \\\`skip_eng_review\\\` is \\\`true\\\`)
|
|
- **NOT CLEARED**: Eng Review missing, stale (>7 days), or has open issues
|
|
- CEO, Design, and Codex reviews are shown for context but never block shipping
|
|
- If \\\`skip_eng_review\\\` config is \\\`true\\\`, Eng Review shows "SKIPPED (global)" and verdict is CLEARED
|
|
|
|
**Staleness detection:** After displaying the dashboard, check if any existing reviews may be stale:
|
|
- Parse the \\\`---HEAD---\\\` section from the bash output to get the current HEAD commit hash
|
|
- For each review entry that has a \\\`commit\\\` field: compare it against the current HEAD. If different, count elapsed commits: \\\`git rev-list --count STORED_COMMIT..HEAD\\\`. Display: "Note: {skill} review from {date} may be stale — {N} commits since review"
|
|
- For entries without a \\\`commit\\\` field (legacy entries): display "Note: {skill} review from {date} has no commit tracking — consider re-running for accurate staleness detection"
|
|
- If all reviews match the current HEAD, do not display any staleness notes`;
|
|
}
|
|
|
|
export function generatePlanFileReviewReport(_ctx: TemplateContext): string {
|
|
return `## Plan File Review Report
|
|
|
|
After displaying the Review Readiness Dashboard in conversation output, also update the
|
|
**plan file** itself so review status is visible to anyone reading the plan.
|
|
|
|
### Detect the plan file
|
|
|
|
1. Check if there is an active plan file in this conversation (the host provides plan file
|
|
paths in system messages — look for plan file references in the conversation context).
|
|
2. If not found, skip this section silently — not every review runs in plan mode.
|
|
|
|
### Generate the report
|
|
|
|
Read the review log output you already have from the Review Readiness Dashboard step above.
|
|
Parse each JSONL entry. Each skill logs different fields:
|
|
|
|
- **plan-ceo-review**: \\\`status\\\`, \\\`unresolved\\\`, \\\`critical_gaps\\\`, \\\`mode\\\`, \\\`scope_proposed\\\`, \\\`scope_accepted\\\`, \\\`scope_deferred\\\`, \\\`commit\\\`
|
|
→ Findings: "{scope_proposed} proposals, {scope_accepted} accepted, {scope_deferred} deferred"
|
|
→ If scope fields are 0 or missing (HOLD/REDUCTION mode): "mode: {mode}, {critical_gaps} critical gaps"
|
|
- **plan-eng-review**: \\\`status\\\`, \\\`unresolved\\\`, \\\`critical_gaps\\\`, \\\`issues_found\\\`, \\\`mode\\\`, \\\`commit\\\`
|
|
→ Findings: "{issues_found} issues, {critical_gaps} critical gaps"
|
|
- **plan-design-review**: \\\`status\\\`, \\\`initial_score\\\`, \\\`overall_score\\\`, \\\`unresolved\\\`, \\\`decisions_made\\\`, \\\`commit\\\`
|
|
→ Findings: "score: {initial_score}/10 → {overall_score}/10, {decisions_made} decisions"
|
|
- **plan-devex-review**: \\\`status\\\`, \\\`initial_score\\\`, \\\`overall_score\\\`, \\\`product_type\\\`, \\\`tthw_current\\\`, \\\`tthw_target\\\`, \\\`mode\\\`, \\\`persona\\\`, \\\`competitive_tier\\\`, \\\`unresolved\\\`, \\\`commit\\\`
|
|
→ Findings: "score: {initial_score}/10 → {overall_score}/10, TTHW: {tthw_current} → {tthw_target}"
|
|
- **devex-review**: \\\`status\\\`, \\\`overall_score\\\`, \\\`product_type\\\`, \\\`tthw_measured\\\`, \\\`dimensions_tested\\\`, \\\`dimensions_inferred\\\`, \\\`boomerang\\\`, \\\`commit\\\`
|
|
→ Findings: "score: {overall_score}/10, TTHW: {tthw_measured}, {dimensions_tested} tested/{dimensions_inferred} inferred"
|
|
- **codex-review**: \\\`status\\\`, \\\`gate\\\`, \\\`findings\\\`, \\\`findings_fixed\\\`
|
|
→ Findings: "{findings} findings, {findings_fixed}/{findings} fixed"
|
|
|
|
All fields needed for the Findings column are now present in the JSONL entries.
|
|
For the review you just completed, you may use richer details from your own Completion
|
|
Summary. For prior reviews, use the JSONL fields directly — they contain all required data.
|
|
|
|
Produce this markdown table:
|
|
|
|
\\\`\\\`\\\`markdown
|
|
## GSTACK REVIEW REPORT
|
|
|
|
| Review | Trigger | Why | Runs | Status | Findings |
|
|
|--------|---------|-----|------|--------|----------|
|
|
| CEO Review | \\\`/plan-ceo-review\\\` | Scope & strategy | {runs} | {status} | {findings} |
|
|
| Codex Review | \\\`/codex review\\\` | Independent 2nd opinion | {runs} | {status} | {findings} |
|
|
| Eng Review | \\\`/plan-eng-review\\\` | Architecture & tests (required) | {runs} | {status} | {findings} |
|
|
| Design Review | \\\`/plan-design-review\\\` | UI/UX gaps | {runs} | {status} | {findings} |
|
|
| DX Review | \\\`/plan-devex-review\\\` | Developer experience gaps | {runs} | {status} | {findings} |
|
|
\\\`\\\`\\\`
|
|
|
|
Below the table, add these lines (omit any that are empty/not applicable):
|
|
|
|
- **CODEX:** (only if codex-review ran) — one-line summary of codex fixes
|
|
- **CROSS-MODEL:** (only if both Claude and Codex reviews exist) — overlap analysis
|
|
- **UNRESOLVED:** total unresolved decisions across all reviews
|
|
- **VERDICT:** list reviews that are CLEAR (e.g., "CEO + ENG CLEARED — ready to implement").
|
|
If Eng Review is not CLEAR and not skipped globally, append "eng review required".
|
|
|
|
### Write to the plan file
|
|
|
|
**PLAN MODE EXCEPTION — ALWAYS RUN:** This writes to the plan file, which is the one
|
|
file you are allowed to edit in plan mode. The plan file review report is part of the
|
|
plan's living status.
|
|
|
|
The report must always be the LAST section of the plan file — never mid-file.
|
|
Use a single delete-then-append flow:
|
|
|
|
1. Read the plan file (Read tool) to see its full current content. Search the read
|
|
output for a \\\`## GSTACK REVIEW REPORT\\\` heading anywhere in the file.
|
|
2. If found, use the Edit tool to DELETE the entire existing section. Match from
|
|
\\\`## GSTACK REVIEW REPORT\\\` through either the next \\\`## \\\` heading or end of
|
|
file, whichever comes first. Replace with the empty string. This applies
|
|
regardless of where the section currently lives — mid-file deletion is
|
|
intentional, not a special case. If the Edit fails (e.g., concurrent edit
|
|
changed the content), re-read the plan file and retry once.
|
|
3. After the delete (or skipped, if no section existed), append the new
|
|
\\\`## GSTACK REVIEW REPORT\\\` section at the END of the file. Use the Edit
|
|
tool to match the file's current last paragraph and add the section after it,
|
|
or use Write to re-emit the whole file with the section at the end.
|
|
4. Verify with the Read tool that \\\`## GSTACK REVIEW REPORT\\\` is the last
|
|
\\\`## \\\` heading in the file before continuing. If it isn't, repeat steps
|
|
2-3 once.
|
|
|
|
Do NOT replace the section in place. The "replace mid-file" path is what allowed
|
|
prior versions to leave the report mid-file when an older report already lived
|
|
there — the user then sees a plan whose review report is not at the bottom and
|
|
(correctly) rejects it.`;
|
|
}
|
|
|
|
export function generateAntiShortcutClause(_ctx: TemplateContext): string {
|
|
return `**Anti-shortcut clause:** The plan file is the OUTPUT of the interactive review, not a substitute for it. Writing every finding into one plan write and calling ExitPlanMode without firing AskUserQuestion is the precise failure mode of the May 2026 transcript bug — the model explored, found issues, and dumped them into a deliverable rather than walking the user through them. If you have ANY non-trivial finding in any review section, the path from finding to ExitPlanMode goes THROUGH AskUserQuestion. Zero findings in every section is the only path to ExitPlanMode that bypasses AskUserQuestion. If you find yourself wanting to write a plan with findings before asking, stop and call AskUserQuestion now — that's the bug, recognize it.`;
|
|
}
|
|
|
|
export function generateSpecReviewLoop(_ctx: TemplateContext): string {
|
|
return `## Spec Review Loop
|
|
|
|
Before presenting the document to the user for approval, run an adversarial review.
|
|
|
|
**Step 1: Dispatch reviewer subagent**
|
|
|
|
Use the Agent tool to dispatch an independent reviewer. The reviewer has fresh context
|
|
and cannot see the brainstorming conversation — only the document. This ensures genuine
|
|
adversarial independence.
|
|
|
|
Prompt the subagent with:
|
|
- The file path of the document just written
|
|
- "Read this document and review it on 5 dimensions. For each dimension, note PASS or
|
|
list specific issues with suggested fixes. At the end, output a quality score (1-10)
|
|
across all dimensions."
|
|
|
|
**Dimensions:**
|
|
1. **Completeness** — Are all requirements addressed? Missing edge cases?
|
|
2. **Consistency** — Do parts of the document agree with each other? Contradictions?
|
|
3. **Clarity** — Could an engineer implement this without asking questions? Ambiguous language?
|
|
4. **Scope** — Does the document creep beyond the original problem? YAGNI violations?
|
|
5. **Feasibility** — Can this actually be built with the stated approach? Hidden complexity?
|
|
|
|
The subagent should return:
|
|
- A quality score (1-10)
|
|
- PASS if no issues, or a numbered list of issues with dimension, description, and fix
|
|
|
|
**Step 2: Fix and re-dispatch**
|
|
|
|
If the reviewer returns issues:
|
|
1. Fix each issue in the document on disk (use Edit tool)
|
|
2. Re-dispatch the reviewer subagent with the updated document
|
|
3. Maximum 3 iterations total
|
|
|
|
**Convergence guard:** If the reviewer returns the same issues on consecutive iterations
|
|
(the fix didn't resolve them or the reviewer disagrees with the fix), stop the loop
|
|
and persist those issues as "Reviewer Concerns" in the document rather than looping
|
|
further.
|
|
|
|
If the subagent fails, times out, or is unavailable — skip the review loop entirely.
|
|
Tell the user: "Spec review unavailable — presenting unreviewed doc." The document is
|
|
already written to disk; the review is a quality bonus, not a gate.
|
|
|
|
**Step 3: Report and persist metrics**
|
|
|
|
After the loop completes (PASS, max iterations, or convergence guard):
|
|
|
|
1. Tell the user the result — summary by default:
|
|
"Your doc survived N rounds of adversarial review. M issues caught and fixed.
|
|
Quality score: X/10."
|
|
If they ask "what did the reviewer find?", show the full reviewer output.
|
|
|
|
2. If issues remain after max iterations or convergence, add a "## Reviewer Concerns"
|
|
section to the document listing each unresolved issue. Downstream skills will see this.
|
|
|
|
3. Append metrics:
|
|
\`\`\`bash
|
|
mkdir -p ~/.gstack/analytics
|
|
echo '{"skill":"${_ctx.skillName}","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","iterations":ITERATIONS,"issues_found":FOUND,"issues_fixed":FIXED,"remaining":REMAINING,"quality_score":SCORE}' >> ~/.gstack/analytics/spec-review.jsonl 2>/dev/null || true
|
|
\`\`\`
|
|
Replace ITERATIONS, FOUND, FIXED, REMAINING, SCORE with actual values from the review.`;
|
|
}
|
|
|
|
export function generateBenefitsFrom(ctx: TemplateContext): string {
|
|
if (!ctx.benefitsFrom || ctx.benefitsFrom.length === 0) return '';
|
|
|
|
const skillList = ctx.benefitsFrom.map(s => `\`/${s}\``).join(' or ');
|
|
const first = ctx.benefitsFrom[0];
|
|
|
|
// Reuse the INVOKE_SKILL resolver for the actual loading instructions
|
|
const invokeBlock = generateInvokeSkill(ctx, [first]);
|
|
|
|
return `## Prerequisite Skill Offer
|
|
|
|
When the design doc check above prints "No design doc found," offer the prerequisite
|
|
skill before proceeding.
|
|
|
|
Say to the user via AskUserQuestion:
|
|
|
|
> "No design doc found for this branch. ${skillList} produces a structured problem
|
|
> statement, premise challenge, and explored alternatives — it gives this review much
|
|
> sharper input to work with. Takes about 10 minutes. The design doc is per-feature,
|
|
> not per-product — it captures the thinking behind this specific change."
|
|
|
|
Options:
|
|
- A) Run /${first} now (we'll pick up the review right after)
|
|
- B) Skip — proceed with standard review
|
|
|
|
If they skip: "No worries — standard review. If you ever want sharper input, try
|
|
/${first} first next time." Then proceed normally. Do not re-offer later in the session.
|
|
|
|
If they choose A:
|
|
|
|
Say: "Running /${first} inline. Once the design doc is ready, I'll pick up
|
|
the review right where we left off."
|
|
|
|
${invokeBlock}
|
|
|
|
After /${first} completes, re-run the design doc check:
|
|
\`\`\`bash
|
|
setopt +o nomatch 2>/dev/null || true # zsh compat
|
|
SLUG=$(~/.claude/skills/gstack/browse/bin/remote-slug 2>/dev/null || basename "$(git rev-parse --show-toplevel 2>/dev/null || pwd)")
|
|
BRANCH=$(git rev-parse --abbrev-ref HEAD 2>/dev/null | tr '/' '-' || echo 'no-branch')
|
|
DESIGN=$(ls -t ~/.gstack/projects/$SLUG/*-$BRANCH-design-*.md 2>/dev/null | head -1)
|
|
[ -z "$DESIGN" ] && DESIGN=$(ls -t ~/.gstack/projects/$SLUG/*-design-*.md 2>/dev/null | head -1)
|
|
[ -n "$DESIGN" ] && echo "Design doc found: $DESIGN" || echo "No design doc found"
|
|
\`\`\`
|
|
|
|
If a design doc is now found, read it and continue the review.
|
|
If none was produced (user may have cancelled), proceed with standard review.`;
|
|
}
|
|
|
|
export function generateCodexSecondOpinion(ctx: TemplateContext): string {
|
|
// Codex host: strip entirely — Codex should never invoke itself
|
|
if (ctx.host === 'codex') return '';
|
|
|
|
return `## Phase 3.5: Cross-Model Second Opinion (optional)
|
|
|
|
**Binary check first:**
|
|
|
|
\`\`\`bash
|
|
which codex 2>/dev/null && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE"
|
|
\`\`\`
|
|
|
|
Use AskUserQuestion (regardless of codex availability):
|
|
|
|
> Want a second opinion from an independent AI perspective? It will review your problem statement, key answers, premises, and any landscape findings from this session without having seen this conversation — it gets a structured summary. Usually takes 2-5 minutes.
|
|
> A) Yes, get a second opinion
|
|
> B) No, proceed to alternatives
|
|
|
|
If B: skip Phase 3.5 entirely. Remember that the second opinion did NOT run (affects design doc, founder signals, and Phase 4 below).
|
|
|
|
**If A: Run the Codex cold read.**
|
|
|
|
1. Assemble a structured context block from Phases 1-3:
|
|
- Mode (Startup or Builder)
|
|
- Problem statement (from Phase 1)
|
|
- Key answers from Phase 2A/2B (summarize each Q&A in 1-2 sentences, include verbatim user quotes)
|
|
- Landscape findings (from Phase 2.75, if search was run)
|
|
- Agreed premises (from Phase 3)
|
|
- Codebase context (project name, languages, recent activity)
|
|
|
|
2. **Write the assembled prompt to a temp file** (prevents shell injection from user-derived content):
|
|
|
|
\`\`\`bash
|
|
CODEX_PROMPT_FILE=$(mktemp /tmp/gstack-codex-oh-XXXXXXXX.txt)
|
|
\`\`\`
|
|
|
|
Write the full prompt to this file. **Always start with the filesystem boundary:**
|
|
"${CODEX_BOUNDARY}"
|
|
Then add the context block and mode-appropriate instructions:
|
|
|
|
**Startup mode instructions:** "You are an independent technical advisor reading a transcript of a startup brainstorming session. [CONTEXT BLOCK HERE]. Your job: 1) What is the STRONGEST version of what this person is trying to build? Steelman it in 2-3 sentences. 2) What is the ONE thing from their answers that reveals the most about what they should actually build? Quote it and explain why. 3) Name ONE agreed premise you think is wrong, and what evidence would prove you right. 4) If you had 48 hours and one engineer to build a prototype, what would you build? Be specific — tech stack, features, what you'd skip. Be direct. Be terse. No preamble."
|
|
|
|
**Builder mode instructions:** "You are an independent technical advisor reading a transcript of a builder brainstorming session. [CONTEXT BLOCK HERE]. Your job: 1) What is the COOLEST version of this they haven't considered? 2) What's the ONE thing from their answers that reveals what excites them most? Quote it. 3) What existing open source project or tool gets them 50% of the way there — and what's the 50% they'd need to build? 4) If you had a weekend to build this, what would you build first? Be specific. Be direct. No preamble."
|
|
|
|
3. Run Codex:
|
|
|
|
\`\`\`bash
|
|
TMPERR_OH=$(mktemp /tmp/codex-oh-err-XXXXXXXX)
|
|
_REPO_ROOT=$(git rev-parse --show-toplevel) || { echo "ERROR: not in a git repo" >&2; exit 1; }
|
|
codex exec "$(cat "$CODEX_PROMPT_FILE")" -C "$_REPO_ROOT" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached < /dev/null 2>"$TMPERR_OH"
|
|
\`\`\`
|
|
|
|
Use a 5-minute timeout (\`timeout: 300000\`). After the command completes, read stderr:
|
|
\`\`\`bash
|
|
cat "$TMPERR_OH"
|
|
rm -f "$TMPERR_OH" "$CODEX_PROMPT_FILE"
|
|
\`\`\`
|
|
|
|
**Error handling:** All errors are non-blocking — second opinion is a quality enhancement, not a prerequisite.
|
|
- **Auth failure:** If stderr contains "auth", "login", "unauthorized", or "API key": "Codex authentication failed. Run \\\`codex login\\\` to authenticate." Fall back to Claude subagent.
|
|
- **Timeout:** "Codex timed out after 5 minutes." Fall back to Claude subagent.
|
|
- **Empty response:** "Codex returned no response." Fall back to Claude subagent.
|
|
|
|
On any Codex error, fall back to the Claude subagent below.
|
|
|
|
**If CODEX_NOT_AVAILABLE (or Codex errored):**
|
|
|
|
Dispatch via the Agent tool. The subagent has fresh context — genuine independence.
|
|
|
|
Subagent prompt: same mode-appropriate prompt as above (Startup or Builder variant).
|
|
|
|
Present findings under a \`SECOND OPINION (Claude subagent):\` header.
|
|
|
|
If the subagent fails or times out: "Second opinion unavailable. Continuing to Phase 4."
|
|
|
|
4. **Presentation:**
|
|
|
|
If Codex ran:
|
|
\`\`\`
|
|
SECOND OPINION (Codex):
|
|
════════════════════════════════════════════════════════════
|
|
<full codex output, verbatim — do not truncate or summarize>
|
|
════════════════════════════════════════════════════════════
|
|
\`\`\`
|
|
|
|
If Claude subagent ran:
|
|
\`\`\`
|
|
SECOND OPINION (Claude subagent):
|
|
════════════════════════════════════════════════════════════
|
|
<full subagent output, verbatim — do not truncate or summarize>
|
|
════════════════════════════════════════════════════════════
|
|
\`\`\`
|
|
|
|
5. **Cross-model synthesis:** After presenting the second opinion output, provide 3-5 bullet synthesis:
|
|
- Where Claude agrees with the second opinion
|
|
- Where Claude disagrees and why
|
|
- Whether the challenged premise changes Claude's recommendation
|
|
|
|
6. **Premise revision check:** If Codex challenged an agreed premise, use AskUserQuestion:
|
|
|
|
> Codex challenged premise #{N}: "{premise text}". Their argument: "{reasoning}".
|
|
> A) Revise this premise based on Codex's input
|
|
> B) Keep the original premise — proceed to alternatives
|
|
|
|
If A: revise the premise and note the revision. If B: proceed (and note that the user defended this premise with reasoning — this is a founder signal if they articulate WHY they disagree, not just dismiss).`;
|
|
}
|
|
|
|
// ─── Scope Drift Detection (shared between /review and /ship) ────────
|
|
|
|
export function generateScopeDrift(ctx: TemplateContext): string {
|
|
const isShip = ctx.skillName === 'ship';
|
|
const stepNum = isShip ? '8.2' : '1.5';
|
|
|
|
return `## Step ${stepNum}: Scope Drift Detection
|
|
|
|
Before reviewing code quality, check: **did they build what was requested — nothing more, nothing less?**
|
|
|
|
1. Read \`TODOS.md\` (if it exists). Read PR description (\`gh pr view --json body --jq .body 2>/dev/null || true\`).
|
|
Read commit messages (\`git log origin/<base>..HEAD --oneline\`).
|
|
**If no PR exists:** rely on commit messages and TODOS.md for stated intent — this is the common case since /review runs before /ship creates the PR.
|
|
2. Identify the **stated intent** — what was this branch supposed to accomplish?
|
|
3. Run \`git diff origin/<base>...HEAD --stat\` and compare the files changed against the stated intent.
|
|
|
|
4. Evaluate with skepticism (incorporating plan completion results if available from an earlier step or adjacent section):
|
|
|
|
**SCOPE CREEP detection:**
|
|
- Files changed that are unrelated to the stated intent
|
|
- New features or refactors not mentioned in the plan
|
|
- "While I was in there..." changes that expand blast radius
|
|
|
|
**MISSING REQUIREMENTS detection:**
|
|
- Requirements from TODOS.md/PR description not addressed in the diff
|
|
- Test coverage gaps for stated requirements
|
|
- Partial implementations (started but not finished)
|
|
|
|
5. Output (before the main review begins):
|
|
\\\`\\\`\\\`
|
|
Scope Check: [CLEAN / DRIFT DETECTED / REQUIREMENTS MISSING]
|
|
Intent: <1-line summary of what was requested>
|
|
Delivered: <1-line summary of what the diff actually does>
|
|
[If drift: list each out-of-scope change]
|
|
[If missing: list each unaddressed requirement]
|
|
\\\`\\\`\\\`
|
|
|
|
6. This is **INFORMATIONAL** — does not block the review. Proceed to the next step.
|
|
|
|
---`;
|
|
}
|
|
|
|
// ─── Adversarial Review (always-on) ──────────────────────────────────
|
|
|
|
export function generateAdversarialStep(ctx: TemplateContext): string {
|
|
// Codex host: strip entirely — Codex should never invoke itself
|
|
if (ctx.host === 'codex') return '';
|
|
|
|
const isShip = ctx.skillName === 'ship';
|
|
const stepNum = isShip ? '11' : '5.7';
|
|
|
|
return `## Step ${stepNum}: Adversarial review (always-on)
|
|
|
|
Every diff gets adversarial review from both Claude and Codex. LOC is not a proxy for risk — a 5-line auth change can be critical.
|
|
|
|
**Detect diff size and tool availability:**
|
|
|
|
\`\`\`bash
|
|
DIFF_INS=$(git diff origin/<base> --stat | tail -1 | grep -oE '[0-9]+ insertion' | grep -oE '[0-9]+' || echo "0")
|
|
DIFF_DEL=$(git diff origin/<base> --stat | tail -1 | grep -oE '[0-9]+ deletion' | grep -oE '[0-9]+' || echo "0")
|
|
DIFF_TOTAL=$((DIFF_INS + DIFF_DEL))
|
|
which codex 2>/dev/null && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE"
|
|
# Legacy opt-out — only gates Codex passes, Claude always runs
|
|
OLD_CFG=$(~/.claude/skills/gstack/bin/gstack-config get codex_reviews 2>/dev/null || true)
|
|
echo "DIFF_SIZE: $DIFF_TOTAL"
|
|
echo "OLD_CFG: \${OLD_CFG:-not_set}"
|
|
\`\`\`
|
|
|
|
If \`OLD_CFG\` is \`disabled\`: skip Codex passes only. Claude adversarial subagent still runs (it's free and fast). Jump to the "Claude adversarial subagent" section.
|
|
|
|
**User override:** If the user explicitly requested "full review", "structured review", or "P1 gate", also run the Codex structured review regardless of diff size.
|
|
|
|
---
|
|
|
|
### Claude adversarial subagent (always runs)
|
|
|
|
Dispatch via the Agent tool. The subagent has fresh context — no checklist bias from the structured review. This genuine independence catches things the primary reviewer is blind to.
|
|
|
|
Subagent prompt:
|
|
"Read the diff for this branch with \`git diff origin/<base>\`. Think like an attacker and a chaos engineer. Your job is to find ways this code will fail in production. Look for: edge cases, race conditions, security holes, resource leaks, failure modes, silent data corruption, logic errors that produce wrong results silently, error handling that swallows failures, and trust boundary violations. Be adversarial. Be thorough. No compliments — just the problems. For each finding, classify as FIXABLE (you know how to fix it) or INVESTIGATE (needs human judgment). After listing findings, end your output with ONE line in the canonical format \`Recommendation: <action> because <one-line reason naming the most exploitable finding>\` — examples: \`Recommendation: Fix the unbounded retry at queue.ts:78 because it'll DoS the worker pool under sustained 429s\` or \`Recommendation: Ship as-is because the strongest finding is a theoretical race that requires conditions we can't trigger in production\`. The reason must point to a specific finding (or no-fix rationale). Generic reasons like 'because it's safer' do not qualify."
|
|
|
|
Present findings under an \`ADVERSARIAL REVIEW (Claude subagent):\` header. **FIXABLE findings** flow into the same Fix-First pipeline as the structured review. **INVESTIGATE findings** are presented as informational.
|
|
|
|
If the subagent fails or times out: "Claude adversarial subagent unavailable. Continuing."
|
|
|
|
---
|
|
|
|
### Codex adversarial challenge (always runs when available)
|
|
|
|
If Codex is available AND \`OLD_CFG\` is NOT \`disabled\`:
|
|
|
|
\`\`\`bash
|
|
TMPERR_ADV=$(mktemp /tmp/codex-adv-XXXXXXXX)
|
|
_REPO_ROOT=$(git rev-parse --show-toplevel) || { echo "ERROR: not in a git repo" >&2; exit 1; }
|
|
codex exec "${CODEX_BOUNDARY}Review the changes on this branch against the base branch. Run git diff origin/<base> to see the diff. Your job is to find ways this code will fail in production. Think like an attacker and a chaos engineer. Find edge cases, race conditions, security holes, resource leaks, failure modes, and silent data corruption paths. Be adversarial. Be thorough. No compliments — just the problems. End your output with ONE line in the canonical format \`Recommendation: <action> because <one-line reason naming the most exploitable finding>\`. Generic reasons like 'because it's safer' do not qualify; the reason must point to a specific finding or no-fix rationale." -C "$_REPO_ROOT" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached < /dev/null 2>"$TMPERR_ADV"
|
|
\`\`\`
|
|
|
|
Set the Bash tool's \`timeout\` parameter to \`300000\` (5 minutes). Do NOT use the \`timeout\` shell command — it doesn't exist on macOS. After the command completes, read stderr:
|
|
\`\`\`bash
|
|
cat "$TMPERR_ADV"
|
|
\`\`\`
|
|
|
|
Present the full output verbatim. This is informational — it never blocks shipping.
|
|
|
|
**Error handling:** All errors are non-blocking — adversarial review is a quality enhancement, not a prerequisite.
|
|
- **Auth failure:** If stderr contains "auth", "login", "unauthorized", or "API key": "Codex authentication failed. Run \\\`codex login\\\` to authenticate."
|
|
- **Timeout:** "Codex timed out after 5 minutes."
|
|
- **Empty response:** "Codex returned no response. Stderr: <paste relevant error>."
|
|
|
|
**Cleanup:** Run \`rm -f "$TMPERR_ADV"\` after processing.
|
|
|
|
If Codex is NOT available: "Codex CLI not found — running Claude adversarial only. Install Codex for cross-model coverage: \`npm install -g @openai/codex\`"
|
|
|
|
---
|
|
|
|
### Codex structured review (large diffs only, 200+ lines)
|
|
|
|
If \`DIFF_TOTAL >= 200\` AND Codex is available AND \`OLD_CFG\` is NOT \`disabled\`:
|
|
|
|
\`\`\`bash
|
|
TMPERR=$(mktemp /tmp/codex-review-XXXXXXXX)
|
|
_REPO_ROOT=$(git rev-parse --show-toplevel) || { echo "ERROR: not in a git repo" >&2; exit 1; }
|
|
cd "$_REPO_ROOT"
|
|
codex review "${CODEX_BOUNDARY}Review the diff against the base branch." --base <base> -c 'model_reasoning_effort="high"' --enable web_search_cached < /dev/null 2>"$TMPERR"
|
|
\`\`\`
|
|
|
|
Set the Bash tool's \`timeout\` parameter to \`300000\` (5 minutes). Do NOT use the \`timeout\` shell command — it doesn't exist on macOS. Present output under \`CODEX SAYS (code review):\` header.
|
|
Check for \`[P1]\` markers: found → \`GATE: FAIL\`, not found → \`GATE: PASS\`.
|
|
|
|
If GATE is FAIL, use AskUserQuestion:
|
|
\`\`\`
|
|
Codex found N critical issues in the diff.
|
|
|
|
A) Investigate and fix now (recommended)
|
|
B) Continue — review will still complete
|
|
\`\`\`
|
|
|
|
If A: address the findings${isShip ? '. After fixing, re-run tests (Step 5) since code has changed' : ''}. Re-run \`codex review\` to verify.
|
|
|
|
Read stderr for errors (same error handling as Codex adversarial above).
|
|
|
|
After stderr: \`rm -f "$TMPERR"\`
|
|
|
|
If \`DIFF_TOTAL < 200\`: skip this section silently. The Claude + Codex adversarial passes provide sufficient coverage for smaller diffs.
|
|
|
|
---
|
|
|
|
### Persist the review result
|
|
|
|
After all passes complete, persist:
|
|
\`\`\`bash
|
|
~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"adversarial-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","source":"SOURCE","tier":"always","gate":"GATE","commit":"'"$(git rev-parse --short HEAD)"'"}'
|
|
\`\`\`
|
|
Substitute: STATUS = "clean" if no findings across ALL passes, "issues_found" if any pass found issues. SOURCE = "both" if Codex ran, "claude" if only Claude subagent ran. GATE = the Codex structured review gate result ("pass"/"fail"), "skipped" if diff < 200, or "informational" if Codex was unavailable. If all passes failed, do NOT persist.
|
|
|
|
---
|
|
|
|
### Cross-model synthesis
|
|
|
|
After all passes complete, synthesize findings across all sources:
|
|
|
|
\`\`\`
|
|
ADVERSARIAL REVIEW SYNTHESIS (always-on, N lines):
|
|
════════════════════════════════════════════════════════════
|
|
High confidence (found by multiple sources): [findings agreed on by >1 pass]
|
|
Unique to Claude structured review: [from earlier step]
|
|
Unique to Claude adversarial: [from subagent]
|
|
Unique to Codex: [from codex adversarial or code review, if ran]
|
|
Models used: Claude structured ✓ Claude adversarial ✓/✗ Codex ✓/✗
|
|
════════════════════════════════════════════════════════════
|
|
\`\`\`
|
|
|
|
High-confidence findings (agreed on by multiple sources) should be prioritized for fixes.
|
|
|
|
---`;
|
|
}
|
|
|
|
export function generateCodexPlanReview(ctx: TemplateContext): string {
|
|
// Codex host: strip entirely — Codex should never invoke itself
|
|
if (ctx.host === 'codex') return '';
|
|
|
|
return `## Outside Voice — Independent Plan Challenge (optional, recommended)
|
|
|
|
After all review sections are complete, offer an independent second opinion from a
|
|
different AI system. Two models agreeing on a plan is stronger signal than one model's
|
|
thorough review.
|
|
|
|
**Check tool availability:**
|
|
|
|
\`\`\`bash
|
|
which codex 2>/dev/null && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE"
|
|
\`\`\`
|
|
|
|
Use AskUserQuestion:
|
|
|
|
> "All review sections are complete. Want an outside voice? A different AI system can
|
|
> give a brutally honest, independent challenge of this plan — logical gaps, feasibility
|
|
> risks, and blind spots that are hard to catch from inside the review. Takes about 2
|
|
> minutes."
|
|
>
|
|
> RECOMMENDATION: Choose A — an independent second opinion catches structural blind
|
|
> spots. Two different AI models agreeing on a plan is stronger signal than one model's
|
|
> thorough review. Completeness: A=9/10, B=7/10.
|
|
|
|
Options:
|
|
- A) Get the outside voice (recommended)
|
|
- B) Skip — proceed to outputs
|
|
|
|
**If B:** Print "Skipping outside voice." and continue to the next section.
|
|
|
|
**If A:** Construct the plan review prompt. Read the plan file being reviewed (the file
|
|
the user pointed this review at, or the branch diff scope). If a CEO plan document
|
|
was written in Step 0D-POST, read that too — it contains the scope decisions and vision.
|
|
|
|
Construct this prompt (substitute the actual plan content — if plan content exceeds 30KB,
|
|
truncate to the first 30KB and note "Plan truncated for size"). **Always start with the
|
|
filesystem boundary instruction:**
|
|
|
|
"${CODEX_BOUNDARY}You are a brutally honest technical reviewer examining a development plan that has
|
|
already been through a multi-section review. Your job is NOT to repeat that review.
|
|
Instead, find what it missed. Look for: logical gaps and unstated assumptions that
|
|
survived the review scrutiny, overcomplexity (is there a fundamentally simpler
|
|
approach the review was too deep in the weeds to see?), feasibility risks the review
|
|
took for granted, missing dependencies or sequencing issues, and strategic
|
|
miscalibration (is this the right thing to build at all?). Be direct. Be terse. No
|
|
compliments. Just the problems.
|
|
|
|
THE PLAN:
|
|
<plan content>"
|
|
|
|
**If CODEX_AVAILABLE:**
|
|
|
|
\`\`\`bash
|
|
TMPERR_PV=$(mktemp /tmp/codex-planreview-XXXXXXXX)
|
|
_REPO_ROOT=$(git rev-parse --show-toplevel) || { echo "ERROR: not in a git repo" >&2; exit 1; }
|
|
codex exec "<prompt>" -C "$_REPO_ROOT" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached < /dev/null 2>"$TMPERR_PV"
|
|
\`\`\`
|
|
|
|
Use a 5-minute timeout (\`timeout: 300000\`). After the command completes, read stderr:
|
|
\`\`\`bash
|
|
cat "$TMPERR_PV"
|
|
\`\`\`
|
|
|
|
Present the full output verbatim:
|
|
|
|
\`\`\`
|
|
CODEX SAYS (plan review — outside voice):
|
|
════════════════════════════════════════════════════════════
|
|
<full codex output, verbatim — do not truncate or summarize>
|
|
════════════════════════════════════════════════════════════
|
|
\`\`\`
|
|
|
|
**Error handling:** All errors are non-blocking — the outside voice is informational.
|
|
- Auth failure (stderr contains "auth", "login", "unauthorized"): "Codex auth failed. Run \\\`codex login\\\` to authenticate."
|
|
- Timeout: "Codex timed out after 5 minutes."
|
|
- Empty response: "Codex returned no response."
|
|
|
|
On any Codex error, fall back to the Claude adversarial subagent.
|
|
|
|
**If CODEX_NOT_AVAILABLE (or Codex errored):**
|
|
|
|
Dispatch via the Agent tool. The subagent has fresh context — genuine independence.
|
|
|
|
Subagent prompt: same plan review prompt as above.
|
|
|
|
Present findings under an \`OUTSIDE VOICE (Claude subagent):\` header.
|
|
|
|
If the subagent fails or times out: "Outside voice unavailable. Continuing to outputs."
|
|
|
|
**Cross-model tension:**
|
|
|
|
After presenting the outside voice findings, note any points where the outside voice
|
|
disagrees with the review findings from earlier sections. Flag these as:
|
|
|
|
\`\`\`
|
|
CROSS-MODEL TENSION:
|
|
[Topic]: Review said X. Outside voice says Y. [Present both perspectives neutrally.
|
|
State what context you might be missing that would change the answer.]
|
|
\`\`\`
|
|
|
|
**User Sovereignty:** Do NOT auto-incorporate outside voice recommendations into the plan.
|
|
Present each tension point to the user. The user decides. Cross-model agreement is a
|
|
strong signal — present it as such — but it is NOT permission to act. You may state
|
|
which argument you find more compelling, but you MUST NOT apply the change without
|
|
explicit user approval.
|
|
|
|
For each substantive tension point, use AskUserQuestion:
|
|
|
|
> "Cross-model disagreement on [topic]. The review found [X] but the outside voice
|
|
> argues [Y]. [One sentence on what context you might be missing.]"
|
|
>
|
|
> RECOMMENDATION: Choose [A or B] because [one-line reason explaining which argument
|
|
> is more compelling and why]. Completeness: A=X/10, B=Y/10.
|
|
|
|
Options:
|
|
- A) Accept the outside voice's recommendation (I'll apply this change)
|
|
- B) Keep the current approach (reject the outside voice)
|
|
- C) Investigate further before deciding
|
|
- D) Add to TODOS.md for later
|
|
|
|
Wait for the user's response. Do NOT default to accepting because you agree with the
|
|
outside voice. If the user chooses B, the current approach stands — do not re-argue.
|
|
|
|
If no tension points exist, note: "No cross-model tension — both reviewers agree."
|
|
|
|
**Persist the result:**
|
|
\`\`\`bash
|
|
~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"codex-plan-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","source":"SOURCE","commit":"'"$(git rev-parse --short HEAD)"'"}'
|
|
\`\`\`
|
|
|
|
Substitute: STATUS = "clean" if no findings, "issues_found" if findings exist.
|
|
SOURCE = "codex" if Codex ran, "claude" if subagent ran.
|
|
|
|
**Cleanup:** Run \`rm -f "$TMPERR_PV"\` after processing (if Codex was used).
|
|
|
|
---`;
|
|
}
|
|
|
|
// ─── Plan File Discovery (shared helper) ──────────────────────────────
|
|
|
|
function generatePlanFileDiscovery(): string {
|
|
return `### Plan File Discovery
|
|
|
|
1. **Conversation context (primary):** Check if there is an active plan file in this conversation. The host agent's system messages include plan file paths when in plan mode. If found, use it directly — this is the most reliable signal.
|
|
|
|
2. **Content-based search (fallback):** If no plan file is referenced in conversation context, search by content:
|
|
|
|
\`\`\`bash
|
|
setopt +o nomatch 2>/dev/null || true # zsh compat
|
|
BRANCH=$(git branch --show-current 2>/dev/null | tr '/' '-')
|
|
REPO=$(basename "$(git rev-parse --show-toplevel 2>/dev/null)")
|
|
# Compute project slug for ~/.gstack/projects/ lookup
|
|
_PLAN_SLUG=$(git remote get-url origin 2>/dev/null | sed 's|.*[:/]\\([^/]*/[^/]*\\)\\.git$|\\1|;s|.*[:/]\\([^/]*/[^/]*\\)$|\\1|' | tr '/' '-' | tr -cd 'a-zA-Z0-9._-') || true
|
|
_PLAN_SLUG="\${_PLAN_SLUG:-$(basename "$PWD" | tr -cd 'a-zA-Z0-9._-')}"
|
|
# Search common plan file locations (project designs first, then personal/local)
|
|
for PLAN_DIR in "$HOME/.gstack/projects/$_PLAN_SLUG" "$HOME/.claude/plans" "$HOME/.codex/plans" ".gstack/plans"; do
|
|
[ -d "$PLAN_DIR" ] || continue
|
|
PLAN=$(ls -t "$PLAN_DIR"/*.md 2>/dev/null | xargs grep -l "$BRANCH" 2>/dev/null | head -1)
|
|
[ -z "$PLAN" ] && PLAN=$(ls -t "$PLAN_DIR"/*.md 2>/dev/null | xargs grep -l "$REPO" 2>/dev/null | head -1)
|
|
[ -z "$PLAN" ] && PLAN=$(find "$PLAN_DIR" -name '*.md' -mmin -1440 -maxdepth 1 2>/dev/null | xargs ls -t 2>/dev/null | head -1)
|
|
[ -n "$PLAN" ] && break
|
|
done
|
|
[ -n "$PLAN" ] && echo "PLAN_FILE: $PLAN" || echo "NO_PLAN_FILE"
|
|
\`\`\`
|
|
|
|
3. **Validation:** If a plan file was found via content-based search (not conversation context), read the first 20 lines and verify it is relevant to the current branch's work. If it appears to be from a different project or feature, treat as "no plan file found."
|
|
|
|
**Error handling:**
|
|
- No plan file found → skip with "No plan file detected — skipping."
|
|
- Plan file found but unreadable (permissions, encoding) → skip with "Plan file found but unreadable — skipping."`;
|
|
}
|
|
|
|
// ─── Plan Completion Audit ────────────────────────────────────────────
|
|
|
|
type PlanCompletionMode = 'ship' | 'review';
|
|
|
|
function generatePlanCompletionAuditInner(mode: PlanCompletionMode): string {
|
|
const sections: string[] = [];
|
|
|
|
// ── Plan file discovery (shared) ──
|
|
sections.push(generatePlanFileDiscovery());
|
|
|
|
// ── Item extraction ──
|
|
sections.push(`
|
|
### Actionable Item Extraction
|
|
|
|
Read the plan file. Extract every actionable item — anything that describes work to be done. Look for:
|
|
|
|
- **Checkbox items:** \`- [ ] ...\` or \`- [x] ...\`
|
|
- **Numbered steps** under implementation headings: "1. Create ...", "2. Add ...", "3. Modify ..."
|
|
- **Imperative statements:** "Add X to Y", "Create a Z service", "Modify the W controller"
|
|
- **File-level specifications:** "New file: path/to/file.ts", "Modify path/to/existing.rb"
|
|
- **Test requirements:** "Test that X", "Add test for Y", "Verify Z"
|
|
- **Data model changes:** "Add column X to table Y", "Create migration for Z"
|
|
|
|
**Ignore:**
|
|
- Context/Background sections (\`## Context\`, \`## Background\`, \`## Problem\`)
|
|
- Questions and open items (marked with ?, "TBD", "TODO: decide")
|
|
- Review report sections (\`## GSTACK REVIEW REPORT\`)
|
|
- Explicitly deferred items ("Future:", "Out of scope:", "NOT in scope:", "P2:", "P3:", "P4:")
|
|
- CEO Review Decisions sections (these record choices, not work items)
|
|
|
|
**Cap:** Extract at most 50 items. If the plan has more, note: "Showing top 50 of N plan items — full list in plan file."
|
|
|
|
**No items found:** If the plan contains no extractable actionable items, skip with: "Plan file contains no actionable items — skipping completion audit."
|
|
|
|
For each item, note:
|
|
- The item text (verbatim or concise summary)
|
|
- Its category: CODE | TEST | MIGRATION | CONFIG | DOCS`);
|
|
|
|
// ── Verification Mode (per PR #1302 — VAS-449 remediation) ──
|
|
sections.push(`
|
|
### Verification Mode
|
|
|
|
Before judging completion, classify HOW each item can be verified. The diff alone cannot prove every kind of work. Items outside the current repo or system are structurally invisible to \`git diff\`.
|
|
|
|
- **DIFF-VERIFIABLE** — A code change in this repo would manifest in \`git diff <base>...HEAD\`. Examples: "add UserService" (file appears), "validate input X" (validation logic appears), "create users table" (migration file appears).
|
|
- **CROSS-REPO** — Item names a file or change in a sibling repo (e.g., \`domain-hq/docs/dashboard.md\`, \`~/Development/<other-repo>/...\`). The current diff CANNOT prove this.
|
|
- **EXTERNAL-STATE** — Item names state in an external system: Supabase config/RLS, Cloudflare DNS, Vercel env vars, OAuth provider allowlists, third-party SaaS, DNS records. The current diff CANNOT prove this.
|
|
- **CONTENT-SHAPE** — Item requires a file to follow a specific convention. If the file is in this repo: diff-verifiable. If in another repo or system: see CROSS-REPO / EXTERNAL-STATE.
|
|
|
|
**Verification dispatch:**
|
|
|
|
- **DIFF-VERIFIABLE** → cross-reference against diff (next section).
|
|
- **CROSS-REPO** → if the sibling repo is reachable on disk (try \`~/Development/<repo>/\`, \`~/code/<repo>/\`, the parent of the current repo), run \`[ -f <path> ]\` to check file existence. File exists → DONE (cite path). File missing → NOT DONE (cite path). Path unreachable → UNVERIFIABLE (cite what needs manual check).
|
|
- **EXTERNAL-STATE** → UNVERIFIABLE. Cite the system and the specific check the user must perform.
|
|
- **CONTENT-SHAPE in another repo** → if the file exists, run any project-detected validator (see "Validator detection" below) before falling back to UNVERIFIABLE. With a validator: pass → DONE; fail → NOT DONE (cite validator output). No validator available: classify UNVERIFIABLE and cite both the file path and the convention to confirm.
|
|
|
|
**Path concreteness rule.** If a plan item names a *concrete filesystem path* (absolute, \`~/...\`, or \`<sibling-repo>/<file>\`), it MUST be classified DONE or NOT DONE based on \`[ -f <path> ]\`. UNVERIFIABLE is only valid when the path is genuinely abstract ("Cloudflare DNS", "Supabase allowlist") or the sibling root is unreachable on this machine. "I don't want to check" is not unreachable.
|
|
|
|
**Validator detection.** Before falling back to UNVERIFIABLE on a CONTENT-SHAPE item, scan the target repo's \`package.json\` for any script matching \`validate-*\`, \`lint-wiki\`, \`check-docs\`, or similar. If found, invoke it with the relevant path argument (e.g., \`npm run validate-wiki -- <path>\`). For multi-target validators (e.g., \`validate-wiki --all\`), run once and reconcile per-item from the output. A passing validator promotes the item from UNVERIFIABLE to DONE; a failing one demotes to NOT DONE.
|
|
|
|
**Honesty rule.** Do NOT classify an item as DONE just because related code shipped. Code that *handles* a deliverable is not the deliverable. Shipping a markdown-extraction library is not the same as shipping the markdown file. When in doubt between DONE and UNVERIFIABLE, prefer UNVERIFIABLE — better to surface a confirmation prompt than silently miss a deliverable.`);
|
|
|
|
// ── Cross-reference against diff ──
|
|
sections.push(`
|
|
### Cross-Reference Against Diff
|
|
|
|
Run \`git diff origin/<base>...HEAD\` and \`git log origin/<base>..HEAD --oneline\` to understand what was implemented.
|
|
|
|
For each extracted plan item, run the verification dispatch from the previous section, then classify:
|
|
|
|
- **DONE** — Clear evidence the item shipped. Cite the specific file(s) changed in the diff for DIFF-VERIFIABLE items, or the verified path that exists for CROSS-REPO items with a reachable sibling repo.
|
|
- **PARTIAL** — Some work toward this item exists but is incomplete (e.g., model created but controller missing, function exists but edge cases not handled).
|
|
- **NOT DONE** — Verification ran and produced negative evidence (file missing, code absent in diff, sibling-repo file confirmed absent).
|
|
- **CHANGED** — The item was implemented using a different approach than the plan described, but the same goal is achieved. Note the difference.
|
|
- **UNVERIFIABLE** — The diff and any reachable sibling-repo checks cannot prove or disprove this. Always applies to EXTERNAL-STATE items and to CROSS-REPO items where the sibling repo isn't reachable. Cite the specific manual verification the user must perform (e.g., "check Cloudflare DNS shows DNS-only mode for dashboard.example.com", "confirm /docs/dashboard.md exists in domain-hq repo").
|
|
|
|
**Be conservative with DONE** — require clear evidence. A file being touched is not enough; the specific functionality described must be present.
|
|
**Be generous with CHANGED** — if the goal is met by different means, that counts as addressed.
|
|
**Be honest with UNVERIFIABLE** — better to surface 5 items the user must manually confirm than silently classify them DONE.`);
|
|
|
|
// ── Output format ──
|
|
sections.push(`
|
|
### Output Format
|
|
|
|
\`\`\`
|
|
PLAN COMPLETION AUDIT
|
|
═══════════════════════════════
|
|
Plan: {plan file path}
|
|
|
|
## Implementation Items
|
|
[DONE] Create UserService — src/services/user_service.rb (+142 lines)
|
|
[PARTIAL] Add validation — model validates but missing controller checks
|
|
[NOT DONE] Add caching layer — no cache-related changes in diff
|
|
[CHANGED] "Redis queue" → implemented with Sidekiq instead
|
|
|
|
## Test Items
|
|
[DONE] Unit tests for UserService — test/services/user_service_test.rb
|
|
[NOT DONE] E2E test for signup flow
|
|
|
|
## Migration Items
|
|
[DONE] Create users table — db/migrate/20240315_create_users.rb
|
|
|
|
## Cross-Repo / External Items
|
|
[DONE] sibling-repo has /docs/dashboard.md — verified at ~/Development/sibling-repo/docs/dashboard.md
|
|
[UNVERIFIABLE] Cloudflare DNS-only on api.example.com — external system, manual check required
|
|
[UNVERIFIABLE] Supabase auth allowlist contains user email — external system, confirm in Supabase dashboard
|
|
|
|
─────────────────────────────────
|
|
COMPLETION: 5/9 DONE, 1 PARTIAL, 1 NOT DONE, 1 CHANGED, 2 UNVERIFIABLE
|
|
─────────────────────────────────
|
|
\`\`\``);
|
|
|
|
// ── Gate logic (mode-specific) ──
|
|
if (mode === 'ship') {
|
|
sections.push(`
|
|
### Gate Logic
|
|
|
|
After producing the completion checklist, evaluate in priority order:
|
|
|
|
1. **Any NOT DONE items** (highest priority — known missing work). Use AskUserQuestion:
|
|
- Show the completion checklist above
|
|
- "{N} items from the plan are NOT DONE. These were part of the original plan but are missing from the implementation."
|
|
- RECOMMENDATION: depends on item count and severity. If 1-2 minor items (docs, config), recommend B. If core functionality is missing, recommend A.
|
|
- Options:
|
|
A) Stop — implement the missing items before shipping
|
|
B) Ship anyway — defer these to a follow-up (will create P1 TODOs in Step 5.5)
|
|
C) These items were intentionally dropped — remove from scope
|
|
- If A: STOP. List the missing items for the user to implement.
|
|
- If B: Continue. For each NOT DONE item, create a P1 TODO in Step 5.5 with "Deferred from plan: {plan file path}".
|
|
- If C: Continue. Note in PR body: "Plan items intentionally dropped: {list}."
|
|
|
|
2. **Any UNVERIFIABLE items** (silent gaps — the diff cannot prove them either way). Only fires after NOT DONE is resolved or absent.
|
|
|
|
**Per-item confirmation is mandatory.** Do NOT use a single AskUserQuestion to blanket-confirm all UNVERIFIABLE items. Blanket confirmation is the failure mode that surfaced in VAS-449 (user clicks A without opening any file). Instead:
|
|
|
|
- Loop through UNVERIFIABLE items one at a time.
|
|
- For each item, use AskUserQuestion with the item's *specific* manual check (e.g., "Confirm: does \`~/Development/domain-hq/docs/dashboard.md\` exist?", not "Have you checked all items?").
|
|
- Options per item:
|
|
Y) Confirmed done — cite what you verified (free-text, embedded in PR body)
|
|
N) Not done — block ship; treat as NOT DONE and re-enter the priority-1 gate
|
|
D) Intentionally dropped — note in PR body: "Plan item intentionally dropped: {item}"
|
|
- RECOMMENDATION per item: Y if the item is concrete and easily verified; N if it's critical-path (auth, DNS, deliverables to other repos) and the user shows hesitation.
|
|
|
|
**Exit conditions:**
|
|
- Any N: STOP. Surface the missing items, suggest re-running /ship after they're addressed.
|
|
- All Y or D: Continue. Embed \`## Plan Completion — Manual Verifications\` section in PR body listing each Y'd item with the user's free-text evidence and each D'd item with "intentionally dropped".
|
|
|
|
**Cap.** If there are more than 5 UNVERIFIABLE items, present them as a numbered list first and ask whether the user wants to (1) confirm each individually, (2) stop and reduce scope, or (3) explicitly accept blanket-confirmation with the warning that this is the VAS-449 failure shape. Default and recommended option is (1).
|
|
|
|
3. **Only PARTIAL items (no NOT DONE, no UNVERIFIABLE):** Continue with a note in the PR body. Not blocking.
|
|
|
|
4. **All DONE or CHANGED:** Pass. "Plan completion: PASS — all items addressed." Continue.
|
|
|
|
**No plan file found:** Skip entirely. "No plan file detected — skipping plan completion audit."
|
|
|
|
**Include in PR body (Step 8):** Add a \`## Plan Completion\` section with the checklist summary.`);
|
|
} else {
|
|
// review mode — enhanced Delivery Integrity (Release 2: Review Army)
|
|
sections.push(`
|
|
### Fallback Intent Sources (when no plan file found)
|
|
|
|
When no plan file is detected, use these secondary intent sources:
|
|
|
|
1. **Commit messages:** Run \`git log origin/<base>..HEAD --oneline\`. Use judgment to extract real intent:
|
|
- Commits with actionable verbs ("add", "implement", "fix", "create", "remove", "update") are intent signals
|
|
- Skip noise: "WIP", "tmp", "squash", "merge", "chore", "typo", "fixup"
|
|
- Extract the intent behind the commit, not the literal message
|
|
2. **TODOS.md:** If it exists, check for items related to this branch or recent dates
|
|
3. **PR description:** Run \`gh pr view --json body -q .body 2>/dev/null\` for intent context
|
|
|
|
**With fallback sources:** Apply the same Cross-Reference classification (DONE/PARTIAL/NOT DONE/CHANGED) using best-effort matching. Note that fallback-sourced items are lower confidence than plan-file items.
|
|
|
|
### Investigation Depth
|
|
|
|
For each PARTIAL or NOT DONE item, investigate WHY:
|
|
|
|
1. Check \`git log origin/<base>..HEAD --oneline\` for commits that suggest the work was started, attempted, or reverted
|
|
2. Read the relevant code to understand what was built instead
|
|
3. Determine the likely reason from this list:
|
|
- **Scope cut** — evidence of intentional removal (revert commit, removed TODO)
|
|
- **Context exhaustion** — work started but stopped mid-way (partial implementation, no follow-up commits)
|
|
- **Misunderstood requirement** — something was built but it doesn't match what the plan described
|
|
- **Blocked by dependency** — plan item depends on something that isn't available
|
|
- **Genuinely forgotten** — no evidence of any attempt
|
|
|
|
Output for each discrepancy:
|
|
\`\`\`
|
|
DISCREPANCY: {PARTIAL|NOT_DONE} | {plan item} | {what was actually delivered}
|
|
INVESTIGATION: {likely reason with evidence from git log / code}
|
|
IMPACT: {HIGH|MEDIUM|LOW} — {what breaks or degrades if this stays undelivered}
|
|
\`\`\`
|
|
|
|
### Learnings Logging (plan-file discrepancies only)
|
|
|
|
**Only for discrepancies sourced from plan files** (not commit messages or TODOS.md), log a learning so future sessions know this pattern occurred:
|
|
|
|
\`\`\`bash
|
|
~/.claude/skills/gstack/bin/gstack-learnings-log '{
|
|
"type": "pitfall",
|
|
"key": "plan-delivery-gap-KEBAB_SUMMARY",
|
|
"insight": "Planned X but delivered Y because Z",
|
|
"confidence": 8,
|
|
"source": "observed",
|
|
"files": ["PLAN_FILE_PATH"]
|
|
}'
|
|
\`\`\`
|
|
|
|
Replace KEBAB_SUMMARY with a kebab-case summary of the gap, and fill in the actual values.
|
|
|
|
**Do NOT log learnings from commit-message-derived or TODOS.md-derived discrepancies.** These are informational in the review output but too noisy for durable memory.
|
|
|
|
### Integration with Scope Drift Detection
|
|
|
|
The plan completion results augment the existing Scope Drift Detection. If a plan file is found:
|
|
|
|
- **NOT DONE items** become additional evidence for **MISSING REQUIREMENTS** in the scope drift report.
|
|
- **Items in the diff that don't match any plan item** become evidence for **SCOPE CREEP** detection.
|
|
- **HIGH-impact discrepancies** trigger AskUserQuestion:
|
|
- Show the investigation findings
|
|
- Options: A) Stop and implement missing items, B) Ship anyway + create P1 TODOs, C) Intentionally dropped
|
|
|
|
This is **INFORMATIONAL** unless HIGH-impact discrepancies are found (then it gates via AskUserQuestion).
|
|
|
|
Update the scope drift output to include plan file context:
|
|
|
|
\`\`\`
|
|
Scope Check: [CLEAN / DRIFT DETECTED / REQUIREMENTS MISSING]
|
|
Intent: <from plan file — 1-line summary>
|
|
Plan: <plan file path>
|
|
Delivered: <1-line summary of what the diff actually does>
|
|
Plan items: N DONE, M PARTIAL, K NOT DONE
|
|
[If NOT DONE: list each missing item with investigation]
|
|
[If scope creep: list each out-of-scope change not in the plan]
|
|
\`\`\`
|
|
|
|
**No plan file found:** Use commit messages and TODOS.md as fallback sources (see above). If no intent sources at all, skip with: "No intent sources detected — skipping completion audit."`);
|
|
}
|
|
|
|
return sections.join('\n');
|
|
}
|
|
|
|
export function generatePlanCompletionAuditShip(_ctx: TemplateContext): string {
|
|
return generatePlanCompletionAuditInner('ship');
|
|
}
|
|
|
|
export function generatePlanCompletionAuditReview(_ctx: TemplateContext): string {
|
|
return generatePlanCompletionAuditInner('review');
|
|
}
|
|
|
|
// ─── Plan Verification Execution ──────────────────────────────────────
|
|
|
|
export function generatePlanVerificationExec(_ctx: TemplateContext): string {
|
|
return `## Step 8.1: Plan Verification
|
|
|
|
Automatically verify the plan's testing/verification steps using the \`/qa-only\` skill.
|
|
|
|
### 1. Check for verification section
|
|
|
|
Using the plan file already discovered in Step 8, look for a verification section. Match any of these headings: \`## Verification\`, \`## Test plan\`, \`## Testing\`, \`## How to test\`, \`## Manual testing\`, or any section with verification-flavored items (URLs to visit, things to check visually, interactions to test).
|
|
|
|
**If no verification section found:** Skip with "No verification steps found in plan — skipping auto-verification."
|
|
**If no plan file was found in Step 8:** Skip (already handled).
|
|
|
|
### 2. Check for running dev server
|
|
|
|
Before invoking browse-based verification, check if a dev server is reachable:
|
|
|
|
\`\`\`bash
|
|
curl -s -o /dev/null -w '%{http_code}' http://localhost:3000 2>/dev/null || \\
|
|
curl -s -o /dev/null -w '%{http_code}' http://localhost:8080 2>/dev/null || \\
|
|
curl -s -o /dev/null -w '%{http_code}' http://localhost:5173 2>/dev/null || \\
|
|
curl -s -o /dev/null -w '%{http_code}' http://localhost:4000 2>/dev/null || echo "NO_SERVER"
|
|
\`\`\`
|
|
|
|
**If NO_SERVER:** Skip with "No dev server detected — skipping plan verification. Run /qa separately after deploying."
|
|
|
|
### 3. Invoke /qa-only inline
|
|
|
|
Read the \`/qa-only\` skill from disk:
|
|
|
|
\`\`\`bash
|
|
cat \${CLAUDE_SKILL_DIR}/../qa-only/SKILL.md
|
|
\`\`\`
|
|
|
|
**If unreadable:** Skip with "Could not load /qa-only — skipping plan verification."
|
|
|
|
Follow the /qa-only workflow with these modifications:
|
|
- **Skip the preamble** (already handled by /ship)
|
|
- **Use the plan's verification section as the primary test input** — treat each verification item as a test case
|
|
- **Use the detected dev server URL** as the base URL
|
|
- **Skip the fix loop** — this is report-only verification during /ship
|
|
- **Cap at the verification items from the plan** — do not expand into general site QA
|
|
|
|
### 4. Gate logic
|
|
|
|
- **All verification items PASS:** Continue silently. "Plan verification: PASS."
|
|
- **Any FAIL:** Use AskUserQuestion:
|
|
- Show the failures with screenshot evidence
|
|
- RECOMMENDATION: Choose A if failures indicate broken functionality. Choose B if cosmetic only.
|
|
- Options:
|
|
A) Fix the failures before shipping (recommended for functional issues)
|
|
B) Ship anyway — known issues (acceptable for cosmetic issues)
|
|
- **No verification section / no server / unreadable skill:** Skip (non-blocking).
|
|
|
|
### 5. Include in PR body
|
|
|
|
Add a \`## Verification Results\` section to the PR body (Step 19):
|
|
- If verification ran: summary of results (N PASS, M FAIL, K SKIPPED)
|
|
- If skipped: reason for skipping (no plan, no server, no verification section)`;
|
|
}
|
|
|
|
// ─── Cross-Review Finding Dedup ──────────────────────────────────────
|
|
|
|
export function generateCrossReviewDedup(ctx: TemplateContext): string {
|
|
const isShip = ctx.skillName === 'ship';
|
|
const stepNum = isShip ? '9.3' : '5.0';
|
|
const findingsRef = isShip
|
|
? 'the checklist pass (Step 9) and specialist review (Step 9.1-9.2)'
|
|
: 'Step 4 critical pass and Step 4.5-4.6 specialists';
|
|
|
|
return `### Step ${stepNum}: Cross-review finding dedup
|
|
|
|
Before classifying findings, check if any were previously skipped by the user in a prior review on this branch.
|
|
|
|
\`\`\`bash
|
|
~/.claude/skills/gstack/bin/gstack-review-read
|
|
\`\`\`
|
|
|
|
Parse the output: only lines BEFORE \`---CONFIG---\` are JSONL entries (the output also contains \`---CONFIG---\` and \`---HEAD---\` footer sections that are not JSONL — ignore those).
|
|
|
|
For each JSONL entry that has a \`findings\` array:
|
|
1. Collect all fingerprints where \`action: "skipped"\`
|
|
2. Note the \`commit\` field from that entry
|
|
|
|
If skipped fingerprints exist, get the list of files changed since that review:
|
|
|
|
\`\`\`bash
|
|
git diff --name-only <prior-review-commit> HEAD
|
|
\`\`\`
|
|
|
|
For each current finding (from both ${findingsRef}), check:
|
|
- Does its fingerprint match a previously skipped finding?
|
|
- Is the finding's file path NOT in the changed-files set?
|
|
|
|
If both conditions are true: suppress the finding. It was intentionally skipped and the relevant code hasn't changed.
|
|
|
|
Print: "Suppressed N findings from prior reviews (previously skipped by user)"
|
|
|
|
**Only suppress \`skipped\` findings — never \`fixed\` or \`auto-fixed\`** (those might regress and should be re-checked).
|
|
|
|
If no prior reviews exist or none have a \`findings\` array, skip this step silently.
|
|
|
|
Output a summary header: \`Pre-Landing Review: N issues (X critical, Y informational)\``;
|
|
}
|