Files
gstack/test/skill-e2e-plan.test.ts
Garry Tan 7ca04d8ef0 v1.42.0.0 Daegu wave: 23 community-filed bugs + PTY classifier enforcement (24 bisect commits) (#1594)
* fix(gstack-paths): guard CLAUDE_PLUGIN_DATA against cross-plugin contamination (#1569)

gstack-paths previously trusted CLAUDE_PLUGIN_DATA as a fallback for
GSTACK_STATE_ROOT whenever GSTACK_HOME was unset. When another plugin
(e.g. Codex) persists its own CLAUDE_PLUGIN_DATA into the session env
via CLAUDE_ENV_FILE, gstack picked it up and wrote checkpoints,
analytics, and learnings into that plugin's directory. Anyone with the
Codex plugin installed alongside gstack hit this silently.

Fix: guard the CLAUDE_PLUGIN_DATA branch so it only fires when
CLAUDE_PLUGIN_ROOT confirms we're running as the gstack plugin (path
contains "gstack"). Skill installs fall through to \$HOME/.gstack.

Contributed by @ElliotDrel via #1570. Closes #1569.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(gbrain-sync): sourceLocalPath handles wrapped {sources:[...]} shape from gbrain v0.20+

gbrain v0.20+ changed `gbrain sources list --json` to return
{sources: [...]} instead of a flat array. sourceLocalPath crashed
upstream with `list.find is not a function` on every /sync-gbrain
invocation against modern gbrain. Accept both shapes for
forward/backward compat, matching probeSource/sourcePageCount in
lib/gbrain-sources.ts.

Contributed by @jakehann11 via #1571. Closes #1567. Supersedes #1564
(@tonyjzhou, same fix, different shape — credit retained).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(brain-context-load): probe gbrain via execFile, not shell builtin (#1559)

gbrainAvailable() used `execFileSync("command", ["-v", "gbrain"])`,
which fails in any environment where the `command` builtin isn't on
the spawned process's PATH (most non-interactive shells). The probe
then reported gbrain as missing even when it was installed, and
context-load silently skipped vector/list queries.

Fix: probe `gbrain --version` directly with a 500ms timeout (matching
the rest of the file's MCP_TIMEOUT_MS). Same semantics, works
everywhere execFile works.

Contributed by @jbetala7 via #1560. Closes #1559.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(gbrain-doctor): pin schema_version:2 doctor parse path (#1418)

Adds an exec-path regression test that runs a fake gbrain shim emitting
the v0.25+ doctor JSON shape (schema_version: 2, status: "warnings",
exit 1 for health_score < 100, no top-level `engine` field). Confirms
freshDetectEngineTier recovers stdout from the non-zero exit and falls
back to GBRAIN_HOME/config.json for the engine label.

The pre-existing test for #1415 only stripped gbrain from PATH; this
test exercises the actual doctor parse path, closing the gap that
codex's plan review flagged.

Also documents the schema_version separation in
lib/gbrain-local-status.ts: the local CacheEntry stays at version 1,
distinct from the doctor-output schema_version which we accept across
versions in gstack-memory-helpers.

Closes #1418 (credit @mvanhorn for surfacing the doctor + schema_v2
collapse). The fix landed pre-emptively in v1.29.x; this commit pins
it with a stronger test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(memory-ingest): pin put_page regression + scrub stale name from --help and comments (#1346)

#1346 reported that gstack-memory-ingest still called the renamed
gbrain put_page subcommand on gbrain v0.18+. The actual code migrated
to `gbrain put` and later to batch `gbrain import <dir>` before this
report landed — only documentation lag remained.

This commit:
- Updates the --help string ("Skip gbrain put calls (still updates
  state file)") so user-facing docs match the shipped subcommand
- Updates two inline comments that still referenced the old name
- Adds test/memory-ingest-no-put_page.test.ts: a regression pin that
  strips comments from bin/gstack-memory-ingest.ts and fails the build
  if "put_page" appears in any active code or string literal, plus a
  sanity check that the file still calls a supported gbrain page-write
  verb (put or import)

Closes #1346. Reporter @kylma-code surfaced the doc lag; the original
code migration credit is on the v1.27.x wave.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(resolvers): rewrite all gbrain put_page instructions to canonical put <slug>

scripts/resolvers/gbrain.ts emitted user-facing copy-paste instructions
using the renamed `gbrain put_page` subcommand across 10 skills
(office-hours, investigate, plan-ceo-review, retro, plan-eng-review,
ship, cso, design-consultation, fallback, entity-stub). Every gstack
user copying those snippets hit "unknown command: put_page" on gbrain
v0.18+.

This commit:
- Rewrites all 10 instruction templates to use `gbrain put <slug>
  --content "$(cat <<EOF...EOF)"` with title/tags moved into YAML
  frontmatter inside --content, matching the v0.18+ subcommand shape
- Updates README.md and USING_GBRAIN_WITH_GSTACK.md "common commands"
  table to reference `gbrain put` and `gbrain get`
- Adds test/resolvers-gbrain-put-rewrite.test.ts pinning two
  invariants: (a) resolver source ships only canonical instructions,
  (b) every tracked SKILL.md file is free of `gbrain put_page`

CHANGELOG entries are deliberately left untouched (historical record).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(build): extract package.json build to scripts/build.sh for Windows Bun compat (#1538, #1537, #1530, #1457, #1561)

Bun's Windows shell parser rejects multiple constructs the inline
package.json build chain used: brace groups `{ cmd; }`, subshells with
redirection `( git ... ) > path/.version`, and (in Bun 1.3.x) subshells
near redirections in general. Every Windows install + every
auto-upgrade since v1.34.2.0 has failed on `bun run build`.

Extracts the build chain to scripts/build.sh and the .version writes to
scripts/write-version-files.sh. POSIX-portable, no Bun shell parsing
involved. Also adds Windows-specific bun.exe handling for non-ASCII
PATHs (a separate Windows footgun where Bun's --compile fails when the
binary lives under a path with non-ASCII chars).

Updates test/build-script-shell-compat.test.ts to assert the new shape:
no subshells with redirections anywhere in the build chain, and build
delegates to scripts/build.sh which delegates .version writes.

Contributed by @Charlie-El via #1544. Supersedes #1531 (@scarson, fixed
in build helper), #1480 (@mikepsinn, partial overlap), #1460
(@realcarsonterry, brace-group fix subsumed) — credit retained.
Closes #1538, #1537, #1530, #1457, #1561.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(windows): .exe glob in .gitignore + .exe extension resolution in find-browse (#1554)

bun build --compile on Windows appends .exe to the output filename,
producing browse.exe instead of browse. find-browse's existsSync probe
only checked the bare path and returned null on Windows even when the
binary was correctly built. .gitignore similarly only excluded the
bare bin/gstack-global-discover path, leaving the .exe variant
tracked.

This commit:
- .gitignore: changes `bin/gstack-global-discover` →
  `bin/gstack-global-discover*` so the Windows .exe variant is ignored
- browse/src/find-browse.ts: adds isExecutable + findExecutable helpers
  that fall back to .exe/.cmd/.bat probing on Windows, mirroring the
  same helper already in make-pdf/src/browseClient.ts and pdftotext.ts

Contributed by @Mike-E-Log via #1554.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci(windows): add fresh-install E2E gate that runs bun run build on windows-latest

Adds .github/workflows/windows-setup-e2e.yml as the gate that catches
Bun shell-parser regressions in the build chain before they reach
users. Triggers on PRs touching package.json, scripts/build.sh,
scripts/write-version-files.sh, setup, browse cli/find-browse, or
gstack-paths.

What it verifies:
1. bun run build completes on Windows (the previously-broken path that
   #1538/#1537/#1530/#1457/#1561 reported)
2. All compiled binaries land on disk (browse.exe, find-browse.exe,
   design.exe, gstack-global-discover.exe)
3. find-browse resolves to the .exe variant on Windows (regression
   gate for #1554)
4. gstack-paths returns non-empty GSTACK_STATE_ROOT/PLAN_ROOT/TMP_ROOT
   on Windows (regression gate for #1570)

Complements the existing windows-free-tests.yml (curated unit subset);
this new workflow exercises the install path itself.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(codex): move diff scope into prompt instead of --base (Codex CLI 0.130+ argv conflict) (#1209)

Codex CLI ≥ 0.130.0 rejects passing a custom prompt and --base together
(mutually exclusive at argv level). Every /codex review, /review, and
/ship structured Codex review call ended with an argv error before the
model ran.

Fix: scope the diff in prompt text using
"Run git diff origin/<base>...HEAD 2>/dev/null || git diff <base>...HEAD"
instead of `--base <base>`. Preserves the filesystem boundary
instruction across all invocations and keeps Codex's review prompt
tuning.

Touches:
- codex/SKILL.md.tmpl + regenerated codex/SKILL.md
- scripts/resolvers/review.ts + regenerated review/SKILL.md, ship/SKILL.md
- test/gen-skill-docs.test.ts: new regression that fails if any of the
  five known files still contain the prompt+--base shape
- test/skill-validation.test.ts: corresponding negative + positive pin
  on the rendered SKILL.md files

Contributed by @jbetala7 via #1209. Closes #1479. Supersedes #1527
(@mvanhorn — same intent, different patch shape, CONFLICTING) and
#1449 (@Gujiassh — broader refactor, CONFLICTING). Credit retained
in CHANGELOG.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(review): diff from git merge-base, not git diff origin/<base> (#1492)

git diff origin/<base> shows everything since the common ancestor in
both directions — it includes commits that landed on origin/<base>
after this branch was created as deletions. That made /review and
/ship's pre-landing structured review report inflated diff totals and
flagged "removed" code that was actually still present in the working
tree.

Fix: compute DIFF_BASE via git merge-base origin/<base> HEAD and diff
the working tree against that point. Same coverage of uncommitted
edits, no phantom deletions from out-of-order base advancement.

Applies to /review's Step 1 (diff existence check), Step 3 (get the
diff), the build-on-intent scope-creep check, the structured review
DIFF_INS/DIFF_DEL stats, and the Claude adversarial subagent prompt.
Same change flows into ship/SKILL.md via the shared resolver.

Touches:
- review/SKILL.md.tmpl + regenerated review/SKILL.md, ship/SKILL.md
- scripts/resolvers/review.ts
- scripts/resolvers/review-army.ts

Contributed by @mvanhorn via #1492.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(codex): pin filesystem-boundary preservation across all codex review surfaces (#1503, #1522)

#1503 reported that the bare codex review --base path stripped the
filesystem boundary instruction, letting Codex spend tokens reading
.claude/skills/ and agents/. #1522 proposed adding a skill-path
detector that switched to the custom-instructions route when the diff
touched skill files.

After C10 (#1209) restructured codex review to always carry the
boundary in the prompt (the prompt+--base argv conflict forced the
restructure), the skill-path detector becomes redundant — every
default call already preserves the boundary.

This commit pins the post-#1209 invariant with a test that fails the
build if any future refactor strips the boundary from codex/SKILL.md,
review/SKILL.md, or ship/SKILL.md. Closes #1503 by regression test.

#1522 (@genisis0x) is superseded by #1209 (the prompt rewrite covers
its safety concern); credit retained in CHANGELOG.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(skills): use command -v instead of which for codex detection (#1197)

`which` is not on PATH in every shell — some Windows shells, BusyBox-
only containers, and minimal CI images all fail when skills probe
codex availability via `which codex`. `command -v` is a POSIX builtin
and always available where the skill is running.

Touched:
- codex/SKILL.md.tmpl: CODEX_BIN=$(command -v codex || echo "")
- scripts/resolvers/review.ts and scripts/resolvers/design.ts:
  3 + 3 sites each rewritten to `command -v codex >/dev/null 2>&1`
- Regenerated all 10 affected SKILL.md files (codex, review, ship,
  design-consultation, design-review, office-hours, plan-ceo-review,
  plan-design-review, plan-devex-review, plan-eng-review)
- test/skill-validation.test.ts: updated pin + defensive regression
  test that fails if `which codex` returns to codex/SKILL.md
- test/skill-e2e-plan.test.ts: updated summary regex

Contributed by @mvanhorn via #1197.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(codex): surface non-zero exits so wrappers stop reading as silent stalls (#1467, #1327)

When codex exits non-zero (parse errors, arg-shape breaks, model API
errors that propagate as non-zero status), the calling agent
previously saw an empty output and burned 30-60 minutes misdiagnosing
as a silent model/API stall. The hang-detection block only caught
exit 124 (the timeout-wrapper signal).

Adds elif blocks in all four codex invocation sites (Review default,
Challenge, Consult new-session, Consult resume) that:
- Echo "[codex exit N] <stderr first line>" to stdout
- Indent the first 20 stderr lines for inline context
- Log codex_nonzero_exit telemetry tagged with the call site

Contributed by @genisis0x via #1467. Closes #1327.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(design): disclose OpenAI key source + warn on cwd .env match (#1278, closes #1248)

The design binary previously called process.env.OPENAI_API_KEY without
checking where the key came from. If a user ran $D inside someone
else's project that had OPENAI_API_KEY in its .env, the resulting
generation billed that project's account. Silent and irreversible.

Fix: resolveApiKeyInfo() returns both the key and its source. When the
env-var path matches an OPENAI_API_KEY entry in the current
directory's .env, .env.<NODE_ENV>, or .env.local file, we set a
warning. requireApiKey() prints "Using OpenAI key from <source>" plus
the warning before the run — never the key itself.

Adds 6 unit tests covering: config-vs-env precedence, env-only (no
match), env+cwd .env match, quoted/exported values, value-mismatch
(no false positive), and the no-leak invariant for requireApiKey
stderr output.

Contributed by @jbetala7 via #1278. Closes #1248.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(browse): guard full-page screenshots against Anthropic vision API >2000px brick (#1214)

Full-page screenshots of tall pages routinely exceeded 2000px on the
longest dimension, silently bricking the agent's session: the
resulting base64 reached the Anthropic vision API which rejected the
oversized image, leaving the agent burning turns on a useless blob
with no stderr trace from the browse side.

Adds browse/src/screenshot-size-guard.ts as a shared helper:
- guardScreenshotBuffer(buf) → downscales in-memory if max(w,h) > 2000
- guardScreenshotPath(path) → file-mode variant that rewrites in place
- Aspect ratio preserved via sharp's resize fit:inside
- Stderr diagnostic on any downscale so callers can see when it fired
- Lazy sharp import so non-screenshot paths pay no startup cost

Wires the guard into all three full-page callsites codex review
flagged:
- browse/src/snapshot.ts: annotated + heatmap fullPage captures
- browse/src/meta-commands.ts: screenshot command (path + base64
  fullPage modes) plus the responsive 3-viewport sweep
- browse/src/write-commands.ts: prettyscreenshot fullPage path

Covers seven unit cases (pass-through, downscale, aspect ratio,
exactly-2000px edge, file-mode rewrite) plus a static invariant test
that fails the build if any of the three callsites stops importing the
guard.

Closes #1214.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(security): add Node sidecar entry for L4 prompt-injection classifier (#1370)

The L4 TestSavant classifier in browse/src/security-classifier.ts
can't be imported into the compiled browse server (onnxruntime-node
dlopen fails from Bun's compile extract dir per CLAUDE.md). The agent
that used to host it (sidebar-agent.ts) was removed when the PTY
proved out — leaving the classifier file shipped but with zero
callers. Exactly the gap codex flagged in #1370.

Adds browse/src/security-sidecar-entry.ts: a Node script that runs the
classifier as a subprocess of the browse server. It reads NDJSON
requests from stdin and writes id-correlated NDJSON responses to
stdout, supporting:
  - op: "scan-page-content" — full L4 classifier scan
  - op: "ping" — liveness probe for the client's health check
  - op: "status" — classifier readiness (used by /pty-inject-scan to
    surface l4 { available: bool } in its response)

Plus browse/src/find-security-sidecar.ts: a resolver that locates
node + the bundled JS entry (browse/dist/security-sidecar.js, built in
a follow-up package.json change) or falls back to the dev TS entry.
Returns null cleanly when node isn't on PATH so the calling endpoint
can degrade per D7 (extension WARN + user confirm).

C17 of the security-stack wave. C18 adds the IPC client + lifecycle
management; C19 wires the endpoint; C20 routes the extension through it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(security): sidecar IPC client with lifecycle + circuit breaker (#1370)

Adds browse/src/security-sidecar-client.ts to manage the Node L4
classifier subprocess from the compiled browse server:

- Lazy spawn on first scan; reuses the same process across requests
- Id-correlated request/response via NDJSON over stdio
- 5s default per-scan timeout; 64KB payload cap (short-circuits before
  spawn so oversized requests don't waste a process)
- 3-in-10-minutes respawn cap → trips circuit breaker; subsequent
  scans throw immediately so the /pty-inject-scan endpoint can surface
  l4 { available: false } to the extension and degrade to WARN+confirm
- process.on('exit') sends SIGTERM to the child for clean teardown
- isSidecarAvailable() lets the endpoint probe before scan calls so
  the response shape reflects degraded mode honestly

Unit tests cover the payload cap, the availability probe, and the
breaker-doesn't-crash invariant under repeated rejected calls.

C18 of the security-stack wave. C19 adds POST /pty-inject-scan; C20
routes the extension through it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(security): add POST /pty-inject-scan endpoint for pre-PTY-inject scans (#1370)

The sidebar's gstackInjectToTerminal callers (toolbar Cleanup,
Inspector "Send to Code") were piping page-derived text directly into
the live claude PTY with ZERO classifier processing — the gap codex
flagged in #1370. The documented sidebar security stack had a hole
the size of every Cleanup-button click.

Adds POST /pty-inject-scan to browse/src/server.ts:
- Local-only binding (NOT in TUNNEL_PATHS — tunnel attempts get the
  general 404 path; never reaches the scan logic)
- Root-token auth via existing validateAuth() — 401 on unauth
- 64KB request cap → 413 + payload-too-large body
- 5s scan timeout via sidecar client
- URL-blocklist forced to BLOCK in PTY context (page-derived REPL
  input is higher-risk than ordinary tool output)
- L4 ML classifier via the sidecar when available; degrades to WARN
  per D7 when sidecar is unavailable
- Response goes through JSON.stringify(..., sanitizeReplacer) per
  v1.38.0.0 Unicode-egress hardening
- Imports only from security-sidecar-client.ts, never directly from
  security-classifier.ts (which would brick the compiled Bun binary)

Seven static-invariant tests pin the POST verb, auth gate, 64KB cap,
tunnel-listener exclusion, sanitizeReplacer wrapping, l4 availability
shape, and the no-direct-classifier-import rule.

C19 of the security-stack wave. C20 routes the extension through it;
C21 adds the invariant AST check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(extension): route gstackInjectToTerminal through /pty-inject-scan (#1370)

Closes the documented-vs-shipped gap codex flagged in #1370. The
sidebar's two PTY-injection call sites (Inspector "Send to Code" and
toolbar Cleanup) now pre-scan via the new /pty-inject-scan endpoint
before writing to the live claude REPL.

Adds window.gstackScanForPTYInject(text, origin) to
extension/sidepanel-terminal.js:
- Async, returns { allow, verdict, reasons, l4 }
- POST to /pty-inject-scan with the existing root-token auth
- WARN+confirm on scan failure (network down, sidecar absent, etc.)
  rather than silent PASS — D7 honest-degradation

gstackInjectToTerminal stays synchronous, returns boolean. Per D6:
keeping the inject sync means existing `const ok = ...?.()` callers
don't break, and the invariant test in
test/extension-pty-inject-invariant.test.ts can statically pin that
every call goes through the scan first.

extension/sidepanel.js call sites updated:
- inspectorSendBtn click → await scan, BLOCK drops + WARN prompts via
  window.confirm, PASS injects silently
- runCleanup() → same flow. Static cleanup prompt always PASSes but
  still routes through scan to honor the invariant.

C20 of the security-stack wave. C21 adds the static invariant test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(security): invariant — extension PTY inject must be scan-gated (#1370)

Static-analysis invariant test that fails the build if any
extension/*.js path calls window.gstackInjectToTerminal without a
preceding window.gstackScanForPTYInject in the same enclosing
function. Closes the documented-vs-shipped gap codex demanded a
machine check on.

Rules:
- Rule 1: any file that calls inject must also reference scan
- Rule 2: in the enclosing function (function declaration, arrow,
  async (), event handler), a scan call must appear before the inject
  call by source position
- Exemption: sidepanel-terminal.js (the file that DEFINES the inject
  function) is exempt from Rule 2 since the definition is not a call

Plus two structural checks:
- sidepanel-terminal.js defines both the inject and scan functions
- inject stays SYNCHRONOUS (no `async` modifier) per D6 — async would
  silently break the `const ok = ...?.()` pattern at every caller

C21 of the security-stack wave. The sidecar architecture (#1370) is
complete: server-side L1-L3 + L4-via-sidecar (C17+C18+C19), extension
pre-scan wiring (C20), and now the regression gate (C21).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(browse): opt-in extended stealth mode with 6 detection-vector patches (#1112)

Rebases @garrytan's PR #1112 (Apr 2026, abandoned) onto the current
browse/src/stealth.ts contract. The existing minimal "codex narrowed"
stealth (webdriver-mask + AutomationControlled launch arg) stays the
default. PR #1112's six additional patches are added behind an opt-in
GSTACK_STEALTH=extended env flag.

Extended-mode patches (applied AFTER the default mask, in order):
  1. delete navigator.webdriver from prototype (not just the getter —
     detectors check `"webdriver" in navigator`)
  2. WebGL renderer spoof to Apple M1 Pro (SwiftShader was the #1
     software-GPU tell in containers)
  3. navigator.plugins returns a PluginArray-prototype-passing array
     with MimeType objects and namedItem()
  4. window.chrome populated with chrome.app, chrome.runtime,
     chrome.loadTimes(), chrome.csi() with realistic shapes
  5. navigator.mediaDevices backfilled when headless drops it
  6. CDP cdc_*-prefixed window globals cleared

Why opt-in: the default mode's contract is fingerprint CONSISTENCY,
which protects against detectors that flag spoofing mismatch. Extended
mode actively lies about the environment; sites that reflect on these
properties can break. Users who hit detection in default mode can flip
GSTACK_STEALTH=extended for SannySoft 100% pass-rate.

Twenty unit tests pin the env-flag semantics, all six patches' code
presence, and the applyStealth wiring order. Live SannySoft pass-rate
verification stays in the periodic-tier E2E suite.

Contributed by @garrytan via #1112 (rebased — original PR opened
before the codex-narrowed minimum landed; rebase preserves the
narrowed default while adding the SannySoft-passing path as opt-in).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(fixtures): regenerate ship-SKILL.md golden baselines after C10-C13 + C16 templates

Updates the three ship-SKILL.md golden baselines (claude, codex,
factory hosts) to match the new shape produced by:
- C10 #1209 codex argv (prompt + diff scope, no --base)
- C11 #1492 merge-base diff (DIFF_BASE= preamble)
- C13 #1197 command -v for codex detection
- C12 + boundary preservation per regen-enforcing test

Per CLAUDE.md SKILL.md workflow: edit the .tmpl, run gen:skill-docs,
commit the regenerated outputs together. Goldens are part of the
regen contract — without this commit, test/host-config.test.ts'
golden-baseline checks fail with the diff codex review surfaced.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(release): v1.41.0.0 — Daegu wave (24 bisect commits, 14 user-facing fixes)

Bumps VERSION 1.40.0.0 → 1.41.0.0. CHANGELOG entry follows the
release-summary format in CLAUDE.md: two-line headline, lead
paragraph, "The numbers that matter" table, "What this means for
builders" closer, then itemized Added/Changed/Fixed/For contributors
with inline credit to every PR author and original issue reporter.

Scale-aware bump per CLAUDE.md: 24 commits, ~6000 LOC net,
substantial new capability across security (PTY sidecar wiring),
install (Windows build chain), compat (gbrain 0.18-0.35, Codex CLI
0.130+), and quality (screenshot guard, design key disclosure,
extended stealth opt-in). MINOR is the right call.

Closes for users: #1567, #1559, #1569, #1346, #1418, #1538, #1537,
#1530, #1457, #1561, #1554, #1479, #1503, #1248, #1214, #1370, #1327,
#1193 pattern, #1152 pattern. Credit retained inline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(find-browse): resolve source-checkout layout <git-root>/browse/dist/browse[.exe]

windows-setup-e2e.yml runs `bun browse/src/find-browse.ts` against a
freshly-built repo where binaries land at browse/dist/browse.exe (no
.claude/skills/gstack/ install layout). The previous markers chain
only matched .codex/.agents/.claude prefixed paths, so find-browse
exited "not found" even when the binary was present.

Adds a source-checkout fallback after the marker scan: if no
installed layout resolves but <git-root>/browse/dist/browse[.exe]
exists, return that. Three real callers hit this path:
- gstack repo dev workflow before `./setup` runs
- windows-setup-e2e.yml CI (the breakage that surfaced this)
- make-pdf consumers running from a sibling source checkout

Smoke-verified: a fresh git repo with browse/dist/browse on disk now
resolves through the source-checkout branch (was returning null
before this commit).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(release): bump v1.41.0.0 → v1.42.0.0 to clear queue collision with #1574

The version-gate workflow flagged a collision: PR #1574
(garrytan/colombo-v3) already claims v1.41.0.0, and #1592
(fix/audit-critical-high-bugs) claims v1.41.1.0. Per CLAUDE.md's
workspace-aware ship rule, queue-advancing past a claimed version
within the same bump level is permitted — MINOR work landing on top
of a queued MINOR still reads as MINOR relative to main.

Util's suggested next slot is v1.42.0.0; taking it. CHANGELOG entry
header bumped + dated 2026-05-19; entry body unchanged (same wave
content, same credit list).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 07:35:01 -07:00

809 lines
31 KiB
TypeScript

import { describe, test, expect, beforeAll, afterAll } from 'bun:test';
import { runSkillTest } from './helpers/session-runner';
import {
ROOT, browseBin, runId, evalsEnabled,
describeIfSelected, testConcurrentIfSelected,
copyDirSync, setupBrowseShims, logCost, recordE2E,
createEvalCollector, finalizeEvalCollector,
} from './helpers/e2e-helpers';
import { judgePosture } from './helpers/llm-judge';
import { spawnSync } from 'child_process';
import * as fs from 'fs';
import * as path from 'path';
import * as os from 'os';
const evalCollector = createEvalCollector('e2e-plan');
// --- Plan CEO Review E2E ---
describeIfSelected('Plan CEO Review E2E', ['plan-ceo-review'], () => {
let planDir: string;
beforeAll(() => {
planDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-plan-ceo-'));
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: planDir, stdio: 'pipe', timeout: 5000 });
// Init git repo (CEO review SKILL.md has a "System Audit" step that runs git)
run('git', ['init', '-b', 'main']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
// Create a simple plan document for the agent to review
fs.writeFileSync(path.join(planDir, 'plan.md'), `# Plan: Add User Dashboard
## Context
We're building a new user dashboard that shows recent activity, notifications, and quick actions.
## Changes
1. New React component \`UserDashboard\` in \`src/components/\`
2. REST API endpoint \`GET /api/dashboard\` returning user stats
3. PostgreSQL query for activity aggregation
4. Redis cache layer for dashboard data (5min TTL)
## Architecture
- Frontend: React + TailwindCSS
- Backend: Express.js REST API
- Database: PostgreSQL with existing user/activity tables
- Cache: Redis for dashboard aggregates
## Open questions
- Should we use WebSocket for real-time updates?
- How do we handle users with 100k+ activity records?
`);
run('git', ['add', '.']);
run('git', ['commit', '-m', 'add plan']);
// Copy plan-ceo-review skill
fs.mkdirSync(path.join(planDir, 'plan-ceo-review'), { recursive: true });
fs.copyFileSync(
path.join(ROOT, 'plan-ceo-review', 'SKILL.md'),
path.join(planDir, 'plan-ceo-review', 'SKILL.md'),
);
});
afterAll(() => {
try { fs.rmSync(planDir, { recursive: true, force: true }); } catch {}
});
testConcurrentIfSelected('plan-ceo-review', async () => {
const result = await runSkillTest({
prompt: `Read plan-ceo-review/SKILL.md for the review workflow.
Read plan.md — that's the plan to review. This is a standalone plan document, not a codebase — skip any codebase exploration or system audit steps.
Choose HOLD SCOPE mode. Skip any AskUserQuestion calls — this is non-interactive.
Write your complete review directly to ${planDir}/review-output.md
Focus on reviewing the plan content: architecture, error handling, security, and performance.`,
workingDirectory: planDir,
maxTurns: 15,
timeout: 360_000,
testName: 'plan-ceo-review',
runId,
model: 'claude-opus-4-7',
});
logCost('/plan-ceo-review', result);
recordE2E(evalCollector, '/plan-ceo-review', 'Plan CEO Review E2E', result, {
passed: ['success', 'error_max_turns'].includes(result.exitReason),
});
// Accept error_max_turns — the CEO review is very thorough and may exceed turns
expect(['success', 'error_max_turns']).toContain(result.exitReason);
// Verify the review was written
const reviewPath = path.join(planDir, 'review-output.md');
if (fs.existsSync(reviewPath)) {
const review = fs.readFileSync(reviewPath, 'utf-8');
expect(review.length).toBeGreaterThan(200);
}
}, 420_000);
});
// --- Plan CEO Review (SELECTIVE EXPANSION) E2E ---
describeIfSelected('Plan CEO Review SELECTIVE EXPANSION E2E', ['plan-ceo-review-selective'], () => {
let planDir: string;
beforeAll(() => {
planDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-plan-ceo-sel-'));
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: planDir, stdio: 'pipe', timeout: 5000 });
run('git', ['init', '-b', 'main']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
fs.writeFileSync(path.join(planDir, 'plan.md'), `# Plan: Add User Dashboard
## Context
We're building a new user dashboard that shows recent activity, notifications, and quick actions.
## Changes
1. New React component \`UserDashboard\` in \`src/components/\`
2. REST API endpoint \`GET /api/dashboard\` returning user stats
3. PostgreSQL query for activity aggregation
4. Redis cache layer for dashboard data (5min TTL)
## Architecture
- Frontend: React + TailwindCSS
- Backend: Express.js REST API
- Database: PostgreSQL with existing user/activity tables
- Cache: Redis for dashboard aggregates
## Open questions
- Should we use WebSocket for real-time updates?
- How do we handle users with 100k+ activity records?
`);
run('git', ['add', '.']);
run('git', ['commit', '-m', 'add plan']);
fs.mkdirSync(path.join(planDir, 'plan-ceo-review'), { recursive: true });
fs.copyFileSync(
path.join(ROOT, 'plan-ceo-review', 'SKILL.md'),
path.join(planDir, 'plan-ceo-review', 'SKILL.md'),
);
});
afterAll(() => {
try { fs.rmSync(planDir, { recursive: true, force: true }); } catch {}
});
testConcurrentIfSelected('plan-ceo-review-selective', async () => {
const result = await runSkillTest({
prompt: `Read plan-ceo-review/SKILL.md for the review workflow.
Read plan.md — that's the plan to review. This is a standalone plan document, not a codebase — skip any codebase exploration or system audit steps.
Choose SELECTIVE EXPANSION mode. Skip any AskUserQuestion calls — this is non-interactive.
For the cherry-pick ceremony, accept all expansion proposals automatically.
Write your complete review directly to ${planDir}/review-output-selective.md
Focus on reviewing the plan content: architecture, error handling, security, and performance.`,
workingDirectory: planDir,
maxTurns: 15,
timeout: 360_000,
testName: 'plan-ceo-review-selective',
runId,
model: 'claude-opus-4-7',
});
logCost('/plan-ceo-review (SELECTIVE)', result);
recordE2E(evalCollector, '/plan-ceo-review-selective', 'Plan CEO Review SELECTIVE EXPANSION E2E', result, {
passed: ['success', 'error_max_turns'].includes(result.exitReason),
});
expect(['success', 'error_max_turns']).toContain(result.exitReason);
const reviewPath = path.join(planDir, 'review-output-selective.md');
if (fs.existsSync(reviewPath)) {
const review = fs.readFileSync(reviewPath, 'utf-8');
expect(review.length).toBeGreaterThan(200);
}
}, 420_000);
});
// --- Plan CEO Review SCOPE EXPANSION energy (V1.1 mode-posture regression gate) ---
describeIfSelected('Plan CEO Review Expansion Energy E2E', ['plan-ceo-review-expansion-energy'], () => {
let planDir: string;
beforeAll(() => {
planDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-plan-ceo-exp-'));
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: planDir, stdio: 'pipe', timeout: 5000 });
run('git', ['init', '-b', 'main']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
// Use the shared fixture so expansion-energy regressions are reproducible.
const fixture = fs.readFileSync(
path.join(ROOT, 'test', 'fixtures', 'mode-posture', 'expansion-plan.md'),
'utf-8',
);
fs.writeFileSync(path.join(planDir, 'plan.md'), fixture);
run('git', ['add', '.']);
run('git', ['commit', '-m', 'add plan']);
fs.mkdirSync(path.join(planDir, 'plan-ceo-review'), { recursive: true });
fs.copyFileSync(
path.join(ROOT, 'plan-ceo-review', 'SKILL.md'),
path.join(planDir, 'plan-ceo-review', 'SKILL.md'),
);
});
afterAll(() => {
try { fs.rmSync(planDir, { recursive: true, force: true }); } catch {}
});
testConcurrentIfSelected('plan-ceo-review-expansion-energy', async () => {
const result = await runSkillTest({
prompt: `Read plan-ceo-review/SKILL.md for the review workflow.
Read plan.md — that's the plan to review. This is a standalone plan document, not a codebase — skip any codebase exploration or system audit steps.
Choose SCOPE EXPANSION mode. Skip any AskUserQuestion calls — this is non-interactive. Auto-approve the ideal-architecture approach in 0C-bis. For 0D, run all three analyses (10x check, platonic ideal, delight opportunities), then emit exactly 2 concrete expansion proposals in the opt-in ceremony.
Write your expansion proposals to ${planDir}/proposals.md with ONLY the proposal text — no conversational wrapper, no review summary, no mode analysis. Each proposal separated by "---".`,
workingDirectory: planDir,
maxTurns: 15,
timeout: 360_000,
testName: 'plan-ceo-review-expansion-energy',
runId,
model: 'claude-opus-4-7',
});
logCost('/plan-ceo-review (EXPANSION ENERGY)', result);
recordE2E(evalCollector, '/plan-ceo-review-expansion-energy', 'Plan CEO Review Expansion Energy E2E', result, {
passed: ['success', 'error_max_turns'].includes(result.exitReason),
});
expect(['success', 'error_max_turns']).toContain(result.exitReason);
const proposalsPath = path.join(planDir, 'proposals.md');
if (!fs.existsSync(proposalsPath)) {
throw new Error('Agent did not emit proposals.md — expansion energy eval requires proposal output');
}
const proposalText = fs.readFileSync(proposalsPath, 'utf-8');
expect(proposalText.length).toBeGreaterThan(200);
const scores = await judgePosture('expansion', proposalText);
console.log('Expansion energy scores:', JSON.stringify(scores, null, 2));
// Pass threshold: 4/5 on both axes (good — matches posture with minor weakness).
expect(scores.axis_a).toBeGreaterThanOrEqual(4); // surface_framing
expect(scores.axis_b).toBeGreaterThanOrEqual(4); // decision_preservation
}, 600_000);
});
// --- Plan Eng Review E2E ---
describeIfSelected('Plan Eng Review E2E', ['plan-eng-review'], () => {
let planDir: string;
beforeAll(() => {
planDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-plan-eng-'));
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: planDir, stdio: 'pipe', timeout: 5000 });
run('git', ['init', '-b', 'main']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
// Create a plan with more engineering detail
fs.writeFileSync(path.join(planDir, 'plan.md'), `# Plan: Migrate Auth to JWT
## Context
Replace session-cookie auth with JWT tokens. Currently using express-session + Redis store.
## Changes
1. Add \`jsonwebtoken\` package
2. New middleware \`auth/jwt-verify.ts\` replacing \`auth/session-check.ts\`
3. Login endpoint returns { accessToken, refreshToken }
4. Refresh endpoint rotates tokens
5. Migration script to invalidate existing sessions
## Files Modified
| File | Change |
|------|--------|
| auth/jwt-verify.ts | NEW: JWT verification middleware |
| auth/session-check.ts | DELETED |
| routes/login.ts | Return JWT instead of setting cookie |
| routes/refresh.ts | NEW: Token refresh endpoint |
| middleware/index.ts | Swap session-check for jwt-verify |
## Error handling
- Expired token: 401 with \`token_expired\` code
- Invalid token: 401 with \`invalid_token\` code
- Refresh with revoked token: 403
## Not in scope
- OAuth/OIDC integration
- Rate limiting on refresh endpoint
`);
run('git', ['add', '.']);
run('git', ['commit', '-m', 'add plan']);
// Copy plan-eng-review skill
fs.mkdirSync(path.join(planDir, 'plan-eng-review'), { recursive: true });
fs.copyFileSync(
path.join(ROOT, 'plan-eng-review', 'SKILL.md'),
path.join(planDir, 'plan-eng-review', 'SKILL.md'),
);
});
afterAll(() => {
try { fs.rmSync(planDir, { recursive: true, force: true }); } catch {}
});
testConcurrentIfSelected('plan-eng-review', async () => {
const result = await runSkillTest({
prompt: `Read plan-eng-review/SKILL.md for the review workflow.
Read plan.md — that's the plan to review. This is a standalone plan document, not a codebase — skip any codebase exploration steps.
Proceed directly to the full review. Skip any AskUserQuestion calls — this is non-interactive.
Write your complete review directly to ${planDir}/review-output.md
Focus on architecture, code quality, tests, and performance sections.`,
workingDirectory: planDir,
maxTurns: 15,
timeout: 360_000,
testName: 'plan-eng-review',
runId,
model: 'claude-opus-4-7',
});
logCost('/plan-eng-review', result);
recordE2E(evalCollector, '/plan-eng-review', 'Plan Eng Review E2E', result, {
passed: ['success', 'error_max_turns'].includes(result.exitReason),
});
expect(['success', 'error_max_turns']).toContain(result.exitReason);
// Verify the review was written
const reviewPath = path.join(planDir, 'review-output.md');
if (fs.existsSync(reviewPath)) {
const review = fs.readFileSync(reviewPath, 'utf-8');
expect(review.length).toBeGreaterThan(200);
}
}, 420_000);
});
// --- Plan-Eng-Review Test-Plan Artifact E2E ---
describeIfSelected('Plan-Eng-Review Test-Plan Artifact E2E', ['plan-eng-review-artifact'], () => {
let planDir: string;
let projectDir: string;
beforeAll(() => {
planDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-plan-artifact-'));
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: planDir, stdio: 'pipe', timeout: 5000 });
run('git', ['init', '-b', 'main']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
// Create base commit on main
fs.writeFileSync(path.join(planDir, 'app.ts'), 'export function greet() { return "hello"; }\n');
run('git', ['add', '.']);
run('git', ['commit', '-m', 'initial']);
// Create feature branch with changes
run('git', ['checkout', '-b', 'feature/add-dashboard']);
fs.writeFileSync(path.join(planDir, 'dashboard.ts'), `export function Dashboard() {
const data = fetchStats();
return { users: data.users, revenue: data.revenue };
}
function fetchStats() {
return fetch('/api/stats').then(r => r.json());
}
`);
fs.writeFileSync(path.join(planDir, 'app.ts'), `import { Dashboard } from "./dashboard";
export function greet() { return "hello"; }
export function main() { return Dashboard(); }
`);
run('git', ['add', '.']);
run('git', ['commit', '-m', 'feat: add dashboard']);
// Plan document
fs.writeFileSync(path.join(planDir, 'plan.md'), `# Plan: Add Dashboard
## Changes
1. New \`dashboard.ts\` with Dashboard component and fetchStats API call
2. Updated \`app.ts\` to import and use Dashboard
## Architecture
- Dashboard fetches from \`/api/stats\` endpoint
- Returns user count and revenue metrics
`);
run('git', ['add', 'plan.md']);
run('git', ['commit', '-m', 'add plan']);
// Copy plan-eng-review skill
fs.mkdirSync(path.join(planDir, 'plan-eng-review'), { recursive: true });
fs.copyFileSync(
path.join(ROOT, 'plan-eng-review', 'SKILL.md'),
path.join(planDir, 'plan-eng-review', 'SKILL.md'),
);
// Set up remote-slug shim and browse shims (plan-eng-review uses remote-slug for artifact path)
setupBrowseShims(planDir);
// Create project directory for artifacts
projectDir = path.join(os.homedir(), '.gstack', 'projects', 'test-project');
fs.mkdirSync(projectDir, { recursive: true });
// Clean up stale test-plan files from previous runs
try {
const staleFiles = fs.readdirSync(projectDir).filter(f => f.includes('test-plan'));
for (const f of staleFiles) {
fs.unlinkSync(path.join(projectDir, f));
}
} catch {}
});
afterAll(() => {
try { fs.rmSync(planDir, { recursive: true, force: true }); } catch {}
// Clean up test-plan artifacts (but not the project dir itself)
try {
const files = fs.readdirSync(projectDir);
for (const f of files) {
if (f.includes('test-plan')) {
fs.unlinkSync(path.join(projectDir, f));
}
}
} catch {}
});
testConcurrentIfSelected('plan-eng-review-artifact', async () => {
// Count existing test-plan files before
const beforeFiles = fs.readdirSync(projectDir).filter(f => f.includes('test-plan'));
const result = await runSkillTest({
prompt: `Read plan-eng-review/SKILL.md for the review workflow.
Skip the preamble bash block, lake intro, telemetry, and contributor mode sections — go straight to the review.
Read plan.md — that's the plan to review. This is a standalone plan with source code in app.ts and dashboard.ts.
Proceed directly to the full review. Skip any AskUserQuestion calls — this is non-interactive.
IMPORTANT: After your review, you MUST write the test-plan artifact as described in the "Test Plan Artifact" section of SKILL.md. The remote-slug shim is at ${planDir}/browse/bin/remote-slug.
Write your review to ${planDir}/review-output.md`,
workingDirectory: planDir,
maxTurns: 25,
allowedTools: ['Bash', 'Read', 'Write', 'Glob', 'Grep'],
timeout: 360_000,
testName: 'plan-eng-review-artifact',
runId,
model: 'claude-opus-4-7',
});
logCost('/plan-eng-review artifact', result);
recordE2E(evalCollector, '/plan-eng-review test-plan artifact', 'Plan-Eng-Review Test-Plan Artifact E2E', result, {
passed: ['success', 'error_max_turns'].includes(result.exitReason),
});
expect(['success', 'error_max_turns']).toContain(result.exitReason);
// Verify test-plan artifact was written
const afterFiles = fs.readdirSync(projectDir).filter(f => f.includes('test-plan'));
const newFiles = afterFiles.filter(f => !beforeFiles.includes(f));
console.log(`Test-plan artifacts: ${beforeFiles.length} before, ${afterFiles.length} after, ${newFiles.length} new`);
if (newFiles.length > 0) {
const content = fs.readFileSync(path.join(projectDir, newFiles[0]), 'utf-8');
console.log(`Test-plan artifact (${newFiles[0]}): ${content.length} chars`);
expect(content.length).toBeGreaterThan(50);
} else {
console.warn('No test-plan artifact found — agent may not have followed artifact instructions');
}
// Soft assertion: we expect an artifact but agent compliance is not guaranteed.
// Log rather than fail — the test-plan artifact is a bonus output, not the core test.
if (newFiles.length === 0) {
console.warn('SOFT FAIL: No test-plan artifact written — agent did not follow artifact instructions');
}
}, 420_000);
});
// --- Office Hours Spec Review E2E ---
describeIfSelected('Office Hours Spec Review E2E', ['office-hours-spec-review'], () => {
let ohDir: string;
beforeAll(() => {
ohDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-oh-spec-'));
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: ohDir, stdio: 'pipe', timeout: 5000 });
run('git', ['init', '-b', 'main']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
fs.writeFileSync(path.join(ohDir, 'README.md'), '# Test Project\n');
run('git', ['add', '.']);
run('git', ['commit', '-m', 'init']);
// Copy office-hours skill
fs.mkdirSync(path.join(ohDir, 'office-hours'), { recursive: true });
fs.copyFileSync(
path.join(ROOT, 'office-hours', 'SKILL.md'),
path.join(ohDir, 'office-hours', 'SKILL.md'),
);
});
afterAll(() => {
try { fs.rmSync(ohDir, { recursive: true, force: true }); } catch {}
});
testConcurrentIfSelected('office-hours-spec-review', async () => {
const result = await runSkillTest({
prompt: `Read office-hours/SKILL.md. I want to understand the spec review loop.
Summarize what the "Spec Review Loop" section does — specifically:
1. How many dimensions does the reviewer check?
2. What tool is used to dispatch the reviewer?
3. What's the maximum number of iterations?
4. What metrics are tracked?
Write your summary to ${ohDir}/spec-review-summary.md`,
workingDirectory: ohDir,
maxTurns: 8,
timeout: 120_000,
testName: 'office-hours-spec-review',
runId,
});
logCost('/office-hours spec review', result);
recordE2E(evalCollector, '/office-hours-spec-review', 'Office Hours Spec Review E2E', result);
expect(result.exitReason).toBe('success');
const summaryPath = path.join(ohDir, 'spec-review-summary.md');
if (fs.existsSync(summaryPath)) {
const summary = fs.readFileSync(summaryPath, 'utf-8').toLowerCase();
expect(summary).toMatch(/5.*dimension|dimension.*5|completeness|consistency|clarity|scope|feasibility/);
expect(summary).toMatch(/agent|subagent/);
expect(summary).toMatch(/3.*iteration|iteration.*3|maximum.*3/);
}
}, 180_000);
});
// --- Plan CEO Review Benefits-From E2E ---
describeIfSelected('Plan CEO Review Benefits-From E2E', ['plan-ceo-review-benefits'], () => {
let benefitsDir: string;
beforeAll(() => {
benefitsDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-benefits-'));
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: benefitsDir, stdio: 'pipe', timeout: 5000 });
run('git', ['init', '-b', 'main']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
fs.writeFileSync(path.join(benefitsDir, 'README.md'), '# Test Project\n');
run('git', ['add', '.']);
run('git', ['commit', '-m', 'init']);
fs.mkdirSync(path.join(benefitsDir, 'plan-ceo-review'), { recursive: true });
fs.copyFileSync(
path.join(ROOT, 'plan-ceo-review', 'SKILL.md'),
path.join(benefitsDir, 'plan-ceo-review', 'SKILL.md'),
);
});
afterAll(() => {
try { fs.rmSync(benefitsDir, { recursive: true, force: true }); } catch {}
});
testConcurrentIfSelected('plan-ceo-review-benefits', async () => {
const result = await runSkillTest({
prompt: `Read plan-ceo-review/SKILL.md. Search for sections about "Prerequisite" or "office-hours" or "design doc found".
Summarize what happens when no design doc is found — specifically:
1. Is /office-hours offered as a prerequisite?
2. What options does the user get?
3. Is there a mid-session detection for when the user seems lost?
Write your summary to ${benefitsDir}/benefits-summary.md`,
workingDirectory: benefitsDir,
maxTurns: 8,
timeout: 120_000,
testName: 'plan-ceo-review-benefits',
runId,
});
logCost('/plan-ceo-review benefits-from', result);
recordE2E(evalCollector, '/plan-ceo-review-benefits', 'Plan CEO Review Benefits-From E2E', result);
expect(result.exitReason).toBe('success');
const summaryPath = path.join(benefitsDir, 'benefits-summary.md');
if (fs.existsSync(summaryPath)) {
const summary = fs.readFileSync(summaryPath, 'utf-8').toLowerCase();
expect(summary).toMatch(/office.hours/);
expect(summary).toMatch(/design doc|no design/i);
}
}, 180_000);
});
// --- Plan Review Report E2E ---
// Verifies that plan-eng-review writes a "## GSTACK REVIEW REPORT" section
// to the bottom of the plan file (the living review status footer).
describeIfSelected('Plan Review Report E2E', ['plan-review-report'], () => {
let planDir: string;
beforeAll(() => {
planDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-review-report-'));
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: planDir, stdio: 'pipe', timeout: 5000 });
run('git', ['init', '-b', 'main']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
fs.writeFileSync(path.join(planDir, 'plan.md'), `# Plan: Add Notifications System
## Context
We're building a real-time notification system for our SaaS app.
## Changes
1. WebSocket server for push notifications
2. Notification preferences API
3. Email digest fallback for offline users
4. PostgreSQL table for notification storage
## Architecture
- WebSocket: Socket.io on Express
- Queue: Bull + Redis for email digests
- Storage: PostgreSQL notifications table
- Frontend: React toast component
## Open questions
- Retry policy for failed WebSocket delivery?
- Max notifications stored per user?
`);
run('git', ['add', '.']);
run('git', ['commit', '-m', 'add plan']);
// Copy plan-eng-review skill
fs.mkdirSync(path.join(planDir, 'plan-eng-review'), { recursive: true });
fs.copyFileSync(
path.join(ROOT, 'plan-eng-review', 'SKILL.md'),
path.join(planDir, 'plan-eng-review', 'SKILL.md'),
);
});
afterAll(() => {
try { fs.rmSync(planDir, { recursive: true, force: true }); } catch {}
});
test('/plan-eng-review writes GSTACK REVIEW REPORT to plan file', async () => {
const result = await runSkillTest({
prompt: `Read plan-eng-review/SKILL.md for the review workflow.
Read plan.md — that's the plan to review. This is a standalone plan document, not a codebase — skip any codebase exploration steps.
Proceed directly to the full review. Skip any AskUserQuestion calls — this is non-interactive.
Skip the preamble bash block, lake intro, telemetry, and contributor mode sections.
CRITICAL REQUIREMENT: plan.md IS the plan file for this review session. After completing your review, you MUST write a "## GSTACK REVIEW REPORT" section to the END of plan.md, exactly as described in the "Plan File Review Report" section of SKILL.md. If gstack-review-read is not available or returns NO_REVIEWS, write the placeholder table with all four review rows (CEO, Codex, Eng, Design). Use the Edit tool to append to plan.md — do NOT overwrite the existing plan content.
This review report at the bottom of the plan is the MOST IMPORTANT deliverable of this test.`,
workingDirectory: planDir,
maxTurns: 20,
timeout: 360_000,
testName: 'plan-review-report',
runId,
model: 'claude-opus-4-7',
});
logCost('/plan-eng-review report', result);
recordE2E(evalCollector, '/plan-review-report', 'Plan Review Report E2E', result, {
passed: ['success', 'error_max_turns'].includes(result.exitReason),
});
expect(['success', 'error_max_turns']).toContain(result.exitReason);
// Verify the review report was written to the plan file
const planContent = fs.readFileSync(path.join(planDir, 'plan.md'), 'utf-8');
// Original plan content should still be present
expect(planContent).toContain('# Plan: Add Notifications System');
expect(planContent).toContain('WebSocket');
// Review report section must exist
expect(planContent).toContain('## GSTACK REVIEW REPORT');
// Report should be at the bottom of the file
const reportIndex = planContent.lastIndexOf('## GSTACK REVIEW REPORT');
const afterReport = planContent.slice(reportIndex);
// Should contain the review table with standard rows
expect(afterReport).toMatch(/\|\s*Review\s*\|/);
expect(afterReport).toContain('CEO Review');
expect(afterReport).toContain('Eng Review');
expect(afterReport).toContain('Design Review');
console.log('Plan review report found at bottom of plan.md');
}, 420_000);
});
// --- Codex Offering E2E ---
// Verifies that Codex is properly offered (with availability check, user prompt,
// and fallback) in office-hours, plan-ceo-review, plan-design-review, plan-eng-review.
describeIfSelected('Codex Offering E2E', [
'codex-offered-office-hours', 'codex-offered-ceo-review',
'codex-offered-design-review', 'codex-offered-eng-review',
], () => {
let testDir: string;
beforeAll(() => {
testDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-codex-offer-'));
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: testDir, stdio: 'pipe', timeout: 5000 });
run('git', ['init', '-b', 'main']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
fs.writeFileSync(path.join(testDir, 'README.md'), '# Test Project\n');
run('git', ['add', '.']);
run('git', ['commit', '-m', 'init']);
// Copy all 4 SKILL.md files
for (const skill of ['office-hours', 'plan-ceo-review', 'plan-design-review', 'plan-eng-review']) {
fs.mkdirSync(path.join(testDir, skill), { recursive: true });
fs.copyFileSync(
path.join(ROOT, skill, 'SKILL.md'),
path.join(testDir, skill, 'SKILL.md'),
);
}
});
afterAll(() => {
try { fs.rmSync(testDir, { recursive: true, force: true }); } catch {}
});
async function checkCodexOffering(skill: string, testName: string, featureName: string) {
const result = await runSkillTest({
prompt: `Read ${skill}/SKILL.md. Search for ALL sections related to "codex", "outside voice", or "second opinion".
Summarize the Codex/${featureName} integration — answer these specific questions:
1. How is Codex availability checked? (what exact bash command?)
2. How is the user prompted? (via AskUserQuestion? what are the options?)
3. What happens when Codex is NOT available? (fallback to subagent? skip entirely?)
4. Is this step blocking (gates the workflow) or optional (can be skipped)?
5. What prompt/context is sent to Codex?
Write your summary to ${testDir}/${testName}-summary.md`,
workingDirectory: testDir,
maxTurns: 8,
timeout: 120_000,
testName,
runId,
});
logCost(`/${skill} codex offering`, result);
recordE2E(evalCollector, `/${testName}`, 'Codex Offering E2E', result);
expect(result.exitReason).toBe('success');
const summaryPath = path.join(testDir, `${testName}-summary.md`);
expect(fs.existsSync(summaryPath)).toBe(true);
const summary = fs.readFileSync(summaryPath, 'utf-8').toLowerCase();
// All skills should have codex availability check (command -v per #1197)
expect(summary).toMatch(/command -v codex/);
// All skills should have fallback behavior
expect(summary).toMatch(/fallback|subagent|unavailable|not available|skip/);
// All skills should show it's optional/non-blocking
expect(summary).toMatch(/optional|non.?blocking|skip|not.*required/);
console.log(`${skill}: Codex offering verified`);
}
testConcurrentIfSelected('codex-offered-office-hours', async () => {
await checkCodexOffering('office-hours', 'codex-offered-office-hours', 'second opinion');
}, 180_000);
testConcurrentIfSelected('codex-offered-ceo-review', async () => {
await checkCodexOffering('plan-ceo-review', 'codex-offered-ceo-review', 'outside voice');
}, 180_000);
testConcurrentIfSelected('codex-offered-design-review', async () => {
await checkCodexOffering('plan-design-review', 'codex-offered-design-review', 'design outside voices');
}, 180_000);
testConcurrentIfSelected('codex-offered-eng-review', async () => {
await checkCodexOffering('plan-eng-review', 'codex-offered-eng-review', 'outside voice');
}, 180_000);
});
// Module-level afterAll — finalize eval collector after all tests complete
afterAll(async () => {
await finalizeEvalCollector(evalCollector);
});