Files
gstack/bin/gstack-memory-ingest.ts
Garry Tan 7ca04d8ef0 v1.42.0.0 Daegu wave: 23 community-filed bugs + PTY classifier enforcement (24 bisect commits) (#1594)
* fix(gstack-paths): guard CLAUDE_PLUGIN_DATA against cross-plugin contamination (#1569)

gstack-paths previously trusted CLAUDE_PLUGIN_DATA as a fallback for
GSTACK_STATE_ROOT whenever GSTACK_HOME was unset. When another plugin
(e.g. Codex) persists its own CLAUDE_PLUGIN_DATA into the session env
via CLAUDE_ENV_FILE, gstack picked it up and wrote checkpoints,
analytics, and learnings into that plugin's directory. Anyone with the
Codex plugin installed alongside gstack hit this silently.

Fix: guard the CLAUDE_PLUGIN_DATA branch so it only fires when
CLAUDE_PLUGIN_ROOT confirms we're running as the gstack plugin (path
contains "gstack"). Skill installs fall through to \$HOME/.gstack.

Contributed by @ElliotDrel via #1570. Closes #1569.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(gbrain-sync): sourceLocalPath handles wrapped {sources:[...]} shape from gbrain v0.20+

gbrain v0.20+ changed `gbrain sources list --json` to return
{sources: [...]} instead of a flat array. sourceLocalPath crashed
upstream with `list.find is not a function` on every /sync-gbrain
invocation against modern gbrain. Accept both shapes for
forward/backward compat, matching probeSource/sourcePageCount in
lib/gbrain-sources.ts.

Contributed by @jakehann11 via #1571. Closes #1567. Supersedes #1564
(@tonyjzhou, same fix, different shape — credit retained).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(brain-context-load): probe gbrain via execFile, not shell builtin (#1559)

gbrainAvailable() used `execFileSync("command", ["-v", "gbrain"])`,
which fails in any environment where the `command` builtin isn't on
the spawned process's PATH (most non-interactive shells). The probe
then reported gbrain as missing even when it was installed, and
context-load silently skipped vector/list queries.

Fix: probe `gbrain --version` directly with a 500ms timeout (matching
the rest of the file's MCP_TIMEOUT_MS). Same semantics, works
everywhere execFile works.

Contributed by @jbetala7 via #1560. Closes #1559.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(gbrain-doctor): pin schema_version:2 doctor parse path (#1418)

Adds an exec-path regression test that runs a fake gbrain shim emitting
the v0.25+ doctor JSON shape (schema_version: 2, status: "warnings",
exit 1 for health_score < 100, no top-level `engine` field). Confirms
freshDetectEngineTier recovers stdout from the non-zero exit and falls
back to GBRAIN_HOME/config.json for the engine label.

The pre-existing test for #1415 only stripped gbrain from PATH; this
test exercises the actual doctor parse path, closing the gap that
codex's plan review flagged.

Also documents the schema_version separation in
lib/gbrain-local-status.ts: the local CacheEntry stays at version 1,
distinct from the doctor-output schema_version which we accept across
versions in gstack-memory-helpers.

Closes #1418 (credit @mvanhorn for surfacing the doctor + schema_v2
collapse). The fix landed pre-emptively in v1.29.x; this commit pins
it with a stronger test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(memory-ingest): pin put_page regression + scrub stale name from --help and comments (#1346)

#1346 reported that gstack-memory-ingest still called the renamed
gbrain put_page subcommand on gbrain v0.18+. The actual code migrated
to `gbrain put` and later to batch `gbrain import <dir>` before this
report landed — only documentation lag remained.

This commit:
- Updates the --help string ("Skip gbrain put calls (still updates
  state file)") so user-facing docs match the shipped subcommand
- Updates two inline comments that still referenced the old name
- Adds test/memory-ingest-no-put_page.test.ts: a regression pin that
  strips comments from bin/gstack-memory-ingest.ts and fails the build
  if "put_page" appears in any active code or string literal, plus a
  sanity check that the file still calls a supported gbrain page-write
  verb (put or import)

Closes #1346. Reporter @kylma-code surfaced the doc lag; the original
code migration credit is on the v1.27.x wave.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(resolvers): rewrite all gbrain put_page instructions to canonical put <slug>

scripts/resolvers/gbrain.ts emitted user-facing copy-paste instructions
using the renamed `gbrain put_page` subcommand across 10 skills
(office-hours, investigate, plan-ceo-review, retro, plan-eng-review,
ship, cso, design-consultation, fallback, entity-stub). Every gstack
user copying those snippets hit "unknown command: put_page" on gbrain
v0.18+.

This commit:
- Rewrites all 10 instruction templates to use `gbrain put <slug>
  --content "$(cat <<EOF...EOF)"` with title/tags moved into YAML
  frontmatter inside --content, matching the v0.18+ subcommand shape
- Updates README.md and USING_GBRAIN_WITH_GSTACK.md "common commands"
  table to reference `gbrain put` and `gbrain get`
- Adds test/resolvers-gbrain-put-rewrite.test.ts pinning two
  invariants: (a) resolver source ships only canonical instructions,
  (b) every tracked SKILL.md file is free of `gbrain put_page`

CHANGELOG entries are deliberately left untouched (historical record).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(build): extract package.json build to scripts/build.sh for Windows Bun compat (#1538, #1537, #1530, #1457, #1561)

Bun's Windows shell parser rejects multiple constructs the inline
package.json build chain used: brace groups `{ cmd; }`, subshells with
redirection `( git ... ) > path/.version`, and (in Bun 1.3.x) subshells
near redirections in general. Every Windows install + every
auto-upgrade since v1.34.2.0 has failed on `bun run build`.

Extracts the build chain to scripts/build.sh and the .version writes to
scripts/write-version-files.sh. POSIX-portable, no Bun shell parsing
involved. Also adds Windows-specific bun.exe handling for non-ASCII
PATHs (a separate Windows footgun where Bun's --compile fails when the
binary lives under a path with non-ASCII chars).

Updates test/build-script-shell-compat.test.ts to assert the new shape:
no subshells with redirections anywhere in the build chain, and build
delegates to scripts/build.sh which delegates .version writes.

Contributed by @Charlie-El via #1544. Supersedes #1531 (@scarson, fixed
in build helper), #1480 (@mikepsinn, partial overlap), #1460
(@realcarsonterry, brace-group fix subsumed) — credit retained.
Closes #1538, #1537, #1530, #1457, #1561.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(windows): .exe glob in .gitignore + .exe extension resolution in find-browse (#1554)

bun build --compile on Windows appends .exe to the output filename,
producing browse.exe instead of browse. find-browse's existsSync probe
only checked the bare path and returned null on Windows even when the
binary was correctly built. .gitignore similarly only excluded the
bare bin/gstack-global-discover path, leaving the .exe variant
tracked.

This commit:
- .gitignore: changes `bin/gstack-global-discover` →
  `bin/gstack-global-discover*` so the Windows .exe variant is ignored
- browse/src/find-browse.ts: adds isExecutable + findExecutable helpers
  that fall back to .exe/.cmd/.bat probing on Windows, mirroring the
  same helper already in make-pdf/src/browseClient.ts and pdftotext.ts

Contributed by @Mike-E-Log via #1554.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci(windows): add fresh-install E2E gate that runs bun run build on windows-latest

Adds .github/workflows/windows-setup-e2e.yml as the gate that catches
Bun shell-parser regressions in the build chain before they reach
users. Triggers on PRs touching package.json, scripts/build.sh,
scripts/write-version-files.sh, setup, browse cli/find-browse, or
gstack-paths.

What it verifies:
1. bun run build completes on Windows (the previously-broken path that
   #1538/#1537/#1530/#1457/#1561 reported)
2. All compiled binaries land on disk (browse.exe, find-browse.exe,
   design.exe, gstack-global-discover.exe)
3. find-browse resolves to the .exe variant on Windows (regression
   gate for #1554)
4. gstack-paths returns non-empty GSTACK_STATE_ROOT/PLAN_ROOT/TMP_ROOT
   on Windows (regression gate for #1570)

Complements the existing windows-free-tests.yml (curated unit subset);
this new workflow exercises the install path itself.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(codex): move diff scope into prompt instead of --base (Codex CLI 0.130+ argv conflict) (#1209)

Codex CLI ≥ 0.130.0 rejects passing a custom prompt and --base together
(mutually exclusive at argv level). Every /codex review, /review, and
/ship structured Codex review call ended with an argv error before the
model ran.

Fix: scope the diff in prompt text using
"Run git diff origin/<base>...HEAD 2>/dev/null || git diff <base>...HEAD"
instead of `--base <base>`. Preserves the filesystem boundary
instruction across all invocations and keeps Codex's review prompt
tuning.

Touches:
- codex/SKILL.md.tmpl + regenerated codex/SKILL.md
- scripts/resolvers/review.ts + regenerated review/SKILL.md, ship/SKILL.md
- test/gen-skill-docs.test.ts: new regression that fails if any of the
  five known files still contain the prompt+--base shape
- test/skill-validation.test.ts: corresponding negative + positive pin
  on the rendered SKILL.md files

Contributed by @jbetala7 via #1209. Closes #1479. Supersedes #1527
(@mvanhorn — same intent, different patch shape, CONFLICTING) and
#1449 (@Gujiassh — broader refactor, CONFLICTING). Credit retained
in CHANGELOG.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(review): diff from git merge-base, not git diff origin/<base> (#1492)

git diff origin/<base> shows everything since the common ancestor in
both directions — it includes commits that landed on origin/<base>
after this branch was created as deletions. That made /review and
/ship's pre-landing structured review report inflated diff totals and
flagged "removed" code that was actually still present in the working
tree.

Fix: compute DIFF_BASE via git merge-base origin/<base> HEAD and diff
the working tree against that point. Same coverage of uncommitted
edits, no phantom deletions from out-of-order base advancement.

Applies to /review's Step 1 (diff existence check), Step 3 (get the
diff), the build-on-intent scope-creep check, the structured review
DIFF_INS/DIFF_DEL stats, and the Claude adversarial subagent prompt.
Same change flows into ship/SKILL.md via the shared resolver.

Touches:
- review/SKILL.md.tmpl + regenerated review/SKILL.md, ship/SKILL.md
- scripts/resolvers/review.ts
- scripts/resolvers/review-army.ts

Contributed by @mvanhorn via #1492.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(codex): pin filesystem-boundary preservation across all codex review surfaces (#1503, #1522)

#1503 reported that the bare codex review --base path stripped the
filesystem boundary instruction, letting Codex spend tokens reading
.claude/skills/ and agents/. #1522 proposed adding a skill-path
detector that switched to the custom-instructions route when the diff
touched skill files.

After C10 (#1209) restructured codex review to always carry the
boundary in the prompt (the prompt+--base argv conflict forced the
restructure), the skill-path detector becomes redundant — every
default call already preserves the boundary.

This commit pins the post-#1209 invariant with a test that fails the
build if any future refactor strips the boundary from codex/SKILL.md,
review/SKILL.md, or ship/SKILL.md. Closes #1503 by regression test.

#1522 (@genisis0x) is superseded by #1209 (the prompt rewrite covers
its safety concern); credit retained in CHANGELOG.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(skills): use command -v instead of which for codex detection (#1197)

`which` is not on PATH in every shell — some Windows shells, BusyBox-
only containers, and minimal CI images all fail when skills probe
codex availability via `which codex`. `command -v` is a POSIX builtin
and always available where the skill is running.

Touched:
- codex/SKILL.md.tmpl: CODEX_BIN=$(command -v codex || echo "")
- scripts/resolvers/review.ts and scripts/resolvers/design.ts:
  3 + 3 sites each rewritten to `command -v codex >/dev/null 2>&1`
- Regenerated all 10 affected SKILL.md files (codex, review, ship,
  design-consultation, design-review, office-hours, plan-ceo-review,
  plan-design-review, plan-devex-review, plan-eng-review)
- test/skill-validation.test.ts: updated pin + defensive regression
  test that fails if `which codex` returns to codex/SKILL.md
- test/skill-e2e-plan.test.ts: updated summary regex

Contributed by @mvanhorn via #1197.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(codex): surface non-zero exits so wrappers stop reading as silent stalls (#1467, #1327)

When codex exits non-zero (parse errors, arg-shape breaks, model API
errors that propagate as non-zero status), the calling agent
previously saw an empty output and burned 30-60 minutes misdiagnosing
as a silent model/API stall. The hang-detection block only caught
exit 124 (the timeout-wrapper signal).

Adds elif blocks in all four codex invocation sites (Review default,
Challenge, Consult new-session, Consult resume) that:
- Echo "[codex exit N] <stderr first line>" to stdout
- Indent the first 20 stderr lines for inline context
- Log codex_nonzero_exit telemetry tagged with the call site

Contributed by @genisis0x via #1467. Closes #1327.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(design): disclose OpenAI key source + warn on cwd .env match (#1278, closes #1248)

The design binary previously called process.env.OPENAI_API_KEY without
checking where the key came from. If a user ran $D inside someone
else's project that had OPENAI_API_KEY in its .env, the resulting
generation billed that project's account. Silent and irreversible.

Fix: resolveApiKeyInfo() returns both the key and its source. When the
env-var path matches an OPENAI_API_KEY entry in the current
directory's .env, .env.<NODE_ENV>, or .env.local file, we set a
warning. requireApiKey() prints "Using OpenAI key from <source>" plus
the warning before the run — never the key itself.

Adds 6 unit tests covering: config-vs-env precedence, env-only (no
match), env+cwd .env match, quoted/exported values, value-mismatch
(no false positive), and the no-leak invariant for requireApiKey
stderr output.

Contributed by @jbetala7 via #1278. Closes #1248.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(browse): guard full-page screenshots against Anthropic vision API >2000px brick (#1214)

Full-page screenshots of tall pages routinely exceeded 2000px on the
longest dimension, silently bricking the agent's session: the
resulting base64 reached the Anthropic vision API which rejected the
oversized image, leaving the agent burning turns on a useless blob
with no stderr trace from the browse side.

Adds browse/src/screenshot-size-guard.ts as a shared helper:
- guardScreenshotBuffer(buf) → downscales in-memory if max(w,h) > 2000
- guardScreenshotPath(path) → file-mode variant that rewrites in place
- Aspect ratio preserved via sharp's resize fit:inside
- Stderr diagnostic on any downscale so callers can see when it fired
- Lazy sharp import so non-screenshot paths pay no startup cost

Wires the guard into all three full-page callsites codex review
flagged:
- browse/src/snapshot.ts: annotated + heatmap fullPage captures
- browse/src/meta-commands.ts: screenshot command (path + base64
  fullPage modes) plus the responsive 3-viewport sweep
- browse/src/write-commands.ts: prettyscreenshot fullPage path

Covers seven unit cases (pass-through, downscale, aspect ratio,
exactly-2000px edge, file-mode rewrite) plus a static invariant test
that fails the build if any of the three callsites stops importing the
guard.

Closes #1214.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(security): add Node sidecar entry for L4 prompt-injection classifier (#1370)

The L4 TestSavant classifier in browse/src/security-classifier.ts
can't be imported into the compiled browse server (onnxruntime-node
dlopen fails from Bun's compile extract dir per CLAUDE.md). The agent
that used to host it (sidebar-agent.ts) was removed when the PTY
proved out — leaving the classifier file shipped but with zero
callers. Exactly the gap codex flagged in #1370.

Adds browse/src/security-sidecar-entry.ts: a Node script that runs the
classifier as a subprocess of the browse server. It reads NDJSON
requests from stdin and writes id-correlated NDJSON responses to
stdout, supporting:
  - op: "scan-page-content" — full L4 classifier scan
  - op: "ping" — liveness probe for the client's health check
  - op: "status" — classifier readiness (used by /pty-inject-scan to
    surface l4 { available: bool } in its response)

Plus browse/src/find-security-sidecar.ts: a resolver that locates
node + the bundled JS entry (browse/dist/security-sidecar.js, built in
a follow-up package.json change) or falls back to the dev TS entry.
Returns null cleanly when node isn't on PATH so the calling endpoint
can degrade per D7 (extension WARN + user confirm).

C17 of the security-stack wave. C18 adds the IPC client + lifecycle
management; C19 wires the endpoint; C20 routes the extension through it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(security): sidecar IPC client with lifecycle + circuit breaker (#1370)

Adds browse/src/security-sidecar-client.ts to manage the Node L4
classifier subprocess from the compiled browse server:

- Lazy spawn on first scan; reuses the same process across requests
- Id-correlated request/response via NDJSON over stdio
- 5s default per-scan timeout; 64KB payload cap (short-circuits before
  spawn so oversized requests don't waste a process)
- 3-in-10-minutes respawn cap → trips circuit breaker; subsequent
  scans throw immediately so the /pty-inject-scan endpoint can surface
  l4 { available: false } to the extension and degrade to WARN+confirm
- process.on('exit') sends SIGTERM to the child for clean teardown
- isSidecarAvailable() lets the endpoint probe before scan calls so
  the response shape reflects degraded mode honestly

Unit tests cover the payload cap, the availability probe, and the
breaker-doesn't-crash invariant under repeated rejected calls.

C18 of the security-stack wave. C19 adds POST /pty-inject-scan; C20
routes the extension through it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(security): add POST /pty-inject-scan endpoint for pre-PTY-inject scans (#1370)

The sidebar's gstackInjectToTerminal callers (toolbar Cleanup,
Inspector "Send to Code") were piping page-derived text directly into
the live claude PTY with ZERO classifier processing — the gap codex
flagged in #1370. The documented sidebar security stack had a hole
the size of every Cleanup-button click.

Adds POST /pty-inject-scan to browse/src/server.ts:
- Local-only binding (NOT in TUNNEL_PATHS — tunnel attempts get the
  general 404 path; never reaches the scan logic)
- Root-token auth via existing validateAuth() — 401 on unauth
- 64KB request cap → 413 + payload-too-large body
- 5s scan timeout via sidecar client
- URL-blocklist forced to BLOCK in PTY context (page-derived REPL
  input is higher-risk than ordinary tool output)
- L4 ML classifier via the sidecar when available; degrades to WARN
  per D7 when sidecar is unavailable
- Response goes through JSON.stringify(..., sanitizeReplacer) per
  v1.38.0.0 Unicode-egress hardening
- Imports only from security-sidecar-client.ts, never directly from
  security-classifier.ts (which would brick the compiled Bun binary)

Seven static-invariant tests pin the POST verb, auth gate, 64KB cap,
tunnel-listener exclusion, sanitizeReplacer wrapping, l4 availability
shape, and the no-direct-classifier-import rule.

C19 of the security-stack wave. C20 routes the extension through it;
C21 adds the invariant AST check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(extension): route gstackInjectToTerminal through /pty-inject-scan (#1370)

Closes the documented-vs-shipped gap codex flagged in #1370. The
sidebar's two PTY-injection call sites (Inspector "Send to Code" and
toolbar Cleanup) now pre-scan via the new /pty-inject-scan endpoint
before writing to the live claude REPL.

Adds window.gstackScanForPTYInject(text, origin) to
extension/sidepanel-terminal.js:
- Async, returns { allow, verdict, reasons, l4 }
- POST to /pty-inject-scan with the existing root-token auth
- WARN+confirm on scan failure (network down, sidecar absent, etc.)
  rather than silent PASS — D7 honest-degradation

gstackInjectToTerminal stays synchronous, returns boolean. Per D6:
keeping the inject sync means existing `const ok = ...?.()` callers
don't break, and the invariant test in
test/extension-pty-inject-invariant.test.ts can statically pin that
every call goes through the scan first.

extension/sidepanel.js call sites updated:
- inspectorSendBtn click → await scan, BLOCK drops + WARN prompts via
  window.confirm, PASS injects silently
- runCleanup() → same flow. Static cleanup prompt always PASSes but
  still routes through scan to honor the invariant.

C20 of the security-stack wave. C21 adds the static invariant test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(security): invariant — extension PTY inject must be scan-gated (#1370)

Static-analysis invariant test that fails the build if any
extension/*.js path calls window.gstackInjectToTerminal without a
preceding window.gstackScanForPTYInject in the same enclosing
function. Closes the documented-vs-shipped gap codex demanded a
machine check on.

Rules:
- Rule 1: any file that calls inject must also reference scan
- Rule 2: in the enclosing function (function declaration, arrow,
  async (), event handler), a scan call must appear before the inject
  call by source position
- Exemption: sidepanel-terminal.js (the file that DEFINES the inject
  function) is exempt from Rule 2 since the definition is not a call

Plus two structural checks:
- sidepanel-terminal.js defines both the inject and scan functions
- inject stays SYNCHRONOUS (no `async` modifier) per D6 — async would
  silently break the `const ok = ...?.()` pattern at every caller

C21 of the security-stack wave. The sidecar architecture (#1370) is
complete: server-side L1-L3 + L4-via-sidecar (C17+C18+C19), extension
pre-scan wiring (C20), and now the regression gate (C21).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(browse): opt-in extended stealth mode with 6 detection-vector patches (#1112)

Rebases @garrytan's PR #1112 (Apr 2026, abandoned) onto the current
browse/src/stealth.ts contract. The existing minimal "codex narrowed"
stealth (webdriver-mask + AutomationControlled launch arg) stays the
default. PR #1112's six additional patches are added behind an opt-in
GSTACK_STEALTH=extended env flag.

Extended-mode patches (applied AFTER the default mask, in order):
  1. delete navigator.webdriver from prototype (not just the getter —
     detectors check `"webdriver" in navigator`)
  2. WebGL renderer spoof to Apple M1 Pro (SwiftShader was the #1
     software-GPU tell in containers)
  3. navigator.plugins returns a PluginArray-prototype-passing array
     with MimeType objects and namedItem()
  4. window.chrome populated with chrome.app, chrome.runtime,
     chrome.loadTimes(), chrome.csi() with realistic shapes
  5. navigator.mediaDevices backfilled when headless drops it
  6. CDP cdc_*-prefixed window globals cleared

Why opt-in: the default mode's contract is fingerprint CONSISTENCY,
which protects against detectors that flag spoofing mismatch. Extended
mode actively lies about the environment; sites that reflect on these
properties can break. Users who hit detection in default mode can flip
GSTACK_STEALTH=extended for SannySoft 100% pass-rate.

Twenty unit tests pin the env-flag semantics, all six patches' code
presence, and the applyStealth wiring order. Live SannySoft pass-rate
verification stays in the periodic-tier E2E suite.

Contributed by @garrytan via #1112 (rebased — original PR opened
before the codex-narrowed minimum landed; rebase preserves the
narrowed default while adding the SannySoft-passing path as opt-in).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(fixtures): regenerate ship-SKILL.md golden baselines after C10-C13 + C16 templates

Updates the three ship-SKILL.md golden baselines (claude, codex,
factory hosts) to match the new shape produced by:
- C10 #1209 codex argv (prompt + diff scope, no --base)
- C11 #1492 merge-base diff (DIFF_BASE= preamble)
- C13 #1197 command -v for codex detection
- C12 + boundary preservation per regen-enforcing test

Per CLAUDE.md SKILL.md workflow: edit the .tmpl, run gen:skill-docs,
commit the regenerated outputs together. Goldens are part of the
regen contract — without this commit, test/host-config.test.ts'
golden-baseline checks fail with the diff codex review surfaced.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(release): v1.41.0.0 — Daegu wave (24 bisect commits, 14 user-facing fixes)

Bumps VERSION 1.40.0.0 → 1.41.0.0. CHANGELOG entry follows the
release-summary format in CLAUDE.md: two-line headline, lead
paragraph, "The numbers that matter" table, "What this means for
builders" closer, then itemized Added/Changed/Fixed/For contributors
with inline credit to every PR author and original issue reporter.

Scale-aware bump per CLAUDE.md: 24 commits, ~6000 LOC net,
substantial new capability across security (PTY sidecar wiring),
install (Windows build chain), compat (gbrain 0.18-0.35, Codex CLI
0.130+), and quality (screenshot guard, design key disclosure,
extended stealth opt-in). MINOR is the right call.

Closes for users: #1567, #1559, #1569, #1346, #1418, #1538, #1537,
#1530, #1457, #1561, #1554, #1479, #1503, #1248, #1214, #1370, #1327,
#1193 pattern, #1152 pattern. Credit retained inline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(find-browse): resolve source-checkout layout <git-root>/browse/dist/browse[.exe]

windows-setup-e2e.yml runs `bun browse/src/find-browse.ts` against a
freshly-built repo where binaries land at browse/dist/browse.exe (no
.claude/skills/gstack/ install layout). The previous markers chain
only matched .codex/.agents/.claude prefixed paths, so find-browse
exited "not found" even when the binary was present.

Adds a source-checkout fallback after the marker scan: if no
installed layout resolves but <git-root>/browse/dist/browse[.exe]
exists, return that. Three real callers hit this path:
- gstack repo dev workflow before `./setup` runs
- windows-setup-e2e.yml CI (the breakage that surfaced this)
- make-pdf consumers running from a sibling source checkout

Smoke-verified: a fresh git repo with browse/dist/browse on disk now
resolves through the source-checkout branch (was returning null
before this commit).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(release): bump v1.41.0.0 → v1.42.0.0 to clear queue collision with #1574

The version-gate workflow flagged a collision: PR #1574
(garrytan/colombo-v3) already claims v1.41.0.0, and #1592
(fix/audit-critical-high-bugs) claims v1.41.1.0. Per CLAUDE.md's
workspace-aware ship rule, queue-advancing past a claimed version
within the same bump level is permitted — MINOR work landing on top
of a queued MINOR still reads as MINOR relative to main.

Util's suggested next slot is v1.42.0.0; taking it. CHANGELOG entry
header bumped + dated 2026-05-19; entry body unchanged (same wave
content, same credit list).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 07:35:01 -07:00

1751 lines
60 KiB
TypeScript

#!/usr/bin/env bun
/**
* gstack-memory-ingest — V1 memory ingest helper.
*
* Walks coding-agent transcript sources + ~/.gstack/ curated artifacts and writes
* each one to gbrain as a typed page. Per plan §"Storage tiering": curated memory
* rides the existing gbrain Postgres + git pipeline; code/transcripts go to the
* Supabase tier when configured (or local PGLite otherwise) — never double-store.
*
* Usage:
* gstack-memory-ingest --probe # count what would ingest, no writes
* gstack-memory-ingest --incremental [--quiet] # default; mtime fast-path; cheap
* gstack-memory-ingest --bulk [--all-history] # first-run; full walk
* gstack-memory-ingest --bulk --benchmark # time the bulk pass + report
* gstack-memory-ingest --include-unattributed # also ingest sessions with no git remote
*
* Sources walked:
* ~/.claude/projects/<encoded-cwd>/<uuid>.jsonl — Claude Code sessions
* ~/.codex/sessions/YYYY/MM/DD/rollout-*.jsonl — Codex CLI sessions
* ~/Library/Application Support/Cursor/User/*.vscdb — Cursor (V1.0.1 follow-up)
* ~/.gstack/projects/<slug>/learnings.jsonl — typed: learning
* ~/.gstack/projects/<slug>/timeline.jsonl — typed: timeline
* ~/.gstack/projects/<slug>/ceo-plans/*.md — typed: ceo-plan
* ~/.gstack/projects/<slug>/*-design-*.md — typed: design-doc
* ~/.gstack/analytics/eureka.jsonl — typed: eureka
* ~/.gstack/builder-profile.jsonl — typed: builder-profile-entry
*
* State: ~/.gstack/.transcript-ingest-state.json (LOCAL per ED1, never synced).
* Secret scanning: gitleaks via lib/gstack-memory-helpers#secretScanFile (D19).
* Concurrent-write handling: partial-flag + re-ingest on next pass (D10).
*
* V1.0 NOTE: Cursor SQLite extraction is a V1.0.1 follow-up. The plan promoted it to
* V1 scope, but full SQLite parsing requires a sqlite3 binary or library; deferred to
* keep V1 ship-tight. See TODOS.md.
*
* V1.5 NOTE: When `gbrain put_file` ships in the gbrain CLI (cross-repo P0 TODO),
* transcripts will route to Supabase Storage instead of the page-write path.
* Until then, all content rides `gbrain put <slug>` (stdin, YAML frontmatter for
* title/type/tags); gbrain's native dedup keys on session_id.
*/
import {
existsSync,
readdirSync,
readFileSync,
writeFileSync,
statSync,
mkdirSync,
appendFileSync,
renameSync,
openSync,
readSync,
closeSync,
rmSync,
} from "fs";
import { join, basename, dirname } from "path";
import { execFileSync, spawnSync, spawn, type ChildProcess } from "child_process";
import { homedir } from "os";
import { createHash } from "crypto";
import {
canonicalizeRemote,
secretScanFile,
detectEngineTier,
withErrorContext,
} from "../lib/gstack-memory-helpers";
import { execGbrainText, spawnGbrainAsync } from "../lib/gbrain-exec";
// ── Types ──────────────────────────────────────────────────────────────────
type Mode = "probe" | "incremental" | "bulk";
interface CliArgs {
mode: Mode;
quiet: boolean;
benchmark: boolean;
includeUnattributed: boolean;
allHistory: boolean;
sources: Set<MemoryType>;
limit: number | null;
noWrite: boolean;
/**
* Opt-in per-file gitleaks scan during the prepare phase. Off by
* default — the cross-machine boundary (gstack-brain-sync, git push)
* has its own scanner. Setting this adds ~4-8 min to cold runs.
*/
scanSecrets: boolean;
}
type MemoryType =
| "transcript"
| "eureka"
| "learning"
| "timeline"
| "ceo-plan"
| "design-doc"
| "retro"
| "builder-profile-entry";
interface PageRecord {
slug: string;
title: string;
type: MemoryType;
agent?: "claude-code" | "codex" | "cursor";
body: string;
tags: string[];
source_path: string;
session_id?: string;
cwd?: string;
git_remote?: string;
start_time?: string;
end_time?: string;
partial?: boolean;
size_bytes: number;
content_sha256: string;
}
interface IngestState {
schema_version: 1;
last_writer: string;
last_full_walk?: string;
sessions: Record<
string,
{
mtime_ns: number;
sha256: string;
ingested_at: string;
page_slug: string;
partial?: boolean;
}
>;
}
interface ProbeReport {
total_files: number;
total_bytes: number;
by_type: Record<MemoryType, { count: number; bytes: number }>;
new_count: number;
updated_count: number;
unchanged_count: number;
estimate_minutes: number;
}
interface BulkResult {
written: number;
skipped_secret: number;
skipped_dedup: number;
skipped_unattributed: number;
failed: number;
duration_ms: number;
partial_pages: number;
/**
* D6: when set, indicates a process-level failure (gbrain CLI missing
* or `gbrain import` crashed). Per-file errors (FILE_TOO_LARGE etc.)
* land in `failed` but do NOT set this flag — the orchestrator should
* still treat the run as OK with summary mentioning the failure count.
* Only when this is set does the verdict become ERR.
*/
system_error?: string;
}
// ── Constants ──────────────────────────────────────────────────────────────
const HOME = homedir();
const GSTACK_HOME = process.env.GSTACK_HOME || join(HOME, ".gstack");
const STATE_PATH = join(GSTACK_HOME, ".transcript-ingest-state.json");
const DEFAULT_INCREMENTAL_BUDGET_MS = 50;
const ALL_TYPES: MemoryType[] = [
"transcript",
"eureka",
"learning",
"timeline",
"ceo-plan",
"design-doc",
"retro",
"builder-profile-entry",
];
// ── CLI ────────────────────────────────────────────────────────────────────
function printUsage(): void {
console.error(`Usage: gstack-memory-ingest [--probe|--incremental|--bulk] [options]
Modes:
--probe Count what would ingest; no writes. Fastest.
--incremental Default. mtime fast-path; only walks changed files.
--bulk First-run; full walk; gates on permission elsewhere.
Options:
--quiet Suppress per-file output (still prints summary).
--benchmark Time the run; report bytes-per-second + total.
--include-unattributed Ingest sessions with no resolvable git remote.
--all-history Walk transcripts older than 90 days too.
--sources <list> Comma-separated subset: ${ALL_TYPES.join(",")}
--limit <N> Stop after N pages written (smoke testing).
--no-write Skip gbrain put calls (still updates state file).
Used by tests + dry runs without actual ingest.
--scan-secrets Opt-in per-file gitleaks scan during prepare. Off by
default; gstack-brain-sync already gates the git-push
boundary. Adds ~4-8 min to cold runs.
--help This text.
`);
}
function parseArgs(): CliArgs {
const args = process.argv.slice(2);
let mode: Mode = "incremental";
let quiet = false;
let benchmark = false;
let includeUnattributed = false;
let allHistory = false;
let limit: number | null = null;
let sources: Set<MemoryType> = new Set(ALL_TYPES);
let noWrite = process.env.GSTACK_MEMORY_INGEST_NO_WRITE === "1";
let scanSecrets = process.env.GSTACK_MEMORY_INGEST_SCAN_SECRETS === "1";
for (let i = 0; i < args.length; i++) {
const a = args[i];
switch (a) {
case "--probe": mode = "probe"; break;
case "--incremental": mode = "incremental"; break;
case "--bulk": mode = "bulk"; break;
case "--quiet": quiet = true; break;
case "--benchmark": benchmark = true; break;
case "--include-unattributed": includeUnattributed = true; break;
case "--all-history": allHistory = true; break;
case "--no-write": noWrite = true; break;
case "--scan-secrets": scanSecrets = true; break;
case "--limit":
limit = parseInt(args[++i] || "0", 10);
if (!Number.isFinite(limit) || limit <= 0) {
console.error("--limit requires a positive integer");
process.exit(1);
}
break;
case "--sources": {
const list = (args[++i] || "").split(",").map((s) => s.trim() as MemoryType);
sources = new Set(list.filter((t) => ALL_TYPES.includes(t)));
if (sources.size === 0) {
console.error(`--sources must include at least one of: ${ALL_TYPES.join(",")}`);
process.exit(1);
}
break;
}
case "--help":
case "-h":
printUsage();
process.exit(0);
default:
console.error(`Unknown argument: ${a}`);
printUsage();
process.exit(1);
}
}
return { mode, quiet, benchmark, includeUnattributed, allHistory, sources, limit, noWrite, scanSecrets };
}
// ── State file ─────────────────────────────────────────────────────────────
function loadState(): IngestState {
if (!existsSync(STATE_PATH)) {
return {
schema_version: 1,
last_writer: "gstack-memory-ingest",
sessions: {},
};
}
try {
const raw = readFileSync(STATE_PATH, "utf-8");
const parsed = JSON.parse(raw) as IngestState;
if (parsed.schema_version !== 1) {
console.error(`State file at ${STATE_PATH} has unknown schema_version ${parsed.schema_version}; backing up + resetting.`);
try {
writeFileSync(STATE_PATH + ".bak", raw, "utf-8");
} catch {
// backup failure is non-fatal
}
return { schema_version: 1, last_writer: "gstack-memory-ingest", sessions: {} };
}
return parsed;
} catch (err) {
console.error(`State file at ${STATE_PATH} corrupt; backing up + resetting.`);
try {
const raw = readFileSync(STATE_PATH, "utf-8");
writeFileSync(STATE_PATH + ".bak", raw, "utf-8");
} catch {
// best-effort
}
return { schema_version: 1, last_writer: "gstack-memory-ingest", sessions: {} };
}
}
function saveState(state: IngestState): void {
// F6 (Codex finding 6): tmp+rename atomic write so a crash mid-write
// never leaves a truncated/corrupt state file. Matches the pattern
// in gstack-gbrain-sync.ts:saveSyncState.
try {
mkdirSync(dirname(STATE_PATH), { recursive: true });
const tmp = `${STATE_PATH}.tmp.${process.pid}`;
writeFileSync(tmp, JSON.stringify(state, null, 2), "utf-8");
renameSync(tmp, STATE_PATH);
} catch (err) {
console.error(`[state] write failed: ${(err as Error).message}`);
}
}
// ── File hash + change detection ───────────────────────────────────────────
function fileSha256(path: string): string {
// F9 (Codex finding 9): full-file hash. The prior 1MB cap silently
// missed tail edits to long partial transcripts — exactly the
// recovery case this pipeline needs to handle correctly. Realistic
// max for an ingest source is ~50MB (long JSONL); fine to load in
// memory for hashing.
try {
const buf = readFileSync(path);
return createHash("sha256").update(buf).digest("hex");
} catch {
return "";
}
}
function fileChangedSinceState(path: string, state: IngestState): boolean {
const entry = state.sessions[path];
if (!entry) return true;
try {
const st = statSync(path);
const mtimeNs = Math.floor(st.mtimeMs * 1e6);
if (mtimeNs === entry.mtime_ns) return false;
const sha = fileSha256(path);
if (sha === entry.sha256) {
// mtime changed but content didn't; just refresh mtime to skip future hashing
entry.mtime_ns = mtimeNs;
return false;
}
return true;
} catch {
return true;
}
}
// ── Walkers ────────────────────────────────────────────────────────────────
interface WalkContext {
args: CliArgs;
state: IngestState;
windowStartMs: number; // ignore files older than this unless --all-history
}
function makeWalkContext(args: CliArgs, state: IngestState): WalkContext {
const ninetyDaysAgoMs = Date.now() - 90 * 24 * 60 * 60 * 1000;
return {
args,
state,
windowStartMs: args.allHistory ? 0 : ninetyDaysAgoMs,
};
}
function* walkClaudeCodeProjects(ctx: WalkContext): Generator<{ path: string; type: MemoryType }> {
const root = join(HOME, ".claude", "projects");
if (!existsSync(root)) return;
let projectDirs: string[];
try {
projectDirs = readdirSync(root);
} catch {
return;
}
for (const dir of projectDirs) {
const fullDir = join(root, dir);
let entries: string[];
try {
entries = readdirSync(fullDir);
} catch {
continue;
}
for (const entry of entries) {
if (!entry.endsWith(".jsonl")) continue;
const fullPath = join(fullDir, entry);
try {
const st = statSync(fullPath);
if (st.mtimeMs < ctx.windowStartMs) continue;
} catch {
continue;
}
yield { path: fullPath, type: "transcript" };
}
}
}
function* walkCodexSessions(ctx: WalkContext): Generator<{ path: string; type: MemoryType }> {
const root = join(HOME, ".codex", "sessions");
if (!existsSync(root)) return;
// Date-bucketed: YYYY/MM/DD/rollout-*.jsonl. Walk up to 4 levels deep.
function* recurse(dir: string, depth: number): Generator<string> {
if (depth > 4) return;
let entries: string[];
try {
entries = readdirSync(dir);
} catch {
return;
}
for (const entry of entries) {
const full = join(dir, entry);
let st;
try {
st = statSync(full);
} catch {
continue;
}
if (st.isDirectory()) {
yield* recurse(full, depth + 1);
} else if (entry.endsWith(".jsonl")) {
if (st.mtimeMs >= ctx.windowStartMs) yield full;
}
}
}
for (const path of recurse(root, 0)) {
yield { path, type: "transcript" };
}
}
function* walkGstackArtifacts(ctx: WalkContext): Generator<{ path: string; type: MemoryType }> {
const projectsRoot = join(GSTACK_HOME, "projects");
// Eureka log: ~/.gstack/analytics/eureka.jsonl
const eurekaLog = join(GSTACK_HOME, "analytics", "eureka.jsonl");
if (existsSync(eurekaLog) && ctx.args.sources.has("eureka")) {
yield { path: eurekaLog, type: "eureka" };
}
// Builder profile: ~/.gstack/builder-profile.jsonl
const builderProfile = join(GSTACK_HOME, "builder-profile.jsonl");
if (existsSync(builderProfile) && ctx.args.sources.has("builder-profile-entry")) {
yield { path: builderProfile, type: "builder-profile-entry" };
}
if (!existsSync(projectsRoot)) return;
let slugs: string[];
try {
slugs = readdirSync(projectsRoot);
} catch {
return;
}
for (const slug of slugs) {
const projDir = join(projectsRoot, slug);
let st;
try {
st = statSync(projDir);
} catch {
continue;
}
if (!st.isDirectory()) continue;
// learnings.jsonl
const learnings = join(projDir, "learnings.jsonl");
if (existsSync(learnings) && ctx.args.sources.has("learning")) {
yield { path: learnings, type: "learning" };
}
// timeline.jsonl
const timeline = join(projDir, "timeline.jsonl");
if (existsSync(timeline) && ctx.args.sources.has("timeline")) {
yield { path: timeline, type: "timeline" };
}
// ceo-plans/*.md
if (ctx.args.sources.has("ceo-plan")) {
const ceoPlans = join(projDir, "ceo-plans");
if (existsSync(ceoPlans)) {
let pe: string[];
try {
pe = readdirSync(ceoPlans);
} catch {
pe = [];
}
for (const e of pe) {
if (e.endsWith(".md")) {
yield { path: join(ceoPlans, e), type: "ceo-plan" };
}
}
}
}
// *-design-*.md (top-level in proj dir)
if (ctx.args.sources.has("design-doc")) {
let pe: string[];
try {
pe = readdirSync(projDir);
} catch {
pe = [];
}
for (const e of pe) {
if (e.endsWith(".md") && e.includes("design-")) {
yield { path: join(projDir, e), type: "design-doc" };
}
}
}
// retros — *.md under projDir/retros/ if exists, or retro-*.md at projDir
if (ctx.args.sources.has("retro")) {
const retroDir = join(projDir, "retros");
if (existsSync(retroDir)) {
let pe: string[];
try {
pe = readdirSync(retroDir);
} catch {
pe = [];
}
for (const e of pe) {
if (e.endsWith(".md")) {
yield { path: join(retroDir, e), type: "retro" };
}
}
}
}
}
}
function* walkAllSources(ctx: WalkContext): Generator<{ path: string; type: MemoryType }> {
if (ctx.args.sources.has("transcript")) {
yield* walkClaudeCodeProjects(ctx);
yield* walkCodexSessions(ctx);
}
yield* walkGstackArtifacts(ctx);
}
// ── Renderers ──────────────────────────────────────────────────────────────
interface ParsedSession {
agent: "claude-code" | "codex";
session_id: string;
cwd: string;
start_time?: string;
end_time?: string;
message_count: number;
tool_calls: number;
body: string;
partial: boolean;
}
function parseTranscriptJsonl(path: string): ParsedSession | null {
// Best-effort tolerant parser. Handles truncated last lines (D10 partial-flag).
let raw: string;
try {
raw = readFileSync(path, "utf-8");
} catch {
return null;
}
const lines = raw.split("\n").filter((l) => l.trim().length > 0);
if (lines.length === 0) return null;
// Detect partial: if the last line doesn't end with `}` or doesn't parse, mark partial.
let partial = false;
let parsedLines: any[] = [];
for (let i = 0; i < lines.length; i++) {
try {
parsedLines.push(JSON.parse(lines[i]));
} catch {
// Last-line truncation is the common case (D10).
if (i === lines.length - 1) partial = true;
else continue;
}
}
if (parsedLines.length === 0) return null;
// Detect format: Codex `session_meta` or Claude Code `type: user|assistant|tool`
const first = parsedLines[0];
const isCodex = first?.type === "session_meta" || first?.payload?.id != null;
const agent: "claude-code" | "codex" = isCodex ? "codex" : "claude-code";
let session_id = "";
let cwd = "";
let start_time: string | undefined;
let end_time: string | undefined;
if (isCodex) {
session_id = first.payload?.id || first.id || basename(path, ".jsonl");
cwd = first.payload?.cwd || first.cwd || "";
start_time = first.timestamp || first.payload?.timestamp;
} else {
// Claude Code: look for cwd in first non-queue record
for (const r of parsedLines) {
if (r?.cwd) {
cwd = r.cwd;
break;
}
}
session_id = basename(path, ".jsonl");
start_time = parsedLines.find((r) => r?.timestamp)?.timestamp;
const last = parsedLines[parsedLines.length - 1];
end_time = last?.timestamp;
}
// Render body — collapsed conversation
let messageCount = 0;
let toolCalls = 0;
const bodyParts: string[] = [];
for (const rec of parsedLines) {
if (rec?.type === "user" || rec?.message?.role === "user") {
const content = extractContentText(rec);
if (content) {
bodyParts.push(`## User\n\n${content}`);
messageCount++;
}
} else if (rec?.type === "assistant" || rec?.message?.role === "assistant") {
const content = extractContentText(rec);
if (content) {
bodyParts.push(`## Assistant\n\n${content}`);
messageCount++;
}
} else if (rec?.type === "tool" || rec?.tool_use_id || rec?.tool_call) {
toolCalls++;
// Collapse to one-line summary
const tool = rec?.name || rec?.tool || rec?.tool_call?.name || "tool";
bodyParts.push(`### Tool call: ${tool}`);
} else if (isCodex && rec?.payload?.message) {
// Codex shape: each record has payload.message
const msg = rec.payload.message;
const role = msg.role || "user";
const content = extractContentText(msg);
if (content) {
bodyParts.push(`## ${role.charAt(0).toUpperCase() + role.slice(1)}\n\n${content}`);
messageCount++;
}
}
}
const body = bodyParts.join("\n\n").slice(0, 200000); // hard cap 200KB
return {
agent,
session_id,
cwd,
start_time,
end_time,
message_count: messageCount,
tool_calls: toolCalls,
body,
partial,
};
}
function extractContentText(rec: any): string {
if (!rec) return "";
if (typeof rec.content === "string") return rec.content;
if (typeof rec.text === "string") return rec.text;
if (typeof rec.message?.content === "string") return rec.message.content;
if (Array.isArray(rec.message?.content)) {
return rec.message.content
.map((c: any) => (typeof c === "string" ? c : c?.text || ""))
.filter(Boolean)
.join("\n");
}
if (Array.isArray(rec.content)) {
return rec.content
.map((c: any) => (typeof c === "string" ? c : c?.text || ""))
.filter(Boolean)
.join("\n");
}
return "";
}
function resolveGitRemote(cwd: string): string {
if (!cwd) return "";
try {
// execFileSync (no shell) so `cwd` cannot trigger command substitution.
// Transcript JSONL records are an untrusted surface (a poisoned `.cwd`
// value containing `"$(...)"` survived `JSON.stringify` interpolation
// into a `/bin/sh -c` context, since JSON quoting does not escape `$`
// or backticks). Mirrors the execFileSync pattern this module already
// uses for `gbrainAvailable()` (line 762) and `gbrainPutPage()` (line 816).
const out = execFileSync("git", ["-C", cwd, "remote", "get-url", "origin"], {
encoding: "utf-8",
timeout: 2000,
stdio: ["ignore", "pipe", "ignore"],
});
return canonicalizeRemote(out.trim());
} catch {
return "";
}
}
function repoSlug(remote: string): string {
if (!remote) return "_unattributed";
// github.com/foo/bar → foo-bar
const parts = remote.split("/");
if (parts.length >= 3) return `${parts[parts.length - 2]}-${parts[parts.length - 1]}`;
return remote.replace(/\//g, "-");
}
function dateOnly(ts: string | undefined): string {
if (!ts) return new Date().toISOString().slice(0, 10);
try {
return new Date(ts).toISOString().slice(0, 10);
} catch {
return new Date().toISOString().slice(0, 10);
}
}
function buildTranscriptPage(path: string, session: ParsedSession): PageRecord {
const remote = resolveGitRemote(session.cwd);
const slug_repo = repoSlug(remote);
const date = dateOnly(session.start_time);
const sessionPrefix = session.session_id.slice(0, 12);
const slug = `transcripts/${session.agent}/${slug_repo}/${date}-${sessionPrefix}`;
const title = `${session.agent} session — ${slug_repo}${date}`;
const tags = [
"transcript",
`agent:${session.agent}`,
`repo:${slug_repo}`,
`date:${date}`,
];
if (session.partial) tags.push("partial:true");
const stats = statSync(path);
const sha = fileSha256(path);
const frontmatter = [
"---",
`agent: ${session.agent}`,
`session_id: ${session.session_id}`,
`cwd: ${session.cwd || ""}`,
`git_remote: ${remote || "_unattributed"}`,
`start_time: ${session.start_time || ""}`,
`end_time: ${session.end_time || ""}`,
`message_count: ${session.message_count}`,
`tool_calls: ${session.tool_calls}`,
`source_path: ${path}`,
session.partial ? "partial: true" : "",
"---",
"",
].filter((l) => l !== "").join("\n");
return {
slug,
title,
type: "transcript",
agent: session.agent,
body: frontmatter + session.body,
tags,
source_path: path,
session_id: session.session_id,
cwd: session.cwd,
git_remote: remote,
start_time: session.start_time,
end_time: session.end_time,
partial: session.partial,
size_bytes: stats.size,
content_sha256: sha,
};
}
function buildArtifactPage(path: string, type: MemoryType): PageRecord {
const stats = statSync(path);
const sha = fileSha256(path);
const raw = readFileSync(path, "utf-8");
// Extract repo slug from path: ~/.gstack/projects/<slug>/...
let slug_repo = "_unattributed";
const m = path.match(/\/\.gstack\/projects\/([^/]+)\//);
if (m) slug_repo = m[1];
const date = new Date(stats.mtimeMs).toISOString().slice(0, 10);
const baseName = basename(path, path.endsWith(".jsonl") ? ".jsonl" : ".md");
const slug = `${type}s/${slug_repo}/${date}-${baseName}`;
const title = `${type}${slug_repo}${date}${baseName}`;
const tags = [type, `repo:${slug_repo}`, `date:${date}`];
// Truncate body to 200KB
const body = raw.slice(0, 200000);
return {
slug,
title,
type,
body,
tags,
source_path: path,
git_remote: slug_repo,
size_bytes: stats.size,
content_sha256: sha,
};
}
// ── Writer (batch via `gbrain import <dir>`) ───────────────────────────────
//
// Architecture (post plan-eng-review + Codex outside-voice):
//
// walkAllSources(ctx)
// → for each path: mtime-skip / source-file gitleaks (D3) / parse / buildPage
// → renderPageBody injects title/type/tags into YAML frontmatter
// → writeStaged: mkdir -p slug subdirs (D1), write ${slug}.md
// → snapshot ~/.gbrain/sync-failures.jsonl byte-offset (D7)
// → spawnSync `gbrain import <stagingDir> --no-embed --json` (D6)
// → parseImportJson(stdout) → { imported, skipped, errors, ... } (D6 OK/ERR)
// → readNewFailures(preImportOffset, slugMap) → Set<sourcePath> (D7)
// → state.sessions[path] = { ... } for prepared files NOT in failed set
// → saveStateAtomic (F6 tmp+rename) + cleanupStagingDir
//
// We trust gbrain's content_hash idempotency (verified in
// ~/git/gbrain/src/core/import-file.ts:242-243, :478) — repeated imports
// of identical content are cheap. So we do NOT track per-file skip_reasons,
// do NOT keep a SIGTERM checkpoint, and do NOT advance a three-state verdict.
let _gbrainAvailability: boolean | null = null;
function gbrainAvailable(): boolean {
if (_gbrainAvailability !== null) return _gbrainAvailability;
try {
// Probe `--help` for the `import` subcommand. gbrain v0.20.0+ ships
// `import <dir>` (batch markdown import via path-authoritative slugs).
// If absent, we surface a single clean error here rather than failing
// the whole stage with a confusing usage message from gbrain itself.
// `gbrain --help` probes only CLI availability, not DB connectivity, so
// it doesn't strictly need DATABASE_URL. But routing through the helper
// keeps the invariant test from chasing exceptions per call site.
const help = execGbrainText(["--help"], { timeout: 5000 });
_gbrainAvailability = /^\s+import\s/m.test(help);
} catch {
_gbrainAvailability = false;
}
return _gbrainAvailability;
}
/**
* Build the markdown body with YAML frontmatter (title/type/tags) injected.
*
* Two cases:
* - Page body already starts with `---\n` (transcripts) — inject into the
* existing frontmatter block before its close fence so gbrain's frontmatter
* parser picks up the fields alongside any session-level metadata the
* transcript builder already wrote (session_id, cwd, git_remote, etc.).
* - No leading frontmatter (raw artifacts: design-docs, learnings, etc.) —
* wrap with a fresh frontmatter block carrying title/type/tags. Without
* this branch, artifact pages would land in gbrain with empty metadata.
*
* gbrain enforces slug = path-derived (slugifyPath in gbrain's sync.ts).
* We do NOT set `slug:` in frontmatter — the staging-dir filename is the
* source of truth and gbrain rejects mismatches.
*/
function renderPageBody(page: PageRecord): string {
let body = page.body;
if (body.startsWith("---\n")) {
const end = body.indexOf("\n---", 4);
if (end > 0) {
const inject = [
`title: ${JSON.stringify(page.title)}`,
`type: ${page.type}`,
`tags:`,
...page.tags.map((t) => ` - ${t}`),
].join("\n");
body = body.slice(0, end) + "\n" + inject + body.slice(end);
}
} else {
body = [
"---",
`title: ${JSON.stringify(page.title)}`,
`type: ${page.type}`,
`tags: [${page.tags.map((t) => JSON.stringify(t)).join(", ")}]`,
"---",
"",
body,
].join("\n");
}
// Strip NUL bytes — Postgres rejects 0x00 in UTF-8 text columns. Some Claude
// Code transcripts contain NUL inside user-pasted content or tool output, and
// surfacing those as `internal_error: invalid byte sequence` from the brain
// is unhelpful when we can sanitize at write time. Originally landed in v1.32.0.0
// (PR #1411) on the per-file `gbrain put` path; moved here so all staged
// pages still get the same sanitization.
body = body.replace(/\x00/g, "");
return body;
}
interface PreparedPage {
/** Page slug (path-shaped, e.g. "transcripts/claude-code/foo"). */
slug: string;
/** Original source file on disk (e.g. ~/.claude/projects/.../foo.jsonl). */
source_path: string;
/** Full markdown including frontmatter — ready to write. */
rendered_body: string;
/** Carry-through fields for state recording on success. */
page_slug: string;
partial: boolean;
}
interface StagingResult {
staging_dir: string;
written: number;
errors: Array<{ slug: string; error: string }>;
/** Map from staging-dir-relative path (e.g. "transcripts/foo.md") → source path. */
stagedPathToSource: Map<string, string>;
}
/**
* Write prepared pages to a staging dir, mirroring slug hierarchy.
*
* D1: gbrain's `slugifyPath` (sync.ts:260) derives the slug from the
* directory-aware relative path inside the import dir, so slugs containing
* slashes (e.g. "transcripts/claude-code/foo") must live in matching
* subdirectories of the staging dir. Otherwise the slug becomes flattened
* or rejected by gbrain's path-vs-frontmatter slug check (import-file.ts:429).
*
* Filename = `${slug}.md`. mkdir is recursive. Existing files overwrite.
* Errors per-file are collected; the whole batch is best-effort.
*/
function writeStaged(prepared: PreparedPage[], stagingDir: string): StagingResult {
mkdirSync(stagingDir, { recursive: true });
const stagedPathToSource = new Map<string, string>();
const errors: Array<{ slug: string; error: string }> = [];
let written = 0;
for (const p of prepared) {
const relPath = `${p.slug}.md`;
const absPath = join(stagingDir, relPath);
try {
mkdirSync(dirname(absPath), { recursive: true });
writeFileSync(absPath, p.rendered_body, "utf-8");
stagedPathToSource.set(relPath, p.source_path);
written++;
} catch (err) {
errors.push({ slug: p.slug, error: (err as Error).message });
}
}
return { staging_dir: stagingDir, written, errors, stagedPathToSource };
}
interface ImportJsonResult {
status?: string;
duration_s?: number;
imported?: number;
skipped?: number;
errors?: number;
chunks?: number;
total_files?: number;
}
/**
* Parse the `gbrain import --json` stdout payload (single JSON object on
* the last non-empty line per commands/import.ts:271-275).
*
* Returns parsed counts on success, or `null` to signal "unparseable" — the
* caller treats null as ERR (system_error) rather than silently passing
* through as zeros. Pre-2026-05-11 this returned zeros on parse failure,
* which silently masked gbrain crashes as "0 imported, 0 failed = OK".
*/
function parseImportJson(stdout: string): ImportJsonResult | null {
const lines = stdout.split("\n").map((s) => s.trim()).filter(Boolean);
for (let i = lines.length - 1; i >= 0; i--) {
const line = lines[i];
if (line.startsWith("{") && line.endsWith("}")) {
try {
const parsed = JSON.parse(line);
if (typeof parsed === "object" && parsed && "imported" in parsed) {
return parsed as ImportJsonResult;
}
} catch {
// try next line up
}
}
}
return null;
}
/**
* Read failures appended to ~/.gbrain/sync-failures.jsonl since the
* snapshotted byte offset, and map them back to source paths.
*
* D7: gbrain import writes per-file failures to sync-failures.jsonl
* (commands/import.ts:308-310) explicitly so "callers can gate state
* advances" (comment at :28). We snapshot the file size before import
* and read only the appended bytes after, so we never confuse new
* entries with prior-run leftovers.
*
* Each line is `{ path, error, code, commit, ts }`. The `path` is the
* staging-dir-relative filename gbrain saw (e.g. "transcripts/foo.md").
* stagedPathToSource maps that back to the original source file.
*/
function readNewFailures(
syncFailuresPath: string,
preImportOffset: number,
stagedPathToSource: Map<string, string>,
): Set<string> {
const failed = new Set<string>();
try {
if (!existsSync(syncFailuresPath)) return failed;
const stat = statSync(syncFailuresPath);
if (stat.size <= preImportOffset) return failed;
// Read appended bytes only. readSync with a positional offset works
// synchronously without slurping the whole file.
const fd = openSync(syncFailuresPath, "r");
try {
const buf = Buffer.alloc(stat.size - preImportOffset);
readSync(fd, buf, 0, buf.length, preImportOffset);
const text = buf.toString("utf-8");
for (const line of text.split("\n")) {
const trimmed = line.trim();
if (!trimmed) continue;
try {
const entry = JSON.parse(trimmed) as { path?: string };
if (entry.path) {
const source = stagedPathToSource.get(entry.path);
if (source) failed.add(source);
}
} catch {
// ignore malformed line
}
}
} finally {
closeSync(fd);
}
} catch {
// Best-effort. If we can't read failures, we conservatively assume
// none — caller will state-record all prepared files. Worst case:
// failed files get a retry-on-next-run shot anyway via content_hash.
}
return failed;
}
// ── Main ingest passes ─────────────────────────────────────────────────────
async function probeMode(args: CliArgs): Promise<ProbeReport> {
const state = loadState();
const ctx = makeWalkContext(args, state);
const byType: Record<MemoryType, { count: number; bytes: number }> = {
transcript: { count: 0, bytes: 0 },
eureka: { count: 0, bytes: 0 },
learning: { count: 0, bytes: 0 },
timeline: { count: 0, bytes: 0 },
"ceo-plan": { count: 0, bytes: 0 },
"design-doc": { count: 0, bytes: 0 },
retro: { count: 0, bytes: 0 },
"builder-profile-entry": { count: 0, bytes: 0 },
};
let totalFiles = 0;
let totalBytes = 0;
let newCount = 0;
let updatedCount = 0;
let unchangedCount = 0;
for (const { path, type } of walkAllSources(ctx)) {
totalFiles++;
let size = 0;
try {
size = statSync(path).size;
} catch {
continue;
}
byType[type].count++;
byType[type].bytes += size;
totalBytes += size;
const entry = state.sessions[path];
if (!entry) newCount++;
else if (fileChangedSinceState(path, state)) updatedCount++;
else unchangedCount++;
}
// Per ED2: ~25-35 min for ~11.7K transcripts = ~150ms/page synchronous
// (gitleaks + render + put + embedding). Scale linearly.
const estimateMinutes = Math.max(1, Math.round((newCount + updatedCount) * 0.15 / 60));
return {
total_files: totalFiles,
total_bytes: totalBytes,
by_type: byType,
new_count: newCount,
updated_count: updatedCount,
unchanged_count: unchangedCount,
estimate_minutes: estimateMinutes,
};
}
/**
* Prepare phase: walk sources, apply incremental + optional-secret-scan filters,
* parse transcripts/artifacts into PageRecord, render bodies with
* frontmatter. Returns the PreparedPage[] to stage + counts of files
* filtered at each gate.
*
* Secret scanning policy (post 2026-05-10 perf review):
*
* The actual cross-machine exfiltration boundary is `gstack-brain-sync`,
* which runs a regex-based secret scanner on the staged diff before
* `git commit` (see bin/gstack-brain-sync:78-110: AWS keys, GitHub
* tokens, OpenAI keys, PEM blocks, JWTs, bearer-token-in-JSON). That's
* the right place — it gates content leaving the machine.
*
* memory-ingest, by contrast, moves data from one local file to a
* local PGLite database. Scanning every source file at ingest time
* doesn't change exposure (the secret already lives in plaintext
* where the user keeps their transcripts and artifacts) but costs
* ~470s on cold runs. We removed the per-file gitleaks gate as
* redundant defense-in-depth and made it opt-in via `--scan-secrets`
* for users who want belt-and-suspenders.
*/
function preparePages(
args: CliArgs,
ctx: WalkContext,
state: IngestState,
): {
prepared: PreparedPage[];
skippedSecret: number;
skippedDedup: number;
skippedUnattributed: number;
parseFailed: number;
partialPages: number;
} {
const prepared: PreparedPage[] = [];
let skippedSecret = 0;
let skippedDedup = 0;
let skippedUnattributed = 0;
let parseFailed = 0;
let partialPages = 0;
for (const { path, type } of walkAllSources(ctx)) {
if (args.limit !== null && prepared.length >= args.limit) break;
if (args.mode === "incremental" && !fileChangedSinceState(path, state)) {
skippedDedup++;
continue;
}
// Optional belt-and-suspenders: when --scan-secrets is set, scan the
// source file with gitleaks and skip dirty ones. Off by default
// because gstack-brain-sync already gates the cross-machine boundary
// and per-file gitleaks costs ~256ms/file (4-8 min on a real corpus).
if (args.scanSecrets) {
const scan = secretScanFile(path);
if (scan.scanner === "gitleaks" && scan.findings.length > 0) {
skippedSecret++;
if (!args.quiet) {
console.error(
`[secret-scan match] ${path} (${scan.findings.length} finding${
scan.findings.length === 1 ? "" : "s"
}); skipped`,
);
}
continue;
}
}
let page: PageRecord;
try {
if (type === "transcript") {
const session = parseTranscriptJsonl(path);
if (!session) {
parseFailed++;
continue;
}
if (!args.includeUnattributed && !session.cwd) {
skippedUnattributed++;
continue;
}
page = buildTranscriptPage(path, session);
if (!args.includeUnattributed && page.git_remote === "_unattributed") {
skippedUnattributed++;
continue;
}
if (page.partial) partialPages++;
} else {
page = buildArtifactPage(path, type);
}
} catch (err) {
parseFailed++;
console.error(`[parse-error] ${path}: ${(err as Error).message}`);
continue;
}
prepared.push({
slug: page.slug,
source_path: path,
rendered_body: renderPageBody(page),
page_slug: page.slug,
partial: page.partial ?? false,
});
}
return {
prepared,
skippedSecret,
skippedDedup,
skippedUnattributed,
parseFailed,
partialPages,
};
}
/**
* Make a per-run staging directory at ~/.gstack/.staging-ingest-<pid>-<ts>/
* The pid+ts namespace avoids collisions when two ingest passes run
* concurrently (the orchestrator's lock should prevent this, but
* defense-in-depth).
*/
function makeStagingDir(): string {
const dir = join(GSTACK_HOME, `.staging-ingest-${process.pid}-${Date.now()}`);
mkdirSync(dir, { recursive: true });
return dir;
}
/**
* Persistent staging dir used in remote-http MCP mode (split-engine D11).
*
* Instead of staging to ~/.gstack/.staging-ingest-<pid>-<ts>/ and cleaning up
* after `gbrain import`, remote-http users get a stable path that survives.
* gstack-brain-sync's allowlist pushes ~/.gstack/transcripts/** to the
* artifacts repo; the brain admin's pull job indexes them into the remote
* brain. Local PGLite (if present) stays code-only.
*
* Path: ~/.gstack/transcripts/<run-id>/ (run-id pid+ts so concurrent passes
* stay separate; brain-sync push doesn't care about subdir naming).
*/
function makePersistentTranscriptDir(): string {
const dir = join(
GSTACK_HOME,
"transcripts",
`run-${process.pid}-${Date.now()}`,
);
mkdirSync(dir, { recursive: true });
return dir;
}
/**
* Detect whether the gbrain MCP is remote-http (Path 4) — and therefore we
* should NOT call `gbrain import` because we don't want the local PGLite
* polluted with transcripts (per plan D11).
*
* Reads ~/.claude.json directly (same fallback chain as gstack-gbrain-detect
* Tier 3). Cheap: one fs read, no fork-exec.
*/
function isRemoteHttpMcpMode(): boolean {
const home = process.env.HOME || homedir();
const claudeJsonPath = join(home, ".claude.json");
if (!existsSync(claudeJsonPath)) return false;
try {
const parsed = JSON.parse(readFileSync(claudeJsonPath, "utf-8")) as {
mcpServers?: {
gbrain?: { type?: string; transport?: string; url?: string };
};
};
const entry = parsed.mcpServers?.gbrain;
if (!entry) return false;
const mtype = entry.type || entry.transport || "";
if (mtype === "url" || mtype === "http" || mtype === "sse") return true;
if (entry.url) return true;
return false;
} catch {
return false;
}
}
/**
* Best-effort recursive cleanup. Failures swallowed — at worst we leak a
* staging dir to disk; the next run uses a new one and they age out via
* normal disk hygiene. We deliberately do NOT crash the pipeline on
* cleanup failure.
*/
function cleanupStagingDir(dir: string): void {
try {
rmSync(dir, { recursive: true, force: true });
} catch {
// best-effort
}
}
/**
* Track the currently-running gbrain import child + active staging dir so
* SIGTERM/SIGINT on the parent process can:
* 1. forward the signal to the child (otherwise gbrain orphans, holds the
* PGLite write lock, and burns CPU — observed during 2026-05-10 cold-run
* testing)
* 2. synchronously clean up the staging dir BEFORE process.exit (otherwise
* finally blocks in async callers don't run after process.exit from
* inside a signal handler, leaking the staging dir on every interrupt)
*/
let _activeImportChild: ChildProcess | null = null;
let _activeStagingDir: string | null = null;
let _signalHandlersInstalled = false;
function installSignalForwarder(): void {
if (_signalHandlersInstalled) return;
_signalHandlersInstalled = true;
const forward = (signal: NodeJS.Signals) => () => {
if (_activeImportChild && _activeImportChild.pid && !_activeImportChild.killed) {
try {
process.kill(_activeImportChild.pid, signal);
} catch {
// child may have already exited between the alive-check and the kill
}
}
// Synchronously clean up the active staging dir before exiting. The async
// `finally` blocks in ingestPass never run after process.exit fires from
// inside this handler, so cleanup has to happen here.
if (_activeStagingDir) {
cleanupStagingDir(_activeStagingDir);
_activeStagingDir = null;
}
// Re-raise to default action so the parent actually exits. Without this,
// a SIGTERM handler that doesn't exit holds the process alive.
process.exit(signal === "SIGINT" ? 130 : 143);
};
process.on("SIGTERM", forward("SIGTERM"));
process.on("SIGINT", forward("SIGINT"));
}
/**
* Run gbrain import as an async child so we can install signal handlers
* that kill the child on parent SIGTERM/SIGINT. Returns the same shape as
* spawnSync's result so the caller doesn't care which mode was used.
*/
function runGbrainImport(
stagingDir: string,
timeoutMs: number,
): Promise<{ status: number | null; stdout: string; stderr: string }> {
installSignalForwarder();
return new Promise((resolve) => {
// Seed DATABASE_URL from gbrain's own config so this stage works
// inside Next.js / Prisma / Rails projects with their own
// .env.local (codex review #7 — defense in depth on top of the
// parent gstack-gbrain-sync seeding the bun grandchild's env).
const child = spawnGbrainAsync(["import", stagingDir, "--no-embed", "--json"]);
_activeImportChild = child;
let stdout = "";
let stderr = "";
let timedOut = false;
const timer = setTimeout(() => {
timedOut = true;
try {
if (child.pid) process.kill(child.pid, "SIGTERM");
} catch {
// already gone
}
}, timeoutMs);
child.stdout?.on("data", (chunk) => {
stdout += chunk.toString("utf-8");
});
child.stderr?.on("data", (chunk) => {
stderr += chunk.toString("utf-8");
});
child.on("close", (status) => {
clearTimeout(timer);
_activeImportChild = null;
resolve({
status: timedOut ? null : status,
stdout,
stderr,
});
});
child.on("error", (err) => {
clearTimeout(timer);
_activeImportChild = null;
resolve({
status: null,
stdout,
stderr: stderr + `\n[spawn-error] ${(err as Error).message}`,
});
});
});
}
async function ingestPass(args: CliArgs): Promise<BulkResult> {
const t0 = Date.now();
const state = loadState();
const ctx = makeWalkContext(args, state);
// Phase 1: prepare (parse + secret-scan + filter + render frontmatter).
const prep = preparePages(args, ctx, state);
let written = 0;
let failed = 0;
if (args.noWrite) {
// --no-write: skip the gbrain import call but still record state for
// prepared pages (treat them as ingested for dedup purposes). Matches
// the prior contract from --help: "Skip gbrain put calls (still
// updates state file)".
const nowIso = new Date().toISOString();
for (const p of prep.prepared) {
try {
state.sessions[p.source_path] = {
mtime_ns: Math.floor(statSync(p.source_path).mtimeMs * 1e6),
sha256: fileSha256(p.source_path),
ingested_at: nowIso,
page_slug: p.page_slug,
partial: p.partial,
};
written++;
} catch {
// best-effort state record
}
}
state.last_full_walk = new Date().toISOString();
state.last_writer = "gstack-memory-ingest";
saveState(state);
return {
written,
skipped_secret: prep.skippedSecret,
skipped_dedup: prep.skippedDedup,
skipped_unattributed: prep.skippedUnattributed,
failed: prep.parseFailed,
duration_ms: Date.now() - t0,
partial_pages: prep.partialPages,
};
}
if (prep.prepared.length === 0) {
// Nothing to import — still touch state.last_full_walk and exit.
state.last_full_walk = new Date().toISOString();
state.last_writer = "gstack-memory-ingest";
saveState(state);
return {
written: 0,
skipped_secret: prep.skippedSecret,
skipped_dedup: prep.skippedDedup,
skipped_unattributed: prep.skippedUnattributed,
failed: prep.parseFailed,
duration_ms: Date.now() - t0,
partial_pages: prep.partialPages,
};
}
if (!gbrainAvailable()) {
const msg =
"gbrain CLI not in PATH or missing `import` subcommand. Run /setup-gbrain.";
console.error(`[memory-ingest] ERR: ${msg}`);
return {
written: 0,
skipped_secret: prep.skippedSecret,
skipped_dedup: prep.skippedDedup,
skipped_unattributed: prep.skippedUnattributed,
failed: prep.parseFailed + prep.prepared.length,
duration_ms: Date.now() - t0,
partial_pages: prep.partialPages,
system_error: msg,
};
}
// Phase 2: stage + (optionally) invoke gbrain import.
//
// Split-engine branch per plan D11: in remote-http MCP mode, we stage to a
// PERSISTENT dir under ~/.gstack/transcripts/ and SKIP `gbrain import`
// entirely. gstack-brain-sync push will pick the dir up via its allowlist
// and the brain admin's pull job will index transcripts into the remote
// brain. Local PGLite (if any) stays code-only.
const remoteHttpMode = isRemoteHttpMcpMode();
const stagingDir = remoteHttpMode
? makePersistentTranscriptDir()
: makeStagingDir();
// Register staging dir with the signal forwarder so SIGTERM/SIGINT can
// synchronously clean it up before process.exit (the async finally block
// below does NOT run after a signal-handler exit). In remote-http mode we
// skip registration — the dir is meant to persist.
if (!remoteHttpMode) {
_activeStagingDir = stagingDir;
}
try {
const staging = writeStaged(prep.prepared, stagingDir);
failed += staging.errors.length;
if (!args.quiet && staging.errors.length > 0) {
for (const e of staging.errors.slice(0, 5)) {
console.error(`[stage-error] ${e.slug}: ${e.error}`);
}
}
// D7: snapshot sync-failures.jsonl byte-offset before import so we
// can read only newly-appended failure entries afterwards.
const syncFailuresPath = join(homedir(), ".gbrain", "sync-failures.jsonl");
let preImportOffset = 0;
try {
if (existsSync(syncFailuresPath)) {
preImportOffset = statSync(syncFailuresPath).size;
}
} catch {
// best-effort; absent file → 0 offset, all future entries are "new"
}
if (!args.quiet) {
const action = remoteHttpMode
? "persisting to artifacts pipeline (skipping local gbrain import — remote-http mode)"
: "running gbrain import";
console.error(
`[memory-ingest] staged ${staging.written} pages → ${stagingDir}; ${action}...`,
);
}
// Remote-http branch (split-engine D11): no local gbrain import. The
// staged markdown lives under ~/.gstack/transcripts/<run-id>/ and the
// next gstack-brain-sync push will move it to the artifacts repo. From
// there the brain admin's pull job indexes into the remote brain.
//
// We treat ALL prepared pages as "written" since the import didn't run
// and we have no per-page failures from gbrain to filter on. The
// brain admin's pull pipeline is the authoritative gate; from this
// machine's perspective, the act of staging IS the write.
if (remoteHttpMode) {
const nowIso = new Date().toISOString();
for (const p of prep.prepared) {
try {
state.sessions[p.source_path] = {
mtime_ns: Math.floor(statSync(p.source_path).mtimeMs * 1e6),
sha256: fileSha256(p.source_path),
ingested_at: nowIso,
page_slug: p.page_slug,
partial: p.partial,
};
written++;
} catch (err) {
console.error(
`[state-record] ${p.source_path}: ${(err as Error).message}`,
);
}
}
state.last_full_walk = nowIso;
state.last_writer = "gstack-memory-ingest (remote-http mode)";
saveState(state);
if (!args.quiet) {
console.error(
`[memory-ingest] persisted ${written} pages to ${stagingDir} (brain admin will index on next pull)`,
);
}
// Skip the gbrain-import error handling + cleanupStagingDir paths
// below by short-circuiting the function.
return {
written,
skipped_secret: prep.skippedSecret,
skipped_dedup: prep.skippedDedup,
skipped_unattributed: prep.skippedUnattributed,
failed,
duration_ms: Date.now() - t0,
partial_pages: prep.partialPages,
};
}
// D6: single batch import. `--no-embed` matches the prior per-file
// behavior (we never enabled embedding); embeddings happen on-demand
// via gbrain's own pipelines. `--json` gives us structured counts.
//
// Async spawn (not spawnSync) so the signal forwarder installed in
// runGbrainImport propagates SIGTERM/SIGINT to the child. With sync
// spawn, parent termination orphans the gbrain process (observed
// during 2026-05-10 cold-run testing — gbrain kept running 15 min
// after the orchestrator timed out).
const importResult = await runGbrainImport(stagingDir, 30 * 60 * 1000);
const stdout = importResult.stdout || "";
const stderr = importResult.stderr || "";
const importJson = parseImportJson(stdout);
if (importResult.status !== 0) {
const tail = (stderr.trim().split("\n").pop() || "").slice(0, 300);
const msg = `gbrain import exited ${importResult.status}: ${tail}`;
console.error(`[memory-ingest] ERR: ${msg}`);
// We conservatively state-record nothing on a non-zero exit — per-run
// partial progress is invisible to us when the importer crashed.
// sync-failures.jsonl entries may still hold per-file detail.
failed += prep.prepared.length;
return {
written: 0,
skipped_secret: prep.skippedSecret,
skipped_dedup: prep.skippedDedup,
skipped_unattributed: prep.skippedUnattributed,
failed,
duration_ms: Date.now() - t0,
partial_pages: prep.partialPages,
system_error: msg,
};
}
if (!args.quiet) {
// Echo gbrain's own progress lines on stderr through so the user sees
// them when running interactively. Already on our stderr from the
// child via `stdio: pipe`, but we explicitly forward for clarity.
process.stderr.write(stderr);
}
if (importJson === null) {
// gbrain exited 0 but didn't emit a parseable --json line. Treat as
// ERR rather than silently passing zeros through — silent zeros let
// a future gbrain-output regression mask data loss.
const msg =
"gbrain import exited 0 but emitted no parseable --json payload. " +
"Refusing to advance state.";
console.error(`[memory-ingest] ERR: ${msg}`);
failed += prep.prepared.length;
return {
written: 0,
skipped_secret: prep.skippedSecret,
skipped_dedup: prep.skippedDedup,
skipped_unattributed: prep.skippedUnattributed,
failed,
duration_ms: Date.now() - t0,
partial_pages: prep.partialPages,
system_error: msg,
};
}
// D7: identify which staged files failed to import and exclude them
// from state recording. Source paths get a retry on the next run.
const failedSources = readNewFailures(
syncFailuresPath,
preImportOffset,
staging.stagedPathToSource,
);
failed += failedSources.size;
// Phase 3: state recording. Only files that landed in gbrain get
// their mtime+sha256 stamped. Failed source paths are deliberately
// left un-state'd so the next run re-prepares them and gbrain's
// content_hash dedup short-circuits the import.
const nowIso = new Date().toISOString();
for (const p of prep.prepared) {
if (failedSources.has(p.source_path)) continue;
try {
state.sessions[p.source_path] = {
mtime_ns: Math.floor(statSync(p.source_path).mtimeMs * 1e6),
sha256: fileSha256(p.source_path),
ingested_at: nowIso,
page_slug: p.page_slug,
partial: p.partial,
};
written++;
if (!args.quiet) {
const tag = p.partial ? " [partial]" : "";
console.log(`[${written}] ${p.page_slug}${tag}`);
}
} catch (err) {
// statSync can fail if the source file was removed mid-run; skip
// recording but don't fail the whole pass.
console.error(
`[state-record] ${p.source_path}: ${(err as Error).message}`,
);
}
}
if (!args.quiet) {
console.error(
`[memory-ingest] gbrain import: ${importJson.imported ?? 0} imported, ` +
`${importJson.skipped ?? 0} unchanged, ${importJson.errors ?? 0} failed` +
(failedSources.size > 0
? ` (see ~/.gbrain/sync-failures.jsonl for details)`
: ""),
);
}
} finally {
cleanupStagingDir(stagingDir);
_activeStagingDir = null;
}
state.last_full_walk = new Date().toISOString();
state.last_writer = "gstack-memory-ingest";
saveState(state);
return {
written,
skipped_secret: prep.skippedSecret,
skipped_dedup: prep.skippedDedup,
skipped_unattributed: prep.skippedUnattributed,
failed: failed + prep.parseFailed,
duration_ms: Date.now() - t0,
partial_pages: prep.partialPages,
};
}
// ── Output formatting ──────────────────────────────────────────────────────
function formatBytes(n: number): string {
if (n < 1024) return `${n}B`;
if (n < 1024 * 1024) return `${(n / 1024).toFixed(1)}KB`;
if (n < 1024 * 1024 * 1024) return `${(n / 1024 / 1024).toFixed(1)}MB`;
return `${(n / 1024 / 1024 / 1024).toFixed(2)}GB`;
}
function printProbeReport(r: ProbeReport, json: boolean): void {
if (json) {
console.log(JSON.stringify(r, null, 2));
return;
}
console.log("Memory ingest probe");
console.log("───────────────────");
console.log(`Total files in window: ${r.total_files}`);
console.log(`Total bytes: ${formatBytes(r.total_bytes)}`);
console.log(`New (never ingested): ${r.new_count}`);
console.log(`Updated (mtime/hash): ${r.updated_count}`);
console.log(`Unchanged: ${r.unchanged_count}`);
console.log("By type:");
for (const [t, v] of Object.entries(r.by_type)) {
if (v.count > 0) {
console.log(` ${t.padEnd(24)} ${String(v.count).padStart(6)} files ${formatBytes(v.bytes).padStart(8)}`);
}
}
console.log(`\nEstimate: ~${r.estimate_minutes} min for full --bulk pass.`);
}
function printBulkResult(r: BulkResult, args: CliArgs): void {
console.log(`\nIngest pass complete (${args.mode}):`);
console.log(` written: ${r.written}`);
console.log(` partial_pages: ${r.partial_pages} (will overwrite on next pass)`);
console.log(` skipped (dedup): ${r.skipped_dedup}`);
console.log(` skipped (secret-scan): ${r.skipped_secret}`);
console.log(` skipped (unattrib): ${r.skipped_unattributed}`);
console.log(` failed: ${r.failed}`);
console.log(` duration: ${(r.duration_ms / 1000).toFixed(1)}s`);
if (args.benchmark) {
const pps = r.duration_ms > 0 ? (r.written * 1000) / r.duration_ms : 0;
console.log(` throughput: ${pps.toFixed(2)} pages/sec`);
}
}
// ── Entry point ────────────────────────────────────────────────────────────
async function main(): Promise<void> {
const args = parseArgs();
// Engine tier detection — informational; routing happens in gbrain server-side.
const engine = detectEngineTier();
if (!args.quiet) {
console.error(`[engine] ${engine.engine}${engine.engine === "supabase" ? ` (${engine.supabase_url || "configured"})` : ""}`);
}
if (args.mode === "probe") {
const report = await probeMode(args);
printProbeReport(report, false);
return;
}
if (args.mode === "incremental" && args.quiet) {
// Steady-state fast path: log nothing unless changes happen.
const t0 = Date.now();
const result = await ingestPass(args);
const dt = Date.now() - t0;
if (result.written > 0 || result.failed > 0) {
console.error(`[memory-ingest] ${result.written} written, ${result.failed} failed in ${dt}ms`);
}
// D6: system_error → process-level failure; orchestrator sees ERR.
// Per-file errors do NOT exit non-zero.
if (result.system_error) process.exit(1);
return;
}
const result = await ingestPass(args);
printBulkResult(result, args);
if (result.system_error) process.exit(1);
}
main().catch((err) => {
console.error(`gstack-memory-ingest fatal: ${err instanceof Error ? err.message : String(err)}`);
process.exit(1);
});