Files
gstack/docs/designs/SYNC_GBRAIN_BATCH_INGEST.md
Garry Tan d21ba06b5a v1.33.0.0 feat: /sync-gbrain memory-stage batch-import refactor (D1-D8) + F6/F9 + signal cleanup (#1432)
* refactor: batch-import architecture (D1-D8) + F6 atomic state + F9 full-file hash

bin/gstack-memory-ingest.ts: rewrite memory ingest around `gbrain import <dir>`
batch path. Replaces per-file gbrainPutPage loop (~470s of subprocess startup
per cold run) with prepare-then-batch:

  walkAllSources
    -> preparePages: mtime-skip + optional gitleaks (--scan-secrets) + parse
    -> writeStaged: mkdir -p per slug segment, hierarchical (D1)
    -> snapshot ~/.gbrain/sync-failures.jsonl byte offset
    -> runGbrainImport (async spawn) -> parseImportJson
    -> readNewFailures: read appended bytes, map back to source paths (D7)
    -> state.sessions[path] = {...} for files NOT in failed set
    -> saveStateAtomic (F6) + cleanupStagingDir

Architecture decisions:
  D1 hierarchical staging dir
  D2 cut over, deleted gbrainPutPage entirely
  D3 source-file gitleaks made opt-in via --scan-secrets (gstack-brain-sync
     owns the cross-machine boundary; per-file scan was redundant ~470s tax)
  D4 OK/ERR verdict (no DEGRADED tri-state)
  D5 unified state schema (no separate skip-list)
  D6 trust gbrain content_hash idempotency (no skip_reason bookkeeping)
  D7 byte-offset snapshot of sync-failures.jsonl + per-source mapping
  F6 saveState uses tmp+rename atomic write
  F9 fileSha256 removes 1MB cap; full-file hash (no more silent tail-edit
     misses on long partial transcripts)

Signal handling: installSignalForwarder propagates SIGTERM/SIGINT to the
gbrain child process AND synchronously cleans the staging dir before
process.exit. Pre-fix, orchestrator timeouts left gbrain processes
orphaned holding the PGLite write lock (observed: 15-hour-CPU-time
orphan still alive a day later).

parseImportJson returns null on unparseable output (treated as ERR by
caller) instead of silently zeroing through.

gbrainAvailable() probes for the `import` subcommand instead of `put`.

Plan + review chain at /Users/garrytan/.claude/plans/purrfect-tumbling-quiche.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: orchestrator OK/ERR verdict parser for batch memory ingest

gstack-gbrain-sync.ts: memory-stage parser now picks [memory-ingest] ERR
lines preferentially over the latest [memory-ingest] line, strips the
prefix and any leading 'ERR: ' for cleaner summary output, and surfaces
'(killed by signal / timeout)' when the child exits with status=null.

Matches D6's OK/ERR contract: per-file failures (FILE_TOO_LARGE etc.)
show in the summary count but only system-level failures (gbrain crash,
process kill, missing CLI) mark the stage ERR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: batch-ingest writer regressions + refresh golden ship fixtures

test/gstack-memory-ingest.test.ts: 5 new tests for the batch-import
architecture:
  1. D1 hierarchical staging slug round-trip — asserts staged file lives
     in transcripts/claude-code/<dir>/*.md, not flat at staging root
  2. Frontmatter injection — asserts title/type/tags written into the
     staged page's YAML block
  3. D7 sync-failures.jsonl exclusion — files listed as failed by
     gbrain do NOT get state-recorded; one of two test sessions lands,
     the other stays un-ingested for retry next run
  4. Missing-`import`-subcommand error path — when gbrain only advertises
     legacy `put`, memory-ingest exits 1 with [memory-ingest] ERR
  5. --scan-secrets opt-in path — verifies a dirty-source file is
     skipped via the secret-scan match when the flag is on, while a
     clean session in the same run still gets staged

Replaces the prior put-per-file shim with an import-batch shim. The
shim fails loudly (exit 99) if the new code ever regresses to per-file
`gbrain put` calls.

test/fixtures/golden/{claude,codex,factory}-ship-SKILL.md: refresh
golden baselines to match the current generated SKILL.md content after
the v1.31.0.0 AskUserQuestion fallback-clause deletion. Goldens were
stale from that release; test was failing on origin/main before this
PR. Caught by the /ship test pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v1.33.0.0 docs: design doc, P2 perf TODOs, gbrain guidance block, changelog

docs/designs/SYNC_GBRAIN_BATCH_INGEST.md: full design doc with the 8
decisions (D1-D8), source-verified gbrain behaviors (content_hash
idempotency, frontmatter parity, path-authoritative slug, per-file
failure surface), measured performance vs plan target, F9 hash
migration one-time cliff note, and follow-up TODOs.

CLAUDE.md: append `## GBrain Search Guidance` block from /sync-gbrain
indicating this worktree's pin and how the agent should prefer gbrain
search over Grep for semantic queries.

TODOS.md: P2 `gbrain import` perf-on-large-staging-dirs investigation
(5,131 files takes >10min in gbrain when 501 takes 10s — likely N+1
SQL or auto-link reconciliation). P3 cache-no-changes-since-last-import
at the prepare-batch level for true no-op fast paths.

VERSION + package.json: bump to 1.33.0.0 (queue-aware via
bin/gstack-next-version — skipped v1.32.0.0 which is claimed by
sibling worktree garrytan/wellington / PR #1431).

CHANGELOG.md: v1.33.0.0 entry per the release-summary format.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: setup-gbrain/memory.md reflects opt-in per-file gitleaks

Per-file gitleaks scanning during memory ingest is now opt-in via
--scan-secrets (or GSTACK_MEMORY_INGEST_SCAN_SECRETS=1). Update the
user-facing reference doc so it stops claiming "every page passes
through gitleaks." Also corrects the /gbrain-sync → /sync-gbrain
command typo and the post-incident recovery section.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 18:47:33 -07:00

15 KiB
Raw Blame History

/sync-gbrain batch ingest migration

Status: Implemented on garrytan/dublin-v1 (D1-D8 decisions land in this PR) Branch: garrytan/dublin-v1 Owner: Garry Tan Triggered by: /investigate run, 2026-05-09 Estimated effort: human ~3 days / CC+gstack ~2 hr Files touched: 4 source + 1 test = 5 total (under estimate)

Decisions (post-review)

This doc captures the original architecture. Final architecture lands per the 8 review decisions captured in /Users/garrytan/.claude/plans/purrfect-tumbling-quiche.md:

  • D1 hierarchical staging dir (mkdir -p per slug segment) — kept
  • D2 cut over + delete legacy in same PR (no --legacy-ingest flag) — kept
  • D3 scan source-file first, stage only clean — kept
  • D4 three-state OK/DEGRADED/ERR verdict COLLAPSED to OK/ERR per Codex finding 7 (gbrain content_hash idempotency makes the third state redundant)
  • D5 skip_reason field in state schema DROPPED per Codex finding 7 (re-runs are cheap; no need for permanent skip-tracking)
  • D6 trust gbrain's content_hash idempotency; drop bookkeeping scaffolding (skip_reason, three-state, SIGTERM checkpoint)
  • D7 per-file failure detection via ~/.gbrain/sync-failures.jsonl (byte-offset snapshot + appended-only read)
  • D8 bundle 3 in-scope pre-existing fixes: F6 atomic saveState (tmp+rename), F8 isolated-stage benchmark, F9 full-file sha256 hash (no more 1MB cap)

Verified from gbrain source

Three properties verified by reading ~/git/gbrain/src/:

  • Idempotency at core/import-file.ts:242-243, :478 — content_hash check, skip if unchanged, overwrite if changed.
  • Frontmatter parity at core/import-file.ts:228, 297, 410-422 — title/type/tags honored; auto-inference only when frontmatter absent.
  • Path-authoritative slug at core/sync.ts:260 (slugifyPath), enforced at core/import-file.ts:429.
  • Per-file failures surface at commands/import.ts:308-310, comment at :28: "callers can gate state advances" — the intentional API for what D7 uses.

Performance: planned vs measured (post 2026-05-10 perf review)

Metric Plan target Measured Verdict
Prepare phase on 5135 files <10s FAST
gbrain import on 5135 files >10 min gbrain-side perf issue, filed
Loop / hang (original bug) never never FIXED
Memory ingest exits null on SIGTERM no no — state writes succeed; child gbrain dies with parent FIXED
FILE_TOO_LARGE blocks last_commit no no — failed paths excluded via D7 FIXED

Initial perf miss + correction. The first cold-run measurement (~12 min) was dominated by 1841 sequential gitleaks subprocess spawns at ~256ms each — a redundant security gate. The cross-machine exfiltration boundary is gstack-brain-sync (bin/gstack-brain-sync:78-110, regex-based secret scan on staged diff before git commit). Scanning every source file before ingest into a LOCAL PGLite doesn't change exposure — the secret already lives on disk in plaintext. We made per-file gitleaks opt-in via --scan-secrets. Default is off. That cut the prepare phase from ~12 min to under 10 seconds.

The remaining cold-run cost is gbrain import itself, which scales worse than linear on large staging dirs (10s for 501 files; >10 min for 5031). That's a gbrain-side perf issue, not gstack architecture. Filed as a TODO; the fix likely lives in gbrain's content_hash check loop or auto-link reconciliation phase.

F9 hash migration (one-time cliff)

F9 switched fileSha256 from a 1MB-capped hash to full-file. Existing state entries from before this change carry the old 1MB-capped hash. For any file whose mtime hasn't changed, fileChangedSinceState returns false at the mtime check and the new hash is never computed — so unchanged files behave identically. For any file whose mtime DOES change after upgrade, the full-file hash is recomputed and (correctly) treated as changed, then re-imported. The gbrain doctor probe report's updated_count may show inflated numbers on the first run post-upgrade because every touched file crosses the algorithm boundary. No data loss, but worth knowing.

Follow-ups (filed as TODOs)

  1. gbrain import perf on large dirs — investigate why 5031 files take >10 min when 501 takes 10s. Likely culprits: N+1 SQL for getPage(slug) content_hash check, per-page auto-link reconciliation, FTS index updates without batching. Lives in gbrain, not gstack.
  2. Optional: source-file changed-detection cache — even with the prepare phase fast, walking 5031 files takes some time. Caching the "no changes since last successful import" state at the batch level (not per-file) would skip the prepare phase entirely on a no-op incremental run.

Problem

/sync-gbrain memory stage takes 35 minutes on a fresh PGLite and exits null, losing all progress. Subsequent runs redo the same 35 minutes. Observed in two consecutive runs (gbrain 0.30.0 broken-postgres run: 712s exit-null; gbrain 0.31.2 PGLite run: 2100s exit-null with 501 pages actually persisted).

Root cause (from /investigate)

Two compounding bugs in bin/gstack-memory-ingest.ts:

  1. Subprocess-per-file architecture. The ingest loop at line 911 walks 1,841 files in ~/.gstack/projects/ and spawns two subprocesses per file:

    • gitleaks detect --no-git --source <path> — 46ms cold start (lib/gstack-memory-helpers.ts:157)
    • gbrain put <slug> — 329ms cold start (bin/gstack-memory-ingest.ts:823)
    • Per-file floor: 375ms × 1841 = 690s (11.5 min) of pure subprocess startup before any actual work happens.
  2. Kill-no-save timeout. Orchestrator at bin/gstack-gbrain-sync.ts:442 enforces a 35-min timeout. When it fires, spawnSync returns result.status === null, the child gets SIGTERM, and the in-memory ingest state never flushes to ~/.gstack/.transcript-ingest-state.json. Next run starts from the same un-progressed state — explains the redo-everything pattern.

Numbers from the field

Metric Value Source
Files in walkAllSources 1,841 find ~/.gstack/projects -type f \( -name "*.md" -o -name "*.jsonl" \)
gbrain put cold start 329ms time (echo "test" | gbrain put _bench)
gitleaks detect cold start 46ms time gitleaks detect --no-git --source <small-file>
Theoretical floor (subprocess only) 690s / 11.5 min 375ms × 1841
Observed run time 2100s / 35 min matches orchestrator timeout exactly
Pages actually persisted 501 gbrain sources list page_count
PGLite growth during run 290 → 386 MB du -sh ~/.gbrain/brain.pglite

Proposed architecture

Replace the per-file subprocess loop with a prepare-then-batch pipeline:

walkAllSources(ctx)
  → prepareStage (in-process, fast):
       parse transcripts/artifacts
       build PageRecord with custom YAML frontmatter
       gitleaks scan (single subprocess on staging dir)
       write prepared .md to staging dir
  → gbrain import <staging-dir> --no-embed (single subprocess)
  → flush state file with all successes
  → cleanup staging dir

Why gbrain import <dir> is the right batch path

  • Already shipped in gbrain CLI (verified: gbrain --help shows import <dir> [--no-embed]).
  • Walks dir in-process inside gbrain's own runtime — no subprocess fan-out.
  • Honors gbrain's batch-size and embedding-batch tuning.
  • gbrain v0.31.2 import did 501 pages + 2906 chunks in 10 seconds during the observed run; the slow part was OUR per-file gbrain put loop above it.

What we keep that the current code does right

  • Custom YAML frontmatter injection (title, type, tags) — preserved by writing prepared .md files with frontmatter into the staging dir.
  • Secret scanning — preserved, but moved to ONE gitleaks detect --source <staging-dir> call after prepare, before import. Files with findings get redacted or excluded; staging dir guarantees gitleaks sees only the prepared content, not internal gbrain state.
  • Partial-transcript detection — preserved in prepare stage; partial files still get a partial: true field in frontmatter.
  • Unattributed-transcript filtering — preserved in prepare stage.
  • Per-file mtime + sha256 state tracking — preserved; the prepare stage records what got staged, the import-success result records what landed.
  • Incremental modefileChangedSinceState check stays at the top of the prepare loop.

Migration steps

Step 1: extract preparePages from current ingest loop

Take everything in ingestPass (lines 899-988 of bin/gstack-memory-ingest.ts) between the walk and the gbrainPutPage call. Move into a new function preparePages(args, ctx, state) → { staged: PreparedPage[], skipped, failed }.

Output: list of { slug, body, source_path, mtime_ns, sha256, partial } where body is the full markdown including frontmatter.

Step 2: add staging dir writer

Pure function: writeStaged(prepared, stagingDir) → { written, errors }. Filename: ${slug}.md. Idempotent overwrite.

Staging dir lifecycle:

  • Created at ~/.gstack/.staging-ingest-${pid}-${ts}/
  • Cleaned in finally block, even on SIGTERM
  • One staging dir per ingest pass — never reused across runs

Step 3: single gitleaks pass

Replace per-file secretScanFile(path) calls with one call after prepare: gitleaks detect --no-git --source <staging-dir> --report-format json --report-path -.

Parse JSON output, build Map<slug, findings[]>. Files with findings get removed from staging dir before import (or sanitized in place per existing redaction policy in lib/gstack-memory-helpers.ts).

Step 4: replace gbrainPutPage loop with single import call

const importResult = spawnSync("gbrain", ["import", stagingDir], {
  stdio: ["ignore", "inherit", "inherit"],
  timeout: 30 * 60 * 1000, // generous; whole batch
});

Parse stdout for the Import complete line and the failed count.

Step 5: persist state on partial success

If gbrain import reports imported=N, failed=M, save state for the N successful slugs (not all of them). Failures stay un-state'd so they retry next run, but successes don't redo.

Step 6: SIGTERM handler in gstack-memory-ingest.ts

Wrap main() in:

let interrupted = false;
const flush = () => {
  if (interrupted) return;
  interrupted = true;
  saveState(state); // best-effort flush of whatever's accumulated
  cleanupStagingDir();
  process.exit(143);
};
process.on("SIGTERM", flush);
process.on("SIGINT", flush);

This unblocks the kill-no-save bug independently — even if the batch import runs over the orchestrator timeout, state from the prepare stage survives.

Step 7: orchestrator update

In bin/gstack-gbrain-sync.ts:444:

  • Change result.status === 0 to result.status === 0 || (parsedSummary.imported > 0 && parsedSummary.imported >= parsedSummary.skipped + parsedSummary.failed). Treat partial success (most pages imported) as OK, not ERR.
  • Surface failed_count and partial_blockers in the stage summary so the user sees Memory ... OK 487/501 imported (14 FILE_TOO_LARGE) instead of ERR exited null.

Step 8: handle FILE_TOO_LARGE specifically

When gbrain reports FILE_TOO_LARGE, log to a new ~/.gstack/.ingest-skip-list.json so the next prepare stage skips that file entirely. Avoids re-staging a file that will always fail. User can review the skip list with a new gstack-memory-ingest --skip-list flag.

Test plan

  1. Unit (free, runs in bun test):

    • preparePages against fixture corpus of 50 files: assert YAML correct, partial detection works, unattributed filtered.
    • writeStaged overwrite idempotency.
    • SIGTERM handler flush behavior using a child-process test harness.
  2. Integration (free, runs in bun test):

    • End-to-end: prepare → gitleaks → gbrain import on a temp PGLite, assert page_count matches imported count.
    • Partial-success path: inject a deliberate FILE_TOO_LARGE; assert successes still state'd, failure logged to skip list.
    • State preservation across SIGTERM: spawn ingest, kill at midpoint, restart, assert resumed state.
  3. Benchmark gate (periodic, paid):

    • Cold run on 1841-file fixture: assert under 8 min.
    • Incremental run (no changes): assert under 60 sec.
    • Test fixture: copy of ~/.gstack/projects/ snapshot for repeatable timing.

Rollback strategy

  • New --legacy-ingest flag on gstack-memory-ingest keeps the old per-file path callable for one release cycle.
  • If batch path regresses on a real corpus, set gstack-config set memory_ingest_path legacy to revert without redeploy.
  • Remove flag + legacy path one minor version after confirming batch is stable.

Risks & open questions for plan-eng-review

  1. gbrain import idempotency on overlapping slugs. If a previous run wrote slug X to PGLite with old content, does gbrain import of updated-X overwrite or duplicate? Need to test before relying on it.

  2. Frontmatter injection inside gbrain import parser. Current code knows how to inject title/type/tags into existing frontmatter blocks (line 794-821). Does gbrain import honor those fields the same way gbrain put does? Verify in unit test.

  3. Staging dir disk pressure. 1841 files × avg ~50KB = ~92MB of staging .md content. Acceptable on dev machines but worth knowing. Alternative: stream prepared content to a tar piped to import (if gbrain supports it) — likely not, ignore for V1.

  4. Cross-worktree concurrency. ~/.gstack/.staging-ingest-${pid}-${ts}/ is pid-namespaced so two concurrent /sync-gbrain runs don't collide. But the orchestrator already holds a lock at ~/.gstack/.sync-gbrain.lock so this is belt-and-suspenders. Keep it.

  5. The "memory ingest exited null" message. After this change, the orchestrator might still see status=null on real OOM kills or SIGKILL. Should the verdict block be more honest? E.g., ERR memory: killed by signal SIGTERM at 35:00 (timeout).

  6. Should we deprecate gbrain put for memory entirely? The legacy path exists for V1.5's put_file migration plan. With batch import working, do we still need single-page put as a fallback for ad-hoc ingestion? Probably yes (for ~/.gstack/.transcript-ingest-state.json updates triggered outside the orchestrator), but worth confirming.

What this isn't

  • Not a gbrain CLI change. All work is in gstack.
  • Not a CLAUDE.md voice/UX change.
  • Not a new user-facing feature. CHANGELOG entry will read: "Memory ingest is ~10× faster on cold runs and survives interruption."

Acceptance criteria

  • Cold /sync-gbrain on 1841 files completes in under 8 minutes.
  • Incremental /sync-gbrain (no file changes) completes in under 60 seconds.
  • SIGTERM mid-run flushes state; next run resumes without redoing successfully-imported files.
  • FILE_TOO_LARGE failures don't block sync.last_commit advancement.
  • All existing test fixtures (transcripts, learnings, design-docs, ceo-plans) ingest correctly with full frontmatter.
  • No regression on partial-transcript or unattributed-transcript handling.