v1.33.0.0 feat: /sync-gbrain memory-stage batch-import refactor (D1-D8) + F6/F9 + signal cleanup (#1432)

* refactor: batch-import architecture (D1-D8) + F6 atomic state + F9 full-file hash

bin/gstack-memory-ingest.ts: rewrite memory ingest around `gbrain import <dir>`
batch path. Replaces per-file gbrainPutPage loop (~470s of subprocess startup
per cold run) with prepare-then-batch:

  walkAllSources
    -> preparePages: mtime-skip + optional gitleaks (--scan-secrets) + parse
    -> writeStaged: mkdir -p per slug segment, hierarchical (D1)
    -> snapshot ~/.gbrain/sync-failures.jsonl byte offset
    -> runGbrainImport (async spawn) -> parseImportJson
    -> readNewFailures: read appended bytes, map back to source paths (D7)
    -> state.sessions[path] = {...} for files NOT in failed set
    -> saveStateAtomic (F6) + cleanupStagingDir

Architecture decisions:
  D1 hierarchical staging dir
  D2 cut over, deleted gbrainPutPage entirely
  D3 source-file gitleaks made opt-in via --scan-secrets (gstack-brain-sync
     owns the cross-machine boundary; per-file scan was redundant ~470s tax)
  D4 OK/ERR verdict (no DEGRADED tri-state)
  D5 unified state schema (no separate skip-list)
  D6 trust gbrain content_hash idempotency (no skip_reason bookkeeping)
  D7 byte-offset snapshot of sync-failures.jsonl + per-source mapping
  F6 saveState uses tmp+rename atomic write
  F9 fileSha256 removes 1MB cap; full-file hash (no more silent tail-edit
     misses on long partial transcripts)

Signal handling: installSignalForwarder propagates SIGTERM/SIGINT to the
gbrain child process AND synchronously cleans the staging dir before
process.exit. Pre-fix, orchestrator timeouts left gbrain processes
orphaned holding the PGLite write lock (observed: 15-hour-CPU-time
orphan still alive a day later).

parseImportJson returns null on unparseable output (treated as ERR by
caller) instead of silently zeroing through.

gbrainAvailable() probes for the `import` subcommand instead of `put`.

Plan + review chain at /Users/garrytan/.claude/plans/purrfect-tumbling-quiche.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: orchestrator OK/ERR verdict parser for batch memory ingest

gstack-gbrain-sync.ts: memory-stage parser now picks [memory-ingest] ERR
lines preferentially over the latest [memory-ingest] line, strips the
prefix and any leading 'ERR: ' for cleaner summary output, and surfaces
'(killed by signal / timeout)' when the child exits with status=null.

Matches D6's OK/ERR contract: per-file failures (FILE_TOO_LARGE etc.)
show in the summary count but only system-level failures (gbrain crash,
process kill, missing CLI) mark the stage ERR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: batch-ingest writer regressions + refresh golden ship fixtures

test/gstack-memory-ingest.test.ts: 5 new tests for the batch-import
architecture:
  1. D1 hierarchical staging slug round-trip — asserts staged file lives
     in transcripts/claude-code/<dir>/*.md, not flat at staging root
  2. Frontmatter injection — asserts title/type/tags written into the
     staged page's YAML block
  3. D7 sync-failures.jsonl exclusion — files listed as failed by
     gbrain do NOT get state-recorded; one of two test sessions lands,
     the other stays un-ingested for retry next run
  4. Missing-`import`-subcommand error path — when gbrain only advertises
     legacy `put`, memory-ingest exits 1 with [memory-ingest] ERR
  5. --scan-secrets opt-in path — verifies a dirty-source file is
     skipped via the secret-scan match when the flag is on, while a
     clean session in the same run still gets staged

Replaces the prior put-per-file shim with an import-batch shim. The
shim fails loudly (exit 99) if the new code ever regresses to per-file
`gbrain put` calls.

test/fixtures/golden/{claude,codex,factory}-ship-SKILL.md: refresh
golden baselines to match the current generated SKILL.md content after
the v1.31.0.0 AskUserQuestion fallback-clause deletion. Goldens were
stale from that release; test was failing on origin/main before this
PR. Caught by the /ship test pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v1.33.0.0 docs: design doc, P2 perf TODOs, gbrain guidance block, changelog

docs/designs/SYNC_GBRAIN_BATCH_INGEST.md: full design doc with the 8
decisions (D1-D8), source-verified gbrain behaviors (content_hash
idempotency, frontmatter parity, path-authoritative slug, per-file
failure surface), measured performance vs plan target, F9 hash
migration one-time cliff note, and follow-up TODOs.

CLAUDE.md: append `## GBrain Search Guidance` block from /sync-gbrain
indicating this worktree's pin and how the agent should prefer gbrain
search over Grep for semantic queries.

TODOS.md: P2 `gbrain import` perf-on-large-staging-dirs investigation
(5,131 files takes >10min in gbrain when 501 takes 10s — likely N+1
SQL or auto-link reconciliation). P3 cache-no-changes-since-last-import
at the prepare-batch level for true no-op fast paths.

VERSION + package.json: bump to 1.33.0.0 (queue-aware via
bin/gstack-next-version — skipped v1.32.0.0 which is claimed by
sibling worktree garrytan/wellington / PR #1431).

CHANGELOG.md: v1.33.0.0 entry per the release-summary format.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: setup-gbrain/memory.md reflects opt-in per-file gitleaks

Per-file gitleaks scanning during memory ingest is now opt-in via
--scan-secrets (or GSTACK_MEMORY_INGEST_SCAN_SECRETS=1). Update the
user-facing reference doc so it stops claiming "every page passes
through gitleaks." Also corrects the /gbrain-sync → /sync-gbrain
command typo and the post-incident recovery section.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Garry Tan
2026-05-11 18:47:33 -07:00
committed by GitHub
parent 74895062fb
commit d21ba06b5a
12 changed files with 1523 additions and 223 deletions

View File

@@ -37,9 +37,22 @@ happens after you say yes.
## What gets scanned for secrets
Every ingested page passes through **gitleaks** before write
(per D19 — replaces the regex scanner that previously ran only on
staged git diffs). Gitleaks is industry-standard, covers:
The cross-machine secret boundary is `gstack-brain-sync` (the git push
to your private artifacts repo), which runs its own scanner before any
content leaves this Mac. Local PGLite ingest doesn't change the exposure
surface for content that already lives on disk in plaintext.
Per-file **gitleaks** scanning during memory ingest is **opt-in** as of
v1.33.0.0 — off by default. To re-enable it (adds ~4-8 min to cold runs
on a large transcript corpus), use either:
```bash
gstack-memory-ingest --bulk --scan-secrets
# or
GSTACK_MEMORY_INGEST_SCAN_SECRETS=1 gstack-memory-ingest --bulk
```
When enabled, gitleaks covers:
- AWS / GCP / Azure access keys
- ANTHROPIC_API_KEY, OPENAI_API_KEY, GitHub tokens
@@ -50,13 +63,11 @@ A session with a positive finding is **skipped entirely** — not partially
redacted. The match line + rule ID are logged to stderr; you can see what
was skipped via `bun run bin/gstack-memory-ingest.ts --probe` (which
shows new vs. updated counts) or by reviewing the helper's output during
`/gbrain-sync --full`.
`/sync-gbrain --full`.
If gitleaks is not installed (run `brew install gitleaks` on macOS, or
`apt install gitleaks` on Linux), the helper warns once and disables
secret scanning. **In that mode, transcripts ingest unscanned. Don't run
ingest without gitleaks if you have any concern about secrets in your
sessions.**
`apt install gitleaks` on Linux) and you passed `--scan-secrets` anyway,
the helper warns once and disables secret scanning for that run.
## Where it goes
@@ -168,14 +179,14 @@ Common cases:
- Brain-sync git history shows every curated artifact push with the
user's git identity.
If you find a transcript page that contains a secret gitleaks missed,
the recovery path is:
If you find a transcript page that contains a secret (either because
per-file scanning was off, or gitleaks missed it), the recovery path is:
1. `gbrain delete_page <slug>` — removes from index immediately
2. Rotate the secret (rotate it anyway as a defensive measure)
3. If brain-sync is on: `git filter-repo --invert-paths --path <relative-path>`
on the brain remote for hard-delete from history
4. File a gitleaks issue with the pattern (or extend the gitleaks config
at `~/.gitleaks.toml`).
4. If the miss looks like a gitleaks rule gap, file a gitleaks issue
with the pattern (or extend the gitleaks config at `~/.gitleaks.toml`).
## Path 4: Remote MCP setup (v1.27.0.0+)