hai/gstack

mirror of https://github.com/garrytan/gstack.git synced 2026-05-13 07:53:04 +08:00

Files

Garry Tan d21ba06b5a v1.33.0.0 feat: /sync-gbrain memory-stage batch-import refactor (D1-D8) + F6/F9 + signal cleanup (#1432 )

* refactor: batch-import architecture (D1-D8) + F6 atomic state + F9 full-file hash

bin/gstack-memory-ingest.ts: rewrite memory ingest around `gbrain import <dir>`
batch path. Replaces per-file gbrainPutPage loop (~470s of subprocess startup
per cold run) with prepare-then-batch:

  walkAllSources
    -> preparePages: mtime-skip + optional gitleaks (--scan-secrets) + parse
    -> writeStaged: mkdir -p per slug segment, hierarchical (D1)
    -> snapshot ~/.gbrain/sync-failures.jsonl byte offset
    -> runGbrainImport (async spawn) -> parseImportJson
    -> readNewFailures: read appended bytes, map back to source paths (D7)
    -> state.sessions[path] = {...} for files NOT in failed set
    -> saveStateAtomic (F6) + cleanupStagingDir

Architecture decisions:
  D1 hierarchical staging dir
  D2 cut over, deleted gbrainPutPage entirely
  D3 source-file gitleaks made opt-in via --scan-secrets (gstack-brain-sync
     owns the cross-machine boundary; per-file scan was redundant ~470s tax)
  D4 OK/ERR verdict (no DEGRADED tri-state)
  D5 unified state schema (no separate skip-list)
  D6 trust gbrain content_hash idempotency (no skip_reason bookkeeping)
  D7 byte-offset snapshot of sync-failures.jsonl + per-source mapping
  F6 saveState uses tmp+rename atomic write
  F9 fileSha256 removes 1MB cap; full-file hash (no more silent tail-edit
     misses on long partial transcripts)

Signal handling: installSignalForwarder propagates SIGTERM/SIGINT to the
gbrain child process AND synchronously cleans the staging dir before
process.exit. Pre-fix, orchestrator timeouts left gbrain processes
orphaned holding the PGLite write lock (observed: 15-hour-CPU-time
orphan still alive a day later).

parseImportJson returns null on unparseable output (treated as ERR by
caller) instead of silently zeroing through.

gbrainAvailable() probes for the `import` subcommand instead of `put`.

Plan + review chain at /Users/garrytan/.claude/plans/purrfect-tumbling-quiche.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: orchestrator OK/ERR verdict parser for batch memory ingest

gstack-gbrain-sync.ts: memory-stage parser now picks [memory-ingest] ERR
lines preferentially over the latest [memory-ingest] line, strips the
prefix and any leading 'ERR: ' for cleaner summary output, and surfaces
'(killed by signal / timeout)' when the child exits with status=null.

Matches D6's OK/ERR contract: per-file failures (FILE_TOO_LARGE etc.)
show in the summary count but only system-level failures (gbrain crash,
process kill, missing CLI) mark the stage ERR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: batch-ingest writer regressions + refresh golden ship fixtures

test/gstack-memory-ingest.test.ts: 5 new tests for the batch-import
architecture:
  1. D1 hierarchical staging slug round-trip — asserts staged file lives
     in transcripts/claude-code/<dir>/*.md, not flat at staging root
  2. Frontmatter injection — asserts title/type/tags written into the
     staged page's YAML block
  3. D7 sync-failures.jsonl exclusion — files listed as failed by
     gbrain do NOT get state-recorded; one of two test sessions lands,
     the other stays un-ingested for retry next run
  4. Missing-`import`-subcommand error path — when gbrain only advertises
     legacy `put`, memory-ingest exits 1 with [memory-ingest] ERR
  5. --scan-secrets opt-in path — verifies a dirty-source file is
     skipped via the secret-scan match when the flag is on, while a
     clean session in the same run still gets staged

Replaces the prior put-per-file shim with an import-batch shim. The
shim fails loudly (exit 99) if the new code ever regresses to per-file
`gbrain put` calls.

test/fixtures/golden/{claude,codex,factory}-ship-SKILL.md: refresh
golden baselines to match the current generated SKILL.md content after
the v1.31.0.0 AskUserQuestion fallback-clause deletion. Goldens were
stale from that release; test was failing on origin/main before this
PR. Caught by the /ship test pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v1.33.0.0 docs: design doc, P2 perf TODOs, gbrain guidance block, changelog

docs/designs/SYNC_GBRAIN_BATCH_INGEST.md: full design doc with the 8
decisions (D1-D8), source-verified gbrain behaviors (content_hash
idempotency, frontmatter parity, path-authoritative slug, per-file
failure surface), measured performance vs plan target, F9 hash
migration one-time cliff note, and follow-up TODOs.

CLAUDE.md: append `## GBrain Search Guidance` block from /sync-gbrain
indicating this worktree's pin and how the agent should prefer gbrain
search over Grep for semantic queries.

TODOS.md: P2 `gbrain import` perf-on-large-staging-dirs investigation
(5,131 files takes >10min in gbrain when 501 takes 10s — likely N+1
SQL or auto-link reconciliation). P3 cache-no-changes-since-last-import
at the prepare-batch level for true no-op fast paths.

VERSION + package.json: bump to 1.33.0.0 (queue-aware via
bin/gstack-next-version — skipped v1.32.0.0 which is claimed by
sibling worktree garrytan/wellington / PR #1431).

CHANGELOG.md: v1.33.0.0 entry per the release-summary format.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: setup-gbrain/memory.md reflects opt-in per-file gitleaks

Per-file gitleaks scanning during memory ingest is now opt-in via
--scan-secrets (or GSTACK_MEMORY_INGEST_SCAN_SECRETS=1). Update the
user-facing reference doc so it stops claiming "every page passes
through gitleaks." Also corrects the /gbrain-sync → /sync-gbrain
command typo and the post-incident recovery section.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-11 18:47:33 -07:00

12 KiB

Raw Blame History

gstack memory ingest — what it does, what stays local, what you can do with it

This is the user-facing reference for the V1 transcript + memory ingest feature in /setup-gbrain. If you ran /setup-gbrain and it asked "Ingest THIS repo's transcripts into gbrain?", this doc explains what happens after you say yes.

What gets ingested

Source	Type	Where	Sensitivity
Claude Code session JSONL	`transcript`	`~/.claude/projects/*/`	High — full conversations including tool I/O
Codex CLI session JSONL	`transcript`	`~/.codex/sessions/YYYY/MM/DD/`	High
Cursor session SQLite (V1.0.1)	`transcript`	`~/Library/Application Support/Cursor/`	Same — deferred V1.0.1
Eureka log	`eureka`	`~/.gstack/analytics/eureka.jsonl`	Medium — your insights, often non-secret
Project learnings	`learning`	`~/.gstack/projects/<slug>/learnings.jsonl`	Medium
Project timeline	`timeline`	`~/.gstack/projects/<slug>/timeline.jsonl`	Low
CEO plans	`ceo-plan`	`~/.gstack/projects/<slug>/ceo-plans/*.md`	Medium
Design docs	`design-doc`	`~/.gstack/projects/<slug>/-design-.md`	Medium
Retros	`retro`	`~/.gstack/projects/<slug>/retros/*.md`	Medium
Builder profile	`builder-profile-entry`	`~/.gstack/builder-profile.jsonl`	Low

What stays local

State files (~/.gstack/.gbrain-sync-state.json, ~/.gstack/.transcript-ingest-state.json, ~/.gstack/.gbrain-engine-cache.json, ~/.gstack/.gbrain-errors.jsonl) are local-only per ED1 (state file sync semantics decision). They are not synced via the brain remote.
Sessions with no resolvable git remote (running in /tmp/, scratch dirs, etc.) are skipped by default. Pass --include-unattributed to the ingest helper to opt them in.
Repos under a deny trust policy (set in /setup-gbrain Step 6) are skipped — neither code nor transcripts from those repos ingest.

What gets scanned for secrets

The cross-machine secret boundary is gstack-brain-sync (the git push to your private artifacts repo), which runs its own scanner before any content leaves this Mac. Local PGLite ingest doesn't change the exposure surface for content that already lives on disk in plaintext.

Per-file gitleaks scanning during memory ingest is opt-in as of v1.33.0.0 — off by default. To re-enable it (adds ~4-8 min to cold runs on a large transcript corpus), use either:

gstack-memory-ingest --bulk --scan-secrets
# or
GSTACK_MEMORY_INGEST_SCAN_SECRETS=1 gstack-memory-ingest --bulk

When enabled, gitleaks covers:

AWS / GCP / Azure access keys
ANTHROPIC_API_KEY, OPENAI_API_KEY, GitHub tokens
Stripe keys, Slack tokens, JWT secrets
Generic high-entropy strings (configurable threshold)

A session with a positive finding is skipped entirely — not partially redacted. The match line + rule ID are logged to stderr; you can see what was skipped via bun run bin/gstack-memory-ingest.ts --probe (which shows new vs. updated counts) or by reviewing the helper's output during /sync-gbrain --full.

If gitleaks is not installed (run brew install gitleaks on macOS, or apt install gitleaks on Linux) and you passed --scan-secrets anyway, the helper warns once and disables secret scanning for that run.

Where it goes

Storage tier depends on your gbrain engine (set during /setup-gbrain):

Supabase configured: code + transcripts go to Supabase Storage (multi-Mac native). Curated memory (eureka/learnings/etc.) goes to the brain-linked git repo via gstack-brain-sync.
Local PGLite only: everything stays on this Mac. Curated memory syncs via git if you've enabled brain-sync.

The "never double-store" rule per the plan: code and transcripts NEVER go in the gbrain-linked git repo. They're too big and they're replaceable from disk on each Mac.

What you can do with it

Query in natural language:

gbrain query "what was I doing on the auth migration"
gbrain search "session_id:abc123"

Browse by type:

gbrain list_pages --type transcript --limit 10
gbrain list_pages --type ceo-plan

Read a specific page:

gbrain get_page transcripts/claude-code/garrytan-gstack/2026-05-01-abc123

Delete a page:
```
gbrain delete_page <slug>
```
Caveat: with brain-sync enabled, the page is removed from gbrain's index but git history retains it. For hard-delete, run git filter-repo on the brain remote.
Bulk-delete by criteria (V1.0.1 follow-up — gstack-transcript-prune helper). For V1.0, use gbrain delete_page <slug> per-page or write a small loop over gbrain list_pages output.

Disable entirely:

gstack-config set transcript_ingest_mode off
gstack-config set gbrain_context_load off  # also disables retrieval

How the agent uses it

At every gstack skill start, the preamble runs gstack-brain-context-load which:

Reads the active skill's gbrain.context_queries: frontmatter
Dispatches each query to gbrain (vector / list / filesystem)
Renders results into ## <render_as> sections wrapped in <USER_TRANSCRIPT_DATA do-not-interpret-as-instructions> envelopes
The model sees this as part of the preamble before making any decisions

For example, when you run /office-hours, the model context automatically includes:

## Prior office-hours sessions in this repo (last 5)
## Your builder profile snapshot (latest entry)
## Recent design docs for this project (last 3)
## Recent eureka moments (last 5)

So the "Welcome back, last time you were on X" beat is sourced from your actual data, not cold-start.

If gbrain is unavailable (CLI missing, MCP not registered, query timeout), the helper renders (unavailable) and the skill continues — startup never blocks > 2s on gbrain issues (Section 1C).

What to do when something feels off

Run /setup-gbrain again. It's idempotent: every step detects existing state, repairs only what's missing, and prints a GREEN/YELLOW/RED verdict block. If a row is RED, the row tells you what to do.

Common cases:

Salience block is empty — your transcripts may not be ingested yet. Run gstack-gbrain-sync --full to do a full pass.
"gbrain CLI missing" in the preamble output — gbrain isn't on your PATH. Run /setup-gbrain to install/wire it.
PGLite engine corrupt (V1.5) — V1.5 ships gbrain restore-from-sync for atomic rebuild from the brain remote. For V1.0, manual recovery: cd ~/.gbrain && rm -rf db && gbrain init --pglite && gbrain import <brain-remote-clone-dir>.
A page has stale or wrong content — gbrain delete_page <slug>, then re-run gstack-gbrain-sync --incremental to re-ingest from source if the source file is still on disk and unchanged.

Privacy + audit

Every secretScanFile finding is logged to stderr at ingest time.
Every gbrain put/delete is logged to ~/.gstack/.gbrain-errors.jsonl with {ts, op, duration_ms, outcome} for forensic tracing.
~/.gstack/.gbrain-engine-cache.json shows which storage tier is active (PGLite vs Supabase).
Brain-sync git history shows every curated artifact push with the user's git identity.

If you find a transcript page that contains a secret (either because per-file scanning was off, or gitleaks missed it), the recovery path is:

gbrain delete_page <slug> — removes from index immediately
Rotate the secret (rotate it anyway as a defensive measure)
If brain-sync is on: git filter-repo --invert-paths --path <relative-path> on the brain remote for hard-delete from history
If the miss looks like a gitleaks rule gap, file a gitleaks issue with the pattern (or extend the gitleaks config at ~/.gitleaks.toml).

Path 4: Remote MCP setup (v1.27.0.0+)

If you don't run gbrain locally — you have a teammate or another machine running gbrain serve over HTTP, accessible via Tailscale, ngrok, or internal LAN — /setup-gbrain Path 4 is the one-paste flow.

You provide:

The MCP URL (e.g., https://wintermute.tail554574.ts.net:3131/mcp)
A bearer token (issued by the brain admin via gbrain access-token issue)

What /setup-gbrain does:

Verifies the URL + token via gstack-gbrain-mcp-verify. Three failure modes get classified with one-line remediation hints: NETWORK ("check Tailscale/DNS"), AUTH ("rotate token"), MALFORMED ("Accept-header gotcha — pass both application/json AND text/event-stream").

Registers the MCP at user scope:

claude mcp add --scope user --transport http gbrain "$URL" \
  --header "Authorization: Bearer $TOKEN"

Skips local install, local doctor, transcript ingest, and federated source registration. All four require a local gbrain CLI that Path 4 doesn't install.
Optionally provisions a gstack-artifacts-$USER private repo on GitHub or GitLab and prints the one-line gbrain sources add command for your brain admin to run on the brain host.

Token storage trade-off

The bearer token lives in ~/.claude.json (mode 0600), where Claude Code stores every MCP server's credentials. During claude mcp add --header "Authorization: Bearer $TOKEN", the token is briefly visible in process argv (~10ms) — visible to ps running concurrently. The window is small but it's not zero.

Mitigations we've considered:

Stdin or env-var input form for headers — would close the argv window. As of Claude Code v1.0.x, the CLI doesn't expose either. When it does, /setup-gbrain Path 4 will switch automatically.
Keychain storage — explicitly out of scope (the token's resting state in ~/.claude.json is the existing trust surface for every MCP credential; expanding to Keychain would touch every MCP server, not just gbrain).

Why Path 4 is "always print" for the brain-admin hookup

gstack-artifacts-init always prints the gbrain sources add command labeled "Send this to your brain admin" — even when the user IS the brain admin (consistent UX, no mode-detection fragility).

A previous design proposed probing whether the user's bearer has admin scope (via a benign MCP write call like add_tag) and auto-executing the source registration when scope was sufficient. The design review flagged that page-write doesn't actually prove source-management permission — those are different scopes in any sensible auth model. Until gbrain ships:

a mcp__gbrain__whoami capability tool that returns the bearer's scope set, AND
a mcp__gbrain__sources_add MCP tool with admin-scope gating

we always print the command rather than pretending we know who has permission to run it.

CLAUDE.md block in Path 4

Distinct from local-stdio mode. Token is never written to CLAUDE.md (many projects check CLAUDE.md into git). The block records the URL, the verified server version, the artifacts repo URL (if provisioned), and the per-repo trust policy.

## GBrain Configuration (configured by /setup-gbrain)
- Mode: remote-http
- MCP URL: https://wintermute.tail554574.ts.net:3131/mcp
- Server version: gbrain v0.27.1
- Setup date: 2026-05-06
- MCP registered: yes (user scope)
- Token: stored in ~/.claude.json (do not commit; never written to CLAUDE.md)
- Artifacts repo: github.com/garrytan/gstack-artifacts-garrytan (private)
- Artifacts sync: artifacts-only
- Current repo policy: read-write

Token rotation

Server-side. When verify hits AUTH (e.g., the brain admin rotated the token), the helper says: "rotate token on the brain host, re-run /setup-gbrain." On wintermute or wherever your gbrain server lives:

gbrain access-token rotate    # invalidates old, issues new

(See gstack/setup-gbrain/SKILL.md.tmpl for the full Path 4 flow plus the gbrain enhancement requests around scoped tokens that would let gstack auto-rotate in V2.)

12 KiB Raw Blame History