Przeglądaj źródła

docs(claude): add handoff notes for explore-overhaul arc and related sessions

Captures three development session checkpoints: per-symbol adaptive sizing (PR #569), trace relevance + cold-start fix (PR #580), and the explore-overhaul arc (explore as sole primary + Zustand store coverage + budget/render tuning). Documents root causes found, gotchas (stale-daemon foot-gun, Mac sleep corrupting benchmarks), validation methodology, and open threads for each session.
Colby McHenry 2 tygodni temu
rodzic
commit
0bc13231f7

+ 80 - 0
.claude/handoffs/explore-overhaul-2026-06-01.md

@@ -0,0 +1,80 @@
+---
+name: explore-overhaul-2026-06-01
+date: 2026-06-01 19:50
+project: codegraph
+branch: main
+summary: Made codegraph_explore the sole primary tool (removed context + trace), added graph-connectivity ranking + 100K budget + full method bodies — then an agent-eval revealed the budget BACKFIRES and the real lever is COVERAGE (Zustand store methods aren't indexed).
+---
+
+# Handoff: codegraph_explore overhaul — explore as the one tool, and the coverage pivot
+
+## Resume here — read this first
+**Current state:** Big uncommitted working tree on `main`. `codegraph_context` and `codegraph_trace` tools are fully removed; `codegraph_explore` is the sole primary, now with graph-connectivity (RWR) ranking, a flat **100K** output budget, full method bodies, whole-central-file, and an always-on blast-radius section. A fresh-daemon agent-eval on the real repo (`~/Downloads/amniservices-mobile-app`) just proved two things: (1) the **100K budget BACKFIRES** — a broad explore hit **67K chars and overflowed the agent's per-tool token cap**, forcing it to Read; (2) the **real cause of the agent's reads is a COVERAGE gap**, not ranking/budget — Zustand store methods (`fetchUser`/`switchOrganization` inside `create((set,get)=>({...}))`) aren't indexed as nodes, and callers **destructure** them (`const {fetchUser}=useOrgUser.getState()`), so `codegraph_node`/`codegraph_callers` return "not found."
+**Immediate next step:** Revert the 100K budget (it overflows) to ~28–35K, then build the Zustand coverage fix (extract store-literal methods as nodes + resolve destructured `getState()` calls). That's what actually deletes the reads.
+
+> Suggested next message: "Revert the explore budget in getExploreOutputBudget (tools.ts) from 100K back to ~30K — the 67K response overflowed the agent's tool cap. Then build the Zustand coverage fix: extract methods inside `create((set,get)=>({...}))` as nodes, and resolve destructured store calls like `const {fetchUser}=useOrgUser.getState()`. Then kill the AmniSphere daemon and re-run the agent eval."
+
+## Goal
+Make `codegraph_explore` good enough to be a **Read-replacement** — one (maybe two) calls answer a structural/flow question with ~0 Read/Grep, for smart AND dumb models. Metric is wall-clock + tool-call count + Read count (NOT token cost). The user's golden era: one tool (`explore`), reflexively used, zero Reads.
+
+## Key findings
+- **The agent's reads are a COVERAGE gap, not ranking/budget.** Agent's own words (diagnostic eval): Zustand store actions inside the `create((set,get)=>({...}))` literal "aren't individually indexed," so `codegraph_node fetchUser` / `codegraph_callers fetchUser` → **"not found"**; callers **destructure** off `useOrgUser.getState()` so even grep needed `\bfetchUser\b`. Component-body control flow (`handleLogin`, `AppInitializer` in `src/app/index.tsx`, `src/components/providers/index.tsx`) isn't a node either.
+- **The 100K budget backfires.** A broad explore returned ~67K chars and "overflowed the token cap" → agent Read instead. Big responses are *worse*. `getExploreOutputBudget` (tools.ts ~line 140) is now a flat 100K — revert toward ~28–35K (size to the agent's per-tool output limit).
+- **Adoption is EXCELLENT — the agent WANTS codegraph.** In the fresh eval it made **16 codegraph calls** vs 5 Reads. So the problem is never "agent won't use it"; it's "the symbols aren't in the graph."
+- **Graph-connectivity ranking works in isolation but didn't address the real cause.** `computeGraphRelevance` (tools.ts, before `handleExplore`) is RWR/personalized-PageRank from the matched seeds; probe shows it ranks `org-user.storage.ts` #1 and returns it whole. But it doesn't cleanly drop noise (LensSwitcher.swift matched "switch") because real codebases share infra + generic terms — **neither graph nor text alone separates; needs IDF×graph fusion**, a tuning long tail. Park it until coverage is fixed.
+- **`context` + `trace` tools fully removed** (def + dispatch + handlers + CLI `context` command + permissions + server-instructions + tests). The shared engine `findRelevantContext` stays (explore runs on it). `synthEdgeNote` kept (shared); `handleTrace`/`sourceLineAt`/`sourceRangeAt`/`maybeInlineFlowTrace`/`handleContext`/`looksLikeFeatureRequest`/`formatTaskContext` deleted.
+- **Read-gate PreToolUse hook was built then REMOVED** (user: "ideally zero hooks"). Deleted `src/hooks/`, `src/mcp/session-consult.ts`, the `mcp-read-gate` CLI cmd, installer wiring (`InstallOptions.readGate`, claude.ts helpers), and the marker security tests. Had an unverified `CLAUDE_SESSION_ID`==hook-`session_id` assumption.
+- **Precision fix landed earlier (keeper):** `isDistinctiveIdentifier` (query-utils.ts) gates the exact-name bonus in `findRelevantContext` Step 5a so a common word ("flat") can't hijack ranking (was surfacing a python `FLAT` constant). Lives in the shared engine → benefits explore.
+- **Blast-radius section added to explore** (`buildBlastRadiusSection`, tools.ts): per entry symbol, who-depends-on-it + covering test files, locations only. Always-on, compact. (2 tests in `__tests__/explore-blast-radius.test.ts`.)
+
+## Gotchas
+- **STALE-DAEMON FOOT-GUN (cost us hours).** `codegraph serve --mcp` connects to a per-repo daemon (`<repo>/.codegraph/daemon.sock`, 5-min idle timeout) that holds the loaded code. **A `npm run build` does NOT take effect until you kill the daemon.** Every agent-eval before the kill was testing STALE code (agent got 2277 chars where a fresh in-process probe got 54K). **Before ANY agent eval:** `pkill -f "serve --mcp"; rm -f <repo>/.codegraph/daemon.sock`. Worth fixing in the product (a rebuild should invalidate the daemon).
+- **probe ≠ agent.** `probe-explore.mjs` loads `dist/` in-process (always current code); the agent uses the daemon (can be stale). Don't trust a probe result as "what the agent sees" unless the daemon was just killed.
+- **Validating with a favorable query lies.** My probe query (`"org user storage…"`) returned the whole central file; the agent's near-identical query behaved totally differently. Use the agent's EXACT query, on a fresh daemon.
+- **n=1 variance is large** — never conclude from one agent run (CLAUDE.md). The "4 vs 5 reads" between runs is noise.
+- **Budget-table repos (excalidraw/django/etc.) NOT validated** — they're not on this machine. The ranking/budget changes could regress them; the CLAUDE.md "do-not-regress explore budget" table is now obsolete (flat 100K) and needs reconciling.
+- All work is **uncommitted on `main`** — branch before committing (PR policy: main is REVIEW_REQUIRED).
+
+## How to test & validate
+- Build: `npm run build` (must exit 0).
+- Cheap probe (current code, NOT what a stale daemon serves): `node scripts/agent-eval/probe-explore.mjs /Users/colby/Downloads/amniservices-mobile-app "<query>"`.
+- Agent A/B (real metric, ~$2, KILL DAEMON FIRST): `pkill -f "serve --mcp"; rm -f /Users/colby/Downloads/amniservices-mobile-app/.codegraph/daemon.sock; CG_BIN=$(pwd)/dist/bin/codegraph.js AGENT_EVAL_OUT=/tmp/agent-eval-amni bash scripts/agent-eval/run-agent.sh /Users/colby/Downloads/amniservices-mobile-app <label> "<prompt>"` → parse `/tmp/agent-eval-amni/run-<label>.jsonl` for tool order + Read count.
+- Diagnostic prompt that worked: append "for EACH Read/Grep note WHY codegraph wasn't enough; end with '## Why I read'." The agent's self-report is the best diagnostic.
+- Affected unit tests (NOT the full suite — user is cost-conscious): `npx vitest run __tests__/{context-ranking,explore-blast-radius,context,mcp-tool-allowlist,security,worktree-detection,installer-targets}.test.ts __tests__/integration/mcp-input-limits.test.ts`.
+- Pass bar: a flow question reaches ~0 Read within the explore-call budget, faster than without-codegraph, no regression on a control repo.
+
+## Repo state
+- branch `main`, last commit `8629f7a docs(changelog): promote [Unreleased] into [0.9.8]`
+- uncommitted (all this session, none committed): `M src/mcp/tools.ts` (the big one — explore ranking/RWR/budget, context+trace removal, blast radius), `M src/context/index.ts` (precision fix), `?? src/context/markers.ts` (LOW_CONFIDENCE_MARKER leaf), `M src/search/query-utils.ts` (isDistinctiveIdentifier), `M src/mcp/server-instructions.ts`, `M src/installer/targets/shared.ts` (permissions), `M src/bin/codegraph.ts` (CLI context/trace removed), `M src/types.ts`, `M CHANGELOG.md`, `?? __tests__/context-ranking.test.ts`, `?? __tests__/explore-blast-radius.test.ts`, `M __tests__/{security,worktree-detection,mcp-tool-allowlist}.test.ts`, `M __tests__/integration/mcp-input-limits.test.ts`. (read-gate hook + session-consult.ts were created then deleted → no trace.)
+
+## Open threads / TODO
+- [ ] **Revert the 100K budget** in `getExploreOutputBudget` (tools.ts ~140) to ~28–35K — it overflows the agent tool cap at 67K.
+- [ ] **Reconsider "whole central file always"** (tools.ts whole-file rule + `centralFiles`) — a 791-line whole file is what overflowed. Prefer the relevant methods full, not whole-huge-file.
+- [ ] **Build the Zustand coverage fix (the real lever):** (a) extract methods defined inside `create((set,get)=>({...}))` as nodes (extraction); (b) resolve destructured store calls `const {fetchUser}=useOrgUser.getState()` (reference resolution). Then re-eval.
+- [ ] Decide: keep `computeGraphRelevance` (graph ranking) or park it — it didn't address the coverage cause. Probe-validated, not agent-validated.
+- [ ] Re-run the agent eval (FRESH DAEMON) after the coverage fix; confirm reads of `org-user.storage.ts`/components drop.
+- [ ] Reconcile CHANGELOG `[Unreleased]` with the final direction (explore primary; context+trace removed; budget/ranking may change).
+- [ ] Validate budget-table repos (excalidraw/django) before merging — not on this machine.
+- [ ] Run the FULL test suite once (only ran affected files this session).
+- [ ] Branch off `main` before committing this work.
+
+## Recent transcript (oldest → newest)
+### Turn — "do it" (build graph ranking)
+- **User:** Build graph-connectivity ranking (RWR/PageRank fused with bm25) as explore's relevance core; keep the output wins; fix the whole-file path.
+- **Claude:** `Edit(tools.ts)` → added `computeGraphRelevance` (RWR, undirected, α=0.25, 25 iters); rewired file ranking/central/gate to graph-primary + text-secondary. `probe` → `org-user.storage.ts` #1 + whole (796 lines), but LensSwitcher.swift/capture noise still present.
+- **Outcome:** RWR surfaces the central file but doesn't cleanly drop noise; reported the honest limit (needs IDF×graph fusion).
+
+### Turn — "how does amniservices do now?"
+- **User:** Run the agent eval on the real repo with everything in place.
+- **Claude:** `run-agent.sh (amni-with-q2)` → codegraph 5 / Read 5 / Grep 3; agent Read `org-user.storage.ts` anyway. `parse` → agent's first explore returned only **2277 chars** while `probe` returned 54K for the same query.
+- **Outcome:** 2277-vs-54K discrepancy → suspected the agent wasn't running my build.
+
+### Turn — "do the test again, ask the agent WHY it read"
+- **User:** Re-run, but make the agent explain each Read / why codegraph wasn't enough.
+- **Claude:** `ps` → found stale daemon `pid 8947` on AmniSphere socket + pile of lingering `serve --mcp`. `pkill -f "serve --mcp"; rm daemon.sock` → fresh. `run-agent.sh (amni-diag-q3)` with the "## Why I read" prompt, fresh daemon.
+- **Outcome:** First valid run on my build. codegraph **16** / Read 5 / Grep 3. Agent's "Why I read": Zustand store methods not indexed (`node`/`callers` → not found), callers destructure off `getState()`, broad explore **overflowed at 67K**. → real cause = COVERAGE; budget BACKFIRES. Pivot the priorities.
+
+### Turn — /handoff save (this)
+- **User:** `/handoff save`.
+- **Claude:** Wrote this file.
+- **Outcome:** Checkpoint captured; next edge = budget revert + Zustand coverage fix.

+ 73 - 0
.claude/handoffs/explore-overhaul-bench-2026-06-02.md

@@ -0,0 +1,73 @@
+---
+name: explore-overhaul-bench-2026-06-02
+date: 2026-06-02 06:30
+project: codegraph
+branch: feat/explore-overhaul-store-coverage
+summary: Finished the explore-overhaul arc (explore as sole primary + store coverage + overload disambiguation + method-atomic render + node file/line selector + explore reshaped to native-read windows) and validated it — all 7 README repos hit 0 Read/0 Grep at effort=high; only the README benchmark write-up remains.
+---
+
+# Handoff: explore-overhaul arc — validated 0-reads across all 7 README repos; README write-up is the last step
+
+## Resume here — read this first
+**Current state:** All code is committed + pushed on `feat/explore-overhaul-store-coverage` (4 commits, working tree clean). The why-Read agent sweep is DONE: **all 7 README repos × 4 runs = 28/28 runs hit 0 Read / 0 Grep on `--effort high`**, every run "codegraph was sufficient." WITH-`high` medians are captured (~59% fewer tool calls · 51% fewer tokens · ~15% cheaper · 0 reads vs the existing README WITHOUT) — the earlier cost REGRESSION (-3%) is recovered. The only open item is **updating the README benchmark section**, which is blocked on one methodology decision.
+**Immediate next step:** Decide how to publish: (A) do a CLEAN both-arms run on `effort=high` with the PLAIN prompt (no why-Read) for an apples-to-apples table, or (B) write the WITH-`high` deltas in against the existing WITHOUT with a cross-effort caveat. Then edit `README.md` (benchmark table + per-repo breakdowns + average line + methodology date) and open the PR.
+
+> Suggested next message: "Do the clean both-arms run on effort=high with the plain prompt for all 7 repos, then update the README benchmark table + per-repo breakdowns from those medians and open the PR."
+
+## Goal
+Make `codegraph_explore` a true Read-replacement — flow/architecture questions answered with ~0 Read/Grep — then re-validate the README benchmark on the current build and update its numbers. Definition of done: README benchmark reflects the current build with defensible (same-effort) numbers; branch merged via PR.
+
+## Key findings
+- **The arc (all shipped on the branch):** explore is the SOLE primary tool (`codegraph_context` + `codegraph_trace` removed in the prior session, this branch); store-action **coverage** (object-literal method extraction — a GENERAL AST rule in `tree-sitter.ts` `extractVariable`/`extractObjectLiteralFunctions`/`findInitializerReturnedObject`, covers Zustand/Redux/Pinia, not a per-lib hack); graph-ranking **gate fix** (a named/≥2-term file is never pruned); **`node` all-overloads + `file`/`line` selector**; **method-atomic render** (never half a method — drop whole methods/files); **explore reshape** to native-read windows.
+- **Native-read ground truth (from the WITHOUT transcripts):** the agent natively reads **~6–9 files as ~100-line windows** (77% ranged, median 100 lines, 51–250 dominant), located by `func X(` signature greps. That's the unit explore now mimics.
+- **Explore reshape (commit 50401a6, the latest mechanism):** `getExploreOutputBudget` caps EVERY tier at **~24K** (was 28/35/38K) + absolute **25K** hard ceiling (was 1.5×-of-budget) — because a bigger response gets **externalized** by the host to a file the agent Reads back (a 35K vscode explore did exactly that) AND costs cache-writes. Repo size scales the CALL budget, not the response. Per-file = one ~150–250 line window: per-symbol `bodyCap` 2×→1.5× and the spine is windowed too (so tokio's big-spine `worker.rs` doesn't starve `harness.rs`'s `poll`); central whole-file 4×→1.5× / 400→280 lines. Explore's named-symbol injection now uses **`cg.getNodesByName`** (direct index, not FTS) so a 50+-overload name (`poll`) surfaces the wanted def (`Harness::poll`) for the PascalCase-type-token bias to pick.
+- **`node` file/line selector (commit 5bf6ad8):** `codegraph_node` takes optional `file`/`line` to pin an overload (the `file:line` a trail showed). `findSymbolMatches` (replaced `findSymbol`) enumerates ALL overloads via `cg.getNodesByName` (new passthrough `index.ts` → QueryBuilder), then file/line filters. The agent USES it in runs (`run file:worker.rs line:508`, `poll file:harness.rs`).
+- **Cost regression was REAL, now recovered.** The pre-reshape n=4 benchmark (on `max` effort, bloated 35-42K explores) was **−3% cost avg** (vscode −52%) and reads were **NOT 0** (vscode 6,4,0,7; tokio 3,4,2,2) — which corrected my earlier n=2 "0 reads everywhere" optimism. The reshape (≤25K, no externalization) + 0 reads flipped cost back to **~15% cheaper**.
+
+## Gotchas
+- **STALE-DAEMON foot-gun:** before ANY agent eval, `pkill -f "serve --mcp"; rm -f <repo>/.codegraph/daemon.sock` so it serves the current `dist/`. `bench-why-repo.sh` does this per-run. A `npm run build` does NOT take effect until the daemon is killed.
+- **Mac SLEEP corrupts long runs:** the first overnight re-bench (5h on `max`) was sleep-corrupted — the Mac napped 16–42 min BETWEEN runs (~3h of the 5h was paused), inflating wall-clock for the later repos. **Always wrap long runs in `caffeinate -dimsu`.** Cost/tokens/reads are sleep-INDEPENDENT (billed API totals), so the cost regression was real (confirmed on vscode which ran fully awake before any sleep); only TIME is corrupted.
+- **`--effort` matters:** the user's Claude default is `max`, which is "too much." The eval is pinned to `--effort high` (levels: low/medium/high/xhigh/max). `bench-why-repo.sh` honors `EFFORT` (default `high`). The MAX-mode runs were discarded and redone on `high`.
+- **why-Read prompt biases reads down (Hawthorne) + adds <0.3% to WITH cost/tokens.** So the 28/28 0-read sweep proves codegraph is *sufficient* (it CAN answer with 0 reads); it slightly understates a natural run's reads. Keep it OUT of any published benchmark numbers (use plain prompt for the table).
+- **README methodology mismatch:** WITH numbers are `effort=high` + why-Read; the existing README WITHOUT is the user's OLD default effort + plain. Cross-effort → can't publish cleanly without same-effort both arms. The user does NOT want to re-run WITHOUT repeatedly, but the effort CHANGED, so a one-time WITHOUT-on-high is a new (justified) measurement.
+- **PR policy:** `main` is REVIEW_REQUIRED — work on the branch, open a PR, `gh pr merge --squash --admin` for self-review. Branch + push only so far; **PR not opened** (user asked branch+push).
+
+## How to test & validate
+- Build: `npm run build` (exit 0). Full suite: `npx vitest run` → **1112 pass, 2 skip, 0 fail** (npm-shim network tests can flake offline — pre-existing).
+- Affected tests: `npx vitest run __tests__/{explore-output-budget,adaptive-explore-sizing,context-ranking,explore-blast-radius,symbol-lookup,pr19-improvements,object-literal-methods}.test.ts`.
+- Deterministic probe (current `dist/`, in-process — NOT the daemon): `node scripts/agent-eval/probe-explore.mjs /tmp/codegraph-corpus/<repo> "<query>"` → confirm ≤~25K chars + the flow files render. `node scripts/agent-eval/probe-node.mjs <repo> <symbol> code` (e.g. `poll file:harness.rs` via a small script).
+- Agent why-Read sweep (the real metric): `EFFORT=high caffeinate -dimsu bash scripts/agent-eval/bench-why-repo.sh /tmp/codegraph-corpus/<repo> "<readme query>" 4` → parse `/tmp/ab-why/<repo>/with*.jsonl` for `Read`/`Grep` tool_use + the trailing `## Why I read` section.
+- All 7 repos are cloned + indexed on the current build at `/tmp/codegraph-corpus/{vscode,excalidraw,django,tokio,okhttp,gin,alamofire}`. README queries are in `scripts/agent-eval/bench-readme.sh`.
+- **Pass bar:** flow question → ~0 Read at the explore-call budget, faster than WITHOUT, no control regression.
+
+## Repo state
+- branch `feat/explore-overhaul-store-coverage`, last commit `9cf671a chore(agent-eval): add per-repo WITH-only why-Read benchmark harness`. Pushed, in sync with origin.
+- 4 commits: `22333c1` (explore-primary + store coverage + overload disambiguation + docs) · `5bf6ad8` (method-atomic render + node file/line selector) · `50401a6` (explore reshape: inline-cap + concentrated flow-windows + direct-index injection) · `9cf671a` (why-Read eval harness).
+- uncommitted: clean (only `.claude/handoffs/` scratch, intentionally not committed).
+- CHANGELOG `[Unreleased]` already updated for the whole arc. README + CLAUDE.md + docs/design banners updated to drop removed context/trace.
+
+## Open threads / TODO
+- [ ] **README benchmark update** — decide (A) clean both-arms `effort=high` plain run vs (B) publish WITH-`high` deltas with caveat; then edit `README.md` (table + per-repo WITH/WITHOUT breakdowns + average + methodology date). WITH-`high` medians: vscode 4cg/742k/$0.70, excalidraw 6.5/834k/$0.74, django 3/509k/$0.59, tokio 6/921k/$0.77, okhttp 4.5/528k/$0.41, gin 4.5/573k/$0.48, alamofire 7.5/976k/$0.65 — all 0 reads.
+- [ ] **Open the PR** (`gh pr create`) once the README is updated.
+- [ ] Optional: delete dead dev probes `scripts/agent-eval/probe-{trace,context}.mjs` (their tools were removed; noted, not done).
+- [ ] Memory `project_codegraph_read_displacement` is updated through the reshape; add the 28/28-on-high + cost-recovery result if you want it durable.
+
+## Recent transcript (oldest → newest)
+### Turn — "re-run the full benchmark to update README; ask why on reads"
+- **Claude:** cloned+indexed all 7 repos; launched `RUNS=4 bench-readme.sh` (both arms). → ran overnight.
+### Turn — "what's still running? 5h overnight"
+- **Claude:** bench was on the LAST repo (alamofire), ~90%. n=4 aggregate showed **cost −3% (regression)** + reads **NOT 0** (vscode 6-7, tokio 2-4) — corrected my n=2 "0 everywhere." Diagnosed reads from transcripts: explore was 35-42K, too spread, dropped named flow methods.
+### Turn — "is it the lock screen?"
+- **Claude:** Yes for TIME (Mac slept 16-42min between runs → 5h mostly paused) but NOT cost (vscode ran awake; cost is billed API). → recommend `caffeinate`.
+### Turn — "how do I stop Mac sleeping" → answered (Battery→Options / `caffeinate`).
+### Turn — "do the queries WITHOUT codegraph, see exactly what it Reads, mimic it"
+- **Claude:** characterized native reads (77% ranged, median 100 lines, signature greps) → spec: ~100-line flow-windows, ~6-9 files, ≤25K.
+### Turn — "do it" (the reshape)
+- **Claude:** `Edit(tools.ts)` → inline-cap 24K + concentrated windows + `getNodesByName` injection; updated budget tests; full suite green; probe vscode 35K→16K, tokio fits harness.rs. Committed `50401a6`, pushed.
+### Turn — "why AST not LSP" (×2) + "reply to hd_nvim (ast-grep author)"
+- **Claude:** explained the trade-off (breadth/speed/no-build/determinism/19-langs vs LSP type-precision-but-heavy); drafted the X reply (concede precision, reframe on goal/cost).
+### Turn — "re-run, ask Why on each WITH read, report per repo" → "no WITHOUT" → "MAX too much, use high, start over"
+- **Claude:** built `bench-why-repo.sh` (WITH-only, why-Read, `EFFORT` default high); ran all 7 on `high` one repo at a time, reporting each. **Result: 28/28 runs 0R/0G, all "sufficient."** Aggregated WITH-`high` medians → ~15% cheaper (regression recovered).
+### Turn — "commit and push this"
+- **Claude:** committed `bench-why-repo.sh` (`9cf671a`), pushed. → this handoff.
+- **Outcome:** Arc complete + validated; README write-up + PR are all that remain.

+ 70 - 0
.claude/handoffs/explore-per-symbol-sizing.md

@@ -0,0 +1,70 @@
+---
+name: explore-per-symbol-sizing
+date: 2026-05-29 23:20
+project: codegraph
+branch: main
+summary: Shipped per-symbol adaptive codegraph_explore sizing (PR #569) — show the answer (named methods + mechanism) in full, collapse redundant interchangeable siblings to signatures, keep named methods alive in non-sibling god-files; flipped Django/OkHttp from cost laggards to clear wins and lifted the README averages to 25%/57%/23%/62%.
+---
+
+# Handoff: per-symbol adaptive codegraph_explore sizing (shipped)
+
+## Resume here — read this first
+**Current state:** **DONE + shipped.** PR #569 squash-merged to `main` (`b026e64`); local is on `main`, `dist/` rebuilt, working tree clean. README benchmarks + averages + header, CHANGELOG, and `docs/design/adaptive-explore-sizing.md` all updated with the new full-7-repo sweep. The only loose end: **two squash-merged feature branches still linger** (`feat/adaptive-explore-sizing` from #564, `feat/explore-per-symbol-sizing` from #569) — local **and** remote — because squash-merges don't register as "merged" in git's ancestor sense.
+**Immediate next step:** Delete those two merged branches (local + remote), or pick up one of the Open-threads frontiers (Gin's small WITH-cost bump, alamofire DataRequest residual, or stabilizing per-repo benchmark numbers with median-of-8).
+
+> Suggested next message: "Delete the merged branches feat/adaptive-explore-sizing and feat/explore-per-symbol-sizing — local and remote."
+
+## Goal
+Make `codegraph_explore`'s cost a clear win on **every** README benchmark repo, especially the two laggards the README showed thinnest (Django 9% cheaper, OkHttp 4%). The optimization target per CLAUDE.md is **tool-calls/reads + latency** (NOT raw cost) — but the user explicitly wanted the cost margins up too. Definition of done = both laggards clearly cheaper with ~0 reads, no regression elsewhere, README refreshed, shipped. **Achieved.**
+
+## Key findings
+- **The feature, in `src/mcp/tools.ts` (`handleExplore` + `buildFlowFromNamedSymbols`):** explore sizes output to the *answer*, not the file count. Builds on PR #564's gate (off-spine + polymorphic-sibling, with a named-callable *spare* + supertype-family *override*).
+- **PR #569 added four things** (all in `tools.ts`):
+  1. **Uniqueness-aware spare** — `buildFlowFromNamedSymbols` now returns `uniqueNamedNodeIds` (callables whose token had ≤3 defs). The whole-file spare uses it, so `as_sql` (110 defs) no longer keeps every Compiler/Expression variant full; `getResponseWithInterceptorChain` (1 def) still spares RealCall.
+  2. **Per-symbol focused view** — a collapsed family file renders FULL bodies for symbols with `prio()` < 99 (on-spine=0, unique-named=1, `fileDefinesSuper && named`=2), signatures for the rest. Bounded: `bodyCap = maxCharsPerFile*2`, `SIG_MAX = max(12, maxSymbolsInFileHeader*2)`. Header tag flips to `· focused (…)` when any body shown, else `· skeleton (…)`.
+  3. **All-tier test-file exclusion** — removed the `budget.excludeLowValueFiles` gate on the `isLowValue` hard-exclude (was <500-file tiers only); guards (query-mentions-tests, ≥2 non-test remain) kept.
+  4. **Named-cluster survival in non-sibling god-files** — inject agent-named method defs into `rangeNodes` even if the gather missed them; rank named ranges at importance **9** (above glue 6 / connected 3); `fileBudget = min(maxCharsPerFile, maxOutputChars - totalChars - 200)` in cluster selection so high-importance named clusters survive instead of being source-order-trimmed.
+- **Validated (headless A/B, Opus 4.8, median of 4, full 7-repo sweep) — now in README:** avg **25% cheaper · 57% fewer tokens · 23% faster · 62% fewer tool calls** (was 22/47/20/50). Per-repo cost: VS Code 33, Excalidraw 27, Django **23** (was 9, median 0 reads), Tokio 35, OkHttp **11** (was 4, 0 RealCall read-backs), Gin 15, Alamofire 28.
+- **PR #564 (already merged, `f1b14f0`)** was the prior round: named-callable spare + supertype-family override (fixed the read-back regression where RealCall.kt / compiler.py were skeletonized then Read back).
+
+## Gotchas
+- **A/B per-repo variance is large (±~10–13 pts).** The WITHOUT arm swings run-to-run (how hard native greps). Excalidraw/Gin look *lower* than the prior README purely from a cheaper native baseline this batch — NOT regressions (reads still 0/low). **Averages are the stable signal.** Never conclude from n=1; the README is median-of-4.
+- **The alamofire `DataRequest` residual is NOT cleanly closable.** A "spare a file when the agent names its class" type-spare *broke OkHttp* (it spared all 5 interceptor classes → 0 skeletons). A named sibling class is structurally indistinguishable from "the one main type." Left as-is (alamofire is 28% cheaper; ~1 DataRequest read/run).
+- **Gin's WITH-cost ticked up ($0.36→$0.48 across batches)** — partly the named-injection adding content to an already-0-read repo. Still 15% cheaper. Possible over-eager named-injection on small repos.
+- **Validate retrieval changes with a real-agent A/B, not just the probe.** The deterministic `probe-explore.mjs` query forms a *different spine* than the agent's real query → it hid both the Django and the OkHttp read-backs. (Dead-end #6 in the design doc.)
+- **Always `npm run build` before probing/A/B** — probes + the A/B MCP server load `dist/`, not `src/`. Corpus indexes (`/tmp/codegraph-corpus/*`) are valid without re-index since all changes are query-time.
+- **`adaptive-sizing-skeletonizing.md` handoff is gone from `main`'s working dir** — it was untracked, got swept into commit `3c38729` on `feat/adaptive-explore-sizing`, so it lives only on that branch now. Deleting that branch deletes it (it's obsolete — that work shipped).
+- **5 `npm-shim` test failures are pre-existing/network** (lack `--probe-net` on the global binary) — not a regression; don't let them block.
+
+## How to test & validate
+- Build first: `npm run build` (must be green).
+- Deterministic probe: `node scripts/agent-eval/probe-explore.mjs /tmp/codegraph-corpus/<repo> "<symbol-bag query>"` → inspect `#### file — … · focused/skeleton` headers + sizes. okhttp = 5 `· skeleton`; django compiler.py `· focused` with `def execute_sql`/`def as_sql`/`def _fetch_all` bodies present; excalidraw/tokio/vscode/gin = 0 skeleton/focused (inert).
+- A/B one repo: `bash /tmp/ab-one.sh <repo> <runs> "<question>"` → writes `/tmp/ab-readme/<repo>/run<n>/`. Aggregate one repo: `node /tmp/one-agg.mjs <repo>`. Full 7: `RUNS=4 bash scripts/agent-eval/bench-readme.sh` then `node scripts/agent-eval/parse-bench-readme.mjs /tmp/ab-readme` (averages) + `node /tmp/full-agg.mjs` (per-repo reads/grep/tools/cost/time).
+- Unit: `npx vitest run __tests__/adaptive-explore-sizing.test.ts` → **8/8** (skeleton, named-callable spare=RealCall, supertype-family override→focused=codec.ts, uniqueness/shared-method, on-spine exemplar full, distinct step full, flag=0 disables).
+- **Methodology:** a real win = cost DOWN **and** reads NOT up vs the same build's WITHOUT arm; confirm inert repos stay 0 skeleton/focused (the change only *adds* spare conditions + per-symbol rendering of already-collapsed files → strict subset of the original gate).
+
+## Repo state
+- branch `main`, last commit `b026e64 feat(mcp): per-symbol adaptive codegraph_explore sizing (#569)`.
+- uncommitted: clean (this handoff file will be a new untracked `.claude/handoffs/` entry).
+- merged-but-undeleted branches: `feat/adaptive-explore-sizing` (#564) + `feat/explore-per-symbol-sizing` (#569), both local + remote.
+
+## Open threads / TODO
+- [ ] Delete the two squash-merged branches (local + remote): `feat/adaptive-explore-sizing`, `feat/explore-per-symbol-sizing`.
+- [ ] (optional) Stabilize the README per-repo numbers with a 2nd full-7 batch → publish median-of-8 (smooths the WITHOUT-arm variance that makes Excalidraw/Gin look lower).
+- [ ] (frontier) Gin's small WITH-cost bump from named-injection on an already-0-read repo — consider gating named-injection by repo size / whether reads were happening.
+- [ ] (frontier) Alamofire `DataRequest` ~1 read/run — only closable by distinguishing "main type" from "interchangeable sibling," which has no clean static signal (don't re-attempt the type-spare; it breaks okhttp).
+- [ ] (frontier) Big non-sibling god-file clustering generally (#185 family) — `query.py`/`Session.swift`-style files; the fileBudget fix helped Django but the merge/selection logic is still the lever for other repos.
+
+## Recent transcript (oldest → newest)
+### Turn — "Add the regression test for the skeletonization"
+- **Claude:** Wrote `__tests__/adaptive-explore-sizing.test.ts` (OkHttp-in-miniature fixture); validated via probe; 5 cases. (This was for PR #564's gate.)
+- **Outcome:** Test landed; led into re-validating the README.
+### Turn — re-validate README, found Django/OkHttp cost laggards → built fixes
+- **Claude:** A/B sweeps via `bench-readme.sh`; diagnosed read-backs (RealCall via `Lockable` 9-impl mixin; compiler.py base+subclasses). Shipped PR #564 (named-callable spare + supertype-family override). Then A+B+C (uniqueness + per-symbol focused + test exclusion).
+- **Outcome:** Django/OkHttp flipped to wins; alamofire type-spare attempt broke okhttp → reverted.
+### Turn — "Pursue the deeper Django fix"
+- **Claude:** Cluster-debug showed `_fetch_all` (L2237) never formed a cluster + got source-order-trimmed. Fixed via named-injection + importance 9 + `fileBudget`. A/B: Django reads tail 3→1, 11%→17%; controls (okhttp 14%, excalidraw 31% / 0 reads) held. Committed `e71eb90`.
+- **Outcome:** Django residual closed.
+### Turn — "push + PR + squash-merge, update README incl. averages + header"
+- **Claude:** Ran final full-7 sweep (`bench-readme.sh`, 56/56). Clarified a confusing "WITHOUT got cheaper" phrasing (WITH cg is cheaper in all 7; the % is the *gap*, which shrinks when the native baseline is cheap that batch). User said publish this batch.
+- **Outcome:** Updated README (headline 25%/62%, average line, 7 summary rows, 7 detail tables, methodology date) + CHANGELOG + design doc. Built clean branch off `origin/main` (dropping the already-squashed commits + the handoff artifact), pushed, opened PR #569, squash-merged → `b026e64`. Synced local to main, rebuilt dist. Offered branch cleanup → user ran `/handoff save`.

+ 86 - 0
.claude/handoffs/trace-relevance-coldstart-2026-05-30.md

@@ -0,0 +1,86 @@
+---
+name: trace-relevance-coldstart-2026-05-30
+date: 2026-05-30 23:30
+project: codegraph
+branch: feat/trace-relevance-closure-collection
+summary: Turned Alamofire (README's weakest repo) into a clean win via a trace endpoint-disambiguation fix + god-file explore rendering, then eliminated the MCP cold-start race that was causing benchmark inconsistency (handshake ~811ms→~90ms); PR #580 has 6 commits, all that's left is a clean README sweep + squash-merge.
+---
+
+# Handoff: trace-relevance + closure-collection + cold-start (PR #580)
+
+## Resume here — read this first
+**Current state:** PR #580 (branch `feat/trace-relevance-closure-collection`, 6 commits, pushed, in sync with remote) is feature-complete and validated — full suite 1090 pass (only the 5 pre-existing npm-shim network fails), 28/28 MCP+daemon tests. The MCP cold-start race (the dominant benchmark-inconsistency source) is ELIMINATED via the proxy-local-handshake (tool registration ~90ms cold+warm, was ~811ms). The README benchmark table still shows the OLD pre-fix numbers.
+**Immediate next step:** Run a median-of-4 README sweep on this build (the race is gone, so numbers should be naturally consistent), update the README table/averages/headline, then squash-merge PR #580.
+
+> Suggested next message: "Run `RUNS=4 bash scripts/agent-eval/bench-readme.sh` on this build, parse with `node scripts/agent-eval/parse-bench-readme.mjs /tmp/ab-readme` (race-aware), update the README benchmark table + averages + the 7 per-repo detail tables + methodology date, then squash-merge PR #580 with `gh pr merge 580 --squash --admin`."
+
+## Goal
+Started as "Alamofire is the README's weakest benchmark repo (13% fewer tool calls vs the ~62% average) — fix it." Became: make CodeGraph's retrieval **consistent and faster**. Definition of done = PR #580 merged (trace fix + dynamic-dispatch coverage + god-file rendering + cold-start elimination), README refreshed with stable median-of-4 numbers. Optimization target per CLAUDE.md is **tool-calls/reads + latency**, NOT raw cost.
+
+## Key findings
+The 6 commits on the branch (oldest→newest):
+- `e86d573` **Trace endpoint relevance** (THE Alamofire win) + closure-collection synthesizer + explore synth-links.
+- `c64c4b3` **God-file multi-phase explore rendering** (6 sub-layers).
+- `5d7388c` Skeleton/focused tag steers to `codegraph_explore`, not Read (spiral fix #1).
+- `dc19eab` Bench parser race-aware (excludes "No such tool available" runs).
+- `91e28df` serve --mcp cold-start ~811ms→~600ms (defer CodeGraph load + 25ms poll).
+- `82ae484` **Proxy-local-handshake** — handshake ~600ms→~90ms, cold-start race eliminated.
+
+Root-causes found by reading A/B TRANSCRIPTS (not the noisy median):
+- **Trace bug:** `handleTrace`'s `scorePair` ranked only by shared-dir-prefix, so overloaded names (`request`=44 defs, `task`=8) resolved to empty `EventMonitor.request(){}` / `RedirectHandler.task` STUBS over the real `Session.request` → agent saw garbage, said "the trace collided with same-named symbols", read by hand. Fix: `nodeRelevance` term in `handleTrace` (penalize ≤1-line stubs −40, test files −150). Result n=8: WITH tools 12→8 median, read variance 0–12→1–4 (the meltdowns WERE the trace-collision flounder). General bug (Swift/Java/C#/Go protocol-stub flooding).
+- **Closure-collection synthesizer** (`src/resolution/callback-synthesizer.ts` `closureCollectionEdges`): Swift `validators.write{$0.append}`…`didCompleteTask` `validators.forEach{$0()}`. The element-invoke `$0(`/`it(` is the precision gate → 9 edges on Alamofire, **0 on every non-Swift control**. Surfaced inline in trace + a "Dynamic-dispatch links" section in `buildFlowFromNamedSymbols` (so it shows when the agent named only `validate`, not `didCompleteTask`).
+- **God-file rendering** (`handleExplore` in `src/mcp/tools.ts`, 6 layers): (1) on-spine god-files render spine-full + off-path methods as signatures (true-spine); (2) named-seed gather — inject each named token's substantive def into the subgraph (FTS buried `validate` → Validation.swift was never gathered); (3) a file that DEFINES a named symbol scores +50 (beats incidental Combine.swift's +23 connected-node score); (4) the 90%-budget early-break and (5) the total-output cap both EXEMPT necessary (entry/spine/uniqueNamed) files; (6) final ceiling 1.5×maxOutputChars. Renders build+validators-exec+validate in ONE explore.
+- **Spiral cause #1 (fixed):** the skeleton tag said "Read for a full body" → agent Read the skeletonized central files → over-investigation spiral. Now steers to `codegraph_explore`.
+- **Spiral cause #2 / the BIG inconsistency (fixed):** MCP **cold-start race**. `serve --mcp` wasn't ready when the headless agent fired → "No such tool available" → grep/Read flounder (19–30 tool spirals). Root-caused: NOT module load (mcp/index 38ms, CodeGraph chain 30ms), NOT the `--liftoff-only` re-exec (NO_RELAUNCH ≈ same) — it's the proxy WAITING for the spawned daemon to bind. Fixed: proxy answers initialize/tools-list from STATIC constants (`runLocalHandshakeProxy` in `proxy.ts`), forwards tool CALLS to the daemon (connected in background), lazy in-process engine fallback preserves the old fall-back-to-direct robustness. `connectWithHello` distinguishes 'version-mismatch' (fail fast → local) from 'not-yet' (poll). Handshake 91ms cold / 88ms warm.
+
+## Gotchas
+- **A/B variance is HUGE — never conclude from n=1, or even one n=4 batch.** The median-of-4 caught regressions the lucky dedicated batches HID (the god-file rework looked great in one batch at 0.5 reads/5.5 tools; the median showed 13 tools dragged by 2 spirals). Report ranges.
+- **Kill stale daemons before any cold-start measurement:** `pkill -9 -f "dist/bin/codegraph.js"; rm -f /tmp/codegraph-corpus/<repo>/.codegraph/daemon.*`. A zombie daemon holding the lock causes a 6s retry-exhaust that looks like a 7× regression (it bit me — the "6239ms" false alarm).
+- **`timeout` is NOT on macOS** (no coreutils) — measure cold-start with a `node` spawn + a `setTimeout` kill-timer (see the transcript's measurement snippets).
+- Corpus repos: `/tmp/codegraph-corpus/<repo>` (all 7 README repos indexed). Explore/trace changes are **query-time** (no re-index). The closure-collection synthesizer is **index-time** but produces 0 edges on non-Swift, so it's inert there.
+- Global `codegraph` is npm-linked to the dev dist (`node dist/bin/codegraph.js`). **Always `npm run build` before any probe/A/B** (they load `dist/`, not `src/`).
+- `engine.ts`/`tools.ts` now `import type CodeGraph` + lazy `require('../index')` (CommonJS, cached) so the daemon binds before the sqlite/query chain loads; `findNearestCodeGraphRoot` now comes from the light `../directory`.
+- The old `runProxy`/`pipeUntilClose` in `proxy.ts` are now DEAD (superseded by `runLocalHandshakeProxy`) — left in place; safe to prune in a follow-up.
+- 5 `npm-shim.test.ts` failures are pre-existing/network (need `--probe-net`) — NOT regressions; ignore.
+- Uncommitted `.gitignore` change (`tmux-web/`) is unrelated/not mine — do NOT commit it on this branch.
+- `parse-bench-readme.mjs` excludes raced runs by default; `CG_INCLUDE_RACED=1` keeps them to see the raw distribution. Now a safety net (race eliminated at source).
+
+## How to test & validate
+- `npm run build` → must be clean (exit 0).
+- `npx vitest run` → **1090 pass**, only the 5 npm-shim network fails.
+- `npx vitest run __tests__/mcp-daemon.test.ts` → **7/7** (sharing, #277 survive-client-death, version-mismatch fallback, idle-timeout).
+- Cold-start handshake (after killing daemons): node-spawn a `serve --mcp`, send `initialize`, time the id:1 response → **~90ms** (was ~811ms). Then a `tools/call` (e.g. `codegraph_status`) returns a real result (forwarded to the daemon, ~3.4s on vscode's first index load — a call that returns LATE, not a missing-tool error).
+- A/B sweep: `RUNS=4 bash scripts/agent-eval/bench-readme.sh` → `node scripts/agent-eval/parse-bench-readme.mjs /tmp/ab-readme`.
+- **Methodology:** handshake <150ms = race eliminated; in an A/B, grep the WITH jsonls for "No such tool available" (should be 0 now); WITH reads/tools < WITHOUT with no control regression.
+
+## Repo state
+- branch `feat/trace-relevance-closure-collection`, last commit `82ae484 perf(mcp): proxy answers initialize/tools-list locally — cold-start handshake ~600ms→~90ms`. In sync with remote (0/0). PR **#580** open.
+- uncommitted: ` M .gitignore` (unrelated tmux-web/ edit, not mine) + the new untracked `.claude/handoffs/` files. All code committed.
+
+## Open threads / TODO
+- [ ] Run median-of-4 README sweep on THIS build (race gone → should be naturally consistent, no exclusions needed).
+- [ ] Update README: benchmark summary table (7 rows × cost/tokens/time/tools), the 7 per-repo detail tables (Time/Reads/Grep/Tools/Tokens/Cost), the average line, the headline (line 7), and the methodology date (line 86).
+- [ ] Squash-merge PR #580 (`gh pr merge 580 --squash --admin` per the PR-merge-policy memory).
+- [ ] Add CHANGELOG entries under `[Unreleased]` for the cold-start fixes (trace fix + closure-collection + god-file already have entries from earlier commits — verify).
+- [ ] (cleanup) Prune dead `runProxy`/`pipeUntilClose` in `proxy.ts`.
+- [ ] (frontier, deferred) Alamofire `task` overload still seeds `RedirectHandler.task` (a harmless false-positive in named-seed ranking) — same nodeRelevance idea could de-rank it.
+
+## Recent transcript (oldest → newest)
+### Turn — "which repos drag the averages? Alamofire looks worst"
+- **Claude:** Computed per-column deviations from the README table → Alamofire #1 drag (tool-calls 13% vs 62% avg, time 7% vs 23%), gin #2; both ~110-file repos.
+- **Outcome:** Targeted Alamofire.
+### Turn — "dig into Alamofire A/B logs"
+- **Claude:** Read `/tmp/ab-readme/alamofire/run*/run-headless-with.jsonl` → found validate-flow break (`validators` closure-array) + per-run reasoning. Built closure-collection synthesizer + explore synth-links. A/B: flat (adoption variance).
+- **Outcome:** Synthesizer correct + sufficient (forced codegraph-only 3/3) but didn't move the noisy median.
+### Turn — "keep working till you figure it out"
+- **Claude:** Read transcripts again → "the trace collided with same-named symbols (44 requests, 8 tasks)". Found `scorePair` picked empty stubs. Added `nodeRelevance`. A/B n=8: tools 12→8, variance collapsed.
+- **Outcome:** The real Alamofire win. Committed `e86d573`; opened PR #580; squash-merge deferred.
+### Turn — "do the file-ranking + gather layers"
+- **Claude:** Built the 6-layer god-file rendering (`c64c4b3`). A/B flat-to-worse → debugged spirals → found the skeleton tag said "Read for more" (fixed `5d7388c`) → spirals persisted → root-caused the MCP cold-start race ("No such tool available").
+- **Outcome:** Race is the dominant inconsistency, not the code.
+### Turn — "keep iterating until consistent / do both (parser + startup)"
+- **Claude:** Bench parser race-aware (`dc19eab`). Deferred CodeGraph load off the bind path + 25ms poll (`91e28df`, 811→600ms).
+- **Outcome:** Consistency handled (parser) + 25% startup.
+### Turn — "how to optimize cold-start? / implement it"
+- **Claude:** Profiled (module loads 30-40ms, re-exec not it — it's the daemon-spawn wait). Implemented `runLocalHandshakeProxy` (`82ae484`): proxy answers initialize/tools-list locally + forwards calls + lazy local-engine fallback. Fixed 4 daemon tests (emit "Attached to shared daemon" + fast-fail version-mismatch + updated 1 assertion). Handshake 90ms; 28/28 MCP tests; full suite 1090 pass.
+- **Outcome:** Cold-start race ELIMINATED. All cold-start work committed + pushed. README sweep + squash-merge pending.