--- name: explore-overhaul-bench-2026-06-02 date: 2026-06-02 06:30 project: codegraph branch: feat/explore-overhaul-store-coverage summary: Finished the explore-overhaul arc (explore as sole primary + store coverage + overload disambiguation + method-atomic render + node file/line selector + explore reshaped to native-read windows) and validated it — all 7 README repos hit 0 Read/0 Grep at effort=high; only the README benchmark write-up remains. --- # Handoff: explore-overhaul arc — validated 0-reads across all 7 README repos; README write-up is the last step ## Resume here — read this first **Current state:** All code is committed + pushed on `feat/explore-overhaul-store-coverage` (4 commits, working tree clean). The why-Read agent sweep is DONE: **all 7 README repos × 4 runs = 28/28 runs hit 0 Read / 0 Grep on `--effort high`**, every run "codegraph was sufficient." WITH-`high` medians are captured (~59% fewer tool calls · 51% fewer tokens · ~15% cheaper · 0 reads vs the existing README WITHOUT) — the earlier cost REGRESSION (-3%) is recovered. The only open item is **updating the README benchmark section**, which is blocked on one methodology decision. **Immediate next step:** Decide how to publish: (A) do a CLEAN both-arms run on `effort=high` with the PLAIN prompt (no why-Read) for an apples-to-apples table, or (B) write the WITH-`high` deltas in against the existing WITHOUT with a cross-effort caveat. Then edit `README.md` (benchmark table + per-repo breakdowns + average line + methodology date) and open the PR. > Suggested next message: "Do the clean both-arms run on effort=high with the plain prompt for all 7 repos, then update the README benchmark table + per-repo breakdowns from those medians and open the PR." ## Goal Make `codegraph_explore` a true Read-replacement — flow/architecture questions answered with ~0 Read/Grep — then re-validate the README benchmark on the current build and update its numbers. Definition of done: README benchmark reflects the current build with defensible (same-effort) numbers; branch merged via PR. ## Key findings - **The arc (all shipped on the branch):** explore is the SOLE primary tool (`codegraph_context` + `codegraph_trace` removed in the prior session, this branch); store-action **coverage** (object-literal method extraction — a GENERAL AST rule in `tree-sitter.ts` `extractVariable`/`extractObjectLiteralFunctions`/`findInitializerReturnedObject`, covers Zustand/Redux/Pinia, not a per-lib hack); graph-ranking **gate fix** (a named/≥2-term file is never pruned); **`node` all-overloads + `file`/`line` selector**; **method-atomic render** (never half a method — drop whole methods/files); **explore reshape** to native-read windows. - **Native-read ground truth (from the WITHOUT transcripts):** the agent natively reads **~6–9 files as ~100-line windows** (77% ranged, median 100 lines, 51–250 dominant), located by `func X(` signature greps. That's the unit explore now mimics. - **Explore reshape (commit 50401a6, the latest mechanism):** `getExploreOutputBudget` caps EVERY tier at **~24K** (was 28/35/38K) + absolute **25K** hard ceiling (was 1.5×-of-budget) — because a bigger response gets **externalized** by the host to a file the agent Reads back (a 35K vscode explore did exactly that) AND costs cache-writes. Repo size scales the CALL budget, not the response. Per-file = one ~150–250 line window: per-symbol `bodyCap` 2×→1.5× and the spine is windowed too (so tokio's big-spine `worker.rs` doesn't starve `harness.rs`'s `poll`); central whole-file 4×→1.5× / 400→280 lines. Explore's named-symbol injection now uses **`cg.getNodesByName`** (direct index, not FTS) so a 50+-overload name (`poll`) surfaces the wanted def (`Harness::poll`) for the PascalCase-type-token bias to pick. - **`node` file/line selector (commit 5bf6ad8):** `codegraph_node` takes optional `file`/`line` to pin an overload (the `file:line` a trail showed). `findSymbolMatches` (replaced `findSymbol`) enumerates ALL overloads via `cg.getNodesByName` (new passthrough `index.ts` → QueryBuilder), then file/line filters. The agent USES it in runs (`run file:worker.rs line:508`, `poll file:harness.rs`). - **Cost regression was REAL, now recovered.** The pre-reshape n=4 benchmark (on `max` effort, bloated 35-42K explores) was **−3% cost avg** (vscode −52%) and reads were **NOT 0** (vscode 6,4,0,7; tokio 3,4,2,2) — which corrected my earlier n=2 "0 reads everywhere" optimism. The reshape (≤25K, no externalization) + 0 reads flipped cost back to **~15% cheaper**. ## Gotchas - **STALE-DAEMON foot-gun:** before ANY agent eval, `pkill -f "serve --mcp"; rm -f /.codegraph/daemon.sock` so it serves the current `dist/`. `bench-why-repo.sh` does this per-run. A `npm run build` does NOT take effect until the daemon is killed. - **Mac SLEEP corrupts long runs:** the first overnight re-bench (5h on `max`) was sleep-corrupted — the Mac napped 16–42 min BETWEEN runs (~3h of the 5h was paused), inflating wall-clock for the later repos. **Always wrap long runs in `caffeinate -dimsu`.** Cost/tokens/reads are sleep-INDEPENDENT (billed API totals), so the cost regression was real (confirmed on vscode which ran fully awake before any sleep); only TIME is corrupted. - **`--effort` matters:** the user's Claude default is `max`, which is "too much." The eval is pinned to `--effort high` (levels: low/medium/high/xhigh/max). `bench-why-repo.sh` honors `EFFORT` (default `high`). The MAX-mode runs were discarded and redone on `high`. - **why-Read prompt biases reads down (Hawthorne) + adds <0.3% to WITH cost/tokens.** So the 28/28 0-read sweep proves codegraph is *sufficient* (it CAN answer with 0 reads); it slightly understates a natural run's reads. Keep it OUT of any published benchmark numbers (use plain prompt for the table). - **README methodology mismatch:** WITH numbers are `effort=high` + why-Read; the existing README WITHOUT is the user's OLD default effort + plain. Cross-effort → can't publish cleanly without same-effort both arms. The user does NOT want to re-run WITHOUT repeatedly, but the effort CHANGED, so a one-time WITHOUT-on-high is a new (justified) measurement. - **PR policy:** `main` is REVIEW_REQUIRED — work on the branch, open a PR, `gh pr merge --squash --admin` for self-review. Branch + push only so far; **PR not opened** (user asked branch+push). ## How to test & validate - Build: `npm run build` (exit 0). Full suite: `npx vitest run` → **1112 pass, 2 skip, 0 fail** (npm-shim network tests can flake offline — pre-existing). - Affected tests: `npx vitest run __tests__/{explore-output-budget,adaptive-explore-sizing,context-ranking,explore-blast-radius,symbol-lookup,pr19-improvements,object-literal-methods}.test.ts`. - Deterministic probe (current `dist/`, in-process — NOT the daemon): `node scripts/agent-eval/probe-explore.mjs /tmp/codegraph-corpus/ ""` → confirm ≤~25K chars + the flow files render. `node scripts/agent-eval/probe-node.mjs code` (e.g. `poll file:harness.rs` via a small script). - Agent why-Read sweep (the real metric): `EFFORT=high caffeinate -dimsu bash scripts/agent-eval/bench-why-repo.sh /tmp/codegraph-corpus/ "" 4` → parse `/tmp/ab-why//with*.jsonl` for `Read`/`Grep` tool_use + the trailing `## Why I read` section. - All 7 repos are cloned + indexed on the current build at `/tmp/codegraph-corpus/{vscode,excalidraw,django,tokio,okhttp,gin,alamofire}`. README queries are in `scripts/agent-eval/bench-readme.sh`. - **Pass bar:** flow question → ~0 Read at the explore-call budget, faster than WITHOUT, no control regression. ## Repo state - branch `feat/explore-overhaul-store-coverage`, last commit `9cf671a chore(agent-eval): add per-repo WITH-only why-Read benchmark harness`. Pushed, in sync with origin. - 4 commits: `22333c1` (explore-primary + store coverage + overload disambiguation + docs) · `5bf6ad8` (method-atomic render + node file/line selector) · `50401a6` (explore reshape: inline-cap + concentrated flow-windows + direct-index injection) · `9cf671a` (why-Read eval harness). - uncommitted: clean (only `.claude/handoffs/` scratch, intentionally not committed). - CHANGELOG `[Unreleased]` already updated for the whole arc. README + CLAUDE.md + docs/design banners updated to drop removed context/trace. ## Open threads / TODO - [ ] **README benchmark update** — decide (A) clean both-arms `effort=high` plain run vs (B) publish WITH-`high` deltas with caveat; then edit `README.md` (table + per-repo WITH/WITHOUT breakdowns + average + methodology date). WITH-`high` medians: vscode 4cg/742k/$0.70, excalidraw 6.5/834k/$0.74, django 3/509k/$0.59, tokio 6/921k/$0.77, okhttp 4.5/528k/$0.41, gin 4.5/573k/$0.48, alamofire 7.5/976k/$0.65 — all 0 reads. - [ ] **Open the PR** (`gh pr create`) once the README is updated. - [ ] Optional: delete dead dev probes `scripts/agent-eval/probe-{trace,context}.mjs` (their tools were removed; noted, not done). - [ ] Memory `project_codegraph_read_displacement` is updated through the reshape; add the 28/28-on-high + cost-recovery result if you want it durable. ## Recent transcript (oldest → newest) ### Turn — "re-run the full benchmark to update README; ask why on reads" - **Claude:** cloned+indexed all 7 repos; launched `RUNS=4 bench-readme.sh` (both arms). → ran overnight. ### Turn — "what's still running? 5h overnight" - **Claude:** bench was on the LAST repo (alamofire), ~90%. n=4 aggregate showed **cost −3% (regression)** + reads **NOT 0** (vscode 6-7, tokio 2-4) — corrected my n=2 "0 everywhere." Diagnosed reads from transcripts: explore was 35-42K, too spread, dropped named flow methods. ### Turn — "is it the lock screen?" - **Claude:** Yes for TIME (Mac slept 16-42min between runs → 5h mostly paused) but NOT cost (vscode ran awake; cost is billed API). → recommend `caffeinate`. ### Turn — "how do I stop Mac sleeping" → answered (Battery→Options / `caffeinate`). ### Turn — "do the queries WITHOUT codegraph, see exactly what it Reads, mimic it" - **Claude:** characterized native reads (77% ranged, median 100 lines, signature greps) → spec: ~100-line flow-windows, ~6-9 files, ≤25K. ### Turn — "do it" (the reshape) - **Claude:** `Edit(tools.ts)` → inline-cap 24K + concentrated windows + `getNodesByName` injection; updated budget tests; full suite green; probe vscode 35K→16K, tokio fits harness.rs. Committed `50401a6`, pushed. ### Turn — "why AST not LSP" (×2) + "reply to hd_nvim (ast-grep author)" - **Claude:** explained the trade-off (breadth/speed/no-build/determinism/19-langs vs LSP type-precision-but-heavy); drafted the X reply (concede precision, reframe on goal/cost). ### Turn — "re-run, ask Why on each WITH read, report per repo" → "no WITHOUT" → "MAX too much, use high, start over" - **Claude:** built `bench-why-repo.sh` (WITH-only, why-Read, `EFFORT` default high); ran all 7 on `high` one repo at a time, reporting each. **Result: 28/28 runs 0R/0G, all "sufficient."** Aggregated WITH-`high` medians → ~15% cheaper (regression recovered). ### Turn — "commit and push this" - **Claude:** committed `bench-why-repo.sh` (`9cf671a`), pushed. → this handoff. - **Outcome:** Arc complete + validated; README write-up + PR are all that remain.