1 month ago · 03069c9118
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -92,7 +92,7 @@ Cursor launches MCP subprocesses with the wrong cwd and doesn't pass `rootUri` i
 
															 ## Retrieval performance & dynamic-dispatch coverage (do not regress)
														
 
															-CodeGraph's core value is letting an agent answer **structural/flow** questions ("how does X reach Y", trace, impact, callers) with a few **fast** codegraph calls and **zero Read/Grep**. The optimization target is **wall-clock latency + tool-call count** — *don't optimize for token cost*. (Cost is **neutral-to-lower** in practice, not "flat" as earlier framing claimed: a with-vs-without A/B on excalidraw (n=3) ran **3× faster, 15× fewer tool calls, ~0 vs 23 reads, and −40% cost** — compact codegraph answers cache well across turns, while the without-arm's read/grep thrash creates fresh, poorly-cacheable input. See `docs/benchmarks/call-sequence-analysis.md`.) The mechanism that drives everything here: **an agent falls back to Read/Grep the instant a codegraph answer is insufficient.** So every change is judged by one question — is codegraph's answer sufficient enough to *stop* the agent from reading?
														
 
															+CodeGraph's core value is letting an agent answer **structural/flow** questions ("how does X reach Y", trace, impact, callers) with a few **fast** codegraph calls and **zero Read/Grep**. The optimization target is **wall-clock latency + tool-call count** — *don't optimize for token cost*. (Cost is **lower**, not "flat" as earlier framing claimed: a current-build with-vs-without A/B across the 7 README repos, median of 4, saved on average **35% cost · 57% tokens · 46% time · 71% tool calls** — reproducing the published README. The mechanism is **far fewer turns over a much smaller accumulated context** — NOT cache-ability: the without-arm's huge token volume is *mostly* cheap cache-reads, which is why token-count savings (57%) look bigger than cost savings (35%). Measure tokens by **summing per-turn assistant usage**, not `result.usage` (last-turn only in current Claude Code). See `docs/benchmarks/call-sequence-analysis.md`.) The mechanism that drives everything here: **an agent falls back to Read/Grep the instant a codegraph answer is insufficient.** So every change is judged by one question — is codegraph's answer sufficient enough to *stop* the agent from reading?
														
 
															 **Target behavior:** a flow question resolves in **1 codegraph call on small repos, scaling to 3–5 on large**, with **Read/Grep = 0**. When reviewing a PR or trying something new, do not regress this.
														
--- a/docs/benchmarks/call-sequence-analysis.md
+++ b/docs/benchmarks/call-sequence-analysis.md
@@ -389,29 +389,38 @@ node scripts/agent-eval/parse-arms.mjs
 
															 ---
														
 
															-# Current-build with/without A/B (excalidraw, 2026-05-24)
														
 
															+# Current-build with/without A/B — the 7 README repos (2026-05-24)
														
 
															-After this session's changes (self-sufficient trace + explore-flow + line numbers), a fresh
														
 
															-with-vs-without A/B on excalidraw — same flow question, **n=3 per arm** (headless: codegraph-only MCP
														
 
															-vs empty MCP):
														
 
															+Re-ran the published README benchmark on the **current build** (all 7 repos freshly reindexed),
														
 
															+same queries, **median of 4 runs/arm** (headless: codegraph-only MCP vs empty MCP):
														
 
															-| metric | with codegraph | without | delta |
														
 
															-|---|--:|--:|--:|
														
 
															-| duration | 49s [43–53] | 145s [88–184] | **3.0× faster** |
														
 
															-| total tool calls | 3.3 [3–4] | 49.3 [20–85] | 15× fewer |
														
 
															-| Reads | 0.3 [0–1] | 23.3 [9–39] | ~0 vs 23 |
														
 
															-| Grep/Glob | 0.0 | 14.3 [11–21] | eliminated |
														
 
															-| codegraph calls | 3.0 | 0 | the trade |
														
 
															-| tokens in | 120k [105–149] | 181k [78–384] | −34% |
														
 
															-| tokens out | 5.9k | 8.8k | −33% |
														
 
															-| **cost** | **$0.405** | **$0.678** | **−40%** |
														
 
															-
														
 
															-**Cost is neutral-to-lower, NOT flat** — correcting the earlier "cost stays ~flat" framing. Every one
														
 
															-of the 3 without-runs cost more than every with-run. Mechanism = caching: compact codegraph answers
														
 
															-(3 calls) cache well across turns, while the without-arm's 23 reads + 14 greps create fresh,
														
 
															-poorly-cacheable input that's re-paid each turn. The without-arm also has large tail variance
														
 
															-(88–184s, 20–85 tools, up to 384k tokens) that codegraph removes. n=3 — the direction is unambiguous
														
 
															-(with beat without on every metric in every pair); treat magnitudes as a range.
														
 
															-
														
 
															-Reproduce: `AGENT_EVAL_OUT=<dir> scripts/agent-eval/run-all.sh <repo> "<Q>" headless` per run;
														
 
															-`scripts/agent-eval/parse-run.mjs <jsonl>` for per-run reads/tools/tokens/cost.
														
 
															+| repo | time with→without | tools w→wo | tokens w→wo (saved) | cost w→wo (saved) |
														
 
															+|---|---|--:|--:|--:|
														
 
															+| vscode | 1m10s→2m26s | 8→55 | 601k→2.8M (78%) | $0.60→$0.80 (26%) |
														
 
															+| excalidraw | 48s→2m58s | 3→79 | 344k→3.5M (90%) | $0.43→$0.90 (52%) |
														
 
															+| django | 1m19s→1m38s | 9→19 | 739k→1.2M (36%) | $0.59→$0.67 (12%) |
														
 
															+| tokio | 53s→3m2s | 4→53 | 379k→2.6M (86%) | $0.42→$2.41 (82%) |
														
 
															+| okhttp | 42s→1m1s | 6→11 | 636k→730k (13%) | $0.47→$0.47 (2%) |
														
 
															+| gin | 44s→1m0s | 6→10 | 444k→675k (34%) | $0.37→$0.47 (21%) |
														
 
															+| alamofire | 1m17s→2m27s | 12→69 | 1.0M→2.8M (64%) | $0.61→$1.14 (47%) |
														
 
															+
														
 
															+**Average saved: 35% cost · 57% tokens · 46% time · 71% tool calls** — reproduces the published
														
 
															+README headline (35% / 59% / 49% / 70%); the current build holds the benchmark with no regression.
														
 
															+
														
 
															+**Cost is lower, not "flat"** (corrects the earlier note). But the **mechanism is volume, not
														
 
															+cache-ability**: codegraph answers in far fewer turns over a much smaller accumulated context, while
														
 
															+the without-arm fans out across many more turns (55–79 tool calls on the big repos), each
														
 
															+re-processing a large, growing context. The without-arm's token volume is *mostly* cheap cache-reads,
														
 
															+which is why **token-count savings (57%) look bigger than cost savings (35%)**. Per-repo margin tracks
														
 
															+how hard the without-arm thrashes that run (tokio blew up to $2.41/3m; django thrashed less).
														
 
															+
														
 
															+**Measurement gotcha:** `result.usage` in this Claude Code version is the **last turn only**, not
														
 
															+cumulative — using it under-counts tokens badly (an earlier excalidraw cut reported "−34% tokens"
														
 
															+off this bug; the real figure is ~90%). Sum **per-turn assistant `usage`** for the true total.
														
 
															+`total_cost_usd` and `duration_ms` are already cumulative/correct.
														
 
															+
														
 
															+Reproduce:
														
 
															+```bash
														
 
															+bash scripts/agent-eval/bench-readme.sh      # 7 repos × with/without × 4 runs (RUNS=4) → /tmp/ab-readme
														
 
															+node scripts/agent-eval/parse-bench-readme.mjs   # medians + % saved (summed per-turn tokens)
														
 
															+```
														
--- a/scripts/agent-eval/bench-readme.sh
+++ b/scripts/agent-eval/bench-readme.sh
@@ -0,0 +1,28 @@
 
															+#!/usr/bin/env bash
														
 
															+# Re-run the README "Benchmark Results" A/B (with vs without codegraph) on the
														
 
															+# current build: the 7 README repos, same queries, RUNS per arm (default 4).
														
 
															+# Output → /tmp/ab-readme/<repo>/run<n>/run-headless-{with,without}.jsonl
														
 
															+# Aggregate with parse-bench-readme.mjs. Repos must be cloned + indexed under
														
 
															+# $CORPUS (default /tmp/codegraph-corpus) by the build under test.
														
 
															+set -uo pipefail
														
 
															+H="$(cd "$(dirname "$0")" && pwd)"
														
 
															+C="${CORPUS:-/tmp/codegraph-corpus}"
														
 
															+RUNS="${RUNS:-4}"
														
 
															+ROWS=(
														
 
															+"vscode|How does the extension host communicate with the main process?"
														
 
															+"excalidraw|How does Excalidraw render and update canvas elements?"
														
 
															+"django|How does Django's ORM build and execute a query from a QuerySet?"
														
 
															+"tokio|How does tokio schedule and run async tasks on its runtime?"
														
 
															+"okhttp|How does OkHttp process a request through its interceptor chain?"
														
 
															+"gin|How does gin route requests through its middleware chain?"
														
 
															+"alamofire|How does Alamofire build, send, and validate a request?"
														
 
															+)
														
 
															+echo "### README A/B START $(date) RUNS=$RUNS"
														
 
															+for row in "${ROWS[@]}"; do
														
 
															+  repo="${row%%|*}"; q="${row#*|}"
														
 
															+  echo "===== $repo ====="
														
 
															+  for run in $(seq 1 "$RUNS"); do
														
 
															+    AGENT_EVAL_OUT="/tmp/ab-readme/$repo/run$run" bash "$H/run-all.sh" "$C/$repo" "$q" headless 2>&1 | grep -E "exit [0-9]" || echo "  run$run: (no exit line)"
														
 
															+  done
														
 
															+done
														
 
															+echo "### README A/B DONE $(date)"
														
--- a/scripts/agent-eval/parse-bench-readme.mjs
+++ b/scripts/agent-eval/parse-bench-readme.mjs
@@ -0,0 +1,67 @@
 
															+#!/usr/bin/env node
														
 
															+// Aggregate the README A/B (bench-readme.sh output): per repo, median of N runs
														
 
															+// per arm → time, tool calls, tokens, cost, and % saved. Plus an average row.
														
 
															+//
														
 
															+// Tokens = SUM of per-turn assistant `usage` (input + output + cache read +
														
 
															+// cache creation) — the cumulative "total tokens processed". NOTE: `result.usage`
														
 
															+// is last-turn-only in current Claude Code, so it under-counts badly; don't use it.
														
 
															+// `total_cost_usd` and `duration_ms` are already cumulative.
														
 
															+//
														
 
															+// Usage: node parse-bench-readme.mjs [/tmp/ab-readme]
														
 
															+import { readFileSync, existsSync, readdirSync } from 'fs';
														
 
															+import { join } from 'path';
														
 
															+const ROOT = process.argv[2] || '/tmp/ab-readme';
														
 
															+const REPOS = ['vscode', 'excalidraw', 'django', 'tokio', 'okhttp', 'gin', 'alamofire'];
														
 
															+
														
 
															+function parse(file) {
														
 
															+  if (!existsSync(file)) return null;
														
 
															+  const L = readFileSync(file, 'utf8').split('\n').filter(Boolean);
														
 
															+  let tools = 0, reads = 0, grep = 0, cg = 0, tokens = 0, r = null;
														
 
															+  for (const l of L) { let e; try { e = JSON.parse(l); } catch { continue; }
														
 
															+    if (e.type === 'assistant') {
														
 
															+      const u = e.message?.usage;
														
 
															+      if (u) tokens += (u.input_tokens || 0) + (u.output_tokens || 0) + (u.cache_read_input_tokens || 0) + (u.cache_creation_input_tokens || 0);
														
 
															+      for (const b of (e.message?.content || [])) if (b.type === 'tool_use') {
														
 
															+        const n = b.name;
														
 
															+        if (n === 'ToolSearch') continue;
														
 
															+        tools++;
														
 
															+        if (n === 'Read') reads++;
														
 
															+        else if (n === 'Grep' || n === 'Glob') grep++;
														
 
															+        else if (/codegraph/.test(n)) cg++;
														
 
															+      }
														
 
															+    }
														
 
															+    if (e.type === 'result') r = e;
														
 
															+  }
														
 
															+  if (!r || r.subtype !== 'success') return null;
														
 
															+  return { dur: r.duration_ms / 1000, tools, reads, grep, cg, tokens, cost: r.total_cost_usd || 0 };
														
 
															+}
														
 
															+const median = (arr) => { const v = [...arr].sort((a, b) => a - b); const n = v.length; return n === 0 ? 0 : n % 2 ? v[(n - 1) / 2] : (v[n / 2 - 1] + v[n / 2]) / 2; };
														
 
															+const fmtTime = (s) => s >= 60 ? `${Math.floor(s / 60)}m ${Math.round(s % 60)}s` : `${Math.round(s)}s`;
														
 
															+const fmtTok = (t) => t >= 1e6 ? `${(t / 1e6).toFixed(1)}M` : `${Math.round(t / 1000)}k`;
														
 
															+const pct = (w, wo) => wo > 0 ? Math.round((1 - w / wo) * 100) : 0;
														
 
															+
														
 
															+console.log('repo        n(w/wo)  time WITH→WITHOUT      tools W→WO   tokens W→WO (saved)     cost W→WO (saved)');
														
 
															+const savings = { cost: [], tokens: [], time: [], tools: [] };
														
 
															+for (const repo of REPOS) {
														
 
															+  const dir = join(ROOT, repo);
														
 
															+  const runDirs = existsSync(dir) ? readdirSync(dir).filter(d => /^run\d+$/.test(d)) : [];
														
 
															+  const W = [], WO = [];
														
 
															+  for (const rd of runDirs) {
														
 
															+    const w = parse(join(dir, rd, 'run-headless-with.jsonl')); if (w) W.push(w);
														
 
															+    const wo = parse(join(dir, rd, 'run-headless-without.jsonl')); if (wo) WO.push(wo);
														
 
															+  }
														
 
															+  if (!W.length || !WO.length) { console.log(`${repo.padEnd(11)} (incomplete: w=${W.length} wo=${WO.length})`); continue; }
														
 
															+  const m = (arr, k) => median(arr.map(x => x[k]));
														
 
															+  const wT = m(W, 'dur'), woT = m(WO, 'dur'), wTok = m(W, 'tokens'), woTok = m(WO, 'tokens');
														
 
															+  const wC = m(W, 'cost'), woC = m(WO, 'cost'), wTl = m(W, 'tools'), woTl = m(WO, 'tools');
														
 
															+  savings.time.push(pct(wT, woT)); savings.tokens.push(pct(wTok, woTok)); savings.cost.push(pct(wC, woC)); savings.tools.push(pct(wTl, woTl));
														
 
															+  console.log(
														
 
															+    `${repo.padEnd(11)} ${W.length}/${WO.length}      ` +
														
 
															+    `${(fmtTime(wT) + '→' + fmtTime(woT)).padEnd(22)}` +
														
 
															+    `${(Math.round(wTl) + '→' + Math.round(woTl)).padEnd(12)}` +
														
 
															+    `${(fmtTok(wTok) + '→' + fmtTok(woTok) + ' (' + pct(wTok, woTok) + '%)').padEnd(24)}` +
														
 
															+    `$${wC.toFixed(2)}→$${woC.toFixed(2)} (${pct(wC, woC)}%)`
														
 
															+  );
														
 
															+}
														
 
															+const avg = (a) => a.length ? Math.round(a.reduce((s, x) => s + x, 0) / a.length) : 0;
														
 
															+console.log(`\nAVERAGE saved:  cost ${avg(savings.cost)}%  ·  tokens ${avg(savings.tokens)}%  ·  time ${avg(savings.time)}%  ·  tool calls ${avg(savings.tools)}%`);