Browse Source

docs(benchmarks): current-build A/B on all 7 README repos + fix token-measurement bug

Re-ran the README benchmark on the current build (7 repos reindexed, median of 4): avg 35% cost / 57% tokens / 46% time / 71% tool calls saved — reproduces the published README (35/59/49/70), no regression. Adds bench-readme.sh + parse-bench-readme.mjs harness.

Fixes a token-measurement bug: result.usage is last-turn-only in current Claude Code; must sum per-turn assistant usage for cumulative tokens. Corrects the earlier excalidraw note (its '-34% tokens' was off this bug; real ~90%) and the cost MECHANISM (volume/fewer-turns, not cache-ability — the without-arm's huge token volume is mostly cheap cache-reads, so token savings 57% > cost savings 35%). Cost/time were always correct.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Colby McHenry 1 month ago
parent
commit
03069c9118

+ 1 - 1
CLAUDE.md

@@ -92,7 +92,7 @@ Cursor launches MCP subprocesses with the wrong cwd and doesn't pass `rootUri` i
 
 
 ## Retrieval performance & dynamic-dispatch coverage (do not regress)
 ## Retrieval performance & dynamic-dispatch coverage (do not regress)
 
 
-CodeGraph's core value is letting an agent answer **structural/flow** questions ("how does X reach Y", trace, impact, callers) with a few **fast** codegraph calls and **zero Read/Grep**. The optimization target is **wall-clock latency + tool-call count** — *don't optimize for token cost*. (Cost is **neutral-to-lower** in practice, not "flat" as earlier framing claimed: a with-vs-without A/B on excalidraw (n=3) ran **3× faster, 15× fewer tool calls, ~0 vs 23 reads, and −40% cost** — compact codegraph answers cache well across turns, while the without-arm's read/grep thrash creates fresh, poorly-cacheable input. See `docs/benchmarks/call-sequence-analysis.md`.) The mechanism that drives everything here: **an agent falls back to Read/Grep the instant a codegraph answer is insufficient.** So every change is judged by one question — is codegraph's answer sufficient enough to *stop* the agent from reading?
+CodeGraph's core value is letting an agent answer **structural/flow** questions ("how does X reach Y", trace, impact, callers) with a few **fast** codegraph calls and **zero Read/Grep**. The optimization target is **wall-clock latency + tool-call count** — *don't optimize for token cost*. (Cost is **lower**, not "flat" as earlier framing claimed: a current-build with-vs-without A/B across the 7 README repos, median of 4, saved on average **35% cost · 57% tokens · 46% time · 71% tool calls** — reproducing the published README. The mechanism is **far fewer turns over a much smaller accumulated context** — NOT cache-ability: the without-arm's huge token volume is *mostly* cheap cache-reads, which is why token-count savings (57%) look bigger than cost savings (35%). Measure tokens by **summing per-turn assistant usage**, not `result.usage` (last-turn only in current Claude Code). See `docs/benchmarks/call-sequence-analysis.md`.) The mechanism that drives everything here: **an agent falls back to Read/Grep the instant a codegraph answer is insufficient.** So every change is judged by one question — is codegraph's answer sufficient enough to *stop* the agent from reading?
 
 
 **Target behavior:** a flow question resolves in **1 codegraph call on small repos, scaling to 3–5 on large**, with **Read/Grep = 0**. When reviewing a PR or trying something new, do not regress this.
 **Target behavior:** a flow question resolves in **1 codegraph call on small repos, scaling to 3–5 on large**, with **Read/Grep = 0**. When reviewing a PR or trying something new, do not regress this.
 
 

+ 33 - 24
docs/benchmarks/call-sequence-analysis.md

@@ -389,29 +389,38 @@ node scripts/agent-eval/parse-arms.mjs
 
 
 ---
 ---
 
 
-# Current-build with/without A/B (excalidraw, 2026-05-24)
+# Current-build with/without A/B — the 7 README repos (2026-05-24)
 
 
-After this session's changes (self-sufficient trace + explore-flow + line numbers), a fresh
-with-vs-without A/B on excalidraw — same flow question, **n=3 per arm** (headless: codegraph-only MCP
-vs empty MCP):
+Re-ran the published README benchmark on the **current build** (all 7 repos freshly reindexed),
+same queries, **median of 4 runs/arm** (headless: codegraph-only MCP vs empty MCP):
 
 
-| metric | with codegraph | without | delta |
-|---|--:|--:|--:|
-| duration | 49s [43–53] | 145s [88–184] | **3.0× faster** |
-| total tool calls | 3.3 [3–4] | 49.3 [20–85] | 15× fewer |
-| Reads | 0.3 [0–1] | 23.3 [9–39] | ~0 vs 23 |
-| Grep/Glob | 0.0 | 14.3 [11–21] | eliminated |
-| codegraph calls | 3.0 | 0 | the trade |
-| tokens in | 120k [105–149] | 181k [78–384] | −34% |
-| tokens out | 5.9k | 8.8k | −33% |
-| **cost** | **$0.405** | **$0.678** | **−40%** |
-
-**Cost is neutral-to-lower, NOT flat** — correcting the earlier "cost stays ~flat" framing. Every one
-of the 3 without-runs cost more than every with-run. Mechanism = caching: compact codegraph answers
-(3 calls) cache well across turns, while the without-arm's 23 reads + 14 greps create fresh,
-poorly-cacheable input that's re-paid each turn. The without-arm also has large tail variance
-(88–184s, 20–85 tools, up to 384k tokens) that codegraph removes. n=3 — the direction is unambiguous
-(with beat without on every metric in every pair); treat magnitudes as a range.
-
-Reproduce: `AGENT_EVAL_OUT=<dir> scripts/agent-eval/run-all.sh <repo> "<Q>" headless` per run;
-`scripts/agent-eval/parse-run.mjs <jsonl>` for per-run reads/tools/tokens/cost.
+| repo | time with→without | tools w→wo | tokens w→wo (saved) | cost w→wo (saved) |
+|---|---|--:|--:|--:|
+| vscode | 1m10s→2m26s | 8→55 | 601k→2.8M (78%) | $0.60→$0.80 (26%) |
+| excalidraw | 48s→2m58s | 3→79 | 344k→3.5M (90%) | $0.43→$0.90 (52%) |
+| django | 1m19s→1m38s | 9→19 | 739k→1.2M (36%) | $0.59→$0.67 (12%) |
+| tokio | 53s→3m2s | 4→53 | 379k→2.6M (86%) | $0.42→$2.41 (82%) |
+| okhttp | 42s→1m1s | 6→11 | 636k→730k (13%) | $0.47→$0.47 (2%) |
+| gin | 44s→1m0s | 6→10 | 444k→675k (34%) | $0.37→$0.47 (21%) |
+| alamofire | 1m17s→2m27s | 12→69 | 1.0M→2.8M (64%) | $0.61→$1.14 (47%) |
+
+**Average saved: 35% cost · 57% tokens · 46% time · 71% tool calls** — reproduces the published
+README headline (35% / 59% / 49% / 70%); the current build holds the benchmark with no regression.
+
+**Cost is lower, not "flat"** (corrects the earlier note). But the **mechanism is volume, not
+cache-ability**: codegraph answers in far fewer turns over a much smaller accumulated context, while
+the without-arm fans out across many more turns (55–79 tool calls on the big repos), each
+re-processing a large, growing context. The without-arm's token volume is *mostly* cheap cache-reads,
+which is why **token-count savings (57%) look bigger than cost savings (35%)**. Per-repo margin tracks
+how hard the without-arm thrashes that run (tokio blew up to $2.41/3m; django thrashed less).
+
+**Measurement gotcha:** `result.usage` in this Claude Code version is the **last turn only**, not
+cumulative — using it under-counts tokens badly (an earlier excalidraw cut reported "−34% tokens"
+off this bug; the real figure is ~90%). Sum **per-turn assistant `usage`** for the true total.
+`total_cost_usd` and `duration_ms` are already cumulative/correct.
+
+Reproduce:
+```bash
+bash scripts/agent-eval/bench-readme.sh      # 7 repos × with/without × 4 runs (RUNS=4) → /tmp/ab-readme
+node scripts/agent-eval/parse-bench-readme.mjs   # medians + % saved (summed per-turn tokens)
+```

+ 28 - 0
scripts/agent-eval/bench-readme.sh

@@ -0,0 +1,28 @@
+#!/usr/bin/env bash
+# Re-run the README "Benchmark Results" A/B (with vs without codegraph) on the
+# current build: the 7 README repos, same queries, RUNS per arm (default 4).
+# Output → /tmp/ab-readme/<repo>/run<n>/run-headless-{with,without}.jsonl
+# Aggregate with parse-bench-readme.mjs. Repos must be cloned + indexed under
+# $CORPUS (default /tmp/codegraph-corpus) by the build under test.
+set -uo pipefail
+H="$(cd "$(dirname "$0")" && pwd)"
+C="${CORPUS:-/tmp/codegraph-corpus}"
+RUNS="${RUNS:-4}"
+ROWS=(
+"vscode|How does the extension host communicate with the main process?"
+"excalidraw|How does Excalidraw render and update canvas elements?"
+"django|How does Django's ORM build and execute a query from a QuerySet?"
+"tokio|How does tokio schedule and run async tasks on its runtime?"
+"okhttp|How does OkHttp process a request through its interceptor chain?"
+"gin|How does gin route requests through its middleware chain?"
+"alamofire|How does Alamofire build, send, and validate a request?"
+)
+echo "### README A/B START $(date) RUNS=$RUNS"
+for row in "${ROWS[@]}"; do
+  repo="${row%%|*}"; q="${row#*|}"
+  echo "===== $repo ====="
+  for run in $(seq 1 "$RUNS"); do
+    AGENT_EVAL_OUT="/tmp/ab-readme/$repo/run$run" bash "$H/run-all.sh" "$C/$repo" "$q" headless 2>&1 | grep -E "exit [0-9]" || echo "  run$run: (no exit line)"
+  done
+done
+echo "### README A/B DONE $(date)"

+ 67 - 0
scripts/agent-eval/parse-bench-readme.mjs

@@ -0,0 +1,67 @@
+#!/usr/bin/env node
+// Aggregate the README A/B (bench-readme.sh output): per repo, median of N runs
+// per arm → time, tool calls, tokens, cost, and % saved. Plus an average row.
+//
+// Tokens = SUM of per-turn assistant `usage` (input + output + cache read +
+// cache creation) — the cumulative "total tokens processed". NOTE: `result.usage`
+// is last-turn-only in current Claude Code, so it under-counts badly; don't use it.
+// `total_cost_usd` and `duration_ms` are already cumulative.
+//
+// Usage: node parse-bench-readme.mjs [/tmp/ab-readme]
+import { readFileSync, existsSync, readdirSync } from 'fs';
+import { join } from 'path';
+const ROOT = process.argv[2] || '/tmp/ab-readme';
+const REPOS = ['vscode', 'excalidraw', 'django', 'tokio', 'okhttp', 'gin', 'alamofire'];
+
+function parse(file) {
+  if (!existsSync(file)) return null;
+  const L = readFileSync(file, 'utf8').split('\n').filter(Boolean);
+  let tools = 0, reads = 0, grep = 0, cg = 0, tokens = 0, r = null;
+  for (const l of L) { let e; try { e = JSON.parse(l); } catch { continue; }
+    if (e.type === 'assistant') {
+      const u = e.message?.usage;
+      if (u) tokens += (u.input_tokens || 0) + (u.output_tokens || 0) + (u.cache_read_input_tokens || 0) + (u.cache_creation_input_tokens || 0);
+      for (const b of (e.message?.content || [])) if (b.type === 'tool_use') {
+        const n = b.name;
+        if (n === 'ToolSearch') continue;
+        tools++;
+        if (n === 'Read') reads++;
+        else if (n === 'Grep' || n === 'Glob') grep++;
+        else if (/codegraph/.test(n)) cg++;
+      }
+    }
+    if (e.type === 'result') r = e;
+  }
+  if (!r || r.subtype !== 'success') return null;
+  return { dur: r.duration_ms / 1000, tools, reads, grep, cg, tokens, cost: r.total_cost_usd || 0 };
+}
+const median = (arr) => { const v = [...arr].sort((a, b) => a - b); const n = v.length; return n === 0 ? 0 : n % 2 ? v[(n - 1) / 2] : (v[n / 2 - 1] + v[n / 2]) / 2; };
+const fmtTime = (s) => s >= 60 ? `${Math.floor(s / 60)}m ${Math.round(s % 60)}s` : `${Math.round(s)}s`;
+const fmtTok = (t) => t >= 1e6 ? `${(t / 1e6).toFixed(1)}M` : `${Math.round(t / 1000)}k`;
+const pct = (w, wo) => wo > 0 ? Math.round((1 - w / wo) * 100) : 0;
+
+console.log('repo        n(w/wo)  time WITH→WITHOUT      tools W→WO   tokens W→WO (saved)     cost W→WO (saved)');
+const savings = { cost: [], tokens: [], time: [], tools: [] };
+for (const repo of REPOS) {
+  const dir = join(ROOT, repo);
+  const runDirs = existsSync(dir) ? readdirSync(dir).filter(d => /^run\d+$/.test(d)) : [];
+  const W = [], WO = [];
+  for (const rd of runDirs) {
+    const w = parse(join(dir, rd, 'run-headless-with.jsonl')); if (w) W.push(w);
+    const wo = parse(join(dir, rd, 'run-headless-without.jsonl')); if (wo) WO.push(wo);
+  }
+  if (!W.length || !WO.length) { console.log(`${repo.padEnd(11)} (incomplete: w=${W.length} wo=${WO.length})`); continue; }
+  const m = (arr, k) => median(arr.map(x => x[k]));
+  const wT = m(W, 'dur'), woT = m(WO, 'dur'), wTok = m(W, 'tokens'), woTok = m(WO, 'tokens');
+  const wC = m(W, 'cost'), woC = m(WO, 'cost'), wTl = m(W, 'tools'), woTl = m(WO, 'tools');
+  savings.time.push(pct(wT, woT)); savings.tokens.push(pct(wTok, woTok)); savings.cost.push(pct(wC, woC)); savings.tools.push(pct(wTl, woTl));
+  console.log(
+    `${repo.padEnd(11)} ${W.length}/${WO.length}      ` +
+    `${(fmtTime(wT) + '→' + fmtTime(woT)).padEnd(22)}` +
+    `${(Math.round(wTl) + '→' + Math.round(woTl)).padEnd(12)}` +
+    `${(fmtTok(wTok) + '→' + fmtTok(woTok) + ' (' + pct(wTok, woTok) + '%)').padEnd(24)}` +
+    `$${wC.toFixed(2)}→$${woC.toFixed(2)} (${pct(wC, woC)}%)`
+  );
+}
+const avg = (a) => a.length ? Math.round(a.reduce((s, x) => s + x, 0) / a.length) : 0;
+console.log(`\nAVERAGE saved:  cost ${avg(savings.cost)}%  ·  tokens ${avg(savings.tokens)}%  ·  time ${avg(savings.time)}%  ·  tool calls ${avg(savings.tools)}%`);