il y a 1 mois · e0454988a6
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -92,7 +92,7 @@ Cursor launches MCP subprocesses with the wrong cwd and doesn't pass `rootUri` i
 
				 
			
 
				 ## Retrieval performance & dynamic-dispatch coverage (do not regress)
			
 
				 
			
 
				-CodeGraph's core value is letting an agent answer **structural/flow** questions ("how does X reach Y", trace, impact, callers) with a few **fast** codegraph calls and **zero Read/Grep**. The optimization target is **wall-clock latency + tool-call count** — *not* token cost (cost stays ~flat; codegraph calls trade for reads). The mechanism that drives everything here: **an agent falls back to Read/Grep the instant a codegraph answer is insufficient.** So every change is judged by one question — is codegraph's answer sufficient enough to *stop* the agent from reading?
			
 
				+CodeGraph's core value is letting an agent answer **structural/flow** questions ("how does X reach Y", trace, impact, callers) with a few **fast** codegraph calls and **zero Read/Grep**. The optimization target is **wall-clock latency + tool-call count** — *don't optimize for token cost*. (Cost is **neutral-to-lower** in practice, not "flat" as earlier framing claimed: a with-vs-without A/B on excalidraw (n=3) ran **3× faster, 15× fewer tool calls, ~0 vs 23 reads, and −40% cost** — compact codegraph answers cache well across turns, while the without-arm's read/grep thrash creates fresh, poorly-cacheable input. See `docs/benchmarks/call-sequence-analysis.md`.) The mechanism that drives everything here: **an agent falls back to Read/Grep the instant a codegraph answer is insufficient.** So every change is judged by one question — is codegraph's answer sufficient enough to *stop* the agent from reading?
			
 
				 
			
 
				 **Target behavior:** a flow question resolves in **1 codegraph call on small repos, scaling to 3–5 on large**, with **Read/Grep = 0**. When reviewing a PR or trying something new, do not regress this.
			
 
				 
			
--- a/docs/benchmarks/call-sequence-analysis.md
+++ b/docs/benchmarks/call-sequence-analysis.md
@@ -386,3 +386,32 @@ ships and needs no steering.
 
				 ARM=I bash scripts/agent-eval/arms-F.sh    # body-trace + destination callees, no steering
			
 
				 node scripts/agent-eval/parse-arms.mjs
			
 
				 ```
			
 
				+
			
 
				+---
			
 
				+
			
 
				+# Current-build with/without A/B (excalidraw, 2026-05-24)
			
 
				+
			
 
				+After this session's changes (self-sufficient trace + explore-flow + line numbers), a fresh
			
 
				+with-vs-without A/B on excalidraw — same flow question, **n=3 per arm** (headless: codegraph-only MCP
			
 
				+vs empty MCP):
			
 
				+
			
 
				+| metric | with codegraph | without | delta |
			
 
				+|---|--:|--:|--:|
			
 
				+| duration | 49s [43–53] | 145s [88–184] | **3.0× faster** |
			
 
				+| total tool calls | 3.3 [3–4] | 49.3 [20–85] | 15× fewer |
			
 
				+| Reads | 0.3 [0–1] | 23.3 [9–39] | ~0 vs 23 |
			
 
				+| Grep/Glob | 0.0 | 14.3 [11–21] | eliminated |
			
 
				+| codegraph calls | 3.0 | 0 | the trade |
			
 
				+| tokens in | 120k [105–149] | 181k [78–384] | −34% |
			
 
				+| tokens out | 5.9k | 8.8k | −33% |
			
 
				+| **cost** | **$0.405** | **$0.678** | **−40%** |
			
 
				+
			
 
				+**Cost is neutral-to-lower, NOT flat** — correcting the earlier "cost stays ~flat" framing. Every one
			
 
				+of the 3 without-runs cost more than every with-run. Mechanism = caching: compact codegraph answers
			
 
				+(3 calls) cache well across turns, while the without-arm's 23 reads + 14 greps create fresh,
			
 
				+poorly-cacheable input that's re-paid each turn. The without-arm also has large tail variance
			
 
				+(88–184s, 20–85 tools, up to 384k tokens) that codegraph removes. n=3 — the direction is unambiguous
			
 
				+(with beat without on every metric in every pair); treat magnitudes as a range.
			
 
				+
			
 
				+Reproduce: `AGENT_EVAL_OUT=<dir> scripts/agent-eval/run-all.sh <repo> "<Q>" headless` per run;
			
 
				+`scripts/agent-eval/parse-run.mjs <jsonl>` for per-run reads/tools/tokens/cost.