Parcourir la source

docs: correct 'cost stays flat' → neutral-to-lower (excalidraw with/without A/B)

Fresh with-vs-without A/B on excalidraw (current build, n=3): 3x faster (49s vs 145s), 15x fewer tool calls, ~0 vs 23 reads, and -40% cost ($0.41 vs $0.68). Cost is neutral-to-lower, not flat — compact codegraph answers cache across turns while the without-arm's read/grep thrash is fresh, poorly-cacheable input. Recorded in call-sequence-analysis.md; corrected the CLAUDE.md optimization-target note (still: don't optimize for cost; target wall-clock + tool-call count).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Colby McHenry il y a 1 mois
Parent
commit
e0454988a6
2 fichiers modifiés avec 30 ajouts et 1 suppressions
  1. 1 1
      CLAUDE.md
  2. 29 0
      docs/benchmarks/call-sequence-analysis.md

+ 1 - 1
CLAUDE.md

@@ -92,7 +92,7 @@ Cursor launches MCP subprocesses with the wrong cwd and doesn't pass `rootUri` i
 
 ## Retrieval performance & dynamic-dispatch coverage (do not regress)
 
-CodeGraph's core value is letting an agent answer **structural/flow** questions ("how does X reach Y", trace, impact, callers) with a few **fast** codegraph calls and **zero Read/Grep**. The optimization target is **wall-clock latency + tool-call count** — *not* token cost (cost stays ~flat; codegraph calls trade for reads). The mechanism that drives everything here: **an agent falls back to Read/Grep the instant a codegraph answer is insufficient.** So every change is judged by one question — is codegraph's answer sufficient enough to *stop* the agent from reading?
+CodeGraph's core value is letting an agent answer **structural/flow** questions ("how does X reach Y", trace, impact, callers) with a few **fast** codegraph calls and **zero Read/Grep**. The optimization target is **wall-clock latency + tool-call count** — *don't optimize for token cost*. (Cost is **neutral-to-lower** in practice, not "flat" as earlier framing claimed: a with-vs-without A/B on excalidraw (n=3) ran **3× faster, 15× fewer tool calls, ~0 vs 23 reads, and −40% cost** — compact codegraph answers cache well across turns, while the without-arm's read/grep thrash creates fresh, poorly-cacheable input. See `docs/benchmarks/call-sequence-analysis.md`.) The mechanism that drives everything here: **an agent falls back to Read/Grep the instant a codegraph answer is insufficient.** So every change is judged by one question — is codegraph's answer sufficient enough to *stop* the agent from reading?
 
 **Target behavior:** a flow question resolves in **1 codegraph call on small repos, scaling to 3–5 on large**, with **Read/Grep = 0**. When reviewing a PR or trying something new, do not regress this.
 

+ 29 - 0
docs/benchmarks/call-sequence-analysis.md

@@ -386,3 +386,32 @@ ships and needs no steering.
 ARM=I bash scripts/agent-eval/arms-F.sh    # body-trace + destination callees, no steering
 node scripts/agent-eval/parse-arms.mjs
 ```
+
+---
+
+# Current-build with/without A/B (excalidraw, 2026-05-24)
+
+After this session's changes (self-sufficient trace + explore-flow + line numbers), a fresh
+with-vs-without A/B on excalidraw — same flow question, **n=3 per arm** (headless: codegraph-only MCP
+vs empty MCP):
+
+| metric | with codegraph | without | delta |
+|---|--:|--:|--:|
+| duration | 49s [43–53] | 145s [88–184] | **3.0× faster** |
+| total tool calls | 3.3 [3–4] | 49.3 [20–85] | 15× fewer |
+| Reads | 0.3 [0–1] | 23.3 [9–39] | ~0 vs 23 |
+| Grep/Glob | 0.0 | 14.3 [11–21] | eliminated |
+| codegraph calls | 3.0 | 0 | the trade |
+| tokens in | 120k [105–149] | 181k [78–384] | −34% |
+| tokens out | 5.9k | 8.8k | −33% |
+| **cost** | **$0.405** | **$0.678** | **−40%** |
+
+**Cost is neutral-to-lower, NOT flat** — correcting the earlier "cost stays ~flat" framing. Every one
+of the 3 without-runs cost more than every with-run. Mechanism = caching: compact codegraph answers
+(3 calls) cache well across turns, while the without-arm's 23 reads + 14 greps create fresh,
+poorly-cacheable input that's re-paid each turn. The without-arm also has large tail variance
+(88–184s, 20–85 tools, up to 384k tokens) that codegraph removes. n=3 — the direction is unambiguous
+(with beat without on every metric in every pair); treat magnitudes as a range.
+
+Reproduce: `AGENT_EVAL_OUT=<dir> scripts/agent-eval/run-all.sh <repo> "<Q>" headless` per run;
+`scripts/agent-eval/parse-run.mjs <jsonl>` for per-run reads/tools/tokens/cost.