1 miesiąc temu · a6183d7c83
--- a/docs/benchmarks/call-sequence-analysis.md
+++ b/docs/benchmarks/call-sequence-analysis.md
@@ -0,0 +1,388 @@
 
				+# Call-sequence analysis — why read savings don't convert to wall-clock
			
 
				+
			
 
				+**Date:** 2026-05-23 · **Branch:** `architectural-improvements` · **Source data:** the surviving
			
 
				+stream-json logs from the A/B matrix (`/tmp/ab-matrix/<Cell>/run-headless-{with,without}.jsonl`,
			
 
				+37 cells × 2 arms). Re-mined — **no re-runs** — with `scripts/agent-eval/seq-matrix.mjs`.
			
 
				+
			
 
				+## Why this exists
			
 
				+
			
 
				+The [A/B matrix](codegraph-ab-matrix.md) showed codegraph cuts **reads 75%** but **wall-clock only
			
 
				+~16%**, and 63% of the wall-clock win comes from just 3 large-repo cells. Reads are at the floor
			
 
				+(~0), so the remaining wall-clock is **round-trips + the synthesis turn** — neither of which read
			
 
				+count can explain. The matrix records tool *counts*, not the call **sequence** or per-call
			
 
				+**payload size**. This analysis recovers both, to find where the wall-clock actually goes.
			
 
				+
			
 
				+## TL;DR — the bottleneck is trace ADOPTION, not trace completeness
			
 
				+
			
 
				+1. **Trace is called in 3 of 37 cells** — even though every question is a canonical flow question
			
 
				+   ("trace the controller → service → repository", "how does X reach Y"). The agent overwhelmingly
			
 
				+   reaches for **`context → search → search → explore`** instead — the exact path-reconstruction
			
 
				+   anti-pattern the instructions tell it to avoid.
			
 
				+2. **`explore` averages 17.9K chars/call; `trace` averages 0.8K** — a **22× payload difference**.
			
 
				+   The path-scoped tool that solves the small-repo-bloat problem exists and is tiny. It's just not
			
 
				+   being invoked.
			
 
				+3. **Small repos still get bloated payloads** because of the explore-default: a **6-file** repo
			
 
				+   (`flutter_module_books`) pulls **17.4K**; a 10-file repo pulls 18.0K. This is precisely the
			
 
				+   "too much context on small codebases" failure mode — happening right now, via explore.
			
 
				+4. **Round-trips are 25% fewer with codegraph (283 vs 375 turns)** but wall-clock is only 16%
			
 
				+   faster — because the with-arm's turns each carry a ~18K explore payload, inflating TTFT and
			
 
				+   eroding the turn savings.
			
 
				+5. **Root cause:** `src/mcp/server-instructions.ts` leads with *"answer directly … `codegraph_context`
			
 
				+   first, then ONE `codegraph_explore`"* as the headline pattern. The trace-first guidance is buried
			
 
				+   in a table + a chain list below it. Agents anchor on the prominent headline → context→explore.
			
 
				+
			
 
				+**Decision:** the next experiment is **trace-first steering / adoption**, not enriching trace. We
			
 
				+can't evaluate trace's completeness when it's used 3/37 times. Get adoption up first, then measure
			
 
				+whether the residual `node`/`explore` follow-ups need a richer trace.
			
 
				+
			
 
				+## Finding 1 — trace adoption: 3/37
			
 
				+
			
 
				+| metric | value |
			
 
				+|---|---|
			
 
				+| flow-question cells | 37 (all of them) |
			
 
				+| cells that called `codegraph_trace` | **3** (`cpp-leveldb`, `excalidraw`, `c-redis`) |
			
 
				+| dominant pattern instead | `context` → `search`×N → `explore` |
			
 
				+
			
 
				+The 3 trace cells, and what followed the trace call:
			
 
				+
			
 
				+| repo | files | cg sequence | turns (with/without) |
			
 
				+|---|--:|---|---|
			
 
				+| cpp-leveldb | 134 | `trace, node, node` | 5 / 8 |
			
 
				+| excalidraw | 643 | `context, trace, trace, explore` | 6 / **19** |
			
 
				+| c-redis | 884 | `context, trace, explore, node` | 10 / 15 |
			
 
				+
			
 
				+Even when trace *is* used, the agent follows it with `node`/`explore` to fetch bodies — so a
			
 
				+secondary lever (after adoption) is making one trace call self-sufficient enough to kill those
			
 
				+follow-ups. But that's step 2.
			
 
				+
			
 
				+## Finding 2 — payload size: path-scoped trace (0.8K) vs breadth-scoped explore (17.9K)
			
 
				+
			
 
				+Across all cells, per codegraph tool — call count and **average payload per call**:
			
 
				+
			
 
				+| tool | calls | avg/call | total |
			
 
				+|---|--:|--:|--:|
			
 
				+| `explore` | 32 | **17.9K** | 573K |
			
 
				+| `context` | 36 | 4.3K | 156K |
			
 
				+| `search` | 39 | 1.3K | 50K |
			
 
				+| `files` | 5 | 3.4K | 17K |
			
 
				+| `node` | 19 | 2.0K | 38K |
			
 
				+| `trace` | 4 | **0.8K** | 3.4K |
			
 
				+
			
 
				+`context` (used in 36/37 cells) is the default opener; `explore` is the default closer. Together
			
 
				+they are the ~22K breadth dump. `trace` — the tool that would replace that with the actual path —
			
 
				+is 22× smaller and barely used. This is the user's premise confirmed in numbers: explore is
			
 
				+breadth-scoped (returns the neighborhood), trace is path-scoped (returns the line).
			
 
				+
			
 
				+## Finding 3 — payload grows with repo size, and over-returns on small repos
			
 
				+
			
 
				+With-arm **total** codegraph payload by repo-size tier:
			
 
				+
			
 
				+| tier | cells | avg total payload | range |
			
 
				+|---|--:|--:|--:|
			
 
				+| S (<200 files) | 19 | 12.7K | 3.0–31.2K |
			
 
				+| M (<2000) | 9 | 32.4K | 5.4–58.2K |
			
 
				+| L (≥2000) | 9 | 34.0K | 20.2–43.1K |
			
 
				+
			
 
				+The small-repo waste is concrete — these all have a 2–3 file flow but pull a full neighborhood:
			
 
				+
			
 
				+| repo | files | with-arm payload | sequence |
			
 
				+|---|--:|--:|---|
			
 
				+| flutter_module_books | 6 | 17.4K | `context, explore` |
			
 
				+| computer-database | 10 | 18.0K | `context, search, status, explore` |
			
 
				+| aspnet-realworld | 78 | 22.2K | `context, explore` |
			
 
				+| django-realworld | 44 | 14.8K | `context, explore` |
			
 
				+
			
 
				+`explore`'s per-call budget is already adaptive (#185), but it doesn't help here because the agent
			
 
				+isn't choosing the path-scoped tool — it's choosing breadth.
			
 
				+
			
 
				+## Finding 4 — round-trips, and the ToolSearch tax
			
 
				+
			
 
				+| metric | with | without |
			
 
				+|---|--:|--:|
			
 
				+| total turns (37 cells) | 283 | 375 |
			
 
				+| avg turns / cell | 7.6 | 10.1 |
			
 
				+
			
 
				+25% fewer turns, but only ~16% faster wall-clock — the gap is the per-turn cost of the big explore
			
 
				+payloads. Also: **every with-arm run opens with a `ToolSearch` round-trip** (MCP tools are deferred
			
 
				+in this harness), a fixed 1-turn tax before any codegraph call. Worth confirming whether the
			
 
				+production install defers codegraph tools the same way.
			
 
				+
			
 
				+## Conclusion → the experiment to run next
			
 
				+
			
 
				+Measure-first changed the plan. The hypothesis was "enrich trace so one call is self-sufficient."
			
 
				+The data says trace is **used 3/37 times**, so completeness is moot until adoption is fixed.
			
 
				+
			
 
				+**Experiment: trace-first steering A/B.**
			
 
				+- **Change:** rewrite the `server-instructions.ts` headline so a *flow* question (how does X reach Y
			
 
				+  / trace / from→to) routes to `codegraph_trace` **first**, demoting the context→explore pattern to
			
 
				+  non-flow/onboarding questions. Mirror into `instructions-template.ts` + `.cursor/rules/codegraph.mdc`.
			
 
				+- **Metric:** trace-adoption rate (target ≫ 3/37), with-arm total payload (expect ↓ sharply,
			
 
				+  especially small repos), turns (expect ↓), wall-clock (expect the 16% gap to widen toward the
			
 
				+  25% turn gap as 18K explore payloads are replaced by <1K traces).
			
 
				+- **Control:** a non-flow "what's the deal with module X" question must still go context→explore —
			
 
				+  don't over-steer everything to trace.
			
 
				+- **Then, step 2:** with adoption up, measure the `node`/`explore` follow-ups after trace
			
 
				+  (cpp-leveldb/excalidraw/c-redis all had them). If they're frequent, enrich trace (per-hop body
			
 
				+  snippet, capped per hop) so one trace call ends the flow investigation.
			
 
				+
			
 
				+## Reproduce
			
 
				+
			
 
				+```bash
			
 
				+node scripts/agent-eval/seq-matrix.mjs            # regenerates every table above from /tmp/ab-matrix
			
 
				+```
			
 
				+
			
 
				+---
			
 
				+
			
 
				+# Ablation experiment — do `context`, `explore`, and `trace` compete? Is `trace` enough?
			
 
				+
			
 
				+**Date:** 2026-05-23 · 52 runs, ~$20. Tool surface trimmed **server-side** via the new
			
 
				+`CODEGRAPH_MCP_TOOLS` allowlist (so an ablated tool is genuinely absent from ListTools, not
			
 
				+denied-on-call); trace-first steering injected with `--append-system-prompt`. 6 repos (2 S / 2 M /
			
 
				+2 L) × 2 runs; arm E is a **non-flow** survey question on 2 repos. Driver `arms-matrix.sh`,
			
 
				+analysis `parse-arms.mjs`.
			
 
				+
			
 
				+| arm | tools | steering | adoption | reads | cgOut | turns | dur |
			
 
				+|---|---|---|--:|--:|--:|--:|--:|
			
 
				+| **A** control | all | none | 2/12 | 1.25 | 28.8K | 7.6 | 38s |
			
 
				+| **B** steer | all | trace-first | **8/12** | 1.00 | **32.0K** | 7.9 | 43s |
			
 
				+| **C** no-explore | hide explore | trace-first | 8/12 | **2.08** | **9.2K** | 9.0 | 44s |
			
 
				+| **D** trace-centric | hide explore+context | trace-first | 8/12 | 2.00 | 6.6K | 10.5 | 46s |
			
 
				+| **E** control-probe | hide explore+context | trace-first | 0/4 | 2.50 | 27.8K | **20.0** | **72s** |
			
 
				+
			
 
				+## What it says
			
 
				+
			
 
				+1. **Steering works for adoption, not for payload.** B lifted trace use **2/12 → 8/12** (and 4/4 on
			
 
				+   the genuinely path-shaped questions — the 2 non-adopters, flutter "what widgets" and vapor "name
			
 
				+   the route", aren't from→to questions). But B's payload (32.0K) is *bigger* than control (28.8K)
			
 
				+   and it's slightly slower — because the agent calls trace **and still calls explore**. Steering
			
 
				+   adds a trace hop without displacing the explore dump.
			
 
				+2. **`explore` is the payload, and it's load-bearing — but 3–5× too heavy.** Removing it (C) cuts
			
 
				+   payload **71%** (32K→9.2K) — confirming it's the bloat. But reads **double** (1.0→2.1) and turns
			
 
				+   rise: the agent Reads files to recover the bodies explore had inlined. So explore isn't
			
 
				+   redundant; it's the only one-call body-supplier, just delivered with a 32K sledgehammer.
			
 
				+3. **`context` is the most redundant of the three — as a body-supplier.** Removing it on top of
			
 
				+   explore (D vs C) left reads flat (2.08→2.00) but raised turns (9.0→10.5). It supplies no unique
			
 
				+   bodies; it earns its keep only as a round-trip-saver (the composed orient call).
			
 
				+4. **Removing tools makes flow questions SLOWER, not faster.** Turns climb monotonically
			
 
				+   A→D (7.6→10.5) and duration with them — the Read + trace-follow-up round-trips cost more
			
 
				+   wall-clock than the saved payload. Leaner payload ≠ faster.
			
 
				+5. **`trace` is definitively NOT sufficient.** The non-flow probe (E) thrashed without the survey
			
 
				+   tools — **20 turns, 72s** reconstructing an overview from search/node/files. Survey questions
			
 
				+   need a survey tool; trace can't substitute.
			
 
				+
			
 
				+## Verdict on the three design questions
			
 
				+
			
 
				+- **Do we need all three?** Yes — but for different reasons. trace = flow tool (real, under-adopted).
			
 
				+  explore = the one-call body-supplier (load-bearing, over-heavy). context = round-trip-saving
			
 
				+  opener (redundant for bodies, useful for orientation).
			
 
				+- **Are they competing?** Yes: explore competes with trace and *wins by default* — even when steered,
			
 
				+  the agent traces **and** explores, so the payload win never lands until explore is displaced.
			
 
				+- **Could trace be all we need?** No. E rules it out for non-flow questions; C/D rule it out even
			
 
				+  for flow (reads double without explore's bodies).
			
 
				+
			
 
				+**Three cheap fixes are now ruled out by data:** "trace is all we need" (false), "just steer to
			
 
				+trace" (B: slower + bigger than control), and "remove explore" (C/D: more reads/turns, slower).
			
 
				+
			
 
				+## The fix the data points to → next experiment
			
 
				+
			
 
				+The only path that wins: **make `trace` self-sufficient by inlining per-hop bodies** (capped per
			
 
				+hop → still path-scoped) so one trace call supplies what explore does *and* what the Read fallback
			
 
				+recovers — displacing both for flow questions. Keep **one** survey tool (context; demote explore to
			
 
				+deep-survey, not the flow default) for the non-flow class E proved is load-bearing.
			
 
				+
			
 
				+- **Experiment:** enriched body-inlining `trace` + steering vs control.
			
 
				+- **Target:** C/D's lean payload (~7–9K, not 32K) **without** C/D's extra reads/turns, and **beat A
			
 
				+  on wall-clock** (the bar B/C/D all failed).
			
 
				+- **Metric:** payload, reads (must stay ≈ A's ~1.0, not rise to 2.0), turns, duration.
			
 
				+
			
 
				+## Reproduce (ablation)
			
 
				+
			
 
				+```bash
			
 
				+bash scripts/agent-eval/arms-matrix.sh     # 52 runs into /tmp/arms (RUNS=2 default)
			
 
				+node scripts/agent-eval/parse-arms.mjs     # the arm-comparison tables above
			
 
				+```
			
 
				+
			
 
				+---
			
 
				+
			
 
				+# Validation — body-inlining trace (arm F)
			
 
				+
			
 
				+The ablation pointed to one fix: make `trace` self-sufficient by inlining per-hop **bodies**
			
 
				+(capped per hop → still path-scoped) so one trace call displaces both the explore dump and the
			
 
				+Read fallback. Implemented in `handleTrace` (`sourceRangeAt`, 28 lines / 1200 chars per hop, with a
			
 
				+`… (+N more lines)` marker). Arm **F** = arm B's surface (all tools + trace-first steering) run on
			
 
				+the body-inlining build, so **F vs B isolates the enrichment**.
			
 
				+
			
 
				+| arm | adoption | reads | cgOut | turns | dur | cost |
			
 
				+|---|--:|--:|--:|--:|--:|--:|
			
 
				+| A all/none | 2/12 | 1.25 | 28.8K | 7.6 | 38s | $0.390 |
			
 
				+| B all/steer (thin trace) | 8/12 | 1.00 | 32.0K | 7.9 | 43s | $0.411 |
			
 
				+| **F all/steer (body trace)** | 5/12 | **1.17** | **25.1K** | **6.8** | **37s** | **$0.348** |
			
 
				+| C no-explore | 8/12 | 2.08 | 9.2K | 9.0 | 44s | $0.356 |
			
 
				+| D trace-centric | 8/12 | 2.00 | 6.6K | 10.5 | 46s | $0.368 |
			
 
				+
			
 
				+**F is the best-balanced arm:** lowest turns (6.8), fastest (37s), cheapest, payload leaner than
			
 
				+A/B — and it hits the target the ablation set: **C/D-class efficiency without C/D's Read penalty**
			
 
				+(F reads 1.17 vs C/D's ~2.0). It gets there not by *removing* a tool but by giving the agent a
			
 
				+complete trace so it *stops early*.
			
 
				+
			
 
				+**The win is clearest where trace connects** — excalidraw (the validated 6-hop path):
			
 
				+
			
 
				+| arm | sequence | turns | reads | dur |
			
 
				+|---|---|--:|--:|--:|
			
 
				+| B (thin) | `trace → context → explore → Grep → Read` | 7 | 1 | 47s |
			
 
				+| **F (body) r1** | `trace → context` | **4** | **0** | **31s** |
			
 
				+| F (body) r2 | `trace → trace → explore` | 5 | 0 | 42s |
			
 
				+
			
 
				+The body-trace ended the investigation in `trace → context` (run 1) — 0 reads, 0 grep, 0 explore.
			
 
				+
			
 
				+**Connectivity is the cap.** On flows that break at *unbridged* dynamic dispatch — aspnet-realworld
			
 
				+(MediatR `_mediator.Send → Handle`), vapor-spi (closure routing) — trace returns "no path" and the
			
 
				+agent falls back to explore, so F ≈ B (no regression, no gain). F's aggregate lift is therefore
			
 
				+**gated by dynamic-dispatch coverage**: the more flows the graph connects end-to-end, the more often
			
 
				+the self-sufficient trace fires. (n=2/arm — adoption and per-repo numbers are noisy; excalidraw and
			
 
				+spring-halo, the connecting repos, are 2/2 trace in both B and F.)
			
 
				+
			
 
				+## Verdict & ship list
			
 
				+
			
 
				+1. **Ship the body-inlining trace** — strict improvement (best-balanced arm; clean 0-read/4-turn win
			
 
				+   on connecting traces; no regression on non-connecting ones).
			
 
				+2. **Strengthen the steering.** Arm A (shipped server-instructions, which *already* say "trace first
			
 
				+   for flow") adopted trace only 2/12 — the guidance is too buried. The explicit
			
 
				+   `--append-system-prompt` used in B–F lifted it. Port that into `server-instructions.ts` +
			
 
				+   `instructions-template.ts` + `.cursor/rules/codegraph.mdc` (house rule: all three together),
			
 
				+   flow-gated so non-flow survey questions still go context/explore (arm E proved they must).
			
 
				+3. **Next frontier to widen F's reach:** bridge more dynamic dispatch (MediatR/.NET, Vapor routing) —
			
 
				+   every newly-connected flow converts an F≈B repo into an F-win repo.
			
 
				+
			
 
				+## Reproduce (arm F)
			
 
				+
			
 
				+```bash
			
 
				+bash scripts/agent-eval/arms-F.sh          # 12 runs (RUNS=2); needs the body-inlining build
			
 
				+node scripts/agent-eval/parse-arms.mjs     # F appears alongside A/B/C/D/E
			
 
				+```
			
 
				+
			
 
				+---
			
 
				+
			
 
				+# Steering port — the negative result (arm G)
			
 
				+
			
 
				+F's win used `--append-system-prompt`, which real users don't get. Arm **G** = arm A's invocation
			
 
				+(NO append-prompt) on a build where the steering was ported into the production channels
			
 
				+(`server-instructions.ts` + the `context`/`trace` tool descriptions + `instructions-template.ts` +
			
 
				+`.cursor/rules`). Three wording iterations, 12 runs each:
			
 
				+
			
 
				+| arm | adoption | reads | payload | turns | dur |
			
 
				+|---|--:|--:|--:|--:|--:|
			
 
				+| A (shipped instructions) | 2/12 | 1.25 | 28.8K | 7.6 | **38s** |
			
 
				+| F (body-trace + append-prompt) | 5/12 | **1.17** | 25.1K | 6.8 | **37s** |
			
 
				+| G v1 — anti-explore wording | 6/12 | 2.08 | 13.8K | 8.8 | 46s |
			
 
				+| G v2 — restore explore as fallback | 6/12 | 1.67 | 22.0K | 7.8 | 46s |
			
 
				+| G v3 — restore context as opener | 6/12 | 2.08 | 11.7K | 8.9 | 46s |
			
 
				+
			
 
				+**Production-instruction steering does not reproduce F, and regresses the A baseline.** All three G
			
 
				+variants pin at **~46s** (slower than A's 38s and F's 37s) with reads at 1.7–2.1 (vs A 1.25, F 1.17).
			
 
				+Wording only shuffled the slack between Read and explore — v1 suppressed explore → Read; v2/v3
			
 
				+restored explore → over-investigation — never landing F's lean `trace → context`.
			
 
				+
			
 
				+**Two root causes:**
			
 
				+1. **Salience.** The same trace-first wording works as a top-of-prompt `--append-system-prompt` (F)
			
 
				+   but not as an MCP `initialize` instruction / tool description (G). An MCP server has no
			
 
				+   higher-salience channel — this is an architectural limit, not a wording bug.
			
 
				+2. **Forcing trace-first backfires where trace doesn't connect.** Steering pushed trace onto
			
 
				+   MediatR (`_mediator.Send`) and Spring interface-DI (`@Autowired` iface → impl) flows, where trace
			
 
				+   returns no-path; the forced trace is then a wasted round-trip *before* the fallback → slower.
			
 
				+   The **unsteered** agent (A) is better-calibrated: it traces only when trace will obviously
			
 
				+   connect (2/12) and explores otherwise.
			
 
				+
			
 
				+## Arm H — body-trace alone (the ship candidate) regresses
			
 
				+
			
 
				+The clean ship test: body-inlining trace + ORIGINAL instructions + no steering (= A's invocation,
			
 
				+only the trace *tool* changed). H vs A isolates the body-trace feature with nothing else moving.
			
 
				+
			
 
				+| arm | adoption | reads | payload | turns | dur |
			
 
				+|---|--:|--:|--:|--:|--:|
			
 
				+| A (no body-trace) | 2/12 | 1.25 | 28.8K | 7.6 | **38s** |
			
 
				+| H (body-trace, no steering) | 3/12 | 1.50 | 29.7K | 8.0 | **45s** |
			
 
				+| F (body-trace + append-prompt) | 5/12 | 1.17 | 25.1K | 6.8 | 37s |
			
 
				+
			
 
				+**Body-trace alone does NOT beat A — it mildly regresses** (45s vs 38s). The sequences show why:
			
 
				+unsteered, the agent treats trace as just one more call in its usual loop — excalidraw H was
			
 
				+`context → trace → explore → node×3 → Grep → Read` (77s) — so the bigger body-trace payload is pure
			
 
				+added cost, not offset by fewer follow-ups. The body-trace only pays off when the agent **leads with
			
 
				+trace and stops after it**, which only the append-prompt (F) achieved.
			
 
				+
			
 
				+## Final verdict
			
 
				+
			
 
				+The body-inlining trace is a real win (F) but its value is **entirely contingent on
			
 
				+lead-with-and-stop-after-trace steering we cannot deliver through any production MCP channel**
			
 
				+(append-prompt salience ≫ server-instructions / tool-descriptions; G failed three times). On its own
			
 
				+(H) it regresses. So:
			
 
				+
			
 
				+- **SHIP: the `CODEGRAPH_MCP_TOOLS` allowlist** — independent, clean, validated.
			
 
				+- **DON'T ship the body-inlining trace or the steering as-is** — measured neutral-to-negative
			
 
				+  without a steering channel we don't have.
			
 
				+- **The real lever is connectivity, not steering** — trace earns its keep only when flows connect
			
 
				+  end-to-end; dynamic-dispatch synthesizers (MediatR/.NET, Spring interface-DI, Vapor closures) help
			
 
				+  the *unsteered* agent, which already traces when trace will connect.
			
 
				+- **One untested lever** to rescue the body-trace: steer via the trace tool's OWN OUTPUT (the
			
 
				+  highest-salience channel — the agent reads it fresh, right at the decision point) with a strong
			
 
				+  leading "complete flow — answer from this, don't explore" banner. Instructions/descriptions are
			
 
				+  too far from the action; the tool result is not. Unproven; the only remaining shot at making the
			
 
				+  body-trace pay off in production.
			
 
				+
			
 
				+measure-first paid off three times: it killed three cheap fixes in the ablation, stopped a steering
			
 
				+change that would have shipped an ~8s/query regression (G), and stopped shipping the body-trace
			
 
				+itself on a confounded assumption (H showed it needs steering we can't deliver).
			
 
				+
			
 
				+## Reproduce (arm G)
			
 
				+
			
 
				+```bash
			
 
				+ARM=G bash scripts/agent-eval/arms-F.sh    # production-instruction steering, no append-prompt
			
 
				+node scripts/agent-eval/parse-arms.mjs
			
 
				+```
			
 
				+
			
 
				+---
			
 
				+
			
 
				+# Arm I — sufficiency, not steering (the shippable win)
			
 
				+
			
 
				+An LLM stops investigating when its context is *sufficient*, not when it's told to stop. So arm I
			
 
				+makes the trace OUTPUT complete instead of steering — same invocation as H (original instructions,
			
 
				+**no steering**), only the trace tool changed:
			
 
				+1. **Hop bodies no longer clipped** at 28 lines (that clip is why H re-fetched `mutateElement`).
			
 
				+2. **The destination's own callees are inlined** — the "last mile" the agent otherwise explores/Reads
			
 
				+   for (excalidraw: `renderStaticScene → _renderStaticScene / renderStaticSceneThrottled`).
			
 
				+
			
 
				+| arm | adoption | reads | greps | payload | turns | dur | cost |
			
 
				+|---|--:|--:|--:|--:|--:|--:|--:|
			
 
				+| A baseline | 2/12 | 1.25 | 1.17 | 28.8K | 7.6 | 38s | $0.390 |
			
 
				+| H body-trace alone | 3/12 | 1.50 | 0.42 | 29.7K | 8.0 | 45s | $0.398 |
			
 
				+| **I body-trace + dest callees** | 2/12 | **1.17** | **0.25** | 27.2K | **7.0** | 39s | **$0.359** |
			
 
				+| F body-trace + append-steer | 5/12 | 1.17 | 0.17 | 25.1K | 6.8 | 37s | $0.348 |
			
 
				+
			
 
				+**I ≥ A on every axis** (reads, greps, turns, cost down; wall-clock flat) and **≈ F on outcomes with
			
 
				+zero steering** — despite *lower* trace adoption (2/12 vs F's 5/12). The destination-callees fix
			
 
				+turned the body-trace from a net-negative (H, 45s) into a net-positive (I, 39s): one richer trace
			
 
				+call now displaces the explore+node+Read follow-ups it used to trigger. excalidraw I-r2 was
			
 
				+`context → trace → explore` — **0 reads, 5 turns**, stopped because the data was present. The residual
			
 
				+reads (I-r1) are the `canvasNonce` data-flow — the def-use frontier the graph deliberately omits.
			
 
				+
			
 
				+This confirms the thesis: **completeness stops the agent; steering doesn't.** Every steering arm
			
 
				+(B/F append-prompt, G instructions) was either unshippable or a regression; the sufficiency arm (I)
			
 
				+ships and needs no steering.
			
 
				+
			
 
				+## Revised final verdict (supersedes the arm-G/H verdict above)
			
 
				+
			
 
				+- **SHIP: body-inlining trace + destination callees** (arm I) — ≥ A on all axes, no steering, no
			
 
				+  regression; makes the self-sufficient-trace property real (one trace call answers the flow).
			
 
				+- **SHIP: the `CODEGRAPH_MCP_TOOLS` allowlist** — independent, validated.
			
 
				+- **DON'T ship steering** (instructions or tool descriptions) — three variants regressed; MCP can't
			
 
				+  deliver append-prompt salience, and forcing trace where it doesn't connect backfires.
			
 
				+- **Connectivity is the multiplier** — arm I helps most where the trace connects; MediatR/.NET,
			
 
				+  Spring interface-DI, and Vapor closures are the next synthesizers, and they help the *unsteered*
			
 
				+  agent (which already traces when trace will connect).
			
 
				+
			
 
				+## Reproduce (arm I)
			
 
				+
			
 
				+```bash
			
 
				+ARM=I bash scripts/agent-eval/arms-F.sh    # body-trace + destination callees, no steering
			
 
				+node scripts/agent-eval/parse-arms.mjs
			
 
				+```
			
--- a/scripts/agent-eval/arms-F.sh
+++ b/scripts/agent-eval/arms-F.sh
@@ -0,0 +1,21 @@
 
				+#!/usr/bin/env bash
			
 
				+# Arm F (body-inlining trace + trace-first steering) across the same 6 repos as
			
 
				+# arms-matrix.sh, so F vs B isolates the trace-enrichment effect (same surface,
			
 
				+# old thin trace in B vs body-inlining trace here).
			
 
				+set -uo pipefail
			
 
				+H="$(cd "$(dirname "$0")" && pwd)"; RUNS="${RUNS:-2}"; C="${CORPUS:-/tmp/codegraph-corpus}"
			
 
				+ROWS=(
			
 
				+"$C/flutter-samples/add_to_app/books/flutter_module_books|How does the books UI build and what child widgets does it show?"
			
 
				+"$C/aspnet-realworld|How is creating an article handled? Trace the controller to the service."
			
 
				+"$C/spring-mall|How is a product-list request handled? Trace the controller to the service."
			
 
				+"$C/vapor-spi|How is a package-show request handled? Name the route and controller."
			
 
				+"$C/excalidraw|How does updating an element re-render the canvas on screen? Trace the flow."
			
 
				+"$C/spring-halo|How is publishing a post handled? Trace the controller to the service."
			
 
				+)
			
 
				+ARM="${ARM:-F}"
			
 
				+echo "### ARM $ARM START $(date) RUNS=$RUNS"
			
 
				+for row in "${ROWS[@]}"; do
			
 
				+  repo="${row%%|*}"; q="${row#*|}"
			
 
				+  for r in $(seq 1 "$RUNS"); do bash "$H/run-arms.sh" "$repo" "$q" "$ARM" "$r"; done
			
 
				+done
			
 
				+echo "### ARM $ARM COMPLETE $(date)"
			
--- a/scripts/agent-eval/arms-matrix.sh
+++ b/scripts/agent-eval/arms-matrix.sh
@@ -0,0 +1,37 @@
 
				+#!/usr/bin/env bash
			
 
				+# Drive the tool-surface ablation across the chosen repos × arms (A–E).
			
 
				+# Arms A–D ask the canonical FLOW question; arm E asks a NON-flow survey
			
 
				+# question (the control probe — should degrade without explore+context).
			
 
				+# Output: /tmp/arms/<repo>/<arm>-r<n>.jsonl  (parse with parse-arms.mjs).
			
 
				+set -uo pipefail
			
 
				+HARNESS="$(cd "$(dirname "$0")" && pwd)"
			
 
				+RUNS="${RUNS:-2}"
			
 
				+C="${CORPUS:-/tmp/codegraph-corpus}"
			
 
				+NFQ='What are the main modules/components of this codebase and what does each one do? Give an overview of how it is organized.'
			
 
				+
			
 
				+# repo-path|flow-question  (2 small, 2 medium, 2 large — spans the size range)
			
 
				+ROWS=(
			
 
				+"$C/flutter-samples/add_to_app/books/flutter_module_books|How does the books UI build and what child widgets does it show?"
			
 
				+"$C/aspnet-realworld|How is creating an article handled? Trace the controller to the service."
			
 
				+"$C/spring-mall|How is a product-list request handled? Trace the controller to the service."
			
 
				+"$C/vapor-spi|How is a package-show request handled? Name the route and controller."
			
 
				+"$C/excalidraw|How does updating an element re-render the canvas on screen? Trace the flow."
			
 
				+"$C/spring-halo|How is publishing a post handled? Trace the controller to the service."
			
 
				+)
			
 
				+
			
 
				+echo "### ARMS MATRIX START $(date) RUNS=$RUNS"
			
 
				+for row in "${ROWS[@]}"; do
			
 
				+  repo="${row%%|*}"; q="${row#*|}"
			
 
				+  for arm in A B C D; do
			
 
				+    for r in $(seq 1 "$RUNS"); do
			
 
				+      bash "$HARNESS/run-arms.sh" "$repo" "$q" "$arm" "$r"
			
 
				+    done
			
 
				+  done
			
 
				+done
			
 
				+# E: non-flow control probe on two repos (must degrade without explore+context)
			
 
				+for repo in "$C/excalidraw" "$C/spring-mall"; do
			
 
				+  for r in $(seq 1 "$RUNS"); do
			
 
				+    bash "$HARNESS/run-arms.sh" "$repo" "$NFQ" E "$r"
			
 
				+  done
			
 
				+done
			
 
				+echo "### ARMS MATRIX COMPLETE $(date)"
			
--- a/scripts/agent-eval/parse-arms.mjs
+++ b/scripts/agent-eval/parse-arms.mjs
@@ -0,0 +1,116 @@
 
				+#!/usr/bin/env node
			
 
				+// Analyze the tool-surface ablation (/tmp/arms/<repo>/<arm>-r<n>.jsonl).
			
 
				+// Compares arms A–E on trace adoption, Read/Grep fallback, codegraph payload,
			
 
				+// round-trips, and duration — averaged across runs per arm.
			
 
				+//
			
 
				+// The decisive signal is READS: if removing a tool raises Reads on a question
			
 
				+// class, that tool was load-bearing for it (not redundant). If removing it
			
 
				+// changes nothing, it was redundant.
			
 
				+//
			
 
				+//   A control       all tools            no steering   (baseline)
			
 
				+//   B steer         all tools            trace-first   (adoption)
			
 
				+//   C no-explore    hide explore         trace-first   (is explore redundant?)
			
 
				+//   D trace-centric hide explore+context trace-first   (is the survey pair redundant?)
			
 
				+//   E control-probe hide explore+context trace-first   (NON-flow Q — should degrade)
			
 
				+//
			
 
				+// Usage: node scripts/agent-eval/parse-arms.mjs [/tmp/arms]
			
 
				+import { readFileSync, readdirSync, existsSync, statSync } from 'fs';
			
 
				+import { join } from 'path';
			
 
				+
			
 
				+const ROOT = process.argv[2] || '/tmp/arms';
			
 
				+const cgShort = (n) => n.replace('mcp__codegraph__codegraph_', '').replace('mcp__codegraph__', '');
			
 
				+
			
 
				+function parse(file) {
			
 
				+  if (!existsSync(file)) return null;
			
 
				+  const lines = readFileSync(file, 'utf8').split('\n').filter(Boolean);
			
 
				+  const calls = []; let result = null, initCg = 0;
			
 
				+  for (const l of lines) {
			
 
				+    let ev; try { ev = JSON.parse(l); } catch { continue; }
			
 
				+    if (ev.type === 'system' && ev.subtype === 'init') initCg = (ev.tools || []).filter(t => /codegraph/.test(t)).length;
			
 
				+    if (ev.type === 'assistant') for (const b of (ev.message?.content || [])) if (b.type === 'tool_use')
			
 
				+      calls.push({ id: b.id, name: b.name, out: 0 });
			
 
				+    if (ev.type === 'user') for (const b of (ev.message?.content || [])) if (b.type === 'tool_result') {
			
 
				+      const c = b.content;
			
 
				+      const txt = typeof c === 'string' ? c : Array.isArray(c) ? c.map(x => x?.text || '').join('') : '';
			
 
				+      const call = calls.find(k => k.id === b.tool_use_id); if (call) call.out = txt.length;
			
 
				+    }
			
 
				+    if (ev.type === 'result') result = ev;
			
 
				+  }
			
 
				+  const cg = calls.filter(c => c.name.includes('codegraph'));
			
 
				+  return {
			
 
				+    initCg,
			
 
				+    reads: calls.filter(c => c.name === 'Read').length,
			
 
				+    greps: calls.filter(c => c.name === 'Grep').length + calls.filter(c => c.name === 'Glob').length,
			
 
				+    cgCalls: cg.length,
			
 
				+    cgSeq: cg.map(c => cgShort(c.name)),
			
 
				+    cgOut: cg.reduce((s, c) => s + c.out, 0),
			
 
				+    traceUsed: cg.some(c => c.name.includes('trace')),
			
 
				+    turns: result?.num_turns ?? null,
			
 
				+    dur: result?.duration_ms ? Math.round(result.duration_ms / 1000) : null,
			
 
				+    cost: result?.total_cost_usd || 0,
			
 
				+    ok: result?.subtype === 'success',
			
 
				+  };
			
 
				+}
			
 
				+
			
 
				+// repo -> arm -> [runs]
			
 
				+const data = {};
			
 
				+if (!existsSync(ROOT)) { console.error(`no ${ROOT}`); process.exit(1); }
			
 
				+for (const repo of readdirSync(ROOT)) {
			
 
				+  const rdir = join(ROOT, repo);
			
 
				+  if (!statSync(rdir).isDirectory()) continue;
			
 
				+  for (const f of readdirSync(rdir)) {
			
 
				+    const m = f.match(/^([A-I])-r(\d+)\.jsonl$/); if (!m) continue;
			
 
				+    const p = parse(join(rdir, f)); if (!p || !p.ok) continue;
			
 
				+    (((data[repo] ??= {})[m[1]]) ??= []).push(p);
			
 
				+  }
			
 
				+}
			
 
				+
			
 
				+const avg = (a, f) => a.length ? a.reduce((s, x) => s + (f(x) || 0), 0) / a.length : 0;
			
 
				+const k = (n) => (n / 1000).toFixed(1);
			
 
				+const pad = (s, n) => String(s).padEnd(n);
			
 
				+const ARMS = ['A', 'H', 'I', 'B', 'F', 'G', 'C', 'D', 'E'];
			
 
				+const LABEL = { A: 'A all/none(old)', H: 'H body-trace/none', I: 'I bodytrace+dest', B: 'B all/steer(thin)', F: 'F all/steer(body)', G: 'G ported(noprompt)', C: 'C no-explore', D: 'D trace-centric', E: 'E nonflow-probe' };
			
 
				+
			
 
				+// ---- per repo × arm ----
			
 
				+console.log('\n=== PER REPO × ARM (avg over runs) ===');
			
 
				+console.log(pad('repo', 22), pad('arm', 16), 'tools', 'trace', pad('reads', 6), pad('cgOutK', 7), pad('turns', 6), 'dur');
			
 
				+for (const repo of Object.keys(data).sort()) {
			
 
				+  for (const arm of ARMS) {
			
 
				+    const runs = data[repo][arm]; if (!runs?.length) continue;
			
 
				+    console.log(
			
 
				+      pad(repo, 22), pad(LABEL[arm], 16),
			
 
				+      pad(runs[0].initCg, 5),
			
 
				+      pad(runs.filter(r => r.traceUsed).length + '/' + runs.length, 5),
			
 
				+      pad(avg(runs, r => r.reads).toFixed(1), 6),
			
 
				+      pad(k(avg(runs, r => r.cgOut)), 7),
			
 
				+      pad(avg(runs, r => r.turns).toFixed(1), 6),
			
 
				+      avg(runs, r => r.dur).toFixed(0) + 's',
			
 
				+    );
			
 
				+  }
			
 
				+}
			
 
				+
			
 
				+// ---- aggregate per arm (flow arms A–D over the flow repos; E shown apart) ----
			
 
				+console.log('\n=== AGGREGATE PER ARM (mean across repos) ===');
			
 
				+console.log(pad('arm', 16), pad('adoption', 9), pad('reads', 7), pad('greps', 7), pad('cgOutK', 8), pad('turns', 7), pad('dur', 6), 'cost');
			
 
				+for (const arm of ARMS) {
			
 
				+  const all = [];
			
 
				+  for (const repo of Object.keys(data)) for (const r of (data[repo][arm] || [])) all.push({ ...r, repo });
			
 
				+  if (!all.length) continue;
			
 
				+  const repos = new Set(all.map(r => r.repo)).size;
			
 
				+  const adopt = all.filter(r => r.traceUsed).length;
			
 
				+  console.log(
			
 
				+    pad(LABEL[arm], 16),
			
 
				+    pad(`${adopt}/${all.length}`, 9),
			
 
				+    pad(avg(all, r => r.reads).toFixed(2), 7),
			
 
				+    pad(avg(all, r => r.greps).toFixed(2), 7),
			
 
				+    pad(k(avg(all, r => r.cgOut)), 8),
			
 
				+    pad(avg(all, r => r.turns).toFixed(1), 7),
			
 
				+    pad(avg(all, r => r.dur).toFixed(0) + 's', 6),
			
 
				+    '$' + avg(all, r => r.cost).toFixed(3),
			
 
				+    `  (${repos} repos)`,
			
 
				+  );
			
 
				+}
			
 
				+
			
 
				+console.log('\nRead the signal: B vs A = does steering alone fix adoption + cut payload.');
			
 
				+console.log('C vs B = is explore redundant (reads should NOT jump). D vs C = is context redundant.');
			
 
				+console.log('E = non-flow under trace-centric; reads SHOULD jump (proves survey tools are load-bearing).');
			
--- a/scripts/agent-eval/run-arms.sh
+++ b/scripts/agent-eval/run-arms.sh
@@ -0,0 +1,56 @@
 
				+#!/usr/bin/env bash
			
 
				+# Tool-surface ablation — run ONE repo+question under ONE arm.
			
 
				+#
			
 
				+# Arms vary (exposed codegraph tools, trace-first steering). Tools are trimmed
			
 
				+# SERVER-SIDE via CODEGRAPH_MCP_TOOLS in the MCP config's `env` block, so an
			
 
				+# ablated tool is genuinely absent from ListTools — no deferred-ToolSearch or
			
 
				+# denied-call confound (which --disallowedTools would introduce). Steering is
			
 
				+# injected with --append-system-prompt, so no rebuild of the shipped
			
 
				+# server-instructions is needed to A/B it.
			
 
				+#
			
 
				+#   A control       all tools            no steering
			
 
				+#   B steer         all tools            trace-first
			
 
				+#   C no-explore    hide explore         trace-first
			
 
				+#   D trace-centric hide explore+context trace-first
			
 
				+#   E control-probe hide explore+context trace-first  (caller passes a NON-flow Q)
			
 
				+#
			
 
				+# Usage: run-arms.sh <repo-path> "<question>" <A|B|C|D|E> [run-id]
			
 
				+set -uo pipefail
			
 
				+REPO="${1:?repo path}"; Q="${2:?question}"; ARM="${3:?arm A-E}"; RID="${4:-1}"
			
 
				+CG_BIN="${CG_BIN:-$(command -v codegraph)}"
			
 
				+OUT="${ARMS_OUT:-/tmp/arms}/$(basename "$REPO")"
			
 
				+mkdir -p "$OUT"
			
 
				+[ -n "$CG_BIN" ] || { echo "no codegraph binary (set CG_BIN)"; exit 1; }
			
 
				+[ -d "$REPO/.codegraph" ] || { echo "no .codegraph index at $REPO"; exit 1; }
			
 
				+
			
 
				+STEER='Flow questions ("how does X reach/become Y", "trace the flow", request to handler, state to render): call codegraph_trace(from,to) FIRST — one call returns the whole path. Use codegraph_context/search only to locate the two endpoint symbols if you do not know them. Do NOT reconstruct the path with repeated search/callers/explore.'
			
 
				+KEEP_NO_EXPLORE="trace,search,node,context,callers,callees,impact,files,status"
			
 
				+KEEP_TRACE_CENTRIC="trace,search,node,callers,callees,impact,files,status"
			
 
				+
			
 
				+case "$ARM" in
			
 
				+  A|G|H|I) TOOLS="";            STEERING="" ;;  # no steering; H = body-trace, I = body-trace + destination callees (sufficiency)
			
 
				+  B|F) TOOLS="";                STEERING="$STEER" ;;  # F = B's surface, run on the body-inlining trace build
			
 
				+  C) TOOLS="$KEEP_NO_EXPLORE";  STEERING="$STEER" ;;
			
 
				+  D|E) TOOLS="$KEEP_TRACE_CENTRIC"; STEERING="$STEER" ;;
			
 
				+  *) echo "bad arm '$ARM' (want A|B|C|D|E)"; exit 1 ;;
			
 
				+esac
			
 
				+
			
 
				+CFG="$OUT/mcp-$ARM.json"
			
 
				+if [ -n "$TOOLS" ]; then
			
 
				+  cat > "$CFG" <<JSON
			
 
				+{"mcpServers":{"codegraph":{"command":"$CG_BIN","args":["serve","--mcp","--path","$REPO"],"env":{"CODEGRAPH_MCP_TOOLS":"$TOOLS"}}}}
			
 
				+JSON
			
 
				+else
			
 
				+  cat > "$CFG" <<JSON
			
 
				+{"mcpServers":{"codegraph":{"command":"$CG_BIN","args":["serve","--mcp","--path","$REPO"]}}}
			
 
				+JSON
			
 
				+fi
			
 
				+
			
 
				+LOG="$OUT/$ARM-r$RID.jsonl"; ERR="$OUT/$ARM-r$RID.err"
			
 
				+ARGS=( -p "$Q" --output-format stream-json --verbose
			
 
				+       --permission-mode bypassPermissions --model opus --max-budget-usd 4
			
 
				+       --strict-mcp-config --mcp-config "$CFG" )
			
 
				+[ -n "$STEERING" ] && ARGS+=( --append-system-prompt "$STEERING" )
			
 
				+
			
 
				+( cd "$REPO" && claude "${ARGS[@]}" > "$LOG" 2>"$ERR" )
			
 
				+echo "[$(basename "$REPO") $ARM r$RID] exit $? -> $LOG ($(wc -l < "$LOG" | tr -d ' ') lines)"
			
--- a/scripts/agent-eval/seq-matrix.mjs
+++ b/scripts/agent-eval/seq-matrix.mjs
@@ -0,0 +1,137 @@
 
				+#!/usr/bin/env node
			
 
				+// Mine the surviving A/B stream-json logs (/tmp/ab-matrix/<Cell>/run-headless-*.jsonl)
			
 
				+// for what the aggregate matrix can't see: the call SEQUENCE and per-call output SIZE.
			
 
				+//
			
 
				+// Answers three questions:
			
 
				+//   1. Trace adoption — on a flow question, does the with-arm actually call codegraph_trace?
			
 
				+//   2. Payload size vs repo size — is trace path-scoped (tiny, size-independent) while
			
 
				+//      explore is breadth-scoped (grows with the repo / over-returns on small repos)?
			
 
				+//   3. Round-trips — num_turns with vs without (the real wall-clock driver).
			
 
				+//
			
 
				+// Usage: node scripts/agent-eval/seq-matrix.mjs [/tmp/ab-matrix]
			
 
				+import { readFileSync, readdirSync, existsSync } from 'fs';
			
 
				+import { join } from 'path';
			
 
				+
			
 
				+const AB = process.argv[2] || '/tmp/ab-matrix';
			
 
				+const MD = new URL('../../docs/benchmarks/codegraph-ab-matrix.md', import.meta.url).pathname;
			
 
				+
			
 
				+// repo -> {lang,size,files} from the published matrix table
			
 
				+const repoMeta = {};
			
 
				+if (existsSync(MD)) for (const line of readFileSync(MD, 'utf8').split('\n')) {
			
 
				+  const m = line.match(/^\|\s*([^|]+?)\s*\|\s*(S|M|L)\s*\|\s*`([^`]+)`\s*\|\s*(\d+)\s*\|/);
			
 
				+  if (m) repoMeta[m[3]] = { lang: m[1].trim(), size: m[2], files: +m[4] };
			
 
				+}
			
 
				+
			
 
				+const cgShort = (n) => n.replace('mcp__codegraph__codegraph_', '').replace('mcp__codegraph__', '');
			
 
				+const tag = (n) => n === 'Read' ? 'R' : n === 'Grep' ? 'G' : n === 'Glob' ? 'Gl'
			
 
				+  : n === 'Bash' ? 'B' : n === 'Task' ? 'Ag' : n === 'ToolSearch' ? 'TS'
			
 
				+  : n.includes('codegraph') ? cgShort(n) : n;
			
 
				+
			
 
				+function parse(file) {
			
 
				+  if (!existsSync(file)) return null;
			
 
				+  const lines = readFileSync(file, 'utf8').split('\n').filter(Boolean);
			
 
				+  const calls = []; let result = null, initCg = 0;
			
 
				+  for (const l of lines) {
			
 
				+    let ev; try { ev = JSON.parse(l); } catch { continue; }
			
 
				+    if (ev.type === 'system' && ev.subtype === 'init') initCg = (ev.tools || []).filter(t => /codegraph/.test(t)).length;
			
 
				+    if (ev.type === 'assistant') for (const b of (ev.message?.content || [])) if (b.type === 'tool_use') {
			
 
				+      const i = b.input || {};
			
 
				+      const q = i.query ?? i.symbol ?? i.task ?? (i.from && i.to ? `${i.from}->${i.to}` : (i.file_path || i.command || ''));
			
 
				+      calls.push({ id: b.id, name: b.name, q: String(q ?? '').slice(0, 38), out: 0 });
			
 
				+    }
			
 
				+    if (ev.type === 'user') for (const b of (ev.message?.content || [])) if (b.type === 'tool_result') {
			
 
				+      const c = b.content;
			
 
				+      const txt = typeof c === 'string' ? c : Array.isArray(c) ? c.map(x => x?.text || '').join('') : '';
			
 
				+      const call = calls.find(k => k.id === b.tool_use_id); if (call) call.out = txt.length;
			
 
				+    }
			
 
				+    if (ev.type === 'result') result = ev;
			
 
				+  }
			
 
				+  const cg = calls.filter(c => c.name.includes('codegraph'));
			
 
				+  const perTool = {};
			
 
				+  for (const c of cg) { const k = cgShort(c.name); (perTool[k] ??= { n: 0, out: 0 }); perTool[k].n++; perTool[k].out += c.out; }
			
 
				+  const traceIdx = cg.findIndex(c => c.name.includes('trace'));
			
 
				+  const u = result?.usage || {};
			
 
				+  return {
			
 
				+    initCg, cg, perTool,
			
 
				+    cgSeq: cg.map(c => cgShort(c.name)),
			
 
				+    seq: calls.map(c => tag(c.name)),
			
 
				+    reads: calls.filter(c => c.name === 'Read').length,
			
 
				+    greps: calls.filter(c => c.name === 'Grep').length,
			
 
				+    cgOut: cg.reduce((s, c) => s + c.out, 0),
			
 
				+    traceUsed: traceIdx >= 0,
			
 
				+    afterTrace: traceIdx >= 0 ? cg.slice(traceIdx + 1).map(c => cgShort(c.name)) : null,
			
 
				+    turns: result?.num_turns ?? null,
			
 
				+    dur: result?.duration_ms ? Math.round(result.duration_ms / 1000) : null,
			
 
				+    cost: result?.total_cost_usd || 0,
			
 
				+  };
			
 
				+}
			
 
				+
			
 
				+const cells = [];
			
 
				+for (const d of readdirSync(AB)) {
			
 
				+  const dir = join(AB, d);
			
 
				+  if (!existsSync(join(dir, 'run-headless-with.jsonl'))) continue;
			
 
				+  const log = existsSync(join(AB, d + '.log')) ? readFileSync(join(AB, d + '.log'), 'utf8') : '';
			
 
				+  const repo = (log.match(/repo:\s*\S*\/([^\s/]+)/) || [])[1] || d;
			
 
				+  const question = (log.match(/question:\s*(.+)/) || [])[1] || '';
			
 
				+  cells.push({ cell: d, repo, question, ...(repoMeta[repo] || {}),
			
 
				+    with: parse(join(dir, 'run-headless-with.jsonl')),
			
 
				+    without: parse(join(dir, 'run-headless-without.jsonl')) });
			
 
				+}
			
 
				+cells.sort((a, b) => (a.files || 0) - (b.files || 0));
			
 
				+
			
 
				+const k = (n) => (n / 1000).toFixed(1);
			
 
				+const pad = (s, n) => String(s).padEnd(n);
			
 
				+
			
 
				+// ---- per-cell sequence table ----
			
 
				+console.log('\n=== PER-CELL: with-arm codegraph sequence + payload (sorted by repo size) ===');
			
 
				+console.log(pad('repo', 22), pad('files', 6), 'trace', pad('cg-call sequence', 40), pad('cgOutK', 7), 'turns(w/wo)');
			
 
				+for (const c of cells) {
			
 
				+  const w = c.with;
			
 
				+  console.log(
			
 
				+    pad(c.repo, 22), pad(c.files ?? '?', 6),
			
 
				+    pad(w.traceUsed ? 'YES' : 'no', 5),
			
 
				+    pad(w.cgSeq.join(',') || '(none)', 40),
			
 
				+    pad(k(w.cgOut), 7),
			
 
				+    `${w.turns}/${c.without?.turns}`,
			
 
				+  );
			
 
				+}
			
 
				+
			
 
				+// ---- trace adoption ----
			
 
				+const flow = cells; // every matrix question is a canonical flow question by design
			
 
				+const used = flow.filter(c => c.with.traceUsed);
			
 
				+console.log(`\n=== TRACE ADOPTION (all ${flow.length} cells are flow questions) ===`);
			
 
				+console.log(`trace called in ${used.length}/${flow.length} cells`);
			
 
				+console.log('used trace:', used.map(c => c.repo).join(', ') || '(none)');
			
 
				+if (used.length) console.log('after-trace follow-ups:', used.map(c => `${c.repo}[${c.with.afterTrace.join(',') || 'none'}]`).join('  '));
			
 
				+
			
 
				+// ---- payload size by repo-size tier ----
			
 
				+const tier = (f) => f < 200 ? 'S(<200)' : f < 2000 ? 'M(<2000)' : 'L(>=2000)';
			
 
				+const byTier = {};
			
 
				+for (const c of cells) { (byTier[tier(c.files || 0)] ??= []).push(c.with.cgOut); }
			
 
				+console.log('\n=== with-arm TOTAL codegraph payload by repo-size tier ===');
			
 
				+for (const t of ['S(<200)', 'M(<2000)', 'L(>=2000)']) {
			
 
				+  const a = byTier[t] || []; if (!a.length) continue;
			
 
				+  const avg = a.reduce((s, x) => s + x, 0) / a.length;
			
 
				+  console.log(`  ${pad(t, 10)} n=${a.length}  avg cgOut=${k(avg)}K  range ${k(Math.min(...a))}-${k(Math.max(...a))}K`);
			
 
				+}
			
 
				+
			
 
				+// ---- per-tool usage + avg payload (breadth vs path evidence) ----
			
 
				+const tot = {};
			
 
				+for (const c of cells) for (const [name, v] of Object.entries(c.with.perTool)) {
			
 
				+  (tot[name] ??= { n: 0, out: 0 }); tot[name].n += v.n; tot[name].out += v.out;
			
 
				+}
			
 
				+console.log('\n=== codegraph tool usage across all cells (n calls, avg payload/call) ===');
			
 
				+for (const [name, v] of Object.entries(tot).sort((a, b) => b[1].n - a[1].n)) {
			
 
				+  console.log(`  ${pad(name, 10)} calls=${pad(v.n, 4)} avg=${k(v.out / v.n)}K/call  total=${k(v.out)}K`);
			
 
				+}
			
 
				+
			
 
				+// ---- round-trips ----
			
 
				+const sum = (arr, f) => arr.reduce((s, x) => s + (f(x) || 0), 0);
			
 
				+const wTurns = sum(cells, c => c.with.turns), woTurns = sum(cells, c => c.without?.turns);
			
 
				+const wCalls = sum(cells, c => c.with.cg.length);
			
 
				+const tsAll = cells.every(c => c.with.seq[0] === 'TS');
			
 
				+console.log('\n=== ROUND-TRIPS ===');
			
 
				+console.log(`turns: with=${wTurns}  without=${woTurns}  (${((1 - wTurns / woTurns) * 100).toFixed(0)}% fewer with)`);
			
 
				+console.log(`avg turns/cell: with=${(wTurns / cells.length).toFixed(1)}  without=${(woTurns / cells.length).toFixed(1)}`);
			
 
				+console.log(`total codegraph calls=${wCalls} (avg ${(wCalls / cells.length).toFixed(1)}/cell)`);
			
 
				+console.log(`every with-arm opens with a ToolSearch round-trip (deferred tools): ${tsAll ? 'YES — 1 fixed tax/run' : 'no'}`);