Просмотр исходного кода

docs(benchmarks): call-sequence + tool-ablation analysis; agent-eval arms harness

Records why codegraph read savings (-75%) under-convert to wall-clock (-16%): the bottleneck is round-trips + the synthesis turn, not reads. Ablation (arms A-I) shows explore is 68% of payload but load-bearing, trace is path-scoped but under-adopted, instruction/description steering cannot match an append-prompt's salience (and regresses), and the shippable win is making the trace output sufficient (arm I). Adds harness: seq-matrix, run-arms/arms-*, parse-arms.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Colby McHenry 1 месяц назад
Родитель
Сommit
a6183d7c83

+ 388 - 0
docs/benchmarks/call-sequence-analysis.md

@@ -0,0 +1,388 @@
+# Call-sequence analysis — why read savings don't convert to wall-clock
+
+**Date:** 2026-05-23 · **Branch:** `architectural-improvements` · **Source data:** the surviving
+stream-json logs from the A/B matrix (`/tmp/ab-matrix/<Cell>/run-headless-{with,without}.jsonl`,
+37 cells × 2 arms). Re-mined — **no re-runs** — with `scripts/agent-eval/seq-matrix.mjs`.
+
+## Why this exists
+
+The [A/B matrix](codegraph-ab-matrix.md) showed codegraph cuts **reads 75%** but **wall-clock only
+~16%**, and 63% of the wall-clock win comes from just 3 large-repo cells. Reads are at the floor
+(~0), so the remaining wall-clock is **round-trips + the synthesis turn** — neither of which read
+count can explain. The matrix records tool *counts*, not the call **sequence** or per-call
+**payload size**. This analysis recovers both, to find where the wall-clock actually goes.
+
+## TL;DR — the bottleneck is trace ADOPTION, not trace completeness
+
+1. **Trace is called in 3 of 37 cells** — even though every question is a canonical flow question
+   ("trace the controller → service → repository", "how does X reach Y"). The agent overwhelmingly
+   reaches for **`context → search → search → explore`** instead — the exact path-reconstruction
+   anti-pattern the instructions tell it to avoid.
+2. **`explore` averages 17.9K chars/call; `trace` averages 0.8K** — a **22× payload difference**.
+   The path-scoped tool that solves the small-repo-bloat problem exists and is tiny. It's just not
+   being invoked.
+3. **Small repos still get bloated payloads** because of the explore-default: a **6-file** repo
+   (`flutter_module_books`) pulls **17.4K**; a 10-file repo pulls 18.0K. This is precisely the
+   "too much context on small codebases" failure mode — happening right now, via explore.
+4. **Round-trips are 25% fewer with codegraph (283 vs 375 turns)** but wall-clock is only 16%
+   faster — because the with-arm's turns each carry a ~18K explore payload, inflating TTFT and
+   eroding the turn savings.
+5. **Root cause:** `src/mcp/server-instructions.ts` leads with *"answer directly … `codegraph_context`
+   first, then ONE `codegraph_explore`"* as the headline pattern. The trace-first guidance is buried
+   in a table + a chain list below it. Agents anchor on the prominent headline → context→explore.
+
+**Decision:** the next experiment is **trace-first steering / adoption**, not enriching trace. We
+can't evaluate trace's completeness when it's used 3/37 times. Get adoption up first, then measure
+whether the residual `node`/`explore` follow-ups need a richer trace.
+
+## Finding 1 — trace adoption: 3/37
+
+| metric | value |
+|---|---|
+| flow-question cells | 37 (all of them) |
+| cells that called `codegraph_trace` | **3** (`cpp-leveldb`, `excalidraw`, `c-redis`) |
+| dominant pattern instead | `context` → `search`×N → `explore` |
+
+The 3 trace cells, and what followed the trace call:
+
+| repo | files | cg sequence | turns (with/without) |
+|---|--:|---|---|
+| cpp-leveldb | 134 | `trace, node, node` | 5 / 8 |
+| excalidraw | 643 | `context, trace, trace, explore` | 6 / **19** |
+| c-redis | 884 | `context, trace, explore, node` | 10 / 15 |
+
+Even when trace *is* used, the agent follows it with `node`/`explore` to fetch bodies — so a
+secondary lever (after adoption) is making one trace call self-sufficient enough to kill those
+follow-ups. But that's step 2.
+
+## Finding 2 — payload size: path-scoped trace (0.8K) vs breadth-scoped explore (17.9K)
+
+Across all cells, per codegraph tool — call count and **average payload per call**:
+
+| tool | calls | avg/call | total |
+|---|--:|--:|--:|
+| `explore` | 32 | **17.9K** | 573K |
+| `context` | 36 | 4.3K | 156K |
+| `search` | 39 | 1.3K | 50K |
+| `files` | 5 | 3.4K | 17K |
+| `node` | 19 | 2.0K | 38K |
+| `trace` | 4 | **0.8K** | 3.4K |
+
+`context` (used in 36/37 cells) is the default opener; `explore` is the default closer. Together
+they are the ~22K breadth dump. `trace` — the tool that would replace that with the actual path —
+is 22× smaller and barely used. This is the user's premise confirmed in numbers: explore is
+breadth-scoped (returns the neighborhood), trace is path-scoped (returns the line).
+
+## Finding 3 — payload grows with repo size, and over-returns on small repos
+
+With-arm **total** codegraph payload by repo-size tier:
+
+| tier | cells | avg total payload | range |
+|---|--:|--:|--:|
+| S (<200 files) | 19 | 12.7K | 3.0–31.2K |
+| M (<2000) | 9 | 32.4K | 5.4–58.2K |
+| L (≥2000) | 9 | 34.0K | 20.2–43.1K |
+
+The small-repo waste is concrete — these all have a 2–3 file flow but pull a full neighborhood:
+
+| repo | files | with-arm payload | sequence |
+|---|--:|--:|---|
+| flutter_module_books | 6 | 17.4K | `context, explore` |
+| computer-database | 10 | 18.0K | `context, search, status, explore` |
+| aspnet-realworld | 78 | 22.2K | `context, explore` |
+| django-realworld | 44 | 14.8K | `context, explore` |
+
+`explore`'s per-call budget is already adaptive (#185), but it doesn't help here because the agent
+isn't choosing the path-scoped tool — it's choosing breadth.
+
+## Finding 4 — round-trips, and the ToolSearch tax
+
+| metric | with | without |
+|---|--:|--:|
+| total turns (37 cells) | 283 | 375 |
+| avg turns / cell | 7.6 | 10.1 |
+
+25% fewer turns, but only ~16% faster wall-clock — the gap is the per-turn cost of the big explore
+payloads. Also: **every with-arm run opens with a `ToolSearch` round-trip** (MCP tools are deferred
+in this harness), a fixed 1-turn tax before any codegraph call. Worth confirming whether the
+production install defers codegraph tools the same way.
+
+## Conclusion → the experiment to run next
+
+Measure-first changed the plan. The hypothesis was "enrich trace so one call is self-sufficient."
+The data says trace is **used 3/37 times**, so completeness is moot until adoption is fixed.
+
+**Experiment: trace-first steering A/B.**
+- **Change:** rewrite the `server-instructions.ts` headline so a *flow* question (how does X reach Y
+  / trace / from→to) routes to `codegraph_trace` **first**, demoting the context→explore pattern to
+  non-flow/onboarding questions. Mirror into `instructions-template.ts` + `.cursor/rules/codegraph.mdc`.
+- **Metric:** trace-adoption rate (target ≫ 3/37), with-arm total payload (expect ↓ sharply,
+  especially small repos), turns (expect ↓), wall-clock (expect the 16% gap to widen toward the
+  25% turn gap as 18K explore payloads are replaced by <1K traces).
+- **Control:** a non-flow "what's the deal with module X" question must still go context→explore —
+  don't over-steer everything to trace.
+- **Then, step 2:** with adoption up, measure the `node`/`explore` follow-ups after trace
+  (cpp-leveldb/excalidraw/c-redis all had them). If they're frequent, enrich trace (per-hop body
+  snippet, capped per hop) so one trace call ends the flow investigation.
+
+## Reproduce
+
+```bash
+node scripts/agent-eval/seq-matrix.mjs            # regenerates every table above from /tmp/ab-matrix
+```
+
+---
+
+# Ablation experiment — do `context`, `explore`, and `trace` compete? Is `trace` enough?
+
+**Date:** 2026-05-23 · 52 runs, ~$20. Tool surface trimmed **server-side** via the new
+`CODEGRAPH_MCP_TOOLS` allowlist (so an ablated tool is genuinely absent from ListTools, not
+denied-on-call); trace-first steering injected with `--append-system-prompt`. 6 repos (2 S / 2 M /
+2 L) × 2 runs; arm E is a **non-flow** survey question on 2 repos. Driver `arms-matrix.sh`,
+analysis `parse-arms.mjs`.
+
+| arm | tools | steering | adoption | reads | cgOut | turns | dur |
+|---|---|---|--:|--:|--:|--:|--:|
+| **A** control | all | none | 2/12 | 1.25 | 28.8K | 7.6 | 38s |
+| **B** steer | all | trace-first | **8/12** | 1.00 | **32.0K** | 7.9 | 43s |
+| **C** no-explore | hide explore | trace-first | 8/12 | **2.08** | **9.2K** | 9.0 | 44s |
+| **D** trace-centric | hide explore+context | trace-first | 8/12 | 2.00 | 6.6K | 10.5 | 46s |
+| **E** control-probe | hide explore+context | trace-first | 0/4 | 2.50 | 27.8K | **20.0** | **72s** |
+
+## What it says
+
+1. **Steering works for adoption, not for payload.** B lifted trace use **2/12 → 8/12** (and 4/4 on
+   the genuinely path-shaped questions — the 2 non-adopters, flutter "what widgets" and vapor "name
+   the route", aren't from→to questions). But B's payload (32.0K) is *bigger* than control (28.8K)
+   and it's slightly slower — because the agent calls trace **and still calls explore**. Steering
+   adds a trace hop without displacing the explore dump.
+2. **`explore` is the payload, and it's load-bearing — but 3–5× too heavy.** Removing it (C) cuts
+   payload **71%** (32K→9.2K) — confirming it's the bloat. But reads **double** (1.0→2.1) and turns
+   rise: the agent Reads files to recover the bodies explore had inlined. So explore isn't
+   redundant; it's the only one-call body-supplier, just delivered with a 32K sledgehammer.
+3. **`context` is the most redundant of the three — as a body-supplier.** Removing it on top of
+   explore (D vs C) left reads flat (2.08→2.00) but raised turns (9.0→10.5). It supplies no unique
+   bodies; it earns its keep only as a round-trip-saver (the composed orient call).
+4. **Removing tools makes flow questions SLOWER, not faster.** Turns climb monotonically
+   A→D (7.6→10.5) and duration with them — the Read + trace-follow-up round-trips cost more
+   wall-clock than the saved payload. Leaner payload ≠ faster.
+5. **`trace` is definitively NOT sufficient.** The non-flow probe (E) thrashed without the survey
+   tools — **20 turns, 72s** reconstructing an overview from search/node/files. Survey questions
+   need a survey tool; trace can't substitute.
+
+## Verdict on the three design questions
+
+- **Do we need all three?** Yes — but for different reasons. trace = flow tool (real, under-adopted).
+  explore = the one-call body-supplier (load-bearing, over-heavy). context = round-trip-saving
+  opener (redundant for bodies, useful for orientation).
+- **Are they competing?** Yes: explore competes with trace and *wins by default* — even when steered,
+  the agent traces **and** explores, so the payload win never lands until explore is displaced.
+- **Could trace be all we need?** No. E rules it out for non-flow questions; C/D rule it out even
+  for flow (reads double without explore's bodies).
+
+**Three cheap fixes are now ruled out by data:** "trace is all we need" (false), "just steer to
+trace" (B: slower + bigger than control), and "remove explore" (C/D: more reads/turns, slower).
+
+## The fix the data points to → next experiment
+
+The only path that wins: **make `trace` self-sufficient by inlining per-hop bodies** (capped per
+hop → still path-scoped) so one trace call supplies what explore does *and* what the Read fallback
+recovers — displacing both for flow questions. Keep **one** survey tool (context; demote explore to
+deep-survey, not the flow default) for the non-flow class E proved is load-bearing.
+
+- **Experiment:** enriched body-inlining `trace` + steering vs control.
+- **Target:** C/D's lean payload (~7–9K, not 32K) **without** C/D's extra reads/turns, and **beat A
+  on wall-clock** (the bar B/C/D all failed).
+- **Metric:** payload, reads (must stay ≈ A's ~1.0, not rise to 2.0), turns, duration.
+
+## Reproduce (ablation)
+
+```bash
+bash scripts/agent-eval/arms-matrix.sh     # 52 runs into /tmp/arms (RUNS=2 default)
+node scripts/agent-eval/parse-arms.mjs     # the arm-comparison tables above
+```
+
+---
+
+# Validation — body-inlining trace (arm F)
+
+The ablation pointed to one fix: make `trace` self-sufficient by inlining per-hop **bodies**
+(capped per hop → still path-scoped) so one trace call displaces both the explore dump and the
+Read fallback. Implemented in `handleTrace` (`sourceRangeAt`, 28 lines / 1200 chars per hop, with a
+`… (+N more lines)` marker). Arm **F** = arm B's surface (all tools + trace-first steering) run on
+the body-inlining build, so **F vs B isolates the enrichment**.
+
+| arm | adoption | reads | cgOut | turns | dur | cost |
+|---|--:|--:|--:|--:|--:|--:|
+| A all/none | 2/12 | 1.25 | 28.8K | 7.6 | 38s | $0.390 |
+| B all/steer (thin trace) | 8/12 | 1.00 | 32.0K | 7.9 | 43s | $0.411 |
+| **F all/steer (body trace)** | 5/12 | **1.17** | **25.1K** | **6.8** | **37s** | **$0.348** |
+| C no-explore | 8/12 | 2.08 | 9.2K | 9.0 | 44s | $0.356 |
+| D trace-centric | 8/12 | 2.00 | 6.6K | 10.5 | 46s | $0.368 |
+
+**F is the best-balanced arm:** lowest turns (6.8), fastest (37s), cheapest, payload leaner than
+A/B — and it hits the target the ablation set: **C/D-class efficiency without C/D's Read penalty**
+(F reads 1.17 vs C/D's ~2.0). It gets there not by *removing* a tool but by giving the agent a
+complete trace so it *stops early*.
+
+**The win is clearest where trace connects** — excalidraw (the validated 6-hop path):
+
+| arm | sequence | turns | reads | dur |
+|---|---|--:|--:|--:|
+| B (thin) | `trace → context → explore → Grep → Read` | 7 | 1 | 47s |
+| **F (body) r1** | `trace → context` | **4** | **0** | **31s** |
+| F (body) r2 | `trace → trace → explore` | 5 | 0 | 42s |
+
+The body-trace ended the investigation in `trace → context` (run 1) — 0 reads, 0 grep, 0 explore.
+
+**Connectivity is the cap.** On flows that break at *unbridged* dynamic dispatch — aspnet-realworld
+(MediatR `_mediator.Send → Handle`), vapor-spi (closure routing) — trace returns "no path" and the
+agent falls back to explore, so F ≈ B (no regression, no gain). F's aggregate lift is therefore
+**gated by dynamic-dispatch coverage**: the more flows the graph connects end-to-end, the more often
+the self-sufficient trace fires. (n=2/arm — adoption and per-repo numbers are noisy; excalidraw and
+spring-halo, the connecting repos, are 2/2 trace in both B and F.)
+
+## Verdict & ship list
+
+1. **Ship the body-inlining trace** — strict improvement (best-balanced arm; clean 0-read/4-turn win
+   on connecting traces; no regression on non-connecting ones).
+2. **Strengthen the steering.** Arm A (shipped server-instructions, which *already* say "trace first
+   for flow") adopted trace only 2/12 — the guidance is too buried. The explicit
+   `--append-system-prompt` used in B–F lifted it. Port that into `server-instructions.ts` +
+   `instructions-template.ts` + `.cursor/rules/codegraph.mdc` (house rule: all three together),
+   flow-gated so non-flow survey questions still go context/explore (arm E proved they must).
+3. **Next frontier to widen F's reach:** bridge more dynamic dispatch (MediatR/.NET, Vapor routing) —
+   every newly-connected flow converts an F≈B repo into an F-win repo.
+
+## Reproduce (arm F)
+
+```bash
+bash scripts/agent-eval/arms-F.sh          # 12 runs (RUNS=2); needs the body-inlining build
+node scripts/agent-eval/parse-arms.mjs     # F appears alongside A/B/C/D/E
+```
+
+---
+
+# Steering port — the negative result (arm G)
+
+F's win used `--append-system-prompt`, which real users don't get. Arm **G** = arm A's invocation
+(NO append-prompt) on a build where the steering was ported into the production channels
+(`server-instructions.ts` + the `context`/`trace` tool descriptions + `instructions-template.ts` +
+`.cursor/rules`). Three wording iterations, 12 runs each:
+
+| arm | adoption | reads | payload | turns | dur |
+|---|--:|--:|--:|--:|--:|
+| A (shipped instructions) | 2/12 | 1.25 | 28.8K | 7.6 | **38s** |
+| F (body-trace + append-prompt) | 5/12 | **1.17** | 25.1K | 6.8 | **37s** |
+| G v1 — anti-explore wording | 6/12 | 2.08 | 13.8K | 8.8 | 46s |
+| G v2 — restore explore as fallback | 6/12 | 1.67 | 22.0K | 7.8 | 46s |
+| G v3 — restore context as opener | 6/12 | 2.08 | 11.7K | 8.9 | 46s |
+
+**Production-instruction steering does not reproduce F, and regresses the A baseline.** All three G
+variants pin at **~46s** (slower than A's 38s and F's 37s) with reads at 1.7–2.1 (vs A 1.25, F 1.17).
+Wording only shuffled the slack between Read and explore — v1 suppressed explore → Read; v2/v3
+restored explore → over-investigation — never landing F's lean `trace → context`.
+
+**Two root causes:**
+1. **Salience.** The same trace-first wording works as a top-of-prompt `--append-system-prompt` (F)
+   but not as an MCP `initialize` instruction / tool description (G). An MCP server has no
+   higher-salience channel — this is an architectural limit, not a wording bug.
+2. **Forcing trace-first backfires where trace doesn't connect.** Steering pushed trace onto
+   MediatR (`_mediator.Send`) and Spring interface-DI (`@Autowired` iface → impl) flows, where trace
+   returns no-path; the forced trace is then a wasted round-trip *before* the fallback → slower.
+   The **unsteered** agent (A) is better-calibrated: it traces only when trace will obviously
+   connect (2/12) and explores otherwise.
+
+## Arm H — body-trace alone (the ship candidate) regresses
+
+The clean ship test: body-inlining trace + ORIGINAL instructions + no steering (= A's invocation,
+only the trace *tool* changed). H vs A isolates the body-trace feature with nothing else moving.
+
+| arm | adoption | reads | payload | turns | dur |
+|---|--:|--:|--:|--:|--:|
+| A (no body-trace) | 2/12 | 1.25 | 28.8K | 7.6 | **38s** |
+| H (body-trace, no steering) | 3/12 | 1.50 | 29.7K | 8.0 | **45s** |
+| F (body-trace + append-prompt) | 5/12 | 1.17 | 25.1K | 6.8 | 37s |
+
+**Body-trace alone does NOT beat A — it mildly regresses** (45s vs 38s). The sequences show why:
+unsteered, the agent treats trace as just one more call in its usual loop — excalidraw H was
+`context → trace → explore → node×3 → Grep → Read` (77s) — so the bigger body-trace payload is pure
+added cost, not offset by fewer follow-ups. The body-trace only pays off when the agent **leads with
+trace and stops after it**, which only the append-prompt (F) achieved.
+
+## Final verdict
+
+The body-inlining trace is a real win (F) but its value is **entirely contingent on
+lead-with-and-stop-after-trace steering we cannot deliver through any production MCP channel**
+(append-prompt salience ≫ server-instructions / tool-descriptions; G failed three times). On its own
+(H) it regresses. So:
+
+- **SHIP: the `CODEGRAPH_MCP_TOOLS` allowlist** — independent, clean, validated.
+- **DON'T ship the body-inlining trace or the steering as-is** — measured neutral-to-negative
+  without a steering channel we don't have.
+- **The real lever is connectivity, not steering** — trace earns its keep only when flows connect
+  end-to-end; dynamic-dispatch synthesizers (MediatR/.NET, Spring interface-DI, Vapor closures) help
+  the *unsteered* agent, which already traces when trace will connect.
+- **One untested lever** to rescue the body-trace: steer via the trace tool's OWN OUTPUT (the
+  highest-salience channel — the agent reads it fresh, right at the decision point) with a strong
+  leading "complete flow — answer from this, don't explore" banner. Instructions/descriptions are
+  too far from the action; the tool result is not. Unproven; the only remaining shot at making the
+  body-trace pay off in production.
+
+measure-first paid off three times: it killed three cheap fixes in the ablation, stopped a steering
+change that would have shipped an ~8s/query regression (G), and stopped shipping the body-trace
+itself on a confounded assumption (H showed it needs steering we can't deliver).
+
+## Reproduce (arm G)
+
+```bash
+ARM=G bash scripts/agent-eval/arms-F.sh    # production-instruction steering, no append-prompt
+node scripts/agent-eval/parse-arms.mjs
+```
+
+---
+
+# Arm I — sufficiency, not steering (the shippable win)
+
+An LLM stops investigating when its context is *sufficient*, not when it's told to stop. So arm I
+makes the trace OUTPUT complete instead of steering — same invocation as H (original instructions,
+**no steering**), only the trace tool changed:
+1. **Hop bodies no longer clipped** at 28 lines (that clip is why H re-fetched `mutateElement`).
+2. **The destination's own callees are inlined** — the "last mile" the agent otherwise explores/Reads
+   for (excalidraw: `renderStaticScene → _renderStaticScene / renderStaticSceneThrottled`).
+
+| arm | adoption | reads | greps | payload | turns | dur | cost |
+|---|--:|--:|--:|--:|--:|--:|--:|
+| A baseline | 2/12 | 1.25 | 1.17 | 28.8K | 7.6 | 38s | $0.390 |
+| H body-trace alone | 3/12 | 1.50 | 0.42 | 29.7K | 8.0 | 45s | $0.398 |
+| **I body-trace + dest callees** | 2/12 | **1.17** | **0.25** | 27.2K | **7.0** | 39s | **$0.359** |
+| F body-trace + append-steer | 5/12 | 1.17 | 0.17 | 25.1K | 6.8 | 37s | $0.348 |
+
+**I ≥ A on every axis** (reads, greps, turns, cost down; wall-clock flat) and **≈ F on outcomes with
+zero steering** — despite *lower* trace adoption (2/12 vs F's 5/12). The destination-callees fix
+turned the body-trace from a net-negative (H, 45s) into a net-positive (I, 39s): one richer trace
+call now displaces the explore+node+Read follow-ups it used to trigger. excalidraw I-r2 was
+`context → trace → explore` — **0 reads, 5 turns**, stopped because the data was present. The residual
+reads (I-r1) are the `canvasNonce` data-flow — the def-use frontier the graph deliberately omits.
+
+This confirms the thesis: **completeness stops the agent; steering doesn't.** Every steering arm
+(B/F append-prompt, G instructions) was either unshippable or a regression; the sufficiency arm (I)
+ships and needs no steering.
+
+## Revised final verdict (supersedes the arm-G/H verdict above)
+
+- **SHIP: body-inlining trace + destination callees** (arm I) — ≥ A on all axes, no steering, no
+  regression; makes the self-sufficient-trace property real (one trace call answers the flow).
+- **SHIP: the `CODEGRAPH_MCP_TOOLS` allowlist** — independent, validated.
+- **DON'T ship steering** (instructions or tool descriptions) — three variants regressed; MCP can't
+  deliver append-prompt salience, and forcing trace where it doesn't connect backfires.
+- **Connectivity is the multiplier** — arm I helps most where the trace connects; MediatR/.NET,
+  Spring interface-DI, and Vapor closures are the next synthesizers, and they help the *unsteered*
+  agent (which already traces when trace will connect).
+
+## Reproduce (arm I)
+
+```bash
+ARM=I bash scripts/agent-eval/arms-F.sh    # body-trace + destination callees, no steering
+node scripts/agent-eval/parse-arms.mjs
+```

+ 21 - 0
scripts/agent-eval/arms-F.sh

@@ -0,0 +1,21 @@
+#!/usr/bin/env bash
+# Arm F (body-inlining trace + trace-first steering) across the same 6 repos as
+# arms-matrix.sh, so F vs B isolates the trace-enrichment effect (same surface,
+# old thin trace in B vs body-inlining trace here).
+set -uo pipefail
+H="$(cd "$(dirname "$0")" && pwd)"; RUNS="${RUNS:-2}"; C="${CORPUS:-/tmp/codegraph-corpus}"
+ROWS=(
+"$C/flutter-samples/add_to_app/books/flutter_module_books|How does the books UI build and what child widgets does it show?"
+"$C/aspnet-realworld|How is creating an article handled? Trace the controller to the service."
+"$C/spring-mall|How is a product-list request handled? Trace the controller to the service."
+"$C/vapor-spi|How is a package-show request handled? Name the route and controller."
+"$C/excalidraw|How does updating an element re-render the canvas on screen? Trace the flow."
+"$C/spring-halo|How is publishing a post handled? Trace the controller to the service."
+)
+ARM="${ARM:-F}"
+echo "### ARM $ARM START $(date) RUNS=$RUNS"
+for row in "${ROWS[@]}"; do
+  repo="${row%%|*}"; q="${row#*|}"
+  for r in $(seq 1 "$RUNS"); do bash "$H/run-arms.sh" "$repo" "$q" "$ARM" "$r"; done
+done
+echo "### ARM $ARM COMPLETE $(date)"

+ 37 - 0
scripts/agent-eval/arms-matrix.sh

@@ -0,0 +1,37 @@
+#!/usr/bin/env bash
+# Drive the tool-surface ablation across the chosen repos × arms (A–E).
+# Arms A–D ask the canonical FLOW question; arm E asks a NON-flow survey
+# question (the control probe — should degrade without explore+context).
+# Output: /tmp/arms/<repo>/<arm>-r<n>.jsonl  (parse with parse-arms.mjs).
+set -uo pipefail
+HARNESS="$(cd "$(dirname "$0")" && pwd)"
+RUNS="${RUNS:-2}"
+C="${CORPUS:-/tmp/codegraph-corpus}"
+NFQ='What are the main modules/components of this codebase and what does each one do? Give an overview of how it is organized.'
+
+# repo-path|flow-question  (2 small, 2 medium, 2 large — spans the size range)
+ROWS=(
+"$C/flutter-samples/add_to_app/books/flutter_module_books|How does the books UI build and what child widgets does it show?"
+"$C/aspnet-realworld|How is creating an article handled? Trace the controller to the service."
+"$C/spring-mall|How is a product-list request handled? Trace the controller to the service."
+"$C/vapor-spi|How is a package-show request handled? Name the route and controller."
+"$C/excalidraw|How does updating an element re-render the canvas on screen? Trace the flow."
+"$C/spring-halo|How is publishing a post handled? Trace the controller to the service."
+)
+
+echo "### ARMS MATRIX START $(date) RUNS=$RUNS"
+for row in "${ROWS[@]}"; do
+  repo="${row%%|*}"; q="${row#*|}"
+  for arm in A B C D; do
+    for r in $(seq 1 "$RUNS"); do
+      bash "$HARNESS/run-arms.sh" "$repo" "$q" "$arm" "$r"
+    done
+  done
+done
+# E: non-flow control probe on two repos (must degrade without explore+context)
+for repo in "$C/excalidraw" "$C/spring-mall"; do
+  for r in $(seq 1 "$RUNS"); do
+    bash "$HARNESS/run-arms.sh" "$repo" "$NFQ" E "$r"
+  done
+done
+echo "### ARMS MATRIX COMPLETE $(date)"

+ 116 - 0
scripts/agent-eval/parse-arms.mjs

@@ -0,0 +1,116 @@
+#!/usr/bin/env node
+// Analyze the tool-surface ablation (/tmp/arms/<repo>/<arm>-r<n>.jsonl).
+// Compares arms A–E on trace adoption, Read/Grep fallback, codegraph payload,
+// round-trips, and duration — averaged across runs per arm.
+//
+// The decisive signal is READS: if removing a tool raises Reads on a question
+// class, that tool was load-bearing for it (not redundant). If removing it
+// changes nothing, it was redundant.
+//
+//   A control       all tools            no steering   (baseline)
+//   B steer         all tools            trace-first   (adoption)
+//   C no-explore    hide explore         trace-first   (is explore redundant?)
+//   D trace-centric hide explore+context trace-first   (is the survey pair redundant?)
+//   E control-probe hide explore+context trace-first   (NON-flow Q — should degrade)
+//
+// Usage: node scripts/agent-eval/parse-arms.mjs [/tmp/arms]
+import { readFileSync, readdirSync, existsSync, statSync } from 'fs';
+import { join } from 'path';
+
+const ROOT = process.argv[2] || '/tmp/arms';
+const cgShort = (n) => n.replace('mcp__codegraph__codegraph_', '').replace('mcp__codegraph__', '');
+
+function parse(file) {
+  if (!existsSync(file)) return null;
+  const lines = readFileSync(file, 'utf8').split('\n').filter(Boolean);
+  const calls = []; let result = null, initCg = 0;
+  for (const l of lines) {
+    let ev; try { ev = JSON.parse(l); } catch { continue; }
+    if (ev.type === 'system' && ev.subtype === 'init') initCg = (ev.tools || []).filter(t => /codegraph/.test(t)).length;
+    if (ev.type === 'assistant') for (const b of (ev.message?.content || [])) if (b.type === 'tool_use')
+      calls.push({ id: b.id, name: b.name, out: 0 });
+    if (ev.type === 'user') for (const b of (ev.message?.content || [])) if (b.type === 'tool_result') {
+      const c = b.content;
+      const txt = typeof c === 'string' ? c : Array.isArray(c) ? c.map(x => x?.text || '').join('') : '';
+      const call = calls.find(k => k.id === b.tool_use_id); if (call) call.out = txt.length;
+    }
+    if (ev.type === 'result') result = ev;
+  }
+  const cg = calls.filter(c => c.name.includes('codegraph'));
+  return {
+    initCg,
+    reads: calls.filter(c => c.name === 'Read').length,
+    greps: calls.filter(c => c.name === 'Grep').length + calls.filter(c => c.name === 'Glob').length,
+    cgCalls: cg.length,
+    cgSeq: cg.map(c => cgShort(c.name)),
+    cgOut: cg.reduce((s, c) => s + c.out, 0),
+    traceUsed: cg.some(c => c.name.includes('trace')),
+    turns: result?.num_turns ?? null,
+    dur: result?.duration_ms ? Math.round(result.duration_ms / 1000) : null,
+    cost: result?.total_cost_usd || 0,
+    ok: result?.subtype === 'success',
+  };
+}
+
+// repo -> arm -> [runs]
+const data = {};
+if (!existsSync(ROOT)) { console.error(`no ${ROOT}`); process.exit(1); }
+for (const repo of readdirSync(ROOT)) {
+  const rdir = join(ROOT, repo);
+  if (!statSync(rdir).isDirectory()) continue;
+  for (const f of readdirSync(rdir)) {
+    const m = f.match(/^([A-I])-r(\d+)\.jsonl$/); if (!m) continue;
+    const p = parse(join(rdir, f)); if (!p || !p.ok) continue;
+    (((data[repo] ??= {})[m[1]]) ??= []).push(p);
+  }
+}
+
+const avg = (a, f) => a.length ? a.reduce((s, x) => s + (f(x) || 0), 0) / a.length : 0;
+const k = (n) => (n / 1000).toFixed(1);
+const pad = (s, n) => String(s).padEnd(n);
+const ARMS = ['A', 'H', 'I', 'B', 'F', 'G', 'C', 'D', 'E'];
+const LABEL = { A: 'A all/none(old)', H: 'H body-trace/none', I: 'I bodytrace+dest', B: 'B all/steer(thin)', F: 'F all/steer(body)', G: 'G ported(noprompt)', C: 'C no-explore', D: 'D trace-centric', E: 'E nonflow-probe' };
+
+// ---- per repo × arm ----
+console.log('\n=== PER REPO × ARM (avg over runs) ===');
+console.log(pad('repo', 22), pad('arm', 16), 'tools', 'trace', pad('reads', 6), pad('cgOutK', 7), pad('turns', 6), 'dur');
+for (const repo of Object.keys(data).sort()) {
+  for (const arm of ARMS) {
+    const runs = data[repo][arm]; if (!runs?.length) continue;
+    console.log(
+      pad(repo, 22), pad(LABEL[arm], 16),
+      pad(runs[0].initCg, 5),
+      pad(runs.filter(r => r.traceUsed).length + '/' + runs.length, 5),
+      pad(avg(runs, r => r.reads).toFixed(1), 6),
+      pad(k(avg(runs, r => r.cgOut)), 7),
+      pad(avg(runs, r => r.turns).toFixed(1), 6),
+      avg(runs, r => r.dur).toFixed(0) + 's',
+    );
+  }
+}
+
+// ---- aggregate per arm (flow arms A–D over the flow repos; E shown apart) ----
+console.log('\n=== AGGREGATE PER ARM (mean across repos) ===');
+console.log(pad('arm', 16), pad('adoption', 9), pad('reads', 7), pad('greps', 7), pad('cgOutK', 8), pad('turns', 7), pad('dur', 6), 'cost');
+for (const arm of ARMS) {
+  const all = [];
+  for (const repo of Object.keys(data)) for (const r of (data[repo][arm] || [])) all.push({ ...r, repo });
+  if (!all.length) continue;
+  const repos = new Set(all.map(r => r.repo)).size;
+  const adopt = all.filter(r => r.traceUsed).length;
+  console.log(
+    pad(LABEL[arm], 16),
+    pad(`${adopt}/${all.length}`, 9),
+    pad(avg(all, r => r.reads).toFixed(2), 7),
+    pad(avg(all, r => r.greps).toFixed(2), 7),
+    pad(k(avg(all, r => r.cgOut)), 8),
+    pad(avg(all, r => r.turns).toFixed(1), 7),
+    pad(avg(all, r => r.dur).toFixed(0) + 's', 6),
+    '$' + avg(all, r => r.cost).toFixed(3),
+    `  (${repos} repos)`,
+  );
+}
+
+console.log('\nRead the signal: B vs A = does steering alone fix adoption + cut payload.');
+console.log('C vs B = is explore redundant (reads should NOT jump). D vs C = is context redundant.');
+console.log('E = non-flow under trace-centric; reads SHOULD jump (proves survey tools are load-bearing).');

+ 56 - 0
scripts/agent-eval/run-arms.sh

@@ -0,0 +1,56 @@
+#!/usr/bin/env bash
+# Tool-surface ablation — run ONE repo+question under ONE arm.
+#
+# Arms vary (exposed codegraph tools, trace-first steering). Tools are trimmed
+# SERVER-SIDE via CODEGRAPH_MCP_TOOLS in the MCP config's `env` block, so an
+# ablated tool is genuinely absent from ListTools — no deferred-ToolSearch or
+# denied-call confound (which --disallowedTools would introduce). Steering is
+# injected with --append-system-prompt, so no rebuild of the shipped
+# server-instructions is needed to A/B it.
+#
+#   A control       all tools            no steering
+#   B steer         all tools            trace-first
+#   C no-explore    hide explore         trace-first
+#   D trace-centric hide explore+context trace-first
+#   E control-probe hide explore+context trace-first  (caller passes a NON-flow Q)
+#
+# Usage: run-arms.sh <repo-path> "<question>" <A|B|C|D|E> [run-id]
+set -uo pipefail
+REPO="${1:?repo path}"; Q="${2:?question}"; ARM="${3:?arm A-E}"; RID="${4:-1}"
+CG_BIN="${CG_BIN:-$(command -v codegraph)}"
+OUT="${ARMS_OUT:-/tmp/arms}/$(basename "$REPO")"
+mkdir -p "$OUT"
+[ -n "$CG_BIN" ] || { echo "no codegraph binary (set CG_BIN)"; exit 1; }
+[ -d "$REPO/.codegraph" ] || { echo "no .codegraph index at $REPO"; exit 1; }
+
+STEER='Flow questions ("how does X reach/become Y", "trace the flow", request to handler, state to render): call codegraph_trace(from,to) FIRST — one call returns the whole path. Use codegraph_context/search only to locate the two endpoint symbols if you do not know them. Do NOT reconstruct the path with repeated search/callers/explore.'
+KEEP_NO_EXPLORE="trace,search,node,context,callers,callees,impact,files,status"
+KEEP_TRACE_CENTRIC="trace,search,node,callers,callees,impact,files,status"
+
+case "$ARM" in
+  A|G|H|I) TOOLS="";            STEERING="" ;;  # no steering; H = body-trace, I = body-trace + destination callees (sufficiency)
+  B|F) TOOLS="";                STEERING="$STEER" ;;  # F = B's surface, run on the body-inlining trace build
+  C) TOOLS="$KEEP_NO_EXPLORE";  STEERING="$STEER" ;;
+  D|E) TOOLS="$KEEP_TRACE_CENTRIC"; STEERING="$STEER" ;;
+  *) echo "bad arm '$ARM' (want A|B|C|D|E)"; exit 1 ;;
+esac
+
+CFG="$OUT/mcp-$ARM.json"
+if [ -n "$TOOLS" ]; then
+  cat > "$CFG" <<JSON
+{"mcpServers":{"codegraph":{"command":"$CG_BIN","args":["serve","--mcp","--path","$REPO"],"env":{"CODEGRAPH_MCP_TOOLS":"$TOOLS"}}}}
+JSON
+else
+  cat > "$CFG" <<JSON
+{"mcpServers":{"codegraph":{"command":"$CG_BIN","args":["serve","--mcp","--path","$REPO"]}}}
+JSON
+fi
+
+LOG="$OUT/$ARM-r$RID.jsonl"; ERR="$OUT/$ARM-r$RID.err"
+ARGS=( -p "$Q" --output-format stream-json --verbose
+       --permission-mode bypassPermissions --model opus --max-budget-usd 4
+       --strict-mcp-config --mcp-config "$CFG" )
+[ -n "$STEERING" ] && ARGS+=( --append-system-prompt "$STEERING" )
+
+( cd "$REPO" && claude "${ARGS[@]}" > "$LOG" 2>"$ERR" )
+echo "[$(basename "$REPO") $ARM r$RID] exit $? -> $LOG ($(wc -l < "$LOG" | tr -d ' ') lines)"

+ 137 - 0
scripts/agent-eval/seq-matrix.mjs

@@ -0,0 +1,137 @@
+#!/usr/bin/env node
+// Mine the surviving A/B stream-json logs (/tmp/ab-matrix/<Cell>/run-headless-*.jsonl)
+// for what the aggregate matrix can't see: the call SEQUENCE and per-call output SIZE.
+//
+// Answers three questions:
+//   1. Trace adoption — on a flow question, does the with-arm actually call codegraph_trace?
+//   2. Payload size vs repo size — is trace path-scoped (tiny, size-independent) while
+//      explore is breadth-scoped (grows with the repo / over-returns on small repos)?
+//   3. Round-trips — num_turns with vs without (the real wall-clock driver).
+//
+// Usage: node scripts/agent-eval/seq-matrix.mjs [/tmp/ab-matrix]
+import { readFileSync, readdirSync, existsSync } from 'fs';
+import { join } from 'path';
+
+const AB = process.argv[2] || '/tmp/ab-matrix';
+const MD = new URL('../../docs/benchmarks/codegraph-ab-matrix.md', import.meta.url).pathname;
+
+// repo -> {lang,size,files} from the published matrix table
+const repoMeta = {};
+if (existsSync(MD)) for (const line of readFileSync(MD, 'utf8').split('\n')) {
+  const m = line.match(/^\|\s*([^|]+?)\s*\|\s*(S|M|L)\s*\|\s*`([^`]+)`\s*\|\s*(\d+)\s*\|/);
+  if (m) repoMeta[m[3]] = { lang: m[1].trim(), size: m[2], files: +m[4] };
+}
+
+const cgShort = (n) => n.replace('mcp__codegraph__codegraph_', '').replace('mcp__codegraph__', '');
+const tag = (n) => n === 'Read' ? 'R' : n === 'Grep' ? 'G' : n === 'Glob' ? 'Gl'
+  : n === 'Bash' ? 'B' : n === 'Task' ? 'Ag' : n === 'ToolSearch' ? 'TS'
+  : n.includes('codegraph') ? cgShort(n) : n;
+
+function parse(file) {
+  if (!existsSync(file)) return null;
+  const lines = readFileSync(file, 'utf8').split('\n').filter(Boolean);
+  const calls = []; let result = null, initCg = 0;
+  for (const l of lines) {
+    let ev; try { ev = JSON.parse(l); } catch { continue; }
+    if (ev.type === 'system' && ev.subtype === 'init') initCg = (ev.tools || []).filter(t => /codegraph/.test(t)).length;
+    if (ev.type === 'assistant') for (const b of (ev.message?.content || [])) if (b.type === 'tool_use') {
+      const i = b.input || {};
+      const q = i.query ?? i.symbol ?? i.task ?? (i.from && i.to ? `${i.from}->${i.to}` : (i.file_path || i.command || ''));
+      calls.push({ id: b.id, name: b.name, q: String(q ?? '').slice(0, 38), out: 0 });
+    }
+    if (ev.type === 'user') for (const b of (ev.message?.content || [])) if (b.type === 'tool_result') {
+      const c = b.content;
+      const txt = typeof c === 'string' ? c : Array.isArray(c) ? c.map(x => x?.text || '').join('') : '';
+      const call = calls.find(k => k.id === b.tool_use_id); if (call) call.out = txt.length;
+    }
+    if (ev.type === 'result') result = ev;
+  }
+  const cg = calls.filter(c => c.name.includes('codegraph'));
+  const perTool = {};
+  for (const c of cg) { const k = cgShort(c.name); (perTool[k] ??= { n: 0, out: 0 }); perTool[k].n++; perTool[k].out += c.out; }
+  const traceIdx = cg.findIndex(c => c.name.includes('trace'));
+  const u = result?.usage || {};
+  return {
+    initCg, cg, perTool,
+    cgSeq: cg.map(c => cgShort(c.name)),
+    seq: calls.map(c => tag(c.name)),
+    reads: calls.filter(c => c.name === 'Read').length,
+    greps: calls.filter(c => c.name === 'Grep').length,
+    cgOut: cg.reduce((s, c) => s + c.out, 0),
+    traceUsed: traceIdx >= 0,
+    afterTrace: traceIdx >= 0 ? cg.slice(traceIdx + 1).map(c => cgShort(c.name)) : null,
+    turns: result?.num_turns ?? null,
+    dur: result?.duration_ms ? Math.round(result.duration_ms / 1000) : null,
+    cost: result?.total_cost_usd || 0,
+  };
+}
+
+const cells = [];
+for (const d of readdirSync(AB)) {
+  const dir = join(AB, d);
+  if (!existsSync(join(dir, 'run-headless-with.jsonl'))) continue;
+  const log = existsSync(join(AB, d + '.log')) ? readFileSync(join(AB, d + '.log'), 'utf8') : '';
+  const repo = (log.match(/repo:\s*\S*\/([^\s/]+)/) || [])[1] || d;
+  const question = (log.match(/question:\s*(.+)/) || [])[1] || '';
+  cells.push({ cell: d, repo, question, ...(repoMeta[repo] || {}),
+    with: parse(join(dir, 'run-headless-with.jsonl')),
+    without: parse(join(dir, 'run-headless-without.jsonl')) });
+}
+cells.sort((a, b) => (a.files || 0) - (b.files || 0));
+
+const k = (n) => (n / 1000).toFixed(1);
+const pad = (s, n) => String(s).padEnd(n);
+
+// ---- per-cell sequence table ----
+console.log('\n=== PER-CELL: with-arm codegraph sequence + payload (sorted by repo size) ===');
+console.log(pad('repo', 22), pad('files', 6), 'trace', pad('cg-call sequence', 40), pad('cgOutK', 7), 'turns(w/wo)');
+for (const c of cells) {
+  const w = c.with;
+  console.log(
+    pad(c.repo, 22), pad(c.files ?? '?', 6),
+    pad(w.traceUsed ? 'YES' : 'no', 5),
+    pad(w.cgSeq.join(',') || '(none)', 40),
+    pad(k(w.cgOut), 7),
+    `${w.turns}/${c.without?.turns}`,
+  );
+}
+
+// ---- trace adoption ----
+const flow = cells; // every matrix question is a canonical flow question by design
+const used = flow.filter(c => c.with.traceUsed);
+console.log(`\n=== TRACE ADOPTION (all ${flow.length} cells are flow questions) ===`);
+console.log(`trace called in ${used.length}/${flow.length} cells`);
+console.log('used trace:', used.map(c => c.repo).join(', ') || '(none)');
+if (used.length) console.log('after-trace follow-ups:', used.map(c => `${c.repo}[${c.with.afterTrace.join(',') || 'none'}]`).join('  '));
+
+// ---- payload size by repo-size tier ----
+const tier = (f) => f < 200 ? 'S(<200)' : f < 2000 ? 'M(<2000)' : 'L(>=2000)';
+const byTier = {};
+for (const c of cells) { (byTier[tier(c.files || 0)] ??= []).push(c.with.cgOut); }
+console.log('\n=== with-arm TOTAL codegraph payload by repo-size tier ===');
+for (const t of ['S(<200)', 'M(<2000)', 'L(>=2000)']) {
+  const a = byTier[t] || []; if (!a.length) continue;
+  const avg = a.reduce((s, x) => s + x, 0) / a.length;
+  console.log(`  ${pad(t, 10)} n=${a.length}  avg cgOut=${k(avg)}K  range ${k(Math.min(...a))}-${k(Math.max(...a))}K`);
+}
+
+// ---- per-tool usage + avg payload (breadth vs path evidence) ----
+const tot = {};
+for (const c of cells) for (const [name, v] of Object.entries(c.with.perTool)) {
+  (tot[name] ??= { n: 0, out: 0 }); tot[name].n += v.n; tot[name].out += v.out;
+}
+console.log('\n=== codegraph tool usage across all cells (n calls, avg payload/call) ===');
+for (const [name, v] of Object.entries(tot).sort((a, b) => b[1].n - a[1].n)) {
+  console.log(`  ${pad(name, 10)} calls=${pad(v.n, 4)} avg=${k(v.out / v.n)}K/call  total=${k(v.out)}K`);
+}
+
+// ---- round-trips ----
+const sum = (arr, f) => arr.reduce((s, x) => s + (f(x) || 0), 0);
+const wTurns = sum(cells, c => c.with.turns), woTurns = sum(cells, c => c.without?.turns);
+const wCalls = sum(cells, c => c.with.cg.length);
+const tsAll = cells.every(c => c.with.seq[0] === 'TS');
+console.log('\n=== ROUND-TRIPS ===');
+console.log(`turns: with=${wTurns}  without=${woTurns}  (${((1 - wTurns / woTurns) * 100).toFixed(0)}% fewer with)`);
+console.log(`avg turns/cell: with=${(wTurns / cells.length).toFixed(1)}  without=${(woTurns / cells.length).toFixed(1)}`);
+console.log(`total codegraph calls=${wCalls} (avg ${(wCalls / cells.length).toFixed(1)}/cell)`);
+console.log(`every with-arm opens with a ToolSearch round-trip (deferred tools): ${tsAll ? 'YES — 1 fixed tax/run' : 'no'}`);