Pārlūkot izejas kodu

docs(readme): answer directly with codegraph, not via an Explore agent (#367)

Replace the stale "## CodeGraph" example block (NEVER call explore directly /
ALWAYS spawn an Explore agent) and the How-It-Works diagram with the validated
"answer directly" guidance, and add codegraph_context/trace/explore to the tool
table. Interactive A/B (Excalidraw + VS Code, n=3/arm) shows direct codegraph
answering beats Explore-agent delegation at every scale: main-session context is
~scale-invariant (~50k), with 0 reads vs 17-26 and ~28% fewer tokens. Record the
writeup under docs/benchmarks/answer-directly-vs-explore-agent.md.

Docs-only; stays on 0.9.4 (no version bump).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Colby Mchenry 1 mēnesi atpakaļ
vecāks
revīzija
1f3625a3e9
2 mainītis faili ar 114 papildinājumiem un 41 dzēšanām
  1. 26 41
      README.md
  2. 88 0
      docs/benchmarks/answer-directly-vs-explore-agent.md

+ 26 - 41
README.md

@@ -263,25 +263,21 @@ CodeGraph builds a semantic knowledge graph of codebases for faster, smarter cod
 
 ### If `.codegraph/` exists in the project
 
-**NEVER call `codegraph_explore` or `codegraph_context` directly in the main session.** These tools return large amounts of source code that fills up main session context. Instead, ALWAYS spawn an Explore agent for any exploration question (e.g., "how does X work?", "explain the Y system", "where is Z implemented?").
+**Answer directly with CodeGraph — don't delegate exploration to a file-reading sub-agent or a grep/read loop.** CodeGraph *is* the pre-built search index; re-deriving its answers with grep + Read repeats work it already did and costs more for the same result. For "how does X work?", architecture, trace, or where-is-X questions, answer in a handful of CodeGraph calls and stop — typically with **zero file reads**. The returned source is complete and authoritative: treat it as already read and do not re-open those files. Reach for raw Read/Grep only to confirm a specific detail CodeGraph didn't cover.
 
-**When spawning Explore agents**, include this instruction in the prompt:
-
-> This project has CodeGraph initialized (.codegraph/ exists). Use `codegraph_explore` as your PRIMARY tool — it returns full source code sections from all relevant files in one call.
->
-> **Rules:**
-> 1. Follow the explore call budget in the `codegraph_explore` tool description — it scales automatically based on project size.
-> 2. Do NOT re-read files that codegraph_explore already returned source code for. The source sections are complete and authoritative.
-> 3. Only fall back to grep/glob/read for files listed under "Additional relevant files" if you need more detail, or if codegraph returned no results.
-
-**The main session may only use these lightweight tools directly** (for targeted lookups before making edits, not for exploration):
+**Tool selection by intent:**
 
 | Tool | Use For |
 |------|---------|
-| `codegraph_search` | Find symbols by name |
-| `codegraph_callers` / `codegraph_callees` | Trace call flow |
+| `codegraph_context` | Map a task / feature / area first — composes search + node + callers + callees in one call |
+| `codegraph_trace` | "How does X reach Y" — the call path, each hop's body inline (follows dynamic-dispatch hops grep can't) |
+| `codegraph_explore` | Survey several related symbols' source in ONE budget-capped call |
+| `codegraph_search` | Find a symbol by name |
+| `codegraph_callers` / `codegraph_callees` | Walk call flow one hop at a time |
 | `codegraph_impact` | Check what's affected before editing |
-| `codegraph_node` | Get a single symbol's details |
+| `codegraph_node` | Get a single symbol's source / signature |
+
+A direct CodeGraph answer is a handful of calls; a grep/read exploration is dozens.
 
 ### If `.codegraph/` does NOT exist
 
@@ -297,34 +293,23 @@ At the start of a session, ask the user if they'd like to initialize CodeGraph:
 ## How It Works
 
 ```
-┌─────────────────────────────────────────────────────────────────┐
-│                        Claude Code                               │
-│                                                                  │
-│  "Implement user authentication"                                 │
-│           │                                                      │
-│           ▼                                                      │
-│  ┌─────────────────┐      ┌─────────────────┐                   │
-│  │  Explore Agent  │ ──── │  Explore Agent  │                   │
-│  └────────┬────────┘      └────────┬────────┘                   │
-│           │                        │                             │
-└───────────┼────────────────────────┼─────────────────────────────┘
-            │                        │
-            ▼                        ▼
 ┌───────────────────────────────────────────────────────────────────┐
-│                     CodeGraph MCP Server                          │
-│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐               │
-│  │   Search    │  │   Callers   │  │   Context   │               │
-│  │  "auth"     │  │  "login()"  │  │  for task   │               │
-│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘               │
-│         │                │                │                       │
-│         └────────────────┼────────────────┘                       │
-│                          ▼                                        │
-│              ┌───────────────────────┐                            │
-│              │   SQLite Graph DB     │                            │
-│              │   • 387 symbols       │                            │
-│              │   • 1,204 edges       │                            │
-│              │   • Instant lookups   │                            │
-│              └───────────────────────┘                            │
+│                            Claude Code                            │
+│                                                                   │
+│   "How does a request reach the database?"                        │
+│       calls CodeGraph tools directly — no Explore sub-agent       │
+│                                 │                                 │
+└─────────────────────────────────┬─────────────────────────────────┘
+                                  │
+                                  ▼
+┌───────────────────────────────────────────────────────────────────┐
+│                        CodeGraph MCP Server                       │
+│                                                                   │
+│       context · trace · explore · callers · callees · impact      │
+│                                 │                                 │
+│                                 ▼                                 │
+│                       SQLite knowledge graph                      │
+│          symbols · edges · files · FTS5 full-text search          │
 └───────────────────────────────────────────────────────────────────┘
 ```
 

+ 88 - 0
docs/benchmarks/answer-directly-vs-explore-agent.md

@@ -0,0 +1,88 @@
+# Answer directly vs. delegate to an Explore agent (interactive A/B)
+
+**Question:** Does answering a "how does X work?" question *directly* with CodeGraph in the
+main session bloat main-session context — and would Claude Code be better off delegating that
+exploration to a disposable **Explore agent** (which keeps main context lean by absorbing the
+file reads in a sub-transcript)? And critically: **does the answer change at scale**, on a
+codebase far larger than Excalidraw?
+
+**Short answer:** No. With CodeGraph, main-session context is roughly **scale-invariant (~50k)**
+because the retrieval is targeted and the `explore` payload is budget-capped — it does not
+balloon on a 16× larger repo. Answering directly wins at **every** scale: same-or-leaner main
+context than the delegation path, **zero file reads**, and ~28% fewer tokens. The
+delegation-for-hygiene advantage stays marginal even on a large codebase.
+
+## Methodology
+
+- **Harness:** interactive Claude Code TUI driven via `scripts/agent-eval/itrun.sh` (tmux),
+  **not** headless `claude -p`. This matters: headless spawns **0** Explore agents, so it cannot
+  measure delegation behavior at all; only the interactive TUI does.
+- **Arms:** `WITH` = CodeGraph in the MCP config; `WITHOUT` = empty MCP config (`--strict-mcp-config`).
+- **Model:** `opus`. **n = 3 runs per arm.** Main **and** sub-agent transcripts parsed
+  (`scripts/agent-eval/parse-session.mjs`); reads/bash are summed across main + sub-agents.
+- **Repos:** Excalidraw (643 files, medium) and VS Code (~10.7k files, large — ~16× Excalidraw).
+- **Build:** 0.9.4. **Date:** 2026-05-24.
+- "main-session context" is the TUI's reported `Context X/Y` for the *main* thread (sub-agent
+  context does not count against it). "billable tokens" = summed per-turn assistant usage
+  (input + output + cache read + cache creation).
+
+## Excalidraw (643 files, medium)
+
+Question: *"How does Excalidraw render and update canvas elements?"*
+
+| metric | WITH codegraph | WITHOUT |
+|---|---|---|
+| Explore agents spawned | 0 / 0 / 0 | 0 / 1 / 1 (delegated 2 of 3) |
+| main-session context | 51k / 49k / 50k (~50k) | 48k / 34k / 26k (~36k) |
+| total tool calls | 4 / 4 / 4 | 16 / 55 / 37 |
+| Reads (main+sub) | 0 / 0 / 0 | 6 / 25 / 16 |
+| billable tokens | ~127k | ~175k |
+
+## VS Code (~10.7k files, large — ~16× Excalidraw)
+
+Question: *"How does the extension host communicate with the main process?"*
+
+| metric | WITH codegraph | WITHOUT |
+|---|---|---|
+| main-session context | 47k / 43k / 50k (~47k) | 54k / 29k / 31k (~38k) |
+| Explore agents | 0 / 0 / 0 | 0 / 1 / 1 (delegated 2/3) |
+| codegraph calls | ~8 (search + explore×2–3 + context) | 0 |
+| Reads (main+sub) | 0 / 1 / 0 | 6 / 26 / 19 |
+| billable tokens | ~126k | ~176k |
+
+## Findings
+
+**Main-session context is scale-invariant with CodeGraph.** With codegraph, main-session
+context was **~47k on VS Code — essentially identical to Excalidraw's ~50k**, despite a 16×
+bigger repo. It didn't balloon. Reason: codegraph's `explore` payload is **budget-capped** and
+retrieval is **targeted** — answering one question pulls in the relevant *flow/area*, not more
+just because the repo is huge. So codegraph makes main-session context roughly scale-invariant
+(~50k). The delegation-for-hygiene advantage stays marginal even on a large codebase — exactly
+the opposite of "it gets significant at scale."
+
+The thing that *would* balloon at scale is reading many big files directly into main — and
+Claude Code avoids that **without** codegraph by delegating to an Explore agent (29–31k main),
+but at the cost of **17–26 reads** and ~28% more tokens. CodeGraph keeps main lean a *better*
+way: a capped, targeted payload — no delegation, **0 reads**.
+
+**On "the Explore agents use codegraph."** I couldn't reproduce it: across **6/6**
+with-codegraph runs (both repos), Claude Code **never delegated** — it answered directly every
+time. The Explore-agent path only appeared in the `without` arm (using grep/read, since codegraph
+wasn't in that config). So with the current instructions + codegraph present, Claude Code stays
+in the main session — the lean-main-via-Explore-agent best case simply isn't what happens;
+lean-main-via-capped-codegraph is, and it's cheaper.
+
+## Verdict
+
+**"Answer directly with codegraph" wins for Claude Code too — at every scale.** No per-agent
+split is needed; the unified "answer directly" instruction is right for Claude Code *and* for
+Codex / Cursor / opencode (which have no Explore-agent mechanism and would otherwise read files
+directly). This conclusion drove updating the README's `## CodeGraph` example block, which
+previously told agents to "NEVER call `codegraph_explore` directly / ALWAYS spawn an Explore
+agent" — i.e., it steered Claude Code toward the *worse* (17–26 read, ~28%-more-token) path.
+
+**Caveat / future work (not a blocker):** an Explore agent that *itself uses codegraph* could in
+principle get lean-main *and* low-work. But the "answer directly" instruction prevents delegation
+in practice (0 delegations observed across 6 runs), the main-context gain would be marginal
+(~50k → ~30k, both a few percent of a 1M window), and it adds a sub-agent round-trip. Worth a
+future experiment, not a default.