1
0

4 Коммитууд c9e207a0f2 ... 291b200ece

Эзэн SHA1 Мессеж Огноо
  Colby McHenry 291b200ece chore: stop tracking .claude/handoffs (local session notes only) 4 өдөр өмнө
  Colby McHenry f82a662ddb feat(mcp): pare default tool surface to codegraph_explore alone + redux-thunk synthesizer 4 өдөр өмнө
  Colby McHenry 7ddd3fa7eb test(agent-eval): persist offload accuracy/adoption eval harness + front-load hook 4 өдөр өмнө
  Colby McHenry 6d5cb6b25c feat(reasoning): add `CODEGRAPH_OFFLOAD_DISABLE` kill-switch and per-call usage log 4 өдөр өмнө

Файлын зөрүү хэтэрхий том тул дарагдсан байна
+ 0 - 175
.claude/handoffs/cross-language-impact-coverage-2026-06-04.md


+ 7 - 22
.cursor/rules/codegraph.mdc

@@ -1,37 +1,22 @@
 ---
 ---
-description: CodeGraph MCP usage guide — when to use which tool
+description: CodeGraph MCP usage guide — one tool, codegraph_explore
 alwaysApply: true
 alwaysApply: true
 ---
 ---
 <!-- CODEGRAPH_START -->
 <!-- CODEGRAPH_START -->
 ## CodeGraph
 ## CodeGraph
 
 
-This project has a CodeGraph MCP server (`codegraph_*` tools) configured. CodeGraph is a tree-sitter-parsed knowledge graph of every symbol, edge, and file. Reads are sub-millisecond and return structural information grep cannot.
+This project has a CodeGraph MCP server configured, exposing a single tool: `codegraph_explore`. CodeGraph is a tree-sitter-parsed knowledge graph of every symbol, edge, and file. Reads are sub-millisecond and return structural information grep cannot.
 
 
-### When to prefer codegraph over native search
+### Use codegraph_explore instead of reading files
 
 
-Use codegraph for **structural** questions — what calls what, what would break, where is X defined, what is X's signature. Use native grep/read only for **literal text** queries (string contents, comments, log messages) or after you already have a specific file open.
-
-| Question | Tool |
-|---|---|
-| "Where is X defined?" / "Find symbol named X" | `codegraph_search` |
-| "What calls function Y?" | `codegraph_callers` |
-| "What does Y call?" | `codegraph_callees` |
-| "How does X reach/become Y? / trace the flow from X to Y" | `codegraph_trace` (one call = the whole path, incl. callback/React/JSX dynamic hops) |
-| "What would break if I changed Z?" | `codegraph_impact` |
-| "Show me Y's signature / source / docstring" | `codegraph_node` |
-| "Give me focused context for a task/area" | `codegraph_context` |
-| "See several related symbols' source at once" | `codegraph_explore` |
-| "What files exist under path/" | `codegraph_files` |
-| "Is the index healthy?" | `codegraph_status` |
+Reach for `codegraph_explore` before grep/find or Read for any **structural** question — how does X work, how does X reach Y, what calls what, where is X defined, or surveying an area. It takes a natural-language question or a bag of symbol/file names and returns the relevant symbols' **verbatim, line-numbered source** grouped by file (the same `<n>\t<line>` shape Read gives you, safe to Edit from), plus the call paths between them — including dynamic-dispatch hops (callbacks, React re-render, JSX children) grep can't follow — and a blast-radius summary of what depends on them. Name a file or symbol in the query to read its current source.
 
 
 ### Rules of thumb
 ### Rules of thumb
 
 
-- **Answer directly — don't delegate exploration.** For "how does X work" / architecture questions, answer with 2-3 codegraph calls: `codegraph_context` first, then ONE `codegraph_explore` for the source of the symbols it surfaces. For a specific **flow** ("how does X reach Y") start with `codegraph_trace` from→to — one call returns the whole path with dynamic hops bridged — then ONE `codegraph_explore` for the bodies; don't rebuild the path with `codegraph_search` + `codegraph_callers`. Codegraph IS the pre-built index, so spawning a separate file-reading sub-task/agent — or running a grep + read loop — repeats work codegraph already did and costs more for the same answer.
+- **Answer directly — don't delegate exploration.** ONE `codegraph_explore` usually answers the whole question; follow up with another `codegraph_explore` naming more specific symbols if you need more. Codegraph IS the pre-built index, so spawning a separate file-reading sub-task/agent — or running a grep + read loop — repeats work codegraph already did and costs more for the same answer.
 - **Trust codegraph results.** They come from a full AST parse. Do NOT re-verify them with grep — that's slower, less accurate, and wastes context.
 - **Trust codegraph results.** They come from a full AST parse. Do NOT re-verify them with grep — that's slower, less accurate, and wastes context.
-- **Don't grep first** when looking up a symbol by name. `codegraph_search` is faster and returns kind + location + signature in one call.
-- **Don't chain `codegraph_search` + `codegraph_node`** when you just want context — `codegraph_context` is one call.
-- **Don't loop `codegraph_node` over many symbols** — one `codegraph_explore` call returns several symbols' source grouped in a single capped call, while each separate node/Read call re-reads the whole context and costs far more.
-- **Index lag — check the staleness banner, don't guess a wait.** When a codegraph response starts with "⚠️ Some files referenced below were edited since the last index sync…", the listed files are pending re-index — Read those specific files for accurate content. Files NOT in that banner are fresh and codegraph is authoritative for them. `codegraph_status` also lists pending files under "Pending sync".
+- **Don't grep or Read first** to find or understand indexed code — one `codegraph_explore` returns the relevant source in a single round-trip. Reach for raw Read/Grep only to confirm a specific detail codegraph didn't cover, or for what it doesn't index (configs, docs).
+- **Index lag — check the staleness banner, don't guess a wait.** When a codegraph response starts with "⚠️ Some files referenced below were edited since the last index sync…", the listed files are pending re-index — Read those specific files for accurate content. Files NOT in that banner are fresh and codegraph is authoritative for them.
 
 
 ### If `.codegraph/` doesn't exist
 ### If `.codegraph/` doesn't exist
 
 

+ 1 - 0
.gitignore

@@ -40,6 +40,7 @@ npm-debug.log*
 # Local Claude settings
 # Local Claude settings
 .claude/settings.local.json
 .claude/settings.local.json
 .claude/scheduled_tasks.lock
 .claude/scheduled_tasks.lock
+.claude/handoffs/
 
 
 # Parallels Windows VM SSH/connection config (local machine, see CLAUDE.md)
 # Parallels Windows VM SSH/connection config (local machine, see CLAUDE.md)
 .parallels
 .parallels

+ 12 - 20
README.md

@@ -262,7 +262,7 @@ agent writes src/Widget.ts
   → next agent query sees it
   → next agent query sees it
 ```
 ```
 
 
-**Verify any time** with `codegraph_status` (via MCP) or `codegraph status` (CLI). If anything is pending, you'll see a `### Pending sync:` section naming the files and their edit age.
+**Verify any time** with `codegraph status` (CLI). If anything is pending, you'll see a `### Pending sync:` section naming the files and their edit age.
 
 
 The handful of cases where manual `codegraph sync` makes sense: the watcher is disabled (sandboxed environments, or `CODEGRAPH_NO_DAEMON=1`), or you're scripting against the index outside an agent session and want a pre-flight sync at the start of your script.
 The handful of cases where manual `codegraph sync` makes sense: the watcher is disabled (sandboxed environments, or `CODEGRAPH_NO_DAEMON=1`), or you're scripting against the index outside an agent session and want a pre-flight sync at the start of your script.
 
 
@@ -300,7 +300,7 @@ CodeGraph detects web-framework routing files and emits `route` nodes linked by
 
 
 ## Mixed iOS / React Native / Expo bridging
 ## Mixed iOS / React Native / Expo bridging
 
 
-Real iOS and React Native codebases live across multiple languages — a Swift caller invokes an Objective-C selector that's been auto-bridged, a JS file calls into a native module via the React Native bridge, a JSX component delegates to a native view manager. Static tree-sitter extraction stops at each language boundary. CodeGraph bridges them so `trace`, `callers`, `callees`, and `impact` connect end-to-end across the gap.
+Real iOS and React Native codebases live across multiple languages — a Swift caller invokes an Objective-C selector that's been auto-bridged, a JS file calls into a native module via the React Native bridge, a JSX component delegates to a native view manager. Static tree-sitter extraction stops at each language boundary. CodeGraph bridges them so `codegraph_explore` connects the flow end-to-end across the gap — call paths and blast radius cross the boundary instead of stopping at it.
 
 
 | Boundary | JS / Swift side | Native side | How |
 | Boundary | JS / Swift side | Native side | How |
 |---|---|---|---|
 |---|---|---|---|
@@ -339,7 +339,7 @@ The installer will:
 - Ask which agent(s) to configure — auto-detects installed ones from: **Claude Code**, **Cursor**, **Codex CLI**, **opencode**, **Hermes Agent**, **Gemini CLI**, **Antigravity IDE**, **Kiro**
 - Ask which agent(s) to configure — auto-detects installed ones from: **Claude Code**, **Cursor**, **Codex CLI**, **opencode**, **Hermes Agent**, **Gemini CLI**, **Antigravity IDE**, **Kiro**
 - Prompt to install `codegraph` on your PATH (so agents can launch the MCP server)
 - Prompt to install `codegraph` on your PATH (so agents can launch the MCP server)
 - Ask whether configs apply to all your projects or just this one
 - Ask whether configs apply to all your projects or just this one
-- Write each chosen agent's MCP server config, plus a small marker-fenced CodeGraph section in the agent's instructions file (`CLAUDE.md` / `AGENTS.md` / `GEMINI.md`) — that's how subagents and non-MCP agents learn the `codegraph explore` / `codegraph node` commands, since the MCP server's own guidance only reaches the main agent. Removed cleanly by `codegraph uninstall`.
+- Write each chosen agent's MCP server config, plus a small marker-fenced CodeGraph section in the agent's instructions file (`CLAUDE.md` / `AGENTS.md` / `GEMINI.md`) — that's how subagents and non-MCP agents learn the `codegraph explore` command, since the MCP server's own guidance only reaches the main agent. Removed cleanly by `codegraph uninstall`.
 - Set up auto-allow permissions when Claude Code is one of the targets
 - Set up auto-allow permissions when Claude Code is one of the targets
 - Initialize your current project (local installs only)
 - Initialize your current project (local installs only)
 
 
@@ -401,19 +401,14 @@ npm install -g @colbymchenry/codegraph
 {
 {
   "permissions": {
   "permissions": {
     "allow": [
     "allow": [
-      "mcp__codegraph__codegraph_search",
-      "mcp__codegraph__codegraph_explore",
-      "mcp__codegraph__codegraph_callers",
-      "mcp__codegraph__codegraph_callees",
-      "mcp__codegraph__codegraph_impact",
-      "mcp__codegraph__codegraph_node",
-      "mcp__codegraph__codegraph_status",
-      "mcp__codegraph__codegraph_files"
+      "mcp__codegraph__*"
     ]
     ]
   }
   }
 }
 }
 ```
 ```
 
 
+<sub>One wildcard auto-approves every CodeGraph tool — `codegraph_explore` is the only one listed by default, but if you re-enable others via `CODEGRAPH_MCP_TOOLS` they're already permitted, no prompt.</sub>
+
 </details>
 </details>
 
 
 <details>
 <details>
@@ -422,11 +417,11 @@ npm install -g @colbymchenry/codegraph
 CodeGraph's MCP server delivers its usage guidance to your agent **automatically**, in the MCP `initialize` response. In short, it tells the agent to:
 CodeGraph's MCP server delivers its usage guidance to your agent **automatically**, in the MCP `initialize` response. In short, it tells the agent to:
 
 
 - **Answer structural questions directly with CodeGraph** — it *is* the pre-built index, so a grep/read loop just repeats work it already did. Treat the returned source as already read.
 - **Answer structural questions directly with CodeGraph** — it *is* the pre-built index, so a grep/read loop just repeats work it already did. Treat the returned source as already read.
-- **Pick the tool by intent:** `codegraph_explore` for almost anything — "how does X work", a flow/"how does X reach Y", or surveying an area (one call returns the relevant symbols' source grouped by file); `codegraph_search` to just locate a symbol; `codegraph_callers` for every call site (including callback registrations); `codegraph_node` for one symbol's full source + callers, or to read a file like the Read tool.
+- **Reach for `codegraph_explore` for almost anything** — "how does X work", a flow/"how does X reach Y", or surveying an area. One call returns the relevant symbols' verbatim source grouped by file, the call paths between them (dynamic-dispatch hops included), and a blast-radius summary. Name a file or symbol in the query to read its current line-numbered source.
 - **Trust the results — don't re-verify with grep**, and check the staleness banner after edits.
 - **Trust the results — don't re-verify with grep**, and check the staleness banner after edits.
 - In a workspace with no index, CodeGraph announces itself inactive and serves no tools — indexing stays your decision.
 - In a workspace with no index, CodeGraph announces itself inactive and serves no tools — indexing stays your decision.
 
 
-The exact text is `src/mcp/server-instructions.ts` — the single source of truth for the main agent. Because subagents and non-MCP harnesses never see the MCP guidance, the installer also writes a four-line marker-fenced section into the agent's instructions file pointing at the `codegraph explore` / `codegraph node` CLI equivalents.
+The exact text is `src/mcp/server-instructions.ts` — the single source of truth for the main agent. Because subagents and non-MCP harnesses never see the MCP guidance, the installer also writes a short marker-fenced section into the agent's instructions file pointing at the `codegraph explore` CLI equivalent.
 
 
 </details>
 </details>
 
 
@@ -447,7 +442,7 @@ The exact text is `src/mcp/server-instructions.ts` — the single source of trut
 ┌───────────────────────────────────────────────────────────────────┐
 ┌───────────────────────────────────────────────────────────────────┐
 │                        CodeGraph MCP Server                       │
 │                        CodeGraph MCP Server                       │
 │                                                                   │
 │                                                                   │
-│       explore · search · callers · callees · impact · node       
+│ explore  ·  one call → verbatim source + call flow + blast radius
 │                                 │                                 │
 │                                 │                                 │
 │                                 ▼                                 │
 │                                 ▼                                 │
 │                       SQLite knowledge graph                      │
 │                       SQLite knowledge graph                      │
@@ -524,16 +519,13 @@ fi
 
 
 ## MCP Tools
 ## MCP Tools
 
 
-When running as an MCP server, CodeGraph exposes a focused set of four tools — measured agent behavior showed a leaner list steers agents to the right tool and saves context every session:
+When running as an MCP server, CodeGraph exposes a **single tool** — `codegraph_explore`. Measured agent behavior showed that one strong tool steers agents better than a menu of narrower ones — fewer mis-picks, and it saves context every session:
 
 
 | Tool | Purpose |
 | Tool | Purpose |
 |------|---------|
 |------|---------|
-| `codegraph_explore` | **Primary.** Answer almost any question in one call — "how does X work", a flow ("how does X reach Y"), or surveying an area — returning the relevant symbols' verbatim source grouped by file, plus a relationship map and blast radius. Surfaces dynamic-dispatch hops (callbacks, React re-render, interface→impl) grep can't follow. |
-| `codegraph_node` | One symbol's full source + caller/callee trail (every overload for an ambiguous name) — or pass a file path to **read a whole file like the Read tool** (same line-numbered output, `offset`/`limit`), with its dependents attached. |
-| `codegraph_search` | Find symbols by name across the codebase |
-| `codegraph_callers` | Every call site of a function — including where it's registered as a callback — with one section per definition when several share a name |
+| `codegraph_explore` | Answer almost any question in one call — "how does X work", a flow ("how does X reach Y"), or surveying an area — returning the relevant symbols' verbatim source grouped by file, plus the call paths between them and a blast-radius summary. Surfaces dynamic-dispatch hops (callbacks, React re-render, interface→impl) grep can't follow. Name a file or symbol in the query to read its current line-numbered source, the same shape the Read tool gives you. |
 
 
-Four more tools (`codegraph_callees`, `codegraph_impact`, `codegraph_files`, `codegraph_status`) stay fully functional but unlisted by default — measured across eval runs, agents never or rarely picked them, and their information already arrives inline on the four above (explore's blast-radius section, node's dependents note, a symbol's body as its callee list). Re-enable any of them with the `CODEGRAPH_MCP_TOOLS` environment variable (e.g. `CODEGRAPH_MCP_TOOLS=explore,node,search,callers,impact`), or use their CLI equivalents (`codegraph callees` / `impact` / `files` / `status`).
+The other tools (`codegraph_node`, `codegraph_search`, `codegraph_callers`, `codegraph_callees`, `codegraph_impact`, `codegraph_files`, `codegraph_status`) stay fully functional but **unlisted by default** — everything they return already arrives inline on `codegraph_explore` (its blast-radius section, the relationship map, a symbol's body as its callee list). Re-enable any of them for the MCP surface with the `CODEGRAPH_MCP_TOOLS` environment variable (e.g. `CODEGRAPH_MCP_TOOLS=explore,node,search,callers`), or use their CLI equivalents (`codegraph node` / `query` / `callers` / `callees` / `impact` / `files` / `status`).
 
 
 In a workspace with no `.codegraph/` index, the server announces itself inactive and lists **no** tools — agents work normally with their built-in tools, and indexing stays your decision.
 In a workspace with no `.codegraph/` index, the server announces itself inactive and lists **no** tools — agents work normally with their built-in tools, and indexing stays your decision.
 
 

+ 1 - 1
__tests__/installer-targets.test.ts

@@ -1031,7 +1031,7 @@ describe('Installer targets — partial-state idempotency', () => {
     // The unrelated GitKraken hook survives untouched.
     // The unrelated GitKraken hook survives untouched.
     expect(stopCommands.some((c: string) => c.includes('gk') && c.includes('ai hook run'))).toBe(true);
     expect(stopCommands.some((c: string) => c.includes('gk') && c.includes('ai hook run'))).toBe(true);
     // Permissions still written as normal alongside the cleanup.
     // Permissions still written as normal alongside the cleanup.
-    expect(after.permissions?.allow).toContain('mcp__codegraph__codegraph_search');
+    expect(after.permissions?.allow).toContain('mcp__codegraph__*');
   });
   });
 
 
   it('claude: cleanupLegacyHooks preserves a sibling hook sharing our matcher group', () => {
   it('claude: cleanupLegacyHooks preserves a sibling hook sharing our matcher group', () => {

+ 7 - 13
__tests__/mcp-tool-allowlist.test.ts

@@ -17,18 +17,13 @@ describe('CODEGRAPH_MCP_TOOLS allowlist', () => {
 
 
   const listed = () => new ToolHandler(null).getTools().map(t => t.name).sort();
   const listed = () => new ToolHandler(null).getTools().map(t => t.name).sort();
 
 
-  it('exposes the default 4-tool surface when unset', () => {
+  it('exposes ONLY codegraph_explore by default when unset', () => {
     delete process.env[ENV];
     delete process.env[ENV];
-    // The default set (see DEFAULT_MCP_TOOLS): explore + node are the
-    // validated workhorses, search the cheap lookup, callers the one
-    // irreplaceable enumerator. callees/impact/files/status stay defined
-    // and executable but unlisted — impact appeared in ZERO recorded runs.
-    expect(listed()).toEqual([
-      'codegraph_callers',
-      'codegraph_explore',
-      'codegraph_node',
-      'codegraph_search',
-    ]);
+    // The default set (see DEFAULT_MCP_TOOLS) is pared to explore alone — the one
+    // tool that earns its place (verbatim source grouped by file, plus the reasoned
+    // flow map under the offload). node/search/callers/callees/impact/files/status
+    // stay defined and executable but unlisted; CODEGRAPH_MCP_TOOLS re-enables them.
+    expect(listed()).toEqual(['codegraph_explore']);
   });
   });
 
 
   it('re-enables an unlisted tool via the allowlist (impact)', () => {
   it('re-enables an unlisted tool via the allowlist (impact)', () => {
@@ -48,8 +43,7 @@ describe('CODEGRAPH_MCP_TOOLS allowlist', () => {
 
 
   it('treats an empty/whitespace value as unset (default surface)', () => {
   it('treats an empty/whitespace value as unset (default surface)', () => {
     process.env[ENV] = '   ';
     process.env[ENV] = '   ';
-    expect(listed()).toHaveLength(4);
-    expect(listed()).toContain('codegraph_explore');
+    expect(listed()).toEqual(['codegraph_explore']);
   });
   });
 
 
   it('rejects a disabled tool on execute (defense in depth)', async () => {
   it('rejects a disabled tool on execute (defense in depth)', async () => {

+ 7 - 7
__tests__/mcp-unindexed.test.ts

@@ -116,7 +116,7 @@ describe('Unindexed-workspace session policy', () => {
     expect(instructions).toMatch(/inactive/i);
     expect(instructions).toMatch(/inactive/i);
     expect(instructions).toMatch(/codegraph init/);
     expect(instructions).toMatch(/codegraph init/);
     // The full playbook must NOT be sent into a session where every call fails
     // The full playbook must NOT be sent into a session where every call fails
-    expect(instructions).not.toMatch(/Tool selection by intent/);
+    expect(instructions).not.toMatch(/How to query/);
     expect(instructions).not.toMatch(/codegraph_explore/);
     expect(instructions).not.toMatch(/codegraph_explore/);
   });
   });
 
 
@@ -128,7 +128,7 @@ describe('Unindexed-workspace session policy', () => {
     expect((res.result as { tools: unknown[] }).tools).toEqual([]);
     expect((res.result as { tools: unknown[] }).tools).toEqual([]);
   });
   });
 
 
-  it('an INDEXED workspace still gets the full playbook and all tools', async () => {
+  it('an INDEXED workspace still gets the full playbook and the explore tool', async () => {
     fs.writeFileSync(path.join(tempDir, 'index.ts'), 'export function hello(): string { return "hi"; }\n');
     fs.writeFileSync(path.join(tempDir, 'index.ts'), 'export function hello(): string { return "hi"; }\n');
     const cg = await CodeGraph.init(tempDir, { index: true });
     const cg = await CodeGraph.init(tempDir, { index: true });
     cg.close();
     cg.close();
@@ -136,15 +136,15 @@ describe('Unindexed-workspace session policy', () => {
     child = spawnServer(tempDir);
     child = spawnServer(tempDir);
     const init = await request(child, { id: 0, method: 'initialize', params: initializeParams(tempDir) });
     const init = await request(child, { id: 0, method: 'initialize', params: initializeParams(tempDir) });
     const instructions = (init.result as { instructions: string }).instructions;
     const instructions = (init.result as { instructions: string }).instructions;
-    expect(instructions).toMatch(/Tool selection by intent/);
+    expect(instructions).toMatch(/How to query/);
     expect(instructions).not.toMatch(/inactive/i);
     expect(instructions).not.toMatch(/inactive/i);
 
 
     const list = await request(child, { id: 1, method: 'tools/list' });
     const list = await request(child, { id: 1, method: 'tools/list' });
     const tools = (list.result as { tools: Array<{ name: string }> }).tools;
     const tools = (list.result as { tools: Array<{ name: string }> }).tools;
-    // A 1-file project triggers the pre-existing tiny-repo tool gating (a
-    // reduced core set) — the contract under test is "indexed → tools are
-    // PRESENT", in contrast to the unindexed empty list above.
-    expect(tools.length).toBeGreaterThanOrEqual(3);
+    // The default surface is pared to explore alone (see DEFAULT_MCP_TOOLS) — the
+    // contract under test is "indexed → tools are PRESENT", in contrast to the
+    // unindexed empty list above.
+    expect(tools.length).toBeGreaterThanOrEqual(1);
     expect(tools.map((t) => t.name)).toContain('codegraph_explore');
     expect(tools.map((t) => t.name)).toContain('codegraph_explore');
   });
   });
 });
 });

+ 59 - 1
__tests__/offload.test.ts

@@ -34,7 +34,7 @@ describe('reasoning offload', () => {
     'CODEGRAPH_OFFLOAD_URL', 'CODEGRAPH_OFFLOAD_MODEL', 'CODEGRAPH_OFFLOAD_KEY',
     'CODEGRAPH_OFFLOAD_URL', 'CODEGRAPH_OFFLOAD_MODEL', 'CODEGRAPH_OFFLOAD_KEY',
     'CODEGRAPH_OFFLOAD_EFFORT', 'CODEGRAPH_OFFLOAD_STYLE', 'CODEGRAPH_OFFLOAD_TIMEOUT_MS',
     'CODEGRAPH_OFFLOAD_EFFORT', 'CODEGRAPH_OFFLOAD_STYLE', 'CODEGRAPH_OFFLOAD_TIMEOUT_MS',
     'CODEGRAPH_OFFLOAD_MAXTOKENS', 'CODEGRAPH_OFFLOAD_STRIP', 'CODEGRAPH_OFFLOAD_DEBUG',
     'CODEGRAPH_OFFLOAD_MAXTOKENS', 'CODEGRAPH_OFFLOAD_STRIP', 'CODEGRAPH_OFFLOAD_DEBUG',
-    'CEREBRAS_API_KEY',
+    'CODEGRAPH_OFFLOAD_DISABLE', 'CODEGRAPH_OFFLOAD_USAGE_LOG', 'CEREBRAS_API_KEY',
   ];
   ];
   let saved: Record<string, string | undefined>;
   let saved: Record<string, string | undefined>;
 
 
@@ -118,6 +118,64 @@ describe('reasoning offload', () => {
     });
     });
   });
   });
 
 
+  describe('CODEGRAPH_OFFLOAD_DISABLE kill-switch', () => {
+    it('forces the offload off even when managed + signed in', () => {
+      writeOffloadConfig({ managed: true });
+      writeOffloadToken('cgai_live');
+      expect(resolveOffload().enabled).toBe(true); // sanity: on without the flag
+      process.env.CODEGRAPH_OFFLOAD_DISABLE = '1';
+      const c = resolveOffload();
+      expect(c.enabled).toBe(false);
+      expect(c.managed).toBe(false);
+      expect(c.origin).toBe('none');
+      expect(isOffloadEnabled()).toBe(false);
+    });
+
+    it('forces the offload off even with a BYO endpoint + key', () => {
+      process.env.CODEGRAPH_OFFLOAD_URL = 'https://env.example/v1';
+      process.env.CODEGRAPH_OFFLOAD_KEY = 'sk-direct';
+      expect(resolveOffload().enabled).toBe(true);
+      process.env.CODEGRAPH_OFFLOAD_DISABLE = '1';
+      expect(resolveOffload().enabled).toBe(false);
+    });
+  });
+
+  describe('per-call usage log (CODEGRAPH_OFFLOAD_USAGE_LOG)', () => {
+    const okResponse = () => ({
+      ok: true, status: 200,
+      headers: { get: (h: string) => (h === 'x-cg-credits-charged' ? '127' : null) },
+      json: async () => ({
+        choices: [{ message: { content: 'Coverage: full.\nThe answer.' }, finish_reason: 'stop' }],
+        usage: { prompt_tokens: 700, completion_tokens: 80, total_tokens: 780 },
+      }),
+    });
+
+    it('appends one JSON line with tokens + charged credits when the log path is set', async () => {
+      writeOffloadConfig({ url: 'https://api.cerebras.ai/v1', keyEnv: 'CEREBRAS_API_KEY' });
+      process.env.CEREBRAS_API_KEY = 'sk-live';
+      vi.stubGlobal('fetch', vi.fn().mockResolvedValue(okResponse()));
+      const logPath = path.join(home, 'usage.jsonl');
+      process.env.CODEGRAPH_OFFLOAD_USAGE_LOG = logPath;
+
+      await synthesizeOffload({ query: 'q', context: 'src' });
+      const line = JSON.parse(fs.readFileSync(logPath, 'utf8').trim());
+      expect(line.totalTokens).toBe(780);
+      expect(line.promptTokens).toBe(700);
+      expect(line.creditsCharged).toBe(127);
+      expect(line.costUsd).toBeCloseTo(0.00127, 6); // 100k credits = $1
+      expect(line.answerLen).toBeGreaterThan(0);
+    });
+
+    it('is a no-op (and never throws) when the log path is unset', async () => {
+      writeOffloadConfig({ url: 'https://api.cerebras.ai/v1', keyEnv: 'CEREBRAS_API_KEY' });
+      process.env.CEREBRAS_API_KEY = 'sk-live';
+      vi.stubGlobal('fetch', vi.fn().mockResolvedValue(okResponse()));
+      // no CODEGRAPH_OFFLOAD_USAGE_LOG set → answer still returns fine
+      const out = await synthesizeOffload({ query: 'q', context: 'src' });
+      expect(out).toContain('Coverage: full.');
+    });
+  });
+
   describe('strict degradation (never throws, returns null to fall back)', () => {
   describe('strict degradation (never throws, returns null to fall back)', () => {
     it('returns null when no endpoint is configured', async () => {
     it('returns null when no endpoint is configured', async () => {
       expect(await synthesizeOffload({ query: 'q', context: 'ctx' })).toBeNull();
       expect(await synthesizeOffload({ query: 'q', context: 'ctx' })).toBeNull();

+ 82 - 0
__tests__/redux-thunk-synthesizer.test.ts

@@ -0,0 +1,82 @@
+import { describe, it, expect, beforeEach, afterEach } from 'vitest';
+import * as fs from 'node:fs';
+import * as path from 'node:path';
+import * as os from 'node:os';
+import { CodeGraph } from '../src';
+
+/**
+ * End-to-end test for the redux-thunk dispatch-chain synthesizer.
+ *
+ * `createAsyncThunk(prefix, async (a, api) => {...})` passes the async body as an argument, so
+ * tree-sitter never makes it its own function node — the thunk `constant`'s body calls (incl.
+ * `dispatch(nextThunk(...))`) are orphaned and `callees(thunk)` is empty. Verify the synthesizer
+ * body-scans each thunk constant and links it → each dispatched thunk, so the chain
+ * `outer → inner → deep` connects end-to-end; and that a non-thunk constant is skipped.
+ */
+describe('redux-thunk synthesizer', () => {
+  let dir: string;
+  beforeEach(() => {
+    dir = fs.mkdtempSync(path.join(os.tmpdir(), 'redux-thunk-fixture-'));
+  });
+  afterEach(() => {
+    fs.rmSync(dir, { recursive: true, force: true });
+  });
+
+  it('links each thunk constant to the thunks it dispatches, and skips non-thunks', async () => {
+    fs.writeFileSync(
+      path.join(dir, 'package.json'),
+      JSON.stringify({ name: 'app', dependencies: { '@reduxjs/toolkit': '^2' } })
+    );
+    fs.writeFileSync(
+      path.join(dir, 'thunks.ts'),
+      `import { createAsyncThunk } from '@reduxjs/toolkit';
+
+export const deepThunk = createAsyncThunk('app/deep', async (n: number) => {
+  return n * 2;
+});
+
+export const innerThunk = createAsyncThunk('app/inner', async (n: number, { dispatch }) => {
+  return dispatch(deepThunk(n));
+});
+
+export const outerThunk = createAsyncThunk('app/outer', async (n: number, { dispatch }) => {
+  await dispatch(innerThunk(n));
+});
+
+// Non-thunk constant that only MENTIONS dispatch in a string — must be skipped.
+export const notAThunk = 'dispatch(innerThunk())';
+`
+    );
+
+    const cg = await CodeGraph.init(dir, { silent: true });
+    await cg.indexAll();
+
+    const db = (cg as any).db.db;
+    const rows = db
+      .prepare(
+        `SELECT s.name source_name, s.kind source_kind, t.name target_name,
+                json_extract(e.metadata,'$.via') via,
+                json_extract(e.metadata,'$.registeredAt') registeredAt
+         FROM edges e
+         JOIN nodes s ON s.id = e.source
+         JOIN nodes t ON t.id = e.target
+         WHERE json_extract(e.metadata,'$.synthesizedBy') = 'redux-thunk'`
+      )
+      .all();
+    cg.close?.();
+
+    // The dispatch chain connects: outer → inner → deep.
+    const pairs = new Set(rows.map((r: any) => `${r.source_name}>${r.target_name}`));
+    expect(pairs.has('outerThunk>innerThunk')).toBe(true);
+    expect(pairs.has('innerThunk>deepThunk')).toBe(true);
+
+    // Sources are thunk constants; the non-thunk string constant is never a source.
+    expect(rows.every((r: any) => r.source_kind === 'constant')).toBe(true);
+    expect(rows.some((r: any) => r.source_name === 'notAThunk')).toBe(false);
+
+    // Edges are 'calls' with the wiring site surfaced for the agent.
+    const outer = rows.find((r: any) => r.source_name === 'outerThunk');
+    expect(outer.via).toBe('innerThunk');
+    expect(outer.registeredAt).toMatch(/thunks\.ts:\d+/);
+  });
+});

+ 65 - 0
scripts/agent-eval/offload-eval-3arm.sh

@@ -0,0 +1,65 @@
+#!/usr/bin/env bash
+# 3-arm offload eval for ONE indexed repo + ONE question, n reps each.
+#   ARM offload : codegraph attached, managed offload ON  (per-run AI usage log)
+#   ARM raw     : codegraph attached, CODEGRAPH_OFFLOAD_DISABLE=1 (raw source)
+#   ARM nocg    : no codegraph (empty MCP config) -> Read/Grep baseline
+# All arms: claude -p sonnet --effort high. One JSON metrics line/run -> $RESULTS.
+#
+# Usage: offload-eval-3arm.sh <indexed-repo> <tier> <reps> "<question>"
+# Env:   MODEL=sonnet EFFORT=high  RESULTS=<file>  AGENT_EVAL_OUT=<scratch dir>
+set -uo pipefail
+HERE="$(cd "$(dirname "$0")" && pwd)"
+ENGINE="$(cd "$HERE/../.." && pwd)"
+BIN="$ENGINE/dist/bin/codegraph.js"
+OUT="${AGENT_EVAL_OUT:-/tmp/cg-offload-eval}"
+TARGET="${1:?usage: offload-eval-3arm.sh <indexed-repo> <tier> <reps> \"<question>\"}"
+TIER="${2:?tier}"; REPS="${3:?reps}"; Q="${4:?question}"
+RUNS="$OUT/runs"
+EXTRACT="$HERE/offload-eval-metrics.mjs"
+RESULTS="${RESULTS:-$OUT/results.jsonl}"
+REPO=$(basename "$TARGET")
+mkdir -p "$RUNS"
+command -v claude >/dev/null || { echo "no claude on PATH"; exit 1; }
+[ -d "$TARGET/.codegraph" ] || { echo "not indexed: $TARGET (run offload-eval-setup.sh first)"; exit 1; }
+# Physical path so pkill matches the daemon's real cmdline (macOS /tmp->/private/tmp symlink
+# otherwise makes the kill miss the daemon, and the next arm connects to the SURVIVING daemon
+# — contaminating the raw arm with offload).
+TARGET=$(cd "$TARGET" && pwd -P)
+
+prewarm() { # path  extra-env (e.g. "FOO=bar")
+  pkill -9 -f "serve --mcp --path $1" 2>/dev/null; rm -f "$1/.codegraph/daemon.sock" 2>/dev/null; sleep 0.6
+  env ${2:-} CODEGRAPH_DAEMON_IDLE_TIMEOUT_MS=1800000 node "$BIN" serve --mcp --path "$1" </dev/null >/dev/null 2>&1 &
+  node -e 'const fs=require("fs");let n=0;const t=setInterval(()=>{if(fs.existsSync(process.argv[1]+"/.codegraph/daemon.sock")){clearInterval(t);process.exit(0)}if(n++>150){clearInterval(t);process.exit(1)}},100)' "$1" \
+    && echo "  daemon warm" || echo "  WARN daemon never bound"
+}
+
+run() { # arm rep mcp-config usage-log-or-dash
+  local arm="$1" rep="$2" cfg="$3" usage="$4" tag="$REPO-$1-$2"
+  [ "$usage" != "-" ] && : > "$usage"
+  ( cd "$TARGET" && claude -p "$Q" \
+      --output-format stream-json --verbose --permission-mode bypassPermissions \
+      --model "${MODEL:-sonnet}" --effort "${EFFORT:-high}" --max-budget-usd 4 \
+      --strict-mcp-config --mcp-config "$cfg" \
+      </dev/null > "$RUNS/$tag.jsonl" 2>"$RUNS/$tag.err" )
+  node "$EXTRACT" --run "$RUNS/$tag.jsonl" --usage "$usage" --arm "$arm" --rep "$rep" \
+      --repo "$REPO" --tier "$TIER" --q "$Q" >> "$RESULTS"
+  node -e 'const o=JSON.parse(require("fs").readFileSync(process.argv[1],"utf8").trim().split("\n").pop());console.log(`  [${o.arm} #${o.rep}] ${o.durationSec}s | main $${o.costUsdMain} ${o.tokBillable} tok | read=${o.read} grep=${o.grep} explore=${o.explore} offload=${o.offloadFired} | AI ${o.ai.calls}call/${o.ai.totalTokens}tok/$${o.ai.costUsd.toFixed(4)} | ok=${o.ok}`)' "$RESULTS"
+}
+
+CFG_OFF="$RUNS/mcp-offload-$REPO.json"; CFG_RAW="$RUNS/mcp-raw-$REPO.json"; CFG_NOCG="$RUNS/mcp-nocg.json"
+USAGE="$RUNS/$REPO-usage.jsonl"
+printf '{"mcpServers":{"codegraph":{"command":"env","args":["CODEGRAPH_WASM_RELAUNCHED=1","CODEGRAPH_OFFLOAD_USAGE_LOG=%s","node","%s","serve","--mcp","--path","%s"]}}}' "$USAGE" "$BIN" "$TARGET" > "$CFG_OFF"
+printf '{"mcpServers":{"codegraph":{"command":"env","args":["CODEGRAPH_WASM_RELAUNCHED=1","CODEGRAPH_OFFLOAD_DISABLE=1","node","%s","serve","--mcp","--path","%s"]}}}' "$BIN" "$TARGET" > "$CFG_RAW"
+printf '{"mcpServers":{}}' > "$CFG_NOCG"
+
+echo "###### repo=$REPO tier=$TIER reps=$REPS model=${MODEL:-sonnet}/${EFFORT:-high}"
+echo "###### Q=$Q"
+echo "== ARM offload =="; prewarm "$TARGET" "CODEGRAPH_OFFLOAD_USAGE_LOG=$USAGE"
+for r in $(seq 1 "$REPS"); do run offload "$r" "$CFG_OFF" "$USAGE"; done
+pkill -9 -f "serve --mcp --path $TARGET" 2>/dev/null; rm -f "$TARGET/.codegraph/daemon.sock" 2>/dev/null; sleep 1
+echo "== ARM raw =="; prewarm "$TARGET" "CODEGRAPH_OFFLOAD_DISABLE=1"
+for r in $(seq 1 "$REPS"); do run raw "$r" "$CFG_RAW" "-"; done
+pkill -9 -f "serve --mcp --path $TARGET" 2>/dev/null; rm -f "$TARGET/.codegraph/daemon.sock" 2>/dev/null; sleep 1
+echo "== ARM nocg =="
+for r in $(seq 1 "$REPS"); do run nocg "$r" "$CFG_NOCG" "-"; done
+echo "###### DONE $REPO"

+ 108 - 0
scripts/agent-eval/offload-eval-effort.mjs

@@ -0,0 +1,108 @@
+#!/usr/bin/env node
+// Effort A/B — does CODEGRAPH_OFFLOAD_EFFORT=high improve offload SYNTHESIS FIDELITY vs low?
+// Probe-based (no agent): for each repo × effort × rep, run codegraph_explore with the offload
+// ON on the canonical question, capture the synthesized answer + AI tokens/cost/latency, then
+// Sonnet-judge that answer's fidelity vs source-verified ground truth. Isolates the synthesis
+// from agent/adoption noise. Requires `codegraph login` (managed offload) + indexed repos.
+//
+// Env: REPS (default 3) · CG_ENGINE (engine repo) · AGENT_EVAL_OUT (repos under /repos) · CONC (judge concurrency)
+import { pathToFileURL, fileURLToPath } from 'node:url';
+import { resolve, dirname, join } from 'node:path';
+import { readFileSync, writeFileSync, existsSync, rmSync } from 'node:fs';
+import { execFile } from 'node:child_process';
+import { tmpdir } from 'node:os';
+
+const HERE = dirname(fileURLToPath(import.meta.url));
+const ENGINE = process.env.CG_ENGINE || resolve(HERE, '..', '..');
+const OUT = process.env.AGENT_EVAL_OUT || '/tmp/cg-offload-eval';
+const REPOS = join(OUT, 'repos');
+const GT = JSON.parse(readFileSync(resolve(HERE, 'offload-eval-ground-truth.json'), 'utf8'));
+const REPS = Number(process.env.REPS || 3);
+const CONC = Number(process.env.CONC || 4);
+const EFFORTS = (process.env.EFFORTS_FILTER || 'low,high').split(',');
+const ONLY = process.env.REPOS_FILTER ? new Set(process.env.REPOS_FILTER.split(',')) : null;
+const TIER = { mtkruto: 'small', postybirb: 'medium', shapeshift: 'complex', trezor: 'large' };
+
+const load = async (rel) => import(pathToFileURL(resolve(ENGINE, rel)).href);
+const idx = await load('dist/index.js');
+const toolsMod = await load('dist/mcp/tools.js');
+const CodeGraph = idx.default?.default ?? idx.default ?? idx.CodeGraph;
+const ToolHandler = toolsMod.ToolHandler ?? toolsMod.default?.ToolHandler;
+if (typeof CodeGraph?.openSync !== 'function' || typeof ToolHandler !== 'function') {
+  console.error('could not load engine from', ENGINE); process.exit(2);
+}
+
+const fidPrompt = (gt, ans) => `You are scoring the FIDELITY of a machine-synthesized code-exploration answer against verified ground truth. Do NOT use any tools.
+
+QUESTION: ${gt.question}
+
+VERIFIED GROUND TRUTH (the actual call path + files):
+${gt.truth}
+
+SYNTHESIZED ANSWER (to score):
+${ans || '(empty)'}
+
+Judge: (1) is the traced call path correct vs ground truth? (2) are the cited files/symbols correct (not fabricated)? (3) if it gave a "Coverage:" verdict, was it honest? A confident WRONG trace is the worst outcome — penalize it harder than an honest partial.
+Output ONLY minified JSON: {"verdict":"pass|partial|fail","score":<0-100>,"fabrication":<true|false>,"coverageHonest":<true|false>,"note":"<=20 words"}`;
+
+const askJudge = (prompt) => new Promise((res) => {
+  execFile('claude', ['-p', prompt, '--model', 'sonnet', '--effort', 'high', '--max-budget-usd', '0.5',
+    '--strict-mcp-config', '--mcp-config', '{"mcpServers":{}}'],
+    { cwd: OUT, maxBuffer: 1 << 24, timeout: 120000 }, (err, stdout) => {
+      const m = (stdout || '').match(/\{[\s\S]*\}/);
+      if (!m) return res({ verdict: 'error', score: null, note: (err ? err.message : 'no json').slice(0, 60) });
+      try { res(JSON.parse(m[0])); } catch { res({ verdict: 'error', score: null }); }
+    });
+});
+
+// ---- 1. Probe: collect synthesized answers at each effort -------------------
+const records = [];
+for (const repo of Object.keys(GT)) {
+  if (ONLY && !ONLY.has(repo)) continue;
+  const dir = join(REPOS, repo);
+  if (!existsSync(join(dir, '.codegraph'))) { console.error('skip (not indexed):', repo); continue; }
+  const cg = CodeGraph.openSync(dir);
+  const h = new ToolHandler(cg);
+  for (const effort of EFFORTS) {
+    for (let rep = 1; rep <= REPS; rep++) {
+      process.env.CODEGRAPH_OFFLOAD_EFFORT = effort;
+      const usageLog = join(tmpdir(), `effort-${repo}-${effort}-${rep}.jsonl`);
+      try { rmSync(usageLog); } catch { /* none */ }
+      process.env.CODEGRAPH_OFFLOAD_USAGE_LOG = usageLog;
+      let answer = '';
+      try { answer = (await h.execute('codegraph_explore', { query: GT[repo].question }))?.content?.[0]?.text ?? ''; }
+      catch (e) { console.error(`  ${repo}/${effort}#${rep} explore failed: ${e?.message}`); }
+      const fired = /Synthesized by CodeGraph/.test(answer);
+      const ai = { tokens: 0, cost: 0, ms: 0 };
+      if (existsSync(usageLog)) for (const e of readFileSync(usageLog, 'utf8').split('\n').filter(Boolean).map(JSON.parse)) {
+        ai.tokens += e.totalTokens || 0; ai.cost += e.costUsd || 0; ai.ms += e.ms || 0;
+      }
+      records.push({ repo, tier: TIER[repo], effort, rep, fired, ai, answer });
+      console.error(`  ${repo}/${effort}#${rep}: fired=${fired} ${ai.tokens}tok $${ai.cost.toFixed(4)} ${ai.ms}ms`);
+    }
+  }
+  try { cg.close?.(); } catch { /* none */ }
+}
+
+// ---- 2. Judge fidelity (concurrency) ---------------------------------------
+console.error(`\njudging ${records.length} answers (concurrency ${CONC})...`);
+let done = 0;
+const q = [...records];
+async function worker() { while (q.length) { const r = q.shift(); r.fid = await askJudge(fidPrompt(GT[r.repo], r.answer)); console.error(`  [${++done}/${records.length}] ${r.repo}/${r.effort}#${r.rep}: ${r.fid.verdict} ${r.fid.score ?? ''}`); } }
+await Promise.all(Array.from({ length: CONC }, worker));
+writeFileSync(join(OUT, 'effort-results.jsonl'), records.map((r) => JSON.stringify(r)).join('\n') + '\n');
+
+// ---- 3. Aggregate: low vs high per repo ------------------------------------
+const med = (a) => { a = a.filter((x) => x != null).sort((x, y) => x - y); return a.length ? (a.length % 2 ? a[(a.length - 1) / 2] : (a[a.length / 2 - 1] + a[a.length / 2]) / 2) : null; };
+console.log(`\n${'='.repeat(80)}\nEFFORT A/B — offload synthesis fidelity (probe, n=${REPS}/cell)\n${'='.repeat(80)}`);
+console.log(`${'repo'.padEnd(11)} ${'tier'.padEnd(8)} ${'effort'.padEnd(6)} fired  ${'fid(med)'.padStart(8)} ${'fab%'.padStart(5)} ${'AItok'.padStart(7)} ${'AIcost'.padStart(8)} ${'ms(med)'.padStart(8)}`);
+for (const repo of Object.keys(GT)) {
+  for (const effort of EFFORTS) {
+    const rs = records.filter((r) => r.repo === repo && r.effort === effort);
+    if (!rs.length) continue;
+    const fids = rs.map((r) => r.fid?.score).filter((x) => x != null);
+    const fab = rs.filter((r) => r.fid?.fabrication === true).length;
+    console.log(`${repo.padEnd(11)} ${TIER[repo].padEnd(8)} ${effort.padEnd(6)} ${rs.filter((r) => r.fired).length}/${rs.length}   ${String(med(fids) ?? '—').padStart(8)} ${String(Math.round(100 * fab / rs.length) + '%').padStart(5)} ${String(Math.round(med(rs.map((r) => r.ai.tokens)) / 1000) + 'k').padStart(7)} ${('$' + (med(rs.map((r) => r.ai.cost)) ?? 0).toFixed(4)).padStart(8)} ${String(med(rs.map((r) => r.ai.ms)) ?? '—').padStart(8)}`);
+  }
+}
+console.log('');

+ 25 - 0
scripts/agent-eval/offload-eval-frontload-matrix.sh

@@ -0,0 +1,25 @@
+#!/usr/bin/env bash
+# Run the FRONTLOAD arm across all 4 tiers (n reps), then judge + merge with the existing
+# matrix (offload/raw/nocg in $OUT/judged.jsonl, if present) + emit a combined summary.
+# Env: REPS (default 3)  AGENT_EVAL_OUT=<scratch dir>
+set -uo pipefail
+HERE="$(cd "$(dirname "$0")" && pwd)"
+OUT="${AGENT_EVAL_OUT:-/tmp/cg-offload-eval}"
+GT="$HERE/offload-eval-ground-truth.json"
+REPS="${REPS:-3}"
+export RESULTS="$OUT/results-fl.jsonl"
+: > "$RESULTS"; rm -f "$OUT/runs/hook-debug.log"
+for repo in mtkruto postybirb shapeshift trezor; do
+  case "$repo" in mtkruto) tier=small;; postybirb) tier=medium;; shapeshift) tier=complex;; trezor) tier=large;; esac
+  Q=$(node -e "console.log(JSON.parse(require('fs').readFileSync(process.argv[1],'utf8'))[process.argv[2]].question)" "$GT" "$repo")
+  echo ""; echo "### $repo ($tier)  $(date +%H:%M:%S)"
+  bash "$HERE/offload-eval-frontload.sh" "$OUT/repos/$repo" "$tier" "$REPS" "$Q"
+done
+echo ""
+echo "frontload: $(wc -l < "$RESULTS") runs | hook injections: $(grep -c INJECTED "$OUT/runs/hook-debug.log" 2>/dev/null) | errors: $(grep -c ERROR "$OUT/runs/hook-debug.log" 2>/dev/null)"
+echo "=== JUDGE frontload ==="
+node "$HERE/offload-eval-judge.mjs" --results "$RESULTS" --truth "$GT" --out "$OUT/judged-fl.jsonl" --concurrency 4 2>&1 | tail -4
+if [ -f "$OUT/judged.jsonl" ]; then cat "$OUT/judged.jsonl" "$OUT/judged-fl.jsonl" > "$OUT/judged-all.jsonl"; else cp "$OUT/judged-fl.jsonl" "$OUT/judged-all.jsonl"; fi
+echo "=== COMBINED SUMMARY ==="
+node "$HERE/offload-eval-summarize.mjs" "$OUT/judged-all.jsonl"
+echo "###### FRONTLOAD MATRIX DONE"

+ 47 - 0
scripts/agent-eval/offload-eval-frontload.sh

@@ -0,0 +1,47 @@
+#!/usr/bin/env bash
+# FRONTLOAD arm (approach 1): codegraph attached (offload-disabled) + the front-load
+# UserPromptSubmit hook (offload-eval-hook.mjs), n reps, appended to $RESULTS. Compare against
+# the matrix's raw/nocg baselines. Usage: offload-eval-frontload.sh <indexed-repo> <tier> <reps> "<Q>"
+# Env: MODEL=sonnet EFFORT=high  RESULTS=<file>  AGENT_EVAL_OUT=<scratch dir>
+set -uo pipefail
+HERE="$(cd "$(dirname "$0")" && pwd)"
+ENGINE="$(cd "$HERE/../.." && pwd)"
+BIN="$ENGINE/dist/bin/codegraph.js"
+OUT="${AGENT_EVAL_OUT:-/tmp/cg-offload-eval}"
+TARGET="${1:?repo}"; TIER="${2:?tier}"; REPS="${3:?reps}"; Q="${4:?question}"
+RUNS="$OUT/runs"
+EXTRACT="$HERE/offload-eval-metrics.mjs"
+RESULTS="${RESULTS:-$OUT/results-fl.jsonl}"
+REPO=$(basename "$TARGET")
+mkdir -p "$RUNS"
+[ -d "$TARGET/.codegraph" ] || { echo "not indexed: $TARGET"; exit 1; }
+TARGET=$(cd "$TARGET" && pwd -P)
+
+CFG="$RUNS/mcp-fl-$REPO.json"
+printf '{"mcpServers":{"codegraph":{"command":"env","args":["CODEGRAPH_WASM_RELAUNCHED=1","CODEGRAPH_OFFLOAD_DISABLE=1","node","%s","serve","--mcp","--path","%s"]}}}' "$BIN" "$TARGET" > "$CFG"
+# Generate the hook settings pointing at the persisted hook; enable its debug log so we can
+# count injections (claude passes this env down to the spawned hook process).
+HOOKCFG="$RUNS/frontload-settings.json"
+printf '{"hooks":{"UserPromptSubmit":[{"hooks":[{"type":"command","command":"node %s/offload-eval-hook.mjs"}]}]}}' "$HERE" > "$HOOKCFG"
+export CG_FRONTLOAD_DEBUG="$RUNS/hook-debug.log"
+
+prewarm() {
+  pkill -9 -f "serve --mcp --path $1" 2>/dev/null; rm -f "$1/.codegraph/daemon.sock" 2>/dev/null; sleep 0.6
+  env CODEGRAPH_OFFLOAD_DISABLE=1 CODEGRAPH_DAEMON_IDLE_TIMEOUT_MS=1800000 node "$BIN" serve --mcp --path "$1" </dev/null >/dev/null 2>&1 &
+  node -e 'const fs=require("fs");let n=0;const t=setInterval(()=>{if(fs.existsSync(process.argv[1]+"/.codegraph/daemon.sock")){clearInterval(t);process.exit(0)}if(n++>150){clearInterval(t);process.exit(1)}},100)' "$1" \
+    && echo "  daemon warm" || echo "  WARN no daemon"
+}
+
+echo "###### FRONTLOAD repo=$REPO tier=$TIER reps=$REPS"
+prewarm "$TARGET"
+for r in $(seq 1 "$REPS"); do
+  tag="$REPO-frontload-$r"
+  ( cd "$TARGET" && claude -p "$Q" --output-format stream-json --verbose --permission-mode bypassPermissions \
+      --model "${MODEL:-sonnet}" --effort "${EFFORT:-high}" --max-budget-usd 4 \
+      --strict-mcp-config --mcp-config "$CFG" --settings "$HOOKCFG" \
+      </dev/null > "$RUNS/$tag.jsonl" 2>"$RUNS/$tag.err" )
+  node "$EXTRACT" --run "$RUNS/$tag.jsonl" --usage "-" --arm frontload --rep "$r" --repo "$REPO" --tier "$TIER" --q "$Q" >> "$RESULTS"
+  node -e 'const o=JSON.parse(require("fs").readFileSync(process.argv[1],"utf8").trim().split("\n").pop());console.log(`  [frontload #${o.rep}] ${o.durationSec}s | main $${o.costUsdMain} ${o.tokBillable}tok | read=${o.read} grep=${o.grep} agentExplore=${o.explore} | ok=${o.ok}`)' "$RESULTS"
+done
+pkill -9 -f "serve --mcp --path $TARGET" 2>/dev/null; rm -f "$TARGET/.codegraph/daemon.sock" 2>/dev/null
+echo "###### FRONTLOAD DONE $REPO (cumulative hook injections: $(grep -c INJECTED "$CG_FRONTLOAD_DEBUG" 2>/dev/null))"

Файлын зөрүү хэтэрхий том тул дарагдсан байна
+ 7 - 0
scripts/agent-eval/offload-eval-ground-truth.json


+ 84 - 0
scripts/agent-eval/offload-eval-hook.mjs

@@ -0,0 +1,84 @@
+#!/usr/bin/env node
+// UserPromptSubmit hook — APPROACH 1: additive context-injection.
+// Front-loads codegraph's structural answer for flow/impact/"how/where" prompts so the
+// agent's reflex grep/read has nothing left to find. Strictly additive (never blocks),
+// gated to structural prompts (no cost otherwise), and uses RAW explore (offload disabled)
+// so the injected context is accurate — never the (currently low-fidelity) synthesis.
+//
+// Reads {prompt, cwd} as JSON on stdin; prints the explore result to stdout (which Claude
+// Code injects into the agent's context). Any failure -> silent exit 0 (degradable).
+import { pathToFileURL, fileURLToPath } from 'node:url';
+import { resolve, join, dirname } from 'node:path';
+import { existsSync, readFileSync, appendFileSync } from 'node:fs';
+
+// Resolve the engine repo from this script's own location (scripts/agent-eval/ -> ../..),
+// overridable with CG_ENGINE. The hook ships inside the repo, so it finds its own dist.
+const HERE = dirname(fileURLToPath(import.meta.url));
+const ENGINE = process.env.CG_ENGINE || resolve(HERE, '..', '..');
+const BUDGET = Number(process.env.CG_FRONTLOAD_BUDGET || 16000);
+
+// Debug log only when CG_FRONTLOAD_DEBUG is set to a file path (the harness points it at a
+// log to count injections); off by default so the shipped hook writes nothing extra.
+const DBG = process.env.CG_FRONTLOAD_DEBUG;
+const dbg = (m) => { if (!DBG) return; try { appendFileSync(DBG, `[${new Date().toISOString()}] ${m}\n`); } catch { /* ignore */ } };
+
+let input = {};
+try { input = JSON.parse(readFileSync(0, 'utf8')); } catch (e) { dbg('stdin parse fail: ' + e.message); }
+const prompt = String(input.prompt || '');
+const cwd = String(input.cwd || process.cwd());
+dbg(`invoked: promptLen=${prompt.length} cwd=${cwd}`);
+
+// Gate: only structural / flow / impact / where-how questions. Cheap regex; silent no-op
+// otherwise so non-structural prompts ("fix this typo") cost nothing.
+const STRUCTURAL = /\b(how|where|trace|flow|path|reach(es|ed)?|call(s|ed|er|ers|ee)?|depend|impact|affect|wire[ds]?|connect|implement|architect|structure|breaks?|what calls|why does)\b/i;
+if (!prompt || !STRUCTURAL.test(prompt)) { dbg('gate: non-structural, no-op'); process.exit(0); }
+dbg('gate: structural PASS');
+
+// Find the index: cwd, then walk up a few levels.
+let root = cwd, found = null;
+for (let i = 0; i < 6 && root; i++) {
+  if (existsSync(join(root, '.codegraph'))) { found = root; break; }
+  const parent = resolve(root, '..'); if (parent === root) break; root = parent;
+}
+if (!found) { dbg(`no .codegraph found from cwd=${cwd}`); process.exit(0); }
+dbg(`found index at ${found}`);
+
+try {
+  process.env.CODEGRAPH_OFFLOAD_DISABLE = '1'; // raw, accurate — never the unfixed offload
+  process.env.CODEGRAPH_TELEMETRY = '0'; process.env.DO_NOT_TRACK = '1';
+  const load = async (rel) => import(pathToFileURL(resolve(ENGINE, rel)).href);
+  const idx = await load('dist/index.js');
+  const tools = await load('dist/mcp/tools.js');
+  const CodeGraph = idx.default?.default ?? idx.default ?? idx.CodeGraph;
+  const ToolHandler = tools.ToolHandler ?? tools.default?.ToolHandler;
+  if (typeof CodeGraph?.openSync !== 'function' || typeof ToolHandler !== 'function') process.exit(0);
+
+  // Retry once on a transient busy/locked index (the hook's openSync can race a
+  // freshly-warming daemon on the first prompt of a session).
+  let text = '';
+  for (let attempt = 1; attempt <= 2; attempt++) {
+    try {
+      const cg = CodeGraph.openSync(found);
+      const h = new ToolHandler(cg);
+      const res = await h.execute('codegraph_explore', { query: prompt });
+      text = res?.content?.[0]?.text ?? '';
+      try { cg.close?.(); } catch { /* ignore */ }
+      dbg(`explore attempt ${attempt} returned ${text.length} chars`);
+      break;
+    } catch (e) {
+      dbg(`explore attempt ${attempt} failed: ${e?.message || e}`);
+      if (attempt === 2) throw e;
+      await new Promise((r) => setTimeout(r, 800));
+    }
+  }
+  if (!text.trim()) { dbg('empty explore result, no-op'); process.exit(0); }
+  if (text.length > BUDGET) text = text.slice(0, BUDGET) + '\n…[front-load truncated to budget]';
+
+  process.stdout.write(
+    `## CodeGraph structural context (auto-retrieved for this question)\n` +
+    `The code graph was queried for your question; the relevant symbols, source, and call flow are below. ` +
+    `Treat the quoted source as already read. If you need more, call codegraph_explore with specific symbol names rather than grepping or reading files.\n\n` +
+    text + '\n'
+  );
+  dbg(`INJECTED ${text.length} chars`);
+} catch (e) { dbg('ERROR: ' + (e?.stack || e?.message || e)); process.exit(0); } // degradable

+ 103 - 0
scripts/agent-eval/offload-eval-judge.mjs

@@ -0,0 +1,103 @@
+#!/usr/bin/env node
+// Accuracy judge. For each run in results.jsonl:
+//   - end-to-end: agent finalAnswer vs verified ground truth (all arms)
+//   - fidelity:   offload synthesized answer vs ground truth (offload arm only)
+// Judge = claude -p sonnet --effort high, no tools, run from a neutral cwd,
+// JSON-only verdicts. Writes judged.jsonl (one line per run, verdicts merged).
+//
+// Usage: judge.mjs --results <f> --truth <f> --out <f> [--concurrency 4]
+import { readFileSync, writeFileSync, existsSync } from 'fs';
+import { execFile } from 'child_process';
+
+const A = {};
+for (let i = 2; i < process.argv.length; i += 2) A[process.argv[i].replace(/^--/, '')] = process.argv[i + 1];
+const results = readFileSync(A.results, 'utf8').split('\n').filter(Boolean).map(l => JSON.parse(l));
+const truth = JSON.parse(readFileSync(A.truth, 'utf8'));
+const OUT = A.out || '/tmp/cg-offload-eval/judged.jsonl';
+const CONC = Number(A.concurrency || 4);
+
+function askJudge(prompt) {
+  return new Promise((resolve) => {
+    execFile('claude', ['-p', prompt, '--model', 'sonnet', '--effort', 'high',
+      '--max-budget-usd', '0.5', '--strict-mcp-config', '--mcp-config', '{"mcpServers":{}}'],
+      // Run from a neutral dir with no repo files so the judge can't "cheat" by reading source.
+      { cwd: process.env.AGENT_EVAL_OUT || '/tmp', maxBuffer: 1 << 24, timeout: 120000 },
+      (err, stdout) => {
+        const raw = (stdout || '').trim();
+        const m = raw.match(/\{[\s\S]*\}/);
+        if (!m) return resolve({ verdict: 'error', score: null, note: (err ? 'exec ' + err.message : 'no json').slice(0, 80) });
+        try { resolve(JSON.parse(m[0])); } catch { resolve({ verdict: 'error', score: null, note: 'parse fail' }); }
+      });
+  });
+}
+
+const e2ePrompt = (gt, ans) => `You are scoring whether an AI coding agent correctly answered a code-flow question about a repository. Judge ONLY against the verified ground truth. Do NOT use any tools.
+
+QUESTION: ${gt.question}
+
+VERIFIED GROUND TRUTH (the actual call path + files):
+${gt.truth}
+
+AGENT'S ANSWER:
+${ans || '(empty)'}
+
+Score how correct the agent's answer is vs the ground truth. A "pass" means it identifies the core mechanism and the major hops with the right files/symbols and makes no materially wrong claim. "partial" = right area but misses major hops or has notable errors. "fail" = wrong layer, fabricated, or misses the mechanism.
+Output ONLY minified JSON, no prose, no code fences:
+{"verdict":"pass|partial|fail","score":<0-100>,"missedHops":["..."],"wrongClaims":["..."],"note":"<=20 words"}`;
+
+const fidPrompt = (gt, ans) => `You are scoring the FIDELITY of a machine-synthesized code-exploration answer against verified ground truth. The synthesized answer claims to trace a flow and cite file:line locations. Do NOT use any tools.
+
+QUESTION: ${gt.question}
+
+VERIFIED GROUND TRUTH (the actual call path + files):
+${gt.truth}
+
+SYNTHESIZED ANSWER (to score):
+${ans || '(empty)'}
+
+Judge: (1) is the traced call path correct vs ground truth? (2) are the cited files/symbols correct (not fabricated)? (3) if it gave a "Coverage:" verdict, was that verdict honest about what it actually covered? A confident WRONG trace is the worst outcome — penalize it harder than an honest "partial/not found".
+Output ONLY minified JSON, no prose, no code fences:
+{"verdict":"pass|partial|fail","score":<0-100>,"fabrication":<true|false>,"coverageHonest":<true|false>,"missedHops":["..."],"note":"<=20 words"}`;
+
+// Build the job list
+const jobs = [];
+for (const r of results) {
+  const gt = truth[r.repo];
+  if (!gt) { r._nojudge = true; continue; }
+  jobs.push({ r, kind: 'e2e', prompt: e2ePrompt(gt, r.finalAnswer) });
+  if (r.arm === 'offload' && Array.isArray(r.offloadAnswers))
+    r.offloadAnswers.forEach((ans, i) => { if (ans && ans.trim()) jobs.push({ r, kind: 'fid', idx: i, prompt: fidPrompt(gt, ans) }); });
+}
+console.error(`judging ${jobs.length} verdicts across ${results.length} runs (concurrency ${CONC})...`);
+
+let done = 0;
+async function worker(queue) {
+  while (queue.length) {
+    const job = queue.shift();
+    const v = await askJudge(job.prompt);
+    if (job.kind === 'e2e') job.r.e2e = v; else (job.r._fid ??= []).push(v);
+    console.error(`  [${++done}/${jobs.length}] ${job.r.repo}/${job.r.arm}#${job.r.rep} ${job.kind}: ${v.verdict}${v.score != null ? ' ' + v.score : ''}`);
+  }
+}
+const q = [...jobs];
+await Promise.all(Array.from({ length: CONC }, () => worker(q)));
+
+// Aggregate per-answer fidelity verdicts into one fidelity object per offload run.
+const medOf = (a) => { a = [...a].sort((x, y) => x - y); return a.length ? (a.length % 2 ? a[(a.length - 1) / 2] : (a[a.length / 2 - 1] + a[a.length / 2]) / 2) : null; };
+for (const r of results) {
+  if (r._fid?.length) {
+    const scores = r._fid.map(v => v.score).filter(x => x != null);
+    r.fidelity = {
+      n: r._fid.length, scores,
+      max: scores.length ? Math.max(...scores) : null,
+      min: scores.length ? Math.min(...scores) : null,
+      median: medOf(scores),
+      anyFabrication: r._fid.some(v => v.fabrication === true),
+      allCoverageHonest: r._fid.every(v => v.coverageHonest !== false),
+      verdicts: r._fid.map(v => v.verdict),
+    };
+  }
+  delete r._fid;
+}
+writeFileSync(OUT, results.map(r => JSON.stringify(r)).join('\n') + '\n');
+console.error(`wrote ${OUT}`);

+ 20 - 0
scripts/agent-eval/offload-eval-matrix.sh

@@ -0,0 +1,20 @@
+#!/usr/bin/env bash
+# Drive the 3-arm campaign (offload/raw/nocg) across all 4 tiers, n reps each, into one
+# results.jsonl. Reads the canonical question per repo from offload-eval-ground-truth.json.
+# Env: REPS (default 3)  AGENT_EVAL_OUT=<scratch dir>
+set -uo pipefail
+HERE="$(cd "$(dirname "$0")" && pwd)"
+OUT="${AGENT_EVAL_OUT:-/tmp/cg-offload-eval}"
+GT="$HERE/offload-eval-ground-truth.json"
+REPS="${REPS:-3}"
+export RESULTS="$OUT/results.jsonl"
+: > "$RESULTS"
+for repo in mtkruto postybirb shapeshift trezor; do
+  case "$repo" in mtkruto) tier=small;; postybirb) tier=medium;; shapeshift) tier=complex;; trezor) tier=large;; esac
+  Q=$(node -e "console.log(JSON.parse(require('fs').readFileSync(process.argv[1],'utf8'))[process.argv[2]].question)" "$GT" "$repo")
+  echo ""; echo "### $repo ($tier)  $(date +%H:%M:%S)"
+  bash "$HERE/offload-eval-3arm.sh" "$OUT/repos/$repo" "$tier" "$REPS" "$Q"
+done
+echo ""; echo "###### MATRIX DONE -> $RESULTS ($(wc -l < "$RESULTS") runs).  Judge + summarize with:"
+echo "  node $HERE/offload-eval-judge.mjs --results $RESULTS --truth $GT --out $OUT/judged.jsonl"
+echo "  node $HERE/offload-eval-summarize.mjs $OUT/judged.jsonl"

+ 94 - 0
scripts/agent-eval/offload-eval-metrics.mjs

@@ -0,0 +1,94 @@
+#!/usr/bin/env node
+// Extract one eval run's metrics from its Claude stream-json transcript + the
+// offload usage sidecar log, emit ONE merged JSON line.
+//
+// Usage: extract-metrics.mjs --run <run.jsonl> --usage <usage.jsonl|-> \
+//          --arm <a> --rep <n> --repo <r> --tier <t> --q <question>
+import { readFileSync, existsSync } from 'fs';
+
+const args = {};
+for (let i = 2; i < process.argv.length; i += 2) args[process.argv[i].replace(/^--/, '')] = process.argv[i + 1];
+
+const runFile = args.run;
+const lines = existsSync(runFile) ? readFileSync(runFile, 'utf8').split('\n').filter(Boolean) : [];
+
+const toolCounts = {};
+let result = null;
+const tok = { gen: 0, fresh: 0, cached: 0 };
+const offloadAnswers = [];
+let exploreResults = 0; // tool_results from explore (offload or raw)
+let lastAssistantText = '';
+
+for (const line of lines) {
+  let ev; try { ev = JSON.parse(line); } catch { continue; }
+
+  // per-turn token usage (authoritative token measure; result.usage is last-turn only)
+  const u = ev.message?.usage;
+  if (u) {
+    tok.gen += u.output_tokens || 0;
+    tok.fresh += (u.input_tokens || 0) + (u.cache_creation_input_tokens || 0);
+    tok.cached += u.cache_read_input_tokens || 0;
+  }
+
+  if (ev.type === 'assistant' && Array.isArray(ev.message?.content)) {
+    for (const b of ev.message.content) {
+      if (b.type === 'tool_use') toolCounts[b.name] = (toolCounts[b.name] || 0) + 1;
+      if (b.type === 'text' && b.text?.trim()) lastAssistantText = b.text.trim();
+    }
+  }
+  // tool_results arrive in user messages
+  if (ev.type === 'user' && Array.isArray(ev.message?.content)) {
+    for (const b of ev.message.content) {
+      if (b.type !== 'tool_result') continue;
+      const text = Array.isArray(b.content)
+        ? b.content.map(c => (typeof c === 'string' ? c : c.text || '')).join('')
+        : (typeof b.content === 'string' ? b.content : '');
+      // An offload answer is either the 'plain'/'report' synthesis (carries the
+      // "Synthesized by CodeGraph" footer) or a 'refs' answer (carries the re-expanded
+      // "### Referenced source — verbatim" appendix). A refs call that cited nothing
+      // valid falls back to RAW source, which is correctly counted as a raw explore below.
+      if (/Synthesized by CodeGraph|### Referenced source — verbatim/.test(text)) { offloadAnswers.push(text); exploreResults++; }
+      else if (/Found \d+ symbols? across|## Exploration:/.test(text)) exploreResults++;
+    }
+  }
+  if (ev.type === 'result') result = ev;
+}
+
+// offload usage sidecar (CodeGraph AI tokens + cost) — one JSON line per offload call
+const ai = { calls: 0, promptTokens: 0, completionTokens: 0, totalTokens: 0, credits: 0, costUsd: 0, ms: 0 };
+if (args.usage && args.usage !== '-' && existsSync(args.usage)) {
+  for (const line of readFileSync(args.usage, 'utf8').split('\n').filter(Boolean)) {
+    let e; try { e = JSON.parse(line); } catch { continue; }
+    ai.calls++;
+    ai.promptTokens += e.promptTokens || 0;
+    ai.completionTokens += e.completionTokens || 0;
+    ai.totalTokens += e.totalTokens || 0;
+    ai.credits += e.creditsCharged || 0;
+    ai.costUsd += e.costUsd || 0;
+    ai.ms += e.ms || 0;
+  }
+}
+
+// front-load hook fired iff its injected header appears in the transcript
+const frontload = lines.some(l => l.includes('auto-retrieved for this question'));
+const get = (n) => toolCounts[n] || 0;
+const read = get('Read');
+const grep = get('Grep') + get('Bash') + get('Glob');
+const explore = get('mcp__codegraph__codegraph_explore');
+const cgAny = Object.keys(toolCounts).filter(k => /mcp__codegraph__/.test(k)).reduce((s, k) => s + toolCounts[k], 0);
+
+const out = {
+  repo: args.repo, tier: args.tier, arm: args.arm, rep: Number(args.rep), question: args.q,
+  ok: result?.subtype === 'success',
+  durationSec: result ? +(result.duration_ms / 1000).toFixed(1) : null,
+  numTurns: result?.num_turns ?? null,
+  costUsdMain: result ? +(result.total_cost_usd || 0).toFixed(4) : null,
+  tokGen: tok.gen, tokFresh: tok.fresh, tokCached: tok.cached, tokBillable: tok.gen + tok.fresh,
+  read, grep, explore, cgAny, frontload,
+  offloadFired: offloadAnswers.length,
+  ai,
+  // text payloads for the accuracy judge (kept separate; large)
+  finalAnswer: (result?.result || lastAssistantText || '').slice(0, 8000),
+  offloadAnswers: offloadAnswers.map(a => a.slice(0, 6000)),
+};
+process.stdout.write(JSON.stringify(out) + '\n');

+ 50 - 0
scripts/agent-eval/offload-eval-refs1.sh

@@ -0,0 +1,50 @@
+#!/usr/bin/env bash
+# ONE offload run on ONE indexed repo at a given offload STYLE (plain|refs), so we can
+# watch a single agent transcript at a time (the user's one-run-at-a-time methodology).
+# The OFFLOAD reasoning runs in the prewarmed DAEMON process, so the style env must be
+# set on BOTH the daemon and the client MCP config. Writes one metrics line to RESULTS
+# and leaves the raw stream-json at $RUNS/<repo>-<style>-<n>.jsonl for inspection.
+#
+# Usage: offload-eval-refs1.sh <indexed-repo> <style> <n> "<question>"
+set -uo pipefail
+HERE="$(cd "$(dirname "$0")" && pwd)"; ENGINE="$(cd "$HERE/../.." && pwd)"; BIN="$ENGINE/dist/bin/codegraph.js"
+OUT="${AGENT_EVAL_OUT:-/tmp/cg-offload-eval}"; RUNS="$OUT/runs"; EXTRACT="$HERE/offload-eval-metrics.mjs"
+TARGET="${1:?repo}"; STYLE="${2:?style}"; N="${3:?run-tag}"; Q="${4:?question}"
+RESULTS="${RESULTS:-$OUT/results-refs.jsonl}"; REPO=$(basename "$TARGET"); TARGET=$(cd "$TARGET" && pwd -P)
+mkdir -p "$RUNS"; command -v claude >/dev/null || { echo "no claude"; exit 1; }
+USAGE="$RUNS/$REPO-$STYLE-usage.jsonl"; : > "$USAGE"
+CFG="$RUNS/mcp-$REPO-$STYLE.json"
+# `raw` is a pseudo-style: codegraph attached but the offload DISABLED (the ceiling —
+# verbatim source, no reasoning model). Any other value is an offload style (plain|refs).
+if [ "$STYLE" = "raw" ]; then
+  DAEMON_ENV="CODEGRAPH_OFFLOAD_DISABLE=1"
+  printf '{"mcpServers":{"codegraph":{"command":"env","args":["CODEGRAPH_WASM_RELAUNCHED=1","CODEGRAPH_OFFLOAD_DISABLE=1","node","%s","serve","--mcp","--path","%s"]}}}' \
+    "$BIN" "$TARGET" > "$CFG"
+  USAGE="-"
+else
+  DAEMON_ENV="CODEGRAPH_OFFLOAD_STYLE=$STYLE CODEGRAPH_OFFLOAD_USAGE_LOG=$USAGE"
+  printf '{"mcpServers":{"codegraph":{"command":"env","args":["CODEGRAPH_WASM_RELAUNCHED=1","CODEGRAPH_OFFLOAD_STYLE=%s","CODEGRAPH_OFFLOAD_USAGE_LOG=%s","node","%s","serve","--mcp","--path","%s"]}}}' \
+    "$STYLE" "$USAGE" "$BIN" "$TARGET" > "$CFG"
+fi
+
+# Prewarm a persistent daemon carrying the SAME offload config (it does the reasoning).
+pkill -9 -f "serve --mcp --path $TARGET" 2>/dev/null; rm -f "$TARGET/.codegraph/daemon.sock" 2>/dev/null; sleep 0.6
+env $DAEMON_ENV CODEGRAPH_DAEMON_IDLE_TIMEOUT_MS=1800000 \
+  node "$BIN" serve --mcp --path "$TARGET" </dev/null >/dev/null 2>&1 &
+node -e 'const fs=require("fs");let n=0;const t=setInterval(()=>{if(fs.existsSync(process.argv[1]+"/.codegraph/daemon.sock")){clearInterval(t);process.exit(0)}if(n++>150){clearInterval(t);process.exit(1)}},100)' "$TARGET" \
+  && echo "daemon warm ($STYLE)" || echo "WARN daemon never bound"
+
+tag="$REPO-$STYLE-$N"
+echo "== run $tag =="
+# DISALLOW (optional): block tools that confound the offload-sufficiency signal —
+# chiefly "Agent" (sub-agent delegation: the spawned Explore subagent has low MCP
+# salience, ignores codegraph, and thrashes via Bash+Read, making the A/B noise).
+( cd "$TARGET" && claude -p "$Q" --output-format stream-json --verbose --permission-mode bypassPermissions \
+    --model "${MODEL:-sonnet}" --effort "${EFFORT:-high}" --max-budget-usd 4 \
+    ${DISALLOW:+--disallowedTools "$DISALLOW"} \
+    --strict-mcp-config --mcp-config "$CFG" </dev/null > "$RUNS/$tag.jsonl" 2>"$RUNS/$tag.err" )
+node "$EXTRACT" --run "$RUNS/$tag.jsonl" --usage "$USAGE" --arm "offload-$STYLE" --rep "$N" \
+    --repo "$REPO" --tier "complex" --q "$Q" >> "$RESULTS"
+node -e 'const o=JSON.parse(require("fs").readFileSync(process.argv[1],"utf8").trim().split("\n").pop());console.log(`  [${o.arm} #${o.rep}] ${o.durationSec}s | main $${o.costUsdMain} ${o.tokBillable} tok | read=${o.read} grep=${o.grep} explore=${o.explore} offload=${o.offloadFired} | AI ${o.ai.calls}call/${o.ai.totalTokens}tok/$${o.ai.costUsd.toFixed(4)} | ok=${o.ok}`)' "$RESULTS"
+pkill -9 -f "serve --mcp --path $TARGET" 2>/dev/null; rm -f "$TARGET/.codegraph/daemon.sock" 2>/dev/null
+echo "raw transcript: $RUNS/$tag.jsonl"

+ 24 - 0
scripts/agent-eval/offload-eval-setup.sh

@@ -0,0 +1,24 @@
+#!/usr/bin/env bash
+# Clone + index the 4 "not-trained-on" eval repos into $AGENT_EVAL_OUT/repos. These were
+# selected via a no-tools memory-probe gate (Sonnet cannot answer their flow questions from
+# memory — so the no-codegraph baseline is honest). Env: AGENT_EVAL_OUT=<scratch dir>
+set -uo pipefail
+HERE="$(cd "$(dirname "$0")" && pwd)"
+ENGINE="$(cd "$HERE/../.." && pwd)"
+BIN="$ENGINE/dist/bin/codegraph.js"
+OUT="${AGENT_EVAL_OUT:-/tmp/cg-offload-eval}"
+ROOT="$OUT/repos"; mkdir -p "$ROOT"
+export CODEGRAPH_TELEMETRY=0 DO_NOT_TRACK=1
+[ -f "$BIN" ] || { echo "engine not built: run 'npm run build' in $ENGINE first"; exit 1; }
+
+clone_index() { # url name
+  echo "=== $2: clone ==="; rm -rf "$ROOT/$2"
+  git clone --quiet --depth 1 "$1" "$ROOT/$2" || { echo "  clone FAILED"; return 1; }
+  echo "=== $2: index ==="
+  node "$BIN" init "$ROOT/$2" 2>&1 | grep -iE 'indexed|nodes|edges|error' | tail -2
+}
+clone_index https://github.com/MTKruto/MTKruto.git mtkruto          # small  (~322 TS)
+clone_index https://github.com/mvdicarlo/postybirb-plus.git postybirb  # medium (~608 TS)
+clone_index https://github.com/shapeshift/web.git shapeshift        # complex (~3.2k TS, 35-pkg monorepo)
+clone_index https://github.com/trezor/trezor-suite.git trezor       # large  (~8k TS monorepo)
+echo "###### SETUP DONE -> $ROOT"

+ 68 - 0
scripts/agent-eval/offload-eval-summarize.mjs

@@ -0,0 +1,68 @@
+#!/usr/bin/env node
+// Aggregate judged.jsonl (or results.jsonl) into a per-repo, per-arm report:
+// time, main tokens/cost, AI tokens/cost, total cost, tool mix, accuracy.
+// Usage: summarize.mjs <judged-or-results.jsonl>
+import { readFileSync } from 'fs';
+const rows = readFileSync(process.argv[2], 'utf8').split('\n').filter(Boolean).map(l => JSON.parse(l));
+
+const med = (xs) => { const a = xs.filter(x => x != null).sort((p, q) => p - q); if (!a.length) return null; const m = Math.floor(a.length / 2); return a.length % 2 ? a[m] : (a[m - 1] + a[m]) / 2; };
+const rng = (xs) => { const a = xs.filter(x => x != null); return a.length ? `${Math.min(...a)}–${Math.max(...a)}` : '—'; };
+const d2 = (x) => x == null ? '—' : (+x).toFixed(2);
+const d3 = (x) => x == null ? '—' : (+x).toFixed(3);
+const d4 = (x) => x == null ? '—' : (+x).toFixed(4);
+
+const ARM_ORDER = ['frontload', 'offload', 'raw', 'nocg'];
+const byRepo = {};
+for (const r of rows) (byRepo[r.repo] ??= {});
+for (const r of rows) ((byRepo[r.repo][r.arm] ??= []).push(r));
+
+const verdictTally = (rs, field) => {
+  const t = { pass: 0, partial: 0, fail: 0, error: 0 };
+  for (const r of rs) { const v = r[field]?.verdict; if (v in t) t[v]++; }
+  return t;
+};
+
+for (const repo of Object.keys(byRepo)) {
+  const tier = byRepo[repo][Object.keys(byRepo[repo])[0]][0].tier;
+  console.log(`\n${'='.repeat(78)}\n${repo}  [${tier}]\n${'='.repeat(78)}`);
+  console.log(`${'arm'.padEnd(9)} n  ${'time(s)'.padStart(9)} ${'mainCost'.padStart(9)} ${'aiCost'.padStart(8)} ${'totCost'.padStart(8)} ${'mainTok'.padStart(8)} ${'aiTok'.padStart(7)} ${'rd'.padStart(3)} ${'gr'.padStart(3)} ${'exp'.padStart(3)} ${'off'.padStart(3)}  e2e(P/p/F)  fidScore`);
+  for (const arm of ARM_ORDER) {
+    const rs = byRepo[repo][arm]; if (!rs) continue;
+    const n = rs.length;
+    const mainCost = med(rs.map(r => r.costUsdMain));
+    const aiCost = med(rs.map(r => r.ai?.costUsd ?? 0));
+    const totCost = (mainCost ?? 0) + (aiCost ?? 0);
+    const e2e = verdictTally(rs, 'e2e');
+    const fidScores = arm === 'offload' ? rs.flatMap(r => r.fidelity?.scores ?? []) : [];
+    const fid = fidScores.length ? med(fidScores) : null;
+    const fab = arm === 'offload' && rs.some(r => r.fidelity?.anyFabrication);
+    const e2eScore = med(rs.map(r => r.e2e?.score).filter(x => x != null));
+    console.log(
+      `${arm.padEnd(9)} ${String(n).padStart(1)}  ${String(med(rs.map(r => r.durationSec))).padStart(9)} ` +
+      `${('$' + d3(mainCost)).padStart(9)} ${('$' + d3(aiCost)).padStart(8)} ${('$' + d3(totCost)).padStart(8)} ` +
+      `${String(Math.round(med(rs.map(r => r.tokBillable)) / 1000) + 'k').padStart(8)} ${String(Math.round(med(rs.map(r => r.ai?.totalTokens ?? 0)) / 1000) + 'k').padStart(7)} ` +
+      `${String(med(rs.map(r => r.read))).padStart(3)} ${String(med(rs.map(r => r.grep))).padStart(3)} ${String(med(rs.map(r => r.explore))).padStart(3)} ${String(med(rs.map(r => r.offloadFired))).padStart(3)}  ` +
+      `${(e2e.pass + '/' + e2e.partial + '/' + e2e.fail).padStart(9)}  ${e2eScore != null ? 'e2e=' + e2eScore : ''} ${fid != null ? 'fid=' + fid + (fab ? ' FAB!' : '') : ''}`
+    );
+  }
+  // ranges line for the two key metrics (variance matters)
+  for (const arm of ARM_ORDER) {
+    const rs = byRepo[repo][arm]; if (!rs) continue;
+    console.log(`   ${arm} ranges: time ${rng(rs.map(r => r.durationSec))}s · mainCost $${rng(rs.map(r => r.costUsdMain))} · read ${rng(rs.map(r => r.read))} · explore ${rng(rs.map(r => r.explore))} · offloadFired ${rng(rs.map(r => r.offloadFired))}`);
+  }
+}
+
+// Cross-repo roll-up: offload vs raw vs nocg deltas
+console.log(`\n${'='.repeat(78)}\nCROSS-REPO SUMMARY (medians per repo, then averaged)\n${'='.repeat(78)}`);
+console.log(`${'repo'.padEnd(12)} ${'arm'.padEnd(8)} ${'time'.padStart(7)} ${'totCost'.padStart(8)} ${'read'.padStart(5)} ${'e2e pass%'.padStart(9)} ${'fid'.padStart(5)}`);
+for (const repo of Object.keys(byRepo)) {
+  for (const arm of ARM_ORDER) {
+    const rs = byRepo[repo][arm]; if (!rs) continue;
+    const e2e = verdictTally(rs, 'e2e');
+    const passPct = Math.round(100 * e2e.pass / rs.length);
+    const totCost = (med(rs.map(r => r.costUsdMain)) ?? 0) + (med(rs.map(r => r.ai?.costUsd ?? 0)) ?? 0);
+    const fid = arm === 'offload' ? med(rs.flatMap(r => r.fidelity?.scores ?? [])) : null;
+    console.log(`${repo.padEnd(12)} ${arm.padEnd(8)} ${(med(rs.map(r => r.durationSec)) + 's').padStart(7)} ${('$' + d3(totCost)).padStart(8)} ${String(med(rs.map(r => r.read))).padStart(5)} ${(passPct + '%').padStart(9)} ${String(fid ?? '—').padStart(5)}`);
+  }
+}
+console.log('');

+ 76 - 0
scripts/agent-eval/offload-eval.md

@@ -0,0 +1,76 @@
+# CodeGraph AI offload — accuracy & adoption eval harness
+
+Measures the managed **offload** (`codegraph_explore` → reasoning model synthesis) and the
+**front-load hook** (approach 1) against plain codegraph and no-codegraph, across repo sizes,
+on **time · main-session tokens/cost · CodeGraph-AI tokens/cost · accuracy**.
+
+All agent arms run `claude -p --model sonnet --effort high` (the deliberate floor model — an
+affordance that lands on Sonnet generalizes up). Everything writes to a scratch dir
+(`AGENT_EVAL_OUT`, default `/tmp/cg-offload-eval`); nothing here is shipped to users.
+
+## Repos (selected via a memory-probe gate — NOT trained on)
+
+Famous repos (express, excalidraw, n8n, …) are useless for *accuracy* evals: Sonnet answers their
+flow questions from memory, so the no-codegraph baseline is dishonest. These four passed a no-tools
+probe (Sonnet could not name their real flow internals) and are cloned fresh by `offload-eval-setup.sh`:
+
+| tier | repo | ~src files | canonical flow |
+|---|---|---|---|
+| small | MTKruto/MTKruto | 322 TS | `sendMessage` → invoke → TL serialize → transport |
+| medium | mvdicarlo/postybirb-plus | 608 TS | submission → queue → per-website `.post()` |
+| complex | shapeshift/web | 3.2k TS (35-pkg monorepo) | swap → swapper registry → concrete swapper |
+| large | trezor/trezor-suite | 8k TS monorepo | send-form → sign thunk → `@trezor/connect` |
+
+Verified ground-truth flows (the judge's reference) live in `offload-eval-ground-truth.json`.
+
+## Arms
+
+- **offload** — codegraph + managed offload ON (requires `codegraph login`); records AI tokens/credits via `CODEGRAPH_OFFLOAD_USAGE_LOG`.
+- **raw** — codegraph, `CODEGRAPH_OFFLOAD_DISABLE=1` (returns raw source).
+- **nocg** — empty MCP config; Read/Grep baseline.
+- **frontload** — codegraph (offload-disabled) + a `UserPromptSubmit` hook (`offload-eval-hook.mjs`) that runs raw explore on the prompt and injects the result into context (approach 1).
+
+## Run it
+
+```bash
+npm run build                       # the harness shells out to dist/
+codegraph login                     # only needed for the offload arm
+export AGENT_EVAL_OUT=/tmp/cg-offload-eval
+
+bash scripts/agent-eval/offload-eval-setup.sh            # clone + index the 4 repos
+bash scripts/agent-eval/offload-eval-matrix.sh           # 3 arms × 4 tiers × REPS (default 3)
+node scripts/agent-eval/offload-eval-judge.mjs \
+     --results $AGENT_EVAL_OUT/results.jsonl \
+     --truth  scripts/agent-eval/offload-eval-ground-truth.json \
+     --out    $AGENT_EVAL_OUT/judged.jsonl
+node scripts/agent-eval/offload-eval-summarize.mjs $AGENT_EVAL_OUT/judged.jsonl
+
+bash scripts/agent-eval/offload-eval-frontload-matrix.sh # frontload arm + judge + merged summary
+```
+
+Single repo: `offload-eval-3arm.sh <indexed-repo> <tier> <reps> "<question>"` (or `-frontload.sh`).
+
+## Files
+
+- `offload-eval-setup.sh` — clone + index the 4 repos.
+- `offload-eval-3arm.sh` / `-frontload.sh` — one repo, the arms.
+- `offload-eval-matrix.sh` / `-frontload-matrix.sh` — drive all 4 tiers.
+- `offload-eval-hook.mjs` — the front-load `UserPromptSubmit` hook (resolves its own engine; `CG_FRONTLOAD_DEBUG=<path>` to log injections; `CG_FRONTLOAD_BUDGET` to cap injected chars).
+- `offload-eval-metrics.mjs` — one run's stream-json + usage log → one JSON metrics line.
+- `offload-eval-judge.mjs` — Sonnet judge: end-to-end (agent final vs ground truth) + per-answer offload fidelity.
+- `offload-eval-summarize.mjs` — per-tier, per-arm table + cross-repo roll-up.
+- `offload-eval-ground-truth.json` — source-verified canonical flows.
+
+## Findings (2026-06, n=3 — direction consistent, magnitudes noisy)
+
+- **Raw codegraph is the efficiency win** — ~nocg accuracy, fewer reads, faster, no AI cost.
+- **The offload is the least-accurate arm in all 4 tiers** — synthesized fidelity 12–27/100 with
+  fabrication in 3/4 (e.g. invented website services; traced `ClientPlain`/`SessionPlain` instead of
+  the real encrypted path). Its speed/cost win is narrow (medium-only) and inversely correlated with
+  accuracy. **Use raw until offload fidelity is fixed.**
+- **The front-load hook SOLVES adoption** — reads → 0–1 in every tier (incl. large, where the agent
+  otherwise read 12–24 files); fired 12/12, 0 errors. Wins on medium/complex (100% pass). But it
+  **regresses small/large to partial** — it suppresses the reads that compensate for explore's gaps at
+  **dynamic boundaries** (async queues, redux thunks, facade/factory indirection).
+- **Master lever for BOTH:** explore's dynamic-dispatch coverage. Fix it → front-load is complete
+  everywhere and the offload has the full flow to synthesize.

+ 8 - 2
src/bin/codegraph.ts

@@ -1382,8 +1382,14 @@ program
       success('Signed in to CodeGraph AI — managed reasoning is on.');
       success('Signed in to CodeGraph AI — managed reasoning is on.');
       try {
       try {
         const usage = await fetchUsage();
         const usage = await fetchUsage();
-        if (usage && typeof usage.remaining === 'number') {
-          info(`  credits: ${usage.remaining.toLocaleString()} remaining`);
+        if (usage) {
+          // Mirror `codegraph usage`'s precedence: a comped/internal account is
+          // flagged `unlimited` (often with remaining:0 when no allowance is set),
+          // so check that before the numeric balance or it reads "0 remaining".
+          if (usage.banned) warn('  Account suspended — contact support.');
+          else if (usage.unlimited) info('  credits: unlimited');
+          else if (typeof usage.remaining === 'number')
+            info(`  credits: ${usage.remaining.toLocaleString()} remaining`);
         }
         }
       } catch {
       } catch {
         /* balance is best-effort */
         /* balance is best-effort */

+ 4 - 4
src/installer/instructions-template.ts

@@ -17,8 +17,8 @@
  *    runs without this block, and consistently with it — including runs
  *    runs without this block, and consistently with it — including runs
  *    with zero Read/grep fallback.
  *    with zero Read/grep fallback.
  *  - **Non-MCP harnesses** — agents with no MCP client at all can still
  *  - **Non-MCP harnesses** — agents with no MCP client at all can still
- *    run the `codegraph explore` / `codegraph node` CLI, which prints the
- *    same output as the MCP tools.
+ *    run the `codegraph explore` CLI, which prints the same output as the
+ *    MCP tool.
  *
  *
  * Keep this block SHORT. The main agent reads it every turn on top of the
  * Keep this block SHORT. The main agent reads it every turn on top of the
  * server instructions — the #529 duplication-cost argument still bounds
  * server instructions — the #529 duplication-cost argument still bounds
@@ -44,8 +44,8 @@ export const CODEGRAPH_INSTRUCTIONS_BLOCK = `${CODEGRAPH_SECTION_START}
 
 
 In repositories indexed by CodeGraph (a \`.codegraph/\` directory exists at the repo root), reach for it BEFORE grep/find or reading files when you need to understand or locate code:
 In repositories indexed by CodeGraph (a \`.codegraph/\` directory exists at the repo root), reach for it BEFORE grep/find or reading files when you need to understand or locate code:
 
 
-- **MCP tools** (when available): \`codegraph_explore\` answers most code questions in one call — the relevant symbols' verbatim source plus the call paths between them. \`codegraph_node\` returns one symbol's source + callers, or reads a whole file with line numbers. If the tools are listed but deferred, load them by name via tool search.
-- **Shell** (always works): \`codegraph explore "<symbol names or question>"\` and \`codegraph node <symbol-or-file>\` print the same output.
+- **MCP tool** (when available): \`codegraph_explore\` answers most code questions in one call — the relevant symbols' verbatim source plus the call paths between them, including dynamic-dispatch hops grep can't follow. Name a file or symbol in the query to read its current line-numbered source. If it's listed but deferred, load it by name via tool search.
+- **Shell** (always works): \`codegraph explore "<symbol names or question>"\` prints the same output.
 
 
 If there is no \`.codegraph/\` directory, skip CodeGraph entirely — indexing is the user's decision.
 If there is no \`.codegraph/\` directory, skip CodeGraph entirely — indexing is the user's decision.
 ${CODEGRAPH_SECTION_END}`;
 ${CODEGRAPH_SECTION_END}`;

+ 13 - 12
src/installer/targets/shared.ts

@@ -31,20 +31,21 @@ export function getMcpServerConfig(): { type: string; command: string; args: str
 
 
 /**
 /**
  * Permissions list for Claude `settings.json`. Other targets that
  * Permissions list for Claude `settings.json`. Other targets that
- * have a permissions concept can compose this list directly. The
- * permission strings follow Claude's `mcp__<server>__<tool>` format.
+ * have a permissions concept can compose this list directly.
+ *
+ * One server-scoped wildcard rather than a per-tool list. By default only
+ * `codegraph_explore` is even LISTED to the agent (see DEFAULT_MCP_TOOLS in
+ * mcp/tools.ts), so in practice explore is the only tool this auto-approves —
+ * but the wildcard means that if a user re-enables another tool via
+ * CODEGRAPH_MCP_TOOLS, it's already pre-approved (no permission prompt, no
+ * hand-editing settings.json), and future tools are covered too. Claude only
+ * honors globs after a literal `mcp__<server>__` prefix, so this exact string
+ * is the way to allow-all for one server; a bare `mcp__codegraph` or `*` is
+ * ignored. The allowlist gates PROMPTING, not visibility, so a superset here
+ * never makes a hidden tool appear.
  */
  */
 export function getCodeGraphPermissions(): string[] {
 export function getCodeGraphPermissions(): string[] {
-  return [
-    'mcp__codegraph__codegraph_explore',
-    'mcp__codegraph__codegraph_search',
-    'mcp__codegraph__codegraph_node',
-    'mcp__codegraph__codegraph_callers',
-    'mcp__codegraph__codegraph_callees',
-    'mcp__codegraph__codegraph_impact',
-    'mcp__codegraph__codegraph_files',
-    'mcp__codegraph__codegraph_status',
-  ];
+  return ['mcp__codegraph__*'];
 }
 }
 
 
 /**
 /**

+ 29 - 36
src/mcp/server-instructions.ts

@@ -7,13 +7,15 @@
  * before it sees individual tool descriptions.
  * before it sees individual tool descriptions.
  *
  *
  * Goals when editing this:
  * Goals when editing this:
- *   - Tool selection by intent (which tool for which question)
- *   - Common chains (refactor planning = X then Y)
- *   - Anti-patterns (don't grep when codegraph_search is faster)
+ *   - Lead the agent to codegraph_explore for any structural/flow question
+ *   - Reinforce "explore instead of Read/Grep" for indexed code
+ *   - Anti-patterns (don't re-verify with grep; don't hand-reconstruct flows)
  *
  *
  * Keep it tight. The agent reads this every session — long instructions
  * Keep it tight. The agent reads this every session — long instructions
- * burn tokens. Reference only tools that exist on `main`; gate any
- * conditional tools behind feature checks if/when they ship.
+ * burn tokens. The DEFAULT MCP surface is `codegraph_explore` ALONE (see
+ * DEFAULT_MCP_TOOLS in tools.ts) — reference only that tool here. The other
+ * tools (node/search/callers/…) stay defined and are re-enablable via
+ * CODEGRAPH_MCP_TOOLS, but they are NOT listed to agents, so don't name them.
  */
  */
 export const SERVER_INSTRUCTIONS = `# Codegraph — code intelligence over an indexed knowledge graph
 export const SERVER_INSTRUCTIONS = `# Codegraph — code intelligence over an indexed knowledge graph
 
 
@@ -27,45 +29,36 @@ verbatim source PLUS who calls it and what it affects, so you edit with the
 blast radius in view. More accurate context, in far fewer tokens and
 blast radius in view. More accurate context, in far fewer tokens and
 round-trips than reading files yourself.
 round-trips than reading files yourself.
 
 
-## Use codegraph instead of reading files — for questions AND edits
+## One tool: codegraph_explore — use it instead of reading files
 
 
-Whether you're answering "how does X work" or implementing a change (fixing
-a bug, adding a feature), reach for codegraph before you Read. For
-understanding, answer DIRECTLY — usually with ONE \`codegraph_explore\` call.
-\`codegraph_explore\` takes either a natural-language question or a bag of
-symbol/file names and returns the verbatim source of the relevant symbols
-grouped by file, so it is Read-equivalent and most often the ONLY
-codegraph call you need. Codegraph IS the pre-built search index — so
-delegating the lookup to a separate file-reading sub-task/agent, or
-running your own grep + read loop, repeats work codegraph already did and
-costs more for the same answer. Reach for raw Read/Grep only to confirm a
-specific detail codegraph didn't cover. A direct codegraph answer is
-typically one to a few calls; a grep/read exploration is dozens.
+There is a single tool, \`codegraph_explore\`, and it is Read-equivalent. It
+takes either a natural-language question or a bag of symbol/file names and
+returns the **verbatim, line-numbered source** of the relevant symbols
+grouped by file — the same \`<n>\\t<line>\` shape \`Read\` gives you, safe to
+\`Edit\` from — PLUS the call path among them (including dynamic-dispatch hops
+like callbacks, React re-render, and JSX children that grep can't follow) and
+a blast-radius summary of what depends on them.
 
 
-## Tool selection by intent
+Whether you're answering "how does X work" or implementing a change (fixing a
+bug, adding a feature), call \`codegraph_explore\` before you Read. ONE call
+usually answers the whole question. Codegraph IS the pre-built search index —
+so running your own grep + read loop, or delegating the lookup to a separate
+file-reading sub-task/agent, repeats work codegraph already did and costs more
+for the same answer. A direct codegraph answer is typically one to a few
+calls; a grep/read exploration is dozens.
 
 
-- **Almost any question — "how does X work", architecture, a bug, "what/where is X", or surveying an area** → \`codegraph_explore\` (PRIMARY — call FIRST; ONE capped call returns the verbatim source of the relevant symbols grouped by file; most often the ONLY call you need)
-- **"How does X reach/become Y? / the flow / the path from X to Y"** → \`codegraph_explore\`, naming the symbols that span the flow (e.g. \`mutateElement renderScene\`) — it surfaces the call path among them, including dynamic-dispatch hops (callbacks, React re-render, JSX children) grep can't follow
-- **"What is the symbol named X?" (just its location)** → \`codegraph_search\`
-- **"What calls this?" / "What would changing this break?"** → \`codegraph_callers\` — EVERY call site with file:line, including where a function is **registered as a callback** (passed as an argument, assigned to a function pointer/field, listed in a handler table) — labeled "via callback registration" — so a function with no direct calls is NOT dead if it's wired up somewhere. When several UNRELATED symbols share a name (one \`UserService\` per monorepo app), it reports **one section per definition** (never a merged list) — pass \`file\` to focus the definition you mean. The wider blast radius arrives automatically on \`codegraph_explore\` (its "Blast radius" section) and \`codegraph_node\` (the dependents note)
-- **"What does this call?"** → \`codegraph_node\` with that symbol and \`includeCode: true\` — the body IS the callee list, and the caller/callee trail comes with it
-- **Reading a source FILE (any time you'd use the \`Read\` tool)** → \`codegraph_node\` with a \`file\` path and no \`symbol\`. It returns the file's **current source with line numbers — the same \`<n>\\t<line>\` shape \`Read\` gives you, safe to \`Edit\` from** — narrowable with \`offset\`/\`limit\` exactly like \`Read\`, PLUS a one-line note of which files depend on it. Same bytes as \`Read\`, faster (served from the index), with the blast radius attached. Use it **instead of \`Read\`** for indexed source files; fall back to \`Read\` only for what codegraph doesn't index (configs, docs). Pass \`symbolsOnly: true\` for just the file's structure.
-- **About to read or edit a symbol you can name** → \`codegraph_node\` with that \`symbol\` (SECONDARY — the after-explore depth tool): the verbatim source (\`includeCode: true\`) PLUS its caller/callee trail, so before changing it you see what calls it and what your edit would break. For an OVERLOADED name it returns EVERY matching definition's body in one call, so you never Read a file to find the right overload
+## How to query
 
 
-## Common chains
-
-- **Flow / "how does X reach Y"**: ONE \`codegraph_explore\` with the symbol names spanning the flow — it surfaces the call path among them (riding dynamic-dispatch hops) AND returns their source. No need to reconstruct the path with \`codegraph_search\` + \`codegraph_callers\`.
-- **Onboarding / understanding any area**: ONE \`codegraph_explore\` is usually the whole answer. Only follow up — \`codegraph_node\` for a specific symbol — if something is still unclear.
-- **Refactor planning**: \`codegraph_callers\` for the complete call-site list to update; the wider blast radius is already attached to \`codegraph_explore\` / \`codegraph_node\` output.
-- **Debugging a regression**: \`codegraph_callers\` of the suspected symbol; \`codegraph_node\` on anything unexpected that appears.
+- **Almost any question — "how does X work", architecture, a bug, "what/where is X", or surveying an area** → \`codegraph_explore\` with a natural-language question or the relevant names. ONE capped call returns the verbatim source grouped by file; most often the ONLY call you need.
+- **"How does X reach/become Y? / the flow / the path from X to Y"** → \`codegraph_explore\`, naming the symbols that span the flow (e.g. \`mutateElement renderScene\`) — it surfaces the call path among them, riding dynamic-dispatch hops, and returns their source.
+- **Reading or editing a file/symbol you can name** → put its name or file path in the \`codegraph_explore\` query — it returns that current line-numbered source (safe to \`Edit\` from) with the call path and blast radius attached, so you don't Read it separately. For an overloaded name it returns every matching definition's body in one call.
+- **Need more?** Call \`codegraph_explore\` again with more specific names — treat the source it returns as already Read.
 
 
 ## Anti-patterns
 ## Anti-patterns
 
 
 - **Trust codegraph's results — don't re-verify them with grep.** They come from a full AST parse; re-checking with grep is slower, less accurate, and wastes context.
 - **Trust codegraph's results — don't re-verify them with grep.** They come from a full AST parse; re-checking with grep is slower, less accurate, and wastes context.
-- **Don't grep first** when looking up a symbol by name — \`codegraph_search\` is faster and returns kind + location + signature.
-- **Don't chain \`codegraph_search\` + \`codegraph_node\`** to understand an area — ONE \`codegraph_explore\` returns the relevant symbols' source together in a single round-trip.
-- **Don't loop \`codegraph_node\` over many symbols** — one \`codegraph_explore\` call returns them all grouped by file, while each separate call re-reads the whole context and costs far more. Use \`codegraph_node\` for a single symbol.
-- **Don't reach for the \`Read\` tool on an indexed source file** — \`codegraph_node\` with a \`file\` reads it for you (same \`<n>\\t<line>\` source, \`offset\`/\`limit\` like Read, faster, with its blast radius), and with a \`symbol\` it returns the source plus the caller/callee trail. Reach for raw \`Read\` only for what codegraph doesn't index (configs, docs) or when the staleness banner flags a file as pending re-index.
+- **Don't grep or Read first** to find or understand indexed code — ONE \`codegraph_explore\` returns the relevant symbols' source together in a single round-trip. Reach for raw \`Read\`/\`Grep\` only to confirm a specific detail codegraph didn't cover, or for what codegraph doesn't index (configs, docs).
+- **Don't reconstruct a flow by hand** — name the endpoints in one \`codegraph_explore\` and it surfaces the path between them, dynamic-dispatch hops included.
 - **After editing, check the staleness banner.** When a tool response starts with "⚠️ Some files referenced below were edited since the last index sync…", the listed files are pending re-index — Read those specific files for accurate content. Every file NOT in that banner is fresh, so still trust codegraph. A different, rarer banner — "⚠️ CodeGraph auto-sync is DISABLED…" — means live watching stopped entirely (the whole index is frozen, not just a few files); until it's resolved, Read files directly to confirm anything that may have changed.
 - **After editing, check the staleness banner.** When a tool response starts with "⚠️ Some files referenced below were edited since the last index sync…", the listed files are pending re-index — Read those specific files for accurate content. Every file NOT in that banner is fresh, so still trust codegraph. A different, rarer banner — "⚠️ CodeGraph auto-sync is DISABLED…" — means live watching stopped entirely (the whole index is frozen, not just a few files); until it's resolved, Read files directly to confirm anything that may have changed.
 
 
 ## Limitations
 ## Limitations

+ 10 - 20
src/mcp/tools.ts

@@ -633,28 +633,18 @@ export function getStaticTools(): ToolDefinition[] {
 }
 }
 
 
 /**
 /**
- * The MCP tools served by DEFAULT (short names). The other defined tools
- * (callees, impact, files, status) remain fully functional — handlers stay,
- * the library API and CLI are untouched, and `CODEGRAPH_MCP_TOOLS` re-enables
- * any of them — they just aren't LISTED to agents anymore.
+ * The MCP tools served by DEFAULT (short names). Pared to ONLY `codegraph_explore`
+ * — the single tool that reliably earns its place: one capped call returns the
+ * verbatim source of the relevant symbols grouped by file (and, with the offload,
+ * a reasoned flow map over that source). Every other tool is a narrower slice of
+ * what explore already does, and presence itself steers mis-picks, so they are no
+ * longer LISTED to agents.
  *
  *
- * Evidence for the cut (the "adapt the tool to the agent" principle —
- * fewer tools = fewer mis-picks, and presence itself steers):
- * - `codegraph_impact` appears in ZERO recorded eval runs ever — its
- *   blast-radius info already arrives inline on explore (the "Blast radius"
- *   section) and node (the dependents note), so agents never need the
- *   standalone tool.
- * - `codegraph_callees` is redundant by construction: a symbol's body (which
- *   node returns) IS its callee list, plus the caller/callee trail.
- * - `codegraph_files` / `codegraph_status`: the tiny-repo audit (see
- *   getTools) found they "reduce to one grep"; staleness banners already
- *   inline the pending-sync info on every read tool, and the CLI covers
- *   diagnostics.
- * - `codegraph_callers` stays: exhaustive call-site enumeration (every
- *   caller with file:line, callback registrations labeled, one section per
- *   same-named definition) is the one job explore/node don't replicate.
+ * The other defined tools (`node`, `search`, `callers`, plus callees/impact/files/
+ * status) remain fully functional — handlers stay, the library API and CLI are
+ * untouched, and `CODEGRAPH_MCP_TOOLS=explore,node,...` re-enables any of them.
  */
  */
-const DEFAULT_MCP_TOOLS = new Set(['explore', 'node', 'search', 'callers']);
+const DEFAULT_MCP_TOOLS = new Set(['explore']);
 
 
 /**
 /**
  * Tool handler that executes tools against a CodeGraph instance
  * Tool handler that executes tools against a CodeGraph instance

+ 11 - 0
src/reasoning/config.ts

@@ -102,6 +102,17 @@ const trimmed = (v: string | undefined): string | undefined => {
 
 
 /** Merge the persisted config with `CODEGRAPH_OFFLOAD_*` env overrides (env wins). */
 /** Merge the persisted config with `CODEGRAPH_OFFLOAD_*` env overrides (env wins). */
 export function resolveOffload(env: NodeJS.ProcessEnv = process.env): ResolvedOffload {
 export function resolveOffload(env: NodeJS.ProcessEnv = process.env): ResolvedOffload {
+  // Hard kill-switch: disable the offload for this process/session without touching
+  // the persisted config or the stored login — e.g. one A/B arm, or a user who wants
+  // codegraph_explore to return raw source for a session. Env-only by design.
+  if (env.CODEGRAPH_OFFLOAD_DISABLE === '1') {
+    return {
+      enabled: false, managed: false, url: undefined, model: MANAGED_DEFAULT_MODEL,
+      apiKey: undefined, keySource: undefined, effort: 'low', style: 'plain',
+      timeoutMs: 20000, maxTokens: 12000, strip: false,
+      debug: env.CODEGRAPH_OFFLOAD_DEBUG === '1', origin: 'none',
+    };
+  }
   const c = readOffloadConfig();
   const c = readOffloadConfig();
   const managed = !!c.managed;
   const managed = !!c.managed;
   const envUrl = trimmed(env.CODEGRAPH_OFFLOAD_URL);
   const envUrl = trimmed(env.CODEGRAPH_OFFLOAD_URL);

+ 44 - 1
src/reasoning/reasoner.ts

@@ -28,6 +28,7 @@
  * result — a broken offload must be invisible to the agent (one isError early in a
  * result — a broken offload must be invisible to the agent (one isError early in a
  * session and an agent can abandon the tool entirely).
  * session and an agent can abandon the tool entirely).
  */
  */
+import * as fs from 'fs';
 import { resolveOffload } from './config';
 import { resolveOffload } from './config';
 
 
 interface SynthArgs {
 interface SynthArgs {
@@ -87,6 +88,23 @@ function debug(...args: unknown[]): void {
   }
   }
 }
 }
 
 
+/**
+ * Append one JSON line of per-call offload usage to `CODEGRAPH_OFFLOAD_USAGE_LOG`
+ * when that env var is set (otherwise a no-op). Lets a harness attribute CodeGraph AI
+ * tokens + cost to a single run without depending on the metered server's cumulative
+ * totals. Best-effort: a write failure is logged under debug and never disrupts the
+ * tool call (the offload is strictly degradable, and so is its bookkeeping).
+ */
+function recordUsage(entry: Record<string, unknown>): void {
+  const logPath = process.env.CODEGRAPH_OFFLOAD_USAGE_LOG;
+  if (!logPath) return;
+  try {
+    fs.appendFileSync(logPath, JSON.stringify(entry) + '\n');
+  } catch (err) {
+    debug('usage-log write failed', (err as Error)?.message);
+  }
+}
+
 // Shared preamble: the model is a pure analysis function, never an agent.
 // Shared preamble: the model is a pure analysis function, never an agent.
 // CORRECTNESS-FIRST — a synthesized answer is only useful if it is never wrong,
 // CORRECTNESS-FIRST — a synthesized answer is only useful if it is never wrong,
 // and NEVER confidently wrong. The calibration below is the load-bearing part.
 // and NEVER confidently wrong. The calibration below is the load-bearing part.
@@ -215,14 +233,39 @@ export async function synthesizeOffload({ query, context }: SynthArgs): Promise<
     }
     }
     const data = (await res.json()) as {
     const data = (await res.json()) as {
       choices?: Array<{ message?: { content?: string }; finish_reason?: string }>;
       choices?: Array<{ message?: { content?: string }; finish_reason?: string }>;
+      usage?: { prompt_tokens?: number; completion_tokens?: number; total_tokens?: number };
     };
     };
+    // Per-call usage/cost capture. The managed gateway returns the spend in the
+    // `x-cg-credits-charged` header (100k credits = $1) and the token counts in the
+    // standard OpenAI `usage` block; a BYO endpoint typically returns `usage` only.
+    // This is the source of truth for "CodeGraph AI tokens + cost" per run.
+    // Optional chaining: usage bookkeeping must NEVER break the degradable path,
+    // even if a response/mock lacks a standard headers object.
+    const creditsCharged = Number(res.headers?.get?.('x-cg-credits-charged'));
     const answer = data.choices?.[0]?.message?.content?.trim();
     const answer = data.choices?.[0]?.message?.content?.trim();
+    recordUsage({
+      ts: new Date().toISOString(),
+      ms: Date.now() - started,
+      model: cfg.model,
+      style: cfg.style,
+      managed: cfg.managed,
+      promptTokens: data.usage?.prompt_tokens ?? null,
+      completionTokens: data.usage?.completion_tokens ?? null,
+      totalTokens: data.usage?.total_tokens ?? null,
+      creditsCharged: Number.isFinite(creditsCharged) ? creditsCharged : null,
+      costUsd: Number.isFinite(creditsCharged) ? creditsCharged / 100_000 : null,
+      queryLen: query.length,
+      ctxLen: ctx.length,
+      rawCtxLen: context.length,
+      answerLen: answer?.length ?? 0,
+      finishReason: data.choices?.[0]?.finish_reason ?? null,
+    });
     if (!answer) {
     if (!answer) {
       debug('empty answer', JSON.stringify(data).slice(0, 200));
       debug('empty answer', JSON.stringify(data).slice(0, 200));
       return null;
       return null;
     }
     }
     debug(
     debug(
-      `ok in ${Date.now() - started}ms [${cfg.style}] — answer ${answer.length} chars (ctx ${ctx.length} of ${context.length}, finish=${data.choices?.[0]?.finish_reason})`
+      `ok in ${Date.now() - started}ms [${cfg.style}] — answer ${answer.length} chars (ctx ${ctx.length} of ${context.length}, finish=${data.choices?.[0]?.finish_reason}), ${data.usage?.total_tokens ?? '?'} tok, ${Number.isFinite(creditsCharged) ? creditsCharged + ' cr' : 'no-charge-hdr'}`
     );
     );
     return answer + footer;
     return answer + footer;
   } catch (err) {
   } catch (err) {

+ 61 - 1
src/resolution/callback-synthesizer.ts

@@ -1646,10 +1646,68 @@ function svelteKitLoadEdges(ctx: ResolutionContext): Edge[] {
   return edges;
   return edges;
 }
 }
 
 
+/**
+ * Redux-thunk dispatch chain. `export const X = createAsyncThunk(prefix, async (a, api) => {...})`
+ * (or a wrapper like trezor's `createThunk(...)`) passes the async body as an ARGUMENT, so
+ * tree-sitter never extracts it as a function node: `X` is a `constant` whose body's calls are
+ * ORPHANED. The `dispatch(nextThunk(...))` calls that drive a thunk chain forward therefore produce
+ * no edges, so `callees(X)` is empty and a flow `dispatch(X(...)) → X → nextThunk` dead-ends at the
+ * constant (validated on trezor-suite: the signXxxThunk constants had ZERO outgoing edges). Bridge
+ * it: body-scan each thunk constant for `dispatch(Y(...))` and link `X → Y`, so the dispatch chain
+ * connects. High-precision — the `dispatch(` keyword plus `Y` must resolve to a function/constant/
+ * method node; capped; gated on thunk constants existing so it never runs on non-RTK repos.
+ * Cross-file by design (a suite thunk dispatches a wallet-core thunk). Provenance `heuristic`,
+ * `synthesizedBy:'redux-thunk'`; `registeredAt` is the dispatch site.
+ */
+const THUNK_DECL_RE = /create(?:Async)?Thunk/;
+const THUNK_DISPATCH_RE = /\bdispatch\s*\(\s*([A-Za-z_]\w*)\s*[(),]/g;
+const THUNK_FANOUT_CAP = 24;
+
+function reduxThunkEdges(queries: QueryBuilder, ctx: ResolutionContext): Edge[] {
+  const edges: Edge[] = [];
+  const seen = new Set<string>();
+  for (const node of queries.iterateNodesByKind('constant')) {
+    // Cheap gate: the initializer (captured in `signature`) must be a create(Async)Thunk call —
+    // avoids reading every constant's body on a large repo.
+    if (!node.signature || !THUNK_DECL_RE.test(node.signature)) continue;
+    const content = ctx.readFile(node.filePath);
+    const src = content && sliceLines(content, node.startLine, node.endLine);
+    if (!src) continue;
+    // Thunks are TS/JS-family (same // and /* */ comment syntax); map to a CommentLang.
+    const safe = stripCommentsForRegex(src, node.language === 'javascript' || node.language === 'jsx' ? 'javascript' : 'typescript');
+    THUNK_DISPATCH_RE.lastIndex = 0;
+    let m: RegExpExecArray | null;
+    let added = 0;
+    while ((m = THUNK_DISPATCH_RE.exec(safe)) && added < THUNK_FANOUT_CAP) {
+      const name = m[1]!;
+      if (name === node.name) continue; // self-dispatch (recursive thunk) — skip
+      const target = ctx
+        .getNodesByName(name)
+        .find((n) => n.kind === 'constant' || n.kind === 'function' || n.kind === 'method');
+      if (!target || target.id === node.id) continue;
+      const key = `${node.id}>${target.id}`;
+      if (seen.has(key)) continue;
+      seen.add(key);
+      const line = node.startLine + safe.slice(0, m.index).split('\n').length - 1;
+      edges.push({
+        source: node.id,
+        target: target.id,
+        kind: 'calls',
+        line,
+        provenance: 'heuristic',
+        metadata: { synthesizedBy: 'redux-thunk', via: name, registeredAt: `${node.filePath}:${line}` },
+      });
+      added++;
+    }
+  }
+  return edges;
+}
+
 /**
 /**
  * Synthesize dispatcher→callback edges (field observers + EventEmitters +
  * Synthesize dispatcher→callback edges (field observers + EventEmitters +
  * React re-render + JSX children + Vue templates + SvelteKit load + RN event
  * React re-render + JSX children + Vue templates + SvelteKit load + RN event
- * channel + Fabric native-impl + MyBatis Java↔XML + Gin middleware chain).
+ * channel + Fabric native-impl + MyBatis Java↔XML + Gin middleware chain +
+ * Redux-thunk dispatch chain).
  * Returns the count added. Never throws into indexing — callers wrap in try/catch.
  * Returns the count added. Never throws into indexing — callers wrap in try/catch.
  */
  */
 export function synthesizeCallbackEdges(queries: QueryBuilder, ctx: ResolutionContext): number {
 export function synthesizeCallbackEdges(queries: QueryBuilder, ctx: ResolutionContext): number {
@@ -1687,6 +1745,7 @@ export function synthesizeCallbackEdges(queries: QueryBuilder, ctx: ResolutionCo
   const rnXPlatEdges = rnCrossPlatformEdges(queries);
   const rnXPlatEdges = rnCrossPlatformEdges(queries);
   const mybatisEdges = mybatisJavaXmlEdges(queries);
   const mybatisEdges = mybatisJavaXmlEdges(queries);
   const ginEdges = ginMiddlewareChainEdges(queries, ctx);
   const ginEdges = ginMiddlewareChainEdges(queries, ctx);
+  const thunkEdges = reduxThunkEdges(queries, ctx);
 
 
   const merged: Edge[] = [];
   const merged: Edge[] = [];
   const seen = new Set<string>();
   const seen = new Set<string>();
@@ -1710,6 +1769,7 @@ export function synthesizeCallbackEdges(queries: QueryBuilder, ctx: ResolutionCo
     ...rnXPlatEdges,
     ...rnXPlatEdges,
     ...mybatisEdges,
     ...mybatisEdges,
     ...ginEdges,
     ...ginEdges,
+    ...thunkEdges,
   ]) {
   ]) {
     const key = `${e.source}>${e.target}`;
     const key = `${e.source}>${e.target}`;
     if (seen.has(key)) continue;
     if (seen.has(key)) continue;

Энэ ялгаанд хэт олон файл өөрчлөгдсөн тул зарим файлыг харуулаагүй болно