Răsfoiți Sursa

chore(agent-eval): add agent-behavior eval harness for codegraph MCP usage

Tooling to measure how a Claude Code agent actually uses the codegraph
MCP tools on a real repo — does it lead with codegraph_explore, how many
Read/Grep follow-ups, token cost — for validating tool-guidance changes
(server-instructions, tool descriptions) against real agent behavior.

- itrun.sh drives the real interactive TUI via tmux (the faithful
  Explore path). Hardened for unattended runs: type-and-verify prompt
  delivery (the ❯ glyph is drawn ~6s before the input accepts keys),
  auto-accepts the "trust this folder" dialog, busy-detection keys on
  the universal "(Ns · …)" spinner so the pre-stream thinking phase
  counts as busy, and fails loudly instead of capturing an empty pane.
- parse-session.mjs reports the tool breakdown + token accounting
  (gen / fresh-in / cached-in / billable) from the session and subagent
  logs, consistent across main-thread and subagent runs; counts
  main-thread Bash in the grep verdict.
- run-agent.sh / parse-run.mjs are the headless stream-json complement
  (exact per-tool tokens/cost via claude -p).
- run-interactive-test.md documents how to run it and how completion is
  detected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Colby McHenry 1 lună în urmă
părinte
comite
1b527aaf0f

+ 131 - 0
run-interactive-test.md

@@ -0,0 +1,131 @@
+# Running the agent-behavior test (how agents actually use codegraph)
+
+This explains how to measure **how a Claude Code agent uses the codegraph MCP
+tools** on a real repo — which tools it calls (does it lead with
+`codegraph_explore`?), how many follow-up `Read`/`Grep`s it does, and the token
+cost. Use it when changing tool guidance (`server-instructions.ts`,
+`instructions-template.ts`, tool descriptions) or retrieval, to verify the
+change actually shifts agent behavior.
+
+Scripts live in `scripts/agent-eval/`.
+
+## Why two harnesses (read this first)
+
+| | Interactive (`itrun.sh`) | Headless (`run-agent.sh`) |
+|---|---|---|
+| Drives | the real TUI via tmux | `claude -p` print mode |
+| Subagent it picks | **Explore** (matches real UX) | general-purpose (diverges) |
+| Metrics | tool breakdown (from session logs) + `Done(…)` token summary | exact per-tool calls + tokens/cost (stream-json) |
+| Cost | Claude Max subscription | API $ (`total_cost_usd`) |
+
+**Headless `claude -p` does NOT reproduce what users see** — it silently picks
+the general-purpose subagent, while interactive sessions delegate to the
+read-first **Explore** subagent. So for "what does my session actually do," use
+the interactive harness. For a clean per-tool/token breakdown in one shot, use
+headless (and ask for the Explore subagent in the prompt if you want that path).
+
+## Prerequisites
+
+- **tmux 3.0+**
+- A logged-in `claude` CLI (Claude Max or API).
+- codegraph configured as an MCP server (`claude mcp list` shows `codegraph`).
+  The interactive harness uses your global config, so it runs whatever
+  `codegraph` resolves to — point that at your dev build (`npm link` / the
+  symlinked global) to test local changes.
+- A target repo, cloned and indexed:
+  ```bash
+  git clone --depth 1 https://github.com/square/okhttp /tmp/corpus/okhttp
+  cd /tmp/corpus/okhttp && codegraph init -i
+  ```
+  Good scale spread for a sweep: Alamofire (~100 files), Excalidraw (~600),
+  OkHttp (~640), VS Code (~10k).
+
+## Interactive test (the faithful one)
+
+```bash
+scripts/agent-eval/itrun.sh <repo-path> <label> "<question>"
+```
+
+Example:
+```bash
+scripts/agent-eval/itrun.sh /tmp/corpus/vscode vscode \
+  "How does the extension host communicate with the main process?"
+```
+
+It opens `claude` in a tmux session, types the question, waits for the agent to
+finish, then prints:
+- the `Done (N tool uses · Xk tokens · Ym)` subagent summary (from the pane),
+- the `Context Xk/1.0M` main-session size,
+- a **tool breakdown** parsed from the session logs (main + subagents), ending
+  in a `VERDICT: codegraph_explore used Nx | Read N | Grep/Bash N` line.
+
+### Startup robustness (so unattended runs don't silently no-op)
+
+Two things bite an unattended driver before the prompt even runs:
+- **The `❯` glyph is drawn ~6s before the input accepts keystrokes.** Waiting
+  for `❯` is necessary but not sufficient. The harness sends the prompt, then
+  **verifies a chunk of it actually landed in the input box**, retrying until it
+  does — so it can't type into a not-yet-live input and submit nothing.
+- **First time claude opens a repo it shows "Is this a project you trust?"**
+  (which also contains `❯`). The harness detects that dialog and presses Enter
+  to accept it before typing.
+
+If the prompt never lands or work never starts, the harness now **fails loudly**
+(non-zero exit) instead of capturing an empty pane and reporting a bogus run.
+
+### How completion is detected (the tricky part)
+
+Claude's TUI redraws in place, so you can't just wait for output to stop. The
+harness polls `tmux capture-pane` and treats the pane as **busy** when it shows
+the spinner's elapsed-time-in-parens — `(8s · …)` / `(1m 3s · …)`, matched by
+`\(([0-9]+m )?[0-9]+s ·`. That's the *universal* working signal: it shows during
+the pre-stream **thinking** phase (`(8s · thinking with max effort)`, which has
+no token arrow yet) *and* during streaming. The `↓ N`/`↑ N` token arrow,
+`esc to interrupt`, and `Initializing…` are OR'd in as belt-and-braces (some TUI
+versions show one but not the others). It declares **idle** when the `❯` prompt
+is present and not busy for 10 consecutive polls (~5s, long enough to ride out
+mid-conversation thinking gaps that briefly drop the spinner). (Technique
+adapted from devpit's `WaitForIdle`.)
+
+### Where the breakdown comes from
+
+`parse-session.mjs` reads the newest session log under
+`~/.claude/projects/<escaped-cwd>/<session>.jsonl` and its subagent transcripts
+under `<session>/subagents/*.jsonl`. The **subagent** file is where the real
+tool calls are — the main log only shows the `Agent` delegation. You can run it
+standalone:
+```bash
+node scripts/agent-eval/parse-session.mjs /tmp/corpus/vscode
+```
+
+## Headless test (clean tokens, forceable Explore path)
+
+```bash
+scripts/agent-eval/run-agent.sh <repo-path> <label> "<question>"
+```
+Writes stream-json and prints the tool sequence + exact tokens/cost. To
+reproduce the Explore-subagent path headlessly, ask for it:
+`"Use an Explore subagent to investigate, then answer: …"`.
+
+## Running a sweep
+
+Single runs vary a lot (the VS Code question has ranged 26–37 tool uses /
+88–105k tokens across runs). For a real signal, run N≥3 and take the median:
+```bash
+for i in 1 2 3; do
+  scripts/agent-eval/itrun.sh /tmp/corpus/vscode "vscode-$i" "<question>"
+done
+```
+
+## What "good" looks like
+
+After the explore-first guidance (PR #191), an understanding question should
+show the agent **leading with `codegraph_explore`** and using `search`/`node`
+to fill gaps — not a wall of `Read`/`Grep`. Example faithful run:
+`VERDICT: codegraph_explore used 3x | Read 8 | Grep/Bash 1`. If `explore` is 0
+and `Read`/`Grep` dominate, the guidance regressed.
+
+## Output artifacts
+
+Transcripts and logs go to `$AGENT_EVAL_OUT` (default `/tmp/agent-eval/`):
+`itrun-<label>.txt` (pane capture), `run-<label>.jsonl` (headless stream-json).

+ 107 - 0
scripts/agent-eval/itrun.sh

@@ -0,0 +1,107 @@
+#!/usr/bin/env bash
+# Drive an INTERACTIVE Claude Code session in tmux, send a prompt, wait for the
+# agent to finish, then print the tool-call breakdown from the session logs.
+#
+# Why interactive (not `claude -p`): headless print-mode picks the
+# general-purpose subagent, while real interactive sessions delegate to the
+# Explore subagent (or drive codegraph from the main thread). Only the
+# interactive TUI reproduces the behavior users actually see. (Idle-detection
+# technique borrowed from devpit's WaitForIdle.)
+#
+# Usage: itrun.sh <repo-path> <label> "<prompt>"
+# Output dir: $AGENT_EVAL_OUT (default /tmp/agent-eval)
+# Requires: tmux 3.0+, a logged-in `claude` CLI, codegraph MCP configured.
+set -uo pipefail
+REPO="$1"; LABEL="$2"; PROMPT="$3"
+SESSION="cgt_${LABEL}"
+OUT_DIR="${AGENT_EVAL_OUT:-/tmp/agent-eval}"; mkdir -p "$OUT_DIR"
+OUT="$OUT_DIR/itrun-${LABEL}.txt"
+HERE="$(cd "$(dirname "$0")" && pwd)"
+
+cap() { tmux capture-pane -p -t "$SESSION" -S -40; }
+
+tmux kill-session -t "$SESSION" 2>/dev/null
+
+# Wide pane so the TUI doesn't hard-wrap tool lines.
+tmux new-session -d -s "$SESSION" -x 230 -y 60
+tmux send-keys -t "$SESSION" "cd $REPO && claude --dangerously-skip-permissions" Enter
+
+# Wait for the ❯ prompt (claude drew its UI), up to 60s. NOTE: ❯ appears on the
+# welcome screen seconds before the input actually accepts keystrokes, so this is
+# necessary but NOT sufficient — the type-and-verify loop below is what proves
+# the input is live.
+ready=0
+for _ in $(seq 1 120); do
+  cap | grep -q "❯" && { ready=1; break; }
+  sleep 0.5
+done
+[ "$ready" = 1 ] || { echo "claude never drew its UI"; cap; tmux kill-session -t "$SESSION" 2>/dev/null; exit 1; }
+
+# Accept the per-folder "Is this a project you trust?" dialog if it shows (first
+# time claude opens a given repo). Option 1 ("Yes, I trust this folder") is
+# pre-selected, so Enter accepts. This dialog also contains ❯, so it must be
+# cleared before the type-and-verify loop or keystrokes land on the menu.
+for _ in $(seq 1 20); do
+  cap | grep -q "trust this folder" || break
+  tmux send-keys -t "$SESSION" Enter
+  sleep 1
+done
+
+# Type-and-verify: send the prompt, confirm a distinctive chunk of it actually
+# landed in the input box, retry if it didn't (handles the early-❯ race where
+# the welcome screen shows the prompt glyph but MCP init is still eating keys).
+needle="${PROMPT:0:24}"
+typed=0
+for _ in $(seq 1 30); do
+  tmux send-keys -l -t "$SESSION" "$PROMPT"
+  sleep 1
+  if cap | grep -Fq "$needle"; then typed=1; break; fi
+  # Clear whatever partial text may have landed, then retry.
+  tmux send-keys -t "$SESSION" C-u
+  sleep 1
+done
+[ "$typed" = 1 ] || { echo "prompt never landed in the input box"; cap; tmux kill-session -t "$SESSION" 2>/dev/null; exit 1; }
+sleep 0.5
+tmux send-keys -t "$SESSION" Enter
+
+# Busy signals. The robust one is the spinner's elapsed-time-in-parens, which
+# EVERY working state shows — both the pre-stream thinking phase
+# "(8s · thinking with max effort)" and the streaming phase
+# "(24s · ↑ 2.5k tokens · …)", and it survives the 32s→"1m 3s" rollover. We OR
+# in the token arrows, "esc to interrupt", and "Initializing" as belt-and-braces
+# (some TUI versions/states show one but not the others).
+BUSY_RE='esc to interrupt|↓ [0-9]|↑ [0-9]|Initializing|\(([0-9]+m )?[0-9]+s ·'
+
+# Wait for work to START (busy indicator appears), up to 60s. If it never starts,
+# fail loudly rather than silently reporting an empty run.
+started=0
+for _ in $(seq 1 120); do
+  cap | grep -qE "$BUSY_RE" && { started=1; break; }
+  sleep 0.5
+done
+[ "$started" = 1 ] || { echo "agent never started working"; cap; tmux kill-session -t "$SESSION" 2>/dev/null; exit 1; }
+
+# Poll for idle: not busy AND ❯ present, for 10 consecutive polls (~5s) to ride
+# out mid-conversation thinking gaps that briefly drop the spinner. Up to ~15min.
+consec=0
+for _ in $(seq 1 1800); do
+  pane=$(cap)
+  if echo "$pane" | grep -qE "$BUSY_RE"; then
+    consec=0
+  elif echo "$pane" | grep -q "❯"; then
+    consec=$((consec+1)); [ "$consec" -ge 10 ] && break
+  else
+    consec=0
+  fi
+  sleep 0.5
+done
+sleep 1
+
+tmux capture-pane -p -t "$SESSION" -S - > "$OUT"
+echo "captured $(wc -l < "$OUT") lines -> $OUT"
+grep -oE "Done \([^)]*\)" "$OUT" | tail -1
+grep -oE "[0-9.]+k?/[0-9.]+M" "$OUT" | tail -1 | sed 's/^/Context /'
+tmux kill-session -t "$SESSION" 2>/dev/null
+
+# Clean tool breakdown from the session logs (main + subagents).
+node "$HERE/parse-session.mjs" "$REPO" 2>/dev/null || true

+ 45 - 0
scripts/agent-eval/parse-run.mjs

@@ -0,0 +1,45 @@
+#!/usr/bin/env node
+// Parse a Claude Code stream-json run log: tool-call sequence + token usage.
+import { readFileSync } from 'fs';
+const file = process.argv[2];
+const lines = readFileSync(file, 'utf8').split('\n').filter(Boolean);
+
+const toolCalls = [];
+let result = null;
+let initTools = null;
+
+for (const line of lines) {
+  let ev;
+  try { ev = JSON.parse(line); } catch { continue; }
+  if (ev.type === 'system' && ev.subtype === 'init') {
+    initTools = (ev.tools || []).filter(t => /codegraph/.test(t));
+  }
+  if (ev.type === 'assistant' && ev.message?.content) {
+    for (const block of ev.message.content) {
+      if (block.type === 'tool_use') {
+        let detail = '';
+        if (block.name === 'Task') detail = ` [subagent_type=${block.input?.subagent_type ?? '?'}] ${(block.input?.description ?? '').slice(0,40)}`;
+        else if (/codegraph/.test(block.name)) detail = ` ${JSON.stringify(block.input?.query ?? block.input?.task ?? block.input?.symbol ?? '').slice(0,60)}`;
+        else if (block.name === 'Bash') detail = ` ${(block.input?.command ?? '').slice(0,50)}`;
+        else if (block.name === 'Read') detail = ` ${(block.input?.file_path ?? '').split('/').slice(-1)[0]}`;
+        toolCalls.push(`${block.name}${detail}`);
+      }
+    }
+  }
+  if (ev.type === 'result') result = ev;
+}
+
+console.log(`\n=== ${file.split('/').pop()} ===`);
+console.log(`codegraph tools exposed: ${initTools ? initTools.length : '?'}`);
+console.log(`\nTool calls (${toolCalls.length}):`);
+const counts = {};
+for (const tc of toolCalls) { const n = tc.split(' ')[0]; counts[n] = (counts[n]||0)+1; }
+console.log('  by type:', JSON.stringify(counts));
+toolCalls.forEach((tc, i) => console.log(`  ${i+1}. ${tc}`));
+
+if (result) {
+  const u = result.usage || {};
+  const totalIn = (u.input_tokens||0) + (u.cache_read_input_tokens||0) + (u.cache_creation_input_tokens||0);
+  console.log(`\nResult: ${result.subtype} | duration ${(result.duration_ms/1000).toFixed(0)}s | turns ${result.num_turns}`);
+  console.log(`  tokens: in=${totalIn} out=${u.output_tokens||0} | cost $${(result.total_cost_usd||0).toFixed(3)}`);
+}

+ 93 - 0
scripts/agent-eval/parse-session.mjs

@@ -0,0 +1,93 @@
+#!/usr/bin/env node
+// Parse the newest Claude Code session log for a project + its subagent logs,
+// and report the tool-call breakdown (main + subagents). Works for interactive
+// runs (driven via itrun.sh) — Claude Code writes full transcripts to
+// ~/.claude/projects/<escaped-cwd>/<session>.jsonl with subagents/ alongside.
+import { readFileSync, readdirSync, statSync, existsSync, realpathSync } from 'fs';
+import { join } from 'path';
+import { homedir } from 'os';
+
+const projectArg = process.argv[2];
+if (!projectArg) { console.error('usage: parse-session.mjs <project-dir>'); process.exit(1); }
+
+// Claude Code escapes the (real) cwd by replacing every "/" with "-".
+const real = realpathSync(projectArg);
+const escaped = real.replace(/\//g, '-');
+const projDir = join(homedir(), '.claude', 'projects', escaped);
+if (!existsSync(projDir)) { console.error('no session logs at', projDir); process.exit(1); }
+
+// Newest top-level session .jsonl
+const sessions = readdirSync(projDir)
+  .filter(f => f.endsWith('.jsonl'))
+  .map(f => ({ f, m: statSync(join(projDir, f)).mtimeMs }))
+  .sort((a, b) => b.m - a.m);
+if (sessions.length === 0) { console.error('no .jsonl sessions in', projDir); process.exit(1); }
+const sessionId = sessions[0].f.replace('.jsonl', '');
+
+function tally(file) {
+  const counts = {};
+  for (const line of readFileSync(file, 'utf8').split('\n')) {
+    if (!line) continue;
+    let ev; try { ev = JSON.parse(line); } catch { continue; }
+    const content = ev.message?.content;
+    if (!Array.isArray(content)) continue;
+    for (const b of content) {
+      if (b.type === 'tool_use') counts[b.name] = (counts[b.name] || 0) + 1;
+    }
+  }
+  return counts;
+}
+
+// Sum token usage from a transcript. The TUI's "Done (…Xk tokens…)" line only
+// covers a subagent's throughput; this works for main-thread runs too and is
+// consistent across both paths. `gen` = output, `fresh` = uncached input
+// (input + cache_creation), `cached` = cache reads (≈free), `total` = all.
+function sumTokens(file) {
+  const t = { gen: 0, fresh: 0, cached: 0 };
+  for (const line of readFileSync(file, 'utf8').split('\n')) {
+    if (!line) continue;
+    let ev; try { ev = JSON.parse(line); } catch { continue; }
+    const u = ev.message?.usage;
+    if (!u) continue;
+    t.gen += u.output_tokens || 0;
+    t.fresh += (u.input_tokens || 0) + (u.cache_creation_input_tokens || 0);
+    t.cached += u.cache_read_input_tokens || 0;
+  }
+  return t;
+}
+
+const mainCounts = tally(join(projDir, sessionId + '.jsonl'));
+
+// Subagent transcripts live under <session>/subagents/*.jsonl
+const subDir = join(projDir, sessionId, 'subagents');
+const subCounts = {};
+let subAgentFiles = 0;
+if (existsSync(subDir)) {
+  for (const f of readdirSync(subDir).filter(f => f.endsWith('.jsonl'))) {
+    subAgentFiles++;
+    const c = tally(join(subDir, f));
+    for (const [k, v] of Object.entries(c)) subCounts[k] = (subCounts[k] || 0) + v;
+  }
+}
+
+const fmt = (counts) => Object.entries(counts).sort((a, b) => b[1] - a[1])
+  .map(([k, v]) => `    ${String(v).padStart(3)}  ${k}`).join('\n') || '    (none)';
+
+console.log(`session: ${sessionId}`);
+console.log(`\nMAIN thread tools:\n${fmt(mainCounts)}`);
+console.log(`\nSUBAGENT tools (${subAgentFiles} subagent transcript${subAgentFiles === 1 ? '' : 's'}):\n${fmt(subCounts)}`);
+
+const explore = subCounts['mcp__codegraph__codegraph_explore'] || mainCounts['mcp__codegraph__codegraph_explore'] || 0;
+const reads = (subCounts['Read'] || 0) + (mainCounts['Read'] || 0);
+const greps = (subCounts['Grep'] || 0) + (mainCounts['Grep'] || 0) + (subCounts['Bash'] || 0) + (mainCounts['Bash'] || 0);
+console.log(`\nVERDICT: codegraph_explore used ${explore}x | Read ${reads} | Grep/Bash ${greps}`);
+
+// Token totals (main + subagents), consistent across main-thread and subagent runs.
+const tok = { gen: 0, fresh: 0, cached: 0 };
+const addTok = (t) => { tok.gen += t.gen; tok.fresh += t.fresh; tok.cached += t.cached; };
+addTok(sumTokens(join(projDir, sessionId + '.jsonl')));
+if (existsSync(subDir)) {
+  for (const f of readdirSync(subDir).filter(f => f.endsWith('.jsonl'))) addTok(sumTokens(join(subDir, f)));
+}
+const k = (n) => (n / 1000).toFixed(1) + 'k';
+console.log(`TOKENS: gen ${k(tok.gen)} | fresh-in ${k(tok.fresh)} | cached-in ${k(tok.cached)} | billable≈ ${k(tok.gen + tok.fresh)}`);

+ 34 - 0
scripts/agent-eval/run-agent.sh

@@ -0,0 +1,34 @@
+#!/usr/bin/env bash
+# Headless Claude Code run against a repo with codegraph MCP, capturing the
+# full stream-json so we can see tool calls + token usage. Complements the
+# interactive itrun.sh: headless gives a clean per-tool breakdown + exact
+# tokens/cost, but defaults to the general-purpose subagent (not Explore).
+# To force the Explore path, ask for it in the prompt.
+#
+# Usage: run-agent.sh <repo-path> <label> "<prompt>"
+# Env: AGENT_EVAL_OUT (default /tmp/agent-eval), CG_BIN (codegraph dist binary)
+set -uo pipefail
+
+REPO="$1"; LABEL="$2"; PROMPT="$3"
+CG_BIN="${CG_BIN:-$(command -v codegraph || echo /usr/local/bin/codegraph)}"
+OUT_DIR="${AGENT_EVAL_OUT:-/tmp/agent-eval}"; mkdir -p "$OUT_DIR"
+OUT="$OUT_DIR/run-${LABEL}.jsonl"
+
+MCP_CONFIG=$(cat <<JSON
+{"mcpServers":{"codegraph":{"command":"${CG_BIN}","args":["serve","--mcp","--path","${REPO}"]}}}
+JSON
+)
+
+echo "→ running [$LABEL] in $REPO"
+cd "$REPO" || exit 1
+
+claude -p "$PROMPT" \
+  --output-format stream-json --verbose \
+  --permission-mode bypassPermissions \
+  --model opus \
+  --max-budget-usd 2 \
+  --strict-mcp-config --mcp-config "$MCP_CONFIG" \
+  > "$OUT" 2>"$OUT_DIR/run-${LABEL}.err"
+
+echo "exit: $? | wrote $OUT ($(wc -l < "$OUT") lines)"
+node "$(cd "$(dirname "$0")" && pwd)/parse-run.mjs" "$OUT" 2>/dev/null || true