# Running the agent-behavior test (how agents actually use codegraph) This explains how to measure **how a Claude Code agent uses the codegraph MCP tools** on a real repo — which tools it calls (does it lead with `codegraph_explore`?), how many follow-up `Read`/`Grep`s it does, and the token cost. Use it when changing tool guidance (`server-instructions.ts`, `instructions-template.ts`, tool descriptions) or retrieval, to verify the change actually shifts agent behavior. Scripts live in `scripts/agent-eval/`. ## Why two harnesses (read this first) | | Interactive (`itrun.sh`) | Headless (`run-agent.sh`) | |---|---|---| | Drives | the real TUI via tmux | `claude -p` print mode | | Subagent it picks | **Explore** (matches real UX) | general-purpose (diverges) | | Metrics | tool breakdown (from session logs) + `Done(…)` token summary | exact per-tool calls + tokens/cost (stream-json) | | Cost | Claude Max subscription | API $ (`total_cost_usd`) | **Headless `claude -p` does NOT reproduce what users see** — it silently picks the general-purpose subagent, while interactive sessions delegate to the read-first **Explore** subagent. So for "what does my session actually do," use the interactive harness. For a clean per-tool/token breakdown in one shot, use headless (and ask for the Explore subagent in the prompt if you want that path). ## Prerequisites - **tmux 3.0+** - A logged-in `claude` CLI (Claude Max or API). - codegraph configured as an MCP server (`claude mcp list` shows `codegraph`). The interactive harness uses your global config, so it runs whatever `codegraph` resolves to — point that at your dev build (`npm link` / the symlinked global) to test local changes. - A target repo, cloned and indexed: ```bash git clone --depth 1 https://github.com/square/okhttp /tmp/corpus/okhttp cd /tmp/corpus/okhttp && codegraph init -i ``` Good scale spread for a sweep: Alamofire (~100 files), Excalidraw (~600), OkHttp (~640), VS Code (~10k). ## Interactive test (the faithful one) ```bash scripts/agent-eval/itrun.sh