# Running the agent-behavior test (how agents actually use codegraph)

This explains how to measure **how a Claude Code agent uses the codegraph MCP
tools** on a real repo — which tools it calls (does it lead with
`codegraph_explore`?), how many follow-up `Read`/`Grep`s it does, and the token
cost. Use it when changing tool guidance (`server-instructions.ts`,
`instructions-template.ts`, tool descriptions) or retrieval, to verify the
change actually shifts agent behavior.

Scripts live in `scripts/agent-eval/`.

## Why two harnesses (read this first)

| | Interactive (`itrun.sh`) | Headless (`run-agent.sh`) |
|---|---|---|
| Drives | the real TUI via tmux | `claude -p` print mode |
| Subagent it picks | **Explore** (matches real UX) | general-purpose (diverges) |
| Metrics | tool breakdown (from session logs) + `Done(…)` token summary | exact per-tool calls + tokens/cost (stream-json) |
| Cost | Claude Max subscription | API $ (`total_cost_usd`) |

**Headless `claude -p` does NOT reproduce what users see** — it silently picks
the general-purpose subagent, while interactive sessions delegate to the
read-first **Explore** subagent. So for "what does my session actually do," use
the interactive harness. For a clean per-tool/token breakdown in one shot, use
headless (and ask for the Explore subagent in the prompt if you want that path).

## Prerequisites

- **tmux 3.0+**
- A logged-in `claude` CLI (Claude Max or API).
- codegraph configured as an MCP server (`claude mcp list` shows `codegraph`).
  The interactive harness uses your global config, so it runs whatever
  `codegraph` resolves to — point that at your dev build (`npm link` / the
  symlinked global) to test local changes.
- A target repo, cloned and indexed:
  ```bash
  git clone --depth 1 https://github.com/square/okhttp /tmp/corpus/okhttp
  cd /tmp/corpus/okhttp && codegraph init -i
  ```
  Good scale spread for a sweep: Alamofire (~100 files), Excalidraw (~600),
  OkHttp (~640), VS Code (~10k).

## Interactive test (the faithful one)

```bash
scripts/agent-eval/itrun.sh <repo-path> <label> "<question>"
```

Example:
```bash
scripts/agent-eval/itrun.sh /tmp/corpus/vscode vscode \
  "How does the extension host communicate with the main process?"
```

It opens `claude` in a tmux session, types the question, waits for the agent to
finish, then prints:
- the `Done (N tool uses · Xk tokens · Ym)` subagent summary (from the pane),
- the `Context Xk/1.0M` main-session size,
- a **tool breakdown** parsed from the session logs (main + subagents), ending
  in a `VERDICT: codegraph_explore used Nx | Read N | Grep/Bash N` line.

### Startup robustness (so unattended runs don't silently no-op)

Two things bite an unattended driver before the prompt even runs:
- **The `❯` glyph is drawn ~6s before the input accepts keystrokes.** Waiting
  for `❯` is necessary but not sufficient. The harness sends the prompt, then
  **verifies a chunk of it actually landed in the input box**, retrying until it
  does — so it can't type into a not-yet-live input and submit nothing.
- **First time claude opens a repo it shows "Is this a project you trust?"**
  (which also contains `❯`). The harness detects that dialog and presses Enter
  to accept it before typing.

If the prompt never lands or work never starts, the harness now **fails loudly**
(non-zero exit) instead of capturing an empty pane and reporting a bogus run.

### How completion is detected (the tricky part)

Claude's TUI redraws in place, so you can't just wait for output to stop. The
harness polls `tmux capture-pane` and treats the pane as **busy** when it shows
the spinner's elapsed-time-in-parens — `(8s · …)` / `(1m 3s · …)`, matched by
`\(([0-9]+m )?[0-9]+s ·`. That's the *universal* working signal: it shows during
the pre-stream **thinking** phase (`(8s · thinking with max effort)`, which has
no token arrow yet) *and* during streaming. The `↓ N`/`↑ N` token arrow,
`esc to interrupt`, and `Initializing…` are OR'd in as belt-and-braces (some TUI
versions show one but not the others). It declares **idle** when the `❯` prompt
is present and not busy for 10 consecutive polls (~5s, long enough to ride out
mid-conversation thinking gaps that briefly drop the spinner). (Technique
adapted from devpit's `WaitForIdle`.)

### Where the breakdown comes from

`parse-session.mjs` reads the newest session log under
`~/.claude/projects/<escaped-cwd>/<session>.jsonl` and its subagent transcripts
under `<session>/subagents/*.jsonl`. The **subagent** file is where the real
tool calls are — the main log only shows the `Agent` delegation. You can run it
standalone:
```bash
node scripts/agent-eval/parse-session.mjs /tmp/corpus/vscode
```

## Headless test (clean tokens, forceable Explore path)

```bash
scripts/agent-eval/run-agent.sh <repo-path> <label> "<question>"
```
Writes stream-json and prints the tool sequence + exact tokens/cost. To
reproduce the Explore-subagent path headlessly, ask for it:
`"Use an Explore subagent to investigate, then answer: …"`.

## Running a sweep

Single runs vary a lot (the VS Code question has ranged 26–37 tool uses /
88–105k tokens across runs). For a real signal, run N≥3 and take the median:
```bash
for i in 1 2 3; do
  scripts/agent-eval/itrun.sh /tmp/corpus/vscode "vscode-$i" "<question>"
done
```

## What "good" looks like

After the explore-first guidance (PR #191), an understanding question should
show the agent **leading with `codegraph_explore`** and using `search`/`node`
to fill gaps — not a wall of `Read`/`Grep`. Example faithful run:
`VERDICT: codegraph_explore used 3x | Read 8 | Grep/Bash 1`. If `explore` is 0
and `Read`/`Grep` dominate, the guidance regressed.

## Output artifacts

Transcripts and logs go to `$AGENT_EVAL_OUT` (default `/tmp/agent-eval/`):
`itrun-<label>.txt` (pane capture), `run-<label>.jsonl` (headless stream-json).