SKILL.md 3.3 KB


name: agent-eval

description: Benchmark CodeGraph retrieval quality on a real codebase by comparing agent behavior with vs without CodeGraph. Use when the user runs /agent-eval or asks to test, benchmark, audit, or validate a codegraph version (the local dev build or a published npm version) against a language's repo.

CodeGraph Quality Audit

Measures how much CodeGraph helps an agent versus plain grep/read, for a chosen codegraph version on a chosen real-world repo. Drives the harness in scripts/agent-eval/.

Prerequisites

  • tmux 3+, a logged-in claude CLI, node, git (macOS/Linux).
  • Run from the codegraph repo root.

Workflow

Copy this checklist:

- [ ] 1. Pick version (local or npm)
- [ ] 2. Pick language
- [ ] 3. Pick repo by size
- [ ] 4. Pick harness (headless / tmux / both)
- [ ] 5. Run audit.sh in the background
- [ ] 6. Report results

Step 1 — version. Ask with AskUserQuestion: which codegraph version to test. Offer "Local dev build" and "Latest published"; the free-text "Other" lets the user type a specific version (e.g. 0.7.10). Map the answer to a VERSION token:

  • "Local dev build" → local
  • "Latest published" → latest
  • a typed version → that string (e.g. 0.7.10)

Step 2 — language. Read .claude/skills/agent-eval/corpus.json. Ask with AskUserQuestion which language to test, listing the languages that have entries.

Step 3 — repo. From the chosen language's entries, ask which repo. Label each option with its size and file count, e.g. excalidraw — Medium (~600 files). Each entry carries the repo URL and a representative question.

Step 4 — harness. Ask with AskUserQuestion which harness to run, and map the answer to a MODE token:

  • "Headless" → headlessclaude -p with stream-json: exact tokens/cost and a clean tool sequence (2 runs, fast, no TTY).
  • "Interactive (tmux)" → tmux — drives the real Claude TUI in tmux: faithful Explore-subagent behavior, metrics from session logs (2 runs, slower).
  • "Both" → all — headless + interactive (4 runs).

Step 5 — run. Launch in the background (sets the version, clones if missing, wipes + re-indexes, runs the chosen arms — several minutes):

scripts/agent-eval/audit.sh <VERSION> <repo-name> <repo-url> "<question>" <MODE>

Step 6 — report. When the job finishes, read the log and report per arm:

  • Headless (parse-run.mjs): total tool calls, file Reads, Grep/Bash, codegraph-tool calls, duration, total cost.
  • Interactive (parse-session.mjs): the VERDICT: codegraph_explore used Nx | Read N | Grep/Bash N and TOKENS: lines.

Lead with cost + tool/Read counts — they are the reliable signals; raw token in/out are confounded by subagent delegation and prompt caching. State whether codegraph reduced effort and whether both arms reached a correct answer.

Notes

  • The index is rebuilt every run (audit.sh wipes .codegraph) — different versions extract differently, so an index must be served by the same binary that built it.
  • audit.sh temporarily mutates the global codegraph install for the test, then restores your dev link via local-install.sh.
  • Corpus repos are cloned to /tmp/codegraph-corpus (reused if already present).
  • Add or edit repos in corpus.json (fields: name, repo, size, files, question).