Measures the managed offload (codegraph_explore → reasoning model synthesis) and the
front-load hook (approach 1) against plain codegraph and no-codegraph, across repo sizes,
on time · main-session tokens/cost · CodeGraph-AI tokens/cost · accuracy.
All agent arms run claude -p --model sonnet --effort high (the deliberate floor model — an
affordance that lands on Sonnet generalizes up). Everything writes to a scratch dir
(AGENT_EVAL_OUT, default /tmp/cg-offload-eval); nothing here is shipped to users.
Famous repos (express, excalidraw, n8n, …) are useless for accuracy evals: Sonnet answers their
flow questions from memory, so the no-codegraph baseline is dishonest. These four passed a no-tools
probe (Sonnet could not name their real flow internals) and are cloned fresh by offload-eval-setup.sh:
| tier | repo | ~src files | canonical flow |
|---|---|---|---|
| small | MTKruto/MTKruto | 322 TS | sendMessage → invoke → TL serialize → transport |
| medium | mvdicarlo/postybirb-plus | 608 TS | submission → queue → per-website .post() |
| complex | shapeshift/web | 3.2k TS (35-pkg monorepo) | swap → swapper registry → concrete swapper |
| large | trezor/trezor-suite | 8k TS monorepo | send-form → sign thunk → @trezor/connect |
Verified ground-truth flows (the judge's reference) live in offload-eval-ground-truth.json.
codegraph login); records AI tokens/credits via CODEGRAPH_OFFLOAD_USAGE_LOG.CODEGRAPH_OFFLOAD_DISABLE=1 (returns raw source).UserPromptSubmit hook (offload-eval-hook.mjs) that runs raw explore on the prompt and injects the result into context (approach 1).npm run build # the harness shells out to dist/
codegraph login # only needed for the offload arm
export AGENT_EVAL_OUT=/tmp/cg-offload-eval
bash scripts/agent-eval/offload-eval-setup.sh # clone + index the 4 repos
bash scripts/agent-eval/offload-eval-matrix.sh # 3 arms × 4 tiers × REPS (default 3)
node scripts/agent-eval/offload-eval-judge.mjs \
--results $AGENT_EVAL_OUT/results.jsonl \
--truth scripts/agent-eval/offload-eval-ground-truth.json \
--out $AGENT_EVAL_OUT/judged.jsonl
node scripts/agent-eval/offload-eval-summarize.mjs $AGENT_EVAL_OUT/judged.jsonl
bash scripts/agent-eval/offload-eval-frontload-matrix.sh # frontload arm + judge + merged summary
Single repo: offload-eval-3arm.sh <indexed-repo> <tier> <reps> "<question>" (or -frontload.sh).
offload-eval-setup.sh — clone + index the 4 repos.offload-eval-3arm.sh / -frontload.sh — one repo, the arms.offload-eval-matrix.sh / -frontload-matrix.sh — drive all 4 tiers.offload-eval-hook.mjs — the front-load UserPromptSubmit hook (resolves its own engine; CG_FRONTLOAD_DEBUG=<path> to log injections; CG_FRONTLOAD_BUDGET to cap injected chars).offload-eval-metrics.mjs — one run's stream-json + usage log → one JSON metrics line.offload-eval-judge.mjs — Sonnet judge: end-to-end (agent final vs ground truth) + per-answer offload fidelity.offload-eval-summarize.mjs — per-tier, per-arm table + cross-repo roll-up.offload-eval-ground-truth.json — source-verified canonical flows.ClientPlain/SessionPlain instead of
the real encrypted path). Its speed/cost win is narrow (medium-only) and inversely correlated with
accuracy. Use raw until offload fidelity is fixed.