# CodeGraph AI offload — accuracy & adoption eval harness Measures the managed **offload** (`codegraph_explore` → reasoning model synthesis) and the **front-load hook** (approach 1) against plain codegraph and no-codegraph, across repo sizes, on **time · main-session tokens/cost · CodeGraph-AI tokens/cost · accuracy**. All agent arms run `claude -p --model sonnet --effort high` (the deliberate floor model — an affordance that lands on Sonnet generalizes up). Everything writes to a scratch dir (`AGENT_EVAL_OUT`, default `/tmp/cg-offload-eval`); nothing here is shipped to users. ## Repos (selected via a memory-probe gate — NOT trained on) Famous repos (express, excalidraw, n8n, …) are useless for *accuracy* evals: Sonnet answers their flow questions from memory, so the no-codegraph baseline is dishonest. These four passed a no-tools probe (Sonnet could not name their real flow internals) and are cloned fresh by `offload-eval-setup.sh`: | tier | repo | ~src files | canonical flow | |---|---|---|---| | small | MTKruto/MTKruto | 322 TS | `sendMessage` → invoke → TL serialize → transport | | medium | mvdicarlo/postybirb-plus | 608 TS | submission → queue → per-website `.post()` | | complex | shapeshift/web | 3.2k TS (35-pkg monorepo) | swap → swapper registry → concrete swapper | | large | trezor/trezor-suite | 8k TS monorepo | send-form → sign thunk → `@trezor/connect` | Verified ground-truth flows (the judge's reference) live in `offload-eval-ground-truth.json`. ## Arms - **offload** — codegraph + managed offload ON (requires `codegraph login`); records AI tokens/credits via `CODEGRAPH_OFFLOAD_USAGE_LOG`. - **raw** — codegraph, `CODEGRAPH_OFFLOAD_DISABLE=1` (returns raw source). - **nocg** — empty MCP config; Read/Grep baseline. - **frontload** — codegraph (offload-disabled) + a `UserPromptSubmit` hook (`offload-eval-hook.mjs`) that runs raw explore on the prompt and injects the result into context (approach 1). ## Run it ```bash npm run build # the harness shells out to dist/ codegraph login # only needed for the offload arm export AGENT_EVAL_OUT=/tmp/cg-offload-eval bash scripts/agent-eval/offload-eval-setup.sh # clone + index the 4 repos bash scripts/agent-eval/offload-eval-matrix.sh # 3 arms × 4 tiers × REPS (default 3) node scripts/agent-eval/offload-eval-judge.mjs \ --results $AGENT_EVAL_OUT/results.jsonl \ --truth scripts/agent-eval/offload-eval-ground-truth.json \ --out $AGENT_EVAL_OUT/judged.jsonl node scripts/agent-eval/offload-eval-summarize.mjs $AGENT_EVAL_OUT/judged.jsonl bash scripts/agent-eval/offload-eval-frontload-matrix.sh # frontload arm + judge + merged summary ``` Single repo: `offload-eval-3arm.sh ""` (or `-frontload.sh`). ## Files - `offload-eval-setup.sh` — clone + index the 4 repos. - `offload-eval-3arm.sh` / `-frontload.sh` — one repo, the arms. - `offload-eval-matrix.sh` / `-frontload-matrix.sh` — drive all 4 tiers. - `offload-eval-hook.mjs` — the front-load `UserPromptSubmit` hook (resolves its own engine; `CG_FRONTLOAD_DEBUG=` to log injections; `CG_FRONTLOAD_BUDGET` to cap injected chars). - `offload-eval-metrics.mjs` — one run's stream-json + usage log → one JSON metrics line. - `offload-eval-judge.mjs` — Sonnet judge: end-to-end (agent final vs ground truth) + per-answer offload fidelity. - `offload-eval-summarize.mjs` — per-tier, per-arm table + cross-repo roll-up. - `offload-eval-ground-truth.json` — source-verified canonical flows. ## Findings (2026-06, n=3 — direction consistent, magnitudes noisy) - **Raw codegraph is the efficiency win** — ~nocg accuracy, fewer reads, faster, no AI cost. - **The offload is the least-accurate arm in all 4 tiers** — synthesized fidelity 12–27/100 with fabrication in 3/4 (e.g. invented website services; traced `ClientPlain`/`SessionPlain` instead of the real encrypted path). Its speed/cost win is narrow (medium-only) and inversely correlated with accuracy. **Use raw until offload fidelity is fixed.** - **The front-load hook SOLVES adoption** — reads → 0–1 in every tier (incl. large, where the agent otherwise read 12–24 files); fired 12/12, 0 errors. Wins on medium/complex (100% pass). But it **regresses small/large to partial** — it suppresses the reads that compensate for explore's gaps at **dynamic boundaries** (async queues, redux thunks, facade/factory indirection). - **Master lever for BOTH:** explore's dynamic-dispatch coverage. Fix it → front-load is complete everywhere and the offload has the full flow to synthesize.