2 tygodni temu · 0e2789ab71
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -137,7 +137,7 @@ For each **language × framework**, validate on **small, medium, and large** rea
 
				 1. **Pick the canonical flow** for the framework ("how does X reach Y": state→render, request→handler→view, query→SQL, action→reducer→store…).
			
 
				 2. **Deterministic probes** (`scripts/agent-eval/probe-{node,explore}.mjs` against the built `dist/`): `codegraph_explore` with the flow's symbol names connects from→to end-to-end with no break (its Flow section shows the path); **no node explosion** (`select count(*) from nodes` stable before/after re-index); synthesized-edge **precision** spot-check (`select … where provenance='heuristic'`).
			
 
				 3. **Agent A/B** (`scripts/agent-eval/run-all.sh <repo> "<Q>"`): with vs without codegraph, **≥2 runs/arm** (run-to-run variance is large — never conclude from n=1). Record **duration, total tool calls, Read, Grep**. Optional forced-Read-0 sufficiency proof via the block-read hook (`scripts/agent-eval/hook-settings.json`).
			
 
				-   - **Run agent-evals in a REAL terminal — NEVER nested inside a Claude Code session** (don't spawn `claude -p` from a Bash tool call). The codegraph MCP server is healthy (full `initialize`→`tools/list` handshake ~165ms, daemon and in-process modes both fine), but a nested `claude -p` marks it `status:"pending"` / 0 tools under CPU/timing contention and the agent silently runs with no codegraph — it can connect early in a session, then degrade to consistent failure as nested spawns pile up. `CODEGRAPH_NO_DAEMON=1` and `< /dev/null` do NOT fix it (it's the nested client, not the server). Confirm via `parse-run.mjs` (`codegraph tools exposed: 0` = void run). To isolate a change — **new-build vs baseline-build, both codegraph-on** (vs run-all.sh's with-vs-without) — use `scripts/agent-eval/ab-new-vs-baseline.sh <indexed-repo> "<task>" [baseline-ref]`.
			
 
				+   - **MCP attach is a startup-latency issue, not a hard block.** On a multi-step task the agent dives into Read/grep before codegraph finishes its ~2-3s startup (worse when the eval is itself run nested inside a Claude session, under CPU contention), so it runs with no codegraph. Fix: **pre-warm a persistent daemon** for the target (`CODEGRAPH_DAEMON_IDLE_TIMEOUT_MS` high; spawn `serve --mcp --path <target> </dev/null &`; wait for `.codegraph/daemon.sock`) **and skip the startup re-exec** (`CODEGRAPH_WASM_RELAUNCHED=1`) so claude connects before the agent's first turn. Don't trust claude's `init` snapshot — it can read `status:"pending"` / 0 tools even when it then connects; judge by actual codegraph usage in `parse-run.mjs`'s `by type`. To isolate a change — **new-build vs baseline-build, both codegraph-on** (vs run-all.sh's with-vs-without) — use `scripts/agent-eval/ab-new-vs-baseline.sh <indexed-repo> "<task>" [baseline-ref]` (it bakes in the pre-warm).
			
 
				 4. **Pass bar:** a normal flow question reaches **~0 Read/Grep within the repo's explore-call budget**, runs **faster** than without-codegraph, and shows **no regression on a control repo**. Record the numbers in `docs/design/dynamic-dispatch-coverage-playbook.md` (the coverage matrix).
			
 
				 
			
 
				 Full playbook + per-mechanism design: `docs/design/dynamic-dispatch-coverage-playbook.md` and `docs/design/callback-edge-synthesis.md`.
			
--- a/scripts/agent-eval/ab-new-vs-baseline.sh
+++ b/scripts/agent-eval/ab-new-vs-baseline.sh
@@ -2,18 +2,20 @@
 
				 # A/B a codegraph retrieval/steering change: the NEW build (current HEAD) vs a
			
 
				 # BASELINE build (a git ref) — BOTH with codegraph attached — on the same
			
 
				 # implementation task, measuring how many Read vs codegraph calls the agent
			
 
				-# makes. This ISOLATES the change (unlike run-all.sh, which is with-vs-without
			
 
				-# codegraph). The agent works on a throwaway copy of the target, so its edits
			
 
				-# never touch your repos.
			
 
				+# makes. ISOLATES the change (unlike run-all.sh's with-vs-without). The agent
			
 
				+# works on a throwaway copy of the target, so your repos are never touched.
			
 
				 #
			
 
				-# *** RUN THIS IN A REAL TERMINAL — NOT nested inside a Claude Code session. ***
			
 
				-# A `claude -p` spawned from within another Claude session (e.g. from a Bash
			
 
				-# tool call) cannot reliably attach the codegraph MCP server: the server is
			
 
				-# healthy (full handshake ~165ms) but the nested client marks it
			
 
				-# status:"pending" / 0 tools under CPU/timing contention, and degrades to
			
 
				-# consistent failure over a long session. NO_DAEMON + `< /dev/null` do NOT fix
			
 
				-# it — it's the nested client, not the server. See codegraph/CLAUDE.md
			
 
				-# ("Running agent-evals — do NOT nest").
			
 
				+# Reliable attach (works even when this is itself run nested inside a Claude
			
 
				+# session): each arm PRE-WARMS a persistent codegraph daemon for its target so
			
 
				+# claude connects to an already-bound, index-loaded daemon instantly — before
			
 
				+# the agent's first turn — and SKIPS codegraph's startup re-exec via
			
 
				+# CODEGRAPH_WASM_RELAUNCHED=1. Without this, on a multi-step task the agent
			
 
				+# dives into Read/grep before codegraph finishes its ~2-3s startup (worse under
			
 
				+# the CPU contention of a nested run) and runs with NO codegraph.
			
 
				+#
			
 
				+# Gotcha: claude's `system/init` snapshot can read status:"pending" / 0 tools
			
 
				+# even when the server then connects fine — judge by ACTUAL codegraph usage in
			
 
				+# parse-run.mjs's "by type", not the init line.
			
 
				 #
			
 
				 # Usage: ab-new-vs-baseline.sh <indexed-repo> "<task>" [baseline-ref]
			
 
				 #   <indexed-repo>  a repo with a .codegraph index (copied per arm)
			
@@ -38,9 +40,13 @@ fi
 
				 CHANGED=$(git -C "$ENGINE" diff --name-only "$BASE_REF" HEAD -- src 2>/dev/null)
			
 
				 [ -n "$CHANGED" ] || { echo "no src/ changes between $BASE_REF and HEAD — nothing to A/B"; exit 1; }
			
 
				 
			
 
				-# Always restore the engine to HEAD on exit, even if interrupted mid-arm.
			
 
				-restore() { git -C "$ENGINE" checkout HEAD -- $CHANGED 2>/dev/null; ( cd "$ENGINE" && npm run build >/dev/null 2>&1 ); }
			
 
				-trap restore EXIT
			
 
				+# On exit: kill any eval daemons + restore the engine to HEAD.
			
 
				+cleanup() {
			
 
				+  pkill -9 -f "serve --mcp --path $OUT/" 2>/dev/null
			
 
				+  git -C "$ENGINE" checkout HEAD -- $CHANGED 2>/dev/null
			
 
				+  ( cd "$ENGINE" && npm run build >/dev/null 2>&1 )
			
 
				+}
			
 
				+trap cleanup EXIT
			
 
				 
			
 
				 mkdir -p "$OUT"
			
 
				 echo "###### engine=$ENGINE  baseline=$BASE_REF"
			
@@ -54,17 +60,25 @@ rm -rf "$OUT/t-new" "$OUT/t-base"
 
				 rsync -a --exclude node_modules --exclude .git --exclude dist --exclude .codegraph "$TARGET/" "$OUT/t-new/"
			
 
				 cp -R "$OUT/t-new" "$OUT/t-base"
			
 
				 
			
 
				-cfg() { printf '{"mcpServers":{"codegraph":{"command":"%s","args":["serve","--mcp","--path","%s"]}}}' "$BIN" "$1" > "$2"; }
			
 
				+prewarm() { # target — spawn a persistent daemon (current $BIN) and wait for its socket
			
 
				+  pkill -9 -f "serve --mcp --path $1" 2>/dev/null
			
 
				+  CODEGRAPH_DAEMON_IDLE_TIMEOUT_MS=1800000 node "$BIN" serve --mcp --path "$1" </dev/null >/dev/null 2>&1 &
			
 
				+  node -e 'const fs=require("fs");let n=0;const t=setInterval(()=>{if(fs.existsSync(process.argv[1]+"/.codegraph/daemon.sock")){clearInterval(t);process.exit(0)}if(n++>150){clearInterval(t);process.exit(1)}},100)' "$1" \
			
 
				+    && echo "  daemon warm: $1" || echo "  WARN: daemon never bound for $1 (arm may run without codegraph)"
			
 
				+}
			
 
				 
			
 
				 run_arm() { # label, target-copy
			
 
				   local label="$1" tgt="$2" c="$OUT/mcp-$1.json"
			
 
				-  cfg "$tgt" "$c"
			
 
				+  # Connect to the pre-warmed daemon; skip the startup re-exec for a fast attach.
			
 
				+  printf '{"mcpServers":{"codegraph":{"command":"env","args":["CODEGRAPH_WASM_RELAUNCHED=1","node","%s","serve","--mcp","--path","%s"]}}}' "$BIN" "$tgt" > "$c"
			
 
				+  prewarm "$tgt"
			
 
				   echo "############## ARM [$label] ##############"
			
 
				   ( cd "$tgt" && claude -p "$TASK" \
			
 
				       --output-format stream-json --verbose --permission-mode bypassPermissions \
			
 
				       --model opus --max-budget-usd 4 --strict-mcp-config --mcp-config "$c" \
			
 
				-      < /dev/null > "$OUT/run-$label.jsonl" 2>"$OUT/run-$label.err" )
			
 
				-  node "$PARSE" "$OUT/run-$label.jsonl" 2>&1 | grep -E "tools exposed|by type|Result" || echo "  (parse failed — see $OUT/run-$label.jsonl)"
			
 
				+      </dev/null > "$OUT/run-$label.jsonl" 2>"$OUT/run-$label.err" )
			
 
				+  node "$PARSE" "$OUT/run-$label.jsonl" 2>&1 | grep -E "by type|Result" || echo "  (parse failed — see $OUT/run-$label.jsonl)"
			
 
				+  pkill -9 -f "serve --mcp --path $tgt" 2>/dev/null
			
 
				   echo
			
 
				 }