trace-relevance-coldstart-2026-05-30.md 12 KB


name: trace-relevance-coldstart-2026-05-30 date: 2026-05-30 23:30 project: codegraph branch: feat/trace-relevance-closure-collection

summary: Turned Alamofire (README's weakest repo) into a clean win via a trace endpoint-disambiguation fix + god-file explore rendering, then eliminated the MCP cold-start race that was causing benchmark inconsistency (handshake ~811ms→~90ms); PR #580 has 6 commits, all that's left is a clean README sweep + squash-merge.

Handoff: trace-relevance + closure-collection + cold-start (PR #580)

Resume here — read this first

Current state: PR #580 (branch feat/trace-relevance-closure-collection, 6 commits, pushed, in sync with remote) is feature-complete and validated — full suite 1090 pass (only the 5 pre-existing npm-shim network fails), 28/28 MCP+daemon tests. The MCP cold-start race (the dominant benchmark-inconsistency source) is ELIMINATED via the proxy-local-handshake (tool registration ~90ms cold+warm, was ~811ms). The README benchmark table still shows the OLD pre-fix numbers. Immediate next step: Run a median-of-4 README sweep on this build (the race is gone, so numbers should be naturally consistent), update the README table/averages/headline, then squash-merge PR #580.

Suggested next message: "Run RUNS=4 bash scripts/agent-eval/bench-readme.sh on this build, parse with node scripts/agent-eval/parse-bench-readme.mjs /tmp/ab-readme (race-aware), update the README benchmark table + averages + the 7 per-repo detail tables + methodology date, then squash-merge PR #580 with gh pr merge 580 --squash --admin."

Goal

Started as "Alamofire is the README's weakest benchmark repo (13% fewer tool calls vs the ~62% average) — fix it." Became: make CodeGraph's retrieval consistent and faster. Definition of done = PR #580 merged (trace fix + dynamic-dispatch coverage + god-file rendering + cold-start elimination), README refreshed with stable median-of-4 numbers. Optimization target per CLAUDE.md is tool-calls/reads + latency, NOT raw cost.

Key findings

The 6 commits on the branch (oldest→newest):

  • e86d573 Trace endpoint relevance (THE Alamofire win) + closure-collection synthesizer + explore synth-links.
  • c64c4b3 God-file multi-phase explore rendering (6 sub-layers).
  • 5d7388c Skeleton/focused tag steers to codegraph_explore, not Read (spiral fix #1).
  • dc19eab Bench parser race-aware (excludes "No such tool available" runs).
  • 91e28df serve --mcp cold-start ~811ms→~600ms (defer CodeGraph load + 25ms poll).
  • 82ae484 Proxy-local-handshake — handshake ~600ms→~90ms, cold-start race eliminated.

Root-causes found by reading A/B TRANSCRIPTS (not the noisy median):

  • Trace bug: handleTrace's scorePair ranked only by shared-dir-prefix, so overloaded names (request=44 defs, task=8) resolved to empty EventMonitor.request(){} / RedirectHandler.task STUBS over the real Session.request → agent saw garbage, said "the trace collided with same-named symbols", read by hand. Fix: nodeRelevance term in handleTrace (penalize ≤1-line stubs −40, test files −150). Result n=8: WITH tools 12→8 median, read variance 0–12→1–4 (the meltdowns WERE the trace-collision flounder). General bug (Swift/Java/C#/Go protocol-stub flooding).
  • Closure-collection synthesizer (src/resolution/callback-synthesizer.ts closureCollectionEdges): Swift validators.write{$0.append}didCompleteTask validators.forEach{$0()}. The element-invoke $0(/it( is the precision gate → 9 edges on Alamofire, 0 on every non-Swift control. Surfaced inline in trace + a "Dynamic-dispatch links" section in buildFlowFromNamedSymbols (so it shows when the agent named only validate, not didCompleteTask).
  • God-file rendering (handleExplore in src/mcp/tools.ts, 6 layers): (1) on-spine god-files render spine-full + off-path methods as signatures (true-spine); (2) named-seed gather — inject each named token's substantive def into the subgraph (FTS buried validate → Validation.swift was never gathered); (3) a file that DEFINES a named symbol scores +50 (beats incidental Combine.swift's +23 connected-node score); (4) the 90%-budget early-break and (5) the total-output cap both EXEMPT necessary (entry/spine/uniqueNamed) files; (6) final ceiling 1.5×maxOutputChars. Renders build+validators-exec+validate in ONE explore.
  • Spiral cause #1 (fixed): the skeleton tag said "Read for a full body" → agent Read the skeletonized central files → over-investigation spiral. Now steers to codegraph_explore.
  • Spiral cause #2 / the BIG inconsistency (fixed): MCP cold-start race. serve --mcp wasn't ready when the headless agent fired → "No such tool available" → grep/Read flounder (19–30 tool spirals). Root-caused: NOT module load (mcp/index 38ms, CodeGraph chain 30ms), NOT the --liftoff-only re-exec (NO_RELAUNCH ≈ same) — it's the proxy WAITING for the spawned daemon to bind. Fixed: proxy answers initialize/tools-list from STATIC constants (runLocalHandshakeProxy in proxy.ts), forwards tool CALLS to the daemon (connected in background), lazy in-process engine fallback preserves the old fall-back-to-direct robustness. connectWithHello distinguishes 'version-mismatch' (fail fast → local) from 'not-yet' (poll). Handshake 91ms cold / 88ms warm.

Gotchas

  • A/B variance is HUGE — never conclude from n=1, or even one n=4 batch. The median-of-4 caught regressions the lucky dedicated batches HID (the god-file rework looked great in one batch at 0.5 reads/5.5 tools; the median showed 13 tools dragged by 2 spirals). Report ranges.
  • Kill stale daemons before any cold-start measurement: pkill -9 -f "dist/bin/codegraph.js"; rm -f /tmp/codegraph-corpus/<repo>/.codegraph/daemon.*. A zombie daemon holding the lock causes a 6s retry-exhaust that looks like a 7× regression (it bit me — the "6239ms" false alarm).
  • timeout is NOT on macOS (no coreutils) — measure cold-start with a node spawn + a setTimeout kill-timer (see the transcript's measurement snippets).
  • Corpus repos: /tmp/codegraph-corpus/<repo> (all 7 README repos indexed). Explore/trace changes are query-time (no re-index). The closure-collection synthesizer is index-time but produces 0 edges on non-Swift, so it's inert there.
  • Global codegraph is npm-linked to the dev dist (node dist/bin/codegraph.js). Always npm run build before any probe/A/B (they load dist/, not src/).
  • engine.ts/tools.ts now import type CodeGraph + lazy require('../index') (CommonJS, cached) so the daemon binds before the sqlite/query chain loads; findNearestCodeGraphRoot now comes from the light ../directory.
  • The old runProxy/pipeUntilClose in proxy.ts are now DEAD (superseded by runLocalHandshakeProxy) — left in place; safe to prune in a follow-up.
  • 5 npm-shim.test.ts failures are pre-existing/network (need --probe-net) — NOT regressions; ignore.
  • Uncommitted .gitignore change (tmux-web/) is unrelated/not mine — do NOT commit it on this branch.
  • parse-bench-readme.mjs excludes raced runs by default; CG_INCLUDE_RACED=1 keeps them to see the raw distribution. Now a safety net (race eliminated at source).

How to test & validate

  • npm run build → must be clean (exit 0).
  • npx vitest run1090 pass, only the 5 npm-shim network fails.
  • npx vitest run __tests__/mcp-daemon.test.ts7/7 (sharing, #277 survive-client-death, version-mismatch fallback, idle-timeout).
  • Cold-start handshake (after killing daemons): node-spawn a serve --mcp, send initialize, time the id:1 response → ~90ms (was ~811ms). Then a tools/call (e.g. codegraph_status) returns a real result (forwarded to the daemon, ~3.4s on vscode's first index load — a call that returns LATE, not a missing-tool error).
  • A/B sweep: RUNS=4 bash scripts/agent-eval/bench-readme.shnode scripts/agent-eval/parse-bench-readme.mjs /tmp/ab-readme.
  • Methodology: handshake <150ms = race eliminated; in an A/B, grep the WITH jsonls for "No such tool available" (should be 0 now); WITH reads/tools < WITHOUT with no control regression.

Repo state

  • branch feat/trace-relevance-closure-collection, last commit 82ae484 perf(mcp): proxy answers initialize/tools-list locally — cold-start handshake ~600ms→~90ms. In sync with remote (0/0). PR #580 open.
  • uncommitted: M .gitignore (unrelated tmux-web/ edit, not mine) + the new untracked .claude/handoffs/ files. All code committed.

Open threads / TODO

  • Run median-of-4 README sweep on THIS build (race gone → should be naturally consistent, no exclusions needed).
  • Update README: benchmark summary table (7 rows × cost/tokens/time/tools), the 7 per-repo detail tables (Time/Reads/Grep/Tools/Tokens/Cost), the average line, the headline (line 7), and the methodology date (line 86).
  • Squash-merge PR #580 (gh pr merge 580 --squash --admin per the PR-merge-policy memory).
  • Add CHANGELOG entries under [Unreleased] for the cold-start fixes (trace fix + closure-collection + god-file already have entries from earlier commits — verify).
  • Prune dead runProxy/pipeUntilClose in proxy.ts.
  • Alamofire task overload still seeds RedirectHandler.task (a harmless false-positive in named-seed ranking) — same nodeRelevance idea could de-rank it.

Recent transcript (oldest → newest)

Turn — "which repos drag the averages? Alamofire looks worst"

  • Claude: Computed per-column deviations from the README table → Alamofire #1 drag (tool-calls 13% vs 62% avg, time 7% vs 23%), gin #2; both ~110-file repos.
  • Outcome: Targeted Alamofire.

    Turn — "dig into Alamofire A/B logs"

  • Claude: Read /tmp/ab-readme/alamofire/run*/run-headless-with.jsonl → found validate-flow break (validators closure-array) + per-run reasoning. Built closure-collection synthesizer + explore synth-links. A/B: flat (adoption variance).

  • Outcome: Synthesizer correct + sufficient (forced codegraph-only 3/3) but didn't move the noisy median.

    Turn — "keep working till you figure it out"

  • Claude: Read transcripts again → "the trace collided with same-named symbols (44 requests, 8 tasks)". Found scorePair picked empty stubs. Added nodeRelevance. A/B n=8: tools 12→8, variance collapsed.

  • Outcome: The real Alamofire win. Committed e86d573; opened PR #580; squash-merge deferred.

    Turn — "do the file-ranking + gather layers"

  • Claude: Built the 6-layer god-file rendering (c64c4b3). A/B flat-to-worse → debugged spirals → found the skeleton tag said "Read for more" (fixed 5d7388c) → spirals persisted → root-caused the MCP cold-start race ("No such tool available").

  • Outcome: Race is the dominant inconsistency, not the code.

    Turn — "keep iterating until consistent / do both (parser + startup)"

  • Claude: Bench parser race-aware (dc19eab). Deferred CodeGraph load off the bind path + 25ms poll (91e28df, 811→600ms).

  • Outcome: Consistency handled (parser) + 25% startup.

    Turn — "how to optimize cold-start? / implement it"

  • Claude: Profiled (module loads 30-40ms, re-exec not it — it's the daemon-spawn wait). Implemented runLocalHandshakeProxy (82ae484): proxy answers initialize/tools-list locally + forwards calls + lazy local-engine fallback. Fixed 4 daemon tests (emit "Attached to shared daemon" + fast-fail version-mismatch + updated 1 assertion). Handshake 90ms; 28/28 MCP tests; full suite 1090 pass.

  • Outcome: Cold-start race ELIMINATED. All cold-start work committed + pushed. README sweep + squash-merge pending.