Explorar o código

fix(mcp): stop the first tool call hanging on a huge-repo catch-up reconcile (#905) (#950)

On a very large repo (the report is a ~93k-file / 5.7GB-DB Java monorepo) the
first MCP `tools/call` after a fresh `serve --mcp` could hang for 10+ minutes
with zero output, and with the liveness watchdog on, the daemon was SIGKILLed
mid-query instead. Root cause: the post-open catch-up reconcile that the first
tool call is gated on does ~2*N synchronous `fs.existsSync`/`fs.statSync` calls
plus a load-all-files query in two non-yielding loops. On a huge repo that wedges
the event loop for minutes, which (a) trips the 60s watchdog (it SIGKILLs a
process whose loop stops turning) and (b) blocks the first call the whole time.

Two complementary fixes:

- Make the reconcile yield. `ExtractionOrchestrator.sync()` now uses the
  yielding `scanDirectoryAsync`, and both O(files) reconcile loops
  `await setImmediate` every SYNC_RECONCILE_YIELD_INTERVAL (1000) files. The loop
  can no longer wedge the main thread, so the watchdog stays fed and the socket /
  any concurrent read stays responsive while a big reconcile runs. Results are
  unchanged — only yield points are added.

- Time-box the catch-up gate. The first `tools/call` now waits on the reconcile
  for at most CODEGRAPH_CATCHUP_GATE_TIMEOUT_MS (default 3000ms), then serves and
  lets the reconcile finish in the background (which now yields, so the served
  call runs concurrently). `=0` restores the old unbounded wait. On a normal repo
  the reconcile finishes well under the budget, so behavior is unchanged.

Tests: adds two time-box cases to mcp-catchup-gate (serves promptly when the
reconcile runs long; `=0` restores the unbounded wait). Full suite green
(1655 passed). Validated end-to-end through the real daemon: first call returns
at the ~3s time-box instead of waiting an injected 8s reconcile; no-delay control
unchanged; `=0` opt-out waits the full reconcile.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Colby Mchenry hai 21 horas
pai
achega
ace8d8a0d0
Modificáronse 4 ficheiros con 145 adicións e 6 borrados
  1. 1 0
      CHANGELOG.md
  2. 51 0
      __tests__/mcp-catchup-gate.test.ts
  3. 25 1
      src/extraction/index.ts
  4. 68 5
      src/mcp/tools.ts

+ 1 - 0
CHANGELOG.md

@@ -31,6 +31,7 @@ and adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
 ### Fixes
 
+- An MCP server pointed at a very large repository (tens of thousands of files) no longer hangs on the first tool call after a fresh start. On startup CodeGraph reconciles its index against the current files on disk, and on a huge repo that reconcile could run for minutes while blocking the very first request — long enough that the background server was sometimes force-restarted mid-scan, so the first query never came back at all. The reconcile now yields as it runs (keeping the server responsive instead of pinning it), and the first tool call waits only briefly for it before answering and letting the rest finish in the background — so you get a fast first response and the index still catches up. Set `CODEGRAPH_CATCHUP_GATE_TIMEOUT_MS` to tune how long that first call waits (default 3000ms), or `=0` to always wait for the full reconcile. (#905)
 - `codegraph install` now wires up your agents and stops there — it no longer indexes the current directory. Building a project's graph is always the explicit `codegraph init` (or `codegraph index`), so you decide what gets indexed and when, and the steps are the same whether you installed globally or just for one project. This clears up the confusion where a project-local install silently indexed but a global one didn't, and where the docs and the tool disagreed about whether you still had to run `init`. (#826)
 - React components declared with `forwardRef`, `memo`, or styled-components / emotion (`const Button = forwardRef(...)`, `const Card = memo(...)`, `const Box = styled.button\`…\``) are now recognized as components, so finding where they're used works. Before, they were indexed as plain constants, so `codegraph callers` and impact analysis reported "no callers found" even when the component was rendered across dozens of files — a dangerous false "safe to change" right before refactoring a shared component. Now every `<Button/>` usage links back to the component, so callers and blast radius are complete. This is the standard shadcn/ui declaration style, so for typical React design systems the whole UI layer is no longer invisible to impact analysis. Thanks @Arlandaren for the report and @maxmilian for the root-cause. (#841)
 - React Router and Next.js routes defined in `.tsx` / `.jsx` files are now indexed. Routes written as JSX — `<Route path="/users" element={<UsersPage/>}/>`, `createBrowserRouter([...])`, and Next.js `app/`/`pages/` page files — were being skipped entirely (only routes that happened to live in plain `.ts`/`.js` were picked up), so "what renders at this path?" and the route → page-component link were missing for most React apps. Now those routes show up in `codegraph search`/`codegraph_explore` and connect to the component they render, just like the backend route → handler links on other frameworks.

+ 51 - 0
__tests__/mcp-catchup-gate.test.ts

@@ -110,6 +110,57 @@ describe('MCP catch-up gate', () => {
     expect(cg.getStats().fileCount).toBe(0);
   });
 
+  it('does not hang the first call when catch-up runs past the timeout (#905)', async () => {
+    // The issue #905 hang: on a huge repo the post-open reconcile takes minutes,
+    // and gating the first tool call on all of it reads as a multi-minute hang.
+    // With the time-box, the call is served promptly and the reconcile finishes
+    // in the background.
+    const prev = process.env.CODEGRAPH_CATCHUP_GATE_TIMEOUT_MS;
+    process.env.CODEGRAPH_CATCHUP_GATE_TIMEOUT_MS = '50';
+    let timer: NodeJS.Timeout | undefined;
+    try {
+      let gateResolved = false;
+      const gate = new Promise<void>((resolve) => {
+        timer = setTimeout(() => { gateResolved = true; resolve(); }, 5000);
+      });
+      handler.setCatchUpGate(gate);
+
+      const started = Date.now();
+      const res = await handler.execute('codegraph_search', { query: 'survivor' });
+      const elapsed = Date.now() - started;
+
+      expect(res.isError).toBeFalsy();
+      expect(res.content[0].text).toMatch(/survivor/);
+      // Served on the timeout (~50ms), NOT after the 5s reconcile.
+      expect(gateResolved).toBe(false);
+      expect(elapsed).toBeLessThan(2000);
+    } finally {
+      if (timer) clearTimeout(timer);
+      if (prev === undefined) delete process.env.CODEGRAPH_CATCHUP_GATE_TIMEOUT_MS;
+      else process.env.CODEGRAPH_CATCHUP_GATE_TIMEOUT_MS = prev;
+    }
+  });
+
+  it('CODEGRAPH_CATCHUP_GATE_TIMEOUT_MS=0 restores the unbounded wait', async () => {
+    const prev = process.env.CODEGRAPH_CATCHUP_GATE_TIMEOUT_MS;
+    process.env.CODEGRAPH_CATCHUP_GATE_TIMEOUT_MS = '0';
+    try {
+      let gateResolved = false;
+      const gate = new Promise<void>((resolve) => {
+        setTimeout(() => { gateResolved = true; resolve(); }, 80);
+      });
+      handler.setCatchUpGate(gate);
+
+      const res = await handler.execute('codegraph_search', { query: 'survivor' });
+      // With the time-box disabled, the call waits for the full reconcile.
+      expect(gateResolved).toBe(true);
+      expect(res.isError).toBeFalsy();
+    } finally {
+      if (prev === undefined) delete process.env.CODEGRAPH_CATCHUP_GATE_TIMEOUT_MS;
+      else process.env.CODEGRAPH_CATCHUP_GATE_TIMEOUT_MS = prev;
+    }
+  });
+
   it('gate that rejects does not break the tool call', async () => {
     // A catch-up sync failure (lock contention, transient FS error) must
     // not poison tool dispatch — the engine logs it, the handler proceeds.

+ 25 - 1
src/extraction/index.ts

@@ -32,6 +32,18 @@ import type { ResolutionContext } from '../resolution/types';
  */
 const FILE_IO_BATCH_SIZE = 10;
 
+/**
+ * How many files the `sync()` reconcile processes between cooperative yields to
+ * the event loop. The reconcile runs two O(files) loops of synchronous `fs`
+ * calls (existsSync for removals, statSync for adds/mods); on a very large repo
+ * (~100k files) an un-yielded run wedges the main thread for minutes, which both
+ * trips the liveness watchdog (it SIGKILLs a process whose loop stops turning)
+ * and blocks the first MCP tool call behind the catch-up gate (issue #905).
+ * Yielding every N files keeps the socket, the watchdog heartbeat, and any
+ * concurrent read query responsive while the reconcile runs.
+ */
+const SYNC_RECONCILE_YIELD_INTERVAL = 1000;
+
 // PARSER_RESET_INTERVAL moved to parse-worker.ts (runs in worker thread)
 
 /**
@@ -1774,7 +1786,7 @@ export class ExtractionOrchestrator {
     // whether or not the project uses git, and crucially also catches committed
     // changes from `git pull`/`checkout`/`merge`/`rebase` — which `git status`
     // cannot see, because the working tree is clean afterward.
-    const currentFiles = scanDirectory(this.rootDir);
+    const currentFiles = await scanDirectoryAsync(this.rootDir);
     filesChecked = currentFiles.length;
     const currentSet = new Set(currentFiles);
 
@@ -1787,15 +1799,27 @@ export class ExtractionOrchestrator {
     // Removals: tracked in the DB but no longer a present source file. Check the
     // filesystem directly — `scanDirectory` (via `git ls-files`) still lists a
     // file deleted from disk but not yet staged, so set membership alone misses it.
+    // `reconcileChecks` drives the cooperative yield shared with the adds/mods loop
+    // below (see SYNC_RECONCILE_YIELD_INTERVAL / issue #905).
+    let reconcileChecks = 0;
     for (const tracked of trackedFiles) {
       if (!currentSet.has(tracked.path) || !fs.existsSync(path.join(this.rootDir, tracked.path))) {
         this.queries.deleteFile(tracked.path);
         filesRemoved++;
       }
+      if (++reconcileChecks % SYNC_RECONCILE_YIELD_INTERVAL === 0) {
+        await new Promise<void>((resolve) => setImmediate(resolve));
+      }
     }
 
     // Adds / modifications.
     for (const filePath of currentFiles) {
+      // Same cooperative yield as the removals loop — this is the other O(files)
+      // synchronous-stat loop that wedges the main thread on a large repo (#905).
+      // Yield at the top of the body so the `continue` fast-paths below still hit it.
+      if (++reconcileChecks % SYNC_RECONCILE_YIELD_INTERVAL === 0) {
+        await new Promise<void>((resolve) => setImmediate(resolve));
+      }
       const fullPath = path.join(this.rootDir, filePath);
       const tracked = trackedMap.get(filePath);
 

+ 68 - 5
src/mcp/tools.ts

@@ -288,6 +288,27 @@ function adaptiveExploreEnabled(): boolean {
   return process.env.CODEGRAPH_ADAPTIVE_EXPLORE !== '0' && process.env.CODEGRAPH_ADAPTIVE_EXPLORE !== 'false';
 }
 
+/**
+ * How long the FIRST tool call waits on the post-open catch-up reconcile before
+ * giving up and serving anyway (issue #905). On a normal repo the reconcile
+ * finishes in well under this, so the gate is fully honored and nothing changes.
+ * On a very large repo (~100k files) the reconcile takes minutes — blocking the
+ * first call on all of it presents as a multi-minute hang — so we wait briefly
+ * for a clean answer, then serve and let the reconcile finish in the background
+ * (it yields to the event loop, so a concurrent read still runs).
+ *
+ * `CODEGRAPH_CATCHUP_GATE_TIMEOUT_MS` overrides the default; `0` restores the
+ * old unbounded-wait behavior (always block until the reconcile completes).
+ */
+const DEFAULT_CATCHUP_GATE_TIMEOUT_MS = 3000;
+function resolveCatchUpGateTimeoutMs(): number {
+  const raw = process.env.CODEGRAPH_CATCHUP_GATE_TIMEOUT_MS;
+  if (raw === undefined || raw === '') return DEFAULT_CATCHUP_GATE_TIMEOUT_MS;
+  const n = Number(raw);
+  if (!Number.isFinite(n) || n < 0) return DEFAULT_CATCHUP_GATE_TIMEOUT_MS;
+  return Math.floor(n);
+}
+
 /**
  * Prefix each line of a source slice with its 1-based line number, matching
  * the Read tool's `cat -n` convention (number + tab) so the agent treats it
@@ -667,7 +688,9 @@ export class ToolHandler {
   // this, a tool call that races past `catchUpSync()` serves rows for files
   // that were deleted (or edited) while no MCP server was running — and the
   // per-file staleness banner can't help, because `getPendingFiles()` is
-  // populated by the watcher, not by catch-up. Cleared on first await so
+  // populated by the watcher, not by catch-up. The wait is time-boxed
+  // (see {@link resolveCatchUpGateTimeoutMs}) so a minutes-long reconcile on a
+  // huge repo can't hang the first call (#905); cleared on first await so
   // subsequent calls don't pay any cost.
   private catchUpGate: Promise<void> | null = null;
 
@@ -691,6 +714,43 @@ export class ToolHandler {
     this.catchUpGate = p;
   }
 
+  /**
+   * Await the catch-up gate, but no longer than the configured timeout (#905).
+   * If the reconcile settles first, we got the fully-reconciled answer. If the
+   * timeout wins, we serve the call now and let the reconcile finish in the
+   * background — it yields to the event loop (see SYNC_RECONCILE_YIELD_INTERVAL),
+   * so a concurrent read still runs against the same connection. Never throws:
+   * a failed reconcile is logged by the engine, and we serve best-effort over
+   * the same potentially-stale data the un-gated path would have.
+   */
+  private async awaitCatchUpGate(gate: Promise<void>): Promise<void> {
+    const timeoutMs = resolveCatchUpGateTimeoutMs();
+    if (timeoutMs <= 0) {
+      // 0 = opt back into the original unbounded wait.
+      try { await gate; } catch { /* engine already logged */ }
+      return;
+    }
+    let timer: NodeJS.Timeout | undefined;
+    const timedOut = new Promise<'timeout'>((resolve) => {
+      timer = setTimeout(() => resolve('timeout'), timeoutMs);
+      timer.unref?.();
+    });
+    try {
+      const outcome = await Promise.race([
+        gate.then(() => 'done' as const, () => 'done' as const),
+        timedOut,
+      ]);
+      if (outcome === 'timeout') {
+        process.stderr.write(
+          `[CodeGraph MCP] Catch-up reconcile still running after ${timeoutMs}ms; serving this tool call now and finishing the reconcile in the background (#905). ` +
+          `Set CODEGRAPH_CATCHUP_GATE_TIMEOUT_MS=0 to always wait for it.\n`
+        );
+      }
+    } finally {
+      if (timer) clearTimeout(timer);
+    }
+  }
+
   /**
    * Record the directory the server tried to resolve the default project from.
    * Used only to make the "no default project" error actionable.
@@ -1128,13 +1188,16 @@ export class ToolHandler {
     try {
       // Block the first tool call on the engine's post-open reconcile so we
       // never serve rows for files deleted/edited while no MCP server was
-      // running. The gate is cleared after first await — subsequent calls
-      // pay nothing. Catch-up failures are logged by the engine; we
-      // proceed regardless so a transient sync error never breaks tools.
+      // running. The wait is time-boxed (#905): a huge-repo reconcile takes
+      // minutes, and blocking the first call on all of it reads as a hang, so
+      // we wait briefly then serve and let it finish in the background. The
+      // gate is cleared after first await — subsequent calls pay nothing.
+      // Catch-up failures are logged by the engine; we proceed regardless so a
+      // transient sync error never breaks tools.
       if (this.catchUpGate) {
         const gate = this.catchUpGate;
         this.catchUpGate = null;
-        try { await gate; } catch { /* engine already logged */ }
+        await this.awaitCatchUpGate(gate);
       }
       // Honor the optional tool allowlist (CODEGRAPH_MCP_TOOLS): a trimmed
       // surface rejects ablated tools defensively even if a client cached them.