Преглед на файлове

fix(mcp): prevent "Transport closed" from a stray daemon-socket error (#974) (#983)

The client-facing MCP proxy could exit with "Transport closed" when its
connection to the shared daemon hit a socket 'error' with no listener
attached — common on WSL2 /mnt (DrvFs), where AF_UNIX is flaky. The global
fatal handler turned that uncaughtException into process.exit(1), which the
MCP client saw as a bare transport close even though the index was healthy.

proxy.ts now keeps an 'error' listener on the daemon socket for its whole
life (and skips a socket destroyed in the connect window), so a stray error
degrades to the existing in-process fallback instead of crashing. daemon.ts
releases the lockfile it acquired when it fails to bind, so the next launch
doesn't spin on a stale lock (the duplicate serve --mcp pileup).

No default behavior change for anyone; WSL /mnt users who still hit trouble
can set CODEGRAPH_NO_DAEMON=1 to skip the shared daemon entirely. Validated
on macOS (unit + live serve probe) and Linux (Docker, --init): 64/64 across
the daemon/socket/lifecycle suites, incl. real AF_UNIX.

Closes #974

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Colby Mchenry преди 14 часа
родител
ревизия
7c6417ef8f
променени са 6 файла, в които са добавени 164 реда и са изтрити 1 реда
  1. 2 0
      CHANGELOG.md
  2. 2 0
      README.md
  3. 55 0
      __tests__/daemon-bind-failure.test.ts
  4. 78 0
      __tests__/proxy-connect.test.ts
  5. 13 0
      src/mcp/daemon.ts
  6. 14 1
      src/mcp/proxy.ts

+ 2 - 0
CHANGELOG.md

@@ -12,6 +12,8 @@ and adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 ### Fixes
 
 - CodeGraph respects your `.gitignore` again when looking for nested git repositories. A directory you've gitignored — a `resource/` or `.repos/` folder of cloned reference projects, a large vendored data dir — is no longer walked into and indexed: version 1.0.0 started searching inside *every* gitignored directory for embedded git repos and pulling them all in, which could multiply a graph many times over and slow indexing to a crawl on a multi-gigabyte folder of clones, even though you'd explicitly excluded it. Indexing the repos inside a gitignored directory is now opt-in — add an `includeIgnored` list to a `codegraph.json` at your repo root, e.g. `{ "includeIgnored": ["packages/", "services/"] }`, to index the embedded repos under the directories you name. The "super-repo of independent clones" layout from 1.0.0 still works, just declared explicitly. Nested repos you *haven't* gitignored (untracked clones) are indexed as before, and a project without this layout is unaffected. (#970, #976, #622)
+- The MCP server no longer drops out with `Transport closed` when its connection to the shared background server hits a transient socket error. To serve all your editor/agent sessions from a single index, CodeGraph runs one shared background server they reach over a local socket; a stray error on that socket while a session was connecting used to take the whole server process down, so your agent saw a bare `Transport closed` even though `codegraph status` and `codegraph sync` were perfectly healthy. Now such an error is handled gracefully — the session falls back to serving itself in-process instead of crashing. This is most common on WSL2 when your project lives on a Windows drive (a `/mnt/c` or `/mnt/d` path), where that socket is flaky; if you still hit problems there, set `CODEGRAPH_NO_DAEMON=1` to skip the shared server entirely and run each session in its own process. Affects Codex and any other MCP client. (#974)
+- A shared background server that can't bind its socket now cleans up its own lock file before exiting, instead of leaving a stale lock that the next launch trips over — which previously could let duplicate `serve --mcp` processes pile up for the same project. (#974)
 
 
 ## [1.1.0] - 2026-06-23

+ 2 - 0
README.md

@@ -747,6 +747,8 @@ Framework routing is validated the same way, on a canonical app per framework: E
 
 **MCP server not connecting** — Your agent starts the server itself, so you don't launch it by hand. Make sure the project is initialized and indexed (`codegraph status`) and that the path in your MCP config is correct. If it still won't connect, re-run `codegraph install` to rewrite the config.
 
+**MCP tool calls fail with `Transport closed` while `codegraph status`/`sync` are healthy** — almost always WSL2 with the project on a Windows drive (a `/mnt/c` or `/mnt/d` path), where the local socket CodeGraph uses to share one background server across sessions is unreliable. CodeGraph now falls back to serving the session in-process instead of dropping the connection, but if you still hit it, set `CODEGRAPH_NO_DAEMON=1` in your MCP server's environment to skip the shared server entirely (each session runs in its own process). Moving the project onto the Linux-native filesystem (e.g. under `~/` instead of `/mnt/`) restores the shared server.
+
 **Missing symbols** — The MCP server auto-syncs on save (wait a couple seconds). Run `codegraph sync` manually if needed. Check that the file's language is supported and isn't inside a `.gitignore`d or default-excluded directory (e.g. `node_modules`, `dist`).
 
 **Sharing one checkout between Windows and WSL** — Don't point both at the same `.codegraph/`: the background-server lock and the SQLite index are tied to the OS that wrote them, and SQLite locking across the WSL2/Windows filesystem boundary is unreliable. Give each side its own index in the same tree by setting `CODEGRAPH_DIR` to a distinct name on one of them — e.g. `CODEGRAPH_DIR=.codegraph-win` on Windows, leaving WSL on the default `.codegraph`. CodeGraph skips any sibling `.codegraph-*` directory when indexing and watching, so the two never trip over each other.

+ 55 - 0
__tests__/daemon-bind-failure.test.ts

@@ -0,0 +1,55 @@
+/**
+ * Daemon bind-failure cleanup — issue #974.
+ *
+ * A detached daemon acquires the `.codegraph/daemon.pid` lock (via
+ * `tryAcquireDaemonLock`) BEFORE it binds its socket. If the bind then fails —
+ * e.g. AF_UNIX is unsupported/unreliable on the filesystem (the WSL2 DrvFs
+ * hazard behind #974) — `Daemon.start()` must release that lockfile before it
+ * propagates the error and exits. Otherwise the next launcher reads a stale lock
+ * pointing at the now-dead pid and the process pileup the issue reported recurs.
+ *
+ * We force a deterministic bind failure by planting a *directory* at the socket
+ * path: `unlinkSync` (the daemon's stale-socket clear) can't remove a directory,
+ * so it survives and `listen()` fails with EADDRINUSE.
+ */
+
+import { afterEach, describe, expect, it } from 'vitest';
+import * as fs from 'fs';
+import * as os from 'os';
+import * as path from 'path';
+import { Daemon, tryAcquireDaemonLock } from '../src/mcp/daemon';
+import { getDaemonPidPath, getDaemonSocketPath } from '../src/mcp/daemon-paths';
+
+const tmpRoots: string[] = [];
+afterEach(() => {
+  while (tmpRoots.length) {
+    const root = tmpRoots.pop()!;
+    try { fs.rmSync(root, { recursive: true, force: true }); } catch { /* best-effort */ }
+  }
+});
+
+describe('Daemon.start() bind failure (#974)', () => {
+  it.runIf(process.platform !== 'win32')('releases the lockfile it acquired when the socket cannot bind', async () => {
+    const root = fs.mkdtempSync(path.join(os.tmpdir(), 'cg-bind-'));
+    tmpRoots.push(root);
+
+    // Acquire the lock exactly as the detached-daemon startup does.
+    const lock = tryAcquireDaemonLock(root);
+    expect(lock.kind).toBe('acquired');
+    const pidPath = getDaemonPidPath(root);
+    expect(fs.existsSync(pidPath)).toBe(true);
+
+    // Make the socket path un-bindable: a directory can't be unlink'd by the
+    // daemon's stale-socket clear, and listen() on it fails with EADDRINUSE.
+    const sockPath = getDaemonSocketPath(root);
+    fs.mkdirSync(sockPath, { recursive: true });
+    // The tmpdir-fallback socket path can live outside `root`; clean it too.
+    tmpRoots.push(sockPath);
+
+    const daemon = new Daemon(root);
+    await expect(daemon.start()).rejects.toThrow();
+
+    // The lockfile must be gone so the next launcher doesn't spin on a stale lock.
+    expect(fs.existsSync(pidPath)).toBe(false);
+  });
+});

+ 78 - 0
__tests__/proxy-connect.test.ts

@@ -0,0 +1,78 @@
+/**
+ * Proxy connect resilience — issue #974.
+ *
+ * `connectWithHello` returns a live socket to the caller, which then attaches
+ * its own onDaemonLost handler. Before #974, `readHelloLine` attached an
+ * 'error' listener and REMOVED it on success, leaving a window where the socket
+ * had no 'error' listener — and a socket 'error' with no listener is re-thrown
+ * by Node as an uncaughtException, which the global fatal handler turns into
+ * process.exit(1). To an MCP client that is a bare "Transport closed". The fix
+ * keeps a guard 'error' listener attached for the socket's whole life.
+ *
+ * AF_UNIX over WSL2/DrvFs makes that window common; here we just prove the
+ * invariant on a normal socket: the returned socket always has an 'error'
+ * listener, and emitting an error on it never throws.
+ */
+
+import { afterEach, describe, expect, it } from 'vitest';
+import * as net from 'net';
+import * as fs from 'fs';
+import * as os from 'os';
+import * as path from 'path';
+import { connectWithHello } from '../src/mcp/proxy';
+import { CodeGraphPackageVersion } from '../src/mcp/version';
+
+const cleanups: Array<() => void> = [];
+afterEach(() => {
+  while (cleanups.length) {
+    try { cleanups.pop()!(); } catch { /* best-effort */ }
+  }
+});
+
+/** Stand up a fake daemon that emits a valid hello line on connect. */
+async function fakeDaemon(version: string): Promise<{ sockPath: string; server: net.Server }> {
+  const dir = fs.mkdtempSync(path.join(os.tmpdir(), 'cg-proxy-'));
+  const sockPath = path.join(dir, 'd.sock');
+  const server = net.createServer((socket) => {
+    const hello = { codegraph: version, pid: process.pid, socketPath: sockPath, protocol: 1 };
+    socket.write(JSON.stringify(hello) + '\n');
+  });
+  await new Promise<void>((resolve) => server.listen(sockPath, resolve));
+  cleanups.push(() => server.close());
+  cleanups.push(() => { try { fs.rmSync(dir, { recursive: true, force: true }); } catch { /* ignore */ } });
+  return { sockPath, server };
+}
+
+describe('connectWithHello — socket is never left without an error listener (#974)', () => {
+  it.runIf(process.platform !== 'win32')('returns a socket that has an error listener and never throws on error', async () => {
+    const { sockPath } = await fakeDaemon(CodeGraphPackageVersion);
+
+    const result = await connectWithHello(sockPath);
+    expect(result).not.toBeNull();
+    expect(result).not.toBe('version-mismatch');
+
+    const socket = result as net.Socket;
+    cleanups.push(() => socket.destroy());
+
+    // The invariant: a guard 'error' listener is attached for the socket's whole
+    // life, so a stray socket error can't escalate to an uncaughtException.
+    expect(socket.listenerCount('error')).toBeGreaterThanOrEqual(1);
+
+    // Emitting an error must NOT throw. Without the guard this is exactly the
+    // path that crashed the proxy with "Transport closed".
+    expect(() => socket.emit('error', new Error('simulated ECONNRESET'))).not.toThrow();
+  });
+
+  it.runIf(process.platform !== 'win32')('still reports version-mismatch (and that path does not throw)', async () => {
+    const { sockPath } = await fakeDaemon('0.0.0-not-our-version');
+    const result = await connectWithHello(sockPath);
+    expect(result).toBe('version-mismatch');
+  });
+
+  it.runIf(process.platform !== 'win32')('returns null when no daemon is listening', async () => {
+    const dir = fs.mkdtempSync(path.join(os.tmpdir(), 'cg-proxy-none-'));
+    cleanups.push(() => { try { fs.rmSync(dir, { recursive: true, force: true }); } catch { /* ignore */ } });
+    const result = await connectWithHello(path.join(dir, 'missing.sock'));
+    expect(result).toBeNull();
+  });
+});

+ 13 - 0
src/mcp/daemon.ts

@@ -183,6 +183,19 @@ export class Daemon {
         this.server = server;
         resolve();
       });
+    }).catch((err) => {
+      // Bind failed — e.g. AF_UNIX is unsupported/unreliable on this filesystem
+      // (the WSL2 DrvFs hazard behind #974), or a stale socket we couldn't clear.
+      // We already hold the lockfile that `tryAcquireDaemonLock` wrote; release it
+      // and any partial socket so the NEXT launcher doesn't spin respawning us on
+      // a stale lock that points at our now-dying pid. Then re-throw so the caller
+      // (the bin's try/catch) exits this detached daemon cleanly and every
+      // launcher falls back to direct mode.
+      this.cleanupLockfile();
+      if (process.platform !== 'win32') {
+        try { fs.unlinkSync(this.socketPath); } catch { /* may not exist */ }
+      }
+      throw err;
     });
 
     const lock: DaemonLockInfo = {

+ 14 - 1
src/mcp/proxy.ts

@@ -135,6 +135,16 @@ export async function connectWithHello(
   if (process.platform !== 'win32' && !fs.existsSync(socketPath)) return null;
   const socket = net.createConnection(socketPath);
   socket.setEncoding('utf8');
+  // Keep an 'error' listener attached for the socket's ENTIRE life. readHelloLine
+  // attaches its own and then REMOVES it on success (its cleanup()), which left a
+  // window — from here until the caller attaches its onDaemonLost handler — where
+  // a socket 'error' had NO listener. In Node an unhandled socket 'error' is
+  // re-thrown as an uncaughtException, which the global fatal handler turns into
+  // process.exit(1); to an MCP client that surfaces as a bare "Transport closed"
+  // (#974). The window is rarely hit on a healthy FS but is common on flaky
+  // AF_UNIX-over-DrvFs (WSL2 /mnt drives). A no-op guard makes the error
+  // recoverable: the follow-up 'close' drives the caller's normal fallback.
+  socket.on('error', () => { /* absorbed — see #974; 'close' drives the fallback */ });
   const hello = await readHelloLine(socket).catch(() => null);
   if (!hello) {
     socket.destroy();
@@ -323,7 +333,10 @@ export async function runLocalHandshakeProxy(deps: LocalHandshakeDeps): Promise<
   let socket: net.Socket | null = null;
   try { socket = await deps.getDaemonSocket(); } catch { socket = null; }
 
-  if (socket && !shuttingDown) {
+  // `!socket.destroyed`: the connect-window error guard above can absorb an
+  // 'error' that already destroyed the socket before we got here (#974) — treat
+  // a dead socket as "no daemon" so we cleanly fall back to the in-process engine.
+  if (socket && !socket.destroyed && !shuttingDown) {
     daemonSocket = socket;
     daemonStatus = 'ready';
     let sockBuf = '';