1
0
Эх сурвалжийг харах

fix(mcp): start the daemon on ExFAT/FAT/network filesystems (#997) (#1022)

A project kept on an ExFAT/FAT external volume (or some network mounts /
WSL2 DrvFs) broke the background auto-sync daemon at two points, both
because the filesystem lacks POSIX features the daemon relied on:

1. Lock acquisition hard-links a temp file onto .codegraph/daemon.pid for
   race-free exclusivity (#411) — these filesystems have no hard links.
2. The Unix-domain socket listen() fails regardless of path length, so the
   old length-only tmpdir fallback never triggered.

Both surface as a capability error, but each OS reports a DIFFERENT errno
for the same gap (macOS ENOTSUP, Linux EPERM, Windows EISDIR), so the fix
is policy-based rather than an enumerated code-set:

- Lock: fall back to an O_EXCL create on any non-EEXIST link error. The
  temp write already proved the directory is writable, so the fallback
  either succeeds (still atomic + exclusive, "first writer wins") or
  surfaces its own genuine error.
- Socket: an ordered candidate list [in-project, tmpdir] walked by BOTH
  the daemon (binds) and the proxy (connects) — they converge on the
  fallback with zero coordination. Relocate past any non-EADDRINUSE bind
  error; EADDRINUSE still rethrows, preserving the #974 contract.

Normal repos are unaffected: the in-project candidate binds first, and the
hard-link lock path is unchanged.

Validated end-to-end on real removable-drive filesystems: macOS ExFAT
(hdiutil image), Linux FAT32 (Docker loop mount), Windows exFAT (diskpart
VHD) — each acquires the lock, relocates (or binds a named pipe on
Windows), and serves a real client.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Colby Mchenry 20 цаг өмнө
parent
commit
f83a1ecc8e

+ 1 - 0
CHANGELOG.md

@@ -21,6 +21,7 @@ and adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 - A `codegraph index` or `codegraph init` that gets orphaned or wedged now stops itself instead of pinning a CPU core forever. If you killed the command (or the terminal/agent that launched it), the underlying indexer process used to keep running in the background — the parent couldn't pass the signal along — and a genuinely stuck index had nothing watching it either, since the self-recovery watchdogs were wired only into the background MCP server. Both gaps are closed: indexing now self-terminates when its parent goes away, and a main thread that stops making progress is killed so it can't hang indefinitely. Opt out with `CODEGRAPH_NO_WATCHDOG=1` (liveness) or `CODEGRAPH_PPID_POLL_MS=0` (orphan detection), matching the server. (#999)
 - Indexing no longer hangs at "Resolving refs" on a repo that commits a large JavaScript/TypeScript theme or SDK. A vendored admin theme (Metronic is the classic case — ~1,300 committed `.js` files) re-declares the same method names (`init`, `update`, `render`, `destroy`, …) on hundreds of widgets, and resolution used to score *every* same-named definition against *every* call — work that grows with the square of how many times a name repeats. On such a repo it pinned a CPU core for 15–30 minutes and effectively never finished. Resolution now declines to guess when a name is defined more times than any real codebase ever repeats one (the cutoff is generous — normal projects top out far below it and are completely unaffected), since no proximity heuristic can pick the one true target among thousands anyway. Indexing that previously wedged now completes in seconds, and precise resolution (imports, qualified names, class-name matches) is unchanged. This is the same class of slowdown as the 1.1.0 import-name fix, now closed for repeated method/symbol names. Tune the cutoff with `CODEGRAPH_AMBIGUOUS_NAME_CEILING` if you ever need to. Thanks @DANOX2 for the detailed report and repro. (#999)
 - Claude Code's front-load prompt hook now fires for non-English prompts. The optional hook that injects CodeGraph context for structural questions only recognized English keywords, so a structural question written in Chinese — or any non-Latin-script language — silently injected nothing: the hook looked like it wasn't wired up despite a correct setup, with no error to explain why. The gate is now language-aware. It recognizes Chinese structural keywords (如何/流程/调用/依赖/实现/架构…), and — in any language — a prompt that names a real code symbol from your project, such as `getUserId`, `article_publish`, `user.login`, or `parseToken()` (the name is checked against the index, so an ordinary word that merely looks like code doesn't trigger it). Non-structural prompts ("fix this typo", in any language) stay a no-op as before, so nothing fires where there's no structural answer to give. Thanks @whinc for the detailed report and repro. (#994)
+- The background auto-sync server now starts for projects kept on an ExFAT or FAT external drive (and some network mounts). Those filesystems don't support the operations the server relies on to coordinate and to listen locally, so it failed immediately and re-logged the same error on every retry — background indexing was broken, so you had to run `codegraph sync` by hand after changes. (The MCP tools, the prompt hook, and manual `codegraph index`/`sync` were unaffected, since none of them need the server.) The server now works around those limitations automatically — falling back to a different coordination method and relocating its local socket to your system temp directory — so background indexing works there exactly like anywhere else, with no configuration needed. Verified end-to-end on real removable-drive filesystems on macOS, Linux, and Windows. Thanks @zengwenliang416 for the detailed report. (#997)
 
 
 ## [1.1.1] - 2026-06-24

+ 246 - 0
__tests__/daemon-socket-fallback.test.ts

@@ -0,0 +1,246 @@
+/**
+ * Daemon support on socket-incapable filesystems — issue #997 (and the adjacent
+ * #974 WSL2 DrvFs hazard).
+ *
+ * A project on an ExFAT/FAT external volume (or some network mounts / WSL2 DrvFs)
+ * breaks the daemon at TWO points, BOTH surfacing as ENOTSUP (verified on a real
+ * macOS fskit ExFAT volume):
+ *
+ *   1. Lock acquisition `link()`s a temp file onto `.codegraph/daemon.pid` for
+ *      race-free exclusivity (#411). ExFAT has no hard links, so this throws
+ *      first — before the socket is ever reached. The fix falls back to an
+ *      O_EXCL create (`acquireLockViaExclusiveOpen`).
+ *   2. The socket `listen()` then throws ENOTSUP regardless of path length, so
+ *      the old length-only tmpdir fallback never triggered. The fix makes the
+ *      socket path an ORDERED candidate list (in-project, then a deterministic
+ *      tmpdir path); the daemon binds the first that works and the proxy connects
+ *      the first that answers, so both converge on the fallback with zero
+ *      coordination.
+ *
+ * Both failures report a DIFFERENT errno per OS — ENOTSUP (macOS), EPERM (Linux),
+ * EISDIR (Windows) — so the fix deliberately does NOT gate on an enumerated set:
+ * the lock falls back on ANY non-EEXIST link error, the socket relocates on ANY
+ * non-EADDRINUSE bind error. These tests pin that policy (incl. a deliberately
+ * unanticipated errno), the candidate list, the candidate-walk binder, and the
+ * exclusive-open lock primitive. (Throwaway scripts drove the full daemon end-to-
+ * end on a real macOS ExFAT image, a Linux FAT loopback mount, and a Windows
+ * exFAT VHD — relocate, serve a real client, rewrite the pidfile — none of which
+ * can run in CI.)
+ */
+
+import { afterEach, describe, expect, it } from 'vitest';
+import * as fs from 'fs';
+import * as net from 'net';
+import * as os from 'os';
+import * as path from 'path';
+import {
+  getDaemonPidPath,
+  getDaemonSocketCandidates,
+  getDaemonSocketPath,
+} from '../src/mcp/daemon-paths';
+import type { DaemonLockInfo } from '../src/mcp/daemon-paths';
+import { decodeLockInfo } from '../src/mcp/daemon-paths';
+import {
+  acquireLockViaExclusiveOpen,
+  bindFirstUsableSocket,
+  tryAcquireDaemonLock,
+} from '../src/mcp/daemon';
+
+const POSIX = process.platform !== 'win32';
+
+const tmpFiles: string[] = [];
+const tmpDirs: string[] = [];
+afterEach(() => {
+  while (tmpFiles.length) {
+    try { fs.rmSync(tmpFiles.pop()!, { force: true }); } catch { /* best-effort */ }
+  }
+  while (tmpDirs.length) {
+    try { fs.rmSync(tmpDirs.pop()!, { recursive: true, force: true }); } catch { /* best-effort */ }
+  }
+});
+
+/** A stand-in net.Server — bindFirstUsableSocket only ever passes it through. */
+const fakeServer = (tag: string): net.Server => ({ tag } as unknown as net.Server);
+
+/** Build an ErrnoException carrying a specific code, like a real listen() error. */
+function errno(code: string): NodeJS.ErrnoException {
+  const e = new Error(`listen ${code}`) as NodeJS.ErrnoException;
+  e.code = code;
+  return e;
+}
+
+describe('getDaemonSocketCandidates (#997)', () => {
+  it.runIf(POSIX)('returns [in-project, tmpdir] for a normal short path', () => {
+    const root = path.join(os.tmpdir(), 'cg-cand-short');
+    const candidates = getDaemonSocketCandidates(root);
+    expect(candidates).toHaveLength(2);
+    expect(candidates[0]).toBe(path.join(root, '.codegraph', 'daemon.sock'));
+    expect(candidates[1]!.startsWith(os.tmpdir())).toBe(true);
+    expect(path.basename(candidates[1]!)).toMatch(/^codegraph-[0-9a-f]{16}\.sock$/);
+  });
+
+  it.runIf(POSIX)('drops straight to [tmpdir] when the in-project path is too long', () => {
+    // A deep root pushes `.codegraph/daemon.sock` past the POSIX socket limit.
+    const root = path.join('/tmp', 'x'.repeat(120));
+    const candidates = getDaemonSocketCandidates(root);
+    expect(candidates).toHaveLength(1);
+    expect(candidates[0]!.startsWith(os.tmpdir())).toBe(true);
+  });
+
+  it.runIf(POSIX)('is deterministic and project-scoped: same root → same tmpdir fallback', () => {
+    const root = path.join(os.tmpdir(), 'cg-cand-determinism');
+    const a = getDaemonSocketCandidates(root);
+    const b = getDaemonSocketCandidates(root);
+    expect(a).toEqual(b);
+    // A different root yields a different (hashed) tmpdir fallback.
+    const other = getDaemonSocketCandidates(root + '-other');
+    expect(other[other.length - 1]).not.toBe(a[a.length - 1]);
+  });
+
+  it.runIf(!POSIX)('returns a single named pipe on Windows', () => {
+    const candidates = getDaemonSocketCandidates('C:/dev/proj');
+    expect(candidates).toHaveLength(1);
+    expect(candidates[0]!.startsWith('\\\\.\\pipe\\codegraph-')).toBe(true);
+  });
+
+  it('getDaemonSocketPath returns the preferred candidate (index 0)', () => {
+    const root = path.join(os.tmpdir(), 'cg-cand-primary');
+    expect(getDaemonSocketPath(root)).toBe(getDaemonSocketCandidates(root)[0]);
+  });
+});
+
+describe('bindFirstUsableSocket (#997)', () => {
+  it('binds the first candidate when it works, without relocating', async () => {
+    const tried: string[] = [];
+    const relocations: string[] = [];
+    const result = await bindFirstUsableSocket(
+      ['/proj/.codegraph/daemon.sock', '/tmp/fallback.sock'],
+      (p) => { tried.push(p); return Promise.resolve(fakeServer(p)); },
+      { onRelocate: (from, to) => relocations.push(`${from}->${to}`) },
+    );
+    expect(result.socketPath).toBe('/proj/.codegraph/daemon.sock');
+    expect(tried).toEqual(['/proj/.codegraph/daemon.sock']); // never touched the fallback
+    expect(relocations).toEqual([]);
+  });
+
+  it('relocates to the tmpdir fallback when the in-project bind throws ENOTSUP', async () => {
+    const tried: string[] = [];
+    const relocations: Array<[string, string, string]> = [];
+    const result = await bindFirstUsableSocket(
+      ['/exfat/proj/.codegraph/daemon.sock', '/tmp/fallback.sock'],
+      (p) => {
+        tried.push(p);
+        if (p.includes('/exfat/')) return Promise.reject(errno('ENOTSUP'));
+        return Promise.resolve(fakeServer(p));
+      },
+      { onRelocate: (from, to, code) => relocations.push([from, to, code]) },
+    );
+    expect(result.socketPath).toBe('/tmp/fallback.sock');
+    expect(tried).toEqual(['/exfat/proj/.codegraph/daemon.sock', '/tmp/fallback.sock']);
+    expect(relocations).toEqual([
+      ['/exfat/proj/.codegraph/daemon.sock', '/tmp/fallback.sock', 'ENOTSUP'],
+    ]);
+  });
+
+  it('does NOT relocate on EADDRINUSE — it propagates even with a fallback present', async () => {
+    const tried: string[] = [];
+    await expect(
+      bindFirstUsableSocket(
+        ['/proj/.codegraph/daemon.sock', '/tmp/fallback.sock'],
+        (p) => { tried.push(p); return Promise.reject(errno('EADDRINUSE')); },
+      ),
+    ).rejects.toMatchObject({ code: 'EADDRINUSE' });
+    expect(tried).toEqual(['/proj/.codegraph/daemon.sock']); // fallback never tried
+  });
+
+  it('propagates a capability error on the LAST candidate (nowhere left to go)', async () => {
+    // When tmpdir itself can't host a socket, the single-candidate long-path list
+    // (or the exhausted tail of a longer one) has no fallback — the daemon must
+    // surface the error so the launcher drops to direct mode (#974).
+    await expect(
+      bindFirstUsableSocket(
+        ['/tmp/only.sock'],
+        () => Promise.reject(errno('ENOTSUP')),
+      ),
+    ).rejects.toMatchObject({ code: 'ENOTSUP' });
+  });
+
+  it('walks past multiple unusable candidates to the first that binds', async () => {
+    const tried: string[] = [];
+    const result = await bindFirstUsableSocket(
+      ['/a.sock', '/b.sock', '/c.sock'],
+      (p) => {
+        tried.push(p);
+        if (p === '/a.sock') return Promise.reject(errno('ENOTSUP'));
+        if (p === '/b.sock') return Promise.reject(errno('EACCES'));
+        return Promise.resolve(fakeServer(p));
+      },
+    );
+    expect(result.socketPath).toBe('/c.sock');
+    expect(tried).toEqual(['/a.sock', '/b.sock', '/c.sock']);
+  });
+
+  it('relocates on an UNEXPECTED errno too — the policy is "anything but EADDRINUSE", not a fixed list', async () => {
+    // ExFAT/FAT report different bind errnos per OS (ENOTSUP macOS, EPERM Linux),
+    // so we must NOT gate relocation on an enumerated set — a code we never
+    // anticipated must still fall through to tmpdir. 'EWEIRD' stands in for any
+    // such surprise.
+    const result = await bindFirstUsableSocket(
+      ['/odd/proj/.codegraph/daemon.sock', '/tmp/fallback.sock'],
+      (p) => p.includes('/odd/') ? Promise.reject(errno('EWEIRD')) : Promise.resolve(fakeServer(p)),
+    );
+    expect(result.socketPath).toBe('/tmp/fallback.sock');
+  });
+});
+
+describe('lock acquisition without hard links (#997)', () => {
+  // The hard-link-FAILS path (link() → O_EXCL fallback) can't be forced on a
+  // normal FS — fs.linkSync's namespace export is non-configurable, so it can't
+  // be spied. It's proven instead end-to-end on real ExFAT/FAT/exFAT volumes
+  // (macOS ENOTSUP, Linux EPERM, Windows EISDIR — all acquire via the fallback).
+  // Here we just guard that the refactored catch block didn't break the normal
+  // link path: a clean acquire, and a second caller correctly sees it held.
+  it.runIf(POSIX)('tryAcquireDaemonLock still acquires on a normal FS, and a second caller is told it is taken', () => {
+    const root = fs.mkdtempSync(path.join(os.tmpdir(), 'cg-lock-'));
+    tmpDirs.push(root);
+
+    const first = tryAcquireDaemonLock(root);
+    expect(first.kind).toBe('acquired');
+    const pidPath = getDaemonPidPath(root);
+    expect(fs.existsSync(pidPath)).toBe(true);
+    expect(decodeLockInfo(fs.readFileSync(pidPath, 'utf8'))?.pid).toBe(process.pid);
+
+    const second = tryAcquireDaemonLock(root); // link() → EEXIST → taken
+    expect(second.kind).toBe('taken');
+    if (second.kind === 'taken') expect(second.existing?.pid).toBe(process.pid);
+  });
+
+  it.runIf(POSIX)('acquireLockViaExclusiveOpen creates the pidfile with a complete, parseable record', () => {
+    const pidPath = path.join(os.tmpdir(), `cg-excl-${process.pid}-${Date.now()}.pid`);
+    tmpFiles.push(pidPath);
+    const info: DaemonLockInfo = {
+      pid: 4242,
+      version: '9.9.9-test',
+      socketPath: '/tmp/whatever.sock',
+      startedAt: 1_700_000_000_000,
+    };
+
+    const acquired = acquireLockViaExclusiveOpen(pidPath, info);
+    expect(acquired).toBe(true);
+    // The file is non-empty and decodes back to exactly what we wrote — i.e. no
+    // empty-file window left behind for a reader to mistake for a corrupt lock.
+    expect(decodeLockInfo(fs.readFileSync(pidPath, 'utf8'))).toEqual(info);
+  });
+
+  it.runIf(POSIX)('acquireLockViaExclusiveOpen is exclusive: the second caller loses (EEXIST → false)', () => {
+    const pidPath = path.join(os.tmpdir(), `cg-excl2-${process.pid}-${Date.now()}.pid`);
+    tmpFiles.push(pidPath);
+    const winner: DaemonLockInfo = { pid: 1, version: 'a', socketPath: '/s1', startedAt: 1 };
+    const loser: DaemonLockInfo = { pid: 2, version: 'b', socketPath: '/s2', startedAt: 2 };
+
+    expect(acquireLockViaExclusiveOpen(pidPath, winner)).toBe(true);
+    expect(acquireLockViaExclusiveOpen(pidPath, loser)).toBe(false); // does not clobber
+    // The winner's record is intact — the loser never overwrote it.
+    expect(decodeLockInfo(fs.readFileSync(pidPath, 'utf8'))).toEqual(winner);
+  });
+});

+ 50 - 9
src/mcp/daemon-paths.ts

@@ -16,6 +16,16 @@
  * an absolute-path hash under `os.tmpdir()`. The pidfile always stays in the
  * project (it doesn't have a length limit) — and acts as the authoritative
  * pointer to the socket path the daemon chose.
+ *
+ * Second special-case (#997, #974): some filesystems can't host an AF_UNIX node
+ * AT ALL — ExFAT/FAT external volumes, certain network mounts, WSL2 DrvFs — so
+ * `listen()` throws ENOTSUP/EACCES regardless of path length. We can't cheaply
+ * tell those apart from a normal volume up front, so instead of guessing we
+ * expose an ORDERED candidate list (`getDaemonSocketCandidates`): the in-project
+ * path first, the deterministic tmpdir path as the fallback of last resort. The
+ * daemon binds the first that works (relocating past a capability error); the
+ * proxy connects the first that answers. Both walk the SAME list, so they still
+ * converge on whichever the daemon bound with zero coordination.
  */
 
 import * as crypto from 'crypto';
@@ -32,19 +42,50 @@ function projectHash(projectRoot: string): string {
 }
 
 /**
- * Compute the socket / named-pipe path the daemon should listen on (and the
- * proxy should connect to) for `projectRoot`. Deterministic given a project
- * root, so independent processes converge without coordination.
+ * The deterministic tmpdir socket path for `projectRoot` — the fallback used
+ * when the in-project location can't host a socket (too long, or an FS that
+ * doesn't support AF_UNIX). Hash keeps it project-scoped, and being purely a
+ * function of the root means the daemon and the proxy compute the identical
+ * path without talking to each other.
  */
-export function getDaemonSocketPath(projectRoot: string): string {
+function tmpdirSocketPath(projectRoot: string): string {
+  return path.join(os.tmpdir(), `codegraph-${projectHash(projectRoot)}.sock`);
+}
+
+/**
+ * Ordered socket / named-pipe path candidates the daemon should try to bind (and
+ * the proxy should try to connect) for `projectRoot`, most-preferred first.
+ * Deterministic given a project root, so independent processes converge without
+ * coordination — even when the preferred candidate is unusable and both fall
+ * through to the same fallback.
+ *
+ *   - Windows: a single named pipe (lives in the kernel pipe namespace, not on
+ *     the project FS, so neither the length nor the ExFAT hazard applies).
+ *   - Short in-project path: `[ .codegraph/daemon.sock , <tmpdir> ]` — try the
+ *     project first, fall back to tmpdir if its FS can't host a socket (#997).
+ *   - Long in-project path (deep monorepos, Bazel out dirs): `[ <tmpdir> ]` only
+ *     — bind would throw ENAMETOOLONG, so we skip straight to tmpdir.
+ */
+export function getDaemonSocketCandidates(projectRoot: string): string[] {
   if (process.platform === 'win32') {
-    return `\\\\.\\pipe\\codegraph-${projectHash(projectRoot)}`;
+    return [`\\\\.\\pipe\\codegraph-${projectHash(projectRoot)}`];
   }
   const inProject = path.join(getCodeGraphDir(projectRoot), 'daemon.sock');
-  if (inProject.length <= POSIX_SOCKET_PATH_LIMIT) return inProject;
-  // Long project paths (deep monorepos, Bazel out dirs) need tmpdir fallback
-  // or `bind` returns EADDRINUSE / ENAMETOOLONG. Hash keeps it project-scoped.
-  return path.join(os.tmpdir(), `codegraph-${projectHash(projectRoot)}.sock`);
+  const tmp = tmpdirSocketPath(projectRoot);
+  if (inProject.length > POSIX_SOCKET_PATH_LIMIT) return [tmp];
+  return [inProject, tmp];
+}
+
+/**
+ * The PREFERRED (primary) socket path — candidate 0. Use this only where a
+ * single representative path is wanted (the lockfile's informational
+ * `socketPath` field, status display). For binding/connecting, walk the full
+ * {@link getDaemonSocketCandidates} list — the daemon may bind a fallback when
+ * candidate 0 is unusable.
+ */
+export function getDaemonSocketPath(projectRoot: string): string {
+  // The candidate list is never empty (≥1 on every platform), so [0] is safe.
+  return getDaemonSocketCandidates(projectRoot)[0]!;
 }
 
 /** Absolute path to the daemon pid lockfile for `projectRoot`. */

+ 6 - 2
src/mcp/daemon-registry.ts

@@ -22,7 +22,7 @@ import * as fs from 'fs';
 import * as os from 'os';
 import * as path from 'path';
 import * as crypto from 'crypto';
-import { getDaemonPidPath, getDaemonSocketPath, decodeLockInfo } from './daemon-paths';
+import { getDaemonPidPath, getDaemonSocketCandidates, decodeLockInfo } from './daemon-paths';
 
 export interface DaemonRecord {
   /** Realpath'd project root the daemon serves. */
@@ -118,8 +118,12 @@ export function listDaemons(opts: { prune?: boolean } = {}): DaemonRecord[] {
 function cleanupDaemonArtifacts(root: string): void {
   try { fs.unlinkSync(getDaemonPidPath(root)); } catch { /* gone */ }
   // POSIX sockets are real files; Windows named pipes vanish with the process.
+  // Sweep every candidate — a daemon that relocated past an unusable in-project
+  // FS (ExFAT/FAT; #997) left its socket at the tmpdir fallback, not candidate 0.
   if (process.platform !== 'win32') {
-    try { fs.unlinkSync(getDaemonSocketPath(root)); } catch { /* gone */ }
+    for (const candidate of getDaemonSocketCandidates(root)) {
+      try { fs.unlinkSync(candidate); } catch { /* gone */ }
+    }
   }
   deregisterDaemon(root);
 }

+ 162 - 28
src/mcp/daemon.ts

@@ -51,6 +51,7 @@ import {
   decodeLockInfo,
   encodeLockInfo,
   getDaemonPidPath,
+  getDaemonSocketCandidates,
   getDaemonSocketPath,
 } from './daemon-paths';
 import { CodeGraphPackageVersion } from './version';
@@ -169,39 +170,60 @@ export class Daemon {
     // (cross-project tool calls only) shouldn't pay any open cost.
     void this.engine.ensureInitialized(this.projectRoot);
 
-    // Stale socket file (left over from a SIGKILL'd previous daemon) will
-    // wedge `listen` with EADDRINUSE. We arrived here holding the lockfile,
-    // which means there's no live daemon, so it's safe to clear.
-    if (process.platform !== 'win32') {
-      try { fs.unlinkSync(this.socketPath); } catch { /* not-exists is fine */ }
-    }
-
-    await new Promise<void>((resolve, reject) => {
-      const server = net.createServer((socket) => this.handleConnection(socket));
-      server.once('error', (err) => reject(err));
-      server.listen(this.socketPath, () => {
-        // POSIX: tighten permissions to user-only — the socket lives under
-        // `.codegraph/`, which is git-ignored but may be on a shared FS.
+    // Walk the ordered socket candidates and bind the first that works. The
+    // in-project path comes first; the deterministic tmpdir path is the fallback
+    // for a filesystem that can't host an AF_UNIX node at all (ExFAT/FAT external
+    // volumes, some network mounts, WSL2 DrvFs → ENOTSUP/EACCES; #997, #974). The
+    // `listen` closure clears a stale socket (left by a SIGKILL'd previous daemon)
+    // before each attempt — safe because we hold the lockfile, so no live daemon
+    // owns it; without it `listen` would wedge on EADDRINUSE.
+    const candidates = getDaemonSocketCandidates(this.projectRoot);
+    const listen = (socketPath: string): Promise<net.Server> =>
+      new Promise<net.Server>((resolve, reject) => {
         if (process.platform !== 'win32') {
-          try { fs.chmodSync(this.socketPath, 0o600); } catch { /* best-effort */ }
+          try { fs.unlinkSync(socketPath); } catch { /* not-exists is fine */ }
         }
-        this.server = server;
-        resolve();
+        const server = net.createServer((socket) => this.handleConnection(socket));
+        server.once('error', reject);
+        server.listen(socketPath, () => {
+          // POSIX: tighten permissions to user-only — the socket lives under
+          // `.codegraph/` (git-ignored, maybe a shared FS) or tmpdir.
+          if (process.platform !== 'win32') {
+            try { fs.chmodSync(socketPath, 0o600); } catch { /* best-effort */ }
+          }
+          resolve(server);
+        });
+      });
+
+    let bound: { server: net.Server; socketPath: string };
+    try {
+      bound = await bindFirstUsableSocket(candidates, listen, {
+        onRelocate: (from, to, code) =>
+          process.stderr.write(
+            `[CodeGraph daemon] Socket ${from} unusable (${code}); relocating to ${to}.\n`
+          ),
       });
-    }).catch((err) => {
-      // Bind failed — e.g. AF_UNIX is unsupported/unreliable on this filesystem
-      // (the WSL2 DrvFs hazard behind #974), or a stale socket we couldn't clear.
-      // We already hold the lockfile that `tryAcquireDaemonLock` wrote; release it
-      // and any partial socket so the NEXT launcher doesn't spin respawning us on
-      // a stale lock that points at our now-dying pid. Then re-throw so the caller
-      // (the bin's try/catch) exits this detached daemon cleanly and every
-      // launcher falls back to direct mode.
+    } catch (err) {
+      // Every candidate failed (the last one, or a non-relocatable error like a
+      // racing EADDRINUSE). We already hold the lockfile `tryAcquireDaemonLock`
+      // wrote; release it and any partial sockets so the NEXT launcher doesn't
+      // spin respawning us on a stale lock pointing at our now-dying pid. Then
+      // re-throw so the caller (the bin's try/catch) exits this detached daemon
+      // cleanly and every launcher falls back to direct mode (#974).
       this.cleanupLockfile();
       if (process.platform !== 'win32') {
-        try { fs.unlinkSync(this.socketPath); } catch { /* may not exist */ }
+        for (const candidate of candidates) {
+          try { fs.unlinkSync(candidate); } catch { /* may not exist */ }
+        }
       }
       throw err;
-    });
+    }
+
+    this.server = bound.server;
+    // Adopt the path we ACTUALLY bound — it may be a tmpdir fallback past an
+    // unusable in-project location. Everything downstream (lockfile, registry,
+    // chmod, cleanup, status) keys off this real path, not the preferred guess.
+    this.socketPath = bound.socketPath;
 
     const lock: DaemonLockInfo = {
       pid: process.pid,
@@ -210,6 +232,19 @@ export class Daemon {
       startedAt: Date.now(),
     };
 
+    // `tryAcquireDaemonLock` wrote the pidfile with the PREFERRED path (candidate
+    // 0) before we knew which one would bind. If we relocated, rewrite it so the
+    // per-project record is honest. Atomic temp+rename; safe because we hold the
+    // lock and we're alive — `clearStaleDaemonLock` pid-verifies, so no racing
+    // candidate clears or clobbers a live daemon's lock.
+    if (this.socketPath !== candidates[0]) {
+      try {
+        const tmpPid = `${this.pidPath}.${process.pid}.relocate`;
+        fs.writeFileSync(tmpPid, encodeLockInfo(lock), { mode: 0o600 });
+        fs.renameSync(tmpPid, this.pidPath);
+      } catch { /* best-effort; the registry record below carries the real path */ }
+    }
+
     // Drop a discovery record so `codegraph list` / `stop --all` can find us.
     // Best-effort; a missing record only means list's liveness prune covers it.
     registerDaemon({ root: this.projectRoot, ...lock });
@@ -433,6 +468,17 @@ export type AcquireResult =
  * the pidfile becomes visible in one step already containing a full record.
  * Whoever links first wins; everyone else gets EEXIST and reads a complete file.
  * There is no empty-file window at all.
+ *
+ * Filesystems without hard links (#997): ExFAT/FAT external volumes and some
+ * network mounts can't `link()` at all — it throws ENOTSUP/EPERM, which would
+ * otherwise kill the daemon before it ever reaches the socket bind. There we
+ * fall back to an O_EXCL create (`acquireLockViaExclusiveOpen`): still exclusive
+ * ("first writer wins"), but the full record is written through the fd in a
+ * second step, so the empty-file window the link approach removed is reopened —
+ * only on these filesystems, only for the microseconds between create and write
+ * (far narrower than the original bug, which the file watcher's startup latency
+ * widened). The race's worst case is two daemons briefly; on a single external
+ * drive that's strictly better than the daemon never starting at all.
  */
 export function tryAcquireDaemonLock(projectRoot: string): AcquireResult {
   const pidPath = getDaemonPidPath(projectRoot);
@@ -453,10 +499,21 @@ export function tryAcquireDaemonLock(projectRoot: string): AcquireResult {
   try {
     fs.writeFileSync(tmp, encodeLockInfo(info), { mode: 0o600 });
     try {
-      fs.linkSync(tmp, pidPath); // atomic + exclusive
+      fs.linkSync(tmp, pidPath); // atomic + exclusive (race-free; see must-fix 1)
       acquired = true;
     } catch (err: unknown) {
-      if ((err as NodeJS.ErrnoException).code !== 'EEXIST') throw err;
+      if ((err as NodeJS.ErrnoException).code === 'EEXIST') {
+        // Lost the race — another candidate already holds it. Fall through to read.
+      } else {
+        // link() failed for a non-conflict reason — nearly always "this filesystem
+        // has no hard links" (ExFAT/FAT external volumes, some network mounts),
+        // which surfaces as a DIFFERENT errno on every OS: ENOTSUP on macOS, EPERM
+        // on Linux, EISDIR on Windows (#997). Enumerating them is whack-a-mole and
+        // unnecessary: the `tmp` write above already proved this directory is
+        // writable, so an O_EXCL create is a valid atomic+exclusive substitute. If
+        // IT fails too, that's a genuine error and propagates. EEXIST ⇒ taken.
+        acquired = acquireLockViaExclusiveOpen(pidPath, info);
+      }
     }
   } finally {
     try { fs.unlinkSync(tmp); } catch { /* temp already gone */ }
@@ -474,6 +531,31 @@ export function tryAcquireDaemonLock(projectRoot: string): AcquireResult {
   return { kind: 'taken', existing, pidPath };
 }
 
+/**
+ * Exclusive-create the pidfile (O_CREAT|O_EXCL via the `wx` flag) and write the
+ * full record through the same fd — the hard-link-free fallback used by
+ * {@link tryAcquireDaemonLock} on filesystems without `link()`. Returns true if
+ * we created it (acquired the lock), false on EEXIST (another candidate holds
+ * it). Any other error propagates. Still exclusive, so "first writer wins" holds
+ * exactly as the link path does; the only difference is the brief empty-file
+ * window between create and write. Exported for testing.
+ */
+export function acquireLockViaExclusiveOpen(pidPath: string, info: DaemonLockInfo): boolean {
+  let fd: number;
+  try {
+    fd = fs.openSync(pidPath, 'wx', 0o600); // O_CREAT | O_EXCL | O_WRONLY
+  } catch (err: unknown) {
+    if ((err as NodeJS.ErrnoException).code === 'EEXIST') return false;
+    throw err;
+  }
+  try {
+    fs.writeSync(fd, encodeLockInfo(info));
+  } finally {
+    fs.closeSync(fd);
+  }
+  return true;
+}
+
 /**
  * Remove a stale pidfile, but only if it still names a dead process. Re-reads
  * the file immediately before unlinking so we never delete a lock that a live
@@ -520,6 +602,58 @@ export function isProcessAlive(pid: number): boolean {
   }
 }
 
+/**
+ * The one `listen()` error we must NOT relocate past. EADDRINUSE means the path
+ * is genuinely occupied — a racing daemon that legitimately owns it, or a
+ * leftover node we couldn't clear (the #974 planted-dir case) — so relocating
+ * would abandon a path another daemon owns; the caller instead releases its lock
+ * and falls back to direct mode. EVERY OTHER bind error just means "this path
+ * didn't work," almost always a filesystem that can't host an AF_UNIX node at all
+ * (ExFAT/FAT, network mounts, WSL2 DrvFs), which reports a DIFFERENT errno per OS
+ * (ENOTSUP macOS, EPERM Linux; #997). Enumerating the "unsupported" codes is
+ * whack-a-mole, so we relocate on anything-but-conflict instead — robust and
+ * self-correcting: if the deterministic tmpdir fallback ALSO fails, that error
+ * propagates from the last candidate. (ENAMETOOLONG never reaches here — the
+ * candidate list already routes over-long paths straight to tmpdir.)
+ */
+const SOCKET_BIND_CONFLICT_CODE = 'EADDRINUSE';
+
+/**
+ * Bind the first usable socket from an ordered candidate list, relocating past
+ * any path that fails to bind for a non-conflict reason (see {@link
+ * SOCKET_BIND_CONFLICT_CODE}). The injected `listen` does the real
+ * `net.Server.listen` (and stale-socket clear); abstracted so the relocation
+ * policy is unit-testable without a real unsupported filesystem. Returns the
+ * server plus the path actually bound. An EADDRINUSE, or any error on the LAST
+ * candidate, propagates — the caller releases the lockfile and falls back to
+ * direct mode (#974). Exported for testing.
+ */
+export async function bindFirstUsableSocket(
+  candidates: string[],
+  listen: (socketPath: string) => Promise<net.Server>,
+  opts: { onRelocate?: (from: string, to: string, code: string) => void } = {},
+): Promise<{ server: net.Server; socketPath: string }> {
+  let lastErr: unknown;
+  for (let i = 0; i < candidates.length; i++) {
+    const socketPath = candidates[i]!; // i < length, so always defined
+    const isLast = i === candidates.length - 1;
+    try {
+      const server = await listen(socketPath);
+      return { server, socketPath };
+    } catch (err) {
+      lastErr = err;
+      const code = (err as NodeJS.ErrnoException).code;
+      if (!isLast && code !== SOCKET_BIND_CONFLICT_CODE) {
+        opts.onRelocate?.(socketPath, candidates[i + 1]!, code ?? ''); // !isLast ⇒ i+1 in range
+        continue;
+      }
+      throw err;
+    }
+  }
+  // Only reachable with an empty candidate list — a programmer error.
+  throw lastErr ?? new Error('no socket candidates to bind');
+}
+
 function resolveIdleTimeoutMs(): number {
   const raw = process.env.CODEGRAPH_DAEMON_IDLE_TIMEOUT_MS;
   if (raw === undefined || raw === '') return DEFAULT_IDLE_TIMEOUT_MS;

+ 22 - 5
src/mcp/index.ts

@@ -48,7 +48,7 @@ import {
   tryAcquireDaemonLock,
 } from './daemon';
 import { connectWithHello, runLocalHandshakeProxy } from './proxy';
-import { getDaemonSocketPath } from './daemon-paths';
+import { getDaemonSocketCandidates } from './daemon-paths';
 import { getTelemetry } from '../telemetry';
 import { supervisionLostReason, parsePpidPollMs, parseHostPpid } from './ppid-watchdog';
 import { installMainThreadWatchdog, WatchdogHandle } from './liveness-watchdog';
@@ -376,17 +376,34 @@ export class MCPServer {
    * never wedges a session.
    */
   private async runProxyWithLocalHandshake(root: string): Promise<void> {
-    const socketPath = getDaemonSocketPath(root);
+    // The daemon may relocate its socket past an in-project filesystem that can't
+    // host one (ExFAT/FAT volumes, WSL2 DrvFs; #997) to the deterministic tmpdir
+    // fallback. We don't read the bound path from the lockfile — both sides walk
+    // the SAME ordered candidate list, so we converge on whichever the daemon
+    // bound with zero coordination. The in-project candidate is tried first, so a
+    // normal repo pays nothing extra (it connects on the very first probe).
+    const candidates = getDaemonSocketCandidates(root);
+    const connectAnyCandidate = async (): Promise<Awaited<ReturnType<typeof connectWithHello>>> => {
+      for (const candidate of candidates) {
+        const s = await connectWithHello(candidate);
+        // A wrong-version daemon IS up — definitive; propagate so the caller
+        // serves in-process instead of spawning + polling for 6s. Don't keep
+        // probing fallbacks past it.
+        if (s === 'version-mismatch') return s;
+        if (s) return s;
+      }
+      return null;
+    };
     const getDaemonSocket = async () => {
-      // Fast path: a daemon may already be listening.
-      const probe = await connectWithHello(socketPath);
+      // Fast path: a daemon may already be listening (on either candidate).
+      const probe = await connectAnyCandidate();
       if (probe === 'version-mismatch') return null; // definitive — serve in-process, don't poll for 6s
       if (probe) return probe;
       // None reachable — spawn one (detached) and poll for its bind.
       spawnDetachedDaemon(root);
       for (let attempt = 0; attempt < DAEMON_CONNECT_MAX_RETRIES; attempt++) {
         await sleep(DAEMON_CONNECT_RETRY_DELAY_MS);
-        const s = await connectWithHello(socketPath);
+        const s = await connectAnyCandidate();
         if (s === 'version-mismatch') return null;
         if (s) return s;
       }