瀏覽代碼

fix(sync): degrade auto-sync on a persistent non-lock sync failure (#1127) (#1128)

FileWatcher.flush() bounded only two failure modes — lock contention
(backoff + degrade past MAX_LOCK_RETRIES) and watch-resource exhaustion
(degrade at setup). Its generic catch branch — any *other* sync error —
reset the only circuit breaker (lockRetryCount = 0) and fell through to
scheduleSync() at the normal debounce cadence, forever, with no backoff
and no degrade().

The trigger is realistic, not synthetic: CodeGraph.sync() runs the whole
extract -> resolve -> maintenance pipeline inside try/finally(release) with
no catch, so a deterministic failure (a tree-sitter extractor that crashes
on one file, SQLITE_FULL, an OOM in batched resolution) propagates straight
into that unbounded branch — wedging a long-running daemon/MCP session into
~1,800 failing syncs + log lines/hour while the auto-update guarantee is
silently dead.

Mirror the lock circuit breaker for the generic branch: a separate
consecutive-failure counter (syncFailureRetryCount) reset only by a clean
sync, exponential backoff via the shared finally, and degrade() past
MAX_SYNC_FAILURE_RETRIES with an actionable reason naming the underlying
error. degrade() -> onDegraded/isDegraded() is what surfaces the dead
guarantee (the staleness banner already consumes it) — a lighter flat-retry
would keep it hidden, which is the core of the #876/#1127 complaint.
Reset-on-success means a transient hiccup never degrades.

The lock path is behaviorally unchanged: in any pure-lock scenario
syncFailureRetryCount stays 0, so Math.max(lockRetryCount,
syncFailureRetryCount) and the degrade threshold behave exactly as before.
Renamed MAX_LOCK_RETRY_DELAY_MS -> MAX_RETRY_BACKOFF_MS (shared cap).

Adds two regression tests mirroring the lock-contention ones: a persistent
non-lock failure degrades past the budget; a transient one recovers without
degrading.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Colby Mchenry 17 小時之前
父節點
當前提交
7c7514f43f
共有 3 個文件被更改,包括 124 次插入17 次删除
  1. 1 0
      CHANGELOG.md
  2. 65 0
      __tests__/watcher.test.ts
  3. 58 17
      src/sync/watcher.ts

+ 1 - 0
CHANGELOG.md

@@ -30,6 +30,7 @@ and adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 - The set of C++ libraries whose macros are recognized for full return-type recovery was expanded well beyond Unreal Engine — now spanning Mozilla, Protobuf, {fmt}, nlohmann/json, GLM, Bullet, Skia, OpenCV, EASTL, Cocos2d-x, GLib, SQLite, and the common Windows calling conventions (so `HRESULT WINAPI CreateThing(...)` indexes as `CreateThing` returning `HRESULT`). Functions from libraries not on the list still get their name recovered automatically; being listed additionally recovers the return type. (#1103)
 - Graph traversal and blast-radius results no longer drop or miscount relationships in a handful of edge cases. When a symbol could be reached by more than one path, an impact/blast-radius query could leave out a direct dependency between two symbols that were already linked another way; separately, the lower-level graph traversal used by the library API could keep only one of several relationships between the same pair of symbols (for example a symbol that both calls and references another), count a caller reached through two different call sites twice, or return slightly more results than the requested size limit on a very highly-connected symbol. These were long-standing and mostly masked by later de-duplication, so day-to-day query results were largely unaffected, but the traversal now returns the complete, correctly-bounded set. Thanks @inth3shadows for the precise, individually-traced reports. (#1086, #1087, #1088, #1089, #1090)
 - Method calls to same-named classes in different files now resolve to the right definition. If two files each declared, say, a `Logger` class with its own `log()` method, a call could be linked to whichever definition happened to be indexed first — so a call in one file wrongly pointed at the class in another, mixing up that method's callers and blast radius. This affected calls written as `obj.log()`, `Logger.log()`, and `Logger::log()` across many languages, including C++, Python, TypeScript, Java, C#, and Rust. When a method name is ambiguous, CodeGraph now prefers the definition in the calling file itself — the correct target in the common case — while Java/Kotlin calls that an `import` already pins to another file are unaffected. Thanks @inth3shadows for the minimal repro and root-cause analysis. (#1079)
+- Live auto-sync now gives up cleanly if indexing keeps failing, instead of retrying forever in the background. If a repeatable indexing error kept every sync from completing — for example a file that reliably crashes one language's parser, a full disk, or a corrupt database — a long-running server or daemon would retry every couple of seconds indefinitely, quietly wasting work and filling the logs while the graph silently stopped updating. Auto-sync now backs off after repeated failures and, if they persist, disables itself with an actionable message (naming the underlying error and pointing you to run `codegraph sync`), so the stalled index is surfaced rather than hidden — matching how prolonged file-lock contention and watch-limit exhaustion already degrade. A single successful sync resets everything, so an occasional transient hiccup is unaffected. Thanks @inth3shadows for the precise, code-traced report. (#1127)
 
 ## [1.1.6] - 2026-06-30
 

+ 65 - 0
__tests__/watcher.test.ts

@@ -324,6 +324,71 @@ describe('FileWatcher', () => {
     });
   });
 
+  describe('persistent sync-failure degradation (#1127)', () => {
+    it('disables auto-sync after a persistent non-lock sync failure, with bounded retries', async () => {
+      // A deterministic pipeline failure (broken extractor on a file, DB
+      // corruption, SQLITE_FULL, OOM) recurs every cycle. Unbounded it retried
+      // forever at the debounce cadence; it must now back off and degrade.
+      const syncFn = vi.fn().mockRejectedValue(new Error('extractor crashed on src/bad.ts'));
+      const onSyncComplete = vi.fn();
+      const onSyncError = vi.fn();
+      const onDegraded = vi.fn();
+      const warnSpy = vi.spyOn(console, 'warn').mockImplementation(() => {});
+      const watcher = newWatcher(syncFn, {
+        debounceMs: 25,
+        onSyncComplete,
+        onSyncError,
+        onDegraded,
+      });
+      watcher.start();
+      await watcher.waitUntilReady();
+      __emitWatchEventForTests(testDir, 'src/persistent-fail.ts');
+
+      // 5 backoff retries (25·1,2,4,8,16 ms), then degrade on the 6th attempt.
+      await waitFor(() => !watcher.isActive(), 8000, 20);
+
+      expect(syncFn.mock.calls.length).toBeGreaterThanOrEqual(6); // MAX_SYNC_FAILURE_RETRIES + 1
+      expect(watcher.isDegraded()).toBe(true);
+      expect(onDegraded).toHaveBeenCalledTimes(1);
+      expect(onDegraded).toHaveBeenCalledWith(expect.stringContaining('auto-sync disabled'));
+      // The degrade reason carries the underlying error so the user can act.
+      expect(onDegraded).toHaveBeenCalledWith(expect.stringContaining('extractor crashed'));
+      // Unlike a held lock, a generic failure IS surfaced per-attempt.
+      expect(onSyncError.mock.calls.length).toBeGreaterThanOrEqual(6);
+      expect(onSyncComplete).not.toHaveBeenCalled();
+      // Degrade stops the watcher, which clears pending state.
+      expect(watcher.getPendingFiles()).toEqual([]);
+      const disableWarnings = warnSpy.mock.calls.filter(
+        (c) => typeof c[0] === 'string' && c[0].includes('File watcher disabled')
+      );
+      expect(disableWarnings).toHaveLength(1);
+    });
+
+    it('does NOT degrade on a transient sync failure — backoff resets after a clean sync', async () => {
+      const syncFn = vi
+        .fn()
+        .mockRejectedValueOnce(new Error('transient blip'))
+        .mockRejectedValueOnce(new Error('transient blip'))
+        .mockRejectedValueOnce(new Error('transient blip'))
+        .mockResolvedValue({ filesChanged: 1, durationMs: 5 });
+      const onDegraded = vi.fn();
+      const onSyncComplete = vi.fn();
+      const watcher = newWatcher(syncFn, { debounceMs: 25, onDegraded, onSyncComplete });
+      watcher.start();
+      await watcher.waitUntilReady();
+      __emitWatchEventForTests(testDir, 'src/transient-fail.ts');
+
+      await waitFor(() => onSyncComplete.mock.calls.length > 0, 4000, 20);
+
+      expect(onDegraded).not.toHaveBeenCalled();
+      expect(watcher.isDegraded()).toBe(false);
+      expect(watcher.isActive()).toBe(true);
+      expect(watcher.getPendingFiles().some((p) => p.path === 'src/transient-fail.ts')).toBe(false);
+
+      watcher.stop();
+    });
+  });
+
   describe('debounced sync', () => {
     it('should trigger sync after file change', async () => {
       const syncFn = vi.fn().mockResolvedValue({ filesChanged: 1, durationMs: 10 });

+ 58 - 17
src/sync/watcher.ts

@@ -46,8 +46,18 @@ import { watchDisabledReason } from './watch-policy';
  * few cycles) stays under this; a long-lived external writer crosses it.
  */
 const MAX_LOCK_RETRIES = 5;
-/** Cap on the exponential lock-retry backoff so it never sleeps absurdly long. */
-const MAX_LOCK_RETRY_DELAY_MS = 30_000;
+/**
+ * Number of consecutive GENERIC (non-lock) sync failures the watcher tolerates
+ * before it degrades auto-sync. A deterministic failure — a tree-sitter
+ * extractor that crashes on one file, DB corruption, `SQLITE_FULL`, an OOM in
+ * batched resolution — recurs every debounce cycle, so left unbounded it would
+ * retry forever (log + work spam) while the auto-update guarantee is silently
+ * dead (#1127). A single clean sync resets the streak, so a transient hiccup
+ * that recovers within the budget never degrades.
+ */
+const MAX_SYNC_FAILURE_RETRIES = 5;
+/** Cap on the exponential retry backoff (either mode) so it never sleeps absurdly long. */
+const MAX_RETRY_BACKOFF_MS = 30_000;
 
 /** Actionable degrade message; both exhaustion paths share it verbatim. */
 const EXHAUSTION_REASON =
@@ -167,8 +177,9 @@ export interface WatchOptions {
 
   /**
    * Callback fired ONCE when live watching degrades permanently and auto-sync
-   * is disabled — OS watch-resource exhaustion (EMFILE/ENFILE), or a write lock
-   * held past the retry budget. The string is an actionable, human-readable
+   * is disabled — OS watch-resource exhaustion (EMFILE/ENFILE), a write lock
+   * held past the retry budget, or a generic sync failure that persists past
+   * the retry budget (#1127). The string is an actionable, human-readable
    * reason. Lets a host (MCP server, daemon, CLI) tell the user that the index
    * will no longer auto-update instead of silently serving stale results.
    */
@@ -250,12 +261,15 @@ export class FileWatcher {
   private inotifyLimitWarned = false;
   /**
    * One-way latch: the reason live watching was permanently disabled at runtime
-   * (watch-resource exhaustion, or lock contention past the retry budget), or
-   * null while healthy. Set by {@link degrade}; cleared only by a fresh start().
+   * (watch-resource exhaustion, lock contention past the retry budget, or a
+   * persistent generic sync failure past the retry budget), or null while
+   * healthy. Set by {@link degrade}; cleared only by a fresh start().
    */
   private degradedReason: string | null = null;
   /** Consecutive lock-contention retries for watcher-triggered syncs. */
   private lockRetryCount = 0;
+  /** Consecutive generic (non-lock) sync failures; reset only by a clean sync. */
+  private syncFailureRetryCount = 0;
   /** Test-only inert mode: started, but with no OS watcher installed. */
   private inert = false;
   private debounceTimer: ReturnType<typeof setTimeout> | null = null;
@@ -329,6 +343,7 @@ export class FileWatcher {
     this.stopped = false;
     this.degradedReason = null;
     this.lockRetryCount = 0;
+    this.syncFailureRetryCount = 0;
 
     // Some environments make filesystem watching unusable — most notably
     // WSL2 /mnt/ drives, where the underlying fs.watch calls block long
@@ -590,7 +605,8 @@ export class FileWatcher {
 
   /**
    * Permanently disable live watching after a terminal runtime failure
-   * (watch-resource exhaustion, or lock contention past the retry budget).
+   * (watch-resource exhaustion, lock contention past the retry budget, or a
+   * persistent generic sync failure past the retry budget).
    * Idempotent: logs one actionable warning, fires {@link WatchOptions.onDegraded}
    * once, and stops the watcher. A subsequent start() clears the latch.
    */
@@ -661,6 +677,7 @@ export class FileWatcher {
     this.dirCapWarned = false;
     this.inotifyLimitWarned = false;
     this.lockRetryCount = 0;
+    this.syncFailureRetryCount = 0;
     // NB: degradedReason is intentionally NOT reset here — it must survive the
     // stop() that degrade() triggers so isDegraded() stays true. start() clears it.
     this.inert = false;
@@ -762,6 +779,7 @@ export class FileWatcher {
     try {
       const result = await this.syncFn();
       this.lockRetryCount = 0; // a clean sync clears any contention backoff
+      this.syncFailureRetryCount = 0; // ...and any generic-failure backoff
       // Remove entries whose most recent event predates this sync — those
       // edits are now in the DB. Entries with lastSeenMs > syncStartedMs
       // arrived mid-sync; whether the in-flight sync captured them depends
@@ -796,10 +814,30 @@ export class FileWatcher {
           );
         }
       } else {
-        this.lockRetryCount = 0; // a non-lock failure isn't contention; reset backoff
+        this.lockRetryCount = 0; // a non-lock failure isn't contention; reset that streak
+        this.syncFailureRetryCount += 1;
         const error = err instanceof Error ? err : new Error(String(err));
-        logWarn('Watch sync failed', { error: error.message });
+        logWarn('Watch sync failed', {
+          error: error.message,
+          retryCount: this.syncFailureRetryCount,
+        });
         this.onSyncError?.(error);
+        // A persistent (deterministic) sync failure — a broken extractor on a
+        // specific file, DB corruption, SQLITE_FULL, an OOM in resolution —
+        // would otherwise retry forever at the debounce cadence, spamming logs
+        // and work while the auto-update guarantee is silently dead (#1127).
+        // Bound it exactly like lock contention: the `finally` block backs off
+        // exponentially, and past the budget we degrade so the dead guarantee
+        // is surfaced (onDegraded / isDegraded) instead of hidden. A single
+        // clean sync resets the streak, so a transient hiccup never degrades.
+        if (this.syncFailureRetryCount > MAX_SYNC_FAILURE_RETRIES) {
+          this.degrade(
+            `CodeGraph auto-sync failed ${this.syncFailureRetryCount} times in a row; ` +
+              'auto-sync disabled. Run `codegraph sync` (or install git sync hooks) to ' +
+              `refresh the graph after changes. Last error: ${error.message}`,
+            { error: error.message, retryCount: this.syncFailureRetryCount }
+          );
+        }
       }
       // Failure: leave pendingFiles untouched. Every edit it tracks is
       // still unindexed; the rescheduled sync sees the same set.
@@ -807,16 +845,19 @@ export class FileWatcher {
       this.syncing = false;
 
       // If pending files remain (mid-sync events, or this sync failed),
-      // schedule another pass. After lock contention, back off exponentially
-      // (debounceMs · 2^(n-1), capped) instead of retrying at the normal
-      // debounce cadence; a clean sync resets lockRetryCount so normal edits
-      // keep the fast debounce. A degrade() above already set `stopped`, so
-      // this won't reschedule a watcher that has given up.
+      // schedule another pass. After EITHER failure mode — lock contention or a
+      // generic sync failure — back off exponentially (debounceMs · 2^(n-1),
+      // capped) instead of retrying at the normal debounce cadence; a clean
+      // sync resets both counters so normal edits keep the fast debounce. Use
+      // the larger streak so interleaved failures still back off. A degrade()
+      // above already set `stopped`, so this won't reschedule a watcher that
+      // has given up.
       if (this.pendingFiles.size > 0 && !this.stopped) {
-        if (this.lockRetryCount > 0) {
+        const retryCount = Math.max(this.lockRetryCount, this.syncFailureRetryCount);
+        if (retryCount > 0) {
           const retryDelayMs = Math.min(
-            this.debounceMs * 2 ** Math.max(0, this.lockRetryCount - 1),
-            MAX_LOCK_RETRY_DELAY_MS
+            this.debounceMs * 2 ** Math.max(0, retryCount - 1),
+            MAX_RETRY_BACKOFF_MS
           );
           this.scheduleRetrySync(retryDelayMs);
         } else {