Ver código fonte

fix(index): rebuild a poisoned/oversized index by recreating the DB, not row-DELETE (#1067) (#1073)

Follow-up to #1065/#1066. Those stopped a *new* index from scanning an
ignored gitlink corpus, but a project that had already built the multi-GB
graph before upgrading still couldn't recover: `codegraph index` printed
only "Indexing project" and was then SIGKILLed (137) by the #850 watchdog
~60s later, before scanning even started.

Root cause is not the scanner. `index` cleared the old graph with a
synchronous `DELETE FROM nodes/edges/files`. `nodes` carries an FTS5
`AFTER DELETE` trigger, so deleting ~1.6M rows fires ~1.6M FTS
delete-markers — O(rows), and it grows the WAL further before it can
finish. A deterministic probe puts the DELETE-clear at 20.4s on 1.5M
synthetic nodes (WAL 1.16->2.14GB); at the report's denser ~2.6KB/node WAL
that crosses the 60s main-thread watchdog. `open()` was never the wedge.

A full re-index is documented as "same result as a fresh init", so make it
one: discard the database files and re-initialize, instead of opening the
old DB and DELETE-ing every row.

- db: add removeDatabaseFiles(dbPath) — unlinks codegraph.db + its
  -wal/-shm sidecars (O(1) regardless of size; sidecars best-effort).
- index: add CodeGraph.recreate(projectRoot) — discards the files and
  returns a fresh, empty instance. Never opens or migrates the poisoned
  DB. POSIX unlinks an open file fine (a live daemon heals via
  reopenIfReplaced, #925); a Windows file lock becomes an actionable
  "stop the daemon / remove .codegraph" error.
- cli: `codegraph index` now calls recreate() instead of open()+clear();
  both clear() calls dropped. The public clear() API is unchanged.

This also reclaims the disk the bloated db/-wal were holding.

Validated: deterministic probe (DELETE O(rows) vs recreate O(1)); an
end-to-end run through the built binary recovering a real 800K-node /
419MB poisoned DB in 0.3s with no wedge and the correct small graph; new
unit + CLI regression tests; existing #874 index tests still green.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Colby Mchenry 1 dia atrás
pai
commit
9684b3b5a5
6 arquivos alterados com 291 adições e 13 exclusões
  1. 3 0
      CHANGELOG.md
  2. 89 1
      __tests__/foundation.test.ts
  3. 100 0
      __tests__/index-command.test.ts
  4. 11 11
      src/bin/codegraph.ts
  5. 39 0
      src/db/index.ts
  6. 49 1
      src/index.ts

+ 3 - 0
CHANGELOG.md

@@ -9,6 +9,9 @@ and adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
 ## [Unreleased]
 
+### Fixes
+
+- `codegraph index` can now rebuild an existing oversized index from an older version, instead of hanging until the watchdog kills it. The previous fix (#1065) stopped *new* indexes from sweeping in a gitignored corpus of nested repos, but a project that had already built the multi-gigabyte graph before upgrading couldn't recover: `codegraph index` is meant to rebuild from scratch, yet it cleared the old graph by deleting every row one at a time, and on a graph of well over a million symbols that took longer than the 60-second responsiveness watchdog allows — so the command was killed before indexing even started, leaving the bad index in place. A full re-index now discards the old database outright and starts fresh, which is near-instant regardless of the old size and also frees the disk the bloated database was holding. Thanks @AriaShishegaran for the detailed follow-up report. (#1067)
 
 ## [1.1.5] - 2026-06-30
 

+ 89 - 1
__tests__/foundation.test.ts

@@ -11,7 +11,7 @@ import * as os from 'os';
 import { CodeGraph } from '../src';
 import { Node, Edge } from '../src/types';
 import { isInitialized, getCodeGraphDir, validateDirectory, codeGraphDirName, isCodeGraphDataDir } from '../src/directory';
-import { DatabaseConnection, getDatabasePath } from '../src/db';
+import { DatabaseConnection, getDatabasePath, removeDatabaseFiles } from '../src/db';
 
 // Create a temporary directory for each test
 function createTempDir(): string {
@@ -25,6 +25,13 @@ function cleanupTempDir(dir: string): void {
   }
 }
 
+/** Normalize a PRAGMA read across return shapes (array | object | scalar). */
+function pragmaValue(raw: unknown, key: string): unknown {
+  const row = Array.isArray(raw) ? raw[0] : raw;
+  if (row !== null && typeof row === 'object') return (row as Record<string, unknown>)[key];
+  return row;
+}
+
 describe('CodeGraph Foundation', () => {
   let tempDir: string;
 
@@ -144,6 +151,87 @@ describe('CodeGraph Foundation', () => {
     });
   });
 
+  // recreate() backs `codegraph index`: it discards the existing DB and returns
+  // a fresh, empty instance rather than DELETE-clearing in place — the path that
+  // recovers a poisoned/oversized prior index without wedging (#1067).
+  describe('Recreate (#1067)', () => {
+    it('returns a fresh, empty, usable instance', async () => {
+      const cg = CodeGraph.initSync(tempDir);
+      // Give the DB some content so "empty afterwards" is meaningful.
+      fs.writeFileSync(path.join(tempDir, 'a.ts'), 'export function f() { return 1; }\n');
+      await cg.indexAll();
+      expect(cg.getStats().nodeCount).toBeGreaterThan(0);
+      cg.close();
+
+      const fresh = await CodeGraph.recreate(tempDir);
+      try {
+        // Empty graph, but a working instance: re-indexing repopulates it.
+        expect(fresh.getStats().nodeCount).toBe(0);
+        const result = await fresh.indexAll();
+        expect(result.success).toBe(true);
+        expect(fresh.getStats().nodeCount).toBeGreaterThan(0);
+      } finally {
+        fresh.close();
+      }
+    });
+
+    it('discards the old database file rather than emptying it in place', async () => {
+      const cg = CodeGraph.initSync(tempDir);
+      await cg.indexAll();
+      cg.close();
+
+      // Stamp a sentinel into the existing DB header. PRAGMA user_version is
+      // untouched by DELETE, so an in-place clear() would preserve it — but a
+      // from-scratch recreate cannot. (An inode-equality check is unreliable:
+      // ext4/overlayfs recycle the inode number after unlink+recreate, so a
+      // "new inode" assertion false-fails on Linux while passing on macOS.)
+      const dbPath = getDatabasePath(tempDir);
+      const stamp = DatabaseConnection.open(dbPath);
+      stamp.getDb().pragma('user_version = 4242');
+      stamp.close();
+
+      const fresh = await CodeGraph.recreate(tempDir);
+      fresh.close();
+
+      // The file exists, and the sentinel is gone — proof the old DB was
+      // discarded and rebuilt, not row-DELETE'd in place (the path that wedged
+      // on a poisoned graph, #1067).
+      expect(fs.existsSync(dbPath)).toBe(true);
+      const check = DatabaseConnection.open(dbPath);
+      const userVersion = pragmaValue(check.getDb().pragma('user_version'), 'user_version');
+      check.close();
+      expect(Number(userVersion)).not.toBe(4242);
+    });
+
+    it('throws a clear error when the project is not initialized', async () => {
+      await expect(CodeGraph.recreate(tempDir)).rejects.toThrow(/not initialized/i);
+    });
+  });
+
+  describe('removeDatabaseFiles (#1067)', () => {
+    it('deletes the database and its -wal/-shm sidecars', () => {
+      const cg = CodeGraph.initSync(tempDir);
+      cg.close();
+      const dbPath = getDatabasePath(tempDir);
+      // Materialise the WAL sidecars so we can prove they're cleaned up too.
+      fs.writeFileSync(dbPath + '-wal', 'x');
+      fs.writeFileSync(dbPath + '-shm', 'x');
+      expect(fs.existsSync(dbPath)).toBe(true);
+
+      removeDatabaseFiles(dbPath);
+
+      expect(fs.existsSync(dbPath)).toBe(false);
+      expect(fs.existsSync(dbPath + '-wal')).toBe(false);
+      expect(fs.existsSync(dbPath + '-shm')).toBe(false);
+    });
+
+    it('is a no-op (does not throw) when the files are already gone', () => {
+      const dbPath = getDatabasePath(tempDir);
+      expect(fs.existsSync(dbPath)).toBe(false);
+      expect(() => removeDatabaseFiles(dbPath)).not.toThrow();
+    });
+  });
+
   describe('Directory Management', () => {
     it('should validate directory structure', () => {
       const cg = CodeGraph.initSync(tempDir);

+ 100 - 0
__tests__/index-command.test.ts

@@ -19,9 +19,17 @@ import * as fs from 'fs';
 import * as path from 'path';
 import * as os from 'os';
 import { CodeGraph } from '../src';
+import { DatabaseConnection } from '../src/db';
 
 const BIN = path.resolve(__dirname, '../dist/bin/codegraph.js');
 
+/** Normalize a PRAGMA read across return shapes (array | object | scalar). */
+function pragmaValue(raw: unknown, key: string): unknown {
+  const row = Array.isArray(raw) ? raw[0] : raw;
+  if (row !== null && typeof row === 'object') return (row as Record<string, unknown>)[key];
+  return row;
+}
+
 function runCodegraph(args: string[], cwd: string): string {
   return execFileSync(process.execPath, [BIN, ...args], {
     cwd,
@@ -105,3 +113,95 @@ describe('codegraph index — full re-index keeps the graph populated (#874)', (
     expect(afterIndex.edges).toBe(afterInit.edges);
   });
 });
+
+/**
+ * Regression coverage for issue #1067: a full re-index must RECOVER an existing
+ * oversized/stale index from earlier versions, not wedge on it.
+ *
+ * Root cause: `index` opened the old database and DELETE-d every row to clear
+ * it. With FTS triggers firing per deleted node, a pre-fix poisoned graph (an
+ * ignored gitlink corpus scanned into ~1.6M nodes + a multi-GB WAL, #1065) took
+ * well over the 60s liveness-watchdog window to clear, so the process was
+ * SIGKILLed before scanning even began and the bad state could never be rebuilt
+ * away. The fix discards (unlinks) the database files and re-initializes a fresh
+ * one — O(1) regardless of size — so `index` recovers any prior state.
+ */
+describe('codegraph index — recovers a stale/oversized prior index (#1067)', () => {
+  let tempDir: string;
+  const dbPath = (dir: string) => path.join(dir, '.codegraph', 'codegraph.db');
+
+  beforeEach(() => {
+    tempDir = fs.mkdtempSync(path.join(os.tmpdir(), 'codegraph-index-recover-'));
+    fs.writeFileSync(
+      path.join(tempDir, 'a.ts'),
+      `export function greet(name: string) { return hello(name); }\n` +
+        `export function hello(n: string) { return 'hi ' + n; }\n`,
+    );
+  });
+
+  afterEach(() => {
+    fs.rmSync(tempDir, { recursive: true, force: true });
+  });
+
+  it('rebuilds to the current disk state, discarding content for files that no longer exist', () => {
+    // Stand in for the "old graph indexed an ignored corpus" shape: index a tree
+    // that also has a junk/ directory, then delete junk/ from disk so the DB now
+    // carries stale nodes for paths that should no longer be indexed.
+    const junkDir = path.join(tempDir, 'junk');
+    fs.mkdirSync(junkDir);
+    for (let i = 0; i < 12; i++) {
+      fs.writeFileSync(path.join(junkDir, `j${i}.ts`), `export function j${i}() { return ${i}; }\n`);
+    }
+    runCodegraph(['init'], tempDir);
+    const withJunk = graphCounts(tempDir);
+
+    // Remove the corpus from disk. The DB still holds its nodes — the stale,
+    // oversized prior state #1067 is about.
+    fs.rmSync(junkDir, { recursive: true, force: true });
+
+    runCodegraph(['index'], tempDir);
+    const recovered = graphCounts(tempDir);
+
+    // The rebuild reflects only what's on disk now — the junk nodes are gone…
+    expect(recovered.nodes).toBeLessThan(withJunk.nodes);
+
+    // …and the result is identical to a fresh init of the same (now-smaller) tree.
+    const fresh = fs.mkdtempSync(path.join(os.tmpdir(), 'codegraph-index-fresh-'));
+    try {
+      fs.copyFileSync(path.join(tempDir, 'a.ts'), path.join(fresh, 'a.ts'));
+      runCodegraph(['init'], fresh);
+      const freshCounts = graphCounts(fresh);
+      expect(recovered.nodes).toBe(freshCounts.nodes);
+      expect(recovered.edges).toBe(freshCounts.edges);
+    } finally {
+      fs.rmSync(fresh, { recursive: true, force: true });
+    }
+  });
+
+  // The fix rebuilds a fresh DB rather than DELETE-ing rows in place. Prove it
+  // with a header sentinel: PRAGMA user_version survives an in-place clear but
+  // not a from-scratch recreate. (An inode check is unreliable — ext4/overlayfs
+  // recycle the inode number after unlink+recreate.)
+  it('rebuilds a fresh database rather than clearing the old one in place', () => {
+    runCodegraph(['init'], tempDir);
+
+    const stamp = DatabaseConnection.open(dbPath(tempDir));
+    stamp.getDb().pragma('user_version = 4242');
+    stamp.close();
+
+    runCodegraph(['index'], tempDir);
+
+    const check = DatabaseConnection.open(dbPath(tempDir));
+    const userVersion = pragmaValue(check.getDb().pragma('user_version'), 'user_version');
+    check.close();
+
+    // Sentinel gone → `index` discarded the old DB and rebuilt it, the path that
+    // avoids the per-row FTS delete wedge on a poisoned graph (#1067).
+    expect(Number(userVersion)).not.toBe(4242);
+
+    // …and the graph is intact afterwards.
+    const counts = graphCounts(tempDir);
+    expect(counts.nodes).toBeGreaterThan(0);
+    expect(counts.edges).toBeGreaterThan(0);
+  });
+});

+ 11 - 11
src/bin/codegraph.ts

@@ -632,16 +632,23 @@ program
       }
 
       const { default: CodeGraph } = await loadCodeGraph();
-      const cg = await CodeGraph.open(projectPath);
+      // `index` is a FULL re-index — identical to a fresh `init`. RECREATE the
+      // database from scratch (discard .codegraph/codegraph.db + its WAL) rather
+      // than opening the old graph and DELETE-ing every row. The clear-then-index
+      // approach reported "0 nodes" without the clear (#874); the recreate keeps
+      // that fixed AND avoids the failure mode where, on a large or pre-fix
+      // poisoned index, the per-row FTS delete churn wedged the main thread long
+      // enough to trip the liveness watchdog before scanning even began (#1067).
+      // recreate() hands back a fresh, empty instance — no clear() needed. For
+      // fast incremental updates use `sync`.
+      const cg = await CodeGraph.recreate(projectPath);
 
       // Supervise the indexer: self-terminate if orphaned (parent shim killed)
       // or if the main thread wedges — neither was guarded on this path (#999).
       const supervision = installCommandSupervision('index');
       try {
         if (options.quiet) {
-          // Quiet mode: no UI, just run. `index` is a full re-index, so clear the
-          // existing graph and rebuild from scratch (see the note below — #874).
-          cg.clear();
+          // Quiet mode: no UI, just run against the freshly-recreated graph.
           const result = await cg.indexAll();
           if (!result.success) process.exit(1);
           cg.destroy();
@@ -651,13 +658,6 @@ program
         const clack = await importESM('@clack/prompts');
         clack.intro('Indexing project');
 
-        // `index` is a FULL re-index: clear the existing graph and rebuild it from
-        // scratch so the result is identical to a fresh `init`. Without the clear,
-        // indexAll() skips every unchanged file by its content hash and reports
-        // "0 nodes, 0 edges" against the already-populated graph — which reads as
-        // "index wiped my index" (#874). For fast incremental updates use `sync`.
-        cg.clear();
-
         let result: IndexResult;
 
         if (options.verbose) {

+ 39 - 0
src/db/index.ts

@@ -281,9 +281,48 @@ function statInode(p: string): string | null {
  */
 export const DATABASE_FILENAME = 'codegraph.db';
 
+/**
+ * SQLite's sidecar files in WAL mode — the write-ahead log and its shared-memory
+ * index. They sit beside the main DB file and are removed alongside it when the
+ * database is discarded (see `removeDatabaseFiles`).
+ */
+const WAL_SIDECAR_SUFFIXES = ['-wal', '-shm'] as const;
+
 /**
  * Get the default database path for a project
  */
 export function getDatabasePath(projectRoot: string): string {
   return path.join(getCodeGraphDir(projectRoot), DATABASE_FILENAME);
 }
+
+/**
+ * Delete a database file and its WAL sidecars (`-wal`/`-shm`).
+ *
+ * This is how a FULL re-index discards an existing database — rather than
+ * opening the old graph and DELETE-ing every row. On a large or pre-fix
+ * poisoned index (e.g. an old graph that scanned an ignored gitlink corpus into
+ * ~1.6M nodes with a multi-GB WAL, #1065) the per-row `nodes_fts` delete-trigger
+ * churn blocks the main thread long enough to trip the #850 liveness watchdog
+ * before indexing even starts, so the rebuild could never recover the bad state
+ * (#1067). Unlinking is O(1) regardless of DB size and also reclaims the disk
+ * the bloated WAL would otherwise keep.
+ *
+ * POSIX removes the directory entry even while another process (a daemon/MCP
+ * server) still holds the file open; that holder heals via `reopenIfReplaced`
+ * (#925). On Windows a live holder can make the unlink fail with EBUSY/EPERM —
+ * that is thrown for the caller to surface ("stop the other process and retry").
+ * The `-wal`/`-shm` sidecars are best-effort: SQLite recreates them on the next
+ * open, so a leftover sidecar is harmless.
+ */
+export function removeDatabaseFiles(dbPath: string): void {
+  // The main DB file first — its removal is the operation that must succeed (or
+  // report why it couldn't). force:true treats an already-missing file as done.
+  fs.rmSync(dbPath, { force: true });
+  for (const suffix of WAL_SIDECAR_SUFFIXES) {
+    try {
+      fs.rmSync(dbPath + suffix, { force: true });
+    } catch {
+      // A sidecar still held/locked is harmless — SQLite rebuilds it on open.
+    }
+  }
+}

+ 49 - 1
src/index.ts

@@ -22,7 +22,7 @@ import {
   BuildContextOptions,
   FindRelevantContextOptions,
 } from './types';
-import { DatabaseConnection, getDatabasePath } from './db';
+import { DatabaseConnection, getDatabasePath, removeDatabaseFiles } from './db';
 import { QueryBuilder } from './db/queries';
 import {
   isInitialized,
@@ -319,6 +319,54 @@ export class CodeGraph {
     return instance;
   }
 
+  /**
+   * Rebuild the project's database from scratch and return a fresh, empty
+   * instance — the "same result as a fresh init" semantics that `codegraph
+   * index` documents.
+   *
+   * Unlike `open()` followed by `clear()`, this DISCARDS the existing
+   * `.codegraph/codegraph.db` (and its `-wal`/`-shm` sidecars) before
+   * re-initializing, instead of opening the old database and DELETE-ing every
+   * row. On a large or pre-fix poisoned index — e.g. an old graph that scanned
+   * an ignored gitlink corpus (#1065) into ~1.6M nodes with a multi-GB WAL —
+   * the per-row `nodes_fts` delete-trigger churn blocks the main thread long
+   * enough to trip the #850 liveness watchdog before indexing even starts, so a
+   * full re-index could never recover the bad state (#1067). Discarding the
+   * files is O(1) regardless of size, reclaims the disk, and sidesteps opening
+   * (and running migrations against) the poisoned database entirely.
+   */
+  static async recreate(projectRoot: string): Promise<CodeGraph> {
+    await initGrammars();
+    const resolvedRoot = path.resolve(projectRoot);
+
+    // Check if initialized — recreate REBUILDS an existing project; it is not a
+    // first-time `init`.
+    if (!isInitialized(resolvedRoot)) {
+      throw new Error(`CodeGraph not initialized in ${resolvedRoot}. Run init() first.`);
+    }
+
+    const dbPath = getDatabasePath(resolvedRoot);
+    try {
+      removeDatabaseFiles(dbPath);
+    } catch (err) {
+      // POSIX unlinks an open file fine; this fires mainly on Windows when a
+      // live daemon/MCP server still holds the database. Turn the raw EBUSY into
+      // an actionable instruction instead of a generic failure.
+      const reason = err instanceof Error ? err.message : String(err);
+      throw new Error(
+        `Could not rebuild the index — the database file is in use (${reason}). ` +
+          `Stop any running CodeGraph MCP server/daemon for this project and retry, ` +
+          `or remove the ${getCodeGraphDir(resolvedRoot)} directory and run "codegraph init".`
+      );
+    }
+
+    // Re-create an empty, freshly-schema'd database at the same path.
+    const db = DatabaseConnection.initialize(dbPath);
+    const queries = new QueryBuilder(db.getDb());
+
+    return new CodeGraph(db, queries, resolvedRoot);
+  }
+
   /**
    * Open synchronously (without sync)
    */