Просмотр исходного кода

fix(db): dedup edges with a UNIQUE identity index so INSERT OR IGNORE works (#1034) (#1050)

`insertEdge` has always used `INSERT OR IGNORE`, but the edges table carried
no UNIQUE constraint — only an autoincrement PK and non-unique indexes — so
`OR IGNORE` had nothing to conflict on and behaved like a plain INSERT.
Whenever two extraction/resolution passes emitted the same edge (e.g. a
return type captured by both a type-reference and a value-reference pass),
the graph stored byte-identical duplicate rows: ~527 on this repo, inflating
edge counts and letting callers/impact list the same relationship twice.

Add a UNIQUE identity index on (source, target, kind, IFNULL(line,-1),
IFNULL(col,-1)) — in schema.sql for fresh databases and migration v6 (dedup
existing rows, then create the index) for existing ones. IFNULL folds the
nullable line/col so coordinate-less edges (synthesized / file-level) dedup
too; SQLite otherwise treats each NULL as distinct. Distinct call sites
(same source/target/kind, different line/col) are preserved — only
byte-identical structural duplicates collapse. This is the storage-layer
invariant the reporter identified: it makes OR IGNORE keep its promise and
catches every double-emit, present and future, rather than chasing each
emitting pass.

Migration v6 is deterministic (keeps the lowest id per identity group) and
idempotent (IF NOT EXISTS index; no-op DELETE once unique). The DELETE's
GROUP BY matches the index expression exactly so creation can't fail on a
leftover pair.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Colby Mchenry 9 часов назад
Родитель
Сommit
0da2dcec8e

+ 1 - 0
CHANGELOG.md

@@ -16,6 +16,7 @@ and adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 - On Windows, CodeGraph's background server now shuts down cleanly instead of occasionally aborting with a crash error. When the indexed project contained a nested repository (a submodule or embedded clone), stopping the server could race the file watcher's teardown and exit with a Windows crash code rather than a clean exit. Shutdown now lets that teardown finish first, so the server stops cleanly and promptly. (Windows only; other platforms were unaffected.) (#1033)
 - C++ classes that inherit from a templated base — `class Widget : public Base<int>`, a CRTP base like `class App : public CRTPBase<App>`, or a struct inheriting a template — are now linked to that base class in the graph. Previously the template arguments (`<int>`) made the inheritance go unrecognized, so these classes looked like they inherited from nothing and impact/callers analysis stopped at the boundary; the connection is now followed like any other base class. Thanks @ryancu7 for the report. (#1043)
 - C++ objects constructed on the stack — `Calculator calc(0)` or `Widget w{1, 2}` — now record that the enclosing function instantiates that class, the same as heap construction (`new Calculator(0)`) already did. Previously only the `new` form was tracked, so a function that built objects with the ordinary stack syntax looked like it didn't construct them and the dependency was missing from impact/callers. Thanks @Dshuishui for the report. (#1035)
+- The graph no longer stores duplicate copies of the same relationship. The same dependency between the same two symbols at the same spot could be recorded more than once, which inflated edge counts and let callers/impact results list a relationship twice. Each relationship is now stored exactly once, and existing projects are de-duplicated automatically the next time CodeGraph opens them. Thanks @inth3shadows for the detailed report. (#1034)
 
 
 ## [1.1.2] - 2026-06-28

+ 110 - 1
__tests__/db-perf.test.ts

@@ -16,7 +16,8 @@ import * as path from 'path';
 import * as os from 'os';
 import { DatabaseConnection } from '../src/db';
 import { QueryBuilder } from '../src/db/queries';
-import { Node } from '../src/types';
+import { runMigrations, getCurrentVersion } from '../src/db/migrations';
+import { Node, Edge } from '../src/types';
 
 function makeNode(id: string, name = id): Node {
   return {
@@ -249,3 +250,111 @@ describe('runMaintenance', () => {
     expect(() => db.runMaintenance()).not.toThrow();
   });
 });
+
+// The edges table carried no UNIQUE constraint, so `insertEdge`'s
+// `INSERT OR IGNORE` had nothing to conflict on and silently admitted
+// byte-identical duplicate rows when two passes emitted the same edge (#1034).
+// A UNIQUE identity index — `(source, target, kind, IFNULL(line,-1),
+// IFNULL(col,-1))` — makes OR IGNORE actually dedup.
+describe('edge identity uniqueness (#1034)', () => {
+  let dir: string;
+  let db: DatabaseConnection;
+  let q: QueryBuilder;
+
+  beforeEach(() => {
+    dir = fs.mkdtempSync(path.join(os.tmpdir(), 'db-edge-uniq-'));
+    db = DatabaseConnection.initialize(path.join(dir, 'test.db'));
+    q = new QueryBuilder(db.getDb());
+    q.insertNodes([makeNode('A'), makeNode('B')]);
+  });
+
+  afterEach(() => {
+    db.close();
+    if (fs.existsSync(dir)) fs.rmSync(dir, { recursive: true, force: true });
+  });
+
+  const edgeCount = () =>
+    (db.getDb().prepare('SELECT count(*) AS c FROM edges').get() as { c: number }).c;
+  const mk = (over: Partial<Edge> = {}): Edge => ({
+    source: 'A',
+    target: 'B',
+    kind: 'references',
+    line: 153,
+    column: 12,
+    metadata: { resolvedBy: 'exact-match' },
+    ...over,
+  });
+
+  it('a fresh database has the identity index', () => {
+    const idx = db
+      .getDb()
+      .prepare("SELECT name FROM sqlite_master WHERE type='index' AND name='idx_edges_identity'")
+      .get();
+    expect(idx).toBeTruthy();
+  });
+
+  it('collapses byte-identical edges to a single row', () => {
+    q.insertEdges([mk(), mk(), mk()]);
+    expect(edgeCount()).toBe(1);
+  });
+
+  it('dedups even when only the metadata differs (same structural identity)', () => {
+    q.insertEdges([mk({ metadata: { resolvedBy: 'exact-match' } }), mk({ metadata: { resolvedBy: 'import' } })]);
+    expect(edgeCount()).toBe(1);
+  });
+
+  it('keeps edges that differ in line/col — distinct call sites are not duplicates', () => {
+    q.insertEdges([mk({ column: 12 }), mk({ column: 99 }), mk({ line: 200, column: 1 })]);
+    expect(edgeCount()).toBe(3);
+  });
+
+  it('dedups coordinate-less edges, folding NULL line/col via IFNULL', () => {
+    q.insertEdges([mk({ line: undefined, column: undefined }), mk({ line: undefined, column: undefined })]);
+    expect(edgeCount()).toBe(1);
+  });
+
+  it('dedups across separate insert calls (storage constraint, not a per-batch dedup)', () => {
+    q.insertEdges([mk()]);
+    q.insertEdges([mk()]);
+    expect(edgeCount()).toBe(1);
+  });
+});
+
+describe('migration v6: dedup edges + add identity index on upgrade (#1034)', () => {
+  it('collapses pre-existing duplicate rows, keeps distinct ones, and restores the constraint', () => {
+    const dir = fs.mkdtempSync(path.join(os.tmpdir(), 'db-mig6-'));
+    const db = DatabaseConnection.initialize(path.join(dir, 'test.db'));
+    const raw = db.getDb();
+    const q = new QueryBuilder(raw);
+    q.insertNodes([makeNode('A'), makeNode('B')]);
+
+    // Recreate a pre-v6 database: without the identity index, `INSERT OR IGNORE`
+    // admits duplicates. Revert the recorded version so migration v6 will re-run.
+    raw.exec('DROP INDEX IF EXISTS idx_edges_identity');
+    raw.prepare('DELETE FROM schema_versions WHERE version >= 6').run();
+    q.insertEdges([
+      { source: 'A', target: 'B', kind: 'references', line: 153, column: 12, metadata: { resolvedBy: 'exact-match' } },
+      { source: 'A', target: 'B', kind: 'references', line: 153, column: 12, metadata: { resolvedBy: 'exact-match' } },
+      { source: 'A', target: 'B', kind: 'calls', line: 200, column: 4 },
+    ]);
+    const count = () => (raw.prepare('SELECT count(*) AS c FROM edges').get() as { c: number }).c;
+    expect(count()).toBe(3); // duplicate admitted while the index was absent
+
+    runMigrations(raw, 5);
+
+    expect(count()).toBe(2); // duplicate collapsed, the distinct `calls` edge kept
+    expect(getCurrentVersion(raw)).toBe(6);
+    const idx = raw
+      .prepare("SELECT name FROM sqlite_master WHERE type='index' AND name='idx_edges_identity'")
+      .get();
+    expect(idx).toBeTruthy();
+    // The constraint now holds — re-inserting the duplicate is a no-op.
+    q.insertEdges([
+      { source: 'A', target: 'B', kind: 'references', line: 153, column: 12, metadata: { resolvedBy: 'x' } },
+    ]);
+    expect(count()).toBe(2);
+
+    db.close();
+    fs.rmSync(dir, { recursive: true, force: true });
+  });
+});

+ 1 - 1
__tests__/foundation.test.ts

@@ -282,7 +282,7 @@ describe('Database Connection', () => {
 
     const version = db.getSchemaVersion();
     expect(version).not.toBeNull();
-    expect(version?.version).toBe(5);
+    expect(version?.version).toBe(6);
 
     db.close();
   });

+ 1 - 1
__tests__/pr19-improvements.test.ts

@@ -299,7 +299,7 @@ describe('Best-Candidate Resolution', () => {
 describe('Schema v2 Migration', () => {
   it.skipIf(!HAS_SQLITE)('should have correct current schema version', async () => {
     const { CURRENT_SCHEMA_VERSION } = await import('../src/db/migrations');
-    expect(CURRENT_SCHEMA_VERSION).toBe(5);
+    expect(CURRENT_SCHEMA_VERSION).toBe(6);
   });
 
   it.skipIf(!HAS_SQLITE)('should have migration for version 2', async () => {

+ 26 - 1
src/db/migrations.ts

@@ -9,7 +9,7 @@ import { SqliteDatabase } from './sqlite-adapter';
 /**
  * Current schema version
  */
-export const CURRENT_SCHEMA_VERSION = 5;
+export const CURRENT_SCHEMA_VERSION = 6;
 
 /**
  * Migration definition
@@ -75,6 +75,31 @@ const migrations: Migration[] = [
       `);
     },
   },
+  {
+    version: 6,
+    description:
+      'Dedup duplicate edge rows and add a UNIQUE identity index so INSERT OR IGNORE actually dedups (#1034)',
+    up: (db) => {
+      // `insertEdge` has always used `INSERT OR IGNORE`, but the edges table had
+      // no UNIQUE constraint, so nothing conflicted and byte-identical rows
+      // accumulated whenever two passes emitted the same edge. Collapse each
+      // identity group to its lowest id, then add the constraint that makes
+      // `OR IGNORE` keep its promise. IFNULL folds nullable line/col so
+      // coordinate-less edges dedup too (SQLite treats each NULL as distinct) —
+      // and it MUST match the GROUP BY exactly, or the index creation would
+      // fail on a pair the DELETE left behind. Idempotent: the index is
+      // `IF NOT EXISTS` and the DELETE is a no-op once the table is unique.
+      db.exec(`
+        DELETE FROM edges
+        WHERE id NOT IN (
+          SELECT MIN(id) FROM edges
+          GROUP BY source, target, kind, IFNULL(line, -1), IFNULL(col, -1)
+        );
+        CREATE UNIQUE INDEX IF NOT EXISTS idx_edges_identity
+          ON edges(source, target, kind, IFNULL(line, -1), IFNULL(col, -1));
+      `);
+    },
+  },
 ];
 
 /**

+ 11 - 0
src/db/schema.sql

@@ -133,6 +133,17 @@ CREATE INDEX IF NOT EXISTS idx_edges_kind ON edges(kind);
 CREATE INDEX IF NOT EXISTS idx_edges_source_kind ON edges(source, kind);
 CREATE INDEX IF NOT EXISTS idx_edges_target_kind ON edges(target, kind);
 
+-- Edge identity uniqueness. An edge IS uniquely (source, target, kind, line,
+-- col); insertEdge uses `INSERT OR IGNORE`, but without something UNIQUE to
+-- conflict on it behaved like a plain INSERT, so two passes emitting the same
+-- edge produced byte-identical duplicate rows that inflated counts and flowed
+-- into callers/impact (#1034). IFNULL folds the nullable line/col so
+-- coordinate-less edges (synthesized / file-level) dedup too — SQLite treats
+-- each NULL as distinct otherwise. Migration v6 dedups existing rows + adds
+-- this on older databases.
+CREATE UNIQUE INDEX IF NOT EXISTS idx_edges_identity
+  ON edges(source, target, kind, IFNULL(line, -1), IFNULL(col, -1));
+
 -- File indexes
 CREATE INDEX IF NOT EXISTS idx_files_language ON files(language);
 CREATE INDEX IF NOT EXISTS idx_files_modified_at ON files(modified_at);