Răsfoiți Sursa

perf(resolution): fix O(K²) import-node blowup in "Resolving refs" (#915) (#965)

* perf(resolution): resolve imports to definitions, not sibling import nodes (#915)

"Resolving refs" crawled (tens of minutes) on large projects — most painfully
ones mixing a big front-end and back-end. An external package or module imported
across hundreds/thousands of files (react, a shared UI package, Python
logging/typing) is re-declared as an `import` node in every importing file, so
its unresolved import ref fell through to the exact-name matcher, which scored
all K same-named import nodes via findBestMatch — K refs x K candidates = O(K^2)
per package, producing only meaningless import->import edges.

Fix: exclude `import`-kind nodes as name-match targets (they're statements, not
definitions; real import->definition resolution is the import resolver's job).
Plus two safe constant-factor wins in findBestMatch: hoist the per-candidate
ref.filePath split, and skip cross-language candidates when a same-language one
exists (provably the same winner — same-language scores >=50, cross-language
maxes at 35).

Measured: superset (Py+TS) candidates scored 7.5M -> 833K (9x), non-import edges
preserved (+1618 now resolve to real defs), ~22K useless import->import edges
removed; kubernetes (Go) computePathProximity 37.2s -> 5.0s; synthetic 8k-file
mixed repo (K=4000) resolution 16.0s -> 1.7s. Full suite green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs: correct stale better-sqlite3/wasm references to node:sqlite

The SQLite backend has been Node's built-in node:sqlite (real SQLite, WAL + FTS5,
from the bundled runtime) for a while — there is no native build step and no
node-sqlite3-wasm fallback. README and the docs site were already updated; this
catches the stragglers:

- CLAUDE.md: the src/db/ backend description and the sqlite-backend test note.
- src/db/index.ts, src/mcp/tools.ts: two code comments that still blamed "the
  wasm backend" for non-WAL behavior (reworded to "when WAL isn't in effect").

Leaves tree-sitter grammar wasm (web-tree-sitter / --liftoff-only) untouched —
that's a different, still-current use of wasm.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore(telemetry): drop the dead sqlite_backend field (schema v2)

node:sqlite is now the only backend, so the `index` event's `sqlite_backend`
field was a constant ("native") carrying no signal — and the `install` event
never actually sent it. Remove the field and the backendKind() helper, bump the
telemetry SCHEMA_VERSION 1 -> 2, and update TELEMETRY.md + docs/design/telemetry.md.

The ingest worker is deliberately left tolerant: `index` doesn't require the
field and schema_version validates as nonNegInt(99), so v2 events ingest fine and
old clients still sending v1 + sqlite_backend keep validating too. Added a legacy
comment there explaining it's safe to drop once old-client share is negligible.

telemetry.test.ts: the assertion pinning schema_version and a stale-claim fixture
line updated 1 -> 2. All telemetry tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Colby Mchenry 17 ore în urmă
părinte
comite
0a91d0f512

+ 1 - 0
CHANGELOG.md

@@ -35,6 +35,7 @@ and adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
 ### Fixes
 
+- `codegraph index` and `codegraph init` no longer crawl during the "Resolving refs" phase on large projects — most painfully ones that mix a big front-end and back-end, where the phase could stretch to many minutes. A package or module imported across hundreds or thousands of files (`react`, a shared UI package, Python `logging` / `typing`) was being treated as if every one of those import statements might be its definition, so the resolver compared each import against all the others — work that grows with the *square* of how widely a package is imported, which is why it blew up only on big, import-heavy repos. Imports now resolve straight to the definitions they actually point at, so those redundant comparisons are gone (reference resolution is dramatically faster on large repos), and the graph no longer accumulates the meaningless import-to-import links the old fallback created. (#915)
 - MCP tool results no longer show up as oversized headings in Markdown-rendering clients (such as the Claude Code VSCode extension). Results used Markdown headings (`##`/`###`) for things like the status summary, each search hit, and every file section in an exploration, so a normal query filled the transcript with large-font lines — worst with `codegraph_search` and `codegraph_explore`, where the noise grew with the number of results. Section headers are now bold labels, which render at normal text size while keeping the same structure. Terminal/CLI output is unchanged. (#778)
 - An MCP server pointed at a very large repository (tens of thousands of files) no longer hangs on the first tool call after a fresh start. On startup CodeGraph reconciles its index against the current files on disk, and on a huge repo that reconcile could run for minutes while blocking the very first request — long enough that the background server was sometimes force-restarted mid-scan, so the first query never came back at all. The reconcile now yields as it runs (keeping the server responsive instead of pinning it), and the first tool call waits only briefly for it before answering and letting the rest finish in the background — so you get a fast first response and the index still catches up. Set `CODEGRAPH_CATCHUP_GATE_TIMEOUT_MS` to tune how long that first call waits (default 3000ms), or `=0` to always wait for the full reconcile. (#905)
 - `codegraph install` now wires up your agents and stops there — it no longer indexes the current directory. Building a project's graph is always the explicit `codegraph init` (or `codegraph index`), so you decide what gets indexed and when, and the steps are the same whether you installed globally or just for one project. This clears up the confusion where a project-local install silently indexed but a global one didn't, and where the docs and the tool disagreed about whether you still had to run `init`. (#826)

+ 2 - 2
CLAUDE.md

@@ -50,7 +50,7 @@ The public API surface is `src/index.ts` — the `CodeGraph` class wires all the
 ### Module layout
 
 - `src/index.ts` — `CodeGraph` class: `init`/`open`/`close`, `indexAll`, `sync`, `searchNodes`, `getCallers`/`getCallees`, `getImpactRadius`, `buildContext`, `watch`/`unwatch`.
-- `src/db/` — `DatabaseConnection`, `QueryBuilder` (prepared statements), `schema.sql`. Backed by `better-sqlite3` (native) when available, transparently falls back to `node-sqlite3-wasm`. `codegraph status` surfaces which backend is live; wasm is the slow path.
+- `src/db/` — `DatabaseConnection`, `QueryBuilder` (prepared statements), `schema.sql`, `sqlite-adapter.ts`. Backed by Node's built-in **`node:sqlite`** (`DatabaseSync`) — real SQLite with WAL + FTS5, exposed through a thin better-sqlite3-shaped adapter. The bundled runtime always ships Node ≥22.5, so `node:sqlite` is always available: **no native build step and no wasm fallback**. (Running from source needs Node ≥22.5.) `codegraph status` reports the live backend (`node-sqlite`, the sole backend).
 - `src/extraction/` — `ExtractionOrchestrator`, tree-sitter wrappers, per-language extractors under `languages/` (one file per language), plus standalone extractors for non-tree-sitter formats (`svelte-extractor.ts`, `vue-extractor.ts`, `liquid-extractor.ts`, `dfm-extractor.ts` for Delphi). `parse-worker.ts` runs heavy parsing off the main thread.
 - `src/resolution/` — `ReferenceResolver` orchestrates `import-resolver.ts` (with `path-aliases.ts` for tsconfig path aliases + cargo workspace member globs), `name-matcher.ts`, and `frameworks/` (Express, Laravel, Rails, FastAPI, Django, Flask, Spring, Gin, Axum, ASP.NET, Vapor, React Router, SvelteKit, Vue/Nuxt, Cargo workspaces). Frameworks emit `route` nodes and `references` edges.
 - `src/graph/` — `GraphTraverser` (BFS/DFS, impact radius, path finding) and `GraphQueryManager` (high-level queries).
@@ -163,7 +163,7 @@ Tests live in `__tests__/` and mirror the module they cover. Notable ones beyond
 
 - `installer-targets.test.ts` — parameterized contract suite across all 4 agent targets (see installer notes above).
 - `evaluation/` — `runner.ts` + `test-cases.ts` exercise codegraph against synthetic projects and score the results; run via `npm run eval` (builds first). Not part of `npm test`.
-- `sqlite-backend.test.ts` — covers native + wasm backend selection and fallback.
+- `sqlite-backend.test.ts` / `node-sqlite-backend.test.ts` — pin that `node:sqlite` is the sole backend: `getBackend()` reports `node-sqlite` and the DB comes up in WAL.
 - `pr19-improvements.test.ts`, `frameworks-integration.test.ts` — regression coverage for specific past PRs/incidents; don't rename these, the names anchor to git history.
 
 Tests create temp dirs with `fs.mkdtempSync` and clean up in `afterEach`. They write real files and exercise real SQLite — there is no DB mocking.

+ 2 - 3
TELEMETRY.md

@@ -39,7 +39,7 @@ Every payload carries this envelope:
 | `os` / `arch` | `darwin` / `arm64` | platform identifiers only |
 | `node_major` | `22` | major version only |
 | `ci` | `false` | whether the `CI` env var was set |
-| `schema_version` | `1` | bumped when this page changes |
+| `schema_version` | `2` | bumped when this page changes (v2 dropped the `index` event's `sqlite_backend` field) |
 
 And one of four events:
 
@@ -48,8 +48,7 @@ And one of four events:
   an upgrade, or a re-run.
 - **`index`** — when a full index completes: the **language names** present (e.g.
   `["typescript","go"]`), the file count as a **coarse bucket** (`<100`, `100-1k`,
-  `1k-10k`, `10k+`), the duration as a bucket (`<10s`, `10-60s`, `1-5m`, `5m+`), and the
-  SQLite backend (`native`/`wasm`).
+  `1k-10k`, `10k+`), and the duration as a bucket (`<10s`, `10-60s`, `1-5m`, `5m+`).
 - **`usage_rollup`** — one line per day per tool: the tool or CLI command **name** (e.g.
   `codegraph_explore`, `init`), how many times it ran, how many errored, and — for MCP
   tools — the connecting agent's name and version from the MCP handshake (e.g.

+ 2 - 2
__tests__/telemetry.test.ts

@@ -173,7 +173,7 @@ describe('Telemetry', () => {
       expect(calls).toHaveLength(1);
       const body = calls[0]!.body;
       expect(body.machine_id).toBe(t.getStatus().machineId);
-      expect(body.schema_version).toBe(1);
+      expect(body.schema_version).toBe(2);
       const events = body.events as Array<{ event: string; ts: string; props: Record<string, unknown> }>;
       expect(events).toHaveLength(2);
       const explore = events.find((e) => e.props.name === 'codegraph_explore')!;
@@ -265,7 +265,7 @@ describe('Telemetry', () => {
       const t = make();
       const stale = path.join(dir, 'telemetry-queue.sending.99999.jsonl');
       fs.mkdirSync(dir, { recursive: true });
-      fs.writeFileSync(stale, JSON.stringify({ v: 1, ev: 'uninstall', ts: '2026-06-11T00:00:00.000Z', props: {} }) + '\n');
+      fs.writeFileSync(stale, JSON.stringify({ v: 2, ev: 'uninstall', ts: '2026-06-11T00:00:00.000Z', props: {} }) + '\n');
       const old = new Date(nowValue.getTime() - 2 * 60 * 60_000);
       fs.utimesSync(stale, old, old);
       t.setEnabled(true, 'cli'); // config so send() has a machine id

+ 3 - 4
docs/design/telemetry.md

@@ -20,7 +20,7 @@ Answer, in aggregate and anonymously:
 - Which install targets people pick, local vs global, fresh vs upgrade.
 - Which MCP tools and CLI commands get used, how often, and how often they error.
 - Which languages people index (prioritize extractor/framework work by real usage).
-- Version adoption speed, OS/arch/Node mix, native-vs-wasm SQLite backend share.
+- Version adoption speed, OS/arch/Node mix. (The SQLite backend is always the built-in `node:sqlite` now — there is no native-vs-wasm split left to measure.)
 
 ## Non-goals / never collected
 
@@ -63,11 +63,10 @@ Common envelope on every batch (computed once per process):
 Event types:
 
 - **`install`** — one per installer run. Props: `targets` (e.g. `["claude","cursor"]`),
-  `scope` (`local`/`global`), `kind` (`fresh`/`upgrade`/`reinstall`), `sqlite_backend`
-  (`native`/`wasm`).
+  `scope` (`local`/`global`), `kind` (`fresh`/`upgrade`/`reinstall`).
 - **`index`** — one per full index (`init`/`index`, not per `sync`). Props: `languages`
   (names only, e.g. `["typescript","go"]`), `file_count_bucket` (`<100`, `100-1k`, `1k-10k`,
-  `10k+`), `duration_bucket` (`<10s`, `10-60s`, `1-5m`, `5m+`), `sqlite_backend`.
+  `10k+`), `duration_bucket` (`<10s`, `10-60s`, `1-5m`, `5m+`).
 - **`usage_rollup`** — the workhorse. One event per `(day, kind, name)` per machine,
   aggregated locally. Props: `kind` (`mcp_tool`/`cli_command`), `name`
   (e.g. `codegraph_explore`, `affected`), `count`, `error_count`, and for MCP:

+ 1 - 1
src/db/index.ts

@@ -143,7 +143,7 @@ export class DatabaseConnection {
    *
    * SQLite silently keeps the prior mode if WAL can't be enabled — e.g. on
    * filesystems without shared-memory support (some network/virtualized mounts,
-   * WSL2 /mnt), and always on the wasm backend. So the effective mode can differ
+   * WSL2 /mnt). So the effective mode can differ
    * from what `configureConnection` requested. Surfaced in `codegraph status` so
    * a "database is locked" report is triageable: 'wal' ⇒ readers never block on a
    * writer; anything else ⇒ they can. See issue #238.

+ 2 - 1
src/mcp/tools.ts

@@ -936,7 +936,8 @@ export class ToolHandler {
     // If the path resolves to the default project, reuse the already-open
     // default instance rather than opening a SECOND connection to the same DB.
     // A duplicate connection serializes reads against the watcher's auto-sync
-    // writes; on the wasm backend (no WAL) that surfaces as intermittent
+    // writes; when WAL isn't in effect (e.g. a filesystem without shared-memory
+    // support) that surfaces as intermittent
     // "database is locked" on concurrent tool calls. See issue #238. The
     // default instance is owned/closed by the server, so it's never cached.
     if (this.cg && this.cg.getProjectRoot() === resolvedRoot) {

+ 51 - 8
src/resolution/name-matcher.ts

@@ -317,7 +317,17 @@ export function matchByExactName(
   ref: UnresolvedRef,
   context: ResolutionContext
 ): ResolvedRef | null {
-  const candidates = applyLanguageGate(context.getNodesByName(ref.referenceName), ref);
+  // `import`-kind nodes are import STATEMENTS, not definitions, so a reference
+  // resolving to a sibling file's `import` is a meaningless edge — the real
+  // import→definition resolution is the import resolver's job (resolveViaImport),
+  // never name-matching here. Excluding them also removes a quadratic blow-up:
+  // a ubiquitous package (`react`, `@superset-ui/core`, Python `logging`/`typing`)
+  // is re-declared as an `import` node in every file that imports it, so K
+  // unresolved import refs each scored K same-named import candidates through
+  // findBestMatch — O(K²) per package, the dominant cost of "Resolving refs" on
+  // large import-heavy (front-end + back-end) repos (#915).
+  const candidates = applyLanguageGate(context.getNodesByName(ref.referenceName), ref)
+    .filter((n) => n.kind !== 'import');
 
   if (candidates.length === 0) {
     return null;
@@ -1119,16 +1129,22 @@ function splitCamelCase(str: string): string[] {
 }
 
 /**
- * Compute directory proximity between two file paths.
- * Returns a score based on the number of shared directory segments.
+ * Compute directory proximity from a pre-split list of directory segments
+ * (`filePath1` minus its filename) and a second file path.
+ * Returns a score based on the number of shared leading directory segments.
  * Higher score = closer in directory tree.
+ *
+ * Split into a pre-split variant because findBestMatch scores every candidate
+ * against the SAME `ref.filePath`; re-splitting it per candidate was a hot spot
+ * on large repos (#915), so the caller splits it once and passes the segments.
  */
-function computePathProximity(filePath1: string, filePath2: string): number {
-  const dir1 = filePath1.split('/').slice(0, -1);
-  const dir2 = filePath2.split('/').slice(0, -1);
+function pathProximityFromDirs(dir1: string[], filePath2: string): number {
+  const dir2 = filePath2.split('/');
+  dir2.pop(); // drop filename — matches the original slice(0, -1) on both paths
 
   let shared = 0;
-  for (let i = 0; i < Math.min(dir1.length, dir2.length); i++) {
+  const limit = Math.min(dir1.length, dir2.length);
+  for (let i = 0; i < limit; i++) {
     if (dir1[i] === dir2[i]) {
       shared++;
     } else {
@@ -1140,6 +1156,16 @@ function computePathProximity(filePath1: string, filePath2: string): number {
   return Math.min(shared * 15, 80);
 }
 
+/**
+ * Compute directory proximity between two file paths.
+ * Returns a score based on the number of shared directory segments.
+ */
+function computePathProximity(filePath1: string, filePath2: string): number {
+  const dir1 = filePath1.split('/');
+  dir1.pop();
+  return pathProximityFromDirs(dir1, filePath2);
+}
+
 /**
  * Find the best matching node when there are multiple candidates
  */
@@ -1158,7 +1184,24 @@ function findBestMatch(
   let bestScore = -1;
   let bestNode: Node | null = null;
 
+  // Split the ref's path once (it's the same across every candidate) instead of
+  // re-splitting it inside computePathProximity per candidate (#915 hot spot).
+  const refDirs = ref.filePath.split('/');
+  refDirs.pop();
+
+  // A same-language candidate ALWAYS outscores a cross-language one: same-language
+  // scores at least +50 (language bonus), while a cross-language candidate maxes
+  // out at +35 (−80 language, +80 proximity, +25 kind, +10 exported; it can never
+  // be in the same file). So when any same-language candidate exists, skip the
+  // cross-language ones — provably the same winner, without paying the per-candidate
+  // scoring. Cuts the candidate set to same-language size on mixed front-end +
+  // back-end repos (#915). When ALL candidates are cross-language (a legitimate
+  // cross-language `calls` bridge), none are skipped and behavior is unchanged.
+  const hasSameLanguage = candidates.some((c) => c.language === ref.language);
+
   for (const candidate of candidates) {
+    if (hasSameLanguage && candidate.language !== ref.language) continue;
+
     let score = 0;
 
     // Same file bonus
@@ -1167,7 +1210,7 @@ function findBestMatch(
     }
 
     // Directory proximity bonus — strongly prefer same module/package
-    score += computePathProximity(ref.filePath, candidate.filePath);
+    score += pathProximityFromDirs(refDirs, candidate.filePath);
 
     // Language matching: strongly prefer same language, penalize cross-language
     if (candidate.language === ref.language) {

+ 5 - 8
src/telemetry/index.ts

@@ -30,7 +30,10 @@ import { randomUUID } from 'crypto';
 export const TELEMETRY_ENDPOINT = 'https://telemetry.getcodegraph.com/v1/events';
 export const TELEMETRY_DOCS = 'https://github.com/colbymchenry/codegraph/blob/main/TELEMETRY.md';
 
-const SCHEMA_VERSION = 1;
+// v2: dropped the `sqlite_backend` field from the `index` event — node:sqlite is
+// now the only backend (the better-sqlite3-native / wasm-fallback split is gone),
+// so the value was a constant carrying no signal. See TELEMETRY.md.
+const SCHEMA_VERSION = 2;
 const MAX_BUFFER_BYTES = 256 * 1024;
 const MAX_EVENTS_PER_REQUEST = 100;
 const DEFAULT_FLUSH_TIMEOUT_MS = 1500;
@@ -55,18 +58,13 @@ export function bucketDuration(ms: number): '<10s' | '10-60s' | '1-5m' | '5m+' {
   return '5m+';
 }
 
-/** Collapse a backend identifier (e.g. `node-sqlite`) to the schema's enum. */
-export function backendKind(backend: string): 'native' | 'wasm' {
-  return backend.toLowerCase().includes('wasm') ? 'wasm' : 'native';
-}
-
 /**
  * Shared "a full index completed" event (CLI init/index + installer local
  * init): language names and coarse buckets only — never paths, file names,
  * or exact counts. Structurally typed so callers don't need engine imports.
  */
 export function recordIndexEvent(
-  cg: { getStats(): { filesByLanguage: Record<string, number> }; getBackend(): string },
+  cg: { getStats(): { filesByLanguage: Record<string, number> } },
   result: { filesIndexed: number; durationMs: number },
 ): void {
   try {
@@ -77,7 +75,6 @@ export function recordIndexEvent(
       languages,
       file_count_bucket: bucketFileCount(result.filesIndexed),
       duration_bucket: bucketDuration(result.durationMs),
-      sqlite_backend: backendKind(cg.getBackend()),
     });
   } catch {
     /* telemetry must never break indexing */

+ 4 - 0
telemetry-worker/src/index.ts

@@ -67,6 +67,10 @@ const nonNegInt =
  * without the other is a bug. Anything not listed here does not exist as far
  * as this endpoint is concerned.
  */
+// `sqlite_backend` (`native`/`wasm`) below is a LEGACY field: pre-schema-v2 clients
+// (≤ June 2026) sent it, but node:sqlite is now the only backend so current clients
+// omit it. Kept here so old clients' events still validate; safe to drop once their
+// share is negligible. Never `required`.
 const EVENTS: Record<string, { required: readonly string[]; props: Record<string, Sanitize> }> = {
   install: {
     required: ['scope', 'kind'],