Procházet zdrojové kódy

fix(prompt-hook): close the segment-vocab integrity gaps (#1141, #1142, #1144, #1145, #1146) (#1150)

Five hardening fixes to the #1136 MEDIUM (graph-derived) tier:

- #1141: updateNode() now writes the segment vocabulary like insertNode()
  does — framework post-extract renames (NestJS route prefixing) left the
  new name permanently unsearchable (the old rows orphaned, the backfill
  gated on an EMPTY vocab, so even a full re-index re-created the drift).
- #1142: new CodeGraph.healSegmentVocabIfEmpty() — the hook opens the
  graph without sync, so a database migrated from pre-vocab schema kept
  the MEDIUM tier dormant until some unrelated sync ran. The hook heals
  on first use (one SELECT when populated; lock-aware, defers to a
  running sync) and records noop-vocab-empty when it can't.
- #1144: a name whose only nodes are file/import kind is skipped instead
  of falling back to surfacing an import statement as a matched symbol;
  import specifiers no longer enter the vocab at all (shared
  isSegmentableKind gate across insertNode/updateNode/rebuild page query)
  since they can never be surfaced and only inflate rarity statistics.
- #1145: plural variant folding is keyed on English plural spelling —
  bare-s plurals no longer mint a bogus -es sibling (services→servic),
  unambiguous sibilant-es plurals no longer mint a bogus -s sibling
  (classes→classe), trailing -ss singulars no longer strip (class→clas);
  genuinely ambiguous endings (caches/databases) still emit both keys.
- #1146: getSegmentCoOccurrence folds variants to their original word
  inside the SQL (CASE mapping + COUNT(DISTINCT word)) so a plural pair
  of ONE word can't tie with a genuine two-word match and crowd it past
  the pre-fold ORDER BY/LIMIT; the JS re-check stays as the honesty layer.

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Colby Mchenry před 1 dnem
rodič
revize
35611b92bb

+ 2 - 0
CHANGELOG.md

@@ -22,6 +22,8 @@ and adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 - Removed dead code left behind by the discontinued managed-reasoning feature. Its `codegraph login` flow was unplugged before ever shipping in a release, but the unused module still shipped inside the platform bundles, and a security review flagged its Windows browser-open step (it routed the login URL through `cmd`, which would have been unsafe had the flow ever been wired back up). The leftover module and its tests are now fully deleted. Thanks @inth3shadows for the report. (#1114)
 - The Claude Code context hook no longer treats ordinary English words that merely start with "call", "trace", "affect", or "connect" — callus, calligraphy, Connecticut, connective, affectionate, Tracey — as structural questions, which used to inject full CodeGraph context into prompts that had nothing to do with code structure. Genuinely structural forms (calls, callers, callbacks, call site, traced, tracing, affected, connections, connectivity, …) still fire exactly as before. Thanks @inth3shadows for the report. (#1138)
 - A stuck git command can no longer hang CodeGraph indefinitely. The git checks behind worktree detection and git-hook setup, and the installer's optional `npm install -g` step, now time out and fail gracefully instead of blocking forever — this matters most for the background MCP server, where an unbounded git hang (network filesystems, a wedged fsmonitor) could previously freeze it long enough for the safety watchdog to kill it. Thanks @inth3shadows for the report. (#1139)
+- The context hook's new plain-words matching works immediately on projects indexed by an older CodeGraph version. The word lookup it relies on is built at index time, so a project indexed before the upgrade had an empty one, and the hook would silently find nothing until something else happened to refresh the index; the hook now fills it in on first use (a one-time step — normally the background MCP server's startup catch-up gets there first). Thanks @inth3shadows for the report. (#1142)
+- Several accuracy fixes to the plain-words matching: a renamed symbol (for example a NestJS route after its module prefix is applied) stays findable under its new name (#1141); a word that only appears in your code as an import statement's package name is no longer presented as a matched symbol (#1144); plural words no longer generate garbled lookup keys ("services" no longer also looks up "servic") (#1145); and a name matching both the singular and plural of one word can no longer squeeze out a genuine two-word match (#1146). Thanks @inth3shadows for the reports.
 
 ## [1.2.0] - 2026-07-02
 

+ 22 - 0
__tests__/identifier-segments.test.ts

@@ -73,9 +73,31 @@ describe('segmentLookupVariants — light plural folding', () => {
   it('folds trailing s/es so plurals hit singular segments', () => {
     expect(segmentLookupVariants('services')).toContain('service');
     expect(segmentLookupVariants('machines')).toContain('machine');
+    expect(segmentLookupVariants('classes')).toContain('class');
+  });
+
+  it('bare-s plurals no longer mint a bogus -es sibling (#1145)', () => {
+    expect(segmentLookupVariants('services')).toEqual(['services', 'service']);
+    expect(segmentLookupVariants('machines')).toEqual(['machines', 'machine']);
+  });
+
+  it('unambiguous sibilant-es plurals no longer mint a bogus -s sibling (#1145)', () => {
+    expect(segmentLookupVariants('classes')).toEqual(['classes', 'class']);
+    expect(segmentLookupVariants('hashes')).toEqual(['hashes', 'hash']);
+  });
+
+  it('a trailing -ss is a singular, not a plural — no strip (#1145)', () => {
+    expect(segmentLookupVariants('class')).toEqual(['class']);
+    expect(segmentLookupVariants('process')).toEqual(['process']);
+  });
+
+  it('ambiguous endings emit BOTH candidate keys — a wrong exclusive guess would lose the real match', () => {
+    expect(segmentLookupVariants('caches')).toEqual(['caches', 'cach', 'cache']);       // cache + s
+    expect(segmentLookupVariants('databases')).toEqual(['databases', 'databas', 'database']); // database + s
   });
 
   it('never strips a word below the minimum', () => {
     expect(segmentLookupVariants('bus')).toEqual(['bus']);
+    expect(segmentLookupVariants('boxes')).toEqual(['boxes']); // -es strip would go sub-minimum
   });
 });

+ 77 - 0
__tests__/segment-vocab.test.ts

@@ -141,4 +141,81 @@ export function writeConfig(): void {}
     // UNTOUCHED file's names must come from the backfill.
     expect(cg.getSegmentMatches(['checkout']).map((m) => m.name)).toContain('CheckoutService');
   });
+
+  it('healSegmentVocabIfEmpty backfills WITHOUT a sync — the prompt-hook open path (#1142)', async () => {
+    // The hook opens the graph without syncing, and a database migrated from
+    // before the vocab table existed starts with it empty — sync's backfill
+    // never runs on that path, leaving the MEDIUM tier permanently dormant.
+    const queries = (cg as unknown as {
+      queries: { clearNameSegmentVocab(): void; isNameSegmentVocabEmpty(): boolean };
+    }).queries;
+    queries.clearNameSegmentVocab();
+    expect(queries.isNameSegmentVocabEmpty()).toBe(true);
+    await expect(cg.healSegmentVocabIfEmpty()).resolves.toBe(true);
+    expect(queries.isNameSegmentVocabEmpty()).toBe(false);
+    expect(cg.getSegmentMatches(['state', 'machine']).map((m) => m.name)).toContain('OrderStateMachine');
+    // Populated vocab: the fast path (one SELECT) still reports usable.
+    await expect(cg.healSegmentVocabIfEmpty()).resolves.toBe(true);
+  });
+
+  it('a rename through updateNode reaches the vocab — the framework post-extract path (#1141)', () => {
+    // Framework resolvers rewrite node names after extraction (NestJS route
+    // prefixing) via updateNode. The new name must become prose-searchable;
+    // the old name's rows become orphans the honesty gate drops.
+    const queries = (cg as unknown as {
+      queries: {
+        getNodesByName(name: string): Array<Record<string, unknown>>;
+        updateNode(node: Record<string, unknown>): void;
+      };
+    }).queries;
+    const node = queries.getNodesByName('OrderStateMachine')[0]!;
+    queries.updateNode({ ...node, name: 'RenamedWorkflowEngine', qualifiedName: 'RenamedWorkflowEngine' });
+    expect(cg.getSegmentMatches(['renamed', 'workflow']).map((m) => m.name)).toContain('RenamedWorkflowEngine');
+    expect(cg.getSegmentMatches(['state', 'machine'])).toEqual([]);
+  });
+
+  it('a name that exists only as an import statement is never surfaced (#1144)', async () => {
+    // Import nodes are named after module specifiers, not symbols. The write
+    // path no longer segments them; and even against legacy vocab rows (a DB
+    // populated before that exclusion), the representative picker must skip
+    // the name rather than surface the import line as a matched symbol.
+    fs.writeFileSync(
+      path.join(dir, 'src', 'consumer.ts'),
+      `import { Thing } from 'external-unindexed-pkg';\nexport function useIt(): void {}\n`,
+    );
+    await cg.sync();
+    expect(cg.getSegmentMatches(['external', 'unindexed'])).toEqual([]);
+    // Legacy rows: plant the vocab entries a pre-exclusion version wrote.
+    const queries = (cg as unknown as { queries: { insertNameSegmentsBatch(names: string[]): void } }).queries;
+    queries.insertNameSegmentsBatch(['external-unindexed-pkg']);
+    expect(cg.getSegmentMatches(['external', 'unindexed'])).toEqual([]);
+  });
+
+  it('co-occurrence counts distinct WORDS, not variants — plural pairs cannot pose as two words (#1146)', () => {
+    const queries = (cg as unknown as {
+      queries: {
+        insertNameSegmentsBatch(names: string[]): void;
+        getSegmentCoOccurrence(
+          variants: Array<{ segment: string; word: string }>,
+          minWords: number,
+          limit: number,
+        ): Array<{ name: string; matches: number }>;
+      };
+    }).queries;
+    // BillingServicesService carries BOTH the `services` and `service`
+    // segments — two variants of ONE prompt word. It must not meet minWords=2.
+    queries.insertNameSegmentsBatch(['BillingServicesService']);
+    const hits = queries.getSegmentCoOccurrence(
+      [
+        { segment: 'services', word: 'services' },
+        { segment: 'service', word: 'services' },
+        { segment: 'checkout', word: 'checkout' },
+      ],
+      2,
+      24,
+    );
+    const names = hits.map((h) => h.name);
+    expect(names).toContain('CheckoutService'); // checkout + service(s) — two real words
+    expect(names).not.toContain('BillingServicesService'); // services + service — one word
+  });
 });

+ 10 - 0
src/bin/codegraph.ts

@@ -1156,6 +1156,16 @@ program
           // scored, each hit re-verified to exist (see getSegmentMatches). The
           // payload names the symbols but does NOT run explore — the agent owns
           // the query where the hook's confidence is only "these are related".
+          //
+          // A database indexed before the vocab table existed starts with it
+          // EMPTY, and only sync() backfills it — which this hook never runs
+          // (#1142). Heal it here: on a populated vocab this is one SELECT;
+          // the actual backfill is a one-time batched pass whose cost the MCP
+          // server's own catch-up sync usually pays first (it runs at every
+          // session start). A distinct noop outcome keeps a dormant vocab
+          // from polluting the noop-unverified recall signal.
+          const vocabReady = await cg.healSegmentVocabIfEmpty().catch(() => false);
+          if (!vocabReady) { gate('noop-vocab-empty'); return; }
           const related = cg.getSegmentMatches(proseWords);
           if (related.length === 0) { gate('noop-unverified'); return; }
           const lines = related

+ 50 - 11
src/db/queries.ts

@@ -319,8 +319,19 @@ export class QueryBuilder {
     // before use, and a full index clears the table at its start. File nodes
     // are excluded: a file's basename duplicates the symbols inside it
     // (state-machine.ts / OrderStateMachine), which double-counts every
-    // concept and defeats the singleton-vs-cluster rarity statistics.
-    if (node.kind !== 'file') this.insertNameSegments(node.name);
+    // concept and defeats the singleton-vs-cluster rarity statistics. Import
+    // nodes are excluded too (#1144): they're named after module specifiers
+    // ("external-unindexed-pkg", "./utils/helpers"), not symbols — an
+    // import-only name can never be surfaced (getSegmentMatches requires a
+    // real definition), so its rows would only inflate the rarity statistics.
+    if (this.isSegmentableKind(node.kind)) this.insertNameSegments(node.name);
+  }
+
+  /** Which node kinds contribute their name to the segment vocabulary — the
+   *  single gate shared by insertNode, updateNode, and the rebuild page query
+   *  (getDistinctNodeNames), so the write paths can't drift apart. */
+  private isSegmentableKind(kind: string): boolean {
+    return kind !== 'file' && kind !== 'import';
   }
 
   /** Write `name`'s segments into name_segment_vocab (idempotent). */
@@ -412,6 +423,16 @@ export class QueryBuilder {
       returnType: node.returnType ?? null,
       updatedAt: node.updatedAt ?? Date.now(),
     });
+
+    // updateNode is a second real write path to `nodes` — framework
+    // post-extract passes rewrite names through it (NestJS route prefixing),
+    // and a renamed node's new name must reach the segment vocabulary just
+    // like an inserted one's (#1141). Without this the rename left the new
+    // name permanently unsearchable: the old name's rows became honest-gate
+    // orphans and the only backfill is gated on the vocab being EMPTY.
+    // insertNameSegments is idempotent (in-memory set + INSERT OR IGNORE),
+    // so no name-changed check is needed.
+    if (this.isSegmentableKind(node.kind)) this.insertNameSegments(node.name);
   }
 
   /**
@@ -461,11 +482,12 @@ export class QueryBuilder {
     return row === undefined;
   }
 
-  /** One page of distinct non-file node names, for batched vocab rebuilds
-   *  (file basenames are excluded from the vocab — see insertNode). */
+  /** One page of distinct segmentable node names, for batched vocab rebuilds
+   *  (file basenames and import specifiers are excluded from the vocab — see
+   *  insertNode). */
   getDistinctNodeNames(limit: number, offset: number): string[] {
     const rows = this.db
-      .prepare("SELECT DISTINCT name FROM nodes WHERE kind != 'file' ORDER BY name LIMIT ? OFFSET ?")
+      .prepare("SELECT DISTINCT name FROM nodes WHERE kind NOT IN ('file', 'import') ORDER BY name LIMIT ? OFFSET ?")
       .all(limit, offset) as Array<{ name: string }>;
     return rows.map((r) => r.name);
   }
@@ -478,17 +500,29 @@ export class QueryBuilder {
   }
 
   /**
-   * Names whose segments cover at least `minSegments` of the given segments
+   * Names whose segments cover at least `minWords` distinct PROMPT WORDS
    * the co-occurrence probe behind the prompt hook's medium tier: the words
    * "state" and "machine" both being segments of `OrderStateMachine` is strong
    * evidence the prompt names that symbol in prose. Ordered by coverage.
+   *
+   * Takes (segment variant → original word) pairs and folds variants back to
+   * their word INSIDE the SQL: a name matching both `service` and `services`
+   * counts ONE word, not two. Counting raw variants let plural-variant pairs
+   * of a single word tie with genuine two-word matches and — because ORDER
+   * BY/LIMIT run here, before any JS-side re-check — crowd a real match past
+   * the LIMIT on vocab-heavy repos (#1146).
    */
-  getSegmentCoOccurrence(segments: string[], minSegments: number, limit: number): Array<{ name: string; matches: number }> {
-    if (segments.length === 0) return [];
-    const placeholders = segments.map(() => '?').join(', ');
+  getSegmentCoOccurrence(
+    variants: Array<{ segment: string; word: string }>,
+    minWords: number,
+    limit: number,
+  ): Array<{ name: string; matches: number }> {
+    if (variants.length === 0) return [];
+    const placeholders = variants.map(() => '?').join(', ');
+    const whens = variants.map(() => 'WHEN ? THEN ?').join(' ');
     const rows = this.db
       .prepare(
-        `SELECT name, COUNT(DISTINCT segment) AS matches
+        `SELECT name, COUNT(DISTINCT CASE segment ${whens} END) AS matches
          FROM name_segment_vocab
          WHERE segment IN (${placeholders})
          GROUP BY name
@@ -496,7 +530,12 @@ export class QueryBuilder {
          ORDER BY matches DESC, length(name) ASC
          LIMIT ?`,
       )
-      .all(...segments, minSegments, limit) as Array<{ name: string; matches: number }>;
+      .all(
+        ...variants.flatMap((v) => [v.segment, v.word]),
+        ...variants.map((v) => v.segment),
+        minWords,
+        limit,
+      ) as Array<{ name: string; matches: number }>;
     return rows;
   }
 

+ 49 - 6
src/index.ts

@@ -935,10 +935,14 @@ export class CodeGraph {
     }
     const variants = [...variantToWord.keys()];
 
-    // Tier A: co-occurrence. minSegments=2 counts VARIANTS, so fold a name's
-    // matched variants back to distinct words before trusting the coverage.
+    // Tier A: co-occurrence. The SQL folds variants back to their original
+    // word (#1146), so minWords=2 means two distinct PROMPT WORDS — a name
+    // matching both `service` and `services` can't tie with (or crowd past
+    // the LIMIT) a genuine two-word match. The JS re-check below recomputes
+    // the fold from live segments as the honesty layer.
+    const variantPairs = [...variantToWord.entries()].map(([segment, word]) => ({ segment, word }));
     const candidates: Array<{ name: string; matchedWords: Set<string> }> = [];
-    for (const hit of this.queries.getSegmentCoOccurrence(variants, 2, 24)) {
+    for (const hit of this.queries.getSegmentCoOccurrence(variantPairs, 2, 24)) {
       const matched = this.wordsMatchingName(hit.name, variantToWord);
       if (matched.size >= 2) candidates.push({ name: hit.name, matchedWords: matched });
     }
@@ -970,7 +974,12 @@ export class CodeGraph {
     }
 
     // Verify against nodes (the honesty gate) and pick a representative
-    // definition per name — prefer a real symbol over a file/import node.
+    // definition per name. A name whose only nodes are file/import kind has
+    // no real definition to point at — surfacing the import statement instead
+    // reads as a matched symbol but isn't one (#1144) — so it's skipped, the
+    // same way an orphaned vocab row is. (Import names no longer enter the
+    // vocab at write time, but rows written before that exclusion persist
+    // until the next full index.)
     const out: SegmentMatch[] = [];
     const seen = new Set<string>();
     candidates.sort((a, b) => b.matchedWords.size - a.matchedWords.size || a.name.length - b.name.length);
@@ -980,7 +989,8 @@ export class CodeGraph {
       seen.add(candidate.name);
       const nodes = this.queries.getNodesByName(candidate.name);
       if (nodes.length === 0) continue; // orphaned vocab row — name no longer exists
-      const rep = nodes.find((n) => n.kind !== 'file' && n.kind !== 'import') ?? nodes[0]!;
+      const rep = nodes.find((n) => n.kind !== 'file' && n.kind !== 'import');
+      if (!rep) continue; // no real definition — don't surface an import/file as one
       out.push({
         name: candidate.name,
         kind: rep.kind,
@@ -1009,10 +1019,43 @@ export class CodeGraph {
     return matched;
   }
 
+  /**
+   * One-shot upgrade heal for callers that open the graph WITHOUT syncing —
+   * concretely the prompt hook, whose MEDIUM tier reads the segment
+   * vocabulary: a database migrated from before the vocab table existed
+   * starts with it empty, and the only other backfill lives inside `sync()`,
+   * which such callers never run (#1142). Returns true when the vocab is
+   * usable (already populated — the overwhelmingly common one-SELECT case —
+   * or healed here); false when it isn't (empty graph, or another process
+   * holds the index lock — that process's own sync heals it).
+   */
+  async healSegmentVocabIfEmpty(): Promise<boolean> {
+    const empty = (() => {
+      try { return this.queries.isNameSegmentVocabEmpty(); } catch { return false; }
+    })();
+    if (!empty) return true;
+    if (this.queries.getNodeAndEdgeCount().nodes === 0) return false;
+    return this.indexMutex.withLock(async () => {
+      try {
+        this.fileLock.acquire();
+      } catch {
+        return false; // an index/sync is running — it backfills the vocab itself
+      }
+      try {
+        if (!this.queries.isNameSegmentVocabEmpty()) return true; // raced: healed meanwhile
+        await this.rebuildNameSegmentVocab();
+        return true;
+      } finally {
+        this.fileLock.release();
+      }
+    });
+  }
+
   /**
    * Rebuild the segment vocabulary from the current graph, batched and
    * yielding — the upgrade-heal path for indexes built before the vocab table
-   * existed. Runs inside sync's mutex/lock (callers hold them).
+   * existed. Runs inside the index mutex/lock (sync and
+   * healSegmentVocabIfEmpty hold them).
    */
   private async rebuildNameSegmentVocab(): Promise<void> {
     const maybeYield = createYielder();

+ 23 - 3
src/search/identifier-segments.ts

@@ -129,12 +129,32 @@ export function extractProseCandidates(prompt: string): string[] {
 /**
  * Lookup variants for a prose word: the word itself plus light plural folding
  * ("services" → service, "dependencies" → dependencie/dependency is NOT
- * attempted — only trailing s/es strip), so common plurals still hit their
+ * attempted — only a trailing s/es strip), so common plurals still hit their
  * singular segment. Returned variants map back to the same original word.
+ *
+ * The strips are keyed on English plural spelling (#1145), in three classes:
+ * - UNAMBIGUOUS `-es` (after x/sh/ss/zz: boxes, hashes, classes, quizzes) —
+ *   strip 2 only. Stripping 1 minted a bogus sibling ("classes" → classe).
+ * - AMBIGUOUS endings (`-ches`/`-ses`/`-zes`/`-oes`): spelling alone can't
+ *   split patches(+es) from caches(+s), lenses from databases, heroes from
+ *   shoes — emit BOTH candidate keys and let the vocab lookup decide; a miss
+ *   is an ignored key, a wrong exclusive guess would LOSE the real match.
+ * - Everything else ending in `-s` — a bare `-s` plural (services, machines,
+ *   cookies): strip 1 only. Stripping 2 minted "services" → servic.
+ * A trailing `-ss` is a singular (class, process), not a plural: no strip —
+ * that used to mint "class" → clas.
  */
 export function segmentLookupVariants(word: string): string[] {
   const variants = [word];
-  if (word.endsWith('es') && word.length >= MIN_PROSE_CHARS + 2) variants.push(word.slice(0, -2));
-  if (word.endsWith('s') && word.length >= MIN_PROSE_CHARS + 1) variants.push(word.slice(0, -1));
+  const canStrip2 = word.length >= MIN_PROSE_CHARS + 2;
+  const canStrip1 = word.length >= MIN_PROSE_CHARS + 1;
+  if (/(?:x|sh|ss|zz)es$/.test(word)) {
+    if (canStrip2) variants.push(word.slice(0, -2));
+  } else if (/(?:ch|s|z|o)es$/.test(word)) {
+    if (canStrip2) variants.push(word.slice(0, -2));
+    if (canStrip1) variants.push(word.slice(0, -1));
+  } else if (word.endsWith('s') && !word.endsWith('ss')) {
+    if (canStrip1) variants.push(word.slice(0, -1));
+  }
   return variants;
 }