Quellcode durchsuchen

fix(search): down-weight the project name in ranking — completes #720 (#748)

The per-word path fix (#745) brought the backend to parity but not above:
the project name still gave the lexically-matching stack a residual dir
match + an FTS class-name match, so a backend query that included the
project name still ranked the frontend at/above the backend.

Derive the project name from go.mod module / package.json name / repo dir,
and treat a query word matching it as non-discriminative: drop it from path
relevance and from codegraph_explore's PascalCase type-disambiguation bias
(reporter's suggestions #1/#2) — unless it's the only query word, so a bare
project-name search still scores.

Narrow by construction: the down-weighting fires ONLY when a query word
matches the derived project name (≥5 chars), so every query that doesn't
name the project is byte-identical. On the reporter's repro the backend
controllers now top a backend question that includes the project name;
queries without it, bare project-name queries, and normal symbol queries
are unchanged. Query-time only (no re-index).

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Colby Mchenry vor 2 Wochen
Ursprung
Commit
75ae1e8bd9
6 geänderte Dateien mit 155 neuen und 11 gelöschten Zeilen
  1. 1 1
      CHANGELOG.md
  2. 42 1
      __tests__/context-ranking.test.ts
  3. 18 1
      src/db/queries.ts
  4. 19 0
      src/index.ts
  5. 9 3
      src/mcp/tools.ts
  6. 66 5
      src/search/query-utils.ts

+ 1 - 1
CHANGELOG.md

@@ -29,7 +29,7 @@ and adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
 ### Fixes
 
-- Search relevance: a multi-word PascalCase query token — typically a project name a user naturally includes (searching `MyApp backend routes`, say) — no longer over-weights a file whose path or class name embeds it. Such a token was scored once per sub-token (`my` / `app` / `myapp`), so a single concept boosted a lexically-matching file's path score several times over — enough, in a mixed-stack repo, to bury the stack the query was actually about. Path relevance now counts each query word once per path level (still splitting it so it matches across naming conventions), so the rest of the query's terms decide the ranking. Thanks @MiNuo1. (#720)
+- Search relevance: including the project name in a query (a user naturally writes `MyApp backend routes`) no longer buries the part of the codebase the query is actually about. The project name lexically matches whatever stack embeds it — a `MyAppFrontend/` directory, a `MyAppApp` class — and it was over-weighted two ways: a single PascalCase word was scored once per sub-token (`my` / `app` / `myapp`), so one concept boosted that path several times over; and the name carried full path / disambiguation weight even though it names the whole repo, not any symbol. Now path relevance counts each query word once, and a word matching the project name (derived from `go.mod`, `package.json`, or the repo directory) is dropped from path scoring and from `codegraph_explore`'s type-disambiguation bias — unless it's the only term, so a bare project-name search still works. In a mixed-stack repo, a backend question now surfaces the backend even with the project name in the query. Thanks @MiNuo1. (#720)
 - Go: a function called only from inside an anonymous closure — a cobra `RunE: func(…) {…}` handler, a goroutine literal, or a callback closure stored in a package-level `var` — now shows its real caller. Previously the call leaked to the file node, so `codegraph_callers` and `codegraph_impact` reported such a function as having no meaningful caller; the call is now attributed to the enclosing declaration, so editing the function surfaces the closures that use it. Existing Go indexes should be re-indexed (`codegraph index -f`) to benefit. Thanks @Cyclone1070. (#693) (Go)
 - Indexing no longer aborts when a `.gitignore` contains non-UTF-8 bytes or an unparseable pattern. A `.gitignore` transparently encrypted in place by corporate DLP / endpoint-security software (a common enterprise scenario) — or one with a stray pattern the matcher can't compile (`\[`, producing "Unterminated character class") — used to crash the entire `sync` / `index` with a screen of garbled bytes and never name the offending file, leaving `Files: 0 / Nodes: 0`. CodeGraph now skips a `.gitignore` that isn't valid UTF-8 text whole, drops only the individual unparseable patterns from a text one, and logs a warning naming the file — indexing continues either way. Thanks @zhanghang-9527. (#682)
 - C++ method calls made through a singleton, factory, or chained getter now resolve to the correct class. A call like `Foo::instance().bar()`, `WidgetFactory::create().draw()`, `openSession()->run()`, or the same stored in an `auto` local first, used to lose the receiver's type — so when two classes had a same-named method the call silently attached to whichever was indexed first (or didn't resolve at all), corrupting callers, impact, and trace. CodeGraph now infers the receiver's type from what the inner call returns (capturing C++ return types for the first time) and creates the edge only when that class genuinely has the method, so a wrong guess produces no edge instead of a misleading one. Covers singletons and self-returning accessors, factories that return a different type, free-function factories, `make_unique` / `make_shared` / `new` / direct construction, and single-level member chains. Existing C/C++ indexes should be re-indexed (`codegraph index -f`) to benefit. Thanks @stabey. (#645) (C/C++)

+ 42 - 1
__tests__/context-ranking.test.ts

@@ -16,7 +16,7 @@ import * as path from 'path';
 import * as os from 'os';
 import CodeGraph from '../src/index';
 import { LOW_CONFIDENCE_MARKER } from '../src/context';
-import { isDistinctiveIdentifier, scorePathRelevance } from '../src/search/query-utils';
+import { isDistinctiveIdentifier, scorePathRelevance, deriveProjectNameTokens } from '../src/search/query-utils';
 
 describe('isDistinctiveIdentifier', () => {
   it('treats plain dictionary words as non-distinctive', () => {
@@ -64,6 +64,47 @@ describe('scorePathRelevance per-word scoring (#720)', () => {
   });
 });
 
+// The project name is context, not a discriminator: dropping it from path
+// scoring stops every file under a `<ProjectName>…/` tree from winning on the
+// name alone, so the rest of the query decides the ranking (#720).
+describe('project-name down-weighting in path relevance (#720)', () => {
+  it('derives the project name from go.mod / package.json, skipping short names', () => {
+    const dir = fs.mkdtempSync(path.join(os.tmpdir(), 'codegraph-projname-'));
+    try {
+      fs.writeFileSync(path.join(dir, 'go.mod'), 'module example.com/SuperBizAgent\n\ngo 1.21\n');
+      fs.writeFileSync(path.join(dir, 'package.json'), JSON.stringify({ name: '@acme/superbizagent-web' }));
+      const tokens = deriveProjectNameTokens(dir);
+      expect(tokens.has('superbizagent')).toBe(true);
+      expect(tokens.has('superbizagentweb')).toBe(true);
+    } finally {
+      fs.rmSync(dir, { recursive: true, force: true });
+    }
+  });
+
+  it('drops a project-name query word from path scoring when other words remain', () => {
+    const proj = new Set(['superbizagent']);
+    // Without the project name dropped, the frontend path wins on it (+5).
+    // With it dropped, only "backend" is left — and it doesn't match this path.
+    const withDrop = scorePathRelevance('SuperBizAgentFrontend/app.js', 'SuperBizAgent backend', proj);
+    const noDrop = scorePathRelevance('SuperBizAgentFrontend/app.js', 'SuperBizAgent backend');
+    expect(withDrop).toBeLessThan(noDrop);
+    expect(withDrop).toBe(0);
+  });
+
+  it('keeps the project-name word when it is the ONLY query word (bare query still scores)', () => {
+    const proj = new Set(['superbizagent']);
+    expect(scorePathRelevance('SuperBizAgentFrontend/app.js', 'SuperBizAgent', proj)).toBe(5);
+  });
+
+  it('does not affect a query that omits the project name', () => {
+    const proj = new Set(['superbizagent']);
+    const path0 = 'internal/controller/chat/chat.go';
+    expect(scorePathRelevance(path0, 'controller chat', proj)).toBe(
+      scorePathRelevance(path0, 'controller chat')
+    );
+  });
+});
+
 describe('Context ranking — common-word precision & confidence', () => {
   let testDir: string;
   let cg: CodeGraph;

+ 18 - 1
src/db/queries.ts

@@ -176,6 +176,12 @@ function rowToFileRecord(row: FileRow): FileRecord {
 export class QueryBuilder {
   private db: SqliteDatabase;
 
+  // Project-name tokens (go.mod / package.json / repo dir), normalized. A query
+  // word matching one is dropped from path-relevance scoring — it names the
+  // whole project, not a symbol, so it carries no discriminative signal (#720).
+  // Set once by the CodeGraph instance; empty by default (no down-weighting).
+  private projectNameTokens: Set<string> = new Set();
+
   // Node cache for frequently accessed nodes (LRU-style, max 1000 entries)
   private nodeCache: Map<string, Node> = new Map();
   private readonly maxCacheSize = 1000;
@@ -219,6 +225,17 @@ export class QueryBuilder {
     this.db = db;
   }
 
+  /** Set the normalized project-name tokens used to down-weight non-discriminative
+   * query words in path scoring (#720). Called once when the project opens. */
+  setProjectNameTokens(tokens: Set<string>): void {
+    this.projectNameTokens = tokens;
+  }
+
+  /** The normalized project-name tokens (#720); empty if none were derived. */
+  getProjectNameTokens(): Set<string> {
+    return this.projectNameTokens;
+  }
+
   // ===========================================================================
   // Node Operations
   // ===========================================================================
@@ -842,7 +859,7 @@ export class QueryBuilder {
         ...r,
         score: r.score
           + kindBonus(r.node.kind)
-          + scorePathRelevance(r.node.filePath, scoringQuery)
+          + scorePathRelevance(r.node.filePath, scoringQuery, this.projectNameTokens)
           + nameMatchBonus(r.node.name, scoringQuery),
       }));
       results.sort((a, b) => b.score - a.score);

+ 19 - 0
src/index.ts

@@ -49,6 +49,7 @@ import { Mutex, FileLock } from './utils';
 import { FileWatcher, WatchOptions, PendingFile, LockUnavailableError } from './sync';
 import { EXTRACTION_VERSION } from './extraction/extraction-version';
 import { getCodeGraphDir } from './directory';
+import { deriveProjectNameTokens } from './search/query-utils';
 import { CodeGraphPackageVersion } from './mcp/version';
 
 // Re-export types for consumers
@@ -154,6 +155,13 @@ export class CodeGraph {
     this.db = db;
     this.queries = queries;
     this.projectRoot = projectRoot;
+    // Down-weight the project name as a query term in search ranking — it names
+    // the whole repo, not a symbol, so it has no discriminative value (#720).
+    try {
+      this.queries.setProjectNameTokens(deriveProjectNameTokens(projectRoot));
+    } catch {
+      // Best-effort: ranking still works without it.
+    }
     this.fileLock = new FileLock(
       path.join(getCodeGraphDir(projectRoot), 'codegraph.lock')
     );
@@ -747,6 +755,17 @@ export class CodeGraph {
     return this.queries.searchNodes(query, options);
   }
 
+  /**
+   * Normalized project-name tokens (go.mod / package.json / repo dir) used to
+   * down-weight the non-discriminative project name in search ranking (#720).
+   * Exposed so explore can exclude it from the PascalCase type-disambiguation
+   * bias, which would otherwise pull overloaded tokens toward whichever stack
+   * embeds the project name.
+   */
+  getProjectNameTokens(): Set<string> {
+    return this.queries.getProjectNameTokens();
+  }
+
   /**
    * Find the project's "primary route file" — the file with the densest
    * concentration of framework-emitted `route` nodes (≥3 routes, ≥30%

+ 9 - 3
src/mcp/tools.ts

@@ -21,7 +21,7 @@ import {
 } from '../sync/worktree';
 import type { PendingFile } from '../sync';
 import type { Node, Edge, SearchResult, Subgraph, NodeKind } from '../types';
-import { isTestFile } from '../search/query-utils';
+import { isTestFile, normalizeNameToken } from '../search/query-utils';
 import {
   existsSync,
   readFileSync,
@@ -1661,8 +1661,14 @@ export class ToolHandler {
       // agent writes "DataRequest task validate", the `task`/`validate` it wants
       // are DataRequest's, NOT the same-named overloads in Validation.swift /
       // Concurrency.swift / the abstract base. Used below to bias overloaded
-      // names toward the file/class the query also names.
-      const typeTokens = tokens.filter((o) => /^[A-Z][A-Za-z0-9]{3,}/.test(o));
+      // names toward the file/class the query also names. EXCLUDE the project
+      // name (a PascalCase token a user naturally includes) — it names the whole
+      // repo, so biasing toward it just pulls overloads to whichever stack
+      // embeds it, re-burying the rest (#720).
+      const projectNameTokens = cg.getProjectNameTokens();
+      const typeTokens = tokens.filter(
+        (o) => /^[A-Z][A-Za-z0-9]{3,}/.test(o) && !projectNameTokens.has(normalizeNameToken(o)),
+      );
       const inNamedContext = (n: Node) =>
         typeTokens.some((ct) => {
           const lc = ct.toLowerCase();

+ 66 - 5
src/search/query-utils.ts

@@ -4,9 +4,55 @@
  * Shared module for search term extraction and scoring.
  */
 
+import * as fs from 'fs';
 import * as path from 'path';
 import { Node } from '../types';
 
+/** Normalize a name to a comparable token: lowercase, alphanumerics only. */
+export function normalizeNameToken(raw: string): string {
+  return raw.toLowerCase().replace(/[^a-z0-9]/g, '');
+}
+
+/**
+ * Tokens that name the PROJECT as a whole — its `go.mod` module, `package.json`
+ * name, or repo root directory — rather than any specific symbol. A user
+ * naturally puts the project name in a query as context ("MyApp backend
+ * routes"), but it carries no discriminative signal: when it's also a substring
+ * of a symbol or path on one stack (a `MyAppFrontend/` dir, a `MyAppApp` class)
+ * it lexically inflates that stack and buries the rest (#720).
+ *
+ * Returned normalized (lowercase, alphanumerics only) so a query word can be
+ * compared by its normalized form. Only names ≥5 chars are kept — short ones
+ * (`api`, `app`, `core`, `web`) collide with real query terms too often to
+ * safely down-weight.
+ */
+export function deriveProjectNameTokens(projectRoot: string): Set<string> {
+  const tokens = new Set<string>();
+  const add = (raw: string | undefined | null): void => {
+    if (!raw) return;
+    const norm = normalizeNameToken(raw);
+    if (norm.length >= 5) tokens.add(norm);
+  };
+
+  // go.mod module last segment (the most reliable signal for Go repos).
+  try {
+    const gomod = fs.readFileSync(path.join(projectRoot, 'go.mod'), 'utf-8');
+    const m = gomod.match(/^\s*module\s+(\S+)/m);
+    if (m && m[1]) add(m[1].split('/').pop());
+  } catch { /* no go.mod */ }
+
+  // package.json name (strip an `@scope/` prefix).
+  try {
+    const pkg = JSON.parse(fs.readFileSync(path.join(projectRoot, 'package.json'), 'utf-8'));
+    if (typeof pkg.name === 'string') add(pkg.name.replace(/^@[^/]+\//, ''));
+  } catch { /* no / invalid package.json */ }
+
+  // Repo root directory name — a fallback when neither manifest names the project.
+  add(path.basename(path.resolve(projectRoot)));
+
+  return tokens;
+}
+
 /**
  * Common stop words to filter from search queries.
  * Includes generic English + code-specific noise words.
@@ -172,7 +218,11 @@ export function extractSearchTerms(query: string, options?: { stems?: boolean })
  * Score path relevance to a query
  * Higher score = more relevant path
  */
-export function scorePathRelevance(filePath: string, query: string): number {
+export function scorePathRelevance(
+  filePath: string,
+  query: string,
+  projectNameTokens?: Set<string>,
+): number {
   const pathLower = filePath.toLowerCase();
   const fileName = path.basename(filePath).toLowerCase();
   const dirName = path.dirname(filePath).toLowerCase();
@@ -187,10 +237,21 @@ export function scorePathRelevance(filePath: string, query: string): number {
   // Split the ORIGINAL-case query into words; extractSearchTerms does the
   // camelCase/snake split per word (so `getUserName` still matches a
   // `get_user_name` path) — we just attribute each word's matches once.
-  const words = query.split(/\s+/).filter((w) => w.length > 0);
-  if (words.length === 0) return 0;
-
-  for (const word of words) {
+  const allWords = query.split(/\s+/).filter((w) => w.length > 0);
+  if (allWords.length === 0) return 0;
+
+  // A query word that just names the PROJECT (its go.mod / package.json / repo
+  // name) carries no discriminative path signal — drop it so the rest of the
+  // query decides the ranking, instead of every file under a `<ProjectName>…/`
+  // tree winning on the project name alone (#720). Only when OTHER words remain,
+  // so a bare project-name query still scores on its path.
+  const words =
+    projectNameTokens && projectNameTokens.size > 0
+      ? allWords.filter((w) => !projectNameTokens.has(normalizeNameToken(w)))
+      : allWords;
+  const scored = words.length > 0 ? words : allWords;
+
+  for (const word of scored) {
     // Use base terms only — stem variants inflate path scores by generating
     // many near-duplicate terms that all match the same path segments.
     const subtokens = extractSearchTerms(word, { stems: false });