Преглед на файлове

fix(search): score path relevance per query word, not per sub-token (#720) (#745)

A multi-word PascalCase query token — typically a project name a user
includes (`SuperBizAgent backend routes`) — splits into sub-tokens
(superbizagent / super / biz / agent) that ALL match the same path segment,
so path relevance summed +5 four times for one concept. In a mixed-stack
repo that ~doubled every score of the lexically-matching stack's file,
burying the stack the query was about.

Score path relevance per original query WORD instead: a word matches a path
level if any of its sub-tokens do, and counts once — while still splitting
the word (via extractSearchTerms on the original case) so it matches across
naming conventions (`getUserName` → `get_user_name`). Distinct words each
still contribute.

Partial fix: this removes the dominant path over-counting (backend rises
from absent-in-top-6 to parity on the reporter's repro). The residual lexical
edge from the project name in the FTS class-name match + dir match is a deeper
down-weighting change, tracked separately. No re-index needed (query-time).

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Colby Mchenry преди 2 седмици
родител
ревизия
afec1282e1
променени са 3 файла, в които са добавени 47 реда и са изтрити 10 реда
  1. 1 0
      CHANGELOG.md
  2. 26 1
      __tests__/context-ranking.test.ts
  3. 20 9
      src/search/query-utils.ts

+ 1 - 0
CHANGELOG.md

@@ -29,6 +29,7 @@ and adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
 ### Fixes
 
+- Search relevance: a multi-word PascalCase query token — typically a project name a user naturally includes (searching `MyApp backend routes`, say) — no longer over-weights a file whose path or class name embeds it. Such a token was scored once per sub-token (`my` / `app` / `myapp`), so a single concept boosted a lexically-matching file's path score several times over — enough, in a mixed-stack repo, to bury the stack the query was actually about. Path relevance now counts each query word once per path level (still splitting it so it matches across naming conventions), so the rest of the query's terms decide the ranking. Thanks @MiNuo1. (#720)
 - Go: a function called only from inside an anonymous closure — a cobra `RunE: func(…) {…}` handler, a goroutine literal, or a callback closure stored in a package-level `var` — now shows its real caller. Previously the call leaked to the file node, so `codegraph_callers` and `codegraph_impact` reported such a function as having no meaningful caller; the call is now attributed to the enclosing declaration, so editing the function surfaces the closures that use it. Existing Go indexes should be re-indexed (`codegraph index -f`) to benefit. Thanks @Cyclone1070. (#693) (Go)
 - Indexing no longer aborts when a `.gitignore` contains non-UTF-8 bytes or an unparseable pattern. A `.gitignore` transparently encrypted in place by corporate DLP / endpoint-security software (a common enterprise scenario) — or one with a stray pattern the matcher can't compile (`\[`, producing "Unterminated character class") — used to crash the entire `sync` / `index` with a screen of garbled bytes and never name the offending file, leaving `Files: 0 / Nodes: 0`. CodeGraph now skips a `.gitignore` that isn't valid UTF-8 text whole, drops only the individual unparseable patterns from a text one, and logs a warning naming the file — indexing continues either way. Thanks @zhanghang-9527. (#682)
 - C++ method calls made through a singleton, factory, or chained getter now resolve to the correct class. A call like `Foo::instance().bar()`, `WidgetFactory::create().draw()`, `openSession()->run()`, or the same stored in an `auto` local first, used to lose the receiver's type — so when two classes had a same-named method the call silently attached to whichever was indexed first (or didn't resolve at all), corrupting callers, impact, and trace. CodeGraph now infers the receiver's type from what the inner call returns (capturing C++ return types for the first time) and creates the edge only when that class genuinely has the method, so a wrong guess produces no edge instead of a misleading one. Covers singletons and self-returning accessors, factories that return a different type, free-function factories, `make_unique` / `make_shared` / `new` / direct construction, and single-level member chains. Existing C/C++ indexes should be re-indexed (`codegraph index -f`) to benefit. Thanks @stabey. (#645) (C/C++)

+ 26 - 1
__tests__/context-ranking.test.ts

@@ -16,7 +16,7 @@ import * as path from 'path';
 import * as os from 'os';
 import CodeGraph from '../src/index';
 import { LOW_CONFIDENCE_MARKER } from '../src/context';
-import { isDistinctiveIdentifier } from '../src/search/query-utils';
+import { isDistinctiveIdentifier, scorePathRelevance } from '../src/search/query-utils';
 
 describe('isDistinctiveIdentifier', () => {
   it('treats plain dictionary words as non-distinctive', () => {
@@ -39,6 +39,31 @@ describe('isDistinctiveIdentifier', () => {
   });
 });
 
+// A single PascalCase query word (notably a project name a user naturally
+// includes) splits into sub-tokens that all match the SAME path segment; summed
+// per sub-token it boosted that path 4×, burying the rest of the query's stack
+// (#720). Path relevance must count each original WORD once per level, while
+// still splitting it for cross-convention matching.
+describe('scorePathRelevance per-word scoring (#720)', () => {
+  it('counts a single PascalCase word once per path level, not once per sub-token', () => {
+    // "SuperBizAgent" → super/biz/agent/superbizagent all hit the dir, but it's
+    // one concept: +5 (dir) once, not +20.
+    expect(scorePathRelevance('SuperBizAgentFrontend/app.js', 'SuperBizAgent')).toBe(5);
+  });
+
+  it('still splits a word so it matches across naming conventions', () => {
+    // getUserName must still match a snake_case path via its sub-tokens.
+    expect(scorePathRelevance('get_user_name.go', 'getUserName')).toBeGreaterThanOrEqual(10);
+  });
+
+  it('still credits distinct query words matching different path segments', () => {
+    // auth (dir) and handler (filename) are separate concepts — each counts.
+    expect(scorePathRelevance('src/auth/login_handler.go', 'auth handler')).toBeGreaterThan(
+      scorePathRelevance('src/auth/login_handler.go', 'auth')
+    );
+  });
+});
+
 describe('Context ranking — common-word precision & confidence', () => {
   let testDir: string;
   let cg: CodeGraph;

+ 20 - 9
src/search/query-utils.ts

@@ -173,23 +173,34 @@ export function extractSearchTerms(query: string, options?: { stems?: boolean })
  * Higher score = more relevant path
  */
 export function scorePathRelevance(filePath: string, query: string): number {
-  // Use base terms only — stem variants inflate path scores by generating
-  // many near-duplicate terms that all match the same path segments.
-  const terms = extractSearchTerms(query, { stems: false });
-  if (terms.length === 0) return 0;
-
   const pathLower = filePath.toLowerCase();
   const fileName = path.basename(filePath).toLowerCase();
   const dirName = path.dirname(filePath).toLowerCase();
   let score = 0;
 
-  for (const term of terms) {
+  // Score per original query WORD, not per sub-token. A single PascalCase word
+  // splits into many sub-tokens (a project name "SuperBizAgent" →
+  // superbizagent / super / biz / agent) that all match the SAME path segment,
+  // so summing per sub-token boosted that path 4× for one concept — enough to
+  // bury the rest of the query's stack (#720). A word matches a path level if
+  // ANY of its sub-tokens do, and counts ONCE; distinct words still each add.
+  // Split the ORIGINAL-case query into words; extractSearchTerms does the
+  // camelCase/snake split per word (so `getUserName` still matches a
+  // `get_user_name` path) — we just attribute each word's matches once.
+  const words = query.split(/\s+/).filter((w) => w.length > 0);
+  if (words.length === 0) return 0;
+
+  for (const word of words) {
+    // Use base terms only — stem variants inflate path scores by generating
+    // many near-duplicate terms that all match the same path segments.
+    const subtokens = extractSearchTerms(word, { stems: false });
+    if (subtokens.length === 0) continue;
     // Exact filename match (strongest)
-    if (fileName.includes(term)) score += 10;
+    if (subtokens.some((t) => fileName.includes(t))) score += 10;
     // Directory match
-    if (dirName.includes(term)) score += 5;
+    if (subtokens.some((t) => dirName.includes(t))) score += 5;
     // General path match
-    else if (pathLower.includes(term)) score += 3;
+    else if (subtokens.some((t) => pathLower.includes(t))) score += 3;
   }
 
   // Deprioritize test files unless the query is explicitly about tests