Преглед изворни кода

docs: Add search quality improvement guide for CodeGraph language extractors

Documents the systematic process for testing and improving search result relevance when LLMs query CodeGraph. Provides step-by-step loop for diagnosing issues with method search ranking, implementing getReceiverType hooks for languages where methods appear outside their owner type in the AST, and validating fixes with real codebases.
Colby McHenry пре 2 месеци
родитељ
комит
6598904d06
1 измењених фајлова са 191 додато и 0 уклоњено
  1. 191 0
      docs/SEARCH_QUALITY_LOOP.md

+ 191 - 0
docs/SEARCH_QUALITY_LOOP.md

@@ -0,0 +1,191 @@
+# CodeGraph Search Quality Loop
+
+You are testing and improving CodeGraph's search quality for a specific language. The user will give you a real-world codebase path to test against.
+
+## What You're Fixing
+
+When an LLM queries CodeGraph via MCP tools (`codegraph_search`, `codegraph_explore`, `codegraph_callees`), the results must be relevant. The main failure mode is: methods with common names (like `run`, `get`, `handle`) flood results and bury the actual target. The fix is usually adding `getReceiverType` to the language extractor so methods include their owner type in the FTS-indexed `qualified_name`.
+
+**Example:** Go's `func (sl *scrapeLoop) run()` was indexed as `scrape.go::scrape.go::run`. After adding `getReceiverType`, it became `scrape.go::scrapeLoop::run` — now FTS can rank it above unrelated `run` methods when the query mentions "scrapeLoop".
+
+## The Loop
+
+### 1. Pick a test query
+
+Choose a query that exercises the language's method-on-type pattern. Good queries mention:
+- A specific type/class/struct name
+- A method on that type
+- A broader topic connecting multiple files
+
+Example for Go: `"scrapeLoop run scrape lifecycle TSDB storage"`
+
+### 2. Index the codebase
+
+```bash
+rm -rf <codebase_path>/.codegraph
+node dist/bin/codegraph.js init -iv <codebase_path>
+```
+
+The `-iv` flag gives verbose output showing extraction progress, node/edge counts, and timing.
+
+### 3. Check what the DB produced
+
+```bash
+# Does the method have its owner type in qualified_name?
+sqlite3 <codebase_path>/.codegraph/codegraph.db \
+  "SELECT name, kind, qualified_name FROM nodes WHERE name = '<method>' AND file_path LIKE '%<file>%';"
+
+# GOOD: file.rs::StructName::method_name
+# BAD:  file.rs::file.rs::method_name  ← owner type missing, FTS can't find it
+```
+
+### 4. Test search ranking
+
+```bash
+node -e "
+const { CodeGraph } = require('./dist/index.js');
+async function test() {
+  const cg = await CodeGraph.open('<codebase_path>');
+
+  // Does the target method rank #1?
+  console.log('=== searchNodes ===');
+  const results = cg.searchNodes('<OwnerType> <method>', { limit: 10, kinds: ['method'] });
+  for (const r of results) {
+    console.log(\`\${r.score.toFixed(2)} | \${r.node.name} (\${r.node.kind}) | \${r.node.filePath}:\${r.node.startLine}\`);
+  }
+
+  // Does explore find the right file?
+  console.log('\n=== findRelevantContext ===');
+  const subgraph = await cg.findRelevantContext('<your natural language query>', {
+    searchLimit: 8, traversalDepth: 3, maxNodes: 80, minScore: 0.2,
+  });
+  const fileGroups = new Map();
+  for (const node of subgraph.nodes.values()) {
+    if (!fileGroups.has(node.filePath)) fileGroups.set(node.filePath, []);
+    fileGroups.get(node.filePath).push(node.name);
+  }
+  console.log('Entry points:');
+  for (const rootId of subgraph.roots.slice(0, 8)) {
+    const node = subgraph.nodes.get(rootId);
+    if (node) console.log(\`  \${node.name} (\${node.kind}) - \${node.filePath}:\${node.startLine}\`);
+  }
+  console.log('Top files:');
+  for (const [file, nodes] of [...fileGroups.entries()].sort((a,b) => b[1].length - a[1].length).slice(0, 5)) {
+    console.log(\`  \${file} (\${nodes.length}): \${nodes.slice(0, 5).join(', ')}\`);
+  }
+
+  // Does qualified lookup resolve correctly?
+  console.log('\n=== qualified lookup ===');
+  const qr = cg.searchNodes('<OwnerType>.<method>', { limit: 50 });
+  const exact = qr.filter(r => r.node.qualifiedName.includes('<OwnerType>::<method>'));
+  console.log(\`\${exact.length} match(es) for <OwnerType>.<method>\`);
+  if (exact[0]) {
+    const callees = cg.getCallees(exact[0].node.id);
+    console.log('Callees:', callees.map(c => c.node.name).join(', '));
+  }
+
+  await cg.close();
+}
+test().catch(console.error);
+"
+```
+
+### 5. If results are bad, diagnose and fix
+
+| Symptom | Cause | Fix |
+|---------|-------|-----|
+| Target method not in top 10 of `searchNodes` | Owner type missing from `qualified_name` | Add `getReceiverType` to `src/extraction/languages/<lang>.ts` |
+| Explore returns irrelevant files | Common method name flooding exact matches | Check co-location boost in `src/db/queries.ts: findNodesByExactName` |
+| A key term is being dropped from search | It's in the STOP_WORDS list | Edit `src/search/query-utils.ts` |
+| `<OwnerType>.<method>` returns "not found" | `qualified_name` doesn't contain `OwnerType::method` | Fix `getReceiverType` output |
+
+### 6. Rebuild and re-test
+
+```bash
+npm run build
+# If you changed extraction (getReceiverType), must re-index:
+rm -rf <codebase_path>/.codegraph
+node dist/bin/codegraph.js init -iv <codebase_path>
+# Then re-run Step 4
+```
+
+### 7. Run the test suite before finishing
+
+```bash
+npm test
+```
+
+All 378+ tests must pass.
+
+## How to Add `getReceiverType` for a Language
+
+**Only needed for languages where methods are top-level or outside their owner type in the AST.** If the language nests methods inside class/struct bodies (Python, Java, TypeScript, C#), the qualified name already includes the parent — verify with Step 3 before adding anything.
+
+### 1. Add the hook to the language extractor
+
+In `src/extraction/languages/<lang>.ts`, add `getReceiverType` to the extractor object:
+
+```typescript
+getReceiverType: (node, source) => {
+  // Extract the owner type name from the method's AST node.
+  // Return the type name string, or undefined if not applicable.
+  // 
+  // The core extractMethod() in tree-sitter.ts will use this to set:
+  //   qualifiedName = `${filePath}::${receiverType}::${methodName}`
+},
+```
+
+### 2. Reference: Go implementation
+
+```typescript
+// src/extraction/languages/go.ts
+getReceiverType: (node, source) => {
+  const receiver = getChildByField(node, 'receiver');
+  if (!receiver) return undefined;
+  const text = getNodeText(receiver, source);
+  const match = text.match(/\*?\s*([A-Za-z_][A-Za-z0-9_]*)\s*\)/);
+  return match?.[1];
+},
+```
+
+### 3. Where it's consumed
+
+`src/extraction/tree-sitter.ts` in `extractMethod()`:
+
+```typescript
+const receiverType = this.extractor.getReceiverType?.(node, this.source);
+if (receiverType) {
+  extraProps.qualifiedName = `${this.filePath}::${receiverType}::${name}`;
+}
+```
+
+## Key Files
+
+| File | Role |
+|------|------|
+| `src/extraction/languages/<lang>.ts` | Language extractor — implement `getReceiverType` here |
+| `src/extraction/tree-sitter.ts` | Core extraction — `extractMethod()` uses the hook |
+| `src/extraction/tree-sitter-types.ts` | `LanguageExtractor` interface definition |
+| `src/search/query-utils.ts` | `STOP_WORDS`, `extractSearchTerms`, `scorePathRelevance` |
+| `src/db/queries.ts` | `searchNodesFTS` (BM25), `findNodesByExactName` (co-location) |
+| `src/context/index.ts` | `findRelevantContext` — hybrid search + co-location boost |
+| `src/mcp/tools.ts` | MCP handlers — `matchesSymbol` uses `qualifiedName.includes("Type::method")` |
+
+## Languages Completed
+
+- [x] **Go** — `getReceiverType` extracts receiver from `func (sl *Type) method()`
+
+## Languages To Do
+
+Check these — only add `getReceiverType` if methods are top-level (not nested inside their owner type in the AST):
+
+- [ ] Rust — methods in `impl Type { }` blocks
+- [ ] C++ — out-of-class method definitions `Type::method()`
+- [ ] Swift — methods in `extension Type { }` blocks
+- [ ] Kotlin — extension functions `fun Type.method()`
+
+Verify these DON'T need it (methods nested in class body → qualified name should already be correct):
+- [ ] Python — verify `qualified_name` includes class name
+- [ ] Java — verify `qualified_name` includes class name
+- [ ] TypeScript — verify `qualified_name` includes class name
+- [ ] C# — verify `qualified_name` includes class name