| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369 |
- /**
- * Search Query Utilities
- *
- * Shared module for search term extraction and scoring.
- */
- import * as path from 'path';
- import { Node } from '../types';
- /**
- * Common stop words to filter from search queries.
- * Includes generic English + code-specific noise words.
- */
- export const STOP_WORDS = new Set([
- // English
- 'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for',
- 'of', 'with', 'by', 'from', 'is', 'it', 'that', 'this', 'are', 'was',
- 'be', 'has', 'had', 'have', 'do', 'does', 'did', 'will', 'would', 'could',
- 'should', 'may', 'might', 'can', 'shall', 'not', 'no', 'all', 'each',
- 'every', 'how', 'what', 'where', 'when', 'who', 'which', 'why',
- 'i', 'me', 'my', 'we', 'our', 'you', 'your', 'he', 'she', 'they',
- 'show', 'give', 'tell',
- 'been', 'done', 'made', 'used', 'using', 'work', 'works', 'found',
- 'also', 'into', 'then', 'than', 'just', 'more', 'some', 'such',
- 'over', 'only', 'out', 'its', 'so', 'up', 'as', 'if',
- 'look', 'need', 'needs', 'want', 'happen', 'happens',
- 'affect', 'affected', 'break', 'breaks', 'failing',
- 'implemented', 'implement',
- // Code-specific noise (avoid filtering common symbol names like get/set/add/build/find/list)
- 'code', 'file', 'files', 'function', 'method', 'class', 'type',
- 'fix', 'bug', 'called',
- ]);
- /**
- * Generate stem variants of a search term by removing common English suffixes.
- * Used for FTS query expansion so "caching" also finds "cache", "eviction" finds "evict", etc.
- * Stems are used as PREFIX matches in FTS, so they don't need to be perfect English words.
- */
- export function getStemVariants(term: string): string[] {
- const variants = new Set<string>();
- const t = term.toLowerCase();
- // -ing: caching→cach/cache, handling→handl/handle, running→run
- if (t.endsWith('ing') && t.length > 5) {
- const base = t.slice(0, -3);
- variants.add(base);
- variants.add(base + 'e');
- if (base.length >= 2 && base[base.length - 1] === base[base.length - 2]) {
- variants.add(base.slice(0, -1));
- }
- }
- // -tion/-sion: eviction→evict, expression→express
- if ((t.endsWith('tion') || t.endsWith('sion')) && t.length > 5) {
- variants.add(t.slice(0, -3));
- }
- // -ment: management→manage
- if (t.endsWith('ment') && t.length > 6) {
- variants.add(t.slice(0, -4));
- }
- // -ies: entries→entry
- if (t.endsWith('ies') && t.length > 4) {
- variants.add(t.slice(0, -3) + 'y');
- }
- // -es: processes→process, classes→class
- else if (t.endsWith('es') && t.length > 4) {
- variants.add(t.slice(0, -2));
- }
- // -s: errors→error (skip -ss endings like "class")
- else if (t.endsWith('s') && !t.endsWith('ss') && t.length > 4) {
- variants.add(t.slice(0, -1));
- }
- // -ed: handled→handle, propagated→propagate, carried→carry
- if (t.endsWith('ed') && !t.endsWith('eed') && t.length > 4) {
- variants.add(t.slice(0, -1));
- variants.add(t.slice(0, -2));
- if (t.endsWith('ied') && t.length > 5) {
- variants.add(t.slice(0, -3) + 'y');
- }
- }
- // -er: builder→build/builde, handler→handl/handle, getter→get
- if (t.endsWith('er') && t.length > 4) {
- const base = t.slice(0, -2);
- variants.add(base);
- variants.add(base + 'e');
- if (base.length >= 2 && base[base.length - 1] === base[base.length - 2]) {
- variants.add(base.slice(0, -1));
- }
- }
- return [...variants].filter(v => v.length >= 3 && v !== t);
- }
- /**
- * Extract meaningful search terms from a natural language query.
- * Splits camelCase, PascalCase, snake_case, SCREAMING_SNAKE, and dot.notation
- * into individual tokens before filtering.
- *
- * Preserves original compound identifiers (e.g., "scrapeLoop") alongside
- * their split parts so that FTS can match both the full symbol name and
- * individual words within it.
- *
- * Also generates stem variants (e.g., "caching"→"cache", "eviction"→"evict")
- * so FTS prefix matching can find related code symbols.
- */
- export function extractSearchTerms(query: string, options?: { stems?: boolean }): string[] {
- const includeStems = options?.stems !== false;
- const tokens = new Set<string>();
- // First, extract and preserve compound identifiers before splitting
- // CamelCase: scrapeLoop, UserService, getCallGraph
- const compoundPattern = /\b([a-zA-Z][a-zA-Z0-9]*(?:[A-Z][a-z]+)+|[A-Z][a-z]+(?:[A-Z][a-z]*)+)\b/g;
- let match;
- while ((match = compoundPattern.exec(query)) !== null) {
- if (match[1] && match[1].length >= 3) {
- tokens.add(match[1].toLowerCase()); // preserve full compound: "scrapeloop"
- }
- }
- // snake_case: scrape_loop, user_service
- const snakePattern = /\b([a-zA-Z][a-zA-Z0-9]*(?:_[a-zA-Z0-9]+)+)\b/g;
- while ((match = snakePattern.exec(query)) !== null) {
- if (match[1] && match[1].length >= 3) {
- tokens.add(match[1].toLowerCase());
- }
- }
- // Split camelCase / PascalCase: "getUserName" → "get User Name"
- const camelSplit = query
- .replace(/([a-z])([A-Z])/g, '$1 $2')
- .replace(/([A-Z]+)([A-Z][a-z])/g, '$1 $2');
- // Replace underscores and dots with spaces (snake_case, dot.notation)
- const normalised = camelSplit.replace(/[_.]+/g, ' ');
- // Split on any non-alphanumeric character
- const words = normalised.split(/[^a-zA-Z0-9]+/).filter(Boolean);
- for (const word of words) {
- const lower = word.toLowerCase();
- if (lower.length < 3) continue;
- if (STOP_WORDS.has(lower)) continue;
- tokens.add(lower);
- }
- // Generate stem variants for broader FTS matching.
- // "caching" → "cache" finds CacheBuilder; "eviction" → "evict" finds evictEntries.
- // Also enables co-occurrence dampening by increasing term count above 1.
- // Stems are skipped when scoring path relevance (stems inflate path scores).
- if (includeStems) {
- const stems = new Set<string>();
- for (const token of tokens) {
- for (const variant of getStemVariants(token)) {
- if (!tokens.has(variant) && !STOP_WORDS.has(variant)) {
- stems.add(variant);
- }
- }
- }
- for (const stem of stems) {
- tokens.add(stem);
- }
- }
- return [...tokens];
- }
- /**
- * Score path relevance to a query
- * Higher score = more relevant path
- */
- export function scorePathRelevance(filePath: string, query: string): number {
- // Use base terms only — stem variants inflate path scores by generating
- // many near-duplicate terms that all match the same path segments.
- const terms = extractSearchTerms(query, { stems: false });
- if (terms.length === 0) return 0;
- const pathLower = filePath.toLowerCase();
- const fileName = path.basename(filePath).toLowerCase();
- const dirName = path.dirname(filePath).toLowerCase();
- let score = 0;
- for (const term of terms) {
- // Exact filename match (strongest)
- if (fileName.includes(term)) score += 10;
- // Directory match
- if (dirName.includes(term)) score += 5;
- // General path match
- else if (pathLower.includes(term)) score += 3;
- }
- // Deprioritize test files unless the query is explicitly about tests
- const queryLower = query.toLowerCase();
- const isTestQuery = queryLower.includes('test') || queryLower.includes('spec');
- if (!isTestQuery && isTestFile(filePath)) {
- score -= 15;
- }
- return score;
- }
- /**
- * Check if a file path looks like a test file
- */
- export function isTestFile(filePath: string): boolean {
- const lower = filePath.toLowerCase();
- const fileName = path.basename(filePath); // original case — needed for camelCase boundaries
- const lowerName = fileName.toLowerCase();
- // --- Filename patterns ---
- if (
- lowerName.startsWith('test_') || // python: test_foo.py
- lowerName.startsWith('test.') ||
- // separator-delimited: foo_test.go, foo.test.ts, foo-spec.rb, bar_spec.py
- /[._-](test|tests|spec|specs)\.[a-z0-9]+$/.test(lowerName) ||
- // CamelCase suffix (Java/Kotlin/Swift/C#/Scala): FooTest.kt, BarTests.swift,
- // BazSpec.scala, QuxTestCase.java. Capital-led so "latest.kt"/"manifest.kt"
- // (lowercase "test") are NOT matched.
- /(?:Test|Tests|TestCase|Tester|Spec|Specs)\.[A-Za-z0-9]+$/.test(fileName)
- ) {
- return true;
- }
- // --- Directory patterns ---
- if (
- lower.includes('/tests/') || lower.includes('/test/') ||
- lower.includes('/__tests__/') || lower.includes('/spec/') ||
- lower.includes('/specs/') || lower.includes('/testlib/') ||
- lower.includes('/testing/') ||
- lower.startsWith('test/') || lower.startsWith('tests/') ||
- lower.startsWith('spec/') || lower.startsWith('specs/') ||
- // CamelCase test source-set dirs (Kotlin Multiplatform / Gradle / Xcode):
- // jvmTest/, commonTest/, androidTest/, iosTest/, integrationTest/. Capital-led
- // so "latest/" / "manifest/" are not matched.
- /(?:^|\/)[A-Za-z0-9]*(?:Test|Tests|Spec)\//.test(filePath)
- ) {
- return true;
- }
- // Non-production directories: examples, samples, benchmarks, fixtures, demos.
- // Check both mid-path (/integration/) and start-of-path (integration/) since
- // file paths may be stored as relative paths without a leading slash.
- return matchesNonProductionDir(lower);
- }
- /**
- * Check if a path is in a non-production directory (integration, sample, example, etc.)
- * Handles both absolute paths (/foo/integration/bar) and relative paths (integration/bar).
- */
- function matchesNonProductionDir(lowerPath: string): boolean {
- const dirs = [
- 'integration', 'sample', 'samples', 'example', 'examples',
- 'fixture', 'fixtures', 'benchmark', 'benchmarks', 'demo', 'demos',
- ];
- for (const dir of dirs) {
- if (lowerPath.includes('/' + dir + '/') || lowerPath.startsWith(dir + '/')) {
- return true;
- }
- }
- return false;
- }
- /**
- * Bonus when a node's name matches the search query.
- * Exact matches get the largest boost; prefix matches get smaller boosts.
- * Multi-word queries also check individual term matches against the name.
- */
- export function nameMatchBonus(nodeName: string, query: string): number {
- const nameLower = nodeName.toLowerCase();
- // Split query into word-level terms (handles "CacheBuilder build" → ["cache","builder","build"])
- const rawTerms = query
- .replace(/([a-z])([A-Z])/g, '$1 $2')
- .split(/[\s_.\-]+/)
- .map(t => t.toLowerCase())
- .filter(t => t.length >= 2);
- // Also keep original space-separated tokens for exact-term matching
- const queryTokens = query.split(/\s+/).map(t => t.toLowerCase()).filter(t => t.length >= 2);
- // Full query as a single token (for compound identifiers like "CacheBuilder")
- const queryLower = query.replace(/[\s]+/g, '').toLowerCase();
- // Exact match: query exactly equals the node name
- if (nameLower === queryLower) return 80;
- // Exact match on a query token: "CacheBuilder build" and node name is "build"
- if (queryTokens.length > 1 && queryTokens.includes(nameLower)) return 60;
- // Name starts with query — scale by length ratio so "Pod"→"Pod" (exact, handled above)
- // scores much higher than "Pod"→"PodGCControllerOptions" (ratio 0.125).
- if (nameLower.startsWith(queryLower)) {
- const ratio = queryLower.length / nameLower.length;
- return Math.round(10 + 30 * ratio);
- }
- // All camelCase-split terms appear in the name
- if (rawTerms.length > 1) {
- const allMatch = rawTerms.every(t => nameLower.includes(t));
- if (allMatch) return 15;
- }
- // Name contains the full query as substring
- if (nameLower.includes(queryLower)) return 10;
- return 0;
- }
- /**
- * Kind-based bonus for search ranking
- * Functions and classes are typically more relevant than variables/imports
- */
- export function kindBonus(kind: Node['kind']): number {
- const bonuses: Record<string, number> = {
- function: 10,
- method: 10,
- class: 8,
- interface: 9,
- type_alias: 6,
- struct: 6,
- trait: 9,
- enum: 5,
- component: 8,
- route: 9,
- module: 4,
- property: 3,
- field: 3,
- variable: 2,
- constant: 3,
- import: 1,
- export: 1,
- parameter: 0,
- namespace: 4,
- file: 0,
- protocol: 9,
- enum_member: 3,
- };
- return bonuses[kind] ?? 0;
- }
- /**
- * Whether a query token looks like a code identifier the user deliberately typed
- * (camelCase / PascalCase-with-internal-caps / snake_case / has a digit) rather
- * than a plain dictionary word ("flat", "object", "screen").
- *
- * Used to decide whether an EXACT name match earns the "the user named this
- * symbol" exemption from single-term dampening. A common English word that
- * happens to exact-match an unrelated symbol — the query "flat object" matching
- * a constant named `FLAT` — must NOT get that exemption, or the +exact-name
- * bonus floats it to the top of a prose query on its own.
- *
- * Classifies the token AS THE USER TYPED IT, not the matched symbol's name:
- * "flat" (lowercase, descriptive) is non-distinctive even though it matches
- * `FLAT`. A leading-capital-only word ("Screen", "Zustand") is also treated as
- * a plain word — sentence-start capitalization and proper nouns aren't reliable
- * identifier signals.
- */
- export function isDistinctiveIdentifier(token: string): boolean {
- if (!token) return false;
- // snake_case / SCREAMING_SNAKE, or an embedded digit → a deliberate identifier.
- if (/[_0-9]/.test(token)) return true;
- // An uppercase letter anywhere AFTER the first char → a camelCase/PascalCase
- // boundary (setLastEmail, OrgUserStore) or an acronym (REST, HTTP).
- if (/[A-Z]/.test(token.slice(1))) return true;
- return false;
- }
|