Jelajahi Sumber

feat(config): map custom file extensions to languages via codegraph.json (#906) (#955)

The extension → language table was hardcoded, so a codebase using a
non-standard extension for a supported language (e.g. `.dota_lua` for Lua)
had those files silently skipped — no way to opt them in short of patching
the source.

Add an opt-in, project-scoped `codegraph.json` at the repo root:

    { "extensions": { ".dota_lua": "lua", ".tpl": "php" } }

Mappings merge on top of the built-in defaults and take precedence (so a
built-in can be re-pointed, e.g. `.h` → `cpp`). Absent or malformed config
is the zero-config default — byte-identical to prior behavior; an invalid
target language or unparseable file is warned-and-skipped, never fatal.

Implementation:
- New `src/project-config.ts` — `loadExtensionOverrides(rootDir)`, validated
  against `isLanguageSupported`, mtime-cached per root.
- `detectLanguage` / `isSourceFile` gain an optional `overrides` arg
  (omitting it is the existing behavior).
- Overrides threaded per-operation through every extraction call site
  (scan/walk gates, git change-detection, grammar selection, extraction,
  the file watcher), resolved from the project root — no process-global
  state, so the multi-project daemon stays isolated. The parse worker
  receives the resolved language in its message.

Tests: 13 new cases (unit, loader validation/normalization/caching, and a
full-index integration proving a custom-extension file is extracted while
the zero-config path indexes nothing). Worker path smoke-tested via the
built CLI.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Colby Mchenry 13 jam lalu
induk
melakukan
d1121e46f0

+ 1 - 0
CHANGELOG.md

@@ -30,6 +30,7 @@ and adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 - Java `static final` constants, C# `const` / `static readonly` constants, Scala `object` vals, and Kotlin top-level / `object` / `companion object` `val`s are now classified as constants rather than generic fields, so they participate in the constant-reader impact analysis above — change a `public static final` table, a `const string`, a Scala `object Config { val Timeout = … }`, or a Kotlin `companion object { const val … }` and the methods that read it now show up as affected. (Per-object Java `final` / C# `readonly` / Scala & Kotlin `class` instance properties are unchanged.) Kotlin constants were previously not indexed as their own symbols at all, so they now also appear in `codegraph search`.
 - Swift top-level `let`s and `static let` constants (including those namespaced in an `enum`/`struct`, the common Swift pattern) are now indexed as constants and participate in the constant-reader impact analysis above — change a `static let defaultRetryLimit` or an `enum Constants { static let … }` and the same-file code that reads it shows up as affected. Computed properties and per-instance `let`s are not treated as constants.
 - Dart top-level `const`/`final` and class `static const`/`static final` constants are now indexed as constants and participate in the constant-reader impact analysis above. Instance fields, `var`s, and locals are not treated as constants. (Generated Dart code with the standard `.g.dart`/`.freezed.dart`/`.pb.dart` suffixes is already skipped.)
+- You can now teach CodeGraph about custom file extensions. Drop a `codegraph.json` at your repo root with an `extensions` map — `{ "extensions": { ".dota_lua": "lua", ".tpl": "php" } }` — and files with those extensions get indexed under the language you name, instead of being silently skipped because the extension wasn't one of the built-in defaults. It's opt-in and committed alongside your code so the whole team shares it, your mappings layer on top of the built-ins and win on conflict (you can even re-point a built-in, e.g. `.h` → `cpp`), and a typo'd language or a malformed config is warned about and skipped rather than breaking indexing. Projects without a `codegraph.json` behave exactly as before. (#906)
 
 ### Fixes
 

+ 27 - 3
README.md

@@ -585,9 +585,10 @@ that drive the graph directly: `DatabaseConnection`, `QueryBuilder`,
 
 ## Configuration
 
-There isn't any — CodeGraph is zero-config, with **no config file** to write or
-keep in sync. Language support is automatic from the file extension; there's
-nothing to wire up per language.
+Next to none — CodeGraph is **zero-config by default**, with nothing to write or
+keep in sync to get started. Language support is automatic from the file
+extension; there's nothing to wire up per language. The one optional file is for
+mapping [custom file extensions](#custom-file-extensions).
 
 What it skips out of the box:
 
@@ -605,6 +606,29 @@ add a negation — `!vendor/`. The defaults apply uniformly, so committing a
 dependency or build directory doesn't force it into the graph; the `.gitignore`
 negation is the explicit opt-in.
 
+### Custom file extensions
+
+If your project uses a non-standard extension for a [supported
+language](#supported-languages) — say `.dota_lua` for Lua, or `.tpl` for PHP —
+those files are skipped by default, because the extension isn't one CodeGraph
+recognizes. Map them with an optional **`codegraph.json`** at your project root:
+
+```json
+{
+  "extensions": {
+    ".dota_lua": "lua",
+    ".tpl": "php"
+  }
+}
+```
+
+Each value is a supported language id. The mappings merge on top of the built-in
+defaults and win on conflict, so you can also re-point a built-in (e.g.
+`".h": "cpp"`). Commit the file to share the mapping with your team. A typo'd
+language or a malformed file is warned about and skipped — it never breaks
+indexing — and a project with no `codegraph.json` behaves exactly as before.
+Re-index (`codegraph index`) after adding or changing mappings.
+
 ## Telemetry
 
 CodeGraph collects **anonymous usage statistics** — which tools and commands get

+ 157 - 0
__tests__/extension-mapping.test.ts

@@ -0,0 +1,157 @@
+/**
+ * Custom extension → language mapping (#906).
+ *
+ * A project can map non-standard file extensions to a supported language via a
+ * committed `codegraph.json` at the repo root, so files that would otherwise be
+ * silently skipped get indexed under the right grammar. These tests cover the
+ * two choke-point functions (detectLanguage / isSourceFile) honoring an override
+ * map, the loader's validation/normalization/caching of `codegraph.json`, and a
+ * full index proving a custom-extension file is actually extracted — while the
+ * zero-config path stays byte-identical (the file is NOT indexed without config).
+ */
+import { describe, it, expect, beforeEach, afterEach } from 'vitest';
+import * as fs from 'node:fs';
+import * as path from 'node:path';
+import * as os from 'node:os';
+import { CodeGraph } from '../src';
+import { detectLanguage, isSourceFile } from '../src/extraction/grammars';
+import { loadExtensionOverrides, clearProjectConfigCache } from '../src/project-config';
+
+describe('custom extension → language mapping (#906)', () => {
+  describe('detectLanguage / isSourceFile overrides argument', () => {
+    it('maps a custom extension only when present in the overrides', () => {
+      expect(detectLanguage('a/b.foo')).toBe('unknown');
+      expect(isSourceFile('a/b.foo')).toBe(false);
+
+      expect(detectLanguage('a/b.foo', undefined, { '.foo': 'typescript' })).toBe('typescript');
+      expect(isSourceFile('a/b.foo', { '.foo': 'typescript' })).toBe(true);
+    });
+
+    it('lets a user mapping take precedence over a built-in extension', () => {
+      expect(detectLanguage('x.h')).toBe('c');
+      expect(detectLanguage('x.h', undefined, { '.h': 'cpp' })).toBe('cpp');
+    });
+
+    it('is byte-identical to zero-config behavior when no overrides are passed', () => {
+      expect(detectLanguage('x.ts')).toBe('typescript');
+      expect(detectLanguage('x.py')).toBe('python');
+      expect(isSourceFile('x.ts')).toBe(true);
+      expect(isSourceFile('x.unknownext')).toBe(false);
+    });
+  });
+
+  describe('loadExtensionOverrides (codegraph.json)', () => {
+    let dir: string;
+    beforeEach(() => {
+      dir = fs.mkdtempSync(path.join(os.tmpdir(), 'cg-extmap-'));
+      clearProjectConfigCache();
+    });
+    afterEach(() => {
+      clearProjectConfigCache();
+      fs.rmSync(dir, { recursive: true, force: true });
+    });
+    const writeConfig = (obj: unknown) =>
+      fs.writeFileSync(
+        path.join(dir, 'codegraph.json'),
+        typeof obj === 'string' ? obj : JSON.stringify(obj)
+      );
+
+    it('returns an empty map when there is no codegraph.json', () => {
+      expect(loadExtensionOverrides(dir)).toEqual({});
+    });
+
+    it('loads and validates a well-formed extensions map', () => {
+      writeConfig({ extensions: { '.foo': 'typescript', '.bar': 'python' } });
+      expect(loadExtensionOverrides(dir)).toEqual({ '.foo': 'typescript', '.bar': 'python' });
+    });
+
+    it('normalizes keys (adds a leading dot, lowercases)', () => {
+      writeConfig({ extensions: { foo: 'lua', '.BAR': 'go' } });
+      expect(loadExtensionOverrides(dir)).toEqual({ '.foo': 'lua', '.bar': 'go' });
+    });
+
+    it('skips entries whose target is not a supported language', () => {
+      writeConfig({ extensions: { '.foo': 'typescript', '.bad': 'pyhton', '.x': 'unknown' } });
+      expect(loadExtensionOverrides(dir)).toEqual({ '.foo': 'typescript' });
+    });
+
+    it('skips multi-part and otherwise unusable extension keys', () => {
+      writeConfig({ extensions: { '.d.ts': 'typescript', 'a/b': 'go', '.': 'lua', '.ok': 'rust' } });
+      expect(loadExtensionOverrides(dir)).toEqual({ '.ok': 'rust' });
+    });
+
+    it('ignores malformed JSON without throwing', () => {
+      writeConfig('{ not: valid json ');
+      expect(loadExtensionOverrides(dir)).toEqual({});
+    });
+
+    it('ignores a non-object extensions field', () => {
+      writeConfig({ extensions: 'nope' });
+      expect(loadExtensionOverrides(dir)).toEqual({});
+    });
+
+    it('picks up a changed config (mtime-invalidated cache)', () => {
+      writeConfig({ extensions: { '.foo': 'typescript' } });
+      expect(loadExtensionOverrides(dir)).toEqual({ '.foo': 'typescript' });
+
+      writeConfig({ extensions: { '.foo': 'go' } });
+      // Force a distinct mtime in case the filesystem clock is coarse.
+      const future = new Date(Date.now() + 2000);
+      fs.utimesSync(path.join(dir, 'codegraph.json'), future, future);
+
+      expect(loadExtensionOverrides(dir)).toEqual({ '.foo': 'go' });
+    });
+  });
+
+  describe('indexAll honors codegraph.json end-to-end', () => {
+    let dir: string;
+    beforeEach(() => {
+      dir = fs.mkdtempSync(path.join(os.tmpdir(), 'cg-extmap-idx-'));
+      clearProjectConfigCache();
+    });
+    afterEach(() => {
+      clearProjectConfigCache();
+      fs.rmSync(dir, { recursive: true, force: true });
+    });
+    const write = (rel: string, body: string) => {
+      const p = path.join(dir, rel);
+      fs.mkdirSync(path.dirname(p), { recursive: true });
+      fs.writeFileSync(p, body);
+    };
+    const indexAndQuery = async () => {
+      const cg = await CodeGraph.init(dir, { silent: true });
+      await cg.indexAll();
+      const db = (cg as any).db.db;
+      const nodes = db
+        .prepare('SELECT name, kind, file_path, language FROM nodes WHERE file_path = ?')
+        .all('widget.foo');
+      const files = db
+        .prepare('SELECT path, language FROM files WHERE path = ?')
+        .all('widget.foo');
+      cg.close?.();
+      return { nodes, files };
+    };
+
+    const SOURCE = 'export function widgetHandler(x: number): number { return x + 1; }\n';
+
+    it('indexes a custom-extension file mapped to a supported language', async () => {
+      write('codegraph.json', JSON.stringify({ extensions: { '.foo': 'typescript' } }));
+      write('widget.foo', SOURCE);
+
+      const { nodes, files } = await indexAndQuery();
+
+      expect(files.length).toBe(1);
+      expect(files[0].language).toBe('typescript');
+      expect(nodes.some((n: any) => n.name === 'widgetHandler' && n.language === 'typescript')).toBe(true);
+    });
+
+    it('does NOT index the same file without codegraph.json (zero-config preserved)', async () => {
+      write('widget.foo', SOURCE);
+
+      const { nodes, files } = await indexAndQuery();
+
+      expect(files.length).toBe(0);
+      expect(nodes.length).toBe(0);
+    });
+  });
+});

+ 19 - 2
site/src/content/docs/getting-started/configuration.md

@@ -1,9 +1,9 @@
 ---
 title: Configuration
-description: CodeGraph is zero-config — there are no config files.
+description: CodeGraph is zero-config by default, with one optional file for mapping custom extensions.
 ---
 
-There isn't any — CodeGraph is **zero-config**, with **no config file** to write or keep in sync. Language support is automatic from the file extension; there's nothing to wire up per language.
+Next to none — CodeGraph is **zero-config by default**, with nothing to write or keep in sync to get started. Language support is automatic from the file extension; there's nothing to wire up per language. The one optional file is for mapping [custom file extensions](#custom-file-extensions).
 
 ## What it skips out of the box
 
@@ -17,6 +17,23 @@ To keep something else out, add it to `.gitignore`. To pull a default-excluded d
 
 The defaults apply uniformly, so committing a dependency or build directory doesn't force it into the graph — the `.gitignore` negation is the explicit opt-in.
 
+## Custom file extensions
+
+If your project uses a non-standard extension for a [supported language](/codegraph/reference/languages/) — say `.dota_lua` for Lua, or `.tpl` for PHP — those files are skipped by default, because the extension isn't one CodeGraph recognizes. Map them with an optional `codegraph.json` at your project root:
+
+```json
+{
+  "extensions": {
+    ".dota_lua": "lua",
+    ".tpl": "php"
+  }
+}
+```
+
+Each value is a supported language id. The mappings merge on top of the built-in defaults and win on conflict, so you can also re-point a built-in (e.g. `".h": "cpp"`). Commit the file to share the mapping with your team.
+
+A typo'd language or a malformed file is warned about and skipped — it never breaks indexing — and a project with no `codegraph.json` behaves exactly as before. Re-index (`codegraph index`) after adding or changing mappings.
+
 ## Where data lives
 
 Per-project data lives in a `.codegraph/` directory at your project root, containing the SQLite database (`codegraph.db`). Nothing leaves your machine.

+ 14 - 5
src/extraction/grammars.ts

@@ -121,13 +121,18 @@ export const EXTENSION_MAP: Record<string, Language> = {
  * Whether a file is one CodeGraph can parse, based purely on its extension.
  * This is the single source of truth for "should we index this file" — derived
  * from EXTENSION_MAP so parser support and indexing selection never drift.
+ *
+ * `overrides` is the project's validated custom extension → language map (from
+ * `codegraph.json`); when present its extensions count as indexable in addition
+ * to the built-ins. Omitting it is byte-identical to the zero-config behavior.
  */
-export function isSourceFile(filePath: string): boolean {
+export function isSourceFile(filePath: string, overrides?: Record<string, Language>): boolean {
   if (isPlayRoutesFile(filePath)) return true; // Play `conf/routes` is extensionless
   if (isShopifyLiquidJson(filePath)) return true; // Shopify OS 2.0 JSON templates / section groups
   const dot = filePath.lastIndexOf('.');
   if (dot < 0) return false;
-  return filePath.slice(dot).toLowerCase() in EXTENSION_MAP;
+  const ext = filePath.slice(dot).toLowerCase();
+  return ext in EXTENSION_MAP || (!!overrides && ext in overrides);
 }
 
 /**
@@ -266,9 +271,13 @@ export function getParser(language: Language): Parser | null {
 }
 
 /**
- * Detect language from file extension
+ * Detect language from file extension.
+ *
+ * `overrides` is the project's validated custom extension → language map (from
+ * `codegraph.json`); when present its mappings take precedence over the built-in
+ * `EXTENSION_MAP`. Omitting it is byte-identical to the zero-config behavior.
  */
-export function detectLanguage(filePath: string, source?: string): Language {
+export function detectLanguage(filePath: string, source?: string, overrides?: Record<string, Language>): Language {
   // Play `conf/routes` has no grammar — route through the no-symbol path; the
   // Play framework resolver extracts route nodes from it.
   if (isPlayRoutesFile(filePath)) return 'yaml';
@@ -276,7 +285,7 @@ export function detectLanguage(filePath: string, source?: string): Language {
   // Shopify OS 2.0 JSON templates / section groups → the Liquid extractor (it
   // links each section `"type"` to its `sections/<type>.liquid`).
   if (isShopifyLiquidJson(filePath)) return 'liquid';
-  const lang = EXTENSION_MAP[ext] || 'unknown';
+  const lang = (overrides && overrides[ext]) || EXTENSION_MAP[ext] || 'unknown';
 
   // .h files could be C, C++, or Objective-C — check source content
   if (lang === 'c' && ext === '.h' && source) {

+ 42 - 19
src/extraction/index.ts

@@ -19,6 +19,7 @@ import {
 import { QueryBuilder } from '../db/queries';
 import { extractFromSource } from './tree-sitter';
 import { detectLanguage, isSourceFile, isLanguageSupported, isFileLevelOnlyLanguage, initGrammars, loadGrammarsForLanguages } from './grammars';
+import { loadExtensionOverrides } from '../project-config';
 import { isCodeGraphDataDir } from '../directory';
 import { logDebug, logWarn } from '../errors';
 import { validatePathWithinRoot, normalizePath } from '../utils';
@@ -637,14 +638,17 @@ interface GitChanges {
 function getGitChangedFiles(rootDir: string): GitChanges | null {
   try {
     const changes: GitChanges = { modified: [], added: [], deleted: [] };
-    collectGitStatus(rootDir, '', changes);
+    // Custom extension → language overrides from the project's codegraph.json,
+    // so change detection sees the same custom-extension files the full index does.
+    const overrides = loadExtensionOverrides(rootDir);
+    collectGitStatus(rootDir, '', changes, overrides);
     return changes;
   } catch {
     return null;
   }
 }
 
-function collectGitStatus(repoDir: string, prefix: string, out: GitChanges): void {
+function collectGitStatus(repoDir: string, prefix: string, out: GitChanges, overrides?: Record<string, Language>): void {
   const output = execFileSync(
     'git',
     ['status', '--porcelain', '--no-renames'],
@@ -678,7 +682,7 @@ function collectGitStatus(repoDir: string, prefix: string, out: GitChanges): voi
     }
 
     const filePath = normalizePath(prefix + rel);
-    if (!isSourceFile(filePath)) continue;
+    if (!isSourceFile(filePath, overrides)) continue;
 
     if (statusCode.includes('D')) {
       // Deletions stay unfiltered: getChangedFiles acts on one only when the
@@ -704,11 +708,11 @@ function collectGitStatus(repoDir: string, prefix: string, out: GitChanges): voi
   // nested deeper) and under this repo's gitignored dirs.
   for (const rel of untrackedDirs) {
     for (const repoRel of findNestedGitRepos(path.join(repoDir, rel), rel)) {
-      collectGitStatus(path.join(repoDir, repoRel), prefix + repoRel, out);
+      collectGitStatus(path.join(repoDir, repoRel), prefix + repoRel, out, overrides);
     }
   }
   for (const rel of findIgnoredEmbeddedRepos(repoDir)) {
-    collectGitStatus(path.join(repoDir, rel), prefix + rel, out);
+    collectGitStatus(path.join(repoDir, rel), prefix + rel, out, overrides);
   }
 }
 
@@ -723,13 +727,16 @@ export function scanDirectory(
   rootDir: string,
   onProgress?: (current: number, file: string) => void
 ): string[] {
+  // Custom extension → language overrides from the project's codegraph.json.
+  const overrides = loadExtensionOverrides(rootDir);
+
   // Fast path: use git to get all visible files (respects .gitignore everywhere)
   const gitFiles = getGitVisibleFiles(rootDir);
   if (gitFiles) {
     const files: string[] = [];
     let count = 0;
     for (const filePath of gitFiles) {
-      if (isSourceFile(filePath)) {
+      if (isSourceFile(filePath, overrides)) {
         files.push(filePath);
         count++;
         onProgress?.(count, filePath);
@@ -750,12 +757,15 @@ export async function scanDirectoryAsync(
   rootDir: string,
   onProgress?: (current: number, file: string) => void
 ): Promise<string[]> {
+  // Custom extension → language overrides from the project's codegraph.json.
+  const overrides = loadExtensionOverrides(rootDir);
+
   const gitFiles = getGitVisibleFiles(rootDir);
   if (gitFiles) {
     const files: string[] = [];
     let count = 0;
     for (const filePath of gitFiles) {
-      if (isSourceFile(filePath)) {
+      if (isSourceFile(filePath, overrides)) {
         files.push(filePath);
         count++;
         onProgress?.(count, filePath);
@@ -781,6 +791,8 @@ function scanDirectoryWalk(
   const files: string[] = [];
   let count = 0;
   const visitedDirs = new Set<string>();
+  // Custom extension → language overrides from the project's codegraph.json.
+  const overrides = loadExtensionOverrides(rootDir);
 
   // A .gitignore matcher scoped to the directory that declared it. Patterns in
   // a nested .gitignore are relative to that directory, so we keep the dir
@@ -857,7 +869,7 @@ function scanDirectoryWalk(
               walk(fullPath, active);
             }
           } else if (stat.isFile()) {
-            if (!isIgnored(fullPath, false, active) && isSourceFile(relativePath)) {
+            if (!isIgnored(fullPath, false, active) && isSourceFile(relativePath, overrides)) {
               files.push(relativePath);
               count++;
               onProgress?.(count, relativePath);
@@ -874,7 +886,7 @@ function scanDirectoryWalk(
           walk(fullPath, active);
         }
       } else if (entry.isFile()) {
-        if (!isIgnored(fullPath, false, active) && isSourceFile(relativePath)) {
+        if (!isIgnored(fullPath, false, active) && isSourceFile(relativePath, overrides)) {
           files.push(relativePath);
           count++;
           onProgress?.(count, relativePath);
@@ -994,6 +1006,11 @@ export class ExtractionOrchestrator {
     let totalNodes = 0;
     let totalEdges = 0;
 
+    // Custom extension → language overrides from the project's codegraph.json.
+    // Threaded into language detection so custom-extension files load the right
+    // grammar and store under the mapped language.
+    const overrides = loadExtensionOverrides(this.rootDir);
+
     const log = verbose
       ? (msg: string) => { console.log(`[worker] ${msg}`); }
       : (_msg: string) => {};
@@ -1050,7 +1067,7 @@ export class ExtractionOrchestrator {
     await new Promise(resolve => setImmediate(resolve));
 
     // Detect needed languages and load grammars in the parse worker
-    const neededLanguages = [...new Set(files.map((f) => detectLanguage(f)))];
+    const neededLanguages = [...new Set(files.map((f) => detectLanguage(f, undefined, overrides)))];
     // .h files default to 'c' but may be C++ — ensure cpp grammar is loaded when c is needed
     if (neededLanguages.includes('c') && !neededLanguages.includes('cpp')) {
       neededLanguages.push('cpp');
@@ -1161,12 +1178,17 @@ export class ExtractionOrchestrator {
     }
 
     async function requestParse(filePath: string, content: string): Promise<ExtractionResult> {
+      // Resolve the language on the main thread (where the project's
+      // codegraph.json overrides are loaded) and hand it to the worker, so the
+      // worker never needs the override map itself.
+      const language = detectLanguage(filePath, content, overrides);
+
       if (!WorkerClass) {
         // In-process fallback
         return extractFromSource(
           filePath,
           content,
-          detectLanguage(filePath, content),
+          language,
           frameworkNames
         );
       }
@@ -1198,7 +1220,7 @@ export class ExtractionOrchestrator {
         }, timeoutMs);
 
         pendingParses.set(id, { resolve, reject, timer });
-        worker.postMessage({ type: 'parse', id, filePath, content, frameworkNames });
+        worker.postMessage({ type: 'parse', id, filePath, content, frameworkNames, language });
       });
     }
 
@@ -1312,7 +1334,7 @@ export class ExtractionOrchestrator {
 
         // Store in database on main thread (SQLite is not thread-safe)
         if (result.nodes.length > 0 || result.errors.length === 0) {
-          const language = detectLanguage(filePath, content);
+          const language = detectLanguage(filePath, content, overrides);
           this.storeExtractionResult(filePath, content, language, stats, result);
         }
 
@@ -1333,7 +1355,7 @@ export class ExtractionOrchestrator {
           // Files with no symbols but no errors (yaml, twig, properties) are
           // tracked at the file level — count them as indexed so the CLI
           // doesn't misleadingly report "No files found to index".
-          const lang = detectLanguage(filePath, content);
+          const lang = detectLanguage(filePath, content, overrides);
           if (isFileLevelOnlyLanguage(lang)) {
             filesIndexed++;
           } else {
@@ -1393,7 +1415,7 @@ export class ExtractionOrchestrator {
         }
 
         if (result.nodes.length > 0 || result.errors.length === 0) {
-          const language = detectLanguage(filePath, content);
+          const language = detectLanguage(filePath, content, overrides);
           const stats = await fsp.stat(path.join(this.rootDir, filePath));
           this.storeExtractionResult(filePath, content, language, stats, result);
 
@@ -1444,7 +1466,7 @@ export class ExtractionOrchestrator {
           }
 
           if (result.nodes.length > 0 || result.errors.length === 0) {
-            const language = detectLanguage(filePath, fullContent);
+            const language = detectLanguage(filePath, fullContent, overrides);
             const stats = await fsp.stat(path.join(this.rootDir, filePath));
             this.storeExtractionResult(filePath, fullContent, language, stats, result);
 
@@ -1607,8 +1629,8 @@ export class ExtractionOrchestrator {
       };
     }
 
-    // Detect language
-    const language = detectLanguage(relativePath, content);
+    // Detect language (honoring the project's codegraph.json extension overrides)
+    const language = detectLanguage(relativePath, content, loadExtensionOverrides(this.rootDir));
     if (!isLanguageSupported(language)) {
       return {
         nodes: [],
@@ -1863,7 +1885,8 @@ export class ExtractionOrchestrator {
 
     // Load only grammars needed for changed files
     if (filesToIndex.length > 0) {
-      const neededLanguages = [...new Set(filesToIndex.map((f) => detectLanguage(f)))];
+      const overrides = loadExtensionOverrides(this.rootDir);
+      const neededLanguages = [...new Set(filesToIndex.map((f) => detectLanguage(f, undefined, overrides)))];
       // .h files default to 'c' but may be C++ — ensure cpp grammar is loaded
       if (neededLanguages.includes('c') && !neededLanguages.includes('cpp')) {
         neededLanguages.push('cpp');

+ 5 - 2
src/extraction/parse-worker.ts

@@ -55,14 +55,17 @@ import type { Language, ExtractionResult } from '../types';
 const PARSER_RESET_INTERVAL = 5000;
 const parseCounts = new Map<Language, number>();
 
-parentPort!.on('message', async (msg: { type: string; id?: number; filePath?: string; content?: string; languages?: Language[]; frameworkNames?: string[] }) => {
+parentPort!.on('message', async (msg: { type: string; id?: number; filePath?: string; content?: string; languages?: Language[]; frameworkNames?: string[]; language?: Language }) => {
   if (msg.type === 'load-grammars') {
     await loadGrammarsForLanguages(msg.languages!);
     parentPort!.postMessage({ type: 'grammars-loaded' });
   } else if (msg.type === 'parse') {
     const { id, filePath, content, frameworkNames } = msg;
     try {
-      const language = detectLanguage(filePath!, content);
+      // The main thread resolves the language (it holds the project's
+      // codegraph.json extension overrides) and sends it; fall back to detection
+      // for older callers / safety.
+      const language = msg.language ?? detectLanguage(filePath!, content);
       const result: ExtractionResult = extractFromSource(filePath!, content!, language, frameworkNames);
 
       // Periodic parser reset to reclaim WASM heap memory

+ 155 - 0
src/project-config.ts

@@ -0,0 +1,155 @@
+/**
+ * Project-scoped configuration: a committed `codegraph.json` at the project
+ * root that a team shares through version control.
+ *
+ * Today it carries one thing — `extensions`, an opt-in map from a custom file
+ * extension to one of CodeGraph's supported languages. The built-in
+ * extension → language table (`EXTENSION_MAP` in `extraction/grammars.ts`) is
+ * otherwise hardcoded, so a codebase that uses a non-standard extension for a
+ * supported language (e.g. `.dota_lua` for Lua) sees those files silently
+ * skipped. This lets the project map them once, in a version-controlled file:
+ *
+ *   {
+ *     "extensions": {
+ *       ".dota_lua": "lua",
+ *       ".tpl": "php"
+ *     }
+ *   }
+ *
+ * User mappings merge on TOP of the built-ins and win on conflict, so a project
+ * can also re-point a built-in extension (e.g. force `.h` → `cpp`). Absent or
+ * malformed config is the zero-config default — no overrides, no error. Invalid
+ * individual entries are warned-and-skipped (never fatal): an unparseable
+ * project file must not break indexing.
+ */
+import * as fs from 'fs';
+import * as path from 'path';
+import { Language } from './types';
+import { isLanguageSupported } from './extraction/grammars';
+import { logWarn } from './errors';
+
+/** Filename of the project-scoped config, resolved relative to the project root. */
+export const PROJECT_CONFIG_FILENAME = 'codegraph.json';
+
+export interface ProjectConfig {
+  /** Map of custom file extension (`.foo`) to a supported language id. */
+  extensions?: Record<string, string>;
+}
+
+interface CacheEntry {
+  mtimeMs: number;
+  overrides: Record<string, Language>;
+}
+
+/**
+ * Cache keyed by project root. The loader is called once per indexing/scan/sync
+ * operation (and per watch event), so the mtime guard keeps repeat calls to one
+ * `stat` while a single `codegraph.json` is in force. Keying by root keeps two
+ * projects in the same process (the daemon / multi-project MCP server) isolated.
+ */
+const overridesCache = new Map<string, Record<string, Language>>();
+const cacheMeta = new Map<string, CacheEntry>();
+
+/** Shared frozen empty map so the no-config path allocates nothing. */
+const EMPTY: Record<string, Language> = Object.freeze({});
+
+/**
+ * Normalize a user-provided extension key to the `.ext` lowercase form used by
+ * the built-in map. Returns null for keys that can never match a real file
+ * extension (so the caller warns and skips):
+ *   - empty / just "."
+ *   - multi-part (".d.ts") — language detection keys off the FINAL extension
+ *     only (`lastIndexOf('.')`), so a multi-dot key would never be consulted.
+ *   - anything containing a path separator.
+ */
+function normalizeExtKey(raw: string): string | null {
+  if (typeof raw !== 'string') return null;
+  let ext = raw.trim().toLowerCase();
+  if (!ext) return null;
+  if (!ext.startsWith('.')) ext = '.' + ext;
+  const body = ext.slice(1);
+  if (!body) return null;
+  if (body.includes('.') || body.includes('/') || body.includes('\\')) return null;
+  return ext;
+}
+
+/**
+ * Parse and validate the `extensions` map out of a `codegraph.json` file.
+ * Every failure mode degrades to "no overrides from this entry" — a bad file or
+ * a typo'd language never throws.
+ */
+function parseExtensionOverrides(file: string): Record<string, Language> {
+  let raw: string;
+  try {
+    raw = fs.readFileSync(file, 'utf-8');
+  } catch {
+    return EMPTY;
+  }
+
+  let parsed: unknown;
+  try {
+    parsed = JSON.parse(raw);
+  } catch (err) {
+    logWarn(`Ignoring ${PROJECT_CONFIG_FILENAME}: not valid JSON`, {
+      file,
+      error: err instanceof Error ? err.message : String(err),
+    });
+    return EMPTY;
+  }
+
+  if (!parsed || typeof parsed !== 'object') return EMPTY;
+  const exts = (parsed as ProjectConfig).extensions;
+  if (!exts || typeof exts !== 'object' || Array.isArray(exts)) return EMPTY;
+
+  const out: Record<string, Language> = {};
+  for (const [rawKey, rawVal] of Object.entries(exts)) {
+    const key = normalizeExtKey(rawKey);
+    if (!key) {
+      logWarn(`Ignoring extension mapping in ${PROJECT_CONFIG_FILENAME}: "${rawKey}" is not a valid file extension`, { file });
+      continue;
+    }
+    if (typeof rawVal !== 'string' || !isLanguageSupported(rawVal as Language)) {
+      logWarn(`Ignoring extension "${rawKey}" in ${PROJECT_CONFIG_FILENAME}: "${String(rawVal)}" is not a supported language`, { file });
+      continue;
+    }
+    out[key] = rawVal as Language;
+  }
+
+  return Object.keys(out).length > 0 ? out : EMPTY;
+}
+
+/**
+ * Load the validated extension overrides for a project, mtime-cached.
+ *
+ * Returns a map of `.ext` → supported language id. The result merges on top of
+ * the built-in extension map at the point of use (see `detectLanguage` /
+ * `isSourceFile`), with these user mappings taking precedence. Returns an empty
+ * map when there is no `codegraph.json` (the zero-config default).
+ */
+export function loadExtensionOverrides(rootDir: string): Record<string, Language> {
+  const file = path.join(rootDir, PROJECT_CONFIG_FILENAME);
+
+  let mtimeMs: number;
+  try {
+    mtimeMs = fs.statSync(file).mtimeMs;
+  } catch {
+    // No config file — drop any stale cache entry and return the default.
+    cacheMeta.delete(rootDir);
+    overridesCache.delete(rootDir);
+    return EMPTY;
+  }
+
+  const meta = cacheMeta.get(rootDir);
+  if (meta && meta.mtimeMs === mtimeMs) return meta.overrides;
+
+  const overrides = parseExtensionOverrides(file);
+  cacheMeta.set(rootDir, { mtimeMs, overrides });
+  overridesCache.set(rootDir, overrides);
+  return overrides;
+}
+
+/** Test/maintenance hook: forget cached config (e.g. after rewriting it in a test). */
+export function clearProjectConfigCache(): void {
+  cacheMeta.clear();
+  overridesCache.clear();
+}

+ 2 - 1
src/sync/watcher.ts

@@ -34,6 +34,7 @@
 import * as fs from 'fs';
 import * as path from 'path';
 import { isSourceFile, buildScopeIgnore, type ScopeIgnore } from '../extraction';
+import { loadExtensionOverrides } from '../project-config';
 import { logDebug, logWarn } from '../errors';
 import { normalizePath } from '../utils';
 import { isCodeGraphDataDir } from '../directory';
@@ -535,7 +536,7 @@ export class FileWatcher {
     if (!rel || rel === '.' || rel.startsWith('..')) return;
     if (this.isAlwaysIgnored(rel)) return;
     if (this.ignoreMatcher && this.ignoreMatcher.ignores(rel)) return;
-    if (!isSourceFile(rel)) return;
+    if (!isSourceFile(rel, loadExtensionOverrides(this.projectRoot))) return;
 
     logDebug('File change detected', { file: rel });
     if (this.ready) {