Jelajahi Sumber

fix(extraction): respect .gitignore by default for embedded-repo discovery (#970, #976) (#980)

#514 (v1.0.0) began walking into gitignored directories to discover and
index the git repos nested inside them. That broke users who rely on
.gitignore to exclude a directory: a gitignored folder of cloned
reference repos blew graphs up (one report went 10k to 500k edges, #976)
and stalled indexing on multi-gigabyte trees of clones (#970).

Respect .gitignore by default again. Discovering embedded repos inside a
gitignored directory is now opt-in via codegraph.json:

    { "includeIgnored": ["packages/", "services/"] }

The single choke point findIgnoredEmbeddedRepos now returns nothing
unless a gitignored dir matches the project's includeIgnored patterns,
and the matcher is threaded from the scan root through the full-index,
incremental-sync, and watcher-scope paths. Downstream ScopeIgnore and the
watcher are unchanged: they key off the discovered embedded roots, so
gating discovery fixes the indexer, sync, and watcher together. Untracked
embedded repos (#193) stay indexed by default.

This restores the super-repo-of-clones behavior (#622, #699) for the
people who want it, while making the default match what every other tool
(and CodeGraph's own git ls-files foundation) does: .gitignore excludes.

project-config.ts now parses codegraph.json once (loadParsedConfig) and
exposes loadIncludeIgnoredPatterns alongside the existing extension map.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Colby Mchenry 18 jam lalu
induk
melakukan
73bcc1afb4

+ 1 - 0
CHANGELOG.md

@@ -40,6 +40,7 @@ and adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
 ### Fixes
 
+- CodeGraph respects your `.gitignore` again when looking for nested git repositories. A directory you've gitignored — a `resource/` or `.repos/` folder of cloned reference projects, a large vendored data dir — is no longer walked into and indexed: version 1.0.0 started searching inside *every* gitignored directory for embedded git repos and pulling them all in, which could multiply a graph many times over and slow indexing to a crawl on a multi-gigabyte folder of clones, even though you'd explicitly excluded it. Indexing the repos inside a gitignored directory is now opt-in — add an `includeIgnored` list to a `codegraph.json` at your repo root, e.g. `{ "includeIgnored": ["packages/", "services/"] }`, to index the embedded repos under the directories you name. The "super-repo of independent clones" layout from 1.0.0 still works, just declared explicitly. Nested repos you *haven't* gitignored (untracked clones) are indexed as before, and a project without this layout is unaffected. (#970, #976, #622)
 - `codegraph index` and `codegraph init` no longer crawl during the "Resolving refs" phase on large projects — most painfully ones that mix a big front-end and back-end, where the phase could stretch to many minutes. A package or module imported across hundreds or thousands of files (`react`, a shared UI package, Python `logging` / `typing`) was being treated as if every one of those import statements might be its definition, so the resolver compared each import against all the others — work that grows with the *square* of how widely a package is imported, which is why it blew up only on big, import-heavy repos. Imports now resolve straight to the definitions they actually point at, so those redundant comparisons are gone (reference resolution is dramatically faster on large repos), and the graph no longer accumulates the meaningless import-to-import links the old fallback created. (#915)
 - MCP tool results no longer show up as oversized headings in Markdown-rendering clients (such as the Claude Code VSCode extension). Results used Markdown headings (`##`/`###`) for things like the status summary, each search hit, and every file section in an exploration, so a normal query filled the transcript with large-font lines — worst with `codegraph_search` and `codegraph_explore`, where the noise grew with the number of results. Section headers are now bold labels, which render at normal text size while keeping the same structure. Terminal/CLI output is unchanged. (#778)
 - An MCP server pointed at a very large repository (tens of thousands of files) no longer hangs on the first tool call after a fresh start. On startup CodeGraph reconciles its index against the current files on disk, and on a huge repo that reconcile could run for minutes while blocking the very first request — long enough that the background server was sometimes force-restarted mid-scan, so the first query never came back at all. The reconcile now yields as it runs (keeping the server responsive instead of pinning it), and the first tool call waits only briefly for it before answering and letting the rest finish in the background — so you get a fast first response and the index still catches up. Set `CODEGRAPH_CATCHUP_GATE_TIMEOUT_MS` to tune how long that first call waits (default 3000ms), or `=0` to always wait for the full reconcile. (#905)

+ 89 - 0
__tests__/include-ignored-config.test.ts

@@ -0,0 +1,89 @@
+/**
+ * `codegraph.json` `includeIgnored` loader (#970, #976 / #622, #699).
+ *
+ * Parsing, validation, and mtime-caching of the opt-in patterns that re-include
+ * gitignored directories for embedded-repo discovery. The behavioral end of this
+ * feature (scanDirectory / discoverEmbeddedRepoRoots / sync honoring the patterns)
+ * lives in `multi-repo-workspace.test.ts`; these are the loader unit tests,
+ * mirroring the `extensions` loader coverage in `extension-mapping.test.ts`.
+ *
+ * Invariant under test: every failure mode degrades to the zero-config default
+ * (empty patterns → `.gitignore` fully respected), never a throw.
+ */
+import { describe, it, expect, beforeEach, afterEach } from 'vitest';
+import * as fs from 'node:fs';
+import * as path from 'node:path';
+import * as os from 'node:os';
+import { loadIncludeIgnoredPatterns, loadExtensionOverrides, clearProjectConfigCache } from '../src/project-config';
+
+describe('includeIgnored loader (codegraph.json)', () => {
+  let dir: string;
+  beforeEach(() => {
+    dir = fs.mkdtempSync(path.join(os.tmpdir(), 'cg-includeignored-'));
+    clearProjectConfigCache();
+  });
+  afterEach(() => {
+    clearProjectConfigCache();
+    fs.rmSync(dir, { recursive: true, force: true });
+  });
+  const writeConfig = (obj: unknown) =>
+    fs.writeFileSync(
+      path.join(dir, 'codegraph.json'),
+      typeof obj === 'string' ? obj : JSON.stringify(obj)
+    );
+
+  it('returns an empty list when there is no codegraph.json (the default)', () => {
+    expect(loadIncludeIgnoredPatterns(dir)).toEqual([]);
+  });
+
+  it('loads a well-formed pattern array', () => {
+    writeConfig({ includeIgnored: ['packages/', 'services/'] });
+    expect(loadIncludeIgnoredPatterns(dir)).toEqual(['packages/', 'services/']);
+  });
+
+  it('trims whitespace and drops blank / non-string entries', () => {
+    writeConfig({ includeIgnored: ['  packages/  ', '', '   ', 42, null, 'services/'] });
+    expect(loadIncludeIgnoredPatterns(dir)).toEqual(['packages/', 'services/']);
+  });
+
+  it('ignores a non-array includeIgnored value without throwing', () => {
+    writeConfig({ includeIgnored: 'packages/' });
+    expect(loadIncludeIgnoredPatterns(dir)).toEqual([]);
+  });
+
+  it('ignores malformed JSON without throwing', () => {
+    writeConfig('{ not: valid json ');
+    expect(loadIncludeIgnoredPatterns(dir)).toEqual([]);
+  });
+
+  it('returns [] when the field is absent but other config is present', () => {
+    writeConfig({ extensions: { '.foo': 'typescript' } });
+    expect(loadIncludeIgnoredPatterns(dir)).toEqual([]);
+  });
+
+  it('coexists with extensions in one file (shared single parse)', () => {
+    writeConfig({ extensions: { '.foo': 'typescript' }, includeIgnored: ['vendor/'] });
+    expect(loadExtensionOverrides(dir)).toEqual({ '.foo': 'typescript' });
+    expect(loadIncludeIgnoredPatterns(dir)).toEqual(['vendor/']);
+  });
+
+  it('picks up a changed config (mtime-invalidated cache)', () => {
+    writeConfig({ includeIgnored: ['packages/'] });
+    expect(loadIncludeIgnoredPatterns(dir)).toEqual(['packages/']);
+
+    writeConfig({ includeIgnored: ['services/'] });
+    // Force a distinct mtime in case the filesystem clock is coarse.
+    const future = new Date(Date.now() + 2000);
+    fs.utimesSync(path.join(dir, 'codegraph.json'), future, future);
+
+    expect(loadIncludeIgnoredPatterns(dir)).toEqual(['services/']);
+  });
+
+  it('drops the patterns again when the config file is removed', () => {
+    writeConfig({ includeIgnored: ['packages/'] });
+    expect(loadIncludeIgnoredPatterns(dir)).toEqual(['packages/']);
+
+    fs.rmSync(path.join(dir, 'codegraph.json'));
+    expect(loadIncludeIgnoredPatterns(dir)).toEqual([]);
+  });
+});

+ 307 - 227
__tests__/multi-repo-workspace.test.ts

@@ -1,18 +1,24 @@
 /**
- * Multi-repo workspaces (#514): a directory holding several independent git
- * repositories must index as a whole.
+ * Multi-repo workspaces (#514) — and the `.gitignore`-respect default (#970, #976).
  *
- * Two enumeration paths are exercised:
+ * A directory holding several independent git repositories can be indexed as a
+ * whole, but ONLY when the project opts the gitignored directories in. The
+ * default is the universal one: `.gitignore` excludes. Walking into a gitignored
+ * directory to index embedded repos there is OPT-IN via `codegraph.json`
+ * `includeIgnored` (#622, #699) — without it a gitignored `node_modules`-style
+ * reference/data dir full of nested clones is left untouched, instead of blowing
+ * the graph up or stalling the scan (#970, #976).
+ *
+ * Two enumeration paths are exercised under opt-in:
  *  - git path: the workspace root is itself a git repo (a "super-repo") whose
- *    `.gitignore` hides the child repos to keep `git status` quiet. git never
- *    lists ignored dirs, so the embedded repos were invisible (0 files). They
- *    are now discovered via the ignored-directories listing and enumerated by
- *    their own `git ls-files`. (#193 covered the *untracked* embedded case.)
+ *    `.gitignore` hides the child repos. They are discovered via the ignored-
+ *    directories listing and enumerated by their own `git ls-files`. (#193
+ *    covered the *untracked* embedded case, which stays on by default.)
  *  - sync path: `git status` in the parent says nothing about embedded repos;
- *    change detection now recurses into them.
+ *    change detection recurses into the opted-in ones.
  *
- * The non-git-parent case (plain folder of repos) already worked via the
- * filesystem walk — locked in here so it stays that way.
+ * The non-git-parent case (plain folder of repos) works via the filesystem walk
+ * regardless — locked in here so it stays that way.
  */
 import { describe, it, expect, beforeEach, afterEach } from 'vitest';
 import * as fs from 'fs';
@@ -21,6 +27,7 @@ import * as os from 'os';
 import { execFileSync } from 'child_process';
 import CodeGraph from '../src/index';
 import { scanDirectory, buildScopeIgnore, discoverEmbeddedRepoRoots } from '../src/extraction';
+import { clearProjectConfigCache } from '../src/project-config';
 
 function git(cwd: string, ...args: string[]): void {
   execFileSync('git', args, { cwd, stdio: ['ignore', 'ignore', 'ignore'] });
@@ -38,247 +45,320 @@ function write(file: string, content: string): void {
   fs.writeFileSync(file, content);
 }
 
-describe('multi-repo workspaces (#514)', () => {
+describe('multi-repo workspaces (#514) + .gitignore-respect default (#970, #976)', () => {
   let ws: string;
 
   beforeEach(() => {
     ws = fs.mkdtempSync(path.join(os.tmpdir(), 'cg-multirepo-'));
+    clearProjectConfigCache();
   });
 
   afterEach(() => {
+    clearProjectConfigCache();
     fs.rmSync(ws, { recursive: true, force: true });
   });
 
-  it('indexes embedded repos hidden by the super-repo .gitignore', () => {
-    write(path.join(ws, 'packages/proj-a/src/auth.ts'), 'export function login() { return 1; }\n');
-    write(path.join(ws, 'packages/proj-b/src/billing.ts'), 'export function charge() { return 2; }\n');
-    makeRepo(path.join(ws, 'packages/proj-a'));
-    makeRepo(path.join(ws, 'packages/proj-b'));
-    write(path.join(ws, '.gitignore'), '/packages/\n');
-    write(path.join(ws, 'tools.ts'), 'export function tool() { return 0; }\n');
-    makeRepo(ws);
-
-    const files = scanDirectory(ws);
-    expect(files).toContain('packages/proj-a/src/auth.ts');
-    expect(files).toContain('packages/proj-b/src/billing.ts');
-    expect(files).toContain('tools.ts'); // the parent's own tracked code still indexes
-  });
-
-  it('keeps respecting the parent .gitignore for the parent own (non-repo) dirs', () => {
-    write(path.join(ws, 'scratch/junk.ts'), 'export function junk() { return 9; }\n');
-    write(path.join(ws, 'src/app.ts'), 'export function app() { return 1; }\n');
-    write(path.join(ws, '.gitignore'), '/scratch/\n');
-    makeRepo(ws);
+  /** Drop a `codegraph.json` at the workspace root. */
+  const writeConfig = (obj: unknown) =>
+    fs.writeFileSync(path.join(ws, 'codegraph.json'),
+      typeof obj === 'string' ? obj : JSON.stringify(obj));
+
+  describe('default: .gitignore is respected (#970, #976)', () => {
+    it('does NOT index embedded repos inside a gitignored dir without opt-in', () => {
+      // The exact #976 layout: nested clones under a directory the user
+      // explicitly gitignored. They must stay out of the index — no graph blowup.
+      write(path.join(ws, '.repos/lib-a/src/a.ts'), 'export function fromLibA() { return 1; }\n');
+      write(path.join(ws, '.repos/lib-b/src/b.ts'), 'export function fromLibB() { return 2; }\n');
+      makeRepo(path.join(ws, '.repos/lib-a'));
+      makeRepo(path.join(ws, '.repos/lib-b'));
+      write(path.join(ws, '.gitignore'), '/.repos/\n');
+      write(path.join(ws, 'app.ts'), 'export function app() { return 0; }\n');
+      makeRepo(ws);
 
-    const files = scanDirectory(ws);
-    expect(files).toContain('src/app.ts');
-    // scratch/ is gitignored and contains NO embedded repo — stays excluded.
-    expect(files.some((f) => f.startsWith('scratch/'))).toBe(false);
+      const files = scanDirectory(ws);
+      expect(files).toContain('app.ts'); // the project's own code still indexes
+      expect(files.some((f) => f.startsWith('.repos/'))).toBe(false);
+    });
+
+    it('does NOT discover gitignored embedded roots without opt-in', () => {
+      write(path.join(ws, 'resource/ref/src/x.ts'), 'export const x = 1;\n');
+      makeRepo(path.join(ws, 'resource/ref'));
+      write(path.join(ws, '.gitignore'), '/resource/\n');
+      makeRepo(ws);
+
+      // The #970 perf fix: a gitignored dir of reference repos is never walked.
+      expect(discoverEmbeddedRepoRoots(ws)).toEqual([]);
+    });
+
+    it('ScopeIgnore: a gitignored dir is fully pruned without opt-in', () => {
+      write(path.join(ws, 'resource/ref/src/x.ts'), 'export const x = 1;\n');
+      makeRepo(path.join(ws, 'resource/ref'));
+      write(path.join(ws, '.gitignore'), '/resource/\n');
+      makeRepo(ws);
+
+      const scope = buildScopeIgnore(ws);
+      // Both the dir and its contents are ignored — the watcher won't descend.
+      expect(scope.ignores('resource/')).toBe(true);
+      expect(scope.ignores('resource/ref/src/x.ts')).toBe(true);
+    });
   });
 
-  it('never descends into git repos inside node_modules (npm git-dependencies)', () => {
-    // Embedded repo first (clean), node_modules dropped in afterwards —
-    // matching reality, where node_modules is never committed.
-    write(path.join(ws, 'packages/proj-a/src/auth.ts'), 'export function login() {}\n');
-    makeRepo(path.join(ws, 'packages/proj-a'));
-    write(path.join(ws, 'packages/proj-a/node_modules/inner/src/evil2.ts'), 'export function evil2() {}\n');
-    makeRepo(path.join(ws, 'packages/proj-a/node_modules/inner')); // npm git-dep: has commits
-    // Workspace-level git-dep too.
-    write(path.join(ws, 'node_modules/git-dep/src/evil.ts'), 'export function evil() {}\n');
-    makeRepo(path.join(ws, 'node_modules/git-dep'));
-    write(path.join(ws, '.gitignore'), '/packages/\nnode_modules\n');
-    makeRepo(ws);
-
-    const files = scanDirectory(ws);
-    expect(files).toContain('packages/proj-a/src/auth.ts');
-    expect(files.some((f) => f.includes('node_modules'))).toBe(false);
-  });
+  describe('opt-in: codegraph.json includeIgnored re-includes a gitignored dir (#622, #699)', () => {
+    it('indexes embedded repos hidden by the super-repo .gitignore', () => {
+      write(path.join(ws, 'packages/proj-a/src/auth.ts'), 'export function login() { return 1; }\n');
+      write(path.join(ws, 'packages/proj-b/src/billing.ts'), 'export function charge() { return 2; }\n');
+      makeRepo(path.join(ws, 'packages/proj-a'));
+      makeRepo(path.join(ws, 'packages/proj-b'));
+      write(path.join(ws, '.gitignore'), '/packages/\n');
+      write(path.join(ws, 'tools.ts'), 'export function tool() { return 0; }\n');
+      writeConfig({ includeIgnored: ['packages/'] });
+      makeRepo(ws);
 
-  it('still indexes UNTRACKED embedded repos (#193 regression)', () => {
-    write(path.join(ws, 'vendor-src/lib/src/util.ts'), 'export function util() {}\n');
-    makeRepo(path.join(ws, 'vendor-src/lib'));
-    write(path.join(ws, 'main.ts'), 'export function main() {}\n');
-    makeRepo(ws); // vendor-src/ is untracked (not ignored) — committed ws has only main.ts + nothing else
-    // NOTE: makeRepo committed vendor-src too via add -A… recreate untracked state:
-    git(ws, 'rm', '-r', '--cached', '-q', 'vendor-src');
-    git(ws, '-c', 'user.email=t@t', '-c', 'user.name=t', 'commit', '-qm', 'untrack');
-
-    const files = scanDirectory(ws);
-    expect(files).toContain('vendor-src/lib/src/util.ts');
-    expect(files).toContain('main.ts');
-  });
+      const files = scanDirectory(ws);
+      expect(files).toContain('packages/proj-a/src/auth.ts');
+      expect(files).toContain('packages/proj-b/src/billing.ts');
+      expect(files).toContain('tools.ts'); // the parent's own tracked code still indexes
+    });
+
+    it('only re-includes the opted-in dir, not every gitignored dir', () => {
+      // `packages/` is opted in; `scratch/` (also holding a repo) is NOT.
+      write(path.join(ws, 'packages/proj-a/src/auth.ts'), 'export function login() {}\n');
+      makeRepo(path.join(ws, 'packages/proj-a'));
+      write(path.join(ws, 'scratch/throwaway/src/junk.ts'), 'export function junk() {}\n');
+      makeRepo(path.join(ws, 'scratch/throwaway'));
+      write(path.join(ws, '.gitignore'), '/packages/\n/scratch/\n');
+      writeConfig({ includeIgnored: ['packages/'] });
+      makeRepo(ws);
 
-  it('skips nested git worktrees instead of indexing them as duplicate embedded repos (#848)', () => {
-    // Claude Code (and others) create worktrees under a gitignored path like
-    // `.claude/worktrees/<name>/`. A worktree's `.git` is a FILE pointing into
-    // the host repo's own `.git/worktrees/`, so it is the SAME repo already
-    // indexed — sweeping it in as an embedded repo multiplies the whole graph.
-    // A genuine embedded clone (a `.git` *directory*) must still be indexed.
-    write(path.join(ws, 'src/app.ts'), 'export function app() { return 1; }\n');
-    write(path.join(ws, '.gitignore'), '.claude/\nvendored/\n');
-    makeRepo(ws);
-    // A real linked worktree under the gitignored .claude/worktrees/.
-    git(ws, 'worktree', 'add', '-q', '.claude/worktrees/feature', '-b', 'feature');
-    // A genuine embedded clone, also gitignored — must STAY indexed (#514).
-    write(path.join(ws, 'vendored/lib.ts'), 'export function vendoredFn() { return 9; }\n');
-    makeRepo(path.join(ws, 'vendored'));
-
-    const files = scanDirectory(ws);
-    expect(files).toContain('src/app.ts');
-    // The worktree is a duplicate working view — never indexed.
-    expect(files.some((f) => f.includes('.claude/worktrees'))).toBe(false);
-    // The genuine embedded clone is still indexed (#514/#622 preserved).
-    expect(files).toContain('vendored/lib.ts');
+      const files = scanDirectory(ws);
+      expect(files).toContain('packages/proj-a/src/auth.ts');
+      expect(files.some((f) => f.startsWith('scratch/'))).toBe(false);
+    });
+
+    it('discovers the opted-in ignored root alongside untracked roots', () => {
+      write(path.join(ws, 'packages/proj-a/src/auth.ts'), 'export function login() {}\n');
+      makeRepo(path.join(ws, 'packages/proj-a'));
+      write(path.join(ws, 'vendor-src/lib/util.ts'), 'export function util() {}\n');
+      makeRepo(path.join(ws, 'vendor-src/lib'));
+      write(path.join(ws, '.gitignore'), '/packages/\n'); // vendor-src stays untracked
+      writeConfig({ includeIgnored: ['packages/'] });
+      makeRepo(ws);
+      git(ws, 'rm', '-r', '--cached', '-q', 'vendor-src');
+      git(ws, '-c', 'user.email=t@t', '-c', 'user.name=t', 'commit', '-qm', 'untrack');
+
+      const roots = discoverEmbeddedRepoRoots(ws);
+      expect(roots).toContain('packages/proj-a/'); // opted-in ignored kind
+      expect(roots).toContain('vendor-src/lib/');   // untracked kind (always on)
+    });
+
+    it('ScopeIgnore: opted-in embedded files use the child rules; the watcher can descend', () => {
+      write(path.join(ws, 'packages/proj-a/src/auth.ts'), 'export function login() {}\n');
+      write(path.join(ws, 'packages/proj-a/.gitignore'), 'build/\n');
+      makeRepo(path.join(ws, 'packages/proj-a'));
+      write(path.join(ws, '.gitignore'), '/packages/\n');
+      writeConfig({ includeIgnored: ['packages/'] });
+      makeRepo(ws);
+
+      const scope = buildScopeIgnore(ws);
+      // Inside the opted-in embedded repo: the CHILD's rules decide.
+      expect(scope.ignores('packages/proj-a/src/auth.ts')).toBe(false);
+      expect(scope.ignores('packages/proj-a/build/out.ts')).toBe(true);
+      // Under the ignored dir but NOT in any embedded repo: parent rules apply.
+      expect(scope.ignores('packages/stray.ts')).toBe(true);
+      // Directory form: ancestors of an embedded root are never pruned —
+      // the Linux per-directory watcher must descend through `packages/`.
+      expect(scope.ignores('packages/')).toBe(false);
+      // Ordinary paths: unchanged semantics.
+      expect(scope.ignores('node_modules/dep/index.ts')).toBe(true);
+      expect(scope.ignores('src/app.ts')).toBe(false);
+    });
+
+    it('sync picks up a change inside an opted-in gitignored embedded repo', async () => {
+      write(path.join(ws, 'packages/proj-a/src/auth.ts'), 'export function login() { return 1; }\n');
+      makeRepo(path.join(ws, 'packages/proj-a'));
+      write(path.join(ws, '.gitignore'), '/packages/\n');
+      writeConfig({ includeIgnored: ['packages/'] });
+      makeRepo(ws);
+
+      const cg = CodeGraph.initSync(ws, { config: { include: ['**/*.ts'], exclude: [] } });
+      try {
+        await cg.indexAll();
+        expect(cg.searchNodes('login', { limit: 5 }).length).toBeGreaterThan(0);
+
+        // Change inside the embedded repo — invisible to the parent's `git status`.
+        write(path.join(ws, 'packages/proj-a/src/auth.ts'),
+          'export function login() { return 1; }\nexport function logout() { return 0; }\n');
+        await cg.sync();
+
+        expect(cg.searchNodes('logout', { limit: 5 }).length).toBeGreaterThan(0);
+      } finally {
+        cg.destroy();
+      }
+    });
   });
 
-  it('skips a submodule worktree instead of indexing it as a duplicate (#945)', () => {
-    // A worktree OF A SUBMODULE points its `.git` into
-    // `.git/modules/<module>/worktrees/<name>` — not the top-level repo's
-    // `.git/worktrees/`. The detector used to miss that extra `modules/<name>`
-    // segment, so the worktree fell through to "embedded" and every symbol it
-    // shared with the real submodule checkout got indexed twice. The submodule's
-    // own checkout (`.git/modules/<module>`, no `worktrees/`) is distinct code
-    // and must stay indexed (#514).
-    const upstream = fs.mkdtempSync(path.join(os.tmpdir(), 'cg-945-up-'));
-    try {
-      // The repo that becomes the submodule's origin.
-      write(path.join(upstream, 'lib.ts'), 'export function libFn() { return 1; }\n');
-      makeRepo(upstream);
-
+  describe('discovery/classifier machinery (exercised under opt-in)', () => {
+    it('keeps respecting the parent .gitignore for the parent own (non-repo) dirs', () => {
+      write(path.join(ws, 'scratch/junk.ts'), 'export function junk() { return 9; }\n');
       write(path.join(ws, 'src/app.ts'), 'export function app() { return 1; }\n');
-      write(path.join(ws, '.gitignore'), '.worktrees/\n');
-      git(ws, 'init', '-q');
-      // protocol.file.allow=always: modern git refuses a local-path submodule otherwise.
-      git(ws, '-c', 'protocol.file.allow=always', 'submodule', 'add', '-q', upstream, 'common');
-      git(ws, '-c', 'user.email=t@t', '-c', 'user.name=t', 'commit', '-qm', 'add submodule');
-
-      // A worktree of the submodule, under the gitignored .worktrees/ — its `.git`
-      // points into `.git/modules/common/worktrees/<name>`.
-      git(path.join(ws, 'common'), 'worktree', 'add', '-q', '../.worktrees/common-feature', '-b', 'feature');
+      write(path.join(ws, '.gitignore'), '/scratch/\n');
+      makeRepo(ws);
 
       const files = scanDirectory(ws);
       expect(files).toContain('src/app.ts');
-      // The real submodule checkout is distinct code — still indexed (#514).
-      expect(files).toContain('common/lib.ts');
-      // The submodule worktree is a duplicate working view — never indexed (#945).
-      expect(files.some((f) => f.includes('.worktrees'))).toBe(false);
-    } finally {
-      fs.rmSync(upstream, { recursive: true, force: true });
-    }
-  });
-
-  it('non-git workspace: walks children and respects each child own .gitignore', () => {
-    write(path.join(ws, 'proj-a/src/auth.ts'), 'export function login() {}\n');
-    write(path.join(ws, 'proj-a/build/out.ts'), 'export function generated() {}\n');
-    write(path.join(ws, 'proj-a/.gitignore'), 'build/\n');
-    write(path.join(ws, 'proj-b/src/billing.ts'), 'export function charge() {}\n');
-    makeRepo(path.join(ws, 'proj-a'));
-    makeRepo(path.join(ws, 'proj-b'));
-    // ws itself is NOT a git repo.
-
-    const files = scanDirectory(ws);
-    expect(files).toContain('proj-a/src/auth.ts');
-    expect(files).toContain('proj-b/src/billing.ts');
-    expect(files.some((f) => f.includes('build/'))).toBe(false);
-  });
+      // scratch/ is gitignored and contains NO embedded repo — stays excluded.
+      expect(files.some((f) => f.startsWith('scratch/'))).toBe(false);
+    });
+
+    it('never descends into git repos inside node_modules (npm git-dependencies)', () => {
+      // Embedded repo first (clean), node_modules dropped in afterwards —
+      // matching reality, where node_modules is never committed.
+      write(path.join(ws, 'packages/proj-a/src/auth.ts'), 'export function login() {}\n');
+      makeRepo(path.join(ws, 'packages/proj-a'));
+      write(path.join(ws, 'packages/proj-a/node_modules/inner/src/evil2.ts'), 'export function evil2() {}\n');
+      makeRepo(path.join(ws, 'packages/proj-a/node_modules/inner')); // npm git-dep: has commits
+      // Workspace-level git-dep too.
+      write(path.join(ws, 'node_modules/git-dep/src/evil.ts'), 'export function evil() {}\n');
+      makeRepo(path.join(ws, 'node_modules/git-dep'));
+      write(path.join(ws, '.gitignore'), '/packages/\nnode_modules\n');
+      writeConfig({ includeIgnored: ['packages/'] });
+      makeRepo(ws);
 
-  it('does not search beyond the embedded-repo depth cap', () => {
-    // Repo buried 5 levels under the ignored dir — past EMBEDDED_REPO_SEARCH_DEPTH (4).
-    const deep = path.join(ws, 'pkgs/a/b/c/d/e');
-    write(path.join(deep, 'src/deep.ts'), 'export function deep() {}\n');
-    makeRepo(deep);
-    write(path.join(ws, 'main.ts'), 'export function main() {}\n');
-    write(path.join(ws, '.gitignore'), '/pkgs/\n');
-    makeRepo(ws);
-
-    const files = scanDirectory(ws);
-    expect(files).toContain('main.ts');
-    expect(files.some((f) => f.includes('deep.ts'))).toBe(false);
-  });
-
-  it('discovers embedded roots (ignored + untracked kinds); none for non-git roots', () => {
-    write(path.join(ws, 'packages/proj-a/src/auth.ts'), 'export function login() {}\n');
-    makeRepo(path.join(ws, 'packages/proj-a'));
-    write(path.join(ws, 'vendor-src/lib/util.ts'), 'export function util() {}\n');
-    makeRepo(path.join(ws, 'vendor-src/lib'));
-    write(path.join(ws, '.gitignore'), '/packages/\n'); // vendor-src stays untracked
-    makeRepo(ws);
-    git(ws, 'rm', '-r', '--cached', '-q', 'vendor-src');
-    git(ws, '-c', 'user.email=t@t', '-c', 'user.name=t', 'commit', '-qm', 'untrack');
-
-    const roots = discoverEmbeddedRepoRoots(ws);
-    expect(roots).toContain('packages/proj-a/');
-    expect(roots).toContain('vendor-src/lib/');
-
-    const plain = fs.mkdtempSync(path.join(os.tmpdir(), 'cg-nongit-'));
-    try {
-      expect(discoverEmbeddedRepoRoots(plain)).toEqual([]);
-    } finally {
-      fs.rmSync(plain, { recursive: true, force: true });
-    }
-  });
+      const files = scanDirectory(ws);
+      expect(files).toContain('packages/proj-a/src/auth.ts');
+      // node_modules is a built-in default exclude — never re-included, even though
+      // `packages/` is opted in and node_modules is gitignored.
+      expect(files.some((f) => f.includes('node_modules'))).toBe(false);
+    });
+
+    it('still indexes UNTRACKED embedded repos by default (#193 regression)', () => {
+      write(path.join(ws, 'vendor-src/lib/src/util.ts'), 'export function util() {}\n');
+      makeRepo(path.join(ws, 'vendor-src/lib'));
+      write(path.join(ws, 'main.ts'), 'export function main() {}\n');
+      makeRepo(ws); // vendor-src/ is untracked (not ignored) — committed ws has only main.ts + nothing else
+      // NOTE: makeRepo committed vendor-src too via add -A… recreate untracked state:
+      git(ws, 'rm', '-r', '--cached', '-q', 'vendor-src');
+      git(ws, '-c', 'user.email=t@t', '-c', 'user.name=t', 'commit', '-qm', 'untrack');
+
+      // No codegraph.json: the untracked path is unaffected by the opt-in gate.
+      const files = scanDirectory(ws);
+      expect(files).toContain('vendor-src/lib/src/util.ts');
+      expect(files).toContain('main.ts');
+    });
+
+    it('skips nested git worktrees instead of indexing them as duplicate embedded repos (#848)', () => {
+      // Claude Code (and others) create worktrees under a gitignored path like
+      // `.claude/worktrees/<name>/`. A worktree's `.git` is a FILE pointing into
+      // the host repo's own `.git/worktrees/`, so it is the SAME repo already
+      // indexed — sweeping it in as an embedded repo multiplies the whole graph.
+      // A genuine embedded clone (a `.git` *directory*) must still be indexed.
+      // Both dirs are opted in so the classifier (not the gitignore gate) is what
+      // decides: the worktree is skipped, the genuine clone is kept.
+      write(path.join(ws, 'src/app.ts'), 'export function app() { return 1; }\n');
+      write(path.join(ws, '.gitignore'), '.claude/\nvendored/\n');
+      writeConfig({ includeIgnored: ['.claude/', 'vendored/'] });
+      makeRepo(ws);
+      // A real linked worktree under the gitignored .claude/worktrees/.
+      git(ws, 'worktree', 'add', '-q', '.claude/worktrees/feature', '-b', 'feature');
+      // A genuine embedded clone, also gitignored — must STAY indexed under opt-in.
+      write(path.join(ws, 'vendored/lib.ts'), 'export function vendoredFn() { return 9; }\n');
+      makeRepo(path.join(ws, 'vendored'));
 
-  it('ScopeIgnore: embedded files use the child rules; the watcher can descend to them', () => {
-    write(path.join(ws, 'packages/proj-a/src/auth.ts'), 'export function login() {}\n');
-    write(path.join(ws, 'packages/proj-a/.gitignore'), 'build/\n');
-    makeRepo(path.join(ws, 'packages/proj-a'));
-    write(path.join(ws, '.gitignore'), '/packages/\n');
-    makeRepo(ws);
-
-    const scope = buildScopeIgnore(ws);
-    // Inside the embedded repo: the CHILD's rules decide.
-    expect(scope.ignores('packages/proj-a/src/auth.ts')).toBe(false);
-    expect(scope.ignores('packages/proj-a/build/out.ts')).toBe(true);
-    // Under the ignored dir but NOT in any embedded repo: parent rules apply.
-    expect(scope.ignores('packages/stray.ts')).toBe(true);
-    // Directory form: ancestors of an embedded root are never pruned —
-    // the Linux per-directory watcher must descend through `packages/`.
-    expect(scope.ignores('packages/')).toBe(false);
-    // Ordinary paths: unchanged semantics.
-    expect(scope.ignores('node_modules/dep/index.ts')).toBe(true);
-    expect(scope.ignores('src/app.ts')).toBe(false);
-  });
+      const files = scanDirectory(ws);
+      expect(files).toContain('src/app.ts');
+      // The worktree is a duplicate working view — never indexed (#848).
+      expect(files.some((f) => f.includes('.claude/worktrees'))).toBe(false);
+      // The genuine embedded clone is still indexed under opt-in (#514/#622).
+      expect(files).toContain('vendored/lib.ts');
+    });
+
+    it('skips a submodule worktree instead of indexing it as a duplicate (#945)', () => {
+      // A worktree OF A SUBMODULE points its `.git` into
+      // `.git/modules/<module>/worktrees/<name>` — not the top-level repo's
+      // `.git/worktrees/`. The detector used to miss that extra `modules/<name>`
+      // segment, so the worktree fell through to "embedded" and every symbol it
+      // shared with the real submodule checkout got indexed twice. The submodule's
+      // own checkout (`.git/modules/<module>`, no `worktrees/`) is distinct code
+      // and must stay indexed. The worktree dir is opted in so the classifier is
+      // what skips it (not the gitignore gate).
+      const upstream = fs.mkdtempSync(path.join(os.tmpdir(), 'cg-945-up-'));
+      try {
+        // The repo that becomes the submodule's origin.
+        write(path.join(upstream, 'lib.ts'), 'export function libFn() { return 1; }\n');
+        makeRepo(upstream);
+
+        write(path.join(ws, 'src/app.ts'), 'export function app() { return 1; }\n');
+        write(path.join(ws, '.gitignore'), '.worktrees/\n');
+        writeConfig({ includeIgnored: ['.worktrees/'] });
+        git(ws, 'init', '-q');
+        // protocol.file.allow=always: modern git refuses a local-path submodule otherwise.
+        git(ws, '-c', 'protocol.file.allow=always', 'submodule', 'add', '-q', upstream, 'common');
+        git(ws, '-c', 'user.email=t@t', '-c', 'user.name=t', 'commit', '-qm', 'add submodule');
+
+        // A worktree of the submodule, under the gitignored .worktrees/ — its `.git`
+        // points into `.git/modules/common/worktrees/<name>`.
+        git(path.join(ws, 'common'), 'worktree', 'add', '-q', '../.worktrees/common-feature', '-b', 'feature');
+
+        const files = scanDirectory(ws);
+        expect(files).toContain('src/app.ts');
+        // The real submodule checkout is distinct code — still indexed (#514).
+        expect(files).toContain('common/lib.ts');
+        // The submodule worktree is a duplicate working view — never indexed (#945).
+        expect(files.some((f) => f.includes('.worktrees'))).toBe(false);
+      } finally {
+        fs.rmSync(upstream, { recursive: true, force: true });
+      }
+    });
+
+    it('non-git workspace: walks children and respects each child own .gitignore', () => {
+      write(path.join(ws, 'proj-a/src/auth.ts'), 'export function login() {}\n');
+      write(path.join(ws, 'proj-a/build/out.ts'), 'export function generated() {}\n');
+      write(path.join(ws, 'proj-a/.gitignore'), 'build/\n');
+      write(path.join(ws, 'proj-b/src/billing.ts'), 'export function charge() {}\n');
+      makeRepo(path.join(ws, 'proj-a'));
+      makeRepo(path.join(ws, 'proj-b'));
+      // ws itself is NOT a git repo.
 
-  it('buildScopeIgnore: indexed root is itself a gitignored subdir of an enclosing repo (#936)', () => {
-    // `child/` is NOT its own repo, so `git` resolves the ENCLOSING repo from
-    // inside it — and `git ls-files --directory`, whose cwd is then a wholly
-    // ignored directory, emits the literal `./` ("this entire directory").
-    // That sentinel used to reach the `ignore` matcher and throw
-    // ("path should be a `path.relative()`d string, but got "./""), aborting
-    // buildScopeIgnore → the MCP daemon's watcher never started and auto-sync
-    // silently stalled until a manual `codegraph sync`.
-    write(path.join(ws, 'child/src/a.ts'), 'export const x = 1;\n');
-    write(path.join(ws, '.gitignore'), '/child/\n');
-    makeRepo(ws);
-
-    const child = path.join(ws, 'child');
-    // The crux: building scope for the ignored subdir must not throw.
-    const scope = buildScopeIgnore(child);
-    // The subdir's own source is watchable/indexable, not ignored.
-    expect(scope.ignores('src/a.ts')).toBe(false);
-    // And the `./` self entry must not be mistaken for a nested embedded repo.
-    expect(discoverEmbeddedRepoRoots(child)).toEqual([]);
-  });
+      const files = scanDirectory(ws);
+      expect(files).toContain('proj-a/src/auth.ts');
+      expect(files).toContain('proj-b/src/billing.ts');
+      expect(files.some((f) => f.includes('build/'))).toBe(false);
+    });
+
+    it('does not search beyond the embedded-repo depth cap (opted-in dir)', () => {
+      // Repo buried 5 levels under the ignored dir — past EMBEDDED_REPO_SEARCH_DEPTH (4).
+      const deep = path.join(ws, 'pkgs/a/b/c/d/e');
+      write(path.join(deep, 'src/deep.ts'), 'export function deep() {}\n');
+      makeRepo(deep);
+      write(path.join(ws, 'main.ts'), 'export function main() {}\n');
+      write(path.join(ws, '.gitignore'), '/pkgs/\n');
+      writeConfig({ includeIgnored: ['pkgs/'] });
+      makeRepo(ws);
 
-  it('sync picks up a change inside a gitignored embedded repo', async () => {
-    write(path.join(ws, 'packages/proj-a/src/auth.ts'), 'export function login() { return 1; }\n');
-    makeRepo(path.join(ws, 'packages/proj-a'));
-    write(path.join(ws, '.gitignore'), '/packages/\n');
-    makeRepo(ws);
-
-    const cg = CodeGraph.initSync(ws, { config: { include: ['**/*.ts'], exclude: [] } });
-    try {
-      await cg.indexAll();
-      expect(cg.searchNodes('login', { limit: 5 }).length).toBeGreaterThan(0);
-
-      // Change inside the embedded repo — invisible to the parent's `git status`.
-      write(path.join(ws, 'packages/proj-a/src/auth.ts'),
-        'export function login() { return 1; }\nexport function logout() { return 0; }\n');
-      await cg.sync();
-
-      expect(cg.searchNodes('logout', { limit: 5 }).length).toBeGreaterThan(0);
-    } finally {
-      cg.destroy();
-    }
+      const files = scanDirectory(ws);
+      expect(files).toContain('main.ts');
+      expect(files.some((f) => f.includes('deep.ts'))).toBe(false);
+    });
+
+    it('buildScopeIgnore: indexed root is itself a gitignored subdir of an enclosing repo (#936)', () => {
+      // `child/` is NOT its own repo, so `git` resolves the ENCLOSING repo from
+      // inside it — and `git ls-files --directory`, whose cwd is then a wholly
+      // ignored directory, emits the literal `./` ("this entire directory").
+      // That sentinel used to reach the `ignore` matcher and throw
+      // ("path should be a `path.relative()`d string, but got "./""), aborting
+      // buildScopeIgnore → the MCP daemon's watcher never started and auto-sync
+      // silently stalled until a manual `codegraph sync`.
+      write(path.join(ws, 'child/src/a.ts'), 'export const x = 1;\n');
+      write(path.join(ws, '.gitignore'), '/child/\n');
+      makeRepo(ws);
+
+      const child = path.join(ws, 'child');
+      // The crux: building scope for the ignored subdir must not throw.
+      const scope = buildScopeIgnore(child);
+      // The subdir's own source is watchable/indexable, not ignored.
+      expect(scope.ignores('src/a.ts')).toBe(false);
+      // And the `./` self entry must not be mistaken for a nested embedded repo.
+      expect(discoverEmbeddedRepoRoots(child)).toEqual([]);
+    });
   });
 });

+ 24 - 2
site/src/content/docs/getting-started/configuration.md

@@ -1,9 +1,9 @@
 ---
 title: Configuration
-description: CodeGraph is zero-config by default, with one optional file for mapping custom extensions.
+description: CodeGraph is zero-config by default, with one optional codegraph.json for custom extensions and indexing nested git repositories.
 ---
 
-Next to none — CodeGraph is **zero-config by default**, with nothing to write or keep in sync to get started. Language support is automatic from the file extension; there's nothing to wire up per language. The one optional file is for mapping [custom file extensions](#custom-file-extensions).
+Next to none — CodeGraph is **zero-config by default**, with nothing to write or keep in sync to get started. Language support is automatic from the file extension; there's nothing to wire up per language. The one optional file, `codegraph.json`, covers [custom file extensions](#custom-file-extensions) and [indexing nested git repositories](#indexing-nested-git-repositories).
 
 ## What it skips out of the box
 
@@ -34,6 +34,28 @@ Each value is a supported language id. The mappings merge on top of the built-in
 
 A typo'd language or a malformed file is warned about and skipped — it never breaks indexing — and a project with no `codegraph.json` behaves exactly as before. Re-index (`codegraph index`) after adding or changing mappings.
 
+## Indexing nested git repositories
+
+CodeGraph respects your `.gitignore`, so a directory you've gitignored stays out of the graph — **including any git repositories nested inside it.** If you keep cloned reference projects, vendored copies, or a folder of unrelated repos in a gitignored directory (a `resource/`, `.repos/`, or `examples/` dir), CodeGraph leaves it untouched: it won't walk in, discover the embedded repos, or index them.
+
+If instead you run a **"super-repo" of independent clones** — a workspace whose own `.gitignore` lists its child repos to keep `git status` quiet, where you genuinely want every child indexed into one graph — opt those directories back in with `includeIgnored`:
+
+```json
+{
+  "includeIgnored": ["packages/", "services/"]
+}
+```
+
+Each entry is a gitignore-style pattern naming a gitignored directory whose nested git repositories should be indexed anyway. CodeGraph descends into the directories you list and indexes each embedded repo by its own `git ls-files`, so every child repo's own `.gitignore` is still honored. Directories you don't list stay excluded.
+
+A few things to know:
+
+- **Untracked** nested repositories (ones you haven't gitignored) are indexed automatically — `includeIgnored` is only for the ones your `.gitignore` excludes.
+- Built-in skips like `node_modules` are never re-included, even inside an opted-in directory.
+- A project without this layout needs no `codegraph.json` at all.
+
+Re-index (`codegraph index`) after adding or changing `includeIgnored`.
+
 ## Where data lives
 
 Per-project data lives in a `.codegraph/` directory at your project root, containing the SQLite database (`codegraph.db`). Nothing leaves your machine.

+ 72 - 33
src/extraction/index.ts

@@ -19,7 +19,7 @@ import {
 import { QueryBuilder } from '../db/queries';
 import { extractFromSource } from './tree-sitter';
 import { detectLanguage, isSourceFile, isLanguageSupported, isFileLevelOnlyLanguage, initGrammars, loadGrammarsForLanguages } from './grammars';
-import { loadExtensionOverrides } from '../project-config';
+import { loadExtensionOverrides, loadIncludeIgnoredPatterns } from '../project-config';
 import { isCodeGraphDataDir } from '../directory';
 import { logDebug, logWarn } from '../errors';
 import { validatePathWithinRoot, normalizePath } from '../utils';
@@ -269,6 +269,20 @@ function defaultsOnlyIgnore(): Ignore {
   return ignore().add(DEFAULT_IGNORE_PATTERNS);
 }
 
+/**
+ * Matcher for the project's `codegraph.json` `includeIgnored` patterns — the
+ * explicit opt-in to index embedded git repos living inside gitignored
+ * directories (#622, #699). Returns `null` when the project opted in nothing,
+ * which is the zero-config DEFAULT: `.gitignore` is then fully respected and a
+ * gitignored directory (even one holding nested repos) is never walked or
+ * indexed (#970, #976). Built once per scan/sync/scope operation from the scan
+ * root and threaded down — never global, so multi-project daemons stay isolated.
+ */
+function loadIncludeIgnoredMatcher(rootDir: string): Ignore | null {
+  const patterns = loadIncludeIgnoredPatterns(rootDir);
+  return patterns.length > 0 ? ignore().add(patterns) : null;
+}
+
 /**
  * `git ls-files --directory` collapses a wholly-untracked/ignored directory into
  * one entry — and when the command's own cwd is such a directory (the indexed
@@ -446,9 +460,12 @@ export function buildScopeIgnore(rootDir: string, embeddedRoots?: Iterable<strin
 
 /**
  * Standalone discovery of every embedded repo root under `rootDir` (relative,
- * trailing-slashed) — both the untracked kind (#193) and the gitignored kind
- * (#514), recursively (an embedded repo can embed further repos). Returns []
- * for non-git roots: the filesystem walk handles nested repos there already.
+ * trailing-slashed) — the untracked kind (#193) always, and the gitignored kind
+ * (#514) only for directories the project opted in via `codegraph.json`
+ * `includeIgnored` (#622, #699); otherwise `.gitignore` is respected and they
+ * are not discovered (#970, #976). Recursive (an embedded repo can embed further
+ * repos). Returns [] for non-git roots: the filesystem walk handles nested repos
+ * there already.
  */
 export function discoverEmbeddedRepoRoots(rootDir: string): string[] {
   try {
@@ -458,6 +475,7 @@ export function discoverEmbeddedRepoRoots(rootDir: string): string[] {
   }
   const out: string[] = [];
   const defaults = defaultsOnlyIgnore();
+  const includeIgnored = loadIncludeIgnoredMatcher(rootDir);
   const visit = (repoAbs: string, prefix: string): void => {
     const candidates: string[] = [];
     try {
@@ -472,7 +490,7 @@ export function discoverEmbeddedRepoRoots(rootDir: string): string[] {
         }
       }
     } catch { /* untracked listing failed — ignored-side discovery still runs */ }
-    candidates.push(...findIgnoredEmbeddedRepos(repoAbs));
+    candidates.push(...findIgnoredEmbeddedRepos(repoAbs, includeIgnored, prefix));
     for (const rel of candidates) {
       const full = normalizePath(prefix + rel);
       out.push(full);
@@ -484,15 +502,27 @@ export function discoverEmbeddedRepoRoots(rootDir: string): string[] {
 }
 
 /**
- * Discover embedded repos hidden by `repoDir`'s OWN ignore rules: for each
- * gitignored directory (skipping built-in default excludes), search for nested
- * `.git` roots. Returns repo paths relative to `repoDir`, trailing-slashed.
+ * Discover embedded repos hidden by `repoDir`'s OWN gitignore rules: for each
+ * gitignored directory, search for nested `.git` roots. Returns repo paths
+ * relative to `repoDir`, trailing-slashed.
+ *
+ * OPT-IN ONLY. Walking into a gitignored directory contradicts what every other
+ * tool (and CodeGraph's own `git ls-files` foundation) does — `.gitignore`
+ * excludes. So this returns `[]` unless the project opted the directory in via
+ * `codegraph.json` `includeIgnored`; without that, a gitignored dir — including
+ * a huge reference/data dir full of nested clones — is left untouched (#970,
+ * #976). When opted in, it restores the super-repo-of-clones behavior (#622,
+ * #699). `prefix` is the scan-root-relative path of `repoDir`, so a pattern like
+ * `services/` opts that whole subtree in at any recursion depth. Built-in
+ * default excludes (`node_modules`, …) are always skipped.
  */
-function findIgnoredEmbeddedRepos(repoDir: string): string[] {
+function findIgnoredEmbeddedRepos(repoDir: string, includeIgnored: Ignore | null, prefix: string): string[] {
+  if (!includeIgnored) return [];
   const defaults = defaultsOnlyIgnore();
   const repos: string[] = [];
   for (const dir of listIgnoredDirs(repoDir)) {
     if (defaults.ignores(dir)) continue;
+    if (!includeIgnored.ignores(normalizePath(prefix + dir))) continue;
     repos.push(...findNestedGitRepos(path.join(repoDir, dir), dir));
   }
   return repos;
@@ -509,12 +539,15 @@ function findIgnoredEmbeddedRepos(repoDir: string): string[] {
  * skips them entirely, and untracked output reports them only as an opaque
  * "subdir/" entry (trailing slash) rather than expanding their files. Each
  * embedded repo is its own git boundary, so we re-run `git ls-files` inside it.
- * (See issue #193.) GITIGNORED embedded repos are invisible even to that —
- * they're discovered separately via `findIgnoredEmbeddedRepos` (#514); every
- * embedded repo root (however found) is recorded in `embeddedRoots` so callers
- * can exempt its files from the parent's own gitignore rules.
+ * (See issue #193.) GITIGNORED embedded repos are invisible even to that; they
+ * are discovered separately via `findIgnoredEmbeddedRepos` (#514) but ONLY for
+ * directories the project opted in through `codegraph.json` `includeIgnored`
+ * (`includeIgnored` here, threaded from the scan root) — by default `.gitignore`
+ * is respected and they stay out (#970, #976). Every embedded repo root (however
+ * found) is recorded in `embeddedRoots` so callers can exempt its files from the
+ * parent's own gitignore rules.
  */
-function collectGitFiles(repoDir: string, prefix: string, files: Set<string>, embeddedRoots?: Set<string>): void {
+function collectGitFiles(repoDir: string, prefix: string, files: Set<string>, embeddedRoots?: Set<string>, includeIgnored: Ignore | null = null): void {
   const gitOpts = { cwd: repoDir, encoding: 'utf-8' as const, timeout: 30000, maxBuffer: 50 * 1024 * 1024, stdio: ['pipe', 'pipe', 'pipe'] as ['pipe', 'pipe', 'pipe'], windowsHide: true };
 
   // Tracked files. --recurse-submodules pulls in files from active submodules,
@@ -548,7 +581,7 @@ function collectGitFiles(repoDir: string, prefix: string, files: Set<string>, em
       // it's a duplicate working view of an already-indexed repo (#848).
       if (classifyGitDir(childDir) === 'embedded' && !defaultsOnlyIgnore().ignores(rel)) {
         embeddedRoots?.add(normalizePath(prefix + rel));
-        collectGitFiles(childDir, prefix + rel, files, embeddedRoots);
+        collectGitFiles(childDir, prefix + rel, files, embeddedRoots, includeIgnored);
       }
       continue;
     }
@@ -556,11 +589,13 @@ function collectGitFiles(repoDir: string, prefix: string, files: Set<string>, em
   }
 
   // Embedded repos hidden by THIS repo's ignore rules (`/packages/` in a
-  // super-repo .gitignore) never appear in any listing above — discover and
-  // recurse into them too. (#514)
-  for (const rel of findIgnoredEmbeddedRepos(repoDir)) {
+  // super-repo .gitignore) never appear in any listing above. By default they
+  // stay hidden — `.gitignore` is respected (#970, #976). They are recursed into
+  // only when the project opted the directory in via `codegraph.json`
+  // `includeIgnored` (#622, #699), which `findIgnoredEmbeddedRepos` enforces.
+  for (const rel of findIgnoredEmbeddedRepos(repoDir, includeIgnored, prefix)) {
     embeddedRoots?.add(normalizePath(prefix + rel));
-    collectGitFiles(path.join(repoDir, rel), prefix + rel, files, embeddedRoots);
+    collectGitFiles(path.join(repoDir, rel), prefix + rel, files, embeddedRoots, includeIgnored);
   }
 }
 
@@ -598,7 +633,7 @@ function getGitVisibleFiles(rootDir: string): Set<string> | null {
 
     const files = new Set<string>();
     const embeddedRoots = new Set<string>();
-    collectGitFiles(rootDir, '', files, embeddedRoots);
+    collectGitFiles(rootDir, '', files, embeddedRoots, loadIncludeIgnoredMatcher(rootDir));
     // Apply built-in default ignores uniformly — to tracked files too, since
     // committing a dependency/build dir doesn't make it project code. A
     // `.gitignore` negation (e.g. `!vendor/`) is the explicit opt-in. (issue #407)
@@ -627,13 +662,15 @@ interface GitChanges {
  * Use `git status` to detect changed files instead of scanning every file.
  * Returns null on failure so callers fall back to full scan.
  *
- * Recurses into embedded repos — both the untracked kind (#193: the parent's
- * status collapses them to an opaque `?? subdir/` entry) and the gitignored
- * kind (#514: they never appear in the parent's status at all) — running
- * `git status` inside each, so changes in a multi-repo workspace sync without
- * a full rescan. Deleting an ENTIRE embedded repo dir is the one case this
- * cannot see (the child status that would report the deletions is gone with
- * it); a full `codegraph index` reconciles that.
+ * Recurses into embedded repos — the untracked kind (#193: the parent's status
+ * collapses them to an opaque `?? subdir/` entry) always, and the gitignored
+ * kind (#514: they never appear in the parent's status at all) only for
+ * directories opted in via `codegraph.json` `includeIgnored` (#622, #699) —
+ * running `git status` inside each, so changes in a multi-repo workspace sync
+ * without a full rescan. By default a gitignored dir is left alone, matching the
+ * full-index scan (#970, #976). Deleting an ENTIRE embedded repo dir is the one
+ * case this cannot see (the child status that would report the deletions is gone
+ * with it); a full `codegraph index` reconciles that.
  */
 function getGitChangedFiles(rootDir: string): GitChanges | null {
   try {
@@ -641,14 +678,14 @@ function getGitChangedFiles(rootDir: string): GitChanges | null {
     // Custom extension → language overrides from the project's codegraph.json,
     // so change detection sees the same custom-extension files the full index does.
     const overrides = loadExtensionOverrides(rootDir);
-    collectGitStatus(rootDir, '', changes, overrides);
+    collectGitStatus(rootDir, '', changes, overrides, loadIncludeIgnoredMatcher(rootDir));
     return changes;
   } catch {
     return null;
   }
 }
 
-function collectGitStatus(repoDir: string, prefix: string, out: GitChanges, overrides?: Record<string, Language>): void {
+function collectGitStatus(repoDir: string, prefix: string, out: GitChanges, overrides?: Record<string, Language>, includeIgnored: Ignore | null = null): void {
   const output = execFileSync(
     'git',
     ['status', '--porcelain', '--no-renames'],
@@ -705,14 +742,16 @@ function collectGitStatus(repoDir: string, prefix: string, out: GitChanges, over
   }
 
   // Recurse embedded repos found under untracked dirs (at the dir itself or
-  // nested deeper) and under this repo's gitignored dirs.
+  // nested deeper). Gitignored dirs are walked only for the directories the
+  // project opted in via `includeIgnored`; by default `.gitignore` is respected
+  // and they are left alone (#970, #976), mirroring the full-index scan.
   for (const rel of untrackedDirs) {
     for (const repoRel of findNestedGitRepos(path.join(repoDir, rel), rel)) {
-      collectGitStatus(path.join(repoDir, repoRel), prefix + repoRel, out, overrides);
+      collectGitStatus(path.join(repoDir, repoRel), prefix + repoRel, out, overrides, includeIgnored);
     }
   }
-  for (const rel of findIgnoredEmbeddedRepos(repoDir)) {
-    collectGitStatus(path.join(repoDir, rel), prefix + rel, out, overrides);
+  for (const rel of findIgnoredEmbeddedRepos(repoDir, includeIgnored, prefix)) {
+    collectGitStatus(path.join(repoDir, rel), prefix + rel, out, overrides, includeIgnored);
   }
 }
 

+ 104 - 32
src/project-config.ts

@@ -34,11 +34,25 @@ export const PROJECT_CONFIG_FILENAME = 'codegraph.json';
 export interface ProjectConfig {
   /** Map of custom file extension (`.foo`) to a supported language id. */
   extensions?: Record<string, string>;
+  /**
+   * Gitignore-style patterns naming gitignored directories whose embedded git
+   * repositories should be indexed anyway — the explicit opt-in to override
+   * `.gitignore` for nested-repo discovery (#622, #699). Absent/empty (the
+   * default) means `.gitignore` is fully respected: gitignored embedded repos
+   * are never discovered or indexed (#970, #976).
+   */
+  includeIgnored?: string[];
+}
+
+/** Parsed, validated view of a project's `codegraph.json`. */
+interface ParsedConfig {
+  extensions: Record<string, Language>;
+  includeIgnored: string[];
 }
 
 interface CacheEntry {
   mtimeMs: number;
-  overrides: Record<string, Language>;
+  config: ParsedConfig;
 }
 
 /**
@@ -47,11 +61,14 @@ interface CacheEntry {
  * `stat` while a single `codegraph.json` is in force. Keying by root keeps two
  * projects in the same process (the daemon / multi-project MCP server) isolated.
  */
-const overridesCache = new Map<string, Record<string, Language>>();
-const cacheMeta = new Map<string, CacheEntry>();
+const cache = new Map<string, CacheEntry>();
 
-/** Shared frozen empty map so the no-config path allocates nothing. */
-const EMPTY: Record<string, Language> = Object.freeze({});
+/** Shared frozen empties so the no-config path allocates nothing. */
+const EMPTY_EXTENSIONS: Record<string, Language> = Object.freeze({});
+const EMPTY_CONFIG: ParsedConfig = Object.freeze({
+  extensions: EMPTY_EXTENSIONS,
+  includeIgnored: Object.freeze([]) as unknown as string[],
+});
 
 /**
  * Normalize a user-provided extension key to the `.ext` lowercase form used by
@@ -74,16 +91,16 @@ function normalizeExtKey(raw: string): string | null {
 }
 
 /**
- * Parse and validate the `extensions` map out of a `codegraph.json` file.
- * Every failure mode degrades to "no overrides from this entry" — a bad file or
- * a typo'd language never throws.
+ * Read + JSON-parse a `codegraph.json` once and return its validated view.
+ * Every failure mode degrades to the zero-config default — a missing file, bad
+ * JSON, or a typo'd value never throws.
  */
-function parseExtensionOverrides(file: string): Record<string, Language> {
+function parseConfig(file: string): ParsedConfig {
   let raw: string;
   try {
     raw = fs.readFileSync(file, 'utf-8');
   } catch {
-    return EMPTY;
+    return EMPTY_CONFIG;
   }
 
   let parsed: unknown;
@@ -94,12 +111,24 @@ function parseExtensionOverrides(file: string): Record<string, Language> {
       file,
       error: err instanceof Error ? err.message : String(err),
     });
-    return EMPTY;
+    return EMPTY_CONFIG;
   }
 
-  if (!parsed || typeof parsed !== 'object') return EMPTY;
+  if (!parsed || typeof parsed !== 'object') return EMPTY_CONFIG;
+
+  const extensions = extractExtensions(parsed, file);
+  const includeIgnored = extractIncludeIgnored(parsed, file);
+  if (extensions === EMPTY_EXTENSIONS && includeIgnored.length === 0) return EMPTY_CONFIG;
+  return { extensions, includeIgnored };
+}
+
+/**
+ * Validate the `extensions` map. Every failure mode degrades to "no overrides
+ * from this entry" — a bad value or a typo'd language never throws.
+ */
+function extractExtensions(parsed: object, file: string): Record<string, Language> {
   const exts = (parsed as ProjectConfig).extensions;
-  if (!exts || typeof exts !== 'object' || Array.isArray(exts)) return EMPTY;
+  if (!exts || typeof exts !== 'object' || Array.isArray(exts)) return EMPTY_EXTENSIONS;
 
   const out: Record<string, Language> = {};
   for (const [rawKey, rawVal] of Object.entries(exts)) {
@@ -115,18 +144,40 @@ function parseExtensionOverrides(file: string): Record<string, Language> {
     out[key] = rawVal as Language;
   }
 
-  return Object.keys(out).length > 0 ? out : EMPTY;
+  return Object.keys(out).length > 0 ? out : EMPTY_EXTENSIONS;
 }
 
 /**
- * Load the validated extension overrides for a project, mtime-cached.
- *
- * Returns a map of `.ext` → supported language id. The result merges on top of
- * the built-in extension map at the point of use (see `detectLanguage` /
- * `isSourceFile`), with these user mappings taking precedence. Returns an empty
- * map when there is no `codegraph.json` (the zero-config default).
+ * Validate the `includeIgnored` patterns: an array of non-empty gitignore-style
+ * strings. A non-array value or a non-string/blank entry warns-and-skips; never
+ * throws. Patterns are kept verbatim (trimmed) so they match exactly as a
+ * `.gitignore` line would.
  */
-export function loadExtensionOverrides(rootDir: string): Record<string, Language> {
+function extractIncludeIgnored(parsed: object, file: string): string[] {
+  const raw = (parsed as ProjectConfig).includeIgnored;
+  if (raw === undefined) return [];
+  if (!Array.isArray(raw)) {
+    logWarn(`Ignoring "includeIgnored" in ${PROJECT_CONFIG_FILENAME}: must be an array of gitignore-style patterns`, { file });
+    return [];
+  }
+
+  const out: string[] = [];
+  for (const entry of raw) {
+    if (typeof entry !== 'string' || !entry.trim()) {
+      logWarn(`Ignoring an "includeIgnored" entry in ${PROJECT_CONFIG_FILENAME}: every pattern must be a non-empty string`, { file });
+      continue;
+    }
+    out.push(entry.trim());
+  }
+  return out;
+}
+
+/**
+ * Load the parsed `codegraph.json` for a project, mtime-cached. A missing or
+ * malformed file yields the zero-config default. One `stat` (and at most one
+ * read/parse) while a single config file is in force, shared across every field.
+ */
+function loadParsedConfig(rootDir: string): ParsedConfig {
   const file = path.join(rootDir, PROJECT_CONFIG_FILENAME);
 
   let mtimeMs: number;
@@ -134,22 +185,43 @@ export function loadExtensionOverrides(rootDir: string): Record<string, Language
     mtimeMs = fs.statSync(file).mtimeMs;
   } catch {
     // No config file — drop any stale cache entry and return the default.
-    cacheMeta.delete(rootDir);
-    overridesCache.delete(rootDir);
-    return EMPTY;
+    cache.delete(rootDir);
+    return EMPTY_CONFIG;
   }
 
-  const meta = cacheMeta.get(rootDir);
-  if (meta && meta.mtimeMs === mtimeMs) return meta.overrides;
+  const entry = cache.get(rootDir);
+  if (entry && entry.mtimeMs === mtimeMs) return entry.config;
+
+  const config = parseConfig(file);
+  cache.set(rootDir, { mtimeMs, config });
+  return config;
+}
 
-  const overrides = parseExtensionOverrides(file);
-  cacheMeta.set(rootDir, { mtimeMs, overrides });
-  overridesCache.set(rootDir, overrides);
-  return overrides;
+/**
+ * Load the validated extension overrides for a project, mtime-cached.
+ *
+ * Returns a map of `.ext` → supported language id. The result merges on top of
+ * the built-in extension map at the point of use (see `detectLanguage` /
+ * `isSourceFile`), with these user mappings taking precedence. Returns an empty
+ * map when there is no `codegraph.json` (the zero-config default).
+ */
+export function loadExtensionOverrides(rootDir: string): Record<string, Language> {
+  return loadParsedConfig(rootDir).extensions;
+}
+
+/**
+ * Load the validated `includeIgnored` patterns for a project, mtime-cached.
+ *
+ * These name gitignored directories whose embedded git repositories should be
+ * indexed despite `.gitignore` (#622, #699). An empty result — the zero-config
+ * default — means `.gitignore` is fully respected: gitignored embedded repos
+ * are never discovered or indexed (#970, #976).
+ */
+export function loadIncludeIgnoredPatterns(rootDir: string): string[] {
+  return loadParsedConfig(rootDir).includeIgnored;
 }
 
 /** Test/maintenance hook: forget cached config (e.g. after rewriting it in a test). */
 export function clearProjectConfigCache(): void {
-  cacheMeta.clear();
-  overridesCache.clear();
+  cache.clear();
 }