Explorar el Código

fix(extraction): skip submodule worktrees instead of indexing them as duplicates (#945) (#947)

A worktree of a submodule points its `.git` into
`.git/modules/<module>/worktrees/<name>`, but `classifyGitDir` only matched
the top-level `.git/worktrees/` shape — so submodule worktrees fell through
to "embedded" and every symbol they shared with the real submodule checkout
got indexed twice (one report: ~28% of the index was duplicates, inflating
both query results and the DB). Broaden the worktree detector to allow the
optional `modules/<module>` segment. The submodule's own checkout
(`.git/modules/<module>`, no `worktrees/`) is unaffected and stays indexed as
distinct code.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Colby Mchenry hace 22 horas
padre
commit
2bdc169ce4
Se han modificado 3 ficheros con 46 adiciones y 4 borrados
  1. 1 0
      CHANGELOG.md
  2. 36 0
      __tests__/multi-repo-workspace.test.ts
  3. 9 4
      src/extraction/index.ts

+ 1 - 0
CHANGELOG.md

@@ -41,6 +41,7 @@ and adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 - A long-running MCP server now notices when a git worktree gains its own index. Before, if the server (or shared daemon) first saw a worktree before you ran `codegraph init` in it — so the lookup walked up to the main checkout's index — it pinned that decision for its whole life: even after the worktree had its own `.codegraph/`, every query kept hitting the main checkout's index and every result carried a false "this index belongs to a different git working tree" warning, until you restarted the server. The CLI got it right but the MCP server didn't, and re-indexing didn't help. The server now re-checks which index a path belongs to on each call, so the worktree's own index is picked up — and the stale warning drops — without a restart. (#926)
 - A long-running MCP server now recovers when your index is deleted and rebuilt at the same path. If `.codegraph/` was removed and recreated while the server held it open — most easily by recreating a git worktree at the same path, or `rm`-ing `.codegraph/` and running `codegraph init` again — the server kept reading the old, deleted database file and served a frozen snapshot: renamed or removed symbols still showed as live, new ones were missing, and `codegraph sync` couldn't refresh it — only restarting the server fixed it. The server now detects that the database file was swapped out from under it and reopens the live one in place, so results stay correct without a restart. (On Linux and macOS; Windows doesn't allow deleting an open file, so it isn't affected.) (#925)
 - The MCP server now opens and auto-syncs a project that lives inside a folder an enclosing git repository ignores. Before, if the directory you indexed sat within a larger repo that gitignored it, the shared MCP daemon failed to open the project — its log repeated `Failed to open project … path should be a` `path.relative()` `d string, but got "./"` — so the file watcher never started and the index silently went stale until you ran `codegraph sync` by hand (setting `CODEGRAPH_NO_DAEMON=1` was the only workaround). The daemon now opens the project and starts watching as expected. Most visible with Codex on Windows, but the cause wasn't platform-specific. (#936)
+- A git worktree of a submodule is no longer indexed as a duplicate copy of that submodule's code. CodeGraph already skips ordinary worktrees (a second working view of a repo it indexes), but a worktree created *from a submodule* — common in monorepos that check submodules out into worktrees for parallel feature work — was mistaken for a genuine embedded repo and swept in, duplicating every symbol it shared with the real submodule checkout (one report had ~28% of its index as duplicates, inflating both query results and the database). These submodule worktrees are now recognized and skipped, while the submodule's own checkout stays indexed as distinct code. Thanks @charlesxu2026-ship-it. (#945)
 
 
 ## [1.0.1] - 2026-06-13

+ 36 - 0
__tests__/multi-repo-workspace.test.ts

@@ -131,6 +131,42 @@ describe('multi-repo workspaces (#514)', () => {
     expect(files).toContain('vendored/lib.ts');
   });
 
+  it('skips a submodule worktree instead of indexing it as a duplicate (#945)', () => {
+    // A worktree OF A SUBMODULE points its `.git` into
+    // `.git/modules/<module>/worktrees/<name>` — not the top-level repo's
+    // `.git/worktrees/`. The detector used to miss that extra `modules/<name>`
+    // segment, so the worktree fell through to "embedded" and every symbol it
+    // shared with the real submodule checkout got indexed twice. The submodule's
+    // own checkout (`.git/modules/<module>`, no `worktrees/`) is distinct code
+    // and must stay indexed (#514).
+    const upstream = fs.mkdtempSync(path.join(os.tmpdir(), 'cg-945-up-'));
+    try {
+      // The repo that becomes the submodule's origin.
+      write(path.join(upstream, 'lib.ts'), 'export function libFn() { return 1; }\n');
+      makeRepo(upstream);
+
+      write(path.join(ws, 'src/app.ts'), 'export function app() { return 1; }\n');
+      write(path.join(ws, '.gitignore'), '.worktrees/\n');
+      git(ws, 'init', '-q');
+      // protocol.file.allow=always: modern git refuses a local-path submodule otherwise.
+      git(ws, '-c', 'protocol.file.allow=always', 'submodule', 'add', '-q', upstream, 'common');
+      git(ws, '-c', 'user.email=t@t', '-c', 'user.name=t', 'commit', '-qm', 'add submodule');
+
+      // A worktree of the submodule, under the gitignored .worktrees/ — its `.git`
+      // points into `.git/modules/common/worktrees/<name>`.
+      git(path.join(ws, 'common'), 'worktree', 'add', '-q', '../.worktrees/common-feature', '-b', 'feature');
+
+      const files = scanDirectory(ws);
+      expect(files).toContain('src/app.ts');
+      // The real submodule checkout is distinct code — still indexed (#514).
+      expect(files).toContain('common/lib.ts');
+      // The submodule worktree is a duplicate working view — never indexed (#945).
+      expect(files.some((f) => f.includes('.worktrees'))).toBe(false);
+    } finally {
+      fs.rmSync(upstream, { recursive: true, force: true });
+    }
+  });
+
   it('non-git workspace: walks children and respects each child own .gitignore', () => {
     write(path.join(ws, 'proj-a/src/auth.ts'), 'export function login() {}\n');
     write(path.join(ws, 'proj-a/build/out.ts'), 'export function generated() {}\n');

+ 9 - 4
src/extraction/index.ts

@@ -304,8 +304,10 @@ const EMBEDDED_REPO_SEARCH_ENTRIES = 2000;
  * - A `.git` **file** is a pointer (`gitdir: …`). A git **worktree** points into
  *   the host repo's own `.git/worktrees/<name>`, so it is a second working view
  *   of a repo CodeGraph already indexes — indexing it just duplicates the whole
- *   graph N times; skip it (#848). A **submodule** points into `.git/modules/`
- *   and is distinct code, so index it as before.
+ *   graph N times; skip it (#848). A **submodule worktree** points into
+ *   `.git/modules/<module>/worktrees/<name>` — same duplication, so skip it too
+ *   (#945). A **submodule** checkout points into `.git/modules/<module>` (no
+ *   `worktrees/` segment) and is distinct code, so index it as before.
  *
  * Returns `'none'` when there is no `.git` entry here.
  */
@@ -320,9 +322,12 @@ function classifyGitDir(absDir: string): 'embedded' | 'worktree' | 'none' {
   if (!st.isFile()) return 'none';
   try {
     const gitdir = fs.readFileSync(path.join(absDir, '.git'), 'utf8').match(/^gitdir:\s*(.+)$/m)?.[1]?.trim();
-    // A linked worktree's gitdir lives under some repo's `.git/worktrees/`.
+    // A worktree's gitdir lives under some repo's `.git/worktrees/<name>` —
+    // either the top-level repo's (`.git/worktrees/`) or, for a worktree of a
+    // submodule, that submodule's gitdir (`.git/modules/<module>/worktrees/`).
+    // The optional `modules/<module>` segment covers the submodule case (#945).
     // Match both separators so a Windows-style pointer is recognized too.
-    if (gitdir && /(^|[\\/])\.git[\\/]worktrees[\\/]/.test(gitdir)) return 'worktree';
+    if (gitdir && /(^|[\\/])\.git[\\/](modules[\\/][^\\/]+[\\/])?worktrees[\\/]/.test(gitdir)) return 'worktree';
   } catch {
     // Unreadable `.git` pointer — fall back to the prior "index it" behavior.
   }