Kaynağa Gözat

fix(extraction): index files reached through in-root symlinks that point outside the repo (#935) (#956)

The directory walk deliberately follows an in-root symlink whose target
lives outside the repo root (the standard Dota custom-game layout, where
`game/` and `content/` link into the SDK tree) and enumerates the files
under it. But the read path then rejected every one of them via the
strict symlink-escape guard, logging `Path traversal blocked in batch
reader` and indexing nothing — discovery and the reader disagreed.

Add an opt-in `allowSymlinkEscape` to validatePathWithinRoot that waives
only the realpath-escape rejection (the lexical `../` guard still
applies) and pass it at the three indexing read sites (batch reader,
indexFile, indexFileWithContent). The content-serving sinks
(ContextBuilder, MCP tools) keep the strict guard, so this stays inside
the #527 model: indexing now follows the symlink, getCode still refuses
to serve out-of-root contents.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Colby Mchenry 13 saat önce
ebeveyn
işleme
6459ead6aa
4 değiştirilmiş dosya ile 79 ekleme ve 6 silme
  1. 1 0
      CHANGELOG.md
  2. 46 0
      __tests__/security.test.ts
  3. 8 4
      src/extraction/index.ts
  4. 24 2
      src/utils.ts

+ 1 - 0
CHANGELOG.md

@@ -49,6 +49,7 @@ and adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 - A git worktree of a submodule is no longer indexed as a duplicate copy of that submodule's code. CodeGraph already skips ordinary worktrees (a second working view of a repo it indexes), but a worktree created *from a submodule* — common in monorepos that check submodules out into worktrees for parallel feature work — was mistaken for a genuine embedded repo and swept in, duplicating every symbol it shared with the real submodule checkout (one report had ~28% of its index as duplicates, inflating both query results and the database). These submodule worktrees are now recognized and skipped, while the submodule's own checkout stays indexed as distinct code. Thanks @charlesxu2026-ship-it. (#945)
 - A C++ class or struct annotated with an export/visibility macro — `class MYLIB_EXPORT Foo : public Bar { … }`, the common DLL-export / visibility pattern in headers — is no longer mis-indexed as a function spanning the whole class body. Not knowing the macro is a macro, the parser read it as a type and the whole declaration as a function, so the class surfaced as a phantom `function` that showed up as a false caller in `codegraph callers`, `codegraph impact`, and blast-radius analysis, and skewed symbol counts. CodeGraph now recognizes this misparse and drops the bogus node. Thanks @spwlyzx. (#946)
 - `codegraph status` no longer reports phantom pending changes for files CodeGraph deliberately keeps out of the index — a tracked file inside a committed dependency dir (a checked-in `vendor/` or `node_modules/`), or a tracked file under a `.gitignore`d directory. A full index correctly excludes these, and `codegraph sync` never indexes them, but the fast change-detector behind `codegraph status` flagged every edit to such a file as "modified" (and a new one as "added") — so the pending-changes count never cleared no matter how many times you synced. Change detection now applies the same ignore rules the full index does, so `status` agrees with `sync`, and tools built on CodeGraph's change-detection API get the same corrected list. (#766)
+- Files reached through a symlinked directory that points outside your project now index instead of being silently skipped. When a folder in your repo is a symlink to a location outside the repo root — the standard layout for Dota 2 custom games and similar SDK-linked projects, where `game/` and `content/` link into the engine tree — CodeGraph followed the symlink to discover those files but then blocked every one of them while reading, logging `Path traversal blocked in batch reader` and indexing nothing under them. The reader now agrees with the directory scan and indexes those files. The guard against `../` path escapes is unchanged, and the protection that keeps your agent from being served file contents from outside the repo is also unchanged — only the indexer's own read path follows these in-repo symlinks. (#935)
 
 
 ## [1.0.1] - 2026-06-13

+ 46 - 0
__tests__/security.test.ts

@@ -232,6 +232,26 @@ describe('Symlink escape prevention (#527)', () => {
     expect(validatePathWithinRoot(root, 'src/inlink.ts')).not.toBeNull();
   });
 
+  // The INDEXING read path opts into following in-root symlinks the directory
+  // walk already descended into — discovery and the reader must agree, or files
+  // reached via an in-root symlink-to-outside fail to index (#935). The lexical
+  // `../` guard is NOT waived, and content-serving sinks never pass the flag.
+  it('allowSymlinkEscape follows an in-repo symlink to an out-of-root FILE (indexing read)', () => {
+    if (!link(path.join(root, 'escape'), path.join(outside, 'pkg', 'secret.txt'))) return;
+    expect(validatePathWithinRoot(root, 'escape', { allowSymlinkEscape: true })).not.toBeNull();
+  });
+
+  it('allowSymlinkEscape follows a path through an in-repo out-of-root DIR symlink (indexing read)', () => {
+    if (!link(path.join(root, 'escapedir'), path.join(outside, 'pkg'))) return;
+    expect(validatePathWithinRoot(root, 'escapedir/secret.txt', { allowSymlinkEscape: true })).not.toBeNull();
+  });
+
+  it('allowSymlinkEscape STILL rejects a lexical ../ traversal (guard not waived)', () => {
+    expect(
+      validatePathWithinRoot(root, `../${path.basename(outside)}/pkg/secret.txt`, { allowSymlinkEscape: true })
+    ).toBeNull();
+  });
+
   it('end-to-end: getCode never serves an out-of-root file reached via a dir symlink', async () => {
     fs.writeFileSync(path.join(outside, 'pkg', 'leak.ts'),
       'export function leaked() { return "LEAKED-ZZZ-9"; }\n');
@@ -250,6 +270,32 @@ describe('Symlink escape prevention (#527)', () => {
       cg.close();
     }
   });
+
+  it('end-to-end (#935): indexes source reached through an in-root dir symlink to outside', async () => {
+    // The Dota custom-game layout symlinks `game/` and `content/` into an SDK
+    // tree outside the repo. Before #935 the batch reader's strict symlink-escape
+    // guard blocked every file under them, so nothing indexed — even though the
+    // directory walk deliberately followed the symlink to enumerate them. The
+    // reader now agrees with discovery: the file indexes.
+    fs.writeFileSync(path.join(outside, 'pkg', 'vendored.ts'),
+      'export function vendoredHelper() { return "LEAKED-ZZZ-9"; }\n');
+    if (!link(path.join(root, 'game'), path.join(outside, 'pkg'))) return;
+
+    const cg = CodeGraph.initSync(root, { config: { include: ['**/*.ts'], exclude: [] } });
+    try {
+      await cg.indexAll();
+      // The symlinked-in file is now part of the graph...
+      const names = cg.getNodesByKind('function').map((n) => n.name);
+      expect(names).toContain('vendoredHelper');
+      // ...but its out-of-root contents are STILL never served (the #527 sink
+      // stays strict — indexing relaxes only the read path, not getCode).
+      for (const n of cg.getNodesByKind('function')) {
+        expect((await cg.getCode(n.id)) ?? '').not.toContain('LEAKED-ZZZ-9');
+      }
+    } finally {
+      cg.close();
+    }
+  });
 });
 
 describe('validateProjectPath — sensitive directory blocking', () => {

+ 8 - 4
src/extraction/index.ts

@@ -1245,7 +1245,10 @@ export class ExtractionOrchestrator {
       const fileContents = await Promise.all(
         batch.map(async (fp) => {
           try {
-            const fullPath = validatePathWithinRoot(this.rootDir, fp);
+            // Indexing read: follow in-root symlinks the directory walk already
+            // descended into (the `../` guard still applies) so files reached
+            // via an in-root symlink-to-outside still index (#935).
+            const fullPath = validatePathWithinRoot(this.rootDir, fp, { allowSymlinkEscape: true });
             if (!fullPath) {
               logWarn('Path traversal blocked in batch reader', { filePath: fp });
               return { filePath: fp, content: null as string | null, stats: null as fs.Stats | null, error: new Error('Path traversal blocked') };
@@ -1551,7 +1554,8 @@ export class ExtractionOrchestrator {
    * Index a single file
    */
   async indexFile(relativePath: string): Promise<ExtractionResult> {
-    const fullPath = validatePathWithinRoot(this.rootDir, relativePath);
+    // Indexing read: follow in-root symlinks (the `../` guard still applies), #935.
+    const fullPath = validatePathWithinRoot(this.rootDir, relativePath, { allowSymlinkEscape: true });
 
     if (!fullPath) {
       return {
@@ -1598,8 +1602,8 @@ export class ExtractionOrchestrator {
     content: string,
     stats: fs.Stats
   ): Promise<ExtractionResult> {
-    // Prevent path traversal
-    const fullPath = validatePathWithinRoot(this.rootDir, relativePath);
+    // Prevent `../` traversal; follow in-root symlinks like the directory walk (#935).
+    const fullPath = validatePathWithinRoot(this.rootDir, relativePath, { allowSymlinkEscape: true });
     if (!fullPath) {
       logWarn('Path traversal blocked in indexFileWithContent', { relativePath });
       return {

+ 24 - 2
src/utils.ts

@@ -91,25 +91,47 @@ function isWithinDir(child: string, parent: string): boolean {
  * (codegraph_node `includeCode`, codegraph_explore source) go through here, so
  * this is the chokepoint that keeps out-of-root file contents from leaking.
  *
+ * `allowSymlinkEscape` waives **only** the realpath-escape rejection (the
+ * lexical `../` guard still applies) for the INDEXING read path. The directory
+ * walk deliberately descends into in-root symlinks whose targets live outside
+ * the root (e.g. a `game/` symlink in a Dota custom-game tree, #935); discovery
+ * and the reader must agree, or every file the walk enumerated fails to index.
+ * Indexing only reads paths it just discovered, into a local index — it never
+ * serves them to an agent, so this does not widen the #527 leak surface. The
+ * content-serving sinks must never pass this flag.
+ *
  * @param projectRoot - The project root directory
  * @param filePath - The (relative or absolute) file path to validate
+ * @param options.allowSymlinkEscape - Follow in-root symlinks out of the root
+ *   (indexing read path only); defaults to the strict, leak-safe behavior.
  * @returns The resolved absolute path (realpath when it exists), or null if it
  *   escapes the root
  */
-export function validatePathWithinRoot(projectRoot: string, filePath: string): string | null {
+export function validatePathWithinRoot(
+  projectRoot: string,
+  filePath: string,
+  options?: { allowSymlinkEscape?: boolean }
+): string | null {
   const resolved = path.resolve(projectRoot, filePath);
   const normalizedRoot = path.resolve(projectRoot);
 
-  // 1. Lexical containment — cheap, catches `../` traversal.
+  // 1. Lexical containment — cheap, catches `../` traversal. Applies even on
+  //    the indexing read path: a crafted `../` escape is still rejected.
   if (!isWithinDir(resolved, normalizedRoot)) {
     return null;
   }
 
   // 2. Symlink-aware containment — resolve symlinks on both sides and re-check,
   //    so an in-repo symlink whose real target escapes the root is rejected.
+  //    The indexing read path (allowSymlinkEscape) skips only this rejection so
+  //    it stays consistent with the directory walk, which already followed the
+  //    in-root symlink to enumerate these files (#935).
   try {
     const realRoot = fs.realpathSync(normalizedRoot);
     const realResolved = fs.realpathSync(resolved);
+    if (options?.allowSymlinkEscape) {
+      return realResolved;
+    }
     return isWithinDir(realResolved, realRoot) ? realResolved : null;
   } catch (err) {
     // ENOENT: the path doesn't exist yet (a file about to be written, or an