소스 검색

fix(extraction): drop the phantom C++ function from a macro-annotated class misparse (#946) (#948)

A C++ class/struct annotated with an export/visibility macro —
`class MYLIB_EXPORT Foo : public Bar { … }` — makes tree-sitter read
`class MYLIB_EXPORT` as an elaborated type specifier and the whole declaration
as a `function_definition` named after the class, spanning the entire body. That
phantom `function` polluted callers/impact/blast-radius and skewed kind stats.

Detect the misparse structurally in cppExtractor.isMisparsedFunction — a
function_definition whose `type` field is a *bodyless* class/struct specifier
(the elaborated-type macro) and whose declarator is not a function_declarator —
and drop the bogus node, matching how macro-prefixed C prototypes are already
handled. The body is mangled by the same misparse and is unrecoverable. Precise
enough to leave genuine code alone: `struct P { int x; } makeP() {}` (real
inline-defined return type, has a field list) and `class Foo f() {}` (elaborated
return type on a real function, has a function_declarator) are untouched. The
leading macro alone triggers the misparse; a base clause is not required.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Colby Mchenry 22 시간 전
부모
커밋
f63e5db2cc
3개의 변경된 파일97개의 추가작업 그리고 2개의 파일을 삭제
  1. 1 0
      CHANGELOG.md
  2. 57 0
      __tests__/extraction.test.ts
  3. 39 2
      src/extraction/languages/c-cpp.ts

+ 1 - 0
CHANGELOG.md

@@ -42,6 +42,7 @@ and adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 - A long-running MCP server now recovers when your index is deleted and rebuilt at the same path. If `.codegraph/` was removed and recreated while the server held it open — most easily by recreating a git worktree at the same path, or `rm`-ing `.codegraph/` and running `codegraph init` again — the server kept reading the old, deleted database file and served a frozen snapshot: renamed or removed symbols still showed as live, new ones were missing, and `codegraph sync` couldn't refresh it — only restarting the server fixed it. The server now detects that the database file was swapped out from under it and reopens the live one in place, so results stay correct without a restart. (On Linux and macOS; Windows doesn't allow deleting an open file, so it isn't affected.) (#925)
 - The MCP server now opens and auto-syncs a project that lives inside a folder an enclosing git repository ignores. Before, if the directory you indexed sat within a larger repo that gitignored it, the shared MCP daemon failed to open the project — its log repeated `Failed to open project … path should be a` `path.relative()` `d string, but got "./"` — so the file watcher never started and the index silently went stale until you ran `codegraph sync` by hand (setting `CODEGRAPH_NO_DAEMON=1` was the only workaround). The daemon now opens the project and starts watching as expected. Most visible with Codex on Windows, but the cause wasn't platform-specific. (#936)
 - A git worktree of a submodule is no longer indexed as a duplicate copy of that submodule's code. CodeGraph already skips ordinary worktrees (a second working view of a repo it indexes), but a worktree created *from a submodule* — common in monorepos that check submodules out into worktrees for parallel feature work — was mistaken for a genuine embedded repo and swept in, duplicating every symbol it shared with the real submodule checkout (one report had ~28% of its index as duplicates, inflating both query results and the database). These submodule worktrees are now recognized and skipped, while the submodule's own checkout stays indexed as distinct code. Thanks @charlesxu2026-ship-it. (#945)
+- A C++ class or struct annotated with an export/visibility macro — `class MYLIB_EXPORT Foo : public Bar { … }`, the common DLL-export / visibility pattern in headers — is no longer mis-indexed as a function spanning the whole class body. Not knowing the macro is a macro, the parser read it as a type and the whole declaration as a function, so the class surfaced as a phantom `function` that showed up as a false caller in `codegraph callers`, `codegraph impact`, and blast-radius analysis, and skewed symbol counts. CodeGraph now recognizes this misparse and drops the bogus node. Thanks @spwlyzx. (#946)
 
 
 ## [1.0.1] - 2026-06-13

+ 57 - 0
__tests__/extraction.test.ts

@@ -2582,6 +2582,63 @@ std::unique_ptr<Widget> makeWidget() { return nullptr; }
     });
   });
 
+  describe('C++ macro-prefixed class/struct misparse (#946)', () => {
+    // An export/visibility macro before the class name plus a base clause
+    // (`class MACRO Name : public Base { … }`) makes tree-sitter read `class
+    // MACRO` as an elaborated type and the whole declaration as a
+    // function_definition named after the class, spanning the entire body — a
+    // phantom `function` that polluted callers/impact/blast-radius. It's dropped.
+    it('does not mint a phantom function for a macro-annotated class that inherits', () => {
+      const code = `#pragma once
+#define MAPCORE_EXPORT __attribute__((visibility("default")))
+
+class DataProvider {
+public:
+    virtual bool Request(void* param) = 0;
+};
+
+class MAPCORE_EXPORT LocalDataProvider : public DataProvider
+{
+public:
+    LocalDataProvider(int dataType);
+    virtual bool Request(void* param) override;
+};
+`;
+      // A header rich in C++ (class / public: / virtual) detects as C++ — the
+      // issue's exact scenario (a `.h` file). Guard it so a detection regression
+      // can't make this test pass for the wrong reason.
+      expect(detectLanguage('provider.h', code)).toBe('cpp');
+      const result = extractFromSource('provider.h', code);
+
+      // The misparse used to surface as `function | LocalDataProvider` spanning
+      // the whole class body — a false caller in the graph. It's gone now.
+      expect(
+        result.nodes.find((n) => n.name === 'LocalDataProvider' && n.kind === 'function')
+      ).toBeUndefined();
+
+      // The sibling class without the macro is unaffected — still a class.
+      expect(result.nodes.find((n) => n.name === 'DataProvider')?.kind).toBe('class');
+    });
+
+    it('drops the struct variant too, without dropping a genuine class', () => {
+      const code = `
+#define API __declspec(dllexport)
+struct API Widget : public Base { int x; };
+class Plain : public Base { public: int y; };
+`;
+      const result = extractFromSource('widget.cpp', code);
+
+      // `struct MACRO Name : Base { … }` misparses the same way — no phantom function.
+      expect(
+        result.nodes.find((n) => n.name === 'Widget' && n.kind === 'function')
+      ).toBeUndefined();
+
+      // A normal class with a base clause and no macro must still be a class — the
+      // drop is precise, not a blanket "class with inheritance" filter.
+      expect(result.nodes.find((n) => n.name === 'Plain')?.kind).toBe('class');
+    });
+  });
+
   describe('C/C++ imports', () => {
     it('should extract system include', () => {
       const code = `#include <iostream>`;

+ 39 - 2
src/extraction/languages/c-cpp.ts

@@ -148,6 +148,40 @@ export const cExtractor: LanguageExtractor = {
   },
 };
 
+/**
+ * Detect tree-sitter's misparse of a macro-annotated class/struct, e.g.
+ * `class MACRO Name { … }` or `class MACRO Name : public Base { … }` (#946).
+ * Not knowing `MACRO` is a macro, tree-sitter reads `class MACRO` as an
+ * *elaborated type specifier* (a bodyless `class_specifier`/`struct_specifier`
+ * whose "type name" is the macro) and the rest as a function: `Name` becomes the
+ * declarator and the `{ … }` a function body — so the whole declaration surfaces
+ * as a `function_definition` named after the class, with a line range spanning
+ * the entire class body. (A base clause, when present, additionally lands in an
+ * `ERROR` node, but it isn't required — the leading macro alone triggers this.)
+ *
+ * Two structural signals pin it down with no risk to genuine code:
+ *  - the `type` field is a *bodyless* class/struct specifier — an elaborated
+ *    type, not a real inline-defined return type like
+ *    `struct P { int x; } makeP() { … }` (which carries a field list); and
+ *  - the declarator is not a `function_declarator` — a real function definition
+ *    always has one, which also leaves the legal-but-rare `class Foo f() { … }`
+ *    (an elaborated return type on a genuine function) alone.
+ *
+ * The class body is mangled by the same misparse and is unrecoverable, so —
+ * matching how macro-prefixed C prototypes are handled — we drop the spurious
+ * node rather than mint a misleading whole-body `function` that pollutes
+ * callers/impact and skews kind statistics.
+ */
+function isMacroMisparsedTypeDecl(node: SyntaxNode): boolean {
+  const typeNode = getChildByField(node, 'type');
+  if (!typeNode) return false;
+  if (typeNode.type !== 'class_specifier' && typeNode.type !== 'struct_specifier') return false;
+  if (typeNode.namedChildren.some((c: SyntaxNode) => c.type === 'field_declaration_list')) return false;
+  const declarator = getChildByField(node, 'declarator');
+  if (declarator && declarator.type === 'function_declarator') return false;
+  return true;
+}
+
 export const cppExtractor: LanguageExtractor = {
   functionTypes: ['function_definition'],
   classTypes: ['class_specifier'],
@@ -192,14 +226,17 @@ export const cppExtractor: LanguageExtractor = {
     }
     return undefined;
   },
-  isMisparsedFunction: (name) => {
+  isMisparsedFunction: (name, node) => {
     // C++ macros like NLOHMANN_JSON_NAMESPACE_BEGIN cause tree-sitter to misparse
     // namespace blocks as function_definitions (e.g. name = "namespace detail").
     // Also filter C++ keywords that tree-sitter occasionally misinterprets as
     // function/method names (e.g. switch statements inside macro-confused scopes).
     if (name.startsWith('namespace')) return true;
     const cppKeywords = ['switch', 'if', 'for', 'while', 'do', 'case', 'return'];
-    return cppKeywords.includes(name);
+    if (cppKeywords.includes(name)) return true;
+    // `class MACRO Name : public Base { … }` misparses to a function_definition
+    // named after the class — drop that phantom (#946).
+    return isMacroMisparsedTypeDecl(node);
   },
   extractImport: (node, source) => {
     const importText = source.substring(node.startIndex, node.endIndex).trim();