ソースを参照

feat(extraction): universal recovery of macro-mangled C/C++ function names (#1102)

* feat(extraction): universal recovery of macro-mangled C/C++ function names

The curated inline-macro blank list (#1100/#1101) can't enumerate every
library's macro. Add a universal post-parse net so a function is findable by
name regardless of which macro decorates it, plus a batch of common libraries
to the curated list for full name+return-type recovery.

- recoverMangledCppName: after extraction, recover the real identifier from a
  name still mangled by an un-blanked macro (`MACRO Ret name(…)` misparses to
  "Ret name"). It's a new `recoverMangledName` extractor hook wired only onto
  C/C++, applied to every name they produce. Safe by construction: it only
  touches an already-mangled name (an internal space that isn't a legit
  `operator …`/destructor), so a clean name is returned unchanged; guarded
  against the `Ret (name)` parenthesized-name idiom and bare primitives. Scoped
  to C/C++ so Kotlin/Scala backtick identifiers (which legitimately contain
  spaces) are never touched.
- Curated list extended past UE/pugixml/Godot/Boost to Qt (Q_INVOKABLE, …),
  Folly, Abseil, LLVM, V8, Eigen, and rapidjson.

Validated on CARLA (large UE project, 1131 C++/h files) vs the pre-fix baseline:
function-name mangles 440 -> 6, 431 fixed, and — critically — 0 regressions
(the salvage also recovers names that the pre-parse's own non-local error-recovery
shifts would otherwise re-mangle, erasing the 7 shifts seen in #1101). The 6
residual are all the moodycamel `Ret (name)` idiom, correctly left alone. On a
made-up macro with no list entry (`WEBKIT_EXPORT WTFString compute()`), the name
`compute` is still recovered. Full suite green; eleven regression/safety tests added.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(changelog): note universal C++ macro-mangled name recovery (#1102)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Colby Mchenry 15 時間 前
コミット
cb20a3bf7f

+ 1 - 0
CHANGELOG.md

@@ -15,6 +15,7 @@ and adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 - C++ methods that return a reference, and user-defined conversion operators, are now indexed under their correct names. An inline getter like `const FGameplayTagContainer& GetActiveTags() const` — everywhere in Unreal Engine headers — was indexed as `& GetActiveTags() const` instead of `GetActiveTags`, and a conversion operator like `operator EALSMovementState() const` kept its trailing `() const` instead of reading `operator EALSMovementState`. In both cases the garbled name meant you couldn't find the symbol by name and its callers weren't linked. Both now read cleanly, matching how pointer-returning and value-returning methods already worked. (#1096)
 - C++ functions written with an inline-specifier macro before the return type are now indexed correctly. In Unreal Engine, inline helpers are commonly written `FORCEINLINE FString GetEnumerationToString(...)`; the `FORCEINLINE` macro made the parser read the return type as part of the function's name (`FString GetEnumerationToString` instead of `GetEnumerationToString`) and lose the real return type, so the function couldn't be found by name and its callers weren't linked. CodeGraph now recognizes the standard Unreal inline macros (`FORCEINLINE`, `FORCENOINLINE`, `FORCEINLINE_DEBUGGABLE`), so both the name and the return type are captured. (#1100)
 - The same function-name recovery now covers inline macros from common third-party C++ libraries, not just Unreal Engine — including pugixml (`PUGI__FN`, `PUGIXML_FUNCTION`), Godot (`_FORCE_INLINE_`), Boost (`BOOST_FORCEINLINE`), and generic `ALWAYS_INLINE` / `FORCE_INLINE`. Functions decorated with these are now indexed under their real names. On a large Unreal project vendoring these libraries this cleaned up the large majority of remaining function-name garbling. (#1101)
+- C++ function names are now recovered even when decorated with a macro CodeGraph doesn't specifically know about. A function written `SOME_LIBRARY_MACRO ReturnType doWork(...)` previously had the macro or return type absorbed into its name whenever the macro wasn't one CodeGraph recognized; now the real name (`doWork`) is recovered regardless of the macro, so it's findable and its callers link — no per-library configuration needed. The recognized-macro list was also broadened (Qt, Folly, Abseil, LLVM, V8, Eigen, rapidjson) so those additionally capture the return type. This only ever cleans up an already-garbled name and is limited to C and C++, so ordinary names — and languages like Kotlin and Scala where identifiers can legitimately contain spaces — are unaffected. (#1102)
 
 
 ## [1.1.6] - 2026-06-30

+ 56 - 1
__tests__/extraction.test.ts

@@ -11,7 +11,7 @@ import * as os from 'os';
 import { CodeGraph } from '../src';
 import { extractFromSource, scanDirectory, buildDefaultIgnore, discoverEmbeddedRepoRoots, buildScopeIgnore } from '../src/extraction';
 import { detectLanguage, isLanguageSupported, getSupportedLanguages, initGrammars, loadAllGrammars, isSourceFile } from '../src/extraction/grammars';
-import { stripCppTemplateArgs, blankCppExportMacros, blankCppInlineMacros } from '../src/extraction/languages/c-cpp';
+import { stripCppTemplateArgs, blankCppExportMacros, blankCppInlineMacros, recoverMangledCppName } from '../src/extraction/languages/c-cpp';
 import { normalizePath } from '../src/utils';
 
 beforeAll(async () => {
@@ -2995,6 +2995,61 @@ class APXCharacter {  // the one real definition
     });
   });
 
+  describe('C++ universal macro-mangled name recovery', () => {
+    // Curated pre-parse blanking can't list every library's inline macro, so a
+    // post-parse salvage recovers the real function name from ANY leftover
+    // `MACRO Ret name(…)` mangle — no list needed. It only ever touches an
+    // already-mangled name, so it can't corrupt a clean one.
+    const namesOf = (code: string, file = 's.cpp') =>
+      extractFromSource(file, code).nodes
+        .filter((n) => n.kind === 'method' || n.kind === 'function')
+        .map((n) => n.name);
+
+    it('recovers the name from a completely unknown macro (no list entry)', () => {
+      expect(namesOf('WEBKIT_EXPORT WTFString computeThing(int x) { return H(x); }')).toContain('computeThing');
+      expect(namesOf('SOMELIB_INLINE MyResult doWork(int x) { return H(x); }')).toContain('doWork');
+      expect(namesOf('MZ_FORCEINLINE char_t* to_str(double v) { return H(v); }')).toContain('to_str');
+    });
+
+    it('recoverMangledCppName only touches already-mangled names, with guards', () => {
+      // Recovered:
+      expect(recoverMangledCppName('WTFString computeThing')).toBe('computeThing');
+      expect(recoverMangledCppName('char_t* to_str(double v)')).toBe('to_str');
+      expect(recoverMangledCppName('unspecified_bool_type() const')).toBe('unspecified_bool_type');
+      // Left unchanged — clean names, operators, destructors, the `Ret (name)`
+      // idiom, and non-identifier tails:
+      expect(recoverMangledCppName('computeThing')).toBe('computeThing');
+      expect(recoverMangledCppName('operator EALSMovementState')).toBe('operator EALSMovementState');
+      expect(recoverMangledCppName('~Widget')).toBe('~Widget');
+      expect(recoverMangledCppName('bool (likely)')).toBe('bool (likely)');
+      expect(recoverMangledCppName('void (free)')).toBe('void (free)');
+      expect(recoverMangledCppName('QDockWidget *')).toBe('QDockWidget *');
+    });
+
+    it('does not disturb clean C++ names or non-C++ (Kotlin backtick) names', () => {
+      expect(namesOf('int foo(int x) { return x; }')).toEqual(['foo']);
+      // Kotlin backtick identifiers legitimately contain spaces; the salvage is
+      // C/C++-only, so they are untouched.
+      const kt = extractFromSource('T.kt', 'class T {\n  fun `decode simple cert`() { }\n}').nodes
+        .filter((n) => n.kind === 'method' || n.kind === 'function')
+        .map((n) => n.name);
+      expect(kt).toContain('`decode simple cert`');
+    });
+
+    it('curated list now also covers Qt / Folly / Abseil / LLVM / V8 / Eigen / rapidjson (full recovery)', () => {
+      const info = (c: string) =>
+        extractFromSource('x.cpp', c).nodes
+          .filter((n) => n.kind === 'method' || n.kind === 'function')
+          .map((n) => ({ name: n.name, ret: n.returnType }));
+      expect(info('FOLLY_ALWAYS_INLINE Str f(int x) { return H(x); }')).toEqual([{ name: 'f', ret: 'Str' }]);
+      expect(namesOf('Q_INVOKABLE void onClicked() { H(); }')).toContain('onClicked');
+      expect(namesOf('ABSL_ATTRIBUTE_ALWAYS_INLINE int hash(int x) { return H(x); }')).toContain('hash');
+      expect(namesOf('EIGEN_STRONG_INLINE Scalar dot(const V& v) { return H(v); }')).toContain('dot');
+      expect(namesOf('V8_INLINE MaybeLocal Get(int i) { return H(i); }')).toContain('Get');
+      expect(namesOf('RAPIDJSON_FORCEINLINE bool Parse(const char* s) { return H(s); }')).toContain('Parse');
+    });
+  });
+
   describe('C++ templated base-class inheritance (#1043)', () => {
     // Inheriting from a template (`class D : public Base<int>`) recorded the base
     // ref as the full instantiation `Base<int>`, which never name-matched the

+ 47 - 0
src/extraction/languages/c-cpp.ts

@@ -123,6 +123,8 @@ function extractCppReturnType(node: SyntaxNode, source: string): string | undefi
 }
 
 export const cExtractor: LanguageExtractor = {
+  // Universal net: recover a real name from any macro-mangled function name.
+  recoverMangledName: recoverMangledCppName,
   functionTypes: ['function_definition'],
   classTypes: [],
   methodTypes: [],
@@ -275,6 +277,15 @@ const CPP_INLINE_MACROS = [
   '_ALWAYS_INLINE_', '_FORCE_INLINE_',
   // Boost
   'BOOST_FORCEINLINE', 'BOOST_NOINLINE',
+  // Qt (per-method markers + inline)
+  'Q_INVOKABLE', 'Q_SCRIPTABLE', 'Q_ALWAYS_INLINE', 'Q_SLOT', 'Q_SIGNAL',
+  // Folly / Abseil / LLVM / V8 / Eigen / rapidjson
+  'FOLLY_ALWAYS_INLINE', 'FOLLY_NOINLINE',
+  'ABSL_ATTRIBUTE_ALWAYS_INLINE', 'ABSL_ATTRIBUTE_NOINLINE',
+  'LLVM_ATTRIBUTE_ALWAYS_INLINE', 'LLVM_ATTRIBUTE_NOINLINE',
+  'V8_INLINE', 'V8_NOINLINE',
+  'EIGEN_STRONG_INLINE', 'EIGEN_ALWAYS_INLINE', 'EIGEN_DEVICE_FUNC',
+  'RAPIDJSON_FORCEINLINE',
   // Common cross-ecosystem inline/attribute hints
   'ALWAYS_INLINE', 'FORCE_INLINE', 'NOINLINE',
 ] as const;
@@ -288,6 +299,40 @@ export function blankCppInlineMacros(source: string): string {
   return source.replace(CPP_INLINE_MACRO_RE, (m) => ' '.repeat(m.length));
 }
 
+// Bare C/C++ type/qualifier tokens that must never be taken as a recovered
+// function name (guards `recoverMangledCppName` against the `Ret (name)` idiom,
+// where the token before the params is the return type, not the name).
+const CPP_PRIMITIVE_NAMES = new Set([
+  'bool', 'void', 'int', 'char', 'short', 'long', 'float', 'double', 'unsigned',
+  'signed', 'wchar_t', 'char8_t', 'char16_t', 'char32_t', 'char_t', 'size_t',
+  'auto', 'const', 'struct', 'class', 'enum', 'union', 'typename',
+]);
+
+/**
+ * Universal fallback (any macro, no list) for a C/C++ function name still mangled
+ * because a macro we don't blank sat in front of the return type: `MACRO Ret
+ * name(…)` / `Ret MACRO name(…)` misparse so the return type is glued onto the
+ * name ("Ret name", "char_t* to_str(double v)"). Recover the real identifier —
+ * the token immediately before the parameter list (or the last token). This runs
+ * AFTER the curated pre-parse blank, so it only ever sees the residual tail that
+ * blanking didn't already fix cleanly (which also recovers the return type).
+ *
+ * Safe by construction: only touches an ALREADY-mangled name — one with an
+ * internal space that isn't a legit `operator …`/destructor — so a well-formed
+ * name is returned unchanged. Guarded against the two ways it could mis-pick:
+ * the `Ret (name)` parenthesized-name idiom (left as-is, ambiguous), and a token
+ * that is a bare primitive/keyword rather than a real identifier.
+ */
+export function recoverMangledCppName(name: string): string {
+  if (!/\s/.test(name) || name.startsWith('operator') || name.startsWith('~')) return name;
+  if (/^\S+\s+\([A-Za-z_]\w*\)/.test(name)) return name; // `Ret (name)` idiom — leave alone
+  const beforeParams = name.includes('(') ? name.slice(0, name.indexOf('(')) : name;
+  const tokens = beforeParams.trim().split(/\s+/);
+  const candidate = tokens[tokens.length - 1];
+  if (!candidate || !/^[A-Za-z_]\w*$/.test(candidate) || CPP_PRIMITIVE_NAMES.has(candidate)) return name;
+  return candidate;
+}
+
 /** C/C++ source pre-processing before tree-sitter: recover both macro-annotated
  * class definitions and macro-prefixed function definitions. Offset-preserving. */
 function preParseCppSource(source: string): string {
@@ -299,6 +344,8 @@ export const cppExtractor: LanguageExtractor = {
   // #1061/#946) and macro-prefixed functions (`FORCEINLINE FString Foo()`, #1093
   // follow-up) that tree-sitter otherwise misparses.
   preParse: preParseCppSource,
+  // Universal net for any macro the curated blank list misses.
+  recoverMangledName: recoverMangledCppName,
   functionTypes: ['function_definition'],
   classTypes: ['class_specifier'],
   // A bodiless `class_specifier` is a forward declaration (`class Foo;`) or an

+ 10 - 0
src/extraction/tree-sitter-types.ts

@@ -133,6 +133,16 @@ export interface LanguageExtractor {
   /** Override symbol name extraction (e.g. ObjC multi-part selectors). */
   resolveName?: (node: SyntaxNode, source: string) => string | undefined;
 
+  /**
+   * Post-process an already-extracted name to recover a real identifier from a
+   * name still mangled by a macro the pre-parse didn't blank (C/C++:
+   * `MACRO Ret name(` misparses to the name "Ret name"). Applied to every name
+   * this extractor produces, so it MUST be a no-op on a well-formed name — only
+   * C/C++ set it, because a mangled name there is unambiguous (an internal space),
+   * whereas e.g. Kotlin/Scala backtick identifiers legitimately contain spaces.
+   */
+  recoverMangledName?: (name: string) => string;
+
   /** Extract property name when the generic name walk fails (e.g. ObjC @property). */
   extractPropertyName?: (node: SyntaxNode, source: string) => string | null;
 

+ 8 - 0
src/extraction/tree-sitter.ts

@@ -63,6 +63,14 @@ const VUE_STORE_FILE_SIGNAL = /\bdefineStore\b|\bcreateStore\b|\bVuex\b|\bmutati
  * Extract the name from a node based on language
  */
 function extractName(node: SyntaxNode, source: string, extractor: LanguageExtractor): string {
+  const name = extractNameRaw(node, source, extractor);
+  // Universal fallback: recover a real identifier from a name still mangled by a
+  // macro the pre-parse didn't blank (C/C++ only — see recoverMangledName). A
+  // no-op on well-formed names, so a clean name is never altered.
+  return extractor.recoverMangledName ? extractor.recoverMangledName(name) : name;
+}
+
+function extractNameRaw(node: SyntaxNode, source: string, extractor: LanguageExtractor): string {
   const hookName = extractor.resolveName?.(node, source);
   if (hookName) return hookName;