소스 검색

fix(extraction): capture & clean docstrings across all README languages (#780) (#806)

* fix(extraction): capture docstrings for export/const/decorator-wrapped symbols (#780)

getPrecedingDocstring walked previousNamedSibling from the EMITTED
declaration node, so it only found a leading comment when the comment was
a direct sibling of that node. For a declaration nested under a wrapper —
`export class X` / `export const f = () => {}` (export_statement /
lexical_declaration), a plain const arrow (variable_declarator), or a
decorated Python def/class (decorated_definition) — the comment is a
sibling of the WRAPPER, so the inner node had no preceding comment and
the docstring was stored as NULL.

Climb out through the wrapper node(s) before scanning for the comment.
Each wrapper holds exactly one declaration, so this can't mis-attribute a
comment to a sibling (verified: an uncommented method does NOT inherit its
class's comment). Also strip leading `#` from Python/Ruby/shell line
comments, which the cleanup chain missed (Python docstrings used to keep
their `#`).

Query/extraction-layer change to a parse helper; re-index to pick up
docstrings on already-indexed files. Verified on the reporter's JS/TS and
Python repros (8/8 now captured) plus over-walk controls; +3 tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(extraction): clean comment markers across all supported languages (#780)

Validating docstring capture across every README language surfaced that
the marker cleanup only knew C-style `//` and `/* */`, plus the `#` added
earlier this branch. Doc comments in other styles were captured but left
their markers in the stored text:

  - Rust/Swift/Kotlin doc lines `///` and `//!`  -> leading `/` / `!` leaked
  - Lua/Luau `--` and `--[[ ]]`                  -> not stripped
  - Pascal `{ }` and `(* *)`                     -> not stripped

Extract the cleanup into cleanCommentMarkers() and handle every style.
Paired block delimiters are stripped only when the comment OPENS with one,
so a line comment that happens to end with `}` / `*)` / `]]` is never
truncated; per-line markers stay anchored at line start.

Validated end-to-end (extract -> index -> codegraph_node output) across
all 19 tree-sitter code languages plus Svelte/Vue `<script>` blocks: every
one now stores and returns a clean docstring. +1 cross-language test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Colby Mchenry 1 주 전
부모
커밋
0df9246752
3개의 변경된 파일142개의 추가작업 그리고 12개의 파일을 삭제
  1. 1 0
      CHANGELOG.md
  2. 83 0
      __tests__/extraction.test.ts
  3. 58 12
      src/extraction/tree-sitter-helpers.ts

+ 1 - 0
CHANGELOG.md

@@ -29,6 +29,7 @@ and adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
 ### Fixes
 
+- Doc comments are now captured for exported, `const`-assigned, and decorated declarations, and the documentation a symbol carries is now clean across every supported language. Previously a comment above `export class X`, `export const fn = () => …`, a plain `const fn = () => …`, or a decorated Python `def`/`class` (`@app.route(...)`, `@dataclass`) was dropped entirely — only comments directly above a plain declaration were kept. CodeGraph now finds the comment through the `export` / `const` / decorator wrapper. Comment-marker cleanup was also rounded out for every language CodeGraph supports: Rust/Swift/Kotlin doc lines (`///`, `//!`), Python/Ruby/shell `#`, Lua/Luau (`--` and `--[[ ]]`), and Pascal (`{ }` and `(* *)`) no longer leave stray markers in the stored text — validated end-to-end across all 19 code languages plus Svelte/Vue `<script>` blocks. (#780). Thanks @caleb-kaiser.
 - Go method calls made through a chained factory function now resolve to the correct type. A call like `New().Method()` used to drop the receiver, so the chained method attached to a same-named method on an unrelated type — or didn't resolve. CodeGraph now captures Go return types (a pointer `*Foo` resolves to `Foo`, and a multi-return `(*Foo, error)` to its first result), infers the chained receiver's type from what the factory function returns, and resolves the method on it — including methods promoted from an embedded struct — creating the edge only when the type or an embedded type genuinely has the method. Existing Go indexes should be re-indexed (`codegraph index -f`) to benefit. (#750) (Go)
 - Scala method calls made through a companion-object factory, a fluent chain, or a case-class `apply` now resolve to the correct type. A call like `Foo.create().bar()` or `Builder(cfg).bar()` used to drop the receiver, so the chained method silently attached to a same-named method on an unrelated type — most often mis-attributing a standard-library `Option` / `Iterator` `.map` / `.flatMap` / `.foreach` onto your own same-named class. CodeGraph now captures Scala return types (a generic `List[Foo]` resolves to its container `List`, a qualified `pkg.Foo` to `Foo`), infers the chained receiver's type from what the inner call returns or constructs, and resolves the method on it — including methods inherited from a trait the type extends — creating the edge only when that type or one of its traits genuinely has the method (so a wrong inference produces no edge instead of a misleading one). Existing Scala indexes should be re-indexed (`codegraph index -f`) to benefit. (#750) (Scala)
 - Rust method calls made through a chained associated function now resolve to the correct type. A call like `Foo::new().bar()` or `Foo::with(cfg).build()` used to drop the receiver, so the chained method silently attached to a same-named method on an unrelated type — or didn't resolve. CodeGraph now captures Rust return types (`-> Self` resolves to the implementing type), infers the chained receiver's type from what the associated function returns, and resolves the method on it — including methods provided by a trait the type implements (via the new `impl Trait for Type` relationships) — creating the edge only when the type or one of its traits genuinely has the method. Existing Rust indexes should be re-indexed (`codegraph index -f`) to benefit. (#750) (Rust)

+ 83 - 0
__tests__/extraction.test.ts

@@ -184,6 +184,89 @@ export class PaymentService {
     expect(chargeMethod).toBeDefined();
   });
 
+  it('captures docstrings for export- and const-wrapped declarations (#780)', () => {
+    const code = `
+// plain class control
+class Ledger {}
+
+// exported class
+export class Invoice {}
+
+// export default
+export default function settle() { return true; }
+
+// exported arrow const
+export const refund = (amount: number) => amount;
+
+// non-export arrow const
+const audit = (amount: number) => amount;
+`;
+    const byName = new Map(extractFromSource('doc.ts', code).nodes.map((n) => [n.name, n]));
+    expect(byName.get('Ledger')?.docstring).toBe('plain class control'); // control still works
+    expect(byName.get('Invoice')?.docstring).toBe('exported class');
+    expect(byName.get('settle')?.docstring).toBe('export default');
+    expect(byName.get('refund')?.docstring).toBe('exported arrow const');
+    expect(byName.get('audit')?.docstring).toBe('non-export arrow const');
+  });
+
+  it('does not mis-attribute a class comment to an uncommented member (#780)', () => {
+    const code = `
+// Comment for Box
+export class Box {
+  noComment() {}
+  // own comment
+  withComment() {}
+}
+`;
+    const byName = new Map(extractFromSource('box.ts', code).nodes.map((n) => [n.name, n]));
+    expect(byName.get('Box')?.docstring).toBe('Comment for Box');
+    expect(byName.get('noComment')?.docstring ?? null).toBeNull(); // no over-walk
+    expect(byName.get('withComment')?.docstring).toBe('own comment');
+  });
+
+  it('captures docstrings for decorated Python declarations, stripping `#` (#780)', () => {
+    const code = [
+      '# decorated function',
+      '@app.route("/x")',
+      'def py_handler():',
+      '    return 1',
+      '',
+      '',
+      '# plain function control',
+      'def py_plain():',
+      '    return 1',
+      '',
+      '',
+      '# decorated class',
+      '@dataclass',
+      'class PyModel:',
+      '    pass',
+      '',
+    ].join('\n');
+    const byName = new Map(extractFromSource('mod.py', code).nodes.map((n) => [n.name, n]));
+    expect(byName.get('py_handler')?.docstring).toBe('decorated function');
+    expect(byName.get('py_plain')?.docstring).toBe('plain function control'); // `#` stripped
+    expect(byName.get('PyModel')?.docstring).toBe('decorated class');
+  });
+
+  it('cleans comment markers across language styles (#780)', () => {
+    const doc = (file: string, code: string, name: string) =>
+      new Map(extractFromSource(file, code).nodes.map((n) => [n.name, n])).get(name)?.docstring;
+
+    // Rust doc lines (`///`, `//!`) — the trailing slash used to leak through.
+    expect(doc('m.rs', '/// rust doc line\nfn rs_fn() {}', 'rs_fn')).toBe('rust doc line');
+    // Lua line + long-bracket comments.
+    expect(doc('m.lua', '-- lua line\nfunction lua_fn() end', 'lua_fn')).toBe('lua line');
+    expect(doc('b.lua', '--[[ lua block ]]\nfunction lua_b() end', 'lua_b')).toBe('lua block');
+    // Pascal brace and paren-star comments.
+    const pasUnit = (c: string) =>
+      `unit U;\ninterface\n${c}\nprocedure P;\nimplementation\nprocedure P;\nbegin\nend;\nend.\n`;
+    expect(doc('a.pas', pasUnit('{ pascal brace }'), 'P')).toBe('pascal brace');
+    expect(doc('c.pas', pasUnit('(* pascal paren *)'), 'P')).toBe('pascal paren');
+    // C block comment still clean (no regression).
+    expect(doc('m.c', '/* c block */\nvoid c_fn(void) {}', 'c_fn')).toBe('c block');
+  });
+
   it('should extract interfaces', () => {
     const code = `
 export interface User {

+ 58 - 12
src/extraction/tree-sitter-helpers.ts

@@ -43,11 +43,66 @@ export function getChildByField(node: SyntaxNode, fieldName: string): SyntaxNode
   return node.childForFieldName(fieldName);
 }
 
+/**
+ * Node types that *wrap* a declaration so a leading comment is a sibling of the
+ * wrapper, not of the emitted (inner) declaration node. CodeGraph emits the
+ * inner node, so before looking for its preceding comment we climb out through
+ * these. Examples: `export class X {}` (export_statement), `@dec\ndef f()`
+ * (decorated_definition), `const f = () => {}` (lexical_declaration →
+ * variable_declarator). Each wraps exactly one declaration, so climbing can't
+ * mis-attribute a comment to a sibling. (#780)
+ */
+const DOCSTRING_WRAPPER_TYPES = new Set([
+  'export_statement', // JS/TS: export class/function/const ...
+  'decorated_definition', // Python: @decorator over def/class
+  'lexical_declaration', // JS/TS: const/let x = () => {}
+  'variable_declaration', // JS/TS: var x = ...
+  'variable_declarator', // JS/TS: the `x = () => {}` inside the declaration
+  'ambient_declaration', // TS: declare ...
+]);
+
+/**
+ * Strip comment-syntax markers from a raw comment so the stored docstring is
+ * just the prose. Covers the marker styles across every supported language:
+ * C-family line and block comments and their doc variants, Rust/Swift/Kotlin
+ * triple-slash and bang doc lines, hash lines (Python/Ruby/shell), Lua/Luau
+ * line and long-bracket comments, and Pascal brace and paren-star comments.
+ * (#780)
+ *
+ * Paired block delimiters are stripped only when the comment OPENS with one,
+ * so a line comment that merely happens to END with a closing delimiter is
+ * never truncated. The per-line markers are anchored at line start, so
+ * they're safe to apply to any comment.
+ */
+function cleanCommentMarkers(comment: string): string {
+  let c = comment.trim();
+  if (c.startsWith('/*')) c = c.replace(/^\/\*+!?/, '').replace(/\*+\/$/, '');
+  else if (c.startsWith('--[')) c = c.replace(/^--\[=*\[/, '').replace(/\]=*\]$/, '');
+  else if (c.startsWith('(*')) c = c.replace(/^\(\*/, '').replace(/\*\)$/, '');
+  else if (c.startsWith('{')) c = c.replace(/^\{/, '').replace(/\}$/, '');
+  return c
+    .replace(/^\/\/[/!]?\s?/gm, '') // // , and Rust/Swift doc lines /// //!
+    .replace(/^--\s?/gm, '') //        Lua/Luau line comments
+    .replace(/^#\s?/gm, '') //         Python/Ruby/shell line comments
+    .replace(/^\s*\*\s?/gm, '') //     block-comment continuation (* foo)
+    .trim();
+}
+
 /**
  * Get the docstring/comment preceding a node
  */
 export function getPrecedingDocstring(node: SyntaxNode, source: string): string | undefined {
-  let sibling = node.previousNamedSibling;
+  // Climb out of any wrapper(s) so a comment preceding the WHOLE construct
+  // (export-, decorator-, or const-arrow-wrapped) is reachable as a sibling.
+  // The emitted node's own `previousNamedSibling` is empty (export/const) or a
+  // decorator (Python) in those cases, so without this the docstring was
+  // dropped. (#780)
+  let anchor = node;
+  while (anchor.parent && DOCSTRING_WRAPPER_TYPES.has(anchor.parent.type)) {
+    anchor = anchor.parent;
+  }
+
+  let sibling = anchor.previousNamedSibling;
   const comments: string[] = [];
 
   while (sibling) {
@@ -66,15 +121,6 @@ export function getPrecedingDocstring(node: SyntaxNode, source: string): string
 
   if (comments.length === 0) return undefined;
 
-  // Clean up comment markers
-  return comments
-    .map((c) =>
-      c
-        .replace(/^\/\*\*?|\*\/$/g, '')
-        .replace(/^\/\/\s?/gm, '')
-        .replace(/^\s*\*\s?/gm, '')
-        .trim()
-    )
-    .join('\n')
-    .trim();
+  // Strip each comment's syntax markers (language-aware), then join.
+  return comments.map(cleanCommentMarkers).join('\n').trim();
 }