Sfoglia il codice sorgente

feat(extraction): add Ruby to value-reference edges (class-scope constants)

Ruby keeps almost all constants inside a class/module (jekyll's lib: 0 top-level,
58 class-internal), so this needed the class-scope extension flagged since Go.
Three Ruby-specific changes:
- Extract Ruby constants as symbols at all: a `CONST = …` has a `constant`-typed
  LHS (not `identifier`), so they were never extracted — including class-internal
  ones (the variableTypes dispatch skipped class-internal assignments; it now
  makes an exception for `constant`-LHS assignments).
- The value-ref target gate accepts `class:`/`module:` parents, not just `file:`.
- The reader-scan matches `constant` nodes (Ruby represents both a constant's
  definition AND its references as `constant`, not `identifier`).

Effectively Ruby-only: Rust impl consts are already `file:`-parented (ripgrep
stayed at 144 edges) and TS/Python class members aren't constant/variable kind.

Class-scope precision uses a FILE-WIDE target map (not strict same-class) because
Ruby constant lookup is lexical+ancestor — a nested class legitimately reads an
enclosing class's constant (jekyll ERBRenderer→ThemeBuilder::SCAFFOLD_DIRECTORIES;
sinatra AcceptEntry→Request::HEADER_PARAM), and strict matching would drop those.
The only real FP is the same name in sibling classes in one file (~1.7% of
targets on rails; a NameError in valid Ruby, so rare). Net ~98–100% precision.

Validated small/medium/large on public Ruby OSS (sinatra, jekyll, rails): node
count identical on/off, ~98–100% precision, impact win holds (HEADER_PARAM 1→5).
Adds a Ruby test (top-level + class-internal const reads). Design matrix,
playbook (class-scope section now resolved), and CHANGELOG updated. Suite green
(1,558 passed).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Colby McHenry 1 settimana fa
parent
commit
b92576f660

+ 1 - 1
CHANGELOG.md

@@ -11,7 +11,7 @@ and adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
 ### New Features
 
-- Impact and blast-radius analysis for TypeScript, JavaScript, Go, Python, and Rust now understands the readers of a constant. When you change a file-scope, package-level, or module-level constant — a config object, a lookup table, a shared constant — the other symbols in that file that read it now show up as affected, where before they were invisible (impact only followed calls, imports, and inheritance, so a constant's consumers looked like "nothing depends on this"). This makes `codegraph impact`, and the impact trail in `codegraph_explore`/`codegraph_node`, catch the "change this table, break its readers" class of change. It's on by default and adds no nodes to your graph; bundled/minified files and ambiguously-shadowed names are skipped to keep results precise. Set `CODEGRAPH_VALUE_REFS=0` to turn it off.
+- Impact and blast-radius analysis for TypeScript, JavaScript, Go, Python, Rust, and Ruby now understands the readers of a constant. When you change a file-scope, package-level, module-level, or class-level constant — a config object, a lookup table, a shared constant — the other symbols in that file that read it now show up as affected, where before they were invisible (impact only followed calls, imports, and inheritance, so a constant's consumers looked like "nothing depends on this"). This makes `codegraph impact`, and the impact trail in `codegraph_explore`/`codegraph_node`, catch the "change this table, break its readers" class of change. It's on by default and adds no nodes to your graph; bundled/minified files and ambiguously-shadowed names are skipped to keep results precise. Set `CODEGRAPH_VALUE_REFS=0` to turn it off.
 
 ### Fixes
 

+ 31 - 0
__tests__/value-reference-edges.test.ts

@@ -227,6 +227,37 @@ describe('value-reference edges', () => {
     expect(valueRefReaders(cg, 'HAS_SSL')).toEqual(['uses_ssl']);
   });
 
+  it('edges readers to a top-level AND a class-internal constant (Ruby)', async () => {
+    // Ruby keeps almost all constants inside a class/module. Both the top-level
+    // `MAX_RETRIES` and the class-internal `Config::TIMEOUT` must be targets, and
+    // their same-file readers edged (TIMEOUT is read by two methods of Config).
+    fs.writeFileSync(
+      path.join(dir, 'app.rb'),
+      [
+        'MAX_RETRIES = 3',
+        '',
+        'def retry_count',
+        '  MAX_RETRIES',
+        'end',
+        '',
+        'class Config',
+        '  TIMEOUT = 30',
+        '  def self.get_timeout',
+        '    TIMEOUT',
+        '  end',
+        '  def describe',
+        '    "timeout=#{TIMEOUT}"',
+        '  end',
+        'end',
+      ].join('\n'),
+    );
+    cg = index();
+    await cg.indexAll();
+
+    expect(valueRefReaders(cg, 'MAX_RETRIES')).toEqual(expect.arrayContaining(['retry_count']));
+    expect(valueRefReaders(cg, 'TIMEOUT')).toEqual(expect.arrayContaining(['get_timeout', 'describe']));
+  });
+
   it('emits nothing when CODEGRAPH_VALUE_REFS=0', async () => {
     const prev = process.env.CODEGRAPH_VALUE_REFS;
     process.env.CODEGRAPH_VALUE_REFS = '0';

+ 22 - 12
docs/design/value-reference-edges-playbook.md

@@ -45,7 +45,7 @@ agent read-reduction (see §4.3).
 
 | Symbol | Role |
 |---|---|
-| `VALUE_REF_LANGS` (static Set) | languages the feature runs for. Currently `typescript`, `javascript`, `tsx`, `go`, `python`, `rust`. **Add the new language here.** |
+| `VALUE_REF_LANGS` (static Set) | languages the feature runs for. Currently `typescript`, `javascript`, `tsx`, `go`, `python`, `rust`, `ruby`. **Add the new language here.** |
 | `valueRefsEnabled` | `process.env.CODEGRAPH_VALUE_REFS !== '0'` — default ON, env opts out. |
 | `MAX_VALUE_REF_NODES` (20_000) | per-scope traversal cap (and the shadow-scan cap). |
 | `captureValueRefScope(kind, name, id, node)` | called from `createNode` on every node. Records **targets** (file-scope `const`/`var`) and **reader scopes** (`function`/`method`/`const`/`var`). |
@@ -55,7 +55,7 @@ agent read-reduction (see §4.3).
 
 - **Target gate:** `kind ∈ {constant, variable}` **and** `name.length >= 3` **and**
   `/[A-Z_]/.test(name)` (distinctive name — dodges single-letter / all-lowercase shadowing)
-  **and** the node's parent id starts with `file:` (file/module scope).
+  **and** the node's parent id starts with `file:`, `class:`, or `module:` (file/class/module scope).
 - **Reader gate:** `kind ∈ {function, method, constant, variable}`.
 
 **The emit loop in `flushValueRefs`:** same-file only (targets + scopes are per-file, reset
@@ -66,10 +66,10 @@ targets** (see §3).
 
 ## 2. Current state (what's shipped + validated)
 
-- **Default ON** for TS/JS/tsx + Go + Python + Rust (`CODEGRAPH_VALUE_REFS=0` disables). Shipped in **PR #895**
+- **Default ON** for TS/JS/tsx + Go + Python + Rust + Ruby (`CODEGRAPH_VALUE_REFS=0` disables). Shipped in **PR #895**
   (flip-on + the shadow prune); Go added in a later PR (the shadow-prune declarator switch +
   `VALUE_REF_LANGS`).
-- **Validated S/M/L** in **TypeScript, JavaScript, tsx, Go, Python, and Rust** — see the matrix in the
+- **Validated S/M/L** in **TS, JS, tsx, Go, Python, Rust, and Ruby** — see the matrix in the
   design doc. All clean: node count identical on/off, precision guards held, impact win
   reproduced. Go required extending the shadow prune (per-grammar declarators) — the worked
   example of "step B is load-bearing."
@@ -196,16 +196,25 @@ reads — the win is blast-radius correctness** (impact API / CodeGraph Pro's ve
 
 ### A. Where do "constants worth tracking" live? (decide FIRST)
 
-value-refs captures **file/module-scope** `const`/`var` (target gate requires parent id
-`file:`). Before anything:
+The target gate now accepts **`file:`, `class:`, and `module:`** parents. Before anything:
 
 - If the language puts shareable constants at **file/module scope** (TS/JS, Python module
-  consts, Go package vars, Rust module `const`/`static`) → the existing scope check fits;
-  proceed.
-- If constants live at **class scope** (Java `static final`, C# `const`/`static readonly`,
-  Swift `static let`) → the `file:`-parent check **won't match**, and the feature is a silent
-  no-op. Extending to class-scope targets is a **bigger change** (capture class-scope values,
-  decide same-file semantics). Flag this to the maintainer before building.
+  consts, Go package vars, Rust module/impl `const`/`static`) → fits as-is; proceed.
+- If constants live **inside a class/module** (Ruby — done) → the `class:`/`module:` gate now
+  covers them, BUT two things may need fixing first: (1) the extractor must actually *extract*
+  the class-internal constant as a node (the dispatch at the `variableTypes` branch skips
+  class-internal assignments — Ruby needed an exception for `constant`-LHS assignments); (2) the
+  reader-scan must match however the grammar represents a constant *reference* (Ruby uses
+  `constant` nodes, not `identifier`). See the Ruby block in the design doc.
+- **Class-scope precision** uses a **file-wide** target map (one target per name per file), NOT
+  strict same-class matching — because lexical-scope languages (Ruby) let a nested class read an
+  enclosing class's constant, and strict matching would drop those valid reads. The only real FP
+  is the same constant name in *sibling* classes in one file (~1.7% of Ruby targets on rails);
+  valid code rarely hits it (a bare sibling-class constant is a NameError in Ruby).
+- **Still untested:** Java `static final`, C# `const`, Swift `static let`. The gate covers them
+  now, but confirm the extractor emits them as `constant`/`variable` nodes with a `class:`/
+  `struct:` parent (Swift stored properties, for one, aren't extracted as their own nodes) — and
+  if the parent kind is `struct:`/`interface:` rather than `class:`/`module:`, widen the gate.
 
 ### B. Confirm the declarator node type (for the shadow prune)
 
@@ -223,6 +232,7 @@ silently does nothing for the new language and intra-file shadowing produces fal
 | Go | `const_spec`, `var_spec`, `short_var_declaration` | spec → `namedChild(0)`; short-var → identifiers in the `left` field | **done** |
 | Python | `assignment` | `left` field: identifier, or iterate a `pattern_list`/`tuple_pattern` | **done** |
 | Rust | `const_item`, `static_item`, `let_declaration` | const/static → `name` field; let → `pattern` field | **done** |
+| Ruby | `assignment` (LHS is a `constant` node) | already in the switch; Ruby can't local-shadow a constant, so the prune is effectively a no-op for it | **done** (class-scope) |
 | Ruby | `assignment` with constant LHS (`CONST`) | LHS | to verify |
 | C/C++ | `init_declarator` in a file-scope `declaration` | declarator id | to verify |
 

+ 33 - 4
docs/design/value-reference-edges.md

@@ -1,6 +1,6 @@
 # Design + status: same-file value-reference edges
 
-**Status:** SHIPPED (default-on for TS/JS/tsx + Go + Python + Rust; `CODEGRAPH_VALUE_REFS=0` disables). The
+**Status:** SHIPPED (default-on for TS/JS/tsx + Go + Python + Rust + Ruby; `CODEGRAPH_VALUE_REFS=0` disables). The
 emitter lives in `TreeSitterExtractor.flushValueRefs` (`src/extraction/tree-sitter.ts`).
 **Motivation:** close the impact-analysis hole for *value consumers*. Static
 extraction edges calls, imports, and inheritance, but never edges a constant to the
@@ -13,7 +13,7 @@ readers" class of change (the ReScript-PR false positive that motivated the work
 ## TL;DR for a new session
 
 We emit a `references` edge (`metadata: { valueRef: true }`) from a reader symbol to
-the **file/package-scope `const`/`var` it reads**, same-file only, for TS/JS/tsx + Go + Python + Rust. Those edges
+the **file/package-scope `const`/`var` it reads**, same-file only, for TS/JS/tsx + Go + Python + Rust + Ruby. Those edges
 flow straight into `getImpactRadius` / `codegraph impact` and the impact trail in
 `codegraph_explore` / `codegraph_node` — no agent-behaviour change required.
 
@@ -46,7 +46,7 @@ The win is **impact-radius correctness**, not agent read-reduction (see "Agent A
    the content-minified bundles guard #1 misses.
 3. **Distinctive-name + same-file** as above.
 
-## Validation matrix — TypeScript / JavaScript / Go / Python / Rust
+## Validation matrix — TS / JS / Go / Python / Rust / Ruby
 
 Method per repo: index the same tree twice (value-refs on vs `CODEGRAPH_VALUE_REFS=0`),
 diff node/edge counts, spot-check precision, and measure `codegraph impact` on a few
@@ -93,7 +93,15 @@ file-scope consts. Node count must be **identical** on/off (edges-only feature).
 | tokio-rs/tokio | medium | 795 | 13,281 (stable) | +476 (1.1%) | all sampled TP; `#[cfg]`-conditional consts kept | `PERMIT_SHIFT` 1→**97**, `LOCAL_QUEUE_CAPACITY` 2→46 |
 | rust-lang/rust-analyzer | large | 1,530 | 38,780 (stable) | +475 (0.25%) | all sampled TP; 0 real shadow leaks | `INLINE_CAP` 2→**183**, `SPAN_PARTS_BIT` 2→18 |
 
-Across S/M/L in all five languages: node count never moved, the precision guards held, and
+**Ruby** (`CONST = …`, almost always **inside a class/module** — needed the class-scope extension)
+
+| Repo | size | files | nodes (on=off) | +value-ref edges | precision | `impact` on→off example |
+|---|---|---|---|---|---|---|
+| sinatra/sinatra | small | 96 | 1,800 (stable) | +73 (2.1%) | ~100% TP (flags are valid nested reads) | `HEADER_PARAM` 1→**5** |
+| jekyll/jekyll | medium | 218 | 1,906 (stable) | +100 (2.4%) | ~100% TP | `DEFAULT_PRIORITY` 1→3, `LOG_LEVELS` 4→5 |
+| rails/rails | large | 1,452 | 61,911 (stable) | +2,255 (1.2%) | ~98% TP (same-file ambiguity 21/1208 targets) | `Post` (Struct const) 75 readers |
+
+Across S/M/L in all six languages: node count never moved, the precision guards held, and
 the `impact` OFF column is the bug — a const that 80–140 symbols read reports "1 affected"
 without value-refs.
 
@@ -126,6 +134,27 @@ tokio's `io/interest.rs` cfg-gated flags). One nice property fell out: consts wr
 config macro (`cfg_aio! { … }`) live in an unparsed token tree, so the prune's syntax walk
 doesn't even see them.
 
+**Ruby is the class-scope case — and required three changes.** Ruby keeps almost all constants
+*inside* a class/module (jekyll's `lib/`: 0 top-level vs 58 class-internal), so the original
+file-scope-only target gate covered ~nothing. Three Ruby-specific fixes: (1) the extractor now
+creates nodes for constant assignments (`CONST = …` has a `constant`-typed LHS, not
+`identifier`, so they were never extracted at all) — including class-internal ones; (2) the
+value-ref target gate accepts `class:`/`module:` parents, not just `file:`; (3) the reader-scan
+matches `constant` nodes, since in Ruby both a constant's definition and its references are
+`constant`-typed. **Effectively Ruby-only:** Rust impl consts are parented to `file:` already
+(so the gate change doesn't touch them — ripgrep stayed at 144 edges), and TS/Python class
+members aren't `constant`/`variable` kind.
+
+The interesting precision question — *which* class does a class-scope target belong to — turns
+out to favor a **file-wide** target map (a name maps to one target per file), because Ruby's
+constant lookup is **lexical + ancestor**: a method in a nested class legitimately reads an
+enclosing class's constant (verified on jekyll's `ERBRenderer→ThemeBuilder::SCAFFOLD_DIRECTORIES`
+and sinatra's `AcceptEntry→Request::HEADER_PARAM`). Strict same-class matching would wrongly drop
+those. The only real false positive is the same constant name defined in *sibling* (un-nested)
+classes in one file — 21 of 1,208 targets (1.7%) on rails, and most of those resolve fine too;
+referencing a sibling class's bare constant is a NameError in real Ruby, so valid code rarely
+hits it. Net precision ~98–100%.
+
 **`tsx` is covered by the TS rows** — excalidraw is a React/.tsx codebase, so the headline
 `tablerIconProps` (1→170) and most of its targets live in `.tsx` files. The one
 tsx-specific path — a const read *only* inside JSX (`<Foo x={CONST}/>`) — relies on the

+ 35 - 9
src/extraction/tree-sitter.ts

@@ -224,7 +224,7 @@ export class TreeSitterExtractor {
   // Value-reference edges (default ON; set CODEGRAPH_VALUE_REFS=0 to disable; see flushValueRefs).
   // Same-file reads of file-scope const/var symbols → `references` edges so impact analysis catches
   // value consumers ("change this constant/table, affect its readers").
-  private static readonly VALUE_REF_LANGS = new Set<string>(['typescript', 'javascript', 'tsx', 'go', 'python', 'rust']);
+  private static readonly VALUE_REF_LANGS = new Set<string>(['typescript', 'javascript', 'tsx', 'go', 'python', 'rust', 'ruby']);
   private static readonly MAX_VALUE_REF_NODES = 20_000;
   private readonly valueRefsEnabled = process.env.CODEGRAPH_VALUE_REFS !== '0';
   private fileScopeValues = new Map<string, string>();
@@ -534,11 +534,14 @@ export class TreeSitterExtractor {
   private captureValueRefScope(kind: NodeKind, name: string, id: string, node: SyntaxNode): void {
     if ((kind === 'constant' || kind === 'variable') && name.length >= 3 && /[A-Z_]/.test(name)) {
       const parentId = this.nodeStack[this.nodeStack.length - 1];
-      if (parentId?.startsWith('file:')) {
+      // file-scope OR class/module-scope constants are targets. Class/module
+      // scope matters for languages (Ruby) that keep nearly all constants inside
+      // a class or module; readers are same-file methods of that type.
+      if (parentId && (parentId.startsWith('file:') || parentId.startsWith('class:') || parentId.startsWith('module:'))) {
         this.fileScopeValues.set(name, id);
-        // How many file-scope nodes carry this name. A conditional module-level
-        // def (`try: X = a; except: X = b`) makes >1 — distinct from a local
-        // shadow, which adds a binding the prune must catch (see flushValueRefs).
+        // How many target nodes carry this name. A conditional def
+        // (`try: X = a; except: X = b`) makes >1 — distinct from a local shadow,
+        // which adds a binding the prune must catch (see flushValueRefs).
         this.fileScopeValueCounts.set(name, (this.fileScopeValueCounts.get(name) ?? 0) + 1);
       }
     }
@@ -629,7 +632,9 @@ export class TreeSitterExtractor {
       while (stack.length > 0 && visited < TreeSitterExtractor.MAX_VALUE_REF_NODES) {
         const n = stack.pop()!;
         visited++;
-        if (n.type === 'identifier') {
+        // `constant` covers Ruby, where both a constant's definition and its
+        // references are `constant`-typed nodes, not `identifier`.
+        if (n.type === 'identifier' || n.type === 'constant') {
           const refName = getNodeText(n, this.source);
           const targetId = targets.get(refName);
           // Skip self and same-name targets: a symbol referencing a file-scope
@@ -788,8 +793,15 @@ export class TreeSitterExtractor {
       skipChildren = true;
     }
     // Check for variable declarations (const, let, var, etc.)
-    // Only extract top-level variables (not inside functions/methods)
-    else if (this.extractor.variableTypes.includes(nodeType) && !this.isInsideClassLikeNode()) {
+    // Only extract top-level variables (not inside functions/methods) — plus
+    // class/module-scope CONSTANTS, which Ruby (and other const-in-class
+    // languages) keep almost exclusively inside a class/module. A Ruby `CONST =
+    // …` has a `constant`-typed LHS; other languages don't put one here, so this
+    // is effectively Ruby-only and doesn't disturb their class-internal locals.
+    else if (
+      this.extractor.variableTypes.includes(nodeType) &&
+      (!this.isInsideClassLikeNode() || this.isClassScopeConstantAssignment(node))
+    ) {
       this.extractVariable(node);
       // extractVariable doesn't walk every initializer shape (object literals
       // are deliberately skipped; Python/Ruby don't walk at all), so scan the
@@ -1098,6 +1110,18 @@ export class TreeSitterExtractor {
     );
   }
 
+  /**
+   * Ruby `CONST = …` assignment whose LHS is a `constant` node — a class/module
+   * (or top-level) constant worth extracting as a symbol even inside a class.
+   * Other languages don't give an assignment a `constant`-typed LHS, so this
+   * gate is effectively Ruby-only.
+   */
+  private isClassScopeConstantAssignment(node: SyntaxNode): boolean {
+    if (node.type !== 'assignment') return false;
+    const left = getChildByField(node, 'left') ?? node.namedChild(0);
+    return left?.type === 'constant';
+  }
+
   /**
    * Extract a function
    */
@@ -1851,7 +1875,9 @@ export class TreeSitterExtractor {
       const left = getChildByField(node, 'left') || node.namedChild(0);
       const right = getChildByField(node, 'right') || node.namedChild(1);
 
-      if (left && left.type === 'identifier') {
+      // Ruby constant assignments (`MAX = 3`) have a `constant`-typed LHS, not
+      // `identifier`; without this they were never extracted as symbols at all.
+      if (left && (left.type === 'identifier' || left.type === 'constant')) {
         const name = getNodeText(left, this.source);
         // Skip if name starts with lowercase and looks like a function call result
         // Python constants are usually UPPER_CASE