Browse Source

feat(extraction): add Go to value-reference edges (+ per-grammar shadow prune)

Extend value-reference edges to Go. Go's package-level const/var are file-scope,
so the target gate fits as-is. But Go's declarators are const_spec / var_spec /
short_var_declaration, not variable_declarator, so the shadow prune was a no-op
for Go — a package `const Timeout` shadowed by a local `Timeout := …` produced a
false positive (a synthetic probe caught it immediately). Generalise the prune's
declarator scan into a per-node-type switch (a file only contains its own
grammar's nodes, so listing all languages in one switch is safe).

Validated small/medium/large/very-large on public Go OSS (gin, hugo, prometheus,
kubernetes): node count identical on/off at every size (incl. 251k nodes on k8s),
precision samples all true positives, guard invariant clean, and the impact win
reproduces (k8s KubeletSubsystem 3→138, prometheus rdsLabelInstance 1→82).

Adds 2 Go test cases (package-const read; the shadow case, verified to fail
without the prune extension). Design matrix, playbook, and CHANGELOG updated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Colby McHenry 1 tuần trước cách đây
mục cha
commit
c369d5d226

+ 1 - 1
CHANGELOG.md

@@ -11,7 +11,7 @@ and adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
 ### New Features
 
-- Impact and blast-radius analysis for TypeScript/JavaScript now understands the readers of a constant. When you change a file-scope `const`/`var` — a config object, a lookup table, a shared constant — the other symbols in that file that read it now show up as affected, where before they were invisible (impact only followed calls, imports, and inheritance, so a constant's consumers looked like "nothing depends on this"). This makes `codegraph impact`, and the impact trail in `codegraph_explore`/`codegraph_node`, catch the "change this table, break its readers" class of change. It's on by default for TS/JS and adds no nodes to your graph; bundled/minified files and ambiguously-shadowed names are skipped to keep results precise. Set `CODEGRAPH_VALUE_REFS=0` to turn it off.
+- Impact and blast-radius analysis for TypeScript, JavaScript, and Go now understands the readers of a constant. When you change a file-scope or package-level `const`/`var` — a config object, a lookup table, a shared constant — the other symbols in that file that read it now show up as affected, where before they were invisible (impact only followed calls, imports, and inheritance, so a constant's consumers looked like "nothing depends on this"). This makes `codegraph impact`, and the impact trail in `codegraph_explore`/`codegraph_node`, catch the "change this table, break its readers" class of change. It's on by default and adds no nodes to your graph; bundled/minified files and ambiguously-shadowed names are skipped to keep results precise. Set `CODEGRAPH_VALUE_REFS=0` to turn it off.
 
 ### Fixes
 

+ 44 - 0
__tests__/value-reference-edges.test.ts

@@ -117,6 +117,50 @@ describe('value-reference edges', () => {
     expect(valueRefReaders(cg, 'THEME_TOKENS')).toEqual(expect.arrayContaining(['Label', 'Box']));
   });
 
+  it('edges same-file readers to a package-level const/var (Go)', async () => {
+    fs.writeFileSync(
+      path.join(dir, 'main.go'),
+      [
+        'package main',
+        '',
+        'const MaxRetries = 3',
+        'var DefaultLabels = map[string]string{"env": "prod"}',
+        '',
+        'func retry() int { return MaxRetries }',
+        'func labels() map[string]string { return DefaultLabels }',
+      ].join('\n'),
+    );
+    cg = index();
+    await cg.indexAll();
+
+    expect(valueRefReaders(cg, 'MaxRetries')).toEqual(expect.arrayContaining(['retry']));
+    expect(valueRefReaders(cg, 'DefaultLabels')).toEqual(expect.arrayContaining(['labels']));
+  });
+
+  it('does NOT edge a Go package const shadowed by a local := of the same name', async () => {
+    // `Timeout` is a package const AND a local `:=` (short_var_declaration) in
+    // shadows(). The local read resolves to the inner binding, so a file-scope
+    // edge would be a false positive — the shadow prune drops the whole target.
+    fs.writeFileSync(
+      path.join(dir, 'shadow.go'),
+      [
+        'package main',
+        '',
+        'const Timeout = 30',
+        '',
+        'func usesConst() int { return Timeout }',
+        'func shadows() int {',
+        '\tTimeout := 5',
+        '\treturn Timeout',
+        '}',
+      ].join('\n'),
+    );
+    cg = index();
+    await cg.indexAll();
+
+    expect(valueRefReaders(cg, 'Timeout')).toEqual([]);
+  });
+
   it('emits nothing when CODEGRAPH_VALUE_REFS=0', async () => {
     const prev = process.env.CODEGRAPH_VALUE_REFS;
     process.env.CODEGRAPH_VALUE_REFS = '0';

+ 33 - 21
docs/design/value-reference-edges-playbook.md

@@ -66,11 +66,13 @@ targets** (see §3).
 
 ## 2. Current state (what's shipped + validated)
 
-- **Default ON** for TS/JS/tsx (`CODEGRAPH_VALUE_REFS=0` disables). Shipped in **PR #895**
-  (flip-on + the shadow prune).
-- **Validated S/M/L** in **TypeScript, JavaScript, and tsx** — **PR #897** (the design doc +
-  matrix). All clean: node count identical on/off, precision guards held, impact win
-  reproduced. No FP class beyond excalidraw's bundled-file shadowing (fixed by the prune).
+- **Default ON** for TS/JS/tsx + Go (`CODEGRAPH_VALUE_REFS=0` disables). Shipped in **PR #895**
+  (flip-on + the shadow prune); Go added in a later PR (the shadow-prune declarator switch +
+  `VALUE_REF_LANGS`).
+- **Validated S/M/L** in **TypeScript, JavaScript, tsx, and Go** — see the matrix in the
+  design doc. All clean: node count identical on/off, precision guards held, impact win
+  reproduced. Go required extending the shadow prune (per-grammar declarators) — the worked
+  example of "step B is load-bearing."
 - **Tests:** `__tests__/value-reference-edges.test.ts` — same-file readers edged; surfaced in
   impact radius; shadowed const NOT edged (verified to fail without the guard); JSX-only read
   edged (tsx); `CODEGRAPH_VALUE_REFS=0` emits nothing.
@@ -206,22 +208,32 @@ value-refs captures **file/module-scope** `const`/`var` (target gate requires pa
 
 ### B. Confirm the declarator node type (for the shadow prune)
 
-The shadow prune scans for `variable_declarator`. **Verify the tree-sitter node type for the
-new grammar** and update the scan if different. Starting points **to verify against the actual
-grammar** (don't trust this table — confirm by parsing a sample, e.g. with a probe or
-`tree-sitter parse`):
-
-| Language | likely declarator node | notes |
-|---|---|---|
-| TS/JS/tsx | `variable_declarator` | done |
-| Python | `assignment` (module-level `NAME = …`) | SCREAMING_CASE fits the distinctive gate |
-| Go | `const_spec` / `var_spec` | inside `const_declaration`/`var_declaration` |
-| Rust | `const_item` / `static_item` | `let_declaration` is locals |
-| Ruby | `assignment` with constant LHS | Ruby CONSTs are `CONST` |
-| C/C++ | `init_declarator` in a file-scope `declaration` | |
-
-If the new grammar's declarator differs, generalise the prune (e.g. a small per-language set
-of declarator node types) rather than hard-coding one.
+The shadow prune (in `flushValueRefs`) counts declarator names via a `switch (n.type)` over
+declarator node types — a file only has its own grammar's nodes, so it's safe to list all
+languages' types in one switch. **Add the new grammar's declarator types there**, with the
+right way to pull the bound name(s). **Verify against the actual grammar** (don't trust this
+table — confirm by parsing a sample). **This step is load-bearing:** if you skip it, the prune
+silently does nothing for the new language and intra-file shadowing produces false positives
+(this is exactly what happened on the first Go pass — see §5-Go below).
+
+| Language | declarator node(s) | name extraction | status |
+|---|---|---|---|
+| TS/JS/tsx | `variable_declarator` | `namedChild(0)` | done |
+| Go | `const_spec`, `var_spec`, `short_var_declaration` | spec → `namedChild(0)`; short-var → identifiers in the `left` field | **done** |
+| Python | `assignment` (module-level `NAME = …`) | LHS identifier(s) | to verify |
+| Rust | `const_item` / `static_item` (`let_declaration` = locals) | `name` field | to verify |
+| Ruby | `assignment` with constant LHS (`CONST`) | LHS | to verify |
+| C/C++ | `init_declarator` in a file-scope `declaration` | declarator id | to verify |
+
+**Go was the worked example of "step B matters":** the first pass added `go` to
+`VALUE_REF_LANGS` only, and a synthetic probe immediately showed a false positive —
+`func withShadow() { TimeoutSeconds := 5; return TimeoutSeconds }` got edged to the package
+`const TimeoutSeconds`, because the prune scanned `variable_declarator` (which Go doesn't
+have). Fix: add Go's `const_spec`/`var_spec`/`short_var_declaration` to the switch. Note the
+**precision-first tradeoff** this inherits from TS/JS — a shadowed target is dropped for the
+*whole file*, so a legit reader elsewhere in that file loses its edge too. On the Go sweep
+(gin/hugo/prometheus) this over-pruning was negligible (guard invariant clean, no LEAKs), so
+it wasn't worth per-reader analysis — but re-check it per language.
 
 ### C. Confirm what kind the extractor assigns
 

+ 24 - 7
docs/design/value-reference-edges.md

@@ -1,6 +1,6 @@
 # Design + status: same-file value-reference edges
 
-**Status:** SHIPPED (default-on for TS/JS; `CODEGRAPH_VALUE_REFS=0` disables). The
+**Status:** SHIPPED (default-on for TS/JS/tsx + Go; `CODEGRAPH_VALUE_REFS=0` disables). The
 emitter lives in `TreeSitterExtractor.flushValueRefs` (`src/extraction/tree-sitter.ts`).
 **Motivation:** close the impact-analysis hole for *value consumers*. Static
 extraction edges calls, imports, and inheritance, but never edges a constant to the
@@ -13,7 +13,7 @@ readers" class of change (the ReScript-PR false positive that motivated the work
 ## TL;DR for a new session
 
 We emit a `references` edge (`metadata: { valueRef: true }`) from a reader symbol to
-the **file-scope `const`/`var` it reads**, same-file only, for TS/JS/TSX. Those edges
+the **file/package-scope `const`/`var` it reads**, same-file only, for TS/JS/tsx + Go. Those edges
 flow straight into `getImpactRadius` / `codegraph impact` and the impact trail in
 `codegraph_explore` / `codegraph_node` — no agent-behaviour change required.
 
@@ -41,7 +41,7 @@ The win is **impact-radius correctness**, not agent read-reduction (see "Agent A
    the content-minified bundles guard #1 misses.
 3. **Distinctive-name + same-file** as above.
 
-## Validation matrix — TypeScript/JavaScript
+## Validation matrix — TypeScript / JavaScript / Go
 
 Method per repo: index the same tree twice (value-refs on vs `CODEGRAPH_VALUE_REFS=0`),
 diff node/edge counts, spot-check precision, and measure `codegraph impact` on a few
@@ -63,10 +63,27 @@ file-scope consts. Node count must be **identical** on/off (edges-only feature).
 | eslint/eslint | medium | 1,420 | 7,167 (stable) | +1,192 (4.2%) | all sampled TP; guard holds; no minified-file FPs | `internalSlotsMap` 1→**32**, `INDEX_MAP` 1→27 |
 | webpack/webpack | large | 9,371 | 28,922 (stable) | +3,521 (4.8%) | all sampled TP; guard holds; no minified-file FPs | `LogType` 1→**89**, `LOG_SYMBOL` 1→90, `UsageState` 2→52 |
 
-Across S/M/L on both languages: node count never moved, the precision guards held, and the
-`impact` OFF column is the bug — a const that 85–90 symbols read reports "1 affected"
-without value-refs. The only false positives ever seen were excalidraw's 23 (one bundled
-file, fixed by the shadow prune); no new FP class surfaced in JS.
+**Go** (package-level `const`/`var`; required extending the shadow prune — see below)
+
+| Repo | size | files | nodes (on=off) | +value-ref edges | precision | `impact` on→off example |
+|---|---|---|---|---|---|---|
+| gin-gonic/gin | small | 110 | 2,599 (stable) | +166 (1.9%) | all sampled TP; guard holds | `abortIndex` 1→**24**, `jsonContentType` 1→8 |
+| gohugoio/hugo | medium | 952 | 19,160 (stable) | +1,616 (2.5%) | all sampled TP; guard holds | `filepathSeparator` 2→**26** |
+| prometheus/prometheus | large | 1,329 | 23,322 (stable) | +3,466 (3.3%) | all sampled TP; guard holds | `rdsLabelInstance` 1→**82**, `ec2Label` 1→24 |
+| kubernetes/kubernetes | very large | 19,160 | 251,086 (stable) | +20,574 (1.9%) | all sampled TP; guard holds on 250 targets | `KubeletSubsystem` 3→**138**, `LEVEL_0` 1→102 |
+
+Across S/M/L in all three languages: node count never moved, the precision guards held, and
+the `impact` OFF column is the bug — a const that 80–90 symbols read reports "1 affected"
+without value-refs.
+
+**Go required a code change** (unlike JS/tsx, which the existing guards covered unchanged).
+Go puts its constants at package = file scope (good — the target gate fits), but its
+declarators are `const_spec`/`var_spec`/`short_var_declaration`, not `variable_declarator`, so
+the shadow prune was a no-op for Go and a package `const Timeout` shadowed by a local
+`Timeout := …` produced a false positive. Extending the prune's declarator switch to Go's node
+types fixed it (one synthetic repro, then clean across gin/hugo/prometheus). This is the
+template for the next language: **the shadow prune is per-grammar and must be wired per
+language** (see the playbook).
 
 **`tsx` is covered by the TS rows** — excalidraw is a React/.tsx codebase, so the headline
 `tablerIconProps` (1→170) and most of its targets live in `.tsx` files. The one

+ 29 - 14
src/extraction/tree-sitter.ts

@@ -224,7 +224,7 @@ export class TreeSitterExtractor {
   // Value-reference edges (default ON; set CODEGRAPH_VALUE_REFS=0 to disable; see flushValueRefs).
   // Same-file reads of file-scope const/var symbols → `references` edges so impact analysis catches
   // value consumers ("change this constant/table, affect its readers").
-  private static readonly VALUE_REF_LANGS = new Set<string>(['typescript', 'javascript', 'tsx']);
+  private static readonly VALUE_REF_LANGS = new Set<string>(['typescript', 'javascript', 'tsx', 'go']);
   private static readonly MAX_VALUE_REF_NODES = 20_000;
   private readonly valueRefsEnabled = process.env.CODEGRAPH_VALUE_REFS !== '0';
   private fileScopeValues = new Map<string, string>();
@@ -557,26 +557,41 @@ export class TreeSitterExtractor {
     if (targets.size === 0 || scopes.length === 0 || isGeneratedFile(this.filePath)) return;
 
     // Prune SHADOWED targets. A name bound more than once in the file (e.g. a
-    // bundled/Emscripten `const Module` re-declared as an inner `var Module` /
-    // function param) resolves to the INNER binding for nested readers, so a
-    // file-scope edge to it is a false positive. Those inner re-declarations
-    // aren't extracted as graph nodes, so detect them at the syntax level:
-    // count `variable_declarator` names across the tree and drop any target
-    // bound twice or more. Single-binding (unambiguous) names are kept. This
-    // complements the path-based isGeneratedFile() check for content-minified
-    // bundles it can't catch by suffix.
+    // bundled/Emscripten `const Module` re-declared as an inner `var Module`, or
+    // a Go package `const Timeout` shadowed by a local `Timeout := …`) may
+    // resolve to the INNER binding for nested readers, so a file-scope edge to
+    // it is a false positive. Those inner re-declarations aren't extracted as
+    // graph nodes, so detect them at the syntax level: count every declarator
+    // name across the tree and drop any target bound twice or more. Single-
+    // binding (unambiguous) names are kept. Complements the path-based
+    // isGeneratedFile() check, which can't catch content-minified bundles.
+    //
+    // Declarator node types are per-grammar; a file only contains its own
+    // language's nodes, so matching all of them in one switch is safe.
     if (this.tree) {
       const declCounts = new Map<string, number>();
+      const bump = (nameNode: SyntaxNode | null) => {
+        if (nameNode && nameNode.type === 'identifier') {
+          const nm = getNodeText(nameNode, this.source);
+          if (targets.has(nm)) declCounts.set(nm, (declCounts.get(nm) ?? 0) + 1);
+        }
+      };
       const dstack: SyntaxNode[] = [this.tree.rootNode];
       let dvisited = 0;
       while (dstack.length > 0 && dvisited < TreeSitterExtractor.MAX_VALUE_REF_NODES) {
         const n = dstack.pop()!;
         dvisited++;
-        if (n.type === 'variable_declarator') {
-          const nameNode = n.namedChild(0);
-          if (nameNode && nameNode.type === 'identifier') {
-            const nm = getNodeText(nameNode, this.source);
-            if (targets.has(nm)) declCounts.set(nm, (declCounts.get(nm) ?? 0) + 1);
+        switch (n.type) {
+          case 'variable_declarator': // TS/JS/tsx
+          case 'const_spec':          // Go  `const X = …`
+          case 'var_spec':            // Go  `var X = …`
+            bump(n.namedChild(0));
+            break;
+          case 'short_var_declaration': { // Go  `x, Y := …`
+            const left = getChildByField(n, 'left') ?? n.namedChild(0);
+            if (left?.type === 'identifier') bump(left);
+            else if (left) for (const c of left.namedChildren) bump(c);
+            break;
           }
         }
         for (let i = 0; i < n.namedChildCount; i++) {