Sfoglia il codice sorgente

fix(index): exclude Android resource XML from the index by default (#1047) (#1054)

An Android `res/` tree (layouts, value bags, drawables, menus, navigation
graphs) holds only non-code resources that yield zero symbols, yet on an
Android app it dominates the file count (one report: 26k XML = 97% of
files, 0 symbols) — bloating the DB, slowing indexing, and padding
explore/search results and file counts with entries that have nothing to
find.

Default-ignore the Android resource type directories (`res/layout/`,
`res/values/`, `res/drawable/`, … and their `-<qualifier>` variants) at
discovery, via DEFAULT_IGNORE_PATTERNS so it applies uniformly to the git
index, the non-git walk, and change detection. The `res/<type>/`
structure is self-identifying, so non-Android projects are untouched, and
the only XML that carries symbols — MyBatis mappers under
`src/main/resources/` — never lives under `res/`, so nothing useful is
dropped. `res/raw/` is deliberately kept (arbitrary bundled assets), and
a `.gitignore` negation re-includes anything.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Colby Mchenry 9 ore fa
parent
commit
ffff9c2b56
3 ha cambiato i file con 109 aggiunte e 0 eliminazioni
  1. 1 0
      CHANGELOG.md
  2. 85 0
      __tests__/android-res-exclusion.test.ts
  3. 23 0
      src/extraction/index.ts

+ 1 - 0
CHANGELOG.md

@@ -20,6 +20,7 @@ and adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 - `codegraph node` can now read a file from the command line. File-read mode — pass `-f`/`--file` to get a file's source with line numbers plus the files that depend on it, the same output as the `codegraph_node` MCP tool — was rejected with "missing required argument 'name'", because the command always demanded a symbol name even though file mode has none, leaving the feature unreachable from the CLI. The symbol name is now optional: `codegraph node -f src/auth.ts` (or `codegraph node src/auth.ts`) reads the file, `codegraph node parseToken` looks up a symbol, and running it with neither prints a short usage hint instead of a cryptic error. Thanks @jcrabapple for the report. (#1044)
 - `codegraph query` no longer prints meaningless relevance percentages like "12042%" next to each result. The number was a raw full-text search score — useful only for ordering the results, not as a real 0–100% figure — so multiplying it by 100 produced wild values that made the output look broken. Results are already listed best-match first, so the CLI now just shows them in that order with no score, matching what the search tool reports to AI agents. If you script against `codegraph query --json`, the raw `score` is still included for sorting or thresholding. Thanks @jcrabapple for the report. (#1045)
 - `codegraph explore` no longer reports an alarming, inflated result count on broad natural-language queries. The "Found N symbols across M files" summary used to count every symbol the search swept in while ranking, so a broad query (for example "publish status to the API") on a large project could announce hundreds of symbols across a big fraction of the codebase — reading as if you had to wade through all of them — even though only the most relevant handful are actually shown with their source. The summary now counts just the files explore returns source for, so the number matches what you see. Ranking and results are unchanged: the right symbols still come first, and any further relevant files are still listed by name under "Not shown above" so nothing is hidden. Thanks @jcrabapple for the report. (#1046)
+- Android resource files no longer bloat the index. A `res/` tree — layouts, drawables, value bags (strings, colors, styles), menus, navigation graphs — contains no code symbols, but on an Android app it can be the overwhelming majority of files (one project: 26,000+ XML files, ~97% of everything, contributing zero symbols), which inflated the database, slowed indexing, and padded file counts and `codegraph explore`/search results with entries that have nothing to find. CodeGraph now skips Android resource directories by default — `res/layout/`, `res/values/`, `res/drawable/`, `res/menu/`, and the rest, including their locale/density/version variants like `res/values-es/` or `res/drawable-hdpi/`. Your actual code is untouched, and so is the one kind of XML that does carry symbols — MyBatis mapper files, which live under `src/main/resources/`, not `res/`. `res/raw/` is deliberately kept (it can hold real assets), and you can re-include any excluded directory with a `.gitignore` negation such as `!res/values/`. Thanks @jcrabapple for the report. (#1047)
 
 
 ## [1.1.2] - 2026-06-28

+ 85 - 0
__tests__/android-res-exclusion.test.ts

@@ -0,0 +1,85 @@
+/**
+ * Android resource XML is excluded from the index by default (#1047).
+ *
+ * A `res/` tree holds only non-code resources (layouts, value bags, drawables,
+ * menus) split into typed, optionally qualified subdirectories. None of it yields
+ * a code symbol, yet on an Android app it dominates the file count (one report:
+ * 26k XML = 97% of files, 0 symbols), bloating the DB, slowing indexing, and
+ * skewing explore results. CodeGraph now default-ignores the Android resource
+ * type directories — `res/layout/`, `res/values/`, `res/drawable/`, … and their
+ * `-<qualifier>` variants — at discovery.
+ *
+ * Guardrails this locks in:
+ *  - Real code (`.java`) is still indexed.
+ *  - The one XML that DOES carry symbols — a MyBatis mapper under
+ *    `src/main/resources/` — is untouched (it never lives under `res/`).
+ *  - Plain non-`res/` XML (`pom.xml`) is unaffected.
+ *  - `res/raw/` is deliberately KEPT — it holds arbitrary bundled assets that can
+ *    be code-ish, so we don't drop it.
+ */
+import { describe, it, expect, beforeEach, afterEach } from 'vitest';
+import * as fs from 'fs';
+import * as path from 'path';
+import * as os from 'os';
+import CodeGraph from '../src/index';
+
+describe('Android resource XML exclusion (#1047)', () => {
+  let dir: string;
+  let cg: CodeGraph;
+
+  const write = (rel: string, body: string) => {
+    const p = path.join(dir, rel);
+    fs.mkdirSync(path.dirname(p), { recursive: true });
+    fs.writeFileSync(p, body);
+  };
+
+  beforeEach(async () => {
+    dir = fs.mkdtempSync(path.join(os.tmpdir(), 'codegraph-android-res-'));
+
+    // Android resource files (every typed subdir, incl. a locale qualifier) — all
+    // should be EXCLUDED.
+    write('app/src/main/res/layout/activity_main.xml', '<LinearLayout><TextView/></LinearLayout>\n');
+    write('app/src/main/res/values/strings.xml', '<resources><string name="app_name">App</string></resources>\n');
+    write('app/src/main/res/values-es/strings.xml', '<resources><string name="app_name">App</string></resources>\n');
+    write('app/src/main/res/drawable/ic_foo.xml', '<vector android:height="24dp"/>\n');
+    write('app/src/main/res/menu/main_menu.xml', '<menu><item android:id="@+id/x"/></menu>\n');
+
+    // Real code, a MyBatis mapper (the one XML with symbols), plain XML, and a
+    // res/raw asset — all should be KEPT.
+    write('app/src/main/java/com/example/Main.java', 'package com.example;\npublic class Main { void run(){} }\n');
+    write('src/main/resources/FooMapper.xml',
+      '<mapper namespace="com.example.FooDao"><select id="findAll">SELECT * FROM foo</select></mapper>\n');
+    write('pom.xml', '<project><artifactId>demo</artifactId></project>\n');
+    write('app/src/main/res/raw/payload.xml', '<data><item>1</item></data>\n');
+
+    cg = CodeGraph.initSync(dir);
+    await cg.indexAll();
+  });
+
+  afterEach(() => {
+    if (cg) cg.destroy();
+    if (fs.existsSync(dir)) fs.rmSync(dir, { recursive: true, force: true });
+  });
+
+  it('excludes Android resource XML but keeps code, MyBatis mappers, plain XML, and res/raw', () => {
+    const indexed = new Set(cg.getFiles().map((f) => f.path));
+
+    // Excluded: every resource type dir, including the qualifier variant.
+    expect(indexed).not.toContain('app/src/main/res/layout/activity_main.xml');
+    expect(indexed).not.toContain('app/src/main/res/values/strings.xml');
+    expect(indexed).not.toContain('app/src/main/res/values-es/strings.xml');
+    expect(indexed).not.toContain('app/src/main/res/drawable/ic_foo.xml');
+    expect(indexed).not.toContain('app/src/main/res/menu/main_menu.xml');
+
+    // Kept: real code, plain XML, and the deliberately-spared res/raw asset.
+    expect(indexed).toContain('app/src/main/java/com/example/Main.java');
+    expect(indexed).toContain('pom.xml');
+    expect(indexed).toContain('app/src/main/res/raw/payload.xml');
+
+    // Kept AND still carries symbols: the MyBatis mapper (non-regression — the
+    // only valuable XML, and it never lives under res/).
+    const mapper = cg.getFiles().find((f) => f.path === 'src/main/resources/FooMapper.xml');
+    expect(mapper).toBeDefined();
+    expect(mapper!.nodeCount).toBeGreaterThan(1); // file node + ≥1 statement
+  });
+});

+ 23 - 0
src/extraction/index.ts

@@ -168,12 +168,35 @@ const DEFAULT_IGNORE_DIRS: ReadonlySet<string> = new Set([
   '.cache',
 ]);
 
+/**
+ * Android resource directory types. A `res/` tree holds ONLY non-code resources —
+ * layouts, drawables, value bags (strings/colors/styles), menus, navigation
+ * graphs — split into one typed subdirectory per kind, optionally density/locale/
+ * version-qualified (`values-es`, `drawable-hdpi`, `layout-v21`, …). None of it
+ * yields an extractable code symbol, yet on an Android app it DOMINATES the tree
+ * (one report: 26k XML files = 97% of the project, 0 symbols), bloating the DB,
+ * slowing indexing, and skewing both the file count and `codegraph_explore`
+ * results (#1047). So these are excluded by default. The structure is
+ * self-identifying — a non-Android project has no `res/layout/` etc., so it's
+ * untouched — and the only XML that DOES produce symbols (MyBatis mappers) lives
+ * under `src/main/resources/`, never `res/`, so nothing useful is dropped.
+ * `res/raw/` is deliberately NOT here: it holds arbitrary bundled assets that can
+ * be code-ish (a `.sql` schema, a `.js`), so we leave it indexed. Override any of
+ * these with a `.gitignore` negation (e.g. `!res/values/`).
+ */
+const ANDROID_RES_TYPES: readonly string[] = [
+  'anim', 'animator', 'color', 'drawable', 'font', 'layout',
+  'menu', 'mipmap', 'navigation', 'transition', 'values', 'xml',
+];
+
 /** Gitignore-style patterns for the `ignore` matcher: the dirs above plus a few globs. */
 const DEFAULT_IGNORE_PATTERNS: string[] = [
   ...Array.from(DEFAULT_IGNORE_DIRS, (d) => `${d}/`),
   '*.egg-info/',     // Python packaging metadata
   'cmake-build-*/',  // CLion / CMake build trees
   'bazel-*/',        // Bazel output symlink trees
+  // Android resource dirs at any depth, with their qualifier variants (#1047).
+  ...ANDROID_RES_TYPES.map((t) => `**/res/${t}*/`),
 ];
 
 /** True if `buf` decodes as strict UTF-8 (no invalid byte sequences). */