Browse Source

New skill to add languages

Colby McHenry 1 month ago
parent
commit
9f1a951642

+ 193 - 0
.claude/skills/add-lang/SKILL.md

@@ -0,0 +1,193 @@
+---
+name: add-lang
+description: Add tree-sitter language support to codegraph end-to-end — wire the grammar + extractor, write tests, then benchmark extraction quality and retrieval value on 3 popular real-world repos. Use when the user runs /add-lang <language> or asks to add/support a new language (e.g. Lua, Elixir, Zig, OCaml) in codegraph.
+---
+
+# Add a language to CodeGraph
+
+Wire a new tree-sitter language into codegraph's extraction pipeline, prove it
+extracts real symbols on popular repos, and prove it beats no-codegraph for an
+agent. Runs **fully autonomously** — pick repos, benchmark, update docs, then
+report. **Never commit, push, publish, or tag** (house rule); leave all changes
+for the user to review.
+
+The argument is the language token used throughout the `Language` union, e.g.
+`lua`, `elixir`, `zig`. If none was given, ask which language. Use the lowercase
+single-token form everywhere (`csharp`, not `c#`).
+
+## Prerequisites
+- Run from the codegraph repo root. `node`, `git`, `gh`, and a logged-in
+  `claude` CLI (the benchmark spawns real `claude -p` runs).
+- The benchmark uses the local dev build — Step 8 builds + links it on PATH.
+
+## Workflow
+
+Copy this checklist and work through it in order:
+```
+- [ ] 1. Resolve language; bail early if already supported (just benchmark)
+- [ ] 2. Find a grammar (tree-sitter-wasms vs vendor a .wasm)
+- [ ] 3. Discover the grammar's AST node types (dump-ast.mjs)
+- [ ] 4. Wire the language (4 source edits)
+- [ ] 5. Build + verify-extraction loop until PASS
+- [ ] 6. Add extraction tests; make them green
+- [ ] 7. Auto-pick 3 popular repos by size tier; add to corpus.json
+- [ ] 8. Benchmark all 3: extraction + with/without A/B
+- [ ] 9. Update README + CHANGELOG
+- [ ] 10. Report; do NOT commit
+```
+
+### Step 1 — Resolve + short-circuit
+
+Check whether the language is already wired: look for the token in the
+`LANGUAGES` const (`src/types.ts`) and the `EXTRACTORS` map
+(`src/extraction/languages/index.ts`). If it is already supported (e.g.
+`typescript`, `rust`), **skip Steps 2–6** and go straight to benchmarking
+(Steps 7–8) to validate/measure it — note in the report that no code changed.
+
+### Step 2 — Find a grammar
+
+```bash
+ls node_modules/tree-sitter-wasms/out/ | grep -i <lang>   # csharp -> c_sharp
+```
+- **Present** → off-the-shelf. No vendoring; `grammars.ts` resolves it from
+  `tree-sitter-wasms` automatically. (Most popular languages are here: lua,
+  elixir, zig, ocaml, solidity, toml, yaml, …)
+- **Absent** → you must vendor a `.wasm` into `src/extraction/wasm/` (like
+  `pascal`/`scala`) and add the token to the vendored branch in Step 4. Get a
+  wasm from the grammar's npm package (a prebuilt `*.wasm`) or by building one
+  (`npx tree-sitter-cli build --wasm`, which needs emscripten/Docker — the
+  `tree-sitter` CLI is usually not on PATH here). **If you cannot obtain a
+  wasm, STOP and tell the user** — the language can't be added without it.
+
+### Step 3 — Discover AST node types
+
+Get a representative source file (write a small sample covering functions,
+classes/structs, imports, enums; or `curl` a raw file from a known repo), then:
+```bash
+node scripts/add-lang/dump-ast.mjs <lang> path/to/sample.<ext>
+# vendored grammar: pass the wasm path instead of the token
+node scripts/add-lang/dump-ast.mjs src/extraction/wasm/tree-sitter-<lang>.wasm sample.<ext>
+```
+The frequency table + field names (`name:`, `parameters:`, `body:`,
+`return_type:`) tell you what to map. Open the existing extractor closest to the
+language's paradigm as a model: `rust.ts`/`scala.ts` (functional, traits),
+`java.ts`/`csharp.ts` (OO), `python.ts`/`ruby.ts` (scripting), `go.ts`
+(top-level methods + receivers).
+
+### Step 4 — Wire the language (4 edits)
+
+These are exact, fragile wiring — match the existing style precisely:
+
+1. **`src/types.ts`** — add `'<lang>',` to the `LANGUAGES` const (before
+   `'unknown'`).
+2. **`src/extraction/grammars.ts`** — three maps:
+   - `WASM_GRAMMAR_FILES`: `<lang>: 'tree-sitter-<lang>.wasm',`
+   - `EXTENSION_MAP`: each file extension → `'<lang>'` (e.g. `'.lua': 'lua',`)
+   - `getLanguageDisplayName`: `<lang>: '<Display Name>',`
+   - **vendored only**: add `<lang>` to the
+     `(lang === 'pascal' || lang === 'scala')` wasm-path branch.
+3. **`src/extraction/languages/<lang>.ts`** — new file exporting
+   `export const <lang>Extractor: LanguageExtractor = { … }`. Map the node types
+   from Step 3. Required fields: `functionTypes`, `classTypes`, `methodTypes`,
+   `interfaceTypes`, `structTypes`, `enumTypes`, `typeAliasTypes`,
+   `importTypes`, `callTypes`, `variableTypes`, `nameField`, `bodyField`,
+   `paramsField`. Add hooks as the grammar needs them (`getSignature`,
+   `getVisibility`, `isExported`, `extractImport`, `getReceiverType`,
+   `interfaceKind`, `enumMemberTypes`, etc. — see
+   `src/extraction/tree-sitter-types.ts`).
+4. **`src/extraction/languages/index.ts`** — `import { <lang>Extractor } from
+   './<lang>';` and add `<lang>: <lang>Extractor,` to `EXTRACTORS`.
+
+### Step 5 — Build + verify loop
+
+```bash
+npm run build            # tsc + copy-assets (copies any vendored *.wasm into dist/)
+```
+Index a small sample repo and check extraction:
+```bash
+( cd <sample-repo> && codegraph init -i )
+node scripts/add-lang/verify-extraction.mjs <sample-repo> <lang>
+```
+`verify-extraction.mjs` fails (exit 1) if the language isn't detected or only
+`file`/`import` nodes were produced — the classic symptom of wrong node-type
+names. On FAIL or a thin WARN: re-run `dump-ast.mjs` on a richer file, fix the
+mappings in `<lang>.ts`, `npm run build`, re-index, re-verify. **Repeat until
+PASS.**
+
+### Step 6 — Tests
+
+Add to `__tests__/extraction.test.ts`, modeled on the `Rust Extraction` block:
+- a `detectLanguage` assertion in `describe('Language Detection')`
+- a `describe('<Lang> Extraction')` block asserting functions/classes/imports
+  are extracted from an inline source string.
+```bash
+npx vitest run __tests__/extraction.test.ts
+```
+Green before continuing.
+
+### Step 7 — Auto-pick 3 repos + corpus
+
+Pick **without asking**. Find candidates, then curate 3 that are genuinely
+`<lang>`-dominant, one per size tier:
+```bash
+gh search repos --language=<lang> --sort=stars --limit 40 \
+  --json fullName,stargazerCount,description
+```
+Tiers (match `corpus.json`): **Small** <~150 files · **Medium** ~150–1500 ·
+**Large** >~1500. Skip repos that are tagged `<lang>` but mostly another
+language. Write one cross-file architecture **question** per repo (the kind that
+needs tracing across files). Add a `"<Language>"` block to
+`.claude/skills/agent-eval/corpus.json` (fields: `name`, `repo`, `size`,
+`files`, `question`) so `/agent-eval` can reuse them.
+
+### Step 8 — Benchmark all 3 (extraction + A/B)
+
+Make the dev build the codegraph on PATH **once**, then loop:
+```bash
+npm run build && ./scripts/local-install.sh
+scripts/add-lang/bench.sh <lang> <name> <url> "<question>" headless   # ×3
+```
+`bench.sh` clones (shared `/tmp/codegraph-corpus`), wipes + indexes, runs
+`verify-extraction.mjs`, then the with/without retrieval A/B via
+`scripts/agent-eval/run-all.sh` (skips the paid A/B if extraction is broken).
+Read each `parse-run.mjs` summary printed by `run-all.sh`: tool calls, file
+`Read`s, Grep/Bash, codegraph-tool calls, duration, and **cost** — for both the
+`with` and `without` arms. After the loop, restore the dev link if needed:
+`./scripts/local-install.sh`.
+
+### Step 9 — Docs + CHANGELOG
+
+- **README.md**: add `<Lang>` to the "19+ Languages" feature bullet, and add a
+  row to the **Supported Languages** table:
+  `| <Lang> | \`.ext\` | Full support (classes, methods, …) |`.
+- **CHANGELOG.md**: add an `## [Unreleased]` section at the top (above the
+  latest version) with `### Added` → a user-perspective bullet, e.g.
+  *"CodeGraph now indexes **<Lang>** (`.ext`) — functions, classes, imports, and
+  call edges."* If `## [Unreleased]` already exists, append under it. (`/publish`
+  folds this into the next versioned block at release time.)
+
+### Step 10 — Report (do NOT commit)
+
+Summarize for review:
+- **Files changed**: the 4 wiring edits + new extractor + tests + README +
+  CHANGELOG + corpus.json (+ any vendored `.wasm`).
+- **Extraction** per repo: files / nodes / edges / `verify-extraction` result.
+- **A/B** per repo: `with` vs `without` (tool calls, file Reads, cost) and a
+  one-line verdict — did codegraph reduce effort, and did both arms reach a
+  correct answer?
+- **Gaps / follow-ups** (node types not yet mapped, resolution edges missing,
+  framework routes, etc.).
+
+Hand the changes to the user. **Do not** run `git commit`/`push`,
+`npm publish`, or `scripts/release.sh`.
+
+## Notes
+- The A/B spawns real **paid** `claude -p` runs (opus, `--max-budget-usd`),
+  2 arms × 3 repos. The corpus dir `/tmp/codegraph-corpus` is shared with
+  `/agent-eval`, so clones are reused across runs.
+- Any new `*.wasm` must live in `src/extraction/wasm/` — `copy-assets` (run by
+  `npm run build`) ships it; otherwise it won't be in `dist/`.
+- An index must be served by the **same** binary that built it. Step 8 builds +
+  links the dev build first, so this holds.
+- If a grammar can't be obtained, or extraction can't reach PASS, **STOP and
+  report** — don't ship a half-wired language.

+ 60 - 0
scripts/add-lang/bench.sh

@@ -0,0 +1,60 @@
+#!/usr/bin/env bash
+# Add-lang benchmark for ONE repo:
+#   clone -> wipe+index (with the codegraph on PATH) -> verify extraction ->
+#   with/without retrieval A/B (reuses scripts/agent-eval/run-all.sh).
+#
+# Assumes the codegraph dev build is already built + linked on PATH — the skill
+# runs `npm run build && ./scripts/local-install.sh` ONCE before looping repos.
+# The A/B is skipped if extraction fails its critical checks (don't burn $ on a
+# broken extractor); set FORCE_AB=1 to run it anyway.
+#
+# Usage: bench.sh <lang> <repo-name> <repo-url> "<question>" [headless|tmux|all]
+# Env:   CORPUS   corpus dir (default /tmp/codegraph-corpus, shared with agent-eval)
+set -uo pipefail
+
+LANG_TOKEN="${1:?usage: bench.sh <lang> <repo-name> <repo-url> \"<question>\" [mode]}"
+NAME="${2:?repo-name required}"
+URL="${3:?repo-url required}"
+Q="${4:?question required}"
+MODE="${5:-headless}"
+
+HARNESS="$(cd "$(dirname "$0")" && pwd)"
+AGENT_EVAL="$(cd "$HARNESS/../agent-eval" && pwd)"
+CORPUS="${CORPUS:-/tmp/codegraph-corpus}"
+REPO="$CORPUS/$NAME"
+
+command -v codegraph >/dev/null || { echo "no codegraph on PATH (build + ./scripts/local-install.sh first)"; exit 1; }
+
+echo "==================== add-lang bench: $NAME ($LANG_TOKEN) ===================="
+echo "codegraph: $(command -v codegraph) -> $(codegraph --version 2>/dev/null || echo '?')"
+
+# 1. Ensure the repo (shallow clone, reuse if present).
+mkdir -p "$CORPUS"
+if [ -d "$REPO/.git" ]; then
+  echo "→ reusing checkout: $REPO"
+else
+  echo "→ cloning $URL"
+  git clone --depth 1 "$URL" "$REPO" || { echo "git clone failed"; exit 1; }
+fi
+
+# 2. Wipe + index with the binary under test.
+echo "→ wiping .codegraph and indexing"
+rm -rf "$REPO/.codegraph"
+( cd "$REPO" && codegraph init -i ) || { echo "indexing failed"; exit 1; }
+
+# 3. Verify extraction (cheap guard before the paid A/B).
+echo "→ verifying extraction"
+node "$HARNESS/verify-extraction.mjs" "$REPO" "$LANG_TOKEN"
+VERIFY=$?
+
+# 4. Retrieval A/B (skipped if extraction is broken, unless FORCE_AB=1).
+if [ "$VERIFY" -ne 0 ] && [ "${FORCE_AB:-0}" != "1" ]; then
+  echo "→ SKIPPING A/B — extraction failed critical checks (set FORCE_AB=1 to override)"
+else
+  echo "→ retrieval A/B (mode=$MODE)"
+  bash "$AGENT_EVAL/run-all.sh" "$REPO" "$Q" "$MODE"
+fi
+
+echo "==================== bench complete: $NAME (verify exit=$VERIFY) ===================="
+# Exit reflects extraction: 0 = pass/warn, 1 = critical fail, 2 = couldn't read status.
+exit "$VERIFY"

+ 103 - 0
scripts/add-lang/dump-ast.mjs

@@ -0,0 +1,103 @@
+#!/usr/bin/env node
+// Dump the tree-sitter AST for a sample file so you can write a LanguageExtractor
+// mapping. Loads a grammar .wasm directly via web-tree-sitter (the same runtime
+// codegraph uses) — you do NOT need to register the language first.
+//
+// Usage:
+//   node scripts/add-lang/dump-ast.mjs <lang|wasm-path> <sample-file> [--depth=N] [--full]
+// Examples:
+//   node scripts/add-lang/dump-ast.mjs lua sample.lua
+//   node scripts/add-lang/dump-ast.mjs src/extraction/wasm/tree-sitter-zig.wasm a.zig --depth=4
+//
+// Output: an indented AST (named nodes, with field names) followed by a
+// node-type FREQUENCY table. The frequency table is the payoff — it tells you
+// which node types to map to functionTypes / classTypes / importTypes / etc.
+
+import { readFileSync, existsSync } from 'node:fs';
+import { createRequire } from 'node:module';
+import { Parser, Language } from 'web-tree-sitter';
+
+const require = createRequire(import.meta.url);
+const fail = (msg) => { console.error(`[dump-ast] ${msg}`); process.exit(1); };
+
+const argv = process.argv.slice(2);
+const positional = argv.filter((a) => !a.startsWith('--'));
+const [langOrWasm, sampleFile] = positional;
+const depthFlag = argv.find((a) => a.startsWith('--depth='));
+const showAll = argv.includes('--full'); // also print anonymous (token) nodes
+const maxDepth = depthFlag ? parseInt(depthFlag.split('=')[1], 10) : (showAll ? Infinity : 8);
+
+if (!langOrWasm || !sampleFile) {
+  fail('usage: dump-ast.mjs <lang|wasm-path> <sample-file> [--depth=N] [--full]');
+}
+if (!existsSync(sampleFile)) fail(`sample file not found: ${sampleFile}`);
+
+// Language tokens whose tree-sitter-wasms filename differs from the token.
+const WASM_SPECIAL = { csharp: 'c_sharp', 'c#': 'c_sharp' };
+
+function resolveWasm(token) {
+  if (token.endsWith('.wasm')) {
+    if (!existsSync(token)) fail(`wasm not found: ${token}`);
+    return token;
+  }
+  const base = WASM_SPECIAL[token.toLowerCase()] ?? token.toLowerCase();
+  try {
+    return require.resolve(`tree-sitter-wasms/out/tree-sitter-${base}.wasm`);
+  } catch {
+    /* not in tree-sitter-wasms — try a vendored copy */
+  }
+  const vendored = `src/extraction/wasm/tree-sitter-${base}.wasm`;
+  if (existsSync(vendored)) return vendored;
+  fail(
+    `no grammar for "${token}" — not in tree-sitter-wasms and not vendored at ` +
+      `${vendored}. Pass an explicit .wasm path, or vendor one (see SKILL.md "Find a grammar").`
+  );
+}
+
+const wasmPath = resolveWasm(langOrWasm);
+const source = readFileSync(sampleFile, 'utf8');
+
+try {
+  await Parser.init();
+} catch {
+  await Parser.init({ locateFile: () => require.resolve('web-tree-sitter/tree-sitter.wasm') });
+}
+
+let language;
+try {
+  language = await Language.load(wasmPath);
+} catch (e) {
+  fail(`failed to load grammar ${wasmPath}: ${e.message}`);
+}
+
+const parser = new Parser();
+parser.setLanguage(language);
+const tree = parser.parse(source);
+
+const freq = new Map();
+const snippet = (node) => {
+  const t = node.text.replace(/\s+/g, ' ').trim();
+  return t.length > 48 ? `${t.slice(0, 48)}…` : t;
+};
+
+function walk(node, depth, fieldName) {
+  if (node.isNamed) freq.set(node.type, (freq.get(node.type) || 0) + 1);
+  if ((node.isNamed || showAll) && depth <= maxDepth) {
+    const field = fieldName ? `${fieldName}: ` : '';
+    const leaf = node.childCount === 0 ? `  "${snippet(node)}"` : '';
+    console.log(`${'  '.repeat(depth)}${field}${node.type}  @${node.startPosition.row + 1}:${node.startPosition.column}${leaf}`);
+  }
+  for (let i = 0; i < node.childCount; i++) {
+    const child = node.child(i);
+    if (child) walk(child, depth + 1, node.fieldNameForChild(i));
+  }
+}
+
+console.log(`\n# AST for ${sampleFile}  (grammar: ${wasmPath.split('/').pop()})\n`);
+walk(tree.rootNode, 0, null);
+
+console.log('\n# Node-type frequency (named nodes) — map the relevant ones in your extractor:\n');
+[...freq.entries()]
+  .sort((a, b) => b[1] - a[1])
+  .forEach(([type, n]) => console.log(`  ${String(n).padStart(5)}  ${type}`));
+console.log();

+ 70 - 0
scripts/add-lang/verify-extraction.mjs

@@ -0,0 +1,70 @@
+#!/usr/bin/env node
+// Sanity-check that codegraph extracted REAL symbols (not just file/import nodes)
+// from a repo for a given language. Exits non-zero on a critical failure so it
+// can drive a write-extractor -> build -> re-check loop.
+//
+// Usage: node scripts/add-lang/verify-extraction.mjs <repo-path> <lang>
+// Reads `codegraph status <repo> --json` using whatever codegraph is on PATH,
+// so it reflects the binary that built the index.
+//
+// Exit codes: 0 = pass or soft-warn, 1 = critical fail, 2 = could not run.
+
+import { execFileSync } from 'node:child_process';
+
+const [repo, lang] = process.argv.slice(2);
+if (!repo || !lang) {
+  console.error('usage: verify-extraction.mjs <repo-path> <lang>');
+  process.exit(2);
+}
+
+let status;
+try {
+  const out = execFileSync('codegraph', ['status', repo, '--json'], { encoding: 'utf8' });
+  status = JSON.parse(out);
+} catch (e) {
+  console.error(`[verify] could not read codegraph status for ${repo}: ${e.message}`);
+  process.exit(2);
+}
+
+// Kinds that prove the extractor mapped AST node types (everything except
+// 'file' and 'import', which codegraph creates structurally for any language).
+const SYMBOL_KINDS = new Set([
+  'module', 'class', 'struct', 'interface', 'trait', 'protocol', 'function',
+  'method', 'property', 'field', 'variable', 'constant', 'enum', 'enum_member',
+  'type_alias', 'namespace', 'route', 'component',
+]);
+
+const byKind = status.nodesByKind || {};
+const langs = status.languages || [];
+const files = status.fileCount || 0;
+const edges = status.edgeCount || 0;
+const symbolKinds = Object.keys(byKind).filter((k) => SYMBOL_KINDS.has(k));
+const symbolCount = symbolKinds.reduce((s, k) => s + byKind[k], 0);
+
+const checks = [];
+const add = (severity, ok, label, detail) => checks.push({ severity, ok, label, detail });
+
+add('critical', status.initialized === true, 'index initialized', `initialized=${status.initialized}`);
+add('critical', langs.includes(lang), `language "${lang}" detected`, `languages=[${langs.join(', ')}]`);
+add('critical', symbolCount > 0, 'structural symbols extracted', `${symbolCount} symbols (${symbolKinds.join(', ') || 'NONE — only file/import nodes!'})`);
+add('soft', symbolCount >= files, 'symbol density >= 1/file', `${symbolCount} symbols across ${files} files`);
+add('soft', edges > files, 'edges resolved', `${edges} edges across ${files} files`);
+
+console.log(`\n# Extraction check — ${repo}  (lang=${lang}, backend=${status.backend})`);
+console.log(`  files=${files} nodes=${status.nodeCount} edges=${edges}`);
+console.log(`  nodesByKind: ${JSON.stringify(byKind)}\n`);
+for (const c of checks) console.log(`  ${c.ok ? '✓' : '✗'} ${c.label} — ${c.detail}`);
+
+const critical = checks.filter((c) => !c.ok && c.severity === 'critical');
+const soft = checks.filter((c) => !c.ok && c.severity === 'soft');
+console.log();
+if (critical.length) {
+  console.log(`RESULT: FAIL (${critical.length} critical) — extractor or grammar wiring is broken. Re-run dump-ast.mjs and fix the node-type mappings.`);
+  process.exit(1);
+}
+if (soft.length) {
+  console.log(`RESULT: WARN (${soft.length} soft) — extraction works but looks thin; inspect the counts above.`);
+  process.exit(0);
+}
+console.log('RESULT: PASS — extraction looks healthy.');
+process.exit(0);