SKILL.md 11 KB


name: add-lang

description: Add tree-sitter language support to codegraph end-to-end — wire the grammar + extractor, write tests, then benchmark extraction quality and retrieval value on 3 popular real-world repos. Use when the user runs /add-lang or asks to add/support a new language (e.g. Lua, Elixir, Zig, OCaml) in codegraph.

Add a language to CodeGraph

Wire a new tree-sitter language into codegraph's extraction pipeline, prove it extracts real symbols on popular repos, and prove it beats no-codegraph for an agent. Runs fully autonomously — pick repos, benchmark, update docs, then report. Never commit, push, publish, or tag (house rule); leave all changes for the user to review.

The argument is the language token used throughout the Language union, e.g. lua, elixir, zig. If none was given, ask which language. Use the lowercase single-token form everywhere (csharp, not c#).

Prerequisites

  • Run from the codegraph repo root. node, git, gh, and a logged-in claude CLI (the benchmark spawns real claude -p runs).
  • The benchmark uses the local dev build — Step 8 builds + links it on PATH.

Workflow

Copy this checklist and work through it in order:

- [ ] 1. Resolve language; bail early if already supported (just benchmark)
- [ ] 2. Find a grammar + health-check it (ABI / heap corruption)
- [ ] 3. Discover the grammar's AST node types (dump-ast.mjs)
- [ ] 4. Wire the language (4 files; sometimes a 5th core touch)
- [ ] 5. Build + verify-extraction loop until PASS
- [ ] 6. Add extraction tests; make them green
- [ ] 7. Auto-pick 3 popular repos by size tier; add to corpus.json
- [ ] 8. Benchmark all 3: extraction + with/without A/B
- [ ] 9. Update README + CHANGELOG
- [ ] 10. Report; do NOT commit

Step 1 — Resolve + short-circuit

Check whether the language is already wired: look for the token in the LANGUAGES const (src/types.ts) and the EXTRACTORS map (src/extraction/languages/index.ts). If it is already supported (e.g. typescript, rust), skip Steps 2–6 and go straight to benchmarking (Steps 7–8) to validate/measure it — note in the report that no code changed.

Step 2 — Find a grammar, then health-check it

ls node_modules/tree-sitter-wasms/out/ | grep -i <lang>   # csharp -> c_sharp
  • Present → likely off-the-shelf; grammars.ts resolves it from tree-sitter-wasms automatically. (Many languages: elixir, zig, ocaml, solidity, toml, yaml, …)
  • Absent → vendor a .wasm into src/extraction/wasm/ (like pascal / scala / lua) and add the token to the vendored branch in Step 4.

Always health-check before writing an extractor — a present grammar can still be unusable:

node scripts/add-lang/check-grammar.mjs <lang> path/to/valid-sample.<ext>

It prints the grammar's ABI version and parses a valid sample many times in a multi-grammar runtime. If it FAILs (ERROR trees on valid code — an old ABI corrupting the shared WASM heap, which silently drops nested calls/imports on every file after the first; e.g. the tree-sitter-wasms Lua grammar is ABI 13 and fails), do NOT use that wasm. Vendor a newer (ABI 14/15) build instead:

npm pack @tree-sitter-grammars/tree-sitter-<lang>   # often ships a prebuilt *.wasm
# or build one: npx tree-sitter build --wasm   (needs Docker/emscripten)
cp <the>.wasm src/extraction/wasm/tree-sitter-<lang>.wasm

then add the token to the vendored branch in Step 4 and re-run check-grammar on the vendored path until it PASSes. If you cannot obtain a healthy wasm, STOP and tell the user.

Step 3 — Discover AST node types

Get a representative source file (write a small sample covering functions, classes/structs, imports, enums; or curl a raw file from a known repo), then:

node scripts/add-lang/dump-ast.mjs <lang> path/to/sample.<ext>
# vendored grammar: pass the wasm path instead of the token
node scripts/add-lang/dump-ast.mjs src/extraction/wasm/tree-sitter-<lang>.wasm sample.<ext>

The frequency table + field names (name:, parameters:, body:, return_type:) tell you what to map. Open the existing extractor closest to the language's paradigm as a model: rust.ts/scala.ts (functional, traits), java.ts/csharp.ts (OO), python.ts/ruby.ts (scripting), go.ts (top-level methods + receivers).

Step 4 — Wire the language (4 files)

These are exact, fragile wiring — match the existing style precisely:

  1. src/types.ts — TWO edits:
    • add '<lang>', to the LANGUAGES const (before 'unknown');
    • add '**/*.<ext>', to DEFAULT_CONFIG.include. Don't skip this — it's the file-scan allowlist; without the glob, codegraph init finds 0 files even though detection/extraction are wired.
  2. src/extraction/grammars.ts — three maps:
    • WASM_GRAMMAR_FILES: <lang>: 'tree-sitter-<lang>.wasm',
    • EXTENSION_MAP: each file extension → '<lang>' (e.g. '.lua': 'lua',)
    • getLanguageDisplayName: <lang>: '<Display Name>',
    • vendored only: add <lang> to the (lang === 'pascal' || lang === 'scala' || …) wasm-path branch.
  3. src/extraction/languages/<lang>.ts — new file exporting export const <lang>Extractor: LanguageExtractor = { … }. Map the node types from Step 3. Required fields: functionTypes, classTypes, methodTypes, interfaceTypes, structTypes, enumTypes, typeAliasTypes, importTypes, callTypes, variableTypes, nameField, bodyField, paramsField. Add hooks as the grammar needs them (getSignature, getVisibility, isExported, extractImport, visitNode, getReceiverType, interfaceKind, enumMemberTypes, etc. — see src/extraction/tree-sitter-types.ts).
  4. src/extraction/languages/index.tsimport { <lang>Extractor } from './<lang>'; and add <lang>: <lang>Extractor, to EXTRACTORS.

Sometimes a 5th, core touch in src/extraction/tree-sitter.ts — variable extraction has per-language branches in extractVariable (the generic fallback only finds direct identifier/variable_declarator children). If the grammar nests declared names (e.g. Lua's variable_declaration → variable_list), add a } else if (this.language === '<lang>') branch there, mirroring the existing ts/python/go ones. Import forms that aren't a distinct node (Lua/Ruby require is a call) are handled in the extractor's visitNode hook instead.

Step 5 — Build + verify loop

npm run build            # tsc + copy-assets (copies any vendored *.wasm into dist/)

Index a small sample repo and check extraction:

( cd <sample-repo> && codegraph init -i )
node scripts/add-lang/verify-extraction.mjs <sample-repo> <lang>

verify-extraction.mjs fails (exit 1) if the language isn't detected or only file/import nodes were produced — the classic symptom of wrong node-type names. On FAIL or a thin WARN: re-run dump-ast.mjs on a richer file, fix the mappings in <lang>.ts, npm run build, re-index, re-verify. Repeat until PASS.

Step 6 — Tests

Add to __tests__/extraction.test.ts, modeled on the Rust Extraction block:

  • a detectLanguage assertion in describe('Language Detection')
  • a describe('<Lang> Extraction') block asserting functions/classes/imports are extracted from an inline source string.

    npx vitest run __tests__/extraction.test.ts
    

    Green before continuing.

Step 7 — Auto-pick 3 repos + corpus

Pick without asking. Find candidates, then curate 3 that are genuinely <lang>-dominant, one per size tier:

gh search repos --language=<lang> --sort=stars --limit 40 \
  --json fullName,stargazerCount,description

Tiers (match corpus.json): Small <~150 files · Medium ~150–1500 · Large >~1500. Skip repos that are tagged <lang> but mostly another language. Write one cross-file architecture question per repo (the kind that needs tracing across files). Add a "<Language>" block to .claude/skills/agent-eval/corpus.json (fields: name, repo, size, files, question) so /agent-eval can reuse them.

Step 8 — Benchmark all 3 (extraction + A/B)

Make the dev build the codegraph on PATH once, then loop:

npm run build && ./scripts/local-install.sh
scripts/add-lang/bench.sh <lang> <name> <url> "<question>" headless   # ×3

bench.sh clones (shared /tmp/codegraph-corpus), wipes + indexes, runs verify-extraction.mjs, then the with/without retrieval A/B via scripts/agent-eval/run-all.sh (skips the paid A/B if extraction is broken). Read each parse-run.mjs summary printed by run-all.sh: tool calls, file Reads, Grep/Bash, codegraph-tool calls, duration, and cost — for both the with and without arms. After the loop, restore the dev link if needed: ./scripts/local-install.sh.

Step 9 — Docs + CHANGELOG

  • README.md: add <Lang> to the "19+ Languages" feature bullet, and add a row to the Supported Languages table: | <Lang> | \.ext` | Full support (classes, methods, …) |`.
  • CHANGELOG.md: add an ## [Unreleased] section at the top (above the latest version) with ### Added → a user-perspective bullet, e.g. *"CodeGraph now indexes (.ext) — functions, classes, imports, and call edges."* If ## [Unreleased] already exists, append under it. (/publish folds this into the next versioned block at release time.)
  • Step 10 — Report (do NOT commit)

    Summarize for review:

    • Files changed: the 4 wiring edits + new extractor + tests + README + CHANGELOG + corpus.json (+ any vendored .wasm).
    • Extraction per repo: files / nodes / edges / verify-extraction result.
    • A/B per repo: with vs without (tool calls, file Reads, cost) and a one-line verdict — did codegraph reduce effort, and did both arms reach a correct answer?
    • Gaps / follow-ups (node types not yet mapped, resolution edges missing, framework routes, etc.).

    Hand the changes to the user. Do not run git commit/push, npm publish, or scripts/release.sh.

    Notes

    • The A/B spawns real paid claude -p runs (opus, --max-budget-usd), 2 arms × 3 repos. The corpus dir /tmp/codegraph-corpus is shared with /agent-eval, so clones are reused across runs.
    • Any new *.wasm must live in src/extraction/wasm/copy-assets (run by npm run build) ships it; otherwise it won't be in dist/.
    • An index must be served by the same binary that built it. Step 8 builds + links the dev build first, so this holds.
    • If a grammar can't be obtained, or extraction can't reach PASS, STOP and report — don't ship a half-wired language.