feat: 3-tier eval suite with planted-bug outcome testing (EVALS=1)

Adds comprehensive eval infrastructure: - Tier 1 (free): 13 new static tests — cross-skill path consistency, QA structure validation, greptile format, planted-bug fixture validation - Tier 2 (Agent SDK E2E): /qa quick, /review with pre-built git repo, 3 planted-bug outcome evals (static, SPA, checkout — each with 5 bugs) - Tier 3 (LLM judge): QA workflow quality, health rubric clarity, cross-skill consistency, baseline score pinning New fixtures: 3 HTML pages with 15 total planted bugs, ground truth JSON, review-eval-vuln.rb, eval-baselines.json. Shared llm-judge.ts helper (DRY). Unified EVALS=1 flag replaces SKILL_E2E + ANTHROPIC_API_KEY checks. `bun run test:evals` runs everything that costs money (~$4/run). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-05-08 21:49:45 +08:00 · 2026-03-14 01:17:36 -05:00
parent 5155fe3a28
commit 76803d789a
17 changed files with 1352 additions and 94 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -4,9 +4,11 @@

 ```bash
 bun install          # install dependencies
-bun test             # run tests (browse + snapshot + skill validation)
-bun run test:eval    # run LLM-as-judge evals (needs ANTHROPIC_API_KEY)
-bun run test:e2e     # run E2E skill tests (needs SKILL_E2E=1, ~$0.50/run)
+bun test             # run free tests (browse + snapshot + skill validation)
+bun run test:evals   # run ALL paid evals: LLM judge + Agent SDK E2E (~$4/run)
+bun run test:eval    # run LLM-as-judge evals only (~$0.15/run)
+bun run test:e2e     # run Agent SDK E2E tests only (~$3.85/run)
+bun run test:all     # free tests + all evals
 bun run dev <cmd>    # run CLI in dev mode, e.g. bun run dev goto https://example.com
 bun run build        # gen docs + compile binaries
 bun run gen:skill-docs  # regenerate SKILL.md files from templates
@@ -14,6 +16,9 @@ bun run skill:check  # health dashboard for all skills
 bun run dev:skill    # watch mode: auto-regen + validate on change
 ```

+All eval commands require `ANTHROPIC_API_KEY` in your environment. E2E tests must
+be run from a plain terminal (not inside Claude Code — nested sessions hang).
+
 ## Project structure

 ```
@@ -29,11 +34,12 @@ gstack/
 │   ├── skill-check.ts     # Health dashboard
 │   └── dev-skill.ts       # Watch mode
 ├── test/            # Skill validation + eval tests
-│   ├── helpers/     # skill-parser.ts, session-runner.ts
-│   ├── skill-validation.test.ts  # Tier 1: static command validation
-│   ├── gen-skill-docs.test.ts    # Tier 1: generator + quality evals
-│   ├── skill-e2e.test.ts         # Tier 2: Agent SDK E2E
-│   └── skill-llm-eval.test.ts   # Tier 3: LLM-as-judge
+│   ├── helpers/     # skill-parser.ts, session-runner.ts, llm-judge.ts
+│   ├── fixtures/    # Ground truth JSON, planted-bug fixtures, eval baselines
+│   ├── skill-validation.test.ts  # Tier 1: static validation (free, <1s)
+│   ├── gen-skill-docs.test.ts    # Tier 1: generator quality (free, <1s)
+│   ├── skill-llm-eval.test.ts   # Tier 3: LLM-as-judge (~$0.15/run)
+│   └── skill-e2e.test.ts         # Tier 2: Agent SDK E2E (~$3.85/run)
 ├── ship/            # Ship workflow skill
 ├── review/          # PR review skill
 ├── plan-ceo-review/ # /plan-ceo-review skill