feat: 3-tier eval suite with planted-bug outcome testing (EVALS=1)

Adds comprehensive eval infrastructure:
- Tier 1 (free): 13 new static tests — cross-skill path consistency, QA
  structure validation, greptile format, planted-bug fixture validation
- Tier 2 (Agent SDK E2E): /qa quick, /review with pre-built git repo,
  3 planted-bug outcome evals (static, SPA, checkout — each with 5 bugs)
- Tier 3 (LLM judge): QA workflow quality, health rubric clarity,
  cross-skill consistency, baseline score pinning

New fixtures: 3 HTML pages with 15 total planted bugs, ground truth JSON,
review-eval-vuln.rb, eval-baselines.json. Shared llm-judge.ts helper (DRY).

Unified EVALS=1 flag replaces SKILL_E2E + ANTHROPIC_API_KEY checks.
`bun run test:evals` runs everything that costs money (~$4/run).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Garry Tan
2026-03-14 01:17:36 -05:00
parent 5155fe3a28
commit 76803d789a
17 changed files with 1352 additions and 94 deletions

View File

@@ -4,9 +4,11 @@
```bash
bun install # install dependencies
bun test # run tests (browse + snapshot + skill validation)
bun run test:eval # run LLM-as-judge evals (needs ANTHROPIC_API_KEY)
bun run test:e2e # run E2E skill tests (needs SKILL_E2E=1, ~$0.50/run)
bun test # run free tests (browse + snapshot + skill validation)
bun run test:evals # run ALL paid evals: LLM judge + Agent SDK E2E (~$4/run)
bun run test:eval # run LLM-as-judge evals only (~$0.15/run)
bun run test:e2e # run Agent SDK E2E tests only (~$3.85/run)
bun run test:all # free tests + all evals
bun run dev <cmd> # run CLI in dev mode, e.g. bun run dev goto https://example.com
bun run build # gen docs + compile binaries
bun run gen:skill-docs # regenerate SKILL.md files from templates
@@ -14,6 +16,9 @@ bun run skill:check # health dashboard for all skills
bun run dev:skill # watch mode: auto-regen + validate on change
```
All eval commands require `ANTHROPIC_API_KEY` in your environment. E2E tests must
be run from a plain terminal (not inside Claude Code — nested sessions hang).
## Project structure
```
@@ -29,11 +34,12 @@ gstack/
│ ├── skill-check.ts # Health dashboard
│ └── dev-skill.ts # Watch mode
├── test/ # Skill validation + eval tests
│ ├── helpers/ # skill-parser.ts, session-runner.ts
│ ├── skill-validation.test.ts # Tier 1: static command validation
│ ├── gen-skill-docs.test.ts # Tier 1: generator + quality evals
│ ├── skill-e2e.test.ts # Tier 2: Agent SDK E2E
── skill-llm-eval.test.ts # Tier 3: LLM-as-judge
│ ├── helpers/ # skill-parser.ts, session-runner.ts, llm-judge.ts
│ ├── fixtures/ # Ground truth JSON, planted-bug fixtures, eval baselines
│ ├── skill-validation.test.ts # Tier 1: static validation (free, <1s)
│ ├── gen-skill-docs.test.ts # Tier 1: generator quality (free, <1s)
── skill-llm-eval.test.ts # Tier 3: LLM-as-judge (~$0.15/run)
│ └── skill-e2e.test.ts # Tier 2: Agent SDK E2E (~$3.85/run)
├── ship/ # Ship workflow skill
├── review/ # PR review skill
├── plan-ceo-review/ # /plan-ceo-review skill