feat: eval CLI tools + docs cleanup

Add eval:list, eval:compare, eval:summary CLI scripts for exploring eval history from ~/.gstack-dev/evals/. eval:compare reuses the shared comparison functions from eval-store.ts. - eval:list: sorted table with branch/tier/cost filters - eval:compare: thin wrapper around compareEvalResults + formatComparison - eval:summary: aggregate stats, flaky test detection, branch rankings - Remove unused @anthropic-ai/claude-agent-sdk from devDependencies - Update CLAUDE.md: streaming docs, eval CLI commands, remove Agent SDK refs - Add GH Actions eval upload (P2) and web dashboard (P3) to TODOS.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-05-20 11:19:56 +08:00 · 2026-03-14 03:49:57 -05:00
parent 84f52f3bad
commit ed802d0c7f
6 changed files with 373 additions and 11 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -5,20 +5,21 @@
 ```bash
 bun install          # install dependencies
 bun test             # run free tests (browse + snapshot + skill validation)
-bun run test:evals   # run paid evals: LLM judge + Agent SDK E2E (~$4/run)
-bun run test:e2e     # run Agent SDK E2E tests only (~$3.85/run)
+bun run test:evals   # run paid evals: LLM judge + E2E (~$4/run)
+bun run test:e2e     # run E2E tests only (~$3.85/run)
 bun run dev <cmd>    # run CLI in dev mode, e.g. bun run dev goto https://example.com
 bun run build        # gen docs + compile binaries
 bun run gen:skill-docs  # regenerate SKILL.md files from templates
 bun run skill:check  # health dashboard for all skills
 bun run dev:skill    # watch mode: auto-regen + validate on change
+bun run eval:list    # list all eval runs from ~/.gstack-dev/evals/
+bun run eval:compare # compare two eval runs (auto-picks most recent)
+bun run eval:summary # aggregate stats across all eval runs
 ```

-`test:evals` requires `ANTHROPIC_API_KEY` and must be run from a plain terminal
-(not inside Claude Code — nested Agent SDK sessions hang).
-
-**Update (v0.3.5):** The session runner now strips CLAUDE* env vars automatically,
-so `test:evals` may work inside Claude Code. If E2E tests hang, run from a plain terminal.
+`test:evals` requires `ANTHROPIC_API_KEY`. E2E tests stream progress in real-time
+(tool-by-tool via `--output-format stream-json --verbose`). Results are persisted
+to `~/.gstack-dev/evals/` with auto-comparison against the previous run.

 ## Project structure

@@ -35,12 +36,12 @@ gstack/
 │   ├── skill-check.ts     # Health dashboard
 │   └── dev-skill.ts       # Watch mode
 ├── test/            # Skill validation + eval tests
-│   ├── helpers/     # skill-parser.ts, session-runner.ts, llm-judge.ts
+│   ├── helpers/     # skill-parser.ts, session-runner.ts, llm-judge.ts, eval-store.ts
 │   ├── fixtures/    # Ground truth JSON, planted-bug fixtures, eval baselines
 │   ├── skill-validation.test.ts  # Tier 1: static validation (free, <1s)
 │   ├── gen-skill-docs.test.ts    # Tier 1: generator quality (free, <1s)
 │   ├── skill-llm-eval.test.ts   # Tier 3: LLM-as-judge (~$0.15/run)
-│   └── skill-e2e.test.ts         # Tier 2: Agent SDK E2E (~$3.85/run)
+│   └── skill-e2e.test.ts         # Tier 2: E2E via claude -p (~$3.85/run)
 ├── ship/            # Ship workflow skill
 ├── review/          # PR review skill
 ├── plan-ceo-review/ # /plan-ceo-review skill