chore: bump v0.3.10, update CHANGELOG and docs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-20 11:19:56 +08:00 · 2026-03-15 16:55:34 -05:00
parent 33c9552870
commit e28033353d
4 changed files with 24 additions and 2 deletions
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -134,6 +134,8 @@ When E2E tests run, they produce machine-readable artifacts in `~/.gstack-dev/`:
 bun run eval:list            # list all eval runs
 bun run eval:compare         # compare two runs (auto-picks most recent)
 bun run eval:summary         # aggregate stats across all runs
+bun run eval:trend           # per-test pass rate over last N runs (flaky detection)
+bun run eval:cache stats     # check LLM judge cache hit rate
 ```

 Artifacts are never cleaned up — they accumulate in `~/.gstack-dev/` for post-mortem debugging and trend analysis.
@@ -152,7 +154,8 @@ Each dimension is scored 1-5. Threshold: every dimension must score **≥ 4**. T
 # Needs ANTHROPIC_API_KEY in .env — included in bun run test:evals
 ```

- Uses `claude-sonnet-4-6` for scoring stability
+- Model defaults to `claude-sonnet-4-6`; override with `EVAL_JUDGE_TIER=haiku|opus`
+- Results are SHA-cached — unchanged SKILL.md content skips API calls ($0 on repeat runs). Set `EVAL_CACHE=0` to force re-run.
 - Tests live in `test/skill-llm-eval.test.ts`
 - Calls the Anthropic API directly (not `claude -p`), so it works from anywhere including inside Claude Code