mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-18 10:31:30 +08:00
Merge remote-tracking branch 'origin/garrytan/team-supabase-store' into garrytan/dev-mode
This commit is contained in:
@@ -169,6 +169,8 @@ When E2E tests run, they produce machine-readable artifacts in `~/.gstack-dev/`:
|
||||
bun run eval:list # list all eval runs
|
||||
bun run eval:compare # compare two runs (auto-picks most recent)
|
||||
bun run eval:summary # aggregate stats across all runs
|
||||
bun run eval:trend # per-test pass rate over last N runs (flaky detection)
|
||||
bun run eval:cache stats # check LLM judge cache hit rate
|
||||
```
|
||||
|
||||
Artifacts are never cleaned up — they accumulate in `~/.gstack-dev/` for post-mortem debugging and trend analysis.
|
||||
@@ -187,7 +189,8 @@ Each dimension is scored 1-5. Threshold: every dimension must score **≥ 4**. T
|
||||
# Needs ANTHROPIC_API_KEY in .env — included in bun run test:evals
|
||||
```
|
||||
|
||||
- Uses `claude-sonnet-4-6` for scoring stability
|
||||
- Model defaults to `claude-sonnet-4-6`; override with `EVAL_JUDGE_TIER=haiku|opus`
|
||||
- Results are SHA-cached — unchanged SKILL.md content skips API calls ($0 on repeat runs). Set `EVAL_CACHE=0` to force re-run.
|
||||
- Tests live in `test/skill-llm-eval.test.ts`
|
||||
- Calls the Anthropic API directly (not `claude -p`), so it works from anywhere including inside Claude Code
|
||||
|
||||
|
||||
Reference in New Issue
Block a user