mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-19 10:52:28 +08:00
feat: diff-based test selection for E2E and LLM-judge evals (v0.6.1.0) (#139)
* feat: diff-based test selection for E2E and LLM-judge evals Each test declares file dependencies in a TOUCHFILES map. The test runner checks git diff against the base branch and only runs tests whose dependencies were modified. Global touchfiles (session-runner, eval-store, gen-skill-docs) trigger all tests. New scripts: test:e2e:all, test:evals:all, eval:select Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: bump version and changelog (v0.6.1.0) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: plan-design-review-audit eval — bump turns to 30, add efficiency hints The test was flaky at 20 turns because the agent reads a 300-line SKILL.md, navigates, extracts design data, and writes a report. Added hints to skip preamble/batch commands/write early while still testing the real SKILL.md. Now completes in ~13 turns consistently. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
13
CLAUDE.md
13
CLAUDE.md
@@ -5,8 +5,11 @@
|
||||
```bash
|
||||
bun install # install dependencies
|
||||
bun test # run free tests (browse + snapshot + skill validation)
|
||||
bun run test:evals # run paid evals: LLM judge + E2E (~$4/run)
|
||||
bun run test:e2e # run E2E tests only (~$3.85/run)
|
||||
bun run test:evals # run paid evals: LLM judge + E2E (diff-based, ~$4/run max)
|
||||
bun run test:evals:all # run ALL paid evals regardless of diff
|
||||
bun run test:e2e # run E2E tests only (diff-based, ~$3.85/run max)
|
||||
bun run test:e2e:all # run ALL E2E tests regardless of diff
|
||||
bun run eval:select # show which tests would run based on current diff
|
||||
bun run dev <cmd> # run CLI in dev mode, e.g. bun run dev goto https://example.com
|
||||
bun run build # gen docs + compile binaries
|
||||
bun run gen:skill-docs # regenerate SKILL.md files from templates
|
||||
@@ -21,6 +24,12 @@ bun run eval:summary # aggregate stats across all eval runs
|
||||
(tool-by-tool via `--output-format stream-json --verbose`). Results are persisted
|
||||
to `~/.gstack-dev/evals/` with auto-comparison against the previous run.
|
||||
|
||||
**Diff-based test selection:** `test:evals` and `test:e2e` auto-select tests based
|
||||
on `git diff` against the base branch. Each test declares its file dependencies in
|
||||
`test/helpers/touchfiles.ts`. Changes to global touchfiles (session-runner, eval-store,
|
||||
llm-judge, gen-skill-docs) trigger all tests. Use `EVALS_ALL=1` or the `:all` script
|
||||
variants to force all tests. Run `eval:select` to preview which tests would run.
|
||||
|
||||
## Project structure
|
||||
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user