mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-22 04:38:24 +08:00
test: add gate-tier mode-posture regression tests
Three gate-tier E2E tests detect when preamble / template changes flatten the distinctive posture of /plan-ceo-review SCOPE EXPANSION or /office-hours (startup Q3, builder mode). The V1 regression that this PR fixes shipped without anyone catching it at ship time — this is the ongoing signal so the same thing doesn't happen again. Pieces: - `judgePosture(mode, text)` in `test/helpers/llm-judge.ts`. Sonnet judge with mode-specific dual-axis rubric (expansion: surface_framing + decision_preservation; forcing: stacking_preserved + domain_matched_consequence; builder: unexpected_combinations + excitement_over_optimization). Pass threshold 4/5 on both axes. - Three fixtures in `test/fixtures/mode-posture/` — deterministic input for expansion proposal generation, Q3 forcing question, and builder adjacent-unlock riffing. - `plan-ceo-review-expansion-energy` case appended to `test/skill-e2e-plan.test.ts`. Generator: Opus (skill default). Judge: Sonnet. - New `test/skill-e2e-office-hours.test.ts` with `office-hours-forcing-energy` + `office-hours-builder-wildness` cases. Generator: Sonnet. Judge: Sonnet. - Touchfile registration in `test/helpers/touchfiles.ts` — all three as `gate` tier in `E2E_TIERS`, triggered by changes to `scripts/resolvers/preamble.ts`, the relevant skill template, the judge helper, or any mode-posture fixture. Cost: ~$0.50-$1.50 per triggered PR. Sonnet judge is cheap; Opus generator for the plan-ceo-review case dominates. Known V1.1 tradeoff: judges test prose markers more than deep behavior. V1.2 candidate is a cross-provider (Codex) adversarial judge on the same output to decouple house-style bias.
This commit is contained in:
15
test/fixtures/mode-posture/builder-idea.md
vendored
Normal file
15
test/fixtures/mode-posture/builder-idea.md
vendored
Normal file
@@ -0,0 +1,15 @@
|
||||
# Weekend Project: Dependency Graph Visualizer
|
||||
|
||||
I want to build a tool that takes a codebase and visualizes its dependency graph — modules, imports, which files depend on which. For fun, for learning. Maybe open-source it.
|
||||
|
||||
## What I have so far
|
||||
|
||||
- Rough idea: point it at a repo, get an interactive graph
|
||||
- Stack I'm leaning toward: TypeScript + D3 or Cytoscape for rendering
|
||||
- Potential: could work for JS/TS first, maybe Python later
|
||||
|
||||
## What I don't know yet
|
||||
|
||||
- How to make the visualization actually useful vs just pretty
|
||||
- Whether this should be a CLI, a web tool, or a VS Code extension
|
||||
- What would make someone else want to use it
|
||||
Reference in New Issue
Block a user