mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-18 10:31:30 +08:00
test: add gate-tier mode-posture regression tests
Three gate-tier E2E tests detect when preamble / template changes flatten the distinctive posture of /plan-ceo-review SCOPE EXPANSION or /office-hours (startup Q3, builder mode). The V1 regression that this PR fixes shipped without anyone catching it at ship time — this is the ongoing signal so the same thing doesn't happen again. Pieces: - `judgePosture(mode, text)` in `test/helpers/llm-judge.ts`. Sonnet judge with mode-specific dual-axis rubric (expansion: surface_framing + decision_preservation; forcing: stacking_preserved + domain_matched_consequence; builder: unexpected_combinations + excitement_over_optimization). Pass threshold 4/5 on both axes. - Three fixtures in `test/fixtures/mode-posture/` — deterministic input for expansion proposal generation, Q3 forcing question, and builder adjacent-unlock riffing. - `plan-ceo-review-expansion-energy` case appended to `test/skill-e2e-plan.test.ts`. Generator: Opus (skill default). Judge: Sonnet. - New `test/skill-e2e-office-hours.test.ts` with `office-hours-forcing-energy` + `office-hours-builder-wildness` cases. Generator: Sonnet. Judge: Sonnet. - Touchfile registration in `test/helpers/touchfiles.ts` — all three as `gate` tier in `E2E_TIERS`, triggered by changes to `scripts/resolvers/preamble.ts`, the relevant skill template, the judge helper, or any mode-posture fixture. Cost: ~$0.50-$1.50 per triggered PR. Sonnet judge is cheap; Opus generator for the plan-ceo-review case dominates. Known V1.1 tradeoff: judges test prose markers more than deep behavior. V1.2 candidate is a cross-provider (Codex) adversarial judge on the same output to decouple house-style bias.
This commit is contained in:
15
test/fixtures/mode-posture/builder-idea.md
vendored
Normal file
15
test/fixtures/mode-posture/builder-idea.md
vendored
Normal file
@@ -0,0 +1,15 @@
|
||||
# Weekend Project: Dependency Graph Visualizer
|
||||
|
||||
I want to build a tool that takes a codebase and visualizes its dependency graph — modules, imports, which files depend on which. For fun, for learning. Maybe open-source it.
|
||||
|
||||
## What I have so far
|
||||
|
||||
- Rough idea: point it at a repo, get an interactive graph
|
||||
- Stack I'm leaning toward: TypeScript + D3 or Cytoscape for rendering
|
||||
- Potential: could work for JS/TS first, maybe Python later
|
||||
|
||||
## What I don't know yet
|
||||
|
||||
- How to make the visualization actually useful vs just pretty
|
||||
- Whether this should be a CLI, a web tool, or a VS Code extension
|
||||
- What would make someone else want to use it
|
||||
23
test/fixtures/mode-posture/expansion-plan.md
vendored
Normal file
23
test/fixtures/mode-posture/expansion-plan.md
vendored
Normal file
@@ -0,0 +1,23 @@
|
||||
# Plan: Team Velocity Dashboard
|
||||
|
||||
## Context
|
||||
|
||||
We're building a dashboard for engineering managers to track team code velocity — commits per engineer, PR cycle time, review latency, CI pass rate. The data already lives in GitHub; we're just aggregating it for a manager's single-pane view.
|
||||
|
||||
## Changes
|
||||
|
||||
1. New React component `TeamVelocityDashboard` in `src/dashboard/`
|
||||
2. REST API endpoint `GET /api/team/velocity?days=30` returning aggregated metrics
|
||||
3. Background job pulling GitHub data every 15 minutes into Postgres
|
||||
4. Simple filter UI: team, date range, metric
|
||||
|
||||
## Architecture
|
||||
|
||||
- Frontend: React + shadcn/ui
|
||||
- Backend: Express + PostgreSQL
|
||||
- Data source: GitHub REST API (cached 15min)
|
||||
|
||||
## Open questions
|
||||
|
||||
- Should we support multiple repos per team?
|
||||
- Do we show individual engineer names or aggregate only?
|
||||
13
test/fixtures/mode-posture/forcing-pitch.md
vendored
Normal file
13
test/fixtures/mode-posture/forcing-pitch.md
vendored
Normal file
@@ -0,0 +1,13 @@
|
||||
# Our Idea: AI Tools for Product Managers
|
||||
|
||||
We're building AI tools for product managers at mid-market SaaS companies. The product combines a bunch of the things PMs already do — writing PRDs, gathering user feedback, analyzing usage data, drafting roadmaps — and uses LLMs to speed each of them up.
|
||||
|
||||
## Who we're targeting
|
||||
|
||||
Product managers at SaaS companies with 50-500 engineers. These PMs are stretched thin, juggle a lot of surface area, and would benefit from AI assistance.
|
||||
|
||||
## What we've done so far
|
||||
|
||||
- Talked to a few PMs we know from prior jobs
|
||||
- Built a prototype that summarizes Zoom customer calls into a PRD stub
|
||||
- Got on a waitlist of about 40 signups from LinkedIn posts
|
||||
Reference in New Issue
Block a user