Files
gstack/test/fixtures/mode-posture/expansion-plan.md
Garry Tan a647064734 test: add gate-tier mode-posture regression tests
Three gate-tier E2E tests detect when preamble / template changes
flatten the distinctive posture of /plan-ceo-review SCOPE EXPANSION or
/office-hours (startup Q3, builder mode). The V1 regression that this
PR fixes shipped without anyone catching it at ship time — this is the
ongoing signal so the same thing doesn't happen again.

Pieces:
- `judgePosture(mode, text)` in `test/helpers/llm-judge.ts`. Sonnet
  judge with mode-specific dual-axis rubric (expansion: surface_framing
  + decision_preservation; forcing: stacking_preserved +
  domain_matched_consequence; builder: unexpected_combinations +
  excitement_over_optimization). Pass threshold 4/5 on both axes.
- Three fixtures in `test/fixtures/mode-posture/` — deterministic input
  for expansion proposal generation, Q3 forcing question, and builder
  adjacent-unlock riffing.
- `plan-ceo-review-expansion-energy` case appended to
  `test/skill-e2e-plan.test.ts`. Generator: Opus (skill default). Judge:
  Sonnet.
- New `test/skill-e2e-office-hours.test.ts` with
  `office-hours-forcing-energy` + `office-hours-builder-wildness`
  cases. Generator: Sonnet. Judge: Sonnet.
- Touchfile registration in `test/helpers/touchfiles.ts` — all three as
  `gate` tier in `E2E_TIERS`, triggered by changes to
  `scripts/resolvers/preamble.ts`, the relevant skill template, the
  judge helper, or any mode-posture fixture.

Cost: ~$0.50-$1.50 per triggered PR. Sonnet judge is cheap; Opus
generator for the plan-ceo-review case dominates.

Known V1.1 tradeoff: judges test prose markers more than deep behavior.
V1.2 candidate is a cross-provider (Codex) adversarial judge on the
same output to decouple house-style bias.
2026-04-18 23:46:00 +08:00

24 lines
817 B
Markdown

# Plan: Team Velocity Dashboard
## Context
We're building a dashboard for engineering managers to track team code velocity — commits per engineer, PR cycle time, review latency, CI pass rate. The data already lives in GitHub; we're just aggregating it for a manager's single-pane view.
## Changes
1. New React component `TeamVelocityDashboard` in `src/dashboard/`
2. REST API endpoint `GET /api/team/velocity?days=30` returning aggregated metrics
3. Background job pulling GitHub data every 15 minutes into Postgres
4. Simple filter UI: team, date range, metric
## Architecture
- Frontend: React + shadcn/ui
- Backend: Express + PostgreSQL
- Data source: GitHub REST API (cached 15min)
## Open questions
- Should we support multiple repos per team?
- Do we show individual engineer names or aggregate only?