merge: integrate origin/main (v0.5.1-v0.6.4) into team-supabase-store

Resolves conflicts in package.json (keep unified cli-eval.ts + add
eval:select) and test/skill-llm-eval.test.ts (keep judgeCost/judgeCosts
helpers + add diff-based test selection).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Garry Tan
2026-03-17 23:01:35 -07:00
58 changed files with 6736 additions and 1719 deletions

View File

@@ -263,6 +263,30 @@
**Effort:** S
**Priority:** P3
### CI/CD generation for non-GitHub providers
**What:** Extend CI/CD bootstrap to generate GitLab CI (`.gitlab-ci.yml`), CircleCI (`.circleci/config.yml`), and Bitrise pipelines.
**Why:** Not all projects use GitHub Actions. Universal CI/CD bootstrap would make test bootstrap work for everyone.
**Context:** v1 ships with GitHub Actions only. Detection logic already checks for `.gitlab-ci.yml`, `.circleci/`, `bitrise.yml` and skips with an informational note. Each provider needs ~20 lines of template text in `generateTestBootstrap()`.
**Effort:** M
**Priority:** P3
**Depends on:** Test bootstrap (shipped)
### Auto-upgrade weak tests (★) to strong tests (★★★)
**What:** When Step 3.4 coverage audit identifies existing ★-rated tests (smoke/trivial assertions), generate improved versions testing edge cases and error paths.
**Why:** Many codebases have tests that technically exist but don't catch real bugs — `expect(component).toBeDefined()` isn't testing behavior. Upgrading these closes the gap between "has tests" and "has good tests."
**Context:** Requires the quality scoring rubric from the test coverage audit. Modifying existing test files is riskier than creating new ones — needs careful diffing to ensure the upgraded test still passes. Consider creating a companion test file rather than modifying the original.
**Effort:** M
**Priority:** P3
**Depends on:** Test quality scoring (shipped)
## Retro
### Deployment health tracking (retro + browse)
@@ -488,6 +512,16 @@ Implemented as `supabase/functions/weekly-digest/index.ts`. pg_cron Monday 9am U
**Priority:** P2
**Depends on:** None
### Cross-platform URL open helper
**What:** `gstack-open-url` helper script — detect platform, use `open` (macOS) or `xdg-open` (Linux).
**Why:** The first-time Completeness Principle intro uses macOS `open` to launch the essay. If gstack ever supports Linux, this silently fails.
**Effort:** S (human: ~30 min / CC: ~2 min)
**Priority:** P4
**Depends on:** Nothing
### CDP-based DOM mutation detection for ref staleness
**What:** Use Chrome DevTools Protocol `DOM.documentUpdated` / MutationObserver events to proactively invalidate stale refs when the DOM changes, without requiring an explicit `snapshot` call.
@@ -534,6 +568,46 @@ Shipped as `/design-consultation` on garrytan/design branch. Renamed from `/setu
**Priority:** P2
**Depends on:** None
## Ship Confidence Dashboard
### Smart review relevance detection — PARTIALLY SHIPPED
~~**What:** Auto-detect which of the 4 reviews are relevant based on branch changes (skip Design Review if no CSS/view changes, skip Code Review if plan-only).~~
`bin/gstack-diff-scope` shipped — categorizes diff into SCOPE_FRONTEND, SCOPE_BACKEND, SCOPE_PROMPTS, SCOPE_TESTS, SCOPE_DOCS, SCOPE_CONFIG. Used by design-review-lite to skip when no frontend files changed. Dashboard integration for conditional row display is a follow-up.
**Remaining:** Dashboard conditional row display (hide "Design Review: NOT YET RUN" when SCOPE_FRONTEND=false). Extend to Eng Review (skip for docs-only) and CEO Review (skip for config-only).
**Effort:** S
**Priority:** P3
**Depends on:** gstack-diff-scope (shipped)
### /merge skill — review-gated PR merge
**What:** Create a `/merge` skill that merges an approved PR, but first checks the Review Readiness Dashboard and runs `/review` (Fix-First) if code review hasn't been done. Separates "ship" (create PR) from "merge" (land it).
**Why:** Currently `/review` runs inside `/ship` Step 3.5 but isn't tracked as a gate. A `/merge` skill ensures code review always happens before landing, and enables workflows where someone else reviews the PR first.
**Context:** `/ship` creates the PR. `/merge` would: check dashboard → run `/review` if needed → `gh pr merge`. This is where code review tracking belongs — at merge time, not at plan time.
**Effort:** M
**Priority:** P2
**Depends on:** Ship Confidence Dashboard (shipped)
## Completeness
### Completeness metrics dashboard
**What:** Track how often Claude chooses the complete option vs shortcut across gstack sessions. Aggregate into a dashboard showing completeness trend over time.
**Why:** Without measurement, we can't know if the Completeness Principle is working. Could surface patterns (e.g., certain skills still bias toward shortcuts).
**Context:** Would require logging choices (e.g., append to a JSONL file when AskUserQuestion resolves), parsing them, and displaying trends. Similar pattern to eval persistence.
**Effort:** M (human) / S (CC)
**Priority:** P3
**Depends on:** Boil the Lake shipped (v0.6.1)
## Completed
### Phase 1: Foundations (v0.2.0)