merge: integrate origin/main (v0.5.1-v0.6.4) into team-supabase-store

Resolves conflicts in package.json (keep unified cli-eval.ts + add eval:select) and test/skill-llm-eval.test.ts (keep judgeCost/judgeCosts helpers + add diff-based test selection). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 04:38:24 +08:00 · 2026-03-17 23:01:35 -07:00
parent 2670c96040 716e4c934a
commit 82cf085213
58 changed files with 6736 additions and 1719 deletions
--- a/TODOS.md
+++ b/TODOS.md
@@ -263,6 +263,30 @@
 **Effort:** S
 **Priority:** P3

+### CI/CD generation for non-GitHub providers
+
+**What:** Extend CI/CD bootstrap to generate GitLab CI (`.gitlab-ci.yml`), CircleCI (`.circleci/config.yml`), and Bitrise pipelines.
+
+**Why:** Not all projects use GitHub Actions. Universal CI/CD bootstrap would make test bootstrap work for everyone.
+
+**Context:** v1 ships with GitHub Actions only. Detection logic already checks for `.gitlab-ci.yml`, `.circleci/`, `bitrise.yml` and skips with an informational note. Each provider needs ~20 lines of template text in `generateTestBootstrap()`.
+
+**Effort:** M
+**Priority:** P3
+**Depends on:** Test bootstrap (shipped)
+
+### Auto-upgrade weak tests (★) to strong tests (★★★)
+
+**What:** When Step 3.4 coverage audit identifies existing ★-rated tests (smoke/trivial assertions), generate improved versions testing edge cases and error paths.
+
+**Why:** Many codebases have tests that technically exist but don't catch real bugs — `expect(component).toBeDefined()` isn't testing behavior. Upgrading these closes the gap between "has tests" and "has good tests."
+
+**Context:** Requires the quality scoring rubric from the test coverage audit. Modifying existing test files is riskier than creating new ones — needs careful diffing to ensure the upgraded test still passes. Consider creating a companion test file rather than modifying the original.
+
+**Effort:** M
+**Priority:** P3
+**Depends on:** Test quality scoring (shipped)
+
 ## Retro

 ### Deployment health tracking (retro + browse)
@@ -488,6 +512,16 @@ Implemented as `supabase/functions/weekly-digest/index.ts`. pg_cron Monday 9am U
 **Priority:** P2
 **Depends on:** None

+### Cross-platform URL open helper
+
+**What:** `gstack-open-url` helper script — detect platform, use `open` (macOS) or `xdg-open` (Linux).
+
+**Why:** The first-time Completeness Principle intro uses macOS `open` to launch the essay. If gstack ever supports Linux, this silently fails.
+
+**Effort:** S (human: ~30 min / CC: ~2 min)
+**Priority:** P4
+**Depends on:** Nothing
+
 ### CDP-based DOM mutation detection for ref staleness

 **What:** Use Chrome DevTools Protocol `DOM.documentUpdated` / MutationObserver events to proactively invalidate stale refs when the DOM changes, without requiring an explicit `snapshot` call.
@@ -534,6 +568,46 @@ Shipped as `/design-consultation` on garrytan/design branch. Renamed from `/setu
 **Priority:** P2
 **Depends on:** None

+## Ship Confidence Dashboard
+
+### Smart review relevance detection — PARTIALLY SHIPPED
+
+~~**What:** Auto-detect which of the 4 reviews are relevant based on branch changes (skip Design Review if no CSS/view changes, skip Code Review if plan-only).~~
+
+`bin/gstack-diff-scope` shipped — categorizes diff into SCOPE_FRONTEND, SCOPE_BACKEND, SCOPE_PROMPTS, SCOPE_TESTS, SCOPE_DOCS, SCOPE_CONFIG. Used by design-review-lite to skip when no frontend files changed. Dashboard integration for conditional row display is a follow-up.
+
+**Remaining:** Dashboard conditional row display (hide "Design Review: NOT YET RUN" when SCOPE_FRONTEND=false). Extend to Eng Review (skip for docs-only) and CEO Review (skip for config-only).
+
+**Effort:** S
+**Priority:** P3
+**Depends on:** gstack-diff-scope (shipped)
+
+### /merge skill — review-gated PR merge
+
+**What:** Create a `/merge` skill that merges an approved PR, but first checks the Review Readiness Dashboard and runs `/review` (Fix-First) if code review hasn't been done. Separates "ship" (create PR) from "merge" (land it).
+
+**Why:** Currently `/review` runs inside `/ship` Step 3.5 but isn't tracked as a gate. A `/merge` skill ensures code review always happens before landing, and enables workflows where someone else reviews the PR first.
+
+**Context:** `/ship` creates the PR. `/merge` would: check dashboard → run `/review` if needed → `gh pr merge`. This is where code review tracking belongs — at merge time, not at plan time.
+
+**Effort:** M
+**Priority:** P2
+**Depends on:** Ship Confidence Dashboard (shipped)
+
+## Completeness
+
+### Completeness metrics dashboard
+
+**What:** Track how often Claude chooses the complete option vs shortcut across gstack sessions. Aggregate into a dashboard showing completeness trend over time.
+
+**Why:** Without measurement, we can't know if the Completeness Principle is working. Could surface patterns (e.g., certain skills still bias toward shortcuts).
+
+**Context:** Would require logging choices (e.g., append to a JSONL file when AskUserQuestion resolves), parsing them, and displaying trends. Similar pattern to eval persistence.
+
+**Effort:** M (human) / S (CC)
+**Priority:** P3
+**Depends on:** Boil the Lake shipped (v0.6.1)
+
 ## Completed

 ### Phase 1: Foundations (v0.2.0)