Commit Graph

159 Commits

Author SHA1 Message Date
Garry Tan
1f3b691411 feat: /gstack-upgrade detects and syncs stale vendored copies (v0.5.4.1) (#137)
When the global gstack is already up to date, standalone /gstack-upgrade
now checks if the local vendored copy in the current project is at a
different version and syncs it automatically. Also adds rollback on
setup failure and update-check fallback matching the preamble pattern.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 13:06:14 -05:00
Garry Tan
a2d756f945 feat: Test Bootstrap + Regression Tests + Coverage Audit (v0.6.0) (#136)
* feat: test bootstrap, regression tests, coverage audit, retro test health

- Add {{TEST_BOOTSTRAP}} resolver to gen-skill-docs.ts
- Add Phase 8e.5 regression test generation to /qa and /qa-design-review
- Add Step 3.4 test coverage audit with quality scoring to /ship
- Add test health tracking to /retro
- Add 2 E2E evals (bootstrap + coverage audit)
- Add 26 validation tests
- Update ARCHITECTURE.md placeholder table
- Add 2 P3 TODOs (CI/CD non-GitHub, auto-upgrade weak tests)

* chore: bump version and changelog (v0.6.0)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: make coverage audit trace actual codepaths, not just syntax patterns

Step 3.4 now instructs Claude to read full files, trace data flow through
every branch, diagram the execution, and check each branch against tests.
Phase 8e.5 regression tests now trace the bug's codepath before writing
the test, catching adjacent edge cases.

* feat: coverage audit now maps user flows, interactions, and error states

Step 3.4 now covers the full picture: code branches AND user-facing behavior.
Maps user flows (complete journey through the feature), interaction edge cases
(double-click, back button, stale state, slow connection), error states
(what does the user actually see?), and boundary states (zero results,
10k results, max-length input). Coverage diagram splits into Code Path
Coverage and User Flow Coverage sections with separate percentages.

* fix: raise test gen cap to 20, add validation tests for user flow coverage

- Raise Step 3.4 test generation cap from 10 to 20 (code + user flow combined)
- Add 3 validation tests: codepath tracing, user flow mapping, diagram sections

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 13:05:18 -05:00
Garry Tan
b65a464d37 feat: always-full eng review + ship review gate persistence (v0.5.4) (#135)
Remove SMALL/BIG CHANGE menu from /plan-eng-review — every plan gets the
full interactive review. Scope reduction is now proactive (only when
complexity check triggers) rather than a menu item.

Add review gate override persistence to /ship — when the user says "ship
anyway" or "not relevant", that decision is saved to the branch's
reviews.jsonl so subsequent /ship runs don't re-ask.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 12:41:44 -05:00
Garry Tan
5e9f0e78f2 feat: SELECTIVE EXPANSION + smarter ship gates (v0.5.3) (#134)
* feat: SELECTIVE EXPANSION mode + user control for CEO review

Add 4th mode to /plan-ceo-review: SELECTIVE EXPANSION holds current scope
as baseline but surfaces expansion opportunities one by one for cherry-picking.
All modes now present every scope-expanding idea as individual AskUserQuestion
calls — user opts in or out of each one. EXPANSION recommends enthusiastically,
SELECTIVE recommends neutrally. CEO plan persistence writes decisions to disk.

* feat: review dashboard — eng required, CEO/design optional

Only Eng Review gates shipping. CEO Review recommended for big product
changes, Design Review for UI work — both informational only. Adds
skip_eng_review global config to disable the gate entirely.

* chore: bump version and changelog (v0.5.3)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 12:22:10 -05:00
Garry Tan
c99757b522 feat: /design-consultation — risk-taking, visual research, ambitious preview (v0.5.2) (#131)
* feat: /design-consultation — risk-taking thesis, visual research, ambitious preview

Add SAFE/RISK breakdown to design proposals so users see which choices
match category conventions vs. which are deliberate creative departures.

Wire browse binary for visual competitive research — agent browses
competitor sites, takes screenshots, and analyzes fonts/colors/spacing
with graceful degradation to WebSearch-only or built-in knowledge.

Upgrade preview page instructions to render realistic product mockups
(dashboards, marketing pages, settings forms) instead of just swatches.

Rewrite README section with the thesis: "coherence is table stakes —
the real question is where you take risks."

* chore: bump version and changelog (v0.5.2)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: restore SKILL.md files to match main

Prior commit included SKILL.md files regenerated from stale templates.
Restore to match origin/main content.

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:49:22 -05:00
Garry Tan
73b00b4e29 feat: Review Readiness Dashboard + gstack-slug helper (v0.5.1) (#130)
* feat: add bin/gstack-slug helper + migrate all inline SLUG computation

Extract the opaque SLUG sed pipeline into a shared 5-line shell script.
Replace 8 inline copies across templates with eval $(gstack-slug).
Sanitizes branch names (/ → -) to prevent subdirectory creation.

* feat: review readiness dashboard — track CEO/Eng/Design reviews per branch

Each review skill logs its result to JSONL. A shared {{REVIEW_DASHBOARD}}
placeholder displays run counts, timestamps, and a CLEARED TO SHIP verdict.
/ship pre-flight reads the dashboard and prompts when reviews are missing.

* chore: bump version and changelog (v0.5.1)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:33:46 -05:00
Garry Tan
5f41cd9ad7 feat: show screenshots to user during QA and browse sessions (v0.5.0.1) (#129)
Add rule 11 to QA and Design methodologies in gen-skill-docs.ts
instructing Claude to Read screenshot PNGs after taking them.
This makes screenshots visible as clickable elements in Conductor
and other Claude Code UIs. Also added to browse and gstack SKILL
templates.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:30:19 -05:00
Garry Tan
2670c96040 merge: integrate origin/main (v0.4.3-v0.5.0) into team-supabase-store
Resolves conflict in scripts/gen-skill-docs.ts (keep both setup-team-sync
and new design/document-release skill templates).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 08:29:30 -07:00
xianren
1a100a2a23 fix: eliminate duplicate command sets in chain, improve flush perf and type safety
- Remove duplicate CHAIN_READ/CHAIN_WRITE/CHAIN_META sets from meta-commands.ts
  and import from commands.ts (single source of truth). The duplicated sets would
  silently fail to route new commands added to commands.ts.
- Replace read+concat+write log flush with fs.appendFileSync — O(new entries)
  instead of O(total log size) per flush cycle.
- Replace `any` types for contextOptions with Playwright's BrowserContextOptions
  and add proper types for storage state in recreateContext().

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 22:05:02 +08:00
Garry Tan
c8c2cbba33 docs: add /design-consultation skill to README (#127)
The skill was fully implemented but completely absent from the README.
Add it to the skill table, write a detailed section with usage example,
and include it in install/uninstall instructions.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 08:10:01 -05:00
Garry Tan
4a77cc2c34 feat: /plan-design-review + /qa-design-review skills (v0.5.0) (#102)
* feat: add {{DESIGN_METHODOLOGY}} resolver and register design review skills

Add generateDesignMethodology() to gen-skill-docs.ts with 10-category, 80-item
design audit checklist. Register plan-design-review and qa-design-review templates
in findTemplates(). Add both skills to skill-check.ts SKILL_FILES. Add command
and snapshot flag validation tests for both skills in skill-validation.test.ts.

* feat: add /plan-design-review and /qa-design-review skills

/plan-design-review: report-only designer audit with letter grades, AI slop
scoring, structured first impression, design system extraction, DESIGN.md
inference and export offer. Never modifies code.

/qa-design-review: same audit, then iterative fix loop with style(design):
commits, CSS-safe WTF heuristic, before/after screenshots, final re-audit.

* chore: bump version and changelog (v0.5.0)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: update README, ARCHITECTURE for design review skills (v0.5.0)

- Update skill count to 11, add /plan-design-review and /qa-design-review
  to skill table, install/uninstall commands, and demo walkthrough
- Add narrative sections: "senior designer mode" and "designer who codes mode"
  with compelling examples showing AI Slop detection and design system inference
- Add {{DESIGN_METHODOLOGY}} to ARCHITECTURE.md placeholder table
- Extend demo to show full plan→eng→review→ship→qa→design-review pipeline

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: regenerate design review SKILL.md files after merge from main

Picks up BASE_BRANCH_DETECT resolver and updated contributor mode from main.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add /design-consultation skill — design consultant that creates DESIGN.md

6-phase consultant flow: product context → competitive research (WebSearch) →
complete coherent proposal → drill-downs on demand → font+color preview page →
write DESIGN.md + update CLAUDE.md. Opinionated recommendations grounded in
product context, not menu-driven forms.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add E2E tests for design skill family (7 tests + LLM quality judge)

Tests 1-4: /design-consultation (core flow, research integration, existing
DESIGN.md handling, font+color preview generation).
Tests 5-6: /plan-design-review (audit report, DESIGN.md export).
Test 7: /qa-design-review (audit + fix loop).
LLM judge validates font blacklist compliance, coherence, and AI slop avoidance.
Also adds plan-design-review + qa-design-review to ALL_SKILLS test array.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: mark /design-consultation as shipped in TODOS.md

Renamed from /setup-design-md to reflect the consultant approach.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 21:55:07 -05:00
Garry Tan
a30f7079da feat: Fix-First Review — auto-fix obvious issues, ask about hard ones (v0.4.5) (#116)
* feat: Fix-First Review — auto-fix obvious issues, ask about hard ones

Replace the CRITICAL-only AskUserQuestion flow with Fix-First:
- Every finding gets action (not just critical ones)
- AUTO-FIX items (dead code, N+1, stale comments) applied directly
- ASK items (security, race conditions, design decisions) batched
  into at most one AskUserQuestion
- Fix-First Heuristic in checklist.md (single source of truth)
- Gate Classification → Severity Classification rename

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v0.4.5)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: polish CHANGELOG v0.4.5 voice — lead with user benefit

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 21:52:50 -05:00
Garry Tan
318ffdbdf0 fix: js statement wrapping + click auto-routes option to selectOption (v0.4.5) (#117)
* fix: js statement wrapping + click auto-routes option to selectOption

Bug 1: js command wrapped all code as expressions — const, semicolons,
and multi-line code broke with SyntaxError. Added needsBlockWrapper()
and wrapForEvaluate() helpers (shared with eval) to detect statements
and use block wrapper {…} instead of expression wrapper (…).

Bug 2: clicking <option> refs hung forever because Playwright can't
.click() native select UI. Click handler now checks ARIA role + DOM
tagName and auto-routes to selectOption() via parent <select>.

Bug 3: click timeouts on <option> elements gave no guidance. Now
throws helpful error: "Use browse select instead of click."

* chore: bump version and changelog (v0.4.5)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 21:50:43 -05:00
Garry Tan
4e9c30076a fix: rename {{UPDATE_CHECK}} to {{PREAMBLE}} in setup-team-sync template
Aligns with main branch rename. Regenerated stale qa/SKILL.md and
ship/SKILL.md.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 19:12:53 -07:00
Garry Tan
238e89db9a docs: cross-reference leaderboard duplication, service-role-key warning
- Add cross-reference comments between dashboard-queries.ts computeLeaderboard()
  and dashboard/ui.ts renderLeaderboard() so maintainers know to update both
- Add security note in setup-team-dashboard about service-role-key visibility
  in pg_cron job table

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 19:12:49 -07:00
Garry Tan
4093c5e031 fix: DRY getValidToken — cli-team delegates to sync.ts, remove phantom Joined column
- Export getValidToken from sync.ts (was private)
- cli-team.ts now uses sync.ts version (supports auto-refresh, was missing)
- Remove unused isTokenExpired/getAuthTokens imports from cli-team
- Remove "Joined" column from formatMembersTable (team_members has no created_at)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 19:12:44 -07:00
Garry Tan
03e61be488 test: add 24 tests for edge function pure functions
Tests for computePassRate, shouldAlert, formatSlackMessage (regression-alert)
and formatDigestMessage (weekly-digest). Covers null inputs, threshold
boundaries, delta formatting, quiet week fallback. Uses Deno/supabase mock
for bun compatibility.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 19:12:40 -07:00
Garry Tan
c86faa7968 fix: update check cache — 60min UP_TO_DATE TTL + --force flag (v0.4.4) (#110)
* fix: split update check cache TTL + add --force flag

UP_TO_DATE cache now expires after 60 min (was 720 min / 12 hours).
UPGRADE_AVAILABLE keeps 720 min TTL to keep nagging.

--force flag deletes cache before checking, used by /gstack-upgrade
standalone invocation to always get a fresh result from GitHub.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: /gstack-upgrade standalone uses --force for fresh check

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v0.4.4)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 14:14:15 -05:00
Garry Tan
a68244ab57 feat: /document-release skill — post-ship doc updates (v0.4.3) (#109)
* docs: update project documentation for v0.4.2

- README: skill count 9→10, added /document-release to skills table,
  install/uninstall sections, and dedicated section with example
- CHANGELOG: added /document-release bullet to v0.4.2 entry

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add /document-release skill with smart VERSION handling

New skill runs after /ship but before PR merge. Reads every doc file,
cross-references the diff, auto-updates factual changes, asks about
risky edits. CHANGELOG clobber protection: never uses Write tool on
CHANGELOG.md, only Edit with exact old_string matches.

Smart VERSION logic: instead of silently skipping already-bumped
versions, compares CHANGELOG entry scope against full diff and asks
if significant uncovered changes exist.

Also fixes gstack-upgrade/SKILL.md missing from skill-check.ts
SKILL_FILES array (existing inconsistency with gen-skill-docs.ts).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: /review Step 5.6 — documentation staleness check

Review skill now cross-references code changes against doc files.
If a doc describes a feature that changed but the doc wasn't updated,
flags it as INFORMATIONAL with a pointer to /document-release.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: /document-release E2E with CHANGELOG clobber guard

E2E test creates a repo with existing CHANGELOG entries, runs
/document-release, and asserts original entries survive. Critical
guardrail against the incident where an agent replaced CHANGELOG
entries during conflict resolution.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: bump to v0.4.3 — /document-release skill

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: regenerate SKILL.md files after merge

* chore: regenerate SKILL.md files after merge

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 12:30:22 -05:00
Garry Tan
276d0cc6cb feat: always-on ELI16 + branch detection (v0.4.3) (#108)
* feat: always-on ELI16 + branch detection in preamble

- Add _BRANCH detection to preamble bash block (git branch --show-current)
- Merge ELI16 rules into default AskUserQuestion format (always-on)
- Remove _SESSIONS >= 3 conditional — better questions always
- Add simplification rules: plain English, no jargon, no raw function names
- Update tests for branch detection and simplification regression guard

* chore: bump version and changelog (v0.4.3)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 12:27:47 -05:00
Garry Tan
8ef73a7508 Merge remote-tracking branch 'origin/main' into garrytan/team-supabase-store 2026-03-16 11:29:57 -05:00
Garry Tan
78e519e3b7 feat: await support in browse js/eval + contributor mode v2 (#104)
* feat: support await in $B js and eval commands

Auto-wrap await expressions in async IIFE context so
$B js "await fetch(...)" works without SyntaxError.

- hasAwait() strips comments before detection
- js: expression wrapping (async()=>(expr))()
- eval: smart wrapping — single-line=expression, multi-line=block
- 6 new unit tests covering async, false-positive, and return semantics

* feat: redesign contributor mode — periodic reflection with 0-10 rating

Replace passive "report when things break" with active reflection:
- Rate gstack experience 0-10 at workflow step boundaries
- Historical calibration example (await bug) anchors the reporting bar
- "What would make this a 10" field focuses on actionable improvements
- Removed category lists in favor of judgment-based assessment

* test: add deterministic contributor mode preamble validation

40 new skill-validation tests (4 checks × 10 skills) verify:
- 0-10 rating scale present
- Calibration example present
- "What would make this a 10" field present
- Periodic reflection (not per-command)

Update existing E2E contributor eval for new report format.

* chore: bump version and changelog (v0.4.2)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: improve contributor mode + qa-quick E2E reliability

Contributor mode:
- Add "do not truncate" directive to template — agent was stopping
  after "My rating" without completing Steps/Raw output/What would
  make this a 10 sections
- Restore assertions for Steps to reproduce and Date footer

QA quick:
- Make test server URL prominent: top of prompt, explicit "already
  running" and "do NOT discover ports" instructions
- Bump session timeout 180s→240s and test timeout 240s→300s
- Set B= at top of prompt (was buried in prose)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: use flexible assertions for contributor mode E2E

Agent writes thorough reports with creative section names
("Repro Steps" vs "Steps to reproduce"). Match intent not formatting:
- /repro|steps to reproduce/ for reproduction steps
- /date.*2026/ for date footer presence

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: add E2E eval failure blame protocol

"Not related to our changes" is an extraordinary claim that requires
extraordinary proof. When evals fail during /ship:

1. Run the same eval on main — prove it fails there too
2. If it passes on main, it IS your change — trace the blame
3. If you can't verify, say "unverified" not "pre-existing"

Added to CLAUDE.md and as a comment in skill-e2e.test.ts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: update CONTRIBUTING.md and BROWSER.md for v0.4.2

CONTRIBUTING.md: update contributor mode description — now describes
periodic 0-10 reflection loop instead of passive friction detection.

BROWSER.md: add js/eval async documentation — await expressions are
auto-wrapped in async context, single-line eval returns values directly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: restore v0.4.2 changelog entries lost during cherry-pick conflict

The base branch detection entries from main were dropped when resolving
the CHANGELOG conflict — should have merged both sets, not replaced.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 11:28:58 -05:00
Garry Tan
1e06b6a5c6 fix: dynamic base branch detection across all SKILL templates (v0.3.10) (#81)
* feat: add {{BASE_BRANCH_DETECT}} resolver to gen-skill-docs

DRY placeholder for dynamic base branch detection across PR-targeting
skills. Detects via gh pr view (existing PR base) → gh repo view
(repo default) → fallback to main.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: ship skill detects base branch instead of hardcoding main

Replaces ~14 hardcoded 'main' references with dynamic detection via
{{BASE_BRANCH_DETECT}}. Fixes stacked branches and Conductor workspaces
targeting non-main branches. Adds --base <base> to gh pr create.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: review, qa, plan-ceo-review detect base branch dynamically

Same pattern as ship: replaces hardcoded 'main' with {{BASE_BRANCH_DETECT}}.
Also cleans up qa bash-isms (REPORT_DIR variable, port chaining).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: retro detects default branch instead of hardcoding origin/main

Retro queries commit history (not PR targets), so uses simpler detection:
gh repo view defaultBranchRef. Replaces ~11 origin/main refs with
origin/<default>.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add explicit cross-step references in gstack-upgrade template

Bash blocks are self-contained, but cross-block variable references
(INSTALL_DIR from Step 2) were implicit. Adds prose making them explicit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs+test: SKILL authoring guidance + regression tests

Adds "Writing SKILL templates" section to CLAUDE.md explaining that
templates are prompts, not scripts. Adds validation test catching
hardcoded 'main' in git commands, and resolver content test.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: update ARCHITECTURE + CONTRIBUTING for new placeholders

Add {{BASE_BRANCH_DETECT}} to ARCHITECTURE.md placeholder list.
Cross-reference CLAUDE.md template authoring guidance from CONTRIBUTING.md.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v0.3.10)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: add missing blank line between resolver functions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add 3 E2E smoke tests for base branch detection

- /review: verifies Step 0 detection + git diff against detected base
- /ship: truncated dry-run (Steps 0-1 only, no push/PR), asserts no
  destructive actions
- /retro: verifies default branch detection for git log queries

Covers the {{BASE_BRANCH_DETECT}} resolver path (review), the ship
template's dual abort check, and retro's inline detection pattern.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v0.4.2)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 10:59:13 -05:00
Garry Tan
9e67d71f72 docs: add 8 team dashboard TODOs from CEO review, mark weekly digest shipped
New TODOs: regression alert links, projected monthly cost, ship-to-Slack
notifications, dynamic favicon, server-side aggregation, SSE streaming,
GitHub Check Runs, ship_logs index.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 10:00:36 -05:00
Garry Tan
721abce5a5 fix: review-driven hardening — env guards, token expiry, slug validation, dashboard UX
From CEO plan review:
- Edge functions: early guard on missing env vars instead of non-null assert crash
- cli-team: wire isTokenExpired check (was imported but unused)
- Migration 007: CHECK constraint on team slug (a-z0-9 hyphens, 2-50 chars)
- Dashboard: streak badges on leaderboard, repo slug in who's-online,
  contextual empty states that teach, 60s refresh (was 30s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 09:59:20 -05:00
Garry Tan
2357f134ce merge: integrate origin/main (v0.4.0, v0.4.1) into team-supabase-store
Resolves conflicts in CHANGELOG.md (ordering), CONTRIBUTING.md (eval
tools list merge), VERSION (take main's 0.4.1), qa/SKILL.md.tmpl
(keep full methodology + baseline line), eval-store.test.ts (drop
redundant comment).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 07:49:27 -05:00
Lucas Braud
54677117cc Fix build script failure on Windows
The `rm -f .*.bun-build` glob cleanup step fails on Windows/Git Bash
when no files match the pattern, causing `bun run build` to exit with
code 1. Since the setup script uses `set -e`, this aborts the entire
setup before skill symlinks are created.

Adding `|| true` makes the cleanup step non-fatal, which matches the
intent — it's just removing stale build artifacts if they exist.
2026-03-16 09:54:24 +01:00
Garry Tan
83bfc7f88d feat: add /setup-team-dashboard skill, post-ship leaderboard callout
Interactive 8-step setup skill for deploying dashboard + edge functions.
Post-ship callout shows team leaderboard after successful sync.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 02:44:59 -05:00
Garry Tan
78840c64a8 feat: add shared team dashboard, regression alerts, weekly digest edge functions
Dashboard: Supabase edge function serving self-contained HTML with
PKCE OAuth, 6 parallel client-side REST queries, SVG charts, dark
theme, auto-refresh, who's-online from heartbeats. Public URL.

Regression alert: webhook on eval_runs INSERT, 5-min cooldown dedup
via alert_cooldowns, Slack notification on >5% pass rate drop.

Weekly digest: pg_cron Monday 9am UTC, aggregates 7-day team data,
Slack message with evals/ships/sessions/costs. 15 tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 02:44:47 -05:00
Garry Tan
46c82ce8ec feat: add team admin CLI + migration 007 (settings, cooldowns, create_team RPC)
New `gstack team` CLI with create, members, set subcommands.
Migration adds team_settings (admin-only), alert_cooldowns (edge-fn
dedup), and create_team() SECURITY DEFINER RPC for atomic team +
first member creation. 9 tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 02:44:24 -05:00
Garry Tan
4985c8e7e9 feat: add CLI leaderboard, refactor formatTeamSummary to use dashboard-queries
New `gstack eval leaderboard` subcommand pulls team data and renders
weekly stats per contributor. Refactored formatTeamSummary to use
computeVelocity from dashboard-queries (DRY). 4 new tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 02:44:12 -05:00
Garry Tan
e969c6dadf feat: add dashboard query functions — pure transforms for team analytics
6 functions: detectRegressions, computeVelocity, computeCostTrend,
computeLeaderboard, computeQATrend, computeEvalTrend. All pure,
no I/O, with division-by-zero guards. 28 tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 02:43:52 -05:00
Garry Tan
3e3843c4a9 feat: contributor mode, session awareness, recommendation format (#90)
* feat: contributor mode, session awareness, universal RECOMMENDATION format

- Rename {{UPDATE_CHECK}} → {{PREAMBLE}} across all 10 skill templates
- Add session tracking (touch ~/.gstack/sessions/$PPID, count active sessions)
- ELI16 mode when 3+ concurrent sessions detected (re-ground user on context)
- Contributor mode: auto-file field reports to ~/.gstack/contributor-logs/
- Universal AskUserQuestion format: context → question → RECOMMENDATION → options
- Update plan-ceo-review and plan-eng-review to reference preamble baseline
- Add vendored symlink awareness section to CLAUDE.md
- Rewrite CONTRIBUTING.md with contributor workflow and cross-project testing
- Add tests for contributor mode and session awareness in generated output
- Add E2E eval for contributor mode report filing

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add Enum & Value Completeness to /review critical checklist

New CRITICAL review category that traces new enum values, status strings,
and type constants through every consumer outside the diff. Catches the
class of bugs where a new value is added but not handled in all switch/case
chains, allowlists, or frontend-backend contracts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: bump v0.4.1, user-facing changelog, update qa-only template and architecture docs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: add CHANGELOG style guide — user-facing, sell the feature

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: rewrite v0.4.1 changelog to be user-facing and sell the features

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add evals for RECOMMENDATION format, session awareness, and enum completeness

Free tests (Tier 1): RECOMMENDATION format + session awareness in all
preamble SKILL.md files, enum completeness checklist structure and CRITICAL
classification.

E2E eval: /review catches missed enum handlers when a new status value
is added but not handled in case/switch and notify methods.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add E2E eval for session awareness ELI16 mode

Stubs _SESSIONS=4, gives agent a decision point on feature/add-payments
branch, verifies the output re-grounds the user with project, branch,
context, and RECOMMENDATION — the ELI16 mode behavior for 3+ sessions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: contributor mode eval marked FAIL due to expected browse error

The test intentionally runs a nonexistent binary to trigger contributor
mode. The session runner's browse error detection catches "no such file
or directory...browse" and sets browseErrors, causing recordE2E to mark
passed=false. Override passed to check only exitReason since the browse
error is the expected scenario.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 01:45:50 -05:00
Garry Tan
6e14689f0e docs: add team sync TODOs — streaming parser, effectiveness scoring, weekly digest
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 00:15:40 -05:00
Garry Tan
3a57a3f59e feat: add /setup-team-sync skill, auto-push transcript hooks in skills
- setup-team-sync/SKILL.md.tmpl: idempotent guided setup (create config,
  OAuth, verify connectivity, configure settings, summary)
- ship/retro/qa SKILL.md.tmpl: add push-transcript hook after existing
  push-ship/push-retro/push-qa hooks (silent, non-fatal)
- scripts/gen-skill-docs.ts: add setup-team-sync to template list
- Regenerated all SKILL.md files

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 00:15:36 -05:00
Garry Tan
a104471272 feat: add push-transcript CLI, show sessions, interactive setup, 36 tests
- cli-sync.ts: push-transcript command, show sessions with formatSessionTable(),
  upgrade cmdSetup() to interactively create .gstack-sync.json if missing
- bin/gstack-sync: add push-transcript case and help text
- test/lib-llm-summarize.test.ts: 10 tests with mocked fetch (429 retry,
  5xx backoff, malformed response, no API key, cache)
- test/lib-transcript-sync.test.ts: 22 tests for parsing, grouping,
  session file extraction, marker management, slug resolution
- test/lib-sync-show.test.ts: 4 tests for formatSessionTable

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 00:15:26 -05:00
Garry Tan
0e29d7d1a3 feat: add enriched transcript sync — Haiku summaries, session file enrichment
Add session intelligence pipeline for team transcript sync:
- lib/transcript-sync.ts: parse history.jsonl, enrich with Claude session
  file data (tools_used, full turn count), sync marker management,
  10-concurrent push with 5-concurrent Haiku summarization
- lib/llm-summarize.ts: raw fetch() to Anthropic Messages API (no SDK dep),
  retry-after on 429, exponential backoff on 5xx, SHA-based eval-cache
- lib/sync.ts: pushTranscript() and pullTranscripts() following existing patterns
- 006_transcript_sync.sql: unique index on (team_id, session_id) for
  idempotent upsert, RLS changed from admin-only to team-wide read

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 00:15:19 -05:00
Garry Tan
f3ee0ee28a feat: QA restructure, browser ref staleness, eval efficiency metrics (v0.4.0) (#83)
* feat: browser ref staleness detection via async count() validation

resolveRef() now checks element count to detect stale refs after page
mutations (e.g. SPA navigation). RefEntry stores role+name metadata
for better diagnostics. 3 new snapshot tests for staleness detection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: qa-only skill, qa fix loop, plan-to-QA artifact flow

Add /qa-only (report-only, Edit tool blocked), restructure /qa with
find-fix-verify cycle, add {{QA_METHODOLOGY}} DRY placeholder for
shared methodology. /plan-eng-review now writes test-plan artifacts
to ~/.gstack/projects/<slug>/ for QA consumption.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: eval efficiency metrics — turns, duration, commentary across all surfaces

Add generateCommentary() for natural-language delta interpretation,
per-test turns/duration in comparison and summary output, judgePassed
unit tests, 3 new E2E tests (qa-only, qa fix loop, plan artifact).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: bump version and changelog (v0.4.0)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: update ARCHITECTURE, BROWSER, CONTRIBUTING, README for v0.4.0

- ARCHITECTURE: add ref staleness detection section, update RefEntry type
- BROWSER: add ref staleness paragraph to snapshot system docs
- CONTRIBUTING: update eval tool descriptions with commentary feature
- README: fix missing qa-only in project-local uninstall command

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: add user-facing benefit descriptions to v0.4.0 changelog

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-15 23:55:39 -05:00
Garry Tan
87cb769c35 feat: sync heartbeats, eval:trend --team, setup guide, 10 new tests
- 005_sync_heartbeats.sql migration for connectivity testing
- eval:trend --team flag pulls team eval data (graceful fallback)
- docs/TEAM_SYNC_SETUP.md step-by-step setup guide
- Design doc status updated to Phase 2 complete
- 10 new tests for sync show formatting functions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 19:43:03 -05:00
Garry Tan
06f2da2019 feat: wire team sync push into ship, retro, qa, and greptile skills
Add non-fatal sync steps to all 4 skill templates:
- /ship Step 8.5: write ship log JSON + push after PR creation
- /retro Step 13: push snapshot after JSON save
- /qa Phase 6.7: write qa-sync.json + push after health score
- greptile-triage: push each triage entry after history file writes

All calls use || true for zero disruption. Silent when sync not
configured.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 19:42:54 -05:00
Garry Tan
dc3fcc8611 feat: DRY push functions, add push-greptile + sync test/show commands
Extract pushWithSync() helper to eliminate boilerplate across 6 push
functions. Add pushHeartbeat() for connectivity testing. Add push-greptile
to CLI. New commands: gstack-sync test (validates full push/pull flow
via sync_heartbeats table), gstack-sync show (terminal team data
dashboard with summary/evals/ships/retros views). Guard main block
with import.meta.main.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 19:42:45 -05:00
Garry Tan
704fe34e98 docs: clean up sync example, add team sync section to README
Remove _comment hacks from JSON example file. Add short team sync
section to README explaining what it is, that it's optional, and
how to set it up.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 17:06:51 -05:00
Garry Tan
14320469b0 docs: CHANGELOG covers full branch scope including team sync
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 17:05:45 -05:00
Garry Tan
eb7ef2153b docs: add setup comments to .gstack-sync.json.example
Explain what team sync gives you, that it's optional, and how to
set it up. Points to TEAM_COORDINATION_STORE.md for full guide.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 17:04:49 -05:00
Garry Tan
e28033353d chore: bump v0.3.10, update CHANGELOG and docs
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 16:55:34 -05:00
Garry Tan
33c9552870 chore: update gitignore
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 16:47:46 -05:00
Garry Tan
daea165333 feat: add eval:trend CLI for per-test pass rate tracking
computeTrends() classifies tests as stable-pass/stable-fail/flaky/
improving/degrading based on pass rate, flip count, and recent streak.
gstack eval trend shows sparkline table with --limit, --tier, --test
filters. Guard CLI main block with import.meta.main to prevent
execution on import.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 16:47:41 -05:00
Garry Tan
59752fc510 feat: wire eval-cache + eval-tier into LLM judge, pin E2E model
callJudge/judge now return {result, meta} with SHA-based caching
(~$0.18/run savings when SKILL.md unchanged) and dynamic model
selection via EVAL_JUDGE_TIER env var. E2E tests pass --model from
EVAL_TIER to claude -p. outcomeJudge retains simple return type.
All 8 LLM eval test sites updated with real costs and costs[].

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 16:47:35 -05:00
Garry Tan
02925cfc7a feat: wire costs[] from modelUsage into eval results
Extract per-model token usage from resultLine.modelUsage (including
cache tokens and exact API cost), flow CostEntry[] through EvalCollector,
aggregate in finalize(). Extend CostEntry with cache_read_input_tokens,
cache_creation_input_tokens, cost_usd. computeCosts() prefers exact
cost_usd over MODEL_PRICING when available (~4x more accurate with
prompt caching).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 16:47:27 -05:00
Garry Tan
4ad73f7362 feat: unified gstack eval CLI with list, compare, push, cache, cost
- lib/cli-eval.ts: routes to list/compare/summary/push/cost/cache/watch
  subcommands. Ports logic from 4 separate scripts into unified entry.
  Adds ANSI color for TTY (respects NO_COLOR), --limit flag for list.
- bin/gstack-eval: bash wrapper matching bin/gstack-sync pattern
- package.json: eval:* scripts now point to lib/cli-eval.ts
- supabase/migrations/004_eval_costs.sql: per-model cost tracking + RLS
- docs/eval-result-format.md: public format spec for any language
- test/lib-eval-cli.test.ts: integration tests (spawn CLI subprocess)
  including 3 push failure modes (file-not-found, invalid schema,
  sync unavailable)

215 tests passing across 13 files.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 09:39:36 -05:00