feat: design outside voices — cross-model design critique (v0.11.3.0) (#347)

* feat(gen-skill-docs): add design outside voices + hard rules resolvers

Add generateDesignOutsideVoices() — parallel Codex + Claude subagent
dispatch for cross-model design critique with litmus scorecard synthesis.
Branches per skillName (plan-design-review, design-review, design-consultation)
with task-specific reasoning effort (high for analytical, medium for creative).

Add generateDesignHardRules() — OpenAI Frontend Skill hard rules + gstack
AI slop blacklist unified into one shared block with classifier step
(landing page vs app UI vs hybrid).

Extract AI_SLOP_BLACKLIST constant from inline prose in generateDesignMethodology()
for DRY. Extend generateDesignReviewLite() with lightweight Codex block.
Extend generateDesignSketch() with outside voices opt-in after wireframe.

Source: OpenAI "Designing Delightful Frontends with GPT-5.4" (Mar 2026)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(design skills): add outside voices + hard rules to all design templates

Insert {{DESIGN_OUTSIDE_VOICES}} in plan-design-review (between Step 0D
and Pass 1), design-review (between Phase 6 and Phase 7), and
design-consultation (between Phase 2 and Phase 3).

Insert {{DESIGN_HARD_RULES}} in plan-design-review Pass 4 and design-review
Phase 3 checklist.

DESIGN_REVIEW_LITE in /ship and /review now includes a Codex design voice
block with litmus checks.

DESIGN_SKETCH in /office-hours now includes outside voices opt-in after
wireframe approval.

Regenerated all SKILL.md files (both Claude and Codex hosts).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add resolver tests + touchfiles for design outside voices

Add 18 test cases across 4 new describe blocks:
- DESIGN_OUTSIDE_VOICES: host guard, skillName branching, reasoning effort
- DESIGN_HARD_RULES: classifier, 3 rule sets, slop blacklist, OpenAI criteria
- DESIGN_SKETCH extended: outside voices step, original wireframe preserved
- DESIGN_REVIEW_LITE extended: Codex block, codex host exclusion

Update touchfiles: add scripts/gen-skill-docs.ts to design skill E2E
test dependencies for accurate diff-based test selection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v0.11.3.0)

Design outside voices — parallel Codex + Claude subagent for cross-model
design critique with litmus scorecard synthesis. OpenAI hard rules + gstack
slop blacklist unified. Classifier for landing page vs app UI.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: generate .agents/ on demand in tests (not checked in since v0.11.2.0)

.agents/ is gitignored since v0.11.2.0 — tests that read Codex-host
SKILL.md files now generate them on demand via `bun run gen-skill-docs.ts
--host codex` before reading. Fixes test failures on fresh clones.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Garry Tan
2026-03-22 20:22:23 -07:00
committed by GitHub
parent b7a3bf108d
commit 4cd4d11cb0
16 changed files with 999 additions and 23 deletions

View File

@@ -856,6 +856,75 @@ Tie everything to user goals and product objectives. Always suggest specific imp
10. **Depth over breadth.** 5-10 well-documented findings with screenshots and specific suggestions > 20 vague observations.
11. **Show screenshots to the user.** After every `$B screenshot`, `$B snapshot -a -o`, or `$B responsive` command, use the Read tool on the output file(s) so the user can see them inline. For `responsive` (3 files), Read all three. This is critical — without it, screenshots are invisible to the user.
### Design Hard Rules
**Classifier — determine rule set before evaluating:**
- **MARKETING/LANDING PAGE** (hero-driven, brand-forward, conversion-focused) → apply Landing Page Rules
- **APP UI** (workspace-driven, data-dense, task-focused: dashboards, admin, settings) → apply App UI Rules
- **HYBRID** (marketing shell with app-like sections) → apply Landing Page Rules to hero/marketing sections, App UI Rules to functional sections
**Hard rejection criteria** (instant-fail patterns — flag if ANY apply):
1. Generic SaaS card grid as first impression
2. Beautiful image with weak brand
3. Strong headline with no clear action
4. Busy imagery behind text
5. Sections repeating same mood statement
6. Carousel with no narrative purpose
7. App UI made of stacked cards instead of layout
**Litmus checks** (answer YES/NO for each — used for cross-model consensus scoring):
1. Brand/product unmistakable in first screen?
2. One strong visual anchor present?
3. Page understandable by scanning headlines only?
4. Each section has one job?
5. Are cards actually necessary?
6. Does motion improve hierarchy or atmosphere?
7. Would design feel premium with all decorative shadows removed?
**Landing page rules** (apply when classifier = MARKETING/LANDING):
- First viewport reads as one composition, not a dashboard
- Brand-first hierarchy: brand > headline > body > CTA
- Typography: expressive, purposeful — no default stacks (Inter, Roboto, Arial, system)
- No flat single-color backgrounds — use gradients, images, subtle patterns
- Hero: full-bleed, edge-to-edge, no inset/tiled/rounded variants
- Hero budget: brand, one headline, one supporting sentence, one CTA group, one image
- No cards in hero. Cards only when card IS the interaction
- One job per section: one purpose, one headline, one short supporting sentence
- Motion: 2-3 intentional motions minimum (entrance, scroll-linked, hover/reveal)
- Color: define CSS variables, avoid purple-on-white defaults, one accent color default
- Copy: product language not design commentary. "If deleting 30% improves it, keep deleting"
- Beautiful defaults: composition-first, brand as loudest text, two typefaces max, cardless by default, first viewport as poster not document
**App UI rules** (apply when classifier = APP UI):
- Calm surface hierarchy, strong typography, few colors
- Dense but readable, minimal chrome
- Organize: primary workspace, navigation, secondary context, one accent
- Avoid: dashboard-card mosaics, thick borders, decorative gradients, ornamental icons
- Copy: utility language — orientation, status, action. Not mood/brand/aspiration
- Cards only when card IS the interaction
- Section headings state what area is or what user can do ("Selected KPIs", "Plan status")
**Universal rules** (apply to ALL types):
- Define CSS variables for color system
- No default font stacks (Inter, Roboto, Arial, system)
- One job per section
- "If deleting 30% of the copy improves it, keep deleting"
- Cards earn their existence — no decorative card grids
**AI Slop blacklist** (the 10 patterns that scream "AI-generated"):
1. Purple/violet/indigo gradient backgrounds or blue-to-purple color schemes
2. **The 3-column feature grid:** icon-in-colored-circle + bold title + 2-line description, repeated 3x symmetrically. THE most recognizable AI layout.
3. Icons in colored circles as section decoration (SaaS starter template look)
4. Centered everything (`text-align: center` on all headings, descriptions, cards)
5. Uniform bubbly border-radius on every element (same large radius on everything)
6. Decorative blobs, floating circles, wavy SVG dividers (if a section feels empty, it needs better content, not decoration)
7. Emoji as design elements (rockets in headings, emoji as bullet points)
8. Colored left-border on cards (`border-left: 3px solid <accent>`)
9. Generic hero copy ("Welcome to [X]", "Unlock the power of...", "Your all-in-one solution for...")
10. Cookie-cutter section rhythm (hero → 3 features → testimonials → pricing → CTA, every section same height)
Source: [OpenAI "Designing Delightful Frontends with GPT-5.4"](https://developers.openai.com/blog/designing-delightful-frontends-with-gpt-5-4) (Mar 2026) + gstack design methodology.
Record baseline design score and AI slop score at end of Phase 6.
---
@@ -879,6 +948,87 @@ Record baseline design score and AI slop score at end of Phase 6.
---
## Design Outside Voices (parallel)
**Automatic:** Outside voices run automatically when Codex is available. No opt-in needed.
**Check Codex availability:**
```bash
which codex 2>/dev/null && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE"
```
**If Codex is available**, launch both voices simultaneously:
1. **Codex design voice** (via Bash):
```bash
TMPERR_DESIGN=$(mktemp /tmp/codex-design-XXXXXXXX)
codex exec "Review the frontend source code in this repo. Evaluate against these design hard rules:
- Spacing: systematic (design tokens / CSS variables) or magic numbers?
- Typography: expressive purposeful fonts or default stacks?
- Color: CSS variables with defined system, or hardcoded hex scattered?
- Responsive: breakpoints defined? calc(100svh - header) for heroes? Mobile tested?
- A11y: ARIA landmarks, alt text, contrast ratios, 44px touch targets?
- Motion: 2-3 intentional animations, or zero / ornamental only?
- Cards: used only when card IS the interaction? No decorative card grids?
First classify as MARKETING/LANDING PAGE vs APP UI vs HYBRID, then apply matching rules.
LITMUS CHECKS — answer YES/NO:
1. Brand/product unmistakable in first screen?
2. One strong visual anchor present?
3. Page understandable by scanning headlines only?
4. Each section has one job?
5. Are cards actually necessary?
6. Does motion improve hierarchy or atmosphere?
7. Would design feel premium with all decorative shadows removed?
HARD REJECTION — flag if ANY apply:
1. Generic SaaS card grid as first impression
2. Beautiful image with weak brand
3. Strong headline with no clear action
4. Busy imagery behind text
5. Sections repeating same mood statement
6. Carousel with no narrative purpose
7. App UI made of stacked cards instead of layout
Be specific. Reference file:line for every finding." -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached 2>"$TMPERR_DESIGN"
```
Use a 5-minute timeout (`timeout: 300000`). After the command completes, read stderr:
```bash
cat "$TMPERR_DESIGN" && rm -f "$TMPERR_DESIGN"
```
2. **Claude design subagent** (via Agent tool):
Dispatch a subagent with this prompt:
"Review the frontend source code in this repo. You are an independent senior product designer doing a source-code design audit. Focus on CONSISTENCY PATTERNS across files rather than individual violations:
- Are spacing values systematic across the codebase?
- Is there ONE color system or scattered approaches?
- Do responsive breakpoints follow a consistent set?
- Is the accessibility approach consistent or spotty?
For each finding: what's wrong, severity (critical/high/medium), and the file:line."
**Error handling (all non-blocking):**
- **Auth failure:** If stderr contains "auth", "login", "unauthorized", or "API key": "Codex authentication failed. Run `codex login` to authenticate."
- **Timeout:** "Codex timed out after 5 minutes."
- **Empty response:** "Codex returned no response."
- On any Codex error: proceed with Claude subagent output only, tagged `[single-model]`.
- If Claude subagent also fails: "Outside voices unavailable — continuing with primary review."
Present Codex output under a `CODEX SAYS (design source audit):` header.
Present subagent output under a `CLAUDE SUBAGENT (design consistency):` header.
**Synthesis — Litmus scorecard:**
Use the same scorecard format as /plan-design-review (shown above). Fill in from both outputs.
Merge findings into the triage with `[codex]` / `[subagent]` / `[cross-model]` tags.
**Log the result:**
```bash
~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"design-outside-voices","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","source":"SOURCE","commit":"'"$(git rev-parse --short HEAD)"'"}'
```
Replace STATUS with "clean" or "issues_found", SOURCE with "codex+subagent", "codex-only", "subagent-only", or "unavailable".
## Phase 7: Triage
Sort all discovered findings by impact, then decide which to fix:

View File

@@ -84,6 +84,8 @@ mkdir -p "$REPORT_DIR/screenshots"
{{DESIGN_METHODOLOGY}}
{{DESIGN_HARD_RULES}}
Record baseline design score and AI slop score at end of Phase 6.
---
@@ -107,6 +109,8 @@ Record baseline design score and AI slop score at end of Phase 6.
---
{{DESIGN_OUTSIDE_VOICES}}
## Phase 7: Triage
Sort all discovered findings by impact, then decide which to fix: