merge: origin/main v1.1.2.0 (mode-posture energy fix) into fix-checkpoints

Main shipped /plan-ceo-review + /office-hours mode-posture fixes as v1.1.2.0 — same version slot this branch used. Bumped ours to v1.1.3.0. Resolved conflicts: - VERSION / package.json: 1.1.2.0 → 1.1.3.0 - CHANGELOG.md: our entry moved up to [1.1.3.0], main's mode-posture entry kept at [1.1.2.0]. Sequence: 1.1.3.0 → 1.1.2.0 → 1.1.1.0 → 1.1.0.0 → 1.0.0.0 - Migration renamed v1.1.2.0.sh → v1.1.3.0.sh (version string inside + test path reference + hardening test describe block all updated) Also expanded our CHANGELOG entry to reflect the testing work that landed over the debugging loop: 8 live-fire E2E tests, 21 free-tier hardening tests, collision sentinel, env: param on runSkillTest, and GSTACK_HOME storage-path fix. No code overlap with main's changes (main edited preamble writing-style rules + plan-ceo-review and office-hours templates; ours edited context-save / context-restore / migration / tests). SKILL.md files regenerated via bun run gen:skill-docs --host all. Golden fixtures updated. bun test: 0 failures after renaming v1.1.2.0 → v1.1.3.0 in the hardening test describe block and MIGRATION constant.
2026-05-17 01:31:26 +08:00 · 2026-04-19 06:42:56 +08:00
parent 45df469909 8ee16b867b
commit ffa279b507
50 changed files with 776 additions and 128 deletions
--- a/test/fixtures/golden/claude-ship-SKILL.md
+++ b/test/fixtures/golden/claude-ship-SKILL.md
@@ -410,9 +410,15 @@ Per-skill instructions may add additional formatting rules on top of this baseli
 These rules apply to every AskUserQuestion, every response you write to the user, and every review finding. They compose with the AskUserQuestion Format section above: Format = *how* a question is structured; Writing Style = *the prose quality of the content inside it*.

 1. **Jargon gets a one-sentence gloss on first use per skill invocation.** Even if the user's own prompt already contained the term — users often paste jargon from someone else's plan. Gloss unconditionally on first use. No cross-invocation memory: a new skill fire is a new first-use opportunity. Example: "race condition (two things happen at the same time and step on each other)".
-2. **Frame questions in outcome terms, not implementation terms.** Bad: "Is this endpoint idempotent?" Good: "If someone double-clicks the button, is it OK for the action to run twice?" Ask the question the user would actually want to answer.
-3. **Short sentences. Concrete nouns. Active voice.** Standard advice from any good writing guide. Prefer "the cache stores the result for 60s" over "results will have been cached for a period of 60s."
-4. **Close every decision with user impact.** Connect the technical call back to who's affected. "If we skip this, your users will see a 3-second spinner on every page load." Make the user's user real.
+2. **Frame questions in outcome terms, not implementation terms.** Ask the question the user would actually want to answer. Outcome framing covers three families — match the framing to the mode:
+   - **Pain reduction** (default for diagnostic / HOLD SCOPE / rigor review): "If someone double-clicks the button, is it OK for the action to run twice?" (instead of "Is this endpoint idempotent?")
+   - **Upside / delight** (for expansion / builder / vision contexts): "When the workflow finishes, does the user see the result instantly, or are they still refreshing a dashboard?" (instead of "Should we add webhook notifications?")
+   - **Interrogative pressure** (for forcing-question / founder-challenge contexts): "Can you name the actual person whose career gets better if this ships and whose career gets worse if it doesn't?" (instead of "Who's the target user?")
+3. **Short sentences. Concrete nouns. Active voice.** Standard advice from any good writing guide. Prefer "the cache stores the result for 60s" over "results will have been cached for a period of 60s." *Exception:* stacked, multi-part questions are a legitimate forcing device — "Title? Gets them promoted? Gets them fired? Keeps them up at night?" is longer than one short sentence, and it should be, because the pressure IS in the stacking. Don't collapse a stack into a single neutral ask when the skill's posture is forcing.
+4. **Close every decision with user impact.** Connect the technical call back to who's affected. Make the user's user real. Impact has three shapes — again, match the mode:
+   - **Pain avoided:** "If we skip this, your users will see a 3-second spinner on every page load."
+   - **Capability unlocked:** "If we ship this, users get instant feedback the moment a workflow finishes — no tabs to refresh, no polling."
+   - **Consequence named** (for forcing questions): "If you can't name the person whose career this helps, you don't know who you're building for — and 'users' isn't an answer."
 5. **User-turn override.** If the user's current message says "be terse" / "no explanations" / "brutally honest, just the answer" / similar, skip this entire Writing Style block for your next response, regardless of config. User's in-turn request wins.
 6. **Glossary boundary is the curated list.** Terms below get glossed. Terms not on the list are assumed plain-English enough. If you see a term that genuinely needs glossing but isn't listed, note it (once) in your response so it can be added via PR.

--- a/test/fixtures/golden/codex-ship-SKILL.md
+++ b/test/fixtures/golden/codex-ship-SKILL.md
@@ -399,9 +399,15 @@ Per-skill instructions may add additional formatting rules on top of this baseli
 These rules apply to every AskUserQuestion, every response you write to the user, and every review finding. They compose with the AskUserQuestion Format section above: Format = *how* a question is structured; Writing Style = *the prose quality of the content inside it*.

 1. **Jargon gets a one-sentence gloss on first use per skill invocation.** Even if the user's own prompt already contained the term — users often paste jargon from someone else's plan. Gloss unconditionally on first use. No cross-invocation memory: a new skill fire is a new first-use opportunity. Example: "race condition (two things happen at the same time and step on each other)".
-2. **Frame questions in outcome terms, not implementation terms.** Bad: "Is this endpoint idempotent?" Good: "If someone double-clicks the button, is it OK for the action to run twice?" Ask the question the user would actually want to answer.
-3. **Short sentences. Concrete nouns. Active voice.** Standard advice from any good writing guide. Prefer "the cache stores the result for 60s" over "results will have been cached for a period of 60s."
-4. **Close every decision with user impact.** Connect the technical call back to who's affected. "If we skip this, your users will see a 3-second spinner on every page load." Make the user's user real.
+2. **Frame questions in outcome terms, not implementation terms.** Ask the question the user would actually want to answer. Outcome framing covers three families — match the framing to the mode:
+   - **Pain reduction** (default for diagnostic / HOLD SCOPE / rigor review): "If someone double-clicks the button, is it OK for the action to run twice?" (instead of "Is this endpoint idempotent?")
+   - **Upside / delight** (for expansion / builder / vision contexts): "When the workflow finishes, does the user see the result instantly, or are they still refreshing a dashboard?" (instead of "Should we add webhook notifications?")
+   - **Interrogative pressure** (for forcing-question / founder-challenge contexts): "Can you name the actual person whose career gets better if this ships and whose career gets worse if it doesn't?" (instead of "Who's the target user?")
+3. **Short sentences. Concrete nouns. Active voice.** Standard advice from any good writing guide. Prefer "the cache stores the result for 60s" over "results will have been cached for a period of 60s." *Exception:* stacked, multi-part questions are a legitimate forcing device — "Title? Gets them promoted? Gets them fired? Keeps them up at night?" is longer than one short sentence, and it should be, because the pressure IS in the stacking. Don't collapse a stack into a single neutral ask when the skill's posture is forcing.
+4. **Close every decision with user impact.** Connect the technical call back to who's affected. Make the user's user real. Impact has three shapes — again, match the mode:
+   - **Pain avoided:** "If we skip this, your users will see a 3-second spinner on every page load."
+   - **Capability unlocked:** "If we ship this, users get instant feedback the moment a workflow finishes — no tabs to refresh, no polling."
+   - **Consequence named** (for forcing questions): "If you can't name the person whose career this helps, you don't know who you're building for — and 'users' isn't an answer."
 5. **User-turn override.** If the user's current message says "be terse" / "no explanations" / "brutally honest, just the answer" / similar, skip this entire Writing Style block for your next response, regardless of config. User's in-turn request wins.
 6. **Glossary boundary is the curated list.** Terms below get glossed. Terms not on the list are assumed plain-English enough. If you see a term that genuinely needs glossing but isn't listed, note it (once) in your response so it can be added via PR.

--- a/test/fixtures/golden/factory-ship-SKILL.md
+++ b/test/fixtures/golden/factory-ship-SKILL.md
@@ -401,9 +401,15 @@ Per-skill instructions may add additional formatting rules on top of this baseli
 These rules apply to every AskUserQuestion, every response you write to the user, and every review finding. They compose with the AskUserQuestion Format section above: Format = *how* a question is structured; Writing Style = *the prose quality of the content inside it*.

 1. **Jargon gets a one-sentence gloss on first use per skill invocation.** Even if the user's own prompt already contained the term — users often paste jargon from someone else's plan. Gloss unconditionally on first use. No cross-invocation memory: a new skill fire is a new first-use opportunity. Example: "race condition (two things happen at the same time and step on each other)".
-2. **Frame questions in outcome terms, not implementation terms.** Bad: "Is this endpoint idempotent?" Good: "If someone double-clicks the button, is it OK for the action to run twice?" Ask the question the user would actually want to answer.
-3. **Short sentences. Concrete nouns. Active voice.** Standard advice from any good writing guide. Prefer "the cache stores the result for 60s" over "results will have been cached for a period of 60s."
-4. **Close every decision with user impact.** Connect the technical call back to who's affected. "If we skip this, your users will see a 3-second spinner on every page load." Make the user's user real.
+2. **Frame questions in outcome terms, not implementation terms.** Ask the question the user would actually want to answer. Outcome framing covers three families — match the framing to the mode:
+   - **Pain reduction** (default for diagnostic / HOLD SCOPE / rigor review): "If someone double-clicks the button, is it OK for the action to run twice?" (instead of "Is this endpoint idempotent?")
+   - **Upside / delight** (for expansion / builder / vision contexts): "When the workflow finishes, does the user see the result instantly, or are they still refreshing a dashboard?" (instead of "Should we add webhook notifications?")
+   - **Interrogative pressure** (for forcing-question / founder-challenge contexts): "Can you name the actual person whose career gets better if this ships and whose career gets worse if it doesn't?" (instead of "Who's the target user?")
+3. **Short sentences. Concrete nouns. Active voice.** Standard advice from any good writing guide. Prefer "the cache stores the result for 60s" over "results will have been cached for a period of 60s." *Exception:* stacked, multi-part questions are a legitimate forcing device — "Title? Gets them promoted? Gets them fired? Keeps them up at night?" is longer than one short sentence, and it should be, because the pressure IS in the stacking. Don't collapse a stack into a single neutral ask when the skill's posture is forcing.
+4. **Close every decision with user impact.** Connect the technical call back to who's affected. Make the user's user real. Impact has three shapes — again, match the mode:
+   - **Pain avoided:** "If we skip this, your users will see a 3-second spinner on every page load."
+   - **Capability unlocked:** "If we ship this, users get instant feedback the moment a workflow finishes — no tabs to refresh, no polling."
+   - **Consequence named** (for forcing questions): "If you can't name the person whose career this helps, you don't know who you're building for — and 'users' isn't an answer."
 5. **User-turn override.** If the user's current message says "be terse" / "no explanations" / "brutally honest, just the answer" / similar, skip this entire Writing Style block for your next response, regardless of config. User's in-turn request wins.
 6. **Glossary boundary is the curated list.** Terms below get glossed. Terms not on the list are assumed plain-English enough. If you see a term that genuinely needs glossing but isn't listed, note it (once) in your response so it can be added via PR.

--- a/test/fixtures/mode-posture/builder-idea.md
+++ b/test/fixtures/mode-posture/builder-idea.md
@@ -0,0 +1,15 @@
+# Weekend Project: Dependency Graph Visualizer
+
+I want to build a tool that takes a codebase and visualizes its dependency graph — modules, imports, which files depend on which. For fun, for learning. Maybe open-source it.
+
+## What I have so far
+
+- Rough idea: point it at a repo, get an interactive graph
+- Stack I'm leaning toward: TypeScript + D3 or Cytoscape for rendering
+- Potential: could work for JS/TS first, maybe Python later
+
+## What I don't know yet
+
+- How to make the visualization actually useful vs just pretty
+- Whether this should be a CLI, a web tool, or a VS Code extension
+- What would make someone else want to use it
--- a/test/fixtures/mode-posture/expansion-plan.md
+++ b/test/fixtures/mode-posture/expansion-plan.md
@@ -0,0 +1,23 @@
+# Plan: Team Velocity Dashboard
+
+## Context
+
+We're building a dashboard for engineering managers to track team code velocity — commits per engineer, PR cycle time, review latency, CI pass rate. The data already lives in GitHub; we're just aggregating it for a manager's single-pane view.
+
+## Changes
+
+1. New React component `TeamVelocityDashboard` in `src/dashboard/`
+2. REST API endpoint `GET /api/team/velocity?days=30` returning aggregated metrics
+3. Background job pulling GitHub data every 15 minutes into Postgres
+4. Simple filter UI: team, date range, metric
+
+## Architecture
+
+- Frontend: React + shadcn/ui
+- Backend: Express + PostgreSQL
+- Data source: GitHub REST API (cached 15min)
+
+## Open questions
+
+- Should we support multiple repos per team?
+- Do we show individual engineer names or aggregate only?
--- a/test/fixtures/mode-posture/forcing-pitch.md
+++ b/test/fixtures/mode-posture/forcing-pitch.md
@@ -0,0 +1,13 @@
+# Our Idea: AI Tools for Product Managers
+
+We're building AI tools for product managers at mid-market SaaS companies. The product combines a bunch of the things PMs already do — writing PRDs, gathering user feedback, analyzing usage data, drafting roadmaps — and uses LLMs to speed each of them up.
+
+## Who we're targeting
+
+Product managers at SaaS companies with 50-500 engineers. These PMs are stretched thin, juggle a lot of surface area, and would benefit from AI assistance.
+
+## What we've done so far
+
+- Talked to a few PMs we know from prior jobs
+- Built a prototype that summarizes Zoom customer calls into a PRD stub
+- Got on a waitlist of about 40 signups from LinkedIn posts