feat: mode-posture energy fix for /plan-ceo-review and /office-hours (v1.1.2.0) (#1065)

* feat: restore mode-posture energy to expansion + forcing + builder output Rewrites Writing Style rule 2-4 examples in scripts/resolvers/preamble.ts to cover three framing families (pain reduction, upside/delight, forcing pressure) instead of diagnostic-pain only. Adds inline exemplars to plan-ceo-review (0D-prelude shared between SCOPE + SELECTIVE EXPANSION) and office-hours (Q3 forcing exemplar with career/day/weekend domain gating, builder operating principles wild exemplar). V1 shipped rule 2-4 examples that all pointed to diagnostic-pain framing ("3-second spinner", "double-click button"). Models follow concrete examples over abstract taxonomies, so any skill with a non-diagnostic mode posture (expansion, forcing, delight) got flattened at runtime even when the template itself said "dream big" or "direct to the point of discomfort." This change targets the actual lever: swap the single diagnostic example for three paired framings, one per posture family. Preserves V1 clarity gains — rules 2, 3, 4 principles unchanged, only examples expanded. Terse mode (EXPLAIN_LEVEL: terse) still skips the block entirely. * chore: regenerate SKILL.md after preamble + template changes Mechanical cascade from `bun run gen:skill-docs --host all` after the Writing Style rule 2-4 example rewrite and the plan-ceo-review / office-hours template exemplar additions. No hand edits — every change flows from the prior commit's templates. * test: add gate-tier mode-posture regression tests Three gate-tier E2E tests detect when preamble / template changes flatten the distinctive posture of /plan-ceo-review SCOPE EXPANSION or /office-hours (startup Q3, builder mode). The V1 regression that this PR fixes shipped without anyone catching it at ship time — this is the ongoing signal so the same thing doesn't happen again. Pieces: - `judgePosture(mode, text)` in `test/helpers/llm-judge.ts`. Sonnet judge with mode-specific dual-axis rubric (expansion: surface_framing + decision_preservation; forcing: stacking_preserved + domain_matched_consequence; builder: unexpected_combinations + excitement_over_optimization). Pass threshold 4/5 on both axes. - Three fixtures in `test/fixtures/mode-posture/` — deterministic input for expansion proposal generation, Q3 forcing question, and builder adjacent-unlock riffing. - `plan-ceo-review-expansion-energy` case appended to `test/skill-e2e-plan.test.ts`. Generator: Opus (skill default). Judge: Sonnet. - New `test/skill-e2e-office-hours.test.ts` with `office-hours-forcing-energy` + `office-hours-builder-wildness` cases. Generator: Sonnet. Judge: Sonnet. - Touchfile registration in `test/helpers/touchfiles.ts` — all three as `gate` tier in `E2E_TIERS`, triggered by changes to `scripts/resolvers/preamble.ts`, the relevant skill template, the judge helper, or any mode-posture fixture. Cost: ~$0.50-$1.50 per triggered PR. Sonnet judge is cheap; Opus generator for the plan-ceo-review case dominates. Known V1.1 tradeoff: judges test prose markers more than deep behavior. V1.2 candidate is a cross-provider (Codex) adversarial judge on the same output to decouple house-style bias. * test: update golden ship baselines + touchfile count for mode-posture entries Mechanical test updates after the mode-posture work: - Golden ship SKILL.md baselines (claude + codex + factory hosts) regenerate with the rewritten Writing Style rule 2-4 examples from preamble.ts. - Touchfile selection test expects 6 matches for a plan-ceo-review/ change (was 5) because E2E_TOUCHFILES now includes plan-ceo-review-expansion-energy. * chore: bump version and changelog (v1.1.2.0) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 17:51:27 +08:00 · 2026-04-19 05:44:39 +08:00
parent e3c961d00f
commit 8ee16b867b
46 changed files with 746 additions and 107 deletions
--- a/test/skill-e2e-plan.test.ts
+++ b/test/skill-e2e-plan.test.ts
@@ -6,6 +6,7 @@ import {
  copyDirSync, setupBrowseShims, logCost, recordE2E,
  createEvalCollector, finalizeEvalCollector,
 } from './helpers/e2e-helpers';
+import { judgePosture } from './helpers/llm-judge';
 import { spawnSync } from 'child_process';
 import * as fs from 'fs';
 import * as path from 'path';
@@ -183,6 +184,79 @@ Focus on reviewing the plan content: architecture, error handling, security, and
  }, 420_000);
 });

+// --- Plan CEO Review SCOPE EXPANSION energy (V1.1 mode-posture regression gate) ---
+
+describeIfSelected('Plan CEO Review Expansion Energy E2E', ['plan-ceo-review-expansion-energy'], () => {
+  let planDir: string;
+
+  beforeAll(() => {
+    planDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-plan-ceo-exp-'));
+    const run = (cmd: string, args: string[]) =>
+      spawnSync(cmd, args, { cwd: planDir, stdio: 'pipe', timeout: 5000 });
+
+    run('git', ['init', '-b', 'main']);
+    run('git', ['config', 'user.email', 'test@test.com']);
+    run('git', ['config', 'user.name', 'Test']);
+
+    // Use the shared fixture so expansion-energy regressions are reproducible.
+    const fixture = fs.readFileSync(
+      path.join(ROOT, 'test', 'fixtures', 'mode-posture', 'expansion-plan.md'),
+      'utf-8',
+    );
+    fs.writeFileSync(path.join(planDir, 'plan.md'), fixture);
+
+    run('git', ['add', '.']);
+    run('git', ['commit', '-m', 'add plan']);
+
+    fs.mkdirSync(path.join(planDir, 'plan-ceo-review'), { recursive: true });
+    fs.copyFileSync(
+      path.join(ROOT, 'plan-ceo-review', 'SKILL.md'),
+      path.join(planDir, 'plan-ceo-review', 'SKILL.md'),
+    );
+  });
+
+  afterAll(() => {
+    try { fs.rmSync(planDir, { recursive: true, force: true }); } catch {}
+  });
+
+  testConcurrentIfSelected('plan-ceo-review-expansion-energy', async () => {
+    const result = await runSkillTest({
+      prompt: `Read plan-ceo-review/SKILL.md for the review workflow.
+
+Read plan.md — that's the plan to review. This is a standalone plan document, not a codebase — skip any codebase exploration or system audit steps.
+
+Choose SCOPE EXPANSION mode. Skip any AskUserQuestion calls — this is non-interactive. Auto-approve the ideal-architecture approach in 0C-bis. For 0D, run all three analyses (10x check, platonic ideal, delight opportunities), then emit exactly 2 concrete expansion proposals in the opt-in ceremony.
+
+Write your expansion proposals to ${planDir}/proposals.md with ONLY the proposal text — no conversational wrapper, no review summary, no mode analysis. Each proposal separated by "---".`,
+      workingDirectory: planDir,
+      maxTurns: 15,
+      timeout: 360_000,
+      testName: 'plan-ceo-review-expansion-energy',
+      runId,
+      model: 'claude-opus-4-6',
+    });
+
+    logCost('/plan-ceo-review (EXPANSION ENERGY)', result);
+    recordE2E(evalCollector, '/plan-ceo-review-expansion-energy', 'Plan CEO Review Expansion Energy E2E', result, {
+      passed: ['success', 'error_max_turns'].includes(result.exitReason),
+    });
+    expect(['success', 'error_max_turns']).toContain(result.exitReason);
+
+    const proposalsPath = path.join(planDir, 'proposals.md');
+    if (!fs.existsSync(proposalsPath)) {
+      throw new Error('Agent did not emit proposals.md — expansion energy eval requires proposal output');
+    }
+    const proposalText = fs.readFileSync(proposalsPath, 'utf-8');
+    expect(proposalText.length).toBeGreaterThan(200);
+
+    const scores = await judgePosture('expansion', proposalText);
+    console.log('Expansion energy scores:', JSON.stringify(scores, null, 2));
+    // Pass threshold: 4/5 on both axes (good — matches posture with minor weakness).
+    expect(scores.axis_a).toBeGreaterThanOrEqual(4);  // surface_framing
+    expect(scores.axis_b).toBeGreaterThanOrEqual(4);  // decision_preservation
+  }, 600_000);
+});
+
 // --- Plan Eng Review E2E ---

 describeIfSelected('Plan Eng Review E2E', ['plan-eng-review'], () => {