Files
gstack/test/skill-e2e-autoplan-dual-voice.test.ts
Garry Tan 4bf5ae12c0 feat: codex/autoplan hardening + Apple Silicon coreutils auto-install
Hardens /codex and /autoplan against silent failures surfaced by the #972
stdin fix and #1003 Apple Silicon codesign. Six-layer defense:

1. **Multi-signal auth probe** (new Step 0.5 / Phase 0.5): env-based auth
   ($CODEX_API_KEY, $OPENAI_API_KEY) OR file-based auth
   (${CODEX_HOME:-~/.codex}/auth.json). Rejects false negatives that the
   old file-only check produced for CI / platform-engineer users.

2. **Timeout wrapper** around every codex exec / codex review invocation:
   gtimeout → timeout → unwrapped fallback chain. On exit 124, surfaces
   common causes + actionable next step. Guards against model-API stalls
   not covered by the #972 stdin fix.

3. **Stderr capture in Challenge mode** (codex/SKILL.md.tmpl:208):
   2>/dev/null → 2>$TMPERR. Post-invocation grep for auth/login/unauthorized
   surfaces errors that were previously dropped silently.

4. **Completeness check** in the Python JSON parser: tracks turn.completed
   events and warns on zero (possible mid-stream disconnect).

5. **Version warning** for known-bad Codex CLI (0.120.0-0.120.2, the range
   that introduced the stdin deadlock #972 fixes). Anchored regex
   `(^|[^0-9.])0\.120\.(0|1|2)([^0-9.]|$)` prevents 0.120.10 / 0.120.20
   false positives.

6. **Failure telemetry + operational learnings**: codex_timeout,
   codex_auth_failed, codex_cli_missing, codex_version_warning events
   land in ~/.gstack/analytics/skill-usage.jsonl behind the existing
   telemetry opt-in. On timeout (exit 124), auto-logs an operational
   learning via gstack-learnings-log so future /investigate sessions
   surface prior hang patterns automatically.

**Shared helper** (bin/gstack-codex-probe): consolidates all four pieces
(auth probe, version check, timeout wrapper, telemetry logger) into one
bash file that /codex and /autoplan source. Namespace-prefixed
(_gstack_codex_*) with a unit test that verifies sourcing does not leak
shell options into the caller. pathRewrites in host configs rewrite
~/.claude/skills/gstack → $GSTACK_ROOT for Codex, $GSTACK_BIN for
Factory/Cursor/etc.

**Apple Silicon coreutils auto-install** (setup:264): macOS lacks GNU
timeout by default; Homebrew's coreutils installs it as gtimeout to
avoid shadowing BSD utilities. ./setup now auto-installs coreutils on
Darwin (arch-agnostic — applies to Intel + Apple Silicon) when neither
gtimeout nor timeout is present. Opt-out via GSTACK_SKIP_COREUTILS=1
for CI, managed machines, or offline envs.

**25 deterministic unit tests** (test/codex-hardening.test.ts):
- 8 auth probe combinations (env precedence, whitespace, alternate
  $CODEX_HOME, corrupt file paths)
- 10 version regex cases including 0.120.10 false-positive guards
  and v-prefixed / multiline output
- 4 timeout wrapper + namespace hygiene (bash -n, gtimeout
  preference, set-option leak check)
- 3 telemetry payload schema checks (confirms env values + auth
  tokens never leak into emitted events)

**1 periodic-tier E2E** (test/skill-e2e-autoplan-dual-voice.test.ts):
gates the /autoplan dual-voice path — asserts both Claude subagent
and Codex voices produce output in Phase 1, OR that [codex-unavailable]
is logged when Codex is absent. ~\$1/run, not a CI gate.

Golden baseline + gen-skill-docs exclusion list updated for the new
codex path references and the 16 < /dev/null redirects from #972.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:10:21 +08:00

102 lines
4.1 KiB
TypeScript

import { describe, test, expect, beforeAll, afterAll } from 'bun:test';
import { runSkillTest } from './helpers/session-runner';
import {
ROOT, runId, evalsEnabled,
describeIfSelected, logCost, recordE2E,
copyDirSync, createEvalCollector, finalizeEvalCollector,
} from './helpers/e2e-helpers';
import { spawnSync } from 'child_process';
import * as fs from 'fs';
import * as path from 'path';
import * as os from 'os';
// E2E for /autoplan's dual-voice (Claude subagent + Codex). Periodic tier:
// non-deterministic, costs ~$1/run, not a gate. The purpose is to catch
// regressions where one of the two voices fails silently post-hardening.
const evalCollector = createEvalCollector('e2e-autoplan-dual-voice');
describeIfSelected('Autoplan dual-voice E2E', ['autoplan-dual-voice'], () => {
let workDir: string;
let planPath: string;
beforeAll(() => {
workDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-autoplan-dv-'));
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: workDir, stdio: 'pipe', timeout: 10000 });
run('git', ['init', '-b', 'main']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
fs.writeFileSync(path.join(workDir, 'README.md'), '# test repo\n');
run('git', ['add', '.']);
run('git', ['commit', '-m', 'initial']);
// Copy /autoplan + its review-skill dependencies (they're loaded from disk).
copyDirSync(path.join(ROOT, 'autoplan'), path.join(workDir, 'autoplan'));
copyDirSync(path.join(ROOT, 'plan-ceo-review'), path.join(workDir, 'plan-ceo-review'));
copyDirSync(path.join(ROOT, 'plan-eng-review'), path.join(workDir, 'plan-eng-review'));
copyDirSync(path.join(ROOT, 'plan-design-review'), path.join(workDir, 'plan-design-review'));
copyDirSync(path.join(ROOT, 'plan-devex-review'), path.join(workDir, 'plan-devex-review'));
// Write a tiny plan file for /autoplan to review.
planPath = path.join(workDir, 'TEST_PLAN.md');
fs.writeFileSync(planPath, `# Test Plan: add /greet skill
## Context
Add a new /greet skill that prints a welcome message.
## Scope
- Create greet/SKILL.md with a simple "hello" flow
- Add to gen-skill-docs pipeline
- One unit test
`);
});
afterAll(() => {
finalizeEvalCollector(evalCollector);
if (workDir && fs.existsSync(workDir)) {
fs.rmSync(workDir, { recursive: true, force: true });
}
});
// Skip entirely unless evals enabled (periodic tier).
test.skipIf(!evalsEnabled)(
'both Claude + Codex voices produce output in Phase 1 (within timeout)',
async () => {
// Fire /autoplan with a 5-min hard timeout on the spawn itself.
// The skill itself has 10-min phase timeouts + auth-gate failfast.
// If Codex is unavailable on the test machine, the skill should print
// [codex-unavailable] and still complete the Claude subagent half.
const result = await runSkillTest({
name: 'autoplan-dual-voice',
workdir: workDir,
prompt: `/autoplan ${planPath}`,
timeoutMs: 300_000, // 5 min
evalCollector,
});
// Accept EITHER outcome as success:
// (a) Both voices produced output (ideal case)
// (b) Codex unavailable + Claude voice produced output (graceful degrade)
const out = result.stdout + result.stderr;
const claudeVoiceFired = /Claude\s+(CEO|subagent)|claude-subagent/i.test(out);
const codexVoiceFired = /codex\s+(exec|review|CEO\s+voice)|\[via:codex\]/i.test(out);
const codexUnavailable = /\[codex-unavailable\]|AUTH_FAILED|codex_cli_missing/i.test(out);
expect(claudeVoiceFired).toBe(true);
expect(codexVoiceFired || codexUnavailable).toBe(true);
// Hang protection: if the skill reached Phase 1 at all, our hardening worked.
// If it didn't, this is a regression from the pre-wave stdin-deadlock era.
const reachedPhase1 = /Phase 1|CEO\s+Review|Strategy\s*&\s*Scope/i.test(out);
expect(reachedPhase1).toBe(true);
logCost(result);
recordE2E('autoplan-dual-voice', result);
},
330_000, // per-test timeout slightly > spawn timeout so cleanup can run
);
});