mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-19 19:02:29 +08:00
Hardens /codex and /autoplan against silent failures surfaced by the #972 stdin fix and #1003 Apple Silicon codesign. Six-layer defense: 1. **Multi-signal auth probe** (new Step 0.5 / Phase 0.5): env-based auth ($CODEX_API_KEY, $OPENAI_API_KEY) OR file-based auth (${CODEX_HOME:-~/.codex}/auth.json). Rejects false negatives that the old file-only check produced for CI / platform-engineer users. 2. **Timeout wrapper** around every codex exec / codex review invocation: gtimeout → timeout → unwrapped fallback chain. On exit 124, surfaces common causes + actionable next step. Guards against model-API stalls not covered by the #972 stdin fix. 3. **Stderr capture in Challenge mode** (codex/SKILL.md.tmpl:208): 2>/dev/null → 2>$TMPERR. Post-invocation grep for auth/login/unauthorized surfaces errors that were previously dropped silently. 4. **Completeness check** in the Python JSON parser: tracks turn.completed events and warns on zero (possible mid-stream disconnect). 5. **Version warning** for known-bad Codex CLI (0.120.0-0.120.2, the range that introduced the stdin deadlock #972 fixes). Anchored regex `(^|[^0-9.])0\.120\.(0|1|2)([^0-9.]|$)` prevents 0.120.10 / 0.120.20 false positives. 6. **Failure telemetry + operational learnings**: codex_timeout, codex_auth_failed, codex_cli_missing, codex_version_warning events land in ~/.gstack/analytics/skill-usage.jsonl behind the existing telemetry opt-in. On timeout (exit 124), auto-logs an operational learning via gstack-learnings-log so future /investigate sessions surface prior hang patterns automatically. **Shared helper** (bin/gstack-codex-probe): consolidates all four pieces (auth probe, version check, timeout wrapper, telemetry logger) into one bash file that /codex and /autoplan source. Namespace-prefixed (_gstack_codex_*) with a unit test that verifies sourcing does not leak shell options into the caller. pathRewrites in host configs rewrite ~/.claude/skills/gstack → $GSTACK_ROOT for Codex, $GSTACK_BIN for Factory/Cursor/etc. **Apple Silicon coreutils auto-install** (setup:264): macOS lacks GNU timeout by default; Homebrew's coreutils installs it as gtimeout to avoid shadowing BSD utilities. ./setup now auto-installs coreutils on Darwin (arch-agnostic — applies to Intel + Apple Silicon) when neither gtimeout nor timeout is present. Opt-out via GSTACK_SKIP_COREUTILS=1 for CI, managed machines, or offline envs. **25 deterministic unit tests** (test/codex-hardening.test.ts): - 8 auth probe combinations (env precedence, whitespace, alternate $CODEX_HOME, corrupt file paths) - 10 version regex cases including 0.120.10 false-positive guards and v-prefixed / multiline output - 4 timeout wrapper + namespace hygiene (bash -n, gtimeout preference, set-option leak check) - 3 telemetry payload schema checks (confirms env values + auth tokens never leak into emitted events) **1 periodic-tier E2E** (test/skill-e2e-autoplan-dual-voice.test.ts): gates the /autoplan dual-voice path — asserts both Claude subagent and Codex voices produce output in Phase 1, OR that [codex-unavailable] is logged when Codex is absent. ~\$1/run, not a CI gate. Golden baseline + gen-skill-docs exclusion list updated for the new codex path references and the 16 < /dev/null redirects from #972. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
102 lines
4.1 KiB
TypeScript
102 lines
4.1 KiB
TypeScript
import { describe, test, expect, beforeAll, afterAll } from 'bun:test';
|
|
import { runSkillTest } from './helpers/session-runner';
|
|
import {
|
|
ROOT, runId, evalsEnabled,
|
|
describeIfSelected, logCost, recordE2E,
|
|
copyDirSync, createEvalCollector, finalizeEvalCollector,
|
|
} from './helpers/e2e-helpers';
|
|
import { spawnSync } from 'child_process';
|
|
import * as fs from 'fs';
|
|
import * as path from 'path';
|
|
import * as os from 'os';
|
|
|
|
// E2E for /autoplan's dual-voice (Claude subagent + Codex). Periodic tier:
|
|
// non-deterministic, costs ~$1/run, not a gate. The purpose is to catch
|
|
// regressions where one of the two voices fails silently post-hardening.
|
|
|
|
const evalCollector = createEvalCollector('e2e-autoplan-dual-voice');
|
|
|
|
describeIfSelected('Autoplan dual-voice E2E', ['autoplan-dual-voice'], () => {
|
|
let workDir: string;
|
|
let planPath: string;
|
|
|
|
beforeAll(() => {
|
|
workDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-autoplan-dv-'));
|
|
|
|
const run = (cmd: string, args: string[]) =>
|
|
spawnSync(cmd, args, { cwd: workDir, stdio: 'pipe', timeout: 10000 });
|
|
|
|
run('git', ['init', '-b', 'main']);
|
|
run('git', ['config', 'user.email', 'test@test.com']);
|
|
run('git', ['config', 'user.name', 'Test']);
|
|
fs.writeFileSync(path.join(workDir, 'README.md'), '# test repo\n');
|
|
run('git', ['add', '.']);
|
|
run('git', ['commit', '-m', 'initial']);
|
|
|
|
// Copy /autoplan + its review-skill dependencies (they're loaded from disk).
|
|
copyDirSync(path.join(ROOT, 'autoplan'), path.join(workDir, 'autoplan'));
|
|
copyDirSync(path.join(ROOT, 'plan-ceo-review'), path.join(workDir, 'plan-ceo-review'));
|
|
copyDirSync(path.join(ROOT, 'plan-eng-review'), path.join(workDir, 'plan-eng-review'));
|
|
copyDirSync(path.join(ROOT, 'plan-design-review'), path.join(workDir, 'plan-design-review'));
|
|
copyDirSync(path.join(ROOT, 'plan-devex-review'), path.join(workDir, 'plan-devex-review'));
|
|
|
|
// Write a tiny plan file for /autoplan to review.
|
|
planPath = path.join(workDir, 'TEST_PLAN.md');
|
|
fs.writeFileSync(planPath, `# Test Plan: add /greet skill
|
|
|
|
## Context
|
|
Add a new /greet skill that prints a welcome message.
|
|
|
|
## Scope
|
|
- Create greet/SKILL.md with a simple "hello" flow
|
|
- Add to gen-skill-docs pipeline
|
|
- One unit test
|
|
`);
|
|
});
|
|
|
|
afterAll(() => {
|
|
finalizeEvalCollector(evalCollector);
|
|
if (workDir && fs.existsSync(workDir)) {
|
|
fs.rmSync(workDir, { recursive: true, force: true });
|
|
}
|
|
});
|
|
|
|
// Skip entirely unless evals enabled (periodic tier).
|
|
test.skipIf(!evalsEnabled)(
|
|
'both Claude + Codex voices produce output in Phase 1 (within timeout)',
|
|
async () => {
|
|
// Fire /autoplan with a 5-min hard timeout on the spawn itself.
|
|
// The skill itself has 10-min phase timeouts + auth-gate failfast.
|
|
// If Codex is unavailable on the test machine, the skill should print
|
|
// [codex-unavailable] and still complete the Claude subagent half.
|
|
const result = await runSkillTest({
|
|
name: 'autoplan-dual-voice',
|
|
workdir: workDir,
|
|
prompt: `/autoplan ${planPath}`,
|
|
timeoutMs: 300_000, // 5 min
|
|
evalCollector,
|
|
});
|
|
|
|
// Accept EITHER outcome as success:
|
|
// (a) Both voices produced output (ideal case)
|
|
// (b) Codex unavailable + Claude voice produced output (graceful degrade)
|
|
const out = result.stdout + result.stderr;
|
|
const claudeVoiceFired = /Claude\s+(CEO|subagent)|claude-subagent/i.test(out);
|
|
const codexVoiceFired = /codex\s+(exec|review|CEO\s+voice)|\[via:codex\]/i.test(out);
|
|
const codexUnavailable = /\[codex-unavailable\]|AUTH_FAILED|codex_cli_missing/i.test(out);
|
|
|
|
expect(claudeVoiceFired).toBe(true);
|
|
expect(codexVoiceFired || codexUnavailable).toBe(true);
|
|
|
|
// Hang protection: if the skill reached Phase 1 at all, our hardening worked.
|
|
// If it didn't, this is a regression from the pre-wave stdin-deadlock era.
|
|
const reachedPhase1 = /Phase 1|CEO\s+Review|Strategy\s*&\s*Scope/i.test(out);
|
|
expect(reachedPhase1).toBe(true);
|
|
|
|
logCost(result);
|
|
recordE2E('autoplan-dual-voice', result);
|
|
},
|
|
330_000, // per-test timeout slightly > spawn timeout so cleanup can run
|
|
);
|
|
});
|