feat: /codex skill — multi-AI second opinion + proactive suggestions (#197)

* feat: /codex skill — multi-AI second opinion (review, challenge, consult) Three modes: code review with pass/fail gate, adversarial challenge mode, and conversational consult with session continuity. First multi-AI skill in gstack, wrapping OpenAI's Codex CLI. * feat: integrate /codex into /review, /ship, /plan-eng-review + dashboard /review offers Codex second opinion after completing its own review. /ship offers Codex review as optional gate before pushing. /plan-eng-review offers Codex plan critique after scope challenge. Review Readiness Dashboard shows Codex Review as optional row. * chore: bump version and changelog (v0.8.0) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * test: codex skill validation (12 stub tests) + E2E eval test Stub tests (free tier): verify template content — three modes, gate verdict, session continuity, cost tracking, cross-model comparison, binary discovery, error handling, mktemp usage, and integrations into /review, /ship, /plan-eng-review. E2E test (paid tier): runs /codex review on vulnerable fixture repo via session-runner, verifies output contains findings and GATE verdict. * fix: codex auth error message — use codex login, not OPENAI_API_KEY Codex authenticates via ChatGPT OAuth (codex login), not an env var. * feat: codex uses high reasoning effort by default gpt-5.2-codex is the only model available with ChatGPT login. All commands now use model_reasoning_effort="high" for maximum depth — the whole point is a thorough second opinion. * feat: crank codex reasoning to xhigh (maximum) * feat: per-mode reasoning (high for review/consult, xhigh for challenge) + web search Review and consult use high reasoning — thorough but not slow. Challenge (adversarial) uses xhigh — maximum depth for breaking code. All modes enable web_search_cached so Codex can look up docs/APIs. * refactor: don't hardcode model — use codex default (always latest) * feat: JSONL output for codex challenge + consult modes Use --json flag to parse codex's JSONL events, extracting reasoning traces ([codex thinking]), tool calls ([codex ran]), and token counts. This gives richer output than the -o flag alone — you can see what codex thought through before its answer. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: only persist codex-review log when code review actually ran Don't write a codex-review entry to reviews.jsonl when only the adversarial challenge (option B) was selected — there's no gate verdict to record, and a false entry misleads the Review Readiness Dashboard into thinking a code review happened. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add codex plan review option to /plan-eng-review After scope challenge (Step 0), offer to have Codex independently review the plan with a brutally honest tech reviewer persona. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * test: update e2e test for codex skill Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: codex integration bugs — plan content, review persistence, quoting, stderr - plan-eng-review: Codex now reads the plan file itself instead of inlining content as a CLI arg (avoids ARG_MAX for large plans) - review: add missing echo to persist codex-review results to reviews.jsonl - codex: consult mode uses $TMPERR (mktemp) instead of hardcoded stderr path - codex + review: quote $SLUG/$BRANCH_SLUG in review log paths - codex: scope plan lookup to current project, warn on cross-project fallback Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: add .context/ to .gitignore to prevent session ID leaks Codex consult mode stores session IDs in .context/codex-session-id. Without this ignore rule, session IDs could leak into commits. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: proactive skill suggestions + opt-out + trigger phrase tests - Preamble reads proactive config via gstack-config - Root SKILL.md.tmpl has lifecycle map (stage → skill suggestion) - Users can opt out ("stop suggesting") / opt in ("be proactive again") - Restored trigger phrase validation tests (16 skills × "Use when" check) - Added missing "Use when" trigger phrases to /debug and /office-hours Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: update changelog for v0.8.0 — add proactive suggestions note Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-05-08 21:49:45 +08:00 · 2026-03-19 00:22:52 -05:00
parent 823772ff0b
commit d85233017b
29 changed files with 1372 additions and 63 deletions
--- a/plan-eng-review/SKILL.md
+++ b/plan-eng-review/SKILL.md
@@ -29,14 +29,14 @@ touch ~/.gstack/sessions/"$PPID"
 _SESSIONS=$(find ~/.gstack/sessions -mmin -120 -type f 2>/dev/null | wc -l | tr -d ' ')
 find ~/.gstack/sessions -mmin +120 -type f -delete 2>/dev/null || true
 _CONTRIB=$(~/.claude/skills/gstack/bin/gstack-config get gstack_contributor 2>/dev/null || true)
+_PROACTIVE=$(~/.claude/skills/gstack/bin/gstack-config get proactive 2>/dev/null || echo "true")
 _BRANCH=$(git branch --show-current 2>/dev/null || echo "unknown")
 echo "BRANCH: $_BRANCH"
+echo "PROACTIVE: $_PROACTIVE"
 _LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no")
 echo "LAKE_INTRO: $_LAKE_SEEN"
 mkdir -p ~/.gstack/analytics
 echo '{"skill":"plan-eng-review","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","repo":"'$(basename "$(git rev-parse --show-toplevel 2>/dev/null)" 2>/dev/null || echo "unknown")'"}'  >> ~/.gstack/analytics/skill-usage.jsonl 2>/dev/null || true
-_PROACTIVE=$(~/.claude/skills/gstack/bin/gstack-config get proactive 2>/dev/null || echo "true")
-echo "PROACTIVE: $_PROACTIVE"
 ```

 If `PROACTIVE` is `"false"`, do not proactively suggest gstack skills — only invoke
@@ -220,6 +220,29 @@ Before reviewing anything, answer these questions:

 If the complexity check triggers (8+ files or 2+ new classes/services), proactively recommend scope reduction via AskUserQuestion — explain what's overbuilt, propose a minimal version that achieves the core goal, and ask whether to reduce or proceed as-is. If the complexity check does not trigger, present your Step 0 findings and proceed directly to Section 1.

+### Step 0.5: Codex plan review (optional)
+
+Check if the Codex CLI is available: `which codex 2>/dev/null`
+
+If available, after presenting Step 0 findings, use AskUserQuestion:
+```
+Want an independent Codex (OpenAI) review of this plan before the detailed review?
+A) Yes — let Codex critique the plan independently
+B) No — proceed with the Claude review only
+```
+
+If the user chooses A: tell Codex to read the plan file itself (avoids ARG_MAX limits for large plans):
+```bash
+codex exec "You are a brutally honest technical reviewer. Read the plan file at <plan-file-path> and review it for: logical gaps and unstated assumptions, missing error handling or edge cases, overcomplexity (is there a simpler approach?), feasibility risks (what could go wrong?), and missing dependencies or sequencing issues. Be direct. Be terse. No compliments. Just the problems." -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached
+```
+
+Replace `<plan-file-path>` with the actual path to the plan file detected earlier. Codex has filesystem access in read-only mode and will read the file itself.
+
+Present the full output under a `CODEX SAYS (plan review):` header. Note any concerns
+that should inform the subsequent engineering review sections.
+
+If Codex is not available, skip silently.
+
 Always work through the full interactive review: one section at a time (Architecture → Code Quality → Tests → Performance) with at most 8 top issues per section.

 **Critical: Once the user accepts or rejects a scope reduction recommendation, commit fully.** Do not re-argue for smaller scope during later review sections. Do not silently reduce scope or skip planned components.
@@ -393,7 +416,7 @@ echo "---CONFIG---"
 ~/.claude/skills/gstack/bin/gstack-config get skip_eng_review 2>/dev/null || echo "false"
 ```

-Parse the output. Find the most recent entry for each skill (plan-ceo-review, plan-eng-review, plan-design-review, design-review-lite). Ignore entries with timestamps older than 7 days. For Design Review, show whichever is more recent between `plan-design-review` (full visual audit) and `design-review-lite` (code-level check). Append "(FULL)" or "(LITE)" to the status to distinguish. Display:
+Parse the output. Find the most recent entry for each skill (plan-ceo-review, plan-eng-review, plan-design-review, design-review-lite, codex-review). Ignore entries with timestamps older than 7 days. For Design Review, show whichever is more recent between `plan-design-review` (full visual audit) and `design-review-lite` (code-level check). Append "(FULL)" or "(LITE)" to the status to distinguish. Display:

 ```
 +====================================================================+
@@ -404,6 +427,7 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl
 | Eng Review      |  1   | 2026-03-16 15:00    | CLEAR     | YES      |
 | CEO Review      |  0   | —                   | —         | no       |
 | Design Review   |  0   | —                   | —         | no       |
+| Codex Review    |  0   | —                   | —         | no       |
 +--------------------------------------------------------------------+
 | VERDICT: CLEARED — Eng Review passed                                |
 +====================================================================+
@@ -413,11 +437,12 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl
 - **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \`gstack-config set skip_eng_review true\` (the "don't bother me" setting).
 - **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup.
 - **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes.
+- **Codex Review (optional):** Independent second opinion from OpenAI Codex CLI. Shows pass/fail gate. Recommend for critical code changes where a second AI perspective adds value. Skip when Codex CLI is not installed.

 **Verdict logic:**
 - **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \`skip_eng_review\` is \`true\`)
 - **NOT CLEARED**: Eng Review missing, stale (>7 days), or has open issues
- CEO and Design reviews are shown for context but never block shipping
+- CEO, Design, and Codex reviews are shown for context but never block shipping
 - If \`skip_eng_review\` config is \`true\`, Eng Review shows "SKIPPED (global)" and verdict is CLEARED

 ## Unresolved decisions
--- a/plan-eng-review/SKILL.md.tmpl
+++ b/plan-eng-review/SKILL.md.tmpl
@@ -84,6 +84,29 @@ Before reviewing anything, answer these questions:

 If the complexity check triggers (8+ files or 2+ new classes/services), proactively recommend scope reduction via AskUserQuestion — explain what's overbuilt, propose a minimal version that achieves the core goal, and ask whether to reduce or proceed as-is. If the complexity check does not trigger, present your Step 0 findings and proceed directly to Section 1.

+### Step 0.5: Codex plan review (optional)
+
+Check if the Codex CLI is available: `which codex 2>/dev/null`
+
+If available, after presenting Step 0 findings, use AskUserQuestion:
+```
+Want an independent Codex (OpenAI) review of this plan before the detailed review?
+A) Yes — let Codex critique the plan independently
+B) No — proceed with the Claude review only
+```
+
+If the user chooses A: tell Codex to read the plan file itself (avoids ARG_MAX limits for large plans):
+```bash
+codex exec "You are a brutally honest technical reviewer. Read the plan file at <plan-file-path> and review it for: logical gaps and unstated assumptions, missing error handling or edge cases, overcomplexity (is there a simpler approach?), feasibility risks (what could go wrong?), and missing dependencies or sequencing issues. Be direct. Be terse. No compliments. Just the problems." -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached
+```
+
+Replace `<plan-file-path>` with the actual path to the plan file detected earlier. Codex has filesystem access in read-only mode and will read the file itself.
+
+Present the full output under a `CODEX SAYS (plan review):` header. Note any concerns
+that should inform the subsequent engineering review sections.
+
+If Codex is not available, skip silently.
+
 Always work through the full interactive review: one section at a time (Architecture → Code Quality → Tests → Performance) with at most 8 top issues per section.

 **Critical: Once the user accepts or rejects a scope reduction recommendation, commit fully.** Do not re-argue for smaller scope during later review sections. Do not silently reduce scope or skip planned components.