mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-16 09:12:13 +08:00
test: demote setup-gbrain Path 4 E2E to periodic-tier
The Agent SDK E2E tests for Path 4 (skill-e2e-setup-gbrain-remote and
skill-e2e-setup-gbrain-bad-token) are inherently non-deterministic —
the model interprets "follow Path 4 only" prompts flexibly and can
skip Step 8 (CLAUDE.md write) or shortcut past the verify helper, which
makes the gate-tier assertions flaky.
The deterministic gate coverage for Path 4 is in
test/setup-gbrain-path4-structure.test.ts: a fast structural lint that
catches AUQ-pacing regressions and prose contract drift in <200ms with
zero token spend. That test is the right tool for catching the failure
mode the gate-tier was meant to guard against.
The Agent SDK E2E tests stay available on-demand for periodic-tier runs
(EVALS=1 EVALS_TIER=periodic bun test test/skill-e2e-setup-gbrain-*.test.ts).
Also tightened the verify-error assertion to the literal field shape
("error_class": "AUTH") instead of a substring match that false-matches
the parent claude session's "needs-auth" MCP discovery markers.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -434,11 +434,15 @@ export const E2E_TIERS: Record<string, 'gate' | 'periodic'> = {
|
||||
// costs ~$0.30-$0.50 per run, not needed on every commit)
|
||||
'brain-privacy-gate': 'periodic',
|
||||
|
||||
// /setup-gbrain Path 4 (Remote MCP) — gate-tier. Stub HTTP server is
|
||||
// deterministic; Path 4's STOP gates are the failure mode this catches
|
||||
// (token in CLAUDE.md, partial registration on bad bearer).
|
||||
'setup-gbrain-remote': 'gate',
|
||||
'setup-gbrain-bad-token': 'gate',
|
||||
// /setup-gbrain Path 4 (Remote MCP) — periodic-tier. The stub HTTP
|
||||
// server is deterministic but the model's interpretation of "follow
|
||||
// Path 4 only" is not — assertions on which steps the model ran are
|
||||
// flaky. The deterministic gate-tier coverage for Path 4 lives in
|
||||
// test/setup-gbrain-path4-structure.test.ts (free, <200ms). These
|
||||
// E2E tests stay available for on-demand verification of the live
|
||||
// model's behavior against a stub MCP server.
|
||||
'setup-gbrain-remote': 'periodic',
|
||||
'setup-gbrain-bad-token': 'periodic',
|
||||
|
||||
// AskUserQuestion format regression — periodic (Opus 4.7 non-deterministic benchmark)
|
||||
'plan-ceo-review-format-mode': 'periodic',
|
||||
|
||||
Reference in New Issue
Block a user