test: lite E2E coverage for benchmark, taste engine, publish

Fills real coverage gaps in v0.19.0.0 primitives. 44 new deterministic tests (gate tier, ~3s) + 8 live-API tests (periodic tier). New gate-tier test files (free, <3s total): - test/taste-engine.test.ts — 24 tests against gstack-taste-update: schema shape, Laplace-smoothed confidence, 5%/week decay clamped at 0, multi-dimension extraction, case-insensitive matching, session cap, legacy profile migration with session truncation, taste-drift conflict warning, malformed-JSON recovery, missing-variant exit code. - test/publish-dry-run.test.ts — 13 tests against gstack-publish --dry-run: manifest parsing, missing/malformed JSON, per-skill validation errors (missing source file / slug / version / marketplaces), slug filter, unknown-skill exit, per-marketplace auth isolation (fake marketplaces with always-pass / always-fail / missing-binary CLIs), and a sanity check against the real repo manifest. - test/benchmark-cli.test.ts — 11 tests against gstack-model-benchmark --dry-run: provider default, unknown-provider WARN, empty list fallback, flag passthrough (timeout/workdir/judge/output), long-prompt truncation, prompt resolution (inline vs file vs positional), missing prompt exit. New periodic-tier test file (paid, gated EVALS=1): - test/skill-e2e-benchmark-providers.test.ts — 8 tests hitting real claude, codex, gemini CLIs with a trivial prompt (~$0.001/provider). Verifies output parsing, token accounting, cost estimation, timeout error.code semantics, Promise.allSettled parallel isolation. Per-provider availability gate — unauthed providers skip cleanly. This suite already caught one real bug (codex adapter missing --skip-git-repo-check, fixed in 5260987d). Registered `benchmark-providers-live` in touchfiles.ts (periodic tier, triggered by changes to bin/gstack-model-benchmark, providers/**, benchmark-runner.ts, pricing.ts). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 02:42:29 +08:00 · 2026-04-18 06:45:06 +08:00
parent 42715188c2
commit c875e0c3fc
5 changed files with 966 additions and 0 deletions
--- a/test/helpers/touchfiles.ts
+++ b/test/helpers/touchfiles.ts
@@ -171,6 +171,9 @@ export const E2E_TOUCHFILES: Record<string, string[]> = {
  // Autoplan
  'autoplan-core':  ['autoplan/**', 'plan-ceo-review/**', 'plan-eng-review/**', 'plan-design-review/**'],

+  // Multi-provider benchmark adapters — live API smoke against real claude/codex/gemini CLIs
+  'benchmark-providers-live': ['bin/gstack-model-benchmark', 'test/helpers/providers/**', 'test/helpers/benchmark-runner.ts', 'test/helpers/pricing.ts'],
+
  // Skill routing — journey-stage tests (depend on ALL skill descriptions)
  'journey-ideation':       ['*/SKILL.md.tmpl', 'SKILL.md.tmpl', 'scripts/gen-skill-docs.ts'],
  'journey-plan-eng':       ['*/SKILL.md.tmpl', 'SKILL.md.tmpl', 'scripts/gen-skill-docs.ts'],
@@ -316,6 +319,9 @@ export const E2E_TIERS: Record<string, 'gate' | 'periodic'> = {
  // Autoplan — periodic (not yet implemented)
  'autoplan-core': 'periodic',

+  // Multi-provider benchmark — periodic (requires external CLIs + auth, paid)
+  'benchmark-providers-live': 'periodic',
+
  // Skill routing — periodic (LLM routing is non-deterministic)
  'journey-ideation': 'periodic',
  'journey-plan-eng': 'periodic',