gstack/test/skill-e2e-plan-format.test.ts at 14b1ba07e99878f39be730fde36664dfe3efe8c2

hai/gstack

mirror of https://github.com/garrytan/gstack.git synced 2026-05-08 21:49:45 +08:00

Files

Garry Tan 14b1ba07e9 test: wire judgeRecommendation into plan-format E2E with threshold >= 4

All four plan-format cases (CEO mode, CEO approach, eng coverage, eng kind)
now run the judge after the existing regex assertions. Threshold reason_substance
>= 4 catches both boilerplate ("because it's better") and generic ("because
it's faster") tier reasoning — exactly the failure modes the regex couldn't.

Move recordE2E to after the judge call so judge_scores and judge_reasoning
land in the eval-store JSON for diagnostics. Booleans are encoded as 0/1 to
fit the Record<string, number> shape EvalTestEntry.judge_scores expects.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-01 14:18:06 -07:00

16 KiB

Raw Blame History

View Raw

16 KiB Raw Blame History

16 KiB

Raw Blame History