mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-08 21:49:45 +08:00
All four plan-format cases (CEO mode, CEO approach, eng coverage, eng kind)
now run the judge after the existing regex assertions. Threshold reason_substance
>= 4 catches both boilerplate ("because it's better") and generic ("because
it's faster") tier reasoning — exactly the failure modes the regex couldn't.
Move recordE2E to after the judge call so judge_scores and judge_reasoning
land in the eval-store JSON for diagnostics. Booleans are encoded as 0/1 to
fit the Record<string, number> shape EvalTestEntry.judge_scores expects.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
16 KiB
16 KiB