mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-08 13:39:45 +08:00
feat(v1.10.1.0): overlay efficacy harness + Opus 4.7 fanout nudge removal (#1166)
* refactor: export readOverlay from model-overlay resolver Needed by the overlay-efficacy eval harness to resolve INHERIT directives without going through generateModelOverlay's full TemplateContext. * chore: add @anthropic-ai/claude-agent-sdk@0.2.117 dep Pinned exact for SDK event-shape stability. Used by the overlay-efficacy harness to drive the model through a closer-to-real Claude Code harness than `claude -p`. * feat(preflight): sanity check for agent-sdk + overlay resolver Verifies: SDK loads, claude-opus-4-7 is a live API model, SDKMessage event shape matches assumptions, readOverlay resolves INHERIT directives and includes expected content. Run with `bun run scripts/preflight-agent-sdk.ts`. PREFLIGHT OK on first run, $0.013 API spend. * feat(eval): parametric overlay-efficacy harness (runner + fixtures) `test/helpers/agent-sdk-runner.ts` wraps @anthropic-ai/claude-agent-sdk with explicit `AgentSdkResult` types, process-level API concurrency semaphore, and 3-shape 429 retry (thrown error, result-message error, mid-stream SDKRateLimitEvent). Pins the local claude binary via `pathToClaudeCodeExecutable`. `test/fixtures/overlay-nudges.ts` holds the typed registry. Two fixtures for the first measurement: `opus-4-7-fanout-toy` (3-file read) and `opus-4-7-fanout-realistic` (mixed-tool audit). Strict validator rejects duplicate ids, non-integer trials, unsafe overlay paths, non-safe id chars, and missing overlay files at module load. Adding a future overlay nudge eval = one fixture entry. * test(eval): unit tests for agent-sdk-runner (36 tests, free tier) Stub `queryProvider` feeds hand-crafted SDKMessage streams. Covers: happy-path shape, all 3 rate-limit shapes + retry, workspace reset on retry, persistent 429 -> `RateLimitExhaustedError`, non-429 propagation, process-level concurrency cap, options propagation, artifact path uniqueness, cost/turn mapping, and every validator rejection case. * test(eval): paid periodic overlay-efficacy harness `test/skill-e2e-overlay-harness.test.ts` iterates OVERLAY_FIXTURES, runs two arms per fixture (overlay-ON, overlay-OFF) at N=10 trials with bounded concurrency. Arms use SDK preset `claude_code` so both include the real Claude Code system prompt; overlay-ON appends the resolved overlay text. Saves per-trial raw event streams to `~/.gstack/projects/<slug>/transcripts/` for forensic recovery. Gated on `EVALS=1 && EVALS_TIER=periodic`. ~$3/run (40 trials). * test: register overlay harness in touchfiles (both maps) Entries for `overlay-harness-opus-4-7-fanout-toy` and `opus-4-7-fanout-realistic` in E2E_TOUCHFILES (deps: model-overlays/, fixtures file, runner, resolver) and E2E_TIERS (`periodic`). Passes `test/touchfiles.test.ts` completeness check. * fix(opus-4.7): remove "Fan out explicitly" overlay nudge Measured counterproductive under the new SDK harness. Baseline Opus 4.7 emits first-turn parallel tool_use blocks 70% of the time on a 3-file read prompt. With the custom nudge: 10%. With Anthropic's own canonical `<use_parallel_tool_calls>` block from their parallel-tool-use docs: 0%. Both overlays suppress fanout; neither improves it. On realistic multi-tool prompts (audit a project: read files + glob + summarize), Opus 4.7 never fans out in first turn regardless of overlay. Zero of 20 trials. Not a prompt problem. Keeping the other three nudges (effort-match, batch questions, literal interpretation) pending their own measurement. Harness is ready for follow-up fixtures — add one entry to `test/fixtures/overlay-nudges.ts` to measure any overlay bullet. Cost of investigation: ~$7 total across 3 eval runs. * chore: bump version and changelog (v1.6.5.0) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(eval): extend OverlayFixture with allowedTools, maxTurns, direction Per-fixture tool allowlist unblocks measuring nudges that need Edit/Write (e.g. literal-interpretation 'fix the failing tests' needs write access). Per-fixture maxTurns lets harder prompts run longer without changing the default. `direction` is cosmetic metadata for test output labeling. Also adds reusable predicates and metrics: - lowerIsBetter20Pct / higherIsBetter20Pct — 20% lift threshold vs baseline - bashToolCallCount — count of Bash tool_use across the session - turnsToCompletion — SDK-reported num_turns at result - uniqueFilesEdited — Edit/Write/MultiEdit file_path set size test/skill-e2e-overlay-harness.test.ts now threads fixture.allowedTools and fixture.maxTurns through runArm. * test(eval): 3 more overlay fixtures to measure remaining Claude nudges Measures three overlay bullets that haven't been tested yet: - claude-dedicated-tools-vs-bash — claude.md says 'prefer Read/Edit/Write/ Glob/Grep over cat/sed/find/grep'. Fixture prompts 'list every TypeScript file under src/ and tell me what each exports' and counts Bash tool_use across the session. Overlay-ON should drop it by >=20%. - opus-4-7-effort-match-trivial — opus-4-7.md says 'simple file reads don't need deep reasoning.' Fixture uses a trivial one-file prompt (config.json lookup) and measures turns_used. Overlay-ON should be <=80% of baseline turns. - opus-4-7-literal-interpretation — opus-4-7.md says 'fix ALL failing tests, not just the obvious one.' Fixture seeds three failing test files with deliberately distinct failure modes and counts unique files edited. Overlay-ON should touch >=20% more files. Adding a fourth fixture for any remaining overlay nudge is a single entry. The harness is now proven on: fanout (deleted after measurement), dedicated tools, effort-match, and literal-interpretation. * fix(eval): handle SDK max-turns throw gracefully Some @anthropic-ai/claude-agent-sdk versions throw from the query generator when maxTurns is reached, instead of emitting a result message with subtype='error_max_turns'. The runner treated that as a non-retryable error and killed the whole periodic run on the first fixture that exceeded its turn cap. Added isMaxTurnsError() detector and a catch branch that synthesizes an AgentSdkResult from events captured before the throw, with exitReason='error_max_turns' and costUsd=0 (unknown from the thrown path). The metric function still runs against whatever assistant turns were collected, so the trial produces a usable number. Hoisted events/assistantTurns/toolCalls/assistantTextParts and the timing counters out of the inner try so the catch branch can read them. No behavior change on the success path or on rate-limit retry paths. * test(eval): bump maxTurns to 15 for claude-dedicated-tools-vs-bash The prompt 'list every TypeScript file under src/ and tell me what each exports' needs 1 turn for Glob + ~5 for Reads + 1 for summary. Default maxTurns=5 was not enough; prior run threw from the SDK on this fixture and tanked the whole periodic eval. Bumping to 15 gives headroom. The runner now also handles max-turns gracefully even if a future fixture underestimates, so this is belt and suspenders. * test(eval): Sonnet 4.6 variants of the 5 Opus-4.7 fixtures Same overlays, same prompts, same metrics, `model: 'claude-sonnet-4-6'`. Tests whether the overlays behave differently on a weaker Claude model where baseline behavior is shakier. Sonnet trials cost ~3-4x less than Opus so these 5 add ~$4.50 to a full run. Measurement result from the first paired run (100 trials total, ~$14.55): - **Sonnet + effort-match shows real overlay benefit.** With the overlay on, Sonnet takes 2.5 turns on a trivial `What's the version in config.json?` prompt. Without, it takes exactly 3.0 turns in all 10 trials. ~17% reduction, below the 20% pass threshold but the signal is clean: overlay-ON distribution [2,2,2,2,2,3,3,3,3,3] vs overlay-OFF [3,3,3,3,3,3,3,3,3,3]. - All other Sonnet dimensions flat (fanout, dedicated-tools, literal interpretation). Same as Opus on those axes. - Opus effort-match remains flat (2.60 vs 2.50, +4% slower with overlay). Implication: model-stratified. The overlay stack helps Sonnet on some axes where it does nothing on Opus. Wholesale removal would hurt Sonnet. Per-nudge per-model measurement is the right move going forward. * chore: bump version to 1.10.1.0 Updates VERSION, package.json, CHANGELOG header, and TODOS completion marker from 1.6.5.0 to 1.10.1.0. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
95
CHANGELOG.md
95
CHANGELOG.md
@@ -1,5 +1,100 @@
|
||||
# Changelog
|
||||
|
||||
## [1.10.1.0] - 2026-04-23
|
||||
|
||||
## **We tried to make Opus 4.7 faster with a prompt. Measurement said it got slower. Pulled the bullet.**
|
||||
|
||||
gstack shipped a "Fan out explicitly" overlay nudge in `model-overlays/opus-4-7.md`
|
||||
back in v1.5.2.0. The idea: tell Opus 4.7 to emit multiple tool calls in one
|
||||
assistant turn instead of one per turn, so "read three files" takes one API
|
||||
round-trip instead of three. Sounded obvious. This release removes that
|
||||
bullet after measuring that it actively hurt performance, and ships the eval
|
||||
harness we used to prove it so you can measure your own overlay changes.
|
||||
|
||||
### The numbers that matter
|
||||
|
||||
Source: new `test/skill-e2e-overlay-harness.test.ts`, N=10 trials per arm per
|
||||
fixture, 40 trials per run, ~$3 per run. Pinned to `claude-opus-4-7` via
|
||||
Anthropic's published Agent SDK (`@anthropic-ai/claude-agent-sdk@0.2.117`)
|
||||
with `pathToClaudeCodeExecutable` set to the locally-installed `claude` binary
|
||||
(2.1.118). Metric: number of parallel `tool_use` blocks in the first assistant
|
||||
turn.
|
||||
|
||||
| Prompt text in overlay | First-turn fanout rate (toy: read 3 files) | Lift vs baseline |
|
||||
|---|---|---|
|
||||
| No overlay (default Claude Code system prompt only) | **70%** (7/10) | baseline |
|
||||
| gstack's original "Fan out explicitly" nudge (v1.5.2.0 through v1.6.3.0) | 10% (1/10) | **-60%** |
|
||||
| Anthropic's own canonical `<use_parallel_tool_calls>` text from their parallel-tool-use docs | **0%** (0/10) | **-70%** |
|
||||
|
||||
On a realistic multi-file audit prompt (`read app.ts + config.ts + README.md,
|
||||
glob src/*.ts, summarize`), Opus 4.7 never fanned out in the first turn at all,
|
||||
regardless of overlay. Zero of 20 trials. The nudge had nothing to grip.
|
||||
|
||||
Total cost of the investigation: **$7** across three eval runs.
|
||||
|
||||
### What this means for you
|
||||
|
||||
If you ship system-prompt nudges for Claude, measure them. Anthropic's own
|
||||
published best-practice text dropped our fanout rate to zero. That's not a
|
||||
claim about Anthropic, it's a claim about measurement: the model, the SDK,
|
||||
the binary, and the context all move under the advice, and the advice sits
|
||||
still. The harness is in the repo now. Run
|
||||
`EVALS=1 EVALS_TIER=periodic bun test test/skill-e2e-overlay-harness.test.ts`.
|
||||
Three dollars per run.
|
||||
|
||||
### Itemized changes
|
||||
|
||||
#### Fixed
|
||||
|
||||
- `model-overlays/opus-4-7.md` — removed the "Fan out explicitly" block. The
|
||||
other three nudges (effort-match, batch questions, literal interpretation)
|
||||
are untested and stay in for now. They're candidates for their own
|
||||
measurement in a follow-up PR.
|
||||
|
||||
#### Added
|
||||
|
||||
- `test/skill-e2e-overlay-harness.test.ts` — periodic-tier eval that iterates a
|
||||
typed fixture registry and runs A/B arms through `@anthropic-ai/claude-agent-sdk`.
|
||||
Uses SDK preset `claude_code` so the arms include Claude Code's real system
|
||||
prompt; overlay-ON appends the resolved overlay text. Saves per-trial raw
|
||||
event streams for forensic recovery. Gated on both `EVALS=1` and
|
||||
`EVALS_TIER=periodic`.
|
||||
- `test/fixtures/overlay-nudges.ts` — typed `OverlayFixture` registry with
|
||||
strict validator. Adding a future nudge to measure = one fixture entry.
|
||||
First two fixtures: `opus-4-7-fanout-toy` and `opus-4-7-fanout-realistic`.
|
||||
- `test/helpers/agent-sdk-runner.ts` — parametric SDK wrapper with explicit
|
||||
`AgentSdkResult` types, process-level API concurrency semaphore, and
|
||||
three-shape 429 retry (thrown error, result-message error, mid-stream
|
||||
`SDKRateLimitEvent`). Binary pinning via `pathToClaudeCodeExecutable`.
|
||||
- `test/agent-sdk-runner.test.ts` — 36 free-tier unit tests covering happy
|
||||
path, all three rate-limit shapes, persistent-429 `RateLimitExhaustedError`,
|
||||
non-429 propagation, options propagation, concurrency cap, and every
|
||||
validator rejection case.
|
||||
- `scripts/preflight-agent-sdk.ts` — 20-line sanity check that confirms the
|
||||
SDK loads, `claude-opus-4-7` is a live API model, the `SDKMessage` event
|
||||
shape matches assumptions, and the overlay resolver produces the expected
|
||||
text. Run manually before paid runs if you suspect drift. Costs ~$0.013.
|
||||
- `@anthropic-ai/claude-agent-sdk@0.2.117` in `devDependencies`. Exact pin,
|
||||
no caret — SDK event shapes can drift on minor versions.
|
||||
|
||||
#### Changed
|
||||
|
||||
- `scripts/resolvers/model-overlay.ts` — exported `readOverlay` so the eval
|
||||
harness can resolve `{{INHERIT:claude}}` directives without synthesizing a
|
||||
full `TemplateContext`.
|
||||
|
||||
#### For contributors
|
||||
|
||||
- `test/helpers/touchfiles.ts` — registered the new eval in both
|
||||
`E2E_TOUCHFILES` (deps: `model-overlays/**`, `overlay-nudges.ts`, runner,
|
||||
resolver) and `E2E_TIERS` (`periodic`). Passes the
|
||||
`test/touchfiles.test.ts` completeness check.
|
||||
- The harness is deliberately parametric. Adding a second overlay nudge
|
||||
measurement (for the remaining three nudges in `opus-4-7.md`, or any
|
||||
future nudge in any overlay file) is a single entry in
|
||||
`test/fixtures/overlay-nudges.ts`. Total incremental effort: ~15 minutes
|
||||
per fixture.
|
||||
|
||||
## [1.10.0.0] - 2026-04-23
|
||||
|
||||
## **Plan reviews walk you through each issue again, and every question is now a real decision brief.**
|
||||
|
||||
Reference in New Issue
Block a user