feat(v1.10.1.0): overlay efficacy harness + Opus 4.7 fanout nudge removal (#1166)

* refactor: export readOverlay from model-overlay resolver

Needed by the overlay-efficacy eval harness to resolve INHERIT directives
without going through generateModelOverlay's full TemplateContext.

* chore: add @anthropic-ai/claude-agent-sdk@0.2.117 dep

Pinned exact for SDK event-shape stability. Used by the overlay-efficacy
harness to drive the model through a closer-to-real Claude Code harness
than `claude -p`.

* feat(preflight): sanity check for agent-sdk + overlay resolver

Verifies: SDK loads, claude-opus-4-7 is a live API model, SDKMessage
event shape matches assumptions, readOverlay resolves INHERIT directives
and includes expected content. Run with `bun run scripts/preflight-agent-sdk.ts`.

PREFLIGHT OK on first run, $0.013 API spend.

* feat(eval): parametric overlay-efficacy harness (runner + fixtures)

`test/helpers/agent-sdk-runner.ts` wraps @anthropic-ai/claude-agent-sdk
with explicit `AgentSdkResult` types, process-level API concurrency
semaphore, and 3-shape 429 retry (thrown error, result-message error,
mid-stream SDKRateLimitEvent). Pins the local claude binary via
`pathToClaudeCodeExecutable`.

`test/fixtures/overlay-nudges.ts` holds the typed registry. Two
fixtures for the first measurement: `opus-4-7-fanout-toy` (3-file read)
and `opus-4-7-fanout-realistic` (mixed-tool audit). Strict validator
rejects duplicate ids, non-integer trials, unsafe overlay paths, non-safe
id chars, and missing overlay files at module load.

Adding a future overlay nudge eval = one fixture entry.

* test(eval): unit tests for agent-sdk-runner (36 tests, free tier)

Stub `queryProvider` feeds hand-crafted SDKMessage streams. Covers:
happy-path shape, all 3 rate-limit shapes + retry, workspace reset on
retry, persistent 429 -> `RateLimitExhaustedError`, non-429 propagation,
process-level concurrency cap, options propagation, artifact path
uniqueness, cost/turn mapping, and every validator rejection case.

* test(eval): paid periodic overlay-efficacy harness

`test/skill-e2e-overlay-harness.test.ts` iterates OVERLAY_FIXTURES,
runs two arms per fixture (overlay-ON, overlay-OFF) at N=10 trials with
bounded concurrency. Arms use SDK preset `claude_code` so both include
the real Claude Code system prompt; overlay-ON appends the resolved
overlay text. Saves per-trial raw event streams to
`~/.gstack/projects/<slug>/transcripts/` for forensic recovery.

Gated on `EVALS=1 && EVALS_TIER=periodic`. ~$3/run (40 trials).

* test: register overlay harness in touchfiles (both maps)

Entries for `overlay-harness-opus-4-7-fanout-toy` and
`opus-4-7-fanout-realistic` in E2E_TOUCHFILES (deps: model-overlays/,
fixtures file, runner, resolver) and E2E_TIERS (`periodic`). Passes
`test/touchfiles.test.ts` completeness check.

* fix(opus-4.7): remove "Fan out explicitly" overlay nudge

Measured counterproductive under the new SDK harness. Baseline Opus 4.7
emits first-turn parallel tool_use blocks 70% of the time on a 3-file
read prompt. With the custom nudge: 10%. With Anthropic's own canonical
`<use_parallel_tool_calls>` block from their parallel-tool-use docs:
0%. Both overlays suppress fanout; neither improves it.

On realistic multi-tool prompts (audit a project: read files + glob +
summarize), Opus 4.7 never fans out in first turn regardless of overlay.
Zero of 20 trials. Not a prompt problem.

Keeping the other three nudges (effort-match, batch questions, literal
interpretation) pending their own measurement. Harness is ready for
follow-up fixtures — add one entry to
`test/fixtures/overlay-nudges.ts` to measure any overlay bullet.

Cost of investigation: ~$7 total across 3 eval runs.

* chore: bump version and changelog (v1.6.5.0)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(eval): extend OverlayFixture with allowedTools, maxTurns, direction

Per-fixture tool allowlist unblocks measuring nudges that need Edit/Write
(e.g. literal-interpretation 'fix the failing tests' needs write access).
Per-fixture maxTurns lets harder prompts run longer without changing the
default. `direction` is cosmetic metadata for test output labeling.

Also adds reusable predicates and metrics:
- lowerIsBetter20Pct / higherIsBetter20Pct — 20% lift threshold vs baseline
- bashToolCallCount — count of Bash tool_use across the session
- turnsToCompletion — SDK-reported num_turns at result
- uniqueFilesEdited — Edit/Write/MultiEdit file_path set size

test/skill-e2e-overlay-harness.test.ts now threads fixture.allowedTools
and fixture.maxTurns through runArm.

* test(eval): 3 more overlay fixtures to measure remaining Claude nudges

Measures three overlay bullets that haven't been tested yet:

- claude-dedicated-tools-vs-bash — claude.md says 'prefer Read/Edit/Write/
  Glob/Grep over cat/sed/find/grep'. Fixture prompts 'list every TypeScript
  file under src/ and tell me what each exports' and counts Bash tool_use
  across the session. Overlay-ON should drop it by >=20%.
- opus-4-7-effort-match-trivial — opus-4-7.md says 'simple file reads
  don't need deep reasoning.' Fixture uses a trivial one-file prompt
  (config.json lookup) and measures turns_used. Overlay-ON should be
  <=80% of baseline turns.
- opus-4-7-literal-interpretation — opus-4-7.md says 'fix ALL failing
  tests, not just the obvious one.' Fixture seeds three failing test
  files with deliberately distinct failure modes and counts unique files
  edited. Overlay-ON should touch >=20% more files.

Adding a fourth fixture for any remaining overlay nudge is a single entry.
The harness is now proven on: fanout (deleted after measurement), dedicated
tools, effort-match, and literal-interpretation.

* fix(eval): handle SDK max-turns throw gracefully

Some @anthropic-ai/claude-agent-sdk versions throw from the query
generator when maxTurns is reached, instead of emitting a result
message with subtype='error_max_turns'. The runner treated that as
a non-retryable error and killed the whole periodic run on the first
fixture that exceeded its turn cap.

Added isMaxTurnsError() detector and a catch branch that synthesizes
an AgentSdkResult from events captured before the throw, with
exitReason='error_max_turns' and costUsd=0 (unknown from the thrown
path). The metric function still runs against whatever assistant
turns were collected, so the trial produces a usable number.

Hoisted events/assistantTurns/toolCalls/assistantTextParts and the
timing counters out of the inner try so the catch branch can read
them. No behavior change on the success path or on rate-limit retry
paths.

* test(eval): bump maxTurns to 15 for claude-dedicated-tools-vs-bash

The prompt 'list every TypeScript file under src/ and tell me what
each exports' needs 1 turn for Glob + ~5 for Reads + 1 for summary.
Default maxTurns=5 was not enough; prior run threw from the SDK on
this fixture and tanked the whole periodic eval.

Bumping to 15 gives headroom. The runner now also handles max-turns
gracefully even if a future fixture underestimates, so this is belt
and suspenders.

* test(eval): Sonnet 4.6 variants of the 5 Opus-4.7 fixtures

Same overlays, same prompts, same metrics, `model: 'claude-sonnet-4-6'`.
Tests whether the overlays behave differently on a weaker Claude model
where baseline behavior is shakier. Sonnet trials cost ~3-4x less than
Opus so these 5 add ~$4.50 to a full run.

Measurement result from the first paired run (100 trials total,
~$14.55):

- **Sonnet + effort-match shows real overlay benefit.** With the overlay
  on, Sonnet takes 2.5 turns on a trivial `What's the version in
  config.json?` prompt. Without, it takes exactly 3.0 turns in all 10
  trials. ~17% reduction, below the 20% pass threshold but the signal
  is clean: overlay-ON distribution [2,2,2,2,2,3,3,3,3,3] vs overlay-OFF
  [3,3,3,3,3,3,3,3,3,3].
- All other Sonnet dimensions flat (fanout, dedicated-tools, literal
  interpretation). Same as Opus on those axes.
- Opus effort-match remains flat (2.60 vs 2.50, +4% slower with overlay).

Implication: model-stratified. The overlay stack helps Sonnet on some
axes where it does nothing on Opus. Wholesale removal would hurt Sonnet.
Per-nudge per-model measurement is the right move going forward.

* chore: bump version to 1.10.1.0

Updates VERSION, package.json, CHANGELOG header, and TODOS completion
marker from 1.6.5.0 to 1.10.1.0.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Garry Tan
2026-04-23 18:42:58 -07:00
committed by GitHub
parent a81be53621
commit e3d7f49c74
13 changed files with 2489 additions and 41 deletions

View File

@@ -1,5 +1,100 @@
# Changelog
## [1.10.1.0] - 2026-04-23
## **We tried to make Opus 4.7 faster with a prompt. Measurement said it got slower. Pulled the bullet.**
gstack shipped a "Fan out explicitly" overlay nudge in `model-overlays/opus-4-7.md`
back in v1.5.2.0. The idea: tell Opus 4.7 to emit multiple tool calls in one
assistant turn instead of one per turn, so "read three files" takes one API
round-trip instead of three. Sounded obvious. This release removes that
bullet after measuring that it actively hurt performance, and ships the eval
harness we used to prove it so you can measure your own overlay changes.
### The numbers that matter
Source: new `test/skill-e2e-overlay-harness.test.ts`, N=10 trials per arm per
fixture, 40 trials per run, ~$3 per run. Pinned to `claude-opus-4-7` via
Anthropic's published Agent SDK (`@anthropic-ai/claude-agent-sdk@0.2.117`)
with `pathToClaudeCodeExecutable` set to the locally-installed `claude` binary
(2.1.118). Metric: number of parallel `tool_use` blocks in the first assistant
turn.
| Prompt text in overlay | First-turn fanout rate (toy: read 3 files) | Lift vs baseline |
|---|---|---|
| No overlay (default Claude Code system prompt only) | **70%** (7/10) | baseline |
| gstack's original "Fan out explicitly" nudge (v1.5.2.0 through v1.6.3.0) | 10% (1/10) | **-60%** |
| Anthropic's own canonical `<use_parallel_tool_calls>` text from their parallel-tool-use docs | **0%** (0/10) | **-70%** |
On a realistic multi-file audit prompt (`read app.ts + config.ts + README.md,
glob src/*.ts, summarize`), Opus 4.7 never fanned out in the first turn at all,
regardless of overlay. Zero of 20 trials. The nudge had nothing to grip.
Total cost of the investigation: **$7** across three eval runs.
### What this means for you
If you ship system-prompt nudges for Claude, measure them. Anthropic's own
published best-practice text dropped our fanout rate to zero. That's not a
claim about Anthropic, it's a claim about measurement: the model, the SDK,
the binary, and the context all move under the advice, and the advice sits
still. The harness is in the repo now. Run
`EVALS=1 EVALS_TIER=periodic bun test test/skill-e2e-overlay-harness.test.ts`.
Three dollars per run.
### Itemized changes
#### Fixed
- `model-overlays/opus-4-7.md` — removed the "Fan out explicitly" block. The
other three nudges (effort-match, batch questions, literal interpretation)
are untested and stay in for now. They're candidates for their own
measurement in a follow-up PR.
#### Added
- `test/skill-e2e-overlay-harness.test.ts` — periodic-tier eval that iterates a
typed fixture registry and runs A/B arms through `@anthropic-ai/claude-agent-sdk`.
Uses SDK preset `claude_code` so the arms include Claude Code's real system
prompt; overlay-ON appends the resolved overlay text. Saves per-trial raw
event streams for forensic recovery. Gated on both `EVALS=1` and
`EVALS_TIER=periodic`.
- `test/fixtures/overlay-nudges.ts` — typed `OverlayFixture` registry with
strict validator. Adding a future nudge to measure = one fixture entry.
First two fixtures: `opus-4-7-fanout-toy` and `opus-4-7-fanout-realistic`.
- `test/helpers/agent-sdk-runner.ts` — parametric SDK wrapper with explicit
`AgentSdkResult` types, process-level API concurrency semaphore, and
three-shape 429 retry (thrown error, result-message error, mid-stream
`SDKRateLimitEvent`). Binary pinning via `pathToClaudeCodeExecutable`.
- `test/agent-sdk-runner.test.ts` — 36 free-tier unit tests covering happy
path, all three rate-limit shapes, persistent-429 `RateLimitExhaustedError`,
non-429 propagation, options propagation, concurrency cap, and every
validator rejection case.
- `scripts/preflight-agent-sdk.ts` — 20-line sanity check that confirms the
SDK loads, `claude-opus-4-7` is a live API model, the `SDKMessage` event
shape matches assumptions, and the overlay resolver produces the expected
text. Run manually before paid runs if you suspect drift. Costs ~$0.013.
- `@anthropic-ai/claude-agent-sdk@0.2.117` in `devDependencies`. Exact pin,
no caret — SDK event shapes can drift on minor versions.
#### Changed
- `scripts/resolvers/model-overlay.ts` — exported `readOverlay` so the eval
harness can resolve `{{INHERIT:claude}}` directives without synthesizing a
full `TemplateContext`.
#### For contributors
- `test/helpers/touchfiles.ts` — registered the new eval in both
`E2E_TOUCHFILES` (deps: `model-overlays/**`, `overlay-nudges.ts`, runner,
resolver) and `E2E_TIERS` (`periodic`). Passes the
`test/touchfiles.test.ts` completeness check.
- The harness is deliberately parametric. Adding a second overlay nudge
measurement (for the remaining three nudges in `opus-4-7.md`, or any
future nudge in any overlay file) is a single entry in
`test/fixtures/overlay-nudges.ts`. Total incremental effort: ~15 minutes
per fixture.
## [1.10.0.0] - 2026-04-23
## **Plan reviews walk you through each issue again, and every question is now a real decision brief.**