feat(v1.10.1.0): overlay efficacy harness + Opus 4.7 fanout nudge removal (#1166)

* refactor: export readOverlay from model-overlay resolver

Needed by the overlay-efficacy eval harness to resolve INHERIT directives
without going through generateModelOverlay's full TemplateContext.

* chore: add @anthropic-ai/claude-agent-sdk@0.2.117 dep

Pinned exact for SDK event-shape stability. Used by the overlay-efficacy
harness to drive the model through a closer-to-real Claude Code harness
than `claude -p`.

* feat(preflight): sanity check for agent-sdk + overlay resolver

Verifies: SDK loads, claude-opus-4-7 is a live API model, SDKMessage
event shape matches assumptions, readOverlay resolves INHERIT directives
and includes expected content. Run with `bun run scripts/preflight-agent-sdk.ts`.

PREFLIGHT OK on first run, $0.013 API spend.

* feat(eval): parametric overlay-efficacy harness (runner + fixtures)

`test/helpers/agent-sdk-runner.ts` wraps @anthropic-ai/claude-agent-sdk
with explicit `AgentSdkResult` types, process-level API concurrency
semaphore, and 3-shape 429 retry (thrown error, result-message error,
mid-stream SDKRateLimitEvent). Pins the local claude binary via
`pathToClaudeCodeExecutable`.

`test/fixtures/overlay-nudges.ts` holds the typed registry. Two
fixtures for the first measurement: `opus-4-7-fanout-toy` (3-file read)
and `opus-4-7-fanout-realistic` (mixed-tool audit). Strict validator
rejects duplicate ids, non-integer trials, unsafe overlay paths, non-safe
id chars, and missing overlay files at module load.

Adding a future overlay nudge eval = one fixture entry.

* test(eval): unit tests for agent-sdk-runner (36 tests, free tier)

Stub `queryProvider` feeds hand-crafted SDKMessage streams. Covers:
happy-path shape, all 3 rate-limit shapes + retry, workspace reset on
retry, persistent 429 -> `RateLimitExhaustedError`, non-429 propagation,
process-level concurrency cap, options propagation, artifact path
uniqueness, cost/turn mapping, and every validator rejection case.

* test(eval): paid periodic overlay-efficacy harness

`test/skill-e2e-overlay-harness.test.ts` iterates OVERLAY_FIXTURES,
runs two arms per fixture (overlay-ON, overlay-OFF) at N=10 trials with
bounded concurrency. Arms use SDK preset `claude_code` so both include
the real Claude Code system prompt; overlay-ON appends the resolved
overlay text. Saves per-trial raw event streams to
`~/.gstack/projects/<slug>/transcripts/` for forensic recovery.

Gated on `EVALS=1 && EVALS_TIER=periodic`. ~$3/run (40 trials).

* test: register overlay harness in touchfiles (both maps)

Entries for `overlay-harness-opus-4-7-fanout-toy` and
`opus-4-7-fanout-realistic` in E2E_TOUCHFILES (deps: model-overlays/,
fixtures file, runner, resolver) and E2E_TIERS (`periodic`). Passes
`test/touchfiles.test.ts` completeness check.

* fix(opus-4.7): remove "Fan out explicitly" overlay nudge

Measured counterproductive under the new SDK harness. Baseline Opus 4.7
emits first-turn parallel tool_use blocks 70% of the time on a 3-file
read prompt. With the custom nudge: 10%. With Anthropic's own canonical
`<use_parallel_tool_calls>` block from their parallel-tool-use docs:
0%. Both overlays suppress fanout; neither improves it.

On realistic multi-tool prompts (audit a project: read files + glob +
summarize), Opus 4.7 never fans out in first turn regardless of overlay.
Zero of 20 trials. Not a prompt problem.

Keeping the other three nudges (effort-match, batch questions, literal
interpretation) pending their own measurement. Harness is ready for
follow-up fixtures — add one entry to
`test/fixtures/overlay-nudges.ts` to measure any overlay bullet.

Cost of investigation: ~$7 total across 3 eval runs.

* chore: bump version and changelog (v1.6.5.0)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(eval): extend OverlayFixture with allowedTools, maxTurns, direction

Per-fixture tool allowlist unblocks measuring nudges that need Edit/Write
(e.g. literal-interpretation 'fix the failing tests' needs write access).
Per-fixture maxTurns lets harder prompts run longer without changing the
default. `direction` is cosmetic metadata for test output labeling.

Also adds reusable predicates and metrics:
- lowerIsBetter20Pct / higherIsBetter20Pct — 20% lift threshold vs baseline
- bashToolCallCount — count of Bash tool_use across the session
- turnsToCompletion — SDK-reported num_turns at result
- uniqueFilesEdited — Edit/Write/MultiEdit file_path set size

test/skill-e2e-overlay-harness.test.ts now threads fixture.allowedTools
and fixture.maxTurns through runArm.

* test(eval): 3 more overlay fixtures to measure remaining Claude nudges

Measures three overlay bullets that haven't been tested yet:

- claude-dedicated-tools-vs-bash — claude.md says 'prefer Read/Edit/Write/
  Glob/Grep over cat/sed/find/grep'. Fixture prompts 'list every TypeScript
  file under src/ and tell me what each exports' and counts Bash tool_use
  across the session. Overlay-ON should drop it by >=20%.
- opus-4-7-effort-match-trivial — opus-4-7.md says 'simple file reads
  don't need deep reasoning.' Fixture uses a trivial one-file prompt
  (config.json lookup) and measures turns_used. Overlay-ON should be
  <=80% of baseline turns.
- opus-4-7-literal-interpretation — opus-4-7.md says 'fix ALL failing
  tests, not just the obvious one.' Fixture seeds three failing test
  files with deliberately distinct failure modes and counts unique files
  edited. Overlay-ON should touch >=20% more files.

Adding a fourth fixture for any remaining overlay nudge is a single entry.
The harness is now proven on: fanout (deleted after measurement), dedicated
tools, effort-match, and literal-interpretation.

* fix(eval): handle SDK max-turns throw gracefully

Some @anthropic-ai/claude-agent-sdk versions throw from the query
generator when maxTurns is reached, instead of emitting a result
message with subtype='error_max_turns'. The runner treated that as
a non-retryable error and killed the whole periodic run on the first
fixture that exceeded its turn cap.

Added isMaxTurnsError() detector and a catch branch that synthesizes
an AgentSdkResult from events captured before the throw, with
exitReason='error_max_turns' and costUsd=0 (unknown from the thrown
path). The metric function still runs against whatever assistant
turns were collected, so the trial produces a usable number.

Hoisted events/assistantTurns/toolCalls/assistantTextParts and the
timing counters out of the inner try so the catch branch can read
them. No behavior change on the success path or on rate-limit retry
paths.

* test(eval): bump maxTurns to 15 for claude-dedicated-tools-vs-bash

The prompt 'list every TypeScript file under src/ and tell me what
each exports' needs 1 turn for Glob + ~5 for Reads + 1 for summary.
Default maxTurns=5 was not enough; prior run threw from the SDK on
this fixture and tanked the whole periodic eval.

Bumping to 15 gives headroom. The runner now also handles max-turns
gracefully even if a future fixture underestimates, so this is belt
and suspenders.

* test(eval): Sonnet 4.6 variants of the 5 Opus-4.7 fixtures

Same overlays, same prompts, same metrics, `model: 'claude-sonnet-4-6'`.
Tests whether the overlays behave differently on a weaker Claude model
where baseline behavior is shakier. Sonnet trials cost ~3-4x less than
Opus so these 5 add ~$4.50 to a full run.

Measurement result from the first paired run (100 trials total,
~$14.55):

- **Sonnet + effort-match shows real overlay benefit.** With the overlay
  on, Sonnet takes 2.5 turns on a trivial `What's the version in
  config.json?` prompt. Without, it takes exactly 3.0 turns in all 10
  trials. ~17% reduction, below the 20% pass threshold but the signal
  is clean: overlay-ON distribution [2,2,2,2,2,3,3,3,3,3] vs overlay-OFF
  [3,3,3,3,3,3,3,3,3,3].
- All other Sonnet dimensions flat (fanout, dedicated-tools, literal
  interpretation). Same as Opus on those axes.
- Opus effort-match remains flat (2.60 vs 2.50, +4% slower with overlay).

Implication: model-stratified. The overlay stack helps Sonnet on some
axes where it does nothing on Opus. Wholesale removal would hurt Sonnet.
Per-nudge per-model measurement is the right move going forward.

* chore: bump version to 1.10.1.0

Updates VERSION, package.json, CHANGELOG header, and TODOS completion
marker from 1.6.5.0 to 1.10.1.0.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Garry Tan
2026-04-23 18:42:58 -07:00
committed by GitHub
parent a81be53621
commit e3d7f49c74
13 changed files with 2489 additions and 41 deletions

View File

@@ -0,0 +1,320 @@
/**
* Overlay-efficacy harness (periodic tier, paid).
*
* Measures whether a model-specific overlay nudge actually changes model
* behavior when run through the real Claude Agent SDK — the harness
* Claude Code itself is built on. This complements test/skill-e2e-opus-47.test.ts
* which measures the same thing via `claude -p` subprocess (a different
* harness with different prompt composition).
*
* For each fixture in test/fixtures/overlay-nudges.ts, runs two arms at
* `fixture.trials` trials per arm with bounded concurrency:
* - overlay-on: SDK systemPrompt = resolved overlay content
* - overlay-off: SDK systemPrompt = "" (empty)
*
* Both arms have no CLAUDE.md, no skills directory, no setting-source
* inheritance (settingSources: []). This is the TRUE bare comparison —
* the only variable is the overlay text.
*
* Budget ~$20 per run at 40 trials (2 fixtures × 2 arms × 10 trials).
* Gated by EVALS=1 AND EVALS_TIER=periodic. Never runs under test:gate.
*/
import { describe, test, expect, afterAll } from 'bun:test';
import * as fs from 'fs';
import * as path from 'path';
import * as os from 'os';
import {
runAgentSdkTest,
resolveClaudeBinary,
type AgentSdkResult,
type SystemPromptOption,
} from './helpers/agent-sdk-runner';
import { EvalCollector, getProjectEvalDir } from './helpers/eval-store';
import {
OVERLAY_FIXTURES,
type OverlayFixture,
} from './fixtures/overlay-nudges';
import { readOverlay } from '../scripts/resolvers/model-overlay';
const evalsEnabled = !!process.env.EVALS;
const periodicTier = process.env.EVALS_TIER === 'periodic';
const shouldRun = evalsEnabled && periodicTier;
const describeE2E = shouldRun ? describe : describe.skip;
// EvalCollector's tier must be 'e2e' | 'llm-judge' per its type signature.
// The existing paid evals violate this by passing descriptive names like
// 'e2e-opus-47' — a pre-existing pattern that only works because bun-test
// runs without strict typechecking. We stay conforming here.
const evalCollector = shouldRun ? new EvalCollector('e2e') : null;
const REPO_ROOT = path.resolve(import.meta.dir, '..');
const runId = new Date()
.toISOString()
.replace(/[:.]/g, '')
.replace('T', '-')
.slice(0, 15);
const TRANSCRIPTS_DIR = path.join(
path.dirname(getProjectEvalDir()),
'transcripts',
`overlay-harness-${runId}`,
);
// ---------------------------------------------------------------------------
// Per-arm helpers
// ---------------------------------------------------------------------------
type Arm = 'overlay-on' | 'overlay-off';
function mkTrialDir(fixtureId: string, arm: Arm, n: number): string {
const dir = fs.mkdtempSync(
path.join(os.tmpdir(), `overlay-harness-${fixtureId}-${arm}-${n}-`),
);
return dir;
}
function saveRawTranscript(
fixtureId: string,
arm: Arm,
n: number,
result: AgentSdkResult,
): void {
fs.mkdirSync(TRANSCRIPTS_DIR, { recursive: true });
const out = path.join(TRANSCRIPTS_DIR, `${fixtureId}-${arm}-${n}.jsonl`);
const lines = result.events.map((e) => JSON.stringify(e));
fs.writeFileSync(out, lines.join('\n') + '\n');
}
function overlayContentFor(fixture: OverlayFixture): string {
const family = path.basename(fixture.overlayPath, '.md');
const resolved = readOverlay(family);
if (!resolved) {
throw new Error(
`fixture ${fixture.id}: resolver returned empty content for ${family}`,
);
}
return resolved;
}
// ---------------------------------------------------------------------------
// Per-fixture runner
// ---------------------------------------------------------------------------
interface ArmResult {
metrics: number[];
costs: number[];
durations: number[];
rateLimitExhausted: number;
sdkClaudeCodeVersions: Set<string>;
}
async function runArm(
fixture: OverlayFixture,
arm: Arm,
systemPrompt: SystemPromptOption,
claudeBinary: string | null,
): Promise<ArmResult> {
const result: ArmResult = {
metrics: [],
costs: [],
durations: [],
rateLimitExhausted: 0,
sdkClaudeCodeVersions: new Set(),
};
const trials = fixture.trials;
const concurrency = fixture.concurrency ?? 3;
// Simple bounded executor: run trials in chunks of `concurrency`.
// The process-level semaphore in agent-sdk-runner.ts enforces the true cap.
let nextTrial = 0;
const workers = Array.from({ length: concurrency }, async () => {
while (true) {
const n = nextTrial++;
if (n >= trials) return;
const dir = mkTrialDir(fixture.id, arm, n);
fixture.setupWorkspace(dir);
try {
const sdkResult = await runAgentSdkTest({
systemPrompt,
userPrompt: fixture.userPrompt,
workingDirectory: dir,
model: fixture.model,
maxTurns: fixture.maxTurns ?? 5,
allowedTools: fixture.allowedTools ?? ['Read', 'Glob', 'Grep', 'Bash'],
permissionMode: 'bypassPermissions',
settingSources: [],
env: { ANTHROPIC_API_KEY: process.env.ANTHROPIC_API_KEY ?? '' },
pathToClaudeCodeExecutable: claudeBinary ?? undefined,
testName: `${fixture.id}-${arm}-${n}`,
runId,
fixtureId: fixture.id,
onRetry: (_) => {
// Reset the workspace before the retry so partial Bash side effects
// from the failed attempt don't contaminate.
fs.rmSync(dir, { recursive: true, force: true });
fs.mkdirSync(dir, { recursive: true });
fixture.setupWorkspace(dir);
},
});
saveRawTranscript(fixture.id, arm, n, sdkResult);
const metric = fixture.metric(sdkResult);
result.metrics.push(metric);
result.costs.push(sdkResult.costUsd);
result.durations.push(sdkResult.durationMs);
result.sdkClaudeCodeVersions.add(sdkResult.sdkClaudeCodeVersion);
evalCollector?.addTest({
name: `${fixture.id}-${arm}-${n}`,
suite: 'overlay-harness',
tier: 'e2e',
passed: true,
duration_ms: sdkResult.durationMs,
cost_usd: sdkResult.costUsd,
transcript: sdkResult.events,
prompt: fixture.userPrompt,
output: sdkResult.output,
turns_used: sdkResult.turnsUsed,
browse_errors: sdkResult.browseErrors,
exit_reason: sdkResult.exitReason,
model: sdkResult.model,
first_response_ms: sdkResult.firstResponseMs,
max_inter_turn_ms: sdkResult.maxInterTurnMs,
});
} catch (err) {
if (err instanceof Error && err.name === 'RateLimitExhaustedError') {
result.rateLimitExhausted++;
// Record a failed trial so the collector captures the attempt.
evalCollector?.addTest({
name: `${fixture.id}-${arm}-${n}`,
suite: 'overlay-harness',
tier: 'e2e',
passed: false,
duration_ms: 0,
cost_usd: 0,
exit_reason: 'rate_limit_exhausted',
error: err.message,
});
} else {
throw err;
}
} finally {
try {
fs.rmSync(dir, { recursive: true, force: true });
} catch {
// best-effort cleanup
}
}
}
});
await Promise.all(workers);
return result;
}
function mean(xs: number[]): number {
if (xs.length === 0) return 0;
return xs.reduce((a, b) => a + b, 0) / xs.length;
}
function sum(xs: number[]): number {
return xs.reduce((a, b) => a + b, 0);
}
// ---------------------------------------------------------------------------
// Test bodies
// ---------------------------------------------------------------------------
describeE2E('overlay efficacy harness (SDK)', () => {
// Resolve binary once
const claudeBinary = resolveClaudeBinary();
if (!claudeBinary) {
test.skip(
'no local `claude` binary on PATH — cannot pin for harness parity',
() => {},
);
return;
}
for (const fixture of OVERLAY_FIXTURES) {
test(
`${fixture.id}: overlay-ON vs overlay-OFF, N=${fixture.trials} per arm`,
async () => {
const overlayText = overlayContentFor(fixture);
expect(overlayText.length).toBeGreaterThan(100);
// Arm composition: both arms use the real Claude Code default system
// prompt (preset). Overlay-ON APPENDS the overlay text; overlay-OFF
// uses the default alone. This measures the overlay's marginal effect
// ON TOP of Claude Code's normal behavioral scaffolding — which is
// the only measurement that matches how real Claude Code composes
// overlays into its system prompt stack.
const [onArm, offArm] = await Promise.all([
runArm(
fixture,
'overlay-on',
{ type: 'preset', preset: 'claude_code', append: overlayText },
claudeBinary,
),
runArm(
fixture,
'overlay-off',
{ type: 'preset', preset: 'claude_code' },
claudeBinary,
),
]);
const arms = {
overlay: onArm.metrics,
off: offArm.metrics,
};
const meanOn = mean(arms.overlay);
const meanOff = mean(arms.off);
const lift = meanOn - meanOff;
const floorHits = arms.overlay.filter((n) => n >= 2).length;
const totalCost = sum(onArm.costs) + sum(offArm.costs);
const versionSet = new Set([
...onArm.sdkClaudeCodeVersions,
...offArm.sdkClaudeCodeVersions,
]);
// Loud output for the next person reading the eval JSON:
// eslint-disable-next-line no-console
console.log(
`\n[${fixture.id}]\n` +
` binary: ${claudeBinary}\n` +
` claude_code_version(s): ${[...versionSet].join(', ')}\n` +
` overlay-ON metrics: [${arms.overlay.join(', ')}] mean=${meanOn.toFixed(2)}\n` +
` overlay-OFF metrics: [${arms.off.join(', ')}] mean=${meanOff.toFixed(2)}\n` +
` lift: ${lift.toFixed(2)} floor_hits(>=2): ${floorHits}/${fixture.trials}\n` +
` rate_limit_exhausted: on=${onArm.rateLimitExhausted} off=${offArm.rateLimitExhausted}\n` +
` total_cost_usd: $${totalCost.toFixed(4)}\n` +
` transcripts: ${TRANSCRIPTS_DIR}`,
);
// Demand enough trials actually completed to make the assertion
// meaningful. If rate-limit exhaustion took out more than half of an
// arm, fail loudly rather than pass/fail on a fragment.
const minTrials = Math.ceil(fixture.trials / 2);
expect(arms.overlay.length).toBeGreaterThanOrEqual(minTrials);
expect(arms.off.length).toBeGreaterThanOrEqual(minTrials);
expect(fixture.pass(arms)).toBe(true);
},
30 * 60 * 1000, // 30 minute timeout per fixture
);
}
});
afterAll(async () => {
if (evalCollector) {
const filepath = await evalCollector.finalize();
// eslint-disable-next-line no-console
console.log(`\n[overlay-harness] eval results: ${filepath}`);
}
});