feat(eval): extend OverlayFixture with allowedTools, maxTurns, direction

Per-fixture tool allowlist unblocks measuring nudges that need Edit/Write
(e.g. literal-interpretation 'fix the failing tests' needs write access).
Per-fixture maxTurns lets harder prompts run longer without changing the
default. `direction` is cosmetic metadata for test output labeling.

Also adds reusable predicates and metrics:
- lowerIsBetter20Pct / higherIsBetter20Pct — 20% lift threshold vs baseline
- bashToolCallCount — count of Bash tool_use across the session
- turnsToCompletion — SDK-reported num_turns at result
- uniqueFilesEdited — Edit/Write/MultiEdit file_path set size

test/skill-e2e-overlay-harness.test.ts now threads fixture.allowedTools
and fixture.maxTurns through runArm.
This commit is contained in:
Garry Tan
2026-04-23 11:07:37 -07:00
parent f3b40b12d7
commit 76829b76dc
2 changed files with 72 additions and 2 deletions

View File

@@ -141,8 +141,8 @@ async function runArm(
userPrompt: fixture.userPrompt,
workingDirectory: dir,
model: fixture.model,
maxTurns: 5,
allowedTools: ['Read', 'Glob', 'Grep', 'Bash'],
maxTurns: fixture.maxTurns ?? 5,
allowedTools: fixture.allowedTools ?? ['Read', 'Glob', 'Grep', 'Bash'],
permissionMode: 'bypassPermissions',
settingSources: [],
env: { ANTHROPIC_API_KEY: process.env.ANTHROPIC_API_KEY ?? '' },