mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-18 18:32:28 +08:00
feat(eval): extend OverlayFixture with allowedTools, maxTurns, direction
Per-fixture tool allowlist unblocks measuring nudges that need Edit/Write (e.g. literal-interpretation 'fix the failing tests' needs write access). Per-fixture maxTurns lets harder prompts run longer without changing the default. `direction` is cosmetic metadata for test output labeling. Also adds reusable predicates and metrics: - lowerIsBetter20Pct / higherIsBetter20Pct — 20% lift threshold vs baseline - bashToolCallCount — count of Bash tool_use across the session - turnsToCompletion — SDK-reported num_turns at result - uniqueFilesEdited — Edit/Write/MultiEdit file_path set size test/skill-e2e-overlay-harness.test.ts now threads fixture.allowedTools and fixture.maxTurns through runArm.
This commit is contained in:
@@ -141,8 +141,8 @@ async function runArm(
|
||||
userPrompt: fixture.userPrompt,
|
||||
workingDirectory: dir,
|
||||
model: fixture.model,
|
||||
maxTurns: 5,
|
||||
allowedTools: ['Read', 'Glob', 'Grep', 'Bash'],
|
||||
maxTurns: fixture.maxTurns ?? 5,
|
||||
allowedTools: fixture.allowedTools ?? ['Read', 'Glob', 'Grep', 'Bash'],
|
||||
permissionMode: 'bypassPermissions',
|
||||
settingSources: [],
|
||||
env: { ANTHROPIC_API_KEY: process.env.ANTHROPIC_API_KEY ?? '' },
|
||||
|
||||
Reference in New Issue
Block a user