fix: browse binary discovery broken for agents (v0.3.5) (#44)

* fix: replace find-browse with direct path in SKILL.md setup blocks Agents were skipping the find-browse binary and guessing bin/browse (wrong path). Now the setup block explicitly checks browse/dist/browse with workspace-local priority, global fallback. Also adds || true to update check to prevent misleading exit code 1. Adds {{UPDATE_CHECK}} and {{BROWSE_SETUP}} template placeholders to gen-skill-docs.ts so all skills share a single source of truth. * refactor: convert qa/ and setup-browser-cookies/ to .tmpl templates Replaces hardcoded update check and find-browse blocks with {{UPDATE_CHECK}} and {{BROWSE_SETUP}} placeholders. Both skills are now generated from templates via gen-skill-docs. * test: add e2e and LLM eval tests for SKILL.md setup block - 3 Agent SDK e2e tests: happy path, NEEDS_SETUP, non-git-repo - LLM eval: setup block clarity + actionability >= 4 - New error pattern: 'no such file or directory.*browse' These tests catch the exact failure mode where agents can't discover the browse binary via SKILL.md instructions. * chore: bump version and changelog (v0.3.5) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-05-20 19:29:56 +08:00 · 2026-03-14 00:24:06 -07:00
parent 6b69c46a27
commit 1717ed2891
15 changed files with 627 additions and 47 deletions
--- a/test/skill-llm-eval.test.ts
+++ b/test/skill-llm-eval.test.ts
@@ -115,6 +115,19 @@ describeEval('LLM-as-judge quality evals', () => {
    expect(scores.actionability).toBeGreaterThanOrEqual(4);
  }, 30_000);

+  test('setup block scores >= 4 on actionability and clarity', async () => {
+    const content = fs.readFileSync(path.join(ROOT, 'SKILL.md'), 'utf-8');
+    const setupStart = content.indexOf('## SETUP');
+    const setupEnd = content.indexOf('## IMPORTANT');
+    const section = content.slice(setupStart, setupEnd);
+
+    const scores = await judge('setup/binary discovery instructions', section);
+    console.log('Setup block scores:', JSON.stringify(scores, null, 2));
+
+    expect(scores.actionability).toBeGreaterThanOrEqual(4);
+    expect(scores.clarity).toBeGreaterThanOrEqual(4);
+  }, 30_000);
+
  test('regression check: compare branch vs baseline quality', async () => {
    // This test compares the generated output against the hand-maintained
    // baseline from main. The generated version should score equal or higher.