hai/gstack

mirror of https://github.com/garrytan/gstack.git synced 2026-05-18 02:22:04 +08:00

Files

Garry Tan 063a99809a docs(designs): add BROWSER_SKILLS_V1 design doc

Captures the 13 locked decisions, two-axis trust model (daemon-side
scoped tokens + process-side env access), 3-tier lookup, file
layout, and full responses to all 8 Codex outside-voice findings.
Includes Phase 2-4 sketches for future branches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-26 05:09:44 -07:00

15 KiB

Raw Blame History

Browser-Skills v1 — codifying repeated browser flows

Status: Phase 1 shipped on garrytan/browserharness. Phases 2-4 enumerated below. Last updated: 2026-04-26 Authors: garrytan (with /plan-eng-review and /codex outside-voice review)

What this is

Browser-skills are per-task directories that codify a repeated browser flow into a deterministic Playwright script. Each skill has:

browser-skills/<name>/
├── SKILL.md                        # frontmatter + prose contract
├── script.ts                       # deterministic logic
├── _lib/browse-client.ts           # vendored copy of the SDK
├── fixtures/<host>-<date>.html     # captured page for tests
└── script.test.ts                  # parser tests against the fixture

A user (or, in Phase 2, an agent that just got a flow right) creates a skill once. Future invocations run the script, returning JSON in 200ms instead of the 30 seconds an agent would burn re-exploring via $B primitives.

The shipped reference is hackernews-frontpage: scrapes the HN front page, returns 30 stories as JSON. Try $B skill list and $B skill run hackernews-frontpage.

Why this is different from domain-skills (v1.8.0.0)

Domain-skills = "agent remembers facts about a site." JSONL notes keyed by hostname, injected into prompts at session start. State machine handles quarantine → active → global promotion.
Browser-skills = "agent codifies procedures into deterministic scripts." Per-task directories, executed via $B skill run, scoped tokens at the daemon for per-spawn capability isolation.

Both use the same mental model (per-host, three-tier scoping). The procedure layer is where the bigger productivity gain lives because it pushes scraping and form automation out of latent space and into reproducible code.

Why this is not the existing P1 ("self-authoring `$B` commands")

The original P1 was blocked on Codex's T1 objection: agent-authored TypeScript cannot run safely inside the daemon (ambient globals, constructor gadgets, top-level-await TOCTOU between approval and execution). The right design was "out-of-process worker isolation with capability-passing IPC." That's a hard project that may never ship.

Browser-skills sidestep the entire problem by running scripts outside the daemon as standalone Bun processes. The daemon never imports or evals skill code. Skills talk to the daemon over loopback HTTP — same wire format any external client would use.

The plan as approved replaces the existing P1.

Phasing

Phase	Branch	Scope
1	`garrytan/browserharness` (this)	SDK, storage, `$B skill list/run/show/test/rm` subcommands, scoped-token model, bundled `hackernews-frontpage` reference. Shipped.
2	new (`browser-skills-scrape-automate`)	`/scrape` and `/automate` skill templates that prototype a flow then offer the skillify approval gate.
3	new (`browser-skills-resolver`)	Resolver injection at session start (per-host browser-skill discovery). Mirrors domain-skill injection. `gstack-config browser_skillify_prompts` knob.
4	new	Eval test infrastructure (LLM-judge), fixture-staleness detection, periodic re-validation against live pages.

Phase 1 architecture

Decisions locked (13)

Phase 1 = full storage + SDK + subcommands + bundled reference. No agent authoring yet. Phase 2 lands /scrape and /automate.
Two verbs in Phase 2: /scrape (read-only) and /automate (mutating). They share skillify approval-gate machinery but live as separate skill templates.
Replaces the existing self-authoring-$B P1 in TODOS.md. Same user-visible goal, no in-daemon isolation problem.
SDK distribution: sibling file inside each skill (Option E). The canonical SDK lives at browse/src/browse-client.ts (~250 LOC). Each skill ships a copy at <skill>/_lib/browse-client.ts. Phase 2's generator copies the current SDK alongside every generated script. Each skill is fully self-contained: copy the directory anywhere, it runs. Version drift impossible (the SDK is frozen at the version the skill was authored against). Disk cost: ~3KB per skill.
Three-tier lookup: bundled → global → project. Bundled skills ship read-only with the gstack install (<gstack-install>/browser-skills/<name>/). Global at ~/.gstack/browser-skills/<name>/. Per-project at <project>/.gstack/browser-skills/<name>/. Lookup walks tiers in priority order project → global → bundled; first hit wins. $B skill list prints the resolved tier alongside each skill name so "why did it run that one?" is never a debugging mystery.
Trust model: scoped tokens at spawn time, NOT env-scrub-as-sandbox. See "Trust model" below. (Revised from original env-scrub plan after Codex flagged it as security theater.)
Single source of truth: SKILL.md frontmatter only. No meta.json. Frontmatter holds host, triggers, args, version, source, trusted. SHA256/staleness deferred to Phase 4 as a separate .checksum sidecar if it lands at all.
No INDEX.json. Walk the directory. $B skill list enumerates the three tiers and parses each SKILL.md frontmatter. ~5-10ms for 50 skills. Eliminates the entire "index drifted from disk" bug class.
$B skill run output protocol. stdout = JSON. stderr = streaming logs. Exit 0 / nonzero. Default 60s timeout, override via --timeout=Ns. Max stdout 1MB (truncate + nonzero exit if exceeded). Matches gh / kubectl / docker conventions.
Fixture replay: two patterns for two test types. SDK unit test stands up an in-test mock HTTP server. End-to-end skill tests parse bundled HTML fixtures via the script's exported parser function (no daemon required). Phase 1 fixture-only is adequate for hackernews-frontpage; Phase 2 /automate will need richer fixtures.
Reference skill: hackernews-frontpage. Scrapes HN front page (titles, points, comments). No auth, stable HTML, ideal fixture-test target.
Token/port discovery: scoped-token env-only for spawned skills; state-file fallback for standalone debug runs. When spawned via $B skill run, the SDK reads GSTACK_PORT + GSTACK_SKILL_TOKEN from env. For standalone bun run script.ts, the SDK falls back to <project>/.gstack/browse.json (the actual state-file path per config.ts:50).
CHANGELOG honesty. Phase 1 lead: humans can hand-write deterministic browser scripts that gstack runs. Phase 1 explicitly notes that agent authoring lands in next release. No fabricated perf numbers — Phase 1 has no before/after.

Trust model (decision #6 in detail)

Two orthogonal axes:

Axis	Mechanism	Default
Daemon-side capability	Per-spawn scoped token bound to `read+write` scope (the 17-cmd browser-driving surface, minus admin commands like `eval`/`js`/`cookies`/`storage`). Single-use clientId encodes skill name + spawn id. Revoked when the spawn exits.	Always scoped (never the daemon root token).
Process-side env access	SKILL.md frontmatter `trusted: true` passes `process.env` minus `GSTACK_TOKEN`. `trusted: false` (default) drops everything except a minimal allowlist (LANG, LC_ALL, TERM, TZ, locked PATH) and explicitly strips secret-pattern keys (TOKEN/KEY/SECRET/PASSWORD, AWS_, AZURE_, GCP_, ANTHROPIC_, OPENAI_, GITHUB_, etc.).	Untrusted (must opt in).

GSTACK_PORT and GSTACK_SKILL_TOKEN are always injected last so a parent process cannot override them by setting them in env.

What this gets right: the daemon-side scoped token is enforceable by the daemon. A skill that tries to call eval (admin scope) gets a 403 even though the SDK exposes it. The capability boundary is in the right place.

What this does NOT close: Bun has no built-in FS sandbox. An untrusted skill can still import 'fs' and read whatever the OS user can read (e.g. ~/.ssh/id_rsa). The env scrub is hygiene, not a sandbox. OS-level isolation (sandbox-exec, namespaces) is Phase 4 work and drops in cleanly behind the existing trusted/untrusted contract.

The original plan called env-scrub a sandbox. Codex correctly flagged that as theater. The revised plan calls it what it is: best-effort hygiene plus defense-in-depth, with the real boundary at the daemon-side scoped token.

File layout

browse/src/
├── browse-client.ts                # canonical SDK (~250 LOC)
├── browser-skills.ts               # 3-tier walk + frontmatter parser + tombstones
├── browser-skill-commands.ts       # $B skill list/show/run/test/rm + spawnSkill
└── skill-token.ts                  # mintSkillToken / revokeSkillToken wrappers

browser-skills/
└── hackernews-frontpage/           # bundled reference skill
    ├── SKILL.md
    ├── script.ts
    ├── _lib/browse-client.ts        # byte-identical copy of canonical
    ├── fixtures/hn-2026-04-26.html
    └── script.test.ts

browse/test/
├── skill-token.test.ts              # mint/revoke lifecycle, scope assertions
├── browse-client.test.ts            # mock HTTP server, wire format, auth
├── browser-skills-storage.test.ts   # 3-tier walk, frontmatter, tombstones
└── browser-skill-commands.test.ts   # parseRunArgs, dispatch, env scrub, spawn

test/skill-validation.test.ts       # extended: bundled-skill contract checks

What does NOT change

Domain-skills storage, state machine, or injection. Untouched.
Tunnel-surface allowlist (server.ts:118-123). Same 17 commands.
L1-L6 security stack. Browser-skills don't inject text into prompts in Phase 1; Phase 3's resolver injection will ride the existing UNTRUSTED envelope.
The cli.ts HTTP client at sendCommand(). The SDK is a separate module with a different concern (library vs CLI process).

Codex outside-voice findings (post-review responses)

The /codex review flagged 8 findings. The plan addresses them as follows:

#	Finding	Phase 1 response
1	Trust model is fake without FS sandbox	Closed by decision #6 (scoped tokens) above.
2	Phase 1 is overbuilt for one bundled skill (lookup tiers, tombstones, etc.)	Acknowledged but kept. User chose full Phase 1 to lock the architecture before Phase 2 lands agent authoring. Each subsystem is small enough to remove cleanly if data later says it's unused.
3	Existing client pattern in `cli.ts:398` may make sibling SDK redundant	Verified false. Line 398 is the end of `extractTabId()` (a flag-parser). The actual HTTP client is `sendCommand()` at cli.ts:401-467, but it's CLI-coupled (`process.stdout.write`, `process.exit`, server-restart recovery). Not reusable as a library. The new `browse-client.ts` mirrors its wire format but is library-shaped.
4	"First hit wins" lookup is opaque	Mitigated by listing the resolved tier inline in `$B skill list` and `$B skill show`. Future: optional `--source bundled\|global\|project` flag if the tier override proves confusing.
5	Atomic skill packaging matters more than the index question; symlink defenses	Closed for Phase 1: bundled skills ship as part of the gstack install (no live writes; atomic by virtue of being read-only files in the install dir). Phase 2's `writeBrowserSkill` will write to a temp dir then rename, and use `realpath`/`lstat` discipline (existing `browse/src/path-security.ts`).
6	Phase 2 synthesis from activity feed is weak (lossy ring buffer)	Open issue for Phase 2 design. The activity feed is telemetry, not a replay IR. Phase 2 will need a structured recorder OR re-prompting the agent to write the script from scratch using its own context. Decide in Phase 2's design pass.
7	Bun runtime regression: skill scripts as standalone Bun reintroduce a Bun runtime requirement	Open issue for Phase 2 distribution. Phase 1 sidesteps this because the bundled reference skill ships inside the gstack install (which already builds with Bun). Phase 2 needs to decide between (a) shipping a Bun binary with each generated skill, (b) compiling skills to self-contained executables, or (c) using Node.js with `cli.ts`'s HTTP pattern.
8	`file://` fixtures don't prove timing/auth/navigation/lazy hydration	Documented limit. Adequate for `hackernews-frontpage`. Phase 2 `/automate` will need richer fixtures (mock daemon with timing, recorded HAR replay, etc.).

Phase 2 sketch (for reference)

Two skill templates: /scrape and /automate. Both run a prototype flow via $B primitives, get the user's "looks right" signal, then offer the skillify approval gate that writes a Phase-1-shaped browser-skill to disk.

Open design questions deferred to Phase 2:

Where do user-authored skills live by default — global or per-project? Lean global for procedures, with per-project override (mirrors domain-skill scope). Phase 1 storage helpers already support both lookup paths.
How does the agent synthesize the script? Codex finding #6: the activity feed is lossy. Options: (a) structured recorder that captures full $B invocations to a separate buffer; (b) re-prompt the agent to write from scratch using its own context, with the activity feed as a reference.
Bun runtime distribution. Codex finding #7. Options: (a) ship Bun binary with each skill; (b) compile each skill to a self-contained binary; (c) use Node + the existing cli.ts pattern.

Phase 3 sketch

Resolver injection at session start. Mirror the domain-skill injection at server.ts:722-743:

const browserSkillsBlock = await renderBrowserSkillsForHost(hostname, projectSlug);
if (browserSkillsBlock) {
  systemPrompt += `\n\n${browserSkillsBlock}`;
}

renderBrowserSkillsForHost() reads the 3 tiers, filters to skills whose host field matches, and emits an UNTRUSTED-wrapped block listing them.

gstack-config browser_skillify_prompts (default off): when on, end-of-task nudges in /qa, /design-review, etc. fire when activity feed shows ≥N commands on a single host AND no skill exists yet for that host+intent.

Phase 4 sketch

LLM-judge eval ("did the agent reach for the skill instead of re-exploring?").
Fixture-staleness detection — compare bundled fixture against live page.
OS-level FS sandbox for untrusted spawns (sandbox-exec on macOS, namespaces / seccomp on Linux).
$B skill upgrade <name> — regenerate the sibling SDK copy when the canonical SDK changes.

Verification (Phase 1)

bun test passes the new test files:

browse/test/skill-token.test.ts — 15 assertions
browse/test/browse-client.test.ts — 26 assertions
browse/test/browser-skills-storage.test.ts — 31 assertions
browse/test/browser-skill-commands.test.ts — 29 assertions
browser-skills/hackernews-frontpage/script.test.ts — 13 assertions
test/skill-validation.test.ts — 7 new bundled-skill assertions

End-to-end with the daemon running:

$B skill list                            # shows hackernews-frontpage (bundled)
$B skill show hackernews-frontpage       # prints SKILL.md
$B skill run hackernews-frontpage        # returns JSON of 30 stories
$B skill test hackernews-frontpage       # runs script.test.ts

15 KiB Raw Blame History