mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-08 13:39:45 +08:00
docs(browser-skills): TODOS Phase 2a + design doc D1-D4 decisions
TODOS.md: - Narrows existing P1 (was "/scrape and /automate") to "/scrape and /skillify" — the /scrape + /skillify wedge ships in this branch. Codex finding #6 (synthesis) removed from Cons (resolved by D2); finding #7 (Bun runtime) stays as the open carry-over. - Adds new ## P0 above PACING_UPDATES_V0 for the /automate follow-up. Same skillify pattern as /scrape, different trust profile (per-step confirmation gate when running non-codified). Reuses /skillify and the D3 helper as-is. Effort M. BROWSER_SKILLS_V1.md: - Phase table re-organized into 1, 2a, 2b, 3, 4. Phase 1 + Phase 2a consolidate into v1.19.0.0 ship (the v1.16.0.0 branch-internal bump never landed on main). - New "Phase 2a" sub-section captures the four decisions locked during /plan-eng-review: D1 — provenance guard (≤10 turn walk-back, refuse if cold) D2 — synthesis input slice (final-attempt $B calls only, closes Codex finding #6) D3 — atomic write discipline (temp-dir-then-rename via new browse/src/browser-skill-write.ts helper) D4 — full test scope (5 gate E2E + 1 unit + smoke) - New "Phase 2b" sketch for /automate: same skillify machinery, per-mutating-step confirmation gate, deferred to next branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
36
TODOS.md
36
TODOS.md
@@ -2,21 +2,21 @@
|
||||
|
||||
## Browser-skills follow-on (Phases 2-4)
|
||||
|
||||
### P1: Browser-skills Phase 2 — `/scrape` and `/automate` skill templates
|
||||
### P1: Browser-skills Phase 2 — `/scrape` and `/skillify` skill templates
|
||||
|
||||
**What:** Phase 2 of the browser-skills design (`docs/designs/BROWSER_SKILLS_V1.md`). Two new gstack skills (`/scrape` for read-only flows, `/automate` for mutating flows) that prototype a flow via `$B` primitives, then offer the skillify approval gate that writes a Phase-1-shaped browser-skill to disk. Each generated skill carries its own copy of `browse-client.ts` in `_lib/`.
|
||||
**What:** Phase 2a of the browser-skills design (`docs/designs/BROWSER_SKILLS_V1.md`). Two new gstack skills: `/scrape <intent>` (read-only) is the single entry point for pulling page data — first call prototypes via `$B` primitives, subsequent calls on a matching intent route to a codified browser-skill in ~200ms. `/skillify` codifies the most recent successful prototype into a permanent browser-skill on disk: synthesizes `script.ts` + `script.test.ts` + fixture from the agent's own context (final-attempt $B calls only), runs the test in a temp dir, asks before committing, atomic rename to `~/.gstack/browser-skills/<name>/`. The mutating-flow sibling `/automate` is split out as its own P0 (below) — same skillify pattern, different trust profile.
|
||||
|
||||
**Why:** Phase 1 shipped the runtime — humans can hand-write deterministic browser scripts that gstack runs. Phase 2 unlocks the productivity gain: an agent that gets a flow right once via 20+ `$B` commands says "skillify it" and the script becomes a 200ms call forever after. Same skillify pattern Garry's articles describe, applied to the two browser activities (scraping + automation) most amenable to deterministic compression.
|
||||
**Why:** Phase 1 shipped the runtime — humans can hand-write deterministic browser scripts that gstack runs. Phase 2a unlocks the productivity gain: an agent that gets a flow right once via 20+ `$B` commands says `/skillify` and the script becomes a 200ms call forever after. Same skillify pattern Garry's articles describe, applied to the read-only browser activity (scraping) most amenable to deterministic compression. Mutating actions ship next as `/automate` because the failure mode (unintended writes) needs stronger gates.
|
||||
|
||||
**Pros:** The 100x productivity gain lives here. Closes the loop: agents prototype, codify, then reach for the codified skill in future sessions instead of re-exploring. Replaces the original "self-authoring `$B` commands" P1 — same user-visible goal, no in-daemon isolation problem (skill scripts run as standalone Bun processes, never imported into the daemon).
|
||||
**Pros:** The 100x productivity gain lives here. Closes the loop: agents prototype, codify, then reach for the codified skill in future sessions instead of re-exploring. Replaces the original "self-authoring `$B` commands" P1 — same user-visible goal, no in-daemon isolation problem (skill scripts run as standalone Bun processes, never imported into the daemon). Synthesis question (Codex finding #6) is resolved by re-prompting from the agent's own conversation context (option b in the design doc), bounded to final-attempt `$B` calls per `/plan-eng-review` D2.
|
||||
|
||||
**Cons:** Two open design questions Phase 2 must close. (a) **How to synthesize the script.** The activity feed is lossy (Codex finding #6: in-memory ring buffer, redacted args, truncated results). Pick between a structured recorder OR re-prompting the agent to write from scratch using its own context. (b) **Bun runtime distribution** (Codex finding #7). Phase 1 sidesteps this because the bundled reference skill ships inside the gstack install. User-authored skills land on machines without Bun unless we ship a runtime alongside, compile to a self-contained binary, or use Node + the existing `cli.ts` pattern.
|
||||
**Cons:** **Bun runtime distribution** (Codex finding #7). Phase 1 sidesteps this because the bundled reference skill ships inside the gstack install. User-authored skills land on machines without Bun unless we ship a runtime alongside, compile to a self-contained binary, or use Node + the existing `cli.ts` pattern. Deferred to Phase 4 — `/skillify` documents the assumption that gstack is installed (which means Bun is on PATH).
|
||||
|
||||
**Context:** The Phase 1 architecture (3-tier lookup, scoped tokens, sibling SDK, frontmatter contract) is locked and exercised by the bundled `hackernews-frontpage` reference skill. Phase 2 plugs `/scrape` and `/automate` into that runtime — no new storage primitives needed.
|
||||
**Context:** The Phase 1 architecture (3-tier lookup, scoped tokens, sibling SDK, frontmatter contract) is locked and exercised by the bundled `hackernews-frontpage` reference skill. Phase 2a plugs `/scrape` and `/skillify` into that runtime via two skill templates plus one new helper (`browse/src/browser-skill-write.ts` for atomic temp-dir-then-rename per `/plan-eng-review` D3) — no new storage primitives.
|
||||
|
||||
**Effort:** L (human: ~1-2 weeks / CC: ~2-3 days)
|
||||
**Priority:** P1
|
||||
**Depends on:** Phase 1 shipped (this branch). Open: synthesis design + Bun runtime distribution.
|
||||
**Effort:** M (human: ~1 week / CC: ~1 day)
|
||||
**Priority:** P1 (this branch — `garrytan/browserharness` shipping as v1.19.0.0)
|
||||
**Depends on:** Phase 1 shipped (this branch).
|
||||
|
||||
---
|
||||
|
||||
@@ -246,6 +246,24 @@ scope of that PR; deliberately deferred to keep PTY-import small.
|
||||
**Priority:** P3 (nice-to-have, not blocking anyone yet)
|
||||
**Depends on:** `/context-save` + `/context-restore` rename stable in production (v1.0.1.0+). Research: does Conductor expose a spawn-workspace CLI?
|
||||
|
||||
## P0: Browser-skills Phase 2 follow-up — `/automate` skill
|
||||
|
||||
**What:** The mutating-flow sibling of `/scrape` (Phase 2b). `/automate <intent>` codifies form fills, click sequences, and multi-step interactions into permanent browser-skills. Reuses Phase 2a's skillify machinery (`/skillify` is shared) and the D3 atomic-write helper. Adds: per-mutating-step UNTRUSTED-wrapped summary + `AskUserQuestion` confirmation gate when running non-codified (codified skills run unattended after the initial human approval). Defaults to `trusted: false` per Phase 1 — env-scrubbed spawn, scoped-token capability, no admin scope.
|
||||
|
||||
**Why:** Read-only scraping is the safer wedge to validate the skillify pattern (failure mode: wrong data = benign). Mutating actions are the other half of the 100x productivity gain — agents that codify "log into example.com → click Settings → toggle X" save real time on every future session. Splitting from Phase 2a means we ship the productivity loop first, validate the architecture, then add the higher-trust surface with confidence.
|
||||
|
||||
**Pros:** Unlocks deterministic automation authoring without self-authoring safety concerns — Phase 1's scoped-token model applies equally to mutating skills. The codified script enumerates exactly which `$B click`/`$B fill`/`$B type` calls run; nothing else is possible at runtime. Reuses 100% of `/skillify`, the D3 helper, and the storage tier. Per-step confirmation gate surfaces the actions to the user before they run for the first time.
|
||||
|
||||
**Cons:** Mutating intents have higher blast radius (the wrong selector clicks "Delete Account" instead of "Delete Comment"). Phase 4 OS-level FS sandbox is a stronger answer; until then, the user trust burden is real. Confirmation-gate UX needs care — too many prompts and users hit "yes" reflexively. Mitigation: only gate first-run; after `/skillify` codifies, the skill runs unattended.
|
||||
|
||||
**Context:** Original Phase 2 plan in `docs/designs/BROWSER_SKILLS_V1.md` bundled `/scrape` + `/automate`. Split during the v1.19.0.0 plan review (`/plan-eng-review` on `garrytan/browserharness`) — the user's source doc framed both as primary, but in practice scraping is where users start because the failure mode is benign. Ship `/scrape` + `/skillify` first (this branch), validate the skillify pattern works, then `/automate` lands on top of the same machinery.
|
||||
|
||||
**Effort:** M (human: ~3-5 days / CC: ~1 day)
|
||||
**Priority:** P0 (next branch after v1.19.0.0)
|
||||
**Depends on:** Phase 2a (`/scrape` + `/skillify`) shipped at v1.19.0.0. The D3 atomic-write helper (`browse/src/browser-skill-write.ts`) and the bundled SDK pattern are reused as-is.
|
||||
|
||||
---
|
||||
|
||||
## P0: PACING_UPDATES_V0 — Louise's fatigue root cause (V1.1)
|
||||
|
||||
**What:** Implement the pacing overhaul extracted from PLAN_TUNING_V1. Full design in `docs/designs/PACING_UPDATES_V0.md`. Requires: session-state model, `phase` field in question-log schema, registry extension for dynamic findings, pacing as skill-template control flow (not preamble prose), `bin/gstack-flip-decision` command, migration-prompt budget rule, first-run preamble audit, ranking threshold calibration from real V0 data, one-way-door uncapped rule, concrete verification values.
|
||||
|
||||
@@ -59,10 +59,11 @@ The plan as approved replaces the existing P1.
|
||||
|
||||
| Phase | Branch | Scope |
|
||||
|-------|--------|-------|
|
||||
| **1** | `garrytan/browserharness` (this) | SDK, storage, `$B skill list/run/show/test/rm` subcommands, scoped-token model, bundled `hackernews-frontpage` reference. **Shipped.** |
|
||||
| **2** | new (`browser-skills-scrape-automate`) | `/scrape` and `/automate` skill templates that prototype a flow then offer the skillify approval gate. |
|
||||
| **1** | `garrytan/browserharness` | SDK, storage, `$B skill list/run/show/test/rm` subcommands, scoped-token model, bundled `hackernews-frontpage` reference. **Shipped (v1.19.0.0, consolidated with Phase 2a).** |
|
||||
| **2a** | `garrytan/browserharness` (continues) | `/scrape <intent>` (read-only, single entry point with match/prototype paths) + `/skillify` (codifies prototype into permanent skill). Adds `browse/src/browser-skill-write.ts` D3 atomic-write helper. **Shipping v1.19.0.0.** |
|
||||
| **2b** | new (`browser-skills-automate`) | `/automate` skill template (mutating-flow sibling of `/scrape`). Reuses `/skillify` and the D3 helper. Per-mutating-step confirmation gate when running non-codified. P0 in TODOS. |
|
||||
| **3** | new (`browser-skills-resolver`) | Resolver injection at session start (per-host browser-skill discovery). Mirrors domain-skill injection. `gstack-config browser_skillify_prompts` knob. |
|
||||
| **4** | new | Eval test infrastructure (LLM-judge), fixture-staleness detection, periodic re-validation against live pages. |
|
||||
| **4** | new | Eval test infrastructure (LLM-judge), fixture-staleness detection, periodic re-validation against live pages, OS-level FS sandbox for untrusted spawns. |
|
||||
|
||||
---
|
||||
|
||||
@@ -205,24 +206,40 @@ The /codex review flagged 8 findings. The plan addresses them as follows:
|
||||
|
||||
---
|
||||
|
||||
## Phase 2 sketch (for reference)
|
||||
## Phase 2a — `/scrape` + `/skillify` (shipping v1.19.0.0)
|
||||
|
||||
Two skill templates: `/scrape` and `/automate`. Both run a prototype flow via
|
||||
`$B` primitives, get the user's "looks right" signal, then offer the skillify
|
||||
approval gate that writes a Phase-1-shaped browser-skill to disk.
|
||||
Two skill templates plus one helper module. `/scrape <intent>` is the single
|
||||
entry point for pulling page data; first call on a new intent prototypes via
|
||||
`$B` primitives and returns JSON, subsequent calls on a matching intent route
|
||||
to a codified browser-skill in ~200ms. `/skillify` codifies the most recent
|
||||
successful prototype into a permanent browser-skill on disk. Mutating-flow
|
||||
sibling `/automate` deferred to Phase 2b (P0 in TODOS).
|
||||
|
||||
Open design questions deferred to Phase 2:
|
||||
### Decisions locked during the v1.19.0.0 plan review (`/plan-eng-review`)
|
||||
|
||||
- **Where do user-authored skills live by default — global or per-project?**
|
||||
Lean global for procedures, with per-project override (mirrors domain-skill
|
||||
scope). Phase 1 storage helpers already support both lookup paths.
|
||||
- **How does the agent synthesize the script?** Codex finding #6: the activity
|
||||
feed is lossy. Options: (a) structured recorder that captures full $B
|
||||
invocations to a separate buffer; (b) re-prompt the agent to write from
|
||||
scratch using its own context, with the activity feed as a reference.
|
||||
- **Bun runtime distribution.** Codex finding #7. Options: (a) ship Bun
|
||||
binary with each skill; (b) compile each skill to a self-contained binary;
|
||||
(c) use Node + the existing `cli.ts` pattern.
|
||||
| ID | Decision | Locked behavior |
|
||||
|----|----------|-----------------|
|
||||
| **D1** | `/skillify` provenance guard | Walk back ≤10 agent turns looking for a clearly-bounded `/scrape` invocation (the prototype's intent line + its trailing JSON output). If not found, refuse with: *"No recent /scrape result found in this conversation. Run /scrape <intent> first, then say /skillify."* No silent fallback. |
|
||||
| **D2** | Synthesis input slice | Template instructs the agent to extract ONLY the final-attempt `$B` calls that produced the JSON the user accepted, plus the user's stated intent string. Drop failed selector attempts, drop unrelated chat, drop earlier-session content. Closes Codex finding #6 by picking option (b) (re-prompt from agent's own context, not a structured recorder). |
|
||||
| **D3** | Atomic write discipline | `/skillify` writes to `~/.gstack/.tmp/skillify-<spawnId>/`, runs `$B skill test` against the temp dir, and only renames into the final tier path on success + user approval. On test failure or approval rejection: `rm -rf` the temp dir entirely (no tombstone for never-approved skills). New module `browse/src/browser-skill-write.ts` (`stageSkill` / `commitSkill` / `discardStaged`) with `realpath`/`lstat` discipline per Codex finding #5. |
|
||||
| **D4** | Test scope | 5 gate-tier E2E (scrape match, scrape prototype, skillify happy, skillify provenance refusal, approval-gate reject) + 1 unit test (atomic-write helper failure cleanup) + 1 hand-verified smoke (mutating-intent refusal). Registered in `test/helpers/touchfiles.ts`. |
|
||||
|
||||
### Carry-overs
|
||||
|
||||
- **Default tier: global.** Lean global for procedures, with per-project
|
||||
override at `/skillify` time (mirrors domain-skill scope). Phase 1 storage
|
||||
helpers support both lookup paths.
|
||||
- **Bun runtime distribution.** Codex finding #7 stays open. Phase 2a assumes
|
||||
Bun is on PATH (gstack already requires it via `setup:6-15`). Documented
|
||||
in `/skillify` SKILL.md "Limits". Real fix lands in Phase 4.
|
||||
|
||||
## Phase 2b — `/automate` sketch
|
||||
|
||||
Mutating-flow sibling of `/scrape`. Same skillify pattern (reuses `/skillify`
|
||||
and the D3 helper as-is). Difference: per-mutating-step UNTRUSTED-wrapped
|
||||
summary + `AskUserQuestion` confirmation gate when run non-codified. After
|
||||
codification, the skill runs unattended (the codified script enumerates exactly
|
||||
which `$B click`/`fill`/`type` calls run). See P0 entry in `TODOS.md`.
|
||||
|
||||
## Phase 3 sketch
|
||||
|
||||
|
||||
Reference in New Issue
Block a user