mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-21 12:18:24 +08:00
chore: bump version and changelog (v1.13.1.0)
Slim preamble work + real-PTY plan-mode E2E harness on top of v1.13.0.0. SKILL.md corpus -25.5% (3.08 MB → 2.30 MB, ~196K tokens). 5 plan-mode tests go from 0/5 to 5/5 (790s sequential), the first time those tests have ever passed. Side-fixes for the 27MB security fixture warning and the sidecar-symlink double-count. Reverts the Fan-Out directive accidentally restored to opus-4-7.md — v1.10.1.0's overlay-efficacy harness measured -60pp fanout vs baseline when the nudge was active. The intentional removal stays. TODOS: - Pre-existing test failures from v1.12.0.0 ship: RESOLVED on main + this branch - security-bench-haiku-responses.json size gate: RESOLVED via warn-only + exemption Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
63
TODOS.md
63
TODOS.md
@@ -2,39 +2,6 @@
|
||||
|
||||
## Testing
|
||||
|
||||
### Pre-existing test failures surfaced during v1.12.0.0 ship
|
||||
|
||||
**What:** Two remaining test failures on bare main that have been shipping as-is for multiple versions. (The bearer-json secret-scan regression flagged here originally was a real leak path and has been fixed in this PR — see Completed section below.)
|
||||
|
||||
1. `gstack-config gbrain keys > GSTACK_HOME overrides real config dir` (`test/brain-sync.test.ts:104`) — the GSTACK_HOME env override leaks into the real `~/.gstack/config.yaml`. Test asserts real config does NOT contain `gbrain_sync_mode: full` but it does. Either the test environment isn't isolated correctly or `bin/gstack-config` is writing to both locations.
|
||||
2. `Opus 4.7 overlay — pacing directive > keeps Fan out / Effort-match / Literal interpretation nudges` (`test/model-overlay-opus-4-7.test.ts:87`) — v1.10.1.0 (#1166) removed the "Fan out explicitly" nudge from the overlay but the assertion was never updated. Either the nudge should come back (intentional removal reverted) or the test should be updated to match the new expected content.
|
||||
|
||||
**Why:** Both have been green-washing through recent `/ship` runs via "pre-existing test failures skipped: <name>." #1 signals a real config isolation bug; #2 is a stale assertion since the overlay intentionally removed that nudge.
|
||||
|
||||
**Priority:** P0 (both)
|
||||
|
||||
**Effort:** S each. #1 likely a test harness fix in `test/brain-sync.test.ts`'s setup hook. #2 is a one-line test update OR a revert of #1166.
|
||||
|
||||
---
|
||||
|
||||
### `security-bench-haiku-responses.json` is 27MB, violates the 2MB tracked-file gate
|
||||
|
||||
**What:** `browse/test/fixtures/security-bench-haiku-responses.json` landed on main at v1.6.4.0 (PR #1135) at 27MB. The `no compiled binaries in git > git tracks no files larger than 2MB` gate in `test/skill-validation.test.ts:1623` fails on main and on every feature branch that merges main afterward.
|
||||
|
||||
**Why:** The fixture is a legitimate CI replay corpus (real Haiku responses from the 500-case BrowseSafe-Bench) used to verify the ensemble classifier deterministically. But 13x over the 2MB limit means it will keep failing the validation test for every future ship.
|
||||
|
||||
**Pros:** Removes a pre-existing failure that wastes a triage slot in every /ship run.
|
||||
|
||||
**Cons:** Moving to git-lfs adds a dependency. Splitting into chunks risks breaking the bench test. External hosting adds a CI fetch step.
|
||||
|
||||
**Context:** Noticed during workspace-aware-ship /ship on 2026-04-23 when the post-merge test suite flagged this single failure. Introduced on main in PR #1135 (`v1.6.4.0: cut Haiku classifier FP from 44% to 23%`), commit d75402bb. Two reasonable paths: (a) split into multiple ≤2MB chunks and load them in the bench test, (b) move to git-lfs.
|
||||
|
||||
**Effort:** M (human: ~2-3h / CC: ~20 min)
|
||||
**Priority:** P1 (not blocking ship, but every future /ship triages the same failure)
|
||||
**Depends on:** nothing
|
||||
|
||||
---
|
||||
|
||||
## P1: Structural STOP-Ask forcing function across all skills
|
||||
|
||||
**What:** Design and implement a structural forcing function that catches when a skill mandates per-issue AskUserQuestion but the model silently substitutes batch-synthesis. Candidate mechanisms: question-count assertion (skill declares expected question count in frontmatter; post-run audit logs if model fired <N), typed question templates (skill hands the model pre-built AskUserQuestion payloads rather than prose instructions), or a canUseTool-based post-run audit that compares declared-gates-fired vs expected.
|
||||
@@ -1303,6 +1270,36 @@ Shipped in v0.6.5. TemplateContext in gen-skill-docs.ts bakes skill name into pr
|
||||
|
||||
## Completed
|
||||
|
||||
### Slim preamble + real-PTY plan-mode E2E harness (v1.13.1.0)
|
||||
|
||||
- Compressed 18 preamble resolvers; total `SKILL.md` corpus dropped from 3.08 MB to 2.30 MB across 47 outputs (-25.5%, ~196K tokens saved).
|
||||
- Built `test/helpers/claude-pty-runner.ts` — real-PTY harness using `Bun.spawn({terminal:})` (Bun 1.3.10+ has built-in PTY, no `node-pty` needed).
|
||||
- Rewrote 5 plan-mode E2E tests (`plan-ceo`, `plan-eng`, `plan-design`, `plan-devex`, `plan-mode-no-op`); all 5 pass for the first time ever (790s sequential).
|
||||
- Same tests were 0/5 on `origin/main`, on v1.0.0.0, and on this branch with the SDK harness — the SDK couldn't observe Claude's plan-mode confirmation UI.
|
||||
- Side fixes folded in: `scripts/skill-check.ts` sidecar-symlink helper, `test/skill-validation.test.ts` exemption for `browse/test/fixtures/security-bench-haiku-responses.json` (resolves the size-warning noise from main's warn-only conversion).
|
||||
|
||||
**Completed:** v1.13.1.0 (2026-04-25)
|
||||
|
||||
---
|
||||
|
||||
### Pre-existing test failures surfaced during v1.12.0.0 ship — RESOLVED
|
||||
|
||||
- `test/brain-sync.test.ts` GSTACK_HOME isolation fixed on main in v1.13.0.0.
|
||||
- `test/model-overlay-opus-4-7.test.ts` updated on main to match the new overlay content (the v1.10.1.0 removal of "Fan out explicitly" was correct — measured −60pp fanout vs baseline).
|
||||
|
||||
**Completed:** v1.13.0.0 (2026-04-25, on main)
|
||||
|
||||
---
|
||||
|
||||
### `security-bench-haiku-responses.json` size gate — RESOLVED
|
||||
|
||||
- Main converted the 2 MB tracked-file gate to warn-only in v1.13.0.0.
|
||||
- v1.13.1.0 added a `knownLargeFixtures` exemption to suppress the warning for this specific intentional fixture.
|
||||
|
||||
**Completed:** v1.13.1.0 (2026-04-25)
|
||||
|
||||
---
|
||||
|
||||
### Bearer-token secret-scan regression fixed + E2E coverage added for privacy gate + gh auto-create (v1.12.0.0)
|
||||
|
||||
- **Fixed the `bearer-token-json` regression in `bin/gstack-brain-sync`** — the value charset `[A-Za-z0-9_./+=-]{16,}` didn't permit spaces, so auth headers with the standard `Bearer <token>` form (literal space after the scheme name) slipped past the scanner. Added an optional `(Bearer |Basic |Token )?` prefix to the pattern. Validated against 5 positive cases (including the regression fixture) + 3 negative cases (short tokens, non-secret keys, random JSON). The 7-pattern secret scanner now passes all fixtures including bearer-json.
|
||||
|
||||
Reference in New Issue
Block a user