feat(security): v2 ensemble tuning — label-first voting + SOLO_CONTENT_BLOCK

Cuts Haiku classifier false-positive rate from 44.1% → 22.9% on
BrowseSafe-Bench smoke. Detection trades from 67.3% → 56.2%; the
lost TPs are all cases Haiku correctly labeled verdict=warn
(phishing targeting users, not agent hijack) — they still surface
in the WARN banner meta but no longer kill the session.

Key changes:
- combineVerdict: label-first voting for transcript_classifier. Only
  meta.verdict==='block' block-votes; verdict==='warn' is a soft
  signal. Missing meta.verdict never block-votes (backward-compat).
- Hallucination guard: verdict='block' at confidence < LOG_ONLY (0.40)
  drops to warn-vote — prevents malformed low-conf blocks from going
  authoritative.
- New THRESHOLDS.SOLO_CONTENT_BLOCK = 0.92 decoupled from BLOCK (0.85).
  Label-less content classifiers (testsavant, deberta) need a higher
  solo-BLOCK bar because they can't distinguish injection from
  phishing-targeting-user. Transcript keeps label-gated solo path
  (verdict=block AND conf >= BLOCK).
- THRESHOLDS.WARN bumped 0.60 → 0.75 — borderline fires drop out of
  the 2-of-N ensemble pool.
- Haiku model pinned (claude-haiku-4-5-20251001). `claude -p` spawns
  from os.tmpdir() so project CLAUDE.md doesn't poison the classifier
  context (measured 44k cache_creation tokens per call before the fix,
  and Haiku refusing to classify because it read "security system"
  from CLAUDE.md and went meta).
- Haiku timeout 15s → 45s. Measured real latency is 17-33s end-to-end
  (Claude Code session startup + Haiku); v1's 15s caused 100% timeout
  when re-measured — v1's ensemble was effectively L4-only in prod.
- Haiku prompt rewritten: explicit block/warn/safe criteria, 8 few-shot
  exemplars (instruction-override → block; social engineering → warn;
  discussion-of-injection → safe).

Test updates:
- 5 existing combineVerdict tests adapted for label-first semantics
  (transcript signals now need meta.verdict to block-vote).
- 6 new tests: warn-soft-signal, three-way-block-with-warn-transcript,
  hallucination-guard-below-floor, above-floor-label-first,
  backward-compat-missing-meta.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Garry Tan
2026-04-21 20:31:53 -07:00
parent 97584f9a59
commit 6cedecd585
7 changed files with 296 additions and 88 deletions

View File

@@ -89,8 +89,10 @@ describe('defense-in-depth — layer coexistence', () => {
// produce a BLOCK-worthy verdict.
const baseSignals: LayerSignal[] = [
{ layer: 'testsavant_content', confidence: 0.88 },
{ layer: 'transcript_classifier', confidence: 0.75 },
// content at 0.95 clears the SOLO_CONTENT_BLOCK threshold (0.92) so
// that the "content alone" case below still hits single_layer_high.
{ layer: 'testsavant_content', confidence: 0.95 },
{ layer: 'transcript_classifier', confidence: 0.75, meta: { verdict: 'block' } },
{ layer: 'canary', confidence: 1.0 },
];
@@ -174,8 +176,8 @@ describe('defense-in-depth — regression guards', () => {
// still be BLOCK, not crash or produce nonsense. Canary uses >= 1.0
// which matches; ML layers also register.
const overflow: LayerSignal[] = [
{ layer: 'testsavant_content', confidence: 5.5 }, // above BLOCK
{ layer: 'transcript_classifier', confidence: 3.2 }, // above BLOCK
{ layer: 'testsavant_content', confidence: 5.5 }, // above BLOCK, block-vote
{ layer: 'transcript_classifier', confidence: 3.2, meta: { verdict: 'block' } }, // label-first block-vote
];
expect(combineVerdict(overflow).verdict).toBe('block');
});