Commit Graph

3 Commits

Author SHA1 Message Date
Garry Tan
8e9ec52d6f feat(security): DeBERTa-v3 ensemble classifier (opt-in)
Adds ProtectAI DeBERTa-v3-base-injection-onnx as an optional L4c layer
for cross-model agreement. Different model family (DeBERTa-v3-base,
~350M params) than the default L4 TestSavantAI (BERT-small, ~30M params)
— when both fire together, that's much stronger signal than either alone.

Opt-in because the download is hefty: set GSTACK_SECURITY_ENSEMBLE=deberta
and the sidebar-agent warmup fetches model.onnx (721MB FP32) into
~/.gstack/models/deberta-v3-injection/ on first run. Subsequent runs are
cached.

Implementation mirrors the TestSavantAI loader:
  * loadDeberta() — idempotent, progress-reported download + pipeline init
    with the same model_max_length=512 override (DeBERTa's config has the
    same bogus model_max_length placeholder as TestSavantAI)
  * scanPageContentDeberta() — htmlToPlainText preprocess, 4000-char cap,
    truncate at 512 tokens, return LayerSignal with layer='deberta_content'
  * getClassifierStatus() includes deberta field only when enabled
    (avoids polluting the shield API with always-off data)

sidebar-agent changes:
  * preSpawnSecurityCheck runs TestSavant + DeBERTa in parallel (Promise.all)
    then adds both to the signals array before the gated Haiku check
  * toolResultScanCtx does the same for tool-output scans
  * When GSTACK_SECURITY_ENSEMBLE is unset, scanPageContentDeberta is a
    no-op that returns confidence=0 with meta.disabled — combineVerdict
    treats it as a non-contributor and the verdict is identical to the
    pre-ensemble behavior

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 04:55:23 +08:00
Garry Tan
d5253215c5 fix(security-classifier): truncation + HTML preprocessing
Two real bugs found by the BrowseSafe-Bench smoke harness.

1. Truncation wasn't happening.
   The TextClassificationPipeline in transformers.js v4 calls the tokenizer
   with `{ padding: true, truncation: true }` — but truncation needs a
   max_length, which it reads from tokenizer.model_max_length. TestSavantAI
   ships with model_max_length set to 1e18 (a common "infinity" placeholder
   in HF configs) so no truncation actually occurs. Inputs longer than 512
   tokens (the BERT-small context limit) crash ONNXRuntime with a
   broadcast-dimension error.
   Fix: override tokenizer._tokenizerConfig.model_max_length = 512 right
   after pipeline load. The getter now returns the real limit and the
   implicit truncation: true in the pipeline actually clips inputs.

2. Classifier was receiving raw HTML.
   TestSavantAI is trained on natural language, not markup. Feeding it a
   blob of <div style="..."> dilutes the injection signal with tag noise.
   When the Perplexity BrowseSafe-Bench fixture has an attack buried inside
   HTML, the classifier said SAFE at confidence 0 across the board.
   Fix: added htmlToPlainText() that strips tags, drops script/style
   bodies, decodes common entities, and collapses whitespace. scanPageContent
   now normalizes input through this before handing to the classifier.

Result: BrowseSafe-Bench smoke runs without errors. Detection rate is only
15% at WARN=0.6 (see bench test docstring for why — TestSavantAI wasn't
trained on this distribution). Ensemble with Haiku transcript classifier
filters FPs in prod; DeBERTa-v3 ensemble is a tracked P2 improvement.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 04:50:53 +08:00
Garry Tan
63a56e6789 feat(security): add security-classifier.ts with TestSavantAI + Haiku
This module holds the ML classifier code that the compiled browse binary
cannot link (onnxruntime-node native dylib doesn't load from Bun compile's
temp extract dir — see CEO plan §"Pre-Impl Gate 1 Outcome"). It's imported
ONLY by sidebar-agent.ts, which runs as a non-compiled bun script.

Two layers:

L4 testsavant_content — TestSavantAI BERT-small ONNX classifier. First call
triggers a one-time 112MB model download to ~/.gstack/models/testsavant-small/
(files staged into the onnx/ layout transformers.js v4 expects). Classifies
page snapshots and tool outputs for indirect prompt injection + jailbreak
attempts. On benign-corpus dry-run: Wikipedia/HN/Reddit/tech-blog all score
SAFE 0.98+, attack text scores INJECTION 0.99+, Stack Overflow
instruction-writing now scores SAFE 0.98 on the shorter form (was 0.99
INJECTION on the longer form — instruction-density threshold). Ensemble
combiner downgrades single-layer high to WARN to cover this case.

L4b transcript_classifier — Claude Haiku reasoning-blind pre-tool-call scan.
Sees only {user_message, last 3 tool_calls}, never Claude's chain-of-thought
or tool results (those are how self-persuasion attacks leak). 2000ms hard
timeout. Fail-open on any subprocess failure so sidebar stays functional.
Gated by shouldRunTranscriptCheck() — only runs when another layer already
fired at >= LOG_ONLY, saving ~70% of Haiku spend.

Both layers degrade gracefully: load/spawn failures set status to 'degraded'
and return confidence=0. Shield icon reflects this via getClassifierStatus()
which security.ts's getStatus() composes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 19:03:36 +08:00