Gap caught by user: the review-flow tests verified the decision path
(POST, file write, agent_error emission) but not the actual security
property — that Block stops subsequent tool calls and Allow lets them
continue.
Mock-claude tool_result_injection scenario now emits a second tool_use
~8s after the injected tool_result, targeting post-block-followup.
example.com. If block really blocks, that event never reaches the
chat feed (SIGTERM killed the subprocess before it emitted). If allow
really allows, it does.
Allow test asserts the followup tool_use DOES appear → session lives.
Block test asserts the followup tool_use does NOT appear after 12s →
kill actually stopped further work. Both tests previously proved the
control plane (decision file → agent poll → agent_error); they now
prove the data plane too.
Test timeout bumped 60s → 90s to accommodate the 12s quiet window.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tests, ~12s hot / ~30s cold (first-run model download). Skips
gracefully if ~/.gstack/models/testsavant-small/ isn't populated.
Spins up real server + real sidebar-agent + PATH-shimmed mock-claude,
HOME re-rooted so neither the chat history nor the attempts log leak
from the user's live /open-gstack-browser session. Models dir
symlinked through to the real warmed cache so the test doesn't
re-download 112MB per run.
Covers the half that hermetic tests can't:
- real classifier (not a stub) fires on real injection text
- sidebar-agent emits a reviewable security_event end-to-end
- server writes the on-disk decision file
- sidebar-agent's poll loop reads the file and acts
- attempts.jsonl gets both block + user_overrode with matching
payloadHash (dashboard can aggregate)
- the raw payload never appears in attempts.jsonl (privacy contract)
Caught a real bug while writing: the server loads pre-existing chat
history from ~/.gstack/sidebar-sessions/, so re-rooting HOME for only
the agent leaked ghost security_events from the live session into the
test. Fix: re-root HOME for both processes. The harness is cleaner for
future full-stack tests because of it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>