diff --git a/docs/ECC-2.0-GA-ROADMAP.md b/docs/ECC-2.0-GA-ROADMAP.md index dce9481e..0cf0b351 100644 --- a/docs/ECC-2.0-GA-ROADMAP.md +++ b/docs/ECC-2.0-GA-ROADMAP.md @@ -99,11 +99,20 @@ As of 2026-05-12: scope, expiry, and days-until-expiry reporting; terminal output and GitHub Action job-summary evidence; README docs; rebuilt action bundles; and 1,708-test validation. +- AgentShield PR #63 exposed baseline drift in the GitHub Action with + `baseline` / `save-baseline` inputs, baseline drift outputs, job-summary + evidence, regression annotations, README/API docs, rebuilt action bundles, + and green remote action/self-scan/Node verification. - AgentShield PDF-export decision: defer a native PDF writer for now. The self-contained HTML executive report remains the exportable buyer artifact and can be printed to PDF when needed; native PDF generation should wait for explicit enterprise/compliance demand or a print-fidelity gap in the HTML report. +- `docs/architecture/agentshield-enterprise-research-roadmap.md` identifies + the next AgentShield enterprise signal: move from scanner/report/policy gate + to a team control plane with baseline drift, evidence packs, multi-harness + adapters, corpus accuracy gates, remediation routing, threat intelligence, + and ECC-Tools/GitHub App integration. - ECC PR #1778 recovered the useful stale #1413 network/homelab architect-agent concepts. - ECC-Tools PR #26 added cost/token-risk predictive follow-ups for AI routing, @@ -208,7 +217,7 @@ is not complete unless the evidence column exists and has been freshly verified. | Naming and rename readiness | Naming matrix across package/plugin/docs/social surfaces | `docs/releases/2.0.0-rc.1/naming-and-publication-matrix.md` records current package, repo, Claude plugin, Codex plugin, OpenCode, and npm availability evidence | Complete for rc.1; post-rc rename remains future work | | Claude and Codex plugin publication | Contact/submission path with required artifacts and status | Publication readiness, naming matrix, and May 12 dry-run evidence document plugin validation, clean-checkout Claude tag/install smoke, and Codex marketplace CLI shape | Needs explicit approval for real tag/push and marketplace submission | | Articles, tweets, and announcements | X thread, LinkedIn copy, GitHub release copy, push checklist | Draft launch collateral exists under rc.1 release docs | Needs URL-backed refresh | -| AgentShield enterprise iteration | Policy gates, SARIF, packs, provenance, corpus, HTML reports, exception lifecycle audit | PRs #53, #55-#62 landed with test evidence; native PDF export deferred in favor of self-contained HTML plus print-to-PDF until explicit enterprise demand appears | Needs next enterprise signal | +| AgentShield enterprise iteration | Policy gates, SARIF, packs, provenance, corpus, HTML reports, exception lifecycle audit, baseline drift action surface, enterprise research roadmap | PRs #53, #55-#63 landed with test evidence; native PDF export deferred in favor of self-contained HTML plus print-to-PDF until explicit enterprise demand appears; `docs/architecture/agentshield-enterprise-research-roadmap.md` selects baseline drift as the first control-plane slice | Baseline-drift Action surface landed; CLI/evidence-pack routing remains | | ECC Tools next-level app | Billing audit, PR checks, deep analyzer, sync backlog, evaluator/RAG corpus | PRs #26-#40 landed with test evidence | Needs capacity-backed Linear rollout | | GitGuardian/Dependabot/CodeRabbit-style checks | Non-blocking taxonomy and deterministic follow-up checks | ECC-Tools risk taxonomy check plus follow-up signals landed, including Skill Quality, Deep Analyzer Evidence, Analyzer Corpus Evidence, RAG/Evaluator Evidence, and PR Review/Salvage Evidence | Partially complete | | Harness-agnostic learning system | Audit, adapter matrix, observability, traces, promotion loop | Audit/adapters/observability gates plus `docs/architecture/evaluator-rag-prototype.md`, `examples/evaluator-rag-prototype/`, and ECC-Tools PR #40 define read-only stale-salvage, billing-readiness, CI-failure-diagnosis, harness-config-quality, AgentShield policy-exception, skill-quality evidence, deep-analyzer evidence, and RAG/evaluator comparison scenarios with trace, report, playbook, verifier, and predictive-check artifacts | Local corpus complete; hosted integration remains future | @@ -231,7 +240,7 @@ back to the repo evidence and merge commits. | Release and publication | rc.1 release docs, publication readiness doc | Naming matrix and plugin submission/contact checklist | Before any tag | | Harness OS core | Audit, adapter matrix, observability docs, `ecc2/` | HUD/session-control acceptance spec | Weekly until GA | | Evaluation and RAG | Reference-set validation, harness audit, traces, ECC-Tools corpus | Read-only evaluator/RAG prototype plus stale-salvage, billing-readiness, CI-failure-diagnosis, harness-config-quality, AgentShield policy-exception, skill-quality evidence, deep-analyzer evidence, and RAG/evaluator comparison fixtures | Hosted retrieval/check-run automation plan | -| AgentShield enterprise | AgentShield PR evidence and roadmap notes | Next enterprise signal | After PDF/export decision | +| AgentShield enterprise | AgentShield PR evidence and roadmap notes | Baseline-drift CLI/evidence-pack follow-up | Next implementation batch | | ECC Tools app | ECC-Tools PR evidence, billing audit, risk taxonomy, evaluator/RAG corpus | Capacity-backed Linear rollout | Next implementation batch | | Linear progress | Linear project status updates and this mirror | Status update with queue/evidence/missing gates | Every significant merge batch | @@ -432,9 +441,11 @@ Acceptance: ## Next Engineering Slices -1. Identify the next AgentShield enterprise signal beyond the merged HTML - executive report, corpus benchmark output, exception lifecycle audit, and - deferred native-PDF decision. +1. Finish the AgentShield baseline-drift control-plane slice from + `docs/architecture/agentshield-enterprise-research-roadmap.md`: PR #63 + shipped the GitHub Action baseline outputs and job-summary evidence; the + remaining work is CLI baseline UX, evidence-pack routing, and ECC-Tools + backlog sync integration. 2. Enable/configure the merged Linear backlog sync path after workspace issue capacity clears or the Linear workspace is upgraded. 3. Use the ECC-Tools evaluator/RAG corpus as the promotion gate before adding diff --git a/docs/architecture/agentshield-enterprise-research-roadmap.md b/docs/architecture/agentshield-enterprise-research-roadmap.md new file mode 100644 index 00000000..8c3be0f5 --- /dev/null +++ b/docs/architecture/agentshield-enterprise-research-roadmap.md @@ -0,0 +1,328 @@ +# AgentShield Enterprise Research Roadmap + +Generated: 2026-05-12 + +This is a planning artifact for the next AgentShield enterprise iteration. It +does not modify AgentShield code. The goal is to turn the current scanner, +policy gate, corpus, and reporting surface into a security control plane for +teams running AI coding agents across multiple harnesses. + +## Evidence Reviewed + +Current AgentShield repository state: + +- AgentShield checkout on clean `main`. +- `README.md`, `API.md`, `package.json`, `.github/workflows/*`, and + `src/`/`tests/` module layout. +- Current supported user surfaces: `agentshield scan`, `agentshield init`, + `agentshield miniclaw start`, scanner JSON, MiniClaw API, GitHub Action, + HTML, SARIF, markdown, terminal, and JSON reports. +- Current enterprise-like surfaces: policy packs, GitHub Action policy + enforcement, SARIF policy violations, supply-chain provenance, corpus + benchmark, HTML executive reports, and exception lifecycle audit. + +External references checked from official GitHub repos or README sources: + +- [stablyai/orca](https://github.com/stablyai/orca): multi-agent IDE, + worktree isolation, live agent status, GitHub integration, diff review, and + notifications. +- [superset-sh/superset](https://github.com/superset-sh/superset): AI-agent + editor with worktree orchestration, built-in diff review, workspace presets, + and universal CLI-agent compatibility. +- [standardagents/dmux](https://github.com/standardagents/dmux): tmux/worktree + multiplexer with lifecycle hooks, multi-agent launches, pane visibility, and + merge/PR workflows. +- [jarrodwatts/claude-hud](https://github.com/jarrodwatts/claude-hud): Claude + Code statusline, context health, tool activity, agent tracking, todo + progress, transcript parsing, and usage telemetry. +- [stanford-iris-lab/meta-harness](https://github.com/stanford-iris-lab/meta-harness): + harness optimization through repeatable tasks, logged proposer interactions, + and evaluated scaffold changes. +- [greyhaven-ai/autocontext](https://github.com/greyhaven-ai/autocontext): + recursive improvement loop with traces, scored generations, playbooks, + persisted knowledge, scenario evaluation, and optional production traces. +- [NousResearch/hermes-agent](https://github.com/NousResearch/hermes-agent): + self-improving skills, memory, session search, multi-platform gateway, + scheduled automation, terminal backends, and trajectory generation. +- [anthropics/claude-code](https://github.com/anthropics/claude-code): + terminal, IDE, GitHub, plugin, permission, MCP, and data-retention surfaces. +- [anomalyco/opencode](https://github.com/anomalyco/opencode): provider-agnostic + open-source coding agent with build/plan agents, desktop beta, + client/server architecture, and LSP support. +- [opencode-ai/opencode](https://github.com/opencode-ai/opencode): earlier + archived Go-based terminal agent with sessions, providers, LSP, file change + tracking, custom commands, and auto-compact. +- [zed-industries/zed](https://github.com/zed-industries/zed): high-performance + multiplayer editor with strict license/compliance CI expectations. +- [aidenybai/ghast](https://github.com/aidenybai/ghast): native terminal + multiplexer built around Ghostty, workspace grouping, split panes, drag/drop, + notifications, and terminal search. + +Local Claude Code source inspection: + +- Reviewed only non-secret local file/module shape from a private Claude Code + source snapshot. +- Relevant surfaces observed: `tools/`, `utils/permissions/`, `utils/mcp/`, + `utils/hooks/`, `utils/plugins/`, `types/permissions.ts`, + `types/plugin.ts`, `remote/`, `tasks/`, `assistant/sessionHistory.ts`, + and session/history utilities. +- No code was copied. The takeaway is that AgentShield should track permissions, + plugins, MCP, hooks, remote sessions, task/subagent activity, and history as + first-class audit domains rather than treating a `.claude/` tree as the only + source of truth. + +## Current AgentShield Position + +AgentShield is already more than a static lint tool: + +- Rule coverage spans secrets, permissions, hooks, MCP servers, agent configs, + prompt injection, supply chain, taint analysis, sandbox execution, policy + evaluation, runtime repair/status, corpus validation, MiniClaw, and Opus + analysis. +- Reports are usable by humans and machines: terminal, JSON, markdown, HTML, + SARIF, scan logs, and GitHub Action outputs. +- Enterprise hooks exist: policy packs, exception metadata, expiring/expired + exception reporting, SARIF code scanning, and job-summary output. +- Accuracy work is active: `runtimeConfidence`, template/example weighting, + docs-example downgrades, hook-manifest resolution, false-positive audit + guidance, and corpus readiness. + +The next iteration should not be "add more regex rules" by default. The higher +leverage move is to make AgentShield remember, compare, route, and enforce +security posture across time, repos, teams, and harnesses. + +## Enterprise Gaps + +### 1. Organization Baselines And Drift + +Enterprise buyers need to know whether a repo, team, or agent fleet is getting +safer or riskier over time. AgentShield has scan logs and baseline comparison +modules, and PR #63 now exposes that drift through GitHub Action inputs, +outputs, annotations, and job-summary evidence. The remaining product surface +should make baseline snapshots, CLI drift summaries, and owner-ready deltas +explicit. + +Target capability: + +- `agentshield baseline write --output agentshield-baseline.json` +- `agentshield scan --baseline agentshield-baseline.json` +- Report sections for new, fixed, unchanged, suppressed, and policy-excepted + findings. +- GitHub Action output that posts "security posture changed" rather than only a + point-in-time grade. + +### 2. Multi-Harness Security Adapters + +The market is moving toward many parallel agent harnesses, not one tool. Orca, +Superset, dmux, OpenCode, Claude Code, Codex, Gemini, Zed, and terminal +multiplexers all create different security surfaces. + +Target capability: + +- A small adapter registry for `claude-code`, `opencode`, `codex`, `gemini`, + `zed`, `dmux`, `orca`, `superset`, and `generic-terminal`. +- Each adapter declares config paths, permission concepts, plugin surfaces, + MCP/tooling conventions, history/session surfaces, and CI evidence. +- Report output groups findings by harness and confidence, so template/docs + findings do not look like active runtime exposure. + +### 3. Session And Worktree Awareness + +Worktree-native orchestrators change the risk model. A team can run many agents +in parallel, each with its own branch, shell, MCP config, and local state. + +Target capability: + +- Optional scan metadata for branch, worktree path, agent name, session id, + provider, and orchestrator. +- A scan-history table that answers: which worktree introduced a new permission, + which agent run added a risky MCP, which branch relaxed policy, and whether + the final merged branch fixed it. +- A compact "security HUD" summary usable by statuslines, GitHub checks, and + local dashboards. + +### 4. Evidence Packs For Buyers And Auditors + +HTML reports are the right buyer-facing artifact today; native PDF is deferred. +The deeper need is a portable evidence bundle that can be attached to audits, +security reviews, and customer questionnaires. + +Target capability: + +- `agentshield scan --evidence-pack out/agentshield-evidence` +- Bundle includes JSON report, HTML report, SARIF, policy evaluation, + exception audit, baseline diff, dependency/provenance summary, and a short + README explaining how to interpret the artifacts. +- Optional redaction mode for secrets, local paths, usernames, and project names. + +### 5. Regression Corpus And Reference Sets + +Meta-Harness and Autocontext point to the same lesson: improvements need scored +scenarios, traces, and playbooks. AgentShield already has a corpus benchmark, +but enterprise trust needs a curated reference set for false positives, +false negatives, and policy regressions. + +Target capability: + +- Versioned scenario fixtures for critical rules, false-positive suppressions, + policy exceptions, template/docs examples, plugin manifests, and hook-code + resolution. +- Per-category precision/coverage reporting, not just aggregate readiness. +- A "no accuracy regression" gate that must pass before releases. +- Playbook notes for why a suppression exists and when it should expire. + +### 6. Remediation Workflow + +Security tools become enterprise-grade when they turn findings into accountable +work without flooding maintainers. + +Target capability: + +- One-click or CLI-generated remediation branch for safe transforms. +- Policy comments that group findings by owner and risk rather than by file + order. +- GitHub App support for check-run annotations, issue caps, Linear sync, and + deferred backlog export. +- Finding fingerprints that avoid duplicate issues across repeated scans. + +### 7. Threat Intelligence And Package Reputation + +Agent security depends on MCP packages, plugin repositories, action bundles, +and rapidly changing CLI ecosystems. Static checks need a maintained external +reputation layer. + +Target capability: + +- A local-first threat-intel cache for known MCP/package risks, CVEs, malware + package names, suspicious install scripts, mutable git dependencies, and + known-good packages. +- Offline deterministic mode remains available. +- Online enrichment is opt-in and produces clear provenance for every external + claim. + +### 8. Commercial And Team Controls + +AgentShield is already connected conceptually to the ECC Tools GitHub App. +Native GitHub payments make the product path more concrete: free local scans, +paid org policy gates, paid evidence bundles, and paid drift/history. + +Target capability: + +- Tier-aware GitHub App checks: free static scan, paid org policy enforcement, + paid evidence packs, paid historical drift, and paid deep analysis. +- Seat/team mapping for policy owners and exception approvers. +- Billing readiness checks shared with ECC-Tools so payment state never changes + enforcement behavior silently. + +## Recommended Build Order + +### Slice 1: Baseline Drift MVP + +Implement the smallest enterprise control-plane primitive: compare this scan to +the last accepted baseline. + +Artifacts: + +- Baseline JSON schema. +- Baseline writer and comparator. +- Terminal and JSON report sections for new/fixed/unchanged findings. +- Tests covering stable fingerprints, fixed findings, new findings, and policy + exception carry-forward. + +Why first: + +- It reuses existing scan output. +- It improves CLI, GitHub Action, and GitHub App value at once. +- It does not require a hosted service. + +### Slice 2: Evidence Pack Bundle + +Bundle the existing machine and human reports into a portable audit artifact. + +Artifacts: + +- `--evidence-pack