docs: add evaluator rag prototype (#1824)

This commit is contained in:
Affaan Mustafa
2026-05-12 17:04:39 -04:00
committed by GitHub
parent f2deedcf3d
commit dcf5668b27
9 changed files with 494 additions and 4 deletions

View File

@@ -54,6 +54,11 @@ As of 2026-05-12:
dry-run without `--force`, local marketplace discovery, temp-home local
install, enabled plugin listing, and clean uninstall for `ecc@ecc`
`2.0.0-rc.1`.
- `docs/architecture/evaluator-rag-prototype.md` and
`examples/evaluator-rag-prototype/` define the first read-only
self-improving harness prototype: scenario spec, trace, report, candidate
playbook, verifier result, accepted maintainer-salvage candidate, and
rejected blind-translation candidate.
- The npm package surface now excludes Python bytecode/cache artifacts through
package `files` negation rules and a publish-surface regression test.
- `docs/legacy-artifact-inventory.md` records that no `_legacy-documents-*`
@@ -194,7 +199,7 @@ is not complete unless the evidence column exists and has been freshly verified.
| AgentShield enterprise iteration | Policy gates, SARIF, packs, provenance, corpus, HTML reports, exception lifecycle audit | PRs #53, #55-#62 landed with test evidence | Needs PDF/export decision or next enterprise signal |
| ECC Tools next-level app | Billing audit, PR checks, deep analyzer, sync backlog | PRs #26-#39 landed with test evidence | Needs capacity-backed Linear rollout / broader evaluator corpus |
| GitGuardian/Dependabot/CodeRabbit-style checks | Non-blocking taxonomy and deterministic follow-up checks | ECC-Tools risk taxonomy check plus follow-up signals landed, including Skill Quality, Deep Analyzer Evidence, Analyzer Corpus Evidence, RAG/Evaluator Evidence, and PR Review/Salvage Evidence | Partially complete |
| Harness-agnostic learning system | Audit, adapter matrix, observability, traces, promotion loop | Audit/adapters/observability gates exist | Needs evaluation/RAG prototype |
| Harness-agnostic learning system | Audit, adapter matrix, observability, traces, promotion loop | Audit/adapters/observability gates plus `docs/architecture/evaluator-rag-prototype.md` and `examples/evaluator-rag-prototype/` define the first read-only scenario, trace, report, playbook, and verifier result | Needs broader evaluator corpus |
| Linear roadmap is detailed | Linear project status plus repo mirror | Repo mirror exists; issue creation was retried on 2026-05-12 and remains blocked by the workspace free issue limit | Needs recurring status updates after each merge batch |
| Flow separation and progress tracking | Flow lanes with owner artifacts and update cadence | This roadmap defines lanes below | Active |
| Realtime Linear sync | Project updates while issue limit is blocked; issues later | ECC-Tools #39 implements opt-in Linear API sync for deferred follow-up backlog items | Needs workspace capacity/config rollout |
@@ -213,7 +218,7 @@ back to the repo evidence and merge commits.
| Queue hygiene and salvage | GitHub PR/issue state, salvage ledger | Append ledger entries for any future stale closures | Every cleanup batch |
| Release and publication | rc.1 release docs, publication readiness doc | Naming matrix and plugin submission/contact checklist | Before any tag |
| Harness OS core | Audit, adapter matrix, observability docs, `ecc2/` | HUD/session-control acceptance spec | Weekly until GA |
| Evaluation and RAG | Reference-set validation, harness audit, traces | Read-only evaluator/RAG prototype design | Before deep analyzer expansion |
| Evaluation and RAG | Reference-set validation, harness audit, traces | Read-only evaluator/RAG prototype plus fixture contract | Expand to CI, billing, harness-config, and AgentShield scenarios |
| AgentShield enterprise | AgentShield PR evidence and roadmap notes | PDF-export decision or next enterprise signal | After value decision |
| ECC Tools app | ECC-Tools PR evidence, billing audit, risk taxonomy | Capacity-backed Linear rollout or broader evaluator/RAG corpus slice | Next implementation batch |
| Linear progress | Linear project status updates and this mirror | Status update with queue/evidence/missing gates | Every significant merge batch |
@@ -407,5 +412,6 @@ Acceptance:
executive report, corpus benchmark output, and exception lifecycle audit.
2. Enable/configure the merged Linear backlog sync path after workspace issue
capacity clears or the Linear workspace is upgraded.
3. Expand the evaluator/RAG corpus with real cleanup-batch cases as future
maintainer-owned examples land.
3. Expand the evaluator/RAG corpus beyond the first stale-salvage prototype to
CI failure diagnosis, harness-config drift, billing readiness, and
AgentShield policy exception scenarios.

View File

@@ -136,6 +136,13 @@ Repo work:
- `agentshield`: feed prompt-injection and config-risk findings into regression
suites.
Current prototype:
- `docs/architecture/evaluator-rag-prototype.md` defines the read-only
evaluator/RAG artifact contract.
- `examples/evaluator-rag-prototype/` records the first scenario spec, trace,
report, candidate playbook, and verifier result for stale-PR salvage.
Verification:
- read-only prototype that emits a trace, report, candidate playbook, and

View File

@@ -0,0 +1,122 @@
# Evaluator RAG Prototype
ECC 2.0 needs a self-improving harness loop that can learn from real work
without blindly mutating a user's Claude, Codex, OpenCode, dmux, Zed, or
terminal setup. This prototype defines the smallest read-only artifact set for
that loop.
The fixture set lives in
[`examples/evaluator-rag-prototype/`](../../examples/evaluator-rag-prototype/).
It uses the May 2026 stale-PR cleanup and salvage lane as the first concrete
scenario because that lane has real inputs, real accepted work, and real
rejected work.
## Reference Pressure
- Meta-Harness: treat the harness itself as an experiment with scenario specs,
verifier results, and promoted playbooks.
- Autocontext: store traces, reports, artifacts, and reusable improvements
before changing installed agent assets.
- Claude HUD: expose context, tools, todos, agent activity, checks, and risk so
an evaluator can judge a run after the fact.
- Hermes Agent: keep skills, memories, scheduler-like follow-ups, and terminal
gateway behavior explicit instead of hiding local commands.
- dmux, Orca, Superset, and Ghast: preserve worktree/session state so parallel
agent work can be compared, resumed, or closed cleanly.
- ECC Tools: route evaluator findings into PR comments, check runs, and Linear
backlog items without flooding GitHub.
## Artifact Contract
Every evaluator/RAG run is read-only until a verifier promotes a playbook.
| Artifact | Purpose | Fixture |
| --- | --- | --- |
| Scenario spec | Declares the objective, allowed evidence, forbidden actions, and pass/fail gates. | `scenario.json` |
| Trace | Captures observation, retrieval, proposal, verification, and promotion events. | `trace.json` |
| Report | Summarizes scores, evidence coverage, risks, and recommended next action. | `report.json` |
| Candidate playbook | Describes the maintainer-owned workflow that could be reused later. | `candidate-playbook.md` |
| Verifier result | Accepts or rejects candidates with concrete reasons and rollback notes. | `verifier-result.json` |
The prototype deliberately separates retrieval from action. A run can retrieve
closed PR diffs, Linear status, CI history, and local docs, but it cannot close,
merge, publish, tag, or rewrite configs as part of the evaluator pass.
## Phase Model
1. Observe the current queue, dirty worktrees, branch state, open PRs/issues,
discussions, CI state, and release gates.
2. Retrieve relevant reference evidence: stale-salvage ledger rows, prior
maintainer PRs, current docs, analyzer findings, CI failures, and harness
adapter rules.
3. Propose one or more playbooks with source attribution and expected
validation gates.
4. Verify each playbook against explicit acceptance and rejection rules.
5. Promote only the candidate that improves the scenario without widening blast
radius.
6. Record rollback guidance and unresolved manual-review tails.
## First Scenario
The first scenario is `stale-pr-salvage-maintainer-branch`.
It models the rule Affaan set during the May 2026 cleanup: stale closure is
queue hygiene, not loss of useful work. Useful closed PR work should be ported
into maintainer-owned PRs with attribution/backlinks, while generated churn,
bulk localization, and ambiguous translator work stay out of blind
cherry-picks.
The verifier accepts a maintainer salvage branch that:
- credits source PRs;
- avoids raw private context and personal paths;
- does not import stale bulk localization without translator review;
- records a durable ledger update;
- runs the same validation gates as a normal code, docs, or catalog change;
- leaves release publication actions approval-gated.
The verifier rejects a blind cherry-pick proposal that:
- imports stale translation/doc churn wholesale;
- skips the current catalog/install architecture;
- lacks attribution;
- lacks tests or ledger updates;
- mutates release or plugin publication state.
## ECC Tools Mapping
ECC Tools already flags missing RAG/evaluator evidence for retrieval,
embedding, ranking, and evaluator changes. This prototype gives those checks a
target shape:
- `scenario.json` maps to analyzer corpus inputs.
- `trace.json` maps to golden traces and run telemetry.
- `report.json` maps to PR comment summaries and Linear backlog summaries.
- `candidate-playbook.md` maps to the suggested follow-up PR body.
- `verifier-result.json` maps to pass/fail check-run evidence.
Future ECC Tools work should consume these artifacts as fixture shape before it
adds hosted retrieval or model-backed judging. The local prototype is enough to
prove the contract before any paid API or vector store is introduced.
## Promotion Rules
A candidate can be promoted only when:
- the verifier result is `accepted`;
- at least one rejected candidate proves the verifier can say no;
- every source PR or reference artifact has attribution;
- the proposed action is maintainer-owned and reversible;
- validation commands are named;
- unresolved translator, release, billing, or publication items remain blocked
until separately approved.
## Next Expansion
The next evaluator/RAG corpus should add:
- a CI-failure diagnosis scenario with captured logs and a known fix;
- a harness-config quality scenario covering MCP/plugin/hook drift;
- a billing-readiness scenario that separates verified Marketplace claims from
launch-copy assumptions;
- an AgentShield policy exception scenario with SARIF and report evidence.