docs: add evaluator CI failure scenario (#1826)

2026-05-13 16:13:03 +08:00 · 2026-05-12 17:44:00 -04:00
parent 863519eecf
commit cd90c84c32
8 changed files with 274 additions and 6 deletions
--- a/docs/ECC-2.0-GA-ROADMAP.md
+++ b/docs/ECC-2.0-GA-ROADMAP.md
@@ -56,9 +56,10 @@ As of 2026-05-12:
  `2.0.0-rc.1`.
 - `docs/architecture/evaluator-rag-prototype.md` and
  `examples/evaluator-rag-prototype/` define the first read-only
-  self-improving harness prototype: scenario spec, trace, report, candidate
-  playbook, verifier result, accepted maintainer-salvage candidate, and
-  rejected blind-translation candidate.
+  self-improving harness prototype: scenario specs, traces, reports,
+  candidate playbooks, verifier results, accepted maintainer-salvage,
+  billing-readiness, and CI-failure-diagnosis candidates, plus rejected
+  unsafe candidates.
 - The npm package surface now excludes Python bytecode/cache artifacts through
  package `files` negation rules and a publish-surface regression test.
 - `docs/legacy-artifact-inventory.md` records that no `_legacy-documents-*`
@@ -199,7 +200,7 @@ is not complete unless the evidence column exists and has been freshly verified.
 | AgentShield enterprise iteration | Policy gates, SARIF, packs, provenance, corpus, HTML reports, exception lifecycle audit | PRs #53, #55-#62 landed with test evidence | Needs PDF/export decision or next enterprise signal |
 | ECC Tools next-level app | Billing audit, PR checks, deep analyzer, sync backlog | PRs #26-#39 landed with test evidence | Needs capacity-backed Linear rollout / broader evaluator corpus |
 | GitGuardian/Dependabot/CodeRabbit-style checks | Non-blocking taxonomy and deterministic follow-up checks | ECC-Tools risk taxonomy check plus follow-up signals landed, including Skill Quality, Deep Analyzer Evidence, Analyzer Corpus Evidence, RAG/Evaluator Evidence, and PR Review/Salvage Evidence | Partially complete |
-| Harness-agnostic learning system | Audit, adapter matrix, observability, traces, promotion loop | Audit/adapters/observability gates plus `docs/architecture/evaluator-rag-prototype.md` and `examples/evaluator-rag-prototype/` define read-only stale-salvage and billing-readiness scenarios with trace, report, playbook, and verifier result artifacts | Needs broader evaluator corpus |
+| Harness-agnostic learning system | Audit, adapter matrix, observability, traces, promotion loop | Audit/adapters/observability gates plus `docs/architecture/evaluator-rag-prototype.md` and `examples/evaluator-rag-prototype/` define read-only stale-salvage, billing-readiness, and CI-failure-diagnosis scenarios with trace, report, playbook, and verifier result artifacts | Needs broader evaluator corpus |
 | Linear roadmap is detailed | Linear project status plus repo mirror | Repo mirror exists; issue creation was retried on 2026-05-12 and remains blocked by the workspace free issue limit | Needs recurring status updates after each merge batch |
 | Flow separation and progress tracking | Flow lanes with owner artifacts and update cadence | This roadmap defines lanes below | Active |
 | Realtime Linear sync | Project updates while issue limit is blocked; issues later | ECC-Tools #39 implements opt-in Linear API sync for deferred follow-up backlog items | Needs workspace capacity/config rollout |
--- a/docs/architecture/evaluator-rag-prototype.md
+++ b/docs/architecture/evaluator-rag-prototype.md
@@ -10,7 +10,9 @@ The fixture set lives in
 It started with the May 2026 stale-PR cleanup and salvage lane because that
 lane has real inputs, real accepted work, and real rejected work. The corpus now
 also includes a billing/Marketplace readiness scenario so launch copy cannot
-treat dry-run release evidence or roadmap intent as live billing state.
+treat dry-run release evidence or roadmap intent as live billing state. A
+CI-failure diagnosis scenario adds the log-first workflow needed before an
+agent proposes fixes for red checks.

 ## Reference Pressure

@@ -96,6 +98,8 @@ Current corpus:
  maintainer-owned branches with attribution and validation.
 - `billing-marketplace-readiness`: verifies billing, App, and Marketplace
  launch claims before public copy says they are live.
+- `ci-failure-diagnosis`: requires failed-job logs, changed-file scope, and a
+  named regression command before a CI fix playbook can be promoted.

 ## ECC Tools Mapping

@@ -129,6 +133,5 @@ A candidate can be promoted only when:

 The next evaluator/RAG corpus should add:

- a CI-failure diagnosis scenario with captured logs and a known fix;
 - a harness-config quality scenario covering MCP/plugin/hook drift;
 - an AgentShield policy exception scenario with SARIF and report evidence.