Преглед изворни кода

v2.0: integrate SkillLens rubric + SkillOpt validation-gated design

- Rubric expanded 8 -> 9 dimensions (Failure Mechanism Encoding /
  Actionable Specificity / High-Risk Action Blacklist) per SkillLens
  73.8% recipe
- Multi-judge independent review, non-reuse, early stop, dry-run cap
  aligned with SkillOpt validation-gated edits
- Three human-in-the-loop checkpoints documented
- 8-item anti-pattern blacklist surfaced in README
- Empirical validation: huashu-gpt-image 80.8 -> 91.65,
  darwin-skill self-eval 86.05 -> 92.7
- Credits SkillLens (arXiv:2605.23899) + SkillOpt (arXiv:2605.23904)
  + autoresearch
alchain пре 3 недеља
родитељ
комит
d1bb98e19f
2 измењених фајлова са 195 додато и 21 уклоњено
  1. 97 10
      README.md
  2. 98 11
      README_EN.md

+ 97 - 10
README.md

@@ -20,7 +20,10 @@
 
 受 [Andrej Karpathy 的 autoresearch](https://github.com/karpathy/autoresearch) 启发,将自主实验循环从模型训练搬到 Skill 优化领域。一个只能向前转的棘轮。
 
+**v2.0** · 更新于 2026-05-28 · 吸收微软研究院 [SkillLens](https://arxiv.org/abs/2605.23899) 与 [SkillOpt](https://arxiv.org/abs/2605.23904) 两篇论文做的系统性升级。
+
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
+[![Version](https://img.shields.io/badge/version-2.0-blue.svg)](#whats-new-in-20)
 [![Agent Skill](https://img.shields.io/badge/Agent%20Skill-Compatible-blueviolet)](https://skills.sh)
 [![Skills](https://img.shields.io/badge/skills.sh-Compatible-green)](https://skills.sh)
 
@@ -32,6 +35,48 @@ npx skills add alchaincyf/darwin-skill
 
 ---
 
+## What's New in 2.0
+
+2.0 不是缝缝补补,是系统性吸收微软研究院 2026-05-22 两篇论文后的结构性升级。五个变化:
+
+**1. 评分标准 8 维 → 9 维**(吸收 [SkillLens](https://arxiv.org/abs/2605.23899) 实证的 73.8% rubric 药方)
+
+- 原「错误处理」维度升级为 **失败模式编码** (Failure Mechanism Encoding):不只是「告诉 agent 别犯错」,而是把已知失败路径显式编码进 skill
+- 原「明确性」维度升级为 **可执行具体性** (Actionable Specificity):明文禁止「建议/可以考虑/根据情况/灵活把握/视情况而定」等模糊词
+- 新增第九维 **高风险行动黑名单** (High-Risk Action Blacklist):rm/git reset --hard/force push 等破坏性操作必须在 skill 中显式列禁
+
+**2. 验证机制对齐 SkillOpt 的 validation-gated 设计**
+
+- 多评委独立审查:每轮启动 2 个独立评委
+- 评委不复用:下一轮启动全新评委,避免锚定效应
+- 早停机制:单轮涨幅 < 1 分自动停手,避免凑分堆冗余
+- 干跑模式控制:干跑比例 > 30% 自动告警
+
+**3. Human in the Loop 三层守关**(达尔文区别于 SkillOpt 全自动设计的核心)
+
+- Phase 1 基线评估:自动 + 人工审报告,决定改什么
+- Phase 2 单维度优化:🔴 CHECKPOINT 强制暂停,等用户确认
+- Phase 2.5 测试提示词跑(可选)
+- Phase 3 回归测试:🛑 STOP 涨幅低于阈值强制停手
+
+**4. 反例黑名单 8 条**(明文禁止的反模式)
+
+1. 同一个 AI 又改又评(SkillLens 实证:LLM 自评准确率仅 46.4%)
+2. 用 `git reset --hard` 当回滚手段(应用 `git revert`)
+3. 为凑分而堆冗余
+4. 跳过测试提示词直接评分
+5. 一轮内改多个维度
+6. 干跑比例 > 30%
+7. 静默跳过异常
+8. 忽视维度相关簇
+
+**5. 实测验证数据**
+
+- huashu-gpt-image skill:**80.8 → 91.5 → 91.65**(+10.85,6 个独立评委共识)
+- darwin-skill 自评:**86.05 → 92.05 → 92.7**
+
+---
+
 ## 核心循环
 
 ![Core Loop](assets/chart-loop.png)
@@ -58,7 +103,7 @@ Agent Skill 生态在快速扩张。Claude Code、Codex、OpenClaw、Trae、Code
 |:---|:---|:---|
 | `program.md` | 本 SKILL.md | 定义评估标准和约束规则 |
 | `train.py` | 每个待优化的 SKILL.md | 被优化的资产,每次实验只改它 |
-| `val_bpb` | 8 维加权总分(满分100) | 可量化的优化目标 |
+| `val_bpb` | 9 维加权总分(满分 100) | 可量化的优化目标 |
 | `git ratchet` | keep / revert 机制 | 只保留有改进的 commit |
 | `test set` | test-prompts.json | 验证改进是否真的有效 |
 | 全自主运行 | **人在回路** | Skill 的好坏比 loss 更微妙,需要人的判断 |
@@ -72,18 +117,26 @@ Agent Skill 生态在快速扩张。Claude Code、Codex、OpenClaw、Trae、Code
 | 01 | **单一可编辑资产** | 每次只改一个 SKILL.md,变量可控,改进可归因 |
 | 02 | **双重评估** | 结构评分(静态分析)+ 效果验证(跑测试看输出) |
 | 03 | **棘轮机制** | 只保留改进,自动回滚退步,分数只升不降 |
-| 04 | **独立评分** | 评分用子 agent,避免「自己改自己评」的偏差 |
+| 04 | **独立评分** | 评分用子 agent,避免「自己改自己评」的偏差(SkillLens 实证 LLM 自评仅 46.4% 准确率) |
 | 05 | **人在回路** | 每个 Skill 优化完后暂停,用户确认再继续下一个 |
 
 ---
 
-## 8 维度评估体系
+## 9 维度评估体系
 
-总分 100。结构维度靠静态分析(60分),效果维度必须实测(40分)
+总分 100。结构维度靠静态分析,效果维度必须实测。v2.0 新增三个维度直接来自 SkillLens 论文的实证 rubric
 
 ![Evaluation Rubric](assets/chart-rubric.png)
 
-> 实测表现权重最高(25分)。Skill 写得再漂亮,跑出来效果不好就是零。
+新增的三个维度(SkillLens 73.8% rubric 药方):
+
+| 维度 | 说明 |
+|:---|:---|
+| **失败模式编码** | 显式编码已知失败路径,不是简单「别犯错」式叮嘱 |
+| **可执行具体性** | 禁用「建议/可以考虑/根据情况/灵活把握/视情况而定」等模糊措辞 |
+| **高风险行动黑名单** | rm / git reset --hard / force push 等破坏性操作必须明文列禁 |
+
+> 实测表现权重最高。Skill 写得再漂亮,跑出来效果不好就是零。
 
 ---
 
@@ -93,14 +146,15 @@ Agent Skill 生态在快速扩张。Claude Code、Codex、OpenClaw、Trae、Code
 
 ![Optimization Lifecycle](assets/chart-phases.png)
 
-**Phase 2 的核心逻辑**:
+**Phase 2 的核心逻辑**(v2.0 强化)
 
 1. 找出得分最低的维度
-2. 针对该维度生成 1 个具体改进方案
+2. 针对该维度生成 1 个具体改进方案(一轮只改一个维度,反例黑名单第 5 条)
 3. 编辑 SKILL.md,git commit
-4. 子 agent 独立重新评分
-5. 新分 > 旧分 → 保留;否则 → git revert
-6. 每个 Skill 完成后暂停,展示 diff + 分数变化,等用户确认
+4. 启动 **2 个独立子 agent** 重新评分(下一轮换全新评委,避免锚定)
+5. 新分 > 旧分 → 保留;否则 → `git revert`(禁用 `git reset --hard`,反例黑名单第 2 条)
+6. 单轮涨幅 < 1 分 → 自动早停(避免凑分堆冗余)
+7. 🔴 CHECKPOINT 暂停,展示 diff + 分数变化,等用户确认
 
 ---
 
@@ -132,6 +186,39 @@ npx skills add alchaincyf/darwin-skill
 
 核心机制完全相同:**只保留可测量的改进,其余全部回滚。**
 
+v2.0 在此基础上吸收了微软研究院 2026-05-22 发布的两篇论文:[SkillLens](https://arxiv.org/abs/2605.23899) 提供了实证验证的 rubric 设计,[SkillOpt](https://arxiv.org/abs/2605.23904) 提供了 validation-gated edits 的形式化框架。
+
+---
+
+## References & Credits
+
+v2.0 的设计直接基于以下学术工作。强烈推荐 skill 生态的研究者和工程师阅读:
+
+### SkillLens
+
+> Microsoft Research. *From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills.* arXiv:2605.23899, 2026.
+
+- 论文:https://arxiv.org/abs/2605.23899
+- **贡献**:实证验证的 73.8% rubric 药方。达尔文.skill v2.0 的三个新维度(Failure Mechanism Encoding / Actionable Specificity / High-Risk Action Blacklist)直接来自该论文。同时也是「同一个 AI 又改又评」反模式的实证来源——LLM 自评准确率仅 46.4%。
+
+### SkillOpt
+
+> Microsoft Research. *SkillOpt: Executive Strategy for Self-Evolving Agent Skills.* arXiv:2605.23904, 2026.
+
+- 论文:https://arxiv.org/abs/2605.23904
+- 项目页:https://microsoft.github.io/SkillOpt/
+- 代码:https://github.com/microsoft/SkillOpt
+- **贡献**:validation-gated edits 的形式化框架。把 skill 当作 frozen 模型的「外部可训练状态」,每次编辑都必须通过独立验证才能保留。达尔文.skill v2.0 的多评委独立审查、评委不复用、早停机制、干跑比例控制都对齐了该框架。
+
+### autoresearch
+
+> Andrej Karpathy. *autoresearch.* GitHub repository, 2026.
+
+- 代码:https://github.com/karpathy/autoresearch
+- **贡献**:达尔文.skill 1.0 的原始灵感来源。核心机制(program.md / train.py / val_bpb / git ratchet / test set)的映射逻辑完全继承自 autoresearch。
+
+**达尔文 vs SkillOpt 的关键区别**:SkillOpt 是全自主系统,达尔文.skill 强调 human-in-the-loop——Skill 的好坏比 validation loss 更微妙,关键阶段(基线评估、单维度优化、回归测试)强制暂停,让人来做最终判断。
+
 ---
 
 ## 关于作者

+ 98 - 11
README_EN.md

@@ -14,7 +14,10 @@ English | **[中文](README.md)**
 
 Inspired by [Karpathy's autoresearch](https://github.com/karpathy/autoresearch). Autonomous experiment loops, applied to skill optimization. A ratchet that only turns forward.
 
+**v2.0** · Updated 2026-05-28 · A structural upgrade integrating Microsoft Research's [SkillLens](https://arxiv.org/abs/2605.23899) and [SkillOpt](https://arxiv.org/abs/2605.23904) papers.
+
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
+[![Version](https://img.shields.io/badge/version-2.0-blue.svg)](#whats-new-in-20)
 [![Agent Skill](https://img.shields.io/badge/Agent%20Skill-Compatible-blueviolet)](https://skills.sh)
 [![Skills](https://img.shields.io/badge/skills.sh-Compatible-green)](https://skills.sh)
 
@@ -26,6 +29,48 @@ npx skills add alchaincyf/darwin-skill
 
 ---
 
+## What's New in 2.0
+
+v2.0 is not a patch release. It's a structural upgrade absorbing two Microsoft Research papers published on 2026-05-22. Five concrete changes:
+
+**1. Rubric expanded from 8 → 9 dimensions** (integrating [SkillLens](https://arxiv.org/abs/2605.23899)'s empirically validated 73.8% rubric recipe)
+
+- The legacy "error handling" dimension is upgraded to **Failure Mechanism Encoding**: not just "tell the agent to be careful," but explicitly encode known failure paths into the skill.
+- The legacy "clarity" dimension is upgraded to **Actionable Specificity**: explicitly bans vague hedge words like "suggest / could consider / depending on / use judgment / case by case."
+- A new ninth dimension **High-Risk Action Blacklist**: destructive operations like `rm` / `git reset --hard` / `force push` must be explicitly listed as forbidden in the skill.
+
+**2. Validation aligned with SkillOpt's validation-gated design**
+
+- Multi-judge independent review: 2 independent judges per round
+- Judges never reused: each new round spawns fresh judges to avoid anchoring bias
+- Early stopping: if a round's score gain < 1 point, automatically halt to prevent padding for score
+- Dry-run control: warn when dry-run ratio exceeds 30%
+
+**3. Human-in-the-loop at three checkpoints** (the core differentiator from SkillOpt's fully autonomous design)
+
+- Phase 1 baseline eval: auto + human review the report, decide what to optimize
+- Phase 2 single-dimension edit: 🔴 CHECKPOINT mandatory pause for user confirmation
+- Phase 2.5 test-prompt run (optional)
+- Phase 3 regression test: 🛑 STOP if gain falls below threshold
+
+**4. Anti-pattern blacklist with 8 explicit forbidden behaviors**
+
+1. Same AI both edits and scores (SkillLens empirical: LLM self-eval accuracy only 46.4%)
+2. Using `git reset --hard` as a rollback mechanism (use `git revert`)
+3. Padding edits just to push the score up
+4. Skipping test prompts and scoring directly
+5. Changing multiple dimensions in one round
+6. Dry-run ratio > 30%
+7. Silently swallowing exceptions
+8. Ignoring correlated dimension clusters
+
+**5. Empirical validation data**
+
+- huashu-gpt-image skill: **80.8 → 91.5 → 91.65** (+10.85, consensus across 6 independent judges)
+- darwin-skill self-eval: **86.05 → 92.05 → 92.7**
+
+---
+
 ## The Core Loop
 
 ![Core Loop](assets/chart-loop-en.png)
@@ -52,7 +97,7 @@ This project maps Karpathy's autoresearch directly onto skill optimization:
 |:---|:---|:---|
 | `program.md` | This SKILL.md | Defines evaluation criteria and constraints |
 | `train.py` | Each target SKILL.md | The single editable asset per experiment |
-| `val_bpb` | 8-dimension weighted score (max 100) | Quantifiable optimization target |
+| `val_bpb` | 9-dimension weighted score (max 100) | Quantifiable optimization target |
 | `git ratchet` | keep / revert mechanism | Only improving commits survive |
 | `test set` | test-prompts.json | Validates whether improvements are real |
 | Fully autonomous | **Human in the loop** | Skill quality is more subjective than loss |
@@ -68,35 +113,44 @@ The key difference: autoresearch is fully autonomous (loss is just a number). Sk
 | 01 | **Single editable asset** | One SKILL.md per experiment. One change, one measurement, one decision |
 | 02 | **Dual evaluation** | Structure scoring (static analysis) + effectiveness scoring (live test execution) |
 | 03 | **Ratchet mechanism** | Score can only go up. Regressions are auto-reverted |
-| 04 | **Independent scoring** | The agent that edits is never the agent that scores |
+| 04 | **Independent scoring** | The agent that edits is never the agent that scores (SkillLens: LLM self-eval is only 46.4% accurate) |
 | 05 | **Human in the loop** | System pauses after each skill. You review, then continue |
 
 ---
 
-## 8-Dimension Evaluation Rubric
+## 9-Dimension Evaluation Rubric
 
-Total: 100 points. Structure (60) + Effectiveness (40).
+Total: 100 points. Structure + Effectiveness. v2.0's three new dimensions come directly from SkillLens's empirically validated rubric.
 
 ![Evaluation Rubric](assets/chart-rubric-en.png)
 
-> Live test performance has the highest weight (25 points). A beautifully written skill that produces bad output is still a bad skill.
+The three new dimensions (SkillLens 73.8% rubric recipe):
+
+| Dimension | Description |
+|:---|:---|
+| **Failure Mechanism Encoding** | Explicitly encode known failure paths, not just "be careful" reminders |
+| **Actionable Specificity** | Ban vague hedge words like "suggest / could consider / depending on / use judgment / case by case" |
+| **High-Risk Action Blacklist** | Destructive operations (rm / git reset --hard / force push) must be explicitly forbidden |
+
+> Live test performance has the highest weight. A beautifully written skill that produces bad output is still a bad skill.
 
 ---
 
 ## The Optimization Cycle
 
-Five phases. Only one is the core.
+Five phases. The system runs autonomously within each phase but pauses between phases for human confirmation.
 
 ![Optimization Lifecycle](assets/chart-phases-en.png)
 
-**Phase 2 (the heart):**
+**Phase 2 (the heart, hardened in v2.0):**
 
 1. Find the lowest-scoring dimension
-2. Generate one targeted improvement
+2. Generate one targeted improvement (one dimension per round, blacklist #5)
 3. Edit SKILL.md, git commit
-4. Independent sub-agent re-scores
-5. Score up → keep. Score down → git revert
-6. Pause. Show diff + score delta. Wait for human confirmation
+4. **Spawn 2 independent sub-agents** to re-score (next round spawns fresh judges to avoid anchoring)
+5. Score up → keep. Score down → `git revert` (never `git reset --hard`, blacklist #2)
+6. Round gain < 1 point → early-stop automatically (no padding for score)
+7. 🔴 CHECKPOINT pauses, shows diff + score delta, waits for human confirmation
 
 ---
 
@@ -128,6 +182,39 @@ Directly inspired by **Andrej Karpathy's [autoresearch](https://github.com/karpa
 
 The core mechanism is identical: **keep only measurable improvements, revert everything else.**
 
+v2.0 builds on this foundation by integrating two Microsoft Research papers (published 2026-05-22): [SkillLens](https://arxiv.org/abs/2605.23899) provides the empirically validated rubric design, and [SkillOpt](https://arxiv.org/abs/2605.23904) provides the formal framework of validation-gated edits.
+
+---
+
+## References & Credits
+
+v2.0's design directly builds on the following academic work. Recommended reading for researchers and engineers working on the skill ecosystem:
+
+### SkillLens
+
+> Microsoft Research. *From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills.* arXiv:2605.23899, 2026.
+
+- Paper: https://arxiv.org/abs/2605.23899
+- **Contribution**: The empirically validated 73.8% rubric recipe. darwin.skill v2.0's three new dimensions (Failure Mechanism Encoding / Actionable Specificity / High-Risk Action Blacklist) come directly from this paper. It is also the empirical source for the "same AI edits and scores" anti-pattern — LLM self-eval accuracy is only 46.4%.
+
+### SkillOpt
+
+> Microsoft Research. *SkillOpt: Executive Strategy for Self-Evolving Agent Skills.* arXiv:2605.23904, 2026.
+
+- Paper: https://arxiv.org/abs/2605.23904
+- Project page: https://microsoft.github.io/SkillOpt/
+- Code: https://github.com/microsoft/SkillOpt
+- **Contribution**: The formal framework of validation-gated edits. Treats a skill as the "external trainable state" of a frozen model: every edit must pass independent validation to be kept. darwin.skill v2.0's multi-judge independent review, non-reuse of judges, early stopping, and dry-run ratio control all align with this framework.
+
+### autoresearch
+
+> Andrej Karpathy. *autoresearch.* GitHub repository, 2026.
+
+- Code: https://github.com/karpathy/autoresearch
+- **Contribution**: The original inspiration for darwin.skill 1.0. The mapping of core mechanisms (program.md / train.py / val_bpb / git ratchet / test set) is inherited directly from autoresearch.
+
+**The key difference between darwin and SkillOpt**: SkillOpt is fully autonomous; darwin.skill emphasizes human-in-the-loop — skill quality is more subjective than validation loss. Critical phases (baseline eval, single-dimension edit, regression test) mandatorily pause for the human to make the final judgment.
+
 ---
 
 ## About the Author