пре 3 недеља · d1bb98e19f
--- a/README.md
+++ b/README.md
@@ -20,7 +20,10 @@
 
				 
			
 
				 受 [Andrej Karpathy 的 autoresearch](https://github.com/karpathy/autoresearch) 启发，将自主实验循环从模型训练搬到 Skill 优化领域。一个只能向前转的棘轮。
			
 
				 
			
 
				+**v2.0** · 更新于 2026-05-28 · 吸收微软研究院 [SkillLens](https://arxiv.org/abs/2605.23899) 与 [SkillOpt](https://arxiv.org/abs/2605.23904) 两篇论文做的系统性升级。
			
 
				+
			
 
				 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
			
 
				+[![Version](https://img.shields.io/badge/version-2.0-blue.svg)](#whats-new-in-20)
			
 
				 [![Agent Skill](https://img.shields.io/badge/Agent%20Skill-Compatible-blueviolet)](https://skills.sh)
			
 
				 [![Skills](https://img.shields.io/badge/skills.sh-Compatible-green)](https://skills.sh)
			
 
				 
			
@@ -32,6 +35,48 @@ npx skills add alchaincyf/darwin-skill
 
				 
			
 
				 ---
			
 
				 
			
 
				+## What's New in 2.0
			
 
				+
			
 
				+2.0 不是缝缝补补，是系统性吸收微软研究院 2026-05-22 两篇论文后的结构性升级。五个变化：
			
 
				+
			
 
				+**1. 评分标准 8 维 → 9 维**（吸收 [SkillLens](https://arxiv.org/abs/2605.23899) 实证的 73.8% rubric 药方）
			
 
				+
			
 
				+- 原「错误处理」维度升级为 **失败模式编码** (Failure Mechanism Encoding)：不只是「告诉 agent 别犯错」，而是把已知失败路径显式编码进 skill
			
 
				+- 原「明确性」维度升级为 **可执行具体性** (Actionable Specificity)：明文禁止「建议/可以考虑/根据情况/灵活把握/视情况而定」等模糊词
			
 
				+- 新增第九维 **高风险行动黑名单** (High-Risk Action Blacklist)：rm/git reset --hard/force push 等破坏性操作必须在 skill 中显式列禁
			
 
				+
			
 
				+**2. 验证机制对齐 SkillOpt 的 validation-gated 设计**
			
 
				+
			
 
				+- 多评委独立审查：每轮启动 2 个独立评委
			
 
				+- 评委不复用：下一轮启动全新评委，避免锚定效应
			
 
				+- 早停机制：单轮涨幅 < 1 分自动停手，避免凑分堆冗余
			
 
				+- 干跑模式控制：干跑比例 > 30% 自动告警
			
 
				+
			
 
				+**3. Human in the Loop 三层守关**（达尔文区别于 SkillOpt 全自动设计的核心）
			
 
				+
			
 
				+- Phase 1 基线评估：自动 + 人工审报告，决定改什么
			
 
				+- Phase 2 单维度优化：🔴 CHECKPOINT 强制暂停，等用户确认
			
 
				+- Phase 2.5 测试提示词跑（可选）
			
 
				+- Phase 3 回归测试：🛑 STOP 涨幅低于阈值强制停手
			
 
				+
			
 
				+**4. 反例黑名单 8 条**（明文禁止的反模式）
			
 
				+
			
 
				+1. 同一个 AI 又改又评（SkillLens 实证：LLM 自评准确率仅 46.4%）
			
 
				+2. 用 `git reset --hard` 当回滚手段（应用 `git revert`）
			
 
				+3. 为凑分而堆冗余
			
 
				+4. 跳过测试提示词直接评分
			
 
				+5. 一轮内改多个维度
			
 
				+6. 干跑比例 > 30%
			
 
				+7. 静默跳过异常
			
 
				+8. 忽视维度相关簇
			
 
				+
			
 
				+**5. 实测验证数据**
			
 
				+
			
 
				+- huashu-gpt-image skill：**80.8 → 91.5 → 91.65**（+10.85，6 个独立评委共识）
			
 
				+- darwin-skill 自评：**86.05 → 92.05 → 92.7**
			
 
				+
			
 
				+---
			
 
				+
			
 
				 ## 核心循环
			
 
				 
			
 
				 ![Core Loop](assets/chart-loop.png)
			
@@ -58,7 +103,7 @@ Agent Skill 生态在快速扩张。Claude Code、Codex、OpenClaw、Trae、Code
 
				 |:---|:---|:---|
			
 
				 | `program.md` | 本 SKILL.md | 定义评估标准和约束规则 |
			
 
				 | `train.py` | 每个待优化的 SKILL.md | 被优化的资产，每次实验只改它 |
			
 
				-| `val_bpb` | 8 维加权总分（满分100） | 可量化的优化目标 |
			
 
				+| `val_bpb` | 9 维加权总分（满分 100） | 可量化的优化目标 |
			
 
				 | `git ratchet` | keep / revert 机制 | 只保留有改进的 commit |
			
 
				 | `test set` | test-prompts.json | 验证改进是否真的有效 |
			
 
				 | 全自主运行 | **人在回路** | Skill 的好坏比 loss 更微妙，需要人的判断 |
			
@@ -72,18 +117,26 @@ Agent Skill 生态在快速扩张。Claude Code、Codex、OpenClaw、Trae、Code
 
				 | 01 | **单一可编辑资产** | 每次只改一个 SKILL.md，变量可控，改进可归因 |
			
 
				 | 02 | **双重评估** | 结构评分（静态分析）+ 效果验证（跑测试看输出） |
			
 
				 | 03 | **棘轮机制** | 只保留改进，自动回滚退步，分数只升不降 |
			
 
				-| 04 | **独立评分** | 评分用子 agent，避免「自己改自己评」的偏差 |
			
 
				+| 04 | **独立评分** | 评分用子 agent，避免「自己改自己评」的偏差（SkillLens 实证 LLM 自评仅 46.4% 准确率） |
			
 
				 | 05 | **人在回路** | 每个 Skill 优化完后暂停，用户确认再继续下一个 |
			
 
				 
			
 
				 ---
			
 
				 
			
 
				-## 8 维度评估体系
			
 
				+## 9 维度评估体系
			
 
				 
			
 
				-总分 100。结构维度靠静态分析（60分），效果维度必须实测（40分）。
			
 
				+总分 100。结构维度靠静态分析，效果维度必须实测。v2.0 新增三个维度直接来自 SkillLens 论文的实证 rubric。
			
 
				 
			
 
				 ![Evaluation Rubric](assets/chart-rubric.png)
			
 
				 
			
 
				-> 实测表现权重最高（25分）。Skill 写得再漂亮，跑出来效果不好就是零。
			
 
				+新增的三个维度（SkillLens 73.8% rubric 药方）：
			
 
				+
			
 
				+| 维度 | 说明 |
			
 
				+|:---|:---|
			
 
				+| **失败模式编码** | 显式编码已知失败路径，不是简单「别犯错」式叮嘱 |
			
 
				+| **可执行具体性** | 禁用「建议/可以考虑/根据情况/灵活把握/视情况而定」等模糊措辞 |
			
 
				+| **高风险行动黑名单** | rm / git reset --hard / force push 等破坏性操作必须明文列禁 |
			
 
				+
			
 
				+> 实测表现权重最高。Skill 写得再漂亮，跑出来效果不好就是零。
			
 
				 
			
 
				 ---
			
 
				 
			
@@ -93,14 +146,15 @@ Agent Skill 生态在快速扩张。Claude Code、Codex、OpenClaw、Trae、Code
 
				 
			
 
				 ![Optimization Lifecycle](assets/chart-phases.png)
			
 
				 
			
 
				-**Phase 2 的核心逻辑**：
			
 
				+**Phase 2 的核心逻辑**（v2.0 强化）：
			
 
				 
			
 
				 1. 找出得分最低的维度
			
 
				-2. 针对该维度生成 1 个具体改进方案
			
 
				+2. 针对该维度生成 1 个具体改进方案（一轮只改一个维度，反例黑名单第 5 条）
			
 
				 3. 编辑 SKILL.md，git commit
			
 
				-4. 子 agent 独立重新评分
			
 
				-5. 新分 > 旧分 → 保留；否则 → git revert
			
 
				-6. 每个 Skill 完成后暂停，展示 diff + 分数变化，等用户确认
			
 
				+4. 启动 **2 个独立子 agent** 重新评分（下一轮换全新评委，避免锚定）
			
 
				+5. 新分 > 旧分 → 保留；否则 → `git revert`（禁用 `git reset --hard`，反例黑名单第 2 条）
			
 
				+6. 单轮涨幅 < 1 分 → 自动早停（避免凑分堆冗余）
			
 
				+7. 🔴 CHECKPOINT 暂停，展示 diff + 分数变化，等用户确认
			
 
				 
			
 
				 ---
			
 
				 
			
@@ -132,6 +186,39 @@ npx skills add alchaincyf/darwin-skill
 
				 
			
 
				 核心机制完全相同：**只保留可测量的改进，其余全部回滚。**
			
 
				 
			
 
				+v2.0 在此基础上吸收了微软研究院 2026-05-22 发布的两篇论文：[SkillLens](https://arxiv.org/abs/2605.23899) 提供了实证验证的 rubric 设计，[SkillOpt](https://arxiv.org/abs/2605.23904) 提供了 validation-gated edits 的形式化框架。
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## References & Credits
			
 
				+
			
 
				+v2.0 的设计直接基于以下学术工作。强烈推荐 skill 生态的研究者和工程师阅读：
			
 
				+
			
 
				+### SkillLens
			
 
				+
			
 
				+> Microsoft Research. *From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills.* arXiv:2605.23899, 2026.
			
 
				+
			
 
				+- 论文：https://arxiv.org/abs/2605.23899
			
 
				+- **贡献**：实证验证的 73.8% rubric 药方。达尔文.skill v2.0 的三个新维度（Failure Mechanism Encoding / Actionable Specificity / High-Risk Action Blacklist）直接来自该论文。同时也是「同一个 AI 又改又评」反模式的实证来源——LLM 自评准确率仅 46.4%。
			
 
				+
			
 
				+### SkillOpt
			
 
				+
			
 
				+> Microsoft Research. *SkillOpt: Executive Strategy for Self-Evolving Agent Skills.* arXiv:2605.23904, 2026.
			
 
				+
			
 
				+- 论文：https://arxiv.org/abs/2605.23904
			
 
				+- 项目页：https://microsoft.github.io/SkillOpt/
			
 
				+- 代码：https://github.com/microsoft/SkillOpt
			
 
				+- **贡献**：validation-gated edits 的形式化框架。把 skill 当作 frozen 模型的「外部可训练状态」，每次编辑都必须通过独立验证才能保留。达尔文.skill v2.0 的多评委独立审查、评委不复用、早停机制、干跑比例控制都对齐了该框架。
			
 
				+
			
 
				+### autoresearch
			
 
				+
			
 
				+> Andrej Karpathy. *autoresearch.* GitHub repository, 2026.
			
 
				+
			
 
				+- 代码：https://github.com/karpathy/autoresearch
			
 
				+- **贡献**：达尔文.skill 1.0 的原始灵感来源。核心机制（program.md / train.py / val_bpb / git ratchet / test set）的映射逻辑完全继承自 autoresearch。
			
 
				+
			
 
				+**达尔文 vs SkillOpt 的关键区别**：SkillOpt 是全自主系统，达尔文.skill 强调 human-in-the-loop——Skill 的好坏比 validation loss 更微妙，关键阶段（基线评估、单维度优化、回归测试）强制暂停，让人来做最终判断。
			
 
				+
			
 
				 ---
			
 
				 
			
 
				 ## 关于作者
			
--- a/README_EN.md
+++ b/README_EN.md
@@ -14,7 +14,10 @@ English | **[中文](README.md)**
 
				 
			
 
				 Inspired by [Karpathy's autoresearch](https://github.com/karpathy/autoresearch). Autonomous experiment loops, applied to skill optimization. A ratchet that only turns forward.
			
 
				 
			
 
				+**v2.0** · Updated 2026-05-28 · A structural upgrade integrating Microsoft Research's [SkillLens](https://arxiv.org/abs/2605.23899) and [SkillOpt](https://arxiv.org/abs/2605.23904) papers.
			
 
				+
			
 
				 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
			
 
				+[![Version](https://img.shields.io/badge/version-2.0-blue.svg)](#whats-new-in-20)
			
 
				 [![Agent Skill](https://img.shields.io/badge/Agent%20Skill-Compatible-blueviolet)](https://skills.sh)
			
 
				 [![Skills](https://img.shields.io/badge/skills.sh-Compatible-green)](https://skills.sh)
			
 
				 
			
@@ -26,6 +29,48 @@ npx skills add alchaincyf/darwin-skill
 
				 
			
 
				 ---
			
 
				 
			
 
				+## What's New in 2.0
			
 
				+
			
 
				+v2.0 is not a patch release. It's a structural upgrade absorbing two Microsoft Research papers published on 2026-05-22. Five concrete changes:
			
 
				+
			
 
				+**1. Rubric expanded from 8 → 9 dimensions** (integrating [SkillLens](https://arxiv.org/abs/2605.23899)'s empirically validated 73.8% rubric recipe)
			
 
				+
			
 
				+- The legacy "error handling" dimension is upgraded to **Failure Mechanism Encoding**: not just "tell the agent to be careful," but explicitly encode known failure paths into the skill.
			
 
				+- The legacy "clarity" dimension is upgraded to **Actionable Specificity**: explicitly bans vague hedge words like "suggest / could consider / depending on / use judgment / case by case."
			
 
				+- A new ninth dimension **High-Risk Action Blacklist**: destructive operations like `rm` / `git reset --hard` / `force push` must be explicitly listed as forbidden in the skill.
			
 
				+
			
 
				+**2. Validation aligned with SkillOpt's validation-gated design**
			
 
				+
			
 
				+- Multi-judge independent review: 2 independent judges per round
			
 
				+- Judges never reused: each new round spawns fresh judges to avoid anchoring bias
			
 
				+- Early stopping: if a round's score gain < 1 point, automatically halt to prevent padding for score
			
 
				+- Dry-run control: warn when dry-run ratio exceeds 30%
			
 
				+
			
 
				+**3. Human-in-the-loop at three checkpoints** (the core differentiator from SkillOpt's fully autonomous design)
			
 
				+
			
 
				+- Phase 1 baseline eval: auto + human review the report, decide what to optimize
			
 
				+- Phase 2 single-dimension edit: 🔴 CHECKPOINT mandatory pause for user confirmation
			
 
				+- Phase 2.5 test-prompt run (optional)
			
 
				+- Phase 3 regression test: 🛑 STOP if gain falls below threshold
			
 
				+
			
 
				+**4. Anti-pattern blacklist with 8 explicit forbidden behaviors**
			
 
				+
			
 
				+1. Same AI both edits and scores (SkillLens empirical: LLM self-eval accuracy only 46.4%)
			
 
				+2. Using `git reset --hard` as a rollback mechanism (use `git revert`)
			
 
				+3. Padding edits just to push the score up
			
 
				+4. Skipping test prompts and scoring directly
			
 
				+5. Changing multiple dimensions in one round
			
 
				+6. Dry-run ratio > 30%
			
 
				+7. Silently swallowing exceptions
			
 
				+8. Ignoring correlated dimension clusters
			
 
				+
			
 
				+**5. Empirical validation data**
			
 
				+
			
 
				+- huashu-gpt-image skill: **80.8 → 91.5 → 91.65** (+10.85, consensus across 6 independent judges)
			
 
				+- darwin-skill self-eval: **86.05 → 92.05 → 92.7**
			
 
				+
			
 
				+---
			
 
				+
			
 
				 ## The Core Loop
			
 
				 
			
 
				 ![Core Loop](assets/chart-loop-en.png)
			
@@ -52,7 +97,7 @@ This project maps Karpathy's autoresearch directly onto skill optimization:
 
				 |:---|:---|:---|
			
 
				 | `program.md` | This SKILL.md | Defines evaluation criteria and constraints |
			
 
				 | `train.py` | Each target SKILL.md | The single editable asset per experiment |
			
 
				-| `val_bpb` | 8-dimension weighted score (max 100) | Quantifiable optimization target |
			
 
				+| `val_bpb` | 9-dimension weighted score (max 100) | Quantifiable optimization target |
			
 
				 | `git ratchet` | keep / revert mechanism | Only improving commits survive |
			
 
				 | `test set` | test-prompts.json | Validates whether improvements are real |
			
 
				 | Fully autonomous | **Human in the loop** | Skill quality is more subjective than loss |
			
@@ -68,35 +113,44 @@ The key difference: autoresearch is fully autonomous (loss is just a number). Sk
 
				 | 01 | **Single editable asset** | One SKILL.md per experiment. One change, one measurement, one decision |
			
 
				 | 02 | **Dual evaluation** | Structure scoring (static analysis) + effectiveness scoring (live test execution) |
			
 
				 | 03 | **Ratchet mechanism** | Score can only go up. Regressions are auto-reverted |
			
 
				-| 04 | **Independent scoring** | The agent that edits is never the agent that scores |
			
 
				+| 04 | **Independent scoring** | The agent that edits is never the agent that scores (SkillLens: LLM self-eval is only 46.4% accurate) |
			
 
				 | 05 | **Human in the loop** | System pauses after each skill. You review, then continue |
			
 
				 
			
 
				 ---
			
 
				 
			
 
				-## 8-Dimension Evaluation Rubric
			
 
				+## 9-Dimension Evaluation Rubric
			
 
				 
			
 
				-Total: 100 points. Structure (60) + Effectiveness (40).
			
 
				+Total: 100 points. Structure + Effectiveness. v2.0's three new dimensions come directly from SkillLens's empirically validated rubric.
			
 
				 
			
 
				 ![Evaluation Rubric](assets/chart-rubric-en.png)
			
 
				 
			
 
				-> Live test performance has the highest weight (25 points). A beautifully written skill that produces bad output is still a bad skill.
			
 
				+The three new dimensions (SkillLens 73.8% rubric recipe):
			
 
				+
			
 
				+| Dimension | Description |
			
 
				+|:---|:---|
			
 
				+| **Failure Mechanism Encoding** | Explicitly encode known failure paths, not just "be careful" reminders |
			
 
				+| **Actionable Specificity** | Ban vague hedge words like "suggest / could consider / depending on / use judgment / case by case" |
			
 
				+| **High-Risk Action Blacklist** | Destructive operations (rm / git reset --hard / force push) must be explicitly forbidden |
			
 
				+
			
 
				+> Live test performance has the highest weight. A beautifully written skill that produces bad output is still a bad skill.
			
 
				 
			
 
				 ---
			
 
				 
			
 
				 ## The Optimization Cycle
			
 
				 
			
 
				-Five phases. Only one is the core.
			
 
				+Five phases. The system runs autonomously within each phase but pauses between phases for human confirmation.
			
 
				 
			
 
				 ![Optimization Lifecycle](assets/chart-phases-en.png)
			
 
				 
			
 
				-**Phase 2 (the heart):**
			
 
				+**Phase 2 (the heart, hardened in v2.0):**
			
 
				 
			
 
				 1. Find the lowest-scoring dimension
			
 
				-2. Generate one targeted improvement
			
 
				+2. Generate one targeted improvement (one dimension per round, blacklist #5)
			
 
				 3. Edit SKILL.md, git commit
			
 
				-4. Independent sub-agent re-scores
			
 
				-5. Score up → keep. Score down → git revert
			
 
				-6. Pause. Show diff + score delta. Wait for human confirmation
			
 
				+4. **Spawn 2 independent sub-agents** to re-score (next round spawns fresh judges to avoid anchoring)
			
 
				+5. Score up → keep. Score down → `git revert` (never `git reset --hard`, blacklist #2)
			
 
				+6. Round gain < 1 point → early-stop automatically (no padding for score)
			
 
				+7. 🔴 CHECKPOINT pauses, shows diff + score delta, waits for human confirmation
			
 
				 
			
 
				 ---
			
 
				 
			
@@ -128,6 +182,39 @@ Directly inspired by **Andrej Karpathy's [autoresearch](https://github.com/karpa
 
				 
			
 
				 The core mechanism is identical: **keep only measurable improvements, revert everything else.**
			
 
				 
			
 
				+v2.0 builds on this foundation by integrating two Microsoft Research papers (published 2026-05-22): [SkillLens](https://arxiv.org/abs/2605.23899) provides the empirically validated rubric design, and [SkillOpt](https://arxiv.org/abs/2605.23904) provides the formal framework of validation-gated edits.
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## References & Credits
			
 
				+
			
 
				+v2.0's design directly builds on the following academic work. Recommended reading for researchers and engineers working on the skill ecosystem:
			
 
				+
			
 
				+### SkillLens
			
 
				+
			
 
				+> Microsoft Research. *From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills.* arXiv:2605.23899, 2026.
			
 
				+
			
 
				+- Paper: https://arxiv.org/abs/2605.23899
			
 
				+- **Contribution**: The empirically validated 73.8% rubric recipe. darwin.skill v2.0's three new dimensions (Failure Mechanism Encoding / Actionable Specificity / High-Risk Action Blacklist) come directly from this paper. It is also the empirical source for the "same AI edits and scores" anti-pattern — LLM self-eval accuracy is only 46.4%.
			
 
				+
			
 
				+### SkillOpt
			
 
				+
			
 
				+> Microsoft Research. *SkillOpt: Executive Strategy for Self-Evolving Agent Skills.* arXiv:2605.23904, 2026.
			
 
				+
			
 
				+- Paper: https://arxiv.org/abs/2605.23904
			
 
				+- Project page: https://microsoft.github.io/SkillOpt/
			
 
				+- Code: https://github.com/microsoft/SkillOpt
			
 
				+- **Contribution**: The formal framework of validation-gated edits. Treats a skill as the "external trainable state" of a frozen model: every edit must pass independent validation to be kept. darwin.skill v2.0's multi-judge independent review, non-reuse of judges, early stopping, and dry-run ratio control all align with this framework.
			
 
				+
			
 
				+### autoresearch
			
 
				+
			
 
				+> Andrej Karpathy. *autoresearch.* GitHub repository, 2026.
			
 
				+
			
 
				+- Code: https://github.com/karpathy/autoresearch
			
 
				+- **Contribution**: The original inspiration for darwin.skill 1.0. The mapping of core mechanisms (program.md / train.py / val_bpb / git ratchet / test set) is inherited directly from autoresearch.
			
 
				+
			
 
				+**The key difference between darwin and SkillOpt**: SkillOpt is fully autonomous; darwin.skill emphasizes human-in-the-loop — skill quality is more subjective than validation loss. Critical phases (baseline eval, single-dimension edit, regression test) mandatorily pause for the human to make the final judgment.
			
 
				+
			
 
				 ---
			
 
				 
			
 
				 ## About the Author