2 hónapja · 10ce74f711
--- a/README.md
+++ b/README.md
@@ -0,0 +1,234 @@
 
				+# Auto Skill Optimizer
			
 
				+
			
 
				+**像训练模型一样优化你的 Claude Code Skills。**
			
 
				+
			
 
				+受 [Andrej Karpathy 的 autoresearch](https://github.com/karpathy/autoresearch) 启发，将自主实验循环从模型训练搬到 Skill 优化领域。核心理念相同：评估、改进、实测验证、保留或回滚。一个只能向前转的棘轮。
			
 
				+
			
 
				+![Hero](assets/aso-hero.png)
			
 
				+
			
 
				+> 「autoresearch 的核心想法很简单：让系统自主运行实验，评估结果，只保留有效的改进。一个只能向前转的棘轮。」
			
 
				+> — Andrej Karpathy
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 为什么做这个
			
 
				+
			
 
				+Claude Code 的 Skill 生态在快速扩张。当你有 10 个 Skills 时可以手动维护；当你有 60+ 个 Skills 时，你需要一个系统。
			
 
				+
			
 
				+传统的 Skill 审查是**纯结构性的**：检查格式对不对、步骤有没有编号、路径能不能访问。但一个格式完美的 Skill，跑出来的效果可能很差。
			
 
				+
			
 
				+Auto Skill Optimizer 做的事情不一样：它同时评估**结构质量**和**实际效果**，然后只保留真正有改进的修改。
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 从 autoresearch 到 Skill Optimizer
			
 
				+
			
 
				+这个项目直接受 Karpathy autoresearch 启发。autoresearch 的做法是：写一个 `program.md` 定义目标和约束，让 agent 自主生成和测试代码变更，只保留可测量的改进。
			
 
				+
			
 
				+我们把同样的思路搬到了 Skill 优化：
			
 
				+
			
 
				+| autoresearch | Auto Skill Optimizer | 为什么这样映射 |
			
 
				+|:---|:---|:---|
			
 
				+| `program.md` | 本 SKILL.md | 定义评估标准和约束规则 |
			
 
				+| `train.py` | 每个待优化的 SKILL.md | 被优化的资产，每次实验只改它 |
			
 
				+| `val_bpb` | 8 维加权总分（满分100） | 可量化的优化目标 |
			
 
				+| `git ratchet` | keep / revert 机制 | 只保留有改进的 commit |
			
 
				+| `test set` | test-prompts.json | 验证改进是否真的有效 |
			
 
				+| 全自主运行 | **人在回路** | Skill 的好坏比 loss 更微妙，需要人的判断 |
			
 
				+
			
 
				+关键区别：autoresearch 是全自主的（loss 数值可以自动比较），Skill 优化增加了**人在回路**（每个 Skill 优化完后暂停，展示 diff 和分数变化，等人确认再继续）。因为 Skill 的「好坏」不像 loss 那样可以纯数值判断。
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 五条核心原则
			
 
				+
			
 
				+| # | 原则 | 说明 |
			
 
				+|:---|:---|:---|
			
 
				+| 01 | **单一可编辑资产** | 每次只改一个 SKILL.md，变量可控，改进可归因 |
			
 
				+| 02 | **双重评估** | 结构评分（静态分析）+ 效果验证（跑测试看输出） |
			
 
				+| 03 | **棘轮机制** | 只保留改进，自动回滚退步，分数只升不降 |
			
 
				+| 04 | **独立评分** | 评分用子 agent，避免「自己改自己评」的偏差 |
			
 
				+| 05 | **人在回路** | 每个 Skill 优化完后暂停，用户确认再继续下一个 |
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 8 维度评估体系
			
 
				+
			
 
				+总分 100。结构维度靠静态分析（60分），效果维度必须实测（40分）。
			
 
				+
			
 
				+![Rubric](assets/aso-rubric.png)
			
 
				+
			
 
				+### 结构维度（60分）
			
 
				+
			
 
				+| # | 维度 | 权重 | 评分标准 |
			
 
				+|:---|:---|:---:|:---|
			
 
				+| 1 | Frontmatter 质量 | 8 | name 规范、description 含触发词、长度合规 |
			
 
				+| 2 | 工作流清晰度 | 15 | 步骤明确可执行，每步有明确输入/输出 |
			
 
				+| 3 | 边界条件覆盖 | 10 | 异常处理、fallback 路径、错误恢复 |
			
 
				+| 4 | 检查点设计 | 7 | 关键决策前有用户确认 |
			
 
				+| 5 | 指令具体性 | 15 | 有具体参数/格式/示例，可直接执行 |
			
 
				+| 6 | 资源整合度 | 5 | references/scripts/assets 路径可达 |
			
 
				+
			
 
				+### 效果维度（40分）
			
 
				+
			
 
				+| # | 维度 | 权重 | 评分标准 |
			
 
				+|:---|:---|:---:|:---|
			
 
				+| 7 | 整体架构 | 15 | 结构层次清晰，与生态一致 |
			
 
				+| 8 | **实测表现** | **25** | 跑 2-3 个测试 prompt，对比带 Skill vs 不带 Skill 的输出质量 |
			
 
				+
			
 
				+> 实测表现权重最高（25分）。Skill 写得再漂亮，跑出来效果不好就是零。
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 优化循环
			
 
				+
			
 
				+五个阶段，从初始化到汇总报告。系统在每个阶段内自主运行，但在阶段之间暂停等待人类确认。
			
 
				+
			
 
				+![Optimization Cycle](assets/aso-cycle.png)
			
 
				+
			
 
				+```
			
 
				+Phase 0    初始化        确定范围，创建 git 分支，加载历史记录
			
 
				+Phase 0.5  测试设计      为每个 Skill 设计 2-3 个测试 prompt
			
 
				+Phase 1    基线评估      8 维度打分，建立优化前基准线
			
 
				+Phase 2    优化循环      诊断→改进→重评→keep/revert，最多 3 轮
			
 
				+Phase 3    汇总报告      Before/After 分数表 + 关键改进摘要
			
 
				+```
			
 
				+
			
 
				+**Phase 2 的核心逻辑**：
			
 
				+
			
 
				+1. 找出得分最低的维度
			
 
				+2. 针对该维度生成 1 个具体改进方案
			
 
				+3. 编辑 SKILL.md，git commit
			
 
				+4. 子 agent 独立重新评分
			
 
				+5. 新分 > 旧分 → 保留；否则 → git revert
			
 
				+6. 每个 Skill 完成后暂停，展示 diff + 分数变化，等用户确认
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 棘轮机制
			
 
				+
			
 
				+分数只能上升。每一轮要么改进 Skill，要么干净地回滚。不会随时间积累局部退化。
			
 
				+
			
 
				+![Ratchet](assets/aso-ratchet.png)
			
 
				+
			
 
				+```
			
 
				+72 (基线) → 78 (保留) → 75 (回滚!) → 84 (保留) → 87 (保留)
			
 
				+```
			
 
				+
			
 
				+轮次 2 的 75 分低于当前最优的 78 分，被自动回滚。有效基线始终锁定在 78，后续改进从 78 继续。
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 为什么需要双重评估
			
 
				+
			
 
				+![Comparison](assets/aso-comparison.png)
			
 
				+
			
 
				+**纯结构审查**只能告诉你 Skill 写得规不规范。**双重评估**还能告诉你它跑得好不好用。
			
 
				+
			
 
				+| 纯结构审查 | Auto Skill Optimizer |
			
 
				+|:---|:---|
			
 
				+| 检查 frontmatter 格式 | 结构评分 + 实测验证同时进行 |
			
 
				+| 验证步骤编号和描述 | 跑真实 prompt 对比输出质量 |
			
 
				+| 确认文件路径有效 | 子 agent 独立评分，避免偏差 |
			
 
				+| 无法判断实际输出质量 | 每轮只改一个维度，精确归因 |
			
 
				+| 无法检测过度约束 | 分数不涨就回滚 |
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 快速开始
			
 
				+
			
 
				+### 安装
			
 
				+
			
 
				+将 `SKILL.md` 放入你的 Claude Code Skills 目录：
			
 
				+
			
 
				+```bash
			
 
				+cp SKILL.md ~/.claude/skills/huashu-auto-skill-optimizer/SKILL.md
			
 
				+```
			
 
				+
			
 
				+### 使用
			
 
				+
			
 
				+```
			
 
				+# 评估所有 Skills 的质量（只评估不改）
			
 
				+> 评估所有 skills
			
 
				+
			
 
				+# 优化指定 Skill
			
 
				+> 优化 huashu-slides 这个 skill
			
 
				+
			
 
				+# 全量优化（推荐首次使用）
			
 
				+> 优化所有 skills
			
 
				+
			
 
				+# 查看历史
			
 
				+> 看看 skill 优化历史
			
 
				+```
			
 
				+
			
 
				+### 输出示例
			
 
				+
			
 
				+```
			
 
				+┌──────────────────────────┬────────┬────────┬────────┐
			
 
				+│ Skill                    │ Before │ After  │ Δ      │
			
 
				+├──────────────────────────┼────────┼────────┼────────┤
			
 
				+│ huashu-proofreading      │ 78     │ 87     │ +9     │
			
 
				+│ huashu-slides            │ 72     │ 83     │ +11    │
			
 
				+│ huashu-publish           │ 81     │ 88     │ +7     │
			
 
				+├──────────────────────────┼────────┼────────┼────────┤
			
 
				+│ 平均                     │ 77     │ 86     │ +9     │
			
 
				+└──────────────────────────┴────────┴────────┴────────┘
			
 
				+```
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 设计灵感
			
 
				+
			
 
				+这个项目的设计直接受 **Andrej Karpathy 的 [autoresearch](https://github.com/karpathy/autoresearch)** 启发。
			
 
				+
			
 
				+autoresearch 证明了一个优雅的想法：你可以把「写论文」这件事变成一个自主实验循环。定义目标（`program.md`），让 agent 不断生成和测试变更（`train.py`），用可量化的指标（`val_bpb`）决定保留还是回滚。这个棘轮机制保证了质量只升不降。
			
 
				+
			
 
				+Auto Skill Optimizer 把同样的思路搬到了 Claude Code Skill 优化。区别在于：
			
 
				+
			
 
				+1. **评估更复杂**：Skill 的好坏不像 loss 那样一个数值就能说清楚，需要 8 个维度的加权评分
			
 
				+2. **需要实测**：结构评分只是一半，另一半必须跑真实 prompt 看效果
			
 
				+3. **人在回路**：Skill 的「好」是主观的，需要人来做最终判断
			
 
				+
			
 
				+但核心机制完全相同：**只保留可测量的改进，其余全部回滚。**
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 约束规则
			
 
				+
			
 
				+1. 不改变 Skill 的核心功能和用途
			
 
				+2. 不引入新依赖
			
 
				+3. 每轮只改一个维度，避免多变更无法归因
			
 
				+4. 优化后 SKILL.md 不超过原始大小的 150%
			
 
				+5. 所有改动在 git 分支上，用 git revert 回滚
			
 
				+6. 效果维度必须用子 agent 评分，不能自己改完自己评
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 文件结构
			
 
				+
			
 
				+```
			
 
				+auto-skill-optimizer/
			
 
				+├── README.md              # 你正在看的文件
			
 
				+├── SKILL.md               # 核心：评估标准 + 优化流程 + 约束规则
			
 
				+├── showcase.html           # Pentagram 风格的可视化展示页
			
 
				+├── assets/                # README 配图
			
 
				+│   ├── aso-hero.png
			
 
				+│   ├── aso-rubric.png
			
 
				+│   ├── aso-cycle.png
			
 
				+│   ├── aso-ratchet.png
			
 
				+│   ├── aso-comparison.png
			
 
				+│   └── aso-mapping.png
			
 
				+└── examples/              # 优化记录示例（待补充）
			
 
				+```
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 致谢
			
 
				+
			
 
				+- [Andrej Karpathy](https://github.com/karpathy) 的 [autoresearch](https://github.com/karpathy/autoresearch) 提供了核心设计灵感
			
 
				+- [Claude Code](https://claude.ai/code) 的 Skill 生态提供了优化场景
			
 
				+- [花叔](https://x.com/AlchainHust) 的 60+ Skills 实践提供了真实测试环境
			
 
				+
			
 
				+---
			
 
				+
			
 
				+**License**: MIT
			
--- a/SKILL.md
+++ b/SKILL.md
@@ -0,0 +1,322 @@
 
				+---
			
 
				+name: huashu-auto-skill-optimizer
			
 
				+description: Autonomous skill optimizer inspired by Karpathy's autoresearch. Evaluates SKILL.md files using an 8-dimension rubric (structure + effectiveness), runs hill-climbing with git version control, and validates improvements through test prompts. Use when user mentions "优化skill", "skill评分", "自动优化", "auto optimize skills", "skill质量检查", "这个skill写得不好", "帮我改改skill", "skill怎么样", "提升skill质量", "skill review", "skill打分".
			
 
				+---
			
 
				+
			
 
				+# Auto Skill Optimizer
			
 
				+
			
 
				+> 借鉴 Karpathy autoresearch 的自主实验循环，对 skills 进行持续优化。
			
 
				+> 核心理念：**评估 → 改进 → 实测验证 → 人类确认 → 保留或回滚**
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 设计哲学
			
 
				+
			
 
				+autoresearch 的精髓：
			
 
				+1. **单一可编辑资产** — 每次只改一个 SKILL.md
			
 
				+2. **双重评估** — 结构评分（静态分析）+ 效果验证（跑测试看输出）
			
 
				+3. **棘轮机制** — 只保留改进，自动回滚退步
			
 
				+4. **独立评分** — 评分用子agent，避免「自己改自己评」的偏差
			
 
				+5. **人在回路** — 每个skill优化完后暂停，用户确认再继续
			
 
				+
			
 
				+与纯结构审查的区别：不只看 SKILL.md 写得规不规范，更看改完后**实际跑出来的效果是否更好**。
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 评估 Rubric（8维度，总分100）
			
 
				+
			
 
				+### 结构维度（60分）— 静态分析
			
 
				+
			
 
				+| # | 维度 | 权重 | 评分标准 |
			
 
				+|---|------|------|---------|
			
 
				+| 1 | **Frontmatter质量** | 8 | name规范、description包含做什么+何时用+触发词、≤1024字符 |
			
 
				+| 2 | **工作流清晰度** | 15 | 步骤明确可执行、有序号、每步有明确输入/输出 |
			
 
				+| 3 | **边界条件覆盖** | 10 | 处理异常情况、有fallback路径、错误恢复 |
			
 
				+| 4 | **检查点设计** | 7 | 关键决策前有用户确认、防止自主失控 |
			
 
				+| 5 | **指令具体性** | 15 | 不模糊、有具体参数/格式/示例、可直接执行 |
			
 
				+| 6 | **资源整合度** | 5 | references/scripts/assets引用正确、路径可达 |
			
 
				+
			
 
				+### 效果维度（40分）— 需要实测
			
 
				+
			
 
				+| # | 维度 | 权重 | 评分标准 |
			
 
				+|---|------|------|---------|
			
 
				+| 7 | **整体架构** | 15 | 结构层次清晰、不冗余不遗漏、与花叔生态一致 |
			
 
				+| 8 | **实测表现** | 25 | 用测试prompt跑一遍，输出质量是否符合skill宣称的能力 |
			
 
				+
			
 
				+### 评分规则
			
 
				+- 维度1-7：每个维度打 1-10 分，乘以权重得到该维度得分
			
 
				+- 维度8（实测表现）：跑2-3个测试prompt，按输出质量打1-10分
			
 
				+- **总分 = Σ(维度分 × 权重) / 10**，满分100
			
 
				+- 改进后总分必须 **严格高于** 改进前才保留
			
 
				+
			
 
				+### 关于「实测表现」维度
			
 
				+
			
 
				+这是与纯结构评分最大的区别。评分方式：
			
 
				+
			
 
				+1. 为每个skill设计2-3个**典型用户prompt**（不是边缘case，是最常见的使用场景）
			
 
				+2. 用子agent执行：一个带skill跑，一个不带skill跑（baseline）
			
 
				+3. 对比输出质量，从以下角度打分：
			
 
				+   - 输出是否完成了用户意图？
			
 
				+   - 相比不带skill的baseline，质量提升明显吗？
			
 
				+   - 有没有skill引入的负面影响（过度冗余、跑偏、格式奇怪）？
			
 
				+
			
 
				+如果无法跑子agent（时间/资源限制），可以退化为「干跑验证」：读完skill后模拟一个典型prompt的执行思路，判断流程是否合理。但要在results.tsv中标注 `dry_run`。
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 自主优化循环
			
 
				+
			
 
				+### Phase 0: 初始化
			
 
				+
			
 
				+```
			
 
				+1. 确认优化范围：
			
 
				+   - 全部skills → 扫描 .claude/skills/*/SKILL.md
			
 
				+   - 指定skills → 用户指定列表
			
 
				+2. 创建 git 分支：auto-optimize/YYYYMMDD-HHMM
			
 
				+3. 初始化 results.tsv（如不存在）
			
 
				+4. 读取现有 results.tsv 了解历史优化记录
			
 
				+```
			
 
				+
			
 
				+### Phase 0.5: 测试Prompt设计
			
 
				+
			
 
				+在评估之前，为每个skill设计测试prompt。这步很关键——没有测试prompt，「实测表现」维度就打不了分。
			
 
				+
			
 
				+```
			
 
				+for each skill:
			
 
				+  1. 读取 SKILL.md，理解它做什么
			
 
				+  2. 设计2-3个测试prompt，覆盖：
			
 
				+     - 最典型的使用场景（happy path）
			
 
				+     - 一个稍复杂或有歧义的场景
			
 
				+  3. 保存到 skill目录/test-prompts.json：
			
 
				+     [
			
 
				+       {"id": 1, "prompt": "用户会说的话", "expected": "期望输出的简短描述"},
			
 
				+       {"id": 2, "prompt": "...", "expected": "..."}
			
 
				+     ]
			
 
				+```
			
 
				+
			
 
				+展示所有测试prompt给用户，**确认后再进入评估**。测试prompt的质量决定了优化方向是否正确。
			
 
				+
			
 
				+### Phase 1: 基线评估（Baseline）
			
 
				+
			
 
				+```
			
 
				+for each skill in 优化范围:
			
 
				+
			
 
				+  # 结构评分（主agent可以做）
			
 
				+  1. 读取 SKILL.md 全文
			
 
				+  2. 按维度1-7逐项打分（附简短理由）
			
 
				+
			
 
				+  # 效果评分（用子agent做，独立于主agent）
			
 
				+  3. 对每个测试prompt，spawn子agent：
			
 
				+     - with_skill: 带着SKILL.md执行测试prompt
			
 
				+     - baseline: 不带skill执行同一prompt
			
 
				+  4. 对比两组输出，打维度8的分
			
 
				+
			
 
				+  # 汇总
			
 
				+  5. 计算加权总分
			
 
				+  6. 记录到 results.tsv
			
 
				+```
			
 
				+
			
 
				+**如果子agent不可用**（超时、环境限制），维度8用干跑验证打分，标注 `dry_run`。不要因为跑不了测试就跳过这个维度——哪怕是模拟推演也比完全不看效果好。
			
 
				+
			
 
				+基线评估完成后，展示评分卡：
			
 
				+
			
 
				+```
			
 
				+┌──────────────────────────┬───────┬──────────────┬──────────────┐
			
 
				+│ Skill                    │ Score │ 结构短板      │ 效果短板      │
			
 
				+├──────────────────────────┼───────┼──────────────┼──────────────┤
			
 
				+│ huashu-proofreading      │ 78    │ 边界条件      │ 测试prompt2  │
			
 
				+│ huashu-slides            │ 72    │ 指令具体性    │ baseline持平  │
			
 
				+├──────────────────────────┼───────┼──────────────┼──────────────┤
			
 
				+│ 平均                     │ 75    │              │              │
			
 
				+└──────────────────────────┴───────┴──────────────┴──────────────┘
			
 
				+```
			
 
				+
			
 
				+**暂停等用户确认，再进入优化循环。**
			
 
				+
			
 
				+### Phase 2: 优化循环
			
 
				+
			
 
				+用户确认后，按基线分数从低到高排序，先优化最弱的。
			
 
				+
			
 
				+```
			
 
				+for each skill:
			
 
				+  round = 0
			
 
				+  while round < MAX_ROUNDS (默认3):
			
 
				+    round += 1
			
 
				+
			
 
				+    # Step 1: 诊断
			
 
				+    找出得分最低的维度（结构或效果都算）
			
 
				+
			
 
				+    # Step 2: 提出改进方案
			
 
				+    针对最低维度，生成1个具体改进方案：
			
 
				+      - 改什么（具体段落/行）
			
 
				+      - 为什么改（对应rubric哪条）
			
 
				+      - 预期提升多少分
			
 
				+
			
 
				+    # Step 3: 执行改进
			
 
				+    编辑 SKILL.md
			
 
				+    git add + commit（message: "optimize {skill}: {改进摘要}"）
			
 
				+
			
 
				+    # Step 4: 重新评估
			
 
				+    - 结构维度：主agent重新打分
			
 
				+    - 效果维度：spawn独立子agent重跑测试prompt（关键！不能自己评自己）
			
 
				+
			
 
				+    # Step 5: 决策
			
 
				+    if 新总分 > 旧总分:
			
 
				+      status = "keep"，更新旧总分
			
 
				+    else:
			
 
				+      status = "revert"
			
 
				+      git revert HEAD（创建新commit回滚，不用reset --hard）
			
 
				+      记录失败尝试到 results.tsv
			
 
				+      break  # 该skill到瓶颈，跳到下一个
			
 
				+
			
 
				+    # Step 6: 日志
			
 
				+    results.tsv 追加行
			
 
				+
			
 
				+  # === 每个skill优化完后的人类检查点 ===
			
 
				+  展示该skill的改动摘要：
			
 
				+    - git diff（改前 vs 改后）
			
 
				+    - 分数变化（哪些维度提升/下降）
			
 
				+    - 测试prompt输出对比（如果跑过的话）
			
 
				+  等用户确认 OK 再继续下一个skill。
			
 
				+  如果用户说"不好"，回滚到该skill的优化前版本。
			
 
				+```
			
 
				+
			
 
				+### Phase 2.5: 探索性重写（可选）
			
 
				+
			
 
				+当 hill-climbing 连续2个skill都在 round 1 就 break（涨不动）时，提议一次「探索性重写」：
			
 
				+
			
 
				+```
			
 
				+1. 选一个瓶颈skill
			
 
				+2. git stash 保存当前最优版本
			
 
				+3. 从头重写SKILL.md（不是微调，是重新组织结构和表达方式）
			
 
				+4. 重新评估
			
 
				+5. if 重写版 > stash版: 采用重写版
			
 
				+   else: git stash pop 恢复
			
 
				+```
			
 
				+
			
 
				+这解决了 hill-climbing 的局部最优问题——有时候需要「先拆后建」才能突破瓶颈。
			
 
				+**必须征得用户同意后才执行。**
			
 
				+
			
 
				+### Phase 3: 汇总报告
			
 
				+
			
 
				+```
			
 
				+## 优化报告
			
 
				+
			
 
				+### 总览
			
 
				+- 优化skills数：N
			
 
				+- 总实验次数：M
			
 
				+- 保留改进：X（Y%）
			
 
				+- 回滚次数：Z
			
 
				+- 实测验证：A次完整测试 / B次干跑
			
 
				+
			
 
				+### 分数变化
			
 
				+┌──────────────────────────┬────────┬────────┬────────┐
			
 
				+│ Skill                    │ Before │ After  │ Δ      │
			
 
				+├──────────────────────────┼────────┼────────┼────────┤
			
 
				+│ huashu-proofreading      │ 78     │ 87     │ +9     │
			
 
				+│ huashu-slides            │ 72     │ 83     │ +11    │
			
 
				+├──────────────────────────┼────────┼────────┼────────┤
			
 
				+│ 平均                     │ 75     │ 85     │ +10    │
			
 
				+└──────────────────────────┴────────┴────────┴────────┘
			
 
				+
			
 
				+### 主要改进
			
 
				+1. [skill-A] 补充了边界条件处理，测试输出质量提升明显
			
 
				+2. [skill-B] 重组了workflow结构，baseline对比优势增大
			
 
				+```
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## results.tsv 格式
			
 
				+
			
 
				+```tsv
			
 
				+timestamp	commit	skill	old_score	new_score	status	dimension	note	eval_mode
			
 
				+2026-03-31T10:00	baseline	huashu-proofreading	-	78	baseline	-	初始评估	full_test
			
 
				+2026-03-31T10:05	a1b2c3d	huashu-proofreading	78	84	keep	边界条件	补充fallback	full_test
			
 
				+2026-03-31T10:10	b2c3d4e	huashu-proofreading	84	82	revert	指令具体性	过度细化	dry_run
			
 
				+```
			
 
				+
			
 
				+新增 `eval_mode` 列：`full_test`（跑了子agent测试）或 `dry_run`（模拟推演）。
			
 
				+文件位置：`.claude/skills/auto-optimize-results.tsv`
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 优化策略库
			
 
				+
			
 
				+按优先级排序，每轮只做最高优先级的一个：
			
 
				+
			
 
				+### P0: 效果问题（实测发现的）
			
 
				+- 测试输出偏离用户意图 → 检查skill是否有误导性指令
			
 
				+- 带skill比不带还差 → skill可能过度约束，考虑精简
			
 
				+- 输出格式不符合预期 → 补充明确的输出模板
			
 
				+
			
 
				+### P1: 结构性问题
			
 
				+- Frontmatter缺少触发词 → 补充中英文触发词
			
 
				+- 缺少Phase/Step结构 → 重组为线性流程
			
 
				+- 缺少用户确认检查点 → 在关键决策处插入
			
 
				+
			
 
				+### P2: 具体性问题
			
 
				+- 步骤模糊（"处理图片"）→ 改为具体操作和参数
			
 
				+- 缺少输入/输出规格 → 补充格式、路径、示例
			
 
				+- 缺少异常处理 → 补充 "如果X失败，则Y"
			
 
				+
			
 
				+### P3: 可读性问题
			
 
				+- 段落过长 → 拆分+用表格
			
 
				+- 重复描述 → 合并去重
			
 
				+- 缺少速查 → 添加TL;DR或决策树
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 约束规则
			
 
				+
			
 
				+1. **不改变skill的核心功能和用途** — 只优化"怎么写"和"怎么执行"，不改"做什么"
			
 
				+2. **不引入新依赖** — 不添加skill原本没有的scripts或references文件
			
 
				+3. **每轮只改一个维度** — 避免多个变更导致无法归因
			
 
				+4. **保持文件大小合理** — 优化后SKILL.md不应超过原始大小的150%
			
 
				+5. **尊重花叔风格** — 中文为主、简洁为上
			
 
				+6. **可回滚** — 所有改动在git分支上，用git revert而非reset --hard
			
 
				+7. **评分独立性** — 效果维度必须用子agent或至少干跑验证，不能在同一上下文里「改完直接评」
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 使用方式
			
 
				+
			
 
				+### 全量优化（推荐首次使用）
			
 
				+```
			
 
				+用户："优化所有skills"
			
 
				+→ Phase 0-3 完整流程
			
 
				+→ 建议：先基线评估，选择分数最低的5-10个重点优化
			
 
				+```
			
 
				+
			
 
				+### 单个优化
			
 
				+```
			
 
				+用户："优化 huashu-slides 这个skill"
			
 
				+→ 只对指定skill执行 Phase 0.5-2
			
 
				+```
			
 
				+
			
 
				+### 仅评估不改
			
 
				+```
			
 
				+用户："评估所有skills的质量"
			
 
				+→ 只执行 Phase 0.5-1（设计测试prompt + 基线评估），不进入优化循环
			
 
				+```
			
 
				+
			
 
				+### 查看历史
			
 
				+```
			
 
				+用户："看看skill优化历史"
			
 
				+→ 读取并展示 results.tsv
			
 
				+```
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 设计灵感
			
 
				+
			
 
				+> "You write the goals and constraints in program.md; let an agent generate and test code deltas indefinitely; keep only what measurably improves the objective."
			
 
				+> — Karpathy, autoresearch
			
 
				+
			
 
				+本skill的对应关系：
			
 
				+- **program.md** → 本文件（评估rubric和约束规则）
			
 
				+- **train.py** → 每个SKILL.md
			
 
				+- **val_bpb** → 8维加权总分（含实测表现）
			
 
				+- **git ratchet** → 只保留有改进的commit
			
 
				+- **test set** → 每个skill的test-prompts.json
			
 
				+
			
 
				+区别：增加了人在回路（autoresearch是全自主的，skill优化需要人的判断力），以及双重评估机制（结构+效果），因为skill的「好坏」比loss数值更微妙。
			
--- a/assets/aso-comparison.png
+++ b/assets/aso-comparison.png
--- a/assets/aso-cycle.png
+++ b/assets/aso-cycle.png
--- a/assets/aso-hero.png
+++ b/assets/aso-hero.png
--- a/assets/aso-mapping.png
+++ b/assets/aso-mapping.png
--- a/assets/aso-ratchet.png
+++ b/assets/aso-ratchet.png
--- a/assets/aso-rubric.png
+++ b/assets/aso-rubric.png
--- a/showcase.html
+++ b/showcase.html
@@ -0,0 +1,1059 @@
 
				+<!DOCTYPE html>
			
 
				+<html lang="zh-CN">
			
 
				+<head>
			
 
				+<meta charset="UTF-8">
			
 
				+<meta name="viewport" content="width=device-width, initial-scale=1.0">
			
 
				+<title>自主技能优化系统</title>
			
 
				+<link href="https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700;800;900&display=swap" rel="stylesheet">
			
 
				+<style>
			
 
				+  :root {
			
 
				+    --accent: #D4532B;
			
 
				+    --black: #111111;
			
 
				+    --dark: #1a1a1a;
			
 
				+    --mid: #666666;
			
 
				+    --light: #999999;
			
 
				+    --border: #d0d0d0;
			
 
				+    --bg: #fafafa;
			
 
				+    --white: #ffffff;
			
 
				+    --col: calc((100% - 11 * 24px) / 12);
			
 
				+  }
			
 
				+
			
 
				+  * { margin: 0; padding: 0; box-sizing: border-box; }
			
 
				+
			
 
				+  body {
			
 
				+    font-family: 'Inter', -apple-system, sans-serif;
			
 
				+    background: var(--bg);
			
 
				+    color: var(--black);
			
 
				+    font-size: 15px;
			
 
				+    line-height: 1.6;
			
 
				+    -webkit-font-smoothing: antialiased;
			
 
				+  }
			
 
				+
			
 
				+  .container {
			
 
				+    max-width: 1200px;
			
 
				+    margin: 0 auto;
			
 
				+    padding: 0 48px;
			
 
				+  }
			
 
				+
			
 
				+  /* ═══════ HERO ═══════ */
			
 
				+  .hero {
			
 
				+    padding: 120px 0 80px;
			
 
				+    border-bottom: 1px solid var(--black);
			
 
				+  }
			
 
				+
			
 
				+  .hero-label {
			
 
				+    font-size: 11px;
			
 
				+    font-weight: 600;
			
 
				+    letter-spacing: 3px;
			
 
				+    text-transform: uppercase;
			
 
				+    color: var(--accent);
			
 
				+    margin-bottom: 32px;
			
 
				+  }
			
 
				+
			
 
				+  .hero h1 {
			
 
				+    font-size: 88px;
			
 
				+    font-weight: 900;
			
 
				+    line-height: 0.95;
			
 
				+    letter-spacing: -3px;
			
 
				+    margin-bottom: 40px;
			
 
				+    max-width: 900px;
			
 
				+  }
			
 
				+
			
 
				+  .hero-subtitle {
			
 
				+    font-size: 20px;
			
 
				+    font-weight: 400;
			
 
				+    color: var(--mid);
			
 
				+    line-height: 1.5;
			
 
				+    max-width: 640px;
			
 
				+    margin-bottom: 56px;
			
 
				+  }
			
 
				+
			
 
				+  .hero-subtitle strong {
			
 
				+    color: var(--black);
			
 
				+    font-weight: 600;
			
 
				+  }
			
 
				+
			
 
				+  .hero-quote {
			
 
				+    border-left: 3px solid var(--accent);
			
 
				+    padding: 20px 0 20px 24px;
			
 
				+    max-width: 600px;
			
 
				+  }
			
 
				+
			
 
				+  .hero-quote p {
			
 
				+    font-size: 16px;
			
 
				+    font-weight: 400;
			
 
				+    font-style: italic;
			
 
				+    color: var(--dark);
			
 
				+    line-height: 1.7;
			
 
				+  }
			
 
				+
			
 
				+  .hero-quote cite {
			
 
				+    display: block;
			
 
				+    margin-top: 12px;
			
 
				+    font-size: 12px;
			
 
				+    font-weight: 600;
			
 
				+    letter-spacing: 1px;
			
 
				+    text-transform: uppercase;
			
 
				+    font-style: normal;
			
 
				+    color: var(--light);
			
 
				+  }
			
 
				+
			
 
				+  /* ═══════ SECTION HEADERS ═══════ */
			
 
				+  .section {
			
 
				+    padding: 80px 0;
			
 
				+    border-bottom: 1px solid var(--border);
			
 
				+  }
			
 
				+
			
 
				+  .section:last-child {
			
 
				+    border-bottom: none;
			
 
				+  }
			
 
				+
			
 
				+  .section-num {
			
 
				+    font-size: 12px;
			
 
				+    font-weight: 700;
			
 
				+    letter-spacing: 2px;
			
 
				+    color: var(--accent);
			
 
				+    margin-bottom: 16px;
			
 
				+    font-variant-numeric: tabular-nums;
			
 
				+  }
			
 
				+
			
 
				+  .section-title {
			
 
				+    font-size: 48px;
			
 
				+    font-weight: 800;
			
 
				+    line-height: 1.05;
			
 
				+    letter-spacing: -1.5px;
			
 
				+    margin-bottom: 16px;
			
 
				+  }
			
 
				+
			
 
				+  .section-lead {
			
 
				+    font-size: 17px;
			
 
				+    color: var(--mid);
			
 
				+    max-width: 560px;
			
 
				+    line-height: 1.6;
			
 
				+    margin-bottom: 48px;
			
 
				+  }
			
 
				+
			
 
				+  /* ═══════ PRINCIPLES ═══════ */
			
 
				+  .principles-grid {
			
 
				+    display: grid;
			
 
				+    grid-template-columns: 1fr 1fr;
			
 
				+    gap: 0;
			
 
				+  }
			
 
				+
			
 
				+  .principle {
			
 
				+    padding: 32px 32px 32px 0;
			
 
				+    border-top: 1px solid var(--border);
			
 
				+  }
			
 
				+
			
 
				+  .principle:nth-child(even) {
			
 
				+    padding-left: 32px;
			
 
				+    border-left: 1px solid var(--border);
			
 
				+  }
			
 
				+
			
 
				+  .principle:nth-child(1),
			
 
				+  .principle:nth-child(2) {
			
 
				+    border-top: 1px solid var(--black);
			
 
				+  }
			
 
				+
			
 
				+  .principle-num {
			
 
				+    font-size: 36px;
			
 
				+    font-weight: 800;
			
 
				+    color: var(--accent);
			
 
				+    margin-bottom: 12px;
			
 
				+    line-height: 1;
			
 
				+  }
			
 
				+
			
 
				+  .principle h3 {
			
 
				+    font-size: 18px;
			
 
				+    font-weight: 700;
			
 
				+    margin-bottom: 8px;
			
 
				+    letter-spacing: -0.3px;
			
 
				+  }
			
 
				+
			
 
				+  .principle p {
			
 
				+    font-size: 14px;
			
 
				+    color: var(--mid);
			
 
				+    line-height: 1.6;
			
 
				+  }
			
 
				+
			
 
				+  .principle--full {
			
 
				+    grid-column: 1 / -1;
			
 
				+    padding-left: 0;
			
 
				+    border-left: none;
			
 
				+  }
			
 
				+
			
 
				+  /* ═══════ RUBRIC ═══════ */
			
 
				+  .rubric-header {
			
 
				+    display: flex;
			
 
				+    gap: 48px;
			
 
				+    margin-bottom: 48px;
			
 
				+  }
			
 
				+
			
 
				+  .rubric-stat {
			
 
				+    display: flex;
			
 
				+    align-items: baseline;
			
 
				+    gap: 12px;
			
 
				+  }
			
 
				+
			
 
				+  .rubric-stat-num {
			
 
				+    font-size: 64px;
			
 
				+    font-weight: 900;
			
 
				+    line-height: 1;
			
 
				+    letter-spacing: -2px;
			
 
				+  }
			
 
				+
			
 
				+  .rubric-stat-num--accent {
			
 
				+    color: var(--accent);
			
 
				+  }
			
 
				+
			
 
				+  .rubric-stat-label {
			
 
				+    font-size: 13px;
			
 
				+    font-weight: 600;
			
 
				+    text-transform: uppercase;
			
 
				+    letter-spacing: 1.5px;
			
 
				+    color: var(--mid);
			
 
				+  }
			
 
				+
			
 
				+  .rubric-table {
			
 
				+    width: 100%;
			
 
				+    border-collapse: collapse;
			
 
				+    margin-bottom: 40px;
			
 
				+  }
			
 
				+
			
 
				+  .rubric-table caption {
			
 
				+    text-align: left;
			
 
				+    font-size: 11px;
			
 
				+    font-weight: 700;
			
 
				+    letter-spacing: 2.5px;
			
 
				+    text-transform: uppercase;
			
 
				+    color: var(--light);
			
 
				+    padding-bottom: 16px;
			
 
				+  }
			
 
				+
			
 
				+  .rubric-table th {
			
 
				+    text-align: left;
			
 
				+    font-size: 11px;
			
 
				+    font-weight: 600;
			
 
				+    letter-spacing: 1.5px;
			
 
				+    text-transform: uppercase;
			
 
				+    color: var(--light);
			
 
				+    padding: 12px 16px 12px 0;
			
 
				+    border-bottom: 2px solid var(--black);
			
 
				+  }
			
 
				+
			
 
				+  .rubric-table td {
			
 
				+    padding: 14px 16px 14px 0;
			
 
				+    border-bottom: 1px solid var(--border);
			
 
				+    font-size: 14px;
			
 
				+    vertical-align: top;
			
 
				+  }
			
 
				+
			
 
				+  .rubric-table tr:last-child td {
			
 
				+    border-bottom: none;
			
 
				+  }
			
 
				+
			
 
				+  .rubric-table .dim-num {
			
 
				+    font-weight: 700;
			
 
				+    color: var(--accent);
			
 
				+    font-variant-numeric: tabular-nums;
			
 
				+    width: 36px;
			
 
				+  }
			
 
				+
			
 
				+  .rubric-table .dim-name {
			
 
				+    font-weight: 600;
			
 
				+    white-space: nowrap;
			
 
				+  }
			
 
				+
			
 
				+  .rubric-table .dim-weight {
			
 
				+    font-weight: 800;
			
 
				+    font-size: 20px;
			
 
				+    font-variant-numeric: tabular-nums;
			
 
				+    text-align: center;
			
 
				+    width: 60px;
			
 
				+    color: var(--dark);
			
 
				+  }
			
 
				+
			
 
				+  .rubric-table .dim-desc {
			
 
				+    color: var(--mid);
			
 
				+    line-height: 1.5;
			
 
				+  }
			
 
				+
			
 
				+  /* ═══════ PHASES ═══════ */
			
 
				+  .phases {
			
 
				+    display: flex;
			
 
				+    flex-direction: column;
			
 
				+    gap: 0;
			
 
				+  }
			
 
				+
			
 
				+  .phase {
			
 
				+    display: grid;
			
 
				+    grid-template-columns: 160px 1fr;
			
 
				+    gap: 40px;
			
 
				+    padding: 40px 0;
			
 
				+    border-top: 1px solid var(--border);
			
 
				+  }
			
 
				+
			
 
				+  .phase:first-child {
			
 
				+    border-top: 1px solid var(--black);
			
 
				+  }
			
 
				+
			
 
				+  .phase-id {
			
 
				+    font-size: 48px;
			
 
				+    font-weight: 900;
			
 
				+    color: var(--accent);
			
 
				+    line-height: 1;
			
 
				+    letter-spacing: -1px;
			
 
				+  }
			
 
				+
			
 
				+  .phase-id span {
			
 
				+    display: block;
			
 
				+    font-size: 11px;
			
 
				+    font-weight: 600;
			
 
				+    letter-spacing: 2px;
			
 
				+    text-transform: uppercase;
			
 
				+    color: var(--light);
			
 
				+    margin-top: 8px;
			
 
				+  }
			
 
				+
			
 
				+  .phase-body h3 {
			
 
				+    font-size: 22px;
			
 
				+    font-weight: 700;
			
 
				+    margin-bottom: 12px;
			
 
				+    letter-spacing: -0.3px;
			
 
				+  }
			
 
				+
			
 
				+  .phase-body p {
			
 
				+    font-size: 14px;
			
 
				+    color: var(--mid);
			
 
				+    line-height: 1.6;
			
 
				+    margin-bottom: 16px;
			
 
				+    max-width: 560px;
			
 
				+  }
			
 
				+
			
 
				+  .phase-steps {
			
 
				+    list-style: none;
			
 
				+    counter-reset: step;
			
 
				+  }
			
 
				+
			
 
				+  .phase-steps li {
			
 
				+    counter-increment: step;
			
 
				+    padding: 8px 0 8px 32px;
			
 
				+    position: relative;
			
 
				+    font-size: 14px;
			
 
				+    line-height: 1.5;
			
 
				+    color: var(--dark);
			
 
				+  }
			
 
				+
			
 
				+  .phase-steps li::before {
			
 
				+    content: counter(step);
			
 
				+    position: absolute;
			
 
				+    left: 0;
			
 
				+    font-size: 11px;
			
 
				+    font-weight: 700;
			
 
				+    color: var(--accent);
			
 
				+    width: 20px;
			
 
				+    height: 20px;
			
 
				+    display: flex;
			
 
				+    align-items: center;
			
 
				+    justify-content: center;
			
 
				+    top: 9px;
			
 
				+  }
			
 
				+
			
 
				+  /* ═══════ RATCHET ═══════ */
			
 
				+  .ratchet-viz {
			
 
				+    display: flex;
			
 
				+    align-items: flex-end;
			
 
				+    gap: 0;
			
 
				+    padding: 48px 0;
			
 
				+    position: relative;
			
 
				+  }
			
 
				+
			
 
				+  .ratchet-viz::before {
			
 
				+    content: '';
			
 
				+    position: absolute;
			
 
				+    bottom: 48px;
			
 
				+    left: 0;
			
 
				+    right: 0;
			
 
				+    height: 1px;
			
 
				+    background: var(--border);
			
 
				+  }
			
 
				+
			
 
				+  .ratchet-step {
			
 
				+    flex: 1;
			
 
				+    display: flex;
			
 
				+    flex-direction: column;
			
 
				+    align-items: center;
			
 
				+    position: relative;
			
 
				+  }
			
 
				+
			
 
				+  .ratchet-bar {
			
 
				+    width: 80px;
			
 
				+    background: var(--black);
			
 
				+    position: relative;
			
 
				+    z-index: 1;
			
 
				+  }
			
 
				+
			
 
				+  .ratchet-bar--revert {
			
 
				+    background: none;
			
 
				+    border: 2px solid var(--border);
			
 
				+  }
			
 
				+
			
 
				+  .ratchet-score {
			
 
				+    font-size: 36px;
			
 
				+    font-weight: 900;
			
 
				+    margin-bottom: 8px;
			
 
				+    letter-spacing: -1px;
			
 
				+    line-height: 1;
			
 
				+  }
			
 
				+
			
 
				+  .ratchet-score--revert {
			
 
				+    color: var(--light);
			
 
				+    text-decoration: line-through;
			
 
				+    text-decoration-color: var(--accent);
			
 
				+    text-decoration-thickness: 2px;
			
 
				+  }
			
 
				+
			
 
				+  .ratchet-label {
			
 
				+    font-size: 11px;
			
 
				+    font-weight: 700;
			
 
				+    letter-spacing: 1.5px;
			
 
				+    text-transform: uppercase;
			
 
				+    margin-top: 12px;
			
 
				+    padding: 4px 10px;
			
 
				+  }
			
 
				+
			
 
				+  .ratchet-label--keep {
			
 
				+    background: var(--black);
			
 
				+    color: var(--white);
			
 
				+  }
			
 
				+
			
 
				+  .ratchet-label--revert {
			
 
				+    background: none;
			
 
				+    border: 1px solid var(--accent);
			
 
				+    color: var(--accent);
			
 
				+  }
			
 
				+
			
 
				+  .ratchet-label--baseline {
			
 
				+    background: var(--accent);
			
 
				+    color: var(--white);
			
 
				+  }
			
 
				+
			
 
				+  .ratchet-arrow {
			
 
				+    position: absolute;
			
 
				+    top: 50%;
			
 
				+    right: -12px;
			
 
				+    width: 24px;
			
 
				+    height: 2px;
			
 
				+    background: var(--border);
			
 
				+    z-index: 2;
			
 
				+  }
			
 
				+
			
 
				+  .ratchet-arrow::after {
			
 
				+    content: '';
			
 
				+    position: absolute;
			
 
				+    right: -1px;
			
 
				+    top: -4px;
			
 
				+    border: solid var(--border);
			
 
				+    border-width: 0 2px 2px 0;
			
 
				+    padding: 3px;
			
 
				+    transform: rotate(-45deg);
			
 
				+  }
			
 
				+
			
 
				+  .ratchet-round {
			
 
				+    font-size: 12px;
			
 
				+    color: var(--light);
			
 
				+    margin-top: 8px;
			
 
				+    font-weight: 500;
			
 
				+  }
			
 
				+
			
 
				+  /* ═══════ COMPARISON ═══════ */
			
 
				+  .comparison {
			
 
				+    display: grid;
			
 
				+    grid-template-columns: 1fr 1fr;
			
 
				+    gap: 0;
			
 
				+  }
			
 
				+
			
 
				+  .comparison-col {
			
 
				+    padding: 40px;
			
 
				+    border: 1px solid var(--border);
			
 
				+  }
			
 
				+
			
 
				+  .comparison-col:first-child {
			
 
				+    border-right: none;
			
 
				+  }
			
 
				+
			
 
				+  .comparison-col--highlight {
			
 
				+    background: var(--black);
			
 
				+    color: var(--white);
			
 
				+    border-color: var(--black);
			
 
				+  }
			
 
				+
			
 
				+  .comparison-tag {
			
 
				+    font-size: 11px;
			
 
				+    font-weight: 700;
			
 
				+    letter-spacing: 2px;
			
 
				+    text-transform: uppercase;
			
 
				+    margin-bottom: 16px;
			
 
				+  }
			
 
				+
			
 
				+  .comparison-col:first-child .comparison-tag {
			
 
				+    color: var(--light);
			
 
				+  }
			
 
				+
			
 
				+  .comparison-col--highlight .comparison-tag {
			
 
				+    color: var(--accent);
			
 
				+  }
			
 
				+
			
 
				+  .comparison-col h3 {
			
 
				+    font-size: 24px;
			
 
				+    font-weight: 800;
			
 
				+    margin-bottom: 20px;
			
 
				+    letter-spacing: -0.5px;
			
 
				+  }
			
 
				+
			
 
				+  .comparison-list {
			
 
				+    list-style: none;
			
 
				+  }
			
 
				+
			
 
				+  .comparison-list li {
			
 
				+    padding: 10px 0;
			
 
				+    font-size: 14px;
			
 
				+    line-height: 1.5;
			
 
				+    border-bottom: 1px solid;
			
 
				+  }
			
 
				+
			
 
				+  .comparison-col:first-child .comparison-list li {
			
 
				+    border-color: var(--border);
			
 
				+    color: var(--mid);
			
 
				+  }
			
 
				+
			
 
				+  .comparison-col--highlight .comparison-list li {
			
 
				+    border-color: #333;
			
 
				+    color: #ccc;
			
 
				+  }
			
 
				+
			
 
				+  .comparison-list li:last-child {
			
 
				+    border-bottom: none;
			
 
				+  }
			
 
				+
			
 
				+  .comparison-list li strong {
			
 
				+    color: var(--black);
			
 
				+  }
			
 
				+
			
 
				+  .comparison-col--highlight .comparison-list li strong {
			
 
				+    color: var(--white);
			
 
				+  }
			
 
				+
			
 
				+  .check-icon {
			
 
				+    display: inline-block;
			
 
				+    width: 16px;
			
 
				+    height: 16px;
			
 
				+    margin-right: 8px;
			
 
				+    vertical-align: middle;
			
 
				+    position: relative;
			
 
				+    top: -1px;
			
 
				+  }
			
 
				+
			
 
				+  /* ═══════ MAPPING TABLE ═══════ */
			
 
				+  .mapping-table {
			
 
				+    width: 100%;
			
 
				+    border-collapse: collapse;
			
 
				+  }
			
 
				+
			
 
				+  .mapping-table th {
			
 
				+    text-align: left;
			
 
				+    font-size: 11px;
			
 
				+    font-weight: 700;
			
 
				+    letter-spacing: 2px;
			
 
				+    text-transform: uppercase;
			
 
				+    padding: 16px 24px 16px 0;
			
 
				+    border-bottom: 2px solid var(--black);
			
 
				+  }
			
 
				+
			
 
				+  .mapping-table th:first-child {
			
 
				+    color: var(--light);
			
 
				+  }
			
 
				+
			
 
				+  .mapping-table th:nth-child(2) {
			
 
				+    color: var(--accent);
			
 
				+  }
			
 
				+
			
 
				+  .mapping-table th:last-child {
			
 
				+    color: var(--light);
			
 
				+  }
			
 
				+
			
 
				+  .mapping-table td {
			
 
				+    padding: 16px 24px 16px 0;
			
 
				+    border-bottom: 1px solid var(--border);
			
 
				+    font-size: 14px;
			
 
				+    vertical-align: top;
			
 
				+  }
			
 
				+
			
 
				+  .mapping-table td:first-child {
			
 
				+    font-weight: 600;
			
 
				+    color: var(--dark);
			
 
				+    white-space: nowrap;
			
 
				+  }
			
 
				+
			
 
				+  .mapping-table td:nth-child(2) {
			
 
				+    font-weight: 600;
			
 
				+    color: var(--black);
			
 
				+  }
			
 
				+
			
 
				+  .mapping-table td:last-child {
			
 
				+    color: var(--mid);
			
 
				+    line-height: 1.5;
			
 
				+  }
			
 
				+
			
 
				+  .mapping-arrow {
			
 
				+    display: inline-block;
			
 
				+    color: var(--accent);
			
 
				+    font-weight: 400;
			
 
				+    margin: 0 4px;
			
 
				+  }
			
 
				+
			
 
				+  /* ═══════ FOOTER ═══════ */
			
 
				+  .footer {
			
 
				+    padding: 48px 0;
			
 
				+    border-top: 1px solid var(--black);
			
 
				+    display: flex;
			
 
				+    justify-content: space-between;
			
 
				+    align-items: center;
			
 
				+  }
			
 
				+
			
 
				+  .footer-left {
			
 
				+    font-size: 12px;
			
 
				+    font-weight: 600;
			
 
				+    letter-spacing: 1px;
			
 
				+    text-transform: uppercase;
			
 
				+    color: var(--light);
			
 
				+  }
			
 
				+
			
 
				+  .footer-right {
			
 
				+    font-size: 12px;
			
 
				+    color: var(--light);
			
 
				+  }
			
 
				+
			
 
				+  /* ═══════ RESPONSIVE ═══════ */
			
 
				+  @media (max-width: 768px) {
			
 
				+    .container { padding: 0 24px; }
			
 
				+    .hero { padding: 64px 0 48px; }
			
 
				+    .hero h1 { font-size: 48px; letter-spacing: -1.5px; }
			
 
				+    .hero-subtitle { font-size: 17px; }
			
 
				+    .section { padding: 48px 0; }
			
 
				+    .section-title { font-size: 32px; }
			
 
				+    .principles-grid { grid-template-columns: 1fr; }
			
 
				+    .principle:nth-child(even) { padding-left: 0; border-left: none; }
			
 
				+    .principle:nth-child(2) { border-top: 1px solid var(--border); }
			
 
				+    .phase { grid-template-columns: 1fr; gap: 16px; }
			
 
				+    .comparison { grid-template-columns: 1fr; }
			
 
				+    .comparison-col:first-child { border-right: 1px solid var(--border); border-bottom: none; }
			
 
				+    .ratchet-viz { flex-wrap: wrap; gap: 24px; }
			
 
				+    .ratchet-step { flex: none; width: calc(33% - 16px); }
			
 
				+    .rubric-stat-num { font-size: 48px; }
			
 
				+    .mapping-table td:first-child { white-space: normal; }
			
 
				+  }
			
 
				+</style>
			
 
				+</head>
			
 
				+<body>
			
 
				+
			
 
				+<!-- ═══════════════════════════ HERO ═══════════════════════════ -->
			
 
				+<div class="container">
			
 
				+  <section class="hero">
			
 
				+    <div class="hero-label">自主技能优化系统</div>
			
 
				+    <h1>Auto Skill<br>Optimizer</h1>
			
 
				+    <p class="hero-subtitle">
			
 
				+      <strong>评估</strong> &rarr; <strong>改进</strong> &rarr; <strong>实测验证</strong> &rarr; <strong>人类确认</strong> &rarr; <strong>保留或回滚</strong>
			
 
				+    </p>
			
 
				+    <div class="hero-quote">
			
 
				+      <p>「autoresearch 的核心想法很简单：让系统自主运行实验，评估结果，只保留有效的改进。一个只能向前转的棘轮。」</p>
			
 
				+      <cite>Andrej Karpathy &mdash; 谈自主实验循环</cite>
			
 
				+    </div>
			
 
				+  </section>
			
 
				+</div>
			
 
				+
			
 
				+<!-- ═══════════════════════════ 01 PRINCIPLES ═══════════════════════════ -->
			
 
				+<div class="container">
			
 
				+  <section class="section">
			
 
				+    <div class="section-num">01</div>
			
 
				+    <h2 class="section-title">核心原则</h2>
			
 
				+    <p class="section-lead">五条规则，防止优化器偏移方向、自我刷分或引入退化。</p>
			
 
				+
			
 
				+    <div class="principles-grid">
			
 
				+      <div class="principle">
			
 
				+        <div class="principle-num">01</div>
			
 
				+        <h3>单一可编辑资产</h3>
			
 
				+        <p>每轮优化只针对一个 SKILL.md 文件。一次修改，一次测量，一次决策。不做跨文件编辑，避免归因模糊。</p>
			
 
				+      </div>
			
 
				+      <div class="principle">
			
 
				+        <div class="principle-num">02</div>
			
 
				+        <h3>双重评估</h3>
			
 
				+        <p>静态结构分析捕捉格式和完整性问题。实测执行捕捉行为退化。两者缺一不可。</p>
			
 
				+      </div>
			
 
				+      <div class="principle">
			
 
				+        <div class="principle-num">03</div>
			
 
				+        <h3>棘轮机制</h3>
			
 
				+        <p>提升总分的改进被 commit。降低分数的修改自动 revert。分数只能上升或持平，永远不会下降。</p>
			
 
				+      </div>
			
 
				+      <div class="principle">
			
 
				+        <div class="principle-num">04</div>
			
 
				+        <h3>独立评分</h3>
			
 
				+        <p>编辑 Skill 的 Agent 永远不为自己打分。由独立的子 Agent 评估输出质量，防止自我表扬偏差。</p>
			
 
				+      </div>
			
 
				+      <div class="principle principle--full">
			
 
				+        <div class="principle-num">05</div>
			
 
				+        <h3>人在回路</h3>
			
 
				+        <p>每个 Skill 的优化循环完成后，系统暂停。向人类展示 diff 摘要、分数变化和测试输出对比。没有明确确认，任何改动都不会生效。</p>
			
 
				+      </div>
			
 
				+    </div>
			
 
				+  </section>
			
 
				+</div>
			
 
				+
			
 
				+<!-- ═══════════════════════════ 02 RUBRIC ═══════════════════════════ -->
			
 
				+<div class="container">
			
 
				+  <section class="section">
			
 
				+    <div class="section-num">02</div>
			
 
				+    <h2 class="section-title">8维度<br>评估体系</h2>
			
 
				+    <p class="section-lead">100分评估体系。结构维度捕捉你能看到的问题，效果维度捕捉只有运行时才能感知的问题。</p>
			
 
				+
			
 
				+    <div class="rubric-header">
			
 
				+      <div class="rubric-stat">
			
 
				+        <div class="rubric-stat-num">60</div>
			
 
				+        <div class="rubric-stat-label">结构<br>分值</div>
			
 
				+      </div>
			
 
				+      <div class="rubric-stat">
			
 
				+        <div class="rubric-stat-num rubric-stat-num--accent">40</div>
			
 
				+        <div class="rubric-stat-label">效果<br>分值</div>
			
 
				+      </div>
			
 
				+    </div>
			
 
				+
			
 
				+    <table class="rubric-table">
			
 
				+      <caption>结构维度 &mdash; 静态分析</caption>
			
 
				+      <thead>
			
 
				+        <tr>
			
 
				+          <th style="width:36px">#</th>
			
 
				+          <th style="width:180px">维度</th>
			
 
				+          <th style="width:60px">权重</th>
			
 
				+          <th>评分标准</th>
			
 
				+        </tr>
			
 
				+      </thead>
			
 
				+      <tbody>
			
 
				+        <tr>
			
 
				+          <td class="dim-num">1</td>
			
 
				+          <td class="dim-name">Frontmatter质量</td>
			
 
				+          <td class="dim-weight">8</td>
			
 
				+          <td class="dim-desc">名称正确，描述包含功能/触发条件/使用场景，不超过1024字符</td>
			
 
				+        </tr>
			
 
				+        <tr>
			
 
				+          <td class="dim-num">2</td>
			
 
				+          <td class="dim-name">工作流清晰度</td>
			
 
				+          <td class="dim-weight">15</td>
			
 
				+          <td class="dim-desc">步骤有编号、可执行，每步都有明确的输入/输出</td>
			
 
				+        </tr>
			
 
				+        <tr>
			
 
				+          <td class="dim-num">3</td>
			
 
				+          <td class="dim-name">边界条件覆盖</td>
			
 
				+          <td class="dim-weight">10</td>
			
 
				+          <td class="dim-desc">错误处理、降级方案、常见故障恢复</td>
			
 
				+        </tr>
			
 
				+        <tr>
			
 
				+          <td class="dim-num">4</td>
			
 
				+          <td class="dim-name">检查点设计</td>
			
 
				+          <td class="dim-weight">7</td>
			
 
				+          <td class="dim-desc">关键决策前需用户确认，防止自主失控</td>
			
 
				+        </tr>
			
 
				+        <tr>
			
 
				+          <td class="dim-num">5</td>
			
 
				+          <td class="dim-name">指令具体性</td>
			
 
				+          <td class="dim-weight">15</td>
			
 
				+          <td class="dim-desc">无歧义，具体的参数/格式/示例，可直接执行</td>
			
 
				+        </tr>
			
 
				+        <tr>
			
 
				+          <td class="dim-num">6</td>
			
 
				+          <td class="dim-name">资源整合度</td>
			
 
				+          <td class="dim-weight">5</td>
			
 
				+          <td class="dim-desc">所有引用的脚本/资产路径存在且可访问</td>
			
 
				+        </tr>
			
 
				+      </tbody>
			
 
				+    </table>
			
 
				+
			
 
				+    <table class="rubric-table">
			
 
				+      <caption>效果维度 &mdash; 需要实测</caption>
			
 
				+      <thead>
			
 
				+        <tr>
			
 
				+          <th style="width:36px">#</th>
			
 
				+          <th style="width:180px">维度</th>
			
 
				+          <th style="width:60px">权重</th>
			
 
				+          <th>评分标准</th>
			
 
				+        </tr>
			
 
				+      </thead>
			
 
				+      <tbody>
			
 
				+        <tr>
			
 
				+          <td class="dim-num">7</td>
			
 
				+          <td class="dim-name">整体架构</td>
			
 
				+          <td class="dim-weight">15</td>
			
 
				+          <td class="dim-desc">层次清晰，无冗余或遗漏，符合生态系统约定</td>
			
 
				+        </tr>
			
 
				+        <tr>
			
 
				+          <td class="dim-num">8</td>
			
 
				+          <td class="dim-name">实测表现</td>
			
 
				+          <td class="dim-weight">25</td>
			
 
				+          <td class="dim-desc">运行2-3个测试提示词，对比启用 Skill 和 baseline 的输出质量</td>
			
 
				+        </tr>
			
 
				+      </tbody>
			
 
				+    </table>
			
 
				+  </section>
			
 
				+</div>
			
 
				+
			
 
				+<!-- ═══════════════════════════ 03 PHASES ═══════════════════════════ -->
			
 
				+<div class="container">
			
 
				+  <section class="section">
			
 
				+    <div class="section-num">03</div>
			
 
				+    <h2 class="section-title">优化循环</h2>
			
 
				+    <p class="section-lead">从初始化到最终报告的五个阶段。系统在每个阶段内自主运行，但在阶段之间暂停等待人类审查。</p>
			
 
				+
			
 
				+    <div class="phases">
			
 
				+      <div class="phase">
			
 
				+        <div class="phase-id">
			
 
				+          0
			
 
				+          <span>初始化</span>
			
 
				+        </div>
			
 
				+        <div class="phase-body">
			
 
				+          <h3>范围与分支设置</h3>
			
 
				+          <p>确定优化范围，创建版本控制基础设施，加载历史记录。</p>
			
 
				+          <ol class="phase-steps">
			
 
				+            <li>确认范围：全部 Skill 还是用户指定子集</li>
			
 
				+            <li>扫描 .claude/skills/*/SKILL.md 获取目标列表</li>
			
 
				+            <li>创建 git 分支：auto-optimize/YYYYMMDD-HHMM</li>
			
 
				+            <li>初始化或加载 results.tsv 用于历史追踪</li>
			
 
				+          </ol>
			
 
				+        </div>
			
 
				+      </div>
			
 
				+
			
 
				+      <div class="phase">
			
 
				+        <div class="phase-id">
			
 
				+          0.5
			
 
				+          <span>设计</span>
			
 
				+        </div>
			
 
				+        <div class="phase-body">
			
 
				+          <h3>测试提示词工程</h3>
			
 
				+          <p>在任何评分之前，先设计用于衡量效果的测试提示词。没有好的测试，优化器就是盲飞。</p>
			
 
				+          <ol class="phase-steps">
			
 
				+            <li>阅读每个 SKILL.md，理解其声明的能力</li>
			
 
				+            <li>为每个 Skill 设计2-3个提示词：一个正常路径，一个模糊场景</li>
			
 
				+            <li>保存到每个 Skill 目录下的 test-prompts.json</li>
			
 
				+            <li>在继续之前，将所有测试提示词提交人类审批</li>
			
 
				+          </ol>
			
 
				+        </div>
			
 
				+      </div>
			
 
				+
			
 
				+      <div class="phase">
			
 
				+        <div class="phase-id">
			
 
				+          1
			
 
				+          <span>基线</span>
			
 
				+        </div>
			
 
				+        <div class="phase-body">
			
 
				+          <h3>全维度评分</h3>
			
 
				+          <p>为每个 Skill 建立起始分数。结构评分由主 Agent 完成，效果评分由独立子 Agent 完成。</p>
			
 
				+          <ol class="phase-steps">
			
 
				+            <li>阅读 SKILL.md，为维度1-7评分并附理由</li>
			
 
				+            <li>启动子 Agent：分别在启用和未启用 Skill 的情况下运行测试提示词</li>
			
 
				+            <li>对比输出，为维度8评分（如子 Agent 不可用则标记 dry_run）</li>
			
 
				+            <li>计算加权总分，记录到 results.tsv</li>
			
 
				+            <li>展示评分卡，暂停等待人类确认</li>
			
 
				+          </ol>
			
 
				+        </div>
			
 
				+      </div>
			
 
				+
			
 
				+      <div class="phase">
			
 
				+        <div class="phase-id">
			
 
				+          2
			
 
				+          <span>优化</span>
			
 
				+        </div>
			
 
				+        <div class="phase-body">
			
 
				+          <h3>Hill-Climbing 循环</h3>
			
 
				+          <p>按分数从低到高处理 Skill。每轮：诊断最弱维度，提出一个针对性修复，执行，重新评分，做出决定。</p>
			
 
				+          <ol class="phase-steps">
			
 
				+            <li>找出该 Skill 得分最低的维度</li>
			
 
				+            <li>生成一项具体改进（改什么，为什么改，预期分数变化）</li>
			
 
				+            <li>编辑 SKILL.md，用结构化消息 git commit</li>
			
 
				+            <li>重新评分：结构由主 Agent，效果由独立子 Agent</li>
			
 
				+            <li>新分 > 旧分：保留。否则：git revert，进入下一个 Skill</li>
			
 
				+            <li>每个 Skill 完成后：展示 diff + 分数变化，等待人类确认</li>
			
 
				+          </ol>
			
 
				+        </div>
			
 
				+      </div>
			
 
				+
			
 
				+      <div class="phase">
			
 
				+        <div class="phase-id">
			
 
				+          3
			
 
				+          <span>报告</span>
			
 
				+        </div>
			
 
				+        <div class="phase-body">
			
 
				+          <h3>总结与指标</h3>
			
 
				+          <p>将所有结果汇总为最终优化报告，包含优化前后分数、实验次数和关键改进。</p>
			
 
				+          <ol class="phase-steps">
			
 
				+            <li>统计总实验次数、保留次数、回滚次数和测试模式</li>
			
 
				+            <li>生成每个 Skill 的优化前后分数对比表</li>
			
 
				+            <li>列出影响最大的改进及其对应维度</li>
			
 
				+            <li>归档 results.tsv 供未来 baseline 参考</li>
			
 
				+          </ol>
			
 
				+        </div>
			
 
				+      </div>
			
 
				+    </div>
			
 
				+  </section>
			
 
				+</div>
			
 
				+
			
 
				+<!-- ═══════════════════════════ 04 RATCHET ═══════════════════════════ -->
			
 
				+<div class="container">
			
 
				+  <section class="section">
			
 
				+    <div class="section-num">04</div>
			
 
				+    <h2 class="section-title">棘轮机制</h2>
			
 
				+    <p class="section-lead">分数只能上升。每轮要么改进 Skill，要么干净地回滚。不会随时间积累局部退化。</p>
			
 
				+
			
 
				+    <div class="ratchet-viz">
			
 
				+      <div class="ratchet-step">
			
 
				+        <div class="ratchet-score">72</div>
			
 
				+        <div style="height:144px" class="ratchet-bar"></div>
			
 
				+        <div class="ratchet-label ratchet-label--baseline">基线</div>
			
 
				+        <div class="ratchet-round">轮次 0</div>
			
 
				+        <div class="ratchet-arrow"></div>
			
 
				+      </div>
			
 
				+      <div class="ratchet-step">
			
 
				+        <div class="ratchet-score">78</div>
			
 
				+        <div style="height:156px" class="ratchet-bar"></div>
			
 
				+        <div class="ratchet-label ratchet-label--keep">保留</div>
			
 
				+        <div class="ratchet-round">轮次 1</div>
			
 
				+        <div class="ratchet-arrow"></div>
			
 
				+      </div>
			
 
				+      <div class="ratchet-step">
			
 
				+        <div class="ratchet-score ratchet-score--revert">75</div>
			
 
				+        <div style="height:150px" class="ratchet-bar ratchet-bar--revert"></div>
			
 
				+        <div class="ratchet-label ratchet-label--revert">回滚</div>
			
 
				+        <div class="ratchet-round">轮次 2</div>
			
 
				+        <div class="ratchet-arrow"></div>
			
 
				+      </div>
			
 
				+      <div class="ratchet-step">
			
 
				+        <div class="ratchet-score">84</div>
			
 
				+        <div style="height:168px" class="ratchet-bar"></div>
			
 
				+        <div class="ratchet-label ratchet-label--keep">Keep</div>
			
 
				+        <div class="ratchet-round">轮次 3</div>
			
 
				+        <div class="ratchet-arrow"></div>
			
 
				+      </div>
			
 
				+      <div class="ratchet-step">
			
 
				+        <div class="ratchet-score">87</div>
			
 
				+        <div style="height:174px" class="ratchet-bar"></div>
			
 
				+        <div class="ratchet-label ratchet-label--keep">Keep</div>
			
 
				+        <div class="ratchet-round">轮次 4</div>
			
 
				+      </div>
			
 
				+    </div>
			
 
				+  </section>
			
 
				+</div>
			
 
				+
			
 
				+<!-- ═══════════════════════════ 05 COMPARISON ═══════════════════════════ -->
			
 
				+<div class="container">
			
 
				+  <section class="section">
			
 
				+    <div class="section-num">05</div>
			
 
				+    <h2 class="section-title">为什么需要<br>双重评估</h2>
			
 
				+    <p class="section-lead">单看结构无法判断 Skill 是否真正好用。单看效果无法判断它为何失败。</p>
			
 
				+
			
 
				+    <div class="comparison">
			
 
				+      <div class="comparison-col">
			
 
				+        <div class="comparison-tag">传统方法</div>
			
 
				+        <h3>纯结构审查</h3>
			
 
				+        <ul class="comparison-list">
			
 
				+          <li>检查 frontmatter 是否存在且格式正确</li>
			
 
				+          <li>验证步骤是否有编号和描述</li>
			
 
				+          <li>确认文件路径和引用是否有效</li>
			
 
				+          <li>无法检测 Skill 是否<strong>真正提升了</strong>输出质量</li>
			
 
				+          <li>无法检测<strong>看似正确</strong>实则产生差结果的误导性指令</li>
			
 
				+          <li>无法检测<strong>弊大于利</strong>的过度约束</li>
			
 
				+        </ul>
			
 
				+      </div>
			
 
				+      <div class="comparison-col comparison-col--highlight">
			
 
				+        <div class="comparison-tag">Auto Skill Optimizer</div>
			
 
				+        <h3>双重评估</h3>
			
 
				+        <ul class="comparison-list">
			
 
				+          <li><strong>结构评分</strong>捕捉格式、完整性和可读性问题</li>
			
 
				+          <li><strong>实测执行</strong>揭示真实场景下的行为影响</li>
			
 
				+          <li><strong>基线对比</strong>衡量 Skill 是增值还是减值</li>
			
 
				+          <li><strong>独立子 Agent</strong>防止自我表扬的评分偏差</li>
			
 
				+          <li><strong>测试提示词设计</strong>确保评估针对真实用户场景</li>
			
 
				+          <li><strong>Dry-run 降级</strong>在实测不可用时提供覆盖</li>
			
 
				+        </ul>
			
 
				+      </div>
			
 
				+    </div>
			
 
				+  </section>
			
 
				+</div>
			
 
				+
			
 
				+<!-- ═══════════════════════════ 06 MAPPING ═══════════════════════════ -->
			
 
				+<div class="container">
			
 
				+  <section class="section">
			
 
				+    <div class="section-num">06</div>
			
 
				+    <h2 class="section-title">概念映射</h2>
			
 
				+    <p class="section-lead">autoresearch 的核心抽象如何转化为 Skill 优化。同一台机器，不同的领域。</p>
			
 
				+
			
 
				+    <table class="mapping-table">
			
 
				+      <thead>
			
 
				+        <tr>
			
 
				+          <th style="width:220px">Autoresearch</th>
			
 
				+          <th style="width:220px">Skill Optimizer</th>
			
 
				+          <th>实现细节</th>
			
 
				+        </tr>
			
 
				+      </thead>
			
 
				+      <tbody>
			
 
				+        <tr>
			
 
				+          <td>研究论文草稿</td>
			
 
				+          <td>SKILL.md 文件</td>
			
 
				+          <td>唯一的可编辑产物。所有改进都表现为对这一个文件的编辑。</td>
			
 
				+        </tr>
			
 
				+        <tr>
			
 
				+          <td>评估指标</td>
			
 
				+          <td>8维度评估体系</td>
			
 
				+          <td>跨结构（60分）和效果（40分）的加权评分，总计100分。</td>
			
 
				+        </tr>
			
 
				+        <tr>
			
 
				+          <td>实验循环</td>
			
 
				+          <td>阶段2 hill-climbing</td>
			
 
				+          <td>诊断最弱维度，提出修复，执行，重新评分，保留或回滚。每个 Skill 最多3轮。</td>
			
 
				+        </tr>
			
 
				+        <tr>
			
 
				+          <td>版本控制</td>
			
 
				+          <td>Git 分支 + revert</td>
			
 
				+          <td>每次编辑都是一次 commit。退化通过 revert（新 commit）回滚。完整审计记录。</td>
			
 
				+        </tr>
			
 
				+        <tr>
			
 
				+          <td>自动化评估</td>
			
 
				+          <td>子 Agent 测试执行</td>
			
 
				+          <td>独立 Agent 分别在启用和未启用 Skill 的情况下运行测试提示词，对比输出质量。</td>
			
 
				+        </tr>
			
 
				+        <tr>
			
 
				+          <td>人类审查关卡</td>
			
 
				+          <td>阶段转换暂停</td>
			
 
				+          <td>系统在基线评分后和每个 Skill 优化后暂停。展示 diff + 分数变化。</td>
			
 
				+        </tr>
			
 
				+        <tr>
			
 
				+          <td>探索 vs 利用</td>
			
 
				+          <td>阶段2.5探索性重写</td>
			
 
				+          <td>当 hill-climbing 停滞（连续2次在第1轮就中断），提出完整的结构重写。</td>
			
 
				+        </tr>
			
 
				+        <tr>
			
 
				+          <td>实验日志</td>
			
 
				+          <td>results.tsv</td>
			
 
				+          <td>带时间戳的记录：commit 哈希、Skill 名称、新旧分数、保留/回滚状态、评估模式。</td>
			
 
				+        </tr>
			
 
				+      </tbody>
			
 
				+    </table>
			
 
				+  </section>
			
 
				+</div>
			
 
				+
			
 
				+<!-- ═══════════════════════════ FOOTER ═══════════════════════════ -->
			
 
				+<div class="container">
			
 
				+  <footer class="footer">
			
 
				+    <div class="footer-left">Auto Skill Optimizer</div>
			
 
				+    <div class="footer-right">灵感源自 Karpathy autoresearch &mdash; 为 Claude Code Skill 生态而建</div>
			
 
				+  </footer>
			
 
				+</div>
			
 
				+
			
 
				+</body>
			
 
				+</html>