[!NOTE] 🤝 Microsoft Research lists darwin-skill as an official SkillOpt integration. On 2026-06-03, the SkillOpt repo noted: *"gbrain, gbrain-evals, and darwin-skill have all integrated SkillOpt."* We absorbed its validation-gated framework; it added darwin to its integration list. A two-way nod. 👉 Visit the SkillOpt repo
v2.0 is not a patch release. It's a structural upgrade absorbing two Microsoft Research papers published on 2026-05-22. Five concrete changes:
1. Rubric expanded from 8 → 9 dimensions (integrating SkillLens's empirically validated 73.8% rubric recipe)
rm / git reset --hard / force push must be explicitly listed as forbidden in the skill.2. Validation aligned with SkillOpt's validation-gated design
3. Human-in-the-loop at three checkpoints (the core differentiator from SkillOpt's fully autonomous design)
4. Anti-pattern blacklist with 8 explicit forbidden behaviors
git reset --hard as a rollback mechanism (use git revert)5. Empirical validation data
Evaluate → Improve → Test → Human Confirm → Keep or Revert. Repeat.
Agent skill ecosystems are expanding fast. Claude Code, Codex, OpenClaw, Trae, CodeBuddy and more all support the SKILL.md format. When you have 10 skills, you can maintain them by hand. When you have 60+, you need a system.
Traditional skill review is purely structural: does the frontmatter look right? Are the steps numbered? Do the file paths exist? But a perfectly formatted skill can still produce terrible output.
darwin.skill evaluates both structure and real-world effectiveness, then keeps only the changes that actually improve things.
This project maps Karpathy's autoresearch directly onto skill optimization:
| autoresearch | darwin.skill | Why |
|---|---|---|
program.md |
This SKILL.md | Defines evaluation criteria and constraints |
train.py |
Each target SKILL.md | The single editable asset per experiment |
val_bpb |
9-dimension weighted score (max 100) | Quantifiable optimization target |
git ratchet |
keep / revert mechanism | Only improving commits survive |
test set |
test-prompts.json | Validates whether improvements are real |
| Fully autonomous | Human in the loop | Skill quality is more subjective than loss |
The key difference: autoresearch is fully autonomous (loss is just a number). Skill quality sometimes needs human judgment. So darwin.skill pauses after each skill's optimization cycle, shows you the diff and score delta, and waits for your confirmation.
| # | Principle | Details |
|---|---|---|
| 01 | Single editable asset | One SKILL.md per experiment. One change, one measurement, one decision |
| 02 | Dual evaluation | Structure scoring (static analysis) + effectiveness scoring (live test execution) |
| 03 | Ratchet mechanism | Score can only go up. Regressions are auto-reverted |
| 04 | Independent scoring | The agent that edits is never the agent that scores (SkillLens: LLM self-eval is only 46.4% accurate) |
| 05 | Human in the loop | System pauses after each skill. You review, then continue |
Total: 100 points. Structure + Effectiveness. v2.0's three new dimensions come directly from SkillLens's empirically validated rubric.
The three new dimensions (SkillLens 73.8% rubric recipe):
| Dimension | Description |
|---|---|
| Failure Mechanism Encoding | Explicitly encode known failure paths, not just "be careful" reminders |
| Actionable Specificity | Ban vague hedge words like "suggest / could consider / depending on / use judgment / case by case" |
| High-Risk Action Blacklist | Destructive operations (rm / git reset --hard / force push) must be explicitly forbidden |
Live test performance has the highest weight. A beautifully written skill that produces bad output is still a bad skill.
Five phases. The system runs autonomously within each phase but pauses between phases for human confirmation.
Phase 2 (the heart, hardened in v2.0):
git revert (never git reset --hard, blacklist #2)Scores can only go up. Failed experiments are cleanly reverted. No regressions accumulate over time.
Round 2 scored 75, below the current best of 78. Auto-reverted. Effective baseline stays at 78. Subsequent improvements build from 78, not 75.
npx skills add alchaincyf/darwin-skill
After installation, tell your agent: "optimize all skills" or "optimize [skill-name]". Works with any tool that supports the SKILL.md format.
Can't access GitHub? Download the zip: darwin-skill.zip. Extract and place SKILL.md in ~/.claude/skills/darwin-skill/.
Directly inspired by Andrej Karpathy's autoresearch.
The core mechanism is identical: keep only measurable improvements, revert everything else.
v2.0 builds on this foundation by integrating two Microsoft Research papers (published 2026-05-22): SkillLens provides the empirically validated rubric design, and SkillOpt provides the formal framework of validation-gated edits.
v2.0's design directly builds on the following academic work. Recommended reading for researchers and engineers working on the skill ecosystem:
Microsoft Research. From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills. arXiv:2605.23899, 2026.
Microsoft Research. SkillOpt: Executive Strategy for Self-Evolving Agent Skills. arXiv:2605.23904, 2026.
pip install skillopt, v0.1.0 on PyPI)Andrej Karpathy. autoresearch. GitHub repository, 2026.
The key difference between darwin and SkillOpt: SkillOpt is fully autonomous; darwin.skill emphasizes human-in-the-loop — skill quality is more subjective than validation loss. Critical phases (baseline eval, single-dimension edit, regression test) mandatorily pause for the human to make the final judgment.
| 🌐 Website | bookai.top · huasheng.ai |
| 𝕏 Twitter | @AlchainHust |
| 📺 Bilibili | 花叔 |
| ▶️ YouTube | @Alchain |
| 📕 Xiaohongshu | 花叔 |
| Search "花叔" |
MIT