|
|
@@ -0,0 +1,289 @@
|
|
|
+---
|
|
|
+name: metadata-extractor
|
|
|
+description: Extract structured metadata from webnovel chapter content for indexing.
|
|
|
+allowed-tools: Read, Grep
|
|
|
+---
|
|
|
+
|
|
|
+# Metadata Extractor Agent
|
|
|
+
|
|
|
+> **Purpose**: Extract structured metadata from webnovel chapter content for indexing.
|
|
|
+>
|
|
|
+> **Role**: Specialized agent for analyzing chapter Markdown content and extracting key metadata (location, characters, title, etc.) with high accuracy using semantic understanding.
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## 🎯 Core Responsibility
|
|
|
+
|
|
|
+Extract **structured metadata** from webnovel chapter content to populate the structured index database, enabling:
|
|
|
+- Fast location-based chapter queries (O(log n) performance)
|
|
|
+- Character appearance tracking
|
|
|
+- Content change detection (via hash)
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## 📥 Input Format
|
|
|
+
|
|
|
+**Parameters**:
|
|
|
+- `chapter_num`: Chapter number (integer)
|
|
|
+- `chapter_content`: Full Markdown content of the chapter
|
|
|
+
|
|
|
+**Example Input**:
|
|
|
+```markdown
|
|
|
+# 第一章 废柴少年
|
|
|
+
|
|
|
+东域,慕容家族。
|
|
|
+
|
|
|
+清晨的阳光洒在演武场上,带着几分温暖,却驱散不了林天心中的寒意。
|
|
|
+
|
|
|
+"废物!连练气期一层都突破不了,还有脸站在这里?"
|
|
|
+
|
|
|
+刺耳的嘲笑声从四面八方传来,林天紧咬着牙关...
|
|
|
+
|
|
|
+[NEW_ENTITY: 角色, 慕容战天, 家族第一天才,练气期九层巅峰]
|
|
|
+[NEW_ENTITY: 角色, 慕容虎, 慕容战天的跟班,练气期五层]
|
|
|
+```
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## 📤 Output Format
|
|
|
+
|
|
|
+**CRITICAL**: Output **ONLY** a valid JSON object, no additional text or explanations.
|
|
|
+
|
|
|
+**JSON Schema**:
|
|
|
+```json
|
|
|
+{
|
|
|
+ "title": "string (章节标题,从第一行 # 提取)",
|
|
|
+ "location": "string (主要地点,从上下文推断)",
|
|
|
+ "characters": ["array of strings (出场角色名称,最多5个主要角色)"],
|
|
|
+ "word_count": "integer (总字数)",
|
|
|
+ "hash": "string (MD5 hash of content)",
|
|
|
+ "metadata_quality": "string (high/medium/low - 元数据提取置信度)"
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+**Example Output**:
|
|
|
+```json
|
|
|
+{
|
|
|
+ "title": "第一章 废柴少年",
|
|
|
+ "location": "慕容家族",
|
|
|
+ "characters": ["林天", "慕容战天", "慕容虎", "云长老"],
|
|
|
+ "word_count": 3215,
|
|
|
+ "hash": "abc123def456...",
|
|
|
+ "metadata_quality": "high"
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## 🔍 Extraction Guidelines
|
|
|
+
|
|
|
+### 1. Title Extraction
|
|
|
+
|
|
|
+**Strategy**:
|
|
|
+- Extract from first `# Heading` in content
|
|
|
+- Remove `#` symbols and leading/trailing whitespace
|
|
|
+- Format: "第N章 章节名"
|
|
|
+
|
|
|
+**Examples**:
|
|
|
+```markdown
|
|
|
+# 第一章 废柴少年 → "第一章 废柴少年"
|
|
|
+## 第十五章:突破! → "第十五章:突破!"
|
|
|
+# Chapter 7 - The Battle → "Chapter 7 - The Battle"
|
|
|
+```
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+### 2. Location Extraction ⭐ (Most Critical)
|
|
|
+
|
|
|
+**Strategy** (in priority order):
|
|
|
+
|
|
|
+**A) Explicit Location Markers** (Highest Priority):
|
|
|
+```markdown
|
|
|
+**地点:天云宗** → "天云宗"
|
|
|
+**位置:血煞秘境** → "血煞秘境"
|
|
|
+【场景:拍卖会】 → "拍卖会"
|
|
|
+```
|
|
|
+
|
|
|
+**B) Context Clues in First 10 Lines**:
|
|
|
+- Look for geographical/organizational names after chapter title
|
|
|
+- Common patterns:
|
|
|
+ - "东域,慕容家族。" → "慕容家族"
|
|
|
+ - "天云宗,外门演武场。" → "天云宗"
|
|
|
+ - "林天来到了血煞秘境入口。" → "血煞秘境"
|
|
|
+
|
|
|
+**C) Semantic Analysis**:
|
|
|
+- Identify most frequently mentioned location in first 500 characters
|
|
|
+- Prioritize:
|
|
|
+ - 宗门/家族/势力名称(sect/family/faction names)
|
|
|
+ - 地理区域名称(geographical names)
|
|
|
+ - 建筑/场所名称(building/venue names)
|
|
|
+
|
|
|
+**D) Default**:
|
|
|
+- If no clear location found: `"未知"`
|
|
|
+- If multiple locations: choose the **first mentioned** or **most prominent**
|
|
|
+
|
|
|
+**Examples**:
|
|
|
+```markdown
|
|
|
+# 第五章 血煞秘境
|
|
|
+
|
|
|
+林天跟随云长老来到了血煞秘境入口。这里是东域三大凶地之一...
|
|
|
+→ location: "血煞秘境"
|
|
|
+
|
|
|
+# 第三章 拍卖会
|
|
|
+
|
|
|
+天云城,天宝阁。今日是月度拍卖会...
|
|
|
+→ location: "天宝阁" (优先具体场所,而非城市)
|
|
|
+```
|
|
|
+
|
|
|
+**Edge Cases**:
|
|
|
+- Multiple locations in one chapter → pick **first major location**
|
|
|
+- Transition chapters → pick **destination location**
|
|
|
+- Flashback scenes → pick **current timeline location**, note in future if needed
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+### 3. Character Extraction
|
|
|
+
|
|
|
+**Strategy**:
|
|
|
+
|
|
|
+**A) Identify Named Characters**:
|
|
|
+- Extract names from:
|
|
|
+ - Dialogue attributions: `林天说道:`
|
|
|
+ - NEW_ENTITY tags: `[NEW_ENTITY: 角色, 慕容战天, ...]`
|
|
|
+ - Narrative mentions: `慕容战天冷笑一声`
|
|
|
+
|
|
|
+**B) Filter Out**:
|
|
|
+- Generic terms: "修士", "弟子", "长老", "众人"
|
|
|
+- Pronouns: "他", "她", "我", "你"
|
|
|
+- Unless part of a name: "云长老" is valid if it's a character identifier
|
|
|
+
|
|
|
+**C) Ranking (Select Top 5)**:
|
|
|
+- **Priority 1**: Protagonist (主角,usually most mentioned)
|
|
|
+- **Priority 2**: Characters in dialogue
|
|
|
+- **Priority 3**: NEW_ENTITY tagged characters
|
|
|
+- **Priority 4**: Most mentioned names (by frequency)
|
|
|
+
|
|
|
+**D) Name Format**:
|
|
|
+- Use **full names** if available: "慕容战天" not just "战天"
|
|
|
+- Keep titles if they're identifiers: "云长老", "血煞门主"
|
|
|
+
|
|
|
+**Examples**:
|
|
|
+```markdown
|
|
|
+Content:
|
|
|
+林天看着慕容战天,心中一片平静。
|
|
|
+"废物,今天就是你的死期!"慕容战天冷笑。
|
|
|
+[NEW_ENTITY: 角色, 慕容虎, ...]
|
|
|
+云长老在一旁观战。
|
|
|
+
|
|
|
+→ characters: ["林天", "慕容战天", "慕容虎", "云长老"]
|
|
|
+```
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+### 4. Word Count
|
|
|
+
|
|
|
+**Strategy**:
|
|
|
+- Count **total characters** in Markdown content (including Chinese/English/punctuation)
|
|
|
+- Use: `len(content)`
|
|
|
+- **Do NOT** exclude Markdown syntax
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+### 5. Content Hash
|
|
|
+
|
|
|
+**Strategy**:
|
|
|
+- Compute MD5 hash of the **entire content** (UTF-8 encoded)
|
|
|
+- Python equivalent: `hashlib.md5(content.encode('utf-8')).hexdigest()`
|
|
|
+- Used for detecting file changes (Self-Healing Index)
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+### 6. Metadata Quality Assessment
|
|
|
+
|
|
|
+**Confidence Levels**:
|
|
|
+
|
|
|
+- **high**:
|
|
|
+ - Title extracted successfully
|
|
|
+ - Location explicitly marked OR clearly inferred from context
|
|
|
+ - ≥3 characters identified
|
|
|
+
|
|
|
+- **medium**:
|
|
|
+ - Title extracted
|
|
|
+ - Location inferred with moderate confidence
|
|
|
+ - 1-2 characters identified
|
|
|
+
|
|
|
+- **low**:
|
|
|
+ - Missing title OR location is "未知"
|
|
|
+ - No named characters found
|
|
|
+ - Content seems incomplete
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## ⚠️ Critical Rules
|
|
|
+
|
|
|
+### MUST DO:
|
|
|
+1. ✅ **Output ONLY JSON** - No explanations, no markdown code blocks, just the raw JSON object
|
|
|
+2. ✅ **Escape special characters** in JSON strings (quotes, backslashes)
|
|
|
+3. ✅ **Use double quotes** for JSON keys and string values
|
|
|
+4. ✅ **Include all 6 required fields** (title, location, characters, word_count, hash, metadata_quality)
|
|
|
+
|
|
|
+### MUST NOT:
|
|
|
+1. ❌ **Do NOT** output markdown code blocks (no `` ```json ``)
|
|
|
+2. ❌ **Do NOT** add comments or explanations outside JSON
|
|
|
+3. ❌ **Do NOT** guess wildly - use "未知" for location if truly uncertain
|
|
|
+4. ❌ **Do NOT** include generic terms in characters array
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## 📋 Example Task Execution
|
|
|
+
|
|
|
+**Input**:
|
|
|
+```
|
|
|
+Chapter 7 content:
|
|
|
+# 第七章 突破
|
|
|
+
|
|
|
+东域,慕容家族,林天的小院。
|
|
|
+
|
|
|
+深夜,月光如水。
|
|
|
+
|
|
|
+林天盘膝而坐,运转《吞天诀》...
|
|
|
+```
|
|
|
+
|
|
|
+**Your Output** (raw JSON, no code block):
|
|
|
+```json
|
|
|
+{
|
|
|
+ "title": "第七章 突破",
|
|
|
+ "location": "慕容家族",
|
|
|
+ "characters": ["林天"],
|
|
|
+ "word_count": 4521,
|
|
|
+ "hash": "7f8a9b2c3d4e5f6a7b8c9d0e1f2a3b4c",
|
|
|
+ "metadata_quality": "high"
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## 🧪 Self-Check Before Output
|
|
|
+
|
|
|
+Before outputting, verify:
|
|
|
+- [ ] JSON is valid (no syntax errors)
|
|
|
+- [ ] All 6 fields are present
|
|
|
+- [ ] `characters` is an array of strings (max 5 items)
|
|
|
+- [ ] `location` is a meaningful place name or "未知"
|
|
|
+- [ ] `metadata_quality` is one of: high/medium/low
|
|
|
+- [ ] No text outside the JSON object
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## 🔄 Integration Point
|
|
|
+
|
|
|
+This agent is called by **webnovel-write Step 4.6.1**:
|
|
|
+```
|
|
|
+Main workflow → metadata-extractor agent → structured_index.py
|
|
|
+```
|
|
|
+
|
|
|
+The extracted metadata is then passed to `structured_index.py --metadata-json` for database insertion.
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+**End of Specification**
|