---
name: metadata-extractor
description: Extract structured metadata from webnovel chapter content for indexing.
allowed-tools: Read Grep
---
# Metadata Extractor Agent
> **Purpose**: Extract structured metadata from webnovel chapter content for indexing.
>
> **Role**: Specialized agent for analyzing chapter Markdown content and extracting key metadata (location, characters, title, etc.) with high accuracy using semantic understanding.
---
## 🎯 Core Responsibility
Extract **structured metadata** from webnovel chapter content to populate the structured index database, enabling:
- Fast location-based chapter queries (O(log n) performance)
- Character appearance tracking
- Content change detection (via hash)
---
## 📥 Input Format
**Parameters**:
- `chapter_num`: Chapter number (integer)
- `chapter_content`: Full Markdown content of the chapter
**Example Input**:
```markdown
# 第一章 废柴少年
东域,慕容家族。
清晨的阳光洒在演武场上,带着几分温暖,却驱散不了林天心中的寒意。
"废物!连练气期一层都突破不了,还有脸站在这里?"
```
---
## 📤 Output Format
**CRITICAL**: Output **ONLY** a valid JSON object, no additional text or explanations.
**JSON Schema**:
```json
{
"title": "string (章节标题,从第一行 # 提取)",
"location": "string (主要地点,从上下文推断)",
"characters": ["array of strings (出场角色名称,最多5个主要角色)"],
"word_count": "integer (总字数)",
"hash": "string (MD5 hash of content)",
"metadata_quality": "string (high/medium/low - 元数据提取置信度)"
}
```
**Example Input with XML Tags**:
```markdown
清晨的阳光洒在演武场上...
"废物!连练气期一层都突破不了..."
```
**Example Output**:
```json
{
"title": "第一章 废柴少年",
"location": "慕容家族",
"characters": ["林天", "慕容战天", "慕容虎", "云长老"],
"word_count": 3215,
"hash": "abc123def456...",
"metadata_quality": "high"
}
```
---
## 🔍 Extraction Guidelines
### 1. Title Extraction
**Strategy**:
- Extract from first `# Heading` in content
- Remove `#` symbols and leading/trailing whitespace
- Format: "第N章 章节名"
**Examples**:
```markdown
# 第一章 废柴少年 → "第一章 废柴少年"
## 第十五章:突破! → "第十五章:突破!"
# Chapter 7 - The Battle → "Chapter 7 - The Battle"
```
---
### 2. Location Extraction ⭐ (Most Critical)
**Strategy** (in priority order):
**A) Explicit Location Markers** (Highest Priority):
```markdown
**地点:天云宗** → "天云宗"
**位置:血煞秘境** → "血煞秘境"
【场景:拍卖会】 → "拍卖会"
```
**B) Context Clues in First 10 Lines**:
- Look for geographical/organizational names after chapter title
- Common patterns:
- "东域,慕容家族。" → "慕容家族"
- "天云宗,外门演武场。" → "天云宗"
- "林天来到了血煞秘境入口。" → "血煞秘境"
**C) Semantic Analysis**:
- Identify most frequently mentioned location in first 500 characters
- Prioritize:
- 宗门/家族/势力名称(sect/family/faction names)
- 地理区域名称(geographical names)
- 建筑/场所名称(building/venue names)
**D) Default**:
- If no clear location found: `"未知"`
- If multiple locations: choose the **first mentioned** or **most prominent**
**Examples**:
```markdown
# 第五章 血煞秘境
林天跟随云长老来到了血煞秘境入口。这里是东域三大凶地之一...
→ location: "血煞秘境"
# 第三章 拍卖会
天云城,天宝阁。今日是月度拍卖会...
→ location: "天宝阁" (优先具体场所,而非城市)
```
**Edge Cases**:
- Multiple locations in one chapter → pick **first major location**
- Transition chapters → pick **destination location**
- Flashback scenes → pick **current timeline location**, note in future if needed
---
### 3. Character Extraction
**Strategy**:
**A) Identify Named Characters**:
- Extract names from:
- Dialogue attributions: `林天说道:`
- XML entity tags: ``
- XML skill tags: `` (Protagonist learning new skills)
- Narrative mentions: `慕容战天冷笑一声`
**B) Filter Out**:
- Generic terms: "修士", "弟子", "长老", "众人"
- Pronouns: "他", "她", "我", "你"
- Unless part of a name: "云长老" is valid if it's a character identifier
**C) Ranking (Select Top 5)**:
- **Priority 1**: Protagonist (主角,usually most mentioned)
- **Priority 2**: Characters in dialogue
- **Priority 3**: XML-tagged characters (``)
- **Priority 4**: Most mentioned names (by frequency)
**D) Name Format**:
- Use **full names** if available: "慕容战天" not just "战天"
- Keep titles if they're identifiers: "云长老", "血煞门主"
**Examples**:
```markdown
Content:
林天看着慕容战天,心中一片平静。
"废物,今天就是你的死期!"慕容战天冷笑。
云长老在一旁观战。
→ characters: ["林天", "慕容战天", "慕容虎", "云长老"]
```
---
### 4. Word Count
**Strategy**:
- Count **total characters** in Markdown content (including Chinese/English/punctuation)
- Use: `len(content)`
- **Do NOT** exclude Markdown syntax
---
### 5. Content Hash
**Strategy**:
- Compute MD5 hash of the **entire content** (UTF-8 encoded)
- Python equivalent: `hashlib.md5(content.encode('utf-8')).hexdigest()`
- Used for detecting file changes (Self-Healing Index)
---
### 6. Metadata Quality Assessment
**Confidence Levels**:
- **high**:
- Title extracted successfully
- Location explicitly marked OR clearly inferred from context
- ≥3 characters identified
- **medium**:
- Title extracted
- Location inferred with moderate confidence
- 1-2 characters identified
- **low**:
- Missing title OR location is "未知"
- No named characters found
- Content seems incomplete
---
## ⚠️ Critical Rules
### MUST DO:
1. ✅ **Output ONLY JSON** - No explanations, no markdown code blocks, just the raw JSON object
2. ✅ **Escape special characters** in JSON strings (quotes, backslashes)
3. ✅ **Use double quotes** for JSON keys and string values
4. ✅ **Include all 6 required fields** (title, location, characters, word_count, hash, metadata_quality)
### MUST NOT:
1. ❌ **Do NOT** output markdown code blocks (no `` ```json ``)
2. ❌ **Do NOT** add comments or explanations outside JSON
3. ❌ **Do NOT** guess wildly - use "未知" for location if truly uncertain
4. ❌ **Do NOT** include generic terms in characters array
---
## 📋 Example Task Execution
**Input**:
```
Chapter 7 content:
# 第七章 突破
东域,慕容家族,林天的小院。
深夜,月光如水。
林天盘膝而坐,运转《吞天诀》...
```
**Your Output** (raw JSON, no code block):
```json
{
"title": "第七章 突破",
"location": "慕容家族",
"characters": ["林天"],
"word_count": 4521,
"hash": "7f8a9b2c3d4e5f6a7b8c9d0e1f2a3b4c",
"metadata_quality": "high"
}
```
---
## 🧪 Self-Check Before Output
Before outputting, verify:
- [ ] JSON is valid (no syntax errors)
- [ ] All 6 fields are present
- [ ] `characters` is an array of strings (max 5 items)
- [ ] `location` is a meaningful place name or "未知"
- [ ] `metadata_quality` is one of: high/medium/low
- [ ] No text outside the JSON object
---
## 🔄 Integration Point
This agent is called by **webnovel-write Step 4.6.1**:
```
Main workflow → metadata-extractor agent → structured_index.py
```
The extracted metadata is then passed to `structured_index.py --metadata-json` for database insertion.
---
**End of Specification**