Просмотр исходного кода

fix: add recovery procedures and enhance API retry mechanism

审查报告补充修复:

1. data-agent.md 添加故障恢复流程文档
   - 索引重建命令 (rebuild-index)
   - 向量重建命令 (rebuild-vectors)
   - 状态同步命令

2. api_client.py 增强重试机制
   - 添加 max_retries 配置 (默认 3)
   - 添加指数退避策略
   - 处理 429/500/502/503/504 状态码
   - 超时重试支持

3. config.py 添加重试配置项
   - api_max_retries: 3
   - api_retry_delay: 1.0

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
lingfengQAQ 5 месяцев назад
Родитель
Сommit
afa7dfbc8f

+ 72 - 7
.claude/agents/data-agent.md

@@ -1,6 +1,6 @@
 ---
 name: data-agent
-description: 数据处理Agent (v5.1),负责AI实体提取、场景切片、索引构建。使用 entities_v3 格式和一对多别名。在章节完成后自动调用,处理数据链的写入工作。支持 SQLite 增量写入优化。
+description: 数据处理Agent (v5.1),负责AI实体提取、场景切片、索引构建。使用 v5.1 实体格式和一对多别名,写入 index.db。在章节完成后自动调用,处理数据链的写入工作。支持 SQLite 增量写入优化。
 tools: Read, Write, Bash
 ---
 
@@ -321,6 +321,65 @@ python -m data_modules.style_sampler extract --chapter 100 --score 85 --scenes '
 
 ---
 
+## 故障恢复流程 (v5.1)
+
+### 索引重建
+
+当 index.db 损坏或与实际数据不一致时,执行索引重建:
+
+```bash
+# 完整重建索引(从正文重新提取所有实体)
+python -m data_modules.index_manager rebuild-index --project-root "."
+
+# 仅重建特定章节范围
+python -m data_modules.index_manager rebuild-index --start 1 --end 50 --project-root "."
+
+# 验证索引完整性
+python -m data_modules.index_manager verify-index --project-root "."
+```
+
+### 向量重建
+
+当 vectors.db 损坏或嵌入模型更换时,执行向量重建:
+
+```bash
+# 完整重建向量库
+python -m data_modules.rag_adapter rebuild-vectors --project-root "."
+
+# 仅重建特定章节范围
+python -m data_modules.rag_adapter rebuild-vectors --start 1 --end 50 --project-root "."
+
+# 检查向量覆盖率
+python -m data_modules.rag_adapter check-coverage --project-root "."
+```
+
+### 状态同步
+
+当 state.json 与 index.db 不一致时:
+
+```bash
+# 从 index.db 同步主角状态到 state.json
+python -m data_modules.index_manager sync-protagonist --project-root "."
+
+# 导出当前状态快照
+python -m data_modules.index_manager export-state --output snapshot.json --project-root "."
+```
+
+### 数据一致性检查
+
+```bash
+# 全面检查数据链一致性
+python -m data_modules.index_manager health-check --project-root "."
+
+# 输出示例:
+# ✅ index.db: 256 entities, 512 aliases, 1024 scenes
+# ✅ vectors.db: 1024 vectors (100% coverage)
+# ⚠️ state.json: protagonist_state.entity_id missing in index.db
+# → 建议执行 sync-protagonist
+```
+
+---
+
 ## 成功标准
 
 1. ✅ 所有出场实体被正确识别(准确率 > 90%)
@@ -343,14 +402,20 @@ python -m data_modules.style_sampler extract --chapter 100 --score 85 --scenes '
 Context Agent (读) ←→ 数据存储 ←→ Data Agent (写)
 ```
 
-**数据流 (v5.0)**:
+**数据流 (v5.1)**:
 ```
-章节正文 → Data Agent → state.json
-                      ├── entities_v3.{类型}.{id}
-                      ├── alias_index (一对多)
+章节正文 → Data Agent → state.json (精简)
+                      └── protagonist_state (快照)
+
+                      → index.db (v5.1 schema)
+                      ├── entities (id, canonical_name, current_json)
+                      ├── aliases (一对多)
                       ├── relationships
-                      └── state_changes
-                      → index.db
+                      ├── state_changes
+                      └── scenes
+
+                      → vectors.db
+                      └── 场景向量嵌入
                       Context Agent → 下一章上下文
 ```

+ 108 - 52
.claude/scripts/data_modules/api_client.py

@@ -114,47 +114,75 @@ class EmbeddingAPIClient:
             return None
 
     async def embed(self, texts: List[str]) -> Optional[List[List[float]]]:
-        """调用 Embedding 服务"""
+        """调用 Embedding 服务(带重试机制)"""
         if not texts:
             return []
 
         timeout = self.config.cold_start_timeout if not self._warmed_up else self.config.normal_timeout
+        max_retries = getattr(self.config, 'api_max_retries', 3)
+        base_delay = getattr(self.config, 'api_retry_delay', 1.0)
 
         async with self.sem:
             start = time.time()
             session = await self._get_session()
 
-            try:
-                url = self._build_url()
-                headers = self._build_headers()
-                payload = self._build_payload(texts)
-
-                async with session.post(
-                    url,
-                    json=payload,
-                    headers=headers,
-                    timeout=aiohttp.ClientTimeout(total=timeout)
-                ) as resp:
-                    if resp.status == 200:
-                        text = await resp.text()
-                        import json as json_module
-                        data = json_module.loads(text)
-                        embeddings = self._parse_response(data)
-
-                        if embeddings:
-                            self.stats.total_calls += 1
-                            self.stats.total_time += time.time() - start
-                            return embeddings
+            for attempt in range(max_retries):
+                try:
+                    url = self._build_url()
+                    headers = self._build_headers()
+                    payload = self._build_payload(texts)
+
+                    async with session.post(
+                        url,
+                        json=payload,
+                        headers=headers,
+                        timeout=aiohttp.ClientTimeout(total=timeout)
+                    ) as resp:
+                        if resp.status == 200:
+                            text = await resp.text()
+                            import json as json_module
+                            data = json_module.loads(text)
+                            embeddings = self._parse_response(data)
+
+                            if embeddings:
+                                self.stats.total_calls += 1
+                                self.stats.total_time += time.time() - start
+                                self._warmed_up = True
+                                return embeddings
+
+                        # 可重试的状态码: 429 (限流), 500, 502, 503, 504
+                        if resp.status in (429, 500, 502, 503, 504) and attempt < max_retries - 1:
+                            delay = base_delay * (2 ** attempt)  # 指数退避
+                            print(f"[WARN] Embed {resp.status}, retrying in {delay:.1f}s ({attempt + 1}/{max_retries})")
+                            await asyncio.sleep(delay)
+                            continue
+
+                        self.stats.errors += 1
+                        err_text = await resp.text()
+                        print(f"[ERR] Embed {resp.status}: {err_text[:200]}")
+                        return None
 
+                except asyncio.TimeoutError:
+                    if attempt < max_retries - 1:
+                        delay = base_delay * (2 ** attempt)
+                        print(f"[WARN] Embed timeout, retrying in {delay:.1f}s ({attempt + 1}/{max_retries})")
+                        await asyncio.sleep(delay)
+                        continue
                     self.stats.errors += 1
-                    err_text = await resp.text()
-                    print(f"[ERR] Embed {resp.status}: {err_text[:200]}")
+                    print(f"[ERR] Embed: Timeout after {max_retries} attempts")
                     return None
 
-            except Exception as e:
-                self.stats.errors += 1
-                print(f"[ERR] Embed: {e}")
-                return None
+                except Exception as e:
+                    if attempt < max_retries - 1:
+                        delay = base_delay * (2 ** attempt)
+                        print(f"[WARN] Embed error: {e}, retrying in {delay:.1f}s ({attempt + 1}/{max_retries})")
+                        await asyncio.sleep(delay)
+                        continue
+                    self.stats.errors += 1
+                    print(f"[ERR] Embed: {e}")
+                    return None
+
+            return None
 
     async def embed_batch(
         self, texts: List[str], *, skip_failures: bool = True
@@ -277,44 +305,72 @@ class RerankAPIClient:
         documents: List[str],
         top_n: Optional[int] = None
     ) -> Optional[List[Dict[str, Any]]]:
-        """调用 Rerank 服务"""
+        """调用 Rerank 服务(带重试机制)"""
         if not documents:
             return []
 
         timeout = self.config.cold_start_timeout if not self._warmed_up else self.config.normal_timeout
+        max_retries = getattr(self.config, 'api_max_retries', 3)
+        base_delay = getattr(self.config, 'api_retry_delay', 1.0)
 
         async with self.sem:
             start = time.time()
             session = await self._get_session()
 
-            try:
-                url = self._build_url()
-                headers = self._build_headers()
-                payload = self._build_payload(query, documents, top_n)
-
-                async with session.post(
-                    url,
-                    json=payload,
-                    headers=headers,
-                    timeout=aiohttp.ClientTimeout(total=timeout)
-                ) as resp:
-                    if resp.status == 200:
-                        data = await resp.json()
-
-                        self.stats.total_calls += 1
-                        self.stats.total_time += time.time() - start
-
-                        return self._parse_response(data)
-                    else:
+            for attempt in range(max_retries):
+                try:
+                    url = self._build_url()
+                    headers = self._build_headers()
+                    payload = self._build_payload(query, documents, top_n)
+
+                    async with session.post(
+                        url,
+                        json=payload,
+                        headers=headers,
+                        timeout=aiohttp.ClientTimeout(total=timeout)
+                    ) as resp:
+                        if resp.status == 200:
+                            data = await resp.json()
+
+                            self.stats.total_calls += 1
+                            self.stats.total_time += time.time() - start
+                            self._warmed_up = True
+
+                            return self._parse_response(data)
+
+                        # 可重试的状态码
+                        if resp.status in (429, 500, 502, 503, 504) and attempt < max_retries - 1:
+                            delay = base_delay * (2 ** attempt)
+                            print(f"[WARN] Rerank {resp.status}, retrying in {delay:.1f}s ({attempt + 1}/{max_retries})")
+                            await asyncio.sleep(delay)
+                            continue
+
                         self.stats.errors += 1
                         err_text = await resp.text()
                         print(f"[ERR] Rerank {resp.status}: {err_text[:200]}")
                         return None
 
-            except Exception as e:
-                self.stats.errors += 1
-                print(f"[ERR] Rerank: {e}")
-                return None
+                except asyncio.TimeoutError:
+                    if attempt < max_retries - 1:
+                        delay = base_delay * (2 ** attempt)
+                        print(f"[WARN] Rerank timeout, retrying in {delay:.1f}s ({attempt + 1}/{max_retries})")
+                        await asyncio.sleep(delay)
+                        continue
+                    self.stats.errors += 1
+                    print(f"[ERR] Rerank: Timeout after {max_retries} attempts")
+                    return None
+
+                except Exception as e:
+                    if attempt < max_retries - 1:
+                        delay = base_delay * (2 ** attempt)
+                        print(f"[WARN] Rerank error: {e}, retrying in {delay:.1f}s ({attempt + 1}/{max_retries})")
+                        await asyncio.sleep(delay)
+                        continue
+                    self.stats.errors += 1
+                    print(f"[ERR] Rerank: {e}")
+                    return None
+
+            return None
 
     async def warmup(self):
         """预热服务"""

+ 3 - 0
.claude/scripts/data_modules/config.py

@@ -104,6 +104,9 @@ class DataModulesConfig:
     cold_start_timeout: int = 300
     normal_timeout: int = 180
 
+    # ================= 重试配置 =================
+    api_max_retries: int = 3  # 最大重试次数
+    api_retry_delay: float = 1.0  # 初始重试延迟(秒),使用指数退避
 
     # ================= 检索配置 =================
     vector_top_k: int = 30