فهرست منبع

feat: 新增 reference CSV 检索基础设施

- reference_search.py: BM25 关键词检索 CSV 参考数据,支持 --skill/--genre/--table 过滤
- 命名规则.csv / 场景写法.csv: 种子数据(UTF-8 BOM)
- references/csv/README.md: CSV schema 文档
- test_reference_search.py: 7 项 pytest 用例全部通过
- pytest.ini: 新增 scripts/tests 测试路径
lingfengQAQ 2 ماه پیش
والد
کامیت
3bc52a3bf2

+ 1 - 1
pytest.ini

@@ -1,4 +1,4 @@
 [pytest]
-testpaths = webnovel-writer/scripts/data_modules/tests
+testpaths = webnovel-writer/scripts/data_modules/tests webnovel-writer/scripts/tests
 pythonpath = webnovel-writer/scripts
 addopts = -q --cov --cov-report=term-missing --cov-fail-under=90

+ 58 - 0
webnovel-writer/references/csv/README.md

@@ -0,0 +1,58 @@
+# Reference CSV 数据规范
+
+本目录存放技能系统的结构化参考数据,所有 CSV 文件使用 **UTF-8 with BOM** 编码。
+
+## 通用列(所有表共享)
+
+| 列名 | 说明 | 示例 |
+|------|------|------|
+| `编号` | 唯一 ID,带表前缀 | `NR-001`、`SP-002`、`WT-003` |
+| `适用技能` | 管道符分隔的技能名 | `write\|init\|plan` |
+| `分类` | 分类标签 | `角色`、`战斗`、`叙事` |
+| `层级` | `提醒` / `缺陷补偿` / `知识补充` | `缺陷补偿` |
+| `关键词` | 逗号分隔的搜索词 | `角色命名,人名,玄幻命名` |
+| `适用题材` | `全部` 或逗号分隔的题材名 | `玄幻,仙侠` |
+
+## 编号前缀
+
+| 前缀 | 表 |
+|------|------|
+| `NR-` | 命名规则 |
+| `SP-` | 场景写法 |
+| `WT-` | 写作技法 |
+
+## 表专属列
+
+### 命名规则.csv
+
+| 列名 | 说明 |
+|------|------|
+| `命名对象` | 命名针对的对象类型(角色人名、书名等) |
+| `规则` | 命名规则描述 |
+| `正例` | 好的命名示例 |
+| `反例` | 不好的命名示例 |
+
+### 场景写法.csv
+
+| 列名 | 说明 |
+|------|------|
+| `场景类型` | 场景分类(战斗场景、对话场景等) |
+| `模式名称` | 写作模式名称 |
+| `说明` | 模式详细说明 |
+| `示例片段` | 正面写作示例 |
+| `反面写法` | 反面写作示例 |
+
+## 适用题材(番茄分类)
+
+**男频:** 都市、玄幻、仙侠、奇幻、武侠、历史、军事、科幻、悬疑、游戏、体育、轻小说
+
+**女频:** 现言、古言、幻言、悬疑、轻小说
+
+## 检索方式
+
+使用 `reference_search.py` 脚本进行检索:
+
+```bash
+python reference_search.py --skill write --query "角色命名" --genre 玄幻
+python reference_search.py --skill write --table 命名规则 --query "战斗描写" --max-results 3
+```

+ 4 - 0
webnovel-writer/references/csv/命名规则.csv

@@ -0,0 +1,4 @@
+编号,适用技能,分类,层级,关键词,适用题材,命名对象,规则,正例,反例
+NR-001,write|init|plan,角色,缺陷补偿,"角色命名,人名,玄幻命名,仙侠命名","玄幻,仙侠",角色人名,玄幻角色命名应体现修仙意境,避免现代感过强的名字;姓氏宜用古风单姓或复姓,名字可含天、云、剑、灵等意象字,"萧炎、叶凌天、慕容紫烟","张伟、李明、王小花"
+NR-002,write|init,角色,缺陷补偿,"角色命名,人名,都市命名,现代命名","都市,现言",角色人名,都市角色命名应贴合现实,避免过于中二或古风;可用常见姓氏搭配有个性但不夸张的名字,"陈默、林晚晴、顾南城","上官逸尘、慕容天帝、龙傲天"
+NR-003,write|init,书名,提醒,"书名命名,标题,取名,书名规则",全部,书名,书名应简短有力(2-6字为佳),能体现核心卖点或世界观;避免过长或含义模糊的标题,"《斗破苍穹》、《完美世界》、《赘婿》",《我在异世界当上了最强的那个人之后发生的故事》

+ 4 - 0
webnovel-writer/references/csv/场景写法.csv

@@ -0,0 +1,4 @@
+编号,适用技能,分类,层级,关键词,适用题材,场景类型,模式名称,说明,示例片段,反面写法
+SP-001,write|plan,战斗,知识补充,"战斗描写,打斗,动作,战斗场景",全部,战斗场景,节奏递进式战斗,战斗描写应遵循试探→对抗→转折→高潮的节奏递进,每个阶段篇幅递增;多用短句加快节奏,穿插感官描写增加沉浸感,他侧身一闪,拳风擦着耳畔呼啸而过。脚下猛然发力,整个人如离弦之箭冲了出去——,两人打了起来。你一拳我一脚,打了很久。最后主角赢了。
+SP-002,write,对话,知识补充,"对话声线,对话描写,人物对话,声线区分",全部,对话场景,声线差异化,不同角色的对话应有明显的语言风格差异:用词习惯、句式长短、口头禅等都应体现性格;避免所有角色说话方式雷同,"""老夫修行三百载,何曾怕过谁!""(长老)""切,又吹。""(少年)",所有角色都用相同的书面语说话,无法通过对话区分说话人
+SP-003,plan,叙事,知识补充,"卷级叙事,叙事功能,卷结构,情节节奏",全部,叙事结构,卷级叙事功能分配,每卷应有明确的叙事功能:引入卷(世界观+目标)、发展卷(升级+冲突)、高潮卷(Boss战+转折)、过渡卷(新地图铺垫);避免各卷功能重复或缺失,第一卷:废柴觉醒→拜师→初战告捷(引入期:建立世界观与主角目标),连续三卷都是打怪升级没有变化,读者审美疲劳

+ 298 - 0
webnovel-writer/scripts/reference_search.py

@@ -0,0 +1,298 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+Reference CSV 检索工具。
+
+在 references/csv/ 目录下的 CSV 文件中执行 BM25 关键词搜索,
+支持按技能、题材过滤,返回 JSON 格式结果。
+
+用法:
+    python reference_search.py --skill write --query "角色命名" --genre 玄幻
+    python reference_search.py --skill write --table 命名规则 --query "战斗描写" --max-results 3
+"""
+from __future__ import annotations
+
+import argparse
+import csv
+import json
+import math
+import sys
+from pathlib import Path
+from typing import Any, Dict, List, Optional
+
+
+# ---------------------------------------------------------------------------
+# CSV loading
+# ---------------------------------------------------------------------------
+
+def _load_csv(path: Path) -> List[Dict[str, str]]:
+    """Load a single CSV file (UTF-8 with BOM)."""
+    with open(path, "r", encoding="utf-8-sig", newline="") as f:
+        reader = csv.DictReader(f)
+        return list(reader)
+
+
+def load_tables(csv_dir: Path, table: Optional[str] = None) -> Dict[str, List[Dict[str, str]]]:
+    """
+    Load CSV tables from *csv_dir*.
+
+    If *table* is given, load only that file (``<table>.csv``).
+    Otherwise load every ``.csv`` file in the directory.
+
+    Returns ``{table_name: [row_dict, ...]}``.
+    """
+    tables: Dict[str, List[Dict[str, str]]] = {}
+    if table:
+        target = csv_dir / f"{table}.csv"
+        if target.is_file():
+            tables[table] = _load_csv(target)
+    else:
+        for p in sorted(csv_dir.glob("*.csv")):
+            tables[p.stem] = _load_csv(p)
+    return tables
+
+
+# ---------------------------------------------------------------------------
+# Filtering
+# ---------------------------------------------------------------------------
+
+def _skill_matches(row: Dict[str, str], skill: str) -> bool:
+    """Return True if *skill* appears in the pipe-separated ``适用技能`` column."""
+    cell = row.get("适用技能", "")
+    return skill in cell.split("|")
+
+
+def _genre_matches(row: Dict[str, str], genre: Optional[str]) -> bool:
+    """Return True if *genre* is None, or matches ``适用题材`` (``全部`` always matches)."""
+    if genre is None:
+        return True
+    cell = row.get("适用题材", "")
+    if cell.strip() == "全部":
+        return True
+    return genre in [g.strip() for g in cell.split(",")]
+
+
+# ---------------------------------------------------------------------------
+# BM25-lite scoring
+# ---------------------------------------------------------------------------
+
+def _tokenize(text: str) -> List[str]:
+    """Split Chinese text into individual characters and comma-separated terms."""
+    # For the 关键词 field: terms are comma-separated
+    # For the query: we just split on common separators
+    tokens: List[str] = []
+    for part in text.replace(",", " ").replace(",", " ").replace("|", " ").split():
+        tokens.append(part)
+    return tokens
+
+
+def _bm25_score(query_terms: List[str], doc_terms: List[str],
+                avg_dl: float, k1: float = 1.5, b: float = 0.75,
+                idf_map: Optional[Dict[str, float]] = None) -> float:
+    """
+    Simplified BM25 score for a single document.
+
+    *idf_map* maps each query term to its IDF value.
+    """
+    if not doc_terms:
+        return 0.0
+    dl = len(doc_terms)
+    score = 0.0
+    tf_map: Dict[str, int] = {}
+    for t in doc_terms:
+        tf_map[t] = tf_map.get(t, 0) + 1
+    for qt in query_terms:
+        tf = tf_map.get(qt, 0)
+        if tf == 0:
+            # Also check substring match (important for Chinese compound words)
+            for dt in tf_map:
+                if qt in dt or dt in qt:
+                    tf = max(tf, 1)
+                    break
+        if tf == 0:
+            continue
+        idf = idf_map.get(qt, 1.0) if idf_map else 1.0
+        numerator = tf * (k1 + 1)
+        denominator = tf + k1 * (1 - b + b * dl / max(avg_dl, 1))
+        score += idf * numerator / denominator
+    return score
+
+
+def _compute_idf(query_terms: List[str], all_docs: List[List[str]]) -> Dict[str, float]:
+    """Compute IDF for each query term across all documents."""
+    n = len(all_docs)
+    if n == 0:
+        return {}
+    idf: Dict[str, float] = {}
+    for qt in query_terms:
+        df = 0
+        for doc in all_docs:
+            for dt in doc:
+                if qt in dt or dt in qt:
+                    df += 1
+                    break
+        # BM25 IDF: log((N - df + 0.5) / (df + 0.5) + 1)
+        idf[qt] = math.log((n - df + 0.5) / (df + 0.5) + 1)
+    return idf
+
+
+# ---------------------------------------------------------------------------
+# Content summary builder
+# ---------------------------------------------------------------------------
+
+# Columns used for building 内容摘要, in priority order.
+_CONTENT_COLUMNS = [
+    "规则", "说明", "模式名称",
+    "正例", "示例片段",
+    "反例", "反面写法",
+    "命名对象", "场景类型",
+]
+
+
+def _build_summary(row: Dict[str, str]) -> str:
+    """Merge key content columns into a single summary string."""
+    parts: List[str] = []
+    for col in _CONTENT_COLUMNS:
+        val = row.get(col, "").strip()
+        if val:
+            parts.append(val)
+    return ";".join(parts) if parts else ""
+
+
+# ---------------------------------------------------------------------------
+# Search entry point
+# ---------------------------------------------------------------------------
+
+def search(
+    csv_dir: Path,
+    skill: str,
+    query: str,
+    table: Optional[str] = None,
+    genre: Optional[str] = None,
+    max_results: int = 5,
+) -> Dict[str, Any]:
+    """
+    Run a BM25 keyword search across CSV reference tables.
+
+    Returns a result dict suitable for JSON serialisation.
+    """
+    if not csv_dir.is_dir():
+        return {
+            "status": "error",
+            "error": {
+                "code": "CSV_DIR_NOT_FOUND",
+                "message": f"CSV directory not found: {csv_dir}",
+            },
+        }
+
+    tables = load_tables(csv_dir, table=table)
+    if not tables:
+        return {
+            "status": "success",
+            "message": "search_results",
+            "data": {
+                "query": query,
+                "skill": skill,
+                "genre": genre,
+                "total": 0,
+                "results": [],
+            },
+        }
+
+    # 1) Collect filtered rows with table name annotation
+    candidates: List[tuple] = []  # (table_name, row)
+    for tbl_name, rows in tables.items():
+        for row in rows:
+            if _skill_matches(row, skill) and _genre_matches(row, genre):
+                candidates.append((tbl_name, row))
+
+    if not candidates:
+        return {
+            "status": "success",
+            "message": "search_results",
+            "data": {
+                "query": query,
+                "skill": skill,
+                "genre": genre,
+                "total": 0,
+                "results": [],
+            },
+        }
+
+    # 2) Tokenize
+    query_terms = _tokenize(query)
+    doc_terms_list = [_tokenize(row.get("关键词", "")) for _, row in candidates]
+    avg_dl = sum(len(d) for d in doc_terms_list) / len(doc_terms_list) if doc_terms_list else 1.0
+    idf_map = _compute_idf(query_terms, doc_terms_list)
+
+    # 3) Score
+    scored: List[tuple] = []
+    for idx, (tbl_name, row) in enumerate(candidates):
+        score = _bm25_score(query_terms, doc_terms_list[idx], avg_dl, idf_map=idf_map)
+        if score > 0:
+            scored.append((score, tbl_name, row))
+
+    scored.sort(key=lambda x: x[0], reverse=True)
+    top = scored[:max_results]
+
+    # 4) Format results
+    results: List[Dict[str, Any]] = []
+    for _score, tbl_name, row in top:
+        results.append({
+            "编号": row.get("编号", ""),
+            "表": tbl_name,
+            "分类": row.get("分类", ""),
+            "层级": row.get("层级", ""),
+            "适用题材": row.get("适用题材", ""),
+            "内容摘要": _build_summary(row),
+        })
+
+    return {
+        "status": "success",
+        "message": "search_results",
+        "data": {
+            "query": query,
+            "skill": skill,
+            "genre": genre,
+            "total": len(results),
+            "results": results,
+        },
+    }
+
+
+# ---------------------------------------------------------------------------
+# CLI
+# ---------------------------------------------------------------------------
+
+def _default_csv_dir() -> Path:
+    """Auto-detect the csv directory relative to this script's location."""
+    return Path(__file__).resolve().parent.parent / "references" / "csv"
+
+
+def main(argv: Optional[List[str]] = None) -> None:
+    parser = argparse.ArgumentParser(
+        description="BM25 keyword search over reference CSV files",
+    )
+    parser.add_argument("--skill", required=True, help="Filter by 适用技能 column")
+    parser.add_argument("--table", default=None, help="Target specific CSV file name (without .csv)")
+    parser.add_argument("--query", required=True, help="BM25 search keywords")
+    parser.add_argument("--genre", default=None, help="Filter by 适用题材 column")
+    parser.add_argument("--max-results", type=int, default=5, help="Max results (default 5)")
+    parser.add_argument("--csv-dir", default=None, help="Override CSV directory path")
+
+    args = parser.parse_args(argv)
+    csv_dir = Path(args.csv_dir) if args.csv_dir else _default_csv_dir()
+
+    result = search(
+        csv_dir=csv_dir,
+        skill=args.skill,
+        query=args.query,
+        table=args.table,
+        genre=args.genre,
+        max_results=args.max_results,
+    )
+    print(json.dumps(result, ensure_ascii=False))
+
+
+if __name__ == "__main__":
+    main()

+ 0 - 0
webnovel-writer/scripts/tests/__init__.py


+ 128 - 0
webnovel-writer/scripts/tests/test_reference_search.py

@@ -0,0 +1,128 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+Tests for reference_search.py — BM25 keyword search over CSV reference files.
+"""
+
+import json
+import subprocess
+import sys
+from pathlib import Path
+
+import pytest
+
+SCRIPT = str(Path(__file__).resolve().parents[1] / "reference_search.py")
+CSV_DIR = str(Path(__file__).resolve().parents[2] / "references" / "csv")
+
+
+def run_search(*args: str) -> dict:
+    """Run reference_search.py as a subprocess and return parsed JSON."""
+    result = subprocess.run(
+        [sys.executable, SCRIPT, "--csv-dir", CSV_DIR, *args],
+        capture_output=True,
+        text=True,
+    )
+    assert result.returncode == 0, f"Script failed: {result.stderr}"
+    return json.loads(result.stdout)
+
+
+class TestSkillAndGenreFiltering:
+    """Test filtering by skill and genre."""
+
+    def test_skill_write_genre_xuanhuan_returns_nr001_not_nr002(self):
+        """--skill write --table 命名规则 --query 角色命名 --genre 玄幻 → NR-001, not NR-002."""
+        out = run_search(
+            "--skill", "write",
+            "--table", "命名规则",
+            "--query", "角色命名",
+            "--genre", "玄幻",
+        )
+        assert out["status"] == "success"
+        ids = [r["编号"] for r in out["data"]["results"]]
+        assert "NR-001" in ids
+        assert "NR-002" not in ids
+
+    def test_skill_write_cross_table_search(self):
+        """--skill write --query 战斗描写 → SP-001 from 场景写法."""
+        out = run_search(
+            "--skill", "write",
+            "--query", "战斗描写",
+        )
+        assert out["status"] == "success"
+        assert out["data"]["total"] >= 1
+        ids = [r["编号"] for r in out["data"]["results"]]
+        assert "SP-001" in ids
+        # Verify it comes from the right table
+        tables = [r["表"] for r in out["data"]["results"] if r["编号"] == "SP-001"]
+        assert tables[0] == "场景写法"
+
+    def test_nonexistent_query_returns_empty(self):
+        """--skill plan --query nonexistent → empty results, no error."""
+        out = run_search(
+            "--skill", "plan",
+            "--query", "nonexistent",
+        )
+        assert out["status"] == "success"
+        assert out["data"]["total"] == 0
+        assert out["data"]["results"] == []
+
+
+class TestErrorHandling:
+    """Test error cases."""
+
+    def test_missing_csv_dir_returns_error(self):
+        """Missing CSV dir → error JSON."""
+        result = subprocess.run(
+            [sys.executable, SCRIPT,
+             "--csv-dir", "/nonexistent/path/that/does/not/exist",
+             "--skill", "write",
+             "--query", "test"],
+            capture_output=True,
+            text=True,
+        )
+        out = json.loads(result.stdout)
+        assert out["status"] == "error"
+        assert "CSV_DIR_NOT_FOUND" in out["error"]["code"]
+
+
+class TestOutputFormat:
+    """Test output JSON structure."""
+
+    def test_result_has_required_fields(self):
+        """Each result has 编号, 表, 分类, 层级, 适用题材, 内容摘要."""
+        out = run_search(
+            "--skill", "write",
+            "--table", "命名规则",
+            "--query", "角色命名",
+        )
+        assert out["status"] == "success"
+        for r in out["data"]["results"]:
+            assert "编号" in r
+            assert "表" in r
+            assert "分类" in r
+            assert "层级" in r
+            assert "适用题材" in r
+            assert "内容摘要" in r
+
+    def test_data_envelope_fields(self):
+        """Data envelope has query, skill, genre, total, results."""
+        out = run_search(
+            "--skill", "write",
+            "--query", "命名",
+            "--genre", "玄幻",
+        )
+        data = out["data"]
+        assert data["query"] == "命名"
+        assert data["skill"] == "write"
+        assert data["genre"] == "玄幻"
+        assert isinstance(data["total"], int)
+        assert isinstance(data["results"], list)
+
+    def test_max_results_limits_output(self):
+        """--max-results 1 limits to 1 result."""
+        out = run_search(
+            "--skill", "write",
+            "--query", "命名",
+            "--max-results", "1",
+        )
+        assert out["data"]["total"] <= 1