Commit 167f33b4481a46dbfb4e131a03c41b1fe45ace6f

Authored by tangwang
1 parent d172c259

eval框架前端

scripts/evaluation/README.md
... ... @@ -34,7 +34,7 @@ Set tenant if needed (`export TENANT_ID=163`). You need a live search API, DashS
34 34 # Batch: live search for every query; only uncached (query, spu_id) pairs hit the LLM
35 35 ./scripts/evaluation/quick_start_eval.sh batch
36 36  
37   -# Deep rebuild: search recall top-500 (score 1) + full-corpus rerank outside pool + batched LLM (early stop; expensive)
  37 +# Deep rebuild: per-query full corpus rerank (outside search top-500 pool) + LLM in 50-doc batches along global sort order (early stop; expensive)
38 38 ./scripts/evaluation/quick_start_eval.sh batch-rebuild
39 39  
40 40 # UI: http://127.0.0.1:6010/
... ... @@ -69,9 +69,24 @@ Explicit equivalents:
69 69 --port 6010
70 70 ```
71 71  
72   -Each `batch` run walks the full queries file. With `batch --force-refresh-labels`, every live top-`k` hit is re-judged by the LLM.
  72 +Each `batch` run walks the full queries file and writes a **batch report** under `batch_reports/`. With `batch --force-refresh-labels`, every live top-`k` hit is re-judged by the LLM (still only those hits—not the deep rebuild pipeline).
73 73  
74   -**Rebuild (`build --force-refresh-labels`):** For each query: take search top **500** as the recall pool (treated as rerank score **1**; those SKUs are not sent to the reranker). Rerank the rest of the tenant corpus; if more than **1000** non-pool docs have rerank score **> 0.5**, the query is **skipped** (logged as too easy / tail too relevant). Otherwise merge pool (search order) + non-pool (rerank score descending), then LLM-judge in batches of **50**, logging **exact_ratio** and **irrelevant_ratio** per batch. Stop after **3** consecutive batches with irrelevant_ratio **> 92%**, but only after at least **15** batches and at most **40** batches.
  74 +### `quick_start_eval.sh batch-rebuild` (deep annotation rebuild)
  75 +
  76 +This runs `build_annotation_set.py build` with **`--force-refresh-labels`** and **`--force-refresh-rerank`** (see the explicit command block below). It does **not** run the `batch` subcommand: there is **no** aggregate batch report for this step; outputs are per-query JSON under `query_builds/` plus updates in `search_eval.sqlite3`.
  77 +
  78 +For **each** query in `queries.txt`, in order:
  79 +
  80 +1. **Search recall** — Call the live search API with `size = max(--search-depth, --search-recall-top-k)` (the wrapper uses `--search-depth 500`). The first **500** hits form the **recall pool**; they are treated as rerank score **1** and are **not** sent to the reranker.
  81 +2. **Full corpus** — Load the tenant’s product corpus from Elasticsearch (same tenant as `TENANT_ID` / `--tenant-id`, default **163**), via `corpus_docs()` (cached in SQLite after the first load).
  82 +3. **Rerank outside pool** — Every corpus document whose `spu_id` is **not** in the pool is scored by the reranker API, **80 documents per request**. With `--force-refresh-rerank`, all those scores are recomputed and written to the **`rerank_scores`** table in `search_eval.sqlite3`. Without that flag, existing `(tenant_id, query, spu_id)` scores are reused and only missing rows hit the API.
  83 +4. **Skip “too easy” queries** — If more than **1000** non-pool documents have rerank score **> 0.5**, that query is **skipped** (one log line: tail too relevant / easy to satisfy). No LLM calls for that query.
  84 +5. **Global sort** — Order to label: pool in **search rank order**, then all remaining corpus docs in **descending rerank score** (dedupe by `spu_id`, pool wins).
  85 +6. **LLM labeling** — Walk that list **from the head** in batches of **50** (not “take top-K then label only K”): each batch logs **exact_ratio** and **irrelevant_ratio**. After at least **20** batches, stop when **3** consecutive batches have irrelevant_ratio **> 92%**; never more than **40** batches (**2000** docs max per query). So labeling follows the best-first order but **stops early**; the tail of the sorted list may never be judged.
  86 +
  87 +**Incremental pool (no full rebuild):** `build_annotation_set.py build` **without** `--force-refresh-labels` uses the older windowed pool (`--annotate-search-top-k`, `--annotate-rerank-top-k`) and fills missing labels in one pass—no rerank-skip rule and no LLM early-stop loop.
  88 +
  89 +**Tuning the rebuild path:** `--search-recall-top-k`, `--rerank-high-threshold`, `--rerank-high-skip-count`, `--rebuild-llm-batch-size`, `--rebuild-min-batches`, `--rebuild-max-batches`, `--rebuild-irrelevant-stop-ratio`, `--rebuild-irrelevant-stop-streak` on `build` (see `eval_framework/cli.py`). Rerank API chunk size is **80** docs per request in code (`full_corpus_rerank_outside_exclude`).
75 90  
76 91 ## Artifacts
77 92  
... ... @@ -95,7 +110,7 @@ Default root: `artifacts/search_evaluation/`
95 110  
96 111 **Standard:** Run `batch` without `--force-refresh-labels` to extend coverage, then use the UI or batch in cached mode. Single-query evaluation defaults to **no** auto-annotation: recall still hits the live API; scoring uses SQLite only, and unlabeled hits count as `Irrelevant`.
97 112  
98   -**Incremental pool (no full rebuild):** `build_annotation_set.py build` without `--force-refresh-labels` merges search and full-corpus rerank windows before labeling (CLI `--search-depth`, `--rerank-depth`, `--annotate-*-top-k`). **Full rebuild** uses the recall-pool + rerank-skip + batched early-stop flow above; tune thresholds via `--search-recall-top-k`, `--rerank-high-threshold`, `--rerank-high-skip-count`, `--rebuild-*` flags on `build`.
  113 +**Rebuild vs incremental `build`:** Deep rebuild is documented in the **`batch-rebuild`** subsection above. Incremental `build` (without `--force-refresh-labels`) uses `--annotate-search-top-k` / `--annotate-rerank-top-k` windows instead.
99 114  
100 115 **Fusion tuning:** `tune_fusion.py` writes experiment configs, restarts the backend, runs batch evaluation, and optionally applies the best variant (see `--experiments-file`, `--score-metric`, `--apply-best`).
101 116  
... ...
scripts/evaluation/eval_framework/cli.py
... ... @@ -53,7 +53,7 @@ def build_cli_parser() -> argparse.ArgumentParser:
53 53 help="Rebuild only: skip query if more than this many non-pool docs have rerank score > threshold (default 1000).",
54 54 )
55 55 build.add_argument("--rebuild-llm-batch-size", type=int, default=None, help="Rebuild only: LLM batch size (default 50).")
56   - build.add_argument("--rebuild-min-batches", type=int, default=None, help="Rebuild only: min LLM batches before early stop (default 15).")
  56 + build.add_argument("--rebuild-min-batches", type=int, default=None, help="Rebuild only: min LLM batches before early stop (default 20).")
57 57 build.add_argument("--rebuild-max-batches", type=int, default=None, help="Rebuild only: max LLM batches (default 40).")
58 58 build.add_argument(
59 59 "--rebuild-irrelevant-stop-ratio",
... ...
scripts/evaluation/eval_framework/clients.py
... ... @@ -21,11 +21,19 @@ class SearchServiceClient:
21 21 self.tenant_id = str(tenant_id)
22 22 self.session = requests.Session()
23 23  
24   - def search(self, query: str, size: int, from_: int = 0, language: str = "en") -> Dict[str, Any]:
  24 + def search(self, query: str, size: int, from_: int = 0, language: str = "en", *, debug: bool = False) -> Dict[str, Any]:
  25 + payload: Dict[str, Any] = {
  26 + "query": query,
  27 + "size": size,
  28 + "from": from_,
  29 + "language": language,
  30 + }
  31 + if debug:
  32 + payload["debug"] = True
25 33 response = self.session.post(
26 34 f"{self.base_url}/search/",
27 35 headers={"Content-Type": "application/json", "X-Tenant-ID": self.tenant_id},
28   - json={"query": query, "size": size, "from": from_, "language": language},
  36 + json=payload,
29 37 timeout=120,
30 38 )
31 39 response.raise_for_status()
... ...
scripts/evaluation/eval_framework/constants.py
... ... @@ -23,7 +23,7 @@ DEFAULT_SEARCH_RECALL_TOP_K = 500
23 23 DEFAULT_RERANK_HIGH_THRESHOLD = 0.5
24 24 DEFAULT_RERANK_HIGH_SKIP_COUNT = 1000
25 25 DEFAULT_REBUILD_LLM_BATCH_SIZE = 50
26   -DEFAULT_REBUILD_MIN_LLM_BATCHES = 15
  26 +DEFAULT_REBUILD_MIN_LLM_BATCHES = 20
27 27 DEFAULT_REBUILD_MAX_LLM_BATCHES = 40
28 28 DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO = 0.92
29 29 DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK = 3
... ...
scripts/evaluation/eval_framework/framework.py
... ... @@ -45,9 +45,27 @@ from .utils import (
45 45 sha1_text,
46 46 utc_now_iso,
47 47 utc_timestamp,
  48 + zh_title_from_multilingual,
48 49 )
49 50  
50 51  
  52 +def _zh_titles_from_debug_per_result(debug_info: Any) -> Dict[str, str]:
  53 + """Map ``spu_id`` -> Chinese title from ``debug_info.per_result[].title_multilingual``."""
  54 + out: Dict[str, str] = {}
  55 + if not isinstance(debug_info, dict):
  56 + return out
  57 + for entry in debug_info.get("per_result") or []:
  58 + if not isinstance(entry, dict):
  59 + continue
  60 + spu_id = str(entry.get("spu_id") or "").strip()
  61 + if not spu_id:
  62 + continue
  63 + zh = zh_title_from_multilingual(entry.get("title_multilingual"))
  64 + if zh:
  65 + out[spu_id] = zh
  66 + return out
  67 +
  68 +
51 69 class SearchEvaluationFramework:
52 70 def __init__(
53 71 self,
... ... @@ -893,7 +911,10 @@ class SearchEvaluationFramework:
893 911 language: str = "en",
894 912 force_refresh_labels: bool = False,
895 913 ) -> Dict[str, Any]:
896   - search_payload = self.search_client.search(query=query, size=max(top_k, 100), from_=0, language=language)
  914 + search_payload = self.search_client.search(
  915 + query=query, size=max(top_k, 100), from_=0, language=language, debug=True
  916 + )
  917 + zh_by_spu = _zh_titles_from_debug_per_result(search_payload.get("debug_info"))
897 918 results = list(search_payload.get("results") or [])
898 919 if auto_annotate:
899 920 self.annotate_missing_labels(query=query, docs=results[:top_k], force_refresh=force_refresh_labels)
... ... @@ -906,11 +927,16 @@ class SearchEvaluationFramework:
906 927 label = labels.get(spu_id)
907 928 if label not in VALID_LABELS:
908 929 unlabeled_hits += 1
  930 + primary_title = build_display_title(doc)
  931 + title_zh = zh_by_spu.get(spu_id) or ""
  932 + if not title_zh and isinstance(doc.get("title"), dict):
  933 + title_zh = zh_title_from_multilingual(doc.get("title"))
909 934 labeled.append(
910 935 {
911 936 "rank": rank,
912 937 "spu_id": spu_id,
913   - "title": build_display_title(doc),
  938 + "title": primary_title,
  939 + "title_zh": title_zh if title_zh and title_zh != primary_title else "",
914 940 "image_url": doc.get("image_url"),
915 941 "label": label,
916 942 "option_values": list(compact_option_values(doc.get("skus") or [])),
... ... @@ -934,12 +960,15 @@ class SearchEvaluationFramework:
934 960 doc = missing_docs_map.get(spu_id)
935 961 if not doc:
936 962 continue
  963 + miss_title = build_display_title(doc)
  964 + miss_zh = zh_title_from_multilingual(doc.get("title")) if isinstance(doc.get("title"), dict) else ""
937 965 missing_relevant.append(
938 966 {
939 967 "spu_id": spu_id,
940 968 "label": labels[spu_id],
941 969 "rerank_score": rerank_scores.get(spu_id),
942   - "title": build_display_title(doc),
  970 + "title": miss_title,
  971 + "title_zh": miss_zh if miss_zh and miss_zh != miss_title else "",
943 972 "image_url": doc.get("image_url"),
944 973 "option_values": list(compact_option_values(doc.get("skus") or [])),
945 974 "product": compact_product_payload(doc),
... ...
scripts/evaluation/eval_framework/static/eval_web.css
... ... @@ -40,7 +40,8 @@
40 40 .Irrelevant { background: var(--irrelevant); }
41 41 .Unknown { background: #637381; }
42 42 .thumb { width: 100px; height: 100px; object-fit: cover; border-radius: 14px; background: #e7e1d4; }
43   - .title { font-size: 16px; font-weight: 700; margin-bottom: 8px; }
  43 + .title { font-size: 16px; font-weight: 700; margin-bottom: 4px; }
  44 + .title-zh { font-size: 14px; font-weight: 500; color: var(--muted); margin-bottom: 8px; line-height: 1.4; }
44 45 .options { color: var(--muted); line-height: 1.5; font-size: 14px; }
45 46 .section { margin-bottom: 28px; }
46 47 .history { font-size: 13px; line-height: 1.5; }
... ...
scripts/evaluation/eval_framework/static/eval_web.js
... ... @@ -25,6 +25,7 @@
25 25 <img class="thumb" src="${item.image_url || ''}" alt="" />
26 26 <div>
27 27 <div class="title">${item.title || ''}</div>
  28 + ${item.title_zh ? `<div class="title-zh">${item.title_zh}</div>` : ''}
28 29 <div class="options">
29 30 <div>${(item.option_values || [])[0] || ''}</div>
30 31 <div>${(item.option_values || [])[1] || ''}</div>
... ...
scripts/evaluation/eval_framework/utils.py
... ... @@ -42,6 +42,14 @@ def pick_text(value: Any, preferred_lang: str = &quot;en&quot;) -&gt; str:
42 42 return str(value).strip()
43 43  
44 44  
  45 +def zh_title_from_multilingual(title_multilingual: Any) -> str:
  46 + """Chinese title string from API debug ``title_multilingual`` (ES-style dict)."""
  47 + if not isinstance(title_multilingual, dict):
  48 + return ""
  49 + zh = str(title_multilingual.get("zh") or "").strip()
  50 + return zh
  51 +
  52 +
45 53 def safe_json_dumps(data: Any) -> str:
46 54 return json.dumps(data, ensure_ascii=False, separators=(",", ":"))
47 55  
... ...