eval框架前端

tangwang
1 parent d172c259
Showing 8 changed files with 74 additions and 12 deletions Show diff stats
scripts/evaluation/README.md
scripts/evaluation/eval_framework/cli.py
scripts/evaluation/eval_framework/clients.py
scripts/evaluation/eval_framework/constants.py
scripts/evaluation/eval_framework/framework.py
scripts/evaluation/eval_framework/static/eval_web.css
scripts/evaluation/eval_framework/static/eval_web.js
scripts/evaluation/eval_framework/utils.py
@@ -34,7 +34,7 @@ Set tenant if needed (`export TENANT_ID=163`). You need a live search API, DashS
 # Batch: live search for every query; only uncached (query, spu_id) pairs hit the LLM
 ./scripts/evaluation/quick_start_eval.sh batch
  
-# Deep rebuild: search recall top-500 (score 1) + full-corpus rerank outside pool + batched LLM (early stop; expensive)
+# Deep rebuild: per-query full corpus rerank (outside search top-500 pool) + LLM in 50-doc batches along global sort order (early stop; expensive)
 ./scripts/evaluation/quick_start_eval.sh batch-rebuild
  
 # UI: http://127.0.0.1:6010/
@@ -69,9 +69,24 @@ Explicit equivalents:
   --port 6010
 ```
  
-Each `batch` run walks the full queries file. With `batch --force-refresh-labels`, every live top-`k` hit is re-judged by the LLM.
+Each `batch` run walks the full queries file and writes a **batch report** under `batch_reports/`. With `batch --force-refresh-labels`, every live top-`k` hit is re-judged by the LLM (still only those hits—not the deep rebuild pipeline).
  
-**Rebuild (`build --force-refresh-labels`):** For each query: take search top **500** as the recall pool (treated as rerank score **1**; those SKUs are not sent to the reranker). Rerank the rest of the tenant corpus; if more than **1000** non-pool docs have rerank score **> 0.5**, the query is **skipped** (logged as too easy / tail too relevant). Otherwise merge pool (search order) + non-pool (rerank score descending), then LLM-judge in batches of **50**, logging **exact_ratio** and **irrelevant_ratio** per batch. Stop after **3** consecutive batches with irrelevant_ratio **> 92%**, but only after at least **15** batches and at most **40** batches.
+### `quick_start_eval.sh batch-rebuild` (deep annotation rebuild)
+
+This runs `build_annotation_set.py build` with **`--force-refresh-labels`** and **`--force-refresh-rerank`** (see the explicit command block below). It does **not** run the `batch` subcommand: there is **no** aggregate batch report for this step; outputs are per-query JSON under `query_builds/` plus updates in `search_eval.sqlite3`.
+
+For **each** query in `queries.txt`, in order:
+
+1. **Search recall** — Call the live search API with `size = max(--search-depth, --search-recall-top-k)` (the wrapper uses `--search-depth 500`). The first **500** hits form the **recall pool**; they are treated as rerank score **1** and are **not** sent to the reranker.
+2. **Full corpus** — Load the tenant’s product corpus from Elasticsearch (same tenant as `TENANT_ID` / `--tenant-id`, default **163**), via `corpus_docs()` (cached in SQLite after the first load).
+3. **Rerank outside pool** — Every corpus document whose `spu_id` is **not** in the pool is scored by the reranker API, **80 documents per request**. With `--force-refresh-rerank`, all those scores are recomputed and written to the **`rerank_scores`** table in `search_eval.sqlite3`. Without that flag, existing `(tenant_id, query, spu_id)` scores are reused and only missing rows hit the API.
+4. **Skip “too easy” queries** — If more than **1000** non-pool documents have rerank score **> 0.5**, that query is **skipped** (one log line: tail too relevant / easy to satisfy). No LLM calls for that query.
+5. **Global sort** — Order to label: pool in **search rank order**, then all remaining corpus docs in **descending rerank score** (dedupe by `spu_id`, pool wins).
+6. **LLM labeling** — Walk that list **from the head** in batches of **50** (not “take top-K then label only K”): each batch logs **exact_ratio** and **irrelevant_ratio**. After at least **20** batches, stop when **3** consecutive batches have irrelevant_ratio **> 92%**; never more than **40** batches (**2000** docs max per query). So labeling follows the best-first order but **stops early**; the tail of the sorted list may never be judged.
+
+**Incremental pool (no full rebuild):** `build_annotation_set.py build` **without** `--force-refresh-labels` uses the older windowed pool (`--annotate-search-top-k`, `--annotate-rerank-top-k`) and fills missing labels in one pass—no rerank-skip rule and no LLM early-stop loop.
+
+**Tuning the rebuild path:** `--search-recall-top-k`, `--rerank-high-threshold`, `--rerank-high-skip-count`, `--rebuild-llm-batch-size`, `--rebuild-min-batches`, `--rebuild-max-batches`, `--rebuild-irrelevant-stop-ratio`, `--rebuild-irrelevant-stop-streak` on `build` (see `eval_framework/cli.py`). Rerank API chunk size is **80** docs per request in code (`full_corpus_rerank_outside_exclude`).
  
 ## Artifacts
  
@@ -95,7 +110,7 @@ Default root: `artifacts/search_evaluation/`
  
 **Standard:** Run `batch` without `--force-refresh-labels` to extend coverage, then use the UI or batch in cached mode. Single-query evaluation defaults to **no** auto-annotation: recall still hits the live API; scoring uses SQLite only, and unlabeled hits count as `Irrelevant`.
  
-**Incremental pool (no full rebuild):** `build_annotation_set.py build` without `--force-refresh-labels` merges search and full-corpus rerank windows before labeling (CLI `--search-depth`, `--rerank-depth`, `--annotate-*-top-k`). **Full rebuild** uses the recall-pool + rerank-skip + batched early-stop flow above; tune thresholds via `--search-recall-top-k`, `--rerank-high-threshold`, `--rerank-high-skip-count`, `--rebuild-*` flags on `build`.
+**Rebuild vs incremental `build`:** Deep rebuild is documented in the **`batch-rebuild`** subsection above. Incremental `build` (without `--force-refresh-labels`) uses `--annotate-search-top-k` / `--annotate-rerank-top-k` windows instead.
  
 **Fusion tuning:** `tune_fusion.py` writes experiment configs, restarts the backend, runs batch evaluation, and optionally applies the best variant (see `--experiments-file`, `--score-metric`, `--apply-best`).
  
@@ -53,7 +53,7 @@ def build_cli_parser() -&gt; argparse.ArgumentParser:
         help="Rebuild only: skip query if more than this many non-pool docs have rerank score > threshold (default 1000).",
     )
     build.add_argument("--rebuild-llm-batch-size", type=int, default=None, help="Rebuild only: LLM batch size (default 50).")
-    build.add_argument("--rebuild-min-batches", type=int, default=None, help="Rebuild only: min LLM batches before early stop (default 15).")
+    build.add_argument("--rebuild-min-batches", type=int, default=None, help="Rebuild only: min LLM batches before early stop (default 20).")
     build.add_argument("--rebuild-max-batches", type=int, default=None, help="Rebuild only: max LLM batches (default 40).")
     build.add_argument(
         "--rebuild-irrelevant-stop-ratio",
@@ -21,11 +21,19 @@ class SearchServiceClient:
         self.tenant_id = str(tenant_id)
         self.session = requests.Session()
  
-    def search(self, query: str, size: int, from_: int = 0, language: str = "en") -> Dict[str, Any]:
+    def search(self, query: str, size: int, from_: int = 0, language: str = "en", *, debug: bool = False) -> Dict[str, Any]:
+        payload: Dict[str, Any] = {
+            "query": query,
+            "size": size,
+            "from": from_,
+            "language": language,
+        }
+        if debug:
+            payload["debug"] = True
         response = self.session.post(
             f"{self.base_url}/search/",
             headers={"Content-Type": "application/json", "X-Tenant-ID": self.tenant_id},
-            json={"query": query, "size": size, "from": from_, "language": language},
+            json=payload,
             timeout=120,
         )
         response.raise_for_status()
@@ -23,7 +23,7 @@ DEFAULT_SEARCH_RECALL_TOP_K = 500
 DEFAULT_RERANK_HIGH_THRESHOLD = 0.5
 DEFAULT_RERANK_HIGH_SKIP_COUNT = 1000
 DEFAULT_REBUILD_LLM_BATCH_SIZE = 50
-DEFAULT_REBUILD_MIN_LLM_BATCHES = 15
+DEFAULT_REBUILD_MIN_LLM_BATCHES = 20
 DEFAULT_REBUILD_MAX_LLM_BATCHES = 40
 DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO = 0.92
 DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK = 3
@@ -45,9 +45,27 @@ from .utils import (
     sha1_text,
     utc_now_iso,
     utc_timestamp,
+    zh_title_from_multilingual,
 )
  
  
+def _zh_titles_from_debug_per_result(debug_info: Any) -> Dict[str, str]:
+    """Map ``spu_id`` -> Chinese title from ``debug_info.per_result[].title_multilingual``."""
+    out: Dict[str, str] = {}
+    if not isinstance(debug_info, dict):
+        return out
+    for entry in debug_info.get("per_result") or []:
+        if not isinstance(entry, dict):
+            continue
+        spu_id = str(entry.get("spu_id") or "").strip()
+        if not spu_id:
+            continue
+        zh = zh_title_from_multilingual(entry.get("title_multilingual"))
+        if zh:
+            out[spu_id] = zh
+    return out
+
+
 class SearchEvaluationFramework:
     def __init__(
         self,
@@ -893,7 +911,10 @@ class SearchEvaluationFramework:
         language: str = "en",
         force_refresh_labels: bool = False,
     ) -> Dict[str, Any]:
-        search_payload = self.search_client.search(query=query, size=max(top_k, 100), from_=0, language=language)
+        search_payload = self.search_client.search(
+            query=query, size=max(top_k, 100), from_=0, language=language, debug=True
+        )
+        zh_by_spu = _zh_titles_from_debug_per_result(search_payload.get("debug_info"))
         results = list(search_payload.get("results") or [])
         if auto_annotate:
             self.annotate_missing_labels(query=query, docs=results[:top_k], force_refresh=force_refresh_labels)
@@ -906,11 +927,16 @@ class SearchEvaluationFramework:
             label = labels.get(spu_id)
             if label not in VALID_LABELS:
                 unlabeled_hits += 1
+            primary_title = build_display_title(doc)
+            title_zh = zh_by_spu.get(spu_id) or ""
+            if not title_zh and isinstance(doc.get("title"), dict):
+                title_zh = zh_title_from_multilingual(doc.get("title"))
             labeled.append(
                 {
                     "rank": rank,
                     "spu_id": spu_id,
-                    "title": build_display_title(doc),
+                    "title": primary_title,
+                    "title_zh": title_zh if title_zh and title_zh != primary_title else "",
                     "image_url": doc.get("image_url"),
                     "label": label,
                     "option_values": list(compact_option_values(doc.get("skus") or [])),
@@ -934,12 +960,15 @@ class SearchEvaluationFramework:
             doc = missing_docs_map.get(spu_id)
             if not doc:
                 continue
+            miss_title = build_display_title(doc)
+            miss_zh = zh_title_from_multilingual(doc.get("title")) if isinstance(doc.get("title"), dict) else ""
             missing_relevant.append(
                 {
                     "spu_id": spu_id,
                     "label": labels[spu_id],
                     "rerank_score": rerank_scores.get(spu_id),
-                    "title": build_display_title(doc),
+                    "title": miss_title,
+                    "title_zh": miss_zh if miss_zh and miss_zh != miss_title else "",
                     "image_url": doc.get("image_url"),
                     "option_values": list(compact_option_values(doc.get("skus") or [])),
                     "product": compact_product_payload(doc),
@@ -40,7 +40,8 @@
     .Irrelevant { background: var(--irrelevant); }
     .Unknown { background: #637381; }
     .thumb { width: 100px; height: 100px; object-fit: cover; border-radius: 14px; background: #e7e1d4; }
-    .title { font-size: 16px; font-weight: 700; margin-bottom: 8px; }
+    .title { font-size: 16px; font-weight: 700; margin-bottom: 4px; }
+    .title-zh { font-size: 14px; font-weight: 500; color: var(--muted); margin-bottom: 8px; line-height: 1.4; }
     .options { color: var(--muted); line-height: 1.5; font-size: 14px; }
     .section { margin-bottom: 28px; }
     .history { font-size: 13px; line-height: 1.5; }
@@ -25,6 +25,7 @@
           <img class="thumb" src="${item.image_url || ''}" alt="" />
           <div>
             <div class="title">${item.title || ''}</div>
+            ${item.title_zh ? `<div class="title-zh">${item.title_zh}</div>` : ''}
             <div class="options">
               <div>${(item.option_values || [])[0] || ''}</div>
               <div>${(item.option_values || [])[1] || ''}</div>
@@ -42,6 +42,14 @@ def pick_text(value: Any, preferred_lang: str = &quot;en&quot;) -&gt; str:
     return str(value).strip()
  
  
+def zh_title_from_multilingual(title_multilingual: Any) -> str:
+    """Chinese title string from API debug ``title_multilingual`` (ES-style dict)."""
+    if not isinstance(title_multilingual, dict):
+        return ""
+    zh = str(title_multilingual.get("zh") or "").strip()
+    return zh
+
+
 def safe_json_dumps(data: Any) -> str:
     return json.dumps(data, ensure_ascii=False, separators=(",", ":"))