diff --git a/scripts/evaluation/README.md b/scripts/evaluation/README.md index 0488a27..9366411 100644 --- a/scripts/evaluation/README.md +++ b/scripts/evaluation/README.md @@ -34,7 +34,7 @@ Set tenant if needed (`export TENANT_ID=163`). You need a live search API, DashS # Batch: live search for every query; only uncached (query, spu_id) pairs hit the LLM ./scripts/evaluation/quick_start_eval.sh batch -# Deep rebuild: search recall top-500 (score 1) + full-corpus rerank outside pool + batched LLM (early stop; expensive) +# Deep rebuild: per-query full corpus rerank (outside search top-500 pool) + LLM in 50-doc batches along global sort order (early stop; expensive) ./scripts/evaluation/quick_start_eval.sh batch-rebuild # UI: http://127.0.0.1:6010/ @@ -69,9 +69,24 @@ Explicit equivalents: --port 6010 ``` -Each `batch` run walks the full queries file. With `batch --force-refresh-labels`, every live top-`k` hit is re-judged by the LLM. +Each `batch` run walks the full queries file and writes a **batch report** under `batch_reports/`. With `batch --force-refresh-labels`, every live top-`k` hit is re-judged by the LLM (still only those hits—not the deep rebuild pipeline). -**Rebuild (`build --force-refresh-labels`):** For each query: take search top **500** as the recall pool (treated as rerank score **1**; those SKUs are not sent to the reranker). Rerank the rest of the tenant corpus; if more than **1000** non-pool docs have rerank score **> 0.5**, the query is **skipped** (logged as too easy / tail too relevant). Otherwise merge pool (search order) + non-pool (rerank score descending), then LLM-judge in batches of **50**, logging **exact_ratio** and **irrelevant_ratio** per batch. Stop after **3** consecutive batches with irrelevant_ratio **> 92%**, but only after at least **15** batches and at most **40** batches. +### `quick_start_eval.sh batch-rebuild` (deep annotation rebuild) + +This runs `build_annotation_set.py build` with **`--force-refresh-labels`** and **`--force-refresh-rerank`** (see the explicit command block below). It does **not** run the `batch` subcommand: there is **no** aggregate batch report for this step; outputs are per-query JSON under `query_builds/` plus updates in `search_eval.sqlite3`. + +For **each** query in `queries.txt`, in order: + +1. **Search recall** — Call the live search API with `size = max(--search-depth, --search-recall-top-k)` (the wrapper uses `--search-depth 500`). The first **500** hits form the **recall pool**; they are treated as rerank score **1** and are **not** sent to the reranker. +2. **Full corpus** — Load the tenant’s product corpus from Elasticsearch (same tenant as `TENANT_ID` / `--tenant-id`, default **163**), via `corpus_docs()` (cached in SQLite after the first load). +3. **Rerank outside pool** — Every corpus document whose `spu_id` is **not** in the pool is scored by the reranker API, **80 documents per request**. With `--force-refresh-rerank`, all those scores are recomputed and written to the **`rerank_scores`** table in `search_eval.sqlite3`. Without that flag, existing `(tenant_id, query, spu_id)` scores are reused and only missing rows hit the API. +4. **Skip “too easy” queries** — If more than **1000** non-pool documents have rerank score **> 0.5**, that query is **skipped** (one log line: tail too relevant / easy to satisfy). No LLM calls for that query. +5. **Global sort** — Order to label: pool in **search rank order**, then all remaining corpus docs in **descending rerank score** (dedupe by `spu_id`, pool wins). +6. **LLM labeling** — Walk that list **from the head** in batches of **50** (not “take top-K then label only K”): each batch logs **exact_ratio** and **irrelevant_ratio**. After at least **20** batches, stop when **3** consecutive batches have irrelevant_ratio **> 92%**; never more than **40** batches (**2000** docs max per query). So labeling follows the best-first order but **stops early**; the tail of the sorted list may never be judged. + +**Incremental pool (no full rebuild):** `build_annotation_set.py build` **without** `--force-refresh-labels` uses the older windowed pool (`--annotate-search-top-k`, `--annotate-rerank-top-k`) and fills missing labels in one pass—no rerank-skip rule and no LLM early-stop loop. + +**Tuning the rebuild path:** `--search-recall-top-k`, `--rerank-high-threshold`, `--rerank-high-skip-count`, `--rebuild-llm-batch-size`, `--rebuild-min-batches`, `--rebuild-max-batches`, `--rebuild-irrelevant-stop-ratio`, `--rebuild-irrelevant-stop-streak` on `build` (see `eval_framework/cli.py`). Rerank API chunk size is **80** docs per request in code (`full_corpus_rerank_outside_exclude`). ## Artifacts @@ -95,7 +110,7 @@ Default root: `artifacts/search_evaluation/` **Standard:** Run `batch` without `--force-refresh-labels` to extend coverage, then use the UI or batch in cached mode. Single-query evaluation defaults to **no** auto-annotation: recall still hits the live API; scoring uses SQLite only, and unlabeled hits count as `Irrelevant`. -**Incremental pool (no full rebuild):** `build_annotation_set.py build` without `--force-refresh-labels` merges search and full-corpus rerank windows before labeling (CLI `--search-depth`, `--rerank-depth`, `--annotate-*-top-k`). **Full rebuild** uses the recall-pool + rerank-skip + batched early-stop flow above; tune thresholds via `--search-recall-top-k`, `--rerank-high-threshold`, `--rerank-high-skip-count`, `--rebuild-*` flags on `build`. +**Rebuild vs incremental `build`:** Deep rebuild is documented in the **`batch-rebuild`** subsection above. Incremental `build` (without `--force-refresh-labels`) uses `--annotate-search-top-k` / `--annotate-rerank-top-k` windows instead. **Fusion tuning:** `tune_fusion.py` writes experiment configs, restarts the backend, runs batch evaluation, and optionally applies the best variant (see `--experiments-file`, `--score-metric`, `--apply-best`). diff --git a/scripts/evaluation/eval_framework/cli.py b/scripts/evaluation/eval_framework/cli.py index c561639..dfdbbd7 100644 --- a/scripts/evaluation/eval_framework/cli.py +++ b/scripts/evaluation/eval_framework/cli.py @@ -53,7 +53,7 @@ def build_cli_parser() -> argparse.ArgumentParser: help="Rebuild only: skip query if more than this many non-pool docs have rerank score > threshold (default 1000).", ) build.add_argument("--rebuild-llm-batch-size", type=int, default=None, help="Rebuild only: LLM batch size (default 50).") - build.add_argument("--rebuild-min-batches", type=int, default=None, help="Rebuild only: min LLM batches before early stop (default 15).") + build.add_argument("--rebuild-min-batches", type=int, default=None, help="Rebuild only: min LLM batches before early stop (default 20).") build.add_argument("--rebuild-max-batches", type=int, default=None, help="Rebuild only: max LLM batches (default 40).") build.add_argument( "--rebuild-irrelevant-stop-ratio", diff --git a/scripts/evaluation/eval_framework/clients.py b/scripts/evaluation/eval_framework/clients.py index a7a5065..05fbe10 100644 --- a/scripts/evaluation/eval_framework/clients.py +++ b/scripts/evaluation/eval_framework/clients.py @@ -21,11 +21,19 @@ class SearchServiceClient: self.tenant_id = str(tenant_id) self.session = requests.Session() - def search(self, query: str, size: int, from_: int = 0, language: str = "en") -> Dict[str, Any]: + def search(self, query: str, size: int, from_: int = 0, language: str = "en", *, debug: bool = False) -> Dict[str, Any]: + payload: Dict[str, Any] = { + "query": query, + "size": size, + "from": from_, + "language": language, + } + if debug: + payload["debug"] = True response = self.session.post( f"{self.base_url}/search/", headers={"Content-Type": "application/json", "X-Tenant-ID": self.tenant_id}, - json={"query": query, "size": size, "from": from_, "language": language}, + json=payload, timeout=120, ) response.raise_for_status() diff --git a/scripts/evaluation/eval_framework/constants.py b/scripts/evaluation/eval_framework/constants.py index f0c64f6..f3fcf87 100644 --- a/scripts/evaluation/eval_framework/constants.py +++ b/scripts/evaluation/eval_framework/constants.py @@ -23,7 +23,7 @@ DEFAULT_SEARCH_RECALL_TOP_K = 500 DEFAULT_RERANK_HIGH_THRESHOLD = 0.5 DEFAULT_RERANK_HIGH_SKIP_COUNT = 1000 DEFAULT_REBUILD_LLM_BATCH_SIZE = 50 -DEFAULT_REBUILD_MIN_LLM_BATCHES = 15 +DEFAULT_REBUILD_MIN_LLM_BATCHES = 20 DEFAULT_REBUILD_MAX_LLM_BATCHES = 40 DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO = 0.92 DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK = 3 diff --git a/scripts/evaluation/eval_framework/framework.py b/scripts/evaluation/eval_framework/framework.py index 4706894..9fea5f2 100644 --- a/scripts/evaluation/eval_framework/framework.py +++ b/scripts/evaluation/eval_framework/framework.py @@ -45,9 +45,27 @@ from .utils import ( sha1_text, utc_now_iso, utc_timestamp, + zh_title_from_multilingual, ) +def _zh_titles_from_debug_per_result(debug_info: Any) -> Dict[str, str]: + """Map ``spu_id`` -> Chinese title from ``debug_info.per_result[].title_multilingual``.""" + out: Dict[str, str] = {} + if not isinstance(debug_info, dict): + return out + for entry in debug_info.get("per_result") or []: + if not isinstance(entry, dict): + continue + spu_id = str(entry.get("spu_id") or "").strip() + if not spu_id: + continue + zh = zh_title_from_multilingual(entry.get("title_multilingual")) + if zh: + out[spu_id] = zh + return out + + class SearchEvaluationFramework: def __init__( self, @@ -893,7 +911,10 @@ class SearchEvaluationFramework: language: str = "en", force_refresh_labels: bool = False, ) -> Dict[str, Any]: - search_payload = self.search_client.search(query=query, size=max(top_k, 100), from_=0, language=language) + search_payload = self.search_client.search( + query=query, size=max(top_k, 100), from_=0, language=language, debug=True + ) + zh_by_spu = _zh_titles_from_debug_per_result(search_payload.get("debug_info")) results = list(search_payload.get("results") or []) if auto_annotate: self.annotate_missing_labels(query=query, docs=results[:top_k], force_refresh=force_refresh_labels) @@ -906,11 +927,16 @@ class SearchEvaluationFramework: label = labels.get(spu_id) if label not in VALID_LABELS: unlabeled_hits += 1 + primary_title = build_display_title(doc) + title_zh = zh_by_spu.get(spu_id) or "" + if not title_zh and isinstance(doc.get("title"), dict): + title_zh = zh_title_from_multilingual(doc.get("title")) labeled.append( { "rank": rank, "spu_id": spu_id, - "title": build_display_title(doc), + "title": primary_title, + "title_zh": title_zh if title_zh and title_zh != primary_title else "", "image_url": doc.get("image_url"), "label": label, "option_values": list(compact_option_values(doc.get("skus") or [])), @@ -934,12 +960,15 @@ class SearchEvaluationFramework: doc = missing_docs_map.get(spu_id) if not doc: continue + miss_title = build_display_title(doc) + miss_zh = zh_title_from_multilingual(doc.get("title")) if isinstance(doc.get("title"), dict) else "" missing_relevant.append( { "spu_id": spu_id, "label": labels[spu_id], "rerank_score": rerank_scores.get(spu_id), - "title": build_display_title(doc), + "title": miss_title, + "title_zh": miss_zh if miss_zh and miss_zh != miss_title else "", "image_url": doc.get("image_url"), "option_values": list(compact_option_values(doc.get("skus") or [])), "product": compact_product_payload(doc), diff --git a/scripts/evaluation/eval_framework/static/eval_web.css b/scripts/evaluation/eval_framework/static/eval_web.css index fbb75ad..ece16ed 100644 --- a/scripts/evaluation/eval_framework/static/eval_web.css +++ b/scripts/evaluation/eval_framework/static/eval_web.css @@ -40,7 +40,8 @@ .Irrelevant { background: var(--irrelevant); } .Unknown { background: #637381; } .thumb { width: 100px; height: 100px; object-fit: cover; border-radius: 14px; background: #e7e1d4; } - .title { font-size: 16px; font-weight: 700; margin-bottom: 8px; } + .title { font-size: 16px; font-weight: 700; margin-bottom: 4px; } + .title-zh { font-size: 14px; font-weight: 500; color: var(--muted); margin-bottom: 8px; line-height: 1.4; } .options { color: var(--muted); line-height: 1.5; font-size: 14px; } .section { margin-bottom: 28px; } .history { font-size: 13px; line-height: 1.5; } diff --git a/scripts/evaluation/eval_framework/static/eval_web.js b/scripts/evaluation/eval_framework/static/eval_web.js index f4d1276..4d63e68 100644 --- a/scripts/evaluation/eval_framework/static/eval_web.js +++ b/scripts/evaluation/eval_framework/static/eval_web.js @@ -25,6 +25,7 @@
${item.title || ''}
+ ${item.title_zh ? `
${item.title_zh}
` : ''}
${(item.option_values || [])[0] || ''}
${(item.option_values || [])[1] || ''}
diff --git a/scripts/evaluation/eval_framework/utils.py b/scripts/evaluation/eval_framework/utils.py index 0425097..dbe613c 100644 --- a/scripts/evaluation/eval_framework/utils.py +++ b/scripts/evaluation/eval_framework/utils.py @@ -42,6 +42,14 @@ def pick_text(value: Any, preferred_lang: str = "en") -> str: return str(value).strip() +def zh_title_from_multilingual(title_multilingual: Any) -> str: + """Chinese title string from API debug ``title_multilingual`` (ES-style dict).""" + if not isinstance(title_multilingual, dict): + return "" + zh = str(title_multilingual.get("zh") or "").strip() + return zh + + def safe_json_dumps(data: Any) -> str: return json.dumps(data, ensure_ascii=False, separators=(",", ":")) -- libgit2 0.21.2