diff --git a/scripts/evaluation/README.md b/scripts/evaluation/README.md
index 0488a27..9366411 100644
--- a/scripts/evaluation/README.md
+++ b/scripts/evaluation/README.md
@@ -34,7 +34,7 @@ Set tenant if needed (`export TENANT_ID=163`). You need a live search API, DashS
# Batch: live search for every query; only uncached (query, spu_id) pairs hit the LLM
./scripts/evaluation/quick_start_eval.sh batch
-# Deep rebuild: search recall top-500 (score 1) + full-corpus rerank outside pool + batched LLM (early stop; expensive)
+# Deep rebuild: per-query full corpus rerank (outside search top-500 pool) + LLM in 50-doc batches along global sort order (early stop; expensive)
./scripts/evaluation/quick_start_eval.sh batch-rebuild
# UI: http://127.0.0.1:6010/
@@ -69,9 +69,24 @@ Explicit equivalents:
--port 6010
```
-Each `batch` run walks the full queries file. With `batch --force-refresh-labels`, every live top-`k` hit is re-judged by the LLM.
+Each `batch` run walks the full queries file and writes a **batch report** under `batch_reports/`. With `batch --force-refresh-labels`, every live top-`k` hit is re-judged by the LLM (still only those hits—not the deep rebuild pipeline).
-**Rebuild (`build --force-refresh-labels`):** For each query: take search top **500** as the recall pool (treated as rerank score **1**; those SKUs are not sent to the reranker). Rerank the rest of the tenant corpus; if more than **1000** non-pool docs have rerank score **> 0.5**, the query is **skipped** (logged as too easy / tail too relevant). Otherwise merge pool (search order) + non-pool (rerank score descending), then LLM-judge in batches of **50**, logging **exact_ratio** and **irrelevant_ratio** per batch. Stop after **3** consecutive batches with irrelevant_ratio **> 92%**, but only after at least **15** batches and at most **40** batches.
+### `quick_start_eval.sh batch-rebuild` (deep annotation rebuild)
+
+This runs `build_annotation_set.py build` with **`--force-refresh-labels`** and **`--force-refresh-rerank`** (see the explicit command block below). It does **not** run the `batch` subcommand: there is **no** aggregate batch report for this step; outputs are per-query JSON under `query_builds/` plus updates in `search_eval.sqlite3`.
+
+For **each** query in `queries.txt`, in order:
+
+1. **Search recall** — Call the live search API with `size = max(--search-depth, --search-recall-top-k)` (the wrapper uses `--search-depth 500`). The first **500** hits form the **recall pool**; they are treated as rerank score **1** and are **not** sent to the reranker.
+2. **Full corpus** — Load the tenant’s product corpus from Elasticsearch (same tenant as `TENANT_ID` / `--tenant-id`, default **163**), via `corpus_docs()` (cached in SQLite after the first load).
+3. **Rerank outside pool** — Every corpus document whose `spu_id` is **not** in the pool is scored by the reranker API, **80 documents per request**. With `--force-refresh-rerank`, all those scores are recomputed and written to the **`rerank_scores`** table in `search_eval.sqlite3`. Without that flag, existing `(tenant_id, query, spu_id)` scores are reused and only missing rows hit the API.
+4. **Skip “too easy” queries** — If more than **1000** non-pool documents have rerank score **> 0.5**, that query is **skipped** (one log line: tail too relevant / easy to satisfy). No LLM calls for that query.
+5. **Global sort** — Order to label: pool in **search rank order**, then all remaining corpus docs in **descending rerank score** (dedupe by `spu_id`, pool wins).
+6. **LLM labeling** — Walk that list **from the head** in batches of **50** (not “take top-K then label only K”): each batch logs **exact_ratio** and **irrelevant_ratio**. After at least **20** batches, stop when **3** consecutive batches have irrelevant_ratio **> 92%**; never more than **40** batches (**2000** docs max per query). So labeling follows the best-first order but **stops early**; the tail of the sorted list may never be judged.
+
+**Incremental pool (no full rebuild):** `build_annotation_set.py build` **without** `--force-refresh-labels` uses the older windowed pool (`--annotate-search-top-k`, `--annotate-rerank-top-k`) and fills missing labels in one pass—no rerank-skip rule and no LLM early-stop loop.
+
+**Tuning the rebuild path:** `--search-recall-top-k`, `--rerank-high-threshold`, `--rerank-high-skip-count`, `--rebuild-llm-batch-size`, `--rebuild-min-batches`, `--rebuild-max-batches`, `--rebuild-irrelevant-stop-ratio`, `--rebuild-irrelevant-stop-streak` on `build` (see `eval_framework/cli.py`). Rerank API chunk size is **80** docs per request in code (`full_corpus_rerank_outside_exclude`).
## Artifacts
@@ -95,7 +110,7 @@ Default root: `artifacts/search_evaluation/`
**Standard:** Run `batch` without `--force-refresh-labels` to extend coverage, then use the UI or batch in cached mode. Single-query evaluation defaults to **no** auto-annotation: recall still hits the live API; scoring uses SQLite only, and unlabeled hits count as `Irrelevant`.
-**Incremental pool (no full rebuild):** `build_annotation_set.py build` without `--force-refresh-labels` merges search and full-corpus rerank windows before labeling (CLI `--search-depth`, `--rerank-depth`, `--annotate-*-top-k`). **Full rebuild** uses the recall-pool + rerank-skip + batched early-stop flow above; tune thresholds via `--search-recall-top-k`, `--rerank-high-threshold`, `--rerank-high-skip-count`, `--rebuild-*` flags on `build`.
+**Rebuild vs incremental `build`:** Deep rebuild is documented in the **`batch-rebuild`** subsection above. Incremental `build` (without `--force-refresh-labels`) uses `--annotate-search-top-k` / `--annotate-rerank-top-k` windows instead.
**Fusion tuning:** `tune_fusion.py` writes experiment configs, restarts the backend, runs batch evaluation, and optionally applies the best variant (see `--experiments-file`, `--score-metric`, `--apply-best`).
diff --git a/scripts/evaluation/eval_framework/cli.py b/scripts/evaluation/eval_framework/cli.py
index c561639..dfdbbd7 100644
--- a/scripts/evaluation/eval_framework/cli.py
+++ b/scripts/evaluation/eval_framework/cli.py
@@ -53,7 +53,7 @@ def build_cli_parser() -> argparse.ArgumentParser:
help="Rebuild only: skip query if more than this many non-pool docs have rerank score > threshold (default 1000).",
)
build.add_argument("--rebuild-llm-batch-size", type=int, default=None, help="Rebuild only: LLM batch size (default 50).")
- build.add_argument("--rebuild-min-batches", type=int, default=None, help="Rebuild only: min LLM batches before early stop (default 15).")
+ build.add_argument("--rebuild-min-batches", type=int, default=None, help="Rebuild only: min LLM batches before early stop (default 20).")
build.add_argument("--rebuild-max-batches", type=int, default=None, help="Rebuild only: max LLM batches (default 40).")
build.add_argument(
"--rebuild-irrelevant-stop-ratio",
diff --git a/scripts/evaluation/eval_framework/clients.py b/scripts/evaluation/eval_framework/clients.py
index a7a5065..05fbe10 100644
--- a/scripts/evaluation/eval_framework/clients.py
+++ b/scripts/evaluation/eval_framework/clients.py
@@ -21,11 +21,19 @@ class SearchServiceClient:
self.tenant_id = str(tenant_id)
self.session = requests.Session()
- def search(self, query: str, size: int, from_: int = 0, language: str = "en") -> Dict[str, Any]:
+ def search(self, query: str, size: int, from_: int = 0, language: str = "en", *, debug: bool = False) -> Dict[str, Any]:
+ payload: Dict[str, Any] = {
+ "query": query,
+ "size": size,
+ "from": from_,
+ "language": language,
+ }
+ if debug:
+ payload["debug"] = True
response = self.session.post(
f"{self.base_url}/search/",
headers={"Content-Type": "application/json", "X-Tenant-ID": self.tenant_id},
- json={"query": query, "size": size, "from": from_, "language": language},
+ json=payload,
timeout=120,
)
response.raise_for_status()
diff --git a/scripts/evaluation/eval_framework/constants.py b/scripts/evaluation/eval_framework/constants.py
index f0c64f6..f3fcf87 100644
--- a/scripts/evaluation/eval_framework/constants.py
+++ b/scripts/evaluation/eval_framework/constants.py
@@ -23,7 +23,7 @@ DEFAULT_SEARCH_RECALL_TOP_K = 500
DEFAULT_RERANK_HIGH_THRESHOLD = 0.5
DEFAULT_RERANK_HIGH_SKIP_COUNT = 1000
DEFAULT_REBUILD_LLM_BATCH_SIZE = 50
-DEFAULT_REBUILD_MIN_LLM_BATCHES = 15
+DEFAULT_REBUILD_MIN_LLM_BATCHES = 20
DEFAULT_REBUILD_MAX_LLM_BATCHES = 40
DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO = 0.92
DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK = 3
diff --git a/scripts/evaluation/eval_framework/framework.py b/scripts/evaluation/eval_framework/framework.py
index 4706894..9fea5f2 100644
--- a/scripts/evaluation/eval_framework/framework.py
+++ b/scripts/evaluation/eval_framework/framework.py
@@ -45,9 +45,27 @@ from .utils import (
sha1_text,
utc_now_iso,
utc_timestamp,
+ zh_title_from_multilingual,
)
+def _zh_titles_from_debug_per_result(debug_info: Any) -> Dict[str, str]:
+ """Map ``spu_id`` -> Chinese title from ``debug_info.per_result[].title_multilingual``."""
+ out: Dict[str, str] = {}
+ if not isinstance(debug_info, dict):
+ return out
+ for entry in debug_info.get("per_result") or []:
+ if not isinstance(entry, dict):
+ continue
+ spu_id = str(entry.get("spu_id") or "").strip()
+ if not spu_id:
+ continue
+ zh = zh_title_from_multilingual(entry.get("title_multilingual"))
+ if zh:
+ out[spu_id] = zh
+ return out
+
+
class SearchEvaluationFramework:
def __init__(
self,
@@ -893,7 +911,10 @@ class SearchEvaluationFramework:
language: str = "en",
force_refresh_labels: bool = False,
) -> Dict[str, Any]:
- search_payload = self.search_client.search(query=query, size=max(top_k, 100), from_=0, language=language)
+ search_payload = self.search_client.search(
+ query=query, size=max(top_k, 100), from_=0, language=language, debug=True
+ )
+ zh_by_spu = _zh_titles_from_debug_per_result(search_payload.get("debug_info"))
results = list(search_payload.get("results") or [])
if auto_annotate:
self.annotate_missing_labels(query=query, docs=results[:top_k], force_refresh=force_refresh_labels)
@@ -906,11 +927,16 @@ class SearchEvaluationFramework:
label = labels.get(spu_id)
if label not in VALID_LABELS:
unlabeled_hits += 1
+ primary_title = build_display_title(doc)
+ title_zh = zh_by_spu.get(spu_id) or ""
+ if not title_zh and isinstance(doc.get("title"), dict):
+ title_zh = zh_title_from_multilingual(doc.get("title"))
labeled.append(
{
"rank": rank,
"spu_id": spu_id,
- "title": build_display_title(doc),
+ "title": primary_title,
+ "title_zh": title_zh if title_zh and title_zh != primary_title else "",
"image_url": doc.get("image_url"),
"label": label,
"option_values": list(compact_option_values(doc.get("skus") or [])),
@@ -934,12 +960,15 @@ class SearchEvaluationFramework:
doc = missing_docs_map.get(spu_id)
if not doc:
continue
+ miss_title = build_display_title(doc)
+ miss_zh = zh_title_from_multilingual(doc.get("title")) if isinstance(doc.get("title"), dict) else ""
missing_relevant.append(
{
"spu_id": spu_id,
"label": labels[spu_id],
"rerank_score": rerank_scores.get(spu_id),
- "title": build_display_title(doc),
+ "title": miss_title,
+ "title_zh": miss_zh if miss_zh and miss_zh != miss_title else "",
"image_url": doc.get("image_url"),
"option_values": list(compact_option_values(doc.get("skus") or [])),
"product": compact_product_payload(doc),
diff --git a/scripts/evaluation/eval_framework/static/eval_web.css b/scripts/evaluation/eval_framework/static/eval_web.css
index fbb75ad..ece16ed 100644
--- a/scripts/evaluation/eval_framework/static/eval_web.css
+++ b/scripts/evaluation/eval_framework/static/eval_web.css
@@ -40,7 +40,8 @@
.Irrelevant { background: var(--irrelevant); }
.Unknown { background: #637381; }
.thumb { width: 100px; height: 100px; object-fit: cover; border-radius: 14px; background: #e7e1d4; }
- .title { font-size: 16px; font-weight: 700; margin-bottom: 8px; }
+ .title { font-size: 16px; font-weight: 700; margin-bottom: 4px; }
+ .title-zh { font-size: 14px; font-weight: 500; color: var(--muted); margin-bottom: 8px; line-height: 1.4; }
.options { color: var(--muted); line-height: 1.5; font-size: 14px; }
.section { margin-bottom: 28px; }
.history { font-size: 13px; line-height: 1.5; }
diff --git a/scripts/evaluation/eval_framework/static/eval_web.js b/scripts/evaluation/eval_framework/static/eval_web.js
index f4d1276..4d63e68 100644
--- a/scripts/evaluation/eval_framework/static/eval_web.js
+++ b/scripts/evaluation/eval_framework/static/eval_web.js
@@ -25,6 +25,7 @@