Commit 167f33b4481a46dbfb4e131a03c41b1fe45ace6f
1 parent
d172c259
eval框架前端
Showing
8 changed files
with
74 additions
and
12 deletions
Show diff stats
scripts/evaluation/README.md
| ... | ... | @@ -34,7 +34,7 @@ Set tenant if needed (`export TENANT_ID=163`). You need a live search API, DashS |
| 34 | 34 | # Batch: live search for every query; only uncached (query, spu_id) pairs hit the LLM |
| 35 | 35 | ./scripts/evaluation/quick_start_eval.sh batch |
| 36 | 36 | |
| 37 | -# Deep rebuild: search recall top-500 (score 1) + full-corpus rerank outside pool + batched LLM (early stop; expensive) | |
| 37 | +# Deep rebuild: per-query full corpus rerank (outside search top-500 pool) + LLM in 50-doc batches along global sort order (early stop; expensive) | |
| 38 | 38 | ./scripts/evaluation/quick_start_eval.sh batch-rebuild |
| 39 | 39 | |
| 40 | 40 | # UI: http://127.0.0.1:6010/ |
| ... | ... | @@ -69,9 +69,24 @@ Explicit equivalents: |
| 69 | 69 | --port 6010 |
| 70 | 70 | ``` |
| 71 | 71 | |
| 72 | -Each `batch` run walks the full queries file. With `batch --force-refresh-labels`, every live top-`k` hit is re-judged by the LLM. | |
| 72 | +Each `batch` run walks the full queries file and writes a **batch report** under `batch_reports/`. With `batch --force-refresh-labels`, every live top-`k` hit is re-judged by the LLM (still only those hits—not the deep rebuild pipeline). | |
| 73 | 73 | |
| 74 | -**Rebuild (`build --force-refresh-labels`):** For each query: take search top **500** as the recall pool (treated as rerank score **1**; those SKUs are not sent to the reranker). Rerank the rest of the tenant corpus; if more than **1000** non-pool docs have rerank score **> 0.5**, the query is **skipped** (logged as too easy / tail too relevant). Otherwise merge pool (search order) + non-pool (rerank score descending), then LLM-judge in batches of **50**, logging **exact_ratio** and **irrelevant_ratio** per batch. Stop after **3** consecutive batches with irrelevant_ratio **> 92%**, but only after at least **15** batches and at most **40** batches. | |
| 74 | +### `quick_start_eval.sh batch-rebuild` (deep annotation rebuild) | |
| 75 | + | |
| 76 | +This runs `build_annotation_set.py build` with **`--force-refresh-labels`** and **`--force-refresh-rerank`** (see the explicit command block below). It does **not** run the `batch` subcommand: there is **no** aggregate batch report for this step; outputs are per-query JSON under `query_builds/` plus updates in `search_eval.sqlite3`. | |
| 77 | + | |
| 78 | +For **each** query in `queries.txt`, in order: | |
| 79 | + | |
| 80 | +1. **Search recall** — Call the live search API with `size = max(--search-depth, --search-recall-top-k)` (the wrapper uses `--search-depth 500`). The first **500** hits form the **recall pool**; they are treated as rerank score **1** and are **not** sent to the reranker. | |
| 81 | +2. **Full corpus** — Load the tenant’s product corpus from Elasticsearch (same tenant as `TENANT_ID` / `--tenant-id`, default **163**), via `corpus_docs()` (cached in SQLite after the first load). | |
| 82 | +3. **Rerank outside pool** — Every corpus document whose `spu_id` is **not** in the pool is scored by the reranker API, **80 documents per request**. With `--force-refresh-rerank`, all those scores are recomputed and written to the **`rerank_scores`** table in `search_eval.sqlite3`. Without that flag, existing `(tenant_id, query, spu_id)` scores are reused and only missing rows hit the API. | |
| 83 | +4. **Skip “too easy” queries** — If more than **1000** non-pool documents have rerank score **> 0.5**, that query is **skipped** (one log line: tail too relevant / easy to satisfy). No LLM calls for that query. | |
| 84 | +5. **Global sort** — Order to label: pool in **search rank order**, then all remaining corpus docs in **descending rerank score** (dedupe by `spu_id`, pool wins). | |
| 85 | +6. **LLM labeling** — Walk that list **from the head** in batches of **50** (not “take top-K then label only K”): each batch logs **exact_ratio** and **irrelevant_ratio**. After at least **20** batches, stop when **3** consecutive batches have irrelevant_ratio **> 92%**; never more than **40** batches (**2000** docs max per query). So labeling follows the best-first order but **stops early**; the tail of the sorted list may never be judged. | |
| 86 | + | |
| 87 | +**Incremental pool (no full rebuild):** `build_annotation_set.py build` **without** `--force-refresh-labels` uses the older windowed pool (`--annotate-search-top-k`, `--annotate-rerank-top-k`) and fills missing labels in one pass—no rerank-skip rule and no LLM early-stop loop. | |
| 88 | + | |
| 89 | +**Tuning the rebuild path:** `--search-recall-top-k`, `--rerank-high-threshold`, `--rerank-high-skip-count`, `--rebuild-llm-batch-size`, `--rebuild-min-batches`, `--rebuild-max-batches`, `--rebuild-irrelevant-stop-ratio`, `--rebuild-irrelevant-stop-streak` on `build` (see `eval_framework/cli.py`). Rerank API chunk size is **80** docs per request in code (`full_corpus_rerank_outside_exclude`). | |
| 75 | 90 | |
| 76 | 91 | ## Artifacts |
| 77 | 92 | |
| ... | ... | @@ -95,7 +110,7 @@ Default root: `artifacts/search_evaluation/` |
| 95 | 110 | |
| 96 | 111 | **Standard:** Run `batch` without `--force-refresh-labels` to extend coverage, then use the UI or batch in cached mode. Single-query evaluation defaults to **no** auto-annotation: recall still hits the live API; scoring uses SQLite only, and unlabeled hits count as `Irrelevant`. |
| 97 | 112 | |
| 98 | -**Incremental pool (no full rebuild):** `build_annotation_set.py build` without `--force-refresh-labels` merges search and full-corpus rerank windows before labeling (CLI `--search-depth`, `--rerank-depth`, `--annotate-*-top-k`). **Full rebuild** uses the recall-pool + rerank-skip + batched early-stop flow above; tune thresholds via `--search-recall-top-k`, `--rerank-high-threshold`, `--rerank-high-skip-count`, `--rebuild-*` flags on `build`. | |
| 113 | +**Rebuild vs incremental `build`:** Deep rebuild is documented in the **`batch-rebuild`** subsection above. Incremental `build` (without `--force-refresh-labels`) uses `--annotate-search-top-k` / `--annotate-rerank-top-k` windows instead. | |
| 99 | 114 | |
| 100 | 115 | **Fusion tuning:** `tune_fusion.py` writes experiment configs, restarts the backend, runs batch evaluation, and optionally applies the best variant (see `--experiments-file`, `--score-metric`, `--apply-best`). |
| 101 | 116 | ... | ... |
scripts/evaluation/eval_framework/cli.py
| ... | ... | @@ -53,7 +53,7 @@ def build_cli_parser() -> argparse.ArgumentParser: |
| 53 | 53 | help="Rebuild only: skip query if more than this many non-pool docs have rerank score > threshold (default 1000).", |
| 54 | 54 | ) |
| 55 | 55 | build.add_argument("--rebuild-llm-batch-size", type=int, default=None, help="Rebuild only: LLM batch size (default 50).") |
| 56 | - build.add_argument("--rebuild-min-batches", type=int, default=None, help="Rebuild only: min LLM batches before early stop (default 15).") | |
| 56 | + build.add_argument("--rebuild-min-batches", type=int, default=None, help="Rebuild only: min LLM batches before early stop (default 20).") | |
| 57 | 57 | build.add_argument("--rebuild-max-batches", type=int, default=None, help="Rebuild only: max LLM batches (default 40).") |
| 58 | 58 | build.add_argument( |
| 59 | 59 | "--rebuild-irrelevant-stop-ratio", | ... | ... |
scripts/evaluation/eval_framework/clients.py
| ... | ... | @@ -21,11 +21,19 @@ class SearchServiceClient: |
| 21 | 21 | self.tenant_id = str(tenant_id) |
| 22 | 22 | self.session = requests.Session() |
| 23 | 23 | |
| 24 | - def search(self, query: str, size: int, from_: int = 0, language: str = "en") -> Dict[str, Any]: | |
| 24 | + def search(self, query: str, size: int, from_: int = 0, language: str = "en", *, debug: bool = False) -> Dict[str, Any]: | |
| 25 | + payload: Dict[str, Any] = { | |
| 26 | + "query": query, | |
| 27 | + "size": size, | |
| 28 | + "from": from_, | |
| 29 | + "language": language, | |
| 30 | + } | |
| 31 | + if debug: | |
| 32 | + payload["debug"] = True | |
| 25 | 33 | response = self.session.post( |
| 26 | 34 | f"{self.base_url}/search/", |
| 27 | 35 | headers={"Content-Type": "application/json", "X-Tenant-ID": self.tenant_id}, |
| 28 | - json={"query": query, "size": size, "from": from_, "language": language}, | |
| 36 | + json=payload, | |
| 29 | 37 | timeout=120, |
| 30 | 38 | ) |
| 31 | 39 | response.raise_for_status() | ... | ... |
scripts/evaluation/eval_framework/constants.py
| ... | ... | @@ -23,7 +23,7 @@ DEFAULT_SEARCH_RECALL_TOP_K = 500 |
| 23 | 23 | DEFAULT_RERANK_HIGH_THRESHOLD = 0.5 |
| 24 | 24 | DEFAULT_RERANK_HIGH_SKIP_COUNT = 1000 |
| 25 | 25 | DEFAULT_REBUILD_LLM_BATCH_SIZE = 50 |
| 26 | -DEFAULT_REBUILD_MIN_LLM_BATCHES = 15 | |
| 26 | +DEFAULT_REBUILD_MIN_LLM_BATCHES = 20 | |
| 27 | 27 | DEFAULT_REBUILD_MAX_LLM_BATCHES = 40 |
| 28 | 28 | DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO = 0.92 |
| 29 | 29 | DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK = 3 | ... | ... |
scripts/evaluation/eval_framework/framework.py
| ... | ... | @@ -45,9 +45,27 @@ from .utils import ( |
| 45 | 45 | sha1_text, |
| 46 | 46 | utc_now_iso, |
| 47 | 47 | utc_timestamp, |
| 48 | + zh_title_from_multilingual, | |
| 48 | 49 | ) |
| 49 | 50 | |
| 50 | 51 | |
| 52 | +def _zh_titles_from_debug_per_result(debug_info: Any) -> Dict[str, str]: | |
| 53 | + """Map ``spu_id`` -> Chinese title from ``debug_info.per_result[].title_multilingual``.""" | |
| 54 | + out: Dict[str, str] = {} | |
| 55 | + if not isinstance(debug_info, dict): | |
| 56 | + return out | |
| 57 | + for entry in debug_info.get("per_result") or []: | |
| 58 | + if not isinstance(entry, dict): | |
| 59 | + continue | |
| 60 | + spu_id = str(entry.get("spu_id") or "").strip() | |
| 61 | + if not spu_id: | |
| 62 | + continue | |
| 63 | + zh = zh_title_from_multilingual(entry.get("title_multilingual")) | |
| 64 | + if zh: | |
| 65 | + out[spu_id] = zh | |
| 66 | + return out | |
| 67 | + | |
| 68 | + | |
| 51 | 69 | class SearchEvaluationFramework: |
| 52 | 70 | def __init__( |
| 53 | 71 | self, |
| ... | ... | @@ -893,7 +911,10 @@ class SearchEvaluationFramework: |
| 893 | 911 | language: str = "en", |
| 894 | 912 | force_refresh_labels: bool = False, |
| 895 | 913 | ) -> Dict[str, Any]: |
| 896 | - search_payload = self.search_client.search(query=query, size=max(top_k, 100), from_=0, language=language) | |
| 914 | + search_payload = self.search_client.search( | |
| 915 | + query=query, size=max(top_k, 100), from_=0, language=language, debug=True | |
| 916 | + ) | |
| 917 | + zh_by_spu = _zh_titles_from_debug_per_result(search_payload.get("debug_info")) | |
| 897 | 918 | results = list(search_payload.get("results") or []) |
| 898 | 919 | if auto_annotate: |
| 899 | 920 | self.annotate_missing_labels(query=query, docs=results[:top_k], force_refresh=force_refresh_labels) |
| ... | ... | @@ -906,11 +927,16 @@ class SearchEvaluationFramework: |
| 906 | 927 | label = labels.get(spu_id) |
| 907 | 928 | if label not in VALID_LABELS: |
| 908 | 929 | unlabeled_hits += 1 |
| 930 | + primary_title = build_display_title(doc) | |
| 931 | + title_zh = zh_by_spu.get(spu_id) or "" | |
| 932 | + if not title_zh and isinstance(doc.get("title"), dict): | |
| 933 | + title_zh = zh_title_from_multilingual(doc.get("title")) | |
| 909 | 934 | labeled.append( |
| 910 | 935 | { |
| 911 | 936 | "rank": rank, |
| 912 | 937 | "spu_id": spu_id, |
| 913 | - "title": build_display_title(doc), | |
| 938 | + "title": primary_title, | |
| 939 | + "title_zh": title_zh if title_zh and title_zh != primary_title else "", | |
| 914 | 940 | "image_url": doc.get("image_url"), |
| 915 | 941 | "label": label, |
| 916 | 942 | "option_values": list(compact_option_values(doc.get("skus") or [])), |
| ... | ... | @@ -934,12 +960,15 @@ class SearchEvaluationFramework: |
| 934 | 960 | doc = missing_docs_map.get(spu_id) |
| 935 | 961 | if not doc: |
| 936 | 962 | continue |
| 963 | + miss_title = build_display_title(doc) | |
| 964 | + miss_zh = zh_title_from_multilingual(doc.get("title")) if isinstance(doc.get("title"), dict) else "" | |
| 937 | 965 | missing_relevant.append( |
| 938 | 966 | { |
| 939 | 967 | "spu_id": spu_id, |
| 940 | 968 | "label": labels[spu_id], |
| 941 | 969 | "rerank_score": rerank_scores.get(spu_id), |
| 942 | - "title": build_display_title(doc), | |
| 970 | + "title": miss_title, | |
| 971 | + "title_zh": miss_zh if miss_zh and miss_zh != miss_title else "", | |
| 943 | 972 | "image_url": doc.get("image_url"), |
| 944 | 973 | "option_values": list(compact_option_values(doc.get("skus") or [])), |
| 945 | 974 | "product": compact_product_payload(doc), | ... | ... |
scripts/evaluation/eval_framework/static/eval_web.css
| ... | ... | @@ -40,7 +40,8 @@ |
| 40 | 40 | .Irrelevant { background: var(--irrelevant); } |
| 41 | 41 | .Unknown { background: #637381; } |
| 42 | 42 | .thumb { width: 100px; height: 100px; object-fit: cover; border-radius: 14px; background: #e7e1d4; } |
| 43 | - .title { font-size: 16px; font-weight: 700; margin-bottom: 8px; } | |
| 43 | + .title { font-size: 16px; font-weight: 700; margin-bottom: 4px; } | |
| 44 | + .title-zh { font-size: 14px; font-weight: 500; color: var(--muted); margin-bottom: 8px; line-height: 1.4; } | |
| 44 | 45 | .options { color: var(--muted); line-height: 1.5; font-size: 14px; } |
| 45 | 46 | .section { margin-bottom: 28px; } |
| 46 | 47 | .history { font-size: 13px; line-height: 1.5; } | ... | ... |
scripts/evaluation/eval_framework/static/eval_web.js
| ... | ... | @@ -25,6 +25,7 @@ |
| 25 | 25 | <img class="thumb" src="${item.image_url || ''}" alt="" /> |
| 26 | 26 | <div> |
| 27 | 27 | <div class="title">${item.title || ''}</div> |
| 28 | + ${item.title_zh ? `<div class="title-zh">${item.title_zh}</div>` : ''} | |
| 28 | 29 | <div class="options"> |
| 29 | 30 | <div>${(item.option_values || [])[0] || ''}</div> |
| 30 | 31 | <div>${(item.option_values || [])[1] || ''}</div> | ... | ... |
scripts/evaluation/eval_framework/utils.py
| ... | ... | @@ -42,6 +42,14 @@ def pick_text(value: Any, preferred_lang: str = "en") -> str: |
| 42 | 42 | return str(value).strip() |
| 43 | 43 | |
| 44 | 44 | |
| 45 | +def zh_title_from_multilingual(title_multilingual: Any) -> str: | |
| 46 | + """Chinese title string from API debug ``title_multilingual`` (ES-style dict).""" | |
| 47 | + if not isinstance(title_multilingual, dict): | |
| 48 | + return "" | |
| 49 | + zh = str(title_multilingual.get("zh") or "").strip() | |
| 50 | + return zh | |
| 51 | + | |
| 52 | + | |
| 45 | 53 | def safe_json_dumps(data: Any) -> str: |
| 46 | 54 | return json.dumps(data, ensure_ascii=False, separators=(",", ":")) |
| 47 | 55 | ... | ... |