Commit dedd31c5fc6c065c3fedf3c361939b9483a91cb8
1 parent
90de78aa
1. 搜索 recall 池「1 分」条数(DEFAULT_SEARCH_RECALL_TOP_K)
scripts/evaluation/eval_framework/constants.py:500 → 200 Rebuild 里 rank <= recall_n 的 rerank_score: 1.0 仍按该 K 生效。 2. LLM 批次上下限 最少批次:DEFAULT_REBUILD_MIN_LLM_BATCHES 20 → 10 最多批次:仍为 40(未改) 3. 提前结束条件(_annotate_rebuild_batches) 在已跑满 min_batches 之后,对每个批次: 本批无 Exact(exact_n == 0),且满足其一即视为 bad batch: irrelevant_ratio >= 0.94 或 (irrelevant + Low Relevant) / n >= 0.96(弱相关用 RELEVANCE_LOW) 连续 2 个 bad batch 则 early stop(原先是连续 3 次、irrelevant > 0.92)。 批次日志里增加了 low_ratio、irrelevant_plus_low_ratio;rebuild 元数据里增加了 rebuild_irrel_low_combined_stop_ratio。 4. CLI --search-recall-top-k 说明改为默认 200 --rebuild-min-batches 说明改为默认 10 --rebuild-irrelevant-stop-ratio / --rebuild-irrelevant-stop-streak 说明与新逻辑一致 新增 --rebuild-irrel-low-combined-stop-ratio(默认 0.96)
Showing
4 changed files
with
105 additions
and
22 deletions
Show diff stats
scripts/evaluation/README.md
| ... | ... | @@ -23,7 +23,7 @@ This directory holds the offline annotation builder, the evaluation web UI/API, |
| 23 | 23 | | `fusion_experiments_round1.json` | Broader first-round experiments | |
| 24 | 24 | | `queries/queries.txt` | Canonical evaluation queries | |
| 25 | 25 | | `README_Requirement.md` | Product/requirements reference | |
| 26 | -| `start_eval.sh.sh` | Wrapper: `batch`, `batch-rebuild` (deep `build` + `--force-refresh-labels`), or `serve` | | |
| 26 | +| `start_eval.sh` | Wrapper: `batch`, `batch-rebuild` (deep `build` + `--force-refresh-labels`), or `serve` | | |
| 27 | 27 | | `../start_eval_web.sh` | Same as `serve` with `activate.sh`; use `./scripts/service_ctl.sh start eval-web` (default port **6010**, override with `EVAL_WEB_PORT`). `./run.sh all` includes eval-web. | |
| 28 | 28 | |
| 29 | 29 | ## Quick start (repo root) |
| ... | ... | @@ -32,13 +32,13 @@ Set tenant if needed (`export TENANT_ID=163`). You need a live search API, DashS |
| 32 | 32 | |
| 33 | 33 | ```bash |
| 34 | 34 | # Batch: live search for every query; only uncached (query, spu_id) pairs hit the LLM |
| 35 | -./scripts/evaluation/start_eval.sh.sh batch | |
| 35 | +./scripts/evaluation/start_eval.sh batch | |
| 36 | 36 | |
| 37 | -# Deep rebuild: per-query full corpus rerank (outside search top-500 pool) + LLM in 50-doc batches along global sort order (early stop; expensive) | |
| 38 | -./scripts/evaluation/start_eval.sh.sh batch-rebuild | |
| 37 | +# Deep rebuild: per-query full corpus rerank (outside search recall pool) + LLM in batches along global sort order (early stop; expensive) | |
| 38 | +./scripts/evaluation/start_eval.sh batch-rebuild | |
| 39 | 39 | |
| 40 | 40 | # UI: http://127.0.0.1:6010/ |
| 41 | -./scripts/evaluation/start_eval.sh.sh serve | |
| 41 | +./scripts/evaluation/start_eval.sh serve | |
| 42 | 42 | # or: ./scripts/service_ctl.sh start eval-web |
| 43 | 43 | ``` |
| 44 | 44 | |
| ... | ... | @@ -71,22 +71,34 @@ Explicit equivalents: |
| 71 | 71 | |
| 72 | 72 | Each `batch` run walks the full queries file and writes a **batch report** under `batch_reports/`. With `batch --force-refresh-labels`, every live top-`k` hit is re-judged by the LLM (still only those hits—not the deep rebuild pipeline). |
| 73 | 73 | |
| 74 | -### `start_eval.sh.sh batch-rebuild` (deep annotation rebuild) | |
| 74 | +### `start_eval.sh batch-rebuild` (deep annotation rebuild) | |
| 75 | 75 | |
| 76 | 76 | This runs `build_annotation_set.py build` with **`--force-refresh-labels`** and **`--force-refresh-rerank`** (see the explicit command block below). It does **not** run the `batch` subcommand: there is **no** aggregate batch report for this step; outputs are per-query JSON under `query_builds/` plus updates in `search_eval.sqlite3`. |
| 77 | 77 | |
| 78 | 78 | For **each** query in `queries.txt`, in order: |
| 79 | 79 | |
| 80 | -1. **Search recall** — Call the live search API with `size = max(--search-depth, --search-recall-top-k)` (the wrapper uses `--search-depth 500`). The first **500** hits form the **recall pool**; they are treated as rerank score **1** and are **not** sent to the reranker. | |
| 80 | +1. **Search recall** — Call the live search API with `size = max(--search-depth, --search-recall-top-k)` (the wrapper uses `--search-depth 500`). The first **`--search-recall-top-k`** hits (default **200**, see `eval_framework.constants.DEFAULT_SEARCH_RECALL_TOP_K`) form the **recall pool**; they are treated as rerank score **1** and are **not** sent to the reranker. | |
| 81 | 81 | 2. **Full corpus** — Load the tenant’s product corpus from Elasticsearch (same tenant as `TENANT_ID` / `--tenant-id`, default **163**), via `corpus_docs()` (cached in SQLite after the first load). |
| 82 | 82 | 3. **Rerank outside pool** — Every corpus document whose `spu_id` is **not** in the pool is scored by the reranker API, **80 documents per request**. With `--force-refresh-rerank`, all those scores are recomputed and written to the **`rerank_scores`** table in `search_eval.sqlite3`. Without that flag, existing `(tenant_id, query, spu_id)` scores are reused and only missing rows hit the API. |
| 83 | 83 | 4. **Skip “too easy” queries** — If more than **1000** non-pool documents have rerank score **> 0.5**, that query is **skipped** (one log line: tail too relevant / easy to satisfy). No LLM calls for that query. |
| 84 | 84 | 5. **Global sort** — Order to label: pool in **search rank order**, then all remaining corpus docs in **descending rerank score** (dedupe by `spu_id`, pool wins). |
| 85 | -6. **LLM labeling** — Walk that list **from the head** in batches of **50** (not “take top-K then label only K”): each batch logs **exact_ratio** and **irrelevant_ratio**. After at least **20** batches, stop when **3** consecutive batches have irrelevant_ratio **> 92%**; never more than **40** batches (**2000** docs max per query). So labeling follows the best-first order but **stops early**; the tail of the sorted list may never be judged. | |
| 85 | +6. **LLM labeling** — Walk that list **from the head** in batches of **50** by default (`--rebuild-llm-batch-size`). Each batch log includes **exact_ratio**, **irrelevant_ratio**, **low_ratio**, and **irrelevant_plus_low_ratio**. | |
| 86 | + | |
| 87 | + **Early stop** (defaults in `eval_framework.constants`; overridable via CLI): | |
| 88 | + | |
| 89 | + - Run **at least** `--rebuild-min-batches` batches (**10** by default) before any early stop is allowed. | |
| 90 | + - After that, define a **bad batch** as one where the batch has **no** **Exact Match** label **and** either: | |
| 91 | + - **Irrelevant** proportion **≥ 0.94** (`--rebuild-irrelevant-stop-ratio`), or | |
| 92 | + - **(Irrelevant + Low Relevant)** proportion **≥ 0.96** (`--rebuild-irrel-low-combined-stop-ratio`). | |
| 93 | + (“Low Relevant” is the weak tier; **High Relevant** does not count toward this combined ratio.) | |
| 94 | + - Count **consecutive** bad batches. **Reset** the count to 0 on any batch that is not bad. | |
| 95 | + - **Stop** when the consecutive bad count reaches **`--rebuild-irrelevant-stop-streak`** (**2** by default), or when **`--rebuild-max-batches`** (**40**) is reached—whichever comes first (up to **2000** docs per query at default batch size). | |
| 96 | + | |
| 97 | + So labeling follows best-first order but **stops early** when the model sees two consecutive “dead” batches; the tail may never be judged. | |
| 86 | 98 | |
| 87 | 99 | **Incremental pool (no full rebuild):** `build_annotation_set.py build` **without** `--force-refresh-labels` uses the older windowed pool (`--annotate-search-top-k`, `--annotate-rerank-top-k`) and fills missing labels in one pass—no rerank-skip rule and no LLM early-stop loop. |
| 88 | 100 | |
| 89 | -**Tuning the rebuild path:** `--search-recall-top-k`, `--rerank-high-threshold`, `--rerank-high-skip-count`, `--rebuild-llm-batch-size`, `--rebuild-min-batches`, `--rebuild-max-batches`, `--rebuild-irrelevant-stop-ratio`, `--rebuild-irrelevant-stop-streak` on `build` (see `eval_framework/cli.py`). Rerank API chunk size is **80** docs per request in code (`full_corpus_rerank_outside_exclude`). | |
| 101 | +**Tuning the rebuild path:** `--search-recall-top-k`, `--rerank-high-threshold`, `--rerank-high-skip-count`, `--rebuild-llm-batch-size`, `--rebuild-min-batches`, `--rebuild-max-batches`, `--rebuild-irrelevant-stop-ratio`, `--rebuild-irrel-low-combined-stop-ratio`, `--rebuild-irrelevant-stop-streak` on `build` (see `eval_framework/cli.py`). Rerank API chunk size is **80** docs per request in code (`full_corpus_rerank_outside_exclude`). | |
| 90 | 102 | |
| 91 | 103 | ## Artifacts |
| 92 | 104 | ... | ... |
scripts/evaluation/eval_framework/cli.py
| ... | ... | @@ -9,6 +9,7 @@ from typing import Any, Dict |
| 9 | 9 | |
| 10 | 10 | from .constants import ( |
| 11 | 11 | DEFAULT_QUERY_FILE, |
| 12 | + DEFAULT_REBUILD_IRREL_LOW_COMBINED_STOP_RATIO, | |
| 12 | 13 | DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO, |
| 13 | 14 | DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK, |
| 14 | 15 | DEFAULT_REBUILD_LLM_BATCH_SIZE, |
| ... | ... | @@ -70,7 +71,7 @@ def build_cli_parser() -> argparse.ArgumentParser: |
| 70 | 71 | "--search-recall-top-k", |
| 71 | 72 | type=int, |
| 72 | 73 | default=None, |
| 73 | - help="Rebuild mode only: top-K search hits enter recall pool with score 1 (default when --force-refresh-labels: 500).", | |
| 74 | + help="Rebuild mode only: top-K search hits enter recall pool with score 1 (default when --force-refresh-labels: 200).", | |
| 74 | 75 | ) |
| 75 | 76 | build.add_argument( |
| 76 | 77 | "--rerank-high-threshold", |
| ... | ... | @@ -85,19 +86,25 @@ def build_cli_parser() -> argparse.ArgumentParser: |
| 85 | 86 | help="Rebuild only: skip query if more than this many non-pool docs have rerank score > threshold (default 1000).", |
| 86 | 87 | ) |
| 87 | 88 | build.add_argument("--rebuild-llm-batch-size", type=int, default=None, help="Rebuild only: LLM batch size (default 50).") |
| 88 | - build.add_argument("--rebuild-min-batches", type=int, default=None, help="Rebuild only: min LLM batches before early stop (default 20).") | |
| 89 | + build.add_argument("--rebuild-min-batches", type=int, default=None, help="Rebuild only: min LLM batches before early stop (default 10).") | |
| 89 | 90 | build.add_argument("--rebuild-max-batches", type=int, default=None, help="Rebuild only: max LLM batches (default 40).") |
| 90 | 91 | build.add_argument( |
| 91 | 92 | "--rebuild-irrelevant-stop-ratio", |
| 92 | 93 | type=float, |
| 93 | 94 | default=None, |
| 94 | - help="Rebuild only: irrelevant ratio above this counts toward early-stop streak (default 0.92).", | |
| 95 | + help="Rebuild only: irrelevant-only branch threshold (>=) for early-stop streak, requires no Exact (default 0.94).", | |
| 96 | + ) | |
| 97 | + build.add_argument( | |
| 98 | + "--rebuild-irrel-low-combined-stop-ratio", | |
| 99 | + type=float, | |
| 100 | + default=None, | |
| 101 | + help="Rebuild only: (irrelevant+low)/n threshold (>=) for early-stop streak, requires no Exact (default 0.96).", | |
| 95 | 102 | ) |
| 96 | 103 | build.add_argument( |
| 97 | 104 | "--rebuild-irrelevant-stop-streak", |
| 98 | 105 | type=int, |
| 99 | 106 | default=None, |
| 100 | - help="Rebuild only: stop after this many consecutive batches above irrelevant ratio (default 3).", | |
| 107 | + help="Rebuild only: consecutive bad batches before early stop (default 2).", | |
| 101 | 108 | ) |
| 102 | 109 | build.add_argument("--language", default="en") |
| 103 | 110 | build.add_argument("--force-refresh-rerank", action="store_true") |
| ... | ... | @@ -147,6 +154,9 @@ def run_build(args: argparse.Namespace) -> None: |
| 147 | 154 | "rebuild_irrelevant_stop_ratio": args.rebuild_irrelevant_stop_ratio |
| 148 | 155 | if args.rebuild_irrelevant_stop_ratio is not None |
| 149 | 156 | else DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO, |
| 157 | + "rebuild_irrel_low_combined_stop_ratio": args.rebuild_irrel_low_combined_stop_ratio | |
| 158 | + if args.rebuild_irrel_low_combined_stop_ratio is not None | |
| 159 | + else DEFAULT_REBUILD_IRREL_LOW_COMBINED_STOP_RATIO, | |
| 150 | 160 | "rebuild_irrelevant_stop_streak": args.rebuild_irrelevant_stop_streak |
| 151 | 161 | if args.rebuild_irrelevant_stop_streak is not None |
| 152 | 162 | else DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK, | ... | ... |
scripts/evaluation/eval_framework/constants.py
| ... | ... | @@ -41,12 +41,25 @@ DEFAULT_JUDGE_DASHSCOPE_BATCH = False |
| 41 | 41 | DEFAULT_JUDGE_BATCH_COMPLETION_WINDOW = "24h" |
| 42 | 42 | DEFAULT_JUDGE_BATCH_POLL_INTERVAL_SEC = 10.0 |
| 43 | 43 | |
| 44 | -# Rebuild annotation pool (build --force-refresh-labels): search recall + full-corpus rerank + LLM batches | |
| 45 | -DEFAULT_SEARCH_RECALL_TOP_K = 500 | |
| 44 | +# --- Rebuild annotation pool (``build --force-refresh-labels``) --- | |
| 45 | +# Flow: search recall pool (rerank_score=1, no rerank API) + rerank rest of corpus + | |
| 46 | +# LLM labels in fixed-size batches along global order (see ``framework._annotate_rebuild_batches``). | |
| 47 | +DEFAULT_SEARCH_RECALL_TOP_K = 200 | |
| 46 | 48 | DEFAULT_RERANK_HIGH_THRESHOLD = 0.5 |
| 47 | 49 | DEFAULT_RERANK_HIGH_SKIP_COUNT = 1000 |
| 48 | 50 | DEFAULT_REBUILD_LLM_BATCH_SIZE = 50 |
| 49 | -DEFAULT_REBUILD_MIN_LLM_BATCHES = 20 | |
| 51 | +# At least this many LLM batches run before early-stop is considered. | |
| 52 | +DEFAULT_REBUILD_MIN_LLM_BATCHES = 10 | |
| 53 | +# Hard cap on LLM batches per query (each batch labels up to ``DEFAULT_REBUILD_LLM_BATCH_SIZE`` docs). | |
| 50 | 54 | DEFAULT_REBUILD_MAX_LLM_BATCHES = 40 |
| 51 | -DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO = 0.92 | |
| 52 | -DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK = 3 | |
| 55 | + | |
| 56 | +# LLM early-stop (only after ``DEFAULT_REBUILD_MIN_LLM_BATCHES`` completed): | |
| 57 | +# A batch is "bad" when it has **no** ``Exact Match`` label AND either: | |
| 58 | +# - irrelevant_ratio >= DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO, or | |
| 59 | +# - (Irrelevant + Low Relevant) / n >= DEFAULT_REBUILD_IRREL_LOW_COMBINED_STOP_RATIO. | |
| 60 | +# ``irrelevant_ratio`` = Irrelevant count / n; weak relevance is ``RELEVANCE_LOW`` ("Low Relevant"). | |
| 61 | +# If a batch is bad, increment a streak; otherwise reset streak to 0. Stop when streak reaches | |
| 62 | +# ``DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK`` (consecutive bad batches). | |
| 63 | +DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO = 0.94 | |
| 64 | +DEFAULT_REBUILD_IRREL_LOW_COMBINED_STOP_RATIO = 0.96 | |
| 65 | +DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK = 2 | ... | ... |
scripts/evaluation/eval_framework/framework.py
| ... | ... | @@ -21,6 +21,7 @@ from .constants import ( |
| 21 | 21 | DEFAULT_JUDGE_DASHSCOPE_BATCH, |
| 22 | 22 | DEFAULT_JUDGE_ENABLE_THINKING, |
| 23 | 23 | DEFAULT_JUDGE_MODEL, |
| 24 | + DEFAULT_REBUILD_IRREL_LOW_COMBINED_STOP_RATIO, | |
| 24 | 25 | DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO, |
| 25 | 26 | DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK, |
| 26 | 27 | DEFAULT_REBUILD_LLM_BATCH_SIZE, |
| ... | ... | @@ -326,10 +327,28 @@ class SearchEvaluationFramework: |
| 326 | 327 | min_batches: int = DEFAULT_REBUILD_MIN_LLM_BATCHES, |
| 327 | 328 | max_batches: int = DEFAULT_REBUILD_MAX_LLM_BATCHES, |
| 328 | 329 | irrelevant_stop_ratio: float = DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO, |
| 330 | + irrelevant_low_combined_stop_ratio: float = DEFAULT_REBUILD_IRREL_LOW_COMBINED_STOP_RATIO, | |
| 329 | 331 | stop_streak: int = DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK, |
| 330 | 332 | force_refresh: bool = True, |
| 331 | 333 | ) -> Tuple[Dict[str, str], List[Dict[str, Any]]]: |
| 332 | - """LLM-label ``ordered_docs`` in fixed-size batches with early stop after enough irrelevant-heavy batches.""" | |
| 334 | + """LLM-label ``ordered_docs`` in fixed-size batches along list order. | |
| 335 | + | |
| 336 | + **Early stop** (only after ``min_batches`` full batches have completed): | |
| 337 | + | |
| 338 | + Per batch, let *n* = batch size, and count labels among docs in that batch only. | |
| 339 | + | |
| 340 | + - *bad batch* iff there is **no** ``Exact Match`` in the batch **and** at least one of: | |
| 341 | + | |
| 342 | + - ``irrelevant_ratio = #(Irrelevant)/n >= irrelevant_stop_ratio`` (default 0.94), or | |
| 343 | + - ``( #(Irrelevant) + #(Low Relevant) ) / n >= irrelevant_low_combined_stop_ratio`` | |
| 344 | + (default 0.96; weak relevance = ``RELEVANCE_LOW``). | |
| 345 | + | |
| 346 | + Maintain a streak of consecutive *bad* batches; any non-bad batch resets the streak to 0. | |
| 347 | + Stop labeling when ``streak >= stop_streak`` (default 2) or when ``max_batches`` is reached | |
| 348 | + or the ordered list is exhausted. | |
| 349 | + | |
| 350 | + Constants for defaults: ``eval_framework.constants`` (``DEFAULT_REBUILD_*``). | |
| 351 | + """ | |
| 333 | 352 | batch_logs: List[Dict[str, Any]] = [] |
| 334 | 353 | streak = 0 |
| 335 | 354 | labels: Dict[str, str] = dict(self.store.get_labels(self.tenant_id, query)) |
| ... | ... | @@ -357,32 +376,46 @@ class SearchEvaluationFramework: |
| 357 | 376 | n = len(batch_docs) |
| 358 | 377 | exact_n = sum(1 for doc in batch_docs if labels.get(str(doc.get("spu_id"))) == RELEVANCE_EXACT) |
| 359 | 378 | irrel_n = sum(1 for doc in batch_docs if labels.get(str(doc.get("spu_id"))) == RELEVANCE_IRRELEVANT) |
| 379 | + low_n = sum(1 for doc in batch_docs if labels.get(str(doc.get("spu_id"))) == RELEVANCE_LOW) | |
| 360 | 380 | exact_ratio = exact_n / n if n else 0.0 |
| 361 | 381 | irrelevant_ratio = irrel_n / n if n else 0.0 |
| 382 | + low_ratio = low_n / n if n else 0.0 | |
| 383 | + irrel_low_ratio = (irrel_n + low_n) / n if n else 0.0 | |
| 362 | 384 | log_entry = { |
| 363 | 385 | "batch_index": batch_idx + 1, |
| 364 | 386 | "size": n, |
| 365 | 387 | "exact_ratio": round(exact_ratio, 6), |
| 366 | 388 | "irrelevant_ratio": round(irrelevant_ratio, 6), |
| 389 | + "low_ratio": round(low_ratio, 6), | |
| 390 | + "irrelevant_plus_low_ratio": round(irrel_low_ratio, 6), | |
| 367 | 391 | "offset_start": start, |
| 368 | 392 | "offset_end": min(start + n, total_ordered), |
| 369 | 393 | } |
| 370 | 394 | batch_logs.append(log_entry) |
| 371 | 395 | print( |
| 372 | 396 | f"[eval-rebuild] query={query!r} llm_batch={batch_idx + 1}/{max_batches} " |
| 373 | - f"size={n} exact_ratio={exact_ratio:.4f} irrelevant_ratio={irrelevant_ratio:.4f}", | |
| 397 | + f"size={n} exact_ratio={exact_ratio:.4f} irrelevant_ratio={irrelevant_ratio:.4f} " | |
| 398 | + f"irrel_plus_low_ratio={irrel_low_ratio:.4f}", | |
| 374 | 399 | flush=True, |
| 375 | 400 | ) |
| 376 | 401 | |
| 402 | + # Early-stop streak: only evaluated after min_batches (warm-up before trusting tail quality). | |
| 377 | 403 | if batch_idx + 1 >= min_batches: |
| 378 | - if irrelevant_ratio > irrelevant_stop_ratio: | |
| 404 | + no_exact = exact_n == 0 | |
| 405 | + # Branch 1: high Irrelevant share, no Exact in this batch. | |
| 406 | + heavy_irrel = irrelevant_ratio >= irrelevant_stop_ratio | |
| 407 | + # Branch 2: Irrelevant + Low Relevant combined share, still no Exact. | |
| 408 | + heavy_irrel_low = irrel_low_ratio >= irrelevant_low_combined_stop_ratio | |
| 409 | + bad_batch = no_exact and (heavy_irrel or heavy_irrel_low) | |
| 410 | + if bad_batch: | |
| 379 | 411 | streak += 1 |
| 380 | 412 | else: |
| 381 | 413 | streak = 0 |
| 382 | 414 | if streak >= stop_streak: |
| 383 | 415 | print( |
| 384 | 416 | f"[eval-rebuild] query={query!r} early_stop after {batch_idx + 1} batches " |
| 385 | - f"({stop_streak} consecutive batches with irrelevant_ratio > {irrelevant_stop_ratio})", | |
| 417 | + f"({stop_streak} consecutive batches: no Exact and " | |
| 418 | + f"(irrelevant>={irrelevant_stop_ratio} or irrel+low>={irrelevant_low_combined_stop_ratio}))", | |
| 386 | 419 | flush=True, |
| 387 | 420 | ) |
| 388 | 421 | break |
| ... | ... | @@ -407,8 +440,19 @@ class SearchEvaluationFramework: |
| 407 | 440 | rebuild_min_batches: int = DEFAULT_REBUILD_MIN_LLM_BATCHES, |
| 408 | 441 | rebuild_max_batches: int = DEFAULT_REBUILD_MAX_LLM_BATCHES, |
| 409 | 442 | rebuild_irrelevant_stop_ratio: float = DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO, |
| 443 | + rebuild_irrel_low_combined_stop_ratio: float = DEFAULT_REBUILD_IRREL_LOW_COMBINED_STOP_RATIO, | |
| 410 | 444 | rebuild_irrelevant_stop_streak: int = DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK, |
| 411 | 445 | ) -> QueryBuildResult: |
| 446 | + """Build per-query annotation pool and write ``query_builds/*.json``. | |
| 447 | + | |
| 448 | + Normal mode unions search + rerank windows and fills missing labels once. | |
| 449 | + | |
| 450 | + **Rebuild mode** (``force_refresh_labels=True``): full recall pool + corpus rerank outside | |
| 451 | + pool, optional skip for "easy" queries, then batched LLM labeling with **early stop**; | |
| 452 | + see ``_build_query_annotation_set_rebuild`` and ``_annotate_rebuild_batches`` (docstring | |
| 453 | + spells out the bad-batch / streak rule). Rebuild tuning knobs: ``rebuild_*`` and | |
| 454 | + ``search_recall_top_k`` parameters below; CLI mirrors them under ``build --force-refresh-labels``. | |
| 455 | + """ | |
| 412 | 456 | if force_refresh_labels: |
| 413 | 457 | return self._build_query_annotation_set_rebuild( |
| 414 | 458 | query=query, |
| ... | ... | @@ -423,6 +467,7 @@ class SearchEvaluationFramework: |
| 423 | 467 | rebuild_min_batches=rebuild_min_batches, |
| 424 | 468 | rebuild_max_batches=rebuild_max_batches, |
| 425 | 469 | rebuild_irrelevant_stop_ratio=rebuild_irrelevant_stop_ratio, |
| 470 | + rebuild_irrel_low_combined_stop_ratio=rebuild_irrel_low_combined_stop_ratio, | |
| 426 | 471 | rebuild_irrelevant_stop_streak=rebuild_irrelevant_stop_streak, |
| 427 | 472 | ) |
| 428 | 473 | |
| ... | ... | @@ -538,6 +583,7 @@ class SearchEvaluationFramework: |
| 538 | 583 | rebuild_min_batches: int, |
| 539 | 584 | rebuild_max_batches: int, |
| 540 | 585 | rebuild_irrelevant_stop_ratio: float, |
| 586 | + rebuild_irrel_low_combined_stop_ratio: float, | |
| 541 | 587 | rebuild_irrelevant_stop_streak: int, |
| 542 | 588 | ) -> QueryBuildResult: |
| 543 | 589 | search_size = max(int(search_depth), int(search_recall_top_k)) |
| ... | ... | @@ -570,6 +616,7 @@ class SearchEvaluationFramework: |
| 570 | 616 | "rebuild_min_batches": rebuild_min_batches, |
| 571 | 617 | "rebuild_max_batches": rebuild_max_batches, |
| 572 | 618 | "rebuild_irrelevant_stop_ratio": rebuild_irrelevant_stop_ratio, |
| 619 | + "rebuild_irrel_low_combined_stop_ratio": rebuild_irrel_low_combined_stop_ratio, | |
| 573 | 620 | "rebuild_irrelevant_stop_streak": rebuild_irrelevant_stop_streak, |
| 574 | 621 | } |
| 575 | 622 | |
| ... | ... | @@ -611,6 +658,7 @@ class SearchEvaluationFramework: |
| 611 | 658 | min_batches=rebuild_min_batches, |
| 612 | 659 | max_batches=rebuild_max_batches, |
| 613 | 660 | irrelevant_stop_ratio=rebuild_irrelevant_stop_ratio, |
| 661 | + irrelevant_low_combined_stop_ratio=rebuild_irrel_low_combined_stop_ratio, | |
| 614 | 662 | stop_streak=rebuild_irrelevant_stop_streak, |
| 615 | 663 | force_refresh=True, |
| 616 | 664 | ) | ... | ... |