Commit dedd31c5fc6c065c3fedf3c361939b9483a91cb8

Authored by tangwang
1 parent 90de78aa

1. 搜索 recall 池「1 分」条数(DEFAULT_SEARCH_RECALL_TOP_K)

scripts/evaluation/eval_framework/constants.py:500 → 200
Rebuild 里 rank <= recall_n 的 rerank_score: 1.0 仍按该 K 生效。
2. LLM 批次上下限
最少批次:DEFAULT_REBUILD_MIN_LLM_BATCHES 20 → 10
最多批次:仍为 40(未改)
3. 提前结束条件(_annotate_rebuild_batches)
在已跑满 min_batches 之后,对每个批次:

本批无 Exact(exact_n == 0),且满足其一即视为 bad batch:
irrelevant_ratio >= 0.94
或 (irrelevant + Low Relevant) / n >= 0.96(弱相关用 RELEVANCE_LOW)
连续 2 个 bad batch 则 early stop(原先是连续 3 次、irrelevant >
0.92)。

批次日志里增加了 low_ratio、irrelevant_plus_low_ratio;rebuild
元数据里增加了 rebuild_irrel_low_combined_stop_ratio。

4. CLI
--search-recall-top-k 说明改为默认 200
--rebuild-min-batches 说明改为默认 10
--rebuild-irrelevant-stop-ratio / --rebuild-irrelevant-stop-streak
说明与新逻辑一致
新增 --rebuild-irrel-low-combined-stop-ratio(默认 0.96)
scripts/evaluation/README.md
... ... @@ -23,7 +23,7 @@ This directory holds the offline annotation builder, the evaluation web UI/API,
23 23 | `fusion_experiments_round1.json` | Broader first-round experiments |
24 24 | `queries/queries.txt` | Canonical evaluation queries |
25 25 | `README_Requirement.md` | Product/requirements reference |
26   -| `start_eval.sh.sh` | Wrapper: `batch`, `batch-rebuild` (deep `build` + `--force-refresh-labels`), or `serve` |
  26 +| `start_eval.sh` | Wrapper: `batch`, `batch-rebuild` (deep `build` + `--force-refresh-labels`), or `serve` |
27 27 | `../start_eval_web.sh` | Same as `serve` with `activate.sh`; use `./scripts/service_ctl.sh start eval-web` (default port **6010**, override with `EVAL_WEB_PORT`). `./run.sh all` includes eval-web. |
28 28  
29 29 ## Quick start (repo root)
... ... @@ -32,13 +32,13 @@ Set tenant if needed (`export TENANT_ID=163`). You need a live search API, DashS
32 32  
33 33 ```bash
34 34 # Batch: live search for every query; only uncached (query, spu_id) pairs hit the LLM
35   -./scripts/evaluation/start_eval.sh.sh batch
  35 +./scripts/evaluation/start_eval.sh batch
36 36  
37   -# Deep rebuild: per-query full corpus rerank (outside search top-500 pool) + LLM in 50-doc batches along global sort order (early stop; expensive)
38   -./scripts/evaluation/start_eval.sh.sh batch-rebuild
  37 +# Deep rebuild: per-query full corpus rerank (outside search recall pool) + LLM in batches along global sort order (early stop; expensive)
  38 +./scripts/evaluation/start_eval.sh batch-rebuild
39 39  
40 40 # UI: http://127.0.0.1:6010/
41   -./scripts/evaluation/start_eval.sh.sh serve
  41 +./scripts/evaluation/start_eval.sh serve
42 42 # or: ./scripts/service_ctl.sh start eval-web
43 43 ```
44 44  
... ... @@ -71,22 +71,34 @@ Explicit equivalents:
71 71  
72 72 Each `batch` run walks the full queries file and writes a **batch report** under `batch_reports/`. With `batch --force-refresh-labels`, every live top-`k` hit is re-judged by the LLM (still only those hits—not the deep rebuild pipeline).
73 73  
74   -### `start_eval.sh.sh batch-rebuild` (deep annotation rebuild)
  74 +### `start_eval.sh batch-rebuild` (deep annotation rebuild)
75 75  
76 76 This runs `build_annotation_set.py build` with **`--force-refresh-labels`** and **`--force-refresh-rerank`** (see the explicit command block below). It does **not** run the `batch` subcommand: there is **no** aggregate batch report for this step; outputs are per-query JSON under `query_builds/` plus updates in `search_eval.sqlite3`.
77 77  
78 78 For **each** query in `queries.txt`, in order:
79 79  
80   -1. **Search recall** — Call the live search API with `size = max(--search-depth, --search-recall-top-k)` (the wrapper uses `--search-depth 500`). The first **500** hits form the **recall pool**; they are treated as rerank score **1** and are **not** sent to the reranker.
  80 +1. **Search recall** — Call the live search API with `size = max(--search-depth, --search-recall-top-k)` (the wrapper uses `--search-depth 500`). The first **`--search-recall-top-k`** hits (default **200**, see `eval_framework.constants.DEFAULT_SEARCH_RECALL_TOP_K`) form the **recall pool**; they are treated as rerank score **1** and are **not** sent to the reranker.
81 81 2. **Full corpus** — Load the tenant’s product corpus from Elasticsearch (same tenant as `TENANT_ID` / `--tenant-id`, default **163**), via `corpus_docs()` (cached in SQLite after the first load).
82 82 3. **Rerank outside pool** — Every corpus document whose `spu_id` is **not** in the pool is scored by the reranker API, **80 documents per request**. With `--force-refresh-rerank`, all those scores are recomputed and written to the **`rerank_scores`** table in `search_eval.sqlite3`. Without that flag, existing `(tenant_id, query, spu_id)` scores are reused and only missing rows hit the API.
83 83 4. **Skip “too easy” queries** — If more than **1000** non-pool documents have rerank score **> 0.5**, that query is **skipped** (one log line: tail too relevant / easy to satisfy). No LLM calls for that query.
84 84 5. **Global sort** — Order to label: pool in **search rank order**, then all remaining corpus docs in **descending rerank score** (dedupe by `spu_id`, pool wins).
85   -6. **LLM labeling** — Walk that list **from the head** in batches of **50** (not “take top-K then label only K”): each batch logs **exact_ratio** and **irrelevant_ratio**. After at least **20** batches, stop when **3** consecutive batches have irrelevant_ratio **> 92%**; never more than **40** batches (**2000** docs max per query). So labeling follows the best-first order but **stops early**; the tail of the sorted list may never be judged.
  85 +6. **LLM labeling** — Walk that list **from the head** in batches of **50** by default (`--rebuild-llm-batch-size`). Each batch log includes **exact_ratio**, **irrelevant_ratio**, **low_ratio**, and **irrelevant_plus_low_ratio**.
  86 +
  87 + **Early stop** (defaults in `eval_framework.constants`; overridable via CLI):
  88 +
  89 + - Run **at least** `--rebuild-min-batches` batches (**10** by default) before any early stop is allowed.
  90 + - After that, define a **bad batch** as one where the batch has **no** **Exact Match** label **and** either:
  91 + - **Irrelevant** proportion **≥ 0.94** (`--rebuild-irrelevant-stop-ratio`), or
  92 + - **(Irrelevant + Low Relevant)** proportion **≥ 0.96** (`--rebuild-irrel-low-combined-stop-ratio`).
  93 + (“Low Relevant” is the weak tier; **High Relevant** does not count toward this combined ratio.)
  94 + - Count **consecutive** bad batches. **Reset** the count to 0 on any batch that is not bad.
  95 + - **Stop** when the consecutive bad count reaches **`--rebuild-irrelevant-stop-streak`** (**2** by default), or when **`--rebuild-max-batches`** (**40**) is reached—whichever comes first (up to **2000** docs per query at default batch size).
  96 +
  97 + So labeling follows best-first order but **stops early** when the model sees two consecutive “dead” batches; the tail may never be judged.
86 98  
87 99 **Incremental pool (no full rebuild):** `build_annotation_set.py build` **without** `--force-refresh-labels` uses the older windowed pool (`--annotate-search-top-k`, `--annotate-rerank-top-k`) and fills missing labels in one pass—no rerank-skip rule and no LLM early-stop loop.
88 100  
89   -**Tuning the rebuild path:** `--search-recall-top-k`, `--rerank-high-threshold`, `--rerank-high-skip-count`, `--rebuild-llm-batch-size`, `--rebuild-min-batches`, `--rebuild-max-batches`, `--rebuild-irrelevant-stop-ratio`, `--rebuild-irrelevant-stop-streak` on `build` (see `eval_framework/cli.py`). Rerank API chunk size is **80** docs per request in code (`full_corpus_rerank_outside_exclude`).
  101 +**Tuning the rebuild path:** `--search-recall-top-k`, `--rerank-high-threshold`, `--rerank-high-skip-count`, `--rebuild-llm-batch-size`, `--rebuild-min-batches`, `--rebuild-max-batches`, `--rebuild-irrelevant-stop-ratio`, `--rebuild-irrel-low-combined-stop-ratio`, `--rebuild-irrelevant-stop-streak` on `build` (see `eval_framework/cli.py`). Rerank API chunk size is **80** docs per request in code (`full_corpus_rerank_outside_exclude`).
90 102  
91 103 ## Artifacts
92 104  
... ...
scripts/evaluation/eval_framework/cli.py
... ... @@ -9,6 +9,7 @@ from typing import Any, Dict
9 9  
10 10 from .constants import (
11 11 DEFAULT_QUERY_FILE,
  12 + DEFAULT_REBUILD_IRREL_LOW_COMBINED_STOP_RATIO,
12 13 DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO,
13 14 DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK,
14 15 DEFAULT_REBUILD_LLM_BATCH_SIZE,
... ... @@ -70,7 +71,7 @@ def build_cli_parser() -&gt; argparse.ArgumentParser:
70 71 "--search-recall-top-k",
71 72 type=int,
72 73 default=None,
73   - help="Rebuild mode only: top-K search hits enter recall pool with score 1 (default when --force-refresh-labels: 500).",
  74 + help="Rebuild mode only: top-K search hits enter recall pool with score 1 (default when --force-refresh-labels: 200).",
74 75 )
75 76 build.add_argument(
76 77 "--rerank-high-threshold",
... ... @@ -85,19 +86,25 @@ def build_cli_parser() -&gt; argparse.ArgumentParser:
85 86 help="Rebuild only: skip query if more than this many non-pool docs have rerank score > threshold (default 1000).",
86 87 )
87 88 build.add_argument("--rebuild-llm-batch-size", type=int, default=None, help="Rebuild only: LLM batch size (default 50).")
88   - build.add_argument("--rebuild-min-batches", type=int, default=None, help="Rebuild only: min LLM batches before early stop (default 20).")
  89 + build.add_argument("--rebuild-min-batches", type=int, default=None, help="Rebuild only: min LLM batches before early stop (default 10).")
89 90 build.add_argument("--rebuild-max-batches", type=int, default=None, help="Rebuild only: max LLM batches (default 40).")
90 91 build.add_argument(
91 92 "--rebuild-irrelevant-stop-ratio",
92 93 type=float,
93 94 default=None,
94   - help="Rebuild only: irrelevant ratio above this counts toward early-stop streak (default 0.92).",
  95 + help="Rebuild only: irrelevant-only branch threshold (>=) for early-stop streak, requires no Exact (default 0.94).",
  96 + )
  97 + build.add_argument(
  98 + "--rebuild-irrel-low-combined-stop-ratio",
  99 + type=float,
  100 + default=None,
  101 + help="Rebuild only: (irrelevant+low)/n threshold (>=) for early-stop streak, requires no Exact (default 0.96).",
95 102 )
96 103 build.add_argument(
97 104 "--rebuild-irrelevant-stop-streak",
98 105 type=int,
99 106 default=None,
100   - help="Rebuild only: stop after this many consecutive batches above irrelevant ratio (default 3).",
  107 + help="Rebuild only: consecutive bad batches before early stop (default 2).",
101 108 )
102 109 build.add_argument("--language", default="en")
103 110 build.add_argument("--force-refresh-rerank", action="store_true")
... ... @@ -147,6 +154,9 @@ def run_build(args: argparse.Namespace) -&gt; None:
147 154 "rebuild_irrelevant_stop_ratio": args.rebuild_irrelevant_stop_ratio
148 155 if args.rebuild_irrelevant_stop_ratio is not None
149 156 else DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO,
  157 + "rebuild_irrel_low_combined_stop_ratio": args.rebuild_irrel_low_combined_stop_ratio
  158 + if args.rebuild_irrel_low_combined_stop_ratio is not None
  159 + else DEFAULT_REBUILD_IRREL_LOW_COMBINED_STOP_RATIO,
150 160 "rebuild_irrelevant_stop_streak": args.rebuild_irrelevant_stop_streak
151 161 if args.rebuild_irrelevant_stop_streak is not None
152 162 else DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK,
... ...
scripts/evaluation/eval_framework/constants.py
... ... @@ -41,12 +41,25 @@ DEFAULT_JUDGE_DASHSCOPE_BATCH = False
41 41 DEFAULT_JUDGE_BATCH_COMPLETION_WINDOW = "24h"
42 42 DEFAULT_JUDGE_BATCH_POLL_INTERVAL_SEC = 10.0
43 43  
44   -# Rebuild annotation pool (build --force-refresh-labels): search recall + full-corpus rerank + LLM batches
45   -DEFAULT_SEARCH_RECALL_TOP_K = 500
  44 +# --- Rebuild annotation pool (``build --force-refresh-labels``) ---
  45 +# Flow: search recall pool (rerank_score=1, no rerank API) + rerank rest of corpus +
  46 +# LLM labels in fixed-size batches along global order (see ``framework._annotate_rebuild_batches``).
  47 +DEFAULT_SEARCH_RECALL_TOP_K = 200
46 48 DEFAULT_RERANK_HIGH_THRESHOLD = 0.5
47 49 DEFAULT_RERANK_HIGH_SKIP_COUNT = 1000
48 50 DEFAULT_REBUILD_LLM_BATCH_SIZE = 50
49   -DEFAULT_REBUILD_MIN_LLM_BATCHES = 20
  51 +# At least this many LLM batches run before early-stop is considered.
  52 +DEFAULT_REBUILD_MIN_LLM_BATCHES = 10
  53 +# Hard cap on LLM batches per query (each batch labels up to ``DEFAULT_REBUILD_LLM_BATCH_SIZE`` docs).
50 54 DEFAULT_REBUILD_MAX_LLM_BATCHES = 40
51   -DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO = 0.92
52   -DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK = 3
  55 +
  56 +# LLM early-stop (only after ``DEFAULT_REBUILD_MIN_LLM_BATCHES`` completed):
  57 +# A batch is "bad" when it has **no** ``Exact Match`` label AND either:
  58 +# - irrelevant_ratio >= DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO, or
  59 +# - (Irrelevant + Low Relevant) / n >= DEFAULT_REBUILD_IRREL_LOW_COMBINED_STOP_RATIO.
  60 +# ``irrelevant_ratio`` = Irrelevant count / n; weak relevance is ``RELEVANCE_LOW`` ("Low Relevant").
  61 +# If a batch is bad, increment a streak; otherwise reset streak to 0. Stop when streak reaches
  62 +# ``DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK`` (consecutive bad batches).
  63 +DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO = 0.94
  64 +DEFAULT_REBUILD_IRREL_LOW_COMBINED_STOP_RATIO = 0.96
  65 +DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK = 2
... ...
scripts/evaluation/eval_framework/framework.py
... ... @@ -21,6 +21,7 @@ from .constants import (
21 21 DEFAULT_JUDGE_DASHSCOPE_BATCH,
22 22 DEFAULT_JUDGE_ENABLE_THINKING,
23 23 DEFAULT_JUDGE_MODEL,
  24 + DEFAULT_REBUILD_IRREL_LOW_COMBINED_STOP_RATIO,
24 25 DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO,
25 26 DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK,
26 27 DEFAULT_REBUILD_LLM_BATCH_SIZE,
... ... @@ -326,10 +327,28 @@ class SearchEvaluationFramework:
326 327 min_batches: int = DEFAULT_REBUILD_MIN_LLM_BATCHES,
327 328 max_batches: int = DEFAULT_REBUILD_MAX_LLM_BATCHES,
328 329 irrelevant_stop_ratio: float = DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO,
  330 + irrelevant_low_combined_stop_ratio: float = DEFAULT_REBUILD_IRREL_LOW_COMBINED_STOP_RATIO,
329 331 stop_streak: int = DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK,
330 332 force_refresh: bool = True,
331 333 ) -> Tuple[Dict[str, str], List[Dict[str, Any]]]:
332   - """LLM-label ``ordered_docs`` in fixed-size batches with early stop after enough irrelevant-heavy batches."""
  334 + """LLM-label ``ordered_docs`` in fixed-size batches along list order.
  335 +
  336 + **Early stop** (only after ``min_batches`` full batches have completed):
  337 +
  338 + Per batch, let *n* = batch size, and count labels among docs in that batch only.
  339 +
  340 + - *bad batch* iff there is **no** ``Exact Match`` in the batch **and** at least one of:
  341 +
  342 + - ``irrelevant_ratio = #(Irrelevant)/n >= irrelevant_stop_ratio`` (default 0.94), or
  343 + - ``( #(Irrelevant) + #(Low Relevant) ) / n >= irrelevant_low_combined_stop_ratio``
  344 + (default 0.96; weak relevance = ``RELEVANCE_LOW``).
  345 +
  346 + Maintain a streak of consecutive *bad* batches; any non-bad batch resets the streak to 0.
  347 + Stop labeling when ``streak >= stop_streak`` (default 2) or when ``max_batches`` is reached
  348 + or the ordered list is exhausted.
  349 +
  350 + Constants for defaults: ``eval_framework.constants`` (``DEFAULT_REBUILD_*``).
  351 + """
333 352 batch_logs: List[Dict[str, Any]] = []
334 353 streak = 0
335 354 labels: Dict[str, str] = dict(self.store.get_labels(self.tenant_id, query))
... ... @@ -357,32 +376,46 @@ class SearchEvaluationFramework:
357 376 n = len(batch_docs)
358 377 exact_n = sum(1 for doc in batch_docs if labels.get(str(doc.get("spu_id"))) == RELEVANCE_EXACT)
359 378 irrel_n = sum(1 for doc in batch_docs if labels.get(str(doc.get("spu_id"))) == RELEVANCE_IRRELEVANT)
  379 + low_n = sum(1 for doc in batch_docs if labels.get(str(doc.get("spu_id"))) == RELEVANCE_LOW)
360 380 exact_ratio = exact_n / n if n else 0.0
361 381 irrelevant_ratio = irrel_n / n if n else 0.0
  382 + low_ratio = low_n / n if n else 0.0
  383 + irrel_low_ratio = (irrel_n + low_n) / n if n else 0.0
362 384 log_entry = {
363 385 "batch_index": batch_idx + 1,
364 386 "size": n,
365 387 "exact_ratio": round(exact_ratio, 6),
366 388 "irrelevant_ratio": round(irrelevant_ratio, 6),
  389 + "low_ratio": round(low_ratio, 6),
  390 + "irrelevant_plus_low_ratio": round(irrel_low_ratio, 6),
367 391 "offset_start": start,
368 392 "offset_end": min(start + n, total_ordered),
369 393 }
370 394 batch_logs.append(log_entry)
371 395 print(
372 396 f"[eval-rebuild] query={query!r} llm_batch={batch_idx + 1}/{max_batches} "
373   - f"size={n} exact_ratio={exact_ratio:.4f} irrelevant_ratio={irrelevant_ratio:.4f}",
  397 + f"size={n} exact_ratio={exact_ratio:.4f} irrelevant_ratio={irrelevant_ratio:.4f} "
  398 + f"irrel_plus_low_ratio={irrel_low_ratio:.4f}",
374 399 flush=True,
375 400 )
376 401  
  402 + # Early-stop streak: only evaluated after min_batches (warm-up before trusting tail quality).
377 403 if batch_idx + 1 >= min_batches:
378   - if irrelevant_ratio > irrelevant_stop_ratio:
  404 + no_exact = exact_n == 0
  405 + # Branch 1: high Irrelevant share, no Exact in this batch.
  406 + heavy_irrel = irrelevant_ratio >= irrelevant_stop_ratio
  407 + # Branch 2: Irrelevant + Low Relevant combined share, still no Exact.
  408 + heavy_irrel_low = irrel_low_ratio >= irrelevant_low_combined_stop_ratio
  409 + bad_batch = no_exact and (heavy_irrel or heavy_irrel_low)
  410 + if bad_batch:
379 411 streak += 1
380 412 else:
381 413 streak = 0
382 414 if streak >= stop_streak:
383 415 print(
384 416 f"[eval-rebuild] query={query!r} early_stop after {batch_idx + 1} batches "
385   - f"({stop_streak} consecutive batches with irrelevant_ratio > {irrelevant_stop_ratio})",
  417 + f"({stop_streak} consecutive batches: no Exact and "
  418 + f"(irrelevant>={irrelevant_stop_ratio} or irrel+low>={irrelevant_low_combined_stop_ratio}))",
386 419 flush=True,
387 420 )
388 421 break
... ... @@ -407,8 +440,19 @@ class SearchEvaluationFramework:
407 440 rebuild_min_batches: int = DEFAULT_REBUILD_MIN_LLM_BATCHES,
408 441 rebuild_max_batches: int = DEFAULT_REBUILD_MAX_LLM_BATCHES,
409 442 rebuild_irrelevant_stop_ratio: float = DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO,
  443 + rebuild_irrel_low_combined_stop_ratio: float = DEFAULT_REBUILD_IRREL_LOW_COMBINED_STOP_RATIO,
410 444 rebuild_irrelevant_stop_streak: int = DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK,
411 445 ) -> QueryBuildResult:
  446 + """Build per-query annotation pool and write ``query_builds/*.json``.
  447 +
  448 + Normal mode unions search + rerank windows and fills missing labels once.
  449 +
  450 + **Rebuild mode** (``force_refresh_labels=True``): full recall pool + corpus rerank outside
  451 + pool, optional skip for "easy" queries, then batched LLM labeling with **early stop**;
  452 + see ``_build_query_annotation_set_rebuild`` and ``_annotate_rebuild_batches`` (docstring
  453 + spells out the bad-batch / streak rule). Rebuild tuning knobs: ``rebuild_*`` and
  454 + ``search_recall_top_k`` parameters below; CLI mirrors them under ``build --force-refresh-labels``.
  455 + """
412 456 if force_refresh_labels:
413 457 return self._build_query_annotation_set_rebuild(
414 458 query=query,
... ... @@ -423,6 +467,7 @@ class SearchEvaluationFramework:
423 467 rebuild_min_batches=rebuild_min_batches,
424 468 rebuild_max_batches=rebuild_max_batches,
425 469 rebuild_irrelevant_stop_ratio=rebuild_irrelevant_stop_ratio,
  470 + rebuild_irrel_low_combined_stop_ratio=rebuild_irrel_low_combined_stop_ratio,
426 471 rebuild_irrelevant_stop_streak=rebuild_irrelevant_stop_streak,
427 472 )
428 473  
... ... @@ -538,6 +583,7 @@ class SearchEvaluationFramework:
538 583 rebuild_min_batches: int,
539 584 rebuild_max_batches: int,
540 585 rebuild_irrelevant_stop_ratio: float,
  586 + rebuild_irrel_low_combined_stop_ratio: float,
541 587 rebuild_irrelevant_stop_streak: int,
542 588 ) -> QueryBuildResult:
543 589 search_size = max(int(search_depth), int(search_recall_top_k))
... ... @@ -570,6 +616,7 @@ class SearchEvaluationFramework:
570 616 "rebuild_min_batches": rebuild_min_batches,
571 617 "rebuild_max_batches": rebuild_max_batches,
572 618 "rebuild_irrelevant_stop_ratio": rebuild_irrelevant_stop_ratio,
  619 + "rebuild_irrel_low_combined_stop_ratio": rebuild_irrel_low_combined_stop_ratio,
573 620 "rebuild_irrelevant_stop_streak": rebuild_irrelevant_stop_streak,
574 621 }
575 622  
... ... @@ -611,6 +658,7 @@ class SearchEvaluationFramework:
611 658 min_batches=rebuild_min_batches,
612 659 max_batches=rebuild_max_batches,
613 660 irrelevant_stop_ratio=rebuild_irrelevant_stop_ratio,
  661 + irrelevant_low_combined_stop_ratio=rebuild_irrel_low_combined_stop_ratio,
614 662 stop_streak=rebuild_irrelevant_stop_streak,
615 663 force_refresh=True,
616 664 )
... ...