diff --git a/scripts/evaluation/README.md b/scripts/evaluation/README.md index 59b43a8..411936e 100644 --- a/scripts/evaluation/README.md +++ b/scripts/evaluation/README.md @@ -23,7 +23,7 @@ This directory holds the offline annotation builder, the evaluation web UI/API, | `fusion_experiments_round1.json` | Broader first-round experiments | | `queries/queries.txt` | Canonical evaluation queries | | `README_Requirement.md` | Product/requirements reference | -| `start_eval.sh.sh` | Wrapper: `batch`, `batch-rebuild` (deep `build` + `--force-refresh-labels`), or `serve` | +| `start_eval.sh` | Wrapper: `batch`, `batch-rebuild` (deep `build` + `--force-refresh-labels`), or `serve` | | `../start_eval_web.sh` | Same as `serve` with `activate.sh`; use `./scripts/service_ctl.sh start eval-web` (default port **6010**, override with `EVAL_WEB_PORT`). `./run.sh all` includes eval-web. | ## Quick start (repo root) @@ -32,13 +32,13 @@ Set tenant if needed (`export TENANT_ID=163`). You need a live search API, DashS ```bash # Batch: live search for every query; only uncached (query, spu_id) pairs hit the LLM -./scripts/evaluation/start_eval.sh.sh batch +./scripts/evaluation/start_eval.sh batch -# Deep rebuild: per-query full corpus rerank (outside search top-500 pool) + LLM in 50-doc batches along global sort order (early stop; expensive) -./scripts/evaluation/start_eval.sh.sh batch-rebuild +# Deep rebuild: per-query full corpus rerank (outside search recall pool) + LLM in batches along global sort order (early stop; expensive) +./scripts/evaluation/start_eval.sh batch-rebuild # UI: http://127.0.0.1:6010/ -./scripts/evaluation/start_eval.sh.sh serve +./scripts/evaluation/start_eval.sh serve # or: ./scripts/service_ctl.sh start eval-web ``` @@ -71,22 +71,34 @@ Explicit equivalents: Each `batch` run walks the full queries file and writes a **batch report** under `batch_reports/`. With `batch --force-refresh-labels`, every live top-`k` hit is re-judged by the LLM (still only those hits—not the deep rebuild pipeline). -### `start_eval.sh.sh batch-rebuild` (deep annotation rebuild) +### `start_eval.sh batch-rebuild` (deep annotation rebuild) This runs `build_annotation_set.py build` with **`--force-refresh-labels`** and **`--force-refresh-rerank`** (see the explicit command block below). It does **not** run the `batch` subcommand: there is **no** aggregate batch report for this step; outputs are per-query JSON under `query_builds/` plus updates in `search_eval.sqlite3`. For **each** query in `queries.txt`, in order: -1. **Search recall** — Call the live search API with `size = max(--search-depth, --search-recall-top-k)` (the wrapper uses `--search-depth 500`). The first **500** hits form the **recall pool**; they are treated as rerank score **1** and are **not** sent to the reranker. +1. **Search recall** — Call the live search API with `size = max(--search-depth, --search-recall-top-k)` (the wrapper uses `--search-depth 500`). The first **`--search-recall-top-k`** hits (default **200**, see `eval_framework.constants.DEFAULT_SEARCH_RECALL_TOP_K`) form the **recall pool**; they are treated as rerank score **1** and are **not** sent to the reranker. 2. **Full corpus** — Load the tenant’s product corpus from Elasticsearch (same tenant as `TENANT_ID` / `--tenant-id`, default **163**), via `corpus_docs()` (cached in SQLite after the first load). 3. **Rerank outside pool** — Every corpus document whose `spu_id` is **not** in the pool is scored by the reranker API, **80 documents per request**. With `--force-refresh-rerank`, all those scores are recomputed and written to the **`rerank_scores`** table in `search_eval.sqlite3`. Without that flag, existing `(tenant_id, query, spu_id)` scores are reused and only missing rows hit the API. 4. **Skip “too easy” queries** — If more than **1000** non-pool documents have rerank score **> 0.5**, that query is **skipped** (one log line: tail too relevant / easy to satisfy). No LLM calls for that query. 5. **Global sort** — Order to label: pool in **search rank order**, then all remaining corpus docs in **descending rerank score** (dedupe by `spu_id`, pool wins). -6. **LLM labeling** — Walk that list **from the head** in batches of **50** (not “take top-K then label only K”): each batch logs **exact_ratio** and **irrelevant_ratio**. After at least **20** batches, stop when **3** consecutive batches have irrelevant_ratio **> 92%**; never more than **40** batches (**2000** docs max per query). So labeling follows the best-first order but **stops early**; the tail of the sorted list may never be judged. +6. **LLM labeling** — Walk that list **from the head** in batches of **50** by default (`--rebuild-llm-batch-size`). Each batch log includes **exact_ratio**, **irrelevant_ratio**, **low_ratio**, and **irrelevant_plus_low_ratio**. + + **Early stop** (defaults in `eval_framework.constants`; overridable via CLI): + + - Run **at least** `--rebuild-min-batches` batches (**10** by default) before any early stop is allowed. + - After that, define a **bad batch** as one where the batch has **no** **Exact Match** label **and** either: + - **Irrelevant** proportion **≥ 0.94** (`--rebuild-irrelevant-stop-ratio`), or + - **(Irrelevant + Low Relevant)** proportion **≥ 0.96** (`--rebuild-irrel-low-combined-stop-ratio`). + (“Low Relevant” is the weak tier; **High Relevant** does not count toward this combined ratio.) + - Count **consecutive** bad batches. **Reset** the count to 0 on any batch that is not bad. + - **Stop** when the consecutive bad count reaches **`--rebuild-irrelevant-stop-streak`** (**2** by default), or when **`--rebuild-max-batches`** (**40**) is reached—whichever comes first (up to **2000** docs per query at default batch size). + + So labeling follows best-first order but **stops early** when the model sees two consecutive “dead” batches; the tail may never be judged. **Incremental pool (no full rebuild):** `build_annotation_set.py build` **without** `--force-refresh-labels` uses the older windowed pool (`--annotate-search-top-k`, `--annotate-rerank-top-k`) and fills missing labels in one pass—no rerank-skip rule and no LLM early-stop loop. -**Tuning the rebuild path:** `--search-recall-top-k`, `--rerank-high-threshold`, `--rerank-high-skip-count`, `--rebuild-llm-batch-size`, `--rebuild-min-batches`, `--rebuild-max-batches`, `--rebuild-irrelevant-stop-ratio`, `--rebuild-irrelevant-stop-streak` on `build` (see `eval_framework/cli.py`). Rerank API chunk size is **80** docs per request in code (`full_corpus_rerank_outside_exclude`). +**Tuning the rebuild path:** `--search-recall-top-k`, `--rerank-high-threshold`, `--rerank-high-skip-count`, `--rebuild-llm-batch-size`, `--rebuild-min-batches`, `--rebuild-max-batches`, `--rebuild-irrelevant-stop-ratio`, `--rebuild-irrel-low-combined-stop-ratio`, `--rebuild-irrelevant-stop-streak` on `build` (see `eval_framework/cli.py`). Rerank API chunk size is **80** docs per request in code (`full_corpus_rerank_outside_exclude`). ## Artifacts diff --git a/scripts/evaluation/eval_framework/cli.py b/scripts/evaluation/eval_framework/cli.py index 9417776..6113beb 100644 --- a/scripts/evaluation/eval_framework/cli.py +++ b/scripts/evaluation/eval_framework/cli.py @@ -9,6 +9,7 @@ from typing import Any, Dict from .constants import ( DEFAULT_QUERY_FILE, + DEFAULT_REBUILD_IRREL_LOW_COMBINED_STOP_RATIO, DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO, DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK, DEFAULT_REBUILD_LLM_BATCH_SIZE, @@ -70,7 +71,7 @@ def build_cli_parser() -> argparse.ArgumentParser: "--search-recall-top-k", type=int, default=None, - help="Rebuild mode only: top-K search hits enter recall pool with score 1 (default when --force-refresh-labels: 500).", + help="Rebuild mode only: top-K search hits enter recall pool with score 1 (default when --force-refresh-labels: 200).", ) build.add_argument( "--rerank-high-threshold", @@ -85,19 +86,25 @@ def build_cli_parser() -> argparse.ArgumentParser: help="Rebuild only: skip query if more than this many non-pool docs have rerank score > threshold (default 1000).", ) build.add_argument("--rebuild-llm-batch-size", type=int, default=None, help="Rebuild only: LLM batch size (default 50).") - build.add_argument("--rebuild-min-batches", type=int, default=None, help="Rebuild only: min LLM batches before early stop (default 20).") + build.add_argument("--rebuild-min-batches", type=int, default=None, help="Rebuild only: min LLM batches before early stop (default 10).") build.add_argument("--rebuild-max-batches", type=int, default=None, help="Rebuild only: max LLM batches (default 40).") build.add_argument( "--rebuild-irrelevant-stop-ratio", type=float, default=None, - help="Rebuild only: irrelevant ratio above this counts toward early-stop streak (default 0.92).", + help="Rebuild only: irrelevant-only branch threshold (>=) for early-stop streak, requires no Exact (default 0.94).", + ) + build.add_argument( + "--rebuild-irrel-low-combined-stop-ratio", + type=float, + default=None, + help="Rebuild only: (irrelevant+low)/n threshold (>=) for early-stop streak, requires no Exact (default 0.96).", ) build.add_argument( "--rebuild-irrelevant-stop-streak", type=int, default=None, - help="Rebuild only: stop after this many consecutive batches above irrelevant ratio (default 3).", + help="Rebuild only: consecutive bad batches before early stop (default 2).", ) build.add_argument("--language", default="en") build.add_argument("--force-refresh-rerank", action="store_true") @@ -147,6 +154,9 @@ def run_build(args: argparse.Namespace) -> None: "rebuild_irrelevant_stop_ratio": args.rebuild_irrelevant_stop_ratio if args.rebuild_irrelevant_stop_ratio is not None else DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO, + "rebuild_irrel_low_combined_stop_ratio": args.rebuild_irrel_low_combined_stop_ratio + if args.rebuild_irrel_low_combined_stop_ratio is not None + else DEFAULT_REBUILD_IRREL_LOW_COMBINED_STOP_RATIO, "rebuild_irrelevant_stop_streak": args.rebuild_irrelevant_stop_streak if args.rebuild_irrelevant_stop_streak is not None else DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK, diff --git a/scripts/evaluation/eval_framework/constants.py b/scripts/evaluation/eval_framework/constants.py index 395d96c..a701a47 100644 --- a/scripts/evaluation/eval_framework/constants.py +++ b/scripts/evaluation/eval_framework/constants.py @@ -41,12 +41,25 @@ DEFAULT_JUDGE_DASHSCOPE_BATCH = False DEFAULT_JUDGE_BATCH_COMPLETION_WINDOW = "24h" DEFAULT_JUDGE_BATCH_POLL_INTERVAL_SEC = 10.0 -# Rebuild annotation pool (build --force-refresh-labels): search recall + full-corpus rerank + LLM batches -DEFAULT_SEARCH_RECALL_TOP_K = 500 +# --- Rebuild annotation pool (``build --force-refresh-labels``) --- +# Flow: search recall pool (rerank_score=1, no rerank API) + rerank rest of corpus + +# LLM labels in fixed-size batches along global order (see ``framework._annotate_rebuild_batches``). +DEFAULT_SEARCH_RECALL_TOP_K = 200 DEFAULT_RERANK_HIGH_THRESHOLD = 0.5 DEFAULT_RERANK_HIGH_SKIP_COUNT = 1000 DEFAULT_REBUILD_LLM_BATCH_SIZE = 50 -DEFAULT_REBUILD_MIN_LLM_BATCHES = 20 +# At least this many LLM batches run before early-stop is considered. +DEFAULT_REBUILD_MIN_LLM_BATCHES = 10 +# Hard cap on LLM batches per query (each batch labels up to ``DEFAULT_REBUILD_LLM_BATCH_SIZE`` docs). DEFAULT_REBUILD_MAX_LLM_BATCHES = 40 -DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO = 0.92 -DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK = 3 + +# LLM early-stop (only after ``DEFAULT_REBUILD_MIN_LLM_BATCHES`` completed): +# A batch is "bad" when it has **no** ``Exact Match`` label AND either: +# - irrelevant_ratio >= DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO, or +# - (Irrelevant + Low Relevant) / n >= DEFAULT_REBUILD_IRREL_LOW_COMBINED_STOP_RATIO. +# ``irrelevant_ratio`` = Irrelevant count / n; weak relevance is ``RELEVANCE_LOW`` ("Low Relevant"). +# If a batch is bad, increment a streak; otherwise reset streak to 0. Stop when streak reaches +# ``DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK`` (consecutive bad batches). +DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO = 0.94 +DEFAULT_REBUILD_IRREL_LOW_COMBINED_STOP_RATIO = 0.96 +DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK = 2 diff --git a/scripts/evaluation/eval_framework/framework.py b/scripts/evaluation/eval_framework/framework.py index 32c815a..f7973a9 100644 --- a/scripts/evaluation/eval_framework/framework.py +++ b/scripts/evaluation/eval_framework/framework.py @@ -21,6 +21,7 @@ from .constants import ( DEFAULT_JUDGE_DASHSCOPE_BATCH, DEFAULT_JUDGE_ENABLE_THINKING, DEFAULT_JUDGE_MODEL, + DEFAULT_REBUILD_IRREL_LOW_COMBINED_STOP_RATIO, DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO, DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK, DEFAULT_REBUILD_LLM_BATCH_SIZE, @@ -326,10 +327,28 @@ class SearchEvaluationFramework: min_batches: int = DEFAULT_REBUILD_MIN_LLM_BATCHES, max_batches: int = DEFAULT_REBUILD_MAX_LLM_BATCHES, irrelevant_stop_ratio: float = DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO, + irrelevant_low_combined_stop_ratio: float = DEFAULT_REBUILD_IRREL_LOW_COMBINED_STOP_RATIO, stop_streak: int = DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK, force_refresh: bool = True, ) -> Tuple[Dict[str, str], List[Dict[str, Any]]]: - """LLM-label ``ordered_docs`` in fixed-size batches with early stop after enough irrelevant-heavy batches.""" + """LLM-label ``ordered_docs`` in fixed-size batches along list order. + + **Early stop** (only after ``min_batches`` full batches have completed): + + Per batch, let *n* = batch size, and count labels among docs in that batch only. + + - *bad batch* iff there is **no** ``Exact Match`` in the batch **and** at least one of: + + - ``irrelevant_ratio = #(Irrelevant)/n >= irrelevant_stop_ratio`` (default 0.94), or + - ``( #(Irrelevant) + #(Low Relevant) ) / n >= irrelevant_low_combined_stop_ratio`` + (default 0.96; weak relevance = ``RELEVANCE_LOW``). + + Maintain a streak of consecutive *bad* batches; any non-bad batch resets the streak to 0. + Stop labeling when ``streak >= stop_streak`` (default 2) or when ``max_batches`` is reached + or the ordered list is exhausted. + + Constants for defaults: ``eval_framework.constants`` (``DEFAULT_REBUILD_*``). + """ batch_logs: List[Dict[str, Any]] = [] streak = 0 labels: Dict[str, str] = dict(self.store.get_labels(self.tenant_id, query)) @@ -357,32 +376,46 @@ class SearchEvaluationFramework: n = len(batch_docs) exact_n = sum(1 for doc in batch_docs if labels.get(str(doc.get("spu_id"))) == RELEVANCE_EXACT) irrel_n = sum(1 for doc in batch_docs if labels.get(str(doc.get("spu_id"))) == RELEVANCE_IRRELEVANT) + low_n = sum(1 for doc in batch_docs if labels.get(str(doc.get("spu_id"))) == RELEVANCE_LOW) exact_ratio = exact_n / n if n else 0.0 irrelevant_ratio = irrel_n / n if n else 0.0 + low_ratio = low_n / n if n else 0.0 + irrel_low_ratio = (irrel_n + low_n) / n if n else 0.0 log_entry = { "batch_index": batch_idx + 1, "size": n, "exact_ratio": round(exact_ratio, 6), "irrelevant_ratio": round(irrelevant_ratio, 6), + "low_ratio": round(low_ratio, 6), + "irrelevant_plus_low_ratio": round(irrel_low_ratio, 6), "offset_start": start, "offset_end": min(start + n, total_ordered), } batch_logs.append(log_entry) print( f"[eval-rebuild] query={query!r} llm_batch={batch_idx + 1}/{max_batches} " - f"size={n} exact_ratio={exact_ratio:.4f} irrelevant_ratio={irrelevant_ratio:.4f}", + f"size={n} exact_ratio={exact_ratio:.4f} irrelevant_ratio={irrelevant_ratio:.4f} " + f"irrel_plus_low_ratio={irrel_low_ratio:.4f}", flush=True, ) + # Early-stop streak: only evaluated after min_batches (warm-up before trusting tail quality). if batch_idx + 1 >= min_batches: - if irrelevant_ratio > irrelevant_stop_ratio: + no_exact = exact_n == 0 + # Branch 1: high Irrelevant share, no Exact in this batch. + heavy_irrel = irrelevant_ratio >= irrelevant_stop_ratio + # Branch 2: Irrelevant + Low Relevant combined share, still no Exact. + heavy_irrel_low = irrel_low_ratio >= irrelevant_low_combined_stop_ratio + bad_batch = no_exact and (heavy_irrel or heavy_irrel_low) + if bad_batch: streak += 1 else: streak = 0 if streak >= stop_streak: print( f"[eval-rebuild] query={query!r} early_stop after {batch_idx + 1} batches " - f"({stop_streak} consecutive batches with irrelevant_ratio > {irrelevant_stop_ratio})", + f"({stop_streak} consecutive batches: no Exact and " + f"(irrelevant>={irrelevant_stop_ratio} or irrel+low>={irrelevant_low_combined_stop_ratio}))", flush=True, ) break @@ -407,8 +440,19 @@ class SearchEvaluationFramework: rebuild_min_batches: int = DEFAULT_REBUILD_MIN_LLM_BATCHES, rebuild_max_batches: int = DEFAULT_REBUILD_MAX_LLM_BATCHES, rebuild_irrelevant_stop_ratio: float = DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO, + rebuild_irrel_low_combined_stop_ratio: float = DEFAULT_REBUILD_IRREL_LOW_COMBINED_STOP_RATIO, rebuild_irrelevant_stop_streak: int = DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK, ) -> QueryBuildResult: + """Build per-query annotation pool and write ``query_builds/*.json``. + + Normal mode unions search + rerank windows and fills missing labels once. + + **Rebuild mode** (``force_refresh_labels=True``): full recall pool + corpus rerank outside + pool, optional skip for "easy" queries, then batched LLM labeling with **early stop**; + see ``_build_query_annotation_set_rebuild`` and ``_annotate_rebuild_batches`` (docstring + spells out the bad-batch / streak rule). Rebuild tuning knobs: ``rebuild_*`` and + ``search_recall_top_k`` parameters below; CLI mirrors them under ``build --force-refresh-labels``. + """ if force_refresh_labels: return self._build_query_annotation_set_rebuild( query=query, @@ -423,6 +467,7 @@ class SearchEvaluationFramework: rebuild_min_batches=rebuild_min_batches, rebuild_max_batches=rebuild_max_batches, rebuild_irrelevant_stop_ratio=rebuild_irrelevant_stop_ratio, + rebuild_irrel_low_combined_stop_ratio=rebuild_irrel_low_combined_stop_ratio, rebuild_irrelevant_stop_streak=rebuild_irrelevant_stop_streak, ) @@ -538,6 +583,7 @@ class SearchEvaluationFramework: rebuild_min_batches: int, rebuild_max_batches: int, rebuild_irrelevant_stop_ratio: float, + rebuild_irrel_low_combined_stop_ratio: float, rebuild_irrelevant_stop_streak: int, ) -> QueryBuildResult: search_size = max(int(search_depth), int(search_recall_top_k)) @@ -570,6 +616,7 @@ class SearchEvaluationFramework: "rebuild_min_batches": rebuild_min_batches, "rebuild_max_batches": rebuild_max_batches, "rebuild_irrelevant_stop_ratio": rebuild_irrelevant_stop_ratio, + "rebuild_irrel_low_combined_stop_ratio": rebuild_irrel_low_combined_stop_ratio, "rebuild_irrelevant_stop_streak": rebuild_irrelevant_stop_streak, } @@ -611,6 +658,7 @@ class SearchEvaluationFramework: min_batches=rebuild_min_batches, max_batches=rebuild_max_batches, irrelevant_stop_ratio=rebuild_irrelevant_stop_ratio, + irrelevant_low_combined_stop_ratio=rebuild_irrel_low_combined_stop_ratio, stop_streak=rebuild_irrelevant_stop_streak, force_refresh=True, ) -- libgit2 0.21.2