diff --git a/scripts/evaluation/README.md b/scripts/evaluation/README.md
index 59b43a8..411936e 100644
--- a/scripts/evaluation/README.md
+++ b/scripts/evaluation/README.md
@@ -23,7 +23,7 @@ This directory holds the offline annotation builder, the evaluation web UI/API, 
 | `fusion_experiments_round1.json` | Broader first-round experiments |
 | `queries/queries.txt` | Canonical evaluation queries |
 | `README_Requirement.md` | Product/requirements reference |
-| `start_eval.sh.sh` | Wrapper: `batch`, `batch-rebuild` (deep `build` + `--force-refresh-labels`), or `serve` |
+| `start_eval.sh` | Wrapper: `batch`, `batch-rebuild` (deep `build` + `--force-refresh-labels`), or `serve` |
 | `../start_eval_web.sh` | Same as `serve` with `activate.sh`; use `./scripts/service_ctl.sh start eval-web` (default port **6010**, override with `EVAL_WEB_PORT`). `./run.sh all` includes eval-web. |
 
 ## Quick start (repo root)
@@ -32,13 +32,13 @@ Set tenant if needed (`export TENANT_ID=163`). You need a live search API, DashS
 
 ```bash
 # Batch: live search for every query; only uncached (query, spu_id) pairs hit the LLM
-./scripts/evaluation/start_eval.sh.sh batch
+./scripts/evaluation/start_eval.sh batch
 
-# Deep rebuild: per-query full corpus rerank (outside search top-500 pool) + LLM in 50-doc batches along global sort order (early stop; expensive)
-./scripts/evaluation/start_eval.sh.sh batch-rebuild
+# Deep rebuild: per-query full corpus rerank (outside search recall pool) + LLM in batches along global sort order (early stop; expensive)
+./scripts/evaluation/start_eval.sh batch-rebuild
 
 # UI: http://127.0.0.1:6010/
-./scripts/evaluation/start_eval.sh.sh serve
+./scripts/evaluation/start_eval.sh serve
 # or: ./scripts/service_ctl.sh start eval-web
 ```
 
@@ -71,22 +71,34 @@ Explicit equivalents:
 
 Each `batch` run walks the full queries file and writes a **batch report** under `batch_reports/`. With `batch --force-refresh-labels`, every live top-`k` hit is re-judged by the LLM (still only those hits—not the deep rebuild pipeline).
 
-### `start_eval.sh.sh batch-rebuild` (deep annotation rebuild)
+### `start_eval.sh batch-rebuild` (deep annotation rebuild)
 
 This runs `build_annotation_set.py build` with **`--force-refresh-labels`** and **`--force-refresh-rerank`** (see the explicit command block below). It does **not** run the `batch` subcommand: there is **no** aggregate batch report for this step; outputs are per-query JSON under `query_builds/` plus updates in `search_eval.sqlite3`.
 
 For **each** query in `queries.txt`, in order:
 
-1. **Search recall** — Call the live search API with `size = max(--search-depth, --search-recall-top-k)` (the wrapper uses `--search-depth 500`). The first **500** hits form the **recall pool**; they are treated as rerank score **1** and are **not** sent to the reranker.
+1. **Search recall** — Call the live search API with `size = max(--search-depth, --search-recall-top-k)` (the wrapper uses `--search-depth 500`). The first **`--search-recall-top-k`** hits (default **200**, see `eval_framework.constants.DEFAULT_SEARCH_RECALL_TOP_K`) form the **recall pool**; they are treated as rerank score **1** and are **not** sent to the reranker.
 2. **Full corpus** — Load the tenant’s product corpus from Elasticsearch (same tenant as `TENANT_ID` / `--tenant-id`, default **163**), via `corpus_docs()` (cached in SQLite after the first load).
 3. **Rerank outside pool** — Every corpus document whose `spu_id` is **not** in the pool is scored by the reranker API, **80 documents per request**. With `--force-refresh-rerank`, all those scores are recomputed and written to the **`rerank_scores`** table in `search_eval.sqlite3`. Without that flag, existing `(tenant_id, query, spu_id)` scores are reused and only missing rows hit the API.
 4. **Skip “too easy” queries** — If more than **1000** non-pool documents have rerank score **> 0.5**, that query is **skipped** (one log line: tail too relevant / easy to satisfy). No LLM calls for that query.
 5. **Global sort** — Order to label: pool in **search rank order**, then all remaining corpus docs in **descending rerank score** (dedupe by `spu_id`, pool wins).
-6. **LLM labeling** — Walk that list **from the head** in batches of **50** (not “take top-K then label only K”): each batch logs **exact_ratio** and **irrelevant_ratio**. After at least **20** batches, stop when **3** consecutive batches have irrelevant_ratio **> 92%**; never more than **40** batches (**2000** docs max per query). So labeling follows the best-first order but **stops early**; the tail of the sorted list may never be judged.
+6. **LLM labeling** — Walk that list **from the head** in batches of **50** by default (`--rebuild-llm-batch-size`). Each batch log includes **exact_ratio**, **irrelevant_ratio**, **low_ratio**, and **irrelevant_plus_low_ratio**.
+
+   **Early stop** (defaults in `eval_framework.constants`; overridable via CLI):
+
+   - Run **at least** `--rebuild-min-batches` batches (**10** by default) before any early stop is allowed.
+   - After that, define a **bad batch** as one where the batch has **no** **Exact Match** label **and** either:
+     - **Irrelevant** proportion **≥ 0.94** (`--rebuild-irrelevant-stop-ratio`), or
+     - **(Irrelevant + Low Relevant)** proportion **≥ 0.96** (`--rebuild-irrel-low-combined-stop-ratio`).  
+       (“Low Relevant” is the weak tier; **High Relevant** does not count toward this combined ratio.)
+   - Count **consecutive** bad batches. **Reset** the count to 0 on any batch that is not bad.
+   - **Stop** when the consecutive bad count reaches **`--rebuild-irrelevant-stop-streak`** (**2** by default), or when **`--rebuild-max-batches`** (**40**) is reached—whichever comes first (up to **2000** docs per query at default batch size).
+
+   So labeling follows best-first order but **stops early** when the model sees two consecutive “dead” batches; the tail may never be judged.
 
 **Incremental pool (no full rebuild):** `build_annotation_set.py build` **without** `--force-refresh-labels` uses the older windowed pool (`--annotate-search-top-k`, `--annotate-rerank-top-k`) and fills missing labels in one pass—no rerank-skip rule and no LLM early-stop loop.
 
-**Tuning the rebuild path:** `--search-recall-top-k`, `--rerank-high-threshold`, `--rerank-high-skip-count`, `--rebuild-llm-batch-size`, `--rebuild-min-batches`, `--rebuild-max-batches`, `--rebuild-irrelevant-stop-ratio`, `--rebuild-irrelevant-stop-streak` on `build` (see `eval_framework/cli.py`). Rerank API chunk size is **80** docs per request in code (`full_corpus_rerank_outside_exclude`).
+**Tuning the rebuild path:** `--search-recall-top-k`, `--rerank-high-threshold`, `--rerank-high-skip-count`, `--rebuild-llm-batch-size`, `--rebuild-min-batches`, `--rebuild-max-batches`, `--rebuild-irrelevant-stop-ratio`, `--rebuild-irrel-low-combined-stop-ratio`, `--rebuild-irrelevant-stop-streak` on `build` (see `eval_framework/cli.py`). Rerank API chunk size is **80** docs per request in code (`full_corpus_rerank_outside_exclude`).
 
 ## Artifacts
 
diff --git a/scripts/evaluation/eval_framework/cli.py b/scripts/evaluation/eval_framework/cli.py
index 9417776..6113beb 100644
--- a/scripts/evaluation/eval_framework/cli.py
+++ b/scripts/evaluation/eval_framework/cli.py
@@ -9,6 +9,7 @@ from typing import Any, Dict
 
 from .constants import (
     DEFAULT_QUERY_FILE,
+    DEFAULT_REBUILD_IRREL_LOW_COMBINED_STOP_RATIO,
     DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO,
     DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK,
     DEFAULT_REBUILD_LLM_BATCH_SIZE,
@@ -70,7 +71,7 @@ def build_cli_parser() -> argparse.ArgumentParser:
         "--search-recall-top-k",
         type=int,
         default=None,
-        help="Rebuild mode only: top-K search hits enter recall pool with score 1 (default when --force-refresh-labels: 500).",
+        help="Rebuild mode only: top-K search hits enter recall pool with score 1 (default when --force-refresh-labels: 200).",
     )
     build.add_argument(
         "--rerank-high-threshold",
@@ -85,19 +86,25 @@ def build_cli_parser() -> argparse.ArgumentParser:
         help="Rebuild only: skip query if more than this many non-pool docs have rerank score > threshold (default 1000).",
     )
     build.add_argument("--rebuild-llm-batch-size", type=int, default=None, help="Rebuild only: LLM batch size (default 50).")
-    build.add_argument("--rebuild-min-batches", type=int, default=None, help="Rebuild only: min LLM batches before early stop (default 20).")
+    build.add_argument("--rebuild-min-batches", type=int, default=None, help="Rebuild only: min LLM batches before early stop (default 10).")
     build.add_argument("--rebuild-max-batches", type=int, default=None, help="Rebuild only: max LLM batches (default 40).")
     build.add_argument(
         "--rebuild-irrelevant-stop-ratio",
         type=float,
         default=None,
-        help="Rebuild only: irrelevant ratio above this counts toward early-stop streak (default 0.92).",
+        help="Rebuild only: irrelevant-only branch threshold (>=) for early-stop streak, requires no Exact (default 0.94).",
+    )
+    build.add_argument(
+        "--rebuild-irrel-low-combined-stop-ratio",
+        type=float,
+        default=None,
+        help="Rebuild only: (irrelevant+low)/n threshold (>=) for early-stop streak, requires no Exact (default 0.96).",
     )
     build.add_argument(
         "--rebuild-irrelevant-stop-streak",
         type=int,
         default=None,
-        help="Rebuild only: stop after this many consecutive batches above irrelevant ratio (default 3).",
+        help="Rebuild only: consecutive bad batches before early stop (default 2).",
     )
     build.add_argument("--language", default="en")
     build.add_argument("--force-refresh-rerank", action="store_true")
@@ -147,6 +154,9 @@ def run_build(args: argparse.Namespace) -> None:
             "rebuild_irrelevant_stop_ratio": args.rebuild_irrelevant_stop_ratio
             if args.rebuild_irrelevant_stop_ratio is not None
             else DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO,
+            "rebuild_irrel_low_combined_stop_ratio": args.rebuild_irrel_low_combined_stop_ratio
+            if args.rebuild_irrel_low_combined_stop_ratio is not None
+            else DEFAULT_REBUILD_IRREL_LOW_COMBINED_STOP_RATIO,
             "rebuild_irrelevant_stop_streak": args.rebuild_irrelevant_stop_streak
             if args.rebuild_irrelevant_stop_streak is not None
             else DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK,
diff --git a/scripts/evaluation/eval_framework/constants.py b/scripts/evaluation/eval_framework/constants.py
index 395d96c..a701a47 100644
--- a/scripts/evaluation/eval_framework/constants.py
+++ b/scripts/evaluation/eval_framework/constants.py
@@ -41,12 +41,25 @@ DEFAULT_JUDGE_DASHSCOPE_BATCH = False
 DEFAULT_JUDGE_BATCH_COMPLETION_WINDOW = "24h"
 DEFAULT_JUDGE_BATCH_POLL_INTERVAL_SEC = 10.0
 
-# Rebuild annotation pool (build --force-refresh-labels): search recall + full-corpus rerank + LLM batches
-DEFAULT_SEARCH_RECALL_TOP_K = 500
+# --- Rebuild annotation pool (``build --force-refresh-labels``) ---
+# Flow: search recall pool (rerank_score=1, no rerank API) + rerank rest of corpus +
+# LLM labels in fixed-size batches along global order (see ``framework._annotate_rebuild_batches``).
+DEFAULT_SEARCH_RECALL_TOP_K = 200
 DEFAULT_RERANK_HIGH_THRESHOLD = 0.5
 DEFAULT_RERANK_HIGH_SKIP_COUNT = 1000
 DEFAULT_REBUILD_LLM_BATCH_SIZE = 50
-DEFAULT_REBUILD_MIN_LLM_BATCHES = 20
+# At least this many LLM batches run before early-stop is considered.
+DEFAULT_REBUILD_MIN_LLM_BATCHES = 10
+# Hard cap on LLM batches per query (each batch labels up to ``DEFAULT_REBUILD_LLM_BATCH_SIZE`` docs).
 DEFAULT_REBUILD_MAX_LLM_BATCHES = 40
-DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO = 0.92
-DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK = 3
+
+# LLM early-stop (only after ``DEFAULT_REBUILD_MIN_LLM_BATCHES`` completed):
+# A batch is "bad" when it has **no** ``Exact Match`` label AND either:
+#   - irrelevant_ratio >= DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO, or
+#   - (Irrelevant + Low Relevant) / n >= DEFAULT_REBUILD_IRREL_LOW_COMBINED_STOP_RATIO.
+# ``irrelevant_ratio`` = Irrelevant count / n; weak relevance is ``RELEVANCE_LOW`` ("Low Relevant").
+# If a batch is bad, increment a streak; otherwise reset streak to 0. Stop when streak reaches
+# ``DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK`` (consecutive bad batches).
+DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO = 0.94
+DEFAULT_REBUILD_IRREL_LOW_COMBINED_STOP_RATIO = 0.96
+DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK = 2
diff --git a/scripts/evaluation/eval_framework/framework.py b/scripts/evaluation/eval_framework/framework.py
index 32c815a..f7973a9 100644
--- a/scripts/evaluation/eval_framework/framework.py
+++ b/scripts/evaluation/eval_framework/framework.py
@@ -21,6 +21,7 @@ from .constants import (
     DEFAULT_JUDGE_DASHSCOPE_BATCH,
     DEFAULT_JUDGE_ENABLE_THINKING,
     DEFAULT_JUDGE_MODEL,
+    DEFAULT_REBUILD_IRREL_LOW_COMBINED_STOP_RATIO,
     DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO,
     DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK,
     DEFAULT_REBUILD_LLM_BATCH_SIZE,
@@ -326,10 +327,28 @@ class SearchEvaluationFramework:
         min_batches: int = DEFAULT_REBUILD_MIN_LLM_BATCHES,
         max_batches: int = DEFAULT_REBUILD_MAX_LLM_BATCHES,
         irrelevant_stop_ratio: float = DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO,
+        irrelevant_low_combined_stop_ratio: float = DEFAULT_REBUILD_IRREL_LOW_COMBINED_STOP_RATIO,
         stop_streak: int = DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK,
         force_refresh: bool = True,
     ) -> Tuple[Dict[str, str], List[Dict[str, Any]]]:
-        """LLM-label ``ordered_docs`` in fixed-size batches with early stop after enough irrelevant-heavy batches."""
+        """LLM-label ``ordered_docs`` in fixed-size batches along list order.
+
+        **Early stop** (only after ``min_batches`` full batches have completed):
+
+        Per batch, let *n* = batch size, and count labels among docs in that batch only.
+
+        - *bad batch* iff there is **no** ``Exact Match`` in the batch **and** at least one of:
+
+          - ``irrelevant_ratio = #(Irrelevant)/n >= irrelevant_stop_ratio`` (default 0.94), or
+          - ``( #(Irrelevant) + #(Low Relevant) ) / n >= irrelevant_low_combined_stop_ratio``
+            (default 0.96; weak relevance = ``RELEVANCE_LOW``).
+
+        Maintain a streak of consecutive *bad* batches; any non-bad batch resets the streak to 0.
+        Stop labeling when ``streak >= stop_streak`` (default 2) or when ``max_batches`` is reached
+        or the ordered list is exhausted.
+
+        Constants for defaults: ``eval_framework.constants`` (``DEFAULT_REBUILD_*``).
+        """
         batch_logs: List[Dict[str, Any]] = []
         streak = 0
         labels: Dict[str, str] = dict(self.store.get_labels(self.tenant_id, query))
@@ -357,32 +376,46 @@ class SearchEvaluationFramework:
             n = len(batch_docs)
             exact_n = sum(1 for doc in batch_docs if labels.get(str(doc.get("spu_id"))) == RELEVANCE_EXACT)
             irrel_n = sum(1 for doc in batch_docs if labels.get(str(doc.get("spu_id"))) == RELEVANCE_IRRELEVANT)
+            low_n = sum(1 for doc in batch_docs if labels.get(str(doc.get("spu_id"))) == RELEVANCE_LOW)
             exact_ratio = exact_n / n if n else 0.0
             irrelevant_ratio = irrel_n / n if n else 0.0
+            low_ratio = low_n / n if n else 0.0
+            irrel_low_ratio = (irrel_n + low_n) / n if n else 0.0
             log_entry = {
                 "batch_index": batch_idx + 1,
                 "size": n,
                 "exact_ratio": round(exact_ratio, 6),
                 "irrelevant_ratio": round(irrelevant_ratio, 6),
+                "low_ratio": round(low_ratio, 6),
+                "irrelevant_plus_low_ratio": round(irrel_low_ratio, 6),
                 "offset_start": start,
                 "offset_end": min(start + n, total_ordered),
             }
             batch_logs.append(log_entry)
             print(
                 f"[eval-rebuild] query={query!r} llm_batch={batch_idx + 1}/{max_batches} "
-                f"size={n} exact_ratio={exact_ratio:.4f} irrelevant_ratio={irrelevant_ratio:.4f}",
+                f"size={n} exact_ratio={exact_ratio:.4f} irrelevant_ratio={irrelevant_ratio:.4f} "
+                f"irrel_plus_low_ratio={irrel_low_ratio:.4f}",
                 flush=True,
             )
 
+            # Early-stop streak: only evaluated after min_batches (warm-up before trusting tail quality).
             if batch_idx + 1 >= min_batches:
-                if irrelevant_ratio > irrelevant_stop_ratio:
+                no_exact = exact_n == 0
+                # Branch 1: high Irrelevant share, no Exact in this batch.
+                heavy_irrel = irrelevant_ratio >= irrelevant_stop_ratio
+                # Branch 2: Irrelevant + Low Relevant combined share, still no Exact.
+                heavy_irrel_low = irrel_low_ratio >= irrelevant_low_combined_stop_ratio
+                bad_batch = no_exact and (heavy_irrel or heavy_irrel_low)
+                if bad_batch:
                     streak += 1
                 else:
                     streak = 0
                 if streak >= stop_streak:
                     print(
                         f"[eval-rebuild] query={query!r} early_stop after {batch_idx + 1} batches "
-                        f"({stop_streak} consecutive batches with irrelevant_ratio > {irrelevant_stop_ratio})",
+                        f"({stop_streak} consecutive batches: no Exact and "
+                        f"(irrelevant>={irrelevant_stop_ratio} or irrel+low>={irrelevant_low_combined_stop_ratio}))",
                         flush=True,
                     )
                     break
@@ -407,8 +440,19 @@ class SearchEvaluationFramework:
         rebuild_min_batches: int = DEFAULT_REBUILD_MIN_LLM_BATCHES,
         rebuild_max_batches: int = DEFAULT_REBUILD_MAX_LLM_BATCHES,
         rebuild_irrelevant_stop_ratio: float = DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO,
+        rebuild_irrel_low_combined_stop_ratio: float = DEFAULT_REBUILD_IRREL_LOW_COMBINED_STOP_RATIO,
         rebuild_irrelevant_stop_streak: int = DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK,
     ) -> QueryBuildResult:
+        """Build per-query annotation pool and write ``query_builds/*.json``.
+
+        Normal mode unions search + rerank windows and fills missing labels once.
+
+        **Rebuild mode** (``force_refresh_labels=True``): full recall pool + corpus rerank outside
+        pool, optional skip for "easy" queries, then batched LLM labeling with **early stop**;
+        see ``_build_query_annotation_set_rebuild`` and ``_annotate_rebuild_batches`` (docstring
+        spells out the bad-batch / streak rule). Rebuild tuning knobs: ``rebuild_*`` and
+        ``search_recall_top_k`` parameters below; CLI mirrors them under ``build --force-refresh-labels``.
+        """
         if force_refresh_labels:
             return self._build_query_annotation_set_rebuild(
                 query=query,
@@ -423,6 +467,7 @@ class SearchEvaluationFramework:
                 rebuild_min_batches=rebuild_min_batches,
                 rebuild_max_batches=rebuild_max_batches,
                 rebuild_irrelevant_stop_ratio=rebuild_irrelevant_stop_ratio,
+                rebuild_irrel_low_combined_stop_ratio=rebuild_irrel_low_combined_stop_ratio,
                 rebuild_irrelevant_stop_streak=rebuild_irrelevant_stop_streak,
             )
 
@@ -538,6 +583,7 @@ class SearchEvaluationFramework:
         rebuild_min_batches: int,
         rebuild_max_batches: int,
         rebuild_irrelevant_stop_ratio: float,
+        rebuild_irrel_low_combined_stop_ratio: float,
         rebuild_irrelevant_stop_streak: int,
     ) -> QueryBuildResult:
         search_size = max(int(search_depth), int(search_recall_top_k))
@@ -570,6 +616,7 @@ class SearchEvaluationFramework:
             "rebuild_min_batches": rebuild_min_batches,
             "rebuild_max_batches": rebuild_max_batches,
             "rebuild_irrelevant_stop_ratio": rebuild_irrelevant_stop_ratio,
+            "rebuild_irrel_low_combined_stop_ratio": rebuild_irrel_low_combined_stop_ratio,
             "rebuild_irrelevant_stop_streak": rebuild_irrelevant_stop_streak,
         }
 
@@ -611,6 +658,7 @@ class SearchEvaluationFramework:
                 min_batches=rebuild_min_batches,
                 max_batches=rebuild_max_batches,
                 irrelevant_stop_ratio=rebuild_irrelevant_stop_ratio,
+                irrelevant_low_combined_stop_ratio=rebuild_irrel_low_combined_stop_ratio,
                 stop_streak=rebuild_irrelevant_stop_streak,
                 force_refresh=True,
             )
--
libgit2 0.21.2