tangwang · tangwang
Showing 11 changed files Show diff stats
docs/Usage-Guide.md
docs/issue-2026-03-31-评估框架-done-0331.md
docs/相关性检索优化说明.md
scripts/evaluation/README.md
scripts/evaluation/README_Requirement_zh.md
scripts/evaluation/eval_framework/cli.py
scripts/evaluation/eval_framework/constants.py
scripts/evaluation/eval_framework/framework.py
scripts/evaluation/eval_framework/prompts.py
scripts/evaluation/eval_search_quality.py
scripts/evaluation/quick_start_eval.sh
@@ -202,13 +202,10 @@ python -m pytest -q tests/test_rerank_client.py tests/test_es_query_builder.py t
 ./scripts/service_ctl.sh restart backend
 sleep 3
 ./scripts/service_ctl.sh status backend
-python ./scripts/eval_search_quality.py
+./scripts/evaluation/quick_start_eval.sh batch
 ```
  
-评估结果会输出到 `artifacts/search_eval/`，包含：
-
-- `search_eval_*.json`：便于脚本二次分析
-- `search_eval_*.md`：便于人工浏览 top20 结果、分数与命中子句
+离线批量评估会把标注与报表写到 `artifacts/search_evaluation/`（SQLite、`batch_reports/` 下的 JSON/Markdown 等）。说明与命令见 [scripts/evaluation/README.md](../scripts/evaluation/README.md)。
  
 ### 方式4: 多环境示例（prod / uat）
  
@@ -0,0 +1,151 @@
+
+
+参考资料：
+
+1. 搜索接口：
+
+```bash
+export BASE_URL="${BASE_URL:-http://localhost:6002}"
+export TENANT_ID="${TENANT_ID:-163}"   # 改成你的租户ID
+```
+```bash
+curl -sS "$BASE_URL/search/" \
+  -H "Content-Type: application/json" \
+  -H "X-Tenant-ID: $TENANT_ID" \
+  -d '{
+    "query": "芭比娃娃",
+    "size": 20,
+    "from": 0,
+    "language": "zh"
+  }'
+```
+
+response：
+{
+  "results": [
+    {
+      "spu_id": "12345",
+      "title": "芭比时尚娃娃",
+      "image_url": "https://example.com/image.jpg",
+      "specifications":[],
+      "skus":[{"sku_id":" ...
+...
+
+2. 重排服务：
+curl -X POST "http://localhost:6007/rerank" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "query": "玩具 芭比",
+    "docs": ["12PCS 6 Types of Dolls with Bottles", "纯棉T恤 短袖"],
+    "top_n":386,
+    "normalize": true
+  }'
+
+
+3. 基于指定字段查询：es_debug_search.py
+
+
+主要任务：
+1. 评估工具的建立：
+注意判断结果好坏，要用统一的评估工具，不要对每个query设定关键词匹配的规则来判断是否符合要求，这样不可扩展，这种方式且容易有误判还是复杂，并且不好扩展到其他搜索词。
+因此要做一个搜索结果评估工具、多个结果对比的工具，供后面的标注集合构建工具调用。工具内部实现可以是调用大模型来判断，说清楚什么叫高相关、基本相关、不相关：
+
+prompt:
+```bash
+你是一个电商搜索结果相关性评估助手。请根据用户查询（query）和每个商品的信息，输出该商品的相关性等级。
+
+## 相关性等级标准
+Exact 完全相关 — 完全匹配用户搜索需求。
+Partial 部分相关 — 主意图满足（同品类或相近用途，基本上符合搜索意图），但次要属性（如颜色、风格、尺码等）跟用户需求有偏差或无法确认。
+Irrelevant 不相关 — 品类或用途不符，主诉求未满足。
+
+1. {title1} {option1_value1} {option2_value1} {option3_value1}
+2. {title2} {option1_value2} {option2_value2}, {option3_value2}
+...
+50. {title50} {option1_value50} {option2_value50} {option3_value50}
+
+## 输出格式
+严格输出 {input_nums} 行，每行仅Exact / Partial / Irrelevant三者之一。按顺序对应上述 50 个商品。不要输出任何其他任何信息
+```
+
+
+2. 测试集（结果标注）建立：
+@queries/queries.txt
+
+对其中每一个query：
+1. 召回：
+1）参考搜索接口 召回结果。搜索结果的top500，纳入召回池，打分全部标记为1
+2）调用重排模型，扫描全库（tenant_id=163），如果已经在召回池（打分已经是1了），则跳过，其余的全部过reranker模型接口调用。每80个doc做一次请求。注意重排模型打分一定要做缓存（本地文件缓存即可。query+title->rerank_score）。
+3）对reranker打分超过0.5的结果数大于1000条的query，则打印一行日志，跳过这个query，表示相关结果太多、容易被满足
+
+
+2. 对如上召回的内容，进行全排序，然后逐批进行llm评判标注（50个一批），每一批都记录exact比例和不相关比例，打印日志。
+直到连续三批不相关比例都大于92%。
+最少要跑15批，最多跑40批
+
+3. 请你思考如何存储结果、并利于以后的对比、使用、展示。
+
+
+
+
+3. 评估工具页面：
+请你设计一个搜索评估交互页面。端口6010。
+页面主题：上方是搜索框，如果发起搜索，那么下方给出本次结果的总体指标以及top100结果（允许翻页）
+
+总体指标：
+| 指标 | 含义 |
+|------|------|
+| **P@5, P@10, P@20, P@50** | 前 K 个结果中「仅 3 相关」的精确率 |
+| **P@5_2_3 ～ P@50_2_3** | 前 K 个结果中「2 和 3 都算相关」的精确率 |
+| **MAP_3** | 仅 3 相关时的 Average Precision（单 query） |
+| **MAP_2_3** | 2 和 3 都相关时的 Average Precision |
+
+结果列表：
+按行列下来，每行左侧给每个结果找到标注值（三个等级。对结果也可以颜色标记），展示图片,title.en+title.en+首个sku的option1/2/3_value（分三行展示，这三行和左侧的图片并列）
+
+
+评测页面最左侧：
+queries默认是queries/queries.txt，填入左侧列表框，点击其中任何一个发起搜索。
+
+4. 批量评估工具
+
+给一个批量执行脚本，
+
+这里要新增一个批量评估的页面。点击批量评估的按钮，对所有搜索词依次发起搜索，最后汇总总体的评估指标，生成报告，报告名称带上时间标记和一些关键信息。并且记录当时的主搜索程序的config.yaml。
+你需要精心地设计如何切换两种模式，通过同一个端口承载这两种不同交互的内容。
+批量评估关注的是所有搜索词总体的评估指标。
+需要记录测试环境时间以及当时的配置文件，以及对应的结果。要保存历次的评估记录，并能查到每一次评估结果对应的配置文件有相关的指标
+
+以上是我的总体设计，但有不周全的地方。你要站在更高的层次理解我的需求，你有足够的自由可以适当调整设计，基于你所了解的自动化搜索评估框架的最佳实践，做出更优秀的设计和更好的实现。
+
+
+
+
+
+
+1. 请仔细检验这个标注集的质量，如果质量不符合要求，那么你要优化工具，迭代直至标注集的结果质量足够高，可以以此为自动化工具来评估检索效果，对检索效果形成指导性意见。
+2. 在结果标注集的质量足够好，批量评估工具足够好用，并且经过你的试用，能判断出搜索质量好坏的情况下，开始真正的动手检索效果调优：基于这个50条query的结果标注集和批量评估工具，对融合公式进行调参。请你先精心地设计实验，设计几组参数，对几组参数分别修改config.yaml、重启（./restart.sh backend）、跑批量评估、收集结果。
+注意评估的过程中，如果发现工具不好用，发现日志不全，发现可以通过修改工具或者日志来提高效率，都可以先做这些，根据完善。
+注意你是代码的总负责人，你有任何权限来满足你进行检索效果调优的需要。你如果发现有其他可能带来更大提升的点，也可以进行实验，你甚至可以修改融合、重排漏斗的代码，来进行实验，以追求更好的结果指标。
+但是注意，因为收到性能和耗时的约束，不要调大reranker模型的输入条数、不要打开精排，耗时方面无法承受两轮reranker模型的调用。
+
+
+
+
+
+
+
+
+
+@scripts/evaluation/README.md @scripts/evaluation/eval_framework/framework.py 
+@quick_start_eval.sh (29-35) 
+请以如下流程为准，进行改造：
+如果重建的话，对每个query：
+每个搜索结果应该会扫描全库，
+1. 搜索结果的top500，纳入召回池，打分全部标记为1
+2. 调用重排模型，扫描全库（tenant_id=163），如果已经在召回池（打分已经是1了），则跳过，其余的全部过
+3. 对reranker打分超过0.5的大于1000条，则打印一行日志，跳过这个query，表示相关结果太多、容易被满足
+
+对如上召回的内容，进行全排序，然后逐批进行llm评判标注（50个一批），每一批都记录exact比例和不相关比例，打印日志。
+直到连续三批不相关比例都大于92%。
+最少要跑15批，最多跑40批
@@ -240,15 +240,10 @@ python -m pytest -q tests/test_rerank_client.py tests/test_es_query_builder.py t
 ./scripts/service_ctl.sh restart backend
 sleep 3
 ./scripts/service_ctl.sh status backend
-python ./scripts/eval_search_quality.py
+./scripts/evaluation/quick_start_eval.sh batch
 ```
  
-评估脚本会生成：
-
-- `artifacts/search_eval/search_eval_*.json`
-- `artifacts/search_eval/search_eval_*.md`
-
-可直接从 JSON 中提取 query 级和 result 级调试字段进行分析。
+评估产物在 `artifacts/search_evaluation/`（如 `search_eval.sqlite3`、`batch_reports/` 下的 JSON/Markdown）。流程与参数说明见 [scripts/evaluation/README.md](../scripts/evaluation/README.md)。
  
 ## 11. 建议测试清单
  
 # Search Evaluation Framework
  
-This directory contains the offline annotation set builder, the online evaluation UI/API, the audit tooling, and the fusion-tuning runner for retrieval quality evaluation.
+This directory holds the offline annotation builder, the evaluation web UI/API, audit tooling, and the fusion-tuning runner for retrieval quality.
  
-It is designed around one core rule:
+**Design:** Build labels offline for a fixed query set (`queries/queries.txt`). Single-query and batch evaluation map recalled `spu_id` values to the SQLite cache. Items without cached labels are scored as `Irrelevant`, and the UI/API surfaces tips when coverage is incomplete.
  
-- Annotation should be built offline first.
-- Single-query evaluation should then map recalled `spu_id` values to the cached annotation set.
-- Recalled items without cached labels are treated as `Irrelevant` during evaluation, and the UI/API returns a tip so the operator knows coverage is incomplete.
+## What it does
  
-## Goals
+1. Build an annotation set for a fixed query set.
+2. Evaluate live search results against cached labels.
+3. Run batch evaluation and keep historical reports with config snapshots.
+4. Tune fusion parameters in a reproducible loop.
  
-The framework supports four related tasks:
+## Layout
  
-1. Build an annotation set for a fixed query set.
-2. Evaluate a live search result list against that annotation set.
-3. Run batch evaluation and store historical reports with config snapshots.
-4. Tune fusion parameters reproducibly.
-
-## Files
-
-- `eval_framework/` (Python package)
-  Modular layout: `framework.py` (orchestration), `store.py` (SQLite), `clients.py` (search/rerank/LLM), `prompts.py` (judge templates), `metrics.py`, `reports.py`, `web_app.py`, `cli.py`, and `static/` (evaluation UI HTML/CSS/JS).
-- `build_annotation_set.py`
-  Thin CLI entrypoint into `eval_framework`.
-- `serve_eval_web.py`
-  Thin web entrypoint into `eval_framework`.
-- `tune_fusion.py`
-  Fusion experiment runner. It applies config variants, restarts backend, runs batch evaluation, and stores experiment reports.
-- `fusion_experiments_shortlist.json`
-  A compact experiment set for practical tuning.
-- `fusion_experiments_round1.json`
-  A broader first-round experiment set.
-- `queries/queries.txt`
-  The canonical evaluation query set.
-- `README_Requirement.md`
-  Requirement reference document.
-- `quick_start_eval.sh`
-  Optional wrapper: `batch` (fill missing labels only), `batch-rebuild` (force full re-label), or `serve` (web UI).
-- `../start_eval_web.sh`
-  Same as `serve` but loads `activate.sh`; use with `./scripts/service_ctl.sh start eval-web` (port **`EVAL_WEB_PORT`**, default **6010**). `./run.sh all` starts **eval-web** with the rest of core services.
-
-## Quick start (from repo root)
-
-Set tenant if needed (`export TENANT_ID=163`). Requires live search API, DashScope key when the batch step needs new LLM labels, and a working backend.
+| Path | Role |
+|------|------|
+| `eval_framework/` | Package: orchestration, SQLite store, search/rerank/LLM clients, prompts, metrics, reports, web UI (`static/`), CLI |
+| `build_annotation_set.py` | CLI entry (build / batch / audit) |
+| `serve_eval_web.py` | Web server for the evaluation UI |
+| `tune_fusion.py` | Applies config variants, restarts backend, runs batch eval, stores experiment reports |
+| `fusion_experiments_shortlist.json` | Compact experiment set for tuning |
+| `fusion_experiments_round1.json` | Broader first-round experiments |
+| `queries/queries.txt` | Canonical evaluation queries |
+| `README_Requirement.md` | Product/requirements reference |
+| `quick_start_eval.sh` | Wrapper: `batch`, `batch-rebuild` (deep `build` + `--force-refresh-labels`), or `serve` |
+| `../start_eval_web.sh` | Same as `serve` with `activate.sh`; use `./scripts/service_ctl.sh start eval-web` (default port **6010**, override with `EVAL_WEB_PORT`). `./run.sh all` includes eval-web. |
+
+## Quick start (repo root)
+
+Set tenant if needed (`export TENANT_ID=163`). You need a live search API, DashScope when new LLM labels are required, and a running backend.
  
 ```bash
-# 1) Batch evaluation: every query in the file gets a live search; only uncached (query, spu_id) pairs call the LLM
+# Batch: live search for every query; only uncached (query, spu_id) pairs hit the LLM
 ./scripts/evaluation/quick_start_eval.sh batch
  
-# Optional: full re-label of current top_k recall (expensive; use only when you intentionally rebuild the cache)
+# Deep rebuild: search recall top-500 (score 1) + full-corpus rerank outside pool + batched LLM (early stop; expensive)
 ./scripts/evaluation/quick_start_eval.sh batch-rebuild
  
-# 2) Evaluation UI on http://127.0.0.1:6010/ (override with EVAL_WEB_PORT / EVAL_WEB_HOST)
+# UI: http://127.0.0.1:6010/
 ./scripts/evaluation/quick_start_eval.sh serve
-# Or: ./scripts/service_ctl.sh start eval-web
+# or: ./scripts/service_ctl.sh start eval-web
 ```
  
-Equivalent explicit commands:
+Explicit equivalents:
  
 ```bash
-# Safe default: no --force-refresh-labels
 ./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
   --tenant-id "${TENANT_ID:-163}" \
   --queries-file scripts/evaluation/queries/queries.txt \
@@ -67,207 +52,54 @@ Equivalent explicit commands:
   --language en \
   --labeler-mode simple
  
-# Rebuild all labels for recalled top_k (same as quick_start_eval.sh batch-rebuild)
-./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
-  --tenant-id "${TENANT_ID:-163}" \
-  --queries-file scripts/evaluation/queries/queries.txt \
-  --top-k 50 \
-  --language en \
-  --labeler-mode simple \
-  --force-refresh-labels
-
-./.venv/bin/python scripts/evaluation/serve_eval_web.py serve \
+./.venv/bin/python scripts/evaluation/build_annotation_set.py build \
   --tenant-id "${TENANT_ID:-163}" \
   --queries-file scripts/evaluation/queries/queries.txt \
-  --host 127.0.0.1 \
-  --port 6010
-```
-
-**Batch behavior:** There is no “skip queries already processed”. Each run walks the full queries file. With `--force-refresh-labels`, for **every** query the runner issues a live search and sends **all** `top_k` returned `spu_id`s through the LLM again (SQLite rows are upserted). Omit `--force-refresh-labels` if you only want to fill in labels that are missing for the current recall window.
-
-## Storage Layout
-
-All generated artifacts are under:
-
-- `/data/saas-search/artifacts/search_evaluation`
-
-Important subpaths:
-
-- `/data/saas-search/artifacts/search_evaluation/search_eval.sqlite3`
-  Main cache and annotation store.
-- `/data/saas-search/artifacts/search_evaluation/query_builds`
-  Per-query pooled annotation-set build artifacts.
-- `/data/saas-search/artifacts/search_evaluation/batch_reports`
-  Batch evaluation JSON, Markdown reports, and config snapshots.
-- `/data/saas-search/artifacts/search_evaluation/audits`
-  Audit summaries for label quality checks.
-- `/data/saas-search/artifacts/search_evaluation/tuning_runs`
-  Fusion experiment summaries and per-experiment config snapshots.
-
-## SQLite Schema Summary
-
-The main tables in `search_eval.sqlite3` are:
-
-- `corpus_docs`
-  Cached product corpus for the tenant.
-- `rerank_scores`
-  Cached full-corpus reranker scores keyed by `(tenant_id, query_text, spu_id)`.
-- `relevance_labels`
-  Cached LLM relevance labels keyed by `(tenant_id, query_text, spu_id)`.
-- `query_profiles`
-  Structured query-intent profiles extracted before labeling.
-- `build_runs`
-  Per-query pooled-build records.
-- `batch_runs`
-  Batch evaluation history.
-
-## Label Semantics
-
-Three labels are used throughout:
-
-- `Exact`
-  Fully matches the intended product type and all explicit required attributes.
-- `Partial`
-  Main intent matches, but explicit attributes are missing, approximate, or weaker than requested.
-- `Irrelevant`
-  Product type mismatches, or explicit required attributes conflict.
-
-The framework always uses:
-
-- LLM-based batched relevance classification
-- caching and retry logic for robust offline labeling
-
-There are now two labeler modes:
-
-- `simple`
-  Default. A single low-coupling LLM judging pass per batch, using the standard relevance prompt.
-- `complex`
-  Legacy structured mode. It extracts query profiles and applies extra guardrails. Kept for comparison, but no longer the default.
-
-## Offline-First Workflow
-
-### 1. Refresh labels for the evaluation query set
-
-For practical evaluation, the most important offline step is to pre-label the result window you plan to score. For the current metrics (`P@5`, `P@10`, `P@20`, `P@50`, `MAP_3`, `MAP_2_3`), a `top_k=50` cached label set is sufficient.
-
-Example (fills missing labels only; recommended default):
-
-```bash
-./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
-  --tenant-id 163 \
-  --queries-file scripts/evaluation/queries/queries.txt \
-  --top-k 50 \
+  --search-depth 500 \
+  --rerank-depth 10000 \
+  --force-refresh-rerank \
+  --force-refresh-labels \
   --language en \
   --labeler-mode simple
-```
-
-To **rebuild** every label for the current `top_k` recall window (all queries, all hits re-sent to the LLM), add `--force-refresh-labels` or run `./scripts/evaluation/quick_start_eval.sh batch-rebuild`.
-
-This command does two things:
-
-- runs **every** query in the file against the live backend (no skip list)
-- with `--force-refresh-labels`, re-labels **all** `top_k` hits per query via the LLM and upserts SQLite; without the flag, only `spu_id`s lacking a cached label are sent to the LLM
-
-After this step, single-query evaluation can run in cached mode without calling the LLM again.
-
-### 2. Optional pooled build
-
-The framework also supports a heavier pooled build that combines:
-
-- top search results
-- top full-corpus reranker results
-
-Example:
  
-```bash
-./.venv/bin/python scripts/evaluation/build_annotation_set.py build \
-  --tenant-id 163 \
-  --queries-file scripts/evaluation/queries/queries.txt \
-  --search-depth 1000 \
-  --rerank-depth 10000 \
-  --annotate-search-top-k 100 \
-  --annotate-rerank-top-k 120 \
-  --language en
-```
-
-This is slower, but useful when you want a richer pooled annotation set beyond the current live recall window.
-
-## Why Single-Query Evaluation Was Slow
-
-If single-query evaluation is slow, the usual reason is that it is still running with `auto_annotate=true`, which means:
-
-- perform live search
-- detect recalled but unlabeled products
-- call the LLM to label them
-
-That is not the intended steady-state evaluation path.
-
-The UI/API is now configured to prefer cached evaluation:
-
-- default single-query evaluation uses `auto_annotate=false`
-- unlabeled recalled results are treated as `Irrelevant`
-- the response includes tips explaining that coverage gap
-
-If you want stable, fast evaluation:
-
-1. prebuild labels offline
-2. use cached single-query evaluation
-
-## Web UI
-
-Start the evaluation UI:
-
-```bash
 ./.venv/bin/python scripts/evaluation/serve_eval_web.py serve \
-  --tenant-id 163 \
+  --tenant-id "${TENANT_ID:-163}" \
   --queries-file scripts/evaluation/queries/queries.txt \
   --host 127.0.0.1 \
   --port 6010
 ```
  
-The UI provides:
+Each `batch` run walks the full queries file. With `batch --force-refresh-labels`, every live top-`k` hit is re-judged by the LLM.
  
-- query list loaded from `queries.txt`
-- single-query evaluation
-- batch evaluation
-- history of batch reports
-- top recalled results
-- missed `Exact` and `Partial` products that were not recalled
-- tips about unlabeled hits treated as `Irrelevant`
+**Rebuild (`build --force-refresh-labels`):** For each query: take search top **500** as the recall pool (treated as rerank score **1**; those SKUs are not sent to the reranker). Rerank the rest of the tenant corpus; if more than **1000** non-pool docs have rerank score **> 0.5**, the query is **skipped** (logged as too easy / tail too relevant). Otherwise merge pool (search order) + non-pool (rerank score descending), then LLM-judge in batches of **50**, logging **exact_ratio** and **irrelevant_ratio** per batch. Stop after **3** consecutive batches with irrelevant_ratio **> 92%**, but only after at least **15** batches and at most **40** batches.
  
-### Single-query response behavior
+## Artifacts
  
-For a single query:
+Default root: `artifacts/search_evaluation/`
  
-1. live search returns recalled `spu_id` values
-2. the framework looks up cached labels by `(query, spu_id)`
-3. unlabeled recalled items are counted as `Irrelevant`
-4. cached `Exact` and `Partial` products that were not recalled are listed under `Missed Exact / Partial`
+- `search_eval.sqlite3` — corpus cache, rerank scores, relevance labels, query profiles, build/batch run metadata
+- `query_builds/` — per-query pooled build outputs
+- `batch_reports/` — batch JSON, Markdown, config snapshots
+- `audits/` — label-quality audit summaries
+- `tuning_runs/` — fusion experiment outputs and config snapshots
  
-This makes the page useful as a real retrieval-evaluation view rather than only a search-result viewer.
+## Labels
  
-## CLI Commands
+- **Exact** — Matches intended product type and all explicit required attributes.
+- **Partial** — Main intent matches; attributes missing, approximate, or weaker.
+- **Irrelevant** — Type mismatch or conflicting required attributes.
  
-### Build pooled annotation artifacts
+**Labeler modes:** `simple` (default): one judging pass per batch with the standard relevance prompt. `complex`: query-profile extraction plus extra guardrails (for structured experiments).
  
-```bash
-./.venv/bin/python scripts/evaluation/build_annotation_set.py build ...
-```
+## Flows
  
-### Run batch evaluation
+**Standard:** Run `batch` without `--force-refresh-labels` to extend coverage, then use the UI or batch in cached mode. Single-query evaluation defaults to **no** auto-annotation: recall still hits the live API; scoring uses SQLite only, and unlabeled hits count as `Irrelevant`.
  
-```bash
-./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
-  --tenant-id 163 \
-  --queries-file scripts/evaluation/queries/queries.txt \
-  --top-k 50 \
-  --language en \
-  --labeler-mode simple
-```
+**Incremental pool (no full rebuild):** `build_annotation_set.py build` without `--force-refresh-labels` merges search and full-corpus rerank windows before labeling (CLI `--search-depth`, `--rerank-depth`, `--annotate-*-top-k`). **Full rebuild** uses the recall-pool + rerank-skip + batched early-stop flow above; tune thresholds via `--search-recall-top-k`, `--rerank-high-threshold`, `--rerank-high-skip-count`, `--rebuild-*` flags on `build`.
  
-Use `--force-refresh-labels` if you want to rebuild the offline label cache for the recalled window first.
+**Fusion tuning:** `tune_fusion.py` writes experiment configs, restarts the backend, runs batch evaluation, and optionally applies the best variant (see `--experiments-file`, `--score-metric`, `--apply-best`).
  
-### Audit annotation quality
+### Audit
  
 ```bash
 ./.venv/bin/python scripts/evaluation/build_annotation_set.py audit \
@@ -278,69 +110,20 @@ Use `--force-refresh-labels` if you want to rebuild the offline label cache for 
   --labeler-mode simple
 ```
  
-This checks cached labels against current guardrails and reports suspicious cases.
-
-## Batch Reports
-
-Each batch run stores:
-
-- aggregate metrics
-- per-query metrics
-- label distribution
-- timestamp
-- config snapshot from `/admin/config`
-
-Reports are written as:
-
-- Markdown for easy reading
-- JSON for downstream processing
-
-## Fusion Tuning
-
-The tuning runner applies experiment configs sequentially and records the outcome.
-
-Example:
-
-```bash
-./.venv/bin/python scripts/evaluation/tune_fusion.py \
-  --tenant-id 163 \
-  --queries-file scripts/evaluation/queries/queries.txt \
-  --top-k 50 \
-  --language en \
-  --experiments-file scripts/evaluation/fusion_experiments_shortlist.json \
-  --score-metric MAP_3 \
-  --apply-best
-```
-
-What it does:
-
-1. writes an experiment config into `config/config.yaml`
-2. restarts backend
-3. runs batch evaluation
-4. stores the per-experiment result
-5. optionally applies the best experiment at the end
+## Web UI
  
-## Current Practical Recommendation
+Features: query list from `queries.txt`, single-query and batch evaluation, batch report history, top recalls, missed Exact/Partial, and coverage tips for unlabeled hits.
  
-For day-to-day evaluation:
+## Batch reports
  
-1. refresh the offline labels for the fixed query set with `batch --force-refresh-labels`
-2. run the web UI or normal batch evaluation in cached mode
-3. only force-refresh labels again when:
-   - the query set changes
-   - the product corpus changes materially
-   - the labeling logic changes
+Each run stores aggregate and per-query metrics, label distribution, timestamp, and an `/admin/config` snapshot, as Markdown and JSON under `batch_reports/`.
  
 ## Caveats
  
-- The current label cache is query-specific, not a full all-products all-queries matrix.
-- Single-query evaluation still depends on the live search API for recall, but not on the LLM if labels are already cached.
-- The backend restart path in this environment can be briefly unstable immediately after startup; a short wait after restart is sometimes necessary for scripting.
-- Some multilingual translation hints are noisy on long-tail fashion queries, which is one reason fusion tuning around translation weight matters.
-
-## Related Requirement Docs
+- Labels are keyed by `(tenant_id, query, spu_id)`, not a full corpus×query matrix.
+- Single-query evaluation still needs live search for recall; LLM calls are avoided when labels exist.
+- Backend restarts in automated tuning may need a short settle time before requests.
  
-- `README_Requirement.md`
-- `README_Requirement_zh.md`
+## Related docs
  
-These documents describe the original problem statement. This `README.md` describes the implemented framework and the current recommended workflow.
+- `README_Requirement.md`, `README_Requirement_zh.md` — requirements background; this file describes the implemented stack and how to run it.
@@ -72,12 +72,20 @@ Irrelevant 不相关 — 品类或用途不符，主诉求未满足。
  
 对其中每一个query：
 1. 召回：
-1）参考搜索接口 召回1k结果。
-2）遍历全库，得到每个spu的title，请求重排模型，进行全排序，得到top1w结果。注意重排模型打分一定要做缓存（本地文件缓存即可。query+title->rerank_score）。
-2. 对以上结果，拆分batch请求llm，进行结果标注。
+1）参考搜索接口 召回结果。搜索结果的top500，纳入召回池，打分全部标记为1
+2）调用重排模型，扫描全库（tenant_id=163），如果已经在召回池（打分已经是1了），则跳过，其余的全部过reranker模型接口调用。每80个doc做一次请求。注意重排模型打分一定要做缓存（本地文件缓存即可。query+title->rerank_score）。
+3）对reranker打分超过0.5的结果数大于1000条的query，则打印一行日志，跳过这个query，表示相关结果太多、容易被满足
+
+
+2. 对如上召回的内容，进行全排序，然后逐批进行llm评判标注（50个一批），每一批都记录exact比例和不相关比例，打印日志。
+直到连续三批不相关比例都大于92%。
+最少要跑15批，最多跑40批
+
 3. 请你思考如何存储结果、并利于以后的对比、使用、展示。
  
  
+
+
 3. 评估工具页面：
 请你设计一个搜索评估交互页面。端口6010。
 页面主题：上方是搜索框，如果发起搜索，那么下方给出本次结果的总体指标以及top100结果（允许翻页）
@@ -6,7 +6,18 @@ import argparse
 import json
 from pathlib import Path
  
-from .constants import DEFAULT_LABELER_MODE, DEFAULT_QUERY_FILE
+from .constants import (
+    DEFAULT_LABELER_MODE,
+    DEFAULT_QUERY_FILE,
+    DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO,
+    DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK,
+    DEFAULT_REBUILD_LLM_BATCH_SIZE,
+    DEFAULT_REBUILD_MAX_LLM_BATCHES,
+    DEFAULT_REBUILD_MIN_LLM_BATCHES,
+    DEFAULT_RERANK_HIGH_SKIP_COUNT,
+    DEFAULT_RERANK_HIGH_THRESHOLD,
+    DEFAULT_SEARCH_RECALL_TOP_K,
+)
 from .framework import SearchEvaluationFramework
 from .utils import ensure_dir, utc_now_iso, utc_timestamp
 from .web_app import create_web_app
@@ -23,6 +34,39 @@ def build_cli_parser() -&gt; argparse.ArgumentParser:
     build.add_argument("--rerank-depth", type=int, default=10000)
     build.add_argument("--annotate-search-top-k", type=int, default=120)
     build.add_argument("--annotate-rerank-top-k", type=int, default=200)
+    build.add_argument(
+        "--search-recall-top-k",
+        type=int,
+        default=None,
+        help="Rebuild mode only: top-K search hits enter recall pool with score 1 (default when --force-refresh-labels: 500).",
+    )
+    build.add_argument(
+        "--rerank-high-threshold",
+        type=float,
+        default=None,
+        help="Rebuild only: count rerank scores above this on non-pool docs (default 0.5).",
+    )
+    build.add_argument(
+        "--rerank-high-skip-count",
+        type=int,
+        default=None,
+        help="Rebuild only: skip query if more than this many non-pool docs have rerank score > threshold (default 1000).",
+    )
+    build.add_argument("--rebuild-llm-batch-size", type=int, default=None, help="Rebuild only: LLM batch size (default 50).")
+    build.add_argument("--rebuild-min-batches", type=int, default=None, help="Rebuild only: min LLM batches before early stop (default 15).")
+    build.add_argument("--rebuild-max-batches", type=int, default=None, help="Rebuild only: max LLM batches (default 40).")
+    build.add_argument(
+        "--rebuild-irrelevant-stop-ratio",
+        type=float,
+        default=None,
+        help="Rebuild only: irrelevant ratio above this counts toward early-stop streak (default 0.92).",
+    )
+    build.add_argument(
+        "--rebuild-irrelevant-stop-streak",
+        type=int,
+        default=None,
+        help="Rebuild only: stop after this many consecutive batches above irrelevant ratio (default 3).",
+    )
     build.add_argument("--language", default="en")
     build.add_argument("--force-refresh-rerank", action="store_true")
     build.add_argument("--force-refresh-labels", action="store_true")
@@ -59,6 +103,22 @@ def run_build(args: argparse.Namespace) -&gt; None:
     framework = SearchEvaluationFramework(tenant_id=args.tenant_id, labeler_mode=args.labeler_mode)
     queries = framework.queries_from_file(Path(args.queries_file))
     summary = []
+    rebuild_kwargs = {}
+    if args.force_refresh_labels:
+        rebuild_kwargs = {
+            "search_recall_top_k": args.search_recall_top_k if args.search_recall_top_k is not None else DEFAULT_SEARCH_RECALL_TOP_K,
+            "rerank_high_threshold": args.rerank_high_threshold if args.rerank_high_threshold is not None else DEFAULT_RERANK_HIGH_THRESHOLD,
+            "rerank_high_skip_count": args.rerank_high_skip_count if args.rerank_high_skip_count is not None else DEFAULT_RERANK_HIGH_SKIP_COUNT,
+            "rebuild_llm_batch_size": args.rebuild_llm_batch_size if args.rebuild_llm_batch_size is not None else DEFAULT_REBUILD_LLM_BATCH_SIZE,
+            "rebuild_min_batches": args.rebuild_min_batches if args.rebuild_min_batches is not None else DEFAULT_REBUILD_MIN_LLM_BATCHES,
+            "rebuild_max_batches": args.rebuild_max_batches if args.rebuild_max_batches is not None else DEFAULT_REBUILD_MAX_LLM_BATCHES,
+            "rebuild_irrelevant_stop_ratio": args.rebuild_irrelevant_stop_ratio
+            if args.rebuild_irrelevant_stop_ratio is not None
+            else DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO,
+            "rebuild_irrelevant_stop_streak": args.rebuild_irrelevant_stop_streak
+            if args.rebuild_irrelevant_stop_streak is not None
+            else DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK,
+        }
     for query in queries:
         result = framework.build_query_annotation_set(
             query=query,
@@ -69,6 +129,7 @@ def run_build(args: argparse.Namespace) -&gt; None:
             language=args.language,
             force_refresh_rerank=args.force_refresh_rerank,
             force_refresh_labels=args.force_refresh_labels,
+            **rebuild_kwargs,
         )
         summary.append(
             {
@@ -17,3 +17,13 @@ DEFAULT_QUERY_FILE = _SCRIPTS_EVAL_DIR / &quot;queries&quot; / &quot;queries.txt&quot;
 JUDGE_PROMPT_VERSION_SIMPLE = "v3_simple_20260331"
 JUDGE_PROMPT_VERSION_COMPLEX = "v2_structured_20260331"
 DEFAULT_LABELER_MODE = "simple"
+
+# Rebuild annotation pool (build --force-refresh-labels): search recall + full-corpus rerank + LLM batches
+DEFAULT_SEARCH_RECALL_TOP_K = 500
+DEFAULT_RERANK_HIGH_THRESHOLD = 0.5
+DEFAULT_RERANK_HIGH_SKIP_COUNT = 1000
+DEFAULT_REBUILD_LLM_BATCH_SIZE = 50
+DEFAULT_REBUILD_MIN_LLM_BATCHES = 15
+DEFAULT_REBUILD_MAX_LLM_BATCHES = 40
+DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO = 0.92
+DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK = 3
@@ -17,6 +17,14 @@ from .clients import DashScopeLabelClient, RerankServiceClient, SearchServiceCli
 from .constants import (
     DEFAULT_ARTIFACT_ROOT,
     DEFAULT_LABELER_MODE,
+    DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO,
+    DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK,
+    DEFAULT_REBUILD_LLM_BATCH_SIZE,
+    DEFAULT_REBUILD_MAX_LLM_BATCHES,
+    DEFAULT_REBUILD_MIN_LLM_BATCHES,
+    DEFAULT_RERANK_HIGH_SKIP_COUNT,
+    DEFAULT_RERANK_HIGH_THRESHOLD,
+    DEFAULT_SEARCH_RECALL_TOP_K,
     JUDGE_PROMPT_VERSION_COMPLEX,
     RELEVANCE_EXACT,
     RELEVANCE_IRRELEVANT,
@@ -345,7 +353,7 @@ class SearchEvaluationFramework:
         self,
         query: str,
         docs: Sequence[Dict[str, Any]],
-        batch_size: int = 24,
+        batch_size: int = 80,
         force_refresh: bool = False,
     ) -> List[Dict[str, Any]]:
         cached = {} if force_refresh else self.store.get_rerank_scores(self.tenant_id, query)
@@ -374,6 +382,52 @@ class SearchEvaluationFramework:
         ranked.sort(key=lambda item: item["score"], reverse=True)
         return ranked
  
+    def full_corpus_rerank_outside_exclude(
+        self,
+        query: str,
+        docs: Sequence[Dict[str, Any]],
+        exclude_spu_ids: set[str],
+        batch_size: int = 80,
+        force_refresh: bool = False,
+    ) -> List[Dict[str, Any]]:
+        """Rerank all corpus docs whose spu_id is not in ``exclude_spu_ids``; excluded IDs are not scored via API."""
+        exclude_spu_ids = {str(x) for x in exclude_spu_ids}
+        cached = {} if force_refresh else self.store.get_rerank_scores(self.tenant_id, query)
+        pending: List[Dict[str, Any]] = [
+            doc
+            for doc in docs
+            if str(doc.get("spu_id")) not in exclude_spu_ids
+            and str(doc.get("spu_id"))
+            and (force_refresh or str(doc.get("spu_id")) not in cached)
+        ]
+        if pending:
+            new_scores: Dict[str, float] = {}
+            for start in range(0, len(pending), batch_size):
+                batch = pending[start : start + batch_size]
+                scores = self._rerank_batch_with_retry(query=query, docs=batch)
+                if len(scores) != len(batch):
+                    raise RuntimeError(f"rerank returned {len(scores)} scores for {len(batch)} docs")
+                for doc, score in zip(batch, scores):
+                    new_scores[str(doc.get("spu_id"))] = float(score)
+            self.store.upsert_rerank_scores(
+                self.tenant_id,
+                query,
+                new_scores,
+                model_name="qwen3_vllm_score",
+            )
+            cached.update(new_scores)
+
+        ranked: List[Dict[str, Any]] = []
+        for doc in docs:
+            spu_id = str(doc.get("spu_id") or "")
+            if not spu_id or spu_id in exclude_spu_ids:
+                continue
+            ranked.append(
+                {"spu_id": spu_id, "score": float(cached.get(spu_id, float("-inf"))), "doc": doc}
+            )
+        ranked.sort(key=lambda item: item["score"], reverse=True)
+        return ranked
+
     def _rerank_batch_with_retry(self, query: str, docs: Sequence[Dict[str, Any]]) -> List[float]:
         if not docs:
             return []
@@ -447,6 +501,78 @@ class SearchEvaluationFramework:
             mid = len(docs) // 2
             return self._classify_with_retry(query, docs[:mid], force_refresh=force_refresh) + self._classify_with_retry(query, docs[mid:], force_refresh=force_refresh)
  
+    def _annotate_rebuild_batches(
+        self,
+        query: str,
+        ordered_docs: Sequence[Dict[str, Any]],
+        *,
+        batch_size: int = DEFAULT_REBUILD_LLM_BATCH_SIZE,
+        min_batches: int = DEFAULT_REBUILD_MIN_LLM_BATCHES,
+        max_batches: int = DEFAULT_REBUILD_MAX_LLM_BATCHES,
+        irrelevant_stop_ratio: float = DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO,
+        stop_streak: int = DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK,
+        force_refresh: bool = True,
+    ) -> Tuple[Dict[str, str], List[Dict[str, Any]]]:
+        """LLM-label ``ordered_docs`` in fixed-size batches with early stop after enough irrelevant-heavy batches."""
+        batch_logs: List[Dict[str, Any]] = []
+        streak = 0
+        labels: Dict[str, str] = dict(self.store.get_labels(self.tenant_id, query))
+        total_ordered = len(ordered_docs)
+
+        for batch_idx in range(max_batches):
+            start = batch_idx * batch_size
+            batch_docs = list(ordered_docs[start : start + batch_size])
+            if not batch_docs:
+                break
+
+            batch_pairs = self._classify_with_retry(query, batch_docs, force_refresh=force_refresh)
+            for sub_labels, raw_response, sub_batch in batch_pairs:
+                to_store = {str(doc.get("spu_id")): label for doc, label in zip(sub_batch, sub_labels)}
+                self.store.upsert_labels(
+                    self.tenant_id,
+                    query,
+                    to_store,
+                    judge_model=self.label_client.model,
+                    raw_response=raw_response,
+                )
+                labels.update(to_store)
+            time.sleep(0.1)
+
+            n = len(batch_docs)
+            exact_n = sum(1 for doc in batch_docs if labels.get(str(doc.get("spu_id"))) == RELEVANCE_EXACT)
+            irrel_n = sum(1 for doc in batch_docs if labels.get(str(doc.get("spu_id"))) == RELEVANCE_IRRELEVANT)
+            exact_ratio = exact_n / n if n else 0.0
+            irrelevant_ratio = irrel_n / n if n else 0.0
+            log_entry = {
+                "batch_index": batch_idx + 1,
+                "size": n,
+                "exact_ratio": round(exact_ratio, 6),
+                "irrelevant_ratio": round(irrelevant_ratio, 6),
+                "offset_start": start,
+                "offset_end": min(start + n, total_ordered),
+            }
+            batch_logs.append(log_entry)
+            print(
+                f"[eval-rebuild] query={query!r} llm_batch={batch_idx + 1}/{max_batches} "
+                f"size={n} exact_ratio={exact_ratio:.4f} irrelevant_ratio={irrelevant_ratio:.4f}",
+                flush=True,
+            )
+
+            if batch_idx + 1 >= min_batches:
+                if irrelevant_ratio > irrelevant_stop_ratio:
+                    streak += 1
+                else:
+                    streak = 0
+                if streak >= stop_streak:
+                    print(
+                        f"[eval-rebuild] query={query!r} early_stop after {batch_idx + 1} batches "
+                        f"({stop_streak} consecutive batches with irrelevant_ratio > {irrelevant_stop_ratio})",
+                        flush=True,
+                    )
+                    break
+
+        return labels, batch_logs
+
     def build_query_annotation_set(
         self,
         query: str,
@@ -458,7 +584,32 @@ class SearchEvaluationFramework:
         language: str = "en",
         force_refresh_rerank: bool = False,
         force_refresh_labels: bool = False,
+        search_recall_top_k: int = DEFAULT_SEARCH_RECALL_TOP_K,
+        rerank_high_threshold: float = DEFAULT_RERANK_HIGH_THRESHOLD,
+        rerank_high_skip_count: int = DEFAULT_RERANK_HIGH_SKIP_COUNT,
+        rebuild_llm_batch_size: int = DEFAULT_REBUILD_LLM_BATCH_SIZE,
+        rebuild_min_batches: int = DEFAULT_REBUILD_MIN_LLM_BATCHES,
+        rebuild_max_batches: int = DEFAULT_REBUILD_MAX_LLM_BATCHES,
+        rebuild_irrelevant_stop_ratio: float = DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO,
+        rebuild_irrelevant_stop_streak: int = DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK,
     ) -> QueryBuildResult:
+        if force_refresh_labels:
+            return self._build_query_annotation_set_rebuild(
+                query=query,
+                search_depth=search_depth,
+                rerank_depth=rerank_depth,
+                language=language,
+                force_refresh_rerank=force_refresh_rerank,
+                search_recall_top_k=search_recall_top_k,
+                rerank_high_threshold=rerank_high_threshold,
+                rerank_high_skip_count=rerank_high_skip_count,
+                rebuild_llm_batch_size=rebuild_llm_batch_size,
+                rebuild_min_batches=rebuild_min_batches,
+                rebuild_max_batches=rebuild_max_batches,
+                rebuild_irrelevant_stop_ratio=rebuild_irrelevant_stop_ratio,
+                rebuild_irrelevant_stop_streak=rebuild_irrelevant_stop_streak,
+            )
+
         search_payload = self.search_client.search(query=query, size=search_depth, from_=0, language=language)
         search_results = list(search_payload.get("results") or [])
         corpus = self.corpus_docs(refresh=False)
@@ -558,6 +709,182 @@ class SearchEvaluationFramework:
             output_json_path=output_json_path,
         )
  
+    def _build_query_annotation_set_rebuild(
+        self,
+        query: str,
+        *,
+        search_depth: int,
+        rerank_depth: int,
+        language: str,
+        force_refresh_rerank: bool,
+        search_recall_top_k: int,
+        rerank_high_threshold: float,
+        rerank_high_skip_count: int,
+        rebuild_llm_batch_size: int,
+        rebuild_min_batches: int,
+        rebuild_max_batches: int,
+        rebuild_irrelevant_stop_ratio: float,
+        rebuild_irrelevant_stop_streak: int,
+    ) -> QueryBuildResult:
+        search_size = max(int(search_depth), int(search_recall_top_k))
+        search_payload = self.search_client.search(query=query, size=search_size, from_=0, language=language)
+        search_results = list(search_payload.get("results") or [])
+        recall_n = min(int(search_recall_top_k), len(search_results))
+        pool_search_docs = search_results[:recall_n]
+        pool_spu_ids = {str(d.get("spu_id")) for d in pool_search_docs if str(d.get("spu_id") or "").strip()}
+
+        corpus = self.corpus_docs(refresh=False)
+        corpus_by_id = {str(d.get("spu_id")): d for d in corpus if str(d.get("spu_id") or "").strip()}
+
+        ranked_outside = self.full_corpus_rerank_outside_exclude(
+            query=query,
+            docs=corpus,
+            exclude_spu_ids=pool_spu_ids,
+            force_refresh=force_refresh_rerank,
+        )
+        rerank_high_n = sum(1 for item in ranked_outside if float(item["score"]) > float(rerank_high_threshold))
+
+        rebuild_meta: Dict[str, Any] = {
+            "mode": "rebuild_v1",
+            "search_recall_top_k": search_recall_top_k,
+            "recall_pool_size": len(pool_spu_ids),
+            "pool_rerank_score_assigned": 1.0,
+            "rerank_high_threshold": rerank_high_threshold,
+            "rerank_high_count_outside_pool": rerank_high_n,
+            "rerank_high_skip_count": rerank_high_skip_count,
+            "rebuild_llm_batch_size": rebuild_llm_batch_size,
+            "rebuild_min_batches": rebuild_min_batches,
+            "rebuild_max_batches": rebuild_max_batches,
+            "rebuild_irrelevant_stop_ratio": rebuild_irrelevant_stop_ratio,
+            "rebuild_irrelevant_stop_streak": rebuild_irrelevant_stop_streak,
+        }
+
+        batch_logs: List[Dict[str, Any]] = []
+        skipped = False
+        skip_reason: str | None = None
+        labels: Dict[str, str] = dict(self.store.get_labels(self.tenant_id, query))
+        llm_labeled_total = 0
+
+        if rerank_high_n > int(rerank_high_skip_count):
+            skipped = True
+            skip_reason = "too_many_high_rerank_scores"
+            print(
+                f"[eval-rebuild] query={query!r} skip: rerank_score>{rerank_high_threshold} "
+                f"outside recall pool count={rerank_high_n} > {rerank_high_skip_count} "
+                f"(relevant tail too large / query too easy to satisfy)",
+                flush=True,
+            )
+        else:
+            ordered_docs: List[Dict[str, Any]] = []
+            seen_ordered: set[str] = set()
+            for doc in pool_search_docs:
+                sid = str(doc.get("spu_id") or "")
+                if not sid or sid in seen_ordered:
+                    continue
+                seen_ordered.add(sid)
+                ordered_docs.append(corpus_by_id.get(sid, doc))
+            for item in ranked_outside:
+                sid = str(item["spu_id"])
+                if sid in seen_ordered:
+                    continue
+                seen_ordered.add(sid)
+                ordered_docs.append(item["doc"])
+
+            labels, batch_logs = self._annotate_rebuild_batches(
+                query,
+                ordered_docs,
+                batch_size=rebuild_llm_batch_size,
+                min_batches=rebuild_min_batches,
+                max_batches=rebuild_max_batches,
+                irrelevant_stop_ratio=rebuild_irrelevant_stop_ratio,
+                stop_streak=rebuild_irrelevant_stop_streak,
+                force_refresh=True,
+            )
+            llm_labeled_total = sum(int(entry.get("size") or 0) for entry in batch_logs)
+
+        rebuild_meta["skipped"] = skipped
+        rebuild_meta["skip_reason"] = skip_reason
+        rebuild_meta["llm_batch_logs"] = batch_logs
+        rebuild_meta["llm_labeled_total"] = llm_labeled_total
+
+        rerank_depth_effective = min(int(rerank_depth), len(ranked_outside))
+        search_labeled_results: List[Dict[str, Any]] = []
+        for rank, doc in enumerate(search_results, start=1):
+            spu_id = str(doc.get("spu_id"))
+            in_pool = rank <= recall_n
+            search_labeled_results.append(
+                {
+                    "rank": rank,
+                    "spu_id": spu_id,
+                    "title": build_display_title(doc),
+                    "image_url": doc.get("image_url"),
+                    "rerank_score": 1.0 if in_pool else None,
+                    "label": labels.get(spu_id),
+                    "option_values": list(compact_option_values(doc.get("skus") or [])),
+                    "product": compact_product_payload(doc),
+                }
+            )
+
+        rerank_top_results: List[Dict[str, Any]] = []
+        for rank, item in enumerate(ranked_outside[:rerank_depth_effective], start=1):
+            doc = item["doc"]
+            spu_id = str(item["spu_id"])
+            rerank_top_results.append(
+                {
+                    "rank": rank,
+                    "spu_id": spu_id,
+                    "title": build_display_title(doc),
+                    "image_url": doc.get("image_url"),
+                    "rerank_score": round(float(item["score"]), 8),
+                    "label": labels.get(spu_id),
+                    "option_values": list(compact_option_values(doc.get("skus") or [])),
+                    "product": compact_product_payload(doc),
+                }
+            )
+
+        top100_labels = [
+            item["label"] if item["label"] in VALID_LABELS else RELEVANCE_IRRELEVANT
+            for item in search_labeled_results[:100]
+        ]
+        metrics = compute_query_metrics(top100_labels)
+        output_dir = ensure_dir(self.artifact_root / "query_builds")
+        run_id = f"{utc_timestamp()}_{sha1_text(self.tenant_id + '|' + query)[:10]}"
+        output_json_path = output_dir / f"{run_id}.json"
+        pool_docs_count = len(pool_spu_ids) + len(ranked_outside)
+        payload = {
+            "run_id": run_id,
+            "created_at": utc_now_iso(),
+            "tenant_id": self.tenant_id,
+            "query": query,
+            "config_meta": requests.get("http://localhost:6002/admin/config/meta", timeout=20).json(),
+            "search_total": int(search_payload.get("total") or 0),
+            "search_depth_requested": search_depth,
+            "search_depth_effective": len(search_results),
+            "rerank_depth_requested": rerank_depth,
+            "rerank_depth_effective": rerank_depth_effective,
+            "corpus_size": len(corpus),
+            "annotation_pool": {
+                "rebuild": rebuild_meta,
+                "ordered_union_size": pool_docs_count,
+            },
+            "labeler_mode": self.labeler_mode,
+            "query_profile": self.get_query_profile(query, force_refresh=False) if self.labeler_mode == "complex" else None,
+            "metrics_top100": metrics,
+            "search_results": search_labeled_results,
+            "full_rerank_top": rerank_top_results,
+        }
+        output_json_path.write_text(json.dumps(payload, ensure_ascii=False, indent=2), encoding="utf-8")
+        self.store.insert_build_run(run_id, self.tenant_id, query, output_json_path, payload["metrics_top100"])
+        return QueryBuildResult(
+            query=query,
+            tenant_id=self.tenant_id,
+            search_total=int(search_payload.get("total") or 0),
+            search_depth=len(search_results),
+            rerank_corpus_size=len(corpus),
+            annotated_count=llm_labeled_total if not skipped else 0,
+            output_json_path=output_json_path,
+        )
+
     def evaluate_live_query(
         self,
         query: str,
@@ -5,46 +5,46 @@ from __future__ import annotations
 import json
 from typing import Any, Dict, Sequence
  
-_CLASSIFY_BATCH_SIMPLE_TEMPLATE = """You are a relevance evaluation assistant for an apparel e-commerce search system. 
+_CLASSIFY_BATCH_SIMPLE_TEMPLATE = """You are a relevance evaluation assistant for an apparel e-commerce search system.
 Given the user query and each product's information, assign one relevance label to each product.
  
 ## Relevance Labels
  
 ### Exact
-The product fully satisfies the user's search intent.
+The product fully satisfies the user’s search intent: the core product type matches, all explicitly stated key attributes are supported by the product information.
  
-Use Exact when:
-- The product matches the core product type named in the query.
-- The key requirements explicitly stated in the query are satisfied.
-- There is no clear conflict with any explicit user requirement.
-
-Typical cases:
-- The query is only a product type, and the product is exactly that product type.
-- The query includes product type + attributes, and the product matches the type and those attributes.
+Typical use cases:
+- The query contains only a product type, and the product is exactly that type.
+- The query contains product type + attributes, and the product matches both the type and all explicitly stated attributes.
  
 ### Partial
-The product satisfies the user's primary intent, but does not fully satisfy all specified details.
+The product satisfies the user's primary intent because the core product type matches, but some explicit requirements in the query are missing, cannot be confirmed, or deviate from the query. Despite the mismatch, the product can still be considered a non-target but acceptable substitute.
  
 Use Partial when:
 - The core product type matches, but some requested attributes cannot be confirmed.
-- The core product type matches, but only some secondary attributes are satisfied.
-- The core product type matches, and there are minor or non-critical deviations from the query.
-- The product does not clearly contradict the user's explicit requirements, but it also cannot be considered a full match.
+- The core product type matches, but some secondary requirements deviate or are inconsistent.
+- The product is not the ideal target, but it is still a plausible and acceptable substitute for the shopper.
  
 Typical cases:
 - Query: "red fitted t-shirt", product: "Women's T-Shirt" → color/fit cannot be confirmed.
 - Query: "red fitted t-shirt", product: "Blue Fitted T-Shirt" → product type and fit match, but color differs.
-- Query: "cotton long sleeve blouse", product: "Long Sleeve Blouse" → material not confirmed.
  
-Important:
-Partial should mainly be used when the core product type is correct, but the detailed requirements are incomplete, uncertain, or only partially matched.
+Detailed example:
+- Query: "cotton long sleeve shirt"
+- Product: "J.VER Men's Linen Shirts Casual Button Down Long Sleeve Shirt Solid Spread Collar Summer Beach Shirts with Pocket"
+
+Analysis:
+- Material mismatch: the query explicitly requires cotton, while the product is linen, so it cannot be Exact.
+- However, the core product type still matches: both are long sleeve shirts.
+- In an e-commerce setting, the shopper may still consider clicking this item because the style and use case are similar.
+- Therefore, it should be labeled Partial as a non-target but acceptable substitute.
  
 ### Irrelevant
 The product does not satisfy the user's main shopping intent.
  
 Use Irrelevant when:
 - The core product type does not match the query.
-- The product matches the general category but is a different product type that shoppers would not consider interchangeable.
+- The product belongs to a broadly related category, but not the specific product subtype requested, and shoppers would not consider them interchangeable.
 - The core product type matches, but the product clearly contradicts an explicit and important requirement in the query.
  
 Typical cases:
@@ -53,6 +53,8 @@ Typical cases:
 - Query: "fitted pants", product: "loose wide-leg pants" → explicit contradiction on fit.
 - Query: "sleeveless dress", product: "long sleeve dress" → explicit contradiction on sleeve style.
  
+This label emphasizes clarity of user intent. When the query specifies a concrete product type or an important attribute, products that conflict with that intent should be judged Irrelevant even if they are related at a higher category level.
+
 ## Decision Principles
  
 1. Product type is the highest-priority factor.
@@ -71,16 +73,13 @@ Typical cases:
    If the user explicitly searched for one of these, the others should usually be judged Irrelevant.
  
 3. If the core product type matches, then evaluate attributes.
-   - If attributes fully match → Exact
-   - If attributes are missing, uncertain, or only partially matched → Partial
-   - If attributes clearly contradict an explicit important requirement → Irrelevant
+   - If all explicit attributes match → Exact
+   - If some attributes are missing, uncertain, or partially mismatched, but the item is still an acceptable substitute → Partial
+   - If an explicit and important attribute is clearly contradicted, and the item is not a reasonable substitute → Irrelevant
  
 4. Distinguish carefully between "not mentioned" and "contradicted".
    - If an attribute is not mentioned or cannot be verified, prefer Partial.
-   - If an attribute is explicitly opposite to the query, use Irrelevant.
-
-5. Do not overuse Exact.
-   Exact requires strong evidence that the product satisfies the user's stated intent, not just the general category.
+   - If an attribute is explicitly opposite to the query, use Irrelevant unless the item is still reasonably acceptable as a substitute under the shopping context.
  
 Query: {query}
  
@@ -97,96 +96,93 @@ The lines must correspond sequentially to the products above.
 Do not output any other information.
 """
  
-_CLASSIFY_BATCH_SIMPLE_TEMPLATE__zh = """你是一个服装电商搜索系统的相关性评估助手。
-给定用户查询和每个产品的信息，为每个产品分配一个相关性标签。
+_CLASSIFY_BATCH_SIMPLE_TEMPLATE__zh = _CLASSIFY_BATCH_SIMPLE_TEMPLATE_ZH = """你是一个服饰电商搜索系统中的相关性判断助手。
+给定用户查询词以及每个商品的信息，请为每个商品分配一个相关性标签。
  
 ## 相关性标签
  
 ### 完全相关
-该产品完全满足用户的搜索意图。
-
-在以下情况使用完全相关：
-- 产品与查询中指定的核心产品类型相匹配。
-- 满足了查询中明确说明的关键要求。
-- 与用户明确的任何要求没有明显冲突。
+核心产品类型匹配，所有明确提及的关键属性均有产品信息支撑。
  
-典型情况：
-- 查询仅包含产品类型，而产品恰好是该产品类型。
-- 查询包含产品类型 + 属性，而产品与该类型及这些属性相匹配。
+典型适用场景：
+- 查询仅包含产品类型，产品即为该类型。
+- 查询包含“产品类型 + 属性”，产品在类型及所有明确属性上均符合。
  
 ### 部分相关
-该产品满足了用户的主要意图，但并未完全满足所有指定的细节。
+产品满足用户的主要意图（核心产品类型匹配），但查询中明确的部分要求未体现，或存在偏差。虽然有不一致，但仍属于“非目标但可接受”的替代品。
  
 在以下情况使用部分相关：
-- 核心产品类型匹配，但部分请求的属性无法确认。
-- 核心产品类型匹配，但仅满足了部分次要属性。
-- 核心产品类型匹配，但与查询存在微小或非关键的偏差。
-- 产品未明显违背用户的明确要求，但也不能视为完全匹配。
+- 核心产品类型匹配，但部分请求的属性在商品信息中缺失、未提及或无法确认。
+- 核心产品类型匹配，但材质、版型、风格等次要要求存在偏差或不一致。
+- 商品不是用户最理想的目标，但从电商购物角度看，仍可能被用户视为可接受的替代品。
  
 典型情况：
-- 查询："红色修身T恤"，产品："女士T恤" → 颜色/版型无法确认。
-- 查询："红色修身T恤"，产品："蓝色修身T恤" → 产品类型和版型匹配，但颜色不同。
-- 查询："棉质长袖衬衫"，产品："长袖衬衫" → 材质未确认。
+- 查询：“红色修身T恤”，产品：“女士T恤” → 颜色/版型无法确认。
+- 查询：“红色修身T恤”，产品：“蓝色修身T恤” → 产品类型和版型匹配，但颜色不同。
  
-重要提示：
-部分相关主要应在核心产品类型正确，但详细要求不完整、不确定或仅部分匹配时使用。
+详细案例：
+- 查询：“棉质长袖衬衫”
+- 商品：“J.VER男式亚麻衬衫休闲纽扣长袖衬衫纯色平领夏季沙滩衬衫带口袋”
  
-### 不相关
-该产品不满足用户的主要购物意图。
+分析：
+- 材质不符：Query 明确指定“棉质”，而商品为“亚麻”，因此不能判为完全相关。
+- 但核心品类仍然匹配：两者都是“长袖衬衫”。
+- 在电商搜索中，用户仍可能因为款式、穿着场景相近而点击该商品。
+- 因此应判为部分相关，即“非目标但可接受”的替代品。
  
-在以下情况使用不相关：
+### 不相关
+产品未满足用户的主要购物意图，主要表现为以下情形之一：
 - 核心产品类型与查询不匹配。
-- 产品匹配了大致类别，但属于购物者不会认为可互换的不同产品类型。
-- 核心产品类型匹配，但产品明显违背了查询中一个明确且重要的要求。
+- 产品虽属大致相关的大类，但与查询指定的具体子类不可互换。
+- 核心产品类型匹配，但产品明显违背了查询中一个明确且重要的属性要求。
  
 典型情况：
-- 查询："裤子"，产品："鞋子" → 错误的产品类型。
-- 查询："连衣裙"，产品："半身裙" → 不同的产品类型。
-- 查询："修身裤"，产品："宽松阔腿裤" → 版型上明显矛盾。
-- 查询："无袖连衣裙"，产品："长袖连衣裙" → 袖型上明显矛盾。
+- 查询：“裤子”，产品：“鞋子” → 产品类型错误。
+- 查询：“连衣裙”，产品：“半身裙” → 具体产品类型不同。
+- 查询：“修身裤”，产品：“宽松阔腿裤” → 与版型要求明显冲突。
+- 查询：“无袖连衣裙”，产品：“长袖连衣裙” → 与袖型要求明显冲突。
  
-## 决策原则
+该标签强调用户意图的明确性。当查询指向具体类型或关键属性时，即使产品在更高层级类别上相关，也应按不相关处理。
  
-1. 产品类型是最高优先级的因素。
-   如果查询明确指定了具体产品类型，结果必须匹配该产品类型才能被评为完全相关或部分相关。
-   不同的产品类型通常是不相关，而非部分相关。
+## 判断原则
  
-2. 当查询明确时，相似或相关的产品类型不可互换。
+1. 产品类型是最高优先级因素。
+   如果查询明确指定了具体产品类型，那么结果必须匹配该产品类型，才可能判为“完全相关”或“部分相关”。
+   不同产品类型通常应判为“不相关”，而不是“部分相关”。
+
+2. 相似或相关的产品类型，在查询明确时通常不可互换。
    例如：
    - 连衣裙 vs 半身裙 vs 连体裤
    - 牛仔裤 vs 裤子
-   - T恤 vs 衬衫
+   - T恤 vs 衬衫/上衣
    - 开衫 vs 毛衣
    - 靴子 vs 鞋子
    - 文胸 vs 上衣
    - 双肩包 vs 包
-   如果用户明确搜索了其中一种，其他的通常应判断为不相关。
-
-3. 如果核心产品类型匹配，则评估属性。
-   - 如果属性完全匹配 → 完全相关
-   - 如果属性缺失、不确定或仅部分匹配 → 部分相关
-   - 如果属性明显违背明确的重点要求 → 不相关
+   如果用户明确搜索其中一种，其他类型通常应判为“不相关”。
  
-4. 仔细区分“未提及”和“矛盾”。
-   - 如果属性未提及或无法验证，倾向于部分相关。
-   - 如果属性与查询明确相反，使用不相关。
+3. 当核心产品类型匹配后，再评估属性。
+   - 所有明确属性都匹配 → 完全相关
+   - 部分属性缺失、无法确认，或存在一定偏差，但仍是可接受替代品 → 部分相关
+   - 明确且重要的属性被明显违背，且不能作为合理替代品 → 不相关
  
-5. 不要过度使用完全相关。
-   完全相关需要强有力的证据表明产品满足了用户声明的意图，而不仅仅是通用类别。
+4. 要严格区分“未提及/无法确认”和“明确冲突”。
+   - 如果某属性没有提及，或无法验证，优先判为“部分相关”。
+   - 如果某属性与查询要求明确相反，则判为“不相关”；除非在购物语境下它仍明显属于可接受替代品。
  
-查询: {query}
+查询：{query}
  
-产品:
+商品：
 {lines}
  
 ## 输出格式
-严格输出 {n} 行，每行包含以下之一：
-Exact
-Partial
-Irrelevant
+严格输出 {n} 行，每行只能是以下三者之一：
+完全相关
+部分相关
+不相关
  
-这些行必须按顺序对应上面的产品。
-不要输出任何其他信息。
+输出行必须与上方商品顺序一一对应。
+不要输出任何其他内容。
 """
  
  
@@ -11,7 +11,7 @@ QUERIES=&quot;${REPO_EVAL_QUERIES:-scripts/evaluation/queries/queries.txt}&quot;
 usage() {
   echo "Usage: $0 batch|batch-rebuild|serve"
   echo "  batch          — batch eval: live search every query, LLM only for missing labels (top_k=50, simple)"
-  echo "  batch-rebuild  — same as batch but --force-refresh-labels (re-LLM all top_k hits; expensive, overwrites cache)"
+  echo "  batch-rebuild  — deep rebuild: build --force-refresh-labels (search recall pool + full-corpus rerank + batched LLM; expensive)"
   echo "  serve          — eval UI (default http://0.0.0.0:\${EVAL_WEB_PORT:-6010}/; also: ./scripts/start_eval_web.sh)"
   echo "Env: TENANT_ID (default 163), REPO_EVAL_QUERIES, EVAL_WEB_HOST, EVAL_WEB_PORT (default 6010)"
 }
@@ -26,13 +26,15 @@ case &quot;${1:-}&quot; in
       --labeler-mode simple
     ;;
   batch-rebuild)
-    exec "$PY" scripts/evaluation/build_annotation_set.py batch \
+    exec "$PY" scripts/evaluation/build_annotation_set.py build \
       --tenant-id "$TENANT_ID" \
       --queries-file "$QUERIES" \
-      --top-k 50 \
+      --search-depth 500 \
+      --rerank-depth 10000 \
+      --force-refresh-rerank \
+      --force-refresh-labels \
       --language en \
-      --labeler-mode simple \
-      --force-refresh-labels
+      --labeler-mode simple
     ;;
   serve)
     EVAL_WEB_PORT="${EVAL_WEB_PORT:-6010}"