Compare View

switch
from
...
to
 
Commits (2)
docs/Usage-Guide.md
... ... @@ -202,13 +202,10 @@ python -m pytest -q tests/test_rerank_client.py tests/test_es_query_builder.py t
202 202 ./scripts/service_ctl.sh restart backend
203 203 sleep 3
204 204 ./scripts/service_ctl.sh status backend
205   -python ./scripts/eval_search_quality.py
  205 +./scripts/evaluation/quick_start_eval.sh batch
206 206 ```
207 207  
208   -评估结果会输出到 `artifacts/search_eval/`,包含:
209   -
210   -- `search_eval_*.json`:便于脚本二次分析
211   -- `search_eval_*.md`:便于人工浏览 top20 结果、分数与命中子句
  208 +离线批量评估会把标注与报表写到 `artifacts/search_evaluation/`(SQLite、`batch_reports/` 下的 JSON/Markdown 等)。说明与命令见 [scripts/evaluation/README.md](../scripts/evaluation/README.md)。
212 209  
213 210 ### 方式4: 多环境示例(prod / uat)
214 211  
... ...
docs/issue-2026-03-31-评估框架-done-0331.md 0 → 100644
... ... @@ -0,0 +1,151 @@
  1 +
  2 +
  3 +参考资料:
  4 +
  5 +1. 搜索接口:
  6 +
  7 +```bash
  8 +export BASE_URL="${BASE_URL:-http://localhost:6002}"
  9 +export TENANT_ID="${TENANT_ID:-163}" # 改成你的租户ID
  10 +```
  11 +```bash
  12 +curl -sS "$BASE_URL/search/" \
  13 + -H "Content-Type: application/json" \
  14 + -H "X-Tenant-ID: $TENANT_ID" \
  15 + -d '{
  16 + "query": "芭比娃娃",
  17 + "size": 20,
  18 + "from": 0,
  19 + "language": "zh"
  20 + }'
  21 +```
  22 +
  23 +response:
  24 +{
  25 + "results": [
  26 + {
  27 + "spu_id": "12345",
  28 + "title": "芭比时尚娃娃",
  29 + "image_url": "https://example.com/image.jpg",
  30 + "specifications":[],
  31 + "skus":[{"sku_id":" ...
  32 +...
  33 +
  34 +2. 重排服务:
  35 +curl -X POST "http://localhost:6007/rerank" \
  36 + -H "Content-Type: application/json" \
  37 + -d '{
  38 + "query": "玩具 芭比",
  39 + "docs": ["12PCS 6 Types of Dolls with Bottles", "纯棉T恤 短袖"],
  40 + "top_n":386,
  41 + "normalize": true
  42 + }'
  43 +
  44 +
  45 +3. 基于指定字段查询:es_debug_search.py
  46 +
  47 +
  48 +主要任务:
  49 +1. 评估工具的建立:
  50 +注意判断结果好坏,要用统一的评估工具,不要对每个query设定关键词匹配的规则来判断是否符合要求,这样不可扩展,这种方式且容易有误判还是复杂,并且不好扩展到其他搜索词。
  51 +因此要做一个搜索结果评估工具、多个结果对比的工具,供后面的标注集合构建工具调用。工具内部实现可以是调用大模型来判断,说清楚什么叫高相关、基本相关、不相关:
  52 +
  53 +prompt:
  54 +```bash
  55 +你是一个电商搜索结果相关性评估助手。请根据用户查询(query)和每个商品的信息,输出该商品的相关性等级。
  56 +
  57 +## 相关性等级标准
  58 +Exact 完全相关 — 完全匹配用户搜索需求。
  59 +Partial 部分相关 — 主意图满足(同品类或相近用途,基本上符合搜索意图),但次要属性(如颜色、风格、尺码等)跟用户需求有偏差或无法确认。
  60 +Irrelevant 不相关 — 品类或用途不符,主诉求未满足。
  61 +
  62 +1. {title1} {option1_value1} {option2_value1} {option3_value1}
  63 +2. {title2} {option1_value2} {option2_value2}, {option3_value2}
  64 +...
  65 +50. {title50} {option1_value50} {option2_value50} {option3_value50}
  66 +
  67 +## 输出格式
  68 +严格输出 {input_nums} 行,每行仅Exact / Partial / Irrelevant三者之一。按顺序对应上述 50 个商品。不要输出任何其他任何信息
  69 +```
  70 +
  71 +
  72 +2. 测试集(结果标注)建立:
  73 +@queries/queries.txt
  74 +
  75 +对其中每一个query:
  76 +1. 召回:
  77 +1)参考搜索接口 召回结果。搜索结果的top500,纳入召回池,打分全部标记为1
  78 +2)调用重排模型,扫描全库(tenant_id=163),如果已经在召回池(打分已经是1了),则跳过,其余的全部过reranker模型接口调用。每80个doc做一次请求。注意重排模型打分一定要做缓存(本地文件缓存即可。query+title->rerank_score)。
  79 +3)对reranker打分超过0.5的结果数大于1000条的query,则打印一行日志,跳过这个query,表示相关结果太多、容易被满足
  80 +
  81 +
  82 +2. 对如上召回的内容,进行全排序,然后逐批进行llm评判标注(50个一批),每一批都记录exact比例和不相关比例,打印日志。
  83 +直到连续三批不相关比例都大于92%。
  84 +最少要跑15批,最多跑40批
  85 +
  86 +3. 请你思考如何存储结果、并利于以后的对比、使用、展示。
  87 +
  88 +
  89 +
  90 +
  91 +3. 评估工具页面:
  92 +请你设计一个搜索评估交互页面。端口6010。
  93 +页面主题:上方是搜索框,如果发起搜索,那么下方给出本次结果的总体指标以及top100结果(允许翻页)
  94 +
  95 +总体指标:
  96 +| 指标 | 含义 |
  97 +|------|------|
  98 +| **P@5, P@10, P@20, P@50** | 前 K 个结果中「仅 3 相关」的精确率 |
  99 +| **P@5_2_3 ~ P@50_2_3** | 前 K 个结果中「2 和 3 都算相关」的精确率 |
  100 +| **MAP_3** | 仅 3 相关时的 Average Precision(单 query) |
  101 +| **MAP_2_3** | 2 和 3 都相关时的 Average Precision |
  102 +
  103 +结果列表:
  104 +按行列下来,每行左侧给每个结果找到标注值(三个等级。对结果也可以颜色标记),展示图片,title.en+title.en+首个sku的option1/2/3_value(分三行展示,这三行和左侧的图片并列)
  105 +
  106 +
  107 +评测页面最左侧:
  108 +queries默认是queries/queries.txt,填入左侧列表框,点击其中任何一个发起搜索。
  109 +
  110 +4. 批量评估工具
  111 +
  112 +给一个批量执行脚本,
  113 +
  114 +这里要新增一个批量评估的页面。点击批量评估的按钮,对所有搜索词依次发起搜索,最后汇总总体的评估指标,生成报告,报告名称带上时间标记和一些关键信息。并且记录当时的主搜索程序的config.yaml。
  115 +你需要精心地设计如何切换两种模式,通过同一个端口承载这两种不同交互的内容。
  116 +批量评估关注的是所有搜索词总体的评估指标。
  117 +需要记录测试环境时间以及当时的配置文件,以及对应的结果。要保存历次的评估记录,并能查到每一次评估结果对应的配置文件有相关的指标
  118 +
  119 +以上是我的总体设计,但有不周全的地方。你要站在更高的层次理解我的需求,你有足够的自由可以适当调整设计,基于你所了解的自动化搜索评估框架的最佳实践,做出更优秀的设计和更好的实现。
  120 +
  121 +
  122 +
  123 +
  124 +
  125 +
  126 +1. 请仔细检验这个标注集的质量,如果质量不符合要求,那么你要优化工具,迭代直至标注集的结果质量足够高,可以以此为自动化工具来评估检索效果,对检索效果形成指导性意见。
  127 +2. 在结果标注集的质量足够好,批量评估工具足够好用,并且经过你的试用,能判断出搜索质量好坏的情况下,开始真正的动手检索效果调优:基于这个50条query的结果标注集和批量评估工具,对融合公式进行调参。请你先精心地设计实验,设计几组参数,对几组参数分别修改config.yaml、重启(./restart.sh backend)、跑批量评估、收集结果。
  128 +注意评估的过程中,如果发现工具不好用,发现日志不全,发现可以通过修改工具或者日志来提高效率,都可以先做这些,根据完善。
  129 +注意你是代码的总负责人,你有任何权限来满足你进行检索效果调优的需要。你如果发现有其他可能带来更大提升的点,也可以进行实验,你甚至可以修改融合、重排漏斗的代码,来进行实验,以追求更好的结果指标。
  130 +但是注意,因为收到性能和耗时的约束,不要调大reranker模型的输入条数、不要打开精排,耗时方面无法承受两轮reranker模型的调用。
  131 +
  132 +
  133 +
  134 +
  135 +
  136 +
  137 +
  138 +
  139 +
  140 +@scripts/evaluation/README.md @scripts/evaluation/eval_framework/framework.py
  141 +@quick_start_eval.sh (29-35)
  142 +请以如下流程为准,进行改造:
  143 +如果重建的话,对每个query:
  144 +每个搜索结果应该会扫描全库,
  145 +1. 搜索结果的top500,纳入召回池,打分全部标记为1
  146 +2. 调用重排模型,扫描全库(tenant_id=163),如果已经在召回池(打分已经是1了),则跳过,其余的全部过
  147 +3. 对reranker打分超过0.5的大于1000条,则打印一行日志,跳过这个query,表示相关结果太多、容易被满足
  148 +
  149 +对如上召回的内容,进行全排序,然后逐批进行llm评判标注(50个一批),每一批都记录exact比例和不相关比例,打印日志。
  150 +直到连续三批不相关比例都大于92%。
  151 +最少要跑15批,最多跑40批
... ...
docs/相关性检索优化说明.md
... ... @@ -240,15 +240,10 @@ python -m pytest -q tests/test_rerank_client.py tests/test_es_query_builder.py t
240 240 ./scripts/service_ctl.sh restart backend
241 241 sleep 3
242 242 ./scripts/service_ctl.sh status backend
243   -python ./scripts/eval_search_quality.py
  243 +./scripts/evaluation/quick_start_eval.sh batch
244 244 ```
245 245  
246   -评估脚本会生成:
247   -
248   -- `artifacts/search_eval/search_eval_*.json`
249   -- `artifacts/search_eval/search_eval_*.md`
250   -
251   -可直接从 JSON 中提取 query 级和 result 级调试字段进行分析。
  246 +评估产物在 `artifacts/search_evaluation/`(如 `search_eval.sqlite3`、`batch_reports/` 下的 JSON/Markdown)。流程与参数说明见 [scripts/evaluation/README.md](../scripts/evaluation/README.md)。
252 247  
253 248 ## 11. 建议测试清单
254 249  
... ...
scripts/evaluation/README.md
1 1 # Search Evaluation Framework
2 2  
3   -This directory contains the offline annotation set builder, the online evaluation UI/API, the audit tooling, and the fusion-tuning runner for retrieval quality evaluation.
  3 +This directory holds the offline annotation builder, the evaluation web UI/API, audit tooling, and the fusion-tuning runner for retrieval quality.
4 4  
5   -It is designed around one core rule:
  5 +**Design:** Build labels offline for a fixed query set (`queries/queries.txt`). Single-query and batch evaluation map recalled `spu_id` values to the SQLite cache. Items without cached labels are scored as `Irrelevant`, and the UI/API surfaces tips when coverage is incomplete.
6 6  
7   -- Annotation should be built offline first.
8   -- Single-query evaluation should then map recalled `spu_id` values to the cached annotation set.
9   -- Recalled items without cached labels are treated as `Irrelevant` during evaluation, and the UI/API returns a tip so the operator knows coverage is incomplete.
  7 +## What it does
10 8  
11   -## Goals
  9 +1. Build an annotation set for a fixed query set.
  10 +2. Evaluate live search results against cached labels.
  11 +3. Run batch evaluation and keep historical reports with config snapshots.
  12 +4. Tune fusion parameters in a reproducible loop.
12 13  
13   -The framework supports four related tasks:
  14 +## Layout
14 15  
15   -1. Build an annotation set for a fixed query set.
16   -2. Evaluate a live search result list against that annotation set.
17   -3. Run batch evaluation and store historical reports with config snapshots.
18   -4. Tune fusion parameters reproducibly.
19   -
20   -## Files
21   -
22   -- `eval_framework/` (Python package)
23   - Modular layout: `framework.py` (orchestration), `store.py` (SQLite), `clients.py` (search/rerank/LLM), `prompts.py` (judge templates), `metrics.py`, `reports.py`, `web_app.py`, `cli.py`, and `static/` (evaluation UI HTML/CSS/JS).
24   -- `build_annotation_set.py`
25   - Thin CLI entrypoint into `eval_framework`.
26   -- `serve_eval_web.py`
27   - Thin web entrypoint into `eval_framework`.
28   -- `tune_fusion.py`
29   - Fusion experiment runner. It applies config variants, restarts backend, runs batch evaluation, and stores experiment reports.
30   -- `fusion_experiments_shortlist.json`
31   - A compact experiment set for practical tuning.
32   -- `fusion_experiments_round1.json`
33   - A broader first-round experiment set.
34   -- `queries/queries.txt`
35   - The canonical evaluation query set.
36   -- `README_Requirement.md`
37   - Requirement reference document.
38   -- `quick_start_eval.sh`
39   - Optional wrapper: `batch` (fill missing labels only), `batch-rebuild` (force full re-label), or `serve` (web UI).
40   -- `../start_eval_web.sh`
41   - Same as `serve` but loads `activate.sh`; use with `./scripts/service_ctl.sh start eval-web` (port **`EVAL_WEB_PORT`**, default **6010**). `./run.sh all` starts **eval-web** with the rest of core services.
42   -
43   -## Quick start (from repo root)
44   -
45   -Set tenant if needed (`export TENANT_ID=163`). Requires live search API, DashScope key when the batch step needs new LLM labels, and a working backend.
  16 +| Path | Role |
  17 +|------|------|
  18 +| `eval_framework/` | Package: orchestration, SQLite store, search/rerank/LLM clients, prompts, metrics, reports, web UI (`static/`), CLI |
  19 +| `build_annotation_set.py` | CLI entry (build / batch / audit) |
  20 +| `serve_eval_web.py` | Web server for the evaluation UI |
  21 +| `tune_fusion.py` | Applies config variants, restarts backend, runs batch eval, stores experiment reports |
  22 +| `fusion_experiments_shortlist.json` | Compact experiment set for tuning |
  23 +| `fusion_experiments_round1.json` | Broader first-round experiments |
  24 +| `queries/queries.txt` | Canonical evaluation queries |
  25 +| `README_Requirement.md` | Product/requirements reference |
  26 +| `quick_start_eval.sh` | Wrapper: `batch`, `batch-rebuild` (deep `build` + `--force-refresh-labels`), or `serve` |
  27 +| `../start_eval_web.sh` | Same as `serve` with `activate.sh`; use `./scripts/service_ctl.sh start eval-web` (default port **6010**, override with `EVAL_WEB_PORT`). `./run.sh all` includes eval-web. |
  28 +
  29 +## Quick start (repo root)
  30 +
  31 +Set tenant if needed (`export TENANT_ID=163`). You need a live search API, DashScope when new LLM labels are required, and a running backend.
46 32  
47 33 ```bash
48   -# 1) Batch evaluation: every query in the file gets a live search; only uncached (query, spu_id) pairs call the LLM
  34 +# Batch: live search for every query; only uncached (query, spu_id) pairs hit the LLM
49 35 ./scripts/evaluation/quick_start_eval.sh batch
50 36  
51   -# Optional: full re-label of current top_k recall (expensive; use only when you intentionally rebuild the cache)
  37 +# Deep rebuild: search recall top-500 (score 1) + full-corpus rerank outside pool + batched LLM (early stop; expensive)
52 38 ./scripts/evaluation/quick_start_eval.sh batch-rebuild
53 39  
54   -# 2) Evaluation UI on http://127.0.0.1:6010/ (override with EVAL_WEB_PORT / EVAL_WEB_HOST)
  40 +# UI: http://127.0.0.1:6010/
55 41 ./scripts/evaluation/quick_start_eval.sh serve
56   -# Or: ./scripts/service_ctl.sh start eval-web
  42 +# or: ./scripts/service_ctl.sh start eval-web
57 43 ```
58 44  
59   -Equivalent explicit commands:
  45 +Explicit equivalents:
60 46  
61 47 ```bash
62   -# Safe default: no --force-refresh-labels
63 48 ./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
64 49 --tenant-id "${TENANT_ID:-163}" \
65 50 --queries-file scripts/evaluation/queries/queries.txt \
... ... @@ -67,207 +52,54 @@ Equivalent explicit commands:
67 52 --language en \
68 53 --labeler-mode simple
69 54  
70   -# Rebuild all labels for recalled top_k (same as quick_start_eval.sh batch-rebuild)
71   -./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
72   - --tenant-id "${TENANT_ID:-163}" \
73   - --queries-file scripts/evaluation/queries/queries.txt \
74   - --top-k 50 \
75   - --language en \
76   - --labeler-mode simple \
77   - --force-refresh-labels
78   -
79   -./.venv/bin/python scripts/evaluation/serve_eval_web.py serve \
  55 +./.venv/bin/python scripts/evaluation/build_annotation_set.py build \
80 56 --tenant-id "${TENANT_ID:-163}" \
81 57 --queries-file scripts/evaluation/queries/queries.txt \
82   - --host 127.0.0.1 \
83   - --port 6010
84   -```
85   -
86   -**Batch behavior:** There is no “skip queries already processed”. Each run walks the full queries file. With `--force-refresh-labels`, for **every** query the runner issues a live search and sends **all** `top_k` returned `spu_id`s through the LLM again (SQLite rows are upserted). Omit `--force-refresh-labels` if you only want to fill in labels that are missing for the current recall window.
87   -
88   -## Storage Layout
89   -
90   -All generated artifacts are under:
91   -
92   -- `/data/saas-search/artifacts/search_evaluation`
93   -
94   -Important subpaths:
95   -
96   -- `/data/saas-search/artifacts/search_evaluation/search_eval.sqlite3`
97   - Main cache and annotation store.
98   -- `/data/saas-search/artifacts/search_evaluation/query_builds`
99   - Per-query pooled annotation-set build artifacts.
100   -- `/data/saas-search/artifacts/search_evaluation/batch_reports`
101   - Batch evaluation JSON, Markdown reports, and config snapshots.
102   -- `/data/saas-search/artifacts/search_evaluation/audits`
103   - Audit summaries for label quality checks.
104   -- `/data/saas-search/artifacts/search_evaluation/tuning_runs`
105   - Fusion experiment summaries and per-experiment config snapshots.
106   -
107   -## SQLite Schema Summary
108   -
109   -The main tables in `search_eval.sqlite3` are:
110   -
111   -- `corpus_docs`
112   - Cached product corpus for the tenant.
113   -- `rerank_scores`
114   - Cached full-corpus reranker scores keyed by `(tenant_id, query_text, spu_id)`.
115   -- `relevance_labels`
116   - Cached LLM relevance labels keyed by `(tenant_id, query_text, spu_id)`.
117   -- `query_profiles`
118   - Structured query-intent profiles extracted before labeling.
119   -- `build_runs`
120   - Per-query pooled-build records.
121   -- `batch_runs`
122   - Batch evaluation history.
123   -
124   -## Label Semantics
125   -
126   -Three labels are used throughout:
127   -
128   -- `Exact`
129   - Fully matches the intended product type and all explicit required attributes.
130   -- `Partial`
131   - Main intent matches, but explicit attributes are missing, approximate, or weaker than requested.
132   -- `Irrelevant`
133   - Product type mismatches, or explicit required attributes conflict.
134   -
135   -The framework always uses:
136   -
137   -- LLM-based batched relevance classification
138   -- caching and retry logic for robust offline labeling
139   -
140   -There are now two labeler modes:
141   -
142   -- `simple`
143   - Default. A single low-coupling LLM judging pass per batch, using the standard relevance prompt.
144   -- `complex`
145   - Legacy structured mode. It extracts query profiles and applies extra guardrails. Kept for comparison, but no longer the default.
146   -
147   -## Offline-First Workflow
148   -
149   -### 1. Refresh labels for the evaluation query set
150   -
151   -For practical evaluation, the most important offline step is to pre-label the result window you plan to score. For the current metrics (`P@5`, `P@10`, `P@20`, `P@50`, `MAP_3`, `MAP_2_3`), a `top_k=50` cached label set is sufficient.
152   -
153   -Example (fills missing labels only; recommended default):
154   -
155   -```bash
156   -./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
157   - --tenant-id 163 \
158   - --queries-file scripts/evaluation/queries/queries.txt \
159   - --top-k 50 \
  58 + --search-depth 500 \
  59 + --rerank-depth 10000 \
  60 + --force-refresh-rerank \
  61 + --force-refresh-labels \
160 62 --language en \
161 63 --labeler-mode simple
162   -```
163   -
164   -To **rebuild** every label for the current `top_k` recall window (all queries, all hits re-sent to the LLM), add `--force-refresh-labels` or run `./scripts/evaluation/quick_start_eval.sh batch-rebuild`.
165   -
166   -This command does two things:
167   -
168   -- runs **every** query in the file against the live backend (no skip list)
169   -- with `--force-refresh-labels`, re-labels **all** `top_k` hits per query via the LLM and upserts SQLite; without the flag, only `spu_id`s lacking a cached label are sent to the LLM
170   -
171   -After this step, single-query evaluation can run in cached mode without calling the LLM again.
172   -
173   -### 2. Optional pooled build
174   -
175   -The framework also supports a heavier pooled build that combines:
176   -
177   -- top search results
178   -- top full-corpus reranker results
179   -
180   -Example:
181 64  
182   -```bash
183   -./.venv/bin/python scripts/evaluation/build_annotation_set.py build \
184   - --tenant-id 163 \
185   - --queries-file scripts/evaluation/queries/queries.txt \
186   - --search-depth 1000 \
187   - --rerank-depth 10000 \
188   - --annotate-search-top-k 100 \
189   - --annotate-rerank-top-k 120 \
190   - --language en
191   -```
192   -
193   -This is slower, but useful when you want a richer pooled annotation set beyond the current live recall window.
194   -
195   -## Why Single-Query Evaluation Was Slow
196   -
197   -If single-query evaluation is slow, the usual reason is that it is still running with `auto_annotate=true`, which means:
198   -
199   -- perform live search
200   -- detect recalled but unlabeled products
201   -- call the LLM to label them
202   -
203   -That is not the intended steady-state evaluation path.
204   -
205   -The UI/API is now configured to prefer cached evaluation:
206   -
207   -- default single-query evaluation uses `auto_annotate=false`
208   -- unlabeled recalled results are treated as `Irrelevant`
209   -- the response includes tips explaining that coverage gap
210   -
211   -If you want stable, fast evaluation:
212   -
213   -1. prebuild labels offline
214   -2. use cached single-query evaluation
215   -
216   -## Web UI
217   -
218   -Start the evaluation UI:
219   -
220   -```bash
221 65 ./.venv/bin/python scripts/evaluation/serve_eval_web.py serve \
222   - --tenant-id 163 \
  66 + --tenant-id "${TENANT_ID:-163}" \
223 67 --queries-file scripts/evaluation/queries/queries.txt \
224 68 --host 127.0.0.1 \
225 69 --port 6010
226 70 ```
227 71  
228   -The UI provides:
  72 +Each `batch` run walks the full queries file. With `batch --force-refresh-labels`, every live top-`k` hit is re-judged by the LLM.
229 73  
230   -- query list loaded from `queries.txt`
231   -- single-query evaluation
232   -- batch evaluation
233   -- history of batch reports
234   -- top recalled results
235   -- missed `Exact` and `Partial` products that were not recalled
236   -- tips about unlabeled hits treated as `Irrelevant`
  74 +**Rebuild (`build --force-refresh-labels`):** For each query: take search top **500** as the recall pool (treated as rerank score **1**; those SKUs are not sent to the reranker). Rerank the rest of the tenant corpus; if more than **1000** non-pool docs have rerank score **> 0.5**, the query is **skipped** (logged as too easy / tail too relevant). Otherwise merge pool (search order) + non-pool (rerank score descending), then LLM-judge in batches of **50**, logging **exact_ratio** and **irrelevant_ratio** per batch. Stop after **3** consecutive batches with irrelevant_ratio **> 92%**, but only after at least **15** batches and at most **40** batches.
237 75  
238   -### Single-query response behavior
  76 +## Artifacts
239 77  
240   -For a single query:
  78 +Default root: `artifacts/search_evaluation/`
241 79  
242   -1. live search returns recalled `spu_id` values
243   -2. the framework looks up cached labels by `(query, spu_id)`
244   -3. unlabeled recalled items are counted as `Irrelevant`
245   -4. cached `Exact` and `Partial` products that were not recalled are listed under `Missed Exact / Partial`
  80 +- `search_eval.sqlite3` — corpus cache, rerank scores, relevance labels, query profiles, build/batch run metadata
  81 +- `query_builds/` — per-query pooled build outputs
  82 +- `batch_reports/` — batch JSON, Markdown, config snapshots
  83 +- `audits/` — label-quality audit summaries
  84 +- `tuning_runs/` — fusion experiment outputs and config snapshots
246 85  
247   -This makes the page useful as a real retrieval-evaluation view rather than only a search-result viewer.
  86 +## Labels
248 87  
249   -## CLI Commands
  88 +- **Exact** — Matches intended product type and all explicit required attributes.
  89 +- **Partial** — Main intent matches; attributes missing, approximate, or weaker.
  90 +- **Irrelevant** — Type mismatch or conflicting required attributes.
250 91  
251   -### Build pooled annotation artifacts
  92 +**Labeler modes:** `simple` (default): one judging pass per batch with the standard relevance prompt. `complex`: query-profile extraction plus extra guardrails (for structured experiments).
252 93  
253   -```bash
254   -./.venv/bin/python scripts/evaluation/build_annotation_set.py build ...
255   -```
  94 +## Flows
256 95  
257   -### Run batch evaluation
  96 +**Standard:** Run `batch` without `--force-refresh-labels` to extend coverage, then use the UI or batch in cached mode. Single-query evaluation defaults to **no** auto-annotation: recall still hits the live API; scoring uses SQLite only, and unlabeled hits count as `Irrelevant`.
258 97  
259   -```bash
260   -./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
261   - --tenant-id 163 \
262   - --queries-file scripts/evaluation/queries/queries.txt \
263   - --top-k 50 \
264   - --language en \
265   - --labeler-mode simple
266   -```
  98 +**Incremental pool (no full rebuild):** `build_annotation_set.py build` without `--force-refresh-labels` merges search and full-corpus rerank windows before labeling (CLI `--search-depth`, `--rerank-depth`, `--annotate-*-top-k`). **Full rebuild** uses the recall-pool + rerank-skip + batched early-stop flow above; tune thresholds via `--search-recall-top-k`, `--rerank-high-threshold`, `--rerank-high-skip-count`, `--rebuild-*` flags on `build`.
267 99  
268   -Use `--force-refresh-labels` if you want to rebuild the offline label cache for the recalled window first.
  100 +**Fusion tuning:** `tune_fusion.py` writes experiment configs, restarts the backend, runs batch evaluation, and optionally applies the best variant (see `--experiments-file`, `--score-metric`, `--apply-best`).
269 101  
270   -### Audit annotation quality
  102 +### Audit
271 103  
272 104 ```bash
273 105 ./.venv/bin/python scripts/evaluation/build_annotation_set.py audit \
... ... @@ -278,69 +110,20 @@ Use `--force-refresh-labels` if you want to rebuild the offline label cache for
278 110 --labeler-mode simple
279 111 ```
280 112  
281   -This checks cached labels against current guardrails and reports suspicious cases.
282   -
283   -## Batch Reports
284   -
285   -Each batch run stores:
286   -
287   -- aggregate metrics
288   -- per-query metrics
289   -- label distribution
290   -- timestamp
291   -- config snapshot from `/admin/config`
292   -
293   -Reports are written as:
294   -
295   -- Markdown for easy reading
296   -- JSON for downstream processing
297   -
298   -## Fusion Tuning
299   -
300   -The tuning runner applies experiment configs sequentially and records the outcome.
301   -
302   -Example:
303   -
304   -```bash
305   -./.venv/bin/python scripts/evaluation/tune_fusion.py \
306   - --tenant-id 163 \
307   - --queries-file scripts/evaluation/queries/queries.txt \
308   - --top-k 50 \
309   - --language en \
310   - --experiments-file scripts/evaluation/fusion_experiments_shortlist.json \
311   - --score-metric MAP_3 \
312   - --apply-best
313   -```
314   -
315   -What it does:
316   -
317   -1. writes an experiment config into `config/config.yaml`
318   -2. restarts backend
319   -3. runs batch evaluation
320   -4. stores the per-experiment result
321   -5. optionally applies the best experiment at the end
  113 +## Web UI
322 114  
323   -## Current Practical Recommendation
  115 +Features: query list from `queries.txt`, single-query and batch evaluation, batch report history, top recalls, missed Exact/Partial, and coverage tips for unlabeled hits.
324 116  
325   -For day-to-day evaluation:
  117 +## Batch reports
326 118  
327   -1. refresh the offline labels for the fixed query set with `batch --force-refresh-labels`
328   -2. run the web UI or normal batch evaluation in cached mode
329   -3. only force-refresh labels again when:
330   - - the query set changes
331   - - the product corpus changes materially
332   - - the labeling logic changes
  119 +Each run stores aggregate and per-query metrics, label distribution, timestamp, and an `/admin/config` snapshot, as Markdown and JSON under `batch_reports/`.
333 120  
334 121 ## Caveats
335 122  
336   -- The current label cache is query-specific, not a full all-products all-queries matrix.
337   -- Single-query evaluation still depends on the live search API for recall, but not on the LLM if labels are already cached.
338   -- The backend restart path in this environment can be briefly unstable immediately after startup; a short wait after restart is sometimes necessary for scripting.
339   -- Some multilingual translation hints are noisy on long-tail fashion queries, which is one reason fusion tuning around translation weight matters.
340   -
341   -## Related Requirement Docs
  123 +- Labels are keyed by `(tenant_id, query, spu_id)`, not a full corpus×query matrix.
  124 +- Single-query evaluation still needs live search for recall; LLM calls are avoided when labels exist.
  125 +- Backend restarts in automated tuning may need a short settle time before requests.
342 126  
343   -- `README_Requirement.md`
344   -- `README_Requirement_zh.md`
  127 +## Related docs
345 128  
346   -These documents describe the original problem statement. This `README.md` describes the implemented framework and the current recommended workflow.
  129 +- `README_Requirement.md`, `README_Requirement_zh.md` — requirements background; this file describes the implemented stack and how to run it.
... ...
scripts/evaluation/README_Requirement_zh.md
... ... @@ -72,12 +72,20 @@ Irrelevant 不相关 — 品类或用途不符,主诉求未满足。
72 72  
73 73 对其中每一个query:
74 74 1. 召回:
75   -1)参考搜索接口 召回1k结果。
76   -2)遍历全库,得到每个spu的title,请求重排模型,进行全排序,得到top1w结果。注意重排模型打分一定要做缓存(本地文件缓存即可。query+title->rerank_score)。
77   -2. 对以上结果,拆分batch请求llm,进行结果标注。
  75 +1)参考搜索接口 召回结果。搜索结果的top500,纳入召回池,打分全部标记为1
  76 +2)调用重排模型,扫描全库(tenant_id=163),如果已经在召回池(打分已经是1了),则跳过,其余的全部过reranker模型接口调用。每80个doc做一次请求。注意重排模型打分一定要做缓存(本地文件缓存即可。query+title->rerank_score)。
  77 +3)对reranker打分超过0.5的结果数大于1000条的query,则打印一行日志,跳过这个query,表示相关结果太多、容易被满足
  78 +
  79 +
  80 +2. 对如上召回的内容,进行全排序,然后逐批进行llm评判标注(50个一批),每一批都记录exact比例和不相关比例,打印日志。
  81 +直到连续三批不相关比例都大于92%。
  82 +最少要跑15批,最多跑40批
  83 +
78 84 3. 请你思考如何存储结果、并利于以后的对比、使用、展示。
79 85  
80 86  
  87 +
  88 +
81 89 3. 评估工具页面:
82 90 请你设计一个搜索评估交互页面。端口6010。
83 91 页面主题:上方是搜索框,如果发起搜索,那么下方给出本次结果的总体指标以及top100结果(允许翻页)
... ...
scripts/evaluation/eval_framework/cli.py
... ... @@ -6,7 +6,18 @@ import argparse
6 6 import json
7 7 from pathlib import Path
8 8  
9   -from .constants import DEFAULT_LABELER_MODE, DEFAULT_QUERY_FILE
  9 +from .constants import (
  10 + DEFAULT_LABELER_MODE,
  11 + DEFAULT_QUERY_FILE,
  12 + DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO,
  13 + DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK,
  14 + DEFAULT_REBUILD_LLM_BATCH_SIZE,
  15 + DEFAULT_REBUILD_MAX_LLM_BATCHES,
  16 + DEFAULT_REBUILD_MIN_LLM_BATCHES,
  17 + DEFAULT_RERANK_HIGH_SKIP_COUNT,
  18 + DEFAULT_RERANK_HIGH_THRESHOLD,
  19 + DEFAULT_SEARCH_RECALL_TOP_K,
  20 +)
10 21 from .framework import SearchEvaluationFramework
11 22 from .utils import ensure_dir, utc_now_iso, utc_timestamp
12 23 from .web_app import create_web_app
... ... @@ -23,6 +34,39 @@ def build_cli_parser() -> argparse.ArgumentParser:
23 34 build.add_argument("--rerank-depth", type=int, default=10000)
24 35 build.add_argument("--annotate-search-top-k", type=int, default=120)
25 36 build.add_argument("--annotate-rerank-top-k", type=int, default=200)
  37 + build.add_argument(
  38 + "--search-recall-top-k",
  39 + type=int,
  40 + default=None,
  41 + help="Rebuild mode only: top-K search hits enter recall pool with score 1 (default when --force-refresh-labels: 500).",
  42 + )
  43 + build.add_argument(
  44 + "--rerank-high-threshold",
  45 + type=float,
  46 + default=None,
  47 + help="Rebuild only: count rerank scores above this on non-pool docs (default 0.5).",
  48 + )
  49 + build.add_argument(
  50 + "--rerank-high-skip-count",
  51 + type=int,
  52 + default=None,
  53 + help="Rebuild only: skip query if more than this many non-pool docs have rerank score > threshold (default 1000).",
  54 + )
  55 + build.add_argument("--rebuild-llm-batch-size", type=int, default=None, help="Rebuild only: LLM batch size (default 50).")
  56 + build.add_argument("--rebuild-min-batches", type=int, default=None, help="Rebuild only: min LLM batches before early stop (default 15).")
  57 + build.add_argument("--rebuild-max-batches", type=int, default=None, help="Rebuild only: max LLM batches (default 40).")
  58 + build.add_argument(
  59 + "--rebuild-irrelevant-stop-ratio",
  60 + type=float,
  61 + default=None,
  62 + help="Rebuild only: irrelevant ratio above this counts toward early-stop streak (default 0.92).",
  63 + )
  64 + build.add_argument(
  65 + "--rebuild-irrelevant-stop-streak",
  66 + type=int,
  67 + default=None,
  68 + help="Rebuild only: stop after this many consecutive batches above irrelevant ratio (default 3).",
  69 + )
26 70 build.add_argument("--language", default="en")
27 71 build.add_argument("--force-refresh-rerank", action="store_true")
28 72 build.add_argument("--force-refresh-labels", action="store_true")
... ... @@ -59,6 +103,22 @@ def run_build(args: argparse.Namespace) -> None:
59 103 framework = SearchEvaluationFramework(tenant_id=args.tenant_id, labeler_mode=args.labeler_mode)
60 104 queries = framework.queries_from_file(Path(args.queries_file))
61 105 summary = []
  106 + rebuild_kwargs = {}
  107 + if args.force_refresh_labels:
  108 + rebuild_kwargs = {
  109 + "search_recall_top_k": args.search_recall_top_k if args.search_recall_top_k is not None else DEFAULT_SEARCH_RECALL_TOP_K,
  110 + "rerank_high_threshold": args.rerank_high_threshold if args.rerank_high_threshold is not None else DEFAULT_RERANK_HIGH_THRESHOLD,
  111 + "rerank_high_skip_count": args.rerank_high_skip_count if args.rerank_high_skip_count is not None else DEFAULT_RERANK_HIGH_SKIP_COUNT,
  112 + "rebuild_llm_batch_size": args.rebuild_llm_batch_size if args.rebuild_llm_batch_size is not None else DEFAULT_REBUILD_LLM_BATCH_SIZE,
  113 + "rebuild_min_batches": args.rebuild_min_batches if args.rebuild_min_batches is not None else DEFAULT_REBUILD_MIN_LLM_BATCHES,
  114 + "rebuild_max_batches": args.rebuild_max_batches if args.rebuild_max_batches is not None else DEFAULT_REBUILD_MAX_LLM_BATCHES,
  115 + "rebuild_irrelevant_stop_ratio": args.rebuild_irrelevant_stop_ratio
  116 + if args.rebuild_irrelevant_stop_ratio is not None
  117 + else DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO,
  118 + "rebuild_irrelevant_stop_streak": args.rebuild_irrelevant_stop_streak
  119 + if args.rebuild_irrelevant_stop_streak is not None
  120 + else DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK,
  121 + }
62 122 for query in queries:
63 123 result = framework.build_query_annotation_set(
64 124 query=query,
... ... @@ -69,6 +129,7 @@ def run_build(args: argparse.Namespace) -> None:
69 129 language=args.language,
70 130 force_refresh_rerank=args.force_refresh_rerank,
71 131 force_refresh_labels=args.force_refresh_labels,
  132 + **rebuild_kwargs,
72 133 )
73 134 summary.append(
74 135 {
... ...
scripts/evaluation/eval_framework/constants.py
... ... @@ -17,3 +17,13 @@ DEFAULT_QUERY_FILE = _SCRIPTS_EVAL_DIR / "queries" / "queries.txt"
17 17 JUDGE_PROMPT_VERSION_SIMPLE = "v3_simple_20260331"
18 18 JUDGE_PROMPT_VERSION_COMPLEX = "v2_structured_20260331"
19 19 DEFAULT_LABELER_MODE = "simple"
  20 +
  21 +# Rebuild annotation pool (build --force-refresh-labels): search recall + full-corpus rerank + LLM batches
  22 +DEFAULT_SEARCH_RECALL_TOP_K = 500
  23 +DEFAULT_RERANK_HIGH_THRESHOLD = 0.5
  24 +DEFAULT_RERANK_HIGH_SKIP_COUNT = 1000
  25 +DEFAULT_REBUILD_LLM_BATCH_SIZE = 50
  26 +DEFAULT_REBUILD_MIN_LLM_BATCHES = 15
  27 +DEFAULT_REBUILD_MAX_LLM_BATCHES = 40
  28 +DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO = 0.92
  29 +DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK = 3
... ...
scripts/evaluation/eval_framework/framework.py
... ... @@ -17,6 +17,14 @@ from .clients import DashScopeLabelClient, RerankServiceClient, SearchServiceCli
17 17 from .constants import (
18 18 DEFAULT_ARTIFACT_ROOT,
19 19 DEFAULT_LABELER_MODE,
  20 + DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO,
  21 + DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK,
  22 + DEFAULT_REBUILD_LLM_BATCH_SIZE,
  23 + DEFAULT_REBUILD_MAX_LLM_BATCHES,
  24 + DEFAULT_REBUILD_MIN_LLM_BATCHES,
  25 + DEFAULT_RERANK_HIGH_SKIP_COUNT,
  26 + DEFAULT_RERANK_HIGH_THRESHOLD,
  27 + DEFAULT_SEARCH_RECALL_TOP_K,
20 28 JUDGE_PROMPT_VERSION_COMPLEX,
21 29 RELEVANCE_EXACT,
22 30 RELEVANCE_IRRELEVANT,
... ... @@ -345,7 +353,7 @@ class SearchEvaluationFramework:
345 353 self,
346 354 query: str,
347 355 docs: Sequence[Dict[str, Any]],
348   - batch_size: int = 24,
  356 + batch_size: int = 80,
349 357 force_refresh: bool = False,
350 358 ) -> List[Dict[str, Any]]:
351 359 cached = {} if force_refresh else self.store.get_rerank_scores(self.tenant_id, query)
... ... @@ -374,6 +382,52 @@ class SearchEvaluationFramework:
374 382 ranked.sort(key=lambda item: item["score"], reverse=True)
375 383 return ranked
376 384  
  385 + def full_corpus_rerank_outside_exclude(
  386 + self,
  387 + query: str,
  388 + docs: Sequence[Dict[str, Any]],
  389 + exclude_spu_ids: set[str],
  390 + batch_size: int = 80,
  391 + force_refresh: bool = False,
  392 + ) -> List[Dict[str, Any]]:
  393 + """Rerank all corpus docs whose spu_id is not in ``exclude_spu_ids``; excluded IDs are not scored via API."""
  394 + exclude_spu_ids = {str(x) for x in exclude_spu_ids}
  395 + cached = {} if force_refresh else self.store.get_rerank_scores(self.tenant_id, query)
  396 + pending: List[Dict[str, Any]] = [
  397 + doc
  398 + for doc in docs
  399 + if str(doc.get("spu_id")) not in exclude_spu_ids
  400 + and str(doc.get("spu_id"))
  401 + and (force_refresh or str(doc.get("spu_id")) not in cached)
  402 + ]
  403 + if pending:
  404 + new_scores: Dict[str, float] = {}
  405 + for start in range(0, len(pending), batch_size):
  406 + batch = pending[start : start + batch_size]
  407 + scores = self._rerank_batch_with_retry(query=query, docs=batch)
  408 + if len(scores) != len(batch):
  409 + raise RuntimeError(f"rerank returned {len(scores)} scores for {len(batch)} docs")
  410 + for doc, score in zip(batch, scores):
  411 + new_scores[str(doc.get("spu_id"))] = float(score)
  412 + self.store.upsert_rerank_scores(
  413 + self.tenant_id,
  414 + query,
  415 + new_scores,
  416 + model_name="qwen3_vllm_score",
  417 + )
  418 + cached.update(new_scores)
  419 +
  420 + ranked: List[Dict[str, Any]] = []
  421 + for doc in docs:
  422 + spu_id = str(doc.get("spu_id") or "")
  423 + if not spu_id or spu_id in exclude_spu_ids:
  424 + continue
  425 + ranked.append(
  426 + {"spu_id": spu_id, "score": float(cached.get(spu_id, float("-inf"))), "doc": doc}
  427 + )
  428 + ranked.sort(key=lambda item: item["score"], reverse=True)
  429 + return ranked
  430 +
377 431 def _rerank_batch_with_retry(self, query: str, docs: Sequence[Dict[str, Any]]) -> List[float]:
378 432 if not docs:
379 433 return []
... ... @@ -447,6 +501,78 @@ class SearchEvaluationFramework:
447 501 mid = len(docs) // 2
448 502 return self._classify_with_retry(query, docs[:mid], force_refresh=force_refresh) + self._classify_with_retry(query, docs[mid:], force_refresh=force_refresh)
449 503  
  504 + def _annotate_rebuild_batches(
  505 + self,
  506 + query: str,
  507 + ordered_docs: Sequence[Dict[str, Any]],
  508 + *,
  509 + batch_size: int = DEFAULT_REBUILD_LLM_BATCH_SIZE,
  510 + min_batches: int = DEFAULT_REBUILD_MIN_LLM_BATCHES,
  511 + max_batches: int = DEFAULT_REBUILD_MAX_LLM_BATCHES,
  512 + irrelevant_stop_ratio: float = DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO,
  513 + stop_streak: int = DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK,
  514 + force_refresh: bool = True,
  515 + ) -> Tuple[Dict[str, str], List[Dict[str, Any]]]:
  516 + """LLM-label ``ordered_docs`` in fixed-size batches with early stop after enough irrelevant-heavy batches."""
  517 + batch_logs: List[Dict[str, Any]] = []
  518 + streak = 0
  519 + labels: Dict[str, str] = dict(self.store.get_labels(self.tenant_id, query))
  520 + total_ordered = len(ordered_docs)
  521 +
  522 + for batch_idx in range(max_batches):
  523 + start = batch_idx * batch_size
  524 + batch_docs = list(ordered_docs[start : start + batch_size])
  525 + if not batch_docs:
  526 + break
  527 +
  528 + batch_pairs = self._classify_with_retry(query, batch_docs, force_refresh=force_refresh)
  529 + for sub_labels, raw_response, sub_batch in batch_pairs:
  530 + to_store = {str(doc.get("spu_id")): label for doc, label in zip(sub_batch, sub_labels)}
  531 + self.store.upsert_labels(
  532 + self.tenant_id,
  533 + query,
  534 + to_store,
  535 + judge_model=self.label_client.model,
  536 + raw_response=raw_response,
  537 + )
  538 + labels.update(to_store)
  539 + time.sleep(0.1)
  540 +
  541 + n = len(batch_docs)
  542 + exact_n = sum(1 for doc in batch_docs if labels.get(str(doc.get("spu_id"))) == RELEVANCE_EXACT)
  543 + irrel_n = sum(1 for doc in batch_docs if labels.get(str(doc.get("spu_id"))) == RELEVANCE_IRRELEVANT)
  544 + exact_ratio = exact_n / n if n else 0.0
  545 + irrelevant_ratio = irrel_n / n if n else 0.0
  546 + log_entry = {
  547 + "batch_index": batch_idx + 1,
  548 + "size": n,
  549 + "exact_ratio": round(exact_ratio, 6),
  550 + "irrelevant_ratio": round(irrelevant_ratio, 6),
  551 + "offset_start": start,
  552 + "offset_end": min(start + n, total_ordered),
  553 + }
  554 + batch_logs.append(log_entry)
  555 + print(
  556 + f"[eval-rebuild] query={query!r} llm_batch={batch_idx + 1}/{max_batches} "
  557 + f"size={n} exact_ratio={exact_ratio:.4f} irrelevant_ratio={irrelevant_ratio:.4f}",
  558 + flush=True,
  559 + )
  560 +
  561 + if batch_idx + 1 >= min_batches:
  562 + if irrelevant_ratio > irrelevant_stop_ratio:
  563 + streak += 1
  564 + else:
  565 + streak = 0
  566 + if streak >= stop_streak:
  567 + print(
  568 + f"[eval-rebuild] query={query!r} early_stop after {batch_idx + 1} batches "
  569 + f"({stop_streak} consecutive batches with irrelevant_ratio > {irrelevant_stop_ratio})",
  570 + flush=True,
  571 + )
  572 + break
  573 +
  574 + return labels, batch_logs
  575 +
450 576 def build_query_annotation_set(
451 577 self,
452 578 query: str,
... ... @@ -458,7 +584,32 @@ class SearchEvaluationFramework:
458 584 language: str = "en",
459 585 force_refresh_rerank: bool = False,
460 586 force_refresh_labels: bool = False,
  587 + search_recall_top_k: int = DEFAULT_SEARCH_RECALL_TOP_K,
  588 + rerank_high_threshold: float = DEFAULT_RERANK_HIGH_THRESHOLD,
  589 + rerank_high_skip_count: int = DEFAULT_RERANK_HIGH_SKIP_COUNT,
  590 + rebuild_llm_batch_size: int = DEFAULT_REBUILD_LLM_BATCH_SIZE,
  591 + rebuild_min_batches: int = DEFAULT_REBUILD_MIN_LLM_BATCHES,
  592 + rebuild_max_batches: int = DEFAULT_REBUILD_MAX_LLM_BATCHES,
  593 + rebuild_irrelevant_stop_ratio: float = DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO,
  594 + rebuild_irrelevant_stop_streak: int = DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK,
461 595 ) -> QueryBuildResult:
  596 + if force_refresh_labels:
  597 + return self._build_query_annotation_set_rebuild(
  598 + query=query,
  599 + search_depth=search_depth,
  600 + rerank_depth=rerank_depth,
  601 + language=language,
  602 + force_refresh_rerank=force_refresh_rerank,
  603 + search_recall_top_k=search_recall_top_k,
  604 + rerank_high_threshold=rerank_high_threshold,
  605 + rerank_high_skip_count=rerank_high_skip_count,
  606 + rebuild_llm_batch_size=rebuild_llm_batch_size,
  607 + rebuild_min_batches=rebuild_min_batches,
  608 + rebuild_max_batches=rebuild_max_batches,
  609 + rebuild_irrelevant_stop_ratio=rebuild_irrelevant_stop_ratio,
  610 + rebuild_irrelevant_stop_streak=rebuild_irrelevant_stop_streak,
  611 + )
  612 +
462 613 search_payload = self.search_client.search(query=query, size=search_depth, from_=0, language=language)
463 614 search_results = list(search_payload.get("results") or [])
464 615 corpus = self.corpus_docs(refresh=False)
... ... @@ -558,6 +709,182 @@ class SearchEvaluationFramework:
558 709 output_json_path=output_json_path,
559 710 )
560 711  
  712 + def _build_query_annotation_set_rebuild(
  713 + self,
  714 + query: str,
  715 + *,
  716 + search_depth: int,
  717 + rerank_depth: int,
  718 + language: str,
  719 + force_refresh_rerank: bool,
  720 + search_recall_top_k: int,
  721 + rerank_high_threshold: float,
  722 + rerank_high_skip_count: int,
  723 + rebuild_llm_batch_size: int,
  724 + rebuild_min_batches: int,
  725 + rebuild_max_batches: int,
  726 + rebuild_irrelevant_stop_ratio: float,
  727 + rebuild_irrelevant_stop_streak: int,
  728 + ) -> QueryBuildResult:
  729 + search_size = max(int(search_depth), int(search_recall_top_k))
  730 + search_payload = self.search_client.search(query=query, size=search_size, from_=0, language=language)
  731 + search_results = list(search_payload.get("results") or [])
  732 + recall_n = min(int(search_recall_top_k), len(search_results))
  733 + pool_search_docs = search_results[:recall_n]
  734 + pool_spu_ids = {str(d.get("spu_id")) for d in pool_search_docs if str(d.get("spu_id") or "").strip()}
  735 +
  736 + corpus = self.corpus_docs(refresh=False)
  737 + corpus_by_id = {str(d.get("spu_id")): d for d in corpus if str(d.get("spu_id") or "").strip()}
  738 +
  739 + ranked_outside = self.full_corpus_rerank_outside_exclude(
  740 + query=query,
  741 + docs=corpus,
  742 + exclude_spu_ids=pool_spu_ids,
  743 + force_refresh=force_refresh_rerank,
  744 + )
  745 + rerank_high_n = sum(1 for item in ranked_outside if float(item["score"]) > float(rerank_high_threshold))
  746 +
  747 + rebuild_meta: Dict[str, Any] = {
  748 + "mode": "rebuild_v1",
  749 + "search_recall_top_k": search_recall_top_k,
  750 + "recall_pool_size": len(pool_spu_ids),
  751 + "pool_rerank_score_assigned": 1.0,
  752 + "rerank_high_threshold": rerank_high_threshold,
  753 + "rerank_high_count_outside_pool": rerank_high_n,
  754 + "rerank_high_skip_count": rerank_high_skip_count,
  755 + "rebuild_llm_batch_size": rebuild_llm_batch_size,
  756 + "rebuild_min_batches": rebuild_min_batches,
  757 + "rebuild_max_batches": rebuild_max_batches,
  758 + "rebuild_irrelevant_stop_ratio": rebuild_irrelevant_stop_ratio,
  759 + "rebuild_irrelevant_stop_streak": rebuild_irrelevant_stop_streak,
  760 + }
  761 +
  762 + batch_logs: List[Dict[str, Any]] = []
  763 + skipped = False
  764 + skip_reason: str | None = None
  765 + labels: Dict[str, str] = dict(self.store.get_labels(self.tenant_id, query))
  766 + llm_labeled_total = 0
  767 +
  768 + if rerank_high_n > int(rerank_high_skip_count):
  769 + skipped = True
  770 + skip_reason = "too_many_high_rerank_scores"
  771 + print(
  772 + f"[eval-rebuild] query={query!r} skip: rerank_score>{rerank_high_threshold} "
  773 + f"outside recall pool count={rerank_high_n} > {rerank_high_skip_count} "
  774 + f"(relevant tail too large / query too easy to satisfy)",
  775 + flush=True,
  776 + )
  777 + else:
  778 + ordered_docs: List[Dict[str, Any]] = []
  779 + seen_ordered: set[str] = set()
  780 + for doc in pool_search_docs:
  781 + sid = str(doc.get("spu_id") or "")
  782 + if not sid or sid in seen_ordered:
  783 + continue
  784 + seen_ordered.add(sid)
  785 + ordered_docs.append(corpus_by_id.get(sid, doc))
  786 + for item in ranked_outside:
  787 + sid = str(item["spu_id"])
  788 + if sid in seen_ordered:
  789 + continue
  790 + seen_ordered.add(sid)
  791 + ordered_docs.append(item["doc"])
  792 +
  793 + labels, batch_logs = self._annotate_rebuild_batches(
  794 + query,
  795 + ordered_docs,
  796 + batch_size=rebuild_llm_batch_size,
  797 + min_batches=rebuild_min_batches,
  798 + max_batches=rebuild_max_batches,
  799 + irrelevant_stop_ratio=rebuild_irrelevant_stop_ratio,
  800 + stop_streak=rebuild_irrelevant_stop_streak,
  801 + force_refresh=True,
  802 + )
  803 + llm_labeled_total = sum(int(entry.get("size") or 0) for entry in batch_logs)
  804 +
  805 + rebuild_meta["skipped"] = skipped
  806 + rebuild_meta["skip_reason"] = skip_reason
  807 + rebuild_meta["llm_batch_logs"] = batch_logs
  808 + rebuild_meta["llm_labeled_total"] = llm_labeled_total
  809 +
  810 + rerank_depth_effective = min(int(rerank_depth), len(ranked_outside))
  811 + search_labeled_results: List[Dict[str, Any]] = []
  812 + for rank, doc in enumerate(search_results, start=1):
  813 + spu_id = str(doc.get("spu_id"))
  814 + in_pool = rank <= recall_n
  815 + search_labeled_results.append(
  816 + {
  817 + "rank": rank,
  818 + "spu_id": spu_id,
  819 + "title": build_display_title(doc),
  820 + "image_url": doc.get("image_url"),
  821 + "rerank_score": 1.0 if in_pool else None,
  822 + "label": labels.get(spu_id),
  823 + "option_values": list(compact_option_values(doc.get("skus") or [])),
  824 + "product": compact_product_payload(doc),
  825 + }
  826 + )
  827 +
  828 + rerank_top_results: List[Dict[str, Any]] = []
  829 + for rank, item in enumerate(ranked_outside[:rerank_depth_effective], start=1):
  830 + doc = item["doc"]
  831 + spu_id = str(item["spu_id"])
  832 + rerank_top_results.append(
  833 + {
  834 + "rank": rank,
  835 + "spu_id": spu_id,
  836 + "title": build_display_title(doc),
  837 + "image_url": doc.get("image_url"),
  838 + "rerank_score": round(float(item["score"]), 8),
  839 + "label": labels.get(spu_id),
  840 + "option_values": list(compact_option_values(doc.get("skus") or [])),
  841 + "product": compact_product_payload(doc),
  842 + }
  843 + )
  844 +
  845 + top100_labels = [
  846 + item["label"] if item["label"] in VALID_LABELS else RELEVANCE_IRRELEVANT
  847 + for item in search_labeled_results[:100]
  848 + ]
  849 + metrics = compute_query_metrics(top100_labels)
  850 + output_dir = ensure_dir(self.artifact_root / "query_builds")
  851 + run_id = f"{utc_timestamp()}_{sha1_text(self.tenant_id + '|' + query)[:10]}"
  852 + output_json_path = output_dir / f"{run_id}.json"
  853 + pool_docs_count = len(pool_spu_ids) + len(ranked_outside)
  854 + payload = {
  855 + "run_id": run_id,
  856 + "created_at": utc_now_iso(),
  857 + "tenant_id": self.tenant_id,
  858 + "query": query,
  859 + "config_meta": requests.get("http://localhost:6002/admin/config/meta", timeout=20).json(),
  860 + "search_total": int(search_payload.get("total") or 0),
  861 + "search_depth_requested": search_depth,
  862 + "search_depth_effective": len(search_results),
  863 + "rerank_depth_requested": rerank_depth,
  864 + "rerank_depth_effective": rerank_depth_effective,
  865 + "corpus_size": len(corpus),
  866 + "annotation_pool": {
  867 + "rebuild": rebuild_meta,
  868 + "ordered_union_size": pool_docs_count,
  869 + },
  870 + "labeler_mode": self.labeler_mode,
  871 + "query_profile": self.get_query_profile(query, force_refresh=False) if self.labeler_mode == "complex" else None,
  872 + "metrics_top100": metrics,
  873 + "search_results": search_labeled_results,
  874 + "full_rerank_top": rerank_top_results,
  875 + }
  876 + output_json_path.write_text(json.dumps(payload, ensure_ascii=False, indent=2), encoding="utf-8")
  877 + self.store.insert_build_run(run_id, self.tenant_id, query, output_json_path, payload["metrics_top100"])
  878 + return QueryBuildResult(
  879 + query=query,
  880 + tenant_id=self.tenant_id,
  881 + search_total=int(search_payload.get("total") or 0),
  882 + search_depth=len(search_results),
  883 + rerank_corpus_size=len(corpus),
  884 + annotated_count=llm_labeled_total if not skipped else 0,
  885 + output_json_path=output_json_path,
  886 + )
  887 +
561 888 def evaluate_live_query(
562 889 self,
563 890 query: str,
... ...
scripts/evaluation/eval_framework/prompts.py
... ... @@ -5,46 +5,46 @@ from __future__ import annotations
5 5 import json
6 6 from typing import Any, Dict, Sequence
7 7  
8   -_CLASSIFY_BATCH_SIMPLE_TEMPLATE = """You are a relevance evaluation assistant for an apparel e-commerce search system.
  8 +_CLASSIFY_BATCH_SIMPLE_TEMPLATE = """You are a relevance evaluation assistant for an apparel e-commerce search system.
9 9 Given the user query and each product's information, assign one relevance label to each product.
10 10  
11 11 ## Relevance Labels
12 12  
13 13 ### Exact
14   -The product fully satisfies the user's search intent.
  14 +The product fully satisfies the user’s search intent: the core product type matches, all explicitly stated key attributes are supported by the product information.
15 15  
16   -Use Exact when:
17   -- The product matches the core product type named in the query.
18   -- The key requirements explicitly stated in the query are satisfied.
19   -- There is no clear conflict with any explicit user requirement.
20   -
21   -Typical cases:
22   -- The query is only a product type, and the product is exactly that product type.
23   -- The query includes product type + attributes, and the product matches the type and those attributes.
  16 +Typical use cases:
  17 +- The query contains only a product type, and the product is exactly that type.
  18 +- The query contains product type + attributes, and the product matches both the type and all explicitly stated attributes.
24 19  
25 20 ### Partial
26   -The product satisfies the user's primary intent, but does not fully satisfy all specified details.
  21 +The product satisfies the user's primary intent because the core product type matches, but some explicit requirements in the query are missing, cannot be confirmed, or deviate from the query. Despite the mismatch, the product can still be considered a non-target but acceptable substitute.
27 22  
28 23 Use Partial when:
29 24 - The core product type matches, but some requested attributes cannot be confirmed.
30   -- The core product type matches, but only some secondary attributes are satisfied.
31   -- The core product type matches, and there are minor or non-critical deviations from the query.
32   -- The product does not clearly contradict the user's explicit requirements, but it also cannot be considered a full match.
  25 +- The core product type matches, but some secondary requirements deviate or are inconsistent.
  26 +- The product is not the ideal target, but it is still a plausible and acceptable substitute for the shopper.
33 27  
34 28 Typical cases:
35 29 - Query: "red fitted t-shirt", product: "Women's T-Shirt" → color/fit cannot be confirmed.
36 30 - Query: "red fitted t-shirt", product: "Blue Fitted T-Shirt" → product type and fit match, but color differs.
37   -- Query: "cotton long sleeve blouse", product: "Long Sleeve Blouse" → material not confirmed.
38 31  
39   -Important:
40   -Partial should mainly be used when the core product type is correct, but the detailed requirements are incomplete, uncertain, or only partially matched.
  32 +Detailed example:
  33 +- Query: "cotton long sleeve shirt"
  34 +- Product: "J.VER Men's Linen Shirts Casual Button Down Long Sleeve Shirt Solid Spread Collar Summer Beach Shirts with Pocket"
  35 +
  36 +Analysis:
  37 +- Material mismatch: the query explicitly requires cotton, while the product is linen, so it cannot be Exact.
  38 +- However, the core product type still matches: both are long sleeve shirts.
  39 +- In an e-commerce setting, the shopper may still consider clicking this item because the style and use case are similar.
  40 +- Therefore, it should be labeled Partial as a non-target but acceptable substitute.
41 41  
42 42 ### Irrelevant
43 43 The product does not satisfy the user's main shopping intent.
44 44  
45 45 Use Irrelevant when:
46 46 - The core product type does not match the query.
47   -- The product matches the general category but is a different product type that shoppers would not consider interchangeable.
  47 +- The product belongs to a broadly related category, but not the specific product subtype requested, and shoppers would not consider them interchangeable.
48 48 - The core product type matches, but the product clearly contradicts an explicit and important requirement in the query.
49 49  
50 50 Typical cases:
... ... @@ -53,6 +53,8 @@ Typical cases:
53 53 - Query: "fitted pants", product: "loose wide-leg pants" → explicit contradiction on fit.
54 54 - Query: "sleeveless dress", product: "long sleeve dress" → explicit contradiction on sleeve style.
55 55  
  56 +This label emphasizes clarity of user intent. When the query specifies a concrete product type or an important attribute, products that conflict with that intent should be judged Irrelevant even if they are related at a higher category level.
  57 +
56 58 ## Decision Principles
57 59  
58 60 1. Product type is the highest-priority factor.
... ... @@ -71,16 +73,13 @@ Typical cases:
71 73 If the user explicitly searched for one of these, the others should usually be judged Irrelevant.
72 74  
73 75 3. If the core product type matches, then evaluate attributes.
74   - - If attributes fully match → Exact
75   - - If attributes are missing, uncertain, or only partially matched → Partial
76   - - If attributes clearly contradict an explicit important requirement → Irrelevant
  76 + - If all explicit attributes match → Exact
  77 + - If some attributes are missing, uncertain, or partially mismatched, but the item is still an acceptable substitute → Partial
  78 + - If an explicit and important attribute is clearly contradicted, and the item is not a reasonable substitute → Irrelevant
77 79  
78 80 4. Distinguish carefully between "not mentioned" and "contradicted".
79 81 - If an attribute is not mentioned or cannot be verified, prefer Partial.
80   - - If an attribute is explicitly opposite to the query, use Irrelevant.
81   -
82   -5. Do not overuse Exact.
83   - Exact requires strong evidence that the product satisfies the user's stated intent, not just the general category.
  82 + - If an attribute is explicitly opposite to the query, use Irrelevant unless the item is still reasonably acceptable as a substitute under the shopping context.
84 83  
85 84 Query: {query}
86 85  
... ... @@ -97,96 +96,93 @@ The lines must correspond sequentially to the products above.
97 96 Do not output any other information.
98 97 """
99 98  
100   -_CLASSIFY_BATCH_SIMPLE_TEMPLATE__zh = """你是一个服装电商搜索系统的相关性评估助手。
101   -给定用户查询和每个产品的信息,为每个产品分配一个相关性标签。
  99 +_CLASSIFY_BATCH_SIMPLE_TEMPLATE__zh = _CLASSIFY_BATCH_SIMPLE_TEMPLATE_ZH = """你是一个服饰电商搜索系统中的相关性判断助手。
  100 +给定用户查询词以及每个商品的信息,请为每个商品分配一个相关性标签。
102 101  
103 102 ## 相关性标签
104 103  
105 104 ### 完全相关
106   -该产品完全满足用户的搜索意图。
107   -
108   -在以下情况使用完全相关:
109   -- 产品与查询中指定的核心产品类型相匹配。
110   -- 满足了查询中明确说明的关键要求。
111   -- 与用户明确的任何要求没有明显冲突。
  105 +核心产品类型匹配,所有明确提及的关键属性均有产品信息支撑。
112 106  
113   -典型情况:
114   -- 查询仅包含产品类型,而产品恰好是该产品类型。
115   -- 查询包含产品类型 + 属性,而产品与该类型及这些属性相匹配。
  107 +典型适用场景:
  108 +- 查询仅包含产品类型,产品即为该类型。
  109 +- 查询包含“产品类型 + 属性”,产品在类型及所有明确属性上均符合。
116 110  
117 111 ### 部分相关
118   -该产品满足了用户的主要意图,但并未完全满足所有指定的细节
  112 +产品满足用户的主要意图(核心产品类型匹配),但查询中明确的部分要求未体现,或存在偏差。虽然有不一致,但仍属于“非目标但可接受”的替代品
119 113  
120 114 在以下情况使用部分相关:
121   -- 核心产品类型匹配,但部分请求的属性无法确认。
122   -- 核心产品类型匹配,但仅满足了部分次要属性。
123   -- 核心产品类型匹配,但与查询存在微小或非关键的偏差。
124   -- 产品未明显违背用户的明确要求,但也不能视为完全匹配。
  115 +- 核心产品类型匹配,但部分请求的属性在商品信息中缺失、未提及或无法确认。
  116 +- 核心产品类型匹配,但材质、版型、风格等次要要求存在偏差或不一致。
  117 +- 商品不是用户最理想的目标,但从电商购物角度看,仍可能被用户视为可接受的替代品。
125 118  
126 119 典型情况:
127   -- 查询:"红色修身T恤",产品:"女士T恤" → 颜色/版型无法确认。
128   -- 查询:"红色修身T恤",产品:"蓝色修身T恤" → 产品类型和版型匹配,但颜色不同。
129   -- 查询:"棉质长袖衬衫",产品:"长袖衬衫" → 材质未确认。
  120 +- 查询:“红色修身T恤”,产品:“女士T恤” → 颜色/版型无法确认。
  121 +- 查询:“红色修身T恤”,产品:“蓝色修身T恤” → 产品类型和版型匹配,但颜色不同。
130 122  
131   -重要提示:
132   -部分相关主要应在核心产品类型正确,但详细要求不完整、不确定或仅部分匹配时使用。
  123 +详细案例:
  124 +- 查询:“棉质长袖衬衫”
  125 +- 商品:“J.VER男式亚麻衬衫休闲纽扣长袖衬衫纯色平领夏季沙滩衬衫带口袋”
133 126  
134   -### 不相关
135   -该产品不满足用户的主要购物意图。
  127 +分析:
  128 +- 材质不符:Query 明确指定“棉质”,而商品为“亚麻”,因此不能判为完全相关。
  129 +- 但核心品类仍然匹配:两者都是“长袖衬衫”。
  130 +- 在电商搜索中,用户仍可能因为款式、穿着场景相近而点击该商品。
  131 +- 因此应判为部分相关,即“非目标但可接受”的替代品。
136 132  
137   -在以下情况使用不相关:
  133 +### 不相关
  134 +产品未满足用户的主要购物意图,主要表现为以下情形之一:
138 135 - 核心产品类型与查询不匹配。
139   -- 产品匹配了大致类别,但属于购物者不会认为可互换的不同产品类型。
140   -- 核心产品类型匹配,但产品明显违背了查询中一个明确且重要的要求。
  136 +- 产品虽属大致相关的大类,但与查询指定的具体子类不可互换。
  137 +- 核心产品类型匹配,但产品明显违背了查询中一个明确且重要的属性要求。
141 138  
142 139 典型情况:
143   -- 查询:"裤子",产品:"鞋子" → 错误的产品类型。
144   -- 查询:"连衣裙",产品:"半身裙" → 不同的产品类型。
145   -- 查询:"修身裤",产品:"宽松阔腿裤" → 版型上明显矛盾。
146   -- 查询:"无袖连衣裙",产品:"长袖连衣裙" → 袖型上明显矛盾。
  140 +- 查询:“裤子”,产品:“鞋子” → 产品类型错误。
  141 +- 查询:“连衣裙”,产品:“半身裙” → 具体产品类型不同。
  142 +- 查询:“修身裤”,产品:“宽松阔腿裤” → 与版型要求明显冲突。
  143 +- 查询:“无袖连衣裙”,产品:“长袖连衣裙” → 与袖型要求明显冲突。
147 144  
148   -## 决策原则
  145 +该标签强调用户意图的明确性。当查询指向具体类型或关键属性时,即使产品在更高层级类别上相关,也应按不相关处理。
149 146  
150   -1. 产品类型是最高优先级的因素。
151   - 如果查询明确指定了具体产品类型,结果必须匹配该产品类型才能被评为完全相关或部分相关。
152   - 不同的产品类型通常是不相关,而非部分相关。
  147 +## 判断原则
153 148  
154   -2. 当查询明确时,相似或相关的产品类型不可互换。
  149 +1. 产品类型是最高优先级因素。
  150 + 如果查询明确指定了具体产品类型,那么结果必须匹配该产品类型,才可能判为“完全相关”或“部分相关”。
  151 + 不同产品类型通常应判为“不相关”,而不是“部分相关”。
  152 +
  153 +2. 相似或相关的产品类型,在查询明确时通常不可互换。
155 154 例如:
156 155 - 连衣裙 vs 半身裙 vs 连体裤
157 156 - 牛仔裤 vs 裤子
158   - - T恤 vs 衬衫
  157 + - T恤 vs 衬衫/上衣
159 158 - 开衫 vs 毛衣
160 159 - 靴子 vs 鞋子
161 160 - 文胸 vs 上衣
162 161 - 双肩包 vs 包
163   - 如果用户明确搜索了其中一种,其他的通常应判断为不相关。
164   -
165   -3. 如果核心产品类型匹配,则评估属性。
166   - - 如果属性完全匹配 → 完全相关
167   - - 如果属性缺失、不确定或仅部分匹配 → 部分相关
168   - - 如果属性明显违背明确的重点要求 → 不相关
  162 + 如果用户明确搜索其中一种,其他类型通常应判为“不相关”。
169 163  
170   -4. 仔细区分“未提及”和“矛盾”。
171   - - 如果属性未提及或无法验证,倾向于部分相关。
172   - - 如果属性与查询明确相反,使用不相关。
  164 +3. 当核心产品类型匹配后,再评估属性。
  165 + - 所有明确属性都匹配 → 完全相关
  166 + - 部分属性缺失、无法确认,或存在一定偏差,但仍是可接受替代品 → 部分相关
  167 + - 明确且重要的属性被明显违背,且不能作为合理替代品 → 不相关
173 168  
174   -5. 不要过度使用完全相关。
175   - 完全相关需要强有力的证据表明产品满足了用户声明的意图,而不仅仅是通用类别。
  169 +4. 要严格区分“未提及/无法确认”和“明确冲突”。
  170 + - 如果某属性没有提及,或无法验证,优先判为“部分相关”。
  171 + - 如果某属性与查询要求明确相反,则判为“不相关”;除非在购物语境下它仍明显属于可接受替代品。
176 172  
177   -查询: {query}
  173 +查询{query}
178 174  
179   -产品:
  175 +商品:
180 176 {lines}
181 177  
182 178 ## 输出格式
183   -严格输出 {n} 行,每行包含以下之一:
184   -Exact
185   -Partial
186   -Irrelevant
  179 +严格输出 {n} 行,每行只能是以下三者之一:
  180 +完全相关
  181 +部分相关
  182 +不相关
187 183  
188   -这些行必须按顺序对应上面的产品。
189   -不要输出任何其他信息。
  184 +输出行必须与上方商品顺序一一对应。
  185 +不要输出任何其他内容。
190 186 """
191 187  
192 188  
... ...
scripts/evaluation/quick_start_eval.sh
... ... @@ -11,7 +11,7 @@ QUERIES=&quot;${REPO_EVAL_QUERIES:-scripts/evaluation/queries/queries.txt}&quot;
11 11 usage() {
12 12 echo "Usage: $0 batch|batch-rebuild|serve"
13 13 echo " batch — batch eval: live search every query, LLM only for missing labels (top_k=50, simple)"
14   - echo " batch-rebuild — same as batch but --force-refresh-labels (re-LLM all top_k hits; expensive, overwrites cache)"
  14 + echo " batch-rebuild — deep rebuild: build --force-refresh-labels (search recall pool + full-corpus rerank + batched LLM; expensive)"
15 15 echo " serve — eval UI (default http://0.0.0.0:\${EVAL_WEB_PORT:-6010}/; also: ./scripts/start_eval_web.sh)"
16 16 echo "Env: TENANT_ID (default 163), REPO_EVAL_QUERIES, EVAL_WEB_HOST, EVAL_WEB_PORT (default 6010)"
17 17 }
... ... @@ -26,13 +26,15 @@ case &quot;${1:-}&quot; in
26 26 --labeler-mode simple
27 27 ;;
28 28 batch-rebuild)
29   - exec "$PY" scripts/evaluation/build_annotation_set.py batch \
  29 + exec "$PY" scripts/evaluation/build_annotation_set.py build \
30 30 --tenant-id "$TENANT_ID" \
31 31 --queries-file "$QUERIES" \
32   - --top-k 50 \
  32 + --search-depth 500 \
  33 + --rerank-depth 10000 \
  34 + --force-refresh-rerank \
  35 + --force-refresh-labels \
33 36 --language en \
34   - --labeler-mode simple \
35   - --force-refresh-labels
  37 + --labeler-mode simple
36 38 ;;
37 39 serve)
38 40 EVAL_WEB_PORT="${EVAL_WEB_PORT:-6010}"
... ...