Blame view

scripts/evaluation/README.md 15.1 KB
881d338b   tangwang   评估框架
1
2
  # Search Evaluation Framework
  
3ac1f8d1   tangwang   评估标准优化
3
  This directory holds the offline annotation builder, the evaluation web UI/API, audit tooling, and the fusion-tuning runner for retrieval quality.
881d338b   tangwang   评估框架
4
  
2059d959   tangwang   feat(eval): 多评估集统...
5
  **Design:** Build labels offline for one or more named evaluation datasets. Each dataset has a stable `dataset_id` backed by a query file registered in `config.yaml -> search_evaluation.datasets`. Single-query and batch evaluation map recalled `spu_id` values to the shared SQLite cache. Items without cached labels are scored as `Irrelevant`, and the UI/API surfaces tips when judged coverage is incomplete. Evaluation now uses a graded four-tier relevance system with a multi-metric primary scorecard instead of a single headline metric.
881d338b   tangwang   评估框架
6
  
3ac1f8d1   tangwang   评估标准优化
7
  ## What it does
881d338b   tangwang   评估框架
8
  
2059d959   tangwang   feat(eval): 多评估集统...
9
  1. Build an annotation set for a named evaluation dataset.
3ac1f8d1   tangwang   评估标准优化
10
11
12
  2. Evaluate live search results against cached labels.
  3. Run batch evaluation and keep historical reports with config snapshots.
  4. Tune fusion parameters in a reproducible loop.
881d338b   tangwang   评估框架
13
  
3ac1f8d1   tangwang   评估标准优化
14
  ## Layout
881d338b   tangwang   评估框架
15
  
3ac1f8d1   tangwang   评估标准优化
16
17
18
19
20
21
22
23
  | Path | Role |
  |------|------|
  | `eval_framework/` | Package: orchestration, SQLite store, search/rerank/LLM clients, prompts, metrics, reports, web UI (`static/`), CLI |
  | `build_annotation_set.py` | CLI entry (build / batch / audit) |
  | `serve_eval_web.py` | Web server for the evaluation UI |
  | `tune_fusion.py` | Applies config variants, restarts backend, runs batch eval, stores experiment reports |
  | `fusion_experiments_shortlist.json` | Compact experiment set for tuning |
  | `fusion_experiments_round1.json` | Broader first-round experiments |
2059d959   tangwang   feat(eval): 多评估集统...
24
25
  | `queries/queries.txt` | Legacy core query set (`dataset_id=core_queries`) |
  | `queries/all_keywords.txt.top1w.shuf.top1k.clothing_filtered` | Expanded clothing dataset (`dataset_id=clothing_top771`) |
3ac1f8d1   tangwang   评估标准优化
26
  | `README_Requirement.md` | Product/requirements reference |
12a75c46   tangwang   feat(eval): 为 LLM...
27
  | `start_eval.sh` | Wrapper: `batch`, `batch-rebuild`, `batch-rebuild-resume` (resume from existing per-query outputs), or `serve` |
3ac1f8d1   tangwang   评估标准优化
28
29
30
31
  | `../start_eval_web.sh` | Same as `serve` with `activate.sh`; use `./scripts/service_ctl.sh start eval-web` (default port **6010**, override with `EVAL_WEB_PORT`). `./run.sh all` includes eval-web. |
  
  ## Quick start (repo root)
  
2059d959   tangwang   feat(eval): 多评估集统...
32
  Set tenant if needed (`export TENANT_ID=163`). To switch datasets, export `REPO_EVAL_DATASET_ID` or pass `--dataset-id`. You need a live search API, DashScope when new LLM labels are required, and a running backend.
881d338b   tangwang   评估框架
33
34
  
  ```bash
3ac1f8d1   tangwang   评估标准优化
35
  # Batch: live search for every query; only uncached (query, spu_id) pairs hit the LLM
dedd31c5   tangwang   1. 搜索 recall 池「1 ...
36
  ./scripts/evaluation/start_eval.sh batch
881d338b   tangwang   评估框架
37
  
2059d959   tangwang   feat(eval): 多评估集统...
38
39
40
  # switch to the 771-query clothing dataset
  REPO_EVAL_DATASET_ID=clothing_top771 ./scripts/evaluation/start_eval.sh batch
  
dedd31c5   tangwang   1. 搜索 recall 池「1 ...
41
42
  # Deep rebuild: per-query full corpus rerank (outside search recall pool) + LLM in batches along global sort order (early stop; expensive)
  ./scripts/evaluation/start_eval.sh batch-rebuild
f8e7cb97   tangwang   evalution framework
43
  
12a75c46   tangwang   feat(eval): 为 LLM...
44
45
46
  # Resume deep rebuild from existing query_builds (recommended for long 771-query runs)
  REPO_EVAL_DATASET_ID=clothing_top771 ./scripts/evaluation/start_eval.sh batch-rebuild-resume
  
3ac1f8d1   tangwang   评估标准优化
47
  # UI: http://127.0.0.1:6010/
dedd31c5   tangwang   1. 搜索 recall 池「1 ...
48
  ./scripts/evaluation/start_eval.sh serve
3ac1f8d1   tangwang   评估标准优化
49
  # or: ./scripts/service_ctl.sh start eval-web
881d338b   tangwang   评估框架
50
51
  ```
  
3ac1f8d1   tangwang   评估标准优化
52
  Explicit equivalents:
881d338b   tangwang   评估框架
53
54
  
  ```bash
f8e7cb97   tangwang   evalution framework
55
56
  ./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
    --tenant-id "${TENANT_ID:-163}" \
2059d959   tangwang   feat(eval): 多评估集统...
57
    --dataset-id core_queries \
f8e7cb97   tangwang   evalution framework
58
59
60
61
    --top-k 50 \
    --language en \
    --labeler-mode simple
  
d172c259   tangwang   eval框架
62
63
  ./.venv/bin/python scripts/evaluation/build_annotation_set.py build \
    --tenant-id "${TENANT_ID:-163}" \
2059d959   tangwang   feat(eval): 多评估集统...
64
    --dataset-id core_queries \
d172c259   tangwang   eval框架
65
66
67
68
69
70
    --search-depth 500 \
    --rerank-depth 10000 \
    --force-refresh-rerank \
    --force-refresh-labels \
    --language en \
    --labeler-mode simple
881d338b   tangwang   评估框架
71
72
73
  
  ./.venv/bin/python scripts/evaluation/serve_eval_web.py serve \
    --tenant-id "${TENANT_ID:-163}" \
2059d959   tangwang   feat(eval): 多评估集统...
74
    --dataset-id core_queries \
881d338b   tangwang   评估框架
75
76
77
78
    --host 127.0.0.1 \
    --port 6010
  ```
  
167f33b4   tangwang   eval框架前端
79
  Each `batch` run walks the full queries file and writes a **batch report** under `batch_reports/`. With `batch --force-refresh-labels`, every live top-`k` hit is re-judged by the LLM (still only those hits—not the deep rebuild pipeline).
d172c259   tangwang   eval框架
80
  
dedd31c5   tangwang   1. 搜索 recall 池「1 ...
81
  ### `start_eval.sh batch-rebuild` (deep annotation rebuild)
167f33b4   tangwang   eval框架前端
82
83
84
  
  This runs `build_annotation_set.py build` with **`--force-refresh-labels`** and **`--force-refresh-rerank`** (see the explicit command block below). It does **not** run the `batch` subcommand: there is **no** aggregate batch report for this step; outputs are per-query JSON under `query_builds/` plus updates in `search_eval.sqlite3`.
  
12a75c46   tangwang   feat(eval): 为 LLM...
85
  For **each** query in the selected dataset query file (`--dataset-id` / `config.yaml search_evaluation.datasets[*].query_file`), in order:
167f33b4   tangwang   eval框架前端
86
  
dedd31c5   tangwang   1. 搜索 recall 池「1 ...
87
  1. **Search recall** — Call the live search API with `size = max(--search-depth, --search-recall-top-k)` (the wrapper uses `--search-depth 500`). The first **`--search-recall-top-k`** hits (default **200**, see `eval_framework.constants.DEFAULT_SEARCH_RECALL_TOP_K`) form the **recall pool**; they are treated as rerank score **1** and are **not** sent to the reranker.
167f33b4   tangwang   eval框架前端
88
89
90
91
  2. **Full corpus** — Load the tenant’s product corpus from Elasticsearch (same tenant as `TENANT_ID` / `--tenant-id`, default **163**), via `corpus_docs()` (cached in SQLite after the first load).
  3. **Rerank outside pool** — Every corpus document whose `spu_id` is **not** in the pool is scored by the reranker API, **80 documents per request**. With `--force-refresh-rerank`, all those scores are recomputed and written to the **`rerank_scores`** table in `search_eval.sqlite3`. Without that flag, existing `(tenant_id, query, spu_id)` scores are reused and only missing rows hit the API.
  4. **Skip “too easy” queries** — If more than **1000** non-pool documents have rerank score **> 0.5**, that query is **skipped** (one log line: tail too relevant / easy to satisfy). No LLM calls for that query.
  5. **Global sort** — Order to label: pool in **search rank order**, then all remaining corpus docs in **descending rerank score** (dedupe by `spu_id`, pool wins).
dedd31c5   tangwang   1. 搜索 recall 池「1 ...
92
93
94
95
96
  6. **LLM labeling** — Walk that list **from the head** in batches of **50** by default (`--rebuild-llm-batch-size`). Each batch log includes **exact_ratio**, **irrelevant_ratio**, **low_ratio**, and **irrelevant_plus_low_ratio**.
  
     **Early stop** (defaults in `eval_framework.constants`; overridable via CLI):
  
     - Run **at least** `--rebuild-min-batches` batches (**10** by default) before any early stop is allowed.
35ae3b29   tangwang   批量评估框架,召回参数修改和llm...
97
98
     - After that, a **bad batch** is one where **both** are true (strict **>**):
       - **Irrelevant** proportion **> 93.9%** (`--rebuild-irrelevant-stop-ratio`, default `0.939`), and
441f049d   tangwang   评测体系优化,以及
99
100
       - **(Irrelevant + Weakly Relevant)** proportion **> 95.9%** (`--rebuild-irrel-low-combined-stop-ratio`, default `0.959`).  
         (“Weakly Relevant” is the weak tier; **Mostly Relevant** and **Exact** do not enter this sum.)
dedd31c5   tangwang   1. 搜索 recall 池「1 ...
101
     - Count **consecutive** bad batches. **Reset** the count to 0 on any batch that is not bad.
35ae3b29   tangwang   批量评估框架,召回参数修改和llm...
102
     - **Stop** when the consecutive bad count reaches **`--rebuild-irrelevant-stop-streak`** (**3** by default), or when **`--rebuild-max-batches`** (**40**) is reached—whichever comes first (up to **2000** docs per query at default batch size).
dedd31c5   tangwang   1. 搜索 recall 池「1 ...
103
  
35ae3b29   tangwang   批量评估框架,召回参数修改和llm...
104
     So labeling follows best-first order but **stops early** after **three** consecutive batches that are overwhelmingly Irrelevant and Irrelevant+Low; the tail may never be judged.
167f33b4   tangwang   eval框架前端
105
106
107
  
  **Incremental pool (no full rebuild):** `build_annotation_set.py build` **without** `--force-refresh-labels` uses the older windowed pool (`--annotate-search-top-k`, `--annotate-rerank-top-k`) and fills missing labels in one pass—no rerank-skip rule and no LLM early-stop loop.
  
dedd31c5   tangwang   1. 搜索 recall 池「1 ...
108
  **Tuning the rebuild path:** `--search-recall-top-k`, `--rerank-high-threshold`, `--rerank-high-skip-count`, `--rebuild-llm-batch-size`, `--rebuild-min-batches`, `--rebuild-max-batches`, `--rebuild-irrelevant-stop-ratio`, `--rebuild-irrel-low-combined-stop-ratio`, `--rebuild-irrelevant-stop-streak` on `build` (see `eval_framework/cli.py`). Rerank API chunk size is **80** docs per request in code (`full_corpus_rerank_outside_exclude`).
881d338b   tangwang   评估框架
109
  
12a75c46   tangwang   feat(eval): 为 LLM...
110
111
  **Resuming interrupted runs:** for long jobs (for example `clothing_top771`), use `batch-rebuild-resume` or pass `build --resume-missing --continue-on-error --max-retries-per-query N`. Resume mode skips queries that already have per-query JSON under `datasets/<dataset_id>/query_builds/`.
  
3ac1f8d1   tangwang   评估标准优化
112
  ## Artifacts
881d338b   tangwang   评估框架
113
  
3ac1f8d1   tangwang   评估标准优化
114
  Default root: `artifacts/search_evaluation/`
881d338b   tangwang   评估框架
115
  
3ac1f8d1   tangwang   评估标准优化
116
  - `search_eval.sqlite3` — corpus cache, rerank scores, relevance labels, query profiles, build/batch run metadata
2059d959   tangwang   feat(eval): 多评估集统...
117
118
119
  - `datasets/<dataset_id>/query_builds/` — per-query pooled build outputs
  - `datasets/<dataset_id>/batch_reports/<batch_id>/` — batch JSON, Markdown, config snapshot, dataset snapshot, query snapshot
  - `datasets/<dataset_id>/audits/` — label-quality audit summaries
3ac1f8d1   tangwang   评估标准优化
120
  - `tuning_runs/` — fusion experiment outputs and config snapshots
881d338b   tangwang   评估框架
121
  
3ac1f8d1   tangwang   评估标准优化
122
  ## Labels
881d338b   tangwang   评估框架
123
  
441f049d   tangwang   评测体系优化,以及
124
125
126
  - **Fully Relevant** — Matches intended product type and all explicit required attributes.
  - **Mostly Relevant** — Main intent matches and is a strong substitute, but some attributes are missing, weaker, or slightly off.
  - **Weakly Relevant** — Only a weak substitute; may share scenario, style, or broad category but is no longer a strong match.
7ddd4cb3   tangwang   评估体系从三等级->四等级 Exa...
127
128
129
130
131
132
  - **Irrelevant** — Type mismatch or important conflicts make it a poor search result.
  
  ## Metric design
  
  This framework now follows graded ranking evaluation closer to e-commerce best practice instead of collapsing everything into binary relevance.
  
465f90e1   tangwang   添加LTR数据收集
133
134
135
136
137
  - **Primary scorecard**
    The primary evaluation set is:
    `NDCG@20`, `NDCG@50`, `ERR@10`, `Strong_Precision@10`, `Strong_Precision@20`, `Useful_Precision@50`, `Avg_Grade@10`, `Gain_Recall@20`.
  - **Composite tuning score: `Primary_Metric_Score`**
    For experiment ranking we compute the mean of the primary scorecard after normalizing `Avg_Grade@10` by the max grade (`3`).
7ddd4cb3   tangwang   评估体系从三等级->四等级 Exa...
138
  - **Gain scheme**
d73ca84a   tangwang   refine eval case ...
139
140
    `Fully Relevant=3`, `Mostly Relevant=2`, `Weakly Relevant=1`, `Irrelevant=0`
    We keep the rel grades `3/2/1/0`, but the current implementation uses the grade values directly as gains so the exact/high gap is less aggressive.
7ddd4cb3   tangwang   评估体系从三等级->四等级 Exa...
141
  - **Why this is better**
441f049d   tangwang   评测体系优化,以及
142
    `NDCG` differentiates “exact”, “strong substitute”, and “weak substitute”, so swapping an `Fully Relevant` with a `Weakly Relevant` item is penalized more than swapping `Mostly Relevant` with `Weakly Relevant`.
7ddd4cb3   tangwang   评估体系从三等级->四等级 Exa...
143
144
145
  
  The reported metrics are:
  
465f90e1   tangwang   添加LTR数据收集
146
147
148
  - **`Primary_Metric_Score`** — Mean score over the primary scorecard (`Avg_Grade@10` normalized by `/3`).
  - **`NDCG@20`, `NDCG@50`, `ERR@10`, `Strong_Precision@10`, `Strong_Precision@20`, `Useful_Precision@50`, `Avg_Grade@10`, `Gain_Recall@20`** — Primary scorecard for optimization decisions.
  - **`NDCG@5`, `NDCG@10`, `ERR@5`, `ERR@20`, `ERR@50`** — Secondary graded ranking quality slices.
441f049d   tangwang   评测体系优化,以及
149
150
  - **`Exact_Precision@K`** — Strict top-slot quality when only `Fully Relevant` counts.
  - **`Strong_Precision@K`** — Business-facing top-slot quality where `Fully Relevant + Mostly Relevant` count as strong positives.
7ddd4cb3   tangwang   评估体系从三等级->四等级 Exa...
151
152
153
154
155
  - **`Useful_Precision@K`** — Broader usefulness where any non-irrelevant result counts.
  - **`Gain_Recall@K`** — Gain captured in the returned list versus the judged label pool for the query.
  - **`Exact_Success@K` / `Strong_Success@K`** — Whether at least one exact or strong result appears in the first K.
  - **`MRR_Exact@10` / `MRR_Strong@10`** — How early the first exact or strong result appears.
  - **`Avg_Grade@10`** — Average relevance grade of the visible first page.
881d338b   tangwang   评估框架
156
  
3ac1f8d1   tangwang   评估标准优化
157
  **Labeler modes:** `simple` (default): one judging pass per batch with the standard relevance prompt. `complex`: query-profile extraction plus extra guardrails (for structured experiments).
881d338b   tangwang   评估框架
158
  
3ac1f8d1   tangwang   评估标准优化
159
  ## Flows
881d338b   tangwang   评估框架
160
  
3ac1f8d1   tangwang   评估标准优化
161
  **Standard:** Run `batch` without `--force-refresh-labels` to extend coverage, then use the UI or batch in cached mode. Single-query evaluation defaults to **no** auto-annotation: recall still hits the live API; scoring uses SQLite only, and unlabeled hits count as `Irrelevant`.
881d338b   tangwang   评估框架
162
  
167f33b4   tangwang   eval框架前端
163
  **Rebuild vs incremental `build`:** Deep rebuild is documented in the **`batch-rebuild`** subsection above. Incremental `build` (without `--force-refresh-labels`) uses `--annotate-search-top-k` / `--annotate-rerank-top-k` windows instead.
881d338b   tangwang   评估框架
164
  
3ac1f8d1   tangwang   评估标准优化
165
  **Fusion tuning:** `tune_fusion.py` writes experiment configs, restarts the backend, runs batch evaluation, and optionally applies the best variant (see `--experiments-file`, `--score-metric`, `--apply-best`).
881d338b   tangwang   评估框架
166
  
3ac1f8d1   tangwang   评估标准优化
167
  ### Audit
881d338b   tangwang   评估框架
168
169
170
171
172
173
174
175
176
177
  
  ```bash
  ./.venv/bin/python scripts/evaluation/build_annotation_set.py audit \
    --tenant-id 163 \
    --queries-file scripts/evaluation/queries/queries.txt \
    --top-k 50 \
    --language en \
    --labeler-mode simple
  ```
  
3ac1f8d1   tangwang   评估标准优化
178
  ## Web UI
881d338b   tangwang   评估框架
179
  
2059d959   tangwang   feat(eval): 多评估集统...
180
  Features: dataset selector, dataset-scoped query list, single-query and batch evaluation, dataset-scoped batch report history, grouped graded-metric cards, top recalls, missed judged useful results, and coverage tips for unlabeled hits.
881d338b   tangwang   评估框架
181
  
3ac1f8d1   tangwang   评估标准优化
182
  ## Batch reports
881d338b   tangwang   评估框架
183
  
7ddd4cb3   tangwang   评估体系从三等级->四等级 Exa...
184
  Each run stores aggregate and per-query metrics, label distribution, timestamp, metric context (including gain scheme and primary metric), and an `/admin/config` snapshot, as Markdown and JSON under `batch_reports/`.
881d338b   tangwang   评估框架
185
  
d73ca84a   tangwang   refine eval case ...
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
  To make later case analysis reproducible without digging through backend logs, each per-query record in the batch JSON now also includes:
  
  - `request_id` — the exact `X-Request-ID` sent by the evaluator for that live search call
  - `top_label_sequence_top10` / `top_label_sequence_top20` — compact label sequence strings such as `1:L3 | 2:L1 | 3:L2`
  - `top_results` — a lightweight top-20 snapshot with `rank`, `spu_id`, `label`, title fields, and `relevance_score`
  
  The Markdown report now surfaces the same case context in a lighter human-readable form:
  
  - request id
  - top-10 / top-20 label sequence
  - top 5 result snapshot for quick scanning
  
  This means a bad case can usually be reconstructed directly from the batch artifact itself, without replaying logs or joining SQLite tables by hand.
  
  The web history endpoint intentionally returns a compact summary only (aggregate metrics plus query count), so adding richer per-query snapshots to the batch payload does not bloat the history list UI.
  
465f90e1   tangwang   添加LTR数据收集
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
  ## Ranking debug and LTR prep
  
  `debug_info` now exposes two extra layers that are useful for tuning and future learning-to-rank work:
  
  - `retrieval_plan` — the effective text/image KNN plan for the query (`k`, `num_candidates`, boost, and whether the long-query branch was used).
  - `ltr_summary` — query-level summary over the visible top results: how many docs came from translation matches, text KNN, image KNN, fallback-to-ES cases, plus mean signal values.
  
  At the document level, `debug_info.per_result[*]` and `ranking_funnel.*.ltr_features` include stable features such as:
  
  - `es_score`, `text_score`, `knn_score`, `rerank_score`, `fine_score`
  - `source_score`, `translation_score`, `text_knn_score`, `image_knn_score`
  - `has_translation_match`, `has_text_knn`, `has_image_knn`, `text_score_fallback_to_es`
  
  This makes it easier to inspect bad cases and also gives us a near-direct feature inventory for downstream LTR experiments.
  
881d338b   tangwang   评估框架
217
218
  ## Caveats
  
3ac1f8d1   tangwang   评估标准优化
219
220
221
  - Labels are keyed by `(tenant_id, query, spu_id)`, not a full corpus×query matrix.
  - Single-query evaluation still needs live search for recall; LLM calls are avoided when labels exist.
  - Backend restarts in automated tuning may need a short settle time before requests.
881d338b   tangwang   评估框架
222
  
3ac1f8d1   tangwang   评估标准优化
223
  ## Related docs
881d338b   tangwang   评估框架
224
  
3ac1f8d1   tangwang   评估标准优化
225
  - `README_Requirement.md`, `README_Requirement_zh.md` — requirements background; this file describes the implemented stack and how to run it.