Blame view

scripts/evaluation/README.md 14.6 KB
881d338b   tangwang   评估框架
1
2
  # Search Evaluation Framework
  
3ac1f8d1   tangwang   评估标准优化
3
  This directory holds the offline annotation builder, the evaluation web UI/API, audit tooling, and the fusion-tuning runner for retrieval quality.
881d338b   tangwang   评估框架
4
  
2059d959   tangwang   feat(eval): 多评估集统...
5
  **Design:** Build labels offline for one or more named evaluation datasets. Each dataset has a stable `dataset_id` backed by a query file registered in `config.yaml -> search_evaluation.datasets`. Single-query and batch evaluation map recalled `spu_id` values to the shared SQLite cache. Items without cached labels are scored as `Irrelevant`, and the UI/API surfaces tips when judged coverage is incomplete. Evaluation now uses a graded four-tier relevance system with a multi-metric primary scorecard instead of a single headline metric.
881d338b   tangwang   评估框架
6
  
3ac1f8d1   tangwang   评估标准优化
7
  ## What it does
881d338b   tangwang   评估框架
8
  
2059d959   tangwang   feat(eval): 多评估集统...
9
  1. Build an annotation set for a named evaluation dataset.
3ac1f8d1   tangwang   评估标准优化
10
11
12
  2. Evaluate live search results against cached labels.
  3. Run batch evaluation and keep historical reports with config snapshots.
  4. Tune fusion parameters in a reproducible loop.
881d338b   tangwang   评估框架
13
  
3ac1f8d1   tangwang   评估标准优化
14
  ## Layout
881d338b   tangwang   评估框架
15
  
3ac1f8d1   tangwang   评估标准优化
16
17
18
19
20
21
22
23
  | Path | Role |
  |------|------|
  | `eval_framework/` | Package: orchestration, SQLite store, search/rerank/LLM clients, prompts, metrics, reports, web UI (`static/`), CLI |
  | `build_annotation_set.py` | CLI entry (build / batch / audit) |
  | `serve_eval_web.py` | Web server for the evaluation UI |
  | `tune_fusion.py` | Applies config variants, restarts backend, runs batch eval, stores experiment reports |
  | `fusion_experiments_shortlist.json` | Compact experiment set for tuning |
  | `fusion_experiments_round1.json` | Broader first-round experiments |
2059d959   tangwang   feat(eval): 多评估集统...
24
25
  | `queries/queries.txt` | Legacy core query set (`dataset_id=core_queries`) |
  | `queries/all_keywords.txt.top1w.shuf.top1k.clothing_filtered` | Expanded clothing dataset (`dataset_id=clothing_top771`) |
3ac1f8d1   tangwang   评估标准优化
26
  | `README_Requirement.md` | Product/requirements reference |
dedd31c5   tangwang   1. 搜索 recall 池「1 ...
27
  | `start_eval.sh` | Wrapper: `batch`, `batch-rebuild` (deep `build` + `--force-refresh-labels`), or `serve` |
3ac1f8d1   tangwang   评估标准优化
28
29
30
31
  | `../start_eval_web.sh` | Same as `serve` with `activate.sh`; use `./scripts/service_ctl.sh start eval-web` (default port **6010**, override with `EVAL_WEB_PORT`). `./run.sh all` includes eval-web. |
  
  ## Quick start (repo root)
  
2059d959   tangwang   feat(eval): 多评估集统...
32
  Set tenant if needed (`export TENANT_ID=163`). To switch datasets, export `REPO_EVAL_DATASET_ID` or pass `--dataset-id`. You need a live search API, DashScope when new LLM labels are required, and a running backend.
881d338b   tangwang   评估框架
33
34
  
  ```bash
3ac1f8d1   tangwang   评估标准优化
35
  # Batch: live search for every query; only uncached (query, spu_id) pairs hit the LLM
dedd31c5   tangwang   1. 搜索 recall 池「1 ...
36
  ./scripts/evaluation/start_eval.sh batch
881d338b   tangwang   评估框架
37
  
2059d959   tangwang   feat(eval): 多评估集统...
38
39
40
  # switch to the 771-query clothing dataset
  REPO_EVAL_DATASET_ID=clothing_top771 ./scripts/evaluation/start_eval.sh batch
  
dedd31c5   tangwang   1. 搜索 recall 池「1 ...
41
42
  # Deep rebuild: per-query full corpus rerank (outside search recall pool) + LLM in batches along global sort order (early stop; expensive)
  ./scripts/evaluation/start_eval.sh batch-rebuild
f8e7cb97   tangwang   evalution framework
43
  
3ac1f8d1   tangwang   评估标准优化
44
  # UI: http://127.0.0.1:6010/
dedd31c5   tangwang   1. 搜索 recall 池「1 ...
45
  ./scripts/evaluation/start_eval.sh serve
3ac1f8d1   tangwang   评估标准优化
46
  # or: ./scripts/service_ctl.sh start eval-web
881d338b   tangwang   评估框架
47
48
  ```
  
3ac1f8d1   tangwang   评估标准优化
49
  Explicit equivalents:
881d338b   tangwang   评估框架
50
51
  
  ```bash
f8e7cb97   tangwang   evalution framework
52
53
  ./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
    --tenant-id "${TENANT_ID:-163}" \
2059d959   tangwang   feat(eval): 多评估集统...
54
    --dataset-id core_queries \
f8e7cb97   tangwang   evalution framework
55
56
57
58
    --top-k 50 \
    --language en \
    --labeler-mode simple
  
d172c259   tangwang   eval框架
59
60
  ./.venv/bin/python scripts/evaluation/build_annotation_set.py build \
    --tenant-id "${TENANT_ID:-163}" \
2059d959   tangwang   feat(eval): 多评估集统...
61
    --dataset-id core_queries \
d172c259   tangwang   eval框架
62
63
64
65
66
67
    --search-depth 500 \
    --rerank-depth 10000 \
    --force-refresh-rerank \
    --force-refresh-labels \
    --language en \
    --labeler-mode simple
881d338b   tangwang   评估框架
68
69
70
  
  ./.venv/bin/python scripts/evaluation/serve_eval_web.py serve \
    --tenant-id "${TENANT_ID:-163}" \
2059d959   tangwang   feat(eval): 多评估集统...
71
    --dataset-id core_queries \
881d338b   tangwang   评估框架
72
73
74
75
    --host 127.0.0.1 \
    --port 6010
  ```
  
167f33b4   tangwang   eval框架前端
76
  Each `batch` run walks the full queries file and writes a **batch report** under `batch_reports/`. With `batch --force-refresh-labels`, every live top-`k` hit is re-judged by the LLM (still only those hits—not the deep rebuild pipeline).
d172c259   tangwang   eval框架
77
  
dedd31c5   tangwang   1. 搜索 recall 池「1 ...
78
  ### `start_eval.sh batch-rebuild` (deep annotation rebuild)
167f33b4   tangwang   eval框架前端
79
80
81
82
83
  
  This runs `build_annotation_set.py build` with **`--force-refresh-labels`** and **`--force-refresh-rerank`** (see the explicit command block below). It does **not** run the `batch` subcommand: there is **no** aggregate batch report for this step; outputs are per-query JSON under `query_builds/` plus updates in `search_eval.sqlite3`.
  
  For **each** query in `queries.txt`, in order:
  
dedd31c5   tangwang   1. 搜索 recall 池「1 ...
84
  1. **Search recall** — Call the live search API with `size = max(--search-depth, --search-recall-top-k)` (the wrapper uses `--search-depth 500`). The first **`--search-recall-top-k`** hits (default **200**, see `eval_framework.constants.DEFAULT_SEARCH_RECALL_TOP_K`) form the **recall pool**; they are treated as rerank score **1** and are **not** sent to the reranker.
167f33b4   tangwang   eval框架前端
85
86
87
88
  2. **Full corpus** — Load the tenant’s product corpus from Elasticsearch (same tenant as `TENANT_ID` / `--tenant-id`, default **163**), via `corpus_docs()` (cached in SQLite after the first load).
  3. **Rerank outside pool** — Every corpus document whose `spu_id` is **not** in the pool is scored by the reranker API, **80 documents per request**. With `--force-refresh-rerank`, all those scores are recomputed and written to the **`rerank_scores`** table in `search_eval.sqlite3`. Without that flag, existing `(tenant_id, query, spu_id)` scores are reused and only missing rows hit the API.
  4. **Skip “too easy” queries** — If more than **1000** non-pool documents have rerank score **> 0.5**, that query is **skipped** (one log line: tail too relevant / easy to satisfy). No LLM calls for that query.
  5. **Global sort** — Order to label: pool in **search rank order**, then all remaining corpus docs in **descending rerank score** (dedupe by `spu_id`, pool wins).
dedd31c5   tangwang   1. 搜索 recall 池「1 ...
89
90
91
92
93
  6. **LLM labeling** — Walk that list **from the head** in batches of **50** by default (`--rebuild-llm-batch-size`). Each batch log includes **exact_ratio**, **irrelevant_ratio**, **low_ratio**, and **irrelevant_plus_low_ratio**.
  
     **Early stop** (defaults in `eval_framework.constants`; overridable via CLI):
  
     - Run **at least** `--rebuild-min-batches` batches (**10** by default) before any early stop is allowed.
35ae3b29   tangwang   批量评估框架,召回参数修改和llm...
94
95
     - After that, a **bad batch** is one where **both** are true (strict **>**):
       - **Irrelevant** proportion **> 93.9%** (`--rebuild-irrelevant-stop-ratio`, default `0.939`), and
441f049d   tangwang   评测体系优化,以及
96
97
       - **(Irrelevant + Weakly Relevant)** proportion **> 95.9%** (`--rebuild-irrel-low-combined-stop-ratio`, default `0.959`).  
         (“Weakly Relevant” is the weak tier; **Mostly Relevant** and **Exact** do not enter this sum.)
dedd31c5   tangwang   1. 搜索 recall 池「1 ...
98
     - Count **consecutive** bad batches. **Reset** the count to 0 on any batch that is not bad.
35ae3b29   tangwang   批量评估框架,召回参数修改和llm...
99
     - **Stop** when the consecutive bad count reaches **`--rebuild-irrelevant-stop-streak`** (**3** by default), or when **`--rebuild-max-batches`** (**40**) is reached—whichever comes first (up to **2000** docs per query at default batch size).
dedd31c5   tangwang   1. 搜索 recall 池「1 ...
100
  
35ae3b29   tangwang   批量评估框架,召回参数修改和llm...
101
     So labeling follows best-first order but **stops early** after **three** consecutive batches that are overwhelmingly Irrelevant and Irrelevant+Low; the tail may never be judged.
167f33b4   tangwang   eval框架前端
102
103
104
  
  **Incremental pool (no full rebuild):** `build_annotation_set.py build` **without** `--force-refresh-labels` uses the older windowed pool (`--annotate-search-top-k`, `--annotate-rerank-top-k`) and fills missing labels in one pass—no rerank-skip rule and no LLM early-stop loop.
  
dedd31c5   tangwang   1. 搜索 recall 池「1 ...
105
  **Tuning the rebuild path:** `--search-recall-top-k`, `--rerank-high-threshold`, `--rerank-high-skip-count`, `--rebuild-llm-batch-size`, `--rebuild-min-batches`, `--rebuild-max-batches`, `--rebuild-irrelevant-stop-ratio`, `--rebuild-irrel-low-combined-stop-ratio`, `--rebuild-irrelevant-stop-streak` on `build` (see `eval_framework/cli.py`). Rerank API chunk size is **80** docs per request in code (`full_corpus_rerank_outside_exclude`).
881d338b   tangwang   评估框架
106
  
3ac1f8d1   tangwang   评估标准优化
107
  ## Artifacts
881d338b   tangwang   评估框架
108
  
3ac1f8d1   tangwang   评估标准优化
109
  Default root: `artifacts/search_evaluation/`
881d338b   tangwang   评估框架
110
  
3ac1f8d1   tangwang   评估标准优化
111
  - `search_eval.sqlite3` — corpus cache, rerank scores, relevance labels, query profiles, build/batch run metadata
2059d959   tangwang   feat(eval): 多评估集统...
112
113
114
  - `datasets/<dataset_id>/query_builds/` — per-query pooled build outputs
  - `datasets/<dataset_id>/batch_reports/<batch_id>/` — batch JSON, Markdown, config snapshot, dataset snapshot, query snapshot
  - `datasets/<dataset_id>/audits/` — label-quality audit summaries
3ac1f8d1   tangwang   评估标准优化
115
  - `tuning_runs/` — fusion experiment outputs and config snapshots
881d338b   tangwang   评估框架
116
  
3ac1f8d1   tangwang   评估标准优化
117
  ## Labels
881d338b   tangwang   评估框架
118
  
441f049d   tangwang   评测体系优化,以及
119
120
121
  - **Fully Relevant** — Matches intended product type and all explicit required attributes.
  - **Mostly Relevant** — Main intent matches and is a strong substitute, but some attributes are missing, weaker, or slightly off.
  - **Weakly Relevant** — Only a weak substitute; may share scenario, style, or broad category but is no longer a strong match.
7ddd4cb3   tangwang   评估体系从三等级->四等级 Exa...
122
123
124
125
126
127
  - **Irrelevant** — Type mismatch or important conflicts make it a poor search result.
  
  ## Metric design
  
  This framework now follows graded ranking evaluation closer to e-commerce best practice instead of collapsing everything into binary relevance.
  
465f90e1   tangwang   添加LTR数据收集
128
129
130
131
132
  - **Primary scorecard**
    The primary evaluation set is:
    `NDCG@20`, `NDCG@50`, `ERR@10`, `Strong_Precision@10`, `Strong_Precision@20`, `Useful_Precision@50`, `Avg_Grade@10`, `Gain_Recall@20`.
  - **Composite tuning score: `Primary_Metric_Score`**
    For experiment ranking we compute the mean of the primary scorecard after normalizing `Avg_Grade@10` by the max grade (`3`).
7ddd4cb3   tangwang   评估体系从三等级->四等级 Exa...
133
  - **Gain scheme**
d73ca84a   tangwang   refine eval case ...
134
135
    `Fully Relevant=3`, `Mostly Relevant=2`, `Weakly Relevant=1`, `Irrelevant=0`
    We keep the rel grades `3/2/1/0`, but the current implementation uses the grade values directly as gains so the exact/high gap is less aggressive.
7ddd4cb3   tangwang   评估体系从三等级->四等级 Exa...
136
  - **Why this is better**
441f049d   tangwang   评测体系优化,以及
137
    `NDCG` differentiates “exact”, “strong substitute”, and “weak substitute”, so swapping an `Fully Relevant` with a `Weakly Relevant` item is penalized more than swapping `Mostly Relevant` with `Weakly Relevant`.
7ddd4cb3   tangwang   评估体系从三等级->四等级 Exa...
138
139
140
  
  The reported metrics are:
  
465f90e1   tangwang   添加LTR数据收集
141
142
143
  - **`Primary_Metric_Score`** — Mean score over the primary scorecard (`Avg_Grade@10` normalized by `/3`).
  - **`NDCG@20`, `NDCG@50`, `ERR@10`, `Strong_Precision@10`, `Strong_Precision@20`, `Useful_Precision@50`, `Avg_Grade@10`, `Gain_Recall@20`** — Primary scorecard for optimization decisions.
  - **`NDCG@5`, `NDCG@10`, `ERR@5`, `ERR@20`, `ERR@50`** — Secondary graded ranking quality slices.
441f049d   tangwang   评测体系优化,以及
144
145
  - **`Exact_Precision@K`** — Strict top-slot quality when only `Fully Relevant` counts.
  - **`Strong_Precision@K`** — Business-facing top-slot quality where `Fully Relevant + Mostly Relevant` count as strong positives.
7ddd4cb3   tangwang   评估体系从三等级->四等级 Exa...
146
147
148
149
150
  - **`Useful_Precision@K`** — Broader usefulness where any non-irrelevant result counts.
  - **`Gain_Recall@K`** — Gain captured in the returned list versus the judged label pool for the query.
  - **`Exact_Success@K` / `Strong_Success@K`** — Whether at least one exact or strong result appears in the first K.
  - **`MRR_Exact@10` / `MRR_Strong@10`** — How early the first exact or strong result appears.
  - **`Avg_Grade@10`** — Average relevance grade of the visible first page.
881d338b   tangwang   评估框架
151
  
3ac1f8d1   tangwang   评估标准优化
152
  **Labeler modes:** `simple` (default): one judging pass per batch with the standard relevance prompt. `complex`: query-profile extraction plus extra guardrails (for structured experiments).
881d338b   tangwang   评估框架
153
  
3ac1f8d1   tangwang   评估标准优化
154
  ## Flows
881d338b   tangwang   评估框架
155
  
3ac1f8d1   tangwang   评估标准优化
156
  **Standard:** Run `batch` without `--force-refresh-labels` to extend coverage, then use the UI or batch in cached mode. Single-query evaluation defaults to **no** auto-annotation: recall still hits the live API; scoring uses SQLite only, and unlabeled hits count as `Irrelevant`.
881d338b   tangwang   评估框架
157
  
167f33b4   tangwang   eval框架前端
158
  **Rebuild vs incremental `build`:** Deep rebuild is documented in the **`batch-rebuild`** subsection above. Incremental `build` (without `--force-refresh-labels`) uses `--annotate-search-top-k` / `--annotate-rerank-top-k` windows instead.
881d338b   tangwang   评估框架
159
  
3ac1f8d1   tangwang   评估标准优化
160
  **Fusion tuning:** `tune_fusion.py` writes experiment configs, restarts the backend, runs batch evaluation, and optionally applies the best variant (see `--experiments-file`, `--score-metric`, `--apply-best`).
881d338b   tangwang   评估框架
161
  
3ac1f8d1   tangwang   评估标准优化
162
  ### Audit
881d338b   tangwang   评估框架
163
164
165
166
167
168
169
170
171
172
  
  ```bash
  ./.venv/bin/python scripts/evaluation/build_annotation_set.py audit \
    --tenant-id 163 \
    --queries-file scripts/evaluation/queries/queries.txt \
    --top-k 50 \
    --language en \
    --labeler-mode simple
  ```
  
3ac1f8d1   tangwang   评估标准优化
173
  ## Web UI
881d338b   tangwang   评估框架
174
  
2059d959   tangwang   feat(eval): 多评估集统...
175
  Features: dataset selector, dataset-scoped query list, single-query and batch evaluation, dataset-scoped batch report history, grouped graded-metric cards, top recalls, missed judged useful results, and coverage tips for unlabeled hits.
881d338b   tangwang   评估框架
176
  
3ac1f8d1   tangwang   评估标准优化
177
  ## Batch reports
881d338b   tangwang   评估框架
178
  
7ddd4cb3   tangwang   评估体系从三等级->四等级 Exa...
179
  Each run stores aggregate and per-query metrics, label distribution, timestamp, metric context (including gain scheme and primary metric), and an `/admin/config` snapshot, as Markdown and JSON under `batch_reports/`.
881d338b   tangwang   评估框架
180
  
d73ca84a   tangwang   refine eval case ...
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
  To make later case analysis reproducible without digging through backend logs, each per-query record in the batch JSON now also includes:
  
  - `request_id` — the exact `X-Request-ID` sent by the evaluator for that live search call
  - `top_label_sequence_top10` / `top_label_sequence_top20` — compact label sequence strings such as `1:L3 | 2:L1 | 3:L2`
  - `top_results` — a lightweight top-20 snapshot with `rank`, `spu_id`, `label`, title fields, and `relevance_score`
  
  The Markdown report now surfaces the same case context in a lighter human-readable form:
  
  - request id
  - top-10 / top-20 label sequence
  - top 5 result snapshot for quick scanning
  
  This means a bad case can usually be reconstructed directly from the batch artifact itself, without replaying logs or joining SQLite tables by hand.
  
  The web history endpoint intentionally returns a compact summary only (aggregate metrics plus query count), so adding richer per-query snapshots to the batch payload does not bloat the history list UI.
  
465f90e1   tangwang   添加LTR数据收集
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
  ## Ranking debug and LTR prep
  
  `debug_info` now exposes two extra layers that are useful for tuning and future learning-to-rank work:
  
  - `retrieval_plan` — the effective text/image KNN plan for the query (`k`, `num_candidates`, boost, and whether the long-query branch was used).
  - `ltr_summary` — query-level summary over the visible top results: how many docs came from translation matches, text KNN, image KNN, fallback-to-ES cases, plus mean signal values.
  
  At the document level, `debug_info.per_result[*]` and `ranking_funnel.*.ltr_features` include stable features such as:
  
  - `es_score`, `text_score`, `knn_score`, `rerank_score`, `fine_score`
  - `source_score`, `translation_score`, `text_knn_score`, `image_knn_score`
  - `has_translation_match`, `has_text_knn`, `has_image_knn`, `text_score_fallback_to_es`
  
  This makes it easier to inspect bad cases and also gives us a near-direct feature inventory for downstream LTR experiments.
  
881d338b   tangwang   评估框架
212
213
  ## Caveats
  
3ac1f8d1   tangwang   评估标准优化
214
215
216
  - Labels are keyed by `(tenant_id, query, spu_id)`, not a full corpus×query matrix.
  - Single-query evaluation still needs live search for recall; LLM calls are avoided when labels exist.
  - Backend restarts in automated tuning may need a short settle time before requests.
881d338b   tangwang   评估框架
217
  
3ac1f8d1   tangwang   评估标准优化
218
  ## Related docs
881d338b   tangwang   评估框架
219
  
3ac1f8d1   tangwang   评估标准优化
220
  - `README_Requirement.md`, `README_Requirement_zh.md` — requirements background; this file describes the implemented stack and how to run it.