Blame view

scripts/evaluation/README.md 14 KB
881d338b   tangwang   评估框架
1
2
  # Search Evaluation Framework
  
3ac1f8d1   tangwang   评估标准优化
3
  This directory holds the offline annotation builder, the evaluation web UI/API, audit tooling, and the fusion-tuning runner for retrieval quality.
881d338b   tangwang   评估框架
4
  
465f90e1   tangwang   添加LTR数据收集
5
  **Design:** Build labels offline for a fixed query set (`queries/queries.txt`). Single-query and batch evaluation map recalled `spu_id` values to the SQLite cache. Items without cached labels are scored as `Irrelevant`, and the UI/API surfaces tips when judged coverage is incomplete. Evaluation now uses a graded four-tier relevance system with a multi-metric primary scorecard instead of a single headline metric.
881d338b   tangwang   评估框架
6
  
3ac1f8d1   tangwang   评估标准优化
7
  ## What it does
881d338b   tangwang   评估框架
8
  
3ac1f8d1   tangwang   评估标准优化
9
10
11
12
  1. Build an annotation set for a fixed query set.
  2. Evaluate live search results against cached labels.
  3. Run batch evaluation and keep historical reports with config snapshots.
  4. Tune fusion parameters in a reproducible loop.
881d338b   tangwang   评估框架
13
  
3ac1f8d1   tangwang   评估标准优化
14
  ## Layout
881d338b   tangwang   评估框架
15
  
3ac1f8d1   tangwang   评估标准优化
16
17
18
19
20
21
22
23
24
25
  | Path | Role |
  |------|------|
  | `eval_framework/` | Package: orchestration, SQLite store, search/rerank/LLM clients, prompts, metrics, reports, web UI (`static/`), CLI |
  | `build_annotation_set.py` | CLI entry (build / batch / audit) |
  | `serve_eval_web.py` | Web server for the evaluation UI |
  | `tune_fusion.py` | Applies config variants, restarts backend, runs batch eval, stores experiment reports |
  | `fusion_experiments_shortlist.json` | Compact experiment set for tuning |
  | `fusion_experiments_round1.json` | Broader first-round experiments |
  | `queries/queries.txt` | Canonical evaluation queries |
  | `README_Requirement.md` | Product/requirements reference |
dedd31c5   tangwang   1. 搜索 recall 池「1 ...
26
  | `start_eval.sh` | Wrapper: `batch`, `batch-rebuild` (deep `build` + `--force-refresh-labels`), or `serve` |
3ac1f8d1   tangwang   评估标准优化
27
28
29
30
31
  | `../start_eval_web.sh` | Same as `serve` with `activate.sh`; use `./scripts/service_ctl.sh start eval-web` (default port **6010**, override with `EVAL_WEB_PORT`). `./run.sh all` includes eval-web. |
  
  ## Quick start (repo root)
  
  Set tenant if needed (`export TENANT_ID=163`). You need a live search API, DashScope when new LLM labels are required, and a running backend.
881d338b   tangwang   评估框架
32
33
  
  ```bash
3ac1f8d1   tangwang   评估标准优化
34
  # Batch: live search for every query; only uncached (query, spu_id) pairs hit the LLM
dedd31c5   tangwang   1. 搜索 recall 池「1 ...
35
  ./scripts/evaluation/start_eval.sh batch
881d338b   tangwang   评估框架
36
  
dedd31c5   tangwang   1. 搜索 recall 池「1 ...
37
38
  # Deep rebuild: per-query full corpus rerank (outside search recall pool) + LLM in batches along global sort order (early stop; expensive)
  ./scripts/evaluation/start_eval.sh batch-rebuild
f8e7cb97   tangwang   evalution framework
39
  
3ac1f8d1   tangwang   评估标准优化
40
  # UI: http://127.0.0.1:6010/
dedd31c5   tangwang   1. 搜索 recall 池「1 ...
41
  ./scripts/evaluation/start_eval.sh serve
3ac1f8d1   tangwang   评估标准优化
42
  # or: ./scripts/service_ctl.sh start eval-web
881d338b   tangwang   评估框架
43
44
  ```
  
3ac1f8d1   tangwang   评估标准优化
45
  Explicit equivalents:
881d338b   tangwang   评估框架
46
47
  
  ```bash
f8e7cb97   tangwang   evalution framework
48
49
50
51
52
53
54
  ./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
    --tenant-id "${TENANT_ID:-163}" \
    --queries-file scripts/evaluation/queries/queries.txt \
    --top-k 50 \
    --language en \
    --labeler-mode simple
  
d172c259   tangwang   eval框架
55
56
57
58
59
60
61
62
63
  ./.venv/bin/python scripts/evaluation/build_annotation_set.py build \
    --tenant-id "${TENANT_ID:-163}" \
    --queries-file scripts/evaluation/queries/queries.txt \
    --search-depth 500 \
    --rerank-depth 10000 \
    --force-refresh-rerank \
    --force-refresh-labels \
    --language en \
    --labeler-mode simple
881d338b   tangwang   评估框架
64
65
66
67
68
69
70
71
  
  ./.venv/bin/python scripts/evaluation/serve_eval_web.py serve \
    --tenant-id "${TENANT_ID:-163}" \
    --queries-file scripts/evaluation/queries/queries.txt \
    --host 127.0.0.1 \
    --port 6010
  ```
  
167f33b4   tangwang   eval框架前端
72
  Each `batch` run walks the full queries file and writes a **batch report** under `batch_reports/`. With `batch --force-refresh-labels`, every live top-`k` hit is re-judged by the LLM (still only those hits—not the deep rebuild pipeline).
d172c259   tangwang   eval框架
73
  
dedd31c5   tangwang   1. 搜索 recall 池「1 ...
74
  ### `start_eval.sh batch-rebuild` (deep annotation rebuild)
167f33b4   tangwang   eval框架前端
75
76
77
78
79
  
  This runs `build_annotation_set.py build` with **`--force-refresh-labels`** and **`--force-refresh-rerank`** (see the explicit command block below). It does **not** run the `batch` subcommand: there is **no** aggregate batch report for this step; outputs are per-query JSON under `query_builds/` plus updates in `search_eval.sqlite3`.
  
  For **each** query in `queries.txt`, in order:
  
dedd31c5   tangwang   1. 搜索 recall 池「1 ...
80
  1. **Search recall** — Call the live search API with `size = max(--search-depth, --search-recall-top-k)` (the wrapper uses `--search-depth 500`). The first **`--search-recall-top-k`** hits (default **200**, see `eval_framework.constants.DEFAULT_SEARCH_RECALL_TOP_K`) form the **recall pool**; they are treated as rerank score **1** and are **not** sent to the reranker.
167f33b4   tangwang   eval框架前端
81
82
83
84
  2. **Full corpus** — Load the tenant’s product corpus from Elasticsearch (same tenant as `TENANT_ID` / `--tenant-id`, default **163**), via `corpus_docs()` (cached in SQLite after the first load).
  3. **Rerank outside pool** — Every corpus document whose `spu_id` is **not** in the pool is scored by the reranker API, **80 documents per request**. With `--force-refresh-rerank`, all those scores are recomputed and written to the **`rerank_scores`** table in `search_eval.sqlite3`. Without that flag, existing `(tenant_id, query, spu_id)` scores are reused and only missing rows hit the API.
  4. **Skip “too easy” queries** — If more than **1000** non-pool documents have rerank score **> 0.5**, that query is **skipped** (one log line: tail too relevant / easy to satisfy). No LLM calls for that query.
  5. **Global sort** — Order to label: pool in **search rank order**, then all remaining corpus docs in **descending rerank score** (dedupe by `spu_id`, pool wins).
dedd31c5   tangwang   1. 搜索 recall 池「1 ...
85
86
87
88
89
  6. **LLM labeling** — Walk that list **from the head** in batches of **50** by default (`--rebuild-llm-batch-size`). Each batch log includes **exact_ratio**, **irrelevant_ratio**, **low_ratio**, and **irrelevant_plus_low_ratio**.
  
     **Early stop** (defaults in `eval_framework.constants`; overridable via CLI):
  
     - Run **at least** `--rebuild-min-batches` batches (**10** by default) before any early stop is allowed.
35ae3b29   tangwang   批量评估框架,召回参数修改和llm...
90
91
     - After that, a **bad batch** is one where **both** are true (strict **>**):
       - **Irrelevant** proportion **> 93.9%** (`--rebuild-irrelevant-stop-ratio`, default `0.939`), and
441f049d   tangwang   评测体系优化,以及
92
93
       - **(Irrelevant + Weakly Relevant)** proportion **> 95.9%** (`--rebuild-irrel-low-combined-stop-ratio`, default `0.959`).  
         (“Weakly Relevant” is the weak tier; **Mostly Relevant** and **Exact** do not enter this sum.)
dedd31c5   tangwang   1. 搜索 recall 池「1 ...
94
     - Count **consecutive** bad batches. **Reset** the count to 0 on any batch that is not bad.
35ae3b29   tangwang   批量评估框架,召回参数修改和llm...
95
     - **Stop** when the consecutive bad count reaches **`--rebuild-irrelevant-stop-streak`** (**3** by default), or when **`--rebuild-max-batches`** (**40**) is reached—whichever comes first (up to **2000** docs per query at default batch size).
dedd31c5   tangwang   1. 搜索 recall 池「1 ...
96
  
35ae3b29   tangwang   批量评估框架,召回参数修改和llm...
97
     So labeling follows best-first order but **stops early** after **three** consecutive batches that are overwhelmingly Irrelevant and Irrelevant+Low; the tail may never be judged.
167f33b4   tangwang   eval框架前端
98
99
100
  
  **Incremental pool (no full rebuild):** `build_annotation_set.py build` **without** `--force-refresh-labels` uses the older windowed pool (`--annotate-search-top-k`, `--annotate-rerank-top-k`) and fills missing labels in one pass—no rerank-skip rule and no LLM early-stop loop.
  
dedd31c5   tangwang   1. 搜索 recall 池「1 ...
101
  **Tuning the rebuild path:** `--search-recall-top-k`, `--rerank-high-threshold`, `--rerank-high-skip-count`, `--rebuild-llm-batch-size`, `--rebuild-min-batches`, `--rebuild-max-batches`, `--rebuild-irrelevant-stop-ratio`, `--rebuild-irrel-low-combined-stop-ratio`, `--rebuild-irrelevant-stop-streak` on `build` (see `eval_framework/cli.py`). Rerank API chunk size is **80** docs per request in code (`full_corpus_rerank_outside_exclude`).
881d338b   tangwang   评估框架
102
  
3ac1f8d1   tangwang   评估标准优化
103
  ## Artifacts
881d338b   tangwang   评估框架
104
  
3ac1f8d1   tangwang   评估标准优化
105
  Default root: `artifacts/search_evaluation/`
881d338b   tangwang   评估框架
106
  
3ac1f8d1   tangwang   评估标准优化
107
108
109
110
111
  - `search_eval.sqlite3` — corpus cache, rerank scores, relevance labels, query profiles, build/batch run metadata
  - `query_builds/` — per-query pooled build outputs
  - `batch_reports/` — batch JSON, Markdown, config snapshots
  - `audits/` — label-quality audit summaries
  - `tuning_runs/` — fusion experiment outputs and config snapshots
881d338b   tangwang   评估框架
112
  
3ac1f8d1   tangwang   评估标准优化
113
  ## Labels
881d338b   tangwang   评估框架
114
  
441f049d   tangwang   评测体系优化,以及
115
116
117
  - **Fully Relevant** — Matches intended product type and all explicit required attributes.
  - **Mostly Relevant** — Main intent matches and is a strong substitute, but some attributes are missing, weaker, or slightly off.
  - **Weakly Relevant** — Only a weak substitute; may share scenario, style, or broad category but is no longer a strong match.
7ddd4cb3   tangwang   评估体系从三等级->四等级 Exa...
118
119
120
121
122
123
  - **Irrelevant** — Type mismatch or important conflicts make it a poor search result.
  
  ## Metric design
  
  This framework now follows graded ranking evaluation closer to e-commerce best practice instead of collapsing everything into binary relevance.
  
465f90e1   tangwang   添加LTR数据收集
124
125
126
127
128
  - **Primary scorecard**
    The primary evaluation set is:
    `NDCG@20`, `NDCG@50`, `ERR@10`, `Strong_Precision@10`, `Strong_Precision@20`, `Useful_Precision@50`, `Avg_Grade@10`, `Gain_Recall@20`.
  - **Composite tuning score: `Primary_Metric_Score`**
    For experiment ranking we compute the mean of the primary scorecard after normalizing `Avg_Grade@10` by the max grade (`3`).
7ddd4cb3   tangwang   评估体系从三等级->四等级 Exa...
129
  - **Gain scheme**
d73ca84a   tangwang   refine eval case ...
130
131
    `Fully Relevant=3`, `Mostly Relevant=2`, `Weakly Relevant=1`, `Irrelevant=0`
    We keep the rel grades `3/2/1/0`, but the current implementation uses the grade values directly as gains so the exact/high gap is less aggressive.
7ddd4cb3   tangwang   评估体系从三等级->四等级 Exa...
132
  - **Why this is better**
441f049d   tangwang   评测体系优化,以及
133
    `NDCG` differentiates “exact”, “strong substitute”, and “weak substitute”, so swapping an `Fully Relevant` with a `Weakly Relevant` item is penalized more than swapping `Mostly Relevant` with `Weakly Relevant`.
7ddd4cb3   tangwang   评估体系从三等级->四等级 Exa...
134
135
136
  
  The reported metrics are:
  
465f90e1   tangwang   添加LTR数据收集
137
138
139
  - **`Primary_Metric_Score`** — Mean score over the primary scorecard (`Avg_Grade@10` normalized by `/3`).
  - **`NDCG@20`, `NDCG@50`, `ERR@10`, `Strong_Precision@10`, `Strong_Precision@20`, `Useful_Precision@50`, `Avg_Grade@10`, `Gain_Recall@20`** — Primary scorecard for optimization decisions.
  - **`NDCG@5`, `NDCG@10`, `ERR@5`, `ERR@20`, `ERR@50`** — Secondary graded ranking quality slices.
441f049d   tangwang   评测体系优化,以及
140
141
  - **`Exact_Precision@K`** — Strict top-slot quality when only `Fully Relevant` counts.
  - **`Strong_Precision@K`** — Business-facing top-slot quality where `Fully Relevant + Mostly Relevant` count as strong positives.
7ddd4cb3   tangwang   评估体系从三等级->四等级 Exa...
142
143
144
145
146
  - **`Useful_Precision@K`** — Broader usefulness where any non-irrelevant result counts.
  - **`Gain_Recall@K`** — Gain captured in the returned list versus the judged label pool for the query.
  - **`Exact_Success@K` / `Strong_Success@K`** — Whether at least one exact or strong result appears in the first K.
  - **`MRR_Exact@10` / `MRR_Strong@10`** — How early the first exact or strong result appears.
  - **`Avg_Grade@10`** — Average relevance grade of the visible first page.
881d338b   tangwang   评估框架
147
  
3ac1f8d1   tangwang   评估标准优化
148
  **Labeler modes:** `simple` (default): one judging pass per batch with the standard relevance prompt. `complex`: query-profile extraction plus extra guardrails (for structured experiments).
881d338b   tangwang   评估框架
149
  
3ac1f8d1   tangwang   评估标准优化
150
  ## Flows
881d338b   tangwang   评估框架
151
  
3ac1f8d1   tangwang   评估标准优化
152
  **Standard:** Run `batch` without `--force-refresh-labels` to extend coverage, then use the UI or batch in cached mode. Single-query evaluation defaults to **no** auto-annotation: recall still hits the live API; scoring uses SQLite only, and unlabeled hits count as `Irrelevant`.
881d338b   tangwang   评估框架
153
  
167f33b4   tangwang   eval框架前端
154
  **Rebuild vs incremental `build`:** Deep rebuild is documented in the **`batch-rebuild`** subsection above. Incremental `build` (without `--force-refresh-labels`) uses `--annotate-search-top-k` / `--annotate-rerank-top-k` windows instead.
881d338b   tangwang   评估框架
155
  
3ac1f8d1   tangwang   评估标准优化
156
  **Fusion tuning:** `tune_fusion.py` writes experiment configs, restarts the backend, runs batch evaluation, and optionally applies the best variant (see `--experiments-file`, `--score-metric`, `--apply-best`).
881d338b   tangwang   评估框架
157
  
3ac1f8d1   tangwang   评估标准优化
158
  ### Audit
881d338b   tangwang   评估框架
159
160
161
162
163
164
165
166
167
168
  
  ```bash
  ./.venv/bin/python scripts/evaluation/build_annotation_set.py audit \
    --tenant-id 163 \
    --queries-file scripts/evaluation/queries/queries.txt \
    --top-k 50 \
    --language en \
    --labeler-mode simple
  ```
  
3ac1f8d1   tangwang   评估标准优化
169
  ## Web UI
881d338b   tangwang   评估框架
170
  
7ddd4cb3   tangwang   评估体系从三等级->四等级 Exa...
171
  Features: query list from `queries.txt`, single-query and batch evaluation, batch report history, grouped graded-metric cards, top recalls, missed judged useful results, and coverage tips for unlabeled hits.
881d338b   tangwang   评估框架
172
  
3ac1f8d1   tangwang   评估标准优化
173
  ## Batch reports
881d338b   tangwang   评估框架
174
  
7ddd4cb3   tangwang   评估体系从三等级->四等级 Exa...
175
  Each run stores aggregate and per-query metrics, label distribution, timestamp, metric context (including gain scheme and primary metric), and an `/admin/config` snapshot, as Markdown and JSON under `batch_reports/`.
881d338b   tangwang   评估框架
176
  
d73ca84a   tangwang   refine eval case ...
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
  To make later case analysis reproducible without digging through backend logs, each per-query record in the batch JSON now also includes:
  
  - `request_id` — the exact `X-Request-ID` sent by the evaluator for that live search call
  - `top_label_sequence_top10` / `top_label_sequence_top20` — compact label sequence strings such as `1:L3 | 2:L1 | 3:L2`
  - `top_results` — a lightweight top-20 snapshot with `rank`, `spu_id`, `label`, title fields, and `relevance_score`
  
  The Markdown report now surfaces the same case context in a lighter human-readable form:
  
  - request id
  - top-10 / top-20 label sequence
  - top 5 result snapshot for quick scanning
  
  This means a bad case can usually be reconstructed directly from the batch artifact itself, without replaying logs or joining SQLite tables by hand.
  
  The web history endpoint intentionally returns a compact summary only (aggregate metrics plus query count), so adding richer per-query snapshots to the batch payload does not bloat the history list UI.
  
465f90e1   tangwang   添加LTR数据收集
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
  ## Ranking debug and LTR prep
  
  `debug_info` now exposes two extra layers that are useful for tuning and future learning-to-rank work:
  
  - `retrieval_plan` — the effective text/image KNN plan for the query (`k`, `num_candidates`, boost, and whether the long-query branch was used).
  - `ltr_summary` — query-level summary over the visible top results: how many docs came from translation matches, text KNN, image KNN, fallback-to-ES cases, plus mean signal values.
  
  At the document level, `debug_info.per_result[*]` and `ranking_funnel.*.ltr_features` include stable features such as:
  
  - `es_score`, `text_score`, `knn_score`, `rerank_score`, `fine_score`
  - `source_score`, `translation_score`, `text_knn_score`, `image_knn_score`
  - `has_translation_match`, `has_text_knn`, `has_image_knn`, `text_score_fallback_to_es`
  
  This makes it easier to inspect bad cases and also gives us a near-direct feature inventory for downstream LTR experiments.
  
881d338b   tangwang   评估框架
208
209
  ## Caveats
  
3ac1f8d1   tangwang   评估标准优化
210
211
212
  - Labels are keyed by `(tenant_id, query, spu_id)`, not a full corpus×query matrix.
  - Single-query evaluation still needs live search for recall; LLM calls are avoided when labels exist.
  - Backend restarts in automated tuning may need a short settle time before requests.
881d338b   tangwang   评估框架
213
  
3ac1f8d1   tangwang   评估标准优化
214
  ## Related docs
881d338b   tangwang   评估框架
215
  
3ac1f8d1   tangwang   评估标准优化
216
  - `README_Requirement.md`, `README_Requirement_zh.md` — requirements background; this file describes the implemented stack and how to run it.