Blame view

scripts/evaluation/README.md 11.7 KB
881d338b   tangwang   评估框架
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
  # Search Evaluation Framework
  
  This directory contains the offline annotation set builder, the online evaluation UI/API, the audit tooling, and the fusion-tuning runner for retrieval quality evaluation.
  
  It is designed around one core rule:
  
  - Annotation should be built offline first.
  - Single-query evaluation should then map recalled `spu_id` values to the cached annotation set.
  - Recalled items without cached labels are treated as `Irrelevant` during evaluation, and the UI/API returns a tip so the operator knows coverage is incomplete.
  
  ## Goals
  
  The framework supports four related tasks:
  
  1. Build an annotation set for a fixed query set.
  2. Evaluate a live search result list against that annotation set.
  3. Run batch evaluation and store historical reports with config snapshots.
  4. Tune fusion parameters reproducibly.
  
  ## Files
  
  - `eval_framework.py`
    Search evaluation core implementation, CLI, FastAPI app, SQLite store, audit logic, and report generation.
  - `build_annotation_set.py`
    Thin CLI entrypoint into `eval_framework.py`.
  - `serve_eval_web.py`
    Thin web entrypoint into `eval_framework.py`.
  - `tune_fusion.py`
    Fusion experiment runner. It applies config variants, restarts backend, runs batch evaluation, and stores experiment reports.
  - `fusion_experiments_shortlist.json`
    A compact experiment set for practical tuning.
  - `fusion_experiments_round1.json`
    A broader first-round experiment set.
  - `queries/queries.txt`
    The canonical evaluation query set.
  - `README_Requirement.md`
    Requirement reference document.
  - `quick_start_eval.sh`
f8e7cb97   tangwang   evalution framework
39
    Optional wrapper: `batch` (fill missing labels only), `batch-rebuild` (force full re-label), or `serve` (web UI).
7b8d9e1a   tangwang   评估框架的启动脚本
40
41
  - `../start_eval_web.sh`
    Same as `serve` but loads `activate.sh`; use with `./scripts/service_ctl.sh start eval-web` (port **`EVAL_WEB_PORT`**, default **6010**). `./run.sh all` starts **eval-web** with the rest of core services.
881d338b   tangwang   评估框架
42
43
44
  
  ## Quick start (from repo root)
  
f8e7cb97   tangwang   evalution framework
45
  Set tenant if needed (`export TENANT_ID=163`). Requires live search API, DashScope key when the batch step needs new LLM labels, and a working backend.
881d338b   tangwang   评估框架
46
47
  
  ```bash
f8e7cb97   tangwang   evalution framework
48
  # 1) Batch evaluation: every query in the file gets a live search; only uncached (query, spu_id) pairs call the LLM
881d338b   tangwang   评估框架
49
50
  ./scripts/evaluation/quick_start_eval.sh batch
  
f8e7cb97   tangwang   evalution framework
51
52
53
  # Optional: full re-label of current top_k recall (expensive; use only when you intentionally rebuild the cache)
  ./scripts/evaluation/quick_start_eval.sh batch-rebuild
  
7b8d9e1a   tangwang   评估框架的启动脚本
54
  # 2) Evaluation UI on http://127.0.0.1:6010/ (override with EVAL_WEB_PORT / EVAL_WEB_HOST)
881d338b   tangwang   评估框架
55
  ./scripts/evaluation/quick_start_eval.sh serve
7b8d9e1a   tangwang   评估框架的启动脚本
56
  # Or: ./scripts/service_ctl.sh start eval-web
881d338b   tangwang   评估框架
57
58
59
60
61
  ```
  
  Equivalent explicit commands:
  
  ```bash
f8e7cb97   tangwang   evalution framework
62
63
64
65
66
67
68
69
70
  # Safe default: no --force-refresh-labels
  ./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
    --tenant-id "${TENANT_ID:-163}" \
    --queries-file scripts/evaluation/queries/queries.txt \
    --top-k 50 \
    --language en \
    --labeler-mode simple
  
  # Rebuild all labels for recalled top_k (same as quick_start_eval.sh batch-rebuild)
881d338b   tangwang   评估框架
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
  ./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
    --tenant-id "${TENANT_ID:-163}" \
    --queries-file scripts/evaluation/queries/queries.txt \
    --top-k 50 \
    --language en \
    --labeler-mode simple \
    --force-refresh-labels
  
  ./.venv/bin/python scripts/evaluation/serve_eval_web.py serve \
    --tenant-id "${TENANT_ID:-163}" \
    --queries-file scripts/evaluation/queries/queries.txt \
    --host 127.0.0.1 \
    --port 6010
  ```
  
  **Batch behavior:** There is no “skip queries already processed”. Each run walks the full queries file. With `--force-refresh-labels`, for **every** query the runner issues a live search and sends **all** `top_k` returned `spu_id`s through the LLM again (SQLite rows are upserted). Omit `--force-refresh-labels` if you only want to fill in labels that are missing for the current recall window.
  
  ## Storage Layout
  
  All generated artifacts are under:
  
  - `/data/saas-search/artifacts/search_evaluation`
  
  Important subpaths:
  
  - `/data/saas-search/artifacts/search_evaluation/search_eval.sqlite3`
    Main cache and annotation store.
  - `/data/saas-search/artifacts/search_evaluation/query_builds`
    Per-query pooled annotation-set build artifacts.
  - `/data/saas-search/artifacts/search_evaluation/batch_reports`
    Batch evaluation JSON, Markdown reports, and config snapshots.
  - `/data/saas-search/artifacts/search_evaluation/audits`
    Audit summaries for label quality checks.
  - `/data/saas-search/artifacts/search_evaluation/tuning_runs`
    Fusion experiment summaries and per-experiment config snapshots.
  
  ## SQLite Schema Summary
  
  The main tables in `search_eval.sqlite3` are:
  
  - `corpus_docs`
    Cached product corpus for the tenant.
  - `rerank_scores`
    Cached full-corpus reranker scores keyed by `(tenant_id, query_text, spu_id)`.
  - `relevance_labels`
    Cached LLM relevance labels keyed by `(tenant_id, query_text, spu_id)`.
  - `query_profiles`
    Structured query-intent profiles extracted before labeling.
  - `build_runs`
    Per-query pooled-build records.
  - `batch_runs`
    Batch evaluation history.
  
  ## Label Semantics
  
  Three labels are used throughout:
  
  - `Exact`
    Fully matches the intended product type and all explicit required attributes.
  - `Partial`
    Main intent matches, but explicit attributes are missing, approximate, or weaker than requested.
  - `Irrelevant`
    Product type mismatches, or explicit required attributes conflict.
  
  The framework always uses:
  
  - LLM-based batched relevance classification
  - caching and retry logic for robust offline labeling
  
  There are now two labeler modes:
  
  - `simple`
    Default. A single low-coupling LLM judging pass per batch, using the standard relevance prompt.
  - `complex`
    Legacy structured mode. It extracts query profiles and applies extra guardrails. Kept for comparison, but no longer the default.
  
  ## Offline-First Workflow
  
  ### 1. Refresh labels for the evaluation query set
  
  For practical evaluation, the most important offline step is to pre-label the result window you plan to score. For the current metrics (`P@5`, `P@10`, `P@20`, `P@50`, `MAP_3`, `MAP_2_3`), a `top_k=50` cached label set is sufficient.
  
f8e7cb97   tangwang   evalution framework
153
  Example (fills missing labels only; recommended default):
881d338b   tangwang   评估框架
154
155
156
157
158
159
160
  
  ```bash
  ./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
    --tenant-id 163 \
    --queries-file scripts/evaluation/queries/queries.txt \
    --top-k 50 \
    --language en \
f8e7cb97   tangwang   evalution framework
161
    --labeler-mode simple
881d338b   tangwang   评估框架
162
163
  ```
  
f8e7cb97   tangwang   evalution framework
164
165
  To **rebuild** every label for the current `top_k` recall window (all queries, all hits re-sent to the LLM), add `--force-refresh-labels` or run `./scripts/evaluation/quick_start_eval.sh batch-rebuild`.
  
881d338b   tangwang   评估框架
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
  This command does two things:
  
  - runs **every** query in the file against the live backend (no skip list)
  - with `--force-refresh-labels`, re-labels **all** `top_k` hits per query via the LLM and upserts SQLite; without the flag, only `spu_id`s lacking a cached label are sent to the LLM
  
  After this step, single-query evaluation can run in cached mode without calling the LLM again.
  
  ### 2. Optional pooled build
  
  The framework also supports a heavier pooled build that combines:
  
  - top search results
  - top full-corpus reranker results
  
  Example:
  
  ```bash
  ./.venv/bin/python scripts/evaluation/build_annotation_set.py build \
    --tenant-id 163 \
    --queries-file scripts/evaluation/queries/queries.txt \
    --search-depth 1000 \
    --rerank-depth 10000 \
    --annotate-search-top-k 100 \
    --annotate-rerank-top-k 120 \
    --language en
  ```
  
  This is slower, but useful when you want a richer pooled annotation set beyond the current live recall window.
  
  ## Why Single-Query Evaluation Was Slow
  
  If single-query evaluation is slow, the usual reason is that it is still running with `auto_annotate=true`, which means:
  
  - perform live search
  - detect recalled but unlabeled products
  - call the LLM to label them
  
  That is not the intended steady-state evaluation path.
  
  The UI/API is now configured to prefer cached evaluation:
  
  - default single-query evaluation uses `auto_annotate=false`
  - unlabeled recalled results are treated as `Irrelevant`
  - the response includes tips explaining that coverage gap
  
  If you want stable, fast evaluation:
  
  1. prebuild labels offline
  2. use cached single-query evaluation
  
  ## Web UI
  
  Start the evaluation UI:
  
  ```bash
  ./.venv/bin/python scripts/evaluation/serve_eval_web.py serve \
    --tenant-id 163 \
    --queries-file scripts/evaluation/queries/queries.txt \
    --host 127.0.0.1 \
    --port 6010
  ```
  
  The UI provides:
  
  - query list loaded from `queries.txt`
  - single-query evaluation
  - batch evaluation
  - history of batch reports
  - top recalled results
  - missed `Exact` and `Partial` products that were not recalled
  - tips about unlabeled hits treated as `Irrelevant`
  
  ### Single-query response behavior
  
  For a single query:
  
  1. live search returns recalled `spu_id` values
  2. the framework looks up cached labels by `(query, spu_id)`
  3. unlabeled recalled items are counted as `Irrelevant`
  4. cached `Exact` and `Partial` products that were not recalled are listed under `Missed Exact / Partial`
  
  This makes the page useful as a real retrieval-evaluation view rather than only a search-result viewer.
  
  ## CLI Commands
  
  ### Build pooled annotation artifacts
  
  ```bash
  ./.venv/bin/python scripts/evaluation/build_annotation_set.py build ...
  ```
  
  ### Run batch evaluation
  
  ```bash
  ./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
    --tenant-id 163 \
    --queries-file scripts/evaluation/queries/queries.txt \
    --top-k 50 \
    --language en \
    --labeler-mode simple
  ```
  
  Use `--force-refresh-labels` if you want to rebuild the offline label cache for the recalled window first.
  
  ### Audit annotation quality
  
  ```bash
  ./.venv/bin/python scripts/evaluation/build_annotation_set.py audit \
    --tenant-id 163 \
    --queries-file scripts/evaluation/queries/queries.txt \
    --top-k 50 \
    --language en \
    --labeler-mode simple
  ```
  
  This checks cached labels against current guardrails and reports suspicious cases.
  
  ## Batch Reports
  
  Each batch run stores:
  
  - aggregate metrics
  - per-query metrics
  - label distribution
  - timestamp
  - config snapshot from `/admin/config`
  
  Reports are written as:
  
  - Markdown for easy reading
  - JSON for downstream processing
  
  ## Fusion Tuning
  
  The tuning runner applies experiment configs sequentially and records the outcome.
  
  Example:
  
  ```bash
  ./.venv/bin/python scripts/evaluation/tune_fusion.py \
    --tenant-id 163 \
    --queries-file scripts/evaluation/queries/queries.txt \
    --top-k 50 \
    --language en \
    --experiments-file scripts/evaluation/fusion_experiments_shortlist.json \
    --score-metric MAP_3 \
    --apply-best
  ```
  
  What it does:
  
  1. writes an experiment config into `config/config.yaml`
  2. restarts backend
  3. runs batch evaluation
  4. stores the per-experiment result
  5. optionally applies the best experiment at the end
  
  ## Current Practical Recommendation
  
  For day-to-day evaluation:
  
  1. refresh the offline labels for the fixed query set with `batch --force-refresh-labels`
  2. run the web UI or normal batch evaluation in cached mode
  3. only force-refresh labels again when:
     - the query set changes
     - the product corpus changes materially
     - the labeling logic changes
  
  ## Caveats
  
  - The current label cache is query-specific, not a full all-products all-queries matrix.
  - Single-query evaluation still depends on the live search API for recall, but not on the LLM if labels are already cached.
  - The backend restart path in this environment can be briefly unstable immediately after startup; a short wait after restart is sometimes necessary for scripting.
  - Some multilingual translation hints are noisy on long-tail fashion queries, which is one reason fusion tuning around translation weight matters.
  
  ## Related Requirement Docs
  
  - `README_Requirement.md`
  - `README_Requirement_zh.md`
  
  These documents describe the original problem statement. This `README.md` describes the implemented framework and the current recommended workflow.