Blame view

scripts/evaluation/README.md 8.6 KB
881d338b   tangwang   评估框架
1
2
  # Search Evaluation Framework
  
3ac1f8d1   tangwang   评估标准优化
3
  This directory holds the offline annotation builder, the evaluation web UI/API, audit tooling, and the fusion-tuning runner for retrieval quality.
881d338b   tangwang   评估框架
4
  
3ac1f8d1   tangwang   评估标准优化
5
  **Design:** Build labels offline for a fixed query set (`queries/queries.txt`). Single-query and batch evaluation map recalled `spu_id` values to the SQLite cache. Items without cached labels are scored as `Irrelevant`, and the UI/API surfaces tips when coverage is incomplete.
881d338b   tangwang   评估框架
6
  
3ac1f8d1   tangwang   评估标准优化
7
  ## What it does
881d338b   tangwang   评估框架
8
  
3ac1f8d1   tangwang   评估标准优化
9
10
11
12
  1. Build an annotation set for a fixed query set.
  2. Evaluate live search results against cached labels.
  3. Run batch evaluation and keep historical reports with config snapshots.
  4. Tune fusion parameters in a reproducible loop.
881d338b   tangwang   评估框架
13
  
3ac1f8d1   tangwang   评估标准优化
14
  ## Layout
881d338b   tangwang   评估框架
15
  
3ac1f8d1   tangwang   评估标准优化
16
17
18
19
20
21
22
23
24
25
  | Path | Role |
  |------|------|
  | `eval_framework/` | Package: orchestration, SQLite store, search/rerank/LLM clients, prompts, metrics, reports, web UI (`static/`), CLI |
  | `build_annotation_set.py` | CLI entry (build / batch / audit) |
  | `serve_eval_web.py` | Web server for the evaluation UI |
  | `tune_fusion.py` | Applies config variants, restarts backend, runs batch eval, stores experiment reports |
  | `fusion_experiments_shortlist.json` | Compact experiment set for tuning |
  | `fusion_experiments_round1.json` | Broader first-round experiments |
  | `queries/queries.txt` | Canonical evaluation queries |
  | `README_Requirement.md` | Product/requirements reference |
d172c259   tangwang   eval框架
26
  | `quick_start_eval.sh` | Wrapper: `batch`, `batch-rebuild` (deep `build` + `--force-refresh-labels`), or `serve` |
3ac1f8d1   tangwang   评估标准优化
27
28
29
30
31
  | `../start_eval_web.sh` | Same as `serve` with `activate.sh`; use `./scripts/service_ctl.sh start eval-web` (default port **6010**, override with `EVAL_WEB_PORT`). `./run.sh all` includes eval-web. |
  
  ## Quick start (repo root)
  
  Set tenant if needed (`export TENANT_ID=163`). You need a live search API, DashScope when new LLM labels are required, and a running backend.
881d338b   tangwang   评估框架
32
33
  
  ```bash
3ac1f8d1   tangwang   评估标准优化
34
  # Batch: live search for every query; only uncached (query, spu_id) pairs hit the LLM
881d338b   tangwang   评估框架
35
36
  ./scripts/evaluation/quick_start_eval.sh batch
  
167f33b4   tangwang   eval框架前端
37
  # Deep rebuild: per-query full corpus rerank (outside search top-500 pool) + LLM in 50-doc batches along global sort order (early stop; expensive)
f8e7cb97   tangwang   evalution framework
38
39
  ./scripts/evaluation/quick_start_eval.sh batch-rebuild
  
3ac1f8d1   tangwang   评估标准优化
40
  # UI: http://127.0.0.1:6010/
881d338b   tangwang   评估框架
41
  ./scripts/evaluation/quick_start_eval.sh serve
3ac1f8d1   tangwang   评估标准优化
42
  # or: ./scripts/service_ctl.sh start eval-web
881d338b   tangwang   评估框架
43
44
  ```
  
3ac1f8d1   tangwang   评估标准优化
45
  Explicit equivalents:
881d338b   tangwang   评估框架
46
47
  
  ```bash
f8e7cb97   tangwang   evalution framework
48
49
50
51
52
53
54
  ./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
    --tenant-id "${TENANT_ID:-163}" \
    --queries-file scripts/evaluation/queries/queries.txt \
    --top-k 50 \
    --language en \
    --labeler-mode simple
  
d172c259   tangwang   eval框架
55
56
57
58
59
60
61
62
63
  ./.venv/bin/python scripts/evaluation/build_annotation_set.py build \
    --tenant-id "${TENANT_ID:-163}" \
    --queries-file scripts/evaluation/queries/queries.txt \
    --search-depth 500 \
    --rerank-depth 10000 \
    --force-refresh-rerank \
    --force-refresh-labels \
    --language en \
    --labeler-mode simple
881d338b   tangwang   评估框架
64
65
66
67
68
69
70
71
  
  ./.venv/bin/python scripts/evaluation/serve_eval_web.py serve \
    --tenant-id "${TENANT_ID:-163}" \
    --queries-file scripts/evaluation/queries/queries.txt \
    --host 127.0.0.1 \
    --port 6010
  ```
  
167f33b4   tangwang   eval框架前端
72
  Each `batch` run walks the full queries file and writes a **batch report** under `batch_reports/`. With `batch --force-refresh-labels`, every live top-`k` hit is re-judged by the LLM (still only those hits—not the deep rebuild pipeline).
d172c259   tangwang   eval框架
73
  
167f33b4   tangwang   eval框架前端
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
  ### `quick_start_eval.sh batch-rebuild` (deep annotation rebuild)
  
  This runs `build_annotation_set.py build` with **`--force-refresh-labels`** and **`--force-refresh-rerank`** (see the explicit command block below). It does **not** run the `batch` subcommand: there is **no** aggregate batch report for this step; outputs are per-query JSON under `query_builds/` plus updates in `search_eval.sqlite3`.
  
  For **each** query in `queries.txt`, in order:
  
  1. **Search recall** — Call the live search API with `size = max(--search-depth, --search-recall-top-k)` (the wrapper uses `--search-depth 500`). The first **500** hits form the **recall pool**; they are treated as rerank score **1** and are **not** sent to the reranker.
  2. **Full corpus** — Load the tenant’s product corpus from Elasticsearch (same tenant as `TENANT_ID` / `--tenant-id`, default **163**), via `corpus_docs()` (cached in SQLite after the first load).
  3. **Rerank outside pool** — Every corpus document whose `spu_id` is **not** in the pool is scored by the reranker API, **80 documents per request**. With `--force-refresh-rerank`, all those scores are recomputed and written to the **`rerank_scores`** table in `search_eval.sqlite3`. Without that flag, existing `(tenant_id, query, spu_id)` scores are reused and only missing rows hit the API.
  4. **Skip “too easy” queries** — If more than **1000** non-pool documents have rerank score **> 0.5**, that query is **skipped** (one log line: tail too relevant / easy to satisfy). No LLM calls for that query.
  5. **Global sort** — Order to label: pool in **search rank order**, then all remaining corpus docs in **descending rerank score** (dedupe by `spu_id`, pool wins).
  6. **LLM labeling** — Walk that list **from the head** in batches of **50** (not “take top-K then label only K”): each batch logs **exact_ratio** and **irrelevant_ratio**. After at least **20** batches, stop when **3** consecutive batches have irrelevant_ratio **> 92%**; never more than **40** batches (**2000** docs max per query). So labeling follows the best-first order but **stops early**; the tail of the sorted list may never be judged.
  
  **Incremental pool (no full rebuild):** `build_annotation_set.py build` **without** `--force-refresh-labels` uses the older windowed pool (`--annotate-search-top-k`, `--annotate-rerank-top-k`) and fills missing labels in one pass—no rerank-skip rule and no LLM early-stop loop.
  
  **Tuning the rebuild path:** `--search-recall-top-k`, `--rerank-high-threshold`, `--rerank-high-skip-count`, `--rebuild-llm-batch-size`, `--rebuild-min-batches`, `--rebuild-max-batches`, `--rebuild-irrelevant-stop-ratio`, `--rebuild-irrelevant-stop-streak` on `build` (see `eval_framework/cli.py`). Rerank API chunk size is **80** docs per request in code (`full_corpus_rerank_outside_exclude`).
881d338b   tangwang   评估框架
90
  
3ac1f8d1   tangwang   评估标准优化
91
  ## Artifacts
881d338b   tangwang   评估框架
92
  
3ac1f8d1   tangwang   评估标准优化
93
  Default root: `artifacts/search_evaluation/`
881d338b   tangwang   评估框架
94
  
3ac1f8d1   tangwang   评估标准优化
95
96
97
98
99
  - `search_eval.sqlite3` — corpus cache, rerank scores, relevance labels, query profiles, build/batch run metadata
  - `query_builds/` — per-query pooled build outputs
  - `batch_reports/` — batch JSON, Markdown, config snapshots
  - `audits/` — label-quality audit summaries
  - `tuning_runs/` — fusion experiment outputs and config snapshots
881d338b   tangwang   评估框架
100
  
3ac1f8d1   tangwang   评估标准优化
101
  ## Labels
881d338b   tangwang   评估框架
102
  
3ac1f8d1   tangwang   评估标准优化
103
104
105
  - **Exact** — Matches intended product type and all explicit required attributes.
  - **Partial** — Main intent matches; attributes missing, approximate, or weaker.
  - **Irrelevant** — Type mismatch or conflicting required attributes.
881d338b   tangwang   评估框架
106
  
3ac1f8d1   tangwang   评估标准优化
107
  **Labeler modes:** `simple` (default): one judging pass per batch with the standard relevance prompt. `complex`: query-profile extraction plus extra guardrails (for structured experiments).
881d338b   tangwang   评估框架
108
  
3ac1f8d1   tangwang   评估标准优化
109
  ## Flows
881d338b   tangwang   评估框架
110
  
3ac1f8d1   tangwang   评估标准优化
111
  **Standard:** Run `batch` without `--force-refresh-labels` to extend coverage, then use the UI or batch in cached mode. Single-query evaluation defaults to **no** auto-annotation: recall still hits the live API; scoring uses SQLite only, and unlabeled hits count as `Irrelevant`.
881d338b   tangwang   评估框架
112
  
167f33b4   tangwang   eval框架前端
113
  **Rebuild vs incremental `build`:** Deep rebuild is documented in the **`batch-rebuild`** subsection above. Incremental `build` (without `--force-refresh-labels`) uses `--annotate-search-top-k` / `--annotate-rerank-top-k` windows instead.
881d338b   tangwang   评估框架
114
  
3ac1f8d1   tangwang   评估标准优化
115
  **Fusion tuning:** `tune_fusion.py` writes experiment configs, restarts the backend, runs batch evaluation, and optionally applies the best variant (see `--experiments-file`, `--score-metric`, `--apply-best`).
881d338b   tangwang   评估框架
116
  
3ac1f8d1   tangwang   评估标准优化
117
  ### Audit
881d338b   tangwang   评估框架
118
119
120
121
122
123
124
125
126
127
  
  ```bash
  ./.venv/bin/python scripts/evaluation/build_annotation_set.py audit \
    --tenant-id 163 \
    --queries-file scripts/evaluation/queries/queries.txt \
    --top-k 50 \
    --language en \
    --labeler-mode simple
  ```
  
3ac1f8d1   tangwang   评估标准优化
128
  ## Web UI
881d338b   tangwang   评估框架
129
  
3ac1f8d1   tangwang   评估标准优化
130
  Features: query list from `queries.txt`, single-query and batch evaluation, batch report history, top recalls, missed Exact/Partial, and coverage tips for unlabeled hits.
881d338b   tangwang   评估框架
131
  
3ac1f8d1   tangwang   评估标准优化
132
  ## Batch reports
881d338b   tangwang   评估框架
133
  
3ac1f8d1   tangwang   评估标准优化
134
  Each run stores aggregate and per-query metrics, label distribution, timestamp, and an `/admin/config` snapshot, as Markdown and JSON under `batch_reports/`.
881d338b   tangwang   评估框架
135
136
137
  
  ## Caveats
  
3ac1f8d1   tangwang   评估标准优化
138
139
140
  - Labels are keyed by `(tenant_id, query, spu_id)`, not a full corpus×query matrix.
  - Single-query evaluation still needs live search for recall; LLM calls are avoided when labels exist.
  - Backend restarts in automated tuning may need a short settle time before requests.
881d338b   tangwang   评估框架
141
  
3ac1f8d1   tangwang   评估标准优化
142
  ## Related docs
881d338b   tangwang   评估框架
143
  
3ac1f8d1   tangwang   评估标准优化
144
  - `README_Requirement.md`, `README_Requirement_zh.md` — requirements background; this file describes the implemented stack and how to run it.