Blame view

scripts/evaluation/README.md 5.41 KB
881d338b   tangwang   评估框架
1
2
  # Search Evaluation Framework
  
3ac1f8d1   tangwang   评估标准优化
3
  This directory holds the offline annotation builder, the evaluation web UI/API, audit tooling, and the fusion-tuning runner for retrieval quality.
881d338b   tangwang   评估框架
4
  
3ac1f8d1   tangwang   评估标准优化
5
  **Design:** Build labels offline for a fixed query set (`queries/queries.txt`). Single-query and batch evaluation map recalled `spu_id` values to the SQLite cache. Items without cached labels are scored as `Irrelevant`, and the UI/API surfaces tips when coverage is incomplete.
881d338b   tangwang   评估框架
6
  
3ac1f8d1   tangwang   评估标准优化
7
  ## What it does
881d338b   tangwang   评估框架
8
  
3ac1f8d1   tangwang   评估标准优化
9
10
11
12
  1. Build an annotation set for a fixed query set.
  2. Evaluate live search results against cached labels.
  3. Run batch evaluation and keep historical reports with config snapshots.
  4. Tune fusion parameters in a reproducible loop.
881d338b   tangwang   评估框架
13
  
3ac1f8d1   tangwang   评估标准优化
14
  ## Layout
881d338b   tangwang   评估框架
15
  
3ac1f8d1   tangwang   评估标准优化
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
  | Path | Role |
  |------|------|
  | `eval_framework/` | Package: orchestration, SQLite store, search/rerank/LLM clients, prompts, metrics, reports, web UI (`static/`), CLI |
  | `build_annotation_set.py` | CLI entry (build / batch / audit) |
  | `serve_eval_web.py` | Web server for the evaluation UI |
  | `tune_fusion.py` | Applies config variants, restarts backend, runs batch eval, stores experiment reports |
  | `fusion_experiments_shortlist.json` | Compact experiment set for tuning |
  | `fusion_experiments_round1.json` | Broader first-round experiments |
  | `queries/queries.txt` | Canonical evaluation queries |
  | `README_Requirement.md` | Product/requirements reference |
  | `quick_start_eval.sh` | Wrapper: `batch`, `batch-rebuild`, or `serve` |
  | `../start_eval_web.sh` | Same as `serve` with `activate.sh`; use `./scripts/service_ctl.sh start eval-web` (default port **6010**, override with `EVAL_WEB_PORT`). `./run.sh all` includes eval-web. |
  
  ## Quick start (repo root)
  
  Set tenant if needed (`export TENANT_ID=163`). You need a live search API, DashScope when new LLM labels are required, and a running backend.
881d338b   tangwang   评估框架
32
33
  
  ```bash
3ac1f8d1   tangwang   评估标准优化
34
  # Batch: live search for every query; only uncached (query, spu_id) pairs hit the LLM
881d338b   tangwang   评估框架
35
36
  ./scripts/evaluation/quick_start_eval.sh batch
  
3ac1f8d1   tangwang   评估标准优化
37
  # Full re-label of current top_k recall (expensive)
f8e7cb97   tangwang   evalution framework
38
39
  ./scripts/evaluation/quick_start_eval.sh batch-rebuild
  
3ac1f8d1   tangwang   评估标准优化
40
  # UI: http://127.0.0.1:6010/
881d338b   tangwang   评估框架
41
  ./scripts/evaluation/quick_start_eval.sh serve
3ac1f8d1   tangwang   评估标准优化
42
  # or: ./scripts/service_ctl.sh start eval-web
881d338b   tangwang   评估框架
43
44
  ```
  
3ac1f8d1   tangwang   评估标准优化
45
  Explicit equivalents:
881d338b   tangwang   评估框架
46
47
  
  ```bash
f8e7cb97   tangwang   evalution framework
48
49
50
51
52
53
54
  ./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
    --tenant-id "${TENANT_ID:-163}" \
    --queries-file scripts/evaluation/queries/queries.txt \
    --top-k 50 \
    --language en \
    --labeler-mode simple
  
881d338b   tangwang   评估框架
55
  ./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
3ac1f8d1   tangwang   评估标准优化
56
    ... same args ... \
881d338b   tangwang   评估框架
57
58
59
60
61
62
63
64
65
    --force-refresh-labels
  
  ./.venv/bin/python scripts/evaluation/serve_eval_web.py serve \
    --tenant-id "${TENANT_ID:-163}" \
    --queries-file scripts/evaluation/queries/queries.txt \
    --host 127.0.0.1 \
    --port 6010
  ```
  
3ac1f8d1   tangwang   评估标准优化
66
  Each batch run walks the full queries file. With `--force-refresh-labels`, every recalled `spu_id` in the window is re-sent to the LLM and upserted. Without it, only missing labels are filled.
881d338b   tangwang   评估框架
67
  
3ac1f8d1   tangwang   评估标准优化
68
  ## Artifacts
881d338b   tangwang   评估框架
69
  
3ac1f8d1   tangwang   评估标准优化
70
  Default root: `artifacts/search_evaluation/`
881d338b   tangwang   评估框架
71
  
3ac1f8d1   tangwang   评估标准优化
72
73
74
75
76
  - `search_eval.sqlite3` — corpus cache, rerank scores, relevance labels, query profiles, build/batch run metadata
  - `query_builds/` — per-query pooled build outputs
  - `batch_reports/` — batch JSON, Markdown, config snapshots
  - `audits/` — label-quality audit summaries
  - `tuning_runs/` — fusion experiment outputs and config snapshots
881d338b   tangwang   评估框架
77
  
3ac1f8d1   tangwang   评估标准优化
78
  ## Labels
881d338b   tangwang   评估框架
79
  
3ac1f8d1   tangwang   评估标准优化
80
81
82
  - **Exact** — Matches intended product type and all explicit required attributes.
  - **Partial** — Main intent matches; attributes missing, approximate, or weaker.
  - **Irrelevant** — Type mismatch or conflicting required attributes.
881d338b   tangwang   评估框架
83
  
3ac1f8d1   tangwang   评估标准优化
84
  **Labeler modes:** `simple` (default): one judging pass per batch with the standard relevance prompt. `complex`: query-profile extraction plus extra guardrails (for structured experiments).
881d338b   tangwang   评估框架
85
  
3ac1f8d1   tangwang   评估标准优化
86
  ## Flows
881d338b   tangwang   评估框架
87
  
3ac1f8d1   tangwang   评估标准优化
88
  **Standard:** Run `batch` without `--force-refresh-labels` to extend coverage, then use the UI or batch in cached mode. Single-query evaluation defaults to **no** auto-annotation: recall still hits the live API; scoring uses SQLite only, and unlabeled hits count as `Irrelevant`.
881d338b   tangwang   评估框架
89
  
3ac1f8d1   tangwang   评估标准优化
90
  **Deeper pool:** `build_annotation_set.py build` merges deep search and full-corpus rerank windows before labeling (see CLI `--search-depth`, `--rerank-depth`, `--annotate-*-top-k`).
881d338b   tangwang   评估框架
91
  
3ac1f8d1   tangwang   评估标准优化
92
  **Fusion tuning:** `tune_fusion.py` writes experiment configs, restarts the backend, runs batch evaluation, and optionally applies the best variant (see `--experiments-file`, `--score-metric`, `--apply-best`).
881d338b   tangwang   评估框架
93
  
3ac1f8d1   tangwang   评估标准优化
94
  ### Audit
881d338b   tangwang   评估框架
95
96
97
98
99
100
101
102
103
104
  
  ```bash
  ./.venv/bin/python scripts/evaluation/build_annotation_set.py audit \
    --tenant-id 163 \
    --queries-file scripts/evaluation/queries/queries.txt \
    --top-k 50 \
    --language en \
    --labeler-mode simple
  ```
  
3ac1f8d1   tangwang   评估标准优化
105
  ## Web UI
881d338b   tangwang   评估框架
106
  
3ac1f8d1   tangwang   评估标准优化
107
  Features: query list from `queries.txt`, single-query and batch evaluation, batch report history, top recalls, missed Exact/Partial, and coverage tips for unlabeled hits.
881d338b   tangwang   评估框架
108
  
3ac1f8d1   tangwang   评估标准优化
109
  ## Batch reports
881d338b   tangwang   评估框架
110
  
3ac1f8d1   tangwang   评估标准优化
111
  Each run stores aggregate and per-query metrics, label distribution, timestamp, and an `/admin/config` snapshot, as Markdown and JSON under `batch_reports/`.
881d338b   tangwang   评估框架
112
113
114
  
  ## Caveats
  
3ac1f8d1   tangwang   评估标准优化
115
116
117
  - Labels are keyed by `(tenant_id, query, spu_id)`, not a full corpus×query matrix.
  - Single-query evaluation still needs live search for recall; LLM calls are avoided when labels exist.
  - Backend restarts in automated tuning may need a short settle time before requests.
881d338b   tangwang   评估框架
118
  
3ac1f8d1   tangwang   评估标准优化
119
  ## Related docs
881d338b   tangwang   评估框架
120
  
3ac1f8d1   tangwang   评估标准优化
121
  - `README_Requirement.md`, `README_Requirement_zh.md` — requirements background; this file describes the implemented stack and how to run it.