881d338b
tangwang
评估框架
|
1
2
|
# Search Evaluation Framework
|
3ac1f8d1
tangwang
评估标准优化
|
3
|
This directory holds the offline annotation builder, the evaluation web UI/API, audit tooling, and the fusion-tuning runner for retrieval quality.
|
881d338b
tangwang
评估框架
|
4
|
|
3ac1f8d1
tangwang
评估标准优化
|
5
|
**Design:** Build labels offline for a fixed query set (`queries/queries.txt`). Single-query and batch evaluation map recalled `spu_id` values to the SQLite cache. Items without cached labels are scored as `Irrelevant`, and the UI/API surfaces tips when coverage is incomplete.
|
881d338b
tangwang
评估框架
|
6
|
|
3ac1f8d1
tangwang
评估标准优化
|
7
|
## What it does
|
881d338b
tangwang
评估框架
|
8
|
|
3ac1f8d1
tangwang
评估标准优化
|
9
10
11
12
|
1. Build an annotation set for a fixed query set.
2. Evaluate live search results against cached labels.
3. Run batch evaluation and keep historical reports with config snapshots.
4. Tune fusion parameters in a reproducible loop.
|
881d338b
tangwang
评估框架
|
13
|
|
3ac1f8d1
tangwang
评估标准优化
|
14
|
## Layout
|
881d338b
tangwang
评估框架
|
15
|
|
3ac1f8d1
tangwang
评估标准优化
|
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|
| Path | Role |
|------|------|
| `eval_framework/` | Package: orchestration, SQLite store, search/rerank/LLM clients, prompts, metrics, reports, web UI (`static/`), CLI |
| `build_annotation_set.py` | CLI entry (build / batch / audit) |
| `serve_eval_web.py` | Web server for the evaluation UI |
| `tune_fusion.py` | Applies config variants, restarts backend, runs batch eval, stores experiment reports |
| `fusion_experiments_shortlist.json` | Compact experiment set for tuning |
| `fusion_experiments_round1.json` | Broader first-round experiments |
| `queries/queries.txt` | Canonical evaluation queries |
| `README_Requirement.md` | Product/requirements reference |
| `quick_start_eval.sh` | Wrapper: `batch`, `batch-rebuild`, or `serve` |
| `../start_eval_web.sh` | Same as `serve` with `activate.sh`; use `./scripts/service_ctl.sh start eval-web` (default port **6010**, override with `EVAL_WEB_PORT`). `./run.sh all` includes eval-web. |
## Quick start (repo root)
Set tenant if needed (`export TENANT_ID=163`). You need a live search API, DashScope when new LLM labels are required, and a running backend.
|
881d338b
tangwang
评估框架
|
32
33
|
```bash
|
3ac1f8d1
tangwang
评估标准优化
|
34
|
# Batch: live search for every query; only uncached (query, spu_id) pairs hit the LLM
|
881d338b
tangwang
评估框架
|
35
36
|
./scripts/evaluation/quick_start_eval.sh batch
|
3ac1f8d1
tangwang
评估标准优化
|
37
|
# Full re-label of current top_k recall (expensive)
|
f8e7cb97
tangwang
evalution framework
|
38
39
|
./scripts/evaluation/quick_start_eval.sh batch-rebuild
|
3ac1f8d1
tangwang
评估标准优化
|
40
|
# UI: http://127.0.0.1:6010/
|
881d338b
tangwang
评估框架
|
41
|
./scripts/evaluation/quick_start_eval.sh serve
|
3ac1f8d1
tangwang
评估标准优化
|
42
|
# or: ./scripts/service_ctl.sh start eval-web
|
881d338b
tangwang
评估框架
|
43
44
|
```
|
3ac1f8d1
tangwang
评估标准优化
|
45
|
Explicit equivalents:
|
881d338b
tangwang
评估框架
|
46
47
|
```bash
|
f8e7cb97
tangwang
evalution framework
|
48
49
50
51
52
53
54
|
./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
--tenant-id "${TENANT_ID:-163}" \
--queries-file scripts/evaluation/queries/queries.txt \
--top-k 50 \
--language en \
--labeler-mode simple
|
881d338b
tangwang
评估框架
|
55
|
./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
|
3ac1f8d1
tangwang
评估标准优化
|
56
|
... same args ... \
|
881d338b
tangwang
评估框架
|
57
58
59
60
61
62
63
64
65
|
--force-refresh-labels
./.venv/bin/python scripts/evaluation/serve_eval_web.py serve \
--tenant-id "${TENANT_ID:-163}" \
--queries-file scripts/evaluation/queries/queries.txt \
--host 127.0.0.1 \
--port 6010
```
|
3ac1f8d1
tangwang
评估标准优化
|
66
|
Each batch run walks the full queries file. With `--force-refresh-labels`, every recalled `spu_id` in the window is re-sent to the LLM and upserted. Without it, only missing labels are filled.
|
881d338b
tangwang
评估框架
|
67
|
|
3ac1f8d1
tangwang
评估标准优化
|
68
|
## Artifacts
|
881d338b
tangwang
评估框架
|
69
|
|
3ac1f8d1
tangwang
评估标准优化
|
70
|
Default root: `artifacts/search_evaluation/`
|
881d338b
tangwang
评估框架
|
71
|
|
3ac1f8d1
tangwang
评估标准优化
|
72
73
74
75
76
|
- `search_eval.sqlite3` — corpus cache, rerank scores, relevance labels, query profiles, build/batch run metadata
- `query_builds/` — per-query pooled build outputs
- `batch_reports/` — batch JSON, Markdown, config snapshots
- `audits/` — label-quality audit summaries
- `tuning_runs/` — fusion experiment outputs and config snapshots
|
881d338b
tangwang
评估框架
|
77
|
|
3ac1f8d1
tangwang
评估标准优化
|
78
|
## Labels
|
881d338b
tangwang
评估框架
|
79
|
|
3ac1f8d1
tangwang
评估标准优化
|
80
81
82
|
- **Exact** — Matches intended product type and all explicit required attributes.
- **Partial** — Main intent matches; attributes missing, approximate, or weaker.
- **Irrelevant** — Type mismatch or conflicting required attributes.
|
881d338b
tangwang
评估框架
|
83
|
|
3ac1f8d1
tangwang
评估标准优化
|
84
|
**Labeler modes:** `simple` (default): one judging pass per batch with the standard relevance prompt. `complex`: query-profile extraction plus extra guardrails (for structured experiments).
|
881d338b
tangwang
评估框架
|
85
|
|
3ac1f8d1
tangwang
评估标准优化
|
86
|
## Flows
|
881d338b
tangwang
评估框架
|
87
|
|
3ac1f8d1
tangwang
评估标准优化
|
88
|
**Standard:** Run `batch` without `--force-refresh-labels` to extend coverage, then use the UI or batch in cached mode. Single-query evaluation defaults to **no** auto-annotation: recall still hits the live API; scoring uses SQLite only, and unlabeled hits count as `Irrelevant`.
|
881d338b
tangwang
评估框架
|
89
|
|
3ac1f8d1
tangwang
评估标准优化
|
90
|
**Deeper pool:** `build_annotation_set.py build` merges deep search and full-corpus rerank windows before labeling (see CLI `--search-depth`, `--rerank-depth`, `--annotate-*-top-k`).
|
881d338b
tangwang
评估框架
|
91
|
|
3ac1f8d1
tangwang
评估标准优化
|
92
|
**Fusion tuning:** `tune_fusion.py` writes experiment configs, restarts the backend, runs batch evaluation, and optionally applies the best variant (see `--experiments-file`, `--score-metric`, `--apply-best`).
|
881d338b
tangwang
评估框架
|
93
|
|
3ac1f8d1
tangwang
评估标准优化
|
94
|
### Audit
|
881d338b
tangwang
评估框架
|
95
96
97
98
99
100
101
102
103
104
|
```bash
./.venv/bin/python scripts/evaluation/build_annotation_set.py audit \
--tenant-id 163 \
--queries-file scripts/evaluation/queries/queries.txt \
--top-k 50 \
--language en \
--labeler-mode simple
```
|
3ac1f8d1
tangwang
评估标准优化
|
105
|
## Web UI
|
881d338b
tangwang
评估框架
|
106
|
|
3ac1f8d1
tangwang
评估标准优化
|
107
|
Features: query list from `queries.txt`, single-query and batch evaluation, batch report history, top recalls, missed Exact/Partial, and coverage tips for unlabeled hits.
|
881d338b
tangwang
评估框架
|
108
|
|
3ac1f8d1
tangwang
评估标准优化
|
109
|
## Batch reports
|
881d338b
tangwang
评估框架
|
110
|
|
3ac1f8d1
tangwang
评估标准优化
|
111
|
Each run stores aggregate and per-query metrics, label distribution, timestamp, and an `/admin/config` snapshot, as Markdown and JSON under `batch_reports/`.
|
881d338b
tangwang
评估框架
|
112
113
114
|
## Caveats
|
3ac1f8d1
tangwang
评估标准优化
|
115
116
117
|
- Labels are keyed by `(tenant_id, query, spu_id)`, not a full corpus×query matrix.
- Single-query evaluation still needs live search for recall; LLM calls are avoided when labels exist.
- Backend restarts in automated tuning may need a short settle time before requests.
|
881d338b
tangwang
评估框架
|
118
|
|
3ac1f8d1
tangwang
评估标准优化
|
119
|
## Related docs
|
881d338b
tangwang
评估框架
|
120
|
|
3ac1f8d1
tangwang
评估标准优化
|
121
|
- `README_Requirement.md`, `README_Requirement_zh.md` — requirements background; this file describes the implemented stack and how to run it.
|