881d338b
tangwang
评估框架
|
1
2
|
# Search Evaluation Framework
|
3ac1f8d1
tangwang
评估标准优化
|
3
|
This directory holds the offline annotation builder, the evaluation web UI/API, audit tooling, and the fusion-tuning runner for retrieval quality.
|
881d338b
tangwang
评估框架
|
4
|
|
465f90e1
tangwang
添加LTR数据收集
|
5
|
**Design:** Build labels offline for a fixed query set (`queries/queries.txt`). Single-query and batch evaluation map recalled `spu_id` values to the SQLite cache. Items without cached labels are scored as `Irrelevant`, and the UI/API surfaces tips when judged coverage is incomplete. Evaluation now uses a graded four-tier relevance system with a multi-metric primary scorecard instead of a single headline metric.
|
881d338b
tangwang
评估框架
|
6
|
|
3ac1f8d1
tangwang
评估标准优化
|
7
|
## What it does
|
881d338b
tangwang
评估框架
|
8
|
|
3ac1f8d1
tangwang
评估标准优化
|
9
10
11
12
|
1. Build an annotation set for a fixed query set.
2. Evaluate live search results against cached labels.
3. Run batch evaluation and keep historical reports with config snapshots.
4. Tune fusion parameters in a reproducible loop.
|
881d338b
tangwang
评估框架
|
13
|
|
3ac1f8d1
tangwang
评估标准优化
|
14
|
## Layout
|
881d338b
tangwang
评估框架
|
15
|
|
3ac1f8d1
tangwang
评估标准优化
|
16
17
18
19
20
21
22
23
24
25
|
| Path | Role |
|------|------|
| `eval_framework/` | Package: orchestration, SQLite store, search/rerank/LLM clients, prompts, metrics, reports, web UI (`static/`), CLI |
| `build_annotation_set.py` | CLI entry (build / batch / audit) |
| `serve_eval_web.py` | Web server for the evaluation UI |
| `tune_fusion.py` | Applies config variants, restarts backend, runs batch eval, stores experiment reports |
| `fusion_experiments_shortlist.json` | Compact experiment set for tuning |
| `fusion_experiments_round1.json` | Broader first-round experiments |
| `queries/queries.txt` | Canonical evaluation queries |
| `README_Requirement.md` | Product/requirements reference |
|
dedd31c5
tangwang
1. 搜索 recall 池「1 ...
|
26
|
| `start_eval.sh` | Wrapper: `batch`, `batch-rebuild` (deep `build` + `--force-refresh-labels`), or `serve` |
|
3ac1f8d1
tangwang
评估标准优化
|
27
28
29
30
31
|
| `../start_eval_web.sh` | Same as `serve` with `activate.sh`; use `./scripts/service_ctl.sh start eval-web` (default port **6010**, override with `EVAL_WEB_PORT`). `./run.sh all` includes eval-web. |
## Quick start (repo root)
Set tenant if needed (`export TENANT_ID=163`). You need a live search API, DashScope when new LLM labels are required, and a running backend.
|
881d338b
tangwang
评估框架
|
32
33
|
```bash
|
3ac1f8d1
tangwang
评估标准优化
|
34
|
# Batch: live search for every query; only uncached (query, spu_id) pairs hit the LLM
|
dedd31c5
tangwang
1. 搜索 recall 池「1 ...
|
35
|
./scripts/evaluation/start_eval.sh batch
|
881d338b
tangwang
评估框架
|
36
|
|
dedd31c5
tangwang
1. 搜索 recall 池「1 ...
|
37
38
|
# Deep rebuild: per-query full corpus rerank (outside search recall pool) + LLM in batches along global sort order (early stop; expensive)
./scripts/evaluation/start_eval.sh batch-rebuild
|
f8e7cb97
tangwang
evalution framework
|
39
|
|
3ac1f8d1
tangwang
评估标准优化
|
40
|
# UI: http://127.0.0.1:6010/
|
dedd31c5
tangwang
1. 搜索 recall 池「1 ...
|
41
|
./scripts/evaluation/start_eval.sh serve
|
3ac1f8d1
tangwang
评估标准优化
|
42
|
# or: ./scripts/service_ctl.sh start eval-web
|
881d338b
tangwang
评估框架
|
43
44
|
```
|
3ac1f8d1
tangwang
评估标准优化
|
45
|
Explicit equivalents:
|
881d338b
tangwang
评估框架
|
46
47
|
```bash
|
f8e7cb97
tangwang
evalution framework
|
48
49
50
51
52
53
54
|
./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
--tenant-id "${TENANT_ID:-163}" \
--queries-file scripts/evaluation/queries/queries.txt \
--top-k 50 \
--language en \
--labeler-mode simple
|
d172c259
tangwang
eval框架
|
55
56
57
58
59
60
61
62
63
|
./.venv/bin/python scripts/evaluation/build_annotation_set.py build \
--tenant-id "${TENANT_ID:-163}" \
--queries-file scripts/evaluation/queries/queries.txt \
--search-depth 500 \
--rerank-depth 10000 \
--force-refresh-rerank \
--force-refresh-labels \
--language en \
--labeler-mode simple
|
881d338b
tangwang
评估框架
|
64
65
66
67
68
69
70
71
|
./.venv/bin/python scripts/evaluation/serve_eval_web.py serve \
--tenant-id "${TENANT_ID:-163}" \
--queries-file scripts/evaluation/queries/queries.txt \
--host 127.0.0.1 \
--port 6010
```
|
167f33b4
tangwang
eval框架前端
|
72
|
Each `batch` run walks the full queries file and writes a **batch report** under `batch_reports/`. With `batch --force-refresh-labels`, every live top-`k` hit is re-judged by the LLM (still only those hits—not the deep rebuild pipeline).
|
d172c259
tangwang
eval框架
|
73
|
|
dedd31c5
tangwang
1. 搜索 recall 池「1 ...
|
74
|
### `start_eval.sh batch-rebuild` (deep annotation rebuild)
|
167f33b4
tangwang
eval框架前端
|
75
76
77
78
79
|
This runs `build_annotation_set.py build` with **`--force-refresh-labels`** and **`--force-refresh-rerank`** (see the explicit command block below). It does **not** run the `batch` subcommand: there is **no** aggregate batch report for this step; outputs are per-query JSON under `query_builds/` plus updates in `search_eval.sqlite3`.
For **each** query in `queries.txt`, in order:
|
dedd31c5
tangwang
1. 搜索 recall 池「1 ...
|
80
|
1. **Search recall** — Call the live search API with `size = max(--search-depth, --search-recall-top-k)` (the wrapper uses `--search-depth 500`). The first **`--search-recall-top-k`** hits (default **200**, see `eval_framework.constants.DEFAULT_SEARCH_RECALL_TOP_K`) form the **recall pool**; they are treated as rerank score **1** and are **not** sent to the reranker.
|
167f33b4
tangwang
eval框架前端
|
81
82
83
84
|
2. **Full corpus** — Load the tenant’s product corpus from Elasticsearch (same tenant as `TENANT_ID` / `--tenant-id`, default **163**), via `corpus_docs()` (cached in SQLite after the first load).
3. **Rerank outside pool** — Every corpus document whose `spu_id` is **not** in the pool is scored by the reranker API, **80 documents per request**. With `--force-refresh-rerank`, all those scores are recomputed and written to the **`rerank_scores`** table in `search_eval.sqlite3`. Without that flag, existing `(tenant_id, query, spu_id)` scores are reused and only missing rows hit the API.
4. **Skip “too easy” queries** — If more than **1000** non-pool documents have rerank score **> 0.5**, that query is **skipped** (one log line: tail too relevant / easy to satisfy). No LLM calls for that query.
5. **Global sort** — Order to label: pool in **search rank order**, then all remaining corpus docs in **descending rerank score** (dedupe by `spu_id`, pool wins).
|
dedd31c5
tangwang
1. 搜索 recall 池「1 ...
|
85
86
87
88
89
|
6. **LLM labeling** — Walk that list **from the head** in batches of **50** by default (`--rebuild-llm-batch-size`). Each batch log includes **exact_ratio**, **irrelevant_ratio**, **low_ratio**, and **irrelevant_plus_low_ratio**.
**Early stop** (defaults in `eval_framework.constants`; overridable via CLI):
- Run **at least** `--rebuild-min-batches` batches (**10** by default) before any early stop is allowed.
|
35ae3b29
tangwang
批量评估框架,召回参数修改和llm...
|
90
91
|
- After that, a **bad batch** is one where **both** are true (strict **>**):
- **Irrelevant** proportion **> 93.9%** (`--rebuild-irrelevant-stop-ratio`, default `0.939`), and
|
441f049d
tangwang
评测体系优化,以及
|
92
93
|
- **(Irrelevant + Weakly Relevant)** proportion **> 95.9%** (`--rebuild-irrel-low-combined-stop-ratio`, default `0.959`).
(“Weakly Relevant” is the weak tier; **Mostly Relevant** and **Exact** do not enter this sum.)
|
dedd31c5
tangwang
1. 搜索 recall 池「1 ...
|
94
|
- Count **consecutive** bad batches. **Reset** the count to 0 on any batch that is not bad.
|
35ae3b29
tangwang
批量评估框架,召回参数修改和llm...
|
95
|
- **Stop** when the consecutive bad count reaches **`--rebuild-irrelevant-stop-streak`** (**3** by default), or when **`--rebuild-max-batches`** (**40**) is reached—whichever comes first (up to **2000** docs per query at default batch size).
|
dedd31c5
tangwang
1. 搜索 recall 池「1 ...
|
96
|
|
35ae3b29
tangwang
批量评估框架,召回参数修改和llm...
|
97
|
So labeling follows best-first order but **stops early** after **three** consecutive batches that are overwhelmingly Irrelevant and Irrelevant+Low; the tail may never be judged.
|
167f33b4
tangwang
eval框架前端
|
98
99
100
|
**Incremental pool (no full rebuild):** `build_annotation_set.py build` **without** `--force-refresh-labels` uses the older windowed pool (`--annotate-search-top-k`, `--annotate-rerank-top-k`) and fills missing labels in one pass—no rerank-skip rule and no LLM early-stop loop.
|
dedd31c5
tangwang
1. 搜索 recall 池「1 ...
|
101
|
**Tuning the rebuild path:** `--search-recall-top-k`, `--rerank-high-threshold`, `--rerank-high-skip-count`, `--rebuild-llm-batch-size`, `--rebuild-min-batches`, `--rebuild-max-batches`, `--rebuild-irrelevant-stop-ratio`, `--rebuild-irrel-low-combined-stop-ratio`, `--rebuild-irrelevant-stop-streak` on `build` (see `eval_framework/cli.py`). Rerank API chunk size is **80** docs per request in code (`full_corpus_rerank_outside_exclude`).
|
881d338b
tangwang
评估框架
|
102
|
|
3ac1f8d1
tangwang
评估标准优化
|
103
|
## Artifacts
|
881d338b
tangwang
评估框架
|
104
|
|
3ac1f8d1
tangwang
评估标准优化
|
105
|
Default root: `artifacts/search_evaluation/`
|
881d338b
tangwang
评估框架
|
106
|
|
3ac1f8d1
tangwang
评估标准优化
|
107
108
109
110
111
|
- `search_eval.sqlite3` — corpus cache, rerank scores, relevance labels, query profiles, build/batch run metadata
- `query_builds/` — per-query pooled build outputs
- `batch_reports/` — batch JSON, Markdown, config snapshots
- `audits/` — label-quality audit summaries
- `tuning_runs/` — fusion experiment outputs and config snapshots
|
881d338b
tangwang
评估框架
|
112
|
|
3ac1f8d1
tangwang
评估标准优化
|
113
|
## Labels
|
881d338b
tangwang
评估框架
|
114
|
|
441f049d
tangwang
评测体系优化,以及
|
115
116
117
|
- **Fully Relevant** — Matches intended product type and all explicit required attributes.
- **Mostly Relevant** — Main intent matches and is a strong substitute, but some attributes are missing, weaker, or slightly off.
- **Weakly Relevant** — Only a weak substitute; may share scenario, style, or broad category but is no longer a strong match.
|
7ddd4cb3
tangwang
评估体系从三等级->四等级 Exa...
|
118
119
120
121
122
123
|
- **Irrelevant** — Type mismatch or important conflicts make it a poor search result.
## Metric design
This framework now follows graded ranking evaluation closer to e-commerce best practice instead of collapsing everything into binary relevance.
|
465f90e1
tangwang
添加LTR数据收集
|
124
125
126
127
128
|
- **Primary scorecard**
The primary evaluation set is:
`NDCG@20`, `NDCG@50`, `ERR@10`, `Strong_Precision@10`, `Strong_Precision@20`, `Useful_Precision@50`, `Avg_Grade@10`, `Gain_Recall@20`.
- **Composite tuning score: `Primary_Metric_Score`**
For experiment ranking we compute the mean of the primary scorecard after normalizing `Avg_Grade@10` by the max grade (`3`).
|
7ddd4cb3
tangwang
评估体系从三等级->四等级 Exa...
|
129
|
- **Gain scheme**
|
441f049d
tangwang
评测体系优化,以及
|
130
|
`Fully Relevant=7`, `Mostly Relevant=3`, `Weakly Relevant=1`, `Irrelevant=0`
|
7ddd4cb3
tangwang
评估体系从三等级->四等级 Exa...
|
131
132
|
The gains come from rel grades `3/2/1/0` with `gain = 2^rel - 1`, a standard `NDCG` setup.
- **Why this is better**
|
441f049d
tangwang
评测体系优化,以及
|
133
|
`NDCG` differentiates “exact”, “strong substitute”, and “weak substitute”, so swapping an `Fully Relevant` with a `Weakly Relevant` item is penalized more than swapping `Mostly Relevant` with `Weakly Relevant`.
|
7ddd4cb3
tangwang
评估体系从三等级->四等级 Exa...
|
134
135
136
|
The reported metrics are:
|
465f90e1
tangwang
添加LTR数据收集
|
137
138
139
|
- **`Primary_Metric_Score`** — Mean score over the primary scorecard (`Avg_Grade@10` normalized by `/3`).
- **`NDCG@20`, `NDCG@50`, `ERR@10`, `Strong_Precision@10`, `Strong_Precision@20`, `Useful_Precision@50`, `Avg_Grade@10`, `Gain_Recall@20`** — Primary scorecard for optimization decisions.
- **`NDCG@5`, `NDCG@10`, `ERR@5`, `ERR@20`, `ERR@50`** — Secondary graded ranking quality slices.
|
441f049d
tangwang
评测体系优化,以及
|
140
141
|
- **`Exact_Precision@K`** — Strict top-slot quality when only `Fully Relevant` counts.
- **`Strong_Precision@K`** — Business-facing top-slot quality where `Fully Relevant + Mostly Relevant` count as strong positives.
|
7ddd4cb3
tangwang
评估体系从三等级->四等级 Exa...
|
142
143
144
145
146
|
- **`Useful_Precision@K`** — Broader usefulness where any non-irrelevant result counts.
- **`Gain_Recall@K`** — Gain captured in the returned list versus the judged label pool for the query.
- **`Exact_Success@K` / `Strong_Success@K`** — Whether at least one exact or strong result appears in the first K.
- **`MRR_Exact@10` / `MRR_Strong@10`** — How early the first exact or strong result appears.
- **`Avg_Grade@10`** — Average relevance grade of the visible first page.
|
881d338b
tangwang
评估框架
|
147
|
|
3ac1f8d1
tangwang
评估标准优化
|
148
|
**Labeler modes:** `simple` (default): one judging pass per batch with the standard relevance prompt. `complex`: query-profile extraction plus extra guardrails (for structured experiments).
|
881d338b
tangwang
评估框架
|
149
|
|
3ac1f8d1
tangwang
评估标准优化
|
150
|
## Flows
|
881d338b
tangwang
评估框架
|
151
|
|
3ac1f8d1
tangwang
评估标准优化
|
152
|
**Standard:** Run `batch` without `--force-refresh-labels` to extend coverage, then use the UI or batch in cached mode. Single-query evaluation defaults to **no** auto-annotation: recall still hits the live API; scoring uses SQLite only, and unlabeled hits count as `Irrelevant`.
|
881d338b
tangwang
评估框架
|
153
|
|
167f33b4
tangwang
eval框架前端
|
154
|
**Rebuild vs incremental `build`:** Deep rebuild is documented in the **`batch-rebuild`** subsection above. Incremental `build` (without `--force-refresh-labels`) uses `--annotate-search-top-k` / `--annotate-rerank-top-k` windows instead.
|
881d338b
tangwang
评估框架
|
155
|
|
3ac1f8d1
tangwang
评估标准优化
|
156
|
**Fusion tuning:** `tune_fusion.py` writes experiment configs, restarts the backend, runs batch evaluation, and optionally applies the best variant (see `--experiments-file`, `--score-metric`, `--apply-best`).
|
881d338b
tangwang
评估框架
|
157
|
|
3ac1f8d1
tangwang
评估标准优化
|
158
|
### Audit
|
881d338b
tangwang
评估框架
|
159
160
161
162
163
164
165
166
167
168
|
```bash
./.venv/bin/python scripts/evaluation/build_annotation_set.py audit \
--tenant-id 163 \
--queries-file scripts/evaluation/queries/queries.txt \
--top-k 50 \
--language en \
--labeler-mode simple
```
|
3ac1f8d1
tangwang
评估标准优化
|
169
|
## Web UI
|
881d338b
tangwang
评估框架
|
170
|
|
7ddd4cb3
tangwang
评估体系从三等级->四等级 Exa...
|
171
|
Features: query list from `queries.txt`, single-query and batch evaluation, batch report history, grouped graded-metric cards, top recalls, missed judged useful results, and coverage tips for unlabeled hits.
|
881d338b
tangwang
评估框架
|
172
|
|
3ac1f8d1
tangwang
评估标准优化
|
173
|
## Batch reports
|
881d338b
tangwang
评估框架
|
174
|
|
7ddd4cb3
tangwang
评估体系从三等级->四等级 Exa...
|
175
|
Each run stores aggregate and per-query metrics, label distribution, timestamp, metric context (including gain scheme and primary metric), and an `/admin/config` snapshot, as Markdown and JSON under `batch_reports/`.
|
881d338b
tangwang
评估框架
|
176
|
|
465f90e1
tangwang
添加LTR数据收集
|
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
|
## Ranking debug and LTR prep
`debug_info` now exposes two extra layers that are useful for tuning and future learning-to-rank work:
- `retrieval_plan` — the effective text/image KNN plan for the query (`k`, `num_candidates`, boost, and whether the long-query branch was used).
- `ltr_summary` — query-level summary over the visible top results: how many docs came from translation matches, text KNN, image KNN, fallback-to-ES cases, plus mean signal values.
At the document level, `debug_info.per_result[*]` and `ranking_funnel.*.ltr_features` include stable features such as:
- `es_score`, `text_score`, `knn_score`, `rerank_score`, `fine_score`
- `source_score`, `translation_score`, `text_knn_score`, `image_knn_score`
- `has_translation_match`, `has_text_knn`, `has_image_knn`, `text_score_fallback_to_es`
This makes it easier to inspect bad cases and also gives us a near-direct feature inventory for downstream LTR experiments.
|
881d338b
tangwang
评估框架
|
192
193
|
## Caveats
|
3ac1f8d1
tangwang
评估标准优化
|
194
195
196
|
- Labels are keyed by `(tenant_id, query, spu_id)`, not a full corpus×query matrix.
- Single-query evaluation still needs live search for recall; LLM calls are avoided when labels exist.
- Backend restarts in automated tuning may need a short settle time before requests.
|
881d338b
tangwang
评估框架
|
197
|
|
3ac1f8d1
tangwang
评估标准优化
|
198
|
## Related docs
|
881d338b
tangwang
评估框架
|
199
|
|
3ac1f8d1
tangwang
评估标准优化
|
200
|
- `README_Requirement.md`, `README_Requirement_zh.md` — requirements background; this file describes the implemented stack and how to run it.
|