Commit 3ac1f8d1cc9d647361028ebf9451265101457381

Authored by tangwang
1 parent 3984ec64

评估标准优化

docs/Usage-Guide.md
@@ -202,13 +202,10 @@ python -m pytest -q tests/test_rerank_client.py tests/test_es_query_builder.py t @@ -202,13 +202,10 @@ python -m pytest -q tests/test_rerank_client.py tests/test_es_query_builder.py t
202 ./scripts/service_ctl.sh restart backend 202 ./scripts/service_ctl.sh restart backend
203 sleep 3 203 sleep 3
204 ./scripts/service_ctl.sh status backend 204 ./scripts/service_ctl.sh status backend
205 -python ./scripts/eval_search_quality.py 205 +./scripts/evaluation/quick_start_eval.sh batch
206 ``` 206 ```
207 207
208 -评估结果会输出到 `artifacts/search_eval/`,包含:  
209 -  
210 -- `search_eval_*.json`:便于脚本二次分析  
211 -- `search_eval_*.md`:便于人工浏览 top20 结果、分数与命中子句 208 +离线批量评估会把标注与报表写到 `artifacts/search_evaluation/`(SQLite、`batch_reports/` 下的 JSON/Markdown 等)。说明与命令见 [scripts/evaluation/README.md](../scripts/evaluation/README.md)。
212 209
213 ### 方式4: 多环境示例(prod / uat) 210 ### 方式4: 多环境示例(prod / uat)
214 211
docs/相关性检索优化说明.md
@@ -240,15 +240,10 @@ python -m pytest -q tests/test_rerank_client.py tests/test_es_query_builder.py t @@ -240,15 +240,10 @@ python -m pytest -q tests/test_rerank_client.py tests/test_es_query_builder.py t
240 ./scripts/service_ctl.sh restart backend 240 ./scripts/service_ctl.sh restart backend
241 sleep 3 241 sleep 3
242 ./scripts/service_ctl.sh status backend 242 ./scripts/service_ctl.sh status backend
243 -python ./scripts/eval_search_quality.py 243 +./scripts/evaluation/quick_start_eval.sh batch
244 ``` 244 ```
245 245
246 -评估脚本会生成:  
247 -  
248 -- `artifacts/search_eval/search_eval_*.json`  
249 -- `artifacts/search_eval/search_eval_*.md`  
250 -  
251 -可直接从 JSON 中提取 query 级和 result 级调试字段进行分析。 246 +评估产物在 `artifacts/search_evaluation/`(如 `search_eval.sqlite3`、`batch_reports/` 下的 JSON/Markdown)。流程与参数说明见 [scripts/evaluation/README.md](../scripts/evaluation/README.md)。
252 247
253 ## 11. 建议测试清单 248 ## 11. 建议测试清单
254 249
scripts/evaluation/README.md
1 # Search Evaluation Framework 1 # Search Evaluation Framework
2 2
3 -This directory contains the offline annotation set builder, the online evaluation UI/API, the audit tooling, and the fusion-tuning runner for retrieval quality evaluation. 3 +This directory holds the offline annotation builder, the evaluation web UI/API, audit tooling, and the fusion-tuning runner for retrieval quality.
4 4
5 -It is designed around one core rule: 5 +**Design:** Build labels offline for a fixed query set (`queries/queries.txt`). Single-query and batch evaluation map recalled `spu_id` values to the SQLite cache. Items without cached labels are scored as `Irrelevant`, and the UI/API surfaces tips when coverage is incomplete.
6 6
7 -- Annotation should be built offline first.  
8 -- Single-query evaluation should then map recalled `spu_id` values to the cached annotation set.  
9 -- Recalled items without cached labels are treated as `Irrelevant` during evaluation, and the UI/API returns a tip so the operator knows coverage is incomplete. 7 +## What it does
10 8
11 -## Goals 9 +1. Build an annotation set for a fixed query set.
  10 +2. Evaluate live search results against cached labels.
  11 +3. Run batch evaluation and keep historical reports with config snapshots.
  12 +4. Tune fusion parameters in a reproducible loop.
12 13
13 -The framework supports four related tasks: 14 +## Layout
14 15
15 -1. Build an annotation set for a fixed query set.  
16 -2. Evaluate a live search result list against that annotation set.  
17 -3. Run batch evaluation and store historical reports with config snapshots.  
18 -4. Tune fusion parameters reproducibly.  
19 -  
20 -## Files  
21 -  
22 -- `eval_framework/` (Python package)  
23 - Modular layout: `framework.py` (orchestration), `store.py` (SQLite), `clients.py` (search/rerank/LLM), `prompts.py` (judge templates), `metrics.py`, `reports.py`, `web_app.py`, `cli.py`, and `static/` (evaluation UI HTML/CSS/JS).  
24 -- `build_annotation_set.py`  
25 - Thin CLI entrypoint into `eval_framework`.  
26 -- `serve_eval_web.py`  
27 - Thin web entrypoint into `eval_framework`.  
28 -- `tune_fusion.py`  
29 - Fusion experiment runner. It applies config variants, restarts backend, runs batch evaluation, and stores experiment reports.  
30 -- `fusion_experiments_shortlist.json`  
31 - A compact experiment set for practical tuning.  
32 -- `fusion_experiments_round1.json`  
33 - A broader first-round experiment set.  
34 -- `queries/queries.txt`  
35 - The canonical evaluation query set.  
36 -- `README_Requirement.md`  
37 - Requirement reference document.  
38 -- `quick_start_eval.sh`  
39 - Optional wrapper: `batch` (fill missing labels only), `batch-rebuild` (force full re-label), or `serve` (web UI).  
40 -- `../start_eval_web.sh`  
41 - Same as `serve` but loads `activate.sh`; use with `./scripts/service_ctl.sh start eval-web` (port **`EVAL_WEB_PORT`**, default **6010**). `./run.sh all` starts **eval-web** with the rest of core services.  
42 -  
43 -## Quick start (from repo root)  
44 -  
45 -Set tenant if needed (`export TENANT_ID=163`). Requires live search API, DashScope key when the batch step needs new LLM labels, and a working backend. 16 +| Path | Role |
  17 +|------|------|
  18 +| `eval_framework/` | Package: orchestration, SQLite store, search/rerank/LLM clients, prompts, metrics, reports, web UI (`static/`), CLI |
  19 +| `build_annotation_set.py` | CLI entry (build / batch / audit) |
  20 +| `serve_eval_web.py` | Web server for the evaluation UI |
  21 +| `tune_fusion.py` | Applies config variants, restarts backend, runs batch eval, stores experiment reports |
  22 +| `fusion_experiments_shortlist.json` | Compact experiment set for tuning |
  23 +| `fusion_experiments_round1.json` | Broader first-round experiments |
  24 +| `queries/queries.txt` | Canonical evaluation queries |
  25 +| `README_Requirement.md` | Product/requirements reference |
  26 +| `quick_start_eval.sh` | Wrapper: `batch`, `batch-rebuild`, or `serve` |
  27 +| `../start_eval_web.sh` | Same as `serve` with `activate.sh`; use `./scripts/service_ctl.sh start eval-web` (default port **6010**, override with `EVAL_WEB_PORT`). `./run.sh all` includes eval-web. |
  28 +
  29 +## Quick start (repo root)
  30 +
  31 +Set tenant if needed (`export TENANT_ID=163`). You need a live search API, DashScope when new LLM labels are required, and a running backend.
46 32
47 ```bash 33 ```bash
48 -# 1) Batch evaluation: every query in the file gets a live search; only uncached (query, spu_id) pairs call the LLM 34 +# Batch: live search for every query; only uncached (query, spu_id) pairs hit the LLM
49 ./scripts/evaluation/quick_start_eval.sh batch 35 ./scripts/evaluation/quick_start_eval.sh batch
50 36
51 -# Optional: full re-label of current top_k recall (expensive; use only when you intentionally rebuild the cache) 37 +# Full re-label of current top_k recall (expensive)
52 ./scripts/evaluation/quick_start_eval.sh batch-rebuild 38 ./scripts/evaluation/quick_start_eval.sh batch-rebuild
53 39
54 -# 2) Evaluation UI on http://127.0.0.1:6010/ (override with EVAL_WEB_PORT / EVAL_WEB_HOST) 40 +# UI: http://127.0.0.1:6010/
55 ./scripts/evaluation/quick_start_eval.sh serve 41 ./scripts/evaluation/quick_start_eval.sh serve
56 -# Or: ./scripts/service_ctl.sh start eval-web 42 +# or: ./scripts/service_ctl.sh start eval-web
57 ``` 43 ```
58 44
59 -Equivalent explicit commands: 45 +Explicit equivalents:
60 46
61 ```bash 47 ```bash
62 -# Safe default: no --force-refresh-labels  
63 ./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \ 48 ./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
64 --tenant-id "${TENANT_ID:-163}" \ 49 --tenant-id "${TENANT_ID:-163}" \
65 --queries-file scripts/evaluation/queries/queries.txt \ 50 --queries-file scripts/evaluation/queries/queries.txt \
@@ -67,13 +52,8 @@ Equivalent explicit commands: @@ -67,13 +52,8 @@ Equivalent explicit commands:
67 --language en \ 52 --language en \
68 --labeler-mode simple 53 --labeler-mode simple
69 54
70 -# Rebuild all labels for recalled top_k (same as quick_start_eval.sh batch-rebuild)  
71 ./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \ 55 ./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
72 - --tenant-id "${TENANT_ID:-163}" \  
73 - --queries-file scripts/evaluation/queries/queries.txt \  
74 - --top-k 50 \  
75 - --language en \  
76 - --labeler-mode simple \ 56 + ... same args ... \
77 --force-refresh-labels 57 --force-refresh-labels
78 58
79 ./.venv/bin/python scripts/evaluation/serve_eval_web.py serve \ 59 ./.venv/bin/python scripts/evaluation/serve_eval_web.py serve \
@@ -83,191 +63,35 @@ Equivalent explicit commands: @@ -83,191 +63,35 @@ Equivalent explicit commands:
83 --port 6010 63 --port 6010
84 ``` 64 ```
85 65
86 -**Batch behavior:** There is no “skip queries already processed”. Each run walks the full queries file. With `--force-refresh-labels`, for **every** query the runner issues a live search and sends **all** `top_k` returned `spu_id`s through the LLM again (SQLite rows are upserted). Omit `--force-refresh-labels` if you only want to fill in labels that are missing for the current recall window.  
87 -  
88 -## Storage Layout  
89 -  
90 -All generated artifacts are under:  
91 -  
92 -- `/data/saas-search/artifacts/search_evaluation`  
93 -  
94 -Important subpaths:  
95 -  
96 -- `/data/saas-search/artifacts/search_evaluation/search_eval.sqlite3`  
97 - Main cache and annotation store.  
98 -- `/data/saas-search/artifacts/search_evaluation/query_builds`  
99 - Per-query pooled annotation-set build artifacts.  
100 -- `/data/saas-search/artifacts/search_evaluation/batch_reports`  
101 - Batch evaluation JSON, Markdown reports, and config snapshots.  
102 -- `/data/saas-search/artifacts/search_evaluation/audits`  
103 - Audit summaries for label quality checks.  
104 -- `/data/saas-search/artifacts/search_evaluation/tuning_runs`  
105 - Fusion experiment summaries and per-experiment config snapshots.  
106 -  
107 -## SQLite Schema Summary  
108 -  
109 -The main tables in `search_eval.sqlite3` are:  
110 -  
111 -- `corpus_docs`  
112 - Cached product corpus for the tenant.  
113 -- `rerank_scores`  
114 - Cached full-corpus reranker scores keyed by `(tenant_id, query_text, spu_id)`.  
115 -- `relevance_labels`  
116 - Cached LLM relevance labels keyed by `(tenant_id, query_text, spu_id)`.  
117 -- `query_profiles`  
118 - Structured query-intent profiles extracted before labeling.  
119 -- `build_runs`  
120 - Per-query pooled-build records.  
121 -- `batch_runs`  
122 - Batch evaluation history.  
123 -  
124 -## Label Semantics  
125 -  
126 -Three labels are used throughout:  
127 -  
128 -- `Exact`  
129 - Fully matches the intended product type and all explicit required attributes.  
130 -- `Partial`  
131 - Main intent matches, but explicit attributes are missing, approximate, or weaker than requested.  
132 -- `Irrelevant`  
133 - Product type mismatches, or explicit required attributes conflict.  
134 -  
135 -The framework always uses:  
136 -  
137 -- LLM-based batched relevance classification  
138 -- caching and retry logic for robust offline labeling  
139 -  
140 -There are now two labeler modes:  
141 -  
142 -- `simple`  
143 - Default. A single low-coupling LLM judging pass per batch, using the standard relevance prompt.  
144 -- `complex`  
145 - Legacy structured mode. It extracts query profiles and applies extra guardrails. Kept for comparison, but no longer the default.  
146 -  
147 -## Offline-First Workflow  
148 -  
149 -### 1. Refresh labels for the evaluation query set  
150 -  
151 -For practical evaluation, the most important offline step is to pre-label the result window you plan to score. For the current metrics (`P@5`, `P@10`, `P@20`, `P@50`, `MAP_3`, `MAP_2_3`), a `top_k=50` cached label set is sufficient.  
152 -  
153 -Example (fills missing labels only; recommended default):  
154 -  
155 -```bash  
156 -./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \  
157 - --tenant-id 163 \  
158 - --queries-file scripts/evaluation/queries/queries.txt \  
159 - --top-k 50 \  
160 - --language en \  
161 - --labeler-mode simple  
162 -```  
163 -  
164 -To **rebuild** every label for the current `top_k` recall window (all queries, all hits re-sent to the LLM), add `--force-refresh-labels` or run `./scripts/evaluation/quick_start_eval.sh batch-rebuild`.  
165 -  
166 -This command does two things:  
167 -  
168 -- runs **every** query in the file against the live backend (no skip list)  
169 -- with `--force-refresh-labels`, re-labels **all** `top_k` hits per query via the LLM and upserts SQLite; without the flag, only `spu_id`s lacking a cached label are sent to the LLM  
170 -  
171 -After this step, single-query evaluation can run in cached mode without calling the LLM again.  
172 -  
173 -### 2. Optional pooled build  
174 -  
175 -The framework also supports a heavier pooled build that combines:  
176 -  
177 -- top search results  
178 -- top full-corpus reranker results  
179 -  
180 -Example:  
181 -  
182 -```bash  
183 -./.venv/bin/python scripts/evaluation/build_annotation_set.py build \  
184 - --tenant-id 163 \  
185 - --queries-file scripts/evaluation/queries/queries.txt \  
186 - --search-depth 1000 \  
187 - --rerank-depth 10000 \  
188 - --annotate-search-top-k 100 \  
189 - --annotate-rerank-top-k 120 \  
190 - --language en  
191 -```  
192 -  
193 -This is slower, but useful when you want a richer pooled annotation set beyond the current live recall window.  
194 -  
195 -## Why Single-Query Evaluation Was Slow  
196 -  
197 -If single-query evaluation is slow, the usual reason is that it is still running with `auto_annotate=true`, which means:  
198 -  
199 -- perform live search  
200 -- detect recalled but unlabeled products  
201 -- call the LLM to label them  
202 -  
203 -That is not the intended steady-state evaluation path.  
204 -  
205 -The UI/API is now configured to prefer cached evaluation:  
206 -  
207 -- default single-query evaluation uses `auto_annotate=false`  
208 -- unlabeled recalled results are treated as `Irrelevant`  
209 -- the response includes tips explaining that coverage gap  
210 -  
211 -If you want stable, fast evaluation:  
212 -  
213 -1. prebuild labels offline  
214 -2. use cached single-query evaluation  
215 -  
216 -## Web UI  
217 -  
218 -Start the evaluation UI:  
219 -  
220 -```bash  
221 -./.venv/bin/python scripts/evaluation/serve_eval_web.py serve \  
222 - --tenant-id 163 \  
223 - --queries-file scripts/evaluation/queries/queries.txt \  
224 - --host 127.0.0.1 \  
225 - --port 6010  
226 -```  
227 -  
228 -The UI provides:  
229 -  
230 -- query list loaded from `queries.txt`  
231 -- single-query evaluation  
232 -- batch evaluation  
233 -- history of batch reports  
234 -- top recalled results  
235 -- missed `Exact` and `Partial` products that were not recalled  
236 -- tips about unlabeled hits treated as `Irrelevant` 66 +Each batch run walks the full queries file. With `--force-refresh-labels`, every recalled `spu_id` in the window is re-sent to the LLM and upserted. Without it, only missing labels are filled.
237 67
238 -### Single-query response behavior 68 +## Artifacts
239 69
240 -For a single query: 70 +Default root: `artifacts/search_evaluation/`
241 71
242 -1. live search returns recalled `spu_id` values  
243 -2. the framework looks up cached labels by `(query, spu_id)`  
244 -3. unlabeled recalled items are counted as `Irrelevant`  
245 -4. cached `Exact` and `Partial` products that were not recalled are listed under `Missed Exact / Partial` 72 +- `search_eval.sqlite3` — corpus cache, rerank scores, relevance labels, query profiles, build/batch run metadata
  73 +- `query_builds/` — per-query pooled build outputs
  74 +- `batch_reports/` — batch JSON, Markdown, config snapshots
  75 +- `audits/` — label-quality audit summaries
  76 +- `tuning_runs/` — fusion experiment outputs and config snapshots
246 77
247 -This makes the page useful as a real retrieval-evaluation view rather than only a search-result viewer. 78 +## Labels
248 79
249 -## CLI Commands 80 +- **Exact** — Matches intended product type and all explicit required attributes.
  81 +- **Partial** — Main intent matches; attributes missing, approximate, or weaker.
  82 +- **Irrelevant** — Type mismatch or conflicting required attributes.
250 83
251 -### Build pooled annotation artifacts 84 +**Labeler modes:** `simple` (default): one judging pass per batch with the standard relevance prompt. `complex`: query-profile extraction plus extra guardrails (for structured experiments).
252 85
253 -```bash  
254 -./.venv/bin/python scripts/evaluation/build_annotation_set.py build ...  
255 -``` 86 +## Flows
256 87
257 -### Run batch evaluation 88 +**Standard:** Run `batch` without `--force-refresh-labels` to extend coverage, then use the UI or batch in cached mode. Single-query evaluation defaults to **no** auto-annotation: recall still hits the live API; scoring uses SQLite only, and unlabeled hits count as `Irrelevant`.
258 89
259 -```bash  
260 -./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \  
261 - --tenant-id 163 \  
262 - --queries-file scripts/evaluation/queries/queries.txt \  
263 - --top-k 50 \  
264 - --language en \  
265 - --labeler-mode simple  
266 -``` 90 +**Deeper pool:** `build_annotation_set.py build` merges deep search and full-corpus rerank windows before labeling (see CLI `--search-depth`, `--rerank-depth`, `--annotate-*-top-k`).
267 91
268 -Use `--force-refresh-labels` if you want to rebuild the offline label cache for the recalled window first. 92 +**Fusion tuning:** `tune_fusion.py` writes experiment configs, restarts the backend, runs batch evaluation, and optionally applies the best variant (see `--experiments-file`, `--score-metric`, `--apply-best`).
269 93
270 -### Audit annotation quality 94 +### Audit
271 95
272 ```bash 96 ```bash
273 ./.venv/bin/python scripts/evaluation/build_annotation_set.py audit \ 97 ./.venv/bin/python scripts/evaluation/build_annotation_set.py audit \
@@ -278,69 +102,20 @@ Use `--force-refresh-labels` if you want to rebuild the offline label cache for @@ -278,69 +102,20 @@ Use `--force-refresh-labels` if you want to rebuild the offline label cache for
278 --labeler-mode simple 102 --labeler-mode simple
279 ``` 103 ```
280 104
281 -This checks cached labels against current guardrails and reports suspicious cases.  
282 -  
283 -## Batch Reports  
284 -  
285 -Each batch run stores:  
286 -  
287 -- aggregate metrics  
288 -- per-query metrics  
289 -- label distribution  
290 -- timestamp  
291 -- config snapshot from `/admin/config`  
292 -  
293 -Reports are written as:  
294 -  
295 -- Markdown for easy reading  
296 -- JSON for downstream processing  
297 -  
298 -## Fusion Tuning  
299 -  
300 -The tuning runner applies experiment configs sequentially and records the outcome.  
301 -  
302 -Example:  
303 -  
304 -```bash  
305 -./.venv/bin/python scripts/evaluation/tune_fusion.py \  
306 - --tenant-id 163 \  
307 - --queries-file scripts/evaluation/queries/queries.txt \  
308 - --top-k 50 \  
309 - --language en \  
310 - --experiments-file scripts/evaluation/fusion_experiments_shortlist.json \  
311 - --score-metric MAP_3 \  
312 - --apply-best  
313 -```  
314 -  
315 -What it does:  
316 -  
317 -1. writes an experiment config into `config/config.yaml`  
318 -2. restarts backend  
319 -3. runs batch evaluation  
320 -4. stores the per-experiment result  
321 -5. optionally applies the best experiment at the end 105 +## Web UI
322 106
323 -## Current Practical Recommendation 107 +Features: query list from `queries.txt`, single-query and batch evaluation, batch report history, top recalls, missed Exact/Partial, and coverage tips for unlabeled hits.
324 108
325 -For day-to-day evaluation: 109 +## Batch reports
326 110
327 -1. refresh the offline labels for the fixed query set with `batch --force-refresh-labels`  
328 -2. run the web UI or normal batch evaluation in cached mode  
329 -3. only force-refresh labels again when:  
330 - - the query set changes  
331 - - the product corpus changes materially  
332 - - the labeling logic changes 111 +Each run stores aggregate and per-query metrics, label distribution, timestamp, and an `/admin/config` snapshot, as Markdown and JSON under `batch_reports/`.
333 112
334 ## Caveats 113 ## Caveats
335 114
336 -- The current label cache is query-specific, not a full all-products all-queries matrix.  
337 -- Single-query evaluation still depends on the live search API for recall, but not on the LLM if labels are already cached.  
338 -- The backend restart path in this environment can be briefly unstable immediately after startup; a short wait after restart is sometimes necessary for scripting.  
339 -- Some multilingual translation hints are noisy on long-tail fashion queries, which is one reason fusion tuning around translation weight matters.  
340 -  
341 -## Related Requirement Docs 115 +- Labels are keyed by `(tenant_id, query, spu_id)`, not a full corpus×query matrix.
  116 +- Single-query evaluation still needs live search for recall; LLM calls are avoided when labels exist.
  117 +- Backend restarts in automated tuning may need a short settle time before requests.
342 118
343 -- `README_Requirement.md`  
344 -- `README_Requirement_zh.md` 119 +## Related docs
345 120
346 -These documents describe the original problem statement. This `README.md` describes the implemented framework and the current recommended workflow. 121 +- `README_Requirement.md`, `README_Requirement_zh.md` — requirements background; this file describes the implemented stack and how to run it.
scripts/evaluation/eval_framework/prompts.py
@@ -5,46 +5,46 @@ from __future__ import annotations @@ -5,46 +5,46 @@ from __future__ import annotations
5 import json 5 import json
6 from typing import Any, Dict, Sequence 6 from typing import Any, Dict, Sequence
7 7
8 -_CLASSIFY_BATCH_SIMPLE_TEMPLATE = """You are a relevance evaluation assistant for an apparel e-commerce search system. 8 +_CLASSIFY_BATCH_SIMPLE_TEMPLATE = """You are a relevance evaluation assistant for an apparel e-commerce search system.
9 Given the user query and each product's information, assign one relevance label to each product. 9 Given the user query and each product's information, assign one relevance label to each product.
10 10
11 ## Relevance Labels 11 ## Relevance Labels
12 12
13 ### Exact 13 ### Exact
14 -The product fully satisfies the user's search intent. 14 +The product fully satisfies the user’s search intent: the core product type matches, all explicitly stated key attributes are supported by the product information.
15 15
16 -Use Exact when:  
17 -- The product matches the core product type named in the query.  
18 -- The key requirements explicitly stated in the query are satisfied.  
19 -- There is no clear conflict with any explicit user requirement.  
20 -  
21 -Typical cases:  
22 -- The query is only a product type, and the product is exactly that product type.  
23 -- The query includes product type + attributes, and the product matches the type and those attributes. 16 +Typical use cases:
  17 +- The query contains only a product type, and the product is exactly that type.
  18 +- The query contains product type + attributes, and the product matches both the type and all explicitly stated attributes.
24 19
25 ### Partial 20 ### Partial
26 -The product satisfies the user's primary intent, but does not fully satisfy all specified details. 21 +The product satisfies the user's primary intent because the core product type matches, but some explicit requirements in the query are missing, cannot be confirmed, or deviate from the query. Despite the mismatch, the product can still be considered a non-target but acceptable substitute.
27 22
28 Use Partial when: 23 Use Partial when:
29 - The core product type matches, but some requested attributes cannot be confirmed. 24 - The core product type matches, but some requested attributes cannot be confirmed.
30 -- The core product type matches, but only some secondary attributes are satisfied.  
31 -- The core product type matches, and there are minor or non-critical deviations from the query.  
32 -- The product does not clearly contradict the user's explicit requirements, but it also cannot be considered a full match. 25 +- The core product type matches, but some secondary requirements deviate or are inconsistent.
  26 +- The product is not the ideal target, but it is still a plausible and acceptable substitute for the shopper.
33 27
34 Typical cases: 28 Typical cases:
35 - Query: "red fitted t-shirt", product: "Women's T-Shirt" → color/fit cannot be confirmed. 29 - Query: "red fitted t-shirt", product: "Women's T-Shirt" → color/fit cannot be confirmed.
36 - Query: "red fitted t-shirt", product: "Blue Fitted T-Shirt" → product type and fit match, but color differs. 30 - Query: "red fitted t-shirt", product: "Blue Fitted T-Shirt" → product type and fit match, but color differs.
37 -- Query: "cotton long sleeve blouse", product: "Long Sleeve Blouse" → material not confirmed.  
38 31
39 -Important:  
40 -Partial should mainly be used when the core product type is correct, but the detailed requirements are incomplete, uncertain, or only partially matched. 32 +Detailed example:
  33 +- Query: "cotton long sleeve shirt"
  34 +- Product: "J.VER Men's Linen Shirts Casual Button Down Long Sleeve Shirt Solid Spread Collar Summer Beach Shirts with Pocket"
  35 +
  36 +Analysis:
  37 +- Material mismatch: the query explicitly requires cotton, while the product is linen, so it cannot be Exact.
  38 +- However, the core product type still matches: both are long sleeve shirts.
  39 +- In an e-commerce setting, the shopper may still consider clicking this item because the style and use case are similar.
  40 +- Therefore, it should be labeled Partial as a non-target but acceptable substitute.
41 41
42 ### Irrelevant 42 ### Irrelevant
43 The product does not satisfy the user's main shopping intent. 43 The product does not satisfy the user's main shopping intent.
44 44
45 Use Irrelevant when: 45 Use Irrelevant when:
46 - The core product type does not match the query. 46 - The core product type does not match the query.
47 -- The product matches the general category but is a different product type that shoppers would not consider interchangeable. 47 +- The product belongs to a broadly related category, but not the specific product subtype requested, and shoppers would not consider them interchangeable.
48 - The core product type matches, but the product clearly contradicts an explicit and important requirement in the query. 48 - The core product type matches, but the product clearly contradicts an explicit and important requirement in the query.
49 49
50 Typical cases: 50 Typical cases:
@@ -53,6 +53,8 @@ Typical cases: @@ -53,6 +53,8 @@ Typical cases:
53 - Query: "fitted pants", product: "loose wide-leg pants" → explicit contradiction on fit. 53 - Query: "fitted pants", product: "loose wide-leg pants" → explicit contradiction on fit.
54 - Query: "sleeveless dress", product: "long sleeve dress" → explicit contradiction on sleeve style. 54 - Query: "sleeveless dress", product: "long sleeve dress" → explicit contradiction on sleeve style.
55 55
  56 +This label emphasizes clarity of user intent. When the query specifies a concrete product type or an important attribute, products that conflict with that intent should be judged Irrelevant even if they are related at a higher category level.
  57 +
56 ## Decision Principles 58 ## Decision Principles
57 59
58 1. Product type is the highest-priority factor. 60 1. Product type is the highest-priority factor.
@@ -71,16 +73,13 @@ Typical cases: @@ -71,16 +73,13 @@ Typical cases:
71 If the user explicitly searched for one of these, the others should usually be judged Irrelevant. 73 If the user explicitly searched for one of these, the others should usually be judged Irrelevant.
72 74
73 3. If the core product type matches, then evaluate attributes. 75 3. If the core product type matches, then evaluate attributes.
74 - - If attributes fully match → Exact  
75 - - If attributes are missing, uncertain, or only partially matched → Partial  
76 - - If attributes clearly contradict an explicit important requirement → Irrelevant 76 + - If all explicit attributes match → Exact
  77 + - If some attributes are missing, uncertain, or partially mismatched, but the item is still an acceptable substitute → Partial
  78 + - If an explicit and important attribute is clearly contradicted, and the item is not a reasonable substitute → Irrelevant
77 79
78 4. Distinguish carefully between "not mentioned" and "contradicted". 80 4. Distinguish carefully between "not mentioned" and "contradicted".
79 - If an attribute is not mentioned or cannot be verified, prefer Partial. 81 - If an attribute is not mentioned or cannot be verified, prefer Partial.
80 - - If an attribute is explicitly opposite to the query, use Irrelevant.  
81 -  
82 -5. Do not overuse Exact.  
83 - Exact requires strong evidence that the product satisfies the user's stated intent, not just the general category. 82 + - If an attribute is explicitly opposite to the query, use Irrelevant unless the item is still reasonably acceptable as a substitute under the shopping context.
84 83
85 Query: {query} 84 Query: {query}
86 85
@@ -97,96 +96,93 @@ The lines must correspond sequentially to the products above. @@ -97,96 +96,93 @@ The lines must correspond sequentially to the products above.
97 Do not output any other information. 96 Do not output any other information.
98 """ 97 """
99 98
100 -_CLASSIFY_BATCH_SIMPLE_TEMPLATE__zh = """你是一个服装电商搜索系统的相关性评估助手。  
101 -给定用户查询和每个产品的信息,为每个产品分配一个相关性标签。 99 +_CLASSIFY_BATCH_SIMPLE_TEMPLATE__zh = _CLASSIFY_BATCH_SIMPLE_TEMPLATE_ZH = """你是一个服饰电商搜索系统中的相关性判断助手。
  100 +给定用户查询词以及每个商品的信息,请为每个商品分配一个相关性标签。
102 101
103 ## 相关性标签 102 ## 相关性标签
104 103
105 ### 完全相关 104 ### 完全相关
106 -该产品完全满足用户的搜索意图。  
107 -  
108 -在以下情况使用完全相关:  
109 -- 产品与查询中指定的核心产品类型相匹配。  
110 -- 满足了查询中明确说明的关键要求。  
111 -- 与用户明确的任何要求没有明显冲突。 105 +核心产品类型匹配,所有明确提及的关键属性均有产品信息支撑。
112 106
113 -典型情况:  
114 -- 查询仅包含产品类型,而产品恰好是该产品类型。  
115 -- 查询包含产品类型 + 属性,而产品与该类型及这些属性相匹配。 107 +典型适用场景:
  108 +- 查询仅包含产品类型,产品即为该类型。
  109 +- 查询包含“产品类型 + 属性”,产品在类型及所有明确属性上均符合。
116 110
117 ### 部分相关 111 ### 部分相关
118 -该产品满足了用户的主要意图,但并未完全满足所有指定的细节 112 +产品满足用户的主要意图(核心产品类型匹配),但查询中明确的部分要求未体现,或存在偏差。虽然有不一致,但仍属于“非目标但可接受”的替代品
119 113
120 在以下情况使用部分相关: 114 在以下情况使用部分相关:
121 -- 核心产品类型匹配,但部分请求的属性无法确认。  
122 -- 核心产品类型匹配,但仅满足了部分次要属性。  
123 -- 核心产品类型匹配,但与查询存在微小或非关键的偏差。  
124 -- 产品未明显违背用户的明确要求,但也不能视为完全匹配。 115 +- 核心产品类型匹配,但部分请求的属性在商品信息中缺失、未提及或无法确认。
  116 +- 核心产品类型匹配,但材质、版型、风格等次要要求存在偏差或不一致。
  117 +- 商品不是用户最理想的目标,但从电商购物角度看,仍可能被用户视为可接受的替代品。
125 118
126 典型情况: 119 典型情况:
127 -- 查询:"红色修身T恤",产品:"女士T恤" → 颜色/版型无法确认。  
128 -- 查询:"红色修身T恤",产品:"蓝色修身T恤" → 产品类型和版型匹配,但颜色不同。  
129 -- 查询:"棉质长袖衬衫",产品:"长袖衬衫" → 材质未确认。 120 +- 查询:“红色修身T恤”,产品:“女士T恤” → 颜色/版型无法确认。
  121 +- 查询:“红色修身T恤”,产品:“蓝色修身T恤” → 产品类型和版型匹配,但颜色不同。
130 122
131 -重要提示:  
132 -部分相关主要应在核心产品类型正确,但详细要求不完整、不确定或仅部分匹配时使用。 123 +详细案例:
  124 +- 查询:“棉质长袖衬衫”
  125 +- 商品:“J.VER男式亚麻衬衫休闲纽扣长袖衬衫纯色平领夏季沙滩衬衫带口袋”
133 126
134 -### 不相关  
135 -该产品不满足用户的主要购物意图。 127 +分析:
  128 +- 材质不符:Query 明确指定“棉质”,而商品为“亚麻”,因此不能判为完全相关。
  129 +- 但核心品类仍然匹配:两者都是“长袖衬衫”。
  130 +- 在电商搜索中,用户仍可能因为款式、穿着场景相近而点击该商品。
  131 +- 因此应判为部分相关,即“非目标但可接受”的替代品。
136 132
137 -在以下情况使用不相关: 133 +### 不相关
  134 +产品未满足用户的主要购物意图,主要表现为以下情形之一:
138 - 核心产品类型与查询不匹配。 135 - 核心产品类型与查询不匹配。
139 -- 产品匹配了大致类别,但属于购物者不会认为可互换的不同产品类型。  
140 -- 核心产品类型匹配,但产品明显违背了查询中一个明确且重要的要求。 136 +- 产品虽属大致相关的大类,但与查询指定的具体子类不可互换。
  137 +- 核心产品类型匹配,但产品明显违背了查询中一个明确且重要的属性要求。
141 138
142 典型情况: 139 典型情况:
143 -- 查询:"裤子",产品:"鞋子" → 错误的产品类型。  
144 -- 查询:"连衣裙",产品:"半身裙" → 不同的产品类型。  
145 -- 查询:"修身裤",产品:"宽松阔腿裤" → 版型上明显矛盾。  
146 -- 查询:"无袖连衣裙",产品:"长袖连衣裙" → 袖型上明显矛盾。 140 +- 查询:“裤子”,产品:“鞋子” → 产品类型错误。
  141 +- 查询:“连衣裙”,产品:“半身裙” → 具体产品类型不同。
  142 +- 查询:“修身裤”,产品:“宽松阔腿裤” → 与版型要求明显冲突。
  143 +- 查询:“无袖连衣裙”,产品:“长袖连衣裙” → 与袖型要求明显冲突。
147 144
148 -## 决策原则 145 +该标签强调用户意图的明确性。当查询指向具体类型或关键属性时,即使产品在更高层级类别上相关,也应按不相关处理。
149 146
150 -1. 产品类型是最高优先级的因素。  
151 - 如果查询明确指定了具体产品类型,结果必须匹配该产品类型才能被评为完全相关或部分相关。  
152 - 不同的产品类型通常是不相关,而非部分相关。 147 +## 判断原则
153 148
154 -2. 当查询明确时,相似或相关的产品类型不可互换。 149 +1. 产品类型是最高优先级因素。
  150 + 如果查询明确指定了具体产品类型,那么结果必须匹配该产品类型,才可能判为“完全相关”或“部分相关”。
  151 + 不同产品类型通常应判为“不相关”,而不是“部分相关”。
  152 +
  153 +2. 相似或相关的产品类型,在查询明确时通常不可互换。
155 例如: 154 例如:
156 - 连衣裙 vs 半身裙 vs 连体裤 155 - 连衣裙 vs 半身裙 vs 连体裤
157 - 牛仔裤 vs 裤子 156 - 牛仔裤 vs 裤子
158 - - T恤 vs 衬衫 157 + - T恤 vs 衬衫/上衣
159 - 开衫 vs 毛衣 158 - 开衫 vs 毛衣
160 - 靴子 vs 鞋子 159 - 靴子 vs 鞋子
161 - 文胸 vs 上衣 160 - 文胸 vs 上衣
162 - 双肩包 vs 包 161 - 双肩包 vs 包
163 - 如果用户明确搜索了其中一种,其他的通常应判断为不相关。  
164 -  
165 -3. 如果核心产品类型匹配,则评估属性。  
166 - - 如果属性完全匹配 → 完全相关  
167 - - 如果属性缺失、不确定或仅部分匹配 → 部分相关  
168 - - 如果属性明显违背明确的重点要求 → 不相关 162 + 如果用户明确搜索其中一种,其他类型通常应判为“不相关”。
169 163
170 -4. 仔细区分“未提及”和“矛盾”。  
171 - - 如果属性未提及或无法验证,倾向于部分相关。  
172 - - 如果属性与查询明确相反,使用不相关。 164 +3. 当核心产品类型匹配后,再评估属性。
  165 + - 所有明确属性都匹配 → 完全相关
  166 + - 部分属性缺失、无法确认,或存在一定偏差,但仍是可接受替代品 → 部分相关
  167 + - 明确且重要的属性被明显违背,且不能作为合理替代品 → 不相关
173 168
174 -5. 不要过度使用完全相关。  
175 - 完全相关需要强有力的证据表明产品满足了用户声明的意图,而不仅仅是通用类别。 169 +4. 要严格区分“未提及/无法确认”和“明确冲突”。
  170 + - 如果某属性没有提及,或无法验证,优先判为“部分相关”。
  171 + - 如果某属性与查询要求明确相反,则判为“不相关”;除非在购物语境下它仍明显属于可接受替代品。
176 172
177 -查询: {query} 173 +查询{query}
178 174
179 -产品: 175 +商品:
180 {lines} 176 {lines}
181 177
182 ## 输出格式 178 ## 输出格式
183 -严格输出 {n} 行,每行包含以下之一:  
184 -Exact  
185 -Partial  
186 -Irrelevant 179 +严格输出 {n} 行,每行只能是以下三者之一:
  180 +完全相关
  181 +部分相关
  182 +不相关
187 183
188 -这些行必须按顺序对应上面的产品。  
189 -不要输出任何其他信息。 184 +输出行必须与上方商品顺序一一对应。
  185 +不要输出任何其他内容。
190 """ 186 """
191 187
192 188
scripts/evaluation/eval_search_quality.py deleted
@@ -1,235 +0,0 @@ @@ -1,235 +0,0 @@
1 -#!/usr/bin/env python3  
2 -"""  
3 -Run search quality evaluation against real tenant indexes and emit JSON/Markdown reports.  
4 -  
5 -Usage:  
6 - source activate.sh  
7 - python scripts/eval_search_quality.py  
8 -"""  
9 -  
10 -from __future__ import annotations  
11 -  
12 -import json  
13 -import sys  
14 -from dataclasses import asdict, dataclass  
15 -from datetime import datetime, timezone  
16 -from pathlib import Path  
17 -from typing import Any, Dict, List  
18 -  
19 -PROJECT_ROOT = Path(__file__).resolve().parents[1]  
20 -if str(PROJECT_ROOT) not in sys.path:  
21 - sys.path.insert(0, str(PROJECT_ROOT))  
22 -  
23 -from api.app import get_searcher, init_service  
24 -from context import create_request_context  
25 -  
26 -  
27 -DEFAULT_QUERIES_BY_TENANT: Dict[str, List[str]] = {  
28 - "0": [  
29 - "连衣裙",  
30 - "dress",  
31 - "dress 连衣裙",  
32 - "maxi dress 长裙",  
33 - "波西米亚连衣裙",  
34 - "T恤",  
35 - "graphic tee 图案T恤",  
36 - "shirt",  
37 - "礼服衬衫",  
38 - "hoodie 卫衣",  
39 - "连帽卫衣",  
40 - "sweatshirt",  
41 - "牛仔裤",  
42 - "jeans",  
43 - "阔腿牛仔裤",  
44 - "毛衣 sweater",  
45 - "cardigan 开衫",  
46 - "jacket 外套",  
47 - "puffer jacket 羽绒服",  
48 - "飞行员夹克",  
49 - ],  
50 - "162": [  
51 - "连衣裙",  
52 - "dress",  
53 - "dress 连衣裙",  
54 - "T恤",  
55 - "shirt",  
56 - "hoodie 卫衣",  
57 - "牛仔裤",  
58 - "jeans",  
59 - "毛衣 sweater",  
60 - "jacket 外套",  
61 - "娃娃衣服",  
62 - "芭比裙子",  
63 - "连衣短裙芭比",  
64 - "公主大裙",  
65 - "晚礼服芭比",  
66 - "毛衣熊",  
67 - "服饰饰品",  
68 - "鞋子",  
69 - "军人套",  
70 - "陆军套",  
71 - ],  
72 -}  
73 -  
74 -  
75 -@dataclass  
76 -class RankedItem:  
77 - rank: int  
78 - spu_id: str  
79 - title: str  
80 - vendor: str  
81 - es_score: float | None  
82 - rerank_score: float | None  
83 - text_score: float | None  
84 - text_source_score: float | None  
85 - text_translation_score: float | None  
86 - text_primary_score: float | None  
87 - text_support_score: float | None  
88 - knn_score: float | None  
89 - fused_score: float | None  
90 - matched_queries: Any  
91 -  
92 -  
93 -def _pick_text(value: Any, language: str = "zh") -> str:  
94 - if value is None:  
95 - return ""  
96 - if isinstance(value, dict):  
97 - return str(value.get(language) or value.get("zh") or value.get("en") or "").strip()  
98 - return str(value).strip()  
99 -  
100 -  
101 -def _to_float(value: Any) -> float | None:  
102 - try:  
103 - if value is None:  
104 - return None  
105 - return float(value)  
106 - except (TypeError, ValueError):  
107 - return None  
108 -  
109 -  
110 -def _evaluate_query(searcher, tenant_id: str, query: str) -> Dict[str, Any]:  
111 - context = create_request_context(  
112 - reqid=f"eval-{tenant_id}-{abs(hash(query)) % 1000000}",  
113 - uid="codex",  
114 - )  
115 - result = searcher.search(  
116 - query=query,  
117 - tenant_id=tenant_id,  
118 - size=20,  
119 - from_=0,  
120 - context=context,  
121 - debug=True,  
122 - language="zh",  
123 - enable_rerank=True,  
124 - )  
125 -  
126 - per_result_debug = ((result.debug_info or {}).get("per_result") or [])  
127 - debug_by_spu_id = {  
128 - str(item.get("spu_id")): item  
129 - for item in per_result_debug  
130 - if isinstance(item, dict) and item.get("spu_id") is not None  
131 - }  
132 -  
133 - ranked_items: List[RankedItem] = []  
134 - for rank, spu in enumerate(result.results[:20], 1):  
135 - spu_id = str(getattr(spu, "spu_id", ""))  
136 - debug_item = debug_by_spu_id.get(spu_id, {})  
137 - ranked_items.append(  
138 - RankedItem(  
139 - rank=rank,  
140 - spu_id=spu_id,  
141 - title=_pick_text(getattr(spu, "title", None), language="zh"),  
142 - vendor=_pick_text(getattr(spu, "vendor", None), language="zh"),  
143 - es_score=_to_float(debug_item.get("es_score")),  
144 - rerank_score=_to_float(debug_item.get("rerank_score")),  
145 - text_score=_to_float(debug_item.get("text_score")),  
146 - text_source_score=_to_float(debug_item.get("text_source_score")),  
147 - text_translation_score=_to_float(debug_item.get("text_translation_score")),  
148 - text_primary_score=_to_float(debug_item.get("text_primary_score")),  
149 - text_support_score=_to_float(debug_item.get("text_support_score")),  
150 - knn_score=_to_float(debug_item.get("knn_score")),  
151 - fused_score=_to_float(debug_item.get("fused_score")),  
152 - matched_queries=debug_item.get("matched_queries"),  
153 - )  
154 - )  
155 -  
156 - return {  
157 - "query": query,  
158 - "tenant_id": tenant_id,  
159 - "total": result.total,  
160 - "max_score": result.max_score,  
161 - "took_ms": result.took_ms,  
162 - "query_analysis": ((result.debug_info or {}).get("query_analysis") or {}),  
163 - "stage_timings": ((result.debug_info or {}).get("stage_timings") or {}),  
164 - "top20": [asdict(item) for item in ranked_items],  
165 - }  
166 -  
167 -  
168 -def _render_markdown(report: Dict[str, Any]) -> str:  
169 - lines: List[str] = []  
170 - lines.append(f"# Search Quality Evaluation")  
171 - lines.append("")  
172 - lines.append(f"- Generated at: {report['generated_at']}")  
173 - lines.append(f"- Queries per tenant: {report['queries_per_tenant']}")  
174 - lines.append("")  
175 - for tenant_id, entries in report["tenants"].items():  
176 - lines.append(f"## Tenant {tenant_id}")  
177 - lines.append("")  
178 - for entry in entries:  
179 - qa = entry.get("query_analysis") or {}  
180 - lines.append(f"### Query: {entry['query']}")  
181 - lines.append("")  
182 - lines.append(  
183 - f"- total={entry['total']} max_score={entry['max_score']:.6f} took_ms={entry['took_ms']}"  
184 - )  
185 - lines.append(  
186 - f"- detected_language={qa.get('detected_language')} translations={qa.get('translations')}"  
187 - )  
188 - lines.append("")  
189 - lines.append("| rank | spu_id | title | fused | rerank | text | text_src | text_trans | knn | es | matched_queries |")  
190 - lines.append("| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |")  
191 - for item in entry.get("top20", []):  
192 - title = str(item.get("title", "")).replace("|", "/")  
193 - matched = json.dumps(item.get("matched_queries"), ensure_ascii=False)  
194 - matched = matched.replace("|", "/")  
195 - lines.append(  
196 - f"| {item.get('rank')} | {item.get('spu_id')} | {title} | "  
197 - f"{item.get('fused_score')} | {item.get('rerank_score')} | {item.get('text_score')} | "  
198 - f"{item.get('text_source_score')} | {item.get('text_translation_score')} | "  
199 - f"{item.get('knn_score')} | {item.get('es_score')} | {matched} |"  
200 - )  
201 - lines.append("")  
202 - return "\n".join(lines)  
203 -  
204 -  
205 -def main() -> None:  
206 - init_service("http://localhost:9200")  
207 - searcher = get_searcher()  
208 -  
209 - tenants_report: Dict[str, List[Dict[str, Any]]] = {}  
210 - for tenant_id, queries in DEFAULT_QUERIES_BY_TENANT.items():  
211 - tenant_entries: List[Dict[str, Any]] = []  
212 - for query in queries:  
213 - print(f"[eval] tenant={tenant_id} query={query}")  
214 - tenant_entries.append(_evaluate_query(searcher, tenant_id, query))  
215 - tenants_report[tenant_id] = tenant_entries  
216 -  
217 - report = {  
218 - "generated_at": datetime.now(timezone.utc).isoformat(),  
219 - "queries_per_tenant": {tenant: len(queries) for tenant, queries in DEFAULT_QUERIES_BY_TENANT.items()},  
220 - "tenants": tenants_report,  
221 - }  
222 -  
223 - out_dir = Path("artifacts/search_eval")  
224 - out_dir.mkdir(parents=True, exist_ok=True)  
225 - timestamp = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ")  
226 - json_path = out_dir / f"search_eval_{timestamp}.json"  
227 - md_path = out_dir / f"search_eval_{timestamp}.md"  
228 - json_path.write_text(json.dumps(report, ensure_ascii=False, indent=2), encoding="utf-8")  
229 - md_path.write_text(_render_markdown(report), encoding="utf-8")  
230 - print(f"[done] json={json_path}")  
231 - print(f"[done] md={md_path}")  
232 -  
233 -  
234 -if __name__ == "__main__":  
235 - main()