Commit 3ac1f8d1cc9d647361028ebf9451265101457381

Authored by tangwang
1 parent 3984ec64

评估标准优化

docs/Usage-Guide.md
... ... @@ -202,13 +202,10 @@ python -m pytest -q tests/test_rerank_client.py tests/test_es_query_builder.py t
202 202 ./scripts/service_ctl.sh restart backend
203 203 sleep 3
204 204 ./scripts/service_ctl.sh status backend
205   -python ./scripts/eval_search_quality.py
  205 +./scripts/evaluation/quick_start_eval.sh batch
206 206 ```
207 207  
208   -评估结果会输出到 `artifacts/search_eval/`,包含:
209   -
210   -- `search_eval_*.json`:便于脚本二次分析
211   -- `search_eval_*.md`:便于人工浏览 top20 结果、分数与命中子句
  208 +离线批量评估会把标注与报表写到 `artifacts/search_evaluation/`(SQLite、`batch_reports/` 下的 JSON/Markdown 等)。说明与命令见 [scripts/evaluation/README.md](../scripts/evaluation/README.md)。
212 209  
213 210 ### 方式4: 多环境示例(prod / uat)
214 211  
... ...
docs/相关性检索优化说明.md
... ... @@ -240,15 +240,10 @@ python -m pytest -q tests/test_rerank_client.py tests/test_es_query_builder.py t
240 240 ./scripts/service_ctl.sh restart backend
241 241 sleep 3
242 242 ./scripts/service_ctl.sh status backend
243   -python ./scripts/eval_search_quality.py
  243 +./scripts/evaluation/quick_start_eval.sh batch
244 244 ```
245 245  
246   -评估脚本会生成:
247   -
248   -- `artifacts/search_eval/search_eval_*.json`
249   -- `artifacts/search_eval/search_eval_*.md`
250   -
251   -可直接从 JSON 中提取 query 级和 result 级调试字段进行分析。
  246 +评估产物在 `artifacts/search_evaluation/`(如 `search_eval.sqlite3`、`batch_reports/` 下的 JSON/Markdown)。流程与参数说明见 [scripts/evaluation/README.md](../scripts/evaluation/README.md)。
252 247  
253 248 ## 11. 建议测试清单
254 249  
... ...
scripts/evaluation/README.md
1 1 # Search Evaluation Framework
2 2  
3   -This directory contains the offline annotation set builder, the online evaluation UI/API, the audit tooling, and the fusion-tuning runner for retrieval quality evaluation.
  3 +This directory holds the offline annotation builder, the evaluation web UI/API, audit tooling, and the fusion-tuning runner for retrieval quality.
4 4  
5   -It is designed around one core rule:
  5 +**Design:** Build labels offline for a fixed query set (`queries/queries.txt`). Single-query and batch evaluation map recalled `spu_id` values to the SQLite cache. Items without cached labels are scored as `Irrelevant`, and the UI/API surfaces tips when coverage is incomplete.
6 6  
7   -- Annotation should be built offline first.
8   -- Single-query evaluation should then map recalled `spu_id` values to the cached annotation set.
9   -- Recalled items without cached labels are treated as `Irrelevant` during evaluation, and the UI/API returns a tip so the operator knows coverage is incomplete.
  7 +## What it does
10 8  
11   -## Goals
  9 +1. Build an annotation set for a fixed query set.
  10 +2. Evaluate live search results against cached labels.
  11 +3. Run batch evaluation and keep historical reports with config snapshots.
  12 +4. Tune fusion parameters in a reproducible loop.
12 13  
13   -The framework supports four related tasks:
  14 +## Layout
14 15  
15   -1. Build an annotation set for a fixed query set.
16   -2. Evaluate a live search result list against that annotation set.
17   -3. Run batch evaluation and store historical reports with config snapshots.
18   -4. Tune fusion parameters reproducibly.
19   -
20   -## Files
21   -
22   -- `eval_framework/` (Python package)
23   - Modular layout: `framework.py` (orchestration), `store.py` (SQLite), `clients.py` (search/rerank/LLM), `prompts.py` (judge templates), `metrics.py`, `reports.py`, `web_app.py`, `cli.py`, and `static/` (evaluation UI HTML/CSS/JS).
24   -- `build_annotation_set.py`
25   - Thin CLI entrypoint into `eval_framework`.
26   -- `serve_eval_web.py`
27   - Thin web entrypoint into `eval_framework`.
28   -- `tune_fusion.py`
29   - Fusion experiment runner. It applies config variants, restarts backend, runs batch evaluation, and stores experiment reports.
30   -- `fusion_experiments_shortlist.json`
31   - A compact experiment set for practical tuning.
32   -- `fusion_experiments_round1.json`
33   - A broader first-round experiment set.
34   -- `queries/queries.txt`
35   - The canonical evaluation query set.
36   -- `README_Requirement.md`
37   - Requirement reference document.
38   -- `quick_start_eval.sh`
39   - Optional wrapper: `batch` (fill missing labels only), `batch-rebuild` (force full re-label), or `serve` (web UI).
40   -- `../start_eval_web.sh`
41   - Same as `serve` but loads `activate.sh`; use with `./scripts/service_ctl.sh start eval-web` (port **`EVAL_WEB_PORT`**, default **6010**). `./run.sh all` starts **eval-web** with the rest of core services.
42   -
43   -## Quick start (from repo root)
44   -
45   -Set tenant if needed (`export TENANT_ID=163`). Requires live search API, DashScope key when the batch step needs new LLM labels, and a working backend.
  16 +| Path | Role |
  17 +|------|------|
  18 +| `eval_framework/` | Package: orchestration, SQLite store, search/rerank/LLM clients, prompts, metrics, reports, web UI (`static/`), CLI |
  19 +| `build_annotation_set.py` | CLI entry (build / batch / audit) |
  20 +| `serve_eval_web.py` | Web server for the evaluation UI |
  21 +| `tune_fusion.py` | Applies config variants, restarts backend, runs batch eval, stores experiment reports |
  22 +| `fusion_experiments_shortlist.json` | Compact experiment set for tuning |
  23 +| `fusion_experiments_round1.json` | Broader first-round experiments |
  24 +| `queries/queries.txt` | Canonical evaluation queries |
  25 +| `README_Requirement.md` | Product/requirements reference |
  26 +| `quick_start_eval.sh` | Wrapper: `batch`, `batch-rebuild`, or `serve` |
  27 +| `../start_eval_web.sh` | Same as `serve` with `activate.sh`; use `./scripts/service_ctl.sh start eval-web` (default port **6010**, override with `EVAL_WEB_PORT`). `./run.sh all` includes eval-web. |
  28 +
  29 +## Quick start (repo root)
  30 +
  31 +Set tenant if needed (`export TENANT_ID=163`). You need a live search API, DashScope when new LLM labels are required, and a running backend.
46 32  
47 33 ```bash
48   -# 1) Batch evaluation: every query in the file gets a live search; only uncached (query, spu_id) pairs call the LLM
  34 +# Batch: live search for every query; only uncached (query, spu_id) pairs hit the LLM
49 35 ./scripts/evaluation/quick_start_eval.sh batch
50 36  
51   -# Optional: full re-label of current top_k recall (expensive; use only when you intentionally rebuild the cache)
  37 +# Full re-label of current top_k recall (expensive)
52 38 ./scripts/evaluation/quick_start_eval.sh batch-rebuild
53 39  
54   -# 2) Evaluation UI on http://127.0.0.1:6010/ (override with EVAL_WEB_PORT / EVAL_WEB_HOST)
  40 +# UI: http://127.0.0.1:6010/
55 41 ./scripts/evaluation/quick_start_eval.sh serve
56   -# Or: ./scripts/service_ctl.sh start eval-web
  42 +# or: ./scripts/service_ctl.sh start eval-web
57 43 ```
58 44  
59   -Equivalent explicit commands:
  45 +Explicit equivalents:
60 46  
61 47 ```bash
62   -# Safe default: no --force-refresh-labels
63 48 ./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
64 49 --tenant-id "${TENANT_ID:-163}" \
65 50 --queries-file scripts/evaluation/queries/queries.txt \
... ... @@ -67,13 +52,8 @@ Equivalent explicit commands:
67 52 --language en \
68 53 --labeler-mode simple
69 54  
70   -# Rebuild all labels for recalled top_k (same as quick_start_eval.sh batch-rebuild)
71 55 ./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
72   - --tenant-id "${TENANT_ID:-163}" \
73   - --queries-file scripts/evaluation/queries/queries.txt \
74   - --top-k 50 \
75   - --language en \
76   - --labeler-mode simple \
  56 + ... same args ... \
77 57 --force-refresh-labels
78 58  
79 59 ./.venv/bin/python scripts/evaluation/serve_eval_web.py serve \
... ... @@ -83,191 +63,35 @@ Equivalent explicit commands:
83 63 --port 6010
84 64 ```
85 65  
86   -**Batch behavior:** There is no “skip queries already processed”. Each run walks the full queries file. With `--force-refresh-labels`, for **every** query the runner issues a live search and sends **all** `top_k` returned `spu_id`s through the LLM again (SQLite rows are upserted). Omit `--force-refresh-labels` if you only want to fill in labels that are missing for the current recall window.
87   -
88   -## Storage Layout
89   -
90   -All generated artifacts are under:
91   -
92   -- `/data/saas-search/artifacts/search_evaluation`
93   -
94   -Important subpaths:
95   -
96   -- `/data/saas-search/artifacts/search_evaluation/search_eval.sqlite3`
97   - Main cache and annotation store.
98   -- `/data/saas-search/artifacts/search_evaluation/query_builds`
99   - Per-query pooled annotation-set build artifacts.
100   -- `/data/saas-search/artifacts/search_evaluation/batch_reports`
101   - Batch evaluation JSON, Markdown reports, and config snapshots.
102   -- `/data/saas-search/artifacts/search_evaluation/audits`
103   - Audit summaries for label quality checks.
104   -- `/data/saas-search/artifacts/search_evaluation/tuning_runs`
105   - Fusion experiment summaries and per-experiment config snapshots.
106   -
107   -## SQLite Schema Summary
108   -
109   -The main tables in `search_eval.sqlite3` are:
110   -
111   -- `corpus_docs`
112   - Cached product corpus for the tenant.
113   -- `rerank_scores`
114   - Cached full-corpus reranker scores keyed by `(tenant_id, query_text, spu_id)`.
115   -- `relevance_labels`
116   - Cached LLM relevance labels keyed by `(tenant_id, query_text, spu_id)`.
117   -- `query_profiles`
118   - Structured query-intent profiles extracted before labeling.
119   -- `build_runs`
120   - Per-query pooled-build records.
121   -- `batch_runs`
122   - Batch evaluation history.
123   -
124   -## Label Semantics
125   -
126   -Three labels are used throughout:
127   -
128   -- `Exact`
129   - Fully matches the intended product type and all explicit required attributes.
130   -- `Partial`
131   - Main intent matches, but explicit attributes are missing, approximate, or weaker than requested.
132   -- `Irrelevant`
133   - Product type mismatches, or explicit required attributes conflict.
134   -
135   -The framework always uses:
136   -
137   -- LLM-based batched relevance classification
138   -- caching and retry logic for robust offline labeling
139   -
140   -There are now two labeler modes:
141   -
142   -- `simple`
143   - Default. A single low-coupling LLM judging pass per batch, using the standard relevance prompt.
144   -- `complex`
145   - Legacy structured mode. It extracts query profiles and applies extra guardrails. Kept for comparison, but no longer the default.
146   -
147   -## Offline-First Workflow
148   -
149   -### 1. Refresh labels for the evaluation query set
150   -
151   -For practical evaluation, the most important offline step is to pre-label the result window you plan to score. For the current metrics (`P@5`, `P@10`, `P@20`, `P@50`, `MAP_3`, `MAP_2_3`), a `top_k=50` cached label set is sufficient.
152   -
153   -Example (fills missing labels only; recommended default):
154   -
155   -```bash
156   -./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
157   - --tenant-id 163 \
158   - --queries-file scripts/evaluation/queries/queries.txt \
159   - --top-k 50 \
160   - --language en \
161   - --labeler-mode simple
162   -```
163   -
164   -To **rebuild** every label for the current `top_k` recall window (all queries, all hits re-sent to the LLM), add `--force-refresh-labels` or run `./scripts/evaluation/quick_start_eval.sh batch-rebuild`.
165   -
166   -This command does two things:
167   -
168   -- runs **every** query in the file against the live backend (no skip list)
169   -- with `--force-refresh-labels`, re-labels **all** `top_k` hits per query via the LLM and upserts SQLite; without the flag, only `spu_id`s lacking a cached label are sent to the LLM
170   -
171   -After this step, single-query evaluation can run in cached mode without calling the LLM again.
172   -
173   -### 2. Optional pooled build
174   -
175   -The framework also supports a heavier pooled build that combines:
176   -
177   -- top search results
178   -- top full-corpus reranker results
179   -
180   -Example:
181   -
182   -```bash
183   -./.venv/bin/python scripts/evaluation/build_annotation_set.py build \
184   - --tenant-id 163 \
185   - --queries-file scripts/evaluation/queries/queries.txt \
186   - --search-depth 1000 \
187   - --rerank-depth 10000 \
188   - --annotate-search-top-k 100 \
189   - --annotate-rerank-top-k 120 \
190   - --language en
191   -```
192   -
193   -This is slower, but useful when you want a richer pooled annotation set beyond the current live recall window.
194   -
195   -## Why Single-Query Evaluation Was Slow
196   -
197   -If single-query evaluation is slow, the usual reason is that it is still running with `auto_annotate=true`, which means:
198   -
199   -- perform live search
200   -- detect recalled but unlabeled products
201   -- call the LLM to label them
202   -
203   -That is not the intended steady-state evaluation path.
204   -
205   -The UI/API is now configured to prefer cached evaluation:
206   -
207   -- default single-query evaluation uses `auto_annotate=false`
208   -- unlabeled recalled results are treated as `Irrelevant`
209   -- the response includes tips explaining that coverage gap
210   -
211   -If you want stable, fast evaluation:
212   -
213   -1. prebuild labels offline
214   -2. use cached single-query evaluation
215   -
216   -## Web UI
217   -
218   -Start the evaluation UI:
219   -
220   -```bash
221   -./.venv/bin/python scripts/evaluation/serve_eval_web.py serve \
222   - --tenant-id 163 \
223   - --queries-file scripts/evaluation/queries/queries.txt \
224   - --host 127.0.0.1 \
225   - --port 6010
226   -```
227   -
228   -The UI provides:
229   -
230   -- query list loaded from `queries.txt`
231   -- single-query evaluation
232   -- batch evaluation
233   -- history of batch reports
234   -- top recalled results
235   -- missed `Exact` and `Partial` products that were not recalled
236   -- tips about unlabeled hits treated as `Irrelevant`
  66 +Each batch run walks the full queries file. With `--force-refresh-labels`, every recalled `spu_id` in the window is re-sent to the LLM and upserted. Without it, only missing labels are filled.
237 67  
238   -### Single-query response behavior
  68 +## Artifacts
239 69  
240   -For a single query:
  70 +Default root: `artifacts/search_evaluation/`
241 71  
242   -1. live search returns recalled `spu_id` values
243   -2. the framework looks up cached labels by `(query, spu_id)`
244   -3. unlabeled recalled items are counted as `Irrelevant`
245   -4. cached `Exact` and `Partial` products that were not recalled are listed under `Missed Exact / Partial`
  72 +- `search_eval.sqlite3` — corpus cache, rerank scores, relevance labels, query profiles, build/batch run metadata
  73 +- `query_builds/` — per-query pooled build outputs
  74 +- `batch_reports/` — batch JSON, Markdown, config snapshots
  75 +- `audits/` — label-quality audit summaries
  76 +- `tuning_runs/` — fusion experiment outputs and config snapshots
246 77  
247   -This makes the page useful as a real retrieval-evaluation view rather than only a search-result viewer.
  78 +## Labels
248 79  
249   -## CLI Commands
  80 +- **Exact** — Matches intended product type and all explicit required attributes.
  81 +- **Partial** — Main intent matches; attributes missing, approximate, or weaker.
  82 +- **Irrelevant** — Type mismatch or conflicting required attributes.
250 83  
251   -### Build pooled annotation artifacts
  84 +**Labeler modes:** `simple` (default): one judging pass per batch with the standard relevance prompt. `complex`: query-profile extraction plus extra guardrails (for structured experiments).
252 85  
253   -```bash
254   -./.venv/bin/python scripts/evaluation/build_annotation_set.py build ...
255   -```
  86 +## Flows
256 87  
257   -### Run batch evaluation
  88 +**Standard:** Run `batch` without `--force-refresh-labels` to extend coverage, then use the UI or batch in cached mode. Single-query evaluation defaults to **no** auto-annotation: recall still hits the live API; scoring uses SQLite only, and unlabeled hits count as `Irrelevant`.
258 89  
259   -```bash
260   -./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
261   - --tenant-id 163 \
262   - --queries-file scripts/evaluation/queries/queries.txt \
263   - --top-k 50 \
264   - --language en \
265   - --labeler-mode simple
266   -```
  90 +**Deeper pool:** `build_annotation_set.py build` merges deep search and full-corpus rerank windows before labeling (see CLI `--search-depth`, `--rerank-depth`, `--annotate-*-top-k`).
267 91  
268   -Use `--force-refresh-labels` if you want to rebuild the offline label cache for the recalled window first.
  92 +**Fusion tuning:** `tune_fusion.py` writes experiment configs, restarts the backend, runs batch evaluation, and optionally applies the best variant (see `--experiments-file`, `--score-metric`, `--apply-best`).
269 93  
270   -### Audit annotation quality
  94 +### Audit
271 95  
272 96 ```bash
273 97 ./.venv/bin/python scripts/evaluation/build_annotation_set.py audit \
... ... @@ -278,69 +102,20 @@ Use `--force-refresh-labels` if you want to rebuild the offline label cache for
278 102 --labeler-mode simple
279 103 ```
280 104  
281   -This checks cached labels against current guardrails and reports suspicious cases.
282   -
283   -## Batch Reports
284   -
285   -Each batch run stores:
286   -
287   -- aggregate metrics
288   -- per-query metrics
289   -- label distribution
290   -- timestamp
291   -- config snapshot from `/admin/config`
292   -
293   -Reports are written as:
294   -
295   -- Markdown for easy reading
296   -- JSON for downstream processing
297   -
298   -## Fusion Tuning
299   -
300   -The tuning runner applies experiment configs sequentially and records the outcome.
301   -
302   -Example:
303   -
304   -```bash
305   -./.venv/bin/python scripts/evaluation/tune_fusion.py \
306   - --tenant-id 163 \
307   - --queries-file scripts/evaluation/queries/queries.txt \
308   - --top-k 50 \
309   - --language en \
310   - --experiments-file scripts/evaluation/fusion_experiments_shortlist.json \
311   - --score-metric MAP_3 \
312   - --apply-best
313   -```
314   -
315   -What it does:
316   -
317   -1. writes an experiment config into `config/config.yaml`
318   -2. restarts backend
319   -3. runs batch evaluation
320   -4. stores the per-experiment result
321   -5. optionally applies the best experiment at the end
  105 +## Web UI
322 106  
323   -## Current Practical Recommendation
  107 +Features: query list from `queries.txt`, single-query and batch evaluation, batch report history, top recalls, missed Exact/Partial, and coverage tips for unlabeled hits.
324 108  
325   -For day-to-day evaluation:
  109 +## Batch reports
326 110  
327   -1. refresh the offline labels for the fixed query set with `batch --force-refresh-labels`
328   -2. run the web UI or normal batch evaluation in cached mode
329   -3. only force-refresh labels again when:
330   - - the query set changes
331   - - the product corpus changes materially
332   - - the labeling logic changes
  111 +Each run stores aggregate and per-query metrics, label distribution, timestamp, and an `/admin/config` snapshot, as Markdown and JSON under `batch_reports/`.
333 112  
334 113 ## Caveats
335 114  
336   -- The current label cache is query-specific, not a full all-products all-queries matrix.
337   -- Single-query evaluation still depends on the live search API for recall, but not on the LLM if labels are already cached.
338   -- The backend restart path in this environment can be briefly unstable immediately after startup; a short wait after restart is sometimes necessary for scripting.
339   -- Some multilingual translation hints are noisy on long-tail fashion queries, which is one reason fusion tuning around translation weight matters.
340   -
341   -## Related Requirement Docs
  115 +- Labels are keyed by `(tenant_id, query, spu_id)`, not a full corpus×query matrix.
  116 +- Single-query evaluation still needs live search for recall; LLM calls are avoided when labels exist.
  117 +- Backend restarts in automated tuning may need a short settle time before requests.
342 118  
343   -- `README_Requirement.md`
344   -- `README_Requirement_zh.md`
  119 +## Related docs
345 120  
346   -These documents describe the original problem statement. This `README.md` describes the implemented framework and the current recommended workflow.
  121 +- `README_Requirement.md`, `README_Requirement_zh.md` — requirements background; this file describes the implemented stack and how to run it.
... ...
scripts/evaluation/eval_framework/prompts.py
... ... @@ -5,46 +5,46 @@ from __future__ import annotations
5 5 import json
6 6 from typing import Any, Dict, Sequence
7 7  
8   -_CLASSIFY_BATCH_SIMPLE_TEMPLATE = """You are a relevance evaluation assistant for an apparel e-commerce search system.
  8 +_CLASSIFY_BATCH_SIMPLE_TEMPLATE = """You are a relevance evaluation assistant for an apparel e-commerce search system.
9 9 Given the user query and each product's information, assign one relevance label to each product.
10 10  
11 11 ## Relevance Labels
12 12  
13 13 ### Exact
14   -The product fully satisfies the user's search intent.
  14 +The product fully satisfies the user’s search intent: the core product type matches, all explicitly stated key attributes are supported by the product information.
15 15  
16   -Use Exact when:
17   -- The product matches the core product type named in the query.
18   -- The key requirements explicitly stated in the query are satisfied.
19   -- There is no clear conflict with any explicit user requirement.
20   -
21   -Typical cases:
22   -- The query is only a product type, and the product is exactly that product type.
23   -- The query includes product type + attributes, and the product matches the type and those attributes.
  16 +Typical use cases:
  17 +- The query contains only a product type, and the product is exactly that type.
  18 +- The query contains product type + attributes, and the product matches both the type and all explicitly stated attributes.
24 19  
25 20 ### Partial
26   -The product satisfies the user's primary intent, but does not fully satisfy all specified details.
  21 +The product satisfies the user's primary intent because the core product type matches, but some explicit requirements in the query are missing, cannot be confirmed, or deviate from the query. Despite the mismatch, the product can still be considered a non-target but acceptable substitute.
27 22  
28 23 Use Partial when:
29 24 - The core product type matches, but some requested attributes cannot be confirmed.
30   -- The core product type matches, but only some secondary attributes are satisfied.
31   -- The core product type matches, and there are minor or non-critical deviations from the query.
32   -- The product does not clearly contradict the user's explicit requirements, but it also cannot be considered a full match.
  25 +- The core product type matches, but some secondary requirements deviate or are inconsistent.
  26 +- The product is not the ideal target, but it is still a plausible and acceptable substitute for the shopper.
33 27  
34 28 Typical cases:
35 29 - Query: "red fitted t-shirt", product: "Women's T-Shirt" → color/fit cannot be confirmed.
36 30 - Query: "red fitted t-shirt", product: "Blue Fitted T-Shirt" → product type and fit match, but color differs.
37   -- Query: "cotton long sleeve blouse", product: "Long Sleeve Blouse" → material not confirmed.
38 31  
39   -Important:
40   -Partial should mainly be used when the core product type is correct, but the detailed requirements are incomplete, uncertain, or only partially matched.
  32 +Detailed example:
  33 +- Query: "cotton long sleeve shirt"
  34 +- Product: "J.VER Men's Linen Shirts Casual Button Down Long Sleeve Shirt Solid Spread Collar Summer Beach Shirts with Pocket"
  35 +
  36 +Analysis:
  37 +- Material mismatch: the query explicitly requires cotton, while the product is linen, so it cannot be Exact.
  38 +- However, the core product type still matches: both are long sleeve shirts.
  39 +- In an e-commerce setting, the shopper may still consider clicking this item because the style and use case are similar.
  40 +- Therefore, it should be labeled Partial as a non-target but acceptable substitute.
41 41  
42 42 ### Irrelevant
43 43 The product does not satisfy the user's main shopping intent.
44 44  
45 45 Use Irrelevant when:
46 46 - The core product type does not match the query.
47   -- The product matches the general category but is a different product type that shoppers would not consider interchangeable.
  47 +- The product belongs to a broadly related category, but not the specific product subtype requested, and shoppers would not consider them interchangeable.
48 48 - The core product type matches, but the product clearly contradicts an explicit and important requirement in the query.
49 49  
50 50 Typical cases:
... ... @@ -53,6 +53,8 @@ Typical cases:
53 53 - Query: "fitted pants", product: "loose wide-leg pants" → explicit contradiction on fit.
54 54 - Query: "sleeveless dress", product: "long sleeve dress" → explicit contradiction on sleeve style.
55 55  
  56 +This label emphasizes clarity of user intent. When the query specifies a concrete product type or an important attribute, products that conflict with that intent should be judged Irrelevant even if they are related at a higher category level.
  57 +
56 58 ## Decision Principles
57 59  
58 60 1. Product type is the highest-priority factor.
... ... @@ -71,16 +73,13 @@ Typical cases:
71 73 If the user explicitly searched for one of these, the others should usually be judged Irrelevant.
72 74  
73 75 3. If the core product type matches, then evaluate attributes.
74   - - If attributes fully match → Exact
75   - - If attributes are missing, uncertain, or only partially matched → Partial
76   - - If attributes clearly contradict an explicit important requirement → Irrelevant
  76 + - If all explicit attributes match → Exact
  77 + - If some attributes are missing, uncertain, or partially mismatched, but the item is still an acceptable substitute → Partial
  78 + - If an explicit and important attribute is clearly contradicted, and the item is not a reasonable substitute → Irrelevant
77 79  
78 80 4. Distinguish carefully between "not mentioned" and "contradicted".
79 81 - If an attribute is not mentioned or cannot be verified, prefer Partial.
80   - - If an attribute is explicitly opposite to the query, use Irrelevant.
81   -
82   -5. Do not overuse Exact.
83   - Exact requires strong evidence that the product satisfies the user's stated intent, not just the general category.
  82 + - If an attribute is explicitly opposite to the query, use Irrelevant unless the item is still reasonably acceptable as a substitute under the shopping context.
84 83  
85 84 Query: {query}
86 85  
... ... @@ -97,96 +96,93 @@ The lines must correspond sequentially to the products above.
97 96 Do not output any other information.
98 97 """
99 98  
100   -_CLASSIFY_BATCH_SIMPLE_TEMPLATE__zh = """你是一个服装电商搜索系统的相关性评估助手。
101   -给定用户查询和每个产品的信息,为每个产品分配一个相关性标签。
  99 +_CLASSIFY_BATCH_SIMPLE_TEMPLATE__zh = _CLASSIFY_BATCH_SIMPLE_TEMPLATE_ZH = """你是一个服饰电商搜索系统中的相关性判断助手。
  100 +给定用户查询词以及每个商品的信息,请为每个商品分配一个相关性标签。
102 101  
103 102 ## 相关性标签
104 103  
105 104 ### 完全相关
106   -该产品完全满足用户的搜索意图。
107   -
108   -在以下情况使用完全相关:
109   -- 产品与查询中指定的核心产品类型相匹配。
110   -- 满足了查询中明确说明的关键要求。
111   -- 与用户明确的任何要求没有明显冲突。
  105 +核心产品类型匹配,所有明确提及的关键属性均有产品信息支撑。
112 106  
113   -典型情况:
114   -- 查询仅包含产品类型,而产品恰好是该产品类型。
115   -- 查询包含产品类型 + 属性,而产品与该类型及这些属性相匹配。
  107 +典型适用场景:
  108 +- 查询仅包含产品类型,产品即为该类型。
  109 +- 查询包含“产品类型 + 属性”,产品在类型及所有明确属性上均符合。
116 110  
117 111 ### 部分相关
118   -该产品满足了用户的主要意图,但并未完全满足所有指定的细节
  112 +产品满足用户的主要意图(核心产品类型匹配),但查询中明确的部分要求未体现,或存在偏差。虽然有不一致,但仍属于“非目标但可接受”的替代品
119 113  
120 114 在以下情况使用部分相关:
121   -- 核心产品类型匹配,但部分请求的属性无法确认。
122   -- 核心产品类型匹配,但仅满足了部分次要属性。
123   -- 核心产品类型匹配,但与查询存在微小或非关键的偏差。
124   -- 产品未明显违背用户的明确要求,但也不能视为完全匹配。
  115 +- 核心产品类型匹配,但部分请求的属性在商品信息中缺失、未提及或无法确认。
  116 +- 核心产品类型匹配,但材质、版型、风格等次要要求存在偏差或不一致。
  117 +- 商品不是用户最理想的目标,但从电商购物角度看,仍可能被用户视为可接受的替代品。
125 118  
126 119 典型情况:
127   -- 查询:"红色修身T恤",产品:"女士T恤" → 颜色/版型无法确认。
128   -- 查询:"红色修身T恤",产品:"蓝色修身T恤" → 产品类型和版型匹配,但颜色不同。
129   -- 查询:"棉质长袖衬衫",产品:"长袖衬衫" → 材质未确认。
  120 +- 查询:“红色修身T恤”,产品:“女士T恤” → 颜色/版型无法确认。
  121 +- 查询:“红色修身T恤”,产品:“蓝色修身T恤” → 产品类型和版型匹配,但颜色不同。
130 122  
131   -重要提示:
132   -部分相关主要应在核心产品类型正确,但详细要求不完整、不确定或仅部分匹配时使用。
  123 +详细案例:
  124 +- 查询:“棉质长袖衬衫”
  125 +- 商品:“J.VER男式亚麻衬衫休闲纽扣长袖衬衫纯色平领夏季沙滩衬衫带口袋”
133 126  
134   -### 不相关
135   -该产品不满足用户的主要购物意图。
  127 +分析:
  128 +- 材质不符:Query 明确指定“棉质”,而商品为“亚麻”,因此不能判为完全相关。
  129 +- 但核心品类仍然匹配:两者都是“长袖衬衫”。
  130 +- 在电商搜索中,用户仍可能因为款式、穿着场景相近而点击该商品。
  131 +- 因此应判为部分相关,即“非目标但可接受”的替代品。
136 132  
137   -在以下情况使用不相关:
  133 +### 不相关
  134 +产品未满足用户的主要购物意图,主要表现为以下情形之一:
138 135 - 核心产品类型与查询不匹配。
139   -- 产品匹配了大致类别,但属于购物者不会认为可互换的不同产品类型。
140   -- 核心产品类型匹配,但产品明显违背了查询中一个明确且重要的要求。
  136 +- 产品虽属大致相关的大类,但与查询指定的具体子类不可互换。
  137 +- 核心产品类型匹配,但产品明显违背了查询中一个明确且重要的属性要求。
141 138  
142 139 典型情况:
143   -- 查询:"裤子",产品:"鞋子" → 错误的产品类型。
144   -- 查询:"连衣裙",产品:"半身裙" → 不同的产品类型。
145   -- 查询:"修身裤",产品:"宽松阔腿裤" → 版型上明显矛盾。
146   -- 查询:"无袖连衣裙",产品:"长袖连衣裙" → 袖型上明显矛盾。
  140 +- 查询:“裤子”,产品:“鞋子” → 产品类型错误。
  141 +- 查询:“连衣裙”,产品:“半身裙” → 具体产品类型不同。
  142 +- 查询:“修身裤”,产品:“宽松阔腿裤” → 与版型要求明显冲突。
  143 +- 查询:“无袖连衣裙”,产品:“长袖连衣裙” → 与袖型要求明显冲突。
147 144  
148   -## 决策原则
  145 +该标签强调用户意图的明确性。当查询指向具体类型或关键属性时,即使产品在更高层级类别上相关,也应按不相关处理。
149 146  
150   -1. 产品类型是最高优先级的因素。
151   - 如果查询明确指定了具体产品类型,结果必须匹配该产品类型才能被评为完全相关或部分相关。
152   - 不同的产品类型通常是不相关,而非部分相关。
  147 +## 判断原则
153 148  
154   -2. 当查询明确时,相似或相关的产品类型不可互换。
  149 +1. 产品类型是最高优先级因素。
  150 + 如果查询明确指定了具体产品类型,那么结果必须匹配该产品类型,才可能判为“完全相关”或“部分相关”。
  151 + 不同产品类型通常应判为“不相关”,而不是“部分相关”。
  152 +
  153 +2. 相似或相关的产品类型,在查询明确时通常不可互换。
155 154 例如:
156 155 - 连衣裙 vs 半身裙 vs 连体裤
157 156 - 牛仔裤 vs 裤子
158   - - T恤 vs 衬衫
  157 + - T恤 vs 衬衫/上衣
159 158 - 开衫 vs 毛衣
160 159 - 靴子 vs 鞋子
161 160 - 文胸 vs 上衣
162 161 - 双肩包 vs 包
163   - 如果用户明确搜索了其中一种,其他的通常应判断为不相关。
164   -
165   -3. 如果核心产品类型匹配,则评估属性。
166   - - 如果属性完全匹配 → 完全相关
167   - - 如果属性缺失、不确定或仅部分匹配 → 部分相关
168   - - 如果属性明显违背明确的重点要求 → 不相关
  162 + 如果用户明确搜索其中一种,其他类型通常应判为“不相关”。
169 163  
170   -4. 仔细区分“未提及”和“矛盾”。
171   - - 如果属性未提及或无法验证,倾向于部分相关。
172   - - 如果属性与查询明确相反,使用不相关。
  164 +3. 当核心产品类型匹配后,再评估属性。
  165 + - 所有明确属性都匹配 → 完全相关
  166 + - 部分属性缺失、无法确认,或存在一定偏差,但仍是可接受替代品 → 部分相关
  167 + - 明确且重要的属性被明显违背,且不能作为合理替代品 → 不相关
173 168  
174   -5. 不要过度使用完全相关。
175   - 完全相关需要强有力的证据表明产品满足了用户声明的意图,而不仅仅是通用类别。
  169 +4. 要严格区分“未提及/无法确认”和“明确冲突”。
  170 + - 如果某属性没有提及,或无法验证,优先判为“部分相关”。
  171 + - 如果某属性与查询要求明确相反,则判为“不相关”;除非在购物语境下它仍明显属于可接受替代品。
176 172  
177   -查询: {query}
  173 +查询{query}
178 174  
179   -产品:
  175 +商品:
180 176 {lines}
181 177  
182 178 ## 输出格式
183   -严格输出 {n} 行,每行包含以下之一:
184   -Exact
185   -Partial
186   -Irrelevant
  179 +严格输出 {n} 行,每行只能是以下三者之一:
  180 +完全相关
  181 +部分相关
  182 +不相关
187 183  
188   -这些行必须按顺序对应上面的产品。
189   -不要输出任何其他信息。
  184 +输出行必须与上方商品顺序一一对应。
  185 +不要输出任何其他内容。
190 186 """
191 187  
192 188  
... ...
scripts/evaluation/eval_search_quality.py deleted
... ... @@ -1,235 +0,0 @@
1   -#!/usr/bin/env python3
2   -"""
3   -Run search quality evaluation against real tenant indexes and emit JSON/Markdown reports.
4   -
5   -Usage:
6   - source activate.sh
7   - python scripts/eval_search_quality.py
8   -"""
9   -
10   -from __future__ import annotations
11   -
12   -import json
13   -import sys
14   -from dataclasses import asdict, dataclass
15   -from datetime import datetime, timezone
16   -from pathlib import Path
17   -from typing import Any, Dict, List
18   -
19   -PROJECT_ROOT = Path(__file__).resolve().parents[1]
20   -if str(PROJECT_ROOT) not in sys.path:
21   - sys.path.insert(0, str(PROJECT_ROOT))
22   -
23   -from api.app import get_searcher, init_service
24   -from context import create_request_context
25   -
26   -
27   -DEFAULT_QUERIES_BY_TENANT: Dict[str, List[str]] = {
28   - "0": [
29   - "连衣裙",
30   - "dress",
31   - "dress 连衣裙",
32   - "maxi dress 长裙",
33   - "波西米亚连衣裙",
34   - "T恤",
35   - "graphic tee 图案T恤",
36   - "shirt",
37   - "礼服衬衫",
38   - "hoodie 卫衣",
39   - "连帽卫衣",
40   - "sweatshirt",
41   - "牛仔裤",
42   - "jeans",
43   - "阔腿牛仔裤",
44   - "毛衣 sweater",
45   - "cardigan 开衫",
46   - "jacket 外套",
47   - "puffer jacket 羽绒服",
48   - "飞行员夹克",
49   - ],
50   - "162": [
51   - "连衣裙",
52   - "dress",
53   - "dress 连衣裙",
54   - "T恤",
55   - "shirt",
56   - "hoodie 卫衣",
57   - "牛仔裤",
58   - "jeans",
59   - "毛衣 sweater",
60   - "jacket 外套",
61   - "娃娃衣服",
62   - "芭比裙子",
63   - "连衣短裙芭比",
64   - "公主大裙",
65   - "晚礼服芭比",
66   - "毛衣熊",
67   - "服饰饰品",
68   - "鞋子",
69   - "军人套",
70   - "陆军套",
71   - ],
72   -}
73   -
74   -
75   -@dataclass
76   -class RankedItem:
77   - rank: int
78   - spu_id: str
79   - title: str
80   - vendor: str
81   - es_score: float | None
82   - rerank_score: float | None
83   - text_score: float | None
84   - text_source_score: float | None
85   - text_translation_score: float | None
86   - text_primary_score: float | None
87   - text_support_score: float | None
88   - knn_score: float | None
89   - fused_score: float | None
90   - matched_queries: Any
91   -
92   -
93   -def _pick_text(value: Any, language: str = "zh") -> str:
94   - if value is None:
95   - return ""
96   - if isinstance(value, dict):
97   - return str(value.get(language) or value.get("zh") or value.get("en") or "").strip()
98   - return str(value).strip()
99   -
100   -
101   -def _to_float(value: Any) -> float | None:
102   - try:
103   - if value is None:
104   - return None
105   - return float(value)
106   - except (TypeError, ValueError):
107   - return None
108   -
109   -
110   -def _evaluate_query(searcher, tenant_id: str, query: str) -> Dict[str, Any]:
111   - context = create_request_context(
112   - reqid=f"eval-{tenant_id}-{abs(hash(query)) % 1000000}",
113   - uid="codex",
114   - )
115   - result = searcher.search(
116   - query=query,
117   - tenant_id=tenant_id,
118   - size=20,
119   - from_=0,
120   - context=context,
121   - debug=True,
122   - language="zh",
123   - enable_rerank=True,
124   - )
125   -
126   - per_result_debug = ((result.debug_info or {}).get("per_result") or [])
127   - debug_by_spu_id = {
128   - str(item.get("spu_id")): item
129   - for item in per_result_debug
130   - if isinstance(item, dict) and item.get("spu_id") is not None
131   - }
132   -
133   - ranked_items: List[RankedItem] = []
134   - for rank, spu in enumerate(result.results[:20], 1):
135   - spu_id = str(getattr(spu, "spu_id", ""))
136   - debug_item = debug_by_spu_id.get(spu_id, {})
137   - ranked_items.append(
138   - RankedItem(
139   - rank=rank,
140   - spu_id=spu_id,
141   - title=_pick_text(getattr(spu, "title", None), language="zh"),
142   - vendor=_pick_text(getattr(spu, "vendor", None), language="zh"),
143   - es_score=_to_float(debug_item.get("es_score")),
144   - rerank_score=_to_float(debug_item.get("rerank_score")),
145   - text_score=_to_float(debug_item.get("text_score")),
146   - text_source_score=_to_float(debug_item.get("text_source_score")),
147   - text_translation_score=_to_float(debug_item.get("text_translation_score")),
148   - text_primary_score=_to_float(debug_item.get("text_primary_score")),
149   - text_support_score=_to_float(debug_item.get("text_support_score")),
150   - knn_score=_to_float(debug_item.get("knn_score")),
151   - fused_score=_to_float(debug_item.get("fused_score")),
152   - matched_queries=debug_item.get("matched_queries"),
153   - )
154   - )
155   -
156   - return {
157   - "query": query,
158   - "tenant_id": tenant_id,
159   - "total": result.total,
160   - "max_score": result.max_score,
161   - "took_ms": result.took_ms,
162   - "query_analysis": ((result.debug_info or {}).get("query_analysis") or {}),
163   - "stage_timings": ((result.debug_info or {}).get("stage_timings") or {}),
164   - "top20": [asdict(item) for item in ranked_items],
165   - }
166   -
167   -
168   -def _render_markdown(report: Dict[str, Any]) -> str:
169   - lines: List[str] = []
170   - lines.append(f"# Search Quality Evaluation")
171   - lines.append("")
172   - lines.append(f"- Generated at: {report['generated_at']}")
173   - lines.append(f"- Queries per tenant: {report['queries_per_tenant']}")
174   - lines.append("")
175   - for tenant_id, entries in report["tenants"].items():
176   - lines.append(f"## Tenant {tenant_id}")
177   - lines.append("")
178   - for entry in entries:
179   - qa = entry.get("query_analysis") or {}
180   - lines.append(f"### Query: {entry['query']}")
181   - lines.append("")
182   - lines.append(
183   - f"- total={entry['total']} max_score={entry['max_score']:.6f} took_ms={entry['took_ms']}"
184   - )
185   - lines.append(
186   - f"- detected_language={qa.get('detected_language')} translations={qa.get('translations')}"
187   - )
188   - lines.append("")
189   - lines.append("| rank | spu_id | title | fused | rerank | text | text_src | text_trans | knn | es | matched_queries |")
190   - lines.append("| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |")
191   - for item in entry.get("top20", []):
192   - title = str(item.get("title", "")).replace("|", "/")
193   - matched = json.dumps(item.get("matched_queries"), ensure_ascii=False)
194   - matched = matched.replace("|", "/")
195   - lines.append(
196   - f"| {item.get('rank')} | {item.get('spu_id')} | {title} | "
197   - f"{item.get('fused_score')} | {item.get('rerank_score')} | {item.get('text_score')} | "
198   - f"{item.get('text_source_score')} | {item.get('text_translation_score')} | "
199   - f"{item.get('knn_score')} | {item.get('es_score')} | {matched} |"
200   - )
201   - lines.append("")
202   - return "\n".join(lines)
203   -
204   -
205   -def main() -> None:
206   - init_service("http://localhost:9200")
207   - searcher = get_searcher()
208   -
209   - tenants_report: Dict[str, List[Dict[str, Any]]] = {}
210   - for tenant_id, queries in DEFAULT_QUERIES_BY_TENANT.items():
211   - tenant_entries: List[Dict[str, Any]] = []
212   - for query in queries:
213   - print(f"[eval] tenant={tenant_id} query={query}")
214   - tenant_entries.append(_evaluate_query(searcher, tenant_id, query))
215   - tenants_report[tenant_id] = tenant_entries
216   -
217   - report = {
218   - "generated_at": datetime.now(timezone.utc).isoformat(),
219   - "queries_per_tenant": {tenant: len(queries) for tenant, queries in DEFAULT_QUERIES_BY_TENANT.items()},
220   - "tenants": tenants_report,
221   - }
222   -
223   - out_dir = Path("artifacts/search_eval")
224   - out_dir.mkdir(parents=True, exist_ok=True)
225   - timestamp = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ")
226   - json_path = out_dir / f"search_eval_{timestamp}.json"
227   - md_path = out_dir / f"search_eval_{timestamp}.md"
228   - json_path.write_text(json.dumps(report, ensure_ascii=False, indent=2), encoding="utf-8")
229   - md_path.write_text(_render_markdown(report), encoding="utf-8")
230   - print(f"[done] json={json_path}")
231   - print(f"[done] md={md_path}")
232   -
233   -
234   -if __name__ == "__main__":
235   - main()