评估标准优化

tangwang
1 parent 3984ec64
Showing 5 changed files with 137 additions and 609 deletions Show diff stats
docs/Usage-Guide.md
docs/相关性检索优化说明.md
scripts/evaluation/README.md
scripts/evaluation/eval_framework/prompts.py
scripts/evaluation/eval_search_quality.py
@@ -202,13 +202,10 @@ python -m pytest -q tests/test_rerank_client.py tests/test_es_query_builder.py t
 ./scripts/service_ctl.sh restart backend
 sleep 3
 ./scripts/service_ctl.sh status backend
-python ./scripts/eval_search_quality.py
+./scripts/evaluation/quick_start_eval.sh batch
 ```
-评估结果会输出到 `artifacts/search_eval/`，包含：
-
-- `search_eval_*.json`：便于脚本二次分析
-- `search_eval_*.md`：便于人工浏览 top20 结果、分数与命中子句
+离线批量评估会把标注与报表写到 `artifacts/search_evaluation/`（SQLite、`batch_reports/` 下的 JSON/Markdown 等）。说明与命令见 [scripts/evaluation/README.md](../scripts/evaluation/README.md)。
 ### 方式4: 多环境示例（prod / uat）
@@ -240,15 +240,10 @@ python -m pytest -q tests/test_rerank_client.py tests/test_es_query_builder.py t
 ./scripts/service_ctl.sh restart backend
 sleep 3
 ./scripts/service_ctl.sh status backend
-python ./scripts/eval_search_quality.py
+./scripts/evaluation/quick_start_eval.sh batch
 ```
-评估脚本会生成：
-
-- `artifacts/search_eval/search_eval_*.json`
-- `artifacts/search_eval/search_eval_*.md`
-
-可直接从 JSON 中提取 query 级和 result 级调试字段进行分析。
+评估产物在 `artifacts/search_evaluation/`（如 `search_eval.sqlite3`、`batch_reports/` 下的 JSON/Markdown）。流程与参数说明见 [scripts/evaluation/README.md](../scripts/evaluation/README.md)。
 ## 11. 建议测试清单
 # Search Evaluation Framework
-This directory contains the offline annotation set builder, the online evaluation UI/API, the audit tooling, and the fusion-tuning runner for retrieval quality evaluation.
+This directory holds the offline annotation builder, the evaluation web UI/API, audit tooling, and the fusion-tuning runner for retrieval quality.
-It is designed around one core rule:
+**Design:** Build labels offline for a fixed query set (`queries/queries.txt`). Single-query and batch evaluation map recalled `spu_id` values to the SQLite cache. Items without cached labels are scored as `Irrelevant`, and the UI/API surfaces tips when coverage is incomplete.
-- Annotation should be built offline first.
-- Single-query evaluation should then map recalled `spu_id` values to the cached annotation set.
-- Recalled items without cached labels are treated as `Irrelevant` during evaluation, and the UI/API returns a tip so the operator knows coverage is incomplete.
+## What it does
-## Goals
+1. Build an annotation set for a fixed query set.
+2. Evaluate live search results against cached labels.
+3. Run batch evaluation and keep historical reports with config snapshots.
+4. Tune fusion parameters in a reproducible loop.
-The framework supports four related tasks:
+## Layout
-1. Build an annotation set for a fixed query set.
-2. Evaluate a live search result list against that annotation set.
-3. Run batch evaluation and store historical reports with config snapshots.
-4. Tune fusion parameters reproducibly.
-
-## Files
-
-- `eval_framework/` (Python package)
-  Modular layout: `framework.py` (orchestration), `store.py` (SQLite), `clients.py` (search/rerank/LLM), `prompts.py` (judge templates), `metrics.py`, `reports.py`, `web_app.py`, `cli.py`, and `static/` (evaluation UI HTML/CSS/JS).
-- `build_annotation_set.py`
-  Thin CLI entrypoint into `eval_framework`.
-- `serve_eval_web.py`
-  Thin web entrypoint into `eval_framework`.
-- `tune_fusion.py`
-  Fusion experiment runner. It applies config variants, restarts backend, runs batch evaluation, and stores experiment reports.
-- `fusion_experiments_shortlist.json`
-  A compact experiment set for practical tuning.
-- `fusion_experiments_round1.json`
-  A broader first-round experiment set.
-- `queries/queries.txt`
-  The canonical evaluation query set.
-- `README_Requirement.md`
-  Requirement reference document.
-- `quick_start_eval.sh`
-  Optional wrapper: `batch` (fill missing labels only), `batch-rebuild` (force full re-label), or `serve` (web UI).
-- `../start_eval_web.sh`
-  Same as `serve` but loads `activate.sh`; use with `./scripts/service_ctl.sh start eval-web` (port **`EVAL_WEB_PORT`**, default **6010**). `./run.sh all` starts **eval-web** with the rest of core services.
-
-## Quick start (from repo root)
-
-Set tenant if needed (`export TENANT_ID=163`). Requires live search API, DashScope key when the batch step needs new LLM labels, and a working backend.
+| Path | Role |
+|------|------|
+| `eval_framework/` | Package: orchestration, SQLite store, search/rerank/LLM clients, prompts, metrics, reports, web UI (`static/`), CLI |
+| `build_annotation_set.py` | CLI entry (build / batch / audit) |
+| `serve_eval_web.py` | Web server for the evaluation UI |
+| `tune_fusion.py` | Applies config variants, restarts backend, runs batch eval, stores experiment reports |
+| `fusion_experiments_shortlist.json` | Compact experiment set for tuning |
+| `fusion_experiments_round1.json` | Broader first-round experiments |
+| `queries/queries.txt` | Canonical evaluation queries |
+| `README_Requirement.md` | Product/requirements reference |
+| `quick_start_eval.sh` | Wrapper: `batch`, `batch-rebuild`, or `serve` |
+| `../start_eval_web.sh` | Same as `serve` with `activate.sh`; use `./scripts/service_ctl.sh start eval-web` (default port **6010**, override with `EVAL_WEB_PORT`). `./run.sh all` includes eval-web. |
+
+## Quick start (repo root)
+
+Set tenant if needed (`export TENANT_ID=163`). You need a live search API, DashScope when new LLM labels are required, and a running backend.
 ```bash
-# 1) Batch evaluation: every query in the file gets a live search; only uncached (query, spu_id) pairs call the LLM
+# Batch: live search for every query; only uncached (query, spu_id) pairs hit the LLM
 ./scripts/evaluation/quick_start_eval.sh batch
-# Optional: full re-label of current top_k recall (expensive; use only when you intentionally rebuild the cache)
+# Full re-label of current top_k recall (expensive)
 ./scripts/evaluation/quick_start_eval.sh batch-rebuild
-# 2) Evaluation UI on http://127.0.0.1:6010/ (override with EVAL_WEB_PORT / EVAL_WEB_HOST)
+# UI: http://127.0.0.1:6010/
 ./scripts/evaluation/quick_start_eval.sh serve
-# Or: ./scripts/service_ctl.sh start eval-web
+# or: ./scripts/service_ctl.sh start eval-web
 ```
-Equivalent explicit commands:
+Explicit equivalents:
 ```bash
-# Safe default: no --force-refresh-labels
 ./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
   --tenant-id "${TENANT_ID:-163}" \
   --queries-file scripts/evaluation/queries/queries.txt \
@@ -67,13 +52,8 @@ Equivalent explicit commands:
   --language en \
   --labeler-mode simple
-# Rebuild all labels for recalled top_k (same as quick_start_eval.sh batch-rebuild)
 ./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
-  --tenant-id "${TENANT_ID:-163}" \
-  --queries-file scripts/evaluation/queries/queries.txt \
-  --top-k 50 \
-  --language en \
-  --labeler-mode simple \
+  ... same args ... \
   --force-refresh-labels
 ./.venv/bin/python scripts/evaluation/serve_eval_web.py serve \
@@ -83,191 +63,35 @@ Equivalent explicit commands:
   --port 6010
 ```
-**Batch behavior:** There is no “skip queries already processed”. Each run walks the full queries file. With `--force-refresh-labels`, for **every** query the runner issues a live search and sends **all** `top_k` returned `spu_id`s through the LLM again (SQLite rows are upserted). Omit `--force-refresh-labels` if you only want to fill in labels that are missing for the current recall window.
-
-## Storage Layout
-
-All generated artifacts are under:
-
-- `/data/saas-search/artifacts/search_evaluation`
-
-Important subpaths:
-
-- `/data/saas-search/artifacts/search_evaluation/search_eval.sqlite3`
-  Main cache and annotation store.
-- `/data/saas-search/artifacts/search_evaluation/query_builds`
-  Per-query pooled annotation-set build artifacts.
-- `/data/saas-search/artifacts/search_evaluation/batch_reports`
-  Batch evaluation JSON, Markdown reports, and config snapshots.
-- `/data/saas-search/artifacts/search_evaluation/audits`
-  Audit summaries for label quality checks.
-- `/data/saas-search/artifacts/search_evaluation/tuning_runs`
-  Fusion experiment summaries and per-experiment config snapshots.
-
-## SQLite Schema Summary
-
-The main tables in `search_eval.sqlite3` are:
-
-- `corpus_docs`
-  Cached product corpus for the tenant.
-- `rerank_scores`
-  Cached full-corpus reranker scores keyed by `(tenant_id, query_text, spu_id)`.
-- `relevance_labels`
-  Cached LLM relevance labels keyed by `(tenant_id, query_text, spu_id)`.
-- `query_profiles`
-  Structured query-intent profiles extracted before labeling.
-- `build_runs`
-  Per-query pooled-build records.
-- `batch_runs`
-  Batch evaluation history.
-
-## Label Semantics
-
-Three labels are used throughout:
-
-- `Exact`
-  Fully matches the intended product type and all explicit required attributes.
-- `Partial`
-  Main intent matches, but explicit attributes are missing, approximate, or weaker than requested.
-- `Irrelevant`
-  Product type mismatches, or explicit required attributes conflict.
-
-The framework always uses:
-
-- LLM-based batched relevance classification
-- caching and retry logic for robust offline labeling
-
-There are now two labeler modes:
-
-- `simple`
-  Default. A single low-coupling LLM judging pass per batch, using the standard relevance prompt.
-- `complex`
-  Legacy structured mode. It extracts query profiles and applies extra guardrails. Kept for comparison, but no longer the default.
-
-## Offline-First Workflow
-
-### 1. Refresh labels for the evaluation query set
-
-For practical evaluation, the most important offline step is to pre-label the result window you plan to score. For the current metrics (`P@5`, `P@10`, `P@20`, `P@50`, `MAP_3`, `MAP_2_3`), a `top_k=50` cached label set is sufficient.
-
-Example (fills missing labels only; recommended default):
-
-```bash
-./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
-  --tenant-id 163 \
-  --queries-file scripts/evaluation/queries/queries.txt \
-  --top-k 50 \
-  --language en \
-  --labeler-mode simple
-```
-
-To **rebuild** every label for the current `top_k` recall window (all queries, all hits re-sent to the LLM), add `--force-refresh-labels` or run `./scripts/evaluation/quick_start_eval.sh batch-rebuild`.
-
-This command does two things:
-
-- runs **every** query in the file against the live backend (no skip list)
-- with `--force-refresh-labels`, re-labels **all** `top_k` hits per query via the LLM and upserts SQLite; without the flag, only `spu_id`s lacking a cached label are sent to the LLM
-
-After this step, single-query evaluation can run in cached mode without calling the LLM again.
-
-### 2. Optional pooled build
-
-The framework also supports a heavier pooled build that combines:
-
-- top search results
-- top full-corpus reranker results
-
-Example:
-
-```bash
-./.venv/bin/python scripts/evaluation/build_annotation_set.py build \
-  --tenant-id 163 \
-  --queries-file scripts/evaluation/queries/queries.txt \
-  --search-depth 1000 \
-  --rerank-depth 10000 \
-  --annotate-search-top-k 100 \
-  --annotate-rerank-top-k 120 \
-  --language en
-```
-
-This is slower, but useful when you want a richer pooled annotation set beyond the current live recall window.
-
-## Why Single-Query Evaluation Was Slow
-
-If single-query evaluation is slow, the usual reason is that it is still running with `auto_annotate=true`, which means:
-
-- perform live search
-- detect recalled but unlabeled products
-- call the LLM to label them
-
-That is not the intended steady-state evaluation path.
-
-The UI/API is now configured to prefer cached evaluation:
-
-- default single-query evaluation uses `auto_annotate=false`
-- unlabeled recalled results are treated as `Irrelevant`
-- the response includes tips explaining that coverage gap
-
-If you want stable, fast evaluation:
-
-1. prebuild labels offline
-2. use cached single-query evaluation
-
-## Web UI
-
-Start the evaluation UI:
-
-```bash
-./.venv/bin/python scripts/evaluation/serve_eval_web.py serve \
-  --tenant-id 163 \
-  --queries-file scripts/evaluation/queries/queries.txt \
-  --host 127.0.0.1 \
-  --port 6010
-```
-
-The UI provides:
-
-- query list loaded from `queries.txt`
-- single-query evaluation
-- batch evaluation
-- history of batch reports
-- top recalled results
-- missed `Exact` and `Partial` products that were not recalled
-- tips about unlabeled hits treated as `Irrelevant`
+Each batch run walks the full queries file. With `--force-refresh-labels`, every recalled `spu_id` in the window is re-sent to the LLM and upserted. Without it, only missing labels are filled.
-### Single-query response behavior
+## Artifacts
-For a single query:
+Default root: `artifacts/search_evaluation/`
-1. live search returns recalled `spu_id` values
-2. the framework looks up cached labels by `(query, spu_id)`
-3. unlabeled recalled items are counted as `Irrelevant`
-4. cached `Exact` and `Partial` products that were not recalled are listed under `Missed Exact / Partial`
+- `search_eval.sqlite3` — corpus cache, rerank scores, relevance labels, query profiles, build/batch run metadata
+- `query_builds/` — per-query pooled build outputs
+- `batch_reports/` — batch JSON, Markdown, config snapshots
+- `audits/` — label-quality audit summaries
+- `tuning_runs/` — fusion experiment outputs and config snapshots
-This makes the page useful as a real retrieval-evaluation view rather than only a search-result viewer.
+## Labels
-## CLI Commands
+- **Exact** — Matches intended product type and all explicit required attributes.
+- **Partial** — Main intent matches; attributes missing, approximate, or weaker.
+- **Irrelevant** — Type mismatch or conflicting required attributes.
-### Build pooled annotation artifacts
+**Labeler modes:** `simple` (default): one judging pass per batch with the standard relevance prompt. `complex`: query-profile extraction plus extra guardrails (for structured experiments).
-```bash
-./.venv/bin/python scripts/evaluation/build_annotation_set.py build ...
-```
+## Flows
-### Run batch evaluation
+**Standard:** Run `batch` without `--force-refresh-labels` to extend coverage, then use the UI or batch in cached mode. Single-query evaluation defaults to **no** auto-annotation: recall still hits the live API; scoring uses SQLite only, and unlabeled hits count as `Irrelevant`.
-```bash
-./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
-  --tenant-id 163 \
-  --queries-file scripts/evaluation/queries/queries.txt \
-  --top-k 50 \
-  --language en \
-  --labeler-mode simple
-```
+**Deeper pool:** `build_annotation_set.py build` merges deep search and full-corpus rerank windows before labeling (see CLI `--search-depth`, `--rerank-depth`, `--annotate-*-top-k`).
-Use `--force-refresh-labels` if you want to rebuild the offline label cache for the recalled window first.
+**Fusion tuning:** `tune_fusion.py` writes experiment configs, restarts the backend, runs batch evaluation, and optionally applies the best variant (see `--experiments-file`, `--score-metric`, `--apply-best`).
-### Audit annotation quality
+### Audit
 ```bash
 ./.venv/bin/python scripts/evaluation/build_annotation_set.py audit \
@@ -278,69 +102,20 @@ Use `--force-refresh-labels` if you want to rebuild the offline label cache for 
   --labeler-mode simple
 ```
-This checks cached labels against current guardrails and reports suspicious cases.
-
-## Batch Reports
-
-Each batch run stores:
-
-- aggregate metrics
-- per-query metrics
-- label distribution
-- timestamp
-- config snapshot from `/admin/config`
-
-Reports are written as:
-
-- Markdown for easy reading
-- JSON for downstream processing
-
-## Fusion Tuning
-
-The tuning runner applies experiment configs sequentially and records the outcome.
-
-Example:
-
-```bash
-./.venv/bin/python scripts/evaluation/tune_fusion.py \
-  --tenant-id 163 \
-  --queries-file scripts/evaluation/queries/queries.txt \
-  --top-k 50 \
-  --language en \
-  --experiments-file scripts/evaluation/fusion_experiments_shortlist.json \
-  --score-metric MAP_3 \
-  --apply-best
-```
-
-What it does:
-
-1. writes an experiment config into `config/config.yaml`
-2. restarts backend
-3. runs batch evaluation
-4. stores the per-experiment result
-5. optionally applies the best experiment at the end
+## Web UI
-## Current Practical Recommendation
+Features: query list from `queries.txt`, single-query and batch evaluation, batch report history, top recalls, missed Exact/Partial, and coverage tips for unlabeled hits.
-For day-to-day evaluation:
+## Batch reports
-1. refresh the offline labels for the fixed query set with `batch --force-refresh-labels`
-2. run the web UI or normal batch evaluation in cached mode
-3. only force-refresh labels again when:
-   - the query set changes
-   - the product corpus changes materially
-   - the labeling logic changes
+Each run stores aggregate and per-query metrics, label distribution, timestamp, and an `/admin/config` snapshot, as Markdown and JSON under `batch_reports/`.
 ## Caveats
-- The current label cache is query-specific, not a full all-products all-queries matrix.
-- Single-query evaluation still depends on the live search API for recall, but not on the LLM if labels are already cached.
-- The backend restart path in this environment can be briefly unstable immediately after startup; a short wait after restart is sometimes necessary for scripting.
-- Some multilingual translation hints are noisy on long-tail fashion queries, which is one reason fusion tuning around translation weight matters.
-
-## Related Requirement Docs
+- Labels are keyed by `(tenant_id, query, spu_id)`, not a full corpus×query matrix.
+- Single-query evaluation still needs live search for recall; LLM calls are avoided when labels exist.
+- Backend restarts in automated tuning may need a short settle time before requests.
-- `README_Requirement.md`
-- `README_Requirement_zh.md`
+## Related docs
-These documents describe the original problem statement. This `README.md` describes the implemented framework and the current recommended workflow.
+- `README_Requirement.md`, `README_Requirement_zh.md` — requirements background; this file describes the implemented stack and how to run it.
@@ -5,46 +5,46 @@ from __future__ import annotations
 import json
 from typing import Any, Dict, Sequence
-_CLASSIFY_BATCH_SIMPLE_TEMPLATE = """You are a relevance evaluation assistant for an apparel e-commerce search system. 
+_CLASSIFY_BATCH_SIMPLE_TEMPLATE = """You are a relevance evaluation assistant for an apparel e-commerce search system.
 Given the user query and each product's information, assign one relevance label to each product.
 ## Relevance Labels
 ### Exact
-The product fully satisfies the user's search intent.
+The product fully satisfies the user’s search intent: the core product type matches, all explicitly stated key attributes are supported by the product information.
-Use Exact when:
-- The product matches the core product type named in the query.
-- The key requirements explicitly stated in the query are satisfied.
-- There is no clear conflict with any explicit user requirement.
-
-Typical cases:
-- The query is only a product type, and the product is exactly that product type.
-- The query includes product type + attributes, and the product matches the type and those attributes.
+Typical use cases:
+- The query contains only a product type, and the product is exactly that type.
+- The query contains product type + attributes, and the product matches both the type and all explicitly stated attributes.
 ### Partial
-The product satisfies the user's primary intent, but does not fully satisfy all specified details.
+The product satisfies the user's primary intent because the core product type matches, but some explicit requirements in the query are missing, cannot be confirmed, or deviate from the query. Despite the mismatch, the product can still be considered a non-target but acceptable substitute.
 Use Partial when:
 - The core product type matches, but some requested attributes cannot be confirmed.
-- The core product type matches, but only some secondary attributes are satisfied.
-- The core product type matches, and there are minor or non-critical deviations from the query.
-- The product does not clearly contradict the user's explicit requirements, but it also cannot be considered a full match.
+- The core product type matches, but some secondary requirements deviate or are inconsistent.
+- The product is not the ideal target, but it is still a plausible and acceptable substitute for the shopper.
 Typical cases:
 - Query: "red fitted t-shirt", product: "Women's T-Shirt" → color/fit cannot be confirmed.
 - Query: "red fitted t-shirt", product: "Blue Fitted T-Shirt" → product type and fit match, but color differs.
-- Query: "cotton long sleeve blouse", product: "Long Sleeve Blouse" → material not confirmed.
-Important:
-Partial should mainly be used when the core product type is correct, but the detailed requirements are incomplete, uncertain, or only partially matched.
+Detailed example:
+- Query: "cotton long sleeve shirt"
+- Product: "J.VER Men's Linen Shirts Casual Button Down Long Sleeve Shirt Solid Spread Collar Summer Beach Shirts with Pocket"
+
+Analysis:
+- Material mismatch: the query explicitly requires cotton, while the product is linen, so it cannot be Exact.
+- However, the core product type still matches: both are long sleeve shirts.
+- In an e-commerce setting, the shopper may still consider clicking this item because the style and use case are similar.
+- Therefore, it should be labeled Partial as a non-target but acceptable substitute.
 ### Irrelevant
 The product does not satisfy the user's main shopping intent.
 Use Irrelevant when:
 - The core product type does not match the query.
-- The product matches the general category but is a different product type that shoppers would not consider interchangeable.
+- The product belongs to a broadly related category, but not the specific product subtype requested, and shoppers would not consider them interchangeable.
 - The core product type matches, but the product clearly contradicts an explicit and important requirement in the query.
 Typical cases:
@@ -53,6 +53,8 @@ Typical cases:
 - Query: "fitted pants", product: "loose wide-leg pants" → explicit contradiction on fit.
 - Query: "sleeveless dress", product: "long sleeve dress" → explicit contradiction on sleeve style.
+This label emphasizes clarity of user intent. When the query specifies a concrete product type or an important attribute, products that conflict with that intent should be judged Irrelevant even if they are related at a higher category level.
+
 ## Decision Principles
 1. Product type is the highest-priority factor.
@@ -71,16 +73,13 @@ Typical cases:
    If the user explicitly searched for one of these, the others should usually be judged Irrelevant.
 3. If the core product type matches, then evaluate attributes.
-   - If attributes fully match → Exact
-   - If attributes are missing, uncertain, or only partially matched → Partial
-   - If attributes clearly contradict an explicit important requirement → Irrelevant
+   - If all explicit attributes match → Exact
+   - If some attributes are missing, uncertain, or partially mismatched, but the item is still an acceptable substitute → Partial
+   - If an explicit and important attribute is clearly contradicted, and the item is not a reasonable substitute → Irrelevant
 4. Distinguish carefully between "not mentioned" and "contradicted".
    - If an attribute is not mentioned or cannot be verified, prefer Partial.
-   - If an attribute is explicitly opposite to the query, use Irrelevant.
-
-5. Do not overuse Exact.
-   Exact requires strong evidence that the product satisfies the user's stated intent, not just the general category.
+   - If an attribute is explicitly opposite to the query, use Irrelevant unless the item is still reasonably acceptable as a substitute under the shopping context.
 Query: {query}
@@ -97,96 +96,93 @@ The lines must correspond sequentially to the products above.
 Do not output any other information.
 """
-_CLASSIFY_BATCH_SIMPLE_TEMPLATE__zh = """你是一个服装电商搜索系统的相关性评估助手。
-给定用户查询和每个产品的信息，为每个产品分配一个相关性标签。
+_CLASSIFY_BATCH_SIMPLE_TEMPLATE__zh = _CLASSIFY_BATCH_SIMPLE_TEMPLATE_ZH = """你是一个服饰电商搜索系统中的相关性判断助手。
+给定用户查询词以及每个商品的信息，请为每个商品分配一个相关性标签。
 ## 相关性标签
 ### 完全相关
-该产品完全满足用户的搜索意图。
-
-在以下情况使用完全相关：
-- 产品与查询中指定的核心产品类型相匹配。
-- 满足了查询中明确说明的关键要求。
-- 与用户明确的任何要求没有明显冲突。
+核心产品类型匹配，所有明确提及的关键属性均有产品信息支撑。
-典型情况：
-- 查询仅包含产品类型，而产品恰好是该产品类型。
-- 查询包含产品类型 + 属性，而产品与该类型及这些属性相匹配。
+典型适用场景：
+- 查询仅包含产品类型，产品即为该类型。
+- 查询包含“产品类型 + 属性”，产品在类型及所有明确属性上均符合。
 ### 部分相关
-该产品满足了用户的主要意图，但并未完全满足所有指定的细节。
+产品满足用户的主要意图（核心产品类型匹配），但查询中明确的部分要求未体现，或存在偏差。虽然有不一致，但仍属于“非目标但可接受”的替代品。
 在以下情况使用部分相关：
-- 核心产品类型匹配，但部分请求的属性无法确认。
-- 核心产品类型匹配，但仅满足了部分次要属性。
-- 核心产品类型匹配，但与查询存在微小或非关键的偏差。
-- 产品未明显违背用户的明确要求，但也不能视为完全匹配。
+- 核心产品类型匹配，但部分请求的属性在商品信息中缺失、未提及或无法确认。
+- 核心产品类型匹配，但材质、版型、风格等次要要求存在偏差或不一致。
+- 商品不是用户最理想的目标，但从电商购物角度看，仍可能被用户视为可接受的替代品。
 典型情况：
-- 查询："红色修身T恤"，产品："女士T恤" → 颜色/版型无法确认。
-- 查询："红色修身T恤"，产品："蓝色修身T恤" → 产品类型和版型匹配，但颜色不同。
-- 查询："棉质长袖衬衫"，产品："长袖衬衫" → 材质未确认。
+- 查询：“红色修身T恤”，产品：“女士T恤” → 颜色/版型无法确认。
+- 查询：“红色修身T恤”，产品：“蓝色修身T恤” → 产品类型和版型匹配，但颜色不同。
-重要提示：
-部分相关主要应在核心产品类型正确，但详细要求不完整、不确定或仅部分匹配时使用。
+详细案例：
+- 查询：“棉质长袖衬衫”
+- 商品：“J.VER男式亚麻衬衫休闲纽扣长袖衬衫纯色平领夏季沙滩衬衫带口袋”
-### 不相关
-该产品不满足用户的主要购物意图。
+分析：
+- 材质不符：Query 明确指定“棉质”，而商品为“亚麻”，因此不能判为完全相关。
+- 但核心品类仍然匹配：两者都是“长袖衬衫”。
+- 在电商搜索中，用户仍可能因为款式、穿着场景相近而点击该商品。
+- 因此应判为部分相关，即“非目标但可接受”的替代品。
-在以下情况使用不相关：
+### 不相关
+产品未满足用户的主要购物意图，主要表现为以下情形之一：
 - 核心产品类型与查询不匹配。
-- 产品匹配了大致类别，但属于购物者不会认为可互换的不同产品类型。
-- 核心产品类型匹配，但产品明显违背了查询中一个明确且重要的要求。
+- 产品虽属大致相关的大类，但与查询指定的具体子类不可互换。
+- 核心产品类型匹配，但产品明显违背了查询中一个明确且重要的属性要求。
 典型情况：
-- 查询："裤子"，产品："鞋子" → 错误的产品类型。
-- 查询："连衣裙"，产品："半身裙" → 不同的产品类型。
-- 查询："修身裤"，产品："宽松阔腿裤" → 版型上明显矛盾。
-- 查询："无袖连衣裙"，产品："长袖连衣裙" → 袖型上明显矛盾。
+- 查询：“裤子”，产品：“鞋子” → 产品类型错误。
+- 查询：“连衣裙”，产品：“半身裙” → 具体产品类型不同。
+- 查询：“修身裤”，产品：“宽松阔腿裤” → 与版型要求明显冲突。
+- 查询：“无袖连衣裙”，产品：“长袖连衣裙” → 与袖型要求明显冲突。
-## 决策原则
+该标签强调用户意图的明确性。当查询指向具体类型或关键属性时，即使产品在更高层级类别上相关，也应按不相关处理。
-1. 产品类型是最高优先级的因素。
-   如果查询明确指定了具体产品类型，结果必须匹配该产品类型才能被评为完全相关或部分相关。
-   不同的产品类型通常是不相关，而非部分相关。
+## 判断原则
-2. 当查询明确时，相似或相关的产品类型不可互换。
+1. 产品类型是最高优先级因素。
+   如果查询明确指定了具体产品类型，那么结果必须匹配该产品类型，才可能判为“完全相关”或“部分相关”。
+   不同产品类型通常应判为“不相关”，而不是“部分相关”。
+
+2. 相似或相关的产品类型，在查询明确时通常不可互换。
    例如：
    - 连衣裙 vs 半身裙 vs 连体裤
    - 牛仔裤 vs 裤子
-   - T恤 vs 衬衫
+   - T恤 vs 衬衫/上衣
    - 开衫 vs 毛衣
    - 靴子 vs 鞋子
    - 文胸 vs 上衣
    - 双肩包 vs 包
-   如果用户明确搜索了其中一种，其他的通常应判断为不相关。
-
-3. 如果核心产品类型匹配，则评估属性。
-   - 如果属性完全匹配 → 完全相关
-   - 如果属性缺失、不确定或仅部分匹配 → 部分相关
-   - 如果属性明显违背明确的重点要求 → 不相关
+   如果用户明确搜索其中一种，其他类型通常应判为“不相关”。
-4. 仔细区分“未提及”和“矛盾”。
-   - 如果属性未提及或无法验证，倾向于部分相关。
-   - 如果属性与查询明确相反，使用不相关。
+3. 当核心产品类型匹配后，再评估属性。
+   - 所有明确属性都匹配 → 完全相关
+   - 部分属性缺失、无法确认，或存在一定偏差，但仍是可接受替代品 → 部分相关
+   - 明确且重要的属性被明显违背，且不能作为合理替代品 → 不相关
-5. 不要过度使用完全相关。
-   完全相关需要强有力的证据表明产品满足了用户声明的意图，而不仅仅是通用类别。
+4. 要严格区分“未提及/无法确认”和“明确冲突”。
+   - 如果某属性没有提及，或无法验证，优先判为“部分相关”。
+   - 如果某属性与查询要求明确相反，则判为“不相关”；除非在购物语境下它仍明显属于可接受替代品。
-查询: {query}
+查询：{query}
-产品:
+商品：
 {lines}
 ## 输出格式
-严格输出 {n} 行，每行包含以下之一：
-Exact
-Partial
-Irrelevant
+严格输出 {n} 行，每行只能是以下三者之一：
+完全相关
+部分相关
+不相关
-这些行必须按顺序对应上面的产品。
-不要输出任何其他信息。
+输出行必须与上方商品顺序一一对应。
+不要输出任何其他内容。
 """
@@ -1,235 +0,0 @@
-#!/usr/bin/env python3
-"""
-Run search quality evaluation against real tenant indexes and emit JSON/Markdown reports.
-
-Usage:
-  source activate.sh
-  python scripts/eval_search_quality.py
-"""
-
-from __future__ import annotations
-
-import json
-import sys
-from dataclasses import asdict, dataclass
-from datetime import datetime, timezone
-from pathlib import Path
-from typing import Any, Dict, List
-
-PROJECT_ROOT = Path(__file__).resolve().parents[1]
-if str(PROJECT_ROOT) not in sys.path:
-    sys.path.insert(0, str(PROJECT_ROOT))
-
-from api.app import get_searcher, init_service
-from context import create_request_context
-
-
-DEFAULT_QUERIES_BY_TENANT: Dict[str, List[str]] = {
-    "0": [
-        "连衣裙",
-        "dress",
-        "dress 连衣裙",
-        "maxi dress 长裙",
-        "波西米亚连衣裙",
-        "T恤",
-        "graphic tee 图案T恤",
-        "shirt",
-        "礼服衬衫",
-        "hoodie 卫衣",
-        "连帽卫衣",
-        "sweatshirt",
-        "牛仔裤",
-        "jeans",
-        "阔腿牛仔裤",
-        "毛衣 sweater",
-        "cardigan 开衫",
-        "jacket 外套",
-        "puffer jacket 羽绒服",
-        "飞行员夹克",
-    ],
-    "162": [
-        "连衣裙",
-        "dress",
-        "dress 连衣裙",
-        "T恤",
-        "shirt",
-        "hoodie 卫衣",
-        "牛仔裤",
-        "jeans",
-        "毛衣 sweater",
-        "jacket 外套",
-        "娃娃衣服",
-        "芭比裙子",
-        "连衣短裙芭比",
-        "公主大裙",
-        "晚礼服芭比",
-        "毛衣熊",
-        "服饰饰品",
-        "鞋子",
-        "军人套",
-        "陆军套",
-    ],
-}
-
-
-@dataclass
-class RankedItem:
-    rank: int
-    spu_id: str
-    title: str
-    vendor: str
-    es_score: float | None
-    rerank_score: float | None
-    text_score: float | None
-    text_source_score: float | None
-    text_translation_score: float | None
-    text_primary_score: float | None
-    text_support_score: float | None
-    knn_score: float | None
-    fused_score: float | None
-    matched_queries: Any
-
-
-def _pick_text(value: Any, language: str = "zh") -> str:
-    if value is None:
-        return ""
-    if isinstance(value, dict):
-        return str(value.get(language) or value.get("zh") or value.get("en") or "").strip()
-    return str(value).strip()
-
-
-def _to_float(value: Any) -> float | None:
-    try:
-        if value is None:
-            return None
-        return float(value)
-    except (TypeError, ValueError):
-        return None
-
-
-def _evaluate_query(searcher, tenant_id: str, query: str) -> Dict[str, Any]:
-    context = create_request_context(
-        reqid=f"eval-{tenant_id}-{abs(hash(query)) % 1000000}",
-        uid="codex",
-    )
-    result = searcher.search(
-        query=query,
-        tenant_id=tenant_id,
-        size=20,
-        from_=0,
-        context=context,
-        debug=True,
-        language="zh",
-        enable_rerank=True,
-    )
-
-    per_result_debug = ((result.debug_info or {}).get("per_result") or [])
-    debug_by_spu_id = {
-        str(item.get("spu_id")): item
-        for item in per_result_debug
-        if isinstance(item, dict) and item.get("spu_id") is not None
-    }
-
-    ranked_items: List[RankedItem] = []
-    for rank, spu in enumerate(result.results[:20], 1):
-        spu_id = str(getattr(spu, "spu_id", ""))
-        debug_item = debug_by_spu_id.get(spu_id, {})
-        ranked_items.append(
-            RankedItem(
-                rank=rank,
-                spu_id=spu_id,
-                title=_pick_text(getattr(spu, "title", None), language="zh"),
-                vendor=_pick_text(getattr(spu, "vendor", None), language="zh"),
-                es_score=_to_float(debug_item.get("es_score")),
-                rerank_score=_to_float(debug_item.get("rerank_score")),
-                text_score=_to_float(debug_item.get("text_score")),
-                text_source_score=_to_float(debug_item.get("text_source_score")),
-                text_translation_score=_to_float(debug_item.get("text_translation_score")),
-                text_primary_score=_to_float(debug_item.get("text_primary_score")),
-                text_support_score=_to_float(debug_item.get("text_support_score")),
-                knn_score=_to_float(debug_item.get("knn_score")),
-                fused_score=_to_float(debug_item.get("fused_score")),
-                matched_queries=debug_item.get("matched_queries"),
-            )
-        )
-
-    return {
-        "query": query,
-        "tenant_id": tenant_id,
-        "total": result.total,
-        "max_score": result.max_score,
-        "took_ms": result.took_ms,
-        "query_analysis": ((result.debug_info or {}).get("query_analysis") or {}),
-        "stage_timings": ((result.debug_info or {}).get("stage_timings") or {}),
-        "top20": [asdict(item) for item in ranked_items],
-    }
-
-
-def _render_markdown(report: Dict[str, Any]) -> str:
-    lines: List[str] = []
-    lines.append(f"# Search Quality Evaluation")
-    lines.append("")
-    lines.append(f"- Generated at: {report['generated_at']}")
-    lines.append(f"- Queries per tenant: {report['queries_per_tenant']}")
-    lines.append("")
-    for tenant_id, entries in report["tenants"].items():
-        lines.append(f"## Tenant {tenant_id}")
-        lines.append("")
-        for entry in entries:
-            qa = entry.get("query_analysis") or {}
-            lines.append(f"### Query: {entry['query']}")
-            lines.append("")
-            lines.append(
-                f"- total={entry['total']} max_score={entry['max_score']:.6f} took_ms={entry['took_ms']}"
-            )
-            lines.append(
-                f"- detected_language={qa.get('detected_language')} translations={qa.get('translations')}"
-            )
-            lines.append("")
-            lines.append("| rank | spu_id | title | fused | rerank | text | text_src | text_trans | knn | es | matched_queries |")
-            lines.append("| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |")
-            for item in entry.get("top20", []):
-                title = str(item.get("title", "")).replace("|", "/")
-                matched = json.dumps(item.get("matched_queries"), ensure_ascii=False)
-                matched = matched.replace("|", "/")
-                lines.append(
-                    f"| {item.get('rank')} | {item.get('spu_id')} | {title} | "
-                    f"{item.get('fused_score')} | {item.get('rerank_score')} | {item.get('text_score')} | "
-                    f"{item.get('text_source_score')} | {item.get('text_translation_score')} | "
-                    f"{item.get('knn_score')} | {item.get('es_score')} | {matched} |"
-                )
-            lines.append("")
-    return "\n".join(lines)
-
-
-def main() -> None:
-    init_service("http://localhost:9200")
-    searcher = get_searcher()
-
-    tenants_report: Dict[str, List[Dict[str, Any]]] = {}
-    for tenant_id, queries in DEFAULT_QUERIES_BY_TENANT.items():
-        tenant_entries: List[Dict[str, Any]] = []
-        for query in queries:
-            print(f"[eval] tenant={tenant_id} query={query}")
-            tenant_entries.append(_evaluate_query(searcher, tenant_id, query))
-        tenants_report[tenant_id] = tenant_entries
-
-    report = {
-        "generated_at": datetime.now(timezone.utc).isoformat(),
-        "queries_per_tenant": {tenant: len(queries) for tenant, queries in DEFAULT_QUERIES_BY_TENANT.items()},
-        "tenants": tenants_report,
-    }
-
-    out_dir = Path("artifacts/search_eval")
-    out_dir.mkdir(parents=True, exist_ok=True)
-    timestamp = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ")
-    json_path = out_dir / f"search_eval_{timestamp}.json"
-    md_path = out_dir / f"search_eval_{timestamp}.md"
-    json_path.write_text(json.dumps(report, ensure_ascii=False, indent=2), encoding="utf-8")
-    md_path.write_text(_render_markdown(report), encoding="utf-8")
-    print(f"[done] json={json_path}")
-    print(f"[done] md={md_path}")
-
-
-if __name__ == "__main__":
-    main()